linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
@ 2022-08-29  5:54 Dietmar Eggemann
  2022-08-29  5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann
  2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu
  0 siblings, 2 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2022-08-29  5:54 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort
  Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

Many of the Android devices still prefer to run PELT with a shorter
halflife than the hardcoded value of 32ms in mainline.

The Android folks claim better response time of display pipeline tasks
(higher min and avg fps for 60, 90 or 120Hz refresh rate). Some of the
benchmarks like PCmark web-browsing show higher scores when running
with 16ms or 8ms PELT halflife. The gain in response time and
performance is considered to outweigh the increase of energy
consumption in these cases.

The original idea of introducing a PELT halflife compile time option
for 32, 16, 8ms from Patrick Bellasi in 2018
https://lkml.kernel.org/r/20180409165134.707-1-patrick.bellasi@arm.com
wasn't integrated into mainline mainly because of breaking the PELT
stability requirement (see (1) below).

We have been experimenting with a new idea from Morten Rasmussen to
instead introduce an additional clock between task and pelt clock. This
way the effect of a shorter PELT halflife of 8ms or 16ms can be
achieved by left-shifting the elapsed time. This is similar to the use
of time shifting of the pelt clock to achieve scale invariance in PELT.
The implementation is from Vincent Donnefort with some minor
modifications to align with current tip sched/core.

---

Known potential issues:

(1) PELT stability requirement

PELT halflife has to be larger than or equal to the scheduling period.

The sched_period (sysctl_sched_latency) of a typical mobile device with
8 CPUs with the default logarithmical tuning is 24ms so only the 32 ms
PELT halflife met this. Shorter halflife of 16ms or even 8ms would break
this.

It looks like that this problem might not exist anymore because of the
PELT rewrite in 2015, i.e. with commit 9d89c257dfb9
("sched/fair: Rewrite runnable load and utilization average tracking").
Since then sched entities (task & task groups) and cfs_rq's are
independently maintained rather than each entity update maintains the
cfs_rq at the same time.

This seems to mitigate the issue that the cfs_rq signal is not correct
when there are not all runnable entities able ot do a self update during
a PELT halflife window.

That said, I'm not entirely sure whether the entity-cfs_rq
synchronization is the only issue behind this PELT stability requirement.


(2) PELT utilization versus util_est (estimated utilization)

The PELT signal of a periodic task oscillates with higher peak amplitude
when using smaller halflife. For a typical periodic task of the display
pipeline with a runtime/period of 8ms/16ms the peak amplitude is at ~40
for 32ms, at ~80 for 16ms and at ~160 for 8ms. Util_est stores the
util_avg peak as util_est.enqueued per task.

With an additional exponential weighted moving average (ewma) to smooth
task utilization decreases, util_est values of the runnable tasks are
aggregated on the root cfs_rq.
CPU and task utilization for CPU frequency selection and task placement
is the max value out of util_est and util_avg. 
I.e. because of how util_est is implemented higher CPU Operating
Performance Points and more capable CPUs are already chosen when using
smaller PELT halflife.


(3) Wrong PELT history when switching PELT multiplier

The PELT history becomes stale the moment the PELT multiplier is changed
during runtime. So all decisions based on PELT are skewed for the time
interval to produce LOAD_MAX_AVG (the sum of the infinite geometric
series) which value is ~345ms for halflife=32ms (smaller for 8ms or
16ms).

Rate limiting the PELT multiplier change to this value is not solving
the issue here. So the user would have to live with possible incorrect
discussions during these PELT multiplier transition times.

---

It looks like that individual task boosting e.g. via uclamp_min,
possibly abstracted by middleware frameworks like Android Dynamic
Performance Framework (ADPF) would be the way to go here but until this
is fully available and adopted some Android folks will still prefer the
overall system boosting they achieve by running with a shorter PELT
halflife.

Vincent Donnefort (1):
  sched/pelt: Introduce PELT multiplier

 kernel/sched/core.c  |  2 +-
 kernel/sched/pelt.c  | 60 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h  | 42 ++++++++++++++++++++++++++++---
 kernel/sched/sched.h |  1 +
 4 files changed, 100 insertions(+), 5 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29  5:54 [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Dietmar Eggemann
@ 2022-08-29  5:54 ` Dietmar Eggemann
  2022-08-29  8:08   ` Peter Zijlstra
  2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu
  1 sibling, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2022-08-29  5:54 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort
  Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

From: Vincent Donnefort <vincent.donnefort@arm.com>

The new sysctl sched_pelt_multiplier allows a user to set a clock
multiplier to x2 or x4 (x1 being the default). This clock multiplier
artificially speeds up PELT ramp up/down similarly to use a faster
half-life than the default 32ms.

  - x1: 32ms half-life
  - x2: 16ms half-life
  - x4: 8ms  half-life

Internally, a new clock is created: rq->clock_task_mult. It sits in the
clock hierarchy between rq->clock_task and rq->clock_pelt.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c  |  2 +-
 kernel/sched/pelt.c  | 60 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h  | 42 ++++++++++++++++++++++++++++---
 kernel/sched/sched.h |  1 +
 4 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 603a80ec9b0e..6203cead4ad3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -727,7 +727,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq, irq_delta + steal);
 #endif
-	update_rq_clock_pelt(rq, delta);
+	update_rq_clock_task_mult(rq, delta);
 }
 
 void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 9adfc4744544..9fa853a64269 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -472,3 +472,63 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	return ret;
 }
 #endif
+
+__read_mostly unsigned int sched_pelt_lshift;
+
+#ifdef CONFIG_SYSCTL
+static unsigned int sysctl_sched_pelt_multiplier = 1;
+
+int sched_pelt_multiplier(struct ctl_table *table, int write, void *buffer,
+			  size_t *lenp, loff_t *ppos)
+{
+	static DEFINE_MUTEX(mutex);
+	unsigned int old;
+	int ret;
+
+	mutex_lock(&mutex);
+	old = sysctl_sched_pelt_multiplier;
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (ret)
+		goto undo;
+	if (!write)
+		goto done;
+
+	switch (sysctl_sched_pelt_multiplier)  {
+	case 1:
+		fallthrough;
+	case 2:
+		fallthrough;
+	case 4:
+		WRITE_ONCE(sched_pelt_lshift,
+			   sysctl_sched_pelt_multiplier >> 1);
+		goto done;
+	default:
+		ret = -EINVAL;
+	}
+
+undo:
+	sysctl_sched_pelt_multiplier = old;
+done:
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
+static struct ctl_table sched_pelt_sysctls[] = {
+	{
+		.procname       = "sched_pelt_multiplier",
+		.data           = &sysctl_sched_pelt_multiplier,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = sched_pelt_multiplier,
+	},
+	{}
+};
+
+static int __init sched_pelt_sysctl_init(void)
+{
+	register_sysctl_init("kernel", sched_pelt_sysctls);
+	return 0;
+}
+late_initcall(sched_pelt_sysctl_init);
+#endif
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc28721..9b35b5072bae 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
 	WRITE_ONCE(avg->util_est.enqueued, enqueued);
 }
 
+static inline u64 rq_clock_task_mult(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+	assert_clock_updated(rq);
+
+	return rq->clock_task_mult;
+}
+
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
 	lockdep_assert_rq_held(rq);
@@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct rq *rq)
 /* The rq is idle, we can sync to clock_task */
 static inline void _update_idle_rq_clock_pelt(struct rq *rq)
 {
-	rq->clock_pelt  = rq_clock_task(rq);
+	rq->clock_pelt = rq_clock_task_mult(rq);
 
 	u64_u32_store(rq->clock_idle, rq_clock(rq));
 	/* Paired with smp_rmb in migrate_se_pelt_lag() */
@@ -121,6 +129,27 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
 	rq->clock_pelt += delta;
 }
 
+extern unsigned int sched_pelt_lshift;
+
+/*
+ * absolute time   |1      |2      |3      |4      |5      |6      |
+ * @ mult = 1      --------****************--------****************-
+ * @ mult = 2      --------********----------------********---------
+ * @ mult = 4      --------****--------------------****-------------
+ * clock task mult
+ * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
+ * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
+ *
+ */
+static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
+{
+	delta <<= READ_ONCE(sched_pelt_lshift);
+
+	rq->clock_task_mult += delta;
+
+	update_rq_clock_pelt(rq, delta);
+}
+
 /*
  * When rq becomes idle, we have to check if it has lost idle time
  * because it was fully busy. A rq is fully used when the /Sum util_sum
@@ -147,7 +176,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 	 * rq's clock_task.
 	 */
 	if (util_sum >= divider)
-		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+		rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;
 
 	_update_idle_rq_clock_pelt(rq);
 }
@@ -218,13 +247,18 @@ update_irq_load_avg(struct rq *rq, u64 running)
 	return 0;
 }
 
-static inline u64 rq_clock_pelt(struct rq *rq)
+static inline u64 rq_clock_task_mult(struct rq *rq)
 {
 	return rq_clock_task(rq);
 }
 
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+	return rq_clock_task_mult(rq);
+}
+
 static inline void
-update_rq_clock_pelt(struct rq *rq, s64 delta) { }
+update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
 
 static inline void
 update_idle_rq_clock_pelt(struct rq *rq) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index da17be6f27fd..62fc3cf10ea7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1007,6 +1007,7 @@ struct rq {
 	u64			clock;
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
+	u64			clock_task_mult;
 	u64			clock_pelt;
 	unsigned long		lost_idle_time;
 	u64			clock_pelt_idle;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29  5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann
@ 2022-08-29  8:08   ` Peter Zijlstra
  2022-08-29 10:02     ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2022-08-29  8:08 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel

On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> The new sysctl sched_pelt_multiplier allows a user to set a clock
> multiplier to x2 or x4 (x1 being the default). This clock multiplier
> artificially speeds up PELT ramp up/down similarly to use a faster
> half-life than the default 32ms.
> 
>   - x1: 32ms half-life
>   - x2: 16ms half-life
>   - x4: 8ms  half-life
> 
> Internally, a new clock is created: rq->clock_task_mult. It sits in the
> clock hierarchy between rq->clock_task and rq->clock_pelt.
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> +extern unsigned int sched_pelt_lshift;
> +
> +/*
> + * absolute time   |1      |2      |3      |4      |5      |6      |
> + * @ mult = 1      --------****************--------****************-
> + * @ mult = 2      --------********----------------********---------
> + * @ mult = 4      --------****--------------------****-------------
> + * clock task mult
> + * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
> + * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
> + *
> + */
> +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
> +{
> +	delta <<= READ_ONCE(sched_pelt_lshift);
> +
> +	rq->clock_task_mult += delta;
> +
> +	update_rq_clock_pelt(rq, delta);
> +}

Hurmph... I'd almost go write you something like
static_call()/static_branch() but for immediates.

That said; given there's only like 3 options, perhaps a few
static_branch() instances work just fine ?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29  8:08   ` Peter Zijlstra
@ 2022-08-29 10:02     ` Peter Zijlstra
  2022-08-29 10:13       ` Vincent Guittot
  2022-09-02  7:53       ` Dietmar Eggemann
  0 siblings, 2 replies; 61+ messages in thread
From: Peter Zijlstra @ 2022-08-29 10:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel

On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote:
> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
> > From: Vincent Donnefort <vincent.donnefort@arm.com>
> > 
> > The new sysctl sched_pelt_multiplier allows a user to set a clock
> > multiplier to x2 or x4 (x1 being the default). This clock multiplier
> > artificially speeds up PELT ramp up/down similarly to use a faster
> > half-life than the default 32ms.
> > 
> >   - x1: 32ms half-life
> >   - x2: 16ms half-life
> >   - x4: 8ms  half-life
> > 
> > Internally, a new clock is created: rq->clock_task_mult. It sits in the
> > clock hierarchy between rq->clock_task and rq->clock_pelt.
> > 
> > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> > +extern unsigned int sched_pelt_lshift;
> > +
> > +/*
> > + * absolute time   |1      |2      |3      |4      |5      |6      |
> > + * @ mult = 1      --------****************--------****************-
> > + * @ mult = 2      --------********----------------********---------
> > + * @ mult = 4      --------****--------------------****-------------
> > + * clock task mult
> > + * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
> > + * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
> > + *
> > + */
> > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
> > +{
> > +	delta <<= READ_ONCE(sched_pelt_lshift);
> > +
> > +	rq->clock_task_mult += delta;
> > +
> > +	update_rq_clock_pelt(rq, delta);
> > +}
> 
> Hurmph... I'd almost go write you something like
> static_call()/static_branch() but for immediates.
> 
> That said; given there's only like 3 options, perhaps a few
> static_branch() instances work just fine ?

Also, I'm not at all sure about exposing that as an official sysctl.

How about something like so?

---
Subject: sched/pelt: Introduce PELT multiplier
From: Vincent Donnefort <vincent.donnefort@arm.com>
Date: Mon, 29 Aug 2022 07:54:50 +0200

From: Vincent Donnefort <vincent.donnefort@arm.com>

The new sysctl sched_pelt_multiplier allows a user to set a clock
multiplier to x2 or x4 (x1 being the default). This clock multiplier
artificially speeds up PELT ramp up/down similarly to use a faster
half-life than the default 32ms.

  - x1: 32ms half-life
  - x2: 16ms half-life
  - x4: 8ms  half-life

Internally, a new clock is created: rq->clock_task_mult. It sits in the
clock hierarchy between rq->clock_task and rq->clock_pelt.

[peterz: Use sched_feat()]
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |    2 +-
 kernel/sched/features.h |    3 +++
 kernel/sched/pelt.h     |   45 +++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h    |    1 +
 4 files changed, 46 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -727,7 +727,7 @@ static void update_rq_clock_task(struct
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq, irq_delta + steal);
 #endif
-	update_rq_clock_pelt(rq, delta);
+	update_rq_clock_task_mult(rq, delta);
 }
 
 void update_rq_clock(struct rq *rq)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(PELT_M2, false)
+SCHED_FEAT(PELT_M4, false)
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,14 @@ static inline void cfs_se_util_change(st
 	WRITE_ONCE(avg->util_est.enqueued, enqueued);
 }
 
+static inline u64 rq_clock_task_mult(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+	assert_clock_updated(rq);
+
+	return rq->clock_task_mult;
+}
+
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
 	lockdep_assert_rq_held(rq);
@@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct r
 /* The rq is idle, we can sync to clock_task */
 static inline void _update_idle_rq_clock_pelt(struct rq *rq)
 {
-	rq->clock_pelt  = rq_clock_task(rq);
+	rq->clock_pelt = rq_clock_task_mult(rq);
 
 	u64_u32_store(rq->clock_idle, rq_clock(rq));
 	/* Paired with smp_rmb in migrate_se_pelt_lag() */
@@ -121,6 +129,30 @@ static inline void update_rq_clock_pelt(
 	rq->clock_pelt += delta;
 }
 
+extern unsigned int sched_pelt_lshift;
+
+/*
+ * absolute time   |1      |2      |3      |4      |5      |6      |
+ * @ mult = 1      --------****************--------****************-
+ * @ mult = 2      --------********----------------********---------
+ * @ mult = 4      --------****--------------------****-------------
+ * clock task mult
+ * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
+ * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
+ *
+ */
+static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
+{
+	if (sched_feat(PELT_M2))
+		delta *= 2;
+	if (sched_feat(PELT_M4))
+		delta *= 4;
+
+	rq->clock_task_mult += delta;
+
+	update_rq_clock_pelt(rq, delta);
+}
+
 /*
  * When rq becomes idle, we have to check if it has lost idle time
  * because it was fully busy. A rq is fully used when the /Sum util_sum
@@ -147,7 +179,7 @@ static inline void update_idle_rq_clock_
 	 * rq's clock_task.
 	 */
 	if (util_sum >= divider)
-		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+		rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;
 
 	_update_idle_rq_clock_pelt(rq);
 }
@@ -218,13 +250,18 @@ update_irq_load_avg(struct rq *rq, u64 r
 	return 0;
 }
 
-static inline u64 rq_clock_pelt(struct rq *rq)
+static inline u64 rq_clock_task_mult(struct rq *rq)
 {
 	return rq_clock_task(rq);
 }
 
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+	return rq_clock_task_mult(rq);
+}
+
 static inline void
-update_rq_clock_pelt(struct rq *rq, s64 delta) { }
+update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
 
 static inline void
 update_idle_rq_clock_pelt(struct rq *rq) { }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1009,6 +1009,7 @@ struct rq {
 	u64			clock;
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
+	u64			clock_task_mult;
 	u64			clock_pelt;
 	unsigned long		lost_idle_time;
 	u64			clock_pelt_idle;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29 10:02     ` Peter Zijlstra
@ 2022-08-29 10:13       ` Vincent Guittot
  2022-08-29 14:23         ` Quentin Perret
  2022-09-02  7:53         ` Dietmar Eggemann
  2022-09-02  7:53       ` Dietmar Eggemann
  1 sibling, 2 replies; 61+ messages in thread
From: Vincent Guittot @ 2022-08-29 10:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Ingo Molnar, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel

On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
> > > From: Vincent Donnefort <vincent.donnefort@arm.com>
> > >
> > > The new sysctl sched_pelt_multiplier allows a user to set a clock
> > > multiplier to x2 or x4 (x1 being the default). This clock multiplier
> > > artificially speeds up PELT ramp up/down similarly to use a faster
> > > half-life than the default 32ms.
> > >
> > >   - x1: 32ms half-life
> > >   - x2: 16ms half-life
> > >   - x4: 8ms  half-life
> > >
> > > Internally, a new clock is created: rq->clock_task_mult. It sits in the
> > > clock hierarchy between rq->clock_task and rq->clock_pelt.
> > >
> > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >
> > > +extern unsigned int sched_pelt_lshift;
> > > +
> > > +/*
> > > + * absolute time   |1      |2      |3      |4      |5      |6      |
> > > + * @ mult = 1      --------****************--------****************-
> > > + * @ mult = 2      --------********----------------********---------
> > > + * @ mult = 4      --------****--------------------****-------------
> > > + * clock task mult
> > > + * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
> > > + * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
> > > + *
> > > + */
> > > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
> > > +{
> > > +   delta <<= READ_ONCE(sched_pelt_lshift);
> > > +
> > > +   rq->clock_task_mult += delta;
> > > +
> > > +   update_rq_clock_pelt(rq, delta);
> > > +}
> >
> > Hurmph... I'd almost go write you something like
> > static_call()/static_branch() but for immediates.
> >
> > That said; given there's only like 3 options, perhaps a few
> > static_branch() instances work just fine ?
>
> Also, I'm not at all sure about exposing that as an official sysctl.

Me too, I would even make it a boot time parameter so we can remove
the new clock_task_mult clock and left shift clock_taslk or the delta
before passing it to clock_pelt

>
> How about something like so?
>
> ---
> Subject: sched/pelt: Introduce PELT multiplier
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> Date: Mon, 29 Aug 2022 07:54:50 +0200
>
> From: Vincent Donnefort <vincent.donnefort@arm.com>
>
> The new sysctl sched_pelt_multiplier allows a user to set a clock
> multiplier to x2 or x4 (x1 being the default). This clock multiplier
> artificially speeds up PELT ramp up/down similarly to use a faster
> half-life than the default 32ms.
>
>   - x1: 32ms half-life
>   - x2: 16ms half-life
>   - x4: 8ms  half-life
>
> Internally, a new clock is created: rq->clock_task_mult. It sits in the
> clock hierarchy between rq->clock_task and rq->clock_pelt.
>
> [peterz: Use sched_feat()]
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c     |    2 +-
>  kernel/sched/features.h |    3 +++
>  kernel/sched/pelt.h     |   45 +++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h    |    1 +
>  4 files changed, 46 insertions(+), 5 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -727,7 +727,7 @@ static void update_rq_clock_task(struct
>         if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
>                 update_irq_load_avg(rq, irq_delta + steal);
>  #endif
> -       update_rq_clock_pelt(rq, delta);
> +       update_rq_clock_task_mult(rq, delta);
>  }
>
>  void update_rq_clock(struct rq *rq)
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false)
>
>  SCHED_FEAT(ALT_PERIOD, true)
>  SCHED_FEAT(BASE_SLICE, true)
> +
> +SCHED_FEAT(PELT_M2, false)
> +SCHED_FEAT(PELT_M4, false)
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(st
>         WRITE_ONCE(avg->util_est.enqueued, enqueued);
>  }
>
> +static inline u64 rq_clock_task_mult(struct rq *rq)
> +{
> +       lockdep_assert_rq_held(rq);
> +       assert_clock_updated(rq);
> +
> +       return rq->clock_task_mult;
> +}
> +
>  static inline u64 rq_clock_pelt(struct rq *rq)
>  {
>         lockdep_assert_rq_held(rq);
> @@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct r
>  /* The rq is idle, we can sync to clock_task */
>  static inline void _update_idle_rq_clock_pelt(struct rq *rq)
>  {
> -       rq->clock_pelt  = rq_clock_task(rq);
> +       rq->clock_pelt = rq_clock_task_mult(rq);
>
>         u64_u32_store(rq->clock_idle, rq_clock(rq));
>         /* Paired with smp_rmb in migrate_se_pelt_lag() */
> @@ -121,6 +129,30 @@ static inline void update_rq_clock_pelt(
>         rq->clock_pelt += delta;
>  }
>
> +extern unsigned int sched_pelt_lshift;
> +
> +/*
> + * absolute time   |1      |2      |3      |4      |5      |6      |
> + * @ mult = 1      --------****************--------****************-
> + * @ mult = 2      --------********----------------********---------
> + * @ mult = 4      --------****--------------------****-------------
> + * clock task mult
> + * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
> + * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
> + *
> + */
> +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
> +{
> +       if (sched_feat(PELT_M2))
> +               delta *= 2;
> +       if (sched_feat(PELT_M4))
> +               delta *= 4;
> +
> +       rq->clock_task_mult += delta;
> +
> +       update_rq_clock_pelt(rq, delta);
> +}
> +
>  /*
>   * When rq becomes idle, we have to check if it has lost idle time
>   * because it was fully busy. A rq is fully used when the /Sum util_sum
> @@ -147,7 +179,7 @@ static inline void update_idle_rq_clock_
>          * rq's clock_task.
>          */
>         if (util_sum >= divider)
> -               rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> +               rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;
>
>         _update_idle_rq_clock_pelt(rq);
>  }
> @@ -218,13 +250,18 @@ update_irq_load_avg(struct rq *rq, u64 r
>         return 0;
>  }
>
> -static inline u64 rq_clock_pelt(struct rq *rq)
> +static inline u64 rq_clock_task_mult(struct rq *rq)
>  {
>         return rq_clock_task(rq);
>  }
>
> +static inline u64 rq_clock_pelt(struct rq *rq)
> +{
> +       return rq_clock_task_mult(rq);
> +}
> +
>  static inline void
> -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
>
>  static inline void
>  update_idle_rq_clock_pelt(struct rq *rq) { }
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1009,6 +1009,7 @@ struct rq {
>         u64                     clock;
>         /* Ensure that all clocks are in the same cache line */
>         u64                     clock_task ____cacheline_aligned;
> +       u64                     clock_task_mult;
>         u64                     clock_pelt;
>         unsigned long           lost_idle_time;
>         u64                     clock_pelt_idle;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29 10:13       ` Vincent Guittot
@ 2022-08-29 14:23         ` Quentin Perret
  2022-08-29 14:34           ` Peter Zijlstra
  2022-09-02  7:53         ` Dietmar Eggemann
  1 sibling, 1 reply; 61+ messages in thread
From: Quentin Perret @ 2022-08-29 14:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen,
	Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

On Monday 29 Aug 2022 at 12:13:26 (+0200), Vincent Guittot wrote:
> On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote:
> > > On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
> > > > From: Vincent Donnefort <vincent.donnefort@arm.com>
> > > >
> > > > The new sysctl sched_pelt_multiplier allows a user to set a clock
> > > > multiplier to x2 or x4 (x1 being the default). This clock multiplier
> > > > artificially speeds up PELT ramp up/down similarly to use a faster
> > > > half-life than the default 32ms.
> > > >
> > > >   - x1: 32ms half-life
> > > >   - x2: 16ms half-life
> > > >   - x4: 8ms  half-life
> > > >
> > > > Internally, a new clock is created: rq->clock_task_mult. It sits in the
> > > > clock hierarchy between rq->clock_task and rq->clock_pelt.
> > > >
> > > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > >
> > > > +extern unsigned int sched_pelt_lshift;
> > > > +
> > > > +/*
> > > > + * absolute time   |1      |2      |3      |4      |5      |6      |
> > > > + * @ mult = 1      --------****************--------****************-
> > > > + * @ mult = 2      --------********----------------********---------
> > > > + * @ mult = 4      --------****--------------------****-------------
> > > > + * clock task mult
> > > > + * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
> > > > + * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
> > > > + *
> > > > + */
> > > > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
> > > > +{
> > > > +   delta <<= READ_ONCE(sched_pelt_lshift);
> > > > +
> > > > +   rq->clock_task_mult += delta;
> > > > +
> > > > +   update_rq_clock_pelt(rq, delta);
> > > > +}
> > >
> > > Hurmph... I'd almost go write you something like
> > > static_call()/static_branch() but for immediates.
> > >
> > > That said; given there's only like 3 options, perhaps a few
> > > static_branch() instances work just fine ?
> >
> > Also, I'm not at all sure about exposing that as an official sysctl.
> 
> Me too, I would even make it a boot time parameter so we can remove
> the new clock_task_mult clock and left shift clock_taslk or the delta
> before passing it to clock_pelt

I'll let folks in CC comment about their use-case in more details, but
there's definitely been an interest in tuning this thing at run-time
FWIW. Typically a larger half-life will be fine with predictable
workloads with little inputs from users (e.g. fullscreen video playback)
while a lower one can be preferred in highly interactive cases (games,
...). The transient state is fun to reason about, but it really
shouldn't be too common AFAIK.

With that said I'd quite like to see numbers to back that claim.
Measuring power while running a video (for instance) with various HL
values should help. And similarly it shouldn't be too hard to get
performance numbers.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29 14:23         ` Quentin Perret
@ 2022-08-29 14:34           ` Peter Zijlstra
  2022-08-29 15:31             ` Quentin Perret
  2022-08-29 15:48             ` Quentin Perret
  0 siblings, 2 replies; 61+ messages in thread
From: Peter Zijlstra @ 2022-08-29 14:34 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen,
	Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

On Mon, Aug 29, 2022 at 02:23:17PM +0000, Quentin Perret wrote:

> I'll let folks in CC comment about their use-case in more details, but
> there's definitely been an interest in tuning this thing at run-time

An interest and it making sense are two very distinct things that bear
no relation to one another in any way.

> FWIW. Typically a larger half-life will be fine with predictable
> workloads with little inputs from users (e.g. fullscreen video playback)
> while a lower one can be preferred in highly interactive cases (games,

As per always; consider the combined workload.

> ...). The transient state is fun to reason about, but it really
> shouldn't be too common AFAIK.

Once you give away control there is no taking it back, and userspace
*will* do stupid things and expect unicorns.

> With that said I'd quite like to see numbers to back that claim.
> Measuring power while running a video (for instance) with various HL
> values should help. And similarly it shouldn't be too hard to get
> performance numbers.

I'm thinking this all needs far more than mere numbers to justify.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29 14:34           ` Peter Zijlstra
@ 2022-08-29 15:31             ` Quentin Perret
  2022-08-29 15:48             ` Quentin Perret
  1 sibling, 0 replies; 61+ messages in thread
From: Quentin Perret @ 2022-08-29 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen,
	Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

On Monday 29 Aug 2022 at 16:34:39 (+0200), Peter Zijlstra wrote:
> On Mon, Aug 29, 2022 at 02:23:17PM +0000, Quentin Perret wrote:
> > FWIW. Typically a larger half-life will be fine with predictable
> > workloads with little inputs from users (e.g. fullscreen video playback)
> > while a lower one can be preferred in highly interactive cases (games,
> 
> As per always; consider the combined workload.

That's the problem of whoever is going to decide which HL should be
used, which would all fall in userspace policy land. And yes that
userspace agent would have to know quite a bit about what is going on in
the system to make 'sane' decisions, but that is the case in Android.

The fact that choosing the right HL for a combined workload is hard
(that userspace agent might want to just default to a safe value?)
doesn't mean there is no value in other cases though...

I mean, if we agree that tuning the PELT HL makes sense at all (even as
a cmdling arg, or sched_feat or whatever), there is only one small step
to say the right PELT HL should indeed depend on the workload you're
currently running (especially with _vastly_ different usage patterns
seen in mobile).

> > ...). The transient state is fun to reason about, but it really
> > shouldn't be too common AFAIK.
> 
> Once you give away control there is no taking it back, and userspace
> *will* do stupid things and expect unicorns.

Sure. If you do stupid, at some point you get(/deserve) what you asked
for... We could perhaps rate-limit changes or something if we want to
make that clear?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29 14:34           ` Peter Zijlstra
  2022-08-29 15:31             ` Quentin Perret
@ 2022-08-29 15:48             ` Quentin Perret
  1 sibling, 0 replies; 61+ messages in thread
From: Quentin Perret @ 2022-08-29 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen,
	Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

On Monday 29 Aug 2022 at 16:34:39 (+0200), Peter Zijlstra wrote:
> On Mon, Aug 29, 2022 at 02:23:17PM +0000, Quentin Perret wrote:
> 
> > I'll let folks in CC comment about their use-case in more details, but
> > there's definitely been an interest in tuning this thing at run-time
> 
> An interest and it making sense are two very distinct things that bear
> no relation to one another in any way.

And just to clarify something here as my first message was misleading,
the run-time tuning of PELT HLs has been used in some Android products
for a little while already, so it's more than just a mere 'interest'.

That doesn't close the debate on whether this makes any sense, but I
figured this should be mentioned :-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29 10:02     ` Peter Zijlstra
  2022-08-29 10:13       ` Vincent Guittot
@ 2022-09-02  7:53       ` Dietmar Eggemann
  2022-09-02  8:45         ` Peter Zijlstra
  1 sibling, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2022-09-02  7:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel

On 29/08/2022 12:02, Peter Zijlstra wrote:
> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote:
>> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
>>> From: Vincent Donnefort <vincent.donnefort@arm.com>

[...]

>>> +extern unsigned int sched_pelt_lshift;
>>> +
>>> +/*
>>> + * absolute time   |1      |2      |3      |4      |5      |6      |
>>> + * @ mult = 1      --------****************--------****************-
>>> + * @ mult = 2      --------********----------------********---------
>>> + * @ mult = 4      --------****--------------------****-------------
>>> + * clock task mult
>>> + * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
>>> + * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
>>> + *
>>> + */
>>> +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
>>> +{
>>> +	delta <<= READ_ONCE(sched_pelt_lshift);
>>> +
>>> +	rq->clock_task_mult += delta;
>>> +
>>> +	update_rq_clock_pelt(rq, delta);
>>> +}
>>
>> Hurmph... I'd almost go write you something like
>> static_call()/static_branch() but for immediates.
>>
>> That said; given there's only like 3 options, perhaps a few
>> static_branch() instances work just fine ?
> 
> Also, I'm not at all sure about exposing that as an official sysctl.
> 
> How about something like so?

[...]

>  void update_rq_clock(struct rq *rq)
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false)
>  
>  SCHED_FEAT(ALT_PERIOD, true)
>  SCHED_FEAT(BASE_SLICE, true)
> +
> +SCHED_FEAT(PELT_M2, false)
> +SCHED_FEAT(PELT_M4, false)

The sched features interface would definitely be much less official but
I think it's not useful for Android since they don't mount debugfs anymore.

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-08-29 10:13       ` Vincent Guittot
  2022-08-29 14:23         ` Quentin Perret
@ 2022-09-02  7:53         ` Dietmar Eggemann
  2022-09-02  8:45           ` Peter Zijlstra
  2022-09-06  5:49           ` Vincent Guittot
  1 sibling, 2 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2022-09-02  7:53 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef,
	linux-kernel

On 29/08/2022 12:13, Vincent Guittot wrote:
> On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote:
>>> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
>>>> From: Vincent Donnefort <vincent.donnefort@arm.com>

[...]

>>> Hurmph... I'd almost go write you something like
>>> static_call()/static_branch() but for immediates.
>>>
>>> That said; given there's only like 3 options, perhaps a few
>>> static_branch() instances work just fine ?
>>
>> Also, I'm not at all sure about exposing that as an official sysctl.
> 
> Me too, I would even make it a boot time parameter so we can remove

Isn't a sched feature even less official than a boot parameter?
But AFAIK at least some of the Android folks want to change this during
runtime and they don't have debugfs mounted.

> the new clock_task_mult clock and left shift clock_taslk or the delta
> before passing it to clock_pelt

We still need rq_clock_task_mult(rq), i.e. `rq->clock_task_mult` in
_update_idle_rq_clock_pelt() though.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-09-02  7:53       ` Dietmar Eggemann
@ 2022-09-02  8:45         ` Peter Zijlstra
  0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2022-09-02  8:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel

On Fri, Sep 02, 2022 at 09:53:22AM +0200, Dietmar Eggemann wrote:
> >  void update_rq_clock(struct rq *rq)
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false)
> >  
> >  SCHED_FEAT(ALT_PERIOD, true)
> >  SCHED_FEAT(BASE_SLICE, true)
> > +
> > +SCHED_FEAT(PELT_M2, false)
> > +SCHED_FEAT(PELT_M4, false)
> 
> The sched features interface would definitely be much less official but
> I think it's not useful for Android since they don't mount debugfs anymore.

Well, they can haz a small patch that sets either one or both to true,
no?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-09-02  7:53         ` Dietmar Eggemann
@ 2022-09-02  8:45           ` Peter Zijlstra
  2022-09-06  5:49           ` Vincent Guittot
  1 sibling, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2022-09-02  8:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Ingo Molnar, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel

On Fri, Sep 02, 2022 at 09:53:50AM +0200, Dietmar Eggemann wrote:

> But AFAIK at least some of the Android folks want to change this during
> runtime and they don't have debugfs mounted.

They want; I want an explanation of what exact problem is fixed how ;-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-09-02  7:53         ` Dietmar Eggemann
  2022-09-02  8:45           ` Peter Zijlstra
@ 2022-09-06  5:49           ` Vincent Guittot
  2022-09-08  6:50             ` Dietmar Eggemann
  1 sibling, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2022-09-06  5:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

On Fri, 2 Sept 2022 at 09:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 29/08/2022 12:13, Vincent Guittot wrote:
> > On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote:
> >>> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
> >>>> From: Vincent Donnefort <vincent.donnefort@arm.com>
>
> [...]
>
> >>> Hurmph... I'd almost go write you something like
> >>> static_call()/static_branch() but for immediates.
> >>>
> >>> That said; given there's only like 3 options, perhaps a few
> >>> static_branch() instances work just fine ?
> >>
> >> Also, I'm not at all sure about exposing that as an official sysctl.
> >
> > Me too, I would even make it a boot time parameter so we can remove
>
> Isn't a sched feature even less official than a boot parameter?
> But AFAIK at least some of the Android folks want to change this during
> runtime and they don't have debugfs mounted.
>
> > the new clock_task_mult clock and left shift clock_taslk or the delta
> > before passing it to clock_pelt
>
> We still need rq_clock_task_mult(rq), i.e. `rq->clock_task_mult` in
> _update_idle_rq_clock_pelt() though.

Why ? If the mult is defined at boot we just have to use
"rq_clock_task(rq) << mult" instead of rq_clock_task(rq) when updating
clock_pelt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier
  2022-09-06  5:49           ` Vincent Guittot
@ 2022-09-08  6:50             ` Dietmar Eggemann
  0 siblings, 0 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2022-09-08  6:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Jian-Min, Qais Yousef, linux-kernel

On 06/09/2022 07:49, Vincent Guittot wrote:
> On Fri, 2 Sept 2022 at 09:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 29/08/2022 12:13, Vincent Guittot wrote:
>>> On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote:
>>>>> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote:
>>>>>> From: Vincent Donnefort <vincent.donnefort@arm.com>

[...]

>> We still need rq_clock_task_mult(rq), i.e. `rq->clock_task_mult` in
>> _update_idle_rq_clock_pelt() though.
> 
> Why ? If the mult is defined at boot we just have to use
> "rq_clock_task(rq) << mult" instead of rq_clock_task(rq) when updating
> clock_pelt

Makes sense! Agreed.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-08-29  5:54 [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Dietmar Eggemann
  2022-08-29  5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann
@ 2022-09-20 14:07 ` Jian-Min Liu
  2022-09-28 17:09   ` Dietmar Eggemann
  2022-09-29  9:47   ` Peter Zijlstra
  1 sibling, 2 replies; 61+ messages in thread
From: Jian-Min Liu @ 2022-09-20 14:07 UTC (permalink / raw)
  To: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort
  Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen


Update some test data in android phone to support switching PELT HL 
is helpful functionality.

We switch runtime PELT HL during runtime by difference scenario e.g.
pelt8 in playing game, pelt32 in camera video. Support runntime
switching PELT HL is flexible for different workloads.

the below table show performance & power data points: 

---------------------------------------------------------------------
--|                      | PELT
halflife                                |
|                      |----------------------------------------------|
|                      |       32      |       16      |       8      |
|                      |----------------------------------------------|
|                      | avg  min  avg | avg  min  avg | avg  min  avg|
| Scenarios            | fps  fps  pwr | fps  fps  pwr | fps  fps  pwr|
|---------------------------------------------------------------------|
| HOK game 60fps       | 100  100  100 | 105 *134* 102 | 104 *152* 106|
| HOK game 90fps       | 100  100  100 | 101 *114* 101 | 103 *129* 105|
| HOK game 120fps      | 100  100  100 | 102 *124* 102 | 105 *134* 105|
| FHD video rec. 60fps | 100  100  100 | n/a  n/a  n/a | 100  100  103|
| Camera snapshot      | 100  100  100 | n/a  n/a  n/a | 100  100  102|
-----------------------------------------------------------------------

HOK ... Honour Of Kings, Video game
FHD ... Full High Definition
fps ... frame per second
pwr ... power consumption

table values are in %


On Mon, 2022-08-29 at 07:54 +0200, Dietmar Eggemann wrote:
> Many of the Android devices still prefer to run PELT with a shorter
> halflife than the hardcoded value of 32ms in mainline.
> 
> The Android folks claim better response time of display pipeline
> tasks
> (higher min and avg fps for 60, 90 or 120Hz refresh rate). Some of
> the
> benchmarks like PCmark web-browsing show higher scores when running
> with 16ms or 8ms PELT halflife. The gain in response time and
> performance is considered to outweigh the increase of energy
> consumption in these cases.
> 
> The original idea of introducing a PELT halflife compile time option
> for 32, 16, 8ms from Patrick Bellasi in 2018
> 
https://urldefense.com/v3/__https://lkml.kernel.org/r/20180409165134.707-1-patrick.bellasi@arm.com__;!!CTRNKA9wMg0ARbw!x-6IhaOmZWO5PJIWEfZLD-6grV2BwlOBpflNV57-oNZY8NfSocwlImAHM2TQFyo56_r-$
>  
> wasn't integrated into mainline mainly because of breaking the PELT
> stability requirement (see (1) below).
> 
> We have been experimenting with a new idea from Morten Rasmussen to
> instead introduce an additional clock between task and pelt clock.
> This
> way the effect of a shorter PELT halflife of 8ms or 16ms can be
> achieved by left-shifting the elapsed time. This is similar to the
> use
> of time shifting of the pelt clock to achieve scale invariance in
> PELT.
> The implementation is from Vincent Donnefort with some minor
> modifications to align with current tip sched/core.
> 
> ---
> 
> Known potential issues:
> 
> (1) PELT stability requirement
> 
> PELT halflife has to be larger than or equal to the scheduling
> period.
> 
> The sched_period (sysctl_sched_latency) of a typical mobile device
> with
> 8 CPUs with the default logarithmical tuning is 24ms so only the 32
> ms
> PELT halflife met this. Shorter halflife of 16ms or even 8ms would
> break
> this.
> 
> It looks like that this problem might not exist anymore because of
> the
> PELT rewrite in 2015, i.e. with commit 9d89c257dfb9
> ("sched/fair: Rewrite runnable load and utilization average
> tracking").
> Since then sched entities (task & task groups) and cfs_rq's are
> independently maintained rather than each entity update maintains the
> cfs_rq at the same time.
> 
> This seems to mitigate the issue that the cfs_rq signal is not
> correct
> when there are not all runnable entities able ot do a self update
> during
> a PELT halflife window.
> 
> That said, I'm not entirely sure whether the entity-cfs_rq
> synchronization is the only issue behind this PELT stability
> requirement.
> 
> 
> (2) PELT utilization versus util_est (estimated utilization)
> 
> The PELT signal of a periodic task oscillates with higher peak
> amplitude
> when using smaller halflife. For a typical periodic task of the
> display
> pipeline with a runtime/period of 8ms/16ms the peak amplitude is at
> ~40
> for 32ms, at ~80 for 16ms and at ~160 for 8ms. Util_est stores the
> util_avg peak as util_est.enqueued per task.
> 
> With an additional exponential weighted moving average (ewma) to
> smooth
> task utilization decreases, util_est values of the runnable tasks are
> aggregated on the root cfs_rq.
> CPU and task utilization for CPU frequency selection and task
> placement
> is the max value out of util_est and util_avg. 
> I.e. because of how util_est is implemented higher CPU Operating
> Performance Points and more capable CPUs are already chosen when
> using
> smaller PELT halflife.
> 
> 
> (3) Wrong PELT history when switching PELT multiplier
> 
> The PELT history becomes stale the moment the PELT multiplier is
> changed
> during runtime. So all decisions based on PELT are skewed for the
> time
> interval to produce LOAD_MAX_AVG (the sum of the infinite geometric
> series) which value is ~345ms for halflife=32ms (smaller for 8ms or
> 16ms).
> 
> Rate limiting the PELT multiplier change to this value is not solving
> the issue here. So the user would have to live with possible
> incorrect
> discussions during these PELT multiplier transition times.
> 
> ---
> 
> It looks like that individual task boosting e.g. via uclamp_min,
> possibly abstracted by middleware frameworks like Android Dynamic
> Performance Framework (ADPF) would be the way to go here but until
> this
> is fully available and adopted some Android folks will still prefer
> the
> overall system boosting they achieve by running with a shorter PELT
> halflife.
> 
> Vincent Donnefort (1):
>   sched/pelt: Introduce PELT multiplier
> 
>  kernel/sched/core.c  |  2 +-
>  kernel/sched/pelt.c  | 60
> ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/pelt.h  | 42 ++++++++++++++++++++++++++++---
>  kernel/sched/sched.h |  1 +
>  4 files changed, 100 insertions(+), 5 deletions(-)
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu
@ 2022-09-28 17:09   ` Dietmar Eggemann
  2022-09-29  9:47   ` Peter Zijlstra
  1 sibling, 0 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2022-09-28 17:09 UTC (permalink / raw)
  To: Jian-Min Liu, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort
  Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen, Kajetan Puchalski

+ Kajetan Puchalski <kajetan.puchalski@arm.com>

On 20/09/2022 16:07, Jian-Min Liu wrote:
> 
> Update some test data in android phone to support switching PELT HL 
> is helpful functionality.
> 
> We switch runtime PELT HL during runtime by difference scenario e.g.
> pelt8 in playing game, pelt32 in camera video. Support runntime
> switching PELT HL is flexible for different workloads.
> 
> the below table show performance & power data points: 
> 
> ---------------------------------------------------------------------
> --|                      | PELT
> halflife                                |
> |                      |----------------------------------------------|
> |                      |       32      |       16      |       8      |
> |                      |----------------------------------------------|
> |                      | avg  min  avg | avg  min  avg | avg  min  avg|
> | Scenarios            | fps  fps  pwr | fps  fps  pwr | fps  fps  pwr|
> |---------------------------------------------------------------------|
> | HOK game 60fps       | 100  100  100 | 105 *134* 102 | 104 *152* 106|
> | HOK game 90fps       | 100  100  100 | 101 *114* 101 | 103 *129* 105|
> | HOK game 120fps      | 100  100  100 | 102 *124* 102 | 105 *134* 105|
> | FHD video rec. 60fps | 100  100  100 | n/a  n/a  n/a | 100  100  103|
> | Camera snapshot      | 100  100  100 | n/a  n/a  n/a | 100  100  102|
> -----------------------------------------------------------------------
> 
> HOK ... Honour Of Kings, Video game
> FHD ... Full High Definition
> fps ... frame per second
> pwr ... power consumption
> 
> table values are in %

I assume that you are specifically interested in those higher min fps
numbers which can be achieved with a tolerable energy consumption
increase when running the game with 16ms or even 8ms PELT halflife.

We see a similar effect when running the UI performance benchmark
Jankbench.

So you need this runtime-switchable PELT multiplier. Would this sched
feature interface:

https://lkml.kernel.org/r/YwyOzgbbUbB+JmSH@hirez.programming.kicks-ass.net

be sufficient for you? People don't like to support `changing PELT
halflife` via an official sysctl.

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu
  2022-09-28 17:09   ` Dietmar Eggemann
@ 2022-09-29  9:47   ` Peter Zijlstra
  2022-09-29 11:07     ` Dietmar Eggemann
                       ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Peter Zijlstra @ 2022-09-29  9:47 UTC (permalink / raw)
  To: Jian-Min Liu
  Cc: Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Qais Yousef, linux-kernel,
	Jonathan JMChen


Because it messes up the order in which people normally read text.
Why is top-posting such a bad thing?
Top-posting.
What is the most annoying thing in e-mail?

On Tue, Sep 20, 2022 at 10:07:59PM +0800, Jian-Min Liu wrote:
> 
> Update some test data in android phone to support switching PELT HL 
> is helpful functionality.
> 
> We switch runtime PELT HL during runtime by difference scenario e.g.
> pelt8 in playing game, pelt32 in camera video. Support runntime
> switching PELT HL is flexible for different workloads.
> 
> the below table show performance & power data points: 
> 
> ---------------------------------------------------------------------
> --|                      | PELT
> halflife                                |
> |                      |----------------------------------------------|
> |                      |       32      |       16      |       8      |
> |                      |----------------------------------------------|
> |                      | avg  min  avg | avg  min  avg | avg  min  avg|
> | Scenarios            | fps  fps  pwr | fps  fps  pwr | fps  fps  pwr|
> |---------------------------------------------------------------------|
> | HOK game 60fps       | 100  100  100 | 105 *134* 102 | 104 *152* 106|
> | HOK game 90fps       | 100  100  100 | 101 *114* 101 | 103 *129* 105|
> | HOK game 120fps      | 100  100  100 | 102 *124* 102 | 105 *134* 105|

You have your min and avg fps columns mixed up, your min cannot be larger
than avg.

Also, with min fps mostly above the actual screen fps, who cares. And
seriously 120fps on a phone !?!? for worse power usage! you gotta be
kidding me.

And I googled this game; it is some top-down tactical thing with
real-time combat (as opposed to turn-based) (DOTA like I suppose),
60 fps locked should be plenty fine.

> | FHD video rec. 60fps | 100  100  100 | n/a  n/a  n/a | 100  100  103|
> | Camera snapshot      | 100  100  100 | n/a  n/a  n/a | 100  100  102|

Mostly I think you've demonstrated that none of this is worth it.

> -----------------------------------------------------------------------
> 
> HOK ... Honour Of Kings, Video game
> FHD ... Full High Definition
> fps ... frame per second
> pwr ... power consumption
> 
> table values are in %

Oh... that's bloody insane; that's why none of it makes sense.


How is any of that an answer to:

  "They want; I want an explanation of what exact problem is fixed how ;-)"

This is just random numbers showing poking the number has some effect;
it has zero explaination of why poking the number changes the workload
and if that is in fact the right way to go about solving that particular
issue.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-29  9:47   ` Peter Zijlstra
@ 2022-09-29 11:07     ` Dietmar Eggemann
  2022-09-29 11:10     ` Kajetan Puchalski
  2022-11-07  9:41     ` Jian-Min Liu (劉建旻)
  2 siblings, 0 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2022-09-29 11:07 UTC (permalink / raw)
  To: Peter Zijlstra, Jian-Min Liu
  Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Qais Yousef, linux-kernel,
	Jonathan JMChen

On 29/09/2022 11:47, Peter Zijlstra wrote:

[...]

>> ---------------------------------------------------------------------
>> --|                      | PELT
>> halflife                                |
>> |                      |----------------------------------------------|
>> |                      |       32      |       16      |       8      |
>> |                      |----------------------------------------------|
>> |                      | avg  min  avg | avg  min  avg | avg  min  avg|
>> | Scenarios            | fps  fps  pwr | fps  fps  pwr | fps  fps  pwr|
>> |---------------------------------------------------------------------|
>> | HOK game 60fps       | 100  100  100 | 105 *134* 102 | 104 *152* 106|
>> | HOK game 90fps       | 100  100  100 | 101 *114* 101 | 103 *129* 105|
>> | HOK game 120fps      | 100  100  100 | 102 *124* 102 | 105 *134* 105|
> 
> You have your min and avg fps columns mixed up, your min cannot be larger
> than avg.
> 
> Also, with min fps mostly above the actual screen fps, who cares. And
> seriously 120fps on a phone !?!? for worse power usage! you gotta be
> kidding me.

I agree that since we don't know what 100% at 32 means its unclear what
problem gets actually solved here by running with 16 or 8.

> And I googled this game; it is some top-down tactical thing with
> real-time combat (as opposed to turn-based) (DOTA like I suppose),
> 60 fps locked should be plenty fine.
> 
>> | FHD video rec. 60fps | 100  100  100 | n/a  n/a  n/a | 100  100  103|
>> | Camera snapshot      | 100  100  100 | n/a  n/a  n/a | 100  100  102|
> 
> Mostly I think you've demonstrated that none of this is worth it.

I assume Jian-Min added those two lines to demonstrate that they would
need the run-time switch.

>> -----------------------------------------------------------------------
>>
>> HOK ... Honour Of Kings, Video game
>> FHD ... Full High Definition
>> fps ... frame per second
>> pwr ... power consumption
>>
>> table values are in %
> 
> Oh... that's bloody insane; that's why none of it makes sense.
> 
> 
> How is any of that an answer to:
> 
>   "They want; I want an explanation of what exact problem is fixed how ;-)"
> 
> This is just random numbers showing poking the number has some effect;
> it has zero explaination of why poking the number changes the workload
> and if that is in fact the right way to go about solving that particular
> issue.

Jian-Min, would you be able to show real numbers in comparison to the
chosen fps here? And explain what the problem is which gets solved. What
is the effect of this higher min fps values when running 16 or 8? And
why is the default 32 not sufficient here?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-29  9:47   ` Peter Zijlstra
  2022-09-29 11:07     ` Dietmar Eggemann
@ 2022-09-29 11:10     ` Kajetan Puchalski
  2022-09-29 11:21       ` Peter Zijlstra
  2022-11-07  9:41     ` Jian-Min Liu (劉建旻)
  2 siblings, 1 reply; 61+ messages in thread
From: Kajetan Puchalski @ 2022-09-29 11:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Thu, Sep 29, 2022 at 11:47:23AM +0200, Peter Zijlstra wrote:

[...]

> Mostly I think you've demonstrated that none of this is worth it.
> 
> > -----------------------------------------------------------------------
> > 
> > HOK ... Honour Of Kings, Video game
> > FHD ... Full High Definition
> > fps ... frame per second
> > pwr ... power consumption
> > 
> > table values are in %
> 
> Oh... that's bloody insane; that's why none of it makes sense.

Hi,

We have seen similar results to the ones provided by MTK while running
Jankbench, a UI performance benchmark.

For the following tables, the pelt numbers refer to multiplier values so
pelt_1 -> 32ms, pelt_2 -> 16ms, pelt_4 -> 8ms.

We can see the max frame durations decreasing significantly in line with
changing the pelt multiplier. Having a faster-responding pelt lets us
improve the worst-case scenario by a large margin which is why it can be
useful in some cases where that worst-case scenario is important.

Max frame duration (ms)

+------------------+----------+
| kernel          |    value  |
|------------------+----------|
| pelt_1           | 157.426  |
| pelt_2           | 111.975  |
| pelt_4           | 85.2713  |
+------------------+----------+

However, it is accompanied by a very noticeable increase in power usage.
We have seen even bigger power usage increases for different workloads.
This is why we think it makes much more sense as something that can be
changed at runtime - if set at boot time the energy consumption increase
would nullify any of the potential benefits. For limited workloads or
scenarios, the tradeoff might be worth it.

Power usage [mW]

+------------------+---------+-------------+
| kernel           |   value | perc_diff   |
|------------------+---------+-------------|
| pelt_1           |   139.9 | 0.0%        |
| pelt_2           |   146.4 | 4.62%       |
| pelt_4           |   158.5 | 13.25%      |
+------------------+---------+-------------+

At the same time we see that the average-case can improve slightly as
well in the process and the consistency either doesn't get worse or
improves a bit too.

Mean frame duration (ms)

+---------------+------------------+---------+-------------+
| variable      | kernel           |   value | perc_diff   |
|---------------+------------------+---------+-------------|
| mean_duration | pelt_1           |    14.6 | 0.0%        |
| mean_duration | pelt_2           |    13.8 | -5.43%      |
| mean_duration | pelt_4           |    14.5 | -0.58%      |
+---------------+------------------+---------+-------------+

Jank percentage

+------------+------------------+---------+-------------+
| variable   | kernel           |   value | perc_diff   |
|------------+------------------+---------+-------------|
| jank_perc  | pelt_1           |     2.1 | 0.0%        |
| jank_perc  | pelt_2           |     2.1 | 0.11%       |
| jank_perc  | pelt_4           |     2   | -3.46%      |
+------------+------------------+---------+-------------+

> How is any of that an answer to:
>
>   "They want; I want an explanation of what exact problem is fixed how ;-)"
>
> This is just random numbers showing poking the number has some effect;
> it has zero explaination of why poking the number changes the workload
> and if that is in fact the right way to go about solving that particular
> issue.

Overall, the problem being solved here is that based on our testing the
PELT half life can occasionally be too slow to keep up in scenarios
where many frames need to be rendered quickly, especially on high-refresh
rate phones and similar devices. While it's not a problem most of the
time and so it doesn't warrant changing the default or having it set at
boot time, introducing this pelt multiplier would be very useful as a
tool to be able to avoid the worst-case in limited scenarios.

----
Kajetan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-29 11:10     ` Kajetan Puchalski
@ 2022-09-29 11:21       ` Peter Zijlstra
  2022-09-29 14:41         ` Kajetan Puchalski
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2022-09-29 11:21 UTC (permalink / raw)
  To: Kajetan Puchalski
  Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:

> Overall, the problem being solved here is that based on our testing the
> PELT half life can occasionally be too slow to keep up in scenarios
> where many frames need to be rendered quickly, especially on high-refresh
> rate phones and similar devices.

But it is a problem of DVFS not ramping up quick enough; or of the
load-balancer not reacting to the increase in load, or what aspect
controlled by PELT is responsible for the improvement seen?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-29 11:21       ` Peter Zijlstra
@ 2022-09-29 14:41         ` Kajetan Puchalski
  2022-10-03 22:57           ` Wei Wang
  2022-11-07 13:41           ` Peter Zijlstra
  0 siblings, 2 replies; 61+ messages in thread
From: Kajetan Puchalski @ 2022-09-29 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
> 
> > Overall, the problem being solved here is that based on our testing the
> > PELT half life can occasionally be too slow to keep up in scenarios
> > where many frames need to be rendered quickly, especially on high-refresh
> > rate phones and similar devices.
> 
> But it is a problem of DVFS not ramping up quick enough; or of the
> load-balancer not reacting to the increase in load, or what aspect
> controlled by PELT is responsible for the improvement seen?

Based on all the tests we've seen, jankbench or otherwise, the
improvement can mainly be attributed to the faster ramp up of frequency
caused by the shorter PELT window while using schedutil. Alongside that
the signals rising faster also mean that the task would get migrated
faster to bigger CPUs on big.LITTLE systems which improves things too
but it's mostly the frequency aspect of it.

To establish that this benchmark is sensitive to frequency I ran some
tests using the 'performance' cpufreq governor.

Max frame duration (ms)

+------------------+-------------+----------+
| kernel           |   iteration |    value |
|------------------+-------------+----------|
| pelt_1           |          10 | 157.426  |
| pelt_4           |          10 |  85.2713 |
| performance      |          10 |  40.9308 |
+------------------+-------------+----------+

Mean frame duration (ms)

+---------------+------------------+---------+-------------+
| variable      | kernel           |   value | perc_diff   |
|---------------+------------------+---------+-------------|
| mean_duration | pelt_1           |    14.6 | 0.0%        |
| mean_duration | pelt_4           |    14.5 | -0.58%      |
| mean_duration | performance      |     4.4 | -69.75%     |
+---------------+------------------+---------+-------------+

Jank percentage

+------------+------------------+---------+-------------+
| variable   | kernel           |   value | perc_diff   |
|------------+------------------+---------+-------------|
| jank_perc  | pelt_1           |     2.1 | 0.0%        |
| jank_perc  | pelt_4           |     2   | -3.46%      |
| jank_perc  | performance      |     0.1 | -97.25%     |
+------------+------------------+---------+-------------+

As you can see, bumping up frequency can hugely improve the results
here. This is what's happening when we decrease the PELT window, just on
a much smaller and not as drastic scale. It also explains specifically
where the increased power usage is coming from.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-29 14:41         ` Kajetan Puchalski
@ 2022-10-03 22:57           ` Wei Wang
  2022-10-04  9:33             ` Dietmar Eggemann
  2022-11-07 13:41           ` Peter Zijlstra
  1 sibling, 1 reply; 61+ messages in thread
From: Wei Wang @ 2022-10-03 22:57 UTC (permalink / raw)
  To: Kajetan Puchalski
  Cc: Peter Zijlstra, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen,
	Chung-Kai (Michael) Mei

We have some data on an earlier build of Pixel 6a, which also runs a
slightly modified "sched" governor. The tuning definitely has both
performance and power impact on UX. With some additional user space
hints such as ADPF (Android Dynamic Performance Framework) and/or the
old-fashioned INTERACTION power hint, different trade-offs can be
archived with this sort of tuning.


+---------------------------------------------------------+----------+----------+
|                         Metrics                         |   32ms   |
  8ms    |
+---------------------------------------------------------+----------+----------+
| Sum of gfxinfo_com.android.test.uibench_deadline_missed |   185.00 |
  112.00 |
| Sum of SFSTATS_GLOBAL_MISSEDFRAMES                      |    62.00 |
   49.00 |
| CPU Power                                               | 6,204.00 |
7,040.00 |
| Sum of Gfxinfo.frame.95th                               |   582.00 |
  506.00 |
| Avg of Gfxinfo.frame.95th                               |    18.19 |
   15.81 |
+---------------------------------------------------------+----------+----------+





On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
> >
> > > Overall, the problem being solved here is that based on our testing the
> > > PELT half life can occasionally be too slow to keep up in scenarios
> > > where many frames need to be rendered quickly, especially on high-refresh
> > > rate phones and similar devices.
> >
> > But it is a problem of DVFS not ramping up quick enough; or of the
> > load-balancer not reacting to the increase in load, or what aspect
> > controlled by PELT is responsible for the improvement seen?
>
> Based on all the tests we've seen, jankbench or otherwise, the
> improvement can mainly be attributed to the faster ramp up of frequency
> caused by the shorter PELT window while using schedutil. Alongside that
> the signals rising faster also mean that the task would get migrated
> faster to bigger CPUs on big.LITTLE systems which improves things too
> but it's mostly the frequency aspect of it.
>
> To establish that this benchmark is sensitive to frequency I ran some
> tests using the 'performance' cpufreq governor.
>
> Max frame duration (ms)
>
> +------------------+-------------+----------+
> | kernel           |   iteration |    value |
> |------------------+-------------+----------|
> | pelt_1           |          10 | 157.426  |
> | pelt_4           |          10 |  85.2713 |
> | performance      |          10 |  40.9308 |
> +------------------+-------------+----------+
>
> Mean frame duration (ms)
>
> +---------------+------------------+---------+-------------+
> | variable      | kernel           |   value | perc_diff   |
> |---------------+------------------+---------+-------------|
> | mean_duration | pelt_1           |    14.6 | 0.0%        |
> | mean_duration | pelt_4           |    14.5 | -0.58%      |
> | mean_duration | performance      |     4.4 | -69.75%     |
> +---------------+------------------+---------+-------------+
>
> Jank percentage
>
> +------------+------------------+---------+-------------+
> | variable   | kernel           |   value | perc_diff   |
> |------------+------------------+---------+-------------|
> | jank_perc  | pelt_1           |     2.1 | 0.0%        |
> | jank_perc  | pelt_4           |     2   | -3.46%      |
> | jank_perc  | performance      |     0.1 | -97.25%     |
> +------------+------------------+---------+-------------+
>
> As you can see, bumping up frequency can hugely improve the results
> here. This is what's happening when we decrease the PELT window, just on
> a much smaller and not as drastic scale. It also explains specifically
> where the increased power usage is coming from.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-10-03 22:57           ` Wei Wang
@ 2022-10-04  9:33             ` Dietmar Eggemann
  2022-10-05 16:57               ` Wei Wang
  0 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2022-10-04  9:33 UTC (permalink / raw)
  To: Wei Wang, Kajetan Puchalski
  Cc: Peter Zijlstra, Jian-Min Liu, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen, Chung-Kai (Michael) Mei

Hi Wei,

On 04/10/2022 00:57, Wei Wang wrote:

Please don't do top-posting.

> We have some data on an earlier build of Pixel 6a, which also runs a
> slightly modified "sched" governor. The tuning definitely has both
> performance and power impact on UX. With some additional user space
> hints such as ADPF (Android Dynamic Performance Framework) and/or the
> old-fashioned INTERACTION power hint, different trade-offs can be
> archived with this sort of tuning.
> 
> 
> +---------------------------------------------------------+----------+----------+
> |                         Metrics                         |   32ms   |
>   8ms    |
> +---------------------------------------------------------+----------+----------+
> | Sum of gfxinfo_com.android.test.uibench_deadline_missed |   185.00 |
>   112.00 |
> | Sum of SFSTATS_GLOBAL_MISSEDFRAMES                      |    62.00 |
>    49.00 |
> | CPU Power                                               | 6,204.00 |
> 7,040.00 |
> | Sum of Gfxinfo.frame.95th                               |   582.00 | 
>   506.00 |
> | Avg of Gfxinfo.frame.95th                               |    18.19 |
>    15.81 |
> +---------------------------------------------------------+----------+----------+

Which App is package `gfxinfo_com.android.test`? Is this UIBench? Never
ran it.

I'm familiar with `dumpsys gfxinfo <PACKAGE_NAME>`.

# adb shell dumpsys gfxinfo <PACKAGE_NAME>

...
** Graphics info for pid XXXX [<PACKAGE_NAME>] **
...
95th percentile: XXms            <-- (a)
...
Number Frame deadline missed: XX <-- (b)
...


I assume that `Gfxinfo.frame.95th` is related to (a) and
`gfxinfo_com.android.test.uibench_deadline_missed` to (b)? Not sure
where `SFSTATS_GLOBAL_MISSEDFRAMES` is coming from?

What's the Sum here? Is it that you ran the test 32 times (582/18.19 = 32)?

[...]

> On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski
> <kajetan.puchalski@arm.com> wrote:
>>
>> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
>>> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
>>>
>>>> Overall, the problem being solved here is that based on our testing the
>>>> PELT half life can occasionally be too slow to keep up in scenarios
>>>> where many frames need to be rendered quickly, especially on high-refresh
>>>> rate phones and similar devices.
>>>
>>> But it is a problem of DVFS not ramping up quick enough; or of the
>>> load-balancer not reacting to the increase in load, or what aspect
>>> controlled by PELT is responsible for the improvement seen?
>>
>> Based on all the tests we've seen, jankbench or otherwise, the
>> improvement can mainly be attributed to the faster ramp up of frequency
>> caused by the shorter PELT window while using schedutil. Alongside that
>> the signals rising faster also mean that the task would get migrated
>> faster to bigger CPUs on big.LITTLE systems which improves things too
>> but it's mostly the frequency aspect of it.
>>
>> To establish that this benchmark is sensitive to frequency I ran some
>> tests using the 'performance' cpufreq governor.
>>
>> Max frame duration (ms)
>>
>> +------------------+-------------+----------+
>> | kernel           |   iteration |    value |
>> |------------------+-------------+----------|
>> | pelt_1           |          10 | 157.426  |
>> | pelt_4           |          10 |  85.2713 |
>> | performance      |          10 |  40.9308 |
>> +------------------+-------------+----------+
>>
>> Mean frame duration (ms)
>>
>> +---------------+------------------+---------+-------------+
>> | variable      | kernel           |   value | perc_diff   |
>> |---------------+------------------+---------+-------------|
>> | mean_duration | pelt_1           |    14.6 | 0.0%        |
>> | mean_duration | pelt_4           |    14.5 | -0.58%      |
>> | mean_duration | performance      |     4.4 | -69.75%     |
>> +---------------+------------------+---------+-------------+
>>
>> Jank percentage
>>
>> +------------+------------------+---------+-------------+
>> | variable   | kernel           |   value | perc_diff   |
>> |------------+------------------+---------+-------------|
>> | jank_perc  | pelt_1           |     2.1 | 0.0%        |
>> | jank_perc  | pelt_4           |     2   | -3.46%      |
>> | jank_perc  | performance      |     0.1 | -97.25%     |
>> +------------+------------------+---------+-------------+
>>
>> As you can see, bumping up frequency can hugely improve the results
>> here. This is what's happening when we decrease the PELT window, just on
>> a much smaller and not as drastic scale. It also explains specifically
>> where the increased power usage is coming from.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-10-04  9:33             ` Dietmar Eggemann
@ 2022-10-05 16:57               ` Wei Wang
  0 siblings, 0 replies; 61+ messages in thread
From: Wei Wang @ 2022-10-05 16:57 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Kajetan Puchalski, Peter Zijlstra, Jian-Min Liu, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen,
	Chung-Kai (Michael) Mei

On Tue, Oct 4, 2022 at 2:33 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> Hi Wei,
>
> On 04/10/2022 00:57, Wei Wang wrote:
>
> Please don't do top-posting.
>

Sorry, forgot this was posted to the list...

> > We have some data on an earlier build of Pixel 6a, which also runs a
> > slightly modified "sched" governor. The tuning definitely has both
> > performance and power impact on UX. With some additional user space
> > hints such as ADPF (Android Dynamic Performance Framework) and/or the
> > old-fashioned INTERACTION power hint, different trade-offs can be
> > archived with this sort of tuning.
> >
> >
> > +---------------------------------------------------------+----------+----------+
> > |                         Metrics                         |   32ms   |
> >   8ms    |
> > +---------------------------------------------------------+----------+----------+
> > | Sum of gfxinfo_com.android.test.uibench_deadline_missed |   185.00 |
> >   112.00 |
> > | Sum of SFSTATS_GLOBAL_MISSEDFRAMES                      |    62.00 |
> >    49.00 |
> > | CPU Power                                               | 6,204.00 |
> > 7,040.00 |
> > | Sum of Gfxinfo.frame.95th                               |   582.00 |
> >   506.00 |
> > | Avg of Gfxinfo.frame.95th                               |    18.19 |
> >    15.81 |
> > +---------------------------------------------------------+----------+----------+
>
> Which App is package `gfxinfo_com.android.test`? Is this UIBench? Never
> ran it.
>

Yes.

> I'm familiar with `dumpsys gfxinfo <PACKAGE_NAME>`.
>
> # adb shell dumpsys gfxinfo <PACKAGE_NAME>
>
> ...
> ** Graphics info for pid XXXX [<PACKAGE_NAME>] **
> ...
> 95th percentile: XXms            <-- (a)
> ...
> Number Frame deadline missed: XX <-- (b)
> ...
>
>
> I assume that `Gfxinfo.frame.95th` is related to (a) and
> `gfxinfo_com.android.test.uibench_deadline_missed` to (b)? Not sure
> where `SFSTATS_GLOBAL_MISSEDFRAMES` is coming from?
>

a) is correct b) is from surfaceflinger. Android display pipeline
involves both a) app (generation) and b) surfaceflinger
(presentation).

> What's the Sum here? Is it that you ran the test 32 times (582/18.19 = 32)?
>

Uibench[1] has several micro tests and it is the sum of those tests.


[1]: https://cs.android.com/android/platform/superproject/+/master:platform_testing/tests/microbenchmarks/uibench/src/com/android/uibench/microbenchmark/


> [...]
>
> > On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski
> > <kajetan.puchalski@arm.com> wrote:
> >>
> >> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote:
> >>> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote:
> >>>
> >>>> Overall, the problem being solved here is that based on our testing the
> >>>> PELT half life can occasionally be too slow to keep up in scenarios
> >>>> where many frames need to be rendered quickly, especially on high-refresh
> >>>> rate phones and similar devices.
> >>>
> >>> But it is a problem of DVFS not ramping up quick enough; or of the
> >>> load-balancer not reacting to the increase in load, or what aspect
> >>> controlled by PELT is responsible for the improvement seen?
> >>
> >> Based on all the tests we've seen, jankbench or otherwise, the
> >> improvement can mainly be attributed to the faster ramp up of frequency
> >> caused by the shorter PELT window while using schedutil. Alongside that
> >> the signals rising faster also mean that the task would get migrated
> >> faster to bigger CPUs on big.LITTLE systems which improves things too
> >> but it's mostly the frequency aspect of it.
> >>
> >> To establish that this benchmark is sensitive to frequency I ran some
> >> tests using the 'performance' cpufreq governor.
> >>
> >> Max frame duration (ms)
> >>
> >> +------------------+-------------+----------+
> >> | kernel           |   iteration |    value |
> >> |------------------+-------------+----------|
> >> | pelt_1           |          10 | 157.426  |
> >> | pelt_4           |          10 |  85.2713 |
> >> | performance      |          10 |  40.9308 |
> >> +------------------+-------------+----------+
> >>
> >> Mean frame duration (ms)
> >>
> >> +---------------+------------------+---------+-------------+
> >> | variable      | kernel           |   value | perc_diff   |
> >> |---------------+------------------+---------+-------------|
> >> | mean_duration | pelt_1           |    14.6 | 0.0%        |
> >> | mean_duration | pelt_4           |    14.5 | -0.58%      |
> >> | mean_duration | performance      |     4.4 | -69.75%     |
> >> +---------------+------------------+---------+-------------+
> >>
> >> Jank percentage
> >>
> >> +------------+------------------+---------+-------------+
> >> | variable   | kernel           |   value | perc_diff   |
> >> |------------+------------------+---------+-------------|
> >> | jank_perc  | pelt_1           |     2.1 | 0.0%        |
> >> | jank_perc  | pelt_4           |     2   | -3.46%      |
> >> | jank_perc  | performance      |     0.1 | -97.25%     |
> >> +------------+------------------+---------+-------------+
> >>
> >> As you can see, bumping up frequency can hugely improve the results
> >> here. This is what's happening when we decrease the PELT window, just on
> >> a much smaller and not as drastic scale. It also explains specifically
> >> where the increased power usage is coming from.
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-29  9:47   ` Peter Zijlstra
  2022-09-29 11:07     ` Dietmar Eggemann
  2022-09-29 11:10     ` Kajetan Puchalski
@ 2022-11-07  9:41     ` Jian-Min Liu (劉建旻)
  2 siblings, 0 replies; 61+ messages in thread
From: Jian-Min Liu (劉建旻) @ 2022-11-07  9:41 UTC (permalink / raw)
  To: peterz
  Cc: dietmar.eggemann, mingo, linux-kernel, patrick.bellasi, adharmap,
	qais.yousef, Jonathan JMChen (陳家明),
	vdonnefort, qperret, morten.rasmussen, vincent.guittot

Hi, 

On Thu, 2022-09-29 at 11:47 +0200, Peter Zijlstra wrote:
> Because it messes up the order in which people normally read text.
> Why is top-posting such a bad thing?
> Top-posting.
> What is the most annoying thing in e-mail?
> 

Sorry for top-posting... 

> On Tue, Sep 20, 2022 at 10:07:59PM +0800, Jian-Min Liu wrote:
> > 
> > Update some test data in android phone to support switching PELT
> > HL 
> > is helpful functionality.
> > 
> > We switch runtime PELT HL during runtime by difference scenario
> > e.g.
> > pelt8 in playing game, pelt32 in camera video. Support runntime
> > switching PELT HL is flexible for different workloads.
> > 
> > the below table show performance & power data points: 
> > 
> > -----------------------------------------------------------------
> > ----
> > --|                      | PELT
> > halflife                                |
> > >                      |-------------------------------------------
> > > ---|
> > >                      |       32      |       16      |       8   
> > >    |
> > >                      |-------------------------------------------
> > > ---|
> > >                      | avg  min  avg | avg  min  avg |
> > > avg  min  avg|
> > > Scenarios            | fps  fps  pwr | fps  fps  pwr |
> > > fps  fps  pwr|
> > > ---------------------------------------------------------------
> > > ------|
> > > HOK game 60fps       | 100  100  100 | 105 *134* 102 | 104 *152*
> > > 106|
> > > HOK game 90fps       | 100  100  100 | 101 *114* 101 | 103 *129*
> > > 105|
> > > HOK game 120fps      | 100  100  100 | 102 *124* 102 | 105 *134*
> > > 105|
> 
> You have your min and avg fps columns mixed up, your min cannot be
> larger
> than avg.
> 
> Also, with min fps mostly above the actual screen fps, who cares. And
> seriously 120fps on a phone !?!? for worse power usage! you gotta be
> kidding me.
> 
> And I googled this game; it is some top-down tactical thing with
> real-time combat (as opposed to turn-based) (DOTA like I suppose),
> 60 fps locked should be plenty fine.
> 
> > > FHD video rec. 60fps | 100  100  100 | n/a  n/a  n/a |
> > > 100  100  103|
> > > Camera snapshot      | 100  100  100 | n/a  n/a  n/a |
> > > 100  100  102|
> 
> Mostly I think you've demonstrated that none of this is worth it.
> 
> > -----------------------------------------------------------------
> > ------
> > 
> > HOK ... Honour Of Kings, Video game
> > FHD ... Full High Definition
> > fps ... frame per second
> > pwr ... power consumption
> > 
> > table values are in %
> 
> Oh... that's bloody insane; that's why none of it makes sense.
> 
> 
> How is any of that an answer to:
> 
>   "They want; I want an explanation of what exact problem is fixed
> how ;-)"
> 
> This is just random numbers showing poking the number has some
> effect;
> it has zero explaination of why poking the number changes the
> workload
> and if that is in fact the right way to go about solving that
> particular
> issue.


Sorry that the data wasn't clear to understand. I try again with
absolute FPS numbers and some additional explanation as well as a
summary why we need to have the PELT halflife tunable a runtime.

HOK* 60FPS

+-------+-----------------------------------------+
|       |     avg. FPS   |     min. FPS  | power  |
+-------+--------+-------+-------+-------+--------+
|kernel | value  |diff(%)| value |diff(%)| diff(%)|
+-------+--------+-------+-------+-------+--------+
|pelt_1 | 54.1   | 0.0%  | 21.8  |  0.0% |  0.0%  |
+-------+--------+-------+-------+-------+--------+
|pelt_2 | 56.9   | 5.2%  | 29.2  | 34.0% |  2.2%  |
+-------+--------+-------+-------+-------+--------+
|pelt_4 | 56.6   | 4.5%  | 33.2  | 52.4% |  6.3%  |
+-------+--------+-------+-------+-------+--------+

*Honour Of Kings, video game

Test methodology:

We choose 60FPS in the game setup. Android's systrace (similar to
ftrace) then provides the real FPS from which we take the average and
minimum value.

Sorry, but we can't share absolute numbers for power from our test
device since this is still considered sensitive information.


FHD 60fps video recording

+-------+-----------------------------------------+
|       |    avg. FPS    |    min. FPS   | power  |
+-------+--------+-------+-------+-------+--------+
|kernel | value  |diff(%)| value |diff(%)| diff(%)|
+-------+--------+-------+-------+-------+--------+
|pelt_1 | 60.0   | 0.0%  | 60.0  |  0.0% |  0.0%  |
+-------+--------+-------+-------+-------+--------+
|pelt_4 | 60.0   | 0.0%  | 60.0  |  0.0% |  2.1%  |
+-------+--------+-------+-------+-------+--------+


To summarize, we need a smaller PELT halflife to reach higher avg. FPS
and min. FPS values for video gaming to achieve a smoother game-play
experience even when it comes with slightly higher power consumption.
Especially the improvement in min. FPS is important here to minimize
situations in which the game otherwise would stutter.
Since not all use cases profit from this behaviour (e.g. video
recording) the PELT halflife should be tunable at runtime.



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-09-29 14:41         ` Kajetan Puchalski
  2022-10-03 22:57           ` Wei Wang
@ 2022-11-07 13:41           ` Peter Zijlstra
  2022-11-08 19:48             ` Qais Yousef
                               ` (3 more replies)
  1 sibling, 4 replies; 61+ messages in thread
From: Peter Zijlstra @ 2022-11-07 13:41 UTC (permalink / raw)
  To: Kajetan Puchalski
  Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:

> Based on all the tests we've seen, jankbench or otherwise, the
> improvement can mainly be attributed to the faster ramp up of frequency
> caused by the shorter PELT window while using schedutil.

Would something terrible like the below help some?

If not, I suppose it could be modified to take the current state as
history. But basically it runs a faster pelt sum along side the regular
signal just for ramping up the frequency.

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..9ba07a1d19f6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,6 +96,7 @@ SCHED_FEAT(WA_BIAS, true)
  */
 SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
+SCHED_FEAT(UTIL_EST_FASTER, true)
 
 SCHED_FEAT(LATENCY_WARN, false)
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 0f310768260c..13cd9e27ce3e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	return periods;
 }
 
+/*
+ * Compute a pelt util_avg assuming no history and @delta runtime.
+ */
+unsigned long faster_est_approx(u64 delta)
+{
+	unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
+	u64 periods = delta / 1024;
+
+	if (periods) {
+		delta %= 1024;
+		contrib = __accumulate_pelt_segments(periods, 1024, delta);
+	}
+
+	return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
+}
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4a20046e586..99827d5dda27 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2922,6 +2922,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
 	return READ_ONCE(rq->avg_dl.util_avg);
 }
 
+extern unsigned long faster_est_approx(u64 runtime);
+
 /**
  * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
  * @cpu: the CPU to get the utilization for.
@@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
  */
 static inline unsigned long cpu_util_cfs(int cpu)
 {
+	struct rq *rq = cpu_rq(cpu);
 	struct cfs_rq *cfs_rq;
 	unsigned long util;
 
-	cfs_rq = &cpu_rq(cpu)->cfs;
+	cfs_rq = &rq->cfs;
 	util = READ_ONCE(cfs_rq->avg.util_avg);
 
 	if (sched_feat(UTIL_EST)) {
+		if (sched_feat(UTIL_EST_FASTER)) {
+			struct task_struct *curr;
+
+			rcu_read_lock();
+			curr = rcu_dereference(rq->curr);
+			if (likely(curr->sched_class == &fair_sched_class)) {
+				u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
+				util = max_t(unsigned long, util,
+					     faster_est_approx(runtime * 2));
+			}
+			rcu_read_unlock();
+		}
 		util = max_t(unsigned long, util,
 			     READ_ONCE(cfs_rq->avg.util_est.enqueued));
 	}




^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-07 13:41           ` Peter Zijlstra
@ 2022-11-08 19:48             ` Qais Yousef
  2022-11-09 15:49               ` Peter Zijlstra
  2022-11-09 15:18             ` Lukasz Luba
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 61+ messages in thread
From: Qais Yousef @ 2022-11-08 19:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kajetan Puchalski, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On 11/07/22 14:41, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> 
> > Based on all the tests we've seen, jankbench or otherwise, the
> > improvement can mainly be attributed to the faster ramp up of frequency
> > caused by the shorter PELT window while using schedutil.
> 
> Would something terrible like the below help some?
> 
> If not, I suppose it could be modified to take the current state as
> history. But basically it runs a faster pelt sum along side the regular
> signal just for ramping up the frequency.

A bit of a tangent, but this reminded me of this old patch:

	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/

I think we have a bit too many moving cogs that might be creating undesired
compound effect.

Should we consider removing margins in favour of improving util ramp up/down?
(whether via util_est or pelt hf).

Only worry is lower end devices; but it seems to me better improve util
response and get rid of these magic numbers - if we can. Having the ability to
adjust them at runtime will help defer the trade-offs to sys admins.


Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-07 13:41           ` Peter Zijlstra
  2022-11-08 19:48             ` Qais Yousef
@ 2022-11-09 15:18             ` Lukasz Luba
  2022-11-10 11:16             ` Dietmar Eggemann
  2022-11-10 12:45             ` Kajetan Puchalski
  3 siblings, 0 replies; 61+ messages in thread
From: Lukasz Luba @ 2022-11-09 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen, Kajetan Puchalski

Hi Peter,

On 11/7/22 13:41, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> 
>> Based on all the tests we've seen, jankbench or otherwise, the
>> improvement can mainly be attributed to the faster ramp up of frequency
>> caused by the shorter PELT window while using schedutil.
> 
> Would something terrible like the below help some?
> 
> If not, I suppose it could be modified to take the current state as
> history. But basically it runs a faster pelt sum along side the regular
> signal just for ramping up the frequency.

[snip]

> +
> +			rcu_read_lock();
> +			curr = rcu_dereference(rq->curr);
> +			if (likely(curr->sched_class == &fair_sched_class)) {
> +				u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
> +				util = max_t(unsigned long, util,
> +					     faster_est_approx(runtime * 2));

That's a really nice hack :)

I wonder why we end up in such situation on Android. Maybe there is
a different solution?
Maybe shorter tick (then also align PELT Half-Life)?

The problem is mostly in those high-FPS phones. You know, we now have
phones with 144Hz displays and even games > 100FPS (which wasn't the
case a few years ago when we invested a lot of effort into this
PELT+EAS). We also have a lot faster CPUs (~2x in 3-4 years).

IMO those games (and OS mechanisms assisting them) would have different
needs probably (if they do this 'soft-real-time simulations' with such
high granularity ~120/s -> every ~8ms).

IMO one old setting might not fit well into this: 4ms tick (which is the
Android use case), which then implies scheduler min granularity, which
we also align with the y^32 PELT.
Is this a correct chain of thinking?

Would it make sense to ask Android phone vendors to experiment with
1ms tick (w/ aligned PELT HL)? With that change, there might be some
power spikes issues, though.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-08 19:48             ` Qais Yousef
@ 2022-11-09 15:49               ` Peter Zijlstra
  2022-11-10 13:25                 ` Qais Yousef
  2023-02-07 10:29                 ` Dietmar Eggemann
  0 siblings, 2 replies; 61+ messages in thread
From: Peter Zijlstra @ 2022-11-09 15:49 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Kajetan Puchalski, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> On 11/07/22 14:41, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> > 
> > > Based on all the tests we've seen, jankbench or otherwise, the
> > > improvement can mainly be attributed to the faster ramp up of frequency
> > > caused by the shorter PELT window while using schedutil.
> > 
> > Would something terrible like the below help some?
> > 
> > If not, I suppose it could be modified to take the current state as
> > history. But basically it runs a faster pelt sum along side the regular
> > signal just for ramping up the frequency.
> 
> A bit of a tangent, but this reminded me of this old patch:
> 
> 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
> 
> I think we have a bit too many moving cogs that might be creating undesired
> compound effect.
> 
> Should we consider removing margins in favour of improving util ramp up/down?
> (whether via util_est or pelt hf).

Yeah, possibly.

So one thing that was key to that hack I proposed is that it is
per-task. This means we can either set or detect the task activation
period and use that to select an appropriate PELT multiplier.

But please explain; once tasks are in a steady state (60HZ, 90HZ or god
forbit higher), the utilization should be the same between the various
PELT window sizes, provided the activation period isn't *much* larger
than the window.

Are these things running a ton of single shot tasks or something daft
like that?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-07 13:41           ` Peter Zijlstra
  2022-11-08 19:48             ` Qais Yousef
  2022-11-09 15:18             ` Lukasz Luba
@ 2022-11-10 11:16             ` Dietmar Eggemann
  2022-11-10 13:05               ` Peter Zijlstra
  2022-11-10 12:45             ` Kajetan Puchalski
  3 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2022-11-10 11:16 UTC (permalink / raw)
  To: Peter Zijlstra, Kajetan Puchalski
  Cc: Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Vincent Donnefort, Quentin Perret, Patrick Bellasi,
	Abhijeet Dharmapurikar, Qais Yousef, linux-kernel,
	Jonathan JMChen

On 07/11/2022 14:41, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:

[...]

> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>   */
>  static inline unsigned long cpu_util_cfs(int cpu)
>  {
> +	struct rq *rq = cpu_rq(cpu);
>  	struct cfs_rq *cfs_rq;
>  	unsigned long util;
>  
> -	cfs_rq = &cpu_rq(cpu)->cfs;
> +	cfs_rq = &rq->cfs;
>  	util = READ_ONCE(cfs_rq->avg.util_avg);
>  
>  	if (sched_feat(UTIL_EST)) {
> +		if (sched_feat(UTIL_EST_FASTER)) {
> +			struct task_struct *curr;
> +
> +			rcu_read_lock();
> +			curr = rcu_dereference(rq->curr);
> +			if (likely(curr->sched_class == &fair_sched_class)) {
> +				u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;

Don't we and up with gigantic runtime numbers here?

oot@juno:~# cat /proc/1676/task/1676/schedstat
36946300 1150620 11
root@juno:~# cat /proc/1676/task/1676/sched
rt-app (1676, #threads: 2)
-------------------------------------------------------------------
se.exec_start                                :         77766.964240 <- !
se.vruntime                                  :           563.587883
e.sum_exec_runtime                          :            36.946300  <- !
se.nr_migrations                             :                    0
...

I expect cpu_util_cfs() to be ~1024 almost all the time now.

> +				util = max_t(unsigned long, util,
> +					     faster_est_approx(runtime * 2));
> +			}
> +			rcu_read_unlock();
> +		}
>  		util = max_t(unsigned long, util,
>  			     READ_ONCE(cfs_rq->avg.util_est.enqueued));
>  	}

[...]


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-07 13:41           ` Peter Zijlstra
                               ` (2 preceding siblings ...)
  2022-11-10 11:16             ` Dietmar Eggemann
@ 2022-11-10 12:45             ` Kajetan Puchalski
  3 siblings, 0 replies; 61+ messages in thread
From: Kajetan Puchalski @ 2022-11-10 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

Hi,

> Would something terrible like the below help some?
> 
> If not, I suppose it could be modified to take the current state as
> history. But basically it runs a faster pelt sum along side the regular
> signal just for ramping up the frequency.

As Dietmar mentioned in the other email, there seems to be an issue with
how the patch computes 'runtime'. Nevertheless I tested it just to see
what would happen so here are the results if you're interested.

Here's a comparison of Jankbench results on a normal system vs pelt_4 vs
performance cpufreq governor vs your pelt_rampup patch.

Max frame duration (ms)

+-----------------------+-----------+------------+
|        kernel         | iteration |   value    |
+-----------------------+-----------+------------+
|       menu            |    10     | 142.973401 |
|   menu_pelt_4         |    10     | 85.271279  |
|   menu_pelt_rampup    |    10     | 61.494636  |
|   menu_performance    |    10     | 40.930829  |
+-----------------------+-----------+------------+

Power usage [mW]

+--------------+-----------------------+-------+-----------+
|  chan_name   |        kernel         | value | perc_diff |
+--------------+-----------------------+-------+-----------+
| total_power  |       menu            | 144.6 |   0.0%    |
| total_power  |   menu_pelt_4         | 158.5 |   9.63%   |
| total_power  |   menu_pelt_rampup    | 272.1 |  88.23%   |
| total_power  |   menu_performance    | 485.6 |  235.9%   |
+--------------+-----------------------+-------+-----------+


Mean frame duration (ms)

+---------------+-----------------------+-------+-----------+
|   variable    |        kernel         | value | perc_diff |
+---------------+-----------------------+-------+-----------+
| mean_duration |       menu            | 13.9  |   0.0%    |
| mean_duration |   menu_pelt_4         | 14.5  |   4.74%   |
| mean_duration |   menu_pelt_rampup    |  8.3  |  -40.31%  |
| mean_duration |   menu_performance    |  4.4  |  -68.13%  |
+---------------+-----------------------+-------+-----------+

Jank percentage

+-----------+-----------------------+-------+-----------+
| variable  |        kernel         | value | perc_diff |
+-----------+-----------------------+-------+-----------+
| jank_perc |       menu            |  1.5  |   0.0%    |
| jank_perc |   menu_pelt_4         |  2.0  |  30.08%   |
| jank_perc |   menu_pelt_rampup    |  0.1  |  -93.09%  |
| jank_perc |   menu_performance    |  0.1  |  -96.29%  |
+-----------+-----------------------+-------+-----------+

[...]

Some variant of this that's tunable at runtime could be workable for the
purposes described before. At least this further proves that it's manipulating
frequency that's responsible for the results here.

---
Kajetan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-10 11:16             ` Dietmar Eggemann
@ 2022-11-10 13:05               ` Peter Zijlstra
  2022-11-10 14:59                 ` Dietmar Eggemann
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2022-11-10 13:05 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
> On 07/11/2022 14:41, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> 
> [...]
> 
> > @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> >   */
> >  static inline unsigned long cpu_util_cfs(int cpu)
> >  {
> > +	struct rq *rq = cpu_rq(cpu);
> >  	struct cfs_rq *cfs_rq;
> >  	unsigned long util;
> >  
> > -	cfs_rq = &cpu_rq(cpu)->cfs;
> > +	cfs_rq = &rq->cfs;
> >  	util = READ_ONCE(cfs_rq->avg.util_avg);
> >  
> >  	if (sched_feat(UTIL_EST)) {
> > +		if (sched_feat(UTIL_EST_FASTER)) {
> > +			struct task_struct *curr;
> > +
> > +			rcu_read_lock();
> > +			curr = rcu_dereference(rq->curr);
> > +			if (likely(curr->sched_class == &fair_sched_class)) {
> > +				u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
> 
> Don't we and up with gigantic runtime numbers here?
> 
> oot@juno:~# cat /proc/1676/task/1676/schedstat
> 36946300 1150620 11
> root@juno:~# cat /proc/1676/task/1676/sched
> rt-app (1676, #threads: 2)
> -------------------------------------------------------------------
> se.exec_start                                :         77766.964240 <- !
> se.vruntime                                  :           563.587883
> e.sum_exec_runtime                          :            36.946300  <- !
> se.nr_migrations                             :                    0
> ...
> 
> I expect cpu_util_cfs() to be ~1024 almost all the time now.

Duh, obviously I meant to measure the runtime of the current activation
and messed up.

We don't appear to have the right information to compute this atm :/


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-09 15:49               ` Peter Zijlstra
@ 2022-11-10 13:25                 ` Qais Yousef
  2023-02-07 10:29                 ` Dietmar Eggemann
  1 sibling, 0 replies; 61+ messages in thread
From: Qais Yousef @ 2022-11-10 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kajetan Puchalski, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On 11/09/22 16:49, Peter Zijlstra wrote:
> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> > On 11/07/22 14:41, Peter Zijlstra wrote:
> > > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> > > 
> > > > Based on all the tests we've seen, jankbench or otherwise, the
> > > > improvement can mainly be attributed to the faster ramp up of frequency
> > > > caused by the shorter PELT window while using schedutil.
> > > 
> > > Would something terrible like the below help some?
> > > 
> > > If not, I suppose it could be modified to take the current state as
> > > history. But basically it runs a faster pelt sum along side the regular
> > > signal just for ramping up the frequency.
> > 
> > A bit of a tangent, but this reminded me of this old patch:
> > 
> > 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
> > 
> > I think we have a bit too many moving cogs that might be creating undesired
> > compound effect.
> > 
> > Should we consider removing margins in favour of improving util ramp up/down?
> > (whether via util_est or pelt hf).
> 
> Yeah, possibly.
> 
> So one thing that was key to that hack I proposed is that it is
> per-task. This means we can either set or detect the task activation
> period and use that to select an appropriate PELT multiplier.

Note that a big difference compared to PELT HF is that we bias towards going up
faster in util_est, not being able to go down as quickly could impact power as
our residency in higher frequencies will be higher. Testing only can show how
big of a problem this is in practice.

> 
> But please explain; once tasks are in a steady state (60HZ, 90HZ or god
> forbit higher), the utilization should be the same between the various
> PELT window sizes, provided the activation period isn't *much* larger
> than the window.

It is steady state for a short period of time, before something else happens
that change the nature of the workload.

For example, being standing still in an empty room then an explosion suddenly
happens causing lots of activity to appear on the screen.

We can have a steady state at 20%, but an action on the screen could suddenly
change the demand to 100%.

You can find a lot of videos on how to tweak cpu frequencies and governor to
improve gaming performances on youtube by the way:

	https://www.youtube.com/results?search_query=android+gaming+cpu+boost

And this ancient video from google about impact of frequency scaling on games:

	https://www.youtube.com/watch?v=AZ97b2nT-Vo

this is truly ancient and the advice given then (over 8 years ago) is not
a reflection on current state of affairs.

The problem is not new; and I guess expectations just keeps going higher on
what one can do on their phone in spite of all the past improvements :-)

> 
> Are these things running a ton of single shot tasks or something daft
> like that?

I'm not sure how all game engines behave; but the few I've seen they don't tend
to do that.

I've seen apps like instagram using single shot tasks sometime in the (distant)
past to retrieve images. Generally I'm not sure how the Java based APIs behave.
There is an API for Job Scheduler that allows apps to schedule background and
foreground work; that could end up reusing a pool of tasks or creating new
ones. I'm not sure. Game engines tend to be written in NDKs; but simpler games
might not be.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-10 13:05               ` Peter Zijlstra
@ 2022-11-10 14:59                 ` Dietmar Eggemann
  2022-11-10 17:51                   ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2022-11-10 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 10/11/2022 14:05, Peter Zijlstra wrote:
> On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
>> On 07/11/2022 14:41, Peter Zijlstra wrote:
>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>>
>> [...]
>>
>>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>>>   */
>>>  static inline unsigned long cpu_util_cfs(int cpu)
>>>  {
>>> +	struct rq *rq = cpu_rq(cpu);
>>>  	struct cfs_rq *cfs_rq;
>>>  	unsigned long util;
>>>  
>>> -	cfs_rq = &cpu_rq(cpu)->cfs;
>>> +	cfs_rq = &rq->cfs;
>>>  	util = READ_ONCE(cfs_rq->avg.util_avg);
>>>  
>>>  	if (sched_feat(UTIL_EST)) {
>>> +		if (sched_feat(UTIL_EST_FASTER)) {
>>> +			struct task_struct *curr;
>>> +
>>> +			rcu_read_lock();
>>> +			curr = rcu_dereference(rq->curr);
>>> +			if (likely(curr->sched_class == &fair_sched_class)) {
>>> +				u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
>>
>> Don't we and up with gigantic runtime numbers here?
>>
>> oot@juno:~# cat /proc/1676/task/1676/schedstat
>> 36946300 1150620 11
>> root@juno:~# cat /proc/1676/task/1676/sched
>> rt-app (1676, #threads: 2)
>> -------------------------------------------------------------------
>> se.exec_start                                :         77766.964240 <- !
>> se.vruntime                                  :           563.587883
>> e.sum_exec_runtime                          :            36.946300  <- !
>> se.nr_migrations                             :                    0
>> ...
>>
>> I expect cpu_util_cfs() to be ~1024 almost all the time now.
> 
> Duh, obviously I meant to measure the runtime of the current activation
> and messed up.
> 
> We don't appear to have the right information to compute this atm :/

This would be:

u64 now = rq_clock_task(rq);
u64 runtime = now - curr->se.exec_start;

but we don't hold the rq lock so we can't get `now`?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-10 14:59                 ` Dietmar Eggemann
@ 2022-11-10 17:51                   ` Peter Zijlstra
  2022-11-30 18:14                     ` Dietmar Eggemann
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2022-11-10 17:51 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Thu, Nov 10, 2022 at 03:59:01PM +0100, Dietmar Eggemann wrote:
> On 10/11/2022 14:05, Peter Zijlstra wrote:
> > On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
> >> On 07/11/2022 14:41, Peter Zijlstra wrote:
> >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> >>
> >> [...]
> >>
> >>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> >>>   */
> >>>  static inline unsigned long cpu_util_cfs(int cpu)
> >>>  {
> >>> +	struct rq *rq = cpu_rq(cpu);
> >>>  	struct cfs_rq *cfs_rq;
> >>>  	unsigned long util;
> >>>  
> >>> -	cfs_rq = &cpu_rq(cpu)->cfs;
> >>> +	cfs_rq = &rq->cfs;
> >>>  	util = READ_ONCE(cfs_rq->avg.util_avg);
> >>>  
> >>>  	if (sched_feat(UTIL_EST)) {
> >>> +		if (sched_feat(UTIL_EST_FASTER)) {
> >>> +			struct task_struct *curr;
> >>> +
> >>> +			rcu_read_lock();
> >>> +			curr = rcu_dereference(rq->curr);
> >>> +			if (likely(curr->sched_class == &fair_sched_class)) {
> >>> +				u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
> >>
> >> Don't we and up with gigantic runtime numbers here?
> >>
> >> oot@juno:~# cat /proc/1676/task/1676/schedstat
> >> 36946300 1150620 11
> >> root@juno:~# cat /proc/1676/task/1676/sched
> >> rt-app (1676, #threads: 2)
> >> -------------------------------------------------------------------
> >> se.exec_start                                :         77766.964240 <- !
> >> se.vruntime                                  :           563.587883
> >> e.sum_exec_runtime                          :            36.946300  <- !
> >> se.nr_migrations                             :                    0
> >> ...
> >>
> >> I expect cpu_util_cfs() to be ~1024 almost all the time now.
> > 
> > Duh, obviously I meant to measure the runtime of the current activation
> > and messed up.
> > 
> > We don't appear to have the right information to compute this atm :/
> 
> This would be:
> 
> u64 now = rq_clock_task(rq);
> u64 runtime = now - curr->se.exec_start;
> 
> but we don't hold the rq lock so we can't get `now`?

Not quite the same; that's the time since we got on-cpu last, but that's
not the same as the runtime of this activation (it is when you discount
preemption).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-10 17:51                   ` Peter Zijlstra
@ 2022-11-30 18:14                     ` Dietmar Eggemann
  2022-12-01 13:37                       ` Kajetan Puchalski
  0 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2022-11-30 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 10/11/2022 18:51, Peter Zijlstra wrote:
> On Thu, Nov 10, 2022 at 03:59:01PM +0100, Dietmar Eggemann wrote:
>> On 10/11/2022 14:05, Peter Zijlstra wrote:
>>> On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote:
>>>> On 07/11/2022 14:41, Peter Zijlstra wrote:
>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>>>>>   */
>>>>>  static inline unsigned long cpu_util_cfs(int cpu)
>>>>>  {
>>>>> +	struct rq *rq = cpu_rq(cpu);
>>>>>  	struct cfs_rq *cfs_rq;
>>>>>  	unsigned long util;
>>>>>  
>>>>> -	cfs_rq = &cpu_rq(cpu)->cfs;
>>>>> +	cfs_rq = &rq->cfs;
>>>>>  	util = READ_ONCE(cfs_rq->avg.util_avg);
>>>>>  
>>>>>  	if (sched_feat(UTIL_EST)) {
>>>>> +		if (sched_feat(UTIL_EST_FASTER)) {
>>>>> +			struct task_struct *curr;
>>>>> +
>>>>> +			rcu_read_lock();
>>>>> +			curr = rcu_dereference(rq->curr);
>>>>> +			if (likely(curr->sched_class == &fair_sched_class)) {
>>>>> +				u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start;
>>>>
>>>> Don't we and up with gigantic runtime numbers here?
>>>>
>>>> oot@juno:~# cat /proc/1676/task/1676/schedstat
>>>> 36946300 1150620 11
>>>> root@juno:~# cat /proc/1676/task/1676/sched
>>>> rt-app (1676, #threads: 2)
>>>> -------------------------------------------------------------------
>>>> se.exec_start                                :         77766.964240 <- !
>>>> se.vruntime                                  :           563.587883
>>>> e.sum_exec_runtime                          :            36.946300  <- !
>>>> se.nr_migrations                             :                    0
>>>> ...
>>>>
>>>> I expect cpu_util_cfs() to be ~1024 almost all the time now.
>>>
>>> Duh, obviously I meant to measure the runtime of the current activation
>>> and messed up.
>>>
>>> We don't appear to have the right information to compute this atm :/
>>
>> This would be:
>>
>> u64 now = rq_clock_task(rq);
>> u64 runtime = now - curr->se.exec_start;
>>
>> but we don't hold the rq lock so we can't get `now`?
> 
> Not quite the same; that's the time since we got on-cpu last, but that's
> not the same as the runtime of this activation (it is when you discount
> preemption).


----|----|----|----|----|----|--->
    a    s1   p1   s2   p2   d

a ... activate_task() -> enqueue_task()

s ... set_next_entity()

p ... put_prev_entity()

d ... deactivate_task() -> dequeue_task()

By `runtime of the activation` you refer to `curr->sum_exec_runtime -
time(a)` ? And the latter we don't have?

And `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_run`
is only covering the time since we got onto the cpu, right?

With a missing `runtime >>= 10` (from __update_load_sum()) and using
`runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime`
for a 1 task-workload (so no preemption) with factor 2 or 4 I get at
least close to the original rq->cfs.avg.util_avg and util_est.enqueued
signals (cells (5)-(8) in the notebook below).

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/UTIL_EST_FASTER.ipynb?flush_cache=true

----

set_next_entity()
    update_stats_curr_start()
        se->exec_start = rq_clock_task()

    cfs_rq->curr = se                                (1)

    se->prev_sum_exec_runtime = se->sum_exec_runtime (2)

update_curr()

    now = rq_clock_task(rq_of(cfs_rq))
    delta_exec = now - curr->exec_start              (3)
    curr->exec_start = now
    curr->sum_exec_runtime += delta_exec;            (4)

put_prev_entity()

    cfs_rq->curr = NULL                              (5)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-30 18:14                     ` Dietmar Eggemann
@ 2022-12-01 13:37                       ` Kajetan Puchalski
  0 siblings, 0 replies; 61+ messages in thread
From: Kajetan Puchalski @ 2022-12-01 13:37 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Jian-Min Liu, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Wed, Nov 30, 2022 at 07:14:51PM +0100, Dietmar Eggemann wrote:

> By `runtime of the activation` you refer to `curr->sum_exec_runtime -
> time(a)` ? And the latter we don't have?
> 
> And `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_run`
> is only covering the time since we got onto the cpu, right?
> 
> With a missing `runtime >>= 10` (from __update_load_sum()) and using
> `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime`
> for a 1 task-workload (so no preemption) with factor 2 or 4 I get at
> least close to the original rq->cfs.avg.util_avg and util_est.enqueued
> signals (cells (5)-(8) in the notebook below).

> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/UTIL_EST_FASTER.ipynb?flush_cache=true
> 

With those two changes as described above the comparative results are as
follows:

Max frame durations (worst case scenario)

+--------------------------------+-----------+------------+
|            kernel              | iteration |   value    |
+--------------------------------+-----------+------------+
|         baseline_60hz          |    10     | 149.935514 |
| pelt_rampup_runtime_shift_60hz |    10     | 108.126862 |
+--------------------------------+-----------+------------+

Power usage [mW]

+--------------+--------------------------------+-------+-----------+
|  chan_name   |             kernel             | value | perc_diff |
+--------------+--------------------------------+-------+-----------+
| total_power  |         baseline_60hz          | 141.6 |   0.0%    |
| total_power  | pelt_rampup_runtime_shift_60hz | 168.0 |  18.61%   |
+--------------+--------------------------------+-------+-----------+

Mean frame duration (average case)

+---------------+--------------------------------+-------+-----------+
|   variable    |             kernel             | value | perc_diff |
+---------------+--------------------------------+-------+-----------+
| mean_duration |         baseline_60hz          | 16.7  |   0.0%    |
| mean_duration | pelt_rampup_runtime_shift_60hz | 13.6  |  -18.9%   |
+---------------+--------------------------------+-------+-----------+

Jank percentage

+-----------+--------------------------------+-------+-----------+
| variable  |             kernel             | value | perc_diff |
+-----------+--------------------------------+-------+-----------+
| jank_perc |         baseline_60hz          |  4.0  |   0.0%    |
| jank_perc | pelt_rampup_runtime_shift_60hz |  1.5  |  -64.04%  |
+-----------+--------------------------------+-------+-----------+

Meaning it's a middle ground of sorts - instead of a 90% increase in
power usage it's 'just' 19%. At the same time though the fastest PELT
multiplier (pelt_4) was getting better max frame durations (85ms vs
108ms) for about half the power increase (9.6% vs 18.6%).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2022-11-09 15:49               ` Peter Zijlstra
  2022-11-10 13:25                 ` Qais Yousef
@ 2023-02-07 10:29                 ` Dietmar Eggemann
  2023-02-09 16:16                   ` Vincent Guittot
  1 sibling, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2023-02-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Qais Yousef
  Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 09/11/2022 16:49, Peter Zijlstra wrote:
> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:

[...]

> So one thing that was key to that hack I proposed is that it is
> per-task. This means we can either set or detect the task activation
> period and use that to select an appropriate PELT multiplier.
> 
> But please explain; once tasks are in a steady state (60HZ, 90HZ or god
> forbit higher), the utilization should be the same between the various
> PELT window sizes, provided the activation period isn't *much* larger
> than the window.
> 
> Are these things running a ton of single shot tasks or something daft
> like that?

This investigation tries to answer these questions. The results can
be found in chapter (B) and (C).

I ran 'util_est_faster' with delta equal to 'duration of the current 
activation'. I.e. the following patch is needed:

https://lkml.kernel.org/r/ec049fd9b635f76a9e1d1ad380fd9184ebeeca53.1671158588.git.yu.c.chen@intel.com 

The testcase is Jankbench on Android 12 on Pixel6, CPU orig capacity 
= [124 124 124 124 427 427 1024 1024], w/ mainline v5.18 kernel and 
forward ported task scheduler patches.

(A) *** 'util_est_faster' vs. 'scaled util_est_faster' ***

The initial approach didn't scale the runtime duration. It is based
on task clock and not PELT clock but it should be scaled by uArch
and frequency to align with the PELT time used for util tracking.

Although the original approach shows better results than the scaled 
one. Even more aggressive boosting on non-big CPUs helps to raise the 
frequency even quicker in the scenario described under (B). 

All tests ran 10 iterations of all Jankbench sub-tests.

Max_frame_duration:
+------------------------+------------+
|             kernel     |    value   |
+------------------------+------------+
|   base-a30b17f016b0    | 147.571352 |
|    util_est_faster     | 84.834999  |
| scaled_util_est_faster | 127.72855  |
+------------------------+------------+

Mean_frame_duration:
+------------------------+-------+-----------+
|             kernel     | value | perc_diff |
+------------------------+-------+-----------+
|   base-a30b17f016b0    | 14.7  |   0.0%    |
|    util_est_faster     | 12.6  |  -14.01%  |
| scaled_util_est_faster | 13.5  |  -8.45%   |
+------------------------+-------------------+

Jank percentage (Jank deadline 16ms):
+------------------------+-------+-----------+
|             kernel     | value | perc_diff |
+------------------------+-------+-----------+
|   base-a30b17f016b0    |  1.8  |   0.0%    |
|    util_est_faster     |  0.8  |  -57.8%   |
| scaled_util_est_faster |  1.4  |  -25.89%  |
+------------------------+-------+-----------+

Power usage [mW] (total - all CPUs):
+------------------------+-------+-----------+
|             kernel     | value | perc_diff |
+------------------------+-------+-----------+
|   base-a30b17f016b0    | 144.4 |   0.0%    |
|    util_est_faster     | 150.9 |   4.45%   |
| scaled_util_est_faster | 152.2 |   5.4%    |
+------------------------+-------+-----------+

'scaled util_est_faster' is used as the base for all following tests. 

(B) *** Where does util_est_faster help exactly? ***

It turns out that the score improvement comes from the more aggressive 
DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util() 
-> effective_cpu_util(..., cpu_util_cfs(), ...).

At the beginning of an episode (e.g. beginning of an image list view
fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
frequency') of the Android Graphics Pipeline (AGP) start to run, the
CPU Operating Performance Point (OPP) is often so low that those tasks
run more like 10/16ms which let the test application count a lot of
Jankframes at those moments.

And there is where this util_est_faster approach helps by boosting CPU
util according to the 'runtime of the current activation'.
Moreover it could also be that the tasks have simply more work to do in
these first activations at the beginning of an episode.

All the other places in which cpu_util_cfs() is used:

(2) CFS load balance ('_lb')
(3) CPU overutilization ('_ou')
(4) CFS fork/exec task placement ('_slowpath')

when tested individually don't show any improvement or even regression.

Max_frame_duration:
+---------------------------------+------------+
|             kernel              |    value   |
+---------------------------------+------------+
|     scaled_util_est_faster      | 127.72855  |
|   scaled_util_est_faster_freq   | 126.646506 |
|    scaled_util_est_faster_lb    | 162.596249 |
|    scaled_util_est_faster_ou    | 166.59519  |
| scaled_util_est_faster_slowpath | 153.966638 |
+---------------------------------+------------+

Mean_frame_duration:
+---------------------------------+-------+-----------+
|             kernel              | value | perc_diff |
+---------------------------------+-------+-----------+
|     scaled_util_est_faster      | 13.5  |   0.0%    |
|   scaled_util_est_faster_freq   | 13.7  |   1.79%   |
|    scaled_util_est_faster_lb    | 14.8  |   9.87%   |
|    scaled_util_est_faster_ou    | 14.5  |   7.46%   |
| scaled_util_est_faster_slowpath | 16.2  |  20.45%   |
+---------------------------------+-------+-----------+

Jank percentage (Jank deadline 16ms):
+---------------------------------+-------+-----------+
|             kernel              | value | perc_diff |
+---------------------------------+-------+-----------+
|     scaled_util_est_faster      |  1.4  |   0.0%    |
|   scaled_util_est_faster_freq   |  1.3  |  -2.34%   |
|    scaled_util_est_faster_lb    |  1.7  |  27.42%   |
|    scaled_util_est_faster_ou    |  2.1  |  50.33%   |
| scaled_util_est_faster_slowpath |  2.8  |  102.39%  |
+---------------------------------+-------+-----------+

Power usage [mW] (total - all CPUs):
+---------------------------------+-------+-----------+
|             kernel              | value | perc_diff |
+---------------------------------+-------+-----------+
|     scaled_util_est_faster      | 152.2 |   0.0%    |
|   scaled_util_est_faster_freq   | 132.3 |  -13.1%   |
|    scaled_util_est_faster_lb    | 137.1 |  -9.96%   |
|    scaled_util_est_faster_ou    | 132.4 |  -13.04%  |
| scaled_util_est_faster_slowpath | 141.3 |  -7.18%   |
+---------------------------------+-------+-----------+

(C) *** Which tasks contribute the most to the score improvement? ***

A trace_event capturing the cases in which task's util_est_fast trumps
CPU util was added to cpu_util_cfs(). This is 1 iteration of Jankbench
and the base is (1) 'scaled_util_est_faster_freq':

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_6.ipynb

'Cell [6]' shows the tasks of the Jankbench process
'[com.an]droid.benchmark' which are boosting the CPU frequency request.

Among them are 2 main threads of the AGP, '[com.an]droid.benchmark' and
'RenderThread'.
The spikes in util_est_fast are congruent with the aforementioned
beginning of an episode in which these periodic tasks are running and
when their runtime/period is rather ~10/16ms and not ~1-2/16ms since
the CPU OPP is still low. 

Very few other Jankbench tasks 'Cell [6] show the same behaviour. The
Surfaceflinger process 'Cell [8]' is not affected and from the kernel
tasks only kcompctd0 creates a mild boost 'Cell [9]'.

As expected, running a non-scaled version of (1) shows more aggressive 
boosting on non-big CPUs:

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_5.ipynb

Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
util when periodic tasks have a longer runtime compared to when they reach
steady-sate.

The results is very similar to PELT halflife reduction. The advantage is
that 'util_est_faster' is only activated selectively when the runtime of
the current task in its current activation is long enough to create this
CPU util boost.  

Original patch:
https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net

Changes applied:
- use 'duration of the current activation' as delta
- delta >>= 10
- uArch and frequency scaling of delta

-->%--

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index efdc29c42161..76d146d06bbe 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -97,6 +97,7 @@ SCHED_FEAT(WA_BIAS, true)
  */  
 SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
+SCHED_FEAT(UTIL_EST_FASTER, true)
 
 SCHED_FEAT(LATENCY_WARN, false)
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 0f310768260c..13cd9e27ce3e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
        return periods;
 }
 
+/*
+ * Compute a pelt util_avg assuming no history and @delta runtime.
+ */
+unsigned long faster_est_approx(u64 delta)
+{
+       unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
+       u64 periods = delta / 1024;
+
+       if (periods) {
+               delta %= 1024;
+               contrib = __accumulate_pelt_segments(periods, 1024, delta);
+       }
+
+       return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
+}
+
 /*
  * We can represent the historical contribution to runnable average as the 
  * coefficients of a geometric series.  To do this we sub-divide our runnable
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1072502976df..7cb45f1d8062 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2961,6 +2961,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
        return READ_ONCE(rq->avg_dl.util_avg);
 }
 
+extern unsigned long faster_est_approx(u64 runtime);
+
 /** 
  * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
  * @cpu: the CPU to get the utilization for.
@@ -2995,13 +2997,39 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
  */  
 static inline unsigned long cpu_util_cfs(int cpu)
 {
+       struct rq *rq = cpu_rq(cpu);
        struct cfs_rq *cfs_rq;
        unsigned long util;
 
-       cfs_rq = &cpu_rq(cpu)->cfs;
+       cfs_rq = &rq->cfs;
        util = READ_ONCE(cfs_rq->avg.util_avg);
 
        if (sched_feat(UTIL_EST)) {
+               if (sched_feat(UTIL_EST_FASTER)) {
+                       struct task_struct *curr;
+
+                       rcu_read_lock();
+                       curr = rcu_dereference(rq->curr);
+                       if (likely(curr->sched_class == &fair_sched_class)) {
+                               unsigned long util_est_fast;
+                               u64 delta;
+
+                               delta = curr->se.sum_exec_runtime -
+                                       curr->se.prev_sum_exec_runtime_vol;
+
+                               delta >>= 10;
+                               if (!delta)
+                                       goto unlock;
+
+                               delta = cap_scale(delta, arch_scale_cpu_capacity(cpu));
+                               delta = cap_scale(delta, arch_scale_freq_capacity(cpu));
+
+                               util_est_fast = faster_est_approx(delta * 2);
+                               util = max(util, util_est_fast);
+                       }
+unlock:
+                       rcu_read_unlock();
+               }
                util = max_t(unsigned long, util,
                             READ_ONCE(cfs_rq->avg.util_est.enqueued));
        }

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-07 10:29                 ` Dietmar Eggemann
@ 2023-02-09 16:16                   ` Vincent Guittot
  2023-02-17 13:54                     ` Dietmar Eggemann
                                       ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Vincent Guittot @ 2023-02-09 16:16 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 09/11/2022 16:49, Peter Zijlstra wrote:
> > On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> >> On 11/07/22 14:41, Peter Zijlstra wrote:
> >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
> > So one thing that was key to that hack I proposed is that it is
> > per-task. This means we can either set or detect the task activation
> > period and use that to select an appropriate PELT multiplier.
> >
> > But please explain; once tasks are in a steady state (60HZ, 90HZ or god
> > forbit higher), the utilization should be the same between the various
> > PELT window sizes, provided the activation period isn't *much* larger
> > than the window.
> >
> > Are these things running a ton of single shot tasks or something daft
> > like that?
>
> This investigation tries to answer these questions. The results can
> be found in chapter (B) and (C).
>
> I ran 'util_est_faster' with delta equal to 'duration of the current
> activation'. I.e. the following patch is needed:
>
> https://lkml.kernel.org/r/ec049fd9b635f76a9e1d1ad380fd9184ebeeca53.1671158588.git.yu.c.chen@intel.com
>
> The testcase is Jankbench on Android 12 on Pixel6, CPU orig capacity
> = [124 124 124 124 427 427 1024 1024], w/ mainline v5.18 kernel and
> forward ported task scheduler patches.
>
> (A) *** 'util_est_faster' vs. 'scaled util_est_faster' ***
>
> The initial approach didn't scale the runtime duration. It is based
> on task clock and not PELT clock but it should be scaled by uArch
> and frequency to align with the PELT time used for util tracking.
>
> Although the original approach shows better results than the scaled
> one. Even more aggressive boosting on non-big CPUs helps to raise the
> frequency even quicker in the scenario described under (B).
>
> All tests ran 10 iterations of all Jankbench sub-tests.
>
> Max_frame_duration:
> +------------------------+------------+
> |             kernel     |    value   |
> +------------------------+------------+
> |   base-a30b17f016b0    | 147.571352 |
> |    util_est_faster     | 84.834999  |
> | scaled_util_est_faster | 127.72855  |
> +------------------------+------------+
>
> Mean_frame_duration:
> +------------------------+-------+-----------+
> |             kernel     | value | perc_diff |
> +------------------------+-------+-----------+
> |   base-a30b17f016b0    | 14.7  |   0.0%    |
> |    util_est_faster     | 12.6  |  -14.01%  |
> | scaled_util_est_faster | 13.5  |  -8.45%   |
> +------------------------+-------------------+
>
> Jank percentage (Jank deadline 16ms):
> +------------------------+-------+-----------+
> |             kernel     | value | perc_diff |
> +------------------------+-------+-----------+
> |   base-a30b17f016b0    |  1.8  |   0.0%    |
> |    util_est_faster     |  0.8  |  -57.8%   |
> | scaled_util_est_faster |  1.4  |  -25.89%  |
> +------------------------+-------+-----------+
>
> Power usage [mW] (total - all CPUs):
> +------------------------+-------+-----------+
> |             kernel     | value | perc_diff |
> +------------------------+-------+-----------+
> |   base-a30b17f016b0    | 144.4 |   0.0%    |
> |    util_est_faster     | 150.9 |   4.45%   |
> | scaled_util_est_faster | 152.2 |   5.4%    |
> +------------------------+-------+-----------+
>
> 'scaled util_est_faster' is used as the base for all following tests.
>
> (B) *** Where does util_est_faster help exactly? ***
>
> It turns out that the score improvement comes from the more aggressive
> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
> -> effective_cpu_util(..., cpu_util_cfs(), ...).
>
> At the beginning of an episode (e.g. beginning of an image list view
> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
> frequency') of the Android Graphics Pipeline (AGP) start to run, the
> CPU Operating Performance Point (OPP) is often so low that those tasks
> run more like 10/16ms which let the test application count a lot of
> Jankframes at those moments.

I don't see how util_est_faster can help this 1ms task here ? It's
most probably never be preempted during this 1ms. For such an Android
Graphics Pipeline short task, hasn't uclamp_min been designed for and
a better solution ?

>
> And there is where this util_est_faster approach helps by boosting CPU
> util according to the 'runtime of the current activation'.
> Moreover it could also be that the tasks have simply more work to do in
> these first activations at the beginning of an episode.
>
> All the other places in which cpu_util_cfs() is used:
>
> (2) CFS load balance ('_lb')
> (3) CPU overutilization ('_ou')
> (4) CFS fork/exec task placement ('_slowpath')
>
> when tested individually don't show any improvement or even regression.
>
> Max_frame_duration:
> +---------------------------------+------------+
> |             kernel              |    value   |
> +---------------------------------+------------+
> |     scaled_util_est_faster      | 127.72855  |
> |   scaled_util_est_faster_freq   | 126.646506 |
> |    scaled_util_est_faster_lb    | 162.596249 |
> |    scaled_util_est_faster_ou    | 166.59519  |
> | scaled_util_est_faster_slowpath | 153.966638 |
> +---------------------------------+------------+
>
> Mean_frame_duration:
> +---------------------------------+-------+-----------+
> |             kernel              | value | perc_diff |
> +---------------------------------+-------+-----------+
> |     scaled_util_est_faster      | 13.5  |   0.0%    |
> |   scaled_util_est_faster_freq   | 13.7  |   1.79%   |
> |    scaled_util_est_faster_lb    | 14.8  |   9.87%   |
> |    scaled_util_est_faster_ou    | 14.5  |   7.46%   |
> | scaled_util_est_faster_slowpath | 16.2  |  20.45%   |
> +---------------------------------+-------+-----------+
>
> Jank percentage (Jank deadline 16ms):
> +---------------------------------+-------+-----------+
> |             kernel              | value | perc_diff |
> +---------------------------------+-------+-----------+
> |     scaled_util_est_faster      |  1.4  |   0.0%    |
> |   scaled_util_est_faster_freq   |  1.3  |  -2.34%   |
> |    scaled_util_est_faster_lb    |  1.7  |  27.42%   |
> |    scaled_util_est_faster_ou    |  2.1  |  50.33%   |
> | scaled_util_est_faster_slowpath |  2.8  |  102.39%  |
> +---------------------------------+-------+-----------+
>
> Power usage [mW] (total - all CPUs):
> +---------------------------------+-------+-----------+
> |             kernel              | value | perc_diff |
> +---------------------------------+-------+-----------+
> |     scaled_util_est_faster      | 152.2 |   0.0%    |
> |   scaled_util_est_faster_freq   | 132.3 |  -13.1%   |
> |    scaled_util_est_faster_lb    | 137.1 |  -9.96%   |
> |    scaled_util_est_faster_ou    | 132.4 |  -13.04%  |
> | scaled_util_est_faster_slowpath | 141.3 |  -7.18%   |
> +---------------------------------+-------+-----------+
>
> (C) *** Which tasks contribute the most to the score improvement? ***
>
> A trace_event capturing the cases in which task's util_est_fast trumps
> CPU util was added to cpu_util_cfs(). This is 1 iteration of Jankbench
> and the base is (1) 'scaled_util_est_faster_freq':
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_6.ipynb
>
> 'Cell [6]' shows the tasks of the Jankbench process
> '[com.an]droid.benchmark' which are boosting the CPU frequency request.
>
> Among them are 2 main threads of the AGP, '[com.an]droid.benchmark' and
> 'RenderThread'.
> The spikes in util_est_fast are congruent with the aforementioned
> beginning of an episode in which these periodic tasks are running and
> when their runtime/period is rather ~10/16ms and not ~1-2/16ms since
> the CPU OPP is still low.
>
> Very few other Jankbench tasks 'Cell [6] show the same behaviour. The
> Surfaceflinger process 'Cell [8]' is not affected and from the kernel
> tasks only kcompctd0 creates a mild boost 'Cell [9]'.
>
> As expected, running a non-scaled version of (1) shows more aggressive
> boosting on non-big CPUs:
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_5.ipynb
>
> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
> util when periodic tasks have a longer runtime compared to when they reach
> steady-sate.
>
> The results is very similar to PELT halflife reduction. The advantage is
> that 'util_est_faster' is only activated selectively when the runtime of
> the current task in its current activation is long enough to create this
> CPU util boost.

IIUC how util_est_faster works, it removes the waiting time when
sharing cpu time with other tasks. So as long as there is no (runnable
but not running time), the result is the same as current util_est.
util_est_faster makes a difference only when the task alternates
between runnable and running slices.
Have you considered using runnable_avg metrics in the increase of cpu
freq ? This takes into the runnable slice and not only the running
time and increase faster than util_avg when tasks compete for the same
CPU

>
> Original patch:
> https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net
>
> Changes applied:
> - use 'duration of the current activation' as delta
> - delta >>= 10
> - uArch and frequency scaling of delta
>
> -->%--
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index efdc29c42161..76d146d06bbe 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -97,6 +97,7 @@ SCHED_FEAT(WA_BIAS, true)
>   */
>  SCHED_FEAT(UTIL_EST, true)
>  SCHED_FEAT(UTIL_EST_FASTUP, true)
> +SCHED_FEAT(UTIL_EST_FASTER, true)
>
>  SCHED_FEAT(LATENCY_WARN, false)
>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 0f310768260c..13cd9e27ce3e 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
>         return periods;
>  }
>
> +/*
> + * Compute a pelt util_avg assuming no history and @delta runtime.
> + */
> +unsigned long faster_est_approx(u64 delta)
> +{
> +       unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
> +       u64 periods = delta / 1024;
> +
> +       if (periods) {
> +               delta %= 1024;
> +               contrib = __accumulate_pelt_segments(periods, 1024, delta);
> +       }
> +
> +       return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
> +}
> +
>  /*
>   * We can represent the historical contribution to runnable average as the
>   * coefficients of a geometric series.  To do this we sub-divide our runnable
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1072502976df..7cb45f1d8062 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2961,6 +2961,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>         return READ_ONCE(rq->avg_dl.util_avg);
>  }
>
> +extern unsigned long faster_est_approx(u64 runtime);
> +
>  /**
>   * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
>   * @cpu: the CPU to get the utilization for.
> @@ -2995,13 +2997,39 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>   */
>  static inline unsigned long cpu_util_cfs(int cpu)
>  {
> +       struct rq *rq = cpu_rq(cpu);
>         struct cfs_rq *cfs_rq;
>         unsigned long util;
>
> -       cfs_rq = &cpu_rq(cpu)->cfs;
> +       cfs_rq = &rq->cfs;
>         util = READ_ONCE(cfs_rq->avg.util_avg);
>
>         if (sched_feat(UTIL_EST)) {
> +               if (sched_feat(UTIL_EST_FASTER)) {
> +                       struct task_struct *curr;
> +
> +                       rcu_read_lock();
> +                       curr = rcu_dereference(rq->curr);
> +                       if (likely(curr->sched_class == &fair_sched_class)) {
> +                               unsigned long util_est_fast;
> +                               u64 delta;
> +
> +                               delta = curr->se.sum_exec_runtime -
> +                                       curr->se.prev_sum_exec_runtime_vol;
> +
> +                               delta >>= 10;
> +                               if (!delta)
> +                                       goto unlock;
> +
> +                               delta = cap_scale(delta, arch_scale_cpu_capacity(cpu));
> +                               delta = cap_scale(delta, arch_scale_freq_capacity(cpu));
> +
> +                               util_est_fast = faster_est_approx(delta * 2);
> +                               util = max(util, util_est_fast);
> +                       }
> +unlock:
> +                       rcu_read_unlock();
> +               }
>                 util = max_t(unsigned long, util,
>                              READ_ONCE(cfs_rq->avg.util_est.enqueued));
>         }

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-09 16:16                   ` Vincent Guittot
@ 2023-02-17 13:54                     ` Dietmar Eggemann
  2023-02-20 13:54                       ` Vincent Guittot
  2023-02-20 10:13                     ` Peter Zijlstra
  2023-02-23 15:37                     ` Qais Yousef
  2 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2023-02-17 13:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 09/02/2023 17:16, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:

[...]

>> (B) *** Where does util_est_faster help exactly? ***
>>
>> It turns out that the score improvement comes from the more aggressive
>> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
>> -> effective_cpu_util(..., cpu_util_cfs(), ...).
>>
>> At the beginning of an episode (e.g. beginning of an image list view
>> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
>> frequency') of the Android Graphics Pipeline (AGP) start to run, the
>> CPU Operating Performance Point (OPP) is often so low that those tasks
>> run more like 10/16ms which let the test application count a lot of
>> Jankframes at those moments.
> 
> I don't see how util_est_faster can help this 1ms task here ? It's
> most probably never be preempted during this 1ms. For such an Android

It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer
runtime with min CPU frequency at little CPU. I see runtime up to 10ms
at the beginning of a test episode.

Like I mentioned below, it could also be that the tasks have more work
to do at the beginning. It's easy to spot using Google's perfetto and
those moments also correlate with the occurrence of jankframes. I'm not
yet sure how much this has to do with the perfetto instrumentation though.

But you're right, on top of that, there is preemption (e.g. of the UI
thread) by other threads (render thread, involved binder threads,
surfaceflinger, etc.) going on. So the UI thread could be
running+runnable for > 20ms, again marked as a jankframe.

> Graphics Pipeline short task, hasn't uclamp_min been designed for and
> a better solution ?

Yes, it has. I'm not sure how feasible this is to do for all tasks
involved. I'm thinking about the Binder threads here for instance.

[...]

>> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
>> util when periodic tasks have a longer runtime compared to when they reach
>> steady-sate.
>>
>> The results is very similar to PELT halflife reduction. The advantage is
>> that 'util_est_faster' is only activated selectively when the runtime of
>> the current task in its current activation is long enough to create this
>> CPU util boost.
> 
> IIUC how util_est_faster works, it removes the waiting time when
> sharing cpu time with other tasks. So as long as there is no (runnable
> but not running time), the result is the same as current util_est.
> util_est_faster makes a difference only when the task alternates
> between runnable and running slices.
> Have you considered using runnable_avg metrics in the increase of cpu
> freq ? This takes into the runnable slice and not only the running
> time and increase faster than util_avg when tasks compete for the same
> CPU

Good idea! No, I haven't.

I just glanced over the code, there shouldn't be an advantage in terms
of more recent update between `curr->sum_exec_runtime` and
update_load_avg(cfs_rq) even in the taskgroup case.

Per-task view:

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb


All tests ran 10 iterations of all Jankbench sub-tests. (Reran the
`max_util_scaled_util_est_faster_rbl_freq` once with very similar
results. Just to make sure the results are somehow correct).

Max_frame_duration:
+------------------------------------------+------------+
|             kernel                       |    value   |
+------------------------------------------+------------+
|            base-a30b17f016b0             | 147.571352 |
|                pelt-hl-m2                | 119.416351 |
|                pelt-hl-m4                | 96.473412  |
|       scaled_util_est_faster_freq        | 126.646506 |
| max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
+------------------------------------------+------------+

Mean_frame_duration:
+------------------------------------------+-------+-----------+
|                  kernel                  | value | perc_diff |
+------------------------------------------+-------+-----------+
|            base-a30b17f016b0             | 14.7  |   0.0%    |
|                pelt-hl-m2                | 13.6  |   -7.5%   |
|                pelt-hl-m4                | 13.0  |  -11.68%  |
|       scaled_util_est_faster_freq        | 13.7  |  -6.81%   |
| max_util_scaled_util_est_faster_rbl_freq | 12.1  |  -17.85%  |
+------------------------------------------+-------+-----------+

Jank percentage (Jank deadline 16ms):
+------------------------------------------+-------+-----------+
|                  kernel                  | value | perc_diff |
+------------------------------------------+-------+-----------+
|            base-a30b17f016b0             |  1.8  |   0.0%    |
|                pelt-hl-m2                |  1.8  |  -4.91%   |
|                pelt-hl-m4                |  1.2  |  -36.61%  |
|       scaled_util_est_faster_freq        |  1.3  |  -27.63%  |
| max_util_scaled_util_est_faster_rbl_freq |  0.8  |  -54.86%  |
+------------------------------------------+-------+-----------+

Power usage [mW] (total - all CPUs):
+------------------------------------------+-------+-----------+
|             kernel                       | value | perc_diff |
+------------------------------------------+-------+-----------+
|            base-a30b17f016b0             | 144.4 |   0.0%    |
|                pelt-hl-m2                | 141.6 |  -1.97%   |
|                pelt-hl-m4                | 163.2 |  12.99%   |
|       scaled_util_est_faster_freq        | 132.3 |  -8.41%   |
| max_util_scaled_util_est_faster_rbl_freq | 133.4 |  -7.67%   |
+------------------------------------------+-------+-----------+

There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
`Jank percentage` and `Power usage` are better.

So maybe DVFS boosting in preempt-scenarios is really the thing here to
further improve the Android Graphics Pipeline.

I ran the same test (boosting only for DVFS requests) with:

-->8--

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dbc56e8b85f9..7a4bf38f2920 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
                             READ_ONCE(cfs_rq->avg.util_est.enqueued));
        }

+       util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
+
        return min(util, capacity_orig_of(cpu));

Thanks!

-- Dietmar







^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-09 16:16                   ` Vincent Guittot
  2023-02-17 13:54                     ` Dietmar Eggemann
@ 2023-02-20 10:13                     ` Peter Zijlstra
  2023-02-20 13:39                       ` Vincent Guittot
  2023-02-23 15:37                     ` Qais Yousef
  2 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2023-02-20 10:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Thu, Feb 09, 2023 at 05:16:46PM +0100, Vincent Guittot wrote:

> > The results is very similar to PELT halflife reduction. The advantage is
> > that 'util_est_faster' is only activated selectively when the runtime of
> > the current task in its current activation is long enough to create this
> > CPU util boost.
> 
> IIUC how util_est_faster works, it removes the waiting time when
> sharing cpu time with other tasks. So as long as there is no (runnable
> but not running time), the result is the same as current util_est.

Uh.. it's double the speed, no? Even if there is no contention, the
fake/in-situ pelt sum runs at double time and thus will ramp up faster
than normal.

> util_est_faster makes a difference only when the task alternates
> between runnable and running slices.

UTIL_EST was supposed to help mitigate some of that, but yes. Also note
that _FASTER sorta sucks here because it starts from 0 every time, if it
were to start from the state saved by util_est_dequeue(), it would ramp
up faster still.

Patch has a comment along those lines I think.

> Have you considered using runnable_avg metrics in the increase of cpu
> freq ? This takes into the runnable slice and not only the running
> time and increase faster than util_avg when tasks compete for the same
> CPU

Interesting! Indeed, that's boosting the DVFS for contention. And as
deggeman's reply shows, it seems to work well.

I wonder if that one place where it regresses is exactly the case
without contention.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-20 10:13                     ` Peter Zijlstra
@ 2023-02-20 13:39                       ` Vincent Guittot
  0 siblings, 0 replies; 61+ messages in thread
From: Vincent Guittot @ 2023-02-20 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Mon, 20 Feb 2023 at 11:13, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 09, 2023 at 05:16:46PM +0100, Vincent Guittot wrote:
>
> > > The results is very similar to PELT halflife reduction. The advantage is
> > > that 'util_est_faster' is only activated selectively when the runtime of
> > > the current task in its current activation is long enough to create this
> > > CPU util boost.
> >
> > IIUC how util_est_faster works, it removes the waiting time when
> > sharing cpu time with other tasks. So as long as there is no (runnable
> > but not running time), the result is the same as current util_est.
>
> Uh.. it's double the speed, no? Even if there is no contention, the
> fake/in-situ pelt sum runs at double time and thus will ramp up faster
> than normal.

Ah yes. I haven't noticed it was (delta * 2) and not delta

>
> > util_est_faster makes a difference only when the task alternates
> > between runnable and running slices.
>
> UTIL_EST was supposed to help mitigate some of that, but yes. Also note
> that _FASTER sorta sucks here because it starts from 0 every time, if it
> were to start from the state saved by util_est_dequeue(), it would ramp
> up faster still.

Yes.

>
> Patch has a comment along those lines I think.
>
> > Have you considered using runnable_avg metrics in the increase of cpu
> > freq ? This takes into the runnable slice and not only the running
> > time and increase faster than util_avg when tasks compete for the same
> > CPU
>
> Interesting! Indeed, that's boosting the DVFS for contention. And as
> deggeman's reply shows, it seems to work well.
>
> I wonder if that one place where it regresses is exactly the case
> without contention.

Yes that might be the case indeed. I would expect uclamp_min to help
for ensuring a min frequency such scenario

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-17 13:54                     ` Dietmar Eggemann
@ 2023-02-20 13:54                       ` Vincent Guittot
  2023-02-21  9:29                         ` Vincent Guittot
  2023-02-22 20:13                         ` Dietmar Eggemann
  0 siblings, 2 replies; 61+ messages in thread
From: Vincent Guittot @ 2023-02-20 13:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 09/02/2023 17:16, Vincent Guittot wrote:
> > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 09/11/2022 16:49, Peter Zijlstra wrote:
> >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> >>>> On 11/07/22 14:41, Peter Zijlstra wrote:
> >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
> >> (B) *** Where does util_est_faster help exactly? ***
> >>
> >> It turns out that the score improvement comes from the more aggressive
> >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
> >> -> effective_cpu_util(..., cpu_util_cfs(), ...).
> >>
> >> At the beginning of an episode (e.g. beginning of an image list view
> >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
> >> frequency') of the Android Graphics Pipeline (AGP) start to run, the
> >> CPU Operating Performance Point (OPP) is often so low that those tasks
> >> run more like 10/16ms which let the test application count a lot of
> >> Jankframes at those moments.
> >
> > I don't see how util_est_faster can help this 1ms task here ? It's
> > most probably never be preempted during this 1ms. For such an Android
>
> It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer
> runtime with min CPU frequency at little CPU. I see runtime up to 10ms
> at the beginning of a test episode.
>
> Like I mentioned below, it could also be that the tasks have more work
> to do at the beginning. It's easy to spot using Google's perfetto and
> those moments also correlate with the occurrence of jankframes. I'm not
> yet sure how much this has to do with the perfetto instrumentation though.
>
> But you're right, on top of that, there is preemption (e.g. of the UI
> thread) by other threads (render thread, involved binder threads,
> surfaceflinger, etc.) going on. So the UI thread could be
> running+runnable for > 20ms, again marked as a jankframe.
>
> > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > a better solution ?
>
> Yes, it has. I'm not sure how feasible this is to do for all tasks
> involved. I'm thinking about the Binder threads here for instance.

Yes, that can probably not  help for all threads but some system
threads like surfaceflinger and graphic composer should probably
benefit from min uclamp

>
> [...]
>
> >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
> >> util when periodic tasks have a longer runtime compared to when they reach
> >> steady-sate.
> >>
> >> The results is very similar to PELT halflife reduction. The advantage is
> >> that 'util_est_faster' is only activated selectively when the runtime of
> >> the current task in its current activation is long enough to create this
> >> CPU util boost.
> >
> > IIUC how util_est_faster works, it removes the waiting time when
> > sharing cpu time with other tasks. So as long as there is no (runnable
> > but not running time), the result is the same as current util_est.
> > util_est_faster makes a difference only when the task alternates
> > between runnable and running slices.
> > Have you considered using runnable_avg metrics in the increase of cpu
> > freq ? This takes into the runnable slice and not only the running
> > time and increase faster than util_avg when tasks compete for the same
> > CPU
>
> Good idea! No, I haven't.
>
> I just glanced over the code, there shouldn't be an advantage in terms
> of more recent update between `curr->sum_exec_runtime` and
> update_load_avg(cfs_rq) even in the taskgroup case.
>
> Per-task view:
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb
>
>
> All tests ran 10 iterations of all Jankbench sub-tests. (Reran the
> `max_util_scaled_util_est_faster_rbl_freq` once with very similar
> results. Just to make sure the results are somehow correct).
>
> Max_frame_duration:
> +------------------------------------------+------------+
> |             kernel                       |    value   |
> +------------------------------------------+------------+
> |            base-a30b17f016b0             | 147.571352 |
> |                pelt-hl-m2                | 119.416351 |
> |                pelt-hl-m4                | 96.473412  |
> |       scaled_util_est_faster_freq        | 126.646506 |
> | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
> +------------------------------------------+------------+
>
> Mean_frame_duration:
> +------------------------------------------+-------+-----------+
> |                  kernel                  | value | perc_diff |
> +------------------------------------------+-------+-----------+
> |            base-a30b17f016b0             | 14.7  |   0.0%    |
> |                pelt-hl-m2                | 13.6  |   -7.5%   |
> |                pelt-hl-m4                | 13.0  |  -11.68%  |
> |       scaled_util_est_faster_freq        | 13.7  |  -6.81%   |
> | max_util_scaled_util_est_faster_rbl_freq | 12.1  |  -17.85%  |
> +------------------------------------------+-------+-----------+
>
> Jank percentage (Jank deadline 16ms):
> +------------------------------------------+-------+-----------+
> |                  kernel                  | value | perc_diff |
> +------------------------------------------+-------+-----------+
> |            base-a30b17f016b0             |  1.8  |   0.0%    |
> |                pelt-hl-m2                |  1.8  |  -4.91%   |
> |                pelt-hl-m4                |  1.2  |  -36.61%  |
> |       scaled_util_est_faster_freq        |  1.3  |  -27.63%  |
> | max_util_scaled_util_est_faster_rbl_freq |  0.8  |  -54.86%  |
> +------------------------------------------+-------+-----------+
>
> Power usage [mW] (total - all CPUs):
> +------------------------------------------+-------+-----------+
> |             kernel                       | value | perc_diff |
> +------------------------------------------+-------+-----------+
> |            base-a30b17f016b0             | 144.4 |   0.0%    |
> |                pelt-hl-m2                | 141.6 |  -1.97%   |
> |                pelt-hl-m4                | 163.2 |  12.99%   |
> |       scaled_util_est_faster_freq        | 132.3 |  -8.41%   |
> | max_util_scaled_util_est_faster_rbl_freq | 133.4 |  -7.67%   |
> +------------------------------------------+-------+-----------+
>
> There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
> `Jank percentage` and `Power usage` are better.

The max frame duration is interesting. Could it be the very 1st frame
of the test ?
It's interesting that it's even worse than baseline whereas it should
take the max of baseline and runnable_avg

>
> So maybe DVFS boosting in preempt-scenarios is really the thing here to
> further improve the Android Graphics Pipeline.
>
> I ran the same test (boosting only for DVFS requests) with:
>
> -->8--
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index dbc56e8b85f9..7a4bf38f2920 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
>                              READ_ONCE(cfs_rq->avg.util_est.enqueued));
>         }
>
> +       util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
> +
>         return min(util, capacity_orig_of(cpu));
>
> Thanks!
>
> -- Dietmar
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-20 13:54                       ` Vincent Guittot
@ 2023-02-21  9:29                         ` Vincent Guittot
  2023-02-22 20:28                           ` Dietmar Eggemann
  2023-02-22 20:13                         ` Dietmar Eggemann
  1 sibling, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2023-02-21  9:29 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Mon, 20 Feb 2023 at 14:54, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >
> > On 09/02/2023 17:16, Vincent Guittot wrote:
> > > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>
> > >> On 09/11/2022 16:49, Peter Zijlstra wrote:
> > >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> > >>>> On 11/07/22 14:41, Peter Zijlstra wrote:
> > >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> >
> > [...]
> >
> > >> (B) *** Where does util_est_faster help exactly? ***
> > >>
> > >> It turns out that the score improvement comes from the more aggressive
> > >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
> > >> -> effective_cpu_util(..., cpu_util_cfs(), ...).
> > >>
> > >> At the beginning of an episode (e.g. beginning of an image list view
> > >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
> > >> frequency') of the Android Graphics Pipeline (AGP) start to run, the
> > >> CPU Operating Performance Point (OPP) is often so low that those tasks
> > >> run more like 10/16ms which let the test application count a lot of
> > >> Jankframes at those moments.
> > >
> > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > most probably never be preempted during this 1ms. For such an Android
> >
> > It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer
> > runtime with min CPU frequency at little CPU. I see runtime up to 10ms
> > at the beginning of a test episode.
> >
> > Like I mentioned below, it could also be that the tasks have more work
> > to do at the beginning. It's easy to spot using Google's perfetto and
> > those moments also correlate with the occurrence of jankframes. I'm not
> > yet sure how much this has to do with the perfetto instrumentation though.
> >
> > But you're right, on top of that, there is preemption (e.g. of the UI
> > thread) by other threads (render thread, involved binder threads,
> > surfaceflinger, etc.) going on. So the UI thread could be
> > running+runnable for > 20ms, again marked as a jankframe.
> >
> > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > a better solution ?
> >
> > Yes, it has. I'm not sure how feasible this is to do for all tasks
> > involved. I'm thinking about the Binder threads here for instance.
>
> Yes, that can probably not  help for all threads but some system
> threads like surfaceflinger and graphic composer should probably
> benefit from min uclamp
>
> >
> > [...]
> >
> > >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
> > >> util when periodic tasks have a longer runtime compared to when they reach
> > >> steady-sate.
> > >>
> > >> The results is very similar to PELT halflife reduction. The advantage is
> > >> that 'util_est_faster' is only activated selectively when the runtime of
> > >> the current task in its current activation is long enough to create this
> > >> CPU util boost.
> > >
> > > IIUC how util_est_faster works, it removes the waiting time when
> > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > but not running time), the result is the same as current util_est.
> > > util_est_faster makes a difference only when the task alternates
> > > between runnable and running slices.
> > > Have you considered using runnable_avg metrics in the increase of cpu
> > > freq ? This takes into the runnable slice and not only the running
> > > time and increase faster than util_avg when tasks compete for the same
> > > CPU
> >
> > Good idea! No, I haven't.
> >
> > I just glanced over the code, there shouldn't be an advantage in terms
> > of more recent update between `curr->sum_exec_runtime` and
> > update_load_avg(cfs_rq) even in the taskgroup case.
> >
> > Per-task view:
> >
> > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb
> >
> >
> > All tests ran 10 iterations of all Jankbench sub-tests. (Reran the
> > `max_util_scaled_util_est_faster_rbl_freq` once with very similar
> > results. Just to make sure the results are somehow correct).
> >
> > Max_frame_duration:
> > +------------------------------------------+------------+
> > |             kernel                       |    value   |
> > +------------------------------------------+------------+
> > |            base-a30b17f016b0             | 147.571352 |
> > |                pelt-hl-m2                | 119.416351 |
> > |                pelt-hl-m4                | 96.473412  |
> > |       scaled_util_est_faster_freq        | 126.646506 |
> > | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
> > +------------------------------------------+------------+
> >
> > Mean_frame_duration:
> > +------------------------------------------+-------+-----------+
> > |                  kernel                  | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > |            base-a30b17f016b0             | 14.7  |   0.0%    |
> > |                pelt-hl-m2                | 13.6  |   -7.5%   |
> > |                pelt-hl-m4                | 13.0  |  -11.68%  |
> > |       scaled_util_est_faster_freq        | 13.7  |  -6.81%   |
> > | max_util_scaled_util_est_faster_rbl_freq | 12.1  |  -17.85%  |
> > +------------------------------------------+-------+-----------+
> >
> > Jank percentage (Jank deadline 16ms):
> > +------------------------------------------+-------+-----------+
> > |                  kernel                  | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > |            base-a30b17f016b0             |  1.8  |   0.0%    |
> > |                pelt-hl-m2                |  1.8  |  -4.91%   |
> > |                pelt-hl-m4                |  1.2  |  -36.61%  |
> > |       scaled_util_est_faster_freq        |  1.3  |  -27.63%  |
> > | max_util_scaled_util_est_faster_rbl_freq |  0.8  |  -54.86%  |
> > +------------------------------------------+-------+-----------+
> >
> > Power usage [mW] (total - all CPUs):
> > +------------------------------------------+-------+-----------+
> > |             kernel                       | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > |            base-a30b17f016b0             | 144.4 |   0.0%    |
> > |                pelt-hl-m2                | 141.6 |  -1.97%   |
> > |                pelt-hl-m4                | 163.2 |  12.99%   |
> > |       scaled_util_est_faster_freq        | 132.3 |  -8.41%   |
> > | max_util_scaled_util_est_faster_rbl_freq | 133.4 |  -7.67%   |
> > +------------------------------------------+-------+-----------+
> >
> > There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
> > `Jank percentage` and `Power usage` are better.
>
> The max frame duration is interesting. Could it be the very 1st frame
> of the test ?
> It's interesting that it's even worse than baseline whereas it should
> take the max of baseline and runnable_avg
>
> >
> > So maybe DVFS boosting in preempt-scenarios is really the thing here to
> > further improve the Android Graphics Pipeline.
> >
> > I ran the same test (boosting only for DVFS requests) with:
> >
> > -->8--
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index dbc56e8b85f9..7a4bf38f2920 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
> >                              READ_ONCE(cfs_rq->avg.util_est.enqueued));
> >         }
> >
> > +       util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
> > +

Another reason why it gives better results could be that
cpu_util_cfs() is not only used for DVFS selection but also to track
the cpu utilization in load balance and EAS so the cpu will be faster
seen as overloaded and tasks will be spread around when there are
contentions.

Could you try to take cfs_rq->avg.runnable_avg into account only when
selecting frequency ?

That being said I can see some place in load balance where
cfs_rq->avg.runnable_avg could give some benefits like in
find_busiest_queue() where it could be better to take into account the
contention when selecting the busiest queue

> >         return min(util, capacity_orig_of(cpu));
> >
> > Thanks!
> >
> > -- Dietmar
> >
> >
> >
> >
> >
> >

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-20 13:54                       ` Vincent Guittot
  2023-02-21  9:29                         ` Vincent Guittot
@ 2023-02-22 20:13                         ` Dietmar Eggemann
  2023-03-02 19:36                           ` Dietmar Eggemann
  1 sibling, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2023-02-22 20:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 20/02/2023 14:54, Vincent Guittot wrote:
> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 09/02/2023 17:16, Vincent Guittot wrote:
>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:

[...]

>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
>>> a better solution ?
>>
>> Yes, it has. I'm not sure how feasible this is to do for all tasks
>> involved. I'm thinking about the Binder threads here for instance.
> 
> Yes, that can probably not help for all threads but some system
> threads like surfaceflinger and graphic composer should probably
> benefit from min uclamp

Yes, and it looks like that the Android version I'm using
SQ1D.220205.004 (Feb '22) (automatic system updates turned off) is
already using uclamp_min != 0 for tasks like UI thread. It's not one
particular value but different values  from [0 .. 512] over the runtime
of a Jankbench iteration. I have to have a closer look.

[...]

>> Max_frame_duration:
>> +------------------------------------------+------------+
>> |             kernel                       |    value   |
>> +------------------------------------------+------------+
>> |            base-a30b17f016b0             | 147.571352 |
>> |                pelt-hl-m2                | 119.416351 |
>> |                pelt-hl-m4                | 96.473412  |
>> |       scaled_util_est_faster_freq        | 126.646506 |
>> | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
>> +------------------------------------------+------------+
>>
>> Mean_frame_duration:
>> +------------------------------------------+-------+-----------+
>> |                  kernel                  | value | perc_diff |
>> +------------------------------------------+-------+-----------+
>> |            base-a30b17f016b0             | 14.7  |   0.0%    |
>> |                pelt-hl-m2                | 13.6  |   -7.5%   |
>> |                pelt-hl-m4                | 13.0  |  -11.68%  |
>> |       scaled_util_est_faster_freq        | 13.7  |  -6.81%   |
>> | max_util_scaled_util_est_faster_rbl_freq | 12.1  |  -17.85%  |
>> +------------------------------------------+-------+-----------+
>>
>> Jank percentage (Jank deadline 16ms):
>> +------------------------------------------+-------+-----------+
>> |                  kernel                  | value | perc_diff |
>> +------------------------------------------+-------+-----------+
>> |            base-a30b17f016b0             |  1.8  |   0.0%    |
>> |                pelt-hl-m2                |  1.8  |  -4.91%   |
>> |                pelt-hl-m4                |  1.2  |  -36.61%  |
>> |       scaled_util_est_faster_freq        |  1.3  |  -27.63%  |
>> | max_util_scaled_util_est_faster_rbl_freq |  0.8  |  -54.86%  |
>> +------------------------------------------+-------+-----------+
>>
>> Power usage [mW] (total - all CPUs):
>> +------------------------------------------+-------+-----------+
>> |             kernel                       | value | perc_diff |
>> +------------------------------------------+-------+-----------+
>> |            base-a30b17f016b0             | 144.4 |   0.0%    |
>> |                pelt-hl-m2                | 141.6 |  -1.97%   |
>> |                pelt-hl-m4                | 163.2 |  12.99%   |
>> |       scaled_util_est_faster_freq        | 132.3 |  -8.41%   |
>> | max_util_scaled_util_est_faster_rbl_freq | 133.4 |  -7.67%   |
>> +------------------------------------------+-------+-----------+
>>
>> There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
>> `Jank percentage` and `Power usage` are better.
> 
> The max frame duration is interesting. Could it be the very 1st frame
> of the test ?
> It's interesting that it's even worse than baseline whereas it should
> take the max of baseline and runnable_avg

Since you asked in the following email: I just used the boosting for CPU
frequency selection (from sugov_get_util()). I added the the `_freq`
suffix in the kernel name to indicate this.

I don't have any helpful `ftrace` or `perfetto` data for these test runs
though.

That's why I ran another iteration with perfetto on
`max_util_scaled_util_est_faster_rbl_freq`.

`Max frame duration` = 121ms (< 158ms but this was over 10 iterations)
happened at the beginning of the 3/8 `List View Fling` episode.

The UI thread (com.android.benchmark) runs on CPU1. Just before the
start of this episode the CPU freq is 0.3Ghz. It takes 43ms for the CPU
freq to go up to 1.1Ghz.

  oriole:/sys # cat devices/system/cpu/cpu1/cpu_capacity

  124

  oriole:/sys # cat devices/system/cpu/cpu1/cpufreq
  /scaling_available_frequencies

  300000 574000 738000 930000 1098000 1197000 1328000 1401000 1598000
  1704000 1803000

So the combination of little CPU and low CPU frequency is the reason
why. But I can't see how using `max(max(util_avg, util_est.enq),
rbl_avg) can make `max frame duration` worse?
Don't understand how asking for higher CPU frequencies in contention
favors the UI thread being scheduled on little CPUs at the beginning of
an episode?

Also the particular uclamp_min settings of the runnable tasks at this
moment can have an influence on this `max frame duration` value.

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-21  9:29                         ` Vincent Guittot
@ 2023-02-22 20:28                           ` Dietmar Eggemann
  2023-03-01 10:24                             ` Vincent Guittot
  0 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2023-02-22 20:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 21/02/2023 10:29, Vincent Guittot wrote:
> On Mon, 20 Feb 2023 at 14:54, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>
>> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 09/02/2023 17:16, Vincent Guittot wrote:
>>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>>
>>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:

[...]

>>> I ran the same test (boosting only for DVFS requests) with:
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *
>>>
>>> -->8--
>>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index dbc56e8b85f9..7a4bf38f2920 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
>>>                              READ_ONCE(cfs_rq->avg.util_est.enqueued));
>>>         }
>>>
>>> +       util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
>>> +
> 
> Another reason why it gives better results could be that
> cpu_util_cfs() is not only used for DVFS selection but also to track
> the cpu utilization in load balance and EAS so the cpu will be faster
> seen as overloaded and tasks will be spread around when there are
> contentions.
> 
> Could you try to take cfs_rq->avg.runnable_avg into account only when
> selecting frequency ?

I actually did exactly this. (* but not shown in the code snippet).
I just used the boosting for CPU frequency selection (from
sugov_get_util()). I added the the `_freq` suffix in the kernel name to
indicate this.

> That being said I can see some place in load balance where
> cfs_rq->avg.runnable_avg could give some benefits like in
> find_busiest_queue() where it could be better to take into account the
> contention when selecting the busiest queue

Could be. Looks like so far we only use it in group_has_capacity(),
group_is_overloaded() and for NUMA.

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-09 16:16                   ` Vincent Guittot
  2023-02-17 13:54                     ` Dietmar Eggemann
  2023-02-20 10:13                     ` Peter Zijlstra
@ 2023-02-23 15:37                     ` Qais Yousef
  2023-03-01 10:39                       ` Vincent Guittot
  2 siblings, 1 reply; 61+ messages in thread
From: Qais Yousef @ 2023-02-23 15:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski,
	Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On 02/09/23 17:16, Vincent Guittot wrote:

> I don't see how util_est_faster can help this 1ms task here ? It's
> most probably never be preempted during this 1ms. For such an Android
> Graphics Pipeline short task, hasn't uclamp_min been designed for and
> a better solution ?

uclamp_min is being used in UI and helping there. But your mileage might vary
with adoption still.

The major motivation behind this is to help things like gaming as the original
thread started. It can help UI and other use cases too. Android framework has
a lot of context on the type of workload that can help it make a decision when
this helps. And OEMs can have the chance to tune and apply based on the
characteristics of their device.

> IIUC how util_est_faster works, it removes the waiting time when
> sharing cpu time with other tasks. So as long as there is no (runnable
> but not running time), the result is the same as current util_est.
> util_est_faster makes a difference only when the task alternates
> between runnable and running slices.
> Have you considered using runnable_avg metrics in the increase of cpu
> freq ? This takes into the runnable slice and not only the running
> time and increase faster than util_avg when tasks compete for the same
> CPU

Just to understand why we're heading into this direction now.

AFAIU the desired outcome to have faster rampup time (and on HMP faster up
migration) which both are tied to utilization signal.

Wouldn't make the util response time faster help not just for rampup, but
rampdown too?

If we improve util response time, couldn't this mean we can remove util_est or
am I missing something?

Currently we have util response which is tweaked by util_est and then that is
tweaked further by schedutil with that 25% margin when maping util to
frequency.

I think if we can allow improving general util response time by tweaking PELT
HALFLIFE we can potentially remove util_est and potentially that magic 25%
margin too.

Why the approach of further tweaking util_est is better?

Recently phoronix reported that schedutil behavior is suboptimal and I wonder
if the response time is contributing to that

	https://www.phoronix.com/review/schedutil-quirky-2023


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-22 20:28                           ` Dietmar Eggemann
@ 2023-03-01 10:24                             ` Vincent Guittot
  0 siblings, 0 replies; 61+ messages in thread
From: Vincent Guittot @ 2023-03-01 10:24 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On Wed, 22 Feb 2023 at 21:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 21/02/2023 10:29, Vincent Guittot wrote:
> > On Mon, 20 Feb 2023 at 14:54, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> >>
> >> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>
> >>> On 09/02/2023 17:16, Vincent Guittot wrote:
> >>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>>
> >>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
> >>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> >>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
> >>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
>
> [...]
>
> >>> I ran the same test (boosting only for DVFS requests) with:
>                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *
> >>>
> >>> -->8--
> >>>
> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>> index dbc56e8b85f9..7a4bf38f2920 100644
> >>> --- a/kernel/sched/sched.h
> >>> +++ b/kernel/sched/sched.h
> >>> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
> >>>                              READ_ONCE(cfs_rq->avg.util_est.enqueued));
> >>>         }
> >>>
> >>> +       util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
> >>> +
> >
> > Another reason why it gives better results could be that
> > cpu_util_cfs() is not only used for DVFS selection but also to track
> > the cpu utilization in load balance and EAS so the cpu will be faster
> > seen as overloaded and tasks will be spread around when there are
> > contentions.
> >
> > Could you try to take cfs_rq->avg.runnable_avg into account only when
> > selecting frequency ?
>
> I actually did exactly this. (* but not shown in the code snippet).
> I just used the boosting for CPU frequency selection (from
> sugov_get_util()). I added the the `_freq` suffix in the kernel name to
> indicate this.

Ok. So the improvement that you are seeing, is really related to
better freq selection

>
> > That being said I can see some place in load balance where
> > cfs_rq->avg.runnable_avg could give some benefits like in
> > find_busiest_queue() where it could be better to take into account the
> > contention when selecting the busiest queue
>
> Could be. Looks like so far we only use it in group_has_capacity(),
> group_is_overloaded() and for NUMA.

I think it could be interesting to use runnable_avg in
find_busiest_queue() for migrate_util case to select the rq with
highest contention as an example

>
> [...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-23 15:37                     ` Qais Yousef
@ 2023-03-01 10:39                       ` Vincent Guittot
  2023-03-01 17:24                         ` Qais Yousef
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2023-03-01 10:39 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski,
	Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 02/09/23 17:16, Vincent Guittot wrote:
>
> > I don't see how util_est_faster can help this 1ms task here ? It's
> > most probably never be preempted during this 1ms. For such an Android
> > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > a better solution ?
>
> uclamp_min is being used in UI and helping there. But your mileage might vary
> with adoption still.
>
> The major motivation behind this is to help things like gaming as the original
> thread started. It can help UI and other use cases too. Android framework has
> a lot of context on the type of workload that can help it make a decision when
> this helps. And OEMs can have the chance to tune and apply based on the
> characteristics of their device.
>
> > IIUC how util_est_faster works, it removes the waiting time when
> > sharing cpu time with other tasks. So as long as there is no (runnable
> > but not running time), the result is the same as current util_est.
> > util_est_faster makes a difference only when the task alternates
> > between runnable and running slices.
> > Have you considered using runnable_avg metrics in the increase of cpu
> > freq ? This takes into the runnable slice and not only the running
> > time and increase faster than util_avg when tasks compete for the same
> > CPU
>
> Just to understand why we're heading into this direction now.
>
> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> migration) which both are tied to utilization signal.
>
> Wouldn't make the util response time faster help not just for rampup, but
> rampdown too?
>
> If we improve util response time, couldn't this mean we can remove util_est or
> am I missing something?

not sure because you still have a ramping step whereas util_est
directly gives you the final tager

>
> Currently we have util response which is tweaked by util_est and then that is
> tweaked further by schedutil with that 25% margin when maping util to
> frequency.

the 25% is not related to the ramping time but to the fact that you
always need some margin to cover unexpected events and estimation
error

>
> I think if we can allow improving general util response time by tweaking PELT
> HALFLIFE we can potentially remove util_est and potentially that magic 25%
> margin too.
>
> Why the approach of further tweaking util_est is better?

note that in this case it doesn't really tweak util_est but Dietmar
has taken into account runnable_avg to increase the freq in case of
contention

Also IIUC Dietmar's results, the problem seems more linked to the
selection of a higher freq than increasing the utilization;
runnable_avg tests give similar perf results than shorter half life
and better power consumption.

>
> Recently phoronix reported that schedutil behavior is suboptimal and I wonder
> if the response time is contributing to that
>
>         https://www.phoronix.com/review/schedutil-quirky-2023
>
>
> Cheers
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-01 10:39                       ` Vincent Guittot
@ 2023-03-01 17:24                         ` Qais Yousef
  2023-03-02  8:00                           ` Vincent Guittot
  2023-03-23 16:29                           ` Dietmar Eggemann
  0 siblings, 2 replies; 61+ messages in thread
From: Qais Yousef @ 2023-03-01 17:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski,
	Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On 03/01/23 11:39, Vincent Guittot wrote:
> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 02/09/23 17:16, Vincent Guittot wrote:
> >
> > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > most probably never be preempted during this 1ms. For such an Android
> > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > a better solution ?
> >
> > uclamp_min is being used in UI and helping there. But your mileage might vary
> > with adoption still.
> >
> > The major motivation behind this is to help things like gaming as the original
> > thread started. It can help UI and other use cases too. Android framework has
> > a lot of context on the type of workload that can help it make a decision when
> > this helps. And OEMs can have the chance to tune and apply based on the
> > characteristics of their device.
> >
> > > IIUC how util_est_faster works, it removes the waiting time when
> > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > but not running time), the result is the same as current util_est.
> > > util_est_faster makes a difference only when the task alternates
> > > between runnable and running slices.
> > > Have you considered using runnable_avg metrics in the increase of cpu
> > > freq ? This takes into the runnable slice and not only the running
> > > time and increase faster than util_avg when tasks compete for the same
> > > CPU
> >
> > Just to understand why we're heading into this direction now.
> >
> > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > migration) which both are tied to utilization signal.
> >
> > Wouldn't make the util response time faster help not just for rampup, but
> > rampdown too?
> >
> > If we improve util response time, couldn't this mean we can remove util_est or
> > am I missing something?
> 
> not sure because you still have a ramping step whereas util_est
> directly gives you the final tager

I didn't get you. tager?

> 
> >
> > Currently we have util response which is tweaked by util_est and then that is
> > tweaked further by schedutil with that 25% margin when maping util to
> > frequency.
> 
> the 25% is not related to the ramping time but to the fact that you
> always need some margin to cover unexpected events and estimation
> error

At the moment we have

	util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection

I think we have too many transformations before deciding the current
frequencies. Which makes it hard to tweak the system response.

> 
> >
> > I think if we can allow improving general util response time by tweaking PELT
> > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > margin too.
> >
> > Why the approach of further tweaking util_est is better?
> 
> note that in this case it doesn't really tweak util_est but Dietmar
> has taken into account runnable_avg to increase the freq in case of
> contention
> 
> Also IIUC Dietmar's results, the problem seems more linked to the
> selection of a higher freq than increasing the utilization;
> runnable_avg tests give similar perf results than shorter half life
> and better power consumption.

Does it ramp down faster too?


Thanks

--
Qais Yousef

> 
> >
> > Recently phoronix reported that schedutil behavior is suboptimal and I wonder
> > if the response time is contributing to that
> >
> >         https://www.phoronix.com/review/schedutil-quirky-2023
> >
> >
> > Cheers
> >
> > --
> > Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-01 17:24                         ` Qais Yousef
@ 2023-03-02  8:00                           ` Vincent Guittot
  2023-03-02 19:39                             ` Dietmar Eggemann
  2023-03-06 19:11                             ` Qais Yousef
  2023-03-23 16:29                           ` Dietmar Eggemann
  1 sibling, 2 replies; 61+ messages in thread
From: Vincent Guittot @ 2023-03-02  8:00 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski,
	Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 03/01/23 11:39, Vincent Guittot wrote:
> > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 02/09/23 17:16, Vincent Guittot wrote:
> > >
> > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > most probably never be preempted during this 1ms. For such an Android
> > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > a better solution ?
> > >
> > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > with adoption still.
> > >
> > > The major motivation behind this is to help things like gaming as the original
> > > thread started. It can help UI and other use cases too. Android framework has
> > > a lot of context on the type of workload that can help it make a decision when
> > > this helps. And OEMs can have the chance to tune and apply based on the
> > > characteristics of their device.
> > >
> > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > but not running time), the result is the same as current util_est.
> > > > util_est_faster makes a difference only when the task alternates
> > > > between runnable and running slices.
> > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > freq ? This takes into the runnable slice and not only the running
> > > > time and increase faster than util_avg when tasks compete for the same
> > > > CPU
> > >
> > > Just to understand why we're heading into this direction now.
> > >
> > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > migration) which both are tied to utilization signal.
> > >
> > > Wouldn't make the util response time faster help not just for rampup, but
> > > rampdown too?
> > >
> > > If we improve util response time, couldn't this mean we can remove util_est or
> > > am I missing something?
> >
> > not sure because you still have a ramping step whereas util_est
> > directly gives you the final tager
>
> I didn't get you. tager?

target

>
> >
> > >
> > > Currently we have util response which is tweaked by util_est and then that is
> > > tweaked further by schedutil with that 25% margin when maping util to
> > > frequency.
> >
> > the 25% is not related to the ramping time but to the fact that you
> > always need some margin to cover unexpected events and estimation
> > error
>
> At the moment we have
>
>         util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
>
> I think we have too many transformations before deciding the current
> frequencies. Which makes it hard to tweak the system response.

What is proposed here with runnable_avg is more to take a new input
when selecting a frequency: the level of contention on the cpu. But
this is not used to modify the utilization seen by the scheduler

>
> >
> > >
> > > I think if we can allow improving general util response time by tweaking PELT
> > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > margin too.
> > >
> > > Why the approach of further tweaking util_est is better?
> >
> > note that in this case it doesn't really tweak util_est but Dietmar
> > has taken into account runnable_avg to increase the freq in case of
> > contention
> >
> > Also IIUC Dietmar's results, the problem seems more linked to the
> > selection of a higher freq than increasing the utilization;
> > runnable_avg tests give similar perf results than shorter half life
> > and better power consumption.
>
> Does it ramp down faster too?

I don't think so.

To be honest, I'm not convinced that modifying the half time is the
right way to solve this. If it was only a matter of half life not
being suitable for a system, the halk life would be set once at boot
and people would not ask to modify it at run time.

>
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > >
> > > Recently phoronix reported that schedutil behavior is suboptimal and I wonder
> > > if the response time is contributing to that
> > >
> > >         https://www.phoronix.com/review/schedutil-quirky-2023
> > >
> > >
> > > Cheers
> > >
> > > --
> > > Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-02-22 20:13                         ` Dietmar Eggemann
@ 2023-03-02 19:36                           ` Dietmar Eggemann
  0 siblings, 0 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2023-03-02 19:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 22/02/2023 21:13, Dietmar Eggemann wrote:
> On 20/02/2023 14:54, Vincent Guittot wrote:
>> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 09/02/2023 17:16, Vincent Guittot wrote:
>>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>>
>>>>> On 09/11/2022 16:49, Peter Zijlstra wrote:
>>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
>>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote:
>>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> 
> [...]
> 
>>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
>>>> a better solution ?
>>>
>>> Yes, it has. I'm not sure how feasible this is to do for all tasks
>>> involved. I'm thinking about the Binder threads here for instance.
>>
>> Yes, that can probably not help for all threads but some system
>> threads like surfaceflinger and graphic composer should probably
>> benefit from min uclamp
> 
> Yes, and it looks like that the Android version I'm using
> SQ1D.220205.004 (Feb '22) (automatic system updates turned off) is
> already using uclamp_min != 0 for tasks like UI thread. It's not one
> particular value but different values  from [0 .. 512] over the runtime
> of a Jankbench iteration. I have to have a closer look.

I did more Jankbench and Speedometer testing especially to understand
the influence of the already used uclamp_min boosting (Android Dynamic
Performance Framework (ADPF) `CPU performance hints` feature:
https://developer.android.com/games/optimize/adpf#cpu-hints) for some
App tasks.

The following notebooks show which of the App tasks are uclamp_min
boosted (their diagram title carries an additional 'uclamp_min_boost'
tag and how uclamp_min boost relates to the other boost values:
This is probably not a fixed mapping and could change between test runs.
I assume that Android will issue performance hints in form of uclamp_min
boosting when it detects certain scenarios like a specific jankframe
threshold or something similar.

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/jankbench_uclamp_min_boost.ipynb

https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/speedometer_uclamp_min_boost.ipynb

`base` has changed compared to `base-a30b17f016b0`. It now also
contains: e5ed0550c04c - sched/fair: unlink misfit task from cpu
overutilized (2023-02-11 Vincent Guittot)

Former `max_util_scaled_util_est_faster_rbl_freq` has been renamed to
`cpu_rbl_freq`.

Jankbench:

Max_frame_duration:
+-----------------------------+------------+
|             kernel          |    value   |
+-----------------------------+------------+
|            base             | 156.299159 |
|       base_wo_uclamp        | 171.063764 | uclamp disabled*
|         pelt-hl-m2          | 126.190232 |
|         pelt-hl-m4          | 100.865171 |
| scaled_util_est_faster_freq | 126.074194 |
|        cpu_rbl_freq         | 153.123089 |
+-----------------------------+------------+

* We still let Android set the uclamp_min values.
Just the uclamp setter are bypassed now.

Mean_frame_duration:
+-----------------------------+-------+-----------+
|           kernel            | value | perc_diff |
+-----------------------------+-------+-----------+
|            base             | 15.5  |   0.0%    |
|       base_wo_uclamp        | 16.6  |   7.76%   |
|         pelt-hl-m2          | 14.9  |  -3.27%   |
|         pelt-hl-m4          | 13.6  |  -12.16%  |
| scaled_util_est_faster_freq | 14.7  |  -4.88%   |
|        cpu_rbl_freq         | 12.2  |  -20.84%  |
+-----------------------------+-------+-----------+

Jank percentage (Jank deadline 16ms):
+-----------------------------+-------+-----------+
|           kernel            | value | perc_diff |
+-----------------------------+-------+-----------+
|            base             |  2.6  |   0.0%    |
|       base_wo_uclamp        |  3.0  |  17.47%   |
|         pelt-hl-m2          |  2.0  |  -23.33%  |
|         pelt-hl-m4          |  1.3  |  -48.55%  |
| scaled_util_est_faster_freq |  1.7  |  -32.21%  |
|        cpu_rbl_freq         |  0.7  |  -71.36%  |
+-----------------------------+-------+-----------+

Power usage [mW] (total - all CPUs):
+-----------------------------+-------+-----------+
|           kernel            | value | perc_diff |
+-----------------------------+-------+-----------+
|            base             | 141.1 |   0.0%    |
|       base_wo_uclamp        | 116.6 |  -17.4%   |
|         pelt-hl-m2          | 138.7 |   -1.7%   |
|         pelt-hl-m4          | 156.5 |  10.87%   |
| scaled_util_est_faster_freq | 147.6 |   4.57%   |
|        cpu_rbl_freq         | 135.0 |  -4.33%   |
+-----------------------------+-------+-----------+

Speedometer:

Score:
+-----------------------------+-------+-----------+
|           kernel            | value | perc_diff |
+-----------------------------+-------+-----------+
|            base             | 108.4 |   0.0%    |
|       base_wo_uclamp        | 95.2  |  -12.17%  |
|         pelt-hl-m2          | 112.9 |   4.13%   |
| scaled_util_est_faster_freq | 114.7 |   5.77%   |
|        cpu_rbl_freq         | 127.7 |  17.75%   |
+-----------------------------+-------+-----------+

Power usage [mW] (total - all CPUs):
+-----------------------------+--------+-----------+
|           kernel            | value  | perc_diff |
+-----------------------------+--------+-----------+
|            base             | 2268.4 |   0.0%    |
|       base_wo_uclamp        | 1789.5 |  -21.11%  |
|         pelt-hl-m2          | 2386.5 |   5.21%   |
| scaled_util_est_faster_freq | 2292.3 |   1.05%   |
|        cpu_rbl_freq         | 2198.3 |  -3.09%   |
+-----------------------------+--------+-----------+

The explanation I have is that the `CPU performance hints` feature
tries to recreate the information about contention for a specific set of
tasks. Since there is also contention in which only non uclamp_min
boosted tasks are runnable, mechanisms like `util_est_faster` or
`cpu_runnable boosting` can help on top of what's already provided with
uclamp_min boosting from userspace.

[...]


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-02  8:00                           ` Vincent Guittot
@ 2023-03-02 19:39                             ` Dietmar Eggemann
  2023-03-06 19:11                             ` Qais Yousef
  1 sibling, 0 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2023-03-02 19:39 UTC (permalink / raw)
  To: Vincent Guittot, Qais Yousef
  Cc: Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 02/03/2023 09:00, Vincent Guittot wrote:
> On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote:
>>
>> On 03/01/23 11:39, Vincent Guittot wrote:
>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
>>>>
>>>> On 02/09/23 17:16, Vincent Guittot wrote:

[...]

>>>> Just to understand why we're heading into this direction now.
>>>>
>>>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
>>>> migration) which both are tied to utilization signal.
>>>>
>>>> Wouldn't make the util response time faster help not just for rampup, but
>>>> rampdown too?
>>>>
>>>> If we improve util response time, couldn't this mean we can remove util_est or
>>>> am I missing something?
>>>
>>> not sure because you still have a ramping step whereas util_est
>>> directly gives you the final tager
>>
>> I didn't get you. tager?
> 
> target

uclamp_min boosting (ADPF's `CPU performance hints` feature) could
eclipse util_est but only if it's higher and only for those tasks
affected the feature,

[...]

>>>> I think if we can allow improving general util response time by tweaking PELT
>>>> HALFLIFE we can potentially remove util_est and potentially that magic 25%
>>>> margin too.
>>>>
>>>> Why the approach of further tweaking util_est is better?
>>>
>>> note that in this case it doesn't really tweak util_est but Dietmar
>>> has taken into account runnable_avg to increase the freq in case of
>>> contention
>>>
>>> Also IIUC Dietmar's results, the problem seems more linked to the
>>> selection of a higher freq than increasing the utilization;
>>> runnable_avg tests give similar perf results than shorter half life
>>> and better power consumption.
>>
>> Does it ramp down faster too?
> 
> I don't think so.
> 
> To be honest, I'm not convinced that modifying the half time is the
> right way to solve this. If it was only a matter of half life not
> being suitable for a system, the halk life would be set once at boot
> and people would not ask to modify it at run time.

IMHO, what people don't like about PELT halflife mods is the fact that
all sched entities and every functionality based on PELT signals would
be affected even though it might not be beneficial or even harmful for
system behaviour not covered by the specific benchmark numbers shown.

That's why we wanted to figure out what is the actual reason which
improves those Jankbench (or Speedometer resp. game FPS numbers). In
this case we would be able to boost more selectively than PELT halflife
modding can do.

Util_est_faster (1) is an approach to only boost the CPU util signal
depending on the current task's activation duration (sum of task's
running time). This time is multiplied by 2 when calculating the fake
PELT signal which is then max-compared with the existing CPU util.

And the idea to max-compare CPU util and CPU runnable (2) is to help
tasks under contention. Android testing showed that contention very
often accompanies jankframe occurrences for example.

I only applied (1) and (2) to DVFS requests in my testing.

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-02  8:00                           ` Vincent Guittot
  2023-03-02 19:39                             ` Dietmar Eggemann
@ 2023-03-06 19:11                             ` Qais Yousef
  2023-03-07 13:22                               ` Vincent Guittot
  1 sibling, 1 reply; 61+ messages in thread
From: Qais Yousef @ 2023-03-06 19:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski,
	Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On 03/02/23 09:00, Vincent Guittot wrote:
> On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 03/01/23 11:39, Vincent Guittot wrote:
> > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 02/09/23 17:16, Vincent Guittot wrote:
> > > >
> > > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > > most probably never be preempted during this 1ms. For such an Android
> > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > > a better solution ?
> > > >
> > > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > > with adoption still.
> > > >
> > > > The major motivation behind this is to help things like gaming as the original
> > > > thread started. It can help UI and other use cases too. Android framework has
> > > > a lot of context on the type of workload that can help it make a decision when
> > > > this helps. And OEMs can have the chance to tune and apply based on the
> > > > characteristics of their device.
> > > >
> > > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > > but not running time), the result is the same as current util_est.
> > > > > util_est_faster makes a difference only when the task alternates
> > > > > between runnable and running slices.
> > > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > > freq ? This takes into the runnable slice and not only the running
> > > > > time and increase faster than util_avg when tasks compete for the same
> > > > > CPU
> > > >
> > > > Just to understand why we're heading into this direction now.
> > > >
> > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > > migration) which both are tied to utilization signal.
> > > >
> > > > Wouldn't make the util response time faster help not just for rampup, but
> > > > rampdown too?
> > > >
> > > > If we improve util response time, couldn't this mean we can remove util_est or
> > > > am I missing something?
> > >
> > > not sure because you still have a ramping step whereas util_est
> > > directly gives you the final tager
> >
> > I didn't get you. tager?
> 
> target

It seems you're referring to the holding function of util_est? ie: keep the
util high to avoid 'spurious' decays?

Isn't this a duplication of the schedutil's filter which is also a holding
function to prevent rapid frequency changes?

FWIW, that schedutil filter does get tweaked a lot in android world. Many add
an additional down_filter to prevent this premature drop in freq (AFAICT).
Which tells me util_est is not delivering completely on that front in practice.

> 
> >
> > >
> > > >
> > > > Currently we have util response which is tweaked by util_est and then that is
> > > > tweaked further by schedutil with that 25% margin when maping util to
> > > > frequency.
> > >
> > > the 25% is not related to the ramping time but to the fact that you
> > > always need some margin to cover unexpected events and estimation
> > > error
> >
> > At the moment we have
> >
> >         util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> >
> > I think we have too many transformations before deciding the current
> > frequencies. Which makes it hard to tweak the system response.
> 
> What is proposed here with runnable_avg is more to take a new input
> when selecting a frequency: the level of contention on the cpu. But

What if there's no contention on the CPU and it's just a single task running
there that suddenly becomes always running for a number of frames?

> this is not used to modify the utilization seen by the scheduler
> 
> >
> > >
> > > >
> > > > I think if we can allow improving general util response time by tweaking PELT
> > > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > > margin too.
> > > >
> > > > Why the approach of further tweaking util_est is better?
> > >
> > > note that in this case it doesn't really tweak util_est but Dietmar
> > > has taken into account runnable_avg to increase the freq in case of
> > > contention
> > >
> > > Also IIUC Dietmar's results, the problem seems more linked to the
> > > selection of a higher freq than increasing the utilization;
> > > runnable_avg tests give similar perf results than shorter half life
> > > and better power consumption.
> >
> > Does it ramp down faster too?
> 
> I don't think so.
> 
> To be honest, I'm not convinced that modifying the half time is the
> right way to solve this. If it was only a matter of half life not
> being suitable for a system, the halk life would be set once at boot
> and people would not ask to modify it at run time.

I'd like to understand more the reason behind these concerns. What is the
problem with modifying the halflife?

The way I see it it is an important metric of how responsive the system to how
loaded it is. Which drives a lot of important decisions.

32ms means the system needs approximately 200ms to detect an always running
task (from idle).

16ms halves it to 100ms. And 8ms halves it further to 50ms.

Or you can phrase it the opposite way, it takes 200ms to detect the system is
now idle from always busy state. etc.

Why is it bad for a sys admin to have the ability to adjust this response time
as they see fit?

What goes wrong?

AFAICS the two natural places to control the response time of the system is
pelt halflife for overall system responsiveness, and the mapping function in
schedutil for more fine grained frequency response.

There are issues with current filtering mechanism in schedutil too:

	1. It drops any requests during the filtering window. At CFS enqueue we
	   could end up with multiple calls to cpufreq_update_util(); or if we
	   do multiple consecutive enqueues. In a shared domain, there's a race
	   which cpu issues the updated freq request first. Which might not be
	   the best for the domain during this window.
	2. Maybe it needs asymmetric values for up and down.

I could be naive, but I see util_est as something we should strive to remove to
be honest. I think there are too many moving cogs.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-06 19:11                             ` Qais Yousef
@ 2023-03-07 13:22                               ` Vincent Guittot
  2023-03-11 16:55                                 ` Qais Yousef
  0 siblings, 1 reply; 61+ messages in thread
From: Vincent Guittot @ 2023-03-07 13:22 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski,
	Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On Mon, 6 Mar 2023 at 20:11, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 03/02/23 09:00, Vincent Guittot wrote:
> > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 03/01/23 11:39, Vincent Guittot wrote:
> > > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > On 02/09/23 17:16, Vincent Guittot wrote:
> > > > >
> > > > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > > > most probably never be preempted during this 1ms. For such an Android
> > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > > > a better solution ?
> > > > >
> > > > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > > > with adoption still.
> > > > >
> > > > > The major motivation behind this is to help things like gaming as the original
> > > > > thread started. It can help UI and other use cases too. Android framework has
> > > > > a lot of context on the type of workload that can help it make a decision when
> > > > > this helps. And OEMs can have the chance to tune and apply based on the
> > > > > characteristics of their device.
> > > > >
> > > > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > > > but not running time), the result is the same as current util_est.
> > > > > > util_est_faster makes a difference only when the task alternates
> > > > > > between runnable and running slices.
> > > > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > > > freq ? This takes into the runnable slice and not only the running
> > > > > > time and increase faster than util_avg when tasks compete for the same
> > > > > > CPU
> > > > >
> > > > > Just to understand why we're heading into this direction now.
> > > > >
> > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > > > migration) which both are tied to utilization signal.
> > > > >
> > > > > Wouldn't make the util response time faster help not just for rampup, but
> > > > > rampdown too?
> > > > >
> > > > > If we improve util response time, couldn't this mean we can remove util_est or
> > > > > am I missing something?
> > > >
> > > > not sure because you still have a ramping step whereas util_est
> > > > directly gives you the final tager
> > >
> > > I didn't get you. tager?
> >
> > target
>
> It seems you're referring to the holding function of util_est? ie: keep the
> util high to avoid 'spurious' decays?

I mean whatever the half life, you will have to wait the utilization
to increase.

>
> Isn't this a duplication of the schedutil's filter which is also a holding
> function to prevent rapid frequency changes?

util_est is used by scheduler to estimate the final utilization of the cfs

>
> FWIW, that schedutil filter does get tweaked a lot in android world. Many add
> an additional down_filter to prevent this premature drop in freq (AFAICT).
> Which tells me util_est is not delivering completely on that front in practice.
>
> >
> > >
> > > >
> > > > >
> > > > > Currently we have util response which is tweaked by util_est and then that is
> > > > > tweaked further by schedutil with that 25% margin when maping util to
> > > > > frequency.
> > > >
> > > > the 25% is not related to the ramping time but to the fact that you
> > > > always need some margin to cover unexpected events and estimation
> > > > error
> > >
> > > At the moment we have
> > >
> > >         util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> > >
> > > I think we have too many transformations before deciding the current
> > > frequencies. Which makes it hard to tweak the system response.
> >
> > What is proposed here with runnable_avg is more to take a new input
> > when selecting a frequency: the level of contention on the cpu. But
>
> What if there's no contention on the CPU and it's just a single task running
> there that suddenly becomes always running for a number of frames?
>
> > this is not used to modify the utilization seen by the scheduler
> >
> > >
> > > >
> > > > >
> > > > > I think if we can allow improving general util response time by tweaking PELT
> > > > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > > > margin too.
> > > > >
> > > > > Why the approach of further tweaking util_est is better?
> > > >
> > > > note that in this case it doesn't really tweak util_est but Dietmar
> > > > has taken into account runnable_avg to increase the freq in case of
> > > > contention
> > > >
> > > > Also IIUC Dietmar's results, the problem seems more linked to the
> > > > selection of a higher freq than increasing the utilization;
> > > > runnable_avg tests give similar perf results than shorter half life
> > > > and better power consumption.
> > >
> > > Does it ramp down faster too?
> >
> > I don't think so.
> >
> > To be honest, I'm not convinced that modifying the half time is the
> > right way to solve this. If it was only a matter of half life not
> > being suitable for a system, the halk life would be set once at boot
> > and people would not ask to modify it at run time.
>
> I'd like to understand more the reason behind these concerns. What is the
> problem with modifying the halflife?

I can somehow understand that some systems would like a different half
life than the current one because of the number of cpus, the pace of
the system... But this should be fixed at boot. The fact that people
needs to dynamically change the half life means for me that even after
changing it then they still don't get the correct  utilization. And I
think that the problem is not really related (or at least not only) to
the correctness of utilization tracking but a lack of taking into
account other input when selecting a frequency. And the contention
(runnable_avg) is a good input to take into account when selecting a
frequency because it reflects that some tasks are waiting to run on
the cpu

>
> The way I see it it is an important metric of how responsive the system to how
> loaded it is. Which drives a lot of important decisions.
>
> 32ms means the system needs approximately 200ms to detect an always running
> task (from idle).
>
> 16ms halves it to 100ms. And 8ms halves it further to 50ms.
>
> Or you can phrase it the opposite way, it takes 200ms to detect the system is
> now idle from always busy state. etc.
>
> Why is it bad for a sys admin to have the ability to adjust this response time
> as they see fit?

because it will use it to bias the response of the system and abuse it
at runtime instead of identifying the root cause.

>
> What goes wrong?
>
> AFAICS the two natural places to control the response time of the system is
> pelt halflife for overall system responsiveness, and the mapping function in
> schedutil for more fine grained frequency response.
>
> There are issues with current filtering mechanism in schedutil too:
>
>         1. It drops any requests during the filtering window. At CFS enqueue we
>            could end up with multiple calls to cpufreq_update_util(); or if we
>            do multiple consecutive enqueues. In a shared domain, there's a race
>            which cpu issues the updated freq request first. Which might not be
>            the best for the domain during this window.
>         2. Maybe it needs asymmetric values for up and down.
>
> I could be naive, but I see util_est as something we should strive to remove to
> be honest. I think there are too many moving cogs.
>
>
> Thanks!
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-07 13:22                               ` Vincent Guittot
@ 2023-03-11 16:55                                 ` Qais Yousef
  0 siblings, 0 replies; 61+ messages in thread
From: Qais Yousef @ 2023-03-11 16:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski,
	Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort,
	Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar,
	Qais Yousef, linux-kernel, Jonathan JMChen

On 03/07/23 14:22, Vincent Guittot wrote:
> On Mon, 6 Mar 2023 at 20:11, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 03/02/23 09:00, Vincent Guittot wrote:
> > > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > On 03/01/23 11:39, Vincent Guittot wrote:
> > > > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
> > > > > >
> > > > > > On 02/09/23 17:16, Vincent Guittot wrote:
> > > > > >
> > > > > > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > > > > > most probably never be preempted during this 1ms. For such an Android
> > > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > > > > > a better solution ?
> > > > > >
> > > > > > uclamp_min is being used in UI and helping there. But your mileage might vary
> > > > > > with adoption still.
> > > > > >
> > > > > > The major motivation behind this is to help things like gaming as the original
> > > > > > thread started. It can help UI and other use cases too. Android framework has
> > > > > > a lot of context on the type of workload that can help it make a decision when
> > > > > > this helps. And OEMs can have the chance to tune and apply based on the
> > > > > > characteristics of their device.
> > > > > >
> > > > > > > IIUC how util_est_faster works, it removes the waiting time when
> > > > > > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > > > > > but not running time), the result is the same as current util_est.
> > > > > > > util_est_faster makes a difference only when the task alternates
> > > > > > > between runnable and running slices.
> > > > > > > Have you considered using runnable_avg metrics in the increase of cpu
> > > > > > > freq ? This takes into the runnable slice and not only the running
> > > > > > > time and increase faster than util_avg when tasks compete for the same
> > > > > > > CPU
> > > > > >
> > > > > > Just to understand why we're heading into this direction now.
> > > > > >
> > > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> > > > > > migration) which both are tied to utilization signal.
> > > > > >
> > > > > > Wouldn't make the util response time faster help not just for rampup, but
> > > > > > rampdown too?
> > > > > >
> > > > > > If we improve util response time, couldn't this mean we can remove util_est or
> > > > > > am I missing something?
> > > > >
> > > > > not sure because you still have a ramping step whereas util_est
> > > > > directly gives you the final tager
> > > >
> > > > I didn't get you. tager?
> > >
> > > target
> >
> > It seems you're referring to the holding function of util_est? ie: keep the
> > util high to avoid 'spurious' decays?
> 
> I mean whatever the half life, you will have to wait the utilization
> to increase.

Yes - which is what ramp up delay that is unacceptable in some cases and seem
to have been raised several times over the years

> 
> >
> > Isn't this a duplication of the schedutil's filter which is also a holding
> > function to prevent rapid frequency changes?
> 
> util_est is used by scheduler to estimate the final utilization of the cfs

IIR the commit message that introduced it correctly it is talking about ramp up
delays - and issues with premature decaying for periodic tasks.

So it is a mechanism to speed up util_avg response time. The same issue we're
trying to address again now.

> 
> >
> > FWIW, that schedutil filter does get tweaked a lot in android world. Many add
> > an additional down_filter to prevent this premature drop in freq (AFAICT).
> > Which tells me util_est is not delivering completely on that front in practice.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Currently we have util response which is tweaked by util_est and then that is
> > > > > > tweaked further by schedutil with that 25% margin when maping util to
> > > > > > frequency.
> > > > >
> > > > > the 25% is not related to the ramping time but to the fact that you
> > > > > always need some margin to cover unexpected events and estimation
> > > > > error
> > > >
> > > > At the moment we have
> > > >
> > > >         util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> > > >
> > > > I think we have too many transformations before deciding the current
> > > > frequencies. Which makes it hard to tweak the system response.
> > >
> > > What is proposed here with runnable_avg is more to take a new input
> > > when selecting a frequency: the level of contention on the cpu. But
> >
> > What if there's no contention on the CPU and it's just a single task running
> > there that suddenly becomes always running for a number of frames?
> >
> > > this is not used to modify the utilization seen by the scheduler
> > >
> > > >
> > > > >
> > > > > >
> > > > > > I think if we can allow improving general util response time by tweaking PELT
> > > > > > HALFLIFE we can potentially remove util_est and potentially that magic 25%
> > > > > > margin too.
> > > > > >
> > > > > > Why the approach of further tweaking util_est is better?
> > > > >
> > > > > note that in this case it doesn't really tweak util_est but Dietmar
> > > > > has taken into account runnable_avg to increase the freq in case of
> > > > > contention
> > > > >
> > > > > Also IIUC Dietmar's results, the problem seems more linked to the
> > > > > selection of a higher freq than increasing the utilization;
> > > > > runnable_avg tests give similar perf results than shorter half life
> > > > > and better power consumption.
> > > >
> > > > Does it ramp down faster too?
> > >
> > > I don't think so.
> > >
> > > To be honest, I'm not convinced that modifying the half time is the
> > > right way to solve this. If it was only a matter of half life not
> > > being suitable for a system, the halk life would be set once at boot
> > > and people would not ask to modify it at run time.
> >
> > I'd like to understand more the reason behind these concerns. What is the
> > problem with modifying the halflife?
> 
> I can somehow understand that some systems would like a different half
> life than the current one because of the number of cpus, the pace of
> the system... But this should be fixed at boot. The fact that people

The boot time might be the only thing required. I think some systems only need
this already. The difficulty in practice is that on some systems this might
result in worse power over a day of use. So it'll all depend, hence the desire
to have it as a runtime. Why invent more crystal balls that might or not might
not be the best thing depends on who you ask?

> needs to dynamically change the half life means for me that even after
> changing it then they still don't get the correct  utilization. And I

What is the correct utilization? It is just a signal in attempt to crystal ball
the future. It can't be correct in general IMHO. It's best effort that we know
fails occasionally already.

As I said above - there's a trade-off in perf/power and that will highly depend
on the system.

The proposed high contention detection doesn't address this trade-off; rather
biases the system further towards perf-first. Which is not always the right
trade-off. It could be a useful addition - but it needs to be a tunable too.

> think that the problem is not really related (or at least not only) to
> the correctness of utilization tracking but a lack of taking into

It's not correctness issue. It's response time issue. It's a simple
task of improving the reactiveness of the system. Which has a power cost that
some users don't want to incur when not necessary.

> account other input when selecting a frequency. And the contention
> (runnable_avg) is a good input to take into account when selecting a
> frequency because it reflects that some tasks are waiting to run on
> the cpu

You did not answer my question above. What if there's no contention and
a single task on a cpu suddenly moves from mostly idle to always running for
a number of frames? There's no contention in there. How will this be improved?

> 
> >
> > The way I see it it is an important metric of how responsive the system to how
> > loaded it is. Which drives a lot of important decisions.
> >
> > 32ms means the system needs approximately 200ms to detect an always running
> > task (from idle).
> >
> > 16ms halves it to 100ms. And 8ms halves it further to 50ms.
> >
> > Or you can phrase it the opposite way, it takes 200ms to detect the system is
> > now idle from always busy state. etc.
> >
> > Why is it bad for a sys admin to have the ability to adjust this response time
> > as they see fit?
> 
> because it will use it to bias the response of the system and abuse it
> at runtime instead of identifying the root cause.

No one wants to abuse anything. But the one size fits all approach is not
always right too. And sys admins and end users have the right to tune their
systems the way they see fit.  There are too many variations out there to hard
code the system response. I view this like the right to repair - it's their
system, why do they have to hack the kernel to tune it?

The root cause is that the system reactiveness is controlled by this value.
And there's a trade-off between perf/power that is highly dependent on the
system characteristic. On some areas a boot time is all that one needs. In
others, it might be desired to improve specific use cases like gaming only as
the speed up at boot time only can hurt overall battery life in normal use
cases.

I think the story is simple :)

In my view util_est is borderline a hack. We just need to enable control pelt
ramp-up/down response times + improve schedutil. I highlight a few shortcomings
that are already known in the practice below. And that phoronix article about
schedutil not being better than ondemand demonstrates that this is an issue
outside of mobile too.

schedutil - as the name says it - depends on util signal. Which also depends on
pelt halflife. I really think this is the most natural and predictable way to
tune the system. I can't see the drawbacks.

I think we need to distinguish between picking sensible default behavior; and
enforcing policies or restricting user's choice. AFAICS the discussion is going
towards the latter.

On the topic of defaults - I do think 16ms is a more sensible default for
modern day hardware and use cases.

/me runs and hides :)


Cheers

--
Qais Yousef

> 
> >
> > What goes wrong?
> >
> > AFAICS the two natural places to control the response time of the system is
> > pelt halflife for overall system responsiveness, and the mapping function in
> > schedutil for more fine grained frequency response.
> >
> > There are issues with current filtering mechanism in schedutil too:
> >
> >         1. It drops any requests during the filtering window. At CFS enqueue we
> >            could end up with multiple calls to cpufreq_update_util(); or if we
> >            do multiple consecutive enqueues. In a shared domain, there's a race
> >            which cpu issues the updated freq request first. Which might not be
> >            the best for the domain during this window.
> >         2. Maybe it needs asymmetric values for up and down.
> >
> > I could be naive, but I see util_est as something we should strive to remove to
> > be honest. I think there are too many moving cogs.
> >
> >
> > Thanks!
> >
> > --
> > Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-01 17:24                         ` Qais Yousef
  2023-03-02  8:00                           ` Vincent Guittot
@ 2023-03-23 16:29                           ` Dietmar Eggemann
  2023-04-03 14:45                             ` Qais Yousef
  1 sibling, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2023-03-23 16:29 UTC (permalink / raw)
  To: Qais Yousef, Vincent Guittot
  Cc: Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar,
	Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

On 01/03/2023 18:24, Qais Yousef wrote:
> On 03/01/23 11:39, Vincent Guittot wrote:
>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
>>>
>>> On 02/09/23 17:16, Vincent Guittot wrote:
>>>
>>>> I don't see how util_est_faster can help this 1ms task here ? It's
>>>> most probably never be preempted during this 1ms. For such an Android
>>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
>>>> a better solution ?
>>>
>>> uclamp_min is being used in UI and helping there. But your mileage might vary
>>> with adoption still.
>>>
>>> The major motivation behind this is to help things like gaming as the original
>>> thread started. It can help UI and other use cases too. Android framework has
>>> a lot of context on the type of workload that can help it make a decision when
>>> this helps. And OEMs can have the chance to tune and apply based on the
>>> characteristics of their device.
>>>
>>>> IIUC how util_est_faster works, it removes the waiting time when
>>>> sharing cpu time with other tasks. So as long as there is no (runnable
>>>> but not running time), the result is the same as current util_est.
>>>> util_est_faster makes a difference only when the task alternates
>>>> between runnable and running slices.
>>>> Have you considered using runnable_avg metrics in the increase of cpu
>>>> freq ? This takes into the runnable slice and not only the running
>>>> time and increase faster than util_avg when tasks compete for the same
>>>> CPU
>>>
>>> Just to understand why we're heading into this direction now.
>>>
>>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
>>> migration) which both are tied to utilization signal.
>>>
>>> Wouldn't make the util response time faster help not just for rampup, but
>>> rampdown too?
>>>
>>> If we improve util response time, couldn't this mean we can remove util_est or
>>> am I missing something?
>>
>> not sure because you still have a ramping step whereas util_est
>> directly gives you the final tager

util_est gives us instantaneous signal at enqueue for periodic tasks,
something PELT will never be able to do.
 
> I didn't get you. tager?
> 
>>
>>>
>>> Currently we have util response which is tweaked by util_est and then that is
>>> tweaked further by schedutil with that 25% margin when maping util to
>>> frequency.
>>
>> the 25% is not related to the ramping time but to the fact that you
>> always need some margin to cover unexpected events and estimation
>> error
> 
> At the moment we have
> 
> 	util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> 
> I think we have too many transformations before deciding the current
> frequencies. Which makes it hard to tweak the system response.

To me it looks more like this:

max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
                             ^^^^^^^^^^^^ 
                             new proposal to factor in root cfs_rq contention


Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
there to create the safety margin used by schedutil & EAS.

* The schedutil up/down filter thing has been already naked in Nov 2016.
IMHO, this is where util_est was initially discussed as an alternative.
We have it in mainline as well, but one value (default 10ms) for both
directions. There was discussion to map it to the driver's
translation_latency instead.

In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.

So on `up` your rate is as small as possible (only respecting the
driver's translation_latency) but on `down` you use much more than that. 

Why exactly do you have this higher value on `down`? My hunch is
scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
so util_est is 0 and the blocked utilization is decaying (too fast,
4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
again when the CPU wakes up in those 20ms?   

>>> I think if we can allow improving general util response time by tweaking PELT
>>> HALFLIFE we can potentially remove util_est and potentially that magic 25%
>>> margin too.
>>>
>>> Why the approach of further tweaking util_est is better?
>>
>> note that in this case it doesn't really tweak util_est but Dietmar
>> has taken into account runnable_avg to increase the freq in case of
>> contention
>>
>> Also IIUC Dietmar's results, the problem seems more linked to the
>> selection of a higher freq than increasing the utilization;
>> runnable_avg tests give similar perf results than shorter half life
>> and better power consumption.
> 
> Does it ramp down faster too?

Not sure why you are interested in this? Can't be related to the
`driving DVFS` functionality discussed above.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-03-23 16:29                           ` Dietmar Eggemann
@ 2023-04-03 14:45                             ` Qais Yousef
  2023-04-06 15:58                               ` Dietmar Eggemann
  0 siblings, 1 reply; 61+ messages in thread
From: Qais Yousef @ 2023-04-03 14:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

Hi Diemtar

On 03/23/23 17:29, Dietmar Eggemann wrote:
> On 01/03/2023 18:24, Qais Yousef wrote:
> > On 03/01/23 11:39, Vincent Guittot wrote:
> >> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
> >>>
> >>> On 02/09/23 17:16, Vincent Guittot wrote:
> >>>
> >>>> I don't see how util_est_faster can help this 1ms task here ? It's
> >>>> most probably never be preempted during this 1ms. For such an Android
> >>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
> >>>> a better solution ?
> >>>
> >>> uclamp_min is being used in UI and helping there. But your mileage might vary
> >>> with adoption still.
> >>>
> >>> The major motivation behind this is to help things like gaming as the original
> >>> thread started. It can help UI and other use cases too. Android framework has
> >>> a lot of context on the type of workload that can help it make a decision when
> >>> this helps. And OEMs can have the chance to tune and apply based on the
> >>> characteristics of their device.
> >>>
> >>>> IIUC how util_est_faster works, it removes the waiting time when
> >>>> sharing cpu time with other tasks. So as long as there is no (runnable
> >>>> but not running time), the result is the same as current util_est.
> >>>> util_est_faster makes a difference only when the task alternates
> >>>> between runnable and running slices.
> >>>> Have you considered using runnable_avg metrics in the increase of cpu
> >>>> freq ? This takes into the runnable slice and not only the running
> >>>> time and increase faster than util_avg when tasks compete for the same
> >>>> CPU
> >>>
> >>> Just to understand why we're heading into this direction now.
> >>>
> >>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> >>> migration) which both are tied to utilization signal.
> >>>
> >>> Wouldn't make the util response time faster help not just for rampup, but
> >>> rampdown too?
> >>>
> >>> If we improve util response time, couldn't this mean we can remove util_est or
> >>> am I missing something?
> >>
> >> not sure because you still have a ramping step whereas util_est
> >> directly gives you the final tager
> 
> util_est gives us instantaneous signal at enqueue for periodic tasks,

How do you define instantaneous and periodic here? How would you describe the
behavior for non periodic tasks?

> something PELT will never be able to do.

Why? Isn't by selecting a lower pelt halflife we achieve something similar?

>  
> > I didn't get you. tager?
> > 
> >>
> >>>
> >>> Currently we have util response which is tweaked by util_est and then that is
> >>> tweaked further by schedutil with that 25% margin when maping util to
> >>> frequency.
> >>
> >> the 25% is not related to the ramping time but to the fact that you
> >> always need some margin to cover unexpected events and estimation
> >> error
> > 
> > At the moment we have
> > 
> > 	util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> > 
> > I think we have too many transformations before deciding the current
> > frequencies. Which makes it hard to tweak the system response.
> 
> To me it looks more like this:
> 
> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
>                              ^^^^^^^^^^^^ 
>                              new proposal to factor in root cfs_rq contention

These are still 5 stages even if written differently.

What if background tasks that are causing the contention? How can you tell it
to ignore that and NOT drive the frequency up unnecessary for those non
important ones? If userspace is fully aware of uclamp - this whole discussion
wouldn't be necessary. And I still have a bunch of fixes to push before
uclamp_max is actually usable in production.

> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
> there to create the safety margin used by schedutil & EAS.

Yes I know and that's not the point. The point is that it's a chain reaction.
25% percent headroom is already very aggressive and causes issues on the top
inefficient ends of the cores. And when util is high, you might end up in
a situation where you skip frequencies. Making everything go up faster without
balancing it with either enabling going down faster too or tune this value can
lead to power and thermal issues on powerful systems.

I think all we need is controlling pelt halflife and this one to tune the
system to the desired trade-off.

> 
> * The schedutil up/down filter thing has been already naked in Nov 2016.
> IMHO, this is where util_est was initially discussed as an alternative.

Well, I don't see anyone not using a down filter. So I'm not sure util_est has
been a true alternative.

> We have it in mainline as well, but one value (default 10ms) for both
> directions. There was discussion to map it to the driver's
> translation_latency instead.

Which can be filled wrong sometimes :(

> 
> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
> 
> So on `up` your rate is as small as possible (only respecting the
> driver's translation_latency) but on `down` you use much more than that. 
> 
> Why exactly do you have this higher value on `down`? My hunch is
> scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
> so util_est is 0 and the blocked utilization is decaying (too fast,
> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
> again when the CPU wakes up in those 20ms?   

The down filter prevents changing the frequency to a lower value. So it's
a holding function to keep the residency at a higher frequency for at least
20ms. It is, sort of, similar to the max() functions you used above. The max
function will allow you to follow the fasting ramping up signal on the way up,
and the slowest ramping down one on the way down.

I think this is more deterministic way to do it.

> 
> >>> I think if we can allow improving general util response time by tweaking PELT
> >>> HALFLIFE we can potentially remove util_est and potentially that magic 25%
> >>> margin too.
> >>>
> >>> Why the approach of further tweaking util_est is better?
> >>
> >> note that in this case it doesn't really tweak util_est but Dietmar
> >> has taken into account runnable_avg to increase the freq in case of
> >> contention
> >>
> >> Also IIUC Dietmar's results, the problem seems more linked to the
> >> selection of a higher freq than increasing the utilization;
> >> runnable_avg tests give similar perf results than shorter half life
> >> and better power consumption.
> > 
> > Does it ramp down faster too?
> 
> Not sure why you are interested in this? Can't be related to the
> `driving DVFS` functionality discussed above.

If you change the reaction time to be more aggressive in going up, then it's
only natural to have it symmetrical so your residency on the power hungry OPPs
don't go over the roof and end up with thermal and power issues.

I am concerned about us biasing towrads perf first too much and not enabling
sys admins to select a proper trade off for their system and use case. Which
are not static. The workloads the system needs to accommodate to are abundant
and operating conditions could change. And the diversity of hardware available
out there is huge - I am not sure how can we expect we can have one response to
accommodate for all of them.

What I'm trying to push for here is that we should look at the chain as one
unit. And we should consider that there's important trade-off to be had here;
having a sensible default doesn't mean the user shouldn't be allowed to select
a different trade-off. I'm not sure the problem can be generalized and fixed
automatically. But happy to be proven wrong of course :-)

FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind
pasting the patch for load_avg consideration so I can take it into account too
in my experiments?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-04-03 14:45                             ` Qais Yousef
@ 2023-04-06 15:58                               ` Dietmar Eggemann
  2023-04-11 17:51                                 ` Qais Yousef
  0 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2023-04-06 15:58 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

Hi Qais,

On 03/04/2023 16:45, Qais Yousef wrote:
> Hi Diemtar
> 
> On 03/23/23 17:29, Dietmar Eggemann wrote:
>> On 01/03/2023 18:24, Qais Yousef wrote:
>>> On 03/01/23 11:39, Vincent Guittot wrote:
>>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
>>>>>
>>>>> On 02/09/23 17:16, Vincent Guittot wrote:

[...]

>>>>> If we improve util response time, couldn't this mean we can remove util_est or
>>>>> am I missing something?
>>>>
>>>> not sure because you still have a ramping step whereas util_est
>>>> directly gives you the final tager
>>
>> util_est gives us instantaneous signal at enqueue for periodic tasks,
> 
> How do you define instantaneous and periodic here? How would you describe the
> behavior for non periodic tasks?

Instantaneous is when the max value is available already @wakeup. That 
is the main use case for util_est, provide this boost to periodic tasks. 
A non-periodic task doesn't benefit from this. Work assumption back then 
was that the important task involved here are the periodic (back then 
60Hz, 16.67 ms period) tasks of the Android display pipeline.

>> something PELT will never be able to do.
> 
> Why? Isn't by selecting a lower pelt halflife we achieve something similar?

You get closer but you still would need time to ramp-up. That's without 
util_est.

[...]

>>>> the 25% is not related to the ramping time but to the fact that you
>>>> always need some margin to cover unexpected events and estimation
>>>> error
>>>
>>> At the moment we have
>>>
>>> 	util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
>>>
>>> I think we have too many transformations before deciding the current
>>> frequencies. Which makes it hard to tweak the system response.
>>
>> To me it looks more like this:
>>
>> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
>>                              ^^^^^^^^^^^^ 
>>                              new proposal to factor in root cfs_rq contention
> 
> These are still 5 stages even if written differently.
> 
> What if background tasks that are causing the contention? How can you tell it
> to ignore that and NOT drive the frequency up unnecessary for those non
> important ones? If userspace is fully aware of uclamp - this whole discussion
> wouldn't be necessary. And I still have a bunch of fixes to push before
> uclamp_max is actually usable in production.

You're hinting to the other open discussion we have on uclamp in feec():

https://lkml.kernel.org/r/20230205224318.2035646-1-qyousef@layalina.io

IMHO, this is a different discussion. No classification of tasks here. 

>> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
>> there to create the safety margin used by schedutil & EAS.
> 
> Yes I know and that's not the point. The point is that it's a chain reaction.
> 25% percent headroom is already very aggressive and causes issues on the top
> inefficient ends of the cores. And when util is high, you might end up in
> a situation where you skip frequencies. Making everything go up faster without
> balancing it with either enabling going down faster too or tune this value can
> lead to power and thermal issues on powerful systems.

I try to follow here but I fail. You're saying that the safety margin is 
too wide and in case util is within the safety margin, the logic is 
eclipsed by going max or choosing a CPU from a higher CPU capacity 
Perf-domain?

Wouldn't `going down faster` contradict with schedutil's 20ms down rate 
limit?

> 
> I think all we need is controlling pelt halflife and this one to tune the
> system to the desired trade-off.
> 
>>
>> * The schedutil up/down filter thing has been already naked in Nov 2016.
>> IMHO, this is where util_est was initially discussed as an alternative.
> 
> Well, I don't see anyone not using a down filter. So I'm not sure util_est has
> been a true alternative.

Definitely not in down direction. util_est is 0 w/o any runnable tasks. 
And blocked utilization is decaying much faster than your 20ms down rate 
limit.

>> We have it in mainline as well, but one value (default 10ms) for both
>> directions. There was discussion to map it to the driver's
>> translation_latency instead.
> 
> Which can be filled wrong sometimes :(
> 
>>
>> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
>>
>> So on `up` your rate is as small as possible (only respecting the
>> driver's translation_latency) but on `down` you use much more than that. 
>>
>> Why exactly do you have this higher value on `down`? My hunch is
>> scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
>> so util_est is 0 and the blocked utilization is decaying (too fast,
>> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
>> again when the CPU wakes up in those 20ms?   
> 
> The down filter prevents changing the frequency to a lower value. So it's
> a holding function to keep the residency at a higher frequency for at least
> 20ms. It is, sort of, similar to the max() functions you used above. The max
> function will allow you to follow the fasting ramping up signal on the way up,
> and the slowest ramping down one on the way down.
> 
> I think this is more deterministic way to do it.

But a faster PELT wouldn't help here, quite the opposite.
[...]

>>>> Also IIUC Dietmar's results, the problem seems more linked to the
>>>> selection of a higher freq than increasing the utilization;
>>>> runnable_avg tests give similar perf results than shorter half life
>>>> and better power consumption.
>>>
>>> Does it ramp down faster too?
>>
>> Not sure why you are interested in this? Can't be related to the
>> `driving DVFS` functionality discussed above.
> 
> If you change the reaction time to be more aggressive in going up, then it's
> only natural to have it symmetrical so your residency on the power hungry OPPs
> don't go over the roof and end up with thermal and power issues.

But you apply this 20ms down rate limit on the big cores too?

> I am concerned about us biasing towrads perf first too much and not enabling
> sys admins to select a proper trade off for their system and use case. Which
> are not static. The workloads the system needs to accommodate to are abundant
> and operating conditions could change. And the diversity of hardware available
> out there is huge - I am not sure how can we expect we can have one response to
> accommodate for all of them.
> 
> What I'm trying to push for here is that we should look at the chain as one
> unit. And we should consider that there's important trade-off to be had here;
> having a sensible default doesn't mean the user shouldn't be allowed to select
> a different trade-off. I'm not sure the problem can be generalized and fixed
> automatically. But happy to be proven wrong of course :-)
> 
> FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind
> pasting the patch for load_avg consideration so I can take it into account too
> in my experiments?

Just posted it:

https://lkml.kernel.org/r/20230406155030.1989554-1-dietmar.eggemann@arm.com


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime
  2023-04-06 15:58                               ` Dietmar Eggemann
@ 2023-04-11 17:51                                 ` Qais Yousef
  0 siblings, 0 replies; 61+ messages in thread
From: Qais Yousef @ 2023-04-11 17:51 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu,
	Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret,
	Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef,
	linux-kernel, Jonathan JMChen

Hi Dietmar!

On 04/06/23 17:58, Dietmar Eggemann wrote:
> Hi Qais,
> 
> On 03/04/2023 16:45, Qais Yousef wrote:
> > Hi Diemtar
> > 
> > On 03/23/23 17:29, Dietmar Eggemann wrote:
> >> On 01/03/2023 18:24, Qais Yousef wrote:
> >>> On 03/01/23 11:39, Vincent Guittot wrote:
> >>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote:
> >>>>>
> >>>>> On 02/09/23 17:16, Vincent Guittot wrote:
> 
> [...]
> 
> >>>>> If we improve util response time, couldn't this mean we can remove util_est or
> >>>>> am I missing something?
> >>>>
> >>>> not sure because you still have a ramping step whereas util_est
> >>>> directly gives you the final tager
> >>
> >> util_est gives us instantaneous signal at enqueue for periodic tasks,
> > 
> > How do you define instantaneous and periodic here? How would you describe the
> > behavior for non periodic tasks?
> 
> Instantaneous is when the max value is available already @wakeup. That 
> is the main use case for util_est, provide this boost to periodic tasks. 
> A non-periodic task doesn't benefit from this. Work assumption back then 
> was that the important task involved here are the periodic (back then 
> 60Hz, 16.67 ms period) tasks of the Android display pipeline.

Not all tasks in the system are periodic..

Note that the main use case that was brought up here is gaming - which is not
the same as Android display pipeline.

> 
> >> something PELT will never be able to do.
> > 
> > Why? Isn't by selecting a lower pelt halflife we achieve something similar?
> 
> You get closer but you still would need time to ramp-up. That's without 
> util_est.

Yes we'll always need time to ramp up. Even for util_est, no?

> 
> [...]
> 
> >>>> the 25% is not related to the ramping time but to the fact that you
> >>>> always need some margin to cover unexpected events and estimation
> >>>> error
> >>>
> >>> At the moment we have
> >>>
> >>> 	util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> >>>
> >>> I think we have too many transformations before deciding the current
> >>> frequencies. Which makes it hard to tweak the system response.
> >>
> >> To me it looks more like this:
> >>
> >> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
> >>                              ^^^^^^^^^^^^ 
> >>                              new proposal to factor in root cfs_rq contention
> > 
> > These are still 5 stages even if written differently.
> > 
> > What if background tasks that are causing the contention? How can you tell it
> > to ignore that and NOT drive the frequency up unnecessary for those non
> > important ones? If userspace is fully aware of uclamp - this whole discussion
> > wouldn't be necessary. And I still have a bunch of fixes to push before
> > uclamp_max is actually usable in production.
> 
> You're hinting to the other open discussion we have on uclamp in feec():

No, no I am not.

> 
> https://lkml.kernel.org/r/20230205224318.2035646-1-qyousef@layalina.io
> 
> IMHO, this is a different discussion. No classification of tasks here. 

That patch has nothing to do with what I'm trying to say here. You say looking
at load_avg helps with contention. My point was that what if the contention is
caused by background tasks? They'll cause a frequency to go up higher which is
not the desired effect.

So it'll not distinguish between cases that matters and cases that don't
matter; and with no ability to control this behavior.

As you know cpuset is used to keep background tasks on little cores; whose top
frequencies on latest ones are very expensive. This could lead to higher
residency on those expensive frequencies with your change.

We need to be selective - which is the whole point behind wanting a runtime
control. Not all workloads are equal. And not all systems handle the same
workload similarly. There are trade-offs.

> 
> >> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
> >> there to create the safety margin used by schedutil & EAS.
> > 
> > Yes I know and that's not the point. The point is that it's a chain reaction.
> > 25% percent headroom is already very aggressive and causes issues on the top
> > inefficient ends of the cores. And when util is high, you might end up in
> > a situation where you skip frequencies. Making everything go up faster without
> > balancing it with either enabling going down faster too or tune this value can
> > lead to power and thermal issues on powerful systems.
> 
> I try to follow here but I fail. You're saying that the safety margin is 
> too wide and in case util is within the safety margin, the logic is 
> eclipsed by going max or choosing a CPU from a higher CPU capacity 
> Perf-domain?
> 
> Wouldn't `going down faster` contradict with schedutil's 20ms down rate 
> limit?

No. 200ms is a far cry from 20ms.

> 
> > 
> > I think all we need is controlling pelt halflife and this one to tune the
> > system to the desired trade-off.
> > 
> >>
> >> * The schedutil up/down filter thing has been already naked in Nov 2016.
> >> IMHO, this is where util_est was initially discussed as an alternative.
> > 
> > Well, I don't see anyone not using a down filter. So I'm not sure util_est has
> > been a true alternative.
> 
> Definitely not in down direction. util_est is 0 w/o any runnable tasks. 
> And blocked utilization is decaying much faster than your 20ms down rate 
> limit.

Okay I'll keep this in mind when looking at this in the future. Maybe there's
something fishy in there that we could improve.

> 
> >> We have it in mainline as well, but one value (default 10ms) for both
> >> directions. There was discussion to map it to the driver's
> >> translation_latency instead.
> > 
> > Which can be filled wrong sometimes :(
> > 
> >>
> >> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
> >>
> >> So on `up` your rate is as small as possible (only respecting the
> >> driver's translation_latency) but on `down` you use much more than that. 
> >>
> >> Why exactly do you have this higher value on `down`? My hunch is
> >> scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
> >> so util_est is 0 and the blocked utilization is decaying (too fast,
> >> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
> >> again when the CPU wakes up in those 20ms?   
> > 
> > The down filter prevents changing the frequency to a lower value. So it's
> > a holding function to keep the residency at a higher frequency for at least
> > 20ms. It is, sort of, similar to the max() functions you used above. The max
> > function will allow you to follow the fasting ramping up signal on the way up,
> > and the slowest ramping down one on the way down.
> > 
> > I think this is more deterministic way to do it.
> 
> But a faster PELT wouldn't help here, quite the opposite.

I didn't mention PELT here. I was comparing util_est max() to the filter in
schedutil.

> [...]
> 
> >>>> Also IIUC Dietmar's results, the problem seems more linked to the
> >>>> selection of a higher freq than increasing the utilization;
> >>>> runnable_avg tests give similar perf results than shorter half life
> >>>> and better power consumption.
> >>>
> >>> Does it ramp down faster too?
> >>
> >> Not sure why you are interested in this? Can't be related to the
> >> `driving DVFS` functionality discussed above.
> > 
> > If you change the reaction time to be more aggressive in going up, then it's
> > only natural to have it symmetrical so your residency on the power hungry OPPs
> > don't go over the roof and end up with thermal and power issues.
> 
> But you apply this 20ms down rate limit on the big cores too?
> 
> > I am concerned about us biasing towrads perf first too much and not enabling
> > sys admins to select a proper trade off for their system and use case. Which
> > are not static. The workloads the system needs to accommodate to are abundant
> > and operating conditions could change. And the diversity of hardware available
> > out there is huge - I am not sure how can we expect we can have one response to
> > accommodate for all of them.
> > 
> > What I'm trying to push for here is that we should look at the chain as one
> > unit. And we should consider that there's important trade-off to be had here;
> > having a sensible default doesn't mean the user shouldn't be allowed to select
> > a different trade-off. I'm not sure the problem can be generalized and fixed
> > automatically. But happy to be proven wrong of course :-)
> > 
> > FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind
> > pasting the patch for load_avg consideration so I can take it into account too
> > in my experiments?
> 
> Just posted it:
> 
> https://lkml.kernel.org/r/20230406155030.1989554-1-dietmar.eggemann@arm.com

Thanks a lot! I'll revisit the whole story taking into account the relationship
with all these other controls. I will need sometime though. But I will get back
with some data hopefully to help us pave the right way. I think we shredded
this thread to pieces enough :)


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2023-04-11 17:51 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-29  5:54 [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Dietmar Eggemann
2022-08-29  5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann
2022-08-29  8:08   ` Peter Zijlstra
2022-08-29 10:02     ` Peter Zijlstra
2022-08-29 10:13       ` Vincent Guittot
2022-08-29 14:23         ` Quentin Perret
2022-08-29 14:34           ` Peter Zijlstra
2022-08-29 15:31             ` Quentin Perret
2022-08-29 15:48             ` Quentin Perret
2022-09-02  7:53         ` Dietmar Eggemann
2022-09-02  8:45           ` Peter Zijlstra
2022-09-06  5:49           ` Vincent Guittot
2022-09-08  6:50             ` Dietmar Eggemann
2022-09-02  7:53       ` Dietmar Eggemann
2022-09-02  8:45         ` Peter Zijlstra
2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu
2022-09-28 17:09   ` Dietmar Eggemann
2022-09-29  9:47   ` Peter Zijlstra
2022-09-29 11:07     ` Dietmar Eggemann
2022-09-29 11:10     ` Kajetan Puchalski
2022-09-29 11:21       ` Peter Zijlstra
2022-09-29 14:41         ` Kajetan Puchalski
2022-10-03 22:57           ` Wei Wang
2022-10-04  9:33             ` Dietmar Eggemann
2022-10-05 16:57               ` Wei Wang
2022-11-07 13:41           ` Peter Zijlstra
2022-11-08 19:48             ` Qais Yousef
2022-11-09 15:49               ` Peter Zijlstra
2022-11-10 13:25                 ` Qais Yousef
2023-02-07 10:29                 ` Dietmar Eggemann
2023-02-09 16:16                   ` Vincent Guittot
2023-02-17 13:54                     ` Dietmar Eggemann
2023-02-20 13:54                       ` Vincent Guittot
2023-02-21  9:29                         ` Vincent Guittot
2023-02-22 20:28                           ` Dietmar Eggemann
2023-03-01 10:24                             ` Vincent Guittot
2023-02-22 20:13                         ` Dietmar Eggemann
2023-03-02 19:36                           ` Dietmar Eggemann
2023-02-20 10:13                     ` Peter Zijlstra
2023-02-20 13:39                       ` Vincent Guittot
2023-02-23 15:37                     ` Qais Yousef
2023-03-01 10:39                       ` Vincent Guittot
2023-03-01 17:24                         ` Qais Yousef
2023-03-02  8:00                           ` Vincent Guittot
2023-03-02 19:39                             ` Dietmar Eggemann
2023-03-06 19:11                             ` Qais Yousef
2023-03-07 13:22                               ` Vincent Guittot
2023-03-11 16:55                                 ` Qais Yousef
2023-03-23 16:29                           ` Dietmar Eggemann
2023-04-03 14:45                             ` Qais Yousef
2023-04-06 15:58                               ` Dietmar Eggemann
2023-04-11 17:51                                 ` Qais Yousef
2022-11-09 15:18             ` Lukasz Luba
2022-11-10 11:16             ` Dietmar Eggemann
2022-11-10 13:05               ` Peter Zijlstra
2022-11-10 14:59                 ` Dietmar Eggemann
2022-11-10 17:51                   ` Peter Zijlstra
2022-11-30 18:14                     ` Dietmar Eggemann
2022-12-01 13:37                       ` Kajetan Puchalski
2022-11-10 12:45             ` Kajetan Puchalski
2022-11-07  9:41     ` Jian-Min Liu (劉建旻)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).