* [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime @ 2022-08-29 5:54 Dietmar Eggemann 2022-08-29 5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann 2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu 0 siblings, 2 replies; 61+ messages in thread From: Dietmar Eggemann @ 2022-08-29 5:54 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen, Vincent Donnefort Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel Many of the Android devices still prefer to run PELT with a shorter halflife than the hardcoded value of 32ms in mainline. The Android folks claim better response time of display pipeline tasks (higher min and avg fps for 60, 90 or 120Hz refresh rate). Some of the benchmarks like PCmark web-browsing show higher scores when running with 16ms or 8ms PELT halflife. The gain in response time and performance is considered to outweigh the increase of energy consumption in these cases. The original idea of introducing a PELT halflife compile time option for 32, 16, 8ms from Patrick Bellasi in 2018 https://lkml.kernel.org/r/20180409165134.707-1-patrick.bellasi@arm.com wasn't integrated into mainline mainly because of breaking the PELT stability requirement (see (1) below). We have been experimenting with a new idea from Morten Rasmussen to instead introduce an additional clock between task and pelt clock. This way the effect of a shorter PELT halflife of 8ms or 16ms can be achieved by left-shifting the elapsed time. This is similar to the use of time shifting of the pelt clock to achieve scale invariance in PELT. The implementation is from Vincent Donnefort with some minor modifications to align with current tip sched/core. --- Known potential issues: (1) PELT stability requirement PELT halflife has to be larger than or equal to the scheduling period. The sched_period (sysctl_sched_latency) of a typical mobile device with 8 CPUs with the default logarithmical tuning is 24ms so only the 32 ms PELT halflife met this. Shorter halflife of 16ms or even 8ms would break this. It looks like that this problem might not exist anymore because of the PELT rewrite in 2015, i.e. with commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking"). Since then sched entities (task & task groups) and cfs_rq's are independently maintained rather than each entity update maintains the cfs_rq at the same time. This seems to mitigate the issue that the cfs_rq signal is not correct when there are not all runnable entities able ot do a self update during a PELT halflife window. That said, I'm not entirely sure whether the entity-cfs_rq synchronization is the only issue behind this PELT stability requirement. (2) PELT utilization versus util_est (estimated utilization) The PELT signal of a periodic task oscillates with higher peak amplitude when using smaller halflife. For a typical periodic task of the display pipeline with a runtime/period of 8ms/16ms the peak amplitude is at ~40 for 32ms, at ~80 for 16ms and at ~160 for 8ms. Util_est stores the util_avg peak as util_est.enqueued per task. With an additional exponential weighted moving average (ewma) to smooth task utilization decreases, util_est values of the runnable tasks are aggregated on the root cfs_rq. CPU and task utilization for CPU frequency selection and task placement is the max value out of util_est and util_avg. I.e. because of how util_est is implemented higher CPU Operating Performance Points and more capable CPUs are already chosen when using smaller PELT halflife. (3) Wrong PELT history when switching PELT multiplier The PELT history becomes stale the moment the PELT multiplier is changed during runtime. So all decisions based on PELT are skewed for the time interval to produce LOAD_MAX_AVG (the sum of the infinite geometric series) which value is ~345ms for halflife=32ms (smaller for 8ms or 16ms). Rate limiting the PELT multiplier change to this value is not solving the issue here. So the user would have to live with possible incorrect discussions during these PELT multiplier transition times. --- It looks like that individual task boosting e.g. via uclamp_min, possibly abstracted by middleware frameworks like Android Dynamic Performance Framework (ADPF) would be the way to go here but until this is fully available and adopted some Android folks will still prefer the overall system boosting they achieve by running with a shorter PELT halflife. Vincent Donnefort (1): sched/pelt: Introduce PELT multiplier kernel/sched/core.c | 2 +- kernel/sched/pelt.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/pelt.h | 42 ++++++++++++++++++++++++++++--- kernel/sched/sched.h | 1 + 4 files changed, 100 insertions(+), 5 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 61+ messages in thread
* [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 5:54 [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Dietmar Eggemann @ 2022-08-29 5:54 ` Dietmar Eggemann 2022-08-29 8:08 ` Peter Zijlstra 2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu 1 sibling, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2022-08-29 5:54 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen, Vincent Donnefort Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel From: Vincent Donnefort <vincent.donnefort@arm.com> The new sysctl sched_pelt_multiplier allows a user to set a clock multiplier to x2 or x4 (x1 being the default). This clock multiplier artificially speeds up PELT ramp up/down similarly to use a faster half-life than the default 32ms. - x1: 32ms half-life - x2: 16ms half-life - x4: 8ms half-life Internally, a new clock is created: rq->clock_task_mult. It sits in the clock hierarchy between rq->clock_task and rq->clock_pelt. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> --- kernel/sched/core.c | 2 +- kernel/sched/pelt.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/pelt.h | 42 ++++++++++++++++++++++++++++--- kernel/sched/sched.h | 1 + 4 files changed, 100 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 603a80ec9b0e..6203cead4ad3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -727,7 +727,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) update_irq_load_avg(rq, irq_delta + steal); #endif - update_rq_clock_pelt(rq, delta); + update_rq_clock_task_mult(rq, delta); } void update_rq_clock(struct rq *rq) diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index 9adfc4744544..9fa853a64269 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -472,3 +472,63 @@ int update_irq_load_avg(struct rq *rq, u64 running) return ret; } #endif + +__read_mostly unsigned int sched_pelt_lshift; + +#ifdef CONFIG_SYSCTL +static unsigned int sysctl_sched_pelt_multiplier = 1; + +int sched_pelt_multiplier(struct ctl_table *table, int write, void *buffer, + size_t *lenp, loff_t *ppos) +{ + static DEFINE_MUTEX(mutex); + unsigned int old; + int ret; + + mutex_lock(&mutex); + old = sysctl_sched_pelt_multiplier; + ret = proc_dointvec(table, write, buffer, lenp, ppos); + if (ret) + goto undo; + if (!write) + goto done; + + switch (sysctl_sched_pelt_multiplier) { + case 1: + fallthrough; + case 2: + fallthrough; + case 4: + WRITE_ONCE(sched_pelt_lshift, + sysctl_sched_pelt_multiplier >> 1); + goto done; + default: + ret = -EINVAL; + } + +undo: + sysctl_sched_pelt_multiplier = old; +done: + mutex_unlock(&mutex); + + return ret; +} + +static struct ctl_table sched_pelt_sysctls[] = { + { + .procname = "sched_pelt_multiplier", + .data = &sysctl_sched_pelt_multiplier, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sched_pelt_multiplier, + }, + {} +}; + +static int __init sched_pelt_sysctl_init(void) +{ + register_sysctl_init("kernel", sched_pelt_sysctls); + return 0; +} +late_initcall(sched_pelt_sysctl_init); +#endif diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index 3a0e0dc28721..9b35b5072bae 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg) WRITE_ONCE(avg->util_est.enqueued, enqueued); } +static inline u64 rq_clock_task_mult(struct rq *rq) +{ + lockdep_assert_rq_held(rq); + assert_clock_updated(rq); + + return rq->clock_task_mult; +} + static inline u64 rq_clock_pelt(struct rq *rq) { lockdep_assert_rq_held(rq); @@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct rq *rq) /* The rq is idle, we can sync to clock_task */ static inline void _update_idle_rq_clock_pelt(struct rq *rq) { - rq->clock_pelt = rq_clock_task(rq); + rq->clock_pelt = rq_clock_task_mult(rq); u64_u32_store(rq->clock_idle, rq_clock(rq)); /* Paired with smp_rmb in migrate_se_pelt_lag() */ @@ -121,6 +129,27 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta) rq->clock_pelt += delta; } +extern unsigned int sched_pelt_lshift; + +/* + * absolute time |1 |2 |3 |4 |5 |6 | + * @ mult = 1 --------****************--------****************- + * @ mult = 2 --------********----------------********--------- + * @ mult = 4 --------****--------------------****------------- + * clock task mult + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | + * + */ +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) +{ + delta <<= READ_ONCE(sched_pelt_lshift); + + rq->clock_task_mult += delta; + + update_rq_clock_pelt(rq, delta); +} + /* * When rq becomes idle, we have to check if it has lost idle time * because it was fully busy. A rq is fully used when the /Sum util_sum @@ -147,7 +176,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq) * rq's clock_task. */ if (util_sum >= divider) - rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; + rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt; _update_idle_rq_clock_pelt(rq); } @@ -218,13 +247,18 @@ update_irq_load_avg(struct rq *rq, u64 running) return 0; } -static inline u64 rq_clock_pelt(struct rq *rq) +static inline u64 rq_clock_task_mult(struct rq *rq) { return rq_clock_task(rq); } +static inline u64 rq_clock_pelt(struct rq *rq) +{ + return rq_clock_task_mult(rq); +} + static inline void -update_rq_clock_pelt(struct rq *rq, s64 delta) { } +update_rq_clock_task_mult(struct rq *rq, s64 delta) { } static inline void update_idle_rq_clock_pelt(struct rq *rq) { } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index da17be6f27fd..62fc3cf10ea7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1007,6 +1007,7 @@ struct rq { u64 clock; /* Ensure that all clocks are in the same cache line */ u64 clock_task ____cacheline_aligned; + u64 clock_task_mult; u64 clock_pelt; unsigned long lost_idle_time; u64 clock_pelt_idle; -- 2.25.1 ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann @ 2022-08-29 8:08 ` Peter Zijlstra 2022-08-29 10:02 ` Peter Zijlstra 0 siblings, 1 reply; 61+ messages in thread From: Peter Zijlstra @ 2022-08-29 8:08 UTC (permalink / raw) To: Dietmar Eggemann Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: > From: Vincent Donnefort <vincent.donnefort@arm.com> > > The new sysctl sched_pelt_multiplier allows a user to set a clock > multiplier to x2 or x4 (x1 being the default). This clock multiplier > artificially speeds up PELT ramp up/down similarly to use a faster > half-life than the default 32ms. > > - x1: 32ms half-life > - x2: 16ms half-life > - x4: 8ms half-life > > Internally, a new clock is created: rq->clock_task_mult. It sits in the > clock hierarchy between rq->clock_task and rq->clock_pelt. > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > +extern unsigned int sched_pelt_lshift; > + > +/* > + * absolute time |1 |2 |3 |4 |5 |6 | > + * @ mult = 1 --------****************--------****************- > + * @ mult = 2 --------********----------------********--------- > + * @ mult = 4 --------****--------------------****------------- > + * clock task mult > + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | > + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | > + * > + */ > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) > +{ > + delta <<= READ_ONCE(sched_pelt_lshift); > + > + rq->clock_task_mult += delta; > + > + update_rq_clock_pelt(rq, delta); > +} Hurmph... I'd almost go write you something like static_call()/static_branch() but for immediates. That said; given there's only like 3 options, perhaps a few static_branch() instances work just fine ? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 8:08 ` Peter Zijlstra @ 2022-08-29 10:02 ` Peter Zijlstra 2022-08-29 10:13 ` Vincent Guittot 2022-09-02 7:53 ` Dietmar Eggemann 0 siblings, 2 replies; 61+ messages in thread From: Peter Zijlstra @ 2022-08-29 10:02 UTC (permalink / raw) To: Dietmar Eggemann Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote: > On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: > > From: Vincent Donnefort <vincent.donnefort@arm.com> > > > > The new sysctl sched_pelt_multiplier allows a user to set a clock > > multiplier to x2 or x4 (x1 being the default). This clock multiplier > > artificially speeds up PELT ramp up/down similarly to use a faster > > half-life than the default 32ms. > > > > - x1: 32ms half-life > > - x2: 16ms half-life > > - x4: 8ms half-life > > > > Internally, a new clock is created: rq->clock_task_mult. It sits in the > > clock hierarchy between rq->clock_task and rq->clock_pelt. > > > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > +extern unsigned int sched_pelt_lshift; > > + > > +/* > > + * absolute time |1 |2 |3 |4 |5 |6 | > > + * @ mult = 1 --------****************--------****************- > > + * @ mult = 2 --------********----------------********--------- > > + * @ mult = 4 --------****--------------------****------------- > > + * clock task mult > > + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | > > + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | > > + * > > + */ > > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) > > +{ > > + delta <<= READ_ONCE(sched_pelt_lshift); > > + > > + rq->clock_task_mult += delta; > > + > > + update_rq_clock_pelt(rq, delta); > > +} > > Hurmph... I'd almost go write you something like > static_call()/static_branch() but for immediates. > > That said; given there's only like 3 options, perhaps a few > static_branch() instances work just fine ? Also, I'm not at all sure about exposing that as an official sysctl. How about something like so? --- Subject: sched/pelt: Introduce PELT multiplier From: Vincent Donnefort <vincent.donnefort@arm.com> Date: Mon, 29 Aug 2022 07:54:50 +0200 From: Vincent Donnefort <vincent.donnefort@arm.com> The new sysctl sched_pelt_multiplier allows a user to set a clock multiplier to x2 or x4 (x1 being the default). This clock multiplier artificially speeds up PELT ramp up/down similarly to use a faster half-life than the default 32ms. - x1: 32ms half-life - x2: 16ms half-life - x4: 8ms half-life Internally, a new clock is created: rq->clock_task_mult. It sits in the clock hierarchy between rq->clock_task and rq->clock_pelt. [peterz: Use sched_feat()] Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 2 +- kernel/sched/features.h | 3 +++ kernel/sched/pelt.h | 45 +++++++++++++++++++++++++++++++++++++++++---- kernel/sched/sched.h | 1 + 4 files changed, 46 insertions(+), 5 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -727,7 +727,7 @@ static void update_rq_clock_task(struct if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) update_irq_load_avg(rq, irq_delta + steal); #endif - update_rq_clock_pelt(rq, delta); + update_rq_clock_task_mult(rq, delta); } void update_rq_clock(struct rq *rq) --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) + +SCHED_FEAT(PELT_M2, false) +SCHED_FEAT(PELT_M4, false) --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(st WRITE_ONCE(avg->util_est.enqueued, enqueued); } +static inline u64 rq_clock_task_mult(struct rq *rq) +{ + lockdep_assert_rq_held(rq); + assert_clock_updated(rq); + + return rq->clock_task_mult; +} + static inline u64 rq_clock_pelt(struct rq *rq) { lockdep_assert_rq_held(rq); @@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct r /* The rq is idle, we can sync to clock_task */ static inline void _update_idle_rq_clock_pelt(struct rq *rq) { - rq->clock_pelt = rq_clock_task(rq); + rq->clock_pelt = rq_clock_task_mult(rq); u64_u32_store(rq->clock_idle, rq_clock(rq)); /* Paired with smp_rmb in migrate_se_pelt_lag() */ @@ -121,6 +129,30 @@ static inline void update_rq_clock_pelt( rq->clock_pelt += delta; } +extern unsigned int sched_pelt_lshift; + +/* + * absolute time |1 |2 |3 |4 |5 |6 | + * @ mult = 1 --------****************--------****************- + * @ mult = 2 --------********----------------********--------- + * @ mult = 4 --------****--------------------****------------- + * clock task mult + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | + * + */ +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) +{ + if (sched_feat(PELT_M2)) + delta *= 2; + if (sched_feat(PELT_M4)) + delta *= 4; + + rq->clock_task_mult += delta; + + update_rq_clock_pelt(rq, delta); +} + /* * When rq becomes idle, we have to check if it has lost idle time * because it was fully busy. A rq is fully used when the /Sum util_sum @@ -147,7 +179,7 @@ static inline void update_idle_rq_clock_ * rq's clock_task. */ if (util_sum >= divider) - rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; + rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt; _update_idle_rq_clock_pelt(rq); } @@ -218,13 +250,18 @@ update_irq_load_avg(struct rq *rq, u64 r return 0; } -static inline u64 rq_clock_pelt(struct rq *rq) +static inline u64 rq_clock_task_mult(struct rq *rq) { return rq_clock_task(rq); } +static inline u64 rq_clock_pelt(struct rq *rq) +{ + return rq_clock_task_mult(rq); +} + static inline void -update_rq_clock_pelt(struct rq *rq, s64 delta) { } +update_rq_clock_task_mult(struct rq *rq, s64 delta) { } static inline void update_idle_rq_clock_pelt(struct rq *rq) { } --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1009,6 +1009,7 @@ struct rq { u64 clock; /* Ensure that all clocks are in the same cache line */ u64 clock_task ____cacheline_aligned; + u64 clock_task_mult; u64 clock_pelt; unsigned long lost_idle_time; u64 clock_pelt_idle; ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 10:02 ` Peter Zijlstra @ 2022-08-29 10:13 ` Vincent Guittot 2022-08-29 14:23 ` Quentin Perret 2022-09-02 7:53 ` Dietmar Eggemann 2022-09-02 7:53 ` Dietmar Eggemann 1 sibling, 2 replies; 61+ messages in thread From: Vincent Guittot @ 2022-08-29 10:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Dietmar Eggemann, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote: > > On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: > > > From: Vincent Donnefort <vincent.donnefort@arm.com> > > > > > > The new sysctl sched_pelt_multiplier allows a user to set a clock > > > multiplier to x2 or x4 (x1 being the default). This clock multiplier > > > artificially speeds up PELT ramp up/down similarly to use a faster > > > half-life than the default 32ms. > > > > > > - x1: 32ms half-life > > > - x2: 16ms half-life > > > - x4: 8ms half-life > > > > > > Internally, a new clock is created: rq->clock_task_mult. It sits in the > > > clock hierarchy between rq->clock_task and rq->clock_pelt. > > > > > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > > > +extern unsigned int sched_pelt_lshift; > > > + > > > +/* > > > + * absolute time |1 |2 |3 |4 |5 |6 | > > > + * @ mult = 1 --------****************--------****************- > > > + * @ mult = 2 --------********----------------********--------- > > > + * @ mult = 4 --------****--------------------****------------- > > > + * clock task mult > > > + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | > > > + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | > > > + * > > > + */ > > > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) > > > +{ > > > + delta <<= READ_ONCE(sched_pelt_lshift); > > > + > > > + rq->clock_task_mult += delta; > > > + > > > + update_rq_clock_pelt(rq, delta); > > > +} > > > > Hurmph... I'd almost go write you something like > > static_call()/static_branch() but for immediates. > > > > That said; given there's only like 3 options, perhaps a few > > static_branch() instances work just fine ? > > Also, I'm not at all sure about exposing that as an official sysctl. Me too, I would even make it a boot time parameter so we can remove the new clock_task_mult clock and left shift clock_taslk or the delta before passing it to clock_pelt > > How about something like so? > > --- > Subject: sched/pelt: Introduce PELT multiplier > From: Vincent Donnefort <vincent.donnefort@arm.com> > Date: Mon, 29 Aug 2022 07:54:50 +0200 > > From: Vincent Donnefort <vincent.donnefort@arm.com> > > The new sysctl sched_pelt_multiplier allows a user to set a clock > multiplier to x2 or x4 (x1 being the default). This clock multiplier > artificially speeds up PELT ramp up/down similarly to use a faster > half-life than the default 32ms. > > - x1: 32ms half-life > - x2: 16ms half-life > - x4: 8ms half-life > > Internally, a new clock is created: rq->clock_task_mult. It sits in the > clock hierarchy between rq->clock_task and rq->clock_pelt. > > [peterz: Use sched_feat()] > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/sched/core.c | 2 +- > kernel/sched/features.h | 3 +++ > kernel/sched/pelt.h | 45 +++++++++++++++++++++++++++++++++++++++++---- > kernel/sched/sched.h | 1 + > 4 files changed, 46 insertions(+), 5 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -727,7 +727,7 @@ static void update_rq_clock_task(struct > if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) > update_irq_load_avg(rq, irq_delta + steal); > #endif > - update_rq_clock_pelt(rq, delta); > + update_rq_clock_task_mult(rq, delta); > } > > void update_rq_clock(struct rq *rq) > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false) > > SCHED_FEAT(ALT_PERIOD, true) > SCHED_FEAT(BASE_SLICE, true) > + > +SCHED_FEAT(PELT_M2, false) > +SCHED_FEAT(PELT_M4, false) > --- a/kernel/sched/pelt.h > +++ b/kernel/sched/pelt.h > @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(st > WRITE_ONCE(avg->util_est.enqueued, enqueued); > } > > +static inline u64 rq_clock_task_mult(struct rq *rq) > +{ > + lockdep_assert_rq_held(rq); > + assert_clock_updated(rq); > + > + return rq->clock_task_mult; > +} > + > static inline u64 rq_clock_pelt(struct rq *rq) > { > lockdep_assert_rq_held(rq); > @@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct r > /* The rq is idle, we can sync to clock_task */ > static inline void _update_idle_rq_clock_pelt(struct rq *rq) > { > - rq->clock_pelt = rq_clock_task(rq); > + rq->clock_pelt = rq_clock_task_mult(rq); > > u64_u32_store(rq->clock_idle, rq_clock(rq)); > /* Paired with smp_rmb in migrate_se_pelt_lag() */ > @@ -121,6 +129,30 @@ static inline void update_rq_clock_pelt( > rq->clock_pelt += delta; > } > > +extern unsigned int sched_pelt_lshift; > + > +/* > + * absolute time |1 |2 |3 |4 |5 |6 | > + * @ mult = 1 --------****************--------****************- > + * @ mult = 2 --------********----------------********--------- > + * @ mult = 4 --------****--------------------****------------- > + * clock task mult > + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | > + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | > + * > + */ > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) > +{ > + if (sched_feat(PELT_M2)) > + delta *= 2; > + if (sched_feat(PELT_M4)) > + delta *= 4; > + > + rq->clock_task_mult += delta; > + > + update_rq_clock_pelt(rq, delta); > +} > + > /* > * When rq becomes idle, we have to check if it has lost idle time > * because it was fully busy. A rq is fully used when the /Sum util_sum > @@ -147,7 +179,7 @@ static inline void update_idle_rq_clock_ > * rq's clock_task. > */ > if (util_sum >= divider) > - rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; > + rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt; > > _update_idle_rq_clock_pelt(rq); > } > @@ -218,13 +250,18 @@ update_irq_load_avg(struct rq *rq, u64 r > return 0; > } > > -static inline u64 rq_clock_pelt(struct rq *rq) > +static inline u64 rq_clock_task_mult(struct rq *rq) > { > return rq_clock_task(rq); > } > > +static inline u64 rq_clock_pelt(struct rq *rq) > +{ > + return rq_clock_task_mult(rq); > +} > + > static inline void > -update_rq_clock_pelt(struct rq *rq, s64 delta) { } > +update_rq_clock_task_mult(struct rq *rq, s64 delta) { } > > static inline void > update_idle_rq_clock_pelt(struct rq *rq) { } > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1009,6 +1009,7 @@ struct rq { > u64 clock; > /* Ensure that all clocks are in the same cache line */ > u64 clock_task ____cacheline_aligned; > + u64 clock_task_mult; > u64 clock_pelt; > unsigned long lost_idle_time; > u64 clock_pelt_idle; ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 10:13 ` Vincent Guittot @ 2022-08-29 14:23 ` Quentin Perret 2022-08-29 14:34 ` Peter Zijlstra 2022-09-02 7:53 ` Dietmar Eggemann 1 sibling, 1 reply; 61+ messages in thread From: Quentin Perret @ 2022-08-29 14:23 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Monday 29 Aug 2022 at 12:13:26 (+0200), Vincent Guittot wrote: > On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote: > > > On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: > > > > From: Vincent Donnefort <vincent.donnefort@arm.com> > > > > > > > > The new sysctl sched_pelt_multiplier allows a user to set a clock > > > > multiplier to x2 or x4 (x1 being the default). This clock multiplier > > > > artificially speeds up PELT ramp up/down similarly to use a faster > > > > half-life than the default 32ms. > > > > > > > > - x1: 32ms half-life > > > > - x2: 16ms half-life > > > > - x4: 8ms half-life > > > > > > > > Internally, a new clock is created: rq->clock_task_mult. It sits in the > > > > clock hierarchy between rq->clock_task and rq->clock_pelt. > > > > > > > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> > > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > > > > > +extern unsigned int sched_pelt_lshift; > > > > + > > > > +/* > > > > + * absolute time |1 |2 |3 |4 |5 |6 | > > > > + * @ mult = 1 --------****************--------****************- > > > > + * @ mult = 2 --------********----------------********--------- > > > > + * @ mult = 4 --------****--------------------****------------- > > > > + * clock task mult > > > > + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | > > > > + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | > > > > + * > > > > + */ > > > > +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) > > > > +{ > > > > + delta <<= READ_ONCE(sched_pelt_lshift); > > > > + > > > > + rq->clock_task_mult += delta; > > > > + > > > > + update_rq_clock_pelt(rq, delta); > > > > +} > > > > > > Hurmph... I'd almost go write you something like > > > static_call()/static_branch() but for immediates. > > > > > > That said; given there's only like 3 options, perhaps a few > > > static_branch() instances work just fine ? > > > > Also, I'm not at all sure about exposing that as an official sysctl. > > Me too, I would even make it a boot time parameter so we can remove > the new clock_task_mult clock and left shift clock_taslk or the delta > before passing it to clock_pelt I'll let folks in CC comment about their use-case in more details, but there's definitely been an interest in tuning this thing at run-time FWIW. Typically a larger half-life will be fine with predictable workloads with little inputs from users (e.g. fullscreen video playback) while a lower one can be preferred in highly interactive cases (games, ...). The transient state is fun to reason about, but it really shouldn't be too common AFAIK. With that said I'd quite like to see numbers to back that claim. Measuring power while running a video (for instance) with various HL values should help. And similarly it shouldn't be too hard to get performance numbers. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 14:23 ` Quentin Perret @ 2022-08-29 14:34 ` Peter Zijlstra 2022-08-29 15:31 ` Quentin Perret 2022-08-29 15:48 ` Quentin Perret 0 siblings, 2 replies; 61+ messages in thread From: Peter Zijlstra @ 2022-08-29 14:34 UTC (permalink / raw) To: Quentin Perret Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Mon, Aug 29, 2022 at 02:23:17PM +0000, Quentin Perret wrote: > I'll let folks in CC comment about their use-case in more details, but > there's definitely been an interest in tuning this thing at run-time An interest and it making sense are two very distinct things that bear no relation to one another in any way. > FWIW. Typically a larger half-life will be fine with predictable > workloads with little inputs from users (e.g. fullscreen video playback) > while a lower one can be preferred in highly interactive cases (games, As per always; consider the combined workload. > ...). The transient state is fun to reason about, but it really > shouldn't be too common AFAIK. Once you give away control there is no taking it back, and userspace *will* do stupid things and expect unicorns. > With that said I'd quite like to see numbers to back that claim. > Measuring power while running a video (for instance) with various HL > values should help. And similarly it shouldn't be too hard to get > performance numbers. I'm thinking this all needs far more than mere numbers to justify. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 14:34 ` Peter Zijlstra @ 2022-08-29 15:31 ` Quentin Perret 2022-08-29 15:48 ` Quentin Perret 1 sibling, 0 replies; 61+ messages in thread From: Quentin Perret @ 2022-08-29 15:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Monday 29 Aug 2022 at 16:34:39 (+0200), Peter Zijlstra wrote: > On Mon, Aug 29, 2022 at 02:23:17PM +0000, Quentin Perret wrote: > > FWIW. Typically a larger half-life will be fine with predictable > > workloads with little inputs from users (e.g. fullscreen video playback) > > while a lower one can be preferred in highly interactive cases (games, > > As per always; consider the combined workload. That's the problem of whoever is going to decide which HL should be used, which would all fall in userspace policy land. And yes that userspace agent would have to know quite a bit about what is going on in the system to make 'sane' decisions, but that is the case in Android. The fact that choosing the right HL for a combined workload is hard (that userspace agent might want to just default to a safe value?) doesn't mean there is no value in other cases though... I mean, if we agree that tuning the PELT HL makes sense at all (even as a cmdling arg, or sched_feat or whatever), there is only one small step to say the right PELT HL should indeed depend on the workload you're currently running (especially with _vastly_ different usage patterns seen in mobile). > > ...). The transient state is fun to reason about, but it really > > shouldn't be too common AFAIK. > > Once you give away control there is no taking it back, and userspace > *will* do stupid things and expect unicorns. Sure. If you do stupid, at some point you get(/deserve) what you asked for... We could perhaps rate-limit changes or something if we want to make that clear? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 14:34 ` Peter Zijlstra 2022-08-29 15:31 ` Quentin Perret @ 2022-08-29 15:48 ` Quentin Perret 1 sibling, 0 replies; 61+ messages in thread From: Quentin Perret @ 2022-08-29 15:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Monday 29 Aug 2022 at 16:34:39 (+0200), Peter Zijlstra wrote: > On Mon, Aug 29, 2022 at 02:23:17PM +0000, Quentin Perret wrote: > > > I'll let folks in CC comment about their use-case in more details, but > > there's definitely been an interest in tuning this thing at run-time > > An interest and it making sense are two very distinct things that bear > no relation to one another in any way. And just to clarify something here as my first message was misleading, the run-time tuning of PELT HLs has been used in some Android products for a little while already, so it's more than just a mere 'interest'. That doesn't close the debate on whether this makes any sense, but I figured this should be mentioned :-) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 10:13 ` Vincent Guittot 2022-08-29 14:23 ` Quentin Perret @ 2022-09-02 7:53 ` Dietmar Eggemann 2022-09-02 8:45 ` Peter Zijlstra 2022-09-06 5:49 ` Vincent Guittot 1 sibling, 2 replies; 61+ messages in thread From: Dietmar Eggemann @ 2022-09-02 7:53 UTC (permalink / raw) To: Vincent Guittot, Peter Zijlstra Cc: Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On 29/08/2022 12:13, Vincent Guittot wrote: > On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote: >>> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: >>>> From: Vincent Donnefort <vincent.donnefort@arm.com> [...] >>> Hurmph... I'd almost go write you something like >>> static_call()/static_branch() but for immediates. >>> >>> That said; given there's only like 3 options, perhaps a few >>> static_branch() instances work just fine ? >> >> Also, I'm not at all sure about exposing that as an official sysctl. > > Me too, I would even make it a boot time parameter so we can remove Isn't a sched feature even less official than a boot parameter? But AFAIK at least some of the Android folks want to change this during runtime and they don't have debugfs mounted. > the new clock_task_mult clock and left shift clock_taslk or the delta > before passing it to clock_pelt We still need rq_clock_task_mult(rq), i.e. `rq->clock_task_mult` in _update_idle_rq_clock_pelt() though. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-09-02 7:53 ` Dietmar Eggemann @ 2022-09-02 8:45 ` Peter Zijlstra 2022-09-06 5:49 ` Vincent Guittot 1 sibling, 0 replies; 61+ messages in thread From: Peter Zijlstra @ 2022-09-02 8:45 UTC (permalink / raw) To: Dietmar Eggemann Cc: Vincent Guittot, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Fri, Sep 02, 2022 at 09:53:50AM +0200, Dietmar Eggemann wrote: > But AFAIK at least some of the Android folks want to change this during > runtime and they don't have debugfs mounted. They want; I want an explanation of what exact problem is fixed how ;-) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-09-02 7:53 ` Dietmar Eggemann 2022-09-02 8:45 ` Peter Zijlstra @ 2022-09-06 5:49 ` Vincent Guittot 2022-09-08 6:50 ` Dietmar Eggemann 1 sibling, 1 reply; 61+ messages in thread From: Vincent Guittot @ 2022-09-06 5:49 UTC (permalink / raw) To: Dietmar Eggemann Cc: Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Fri, 2 Sept 2022 at 09:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 29/08/2022 12:13, Vincent Guittot wrote: > > On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote: > >> > >> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote: > >>> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: > >>>> From: Vincent Donnefort <vincent.donnefort@arm.com> > > [...] > > >>> Hurmph... I'd almost go write you something like > >>> static_call()/static_branch() but for immediates. > >>> > >>> That said; given there's only like 3 options, perhaps a few > >>> static_branch() instances work just fine ? > >> > >> Also, I'm not at all sure about exposing that as an official sysctl. > > > > Me too, I would even make it a boot time parameter so we can remove > > Isn't a sched feature even less official than a boot parameter? > But AFAIK at least some of the Android folks want to change this during > runtime and they don't have debugfs mounted. > > > the new clock_task_mult clock and left shift clock_taslk or the delta > > before passing it to clock_pelt > > We still need rq_clock_task_mult(rq), i.e. `rq->clock_task_mult` in > _update_idle_rq_clock_pelt() though. Why ? If the mult is defined at boot we just have to use "rq_clock_task(rq) << mult" instead of rq_clock_task(rq) when updating clock_pelt ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-09-06 5:49 ` Vincent Guittot @ 2022-09-08 6:50 ` Dietmar Eggemann 0 siblings, 0 replies; 61+ messages in thread From: Dietmar Eggemann @ 2022-09-08 6:50 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On 06/09/2022 07:49, Vincent Guittot wrote: > On Fri, 2 Sept 2022 at 09:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> >> On 29/08/2022 12:13, Vincent Guittot wrote: >>> On Mon, 29 Aug 2022 at 12:03, Peter Zijlstra <peterz@infradead.org> wrote: >>>> >>>> On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote: >>>>> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: >>>>>> From: Vincent Donnefort <vincent.donnefort@arm.com> [...] >> We still need rq_clock_task_mult(rq), i.e. `rq->clock_task_mult` in >> _update_idle_rq_clock_pelt() though. > > Why ? If the mult is defined at boot we just have to use > "rq_clock_task(rq) << mult" instead of rq_clock_task(rq) when updating > clock_pelt Makes sense! Agreed. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-08-29 10:02 ` Peter Zijlstra 2022-08-29 10:13 ` Vincent Guittot @ 2022-09-02 7:53 ` Dietmar Eggemann 2022-09-02 8:45 ` Peter Zijlstra 1 sibling, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2022-09-02 7:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On 29/08/2022 12:02, Peter Zijlstra wrote: > On Mon, Aug 29, 2022 at 10:08:13AM +0200, Peter Zijlstra wrote: >> On Mon, Aug 29, 2022 at 07:54:50AM +0200, Dietmar Eggemann wrote: >>> From: Vincent Donnefort <vincent.donnefort@arm.com> [...] >>> +extern unsigned int sched_pelt_lshift; >>> + >>> +/* >>> + * absolute time |1 |2 |3 |4 |5 |6 | >>> + * @ mult = 1 --------****************--------****************- >>> + * @ mult = 2 --------********----------------********--------- >>> + * @ mult = 4 --------****--------------------****------------- >>> + * clock task mult >>> + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | | >>> + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | | >>> + * >>> + */ >>> +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta) >>> +{ >>> + delta <<= READ_ONCE(sched_pelt_lshift); >>> + >>> + rq->clock_task_mult += delta; >>> + >>> + update_rq_clock_pelt(rq, delta); >>> +} >> >> Hurmph... I'd almost go write you something like >> static_call()/static_branch() but for immediates. >> >> That said; given there's only like 3 options, perhaps a few >> static_branch() instances work just fine ? > > Also, I'm not at all sure about exposing that as an official sysctl. > > How about something like so? [...] > void update_rq_clock(struct rq *rq) > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false) > > SCHED_FEAT(ALT_PERIOD, true) > SCHED_FEAT(BASE_SLICE, true) > + > +SCHED_FEAT(PELT_M2, false) > +SCHED_FEAT(PELT_M4, false) The sched features interface would definitely be much less official but I think it's not useful for Android since they don't mount debugfs anymore. [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier 2022-09-02 7:53 ` Dietmar Eggemann @ 2022-09-02 8:45 ` Peter Zijlstra 0 siblings, 0 replies; 61+ messages in thread From: Peter Zijlstra @ 2022-09-02 8:45 UTC (permalink / raw) To: Dietmar Eggemann Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Jian-Min, Qais Yousef, linux-kernel On Fri, Sep 02, 2022 at 09:53:22AM +0200, Dietmar Eggemann wrote: > > void update_rq_clock(struct rq *rq) > > --- a/kernel/sched/features.h > > +++ b/kernel/sched/features.h > > @@ -101,3 +101,6 @@ SCHED_FEAT(LATENCY_WARN, false) > > > > SCHED_FEAT(ALT_PERIOD, true) > > SCHED_FEAT(BASE_SLICE, true) > > + > > +SCHED_FEAT(PELT_M2, false) > > +SCHED_FEAT(PELT_M4, false) > > The sched features interface would definitely be much less official but > I think it's not useful for Android since they don't mount debugfs anymore. Well, they can haz a small patch that sets either one or both to true, no? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-08-29 5:54 [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Dietmar Eggemann 2022-08-29 5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann @ 2022-09-20 14:07 ` Jian-Min Liu 2022-09-28 17:09 ` Dietmar Eggemann 2022-09-29 9:47 ` Peter Zijlstra 1 sibling, 2 replies; 61+ messages in thread From: Jian-Min Liu @ 2022-09-20 14:07 UTC (permalink / raw) To: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen, Vincent Donnefort Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen Update some test data in android phone to support switching PELT HL is helpful functionality. We switch runtime PELT HL during runtime by difference scenario e.g. pelt8 in playing game, pelt32 in camera video. Support runntime switching PELT HL is flexible for different workloads. the below table show performance & power data points: --------------------------------------------------------------------- --| | PELT halflife | | |----------------------------------------------| | | 32 | 16 | 8 | | |----------------------------------------------| | | avg min avg | avg min avg | avg min avg| | Scenarios | fps fps pwr | fps fps pwr | fps fps pwr| |---------------------------------------------------------------------| | HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106| | HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105| | HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105| | FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103| | Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102| ----------------------------------------------------------------------- HOK ... Honour Of Kings, Video game FHD ... Full High Definition fps ... frame per second pwr ... power consumption table values are in % On Mon, 2022-08-29 at 07:54 +0200, Dietmar Eggemann wrote: > Many of the Android devices still prefer to run PELT with a shorter > halflife than the hardcoded value of 32ms in mainline. > > The Android folks claim better response time of display pipeline > tasks > (higher min and avg fps for 60, 90 or 120Hz refresh rate). Some of > the > benchmarks like PCmark web-browsing show higher scores when running > with 16ms or 8ms PELT halflife. The gain in response time and > performance is considered to outweigh the increase of energy > consumption in these cases. > > The original idea of introducing a PELT halflife compile time option > for 32, 16, 8ms from Patrick Bellasi in 2018 > https://urldefense.com/v3/__https://lkml.kernel.org/r/20180409165134.707-1-patrick.bellasi@arm.com__;!!CTRNKA9wMg0ARbw!x-6IhaOmZWO5PJIWEfZLD-6grV2BwlOBpflNV57-oNZY8NfSocwlImAHM2TQFyo56_r-$ > > wasn't integrated into mainline mainly because of breaking the PELT > stability requirement (see (1) below). > > We have been experimenting with a new idea from Morten Rasmussen to > instead introduce an additional clock between task and pelt clock. > This > way the effect of a shorter PELT halflife of 8ms or 16ms can be > achieved by left-shifting the elapsed time. This is similar to the > use > of time shifting of the pelt clock to achieve scale invariance in > PELT. > The implementation is from Vincent Donnefort with some minor > modifications to align with current tip sched/core. > > --- > > Known potential issues: > > (1) PELT stability requirement > > PELT halflife has to be larger than or equal to the scheduling > period. > > The sched_period (sysctl_sched_latency) of a typical mobile device > with > 8 CPUs with the default logarithmical tuning is 24ms so only the 32 > ms > PELT halflife met this. Shorter halflife of 16ms or even 8ms would > break > this. > > It looks like that this problem might not exist anymore because of > the > PELT rewrite in 2015, i.e. with commit 9d89c257dfb9 > ("sched/fair: Rewrite runnable load and utilization average > tracking"). > Since then sched entities (task & task groups) and cfs_rq's are > independently maintained rather than each entity update maintains the > cfs_rq at the same time. > > This seems to mitigate the issue that the cfs_rq signal is not > correct > when there are not all runnable entities able ot do a self update > during > a PELT halflife window. > > That said, I'm not entirely sure whether the entity-cfs_rq > synchronization is the only issue behind this PELT stability > requirement. > > > (2) PELT utilization versus util_est (estimated utilization) > > The PELT signal of a periodic task oscillates with higher peak > amplitude > when using smaller halflife. For a typical periodic task of the > display > pipeline with a runtime/period of 8ms/16ms the peak amplitude is at > ~40 > for 32ms, at ~80 for 16ms and at ~160 for 8ms. Util_est stores the > util_avg peak as util_est.enqueued per task. > > With an additional exponential weighted moving average (ewma) to > smooth > task utilization decreases, util_est values of the runnable tasks are > aggregated on the root cfs_rq. > CPU and task utilization for CPU frequency selection and task > placement > is the max value out of util_est and util_avg. > I.e. because of how util_est is implemented higher CPU Operating > Performance Points and more capable CPUs are already chosen when > using > smaller PELT halflife. > > > (3) Wrong PELT history when switching PELT multiplier > > The PELT history becomes stale the moment the PELT multiplier is > changed > during runtime. So all decisions based on PELT are skewed for the > time > interval to produce LOAD_MAX_AVG (the sum of the infinite geometric > series) which value is ~345ms for halflife=32ms (smaller for 8ms or > 16ms). > > Rate limiting the PELT multiplier change to this value is not solving > the issue here. So the user would have to live with possible > incorrect > discussions during these PELT multiplier transition times. > > --- > > It looks like that individual task boosting e.g. via uclamp_min, > possibly abstracted by middleware frameworks like Android Dynamic > Performance Framework (ADPF) would be the way to go here but until > this > is fully available and adopted some Android folks will still prefer > the > overall system boosting they achieve by running with a shorter PELT > halflife. > > Vincent Donnefort (1): > sched/pelt: Introduce PELT multiplier > > kernel/sched/core.c | 2 +- > kernel/sched/pelt.c | 60 > ++++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/pelt.h | 42 ++++++++++++++++++++++++++++--- > kernel/sched/sched.h | 1 + > 4 files changed, 100 insertions(+), 5 deletions(-) > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu @ 2022-09-28 17:09 ` Dietmar Eggemann 2022-09-29 9:47 ` Peter Zijlstra 1 sibling, 0 replies; 61+ messages in thread From: Dietmar Eggemann @ 2022-09-28 17:09 UTC (permalink / raw) To: Jian-Min Liu, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen, Vincent Donnefort Cc: Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen, Kajetan Puchalski + Kajetan Puchalski <kajetan.puchalski@arm.com> On 20/09/2022 16:07, Jian-Min Liu wrote: > > Update some test data in android phone to support switching PELT HL > is helpful functionality. > > We switch runtime PELT HL during runtime by difference scenario e.g. > pelt8 in playing game, pelt32 in camera video. Support runntime > switching PELT HL is flexible for different workloads. > > the below table show performance & power data points: > > --------------------------------------------------------------------- > --| | PELT > halflife | > | |----------------------------------------------| > | | 32 | 16 | 8 | > | |----------------------------------------------| > | | avg min avg | avg min avg | avg min avg| > | Scenarios | fps fps pwr | fps fps pwr | fps fps pwr| > |---------------------------------------------------------------------| > | HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106| > | HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105| > | HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105| > | FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103| > | Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102| > ----------------------------------------------------------------------- > > HOK ... Honour Of Kings, Video game > FHD ... Full High Definition > fps ... frame per second > pwr ... power consumption > > table values are in % I assume that you are specifically interested in those higher min fps numbers which can be achieved with a tolerable energy consumption increase when running the game with 16ms or even 8ms PELT halflife. We see a similar effect when running the UI performance benchmark Jankbench. So you need this runtime-switchable PELT multiplier. Would this sched feature interface: https://lkml.kernel.org/r/YwyOzgbbUbB+JmSH@hirez.programming.kicks-ass.net be sufficient for you? People don't like to support `changing PELT halflife` via an official sysctl. [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu 2022-09-28 17:09 ` Dietmar Eggemann @ 2022-09-29 9:47 ` Peter Zijlstra 2022-09-29 11:07 ` Dietmar Eggemann ` (2 more replies) 1 sibling, 3 replies; 61+ messages in thread From: Peter Zijlstra @ 2022-09-29 9:47 UTC (permalink / raw) To: Jian-Min Liu Cc: Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen Because it messes up the order in which people normally read text. Why is top-posting such a bad thing? Top-posting. What is the most annoying thing in e-mail? On Tue, Sep 20, 2022 at 10:07:59PM +0800, Jian-Min Liu wrote: > > Update some test data in android phone to support switching PELT HL > is helpful functionality. > > We switch runtime PELT HL during runtime by difference scenario e.g. > pelt8 in playing game, pelt32 in camera video. Support runntime > switching PELT HL is flexible for different workloads. > > the below table show performance & power data points: > > --------------------------------------------------------------------- > --| | PELT > halflife | > | |----------------------------------------------| > | | 32 | 16 | 8 | > | |----------------------------------------------| > | | avg min avg | avg min avg | avg min avg| > | Scenarios | fps fps pwr | fps fps pwr | fps fps pwr| > |---------------------------------------------------------------------| > | HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106| > | HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105| > | HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105| You have your min and avg fps columns mixed up, your min cannot be larger than avg. Also, with min fps mostly above the actual screen fps, who cares. And seriously 120fps on a phone !?!? for worse power usage! you gotta be kidding me. And I googled this game; it is some top-down tactical thing with real-time combat (as opposed to turn-based) (DOTA like I suppose), 60 fps locked should be plenty fine. > | FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103| > | Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102| Mostly I think you've demonstrated that none of this is worth it. > ----------------------------------------------------------------------- > > HOK ... Honour Of Kings, Video game > FHD ... Full High Definition > fps ... frame per second > pwr ... power consumption > > table values are in % Oh... that's bloody insane; that's why none of it makes sense. How is any of that an answer to: "They want; I want an explanation of what exact problem is fixed how ;-)" This is just random numbers showing poking the number has some effect; it has zero explaination of why poking the number changes the workload and if that is in fact the right way to go about solving that particular issue. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-29 9:47 ` Peter Zijlstra @ 2022-09-29 11:07 ` Dietmar Eggemann 2022-09-29 11:10 ` Kajetan Puchalski 2022-11-07 9:41 ` Jian-Min Liu (劉建旻) 2 siblings, 0 replies; 61+ messages in thread From: Dietmar Eggemann @ 2022-09-29 11:07 UTC (permalink / raw) To: Peter Zijlstra, Jian-Min Liu Cc: Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 29/09/2022 11:47, Peter Zijlstra wrote: [...] >> --------------------------------------------------------------------- >> --| | PELT >> halflife | >> | |----------------------------------------------| >> | | 32 | 16 | 8 | >> | |----------------------------------------------| >> | | avg min avg | avg min avg | avg min avg| >> | Scenarios | fps fps pwr | fps fps pwr | fps fps pwr| >> |---------------------------------------------------------------------| >> | HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106| >> | HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105| >> | HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105| > > You have your min and avg fps columns mixed up, your min cannot be larger > than avg. > > Also, with min fps mostly above the actual screen fps, who cares. And > seriously 120fps on a phone !?!? for worse power usage! you gotta be > kidding me. I agree that since we don't know what 100% at 32 means its unclear what problem gets actually solved here by running with 16 or 8. > And I googled this game; it is some top-down tactical thing with > real-time combat (as opposed to turn-based) (DOTA like I suppose), > 60 fps locked should be plenty fine. > >> | FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103| >> | Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102| > > Mostly I think you've demonstrated that none of this is worth it. I assume Jian-Min added those two lines to demonstrate that they would need the run-time switch. >> ----------------------------------------------------------------------- >> >> HOK ... Honour Of Kings, Video game >> FHD ... Full High Definition >> fps ... frame per second >> pwr ... power consumption >> >> table values are in % > > Oh... that's bloody insane; that's why none of it makes sense. > > > How is any of that an answer to: > > "They want; I want an explanation of what exact problem is fixed how ;-)" > > This is just random numbers showing poking the number has some effect; > it has zero explaination of why poking the number changes the workload > and if that is in fact the right way to go about solving that particular > issue. Jian-Min, would you be able to show real numbers in comparison to the chosen fps here? And explain what the problem is which gets solved. What is the effect of this higher min fps values when running 16 or 8? And why is the default 32 not sufficient here? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-29 9:47 ` Peter Zijlstra 2022-09-29 11:07 ` Dietmar Eggemann @ 2022-09-29 11:10 ` Kajetan Puchalski 2022-09-29 11:21 ` Peter Zijlstra 2022-11-07 9:41 ` Jian-Min Liu (劉建旻) 2 siblings, 1 reply; 61+ messages in thread From: Kajetan Puchalski @ 2022-09-29 11:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, Sep 29, 2022 at 11:47:23AM +0200, Peter Zijlstra wrote: [...] > Mostly I think you've demonstrated that none of this is worth it. > > > ----------------------------------------------------------------------- > > > > HOK ... Honour Of Kings, Video game > > FHD ... Full High Definition > > fps ... frame per second > > pwr ... power consumption > > > > table values are in % > > Oh... that's bloody insane; that's why none of it makes sense. Hi, We have seen similar results to the ones provided by MTK while running Jankbench, a UI performance benchmark. For the following tables, the pelt numbers refer to multiplier values so pelt_1 -> 32ms, pelt_2 -> 16ms, pelt_4 -> 8ms. We can see the max frame durations decreasing significantly in line with changing the pelt multiplier. Having a faster-responding pelt lets us improve the worst-case scenario by a large margin which is why it can be useful in some cases where that worst-case scenario is important. Max frame duration (ms) +------------------+----------+ | kernel | value | |------------------+----------| | pelt_1 | 157.426 | | pelt_2 | 111.975 | | pelt_4 | 85.2713 | +------------------+----------+ However, it is accompanied by a very noticeable increase in power usage. We have seen even bigger power usage increases for different workloads. This is why we think it makes much more sense as something that can be changed at runtime - if set at boot time the energy consumption increase would nullify any of the potential benefits. For limited workloads or scenarios, the tradeoff might be worth it. Power usage [mW] +------------------+---------+-------------+ | kernel | value | perc_diff | |------------------+---------+-------------| | pelt_1 | 139.9 | 0.0% | | pelt_2 | 146.4 | 4.62% | | pelt_4 | 158.5 | 13.25% | +------------------+---------+-------------+ At the same time we see that the average-case can improve slightly as well in the process and the consistency either doesn't get worse or improves a bit too. Mean frame duration (ms) +---------------+------------------+---------+-------------+ | variable | kernel | value | perc_diff | |---------------+------------------+---------+-------------| | mean_duration | pelt_1 | 14.6 | 0.0% | | mean_duration | pelt_2 | 13.8 | -5.43% | | mean_duration | pelt_4 | 14.5 | -0.58% | +---------------+------------------+---------+-------------+ Jank percentage +------------+------------------+---------+-------------+ | variable | kernel | value | perc_diff | |------------+------------------+---------+-------------| | jank_perc | pelt_1 | 2.1 | 0.0% | | jank_perc | pelt_2 | 2.1 | 0.11% | | jank_perc | pelt_4 | 2 | -3.46% | +------------+------------------+---------+-------------+ > How is any of that an answer to: > > "They want; I want an explanation of what exact problem is fixed how ;-)" > > This is just random numbers showing poking the number has some effect; > it has zero explaination of why poking the number changes the workload > and if that is in fact the right way to go about solving that particular > issue. Overall, the problem being solved here is that based on our testing the PELT half life can occasionally be too slow to keep up in scenarios where many frames need to be rendered quickly, especially on high-refresh rate phones and similar devices. While it's not a problem most of the time and so it doesn't warrant changing the default or having it set at boot time, introducing this pelt multiplier would be very useful as a tool to be able to avoid the worst-case in limited scenarios. ---- Kajetan ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-29 11:10 ` Kajetan Puchalski @ 2022-09-29 11:21 ` Peter Zijlstra 2022-09-29 14:41 ` Kajetan Puchalski 0 siblings, 1 reply; 61+ messages in thread From: Peter Zijlstra @ 2022-09-29 11:21 UTC (permalink / raw) To: Kajetan Puchalski Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote: > Overall, the problem being solved here is that based on our testing the > PELT half life can occasionally be too slow to keep up in scenarios > where many frames need to be rendered quickly, especially on high-refresh > rate phones and similar devices. But it is a problem of DVFS not ramping up quick enough; or of the load-balancer not reacting to the increase in load, or what aspect controlled by PELT is responsible for the improvement seen? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-29 11:21 ` Peter Zijlstra @ 2022-09-29 14:41 ` Kajetan Puchalski 2022-10-03 22:57 ` Wei Wang 2022-11-07 13:41 ` Peter Zijlstra 0 siblings, 2 replies; 61+ messages in thread From: Kajetan Puchalski @ 2022-09-29 14:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote: > On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote: > > > Overall, the problem being solved here is that based on our testing the > > PELT half life can occasionally be too slow to keep up in scenarios > > where many frames need to be rendered quickly, especially on high-refresh > > rate phones and similar devices. > > But it is a problem of DVFS not ramping up quick enough; or of the > load-balancer not reacting to the increase in load, or what aspect > controlled by PELT is responsible for the improvement seen? Based on all the tests we've seen, jankbench or otherwise, the improvement can mainly be attributed to the faster ramp up of frequency caused by the shorter PELT window while using schedutil. Alongside that the signals rising faster also mean that the task would get migrated faster to bigger CPUs on big.LITTLE systems which improves things too but it's mostly the frequency aspect of it. To establish that this benchmark is sensitive to frequency I ran some tests using the 'performance' cpufreq governor. Max frame duration (ms) +------------------+-------------+----------+ | kernel | iteration | value | |------------------+-------------+----------| | pelt_1 | 10 | 157.426 | | pelt_4 | 10 | 85.2713 | | performance | 10 | 40.9308 | +------------------+-------------+----------+ Mean frame duration (ms) +---------------+------------------+---------+-------------+ | variable | kernel | value | perc_diff | |---------------+------------------+---------+-------------| | mean_duration | pelt_1 | 14.6 | 0.0% | | mean_duration | pelt_4 | 14.5 | -0.58% | | mean_duration | performance | 4.4 | -69.75% | +---------------+------------------+---------+-------------+ Jank percentage +------------+------------------+---------+-------------+ | variable | kernel | value | perc_diff | |------------+------------------+---------+-------------| | jank_perc | pelt_1 | 2.1 | 0.0% | | jank_perc | pelt_4 | 2 | -3.46% | | jank_perc | performance | 0.1 | -97.25% | +------------+------------------+---------+-------------+ As you can see, bumping up frequency can hugely improve the results here. This is what's happening when we decrease the PELT window, just on a much smaller and not as drastic scale. It also explains specifically where the increased power usage is coming from. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-29 14:41 ` Kajetan Puchalski @ 2022-10-03 22:57 ` Wei Wang 2022-10-04 9:33 ` Dietmar Eggemann 2022-11-07 13:41 ` Peter Zijlstra 1 sibling, 1 reply; 61+ messages in thread From: Wei Wang @ 2022-10-03 22:57 UTC (permalink / raw) To: Kajetan Puchalski Cc: Peter Zijlstra, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen, Chung-Kai (Michael) Mei We have some data on an earlier build of Pixel 6a, which also runs a slightly modified "sched" governor. The tuning definitely has both performance and power impact on UX. With some additional user space hints such as ADPF (Android Dynamic Performance Framework) and/or the old-fashioned INTERACTION power hint, different trade-offs can be archived with this sort of tuning. +---------------------------------------------------------+----------+----------+ | Metrics | 32ms | 8ms | +---------------------------------------------------------+----------+----------+ | Sum of gfxinfo_com.android.test.uibench_deadline_missed | 185.00 | 112.00 | | Sum of SFSTATS_GLOBAL_MISSEDFRAMES | 62.00 | 49.00 | | CPU Power | 6,204.00 | 7,040.00 | | Sum of Gfxinfo.frame.95th | 582.00 | 506.00 | | Avg of Gfxinfo.frame.95th | 18.19 | 15.81 | +---------------------------------------------------------+----------+----------+ On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote: > > > > > Overall, the problem being solved here is that based on our testing the > > > PELT half life can occasionally be too slow to keep up in scenarios > > > where many frames need to be rendered quickly, especially on high-refresh > > > rate phones and similar devices. > > > > But it is a problem of DVFS not ramping up quick enough; or of the > > load-balancer not reacting to the increase in load, or what aspect > > controlled by PELT is responsible for the improvement seen? > > Based on all the tests we've seen, jankbench or otherwise, the > improvement can mainly be attributed to the faster ramp up of frequency > caused by the shorter PELT window while using schedutil. Alongside that > the signals rising faster also mean that the task would get migrated > faster to bigger CPUs on big.LITTLE systems which improves things too > but it's mostly the frequency aspect of it. > > To establish that this benchmark is sensitive to frequency I ran some > tests using the 'performance' cpufreq governor. > > Max frame duration (ms) > > +------------------+-------------+----------+ > | kernel | iteration | value | > |------------------+-------------+----------| > | pelt_1 | 10 | 157.426 | > | pelt_4 | 10 | 85.2713 | > | performance | 10 | 40.9308 | > +------------------+-------------+----------+ > > Mean frame duration (ms) > > +---------------+------------------+---------+-------------+ > | variable | kernel | value | perc_diff | > |---------------+------------------+---------+-------------| > | mean_duration | pelt_1 | 14.6 | 0.0% | > | mean_duration | pelt_4 | 14.5 | -0.58% | > | mean_duration | performance | 4.4 | -69.75% | > +---------------+------------------+---------+-------------+ > > Jank percentage > > +------------+------------------+---------+-------------+ > | variable | kernel | value | perc_diff | > |------------+------------------+---------+-------------| > | jank_perc | pelt_1 | 2.1 | 0.0% | > | jank_perc | pelt_4 | 2 | -3.46% | > | jank_perc | performance | 0.1 | -97.25% | > +------------+------------------+---------+-------------+ > > As you can see, bumping up frequency can hugely improve the results > here. This is what's happening when we decrease the PELT window, just on > a much smaller and not as drastic scale. It also explains specifically > where the increased power usage is coming from. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-10-03 22:57 ` Wei Wang @ 2022-10-04 9:33 ` Dietmar Eggemann 2022-10-05 16:57 ` Wei Wang 0 siblings, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2022-10-04 9:33 UTC (permalink / raw) To: Wei Wang, Kajetan Puchalski Cc: Peter Zijlstra, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen, Chung-Kai (Michael) Mei Hi Wei, On 04/10/2022 00:57, Wei Wang wrote: Please don't do top-posting. > We have some data on an earlier build of Pixel 6a, which also runs a > slightly modified "sched" governor. The tuning definitely has both > performance and power impact on UX. With some additional user space > hints such as ADPF (Android Dynamic Performance Framework) and/or the > old-fashioned INTERACTION power hint, different trade-offs can be > archived with this sort of tuning. > > > +---------------------------------------------------------+----------+----------+ > | Metrics | 32ms | > 8ms | > +---------------------------------------------------------+----------+----------+ > | Sum of gfxinfo_com.android.test.uibench_deadline_missed | 185.00 | > 112.00 | > | Sum of SFSTATS_GLOBAL_MISSEDFRAMES | 62.00 | > 49.00 | > | CPU Power | 6,204.00 | > 7,040.00 | > | Sum of Gfxinfo.frame.95th | 582.00 | > 506.00 | > | Avg of Gfxinfo.frame.95th | 18.19 | > 15.81 | > +---------------------------------------------------------+----------+----------+ Which App is package `gfxinfo_com.android.test`? Is this UIBench? Never ran it. I'm familiar with `dumpsys gfxinfo <PACKAGE_NAME>`. # adb shell dumpsys gfxinfo <PACKAGE_NAME> ... ** Graphics info for pid XXXX [<PACKAGE_NAME>] ** ... 95th percentile: XXms <-- (a) ... Number Frame deadline missed: XX <-- (b) ... I assume that `Gfxinfo.frame.95th` is related to (a) and `gfxinfo_com.android.test.uibench_deadline_missed` to (b)? Not sure where `SFSTATS_GLOBAL_MISSEDFRAMES` is coming from? What's the Sum here? Is it that you ran the test 32 times (582/18.19 = 32)? [...] > On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: >> >> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote: >>> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote: >>> >>>> Overall, the problem being solved here is that based on our testing the >>>> PELT half life can occasionally be too slow to keep up in scenarios >>>> where many frames need to be rendered quickly, especially on high-refresh >>>> rate phones and similar devices. >>> >>> But it is a problem of DVFS not ramping up quick enough; or of the >>> load-balancer not reacting to the increase in load, or what aspect >>> controlled by PELT is responsible for the improvement seen? >> >> Based on all the tests we've seen, jankbench or otherwise, the >> improvement can mainly be attributed to the faster ramp up of frequency >> caused by the shorter PELT window while using schedutil. Alongside that >> the signals rising faster also mean that the task would get migrated >> faster to bigger CPUs on big.LITTLE systems which improves things too >> but it's mostly the frequency aspect of it. >> >> To establish that this benchmark is sensitive to frequency I ran some >> tests using the 'performance' cpufreq governor. >> >> Max frame duration (ms) >> >> +------------------+-------------+----------+ >> | kernel | iteration | value | >> |------------------+-------------+----------| >> | pelt_1 | 10 | 157.426 | >> | pelt_4 | 10 | 85.2713 | >> | performance | 10 | 40.9308 | >> +------------------+-------------+----------+ >> >> Mean frame duration (ms) >> >> +---------------+------------------+---------+-------------+ >> | variable | kernel | value | perc_diff | >> |---------------+------------------+---------+-------------| >> | mean_duration | pelt_1 | 14.6 | 0.0% | >> | mean_duration | pelt_4 | 14.5 | -0.58% | >> | mean_duration | performance | 4.4 | -69.75% | >> +---------------+------------------+---------+-------------+ >> >> Jank percentage >> >> +------------+------------------+---------+-------------+ >> | variable | kernel | value | perc_diff | >> |------------+------------------+---------+-------------| >> | jank_perc | pelt_1 | 2.1 | 0.0% | >> | jank_perc | pelt_4 | 2 | -3.46% | >> | jank_perc | performance | 0.1 | -97.25% | >> +------------+------------------+---------+-------------+ >> >> As you can see, bumping up frequency can hugely improve the results >> here. This is what's happening when we decrease the PELT window, just on >> a much smaller and not as drastic scale. It also explains specifically >> where the increased power usage is coming from. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-10-04 9:33 ` Dietmar Eggemann @ 2022-10-05 16:57 ` Wei Wang 0 siblings, 0 replies; 61+ messages in thread From: Wei Wang @ 2022-10-05 16:57 UTC (permalink / raw) To: Dietmar Eggemann Cc: Kajetan Puchalski, Peter Zijlstra, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen, Chung-Kai (Michael) Mei On Tue, Oct 4, 2022 at 2:33 AM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > Hi Wei, > > On 04/10/2022 00:57, Wei Wang wrote: > > Please don't do top-posting. > Sorry, forgot this was posted to the list... > > We have some data on an earlier build of Pixel 6a, which also runs a > > slightly modified "sched" governor. The tuning definitely has both > > performance and power impact on UX. With some additional user space > > hints such as ADPF (Android Dynamic Performance Framework) and/or the > > old-fashioned INTERACTION power hint, different trade-offs can be > > archived with this sort of tuning. > > > > > > +---------------------------------------------------------+----------+----------+ > > | Metrics | 32ms | > > 8ms | > > +---------------------------------------------------------+----------+----------+ > > | Sum of gfxinfo_com.android.test.uibench_deadline_missed | 185.00 | > > 112.00 | > > | Sum of SFSTATS_GLOBAL_MISSEDFRAMES | 62.00 | > > 49.00 | > > | CPU Power | 6,204.00 | > > 7,040.00 | > > | Sum of Gfxinfo.frame.95th | 582.00 | > > 506.00 | > > | Avg of Gfxinfo.frame.95th | 18.19 | > > 15.81 | > > +---------------------------------------------------------+----------+----------+ > > Which App is package `gfxinfo_com.android.test`? Is this UIBench? Never > ran it. > Yes. > I'm familiar with `dumpsys gfxinfo <PACKAGE_NAME>`. > > # adb shell dumpsys gfxinfo <PACKAGE_NAME> > > ... > ** Graphics info for pid XXXX [<PACKAGE_NAME>] ** > ... > 95th percentile: XXms <-- (a) > ... > Number Frame deadline missed: XX <-- (b) > ... > > > I assume that `Gfxinfo.frame.95th` is related to (a) and > `gfxinfo_com.android.test.uibench_deadline_missed` to (b)? Not sure > where `SFSTATS_GLOBAL_MISSEDFRAMES` is coming from? > a) is correct b) is from surfaceflinger. Android display pipeline involves both a) app (generation) and b) surfaceflinger (presentation). > What's the Sum here? Is it that you ran the test 32 times (582/18.19 = 32)? > Uibench[1] has several micro tests and it is the sum of those tests. [1]: https://cs.android.com/android/platform/superproject/+/master:platform_testing/tests/microbenchmarks/uibench/src/com/android/uibench/microbenchmark/ > [...] > > > On Thu, Sep 29, 2022 at 11:59 PM Kajetan Puchalski > > <kajetan.puchalski@arm.com> wrote: > >> > >> On Thu, Sep 29, 2022 at 01:21:45PM +0200, Peter Zijlstra wrote: > >>> On Thu, Sep 29, 2022 at 12:10:17PM +0100, Kajetan Puchalski wrote: > >>> > >>>> Overall, the problem being solved here is that based on our testing the > >>>> PELT half life can occasionally be too slow to keep up in scenarios > >>>> where many frames need to be rendered quickly, especially on high-refresh > >>>> rate phones and similar devices. > >>> > >>> But it is a problem of DVFS not ramping up quick enough; or of the > >>> load-balancer not reacting to the increase in load, or what aspect > >>> controlled by PELT is responsible for the improvement seen? > >> > >> Based on all the tests we've seen, jankbench or otherwise, the > >> improvement can mainly be attributed to the faster ramp up of frequency > >> caused by the shorter PELT window while using schedutil. Alongside that > >> the signals rising faster also mean that the task would get migrated > >> faster to bigger CPUs on big.LITTLE systems which improves things too > >> but it's mostly the frequency aspect of it. > >> > >> To establish that this benchmark is sensitive to frequency I ran some > >> tests using the 'performance' cpufreq governor. > >> > >> Max frame duration (ms) > >> > >> +------------------+-------------+----------+ > >> | kernel | iteration | value | > >> |------------------+-------------+----------| > >> | pelt_1 | 10 | 157.426 | > >> | pelt_4 | 10 | 85.2713 | > >> | performance | 10 | 40.9308 | > >> +------------------+-------------+----------+ > >> > >> Mean frame duration (ms) > >> > >> +---------------+------------------+---------+-------------+ > >> | variable | kernel | value | perc_diff | > >> |---------------+------------------+---------+-------------| > >> | mean_duration | pelt_1 | 14.6 | 0.0% | > >> | mean_duration | pelt_4 | 14.5 | -0.58% | > >> | mean_duration | performance | 4.4 | -69.75% | > >> +---------------+------------------+---------+-------------+ > >> > >> Jank percentage > >> > >> +------------+------------------+---------+-------------+ > >> | variable | kernel | value | perc_diff | > >> |------------+------------------+---------+-------------| > >> | jank_perc | pelt_1 | 2.1 | 0.0% | > >> | jank_perc | pelt_4 | 2 | -3.46% | > >> | jank_perc | performance | 0.1 | -97.25% | > >> +------------+------------------+---------+-------------+ > >> > >> As you can see, bumping up frequency can hugely improve the results > >> here. This is what's happening when we decrease the PELT window, just on > >> a much smaller and not as drastic scale. It also explains specifically > >> where the increased power usage is coming from. > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-29 14:41 ` Kajetan Puchalski 2022-10-03 22:57 ` Wei Wang @ 2022-11-07 13:41 ` Peter Zijlstra 2022-11-08 19:48 ` Qais Yousef ` (3 more replies) 1 sibling, 4 replies; 61+ messages in thread From: Peter Zijlstra @ 2022-11-07 13:41 UTC (permalink / raw) To: Kajetan Puchalski Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > Based on all the tests we've seen, jankbench or otherwise, the > improvement can mainly be attributed to the faster ramp up of frequency > caused by the shorter PELT window while using schedutil. Would something terrible like the below help some? If not, I suppose it could be modified to take the current state as history. But basically it runs a faster pelt sum along side the regular signal just for ramping up the frequency. diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..9ba07a1d19f6 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -96,6 +96,7 @@ SCHED_FEAT(WA_BIAS, true) */ SCHED_FEAT(UTIL_EST, true) SCHED_FEAT(UTIL_EST_FASTUP, true) +SCHED_FEAT(UTIL_EST_FASTER, true) SCHED_FEAT(LATENCY_WARN, false) diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index 0f310768260c..13cd9e27ce3e 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa, return periods; } +/* + * Compute a pelt util_avg assuming no history and @delta runtime. + */ +unsigned long faster_est_approx(u64 delta) +{ + unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */ + u64 periods = delta / 1024; + + if (periods) { + delta %= 1024; + contrib = __accumulate_pelt_segments(periods, 1024, delta); + } + + return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER; +} + /* * We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a4a20046e586..99827d5dda27 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2922,6 +2922,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq) return READ_ONCE(rq->avg_dl.util_avg); } +extern unsigned long faster_est_approx(u64 runtime); + /** * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks. * @cpu: the CPU to get the utilization for. @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq) */ static inline unsigned long cpu_util_cfs(int cpu) { + struct rq *rq = cpu_rq(cpu); struct cfs_rq *cfs_rq; unsigned long util; - cfs_rq = &cpu_rq(cpu)->cfs; + cfs_rq = &rq->cfs; util = READ_ONCE(cfs_rq->avg.util_avg); if (sched_feat(UTIL_EST)) { + if (sched_feat(UTIL_EST_FASTER)) { + struct task_struct *curr; + + rcu_read_lock(); + curr = rcu_dereference(rq->curr); + if (likely(curr->sched_class == &fair_sched_class)) { + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start; + util = max_t(unsigned long, util, + faster_est_approx(runtime * 2)); + } + rcu_read_unlock(); + } util = max_t(unsigned long, util, READ_ONCE(cfs_rq->avg.util_est.enqueued)); } ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-07 13:41 ` Peter Zijlstra @ 2022-11-08 19:48 ` Qais Yousef 2022-11-09 15:49 ` Peter Zijlstra 2022-11-09 15:18 ` Lukasz Luba ` (2 subsequent siblings) 3 siblings, 1 reply; 61+ messages in thread From: Qais Yousef @ 2022-11-08 19:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Kajetan Puchalski, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 11/07/22 14:41, Peter Zijlstra wrote: > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > > Based on all the tests we've seen, jankbench or otherwise, the > > improvement can mainly be attributed to the faster ramp up of frequency > > caused by the shorter PELT window while using schedutil. > > Would something terrible like the below help some? > > If not, I suppose it could be modified to take the current state as > history. But basically it runs a faster pelt sum along side the regular > signal just for ramping up the frequency. A bit of a tangent, but this reminded me of this old patch: https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/ I think we have a bit too many moving cogs that might be creating undesired compound effect. Should we consider removing margins in favour of improving util ramp up/down? (whether via util_est or pelt hf). Only worry is lower end devices; but it seems to me better improve util response and get rid of these magic numbers - if we can. Having the ability to adjust them at runtime will help defer the trade-offs to sys admins. Thanks -- Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-08 19:48 ` Qais Yousef @ 2022-11-09 15:49 ` Peter Zijlstra 2022-11-10 13:25 ` Qais Yousef 2023-02-07 10:29 ` Dietmar Eggemann 0 siblings, 2 replies; 61+ messages in thread From: Peter Zijlstra @ 2022-11-09 15:49 UTC (permalink / raw) To: Qais Yousef Cc: Kajetan Puchalski, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: > On 11/07/22 14:41, Peter Zijlstra wrote: > > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > > > > Based on all the tests we've seen, jankbench or otherwise, the > > > improvement can mainly be attributed to the faster ramp up of frequency > > > caused by the shorter PELT window while using schedutil. > > > > Would something terrible like the below help some? > > > > If not, I suppose it could be modified to take the current state as > > history. But basically it runs a faster pelt sum along side the regular > > signal just for ramping up the frequency. > > A bit of a tangent, but this reminded me of this old patch: > > https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/ > > I think we have a bit too many moving cogs that might be creating undesired > compound effect. > > Should we consider removing margins in favour of improving util ramp up/down? > (whether via util_est or pelt hf). Yeah, possibly. So one thing that was key to that hack I proposed is that it is per-task. This means we can either set or detect the task activation period and use that to select an appropriate PELT multiplier. But please explain; once tasks are in a steady state (60HZ, 90HZ or god forbit higher), the utilization should be the same between the various PELT window sizes, provided the activation period isn't *much* larger than the window. Are these things running a ton of single shot tasks or something daft like that? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-09 15:49 ` Peter Zijlstra @ 2022-11-10 13:25 ` Qais Yousef 2023-02-07 10:29 ` Dietmar Eggemann 1 sibling, 0 replies; 61+ messages in thread From: Qais Yousef @ 2022-11-10 13:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Kajetan Puchalski, Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 11/09/22 16:49, Peter Zijlstra wrote: > On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: > > On 11/07/22 14:41, Peter Zijlstra wrote: > > > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > > > > > > Based on all the tests we've seen, jankbench or otherwise, the > > > > improvement can mainly be attributed to the faster ramp up of frequency > > > > caused by the shorter PELT window while using schedutil. > > > > > > Would something terrible like the below help some? > > > > > > If not, I suppose it could be modified to take the current state as > > > history. But basically it runs a faster pelt sum along side the regular > > > signal just for ramping up the frequency. > > > > A bit of a tangent, but this reminded me of this old patch: > > > > https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/ > > > > I think we have a bit too many moving cogs that might be creating undesired > > compound effect. > > > > Should we consider removing margins in favour of improving util ramp up/down? > > (whether via util_est or pelt hf). > > Yeah, possibly. > > So one thing that was key to that hack I proposed is that it is > per-task. This means we can either set or detect the task activation > period and use that to select an appropriate PELT multiplier. Note that a big difference compared to PELT HF is that we bias towards going up faster in util_est, not being able to go down as quickly could impact power as our residency in higher frequencies will be higher. Testing only can show how big of a problem this is in practice. > > But please explain; once tasks are in a steady state (60HZ, 90HZ or god > forbit higher), the utilization should be the same between the various > PELT window sizes, provided the activation period isn't *much* larger > than the window. It is steady state for a short period of time, before something else happens that change the nature of the workload. For example, being standing still in an empty room then an explosion suddenly happens causing lots of activity to appear on the screen. We can have a steady state at 20%, but an action on the screen could suddenly change the demand to 100%. You can find a lot of videos on how to tweak cpu frequencies and governor to improve gaming performances on youtube by the way: https://www.youtube.com/results?search_query=android+gaming+cpu+boost And this ancient video from google about impact of frequency scaling on games: https://www.youtube.com/watch?v=AZ97b2nT-Vo this is truly ancient and the advice given then (over 8 years ago) is not a reflection on current state of affairs. The problem is not new; and I guess expectations just keeps going higher on what one can do on their phone in spite of all the past improvements :-) > > Are these things running a ton of single shot tasks or something daft > like that? I'm not sure how all game engines behave; but the few I've seen they don't tend to do that. I've seen apps like instagram using single shot tasks sometime in the (distant) past to retrieve images. Generally I'm not sure how the Java based APIs behave. There is an API for Job Scheduler that allows apps to schedule background and foreground work; that could end up reusing a pool of tasks or creating new ones. I'm not sure. Game engines tend to be written in NDKs; but simpler games might not be. Cheers -- Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-09 15:49 ` Peter Zijlstra 2022-11-10 13:25 ` Qais Yousef @ 2023-02-07 10:29 ` Dietmar Eggemann 2023-02-09 16:16 ` Vincent Guittot 1 sibling, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2023-02-07 10:29 UTC (permalink / raw) To: Peter Zijlstra, Qais Yousef Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 09/11/2022 16:49, Peter Zijlstra wrote: > On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: >> On 11/07/22 14:41, Peter Zijlstra wrote: >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: [...] > So one thing that was key to that hack I proposed is that it is > per-task. This means we can either set or detect the task activation > period and use that to select an appropriate PELT multiplier. > > But please explain; once tasks are in a steady state (60HZ, 90HZ or god > forbit higher), the utilization should be the same between the various > PELT window sizes, provided the activation period isn't *much* larger > than the window. > > Are these things running a ton of single shot tasks or something daft > like that? This investigation tries to answer these questions. The results can be found in chapter (B) and (C). I ran 'util_est_faster' with delta equal to 'duration of the current activation'. I.e. the following patch is needed: https://lkml.kernel.org/r/ec049fd9b635f76a9e1d1ad380fd9184ebeeca53.1671158588.git.yu.c.chen@intel.com The testcase is Jankbench on Android 12 on Pixel6, CPU orig capacity = [124 124 124 124 427 427 1024 1024], w/ mainline v5.18 kernel and forward ported task scheduler patches. (A) *** 'util_est_faster' vs. 'scaled util_est_faster' *** The initial approach didn't scale the runtime duration. It is based on task clock and not PELT clock but it should be scaled by uArch and frequency to align with the PELT time used for util tracking. Although the original approach shows better results than the scaled one. Even more aggressive boosting on non-big CPUs helps to raise the frequency even quicker in the scenario described under (B). All tests ran 10 iterations of all Jankbench sub-tests. Max_frame_duration: +------------------------+------------+ | kernel | value | +------------------------+------------+ | base-a30b17f016b0 | 147.571352 | | util_est_faster | 84.834999 | | scaled_util_est_faster | 127.72855 | +------------------------+------------+ Mean_frame_duration: +------------------------+-------+-----------+ | kernel | value | perc_diff | +------------------------+-------+-----------+ | base-a30b17f016b0 | 14.7 | 0.0% | | util_est_faster | 12.6 | -14.01% | | scaled_util_est_faster | 13.5 | -8.45% | +------------------------+-------------------+ Jank percentage (Jank deadline 16ms): +------------------------+-------+-----------+ | kernel | value | perc_diff | +------------------------+-------+-----------+ | base-a30b17f016b0 | 1.8 | 0.0% | | util_est_faster | 0.8 | -57.8% | | scaled_util_est_faster | 1.4 | -25.89% | +------------------------+-------+-----------+ Power usage [mW] (total - all CPUs): +------------------------+-------+-----------+ | kernel | value | perc_diff | +------------------------+-------+-----------+ | base-a30b17f016b0 | 144.4 | 0.0% | | util_est_faster | 150.9 | 4.45% | | scaled_util_est_faster | 152.2 | 5.4% | +------------------------+-------+-----------+ 'scaled util_est_faster' is used as the base for all following tests. (B) *** Where does util_est_faster help exactly? *** It turns out that the score improvement comes from the more aggressive DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util() -> effective_cpu_util(..., cpu_util_cfs(), ...). At the beginning of an episode (e.g. beginning of an image list view fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU frequency') of the Android Graphics Pipeline (AGP) start to run, the CPU Operating Performance Point (OPP) is often so low that those tasks run more like 10/16ms which let the test application count a lot of Jankframes at those moments. And there is where this util_est_faster approach helps by boosting CPU util according to the 'runtime of the current activation'. Moreover it could also be that the tasks have simply more work to do in these first activations at the beginning of an episode. All the other places in which cpu_util_cfs() is used: (2) CFS load balance ('_lb') (3) CPU overutilization ('_ou') (4) CFS fork/exec task placement ('_slowpath') when tested individually don't show any improvement or even regression. Max_frame_duration: +---------------------------------+------------+ | kernel | value | +---------------------------------+------------+ | scaled_util_est_faster | 127.72855 | | scaled_util_est_faster_freq | 126.646506 | | scaled_util_est_faster_lb | 162.596249 | | scaled_util_est_faster_ou | 166.59519 | | scaled_util_est_faster_slowpath | 153.966638 | +---------------------------------+------------+ Mean_frame_duration: +---------------------------------+-------+-----------+ | kernel | value | perc_diff | +---------------------------------+-------+-----------+ | scaled_util_est_faster | 13.5 | 0.0% | | scaled_util_est_faster_freq | 13.7 | 1.79% | | scaled_util_est_faster_lb | 14.8 | 9.87% | | scaled_util_est_faster_ou | 14.5 | 7.46% | | scaled_util_est_faster_slowpath | 16.2 | 20.45% | +---------------------------------+-------+-----------+ Jank percentage (Jank deadline 16ms): +---------------------------------+-------+-----------+ | kernel | value | perc_diff | +---------------------------------+-------+-----------+ | scaled_util_est_faster | 1.4 | 0.0% | | scaled_util_est_faster_freq | 1.3 | -2.34% | | scaled_util_est_faster_lb | 1.7 | 27.42% | | scaled_util_est_faster_ou | 2.1 | 50.33% | | scaled_util_est_faster_slowpath | 2.8 | 102.39% | +---------------------------------+-------+-----------+ Power usage [mW] (total - all CPUs): +---------------------------------+-------+-----------+ | kernel | value | perc_diff | +---------------------------------+-------+-----------+ | scaled_util_est_faster | 152.2 | 0.0% | | scaled_util_est_faster_freq | 132.3 | -13.1% | | scaled_util_est_faster_lb | 137.1 | -9.96% | | scaled_util_est_faster_ou | 132.4 | -13.04% | | scaled_util_est_faster_slowpath | 141.3 | -7.18% | +---------------------------------+-------+-----------+ (C) *** Which tasks contribute the most to the score improvement? *** A trace_event capturing the cases in which task's util_est_fast trumps CPU util was added to cpu_util_cfs(). This is 1 iteration of Jankbench and the base is (1) 'scaled_util_est_faster_freq': https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_6.ipynb 'Cell [6]' shows the tasks of the Jankbench process '[com.an]droid.benchmark' which are boosting the CPU frequency request. Among them are 2 main threads of the AGP, '[com.an]droid.benchmark' and 'RenderThread'. The spikes in util_est_fast are congruent with the aforementioned beginning of an episode in which these periodic tasks are running and when their runtime/period is rather ~10/16ms and not ~1-2/16ms since the CPU OPP is still low. Very few other Jankbench tasks 'Cell [6] show the same behaviour. The Surfaceflinger process 'Cell [8]' is not affected and from the kernel tasks only kcompctd0 creates a mild boost 'Cell [9]'. As expected, running a non-scaled version of (1) shows more aggressive boosting on non-big CPUs: https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_5.ipynb Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU util when periodic tasks have a longer runtime compared to when they reach steady-sate. The results is very similar to PELT halflife reduction. The advantage is that 'util_est_faster' is only activated selectively when the runtime of the current task in its current activation is long enough to create this CPU util boost. Original patch: https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net Changes applied: - use 'duration of the current activation' as delta - delta >>= 10 - uArch and frequency scaling of delta -->%-- diff --git a/kernel/sched/features.h b/kernel/sched/features.h index efdc29c42161..76d146d06bbe 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -97,6 +97,7 @@ SCHED_FEAT(WA_BIAS, true) */ SCHED_FEAT(UTIL_EST, true) SCHED_FEAT(UTIL_EST_FASTUP, true) +SCHED_FEAT(UTIL_EST_FASTER, true) SCHED_FEAT(LATENCY_WARN, false) diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index 0f310768260c..13cd9e27ce3e 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa, return periods; } +/* + * Compute a pelt util_avg assuming no history and @delta runtime. + */ +unsigned long faster_est_approx(u64 delta) +{ + unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */ + u64 periods = delta / 1024; + + if (periods) { + delta %= 1024; + contrib = __accumulate_pelt_segments(periods, 1024, delta); + } + + return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER; +} + /* * We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1072502976df..7cb45f1d8062 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2961,6 +2961,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq) return READ_ONCE(rq->avg_dl.util_avg); } +extern unsigned long faster_est_approx(u64 runtime); + /** * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks. * @cpu: the CPU to get the utilization for. @@ -2995,13 +2997,39 @@ static inline unsigned long cpu_util_dl(struct rq *rq) */ static inline unsigned long cpu_util_cfs(int cpu) { + struct rq *rq = cpu_rq(cpu); struct cfs_rq *cfs_rq; unsigned long util; - cfs_rq = &cpu_rq(cpu)->cfs; + cfs_rq = &rq->cfs; util = READ_ONCE(cfs_rq->avg.util_avg); if (sched_feat(UTIL_EST)) { + if (sched_feat(UTIL_EST_FASTER)) { + struct task_struct *curr; + + rcu_read_lock(); + curr = rcu_dereference(rq->curr); + if (likely(curr->sched_class == &fair_sched_class)) { + unsigned long util_est_fast; + u64 delta; + + delta = curr->se.sum_exec_runtime - + curr->se.prev_sum_exec_runtime_vol; + + delta >>= 10; + if (!delta) + goto unlock; + + delta = cap_scale(delta, arch_scale_cpu_capacity(cpu)); + delta = cap_scale(delta, arch_scale_freq_capacity(cpu)); + + util_est_fast = faster_est_approx(delta * 2); + util = max(util, util_est_fast); + } +unlock: + rcu_read_unlock(); + } util = max_t(unsigned long, util, READ_ONCE(cfs_rq->avg.util_est.enqueued)); } ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-07 10:29 ` Dietmar Eggemann @ 2023-02-09 16:16 ` Vincent Guittot 2023-02-17 13:54 ` Dietmar Eggemann ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Vincent Guittot @ 2023-02-09 16:16 UTC (permalink / raw) To: Dietmar Eggemann Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 09/11/2022 16:49, Peter Zijlstra wrote: > > On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: > >> On 11/07/22 14:41, Peter Zijlstra wrote: > >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > [...] > > > So one thing that was key to that hack I proposed is that it is > > per-task. This means we can either set or detect the task activation > > period and use that to select an appropriate PELT multiplier. > > > > But please explain; once tasks are in a steady state (60HZ, 90HZ or god > > forbit higher), the utilization should be the same between the various > > PELT window sizes, provided the activation period isn't *much* larger > > than the window. > > > > Are these things running a ton of single shot tasks or something daft > > like that? > > This investigation tries to answer these questions. The results can > be found in chapter (B) and (C). > > I ran 'util_est_faster' with delta equal to 'duration of the current > activation'. I.e. the following patch is needed: > > https://lkml.kernel.org/r/ec049fd9b635f76a9e1d1ad380fd9184ebeeca53.1671158588.git.yu.c.chen@intel.com > > The testcase is Jankbench on Android 12 on Pixel6, CPU orig capacity > = [124 124 124 124 427 427 1024 1024], w/ mainline v5.18 kernel and > forward ported task scheduler patches. > > (A) *** 'util_est_faster' vs. 'scaled util_est_faster' *** > > The initial approach didn't scale the runtime duration. It is based > on task clock and not PELT clock but it should be scaled by uArch > and frequency to align with the PELT time used for util tracking. > > Although the original approach shows better results than the scaled > one. Even more aggressive boosting on non-big CPUs helps to raise the > frequency even quicker in the scenario described under (B). > > All tests ran 10 iterations of all Jankbench sub-tests. > > Max_frame_duration: > +------------------------+------------+ > | kernel | value | > +------------------------+------------+ > | base-a30b17f016b0 | 147.571352 | > | util_est_faster | 84.834999 | > | scaled_util_est_faster | 127.72855 | > +------------------------+------------+ > > Mean_frame_duration: > +------------------------+-------+-----------+ > | kernel | value | perc_diff | > +------------------------+-------+-----------+ > | base-a30b17f016b0 | 14.7 | 0.0% | > | util_est_faster | 12.6 | -14.01% | > | scaled_util_est_faster | 13.5 | -8.45% | > +------------------------+-------------------+ > > Jank percentage (Jank deadline 16ms): > +------------------------+-------+-----------+ > | kernel | value | perc_diff | > +------------------------+-------+-----------+ > | base-a30b17f016b0 | 1.8 | 0.0% | > | util_est_faster | 0.8 | -57.8% | > | scaled_util_est_faster | 1.4 | -25.89% | > +------------------------+-------+-----------+ > > Power usage [mW] (total - all CPUs): > +------------------------+-------+-----------+ > | kernel | value | perc_diff | > +------------------------+-------+-----------+ > | base-a30b17f016b0 | 144.4 | 0.0% | > | util_est_faster | 150.9 | 4.45% | > | scaled_util_est_faster | 152.2 | 5.4% | > +------------------------+-------+-----------+ > > 'scaled util_est_faster' is used as the base for all following tests. > > (B) *** Where does util_est_faster help exactly? *** > > It turns out that the score improvement comes from the more aggressive > DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util() > -> effective_cpu_util(..., cpu_util_cfs(), ...). > > At the beginning of an episode (e.g. beginning of an image list view > fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU > frequency') of the Android Graphics Pipeline (AGP) start to run, the > CPU Operating Performance Point (OPP) is often so low that those tasks > run more like 10/16ms which let the test application count a lot of > Jankframes at those moments. I don't see how util_est_faster can help this 1ms task here ? It's most probably never be preempted during this 1ms. For such an Android Graphics Pipeline short task, hasn't uclamp_min been designed for and a better solution ? > > And there is where this util_est_faster approach helps by boosting CPU > util according to the 'runtime of the current activation'. > Moreover it could also be that the tasks have simply more work to do in > these first activations at the beginning of an episode. > > All the other places in which cpu_util_cfs() is used: > > (2) CFS load balance ('_lb') > (3) CPU overutilization ('_ou') > (4) CFS fork/exec task placement ('_slowpath') > > when tested individually don't show any improvement or even regression. > > Max_frame_duration: > +---------------------------------+------------+ > | kernel | value | > +---------------------------------+------------+ > | scaled_util_est_faster | 127.72855 | > | scaled_util_est_faster_freq | 126.646506 | > | scaled_util_est_faster_lb | 162.596249 | > | scaled_util_est_faster_ou | 166.59519 | > | scaled_util_est_faster_slowpath | 153.966638 | > +---------------------------------+------------+ > > Mean_frame_duration: > +---------------------------------+-------+-----------+ > | kernel | value | perc_diff | > +---------------------------------+-------+-----------+ > | scaled_util_est_faster | 13.5 | 0.0% | > | scaled_util_est_faster_freq | 13.7 | 1.79% | > | scaled_util_est_faster_lb | 14.8 | 9.87% | > | scaled_util_est_faster_ou | 14.5 | 7.46% | > | scaled_util_est_faster_slowpath | 16.2 | 20.45% | > +---------------------------------+-------+-----------+ > > Jank percentage (Jank deadline 16ms): > +---------------------------------+-------+-----------+ > | kernel | value | perc_diff | > +---------------------------------+-------+-----------+ > | scaled_util_est_faster | 1.4 | 0.0% | > | scaled_util_est_faster_freq | 1.3 | -2.34% | > | scaled_util_est_faster_lb | 1.7 | 27.42% | > | scaled_util_est_faster_ou | 2.1 | 50.33% | > | scaled_util_est_faster_slowpath | 2.8 | 102.39% | > +---------------------------------+-------+-----------+ > > Power usage [mW] (total - all CPUs): > +---------------------------------+-------+-----------+ > | kernel | value | perc_diff | > +---------------------------------+-------+-----------+ > | scaled_util_est_faster | 152.2 | 0.0% | > | scaled_util_est_faster_freq | 132.3 | -13.1% | > | scaled_util_est_faster_lb | 137.1 | -9.96% | > | scaled_util_est_faster_ou | 132.4 | -13.04% | > | scaled_util_est_faster_slowpath | 141.3 | -7.18% | > +---------------------------------+-------+-----------+ > > (C) *** Which tasks contribute the most to the score improvement? *** > > A trace_event capturing the cases in which task's util_est_fast trumps > CPU util was added to cpu_util_cfs(). This is 1 iteration of Jankbench > and the base is (1) 'scaled_util_est_faster_freq': > > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_6.ipynb > > 'Cell [6]' shows the tasks of the Jankbench process > '[com.an]droid.benchmark' which are boosting the CPU frequency request. > > Among them are 2 main threads of the AGP, '[com.an]droid.benchmark' and > 'RenderThread'. > The spikes in util_est_fast are congruent with the aforementioned > beginning of an episode in which these periodic tasks are running and > when their runtime/period is rather ~10/16ms and not ~1-2/16ms since > the CPU OPP is still low. > > Very few other Jankbench tasks 'Cell [6] show the same behaviour. The > Surfaceflinger process 'Cell [8]' is not affected and from the kernel > tasks only kcompctd0 creates a mild boost 'Cell [9]'. > > As expected, running a non-scaled version of (1) shows more aggressive > boosting on non-big CPUs: > > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/util_est_faster_5.ipynb > > Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU > util when periodic tasks have a longer runtime compared to when they reach > steady-sate. > > The results is very similar to PELT halflife reduction. The advantage is > that 'util_est_faster' is only activated selectively when the runtime of > the current task in its current activation is long enough to create this > CPU util boost. IIUC how util_est_faster works, it removes the waiting time when sharing cpu time with other tasks. So as long as there is no (runnable but not running time), the result is the same as current util_est. util_est_faster makes a difference only when the task alternates between runnable and running slices. Have you considered using runnable_avg metrics in the increase of cpu freq ? This takes into the runnable slice and not only the running time and increase faster than util_avg when tasks compete for the same CPU > > Original patch: > https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net > > Changes applied: > - use 'duration of the current activation' as delta > - delta >>= 10 > - uArch and frequency scaling of delta > > -->%-- > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index efdc29c42161..76d146d06bbe 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -97,6 +97,7 @@ SCHED_FEAT(WA_BIAS, true) > */ > SCHED_FEAT(UTIL_EST, true) > SCHED_FEAT(UTIL_EST_FASTUP, true) > +SCHED_FEAT(UTIL_EST_FASTER, true) > > SCHED_FEAT(LATENCY_WARN, false) > > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c > index 0f310768260c..13cd9e27ce3e 100644 > --- a/kernel/sched/pelt.c > +++ b/kernel/sched/pelt.c > @@ -148,6 +148,22 @@ accumulate_sum(u64 delta, struct sched_avg *sa, > return periods; > } > > +/* > + * Compute a pelt util_avg assuming no history and @delta runtime. > + */ > +unsigned long faster_est_approx(u64 delta) > +{ > + unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */ > + u64 periods = delta / 1024; > + > + if (periods) { > + delta %= 1024; > + contrib = __accumulate_pelt_segments(periods, 1024, delta); > + } > + > + return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER; > +} > + > /* > * We can represent the historical contribution to runnable average as the > * coefficients of a geometric series. To do this we sub-divide our runnable > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 1072502976df..7cb45f1d8062 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2961,6 +2961,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq) > return READ_ONCE(rq->avg_dl.util_avg); > } > > +extern unsigned long faster_est_approx(u64 runtime); > + > /** > * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks. > * @cpu: the CPU to get the utilization for. > @@ -2995,13 +2997,39 @@ static inline unsigned long cpu_util_dl(struct rq *rq) > */ > static inline unsigned long cpu_util_cfs(int cpu) > { > + struct rq *rq = cpu_rq(cpu); > struct cfs_rq *cfs_rq; > unsigned long util; > > - cfs_rq = &cpu_rq(cpu)->cfs; > + cfs_rq = &rq->cfs; > util = READ_ONCE(cfs_rq->avg.util_avg); > > if (sched_feat(UTIL_EST)) { > + if (sched_feat(UTIL_EST_FASTER)) { > + struct task_struct *curr; > + > + rcu_read_lock(); > + curr = rcu_dereference(rq->curr); > + if (likely(curr->sched_class == &fair_sched_class)) { > + unsigned long util_est_fast; > + u64 delta; > + > + delta = curr->se.sum_exec_runtime - > + curr->se.prev_sum_exec_runtime_vol; > + > + delta >>= 10; > + if (!delta) > + goto unlock; > + > + delta = cap_scale(delta, arch_scale_cpu_capacity(cpu)); > + delta = cap_scale(delta, arch_scale_freq_capacity(cpu)); > + > + util_est_fast = faster_est_approx(delta * 2); > + util = max(util, util_est_fast); > + } > +unlock: > + rcu_read_unlock(); > + } > util = max_t(unsigned long, util, > READ_ONCE(cfs_rq->avg.util_est.enqueued)); > } ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-09 16:16 ` Vincent Guittot @ 2023-02-17 13:54 ` Dietmar Eggemann 2023-02-20 13:54 ` Vincent Guittot 2023-02-20 10:13 ` Peter Zijlstra 2023-02-23 15:37 ` Qais Yousef 2 siblings, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2023-02-17 13:54 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 09/02/2023 17:16, Vincent Guittot wrote: > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> >> On 09/11/2022 16:49, Peter Zijlstra wrote: >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: >>>> On 11/07/22 14:41, Peter Zijlstra wrote: >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: [...] >> (B) *** Where does util_est_faster help exactly? *** >> >> It turns out that the score improvement comes from the more aggressive >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util() >> -> effective_cpu_util(..., cpu_util_cfs(), ...). >> >> At the beginning of an episode (e.g. beginning of an image list view >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU >> frequency') of the Android Graphics Pipeline (AGP) start to run, the >> CPU Operating Performance Point (OPP) is often so low that those tasks >> run more like 10/16ms which let the test application count a lot of >> Jankframes at those moments. > > I don't see how util_est_faster can help this 1ms task here ? It's > most probably never be preempted during this 1ms. For such an Android It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer runtime with min CPU frequency at little CPU. I see runtime up to 10ms at the beginning of a test episode. Like I mentioned below, it could also be that the tasks have more work to do at the beginning. It's easy to spot using Google's perfetto and those moments also correlate with the occurrence of jankframes. I'm not yet sure how much this has to do with the perfetto instrumentation though. But you're right, on top of that, there is preemption (e.g. of the UI thread) by other threads (render thread, involved binder threads, surfaceflinger, etc.) going on. So the UI thread could be running+runnable for > 20ms, again marked as a jankframe. > Graphics Pipeline short task, hasn't uclamp_min been designed for and > a better solution ? Yes, it has. I'm not sure how feasible this is to do for all tasks involved. I'm thinking about the Binder threads here for instance. [...] >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU >> util when periodic tasks have a longer runtime compared to when they reach >> steady-sate. >> >> The results is very similar to PELT halflife reduction. The advantage is >> that 'util_est_faster' is only activated selectively when the runtime of >> the current task in its current activation is long enough to create this >> CPU util boost. > > IIUC how util_est_faster works, it removes the waiting time when > sharing cpu time with other tasks. So as long as there is no (runnable > but not running time), the result is the same as current util_est. > util_est_faster makes a difference only when the task alternates > between runnable and running slices. > Have you considered using runnable_avg metrics in the increase of cpu > freq ? This takes into the runnable slice and not only the running > time and increase faster than util_avg when tasks compete for the same > CPU Good idea! No, I haven't. I just glanced over the code, there shouldn't be an advantage in terms of more recent update between `curr->sum_exec_runtime` and update_load_avg(cfs_rq) even in the taskgroup case. Per-task view: https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb All tests ran 10 iterations of all Jankbench sub-tests. (Reran the `max_util_scaled_util_est_faster_rbl_freq` once with very similar results. Just to make sure the results are somehow correct). Max_frame_duration: +------------------------------------------+------------+ | kernel | value | +------------------------------------------+------------+ | base-a30b17f016b0 | 147.571352 | | pelt-hl-m2 | 119.416351 | | pelt-hl-m4 | 96.473412 | | scaled_util_est_faster_freq | 126.646506 | | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!! +------------------------------------------+------------+ Mean_frame_duration: +------------------------------------------+-------+-----------+ | kernel | value | perc_diff | +------------------------------------------+-------+-----------+ | base-a30b17f016b0 | 14.7 | 0.0% | | pelt-hl-m2 | 13.6 | -7.5% | | pelt-hl-m4 | 13.0 | -11.68% | | scaled_util_est_faster_freq | 13.7 | -6.81% | | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% | +------------------------------------------+-------+-----------+ Jank percentage (Jank deadline 16ms): +------------------------------------------+-------+-----------+ | kernel | value | perc_diff | +------------------------------------------+-------+-----------+ | base-a30b17f016b0 | 1.8 | 0.0% | | pelt-hl-m2 | 1.8 | -4.91% | | pelt-hl-m4 | 1.2 | -36.61% | | scaled_util_est_faster_freq | 1.3 | -27.63% | | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% | +------------------------------------------+-------+-----------+ Power usage [mW] (total - all CPUs): +------------------------------------------+-------+-----------+ | kernel | value | perc_diff | +------------------------------------------+-------+-----------+ | base-a30b17f016b0 | 144.4 | 0.0% | | pelt-hl-m2 | 141.6 | -1.97% | | pelt-hl-m4 | 163.2 | 12.99% | | scaled_util_est_faster_freq | 132.3 | -8.41% | | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% | +------------------------------------------+-------+-----------+ There is a regression in `Max_frame_duration` but `Mean_frame_duration`, `Jank percentage` and `Power usage` are better. So maybe DVFS boosting in preempt-scenarios is really the thing here to further improve the Android Graphics Pipeline. I ran the same test (boosting only for DVFS requests) with: -->8-- diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index dbc56e8b85f9..7a4bf38f2920 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu) READ_ONCE(cfs_rq->avg.util_est.enqueued)); } + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg)); + return min(util, capacity_orig_of(cpu)); Thanks! -- Dietmar ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-17 13:54 ` Dietmar Eggemann @ 2023-02-20 13:54 ` Vincent Guittot 2023-02-21 9:29 ` Vincent Guittot 2023-02-22 20:13 ` Dietmar Eggemann 0 siblings, 2 replies; 61+ messages in thread From: Vincent Guittot @ 2023-02-20 13:54 UTC (permalink / raw) To: Dietmar Eggemann Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 09/02/2023 17:16, Vincent Guittot wrote: > > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > >> > >> On 09/11/2022 16:49, Peter Zijlstra wrote: > >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: > >>>> On 11/07/22 14:41, Peter Zijlstra wrote: > >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > [...] > > >> (B) *** Where does util_est_faster help exactly? *** > >> > >> It turns out that the score improvement comes from the more aggressive > >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util() > >> -> effective_cpu_util(..., cpu_util_cfs(), ...). > >> > >> At the beginning of an episode (e.g. beginning of an image list view > >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU > >> frequency') of the Android Graphics Pipeline (AGP) start to run, the > >> CPU Operating Performance Point (OPP) is often so low that those tasks > >> run more like 10/16ms which let the test application count a lot of > >> Jankframes at those moments. > > > > I don't see how util_est_faster can help this 1ms task here ? It's > > most probably never be preempted during this 1ms. For such an Android > > It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer > runtime with min CPU frequency at little CPU. I see runtime up to 10ms > at the beginning of a test episode. > > Like I mentioned below, it could also be that the tasks have more work > to do at the beginning. It's easy to spot using Google's perfetto and > those moments also correlate with the occurrence of jankframes. I'm not > yet sure how much this has to do with the perfetto instrumentation though. > > But you're right, on top of that, there is preemption (e.g. of the UI > thread) by other threads (render thread, involved binder threads, > surfaceflinger, etc.) going on. So the UI thread could be > running+runnable for > 20ms, again marked as a jankframe. > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > a better solution ? > > Yes, it has. I'm not sure how feasible this is to do for all tasks > involved. I'm thinking about the Binder threads here for instance. Yes, that can probably not help for all threads but some system threads like surfaceflinger and graphic composer should probably benefit from min uclamp > > [...] > > >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU > >> util when periodic tasks have a longer runtime compared to when they reach > >> steady-sate. > >> > >> The results is very similar to PELT halflife reduction. The advantage is > >> that 'util_est_faster' is only activated selectively when the runtime of > >> the current task in its current activation is long enough to create this > >> CPU util boost. > > > > IIUC how util_est_faster works, it removes the waiting time when > > sharing cpu time with other tasks. So as long as there is no (runnable > > but not running time), the result is the same as current util_est. > > util_est_faster makes a difference only when the task alternates > > between runnable and running slices. > > Have you considered using runnable_avg metrics in the increase of cpu > > freq ? This takes into the runnable slice and not only the running > > time and increase faster than util_avg when tasks compete for the same > > CPU > > Good idea! No, I haven't. > > I just glanced over the code, there shouldn't be an advantage in terms > of more recent update between `curr->sum_exec_runtime` and > update_load_avg(cfs_rq) even in the taskgroup case. > > Per-task view: > > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb > > > All tests ran 10 iterations of all Jankbench sub-tests. (Reran the > `max_util_scaled_util_est_faster_rbl_freq` once with very similar > results. Just to make sure the results are somehow correct). > > Max_frame_duration: > +------------------------------------------+------------+ > | kernel | value | > +------------------------------------------+------------+ > | base-a30b17f016b0 | 147.571352 | > | pelt-hl-m2 | 119.416351 | > | pelt-hl-m4 | 96.473412 | > | scaled_util_est_faster_freq | 126.646506 | > | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!! > +------------------------------------------+------------+ > > Mean_frame_duration: > +------------------------------------------+-------+-----------+ > | kernel | value | perc_diff | > +------------------------------------------+-------+-----------+ > | base-a30b17f016b0 | 14.7 | 0.0% | > | pelt-hl-m2 | 13.6 | -7.5% | > | pelt-hl-m4 | 13.0 | -11.68% | > | scaled_util_est_faster_freq | 13.7 | -6.81% | > | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% | > +------------------------------------------+-------+-----------+ > > Jank percentage (Jank deadline 16ms): > +------------------------------------------+-------+-----------+ > | kernel | value | perc_diff | > +------------------------------------------+-------+-----------+ > | base-a30b17f016b0 | 1.8 | 0.0% | > | pelt-hl-m2 | 1.8 | -4.91% | > | pelt-hl-m4 | 1.2 | -36.61% | > | scaled_util_est_faster_freq | 1.3 | -27.63% | > | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% | > +------------------------------------------+-------+-----------+ > > Power usage [mW] (total - all CPUs): > +------------------------------------------+-------+-----------+ > | kernel | value | perc_diff | > +------------------------------------------+-------+-----------+ > | base-a30b17f016b0 | 144.4 | 0.0% | > | pelt-hl-m2 | 141.6 | -1.97% | > | pelt-hl-m4 | 163.2 | 12.99% | > | scaled_util_est_faster_freq | 132.3 | -8.41% | > | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% | > +------------------------------------------+-------+-----------+ > > There is a regression in `Max_frame_duration` but `Mean_frame_duration`, > `Jank percentage` and `Power usage` are better. The max frame duration is interesting. Could it be the very 1st frame of the test ? It's interesting that it's even worse than baseline whereas it should take the max of baseline and runnable_avg > > So maybe DVFS boosting in preempt-scenarios is really the thing here to > further improve the Android Graphics Pipeline. > > I ran the same test (boosting only for DVFS requests) with: > > -->8-- > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index dbc56e8b85f9..7a4bf38f2920 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu) > READ_ONCE(cfs_rq->avg.util_est.enqueued)); > } > > + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg)); > + > return min(util, capacity_orig_of(cpu)); > > Thanks! > > -- Dietmar > > > > > > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-20 13:54 ` Vincent Guittot @ 2023-02-21 9:29 ` Vincent Guittot 2023-02-22 20:28 ` Dietmar Eggemann 2023-02-22 20:13 ` Dietmar Eggemann 1 sibling, 1 reply; 61+ messages in thread From: Vincent Guittot @ 2023-02-21 9:29 UTC (permalink / raw) To: Dietmar Eggemann Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Mon, 20 Feb 2023 at 14:54, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > > > On 09/02/2023 17:16, Vincent Guittot wrote: > > > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > >> > > >> On 09/11/2022 16:49, Peter Zijlstra wrote: > > >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: > > >>>> On 11/07/22 14:41, Peter Zijlstra wrote: > > >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > > > [...] > > > > >> (B) *** Where does util_est_faster help exactly? *** > > >> > > >> It turns out that the score improvement comes from the more aggressive > > >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util() > > >> -> effective_cpu_util(..., cpu_util_cfs(), ...). > > >> > > >> At the beginning of an episode (e.g. beginning of an image list view > > >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU > > >> frequency') of the Android Graphics Pipeline (AGP) start to run, the > > >> CPU Operating Performance Point (OPP) is often so low that those tasks > > >> run more like 10/16ms which let the test application count a lot of > > >> Jankframes at those moments. > > > > > > I don't see how util_est_faster can help this 1ms task here ? It's > > > most probably never be preempted during this 1ms. For such an Android > > > > It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer > > runtime with min CPU frequency at little CPU. I see runtime up to 10ms > > at the beginning of a test episode. > > > > Like I mentioned below, it could also be that the tasks have more work > > to do at the beginning. It's easy to spot using Google's perfetto and > > those moments also correlate with the occurrence of jankframes. I'm not > > yet sure how much this has to do with the perfetto instrumentation though. > > > > But you're right, on top of that, there is preemption (e.g. of the UI > > thread) by other threads (render thread, involved binder threads, > > surfaceflinger, etc.) going on. So the UI thread could be > > running+runnable for > 20ms, again marked as a jankframe. > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > > a better solution ? > > > > Yes, it has. I'm not sure how feasible this is to do for all tasks > > involved. I'm thinking about the Binder threads here for instance. > > Yes, that can probably not help for all threads but some system > threads like surfaceflinger and graphic composer should probably > benefit from min uclamp > > > > > [...] > > > > >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU > > >> util when periodic tasks have a longer runtime compared to when they reach > > >> steady-sate. > > >> > > >> The results is very similar to PELT halflife reduction. The advantage is > > >> that 'util_est_faster' is only activated selectively when the runtime of > > >> the current task in its current activation is long enough to create this > > >> CPU util boost. > > > > > > IIUC how util_est_faster works, it removes the waiting time when > > > sharing cpu time with other tasks. So as long as there is no (runnable > > > but not running time), the result is the same as current util_est. > > > util_est_faster makes a difference only when the task alternates > > > between runnable and running slices. > > > Have you considered using runnable_avg metrics in the increase of cpu > > > freq ? This takes into the runnable slice and not only the running > > > time and increase faster than util_avg when tasks compete for the same > > > CPU > > > > Good idea! No, I haven't. > > > > I just glanced over the code, there shouldn't be an advantage in terms > > of more recent update between `curr->sum_exec_runtime` and > > update_load_avg(cfs_rq) even in the taskgroup case. > > > > Per-task view: > > > > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb > > > > > > All tests ran 10 iterations of all Jankbench sub-tests. (Reran the > > `max_util_scaled_util_est_faster_rbl_freq` once with very similar > > results. Just to make sure the results are somehow correct). > > > > Max_frame_duration: > > +------------------------------------------+------------+ > > | kernel | value | > > +------------------------------------------+------------+ > > | base-a30b17f016b0 | 147.571352 | > > | pelt-hl-m2 | 119.416351 | > > | pelt-hl-m4 | 96.473412 | > > | scaled_util_est_faster_freq | 126.646506 | > > | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!! > > +------------------------------------------+------------+ > > > > Mean_frame_duration: > > +------------------------------------------+-------+-----------+ > > | kernel | value | perc_diff | > > +------------------------------------------+-------+-----------+ > > | base-a30b17f016b0 | 14.7 | 0.0% | > > | pelt-hl-m2 | 13.6 | -7.5% | > > | pelt-hl-m4 | 13.0 | -11.68% | > > | scaled_util_est_faster_freq | 13.7 | -6.81% | > > | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% | > > +------------------------------------------+-------+-----------+ > > > > Jank percentage (Jank deadline 16ms): > > +------------------------------------------+-------+-----------+ > > | kernel | value | perc_diff | > > +------------------------------------------+-------+-----------+ > > | base-a30b17f016b0 | 1.8 | 0.0% | > > | pelt-hl-m2 | 1.8 | -4.91% | > > | pelt-hl-m4 | 1.2 | -36.61% | > > | scaled_util_est_faster_freq | 1.3 | -27.63% | > > | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% | > > +------------------------------------------+-------+-----------+ > > > > Power usage [mW] (total - all CPUs): > > +------------------------------------------+-------+-----------+ > > | kernel | value | perc_diff | > > +------------------------------------------+-------+-----------+ > > | base-a30b17f016b0 | 144.4 | 0.0% | > > | pelt-hl-m2 | 141.6 | -1.97% | > > | pelt-hl-m4 | 163.2 | 12.99% | > > | scaled_util_est_faster_freq | 132.3 | -8.41% | > > | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% | > > +------------------------------------------+-------+-----------+ > > > > There is a regression in `Max_frame_duration` but `Mean_frame_duration`, > > `Jank percentage` and `Power usage` are better. > > The max frame duration is interesting. Could it be the very 1st frame > of the test ? > It's interesting that it's even worse than baseline whereas it should > take the max of baseline and runnable_avg > > > > > So maybe DVFS boosting in preempt-scenarios is really the thing here to > > further improve the Android Graphics Pipeline. > > > > I ran the same test (boosting only for DVFS requests) with: > > > > -->8-- > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index dbc56e8b85f9..7a4bf38f2920 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu) > > READ_ONCE(cfs_rq->avg.util_est.enqueued)); > > } > > > > + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg)); > > + Another reason why it gives better results could be that cpu_util_cfs() is not only used for DVFS selection but also to track the cpu utilization in load balance and EAS so the cpu will be faster seen as overloaded and tasks will be spread around when there are contentions. Could you try to take cfs_rq->avg.runnable_avg into account only when selecting frequency ? That being said I can see some place in load balance where cfs_rq->avg.runnable_avg could give some benefits like in find_busiest_queue() where it could be better to take into account the contention when selecting the busiest queue > > return min(util, capacity_orig_of(cpu)); > > > > Thanks! > > > > -- Dietmar > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-21 9:29 ` Vincent Guittot @ 2023-02-22 20:28 ` Dietmar Eggemann 2023-03-01 10:24 ` Vincent Guittot 0 siblings, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2023-02-22 20:28 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 21/02/2023 10:29, Vincent Guittot wrote: > On Mon, 20 Feb 2023 at 14:54, Vincent Guittot > <vincent.guittot@linaro.org> wrote: >> >> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>> >>> On 09/02/2023 17:16, Vincent Guittot wrote: >>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>>>> >>>>> On 09/11/2022 16:49, Peter Zijlstra wrote: >>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: >>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote: >>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: [...] >>> I ran the same test (boosting only for DVFS requests) with: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * >>> >>> -->8-- >>> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index dbc56e8b85f9..7a4bf38f2920 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu) >>> READ_ONCE(cfs_rq->avg.util_est.enqueued)); >>> } >>> >>> + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg)); >>> + > > Another reason why it gives better results could be that > cpu_util_cfs() is not only used for DVFS selection but also to track > the cpu utilization in load balance and EAS so the cpu will be faster > seen as overloaded and tasks will be spread around when there are > contentions. > > Could you try to take cfs_rq->avg.runnable_avg into account only when > selecting frequency ? I actually did exactly this. (* but not shown in the code snippet). I just used the boosting for CPU frequency selection (from sugov_get_util()). I added the the `_freq` suffix in the kernel name to indicate this. > That being said I can see some place in load balance where > cfs_rq->avg.runnable_avg could give some benefits like in > find_busiest_queue() where it could be better to take into account the > contention when selecting the busiest queue Could be. Looks like so far we only use it in group_has_capacity(), group_is_overloaded() and for NUMA. [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-22 20:28 ` Dietmar Eggemann @ 2023-03-01 10:24 ` Vincent Guittot 0 siblings, 0 replies; 61+ messages in thread From: Vincent Guittot @ 2023-03-01 10:24 UTC (permalink / raw) To: Dietmar Eggemann Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Wed, 22 Feb 2023 at 21:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 21/02/2023 10:29, Vincent Guittot wrote: > > On Mon, 20 Feb 2023 at 14:54, Vincent Guittot > > <vincent.guittot@linaro.org> wrote: > >> > >> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > >>> > >>> On 09/02/2023 17:16, Vincent Guittot wrote: > >>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > >>>>> > >>>>> On 09/11/2022 16:49, Peter Zijlstra wrote: > >>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: > >>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote: > >>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > [...] > > >>> I ran the same test (boosting only for DVFS requests) with: > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * > >>> > >>> -->8-- > >>> > >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > >>> index dbc56e8b85f9..7a4bf38f2920 100644 > >>> --- a/kernel/sched/sched.h > >>> +++ b/kernel/sched/sched.h > >>> @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu) > >>> READ_ONCE(cfs_rq->avg.util_est.enqueued)); > >>> } > >>> > >>> + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg)); > >>> + > > > > Another reason why it gives better results could be that > > cpu_util_cfs() is not only used for DVFS selection but also to track > > the cpu utilization in load balance and EAS so the cpu will be faster > > seen as overloaded and tasks will be spread around when there are > > contentions. > > > > Could you try to take cfs_rq->avg.runnable_avg into account only when > > selecting frequency ? > > I actually did exactly this. (* but not shown in the code snippet). > I just used the boosting for CPU frequency selection (from > sugov_get_util()). I added the the `_freq` suffix in the kernel name to > indicate this. Ok. So the improvement that you are seeing, is really related to better freq selection > > > That being said I can see some place in load balance where > > cfs_rq->avg.runnable_avg could give some benefits like in > > find_busiest_queue() where it could be better to take into account the > > contention when selecting the busiest queue > > Could be. Looks like so far we only use it in group_has_capacity(), > group_is_overloaded() and for NUMA. I think it could be interesting to use runnable_avg in find_busiest_queue() for migrate_util case to select the rq with highest contention as an example > > [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-20 13:54 ` Vincent Guittot 2023-02-21 9:29 ` Vincent Guittot @ 2023-02-22 20:13 ` Dietmar Eggemann 2023-03-02 19:36 ` Dietmar Eggemann 1 sibling, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2023-02-22 20:13 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 20/02/2023 14:54, Vincent Guittot wrote: > On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> >> On 09/02/2023 17:16, Vincent Guittot wrote: >>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>>> >>>> On 09/11/2022 16:49, Peter Zijlstra wrote: >>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: >>>>>> On 11/07/22 14:41, Peter Zijlstra wrote: >>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: [...] >>> Graphics Pipeline short task, hasn't uclamp_min been designed for and >>> a better solution ? >> >> Yes, it has. I'm not sure how feasible this is to do for all tasks >> involved. I'm thinking about the Binder threads here for instance. > > Yes, that can probably not help for all threads but some system > threads like surfaceflinger and graphic composer should probably > benefit from min uclamp Yes, and it looks like that the Android version I'm using SQ1D.220205.004 (Feb '22) (automatic system updates turned off) is already using uclamp_min != 0 for tasks like UI thread. It's not one particular value but different values from [0 .. 512] over the runtime of a Jankbench iteration. I have to have a closer look. [...] >> Max_frame_duration: >> +------------------------------------------+------------+ >> | kernel | value | >> +------------------------------------------+------------+ >> | base-a30b17f016b0 | 147.571352 | >> | pelt-hl-m2 | 119.416351 | >> | pelt-hl-m4 | 96.473412 | >> | scaled_util_est_faster_freq | 126.646506 | >> | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!! >> +------------------------------------------+------------+ >> >> Mean_frame_duration: >> +------------------------------------------+-------+-----------+ >> | kernel | value | perc_diff | >> +------------------------------------------+-------+-----------+ >> | base-a30b17f016b0 | 14.7 | 0.0% | >> | pelt-hl-m2 | 13.6 | -7.5% | >> | pelt-hl-m4 | 13.0 | -11.68% | >> | scaled_util_est_faster_freq | 13.7 | -6.81% | >> | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% | >> +------------------------------------------+-------+-----------+ >> >> Jank percentage (Jank deadline 16ms): >> +------------------------------------------+-------+-----------+ >> | kernel | value | perc_diff | >> +------------------------------------------+-------+-----------+ >> | base-a30b17f016b0 | 1.8 | 0.0% | >> | pelt-hl-m2 | 1.8 | -4.91% | >> | pelt-hl-m4 | 1.2 | -36.61% | >> | scaled_util_est_faster_freq | 1.3 | -27.63% | >> | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% | >> +------------------------------------------+-------+-----------+ >> >> Power usage [mW] (total - all CPUs): >> +------------------------------------------+-------+-----------+ >> | kernel | value | perc_diff | >> +------------------------------------------+-------+-----------+ >> | base-a30b17f016b0 | 144.4 | 0.0% | >> | pelt-hl-m2 | 141.6 | -1.97% | >> | pelt-hl-m4 | 163.2 | 12.99% | >> | scaled_util_est_faster_freq | 132.3 | -8.41% | >> | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% | >> +------------------------------------------+-------+-----------+ >> >> There is a regression in `Max_frame_duration` but `Mean_frame_duration`, >> `Jank percentage` and `Power usage` are better. > > The max frame duration is interesting. Could it be the very 1st frame > of the test ? > It's interesting that it's even worse than baseline whereas it should > take the max of baseline and runnable_avg Since you asked in the following email: I just used the boosting for CPU frequency selection (from sugov_get_util()). I added the the `_freq` suffix in the kernel name to indicate this. I don't have any helpful `ftrace` or `perfetto` data for these test runs though. That's why I ran another iteration with perfetto on `max_util_scaled_util_est_faster_rbl_freq`. `Max frame duration` = 121ms (< 158ms but this was over 10 iterations) happened at the beginning of the 3/8 `List View Fling` episode. The UI thread (com.android.benchmark) runs on CPU1. Just before the start of this episode the CPU freq is 0.3Ghz. It takes 43ms for the CPU freq to go up to 1.1Ghz. oriole:/sys # cat devices/system/cpu/cpu1/cpu_capacity 124 oriole:/sys # cat devices/system/cpu/cpu1/cpufreq /scaling_available_frequencies 300000 574000 738000 930000 1098000 1197000 1328000 1401000 1598000 1704000 1803000 So the combination of little CPU and low CPU frequency is the reason why. But I can't see how using `max(max(util_avg, util_est.enq), rbl_avg) can make `max frame duration` worse? Don't understand how asking for higher CPU frequencies in contention favors the UI thread being scheduled on little CPUs at the beginning of an episode? Also the particular uclamp_min settings of the runnable tasks at this moment can have an influence on this `max frame duration` value. [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-22 20:13 ` Dietmar Eggemann @ 2023-03-02 19:36 ` Dietmar Eggemann 0 siblings, 0 replies; 61+ messages in thread From: Dietmar Eggemann @ 2023-03-02 19:36 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 22/02/2023 21:13, Dietmar Eggemann wrote: > On 20/02/2023 14:54, Vincent Guittot wrote: >> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>> >>> On 09/02/2023 17:16, Vincent Guittot wrote: >>>> On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>>>> >>>>> On 09/11/2022 16:49, Peter Zijlstra wrote: >>>>>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote: >>>>>>> On 11/07/22 14:41, Peter Zijlstra wrote: >>>>>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > [...] > >>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and >>>> a better solution ? >>> >>> Yes, it has. I'm not sure how feasible this is to do for all tasks >>> involved. I'm thinking about the Binder threads here for instance. >> >> Yes, that can probably not help for all threads but some system >> threads like surfaceflinger and graphic composer should probably >> benefit from min uclamp > > Yes, and it looks like that the Android version I'm using > SQ1D.220205.004 (Feb '22) (automatic system updates turned off) is > already using uclamp_min != 0 for tasks like UI thread. It's not one > particular value but different values from [0 .. 512] over the runtime > of a Jankbench iteration. I have to have a closer look. I did more Jankbench and Speedometer testing especially to understand the influence of the already used uclamp_min boosting (Android Dynamic Performance Framework (ADPF) `CPU performance hints` feature: https://developer.android.com/games/optimize/adpf#cpu-hints) for some App tasks. The following notebooks show which of the App tasks are uclamp_min boosted (their diagram title carries an additional 'uclamp_min_boost' tag and how uclamp_min boost relates to the other boost values: This is probably not a fixed mapping and could change between test runs. I assume that Android will issue performance hints in form of uclamp_min boosting when it detects certain scenarios like a specific jankframe threshold or something similar. https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/jankbench_uclamp_min_boost.ipynb https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/speedometer_uclamp_min_boost.ipynb `base` has changed compared to `base-a30b17f016b0`. It now also contains: e5ed0550c04c - sched/fair: unlink misfit task from cpu overutilized (2023-02-11 Vincent Guittot) Former `max_util_scaled_util_est_faster_rbl_freq` has been renamed to `cpu_rbl_freq`. Jankbench: Max_frame_duration: +-----------------------------+------------+ | kernel | value | +-----------------------------+------------+ | base | 156.299159 | | base_wo_uclamp | 171.063764 | uclamp disabled* | pelt-hl-m2 | 126.190232 | | pelt-hl-m4 | 100.865171 | | scaled_util_est_faster_freq | 126.074194 | | cpu_rbl_freq | 153.123089 | +-----------------------------+------------+ * We still let Android set the uclamp_min values. Just the uclamp setter are bypassed now. Mean_frame_duration: +-----------------------------+-------+-----------+ | kernel | value | perc_diff | +-----------------------------+-------+-----------+ | base | 15.5 | 0.0% | | base_wo_uclamp | 16.6 | 7.76% | | pelt-hl-m2 | 14.9 | -3.27% | | pelt-hl-m4 | 13.6 | -12.16% | | scaled_util_est_faster_freq | 14.7 | -4.88% | | cpu_rbl_freq | 12.2 | -20.84% | +-----------------------------+-------+-----------+ Jank percentage (Jank deadline 16ms): +-----------------------------+-------+-----------+ | kernel | value | perc_diff | +-----------------------------+-------+-----------+ | base | 2.6 | 0.0% | | base_wo_uclamp | 3.0 | 17.47% | | pelt-hl-m2 | 2.0 | -23.33% | | pelt-hl-m4 | 1.3 | -48.55% | | scaled_util_est_faster_freq | 1.7 | -32.21% | | cpu_rbl_freq | 0.7 | -71.36% | +-----------------------------+-------+-----------+ Power usage [mW] (total - all CPUs): +-----------------------------+-------+-----------+ | kernel | value | perc_diff | +-----------------------------+-------+-----------+ | base | 141.1 | 0.0% | | base_wo_uclamp | 116.6 | -17.4% | | pelt-hl-m2 | 138.7 | -1.7% | | pelt-hl-m4 | 156.5 | 10.87% | | scaled_util_est_faster_freq | 147.6 | 4.57% | | cpu_rbl_freq | 135.0 | -4.33% | +-----------------------------+-------+-----------+ Speedometer: Score: +-----------------------------+-------+-----------+ | kernel | value | perc_diff | +-----------------------------+-------+-----------+ | base | 108.4 | 0.0% | | base_wo_uclamp | 95.2 | -12.17% | | pelt-hl-m2 | 112.9 | 4.13% | | scaled_util_est_faster_freq | 114.7 | 5.77% | | cpu_rbl_freq | 127.7 | 17.75% | +-----------------------------+-------+-----------+ Power usage [mW] (total - all CPUs): +-----------------------------+--------+-----------+ | kernel | value | perc_diff | +-----------------------------+--------+-----------+ | base | 2268.4 | 0.0% | | base_wo_uclamp | 1789.5 | -21.11% | | pelt-hl-m2 | 2386.5 | 5.21% | | scaled_util_est_faster_freq | 2292.3 | 1.05% | | cpu_rbl_freq | 2198.3 | -3.09% | +-----------------------------+--------+-----------+ The explanation I have is that the `CPU performance hints` feature tries to recreate the information about contention for a specific set of tasks. Since there is also contention in which only non uclamp_min boosted tasks are runnable, mechanisms like `util_est_faster` or `cpu_runnable boosting` can help on top of what's already provided with uclamp_min boosting from userspace. [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-09 16:16 ` Vincent Guittot 2023-02-17 13:54 ` Dietmar Eggemann @ 2023-02-20 10:13 ` Peter Zijlstra 2023-02-20 13:39 ` Vincent Guittot 2023-02-23 15:37 ` Qais Yousef 2 siblings, 1 reply; 61+ messages in thread From: Peter Zijlstra @ 2023-02-20 10:13 UTC (permalink / raw) To: Vincent Guittot Cc: Dietmar Eggemann, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, Feb 09, 2023 at 05:16:46PM +0100, Vincent Guittot wrote: > > The results is very similar to PELT halflife reduction. The advantage is > > that 'util_est_faster' is only activated selectively when the runtime of > > the current task in its current activation is long enough to create this > > CPU util boost. > > IIUC how util_est_faster works, it removes the waiting time when > sharing cpu time with other tasks. So as long as there is no (runnable > but not running time), the result is the same as current util_est. Uh.. it's double the speed, no? Even if there is no contention, the fake/in-situ pelt sum runs at double time and thus will ramp up faster than normal. > util_est_faster makes a difference only when the task alternates > between runnable and running slices. UTIL_EST was supposed to help mitigate some of that, but yes. Also note that _FASTER sorta sucks here because it starts from 0 every time, if it were to start from the state saved by util_est_dequeue(), it would ramp up faster still. Patch has a comment along those lines I think. > Have you considered using runnable_avg metrics in the increase of cpu > freq ? This takes into the runnable slice and not only the running > time and increase faster than util_avg when tasks compete for the same > CPU Interesting! Indeed, that's boosting the DVFS for contention. And as deggeman's reply shows, it seems to work well. I wonder if that one place where it regresses is exactly the case without contention. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-20 10:13 ` Peter Zijlstra @ 2023-02-20 13:39 ` Vincent Guittot 0 siblings, 0 replies; 61+ messages in thread From: Vincent Guittot @ 2023-02-20 13:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Dietmar Eggemann, Qais Yousef, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Mon, 20 Feb 2023 at 11:13, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Feb 09, 2023 at 05:16:46PM +0100, Vincent Guittot wrote: > > > > The results is very similar to PELT halflife reduction. The advantage is > > > that 'util_est_faster' is only activated selectively when the runtime of > > > the current task in its current activation is long enough to create this > > > CPU util boost. > > > > IIUC how util_est_faster works, it removes the waiting time when > > sharing cpu time with other tasks. So as long as there is no (runnable > > but not running time), the result is the same as current util_est. > > Uh.. it's double the speed, no? Even if there is no contention, the > fake/in-situ pelt sum runs at double time and thus will ramp up faster > than normal. Ah yes. I haven't noticed it was (delta * 2) and not delta > > > util_est_faster makes a difference only when the task alternates > > between runnable and running slices. > > UTIL_EST was supposed to help mitigate some of that, but yes. Also note > that _FASTER sorta sucks here because it starts from 0 every time, if it > were to start from the state saved by util_est_dequeue(), it would ramp > up faster still. Yes. > > Patch has a comment along those lines I think. > > > Have you considered using runnable_avg metrics in the increase of cpu > > freq ? This takes into the runnable slice and not only the running > > time and increase faster than util_avg when tasks compete for the same > > CPU > > Interesting! Indeed, that's boosting the DVFS for contention. And as > deggeman's reply shows, it seems to work well. > > I wonder if that one place where it regresses is exactly the case > without contention. Yes that might be the case indeed. I would expect uclamp_min to help for ensuring a min frequency such scenario ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-09 16:16 ` Vincent Guittot 2023-02-17 13:54 ` Dietmar Eggemann 2023-02-20 10:13 ` Peter Zijlstra @ 2023-02-23 15:37 ` Qais Yousef 2023-03-01 10:39 ` Vincent Guittot 2 siblings, 1 reply; 61+ messages in thread From: Qais Yousef @ 2023-02-23 15:37 UTC (permalink / raw) To: Vincent Guittot Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 02/09/23 17:16, Vincent Guittot wrote: > I don't see how util_est_faster can help this 1ms task here ? It's > most probably never be preempted during this 1ms. For such an Android > Graphics Pipeline short task, hasn't uclamp_min been designed for and > a better solution ? uclamp_min is being used in UI and helping there. But your mileage might vary with adoption still. The major motivation behind this is to help things like gaming as the original thread started. It can help UI and other use cases too. Android framework has a lot of context on the type of workload that can help it make a decision when this helps. And OEMs can have the chance to tune and apply based on the characteristics of their device. > IIUC how util_est_faster works, it removes the waiting time when > sharing cpu time with other tasks. So as long as there is no (runnable > but not running time), the result is the same as current util_est. > util_est_faster makes a difference only when the task alternates > between runnable and running slices. > Have you considered using runnable_avg metrics in the increase of cpu > freq ? This takes into the runnable slice and not only the running > time and increase faster than util_avg when tasks compete for the same > CPU Just to understand why we're heading into this direction now. AFAIU the desired outcome to have faster rampup time (and on HMP faster up migration) which both are tied to utilization signal. Wouldn't make the util response time faster help not just for rampup, but rampdown too? If we improve util response time, couldn't this mean we can remove util_est or am I missing something? Currently we have util response which is tweaked by util_est and then that is tweaked further by schedutil with that 25% margin when maping util to frequency. I think if we can allow improving general util response time by tweaking PELT HALFLIFE we can potentially remove util_est and potentially that magic 25% margin too. Why the approach of further tweaking util_est is better? Recently phoronix reported that schedutil behavior is suboptimal and I wonder if the response time is contributing to that https://www.phoronix.com/review/schedutil-quirky-2023 Cheers -- Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-02-23 15:37 ` Qais Yousef @ 2023-03-01 10:39 ` Vincent Guittot 2023-03-01 17:24 ` Qais Yousef 0 siblings, 1 reply; 61+ messages in thread From: Vincent Guittot @ 2023-03-01 10:39 UTC (permalink / raw) To: Qais Yousef Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > > On 02/09/23 17:16, Vincent Guittot wrote: > > > I don't see how util_est_faster can help this 1ms task here ? It's > > most probably never be preempted during this 1ms. For such an Android > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > a better solution ? > > uclamp_min is being used in UI and helping there. But your mileage might vary > with adoption still. > > The major motivation behind this is to help things like gaming as the original > thread started. It can help UI and other use cases too. Android framework has > a lot of context on the type of workload that can help it make a decision when > this helps. And OEMs can have the chance to tune and apply based on the > characteristics of their device. > > > IIUC how util_est_faster works, it removes the waiting time when > > sharing cpu time with other tasks. So as long as there is no (runnable > > but not running time), the result is the same as current util_est. > > util_est_faster makes a difference only when the task alternates > > between runnable and running slices. > > Have you considered using runnable_avg metrics in the increase of cpu > > freq ? This takes into the runnable slice and not only the running > > time and increase faster than util_avg when tasks compete for the same > > CPU > > Just to understand why we're heading into this direction now. > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up > migration) which both are tied to utilization signal. > > Wouldn't make the util response time faster help not just for rampup, but > rampdown too? > > If we improve util response time, couldn't this mean we can remove util_est or > am I missing something? not sure because you still have a ramping step whereas util_est directly gives you the final tager > > Currently we have util response which is tweaked by util_est and then that is > tweaked further by schedutil with that 25% margin when maping util to > frequency. the 25% is not related to the ramping time but to the fact that you always need some margin to cover unexpected events and estimation error > > I think if we can allow improving general util response time by tweaking PELT > HALFLIFE we can potentially remove util_est and potentially that magic 25% > margin too. > > Why the approach of further tweaking util_est is better? note that in this case it doesn't really tweak util_est but Dietmar has taken into account runnable_avg to increase the freq in case of contention Also IIUC Dietmar's results, the problem seems more linked to the selection of a higher freq than increasing the utilization; runnable_avg tests give similar perf results than shorter half life and better power consumption. > > Recently phoronix reported that schedutil behavior is suboptimal and I wonder > if the response time is contributing to that > > https://www.phoronix.com/review/schedutil-quirky-2023 > > > Cheers > > -- > Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-01 10:39 ` Vincent Guittot @ 2023-03-01 17:24 ` Qais Yousef 2023-03-02 8:00 ` Vincent Guittot 2023-03-23 16:29 ` Dietmar Eggemann 0 siblings, 2 replies; 61+ messages in thread From: Qais Yousef @ 2023-03-01 17:24 UTC (permalink / raw) To: Vincent Guittot Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 03/01/23 11:39, Vincent Guittot wrote: > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 02/09/23 17:16, Vincent Guittot wrote: > > > > > I don't see how util_est_faster can help this 1ms task here ? It's > > > most probably never be preempted during this 1ms. For such an Android > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > > a better solution ? > > > > uclamp_min is being used in UI and helping there. But your mileage might vary > > with adoption still. > > > > The major motivation behind this is to help things like gaming as the original > > thread started. It can help UI and other use cases too. Android framework has > > a lot of context on the type of workload that can help it make a decision when > > this helps. And OEMs can have the chance to tune and apply based on the > > characteristics of their device. > > > > > IIUC how util_est_faster works, it removes the waiting time when > > > sharing cpu time with other tasks. So as long as there is no (runnable > > > but not running time), the result is the same as current util_est. > > > util_est_faster makes a difference only when the task alternates > > > between runnable and running slices. > > > Have you considered using runnable_avg metrics in the increase of cpu > > > freq ? This takes into the runnable slice and not only the running > > > time and increase faster than util_avg when tasks compete for the same > > > CPU > > > > Just to understand why we're heading into this direction now. > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up > > migration) which both are tied to utilization signal. > > > > Wouldn't make the util response time faster help not just for rampup, but > > rampdown too? > > > > If we improve util response time, couldn't this mean we can remove util_est or > > am I missing something? > > not sure because you still have a ramping step whereas util_est > directly gives you the final tager I didn't get you. tager? > > > > > Currently we have util response which is tweaked by util_est and then that is > > tweaked further by schedutil with that 25% margin when maping util to > > frequency. > > the 25% is not related to the ramping time but to the fact that you > always need some margin to cover unexpected events and estimation > error At the moment we have util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection I think we have too many transformations before deciding the current frequencies. Which makes it hard to tweak the system response. > > > > > I think if we can allow improving general util response time by tweaking PELT > > HALFLIFE we can potentially remove util_est and potentially that magic 25% > > margin too. > > > > Why the approach of further tweaking util_est is better? > > note that in this case it doesn't really tweak util_est but Dietmar > has taken into account runnable_avg to increase the freq in case of > contention > > Also IIUC Dietmar's results, the problem seems more linked to the > selection of a higher freq than increasing the utilization; > runnable_avg tests give similar perf results than shorter half life > and better power consumption. Does it ramp down faster too? Thanks -- Qais Yousef > > > > > Recently phoronix reported that schedutil behavior is suboptimal and I wonder > > if the response time is contributing to that > > > > https://www.phoronix.com/review/schedutil-quirky-2023 > > > > > > Cheers > > > > -- > > Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-01 17:24 ` Qais Yousef @ 2023-03-02 8:00 ` Vincent Guittot 2023-03-02 19:39 ` Dietmar Eggemann 2023-03-06 19:11 ` Qais Yousef 2023-03-23 16:29 ` Dietmar Eggemann 1 sibling, 2 replies; 61+ messages in thread From: Vincent Guittot @ 2023-03-02 8:00 UTC (permalink / raw) To: Qais Yousef Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote: > > On 03/01/23 11:39, Vincent Guittot wrote: > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 02/09/23 17:16, Vincent Guittot wrote: > > > > > > > I don't see how util_est_faster can help this 1ms task here ? It's > > > > most probably never be preempted during this 1ms. For such an Android > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > > > a better solution ? > > > > > > uclamp_min is being used in UI and helping there. But your mileage might vary > > > with adoption still. > > > > > > The major motivation behind this is to help things like gaming as the original > > > thread started. It can help UI and other use cases too. Android framework has > > > a lot of context on the type of workload that can help it make a decision when > > > this helps. And OEMs can have the chance to tune and apply based on the > > > characteristics of their device. > > > > > > > IIUC how util_est_faster works, it removes the waiting time when > > > > sharing cpu time with other tasks. So as long as there is no (runnable > > > > but not running time), the result is the same as current util_est. > > > > util_est_faster makes a difference only when the task alternates > > > > between runnable and running slices. > > > > Have you considered using runnable_avg metrics in the increase of cpu > > > > freq ? This takes into the runnable slice and not only the running > > > > time and increase faster than util_avg when tasks compete for the same > > > > CPU > > > > > > Just to understand why we're heading into this direction now. > > > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up > > > migration) which both are tied to utilization signal. > > > > > > Wouldn't make the util response time faster help not just for rampup, but > > > rampdown too? > > > > > > If we improve util response time, couldn't this mean we can remove util_est or > > > am I missing something? > > > > not sure because you still have a ramping step whereas util_est > > directly gives you the final tager > > I didn't get you. tager? target > > > > > > > > > Currently we have util response which is tweaked by util_est and then that is > > > tweaked further by schedutil with that 25% margin when maping util to > > > frequency. > > > > the 25% is not related to the ramping time but to the fact that you > > always need some margin to cover unexpected events and estimation > > error > > At the moment we have > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection > > I think we have too many transformations before deciding the current > frequencies. Which makes it hard to tweak the system response. What is proposed here with runnable_avg is more to take a new input when selecting a frequency: the level of contention on the cpu. But this is not used to modify the utilization seen by the scheduler > > > > > > > > > I think if we can allow improving general util response time by tweaking PELT > > > HALFLIFE we can potentially remove util_est and potentially that magic 25% > > > margin too. > > > > > > Why the approach of further tweaking util_est is better? > > > > note that in this case it doesn't really tweak util_est but Dietmar > > has taken into account runnable_avg to increase the freq in case of > > contention > > > > Also IIUC Dietmar's results, the problem seems more linked to the > > selection of a higher freq than increasing the utilization; > > runnable_avg tests give similar perf results than shorter half life > > and better power consumption. > > Does it ramp down faster too? I don't think so. To be honest, I'm not convinced that modifying the half time is the right way to solve this. If it was only a matter of half life not being suitable for a system, the halk life would be set once at boot and people would not ask to modify it at run time. > > > Thanks > > -- > Qais Yousef > > > > > > > > > Recently phoronix reported that schedutil behavior is suboptimal and I wonder > > > if the response time is contributing to that > > > > > > https://www.phoronix.com/review/schedutil-quirky-2023 > > > > > > > > > Cheers > > > > > > -- > > > Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-02 8:00 ` Vincent Guittot @ 2023-03-02 19:39 ` Dietmar Eggemann 2023-03-06 19:11 ` Qais Yousef 1 sibling, 0 replies; 61+ messages in thread From: Dietmar Eggemann @ 2023-03-02 19:39 UTC (permalink / raw) To: Vincent Guittot, Qais Yousef Cc: Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 02/03/2023 09:00, Vincent Guittot wrote: > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote: >> >> On 03/01/23 11:39, Vincent Guittot wrote: >>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: >>>> >>>> On 02/09/23 17:16, Vincent Guittot wrote: [...] >>>> Just to understand why we're heading into this direction now. >>>> >>>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up >>>> migration) which both are tied to utilization signal. >>>> >>>> Wouldn't make the util response time faster help not just for rampup, but >>>> rampdown too? >>>> >>>> If we improve util response time, couldn't this mean we can remove util_est or >>>> am I missing something? >>> >>> not sure because you still have a ramping step whereas util_est >>> directly gives you the final tager >> >> I didn't get you. tager? > > target uclamp_min boosting (ADPF's `CPU performance hints` feature) could eclipse util_est but only if it's higher and only for those tasks affected the feature, [...] >>>> I think if we can allow improving general util response time by tweaking PELT >>>> HALFLIFE we can potentially remove util_est and potentially that magic 25% >>>> margin too. >>>> >>>> Why the approach of further tweaking util_est is better? >>> >>> note that in this case it doesn't really tweak util_est but Dietmar >>> has taken into account runnable_avg to increase the freq in case of >>> contention >>> >>> Also IIUC Dietmar's results, the problem seems more linked to the >>> selection of a higher freq than increasing the utilization; >>> runnable_avg tests give similar perf results than shorter half life >>> and better power consumption. >> >> Does it ramp down faster too? > > I don't think so. > > To be honest, I'm not convinced that modifying the half time is the > right way to solve this. If it was only a matter of half life not > being suitable for a system, the halk life would be set once at boot > and people would not ask to modify it at run time. IMHO, what people don't like about PELT halflife mods is the fact that all sched entities and every functionality based on PELT signals would be affected even though it might not be beneficial or even harmful for system behaviour not covered by the specific benchmark numbers shown. That's why we wanted to figure out what is the actual reason which improves those Jankbench (or Speedometer resp. game FPS numbers). In this case we would be able to boost more selectively than PELT halflife modding can do. Util_est_faster (1) is an approach to only boost the CPU util signal depending on the current task's activation duration (sum of task's running time). This time is multiplied by 2 when calculating the fake PELT signal which is then max-compared with the existing CPU util. And the idea to max-compare CPU util and CPU runnable (2) is to help tasks under contention. Android testing showed that contention very often accompanies jankframe occurrences for example. I only applied (1) and (2) to DVFS requests in my testing. [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-02 8:00 ` Vincent Guittot 2023-03-02 19:39 ` Dietmar Eggemann @ 2023-03-06 19:11 ` Qais Yousef 2023-03-07 13:22 ` Vincent Guittot 1 sibling, 1 reply; 61+ messages in thread From: Qais Yousef @ 2023-03-06 19:11 UTC (permalink / raw) To: Vincent Guittot Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 03/02/23 09:00, Vincent Guittot wrote: > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 03/01/23 11:39, Vincent Guittot wrote: > > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > On 02/09/23 17:16, Vincent Guittot wrote: > > > > > > > > > I don't see how util_est_faster can help this 1ms task here ? It's > > > > > most probably never be preempted during this 1ms. For such an Android > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > > > > a better solution ? > > > > > > > > uclamp_min is being used in UI and helping there. But your mileage might vary > > > > with adoption still. > > > > > > > > The major motivation behind this is to help things like gaming as the original > > > > thread started. It can help UI and other use cases too. Android framework has > > > > a lot of context on the type of workload that can help it make a decision when > > > > this helps. And OEMs can have the chance to tune and apply based on the > > > > characteristics of their device. > > > > > > > > > IIUC how util_est_faster works, it removes the waiting time when > > > > > sharing cpu time with other tasks. So as long as there is no (runnable > > > > > but not running time), the result is the same as current util_est. > > > > > util_est_faster makes a difference only when the task alternates > > > > > between runnable and running slices. > > > > > Have you considered using runnable_avg metrics in the increase of cpu > > > > > freq ? This takes into the runnable slice and not only the running > > > > > time and increase faster than util_avg when tasks compete for the same > > > > > CPU > > > > > > > > Just to understand why we're heading into this direction now. > > > > > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up > > > > migration) which both are tied to utilization signal. > > > > > > > > Wouldn't make the util response time faster help not just for rampup, but > > > > rampdown too? > > > > > > > > If we improve util response time, couldn't this mean we can remove util_est or > > > > am I missing something? > > > > > > not sure because you still have a ramping step whereas util_est > > > directly gives you the final tager > > > > I didn't get you. tager? > > target It seems you're referring to the holding function of util_est? ie: keep the util high to avoid 'spurious' decays? Isn't this a duplication of the schedutil's filter which is also a holding function to prevent rapid frequency changes? FWIW, that schedutil filter does get tweaked a lot in android world. Many add an additional down_filter to prevent this premature drop in freq (AFAICT). Which tells me util_est is not delivering completely on that front in practice. > > > > > > > > > > > > > > Currently we have util response which is tweaked by util_est and then that is > > > > tweaked further by schedutil with that 25% margin when maping util to > > > > frequency. > > > > > > the 25% is not related to the ramping time but to the fact that you > > > always need some margin to cover unexpected events and estimation > > > error > > > > At the moment we have > > > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection > > > > I think we have too many transformations before deciding the current > > frequencies. Which makes it hard to tweak the system response. > > What is proposed here with runnable_avg is more to take a new input > when selecting a frequency: the level of contention on the cpu. But What if there's no contention on the CPU and it's just a single task running there that suddenly becomes always running for a number of frames? > this is not used to modify the utilization seen by the scheduler > > > > > > > > > > > > > > I think if we can allow improving general util response time by tweaking PELT > > > > HALFLIFE we can potentially remove util_est and potentially that magic 25% > > > > margin too. > > > > > > > > Why the approach of further tweaking util_est is better? > > > > > > note that in this case it doesn't really tweak util_est but Dietmar > > > has taken into account runnable_avg to increase the freq in case of > > > contention > > > > > > Also IIUC Dietmar's results, the problem seems more linked to the > > > selection of a higher freq than increasing the utilization; > > > runnable_avg tests give similar perf results than shorter half life > > > and better power consumption. > > > > Does it ramp down faster too? > > I don't think so. > > To be honest, I'm not convinced that modifying the half time is the > right way to solve this. If it was only a matter of half life not > being suitable for a system, the halk life would be set once at boot > and people would not ask to modify it at run time. I'd like to understand more the reason behind these concerns. What is the problem with modifying the halflife? The way I see it it is an important metric of how responsive the system to how loaded it is. Which drives a lot of important decisions. 32ms means the system needs approximately 200ms to detect an always running task (from idle). 16ms halves it to 100ms. And 8ms halves it further to 50ms. Or you can phrase it the opposite way, it takes 200ms to detect the system is now idle from always busy state. etc. Why is it bad for a sys admin to have the ability to adjust this response time as they see fit? What goes wrong? AFAICS the two natural places to control the response time of the system is pelt halflife for overall system responsiveness, and the mapping function in schedutil for more fine grained frequency response. There are issues with current filtering mechanism in schedutil too: 1. It drops any requests during the filtering window. At CFS enqueue we could end up with multiple calls to cpufreq_update_util(); or if we do multiple consecutive enqueues. In a shared domain, there's a race which cpu issues the updated freq request first. Which might not be the best for the domain during this window. 2. Maybe it needs asymmetric values for up and down. I could be naive, but I see util_est as something we should strive to remove to be honest. I think there are too many moving cogs. Thanks! -- Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-06 19:11 ` Qais Yousef @ 2023-03-07 13:22 ` Vincent Guittot 2023-03-11 16:55 ` Qais Yousef 0 siblings, 1 reply; 61+ messages in thread From: Vincent Guittot @ 2023-03-07 13:22 UTC (permalink / raw) To: Qais Yousef Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Mon, 6 Mar 2023 at 20:11, Qais Yousef <qyousef@layalina.io> wrote: > > On 03/02/23 09:00, Vincent Guittot wrote: > > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 03/01/23 11:39, Vincent Guittot wrote: > > > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > On 02/09/23 17:16, Vincent Guittot wrote: > > > > > > > > > > > I don't see how util_est_faster can help this 1ms task here ? It's > > > > > > most probably never be preempted during this 1ms. For such an Android > > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > > > > > a better solution ? > > > > > > > > > > uclamp_min is being used in UI and helping there. But your mileage might vary > > > > > with adoption still. > > > > > > > > > > The major motivation behind this is to help things like gaming as the original > > > > > thread started. It can help UI and other use cases too. Android framework has > > > > > a lot of context on the type of workload that can help it make a decision when > > > > > this helps. And OEMs can have the chance to tune and apply based on the > > > > > characteristics of their device. > > > > > > > > > > > IIUC how util_est_faster works, it removes the waiting time when > > > > > > sharing cpu time with other tasks. So as long as there is no (runnable > > > > > > but not running time), the result is the same as current util_est. > > > > > > util_est_faster makes a difference only when the task alternates > > > > > > between runnable and running slices. > > > > > > Have you considered using runnable_avg metrics in the increase of cpu > > > > > > freq ? This takes into the runnable slice and not only the running > > > > > > time and increase faster than util_avg when tasks compete for the same > > > > > > CPU > > > > > > > > > > Just to understand why we're heading into this direction now. > > > > > > > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up > > > > > migration) which both are tied to utilization signal. > > > > > > > > > > Wouldn't make the util response time faster help not just for rampup, but > > > > > rampdown too? > > > > > > > > > > If we improve util response time, couldn't this mean we can remove util_est or > > > > > am I missing something? > > > > > > > > not sure because you still have a ramping step whereas util_est > > > > directly gives you the final tager > > > > > > I didn't get you. tager? > > > > target > > It seems you're referring to the holding function of util_est? ie: keep the > util high to avoid 'spurious' decays? I mean whatever the half life, you will have to wait the utilization to increase. > > Isn't this a duplication of the schedutil's filter which is also a holding > function to prevent rapid frequency changes? util_est is used by scheduler to estimate the final utilization of the cfs > > FWIW, that schedutil filter does get tweaked a lot in android world. Many add > an additional down_filter to prevent this premature drop in freq (AFAICT). > Which tells me util_est is not delivering completely on that front in practice. > > > > > > > > > > > > > > > > > > > > Currently we have util response which is tweaked by util_est and then that is > > > > > tweaked further by schedutil with that 25% margin when maping util to > > > > > frequency. > > > > > > > > the 25% is not related to the ramping time but to the fact that you > > > > always need some margin to cover unexpected events and estimation > > > > error > > > > > > At the moment we have > > > > > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection > > > > > > I think we have too many transformations before deciding the current > > > frequencies. Which makes it hard to tweak the system response. > > > > What is proposed here with runnable_avg is more to take a new input > > when selecting a frequency: the level of contention on the cpu. But > > What if there's no contention on the CPU and it's just a single task running > there that suddenly becomes always running for a number of frames? > > > this is not used to modify the utilization seen by the scheduler > > > > > > > > > > > > > > > > > > > I think if we can allow improving general util response time by tweaking PELT > > > > > HALFLIFE we can potentially remove util_est and potentially that magic 25% > > > > > margin too. > > > > > > > > > > Why the approach of further tweaking util_est is better? > > > > > > > > note that in this case it doesn't really tweak util_est but Dietmar > > > > has taken into account runnable_avg to increase the freq in case of > > > > contention > > > > > > > > Also IIUC Dietmar's results, the problem seems more linked to the > > > > selection of a higher freq than increasing the utilization; > > > > runnable_avg tests give similar perf results than shorter half life > > > > and better power consumption. > > > > > > Does it ramp down faster too? > > > > I don't think so. > > > > To be honest, I'm not convinced that modifying the half time is the > > right way to solve this. If it was only a matter of half life not > > being suitable for a system, the halk life would be set once at boot > > and people would not ask to modify it at run time. > > I'd like to understand more the reason behind these concerns. What is the > problem with modifying the halflife? I can somehow understand that some systems would like a different half life than the current one because of the number of cpus, the pace of the system... But this should be fixed at boot. The fact that people needs to dynamically change the half life means for me that even after changing it then they still don't get the correct utilization. And I think that the problem is not really related (or at least not only) to the correctness of utilization tracking but a lack of taking into account other input when selecting a frequency. And the contention (runnable_avg) is a good input to take into account when selecting a frequency because it reflects that some tasks are waiting to run on the cpu > > The way I see it it is an important metric of how responsive the system to how > loaded it is. Which drives a lot of important decisions. > > 32ms means the system needs approximately 200ms to detect an always running > task (from idle). > > 16ms halves it to 100ms. And 8ms halves it further to 50ms. > > Or you can phrase it the opposite way, it takes 200ms to detect the system is > now idle from always busy state. etc. > > Why is it bad for a sys admin to have the ability to adjust this response time > as they see fit? because it will use it to bias the response of the system and abuse it at runtime instead of identifying the root cause. > > What goes wrong? > > AFAICS the two natural places to control the response time of the system is > pelt halflife for overall system responsiveness, and the mapping function in > schedutil for more fine grained frequency response. > > There are issues with current filtering mechanism in schedutil too: > > 1. It drops any requests during the filtering window. At CFS enqueue we > could end up with multiple calls to cpufreq_update_util(); or if we > do multiple consecutive enqueues. In a shared domain, there's a race > which cpu issues the updated freq request first. Which might not be > the best for the domain during this window. > 2. Maybe it needs asymmetric values for up and down. > > I could be naive, but I see util_est as something we should strive to remove to > be honest. I think there are too many moving cogs. > > > Thanks! > > -- > Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-07 13:22 ` Vincent Guittot @ 2023-03-11 16:55 ` Qais Yousef 0 siblings, 0 replies; 61+ messages in thread From: Qais Yousef @ 2023-03-11 16:55 UTC (permalink / raw) To: Vincent Guittot Cc: Dietmar Eggemann, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 03/07/23 14:22, Vincent Guittot wrote: > On Mon, 6 Mar 2023 at 20:11, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 03/02/23 09:00, Vincent Guittot wrote: > > > On Wed, 1 Mar 2023 at 18:25, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > On 03/01/23 11:39, Vincent Guittot wrote: > > > > > On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > > > On 02/09/23 17:16, Vincent Guittot wrote: > > > > > > > > > > > > > I don't see how util_est_faster can help this 1ms task here ? It's > > > > > > > most probably never be preempted during this 1ms. For such an Android > > > > > > > Graphics Pipeline short task, hasn't uclamp_min been designed for and > > > > > > > a better solution ? > > > > > > > > > > > > uclamp_min is being used in UI and helping there. But your mileage might vary > > > > > > with adoption still. > > > > > > > > > > > > The major motivation behind this is to help things like gaming as the original > > > > > > thread started. It can help UI and other use cases too. Android framework has > > > > > > a lot of context on the type of workload that can help it make a decision when > > > > > > this helps. And OEMs can have the chance to tune and apply based on the > > > > > > characteristics of their device. > > > > > > > > > > > > > IIUC how util_est_faster works, it removes the waiting time when > > > > > > > sharing cpu time with other tasks. So as long as there is no (runnable > > > > > > > but not running time), the result is the same as current util_est. > > > > > > > util_est_faster makes a difference only when the task alternates > > > > > > > between runnable and running slices. > > > > > > > Have you considered using runnable_avg metrics in the increase of cpu > > > > > > > freq ? This takes into the runnable slice and not only the running > > > > > > > time and increase faster than util_avg when tasks compete for the same > > > > > > > CPU > > > > > > > > > > > > Just to understand why we're heading into this direction now. > > > > > > > > > > > > AFAIU the desired outcome to have faster rampup time (and on HMP faster up > > > > > > migration) which both are tied to utilization signal. > > > > > > > > > > > > Wouldn't make the util response time faster help not just for rampup, but > > > > > > rampdown too? > > > > > > > > > > > > If we improve util response time, couldn't this mean we can remove util_est or > > > > > > am I missing something? > > > > > > > > > > not sure because you still have a ramping step whereas util_est > > > > > directly gives you the final tager > > > > > > > > I didn't get you. tager? > > > > > > target > > > > It seems you're referring to the holding function of util_est? ie: keep the > > util high to avoid 'spurious' decays? > > I mean whatever the half life, you will have to wait the utilization > to increase. Yes - which is what ramp up delay that is unacceptable in some cases and seem to have been raised several times over the years > > > > > Isn't this a duplication of the schedutil's filter which is also a holding > > function to prevent rapid frequency changes? > > util_est is used by scheduler to estimate the final utilization of the cfs IIR the commit message that introduced it correctly it is talking about ramp up delays - and issues with premature decaying for periodic tasks. So it is a mechanism to speed up util_avg response time. The same issue we're trying to address again now. > > > > > FWIW, that schedutil filter does get tweaked a lot in android world. Many add > > an additional down_filter to prevent this premature drop in freq (AFAICT). > > Which tells me util_est is not delivering completely on that front in practice. > > > > > > > > > > > > > > > > > > > > > > > > > > Currently we have util response which is tweaked by util_est and then that is > > > > > > tweaked further by schedutil with that 25% margin when maping util to > > > > > > frequency. > > > > > > > > > > the 25% is not related to the ramping time but to the fact that you > > > > > always need some margin to cover unexpected events and estimation > > > > > error > > > > > > > > At the moment we have > > > > > > > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection > > > > > > > > I think we have too many transformations before deciding the current > > > > frequencies. Which makes it hard to tweak the system response. > > > > > > What is proposed here with runnable_avg is more to take a new input > > > when selecting a frequency: the level of contention on the cpu. But > > > > What if there's no contention on the CPU and it's just a single task running > > there that suddenly becomes always running for a number of frames? > > > > > this is not used to modify the utilization seen by the scheduler > > > > > > > > > > > > > > > > > > > > > > > > I think if we can allow improving general util response time by tweaking PELT > > > > > > HALFLIFE we can potentially remove util_est and potentially that magic 25% > > > > > > margin too. > > > > > > > > > > > > Why the approach of further tweaking util_est is better? > > > > > > > > > > note that in this case it doesn't really tweak util_est but Dietmar > > > > > has taken into account runnable_avg to increase the freq in case of > > > > > contention > > > > > > > > > > Also IIUC Dietmar's results, the problem seems more linked to the > > > > > selection of a higher freq than increasing the utilization; > > > > > runnable_avg tests give similar perf results than shorter half life > > > > > and better power consumption. > > > > > > > > Does it ramp down faster too? > > > > > > I don't think so. > > > > > > To be honest, I'm not convinced that modifying the half time is the > > > right way to solve this. If it was only a matter of half life not > > > being suitable for a system, the halk life would be set once at boot > > > and people would not ask to modify it at run time. > > > > I'd like to understand more the reason behind these concerns. What is the > > problem with modifying the halflife? > > I can somehow understand that some systems would like a different half > life than the current one because of the number of cpus, the pace of > the system... But this should be fixed at boot. The fact that people The boot time might be the only thing required. I think some systems only need this already. The difficulty in practice is that on some systems this might result in worse power over a day of use. So it'll all depend, hence the desire to have it as a runtime. Why invent more crystal balls that might or not might not be the best thing depends on who you ask? > needs to dynamically change the half life means for me that even after > changing it then they still don't get the correct utilization. And I What is the correct utilization? It is just a signal in attempt to crystal ball the future. It can't be correct in general IMHO. It's best effort that we know fails occasionally already. As I said above - there's a trade-off in perf/power and that will highly depend on the system. The proposed high contention detection doesn't address this trade-off; rather biases the system further towards perf-first. Which is not always the right trade-off. It could be a useful addition - but it needs to be a tunable too. > think that the problem is not really related (or at least not only) to > the correctness of utilization tracking but a lack of taking into It's not correctness issue. It's response time issue. It's a simple task of improving the reactiveness of the system. Which has a power cost that some users don't want to incur when not necessary. > account other input when selecting a frequency. And the contention > (runnable_avg) is a good input to take into account when selecting a > frequency because it reflects that some tasks are waiting to run on > the cpu You did not answer my question above. What if there's no contention and a single task on a cpu suddenly moves from mostly idle to always running for a number of frames? There's no contention in there. How will this be improved? > > > > > The way I see it it is an important metric of how responsive the system to how > > loaded it is. Which drives a lot of important decisions. > > > > 32ms means the system needs approximately 200ms to detect an always running > > task (from idle). > > > > 16ms halves it to 100ms. And 8ms halves it further to 50ms. > > > > Or you can phrase it the opposite way, it takes 200ms to detect the system is > > now idle from always busy state. etc. > > > > Why is it bad for a sys admin to have the ability to adjust this response time > > as they see fit? > > because it will use it to bias the response of the system and abuse it > at runtime instead of identifying the root cause. No one wants to abuse anything. But the one size fits all approach is not always right too. And sys admins and end users have the right to tune their systems the way they see fit. There are too many variations out there to hard code the system response. I view this like the right to repair - it's their system, why do they have to hack the kernel to tune it? The root cause is that the system reactiveness is controlled by this value. And there's a trade-off between perf/power that is highly dependent on the system characteristic. On some areas a boot time is all that one needs. In others, it might be desired to improve specific use cases like gaming only as the speed up at boot time only can hurt overall battery life in normal use cases. I think the story is simple :) In my view util_est is borderline a hack. We just need to enable control pelt ramp-up/down response times + improve schedutil. I highlight a few shortcomings that are already known in the practice below. And that phoronix article about schedutil not being better than ondemand demonstrates that this is an issue outside of mobile too. schedutil - as the name says it - depends on util signal. Which also depends on pelt halflife. I really think this is the most natural and predictable way to tune the system. I can't see the drawbacks. I think we need to distinguish between picking sensible default behavior; and enforcing policies or restricting user's choice. AFAICS the discussion is going towards the latter. On the topic of defaults - I do think 16ms is a more sensible default for modern day hardware and use cases. /me runs and hides :) Cheers -- Qais Yousef > > > > > What goes wrong? > > > > AFAICS the two natural places to control the response time of the system is > > pelt halflife for overall system responsiveness, and the mapping function in > > schedutil for more fine grained frequency response. > > > > There are issues with current filtering mechanism in schedutil too: > > > > 1. It drops any requests during the filtering window. At CFS enqueue we > > could end up with multiple calls to cpufreq_update_util(); or if we > > do multiple consecutive enqueues. In a shared domain, there's a race > > which cpu issues the updated freq request first. Which might not be > > the best for the domain during this window. > > 2. Maybe it needs asymmetric values for up and down. > > > > I could be naive, but I see util_est as something we should strive to remove to > > be honest. I think there are too many moving cogs. > > > > > > Thanks! > > > > -- > > Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-01 17:24 ` Qais Yousef 2023-03-02 8:00 ` Vincent Guittot @ 2023-03-23 16:29 ` Dietmar Eggemann 2023-04-03 14:45 ` Qais Yousef 1 sibling, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2023-03-23 16:29 UTC (permalink / raw) To: Qais Yousef, Vincent Guittot Cc: Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 01/03/2023 18:24, Qais Yousef wrote: > On 03/01/23 11:39, Vincent Guittot wrote: >> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: >>> >>> On 02/09/23 17:16, Vincent Guittot wrote: >>> >>>> I don't see how util_est_faster can help this 1ms task here ? It's >>>> most probably never be preempted during this 1ms. For such an Android >>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and >>>> a better solution ? >>> >>> uclamp_min is being used in UI and helping there. But your mileage might vary >>> with adoption still. >>> >>> The major motivation behind this is to help things like gaming as the original >>> thread started. It can help UI and other use cases too. Android framework has >>> a lot of context on the type of workload that can help it make a decision when >>> this helps. And OEMs can have the chance to tune and apply based on the >>> characteristics of their device. >>> >>>> IIUC how util_est_faster works, it removes the waiting time when >>>> sharing cpu time with other tasks. So as long as there is no (runnable >>>> but not running time), the result is the same as current util_est. >>>> util_est_faster makes a difference only when the task alternates >>>> between runnable and running slices. >>>> Have you considered using runnable_avg metrics in the increase of cpu >>>> freq ? This takes into the runnable slice and not only the running >>>> time and increase faster than util_avg when tasks compete for the same >>>> CPU >>> >>> Just to understand why we're heading into this direction now. >>> >>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up >>> migration) which both are tied to utilization signal. >>> >>> Wouldn't make the util response time faster help not just for rampup, but >>> rampdown too? >>> >>> If we improve util response time, couldn't this mean we can remove util_est or >>> am I missing something? >> >> not sure because you still have a ramping step whereas util_est >> directly gives you the final tager util_est gives us instantaneous signal at enqueue for periodic tasks, something PELT will never be able to do. > I didn't get you. tager? > >> >>> >>> Currently we have util response which is tweaked by util_est and then that is >>> tweaked further by schedutil with that 25% margin when maping util to >>> frequency. >> >> the 25% is not related to the ramping time but to the fact that you >> always need some margin to cover unexpected events and estimation >> error > > At the moment we have > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection > > I think we have too many transformations before deciding the current > frequencies. Which makes it hard to tweak the system response. To me it looks more like this: max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection ^^^^^^^^^^^^ new proposal to factor in root cfs_rq contention Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only there to create the safety margin used by schedutil & EAS. * The schedutil up/down filter thing has been already naked in Nov 2016. IMHO, this is where util_est was initially discussed as an alternative. We have it in mainline as well, but one value (default 10ms) for both directions. There was discussion to map it to the driver's translation_latency instead. In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`. So on `up` your rate is as small as possible (only respecting the driver's translation_latency) but on `down` you use much more than that. Why exactly do you have this higher value on `down`? My hunch is scenarios in which the CPU (all CPUs in the freq. domain) goes idle, so util_est is 0 and the blocked utilization is decaying (too fast, 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency again when the CPU wakes up in those 20ms? >>> I think if we can allow improving general util response time by tweaking PELT >>> HALFLIFE we can potentially remove util_est and potentially that magic 25% >>> margin too. >>> >>> Why the approach of further tweaking util_est is better? >> >> note that in this case it doesn't really tweak util_est but Dietmar >> has taken into account runnable_avg to increase the freq in case of >> contention >> >> Also IIUC Dietmar's results, the problem seems more linked to the >> selection of a higher freq than increasing the utilization; >> runnable_avg tests give similar perf results than shorter half life >> and better power consumption. > > Does it ramp down faster too? Not sure why you are interested in this? Can't be related to the `driving DVFS` functionality discussed above. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-03-23 16:29 ` Dietmar Eggemann @ 2023-04-03 14:45 ` Qais Yousef 2023-04-06 15:58 ` Dietmar Eggemann 0 siblings, 1 reply; 61+ messages in thread From: Qais Yousef @ 2023-04-03 14:45 UTC (permalink / raw) To: Dietmar Eggemann Cc: Vincent Guittot, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen Hi Diemtar On 03/23/23 17:29, Dietmar Eggemann wrote: > On 01/03/2023 18:24, Qais Yousef wrote: > > On 03/01/23 11:39, Vincent Guittot wrote: > >> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > >>> > >>> On 02/09/23 17:16, Vincent Guittot wrote: > >>> > >>>> I don't see how util_est_faster can help this 1ms task here ? It's > >>>> most probably never be preempted during this 1ms. For such an Android > >>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and > >>>> a better solution ? > >>> > >>> uclamp_min is being used in UI and helping there. But your mileage might vary > >>> with adoption still. > >>> > >>> The major motivation behind this is to help things like gaming as the original > >>> thread started. It can help UI and other use cases too. Android framework has > >>> a lot of context on the type of workload that can help it make a decision when > >>> this helps. And OEMs can have the chance to tune and apply based on the > >>> characteristics of their device. > >>> > >>>> IIUC how util_est_faster works, it removes the waiting time when > >>>> sharing cpu time with other tasks. So as long as there is no (runnable > >>>> but not running time), the result is the same as current util_est. > >>>> util_est_faster makes a difference only when the task alternates > >>>> between runnable and running slices. > >>>> Have you considered using runnable_avg metrics in the increase of cpu > >>>> freq ? This takes into the runnable slice and not only the running > >>>> time and increase faster than util_avg when tasks compete for the same > >>>> CPU > >>> > >>> Just to understand why we're heading into this direction now. > >>> > >>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up > >>> migration) which both are tied to utilization signal. > >>> > >>> Wouldn't make the util response time faster help not just for rampup, but > >>> rampdown too? > >>> > >>> If we improve util response time, couldn't this mean we can remove util_est or > >>> am I missing something? > >> > >> not sure because you still have a ramping step whereas util_est > >> directly gives you the final tager > > util_est gives us instantaneous signal at enqueue for periodic tasks, How do you define instantaneous and periodic here? How would you describe the behavior for non periodic tasks? > something PELT will never be able to do. Why? Isn't by selecting a lower pelt halflife we achieve something similar? > > > I didn't get you. tager? > > > >> > >>> > >>> Currently we have util response which is tweaked by util_est and then that is > >>> tweaked further by schedutil with that 25% margin when maping util to > >>> frequency. > >> > >> the 25% is not related to the ramping time but to the fact that you > >> always need some margin to cover unexpected events and estimation > >> error > > > > At the moment we have > > > > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection > > > > I think we have too many transformations before deciding the current > > frequencies. Which makes it hard to tweak the system response. > > To me it looks more like this: > > max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection > ^^^^^^^^^^^^ > new proposal to factor in root cfs_rq contention These are still 5 stages even if written differently. What if background tasks that are causing the contention? How can you tell it to ignore that and NOT drive the frequency up unnecessary for those non important ones? If userspace is fully aware of uclamp - this whole discussion wouldn't be necessary. And I still have a bunch of fixes to push before uclamp_max is actually usable in production. > Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only > there to create the safety margin used by schedutil & EAS. Yes I know and that's not the point. The point is that it's a chain reaction. 25% percent headroom is already very aggressive and causes issues on the top inefficient ends of the cores. And when util is high, you might end up in a situation where you skip frequencies. Making everything go up faster without balancing it with either enabling going down faster too or tune this value can lead to power and thermal issues on powerful systems. I think all we need is controlling pelt halflife and this one to tune the system to the desired trade-off. > > * The schedutil up/down filter thing has been already naked in Nov 2016. > IMHO, this is where util_est was initially discussed as an alternative. Well, I don't see anyone not using a down filter. So I'm not sure util_est has been a true alternative. > We have it in mainline as well, but one value (default 10ms) for both > directions. There was discussion to map it to the driver's > translation_latency instead. Which can be filled wrong sometimes :( > > In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`. > > So on `up` your rate is as small as possible (only respecting the > driver's translation_latency) but on `down` you use much more than that. > > Why exactly do you have this higher value on `down`? My hunch is > scenarios in which the CPU (all CPUs in the freq. domain) goes idle, > so util_est is 0 and the blocked utilization is decaying (too fast, > 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency > again when the CPU wakes up in those 20ms? The down filter prevents changing the frequency to a lower value. So it's a holding function to keep the residency at a higher frequency for at least 20ms. It is, sort of, similar to the max() functions you used above. The max function will allow you to follow the fasting ramping up signal on the way up, and the slowest ramping down one on the way down. I think this is more deterministic way to do it. > > >>> I think if we can allow improving general util response time by tweaking PELT > >>> HALFLIFE we can potentially remove util_est and potentially that magic 25% > >>> margin too. > >>> > >>> Why the approach of further tweaking util_est is better? > >> > >> note that in this case it doesn't really tweak util_est but Dietmar > >> has taken into account runnable_avg to increase the freq in case of > >> contention > >> > >> Also IIUC Dietmar's results, the problem seems more linked to the > >> selection of a higher freq than increasing the utilization; > >> runnable_avg tests give similar perf results than shorter half life > >> and better power consumption. > > > > Does it ramp down faster too? > > Not sure why you are interested in this? Can't be related to the > `driving DVFS` functionality discussed above. If you change the reaction time to be more aggressive in going up, then it's only natural to have it symmetrical so your residency on the power hungry OPPs don't go over the roof and end up with thermal and power issues. I am concerned about us biasing towrads perf first too much and not enabling sys admins to select a proper trade off for their system and use case. Which are not static. The workloads the system needs to accommodate to are abundant and operating conditions could change. And the diversity of hardware available out there is huge - I am not sure how can we expect we can have one response to accommodate for all of them. What I'm trying to push for here is that we should look at the chain as one unit. And we should consider that there's important trade-off to be had here; having a sensible default doesn't mean the user shouldn't be allowed to select a different trade-off. I'm not sure the problem can be generalized and fixed automatically. But happy to be proven wrong of course :-) FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind pasting the patch for load_avg consideration so I can take it into account too in my experiments? Thanks! -- Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-04-03 14:45 ` Qais Yousef @ 2023-04-06 15:58 ` Dietmar Eggemann 2023-04-11 17:51 ` Qais Yousef 0 siblings, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2023-04-06 15:58 UTC (permalink / raw) To: Qais Yousef Cc: Vincent Guittot, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen Hi Qais, On 03/04/2023 16:45, Qais Yousef wrote: > Hi Diemtar > > On 03/23/23 17:29, Dietmar Eggemann wrote: >> On 01/03/2023 18:24, Qais Yousef wrote: >>> On 03/01/23 11:39, Vincent Guittot wrote: >>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: >>>>> >>>>> On 02/09/23 17:16, Vincent Guittot wrote: [...] >>>>> If we improve util response time, couldn't this mean we can remove util_est or >>>>> am I missing something? >>>> >>>> not sure because you still have a ramping step whereas util_est >>>> directly gives you the final tager >> >> util_est gives us instantaneous signal at enqueue for periodic tasks, > > How do you define instantaneous and periodic here? How would you describe the > behavior for non periodic tasks? Instantaneous is when the max value is available already @wakeup. That is the main use case for util_est, provide this boost to periodic tasks. A non-periodic task doesn't benefit from this. Work assumption back then was that the important task involved here are the periodic (back then 60Hz, 16.67 ms period) tasks of the Android display pipeline. >> something PELT will never be able to do. > > Why? Isn't by selecting a lower pelt halflife we achieve something similar? You get closer but you still would need time to ramp-up. That's without util_est. [...] >>>> the 25% is not related to the ramping time but to the fact that you >>>> always need some margin to cover unexpected events and estimation >>>> error >>> >>> At the moment we have >>> >>> util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection >>> >>> I think we have too many transformations before deciding the current >>> frequencies. Which makes it hard to tweak the system response. >> >> To me it looks more like this: >> >> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection >> ^^^^^^^^^^^^ >> new proposal to factor in root cfs_rq contention > > These are still 5 stages even if written differently. > > What if background tasks that are causing the contention? How can you tell it > to ignore that and NOT drive the frequency up unnecessary for those non > important ones? If userspace is fully aware of uclamp - this whole discussion > wouldn't be necessary. And I still have a bunch of fixes to push before > uclamp_max is actually usable in production. You're hinting to the other open discussion we have on uclamp in feec(): https://lkml.kernel.org/r/20230205224318.2035646-1-qyousef@layalina.io IMHO, this is a different discussion. No classification of tasks here. >> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only >> there to create the safety margin used by schedutil & EAS. > > Yes I know and that's not the point. The point is that it's a chain reaction. > 25% percent headroom is already very aggressive and causes issues on the top > inefficient ends of the cores. And when util is high, you might end up in > a situation where you skip frequencies. Making everything go up faster without > balancing it with either enabling going down faster too or tune this value can > lead to power and thermal issues on powerful systems. I try to follow here but I fail. You're saying that the safety margin is too wide and in case util is within the safety margin, the logic is eclipsed by going max or choosing a CPU from a higher CPU capacity Perf-domain? Wouldn't `going down faster` contradict with schedutil's 20ms down rate limit? > > I think all we need is controlling pelt halflife and this one to tune the > system to the desired trade-off. > >> >> * The schedutil up/down filter thing has been already naked in Nov 2016. >> IMHO, this is where util_est was initially discussed as an alternative. > > Well, I don't see anyone not using a down filter. So I'm not sure util_est has > been a true alternative. Definitely not in down direction. util_est is 0 w/o any runnable tasks. And blocked utilization is decaying much faster than your 20ms down rate limit. >> We have it in mainline as well, but one value (default 10ms) for both >> directions. There was discussion to map it to the driver's >> translation_latency instead. > > Which can be filled wrong sometimes :( > >> >> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`. >> >> So on `up` your rate is as small as possible (only respecting the >> driver's translation_latency) but on `down` you use much more than that. >> >> Why exactly do you have this higher value on `down`? My hunch is >> scenarios in which the CPU (all CPUs in the freq. domain) goes idle, >> so util_est is 0 and the blocked utilization is decaying (too fast, >> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency >> again when the CPU wakes up in those 20ms? > > The down filter prevents changing the frequency to a lower value. So it's > a holding function to keep the residency at a higher frequency for at least > 20ms. It is, sort of, similar to the max() functions you used above. The max > function will allow you to follow the fasting ramping up signal on the way up, > and the slowest ramping down one on the way down. > > I think this is more deterministic way to do it. But a faster PELT wouldn't help here, quite the opposite. [...] >>>> Also IIUC Dietmar's results, the problem seems more linked to the >>>> selection of a higher freq than increasing the utilization; >>>> runnable_avg tests give similar perf results than shorter half life >>>> and better power consumption. >>> >>> Does it ramp down faster too? >> >> Not sure why you are interested in this? Can't be related to the >> `driving DVFS` functionality discussed above. > > If you change the reaction time to be more aggressive in going up, then it's > only natural to have it symmetrical so your residency on the power hungry OPPs > don't go over the roof and end up with thermal and power issues. But you apply this 20ms down rate limit on the big cores too? > I am concerned about us biasing towrads perf first too much and not enabling > sys admins to select a proper trade off for their system and use case. Which > are not static. The workloads the system needs to accommodate to are abundant > and operating conditions could change. And the diversity of hardware available > out there is huge - I am not sure how can we expect we can have one response to > accommodate for all of them. > > What I'm trying to push for here is that we should look at the chain as one > unit. And we should consider that there's important trade-off to be had here; > having a sensible default doesn't mean the user shouldn't be allowed to select > a different trade-off. I'm not sure the problem can be generalized and fixed > automatically. But happy to be proven wrong of course :-) > > FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind > pasting the patch for load_avg consideration so I can take it into account too > in my experiments? Just posted it: https://lkml.kernel.org/r/20230406155030.1989554-1-dietmar.eggemann@arm.com ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2023-04-06 15:58 ` Dietmar Eggemann @ 2023-04-11 17:51 ` Qais Yousef 0 siblings, 0 replies; 61+ messages in thread From: Qais Yousef @ 2023-04-11 17:51 UTC (permalink / raw) To: Dietmar Eggemann Cc: Vincent Guittot, Peter Zijlstra, Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen Hi Dietmar! On 04/06/23 17:58, Dietmar Eggemann wrote: > Hi Qais, > > On 03/04/2023 16:45, Qais Yousef wrote: > > Hi Diemtar > > > > On 03/23/23 17:29, Dietmar Eggemann wrote: > >> On 01/03/2023 18:24, Qais Yousef wrote: > >>> On 03/01/23 11:39, Vincent Guittot wrote: > >>>> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@layalina.io> wrote: > >>>>> > >>>>> On 02/09/23 17:16, Vincent Guittot wrote: > > [...] > > >>>>> If we improve util response time, couldn't this mean we can remove util_est or > >>>>> am I missing something? > >>>> > >>>> not sure because you still have a ramping step whereas util_est > >>>> directly gives you the final tager > >> > >> util_est gives us instantaneous signal at enqueue for periodic tasks, > > > > How do you define instantaneous and periodic here? How would you describe the > > behavior for non periodic tasks? > > Instantaneous is when the max value is available already @wakeup. That > is the main use case for util_est, provide this boost to periodic tasks. > A non-periodic task doesn't benefit from this. Work assumption back then > was that the important task involved here are the periodic (back then > 60Hz, 16.67 ms period) tasks of the Android display pipeline. Not all tasks in the system are periodic.. Note that the main use case that was brought up here is gaming - which is not the same as Android display pipeline. > > >> something PELT will never be able to do. > > > > Why? Isn't by selecting a lower pelt halflife we achieve something similar? > > You get closer but you still would need time to ramp-up. That's without > util_est. Yes we'll always need time to ramp up. Even for util_est, no? > > [...] > > >>>> the 25% is not related to the ramping time but to the fact that you > >>>> always need some margin to cover unexpected events and estimation > >>>> error > >>> > >>> At the moment we have > >>> > >>> util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection > >>> > >>> I think we have too many transformations before deciding the current > >>> frequencies. Which makes it hard to tweak the system response. > >> > >> To me it looks more like this: > >> > >> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection > >> ^^^^^^^^^^^^ > >> new proposal to factor in root cfs_rq contention > > > > These are still 5 stages even if written differently. > > > > What if background tasks that are causing the contention? How can you tell it > > to ignore that and NOT drive the frequency up unnecessary for those non > > important ones? If userspace is fully aware of uclamp - this whole discussion > > wouldn't be necessary. And I still have a bunch of fixes to push before > > uclamp_max is actually usable in production. > > You're hinting to the other open discussion we have on uclamp in feec(): No, no I am not. > > https://lkml.kernel.org/r/20230205224318.2035646-1-qyousef@layalina.io > > IMHO, this is a different discussion. No classification of tasks here. That patch has nothing to do with what I'm trying to say here. You say looking at load_avg helps with contention. My point was that what if the contention is caused by background tasks? They'll cause a frequency to go up higher which is not the desired effect. So it'll not distinguish between cases that matters and cases that don't matter; and with no ability to control this behavior. As you know cpuset is used to keep background tasks on little cores; whose top frequencies on latest ones are very expensive. This could lead to higher residency on those expensive frequencies with your change. We need to be selective - which is the whole point behind wanting a runtime control. Not all workloads are equal. And not all systems handle the same workload similarly. There are trade-offs. > > >> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only > >> there to create the safety margin used by schedutil & EAS. > > > > Yes I know and that's not the point. The point is that it's a chain reaction. > > 25% percent headroom is already very aggressive and causes issues on the top > > inefficient ends of the cores. And when util is high, you might end up in > > a situation where you skip frequencies. Making everything go up faster without > > balancing it with either enabling going down faster too or tune this value can > > lead to power and thermal issues on powerful systems. > > I try to follow here but I fail. You're saying that the safety margin is > too wide and in case util is within the safety margin, the logic is > eclipsed by going max or choosing a CPU from a higher CPU capacity > Perf-domain? > > Wouldn't `going down faster` contradict with schedutil's 20ms down rate > limit? No. 200ms is a far cry from 20ms. > > > > > I think all we need is controlling pelt halflife and this one to tune the > > system to the desired trade-off. > > > >> > >> * The schedutil up/down filter thing has been already naked in Nov 2016. > >> IMHO, this is where util_est was initially discussed as an alternative. > > > > Well, I don't see anyone not using a down filter. So I'm not sure util_est has > > been a true alternative. > > Definitely not in down direction. util_est is 0 w/o any runnable tasks. > And blocked utilization is decaying much faster than your 20ms down rate > limit. Okay I'll keep this in mind when looking at this in the future. Maybe there's something fishy in there that we could improve. > > >> We have it in mainline as well, but one value (default 10ms) for both > >> directions. There was discussion to map it to the driver's > >> translation_latency instead. > > > > Which can be filled wrong sometimes :( > > > >> > >> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`. > >> > >> So on `up` your rate is as small as possible (only respecting the > >> driver's translation_latency) but on `down` you use much more than that. > >> > >> Why exactly do you have this higher value on `down`? My hunch is > >> scenarios in which the CPU (all CPUs in the freq. domain) goes idle, > >> so util_est is 0 and the blocked utilization is decaying (too fast, > >> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency > >> again when the CPU wakes up in those 20ms? > > > > The down filter prevents changing the frequency to a lower value. So it's > > a holding function to keep the residency at a higher frequency for at least > > 20ms. It is, sort of, similar to the max() functions you used above. The max > > function will allow you to follow the fasting ramping up signal on the way up, > > and the slowest ramping down one on the way down. > > > > I think this is more deterministic way to do it. > > But a faster PELT wouldn't help here, quite the opposite. I didn't mention PELT here. I was comparing util_est max() to the filter in schedutil. > [...] > > >>>> Also IIUC Dietmar's results, the problem seems more linked to the > >>>> selection of a higher freq than increasing the utilization; > >>>> runnable_avg tests give similar perf results than shorter half life > >>>> and better power consumption. > >>> > >>> Does it ramp down faster too? > >> > >> Not sure why you are interested in this? Can't be related to the > >> `driving DVFS` functionality discussed above. > > > > If you change the reaction time to be more aggressive in going up, then it's > > only natural to have it symmetrical so your residency on the power hungry OPPs > > don't go over the roof and end up with thermal and power issues. > > But you apply this 20ms down rate limit on the big cores too? > > > I am concerned about us biasing towrads perf first too much and not enabling > > sys admins to select a proper trade off for their system and use case. Which > > are not static. The workloads the system needs to accommodate to are abundant > > and operating conditions could change. And the diversity of hardware available > > out there is huge - I am not sure how can we expect we can have one response to > > accommodate for all of them. > > > > What I'm trying to push for here is that we should look at the chain as one > > unit. And we should consider that there's important trade-off to be had here; > > having a sensible default doesn't mean the user shouldn't be allowed to select > > a different trade-off. I'm not sure the problem can be generalized and fixed > > automatically. But happy to be proven wrong of course :-) > > > > FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind > > pasting the patch for load_avg consideration so I can take it into account too > > in my experiments? > > Just posted it: > > https://lkml.kernel.org/r/20230406155030.1989554-1-dietmar.eggemann@arm.com Thanks a lot! I'll revisit the whole story taking into account the relationship with all these other controls. I will need sometime though. But I will get back with some data hopefully to help us pave the right way. I think we shredded this thread to pieces enough :) Thanks! -- Qais Yousef ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-07 13:41 ` Peter Zijlstra 2022-11-08 19:48 ` Qais Yousef @ 2022-11-09 15:18 ` Lukasz Luba 2022-11-10 11:16 ` Dietmar Eggemann 2022-11-10 12:45 ` Kajetan Puchalski 3 siblings, 0 replies; 61+ messages in thread From: Lukasz Luba @ 2022-11-09 15:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen, Kajetan Puchalski Hi Peter, On 11/7/22 13:41, Peter Zijlstra wrote: > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > >> Based on all the tests we've seen, jankbench or otherwise, the >> improvement can mainly be attributed to the faster ramp up of frequency >> caused by the shorter PELT window while using schedutil. > > Would something terrible like the below help some? > > If not, I suppose it could be modified to take the current state as > history. But basically it runs a faster pelt sum along side the regular > signal just for ramping up the frequency. [snip] > + > + rcu_read_lock(); > + curr = rcu_dereference(rq->curr); > + if (likely(curr->sched_class == &fair_sched_class)) { > + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start; > + util = max_t(unsigned long, util, > + faster_est_approx(runtime * 2)); That's a really nice hack :) I wonder why we end up in such situation on Android. Maybe there is a different solution? Maybe shorter tick (then also align PELT Half-Life)? The problem is mostly in those high-FPS phones. You know, we now have phones with 144Hz displays and even games > 100FPS (which wasn't the case a few years ago when we invested a lot of effort into this PELT+EAS). We also have a lot faster CPUs (~2x in 3-4 years). IMO those games (and OS mechanisms assisting them) would have different needs probably (if they do this 'soft-real-time simulations' with such high granularity ~120/s -> every ~8ms). IMO one old setting might not fit well into this: 4ms tick (which is the Android use case), which then implies scheduler min granularity, which we also align with the y^32 PELT. Is this a correct chain of thinking? Would it make sense to ask Android phone vendors to experiment with 1ms tick (w/ aligned PELT HL)? With that change, there might be some power spikes issues, though. Regards, Lukasz ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-07 13:41 ` Peter Zijlstra 2022-11-08 19:48 ` Qais Yousef 2022-11-09 15:18 ` Lukasz Luba @ 2022-11-10 11:16 ` Dietmar Eggemann 2022-11-10 13:05 ` Peter Zijlstra 2022-11-10 12:45 ` Kajetan Puchalski 3 siblings, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2022-11-10 11:16 UTC (permalink / raw) To: Peter Zijlstra, Kajetan Puchalski Cc: Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 07/11/2022 14:41, Peter Zijlstra wrote: > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: [...] > @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq) > */ > static inline unsigned long cpu_util_cfs(int cpu) > { > + struct rq *rq = cpu_rq(cpu); > struct cfs_rq *cfs_rq; > unsigned long util; > > - cfs_rq = &cpu_rq(cpu)->cfs; > + cfs_rq = &rq->cfs; > util = READ_ONCE(cfs_rq->avg.util_avg); > > if (sched_feat(UTIL_EST)) { > + if (sched_feat(UTIL_EST_FASTER)) { > + struct task_struct *curr; > + > + rcu_read_lock(); > + curr = rcu_dereference(rq->curr); > + if (likely(curr->sched_class == &fair_sched_class)) { > + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start; Don't we and up with gigantic runtime numbers here? oot@juno:~# cat /proc/1676/task/1676/schedstat 36946300 1150620 11 root@juno:~# cat /proc/1676/task/1676/sched rt-app (1676, #threads: 2) ------------------------------------------------------------------- se.exec_start : 77766.964240 <- ! se.vruntime : 563.587883 e.sum_exec_runtime : 36.946300 <- ! se.nr_migrations : 0 ... I expect cpu_util_cfs() to be ~1024 almost all the time now. > + util = max_t(unsigned long, util, > + faster_est_approx(runtime * 2)); > + } > + rcu_read_unlock(); > + } > util = max_t(unsigned long, util, > READ_ONCE(cfs_rq->avg.util_est.enqueued)); > } [...] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-10 11:16 ` Dietmar Eggemann @ 2022-11-10 13:05 ` Peter Zijlstra 2022-11-10 14:59 ` Dietmar Eggemann 0 siblings, 1 reply; 61+ messages in thread From: Peter Zijlstra @ 2022-11-10 13:05 UTC (permalink / raw) To: Dietmar Eggemann Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote: > On 07/11/2022 14:41, Peter Zijlstra wrote: > > On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > > [...] > > > @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq) > > */ > > static inline unsigned long cpu_util_cfs(int cpu) > > { > > + struct rq *rq = cpu_rq(cpu); > > struct cfs_rq *cfs_rq; > > unsigned long util; > > > > - cfs_rq = &cpu_rq(cpu)->cfs; > > + cfs_rq = &rq->cfs; > > util = READ_ONCE(cfs_rq->avg.util_avg); > > > > if (sched_feat(UTIL_EST)) { > > + if (sched_feat(UTIL_EST_FASTER)) { > > + struct task_struct *curr; > > + > > + rcu_read_lock(); > > + curr = rcu_dereference(rq->curr); > > + if (likely(curr->sched_class == &fair_sched_class)) { > > + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start; > > Don't we and up with gigantic runtime numbers here? > > oot@juno:~# cat /proc/1676/task/1676/schedstat > 36946300 1150620 11 > root@juno:~# cat /proc/1676/task/1676/sched > rt-app (1676, #threads: 2) > ------------------------------------------------------------------- > se.exec_start : 77766.964240 <- ! > se.vruntime : 563.587883 > e.sum_exec_runtime : 36.946300 <- ! > se.nr_migrations : 0 > ... > > I expect cpu_util_cfs() to be ~1024 almost all the time now. Duh, obviously I meant to measure the runtime of the current activation and messed up. We don't appear to have the right information to compute this atm :/ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-10 13:05 ` Peter Zijlstra @ 2022-11-10 14:59 ` Dietmar Eggemann 2022-11-10 17:51 ` Peter Zijlstra 0 siblings, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2022-11-10 14:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 10/11/2022 14:05, Peter Zijlstra wrote: > On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote: >> On 07/11/2022 14:41, Peter Zijlstra wrote: >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: >> >> [...] >> >>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq) >>> */ >>> static inline unsigned long cpu_util_cfs(int cpu) >>> { >>> + struct rq *rq = cpu_rq(cpu); >>> struct cfs_rq *cfs_rq; >>> unsigned long util; >>> >>> - cfs_rq = &cpu_rq(cpu)->cfs; >>> + cfs_rq = &rq->cfs; >>> util = READ_ONCE(cfs_rq->avg.util_avg); >>> >>> if (sched_feat(UTIL_EST)) { >>> + if (sched_feat(UTIL_EST_FASTER)) { >>> + struct task_struct *curr; >>> + >>> + rcu_read_lock(); >>> + curr = rcu_dereference(rq->curr); >>> + if (likely(curr->sched_class == &fair_sched_class)) { >>> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start; >> >> Don't we and up with gigantic runtime numbers here? >> >> oot@juno:~# cat /proc/1676/task/1676/schedstat >> 36946300 1150620 11 >> root@juno:~# cat /proc/1676/task/1676/sched >> rt-app (1676, #threads: 2) >> ------------------------------------------------------------------- >> se.exec_start : 77766.964240 <- ! >> se.vruntime : 563.587883 >> e.sum_exec_runtime : 36.946300 <- ! >> se.nr_migrations : 0 >> ... >> >> I expect cpu_util_cfs() to be ~1024 almost all the time now. > > Duh, obviously I meant to measure the runtime of the current activation > and messed up. > > We don't appear to have the right information to compute this atm :/ This would be: u64 now = rq_clock_task(rq); u64 runtime = now - curr->se.exec_start; but we don't hold the rq lock so we can't get `now`? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-10 14:59 ` Dietmar Eggemann @ 2022-11-10 17:51 ` Peter Zijlstra 2022-11-30 18:14 ` Dietmar Eggemann 0 siblings, 1 reply; 61+ messages in thread From: Peter Zijlstra @ 2022-11-10 17:51 UTC (permalink / raw) To: Dietmar Eggemann Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Thu, Nov 10, 2022 at 03:59:01PM +0100, Dietmar Eggemann wrote: > On 10/11/2022 14:05, Peter Zijlstra wrote: > > On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote: > >> On 07/11/2022 14:41, Peter Zijlstra wrote: > >>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: > >> > >> [...] > >> > >>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq) > >>> */ > >>> static inline unsigned long cpu_util_cfs(int cpu) > >>> { > >>> + struct rq *rq = cpu_rq(cpu); > >>> struct cfs_rq *cfs_rq; > >>> unsigned long util; > >>> > >>> - cfs_rq = &cpu_rq(cpu)->cfs; > >>> + cfs_rq = &rq->cfs; > >>> util = READ_ONCE(cfs_rq->avg.util_avg); > >>> > >>> if (sched_feat(UTIL_EST)) { > >>> + if (sched_feat(UTIL_EST_FASTER)) { > >>> + struct task_struct *curr; > >>> + > >>> + rcu_read_lock(); > >>> + curr = rcu_dereference(rq->curr); > >>> + if (likely(curr->sched_class == &fair_sched_class)) { > >>> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start; > >> > >> Don't we and up with gigantic runtime numbers here? > >> > >> oot@juno:~# cat /proc/1676/task/1676/schedstat > >> 36946300 1150620 11 > >> root@juno:~# cat /proc/1676/task/1676/sched > >> rt-app (1676, #threads: 2) > >> ------------------------------------------------------------------- > >> se.exec_start : 77766.964240 <- ! > >> se.vruntime : 563.587883 > >> e.sum_exec_runtime : 36.946300 <- ! > >> se.nr_migrations : 0 > >> ... > >> > >> I expect cpu_util_cfs() to be ~1024 almost all the time now. > > > > Duh, obviously I meant to measure the runtime of the current activation > > and messed up. > > > > We don't appear to have the right information to compute this atm :/ > > This would be: > > u64 now = rq_clock_task(rq); > u64 runtime = now - curr->se.exec_start; > > but we don't hold the rq lock so we can't get `now`? Not quite the same; that's the time since we got on-cpu last, but that's not the same as the runtime of this activation (it is when you discount preemption). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-10 17:51 ` Peter Zijlstra @ 2022-11-30 18:14 ` Dietmar Eggemann 2022-12-01 13:37 ` Kajetan Puchalski 0 siblings, 1 reply; 61+ messages in thread From: Dietmar Eggemann @ 2022-11-30 18:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Kajetan Puchalski, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On 10/11/2022 18:51, Peter Zijlstra wrote: > On Thu, Nov 10, 2022 at 03:59:01PM +0100, Dietmar Eggemann wrote: >> On 10/11/2022 14:05, Peter Zijlstra wrote: >>> On Thu, Nov 10, 2022 at 12:16:26PM +0100, Dietmar Eggemann wrote: >>>> On 07/11/2022 14:41, Peter Zijlstra wrote: >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote: >>>> >>>> [...] >>>> >>>>> @@ -2956,13 +2958,26 @@ static inline unsigned long cpu_util_dl(struct rq *rq) >>>>> */ >>>>> static inline unsigned long cpu_util_cfs(int cpu) >>>>> { >>>>> + struct rq *rq = cpu_rq(cpu); >>>>> struct cfs_rq *cfs_rq; >>>>> unsigned long util; >>>>> >>>>> - cfs_rq = &cpu_rq(cpu)->cfs; >>>>> + cfs_rq = &rq->cfs; >>>>> util = READ_ONCE(cfs_rq->avg.util_avg); >>>>> >>>>> if (sched_feat(UTIL_EST)) { >>>>> + if (sched_feat(UTIL_EST_FASTER)) { >>>>> + struct task_struct *curr; >>>>> + >>>>> + rcu_read_lock(); >>>>> + curr = rcu_dereference(rq->curr); >>>>> + if (likely(curr->sched_class == &fair_sched_class)) { >>>>> + u64 runtime = curr->se.sum_exec_runtime - curr->se.exec_start; >>>> >>>> Don't we and up with gigantic runtime numbers here? >>>> >>>> oot@juno:~# cat /proc/1676/task/1676/schedstat >>>> 36946300 1150620 11 >>>> root@juno:~# cat /proc/1676/task/1676/sched >>>> rt-app (1676, #threads: 2) >>>> ------------------------------------------------------------------- >>>> se.exec_start : 77766.964240 <- ! >>>> se.vruntime : 563.587883 >>>> e.sum_exec_runtime : 36.946300 <- ! >>>> se.nr_migrations : 0 >>>> ... >>>> >>>> I expect cpu_util_cfs() to be ~1024 almost all the time now. >>> >>> Duh, obviously I meant to measure the runtime of the current activation >>> and messed up. >>> >>> We don't appear to have the right information to compute this atm :/ >> >> This would be: >> >> u64 now = rq_clock_task(rq); >> u64 runtime = now - curr->se.exec_start; >> >> but we don't hold the rq lock so we can't get `now`? > > Not quite the same; that's the time since we got on-cpu last, but that's > not the same as the runtime of this activation (it is when you discount > preemption). ----|----|----|----|----|----|---> a s1 p1 s2 p2 d a ... activate_task() -> enqueue_task() s ... set_next_entity() p ... put_prev_entity() d ... deactivate_task() -> dequeue_task() By `runtime of the activation` you refer to `curr->sum_exec_runtime - time(a)` ? And the latter we don't have? And `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_run` is only covering the time since we got onto the cpu, right? With a missing `runtime >>= 10` (from __update_load_sum()) and using `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime` for a 1 task-workload (so no preemption) with factor 2 or 4 I get at least close to the original rq->cfs.avg.util_avg and util_est.enqueued signals (cells (5)-(8) in the notebook below). https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/UTIL_EST_FASTER.ipynb?flush_cache=true ---- set_next_entity() update_stats_curr_start() se->exec_start = rq_clock_task() cfs_rq->curr = se (1) se->prev_sum_exec_runtime = se->sum_exec_runtime (2) update_curr() now = rq_clock_task(rq_of(cfs_rq)) delta_exec = now - curr->exec_start (3) curr->exec_start = now curr->sum_exec_runtime += delta_exec; (4) put_prev_entity() cfs_rq->curr = NULL (5) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-30 18:14 ` Dietmar Eggemann @ 2022-12-01 13:37 ` Kajetan Puchalski 0 siblings, 0 replies; 61+ messages in thread From: Kajetan Puchalski @ 2022-12-01 13:37 UTC (permalink / raw) To: Dietmar Eggemann Cc: Peter Zijlstra, Jian-Min Liu, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen On Wed, Nov 30, 2022 at 07:14:51PM +0100, Dietmar Eggemann wrote: > By `runtime of the activation` you refer to `curr->sum_exec_runtime - > time(a)` ? And the latter we don't have? > > And `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_run` > is only covering the time since we got onto the cpu, right? > > With a missing `runtime >>= 10` (from __update_load_sum()) and using > `runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime` > for a 1 task-workload (so no preemption) with factor 2 or 4 I get at > least close to the original rq->cfs.avg.util_avg and util_est.enqueued > signals (cells (5)-(8) in the notebook below). > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/UTIL_EST_FASTER.ipynb?flush_cache=true > With those two changes as described above the comparative results are as follows: Max frame durations (worst case scenario) +--------------------------------+-----------+------------+ | kernel | iteration | value | +--------------------------------+-----------+------------+ | baseline_60hz | 10 | 149.935514 | | pelt_rampup_runtime_shift_60hz | 10 | 108.126862 | +--------------------------------+-----------+------------+ Power usage [mW] +--------------+--------------------------------+-------+-----------+ | chan_name | kernel | value | perc_diff | +--------------+--------------------------------+-------+-----------+ | total_power | baseline_60hz | 141.6 | 0.0% | | total_power | pelt_rampup_runtime_shift_60hz | 168.0 | 18.61% | +--------------+--------------------------------+-------+-----------+ Mean frame duration (average case) +---------------+--------------------------------+-------+-----------+ | variable | kernel | value | perc_diff | +---------------+--------------------------------+-------+-----------+ | mean_duration | baseline_60hz | 16.7 | 0.0% | | mean_duration | pelt_rampup_runtime_shift_60hz | 13.6 | -18.9% | +---------------+--------------------------------+-------+-----------+ Jank percentage +-----------+--------------------------------+-------+-----------+ | variable | kernel | value | perc_diff | +-----------+--------------------------------+-------+-----------+ | jank_perc | baseline_60hz | 4.0 | 0.0% | | jank_perc | pelt_rampup_runtime_shift_60hz | 1.5 | -64.04% | +-----------+--------------------------------+-------+-----------+ Meaning it's a middle ground of sorts - instead of a 90% increase in power usage it's 'just' 19%. At the same time though the fastest PELT multiplier (pelt_4) was getting better max frame durations (85ms vs 108ms) for about half the power increase (9.6% vs 18.6%). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-11-07 13:41 ` Peter Zijlstra ` (2 preceding siblings ...) 2022-11-10 11:16 ` Dietmar Eggemann @ 2022-11-10 12:45 ` Kajetan Puchalski 3 siblings, 0 replies; 61+ messages in thread From: Kajetan Puchalski @ 2022-11-10 12:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Jian-Min Liu, Dietmar Eggemann, Ingo Molnar, Vincent Guittot, Morten Rasmussen, Vincent Donnefort, Quentin Perret, Patrick Bellasi, Abhijeet Dharmapurikar, Qais Yousef, linux-kernel, Jonathan JMChen Hi, > Would something terrible like the below help some? > > If not, I suppose it could be modified to take the current state as > history. But basically it runs a faster pelt sum along side the regular > signal just for ramping up the frequency. As Dietmar mentioned in the other email, there seems to be an issue with how the patch computes 'runtime'. Nevertheless I tested it just to see what would happen so here are the results if you're interested. Here's a comparison of Jankbench results on a normal system vs pelt_4 vs performance cpufreq governor vs your pelt_rampup patch. Max frame duration (ms) +-----------------------+-----------+------------+ | kernel | iteration | value | +-----------------------+-----------+------------+ | menu | 10 | 142.973401 | | menu_pelt_4 | 10 | 85.271279 | | menu_pelt_rampup | 10 | 61.494636 | | menu_performance | 10 | 40.930829 | +-----------------------+-----------+------------+ Power usage [mW] +--------------+-----------------------+-------+-----------+ | chan_name | kernel | value | perc_diff | +--------------+-----------------------+-------+-----------+ | total_power | menu | 144.6 | 0.0% | | total_power | menu_pelt_4 | 158.5 | 9.63% | | total_power | menu_pelt_rampup | 272.1 | 88.23% | | total_power | menu_performance | 485.6 | 235.9% | +--------------+-----------------------+-------+-----------+ Mean frame duration (ms) +---------------+-----------------------+-------+-----------+ | variable | kernel | value | perc_diff | +---------------+-----------------------+-------+-----------+ | mean_duration | menu | 13.9 | 0.0% | | mean_duration | menu_pelt_4 | 14.5 | 4.74% | | mean_duration | menu_pelt_rampup | 8.3 | -40.31% | | mean_duration | menu_performance | 4.4 | -68.13% | +---------------+-----------------------+-------+-----------+ Jank percentage +-----------+-----------------------+-------+-----------+ | variable | kernel | value | perc_diff | +-----------+-----------------------+-------+-----------+ | jank_perc | menu | 1.5 | 0.0% | | jank_perc | menu_pelt_4 | 2.0 | 30.08% | | jank_perc | menu_pelt_rampup | 0.1 | -93.09% | | jank_perc | menu_performance | 0.1 | -96.29% | +-----------+-----------------------+-------+-----------+ [...] Some variant of this that's tunable at runtime could be workable for the purposes described before. At least this further proves that it's manipulating frequency that's responsible for the results here. --- Kajetan ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime 2022-09-29 9:47 ` Peter Zijlstra 2022-09-29 11:07 ` Dietmar Eggemann 2022-09-29 11:10 ` Kajetan Puchalski @ 2022-11-07 9:41 ` Jian-Min Liu (劉建旻) 2 siblings, 0 replies; 61+ messages in thread From: Jian-Min Liu (劉建旻) @ 2022-11-07 9:41 UTC (permalink / raw) To: peterz Cc: dietmar.eggemann, mingo, linux-kernel, patrick.bellasi, adharmap, qais.yousef, Jonathan JMChen (陳家明), vdonnefort, qperret, morten.rasmussen, vincent.guittot Hi, On Thu, 2022-09-29 at 11:47 +0200, Peter Zijlstra wrote: > Because it messes up the order in which people normally read text. > Why is top-posting such a bad thing? > Top-posting. > What is the most annoying thing in e-mail? > Sorry for top-posting... > On Tue, Sep 20, 2022 at 10:07:59PM +0800, Jian-Min Liu wrote: > > > > Update some test data in android phone to support switching PELT > > HL > > is helpful functionality. > > > > We switch runtime PELT HL during runtime by difference scenario > > e.g. > > pelt8 in playing game, pelt32 in camera video. Support runntime > > switching PELT HL is flexible for different workloads. > > > > the below table show performance & power data points: > > > > ----------------------------------------------------------------- > > ---- > > --| | PELT > > halflife | > > > |------------------------------------------- > > > ---| > > > | 32 | 16 | 8 > > > | > > > |------------------------------------------- > > > ---| > > > | avg min avg | avg min avg | > > > avg min avg| > > > Scenarios | fps fps pwr | fps fps pwr | > > > fps fps pwr| > > > --------------------------------------------------------------- > > > ------| > > > HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* > > > 106| > > > HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* > > > 105| > > > HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* > > > 105| > > You have your min and avg fps columns mixed up, your min cannot be > larger > than avg. > > Also, with min fps mostly above the actual screen fps, who cares. And > seriously 120fps on a phone !?!? for worse power usage! you gotta be > kidding me. > > And I googled this game; it is some top-down tactical thing with > real-time combat (as opposed to turn-based) (DOTA like I suppose), > 60 fps locked should be plenty fine. > > > > FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | > > > 100 100 103| > > > Camera snapshot | 100 100 100 | n/a n/a n/a | > > > 100 100 102| > > Mostly I think you've demonstrated that none of this is worth it. > > > ----------------------------------------------------------------- > > ------ > > > > HOK ... Honour Of Kings, Video game > > FHD ... Full High Definition > > fps ... frame per second > > pwr ... power consumption > > > > table values are in % > > Oh... that's bloody insane; that's why none of it makes sense. > > > How is any of that an answer to: > > "They want; I want an explanation of what exact problem is fixed > how ;-)" > > This is just random numbers showing poking the number has some > effect; > it has zero explaination of why poking the number changes the > workload > and if that is in fact the right way to go about solving that > particular > issue. Sorry that the data wasn't clear to understand. I try again with absolute FPS numbers and some additional explanation as well as a summary why we need to have the PELT halflife tunable a runtime. HOK* 60FPS +-------+-----------------------------------------+ | | avg. FPS | min. FPS | power | +-------+--------+-------+-------+-------+--------+ |kernel | value |diff(%)| value |diff(%)| diff(%)| +-------+--------+-------+-------+-------+--------+ |pelt_1 | 54.1 | 0.0% | 21.8 | 0.0% | 0.0% | +-------+--------+-------+-------+-------+--------+ |pelt_2 | 56.9 | 5.2% | 29.2 | 34.0% | 2.2% | +-------+--------+-------+-------+-------+--------+ |pelt_4 | 56.6 | 4.5% | 33.2 | 52.4% | 6.3% | +-------+--------+-------+-------+-------+--------+ *Honour Of Kings, video game Test methodology: We choose 60FPS in the game setup. Android's systrace (similar to ftrace) then provides the real FPS from which we take the average and minimum value. Sorry, but we can't share absolute numbers for power from our test device since this is still considered sensitive information. FHD 60fps video recording +-------+-----------------------------------------+ | | avg. FPS | min. FPS | power | +-------+--------+-------+-------+-------+--------+ |kernel | value |diff(%)| value |diff(%)| diff(%)| +-------+--------+-------+-------+-------+--------+ |pelt_1 | 60.0 | 0.0% | 60.0 | 0.0% | 0.0% | +-------+--------+-------+-------+-------+--------+ |pelt_4 | 60.0 | 0.0% | 60.0 | 0.0% | 2.1% | +-------+--------+-------+-------+-------+--------+ To summarize, we need a smaller PELT halflife to reach higher avg. FPS and min. FPS values for video gaming to achieve a smoother game-play experience even when it comes with slightly higher power consumption. Especially the improvement in min. FPS is important here to minimize situations in which the game otherwise would stutter. Since not all use cases profit from this behaviour (e.g. video recording) the PELT halflife should be tunable at runtime. ^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2023-04-11 17:51 UTC | newest] Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-08-29 5:54 [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Dietmar Eggemann 2022-08-29 5:54 ` [RFC PATCH 1/1] sched/pelt: Introduce PELT multiplier Dietmar Eggemann 2022-08-29 8:08 ` Peter Zijlstra 2022-08-29 10:02 ` Peter Zijlstra 2022-08-29 10:13 ` Vincent Guittot 2022-08-29 14:23 ` Quentin Perret 2022-08-29 14:34 ` Peter Zijlstra 2022-08-29 15:31 ` Quentin Perret 2022-08-29 15:48 ` Quentin Perret 2022-09-02 7:53 ` Dietmar Eggemann 2022-09-02 8:45 ` Peter Zijlstra 2022-09-06 5:49 ` Vincent Guittot 2022-09-08 6:50 ` Dietmar Eggemann 2022-09-02 7:53 ` Dietmar Eggemann 2022-09-02 8:45 ` Peter Zijlstra 2022-09-20 14:07 ` [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime Jian-Min Liu 2022-09-28 17:09 ` Dietmar Eggemann 2022-09-29 9:47 ` Peter Zijlstra 2022-09-29 11:07 ` Dietmar Eggemann 2022-09-29 11:10 ` Kajetan Puchalski 2022-09-29 11:21 ` Peter Zijlstra 2022-09-29 14:41 ` Kajetan Puchalski 2022-10-03 22:57 ` Wei Wang 2022-10-04 9:33 ` Dietmar Eggemann 2022-10-05 16:57 ` Wei Wang 2022-11-07 13:41 ` Peter Zijlstra 2022-11-08 19:48 ` Qais Yousef 2022-11-09 15:49 ` Peter Zijlstra 2022-11-10 13:25 ` Qais Yousef 2023-02-07 10:29 ` Dietmar Eggemann 2023-02-09 16:16 ` Vincent Guittot 2023-02-17 13:54 ` Dietmar Eggemann 2023-02-20 13:54 ` Vincent Guittot 2023-02-21 9:29 ` Vincent Guittot 2023-02-22 20:28 ` Dietmar Eggemann 2023-03-01 10:24 ` Vincent Guittot 2023-02-22 20:13 ` Dietmar Eggemann 2023-03-02 19:36 ` Dietmar Eggemann 2023-02-20 10:13 ` Peter Zijlstra 2023-02-20 13:39 ` Vincent Guittot 2023-02-23 15:37 ` Qais Yousef 2023-03-01 10:39 ` Vincent Guittot 2023-03-01 17:24 ` Qais Yousef 2023-03-02 8:00 ` Vincent Guittot 2023-03-02 19:39 ` Dietmar Eggemann 2023-03-06 19:11 ` Qais Yousef 2023-03-07 13:22 ` Vincent Guittot 2023-03-11 16:55 ` Qais Yousef 2023-03-23 16:29 ` Dietmar Eggemann 2023-04-03 14:45 ` Qais Yousef 2023-04-06 15:58 ` Dietmar Eggemann 2023-04-11 17:51 ` Qais Yousef 2022-11-09 15:18 ` Lukasz Luba 2022-11-10 11:16 ` Dietmar Eggemann 2022-11-10 13:05 ` Peter Zijlstra 2022-11-10 14:59 ` Dietmar Eggemann 2022-11-10 17:51 ` Peter Zijlstra 2022-11-30 18:14 ` Dietmar Eggemann 2022-12-01 13:37 ` Kajetan Puchalski 2022-11-10 12:45 ` Kajetan Puchalski 2022-11-07 9:41 ` Jian-Min Liu (劉建旻)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).