linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
@ 2023-08-27 23:31 Qais Yousef
  2023-08-27 23:31 ` [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
                   ` (8 more replies)
  0 siblings, 9 replies; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:31 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
margins applied in fits_capacity() and apply_dvfs_headroom().

As reported two years ago in

	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/

these values are not good fit for all systems and people do feel the need to
modify them regularly out of tree.

Equally recent discussion in PELT HALFLIFE thread highlighted the need for
a way to tune system response time to achieve better perf, power and thermal
characteristic for a given system

	https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/

fits_capacity() and apply_dvfs_headroom() are not a suitable tunables. Attempt
to use PELF HALFLIFE highlighted that there's room to do better, which I hope
my proposal helps to achieve that.

This series attempts to address these issues by first removing the magic
'margins' from those two areas that has proved to be problematic in practice;
and, at least in Android world, they're being modified out of tree on regular
basis.

I attempted to tackle the problem by trying to find the answer to the question
what would really go wrong if we don't have these margins or headrooms?

The simplest answers I found is that for fits_capacity() if we do a bad
decision, then the task will become misfit and will have to wait for the next
load balance to move it to the correct CPU. Hence I thought a reasonable
definition is that fits_capacity() should be a function of tick and the
headroom should cater for the fact that if a task continues to run without
sleep, then as long as by the time we hit the tick (load balance) and it still
fits, then it should not be considered a misfit and should be acceptable to run
on this CPU.

For the dvfs headroom, the worst can happen is that util grows above
capacity@OPP before we get a chance to send a request to hardware to modify the
freq. Which means we should make sure the frequency selection provides
enough headroom to cater for the fact that if the task continues to run without
sleep, then the current frequency should provide a capacity@OPP higher than
util after rate_limit_us of cpufeq transition delay.

To achieve this, we need a new function to help us with predicting, or
approximate, the util given a delta of runtime. Which is what introduced in
patches 1 and 2.

Removing these margins doesn't actually fix the problem of being able to tune
the system response. To do that we introduce a new tunable to schedutil called
response_time_ms which dictates how long it takes the policy to go from minimum
to maximum performance point. This value reflects the time it takes PELT to
grow to the capacity of the CPUs in that policy (which can be different in case
of HMP). It should be a direct presentation of PELT ramp-up time, hence more
meaningful from tuning perspective as an absolute value of how long it takes to
saturate the policy.  It should be much easier for userspace to reason about an
appropriate number given this absolute value. It can be expanded or shrunk to
slow or speed up the response time. Ultimately leading to appropriate perf,
power and thermal trade-off for the system.

In case of slowing the response time, there's inherit limitation that util_avg
saturates at 1024. Which means the impact of slowing down after a certain
degree would be to lose the top freqs. I think this limitation can be overcome
but not sure how yet. Suggestions would be appreciated meanwhile.

To further help tune the system, we introduce PELT HALFLIFE multiplier as
a boot time parameter. This parameter has an impact on how fast we migrate, so
should compensate for whoever needed to tune fits_capacity(); and it has great
impact on default response_time_ms. Particularly it gives a natural faster rise
time when the system gets busy, AND fall time when the system goes back to
idle. It is coarse grain response control that can be coupled with finer grain
control via schedutil's response_time_ms.

I believe (hope) by making the behavior of fits_capacity() and
apply_dvfs_headroom() more deterministic, and scalable across systems, to be
a true function of their natural limitations and combined with the new, and
hopefully sensible, tunable to allow managing the reactiveness of the system to
achieve what the user/sysadmin perceives as the best perf, power and thermal
trade-off should address the class of problems at hand hopefully in
deterministic and sensible/user friendly manner.

I'm not a pelt guru, so help in making sure that approximate_util_avg() and
approximate_runtime() are reasonable and correct would be appreciated.

The remainder of the patches should hopefully be straightforward. There are
some pending question that you'll find in various TODOs/XXX that I'd appreciate
feedback on.

Not tested comprehensively. But booted on Pixel 6 running mainline-ish kernel.

I could see the following as default output for response_time_ms:

	# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:13
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:42
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:176

Note how the little core has a very short saturation time given its small
capacity in practice. fits_capacity() being defined as a function of TICK_US
means that 1/3rd of its top performance would be ignored (when EAS is active
- !overutilized) - which is desirable since a lot of workloads suffer in terms
of perf by staying for too long on the littles and given our relatively high
TICK_US values, the earlier move is good.

The biggest policy though has a saturation of 176 which I didn't expect. My
measurement in the past where that we need at least 200ms with 32ms PELF HF.
Maybe I have a bug or my old measurements are now invalid for some reason.

When I set PELT HALFLIFE to 16ms I get:

	# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:7
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:21
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:79

and for 8ms:

	# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:4
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:10
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:34

policy6 (big core) numbers aren't halving properly. Something to investigate.

I ran speedometer tests too and I could see the score changes as I make
response_time_ms faster/slower or modify PELT HF. I could see the freq
residency also shifts according to my changes where top frequencies get higher
residencies as I speed it up, or they are never reached/reduced residency when
I slow it down.

Finally at the end of the series I modify the default cpufreq transition delay
to be 2ms. I found on several on my Arm based systems I end up with this
default value, and 10ms is too high nowadays even for a low end system.
I haven't done a full surveillance to be honest, but 10ms I think is way too
high for the majority of the systems out there - even 2ms can be a bit high
for a large class of systems. Suggestions for other values are welcome!

This series is based on the tip/sched/core with the below series applied

	https://lore.kernel.org/lkml/20230820210640.585311-1-qyousef@layalina.io/

Many thanks

--
Qais Yousef

Qais Yousef (6):
  sched/pelt: Add a new function to approximate the future util_avg
    value
  sched/pelt: Add a new function to approximate runtime to reach given
    util
  sched/fair: Remove magic margin in fits_capacity()
  sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom()
  sched/schedutil: Add a new tunable to dictate response time
  cpufreq: Change default transition delay to 2ms

Vincent Donnefort (1):
  sched/pelt: Introduce PELT multiplier

 Documentation/admin-guide/pm/cpufreq.rst | 19 ++++-
 drivers/cpufreq/cpufreq.c                |  4 +-
 kernel/sched/core.c                      |  5 +-
 kernel/sched/cpufreq_schedutil.c         | 80 ++++++++++++++++++++-
 kernel/sched/fair.c                      | 21 +++++-
 kernel/sched/pelt.c                      | 89 ++++++++++++++++++++++++
 kernel/sched/pelt.h                      | 42 +++++++++--
 kernel/sched/sched.h                     | 30 +++++---
 8 files changed, 265 insertions(+), 25 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
@ 2023-08-27 23:31 ` Qais Yousef
  2023-09-06 12:56   ` Dietmar Eggemann
  2023-08-27 23:31 ` [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:31 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

Given a util_avg value, the new function will return the future one
given a runtime delta.

This will be useful in later patches to help replace some magic margins
with more deterministic behavior.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/pelt.c  | 22 +++++++++++++++++++++-
 kernel/sched/sched.h |  3 +++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 0f310768260c..50322005a0ae 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -466,4 +466,24 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 
 	return ret;
 }
-#endif
+#endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
+
+/*
+ * Approximate the new util_avg value assuming an entity has continued to run
+ * for @delta us.
+ */
+unsigned long approximate_util_avg(unsigned long util, u64 delta)
+{
+	struct sched_avg sa = {
+		.util_sum = util * PELT_MIN_DIVIDER,
+		.util_avg = util,
+	};
+
+	if (unlikely(!delta))
+		return util;
+
+	accumulate_sum(delta, &sa, 0, 0, 1);
+	___update_load_avg(&sa, 0);
+
+	return sa.util_avg;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 56eeb5b05b50..5f76b8a75a9f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2997,6 +2997,9 @@ enum cpu_util_type {
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 				 enum cpu_util_type type,
 				 struct task_struct *p);
+
+unsigned long approximate_util_avg(unsigned long util, u64 delta);
+
 /*
  * DVFS decision are made at discrete points. If CPU stays busy, the util will
  * continue to grow, which means it could need to run at a higher frequency
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
  2023-08-27 23:31 ` [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
@ 2023-08-27 23:31 ` Qais Yousef
  2023-09-06 12:56   ` Dietmar Eggemann
  2023-09-15  9:15   ` Hongyan Xia
  2023-08-27 23:31 ` [RFC PATCH 3/7] sched/fair: Remove magic margin in fits_capacity() Qais Yousef
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:31 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

It is basically the ramp-up time from 0 to a given value. Will be used
later to implement new tunable to control response time  for schedutil.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/pelt.c  | 21 +++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 50322005a0ae..f673b9ab92dc 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
 
 	return sa.util_avg;
 }
+
+/*
+ * Approximate the required amount of runtime in ms required to reach @util.
+ */
+u64 approximate_runtime(unsigned long util)
+{
+	struct sched_avg sa = {};
+	u64 delta = 1024; // period = 1024 = ~1ms
+	u64 runtime = 0;
+
+	if (unlikely(!util))
+		return runtime;
+
+	while (sa.util_avg < util) {
+		accumulate_sum(delta, &sa, 0, 0, 1);
+		___update_load_avg(&sa, 0);
+		runtime++;
+	}
+
+	return runtime;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f76b8a75a9f..2b889ad399de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2999,6 +2999,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 				 struct task_struct *p);
 
 unsigned long approximate_util_avg(unsigned long util, u64 delta);
+u64 approximate_runtime(unsigned long util);
 
 /*
  * DVFS decision are made at discrete points. If CPU stays busy, the util will
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC PATCH 3/7] sched/fair: Remove magic margin in fits_capacity()
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
  2023-08-27 23:31 ` [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
  2023-08-27 23:31 ` [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
@ 2023-08-27 23:31 ` Qais Yousef
  2023-09-06 14:38   ` Dietmar Eggemann
  2023-08-27 23:32 ` [RFC PATCH 4/7] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom() Qais Yousef
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:31 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

80% margin is a magic value that has served its purpose for now, but it
no longer fits the variety of systems exist today. If a system is over
powered specifically, this 80% will mean we leave a lot of capacity
unused before we decide to upmigrate on HMP system.

The upmigration behavior should rely on the fact that a bad decision
made will need load balance to kick in to perform misfit migration. And
I think this is an adequate definition for what to consider as enough
headroom to consider whether a util fits capacity or not.

Use the new approximate_util_avg() function to predict the util if the
task continues to run for TICK_US. If the value is not strictly less
than the capacity, then it must not be placed there, ie considered
misfit.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/fair.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b7445cd5af9..facbf3eb7141 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -109,16 +109,31 @@ int __weak arch_asym_cpu_priority(int cpu)
 }
 
 /*
- * The margin used when comparing utilization with CPU capacity.
+ * The util will fit the capacity if it has enough headroom to grow within the
+ * next tick - which is when any load balancing activity happens to do the
+ * correction.
  *
- * (default: ~20%)
+ * If util stays within the capacity before tick has elapsed, then it should be
+ * fine. If not, then a correction action must happen shortly after it starts
+ * running, hence we treat it as !fit.
+ *
+ * TODO: TICK is not actually accurate enough. balance_interval is the correct
+ * one to use as the next load balance doesn't not happen religiously at tick.
+ * Accessing balance_interval might be tricky and will require some refactoring
+ * first.
  */
-#define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
+static inline bool fits_capacity(unsigned long util, unsigned long capacity)
+{
+	return approximate_util_avg(util, TICK_USEC) < capacity;
+}
 
 /*
  * The margin used when comparing CPU capacities.
  * is 'cap1' noticeably greater than 'cap2'
  *
+ * TODO: use approximate_util_avg() to give something more quantifiable based
+ * on time? Like 1ms?
+ *
  * (default: ~5%)
  */
 #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC PATCH 4/7] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom()
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
                   ` (2 preceding siblings ...)
  2023-08-27 23:31 ` [RFC PATCH 3/7] sched/fair: Remove magic margin in fits_capacity() Qais Yousef
@ 2023-08-27 23:32 ` Qais Yousef
  2023-09-07 11:34   ` Peter Zijlstra
  2023-08-27 23:32 ` [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

Instead of the magical 1.25 headroom, use the new approximate_util_avg()
to provide headroom based on the dvfs_update_delay; which is the period
at which the cpufreq governor will send DVFS updates to the hardware.

Add a new percpu dvfs_update_delay that can be cheaply accessed whenever
apply_dvfs_headroom() is called. We expect cpufreq governors that rely
on util to drive its DVFS logic/algorithm to populate these percpu
variables. schedutil is the only such governor at the moment.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/core.c              |  3 ++-
 kernel/sched/cpufreq_schedutil.c | 10 +++++++++-
 kernel/sched/sched.h             | 25 ++++++++++++++-----------
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 602e369753a3..f56eb44745a8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -116,6 +116,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(u64, dvfs_update_delay);
 
 #ifdef CONFIG_SCHED_DEBUG
 /*
@@ -7439,7 +7440,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * frequency will be gracefully reduced with the utilization decay.
 	 */
 	if (type == FREQUENCY_UTIL) {
-		util = apply_dvfs_headroom(util_cfs) + cpu_util_rt(rq);
+		util = apply_dvfs_headroom(util_cfs, cpu) + cpu_util_rt(rq);
 		util = uclamp_rq_util_with(rq, util, p);
 	} else {
 		util = util_cfs + cpu_util_rt(rq);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 0c7565ac31fb..04aa06846f31 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -519,15 +519,21 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
 	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
 	struct sugov_policy *sg_policy;
 	unsigned int rate_limit_us;
+	int cpu;
 
 	if (kstrtouint(buf, 10, &rate_limit_us))
 		return -EINVAL;
 
 	tunables->rate_limit_us = rate_limit_us;
 
-	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+
 		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
 
+		for_each_cpu(cpu, sg_policy->policy->cpus)
+			per_cpu(dvfs_update_delay, cpu) = rate_limit_us;
+	}
+
 	return count;
 }
 
@@ -772,6 +778,8 @@ static int sugov_start(struct cpufreq_policy *policy)
 		memset(sg_cpu, 0, sizeof(*sg_cpu));
 		sg_cpu->cpu			= cpu;
 		sg_cpu->sg_policy		= sg_policy;
+
+		per_cpu(dvfs_update_delay, cpu) = sg_policy->tunables->rate_limit_us;
 	}
 
 	if (policy_is_shared(policy))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2b889ad399de..e06e512af192 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3001,6 +3001,15 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 unsigned long approximate_util_avg(unsigned long util, u64 delta);
 u64 approximate_runtime(unsigned long util);
 
+/*
+ * Any governor that relies on util signal to drive DVFS, must populate these
+ * percpu dvfs_update_delay variables.
+ *
+ * It should describe the rate/delay at which the governor sends DVFS freq
+ * update to the hardware in us.
+ */
+DECLARE_PER_CPU_SHARED_ALIGNED(u64, dvfs_update_delay);
+
 /*
  * DVFS decision are made at discrete points. If CPU stays busy, the util will
  * continue to grow, which means it could need to run at a higher frequency
@@ -3010,20 +3019,14 @@ u64 approximate_runtime(unsigned long util);
  * to run at adequate performance point.
  *
  * This function provides enough headroom to provide adequate performance
- * assuming the CPU continues to be busy.
- *
- * At the moment it is a constant multiplication with 1.25.
+ * assuming the CPU continues to be busy. This headroom is based on the
+ * dvfs_update_delay of the cpufreq governor.
  *
- * TODO: The headroom should be a function of the delay. 25% is too high
- * especially on powerful systems. For example, if the delay is 500us, it makes
- * more sense to give a small headroom as the next decision point is not far
- * away and will follow the util if it continues to rise. On the other hand if
- * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle
- * at a lower frequency if it never goes to idle until then.
+ * XXX: Should we provide headroom when the util is decaying?
  */
-static inline unsigned long apply_dvfs_headroom(unsigned long util)
+static inline unsigned long apply_dvfs_headroom(unsigned long util, int cpu)
 {
-	return util + (util >> 2);
+	return approximate_util_avg(util, per_cpu(dvfs_update_delay, cpu));
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
                   ` (3 preceding siblings ...)
  2023-08-27 23:32 ` [RFC PATCH 4/7] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom() Qais Yousef
@ 2023-08-27 23:32 ` Qais Yousef
  2023-09-06 21:13   ` Dietmar Eggemann
  2023-09-07 11:44   ` Peter Zijlstra
  2023-08-27 23:32 ` [RFC PATCH 6/7] sched/pelt: Introduce PELT multiplier Qais Yousef
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

The new tunable, response_time_ms,  allow us to speed up or slow down
the response time of the policy to meet the perf, power and thermal
characteristic desired by the user/sysadmin. There's no single universal
trade-off that we can apply for all systems even if they use the same
SoC. The form factor of the system, the dominant use case, and in case
of battery powered systems, the size of the battery and presence or
absence of active cooling can play a big role on what would be best to
use.

The new tunable provides sensible defaults, but yet gives the power to
control the response time to the user/sysadmin, if they wish to.

This tunable is applied when we map the util into frequency.

TODO: to retain previous behavior, we must multiply default time with
80%..

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 Documentation/admin-guide/pm/cpufreq.rst | 19 ++++++-
 kernel/sched/cpufreq_schedutil.c         | 70 +++++++++++++++++++++++-
 2 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 6adb7988e0eb..c43df0e716a7 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -417,7 +417,7 @@ is passed by the scheduler to the governor callback which causes the frequency
 to go up to the allowed maximum immediately and then draw back to the value
 returned by the above formula over time.
 
-This governor exposes only one tunable:
+This governor exposes two tunables:
 
 ``rate_limit_us``
 	Minimum time (in microseconds) that has to pass between two consecutive
@@ -427,6 +427,23 @@ This governor exposes only one tunable:
 	The purpose of this tunable is to reduce the scheduler context overhead
 	of the governor which might be excessive without it.
 
+``respone_time_ms``
+	Amount of time (in milliseconds) required to ramp the policy from
+	lowest to highest frequency. Can be decreased to speed up the
+	responsiveness of the system, or increased to slow the system down in
+	hope to save power. The best perf/watt will depend on the system
+	characteristics and the dominant workload you expect to run. For
+	userspace that has smart context on the type of workload running (like
+	in Android), one can tune this to suite the demand of that workload.
+
+	Note that when slowing the response down, you can end up effectively
+	chopping off the top frequencies for that policy as the util is capped
+	to 1024. On HMP systems where some CPUs have a capacity less than 1024,
+	unless affinity is used, the task would have probably migrated to
+	a bigger core before you reach the max performance of the policy. If
+	they're locked to that policy, then they should reach the max
+	performance after the specified time.
+
 This governor generally is regarded as a replacement for the older `ondemand`_
 and `conservative`_ governors (described below), as it is simpler and more
 tightly integrated with the CPU scheduler, its overhead in terms of CPU context
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 04aa06846f31..42f4c4100902 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -11,6 +11,7 @@
 struct sugov_tunables {
 	struct gov_attr_set	attr_set;
 	unsigned int		rate_limit_us;
+	unsigned int		response_time_ms;
 };
 
 struct sugov_policy {
@@ -22,6 +23,7 @@ struct sugov_policy {
 	raw_spinlock_t		update_lock;
 	u64			last_freq_update_time;
 	s64			freq_update_delay_ns;
+	unsigned int		freq_response_time_ms;
 	unsigned int		next_freq;
 	unsigned int		cached_raw_freq;
 
@@ -59,6 +61,45 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
 
 /************************ Governor internals ***********************/
 
+static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
+{
+	int cpu = cpumask_first(sg_policy->policy->cpus);
+	unsigned long cap = capacity_orig_of(cpu);
+
+	return approximate_runtime(cap);
+}
+
+/*
+ * Shrink or expand how long it takes to reach the maximum performance of the
+ * policy.
+ *
+ * sg_policy->freq_response_time_ms is a constant value defined by PELT
+ * HALFLIFE and the capacity of the policy (assuming HMP systems).
+ *
+ * sg_policy->tunables->response_time_ms is a user defined response time. By
+ * setting it lower than sg_policy->freq_response_time_ms, the system will
+ * respond faster to changes in util, which will result in reaching maximum
+ * performance point quicker. By setting it higher, it'll slow down the amount
+ * of time required to reach the maximum OPP.
+ *
+ * This should be applied when selecting the frequency. By default no
+ * conversion is done and we should return util as-is.
+ */
+static inline unsigned long
+sugov_apply_response_time(struct sugov_policy *sg_policy, unsigned long util)
+{
+	unsigned long mult;
+
+	if (sg_policy->freq_response_time_ms == sg_policy->tunables->response_time_ms)
+		return util;
+
+	mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
+	mult /=	sg_policy->tunables->response_time_ms;
+	mult *= util;
+
+	return mult >> SCHED_CAPACITY_SHIFT;
+}
+
 static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 {
 	s64 delta_ns;
@@ -143,6 +184,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
+	util = sugov_apply_response_time(sg_policy, util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
@@ -539,8 +581,32 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
 
 static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
 
+static ssize_t response_time_ms_show(struct gov_attr_set *attr_set, char *buf)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+	return sprintf(buf, "%u\n", tunables->response_time_ms);
+}
+
+static ssize_t
+response_time_ms_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+	unsigned int response_time_ms;
+
+	if (kstrtouint(buf, 10, &response_time_ms))
+		return -EINVAL;
+
+	tunables->response_time_ms = response_time_ms;
+
+	return count;
+}
+
+static struct governor_attr response_time_ms = __ATTR_RW(response_time_ms);
+
 static struct attribute *sugov_attrs[] = {
 	&rate_limit_us.attr,
+	&response_time_ms.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(sugov);
@@ -704,6 +770,7 @@ static int sugov_init(struct cpufreq_policy *policy)
 	}
 
 	tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
+	tunables->response_time_ms = sugov_calc_freq_response_ms(sg_policy);
 
 	policy->governor_data = sg_policy;
 	sg_policy->tunables = tunables;
@@ -763,7 +830,8 @@ static int sugov_start(struct cpufreq_policy *policy)
 	void (*uu)(struct update_util_data *data, u64 time, unsigned int flags);
 	unsigned int cpu;
 
-	sg_policy->freq_update_delay_ns	= sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+	sg_policy->freq_update_delay_ns		= sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+	sg_policy->freq_response_time_ms	= sugov_calc_freq_response_ms(sg_policy);
 	sg_policy->last_freq_update_time	= 0;
 	sg_policy->next_freq			= 0;
 	sg_policy->work_in_progress		= false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC PATCH 6/7] sched/pelt: Introduce PELT multiplier
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
                   ` (4 preceding siblings ...)
  2023-08-27 23:32 ` [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
@ 2023-08-27 23:32 ` Qais Yousef
  2023-08-27 23:32 ` [RFC PATCH 7/7] cpufreq: Change default transition delay to 2ms Qais Yousef
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

From: Vincent Donnefort <vincent.donnefort@arm.com>

The new sched_pelt_multiplier boot param allows a user to set a clock
multiplier to x2 or x4 (x1 being the default). This clock multiplier
artificially speeds up PELT ramp up/down similarly to use a faster
half-life than the default 32ms.

  - x1: 32ms half-life
  - x2: 16ms half-life
  - x4: 8ms  half-life

Internally, a new clock is created: rq->clock_task_mult. It sits in the
clock hierarchy between rq->clock_task and rq->clock_pelt.

The param is set as read only and can only be changed at boot time via

	kernel.sched_pelt_multiplier=[1, 2, 4]

PELT has a big impact on the overall system response and reactiveness to
change. Smaller PELT HF means it'll require less time to reach the
maximum performance point of the system when the system become fully
busy; and equally shorter time to go back to lowest performance point
when the system goes back to idle.

This faster reaction impacts both dvfs response and migration time
between clusters in HMP system.

Smaller PELT values are expected to give better performance at the cost
of more power. Under powered systems can particularly benefit from
smaller values. Powerful systems can still benefit from smaller values
if they want to be tuned towards perf more and power is not the major
concern for them.

This combined with respone_time_ms from schedutil should give the user
and sysadmin a deterministic way to control the triangular power, perf
and thermals for their system. The default response_time_ms will half
as PELT HF halves.

Update approximate_{util_avg, runtime}() to take into account the PELT
HALFLIFE multiplier.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[Converted from sysctl to boot param and updated commit message]
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/core.c  |  2 +-
 kernel/sched/pelt.c  | 52 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/pelt.h  | 42 +++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  1 +
 4 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f56eb44745a8..42ed86a6ad3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -745,7 +745,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq, irq_delta + steal);
 #endif
-	update_rq_clock_pelt(rq, delta);
+	update_rq_clock_task_mult(rq, delta);
 }
 
 void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index f673b9ab92dc..24886bab0f91 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -468,6 +468,54 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 }
 #endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
 
+__read_mostly unsigned int sched_pelt_lshift;
+static unsigned int sched_pelt_multiplier = 1;
+
+static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
+{
+	int ret;
+
+	ret = param_set_int(val, kp);
+	if (ret)
+		goto error;
+
+	switch (sched_pelt_multiplier)  {
+	case 1:
+		fallthrough;
+	case 2:
+		fallthrough;
+	case 4:
+		WRITE_ONCE(sched_pelt_lshift,
+			   sched_pelt_multiplier >> 1);
+		break;
+	default:
+		ret = -EINVAL;
+		goto error;
+	}
+
+	return 0;
+
+error:
+	sched_pelt_multiplier = 1;
+	return ret;
+}
+
+static const struct kernel_param_ops sched_pelt_multiplier_ops = {
+	.set = set_sched_pelt_multiplier,
+	.get = param_get_int,
+};
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+/* XXX: should we use sched as prefix? */
+#define MODULE_PARAM_PREFIX "kernel."
+module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
+MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
+MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
+MODULE_PARM_DESC(sched_pelt_multiplier, "                2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
+MODULE_PARM_DESC(sched_pelt_multiplier, "                4  8ms PELT HALIFE - roughly  50ms to go from 0 to max performance point.");
+
 /*
  * Approximate the new util_avg value assuming an entity has continued to run
  * for @delta us.
@@ -482,7 +530,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
 	if (unlikely(!delta))
 		return util;
 
-	accumulate_sum(delta, &sa, 0, 0, 1);
+	accumulate_sum(delta << sched_pelt_lshift, &sa, 0, 0, 1);
 	___update_load_avg(&sa, 0);
 
 	return sa.util_avg;
@@ -494,7 +542,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
 u64 approximate_runtime(unsigned long util)
 {
 	struct sched_avg sa = {};
-	u64 delta = 1024; // period = 1024 = ~1ms
+	u64 delta = 1024 << sched_pelt_lshift; // period = 1024 = ~1ms
 	u64 runtime = 0;
 
 	if (unlikely(!util))
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc28721..9b35b5072bae 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
 	WRITE_ONCE(avg->util_est.enqueued, enqueued);
 }
 
+static inline u64 rq_clock_task_mult(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+	assert_clock_updated(rq);
+
+	return rq->clock_task_mult;
+}
+
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
 	lockdep_assert_rq_held(rq);
@@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct rq *rq)
 /* The rq is idle, we can sync to clock_task */
 static inline void _update_idle_rq_clock_pelt(struct rq *rq)
 {
-	rq->clock_pelt  = rq_clock_task(rq);
+	rq->clock_pelt = rq_clock_task_mult(rq);
 
 	u64_u32_store(rq->clock_idle, rq_clock(rq));
 	/* Paired with smp_rmb in migrate_se_pelt_lag() */
@@ -121,6 +129,27 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
 	rq->clock_pelt += delta;
 }
 
+extern unsigned int sched_pelt_lshift;
+
+/*
+ * absolute time   |1      |2      |3      |4      |5      |6      |
+ * @ mult = 1      --------****************--------****************-
+ * @ mult = 2      --------********----------------********---------
+ * @ mult = 4      --------****--------------------****-------------
+ * clock task mult
+ * @ mult = 2      |   |   |2  |3  |   |   |   |   |5  |6  |   |   |
+ * @ mult = 4      | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
+ *
+ */
+static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
+{
+	delta <<= READ_ONCE(sched_pelt_lshift);
+
+	rq->clock_task_mult += delta;
+
+	update_rq_clock_pelt(rq, delta);
+}
+
 /*
  * When rq becomes idle, we have to check if it has lost idle time
  * because it was fully busy. A rq is fully used when the /Sum util_sum
@@ -147,7 +176,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 	 * rq's clock_task.
 	 */
 	if (util_sum >= divider)
-		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+		rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;
 
 	_update_idle_rq_clock_pelt(rq);
 }
@@ -218,13 +247,18 @@ update_irq_load_avg(struct rq *rq, u64 running)
 	return 0;
 }
 
-static inline u64 rq_clock_pelt(struct rq *rq)
+static inline u64 rq_clock_task_mult(struct rq *rq)
 {
 	return rq_clock_task(rq);
 }
 
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+	return rq_clock_task_mult(rq);
+}
+
 static inline void
-update_rq_clock_pelt(struct rq *rq, s64 delta) { }
+update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
 
 static inline void
 update_idle_rq_clock_pelt(struct rq *rq) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e06e512af192..896b6655397c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1023,6 +1023,7 @@ struct rq {
 	u64			clock;
 	/* Ensure that all clocks are in the same cache line */
 	u64			clock_task ____cacheline_aligned;
+	u64			clock_task_mult;
 	u64			clock_pelt;
 	unsigned long		lost_idle_time;
 	u64			clock_pelt_idle;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC PATCH 7/7] cpufreq: Change default transition delay to 2ms
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
                   ` (5 preceding siblings ...)
  2023-08-27 23:32 ` [RFC PATCH 6/7] sched/pelt: Introduce PELT multiplier Qais Yousef
@ 2023-08-27 23:32 ` Qais Yousef
  2023-09-06  9:18 ` [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Lukasz Luba
  2023-09-07 13:08 ` Peter Zijlstra
  8 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-08-27 23:32 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Qais Yousef

10ms is too high for today's hardware, even low end ones. This default
end up being used a lot on Arm machines at least. Pine64, mac mini and
pixel 6 all end up with 10ms rate_limit_us when using schedutil, and
it's too high for all of them.

Change the default to 2ms which should be 'pessimistic' enough for worst
case scenario, but not too high for platforms with fast DVFS hardware.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 drivers/cpufreq/cpufreq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 50bbc969ffe5..d8fc33b7f2d2 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -579,11 +579,11 @@ unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy)
 		 * for platforms where transition_latency is in milliseconds, it
 		 * ends up giving unrealistic values.
 		 *
-		 * Cap the default transition delay to 10 ms, which seems to be
+		 * Cap the default transition delay to 2 ms, which seems to be
 		 * a reasonable amount of time after which we should reevaluate
 		 * the frequency.
 		 */
-		return min(latency * LATENCY_MULTIPLIER, (unsigned int)10000);
+		return min(latency * LATENCY_MULTIPLIER, (unsigned int)(2*MSEC_PER_SEC));
 	}
 
 	return LATENCY_MULTIPLIER;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
                   ` (6 preceding siblings ...)
  2023-08-27 23:32 ` [RFC PATCH 7/7] cpufreq: Change default transition delay to 2ms Qais Yousef
@ 2023-09-06  9:18 ` Lukasz Luba
  2023-09-06 21:18   ` Qais Yousef
  2023-09-07 13:08 ` Peter Zijlstra
  8 siblings, 1 reply; 64+ messages in thread
From: Lukasz Luba @ 2023-09-06  9:18 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Peter Zijlstra

Hi Qais,

On 8/28/23 00:31, Qais Yousef wrote:
> Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
> margins applied in fits_capacity() and apply_dvfs_headroom().
> 
> As reported two years ago in
> 
> 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
> 
> these values are not good fit for all systems and people do feel the need to
> modify them regularly out of tree.

That is true, in Android kernel those are known 'features'. Furthermore,
in my game testing it looks like higher margins do help to shrink
number of dropped frames, while on other types of workloads (e.g.
those that you have in the link above) the 0% shows better energy.

I remember also the results from MTK regarding the PELT HALF_LIFE

https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/

The numbers for 8ms half_life where showing really nice improvement
for the 'min fps' metric. I got similar with higher margin.

IMO we can derive quite important information from those different
experiments:
More sustainable workloads like "Yahoo browser" don't need margin.
More unpredictable workloads like "Fortnite" (shooter game with 'open
world') need some decent margin.

The problem is that the periodic task can be 'noisy'. The low-pass
filter which is our exponentially weighted moving avg PELT will
'smooth' the measured values. It will block sudden 'spikes' since
they are high-frequency changes. Those sudden 'spikes' are
the task activations where we need to compute a bit longer, e.g.
there was explosion in the game. The 25% margin helps us to
be ready for this 'noisy' task - the CPU frequency is higher
(and capacity). So if a sudden need for longer computation
is seen, then we have enough 'idle time' (~25% idle) to serve this
properly and not loose the frame.

The margin helps in two ways for 'noisy' workloads:
1. in fits_capacity() to avoid a CPU which couldn't handle it
    and prefers CPUs with higher capacity
2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to
    serve sudden computation need

IIUC, your proposal is to:
1. extend the low-pass filter to some higher frequency, so we
    could see those 'spikes' - that's the PELT HALF_LIFE boot
    parameter for 8ms
1.1. You are likely to have a 'gift' from the Util_est
      which picks the max util_avg values and maintains them
      for a while. That's why the 8ms PELT information can last longer
      and you can get higher frequency and longer idle time.
2. Plumb in this new idea of dvfs_update_delay as the new
    'margin' - this I don't understand

For the 2. I don't see that the dvfs HW characteristics are best
for this problem purpose. We can have a really fast DVFS HW,
but we need some decent spare idle time in some workloads, which
are two independent issues IMO. You might get the higher
idle time thanks to 1.1. but this is a 'side effect'.

Could you explain a bit more why this dvfs_update_delay is
crucial here?

Regards,
Lukasz


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-08-27 23:31 ` [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
@ 2023-09-06 12:56   ` Dietmar Eggemann
  2023-09-06 21:19     ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-06 12:56 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba

On 28/08/2023 01:31, Qais Yousef wrote:
> Given a util_avg value, the new function will return the future one
> given a runtime delta.
> 
> This will be useful in later patches to help replace some magic margins
> with more deterministic behavior.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/pelt.c  | 22 +++++++++++++++++++++-
>  kernel/sched/sched.h |  3 +++
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 0f310768260c..50322005a0ae 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -466,4 +466,24 @@ int update_irq_load_avg(struct rq *rq, u64 running)
>  
>  	return ret;
>  }
> -#endif
> +#endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
> +
> +/*
> + * Approximate the new util_avg value assuming an entity has continued to run
> + * for @delta us.
> + */
> +unsigned long approximate_util_avg(unsigned long util, u64 delta)
> +{
> +	struct sched_avg sa = {
> +		.util_sum = util * PELT_MIN_DIVIDER,
> +		.util_avg = util,
> +	};
> +
> +	if (unlikely(!delta))
> +		return util;
> +
> +	accumulate_sum(delta, &sa, 0, 0, 1);

IMHO, you miss the handling of `periods != 0`. load = 0 eclipses this
code in accumulate_sum().

> +	___update_load_avg(&sa, 0);
> +
> +	return sa.util_avg;
> +}

We already discussed something similar like this in Nov 22, the so
called UTIL_EST_FASTER thing.

https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net

+/*
+ * Compute a pelt util_avg assuming no history and @delta runtime.
+ */
+unsigned long faster_est_approx(u64 delta)
+{
+	unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
+	u64 periods = delta / 1024;
+
+	if (periods) {
+		delta %= 1024;
+		contrib = __accumulate_pelt_segments(periods, 1024, delta);
+	}
+
+	return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
+}
+

[...]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util
  2023-08-27 23:31 ` [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
@ 2023-09-06 12:56   ` Dietmar Eggemann
  2023-09-06 20:44     ` Dietmar Eggemann
  2023-09-15  9:15   ` Hongyan Xia
  1 sibling, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-06 12:56 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba

On 28/08/2023 01:31, Qais Yousef wrote:
> It is basically the ramp-up time from 0 to a given value. Will be used
> later to implement new tunable to control response time  for schedutil.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/pelt.c  | 21 +++++++++++++++++++++
>  kernel/sched/sched.h |  1 +
>  2 files changed, 22 insertions(+)
> 
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 50322005a0ae..f673b9ab92dc 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
>  
>  	return sa.util_avg;
>  }
> +
> +/*
> + * Approximate the required amount of runtime in ms required to reach @util.
> + */
> +u64 approximate_runtime(unsigned long util)
> +{
> +	struct sched_avg sa = {};
> +	u64 delta = 1024; // period = 1024 = ~1ms
> +	u64 runtime = 0;
> +
> +	if (unlikely(!util))
> +		return runtime;
> +
> +	while (sa.util_avg < util) {
> +		accumulate_sum(delta, &sa, 0, 0, 1);
> +		___update_load_avg(&sa, 0);
> +		runtime++;
> +	}
> +
> +	return runtime;
> +}

S_n = S_inv * (1 - 0.5^(t/hl))

t = hl * ln(1 - Sn/S_inv)/ln(0.5)

(1) for a little CPU (capacity_orig = 446)

t = 32ms * ln(1 - 446/1024)/ln(0.5)

t = 26ms

(2) for a big CPU (capacity = 1023 (*instead of 1024 since ln(0) not
    defined

t = 32ms * ln(1 - 1023/1024)/ln(0.5)

t = 320ms

[...]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 3/7] sched/fair: Remove magic margin in fits_capacity()
  2023-08-27 23:31 ` [RFC PATCH 3/7] sched/fair: Remove magic margin in fits_capacity() Qais Yousef
@ 2023-09-06 14:38   ` Dietmar Eggemann
  2023-09-06 21:45     ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-06 14:38 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba

On 28/08/2023 01:31, Qais Yousef wrote:
> 80% margin is a magic value that has served its purpose for now, but it
> no longer fits the variety of systems exist today. If a system is over
> powered specifically, this 80% will mean we leave a lot of capacity
> unused before we decide to upmigrate on HMP system.
> 
> The upmigration behavior should rely on the fact that a bad decision
> made will need load balance to kick in to perform misfit migration. And
> I think this is an adequate definition for what to consider as enough
> headroom to consider whether a util fits capacity or not.
> 
> Use the new approximate_util_avg() function to predict the util if the
> task continues to run for TICK_US. If the value is not strictly less
> than the capacity, then it must not be placed there, ie considered
> misfit.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/fair.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0b7445cd5af9..facbf3eb7141 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -109,16 +109,31 @@ int __weak arch_asym_cpu_priority(int cpu)
>  }
>  
>  /*
> - * The margin used when comparing utilization with CPU capacity.
> + * The util will fit the capacity if it has enough headroom to grow within the
> + * next tick - which is when any load balancing activity happens to do the
> + * correction.
>   *
> - * (default: ~20%)
> + * If util stays within the capacity before tick has elapsed, then it should be
> + * fine. If not, then a correction action must happen shortly after it starts
> + * running, hence we treat it as !fit.
> + *
> + * TODO: TICK is not actually accurate enough. balance_interval is the correct
> + * one to use as the next load balance doesn't not happen religiously at tick.
> + * Accessing balance_interval might be tricky and will require some refactoring
> + * first.
>   */

I understand that you want to have a more intelligent margin (depending
on the util value) but why you want to use the time value of TICK_USEC
(or the balance_interval)?

We call fits_capacity() e.g. in wakeup and the next lb can just happen
immediately after it.

> -#define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
> +static inline bool fits_capacity(unsigned long util, unsigned long capacity)
> +{
> +	return approximate_util_avg(util, TICK_USEC) < capacity;
> +}

[...]


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util
  2023-09-06 12:56   ` Dietmar Eggemann
@ 2023-09-06 20:44     ` Dietmar Eggemann
  2023-09-06 21:38       ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-06 20:44 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba

On 06/09/2023 14:56, Dietmar Eggemann wrote:
> On 28/08/2023 01:31, Qais Yousef wrote:
>> It is basically the ramp-up time from 0 to a given value. Will be used
>> later to implement new tunable to control response time  for schedutil.
>>
>> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
>> ---
>>  kernel/sched/pelt.c  | 21 +++++++++++++++++++++
>>  kernel/sched/sched.h |  1 +
>>  2 files changed, 22 insertions(+)
>>
>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index 50322005a0ae..f673b9ab92dc 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
>>  
>>  	return sa.util_avg;
>>  }
>> +
>> +/*
>> + * Approximate the required amount of runtime in ms required to reach @util.
>> + */
>> +u64 approximate_runtime(unsigned long util)
>> +{
>> +	struct sched_avg sa = {};
>> +	u64 delta = 1024; // period = 1024 = ~1ms
>> +	u64 runtime = 0;
>> +
>> +	if (unlikely(!util))
>> +		return runtime;
>> +
>> +	while (sa.util_avg < util) {
>> +		accumulate_sum(delta, &sa, 0, 0, 1);
>> +		___update_load_avg(&sa, 0);
>> +		runtime++;
>> +	}
>> +
>> +	return runtime;
>> +}
> 
> S_n = S_inv * (1 - 0.5^(t/hl))
> 
> t = hl * ln(1 - Sn/S_inv)/ln(0.5)
> 
> (1) for a little CPU (capacity_orig = 446)
> 
> t = 32ms * ln(1 - 446/1024)/ln(0.5)
> 
> t = 26ms
> 
> (2) for a big CPU (capacity = 1023 (*instead of 1024 since ln(0) not
>     defined
> 
> t = 32ms * ln(1 - 1023/1024)/ln(0.5)
> 
> t = 320ms

Forgot half of what I wanted to ask:

And you want to be able to have a schedutil interface:

/sys/devices/system/cpu/cpufreq/policy*/schedutil/response_time_ms

in which by default we have 26ms for a CPU with the capacity_orig of 446.

I.e. you want to have a time-based interface there? Which the user can
overwrite, say with 52ms and this then will lower the return value of
get_next_freq() so the system will respond slower?

And the time based interface is more intuitive than staying in the
capacity world of [0-1024]?



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time
  2023-08-27 23:32 ` [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
@ 2023-09-06 21:13   ` Dietmar Eggemann
  2023-09-06 21:52     ` Qais Yousef
  2023-09-07 11:44   ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-06 21:13 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba

On 28/08/2023 01:32, Qais Yousef wrote:

[...]

> @@ -427,6 +427,23 @@ This governor exposes only one tunable:
>  	The purpose of this tunable is to reduce the scheduler context overhead
>  	of the governor which might be excessive without it.
>  
> +``respone_time_ms``
> +	Amount of time (in milliseconds) required to ramp the policy from
> +	lowest to highest frequency. Can be decreased to speed up the
> +	responsiveness of the system, or increased to slow the system down in
> +	hope to save power. The best perf/watt will depend on the system
> +	characteristics and the dominant workload you expect to run. For
> +	userspace that has smart context on the type of workload running (like
> +	in Android), one can tune this to suite the demand of that workload.
> +
> +	Note that when slowing the response down, you can end up effectively
> +	chopping off the top frequencies for that policy as the util is capped
> +	to 1024. On HMP systems where some CPUs have a capacity less than 1024,

HMP isn't used in mainline AFAIK. IMHO, the term `asymmetric CPU
capacity` systems is used.

[...]

> @@ -59,6 +61,45 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>  
>  /************************ Governor internals ***********************/
>  
> +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> +{
> +	int cpu = cpumask_first(sg_policy->policy->cpus);
> +	unsigned long cap = capacity_orig_of(cpu);
> +
> +	return approximate_runtime(cap);
> +}

I can see the potential issue of schedutil being earlier initialized
than the `max frequency scaling of cpu_capacity_orig` happens in
drivers/base/arch_topology.c.

So the response_time_ms setup for a little CPU on Juno-r0 wouldn't
happen on cpu_capacity_orig = 446 -> 26ms but on on the raw capacity
value from dt:

    capacity-dmips-mhz = <578>

So I would expect to see t = 32ms * ln(1 - 578/1024)/ln(0.5) = 38ms instead.

We have a similar dependency between `max frequency scaled
cpu_capacity_orig` and the EM setup code.

[...]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-06  9:18 ` [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Lukasz Luba
@ 2023-09-06 21:18   ` Qais Yousef
  2023-09-07  7:48     ` Lukasz Luba
  2023-09-07 13:26     ` Peter Zijlstra
  0 siblings, 2 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-06 21:18 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Peter Zijlstra

Hi Lukasz

On 09/06/23 10:18, Lukasz Luba wrote:
> Hi Qais,
> 
> On 8/28/23 00:31, Qais Yousef wrote:
> > Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
> > margins applied in fits_capacity() and apply_dvfs_headroom().
> > 
> > As reported two years ago in
> > 
> > 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
> > 
> > these values are not good fit for all systems and people do feel the need to
> > modify them regularly out of tree.
> 
> That is true, in Android kernel those are known 'features'. Furthermore,
> in my game testing it looks like higher margins do help to shrink
> number of dropped frames, while on other types of workloads (e.g.
> those that you have in the link above) the 0% shows better energy.

Do you keep margins high for all types of CPU? I think the littles are the
problematic ones which higher margins helps as this means you move away from
them quickly.

> 
> I remember also the results from MTK regarding the PELT HALF_LIFE
> 
> https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/
> 
> The numbers for 8ms half_life where showing really nice improvement
> for the 'min fps' metric. I got similar with higher margin.
> 
> IMO we can derive quite important information from those different
> experiments:
> More sustainable workloads like "Yahoo browser" don't need margin.
> More unpredictable workloads like "Fortnite" (shooter game with 'open
> world') need some decent margin.

Yeah. So the point is that while we should have a sensible default, but there
isn't a one size fits all. But the question is how the user/sysadmin should
control this? This series is what I propose of course :)

I also think the current forced/fixed margin values enforce a policy that is
clearly not a good default on many systems. With no alternative in hand but to
hack their own solutions.

> 
> The problem is that the periodic task can be 'noisy'. The low-pass

Hehe. That's because they're not really periodic ;-)

I think the model of a periodic task is not suitable for most workloads. All
of them are dynamic and how much they need to do at each wake up can very
significantly over 10s of ms.

> filter which is our exponentially weighted moving avg PELT will
> 'smooth' the measured values. It will block sudden 'spikes' since
> they are high-frequency changes. Those sudden 'spikes' are
> the task activations where we need to compute a bit longer, e.g.
> there was explosion in the game. The 25% margin helps us to
> be ready for this 'noisy' task - the CPU frequency is higher
> (and capacity). So if a sudden need for longer computation
> is seen, then we have enough 'idle time' (~25% idle) to serve this
> properly and not loose the frame.
> 
> The margin helps in two ways for 'noisy' workloads:
> 1. in fits_capacity() to avoid a CPU which couldn't handle it
>    and prefers CPUs with higher capacity
> 2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to
>    serve sudden computation need
> 
> IIUC, your proposal is to:
> 1. extend the low-pass filter to some higher frequency, so we
>    could see those 'spikes' - that's the PELT HALF_LIFE boot
>    parameter for 8ms

That's another way to look at it, yes. We can control how reactive we'd like
the system to be for changes.

> 1.1. You are likely to have a 'gift' from the Util_est
>      which picks the max util_avg values and maintains them
>      for a while. That's why the 8ms PELT information can last longer
>      and you can get higher frequency and longer idle time.

This is probably controversial statement. But I am not in favour of util_est.
I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
default instead. But I will need to do a separate investigation on that.

> 2. Plumb in this new idea of dvfs_update_delay as the new
>    'margin' - this I don't understand
> 
> For the 2. I don't see that the dvfs HW characteristics are best
> for this problem purpose. We can have a really fast DVFS HW,
> but we need some decent spare idle time in some workloads, which
> are two independent issues IMO. You might get the higher
> idle time thanks to 1.1. but this is a 'side effect'.
> 
> Could you explain a bit more why this dvfs_update_delay is
> crucial here?

I'm not sure why you relate this to idle time. And the word margin is a bit
overloaded here. so I suppose you're referring to the one we have in
map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra
headroom will result in idle time, but this is not necessarily true IMO.

My rationale is simply that DVFS based on util should follow util_avg as-is.
But as pointed out in different discussions happened elsewhere, we need to
provide a headroom for this util to grow as if we were to be exact and the task
continues to run, then likely the util will go above the current OPP before we
get a chance to change it again. If we do have an ideal hardware that takes
0 time to change frequency, then this headroom IMO is not needed because
frequency will follow us as util grows. Assuming here that util updates
instantaneously as the task continues to run.

So instead of a constant 25% headroom; I redefine this to be a function of the
hardware delay. If we take a decision now to choose which OPP, then it should
be based on util_avg value after taking into account how much it'll grow before
we take the next decision (which the dvfs_update_delay). We don't need any more
than that.

Maybe we need to take into account how often we call update_load_avg(). I'm not
sure about this yet.

If the user wants to have faster response time, then the new knobs are the way
to control that. But the headroom should be small enough to make sure we don't
overrun until the next decision point. Not less, and not more.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-09-06 12:56   ` Dietmar Eggemann
@ 2023-09-06 21:19     ` Qais Yousef
  2023-09-07 11:12       ` Dietmar Eggemann
  0 siblings, 1 reply; 64+ messages in thread
From: Qais Yousef @ 2023-09-06 21:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/06/23 14:56, Dietmar Eggemann wrote:
> On 28/08/2023 01:31, Qais Yousef wrote:
> > Given a util_avg value, the new function will return the future one
> > given a runtime delta.
> > 
> > This will be useful in later patches to help replace some magic margins
> > with more deterministic behavior.
> > 
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >  kernel/sched/pelt.c  | 22 +++++++++++++++++++++-
> >  kernel/sched/sched.h |  3 +++
> >  2 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index 0f310768260c..50322005a0ae 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -466,4 +466,24 @@ int update_irq_load_avg(struct rq *rq, u64 running)
> >  
> >  	return ret;
> >  }
> > -#endif
> > +#endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
> > +
> > +/*
> > + * Approximate the new util_avg value assuming an entity has continued to run
> > + * for @delta us.
> > + */
> > +unsigned long approximate_util_avg(unsigned long util, u64 delta)
> > +{
> > +	struct sched_avg sa = {
> > +		.util_sum = util * PELT_MIN_DIVIDER,
> > +		.util_avg = util,
> > +	};
> > +
> > +	if (unlikely(!delta))
> > +		return util;
> > +
> > +	accumulate_sum(delta, &sa, 0, 0, 1);
> 
> IMHO, you miss the handling of `periods != 0`. load = 0 eclipses this
> code in accumulate_sum().

Yes. For some reason I got blank registered when I saw if this codepath can
impact util_avg..

> 
> > +	___update_load_avg(&sa, 0);
> > +
> > +	return sa.util_avg;
> > +}
> 
> We already discussed something similar like this in Nov 22, the so
> called UTIL_EST_FASTER thing.
> 
> https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net
> 
> +/*
> + * Compute a pelt util_avg assuming no history and @delta runtime.
> + */
> +unsigned long faster_est_approx(u64 delta)
> +{
> +	unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
> +	u64 periods = delta / 1024;
> +
> +	if (periods) {
> +		delta %= 1024;
> +		contrib = __accumulate_pelt_segments(periods, 1024, delta);
> +	}
> +
> +	return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
> +}
> +

I could look at using this version instead. This misses the decay part though?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util
  2023-09-06 20:44     ` Dietmar Eggemann
@ 2023-09-06 21:38       ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-06 21:38 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/06/23 22:44, Dietmar Eggemann wrote:
> On 06/09/2023 14:56, Dietmar Eggemann wrote:
> > On 28/08/2023 01:31, Qais Yousef wrote:
> >> It is basically the ramp-up time from 0 to a given value. Will be used
> >> later to implement new tunable to control response time  for schedutil.
> >>
> >> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> >> ---
> >>  kernel/sched/pelt.c  | 21 +++++++++++++++++++++
> >>  kernel/sched/sched.h |  1 +
> >>  2 files changed, 22 insertions(+)
> >>
> >> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> >> index 50322005a0ae..f673b9ab92dc 100644
> >> --- a/kernel/sched/pelt.c
> >> +++ b/kernel/sched/pelt.c
> >> @@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
> >>  
> >>  	return sa.util_avg;
> >>  }
> >> +
> >> +/*
> >> + * Approximate the required amount of runtime in ms required to reach @util.
> >> + */
> >> +u64 approximate_runtime(unsigned long util)
> >> +{
> >> +	struct sched_avg sa = {};
> >> +	u64 delta = 1024; // period = 1024 = ~1ms
> >> +	u64 runtime = 0;
> >> +
> >> +	if (unlikely(!util))
> >> +		return runtime;
> >> +
> >> +	while (sa.util_avg < util) {
> >> +		accumulate_sum(delta, &sa, 0, 0, 1);
> >> +		___update_load_avg(&sa, 0);
> >> +		runtime++;
> >> +	}
> >> +
> >> +	return runtime;
> >> +}
> > 
> > S_n = S_inv * (1 - 0.5^(t/hl))
> > 
> > t = hl * ln(1 - Sn/S_inv)/ln(0.5)
> > 
> > (1) for a little CPU (capacity_orig = 446)
> > 
> > t = 32ms * ln(1 - 446/1024)/ln(0.5)
> > 
> > t = 26ms
> > 
> > (2) for a big CPU (capacity = 1023 (*instead of 1024 since ln(0) not
> >     defined
> > 
> > t = 32ms * ln(1 - 1023/1024)/ln(0.5)
> > 
> > t = 320ms
> 
> Forgot half of what I wanted to ask:
> 
> And you want to be able to have a schedutil interface:
> 
> /sys/devices/system/cpu/cpufreq/policy*/schedutil/response_time_ms
> 
> in which by default we have 26ms for a CPU with the capacity_orig of 446.

Note that this *is* the default. I'm just exposing it not really changing it :)

It is actually much less than that if you take into account the current 25%
headroom.

> 
> I.e. you want to have a time-based interface there? Which the user can
> overwrite, say with 52ms and this then will lower the return value of
> get_next_freq() so the system will respond slower?
> 
> And the time based interface is more intuitive than staying in the
> capacity world of [0-1024]?

Yes this is exactly how I am defining the interface :-) I think this is generic
and will give users what they need and hopefully should stand the test of time.

The slow down aspect has a limitation though as I highlight in the cover
letter. I haven't figured out how to resolve it yet, or worth the effort.
If anyone has thoughts on that that'd be useful to learn about. It should be
fixable though.

Generally perf first is not always the desired outcome. Power and thermal play
bigger roles in a lot of systems today and I can see even sever market pays
more attention to them now.

Hence I didn't see why I should limit it to improving perf only and disregard
that there are situations where the system might be more concerned about power
or thermals and this could allow more fine tuning than limiting max frequencies
abruptly. They just get harder to reach so in average we get different
residencies (for same workload), but freqs are still reachable. There will be
a perf hit of course, but that's userspace problem to decide if it's worth it
or not.

Generally I am a big advocate of userspace to take the leap and be smarter
about what it needs and when. Fixing everything automagically has its appeal
but I don't think this is sustainable anymore.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 3/7] sched/fair: Remove magic margin in fits_capacity()
  2023-09-06 14:38   ` Dietmar Eggemann
@ 2023-09-06 21:45     ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-06 21:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/06/23 16:38, Dietmar Eggemann wrote:
> On 28/08/2023 01:31, Qais Yousef wrote:
> > 80% margin is a magic value that has served its purpose for now, but it
> > no longer fits the variety of systems exist today. If a system is over
> > powered specifically, this 80% will mean we leave a lot of capacity
> > unused before we decide to upmigrate on HMP system.
> > 
> > The upmigration behavior should rely on the fact that a bad decision
> > made will need load balance to kick in to perform misfit migration. And
> > I think this is an adequate definition for what to consider as enough
> > headroom to consider whether a util fits capacity or not.
> > 
> > Use the new approximate_util_avg() function to predict the util if the
> > task continues to run for TICK_US. If the value is not strictly less
> > than the capacity, then it must not be placed there, ie considered
> > misfit.
> > 
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >  kernel/sched/fair.c | 21 ++++++++++++++++++---
> >  1 file changed, 18 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0b7445cd5af9..facbf3eb7141 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -109,16 +109,31 @@ int __weak arch_asym_cpu_priority(int cpu)
> >  }
> >  
> >  /*
> > - * The margin used when comparing utilization with CPU capacity.
> > + * The util will fit the capacity if it has enough headroom to grow within the
> > + * next tick - which is when any load balancing activity happens to do the
> > + * correction.
> >   *
> > - * (default: ~20%)
> > + * If util stays within the capacity before tick has elapsed, then it should be
> > + * fine. If not, then a correction action must happen shortly after it starts
> > + * running, hence we treat it as !fit.
> > + *
> > + * TODO: TICK is not actually accurate enough. balance_interval is the correct
> > + * one to use as the next load balance doesn't not happen religiously at tick.
> > + * Accessing balance_interval might be tricky and will require some refactoring
> > + * first.
> >   */
> 
> I understand that you want to have a more intelligent margin (depending
> on the util value) but why you want to use the time value of TICK_USEC
> (or the balance_interval)?
> 
> We call fits_capacity() e.g. in wakeup and the next lb can just happen
> immediately after it.

If it happens immediately, then current values we're considering without margin
are enough to make a correct decision. But worst case scenario if the task
doesn't go to sleep shortly after and continues to run, then we'll have to wait
for TICK_USEC for lb to kick off again and handle a misfit lb.

So we only need to add margin (or headroom which I think is a better word) to
account for the fact that a worst case scenario is that the task will run for
a full tick on this CPU. And what I'm trying to say/do here is that as long as
the task doesn't grow beyond the capacity of the CPU within tick, then it's
fine for it to run there as it won't cause misfit or require misfit lb to run.

If the value goes beyond capacity of the CPU before the end of the tick, then
this means the task had to run at lower capacity for sometime. Which is what
we're trying to avoid IIUC.


Thanks!

--
Qais Yousef

> 
> > -#define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
> > +static inline bool fits_capacity(unsigned long util, unsigned long capacity)
> > +{
> > +	return approximate_util_avg(util, TICK_USEC) < capacity;
> > +}
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time
  2023-09-06 21:13   ` Dietmar Eggemann
@ 2023-09-06 21:52     ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-06 21:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/06/23 23:13, Dietmar Eggemann wrote:
> On 28/08/2023 01:32, Qais Yousef wrote:
> 
> [...]
> 
> > @@ -427,6 +427,23 @@ This governor exposes only one tunable:
> >  	The purpose of this tunable is to reduce the scheduler context overhead
> >  	of the governor which might be excessive without it.
> >  
> > +``respone_time_ms``
> > +	Amount of time (in milliseconds) required to ramp the policy from
> > +	lowest to highest frequency. Can be decreased to speed up the
> > +	responsiveness of the system, or increased to slow the system down in
> > +	hope to save power. The best perf/watt will depend on the system
> > +	characteristics and the dominant workload you expect to run. For
> > +	userspace that has smart context on the type of workload running (like
> > +	in Android), one can tune this to suite the demand of that workload.
> > +
> > +	Note that when slowing the response down, you can end up effectively
> > +	chopping off the top frequencies for that policy as the util is capped
> > +	to 1024. On HMP systems where some CPUs have a capacity less than 1024,
> 
> HMP isn't used in mainline AFAIK. IMHO, the term `asymmetric CPU
> capacity` systems is used.

It's a shorter name and less mouthful and typeful; I think we should start to
use it :)

> 
> [...]
> 
> > @@ -59,6 +61,45 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> >  
> >  /************************ Governor internals ***********************/
> >  
> > +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> > +{
> > +	int cpu = cpumask_first(sg_policy->policy->cpus);
> > +	unsigned long cap = capacity_orig_of(cpu);
> > +
> > +	return approximate_runtime(cap);
> > +}
> 
> I can see the potential issue of schedutil being earlier initialized
> than the `max frequency scaling of cpu_capacity_orig` happens in
> drivers/base/arch_topology.c.
> 
> So the response_time_ms setup for a little CPU on Juno-r0 wouldn't
> happen on cpu_capacity_orig = 446 -> 26ms but on on the raw capacity
> value from dt:
> 
>     capacity-dmips-mhz = <578>
> 
> So I would expect to see t = 32ms * ln(1 - 578/1024)/ln(0.5) = 38ms instead.
> 
> We have a similar dependency between `max frequency scaled
> cpu_capacity_orig` and the EM setup code.

Hmm thanks for the pointer! That might help explain why I see wrong values for
the big core in my setup.

Should using arch_scale_cpu_capacity() help instead? Or I need to find a way to
plug the race instead?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-06 21:18   ` Qais Yousef
@ 2023-09-07  7:48     ` Lukasz Luba
  2023-09-07 11:53       ` Peter Zijlstra
  2023-09-10 18:14       ` Qais Yousef
  2023-09-07 13:26     ` Peter Zijlstra
  1 sibling, 2 replies; 64+ messages in thread
From: Lukasz Luba @ 2023-09-07  7:48 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Peter Zijlstra



On 9/6/23 22:18, Qais Yousef wrote:
> Hi Lukasz
> 
> On 09/06/23 10:18, Lukasz Luba wrote:
>> Hi Qais,
>>
>> On 8/28/23 00:31, Qais Yousef wrote:
>>> Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
>>> margins applied in fits_capacity() and apply_dvfs_headroom().
>>>
>>> As reported two years ago in
>>>
>>> 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
>>>
>>> these values are not good fit for all systems and people do feel the need to
>>> modify them regularly out of tree.
>>
>> That is true, in Android kernel those are known 'features'. Furthermore,
>> in my game testing it looks like higher margins do help to shrink
>> number of dropped frames, while on other types of workloads (e.g.
>> those that you have in the link above) the 0% shows better energy.
> 
> Do you keep margins high for all types of CPU? I think the littles are the
> problematic ones which higher margins helps as this means you move away from
> them quickly.

That's true, for the Littles higher margins helps to evacuate tasks
sooner. I have experiments showing good results with 60% margin on
Littles, while on Big & Mid 20%, 30%. The Littles still have also
tasks in cgroups cpumask which are quite random, so they cannot
migrate, but have a bit higher 'idle time' headroom.


> 
>>
>> I remember also the results from MTK regarding the PELT HALF_LIFE
>>
>> https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/
>>
>> The numbers for 8ms half_life where showing really nice improvement
>> for the 'min fps' metric. I got similar with higher margin.
>>
>> IMO we can derive quite important information from those different
>> experiments:
>> More sustainable workloads like "Yahoo browser" don't need margin.
>> More unpredictable workloads like "Fortnite" (shooter game with 'open
>> world') need some decent margin.
> 
> Yeah. So the point is that while we should have a sensible default, but there
> isn't a one size fits all. But the question is how the user/sysadmin should
> control this? This series is what I propose of course :)
> 
> I also think the current forced/fixed margin values enforce a policy that is
> clearly not a good default on many systems. With no alternative in hand but to
> hack their own solutions.

I see.

> 
>>
>> The problem is that the periodic task can be 'noisy'. The low-pass
> 
> Hehe. That's because they're not really periodic ;-)

They are periodic in a sense, they wake up every 16ms, but sometimes
they have more work. It depends what is currently going in the game
and/or sometimes the data locality (might not be in cache).

Although, that's for games, other workloads like youtube play or this
one 'Yahoo browser' (from your example) are more 'predictable' (after
the start up period). And I really like the potential energy saving
there :)

> 
> I think the model of a periodic task is not suitable for most workloads. All
> of them are dynamic and how much they need to do at each wake up can very
> significantly over 10s of ms.

Might be true, the model was built a few years ago when there wasn't
such dynamic game scenario with high FPS on mobiles. This could still
be tuned with your new design IIUC (no need extra hooks in Android).

> 
>> filter which is our exponentially weighted moving avg PELT will
>> 'smooth' the measured values. It will block sudden 'spikes' since
>> they are high-frequency changes. Those sudden 'spikes' are
>> the task activations where we need to compute a bit longer, e.g.
>> there was explosion in the game. The 25% margin helps us to
>> be ready for this 'noisy' task - the CPU frequency is higher
>> (and capacity). So if a sudden need for longer computation
>> is seen, then we have enough 'idle time' (~25% idle) to serve this
>> properly and not loose the frame.
>>
>> The margin helps in two ways for 'noisy' workloads:
>> 1. in fits_capacity() to avoid a CPU which couldn't handle it
>>     and prefers CPUs with higher capacity
>> 2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to
>>     serve sudden computation need
>>
>> IIUC, your proposal is to:
>> 1. extend the low-pass filter to some higher frequency, so we
>>     could see those 'spikes' - that's the PELT HALF_LIFE boot
>>     parameter for 8ms
> 
> That's another way to look at it, yes. We can control how reactive we'd like
> the system to be for changes.

Which make sense in context to what I said above (newer gaming).

> 
>> 1.1. You are likely to have a 'gift' from the Util_est
>>       which picks the max util_avg values and maintains them
>>       for a while. That's why the 8ms PELT information can last longer
>>       and you can get higher frequency and longer idle time.
> 
> This is probably controversial statement. But I am not in favour of util_est.
> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> default instead. But I will need to do a separate investigation on that.

I like util_est, sometimes it helps ;)

> 
>> 2. Plumb in this new idea of dvfs_update_delay as the new
>>     'margin' - this I don't understand
>>
>> For the 2. I don't see that the dvfs HW characteristics are best
>> for this problem purpose. We can have a really fast DVFS HW,
>> but we need some decent spare idle time in some workloads, which
>> are two independent issues IMO. You might get the higher
>> idle time thanks to 1.1. but this is a 'side effect'.
>>
>> Could you explain a bit more why this dvfs_update_delay is
>> crucial here?
> 
> I'm not sure why you relate this to idle time. And the word margin is a bit
> overloaded here. so I suppose you're referring to the one we have in
> map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra
> headroom will result in idle time, but this is not necessarily true IMO.
> 
> My rationale is simply that DVFS based on util should follow util_avg as-is.
> But as pointed out in different discussions happened elsewhere, we need to
> provide a headroom for this util to grow as if we were to be exact and the task
> continues to run, then likely the util will go above the current OPP before we
> get a chance to change it again. If we do have an ideal hardware that takes

Yes, this is another requirement to have +X% margin. When the tasks are
growing, we don't know their final util_avg and we give them a bit more
cycles.
IMO we have to be ready always for such situation in the scheduler,
haven't we?

> 0 time to change frequency, then this headroom IMO is not needed because
> frequency will follow us as util grows. Assuming here that util updates
> instantaneously as the task continues to run.
> 
> So instead of a constant 25% headroom; I redefine this to be a function of the
> hardware delay. If we take a decision now to choose which OPP, then it should
> be based on util_avg value after taking into account how much it'll grow before
> we take the next decision (which the dvfs_update_delay). We don't need any more
> than that.
> 
> Maybe we need to take into account how often we call update_load_avg(). I'm not
> sure about this yet.
> 
> If the user wants to have faster response time, then the new knobs are the way
> to control that. But the headroom should be small enough to make sure we don't
> overrun until the next decision point. Not less, and not more.

For ideal workloads (rt-app) or those 'calm' yes, we could save energy
(as you pointed for this 0% margin energy values). I do like this 10%
energy saving in some DoU scenarios. I couldn't catch the idea with
feeding the dvfs response information into this equation. We might
discuss this offline ;)

Cheers,
Lukasz

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-09-06 21:19     ` Qais Yousef
@ 2023-09-07 11:12       ` Dietmar Eggemann
  2023-09-10 19:58         ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-07 11:12 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 06/09/2023 23:19, Qais Yousef wrote:
> On 09/06/23 14:56, Dietmar Eggemann wrote:
>> On 28/08/2023 01:31, Qais Yousef wrote:

[...]

>>> +/*
>>> + * Approximate the new util_avg value assuming an entity has continued to run
>>> + * for @delta us.
>>> + */
>>> +unsigned long approximate_util_avg(unsigned long util, u64 delta)
>>> +{
>>> +	struct sched_avg sa = {
>>> +		.util_sum = util * PELT_MIN_DIVIDER,
>>> +		.util_avg = util,
>>> +	};
>>> +
>>> +	if (unlikely(!delta))
>>> +		return util;
>>> +
>>> +	accumulate_sum(delta, &sa, 0, 0, 1);
>>
>> IMHO, you miss the handling of `periods != 0`. load = 0 eclipses this
>> code in accumulate_sum().

You could call accumulate_sum(delta, &sa, 1, 0, 1);

> 
> Yes. For some reason I got blank registered when I saw if this codepath can
> impact util_avg..

Another thing ... I guess if you call accumulate_sum with delta the PELT
machinery assumes `delta = now - sa->last_update_time` which means you
would have to use `clock_pelt + TICK_USEC` as delta.
>>
>>> +	___update_load_avg(&sa, 0);
>>> +
>>> +	return sa.util_avg;
>>> +}
>>
>> We already discussed something similar like this in Nov 22, the so
>> called UTIL_EST_FASTER thing.
>>
>> https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net
>>
>> +/*
>> + * Compute a pelt util_avg assuming no history and @delta runtime.
>> + */
>> +unsigned long faster_est_approx(u64 delta)
>> +{
>> +	unsigned long contrib = (unsigned long)delta; /* p == 0 -> delta < 1024 */
>> +	u64 periods = delta / 1024;
>> +
>> +	if (periods) {
>> +		delta %= 1024;
>> +		contrib = __accumulate_pelt_segments(periods, 1024, delta);
>> +	}
>> +
>> +	return (contrib << SCHED_CAPACITY_SHIFT) / PELT_MIN_DIVIDER;
>> +}
>> +
> 
> I could look at using this version instead. This misses the decay part though?

__accumulate_pelt_segments(periods, ...) decays the periods. But
obviously not the util you pass into approximate_util_avg().
















^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 4/7] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom()
  2023-08-27 23:32 ` [RFC PATCH 4/7] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom() Qais Yousef
@ 2023-09-07 11:34   ` Peter Zijlstra
  2023-09-10 19:23     ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 11:34 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On Mon, Aug 28, 2023 at 12:32:00AM +0100, Qais Yousef wrote:
> Instead of the magical 1.25 headroom, use the new approximate_util_avg()
> to provide headroom based on the dvfs_update_delay; which is the period
> at which the cpufreq governor will send DVFS updates to the hardware.
> 
> Add a new percpu dvfs_update_delay that can be cheaply accessed whenever
> apply_dvfs_headroom() is called. We expect cpufreq governors that rely
> on util to drive its DVFS logic/algorithm to populate these percpu
> variables. schedutil is the only such governor at the moment.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/core.c              |  3 ++-
>  kernel/sched/cpufreq_schedutil.c | 10 +++++++++-
>  kernel/sched/sched.h             | 25 ++++++++++++++-----------
>  3 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 602e369753a3..f56eb44745a8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -116,6 +116,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> +DEFINE_PER_CPU_SHARED_ALIGNED(u64, dvfs_update_delay);

This makes no sense, why are you using SHARED_ALIGNED and thus wasting
an entire cacheline for the one variable?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time
  2023-08-27 23:32 ` [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
  2023-09-06 21:13   ` Dietmar Eggemann
@ 2023-09-07 11:44   ` Peter Zijlstra
  2023-09-10 19:25     ` Qais Yousef
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 11:44 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On Mon, Aug 28, 2023 at 12:32:01AM +0100, Qais Yousef wrote:
> +static inline unsigned long
> +sugov_apply_response_time(struct sugov_policy *sg_policy, unsigned long util)
> +{
> +	unsigned long mult;
> +
> +	if (sg_policy->freq_response_time_ms == sg_policy->tunables->response_time_ms)
> +		return util;
> +
> +	mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
> +	mult /=	sg_policy->tunables->response_time_ms;
> +	mult *= util;
> +
> +	return mult >> SCHED_CAPACITY_SHIFT;
> +}
> +
>  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>  {
>  	s64 delta_ns;
> @@ -143,6 +184,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>  	unsigned int freq = arch_scale_freq_invariant() ?
>  				policy->cpuinfo.max_freq : policy->cur;
>  
> +	util = sugov_apply_response_time(sg_policy, util);
>  	freq = map_util_freq(util, freq, max);
>  
>  	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)

Urgh, so instead of caching the multiplier you keep computing what is
essentially a constant over and over and over and over again :/

That is, compute the whole 'freq_response_time_ms * SCHED_CAPACITY_SCALE
/ response_time_ms' thing *once*, when that file is written to, and then
reduce the whole thing to:

	return (freq_response_mult * util) >> SCHED_CAPACITY_SHIFT;

No need for that special case, no need for divisions, just go.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07  7:48     ` Lukasz Luba
@ 2023-09-07 11:53       ` Peter Zijlstra
  2023-09-07 13:06         ` Lukasz Luba
  2023-09-10 18:20         ` Qais Yousef
  2023-09-10 18:14       ` Qais Yousef
  1 sibling, 2 replies; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 11:53 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar

On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote:

> > Hehe. That's because they're not really periodic ;-)
> 
> They are periodic in a sense, they wake up every 16ms, but sometimes
> they have more work. It depends what is currently going in the game
> and/or sometimes the data locality (might not be in cache).
> 
> Although, that's for games, other workloads like youtube play or this
> one 'Yahoo browser' (from your example) are more 'predictable' (after
> the start up period). And I really like the potential energy saving
> there :)

So everything media is fundamentally periodic, you're hard tied to the
framerate / audio-buffer size etc..

Also note that the traditional periodic task model from the real-time
community has the notion of WCET, which completely covers this
fluctuation in frame-to-frame work, it only considers the absolute worst
case.

Now, practically, that stinks, esp. when you care about batteries, but
it does not mean these tasks are not periodic.

Many extentions to the periodic task model are possible, including
things like average runtime with bursts etc.. all have their trade-offs.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 11:53       ` Peter Zijlstra
@ 2023-09-07 13:06         ` Lukasz Luba
  2023-09-07 13:29           ` Peter Zijlstra
  2023-09-10 18:20         ` Qais Yousef
  1 sibling, 1 reply; 64+ messages in thread
From: Lukasz Luba @ 2023-09-07 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar



On 9/7/23 12:53, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote:
> 
>>> Hehe. That's because they're not really periodic ;-)
>>
>> They are periodic in a sense, they wake up every 16ms, but sometimes
>> they have more work. It depends what is currently going in the game
>> and/or sometimes the data locality (might not be in cache).
>>
>> Although, that's for games, other workloads like youtube play or this
>> one 'Yahoo browser' (from your example) are more 'predictable' (after
>> the start up period). And I really like the potential energy saving
>> there :)
> 
> So everything media is fundamentally periodic, you're hard tied to the
> framerate / audio-buffer size etc..

Agree

> 
> Also note that the traditional periodic task model from the real-time
> community has the notion of WCET, which completely covers this
> fluctuation in frame-to-frame work, it only considers the absolute worst
> case.

That's good point, the WCET here. IMO shorter PELT e.g. 8ms allows us
to 'see' a bit more that information: the worst case in fluctuation of
a particular task. Then this 'seen' value is maintained in util_est
for a while. That's why (probably) I see a better 95-, 99-percentile
numbers for frames rendering time.

> 
> Now, practically, that stinks, esp. when you care about batteries, but
> it does not mean these tasks are not periodic.

Totally agree they are periodic.

> 
> Many extentions to the periodic task model are possible, including
> things like average runtime with bursts etc.. all have their trade-offs.

Was that maybe proposed somewhere on LKML (the other models)?

I can recall one idea - WALT.
IIRC ~2016/2017 the WALT proposal and some discussion/conferences, it
didn't get positive feedback [2].

I don't know if you remember those numbers back than, e.g. video 1080p
playback was using ~10% less energy... Those 10%-15% are still important
for us ;)

Regards,
Lukasz

[1] 
https://lore.kernel.org/all/1477638642-17428-1-git-send-email-markivx@codeaurora.org/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
                   ` (7 preceding siblings ...)
  2023-09-06  9:18 ` [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Lukasz Luba
@ 2023-09-07 13:08 ` Peter Zijlstra
  2023-09-08  0:17   ` Qais Yousef
  8 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 13:08 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:

> Equally recent discussion in PELT HALFLIFE thread highlighted the need for
> a way to tune system response time to achieve better perf, power and thermal
> characteristic for a given system
> 
> 	https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/
> 

> To further help tune the system, we introduce PELT HALFLIFE multiplier as
> a boot time parameter. This parameter has an impact on how fast we migrate, so
> should compensate for whoever needed to tune fits_capacity(); and it has great
> impact on default response_time_ms. Particularly it gives a natural faster rise
> time when the system gets busy, AND fall time when the system goes back to
> idle. It is coarse grain response control that can be coupled with finer grain
> control via schedutil's response_time_ms.

You're misrepresenting things... The outcome of that thread above was
that PELT halftime was not the primary problem. Specifically:

  https://lore.kernel.org/lkml/424e2c81-987d-f10e-106d-8b4c611768bc@arm.com/

mentions that the only thing that gaming nonsense cares about is DVFS
ramp-up.

None of the other PELT users mattered one bit.

Also, ISTR a fair amount of this was workload dependent. So a solution
that has per-task configurability -- like UTIL_EST_FASTER, seems more
suitable.


I'm *really* hesitant on adding all these mostly random knobs -- esp.
without strong justification -- which you don't present. You mostly seem
to justify things with: people do random hack, we should legitimize them
hacks.

Like the last time around, I want the actual problem explained. The
problem is not that random people on the internet do random things to
their kernel.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-06 21:18   ` Qais Yousef
  2023-09-07  7:48     ` Lukasz Luba
@ 2023-09-07 13:26     ` Peter Zijlstra
  2023-09-07 13:57       ` Lukasz Luba
  2023-09-10 18:46       ` Qais Yousef
  1 sibling, 2 replies; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 13:26 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Lukasz Luba, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar

On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:

> This is probably controversial statement. But I am not in favour of util_est.
> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> default instead. But I will need to do a separate investigation on that.

I think util_est makes perfect sense, where PELT has to fundamentally
decay non-running / non-runnable tasks in order to provide a temporal
average, DVFS might be best served with a termporal max filter.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:06         ` Lukasz Luba
@ 2023-09-07 13:29           ` Peter Zijlstra
  2023-09-07 13:33             ` Lukasz Luba
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 13:29 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar

On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:

> > Many extentions to the periodic task model are possible, including
> > things like average runtime with bursts etc.. all have their trade-offs.
> 
> Was that maybe proposed somewhere on LKML (the other models)?

RT literatur mostly methinks. Replacing WCET with a statistical model of
sorts is not uncommon, the argument goes that not everybody will have
their worst case at the same time and lows and highs can commonly cancel
out and this way we can cram a little more on the system.

Typically this is proposed in the context of soft-realtime systems.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:29           ` Peter Zijlstra
@ 2023-09-07 13:33             ` Lukasz Luba
  2023-09-07 13:38               ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Lukasz Luba @ 2023-09-07 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar



On 9/7/23 14:29, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:
> 
>>> Many extentions to the periodic task model are possible, including
>>> things like average runtime with bursts etc.. all have their trade-offs.
>>
>> Was that maybe proposed somewhere on LKML (the other models)?
> 
> RT literatur mostly methinks. Replacing WCET with a statistical model of
> sorts is not uncommon, the argument goes that not everybody will have
> their worst case at the same time and lows and highs can commonly cancel
> out and this way we can cram a little more on the system.
> 
> Typically this is proposed in the context of soft-realtime systems.

Thanks Peter, I will dive into some books...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:33             ` Lukasz Luba
@ 2023-09-07 13:38               ` Peter Zijlstra
  2023-09-07 13:45                 ` Lukasz Luba
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 13:38 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar,
	bristot, juri.lelli

On Thu, Sep 07, 2023 at 02:33:49PM +0100, Lukasz Luba wrote:
> 
> 
> On 9/7/23 14:29, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:
> > 
> > > > Many extentions to the periodic task model are possible, including
> > > > things like average runtime with bursts etc.. all have their trade-offs.
> > > 
> > > Was that maybe proposed somewhere on LKML (the other models)?
> > 
> > RT literatur mostly methinks. Replacing WCET with a statistical model of
> > sorts is not uncommon, the argument goes that not everybody will have
> > their worst case at the same time and lows and highs can commonly cancel
> > out and this way we can cram a little more on the system.
> > 
> > Typically this is proposed in the context of soft-realtime systems.
> 
> Thanks Peter, I will dive into some books...

I would look at academic papers, not sure any of that ever made it to
books, Daniel would know I suppose.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:38               ` Peter Zijlstra
@ 2023-09-07 13:45                 ` Lukasz Luba
  2023-09-08 12:51                   ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 64+ messages in thread
From: Lukasz Luba @ 2023-09-07 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar,
	bristot, juri.lelli



On 9/7/23 14:38, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 02:33:49PM +0100, Lukasz Luba wrote:
>>
>>
>> On 9/7/23 14:29, Peter Zijlstra wrote:
>>> On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:
>>>
>>>>> Many extentions to the periodic task model are possible, including
>>>>> things like average runtime with bursts etc.. all have their trade-offs.
>>>>
>>>> Was that maybe proposed somewhere on LKML (the other models)?
>>>
>>> RT literatur mostly methinks. Replacing WCET with a statistical model of
>>> sorts is not uncommon, the argument goes that not everybody will have
>>> their worst case at the same time and lows and highs can commonly cancel
>>> out and this way we can cram a little more on the system.
>>>
>>> Typically this is proposed in the context of soft-realtime systems.
>>
>> Thanks Peter, I will dive into some books...
> 
> I would look at academic papers, not sure any of that ever made it to
> books, Daniel would know I suppose.

Good hint, thanks!

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:26     ` Peter Zijlstra
@ 2023-09-07 13:57       ` Lukasz Luba
  2023-09-07 14:29         ` Peter Zijlstra
  2023-09-10 18:46       ` Qais Yousef
  1 sibling, 1 reply; 64+ messages in thread
From: Lukasz Luba @ 2023-09-07 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Qais Yousef



On 9/7/23 14:26, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> 
>> This is probably controversial statement. But I am not in favour of util_est.
>> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
>> default instead. But I will need to do a separate investigation on that.
> 
> I think util_est makes perfect sense, where PELT has to fundamentally
> decay non-running / non-runnable tasks in order to provide a temporal
> average, DVFS might be best served with a termporal max filter.
> 
> 

Since we are here...
Would you allow to have a configuration for
the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?

I've found other values than '2' better in some scenarios. That helps
to prevent a big task to 'down' migrate from a Big CPU (1024) to some
Mid CPU (~500-700 capacity) or even Little (~120-300).


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:57       ` Lukasz Luba
@ 2023-09-07 14:29         ` Peter Zijlstra
  2023-09-07 14:42           ` Lukasz Luba
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 14:29 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Qais Yousef

On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote:
> 
> 
> On 9/7/23 14:26, Peter Zijlstra wrote:
> > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> > 
> > > This is probably controversial statement. But I am not in favour of util_est.
> > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> > > default instead. But I will need to do a separate investigation on that.
> > 
> > I think util_est makes perfect sense, where PELT has to fundamentally
> > decay non-running / non-runnable tasks in order to provide a temporal
> > average, DVFS might be best served with a termporal max filter.
> > 
> > 
> 
> Since we are here...
> Would you allow to have a configuration for
> the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?
> 
> I've found other values than '2' better in some scenarios. That helps
> to prevent a big task to 'down' migrate from a Big CPU (1024) to some
> Mid CPU (~500-700 capacity) or even Little (~120-300).

Larger values, I'm thinking you're after? Those would cause the new
contribution to weight less, making the function more smooth, right?

What task characteristic is tied to this? That is, this seems trivial to
modify per-task.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 14:29         ` Peter Zijlstra
@ 2023-09-07 14:42           ` Lukasz Luba
  2023-09-07 20:16             ` Peter Zijlstra
  2023-09-10 19:06             ` Qais Yousef
  0 siblings, 2 replies; 64+ messages in thread
From: Lukasz Luba @ 2023-09-07 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Qais Yousef



On 9/7/23 15:29, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote:
>>
>>
>> On 9/7/23 14:26, Peter Zijlstra wrote:
>>> On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
>>>
>>>> This is probably controversial statement. But I am not in favour of util_est.
>>>> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
>>>> default instead. But I will need to do a separate investigation on that.
>>>
>>> I think util_est makes perfect sense, where PELT has to fundamentally
>>> decay non-running / non-runnable tasks in order to provide a temporal
>>> average, DVFS might be best served with a termporal max filter.
>>>
>>>
>>
>> Since we are here...
>> Would you allow to have a configuration for
>> the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?
>>
>> I've found other values than '2' better in some scenarios. That helps
>> to prevent a big task to 'down' migrate from a Big CPU (1024) to some
>> Mid CPU (~500-700 capacity) or even Little (~120-300).
> 
> Larger values, I'm thinking you're after? Those would cause the new
> contribution to weight less, making the function more smooth, right?

Yes, more smooth, because we only use the 'ewma' goodness for decaying
part (not the raising [1]).

> 
> What task characteristic is tied to this? That is, this seems trivial to
> modify per-task.

In particular Speedometer test and the main browser task, which reaches
~900util, but sometimes vanish and waits for other background tasks
to do something. In the meantime it can decay and wake-up on
Mid/Little (which can cause a penalty to score up to 5-10% vs. if
we pin the task to big CPUs). So, a longer util_est helps to avoid
at least very bad down migration to Littles...

[1] https://elixir.bootlin.com/linux/v6.5.1/source/kernel/sched/fair.c#L4442

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 14:42           ` Lukasz Luba
@ 2023-09-07 20:16             ` Peter Zijlstra
  2023-09-12 11:51               ` Lukasz Luba
  2023-09-10 19:06             ` Qais Yousef
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-07 20:16 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Qais Yousef

On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:

> > What task characteristic is tied to this? That is, this seems trivial to
> > modify per-task.
> 
> In particular Speedometer test and the main browser task, which reaches
> ~900util, but sometimes vanish and waits for other background tasks
> to do something. In the meantime it can decay and wake-up on
> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> we pin the task to big CPUs). So, a longer util_est helps to avoid
> at least very bad down migration to Littles...

Do they do a few short activations (wakeup/sleeps) while waiting? That
would indeed completely ruin things since the EWMA thing is activation
based.

I wonder if there's anything sane we can do here...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:08 ` Peter Zijlstra
@ 2023-09-08  0:17   ` Qais Yousef
  2023-09-08  7:40     ` Dietmar Eggemann
  2023-09-08 10:25     ` Peter Zijlstra
  0 siblings, 2 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-08  0:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On 09/07/23 15:08, Peter Zijlstra wrote:
> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:
> 
> > Equally recent discussion in PELT HALFLIFE thread highlighted the need for
> > a way to tune system response time to achieve better perf, power and thermal
> > characteristic for a given system
> > 
> > 	https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/
> > 
> 
> > To further help tune the system, we introduce PELT HALFLIFE multiplier as
> > a boot time parameter. This parameter has an impact on how fast we migrate, so
> > should compensate for whoever needed to tune fits_capacity(); and it has great
> > impact on default response_time_ms. Particularly it gives a natural faster rise
> > time when the system gets busy, AND fall time when the system goes back to
> > idle. It is coarse grain response control that can be coupled with finer grain
> > control via schedutil's response_time_ms.
> 
> You're misrepresenting things... The outcome of that thread above was

Sorry if I did. My PoV might have gotten skewed. I'm not intending to mislead
for sure. I actually was hesitant about adding the PELT patch initially, but
it did feel that the two topics are connected. Margins are causing problems
because they end up wasting power. So there's a desire to slow current response
down. But this PELT story wanted to speed things up. And this polar opposite is
what I think the distilled problem.

> that PELT halftime was not the primary problem. Specifically:
> 
>   https://lore.kernel.org/lkml/424e2c81-987d-f10e-106d-8b4c611768bc@arm.com/
> 
> mentions that the only thing that gaming nonsense cares about is DVFS
> ramp-up.
> 
> None of the other PELT users mattered one bit.

I actually latched to Vincent response about a boot time parameter makes sense.

Just to be clear, my main issue here with the current hardcoded values of the
'margins'. And the fact they go too fast is my main problem.

The way I saw PELT fits in this story is to help lower end systems who don't
have a lot of oomph. For reasonably powerful system; I don't see a necessity to
change this and DVFS is what matters, I agree.

It was my attempt to draw a full picture and cover the full spectrum. I don't
think PELT halfllife plays a role in powerful systems. But under-powered ones,
I think it will help; and that's why I was depicting it as coarse grain
control.

I think I did try to present similar arguments on that thread.

> 
> Also, ISTR a fair amount of this was workload dependent. So a solution
> that has per-task configurability -- like UTIL_EST_FASTER, seems more
> suitable.

But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is
too aggressive/fast and wastes power. I'm actually slowing things down as
a result of this series. And I'm expecting some not to be happy about it on
their systems. The response_time_ms was my way to give back control. I didn't
see how I can make things faster and slower at the same time without making
decisions on behalf of the user/sysadmin.

So the connection I see between PELT and the margins or headrooms in
fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need
to manage the perf/power trade-off of the system.

Particularly the default is not good for the modern systems, Cortex-X is too
powerful but we still operate within the same power and thermal budgets.

And what was a high end A78 is a mid core today. So if you look at today's
mobile world topology we really have a tiy+big+huge combination of cores. The
bigs are called mids, but they're very capable. Fits capacity forces migration
to the 'huge' cores too soon with that 80% margin. While the 80% might be too
small for the tiny ones as some workloads really struggle there if they hang on
for too long. It doesn't help that these systems ship with 4ms tick. Something
more to consider changing I guess.

And the 25% headroom forces near max frequency to be used when the workload is
happily hovering in the 750 region. I did force the frequency to be lower and
the workload is happily running - we don't need the extra 25% headroom
enforced unconditionally.

UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?
I didn't get the per-task configurability part. AFAIU we can't turn off these
sched-features if they end up causing power issues. That what makes me hesitant
about them. There's a bias towards perf. But some systems prefer to save power
at the expense of perf. There's a lot of grey areas in between to what
perceived as a suitable trade-off for perf vs power. There are cases like above
where actually you can lower freqs without hit on perf. But most of the time
it's a trade-off; and some do decide to drop perf in favour of power. Keep in
mind battery capacity differs between systems with the same SoC even. Some ship
to enable more perf, others are more constrained and opt to be more efficient.

Sorry I didn't explain things well in the cover letter.

> I'm *really* hesitant on adding all these mostly random knobs -- esp.
> without strong justification -- which you don't present. You mostly seem
> to justify things with: people do random hack, we should legitimize them
> hacks.

I share your sentiment and I am trying to find out what's the right thing to do
really. I am open to explore other territories. But from what I see there's
a real need to give users the power to tune how responsive their system needs
to be. I can't see how we can have one size that fits all here given the
different capabilities of the systems and the desired outcome (I want more perf
vs more efficiency).

> Like the last time around, I want the actual problem explained. The
> problem is not that random people on the internet do random things to
> their kernel.

The problem is that those 0.8 and 1.25 margins forces unsuitable default. The
case I see the most is it is causing wasting power that tuning it down regains
this power at no perf cost or small one. Others actually do tune it for faster
response, but I can't cover this case in detail. All I know is lower end
systems do struggle as they don't have enough oomph. I also saw comparison on
phoronix where schedutil is not doing as good still - which tells me it seems
server systems do prefer to ramp up faster too. I think that PELT thread is
a variation of the same problem.

So one of the things I saw is a workload where it spends majority of the time
in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses
the medium cores and runs at a lot higher freqs than it really needs on bigs.
We don't end up utilizing our resources properly.

Happy to go and dig for more data/info if this is not good enough :)

There's a question that I'm struggling with if I may ask. Why is it perceived
our constant response time (practically ~200ms to go from 0 to max) as a good
fit for all use cases? Capability of systems differs widely in terms of what
performance you get at say a util of 512. Or in other words how much work is
done in a unit of time differs between system, but we still represent that work
in a constant way. A task ran for 10ms on powerful System A would have done
a lot more work than running on poor System B for the same 10ms. But util will
still rise the same for both cases. If someone wants to allow this task to be
able to do more on those 10ms, it seems natural to be able to control this
response time. It seems this thinking is flawed for some reason and I'd
appreciate a help to understand why. I think a lot of us perceive this problem
this way.

Hopefully uclamp will help address these issues in a better way. ADPF gives
apps a way to access it reasonably now. Unity announced support for ADPF, so
hopefully games and other workloads can learn to be smarter overtime. But the
spectrum of workloads to cover is still big, and adoption will take time. And
there are still lessons to be learnt and improvements to make. I expect this
effort to take time before it's the norm.  And thinking of desktop systems;
distros like Debian for example still don't enable uclamp by default on their
kernels. I sent asking to enable it and it got added to wishlist.. Actually
even schedutil is not enabled by default on my Pine64 running Armbian nor on my
Mac Mini with M1 chip running Asahi Linux Ubuntu.  You'd think big.LITTLE
systems should have EAS written all over them, but not sure if this is
accidental omission or ondemand is actually perceived as better. I think my
intel systems also don't run schedutil by default still too.  They're not
waking up on lan now to double check though (yep, saving power :D).

Happy to go and try to dig more info if this is still not clear enough.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08  0:17   ` Qais Yousef
@ 2023-09-08  7:40     ` Dietmar Eggemann
  2023-09-08 14:07       ` Qais Yousef
  2023-09-08 10:25     ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-08  7:40 UTC (permalink / raw)
  To: Qais Yousef, Peter Zijlstra
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	linux-kernel, linux-pm, Lukasz Luba

On 08/09/2023 02:17, Qais Yousef wrote:
> On 09/07/23 15:08, Peter Zijlstra wrote:
>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:

[...]

> But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is
> too aggressive/fast and wastes power. I'm actually slowing things down as
> a result of this series. And I'm expecting some not to be happy about it on
> their systems. The response_time_ms was my way to give back control. I didn't
> see how I can make things faster and slower at the same time without making
> decisions on behalf of the user/sysadmin.
> 
> So the connection I see between PELT and the margins or headrooms in
> fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need
> to manage the perf/power trade-off of the system.
> 
> Particularly the default is not good for the modern systems, Cortex-X is too
> powerful but we still operate within the same power and thermal budgets.
> 
> And what was a high end A78 is a mid core today. So if you look at today's
> mobile world topology we really have a tiy+big+huge combination of cores. The
> bigs are called mids, but they're very capable. Fits capacity forces migration
> to the 'huge' cores too soon with that 80% margin. While the 80% might be too
> small for the tiny ones as some workloads really struggle there if they hang on
> for too long. It doesn't help that these systems ship with 4ms tick. Something
> more to consider changing I guess.

If this is the problem then you could simply make the margin (headroom)
a function of cpu_capacity_orig?

[...]

> There's a question that I'm struggling with if I may ask. Why is it perceived
> our constant response time (practically ~200ms to go from 0 to max) as a good
> fit for all use cases? Capability of systems differs widely in terms of what
> performance you get at say a util of 512. Or in other words how much work is
> done in a unit of time differs between system, but we still represent that work
> in a constant way. A task ran for 10ms on powerful System A would have done

PELT (util_avg) is uarch & frequency invariant.

So e.g. a task with util_avg = 256 could have a runtime/period

on big CPU (capacity = 1024) of 4ms/16ms

on little CPU (capacity = 512) of 8ms/16ms

The amount of work in invariant (so we can compare between asymmetric
capacity CPUs) but the runtime obviously differs according to the capacity.

[...]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08  0:17   ` Qais Yousef
  2023-09-08  7:40     ` Dietmar Eggemann
@ 2023-09-08 10:25     ` Peter Zijlstra
  2023-09-08 13:33       ` Qais Yousef
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-08 10:25 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote:

> Just to be clear, my main issue here with the current hardcoded values of the
> 'margins'. And the fact they go too fast is my main problem.

So I stripped the whole margin thing from my reply because I didn't want
to comment on that yet, but yes, I can see how those might be a problem,
and you're changing them into something dynamic, not just removing them.

The tunables is what I worry most about. The moment we expose knobs it
becomes really hard to change things later.

> UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?

The idea of UTIL_EST_FASTER was that we run a PELT sum on the current
activation runtime, all runtime since wakeup and take the max of this
extra sum and the regular thing.

On top of that this extra PELT sum can/has a time multiplier and thus
ramps up faster (this multiplies could be per task). Nb.:

	util_est_fast = faster_est_approx(delta * 2);

is a state-less expression -- by making

	util_est_fast = faster_est_approx(delta * curr->se.faster_mult);

only the current task is affected.

> I didn't get the per-task configurability part. AFAIU we can't turn off these
> sched-features if they end up causing power issues. That what makes me hesitant
> about them. 

See above, the extra sum is (fundamentally) per task, the multiplier
could be per task, if you set the multiplier to <=1, you'll never gain on
the existing sum and the max filter makes that the feature is
effectively disabled for the one task.

It of course gets us the problem of how to set the new multiplier... ;-)

> There's a bias towards perf. But some systems prefer to save power
> at the expense of perf. There's a lot of grey areas in between to what
> perceived as a suitable trade-off for perf vs power. There are cases like above
> where actually you can lower freqs without hit on perf. But most of the time
> it's a trade-off; and some do decide to drop perf in favour of power. Keep in
> mind battery capacity differs between systems with the same SoC even. Some ship
> to enable more perf, others are more constrained and opt to be more efficient.

It always depends on the workload too -- you want different trade-offs
for different tasks.

> > I'm *really* hesitant on adding all these mostly random knobs -- esp.
> > without strong justification -- which you don't present. You mostly seem
> > to justify things with: people do random hack, we should legitimize them
> > hacks.
> 
> I share your sentiment and I am trying to find out what's the right thing to do
> really. I am open to explore other territories. But from what I see there's
> a real need to give users the power to tune how responsive their system needs
> to be. I can't see how we can have one size that fits all here given the
> different capabilities of the systems and the desired outcome (I want more perf
> vs more efficiency).

This is true; but we also cannot keep adding random knobs. Knobs that
are very specific are hard constraints we've got to live with. Take for
instance uclamp, that's not something we can ever get rid of I think
(randomly picking on uclamp, not saying I'm hating on it).

From an actual interface POV, the unit-less generic energy-vs-perf knob
is of course ideal, one global and one per task and then we can fill out
the details as we see fit.  System integrators (you say users, but
really, not a single actual user will use any of this) can muck about
and see what works for them.

(even hardware has these things today, we get 0-255 values that do
'something' uarch specific)

> The problem is that those 0.8 and 1.25 margins forces unsuitable default. The
> case I see the most is it is causing wasting power that tuning it down regains
> this power at no perf cost or small one. Others actually do tune it for faster
> response, but I can't cover this case in detail. All I know is lower end
> systems do struggle as they don't have enough oomph. I also saw comparison on
> phoronix where schedutil is not doing as good still - which tells me it seems
> server systems do prefer to ramp up faster too. I think that PELT thread is
> a variation of the same problem.
> 
> So one of the things I saw is a workload where it spends majority of the time
> in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses
> the medium cores and runs at a lot higher freqs than it really needs on bigs.
> We don't end up utilizing our resources properly.

So that is actually a fairly solid argument for changing things up, if
the margin causes us to neglect mid cores then that needs fixing. But I
don't think that means we need a tunable. After all, the system knows it
has 3 capacities, it just needs to be better at mapping workloads to
them.

It knows how much 'room' there is between a mid and a large. If 1.25*mid
> large we in trouble etc..

> There's a question that I'm struggling with if I may ask. Why is it perceived
> our constant response time (practically ~200ms to go from 0 to max) as a good
> fit for all use cases? Capability of systems differs widely in terms of what
> performance you get at say a util of 512. Or in other words how much work is
> done in a unit of time differs between system, but we still represent that work
> in a constant way. A task ran for 10ms on powerful System A would have done
> a lot more work than running on poor System B for the same 10ms. But util will
> still rise the same for both cases. If someone wants to allow this task to be
> able to do more on those 10ms, it seems natural to be able to control this
> response time. It seems this thinking is flawed for some reason and I'd
> appreciate a help to understand why. I think a lot of us perceive this problem
> this way.

I think part of the problem is that todays servers are tomorrow's
smartphones. Back when we started all this PELT nonsense computers in
general were less powerful than they are now, yet todays servers are no
less busy than they were back then.

Give us compute, we'll fill it.

Now, smartphones in particular are media devices, but a large part of
the server farms are indirectly interactive too, you don't want your
search query to take too long, or your bookface page stuck loading, or
you twatter message about your latest poop not being insta read by your
mates.

That is, much of what we do with the computers, ever more powerful or
not, is eventually measured in human time perception.

So yeah, 200ms.

Remember, all this PELT nonsense was created for cgroups, to distribute
shared between CPUs as load demanded. I think for that purpose it still
sorta makes sense.

Ofc we've added a few more users over time, because if you have this
data, might as well use it etc. I'm not sure we really sat down and
analyzed if the timing all made sense.

And as I argued elsewhere, PELT is a running average, but DVFS might be
better suited with a max filter.

> Happy to go and try to dig more info if this is still not clear enough.

So I'm not generally opposed to changing things -- but I much prefer to
have an actual problem driving that change :-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:45                 ` Lukasz Luba
@ 2023-09-08 12:51                   ` Daniel Bristot de Oliveira
  2023-09-12 11:57                     ` Lukasz Luba
  0 siblings, 1 reply; 64+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-09-08 12:51 UTC (permalink / raw)
  To: Lukasz Luba, Peter Zijlstra
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar,
	juri.lelli

On 9/7/23 15:45, Lukasz Luba wrote:
>>>> RT literatur mostly methinks. Replacing WCET with a statistical model of
>>>> sorts is not uncommon, the argument goes that not everybody will have
>>>> their worst case at the same time and lows and highs can commonly cancel
>>>> out and this way we can cram a little more on the system.
>>>>
>>>> Typically this is proposed in the context of soft-realtime systems.
>>>
>>> Thanks Peter, I will dive into some books...
>>
>> I would look at academic papers, not sure any of that ever made it to
>> books, Daniel would know I suppose.
> 
> Good hint, thanks!

The key-words that came to my mind are:

	- mk-firm, where you accept m tasks will make their deadline
	           every k execution - like, because you run too long.
	- mixed criticality with pWCET (probabilistic execution time) or
		  average execution time + an sporadic tail execution time for
		  the low criticality part.

mk-firm smells like 2005's.. mixed criticality as 2015's..present.

You will probably find more papers than books. Read the papers
as a source for inspiration... not necessarily as a definitive
solution. They generally proposed too restrictive task models.

-- Daniel


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08 10:25     ` Peter Zijlstra
@ 2023-09-08 13:33       ` Qais Yousef
  2023-09-08 13:58         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-08 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On 09/08/23 12:25, Peter Zijlstra wrote:
> On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote:
> 
> > Just to be clear, my main issue here with the current hardcoded values of the
> > 'margins'. And the fact they go too fast is my main problem.
> 
> So I stripped the whole margin thing from my reply because I didn't want
> to comment on that yet, but yes, I can see how those might be a problem,
> and you're changing them into something dynamic, not just removing them.

The main difficulty is that if you try to apply those patches on their own, I'm
sure you'll notice a difference. So if we were to take this alone and put them
on linux-next; I expect a few regression reports for those who run with
schedutil. Any ST oriented workload will not be happy. But if we compensate to
reduce the regression, my problem will re-appear, just for a different reason.
So whack-a-mole.

I didn't know how to make both happy without being dynamic, hence the RFC to
hopefully get some help and insights on how to resolve this. I think I'm
hovering around the right solutions, but not sure if I'm there yet. Some
implementation details certainly still need ironing out.

I genuinely think that we should be more conservative in adding those hardcoded
numbers without making them a function of real limitation.

TEO util threshold for instance has a similar problem to these margins.
I backported them to 5.10 and 5.15 and not long after I had to introduce knobs
to allow tuning them as power regression reports surfaced. The good news it
wasn't a full revert; the bad news those numbers seemed best for a class for
workloads on a particular system, but on another system and different
workloads, the reality will be different. And of course because Android has out
of tree patches; I need to spend a good amount of time before I can report back
properly to ensure the root cause is identified correctly. I will risk
a potentially incorrect statement, but I do hold to question the validity of
these hardcoded numbers on all systems and all workloads.

I am not sure we can avoid being dynamic; and personally I prefer to delegate
more to userspace and make it their problem to manage this dynamism. But by
providing the right tools of course :) I think they need to earn their best
perf/watt too; not let the kernel do all the dirty work, hehe.

> The tunables is what I worry most about. The moment we expose knobs it
> becomes really hard to change things later.

I'm not particularly attached to them to be honest. But at the same time I am
not sure if we can get away without giving the user some power to decide. What
I propose what seemed the most sensible way to do it. But really open to
explore alternatives and I do indeed need help to find this.

Generally; I think userspace expects too much automagic and the programming
model is ancient and not portable and we end up overcompensating for that in
the kernel. So giving them some power is necessary the way I see it, but the
shape and form it should take is debatable for sure. I don't claim to have the
right answer but happy to explore and experiment to get the right ones
identified and done :-)

> 
> > UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?
> 
> The idea of UTIL_EST_FASTER was that we run a PELT sum on the current
> activation runtime, all runtime since wakeup and take the max of this
> extra sum and the regular thing.
> 
> On top of that this extra PELT sum can/has a time multiplier and thus
> ramps up faster (this multiplies could be per task). Nb.:
> 
> 	util_est_fast = faster_est_approx(delta * 2);
> 
> is a state-less expression -- by making
> 
> 	util_est_fast = faster_est_approx(delta * curr->se.faster_mult);
> 
> only the current task is affected.

Okay; maybe I didn't understand this fully and will go back and study it more.

Maybe the word faster is what makes me worried as I really see faster is not
what people want on a class of systems; or at least CPUs if you think of HMP.
Taming the beast is a more difficult problem in this class of systems.

So if I get it correctly; we will slow things down by removing these margins,
but people who suffer from this slow down will need to use util_est_faster to
regain the difference, right?

> 
> > I didn't get the per-task configurability part. AFAIU we can't turn off these
> > sched-features if they end up causing power issues. That what makes me hesitant
> > about them. 
> 
> See above, the extra sum is (fundamentally) per task, the multiplier
> could be per task, if you set the multiplier to <=1, you'll never gain on
> the existing sum and the max filter makes that the feature is
> effectively disabled for the one task.

Gotch ya. I think this could work, but it also seems to overlap with what we
can get already with uclamp. If we can tell this task needs a faster
multiplier, we can tell that it needs better uclamp_min and do that instead?
When should we use one over the other if we add both?

The challenging bit in practice is when we need to get some generic auto
response for all these workloads that just expect the system to give them what
they want without collaboration. I really do hope we can provide alternative to
make these expectations obselete and just be able to tell userspace your app is
not portable, go fix it; but we're not there yet.

And another selfish reason; analysing workloads is harder with these. We have
a lot of mechanisms on top of each others and reasoning about a cause of
a power issue in particular becomes a lot harder when something goes wrong on
one of these and gets bubbled up in subtle ways. Perf issues tend to be more
obvious; but if something cause power or bad thermals, then finding out if
there's sub optimality is hard. And if I find one, fixing it will be hard too.

> It of course gets us the problem of how to set the new multiplier... ;-)

I am actually trying to write a proposal for a generic QoS interface that we
potentially can plumb these things into (main motivation is wake up latency
control with eevdf - but seems you might be pushing something out soon). My
perception of the reality is that userspace is stuck on old programming model
and *a lot* of bad habits. But I think it is about time for it to get smarter
and more collaborative. But this necessities we give some mechanisms to enable
this smarter approach.

> 
> > There's a bias towards perf. But some systems prefer to save power
> > at the expense of perf. There's a lot of grey areas in between to what
> > perceived as a suitable trade-off for perf vs power. There are cases like above
> > where actually you can lower freqs without hit on perf. But most of the time
> > it's a trade-off; and some do decide to drop perf in favour of power. Keep in
> > mind battery capacity differs between systems with the same SoC even. Some ship
> > to enable more perf, others are more constrained and opt to be more efficient.
> 
> It always depends on the workload too -- you want different trade-offs
> for different tasks.

Indeed. We are trying to push for better classification of workloads so that we
can tell with reasonable confidence a certain trade-off is better for them.
Which what it really helps with is enable better use of resources with the
pre-knowledge that the current user experience won't be impacted.

Again, I really ultimately would love to see userspace becoming smarter and
delegate the task of writing portable and scalable software that works across
systems without the need for guess work and hand tuning. I think we're making
good steps in that direction, but we still need a lot more effort.

> 
> > > I'm *really* hesitant on adding all these mostly random knobs -- esp.
> > > without strong justification -- which you don't present. You mostly seem
> > > to justify things with: people do random hack, we should legitimize them
> > > hacks.
> > 
> > I share your sentiment and I am trying to find out what's the right thing to do
> > really. I am open to explore other territories. But from what I see there's
> > a real need to give users the power to tune how responsive their system needs
> > to be. I can't see how we can have one size that fits all here given the
> > different capabilities of the systems and the desired outcome (I want more perf
> > vs more efficiency).
> 
> This is true; but we also cannot keep adding random knobs. Knobs that
> are very specific are hard constraints we've got to live with. Take for
> instance uclamp, that's not something we can ever get rid of I think
> (randomly picking on uclamp, not saying I'm hating on it).

I'm really open to explore alterantives. But need help to find them. I'm also
trying to simplify kernel responsibilities by delegating more to uerspace.
It could be a personal mental hung up, but I can't see how can we have one size
fits all. Almost all types of systems are expected to do a lot of varying
workloads and both hardware and software are moving at faster pace, but
programming model is pretty much the same.

The response_time_ms in schedutil seemed a reasonable knob to me as it directly
tells the user how fast they rampup for that policy. It can be done once at
boot, or if someone has knowledge about workloads they can be smart and find
the best ones for them on a particular system. The good news for us in the
kernel is that we won't care. uclamp for really smart per task control, and
this knob for some hand tuning for those who don't have alternatives is the way
I see it.

> 
> From an actual interface POV, the unit-less generic energy-vs-perf knob

I can see this working for mobile as SoC vendors/OEM can get energy for their
systems and define these curves properly.

But average joe will lose out. For example M1 mac mini doesn't have energy
model actually defined. I do have energy meter so I hope to be able to do some
measurement, but not sure if I can get accurate numbers out.

x86 and other archs don't tend to produce as good energy-vs-perf curves like we
tend to see in mobile world (maybe they do and I'm just ignorant, apologies if
this ends up being a bad blanket statement).

Don't you think we could end up making the bar high to define this knob? It is
less intuitive too, but this is less of a problem maybe.

> is of course ideal, one global and one per task and then we can fill out
> the details as we see fit.  System integrators (you say users, but

Can certainly look at that and it sounds reasonable to me, par the issues above
about it requires more effort and good class of Linux users might not see these
definitions on their systems as there's no real system integrators for a large
class of desktop/laptop systems. It'd be nice to make the programming
experience coherent and readily available, if possible. I think these systems
are losing out.

> really, not a single actual user will use any of this) can muck about
> and see what works for them.

Yes I mean system integrator. I use users maybe bcause I do think of desktops
too as the integrator for them is the end users. I do hope to see more vendors
do ship tuned Linux desktops/laptops like we see in Android world. Servers
probably have an army of people managing them anyway.

> 
> (even hardware has these things today, we get 0-255 values that do
> 'something' uarch specific)

Ah, could I get some pointers please?

> 
> > The problem is that those 0.8 and 1.25 margins forces unsuitable default. The
> > case I see the most is it is causing wasting power that tuning it down regains
> > this power at no perf cost or small one. Others actually do tune it for faster
> > response, but I can't cover this case in detail. All I know is lower end
> > systems do struggle as they don't have enough oomph. I also saw comparison on
> > phoronix where schedutil is not doing as good still - which tells me it seems
> > server systems do prefer to ramp up faster too. I think that PELT thread is
> > a variation of the same problem.
> > 
> > So one of the things I saw is a workload where it spends majority of the time
> > in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses
> > the medium cores and runs at a lot higher freqs than it really needs on bigs.
> > We don't end up utilizing our resources properly.
> 
> So that is actually a fairly solid argument for changing things up, if
> the margin causes us to neglect mid cores then that needs fixing. But I
> don't think that means we need a tunable. After all, the system knows it
> has 3 capacities, it just needs to be better at mapping workloads to
> them.

We can fix the misfit capacity without a tunable, I believe. I just know from
past discussions that those low end systems like these to be large. And the
PELT boot time is to help address this potential issue. Happy to leave it out
and leave it to someone who cares to come and complain. But from theoertical
point of view I can see the problem of slow response on those systems. And
capacities don't tell us much if this is a high end SoC or lower end SoC. Nor
util or anything else we have in the system today, to my knowledge at least.

> 
> It knows how much 'room' there is between a mid and a large. If 1.25*mid

Ideally we should end up distributing on mids and bigs for the capacity region
that overlaps.

I do see that the needs to have the margin is related to misfit migration and
we can fix it by improving the definition of this relationship. I'm not sure if
I implemented it in the best way, but I think the definition I'm proposing
makes sense and removes guess work. If the task is 600 and fits in both mids
and bigs, why should we skip the mids as a candidate if no misfit can happen
very soon by next tick? If current implementation is expensive I think I can
make it cheaper. But if no misfit can happen within tick, I think we need to
consider those CPUs as candidates.

On a slightly related problem that I avoided bringing up but maybe a good time
now. I see the definition of overutilized is stale too. It is a wrapper around
fits_capacity(), or misfit detection. It is very easy for a single busy task to
trigger overutilized. And if this task is background and capped by cpuset to
little cores, then we end up overutilized until it decides to go back to sleep.
Not ideal. I think the definition needs revisiting too, but I have no idea how
yet. It should be more of a function of the current system state rather than
tightly coupled with misfit detection. EAS is disabled when we're overutilized
and default spreading behavior can be expensive in terms of power.

> > large we in trouble etc..
> 
> > There's a question that I'm struggling with if I may ask. Why is it perceived
> > our constant response time (practically ~200ms to go from 0 to max) as a good
> > fit for all use cases? Capability of systems differs widely in terms of what
> > performance you get at say a util of 512. Or in other words how much work is
> > done in a unit of time differs between system, but we still represent that work
> > in a constant way. A task ran for 10ms on powerful System A would have done
> > a lot more work than running on poor System B for the same 10ms. But util will
> > still rise the same for both cases. If someone wants to allow this task to be
> > able to do more on those 10ms, it seems natural to be able to control this
> > response time. It seems this thinking is flawed for some reason and I'd
> > appreciate a help to understand why. I think a lot of us perceive this problem
> > this way.
> 
> I think part of the problem is that todays servers are tomorrow's
> smartphones. Back when we started all this PELT nonsense computers in
> general were less powerful than they are now, yet todays servers are no
> less busy than they were back then.
> 
> Give us compute, we'll fill it.

Hehe, yep!

> 
> Now, smartphones in particular are media devices, but a large part of
> the server farms are indirectly interactive too, you don't want your
> search query to take too long, or your bookface page stuck loading, or
> you twatter message about your latest poop not being insta read by your
> mates.
> 
> That is, much of what we do with the computers, ever more powerful or
> not, is eventually measured in human time perception.

Sadly I do think a lot of workloads make bad assumptions about hardware and
kernel services and I think the past trend has been to compensate for this in
the kernel but the true problem IMHO is that our current programming model is
stale and programs are carrying old bad habits that are no longer valid.

As a simple example a lot has struggled with HMP systems as they were assuming
if I have X cores then I can spawn X threads and do my awesome parallel work.

They of course got caught out badly. They used affinity later to be smart about
which cores, but then as noted earlier the bigs are power hungry and now they
can easily end up in power and thermal issues because the past assumptions are
no longer true.

By the way even the littles can be power hungry at top frequencies. So any form
of pinning done is causing problems. They just can't make assumptions. But what
to do instead then?

ADPF and uclamp is the way to address this and make portable software and
that's what being pushed for. But flushing these old habits out will take time.
Beside I think we still have ironing out work to be done.

Generally even in desktop/laptop/server, programmers seem to think they're the
only active app and get greedy when creating tasks ending up tripping over
themselves. We need a smart middleware to manage these; or a new programming
model to abstract these details. I don't know how, but the status queue is
that programming model is lagging behind.

I think Windows and Mac OS/iOS do provide some more tightly integrated
interfaces for apps; see Grand Central Dispatcher for instance on apple OSes.

> 
> So yeah, 200ms.
> 
> Remember, all this PELT nonsense was created for cgroups, to distribute
> shared between CPUs as load demanded. I think for that purpose it still
> sorta makes sense.
> 
> Ofc we've added a few more users over time, because if you have this
> data, might as well use it etc. I'm not sure we really sat down and
> analyzed if the timing all made sense.

I think if we want to distill the problems to its basic form, it's a timing
issue. Too fast, we lose power. Too slow, we lose perf. And we don't have a way
to scale perf per systems. ie; what absolute perf we end up getting we don't
know and I'm not sure if we can provide at all without hardware extensions. So
that's why we end up scaling time, and end up with those related knobs.

> 
> And as I argued elsewhere, PELT is a running average, but DVFS might be
> better suited with a max filter.

Sorry didn't catch up with all other replies yet, but I will think how to
incorporate all of that in. I think the major issue is that we do need to both
speed up and slow down. And as long as we are able to achieve that I'm really
fine to explore options. What I'm presenting here is what truly seemed to me
the best. But I need help and feedback to do better :-)

> 
> > Happy to go and try to dig more info if this is still not clear enough.
> 
> So I'm not generally opposed to changing things -- but I much prefer to
> have an actual problem driving that change :-)

Good to know. If you think the info I shared are still not good enough, I can
look for more examples. I think my main goal here is really to discuss the
problem, and my proposed solution is a way to demonstrate both the problem and
a possible way to fix it so I'm not just complaining, but actively looking for
fixes too :-) I don't claim to have all the right answers, but certainly happy
to follow this through to make sure we fix the problem properly. Hopefully not
just for me, but for all those who've been struggling with similar problems.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08 13:33       ` Qais Yousef
@ 2023-09-08 13:58         ` Peter Zijlstra
  2023-09-08 13:59         ` Peter Zijlstra
  2023-09-10 21:17         ` Qais Yousef
  2 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-08 13:58 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote:

> > > UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?
> > 
> > The idea of UTIL_EST_FASTER was that we run a PELT sum on the current
> > activation runtime, all runtime since wakeup and take the max of this
> > extra sum and the regular thing.
> > 
> > On top of that this extra PELT sum can/has a time multiplier and thus
> > ramps up faster (this multiplies could be per task). Nb.:
> > 
> > 	util_est_fast = faster_est_approx(delta * 2);
> > 
> > is a state-less expression -- by making
> > 
> > 	util_est_fast = faster_est_approx(delta * curr->se.faster_mult);
> > 
> > only the current task is affected.
> 
> Okay; maybe I didn't understand this fully and will go back and study it more.
> 
> Maybe the word faster is what makes me worried as I really see faster is not
> what people want on a class of systems; or at least CPUs if you think of HMP.
> Taming the beast is a more difficult problem in this class of systems.

The faster refers to the ramp-up. Which was the issue identified in that
earlier thread. The game thing wanted to ramp up more agressive.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08 13:33       ` Qais Yousef
  2023-09-08 13:58         ` Peter Zijlstra
@ 2023-09-08 13:59         ` Peter Zijlstra
  2023-09-08 14:11           ` Qais Yousef
  2023-09-10 21:17         ` Qais Yousef
  2 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-08 13:59 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote:

> > (even hardware has these things today, we get 0-255 values that do
> > 'something' uarch specific)
> 
> Ah, could I get some pointers please?

Intel HWP.EPP and AMD CPPC EPP I think.. both intel-pstate and
amd-pstate have EPP thingies.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08  7:40     ` Dietmar Eggemann
@ 2023-09-08 14:07       ` Qais Yousef
  2023-09-12 17:18         ` Dietmar Eggemann
  0 siblings, 1 reply; 64+ messages in thread
From: Qais Yousef @ 2023-09-08 14:07 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/08/23 09:40, Dietmar Eggemann wrote:
> On 08/09/2023 02:17, Qais Yousef wrote:
> > On 09/07/23 15:08, Peter Zijlstra wrote:
> >> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:
> 
> [...]
> 
> > But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is
> > too aggressive/fast and wastes power. I'm actually slowing things down as
> > a result of this series. And I'm expecting some not to be happy about it on
> > their systems. The response_time_ms was my way to give back control. I didn't
> > see how I can make things faster and slower at the same time without making
> > decisions on behalf of the user/sysadmin.
> > 
> > So the connection I see between PELT and the margins or headrooms in
> > fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need
> > to manage the perf/power trade-off of the system.
> > 
> > Particularly the default is not good for the modern systems, Cortex-X is too
> > powerful but we still operate within the same power and thermal budgets.
> > 
> > And what was a high end A78 is a mid core today. So if you look at today's
> > mobile world topology we really have a tiy+big+huge combination of cores. The
> > bigs are called mids, but they're very capable. Fits capacity forces migration
> > to the 'huge' cores too soon with that 80% margin. While the 80% might be too
> > small for the tiny ones as some workloads really struggle there if they hang on
> > for too long. It doesn't help that these systems ship with 4ms tick. Something
> > more to consider changing I guess.
> 
> If this is the problem then you could simply make the margin (headroom)
> a function of cpu_capacity_orig?

I don't see what you mean. instead of capacity_of() but keep the 80%?

Again, I could be delusional and misunderstanding everything, but what I really
see fits_capacity() is about is misfit detection. But a task is not really
misfit until it actually has a util above the capacity of the CPU. Now due to
implementation details there can be a delay between the task crossing this
capacity and being able to move it. Which what I believe this headroom is
trying to achieve.

I think we can better define this by tying this headroom to the worst case
scenario it takes to actually move this misfit task to the right CPU. If it can
continue to run without being impacted with this delay and crossing the
capacity of the CPU it is on, then we should not trigger misfit IMO.

> 
> [...]
> 
> > There's a question that I'm struggling with if I may ask. Why is it perceived
> > our constant response time (practically ~200ms to go from 0 to max) as a good
> > fit for all use cases? Capability of systems differs widely in terms of what
> > performance you get at say a util of 512. Or in other words how much work is
> > done in a unit of time differs between system, but we still represent that work
> > in a constant way. A task ran for 10ms on powerful System A would have done
> 
> PELT (util_avg) is uarch & frequency invariant.
> 
> So e.g. a task with util_avg = 256 could have a runtime/period
> 
> on big CPU (capacity = 1024) of 4ms/16ms
> 
> on little CPU (capacity = 512) of 8ms/16ms
> 
> The amount of work in invariant (so we can compare between asymmetric
> capacity CPUs) but the runtime obviously differs according to the capacity.

Yep!


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08 13:59         ` Peter Zijlstra
@ 2023-09-08 14:11           ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On 09/08/23 15:59, Peter Zijlstra wrote:
> On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote:
> 
> > > (even hardware has these things today, we get 0-255 values that do
> > > 'something' uarch specific)
> > 
> > Ah, could I get some pointers please?
> 
> Intel HWP.EPP and AMD CPPC EPP I think.. both intel-pstate and
> amd-pstate have EPP thingies.

Okay, thanks!

So do you see tying this to the presence of some hardware mechanisms and
provide a fallback for the other systems to define it somehow would be the best
way to explore this?


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07  7:48     ` Lukasz Luba
  2023-09-07 11:53       ` Peter Zijlstra
@ 2023-09-10 18:14       ` Qais Yousef
  1 sibling, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 18:14 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Peter Zijlstra

On 09/07/23 08:48, Lukasz Luba wrote:

> They are periodic in a sense, they wake up every 16ms, but sometimes
> they have more work. It depends what is currently going in the game
> and/or sometimes the data locality (might not be in cache).
> 
> Although, that's for games, other workloads like youtube play or this
> one 'Yahoo browser' (from your example) are more 'predictable' (after
> the start up period). And I really like the potential energy saving
> there :)

It is more complicated than that from what I've seen. Userspace is sadly
bloated and the relationship between the tasks are a lot more complex. They
talk to other frame work elements, other hardware, have network elements coming
in, and specifically for gaming, could be preparing multiple frames in
parallel. The task wake up and sleep time is not that periodic. It can busy
loop for periods of time, other wake up for short periods of time (pattern of
which might not be on point as it interacts with other elements in a serial
manner where one prepared something and can take variable time every wake up to
prepare it before handing it over to the next task).

Browsers can be tricky as well as when user scrolls what elements appear and
what java script will execute and how heavy it is, and how many tabs are have
webpages being opened and how the user is moving between them.

It is organized chaos :-)

> 
> > 
> > I think the model of a periodic task is not suitable for most workloads. All
> > of them are dynamic and how much they need to do at each wake up can very
> > significantly over 10s of ms.
> 
> Might be true, the model was built a few years ago when there wasn't
> such dynamic game scenario with high FPS on mobiles. This could still
> be tuned with your new design IIUC (no need extra hooks in Android).

It is my perception of course. But I think generally, not just for gaming,
there are a lot of elements interacting with each others in a complex way.
The wake up time and length is determined by these complex elements; and it is
a very dynamic interaction where they could get into steady state for a very
short period of time but could change quickly. As an extreme example a player
could be standing in empty room doing nothing but another player in another
part of the world launches a rocket on this room and we'd get to know when the
network packet arrives that we have to draw a big explosion.

A lot of workloads are interactive and these moments of interactions are the
challenging ones.

> 
> > 
> > > 2. Plumb in this new idea of dvfs_update_delay as the new
> > >     'margin' - this I don't understand
> > > 
> > > For the 2. I don't see that the dvfs HW characteristics are best
> > > for this problem purpose. We can have a really fast DVFS HW,
> > > but we need some decent spare idle time in some workloads, which
> > > are two independent issues IMO. You might get the higher
> > > idle time thanks to 1.1. but this is a 'side effect'.
> > > 
> > > Could you explain a bit more why this dvfs_update_delay is
> > > crucial here?
> > 
> > I'm not sure why you relate this to idle time. And the word margin is a bit
> > overloaded here. so I suppose you're referring to the one we have in
> > map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra
> > headroom will result in idle time, but this is not necessarily true IMO.
> > 
> > My rationale is simply that DVFS based on util should follow util_avg as-is.
> > But as pointed out in different discussions happened elsewhere, we need to
> > provide a headroom for this util to grow as if we were to be exact and the task
> > continues to run, then likely the util will go above the current OPP before we
> > get a chance to change it again. If we do have an ideal hardware that takes
> 
> Yes, this is another requirement to have +X% margin. When the tasks are
> growing, we don't know their final util_avg and we give them a bit more
> cycles.
> IMO we have to be ready always for such situation in the scheduler,
> haven't we?

Yes we should. I think I am not ignoring this part. Hope I clarified things
offline.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 11:53       ` Peter Zijlstra
  2023-09-07 13:06         ` Lukasz Luba
@ 2023-09-10 18:20         ` Qais Yousef
  1 sibling, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 18:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lukasz Luba, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar

On 09/07/23 13:53, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote:
> 
> > > Hehe. That's because they're not really periodic ;-)
> > 
> > They are periodic in a sense, they wake up every 16ms, but sometimes
> > they have more work. It depends what is currently going in the game
> > and/or sometimes the data locality (might not be in cache).
> > 
> > Although, that's for games, other workloads like youtube play or this
> > one 'Yahoo browser' (from your example) are more 'predictable' (after
> > the start up period). And I really like the potential energy saving
> > there :)
> 
> So everything media is fundamentally periodic, you're hard tied to the
> framerate / audio-buffer size etc..
> 
> Also note that the traditional periodic task model from the real-time
> community has the notion of WCET, which completely covers this
> fluctuation in frame-to-frame work, it only considers the absolute worst
> case.
> 
> Now, practically, that stinks, esp. when you care about batteries, but
> it does not mean these tasks are not periodic.

piecewise periodic?

> Many extentions to the periodic task model are possible, including
> things like average runtime with bursts etc.. all have their trade-offs.

The challenge we have is the endless number of workloads we need to cater for..
Or you think one of these models can actually scale to that?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 13:26     ` Peter Zijlstra
  2023-09-07 13:57       ` Lukasz Luba
@ 2023-09-10 18:46       ` Qais Yousef
  1 sibling, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lukasz Luba, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar

On 09/07/23 15:26, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> 
> > This is probably controversial statement. But I am not in favour of util_est.
> > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> > default instead. But I will need to do a separate investigation on that.
> 
> I think util_est makes perfect sense, where PELT has to fundamentally

My concern about it is that it has inherit bias towards PERF. In the soap of
tasks running in the system, not all of which care about perf. The key tasks
tend to be the minority, I'd say. Again, I need to do more investigations but
my worry mainly comes from that and what impact it could have on power.

In an ideal world where userspace is fully uclamp aware, we shouldn't need it.
The task can set uclamp_min to make sure it sees the right performance at wake
up.

And depends on the outcome of this discussion, we might need to introduce
something to help speed up/slow down migration more selectively. So it could
become redundant control.

> decay non-running / non-runnable tasks in order to provide a temporal
> average, DVFS might be best served with a termporal max filter.

Ah, I certainly don't think DVFS need PELT HALFLIFE to be controlled. I only
see it being useful on HMP systems, under-powered specifically, that really
need faster *migration* times. Maybe we can find a better way to control this.
I'll think about it.

Not sure about the temporal max. Isn't this a bias towards perf first too?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 14:42           ` Lukasz Luba
  2023-09-07 20:16             ` Peter Zijlstra
@ 2023-09-10 19:06             ` Qais Yousef
  1 sibling, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 19:06 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Peter Zijlstra, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Viresh Kumar

On 09/07/23 15:42, Lukasz Luba wrote:
> 
> 
> On 9/7/23 15:29, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote:
> > > 
> > > 
> > > On 9/7/23 14:26, Peter Zijlstra wrote:
> > > > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> > > > 
> > > > > This is probably controversial statement. But I am not in favour of util_est.
> > > > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> > > > > default instead. But I will need to do a separate investigation on that.
> > > > 
> > > > I think util_est makes perfect sense, where PELT has to fundamentally
> > > > decay non-running / non-runnable tasks in order to provide a temporal
> > > > average, DVFS might be best served with a termporal max filter.
> > > > 
> > > > 
> > > 
> > > Since we are here...
> > > Would you allow to have a configuration for
> > > the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?
> > > 
> > > I've found other values than '2' better in some scenarios. That helps
> > > to prevent a big task to 'down' migrate from a Big CPU (1024) to some
> > > Mid CPU (~500-700 capacity) or even Little (~120-300).
> > 
> > Larger values, I'm thinking you're after? Those would cause the new
> > contribution to weight less, making the function more smooth, right?
> 
> Yes, more smooth, because we only use the 'ewma' goodness for decaying
> part (not the raising [1]).
> 
> > 
> > What task characteristic is tied to this? That is, this seems trivial to
> > modify per-task.
> 
> In particular Speedometer test and the main browser task, which reaches
> ~900util, but sometimes vanish and waits for other background tasks
> to do something. In the meantime it can decay and wake-up on
> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> we pin the task to big CPUs). So, a longer util_est helps to avoid
> at least very bad down migration to Littles...

Warning, this is not a global win! We do want tasks in general to downmigrate
when they sleep. Would be great to avoid biasing towards perf first by default
to fix these special cases.

As I mentioned in other reply, there's a perf/power/thermal impact of these
decisions and it's not a global win. Some might want this to improve their
scores, others might not want that and rather get the worse score but keep
their power budget in check. And it will highly depend on the workload and the
system. Which we can't test everyone of them :(

We did give the power to userspace via uclamp which should make this problem
fixable. And this is readily available. We can't basically know in the kernel
when one way is better than the other without being told explicitly IMHO.

If you try to boot with faster PELT HALFLIFE, would this give you the same
perf/power trade-off?


Thanks

--
Qais Yousef

> 
> [1] https://elixir.bootlin.com/linux/v6.5.1/source/kernel/sched/fair.c#L4442

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 4/7] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom()
  2023-09-07 11:34   ` Peter Zijlstra
@ 2023-09-10 19:23     ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 19:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On 09/07/23 13:34, Peter Zijlstra wrote:
> On Mon, Aug 28, 2023 at 12:32:00AM +0100, Qais Yousef wrote:
> > Instead of the magical 1.25 headroom, use the new approximate_util_avg()
> > to provide headroom based on the dvfs_update_delay; which is the period
> > at which the cpufreq governor will send DVFS updates to the hardware.
> > 
> > Add a new percpu dvfs_update_delay that can be cheaply accessed whenever
> > apply_dvfs_headroom() is called. We expect cpufreq governors that rely
> > on util to drive its DVFS logic/algorithm to populate these percpu
> > variables. schedutil is the only such governor at the moment.
> > 
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >  kernel/sched/core.c              |  3 ++-
> >  kernel/sched/cpufreq_schedutil.c | 10 +++++++++-
> >  kernel/sched/sched.h             | 25 ++++++++++++++-----------
> >  3 files changed, 25 insertions(+), 13 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 602e369753a3..f56eb44745a8 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -116,6 +116,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
> >  
> >  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> > +DEFINE_PER_CPU_SHARED_ALIGNED(u64, dvfs_update_delay);
> 
> This makes no sense, why are you using SHARED_ALIGNED and thus wasting
> an entire cacheline for the one variable?

Err, brain fart. Sorry.


Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time
  2023-09-07 11:44   ` Peter Zijlstra
@ 2023-09-10 19:25     ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On 09/07/23 13:44, Peter Zijlstra wrote:
> On Mon, Aug 28, 2023 at 12:32:01AM +0100, Qais Yousef wrote:
> > +static inline unsigned long
> > +sugov_apply_response_time(struct sugov_policy *sg_policy, unsigned long util)
> > +{
> > +	unsigned long mult;
> > +
> > +	if (sg_policy->freq_response_time_ms == sg_policy->tunables->response_time_ms)
> > +		return util;
> > +
> > +	mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
> > +	mult /=	sg_policy->tunables->response_time_ms;
> > +	mult *= util;
> > +
> > +	return mult >> SCHED_CAPACITY_SHIFT;
> > +}
> > +
> >  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >  {
> >  	s64 delta_ns;
> > @@ -143,6 +184,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >  	unsigned int freq = arch_scale_freq_invariant() ?
> >  				policy->cpuinfo.max_freq : policy->cur;
> >  
> > +	util = sugov_apply_response_time(sg_policy, util);
> >  	freq = map_util_freq(util, freq, max);
> >  
> >  	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> 
> Urgh, so instead of caching the multiplier you keep computing what is
> essentially a constant over and over and over and over again :/
> 
> That is, compute the whole 'freq_response_time_ms * SCHED_CAPACITY_SCALE
> / response_time_ms' thing *once*, when that file is written to, and then
> reduce the whole thing to:
> 
> 	return (freq_response_mult * util) >> SCHED_CAPACITY_SHIFT;
> 
> No need for that special case, no need for divisions, just go.

Yes! I was too focused am I doing the right thing, I didn't stop to think this
is actually a constant and can be done once too. I will fix it if this knobs
ends up hanging around.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-09-07 11:12       ` Dietmar Eggemann
@ 2023-09-10 19:58         ` Qais Yousef
  2023-09-13 17:22           ` Dietmar Eggemann
  0 siblings, 1 reply; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 19:58 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/07/23 13:12, Dietmar Eggemann wrote:
> On 06/09/2023 23:19, Qais Yousef wrote:
> > On 09/06/23 14:56, Dietmar Eggemann wrote:
> >> On 28/08/2023 01:31, Qais Yousef wrote:
> 
> [...]
> 
> >>> +/*
> >>> + * Approximate the new util_avg value assuming an entity has continued to run
> >>> + * for @delta us.
> >>> + */
> >>> +unsigned long approximate_util_avg(unsigned long util, u64 delta)
> >>> +{
> >>> +	struct sched_avg sa = {
> >>> +		.util_sum = util * PELT_MIN_DIVIDER,
> >>> +		.util_avg = util,
> >>> +	};
> >>> +
> >>> +	if (unlikely(!delta))
> >>> +		return util;
> >>> +
> >>> +	accumulate_sum(delta, &sa, 0, 0, 1);
> >>
> >> IMHO, you miss the handling of `periods != 0`. load = 0 eclipses this
> >> code in accumulate_sum().
> 
> You could call accumulate_sum(delta, &sa, 1, 0, 1);

Yes. I initially thought the load is not necessary, but good catch. I didn't
get a chance to rerun to see the numbers, but hopefully this should fix the
wrong numbers I was seeing. Thanks!

> 
> > 
> > Yes. For some reason I got blank registered when I saw if this codepath can
> > impact util_avg..
> 
> Another thing ... I guess if you call accumulate_sum with delta the PELT
> machinery assumes `delta = now - sa->last_update_time` which means you
> would have to use `clock_pelt + TICK_USEC` as delta.

Right.

The way I understood it is that at TICK we should do update_load_avg() which
would call __update_load_sum() which uses

	delta = now - sa->last_update_time

which passes this delta to accumulate_sum()

I can see we are not very accurate since there will be a small additional time
besides TICK_USEC that we are not accounting for. But I can't see how this can
cause a big error.

	predicted (assumed) tick time/delta

		sa->last_update_time = now
		tick_time = TICK_USEC + now

		delta = tick_time - sa->last_update_time
		delta = TICK_USEC + now - now
		delta = TICK_USEC

	but actual tick time/delta

		sa->last_update_time = now - x
		tick_time = TICK_USEC + now

		delta = tick_time - sa->last_update_time
		delta = TICK_USEC + now - (now - x)
		delta = TICK_USEC + x

So the delta I am using might be slightly shorter than it should be.

IIUC, what you're saying that the `x` in my equation above is clock_pelt,
right?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08 13:33       ` Qais Yousef
  2023-09-08 13:58         ` Peter Zijlstra
  2023-09-08 13:59         ` Peter Zijlstra
@ 2023-09-10 21:17         ` Qais Yousef
  2 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-10 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba

On 09/08/23 14:33, Qais Yousef wrote:
> On 09/08/23 12:25, Peter Zijlstra wrote:
> > On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote:
> > 
> > > Just to be clear, my main issue here with the current hardcoded values of the
> > > 'margins'. And the fact they go too fast is my main problem.
> > 
> > So I stripped the whole margin thing from my reply because I didn't want
> > to comment on that yet, but yes, I can see how those might be a problem,
> > and you're changing them into something dynamic, not just removing them.
> 
> The main difficulty is that if you try to apply those patches on their own, I'm
> sure you'll notice a difference. So if we were to take this alone and put them
> on linux-next; I expect a few regression reports for those who run with
> schedutil. Any ST oriented workload will not be happy. But if we compensate to
> reduce the regression, my problem will re-appear, just for a different reason.
> So whack-a-mole.

Sorry I just realized that the dynamic thing was about the margin, not the new
knob.

My answer above still holds to some extent. But yes, I meant to write that I'm
removing magic hardcoded numbers from the margins.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-07 20:16             ` Peter Zijlstra
@ 2023-09-12 11:51               ` Lukasz Luba
  2023-09-12 14:01                 ` Vincent Guittot
  2023-09-12 14:28                 ` Peter Zijlstra
  0 siblings, 2 replies; 64+ messages in thread
From: Lukasz Luba @ 2023-09-12 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Qais Yousef,
	Chris Redpath

Hi Peter,

On 9/7/23 21:16, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
> 
>>> What task characteristic is tied to this? That is, this seems trivial to
>>> modify per-task.
>>
>> In particular Speedometer test and the main browser task, which reaches
>> ~900util, but sometimes vanish and waits for other background tasks
>> to do something. In the meantime it can decay and wake-up on
>> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
>> we pin the task to big CPUs). So, a longer util_est helps to avoid
>> at least very bad down migration to Littles...
> 
> Do they do a few short activations (wakeup/sleeps) while waiting? That
> would indeed completely ruin things since the EWMA thing is activation
> based.
> 
> I wonder if there's anything sane we can do here...

My apologies for a delay, I have tried to push the graphs for you.

The experiment is on pixel7*. It's running the browser on the phone
with the test 'Speedometer 2.0'. It's a web test (you can also run on
your phone) available here, no need to install anything:
https://browserbench.org/Speedometer2.0/

Here is the Jupiter notebook [1], with plots of the signals:
- top 20 tasks' (based on runtime) utilization
- Util EST signals for the top 20 tasks, with the longer decaying ewma
   filter (which is the 'red' plot called 'ewma')
- the main task (comm=CrRendererMain) Util, Util EST and task residency
   (which tires to stick to CPUs 6,7* )
- the test score was 144.6 (while with fast decay ewma is ~134), so
   staying at big cpus (helps the score in this case)

(the plots are interactive, you can zoom in with the icon 'Box Zoom')
(e.g. you can zoom in the task activation plot which is also linked
with the 'Util EST' on top, for this main task)

You can see the util signal of that 'CrRendererMain' task and those
utilization drops in time, which I was referring to. When the util
drops below some threshold, the task might 'fit' into smaller CPU,
which could be prevented automatically byt maintaining the util est
for longer (but not for all).

I do like your idea that Util EST might be per-task. I'm going to
check this, because that might help to get rid of the overutilized state
which is probably because small tasks are also 'bigger' for longer.

If this util est have chance to fly upstream, I could send an RFC if
you don't mind.

Regards,
Lukasz

*CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3
Littles (~150 capacity)

[1] 
https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08 12:51                   ` Daniel Bristot de Oliveira
@ 2023-09-12 11:57                     ` Lukasz Luba
  0 siblings, 0 replies; 64+ messages in thread
From: Lukasz Luba @ 2023-09-12 11:57 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Qais Yousef, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Vincent Guittot, Peter Zijlstra,
	Viresh Kumar, juri.lelli

Hi Daniel,

On 9/8/23 13:51, Daniel Bristot de Oliveira wrote:
> On 9/7/23 15:45, Lukasz Luba wrote:
>>>>> RT literatur mostly methinks. Replacing WCET with a statistical model of
>>>>> sorts is not uncommon, the argument goes that not everybody will have
>>>>> their worst case at the same time and lows and highs can commonly cancel
>>>>> out and this way we can cram a little more on the system.
>>>>>
>>>>> Typically this is proposed in the context of soft-realtime systems.
>>>>
>>>> Thanks Peter, I will dive into some books...
>>>
>>> I would look at academic papers, not sure any of that ever made it to
>>> books, Daniel would know I suppose.
>>
>> Good hint, thanks!
> 
> The key-words that came to my mind are:
> 
> 	- mk-firm, where you accept m tasks will make their deadline
> 	           every k execution - like, because you run too long.
> 	- mixed criticality with pWCET (probabilistic execution time) or
> 		  average execution time + an sporadic tail execution time for
> 		  the low criticality part.
> 
> mk-firm smells like 2005's.. mixed criticality as 2015's..present.
> 
> You will probably find more papers than books. Read the papers
> as a source for inspiration... not necessarily as a definitive
> solution. They generally proposed too restrictive task models.
> 
> -- Daniel
> 

Thanks for describing this context! That would save my time and avoid
maybe sinking in this unknown water. As you said I might tread that
as inspiration, since I don't fight with life-critical system,
but a phone which needs 'nice user experience' (hopefully there are
no people who disagree) ;)

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-12 11:51               ` Lukasz Luba
@ 2023-09-12 14:01                 ` Vincent Guittot
  2023-09-13  9:53                   ` Lukasz Luba
  2023-09-12 14:28                 ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2023-09-12 14:01 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Peter Zijlstra, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Viresh Kumar, Qais Yousef,
	Chris Redpath

Hi Lukasz,

On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Peter,
>
> On 9/7/23 21:16, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
> >
> >>> What task characteristic is tied to this? That is, this seems trivial to
> >>> modify per-task.
> >>
> >> In particular Speedometer test and the main browser task, which reaches
> >> ~900util, but sometimes vanish and waits for other background tasks
> >> to do something. In the meantime it can decay and wake-up on
> >> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> >> we pin the task to big CPUs). So, a longer util_est helps to avoid
> >> at least very bad down migration to Littles...
> >
> > Do they do a few short activations (wakeup/sleeps) while waiting? That
> > would indeed completely ruin things since the EWMA thing is activation
> > based.
> >
> > I wonder if there's anything sane we can do here...
>
> My apologies for a delay, I have tried to push the graphs for you.
>
> The experiment is on pixel7*. It's running the browser on the phone
> with the test 'Speedometer 2.0'. It's a web test (you can also run on
> your phone) available here, no need to install anything:
> https://browserbench.org/Speedometer2.0/
>
> Here is the Jupiter notebook [1], with plots of the signals:
> - top 20 tasks' (based on runtime) utilization
> - Util EST signals for the top 20 tasks, with the longer decaying ewma
>    filter (which is the 'red' plot called 'ewma')
> - the main task (comm=CrRendererMain) Util, Util EST and task residency
>    (which tires to stick to CPUs 6,7* )
> - the test score was 144.6 (while with fast decay ewma is ~134), so
>    staying at big cpus (helps the score in this case)
>
> (the plots are interactive, you can zoom in with the icon 'Box Zoom')
> (e.g. you can zoom in the task activation plot which is also linked
> with the 'Util EST' on top, for this main task)
>
> You can see the util signal of that 'CrRendererMain' task and those
> utilization drops in time, which I was referring to. When the util
> drops below some threshold, the task might 'fit' into smaller CPU,
> which could be prevented automatically byt maintaining the util est
> for longer (but not for all).

I was looking at your nice chart and I wonder if you could also add
the runnable _avg of the tasks ?

My 1st impression is that the decrease happens when your task starts
to share the CPU with some other tasks and this ends up with a
decrease of its utilization because util_avg doesn't take into account
the waiting time so typically task with an utilization of 1024, will
see its utilization decrease because of other tasks running on the
same cpu. This would explain the drop that you can see.

 I wonder if we should not take into account the runnable_avg when
applying the ewm on util_est ? so the util_est will not decrease
because of time sharing with other

>
> I do like your idea that Util EST might be per-task. I'm going to
> check this, because that might help to get rid of the overutilized state
> which is probably because small tasks are also 'bigger' for longer.
>
> If this util est have chance to fly upstream, I could send an RFC if
> you don't mind.
>
> Regards,
> Lukasz
>
> *CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3
> Littles (~150 capacity)
>
> [1]
> https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-12 11:51               ` Lukasz Luba
  2023-09-12 14:01                 ` Vincent Guittot
@ 2023-09-12 14:28                 ` Peter Zijlstra
  1 sibling, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2023-09-12 14:28 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki, Ingo Molnar,
	Dietmar Eggemann, Vincent Guittot, Viresh Kumar, Qais Yousef,
	Chris Redpath

On Tue, Sep 12, 2023 at 12:51:52PM +0100, Lukasz Luba wrote:

> You can see the util signal of that 'CrRendererMain' task and those
> utilization drops in time, which I was referring to. When the util
> drops below some threshold, the task might 'fit' into smaller CPU,
> which could be prevented automatically byt maintaining the util est
> for longer (but not for all).

Right, so right at those util_est dips it has some small activations.
It's like a poll loop or something instead of a full block waiting for
things to happen.

And yeah, that'll destroy util_est in a hurry :/

> I do like your idea that Util EST might be per-task. I'm going to
> check this, because that might help to get rid of the overutilized state
> which is probably because small tasks are also 'bigger' for longer.
> 
> If this util est have chance to fly upstream, I could send an RFC if
> you don't mind.

The biggest stumbling block I see is the user interface; some generic
QoS hints based thing that allows us to do random things -- like tune
the above might do, dunno.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-08 14:07       ` Qais Yousef
@ 2023-09-12 17:18         ` Dietmar Eggemann
  2023-09-16 19:38           ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-12 17:18 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 08/09/2023 16:07, Qais Yousef wrote:
> On 09/08/23 09:40, Dietmar Eggemann wrote:
>> On 08/09/2023 02:17, Qais Yousef wrote:
>>> On 09/07/23 15:08, Peter Zijlstra wrote:
>>>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:

[...]

>>> And what was a high end A78 is a mid core today. So if you look at today's
>>> mobile world topology we really have a tiy+big+huge combination of cores. The
>>> bigs are called mids, but they're very capable. Fits capacity forces migration
>>> to the 'huge' cores too soon with that 80% margin. While the 80% might be too
>>> small for the tiny ones as some workloads really struggle there if they hang on
>>> for too long. It doesn't help that these systems ship with 4ms tick. Something
>>> more to consider changing I guess.
>>
>> If this is the problem then you could simply make the margin (headroom)
>> a function of cpu_capacity_orig?
> 
> I don't see what you mean. instead of capacity_of() but keep the 80%?
> 
> Again, I could be delusional and misunderstanding everything, but what I really
> see fits_capacity() is about is misfit detection. But a task is not really
> misfit until it actually has a util above the capacity of the CPU. Now due to
> implementation details there can be a delay between the task crossing this
> capacity and being able to move it. Which what I believe this headroom is
> trying to achieve.
> 
> I think we can better define this by tying this headroom to the worst case
> scenario it takes to actually move this misfit task to the right CPU. If it can
> continue to run without being impacted with this delay and crossing the
> capacity of the CPU it is on, then we should not trigger misfit IMO.


Instead of:

  fits_capacity(unsigned long util, unsigned long capacity)

      return approximate_util_avg(util, TICK_USEC) < capacity;

just make 1280 in:

  #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)

dependent on cpu's capacity_orig or the capacity diff to the next higher
capacity_orig.

Typical example today: {little-medium-big capacity_orig} = {128, 896, 1024}

896÷128 = 7

1024/896 = 1.14

to achieve higher margin on little and lower margin on medium.

[...]


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-12 14:01                 ` Vincent Guittot
@ 2023-09-13  9:53                   ` Lukasz Luba
  0 siblings, 0 replies; 64+ messages in thread
From: Lukasz Luba @ 2023-09-13  9:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, linux-kernel, linux-pm, Rafael J. Wysocki,
	Ingo Molnar, Dietmar Eggemann, Viresh Kumar, Qais Yousef,
	Chris Redpath

Hi Vincent,

On 9/12/23 15:01, Vincent Guittot wrote:
> Hi Lukasz,
> 
> On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Peter,
>>
>> On 9/7/23 21:16, Peter Zijlstra wrote:
>>> On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
>>>
>>>>> What task characteristic is tied to this? That is, this seems trivial to
>>>>> modify per-task.
>>>>
>>>> In particular Speedometer test and the main browser task, which reaches
>>>> ~900util, but sometimes vanish and waits for other background tasks
>>>> to do something. In the meantime it can decay and wake-up on
>>>> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
>>>> we pin the task to big CPUs). So, a longer util_est helps to avoid
>>>> at least very bad down migration to Littles...
>>>
>>> Do they do a few short activations (wakeup/sleeps) while waiting? That
>>> would indeed completely ruin things since the EWMA thing is activation
>>> based.
>>>
>>> I wonder if there's anything sane we can do here...
>>
>> My apologies for a delay, I have tried to push the graphs for you.
>>
>> The experiment is on pixel7*. It's running the browser on the phone
>> with the test 'Speedometer 2.0'. It's a web test (you can also run on
>> your phone) available here, no need to install anything:
>> https://browserbench.org/Speedometer2.0/
>>
>> Here is the Jupiter notebook [1], with plots of the signals:
>> - top 20 tasks' (based on runtime) utilization
>> - Util EST signals for the top 20 tasks, with the longer decaying ewma
>>     filter (which is the 'red' plot called 'ewma')
>> - the main task (comm=CrRendererMain) Util, Util EST and task residency
>>     (which tires to stick to CPUs 6,7* )
>> - the test score was 144.6 (while with fast decay ewma is ~134), so
>>     staying at big cpus (helps the score in this case)
>>
>> (the plots are interactive, you can zoom in with the icon 'Box Zoom')
>> (e.g. you can zoom in the task activation plot which is also linked
>> with the 'Util EST' on top, for this main task)
>>
>> You can see the util signal of that 'CrRendererMain' task and those
>> utilization drops in time, which I was referring to. When the util
>> drops below some threshold, the task might 'fit' into smaller CPU,
>> which could be prevented automatically byt maintaining the util est
>> for longer (but not for all).
> 
> I was looking at your nice chart and I wonder if you could also add
> the runnable _avg of the tasks ?

Yes, I will try today or tomorrow to add such plots as well.

> 
> My 1st impression is that the decrease happens when your task starts
> to share the CPU with some other tasks and this ends up with a
> decrease of its utilization because util_avg doesn't take into account
> the waiting time so typically task with an utilization of 1024, will
> see its utilization decrease because of other tasks running on the
> same cpu. This would explain the drop that you can see.
> 
>   I wonder if we should not take into account the runnable_avg when
> applying the ewm on util_est ? so the util_est will not decrease
> because of time sharing with other

Yes, that sounds a good idea. Let me provide those plots so we could
go further with the analysis. I will try to capture if that happens
to that particular task on CPU (if there are some others as well).


Thanks for jumping in to the discussion!

Lukasz

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-09-10 19:58         ` Qais Yousef
@ 2023-09-13 17:22           ` Dietmar Eggemann
  2023-09-16 19:49             ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2023-09-13 17:22 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 10/09/2023 21:58, Qais Yousef wrote:
> On 09/07/23 13:12, Dietmar Eggemann wrote:
>> On 06/09/2023 23:19, Qais Yousef wrote:
>>> On 09/06/23 14:56, Dietmar Eggemann wrote:
>>>> On 28/08/2023 01:31, Qais Yousef wrote:

[...]

>> Another thing ... I guess if you call accumulate_sum with delta the PELT
>> machinery assumes `delta = now - sa->last_update_time` which means you
>> would have to use `clock_pelt + TICK_USEC` as delta.
> 
> Right.
> 
> The way I understood it is that at TICK we should do update_load_avg() which
> would call __update_load_sum() which uses
> 
> 	delta = now - sa->last_update_time
> 
> which passes this delta to accumulate_sum()
> 
> I can see we are not very accurate since there will be a small additional time
> besides TICK_USEC that we are not accounting for. But I can't see how this can
> cause a big error.
> 
> 	predicted (assumed) tick time/delta
> 
> 		sa->last_update_time = now
> 		tick_time = TICK_USEC + now
> 
> 		delta = tick_time - sa->last_update_time
> 		delta = TICK_USEC + now - now
> 		delta = TICK_USEC
> 
> 	but actual tick time/delta
> 
> 		sa->last_update_time = now - x
> 		tick_time = TICK_USEC + now
> 
> 		delta = tick_time - sa->last_update_time
> 		delta = TICK_USEC + now - (now - x)
> 		delta = TICK_USEC + x
> 
> So the delta I am using might be slightly shorter than it should be.
> 
> IIUC, what you're saying that the `x` in my equation above is clock_pelt,
> right?

No, I was wrong here. Calling accumulate_sum with `delta = TICK_USEC` is
fine.

accumulate_sum() will accrue `sa->util.sum` and ___update_load_avg()
will then adjust `sa->util_avg` accordingly.

delta should be 4000 on Arm64 boards so you will cross period
boundaries. In case `delta < 1024` you might want to not call
___update_load_avg() to be in pair with __update_load_avg_cfs_rq().















^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util
  2023-08-27 23:31 ` [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
  2023-09-06 12:56   ` Dietmar Eggemann
@ 2023-09-15  9:15   ` Hongyan Xia
  2023-09-16 19:56     ` Qais Yousef
  1 sibling, 1 reply; 64+ messages in thread
From: Hongyan Xia @ 2023-09-15  9:15 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba

On 28/08/2023 00:31, Qais Yousef wrote:
> It is basically the ramp-up time from 0 to a given value. Will be used
> later to implement new tunable to control response time  for schedutil.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>   kernel/sched/pelt.c  | 21 +++++++++++++++++++++
>   kernel/sched/sched.h |  1 +
>   2 files changed, 22 insertions(+)
> 
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 50322005a0ae..f673b9ab92dc 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
>   
>   	return sa.util_avg;
>   }
> +
> +/*
> + * Approximate the required amount of runtime in ms required to reach @util.
> + */
> +u64 approximate_runtime(unsigned long util)
> +{
> +	struct sched_avg sa = {};
> +	u64 delta = 1024; // period = 1024 = ~1ms
> +	u64 runtime = 0;
> +
> +	if (unlikely(!util))
> +		return runtime;
> +
> +	while (sa.util_avg < util) {
> +		accumulate_sum(delta, &sa, 0, 0, 1);

This looks a bit uncomfortable as the existing comment says that we assume:

	if (!load)
		runnable = running = 0;

I haven't looked at the math in detail, but if this is okay, maybe a 
comment saying why this is okay despite the existing comment says otherwise?

> +		___update_load_avg(&sa, 0);
> +		runtime++;
> +	}
> +
> +	return runtime;
> +} > [...]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins
  2023-09-12 17:18         ` Dietmar Eggemann
@ 2023-09-16 19:38           ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-16 19:38 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/12/23 19:18, Dietmar Eggemann wrote:
> On 08/09/2023 16:07, Qais Yousef wrote:
> > On 09/08/23 09:40, Dietmar Eggemann wrote:
> >> On 08/09/2023 02:17, Qais Yousef wrote:
> >>> On 09/07/23 15:08, Peter Zijlstra wrote:
> >>>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:
> 
> [...]
> 
> >>> And what was a high end A78 is a mid core today. So if you look at today's
> >>> mobile world topology we really have a tiy+big+huge combination of cores. The
> >>> bigs are called mids, but they're very capable. Fits capacity forces migration
> >>> to the 'huge' cores too soon with that 80% margin. While the 80% might be too
> >>> small for the tiny ones as some workloads really struggle there if they hang on
> >>> for too long. It doesn't help that these systems ship with 4ms tick. Something
> >>> more to consider changing I guess.
> >>
> >> If this is the problem then you could simply make the margin (headroom)
> >> a function of cpu_capacity_orig?
> > 
> > I don't see what you mean. instead of capacity_of() but keep the 80%?
> > 
> > Again, I could be delusional and misunderstanding everything, but what I really
> > see fits_capacity() is about is misfit detection. But a task is not really
> > misfit until it actually has a util above the capacity of the CPU. Now due to
> > implementation details there can be a delay between the task crossing this
> > capacity and being able to move it. Which what I believe this headroom is
> > trying to achieve.
> > 
> > I think we can better define this by tying this headroom to the worst case
> > scenario it takes to actually move this misfit task to the right CPU. If it can
> > continue to run without being impacted with this delay and crossing the
> > capacity of the CPU it is on, then we should not trigger misfit IMO.
> 
> 
> Instead of:
> 
>   fits_capacity(unsigned long util, unsigned long capacity)
> 
>       return approximate_util_avg(util, TICK_USEC) < capacity;
> 
> just make 1280 in:
> 
>   #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
> 
> dependent on cpu's capacity_orig or the capacity diff to the next higher
> capacity_orig.
> 
> Typical example today: {little-medium-big capacity_orig} = {128, 896, 1024}
> 
> 896÷128 = 7
> 
> 1024/896 = 1.14
> 
> to achieve higher margin on little and lower margin on medium.

I am not keen on this personally. I think these numbers are random to me and
why they help (or not help) is not clear to me at least.

I do believe that the only reason why we want to move before a task util
crosses the capacity of the CPU is tied down to the misfit load balance to be
able to move the task. Because until the task crosses the capacity, it is
getting its computational demand per our PELT representation. But since load
balance is not an immediate action (especially on our platforms where it is
4ms, something I hope we can change); we need to preemptively exclude the CPU
as a misfit when we know the task will get 'stuck' on this CPU and not get its
computational demand (as per our representation of course).

I think this removes all guess work and provides a very meaningful decision
making process that I think will scale transparently so we utilize our
resources the best we can.

We can probably optimize the code to avoid the call to approximate_util_avg()
if this is a problem.

Why do you think the ratio of cpu capacities gives more meaningful method to
judge misfit?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-09-13 17:22           ` Dietmar Eggemann
@ 2023-09-16 19:49             ` Qais Yousef
  2023-09-16 19:52               ` Qais Yousef
  0 siblings, 1 reply; 64+ messages in thread
From: Qais Yousef @ 2023-09-16 19:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/13/23 19:22, Dietmar Eggemann wrote:
> On 10/09/2023 21:58, Qais Yousef wrote:
> > On 09/07/23 13:12, Dietmar Eggemann wrote:
> >> On 06/09/2023 23:19, Qais Yousef wrote:
> >>> On 09/06/23 14:56, Dietmar Eggemann wrote:
> >>>> On 28/08/2023 01:31, Qais Yousef wrote:
> 
> [...]
> 
> >> Another thing ... I guess if you call accumulate_sum with delta the PELT
> >> machinery assumes `delta = now - sa->last_update_time` which means you
> >> would have to use `clock_pelt + TICK_USEC` as delta.
> > 
> > Right.
> > 
> > The way I understood it is that at TICK we should do update_load_avg() which
> > would call __update_load_sum() which uses
> > 
> > 	delta = now - sa->last_update_time
> > 
> > which passes this delta to accumulate_sum()
> > 
> > I can see we are not very accurate since there will be a small additional time
> > besides TICK_USEC that we are not accounting for. But I can't see how this can
> > cause a big error.
> > 
> > 	predicted (assumed) tick time/delta
> > 
> > 		sa->last_update_time = now
> > 		tick_time = TICK_USEC + now
> > 
> > 		delta = tick_time - sa->last_update_time
> > 		delta = TICK_USEC + now - now
> > 		delta = TICK_USEC
> > 
> > 	but actual tick time/delta
> > 
> > 		sa->last_update_time = now - x
> > 		tick_time = TICK_USEC + now
> > 
> > 		delta = tick_time - sa->last_update_time
> > 		delta = TICK_USEC + now - (now - x)
> > 		delta = TICK_USEC + x
> > 
> > So the delta I am using might be slightly shorter than it should be.
> > 
> > IIUC, what you're saying that the `x` in my equation above is clock_pelt,
> > right?
> 
> No, I was wrong here. Calling accumulate_sum with `delta = TICK_USEC` is
> fine.
> 
> accumulate_sum() will accrue `sa->util.sum` and ___update_load_avg()
> will then adjust `sa->util_avg` accordingly.
> 
> delta should be 4000 on Arm64 boards so you will cross period
> boundaries. In case `delta < 1024` you might want to not call
> ___update_load_avg() to be in pair with __update_load_avg_cfs_rq().

You mean *not* call, or actually *do* call ___update_load_avg() if delta
< 1024? I am certainly not calling it now and I think you're suggesting to
actually call it when period is less than 1024.

This area is not my strength, so I do sure appreciate any suggestion to make it
better! :-) I will look into that for next version.


Many thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value
  2023-09-16 19:49             ` Qais Yousef
@ 2023-09-16 19:52               ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-16 19:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba

On 09/16/23 20:49, Qais Yousef wrote:
> On 09/13/23 19:22, Dietmar Eggemann wrote:
> > On 10/09/2023 21:58, Qais Yousef wrote:
> > > On 09/07/23 13:12, Dietmar Eggemann wrote:
> > >> On 06/09/2023 23:19, Qais Yousef wrote:
> > >>> On 09/06/23 14:56, Dietmar Eggemann wrote:
> > >>>> On 28/08/2023 01:31, Qais Yousef wrote:
> > 
> > [...]
> > 
> > >> Another thing ... I guess if you call accumulate_sum with delta the PELT
> > >> machinery assumes `delta = now - sa->last_update_time` which means you
> > >> would have to use `clock_pelt + TICK_USEC` as delta.
> > > 
> > > Right.
> > > 
> > > The way I understood it is that at TICK we should do update_load_avg() which
> > > would call __update_load_sum() which uses
> > > 
> > > 	delta = now - sa->last_update_time
> > > 
> > > which passes this delta to accumulate_sum()
> > > 
> > > I can see we are not very accurate since there will be a small additional time
> > > besides TICK_USEC that we are not accounting for. But I can't see how this can
> > > cause a big error.
> > > 
> > > 	predicted (assumed) tick time/delta
> > > 
> > > 		sa->last_update_time = now
> > > 		tick_time = TICK_USEC + now
> > > 
> > > 		delta = tick_time - sa->last_update_time
> > > 		delta = TICK_USEC + now - now
> > > 		delta = TICK_USEC
> > > 
> > > 	but actual tick time/delta
> > > 
> > > 		sa->last_update_time = now - x
> > > 		tick_time = TICK_USEC + now
> > > 
> > > 		delta = tick_time - sa->last_update_time
> > > 		delta = TICK_USEC + now - (now - x)
> > > 		delta = TICK_USEC + x
> > > 
> > > So the delta I am using might be slightly shorter than it should be.
> > > 
> > > IIUC, what you're saying that the `x` in my equation above is clock_pelt,
> > > right?
> > 
> > No, I was wrong here. Calling accumulate_sum with `delta = TICK_USEC` is
> > fine.
> > 
> > accumulate_sum() will accrue `sa->util.sum` and ___update_load_avg()
> > will then adjust `sa->util_avg` accordingly.
> > 
> > delta should be 4000 on Arm64 boards so you will cross period
> > boundaries. In case `delta < 1024` you might want to not call
> > ___update_load_avg() to be in pair with __update_load_avg_cfs_rq().
> 
> You mean *not* call, or actually *do* call ___update_load_avg() if delta
> < 1024? I am certainly not calling it now and I think you're suggesting to
> actually call it when period is less than 1024.

Oops my bad, I got confused. I am calling it. Ignore me!


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util
  2023-09-15  9:15   ` Hongyan Xia
@ 2023-09-16 19:56     ` Qais Yousef
  0 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2023-09-16 19:56 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann, linux-kernel, linux-pm,
	Lukasz Luba

On 09/15/23 10:15, Hongyan Xia wrote:
> On 28/08/2023 00:31, Qais Yousef wrote:
> > It is basically the ramp-up time from 0 to a given value. Will be used
> > later to implement new tunable to control response time  for schedutil.
> > 
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >   kernel/sched/pelt.c  | 21 +++++++++++++++++++++
> >   kernel/sched/sched.h |  1 +
> >   2 files changed, 22 insertions(+)
> > 
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index 50322005a0ae..f673b9ab92dc 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
> >   	return sa.util_avg;
> >   }
> > +
> > +/*
> > + * Approximate the required amount of runtime in ms required to reach @util.
> > + */
> > +u64 approximate_runtime(unsigned long util)
> > +{
> > +	struct sched_avg sa = {};
> > +	u64 delta = 1024; // period = 1024 = ~1ms
> > +	u64 runtime = 0;
> > +
> > +	if (unlikely(!util))
> > +		return runtime;
> > +
> > +	while (sa.util_avg < util) {
> > +		accumulate_sum(delta, &sa, 0, 0, 1);
> 
> This looks a bit uncomfortable as the existing comment says that we assume:
> 
> 	if (!load)
> 		runnable = running = 0;
> 
> I haven't looked at the math in detail, but if this is okay, maybe a comment
> saying why this is okay despite the existing comment says otherwise?

Yeah as Dietmar highlighted I should pass 1 for load and it was my bad
misreading the code.

So it should be

	accumulate_sum(delta, &sa, 1, 0, 1);

If that's what you meant, yes.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2023-09-16 19:57 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-27 23:31 [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Qais Yousef
2023-08-27 23:31 ` [RFC PATCH 1/7] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
2023-09-06 12:56   ` Dietmar Eggemann
2023-09-06 21:19     ` Qais Yousef
2023-09-07 11:12       ` Dietmar Eggemann
2023-09-10 19:58         ` Qais Yousef
2023-09-13 17:22           ` Dietmar Eggemann
2023-09-16 19:49             ` Qais Yousef
2023-09-16 19:52               ` Qais Yousef
2023-08-27 23:31 ` [RFC PATCH 2/7] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
2023-09-06 12:56   ` Dietmar Eggemann
2023-09-06 20:44     ` Dietmar Eggemann
2023-09-06 21:38       ` Qais Yousef
2023-09-15  9:15   ` Hongyan Xia
2023-09-16 19:56     ` Qais Yousef
2023-08-27 23:31 ` [RFC PATCH 3/7] sched/fair: Remove magic margin in fits_capacity() Qais Yousef
2023-09-06 14:38   ` Dietmar Eggemann
2023-09-06 21:45     ` Qais Yousef
2023-08-27 23:32 ` [RFC PATCH 4/7] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom() Qais Yousef
2023-09-07 11:34   ` Peter Zijlstra
2023-09-10 19:23     ` Qais Yousef
2023-08-27 23:32 ` [RFC PATCH 5/7] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
2023-09-06 21:13   ` Dietmar Eggemann
2023-09-06 21:52     ` Qais Yousef
2023-09-07 11:44   ` Peter Zijlstra
2023-09-10 19:25     ` Qais Yousef
2023-08-27 23:32 ` [RFC PATCH 6/7] sched/pelt: Introduce PELT multiplier Qais Yousef
2023-08-27 23:32 ` [RFC PATCH 7/7] cpufreq: Change default transition delay to 2ms Qais Yousef
2023-09-06  9:18 ` [RFC PATCH 0/7] sched: cpufreq: Remove magic margins Lukasz Luba
2023-09-06 21:18   ` Qais Yousef
2023-09-07  7:48     ` Lukasz Luba
2023-09-07 11:53       ` Peter Zijlstra
2023-09-07 13:06         ` Lukasz Luba
2023-09-07 13:29           ` Peter Zijlstra
2023-09-07 13:33             ` Lukasz Luba
2023-09-07 13:38               ` Peter Zijlstra
2023-09-07 13:45                 ` Lukasz Luba
2023-09-08 12:51                   ` Daniel Bristot de Oliveira
2023-09-12 11:57                     ` Lukasz Luba
2023-09-10 18:20         ` Qais Yousef
2023-09-10 18:14       ` Qais Yousef
2023-09-07 13:26     ` Peter Zijlstra
2023-09-07 13:57       ` Lukasz Luba
2023-09-07 14:29         ` Peter Zijlstra
2023-09-07 14:42           ` Lukasz Luba
2023-09-07 20:16             ` Peter Zijlstra
2023-09-12 11:51               ` Lukasz Luba
2023-09-12 14:01                 ` Vincent Guittot
2023-09-13  9:53                   ` Lukasz Luba
2023-09-12 14:28                 ` Peter Zijlstra
2023-09-10 19:06             ` Qais Yousef
2023-09-10 18:46       ` Qais Yousef
2023-09-07 13:08 ` Peter Zijlstra
2023-09-08  0:17   ` Qais Yousef
2023-09-08  7:40     ` Dietmar Eggemann
2023-09-08 14:07       ` Qais Yousef
2023-09-12 17:18         ` Dietmar Eggemann
2023-09-16 19:38           ` Qais Yousef
2023-09-08 10:25     ` Peter Zijlstra
2023-09-08 13:33       ` Qais Yousef
2023-09-08 13:58         ` Peter Zijlstra
2023-09-08 13:59         ` Peter Zijlstra
2023-09-08 14:11           ` Qais Yousef
2023-09-10 21:17         ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).