linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation
@ 2023-12-08  1:52 Qais Yousef
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-08  1:52 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Qais Yousef

One of the practical issues that has risen up when trying to deploy uclamp in
practice, was making uclamp_max more effective. Due to max-aggregation at rq,
the effective uclamp_max value for a capped task can easily be lifted by any
task that gets enqueued. Which can be often enough in practice to cause it to
be ineffective and leaving much to be desired.

One of the solutions attempted out of tree was to filter out the short running
tasks based on sched_slice(), and this proved to be effective enough

	https://android-review.googlesource.com/c/kernel/common/+/1914696

But the solution is not upstream friendly as it introduces more magic
thresholds and not sure how it would work after EEVDF changes got merged in.

In principle, the max-aggregation is required to address the frequency hint
part of uclamp. We don't need this at wake up path as the per-task value should
let us know with the task fits a CPU or not on HMP systems.

As pointed out in remove magic hardcoded margins series [1], the limitation is
actually in the governor/hardware where we are constrained to changing
frequencies at a high rate, which uclamp usage can induce.

To address the problems, we can move the problem to be a cpufreq governor issue
to deal with whatever limitation it has to respond to task's perf requirement
hints. This means we need to tell the governor that we need a frequency update
to cater for a task perf hints, hence adding a new SCHED_CPUFREQ_PERF_HINTS
flag.

With the new flag, we can then send special updates to cpufreq governor on
context switch *if* it has perf requirements that aren't met already.

The governor can then choose to honour or ignore these request based on
whatever criteria it sees fit. For schedutil I made use of the new
approximate_util_avg() to create a threshold based on rate_limit_us to ignore
perf requirements for tasks that their util_avg tells us they were historically
running shorter than hardware's ability to change frequency. Which means they
will actually go back to sleep before they see the freq change, so honouring
their request is pointless, hence we ignore it.

Instead of being exact, I made an assumption that the task has to run for at
least 500us above rate_limit_us which is magical but what I've seen in practice
as a meaningful runtime where task can actually do meaningful work that is
worthwhile. But this exact definition of the threshold is subject for debate.
We can make it 1.5 rate_limit_us for example. I preferred the absolute number
as even in lower end hardware; I think this is a meaningful unit of time for
producing meaningful results that can make can impact. There's the hidden
assumption that most modern hardware already has fast enough DVFS. Absolute
number fails for super fast hardware though..

This allows us to remove uclamp max-aggregation logic completely and moves the
problem to be a cpufreq governor problem instead. Which I think is the right
thing to do as the scheduler was overcompensating for what is in reality
a cpufreq governor limitation and policy. We just need to improve the
interface with the governor.

Implementing different policies/strategies to deal with the problem would be
easier now that the problem space has moved to the governor. And it keeps
scheduler code clean and focus on things that matter from scheduling point of
view only.

For example max-aggregation can be implemented in the governor by adding new
flag when sending cpufreq_update_util() at enqueue/dequeue_task(). Not that
I think it's a good idea, but the possibility is there. Especially if platforms
like x86 has a lot of intelligence in firmware and they'd like to implement
something smarter at that level. They'll just need to improve the interface
with the governor.

===

This patch is based on remove margins series [1] and data is collected it
against it as a baseline.

Testing on pixel 6 with mainline(ish) kernel

==

Speedometer browser benchmark

       | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
-------+-----------+---------------+-----------+---------------------
score  |  108.03   |     135.72    |   108.09  |    137.47
-------+-----------+---------------+-----------+---------------------
power  |  1204.75  |    1451.79    |  1216.17  |    1484.73
-------+-----------+---------------+-----------+---------------------

No regressions.

===

UiBench

       | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
-------+-----------+---------------+-----------+---------------------
jank   |    68.0   |      56.0     |    65.6   |    57.3
-------+-----------+---------------+-----------+---------------------
power  |   146.54  |     164.49    |   144.91  |    167.57
-------+-----------+---------------+-----------+---------------------

No regressions.

===

Spawning 8 busyloop threads each pinned to a CPU with uclamp_max set to low-ish
OPP

```
adb shell "uclampset -M 90 taskset -a 01 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 02 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 04 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 08 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 270 taskset -a 10 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 270 taskset -a 20 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 670 taskset -a 40 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 670 taskset -a 80 cat /dev/zero > /dev/null &" &
```

And running speedometer for a single iteration

       | baseline  |   patch   |
-------+-----------+-----------+
score  |   73.44   |   75.62   |
-------+-----------+-----------+
power  |   520.46  |   489.49  |
-------+-----------+-----------+

Similar score at lower power.

Little's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@90   |   29.59%  |   49.25%  |
---------+-----------+-----------+
OPP@max  |   40.02%  |   12.31%  |
---------+-----------+-----------+

Mid's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@270  |   50.02%  |   77.53%  |
---------+-----------+-----------+
OPP@max  |   49.17%  |   22.46%  |
---------+-----------+-----------+

Big's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@670  |   46.43%  |   54.44%  |
---------+-----------+-----------+
OPP@max  |   1.76%   |   4.57%   |
---------+-----------+-----------+

As you can see the residency at the capped frequency has increased considerably
for all clusters. The time spent running at max frequency is reduced
significantly for little and mid. For big there's a slight increase. Both
numbers are suspiciously low. With the busy loops in background; these cores
are more subject to throttling. So the higher number indicates we've been
throttled less.

---

Patch 1 clean ups the code that sends cpufreq_update_util() to be more
intentional and less noisy.

Patch 2 removes uclamp-max aggregation and implements sending
cpufreq_update_util() updates at context switch instead.

Patch 3 implements the logic to filter short running tasks compared to
rate_limit_us and add perf hint flag at task_tick_fair().

Patch 4 updates uclamp docs to reflect the new changes.

[1] https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/

Thanks!

--
Qais Yousef

Qais Yousef (4):
  sched/fair: Be less aggressive in calling cpufreq_update_util()
  sched/uclamp: Remove rq max aggregation
  sched/schedutil: Ignore update requests for short running tasks
  sched/documentation: Remove reference to max aggregation

 Documentation/scheduler/sched-util-clamp.rst | 239 ++---------
 include/linux/sched.h                        |  16 -
 include/linux/sched/cpufreq.h                |   3 +-
 init/Kconfig                                 |  31 --
 kernel/sched/core.c                          | 426 +++----------------
 kernel/sched/cpufreq_schedutil.c             |  61 ++-
 kernel/sched/fair.c                          | 157 ++-----
 kernel/sched/rt.c                            |   4 -
 kernel/sched/sched.h                         | 150 ++-----
 9 files changed, 244 insertions(+), 843 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08  1:52 [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Qais Yousef
@ 2023-12-08  1:52 ` Qais Yousef
  2023-12-08 10:05   ` Lukasz Luba
                     ` (5 more replies)
  2023-12-08  1:52 ` [PATCH 2/4] sched/uclamp: Remove rq max aggregation Qais Yousef
                   ` (3 subsequent siblings)
  4 siblings, 6 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-08  1:52 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Qais Yousef

Due to the way code is structured, it makes a lot of sense to trigger
cpufreq_update_util() from update_load_avg(). But this is too aggressive
as in most cases we are iterating through entities in a loop to
update_load_avg() in the hierarchy. So we end up sending too many
request in an loop as we're updating the hierarchy.

Combine this with the rate limit in schedutil, we could end up
prematurely send up a wrong frequency update before we have actually
updated all entities appropriately.

Be smarter about it by limiting the trigger to perform frequency updates
after all accounting logic has done. This ended up being in the
following points:

1. enqueue/dequeue_task_fair()
2. throttle/unthrottle_cfs_rq()
3. attach/detach_task_cfs_rq()
4. task_tick_fair()
5. __sched_group_set_shares()

This is not 100% ideal still due to other limitations that might be
a bit harder to handle. Namely we can end up with premature update
request in the following situations:

a. Simultaneous task enqueue on the CPU where 2nd task is bigger and
   requires higher freq. The trigger to cpufreq_update_util() by the
   first task will lead to dropping the 2nd request until tick. Or
   another CPU in the same policy trigger a freq update.

b. CPUs sharing a policy can end up with the same race in a but the
   simultaneous enqueue happens on different CPUs in the same policy.

The above though are limitations in the governor/hardware, and from
scheduler point of view at least that's the best we can do. The
governor might consider smarter logic to aggregate near simultaneous
request and honour the higher one.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/fair.c | 55 ++++++++++++---------------------------------
 1 file changed, 14 insertions(+), 41 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b83448be3f79..f99910fc6705 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3997,29 +3997,6 @@ static inline void update_cfs_group(struct sched_entity *se)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
-{
-	struct rq *rq = rq_of(cfs_rq);
-
-	if (&rq->cfs == cfs_rq) {
-		/*
-		 * There are a few boundary cases this might miss but it should
-		 * get called often enough that that should (hopefully) not be
-		 * a real problem.
-		 *
-		 * It will not get called when we go idle, because the idle
-		 * thread is a different class (!fair), nor will the utilization
-		 * number include things like RT tasks.
-		 *
-		 * As is, the util number is not freq-invariant (we'd have to
-		 * implement arch_scale_freq_capacity() for that).
-		 *
-		 * See cpu_util_cfs().
-		 */
-		cpufreq_update_util(rq, flags);
-	}
-}
-
 #ifdef CONFIG_SMP
 static inline bool load_avg_is_decayed(struct sched_avg *sa)
 {
@@ -4648,8 +4625,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
-	cfs_rq_util_change(cfs_rq, 0);
-
 	trace_pelt_cfs_tp(cfs_rq);
 }
 
@@ -4678,8 +4653,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
-	cfs_rq_util_change(cfs_rq, 0);
-
 	trace_pelt_cfs_tp(cfs_rq);
 }
 
@@ -4726,11 +4699,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		 */
 		detach_entity_load_avg(cfs_rq, se);
 		update_tg_load_avg(cfs_rq);
-	} else if (decayed) {
-		cfs_rq_util_change(cfs_rq, 0);
-
-		if (flags & UPDATE_TG)
-			update_tg_load_avg(cfs_rq);
+	} else if (decayed && (flags & UPDATE_TG)) {
+		update_tg_load_avg(cfs_rq);
 	}
 }
 
@@ -5114,7 +5084,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
-	cfs_rq_util_change(cfs_rq, 0);
 }
 
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
@@ -5807,6 +5776,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	sub_nr_running(rq, task_delta);
 
 done:
+	cpufreq_update_util(rq, 0);
+
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5899,6 +5870,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
+	cpufreq_update_util(rq, 0);
+
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
 		resched_curr(rq);
@@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
-	/*
-	 * If in_iowait is set, the code below may not trigger any cpufreq
-	 * utilization updates, so do it here explicitly with the IOWAIT flag
-	 * passed.
-	 */
-	if (p->in_iowait)
-		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
-
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
@@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
+	cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
+
 	hrtick_update(rq);
 }
 
@@ -6849,6 +6816,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
+	cpufreq_update_util(rq, 0);
 	hrtick_update(rq);
 }
 
@@ -8482,6 +8450,7 @@ done: __maybe_unused;
 
 	update_misfit_status(p, rq);
 	sched_fair_update_stop_tick(rq, p);
+	cpufreq_update_util(rq, 0);
 
 	return p;
 
@@ -12615,6 +12584,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+	cpufreq_update_util(rq, 0);
 
 	task_tick_core(rq, curr);
 }
@@ -12739,6 +12709,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	struct sched_entity *se = &p->se;
 
 	detach_entity_cfs_rq(se);
+	cpufreq_update_util(task_rq(p), 0);
 }
 
 static void attach_task_cfs_rq(struct task_struct *p)
@@ -12746,6 +12717,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	struct sched_entity *se = &p->se;
 
 	attach_entity_cfs_rq(se);
+	cpufreq_update_util(task_rq(p), 0);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
@@ -12991,6 +12963,7 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
 			update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
 			update_cfs_group(se);
 		}
+		cpufreq_update_util(rq, 0);
 		rq_unlock_irqrestore(rq, &rf);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] sched/uclamp: Remove rq max aggregation
  2023-12-08  1:52 [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Qais Yousef
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
@ 2023-12-08  1:52 ` Qais Yousef
  2023-12-11  0:08   ` Qais Yousef
  2023-12-08  1:52 ` [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks Qais Yousef
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-08  1:52 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Qais Yousef

Max aggregation intended to facilitate combining different uclamp
performance hints from all RUNNABLE tasks on the rq. But in practice
this ended up being a pain, especially for making uclamp_max effective.

What we've seen in practice is that those busy tasks that we really want
to limit how much performance they can get (or how much power they can
consume); can easily be disrupted with a lot of small tasks waking up
during its busy run period. Combine this with schedutil rate limit, we
end up effectively running full throttle without capping being
effective as these small tasks keep being added to the rq making the
max aggregated value 1024 most (if not effectively all) of the time.

Instead of max aggregation, we can go with a simpler approach of
applying uclamp value based on the RUNNING task.

Note that max aggregation is important only for freq selection. For task
placement we already look at the task's uclamp values only in the load
balancer. In EAS we look at the rq, but EAS cares about the frequency
selection too, so it needed to care about max aggregation for that
reason when deciding where to place the task.

To make sure the cpufreq governor follows the new uclamp settings on
context switch, we add a new call to cpufreq_update_util() at the end of
the context switch. The governor should then apply clamp the util with
the uclamp value of whatever task that is currently running on each CPU
in the policy. A new flag SCHED_CPUFREQ_PERF_HINTS was added to help
distinguish this request.

This approach allows us to simplify the code by removing all notion of
buckets, uclamp static key, and the need to inc/dec at enqueue path
which was considered expensive addition to the fast path.

The down side is that we still have the issue of small tasks waking up
can cause the CPU to momentarily to spike to high frequency instigated
by the capped task. It is less severe though as the busy tasks will be
capped for the majority of time it is RUNNING (limited by DVFS hardware
response time to change frequency at each context switch). Compared to
previously whenever uncapped task is enqueued we were lifting the cap.

The problem of handling these small tasks causing a removal of the
capping (or even causing a boost if they had high uclamp_min value)
should be shifted to the governor. If a task runtime is shorter than
DVFS hardware ability to change frequency, then this task's requirements
should be ignored as they won't run long enough to observe the change
due to the hardware limitation.

To keep access to uclamp_eff_value() fast, we continue to store the
effective value there - but instead of relying on active flag which was
set at enqueue and cleared at dequeue; we do so unconditionally whenever
uclamp_min/max is changed via syscall to sched_setattr() or when we
change uclamp_min/max in the hierarchy including root group. Added new
update_uclamp_req() to help keep both p->uclamp_req and p->uclamp in
sync.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 include/linux/sched.h            |  16 --
 include/linux/sched/cpufreq.h    |   3 +-
 init/Kconfig                     |  31 ---
 kernel/sched/core.c              | 426 +++++--------------------------
 kernel/sched/cpufreq_schedutil.c |   2 +-
 kernel/sched/fair.c              |  80 ++----
 kernel/sched/rt.c                |   4 -
 kernel/sched/sched.h             | 128 ++--------
 8 files changed, 104 insertions(+), 586 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d258162deb0..73b0c6ef8fd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -700,25 +700,11 @@ struct sched_dl_entity {
 };
 
 #ifdef CONFIG_UCLAMP_TASK
-/* Number of utilization clamp buckets (shorter alias) */
-#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
-
 /*
  * Utilization clamp for a scheduling entity
  * @value:		clamp value "assigned" to a se
- * @bucket_id:		bucket index corresponding to the "assigned" value
- * @active:		the se is currently refcounted in a rq's bucket
  * @user_defined:	the requested clamp value comes from user-space
  *
- * The bucket_id is the index of the clamp bucket matching the clamp value
- * which is pre-computed and stored to avoid expensive integer divisions from
- * the fast path.
- *
- * The active bit is set whenever a task has got an "effective" value assigned,
- * which can be different from the clamp value "requested" from user-space.
- * This allows to know a task is refcounted in the rq's bucket corresponding
- * to the "effective" bucket_id.
- *
  * The user_defined bit is set whenever a task has got a task-specific clamp
  * value requested from userspace, i.e. the system defaults apply to this task
  * just as a restriction. This allows to relax default clamps when a less
@@ -728,8 +714,6 @@ struct sched_dl_entity {
  */
 struct uclamp_se {
 	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
-	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
-	unsigned int active		: 1;
 	unsigned int user_defined	: 1;
 };
 #endif /* CONFIG_UCLAMP_TASK */
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d01755d3142f..185f2470597b 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -8,7 +8,8 @@
  * Interface between cpufreq drivers and the scheduler:
  */
 
-#define SCHED_CPUFREQ_IOWAIT	(1U << 0)
+#define SCHED_CPUFREQ_IOWAIT		(1U << 0)
+#define SCHED_CPUFREQ_PERF_HINTS	(1U << 1)
 
 #ifdef CONFIG_CPU_FREQ
 struct cpufreq_policy;
diff --git a/init/Kconfig b/init/Kconfig
index 9ffb103fc927..8f0cf36cd577 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -809,37 +809,6 @@ config UCLAMP_TASK
 
 	  If in doubt, say N.
 
-config UCLAMP_BUCKETS_COUNT
-	int "Number of supported utilization clamp buckets"
-	range 5 20
-	default 5
-	depends on UCLAMP_TASK
-	help
-	  Defines the number of clamp buckets to use. The range of each bucket
-	  will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
-	  number of clamp buckets the finer their granularity and the higher
-	  the precision of clamping aggregation and tracking at run-time.
-
-	  For example, with the minimum configuration value we will have 5
-	  clamp buckets tracking 20% utilization each. A 25% boosted tasks will
-	  be refcounted in the [20..39]% bucket and will set the bucket clamp
-	  effective value to 25%.
-	  If a second 30% boosted task should be co-scheduled on the same CPU,
-	  that task will be refcounted in the same bucket of the first task and
-	  it will boost the bucket clamp effective value to 30%.
-	  The clamp effective value of a bucket is reset to its nominal value
-	  (20% in the example above) when there are no more tasks refcounted in
-	  that bucket.
-
-	  An additional boost/capping margin can be added to some tasks. In the
-	  example above the 25% task will be boosted to 30% until it exits the
-	  CPU. If that should be considered not acceptable on certain systems,
-	  it's always possible to reduce the margin by increasing the number of
-	  clamp buckets to trade off used memory for run-time tracking
-	  precision.
-
-	  If in doubt, use the default value.
-
 endmenu
 
 #
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9c8626b4ddff..5095f40edf0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1389,37 +1389,9 @@ static unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCA
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
-/*
- * This static key is used to reduce the uclamp overhead in the fast path. It
- * primarily disables the call to uclamp_rq_{inc, dec}() in
- * enqueue/dequeue_task().
- *
- * This allows users to continue to enable uclamp in their kernel config with
- * minimum uclamp overhead in the fast path.
- *
- * As soon as userspace modifies any of the uclamp knobs, the static key is
- * enabled, since we have an actual users that make use of uclamp
- * functionality.
- *
- * The knobs that would enable this static key are:
- *
- *   * A task modifying its uclamp value with sched_setattr().
- *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
- *   * An admin modifying the cgroup cpu.uclamp.{min, max}
- */
-DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
-
-/* Integer rounded range for each bucket */
-#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
-
 #define for_each_clamp_id(clamp_id) \
 	for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++)
 
-static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
-{
-	return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1);
-}
-
 static inline unsigned int uclamp_none(enum uclamp_id clamp_id)
 {
 	if (clamp_id == UCLAMP_MIN)
@@ -1431,73 +1403,25 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
 				 unsigned int value, bool user_defined)
 {
 	uc_se->value = value;
-	uc_se->bucket_id = uclamp_bucket_id(value);
 	uc_se->user_defined = user_defined;
 }
 
-static inline unsigned int
-uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
-		  unsigned int clamp_value)
-{
-	/*
-	 * Avoid blocked utilization pushing up the frequency when we go
-	 * idle (which drops the max-clamp) by retaining the last known
-	 * max-clamp.
-	 */
-	if (clamp_id == UCLAMP_MAX) {
-		rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
-		return clamp_value;
-	}
-
-	return uclamp_none(UCLAMP_MIN);
-}
-
-static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
-				     unsigned int clamp_value)
-{
-	/* Reset max-clamp retention only on idle exit */
-	if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
-		return;
-
-	uclamp_rq_set(rq, clamp_id, clamp_value);
-}
-
-static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
-				   unsigned int clamp_value)
-{
-	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
-	int bucket_id = UCLAMP_BUCKETS - 1;
-
-	/*
-	 * Since both min and max clamps are max aggregated, find the
-	 * top most bucket with tasks in.
-	 */
-	for ( ; bucket_id >= 0; bucket_id--) {
-		if (!bucket[bucket_id].tasks)
-			continue;
-		return bucket[bucket_id].value;
-	}
-
-	/* No tasks -- default clamp values */
-	return uclamp_idle_value(rq, clamp_id, clamp_value);
-}
+static inline void update_uclamp_req(struct task_struct *p,
+				     enum uclamp_id clamp_id,
+				     unsigned int value, bool user_defined);
 
 static void __uclamp_update_util_min_rt_default(struct task_struct *p)
 {
 	unsigned int default_util_min;
-	struct uclamp_se *uc_se;
 
 	lockdep_assert_held(&p->pi_lock);
 
-	uc_se = &p->uclamp_req[UCLAMP_MIN];
-
 	/* Only sync if user didn't override the default */
-	if (uc_se->user_defined)
+	if (p->uclamp_req[UCLAMP_MIN].user_defined)
 		return;
 
 	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
-	uclamp_se_set(uc_se, default_util_min, false);
+	update_uclamp_req(p, UCLAMP_MIN, default_util_min, false);
 }
 
 static void uclamp_update_util_min_rt_default(struct task_struct *p)
@@ -1545,7 +1469,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
  *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
-static inline struct uclamp_se
+static inline unsigned long
 uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
@@ -1553,241 +1477,27 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 
 	/* System default restrictions always apply */
 	if (unlikely(uc_req.value > uc_max.value))
-		return uc_max;
+		return uc_max.value;
 
-	return uc_req;
+	return uc_req.value;
 }
 
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_eff;
-
-	/* Task currently refcounted: use back-annotated (effective) value */
-	if (p->uclamp[clamp_id].active)
-		return (unsigned long)p->uclamp[clamp_id].value;
-
-	uc_eff = uclamp_eff_get(p, clamp_id);
-
-	return (unsigned long)uc_eff.value;
+	/* This should be kept up-to-date whenever uclamp value is changed */
+	return (unsigned long)p->uclamp[clamp_id].value;
 }
 
-/*
- * When a task is enqueued on a rq, the clamp bucket currently defined by the
- * task's uclamp::bucket_id is refcounted on that rq. This also immediately
- * updates the rq's clamp value if required.
- *
- * Tasks can have a task-specific value requested from user-space, track
- * within each bucket the maximum value for tasks refcounted in it.
- * This "local max aggregation" allows to track the exact "requested" value
- * for each bucket when all its RUNNABLE tasks require the same clamp.
- */
-static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-				    enum uclamp_id clamp_id)
+/* Update task's p->uclamp_req and effective p->uclamp in one go */
+static inline void update_uclamp_req(struct task_struct *p,
+				     enum uclamp_id clamp_id,
+				     unsigned int value, bool user_defined)
 {
-	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
-	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
-	struct uclamp_bucket *bucket;
-
-	lockdep_assert_rq_held(rq);
-
-	/* Update task effective clamp */
-	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
-
-	bucket = &uc_rq->bucket[uc_se->bucket_id];
-	bucket->tasks++;
-	uc_se->active = true;
-
-	uclamp_idle_reset(rq, clamp_id, uc_se->value);
-
-	/*
-	 * Local max aggregation: rq buckets always track the max
-	 * "requested" clamp value of its RUNNABLE tasks.
-	 */
-	if (bucket->tasks == 1 || uc_se->value > bucket->value)
-		bucket->value = uc_se->value;
-
-	if (uc_se->value > uclamp_rq_get(rq, clamp_id))
-		uclamp_rq_set(rq, clamp_id, uc_se->value);
-}
-
-/*
- * When a task is dequeued from a rq, the clamp bucket refcounted by the task
- * is released. If this is the last task reference counting the rq's max
- * active clamp value, then the rq's clamp value is updated.
- *
- * Both refcounted tasks and rq's cached clamp values are expected to be
- * always valid. If it's detected they are not, as defensive programming,
- * enforce the expected state and warn.
- */
-static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
-				    enum uclamp_id clamp_id)
-{
-	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
-	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
-	struct uclamp_bucket *bucket;
-	unsigned int bkt_clamp;
-	unsigned int rq_clamp;
-
-	lockdep_assert_rq_held(rq);
-
-	/*
-	 * If sched_uclamp_used was enabled after task @p was enqueued,
-	 * we could end up with unbalanced call to uclamp_rq_dec_id().
-	 *
-	 * In this case the uc_se->active flag should be false since no uclamp
-	 * accounting was performed at enqueue time and we can just return
-	 * here.
-	 *
-	 * Need to be careful of the following enqueue/dequeue ordering
-	 * problem too
-	 *
-	 *	enqueue(taskA)
-	 *	// sched_uclamp_used gets enabled
-	 *	enqueue(taskB)
-	 *	dequeue(taskA)
-	 *	// Must not decrement bucket->tasks here
-	 *	dequeue(taskB)
-	 *
-	 * where we could end up with stale data in uc_se and
-	 * bucket[uc_se->bucket_id].
-	 *
-	 * The following check here eliminates the possibility of such race.
-	 */
-	if (unlikely(!uc_se->active))
-		return;
-
-	bucket = &uc_rq->bucket[uc_se->bucket_id];
-
-	SCHED_WARN_ON(!bucket->tasks);
-	if (likely(bucket->tasks))
-		bucket->tasks--;
-
-	uc_se->active = false;
-
-	/*
-	 * Keep "local max aggregation" simple and accept to (possibly)
-	 * overboost some RUNNABLE tasks in the same bucket.
-	 * The rq clamp bucket value is reset to its base value whenever
-	 * there are no more RUNNABLE tasks refcounting it.
-	 */
-	if (likely(bucket->tasks))
-		return;
-
-	rq_clamp = uclamp_rq_get(rq, clamp_id);
-	/*
-	 * Defensive programming: this should never happen. If it happens,
-	 * e.g. due to future modification, warn and fixup the expected value.
-	 */
-	SCHED_WARN_ON(bucket->value > rq_clamp);
-	if (bucket->value >= rq_clamp) {
-		bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
-		uclamp_rq_set(rq, clamp_id, bkt_clamp);
-	}
-}
-
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
-{
-	enum uclamp_id clamp_id;
-
-	/*
-	 * Avoid any overhead until uclamp is actually used by the userspace.
-	 *
-	 * The condition is constructed such that a NOP is generated when
-	 * sched_uclamp_used is disabled.
-	 */
-	if (!static_branch_unlikely(&sched_uclamp_used))
-		return;
-
-	if (unlikely(!p->sched_class->uclamp_enabled))
-		return;
-
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_inc_id(rq, p, clamp_id);
-
-	/* Reset clamp idle holding when there is one RUNNABLE task */
-	if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
-		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
-}
-
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
-{
-	enum uclamp_id clamp_id;
-
-	/*
-	 * Avoid any overhead until uclamp is actually used by the userspace.
-	 *
-	 * The condition is constructed such that a NOP is generated when
-	 * sched_uclamp_used is disabled.
-	 */
-	if (!static_branch_unlikely(&sched_uclamp_used))
-		return;
-
-	if (unlikely(!p->sched_class->uclamp_enabled))
-		return;
-
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_dec_id(rq, p, clamp_id);
-}
-
-static inline void uclamp_rq_reinc_id(struct rq *rq, struct task_struct *p,
-				      enum uclamp_id clamp_id)
-{
-	if (!p->uclamp[clamp_id].active)
-		return;
-
-	uclamp_rq_dec_id(rq, p, clamp_id);
-	uclamp_rq_inc_id(rq, p, clamp_id);
-
-	/*
-	 * Make sure to clear the idle flag if we've transiently reached 0
-	 * active tasks on rq.
-	 */
-	if (clamp_id == UCLAMP_MAX && (rq->uclamp_flags & UCLAMP_FLAG_IDLE))
-		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
-}
-
-static inline void
-uclamp_update_active(struct task_struct *p)
-{
-	enum uclamp_id clamp_id;
-	struct rq_flags rf;
-	struct rq *rq;
-
-	/*
-	 * Lock the task and the rq where the task is (or was) queued.
-	 *
-	 * We might lock the (previous) rq of a !RUNNABLE task, but that's the
-	 * price to pay to safely serialize util_{min,max} updates with
-	 * enqueues, dequeues and migration operations.
-	 * This is the same locking schema used by __set_cpus_allowed_ptr().
-	 */
-	rq = task_rq_lock(p, &rf);
-
-	/*
-	 * Setting the clamp bucket is serialized by task_rq_lock().
-	 * If the task is not yet RUNNABLE and its task_struct is not
-	 * affecting a valid clamp bucket, the next time it's enqueued,
-	 * it will already see the updated clamp bucket value.
-	 */
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_reinc_id(rq, p, clamp_id);
-
-	task_rq_unlock(rq, p, &rf);
+	uclamp_se_set(&p->uclamp_req[clamp_id], value, user_defined);
+	uclamp_se_set(&p->uclamp[clamp_id], uclamp_eff_get(p, clamp_id), user_defined);
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
-static inline void
-uclamp_update_active_tasks(struct cgroup_subsys_state *css)
-{
-	struct css_task_iter it;
-	struct task_struct *p;
-
-	css_task_iter_start(css, 0, &it);
-	while ((p = css_task_iter_next(&it)))
-		uclamp_update_active(p);
-	css_task_iter_end(&it);
-}
-
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 #endif
 
@@ -1874,15 +1584,11 @@ static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		update_root_tg = true;
 	}
 
-	if (update_root_tg) {
-		static_branch_enable(&sched_uclamp_used);
+	if (update_root_tg)
 		uclamp_update_root_tg();
-	}
 
-	if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) {
-		static_branch_enable(&sched_uclamp_used);
+	if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default)
 		uclamp_sync_util_min_rt_default();
-	}
 
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
@@ -1923,15 +1629,6 @@ static int uclamp_validate(struct task_struct *p,
 	if (util_min != -1 && util_max != -1 && util_min > util_max)
 		return -EINVAL;
 
-	/*
-	 * We have valid uclamp attributes; make sure uclamp is enabled.
-	 *
-	 * We need to do that here, because enabling static branches is a
-	 * blocking operation which obviously cannot be done while holding
-	 * scheduler locks.
-	 */
-	static_branch_enable(&sched_uclamp_used);
-
 	return 0;
 }
 
@@ -1981,7 +1678,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
 		else
 			value = uclamp_none(clamp_id);
 
-		uclamp_se_set(uc_se, value, false);
+		update_uclamp_req(p, clamp_id, value, false);
 
 	}
 
@@ -1990,14 +1687,14 @@ static void __setscheduler_uclamp(struct task_struct *p,
 
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
 	    attr->sched_util_min != -1) {
-		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
-			      attr->sched_util_min, true);
+
+		update_uclamp_req(p, UCLAMP_MIN, attr->sched_util_min, true);
 	}
 
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
 	    attr->sched_util_max != -1) {
-		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
-			      attr->sched_util_max, true);
+
+		update_uclamp_req(p, UCLAMP_MAX, attr->sched_util_max, true);
 	}
 }
 
@@ -2009,15 +1706,12 @@ static void uclamp_fork(struct task_struct *p)
 	 * We don't need to hold task_rq_lock() when updating p->uclamp_* here
 	 * as the task is still at its early fork stages.
 	 */
-	for_each_clamp_id(clamp_id)
-		p->uclamp[clamp_id].active = false;
-
 	if (likely(!p->sched_reset_on_fork))
 		return;
 
 	for_each_clamp_id(clamp_id) {
-		uclamp_se_set(&p->uclamp_req[clamp_id],
-			      uclamp_none(clamp_id), false);
+		update_uclamp_req(&init_task, clamp_id,
+				  uclamp_none(clamp_id), false);
 	}
 }
 
@@ -2026,32 +1720,17 @@ static void uclamp_post_fork(struct task_struct *p)
 	uclamp_update_util_min_rt_default(p);
 }
 
-static void __init init_uclamp_rq(struct rq *rq)
-{
-	enum uclamp_id clamp_id;
-	struct uclamp_rq *uc_rq = rq->uclamp;
-
-	for_each_clamp_id(clamp_id) {
-		uc_rq[clamp_id] = (struct uclamp_rq) {
-			.value = uclamp_none(clamp_id)
-		};
-	}
-
-	rq->uclamp_flags = UCLAMP_FLAG_IDLE;
-}
-
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
 	enum uclamp_id clamp_id;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		init_uclamp_rq(cpu_rq(cpu));
 
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&init_task.uclamp_req[clamp_id],
 			      uclamp_none(clamp_id), false);
+
+		uclamp_se_set(&init_task.uclamp[clamp_id],
+			      uclamp_none(clamp_id), false);
 	}
 
 	/* System defaults allow max clamp values for both indexes */
@@ -2066,8 +1745,6 @@ static void __init init_uclamp(void)
 }
 
 #else /* CONFIG_UCLAMP_TASK */
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
 static inline int uclamp_validate(struct task_struct *p,
 				  const struct sched_attr *attr)
 {
@@ -2114,7 +1791,6 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
 	}
 
-	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
 
 	if (sched_core_enabled(rq))
@@ -2134,7 +1810,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_dequeue(p, flags & DEQUEUE_SLEEP);
 	}
 
-	uclamp_rq_dec(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -5150,6 +4825,7 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * prev into current:
 	 */
 	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	uclamp_context_switch(rq);
 	__balance_callbacks(rq);
 	raw_spin_rq_unlock_irq(rq);
 }
@@ -7469,13 +7145,17 @@ int sched_core_idle_cpu(int cpu)
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 				 unsigned long *min,
-				 unsigned long *max)
+				 unsigned long *max,
+				 struct task_struct *p)
 {
 	unsigned long util, irq, scale;
 	struct rq *rq = cpu_rq(cpu);
 
 	scale = arch_scale_cpu_capacity(cpu);
 
+	if (!p)
+		p = rq->curr;
+
 	/*
 	 * Early check to see if IRQ/steal time saturates the CPU, can be
 	 * because of inaccuracies in how we track these -- see
@@ -7497,13 +7177,13 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 		 *   steals time to the deadline task.
 		 * - The minimum performance requirement for CFS and/or RT.
 		 */
-		*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
+		*min = max(irq + cpu_bw_dl(rq), uclamp_eff_value(p, UCLAMP_MIN));
 
 		/*
 		 * When an RT task is runnable and uclamp is not used, we must
 		 * ensure that the task will run at maximum compute capacity.
 		 */
-		if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
+		if (!IS_ENABLED(CONFIG_UCLAMP_TASK) && rt_rq_is_runnable(&rq->rt))
 			*min = max(*min, scale);
 	}
 
@@ -7521,7 +7201,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * than the actual utilization because of uclamp_max requirements.
 	 */
 	if (max)
-		*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+		*max = min(scale, uclamp_eff_value(p, UCLAMP_MAX));
 
 	if (util >= scale)
 		return scale;
@@ -7543,7 +7223,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 
 unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL, NULL);
 }
 #endif /* CONFIG_SMP */
 
@@ -10585,6 +10265,30 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static inline void
+uclamp_update_css_tasks(struct cgroup_subsys_state *css)
+{
+	enum uclamp_id clamp_id;
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	/* Update effective uclamp value for all the css tasks */
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it))) {
+
+		/* Only RT and fair support uclamp */
+		if (!rt_policy(p->policy) && !fair_policy(p->policy))
+			continue;
+
+		/* Update task effect value */
+		for_each_clamp_id(clamp_id) {
+			uclamp_se_set(&p->uclamp[clamp_id],
+				      uclamp_eff_get(p, clamp_id), false);
+		}
+	}
+	css_task_iter_end(&it);
+}
+
 static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 {
 	struct cgroup_subsys_state *top_css = css;
@@ -10620,7 +10324,6 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 			if (eff[clamp_id] == uc_se[clamp_id].value)
 				continue;
 			uc_se[clamp_id].value = eff[clamp_id];
-			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
 			clamps |= (0x1 << clamp_id);
 		}
 		if (!clamps) {
@@ -10628,8 +10331,7 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 			continue;
 		}
 
-		/* Immediately update descendants RUNNABLE tasks */
-		uclamp_update_active_tasks(css);
+		uclamp_update_css_tasks(css);
 	}
 }
 
@@ -10687,8 +10389,6 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
 	if (req.ret)
 		return req.ret;
 
-	static_branch_enable(&sched_uclamp_used);
-
 	guard(mutex)(&uclamp_mutex);
 	guard(rcu)();
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 788208becc13..137636f62593 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -244,7 +244,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 {
 	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
 
-	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
+	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max, NULL);
 	util = max(util, boost);
 	sg_cpu->bw_min = min;
 	sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f99910fc6705..74326179658c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4931,7 +4931,7 @@ static inline int util_fits_cpu(unsigned long util,
 	 */
 	fits = fits_capacity(util, capacity);
 
-	if (!uclamp_is_used())
+	if (!IS_ENABLED(CONFIG_UCLAMP_TASK))
 		return fits;
 
 	/*
@@ -6624,11 +6624,18 @@ static inline void hrtick_update(struct rq *rq)
 #ifdef CONFIG_SMP
 static inline bool cpu_overutilized(int cpu)
 {
-	unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
-	unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+	struct task_struct *p = cpu_rq(cpu)->curr;
+	unsigned long util_min = uclamp_eff_value(p, UCLAMP_MIN);
+	unsigned long util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
-	/* Return true only if the utilization doesn't fit CPU's capacity */
-	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+	/*
+	 * Return true only if the utilization doesn't fit CPU's capacity
+	 *
+	 * TODO: this definition of overutilized = misfit() is stale and no
+	 * longer fit for purpose and must be updated to reflect when the
+	 * system or a cluster is truly overloaded.
+	 */
+	return !util_fits_cpu(cpu_util_cfs(cpu), util_min, util_max, cpu);
 }
 
 static inline void update_overutilized_status(struct rq *rq)
@@ -7776,7 +7783,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	for_each_cpu(cpu, pd_cpus) {
 		unsigned long util = cpu_util(cpu, p, -1, 0);
 
-		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+		busy_time += effective_cpu_util(cpu, util, NULL, NULL, NULL);
 	}
 
 	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7801,30 +7808,9 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
 		unsigned long eff_util, min, max;
 
-		/*
-		 * Performance domain frequency: utilization clamping
-		 * must be considered since it affects the selection
-		 * of the performance domain frequency.
-		 * NOTE: in case RT tasks are running, by default the
-		 * FREQUENCY_UTIL's utilization can be max OPP.
-		 */
-		eff_util = effective_cpu_util(cpu, util, &min, &max);
-
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
-
+		eff_util = effective_cpu_util(cpu, util, &min, &max, tsk);
 		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
+
 		max_util = max(max_util, eff_util);
 	}
 
@@ -7897,8 +7883,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
-	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
+	unsigned long p_util_min = uclamp_eff_value(p, UCLAMP_MIN);
+	unsigned long p_util_max = uclamp_eff_value(p, UCLAMP_MAX);
 	struct root_domain *rd = this_rq()->rd;
 	int cpu, best_energy_cpu, target = -1;
 	int prev_fits = -1, best_fits = -1;
@@ -7932,10 +7918,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	eenv_task_busy_time(&eenv, p, prev_cpu);
 
 	for (; pd; pd = pd->next) {
-		unsigned long util_min = p_util_min, util_max = p_util_max;
 		unsigned long cpu_cap, cpu_thermal_cap, util;
 		long prev_spare_cap = -1, max_spare_cap = -1;
-		unsigned long rq_util_min, rq_util_max;
 		unsigned long cur_delta, base_energy;
 		int max_spare_cap_cpu = -1;
 		int fits, max_fits = -1;
@@ -7954,8 +7938,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		eenv.pd_cap = 0;
 
 		for_each_cpu(cpu, cpus) {
-			struct rq *rq = cpu_rq(cpu);
-
 			eenv.pd_cap += cpu_thermal_cap;
 
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
@@ -7967,29 +7949,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			util = cpu_util(cpu, p, cpu, 0);
 			cpu_cap = capacity_of(cpu);
 
-			/*
-			 * Skip CPUs that cannot satisfy the capacity request.
-			 * IOW, placing the task there would make the CPU
-			 * overutilized. Take uclamp into account to see how
-			 * much capacity we can get out of the CPU; this is
-			 * aligned with sched_cpu_util().
-			 */
-			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
-				/*
-				 * Open code uclamp_rq_util_with() except for
-				 * the clamp() part. Ie: apply max aggregation
-				 * only. util_fits_cpu() logic requires to
-				 * operate on non clamped util but must use the
-				 * max-aggregated uclamp_{min, max}.
-				 */
-				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
-				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
-
-				util_min = max(rq_util_min, p_util_min);
-				util_max = max(rq_util_max, p_util_max);
-			}
-
-			fits = util_fits_cpu(util, util_min, util_max, cpu);
+			fits = util_fits_cpu(util, p_util_min, p_util_max, cpu);
 			if (!fits)
 				continue;
 
@@ -13135,10 +13095,6 @@ DEFINE_SCHED_CLASS(fair) = {
 #ifdef CONFIG_SCHED_CORE
 	.task_is_throttled	= task_is_throttled_fair,
 #endif
-
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..86733bed0e3c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2681,10 +2681,6 @@ DEFINE_SCHED_CLASS(rt) = {
 #ifdef CONFIG_SCHED_CORE
 	.task_is_throttled	= task_is_throttled_rt,
 #endif
-
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7c89c623250..f05a0674a036 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -912,50 +912,6 @@ extern void rto_push_irq_work_func(struct irq_work *work);
 #endif
 #endif /* CONFIG_SMP */
 
-#ifdef CONFIG_UCLAMP_TASK
-/*
- * struct uclamp_bucket - Utilization clamp bucket
- * @value: utilization clamp value for tasks on this clamp bucket
- * @tasks: number of RUNNABLE tasks on this clamp bucket
- *
- * Keep track of how many tasks are RUNNABLE for a given utilization
- * clamp value.
- */
-struct uclamp_bucket {
-	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
-	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
-};
-
-/*
- * struct uclamp_rq - rq's utilization clamp
- * @value: currently active clamp values for a rq
- * @bucket: utilization clamp buckets affecting a rq
- *
- * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
- * A clamp value is affecting a rq when there is at least one task RUNNABLE
- * (or actually running) with that value.
- *
- * There are up to UCLAMP_CNT possible different clamp values, currently there
- * are only two: minimum utilization and maximum utilization.
- *
- * All utilization clamping values are MAX aggregated, since:
- * - for util_min: we want to run the CPU at least at the max of the minimum
- *   utilization required by its currently RUNNABLE tasks.
- * - for util_max: we want to allow the CPU to run up to the max of the
- *   maximum utilization allowed by its currently RUNNABLE tasks.
- *
- * Since on each system we expect only a limited number of different
- * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
- * the metrics required to compute all the per-rq utilization clamp values.
- */
-struct uclamp_rq {
-	unsigned int value;
-	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
-};
-
-DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
-#endif /* CONFIG_UCLAMP_TASK */
-
 struct rq;
 struct balance_callback {
 	struct balance_callback *next;
@@ -994,13 +950,6 @@ struct rq {
 #endif
 	u64			nr_switches;
 
-#ifdef CONFIG_UCLAMP_TASK
-	/* Utilization clamp values based on CPU's RUNNABLE tasks */
-	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
-	unsigned int		uclamp_flags;
-#define UCLAMP_FLAG_IDLE 0x01
-#endif
-
 	struct cfs_rq		cfs;
 	struct rt_rq		rt;
 	struct dl_rq		dl;
@@ -2248,10 +2197,6 @@ extern s64 update_curr_common(struct rq *rq);
 
 struct sched_class {
 
-#ifdef CONFIG_UCLAMP_TASK
-	int uclamp_enabled;
-#endif
-
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task)   (struct rq *rq);
@@ -2997,7 +2942,8 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #ifdef CONFIG_SMP
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 				 unsigned long *min,
-				 unsigned long *max);
+				 unsigned long *max,
+				 struct task_struct *p);
 
 unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 				 unsigned long min,
@@ -3072,49 +3018,35 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
 #ifdef CONFIG_UCLAMP_TASK
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
 
-static inline unsigned long uclamp_rq_get(struct rq *rq,
-					  enum uclamp_id clamp_id)
-{
-	return READ_ONCE(rq->uclamp[clamp_id].value);
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
-				 unsigned int value)
-{
-	WRITE_ONCE(rq->uclamp[clamp_id].value, value);
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
-	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
-}
-
 /* Is the rq being capped/throttled by uclamp_max? */
 static inline bool uclamp_rq_is_capped(struct rq *rq)
 {
 	unsigned long rq_util;
 	unsigned long max_util;
 
-	if (!static_branch_likely(&sched_uclamp_used))
-		return false;
-
 	rq_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
-	max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+	max_util = uclamp_eff_value(rq->curr, UCLAMP_MAX);
 
 	return max_util != SCHED_CAPACITY_SCALE && rq_util >= max_util;
 }
 
-/*
- * When uclamp is compiled in, the aggregation at rq level is 'turned off'
- * by default in the fast path and only gets turned on once userspace performs
- * an operation that requires it.
- *
- * Returns true if userspace opted-in to use uclamp and aggregation at rq level
- * hence is active.
- */
-static inline bool uclamp_is_used(void)
+/* Request freq update on context switch if necessary */
+static inline void uclamp_context_switch(struct rq *rq)
 {
-	return static_branch_likely(&sched_uclamp_used);
+	unsigned long uclamp_min;
+	unsigned long uclamp_max;
+	unsigned long util;
+
+	/* Only RT and FAIR tasks are aware of uclamp */
+	if (!rt_policy(current->policy) && !fair_policy(current->policy))
+		return;
+
+	uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
+	uclamp_max = uclamp_eff_value(current, UCLAMP_MAX);
+	util = rq->cfs.avg.util_avg;
+
+	if (uclamp_min > util || uclamp_max < util)
+		cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
 }
 #else /* CONFIG_UCLAMP_TASK */
 static inline unsigned long uclamp_eff_value(struct task_struct *p,
@@ -3128,28 +3060,8 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
 
 static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
 
-static inline bool uclamp_is_used(void)
+static inline void uclamp_context_switch(struct rq *rq)
 {
-	return false;
-}
-
-static inline unsigned long uclamp_rq_get(struct rq *rq,
-					  enum uclamp_id clamp_id)
-{
-	if (clamp_id == UCLAMP_MIN)
-		return 0;
-
-	return SCHED_CAPACITY_SCALE;
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
-				 unsigned int value)
-{
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
-	return false;
 }
 #endif /* CONFIG_UCLAMP_TASK */
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks
  2023-12-08  1:52 [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Qais Yousef
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
  2023-12-08  1:52 ` [PATCH 2/4] sched/uclamp: Remove rq max aggregation Qais Yousef
@ 2023-12-08  1:52 ` Qais Yousef
  2023-12-08 10:42   ` Hongyan Xia
  2023-12-08  1:52 ` [PATCH 4/4] sched/documentation: Remove reference to max aggregation Qais Yousef
  2023-12-18  8:19 ` [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Dietmar Eggemann
  4 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-08  1:52 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Qais Yousef

Ignore freq updates to honour uclamp requests if the task is short
running. It won't run long enough to see the changes, so avoid the
unnecessary work and noise.

Make sure SCHED_CPUFREQ_PERF_HINTS flag is set in task_tick_fair() so
that we can do correction action if the task continued to run such that
it is no longer considered a short task.

Should address the problem of noisy short running tasks unnecessary
causing frequency spikes when waking up on a CPU that is running a busy
task capped by UCLAMP_MAX.

Move helper functions to access task_util_est() and related attributes
to sched.h to enable using it from cpufreq_schedutil.c

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/cpufreq_schedutil.c | 59 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c              | 24 +------------
 kernel/sched/sched.h             | 22 ++++++++++++
 3 files changed, 82 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 137636f62593..04a5cfcdbbf2 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -8,11 +8,18 @@
 
 #define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
 
+/*
+ * Min runtime in us a task should run for above rate_limit_us so that we don't
+ * ignore it in ignore_short_tasks().
+ */
+#define SHORT_TASK_MIN		500
+
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult);
 
 struct sugov_tunables {
 	struct gov_attr_set	attr_set;
 	unsigned int		rate_limit_us;
+	unsigned long		rate_limit_util;
 	unsigned int		response_time_ms;
 };
 
@@ -127,6 +134,49 @@ sugov_apply_response_time(unsigned long util, int cpu)
 	return mult >> SCHED_CAPACITY_SHIFT;
 }
 
+/*
+ * Ignore updates if current task's runtime is too short for the rate limit.
+ * The task must run for an average of rate_limit_us + SHORT_TASK_MIN for it
+ * not to be ignored.
+ *
+ * If we made a wrong decision and the task has changed characteristic such
+ * that it is no longer a short task, we should detect that at tick. Which can
+ * be a high penalty if the tick value is too high.
+ *
+ * XXX: can we take TICK_US into account somehow when verifying if we can
+ * ignore it?
+ *
+ * This is only valid for requests containing SCHED_CPUFREQ_PERF_HINTS flag,
+ * ie: uclamp hints requests at context switches.
+ *
+ * This flag is expected to be passed on context switch and tick. Only fair
+ * tasks are considered now as we use util to approximate its average runtime.
+ * We can't do the same without tracking the average runtime of the RT task in
+ * our accounting. And it might be risky to temporarily ignore the RT task's
+ * perf requirements as a mistake could have higher consequence.
+ *
+ * Once fair gains the concept of latency sensitive tasks, we might need to
+ * consider the consequence of ignoring them here too. For the same reason
+ * ignoring RT tasks is risky.
+ */
+static inline bool ignore_short_tasks(int cpu,
+				      struct sugov_policy *sg_policy,
+				      unsigned int flags)
+{
+	struct task_struct *p = cpu_rq(cpu)->curr;
+	unsigned long task_util;
+
+	if (!(flags & SCHED_CPUFREQ_PERF_HINTS))
+		return false;
+
+	if (!fair_policy(p->policy))
+		return false;
+
+	task_util = task_util_est(p);
+
+	return task_util < sg_policy->tunables->rate_limit_util;
+}
+
 static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 {
 	s64 delta_ns;
@@ -396,8 +446,12 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 					      u64 time, unsigned long max_cap,
 					      unsigned int flags)
 {
+	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	unsigned long boost;
 
+	if (ignore_short_tasks(sg_cpu->cpu, sg_policy, flags))
+		return false;
+
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
@@ -526,6 +580,9 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	unsigned int next_f;
 
+	if (ignore_short_tasks(sg_cpu->cpu, sg_policy, flags))
+		return;
+
 	raw_spin_lock(&sg_policy->update_lock);
 
 	sugov_iowait_boost(sg_cpu, time, flags);
@@ -612,6 +669,7 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
 		return -EINVAL;
 
 	tunables->rate_limit_us = rate_limit_us;
+	tunables->rate_limit_util = approximate_util_avg(0, rate_limit_us + SHORT_TASK_MIN);
 
 	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
 
@@ -853,6 +911,7 @@ static int sugov_init(struct cpufreq_policy *policy)
 	sg_policy->tunables = tunables;
 
 	tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
+	tunables->rate_limit_util = approximate_util_avg(0, tunables->rate_limit_us + SHORT_TASK_MIN);
 	tunables->response_time_ms = sugov_calc_freq_response_ms(sg_policy);
 	sugov_update_response_time_mult(sg_policy);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74326179658c..b824e633ac2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4754,28 +4754,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 
 static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
 
-static inline unsigned long task_util(struct task_struct *p)
-{
-	return READ_ONCE(p->se.avg.util_avg);
-}
-
-static inline unsigned long task_runnable(struct task_struct *p)
-{
-	return READ_ONCE(p->se.avg.runnable_avg);
-}
-
-static inline unsigned long _task_util_est(struct task_struct *p)
-{
-	struct util_est ue = READ_ONCE(p->se.avg.util_est);
-
-	return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
-}
-
-static inline unsigned long task_util_est(struct task_struct *p)
-{
-	return max(task_util(p), _task_util_est(p));
-}
-
 static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
 				    struct task_struct *p)
 {
@@ -12544,7 +12522,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
-	cpufreq_update_util(rq, 0);
+	cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
 
 	task_tick_core(rq, curr);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f05a0674a036..b7a8cd768bef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2952,6 +2952,28 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 unsigned long approximate_util_avg(unsigned long util, u64 delta);
 u64 approximate_runtime(unsigned long util);
 
+static inline unsigned long task_util(struct task_struct *p)
+{
+	return READ_ONCE(p->se.avg.util_avg);
+}
+
+static inline unsigned long task_runnable(struct task_struct *p)
+{
+	return READ_ONCE(p->se.avg.runnable_avg);
+}
+
+static inline unsigned long _task_util_est(struct task_struct *p)
+{
+	struct util_est ue = READ_ONCE(p->se.avg.util_est);
+
+	return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
+}
+
+static inline unsigned long task_util_est(struct task_struct *p)
+{
+	return max(task_util(p), _task_util_est(p));
+}
+
 /*
  * Any governor that relies on util signal to drive DVFS, must populate these
  * percpu dvfs_update_delay variables.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] sched/documentation: Remove reference to max aggregation
  2023-12-08  1:52 [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Qais Yousef
                   ` (2 preceding siblings ...)
  2023-12-08  1:52 ` [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks Qais Yousef
@ 2023-12-08  1:52 ` Qais Yousef
  2023-12-18  8:19 ` [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Dietmar Eggemann
  4 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-08  1:52 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Qais Yousef

Now that max aggregation complexity was removed, update the doc to
reflect the new working design.

And since we fixed one of the limitation, remove it as well.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 Documentation/scheduler/sched-util-clamp.rst | 239 ++++---------------
 1 file changed, 45 insertions(+), 194 deletions(-)

diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
index 74d5b7c6431d..642e5f386f8e 100644
--- a/Documentation/scheduler/sched-util-clamp.rst
+++ b/Documentation/scheduler/sched-util-clamp.rst
@@ -129,172 +129,50 @@ and the scheduler needs to select a suitable CPU for it to run on.
 
 Since the goal of util clamp is to allow requesting a minimum and maximum
 performance point for a task to run on, it must be able to influence the
-frequency selection as well as task placement to be most effective. Both of
-which have implications on the utilization value at CPU runqueue (rq for short)
-level, which brings us to the main design challenge.
-
-When a task wakes up on an rq, the utilization signal of the rq will be
-affected by the uclamp settings of all the tasks enqueued on it. For example if
-a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs
-to respect to this request as well as all other requests from all of the
-enqueued tasks.
-
-To be able to aggregate the util clamp value of all the tasks attached to the
-rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the
-scheduler hot path. Hence care must be taken since any slow down will have
-significant impact on a lot of use cases and could hinder its usability in
-practice.
-
-The way this is handled is by dividing the utilization range into buckets
-(struct uclamp_bucket) which allows us to reduce the search space from every
-task on the rq to only a subset of tasks on the top-most bucket.
-
-When a task is enqueued, the counter in the matching bucket is incremented,
-and on dequeue it is decremented. This makes keeping track of the effective
-uclamp value at rq level a lot easier.
-
-As tasks are enqueued and dequeued, we keep track of the current effective
-uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on
-how this works.
-
-Later at any path that wants to identify the effective uclamp value of the rq,
-it will simply need to read this effective uclamp value of the rq at that exact
-moment of time it needs to take a decision.
+frequency selection as well as task placement to be most effective.
 
 For task placement case, only Energy Aware and Capacity Aware Scheduling
 (EAS/CAS) make use of uclamp for now, which implies that it is applied on
-heterogeneous systems only.
-When a task wakes up, the scheduler will look at the current effective uclamp
-value of every rq and compare it with the potential new value if the task were
-to be enqueued there. Favoring the rq that will end up with the most energy
-efficient combination.
+heterogeneous systems only. When a task wakes up, the scheduler will look at
+the effective uclamp values of the woken task to check if it will fit the
+capacity of the CPU. Favouring the most energy efficient CPU that fits the
+performant request hints. Enabling it to favour a bigger CPU if uclamp_min is
+high even if the utilization of the task is low. Or enable it to run on
+a smaller CPU if UCLAMP_MAX is low, even if the utilization of the task is
+high.
 
 Similarly in schedutil, when it needs to make a frequency update it will look
-at the current effective uclamp value of the rq which is influenced by the set
-of tasks currently enqueued there and select the appropriate frequency that
-will satisfy constraints from requests.
+at the effective uclamp values of the current running task on the rq and select
+the appropriate frequency that will satisfy the performance request hints of
+the task, taking into account the current utilization of the rq.
 
 Other paths like setting overutilization state (which effectively disables EAS)
 make use of uclamp as well. Such cases are considered necessary housekeeping to
 allow the 2 main use cases above and will not be covered in detail here as they
 could change with implementation details.
 
-.. _uclamp-buckets:
+2.1. Frequency hints are applied at context switch
+--------------------------------------------------
 
-2.1. Buckets
-------------
+At context switch, we tell schedutil of the new uclamp values (or min/max perf
+requirments) of the newly RUNNING task. It is up to the governor to try its
+best to honour these requests.
 
-::
-
-                           [struct rq]
-
-  (bottom)                                                    (top)
-
-    0                                                          1024
-    |                                                           |
-    +-----------+-----------+-----------+----   ----+-----------+
-    |  Bucket 0 |  Bucket 1 |  Bucket 2 |    ...    |  Bucket N |
-    +-----------+-----------+-----------+----   ----+-----------+
-       :           :                                   :
-       +- p0       +- p3                               +- p4
-       :                                               :
-       +- p1                                           +- p5
-       :
-       +- p2
-
-
-.. note::
-  The diagram above is an illustration rather than a true depiction of the
-  internal data structure.
-
-To reduce the search space when trying to decide the effective uclamp value of
-an rq as tasks are enqueued/dequeued, the whole utilization range is divided
-into N buckets where N is configured at compile time by setting
-CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5.
-
-The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX].
-
-The range of each bucket is 1024/N. For example, for the default value of
-5 there will be 5 buckets, each of which will cover the following range:
-
-::
-
-        DELTA = round_closest(1024/5) = 204.8 = 205
-
-        Bucket 0: [0:204]
-        Bucket 1: [205:409]
-        Bucket 2: [410:614]
-        Bucket 3: [615:819]
-        Bucket 4: [820:1024]
-
-When a task p with following tunable parameters
-
-::
-
-        p->uclamp[UCLAMP_MIN] = 300
-        p->uclamp[UCLAMP_MAX] = 1024
-
-is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket
-4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in
-this range.
-
-The rq then keeps track of its current effective uclamp value for each
-uclamp_id.
-
-When a task p is enqueued, the rq value changes to:
-
-::
-
-        // update bucket logic goes here
-        rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
-        // repeat for UCLAMP_MAX
+For uclamp to be effective, it is desired to have a hardware with reasonably
+fast DVFS hardware (rate_limit_us is short).
 
-Similarly, when p is dequeued the rq value changes to:
-
-::
-
-        // update bucket logic goes here
-        rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value()
-        // repeat for UCLAMP_MAX
-
-When all buckets are empty, the rq uclamp values are reset to system defaults.
-See :ref:`section 3.4 <uclamp-default-values>` for details on default values.
+It is believed that most modern hardware (including lower rend ones) has
+rate_limit_us <= 2ms which should be good enough. Having 500us or below would
+be ideal so the hardware can reasonably catch up with each task's performance
+constraints.
 
+Schedutil might ignore the task's performance request if its historical runtime
+has been shorter than the rate_limit_us.
 
-2.2. Max aggregation
---------------------
+See :ref:`Section 5.2 <schedutil-response-time-issues>` for more details on
+schedutil related limitations.
 
-Util clamp is tuned to honour the request for the task that requires the
-highest performance point.
-
-When multiple tasks are attached to the same rq, then util clamp must make sure
-the task that needs the highest performance point gets it even if there's
-another task that doesn't need it or is disallowed from reaching this point.
-
-For example, if there are multiple tasks attached to an rq with the following
-values:
-
-::
-
-        p0->uclamp[UCLAMP_MIN] = 300
-        p0->uclamp[UCLAMP_MAX] = 900
-
-        p1->uclamp[UCLAMP_MIN] = 500
-        p1->uclamp[UCLAMP_MAX] = 500
-
-then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN
-and UCLAMP_MAX become:
-
-::
-
-        rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
-        rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
-
-As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max
-aggregation is the cause of one of limitations when using util clamp, in
-particular for UCLAMP_MAX hint when user space would like to save power.
-
-2.3. Hierarchical aggregation
+2.2. Hierarchical aggregation
 -----------------------------
 
 As stated earlier, util clamp is a property of every task in the system. But
@@ -324,7 +202,7 @@ In other words, this aggregation will not cause an error when a task changes
 its uclamp values, but rather the system may not be able to satisfy requests
 based on those factors.
 
-2.4. Range
+2.3. Range
 ----------
 
 Uclamp performance request has the range of 0 to 1024 inclusive.
@@ -332,6 +210,14 @@ Uclamp performance request has the range of 0 to 1024 inclusive.
 For cgroup interface percentage is used (that is 0 to 100 inclusive).
 Just like other cgroup interfaces, you can use 'max' instead of 100.
 
+2.4. Older design
+-----------------
+
+Older design was behaving differently due what was called max-aggregation rule
+which was adding high complexity and had some limitations. Please consult the
+docs corresponding to your kernel version as part of this doc might reflect how
+it works on your kernel.
+
 .. _uclamp-interfaces:
 
 3. Interfaces
@@ -594,39 +480,7 @@ possible.
 5. Limitations
 ==============
 
-.. _uclamp-capping-fail:
-
-5.1. Capping frequency with uclamp_max fails under certain conditions
----------------------------------------------------------------------
-
-If task p0 is capped to run at 512:
-
-::
-
-        p0->uclamp[UCLAMP_MAX] = 512
-
-and it shares the rq with p1 which is free to run at any performance point:
-
-::
-
-        p1->uclamp[UCLAMP_MAX] = 1024
-
-then due to max aggregation the rq will be allowed to reach max performance
-point:
-
-::
-
-        rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024
-
-Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for
-the rq will depend on the actual utilization value of the tasks.
-
-If p1 is a small task but p0 is a CPU intensive task, then due to the fact that
-both are running at the same rq, p1 will cause the frequency capping to be left
-from the rq although p1, which is allowed to run at any performance point,
-doesn't actually need to run at that frequency.
-
-5.2. UCLAMP_MAX can break PELT (util_avg) signal
+5.1. UCLAMP_MAX can break PELT (util_avg) signal
 ------------------------------------------------
 
 PELT assumes that frequency will always increase as the signals grow to ensure
@@ -650,10 +504,6 @@ CPU is capable of. The max CPU frequency (Fmax) matters here as well,
 since it designates the shortest computational time to finish the task's
 work on this CPU.
 
-::
-
-        rq->uclamp[UCLAMP_MAX] = 0
-
 If the ratio of Fmax/Fmin is 3, then maximum value will be:
 
 ::
@@ -689,9 +539,8 @@ If task p1 wakes up on this CPU, which have:
         p1->util_avg = 200
         p1->uclamp[UCLAMP_MAX] = 1024
 
-then the effective UCLAMP_MAX for the CPU will be 1024 according to max
-aggregation rule. But since the capped p0 task was running and throttled
-severely, then the rq->util_avg will be:
+Since the capped p0 task was running and throttled severely, then the
+rq->util_avg will be:
 
 ::
 
@@ -699,9 +548,9 @@ severely, then the rq->util_avg will be:
         p1->util_avg = 200
 
         rq->util_avg = 1024
-        rq->uclamp[UCLAMP_MAX] = 1024
 
-Hence lead to a frequency spike since if p0 wasn't throttled we should get:
+Hence lead to a frequency spike when p1 is running. If p0 wasn't throttled we
+should get:
 
 ::
 
@@ -710,9 +559,11 @@ Hence lead to a frequency spike since if p0 wasn't throttled we should get:
 
         rq->util_avg = 500
 
-and run somewhere near mid performance point of that CPU, not the Fmax we get.
+and run somewhere near mid performance point of that CPU, not the Fmax p1 gets.
+
+.. _schedutil_response_time_issues:
 
-5.3. Schedutil response time issues
+5.2. Schedutil response time issues
 -----------------------------------
 
 schedutil has three limitations:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
@ 2023-12-08 10:05   ` Lukasz Luba
  2023-12-10 20:51     ` Qais Yousef
  2023-12-11 18:47   ` Christian Loehle
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Lukasz Luba @ 2023-12-08 10:05 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Vincent Guittot, Viresh Kumar, Rafael J. Wysocki,
	Dietmar Eggemann, linux-pm, Ingo Molnar, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Peter Zijlstra

Hi Qais,

On 12/8/23 01:52, Qais Yousef wrote:

[snip]

> @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	 */
>   	util_est_enqueue(&rq->cfs, p);
>   
> -	/*
> -	 * If in_iowait is set, the code below may not trigger any cpufreq
> -	 * utilization updates, so do it here explicitly with the IOWAIT flag
> -	 * passed.
> -	 */
> -	if (p->in_iowait)
> -		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -

Why this io wait boost is considered as the $subject says 'aggressive'
calling?

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks
  2023-12-08  1:52 ` [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks Qais Yousef
@ 2023-12-08 10:42   ` Hongyan Xia
  2023-12-10 22:22     ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Hongyan Xia @ 2023-12-08 10:42 UTC (permalink / raw)
  To: Qais Yousef, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Ingo Molnar, Rafael J. Wysocki, Peter Zijlstra,
	Viresh Kumar, Dietmar Eggemann

Hi Qais,

On 08/12/2023 01:52, Qais Yousef wrote:
> Ignore freq updates to honour uclamp requests if the task is short
> running. It won't run long enough to see the changes, so avoid the
> unnecessary work and noise.
> 
> Make sure SCHED_CPUFREQ_PERF_HINTS flag is set in task_tick_fair() so
> that we can do correction action if the task continued to run such that
> it is no longer considered a short task.
> 
> Should address the problem of noisy short running tasks unnecessary
> causing frequency spikes when waking up on a CPU that is running a busy
> task capped by UCLAMP_MAX.

Actually, an occasional spike is not a big problem to me.

What is a big concern is a normal task and a uclamp_max task running on 
the same rq. If the uclamp_max task is 1024 but capped by uclamp_max at 
the lowest OPP, and the normal task has no uclamp but a duty cycle, then 
when the normal task wakes up on the rq, it'll be the highest OPP. When 
it sleeps, the ulamp_max is back and at the lowest OPP. This square-wave 
problem to me is a much bigger concern than an infrequent spike. If 
CONFIG_HZ is 1000, this square wave's frequency is 500 switching between 
highest and lowest OPP, which is definitely unacceptable.

The problem I think with filtering is, under this condition, should we 
filter out the lowest OPP or the highest? Neither sounds like a good 
answer because neither is a short-running task and the correct answer 
might be somewhere in between.

Sorry to ramble on this again and again, but I think filtering is 
addressing the symptom, not the cause. The cause is we have no idea 
under what condition a util_avg was achieved. The 1024 task in the 
previous example would be much better if we extend it into

[1024, achieved at uclamp_min 0, achieved at uclamp_max 300]

If we know 1024 was done under uclamp_max of 300, then we know we don't 
need to raise to the max OPP. So far, we carry around a lot of different 
new variables but not these two which we really need.

> 
> Move helper functions to access task_util_est() and related attributes
> to sched.h to enable using it from cpufreq_schedutil.c
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>   kernel/sched/cpufreq_schedutil.c | 59 ++++++++++++++++++++++++++++++++
>   kernel/sched/fair.c              | 24 +------------
>   kernel/sched/sched.h             | 22 ++++++++++++
>   3 files changed, 82 insertions(+), 23 deletions(-)
> 
> [...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08 10:05   ` Lukasz Luba
@ 2023-12-10 20:51     ` Qais Yousef
  2023-12-11  7:56       ` Lukasz Luba
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-10 20:51 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, Vincent Guittot, Viresh Kumar, Rafael J. Wysocki,
	Dietmar Eggemann, linux-pm, Ingo Molnar, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Peter Zijlstra

On 12/08/23 10:05, Lukasz Luba wrote:
> Hi Qais,
> 
> On 12/8/23 01:52, Qais Yousef wrote:
> 
> [snip]
> 
> > @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   	 */
> >   	util_est_enqueue(&rq->cfs, p);
> > -	/*
> > -	 * If in_iowait is set, the code below may not trigger any cpufreq
> > -	 * utilization updates, so do it here explicitly with the IOWAIT flag
> > -	 * passed.
> > -	 */
> > -	if (p->in_iowait)
> > -		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > -
> 
> Why this io wait boost is considered as the $subject says 'aggressive'
> calling?

This will trigger a frequency update along with the iowait boost. Did I miss
something?


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks
  2023-12-08 10:42   ` Hongyan Xia
@ 2023-12-10 22:22     ` Qais Yousef
  2023-12-11 11:15       ` Hongyan Xia
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-10 22:22 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Ingo Molnar, Rafael J. Wysocki,
	Peter Zijlstra, Viresh Kumar, Dietmar Eggemann

Hi Hongyan

On 12/08/23 10:42, Hongyan Xia wrote:

> What is a big concern is a normal task and a uclamp_max task running on the
> same rq. If the uclamp_max task is 1024 but capped by uclamp_max at the
> lowest OPP, and the normal task has no uclamp but a duty cycle, then when

You mean util_avg is 1024 but capped to lowest OPP? uclamp_max is repeated but
couldn't decipher what you meant to write instead.

> the normal task wakes up on the rq, it'll be the highest OPP. When it
> sleeps, the ulamp_max is back and at the lowest OPP. This square-wave
> problem to me is a much bigger concern than an infrequent spike. If
> CONFIG_HZ is 1000, this square wave's frequency is 500 switching between

If the rq->util_avg is 1024, then for any task that is running, the requested
frequency should be max. If there's a task that is capped by uclamp_max, then
this task should not run at max.

So other tasks should have run at max; why you don't want them to run at max?

> highest and lowest OPP, which is definitely unacceptable.

How come so definitive? How did you reach this definitive conclusion?

> The problem I think with filtering is, under this condition, should we
> filter out the lowest OPP or the highest? Neither sounds like a good answer
> because neither is a short-running task and the correct answer might be
> somewhere in between.

We only ignore uclamp requirement with the filter. schedutil is drive by the rq
utilization signal as normal. It is only the fact should we apply
uclamp_min/max or not.

It seems you think we need to modify the rq->util_avg. And this should not be
the case. uclamp should not change how PELT accounting works; just modify how
some decisions based on it are taken.

It is true there's a corner case where util_avg could be wrong under the
documented limitation. But this is not related to max-aggregation and its
solution needs some creativity in handling pelt accounting under these
conditions.

Generally; capping that hard stirs question why userspace is doing this. We
don't want to cripple tasks, but prevent them from consuming inefficient energy
points. Otherwise they should make adequate progress. I'd expect uclamp_max to
be more meaningful for tasks that actually need to run at those higher
expensive frequencies.

So the corner case warrants fixing, but it is not a nuance in practice yet. And
it is a problem of failing to calculate the stolen idle time as we don't have
any under these corner conditions (Vincent can make a more accurate statement
than me here). It has nothing to do with how to handle performance requirements
of multiple RUNNABLE tasks.

> Sorry to ramble on this again and again, but I think filtering is addressing
> the symptom, not the cause. The cause is we have no idea under what
> condition a util_avg was achieved. The 1024 task in the previous example
> would be much better if we extend it into

I think the other way around :-) I think you're mixing the root cause of that
limitation with how uclamp hints for multiple tasks should be handled - which
is what is being fixed here.

I wrote the documentation and did the experiments to see how PELT behaves under
extreme conditions. And it says *it can break PELT*.

> [1024, achieved at uclamp_min 0, achieved at uclamp_max 300]

Why you think this is the dominant use case? And why do you think this is
caused by max-aggregation? This is a limitation of PELT accounting and has
nothing to do with max-aggregation which is how multiple uclamp hints for
RUNNABLE tasks are handled.

Have you actually seen it practice? We haven't come across this problem yet. We
just want to avoid using expensive OPPs, but capping too had is actually
a problem as it can cause starvation for those tasks.

Is it only the documentation what triggered this concern about this corner
case? I'm curious what have you seen.

> If we know 1024 was done under uclamp_max of 300, then we know we don't need
> to raise to the max OPP. So far, we carry around a lot of different new
> variables but not these two which we really need.

This problem is independent of how uclamp hint of multiple tasks should be
accounted for by the governor. This is a limitation of how PELT accounting
works. And to be honest, if you think more about it, this 300 tasks is already
a 1024 on current littles that has a capacity of 200 or less. And the capacity
of the mids at lowest OPP usually starts at a capacity around 100 or something.
Can't see it hit this problem while running on middle.  I think this 300 tasks
will already run near lowest OPP at the big even without uclamp_max being
0 - it is that small for it.

So not sure on what systems you saw this problem on, and whether at all this is
a problem in practice. Like priority/nice and other sched attributes; you can
pick a wrong combination and shoot your self in the foot.

As I put in the documentation, this limitation will only hit if the actual task
capacity reaches some magical ratio. I'd expect practically these tasks to
still see idle time and get their util_avg corrected eventually.

So worth a fix, not related to handling performance requests for multiple
tasks, and not urgently needed as nothing is falling apart because of it for
the time being at least.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched/uclamp: Remove rq max aggregation
  2023-12-08  1:52 ` [PATCH 2/4] sched/uclamp: Remove rq max aggregation Qais Yousef
@ 2023-12-11  0:08   ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-11  0:08 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia

On 12/08/23 01:52, Qais Yousef wrote:

> +/* Request freq update on context switch if necessary */
> +static inline void uclamp_context_switch(struct rq *rq)
>  {
> -	return static_branch_likely(&sched_uclamp_used);
> +	unsigned long uclamp_min;
> +	unsigned long uclamp_max;
> +	unsigned long util;
> +
> +	/* Only RT and FAIR tasks are aware of uclamp */
> +	if (!rt_policy(current->policy) && !fair_policy(current->policy))
> +		return;

We have a dependency on min_granularity_ns (or base_slice_ns) here that
I forgot to add before posting.

If our base_slice_ns is smaller than dvfs_update_delay, then tasks won't run
long enough for the hardware to apply their performance hints before they get
context switched out.

Beside the new proposed sched_runtime being able to request a smaller slice; in
practice default base_slice_ns is okay-ish.

> +
> +	uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> +	uclamp_max = uclamp_eff_value(current, UCLAMP_MAX);
> +	util = rq->cfs.avg.util_avg;
> +
> +	if (uclamp_min > util || uclamp_max < util)
> +		cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
>  }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-10 20:51     ` Qais Yousef
@ 2023-12-11  7:56       ` Lukasz Luba
  2023-12-12 12:10         ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Lukasz Luba @ 2023-12-11  7:56 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Vincent Guittot, Viresh Kumar, Rafael J. Wysocki,
	Dietmar Eggemann, linux-pm, Ingo Molnar, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Peter Zijlstra



On 12/10/23 20:51, Qais Yousef wrote:
> On 12/08/23 10:05, Lukasz Luba wrote:
>> Hi Qais,
>>
>> On 12/8/23 01:52, Qais Yousef wrote:
>>
>> [snip]
>>
>>> @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>    	 */
>>>    	util_est_enqueue(&rq->cfs, p);
>>> -	/*
>>> -	 * If in_iowait is set, the code below may not trigger any cpufreq
>>> -	 * utilization updates, so do it here explicitly with the IOWAIT flag
>>> -	 * passed.
>>> -	 */
>>> -	if (p->in_iowait)
>>> -		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
>>> -
>>
>> Why this io wait boost is considered as the $subject says 'aggressive'
>> calling?
> 
> This will trigger a frequency update along with the iowait boost. Did I miss
> something?

Yes, it will change CPU freq and it was the main goal for this code
path. We have tests which check how that works on different memory
types.

Why do you want to remove it?

Did you run some tests (e.g. fio, gallery, etc) to check if you still
have a decent performance out some new ufs/nvme memories?

Beata & Dietmar have presented at LPC2021 a proposal to have a per-task
io boost, with a bit more controllable way of the trade off power vs.
performance [1]. IMO the io wait boost could evolve, not simply die.

Regards,
Lukasz

[1] https://lpc.events/event/11/contributions/1042/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks
  2023-12-10 22:22     ` Qais Yousef
@ 2023-12-11 11:15       ` Hongyan Xia
  2023-12-12 12:23         ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Hongyan Xia @ 2023-12-11 11:15 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Ingo Molnar, Rafael J. Wysocki,
	Peter Zijlstra, Viresh Kumar, Dietmar Eggemann

On 10/12/2023 22:22, Qais Yousef wrote:
> Hi Hongyan
> 
> On 12/08/23 10:42, Hongyan Xia wrote:
> 
>> What is a big concern is a normal task and a uclamp_max task running on the
>> same rq. If the uclamp_max task is 1024 but capped by uclamp_max at the
>> lowest OPP, and the normal task has no uclamp but a duty cycle, then when
> 
> You mean util_avg is 1024 but capped to lowest OPP? uclamp_max is repeated but
> couldn't decipher what you meant to write instead.
> 
>> the normal task wakes up on the rq, it'll be the highest OPP. When it
>> sleeps, the ulamp_max is back and at the lowest OPP. This square-wave
>> problem to me is a much bigger concern than an infrequent spike. If
>> CONFIG_HZ is 1000, this square wave's frequency is 500 switching between
> 
> If the rq->util_avg is 1024, then for any task that is running, the requested
> frequency should be max. If there's a task that is capped by uclamp_max, then
> this task should not run at max.
> 
> So other tasks should have run at max; why you don't want them to run at max?

Because it saves power. If there's a 1024 task capped at 300 and a true 
300 task without uclamp on the same rq, there's no need to run the CPU 
at more than 600. Running it at 1024 ignores the uclamp_max and burns 
battery when it's not needed.

>> highest and lowest OPP, which is definitely unacceptable.
> 
> How come so definitive? How did you reach this definitive conclusion?

You are right. After talking to our firmware and silicon engineers they 
don't think switching between the highest and lowest OPP 500 times a 
second can have damaging effects, so I retract the 'unacceptable' comment.

>> The problem I think with filtering is, under this condition, should we
>> filter out the lowest OPP or the highest? Neither sounds like a good answer
>> because neither is a short-running task and the correct answer might be
>> somewhere in between.
> 
> We only ignore uclamp requirement with the filter. schedutil is drive by the rq
> utilization signal as normal. It is only the fact should we apply
> uclamp_min/max or not.
> 
> It seems you think we need to modify the rq->util_avg. And this should not be
> the case. uclamp should not change how PELT accounting works; just modify how
> some decisions based on it are taken.

I agree, uclamp shouldn't change PELT, but my series doesn't. Just like 
util_est which boosts util_avg, my patches don't touch util_avg but 
simply introduces util_min, util_max on the side of util_avg. I fail to 
see why it's okay for util_est to bias util_avg but not okay for me to 
do so. If this is the case, then the 'util_guest' proposal should also 
be right out rejected on the same ground.

> It is true there's a corner case where util_avg could be wrong under the
> documented limitation. But this is not related to max-aggregation and its
> solution needs some creativity in handling pelt accounting under these
> conditions.
> 
> Generally; capping that hard stirs question why userspace is doing this. We
> don't want to cripple tasks, but prevent them from consuming inefficient energy
> points. Otherwise they should make adequate progress. I'd expect uclamp_max to
> be more meaningful for tasks that actually need to run at those higher
> expensive frequencies.
> 
> So the corner case warrants fixing, but it is not a nuance in practice yet. And
> it is a problem of failing to calculate the stolen idle time as we don't have
> any under these corner conditions (Vincent can make a more accurate statement
> than me here). It has nothing to do with how to handle performance requirements
> of multiple RUNNABLE tasks.
> 
>> Sorry to ramble on this again and again, but I think filtering is addressing
>> the symptom, not the cause. The cause is we have no idea under what
>> condition a util_avg was achieved. The 1024 task in the previous example
>> would be much better if we extend it into
> 
> I think the other way around :-) I think you're mixing the root cause of that
> limitation with how uclamp hints for multiple tasks should be handled - which
> is what is being fixed here.
> 
> I wrote the documentation and did the experiments to see how PELT behaves under
> extreme conditions. And it says *it can break PELT*.
> 
>> [1024, achieved at uclamp_min 0, achieved at uclamp_max 300]
> 
> Why you think this is the dominant use case? And why do you think this is
> caused by max-aggregation? This is a limitation of PELT accounting and has
> nothing to do with max-aggregation which is how multiple uclamp hints for
> RUNNABLE tasks are handled.
> 
> Have you actually seen it practice? We haven't come across this problem yet. We
> just want to avoid using expensive OPPs, but capping too had is actually
> a problem as it can cause starvation for those tasks.
> 
> Is it only the documentation what triggered this concern about this corner
> case? I'm curious what have you seen.

This is not a corner case. Having a uclamp_max task and a normal task on 
the same rq is fairly common. My concern isn't the 'frequency spike' 
problem from documentation. My concern comes from benchmarks, which is 
high-frequency square waves. An occasional spike isn't worrying, but the 
square waves are.

>> If we know 1024 was done under uclamp_max of 300, then we know we don't need
>> to raise to the max OPP. So far, we carry around a lot of different new
>> variables but not these two which we really need.
> 
> This problem is independent of how uclamp hint of multiple tasks should be
> accounted for by the governor. This is a limitation of how PELT accounting
> works. And to be honest, if you think more about it, this 300 tasks is already
> a 1024 on current littles that has a capacity of 200 or less. And the capacity
> of the mids at lowest OPP usually starts at a capacity around 100 or something.
> Can't see it hit this problem while running on middle.  I think this 300 tasks
> will already run near lowest OPP at the big even without uclamp_max being
> 0 - it is that small for it.
> 
> So not sure on what systems you saw this problem on, and whether at all this is
> a problem in practice. Like priority/nice and other sched attributes; you can
> pick a wrong combination and shoot your self in the foot.
> 
> As I put in the documentation, this limitation will only hit if the actual task
> capacity reaches some magical ratio. I'd expect practically these tasks to
> still see idle time and get their util_avg corrected eventually.

Like in the previous comment, it's square waves that happen 500 times a 
second I saw in benchmarks that's worrying, not the occasional spike in 
documentation. I doubt we can say that a uclamp_max task and a normal 
task running on the same rq is a corner case.

> So worth a fix, not related to handling performance requests for multiple
> tasks, and not urgently needed as nothing is falling apart because of it for
> the time being at least.

Also, I think there's still an unanswered question. If there's a 1024 
task with a uclamp_min of 300 and a 300-util task with default uclamp on 
the same rq, which currently under max aggregation switches between 
highest and lowest OPP, should we filter out the high OPP or the low 
one? Neither is a short-running task. We could designate this as a 
corner case (though I don't think so), but wouldn't it be nice if we 
don't have any of these problems in the first place?

Hongyan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
  2023-12-08 10:05   ` Lukasz Luba
@ 2023-12-11 18:47   ` Christian Loehle
  2023-12-12 12:34     ` Qais Yousef
  2023-12-12 10:46   ` Dietmar Eggemann
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Christian Loehle @ 2023-12-11 18:47 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia

On 08/12/2023 01:52, Qais Yousef wrote:
> Due to the way code is structured, it makes a lot of sense to trigger
> cpufreq_update_util() from update_load_avg(). But this is too aggressive
> as in most cases we are iterating through entities in a loop to
> update_load_avg() in the hierarchy. So we end up sending too many
> request in an loop as we're updating the hierarchy.

If this is actually less aggressive heavily depends on the workload,
I can argue the patch is more aggressive, as you call cpufreq_update_util
at every enqueue and dequeue, instead of just at enqueue.
For an I/O workload it is definitely more aggressive, see below.

> 
> Combine this with the rate limit in schedutil, we could end up
> prematurely send up a wrong frequency update before we have actually
> updated all entities appropriately.
> [SNIP]


> @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	 */
>  	util_est_enqueue(&rq->cfs, p);
>  
> -	/*
> -	 * If in_iowait is set, the code below may not trigger any cpufreq
> -	 * utilization updates, so do it here explicitly with the IOWAIT flag
> -	 * passed.
> -	 */
> -	if (p->in_iowait)
> -		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
>  			break;
> @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  enqueue_throttle:
>  	assert_list_leaf_cfs_rq(rq);
>  
> +	cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> +
>  	hrtick_update(rq);
>  }
>  
> @@ -6849,6 +6816,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  dequeue_throttle:
>  	util_est_update(&rq->cfs, p, task_sleep);
> +	cpufreq_update_util(rq, 0);

This is quite critical, instead of only calling the update
at enqueue (with SCHED_CPUFREQ_IOWAIT if applicable) it is
now called at every enqueue and dequeue. The only way for
schedutil (intel_pstate too?) to build up a value of
iowait_boost > 128 is a large enough rate_limit_us, as even
for just a in_iowait task the enqueue increases the boost and
its own dequeue could reduce it already. For just a basic
benchmark workload and 2000 rate_limit_us this doesn't seem
to be that critical, anything below 200 rate_limit_us didn't
show any iowait boosting > 128 anymore on my system.
Of course if the workload does more between enqueue and
dequeue (time until task issues next I/O) already larger
values of rate_limit_us will disable any significant
iowait boost benefit.

Just to add some numbers to the story:
fio --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1
fio --time_based --name=fiotest --filename=/dev/mmcblk2 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1

All results are sorted:
With this patch and rate_limit_us=2000:
(Second line is without iowait boosting, results are sorted):
[3883, 3980, 3997, 4018, 4019]
[2732, 2745, 2782, 2837, 2841]
/dev/mmcblk2
[4136, 4144, 4198, 4275, 4329]
[2753, 2975, 2975, 2975, 2976]

Without this patch and rate_limit_us=2000:
[3918, 4021, 4043, 4081, 4085]
[2850, 2859, 2863, 2873, 2887]
/dev/mmcblk2
[4277, 4358, 4380, 4421, 4425]
[2796, 3103, 3128, 3180, 3200]

With this patch and rate_limit_us=200:
/dev/nvme0n1
[2470, 2480, 2481, 2484, 2520]
[2473, 2510, 2517, 2534, 2572]
/dev/mmcblk2
[2286, 2338, 2440, 2504, 2535]
[2360, 2462, 2484, 2503, 2707]

Without this patch and rate_limit_us=200:
/dev/nvme0n1
[3880, 3956, 4010, 4013, 4016]
[2732, 2867, 2937, 2937, 2939]
/dev/mmcblk2
[4783, 4791, 4821, 4855, 4860]
[2653, 3091, 3095, 3166, 3202]

I'm currently working on iowait boosting and seeing where it's
actually needed and how it could be improved, so always interested
in anyone's thoughts.

(The second line here doesn't provide additional
information, I left it in to compare for reproducibility).
All with CONFIG_HZ=100 on an rk3399.

Best Regards,
Christian

> [SNIP]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
  2023-12-08 10:05   ` Lukasz Luba
  2023-12-11 18:47   ` Christian Loehle
@ 2023-12-12 10:46   ` Dietmar Eggemann
  2023-12-12 12:35     ` Qais Yousef
  2023-12-12 10:47   ` Hongyan Xia
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-12-12 10:46 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia

On 08/12/2023 02:52, Qais Yousef wrote:
> Due to the way code is structured, it makes a lot of sense to trigger
> cpufreq_update_util() from update_load_avg(). But this is too aggressive
> as in most cases we are iterating through entities in a loop to
> update_load_avg() in the hierarchy. So we end up sending too many
> request in an loop as we're updating the hierarchy.

But update_load_avg() calls cfs_rq_util_change() which only issues a
cpufreq_update_util() call for the root cfs_rq?

So the 'iterating through entities' should be for a task in a non-root
taskgroup which the condition (1) takes care of.

cfs_rq_util_change()

    ...
    if (&rq->cfs == cfs_rq) (1)

        cpufreq_update_util()

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
                     ` (2 preceding siblings ...)
  2023-12-12 10:46   ` Dietmar Eggemann
@ 2023-12-12 10:47   ` Hongyan Xia
  2023-12-12 11:06   ` Vincent Guittot
  2023-12-18  8:51   ` Dietmar Eggemann
  5 siblings, 0 replies; 32+ messages in thread
From: Hongyan Xia @ 2023-12-12 10:47 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei

On 08/12/2023 01:52, Qais Yousef wrote:
> Due to the way code is structured, it makes a lot of sense to trigger
> cpufreq_update_util() from update_load_avg(). But this is too aggressive
> as in most cases we are iterating through entities in a loop to
> update_load_avg() in the hierarchy. So we end up sending too many
> request in an loop as we're updating the hierarchy.

Do you mean the for_each_sched_entity(se) loop? I think we update CPU 
frequency only once at the root CFS?

> Combine this with the rate limit in schedutil, we could end up
> prematurely send up a wrong frequency update before we have actually
> updated all entities appropriately.
> 
> Be smarter about it by limiting the trigger to perform frequency updates
> after all accounting logic has done. This ended up being in the
> following points:
> 
> 1. enqueue/dequeue_task_fair()
> 2. throttle/unthrottle_cfs_rq()
> 3. attach/detach_task_cfs_rq()
> 4. task_tick_fair()
> 5. __sched_group_set_shares()
> 
> This is not 100% ideal still due to other limitations that might be
> a bit harder to handle. Namely we can end up with premature update
> request in the following situations:
> 
> a. Simultaneous task enqueue on the CPU where 2nd task is bigger and
>     requires higher freq. The trigger to cpufreq_update_util() by the
>     first task will lead to dropping the 2nd request until tick. Or
>     another CPU in the same policy trigger a freq update.
> 
> b. CPUs sharing a policy can end up with the same race in a but the
>     simultaneous enqueue happens on different CPUs in the same policy.
> 
> The above though are limitations in the governor/hardware, and from
> scheduler point of view at least that's the best we can do. The
> governor might consider smarter logic to aggregate near simultaneous
> request and honour the higher one.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>   kernel/sched/fair.c | 55 ++++++++++++---------------------------------
>   1 file changed, 14 insertions(+), 41 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b83448be3f79..f99910fc6705 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3997,29 +3997,6 @@ static inline void update_cfs_group(struct sched_entity *se)
>   }
>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>   
> -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> -{
> -	struct rq *rq = rq_of(cfs_rq);
> -
> -	if (&rq->cfs == cfs_rq) {

Here. I think this restricts frequency updates to the root CFS?

> -		/*
> -		 * There are a few boundary cases this might miss but it should
> -		 * get called often enough that that should (hopefully) not be
> -		 * a real problem.
> -		 *
> -		 * It will not get called when we go idle, because the idle
> -		 * thread is a different class (!fair), nor will the utilization
> -		 * number include things like RT tasks.
> -		 *
> -		 * As is, the util number is not freq-invariant (we'd have to
> -		 * implement arch_scale_freq_capacity() for that).
> -		 *
> -		 * See cpu_util_cfs().
> -		 */
> -		cpufreq_update_util(rq, flags);
> -	}
> -}
> -
>   #ifdef CONFIG_SMP
>   static inline bool load_avg_is_decayed(struct sched_avg *sa)
>   {
> [...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
                     ` (3 preceding siblings ...)
  2023-12-12 10:47   ` Hongyan Xia
@ 2023-12-12 11:06   ` Vincent Guittot
  2023-12-12 12:40     ` Qais Yousef
  2023-12-18  8:51   ` Dietmar Eggemann
  5 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2023-12-12 11:06 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On Fri, 8 Dec 2023 at 02:52, Qais Yousef <qyousef@layalina.io> wrote:
>
> Due to the way code is structured, it makes a lot of sense to trigger
> cpufreq_update_util() from update_load_avg(). But this is too aggressive
> as in most cases we are iterating through entities in a loop to
> update_load_avg() in the hierarchy. So we end up sending too many
> request in an loop as we're updating the hierarchy.
>
> Combine this with the rate limit in schedutil, we could end up
> prematurely send up a wrong frequency update before we have actually
> updated all entities appropriately.
>
> Be smarter about it by limiting the trigger to perform frequency updates
> after all accounting logic has done. This ended up being in the
> following points:
>
> 1. enqueue/dequeue_task_fair()
> 2. throttle/unthrottle_cfs_rq()
> 3. attach/detach_task_cfs_rq()
> 4. task_tick_fair()
> 5. __sched_group_set_shares()
>
> This is not 100% ideal still due to other limitations that might be
> a bit harder to handle. Namely we can end up with premature update
> request in the following situations:
>
> a. Simultaneous task enqueue on the CPU where 2nd task is bigger and
>    requires higher freq. The trigger to cpufreq_update_util() by the
>    first task will lead to dropping the 2nd request until tick. Or
>    another CPU in the same policy trigger a freq update.
>
> b. CPUs sharing a policy can end up with the same race in a but the
>    simultaneous enqueue happens on different CPUs in the same policy.
>
> The above though are limitations in the governor/hardware, and from
> scheduler point of view at least that's the best we can do. The
> governor might consider smarter logic to aggregate near simultaneous
> request and honour the higher one.
>
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/fair.c | 55 ++++++++++++---------------------------------
>  1 file changed, 14 insertions(+), 41 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b83448be3f79..f99910fc6705 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3997,29 +3997,6 @@ static inline void update_cfs_group(struct sched_entity *se)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> -{
> -       struct rq *rq = rq_of(cfs_rq);
> -
> -       if (&rq->cfs == cfs_rq) {
> -               /*
> -                * There are a few boundary cases this might miss but it should
> -                * get called often enough that that should (hopefully) not be
> -                * a real problem.
> -                *
> -                * It will not get called when we go idle, because the idle
> -                * thread is a different class (!fair), nor will the utilization
> -                * number include things like RT tasks.
> -                *
> -                * As is, the util number is not freq-invariant (we'd have to
> -                * implement arch_scale_freq_capacity() for that).
> -                *
> -                * See cpu_util_cfs().
> -                */
> -               cpufreq_update_util(rq, flags);
> -       }
> -}
> -
>  #ifdef CONFIG_SMP
>  static inline bool load_avg_is_decayed(struct sched_avg *sa)
>  {
> @@ -4648,8 +4625,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
>         add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
>
> -       cfs_rq_util_change(cfs_rq, 0);
> -
>         trace_pelt_cfs_tp(cfs_rq);
>  }
>
> @@ -4678,8 +4653,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
>         add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
>
> -       cfs_rq_util_change(cfs_rq, 0);
> -
>         trace_pelt_cfs_tp(cfs_rq);
>  }
>
> @@ -4726,11 +4699,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                  */
>                 detach_entity_load_avg(cfs_rq, se);
>                 update_tg_load_avg(cfs_rq);
> -       } else if (decayed) {
> -               cfs_rq_util_change(cfs_rq, 0);
> -
> -               if (flags & UPDATE_TG)
> -                       update_tg_load_avg(cfs_rq);
> +       } else if (decayed && (flags & UPDATE_TG)) {
> +               update_tg_load_avg(cfs_rq);
>         }
>  }
>
> @@ -5114,7 +5084,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
>  {
> -       cfs_rq_util_change(cfs_rq, 0);
>  }
>
>  static inline void remove_entity_load_avg(struct sched_entity *se) {}
> @@ -5807,6 +5776,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>         sub_nr_running(rq, task_delta);
>
>  done:
> +       cpufreq_update_util(rq, 0);
> +
>         /*
>          * Note: distribution will already see us throttled via the
>          * throttled-list.  rq->lock protects completion.
> @@ -5899,6 +5870,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  unthrottle_throttle:
>         assert_list_leaf_cfs_rq(rq);
>
> +       cpufreq_update_util(rq, 0);
> +
>         /* Determine whether we need to wake up potentially idle CPU: */
>         if (rq->curr == rq->idle && rq->cfs.nr_running)
>                 resched_curr(rq);
> @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>          */
>         util_est_enqueue(&rq->cfs, p);
>
> -       /*
> -        * If in_iowait is set, the code below may not trigger any cpufreq
> -        * utilization updates, so do it here explicitly with the IOWAIT flag
> -        * passed.
> -        */
> -       if (p->in_iowait)
> -               cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
>         for_each_sched_entity(se) {
>                 if (se->on_rq)
>                         break;
> @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  enqueue_throttle:
>         assert_list_leaf_cfs_rq(rq);
>

Here and in the other places below,  you lose :

 -       } else if (decayed) {

The decayed condition ensures a rate limit (~1ms) in the number of
calls to cpufreq_update_util.

enqueue/dequeue/tick don't create any sudden change in the PELT
signals that would require to update cpufreq of the change unlike
attach/detach


> +       cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> +
>         hrtick_update(rq);
>  }
>
> @@ -6849,6 +6816,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
>  dequeue_throttle:
>         util_est_update(&rq->cfs, p, task_sleep);
> +       cpufreq_update_util(rq, 0);
>         hrtick_update(rq);
>  }
>
> @@ -8482,6 +8450,7 @@ done: __maybe_unused;
>
>         update_misfit_status(p, rq);
>         sched_fair_update_stop_tick(rq, p);
> +       cpufreq_update_util(rq, 0);
>
>         return p;
>
> @@ -12615,6 +12584,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
>         update_misfit_status(curr, rq);
>         update_overutilized_status(task_rq(curr));
> +       cpufreq_update_util(rq, 0);
>
>         task_tick_core(rq, curr);
>  }
> @@ -12739,6 +12709,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
>         struct sched_entity *se = &p->se;
>
>         detach_entity_cfs_rq(se);
> +       cpufreq_update_util(task_rq(p), 0);
>  }
>
>  static void attach_task_cfs_rq(struct task_struct *p)
> @@ -12746,6 +12717,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
>         struct sched_entity *se = &p->se;
>
>         attach_entity_cfs_rq(se);
> +       cpufreq_update_util(task_rq(p), 0);
>  }
>
>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
> @@ -12991,6 +12963,7 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
>                         update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
>                         update_cfs_group(se);
>                 }
> +               cpufreq_update_util(rq, 0);
>                 rq_unlock_irqrestore(rq, &rf);
>         }
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-11  7:56       ` Lukasz Luba
@ 2023-12-12 12:10         ` Qais Yousef
  2023-12-14  8:19           ` Lukasz Luba
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-12 12:10 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, Vincent Guittot, Viresh Kumar, Rafael J. Wysocki,
	Dietmar Eggemann, linux-pm, Ingo Molnar, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Peter Zijlstra

On 12/11/23 07:56, Lukasz Luba wrote:
> 
> 
> On 12/10/23 20:51, Qais Yousef wrote:
> > On 12/08/23 10:05, Lukasz Luba wrote:
> > > Hi Qais,
> > > 
> > > On 12/8/23 01:52, Qais Yousef wrote:
> > > 
> > > [snip]
> > > 
> > > > @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >    	 */
> > > >    	util_est_enqueue(&rq->cfs, p);
> > > > -	/*
> > > > -	 * If in_iowait is set, the code below may not trigger any cpufreq
> > > > -	 * utilization updates, so do it here explicitly with the IOWAIT flag
> > > > -	 * passed.
> > > > -	 */
> > > > -	if (p->in_iowait)
> > > > -		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > > > -
> > > 
> > > Why this io wait boost is considered as the $subject says 'aggressive'
> > > calling?
> > 
> > This will trigger a frequency update along with the iowait boost. Did I miss
> > something?
> 
> Yes, it will change CPU freq and it was the main goal for this code
> path. We have tests which check how that works on different memory
> types.
> 
> Why do you want to remove it?

It seems you missed this hunk? I of course didn't remove it altogether if
that's what you mean :)

	@@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
	 enqueue_throttle:
		assert_list_leaf_cfs_rq(rq);

	+       cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
	+
		hrtick_update(rq);
	 }

> 
> Did you run some tests (e.g. fio, gallery, etc) to check if you still
> have a decent performance out some new ufs/nvme memories?

PCMark storage gives

before*: 29681
after: 30014

* no patches applied including remove-margins one


Cheers

--
Qais Yousef

> 
> Beata & Dietmar have presented at LPC2021 a proposal to have a per-task
> io boost, with a bit more controllable way of the trade off power vs.
> performance [1]. IMO the io wait boost could evolve, not simply die.
> 
> Regards,
> Lukasz
> 
> [1] https://lpc.events/event/11/contributions/1042/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks
  2023-12-11 11:15       ` Hongyan Xia
@ 2023-12-12 12:23         ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-12 12:23 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Ingo Molnar, Rafael J. Wysocki,
	Peter Zijlstra, Viresh Kumar, Dietmar Eggemann

On 12/11/23 11:15, Hongyan Xia wrote:
> > If the rq->util_avg is 1024, then for any task that is running, the requested
> > frequency should be max. If there's a task that is capped by uclamp_max, then
> > this task should not run at max.
> > 
> > So other tasks should have run at max; why you don't want them to run at max?
> 
> Because it saves power. If there's a 1024 task capped at 300 and a true 300
> task without uclamp on the same rq, there's no need to run the CPU at more
> than 600. Running it at 1024 ignores the uclamp_max and burns battery when
> it's not needed.

To fix this problem of tasks that are not 1024 but appearing 1024 the solution
doesn't lie on how the combined tasks perf hints are treated.

Note that tasks stuck on a CPU with small capacity (due to affinity or
extremely long balance_interval) can still appear as 1024 the same way
UCLAMP_MAX induces.

> > Is it only the documentation what triggered this concern about this corner
> > case? I'm curious what have you seen.
> 
> This is not a corner case. Having a uclamp_max task and a normal task on the
> same rq is fairly common. My concern isn't the 'frequency spike' problem
> from documentation. My concern comes from benchmarks, which is
> high-frequency square waves. An occasional spike isn't worrying, but the
> square waves are.

Fairly common in practice or you synthetic test setup to trigger it? We haven't
hit this problem in practice. Again, the solution is not related to how the
performance hints are applied.

Note if you have busy tasks running and sharing the CPU, the CPU should run at
max for non capped tasks. Please differentiate between the two problems.

> 
> > So worth a fix, not related to handling performance requests for multiple
> > tasks, and not urgently needed as nothing is falling apart because of it for
> > the time being at least.
> 
> Also, I think there's still an unanswered question. If there's a 1024 task

If there's a 1024 tasks on the rq then it'd run at max frequency and the system
will be overutilized and EAS disabled and work is spread according to load.

Cheers

--
Qais Yousef

> with a uclamp_min of 300 and a 300-util task with default uclamp on the same
> rq, which currently under max aggregation switches between highest and
> lowest OPP, should we filter out the high OPP or the low one? Neither is a
> short-running task. We could designate this as a corner case (though I don't
> think so), but wouldn't it be nice if we don't have any of these problems in
> the first place?
> 
> Hongyan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-11 18:47   ` Christian Loehle
@ 2023-12-12 12:34     ` Qais Yousef
  2023-12-12 13:09       ` Christian Loehle
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-12 12:34 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann, linux-kernel, linux-pm,
	Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/11/23 18:47, Christian Loehle wrote:
> On 08/12/2023 01:52, Qais Yousef wrote:
> > Due to the way code is structured, it makes a lot of sense to trigger
> > cpufreq_update_util() from update_load_avg(). But this is too aggressive
> > as in most cases we are iterating through entities in a loop to
> > update_load_avg() in the hierarchy. So we end up sending too many
> > request in an loop as we're updating the hierarchy.
> 
> If this is actually less aggressive heavily depends on the workload,
> I can argue the patch is more aggressive, as you call cpufreq_update_util
> at every enqueue and dequeue, instead of just at enqueue.
> For an I/O workload it is definitely more aggressive, see below.

I could have unwittingly broken something. Thanks for the report!

> 
> > 
> > Combine this with the rate limit in schedutil, we could end up
> > prematurely send up a wrong frequency update before we have actually
> > updated all entities appropriately.
> > [SNIP]
> 
> 
> > @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	 */
> >  	util_est_enqueue(&rq->cfs, p);
> >  
> > -	/*
> > -	 * If in_iowait is set, the code below may not trigger any cpufreq
> > -	 * utilization updates, so do it here explicitly with the IOWAIT flag
> > -	 * passed.
> > -	 */
> > -	if (p->in_iowait)
> > -		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > -
> >  	for_each_sched_entity(se) {
> >  		if (se->on_rq)
> >  			break;
> > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  enqueue_throttle:
> >  	assert_list_leaf_cfs_rq(rq);
> >  
> > +	cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> > +
> >  	hrtick_update(rq);
> >  }
> >  
> > @@ -6849,6 +6816,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  
> >  dequeue_throttle:
> >  	util_est_update(&rq->cfs, p, task_sleep);
> > +	cpufreq_update_util(rq, 0);
> 
> This is quite critical, instead of only calling the update
> at enqueue (with SCHED_CPUFREQ_IOWAIT if applicable) it is
> now called at every enqueue and dequeue. The only way for

I think it was called at enqueue/dequeue before, but now it is done
unconditionally as I don't check for decay like before. It shouldn't change the
behavior as if there's no frequency change, then the governor will do nothing,
including not update last_update_time IIRC.

> schedutil (intel_pstate too?) to build up a value of
> iowait_boost > 128 is a large enough rate_limit_us, as even
> for just a in_iowait task the enqueue increases the boost and
> its own dequeue could reduce it already. For just a basic
> benchmark workload and 2000 rate_limit_us this doesn't seem
> to be that critical, anything below 200 rate_limit_us didn't

200us is too low. Does rk3399 support this? My pine64 has this SoC and
I remember it doesn't support fastswitch and the time to wake up the sugov
thread will be comparable to this before even trying to talk tot he hardware.

Not necessarily means that I don't have a bug in my code of course! :)

> show any iowait boosting > 128 anymore on my system.
> Of course if the workload does more between enqueue and
> dequeue (time until task issues next I/O) already larger
> values of rate_limit_us will disable any significant
> iowait boost benefit.

Hmm. It seems sugov_iowait_reset() is being called at the dequeue?

Tweaking the rate limit means short living tasks freq update at dequeue is
likely to be ignored by the governor.

The short value means it is likely to be taken into account.

Not sure if this is uncovering a bug somewhere else or I broke something.

> 
> Just to add some numbers to the story:
> fio --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1
> fio --time_based --name=fiotest --filename=/dev/mmcblk2 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1
> 
> All results are sorted:
> With this patch and rate_limit_us=2000:
> (Second line is without iowait boosting, results are sorted):
> [3883, 3980, 3997, 4018, 4019]
> [2732, 2745, 2782, 2837, 2841]
> /dev/mmcblk2
> [4136, 4144, 4198, 4275, 4329]
> [2753, 2975, 2975, 2975, 2976]
> 
> Without this patch and rate_limit_us=2000:
> [3918, 4021, 4043, 4081, 4085]
> [2850, 2859, 2863, 2873, 2887]
> /dev/mmcblk2
> [4277, 4358, 4380, 4421, 4425]
> [2796, 3103, 3128, 3180, 3200]
> 
> With this patch and rate_limit_us=200:
> /dev/nvme0n1
> [2470, 2480, 2481, 2484, 2520]
> [2473, 2510, 2517, 2534, 2572]
> /dev/mmcblk2
> [2286, 2338, 2440, 2504, 2535]
> [2360, 2462, 2484, 2503, 2707]
> 
> Without this patch and rate_limit_us=200:
> /dev/nvme0n1
> [3880, 3956, 4010, 4013, 4016]
> [2732, 2867, 2937, 2937, 2939]
> /dev/mmcblk2
> [4783, 4791, 4821, 4855, 4860]
> [2653, 3091, 3095, 3166, 3202]

Was any other patch in this series or remove margin series applied or just this
one?

> 
> I'm currently working on iowait boosting and seeing where it's
> actually needed and how it could be improved, so always interested
> in anyone's thoughts.

One of the problems identified with iowait boost is that it is per-cpu; which
means tasks that are causing the iowait to happen will lose this boost when
migrated.

Arm was working on a way to help convert it to per-task. See Lukasz email.

> 
> (The second line here doesn't provide additional
> information, I left it in to compare for reproducibility).
> All with CONFIG_HZ=100 on an rk3399.

Your tick is 10ms?! sugov_iowait_reset() should return false then. I see now,
we undo the boost in sugov_iowait_apply().

There's room for improvement for sure. Thanks for the feedback!


Cheers

--
Qais Yousef

> 
> Best Regards,
> Christian
> 
> > [SNIP]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-12 10:46   ` Dietmar Eggemann
@ 2023-12-12 12:35     ` Qais Yousef
  2023-12-12 18:22       ` Hongyan Xia
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-12 12:35 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/12/23 11:46, Dietmar Eggemann wrote:
> On 08/12/2023 02:52, Qais Yousef wrote:
> > Due to the way code is structured, it makes a lot of sense to trigger
> > cpufreq_update_util() from update_load_avg(). But this is too aggressive
> > as in most cases we are iterating through entities in a loop to
> > update_load_avg() in the hierarchy. So we end up sending too many
> > request in an loop as we're updating the hierarchy.
> 
> But update_load_avg() calls cfs_rq_util_change() which only issues a
> cpufreq_update_util() call for the root cfs_rq?

Yes I've noticed that and wondered. Maybe my analysis was flawed and I was just
hitting the issue of iowait boost request conflicting with update_load_avg()
request.

Let me have another look. I think we'll still end up needing to take the update
out of util_avg to be able to combine the two calls.


Cheers

--
Qais Yousef

> 
> So the 'iterating through entities' should be for a task in a non-root
> taskgroup which the condition (1) takes care of.
> 
> cfs_rq_util_change()
> 
>     ...
>     if (&rq->cfs == cfs_rq) (1)
> 
>         cpufreq_update_util()
> 
> [...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-12 11:06   ` Vincent Guittot
@ 2023-12-12 12:40     ` Qais Yousef
  2023-12-29  0:25       ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-12 12:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/12/23 12:06, Vincent Guittot wrote:

> > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  enqueue_throttle:
> >         assert_list_leaf_cfs_rq(rq);
> >
> 
> Here and in the other places below,  you lose :
> 
>  -       } else if (decayed) {
> 
> The decayed condition ensures a rate limit (~1ms) in the number of
> calls to cpufreq_update_util.
> 
> enqueue/dequeue/tick don't create any sudden change in the PELT
> signals that would require to update cpufreq of the change unlike
> attach/detach

Okay, thanks for the clue. Let me rethink this again.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-12 12:34     ` Qais Yousef
@ 2023-12-12 13:09       ` Christian Loehle
  2023-12-12 13:29         ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Christian Loehle @ 2023-12-12 13:09 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann, linux-kernel, linux-pm,
	Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/12/2023 12:34, Qais Yousef wrote:
> On 12/11/23 18:47, Christian Loehle wrote:
>> On 08/12/2023 01:52, Qais Yousef wrote:
>>> Due to the way code is structured, it makes a lot of sense to trigger
>>> cpufreq_update_util() from update_load_avg(). But this is too aggressive
>>> as in most cases we are iterating through entities in a loop to
>>> update_load_avg() in the hierarchy. So we end up sending too many
>>> request in an loop as we're updating the hierarchy.
>>
>> If this is actually less aggressive heavily depends on the workload,
>> I can argue the patch is more aggressive, as you call cpufreq_update_util
>> at every enqueue and dequeue, instead of just at enqueue.
>> For an I/O workload it is definitely more aggressive, see below.
> 
> I could have unwittingly broken something. Thanks for the report!
> 
> [SNIP]
>>>  dequeue_throttle:
>>>  	util_est_update(&rq->cfs, p, task_sleep);
>>> +	cpufreq_update_util(rq, 0);
>>
>> This is quite critical, instead of only calling the update
>> at enqueue (with SCHED_CPUFREQ_IOWAIT if applicable) it is
>> now called at every enqueue and dequeue. The only way for
> 
> I think it was called at enqueue/dequeue before, but now it is done
> unconditionally as I don't check for decay like before. It shouldn't change the
> behavior as if there's no frequency change, then the governor will do nothing

Well the governor will update the fields by calling sugov_iowait_apply
What exactly are you referring to when you say
"I think it was called at dequeue before"?
From what I can see this patch calls cpufreq_update_util much more
on an enqueue/dequeue workload like fio.

> including not update last_update_time IIRC.

sched_avg maybe, but iowait boosts last_update is changed,
I'll look into it, see below.

> 
>> schedutil (intel_pstate too?) to build up a value of
>> iowait_boost > 128 is a large enough rate_limit_us, as even
>> for just a in_iowait task the enqueue increases the boost and
>> its own dequeue could reduce it already. For just a basic
>> benchmark workload and 2000 rate_limit_us this doesn't seem
>> to be that critical, anything below 200 rate_limit_us didn't
> 
> 200us is too low. Does rk3399 support this? My pine64 has this SoC and
> I remember it doesn't support fastswitch and the time to wake up the sugov
> thread will be comparable to this before even trying to talk tot he hardware.

Unlikely it is actually supported, but I'm mostly concerned with the
actual iowait_boost value, so this limitation here actually is besides
my point. (The fact that tip:sched/core results are very different at 200us
from this patch gives me some confidence here.)

> 
> Not necessarily means that I don't have a bug in my code of course! :)
> 
>> show any iowait boosting > 128 anymore on my system.
>> Of course if the workload does more between enqueue and
>> dequeue (time until task issues next I/O) already larger
>> values of rate_limit_us will disable any significant
>> iowait boost benefit.
> 
> Hmm. It seems sugov_iowait_reset() is being called at the dequeue?

Also yes, but I'm actually worried about the reduce / halving in
iowait_boost_apply().
With a one-to-one correspondence of enqueue (inc boost if in_iowait) and
dequeue (dec boost regardless) with cpufreq_update_util() you would expect
there to never be any boost apart from the minimal between the enqueue and
the dequeue. It does build up anyway, but that is only because the reduce
in iowait_boost_apply() is never hit if the last update time delta is < rate_limit_us.
(and rate_limit_us=2000 gives us still some headroom for read->read userspace
workloads, for read->think->read this could be more problematic, see also below.)

(Now thinking about it the fact that last_update before determining if frequency
should be changed, could be an issue, I'll look into it some more, anyway at worst
it's an issue with larger impact with your patch.)

> 
> Tweaking the rate limit means short living tasks freq update at dequeue is
> likely to be ignored by the governor.
> 
> The short value means it is likely to be taken into account.
> 
> Not sure if this is uncovering a bug somewhere else or I broke something> 
>>
>> Just to add some numbers to the story:
>> fio --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1
>> fio --time_based --name=fiotest --filename=/dev/mmcblk2 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1
>>
>> All results are sorted:
>> With this patch and rate_limit_us=2000:
>> (Second line is without iowait boosting, results are sorted):
>> [3883, 3980, 3997, 4018, 4019]
>> [2732, 2745, 2782, 2837, 2841]
>> /dev/mmcblk2
>> [4136, 4144, 4198, 4275, 4329]
>> [2753, 2975, 2975, 2975, 2976]
>>
>> Without this patch and rate_limit_us=2000:
>> [3918, 4021, 4043, 4081, 4085]
>> [2850, 2859, 2863, 2873, 2887]
>> /dev/mmcblk2
>> [4277, 4358, 4380, 4421, 4425]
>> [2796, 3103, 3128, 3180, 3200]
>>
>> With this patch and rate_limit_us=200:
>> /dev/nvme0n1
>> [2470, 2480, 2481, 2484, 2520]
>> [2473, 2510, 2517, 2534, 2572]
>> /dev/mmcblk2
>> [2286, 2338, 2440, 2504, 2535]
>> [2360, 2462, 2484, 2503, 2707]
>>
>> Without this patch and rate_limit_us=200:
>> /dev/nvme0n1
>> [3880, 3956, 4010, 4013, 4016]
>> [2732, 2867, 2937, 2937, 2939]
>> /dev/mmcblk2
>> [4783, 4791, 4821, 4855, 4860]
>> [2653, 3091, 3095, 3166, 3202]
> 
> Was any other patch in this series or remove margin series applied or just this
> one?

These specific numbers were just with this one.
I did a couple of tests to get a feel for the entire series(both),
but found no drastic change that would be uncovered by n=3 runs
anyway.

>>
>> I'm currently working on iowait boosting and seeing where it's
>> actually needed and how it could be improved, so always interested
>> in anyone's thoughts.
> 
> One of the problems identified with iowait boost is that it is per-cpu; which
> means tasks that are causing the iowait to happen will lose this boost when
> migrated.
> 
> Arm was working on a way to help convert it to per-task. See Lukasz email.

I guess that would be me now :)
Apart from considering per-task I'd like to get a reasonable scope for the
feature designed anyway.
Like in your patch, assuming rate_limit_us=2000, on my platform and scenarios
the context-switches + the minimum stuff fio does until between read syscalls
took around 200us (that's where the boost benefit disappears).
If we're considering workloads that do just a little more in-between than what
fio does (maybe actually looking at the data?) and therefore takes maybe >2000us
in-between, you disabled iowait boost for that workload with this patch.
My view right now is that this might not be critical, any attempts at recreating
such workloads (realistically) lead basically to them contributing enough util to
not need the iowait boost, but of course you can also create synthetic workloads
where this is not true.
If you want to play around with this too, I have recently added --thinkcycles
parameter to fio, you will have to build it from source though as it hasn't seen
a release since.

> 
>>
>> (The second line here doesn't provide additional
>> information, I left it in to compare for reproducibility).
>> All with CONFIG_HZ=100 on an rk3399.
> 
> Your tick is 10ms?! sugov_iowait_reset() should return false then. I see now,
> we undo the boost in sugov_iowait_apply().

Again, just to emphasize, the disabling of iowait boost then does not come from
sugov_iowait_reset, but sugov_iowait_apply, which will be called in dequeue regardless
in your patch, plus you're lowering the rate_limit_us, which right now acts as
a 'iowait boost protection' for your patch, if that makes sense.

Best Regards,
Christian


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-12 13:09       ` Christian Loehle
@ 2023-12-12 13:29         ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-12 13:29 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, Dietmar Eggemann, linux-kernel, linux-pm,
	Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/12/23 13:09, Christian Loehle wrote:

> > Arm was working on a way to help convert it to per-task. See Lukasz email.
> 
> I guess that would be me now :)

Ah, sorry haven't noticed the email address :-)

> Apart from considering per-task I'd like to get a reasonable scope for the
> feature designed anyway.

Beside the iowait boost is completely ignored at migration. There's the desire
to disable it for some tasks. Not every task's io performance is important to
honour. Being able to control this via cgroup would be helpful so it can enable
disable it for background tasks for example. Generally controlling the default
behavior could be useful too.

> If you want to play around with this too, I have recently added --thinkcycles
> parameter to fio, you will have to build it from source though as it hasn't seen
> a release since.

Thanks. Might reach out if I needed this

> > Your tick is 10ms?! sugov_iowait_reset() should return false then. I see now,
> > we undo the boost in sugov_iowait_apply().
> 
> Again, just to emphasize, the disabling of iowait boost then does not come from
> sugov_iowait_reset, but sugov_iowait_apply, which will be called in dequeue regardless
> in your patch, plus you're lowering the rate_limit_us, which right now acts as
> a 'iowait boost protection' for your patch, if that makes sense.

Maybe I should have redited my reply. I meant that I can see now how we can end
up undoing the boost in sugov_iowait_apply() under the conditions you pointed
out. So yep, I can see the problem.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-12 12:35     ` Qais Yousef
@ 2023-12-12 18:22       ` Hongyan Xia
  0 siblings, 0 replies; 32+ messages in thread
From: Hongyan Xia @ 2023-12-12 18:22 UTC (permalink / raw)
  To: Qais Yousef, Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei

On 12/12/2023 12:35, Qais Yousef wrote:
> On 12/12/23 11:46, Dietmar Eggemann wrote:
>> On 08/12/2023 02:52, Qais Yousef wrote:
>>> Due to the way code is structured, it makes a lot of sense to trigger
>>> cpufreq_update_util() from update_load_avg(). But this is too aggressive
>>> as in most cases we are iterating through entities in a loop to
>>> update_load_avg() in the hierarchy. So we end up sending too many
>>> request in an loop as we're updating the hierarchy.
>>
>> But update_load_avg() calls cfs_rq_util_change() which only issues a
>> cpufreq_update_util() call for the root cfs_rq?
> 
> Yes I've noticed that and wondered. Maybe my analysis was flawed and I was just
> hitting the issue of iowait boost request conflicting with update_load_avg()
> request.
> 
> Let me have another look. I think we'll still end up needing to take the update
> out of util_avg to be able to combine the two calls.

I agree. Currently it does not express the intention clearly. We only 
want to update the root CFS but the code was written in a misleading way 
that suggests we want to update for every cfs_rq. A single update at the 
end looks much nicer and makes other patches easier.

Hongyan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-12 12:10         ` Qais Yousef
@ 2023-12-14  8:19           ` Lukasz Luba
  0 siblings, 0 replies; 32+ messages in thread
From: Lukasz Luba @ 2023-12-14  8:19 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Vincent Guittot, Viresh Kumar, Rafael J. Wysocki,
	Dietmar Eggemann, linux-pm, Ingo Molnar, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia, Peter Zijlstra



On 12/12/23 12:10, Qais Yousef wrote:
> On 12/11/23 07:56, Lukasz Luba wrote:
>>
>>
>> On 12/10/23 20:51, Qais Yousef wrote:
>>> On 12/08/23 10:05, Lukasz Luba wrote:
>>>> Hi Qais,
>>>>
>>>> On 12/8/23 01:52, Qais Yousef wrote:
>>>>
>>>> [snip]
>>>>
>>>>> @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>     	 */
>>>>>     	util_est_enqueue(&rq->cfs, p);
>>>>> -	/*
>>>>> -	 * If in_iowait is set, the code below may not trigger any cpufreq
>>>>> -	 * utilization updates, so do it here explicitly with the IOWAIT flag
>>>>> -	 * passed.
>>>>> -	 */
>>>>> -	if (p->in_iowait)
>>>>> -		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
>>>>> -
>>>>
>>>> Why this io wait boost is considered as the $subject says 'aggressive'
>>>> calling?
>>>
>>> This will trigger a frequency update along with the iowait boost. Did I miss
>>> something?
>>
>> Yes, it will change CPU freq and it was the main goal for this code
>> path. We have tests which check how that works on different memory
>> types.
>>
>> Why do you want to remove it?
> 
> It seems you missed this hunk? I of course didn't remove it altogether if
> that's what you mean :)
> 
> 	@@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> 	 enqueue_throttle:
> 		assert_list_leaf_cfs_rq(rq);
> 
> 	+       cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> 	+
> 		hrtick_update(rq);
> 	 }
> 

Yes, you're right, I missed that change. I will have to spend some time
to figure out this new flow in the whole patch set.


>>
>> Did you run some tests (e.g. fio, gallery, etc) to check if you still
>> have a decent performance out some new ufs/nvme memories?
> 
> PCMark storage gives
> 
> before*: 29681
> after: 30014

The result looks good.

Thanks,
Lukasz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation
  2023-12-18  8:19 ` [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Dietmar Eggemann
@ 2023-12-17 21:23   ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-17 21:23 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/18/23 09:19, Dietmar Eggemann wrote:
> On 08/12/2023 02:52, Qais Yousef wrote:
> 
> [...]
> 
> > ===
> > 
> > This patch is based on remove margins series [1] and data is collected it
> > against it as a baseline.
> > 
> > Testing on pixel 6 with mainline(ish) kernel
> 
> How is the Pixel6 configured in terms of per-policy rate_limit_us and
> response_time_ms ? Is this the now default 2ms and whatever the systems
> calculates for response_time_ms ?

Yes.

> 
> Pixel6 is still a slow switching device, rigth?
> 
> root           297     2 1 08:58:01 ?     00:00:13 [sugov:0]
> root           298     2 0 08:58:01 ?     00:00:03 [sugov:4]
> root           299     2 1 08:58:01 ?     00:00:05 [sugov:6]

Yes.

> 
> > ==
> > 
> > Speedometer browser benchmark
> > 
> >        | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
> > -------+-----------+---------------+-----------+---------------------
> > score  |  108.03   |     135.72    |   108.09  |    137.47
> > -------+-----------+---------------+-----------+---------------------
> > power  |  1204.75  |    1451.79    |  1216.17  |    1484.73
> > -------+-----------+---------------+-----------+---------------------
> 
> What's the difference between baseline & 1.25 headroom. IMHO, we have:

Baseline is the remove-margins patches as specified in the quoted text above

	> > This patch is based on remove margins series [1] and data is collected it
	> > against it as a baseline.

The series were stacked on top of each others. Results from this run should be
compared to remove-margins[1] tables too.

> 
>  static inline unsigned long map_util_perf(unsigned long util)
>  {
>    return util + (util >> 2);
>  }
> 
> on baseline?

This is not baseline. See above.

> 
> By patch you refer to the whole patch-set + [1]?
> 
> And I assume 'patch + 1.25 headroom' is 'response_time_ms' tuned to
> reach 1.25 ?

Yes. Which is done by multiplying the response_time_ms with 0.8.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-18  8:51   ` Dietmar Eggemann
@ 2023-12-17 21:44     ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-12-17 21:44 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Vincent Guittot, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/18/23 09:51, Dietmar Eggemann wrote:
> On 08/12/2023 02:52, Qais Yousef wrote:
> > Due to the way code is structured, it makes a lot of sense to trigger
> > cpufreq_update_util() from update_load_avg(). But this is too aggressive
> > as in most cases we are iterating through entities in a loop to
> > update_load_avg() in the hierarchy. So we end up sending too many
> > request in an loop as we're updating the hierarchy.
> > 
> > Combine this with the rate limit in schedutil, we could end up
> > prematurely send up a wrong frequency update before we have actually
> > updated all entities appropriately.
> > 
> > Be smarter about it by limiting the trigger to perform frequency updates
> > after all accounting logic has done. This ended up being in the
> 
> What are the boundaries of the 'accounting logic' here? Is this related
> to the update of all sched_entities and cfs_rq's involved when a task is
> attached/detached (or enqueued/dequeued)?

Yes.

> 
> I can't see that there are any premature cfs_rq_util_change() in the
> current code when we consider this.

Thanks for checking. I'll revisit the problem as indicated previously. This
patch is still needed; I'll update rationale at least and fix highlighted
issues with decay.

> 
> And avoiding updates for a smaller task to make sure updates for a
> bigger task go through is IMHO not feasible.

Where did this line of thought come from? This patch is about consolidating
how scheduler request frequency updates. And later patches requires the single
update at tick to pass the new SCHED_CPUFREQ_PERF_HINTS.

If you're referring to the logic in later patches about ignore_short_tasks();
then we only ignore the performance hints for this task.

Why not feasible? What's the rationale?

> 
> I wonder how much influence does this patch has on the test results
> presented the patch header?

The only change of behavior is how we deal with decay. Which I thought wouldn't
introduce a functional change, but as caught to by Christian, it did. No
functional changes are supposed to happen that can affect the test results
AFAICT.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation
  2023-12-08  1:52 [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Qais Yousef
                   ` (3 preceding siblings ...)
  2023-12-08  1:52 ` [PATCH 4/4] sched/documentation: Remove reference to max aggregation Qais Yousef
@ 2023-12-18  8:19 ` Dietmar Eggemann
  2023-12-17 21:23   ` Qais Yousef
  4 siblings, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-12-18  8:19 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia

On 08/12/2023 02:52, Qais Yousef wrote:

[...]

> ===
> 
> This patch is based on remove margins series [1] and data is collected it
> against it as a baseline.
> 
> Testing on pixel 6 with mainline(ish) kernel

How is the Pixel6 configured in terms of per-policy rate_limit_us and
response_time_ms ? Is this the now default 2ms and whatever the systems
calculates for response_time_ms ?

Pixel6 is still a slow switching device, rigth?

root           297     2 1 08:58:01 ?     00:00:13 [sugov:0]
root           298     2 0 08:58:01 ?     00:00:03 [sugov:4]
root           299     2 1 08:58:01 ?     00:00:05 [sugov:6]

> ==
> 
> Speedometer browser benchmark
> 
>        | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
> -------+-----------+---------------+-----------+---------------------
> score  |  108.03   |     135.72    |   108.09  |    137.47
> -------+-----------+---------------+-----------+---------------------
> power  |  1204.75  |    1451.79    |  1216.17  |    1484.73
> -------+-----------+---------------+-----------+---------------------

What's the difference between baseline & 1.25 headroom. IMHO, we have:

 static inline unsigned long map_util_perf(unsigned long util)
 {
   return util + (util >> 2);
 }

on baseline?

By patch you refer to the whole patch-set + [1]?

And I assume 'patch + 1.25 headroom' is 'response_time_ms' tuned to
reach 1.25 ?

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
                     ` (4 preceding siblings ...)
  2023-12-12 11:06   ` Vincent Guittot
@ 2023-12-18  8:51   ` Dietmar Eggemann
  2023-12-17 21:44     ` Qais Yousef
  5 siblings, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-12-18  8:51 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki,
	Viresh Kumar, Vincent Guittot
  Cc: linux-kernel, linux-pm, Lukasz Luba, Wei Wang, Rick Yiu,
	Chung-Kai Mei, Hongyan Xia

On 08/12/2023 02:52, Qais Yousef wrote:
> Due to the way code is structured, it makes a lot of sense to trigger
> cpufreq_update_util() from update_load_avg(). But this is too aggressive
> as in most cases we are iterating through entities in a loop to
> update_load_avg() in the hierarchy. So we end up sending too many
> request in an loop as we're updating the hierarchy.
> 
> Combine this with the rate limit in schedutil, we could end up
> prematurely send up a wrong frequency update before we have actually
> updated all entities appropriately.
> 
> Be smarter about it by limiting the trigger to perform frequency updates
> after all accounting logic has done. This ended up being in the

What are the boundaries of the 'accounting logic' here? Is this related
to the update of all sched_entities and cfs_rq's involved when a task is
attached/detached (or enqueued/dequeued)?

I can't see that there are any premature cfs_rq_util_change() in the
current code when we consider this.

And avoiding updates for a smaller task to make sure updates for a
bigger task go through is IMHO not feasible.

I wonder how much influence does this patch has on the test results
presented the patch header?

> following points:
> 
> 1. enqueue/dequeue_task_fair()
> 2. throttle/unthrottle_cfs_rq()
> 3. attach/detach_task_cfs_rq()
> 4. task_tick_fair()
> 5. __sched_group_set_shares()
> 
> This is not 100% ideal still due to other limitations that might be
> a bit harder to handle. Namely we can end up with premature update
> request in the following situations:
> 
> a. Simultaneous task enqueue on the CPU where 2nd task is bigger and
>    requires higher freq. The trigger to cpufreq_update_util() by the
>    first task will lead to dropping the 2nd request until tick. Or
>    another CPU in the same policy trigger a freq update.
> 
> b. CPUs sharing a policy can end up with the same race in a but the
>    simultaneous enqueue happens on different CPUs in the same policy.
> 
> The above though are limitations in the governor/hardware, and from
> scheduler point of view at least that's the best we can do. The
> governor might consider smarter logic to aggregate near simultaneous
> request and honour the higher one.

[...]


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-12 12:40     ` Qais Yousef
@ 2023-12-29  0:25       ` Qais Yousef
  2024-01-03 13:41         ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-12-29  0:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 12/12/23 12:40, Qais Yousef wrote:
> On 12/12/23 12:06, Vincent Guittot wrote:
> 
> > > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  enqueue_throttle:
> > >         assert_list_leaf_cfs_rq(rq);
> > >
> > 
> > Here and in the other places below,  you lose :
> > 
> >  -       } else if (decayed) {
> > 
> > The decayed condition ensures a rate limit (~1ms) in the number of
> > calls to cpufreq_update_util.
> > 
> > enqueue/dequeue/tick don't create any sudden change in the PELT
> > signals that would require to update cpufreq of the change unlike
> > attach/detach
> 
> Okay, thanks for the clue. Let me rethink this again.

Thinking more about this. Do we really need to send freq updates at
enqueue/attach etc?

I did an experiment to remove all the updates except in three places:

1. Context switch (done unconditionally)
2. Tick
2. update_blocked_averages()

I tried the below patch on mac mini with m1 SoC, 6.6 kernel; speedometers
scores were the same (against this series).

I ran perf bench sched pipe to see if the addition in context switch will make
things worse

before (this series applied):

	# Running 'sched/pipe' benchmark:
	# Executed 1000000 pipe operations between two processes

	     Total time: 20.505 [sec]

	      20.505628 usecs/op
		  48767 ops/sec

after (proposed patch below applied on top):

	# Running 'sched/pipe' benchmark:
	# Executed 1000000 pipe operations between two processes

	     Total time: 19.475 [sec]

	      19.475838 usecs/op
		  51345 ops/sec

I tried against vanilla kernel (without any patches, including some dependency
backports) using schedutil governor

	# Running 'sched/pipe' benchmark:
	# Executed 1000000 pipe operations between two processes

	     Total time: 22.428 [sec]

	      22.428166 usecs/op
		  44586 ops/sec

(I got higher run to run variance against this kernel)

So things got better. I think we can actually unify all updates to be at
context switch and tick for all classes.

The one hole is for long running CFS tasks without context switch; no updates
until tick this means the dvfs headroom needs to cater for that as worst case
scenario now. I think this is the behavior today actually; but accidental
updates due to enqueue/dequeue could have helped to issue more updates. If none
of that happens, then updating load_avg at entity_tick() is what would have
triggered a frequency update.

I'm not sure if the blocked average one is required to be honest. But feared
that when the cpu goes to idle we might miss reducing frequencies vote for that
CPU - which matters on shared cpufreq policies.

I haven't done comprehensive testing of course. But would love to hear any
thoughts on how we can be more strategic and reduce the number of cpufreq
updates we send. And iowait boost state needs to be verified.

While testing this series more I did notice that short kworker context switches
still can cause the frequency to go high. I traced this back to
__balance_callbacks() in finish_lock_switch() after calling
uclamp_context_switch(). So I think we do have a problem of updates being
'accidental' and we need to improve the interaction with the governor to be
more intentional keeping in mind all the constraints we have today in h/w and
software.

--->8---


From 6deed09be1d075afa3858ca62dd54826cdb60d44 Mon Sep 17 00:00:00 2001
From: Qais Yousef <qyousef@layalina.io>
Date: Tue, 26 Dec 2023 01:23:57 +0000
Subject: [PATCH] sched/fair: Update freq on tick and context switch and
 blocked avgs

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/cpufreq_schedutil.c |  3 ---
 kernel/sched/fair.c              | 13 +------------
 kernel/sched/sched.h             | 15 +--------------
 3 files changed, 2 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c0879a985097..553a3d7f02d8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -166,9 +166,6 @@ static inline bool ignore_short_tasks(int cpu,
 	struct task_struct *p = cpu_rq(cpu)->curr;
 	unsigned long task_util;
 
-	if (!(flags & SCHED_CPUFREQ_PERF_HINTS))
-		return false;
-
 	if (!fair_policy(p->policy))
 		return false;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d63eae534cec..3a30f78b37d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5717,8 +5717,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	sub_nr_running(rq, task_delta);
 
 done:
-	cpufreq_update_util(rq, 0);
-
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5811,8 +5809,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
-	cpufreq_update_util(rq, 0);
-
 	/* Determine whether we need to wake up potentially idle CPU: */
 	if (rq->curr == rq->idle && rq->cfs.nr_running)
 		resched_curr(rq);
@@ -6675,8 +6671,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
-	cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
-
 	hrtick_update(rq);
 }
 
@@ -6754,7 +6748,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
-	cpufreq_update_util(rq, 0);
 	hrtick_update(rq);
 }
 
@@ -8338,7 +8331,6 @@ done: __maybe_unused;
 
 	update_misfit_status(p, rq);
 	sched_fair_update_stop_tick(rq, p);
-	cpufreq_update_util(rq, 0);
 
 	return p;
 
@@ -12460,7 +12452,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
-	cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
+	cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
 
 	task_tick_core(rq, curr);
 }
@@ -12585,7 +12577,6 @@ static void detach_task_cfs_rq(struct task_struct *p)
 	struct sched_entity *se = &p->se;
 
 	detach_entity_cfs_rq(se);
-	cpufreq_update_util(task_rq(p), 0);
 }
 
 static void attach_task_cfs_rq(struct task_struct *p)
@@ -12593,7 +12584,6 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	struct sched_entity *se = &p->se;
 
 	attach_entity_cfs_rq(se);
-	cpufreq_update_util(task_rq(p), 0);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
@@ -12839,7 +12829,6 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
 			update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
 			update_cfs_group(se);
 		}
-		cpufreq_update_util(rq, 0);
 		rq_unlock_irqrestore(rq, &rf);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 516187ea2b81..e1622e2b82be 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3076,20 +3076,7 @@ static inline bool uclamp_rq_is_capped(struct rq *rq)
 /* Request freq update on context switch if necessary */
 static inline void uclamp_context_switch(struct rq *rq)
 {
-	unsigned long uclamp_min;
-	unsigned long uclamp_max;
-	unsigned long util;
-
-	/* Only RT and FAIR tasks are aware of uclamp */
-	if (!rt_policy(current->policy) && !fair_policy(current->policy))
-		return;
-
-	uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
-	uclamp_max = uclamp_eff_value(current, UCLAMP_MAX);
-	util = rq->cfs.avg.util_avg;
-
-	if (uclamp_min > util || uclamp_max < util)
-		cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
+	cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
 }
 #else /* CONFIG_UCLAMP_TASK */
 static inline unsigned long uclamp_eff_value(struct task_struct *p,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2023-12-29  0:25       ` Qais Yousef
@ 2024-01-03 13:41         ` Vincent Guittot
  2024-01-04 19:40           ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2024-01-03 13:41 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On Fri, 29 Dec 2023 at 01:25, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 12/12/23 12:40, Qais Yousef wrote:
> > On 12/12/23 12:06, Vincent Guittot wrote:
> >
> > > > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  enqueue_throttle:
> > > >         assert_list_leaf_cfs_rq(rq);
> > > >
> > >
> > > Here and in the other places below,  you lose :
> > >
> > >  -       } else if (decayed) {
> > >
> > > The decayed condition ensures a rate limit (~1ms) in the number of
> > > calls to cpufreq_update_util.
> > >
> > > enqueue/dequeue/tick don't create any sudden change in the PELT
> > > signals that would require to update cpufreq of the change unlike
> > > attach/detach
> >
> > Okay, thanks for the clue. Let me rethink this again.
>
> Thinking more about this. Do we really need to send freq updates at
> enqueue/attach etc?

Yes, attach and detach are the 2 events which can make abrupt and
significant changes in the utilization of the CPU.

>
> I did an experiment to remove all the updates except in three places:
>
> 1. Context switch (done unconditionally)
> 2. Tick
> 2. update_blocked_averages()

From the PoV of util_avg, attach, detach, tick and
update_blocked_averages are mandatory events to report to cpufreq to
correctly follow utilization.

>
> I tried the below patch on mac mini with m1 SoC, 6.6 kernel; speedometers
> scores were the same (against this series).
>
> I ran perf bench sched pipe to see if the addition in context switch will make
> things worse
>
> before (this series applied):
>
>         # Running 'sched/pipe' benchmark:
>         # Executed 1000000 pipe operations between two processes
>
>              Total time: 20.505 [sec]
>
>               20.505628 usecs/op
>                   48767 ops/sec
>
> after (proposed patch below applied on top):
>
>         # Running 'sched/pipe' benchmark:
>         # Executed 1000000 pipe operations between two processes
>
>              Total time: 19.475 [sec]
>
>               19.475838 usecs/op
>                   51345 ops/sec
>
> I tried against vanilla kernel (without any patches, including some dependency
> backports) using schedutil governor
>
>         # Running 'sched/pipe' benchmark:
>         # Executed 1000000 pipe operations between two processes
>
>              Total time: 22.428 [sec]
>
>               22.428166 usecs/op
>                   44586 ops/sec
>
> (I got higher run to run variance against this kernel)
>
> So things got better. I think we can actually unify all updates to be at
> context switch and tick for all classes.
>
> The one hole is for long running CFS tasks without context switch; no updates
> until tick this means the dvfs headroom needs to cater for that as worst case
> scenario now. I think this is the behavior today actually; but accidental
> updates due to enqueue/dequeue could have helped to issue more updates. If none
> of that happens, then updating load_avg at entity_tick() is what would have
> triggered a frequency update.
>
> I'm not sure if the blocked average one is required to be honest. But feared
> that when the cpu goes to idle we might miss reducing frequencies vote for that
> CPU - which matters on shared cpufreq policies.
>
> I haven't done comprehensive testing of course. But would love to hear any
> thoughts on how we can be more strategic and reduce the number of cpufreq
> updates we send. And iowait boost state needs to be verified.
>
> While testing this series more I did notice that short kworker context switches
> still can cause the frequency to go high. I traced this back to
> __balance_callbacks() in finish_lock_switch() after calling
> uclamp_context_switch(). So I think we do have a problem of updates being
> 'accidental' and we need to improve the interaction with the governor to be
> more intentional keeping in mind all the constraints we have today in h/w and
> software.
>
> --->8---
>
>
> From 6deed09be1d075afa3858ca62dd54826cdb60d44 Mon Sep 17 00:00:00 2001
> From: Qais Yousef <qyousef@layalina.io>
> Date: Tue, 26 Dec 2023 01:23:57 +0000
> Subject: [PATCH] sched/fair: Update freq on tick and context switch and
>  blocked avgs
>
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/cpufreq_schedutil.c |  3 ---
>  kernel/sched/fair.c              | 13 +------------
>  kernel/sched/sched.h             | 15 +--------------
>  3 files changed, 2 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index c0879a985097..553a3d7f02d8 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -166,9 +166,6 @@ static inline bool ignore_short_tasks(int cpu,
>         struct task_struct *p = cpu_rq(cpu)->curr;
>         unsigned long task_util;
>
> -       if (!(flags & SCHED_CPUFREQ_PERF_HINTS))
> -               return false;
> -
>         if (!fair_policy(p->policy))
>                 return false;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d63eae534cec..3a30f78b37d3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5717,8 +5717,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>         sub_nr_running(rq, task_delta);
>
>  done:
> -       cpufreq_update_util(rq, 0);
> -
>         /*
>          * Note: distribution will already see us throttled via the
>          * throttled-list.  rq->lock protects completion.
> @@ -5811,8 +5809,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  unthrottle_throttle:
>         assert_list_leaf_cfs_rq(rq);
>
> -       cpufreq_update_util(rq, 0);
> -
>         /* Determine whether we need to wake up potentially idle CPU: */
>         if (rq->curr == rq->idle && rq->cfs.nr_running)
>                 resched_curr(rq);
> @@ -6675,8 +6671,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  enqueue_throttle:
>         assert_list_leaf_cfs_rq(rq);
>
> -       cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> -
>         hrtick_update(rq);
>  }
>
> @@ -6754,7 +6748,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
>  dequeue_throttle:
>         util_est_update(&rq->cfs, p, task_sleep);
> -       cpufreq_update_util(rq, 0);
>         hrtick_update(rq);
>  }
>
> @@ -8338,7 +8331,6 @@ done: __maybe_unused;
>
>         update_misfit_status(p, rq);
>         sched_fair_update_stop_tick(rq, p);
> -       cpufreq_update_util(rq, 0);
>
>         return p;
>
> @@ -12460,7 +12452,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
>         update_misfit_status(curr, rq);
>         update_overutilized_status(task_rq(curr));
> -       cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
> +       cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
>
>         task_tick_core(rq, curr);
>  }
> @@ -12585,7 +12577,6 @@ static void detach_task_cfs_rq(struct task_struct *p)
>         struct sched_entity *se = &p->se;
>
>         detach_entity_cfs_rq(se);
> -       cpufreq_update_util(task_rq(p), 0);
>  }
>
>  static void attach_task_cfs_rq(struct task_struct *p)
> @@ -12593,7 +12584,6 @@ static void attach_task_cfs_rq(struct task_struct *p)
>         struct sched_entity *se = &p->se;
>
>         attach_entity_cfs_rq(se);
> -       cpufreq_update_util(task_rq(p), 0);
>  }
>
>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
> @@ -12839,7 +12829,6 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
>                         update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
>                         update_cfs_group(se);
>                 }
> -               cpufreq_update_util(rq, 0);
>                 rq_unlock_irqrestore(rq, &rf);
>         }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 516187ea2b81..e1622e2b82be 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3076,20 +3076,7 @@ static inline bool uclamp_rq_is_capped(struct rq *rq)
>  /* Request freq update on context switch if necessary */
>  static inline void uclamp_context_switch(struct rq *rq)
>  {
> -       unsigned long uclamp_min;
> -       unsigned long uclamp_max;
> -       unsigned long util;
> -
> -       /* Only RT and FAIR tasks are aware of uclamp */
> -       if (!rt_policy(current->policy) && !fair_policy(current->policy))
> -               return;
> -
> -       uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> -       uclamp_max = uclamp_eff_value(current, UCLAMP_MAX);
> -       util = rq->cfs.avg.util_avg;
> -
> -       if (uclamp_min > util || uclamp_max < util)
> -               cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
> +       cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
>  }
>  #else /* CONFIG_UCLAMP_TASK */
>  static inline unsigned long uclamp_eff_value(struct task_struct *p,
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()
  2024-01-03 13:41         ` Vincent Guittot
@ 2024-01-04 19:40           ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2024-01-04 19:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Dietmar Eggemann, linux-kernel, linux-pm, Lukasz Luba, Wei Wang,
	Rick Yiu, Chung-Kai Mei, Hongyan Xia

On 01/03/24 14:41, Vincent Guittot wrote:
> On Fri, 29 Dec 2023 at 01:25, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 12/12/23 12:40, Qais Yousef wrote:
> > > On 12/12/23 12:06, Vincent Guittot wrote:
> > >
> > > > > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > >  enqueue_throttle:
> > > > >         assert_list_leaf_cfs_rq(rq);
> > > > >
> > > >
> > > > Here and in the other places below,  you lose :
> > > >
> > > >  -       } else if (decayed) {
> > > >
> > > > The decayed condition ensures a rate limit (~1ms) in the number of
> > > > calls to cpufreq_update_util.
> > > >
> > > > enqueue/dequeue/tick don't create any sudden change in the PELT
> > > > signals that would require to update cpufreq of the change unlike
> > > > attach/detach
> > >
> > > Okay, thanks for the clue. Let me rethink this again.
> >
> > Thinking more about this. Do we really need to send freq updates at
> > enqueue/attach etc?
> 
> Yes, attach and detach are the 2 events which can make abrupt and
> significant changes in the utilization of the CPU.
> 
> >
> > I did an experiment to remove all the updates except in three places:
> >
> > 1. Context switch (done unconditionally)
> > 2. Tick
> > 2. update_blocked_averages()
> 
> From the PoV of util_avg, attach, detach, tick and
> update_blocked_averages are mandatory events to report to cpufreq to
> correctly follow utilization.

Okay, I'll re-instate the attach/detach updates.

Worth noting that unconditional calling is not a good idea after all. So I'll
make sure that context switch updates are protected with static key for
governors that don't register a hook, and that it is only called when we think
it's necessary. I did notice some overhead after all against reverse-misfit
patches.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-01-04 19:40 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-08  1:52 [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Qais Yousef
2023-12-08  1:52 ` [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util() Qais Yousef
2023-12-08 10:05   ` Lukasz Luba
2023-12-10 20:51     ` Qais Yousef
2023-12-11  7:56       ` Lukasz Luba
2023-12-12 12:10         ` Qais Yousef
2023-12-14  8:19           ` Lukasz Luba
2023-12-11 18:47   ` Christian Loehle
2023-12-12 12:34     ` Qais Yousef
2023-12-12 13:09       ` Christian Loehle
2023-12-12 13:29         ` Qais Yousef
2023-12-12 10:46   ` Dietmar Eggemann
2023-12-12 12:35     ` Qais Yousef
2023-12-12 18:22       ` Hongyan Xia
2023-12-12 10:47   ` Hongyan Xia
2023-12-12 11:06   ` Vincent Guittot
2023-12-12 12:40     ` Qais Yousef
2023-12-29  0:25       ` Qais Yousef
2024-01-03 13:41         ` Vincent Guittot
2024-01-04 19:40           ` Qais Yousef
2023-12-18  8:51   ` Dietmar Eggemann
2023-12-17 21:44     ` Qais Yousef
2023-12-08  1:52 ` [PATCH 2/4] sched/uclamp: Remove rq max aggregation Qais Yousef
2023-12-11  0:08   ` Qais Yousef
2023-12-08  1:52 ` [PATCH 3/4] sched/schedutil: Ignore update requests for short running tasks Qais Yousef
2023-12-08 10:42   ` Hongyan Xia
2023-12-10 22:22     ` Qais Yousef
2023-12-11 11:15       ` Hongyan Xia
2023-12-12 12:23         ` Qais Yousef
2023-12-08  1:52 ` [PATCH 4/4] sched/documentation: Remove reference to max aggregation Qais Yousef
2023-12-18  8:19 ` [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation Dietmar Eggemann
2023-12-17 21:23   ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).