linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/2] cpuidle: teo: Introduce util-awareness
@ 2022-10-31 12:13 Kajetan Puchalski
  2022-10-31 12:13 ` [RFC PATCH v3 1/2] cpuidle: teo: Optionally skip polling states in teo_find_shallower_state() Kajetan Puchalski
  2022-10-31 12:13 ` [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
  0 siblings, 2 replies; 11+ messages in thread
From: Kajetan Puchalski @ 2022-10-31 12:13 UTC (permalink / raw)
  To: rafael
  Cc: daniel.lezcano, lukasz.luba, Dietmar.Eggemann, dsmythies,
	yu.chen.surf, kajetan.puchalski, linux-pm, linux-kernel

Hi,

At the moment, all the available idle governors operate mainly based on their own past correctness
metrics along with timer events without taking into account any scheduling information. Especially
on interactive systems, this results in them frequently selecting a deeper idle state and then
waking up before its target residency is hit, thus leading to increased wakeup latency and lower
performance with no power saving. For 'menu' while web browsing on Android for instance, those types
of wakeups ('too deep') account for over 24% of all wakeups.

At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
it over C1. This is because the power usage of the two states can be so close that sufficient
amounts of too deep C1 sleeps can completely offset the C1 power saving to the point where it
would've been more power efficient to just use C0 instead.

Sleeps that happened in C0 while they could have used C1 ('too shallow') only save
less power than they otherwise could have. Too deep sleeps, on the other hand, harm performance
and nullify the potential power saving from using C1 in the first place. While taking this into
account, it is clear that on balance it is preferable for an idle governor to have more too shallow
sleeps instead of more too deep sleeps on those kinds of platforms.

Currently the best available governor under this metric is TEO which on average results in less than
half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
increased performance in the process.

This proposed optional extension to TEO would specifically tune it for minimising too deep
sleeps and minimising latency to achieve better performance. To this end, before selecting the next
idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
CPU is being utilized. This util value is then compared to a threshold defined as a percentage of
the cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the
threshold, the idle state selected by TEO metrics will be reduced by 1, thus selecting a shallower
state. If the util is below the threshold, the governor defaults to the TEO metrics mechanism to try
to select the deepest available idle state based on the closest timer event and its own correctness.

As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle
states are present. If there's just 2 idle states and the CPU is utilized, we can directly select
the shallowest state and save cycles by skipping the entire metrics mechanism.

As of v3 it also includes an adjustment where on systems with more than 2 idle states the state will
only be reduced if the selected candidate is C1 and C0 is not a polling state. This will effectively
make the patch have no effect on most Intel systems.

Initially I am sending this as a patch for TEO to visualize the proposed mechanism and simplify
the review process. An alternative way of implementing it while not interfering
with existing TEO code would be to fork TEO into a separate but mostly identical for the time being
governor (working name 'idleutil') and then implement util-awareness there, so that the two
approaches can coexist and both be available at runtime instead of relying on a compile-time option.
I am happy to send a patchset doing that if you think it's a cleaner approach than doing it this way.

This approach can outperform all the other currently available governors, at least on mobile device
workloads, which is why I think it is worth keeping as an option.

Additionally, in my view, the reason why it makes more sense to implement this type of mechanism
inside a governor rather than outside using something like QoS or some other way to disable certain
idle states on the fly are the governor's metrics. If we were disabling idle states and reenabling
them without the governor 'knowing' about it, the governor's metrics would end up being updated
based on state selections not caused by the governor itself. This could interfere with the
correctness of said metrics as that's not what they were designed for as far as I understand.
This approach skips metrics updates whenever a state was selected based on the util and not based
on the metrics.

There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
it on TEO because it performs the best out of all the available options and I didn't think there was
any point in reinventing the wheel on the side of computing governor metrics. If a
better approach comes along at some point, there's no reason why the same idle aware mechanism
couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
a separate governor rather than a TEO add-on.

As for how the extension performs in practice, below I'll add some benchmark results I got while
testing this patchset.

Pixel 6 (Android 12, mainline kernel 5.18):

1. Geekbench 5 (latency-sensitive, heavy load test)

The values below are gmean values across 3 back to back iteration of Geekbench 5.
As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
values for all of the governors can change between runs as the benchmark might be affected by factors
other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
scores than all the other governors.

'shallow' is a trivial governor that only ever selects the shallowest available state, included here
for reference and to establish the lower bound of latency possible to achieve through cpuidle.

'gmean too deep %' and 'gmean too shallow %' are percentages of too deep and too shallow sleeps
computed using the new trace event - cpu_idle_miss. The percentage is obtained by counting the two
types of misses over the course of a run and then dividing them by the total number of wakeups.

| metric                                | menu           | teo               | shallow           | teo + util-aware  |
| ------------------------------------- | -------------  | ---------------   | ---------------   | ---------------   |
| gmean score                           | 2716.4 (0.0%)  | 2795 (+2.89%)     | 2780.5 (+2.36%)   | 2830.8 (+4.21%)   |
| gmean too deep %                      | 16.64%         | 9.61%             | 0%                | 4.19%             |
| gmean too shallow %                   | 2.66%          | 5.54%             | 31.47%            | 15.3%             |
| gmean task wakeup latency (gb5)       | 82.05μs (0.0%) | 73.97μs (-9.85%)  | 42.05μs (-48.76%) | 66.91μs (-18.45%) |
| gmean task wakeup latency (asynctask) | 75.66μs (0.0%) | 56.58μs (-25.22%) | 65.78μs (-13.06%) | 55.35μs (-26.84%) |

In case of this benchmark, the difference in latency does seem to translate into better scores.

Additionally, here's a set of runs of Geekbench done after holding the phone in
the fridge for exactly an hour each time in order to minimise the impact of thermal issues.

| metric                                | menu           | teo               | teo + util-aware  |
| ------------------------------------- | -------------  | ---------------   | ---------------   |
| gmean multicore score                 | 2792.1 (0.0%)  | 2845.2 (+1.9%)    | 2857.4 (+2.34%)   |
| gmean single-core score               | 1048.3 (0.0%)  | 1052.6 (+0.41%)   | 1055.3 (+0.67%)   |

2. PCMark Web Browsing (non latency-sensitive, normal usage test)

The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.

| metric                    | menu           | teo               | shallow          | teo + util-aware  |
| ------------------------- | -------------  | ---------------   | ---------------  | ---------------   |
| gmean score               | 6283.0 (0.0%)  | 6262.9 (-0.32%)   | 6258.4 (-0.39%)  | 6323.7 (+0.65%)   |
| gmean too deep %          | 24.15%         | 10.32%            | 0%               | 3.2%              |
| gmean too shallow %       | 2.81%          | 7.68%             | 27.69%           | 17.189%           |
| gmean power usage [mW]    | 209.1 (0.0%)   | 187.8 (-10.17%)   | 205.5 (-1.71%)   | 205 (-1.96%)      |
| gmean task wakeup latency | 204.6μs (0.0%) | 184.39μs (-9.87%) | 95.55μs (-53.3%) | 95.98μs (-53.09%) |

As this is a web browsing benchmark, the task for which the wakeup latency was recorded was Chrome's
rendering task, ie CrRendererMain. The latency improvement for the actual benchmark task was very
similar.

In this case the large latency improvement does not translate into a notable increase in benchmark score as
this particular benchmark mainly responds to changes in operating frequency. Nevertheless, the small power
saving compared to menu with no decrease in benchmark score indicate that there are no regressions for this
type of workload while using this governor.

Note: The results above were as mentioned obtained on the 5.18 kernel. Results for Geekbench obtained after
backporting CFS patches from the most recent mainline can be found in the pdf linked below [1].
The results and improvements still hold up but the numbers change slightly. Additionally, the pdf contains
plots for all the relevant results obtained with this and other idle governors, alongside power usage numbers.

At the very least this approach seems promising so I wanted to discuss it in RFC form first.
Thank you for taking your time to read this!

--
Kajetan

[1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf

v2 -> v3:
- add a patch adding an option to skip polling states in teo_find_shallower_state()
- only reduce the state if the candidate state is C1 and C0 is not a polling state
- add a check for polling states in the 2-states fast-exit path
- removed the ifdefs and Kconfig option

v1 -> v2:
- rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies)
- add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized
- fix typos in comments
- include a missing header

Kajetan Puchalski (2):
  cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
  cpuidle: teo: Introduce optional util-awareness

 drivers/cpuidle/governors/teo.c | 92 +++++++++++++++++++++++++++++++--
 1 file changed, 87 insertions(+), 5 deletions(-)

-- 
2.37.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v3 1/2] cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
  2022-10-31 12:13 [RFC PATCH v3 0/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
@ 2022-10-31 12:13 ` Kajetan Puchalski
  2022-10-31 12:13 ` [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
  1 sibling, 0 replies; 11+ messages in thread
From: Kajetan Puchalski @ 2022-10-31 12:13 UTC (permalink / raw)
  To: rafael
  Cc: daniel.lezcano, lukasz.luba, Dietmar.Eggemann, dsmythies,
	yu.chen.surf, kajetan.puchalski, linux-pm, linux-kernel

Add a no_poll flag to teo_find_shallower_state() that will let the
function optionally not consider polling states.
This allows the caller to guard against the function inadvertently
resulting in TEO putting the CPU in a polling state when that
behaviour is undesirable.

Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
---
 drivers/cpuidle/governors/teo.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index d9262db79cae..e2864474a98d 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -258,15 +258,17 @@ static s64 teo_middle_of_bin(int idx, struct cpuidle_driver *drv)
  * @dev: Target CPU.
  * @state_idx: Index of the capping idle state.
  * @duration_ns: Idle duration value to match.
+ * @no_poll: Don't consider polling states.
  */
 static int teo_find_shallower_state(struct cpuidle_driver *drv,
 				    struct cpuidle_device *dev, int state_idx,
-				    s64 duration_ns)
+				    s64 duration_ns, bool no_poll)
 {
 	int i;
 
 	for (i = state_idx - 1; i >= 0; i--) {
-		if (dev->states_usage[i].disable)
+		if (dev->states_usage[i].disable ||
+				(no_poll && drv->states[i].flags & CPUIDLE_FLAG_POLLING))
 			continue;
 
 		state_idx = i;
@@ -469,7 +471,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		 */
 		if (idx > idx0 &&
 		    drv->states[idx].target_residency_ns > delta_tick)
-			idx = teo_find_shallower_state(drv, dev, idx, delta_tick);
+			idx = teo_find_shallower_state(drv, dev, idx, delta_tick, false);
 	}
 
 	return idx;
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-10-31 12:13 [RFC PATCH v3 0/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
  2022-10-31 12:13 ` [RFC PATCH v3 1/2] cpuidle: teo: Optionally skip polling states in teo_find_shallower_state() Kajetan Puchalski
@ 2022-10-31 12:13 ` Kajetan Puchalski
  2022-11-01  6:24   ` Doug Smythies
  2022-11-25 18:27   ` Rafael J. Wysocki
  1 sibling, 2 replies; 11+ messages in thread
From: Kajetan Puchalski @ 2022-10-31 12:13 UTC (permalink / raw)
  To: rafael
  Cc: daniel.lezcano, lukasz.luba, Dietmar.Eggemann, dsmythies,
	yu.chen.surf, kajetan.puchalski, linux-pm, linux-kernel

Modern interactive systems, such as recent Android phones, tend to have
power efficient shallow idle states. Selecting deeper idle states on a
device while a latency-sensitive workload is running can adversely impact
performance due to increased latency. Additionally, if the CPU wakes up
from a deeper sleep before its target residency as is often the case, it
results in a waste of energy on top of that.

This patch extends the TEO governor with a mechanism adding util-awareness,
effectively providing a way for the governor to reduce the selected idle
state by 1 when the CPU is being utilized over a certain threshold while
still trying to select the deepest possible state using TEO's metrics when
the CPU is not being utilized. This is now possible since the CPU
utilization is exported from the scheduler with the sched_cpu_util function
and already used e.g. in the thermal governor IPA.

Under this implementation, when the CPU is being utilised and the
selected candidate state is C1, it will be reduced to C0 as long as C0
is not a polling state. This effectively should make the patch have no
effect on most Intel systems.

This can provide drastically decreased latency and performance benefits in
certain types of mobile workloads that are sensitive to latency,
such as Geekbench 5.

Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
---
 drivers/cpuidle/governors/teo.c | 84 ++++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index e2864474a98d..0f38b3ecab3c 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -2,8 +2,13 @@
 /*
  * Timer events oriented CPU idle governor
  *
+ * TEO governor:
  * Copyright (C) 2018 - 2021 Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * Util-awareness mechanism:
+ * Copyright (C) 2022 Arm Ltd.
+ * Author: Kajetan Puchalski <kajetan.puchalski@arm.com>
  */
 
 /**
@@ -99,14 +104,49 @@
  *      select the given idle state instead of the candidate one.
  *
  * 3. By default, select the candidate state.
+ *
+ * Util-awareness mechanism:
+ *
+ * The idea behind the util-awareness extension is that there are two distinct
+ * scenarios for the CPU which should result in two different approaches to idle
+ * state selection - utilized and not utilized.
+ *
+ * In this case, 'utilized' means that the average runqueue util of the CPU is
+ * above a certain threshold.
+ *
+ * When the CPU is utilized while going into idle, more likely than not it will
+ * be woken up to do more work soon and so a shallower idle state should be
+ * selected to minimise latency and maximise performance. When the CPU is not
+ * being utilized, the usual metrics-based approach to selecting the deepest
+ * available idle state should be preferred to take advantage of the power
+ * saving.
+ *
+ * In order to achieve this, the governor uses a utilization threshold.
+ * The threshold is computed per-cpu as a percentage of the CPU's capacity
+ * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
+ * seems to be getting the best results.
+ *
+ * Before selecting the next idle state, the governor compares the current CPU
+ * util to the precomputed util threshold. If it's below, it defaults to the
+ * TEO metrics mechanism. If it's above and the currently selected candidate is
+ * C1, the idle state will be reduced to C0 as long as C0 is not a polling state.
  */
 
 #include <linux/cpuidle.h>
 #include <linux/jiffies.h>
 #include <linux/kernel.h>
+#include <linux/sched.h>
 #include <linux/sched/clock.h>
+#include <linux/sched/topology.h>
 #include <linux/tick.h>
 
+/*
+ * The number of bits to shift the cpu's capacity by in order to determine
+ * the utilized threshold
+ */
+#define UTIL_THRESHOLD_SHIFT 6
+
+
 /*
  * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
  * is used for decreasing metrics on a regular basis.
@@ -137,9 +177,11 @@ struct teo_bin {
  * @time_span_ns: Time between idle state selection and post-wakeup update.
  * @sleep_length_ns: Time till the closest timer event (at the selection time).
  * @state_bins: Idle state data bins for this CPU.
- * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
+ * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
  * @next_recent_idx: Index of the next @recent_idx entry to update.
  * @recent_idx: Indices of bins corresponding to recent "intercepts".
+ * @util_threshold: Threshold above which the CPU is considered utilized
+ * @utilized: Whether the last sleep on the CPU happened while utilized
  */
 struct teo_cpu {
 	s64 time_span_ns;
@@ -148,10 +190,24 @@ struct teo_cpu {
 	unsigned int total;
 	int next_recent_idx;
 	int recent_idx[NR_RECENT];
+	unsigned long util_threshold;
+	bool utilized;
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
 
+/**
+ * teo_get_util - Update the CPU utilized status
+ * @dev: Target CPU
+ * @cpu_data: Governor CPU data for the target CPU
+ */
+static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
+{
+	unsigned long util = sched_cpu_util(dev->cpu);
+
+	cpu_data->utilized = util > cpu_data->util_threshold;
+}
+
 /**
  * teo_update - Update CPU metrics after wakeup.
  * @drv: cpuidle driver containing state data.
@@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	int i;
 
 	if (dev->last_state_idx >= 0) {
-		teo_update(drv, dev);
+		/* don't update metrics if the cpu was utilized during the last sleep */
+		if (!cpu_data->utilized)
+			teo_update(drv, dev);
 		dev->last_state_idx = -1;
 	}
 
@@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 			goto end;
 	}
 
+	teo_get_util(dev, cpu_data);
+	/* the cpu is being utilized and there's only 2 states to choose from */
+	/* no need to consider metrics, choose the shallowest non-polling state and exit */
+	if (drv->state_count < 3 && cpu_data->utilized) {
+		for (i = 0; i < drv->state_count; ++i) {
+			if (dev->states_usage[i].disable ||
+					drv->states[i].flags & CPUIDLE_FLAG_POLLING)
+				continue;
+			break;
+		}
+
+		idx = i;
+		goto end;
+	}
+
 	/*
 	 * Find the deepest idle state whose target residency does not exceed
 	 * the current sleep length and the deepest idle state not deeper than
@@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (idx > constraint_idx)
 		idx = constraint_idx;
 
+	/* if the CPU is being utilized and C1 is the selected candidate */
+	/* choose a shallower non-polling state to improve latency */
+	if (cpu_data->utilized && idx == 1)
+		idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
+
 end:
 	/*
 	 * Don't stop the tick if the selected state is a polling one or if the
@@ -510,9 +588,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+	unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
 	int i;
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
+	cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
 
 	for (i = 0; i < NR_RECENT; i++)
 		cpu_data->recent_idx[i] = -1;
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-10-31 12:13 ` [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
@ 2022-11-01  6:24   ` Doug Smythies
  2022-11-01 23:02     ` Doug Smythies
  2022-11-02 14:01     ` Kajetan Puchalski
  2022-11-25 18:27   ` Rafael J. Wysocki
  1 sibling, 2 replies; 11+ messages in thread
From: Doug Smythies @ 2022-11-01  6:24 UTC (permalink / raw)
  To: Kajetan Puchalski
  Cc: rafael, daniel.lezcano, lukasz.luba, Dietmar.Eggemann,
	yu.chen.surf, linux-pm, linux-kernel, Doug Smythies

Hi Kajetan,

On Mon, Oct 31, 2022 at 5:14 AM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:

... [delete some]...

>  /**
>   * teo_update - Update CPU metrics after wakeup.
>   * @drv: cpuidle driver containing state data.
> @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         int i;
>
>         if (dev->last_state_idx >= 0) {
> -               teo_update(drv, dev);
> +               /* don't update metrics if the cpu was utilized during the last sleep */
> +               if (!cpu_data->utilized)
> +                       teo_update(drv, dev);
>                 dev->last_state_idx = -1;
>         }

Ignoring the metrics is not the correct thing to do.
Depending on the workflow, it can severely bias the idle states deeper
than they should be because most of the needed information to select
the appropriate shallow state is tossed out.

Example 1:
2 pairs of ping pongs = 4 threads
Parameters chosen such that idle state 2 would be a most used state.
CPU frequency governor: Schedutil.
CPU frequency scaling driver: intel_cpufreq.
HWP: Disabled
Processor: i5-10600K (6 cores 12 cpus).
Kernel: 6.1-rc3
Run length: 1e8 cycles
Idle governor:
teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec
util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec
util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec

Where util-v4 is the same as this patch (util-v3) with the above code reverted.

Note: less time per loop is better.

Example 2: Same but parameters selected such that idle state 0 would
be a most used idle state.
Run Length: 4e8 cycles
Idle governor:
teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec
menu:  3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec
util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec
util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result)

Note: the util-v4 test is still in progress, but it is late in my time
zone. But I can tell from the idle state usage, which I can observe
once per minute, that the issue is, at least mostly, fixed.

... Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-11-01  6:24   ` Doug Smythies
@ 2022-11-01 23:02     ` Doug Smythies
  2022-11-02 14:01     ` Kajetan Puchalski
  1 sibling, 0 replies; 11+ messages in thread
From: Doug Smythies @ 2022-11-01 23:02 UTC (permalink / raw)
  To: Kajetan Puchalski
  Cc: rafael, daniel.lezcano, lukasz.luba, Dietmar.Eggemann,
	yu.chen.surf, linux-pm, linux-kernel, Doug Smythies

On Mon, Oct 31, 2022 at 11:24 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Kajetan,
>
> On Mon, Oct 31, 2022 at 5:14 AM Kajetan Puchalski
> <kajetan.puchalski@arm.com> wrote:
>
> ... [delete some]...
>
> >  /**
> >   * teo_update - Update CPU metrics after wakeup.
> >   * @drv: cpuidle driver containing state data.
> > @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> >         int i;
> >
> >         if (dev->last_state_idx >= 0) {
> > -               teo_update(drv, dev);
> > +               /* don't update metrics if the cpu was utilized during the last sleep */
> > +               if (!cpu_data->utilized)
> > +                       teo_update(drv, dev);
> >                 dev->last_state_idx = -1;
> >         }
>
> Ignoring the metrics is not the correct thing to do.
> Depending on the workflow, it can severely bias the idle states deeper
> than they should be because most of the needed information to select
> the appropriate shallow state is tossed out.
>
> Example 1:
> 2 pairs of ping pongs = 4 threads
> Parameters chosen such that idle state 2 would be a most used state.

Sorry, typo, I meant idle state 1 would be most used.

> CPU frequency governor: Schedutil.
> CPU frequency scaling driver: intel_cpufreq.
> HWP: Disabled
> Processor: i5-10600K (6 cores 12 cpus).
> Kernel: 6.1-rc3
> Run length: 1e8 cycles
> Idle governor:
> teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
> menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec
> util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec
> util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
>
> Where util-v4 is the same as this patch (util-v3) with the above code reverted.
>
> Note: less time per loop is better.
>
> Example 2: Same but parameters selected such that idle state 0 would
> be a most used idle state.
> Run Length: 4e8 cycles
> Idle governor:
> teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec
> menu:  3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec
> util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec
> util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result)

util-v4: 3.15 uSecs/loop ; idle state 0 ~1.2e6 exits/sec

For the above tests we do not expect teo-util to have much impact.

For completeness:

Test 3: Same but parameters selected such that idle states 2 and 3
would be used the most.
Run Length: 3.42e5 cycles
CPU frequency scaling governor: schedutil.
CPU frequency scaling driver: intel_cpufreq.
Idle governor:
teo: 4005 uSecs/loop ; IS 2: 1917 IS 3: 107.4 exits/sec
menu: 3113 uSecs/loop ;  IS 2: 1020 IS 3: 1576 exits/sec
util-v3: 3457 uSecs/loop ;  IS 2: 1139 IS 3: 1000 exits/sec
util-v4: 3526 uSecs/loop ;  IS 2: 2029 IS 3: 109 exits/sec

Now, things are very noisy with the schedutil governor, so...

Test 4: Same as test 3 except for frequency scaling governor.
Run Length: 3.42e5 cycles
CPU frequency scaling governor: performance.
CPU frequency scaling driver: intel_pstate.
Idle governor:
teo: 2688 uSecs/loop ; IS 2: 2489 IS 3: 16 exits/sec
menu: 2596 uSecs/loop ;  IS 2: 865 IS 3: 2005 exits/sec
util-v3: 2766 uSecs/loop ;  IS 2: 1049 IS 3: 1394 exits/sec
util-v4: 2756 uSecs/loop ;  IS 2: 2440 IS 3: 24 exits/sec

... Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-11-01  6:24   ` Doug Smythies
  2022-11-01 23:02     ` Doug Smythies
@ 2022-11-02 14:01     ` Kajetan Puchalski
  1 sibling, 0 replies; 11+ messages in thread
From: Kajetan Puchalski @ 2022-11-02 14:01 UTC (permalink / raw)
  To: Doug Smythies
  Cc: rafael, daniel.lezcano, lukasz.luba, Dietmar.Eggemann,
	yu.chen.surf, linux-pm, linux-kernel

On Mon, Oct 31, 2022 at 11:24:06PM -0700, Doug Smythies wrote:

> Ignoring the metrics is not the correct thing to do.
> Depending on the workflow, it can severely bias the idle states deeper
> than they should be because most of the needed information to select
> the appropriate shallow state is tossed out.

Interesting, thanks for flagging this actually. Arguably it made sense
originally but with v3 changes it probably causes more harm than good.

From testing it on my own side it still looks all right without the
skipping metrics part so I'm fine with just dropping it from the patch.
V4 it is.

I'll include my updated numbers and more tests in the new cover letter.

> Example 1:
> 2 pairs of ping pongs = 4 threads
> Parameters chosen such that idle state 2 would be a most used state.
> CPU frequency governor: Schedutil.
> CPU frequency scaling driver: intel_cpufreq.
> HWP: Disabled
> Processor: i5-10600K (6 cores 12 cpus).
> Kernel: 6.1-rc3
> Run length: 1e8 cycles
> Idle governor:
> teo: 11.73 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
> menu: 12.1 uSecs/loop ; idle state 1 ~3.3e6 exits/sec
> util-v3: 15.2 uSecs/loop ; idle state 1 ~200 exits/sec
> util-v4: 11.63 uSecs/loop ; idle state 1 ~3.5e6 exits/sec
> 
> Where util-v4 is the same as this patch (util-v3) with the above code reverted.
> 
> Note: less time per loop is better.
> 
> Example 2: Same but parameters selected such that idle state 0 would
> be a most used idle state.
> Run Length: 4e8 cycles
> Idle governor:
> teo: 3.1 uSecs/loop ; idle state 0 ~1.2e6 exits/sec
> menu:  3.1 uSecs/loop ; idle state 0 ~1.3e6 exits/sec
> util-v3: 5.1 uSecs/loop ; idle state 0 ~4 exits/sec
> util-v4: ? uSecs/loop ; idle state 0 ~1.2e6 exits/sec (partial result)

Thanks a lot for the testing!

---
Kajetan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-10-31 12:13 ` [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
  2022-11-01  6:24   ` Doug Smythies
@ 2022-11-25 18:27   ` Rafael J. Wysocki
  2022-11-25 20:00     ` Doug Smythies
                       ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2022-11-25 18:27 UTC (permalink / raw)
  To: Kajetan Puchalski
  Cc: rafael, daniel.lezcano, lukasz.luba, Dietmar.Eggemann, dsmythies,
	yu.chen.surf, linux-pm, linux-kernel

On Mon, Oct 31, 2022 at 1:14 PM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> Modern interactive systems, such as recent Android phones, tend to have
> power efficient shallow idle states. Selecting deeper idle states on a
> device while a latency-sensitive workload is running can adversely impact
> performance due to increased latency. Additionally, if the CPU wakes up
> from a deeper sleep before its target residency as is often the case, it
> results in a waste of energy on top of that.
>
> This patch extends the TEO governor with a mechanism adding util-awareness,
> effectively providing a way for the governor to reduce the selected idle
> state by 1 when the CPU is being utilized over a certain threshold while
> still trying to select the deepest possible state using TEO's metrics when
> the CPU is not being utilized. This is now possible since the CPU
> utilization is exported from the scheduler with the sched_cpu_util function
> and already used e.g. in the thermal governor IPA.
>
> Under this implementation, when the CPU is being utilised and the
> selected candidate state is C1, it will be reduced to C0 as long as C0
> is not a polling state. This effectively should make the patch have no
> effect on most Intel systems.
>
> This can provide drastically decreased latency and performance benefits in
> certain types of mobile workloads that are sensitive to latency,
> such as Geekbench 5.
>
> Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
> ---
>  drivers/cpuidle/governors/teo.c | 84 ++++++++++++++++++++++++++++++++-
>  1 file changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index e2864474a98d..0f38b3ecab3c 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -2,8 +2,13 @@
>  /*
>   * Timer events oriented CPU idle governor
>   *
> + * TEO governor:
>   * Copyright (C) 2018 - 2021 Intel Corporation
>   * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> + *
> + * Util-awareness mechanism:
> + * Copyright (C) 2022 Arm Ltd.
> + * Author: Kajetan Puchalski <kajetan.puchalski@arm.com>
>   */
>
>  /**
> @@ -99,14 +104,49 @@
>   *      select the given idle state instead of the candidate one.
>   *
>   * 3. By default, select the candidate state.
> + *
> + * Util-awareness mechanism:
> + *
> + * The idea behind the util-awareness extension is that there are two distinct
> + * scenarios for the CPU which should result in two different approaches to idle
> + * state selection - utilized and not utilized.
> + *
> + * In this case, 'utilized' means that the average runqueue util of the CPU is
> + * above a certain threshold.
> + *
> + * When the CPU is utilized while going into idle, more likely than not it will
> + * be woken up to do more work soon and so a shallower idle state should be
> + * selected to minimise latency and maximise performance. When the CPU is not
> + * being utilized, the usual metrics-based approach to selecting the deepest
> + * available idle state should be preferred to take advantage of the power
> + * saving.
> + *
> + * In order to achieve this, the governor uses a utilization threshold.
> + * The threshold is computed per-cpu as a percentage of the CPU's capacity
> + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> + * seems to be getting the best results.
> + *
> + * Before selecting the next idle state, the governor compares the current CPU
> + * util to the precomputed util threshold. If it's below, it defaults to the
> + * TEO metrics mechanism. If it's above and the currently selected candidate is
> + * C1, the idle state will be reduced to C0 as long as C0 is not a polling state.
>   */
>
>  #include <linux/cpuidle.h>
>  #include <linux/jiffies.h>
>  #include <linux/kernel.h>
> +#include <linux/sched.h>
>  #include <linux/sched/clock.h>
> +#include <linux/sched/topology.h>
>  #include <linux/tick.h>
>
> +/*
> + * The number of bits to shift the cpu's capacity by in order to determine
> + * the utilized threshold
> + */
> +#define UTIL_THRESHOLD_SHIFT 6

Why is this particular number regarded as the best one?

> +
> +
>  /*
>   * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
>   * is used for decreasing metrics on a regular basis.
> @@ -137,9 +177,11 @@ struct teo_bin {
>   * @time_span_ns: Time between idle state selection and post-wakeup update.
>   * @sleep_length_ns: Time till the closest timer event (at the selection time).
>   * @state_bins: Idle state data bins for this CPU.
> - * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
> + * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
>   * @next_recent_idx: Index of the next @recent_idx entry to update.
>   * @recent_idx: Indices of bins corresponding to recent "intercepts".
> + * @util_threshold: Threshold above which the CPU is considered utilized
> + * @utilized: Whether the last sleep on the CPU happened while utilized
>   */
>  struct teo_cpu {
>         s64 time_span_ns;
> @@ -148,10 +190,24 @@ struct teo_cpu {
>         unsigned int total;
>         int next_recent_idx;
>         int recent_idx[NR_RECENT];
> +       unsigned long util_threshold;
> +       bool utilized;
>  };
>
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>
> +/**
> + * teo_get_util - Update the CPU utilized status
> + * @dev: Target CPU
> + * @cpu_data: Governor CPU data for the target CPU
> + */
> +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> +{
> +       unsigned long util = sched_cpu_util(dev->cpu);
> +
> +       cpu_data->utilized = util > cpu_data->util_threshold;

Why exactly do you need the local variable here?

Then, if there's only one caller, maybe this could be folded into it?

> +}
> +
>  /**
>   * teo_update - Update CPU metrics after wakeup.
>   * @drv: cpuidle driver containing state data.
> @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         int i;
>
>         if (dev->last_state_idx >= 0) {
> -               teo_update(drv, dev);
> +               /* don't update metrics if the cpu was utilized during the last sleep */

Why?

The metrics are related to idle duration and cpu_data->utilized
indicates whether or not latency should be reduced.  These are
different things.

Moreover, this is just one data point and there may not be any direct
connection between it and the decision made in this particular cycle.

> +               if (!cpu_data->utilized)
> +                       teo_update(drv, dev);
>                 dev->last_state_idx = -1;
>         }
>
> @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                         goto end;
>         }
>
> +       teo_get_util(dev, cpu_data);
> +       /* the cpu is being utilized and there's only 2 states to choose from */
> +       /* no need to consider metrics, choose the shallowest non-polling state and exit */

A proper kernel-coding-style 2-line comment, please!

Also I would say "utilized beyond the threshold" instead of just
"utilized" and "there are only 2 states" (plural).

> +       if (drv->state_count < 3 && cpu_data->utilized) {
> +               for (i = 0; i < drv->state_count; ++i) {
> +                       if (dev->states_usage[i].disable ||
> +                                       drv->states[i].flags & CPUIDLE_FLAG_POLLING)
> +                               continue;
> +                       break;

This looks odd.  It would be more straightforward to do

if (!dev->states_usage[i].disable && !(drv->states[i].flags &
CPUIDLE_FLAG_POLLING)) {
        idx = i;
        goto end;
}

without the "break" and "continue".

> +               }
> +
> +               idx = i;
> +               goto end;
> +       }
> +
>         /*
>          * Find the deepest idle state whose target residency does not exceed
>          * the current sleep length and the deepest idle state not deeper than
> @@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         if (idx > constraint_idx)
>                 idx = constraint_idx;
>
> +       /* if the CPU is being utilized and C1 is the selected candidate */
> +       /* choose a shallower non-polling state to improve latency */

Again, the kernel coding style for multi-line comments is different
from the above.

> +       if (cpu_data->utilized && idx == 1)

I've changed my mind with respect to adding the idx == 1 check to
this.  If the goal is to reduce latency for the "loaded" CPUs, this
applies to deeper idle states too.

> +               idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> +
>  end:
>         /*
>          * Don't stop the tick if the selected state is a polling one or if the
> @@ -510,9 +588,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>                              struct cpuidle_device *dev)
>  {
>         struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +       unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
>         int i;
>
>         memset(cpu_data, 0, sizeof(*cpu_data));
> +       cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>
>         for (i = 0; i < NR_RECENT; i++)
>                 cpu_data->recent_idx[i] = -1;
> --

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-11-25 18:27   ` Rafael J. Wysocki
@ 2022-11-25 20:00     ` Doug Smythies
  2022-11-27  4:29     ` Doug Smythies
  2022-11-28 14:32     ` Kajetan Puchalski
  2 siblings, 0 replies; 11+ messages in thread
From: Doug Smythies @ 2022-11-25 20:00 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Kajetan Puchalski'
  Cc: daniel.lezcano, lukasz.luba, Dietmar.Eggemann, yu.chen.surf,
	linux-pm, linux-kernel, Doug Smythies

On 2022.11.25 10:27 Rafael wrote:
> On Mon, Oct 31, 2022 at 1:14 PM Kajetan wrote:

... [delete some] ...

>> +}
>> +
>> /**
>>   * teo_update - Update CPU metrics after wakeup.
>>   * @drv: cpuidle driver containing state data.
>> @@ -303,7 +359,9 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>         int i;
>>
>>         if (dev->last_state_idx >= 0) {
>> -               teo_update(drv, dev);
>> +               /* don't update metrics if the cpu was utilized during the last sleep */
>
> Why?
>
> The metrics are related to idle duration and cpu_data->utilized
> indicates whether or not latency should be reduced.  These are
> different things.
>
> Moreover, this is just one data point and there may not be any direct
> connection between it and the decision made in this particular cycle.

Hi Rafael,

Yes, agreed. And version 4 of this patch set deleted this stuff.
See also my similar feedback, with accompanying test data
that verified the issue in [1].

[1] https://lore.kernel.org/linux-pm/CAAYoRsUgm6KyJCDowGKFVuMwJepnVN8NFEenjd3O-FN7+BETSw@mail.gmail.com/

... Doug



^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-11-25 18:27   ` Rafael J. Wysocki
  2022-11-25 20:00     ` Doug Smythies
@ 2022-11-27  4:29     ` Doug Smythies
  2022-11-28 14:32     ` Kajetan Puchalski
  2 siblings, 0 replies; 11+ messages in thread
From: Doug Smythies @ 2022-11-27  4:29 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Kajetan Puchalski'
  Cc: daniel.lezcano, lukasz.luba, Dietmar.Eggemann, yu.chen.surf,
	linux-pm, linux-kernel, Doug Smythies

On 2022.11.25 10:27 Rafael wrote:
> On Mon, Oct 31, 2022 at 1:14 PM Kajetan wrote:

... [delete some] ...

>>         /*
>>          * Find the deepest idle state whose target residency does not exceed
>>          * the current sleep length and the deepest idle state not deeper than
>> @@ -454,6 +527,11 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>         if (idx > constraint_idx)
>>                 idx = constraint_idx;
>>
>> +       /* if the CPU is being utilized and C1 is the selected candidate */
>> +       /* choose a shallower non-polling state to improve latency */
>
> Again, the kernel coding style for multi-line comments is different
> from the above.
>
>> +       if (cpu_data->utilized && idx == 1)
>
> I've changed my mind with respect to adding the idx == 1 check to
> this.  If the goal is to reduce latency for the "loaded" CPUs, this
> applies to deeper idle states too.

After taking idle state 0 (POLL) out of it, the energy cost for reducing
the selected idle state by 1 was still high in some cases, at least on my
Intel processor. That was mainly for idle state 2 being bumped to idle
state 1. I don't recall significant differences bumping idle state 3 to idle
state 2, but I don't know about other Intel processors.

So, there is a trade-off here where we might want to accept this higher
energy consumption for no gain in some workflows verses the higher
energy for gain in other workflows, or not.

Example 1: Higher energy, for no benefit:
Workflow: a medium load at 211 work/sleep frequency.
This data is for one thread, but I looked at up to 6 threads.
No performance metric, the work just has to finish before
the next cycle begins.
CPU frequency scaling driver: intel_pstate
CPU frequency scaling governor: powersave
No HWP.
Kernel 6.1-rc3

teo: ~14.8 watts
util-v4 without the "idx == 1" above: 16.1 watts (+8.8%)
More info:
http://smythies.com/~doug/linux/idle/teo-util/consume/dwell-v4/

Example 2: Lower energy, but no loss in performance:
Workflow: 500 threads, light load per thread,
approximately 10 hertz work/sleep frequency per thread.
CPU frequency scaling driver: intel_cpufreq
CPU frequency scaling governor: schedutil
No HWP.
Kernel 6.1-rc3

teo: ~70 watts
util-v4 without the "idx == 1" above: ~59 watts (-16%)
Execution times were the same
More info:
http://smythies.com/~doug/linux/idle/teo-util/waiter/

Note: legend util-v4-1 is util-v4 without the "idx == 1".
I have also added util-v4-1 to some of the previous results.

For reference, my testing processor:

Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/desc
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH MWAIT 0x0
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH MWAIT 0x30
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH MWAIT 0x60

... Doug


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-11-25 18:27   ` Rafael J. Wysocki
  2022-11-25 20:00     ` Doug Smythies
  2022-11-27  4:29     ` Doug Smythies
@ 2022-11-28 14:32     ` Kajetan Puchalski
  2022-11-28 14:54       ` Rafael J. Wysocki
  2 siblings, 1 reply; 11+ messages in thread
From: Kajetan Puchalski @ 2022-11-28 14:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: daniel.lezcano, lukasz.luba, Dietmar.Eggemann, dsmythies,
	yu.chen.surf, kajetan.puchalski, linux-pm, linux-kernel

On Fri, Nov 25, 2022 at 07:27:13PM +0100, Rafael J. Wysocki wrote:
> > +/*
> > + * The number of bits to shift the cpu's capacity by in order to determine
> > + * the utilized threshold
> > + */
> > +#define UTIL_THRESHOLD_SHIFT 6
>
> Why is this particular number regarded as the best one?

Based on my testing this number achieved the best balance of power and
performance on average. It also makes sense from looking at the util
plots. The resulting threshold is high enough to not be triggered by
background noise and low enough to react quickly when activity starts
ramping up.

> > +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> > +{
> > +       unsigned long util = sched_cpu_util(dev->cpu);
> > +
> > +       cpu_data->utilized = util > cpu_data->util_threshold;
>
> Why exactly do you need the local variable here?

It's not necessarily needed, I can replace it with comparing the result
of the call directly.

> Then, if there's only one caller, maybe this could be folded into it?

I do think it's nicer to have it separated in its own helper function, that
way if anything more has to be done with the util it'll all be
self-contained. Having only one caller shouldn't be a big issue, it's
also the case for teo_middle_of_bin and teo_find_shallower_state in the
current TEO implementation.

> > +               /* don't update metrics if the cpu was utilized during the last sleep */
> 
> Why?
> 
> The metrics are related to idle duration and cpu_data->utilized
> indicates whether or not latency should be reduced.  These are
> different things.
> 
> Moreover, this is just one data point and there may not be any direct
> connection between it and the decision made in this particular cycle.

Agreed, v4 already has this part removed.

> > +               if (!cpu_data->utilized)
> > +                       teo_update(drv, dev);
> >                 dev->last_state_idx = -1;
> >         }
> >
> > @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> >                         goto end;
> >         }
> >
> > +       teo_get_util(dev, cpu_data);
> > +       /* the cpu is being utilized and there's only 2 states to choose from */
> > +       /* no need to consider metrics, choose the shallowest non-polling state and exit */
> 
> A proper kernel-coding-style 2-line comment, please!
> 
> Also I would say "utilized beyond the threshold" instead of just
> "utilized" and "there are only 2 states" (plural).

Both good points, I'll fix that.

> > +       if (drv->state_count < 3 && cpu_data->utilized) {
> > +               for (i = 0; i < drv->state_count; ++i) {
> > +                       if (dev->states_usage[i].disable ||
> > +                                       drv->states[i].flags & CPUIDLE_FLAG_POLLING)
> > +                               continue;
> > +                       break;
> 
> This looks odd.  It would be more straightforward to do
> 
> if (!dev->states_usage[i].disable && !(drv->states[i].flags &
> CPUIDLE_FLAG_POLLING)) {
>         idx = i;
>         goto end;
> }
> 
> without the "break" and "continue".

Fair enough, this works as well.

> I've changed my mind with respect to adding the idx == 1 check to
> this.  If the goal is to reduce latency for the "loaded" CPUs, this
> applies to deeper idle states too.

I see, this has no effect on arm devices one way or the other so I don't
mind, it's completely up to you. In light of Doug's test results
regarding this change, should I remove the check in v5?

Thanks,
Kajetan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness
  2022-11-28 14:32     ` Kajetan Puchalski
@ 2022-11-28 14:54       ` Rafael J. Wysocki
  0 siblings, 0 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2022-11-28 14:54 UTC (permalink / raw)
  To: Kajetan Puchalski
  Cc: Rafael J. Wysocki, daniel.lezcano, lukasz.luba, Dietmar.Eggemann,
	dsmythies, yu.chen.surf, linux-pm, linux-kernel

On Mon, Nov 28, 2022 at 3:33 PM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> On Fri, Nov 25, 2022 at 07:27:13PM +0100, Rafael J. Wysocki wrote:
> > > +/*
> > > + * The number of bits to shift the cpu's capacity by in order to determine
> > > + * the utilized threshold
> > > + */
> > > +#define UTIL_THRESHOLD_SHIFT 6
> >
> > Why is this particular number regarded as the best one?
>
> Based on my testing this number achieved the best balance of power and
> performance on average. It also makes sense from looking at the util
> plots. The resulting threshold is high enough to not be triggered by
> background noise and low enough to react quickly when activity starts
> ramping up.

It would be good to put some of this information (or maybe even all of
it) into the comment above the symbol definition.

> > > +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> > > +{
> > > +       unsigned long util = sched_cpu_util(dev->cpu);
> > > +
> > > +       cpu_data->utilized = util > cpu_data->util_threshold;
> >
> > Why exactly do you need the local variable here?
>
> It's not necessarily needed, I can replace it with comparing the result
> of the call directly.
>
> > Then, if there's only one caller, maybe this could be folded into it?
>
> I do think it's nicer to have it separated in its own helper function, that
> way if anything more has to be done with the util it'll all be
> self-contained. Having only one caller shouldn't be a big issue, it's
> also the case for teo_middle_of_bin and teo_find_shallower_state in the
> current TEO implementation.

OK

> > > +               /* don't update metrics if the cpu was utilized during the last sleep */
> >
> > Why?
> >
> > The metrics are related to idle duration and cpu_data->utilized
> > indicates whether or not latency should be reduced.  These are
> > different things.
> >
> > Moreover, this is just one data point and there may not be any direct
> > connection between it and the decision made in this particular cycle.
>
> Agreed, v4 already has this part removed.
>
> > > +               if (!cpu_data->utilized)
> > > +                       teo_update(drv, dev);
> > >                 dev->last_state_idx = -1;
> > >         }
> > >
> > > @@ -323,6 +381,21 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > >                         goto end;
> > >         }
> > >
> > > +       teo_get_util(dev, cpu_data);
> > > +       /* the cpu is being utilized and there's only 2 states to choose from */
> > > +       /* no need to consider metrics, choose the shallowest non-polling state and exit */
> >
> > A proper kernel-coding-style 2-line comment, please!
> >
> > Also I would say "utilized beyond the threshold" instead of just
> > "utilized" and "there are only 2 states" (plural).
>
> Both good points, I'll fix that.
>
> > > +       if (drv->state_count < 3 && cpu_data->utilized) {
> > > +               for (i = 0; i < drv->state_count; ++i) {
> > > +                       if (dev->states_usage[i].disable ||
> > > +                                       drv->states[i].flags & CPUIDLE_FLAG_POLLING)
> > > +                               continue;
> > > +                       break;
> >
> > This looks odd.  It would be more straightforward to do
> >
> > if (!dev->states_usage[i].disable && !(drv->states[i].flags &
> > CPUIDLE_FLAG_POLLING)) {
> >         idx = i;
> >         goto end;
> > }
> >
> > without the "break" and "continue".
>
> Fair enough, this works as well.
>
> > I've changed my mind with respect to adding the idx == 1 check to
> > this.  If the goal is to reduce latency for the "loaded" CPUs, this
> > applies to deeper idle states too.
>
> I see, this has no effect on arm devices one way or the other so I don't
> mind, it's completely up to you. In light of Doug's test results
> regarding this change, should I remove the check in v5?

Yes, please.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-11-28 14:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31 12:13 [RFC PATCH v3 0/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
2022-10-31 12:13 ` [RFC PATCH v3 1/2] cpuidle: teo: Optionally skip polling states in teo_find_shallower_state() Kajetan Puchalski
2022-10-31 12:13 ` [RFC PATCH v3 2/2] cpuidle: teo: Introduce util-awareness Kajetan Puchalski
2022-11-01  6:24   ` Doug Smythies
2022-11-01 23:02     ` Doug Smythies
2022-11-02 14:01     ` Kajetan Puchalski
2022-11-25 18:27   ` Rafael J. Wysocki
2022-11-25 20:00     ` Doug Smythies
2022-11-27  4:29     ` Doug Smythies
2022-11-28 14:32     ` Kajetan Puchalski
2022-11-28 14:54       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).