linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
@ 2021-06-02 18:14 Rafael J. Wysocki
  2021-06-02 18:15 ` [PATCH v1 1/5] cpuidle: teo: Cosmetic modifications of teo_update() Rafael J. Wysocki
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-06-02 18:14 UTC (permalink / raw)
  To: Linux PM; +Cc: LKML

Hi All,

This series of patches addresses some theoretical shortcoming in the
TEO (Timer Events Oriented) cpuidle governor by reworking its idle
state selection logic to some extent.

Patches [1-2/5] are introductory cleanups and the substantial changes are
made in patches [3-4/5] (please refer to the changelogs of these two
patches for details).  The last patch only deals with documentation.

Even though this work is mostly based on theoretical considerations, it
shows a measurable reduction of the number of cases in which the shallowest
idle state is selected while it would be more beneficial to select a deeper
one or the deepest idle state is selected while it would be more beneficial to
select a shallower one, which should be a noticeable improvement.

Thanks!




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 1/5] cpuidle: teo: Cosmetic modifications of teo_update()
  2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
@ 2021-06-02 18:15 ` Rafael J. Wysocki
  2021-06-02 18:15 ` [PATCH v1 2/5] cpuidle: teo: Cosmetic modification of teo_select() Rafael J. Wysocki
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-06-02 18:15 UTC (permalink / raw)
  To: Linux PM; +Cc: LKML

From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>

Rename a local variable in teo_update() so that its purpose is better
reflected by its name and use one more local variable in the loop
over the CPU idle states in that function to make the code somewhat
easier to read.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -117,7 +117,7 @@ static DEFINE_PER_CPU(struct teo_cpu, te
 static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	int i, idx_hit = 0, idx_timer = 0;
+	int i, idx_timer = 0, idx_duration = 0;
 	unsigned int hits, misses;
 	u64 measured_ns;
 
@@ -156,14 +156,15 @@ static void teo_update(struct cpuidle_dr
 	 * states matching the sleep length and the measured idle duration.
 	 */
 	for (i = 0; i < drv->state_count; i++) {
+		s64 target_residency_ns = drv->states[i].target_residency_ns;
 		unsigned int early_hits = cpu_data->states[i].early_hits;
 
 		cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;
 
-		if (drv->states[i].target_residency_ns <= cpu_data->sleep_length_ns) {
+		if (target_residency_ns <= cpu_data->sleep_length_ns) {
 			idx_timer = i;
-			if (drv->states[i].target_residency_ns <= measured_ns)
-				idx_hit = i;
+			if (target_residency_ns <= measured_ns)
+				idx_duration = i;
 		}
 	}
 
@@ -181,11 +182,11 @@ static void teo_update(struct cpuidle_dr
 	misses = cpu_data->states[idx_timer].misses;
 	misses -= misses >> DECAY_SHIFT;
 
-	if (idx_timer == idx_hit) {
+	if (idx_timer == idx_duration) {
 		hits += PULSE;
 	} else {
 		misses += PULSE;
-		cpu_data->states[idx_hit].early_hits += PULSE;
+		cpu_data->states[idx_duration].early_hits += PULSE;
 	}
 
 	cpu_data->states[idx_timer].misses = misses;




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 2/5] cpuidle: teo: Cosmetic modification of teo_select()
  2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
  2021-06-02 18:15 ` [PATCH v1 1/5] cpuidle: teo: Cosmetic modifications of teo_update() Rafael J. Wysocki
@ 2021-06-02 18:15 ` Rafael J. Wysocki
  2021-06-02 18:16 ` [PATCH v1 3/5] cpuidle: teo: Change the main idle state selection logic Rafael J. Wysocki
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-06-02 18:15 UTC (permalink / raw)
  To: Linux PM; +Cc: LKML

From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>

Initialize local variables in teo_select() where they are declared.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |   18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -241,10 +241,15 @@ static int teo_select(struct cpuidle_dri
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
-	int max_early_idx, prev_max_early_idx, constraint_idx, idx0, idx, i;
-	unsigned int hits, misses, early_hits;
+	int constraint_idx = drv->state_count;
+	unsigned int hits = 0, misses = 0;
+	unsigned int early_hits = 0;
+	int prev_max_early_idx = -1;
+	int max_early_idx = -1;
+	int idx0 = -1, idx = -1;
 	ktime_t delta_tick;
 	s64 duration_ns;
+	int i;
 
 	if (dev->last_state_idx >= 0) {
 		teo_update(drv, dev);
@@ -256,15 +261,6 @@ static int teo_select(struct cpuidle_dri
 	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
 	cpu_data->sleep_length_ns = duration_ns;
 
-	hits = 0;
-	misses = 0;
-	early_hits = 0;
-	max_early_idx = -1;
-	prev_max_early_idx = -1;
-	constraint_idx = drv->state_count;
-	idx = -1;
-	idx0 = idx;
-
 	for (i = 0; i < drv->state_count; i++) {
 		struct cpuidle_state *s = &drv->states[i];
 




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 3/5] cpuidle: teo: Change the main idle state selection logic
  2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
  2021-06-02 18:15 ` [PATCH v1 1/5] cpuidle: teo: Cosmetic modifications of teo_update() Rafael J. Wysocki
  2021-06-02 18:15 ` [PATCH v1 2/5] cpuidle: teo: Cosmetic modification of teo_select() Rafael J. Wysocki
@ 2021-06-02 18:16 ` Rafael J. Wysocki
  2021-06-02 18:17 ` [PATCH v1 4/5] cpuidle: teo: Rework most recent idle duration values treatment Rafael J. Wysocki
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-06-02 18:16 UTC (permalink / raw)
  To: Linux PM; +Cc: LKML

From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>

Two aspects of the current main idle state selection logic in the
TEO (Timer Events Oriented) cpuidle governor are quite questionable.

First of all, the "hits" and "misses" metrics used by it are only
updated for a given idle state if the time till the next timer event
("sleep length") is between the target residency of that state and
the target residency of the next one.  Consequently, they are likely
to become stale if the sleep length tends to fall outside that
interval which increases the likelihood of subomtimal idle state
selection.

Second, the decision on whether or not to select the idle state
"matching" the sleep length is based on the metrics collected for
that state alone, whereas in principle the metrics collected for
the other idle states should be taken into consideration when that
decision is made.  For example, if the measured idle duration is less
than the target residency of the idle state "matching" the sleep
length, then it is also less than the target residency of any deeper
idle state and that should be taken into account when considering
whether or not to select any of those states, but currently it is
not.

In order to address the above shortcomings, modify the main idle
state selection logic in the TEO governor to take the metrics
collected for all of the idle states into account when deciding
whether or not to select the one "matching" the sleep length.

Moreover, drop the "misses" metric that becomes redundant after the
above change and rename the "early_hits" metric to "intercepts" so
that its role is better reflected by its name (the idea being that
if a CPU wakes up earlier than indicated by the sleep length, then
it must be a result of a non-timer interrupt that "intercepts" the
CPU).

Also rename the states[] array in struct struct teo_cpu to
state_bins[] to avoid confusing it with the states[] array in
struct cpuidle_driver and update the documentation to match the
new code (and make it more comprehensive while at it).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |  380 +++++++++++++++++++++-------------------
 1 file changed, 206 insertions(+), 174 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -2,47 +2,90 @@
 /*
  * Timer events oriented CPU idle governor
  *
- * Copyright (C) 2018 Intel Corporation
+ * Copyright (C) 2018 - 2021 Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  *
  * The idea of this governor is based on the observation that on many systems
  * timer events are two or more orders of magnitude more frequent than any
- * other interrupts, so they are likely to be the most significant source of CPU
+ * other interrupts, so they are likely to be the most significant cause of CPU
  * wakeups from idle states.  Moreover, information about what happened in the
  * (relatively recent) past can be used to estimate whether or not the deepest
- * idle state with target residency within the time to the closest timer is
- * likely to be suitable for the upcoming idle time of the CPU and, if not, then
- * which of the shallower idle states to choose.
- *
- * Of course, non-timer wakeup sources are more important in some use cases and
- * they can be covered by taking a few most recent idle time intervals of the
- * CPU into account.  However, even in that case it is not necessary to consider
- * idle duration values greater than the time till the closest timer, as the
- * patterns that they may belong to produce average values close enough to
- * the time till the closest timer (sleep length) anyway.
- *
- * Thus this governor estimates whether or not the upcoming idle time of the CPU
- * is likely to be significantly shorter than the sleep length and selects an
- * idle state for it in accordance with that, as follows:
- *
- * - Find an idle state on the basis of the sleep length and state statistics
- *   collected over time:
- *
- *   o Find the deepest idle state whose target residency is less than or equal
- *     to the sleep length.
- *
- *   o Select it if it matched both the sleep length and the observed idle
- *     duration in the past more often than it matched the sleep length alone
- *     (i.e. the observed idle duration was significantly shorter than the sleep
- *     length matched by it).
- *
- *   o Otherwise, select the shallower state with the greatest matched "early"
- *     wakeups metric.
- *
- * - If the majority of the most recent idle duration values are below the
- *   target residency of the idle state selected so far, use those values to
- *   compute the new expected idle duration and find an idle state matching it
- *   (which has to be shallower than the one selected so far).
+ * idle state with target residency within the (known) time till the closest
+ * timer event, referred to as the sleep length, is likely to be suitable for
+ * the upcoming CPU idle period and, if not, then which of the shallower idle
+ * states to choose instead of it.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases
+ * which can be covered by taking a few most recent idle time intervals of the
+ * CPU into account.  However, even in that context it is not necessary to
+ * consider idle duration values greater than the sleep length, because the
+ * closest timer will ultimately wake up the CPU anyway unless it is woken up
+ * earlier.
+ *
+ * Thus this governor estimates whether or not the prospective idle duration of
+ * a CPU is likely to be significantly shorter than the sleep length and selects
+ * an idle state for it accordingly.
+ *
+ * The computations carried out by this governor are based on using bins whose
+ * boundaries are aligned with the target residency parameter values of the CPU
+ * idle states provided by the cpuidle driver in the ascending order.  That is,
+ * the first bin spans from 0 up to, but not including, the target residency of
+ * the second idle state (idle state 1), the second bin spans from the target
+ * residency of idle state 1 up to, but not including, the target residency of
+ * idle state 2, the third bin spans from the target residency of idle state 2
+ * up to, but not including, the target residency of idle state 3 and so on.
+ * The last bin spans from the target residency of the deepest idle state
+ * supplied by the driver to infinity.
+ *
+ * Two metrics called "hits" and "intercepts" are associated with each bin.
+ * They are updated every time before selecting an idle state for the given CPU
+ * in accordance with what happened last time.
+ *
+ * The "hits" metric reflects the relative frequency of situations in which the
+ * sleep length and the idle duration measured after CPU wakeup fall into the
+ * same bin (that is, the CPU appears to wake up "on time" relative to the sleep
+ * length).  In turn, the "intercepts" metric reflects the relative frequency of
+ * situations in which the measured idle duration is so much shorter than the
+ * sleep length that the bin it falls into corresponds to an idle state
+ * shallower than the one whose bin is fallen into by the sleep length.
+ *
+ * In order to select an idle state for a CPU, the governor takes the following
+ * steps (modulo the possible latency constraint that must be taken into account
+ * too):
+ *
+ * 1. Find the deepest CPU idle state whose target residency does not exceed
+ *    the current sleep length (the candidate idle state) and compute two sums
+ *    as follows:
+ *
+ *    - The sum of the "hits" and "intercepts" metrics for the candidate state
+ *      and all of the deeper idle states (it represents the cases in which the
+ *      CPU was idle long enough to avoid being intercepted if the sleep length
+ *      had been equal to the current one).
+ *
+ *    - The sum of the "intercepts" metrics for all of the idle states shallower
+ *      than the candidate one (it represents the cases in which the CPU was not
+ *      idle long enough to avoid being intercepted if the sleep length had been
+ *      equal to the current one).
+ *
+ * 2. If the second sum is greater than the first one, look for an alternative
+ *    idle state to select.
+ *
+ *    - Traverse the idle states shallower than the candidate one in the
+ *      descending order.
+ *
+ *    - For each of them compute the sum of the "intercepts" metrics over all of
+ *      the idle  states between it and the candidate one (including the former
+ *      and excluding the latter).
+ *
+ *    - If that sum is greater than a half of the second sum computed in step 1
+ *      (which means that the target residency of the state in question had not
+ *      exceeded the idle duration in over a half of the relevant cases), select
+ *      the given idle state instead of the candidate one.
+ *
+ * 3. If the majority of the most recent idle duration values are below the
+ *    current anticipated idle duration, use those values to compute the new
+ *    expected idle duration and find an idle state matching it (which has to
+ *    be shallower than the current candidate one).
  */
 
 #include <linux/cpuidle.h>
@@ -65,44 +108,29 @@
 #define INTERVALS	8
 
 /**
- * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
- * @early_hits: "Early" CPU wakeups "matching" this state.
- * @hits: "On time" CPU wakeups "matching" this state.
- * @misses: CPU wakeups "missing" this state.
- *
- * A CPU wakeup is "matched" by a given idle state if the idle duration measured
- * after the wakeup is between the target residency of that state and the target
- * residency of the next one (or if this is the deepest available idle state, it
- * "matches" a CPU wakeup when the measured idle duration is at least equal to
- * its target residency).
- *
- * Also, from the TEO governor perspective, a CPU wakeup from idle is "early" if
- * it occurs significantly earlier than the closest expected timer event (that
- * is, early enough to match an idle state shallower than the one matching the
- * time till the closest timer event).  Otherwise, the wakeup is "on time", or
- * it is a "hit".
- *
- * A "miss" occurs when the given state doesn't match the wakeup, but it matches
- * the time till the closest timer event used for idle state selection.
+ * struct teo_bin - Metrics used by the TEO cpuidle governor.
+ * @intercepts: The "intercepts" metric.
+ * @hits: The "hits" metric.
  */
-struct teo_idle_state {
-	unsigned int early_hits;
+struct teo_bin {
+	unsigned int intercepts;
 	unsigned int hits;
-	unsigned int misses;
 };
 
 /**
  * struct teo_cpu - CPU data used by the TEO cpuidle governor.
  * @time_span_ns: Time between idle state selection and post-wakeup update.
  * @sleep_length_ns: Time till the closest timer event (at the selection time).
- * @states: Idle states data corresponding to this CPU.
+ * @state_bins: Idle state data bins for this CPU.
+ * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
  * @interval_idx: Index of the most recent saved idle interval.
  * @intervals: Saved idle duration values.
  */
 struct teo_cpu {
 	s64 time_span_ns;
 	s64 sleep_length_ns;
-	struct teo_idle_state states[CPUIDLE_STATE_MAX];
+	struct teo_bin state_bins[CPUIDLE_STATE_MAX];
+	unsigned int total;
 	int interval_idx;
 	u64 intervals[INTERVALS];
 };
@@ -110,7 +138,7 @@ struct teo_cpu {
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
 
 /**
- * teo_update - Update CPU data after wakeup.
+ * teo_update - Update CPU metrics after wakeup.
  * @drv: cpuidle driver containing state data.
  * @dev: Target CPU.
  */
@@ -118,7 +146,6 @@ static void teo_update(struct cpuidle_dr
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
 	int i, idx_timer = 0, idx_duration = 0;
-	unsigned int hits, misses;
 	u64 measured_ns;
 
 	if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
@@ -151,15 +178,21 @@ static void teo_update(struct cpuidle_dr
 			measured_ns /= 2;
 	}
 
+	cpu_data->total = 0;
+
 	/*
-	 * Decay the "early hits" metric for all of the states and find the
-	 * states matching the sleep length and the measured idle duration.
+	 * Decay the "hits" and "intercepts" metrics for all of the bins and
+	 * find the bins that the sleep length and the measured idle duration
+	 * fall into.
 	 */
 	for (i = 0; i < drv->state_count; i++) {
 		s64 target_residency_ns = drv->states[i].target_residency_ns;
-		unsigned int early_hits = cpu_data->states[i].early_hits;
+		struct teo_bin *bin = &cpu_data->state_bins[i];
 
-		cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;
+		bin->hits -= bin->hits >> DECAY_SHIFT;
+		bin->intercepts -= bin->intercepts >> DECAY_SHIFT;
+
+		cpu_data->total += bin->hits + bin->intercepts;
 
 		if (target_residency_ns <= cpu_data->sleep_length_ns) {
 			idx_timer = i;
@@ -169,28 +202,17 @@ static void teo_update(struct cpuidle_dr
 	}
 
 	/*
-	 * Update the "hits" and "misses" data for the state matching the sleep
-	 * length.  If it matches the measured idle duration too, this is a hit,
-	 * so increase the "hits" metric for it then.  Otherwise, this is a
-	 * miss, so increase the "misses" metric for it.  In the latter case
-	 * also increase the "early hits" metric for the state that actually
-	 * matches the measured idle duration.
+	 * If the measured idle duration falls into the same bin as the sleep
+	 * length, this is a "hit", so update the "hits" metric for that bin.
+	 * Otherwise, update the "intercepts" metric for the bin fallen into by
+	 * the measured idle duration.
 	 */
-	hits = cpu_data->states[idx_timer].hits;
-	hits -= hits >> DECAY_SHIFT;
-
-	misses = cpu_data->states[idx_timer].misses;
-	misses -= misses >> DECAY_SHIFT;
-
-	if (idx_timer == idx_duration) {
-		hits += PULSE;
-	} else {
-		misses += PULSE;
-		cpu_data->states[idx_duration].early_hits += PULSE;
-	}
+	if (idx_timer == idx_duration)
+		cpu_data->state_bins[idx_timer].hits += PULSE;
+	else
+		cpu_data->state_bins[idx_duration].intercepts += PULSE;
 
-	cpu_data->states[idx_timer].misses = misses;
-	cpu_data->states[idx_timer].hits = hits;
+	cpu_data->total += PULSE;
 
 	/*
 	 * Save idle duration values corresponding to non-timer wakeups for
@@ -206,6 +228,12 @@ static bool teo_time_ok(u64 interval_ns)
 	return !tick_nohz_tick_stopped() || interval_ns >= TICK_NSEC;
 }
 
+static s64 teo_middle_of_bin(int idx, struct cpuidle_driver *drv)
+{
+	return (drv->states[idx].target_residency_ns +
+		drv->states[idx+1].target_residency_ns) / 2;
+}
+
 /**
  * teo_find_shallower_state - Find shallower idle state matching given duration.
  * @drv: cpuidle driver containing state data.
@@ -241,12 +269,12 @@ static int teo_select(struct cpuidle_dri
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
-	int constraint_idx = drv->state_count;
-	unsigned int hits = 0, misses = 0;
-	unsigned int early_hits = 0;
-	int prev_max_early_idx = -1;
-	int max_early_idx = -1;
-	int idx0 = -1, idx = -1;
+	unsigned int idx_intercept_sum = 0;
+	unsigned int intercept_sum = 0;
+	unsigned int idx_hit_sum = 0;
+	unsigned int hit_sum = 0;
+	int constraint_idx = 0;
+	int idx0 = 0, idx = -1;
 	ktime_t delta_tick;
 	s64 duration_ns;
 	int i;
@@ -261,119 +289,122 @@ static int teo_select(struct cpuidle_dri
 	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
 	cpu_data->sleep_length_ns = duration_ns;
 
-	for (i = 0; i < drv->state_count; i++) {
-		struct cpuidle_state *s = &drv->states[i];
-
-		if (dev->states_usage[i].disable) {
-			/*
-			 * Ignore disabled states with target residencies beyond
-			 * the anticipated idle duration.
-			 */
-			if (s->target_residency_ns > duration_ns)
-				continue;
-
-			/*
-			 * This state is disabled, so the range of idle duration
-			 * values corresponding to it is covered by the current
-			 * candidate state, but still the "hits" and "misses"
-			 * metrics of the disabled state need to be used to
-			 * decide whether or not the state covering the range in
-			 * question is good enough.
-			 */
-			hits = cpu_data->states[i].hits;
-			misses = cpu_data->states[i].misses;
-
-			if (early_hits >= cpu_data->states[i].early_hits ||
-			    idx < 0)
-				continue;
+	/* Check if there is any choice in the first place. */
+	if (drv->state_count < 2) {
+		idx = 0;;
+		goto end;
+	}
+	if (!dev->states_usage[0].disable) {
+		idx = 0;
+		if (drv->states[1].target_residency_ns > duration_ns)
+			goto end;
+	}
 
-			/*
-			 * If the current candidate state has been the one with
-			 * the maximum "early hits" metric so far, the "early
-			 * hits" metric of the disabled state replaces the
-			 * current "early hits" count to avoid selecting a
-			 * deeper state with lower "early hits" metric.
-			 */
-			if (max_early_idx == idx) {
-				early_hits = cpu_data->states[i].early_hits;
-				continue;
-			}
+	/*
+	 * Find the deepest idle state whose target residency does not exceed
+	 * the current sleep length and the deepest idle state not deeper than
+	 * the former whose exit latency does not exceed the current latency
+	 * constraint.  Compute the sums of metrics for early wakeup pattern
+	 * detection.
+	 */
+	for (i = 1; i < drv->state_count; i++) {
+		struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
+		struct cpuidle_state *s = &drv->states[i];
 
-			/*
-			 * The current candidate state is closer to the disabled
-			 * one than the current maximum "early hits" state, so
-			 * replace the latter with it, but in case the maximum
-			 * "early hits" state index has not been set so far,
-			 * check if the current candidate state is not too
-			 * shallow for that role.
-			 */
-			if (teo_time_ok(drv->states[idx].target_residency_ns)) {
-				prev_max_early_idx = max_early_idx;
-				early_hits = cpu_data->states[i].early_hits;
-				max_early_idx = idx;
-			}
+		/*
+		 * Update the sums of idle state mertics for all of the states
+		 * shallower than the current one.
+		 */
+		intercept_sum += prev_bin->intercepts;
+		hit_sum += prev_bin->hits;
 
+		if (dev->states_usage[i].disable)
 			continue;
-		}
 
 		if (idx < 0) {
 			idx = i; /* first enabled state */
-			hits = cpu_data->states[i].hits;
-			misses = cpu_data->states[i].misses;
 			idx0 = i;
 		}
 
 		if (s->target_residency_ns > duration_ns)
 			break;
 
-		if (s->exit_latency_ns > latency_req && constraint_idx > i)
+		idx = i;
+
+		if (s->exit_latency_ns <= latency_req)
 			constraint_idx = i;
 
-		idx = i;
-		hits = cpu_data->states[i].hits;
-		misses = cpu_data->states[i].misses;
+		idx_intercept_sum = intercept_sum;
+		idx_hit_sum = hit_sum;
+	}
 
-		if (early_hits < cpu_data->states[i].early_hits &&
-		    teo_time_ok(drv->states[i].target_residency_ns)) {
-			prev_max_early_idx = max_early_idx;
-			early_hits = cpu_data->states[i].early_hits;
-			max_early_idx = i;
-		}
+	/* Avoid unnecessary overhead. */
+	if (idx < 0) {
+		idx = 0; /* No states enabled, must use 0. */
+		goto end;
+	} else if (idx == idx0) {
+		goto end;
 	}
 
 	/*
-	 * If the "hits" metric of the idle state matching the sleep length is
-	 * greater than its "misses" metric, that is the one to use.  Otherwise,
-	 * it is more likely that one of the shallower states will match the
-	 * idle duration observed after wakeup, so take the one with the maximum
-	 * "early hits" metric, but if that cannot be determined, just use the
-	 * state selected so far.
+	 * If the sum of the intercepts metric for all of the idle states
+	 * shallower than the current candidate one (idx) is greater than the
+	 * sum of the intercepts and hits metrics for the candidate state and
+	 * all of the deeper states, the CPU is likely to wake up early, so find
+	 * an alternative idle state to select.
 	 */
-	if (hits <= misses) {
+	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
+		s64 last_enabled_span_ns = duration_ns;
+		int last_enabled_idx = idx;
+
 		/*
-		 * The current candidate state is not suitable, so take the one
-		 * whose "early hits" metric is the maximum for the range of
-		 * shallower states.
+		 * Look for the deepest idle state whose target residency had
+		 * not exceeded the idle duration in over a half of the relevant
+		 * cases in the past.
+		 *
+		 * Take the possible latency constraint and duration limitation
+		 * present if the tick has been stopped already into account.
 		 */
-		if (idx == max_early_idx)
-			max_early_idx = prev_max_early_idx;
+		intercept_sum = 0;
 
-		if (max_early_idx >= 0) {
-			idx = max_early_idx;
-			duration_ns = drv->states[idx].target_residency_ns;
+		for (i = idx - 1; i >= idx0; i--) {
+			s64 span_ns;
+
+			intercept_sum += cpu_data->state_bins[i].intercepts;
+
+			if (dev->states_usage[i].disable)
+				continue;
+
+			span_ns = teo_middle_of_bin(i, drv);
+			if (!teo_time_ok(span_ns)) {
+				/*
+				 * The current state is too shallow, so select
+				 * the first enabled deeper state.
+				 */
+				duration_ns = last_enabled_span_ns;
+				idx = last_enabled_idx;
+				break;
+			}
+
+			if (2 * intercept_sum > idx_intercept_sum) {
+				idx = i;
+				duration_ns = span_ns;
+				break;
+			}
+
+			last_enabled_span_ns = span_ns;
+			last_enabled_idx = i;
 		}
 	}
 
 	/*
-	 * If there is a latency constraint, it may be necessary to use a
-	 * shallower idle state than the one selected so far.
+	 * If there is a latency constraint, it may be necessary to select an
+	 * idle state shallower than the current candidate one.
 	 */
-	if (constraint_idx < idx)
+	if (idx > constraint_idx)
 		idx = constraint_idx;
 
-	if (idx < 0) {
-		idx = 0; /* No states enabled. Must use 0. */
-	} else if (idx > idx0) {
+	if (idx > idx0) {
 		unsigned int count = 0;
 		u64 sum = 0;
 
@@ -416,6 +447,7 @@ static int teo_select(struct cpuidle_dri
 		}
 	}
 
+end:
 	/*
 	 * Don't stop the tick if the selected state is a polling one or if the
 	 * expected idle duration is shorter than the tick period length.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 4/5] cpuidle: teo: Rework most recent idle duration values treatment
  2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2021-06-02 18:16 ` [PATCH v1 3/5] cpuidle: teo: Change the main idle state selection logic Rafael J. Wysocki
@ 2021-06-02 18:17 ` Rafael J. Wysocki
  2021-06-02 18:18 ` [PATCH v1 5/5] cpuidle: teo: Use kerneldoc documentation in admin-guide Rafael J. Wysocki
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-06-02 18:17 UTC (permalink / raw)
  To: Linux PM; +Cc: LKML

From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>

The TEO (Timer Events Oriented) cpuidle governor uses several most
recent idle duration values for a given CPU to refine the idle state
selection in case the previous long-term trends have not been
followed recently and a new trend appears to be forming.  That is
done by computing the average of the most recent idle duration
values falling below the time till the next timer event ("sleep
length"), provided that they are the majority of the most recent
idle duration values taken into account, and using it as the new
expected idle duration value.

However, idle state selection based on that value may not be optimal,
because the average does not really indicate which of the idle states
with target residencies less than or equal to it is likely to be the
best fit.

Thus, instead of computing the average, make the governor carry out
computations based on the distribution of the most recent idle
duration values among the bins corresponding to different idle
states.  Namely, if the majority of the most recent idle duration
values taken into consideration are less than the current sleep
length (which means that the CPU is likely to wake up early), find
the idle state closest to the "candidate" one "matching" the sleep
length whose target residency is less than or equal to the majority
of the most recent idle duration values that have fallen below the
current sleep length (which means that it is likely to be "shallow
enough" this time).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/teo.c |  157 ++++++++++++++++++----------------------
 1 file changed, 72 insertions(+), 85 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -47,15 +47,20 @@
  * length).  In turn, the "intercepts" metric reflects the relative frequency of
  * situations in which the measured idle duration is so much shorter than the
  * sleep length that the bin it falls into corresponds to an idle state
- * shallower than the one whose bin is fallen into by the sleep length.
+ * shallower than the one whose bin is fallen into by the sleep length (these
+ * situations are referred to as "intercepts" below).
+ *
+ * In addition to the metrics described above, the governor counts recent
+ * intercepts (that is, intercepts that have occurred during the last NR_RECENT
+ * invocations of it for the given CPU) for each bin.
  *
  * In order to select an idle state for a CPU, the governor takes the following
  * steps (modulo the possible latency constraint that must be taken into account
  * too):
  *
  * 1. Find the deepest CPU idle state whose target residency does not exceed
- *    the current sleep length (the candidate idle state) and compute two sums
- *    as follows:
+ *    the current sleep length (the candidate idle state) and compute 3 sums as
+ *    follows:
  *
  *    - The sum of the "hits" and "intercepts" metrics for the candidate state
  *      and all of the deeper idle states (it represents the cases in which the
@@ -67,25 +72,29 @@
  *      idle long enough to avoid being intercepted if the sleep length had been
  *      equal to the current one).
  *
- * 2. If the second sum is greater than the first one, look for an alternative
- *    idle state to select.
+ *    - The sum of the numbers of recent intercepts for all of the idle states
+ *      shallower than the candidate one.
+ *
+ * 2. If the second sum is greater than the first one or the third sum is
+ *    greater than NR_RECENT / 2, the CPU is likely to wake up early, so look
+ *    for an alternative idle state to select.
  *
  *    - Traverse the idle states shallower than the candidate one in the
  *      descending order.
  *
- *    - For each of them compute the sum of the "intercepts" metrics over all of
- *      the idle  states between it and the candidate one (including the former
- *      and excluding the latter).
- *
- *    - If that sum is greater than a half of the second sum computed in step 1
- *      (which means that the target residency of the state in question had not
- *      exceeded the idle duration in over a half of the relevant cases), select
- *      the given idle state instead of the candidate one.
- *
- * 3. If the majority of the most recent idle duration values are below the
- *    current anticipated idle duration, use those values to compute the new
- *    expected idle duration and find an idle state matching it (which has to
- *    be shallower than the current candidate one).
+ *    - For each of them compute the sum of the "intercepts" metrics and the sum
+ *      of the numbers of recent intercepts over all of the idle states between
+ *      it and the candidate one (including the former and excluding the
+ *      latter).
+ *
+ *    - If each of these sums that needs to be taken into account (because the
+ *      check related to it has indicated that the CPU is likely to wake up
+ *      early) is greater than a half of the corresponding sum computed in step
+ *      1 (which means that the target residency of the state in question had
+ *      not exceeded the idle duration in over a half of the relevant cases),
+ *      select the given idle state instead of the candidate one.
+ *
+ * 3. By default, select the candidate state.
  */
 
 #include <linux/cpuidle.h>
@@ -103,18 +112,20 @@
 
 /*
  * Number of the most recent idle duration values to take into consideration for
- * the detection of wakeup patterns.
+ * the detection of recent early wakeup patterns.
  */
-#define INTERVALS	8
+#define NR_RECENT	9
 
 /**
  * struct teo_bin - Metrics used by the TEO cpuidle governor.
  * @intercepts: The "intercepts" metric.
  * @hits: The "hits" metric.
+ * @recent: The number of recent "intercepts".
  */
 struct teo_bin {
 	unsigned int intercepts;
 	unsigned int hits;
+	unsigned int recent;
 };
 
 /**
@@ -123,16 +134,16 @@ struct teo_bin {
  * @sleep_length_ns: Time till the closest timer event (at the selection time).
  * @state_bins: Idle state data bins for this CPU.
  * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
- * @interval_idx: Index of the most recent saved idle interval.
- * @intervals: Saved idle duration values.
+ * @next_recent_idx: Index of the next @recent_idx entry to update.
+ * @recent_idx: Indices of bins corresponding to recent "intercepts".
  */
 struct teo_cpu {
 	s64 time_span_ns;
 	s64 sleep_length_ns;
 	struct teo_bin state_bins[CPUIDLE_STATE_MAX];
 	unsigned int total;
-	int interval_idx;
-	u64 intervals[INTERVALS];
+	int next_recent_idx;
+	int recent_idx[NR_RECENT];
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
@@ -201,26 +212,29 @@ static void teo_update(struct cpuidle_dr
 		}
 	}
 
+	i = cpu_data->next_recent_idx++;
+	if (cpu_data->next_recent_idx >= NR_RECENT)
+		cpu_data->next_recent_idx = 0;
+
+	if (cpu_data->recent_idx[i] >= 0)
+		cpu_data->state_bins[cpu_data->recent_idx[i]].recent--;
+
 	/*
 	 * If the measured idle duration falls into the same bin as the sleep
 	 * length, this is a "hit", so update the "hits" metric for that bin.
 	 * Otherwise, update the "intercepts" metric for the bin fallen into by
 	 * the measured idle duration.
 	 */
-	if (idx_timer == idx_duration)
+	if (idx_timer == idx_duration) {
 		cpu_data->state_bins[idx_timer].hits += PULSE;
-	else
+		cpu_data->recent_idx[i] = -1;
+	} else {
 		cpu_data->state_bins[idx_duration].intercepts += PULSE;
+		cpu_data->state_bins[idx_duration].recent++;
+		cpu_data->recent_idx[i] = idx_duration;
+	}
 
 	cpu_data->total += PULSE;
-
-	/*
-	 * Save idle duration values corresponding to non-timer wakeups for
-	 * pattern detection.
-	 */
-	cpu_data->intervals[cpu_data->interval_idx++] = measured_ns;
-	if (cpu_data->interval_idx >= INTERVALS)
-		cpu_data->interval_idx = 0;
 }
 
 static bool teo_time_ok(u64 interval_ns)
@@ -271,10 +285,13 @@ static int teo_select(struct cpuidle_dri
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
 	unsigned int idx_intercept_sum = 0;
 	unsigned int intercept_sum = 0;
+	unsigned int idx_recent_sum = 0;
+	unsigned int recent_sum = 0;
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
+	bool alt_intercepts, alt_recent;
 	ktime_t delta_tick;
 	s64 duration_ns;
 	int i;
@@ -317,6 +334,7 @@ static int teo_select(struct cpuidle_dri
 		 */
 		intercept_sum += prev_bin->intercepts;
 		hit_sum += prev_bin->hits;
+		recent_sum += prev_bin->recent;
 
 		if (dev->states_usage[i].disable)
 			continue;
@@ -336,6 +354,7 @@ static int teo_select(struct cpuidle_dri
 
 		idx_intercept_sum = intercept_sum;
 		idx_hit_sum = hit_sum;
+		idx_recent_sum = recent_sum;
 	}
 
 	/* Avoid unnecessary overhead. */
@@ -350,27 +369,36 @@ static int teo_select(struct cpuidle_dri
 	 * If the sum of the intercepts metric for all of the idle states
 	 * shallower than the current candidate one (idx) is greater than the
 	 * sum of the intercepts and hits metrics for the candidate state and
-	 * all of the deeper states, the CPU is likely to wake up early, so find
-	 * an alternative idle state to select.
+	 * all of the deeper states, or the sum of the numbers of recent
+	 * intercepts over all of the states shallower than the candidate one
+	 * is greater than a half of the number of recent events taken into
+	 * account, the CPU is likely to wake up early, so find an alternative
+	 * idle state to select.
 	 */
-	if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) {
+	alt_intercepts = 2 * idx_intercept_sum > cpu_data->total - idx_hit_sum;
+	alt_recent = idx_recent_sum > NR_RECENT / 2;
+	if (alt_recent || alt_intercepts) {
 		s64 last_enabled_span_ns = duration_ns;
 		int last_enabled_idx = idx;
 
 		/*
 		 * Look for the deepest idle state whose target residency had
 		 * not exceeded the idle duration in over a half of the relevant
-		 * cases in the past.
+		 * cases (both with respect to intercepts overall and with
+		 * respect to the recent intercepts only) in the past.
 		 *
 		 * Take the possible latency constraint and duration limitation
 		 * present if the tick has been stopped already into account.
 		 */
 		intercept_sum = 0;
+		recent_sum = 0;
 
 		for (i = idx - 1; i >= idx0; i--) {
+			struct teo_bin *bin = &cpu_data->state_bins[i];
 			s64 span_ns;
 
-			intercept_sum += cpu_data->state_bins[i].intercepts;
+			intercept_sum += bin->intercepts;
+			recent_sum += bin->recent;
 
 			if (dev->states_usage[i].disable)
 				continue;
@@ -386,7 +414,9 @@ static int teo_select(struct cpuidle_dri
 				break;
 			}
 
-			if (2 * intercept_sum > idx_intercept_sum) {
+			if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
+			    (!alt_intercepts ||
+			     2 * intercept_sum > idx_intercept_sum)) {
 				idx = i;
 				duration_ns = span_ns;
 				break;
@@ -404,49 +434,6 @@ static int teo_select(struct cpuidle_dri
 	if (idx > constraint_idx)
 		idx = constraint_idx;
 
-	if (idx > idx0) {
-		unsigned int count = 0;
-		u64 sum = 0;
-
-		/*
-		 * The target residencies of at least two different enabled idle
-		 * states are less than or equal to the current expected idle
-		 * duration.  Try to refine the selection using the most recent
-		 * measured idle duration values.
-		 *
-		 * Count and sum the most recent idle duration values less than
-		 * the current expected idle duration value.
-		 */
-		for (i = 0; i < INTERVALS; i++) {
-			u64 val = cpu_data->intervals[i];
-
-			if (val >= duration_ns)
-				continue;
-
-			count++;
-			sum += val;
-		}
-
-		/*
-		 * Give up unless the majority of the most recent idle duration
-		 * values are in the interesting range.
-		 */
-		if (count > INTERVALS / 2) {
-			u64 avg_ns = div64_u64(sum, count);
-
-			/*
-			 * Avoid spending too much time in an idle state that
-			 * would be too shallow.
-			 */
-			if (teo_time_ok(avg_ns)) {
-				duration_ns = avg_ns;
-				if (drv->states[idx].target_residency_ns > avg_ns)
-					idx = teo_find_shallower_state(drv, dev,
-								       idx, avg_ns);
-			}
-		}
-	}
-
 end:
 	/*
 	 * Don't stop the tick if the selected state is a polling one or if the
@@ -507,8 +494,8 @@ static int teo_enable_device(struct cpui
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
 
-	for (i = 0; i < INTERVALS; i++)
-		cpu_data->intervals[i] = U64_MAX;
+	for (i = 0; i < NR_RECENT; i++)
+		cpu_data->recent_idx[i] = -1;
 
 	return 0;
 }




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 5/5] cpuidle: teo: Use kerneldoc documentation in admin-guide
  2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2021-06-02 18:17 ` [PATCH v1 4/5] cpuidle: teo: Rework most recent idle duration values treatment Rafael J. Wysocki
@ 2021-06-02 18:18 ` Rafael J. Wysocki
  2021-07-04 21:01 ` [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Doug Smythies
  2021-07-27 20:06 ` Doug Smythies
  6 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-06-02 18:18 UTC (permalink / raw)
  To: Linux PM; +Cc: LKML

From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>

There are two descriptions of the TEO (Timer Events Oriented) cpuidle
governor in the kernel source tree, one in the C file containing its
code and one in cpuidle.rst which is part of admin-guide.

Instead of trying to keep them both in sync and in order to reduce
text duplication, include the governor description from the C file
directly into cpuidle.rst.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 Documentation/admin-guide/pm/cpuidle.rst |   77 -------------------------------
 drivers/cpuidle/governors/teo.c          |   12 +++-
 2 files changed, 10 insertions(+), 79 deletions(-)

Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/cpuidle.rst
+++ linux-pm/Documentation/admin-guide/pm/cpuidle.rst
@@ -347,81 +347,8 @@ for tickless systems.  It follows the sa
 <menu-gov_>`_: it always tries to find the deepest idle state suitable for the
 given conditions.  However, it applies a different approach to that problem.
 
-First, it does not use sleep length correction factors, but instead it attempts
-to correlate the observed idle duration values with the available idle states
-and use that information to pick up the idle state that is most likely to
-"match" the upcoming CPU idle interval.   Second, it does not take the tasks
-that were running on the given CPU in the past and are waiting on some I/O
-operations to complete now at all (there is no guarantee that they will run on
-the same CPU when they become runnable again) and the pattern detection code in
-it avoids taking timer wakeups into account.  It also only uses idle duration
-values less than the current time till the closest timer (with the scheduler
-tick excluded) for that purpose.
-
-Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
-the *sleep length*, which is the time until the closest timer event with the
-assumption that the scheduler tick will be stopped (that also is the upper bound
-on the time until the next CPU wakeup).  That value is then used to preselect an
-idle state on the basis of three metrics maintained for each idle state provided
-by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
-
-The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
-state will "match" the observed (post-wakeup) idle duration if it "matches" the
-sleep length.  They both are subject to decay (after a CPU wakeup) every time
-the target residency of the idle state corresponding to them is less than or
-equal to the sleep length and the target residency of the next idle state is
-greater than the sleep length (that is, when the idle state corresponding to
-them "matches" the sleep length).  The ``hits`` metric is increased if the
-former condition is satisfied and the target residency of the given idle state
-is less than or equal to the observed idle duration and the target residency of
-the next idle state is greater than the observed idle duration at the same time
-(that is, it is increased when the given idle state "matches" both the sleep
-length and the observed idle duration).  In turn, the ``misses`` metric is
-increased when the given idle state "matches" the sleep length only and the
-observed idle duration is too short for its target residency.
-
-The ``early_hits`` metric measures the likelihood that a given idle state will
-"match" the observed (post-wakeup) idle duration if it does not "match" the
-sleep length.  It is subject to decay on every CPU wakeup and it is increased
-when the idle state corresponding to it "matches" the observed (post-wakeup)
-idle duration and the target residency of the next idle state is less than or
-equal to the sleep length (i.e. the idle state "matching" the sleep length is
-deeper than the given one).
-
-The governor walks the list of idle states provided by the ``CPUIdle`` driver
-and finds the last (deepest) one with the target residency less than or equal
-to the sleep length.  Then, the ``hits`` and ``misses`` metrics of that idle
-state are compared with each other and it is preselected if the ``hits`` one is
-greater (which means that that idle state is likely to "match" the observed idle
-duration after CPU wakeup).  If the ``misses`` one is greater, the governor
-preselects the shallower idle state with the maximum ``early_hits`` metric
-(or if there are multiple shallower idle states with equal ``early_hits``
-metric which also is the maximum, the shallowest of them will be preselected).
-[If there is a wakeup latency constraint coming from the `PM QoS framework
-<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
-target residency within the sleep length, the deepest idle state with the exit
-latency within the constraint is preselected without consulting the ``hits``,
-``misses`` and ``early_hits`` metrics.]
-
-Next, the governor takes several idle duration values observed most recently
-into consideration and if at least a half of them are greater than or equal to
-the target residency of the preselected idle state, that idle state becomes the
-final candidate to ask for.  Otherwise, the average of the most recent idle
-duration values below the target residency of the preselected idle state is
-computed and the governor walks the idle states shallower than the preselected
-one and finds the deepest of them with the target residency within that average.
-That idle state is then taken as the final candidate to ask for.
-
-Still, at this point the governor may need to refine the idle state selection if
-it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_.  That
-generally happens if the target residency of the idle state selected so far is
-less than the tick period and the tick has not been stopped already (in a
-previous iteration of the idle loop).  Then, like in the ``menu`` governor
-`case <menu-gov_>`_, the sleep length used in the previous computations may not
-reflect the real time until the closest timer event and if it really is greater
-than that time, a shallower state with a suitable target residency may need to
-be selected.
-
+.. kernel-doc:: drivers/cpuidle/governors/teo.c
+   :doc: teo-description
 
 .. _idle-states-representation:
 
Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -4,6 +4,10 @@
  *
  * Copyright (C) 2018 - 2021 Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ */
+
+/**
+ * DOC: teo-description
  *
  * The idea of this governor is based on the observation that on many systems
  * timer events are two or more orders of magnitude more frequent than any
@@ -28,7 +32,7 @@
  *
  * The computations carried out by this governor are based on using bins whose
  * boundaries are aligned with the target residency parameter values of the CPU
- * idle states provided by the cpuidle driver in the ascending order.  That is,
+ * idle states provided by the %CPUIdle driver in the ascending order.  That is,
  * the first bin spans from 0 up to, but not including, the target residency of
  * the second idle state (idle state 1), the second bin spans from the target
  * residency of idle state 1 up to, but not including, the target residency of
@@ -51,8 +55,8 @@
  * situations are referred to as "intercepts" below).
  *
  * In addition to the metrics described above, the governor counts recent
- * intercepts (that is, intercepts that have occurred during the last NR_RECENT
- * invocations of it for the given CPU) for each bin.
+ * intercepts (that is, intercepts that have occurred during the last
+ * %NR_RECENT invocations of it for the given CPU) for each bin.
  *
  * In order to select an idle state for a CPU, the governor takes the following
  * steps (modulo the possible latency constraint that must be taken into account
@@ -76,7 +80,7 @@
  *      shallower than the candidate one.
  *
  * 2. If the second sum is greater than the first one or the third sum is
- *    greater than NR_RECENT / 2, the CPU is likely to wake up early, so look
+ *    greater than %NR_RECENT / 2, the CPU is likely to wake up early, so look
  *    for an alternative idle state to select.
  *
  *    - Traverse the idle states shallower than the candidate one in the




^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2021-06-02 18:18 ` [PATCH v1 5/5] cpuidle: teo: Use kerneldoc documentation in admin-guide Rafael J. Wysocki
@ 2021-07-04 21:01 ` Doug Smythies
  2021-07-05 13:24   ` Rafael J. Wysocki
  2021-07-27 20:06 ` Doug Smythies
  6 siblings, 1 reply; 17+ messages in thread
From: Doug Smythies @ 2021-07-04 21:01 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'LKML', 'Linux PM', Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 5117 bytes --]

Hi Rafael,

On 2021.06.02 11:14 Rafael J. Wysocki wrote:

> Hi All,
>
> This series of patches addresses some theoretical shortcoming in the
> TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> state selection logic to some extent.
>
> Patches [1-2/5] are introductory cleanups and the substantial changes are
> made in patches [3-4/5] (please refer to the changelogs of these two
> patches for details).  The last patch only deals with documentation.
>
> Even though this work is mostly based on theoretical considerations, it
> shows a measurable reduction of the number of cases in which the
shallowest
> idle state is selected while it would be more beneficial to select a
deeper
> one or the deepest idle state is selected while it would be more
beneficial to
> select a shallower one, which should be a noticeable improvement.

Do you have any test results to share? Or test methods that I can try?
I have done a few tests, and generally don't notice much difference.
Perhaps an increase in idle state 2 below (was to shallow) numbers.
I am searching for some results that would offset the below:

The difficulty I am having with this patch set is the additional overhead
which becomes significant at the extremes, where idle state 0 is dominant.
Throughout the history of teo, I have used multiple one core pipe-tests
for this particular test. Some results:

CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
HWP: disabled
CPU frequency scaling driver: intel_pstate, active, powersave
Pipe-tests are run forever, printing average loop time for the
Last 2.5 million loops. 1021 of those are again averaged.
Total = 2.5525e9 loops
The power and idle data is sampled for 100 minutes.

Note 1: other tests were also done and also with passive,
schedutil, but it isn't relevant for this test because the
CPU frequency stays pinned at maximum.

Note 2: I use TCC offset for thermal throttling, but I disabled it
for these tests, because the temperature needed to go higher
than my normal throttling point.

Idle configuration 1: As a COMETLAKE processor, with 4 idle states.
Kernel 5.13-RC4.

Before patch set average:
2.8014 uSec/loop
113.9 watts
Idle state 0 residency: 9.450%
Idle state 0 entries per minute: 256,812,896.6

After patch set average:
2.8264 uSec/loop, 0.89% slower
114.0 watts
Idle state 0 residency: 8.677% 
Idle state 0 entries per minute: 254,560,049.9

Menu governor:
2.8051 uSec/loop, 0.13% slower
113.9 watts
Idle state 0 residency: 8.437% 
Idle state 0 entries per minute: 256,436,417.2

O.K., perhaps not so bad, but also not many idle states.

Idle configuration 2: As a SKYLAKE processor, with 9 idle states.
i.e.:
/drivers/idle/intel_idle.c
static const struct x86_cpu_id intel_idle_ids[] __initconst
...
   X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &idle_cpu_skx),
+ X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, &idle_cpu_skl),

Purpose: To demonstrate increasing overhead as a function of number
of idle states.
Kernel 5.13.

Before patch set average:
2.8394 uSec/loop
114.2 watts
Idle state 0 residency: 7.212%
Idle state 0 entries per minute: 253,391,954.3

After patch set average:
2.9103 uSec/loop, 2.5% slower
114.4 watts, 0.18% more
Idle state 0 residency: 6.152%, 14.7% less.
Idle state 0 entries per minute: 244,024,752.1

Menu governor:
2.8141 uSec/loop, 0.89% faster
113.9 watts,  0.26% less
Idle state 0 residency: 7.167%, 0.6% less
Idle state 0 entries per minute: 255,650,610.7

Another potentially interesting test was the ebizzy test:
Records per second, averaged over many tests, varying
threads and intervals:

passive, schedutil:
Before: 6771.977
After: 5502.643, -18.7%
Menu: 10728.89, +58.4%

Active, powersave:
Before: 8361.82
After: 8463.31, +1.2%
Menu: 8225.58, -1.6%

I think it has more to do with CPU scaling governors
than this patch set, so:

performance:
Before: 12137.33
After: 12083.26, -0.4%
Menu: 11983.73, -1.3%

These and other test results available here:
(encoded to prevent a barrage of bots)

double u double u double u dot smythies dot com
/~doug/linux/idle/teo-2021-06/

... a day later ...

I might have an answer to my own question.
By switching to cross core pipe-tests, and only loading down one
CPU per core, I was able to get a lot more activity in other idle states.
The test runs for 100 minutes, and the results change with time, but
I'll leave that investigation for another day (there is no throttling):

1st 50 tests:
Before: 3.888 uSec/loop
After: 3.764 uSec/loop
Menu: 3.464 uSec/loop

Tests 50 to 100:
Before: 4.329 uSec/loop
After: 3.919 uSec/loop
Menu: 3.514 uSec/loop

Tests 200 to 250:
Before: 5.089 uSec/loop
After: 4.364 uSec/loop
Menu: 4.619 uSec/loop
 
Tests 280 to 330:
Before: 5.142 uSec/loop
After: 4.464 uSec/loop
Menu: 4.619 uSec/loop

Notice that the "after" this patch set is applied eventually does
better than using the menu governor. Its processor package power
always remains less, than the menu governor.

The results can be viewed graphically at the above link, but the
most dramatic results are:

Idle state 3 above % goes from 70% to 5%.
Idle state 2 below % goes from 13% to less than 1%.
 
... Doug


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 4994 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-04 21:01 ` [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Doug Smythies
@ 2021-07-05 13:24   ` Rafael J. Wysocki
  0 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-07-05 13:24 UTC (permalink / raw)
  To: Doug Smythies; +Cc: Rafael J. Wysocki, LKML, Linux PM

On Sun, Jul 4, 2021 at 11:01 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Rafael,
>
> On 2021.06.02 11:14 Rafael J. Wysocki wrote:
>
> > Hi All,
> >
> > This series of patches addresses some theoretical shortcoming in the
> > TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> > state selection logic to some extent.
> >
> > Patches [1-2/5] are introductory cleanups and the substantial changes are
> > made in patches [3-4/5] (please refer to the changelogs of these two
> > patches for details).  The last patch only deals with documentation.
> >
> > Even though this work is mostly based on theoretical considerations, it
> > shows a measurable reduction of the number of cases in which the
> shallowest
> > idle state is selected while it would be more beneficial to select a
> deeper
> > one or the deepest idle state is selected while it would be more
> beneficial to
> > select a shallower one, which should be a noticeable improvement.
>
> Do you have any test results to share? Or test methods that I can try?
> I have done a few tests, and generally don't notice much difference.
> Perhaps an increase in idle state 2 below (was to shallow) numbers.
> I am searching for some results that would offset the below:
>
> The difficulty I am having with this patch set is the additional overhead
> which becomes significant at the extremes, where idle state 0 is dominant.
> Throughout the history of teo, I have used multiple one core pipe-tests
> for this particular test. Some results:
>
> CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> HWP: disabled
> CPU frequency scaling driver: intel_pstate, active, powersave
> Pipe-tests are run forever, printing average loop time for the
> Last 2.5 million loops. 1021 of those are again averaged.
> Total = 2.5525e9 loops
> The power and idle data is sampled for 100 minutes.
>
> Note 1: other tests were also done and also with passive,
> schedutil, but it isn't relevant for this test because the
> CPU frequency stays pinned at maximum.
>
> Note 2: I use TCC offset for thermal throttling, but I disabled it
> for these tests, because the temperature needed to go higher
> than my normal throttling point.
>
> Idle configuration 1: As a COMETLAKE processor, with 4 idle states.
> Kernel 5.13-RC4.
>
> Before patch set average:
> 2.8014 uSec/loop
> 113.9 watts
> Idle state 0 residency: 9.450%
> Idle state 0 entries per minute: 256,812,896.6
>
> After patch set average:
> 2.8264 uSec/loop, 0.89% slower
> 114.0 watts
> Idle state 0 residency: 8.677%
> Idle state 0 entries per minute: 254,560,049.9
>
> Menu governor:
> 2.8051 uSec/loop, 0.13% slower
> 113.9 watts
> Idle state 0 residency: 8.437%
> Idle state 0 entries per minute: 256,436,417.2
>
> O.K., perhaps not so bad, but also not many idle states.
>
> Idle configuration 2: As a SKYLAKE processor, with 9 idle states.
> i.e.:
> /drivers/idle/intel_idle.c
> static const struct x86_cpu_id intel_idle_ids[] __initconst
> ...
>    X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &idle_cpu_skx),
> + X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, &idle_cpu_skl),
>
> Purpose: To demonstrate increasing overhead as a function of number
> of idle states.
> Kernel 5.13.
>
> Before patch set average:
> 2.8394 uSec/loop
> 114.2 watts
> Idle state 0 residency: 7.212%
> Idle state 0 entries per minute: 253,391,954.3
>
> After patch set average:
> 2.9103 uSec/loop, 2.5% slower
> 114.4 watts, 0.18% more
> Idle state 0 residency: 6.152%, 14.7% less.
> Idle state 0 entries per minute: 244,024,752.1
>
> Menu governor:
> 2.8141 uSec/loop, 0.89% faster
> 113.9 watts,  0.26% less
> Idle state 0 residency: 7.167%, 0.6% less
> Idle state 0 entries per minute: 255,650,610.7
>
> Another potentially interesting test was the ebizzy test:
> Records per second, averaged over many tests, varying
> threads and intervals:
>
> passive, schedutil:
> Before: 6771.977
> After: 5502.643, -18.7%
> Menu: 10728.89, +58.4%
>
> Active, powersave:
> Before: 8361.82
> After: 8463.31, +1.2%
> Menu: 8225.58, -1.6%
>
> I think it has more to do with CPU scaling governors
> than this patch set, so:
>
> performance:
> Before: 12137.33
> After: 12083.26, -0.4%
> Menu: 11983.73, -1.3%
>
> These and other test results available here:
> (encoded to prevent a barrage of bots)
>
> double u double u double u dot smythies dot com
> /~doug/linux/idle/teo-2021-06/
>
> ... a day later ...
>
> I might have an answer to my own question.
> By switching to cross core pipe-tests, and only loading down one
> CPU per core, I was able to get a lot more activity in other idle states.
> The test runs for 100 minutes, and the results change with time, but
> I'll leave that investigation for another day (there is no throttling):
>
> 1st 50 tests:
> Before: 3.888 uSec/loop
> After: 3.764 uSec/loop
> Menu: 3.464 uSec/loop
>
> Tests 50 to 100:
> Before: 4.329 uSec/loop
> After: 3.919 uSec/loop
> Menu: 3.514 uSec/loop
>
> Tests 200 to 250:
> Before: 5.089 uSec/loop
> After: 4.364 uSec/loop
> Menu: 4.619 uSec/loop
>
> Tests 280 to 330:
> Before: 5.142 uSec/loop
> After: 4.464 uSec/loop
> Menu: 4.619 uSec/loop
>
> Notice that the "after" this patch set is applied eventually does
> better than using the menu governor. Its processor package power
> always remains less, than the menu governor.

That's good news, thanks!

> The results can be viewed graphically at the above link, but the
> most dramatic results are:
>
> Idle state 3 above % goes from 70% to 5%.
> Idle state 2 below % goes from 13% to less than 1%.

This also looks promising.

Thank you for all of the results!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
                   ` (5 preceding siblings ...)
  2021-07-04 21:01 ` [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Doug Smythies
@ 2021-07-27 20:06 ` Doug Smythies
  2021-07-28 13:52   ` Rafael J. Wysocki
  6 siblings, 1 reply; 17+ messages in thread
From: Doug Smythies @ 2021-07-27 20:06 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'LKML', 'Linux PM', Doug Smythies

Hi Rafael,

Further to my reply of 2021.07.04  on this, I have
continued to work with and test this patch set.

On 2021.06.02 11:14 Rafael J. Wysocki wrote:

>This series of patches addresses some theoretical shortcoming in the
> TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> state selection logic to some extent.
>
> Patches [1-2/5] are introductory cleanups and the substantial changes are
> made in patches [3-4/5] (please refer to the changelogs of these two
> patches for details).  The last patch only deals with documentation.
>
> Even though this work is mostly based on theoretical considerations, it
> shows a measurable reduction of the number of cases in which the shallowest
> idle state is selected while it would be more beneficial to select a deeper
> one or the deepest idle state is selected while it would be more beneficial to
> select a shallower one, which should be a noticeable improvement.

I am concentrating in the idle state 0 and 1 area.
When I disable idle state 0, the expectation is its
usage will fall to idle state 1. It doesn't.

Conditions:
CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
HWP: disabled
CPU frequency scaling driver: intel_pstate, active
CPU frequency scaling governor: performance.
Idle configuration: As a COMETLAKE processor, with 4 idle states.
Sample time for below: 1 minute.
Workflow: Cross core named pipe token passing, 12 threads.

Kernel 5.14-rc3: idle: teo governor

All idle states enabled: PASS
Processor: 97 watts
Idle state 0 entries: 811151
Idle state 1 entries: 140300776
Idle state 2 entries: 889
Idle state 3 entries: 8

Idle state 0 disabled: FAIL <<<<<
Processor: 96 watts
Idle state 0 entries: 0
Idle state 1 entries: 65599283
Idle state 2 entries: 364399
Idle state 3 entries: 65112651

Kernel 5.14-rc3: idle: menu governor

All idle states enabled: PASS
Processor: 102 watts
Idle state 0 entries: 169320747
Idle state 1 entries: 1860110
Idle state 2 entries: 14
Idle state 3 entries: 54

Idle state 0 disabled: PASS
Processor: 96.7 watts
Idle state 0 entries: 0
Idle state 1 entries: 141936790
Idle state 2 entries: 0
Idle state 3 entries: 6

Prior to this patch set:
Kernel 5.13: idle: teo governor

All idle states enabled: PASS
Processor: 97 watts
Idle state 0 entries: 446735
Idle state 1 entries: 140903027
Idle state 2 entries: 0
Idle state 3 entries: 0

Idle state 0 disabled: PASS
Processor: 96 watts
Idle state 0 entries: 0
Idle state 1 entries: 139308125
Idle state 2 entries: 0
Idle state 3 entries: 0

I haven't tried to isolate the issue in the code, yet.
Nor have I explored to determine if there might
be other potential idle state disabled issues.

... Doug



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-27 20:06 ` Doug Smythies
@ 2021-07-28 13:52   ` Rafael J. Wysocki
  2021-07-28 17:47     ` Rafael J. Wysocki
  0 siblings, 1 reply; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-07-28 13:52 UTC (permalink / raw)
  To: Doug Smythies; +Cc: Rafael J. Wysocki, LKML, Linux PM

On Tue, Jul 27, 2021 at 10:06 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Rafael,
>
> Further to my reply of 2021.07.04  on this, I have
> continued to work with and test this patch set.
>
> On 2021.06.02 11:14 Rafael J. Wysocki wrote:
>
> >This series of patches addresses some theoretical shortcoming in the
> > TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> > state selection logic to some extent.
> >
> > Patches [1-2/5] are introductory cleanups and the substantial changes are
> > made in patches [3-4/5] (please refer to the changelogs of these two
> > patches for details).  The last patch only deals with documentation.
> >
> > Even though this work is mostly based on theoretical considerations, it
> > shows a measurable reduction of the number of cases in which the shallowest
> > idle state is selected while it would be more beneficial to select a deeper
> > one or the deepest idle state is selected while it would be more beneficial to
> > select a shallower one, which should be a noticeable improvement.
>
> I am concentrating in the idle state 0 and 1 area.
> When I disable idle state 0, the expectation is its
> usage will fall to idle state 1. It doesn't.
>
> Conditions:
> CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> HWP: disabled
> CPU frequency scaling driver: intel_pstate, active
> CPU frequency scaling governor: performance.
> Idle configuration: As a COMETLAKE processor, with 4 idle states.
> Sample time for below: 1 minute.
> Workflow: Cross core named pipe token passing, 12 threads.
>
> Kernel 5.14-rc3: idle: teo governor
>
> All idle states enabled: PASS
> Processor: 97 watts
> Idle state 0 entries: 811151
> Idle state 1 entries: 140300776
> Idle state 2 entries: 889
> Idle state 3 entries: 8
>
> Idle state 0 disabled: FAIL <<<<<
> Processor: 96 watts
> Idle state 0 entries: 0
> Idle state 1 entries: 65599283
> Idle state 2 entries: 364399
> Idle state 3 entries: 65112651

This looks odd.

Thanks for the report, I'll take a look at this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-28 13:52   ` Rafael J. Wysocki
@ 2021-07-28 17:47     ` Rafael J. Wysocki
  2021-07-29  6:34       ` Doug Smythies
  0 siblings, 1 reply; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-07-28 17:47 UTC (permalink / raw)
  To: Doug Smythies; +Cc: Rafael J. Wysocki, LKML, Linux PM

On Wednesday, July 28, 2021 3:52:51 PM CEST Rafael J. Wysocki wrote:
> On Tue, Jul 27, 2021 at 10:06 PM Doug Smythies <dsmythies@telus.net> wrote:
> >
> > Hi Rafael,
> >
> > Further to my reply of 2021.07.04  on this, I have
> > continued to work with and test this patch set.
> >
> > On 2021.06.02 11:14 Rafael J. Wysocki wrote:
> >
> > >This series of patches addresses some theoretical shortcoming in the
> > > TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> > > state selection logic to some extent.
> > >
> > > Patches [1-2/5] are introductory cleanups and the substantial changes are
> > > made in patches [3-4/5] (please refer to the changelogs of these two
> > > patches for details).  The last patch only deals with documentation.
> > >
> > > Even though this work is mostly based on theoretical considerations, it
> > > shows a measurable reduction of the number of cases in which the shallowest
> > > idle state is selected while it would be more beneficial to select a deeper
> > > one or the deepest idle state is selected while it would be more beneficial to
> > > select a shallower one, which should be a noticeable improvement.
> >
> > I am concentrating in the idle state 0 and 1 area.
> > When I disable idle state 0, the expectation is its
> > usage will fall to idle state 1. It doesn't.
> >
> > Conditions:
> > CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > HWP: disabled
> > CPU frequency scaling driver: intel_pstate, active
> > CPU frequency scaling governor: performance.
> > Idle configuration: As a COMETLAKE processor, with 4 idle states.
> > Sample time for below: 1 minute.
> > Workflow: Cross core named pipe token passing, 12 threads.
> >
> > Kernel 5.14-rc3: idle: teo governor
> >
> > All idle states enabled: PASS
> > Processor: 97 watts
> > Idle state 0 entries: 811151
> > Idle state 1 entries: 140300776
> > Idle state 2 entries: 889
> > Idle state 3 entries: 8
> >
> > Idle state 0 disabled: FAIL <<<<<
> > Processor: 96 watts
> > Idle state 0 entries: 0
> > Idle state 1 entries: 65599283
> > Idle state 2 entries: 364399
> > Idle state 3 entries: 65112651
> 
> This looks odd.
> 
> Thanks for the report, I'll take a look at this.

I have found an issue in the code that may be responsible for the
observed behavior and should be addressed by the appended patch (not
tested yet).

Basically, the "disabled" check in the second loop over states in
teo_select() needs to exclude the first enabled state, because
there are no more states to check after that.

Plus the time span check needs to be done when the given state
is about to be selected, because otherwise the function may end up
returning a state for which the sums are too low.

Thanks!

---
 drivers/cpuidle/governors/teo.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -404,25 +404,27 @@ static int teo_select(struct cpuidle_dri
 			intercept_sum += bin->intercepts;
 			recent_sum += bin->recent;
 
-			if (dev->states_usage[i].disable)
+			if (dev->states_usage[i].disable && i > idx0)
 				continue;
 
 			span_ns = teo_middle_of_bin(i, drv);
-			if (!teo_time_ok(span_ns)) {
-				/*
-				 * The current state is too shallow, so select
-				 * the first enabled deeper state.
-				 */
-				duration_ns = last_enabled_span_ns;
-				idx = last_enabled_idx;
-				break;
-			}
 
 			if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
 			    (!alt_intercepts ||
 			     2 * intercept_sum > idx_intercept_sum)) {
-				idx = i;
-				duration_ns = span_ns;
+				if (!teo_time_ok(span_ns) ||
+				    dev->states_usage[i].disable) {
+					/*
+					 * The current state is too shallow or
+					 * disabled, so select the first enabled
+					 * deeper state.
+					 */
+					duration_ns = last_enabled_span_ns;
+					idx = last_enabled_idx;
+				} else {
+					idx = i;
+					duration_ns = span_ns;
+				}
 				break;
 			}
 




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-28 17:47     ` Rafael J. Wysocki
@ 2021-07-29  6:34       ` Doug Smythies
  2021-07-29 15:23         ` Doug Smythies
  2021-07-29 16:14         ` Rafael J. Wysocki
  0 siblings, 2 replies; 17+ messages in thread
From: Doug Smythies @ 2021-07-29  6:34 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Rafael J. Wysocki, LKML, Linux PM, dsmythies

On Wed, Jul 28, 2021 at 10:47 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Wednesday, July 28, 2021 3:52:51 PM CEST Rafael J. Wysocki wrote:
> > On Tue, Jul 27, 2021 at 10:06 PM Doug Smythies <dsmythies@telus.net> wrote:
> > >
> > > Hi Rafael,
> > >
> > > Further to my reply of 2021.07.04  on this, I have
> > > continued to work with and test this patch set.
> > >
> > > On 2021.06.02 11:14 Rafael J. Wysocki wrote:
> > >
> > > >This series of patches addresses some theoretical shortcoming in the
> > > > TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> > > > state selection logic to some extent.
> > > >
> > > > Patches [1-2/5] are introductory cleanups and the substantial changes are
> > > > made in patches [3-4/5] (please refer to the changelogs of these two
> > > > patches for details).  The last patch only deals with documentation.
> > > >
> > > > Even though this work is mostly based on theoretical considerations, it
> > > > shows a measurable reduction of the number of cases in which the shallowest
> > > > idle state is selected while it would be more beneficial to select a deeper
> > > > one or the deepest idle state is selected while it would be more beneficial to
> > > > select a shallower one, which should be a noticeable improvement.
> > >
> > > I am concentrating in the idle state 0 and 1 area.
> > > When I disable idle state 0, the expectation is its
> > > usage will fall to idle state 1. It doesn't.
> > >
> > > Conditions:
> > > CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > HWP: disabled
> > > CPU frequency scaling driver: intel_pstate, active
> > > CPU frequency scaling governor: performance.
> > > Idle configuration: As a COMETLAKE processor, with 4 idle states.
> > > Sample time for below: 1 minute.
> > > Workflow: Cross core named pipe token passing, 12 threads.
> > >
> > > Kernel 5.14-rc3: idle: teo governor
> > >
> > > All idle states enabled: PASS
> > > Processor: 97 watts
> > > Idle state 0 entries: 811151
> > > Idle state 1 entries: 140300776
> > > Idle state 2 entries: 889
> > > Idle state 3 entries: 8
> > >
> > > Idle state 0 disabled: FAIL <<<<<
> > > Processor: 96 watts
> > > Idle state 0 entries: 0
> > > Idle state 1 entries: 65599283
> > > Idle state 2 entries: 364399
> > > Idle state 3 entries: 65112651
> >
> > This looks odd.
> >
> > Thanks for the report, I'll take a look at this.
>
> I have found an issue in the code that may be responsible for the
> observed behavior and should be addressed by the appended patch (not
> tested yet).
>
> Basically, the "disabled" check in the second loop over states in
> teo_select() needs to exclude the first enabled state, because
> there are no more states to check after that.
>
> Plus the time span check needs to be done when the given state
> is about to be selected, because otherwise the function may end up
> returning a state for which the sums are too low.
>
> Thanks!
>
> ---
>  drivers/cpuidle/governors/teo.c |   26 ++++++++++++++------------
>  1 file changed, 14 insertions(+), 12 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/teo.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/teo.c
> +++ linux-pm/drivers/cpuidle/governors/teo.c
> @@ -404,25 +404,27 @@ static int teo_select(struct cpuidle_dri
>                         intercept_sum += bin->intercepts;
>                         recent_sum += bin->recent;
>
> -                       if (dev->states_usage[i].disable)
> +                       if (dev->states_usage[i].disable && i > idx0)
>                                 continue;
>
>                         span_ns = teo_middle_of_bin(i, drv);
> -                       if (!teo_time_ok(span_ns)) {
> -                               /*
> -                                * The current state is too shallow, so select
> -                                * the first enabled deeper state.
> -                                */
> -                               duration_ns = last_enabled_span_ns;
> -                               idx = last_enabled_idx;
> -                               break;
> -                       }
>
>                         if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
>                             (!alt_intercepts ||
>                              2 * intercept_sum > idx_intercept_sum)) {
> -                               idx = i;
> -                               duration_ns = span_ns;
> +                               if (!teo_time_ok(span_ns) ||
> +                                   dev->states_usage[i].disable) {
> +                                       /*
> +                                        * The current state is too shallow or
> +                                        * disabled, so select the first enabled
> +                                        * deeper state.
> +                                        */
> +                                       duration_ns = last_enabled_span_ns;
> +                                       idx = last_enabled_idx;
> +                               } else {
> +                                       idx = i;
> +                                       duration_ns = span_ns;
> +                               }
>                                 break;
>                         }

Hi Rafael,

I tried the patch and when I disabled idle state 0
got, very similar to before:

Idle state 0 disabled: FAIL
Processor: 95 watts
Idle state 0 entries: 0
Idle state 1 entries: 65,475,534
Idle state 2 entries: 333144
Idle state 3 entries: 65,247,048

However, I accidently left it for about 30 minutes
and noticed:

Idle state 0 disabled:
Processor: 83 watts
Idle state 0 entries: 0
Idle state 1 entries: 88,706,831
Idle state 2 entries: 100
Idle state 3 entries: 662

I went back to unmodified kernel 5.13-rc3 and
let it run longer with idle state 0 disabled, and
after 30 minutes it had changed but nowhere
near as much:

Idle state 0 disabled:
Processor: 87 watts
Idle state 0 entries: 0
Idle state 1 entries: 70,361,020
Idle state 2 entries: 71219
Idle state 3 entries: 27,249,975

... Doug

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-29  6:34       ` Doug Smythies
@ 2021-07-29 15:23         ` Doug Smythies
  2021-07-29 16:18           ` Rafael J. Wysocki
  2021-07-29 16:14         ` Rafael J. Wysocki
  1 sibling, 1 reply; 17+ messages in thread
From: Doug Smythies @ 2021-07-29 15:23 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Rafael J. Wysocki, LKML, Linux PM, dsmythies

On Wed, Jul 28, 2021 at 11:34 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On Wed, Jul 28, 2021 at 10:47 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > On Wednesday, July 28, 2021 3:52:51 PM CEST Rafael J. Wysocki wrote:
> > > On Tue, Jul 27, 2021 at 10:06 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > >
> > > > Hi Rafael,
> > > >
> > > > Further to my reply of 2021.07.04  on this, I have
> > > > continued to work with and test this patch set.
> > > >
> > > > On 2021.06.02 11:14 Rafael J. Wysocki wrote:
> > > >
> > > > >This series of patches addresses some theoretical shortcoming in the
> > > > > TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> > > > > state selection logic to some extent.
> > > > >
> > > > > Patches [1-2/5] are introductory cleanups and the substantial changes are
> > > > > made in patches [3-4/5] (please refer to the changelogs of these two
> > > > > patches for details).  The last patch only deals with documentation.
> > > > >
> > > > > Even though this work is mostly based on theoretical considerations, it
> > > > > shows a measurable reduction of the number of cases in which the shallowest
> > > > > idle state is selected while it would be more beneficial to select a deeper
> > > > > one or the deepest idle state is selected while it would be more beneficial to
> > > > > select a shallower one, which should be a noticeable improvement.
> > > >
> > > > I am concentrating in the idle state 0 and 1 area.
> > > > When I disable idle state 0, the expectation is its
> > > > usage will fall to idle state 1. It doesn't.
> > > >
> > > > Conditions:
> > > > CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > > HWP: disabled
> > > > CPU frequency scaling driver: intel_pstate, active
> > > > CPU frequency scaling governor: performance.
> > > > Idle configuration: As a COMETLAKE processor, with 4 idle states.
> > > > Sample time for below: 1 minute.
> > > > Workflow: Cross core named pipe token passing, 12 threads.
> > > >
> > > > Kernel 5.14-rc3: idle: teo governor
> > > >
> > > > All idle states enabled: PASS
> > > > Processor: 97 watts
> > > > Idle state 0 entries: 811151
> > > > Idle state 1 entries: 140300776
> > > > Idle state 2 entries: 889
> > > > Idle state 3 entries: 8
> > > >
> > > > Idle state 0 disabled: FAIL <<<<<
> > > > Processor: 96 watts
> > > > Idle state 0 entries: 0
> > > > Idle state 1 entries: 65599283
> > > > Idle state 2 entries: 364399
> > > > Idle state 3 entries: 65112651
> > >
> > > This looks odd.
> > >
> > > Thanks for the report, I'll take a look at this.
> >
> > I have found an issue in the code that may be responsible for the
> > observed behavior and should be addressed by the appended patch (not
> > tested yet).
> >
> > Basically, the "disabled" check in the second loop over states in
> > teo_select() needs to exclude the first enabled state, because
> > there are no more states to check after that.
> >
> > Plus the time span check needs to be done when the given state
> > is about to be selected, because otherwise the function may end up
> > returning a state for which the sums are too low.
> >
> > Thanks!
> >
> > ---
> >  drivers/cpuidle/governors/teo.c |   26 ++++++++++++++------------
> >  1 file changed, 14 insertions(+), 12 deletions(-)
> >
> > Index: linux-pm/drivers/cpuidle/governors/teo.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpuidle/governors/teo.c
> > +++ linux-pm/drivers/cpuidle/governors/teo.c
> > @@ -404,25 +404,27 @@ static int teo_select(struct cpuidle_dri
> >                         intercept_sum += bin->intercepts;
> >                         recent_sum += bin->recent;
> >
> > -                       if (dev->states_usage[i].disable)
> > +                       if (dev->states_usage[i].disable && i > idx0)
> >                                 continue;
> >
> >                         span_ns = teo_middle_of_bin(i, drv);
> > -                       if (!teo_time_ok(span_ns)) {
> > -                               /*
> > -                                * The current state is too shallow, so select
> > -                                * the first enabled deeper state.
> > -                                */
> > -                               duration_ns = last_enabled_span_ns;
> > -                               idx = last_enabled_idx;
> > -                               break;
> > -                       }
> >
> >                         if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
> >                             (!alt_intercepts ||
> >                              2 * intercept_sum > idx_intercept_sum)) {
> > -                               idx = i;
> > -                               duration_ns = span_ns;
> > +                               if (!teo_time_ok(span_ns) ||
> > +                                   dev->states_usage[i].disable) {
> > +                                       /*
> > +                                        * The current state is too shallow or
> > +                                        * disabled, so select the first enabled
> > +                                        * deeper state.
> > +                                        */
> > +                                       duration_ns = last_enabled_span_ns;
> > +                                       idx = last_enabled_idx;
> > +                               } else {
> > +                                       idx = i;
> > +                                       duration_ns = span_ns;
> > +                               }
> >                                 break;
> >                         }
>
> Hi Rafael,
>
> I tried the patch and when I disabled idle state 0
> got, very similar to before:
>
> Idle state 0 disabled: FAIL
> Processor: 95 watts
> Idle state 0 entries: 0
> Idle state 1 entries: 65,475,534
> Idle state 2 entries: 333144
> Idle state 3 entries: 65,247,048
>
> However, I accidently left it for about 30 minutes
> and noticed:
>
> Idle state 0 disabled:
> Processor: 83 watts
> Idle state 0 entries: 0
> Idle state 1 entries: 88,706,831
> Idle state 2 entries: 100
> Idle state 3 entries: 662
>
> I went back to unmodified kernel 5.13-rc3 and

Sorry, 5.14-rc3.

> let it run longer with idle state 0 disabled, and
> after 30 minutes it had changed but nowhere
> near as much:
>
> Idle state 0 disabled:
> Processor: 87 watts
> Idle state 0 entries: 0
> Idle state 1 entries: 70,361,020
> Idle state 2 entries: 71219
> Idle state 3 entries: 27,249,975

Addendum: So far the workflow used for this
thread has been event based. If I switch to
a timer based workflow, everything works as
expected for both kernels, 5.14-rc3 unmodified
and modified with the patch from herein.

... Doug

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-29  6:34       ` Doug Smythies
  2021-07-29 15:23         ` Doug Smythies
@ 2021-07-29 16:14         ` Rafael J. Wysocki
  2021-07-30  3:36           ` Doug Smythies
  1 sibling, 1 reply; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-07-29 16:14 UTC (permalink / raw)
  To: Doug Smythies; +Cc: Rafael J. Wysocki, LKML, Linux PM, dsmythies

On Thursday, July 29, 2021 8:34:37 AM CEST Doug Smythies wrote:
> On Wed, Jul 28, 2021 at 10:47 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > On Wednesday, July 28, 2021 3:52:51 PM CEST Rafael J. Wysocki wrote:
> > > On Tue, Jul 27, 2021 at 10:06 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > >
> > > > Hi Rafael,
> > > >
> > > > Further to my reply of 2021.07.04  on this, I have
> > > > continued to work with and test this patch set.
> > > >
> > > > On 2021.06.02 11:14 Rafael J. Wysocki wrote:
> > > >
> > > > >This series of patches addresses some theoretical shortcoming in the
> > > > > TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> > > > > state selection logic to some extent.
> > > > >
> > > > > Patches [1-2/5] are introductory cleanups and the substantial changes are
> > > > > made in patches [3-4/5] (please refer to the changelogs of these two
> > > > > patches for details).  The last patch only deals with documentation.
> > > > >
> > > > > Even though this work is mostly based on theoretical considerations, it
> > > > > shows a measurable reduction of the number of cases in which the shallowest
> > > > > idle state is selected while it would be more beneficial to select a deeper
> > > > > one or the deepest idle state is selected while it would be more beneficial to
> > > > > select a shallower one, which should be a noticeable improvement.
> > > >
> > > > I am concentrating in the idle state 0 and 1 area.
> > > > When I disable idle state 0, the expectation is its
> > > > usage will fall to idle state 1. It doesn't.
> > > >
> > > > Conditions:
> > > > CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > > HWP: disabled
> > > > CPU frequency scaling driver: intel_pstate, active
> > > > CPU frequency scaling governor: performance.
> > > > Idle configuration: As a COMETLAKE processor, with 4 idle states.
> > > > Sample time for below: 1 minute.
> > > > Workflow: Cross core named pipe token passing, 12 threads.
> > > >
> > > > Kernel 5.14-rc3: idle: teo governor
> > > >
> > > > All idle states enabled: PASS
> > > > Processor: 97 watts
> > > > Idle state 0 entries: 811151
> > > > Idle state 1 entries: 140300776
> > > > Idle state 2 entries: 889
> > > > Idle state 3 entries: 8
> > > >
> > > > Idle state 0 disabled: FAIL <<<<<
> > > > Processor: 96 watts
> > > > Idle state 0 entries: 0
> > > > Idle state 1 entries: 65599283
> > > > Idle state 2 entries: 364399
> > > > Idle state 3 entries: 65112651
> > >
> > > This looks odd.
> > >
> > > Thanks for the report, I'll take a look at this.
> >
> > I have found an issue in the code that may be responsible for the
> > observed behavior and should be addressed by the appended patch (not
> > tested yet).
> >
> > Basically, the "disabled" check in the second loop over states in
> > teo_select() needs to exclude the first enabled state, because
> > there are no more states to check after that.
> >
> > Plus the time span check needs to be done when the given state
> > is about to be selected, because otherwise the function may end up
> > returning a state for which the sums are too low.
> >
> > Thanks!
> >
> > ---
> >  drivers/cpuidle/governors/teo.c |   26 ++++++++++++++------------
> >  1 file changed, 14 insertions(+), 12 deletions(-)
> >
> > Index: linux-pm/drivers/cpuidle/governors/teo.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpuidle/governors/teo.c
> > +++ linux-pm/drivers/cpuidle/governors/teo.c
> > @@ -404,25 +404,27 @@ static int teo_select(struct cpuidle_dri
> >                         intercept_sum += bin->intercepts;
> >                         recent_sum += bin->recent;
> >
> > -                       if (dev->states_usage[i].disable)
> > +                       if (dev->states_usage[i].disable && i > idx0)
> >                                 continue;
> >
> >                         span_ns = teo_middle_of_bin(i, drv);
> > -                       if (!teo_time_ok(span_ns)) {
> > -                               /*
> > -                                * The current state is too shallow, so select
> > -                                * the first enabled deeper state.
> > -                                */
> > -                               duration_ns = last_enabled_span_ns;
> > -                               idx = last_enabled_idx;
> > -                               break;
> > -                       }
> >
> >                         if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
> >                             (!alt_intercepts ||
> >                              2 * intercept_sum > idx_intercept_sum)) {
> > -                               idx = i;
> > -                               duration_ns = span_ns;
> > +                               if (!teo_time_ok(span_ns) ||
> > +                                   dev->states_usage[i].disable) {
> > +                                       /*
> > +                                        * The current state is too shallow or
> > +                                        * disabled, so select the first enabled
> > +                                        * deeper state.
> > +                                        */
> > +                                       duration_ns = last_enabled_span_ns;
> > +                                       idx = last_enabled_idx;
> > +                               } else {
> > +                                       idx = i;
> > +                                       duration_ns = span_ns;
> > +                               }
> >                                 break;
> >                         }
> 
> Hi Rafael,

Hi Doug,

Thanks for the feedback, much appreciated!

> I tried the patch and when I disabled idle state 0
> got, very similar to before:
> 
> Idle state 0 disabled: FAIL
> Processor: 95 watts
> Idle state 0 entries: 0
> Idle state 1 entries: 65,475,534
> Idle state 2 entries: 333144
> Idle state 3 entries: 65,247,048
> 
> However, I accidently left it for about 30 minutes
> and noticed:
> 
> Idle state 0 disabled:
> Processor: 83 watts
> Idle state 0 entries: 0
> Idle state 1 entries: 88,706,831
> Idle state 2 entries: 100
> Idle state 3 entries: 662

This means that idle state 0 data are disregarded after disabling it
and that most likely is because the second loop in teo_select() should
be over all states down to idle state 0 (not only down to the first
enabled one).

So below is an updated patch (not tested yet).

---
 drivers/cpuidle/governors/teo.c |   28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/teo.c
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -397,32 +397,34 @@ static int teo_select(struct cpuidle_dri
 		intercept_sum = 0;
 		recent_sum = 0;
 
-		for (i = idx - 1; i >= idx0; i--) {
+		for (i = idx - 1; i >= 0; i--) {
 			struct teo_bin *bin = &cpu_data->state_bins[i];
 			s64 span_ns;
 
 			intercept_sum += bin->intercepts;
 			recent_sum += bin->recent;
 
-			if (dev->states_usage[i].disable)
+			if (dev->states_usage[i].disable && i > 0)
 				continue;
 
 			span_ns = teo_middle_of_bin(i, drv);
-			if (!teo_time_ok(span_ns)) {
-				/*
-				 * The current state is too shallow, so select
-				 * the first enabled deeper state.
-				 */
-				duration_ns = last_enabled_span_ns;
-				idx = last_enabled_idx;
-				break;
-			}
 
 			if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
 			    (!alt_intercepts ||
 			     2 * intercept_sum > idx_intercept_sum)) {
-				idx = i;
-				duration_ns = span_ns;
+				if (!teo_time_ok(span_ns) ||
+				    dev->states_usage[i].disable) {
+					/*
+					 * The current state is too shallow or
+					 * disabled, so select the first enabled
+					 * deeper state.
+					 */
+					duration_ns = last_enabled_span_ns;
+					idx = last_enabled_idx;
+				} else {
+					idx = i;
+					duration_ns = span_ns;
+				}
 				break;
 			}
 




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-29 15:23         ` Doug Smythies
@ 2021-07-29 16:18           ` Rafael J. Wysocki
  0 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-07-29 16:18 UTC (permalink / raw)
  To: Doug Smythies; +Cc: Rafael J. Wysocki, Rafael J. Wysocki, LKML, Linux PM

On Thu, Jul 29, 2021 at 5:24 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On Wed, Jul 28, 2021 at 11:34 PM Doug Smythies <dsmythies@telus.net> wrote:
> >
> > On Wed, Jul 28, 2021 at 10:47 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > On Wednesday, July 28, 2021 3:52:51 PM CEST Rafael J. Wysocki wrote:
> > > > On Tue, Jul 27, 2021 at 10:06 PM Doug Smythies <dsmythies@telus.net> wrote:
> > > > >
> > > > > Hi Rafael,
> > > > >
> > > > > Further to my reply of 2021.07.04  on this, I have
> > > > > continued to work with and test this patch set.
> > > > >
> > > > > On 2021.06.02 11:14 Rafael J. Wysocki wrote:
> > > > >
> > > > > >This series of patches addresses some theoretical shortcoming in the
> > > > > > TEO (Timer Events Oriented) cpuidle governor by reworking its idle
> > > > > > state selection logic to some extent.
> > > > > >
> > > > > > Patches [1-2/5] are introductory cleanups and the substantial changes are
> > > > > > made in patches [3-4/5] (please refer to the changelogs of these two
> > > > > > patches for details).  The last patch only deals with documentation.
> > > > > >
> > > > > > Even though this work is mostly based on theoretical considerations, it
> > > > > > shows a measurable reduction of the number of cases in which the shallowest
> > > > > > idle state is selected while it would be more beneficial to select a deeper
> > > > > > one or the deepest idle state is selected while it would be more beneficial to
> > > > > > select a shallower one, which should be a noticeable improvement.
> > > > >
> > > > > I am concentrating in the idle state 0 and 1 area.
> > > > > When I disable idle state 0, the expectation is its
> > > > > usage will fall to idle state 1. It doesn't.
> > > > >
> > > > > Conditions:
> > > > > CPU: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
> > > > > HWP: disabled
> > > > > CPU frequency scaling driver: intel_pstate, active
> > > > > CPU frequency scaling governor: performance.
> > > > > Idle configuration: As a COMETLAKE processor, with 4 idle states.
> > > > > Sample time for below: 1 minute.
> > > > > Workflow: Cross core named pipe token passing, 12 threads.
> > > > >
> > > > > Kernel 5.14-rc3: idle: teo governor
> > > > >
> > > > > All idle states enabled: PASS
> > > > > Processor: 97 watts
> > > > > Idle state 0 entries: 811151
> > > > > Idle state 1 entries: 140300776
> > > > > Idle state 2 entries: 889
> > > > > Idle state 3 entries: 8
> > > > >
> > > > > Idle state 0 disabled: FAIL <<<<<
> > > > > Processor: 96 watts
> > > > > Idle state 0 entries: 0
> > > > > Idle state 1 entries: 65599283
> > > > > Idle state 2 entries: 364399
> > > > > Idle state 3 entries: 65112651
> > > >
> > > > This looks odd.
> > > >
> > > > Thanks for the report, I'll take a look at this.
> > >
> > > I have found an issue in the code that may be responsible for the
> > > observed behavior and should be addressed by the appended patch (not
> > > tested yet).
> > >
> > > Basically, the "disabled" check in the second loop over states in
> > > teo_select() needs to exclude the first enabled state, because
> > > there are no more states to check after that.
> > >
> > > Plus the time span check needs to be done when the given state
> > > is about to be selected, because otherwise the function may end up
> > > returning a state for which the sums are too low.
> > >
> > > Thanks!
> > >
> > > ---
> > >  drivers/cpuidle/governors/teo.c |   26 ++++++++++++++------------
> > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > >
> > > Index: linux-pm/drivers/cpuidle/governors/teo.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/cpuidle/governors/teo.c
> > > +++ linux-pm/drivers/cpuidle/governors/teo.c
> > > @@ -404,25 +404,27 @@ static int teo_select(struct cpuidle_dri
> > >                         intercept_sum += bin->intercepts;
> > >                         recent_sum += bin->recent;
> > >
> > > -                       if (dev->states_usage[i].disable)
> > > +                       if (dev->states_usage[i].disable && i > idx0)
> > >                                 continue;
> > >
> > >                         span_ns = teo_middle_of_bin(i, drv);
> > > -                       if (!teo_time_ok(span_ns)) {
> > > -                               /*
> > > -                                * The current state is too shallow, so select
> > > -                                * the first enabled deeper state.
> > > -                                */
> > > -                               duration_ns = last_enabled_span_ns;
> > > -                               idx = last_enabled_idx;
> > > -                               break;
> > > -                       }
> > >
> > >                         if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
> > >                             (!alt_intercepts ||
> > >                              2 * intercept_sum > idx_intercept_sum)) {
> > > -                               idx = i;
> > > -                               duration_ns = span_ns;
> > > +                               if (!teo_time_ok(span_ns) ||
> > > +                                   dev->states_usage[i].disable) {
> > > +                                       /*
> > > +                                        * The current state is too shallow or
> > > +                                        * disabled, so select the first enabled
> > > +                                        * deeper state.
> > > +                                        */
> > > +                                       duration_ns = last_enabled_span_ns;
> > > +                                       idx = last_enabled_idx;
> > > +                               } else {
> > > +                                       idx = i;
> > > +                                       duration_ns = span_ns;
> > > +                               }
> > >                                 break;
> > >                         }
> >
> > Hi Rafael,
> >
> > I tried the patch and when I disabled idle state 0
> > got, very similar to before:
> >
> > Idle state 0 disabled: FAIL
> > Processor: 95 watts
> > Idle state 0 entries: 0
> > Idle state 1 entries: 65,475,534
> > Idle state 2 entries: 333144
> > Idle state 3 entries: 65,247,048
> >
> > However, I accidently left it for about 30 minutes
> > and noticed:
> >
> > Idle state 0 disabled:
> > Processor: 83 watts
> > Idle state 0 entries: 0
> > Idle state 1 entries: 88,706,831
> > Idle state 2 entries: 100
> > Idle state 3 entries: 662
> >
> > I went back to unmodified kernel 5.13-rc3 and
>
> Sorry, 5.14-rc3.
>
> > let it run longer with idle state 0 disabled, and
> > after 30 minutes it had changed but nowhere
> > near as much:
> >
> > Idle state 0 disabled:
> > Processor: 87 watts
> > Idle state 0 entries: 0
> > Idle state 1 entries: 70,361,020
> > Idle state 2 entries: 71219
> > Idle state 3 entries: 27,249,975
>
> Addendum: So far the workflow used for this
> thread has been event based. If I switch to
> a timer based workflow, everything works as
> expected for both kernels, 5.14-rc3 unmodified
> and modified with the patch from herein.

Yes, the affected case is when the governor selects states that are
shallower than indicated by the time till the next timer.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-29 16:14         ` Rafael J. Wysocki
@ 2021-07-30  3:36           ` Doug Smythies
  2021-07-30 13:25             ` Rafael J. Wysocki
  0 siblings, 1 reply; 17+ messages in thread
From: Doug Smythies @ 2021-07-30  3:36 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Rafael J. Wysocki, LKML, Linux PM, dsmythies

On Thu, Jul 29, 2021 at 9:14 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
... [snip]...
>
> This means that idle state 0 data are disregarded after disabling it
> and that most likely is because the second loop in teo_select() should
> be over all states down to idle state 0 (not only down to the first
> enabled one).
>
> So below is an updated patch (not tested yet).

Hi Rafael,

This updated patch works great / solves the problem.
Tested-by: Doug Smythies <dsmythies@telus.net>

Thank you very much.

... Doug

>
> ---
>  drivers/cpuidle/governors/teo.c |   28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/teo.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/teo.c
> +++ linux-pm/drivers/cpuidle/governors/teo.c
> @@ -397,32 +397,34 @@ static int teo_select(struct cpuidle_dri
>                 intercept_sum = 0;
>                 recent_sum = 0;
>
> -               for (i = idx - 1; i >= idx0; i--) {
> +               for (i = idx - 1; i >= 0; i--) {
>                         struct teo_bin *bin = &cpu_data->state_bins[i];
>                         s64 span_ns;
>
>                         intercept_sum += bin->intercepts;
>                         recent_sum += bin->recent;
>
> -                       if (dev->states_usage[i].disable)
> +                       if (dev->states_usage[i].disable && i > 0)
>                                 continue;
>
>                         span_ns = teo_middle_of_bin(i, drv);
> -                       if (!teo_time_ok(span_ns)) {
> -                               /*
> -                                * The current state is too shallow, so select
> -                                * the first enabled deeper state.
> -                                */
> -                               duration_ns = last_enabled_span_ns;
> -                               idx = last_enabled_idx;
> -                               break;
> -                       }
>
>                         if ((!alt_recent || 2 * recent_sum > idx_recent_sum) &&
>                             (!alt_intercepts ||
>                              2 * intercept_sum > idx_intercept_sum)) {
> -                               idx = i;
> -                               duration_ns = span_ns;
> +                               if (!teo_time_ok(span_ns) ||
> +                                   dev->states_usage[i].disable) {
> +                                       /*
> +                                        * The current state is too shallow or
> +                                        * disabled, so select the first enabled
> +                                        * deeper state.
> +                                        */
> +                                       duration_ns = last_enabled_span_ns;
> +                                       idx = last_enabled_idx;
> +                               } else {
> +                                       idx = i;
> +                                       duration_ns = span_ns;
> +                               }
>                                 break;
>                         }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic
  2021-07-30  3:36           ` Doug Smythies
@ 2021-07-30 13:25             ` Rafael J. Wysocki
  0 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2021-07-30 13:25 UTC (permalink / raw)
  To: Doug Smythies; +Cc: Rafael J. Wysocki, Rafael J. Wysocki, LKML, Linux PM

On Fri, Jul 30, 2021 at 5:36 AM Doug Smythies <dsmythies@telus.net> wrote:
>
> On Thu, Jul 29, 2021 at 9:14 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> ... [snip]...
> >
> > This means that idle state 0 data are disregarded after disabling it
> > and that most likely is because the second loop in teo_select() should
> > be over all states down to idle state 0 (not only down to the first
> > enabled one).
> >
> > So below is an updated patch (not tested yet).
>
> Hi Rafael,
>
> This updated patch works great / solves the problem.
> Tested-by: Doug Smythies <dsmythies@telus.net>
>
> Thank you very much.

Thank you and you're welcome!

I've found a small issue in the patch though (it needs to check the
time span before setting the "last enabled" index), so let me submit a
proper patch with a changelog based on this one.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-07-30 13:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-02 18:14 [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Rafael J. Wysocki
2021-06-02 18:15 ` [PATCH v1 1/5] cpuidle: teo: Cosmetic modifications of teo_update() Rafael J. Wysocki
2021-06-02 18:15 ` [PATCH v1 2/5] cpuidle: teo: Cosmetic modification of teo_select() Rafael J. Wysocki
2021-06-02 18:16 ` [PATCH v1 3/5] cpuidle: teo: Change the main idle state selection logic Rafael J. Wysocki
2021-06-02 18:17 ` [PATCH v1 4/5] cpuidle: teo: Rework most recent idle duration values treatment Rafael J. Wysocki
2021-06-02 18:18 ` [PATCH v1 5/5] cpuidle: teo: Use kerneldoc documentation in admin-guide Rafael J. Wysocki
2021-07-04 21:01 ` [PATCH v1 0/5] cpuidle: teo: Rework the idle state selection logic Doug Smythies
2021-07-05 13:24   ` Rafael J. Wysocki
2021-07-27 20:06 ` Doug Smythies
2021-07-28 13:52   ` Rafael J. Wysocki
2021-07-28 17:47     ` Rafael J. Wysocki
2021-07-29  6:34       ` Doug Smythies
2021-07-29 15:23         ` Doug Smythies
2021-07-29 16:18           ` Rafael J. Wysocki
2021-07-29 16:14         ` Rafael J. Wysocki
2021-07-30  3:36           ` Doug Smythies
2021-07-30 13:25             ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).