linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/1] Weighted approach to gather and use history in TEO governor
@ 2020-02-22  7:00 Pratik Rajesh Sampat
  2020-02-22  7:00 ` [RFC 1/1] " Pratik Rajesh Sampat
  2020-02-25  5:13 ` [RFC 0/1] " Gautham R Shenoy
  0 siblings, 2 replies; 8+ messages in thread
From: Pratik Rajesh Sampat @ 2020-02-22  7:00 UTC (permalink / raw)
  To: linux-kernel, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, ego, svaidy, psampat, pratik.sampat,
	pratik.r.sampat

Currently the TEO governor apart from the TEO timer and hit/miss/early
hit buckets; also gathers history of 8 intervals and if there are
significant idle durations less than the current, then it decides if a
shallower state must be chosen.

The current sliding history window does do a fair job at prediction,
however, the hard-coded window can be a limiting factor for an accurate
prediction and having the window size increase can also linearly affect
both space and time complexity of the prediction.

To complement the current moving window history, an approach is devised
where each idle state separately maintains a weight for itself and its
counterpart idle states to form a probability distribution.

When a decision needs to be made, the TEO governor selects an idle state
based on its timer and other hits/early hits metric. After which, the
probability distribution of that selected idle state is looked at which
gives insight into how probable that state is to occur if picked.

The probability distribution is nothing but a n*n matrix, where
n = drv->state_count.
Each entry in the array signifies a weight for that row.
The weights can vary from the range [0-10000].

For example:
state_mat[1][2] = 3000 means that previously when state 1 was selected,
the probability that state 2 will occur is 30%.
The trailing zeros correspond to having more resolution while increasing
or reducing the weights for correction.

Currently, for selection of an idle state based on probabilities, a
weighted random number generator is used to choose one of the idle
states. Naturally, the states with higher weights are more likely to be
chosen.

On wakeup, the weights are updated. The state with which it should have
woken up with (could be the hit / miss / early hit state) is increased
in weight by the "LEARNING_RATE" % and the rest of the states for that
index are reduced by the same factor.

The advantage of this approach is that unlimited history of idle states
can be maintained in constant overhead, which can help in more accurate
prediction for choosing idle states.

The advantage of unlimited history can become a possible disadvantage as
the lifetime history for that thread may make the weights stale and
influence the choosing of idle states which may not be relevant anymore.
Aging the weights could be a solution for that, although this RFC does
not cover the implementation for that.

Having a finer view of the history in addition to weighted randomized
salt seems to show some promise in terms of saving power without
compromising performance.

Benchmarks:
Note: Wt. TEO governor represents the governor after the proposed change

Schbench
========
Benchmarks wakeup latencies
Scale of measurement:
1. 99th percentile latency - usec
2. Power - Watts

Command: $ schbench -c 30000 -s 30000 -m 6 -r 30 -t <Threads>
Varying parameter: -t

Machine: IBM POWER 9

+--------+-------------+-----------------+-----------+-----------------+
| Threads| TEO latency | Wt. TEO latency | TEO power | Wt. TEO power   |
+--------+-------------+-----------------+-----------+-----------------+
| 2      | 979         | 949  ( +3.06%)  | 38        | 36  ( +5.26%)   |
| 4      | 997         | 1042 ( -4.51%)  | 51        | 39  ( +23.52%)  |
| 8      | 1158        | 1050 ( +9.32%)  | 89        | 63  ( +29.21%)  |
| 16     | 1138        | 1135 ( +0.26%)  | 105       | 117 ( -11.42%)  |
+--------+-------------+-----------------+-----------+-----------------+

Sleeping Ebizzy
===============
Program to generate workloads resembling web server workloads.
The benchmark is customized to allow for a sleep interval -i
Scale of measurement:
1. Number of records/s
2. systime (s)

Parameters:
1. -m => Always use mmap instead of malloc
2. -M => Never use mmap
3. -S <seconds> => Number of seconds to run
4. -i <interval> => Sleep interval

Machine: IBM POWER 9

+-------------------+-------------+-------------------+-----------+---------------+
| Parameters        | TEO records | Wt. TEO records   | TEO power | Wt. TEO power |
+-------------------+-------------+-------------------+-----------+---------------+
| -S 60 -i 10000    | 1115000     | 1198081 ( +7.45%) | 149       | 150 ( -0.66%) |
| -m -S 60 -i 10000 | 15879       | 15513   ( -2.30%) | 23        | 22  ( +4.34%) |
| -M -S 60 -i 10000 | 72887       | 77546   ( +6.39%) | 104       | 103 ( +0.96%) |
+-------------------+-------------+-------------------+-----------+---------------+

Hackbench
=========
Creates a specified number of pairs of schedulable entities
which communicate via either sockets or pipes and time how long  it
takes for each pair to send data back and forth.
Scale of measurement:
1. Time (s)
2. Power (watts)

Command: Sockets: $ hackbench -l <Messages>
         Pipes  : $ hackbench --pipe -l <Messages>
Varying parameter: -l

Machine: IBM POWER 9

+----------+------------+-------------------+----------+-------------------+
| Messages | TEO socket | Wt. TEO socket    | TEO pipe | Wt. TEO pipe      |
+----------+------------+-------------------+----------+-------------------+
| 100      | 0.042      | 0.043   ( -2.32%) | 0.031    | 0.032   ( +3.12%) |
| 1000     | 0.258      | 0.272   ( +5.14%) | 0.301    | 0.312   ( -3.65%) |
| 10000    | 2.397      | 2.441   ( +1.80%) | 5.642    | 5.092   ( +9.74%) |
| 100000   | 23.691     | 23.730  ( -0.16%) | 57.762   | 57.857  ( -0.16%) |
| 1000000  | 234.103    | 233.841 ( +0.11%) | 559.807  | 592.304 ( -5.80%) |
+----------+------------+-------------------+----------+-------------------+

Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
       Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO



Pratik Rajesh Sampat (1):
  Weighted approach to gather and use history in TEO governor

 drivers/cpuidle/governors/teo.c | 95 +++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-02-22  7:00 [RFC 0/1] Weighted approach to gather and use history in TEO governor Pratik Rajesh Sampat
@ 2020-02-22  7:00 ` Pratik Rajesh Sampat
  2020-02-25  6:18   ` Gautham R Shenoy
  2020-02-25  5:13 ` [RFC 0/1] " Gautham R Shenoy
  1 sibling, 1 reply; 8+ messages in thread
From: Pratik Rajesh Sampat @ 2020-02-22  7:00 UTC (permalink / raw)
  To: linux-kernel, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, ego, svaidy, psampat, pratik.sampat,
	pratik.r.sampat

Complementing the current self correcting window algorithm, an alternate
approach is devised to leverage lifetime history with constant overhead

Each CPU maintains a matrix wherein each idle state maintains a
probability distribution.

The probability distribution is nothing but a n*n matrix, where
n = drv->state_count.
Each entry in the array signifies a weight for that row.
The weights can vary from the range [0-10000].

For example:
state_mat[1][2] = 3000 means that previously when state 1 was selected,
the probability that state 2 will occur is 30%.
The trailing zeros correspond to having more resolution while increasing
or reducing the weights for correction.

Initially the weights are distributed in a way such that the index of
that state in question has a higher probability of choosing itself, as
we have no reason to believe otherwise yet. Initial bias to itself is
70% and the rest 30% is equally distributed to the rest of the states.

Selection of an idle state:
When the TEO governor chooses an idle state, the probability
distribution for that state is looked at. A weighted random number
generator is used using the weights as bias to choose the next idle
state. The algorithm leans to choose that or a shallower state than that
for its next prediction

Correction of the probability distribution:
On wakeup, the weights are updated. The state which it should have woken
up with (could be the hit / miss / early hit state) is increased in
weight by the "LEARNING_RATE" % and the rest of the states for that
index are reduced by the same factor.
The LEARNING RATE is experimentally chosen to be 5 %
---
 drivers/cpuidle/governors/teo.c | 95 +++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 5 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index de7e706efd46..8060c287f5e4 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -50,6 +50,7 @@
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
 #include <linux/tick.h>
+#include <linux/random.h>
 
 /*
  * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
@@ -64,6 +65,12 @@
  */
 #define INTERVALS	8
 
+/*
+ * Percentage of the amount of weight to be shifted in the idle state weight
+ * distribution for correction
+ */
+#define LEARNING_RATE	5
+
 /**
  * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
  * @early_hits: "Early" CPU wakeups "matching" this state.
@@ -98,6 +105,8 @@ struct teo_idle_state {
  * @states: Idle states data corresponding to this CPU.
  * @interval_idx: Index of the most recent saved idle interval.
  * @intervals: Saved idle duration values.
+ * @state_mat: Each idle state maintains a weights corresponding to that
+ * state, storing the probablity distribution of occurance for that state
  */
 struct teo_cpu {
 	u64 time_span_ns;
@@ -105,6 +114,7 @@ struct teo_cpu {
 	struct teo_idle_state states[CPUIDLE_STATE_MAX];
 	int interval_idx;
 	u64 intervals[INTERVALS];
+	int state_mat[CPUIDLE_STATE_MAX][CPUIDLE_STATE_MAX];
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
@@ -117,7 +127,7 @@ static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
 static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	int i, idx_hit = -1, idx_timer = -1;
+	int i, idx_hit = -1, idx_timer = -1, idx = -1, last_idx = dev->last_state_idx;
 	u64 measured_ns;
 
 	if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
@@ -183,16 +193,50 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 
 		if (idx_timer > idx_hit) {
 			misses += PULSE;
-			if (idx_hit >= 0)
+			idx = idx_timer;
+			if (idx_hit >= 0) {
 				cpu_data->states[idx_hit].early_hits += PULSE;
+				idx = idx_hit;
+			}
 		} else {
 			hits += PULSE;
+			idx = last_idx;
 		}
 
 		cpu_data->states[idx_timer].misses = misses;
 		cpu_data->states[idx_timer].hits = hits;
 	}
 
+	/*
+	 * Rearrange the weight distribution of the state, increase the weight
+	 * by the LEARNING RATE % for the idle state that was supposed to be
+	 * chosen and reduce by the same amount for rest of the states
+	 *
+	 * If the weights are greater than (100 - LEARNING_RATE) % or lesser
+	 * than LEARNING_RATE %, do not increase or decrease the confidence
+	 * respectively
+	 */
+	for (i = 0; i < drv->state_count; i++) {
+		unsigned int delta;
+
+		if (idx == -1)
+			break;
+		if (i ==  idx) {
+			delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
+			if (cpu_data->state_mat[last_idx][i] + delta >=
+			    (100 - LEARNING_RATE) * 100)
+				continue;
+			cpu_data->state_mat[last_idx][i] += delta;
+			continue;
+		}
+		delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) /
+			((drv->state_count - 1) * 100);
+		if (cpu_data->state_mat[last_idx][i] - delta <=
+		    LEARNING_RATE * 100)
+			continue;
+		cpu_data->state_mat[last_idx][i] -= delta;
+	}
+
 	/*
 	 * Save idle duration values corresponding to non-timer wakeups for
 	 * pattern detection.
@@ -244,7 +288,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
 	u64 duration_ns;
 	unsigned int hits, misses, early_hits;
-	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i;
+	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i, og_idx;
 	ktime_t delta_tick;
 
 	if (dev->last_state_idx >= 0) {
@@ -374,10 +418,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (constraint_idx < idx)
 		idx = constraint_idx;
 
+	og_idx = idx;
+
 	if (idx < 0) {
 		idx = 0; /* No states enabled. Must use 0. */
 	} else if (idx > 0) {
-		unsigned int count = 0;
+		unsigned int count = 0, sum_weights = 0, weights_list[CPUIDLE_STATE_MAX];
+		int i, j = 0, rnd_wt, rnd_num = 0;
 		u64 sum = 0;
 
 		/*
@@ -412,6 +459,29 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 								       idx, avg_ns);
 			}
 		}
+		/*
+		 * In case, the recent history yields a shallower state, then
+		 * the probability distribution is looked at.
+		 * The weighted random number generator uses the weights as a
+		 * bias to choose the next idle state
+		 */
+		if (og_idx != idx) {
+			for (i = 0; i <= idx; i++) {
+				if (dev->states_usage[i].disable)
+					continue;
+				sum_weights += cpu_data->state_mat[idx][i];
+				weights_list[j++] = sum_weights;
+			}
+			get_random_bytes(&rnd_num, sizeof(rnd_num));
+			rnd_num = rnd_num % 100;
+			rnd_wt = (rnd_num * sum_weights) / 100;
+			for (i = 0; i < j; i++) {
+				if (rnd_wt < weights_list[i])
+					break;
+				rnd_wt -= weights_list[i];
+			}
+			idx = i;
+		}
 	}
 
 	/*
@@ -468,13 +538,28 @@ static int teo_enable_device(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	int i;
+	int i, j;
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
 
 	for (i = 0; i < INTERVALS; i++)
 		cpu_data->intervals[i] = U64_MAX;
 
+	/*
+	 * Populate initial weights for each state
+	 * The stop state is initially more biased for itself.
+	 *
+	 * Currently the initial distribution of probabilities are 70%-30%.
+	 * The trailing 0s are for increased resolution.
+	 */
+	for (i = 0; i < drv->state_count; i++) {
+		for (j = 0; j < drv->state_count; j++) {
+			if (i == j)
+				cpu_data->state_mat[i][j] = 7000;
+			else
+				cpu_data->state_mat[i][j] = 3000 / (drv->state_count - 1);
+		}
+	}
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 0/1] Weighted approach to gather and use history in TEO governor
  2020-02-22  7:00 [RFC 0/1] Weighted approach to gather and use history in TEO governor Pratik Rajesh Sampat
  2020-02-22  7:00 ` [RFC 1/1] " Pratik Rajesh Sampat
@ 2020-02-25  5:13 ` Gautham R Shenoy
  2020-02-27 16:14   ` Doug Smythies
  2020-02-29  8:58   ` Pratik Sampat
  1 sibling, 2 replies; 8+ messages in thread
From: Gautham R Shenoy @ 2020-02-25  5:13 UTC (permalink / raw)
  To: Pratik Rajesh Sampat
  Cc: linux-kernel, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, ego, svaidy, pratik.sampat, pratik.r.sampat

Hello Pratik,

On Sat, Feb 22, 2020 at 12:30:01PM +0530, Pratik Rajesh Sampat wrote:
> Currently the TEO governor apart from the TEO timer and hit/miss/early
> hit buckets; also gathers history of 8 intervals and if there are
> significant idle durations less than the current, then it decides if a
> shallower state must be chosen.
> 
> The current sliding history window does do a fair job at prediction,
> however, the hard-coded window can be a limiting factor for an accurate
> prediction and having the window size increase can also linearly affect
> both space and time complexity of the prediction.
> 
> To complement the current moving window history, an approach is devised
> where each idle state separately maintains a weight for itself and its
> counterpart idle states to form a probability distribution.
> 
> When a decision needs to be made, the TEO governor selects an idle state
> based on its timer and other hits/early hits metric. After which, the
> probability distribution of that selected idle state is looked at which
> gives insight into how probable that state is to occur if picked.
> 
> The probability distribution is nothing but a n*n matrix, where
> n = drv->state_count.
> Each entry in the array signifies a weight for that row.
> The weights can vary from the range [0-10000].
> 
> For example:
> state_mat[1][2] = 3000 means that previously when state 1 was selected,
> the probability that state 2 will occur is 30%.

Could you clarify what this means ? Do you mean that when state 1 is
selected, the probability that the CPU will be in state 1 for the
duration corresponding to state 2's residency is 30% ?

Further more, this means that during idle state selection we have O(n)
complexity if n is the number of idle states, since we want to select
a state where we are more likely to reside ?

> The trailing zeros correspond to having more resolution while increasing
> or reducing the weights for correction.
> 
> Currently, for selection of an idle state based on probabilities, a
> weighted random number generator is used to choose one of the idle
> states. Naturally, the states with higher weights are more likely to be
> chosen.
> 
> On wakeup, the weights are updated. The state with which it should have
> woken up with (could be the hit / miss / early hit state) is increased
> in weight by the "LEARNING_RATE" % and the rest of the states for that
> index are reduced by the same factor.

So we only update the weight in just one cell ?

To use the example above, if we selected state 1, and we resided in it
for a duration corresponding to state 2's residency, we will only
update state_mat[1][2] ?

> 
> The advantage of this approach is that unlimited history of idle states
> can be maintained in constant overhead, which can help in more accurate
> prediction for choosing idle states.
> 
> The advantage of unlimited history can become a possible disadvantage as
> the lifetime history for that thread may make the weights stale and
> influence the choosing of idle states which may not be relevant
> anymore.

Can the effect of this staleless be observed ? For instance, if we
have a particular idle entry/exit pattern for a very long duration,
say a few 10s of minutes and then the idle entry/exit pattern changes,
how bad will the weighted approach be compared to the current TEO
governor ?




> Aging the weights could be a solution for that, although this RFC does
> not cover the implementation for that.
> 
> Having a finer view of the history in addition to weighted randomized
> salt seems to show some promise in terms of saving power without
> compromising performance.
> 
> Benchmarks:
> Note: Wt. TEO governor represents the governor after the proposed change
> 
> Schbench
> ========
> Benchmarks wakeup latencies
> Scale of measurement:
> 1. 99th percentile latency - usec
> 2. Power - Watts
> 
> Command: $ schbench -c 30000 -s 30000 -m 6 -r 30 -t <Threads>
> Varying parameter: -t
> 
> Machine: IBM POWER 9
> 
> +--------+-------------+-----------------+-----------+-----------------+
> | Threads| TEO latency | Wt. TEO latency | TEO power | Wt. TEO power   |
> +--------+-------------+-----------------+-----------+-----------------+
> | 2      | 979         | 949  ( +3.06%)  | 38        | 36  ( +5.26%)   |
> | 4      | 997         | 1042 ( -4.51%)  | 51        | 39  ( +23.52%)  |
> | 8      | 1158        | 1050 ( +9.32%)  | 89        | 63  ( +29.21%)  |
> | 16     | 1138        | 1135 ( +0.26%)  | 105       | 117 ( -11.42%)  |
> +--------+-------------+-----------------+-----------+-----------------+
> 
> Sleeping Ebizzy
> ===============
> Program to generate workloads resembling web server workloads.
> The benchmark is customized to allow for a sleep interval -i
> Scale of measurement:
> 1. Number of records/s
> 2. systime (s)
> 
> Parameters:
> 1. -m => Always use mmap instead of malloc
> 2. -M => Never use mmap
> 3. -S <seconds> => Number of seconds to run
> 4. -i <interval> => Sleep interval
> 
> Machine: IBM POWER 9
> 
> +-------------------+-------------+-------------------+-----------+---------------+
> | Parameters        | TEO records | Wt. TEO records   | TEO power | Wt. TEO power |
> +-------------------+-------------+-------------------+-----------+---------------+
> | -S 60 -i 10000    | 1115000     | 1198081 ( +7.45%) | 149       | 150 ( -0.66%) |
> | -m -S 60 -i 10000 | 15879       | 15513   ( -2.30%) | 23        | 22  ( +4.34%) |
> | -M -S 60 -i 10000 | 72887       | 77546   ( +6.39%) | 104       | 103 ( +0.96%) |
> +-------------------+-------------+-------------------+-----------+---------------+
> 
> Hackbench
> =========
> Creates a specified number of pairs of schedulable entities
> which communicate via either sockets or pipes and time how long  it
> takes for each pair to send data back and forth.
> Scale of measurement:
> 1. Time (s)
> 2. Power (watts)
> 
> Command: Sockets: $ hackbench -l <Messages>
>          Pipes  : $ hackbench --pipe -l <Messages>
> Varying parameter: -l
> 
> Machine: IBM POWER 9
> 
> +----------+------------+-------------------+----------+-------------------+
> | Messages | TEO socket | Wt. TEO socket    | TEO pipe | Wt. TEO pipe      |
> +----------+------------+-------------------+----------+-------------------+
> | 100      | 0.042      | 0.043   ( -2.32%) | 0.031    | 0.032   ( +3.12%) |
> | 1000     | 0.258      | 0.272   ( +5.14%) | 0.301    | 0.312   ( -3.65%) |
> | 10000    | 2.397      | 2.441   ( +1.80%) | 5.642    | 5.092   ( +9.74%) |
> | 100000   | 23.691     | 23.730  ( -0.16%) | 57.762   | 57.857  ( -0.16%) |
> | 1000000  | 234.103    | 233.841 ( +0.11%) | 559.807  | 592.304 ( -5.80%) |
> +----------+------------+-------------------+----------+-------------------+
> 
> Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
>        Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO
> 
>


Could you also provide power measurements for the duration when the
system is completely idle for each of the variants of TEO governor ?
Is it the case that the benefits that we are seeing above are only due
to Wt. TEO being more conservative than TEO governor by always
choosing a shallower state ?





> 
> Pratik Rajesh Sampat (1):
>   Weighted approach to gather and use history in TEO governor
> 
>  drivers/cpuidle/governors/teo.c | 95 +++++++++++++++++++++++++++++++--
>  1 file changed, 90 insertions(+), 5 deletions(-)
> 
> -- 
> 2.17.1
> 

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-02-22  7:00 ` [RFC 1/1] " Pratik Rajesh Sampat
@ 2020-02-25  6:18   ` Gautham R Shenoy
  2020-02-29  8:58     ` Pratik Sampat
  0 siblings, 1 reply; 8+ messages in thread
From: Gautham R Shenoy @ 2020-02-25  6:18 UTC (permalink / raw)
  To: Pratik Rajesh Sampat
  Cc: linux-kernel, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, ego, svaidy, pratik.sampat, pratik.r.sampat

On Sat, Feb 22, 2020 at 12:30:02PM +0530, Pratik Rajesh Sampat wrote:
> Complementing the current self correcting window algorithm, an alternate
> approach is devised to leverage lifetime history with constant overhead
> 
> Each CPU maintains a matrix wherein each idle state maintains a
> probability distribution.
> 
> The probability distribution is nothing but a n*n matrix, where
> n = drv->state_count.
> Each entry in the array signifies a weight for that row.
> The weights can vary from the range [0-10000].
> 
> For example:
> state_mat[1][2] = 3000 means that previously when state 1 was selected,
> the probability that state 2 will occur is 30%.
> The trailing zeros correspond to having more resolution while increasing
> or reducing the weights for correction.
> 
> Initially the weights are distributed in a way such that the index of
> that state in question has a higher probability of choosing itself, as
> we have no reason to believe otherwise yet. Initial bias to itself is
> 70% and the rest 30% is equally distributed to the rest of the states.
> 
> Selection of an idle state:
> When the TEO governor chooses an idle state, the probability
> distribution for that state is looked at. A weighted random number
> generator is used using the weights as bias to choose the next idle
> state. The algorithm leans to choose that or a shallower state than that
> for its next prediction
> 
> Correction of the probability distribution:
> On wakeup, the weights are updated. The state which it should have woken
> up with (could be the hit / miss / early hit state) is increased in
> weight by the "LEARNING_RATE" % and the rest of the states for that
> index are reduced by the same factor.
> The LEARNING RATE is experimentally chosen to be 5 %


I know this is an RFC patch, not meant for inclusion, but it is good
practice to have your Signed-off-by.


> ---
>  drivers/cpuidle/governors/teo.c | 95 +++++++++++++++++++++++++++++++--
>  1 file changed, 90 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index de7e706efd46..8060c287f5e4 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -50,6 +50,7 @@
>  #include <linux/kernel.h>
>  #include <linux/sched/clock.h>
>  #include <linux/tick.h>
> +#include <linux/random.h>
> 
>  /*
>   * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
> @@ -64,6 +65,12 @@
>   */
>  #define INTERVALS	8
> 
> +/*
> + * Percentage of the amount of weight to be shifted in the idle state weight
> + * distribution for correction
> + */
> +#define LEARNING_RATE	5
> +
>  /**
>   * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
>   * @early_hits: "Early" CPU wakeups "matching" this state.
> @@ -98,6 +105,8 @@ struct teo_idle_state {
>   * @states: Idle states data corresponding to this CPU.
>   * @interval_idx: Index of the most recent saved idle interval.
>   * @intervals: Saved idle duration values.
> + * @state_mat: Each idle state maintains a weights corresponding to that
> + * state, storing the probablity distribution of occurance for that state
>   */
>  struct teo_cpu {
>  	u64 time_span_ns;
> @@ -105,6 +114,7 @@ struct teo_cpu {
>  	struct teo_idle_state states[CPUIDLE_STATE_MAX];
>  	int interval_idx;
>  	u64 intervals[INTERVALS];
> +	int state_mat[CPUIDLE_STATE_MAX][CPUIDLE_STATE_MAX];
>  };
> 
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
> @@ -117,7 +127,7 @@ static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>  static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  {
>  	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> -	int i, idx_hit = -1, idx_timer = -1;
> +	int i, idx_hit = -1, idx_timer = -1, idx = -1, last_idx = dev->last_state_idx;
>  	u64 measured_ns;
> 
>  	if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
> @@ -183,16 +193,50 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> 
>  		if (idx_timer > idx_hit) {
>  			misses += PULSE;
> -			if (idx_hit >= 0)
> +			idx = idx_timer;
> +			if (idx_hit >= 0) {
>  				cpu_data->states[idx_hit].early_hits += PULSE;
> +				idx = idx_hit;
> +			}
>  		} else {
>  			hits += PULSE;
> +			idx = last_idx;
>  		}
> 
>  		cpu_data->states[idx_timer].misses = misses;
>  		cpu_data->states[idx_timer].hits = hits;
>  	}
> 
> +	/*
> +	 * Rearrange the weight distribution of the state, increase the weight
> +	 * by the LEARNING RATE % for the idle state that was supposed to be
> +	 * chosen and reduce by the same amount for rest of the states
> +	 *
> +	 * If the weights are greater than (100 - LEARNING_RATE) % or lesser
> +	 * than LEARNING_RATE %, do not increase or decrease the confidence
> +	 * respectively
> +	 */
> +	for (i = 0; i < drv->state_count; i++) {
> +		unsigned int delta;
> +
> +		if (idx == -1)
> +			break;
> +		if (i ==  idx) {
> +			delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
> +			if (cpu_data->state_mat[last_idx][i] + delta >=
> +			    (100 - LEARNING_RATE) * 100)
> +				continue;
> +			cpu_data->state_mat[last_idx][i] += delta;
> +			continue;
> +		}
> +		delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) /
> +			((drv->state_count - 1) * 100);


What happens when drv->state_count == 1?

> +		if (cpu_data->state_mat[last_idx][i] - delta <=
> +		    LEARNING_RATE * 100)
> +			continue;
> +		cpu_data->state_mat[last_idx][i] -= delta;

So, even update takes O(n) time, since we have to increase the weight
for one state, and correspondingly decrease the state for all the
other states.


> +	}
> +
>  	/*
>  	 * Save idle duration values corresponding to non-timer wakeups for
>  	 * pattern detection.
> @@ -244,7 +288,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
>  	u64 duration_ns;
>  	unsigned int hits, misses, early_hits;
> -	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i;
> +	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i, og_idx;
>  	ktime_t delta_tick;
> 
>  	if (dev->last_state_idx >= 0) {
> @@ -374,10 +418,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  	if (constraint_idx < idx)
>  		idx = constraint_idx;
> 
> +	og_idx = idx;
> +
>  	if (idx < 0) {
>  		idx = 0; /* No states enabled. Must use 0. */
>  	} else if (idx > 0) {
> -		unsigned int count = 0;
> +		unsigned int count = 0, sum_weights = 0, weights_list[CPUIDLE_STATE_MAX];
> +		int i, j = 0, rnd_wt, rnd_num = 0;
>  		u64 sum = 0;
> 
>  		/*
> @@ -412,6 +459,29 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>  								       idx, avg_ns);
>  			}
>  		}
> +		/*
> +		 * In case, the recent history yields a shallower state, then
> +		 * the probability distribution is looked at.
> +		 * The weighted random number generator uses the weights as a
> +		 * bias to choose the next idle state
> +		 */
> +		if (og_idx != idx) {
> +			for (i = 0; i <= idx; i++) {


So it seems like we are restricting our choice to states no deeper
than the selected state.

Is it not possible that cpu_data->state_mat[idx][j] has the
maximum weight when j > idx ? If yes, why are we leaving those states
out ?

> +				if (dev->states_usage[i].disable)
> +					continue;
> +				sum_weights += cpu_data->state_mat[idx][i];
> +				weights_list[j++] = sum_weights;
> +			}

Assume that cpu_data->stat_mat[idx] = {4, 5, 6, 9}
weight_list[] = {4, 9, 15, 24}

> +			get_random_bytes(&rnd_num, sizeof(rnd_num));
> +			rnd_num = rnd_num % 100;

Assume rnd_num = 50.
> +			rnd_wt = (rnd_num * sum_weights) / 100;


Then, rnd_wt = 12.

From the logic, below, it appears that you want to pick the shallowest
state i at which rnd_wt < weights_list[i]. In which case it would be
the state corresponding to the weight 6 (as the cumulative weight at
that point is 15).


> +			for (i = 0; i < j; i++) {
> +				if (rnd_wt < weights_list[i])
> +					break;
> +				rnd_wt -= weights_list[i];


And yet, because of this additional subtraction, after the first
iteration of this loop, rnd_wt = 12 - 4 = 8, which means that you will
end up picking the state corresponding to weight 5 (cumulative weight
being 9 at this point).

This doesn't seem right.

> +			}
> +			idx = i;
> +		}
>  	}
> 
>  	/*
> @@ -468,13 +538,28 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>  			     struct cpuidle_device *dev)
>  {
>  	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> -	int i;
> +	int i, j;
> 
>  	memset(cpu_data, 0, sizeof(*cpu_data));
> 
>  	for (i = 0; i < INTERVALS; i++)
>  		cpu_data->intervals[i] = U64_MAX;
> 
> +	/*
> +	 * Populate initial weights for each state
> +	 * The stop state is initially more biased for itself.
> +	 *
> +	 * Currently the initial distribution of probabilities are 70%-30%.
> +	 * The trailing 0s are for increased resolution.
> +	 */
> +	for (i = 0; i < drv->state_count; i++) {
> +		for (j = 0; j < drv->state_count; j++) {
> +			if (i == j)
> +				cpu_data->state_mat[i][j] = 7000;
> +			else
> +				cpu_data->state_mat[i][j] = 3000 / (drv->state_count - 1);


> +		}
> +	}
>  	return 0;
>  }
> 
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC 0/1] Weighted approach to gather and use history in TEO governor
  2020-02-25  5:13 ` [RFC 0/1] " Gautham R Shenoy
@ 2020-02-27 16:14   ` Doug Smythies
  2020-02-29  8:58     ` Pratik Sampat
  2020-02-29  8:58   ` Pratik Sampat
  1 sibling, 1 reply; 8+ messages in thread
From: Doug Smythies @ 2020-02-27 16:14 UTC (permalink / raw)
  To: ego, 'Pratik Rajesh Sampat'
  Cc: linux-kernel, rafael.j.wysocki, peterz, daniel.lezcano, svaidy,
	pratik.sampat, pratik.r.sampat

On 2020.02.24 21:13 Gautham R Shenoy wrote:

...

> Could you also provide power measurements for the duration when the
> system is completely idle for each of the variants of TEO governor ?
> Is it the case that the benefits that we are seeing above are only due
> to Wt. TEO being more conservative than TEO governor by always
> choosing a shallower state ?

For what it's worth:

CPU: Intel: i7-2600K  
Kernel: 5.6-rc2 (teo) and + this patch set (wtteo)
Note: in general, "idle" on this system is considerably more "idle" than most systems.
Sample period: 5 minutes.
CPU scaling driver: intel_cpufreq
Governor: performance
Deepest idle state: 4 (C6)

teo:
Test duration 740 minutes (12.33 hours).
Average processor package power: 3.84 watts
Idle state 0:    4.19 / minute
Idle state 1:   29.26 / minute
Idle state 2:   46.71 / minute
Idle state 3:    7.42 / minute
Idle state 4: 1124.55 / minute
Total: 2.525 idle entries per cpu per second

wtteo:
Test duration 1095 minutes (18.25 hours).
Average processor package power: 3.84 watts
Idle state 0:    7.98 / minute
Idle state 1:   30.49 / minute
Idle state 2:   52.51 / minute
Idle state 3:    8.65 / minute
Idle state 4: 1125.33 / minute
Total: 2.552 idle entries per cpu per second

The above/below data for this test is incomplete because my program
doesn't process it if there are not enough state entries per sample period.
(I need to fix that for this type of test.)

I have done a couple of other tests with this patch set,
but nothing to report yet, as the differences have been minor so far.

... Doug



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 0/1] Weighted approach to gather and use history in TEO governor
  2020-02-25  5:13 ` [RFC 0/1] " Gautham R Shenoy
  2020-02-27 16:14   ` Doug Smythies
@ 2020-02-29  8:58   ` Pratik Sampat
  1 sibling, 0 replies; 8+ messages in thread
From: Pratik Sampat @ 2020-02-29  8:58 UTC (permalink / raw)
  To: ego
  Cc: linux-kernel, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, svaidy, pratik.sampat, pratik.r.sampat

Hello Gautham,

Thanks for your comments.


On 25/02/20 10:43 am, Gautham R Shenoy wrote:
> Hello Pratik,
>
> On Sat, Feb 22, 2020 at 12:30:01PM +0530, Pratik Rajesh Sampat wrote:
>> Currently the TEO governor apart from the TEO timer and hit/miss/early
>> hit buckets; also gathers history of 8 intervals and if there are
>> significant idle durations less than the current, then it decides if a
>> shallower state must be chosen.
>>
>> The current sliding history window does do a fair job at prediction,
>> however, the hard-coded window can be a limiting factor for an accurate
>> prediction and having the window size increase can also linearly affect
>> both space and time complexity of the prediction.
>>
>> To complement the current moving window history, an approach is devised
>> where each idle state separately maintains a weight for itself and its
>> counterpart idle states to form a probability distribution.
>>
>> When a decision needs to be made, the TEO governor selects an idle state
>> based on its timer and other hits/early hits metric. After which, the
>> probability distribution of that selected idle state is looked at which
>> gives insight into how probable that state is to occur if picked.
>>
>> The probability distribution is nothing but a n*n matrix, where
>> n = drv->state_count.
>> Each entry in the array signifies a weight for that row.
>> The weights can vary from the range [0-10000].
>>
>> For example:
>> state_mat[1][2] = 3000 means that previously when state 1 was selected,
>> the probability that state 2 will occur is 30%.
> Could you clarify what this means ? Do you mean that when state 1 is
> selected, the probability that the CPU will be in state 1 for the
> duration corresponding to state 2's residency is 30% ?

Yes. This precisely means that. In the case when the original logic
chooses the state X, the probability that it should have gone to
state Y because its residency is Z%

> Further more, this means that during idle state selection we have O(n)
> complexity if n is the number of idle states, since we want to select
> a state where we are more likely to reside ?

Absolutely. Although it has constant space complexity, the time
complexity is linear.

>> The trailing zeros correspond to having more resolution while increasing
>> or reducing the weights for correction.
>>
>> Currently, for selection of an idle state based on probabilities, a
>> weighted random number generator is used to choose one of the idle
>> states. Naturally, the states with higher weights are more likely to be
>> chosen.
>>
>> On wakeup, the weights are updated. The state with which it should have
>> woken up with (could be the hit / miss / early hit state) is increased
>> in weight by the "LEARNING_RATE" % and the rest of the states for that
>> index are reduced by the same factor.
> So we only update the weight in just one cell ?
>
> To use the example above, if we selected state 1, and we resided in it
> for a duration corresponding to state 2's residency, we will only
> update state_mat[1][2] ?

No that is not the case, the weight for stat_mat[1][2] will increase while
the weights for rest of the states of state_mat[1][X] will decrease with
the equal amount.

>> The advantage of this approach is that unlimited history of idle states
>> can be maintained in constant overhead, which can help in more accurate
>> prediction for choosing idle states.
>>
>> The advantage of unlimited history can become a possible disadvantage as
>> the lifetime history for that thread may make the weights stale and
>> influence the choosing of idle states which may not be relevant
>> anymore.
> Can the effect of this staleless be observed ? For instance, if we
> have a particular idle entry/exit pattern for a very long duration,
> say a few 10s of minutes and then the idle entry/exit pattern changes,
> how bad will the weighted approach be compared to the current TEO
> governor ?
>
I haven't been able to observe any adverse effects of the statelessness
affecting when run for long durations.
The reason I believe for that is we also leverage the recent history
for this, when we run something and then idle the system only for us
to run it again, the recent moving window history initially does not
get triggered, however it may later. This gives our weights ample
amount of time to be adjusted properly.

>
>
>> Aging the weights could be a solution for that, although this RFC does
>> not cover the implementation for that.
>>
>> Having a finer view of the history in addition to weighted randomized
>> salt seems to show some promise in terms of saving power without
>> compromising performance.
>>
>> Benchmarks:
>> Note: Wt. TEO governor represents the governor after the proposed change
>>
>> Schbench
>> ========
>> Benchmarks wakeup latencies
>> Scale of measurement:
>> 1. 99th percentile latency - usec
>> 2. Power - Watts
>>
>> Command: $ schbench -c 30000 -s 30000 -m 6 -r 30 -t <Threads>
>> Varying parameter: -t
>>
>> Machine: IBM POWER 9
>>
>> +--------+-------------+-----------------+-----------+-----------------+
>> | Threads| TEO latency | Wt. TEO latency | TEO power | Wt. TEO power   |
>> +--------+-------------+-----------------+-----------+-----------------+
>> | 2      | 979         | 949  ( +3.06%)  | 38        | 36  ( +5.26%)   |
>> | 4      | 997         | 1042 ( -4.51%)  | 51        | 39  ( +23.52%)  |
>> | 8      | 1158        | 1050 ( +9.32%)  | 89        | 63  ( +29.21%)  |
>> | 16     | 1138        | 1135 ( +0.26%)  | 105       | 117 ( -11.42%)  |
>> +--------+-------------+-----------------+-----------+-----------------+
>>
>> Sleeping Ebizzy
>> ===============
>> Program to generate workloads resembling web server workloads.
>> The benchmark is customized to allow for a sleep interval -i
>> Scale of measurement:
>> 1. Number of records/s
>> 2. systime (s)
>>
>> Parameters:
>> 1. -m => Always use mmap instead of malloc
>> 2. -M => Never use mmap
>> 3. -S <seconds> => Number of seconds to run
>> 4. -i <interval> => Sleep interval
>>
>> Machine: IBM POWER 9
>>
>> +-------------------+-------------+-------------------+-----------+---------------+
>> | Parameters        | TEO records | Wt. TEO records   | TEO power | Wt. TEO power |
>> +-------------------+-------------+-------------------+-----------+---------------+
>> | -S 60 -i 10000    | 1115000     | 1198081 ( +7.45%) | 149       | 150 ( -0.66%) |
>> | -m -S 60 -i 10000 | 15879       | 15513   ( -2.30%) | 23        | 22  ( +4.34%) |
>> | -M -S 60 -i 10000 | 72887       | 77546   ( +6.39%) | 104       | 103 ( +0.96%) |
>> +-------------------+-------------+-------------------+-----------+---------------+
>>
>> Hackbench
>> =========
>> Creates a specified number of pairs of schedulable entities
>> which communicate via either sockets or pipes and time how long  it
>> takes for each pair to send data back and forth.
>> Scale of measurement:
>> 1. Time (s)
>> 2. Power (watts)
>>
>> Command: Sockets: $ hackbench -l <Messages>
>>           Pipes  : $ hackbench --pipe -l <Messages>
>> Varying parameter: -l
>>
>> Machine: IBM POWER 9
>>
>> +----------+------------+-------------------+----------+-------------------+
>> | Messages | TEO socket | Wt. TEO socket    | TEO pipe | Wt. TEO pipe      |
>> +----------+------------+-------------------+----------+-------------------+
>> | 100      | 0.042      | 0.043   ( -2.32%) | 0.031    | 0.032   ( +3.12%) |
>> | 1000     | 0.258      | 0.272   ( +5.14%) | 0.301    | 0.312   ( -3.65%) |
>> | 10000    | 2.397      | 2.441   ( +1.80%) | 5.642    | 5.092   ( +9.74%) |
>> | 100000   | 23.691     | 23.730  ( -0.16%) | 57.762   | 57.857  ( -0.16%) |
>> | 1000000  | 234.103    | 233.841 ( +0.11%) | 559.807  | 592.304 ( -5.80%) |
>> +----------+------------+-------------------+----------+-------------------+
>>
>> Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
>>         Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO
>>
>>
>
> Could you also provide power measurements for the duration when the
> system is completely idle for each of the variants of TEO governor ?
> Is it the case that the benefits that we are seeing above are only due
> to Wt. TEO being more conservative than TEO governor by always
> choosing a shallower state ?
>
>
>
>
>
>> Pratik Rajesh Sampat (1):
>>    Weighted approach to gather and use history in TEO governor
>>
>>   drivers/cpuidle/governors/teo.c | 95 +++++++++++++++++++++++++++++++--
>>   1 file changed, 90 insertions(+), 5 deletions(-)
>>
>> -- 
>> 2.17.1
>>
> --
> Thanks and Regards
> gautham.

---

Pratik


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-02-25  6:18   ` Gautham R Shenoy
@ 2020-02-29  8:58     ` Pratik Sampat
  0 siblings, 0 replies; 8+ messages in thread
From: Pratik Sampat @ 2020-02-29  8:58 UTC (permalink / raw)
  To: ego
  Cc: linux-kernel, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, svaidy, pratik.sampat, pratik.r.sampat

Hello Gautham,

Snip [...]

>
> I know this is an RFC patch, not meant for inclusion, but it is good
> practice to have your Signed-off-by.
>
Sorry about that, my bad.



Snip [...]

>> +	/*
>> +	 * Rearrange the weight distribution of the state, increase the weight
>> +	 * by the LEARNING RATE % for the idle state that was supposed to be
>> +	 * chosen and reduce by the same amount for rest of the states
>> +	 *
>> +	 * If the weights are greater than (100 - LEARNING_RATE) % or lesser
>> +	 * than LEARNING_RATE %, do not increase or decrease the confidence
>> +	 * respectively
>> +	 */
>> +	for (i = 0; i < drv->state_count; i++) {
>> +		unsigned int delta;
>> +
>> +		if (idx == -1)
>> +			break;
>> +		if (i ==  idx) {
>> +			delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
>> +			if (cpu_data->state_mat[last_idx][i] + delta >=
>> +			    (100 - LEARNING_RATE) * 100)
>> +				continue;
>> +			cpu_data->state_mat[last_idx][i] += delta;
>> +			continue;
>> +		}
>> +		delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) /
>> +			((drv->state_count - 1) * 100);
>
> What happens when drv->state_count == 1?

In that case, the idx has to be 0 and the weights go on increasing upto
(100 - LEARNING_RATE) %. However that would not affect how states are
chosen. Although I could break if we have only one state and spare us some cycles.

>> +		if (cpu_data->state_mat[last_idx][i] - delta <=
>> +		    LEARNING_RATE * 100)
>> +			continue;
>> +		cpu_data->state_mat[last_idx][i] -= delta;
> So, even update takes O(n) time, since we have to increase the weight
> for one state, and correspondingly decrease the state for all the
> other states.

Yes it does take O(n).

>
>> +	}
>> +
>>   	/*
>>   	 * Save idle duration values corresponding to non-timer wakeups for
>>   	 * pattern detection.
>> @@ -244,7 +288,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>   	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
>>   	u64 duration_ns;
>>   	unsigned int hits, misses, early_hits;
>> -	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i;
>> +	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i, og_idx;
>>   	ktime_t delta_tick;
>>
>>   	if (dev->last_state_idx >= 0) {
>> @@ -374,10 +418,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>   	if (constraint_idx < idx)
>>   		idx = constraint_idx;
>>
>> +	og_idx = idx;
>> +
>>   	if (idx < 0) {
>>   		idx = 0; /* No states enabled. Must use 0. */
>>   	} else if (idx > 0) {
>> -		unsigned int count = 0;
>> +		unsigned int count = 0, sum_weights = 0, weights_list[CPUIDLE_STATE_MAX];
>> +		int i, j = 0, rnd_wt, rnd_num = 0;
>>   		u64 sum = 0;
>>
>>   		/*
>> @@ -412,6 +459,29 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>>   								       idx, avg_ns);
>>   			}
>>   		}
>> +		/*
>> +		 * In case, the recent history yields a shallower state, then
>> +		 * the probability distribution is looked at.
>> +		 * The weighted random number generator uses the weights as a
>> +		 * bias to choose the next idle state
>> +		 */
>> +		if (og_idx != idx) {
>> +			for (i = 0; i <= idx; i++) {
>
> So it seems like we are restricting our choice to states no deeper
> than the selected state.
>
> Is it not possible that cpu_data->state_mat[idx][j] has the
> maximum weight when j > idx ? If yes, why are we leaving those states
> out ?

It is certainly possible, however, the idea is that the state
selected because of a timer is the deepest it could have gone unless interrupted
otherwise for which it may have to choose a shallower state.

Having said that, timers can get cancelled prompting to choose a deeper state,
however, it may not be often enough for us to start amounting in the mix.
Certainly more testing of various workloads is required to determine if that
is indeed the case.

>> +				if (dev->states_usage[i].disable)
>> +					continue;
>> +				sum_weights += cpu_data->state_mat[idx][i];
>> +				weights_list[j++] = sum_weights;
>> +			}
> Assume that cpu_data->stat_mat[idx] = {4, 5, 6, 9}
> weight_list[] = {4, 9, 15, 24}
>
>> +			get_random_bytes(&rnd_num, sizeof(rnd_num));
>> +			rnd_num = rnd_num % 100;
> Assume rnd_num = 50.
>> +			rnd_wt = (rnd_num * sum_weights) / 100;
>
> Then, rnd_wt = 12.
>
>  From the logic, below, it appears that you want to pick the shallowest
> state i at which rnd_wt < weights_list[i]. In which case it would be
> the state corresponding to the weight 6 (as the cumulative weight at
> that point is 15).
>
>
>> +			for (i = 0; i < j; i++) {
>> +				if (rnd_wt < weights_list[i])
>> +					break;
>> +				rnd_wt -= weights_list[i];
>
> And yet, because of this additional subtraction, after the first
> iteration of this loop, rnd_wt = 12 - 4 = 8, which means that you will
> end up picking the state corresponding to weight 5 (cumulative weight
> being 9 at this point).
>
> This doesn't seem right.

You're right. I've made a mistake here.
The line: rnd_wt -= weights_list[i]; is not needed and throws the algorithm off.

I've re-run the benchmarks again to check for the affects.
Although I see some variation in results, however I do see the algorithm improve
over the conventional TEO.

Initial weight distribution 60-40.
Learning rate: 10%

Schbench
+---------+-------------+----------------+-----------+--------------+
| Threads | TEO latency | Wt.TEO latency | TEO power | Wt.TEO power |
+---------+-------------+----------------+-----------+--------------|
| 2       | 979         | 947(+3.26%)    | 38        | 34(+10.52%)  |
|---------|-------------|----------------|-----------|--------------|
| 4       | 997         | 1110(-11.34%)  | 51        | 41(+19.60%)  |
|---------|-------------|----------------|-----------|--------------|
| 8       | 1158        | 1070(+7.59%)   | 89        | 71(+20.22%)  |
|---------|-------------|----------------|-----------|--------------|
| 16      | 1138        | 1334(-17.22%)  | 105       | 101(+3.80%)  |
+---------+-------------+----------------+-----------+--------------+

Hackbench
+-------------------+-------------+-----------------+-----------+--------------+
| Parameters        | TEO records | Wt.TEO records  | TEO power | Wt.TEO power |
+-------------------+-------------+-----------------+-----------+--------------+
| -S 60 -i 10000    | 1115000     | 1180442(+5.86%) | 149       | 147(+1.34%)  |
|-------------------|-------------|-----------------|-----------|--------------|
| -m -S 60 -i 10000 | 15879       | 15953(+0.46%)   | 23        | 22(+4.34%)   |
|-------------------|-------------|-----------------|-----------|--------------|
| -M -S 60 -i 10000 | 72887       | 75454(+3.52%)   | 104       | 100(+3.84%)  |
+-------------------+-------------+-----------------+-----------+--------------+

>> +			}
>> +			idx = i;
>> +		}
>>   	}
>>
>>   	/*
>> @@ -468,13 +538,28 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>>   			     struct cpuidle_device *dev)
>>   {
>>   	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
>> -	int i;
>> +	int i, j;
>>
>>   	memset(cpu_data, 0, sizeof(*cpu_data));
>>
>>   	for (i = 0; i < INTERVALS; i++)
>>   		cpu_data->intervals[i] = U64_MAX;
>>
>> +	/*
>> +	 * Populate initial weights for each state
>> +	 * The stop state is initially more biased for itself.
>> +	 *
>> +	 * Currently the initial distribution of probabilities are 70%-30%.
>> +	 * The trailing 0s are for increased resolution.
>> +	 */
>> +	for (i = 0; i < drv->state_count; i++) {
>> +		for (j = 0; j < drv->state_count; j++) {
>> +			if (i == j)
>> +				cpu_data->state_mat[i][j] = 7000;
>> +			else
>> +				cpu_data->state_mat[i][j] = 3000 / (drv->state_count - 1);
>
>> +		}
>> +	}
>>   	return 0;
>>   }
>>
>> -- 
>> 2.17.1
>>
---

Pratik


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 0/1] Weighted approach to gather and use history in TEO governor
  2020-02-27 16:14   ` Doug Smythies
@ 2020-02-29  8:58     ` Pratik Sampat
  0 siblings, 0 replies; 8+ messages in thread
From: Pratik Sampat @ 2020-02-29  8:58 UTC (permalink / raw)
  To: Doug Smythies, ego
  Cc: linux-kernel, rafael.j.wysocki, peterz, daniel.lezcano, svaidy,
	pratik.sampat, pratik.r.sampat

Hello Doug,

Thanks for running these numbers for me.

On 27/02/20 9:44 pm, Doug Smythies wrote:
> On 2020.02.24 21:13 Gautham R Shenoy wrote:
>
> ...
>
>> Could you also provide power measurements for the duration when the
>> system is completely idle for each of the variants of TEO governor ?
>> Is it the case that the benefits that we are seeing above are only due
>> to Wt. TEO being more conservative than TEO governor by always
>> choosing a shallower state ?

For system idle I see similar power statistics for both the TEO and the wtteo.

> For what it's worth:
>
> CPU: Intel: i7-2600K
> Kernel: 5.6-rc2 (teo) and + this patch set (wtteo)
> Note: in general, "idle" on this system is considerably more "idle" than most systems.
> Sample period: 5 minutes.
> CPU scaling driver: intel_cpufreq
> Governor: performance
> Deepest idle state: 4 (C6)
>
> teo:
> Test duration 740 minutes (12.33 hours).
> Average processor package power: 3.84 watts
> Idle state 0:    4.19 / minute
> Idle state 1:   29.26 / minute
> Idle state 2:   46.71 / minute
> Idle state 3:    7.42 / minute
> Idle state 4: 1124.55 / minute
> Total: 2.525 idle entries per cpu per second
>
> wtteo:
> Test duration 1095 minutes (18.25 hours).
> Average processor package power: 3.84 watts
> Idle state 0:    7.98 / minute
> Idle state 1:   30.49 / minute
> Idle state 2:   52.51 / minute
> Idle state 3:    8.65 / minute
> Idle state 4: 1125.33 / minute
> Total: 2.552 idle entries per cpu per second
>
> The above/below data for this test is incomplete because my program
> doesn't process it if there are not enough state entries per sample period.
> (I need to fix that for this type of test.)
>
> I have done a couple of other tests with this patch set,
> but nothing to report yet, as the differences have been minor so far.
>
> ... Doug
>
>
---

Pratik


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-02-29  9:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-22  7:00 [RFC 0/1] Weighted approach to gather and use history in TEO governor Pratik Rajesh Sampat
2020-02-22  7:00 ` [RFC 1/1] " Pratik Rajesh Sampat
2020-02-25  6:18   ` Gautham R Shenoy
2020-02-29  8:58     ` Pratik Sampat
2020-02-25  5:13 ` [RFC 0/1] " Gautham R Shenoy
2020-02-27 16:14   ` Doug Smythies
2020-02-29  8:58     ` Pratik Sampat
2020-02-29  8:58   ` Pratik Sampat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).