linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/1] Alternate history mechanism for the TEO governor
@ 2020-05-11 14:10 Pratik Rajesh Sampat
  2020-05-11 14:10 ` [RFC 1/1] Weighted approach to gather and use history in " Pratik Rajesh Sampat
  2020-05-17 18:11 ` [RFC 0/1] Alternate history mechanism for the " Doug Smythies
  0 siblings, 2 replies; 9+ messages in thread
From: Pratik Rajesh Sampat @ 2020-05-11 14:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, ego, svaidy, psampat, pratik.sampat,
	pratik.r.sampat

First RFC posting: https://lkml.org/lkml/2020/2/22/27
Changelog
1. Upated benchmarks

Currently the TEO governor apart from the TEO timer and hit/miss/early
hit buckets; especially for non-timer events, also gathers history of 8
intervals and if there are significant idle durations less than the
current, then it decides if a shallower state must be chosen.
 
The current sliding history window does do fair job for the prediction
of the non-timer events, so to improve upon the history mechanism,
one may just hypothesize increasing the interval window so that more
history can influence better decisions.

However, this may not be true. The current algorithm can be limited due
to its linearity, while also average of interval durations may not be a
suitable metric when the data presents skewness. This can especially be
true for the when number of intervals taken into account increases.

Benchmark: Schbench
Provides latency statistics for scheduler wakeups
Machine - IBM Power 9

Metric of measurement
1. Performance - latency 99th percentile - usec
2. Power - watts
Performance and Power statistics are normalized

Latency - Normalized
8 intervals is baseline
+---------+-------------+--------------+--------------+--------------+
| Threads | 8 Intervals | 16 Intervals | 32 Intervals | 64 Intervals |
+---------+-------------+--------------+--------------+--------------+
| 2       | 100         | 102.81       | 106.93       | 100.86       |
+---------+-------------+--------------+--------------+--------------+
| 4       | 100         | 104.67       | 96.58        | 96.58        |
+---------+-------------+--------------+--------------+--------------+
| 8       | 100         | 116.26       | 100          | 121.44       |
+---------+-------------+--------------+--------------+--------------+
| 16      | 100         | 85.94        | 81.46        | 80.86        |
+---------+-------------+--------------+--------------+--------------+
+---------+---------------+
| Threads | 128 Intervals |
+---------+---------------+
| 2       | 83.09         |
+---------+---------------+
| 4       | 74.75         |
+---------+---------------+
| 8       | 87.15         |
+---------+---------------+
| 16      | 76.98         |
+---------+---------------+

The latency number are kind of a mixed bag. Some intervals perform
similar, while some slightly better, while some considerably worse

Standout: 128 latency intervals always performs well in terms of latency

Power - Normalized
8 intervals in baseline
+---------+-------------+--------------+--------------+--------------+
| Threads | 8 Intervals | 16 Intervals | 32 Intervals | 64 Intervals |
+---------+-------------+--------------+--------------+--------------+
| 2       | 100         | 100          | 85.29        | 97.05        |
+---------+-------------+--------------+--------------+--------------+
| 4       | 100         | 115.9        | 111.36       | 145.45       |
+---------+-------------+--------------+--------------+--------------+
| 8       | 100         | 109.63       | 114.45       | 102.4        |
+---------+-------------+--------------+--------------+--------------+
| 16      | 100         | 104.2        | 103.36       | 102.52       |
+---------+-------------+--------------+--------------+--------------+
+---------+---------------+
| Threads | 128 Intervals |
+---------+---------------+
| 2       | 123.52        |
+---------+---------------+
| 4       | 159.09        |
+---------+---------------+
| 8       | 114.45        |
+---------+---------------+
| 16      | 106.72        |
+---------+---------------+

Power considerably regresses in almost all cases, showing that with
increasing history there seems to be skewness presented to shallow
states.

Weighted TEO Governor
---------------------

To complement the current moving window history, an approach is devised
where each idle state separately maintains a weight for itself and its
counterpart idle states to form a probability distribution.

When a decision needs to be made, the TEO governor selects an idle state
based on its timer and other hits/early hits metric. After which, the
probability distribution of that selected idle state is looked at which
gives insight into how probable that state is to occur if picked.

The probability distribution is nothing but a n*n matrix, where
n - drv->state_count.
Each entry in the array signifies a weight for that row.
The weights can vary from the range [0-10000].

For example:
state_mat[2][1] - 3000 means that when state 2 is entered idle with, the
probability that the interval will last long enough to satisfy state 1's
residency is 30%.
The trailing zeros correspond to having more resolution while increasing
or reducing the weights for correction.

Currently, for selection of an idle state based on probabilities, a
weighted random number generator is used to choose one of the idle
states. Naturally, the states with higher weights are more likely to be
chosen.

On wakeup, the weights are updated. The state with which it should have
woken up with (could be the hit / miss / early hit state) is increased
in weight by the "LEARNING_RATE" % and the rest of the states for that
index are reduced by the same factor.

A possible disadvantage of this approach is that history for that
thread may make the weights stale and influence the choosing of idle
states which may not be relevant anymore. Aging the weights could be a
solution for that, where over time, the highest weight for that index
can be decayed and the weight equally spread across the rest of the
states.
Although this RFC does not cover the implementation for that as there
seemed to be too much run to run variance with this approach.

Having a finer view of the history in addition to weighted randomized
salt seems to show some promise in terms of saving power without
compromising performance.

Why not just pick the state with the highest probability?
- If there are multiple states close in probabilities competing for
dominance they should be given a fair chance
Of course, if the prediction was wrong the algorithm will self correct,
however by that time the pattern may have changed
- If this contention pattern is exhibited often then the prediction
algorithm will always stay playing catch-up. 

Benchmarks:

Metric of measurement:
1. Performance (latency / throughput)
2. Power (watts)
3. Accuracy %
  a. Correct prediction - The idle state predicted aligns with the
     actual sleep duration
  b. Undershoot prediction (US) - The idle state predicted is deeper for
     the actual sleep duration
  c. Overshoot prediction (OS) - The idle state predicted is shallower
     for the actual sleep duration

Performance and Power numbers are normalized.
However accuracy numbers are deliberately kept as-is to show how well
the vanilla governor performed in the first place

Schbench
--------
Benchmarks scheduler wakeup latencies

1. latency 99th percentile - usec
2. Power - watts
Machine - IBM Power 9

Latency and Power - Normalized
+---------+--------------+-----------------+---------------+
| Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power |
+---------+--------------+-----------------+---------------+
| 2       | 100          | 101.3           | 85.29         |
+---------+--------------+-----------------+---------------+
| 4       | 100          | 105.06          | 113.63        |
+---------+--------------+-----------------+---------------+
| 8       | 100          | 92.32           | 90.36         |
+---------+--------------+-----------------+---------------+
| 16      | 100          | 99.1            | 92.43         |
+---------+--------------+-----------------+---------------+

Accuracy

Vanilla TEO Governor - Prediction distribution %
+---------+------+------+------+-------+-------+-------+---------+
| Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
+---------+------+------+------+-------+-------+-------+---------+
| 2       | 6.12 | 1.08 | 1.76 | 20.41 | 9.2   | 28.74 | 22.51   |
+---------+------+------+------+-------+-------+-------+---------+
| 4       | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67   |
+---------+------+------+------+-------+-------+-------+---------+
| 8       | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95   |
+---------+------+------+------+-------+-------+-------+---------+
| 16      | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34   |
+---------+------+------+------+-------+-------+-------+---------+
+---------+------+------+------+
| Threads | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 2       | 1.77 | 1.27 | 7.14 |
+---------+------+------+------+
| 4       | 1.8  | 1.31 | 6.71 |
+---------+------+------+------+
| 8       | 0.65 | 0.72 | 3.2  |
+---------+------+------+------+
| 16      | 0.63 | 1.71 | 3.68 |
+---------+------+------+------+

Weighted TEO Governor - Prediction distribution %
+---------+------+------+------+-------+-------+-------+---------+
| Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
+---------+------+------+------+-------+-------+-------+---------+
| 2       | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13   |
+---------+------+------+------+-------+-------+-------+---------+
| 4       | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01   |
+---------+------+------+------+-------+-------+-------+---------+
| 8       | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9    |
+---------+------+------+------+-------+-------+-------+---------+
| 16      | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92   |
+---------+------+------+------+-------+-------+-------+---------+
+---------+------+------+------+
| Threads | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 2       | 0.39 | 0.42 | 2.31 |
+---------+------+------+------+
| 4       | 0.45 | 0.51 | 2.89 |
+---------+------+------+------+
| 8       | 0.53 | 0.66 | 3.06 |
+---------+------+------+------+
| 16      | 0.97 | 1.9  | 3.05 |
+---------+------+------+------+

Sleeping Ebizzy
---------------
Program to generate workloads resembling web server workloads.
The benchmark is customized to allow for a sleep interval -i

1. Number of records
2. Power - watts
Machine - IBM Power 9

Parameters:
1. -m -> Always use mmap instead of malloc
2. -M -> Never use mmap
3. -S <seconds> -> Number of seconds to run
4. -i <interval> -> Sleep interval

Number of records and power normalized
+-------------------+---------------+------------------+-----------------+
| Parameters        | TEO baseline  | Wt TEO records   | Wt. TEO Power   |
+-------------------+---------------+------------------+-----------------+
| -S 60 -i 10000    | 100           | 106.56           | 93.95           |
+-------------------+---------------+------------------+-----------------+
| -m -S 60 -i 10000 | 100           | 100.62           | 82.14           |
+-------------------+---------------+------------------+-----------------+
| -M -S 60 -i 10000 | 100           | 104.97           | 95.19           |
+-------------------+---------------+------------------+-----------------+

Accuracy

Vanilla TEO Governor - Prediction distribution %
+-------------------+-------+------+------+-------+------+-------+
| Parameters        | US 1  | US 2 | US 3 | US 4  | US 5 | US 6  |
+-------------------+-------+------+------+-------+------+-------+
| -S 60 -i 10000    | 45.46 | 0.52 | 1.5  | 15.34 | 2.44 | 8.61  |
+-------------------+-------+------+------+-------+------+-------+
| -m -S 60 -i 10000 | 4.22  | 2.08 | 0.71 | 90.01 | 0    | 0.01  |
+-------------------+-------+------+------+-------+------+-------+
| -M -S 60 -i 10000 | 15.78 | 1.42 | 2.4  | 22.39 | 1.68 | 11.25 |
+-------------------+-------+------+------+-------+------+-------+
+-------------------+---------+------+------+------+------+
| Parameters        | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
+-------------------+---------+------+------+------+------+
| -S 60 -i 10000    | 17.03   | 1.73 | 1.1  | 6.27 | 0    |
+-------------------+---------+------+------+------+------+
| -m -S 60 -i 10000 | 2.44    | 0.18 | 0.13 | 0.22 | 0    |
+-------------------+---------+------+------+------+------+
| -M -S 60 -i 10000 | 31.65   | 3.45 | 1.8  | 8.18 | 0    |
+-------------------+---------+------+------+------+------+

Weigted TEO Governor - Prediction distribution %
+-------------------+-------+------+------+-------+------+-------+
| Parameters        | US 1  | US 2 | US 3 | US 4  | US 5 | US 6  |
+-------------------+-------+------+------+-------+------+-------+
| -S 60 -i 10000    | 8.25  | 0.87 | 0.98 | 19.23 | 4.05 | 26.35 |
+-------------------+-------+------+------+-------+------+-------+
| -m -S 60 -i 10000 | 7.69  | 4.35 | 0.93 | 82.74 | 0.01 | 0.01  |
+-------------------+-------+------+------+-------+------+-------+
| -M -S 60 -i 10000 | 3.73  | 3.29 | 0.73 | 13.33 | 7.38 | 18.61 |
+-------------------+-------+------+------+-------+------+-------+
+-------------------+---------+------+------+------+------+
| Parameters        | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
+-------------------+---------+------+------+------+------+
| -S 60 -i 10000    | 32.86   | 3.27 | 2.05 | 2.09 | 0    |
+-------------------+---------+------+------+------+------+
| -m -S 60 -i 10000 | 3.4     | 0.29 | 0.28 | 0.3  | 0    |
+-------------------+---------+------+------+------+------+
| -M -S 60 -i 10000 | 48.19   | 1.8  | 0.93 | 1.97 | 0.04 |
+-------------------+---------+------+------+------+------+

Pgbench
-------
pgbench is a simple program for running benchmark tests on PostgreSQL.
It runs the same sequence of SQL commands over and over, possibly in
multiple concurrent database sessions, and then calculates the average
transaction rate (transactions per second).
Scale of measurement:
1. Number of transactions
2. Power (watts)
Machine - IBM Power 9

Number of transactions and power is normalized

+---------+---------------+---------------------+-----------------+
| Clients | TEO Baseline  | Wt. TEO transations | Wt. TEO power   |
+---------+---------------+---------------------+-----------------+
| 4       | 100           | 105.93              | 85.18           |
+---------+---------------+---------------------+-----------------+
| 8       | 100           | 98.11               | 100             |
+---------+---------------+---------------------+-----------------+
| 16      | 100           | 98.73               | 104.16          |
+---------+---------------+---------------------+-----------------+
| 32      | 100           | 101.05              | 95              |
+---------+---------------+---------------------+-----------------+

Accuracy

Vanilla TEO Governor - Prediction distribution %
+---------+-------+-------+-------+-------+------+------+---------+
| Clients | US 1  | US 2  | US 3  | US 4  | US 5 | US 6 | Correct |
+---------+-------+-------+-------+-------+------+------+---------+
| 4       | 59.97 | 2.56  | 0.46  | 1.45  | 0.13 | 0.87 | 32.09   |
+---------+-------+-------+-------+-------+------+------+---------+
| 8       | 82.76 | 7.31  | 0.12  | 1.23  | 0.08 | 1.22 | 7.2     |
+---------+-------+-------+-------+-------+------+------+---------+
| 16      | 2     | 13.46 | 73.27 | 10.7  | 0.04 | 0.12 | 0.38    |
+---------+-------+-------+-------+-------+------+------+---------+
| 32      | 1.33  | 0.76  | 44.9  | 40.8  | 0.07 | 12   | 0.12    |
+---------+-------+-------+-------+-------+------+------+---------+
+---------+------+------+------+
| Clients | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 4       | 2.44 | 0.01 | 0.02 |
+---------+------+------+------+
| 8       | 0.04 | 0.02 | 0.02 |
+---------+------+------+------+
| 16      | 0.02 | 0    | 0.01 |
+---------+------+------+------+
| 32      | 0.02 | 0    | 0    |
+---------+------+------+------+

Weigted TEO Governor - Prediction distribution %
+---------+-------+-------+-------+-------+------+------+---------+
| Clients | US 1  | US 2  | US 3  | US 4  | US 5 | US 6 | Correct |
+---------+-------+-------+-------+-------+------+------+---------+
| 4       | 42.79 | 1.79  | 0.01  | 0.72  | 0.1  | 0.78 | 46.99   |
+---------+-------+-------+-------+-------+------+------+---------+
| 8       | 76.52 | 8.02  | 0.12  | 0.42  | 0.06 | 0.5  | 14.35   |
+---------+-------+-------+-------+-------+------+------+---------+
| 16      | 0.72  | 51.95 | 42.02 | 4.65  | 0.05 | 0.31 | 0.3     |
+---------+-------+-------+-------+-------+------+------+---------+
| 32      | 1.06  | 1.59  | 42.89 | 53.11 | 0.05 | 0.42 | 0.87    |
+---------+-------+-------+-------+-------+------+------+---------+
+---------+------+------+------+
| Clients | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 4       | 6.81 | 0    | 0.01 |
+---------+------+------+------+
| 8       | 0    | 0    | 0.01 |
+---------+------+------+------+
| 16      | 0    | 0    | 0    |
+---------+------+------+------+
| 32      | 0    | 0    | 0.01 |
+---------+------+------+------+

Hackbench
---------
Creates a specified number of pairs of schedulable entities
which communicate via either sockets or pipes and time how long  it
takes for each pair to send data back and forth.

Machine - IBM Power 9

Scale of measurement:
1. Time (s)
2. Power (watts)
Time is normalized

+---------+----------+----------------------+-------------------+
| Loops   | TEO Time | Wt. TEO Time Sockets | Wt. TEO Time Pipe |
+---------+----------+----------------------+-------------------+
| 100     | 100      | 95.23                | 87.09             |
+---------+----------+----------------------+-------------------+
| 1000    | 100      | 105.81               | 98.67             |
+---------+----------+----------------------+-------------------+
| 10000   | 100      | 99.33                | 92.73             |
+---------+----------+----------------------+-------------------+
| 100000  | 100      | 98.88                | 101.99            |
+---------+----------+----------------------+-------------------+
| 1000000 | 100      | 100.04               | 100.2             |
+---------+----------+----------------------+-------------------+

Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
       Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO

Pratik Rajesh Sampat (1):
  Weighted approach to gather and use history in TEO governor

 drivers/cpuidle/governors/teo.c | 96 +++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-05-11 14:10 [RFC 0/1] Alternate history mechanism for the TEO governor Pratik Rajesh Sampat
@ 2020-05-11 14:10 ` Pratik Rajesh Sampat
  2020-05-12 17:37   ` Peter Zijlstra
  2020-05-17 18:11 ` [RFC 0/1] Alternate history mechanism for the " Doug Smythies
  1 sibling, 1 reply; 9+ messages in thread
From: Pratik Rajesh Sampat @ 2020-05-11 14:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael.j.wysocki, peterz, dsmythies,
	daniel.lezcano, ego, svaidy, psampat, pratik.sampat,
	pratik.r.sampat

Complementing the current self correcting window algorithm, an alternate
approach is devised to leverage history based on probability of
occurrence of states

Each CPU maintains a matrix wherein each idle state maintains a
probability distribution for itself and the other corresponding states.

The probability distribution is nothing but a n*n matrix, where
n = drv->state_count.
Each entry in the array signifies a weight for that row.
The weights can vary from the range [0-10000].

For example:
state_mat[2][1] = 3000 means that when state 2 is entered idle with, the
probability that the interval will last long enough to satisfy state 1's
residency is 30%.
The trailing zeros correspond to having more resolution while increasing
or reducing the weights for correction.

Initially the weights are distributed in a way such that the index of
that state in question has a higher probability of choosing itself, as
we have no reason to believe otherwise yet. Initial bias to itself is
60% and the rest 40% is equally distributed to the rest of the states.

Selection of an idle state:
When the TEO governor chooses an idle state, the probability
distribution for that state is looked at. A weighted random number
generator is used using the weights as bias to choose the next idle
state. The algorithm leans to choose that or a shallower state than that
for its next prediction

Correction of the probability distribution:
On wakeup, the weights are updated. The state which it should have woken
up with (could be the hit / miss / early hit state) is increased in
weight by the "LEARNING_RATE" % and the rest of the states for that
index are reduced by the same factor.
The LEARNING RATE is experimentally chosen to be 10 %

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
---
 drivers/cpuidle/governors/teo.c | 96 +++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 5 deletions(-)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index de7e706efd46..84058d797b43 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -50,6 +50,7 @@
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
 #include <linux/tick.h>
+#include <linux/random.h>
 
 /*
  * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
@@ -64,6 +65,12 @@
  */
 #define INTERVALS	8
 
+/*
+ * Percentage of the amount of weight to be shifted in the idle state weight
+ * distribution for correction
+ */
+#define LEARNING_RATE	10
+
 /**
  * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
  * @early_hits: "Early" CPU wakeups "matching" this state.
@@ -98,6 +105,8 @@ struct teo_idle_state {
  * @states: Idle states data corresponding to this CPU.
  * @interval_idx: Index of the most recent saved idle interval.
  * @intervals: Saved idle duration values.
+ * @state_mat: Each idle state maintains a weights corresponding to that
+ * state, storing the probability distribution of occurrence for that state
  */
 struct teo_cpu {
 	u64 time_span_ns;
@@ -105,6 +114,7 @@ struct teo_cpu {
 	struct teo_idle_state states[CPUIDLE_STATE_MAX];
 	int interval_idx;
 	u64 intervals[INTERVALS];
+	int state_mat[CPUIDLE_STATE_MAX][CPUIDLE_STATE_MAX];
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
@@ -117,7 +127,8 @@ static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
 static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	int i, idx_hit = -1, idx_timer = -1;
+	int i, idx_hit = -1, idx_timer = -1, idx = -1;
+	int last_idx = dev->last_state_idx;
 	u64 measured_ns;
 
 	if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
@@ -183,16 +194,50 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 
 		if (idx_timer > idx_hit) {
 			misses += PULSE;
-			if (idx_hit >= 0)
+			idx = idx_timer;
+			if (idx_hit >= 0) {
 				cpu_data->states[idx_hit].early_hits += PULSE;
+				idx = idx_hit;
+			}
 		} else {
 			hits += PULSE;
+			idx = last_idx;
 		}
 
 		cpu_data->states[idx_timer].misses = misses;
 		cpu_data->states[idx_timer].hits = hits;
 	}
 
+	/*
+	 * Rearrange the weight distribution of the state, increase the weight
+	 * by the LEARNING RATE % for the idle state that was supposed to be
+	 * chosen and reduce by the same amount for rest of the states
+	 *
+	 * If the weights are greater than (100 - LEARNING_RATE) % or lesser
+	 * than LEARNING_RATE %, do not increase or decrease the confidence
+	 * respectively
+	 */
+	for (i = 0; i < drv->state_count; i++) {
+		unsigned int delta;
+
+		if (idx == -1)
+			break;
+		if (i ==  idx) {
+			delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
+			if (cpu_data->state_mat[last_idx][i] + delta >=
+			    (100 - LEARNING_RATE) * 100)
+				continue;
+			cpu_data->state_mat[last_idx][i] += delta;
+			continue;
+		}
+		delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) /
+			((drv->state_count - 1) * 100);
+		if (cpu_data->state_mat[last_idx][i] - delta <=
+		    LEARNING_RATE * 100)
+			continue;
+		cpu_data->state_mat[last_idx][i] -= delta;
+	}
+
 	/*
 	 * Save idle duration values corresponding to non-timer wakeups for
 	 * pattern detection.
@@ -244,7 +289,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
 	u64 duration_ns;
 	unsigned int hits, misses, early_hits;
-	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i;
+	int max_early_idx, prev_max_early_idx, constraint_idx, idx, i, og_idx;
 	ktime_t delta_tick;
 
 	if (dev->last_state_idx >= 0) {
@@ -374,10 +419,14 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (constraint_idx < idx)
 		idx = constraint_idx;
 
+	og_idx = idx;
+
 	if (idx < 0) {
 		idx = 0; /* No states enabled. Must use 0. */
 	} else if (idx > 0) {
-		unsigned int count = 0;
+		unsigned int weights_list[CPUIDLE_STATE_MAX];
+		unsigned int i, j = 0, rnd_wt, rnd_num = 0;
+		unsigned int count = 0, sum_weights = 0;
 		u64 sum = 0;
 
 		/*
@@ -412,6 +461,28 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 								       idx, avg_ns);
 			}
 		}
+		/*
+		 * In case, the recent history yields a shallower state, then
+		 * the probability distribution is looked at.
+		 * The weighted random number generator uses the weights as a
+		 * bias to choose the next idle state
+		 */
+		if (og_idx != idx) {
+			for (i = 0; i <= idx; i++) {
+				if (dev->states_usage[i].disable)
+					continue;
+				sum_weights += cpu_data->state_mat[idx][i];
+				weights_list[j++] = sum_weights;
+			}
+			get_random_bytes(&rnd_num, sizeof(rnd_num));
+			rnd_num = rnd_num % 100;
+			rnd_wt = (rnd_num * sum_weights) / 100;
+			for (i = 0; i < j; i++) {
+				if (rnd_wt < weights_list[i])
+					break;
+			}
+			idx = i;
+		}
 	}
 
 	/*
@@ -468,13 +539,28 @@ static int teo_enable_device(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	int i;
+	int i, j;
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
 
 	for (i = 0; i < INTERVALS; i++)
 		cpu_data->intervals[i] = U64_MAX;
 
+	/*
+	 * Populate initial weights for each state
+	 * The stop state is initially more biased for itself.
+	 *
+	 * Currently the initial distribution of probabilities are 70%-30%.
+	 * The trailing 0s are for increased resolution.
+	 */
+	for (i = 0; i < drv->state_count; i++) {
+		for (j = 0; j < drv->state_count; j++) {
+			if (i == j)
+				cpu_data->state_mat[i][j] = 6000;
+			else
+				cpu_data->state_mat[i][j] = 4000 / (drv->state_count - 1);
+		}
+	}
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-05-11 14:10 ` [RFC 1/1] Weighted approach to gather and use history in " Pratik Rajesh Sampat
@ 2020-05-12 17:37   ` Peter Zijlstra
  2020-05-13  5:31     ` Pratik Sampat
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2020-05-12 17:37 UTC (permalink / raw)
  To: Pratik Rajesh Sampat
  Cc: linux-kernel, linux-pm, rafael.j.wysocki, dsmythies,
	daniel.lezcano, ego, svaidy, pratik.sampat, pratik.r.sampat


Just a quick note..

On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote:

> +	/*
> +	 * Rearrange the weight distribution of the state, increase the weight
> +	 * by the LEARNING RATE % for the idle state that was supposed to be
> +	 * chosen and reduce by the same amount for rest of the states
> +	 *
> +	 * If the weights are greater than (100 - LEARNING_RATE) % or lesser
> +	 * than LEARNING_RATE %, do not increase or decrease the confidence
> +	 * respectively
> +	 */
> +	for (i = 0; i < drv->state_count; i++) {
> +		unsigned int delta;
> +
> +		if (idx == -1)
> +			break;
> +		if (i ==  idx) {
> +			delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;

100 is a crap number to divide by as a computer. We bio-puddings happend
to have 10 digits, so 100 makes sense to us, but it does not to our
binary friends.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-05-12 17:37   ` Peter Zijlstra
@ 2020-05-13  5:31     ` Pratik Sampat
  2020-05-13 14:49       ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Pratik Sampat @ 2020-05-13  5:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, rafael.j.wysocki, dsmythies,
	daniel.lezcano, ego, svaidy, pratik.sampat, pratik.r.sampat

Thanks for your comment.


On 12/05/20 11:07 pm, Peter Zijlstra wrote:
> Just a quick note..
>
> On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote:
>
>> +	/*
>> +	 * Rearrange the weight distribution of the state, increase the weight
>> +	 * by the LEARNING RATE % for the idle state that was supposed to be
>> +	 * chosen and reduce by the same amount for rest of the states
>> +	 *
>> +	 * If the weights are greater than (100 - LEARNING_RATE) % or lesser
>> +	 * than LEARNING_RATE %, do not increase or decrease the confidence
>> +	 * respectively
>> +	 */
>> +	for (i = 0; i < drv->state_count; i++) {
>> +		unsigned int delta;
>> +
>> +		if (idx == -1)
>> +			break;
>> +		if (i ==  idx) {
>> +			delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
> 100 is a crap number to divide by as a computer. We bio-puddings happend
> to have 10 digits, so 100 makes sense to us, but it does not to our
> binary friends.
>
>
Absolutely! I just wrote the code exactly the way I did the Math on paper,
definitely need to figure out an optimal way of doing things.

~Pratik


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-05-13  5:31     ` Pratik Sampat
@ 2020-05-13 14:49       ` Rafael J. Wysocki
  2020-05-14 15:35         ` Pratik Sampat
  0 siblings, 1 reply; 9+ messages in thread
From: Rafael J. Wysocki @ 2020-05-13 14:49 UTC (permalink / raw)
  To: Pratik Sampat
  Cc: Peter Zijlstra, Linux Kernel Mailing List, Linux PM,
	Rafael Wysocki, Doug Smythies, Daniel Lezcano, Gautham R. Shenoy,
	Vaidyanathan Srinivasan, pratik.sampat, pratik.r.sampat

On Wed, May 13, 2020 at 7:31 AM Pratik Sampat <psampat@linux.ibm.com> wrote:
>
> Thanks for your comment.
>
>
> On 12/05/20 11:07 pm, Peter Zijlstra wrote:
> > Just a quick note..
> >
> > On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote:
> >
> >> +    /*
> >> +     * Rearrange the weight distribution of the state, increase the weight
> >> +     * by the LEARNING RATE % for the idle state that was supposed to be
> >> +     * chosen and reduce by the same amount for rest of the states
> >> +     *
> >> +     * If the weights are greater than (100 - LEARNING_RATE) % or lesser
> >> +     * than LEARNING_RATE %, do not increase or decrease the confidence
> >> +     * respectively
> >> +     */
> >> +    for (i = 0; i < drv->state_count; i++) {
> >> +            unsigned int delta;
> >> +
> >> +            if (idx == -1)
> >> +                    break;
> >> +            if (i ==  idx) {
> >> +                    delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
> > 100 is a crap number to divide by as a computer. We bio-puddings happend
> > to have 10 digits, so 100 makes sense to us, but it does not to our
> > binary friends.
> >
> >
> Absolutely! I just wrote the code exactly the way I did the Math on paper,
> definitely need to figure out an optimal way of doing things.

There is no particular reason to use percent in computations at all.
You may as well use 1/1024 parts instead (and then use shifts instead
of divisions).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
  2020-05-13 14:49       ` Rafael J. Wysocki
@ 2020-05-14 15:35         ` Pratik Sampat
  0 siblings, 0 replies; 9+ messages in thread
From: Pratik Sampat @ 2020-05-14 15:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux Kernel Mailing List, Linux PM,
	Rafael Wysocki, Doug Smythies, Daniel Lezcano, Gautham R. Shenoy,
	Vaidyanathan Srinivasan, pratik.sampat, pratik.r.sampat



On 13/05/20 8:19 pm, Rafael J. Wysocki wrote:
> On Wed, May 13, 2020 at 7:31 AM Pratik Sampat <psampat@linux.ibm.com> wrote:
>> Thanks for your comment.
>>
>>
>> On 12/05/20 11:07 pm, Peter Zijlstra wrote:
>>> Just a quick note..
>>>
>>> On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote:
>>>
>>>> +    /*
>>>> +     * Rearrange the weight distribution of the state, increase the weight
>>>> +     * by the LEARNING RATE % for the idle state that was supposed to be
>>>> +     * chosen and reduce by the same amount for rest of the states
>>>> +     *
>>>> +     * If the weights are greater than (100 - LEARNING_RATE) % or lesser
>>>> +     * than LEARNING_RATE %, do not increase or decrease the confidence
>>>> +     * respectively
>>>> +     */
>>>> +    for (i = 0; i < drv->state_count; i++) {
>>>> +            unsigned int delta;
>>>> +
>>>> +            if (idx == -1)
>>>> +                    break;
>>>> +            if (i ==  idx) {
>>>> +                    delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100;
>>> 100 is a crap number to divide by as a computer. We bio-puddings happend
>>> to have 10 digits, so 100 makes sense to us, but it does not to our
>>> binary friends.
>>>
>>>
>> Absolutely! I just wrote the code exactly the way I did the Math on paper,
>> definitely need to figure out an optimal way of doing things.
> There is no particular reason to use percent in computations at all.
> You may as well use 1/1024 parts instead (and then use shifts instead
> of divisions).

Yes you're right. Looking at it now the whole percent system and divisions
does seem quite unnecessary and we can achieve it rather with bitwise
operations.

Thanks!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC 0/1] Alternate history mechanism for the TEO governor
  2020-05-11 14:10 [RFC 0/1] Alternate history mechanism for the TEO governor Pratik Rajesh Sampat
  2020-05-11 14:10 ` [RFC 1/1] Weighted approach to gather and use history in " Pratik Rajesh Sampat
@ 2020-05-17 18:11 ` Doug Smythies
  2020-05-21 11:09   ` Pratik Sampat
  1 sibling, 1 reply; 9+ messages in thread
From: Doug Smythies @ 2020-05-17 18:11 UTC (permalink / raw)
  To: 'Pratik Rajesh Sampat'
  Cc: linux-kernel, linux-pm, rafael.j.wysocki, peterz, daniel.lezcano,
	ego, svaidy, pratik.sampat, pratik.r.sampat

On 2020.05.11 Pratik Rajesh Sampat wrote:
> 
> First RFC posting: https://lkml.org/lkml/2020/2/22/27

Summary:

On that thread I wrote:

  > I have done a couple of other tests with this patch set,
  > but nothing to report yet, as the differences have been
  > minor so far.

I tried your tests, or as close as I could find, and still
do not notice much difference.

For detail, but likely little added value, read on:

Kernel: 5.7-rc4:
"teo": unmodified kernel.
"wtteo": with this patch added.
"menu": the menu idle governor, for comparison.
CPU frequency scaling driver: intel-cpufreq
CPU frequency scaling governor: schedutil
CPU idle driver: intel_idle

...

> Benchmarks:
> Schbench
> --------
> Benchmarks scheduler wakeup latencies
> 
> 1. latency 99th percentile - usec

I found a Phoronix schbench test.
It defaults to 99.9th percentile.

schbench (usec, 99.9th Latency Percentile, less is better)(8 workers)					

threads	teo		wtteo				menu	
	2	14197		14194		99.98%	14467		101.90%
	4	46733		46490		99.48%	46554		99.62%
	6	57306		58291		101.72%	57754		100.78%
	8	81408		80768		99.21%	81715		100.38%
	16	157286	156570	99.54%	156621	99.58%
	32	314573	310784	98.80%	315802	100.39%

Powers and other idle statistics were similar. [1]

> 2. Power - watts
> Machine - IBM Power 9
> 
> Latency and Power - Normalized
> +---------+--------------+-----------------+---------------+
> | Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power |
> +---------+--------------+-----------------+---------------+
> | 2       | 100          | 101.3           | 85.29         |
> +---------+--------------+-----------------+---------------+
> | 4       | 100          | 105.06          | 113.63        |
> +---------+--------------+-----------------+---------------+
> | 8       | 100          | 92.32           | 90.36         |
> +---------+--------------+-----------------+---------------+
> | 16      | 100          | 99.1            | 92.43         |
> +---------+--------------+-----------------+---------------+
> 
> Accuracy
> 
> Vanilla TEO Governor - Prediction distribution %
> +---------+------+------+------+-------+-------+-------+---------+
> | Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
> +---------+------+------+------+-------+-------+-------+---------+
> | 2       | 6.12 | 1.08 | 1.76 | 20.41 | 9.2   | 28.74 | 22.51   |
> +---------+------+------+------+-------+-------+-------+---------+
> | 4       | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67   |
> +---------+------+------+------+-------+-------+-------+---------+
> | 8       | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95   |
> +---------+------+------+------+-------+-------+-------+---------+
> | 16      | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34   |
> +---------+------+------+------+-------+-------+-------+---------+
> +---------+------+------+------+
> | Threads | OS 1 | OS 2 | OS 3 |
> +---------+------+------+------+
> | 2       | 1.77 | 1.27 | 7.14 |
> +---------+------+------+------+
> | 4       | 1.8  | 1.31 | 6.71 |
> +---------+------+------+------+
> | 8       | 0.65 | 0.72 | 3.2  |
> +---------+------+------+------+
> | 16      | 0.63 | 1.71 | 3.68 |
> +---------+------+------+------+
> 
> Weighted TEO Governor - Prediction distribution %
> +---------+------+------+------+-------+-------+-------+---------+
> | Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
> +---------+------+------+------+-------+-------+-------+---------+
> | 2       | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13   |
> +---------+------+------+------+-------+-------+-------+---------+
> | 4       | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01   |
> +---------+------+------+------+-------+-------+-------+---------+
> | 8       | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9    |
> +---------+------+------+------+-------+-------+-------+---------+
> | 16      | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92   |
> +---------+------+------+------+-------+-------+-------+---------+
> +---------+------+------+------+
> | Threads | OS 1 | OS 2 | OS 3 |
> +---------+------+------+------+
> | 2       | 0.39 | 0.42 | 2.31 |
> +---------+------+------+------+
> | 4       | 0.45 | 0.51 | 2.89 |
> +---------+------+------+------+
> | 8       | 0.53 | 0.66 | 3.06 |
> +---------+------+------+------+
> | 16      | 0.97 | 1.9  | 3.05 |
> +---------+------+------+------+
> 
> Sleeping Ebizzy
> ---------------
> Program to generate workloads resembling web server workloads.
> The benchmark is customized to allow for a sleep interval -i

I found a Phoronix ebizzy, but without the customization,
which I suspect is important to demonstrate your potential
improvement.

Could you send me yours to try?

ebizzy (records per second, more is better)					

teo		wtteo				menu	
132344	132228	99.91%	130926	98.93%

Powers and other idle statistics were similar. [2]

> 1. Number of records
> 2. Power - watts
> Machine - IBM Power 9
> 
> Parameters:
> 1. -m -> Always use mmap instead of malloc
> 2. -M -> Never use mmap
> 3. -S <seconds> -> Number of seconds to run
> 4. -i <interval> -> Sleep interval

What are the units of this interval?
They must be microseconds, as that is the only thing that makes sense.

I have tried to simulate the resulting actual workflow
myself, but didn't get results like yours. (I may have done a poorly.)
My test does not produce performance data, as it just has to do its work
before the next time to do a chunk of work.
The test is:

forever
  do 100 times
    very short sleep
  enddo
  sleep for 10 milliseconds
endforever

The overheads result in enough activity.
Powers and other idle statistics were similar. [3]

> 
> Number of records and power normalized
> +-------------------+---------------+------------------+-----------------+
> | Parameters        | TEO baseline  | Wt TEO records   | Wt. TEO Power   |
> +-------------------+---------------+------------------+-----------------+
> | -S 60 -i 10000    | 100           | 106.56           | 93.95           |
> +-------------------+---------------+------------------+-----------------+
> | -m -S 60 -i 10000 | 100           | 100.62           | 82.14           |
> +-------------------+---------------+------------------+-----------------+
> | -M -S 60 -i 10000 | 100           | 104.97           | 95.19           |
> +-------------------+---------------+------------------+-----------------+
> 
> Accuracy
> 
> Vanilla TEO Governor - Prediction distribution %
> +-------------------+-------+------+------+-------+------+-------+
> | Parameters        | US 1  | US 2 | US 3 | US 4  | US 5 | US 6  |
> +-------------------+-------+------+------+-------+------+-------+
> | -S 60 -i 10000    | 45.46 | 0.52 | 1.5  | 15.34 | 2.44 | 8.61  |
> +-------------------+-------+------+------+-------+------+-------+
> | -m -S 60 -i 10000 | 4.22  | 2.08 | 0.71 | 90.01 | 0    | 0.01  |
> +-------------------+-------+------+------+-------+------+-------+
> | -M -S 60 -i 10000 | 15.78 | 1.42 | 2.4  | 22.39 | 1.68 | 11.25 |
> +-------------------+-------+------+------+-------+------+-------+
> +-------------------+---------+------+------+------+------+
> | Parameters        | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
> +-------------------+---------+------+------+------+------+
> | -S 60 -i 10000    | 17.03   | 1.73 | 1.1  | 6.27 | 0    |
> +-------------------+---------+------+------+------+------+
> | -m -S 60 -i 10000 | 2.44    | 0.18 | 0.13 | 0.22 | 0    |
> +-------------------+---------+------+------+------+------+
> | -M -S 60 -i 10000 | 31.65   | 3.45 | 1.8  | 8.18 | 0    |
> +-------------------+---------+------+------+------+------+
> 
> Weigted TEO Governor - Prediction distribution %
> +-------------------+-------+------+------+-------+------+-------+
> | Parameters        | US 1  | US 2 | US 3 | US 4  | US 5 | US 6  |
> +-------------------+-------+------+------+-------+------+-------+
> | -S 60 -i 10000    | 8.25  | 0.87 | 0.98 | 19.23 | 4.05 | 26.35 |
> +-------------------+-------+------+------+-------+------+-------+
> | -m -S 60 -i 10000 | 7.69  | 4.35 | 0.93 | 82.74 | 0.01 | 0.01  |
> +-------------------+-------+------+------+-------+------+-------+
> | -M -S 60 -i 10000 | 3.73  | 3.29 | 0.73 | 13.33 | 7.38 | 18.61 |
> +-------------------+-------+------+------+-------+------+-------+
> +-------------------+---------+------+------+------+------+
> | Parameters        | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
> +-------------------+---------+------+------+------+------+
> | -S 60 -i 10000    | 32.86   | 3.27 | 2.05 | 2.09 | 0    |
> +-------------------+---------+------+------+------+------+
> | -m -S 60 -i 10000 | 3.4     | 0.29 | 0.28 | 0.3  | 0    |
> +-------------------+---------+------+------+------+------+
> | -M -S 60 -i 10000 | 48.19   | 1.8  | 0.93 | 1.97 | 0.04 |
> +-------------------+---------+------+------+------+------+

For accuracy numbers, it would help to know the sample size
and the importance.

For this 60 second test, I wonder if the number of times
each idle state was entered and exited was large enough to
draw any conclusion. I often find for tests that some states are
only used a few times in 1 minute, and so don't really care about the accuracy.

Anyway, for my attempts that this test, I had to extend to a 5 minute sample
time to get adequate numbers in all idle states for the accuracy statistics.
(which showed no difference, by the way (for those not looking at the graphs).)

For my test all three governors, teo, wtteo, and menu, were
using idle state 0 about 7 to 8 thousand times per 5 minutes,
and 100% of time the assessment was the state was too shallow.
However, I don't really care because it is only 0.003% of the time,
and if idle state 0 is disabled (teo-0disable on [3] (it is enabled
again at minute 35), the power doesn't change.

All that being said, your power/accuracy results do seem correlated.

> 
> Pgbench
> -------
> pgbench is a simple program for running benchmark tests on PostgreSQL.
> It runs the same sequence of SQL commands over and over, possibly in
> multiple concurrent database sessions, and then calculates the average
> transaction rate (transactions per second).

I did not try this test or anything similar.
...

> 
> Hackbench
> ---------
> Creates a specified number of pairs of schedulable entities
> which communicate via either sockets or pipes and time how long  it
> takes for each pair to send data back and forth.
> 

I found a Phoronix version, but it doesn't like
your low loops counts, so I stayed with the default 50,000.

I suspect your low loop count results in a workflow somewhat like
your special ebizzy test. Anyway, maybe I should try your version
and low loop counts.

I did many tests, and get inconsistent results.

You use these terms like "sockets" and "pipes", but
the phoronix test uses "count" and "thread" or "process".

I only used "process" for the simple reason that there was very
very little use of idle at all with "thread", so there was no value
in any test.

hackbench test 1: all - process (seconds, less is better)					
						
test	count	teo		wtteo				menu	
1	1	8.7		8.99		103.33%	9.071		104.26%
2	2	16.509	16.96		102.73%	17.159	103.94%
3	4	33.451	34.081	101.88%	34.101	101.94%
4	8	69.037	71.647	103.78%	69.914	101.27%
5	16	161.64	165.569	102.43%	165.015	102.09%

Powers and other idle statistics were similar. [4]

hackbench test 2: count 1 - process (seconds, less is better)					
		teo	wtteo			menu	
average	8.906	8.703	97.72%	9.032	101.41%
max		9.263	8.856			9.228	
min		8.761	8.599			8.876	
Std. Dev.	0.83%	0.46%			0.80%	
runs		256	256			200	

Powers and other idle statistics were similar. [5]
However, idle state 3 is worthy of a look.

hackbench test 3: count 2 - process (seconds, less is better)					
		teo		wtteo			menu	
average	16.702	16.65	99.69%	16.796	100.56%
max		16.853	16.966		17.058	
min		16.542	16.487		16.659	
Std. Dev.	0.41%		0.59%			0.56%	
runs		100		100			100	

Powers and other idle statistics were similar. [6]
However, idle state 3 is worthy of a look.

> Machine - IBM Power 9
> 
> Scale of measurement:
> 1. Time (s)
> 2. Power (watts)
> Time is normalized
> 
> +---------+----------+----------------------+-------------------+
> | Loops   | TEO Time | Wt. TEO Time Sockets | Wt. TEO Time Pipe |
> +---------+----------+----------------------+-------------------+
> | 100     | 100      | 95.23                | 87.09             |
> +---------+----------+----------------------+-------------------+
> | 1000    | 100      | 105.81               | 98.67             |
> +---------+----------+----------------------+-------------------+
> | 10000   | 100      | 99.33                | 92.73             |
> +---------+----------+----------------------+-------------------+
> | 100000  | 100      | 98.88                | 101.99            |
> +---------+----------+----------------------+-------------------+
> | 1000000 | 100      | 100.04               | 100.2             |
> +---------+----------+----------------------+-------------------+
> 
> Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
>        Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO
> 
> Pratik Rajesh Sampat (1):
>   Weighted approach to gather and use history in TEO governor
> 
>  drivers/cpuidle/governors/teo.c | 96 +++++++++++++++++++++++++++++++--
>  1 file changed, 91 insertions(+), 5 deletions(-)
> 
> --
> 2.17.1

I also tried Giovanni's and Mel's mmtests, (uses idle states 0 and 1 a lot)
but couldn't extract the performance report. [7]

Old sweep test, which doesn't produce performance data. [8]
Old system idle test. [9]

[1] http://www.smythies.com/~doug/linux/idle/wtteo/schbench/
[2] http://www.smythies.com/~doug/linux/idle/wtteo/ebizzy/
[3] http://www.smythies.com/~doug/linux/idle/wtteo/pn01/
[4] http://www.smythies.com/~doug/linux/idle/wtteo/hackbench/
[5] http://www.smythies.com/~doug/linux/idle/wtteo/hackbench2/
[6] http://www.smythies.com/~doug/linux/idle/wtteo/hackbench3/
[7] http://www.smythies.com/~doug/linux/idle/wtteo/mmtests-udp/
[8] http://www.smythies.com/~doug/linux/idle/wtteo/sweep/
[9] http://www.smythies.com/~doug/linux/idle/wtteo/idle/

... Doug



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 0/1] Alternate history mechanism for the TEO governor
  2020-05-17 18:11 ` [RFC 0/1] Alternate history mechanism for the " Doug Smythies
@ 2020-05-21 11:09   ` Pratik Sampat
  2020-05-25 18:32     ` Doug Smythies
  0 siblings, 1 reply; 9+ messages in thread
From: Pratik Sampat @ 2020-05-21 11:09 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, linux-pm, rafael.j.wysocki, peterz, daniel.lezcano,
	ego, svaidy, pratik.sampat, pratik.r.sampat

Hello Doug,

Thanks a lot for running these benchmarks on an Intel box.


On 17/05/20 11:41 pm, Doug Smythies wrote:
> On 2020.05.11 Pratik Rajesh Sampat wrote:
>> First RFC posting:https://lkml.org/lkml/2020/2/22/27
> Summary:
>
> On that thread I wrote:
>
>    > I have done a couple of other tests with this patch set,
>    > but nothing to report yet, as the differences have been
>    > minor so far.
>
> I tried your tests, or as close as I could find, and still
> do not notice much difference.

That is quite unfortunate. At least it doesn't seem to regress.

Nevertheless, as Rafael suggested aging is crucial, this patch doesn't age
weights. I do have a version with aging but I had a lot of run to run variance
so I had refrained from posting that.
I'm tweaking around the logic for aging as well as distribution of weights,
hopefully that may help.

> For detail, but likely little added value, read on:
>
> Kernel: 5.7-rc4:
> "teo": unmodified kernel.
> "wtteo": with this patch added.
> "menu": the menu idle governor, for comparison.
> CPU frequency scaling driver: intel-cpufreq
> CPU frequency scaling governor: schedutil
> CPU idle driver: intel_idle
>
> ...
>
>> Benchmarks:
>> Schbench
>> --------
>> Benchmarks scheduler wakeup latencies
>>
>> 1. latency 99th percentile - usec
> I found a Phoronix schbench test.
> It defaults to 99.9th percentile.
>
> schbench (usec, 99.9th Latency Percentile, less is better)(8 workers)					
>
> threads	teo		wtteo				menu	
> 	2	14197		14194		99.98%	14467		101.90%
> 	4	46733		46490		99.48%	46554		99.62%
> 	6	57306		58291		101.72%	57754		100.78%
> 	8	81408		80768		99.21%	81715		100.38%
> 	16	157286	156570	99.54%	156621	99.58%
> 	32	314573	310784	98.80%	315802	100.39%
>
> Powers and other idle statistics were similar. [1]
>
>> 2. Power - watts
>> Machine - IBM Power 9
>>
>> Latency and Power - Normalized
>> +---------+--------------+-----------------+---------------+
>> | Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power |
>> +---------+--------------+-----------------+---------------+
>> | 2       | 100          | 101.3           | 85.29         |
>> +---------+--------------+-----------------+---------------+
>> | 4       | 100          | 105.06          | 113.63        |
>> +---------+--------------+-----------------+---------------+
>> | 8       | 100          | 92.32           | 90.36         |
>> +---------+--------------+-----------------+---------------+
>> | 16      | 100          | 99.1            | 92.43         |
>> +---------+--------------+-----------------+---------------+
>>
>> Accuracy
>>
>> Vanilla TEO Governor - Prediction distribution %
>> +---------+------+------+------+-------+-------+-------+---------+
>> | Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 2       | 6.12 | 1.08 | 1.76 | 20.41 | 9.2   | 28.74 | 22.51   |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 4       | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67   |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 8       | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95   |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 16      | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34   |
>> +---------+------+------+------+-------+-------+-------+---------+
>> +---------+------+------+------+
>> | Threads | OS 1 | OS 2 | OS 3 |
>> +---------+------+------+------+
>> | 2       | 1.77 | 1.27 | 7.14 |
>> +---------+------+------+------+
>> | 4       | 1.8  | 1.31 | 6.71 |
>> +---------+------+------+------+
>> | 8       | 0.65 | 0.72 | 3.2  |
>> +---------+------+------+------+
>> | 16      | 0.63 | 1.71 | 3.68 |
>> +---------+------+------+------+
>>
>> Weighted TEO Governor - Prediction distribution %
>> +---------+------+------+------+-------+-------+-------+---------+
>> | Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 2       | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13   |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 4       | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01   |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 8       | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9    |
>> +---------+------+------+------+-------+-------+-------+---------+
>> | 16      | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92   |
>> +---------+------+------+------+-------+-------+-------+---------+
>> +---------+------+------+------+
>> | Threads | OS 1 | OS 2 | OS 3 |
>> +---------+------+------+------+
>> | 2       | 0.39 | 0.42 | 2.31 |
>> +---------+------+------+------+
>> | 4       | 0.45 | 0.51 | 2.89 |
>> +---------+------+------+------+
>> | 8       | 0.53 | 0.66 | 3.06 |
>> +---------+------+------+------+
>> | 16      | 0.97 | 1.9  | 3.05 |
>> +---------+------+------+------+
>>
>> Sleeping Ebizzy
>> ---------------
>> Program to generate workloads resembling web server workloads.
>> The benchmark is customized to allow for a sleep interval -i
> I found a Phoronix ebizzy, but without the customization,
> which I suspect is important to demonstrate your potential
> improvement.
>
> Could you send me yours to try?

Sure thing, sleeping ebizzy is hosted here:
https://github.com/pratiksampat/sleeping-ebizzy

>
> ebizzy (records per second, more is better)					
>
> teo		wtteo				menu	
> 132344	132228	99.91%	130926	98.93%
>
> Powers and other idle statistics were similar. [2]
>
>> 1. Number of records
>> 2. Power - watts
>> Machine - IBM Power 9
>>
>> Parameters:
>> 1. -m -> Always use mmap instead of malloc
>> 2. -M -> Never use mmap
>> 3. -S <seconds> -> Number of seconds to run
>> 4. -i <interval> -> Sleep interval
> What are the units of this interval?
> They must be microseconds, as that is the only thing that makes sense.

Yes, it is in microseconds

> I have tried to simulate the resulting actual workflow
> myself, but didn't get results like yours. (I may have done a poorly.)
> My test does not produce performance data, as it just has to do its work
> before the next time to do a chunk of work.
> The test is:
>
> forever
>    do 100 times
>      very short sleep
>    enddo
>    sleep for 10 milliseconds
> endforever

Yes, In logic this is very similar to what benchmark emulates.

> The overheads result in enough activity.
> Powers and other idle statistics were similar. [3]
>
>> Number of records and power normalized
>> +-------------------+---------------+------------------+-----------------+
>> | Parameters        | TEO baseline  | Wt TEO records   | Wt. TEO Power   |
>> +-------------------+---------------+------------------+-----------------+
>> | -S 60 -i 10000    | 100           | 106.56           | 93.95           |
>> +-------------------+---------------+------------------+-----------------+
>> | -m -S 60 -i 10000 | 100           | 100.62           | 82.14           |
>> +-------------------+---------------+------------------+-----------------+
>> | -M -S 60 -i 10000 | 100           | 104.97           | 95.19           |
>> +-------------------+---------------+------------------+-----------------+
>>
>> Accuracy
>>
>> Vanilla TEO Governor - Prediction distribution %
>> +-------------------+-------+------+------+-------+------+-------+
>> | Parameters        | US 1  | US 2 | US 3 | US 4  | US 5 | US 6  |
>> +-------------------+-------+------+------+-------+------+-------+
>> | -S 60 -i 10000    | 45.46 | 0.52 | 1.5  | 15.34 | 2.44 | 8.61  |
>> +-------------------+-------+------+------+-------+------+-------+
>> | -m -S 60 -i 10000 | 4.22  | 2.08 | 0.71 | 90.01 | 0    | 0.01  |
>> +-------------------+-------+------+------+-------+------+-------+
>> | -M -S 60 -i 10000 | 15.78 | 1.42 | 2.4  | 22.39 | 1.68 | 11.25 |
>> +-------------------+-------+------+------+-------+------+-------+
>> +-------------------+---------+------+------+------+------+
>> | Parameters        | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
>> +-------------------+---------+------+------+------+------+
>> | -S 60 -i 10000    | 17.03   | 1.73 | 1.1  | 6.27 | 0    |
>> +-------------------+---------+------+------+------+------+
>> | -m -S 60 -i 10000 | 2.44    | 0.18 | 0.13 | 0.22 | 0    |
>> +-------------------+---------+------+------+------+------+
>> | -M -S 60 -i 10000 | 31.65   | 3.45 | 1.8  | 8.18 | 0    |
>> +-------------------+---------+------+------+------+------+
>>
>> Weigted TEO Governor - Prediction distribution %
>> +-------------------+-------+------+------+-------+------+-------+
>> | Parameters        | US 1  | US 2 | US 3 | US 4  | US 5 | US 6  |
>> +-------------------+-------+------+------+-------+------+-------+
>> | -S 60 -i 10000    | 8.25  | 0.87 | 0.98 | 19.23 | 4.05 | 26.35 |
>> +-------------------+-------+------+------+-------+------+-------+
>> | -m -S 60 -i 10000 | 7.69  | 4.35 | 0.93 | 82.74 | 0.01 | 0.01  |
>> +-------------------+-------+------+------+-------+------+-------+
>> | -M -S 60 -i 10000 | 3.73  | 3.29 | 0.73 | 13.33 | 7.38 | 18.61 |
>> +-------------------+-------+------+------+-------+------+-------+
>> +-------------------+---------+------+------+------+------+
>> | Parameters        | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
>> +-------------------+---------+------+------+------+------+
>> | -S 60 -i 10000    | 32.86   | 3.27 | 2.05 | 2.09 | 0    |
>> +-------------------+---------+------+------+------+------+
>> | -m -S 60 -i 10000 | 3.4     | 0.29 | 0.28 | 0.3  | 0    |
>> +-------------------+---------+------+------+------+------+
>> | -M -S 60 -i 10000 | 48.19   | 1.8  | 0.93 | 1.97 | 0.04 |
>> +-------------------+---------+------+------+------+------+
> For accuracy numbers, it would help to know the sample size
> and the importance.
>
> For this 60 second test, I wonder if the number of times
> each idle state was entered and exited was large enough to
> draw any conclusion. I often find for tests that some states are
> only used a few times in 1 minute, and so don't really care about the accuracy.

The sample size does go upto early double digit thousands but I don't really
know the physical importance of such a number.
So, I get what you're saying and maybe I need to benchmark with a longer duration
as your experience shows.

> Anyway, for my attempts that this test, I had to extend to a 5 minute sample
> time to get adequate numbers in all idle states for the accuracy statistics.
> (which showed no difference, by the way (for those not looking at the graphs).)
>
> For my test all three governors, teo, wtteo, and menu, were
> using idle state 0 about 7 to 8 thousand times per 5 minutes,
> and 100% of time the assessment was the state was too shallow.
> However, I don't really care because it is only 0.003% of the time,
> and if idle state 0 is disabled (teo-0disable on [3] (it is enabled
> again at minute 35), the power doesn't change.
>
> All that being said, your power/accuracy results do seem correlated.
>
This I believe is a good affirmation to have. I would be worried if
we predicted more correctly and somehow ended up doing worse or vise-versa.

>> Pgbench
>> -------
>> pgbench is a simple program for running benchmark tests on PostgreSQL.
>> It runs the same sequence of SQL commands over and over, possibly in
>> multiple concurrent database sessions, and then calculates the average
>> transaction rate (transactions per second).
> I did not try this test or anything similar.
> ...
>
>> Hackbench
>> ---------
>> Creates a specified number of pairs of schedulable entities
>> which communicate via either sockets or pipes and time how long  it
>> takes for each pair to send data back and forth.
>>
> I found a Phoronix version, but it doesn't like
> your low loops counts, so I stayed with the default 50,000.
>
> I suspect your low loop count results in a workflow somewhat like
> your special ebizzy test. Anyway, maybe I should try your version
> and low loop counts.
>
> I did many tests, and get inconsistent results.
>
> You use these terms like "sockets" and "pipes", but
> the phoronix test uses "count" and "thread" or "process".
>
> I only used "process" for the simple reason that there was very
> very little use of idle at all with "thread", so there was no value
> in any test.
>
> hackbench test 1: all - process (seconds, less is better)					
> 						
> test	count	teo		wtteo				menu	
> 1	1	8.7		8.99		103.33%	9.071		104.26%
> 2	2	16.509	16.96		102.73%	17.159	103.94%
> 3	4	33.451	34.081	101.88%	34.101	101.94%
> 4	8	69.037	71.647	103.78%	69.914	101.27%
> 5	16	161.64	165.569	102.43%	165.015	102.09%
>
> Powers and other idle statistics were similar. [4]
>
> hackbench test 2: count 1 - process (seconds, less is better)					
> 		teo	wtteo			menu	
> average	8.906	8.703	97.72%	9.032	101.41%
> max		9.263	8.856			9.228	
> min		8.761	8.599			8.876	
> Std. Dev.	0.83%	0.46%			0.80%	
> runs		256	256			200	
>
> Powers and other idle statistics were similar. [5]
> However, idle state 3 is worthy of a look.
>
> hackbench test 3: count 2 - process (seconds, less is better)					
> 		teo		wtteo			menu	
> average	16.702	16.65	99.69%	16.796	100.56%
> max		16.853	16.966		17.058	
> min		16.542	16.487		16.659	
> Std. Dev.	0.41%		0.59%			0.56%	
> runs		100		100			100	
>
> Powers and other idle statistics were similar. [6]
> However, idle state 3 is worthy of a look.
>
>> Machine - IBM Power 9
>>
>> Scale of measurement:
>> 1. Time (s)
>> 2. Power (watts)
>> Time is normalized
>>
>> +---------+----------+----------------------+-------------------+
>> | Loops   | TEO Time | Wt. TEO Time Sockets | Wt. TEO Time Pipe |
>> +---------+----------+----------------------+-------------------+
>> | 100     | 100      | 95.23                | 87.09             |
>> +---------+----------+----------------------+-------------------+
>> | 1000    | 100      | 105.81               | 98.67             |
>> +---------+----------+----------------------+-------------------+
>> | 10000   | 100      | 99.33                | 92.73             |
>> +---------+----------+----------------------+-------------------+
>> | 100000  | 100      | 98.88                | 101.99            |
>> +---------+----------+----------------------+-------------------+
>> | 1000000 | 100      | 100.04               | 100.2             |
>> +---------+----------+----------------------+-------------------+
>>
>> Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
>>         Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO
>>
>> Pratik Rajesh Sampat (1):
>>    Weighted approach to gather and use history in TEO governor
>>
>>   drivers/cpuidle/governors/teo.c | 96 +++++++++++++++++++++++++++++++--
>>   1 file changed, 91 insertions(+), 5 deletions(-)
>>
>> --
>> 2.17.1
> I also tried Giovanni's and Mel's mmtests, (uses idle states 0 and 1 a lot)
> but couldn't extract the performance report. [7]
>
> Old sweep test, which doesn't produce performance data. [8]
> Old system idle test. [9]
>
> [1]http://www.smythies.com/~doug/linux/idle/wtteo/schbench/
> [2]http://www.smythies.com/~doug/linux/idle/wtteo/ebizzy/
> [3]http://www.smythies.com/~doug/linux/idle/wtteo/pn01/
> [4]http://www.smythies.com/~doug/linux/idle/wtteo/hackbench/
> [5]http://www.smythies.com/~doug/linux/idle/wtteo/hackbench2/
> [6]http://www.smythies.com/~doug/linux/idle/wtteo/hackbench3/
> [7]http://www.smythies.com/~doug/linux/idle/wtteo/mmtests-udp/
> [8]http://www.smythies.com/~doug/linux/idle/wtteo/sweep/
> [9]http://www.smythies.com/~doug/linux/idle/wtteo/idle/
>
> ... Doug
>
>
Thanks again for these comprehensive results.
~ Pratik


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC 0/1] Alternate history mechanism for the TEO governor
  2020-05-21 11:09   ` Pratik Sampat
@ 2020-05-25 18:32     ` Doug Smythies
  0 siblings, 0 replies; 9+ messages in thread
From: Doug Smythies @ 2020-05-25 18:32 UTC (permalink / raw)
  To: 'Pratik Sampat'
  Cc: linux-kernel, linux-pm, rafael.j.wysocki, peterz, daniel.lezcano,
	ego, svaidy, pratik.sampat, pratik.r.sampat

On 2020.05.21 04:09 Pratik Sampat wrote:
> On 17/05/20 11:41 pm, Doug Smythies wrote:
> > On 2020.05.11 Pratik Rajesh Sampat wrote:
> >> First RFC posting:https://lkml.org/lkml/2020/2/22/27
> > Summary:
> >
> > On that thread I wrote:
> >
> >    > I have done a couple of other tests with this patch set,
> >    > but nothing to report yet, as the differences have been
> >    > minor so far.
> >
> > I tried your tests, or as close as I could find, and still
> > do not notice much difference.
> 
> That is quite unfortunate. At least it doesn't seem to regress.

Yes, while I have not been able to demonstrate improvement,
I have not found any regression.

> 
> Nevertheless, as Rafael suggested aging is crucial, this patch doesn't age
> weights. I do have a version with aging but I had a lot of run to run variance
> so I had refrained from posting that.
> I'm tweaking around the logic for aging as well as distribution of weights,
> hopefully that may help.

O.K. I am putting this testing aside for now.
I like the set of tests, as they really show the differences between menu
and teo governors well.

> >>
> >> Sleeping Ebizzy
> >> ---------------
> >> Program to generate workloads resembling web server workloads.
> >> The benchmark is customized to allow for a sleep interval -i
> > I found a Phoronix ebizzy, but without the customization,
> > which I suspect is important to demonstrate your potential
> > improvement.
> >
> > Could you send me yours to try?
> 
> Sure thing, sleeping ebizzy is hosted here:
> https://github.com/pratiksampat/sleeping-ebizzy
> 
> >
> > ebizzy (records per second, more is better)
> >
> > teo		wtteo				menu
> > 132344	132228	99.91%	130926	98.93%

O.K. yours is way different than what I was using.
Anyway, results still are not very different
between teo and wtteo. Some tests are showing a little difference
between above/below statistics [1]

[1] http://www.smythies.com/~doug/linux/idle/wtteo/ebizzy-interval/2_below.png

By the way, and likely not relevant, your sleeping-ebizzy test
seems extremely sensitive to the interval and number of threads.
It is not clear to me what settings I should use to try to re-create
your results. [2] is an interesting graph of records per second verses
intervals verses threads.
 
[2] http://www.smythies.com/~doug/linux/idle/wtteo/doug08/sleeping-ebizzy-records-intervals-threads.png




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-05-25 18:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 14:10 [RFC 0/1] Alternate history mechanism for the TEO governor Pratik Rajesh Sampat
2020-05-11 14:10 ` [RFC 1/1] Weighted approach to gather and use history in " Pratik Rajesh Sampat
2020-05-12 17:37   ` Peter Zijlstra
2020-05-13  5:31     ` Pratik Sampat
2020-05-13 14:49       ` Rafael J. Wysocki
2020-05-14 15:35         ` Pratik Sampat
2020-05-17 18:11 ` [RFC 0/1] Alternate history mechanism for the " Doug Smythies
2020-05-21 11:09   ` Pratik Sampat
2020-05-25 18:32     ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).