linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] cpufreq: schedutil: Fix and optimization
@ 2017-03-19 13:21 Rafael J. Wysocki
  2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki
  2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki
  0 siblings, 2 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-19 13:21 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

Hi,

Patch [1/2] in this series fixes an initialization issue in schedutil.

Patch [2/2] is an optimization, but it also might be regarded as a fix or at
least a problem workaround.

Namely, while playing with the passive mode in intel_pstate lately I noticed
that schedutil evidently underestimated CPU utilization at high loads.  I guess
it has been known for a while, but this is the first time I've seen that so
clearly (actually by looking at the values written to the P-state request
register and not via frequencies).

After some investigation I can quite confidently attribute that to a PELT
metric issue, as everything else has been ruled out with high confidence,
and hence the patch (which also works around the problem adding even more
condifence to the observation that PELT underestimates CPU utilization at
least sometimes).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
  2017-03-19 13:21 [PATCH 0/2] cpufreq: schedutil: Fix and optimization Rafael J. Wysocki
@ 2017-03-19 13:30 ` Rafael J. Wysocki
  2017-03-20  3:28   ` Viresh Kumar
  2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki
  1 sibling, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-19 13:30 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.

That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.

Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/cpufreq_schedutil.c |   20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -577,20 +577,14 @@ static int sugov_start(struct cpufreq_po
 	for_each_cpu(cpu, policy->cpus) {
 		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
 
+		memset(sg_cpu, 0, sizeof(*sg_cpu));
 		sg_cpu->sg_policy = sg_policy;
-		if (policy_is_shared(policy)) {
-			sg_cpu->util = 0;
-			sg_cpu->max = 0;
-			sg_cpu->flags = SCHED_CPUFREQ_RT;
-			sg_cpu->last_update = 0;
-			sg_cpu->iowait_boost = 0;
-			sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq;
-			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
-						     sugov_update_shared);
-		} else {
-			cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
-						     sugov_update_single);
-		}
+		sg_cpu->flags = SCHED_CPUFREQ_RT;
+		sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq;
+		cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+					     policy_is_shared(policy) ?
+							sugov_update_shared :
+							sugov_update_single);
 	}
 	return 0;
 }

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-19 13:21 [PATCH 0/2] cpufreq: schedutil: Fix and optimization Rafael J. Wysocki
  2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki
@ 2017-03-19 13:34 ` Rafael J. Wysocki
  2017-03-19 21:24   ` Rafael J. Wysocki
                     ` (3 more replies)
  1 sibling, 4 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-19 13:34 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The PELT metric used by the schedutil governor underestimates the
CPU utilization in some cases.  The reason for that may be time spent
in interrupt handlers and similar which is not accounted for by PELT.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

To work around this issue use the observation that, from the
schedutil governor's perspective, CPUs that are never idle should
always run at the maximum frequency and make that happen.

To that end, add a counter of idle calls to struct sugov_cpu and
modify cpuidle_idle_call() to increment that counter every time it
is about to put the given CPU into an idle state.  Next, make the
schedutil governor look at that counter for the current CPU every
time before it is about to start heavy computations.  If the counter
has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
the CPU has not been idle for at least that long and the governor
will choose the maximum frequency for it without looking at the PELT
metric at all.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/sched/cpufreq.h    |    6 ++++++
 kernel/sched/cpufreq_schedutil.c |   38 ++++++++++++++++++++++++++++++++++++--
 kernel/sched/idle.c              |    3 +++
 3 files changed, 45 insertions(+), 2 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -20,6 +20,7 @@
 #include "sched.h"
 
 #define SUGOV_KTHREAD_PRIORITY	50
+#define SUGOV_BUSY_THRESHOLD	(50 * NSEC_PER_MSEC)
 
 struct sugov_tunables {
 	struct gov_attr_set attr_set;
@@ -55,6 +56,9 @@ struct sugov_cpu {
 
 	unsigned long iowait_boost;
 	unsigned long iowait_boost_max;
+	unsigned long idle_calls;
+	unsigned long saved_idle_calls;
+	u64 busy_start;
 	u64 last_update;
 
 	/* The fields below are only needed when sharing a policy. */
@@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
 	sg_cpu->iowait_boost >>= 1;
 }
 
+void cpufreq_schedutil_idle_call(void)
+{
+	struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
+
+	sg_cpu->idle_calls++;
+}
+
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+	if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
+		sg_cpu->busy_start = 0;
+		return false;
+	}
+
+	if (!sg_cpu->busy_start) {
+		sg_cpu->busy_start = sg_cpu->last_update;
+		return false;
+	}
+
+	return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
+}
+
+static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
+{
+	if (!sg_cpu->busy_start)
+		sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
+}
+
 static void sugov_update_single(struct update_util_data *hook, u64 time,
 				unsigned int flags)
 {
@@ -207,7 +239,7 @@ static void sugov_update_single(struct u
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
-	if (flags & SCHED_CPUFREQ_RT_DL) {
+	if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
 		next_f = policy->cpuinfo.max_freq;
 	} else {
 		sugov_get_util(&util, &max);
@@ -215,6 +247,7 @@ static void sugov_update_single(struct u
 		next_f = get_next_freq(sg_policy, util, max);
 	}
 	sugov_update_commit(sg_policy, time, next_f);
+	sugov_save_idle_calls(sg_cpu);
 }
 
 static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
@@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
 	sg_cpu->last_update = time;
 
 	if (sugov_should_update_freq(sg_policy, time)) {
-		if (flags & SCHED_CPUFREQ_RT_DL)
+		if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
 			next_f = sg_policy->policy->cpuinfo.max_freq;
 		else
 			next_f = sugov_next_freq_shared(sg_cpu);
 
 		sugov_update_commit(sg_policy, time, next_f);
+		sugov_save_idle_calls(sg_cpu);
 	}
 
 	raw_spin_unlock(&sg_policy->update_lock);
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -151,6 +151,9 @@ static void cpuidle_idle_call(void)
 	 */
 	rcu_idle_enter();
 
+	/* Notify the schedutil cpufreq governor that we are entering idle. */
+	cpufreq_schedutil_idle_call();
+
 	if (cpuidle_not_available(drv, dev)) {
 		default_idle_call();
 		goto exit_idle;
Index: linux-pm/include/linux/sched/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/sched/cpufreq.h
+++ linux-pm/include/linux/sched/cpufreq.h
@@ -24,4 +24,10 @@ void cpufreq_add_update_util_hook(int cp
 void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+extern void cpufreq_schedutil_idle_call(void);
+#else
+static inline void cpufreq_schedutil_idle_call(void) {}
+#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
 #endif /* _LINUX_SCHED_CPUFREQ_H */

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki
@ 2017-03-19 21:24   ` Rafael J. Wysocki
  2017-03-19 21:42     ` Rafael J. Wysocki
  2017-03-20 10:38     ` Peter Zijlstra
  2017-03-20  3:57   ` Viresh Kumar
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-19 21:24 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 3957 bytes --]

On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.

In case you are wondering about the actual numbers, attached are two turbostat
log files from two runs of the same workload, without (before.txt) and with (after.txt)
the patch applied.

The workload is essentially "make -j 5" in the kernel source tree and the
machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
The P-states available for each core are between 8 and 31 (0x1f) corresponding
to 800 MHz and 3.1 GHz, respectively.  All cores can run sustainably at 2.9 GHz
at the same time, although that is not a guaranteed sustainable frequency
(it may be dropped occasionally for thermal reasons, for example).

The interesting columns are Bzy_MHz (and specifically the rows with "-" under
CPU that correspond to the entire processor), which is the avreage frequency
between iterations based on the numbers read from feedback registers, and
the rightmost one, which is the values written to the P-state request register
(the 3rd and 4th hex digits from the right represent the requested P-state).

The turbostat data collection ran every 2 seconds and I looked at the last 30
iterations in each case corresponding to about 1 minute of the workload run
during which all of the cores were around 100% busy.

Now, if you look at after.txt (the run with the patch applied), you'll notice that
during those last 30 iterations P-state 31 (0x1f) had been requested on all
cores pretty much 100% of the time (meaning: as expected in that case) and
the average processor frequency (computed by taking the average from
all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
drop it from 2.9 GHz occasionally).

In the before.txt case (without the patch) the average frequency over the last
30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
applied (on the average).  That already is quite a measurable difference, but it
would have been much worse if the processor had not coordinated P-states in
hardware (such that if any core requested 31, the processor would pick that
one or close to it for the entire package regardless of the requests from the
other cores).

Namely, if you look at the P-states requested for different cores (during the last
30 iterations of the before.txt run), which essentially is what should be used
according to the governor, the average of them is 27.25 (almost 4 bins lower
than the maximum) and the standard deviation is 6, so it is not like they are a
little off occasionally.  At least some of them are way off most of the time.

Honestly, if the processor had been capable of doing per-core P-states, that
would have been a disaster and there are customers who wouldn't look at
schedutil again after being confronted with these numbers.

So this is rather serious.

BTW, both intel_pstate in the active mode and ondemand request 0x1f on
all cores for that workload, like in the after.txt case.

Thanks,
Rafael

[-- Attachment #2: before.txt --]
[-- Type: text/plain, Size: 15917 bytes --]

CPUID(7): No-SGX
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -      93   11.26     826    2494  0x00000000
       0      61    7.56     808    2494  0x00000800
       2      90   10.87     832    2494  0x00000800
       1     130   15.79     822    2494  0x00000800
       3      90   10.83     837    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -      54    6.63     811    2494  0x00000000
       0      46    5.68     816    2494  0x00000800
       2      55    6.87     804    2494  0x00000800
       1      79    9.77     806    2494  0x00000800
       3      35    4.20     828    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     305   36.58     835    2494  0x00000000
       0     332   39.69     838    2494  0x00000800
       2     307   36.95     833    2494  0x00000800
       1     248   29.82     832    2494  0x00000800
       3     333   39.87     836    2494  0x00000c00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     357   38.08     940    2495  0x00000000
       0     424   36.72    1158    2495  0x00000800
       2     312   36.27     862    2495  0x00000800
       1     407   46.72     874    2494  0x00000800
       3     284   32.62     874    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2017   92.64    2183    2494  0x00000000
       0    2041   94.65    2162    2494  0x00000800
       2    2032   93.65    2175    2494  0x00001200
       1    2024   92.33    2197    2494  0x00001f00
       3    1972   89.92    2198    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2455   99.21    2480    2494  0x00000000
       0    2450   98.95    2482    2494  0x00000900
       2    2457   99.29    2480    2494  0x00000900
       1    2453   99.11    2481    2494  0x00001000
       3    2461   99.49    2479    2494  0x00000d00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2308   99.14    2333    2494  0x00000000
       0    2309   99.04    2337    2494  0x00001f00
       2    2307   99.12    2333    2494  0x00001200
       1    2305   99.03    2333    2494  0x00001600
       3    2310   99.39    2329    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2613   99.22    2639    2494  0x00000000
       0    2614   99.32    2638    2494  0x00000800
       2    2589   98.19    2643    2494  0x00001200
       1    2623   99.60    2640    2494  0x00001800
       3    2624   99.76    2637    2494  0x00000a00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2687   98.70    2729    2494  0x00000000
       0    2700   99.24    2727    2494  0x00000800
       2    2707   99.48    2728    2494  0x00001f00
       1    2643   97.09    2728    2494  0x00001c00
       3    2699   99.01    2733    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.98    2900    2494  0x00000000
       0    2892   99.96    2900    2494  0x00001f00
       2    2892   99.96    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001b00
       2    2893  100.00    2900    2494  0x00001800
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001a00
       2    2893  100.00    2900    2494  0x00000d00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001900
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2886  100.00    2893    2494  0x00000000
       0    2886  100.00    2893    2494  0x00001f00
       2    2886  100.00    2893    2494  0x00001f00
       1    2886  100.00    2893    2494  0x00001900
       3    2886  100.00    2893    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001b00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001c00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001900
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2890   99.99    2897    2494  0x00000000
       0    2890  100.00    2897    2494  0x00001f00
       2    2890  100.00    2897    2494  0x00001800
       1    2890  100.00    2897    2494  0x00000e00
       3    2889   99.97    2897    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001600
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00000a00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2890  100.00    2897    2494  0x00000000
       0    2890  100.00    2897    2494  0x00001f00
       2    2890  100.00    2897    2494  0x00001f00
       1    2890  100.00    2897    2494  0x00001b00
       3    2890  100.00    2897    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2870   99.94    2878    2494  0x00000000
       0    2872  100.00    2878    2494  0x00001f00
       2    2867   99.84    2878    2494  0x00001f00
       1    2872  100.00    2878    2494  0x00001f00
       3    2870   99.92    2879    2494  0x00001200
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00000f00
       2    2892  100.00    2898    2494  0x00001b00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2891   99.94    2899    2494  0x00000000
       0    2893  100.00    2899    2494  0x00001f00
       2    2887   99.78    2899    2494  0x00001f00
       1    2893  100.00    2899    2494  0x00001f00
       3    2893  100.00    2899    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001d00
       2    2893  100.00    2900    2494  0x00001c00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2889  100.00    2895    2494  0x00000000
       0    2889  100.00    2895    2494  0x00001f00
       2    2889  100.00    2895    2494  0x00001600
       1    2889  100.00    2895    2494  0x00001c00
       3    2889  100.00    2895    2494  0x00000b00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001800
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001600
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001200
       3    2892   99.96    2900    2494  0x00001300
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2891   99.99    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2891   99.97    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892   99.99    2899    2494  0x00000000
       0    2892  100.00    2899    2494  0x00001f00
       2    2892  100.00    2899    2494  0x00001100
       1    2891   99.96    2899    2494  0x00001f00
       3    2892  100.00    2899    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2886  100.00    2893    2494  0x00000000
       0    2886  100.00    2893    2494  0x00000f00
       2    2886  100.00    2893    2494  0x00000900
       1    2886  100.00    2893    2494  0x00001300
       3    2886  100.00    2893    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2495  0x00000000
       0    2892  100.00    2898    2495  0x00001f00
       2    2892  100.00    2898    2495  0x00001700
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001700
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00000d00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2899    2494  0x00000000
       0    2892   99.98    2899    2494  0x00001f00
       2    2892  100.00    2899    2494  0x00001f00
       1    2892  100.00    2899    2494  0x00001f00
       3    2892  100.00    2899    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2879  100.00    2885    2494  0x00000000
       0    2879  100.00    2885    2494  0x00001500
       2    2879  100.00    2885    2494  0x00000900
       1    2878   99.99    2885    2494  0x00001f00
       3    2879  100.00    2885    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2883  100.00    2889    2494  0x00000000
       0    2883  100.00    2889    2494  0x00001f00
       2    2883  100.00    2889    2494  0x00001f00
       1    2883  100.00    2889    2494  0x00001f00
       3    2883  100.00    2889    2494  0x00001600
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001400
       1    2893  100.00    2900    2494  0x00001b00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001100
       2    2893  100.00    2900    2494  0x00001000
       1    2893  100.00    2900    2494  0x00001f00
       3    2893   99.98    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001300
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001500
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2899    2494  0x00000000
       0    2893  100.00    2899    2494  0x00001f00
       2    2893  100.00    2899    2494  0x00001c00
       1    2893  100.00    2899    2494  0x00001f00
       3    2893  100.00    2899    2494  0x00001c00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001b00
       3    2893  100.00    2900    2494  0x00001d00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2879  100.00    2886    2494  0x00000000
       0    2879  100.00    2886    2494  0x00001200
       2    2879  100.00    2886    2494  0x00001f00
       1    2879  100.00    2886    2494  0x00001f00
       3    2879  100.00    2886    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2892   99.98    2900    2494  0x00001300
       3    2893  100.00    2900    2494  0x00001000
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2495  0x00000000
       0    2892  100.00    2898    2495  0x00001f00
       2    2892  100.00    2898    2495  0x00000800
       1    2892  100.00    2898    2495  0x00001b00
       3    2892  100.00    2898    2494  0x00001600
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001b00
       1    2892   99.96    2900    2494  0x00001f00
       3    2893   99.99    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2891   99.99    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001a00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2890   99.95    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001600
       2    2893  100.00    2900    2494  0x00001100
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001c00

[-- Attachment #3: after.txt --]
[-- Type: text/plain, Size: 18461 bytes --]

CPUID(7): No-SGX
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     171   20.00     854    2496  0x00000000
       0     182   21.23     858    2496  0x00000800
       2     170   19.93     855    2496  0x00000800
       1     173   20.54     845    2496  0x00000800
       3     157   18.32     858    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     222   25.57     872    2494  0x00000000
       0     210   23.66     889    2494  0x00000800
       2     200   23.20     865    2494  0x00000800
       1     189   21.33     886    2494  0x00000800
       3     291   34.09     856    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     132   16.37     808    2494  0x00000000
       0     131   16.19     811    2494  0x00000800
       2     215   26.87     801    2494  0x00000800
       1     107   13.33     803    2494  0x00000800
       3      76    9.08     834    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     503   47.44    1063    2494  0x00000000
       0     510   48.60    1051    2494  0x00000a00
       2     482   45.17    1070    2494  0x00000800
       1     512   48.94    1048    2494  0x00000a00
       3     508   47.06    1082    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     447   40.48    1108    2494  0x00000000
       0     444   39.85    1117    2494  0x00000800
       2     422   38.66    1093    2494  0x00000a00
       1     440   38.68    1140    2494  0x00000800
       3     484   44.75    1084    2494  0x00000a00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     405   34.24    1185    2494  0x00000000
       0     422   33.63    1258    2494  0x00000800
       2     435   35.97    1211    2494  0x00000800
       1     296   26.52    1118    2494  0x00000800
       3     468   40.85    1147    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     155   19.07     813    2494  0x00000000
       0     161   19.92     809    2494  0x00000800
       2     156   19.31     810    2494  0x00000800
       1     119   14.79     809    2494  0x00000800
       3     183   22.26     823    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     632   36.32    1744    2494  0x00000000
       0    1237   58.35    2125    2494  0x00001f00
       2     503   31.16    1617    2494  0x00000800
       1     396   28.29    1403    2494  0x00000800
       3     392   27.50    1430    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -     799   36.92    2168    2494  0x00000000
       0    1935   77.42    2505    2494  0x00000800
       2     386   22.32    1733    2494  0x00000800
       1     474   25.39    1870    2494  0x00000800
       3     400   22.53    1778    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    1856   82.07    2267    2494  0x00000000
       0    1850   82.38    2251    2494  0x00000900
       2    1836   81.44    2260    2494  0x00001500
       1    1850   80.66    2299    2494  0x00001700
       3    1889   83.80    2259    2494  0x00000a00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2601   95.10    2741    2494  0x00000000
       0    2625   96.00    2740    2494  0x00001200
       2    2573   93.94    2746    2494  0x00001000
       1    2635   96.58    2734    2494  0x00000c00
       3    2570   93.87    2744    2494  0x00000b00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2488   92.91    2684    2494  0x00000000
       0    2475   92.17    2692    2494  0x00000b00
       2    2525   94.49    2678    2494  0x00001c00
       1    2506   93.54    2685    2494  0x00001f00
       3    2446   91.43    2682    2494  0x00000800
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2508   94.52    2659    2494  0x00000000
       0    2551   96.11    2661    2494  0x00001200
       2    2525   95.10    2661    2494  0x00001f00
       1    2531   95.48    2657    2494  0x00001f00
       3    2423   91.40    2657    2494  0x00001400
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2616   94.99    2760    2494  0x00000000
       0    2577   93.80    2754    2494  0x00001f00
       2    2653   96.61    2753    2494  0x00001f00
       1    2606   94.38    2767    2494  0x00001f00
       3    2626   95.17    2765    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2851   98.63    2897    2494  0x00000000
       0    2823   97.66    2897    2494  0x00000f00
       2    2854   98.71    2898    2494  0x00001600
       1    2881   99.69    2897    2494  0x00001f00
       3    2845   98.44    2897    2494  0x00001900
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2871   99.23    2900    2494  0x00000000
       0    2884   99.68    2900    2494  0x00001900
       2    2868   99.11    2900    2494  0x00001f00
       1    2875   99.35    2900    2494  0x00001f00
       3    2858   98.78    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892   99.96    2900    2494  0x00000000
       0    2893   99.97    2900    2494  0x00001f00
       2    2892   99.93    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2892   99.94    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893   99.97    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893   99.97    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893   99.98    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2495  0x00000000
       0    2892  100.00    2898    2495  0x00001f00
       2    2892  100.00    2898    2495  0x00001f00
       1    2892  100.00    2898    2495  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2892   99.94    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2891   99.99    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2890   99.95    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2892   99.96    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.98    2900    2494  0x00000000
       0    2892   99.97    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2892   99.95    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893   99.98    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2894  100.00    2900    2494  0x00000000
       0    2894  100.00    2900    2495  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2890  100.00    2897    2494  0x00000000
       0    2890  100.00    2897    2494  0x00001f00
       2    2890  100.00    2897    2494  0x00001f00
       1    2890  100.00    2897    2494  0x00001f00
       3    2890  100.00    2897    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893   99.99    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2892   99.96    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2888   99.99    2895    2494  0x00000000
       0    2888  100.00    2895    2494  0x00001f00
       2    2888  100.00    2895    2494  0x00001f00
       1    2888  100.00    2895    2494  0x00001f00
       3    2888   99.98    2895    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892  100.00    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2892  100.00    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893   99.99    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2892   99.96    2900    2494  0x00000000
       0    2889   99.84    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2893  100.00    2900    2494  0x00000000
       0    2893  100.00    2900    2494  0x00001f00
       2    2893  100.00    2900    2494  0x00001f00
       1    2893  100.00    2900    2494  0x00001f00
       3    2893  100.00    2900    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2886   99.81    2898    2494  0x00000000
       0    2892  100.00    2898    2494  0x00001f00
       2    2890   99.95    2898    2494  0x00001f00
       1    2872   99.30    2898    2494  0x00001f00
       3    2891   99.98    2898    2494  0x00001f00
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
       -    2889   99.90    2898    2494  0x00000000
       0    2891   99.98    2898    2494  0x00001f00
       2    2892  100.00    2898    2494  0x00001f00
       1    2880   99.61    2898    2494  0x00001f00
       3    2892  100.00    2898    2494  0x00001f00

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-19 21:24   ` Rafael J. Wysocki
@ 2017-03-19 21:42     ` Rafael J. Wysocki
  2017-03-20 10:38     ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-19 21:42 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Sunday, March 19, 2017 10:24:24 PM Rafael J. Wysocki wrote:
> On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> 
> In case you are wondering about the actual numbers, attached are two turbostat
> log files from two runs of the same workload, without (before.txt) and with (after.txt)
> the patch applied.
> 
> The workload is essentially "make -j 5" in the kernel source tree and the
> machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
> The P-states available for each core are between 8 and 31 (0x1f) corresponding
> to 800 MHz and 3.1 GHz, respectively.  All cores can run sustainably at 2.9 GHz
> at the same time, although that is not a guaranteed sustainable frequency
> (it may be dropped occasionally for thermal reasons, for example).
> 
> The interesting columns are Bzy_MHz (and specifically the rows with "-" under
> CPU that correspond to the entire processor), which is the avreage frequency
> between iterations based on the numbers read from feedback registers, and
> the rightmost one, which is the values written to the P-state request register
> (the 3rd and 4th hex digits from the right represent the requested P-state).
> 
> The turbostat data collection ran every 2 seconds and I looked at the last 30
> iterations in each case corresponding to about 1 minute of the workload run
> during which all of the cores were around 100% busy.
> 
> Now, if you look at after.txt (the run with the patch applied), you'll notice that
> during those last 30 iterations P-state 31 (0x1f) had been requested on all
> cores pretty much 100% of the time (meaning: as expected in that case) and
> the average processor frequency (computed by taking the average from
> all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
> drop it from 2.9 GHz occasionally).
> 
> In the before.txt case (without the patch) the average frequency over the last
> 30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
> applied (on the average).

0.08% of course, sorry.  Still visible, though. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
  2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki
@ 2017-03-20  3:28   ` Viresh Kumar
  2017-03-20 12:36     ` Rafael J. Wysocki
  0 siblings, 1 reply; 71+ messages in thread
From: Viresh Kumar @ 2017-03-20  3:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 19-03-17, 14:30, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> sugov_start() only initializes struct sugov_cpu per-CPU structures
> for shared policies, but it should do that for single-CPU policies too.
> 
> That in particular makes the IO-wait boost mechanism work in the
> cases when cpufreq policies correspond to individual CPUs.
> 
> Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Cc: 4.9+ <stable@vger.kernel.org> # 4.9+

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki
  2017-03-19 21:24   ` Rafael J. Wysocki
@ 2017-03-20  3:57   ` Viresh Kumar
  2017-03-20  8:26     ` Vincent Guittot
  2017-03-20 12:48     ` Rafael J. Wysocki
  2017-03-20 10:36   ` Peter Zijlstra
  2017-03-20 21:46   ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki
  3 siblings, 2 replies; 71+ messages in thread
From: Viresh Kumar @ 2017-03-20  3:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
> 
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state.  Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations.  If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Looks like we are fixing a PELT problem with a schedutil Hack :)

> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/sched/cpufreq.h    |    6 ++++++
>  kernel/sched/cpufreq_schedutil.c |   38 ++++++++++++++++++++++++++++++++++++--
>  kernel/sched/idle.c              |    3 +++
>  3 files changed, 45 insertions(+), 2 deletions(-)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -20,6 +20,7 @@
>  #include "sched.h"
>  
>  #define SUGOV_KTHREAD_PRIORITY	50
> +#define SUGOV_BUSY_THRESHOLD	(50 * NSEC_PER_MSEC)
>  
>  struct sugov_tunables {
>  	struct gov_attr_set attr_set;
> @@ -55,6 +56,9 @@ struct sugov_cpu {
>  
>  	unsigned long iowait_boost;
>  	unsigned long iowait_boost_max;
> +	unsigned long idle_calls;
> +	unsigned long saved_idle_calls;
> +	u64 busy_start;
>  	u64 last_update;
>  
>  	/* The fields below are only needed when sharing a policy. */
> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
>  	sg_cpu->iowait_boost >>= 1;
>  }
>  
> +void cpufreq_schedutil_idle_call(void)
> +{
> +	struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
> +
> +	sg_cpu->idle_calls++;
> +}
> +
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +	if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> +		sg_cpu->busy_start = 0;
> +		return false;
> +	}
> +
> +	if (!sg_cpu->busy_start) {
> +		sg_cpu->busy_start = sg_cpu->last_update;
> +		return false;
> +	}
> +
> +	return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> +}
> +
> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> +{
> +	if (!sg_cpu->busy_start)
> +		sg_cpu->saved_idle_calls = sg_cpu->idle_calls;

Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
for idle_calls to get incremented by this time?

> +}
> +
>  static void sugov_update_single(struct update_util_data *hook, u64 time,
>  				unsigned int flags)
>  {
> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
>  	if (!sugov_should_update_freq(sg_policy, time))
>  		return;
>  
> -	if (flags & SCHED_CPUFREQ_RT_DL) {
> +	if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
>  		next_f = policy->cpuinfo.max_freq;
>  	} else {
>  		sugov_get_util(&util, &max);
> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
>  		next_f = get_next_freq(sg_policy, util, max);
>  	}
>  	sugov_update_commit(sg_policy, time, next_f);
> +	sugov_save_idle_calls(sg_cpu);
>  }
>  
>  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
>  	sg_cpu->last_update = time;
>  
>  	if (sugov_should_update_freq(sg_policy, time)) {
> -		if (flags & SCHED_CPUFREQ_RT_DL)
> +		if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))

What about others CPUs in this policy?

>  			next_f = sg_policy->policy->cpuinfo.max_freq;
>  		else
>  			next_f = sugov_next_freq_shared(sg_cpu);
>  
>  		sugov_update_commit(sg_policy, time, next_f);
> +		sugov_save_idle_calls(sg_cpu);
>  	}
>  
>  	raw_spin_unlock(&sg_policy->update_lock);

-- 
viresh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20  3:57   ` Viresh Kumar
@ 2017-03-20  8:26     ` Vincent Guittot
  2017-03-20 12:34       ` Patrick Bellasi
  2017-03-20 12:59       ` Rafael J. Wysocki
  2017-03-20 12:48     ` Rafael J. Wysocki
  1 sibling, 2 replies; 71+ messages in thread
From: Vincent Guittot @ 2017-03-20  8:26 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> The PELT metric used by the schedutil governor underestimates the
>> CPU utilization in some cases.  The reason for that may be time spent
>> in interrupt handlers and similar which is not accounted for by PELT.

Are you sure of the root cause  described above (time stolen by irq
handler) or is it just a hypotheses ? That would be good to be sure of
the root cause
Furthermore, IIRC the time spent in irq context is also accounted as
run time for the running cfs task but not RT and deadline task running
time
So I'm not really aligned with the description of your problem: PELT
metric underestimates the load of the CPU.  The PELT is just about
tracking CFS task utilization but not whole CPU utilization and
according to your description of the problem (time stolen by irq),
your problem doesn't come from an underestimation of CFS task but from
time spent in something else but not accounted in the value used by
schedutil

That would be good to be sure of what is running during this not
accounted time and find a way to account them


>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register.  Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again.  That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> To work around this issue use the observation that, from the
>> schedutil governor's perspective, CPUs that are never idle should
>> always run at the maximum frequency and make that happen.
>>
>> To that end, add a counter of idle calls to struct sugov_cpu and
>> modify cpuidle_idle_call() to increment that counter every time it
>> is about to put the given CPU into an idle state.  Next, make the
>> schedutil governor look at that counter for the current CPU every
>> time before it is about to start heavy computations.  If the counter
>> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> the CPU has not been idle for at least that long and the governor
>> will choose the maximum frequency for it without looking at the PELT
>> metric at all.
>
> Looks like we are fixing a PELT problem with a schedutil Hack :)

I would not say that it's a PELT problem (at least based on current
description) but more that we don't track all
activities of CPU

>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>  include/linux/sched/cpufreq.h    |    6 ++++++
>>  kernel/sched/cpufreq_schedutil.c |   38 ++++++++++++++++++++++++++++++++++++--
>>  kernel/sched/idle.c              |    3 +++
>>  3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
>> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
>> @@ -20,6 +20,7 @@
>>  #include "sched.h"
>>
>>  #define SUGOV_KTHREAD_PRIORITY       50
>> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
>>
>>  struct sugov_tunables {
>>       struct gov_attr_set attr_set;
>> @@ -55,6 +56,9 @@ struct sugov_cpu {
>>
>>       unsigned long iowait_boost;
>>       unsigned long iowait_boost_max;
>> +     unsigned long idle_calls;
>> +     unsigned long saved_idle_calls;
>> +     u64 busy_start;
>>       u64 last_update;
>>
>>       /* The fields below are only needed when sharing a policy. */
>> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
>>       sg_cpu->iowait_boost >>= 1;
>>  }
>>
>> +void cpufreq_schedutil_idle_call(void)
>> +{
>> +     struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
>> +
>> +     sg_cpu->idle_calls++;
>> +}
>> +
>> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
>> +{
>> +     if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
>> +             sg_cpu->busy_start = 0;
>> +             return false;
>> +     }
>> +
>> +     if (!sg_cpu->busy_start) {
>> +             sg_cpu->busy_start = sg_cpu->last_update;
>> +             return false;
>> +     }
>> +
>> +     return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
>> +}
>> +
>> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
>> +{
>> +     if (!sg_cpu->busy_start)
>> +             sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
>
> Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
> for idle_calls to get incremented by this time?
>
>> +}
>> +
>>  static void sugov_update_single(struct update_util_data *hook, u64 time,
>>                               unsigned int flags)
>>  {
>> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
>>       if (!sugov_should_update_freq(sg_policy, time))
>>               return;
>>
>> -     if (flags & SCHED_CPUFREQ_RT_DL) {
>> +     if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
>>               next_f = policy->cpuinfo.max_freq;
>>       } else {
>>               sugov_get_util(&util, &max);
>> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
>>               next_f = get_next_freq(sg_policy, util, max);
>>       }
>>       sugov_update_commit(sg_policy, time, next_f);
>> +     sugov_save_idle_calls(sg_cpu);
>>  }
>>
>>  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
>> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
>>       sg_cpu->last_update = time;
>>
>>       if (sugov_should_update_freq(sg_policy, time)) {
>> -             if (flags & SCHED_CPUFREQ_RT_DL)
>> +             if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
>
> What about others CPUs in this policy?
>
>>                       next_f = sg_policy->policy->cpuinfo.max_freq;
>>               else
>>                       next_f = sugov_next_freq_shared(sg_cpu);
>>
>>               sugov_update_commit(sg_policy, time, next_f);
>> +             sugov_save_idle_calls(sg_cpu);
>>       }
>>
>>       raw_spin_unlock(&sg_policy->update_lock);
>
> --
> viresh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki
  2017-03-19 21:24   ` Rafael J. Wysocki
  2017-03-20  3:57   ` Viresh Kumar
@ 2017-03-20 10:36   ` Peter Zijlstra
  2017-03-20 12:35     ` Rafael J. Wysocki
  2017-03-20 21:46   ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki
  3 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-20 10:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
> 
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state.  Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations.  If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Why the time limit?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-19 21:24   ` Rafael J. Wysocki
  2017-03-19 21:42     ` Rafael J. Wysocki
@ 2017-03-20 10:38     ` Peter Zijlstra
  2017-03-20 12:31       ` Rafael J. Wysocki
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-20 10:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:

> Honestly, if the processor had been capable of doing per-core P-states, that
> would have been a disaster and there are customers who wouldn't look at
> schedutil again after being confronted with these numbers.

This, I feel, is a bit overstated. We have bug, we fix them.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 10:38     ` Peter Zijlstra
@ 2017-03-20 12:31       ` Rafael J. Wysocki
  0 siblings, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Monday, March 20, 2017 11:38:15 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:
> 
> > Honestly, if the processor had been capable of doing per-core P-states, that
> > would have been a disaster and there are customers who wouldn't look at
> > schedutil again after being confronted with these numbers.
> 
> This, I feel, is a bit overstated.

Maybe a bit, but I'm not really sure ...

> We have bug, we fix them.

Of course. :-)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20  8:26     ` Vincent Guittot
@ 2017-03-20 12:34       ` Patrick Bellasi
  2017-03-22 23:56         ` Joel Fernandes
  2017-03-20 12:59       ` Rafael J. Wysocki
  1 sibling, 1 reply; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-20 12:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Vincent Guittot, Linux PM, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20-Mar 09:26, Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases.  The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
> 
> Are you sure of the root cause  described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause
> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

As long as the IRQ processing does not generate a context switch,
which is happening (eventually) if the top half schedule some deferred
work to be executed by a bottom half.

Thus, me too I would say that all the top half time is accounted in
PELT, since the current task is still RUNNABLE/RUNNING.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

Quite likely. Indeed, it can really be that the CFS task is preempted
because of some RT activity generated by the IRQ handler.

More in general, I've also noticed many suboptimal freq switches when
RT tasks interleave with CFS ones, because of:
- relatively long down _and up_ throttling times
- the way schedutil's flags are tracked and updated
- the callsites from where we call schedutil updates

For example it can really happen that we are running at the highest
OPP because of some RT activity. Then we switch back to a relatively
low utilization CFS workload and then:
1. a tick happens which produces a frequency drop
2. right after a new IRQ starts a RT task

Well, because of the schedutil's throttling mechanism we can end up
running at the reduce OPP for quite long before (eventually) going
back to the higher OPP... if and only we are lucky enough to get a new
RT task starting outside of a throttling window.

I guess that if we have quite a lot of IRQ->RT activations in a CPU
with a relatively low CFS utilization, this unwanted frequency hopping
is quite likely.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Do we have any number to characterize the frequency of IRQ/RT
activations?

We should notice also that, while CFS tasks are preempted by RT ones,
the PELT signal decays. This can contribute on lowering even further
the utilization demand from the CFS class.

> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register.  Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.

Can you share a trace with at least sched_switch events enabled?
This can help on identify which workloads are there and how they
interact with schedutil.

> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again.  That causes the actual frequency of the processor to

That's possibly exactly the pattern I've described above.

> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.

Perhaps some of the patches I've posted in this series:
   https://lkml.org/lkml/2017/3/2/385
can help on that side.

The pattern you report seems to me quite matching with the case "b"
described in the cover letter. I've also shared a notebook:
   https://gist.github.com/derkling/d6a21b459a18091b2b058668a550010d
which clearly shows that behavior happening in the first plot,
cell named Out[15].

Did you had a chance to have a look at them?

It would be nice to know if they can provide some benefits to your
use-case.

> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state.  Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations.  If the counter
> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> >> the CPU has not been idle for at least that long and the governor
> >> will choose the maximum frequency for it without looking at the PELT
> >> metric at all.
> >
> > Looks like we are fixing a PELT problem with a schedutil Hack :)
> 
> I would not say that it's a PELT problem (at least based on current
> description) but more that we don't track all
> activities of CPU

Agree... with both comments.


> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> ---
> >>  include/linux/sched/cpufreq.h    |    6 ++++++
> >>  kernel/sched/cpufreq_schedutil.c |   38 ++++++++++++++++++++++++++++++++++++--
> >>  kernel/sched/idle.c              |    3 +++
> >>  3 files changed, 45 insertions(+), 2 deletions(-)
> >>
> >> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> >> ===================================================================
> >> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> >> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> >> @@ -20,6 +20,7 @@
> >>  #include "sched.h"
> >>
> >>  #define SUGOV_KTHREAD_PRIORITY       50
> >> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
> >>
> >>  struct sugov_tunables {
> >>       struct gov_attr_set attr_set;
> >> @@ -55,6 +56,9 @@ struct sugov_cpu {
> >>
> >>       unsigned long iowait_boost;
> >>       unsigned long iowait_boost_max;
> >> +     unsigned long idle_calls;
> >> +     unsigned long saved_idle_calls;
> >> +     u64 busy_start;
> >>       u64 last_update;
> >>
> >>       /* The fields below are only needed when sharing a policy. */
> >> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
> >>       sg_cpu->iowait_boost >>= 1;
> >>  }
> >>
> >> +void cpufreq_schedutil_idle_call(void)
> >> +{
> >> +     struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
> >> +
> >> +     sg_cpu->idle_calls++;
> >> +}
> >> +
> >> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> >> +{
> >> +     if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> >> +             sg_cpu->busy_start = 0;
> >> +             return false;
> >> +     }
> >> +
> >> +     if (!sg_cpu->busy_start) {
> >> +             sg_cpu->busy_start = sg_cpu->last_update;
> >> +             return false;
> >> +     }
> >> +
> >> +     return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> >> +}
> >> +
> >> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> >> +{
> >> +     if (!sg_cpu->busy_start)
> >> +             sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
> >
> > Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
> > for idle_calls to get incremented by this time?
> >
> >> +}
> >> +
> >>  static void sugov_update_single(struct update_util_data *hook, u64 time,
> >>                               unsigned int flags)
> >>  {
> >> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
> >>       if (!sugov_should_update_freq(sg_policy, time))
> >>               return;
> >>
> >> -     if (flags & SCHED_CPUFREQ_RT_DL) {
> >> +     if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
> >>               next_f = policy->cpuinfo.max_freq;
> >>       } else {
> >>               sugov_get_util(&util, &max);
> >> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
> >>               next_f = get_next_freq(sg_policy, util, max);
> >>       }
> >>       sugov_update_commit(sg_policy, time, next_f);
> >> +     sugov_save_idle_calls(sg_cpu);
> >>  }
> >>
> >>  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> >> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
> >>       sg_cpu->last_update = time;
> >>
> >>       if (sugov_should_update_freq(sg_policy, time)) {
> >> -             if (flags & SCHED_CPUFREQ_RT_DL)
> >> +             if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
> >
> > What about others CPUs in this policy?
> >
> >>                       next_f = sg_policy->policy->cpuinfo.max_freq;
> >>               else
> >>                       next_f = sugov_next_freq_shared(sg_cpu);
> >>
> >>               sugov_update_commit(sg_policy, time, next_f);
> >> +             sugov_save_idle_calls(sg_cpu);
> >>       }
> >>
> >>       raw_spin_unlock(&sg_policy->update_lock);
> >
> > --
> > viresh

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 10:36   ` Peter Zijlstra
@ 2017-03-20 12:35     ` Rafael J. Wysocki
  2017-03-20 12:50       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> > 
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> > 
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state.  Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations.  If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
> 
> Why the time limit?

One iteration appeared to be a bit too aggressive, but honestly I think
I need to check again if this thing is regarded as viable at all.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
  2017-03-20  3:28   ` Viresh Kumar
@ 2017-03-20 12:36     ` Rafael J. Wysocki
  0 siblings, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 12:36 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Monday, March 20, 2017 08:58:40 AM Viresh Kumar wrote:
> On 19-03-17, 14:30, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > sugov_start() only initializes struct sugov_cpu per-CPU structures
> > for shared policies, but it should do that for single-CPU policies too.
> > 
> > That in particular makes the IO-wait boost mechanism work in the
> > cases when cpufreq policies correspond to individual CPUs.
> > 
> > Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Cc: 4.9+ <stable@vger.kernel.org> # 4.9+

"stable" checks Fixes: tags too, but yes.

> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20  3:57   ` Viresh Kumar
  2017-03-20  8:26     ` Vincent Guittot
@ 2017-03-20 12:48     ` Rafael J. Wysocki
  1 sibling, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 12:48 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Monday, March 20, 2017 09:27:45 AM Viresh Kumar wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> > 
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> > 
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state.  Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations.  If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
> 
> Looks like we are fixing a PELT problem with a schedutil Hack :)

Did you notice the "To work around this issue" phrase above?  This is a
workaround, not a fix.

But it is an optimization too, because avoiding heavy computations if they
are not necessary should be a win in any case.

> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  include/linux/sched/cpufreq.h    |    6 ++++++
> >  kernel/sched/cpufreq_schedutil.c |   38 ++++++++++++++++++++++++++++++++++++--
> >  kernel/sched/idle.c              |    3 +++
> >  3 files changed, 45 insertions(+), 2 deletions(-)
> > 
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -20,6 +20,7 @@
> >  #include "sched.h"
> >  
> >  #define SUGOV_KTHREAD_PRIORITY	50
> > +#define SUGOV_BUSY_THRESHOLD	(50 * NSEC_PER_MSEC)
> >  
> >  struct sugov_tunables {
> >  	struct gov_attr_set attr_set;
> > @@ -55,6 +56,9 @@ struct sugov_cpu {
> >  
> >  	unsigned long iowait_boost;
> >  	unsigned long iowait_boost_max;
> > +	unsigned long idle_calls;
> > +	unsigned long saved_idle_calls;

The above two could be unsigned int, BTW.

> > +	u64 busy_start;
> >  	u64 last_update;
> >  
> >  	/* The fields below are only needed when sharing a policy. */
> > @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
> >  	sg_cpu->iowait_boost >>= 1;
> >  }
> >  
> > +void cpufreq_schedutil_idle_call(void)
> > +{
> > +	struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
> > +
> > +	sg_cpu->idle_calls++;
> > +}
> > +
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > +	if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> > +		sg_cpu->busy_start = 0;
> > +		return false;
> > +	}
> > +
> > +	if (!sg_cpu->busy_start) {
> > +		sg_cpu->busy_start = sg_cpu->last_update;
> > +		return false;
> > +	}
> > +
> > +	return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> > +}
> > +
> > +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> > +{
> > +	if (!sg_cpu->busy_start)
> > +		sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
> 
> Why aren't we doing this in sugov_cpu_is_busy() itself ?

No, we aren't.

sugov_cpu_is_busy() may not be called at all sometimes.

> And isn't it possible for idle_calls to get incremented by this time?

No, it isn't.

The idle loop does that and after it has disabled interrupts.  If this code is
running, we surely are not in there on the same CPU, are we?

> > +}
> > +
> >  static void sugov_update_single(struct update_util_data *hook, u64 time,
> >  				unsigned int flags)
> >  {
> > @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
> >  	if (!sugov_should_update_freq(sg_policy, time))
> >  		return;
> >  
> > -	if (flags & SCHED_CPUFREQ_RT_DL) {
> > +	if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
> >  		next_f = policy->cpuinfo.max_freq;
> >  	} else {
> >  		sugov_get_util(&util, &max);
> > @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
> >  		next_f = get_next_freq(sg_policy, util, max);
> >  	}
> >  	sugov_update_commit(sg_policy, time, next_f);
> > +	sugov_save_idle_calls(sg_cpu);
> >  }
> >  
> >  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> > @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
> >  	sg_cpu->last_update = time;
> >  
> >  	if (sugov_should_update_freq(sg_policy, time)) {
> > -		if (flags & SCHED_CPUFREQ_RT_DL)
> > +		if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
> 
> What about others CPUs in this policy?

We'll check this for them when they get updated.

The point is that we can just bail out earlier here and avoid the heavy stuff,
but we don't have to.  This is opportunistic.

> >  			next_f = sg_policy->policy->cpuinfo.max_freq;
> >  		else
> >  			next_f = sugov_next_freq_shared(sg_cpu);
> >  
> >  		sugov_update_commit(sg_policy, time, next_f);
> > +		sugov_save_idle_calls(sg_cpu);
> >  	}
> >  
> >  	raw_spin_unlock(&sg_policy->update_lock);
> 
> 

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 12:35     ` Rafael J. Wysocki
@ 2017-03-20 12:50       ` Peter Zijlstra
  2017-03-20 13:04         ` Rafael J. Wysocki
  2017-03-20 13:06         ` Patrick Bellasi
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-20 12:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > 
> > > The PELT metric used by the schedutil governor underestimates the
> > > CPU utilization in some cases.  The reason for that may be time spent
> > > in interrupt handlers and similar which is not accounted for by PELT.
> > > 
> > > That can be easily demonstrated by running kernel compilation on
> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > register.  Namely, the expected result would be that when all CPUs
> > > were 100% busy, all of them would be requested to run in the maximum
> > > P-state, but observation shows that this clearly isn't the case.
> > > The CPUs run in the maximum P-state for a while and then are
> > > requested to run slower and go back to the maximum P-state after
> > > a while again.  That causes the actual frequency of the processor to
> > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > which clearly is not desirable.
> > > 
> > > To work around this issue use the observation that, from the
> > > schedutil governor's perspective, CPUs that are never idle should
> > > always run at the maximum frequency and make that happen.
> > > 
> > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > modify cpuidle_idle_call() to increment that counter every time it
> > > is about to put the given CPU into an idle state.  Next, make the
> > > schedutil governor look at that counter for the current CPU every
> > > time before it is about to start heavy computations.  If the counter
> > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > the CPU has not been idle for at least that long and the governor
> > > will choose the maximum frequency for it without looking at the PELT
> > > metric at all.
> > 
> > Why the time limit?
> 
> One iteration appeared to be a bit too aggressive, but honestly I think
> I need to check again if this thing is regarded as viable at all.
> 

I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
RCU come to mind as things that might already use something like that.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20  8:26     ` Vincent Guittot
  2017-03-20 12:34       ` Patrick Bellasi
@ 2017-03-20 12:59       ` Rafael J. Wysocki
  2017-03-20 13:20         ` Vincent Guittot
  1 sibling, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 12:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Viresh Kumar, Linux PM, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases.  The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
> 
> Are you sure of the root cause  described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause

No, I'm not sure.  That's why I said "may be". :-)

> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

OK

Anyway, the problem is that we have a 100% busy CPU which quite evidently
is not reported as 100% busy by the metric we use.

Now, if I was sure about the root cause, I would fix it rather than suggest
workarounds.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

That's fair enough, but the assumption was that PELT would be sufficient for
that.  To me, it clearly isn't, so the assumption was incorrect.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Yes, I agree.

I'm not sure if I can carry out full investigation of that any time soon, however.

I sent this mostly to let everybody know that there was a problem and how it
could be worked around.  That's why it is an RFC.

> 
> >>
> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register.  Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.
> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again.  That causes the actual frequency of the processor to
> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.
> >>
> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state.  Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations.  If the counter
> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> >> the CPU has not been idle for at least that long and the governor
> >> will choose the maximum frequency for it without looking at the PELT
> >> metric at all.
> >
> > Looks like we are fixing a PELT problem with a schedutil Hack :)
> 
> I would not say that it's a PELT problem (at least based on current
> description) but more that we don't track all
> activities of CPU

I generally agree.  PELT does what it does.

However, using PELT the way we do that in schedutil turns out to be problematic.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 12:50       ` Peter Zijlstra
@ 2017-03-20 13:04         ` Rafael J. Wysocki
  2017-03-20 13:06         ` Patrick Bellasi
  1 sibling, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Monday, March 20, 2017 01:50:09 PM Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > 
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > 
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register.  Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again.  That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > > 
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > > 
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state.  Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations.  If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > > 
> > > Why the time limit?
> > 
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> > 
> 
> I don't hate the idea; if we don't hit idle; we shouldn't shift down.

OK

> I just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

NOHZ does that, but I did't want this to artificially depend on NOHZ.  That said,
yes, we can use that one too.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 13:06         ` Patrick Bellasi
@ 2017-03-20 13:05           ` Rafael J. Wysocki
  2017-03-20 14:13             ` Patrick Bellasi
  0 siblings, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 13:05 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> On 20-Mar 13:50, Peter Zijlstra wrote:
> > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > 
> > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > > 
> > > > > That can be easily demonstrated by running kernel compilation on
> > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > register.  Namely, the expected result would be that when all CPUs
> > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > requested to run slower and go back to the maximum P-state after
> > > > > a while again.  That causes the actual frequency of the processor to
> > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > which clearly is not desirable.
> > > > > 
> > > > > To work around this issue use the observation that, from the
> > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > always run at the maximum frequency and make that happen.
> > > > > 
> > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > is about to put the given CPU into an idle state.  Next, make the
> > > > > schedutil governor look at that counter for the current CPU every
> > > > > time before it is about to start heavy computations.  If the counter
> > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > the CPU has not been idle for at least that long and the governor
> > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > metric at all.
> > > > 
> > > > Why the time limit?
> > > 
> > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > I need to check again if this thing is regarded as viable at all.
> > > 
> > 
> > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > RCU come to mind as things that might already use something like that.
> 
> Maybe the problem is not going down (e.g. when there are only small
> CFS tasks it makes perfectly sense) but instead not being fast enough
> on rampin-up when a new RT task is activated.
> 
> And this boils down to two main point:
> 1) throttling for up transitions perhaps is only harmful
> 2) the call sites for schedutils updates are not properly positioned
>    in specific scheduler decision points.
> 
> The proposed patch is adding yet another throttling mechanism, perhaps
> on top of one which already needs to be improved.

It is not throttling anything.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 12:50       ` Peter Zijlstra
  2017-03-20 13:04         ` Rafael J. Wysocki
@ 2017-03-20 13:06         ` Patrick Bellasi
  2017-03-20 13:05           ` Rafael J. Wysocki
  1 sibling, 1 reply; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-20 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20-Mar 13:50, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > 
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > 
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register.  Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again.  That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > > 
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > > 
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state.  Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations.  If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > > 
> > > Why the time limit?
> > 
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> > 
> 
> I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

Maybe the problem is not going down (e.g. when there are only small
CFS tasks it makes perfectly sense) but instead not being fast enough
on rampin-up when a new RT task is activated.

And this boils down to two main point:
1) throttling for up transitions perhaps is only harmful
2) the call sites for schedutils updates are not properly positioned
   in specific scheduler decision points.

The proposed patch is adding yet another throttling mechanism, perhaps
on top of one which already needs to be improved.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 12:59       ` Rafael J. Wysocki
@ 2017-03-20 13:20         ` Vincent Guittot
  0 siblings, 0 replies; 71+ messages in thread
From: Vincent Guittot @ 2017-03-20 13:20 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Linux PM, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20 March 2017 at 13:59, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases.  The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause  described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>
> No, I'm not sure.  That's why I said "may be". :-)
>
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> OK
>
> Anyway, the problem is that we have a 100% busy CPU which quite evidently
> is not reported as 100% busy by the metric we use.


>
> Now, if I was sure about the root cause, I would fix it rather than suggest
> workarounds.
>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU.  The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> That's fair enough, but the assumption was that PELT would be sufficient for
> that.  To me, it clearly isn't, so the assumption was incorrect.

So PELT is just about CFS utilization and we need metrics for other
sched class as well
I know that Juri and other people are working on the dealine part

In current kernel, the only available metric is rt_avg (and
scale_rt_capacity) which gives you the available compute capacity for
CFS and in some way the amount of capacity used by RT and deadline.
That would be interesting to see if the amount of capacity used by RT
tasks (i assume  that you don't have deadline task in your use case)
is similar to the amount of underestimation of CFS

>
>> That would be good to be sure of what is running during this not
>> accounted time and find a way to account them
>
> Yes, I agree.
>
> I'm not sure if I can carry out full investigation of that any time soon, however.

If you can share traces as Patrick proposed, we could estimate how
much time/capacity is used by rt tasks and if this can explain the
underestimation.

>
> I sent this mostly to let everybody know that there was a problem and how it
> could be worked around.  That's why it is an RFC.

Ok

>
>>
>> >>
>> >> That can be easily demonstrated by running kernel compilation on
>> >> a Sandy Bridge Intel processor, running turbostat in parallel with
>> >> it and looking at the values written to the MSR_IA32_PERF_CTL
>> >> register.  Namely, the expected result would be that when all CPUs
>> >> were 100% busy, all of them would be requested to run in the maximum
>> >> P-state, but observation shows that this clearly isn't the case.
>> >> The CPUs run in the maximum P-state for a while and then are
>> >> requested to run slower and go back to the maximum P-state after
>> >> a while again.  That causes the actual frequency of the processor to
>> >> visibly oscillate below the sustainable maximum in a jittery fashion
>> >> which clearly is not desirable.
>> >>
>> >> To work around this issue use the observation that, from the
>> >> schedutil governor's perspective, CPUs that are never idle should
>> >> always run at the maximum frequency and make that happen.
>> >>
>> >> To that end, add a counter of idle calls to struct sugov_cpu and
>> >> modify cpuidle_idle_call() to increment that counter every time it
>> >> is about to put the given CPU into an idle state.  Next, make the
>> >> schedutil governor look at that counter for the current CPU every
>> >> time before it is about to start heavy computations.  If the counter
>> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> >> the CPU has not been idle for at least that long and the governor
>> >> will choose the maximum frequency for it without looking at the PELT
>> >> metric at all.
>> >
>> > Looks like we are fixing a PELT problem with a schedutil Hack :)
>>
>> I would not say that it's a PELT problem (at least based on current
>> description) but more that we don't track all
>> activities of CPU
>
> I generally agree.  PELT does what it does.
>
> However, using PELT the way we do that in schedutil turns out to be problematic.

yes. PELT is not enough but just part of the equation

>
> Thanks,
> Rafael
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 13:05           ` Rafael J. Wysocki
@ 2017-03-20 14:13             ` Patrick Bellasi
  0 siblings, 0 replies; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-20 14:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20-Mar 14:05, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> > On 20-Mar 13:50, Peter Zijlstra wrote:
> > > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > 
> > > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > > > 
> > > > > > That can be easily demonstrated by running kernel compilation on
> > > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > > register.  Namely, the expected result would be that when all CPUs
> > > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > > requested to run slower and go back to the maximum P-state after
> > > > > > a while again.  That causes the actual frequency of the processor to
> > > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > > which clearly is not desirable.
> > > > > > 
> > > > > > To work around this issue use the observation that, from the
> > > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > > always run at the maximum frequency and make that happen.
> > > > > > 
> > > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > > is about to put the given CPU into an idle state.  Next, make the
> > > > > > schedutil governor look at that counter for the current CPU every
> > > > > > time before it is about to start heavy computations.  If the counter
> > > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > > the CPU has not been idle for at least that long and the governor
> > > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > > metric at all.
> > > > > 
> > > > > Why the time limit?
> > > > 
> > > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > > I need to check again if this thing is regarded as viable at all.
> > > > 
> > > 
> > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > > RCU come to mind as things that might already use something like that.
> > 
> > Maybe the problem is not going down (e.g. when there are only small
> > CFS tasks it makes perfectly sense) but instead not being fast enough
> > on rampin-up when a new RT task is activated.
> > 
> > And this boils down to two main point:
> > 1) throttling for up transitions perhaps is only harmful
> > 2) the call sites for schedutils updates are not properly positioned
> >    in specific scheduler decision points.
> > 
> > The proposed patch is adding yet another throttling mechanism, perhaps
> > on top of one which already needs to be improved.
> 
> It is not throttling anything.

It's a kind-of...

-       if (flags & SCHED_CPUFREQ_RT_DL) { 
+       if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
                next_f = policy->cpuinfo.max_freq;

This check disregard any signal the scheduler can provide.

A 60% CFS task with a 100ms period, with such a policy will end up
running at the highest OPP for just 10% of its entire activation.
Moreover, when it completes, we are likely to enter an idle OPP while
still remaining at the highest OPP.

IMHO the ultimate goal of scheduitl should be that to be driven by the
scheduler, which has (or can have) all the required information to
support OPP selection.

If something is not working, well, then we should properly fix the
signals and/or provide (at least) a per-task tunable interface.

Adding an hardcoded threshold is an easy fix but it will ultimately
increase the complexity of the governor.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki
                     ` (2 preceding siblings ...)
  2017-03-20 10:36   ` Peter Zijlstra
@ 2017-03-20 21:46   ` Rafael J. Wysocki
  2017-03-21  6:40     ` Viresh Kumar
                       ` (3 more replies)
  3 siblings, 4 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-20 21:46 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

To work around this issue use the observation that, from the
schedutil governor's perspective, it does not make sense to decrease
the frequency of a CPU that doesn't enter idle and avoid decreasing
the frequency of busy CPUs.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before it is about to set a new frequency
for that CPU's policy.  If the counter has not changed since the
previous iteration, the CPU has been busy for all that time and
its frequency should not be decreased, so if the new frequency would
be lower than the one set previously, the governor will skip the
frequency update.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is a slightly different approach (avoid decreasing frequency for busy CPUs
instead of bumping if for them to the max upfront) and it works around the
original problem too.

I tried to address a few Peter's comments here and the result doesn't seem to
be too heavy-wieght.

Thanks,
Rafael

---
 include/linux/tick.h             |    1 +
 kernel/sched/cpufreq_schedutil.c |   28 ++++++++++++++++++++++++----
 kernel/time/tick-sched.c         |   12 ++++++++++++
 3 files changed, 37 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,9 @@ struct sugov_cpu {
 	unsigned long iowait_boost;
 	unsigned long iowait_boost_max;
 	u64 last_update;
+#ifdef CONFIG_NO_HZ_COMMON
+	unsigned long saved_idle_calls;
+#endif
 
 	/* The fields below are only needed when sharing a policy. */
 	unsigned long util;
@@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str
 	return delta_ns >= sg_policy->freq_update_delay_ns;
 }
 
-static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
-				unsigned int next_freq)
+#ifdef CONFIG_NO_HZ_COMMON
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+	unsigned long idle_calls = tick_nohz_get_idle_calls();
+	bool ret = idle_calls == sg_cpu->saved_idle_calls;
+
+	sg_cpu->saved_idle_calls = idle_calls;
+	return ret;
+}
+#else
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+#endif /* CONFIG_NO_HZ_COMMON */
+
+static void sugov_update_commit(struct sugov_cpu *sg_cpu,
+				struct sugov_policy *sg_policy,
+				u64 time, unsigned int next_freq)
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
 
+	if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
+		next_freq = sg_policy->next_freq;
+
 	if (policy->fast_switch_enabled) {
 		if (sg_policy->next_freq == next_freq) {
 			trace_cpu_frequency(policy->cur, smp_processor_id());
@@ -214,7 +234,7 @@ static void sugov_update_single(struct u
 		sugov_iowait_boost(sg_cpu, &util, &max);
 		next_f = get_next_freq(sg_policy, util, max);
 	}
-	sugov_update_commit(sg_policy, time, next_f);
+	sugov_update_commit(sg_cpu, sg_policy, time, next_f);
 }
 
 static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
@@ -283,7 +303,7 @@ static void sugov_update_shared(struct u
 		else
 			next_f = sugov_next_freq_shared(sg_cpu);
 
-		sugov_update_commit(sg_policy, time, next_f);
+		sugov_update_commit(sg_cpu, sg_policy, time, next_f);
 	}
 
 	raw_spin_unlock(&sg_policy->update_lock);
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
+extern unsigned long tick_nohz_get_idle_calls(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 #else /* !CONFIG_NO_HZ_COMMON */
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
 	return ts->sleep_length;
 }
 
+/**
+ * tick_nohz_get_idle_calls - return the current idle calls counter value
+ *
+ * Called from the schedutil frequency scaling governor in scheduler context.
+ */
+unsigned long tick_nohz_get_idle_calls(void)
+{
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	return ts->idle_calls;
+}
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-20 21:46   ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki
@ 2017-03-21  6:40     ` Viresh Kumar
  2017-03-21 12:30       ` Rafael J. Wysocki
  2017-03-21  8:50     ` Vincent Guittot
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 71+ messages in thread
From: Viresh Kumar @ 2017-03-21  6:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20-03-17, 22:46, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c

> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> +				struct sugov_policy *sg_policy,
> +				u64 time, unsigned int next_freq)
>  {
>  	struct cpufreq_policy *policy = sg_policy->policy;
>  
> +	if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> +		next_freq = sg_policy->next_freq;
> +

In the earlier version you said that we want to be opportunistic and
don't want to do heavy computation and so check only for current CPU.

But in this version, all those computations are already done by now.
Why shouldn't we check all CPUs in the policy now? I am asking as we
will still have the same problem, we are trying to work-around if the
current CPU isn't busy but others sharing the policy are.

Also, why not return directly from within the if block? To run
trace_cpu_frequency()?

I don't remember exactly, but why don't we run that for !fast-switch
case? We can simplify the code a bit if we check for no freq change at
the top of the routine.

-- 
viresh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-20 21:46   ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki
  2017-03-21  6:40     ` Viresh Kumar
@ 2017-03-21  8:50     ` Vincent Guittot
  2017-03-21 11:56       ` Patrick Bellasi
  2017-03-21 13:22       ` Peter Zijlstra
  2017-03-21 11:50     ` Patrick Bellasi
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
  3 siblings, 2 replies; 71+ messages in thread
From: Vincent Guittot @ 2017-03-21  8:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> To work around this issue use the observation that, from the
> schedutil governor's perspective, it does not make sense to decrease
> the frequency of a CPU that doesn't enter idle and avoid decreasing
> the frequency of busy CPUs.

I don't fully agree with that statement.
If there are 2 runnable tasks on CPU A and scheduler migrates the
waiting task to another CPU B so CPU A is less loaded now, it makes
sense to reduce the OPP. That's even for that purpose that we have
decided to use scheduler metrics in cpufreq governor so we can adjust
OPP immediately when tasks migrate.
That being said, i probably know why you see such OPP switches in your
use case. When we migrate a task, we also migrate/remove its
utilization from CPU.
If the CPU is not overloaded, it means that runnable tasks have all
computation that they need and don't have any reason to use more when
a task migrates to another CPU. so decreasing the OPP makes sense
because the utilzation is decreasing
If the CPU is overloaded, it means that runnable tasks have to share
CPU time and probably don't have all computations that they would like
so when a task migrate, the remaining tasks on the CPU will increase
their utilization and fill space left by the task that has just
migrated. So the CPU's utilization will decrease when a task migrates
(and as a result the OPP) but then its utilization will increase with
remaining tasks running more time as well as the OPP

So you need to make the difference between this 2 cases: Is a CPU
overloaded or not. You can't really rely on the utilization to detect
that but you could take advantage of the load which take into account
the waiting time of tasks

Vincent
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before it is about to set a new frequency
> for that CPU's policy.  If the counter has not changed since the
> previous iteration, the CPU has been busy for all that time and
> its frequency should not be decreased, so if the new frequency would
> be lower than the one set previously, the governor will skip the
> frequency update.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>
> This is a slightly different approach (avoid decreasing frequency for busy CPUs
> instead of bumping if for them to the max upfront) and it works around the
> original problem too.
>
> I tried to address a few Peter's comments here and the result doesn't seem to
> be too heavy-wieght.
>
> Thanks,
> Rafael
>
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   28 ++++++++++++++++++++++++----
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 37 insertions(+), 4 deletions(-)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -56,6 +56,9 @@ struct sugov_cpu {
>         unsigned long iowait_boost;
>         unsigned long iowait_boost_max;
>         u64 last_update;
> +#ifdef CONFIG_NO_HZ_COMMON
> +       unsigned long saved_idle_calls;
> +#endif
>
>         /* The fields below are only needed when sharing a policy. */
>         unsigned long util;
> @@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str
>         return delta_ns >= sg_policy->freq_update_delay_ns;
>  }
>
> -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> -                               unsigned int next_freq)
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +       unsigned long idle_calls = tick_nohz_get_idle_calls();
> +       bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> +       sg_cpu->saved_idle_calls = idle_calls;
> +       return ret;
> +}
> +#else
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +#endif /* CONFIG_NO_HZ_COMMON */
> +
> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> +                               struct sugov_policy *sg_policy,
> +                               u64 time, unsigned int next_freq)
>  {
>         struct cpufreq_policy *policy = sg_policy->policy;
>
> +       if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> +               next_freq = sg_policy->next_freq;
> +
>         if (policy->fast_switch_enabled) {
>                 if (sg_policy->next_freq == next_freq) {
>                         trace_cpu_frequency(policy->cur, smp_processor_id());
> @@ -214,7 +234,7 @@ static void sugov_update_single(struct u
>                 sugov_iowait_boost(sg_cpu, &util, &max);
>                 next_f = get_next_freq(sg_policy, util, max);
>         }
> -       sugov_update_commit(sg_policy, time, next_f);
> +       sugov_update_commit(sg_cpu, sg_policy, time, next_f);
>  }
>
>  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -283,7 +303,7 @@ static void sugov_update_shared(struct u
>                 else
>                         next_f = sugov_next_freq_shared(sg_cpu);
>
> -               sugov_update_commit(sg_policy, time, next_f);
> +               sugov_update_commit(sg_cpu, sg_policy, time, next_f);
>         }
>
>         raw_spin_unlock(&sg_policy->update_lock);
> Index: linux-pm/include/linux/tick.h
> ===================================================================
> --- linux-pm.orig/include/linux/tick.h
> +++ linux-pm/include/linux/tick.h
> @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
>  extern void tick_nohz_idle_exit(void);
>  extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
> +extern unsigned long tick_nohz_get_idle_calls(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>  #else /* !CONFIG_NO_HZ_COMMON */
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
>         return ts->sleep_length;
>  }
>
> +/**
> + * tick_nohz_get_idle_calls - return the current idle calls counter value
> + *
> + * Called from the schedutil frequency scaling governor in scheduler context.
> + */
> +unsigned long tick_nohz_get_idle_calls(void)
> +{
> +       struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +       return ts->idle_calls;
> +}
> +
>  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
>  {
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-20 21:46   ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki
  2017-03-21  6:40     ` Viresh Kumar
  2017-03-21  8:50     ` Vincent Guittot
@ 2017-03-21 11:50     ` Patrick Bellasi
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
  3 siblings, 0 replies; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-21 11:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 20-Mar 22:46, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> To work around this issue use the observation that, from the
> schedutil governor's perspective, it does not make sense to decrease
> the frequency of a CPU that doesn't enter idle and avoid decreasing
> the frequency of busy CPUs.
> 
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before it is about to set a new frequency
> for that CPU's policy.  If the counter has not changed since the
> previous iteration, the CPU has been busy for all that time and
> its frequency should not be decreased, so if the new frequency would
> be lower than the one set previously, the governor will skip the
> frequency update.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> This is a slightly different approach (avoid decreasing frequency for busy CPUs
> instead of bumping if for them to the max upfront) and it works around the
> original problem too.

I like much better this version where we do not enforce max frequency
as well as we removed the hardcoded time threshold. ;-)

Makes sense to me also to avoid down scaling until we don't hit an IDLE.

However, I also agree with Vincent's observation: this constraint should be
there only for "overutilized" CPUs... but unfortunately, in mainline we are
still missing that flag and thus we should probably use the
"overloaded" one for the time being.

> I tried to address a few Peter's comments here and the result doesn't seem to
> be too heavy-wieght.

Nice!

> Thanks,
> Rafael
> 
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   28 ++++++++++++++++++++++++----
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 37 insertions(+), 4 deletions(-)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -56,6 +56,9 @@ struct sugov_cpu {
>  	unsigned long iowait_boost;
>  	unsigned long iowait_boost_max;
>  	u64 last_update;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	unsigned long saved_idle_calls;
> +#endif
>  
>  	/* The fields below are only needed when sharing a policy. */
>  	unsigned long util;
> @@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str
>  	return delta_ns >= sg_policy->freq_update_delay_ns;
>  }
>  
> -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> -				unsigned int next_freq)
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +	unsigned long idle_calls = tick_nohz_get_idle_calls();
> +	bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> +	sg_cpu->saved_idle_calls = idle_calls;
> +	return ret;
> +}
> +#else
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +#endif /* CONFIG_NO_HZ_COMMON */
> +
> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> +				struct sugov_policy *sg_policy,
> +				u64 time, unsigned int next_freq)
>  {
>  	struct cpufreq_policy *policy = sg_policy->policy;
>  
> +	if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> +		next_freq = sg_policy->next_freq;
> +
>  	if (policy->fast_switch_enabled) {
>  		if (sg_policy->next_freq == next_freq) {
>  			trace_cpu_frequency(policy->cur, smp_processor_id());
> @@ -214,7 +234,7 @@ static void sugov_update_single(struct u
>  		sugov_iowait_boost(sg_cpu, &util, &max);
>  		next_f = get_next_freq(sg_policy, util, max);
>  	}
> -	sugov_update_commit(sg_policy, time, next_f);
> +	sugov_update_commit(sg_cpu, sg_policy, time, next_f);
>  }
>  
>  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -283,7 +303,7 @@ static void sugov_update_shared(struct u
>  		else
>  			next_f = sugov_next_freq_shared(sg_cpu);
>  
> -		sugov_update_commit(sg_policy, time, next_f);
> +		sugov_update_commit(sg_cpu, sg_policy, time, next_f);
>  	}
>  
>  	raw_spin_unlock(&sg_policy->update_lock);
> Index: linux-pm/include/linux/tick.h
> ===================================================================
> --- linux-pm.orig/include/linux/tick.h
> +++ linux-pm/include/linux/tick.h
> @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
>  extern void tick_nohz_idle_exit(void);
>  extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
> +extern unsigned long tick_nohz_get_idle_calls(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>  #else /* !CONFIG_NO_HZ_COMMON */
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
>  	return ts->sleep_length;
>  }
>  
> +/**
> + * tick_nohz_get_idle_calls - return the current idle calls counter value
> + *
> + * Called from the schedutil frequency scaling governor in scheduler context.
> + */
> +unsigned long tick_nohz_get_idle_calls(void)
> +{
> +	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +	return ts->idle_calls;
> +}
> +
>  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
>  {
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> 

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21  8:50     ` Vincent Guittot
@ 2017-03-21 11:56       ` Patrick Bellasi
  2017-03-21 13:22       ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-21 11:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21-Mar 09:50, Vincent Guittot wrote:
> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > The way the schedutil governor uses the PELT metric causes it to
> > underestimate the CPU utilization in some cases.
> >
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> >
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, it does not make sense to decrease
> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > the frequency of busy CPUs.
> 
> I don't fully agree with that statement.
> If there are 2 runnable tasks on CPU A and scheduler migrates the
> waiting task to another CPU B so CPU A is less loaded now, it makes
> sense to reduce the OPP. That's even for that purpose that we have
> decided to use scheduler metrics in cpufreq governor so we can adjust
> OPP immediately when tasks migrate.
> That being said, i probably know why you see such OPP switches in your
> use case. When we migrate a task, we also migrate/remove its
> utilization from CPU.
> If the CPU is not overloaded, it means that runnable tasks have all
> computation that they need and don't have any reason to use more when
> a task migrates to another CPU. so decreasing the OPP makes sense
> because the utilzation is decreasing
> If the CPU is overloaded, it means that runnable tasks have to share
> CPU time and probably don't have all computations that they would like
> so when a task migrate, the remaining tasks on the CPU will increase
> their utilization and fill space left by the task that has just
> migrated. So the CPU's utilization will decrease when a task migrates
> (and as a result the OPP) but then its utilization will increase with
> remaining tasks running more time as well as the OPP
> 
> So you need to make the difference between this 2 cases: Is a CPU
> overloaded or not. You can't really rely on the utilization to detect
> that but you could take advantage of the load which take into account
> the waiting time of tasks

Right, we can use "overloaded" for the time being until we push the
"overutilized" bits.

[...]

> > +#ifdef CONFIG_NO_HZ_COMMON
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > +       unsigned long idle_calls = tick_nohz_get_idle_calls();
> > +       bool ret = idle_calls == sg_cpu->saved_idle_calls;

Vincent: are you proposing something like this?

   +     if (this_rq()->rd->overload)
   +             return false;

> > +
> > +       sg_cpu->saved_idle_calls = idle_calls;
> > +       return ret;
> > +}
> > +#else
> > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> > +#endif /* CONFIG_NO_HZ_COMMON */
> > +
> > +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> > +                               struct sugov_policy *sg_policy,
> > +                               u64 time, unsigned int next_freq)
> >  {
> >         struct cpufreq_policy *policy = sg_policy->policy;
> >
> > +       if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> > +               next_freq = sg_policy->next_freq;
> > +
> >         if (policy->fast_switch_enabled) {
> >                 if (sg_policy->next_freq == next_freq) {
> >                         trace_cpu_frequency(policy->cur, smp_processor_id());
> > @@ -214,7 +234,7 @@ static void sugov_update_single(struct u
> >                 sugov_iowait_boost(sg_cpu, &util, &max);
> >                 next_f = get_next_freq(sg_policy, util, max);
> >         }
> > -       sugov_update_commit(sg_policy, time, next_f);
> > +       sugov_update_commit(sg_cpu, sg_policy, time, next_f);
> >  }
> >

[...]

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21  6:40     ` Viresh Kumar
@ 2017-03-21 12:30       ` Rafael J. Wysocki
  0 siblings, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-21 12:30 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar

On Tue, Mar 21, 2017 at 7:40 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 20-03-17, 22:46, Rafael J. Wysocki wrote:
>> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
>
>> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
>> +                             struct sugov_policy *sg_policy,
>> +                             u64 time, unsigned int next_freq)
>>  {
>>       struct cpufreq_policy *policy = sg_policy->policy;
>>
>> +     if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
>> +             next_freq = sg_policy->next_freq;
>> +
>
> In the earlier version you said that we want to be opportunistic and
> don't want to do heavy computation and so check only for current CPU.
>
> But in this version, all those computations are already done by now.
> Why shouldn't we check all CPUs in the policy now? I am asking as we
> will still have the same problem, we are trying to work-around if the
> current CPU isn't busy but others sharing the policy are.

This isn't the way I'm looking at that.

This is an easy (and relatively cheap) check to make for the *current*
*CPU* and our frequency selection algorithm turns out to have
problems, so it would be kind of unreasonable to not use the
opportunity to fix up the value coming from it - if we can do that
easily enough.

For the other CPUs in the policy that would require extra
synchronization etc., so not that easy any more.

> Also, why not return directly from within the if block? To run
> trace_cpu_frequency()?

Yes.

> I don't remember exactly, but why don't we run that for !fast-switch
> case?

That's an interesting question.

We do that in the fast switch case, because otherwise utilities get
confused if the frequency is not updated for a long enough time.

I'm not really sure why they don't get confused in the other case,
though.  [In that case the core calls trace_cpu_frequency() for us,
but only if we actually run the async work.]

It looks like it wouldn't hurt to always run trace_cpu_frequency()
when we want to bail out early for next_freq == sg_policy->next_freq.

Let me prepare a patch for that. :-)

> We can simplify the code a bit if we check for no freq change at
> the top of the routine.

Right.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21  8:50     ` Vincent Guittot
  2017-03-21 11:56       ` Patrick Bellasi
@ 2017-03-21 13:22       ` Peter Zijlstra
  2017-03-21 13:37         ` Vincent Guittot
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-21 13:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, it does not make sense to decrease
> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > the frequency of busy CPUs.
> 
> I don't fully agree with that statement.
> If there are 2 runnable tasks on CPU A and scheduler migrates the
> waiting task to another CPU B so CPU A is less loaded now, it makes
> sense to reduce the OPP. That's even for that purpose that we have
> decided to use scheduler metrics in cpufreq governor so we can adjust
> OPP immediately when tasks migrate.
> That being said, i probably know why you see such OPP switches in your
> use case. When we migrate a task, we also migrate/remove its
> utilization from CPU.
> If the CPU is not overloaded, it means that runnable tasks have all
> computation that they need and don't have any reason to use more when
> a task migrates to another CPU. so decreasing the OPP makes sense
> because the utilzation is decreasing
> If the CPU is overloaded, it means that runnable tasks have to share
> CPU time and probably don't have all computations that they would like
> so when a task migrate, the remaining tasks on the CPU will increase
> their utilization and fill space left by the task that has just
> migrated. So the CPU's utilization will decrease when a task migrates
> (and as a result the OPP) but then its utilization will increase with
> remaining tasks running more time as well as the OPP
> 
> So you need to make the difference between this 2 cases: Is a CPU
> overloaded or not. You can't really rely on the utilization to detect
> that but you could take advantage of the load which take into account
> the waiting time of tasks

I'm confused. What two cases? You only list the overloaded case, but he
does exactly that. Note that the lack of idle time is an exact
equivalent of 100% utilized.

So even while we cannot currently detect the 100% utilized state through
the running state tracking; because averages etc.. we can detect the
lack of idle time.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 13:22       ` Peter Zijlstra
@ 2017-03-21 13:37         ` Vincent Guittot
  2017-03-21 14:03           ` Peter Zijlstra
  2017-03-21 14:26           ` Rafael J. Wysocki
  0 siblings, 2 replies; 71+ messages in thread
From: Vincent Guittot @ 2017-03-21 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
>> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
>> > To work around this issue use the observation that, from the
>> > schedutil governor's perspective, it does not make sense to decrease
>> > the frequency of a CPU that doesn't enter idle and avoid decreasing
>> > the frequency of busy CPUs.
>>
>> I don't fully agree with that statement.
>> If there are 2 runnable tasks on CPU A and scheduler migrates the
>> waiting task to another CPU B so CPU A is less loaded now, it makes
>> sense to reduce the OPP. That's even for that purpose that we have
>> decided to use scheduler metrics in cpufreq governor so we can adjust
>> OPP immediately when tasks migrate.
>> That being said, i probably know why you see such OPP switches in your
>> use case. When we migrate a task, we also migrate/remove its
>> utilization from CPU.
>> If the CPU is not overloaded, it means that runnable tasks have all
>> computation that they need and don't have any reason to use more when
>> a task migrates to another CPU. so decreasing the OPP makes sense
>> because the utilzation is decreasing
>> If the CPU is overloaded, it means that runnable tasks have to share
>> CPU time and probably don't have all computations that they would like
>> so when a task migrate, the remaining tasks on the CPU will increase
>> their utilization and fill space left by the task that has just
>> migrated. So the CPU's utilization will decrease when a task migrates
>> (and as a result the OPP) but then its utilization will increase with
>> remaining tasks running more time as well as the OPP
>>
>> So you need to make the difference between this 2 cases: Is a CPU
>> overloaded or not. You can't really rely on the utilization to detect
>> that but you could take advantage of the load which take into account
>> the waiting time of tasks
>
> I'm confused. What two cases? You only list the overloaded case, but he

overloaded vs not overloaded use case.
For the not overloaded case, it makes sense to immediately update to
OPP to be aligned with the new utilization of the CPU even if it was
not idle in the past couple of ticks

> does exactly that. Note that the lack of idle time is an exact
> equivalent of 100% utilized.
>
> So even while we cannot currently detect the 100% utilized state through
> the running state tracking; because averages etc.. we can detect the
> lack of idle time.

But after how much lack of idle time do we consider that we are overloaded ?

>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 13:37         ` Vincent Guittot
@ 2017-03-21 14:03           ` Peter Zijlstra
  2017-03-21 14:18             ` Vincent Guittot
                               ` (2 more replies)
  2017-03-21 14:26           ` Rafael J. Wysocki
  1 sibling, 3 replies; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-21 14:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:

> For the not overloaded case, it makes sense to immediately update to
> OPP to be aligned with the new utilization of the CPU even if it was
> not idle in the past couple of ticks

Yeah, but we cannot know. Also, who cares?

> > does exactly that. Note that the lack of idle time is an exact
> > equivalent of 100% utilized.
> >
> > So even while we cannot currently detect the 100% utilized state through
> > the running state tracking; because averages etc.. we can detect the
> > lack of idle time.
> 
> But after how much lack of idle time do we consider that we are overloaded ?

0 :-)

Note that utilization is an absolute metric, not a windowed one. That
is, there is no actual time associated with it. Now, for practical
purposes we end up using windowed things in many places,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:03           ` Peter Zijlstra
@ 2017-03-21 14:18             ` Vincent Guittot
  2017-03-21 14:25             ` Patrick Bellasi
       [not found]             ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com>
  2 siblings, 0 replies; 71+ messages in thread
From: Vincent Guittot @ 2017-03-21 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
>> On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> For the not overloaded case, it makes sense to immediately update to
>> OPP to be aligned with the new utilization of the CPU even if it was
>> not idle in the past couple of ticks
>
> Yeah, but we cannot know. Also, who cares?

embedded system that doesn't want to stay at higest OPP if significant
part of the utilzation has moved away as an example
AFAICT, schedutil tries to select the best OPP according to the
current utilization of the CPU so if the utilization decreases, the
OPP should also decrease

>
>> > does exactly that. Note that the lack of idle time is an exact
>> > equivalent of 100% utilized.
>> >
>> > So even while we cannot currently detect the 100% utilized state through
>> > the running state tracking; because averages etc.. we can detect the
>> > lack of idle time.
>>
>> But after how much lack of idle time do we consider that we are overloaded ?
>
> 0 :-)
>
> Note that utilization is an absolute metric, not a windowed one. That
> is, there is no actual time associated with it. Now, for practical
> purposes we end up using windowed things in many places,
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:03           ` Peter Zijlstra
  2017-03-21 14:18             ` Vincent Guittot
@ 2017-03-21 14:25             ` Patrick Bellasi
       [not found]             ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com>
  2 siblings, 0 replies; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-21 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21-Mar 15:03, Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > For the not overloaded case, it makes sense to immediately update to
> > OPP to be aligned with the new utilization of the CPU even if it was
> > not idle in the past couple of ticks
> 
> Yeah, but we cannot know. Also, who cares?
> 
> > > does exactly that. Note that the lack of idle time is an exact
> > > equivalent of 100% utilized.
> > >
> > > So even while we cannot currently detect the 100% utilized state through
> > > the running state tracking; because averages etc.. we can detect the
> > > lack of idle time.
> > 
> > But after how much lack of idle time do we consider that we are overloaded ?
> 
> 0 :-)

If we should use "utilization" this time can be non 0 and it depends
for example on how long PELT takes to build up a utilization value
which marks the CPU as "overutilized"... thus we already have a
suitable time at least for CFS tasks.

> Note that utilization is an absolute metric, not a windowed one. That
> is, there is no actual time associated with it. Now, for practical
> purposes we end up using windowed things in many places,
> 

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 13:37         ` Vincent Guittot
  2017-03-21 14:03           ` Peter Zijlstra
@ 2017-03-21 14:26           ` Rafael J. Wysocki
  2017-03-21 14:38             ` Patrick Bellasi
  2017-03-21 15:02             ` Peter Zijlstra
  1 sibling, 2 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-21 14:26 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar

On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> >> > To work around this issue use the observation that, from the
> >> > schedutil governor's perspective, it does not make sense to decrease
> >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> >> > the frequency of busy CPUs.
> >>
> >> I don't fully agree with that statement.
> >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> >> waiting task to another CPU B so CPU A is less loaded now, it makes
> >> sense to reduce the OPP. That's even for that purpose that we have
> >> decided to use scheduler metrics in cpufreq governor so we can adjust
> >> OPP immediately when tasks migrate.
> >> That being said, i probably know why you see such OPP switches in your
> >> use case. When we migrate a task, we also migrate/remove its
> >> utilization from CPU.
> >> If the CPU is not overloaded, it means that runnable tasks have all
> >> computation that they need and don't have any reason to use more when
> >> a task migrates to another CPU. so decreasing the OPP makes sense
> >> because the utilzation is decreasing
> >> If the CPU is overloaded, it means that runnable tasks have to share
> >> CPU time and probably don't have all computations that they would like
> >> so when a task migrate, the remaining tasks on the CPU will increase
> >> their utilization and fill space left by the task that has just
> >> migrated. So the CPU's utilization will decrease when a task migrates
> >> (and as a result the OPP) but then its utilization will increase with
> >> remaining tasks running more time as well as the OPP
> >>
> >> So you need to make the difference between this 2 cases: Is a CPU
> >> overloaded or not. You can't really rely on the utilization to detect
> >> that but you could take advantage of the load which take into account
> >> the waiting time of tasks
> >
> > I'm confused. What two cases? You only list the overloaded case, but he
> 
> overloaded vs not overloaded use case.
> For the not overloaded case, it makes sense to immediately update to
> OPP to be aligned with the new utilization of the CPU even if it was
> not idle in the past couple of ticks

Yes, if the OPP (or P-state if you will) can be changed immediately.  If it can't,
conditions may change by the time we actually update it and in that case It'd
be better to wait and see IMO.

In any case, the theory about migrating tasks made sense to me, so below is
what I tested.  It works, and besides it has a nice feature that I don't need
to fetch for the timekeeping data. :-)

I only wonder if we want to do this or only prevent the frequency from
decreasing in the overloaded case?

---
 kernel/sched/cpufreq_schedutil.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -61,6 +61,7 @@ struct sugov_cpu {
 	unsigned long util;
 	unsigned long max;
 	unsigned int flags;
+	bool overload;
 };
 
 static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -207,7 +208,7 @@ static void sugov_update_single(struct u
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
-	if (flags & SCHED_CPUFREQ_RT_DL) {
+	if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
 		next_f = policy->cpuinfo.max_freq;
 	} else {
 		sugov_get_util(&util, &max);
@@ -242,7 +243,7 @@ static unsigned int sugov_next_freq_shar
 			j_sg_cpu->iowait_boost = 0;
 			continue;
 		}
-		if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+		if ((j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) || j_sg_cpu->overload)
 			return policy->cpuinfo.max_freq;
 
 		j_util = j_sg_cpu->util;
@@ -273,12 +274,13 @@ static void sugov_update_shared(struct u
 	sg_cpu->util = util;
 	sg_cpu->max = max;
 	sg_cpu->flags = flags;
+	sg_cpu->overload = this_rq()->rd->overload;
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
 	if (sugov_should_update_freq(sg_policy, time)) {
-		if (flags & SCHED_CPUFREQ_RT_DL)
+		if ((flags & SCHED_CPUFREQ_RT_DL) || sg_cpu->overload)
 			next_f = sg_policy->policy->cpuinfo.max_freq;
 		else
 			next_f = sugov_next_freq_shared(sg_cpu);

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:26           ` Rafael J. Wysocki
@ 2017-03-21 14:38             ` Patrick Bellasi
  2017-03-21 14:46               ` Rafael J. Wysocki
  2017-03-21 15:02             ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-21 14:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21-Mar 15:26, Rafael J. Wysocki wrote:
> On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > >> > To work around this issue use the observation that, from the
> > >> > schedutil governor's perspective, it does not make sense to decrease
> > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > >> > the frequency of busy CPUs.
> > >>
> > >> I don't fully agree with that statement.
> > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > >> sense to reduce the OPP. That's even for that purpose that we have
> > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > >> OPP immediately when tasks migrate.
> > >> That being said, i probably know why you see such OPP switches in your
> > >> use case. When we migrate a task, we also migrate/remove its
> > >> utilization from CPU.
> > >> If the CPU is not overloaded, it means that runnable tasks have all
> > >> computation that they need and don't have any reason to use more when
> > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > >> because the utilzation is decreasing
> > >> If the CPU is overloaded, it means that runnable tasks have to share
> > >> CPU time and probably don't have all computations that they would like
> > >> so when a task migrate, the remaining tasks on the CPU will increase
> > >> their utilization and fill space left by the task that has just
> > >> migrated. So the CPU's utilization will decrease when a task migrates
> > >> (and as a result the OPP) but then its utilization will increase with
> > >> remaining tasks running more time as well as the OPP
> > >>
> > >> So you need to make the difference between this 2 cases: Is a CPU
> > >> overloaded or not. You can't really rely on the utilization to detect
> > >> that but you could take advantage of the load which take into account
> > >> the waiting time of tasks
> > >
> > > I'm confused. What two cases? You only list the overloaded case, but he
> > 
> > overloaded vs not overloaded use case.
> > For the not overloaded case, it makes sense to immediately update to
> > OPP to be aligned with the new utilization of the CPU even if it was
> > not idle in the past couple of ticks
> 
> Yes, if the OPP (or P-state if you will) can be changed immediately.  If it can't,
> conditions may change by the time we actually update it and in that case It'd
> be better to wait and see IMO.
> 
> In any case, the theory about migrating tasks made sense to me, so below is
> what I tested.  It works, and besides it has a nice feature that I don't need
> to fetch for the timekeeping data. :-)
> 
> I only wonder if we want to do this or only prevent the frequency from
> decreasing in the overloaded case?
> 
> ---
>  kernel/sched/cpufreq_schedutil.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,7 @@ struct sugov_cpu {
>  	unsigned long util;
>  	unsigned long max;
>  	unsigned int flags;
> +	bool overload;
>  };
>  
>  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
>  	if (!sugov_should_update_freq(sg_policy, time))
>  		return;
>  
> -	if (flags & SCHED_CPUFREQ_RT_DL) {
> +	if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
>  		next_f = policy->cpuinfo.max_freq;

Isn't this going to max OPP every time we have more than 1 task in
that CPU?

In that case it will not fit the case: we have two 10% tasks on that CPU.

Previous solution was better IMO, apart from using overloaded instead
of overutilized (which is not yet there) :-/

>  	} else {
>  		sugov_get_util(&util, &max);
> @@ -242,7 +243,7 @@ static unsigned int sugov_next_freq_shar
>  			j_sg_cpu->iowait_boost = 0;
>  			continue;
>  		}
> -		if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
> +		if ((j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) || j_sg_cpu->overload)
>  			return policy->cpuinfo.max_freq;
>  
>  		j_util = j_sg_cpu->util;
> @@ -273,12 +274,13 @@ static void sugov_update_shared(struct u
>  	sg_cpu->util = util;
>  	sg_cpu->max = max;
>  	sg_cpu->flags = flags;
> +	sg_cpu->overload = this_rq()->rd->overload;
>  
>  	sugov_set_iowait_boost(sg_cpu, time, flags);
>  	sg_cpu->last_update = time;
>  
>  	if (sugov_should_update_freq(sg_policy, time)) {
> -		if (flags & SCHED_CPUFREQ_RT_DL)
> +		if ((flags & SCHED_CPUFREQ_RT_DL) || sg_cpu->overload)
>  			next_f = sg_policy->policy->cpuinfo.max_freq;
>  		else
>  			next_f = sugov_next_freq_shared(sg_cpu);
> 

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:38             ` Patrick Bellasi
@ 2017-03-21 14:46               ` Rafael J. Wysocki
  2017-03-21 14:50                 ` Rafael J. Wysocki
                                   ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-21 14:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote:
> On 21-Mar 15:26, Rafael J. Wysocki wrote:
> > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > >
> > > >> > To work around this issue use the observation that, from the
> > > >> > schedutil governor's perspective, it does not make sense to decrease
> > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > > >> > the frequency of busy CPUs.
> > > >>
> > > >> I don't fully agree with that statement.
> > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > > >> sense to reduce the OPP. That's even for that purpose that we have
> > > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > > >> OPP immediately when tasks migrate.
> > > >> That being said, i probably know why you see such OPP switches in your
> > > >> use case. When we migrate a task, we also migrate/remove its
> > > >> utilization from CPU.
> > > >> If the CPU is not overloaded, it means that runnable tasks have all
> > > >> computation that they need and don't have any reason to use more when
> > > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > > >> because the utilzation is decreasing
> > > >> If the CPU is overloaded, it means that runnable tasks have to share
> > > >> CPU time and probably don't have all computations that they would like
> > > >> so when a task migrate, the remaining tasks on the CPU will increase
> > > >> their utilization and fill space left by the task that has just
> > > >> migrated. So the CPU's utilization will decrease when a task migrates
> > > >> (and as a result the OPP) but then its utilization will increase with
> > > >> remaining tasks running more time as well as the OPP
> > > >>
> > > >> So you need to make the difference between this 2 cases: Is a CPU
> > > >> overloaded or not. You can't really rely on the utilization to detect
> > > >> that but you could take advantage of the load which take into account
> > > >> the waiting time of tasks
> > > >
> > > > I'm confused. What two cases? You only list the overloaded case, but he
> > > 
> > > overloaded vs not overloaded use case.
> > > For the not overloaded case, it makes sense to immediately update to
> > > OPP to be aligned with the new utilization of the CPU even if it was
> > > not idle in the past couple of ticks
> > 
> > Yes, if the OPP (or P-state if you will) can be changed immediately.  If it can't,
> > conditions may change by the time we actually update it and in that case It'd
> > be better to wait and see IMO.
> > 
> > In any case, the theory about migrating tasks made sense to me, so below is
> > what I tested.  It works, and besides it has a nice feature that I don't need
> > to fetch for the timekeeping data. :-)
> > 
> > I only wonder if we want to do this or only prevent the frequency from
> > decreasing in the overloaded case?
> > 
> > ---
> >  kernel/sched/cpufreq_schedutil.c |    8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -61,6 +61,7 @@ struct sugov_cpu {
> >  	unsigned long util;
> >  	unsigned long max;
> >  	unsigned int flags;
> > +	bool overload;
> >  };
> >  
> >  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
> >  	if (!sugov_should_update_freq(sg_policy, time))
> >  		return;
> >  
> > -	if (flags & SCHED_CPUFREQ_RT_DL) {
> > +	if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> >  		next_f = policy->cpuinfo.max_freq;
> 
> Isn't this going to max OPP every time we have more than 1 task in
> that CPU?
> 
> In that case it will not fit the case: we have two 10% tasks on that CPU.

Good point.

> Previous solution was better IMO, apart from using overloaded instead
> of overutilized (which is not yet there) :-/

OK, so the one below works too.

---
 kernel/sched/cpufreq_schedutil.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -37,6 +37,7 @@ struct sugov_policy {
 	s64 freq_update_delay_ns;
 	unsigned int next_freq;
 	unsigned int cached_raw_freq;
+	bool overload;
 
 	/* The next fields are only needed if fast switch cannot be used. */
 	struct irq_work irq_work;
@@ -61,6 +62,7 @@ struct sugov_cpu {
 	unsigned long util;
 	unsigned long max;
 	unsigned int flags;
+	bool overload;
 };
 
 static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -93,6 +95,9 @@ static void sugov_update_commit(struct s
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
 
+	if (sg_policy->overload && next_freq < sg_policy->next_freq)
+		next_freq = sg_policy->next_freq;
+
 	if (policy->fast_switch_enabled) {
 		if (sg_policy->next_freq == next_freq) {
 			trace_cpu_frequency(policy->cur, smp_processor_id());
@@ -207,6 +212,8 @@ static void sugov_update_single(struct u
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
+	sg_policy->overload = this_rq()->rd->overload;
+
 	if (flags & SCHED_CPUFREQ_RT_DL) {
 		next_f = policy->cpuinfo.max_freq;
 	} else {
@@ -225,6 +232,8 @@ static unsigned int sugov_next_freq_shar
 	unsigned long util = 0, max = 1;
 	unsigned int j;
 
+	sg_policy->overload = false;
+
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
 		unsigned long j_util, j_max;
@@ -253,6 +262,7 @@ static unsigned int sugov_next_freq_shar
 		}
 
 		sugov_iowait_boost(j_sg_cpu, &util, &max);
+		sg_policy->overload = sg_policy->overload || sg_cpu->overload;
 	}
 
 	return get_next_freq(sg_policy, util, max);
@@ -273,6 +283,7 @@ static void sugov_update_shared(struct u
 	sg_cpu->util = util;
 	sg_cpu->max = max;
 	sg_cpu->flags = flags;
+	sg_cpu->overload = this_rq()->rd->overload;
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:46               ` Rafael J. Wysocki
@ 2017-03-21 14:50                 ` Rafael J. Wysocki
  2017-03-21 15:04                 ` Peter Zijlstra
  2017-03-21 15:08                 ` Patrick Bellasi
  2 siblings, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-21 14:50 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tuesday, March 21, 2017 03:46:07 PM Rafael J. Wysocki wrote:
> On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote:
> > On 21-Mar 15:26, Rafael J. Wysocki wrote:
> > > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > > >
> > > > >> > To work around this issue use the observation that, from the
> > > > >> > schedutil governor's perspective, it does not make sense to decrease
> > > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > > > >> > the frequency of busy CPUs.
> > > > >>
> > > > >> I don't fully agree with that statement.
> > > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > > > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > > > >> sense to reduce the OPP. That's even for that purpose that we have
> > > > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > > > >> OPP immediately when tasks migrate.
> > > > >> That being said, i probably know why you see such OPP switches in your
> > > > >> use case. When we migrate a task, we also migrate/remove its
> > > > >> utilization from CPU.
> > > > >> If the CPU is not overloaded, it means that runnable tasks have all
> > > > >> computation that they need and don't have any reason to use more when
> > > > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > > > >> because the utilzation is decreasing
> > > > >> If the CPU is overloaded, it means that runnable tasks have to share
> > > > >> CPU time and probably don't have all computations that they would like
> > > > >> so when a task migrate, the remaining tasks on the CPU will increase
> > > > >> their utilization and fill space left by the task that has just
> > > > >> migrated. So the CPU's utilization will decrease when a task migrates
> > > > >> (and as a result the OPP) but then its utilization will increase with
> > > > >> remaining tasks running more time as well as the OPP
> > > > >>
> > > > >> So you need to make the difference between this 2 cases: Is a CPU
> > > > >> overloaded or not. You can't really rely on the utilization to detect
> > > > >> that but you could take advantage of the load which take into account
> > > > >> the waiting time of tasks
> > > > >
> > > > > I'm confused. What two cases? You only list the overloaded case, but he
> > > > 
> > > > overloaded vs not overloaded use case.
> > > > For the not overloaded case, it makes sense to immediately update to
> > > > OPP to be aligned with the new utilization of the CPU even if it was
> > > > not idle in the past couple of ticks
> > > 
> > > Yes, if the OPP (or P-state if you will) can be changed immediately.  If it can't,
> > > conditions may change by the time we actually update it and in that case It'd
> > > be better to wait and see IMO.
> > > 
> > > In any case, the theory about migrating tasks made sense to me, so below is
> > > what I tested.  It works, and besides it has a nice feature that I don't need
> > > to fetch for the timekeeping data. :-)
> > > 
> > > I only wonder if we want to do this or only prevent the frequency from
> > > decreasing in the overloaded case?
> > > 
> > > ---
> > >  kernel/sched/cpufreq_schedutil.c |    8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > > @@ -61,6 +61,7 @@ struct sugov_cpu {
> > >  	unsigned long util;
> > >  	unsigned long max;
> > >  	unsigned int flags;
> > > +	bool overload;
> > >  };
> > >  
> > >  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
> > >  	if (!sugov_should_update_freq(sg_policy, time))
> > >  		return;
> > >  
> > > -	if (flags & SCHED_CPUFREQ_RT_DL) {
> > > +	if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> > >  		next_f = policy->cpuinfo.max_freq;
> > 
> > Isn't this going to max OPP every time we have more than 1 task in
> > that CPU?
> > 
> > In that case it will not fit the case: we have two 10% tasks on that CPU.
> 
> Good point.
> 
> > Previous solution was better IMO, apart from using overloaded instead
> > of overutilized (which is not yet there) :-/
> 
> OK, so the one below works too.

Admittedly, we could check the idle condition and the overload flag at the same
time, though.

Let me try that too.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
       [not found]             ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com>
@ 2017-03-21 14:58               ` Peter Zijlstra
  2017-03-21 17:00                 ` Vincent Guittot
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-21 14:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote:
> On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > For the not overloaded case, it makes sense to immediately update to
> > > OPP to be aligned with the new utilization of the CPU even if it was
> > > not idle in the past couple of ticks
> >
> > Yeah, but we cannot know. Also, who cares?
> >
> 
> embedded system that doesn't want to stay at higest OPP if significant part
> of the utilzation has moved away as an example
> AFAICT, schedutil tries to select the best OPP according to the current
> utilization of the CPU so if the utilization decreases, the OPP should also
> decrease

Sure I get that; but given the lack of crystal ball instructions we
cannot know if this is the case or not.

And if we really dropped below 100% utilization, we should hit idle
fairly soon.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:26           ` Rafael J. Wysocki
  2017-03-21 14:38             ` Patrick Bellasi
@ 2017-03-21 15:02             ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-21 15:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tue, Mar 21, 2017 at 03:26:06PM +0100, Rafael J. Wysocki wrote:
> +	if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
>  		next_f = policy->cpuinfo.max_freq;

So this I think is wrong; rd->overload is set if _any_ of the CPUs in the
root domain is overloaded. And given the root domain is typically the
_entire_ machine, this would have a tendency to run at max_freq far too
often.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:46               ` Rafael J. Wysocki
  2017-03-21 14:50                 ` Rafael J. Wysocki
@ 2017-03-21 15:04                 ` Peter Zijlstra
  2017-03-21 15:18                   ` Rafael J. Wysocki
  2017-03-21 15:08                 ` Patrick Bellasi
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-21 15:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tue, Mar 21, 2017 at 03:46:07PM +0100, Rafael J. Wysocki wrote:
> @@ -207,6 +212,8 @@ static void sugov_update_single(struct u
>  	if (!sugov_should_update_freq(sg_policy, time))
>  		return;
>  
> +	sg_policy->overload = this_rq()->rd->overload;
> +

Same problem as before; rd->overload is set if _any_ CPU in the root
domain has more than 1 runnable task at a random point in history (when
we ran the load balance tick -- and since that is the same tick used for
timers, there's a bias to over-account there).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:46               ` Rafael J. Wysocki
  2017-03-21 14:50                 ` Rafael J. Wysocki
  2017-03-21 15:04                 ` Peter Zijlstra
@ 2017-03-21 15:08                 ` Patrick Bellasi
  2017-03-21 15:18                   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-21 15:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21-Mar 15:46, Rafael J. Wysocki wrote:
> On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote:
> > On 21-Mar 15:26, Rafael J. Wysocki wrote:
> > > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > > >
> > > > >> > To work around this issue use the observation that, from the
> > > > >> > schedutil governor's perspective, it does not make sense to decrease
> > > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > > > >> > the frequency of busy CPUs.
> > > > >>
> > > > >> I don't fully agree with that statement.
> > > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > > > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > > > >> sense to reduce the OPP. That's even for that purpose that we have
> > > > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > > > >> OPP immediately when tasks migrate.
> > > > >> That being said, i probably know why you see such OPP switches in your
> > > > >> use case. When we migrate a task, we also migrate/remove its
> > > > >> utilization from CPU.
> > > > >> If the CPU is not overloaded, it means that runnable tasks have all
> > > > >> computation that they need and don't have any reason to use more when
> > > > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > > > >> because the utilzation is decreasing
> > > > >> If the CPU is overloaded, it means that runnable tasks have to share
> > > > >> CPU time and probably don't have all computations that they would like
> > > > >> so when a task migrate, the remaining tasks on the CPU will increase
> > > > >> their utilization and fill space left by the task that has just
> > > > >> migrated. So the CPU's utilization will decrease when a task migrates
> > > > >> (and as a result the OPP) but then its utilization will increase with
> > > > >> remaining tasks running more time as well as the OPP
> > > > >>
> > > > >> So you need to make the difference between this 2 cases: Is a CPU
> > > > >> overloaded or not. You can't really rely on the utilization to detect
> > > > >> that but you could take advantage of the load which take into account
> > > > >> the waiting time of tasks
> > > > >
> > > > > I'm confused. What two cases? You only list the overloaded case, but he
> > > > 
> > > > overloaded vs not overloaded use case.
> > > > For the not overloaded case, it makes sense to immediately update to
> > > > OPP to be aligned with the new utilization of the CPU even if it was
> > > > not idle in the past couple of ticks
> > > 
> > > Yes, if the OPP (or P-state if you will) can be changed immediately.  If it can't,
> > > conditions may change by the time we actually update it and in that case It'd
> > > be better to wait and see IMO.
> > > 
> > > In any case, the theory about migrating tasks made sense to me, so below is
> > > what I tested.  It works, and besides it has a nice feature that I don't need
> > > to fetch for the timekeeping data. :-)
> > > 
> > > I only wonder if we want to do this or only prevent the frequency from
> > > decreasing in the overloaded case?
> > > 
> > > ---
> > >  kernel/sched/cpufreq_schedutil.c |    8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > > @@ -61,6 +61,7 @@ struct sugov_cpu {
> > >  	unsigned long util;
> > >  	unsigned long max;
> > >  	unsigned int flags;
> > > +	bool overload;
> > >  };
> > >  
> > >  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
> > >  	if (!sugov_should_update_freq(sg_policy, time))
> > >  		return;
> > >  
> > > -	if (flags & SCHED_CPUFREQ_RT_DL) {
> > > +	if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> > >  		next_f = policy->cpuinfo.max_freq;
> > 
> > Isn't this going to max OPP every time we have more than 1 task in
> > that CPU?
> > 
> > In that case it will not fit the case: we have two 10% tasks on that CPU.
> 
> Good point.
> 
> > Previous solution was better IMO, apart from using overloaded instead
> > of overutilized (which is not yet there) :-/
> 
> OK, so the one below works too.

Better... just one minor comment.


> ---
>  kernel/sched/cpufreq_schedutil.c |   11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -37,6 +37,7 @@ struct sugov_policy {
>  	s64 freq_update_delay_ns;
>  	unsigned int next_freq;
>  	unsigned int cached_raw_freq;
> +	bool overload;

Can we avoid using "overloaded" in favor of a more generic and
schedutil specific name. Mainly because in the future we would
probably like to switch from "overloaded" to "overutilized".

What about something like: "busy" ?

>  
>  	/* The next fields are only needed if fast switch cannot be used. */
>  	struct irq_work irq_work;
> @@ -61,6 +62,7 @@ struct sugov_cpu {
>  	unsigned long util;
>  	unsigned long max;
>  	unsigned int flags;
> +	bool overload;
>  };
>  
>  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -93,6 +95,9 @@ static void sugov_update_commit(struct s
>  {
>  	struct cpufreq_policy *policy = sg_policy->policy;
>  
> +	if (sg_policy->overload && next_freq < sg_policy->next_freq)
> +		next_freq = sg_policy->next_freq;
> +
>  	if (policy->fast_switch_enabled) {
>  		if (sg_policy->next_freq == next_freq) {
>  			trace_cpu_frequency(policy->cur, smp_processor_id());
> @@ -207,6 +212,8 @@ static void sugov_update_single(struct u
>  	if (!sugov_should_update_freq(sg_policy, time))
>  		return;
>  
> +	sg_policy->overload = this_rq()->rd->overload;

And than we can move this bit into an inline function, something like e.g.:

   static inline bool sugov_this_cpu_is_busy()
   {
           return this_rq()->rd->overloaded
   }

Where in future we can easily switch from usage of "overloaded" to
usage of "utilization".

> +
>  	if (flags & SCHED_CPUFREQ_RT_DL) {
>  		next_f = policy->cpuinfo.max_freq;
>  	} else {
> @@ -225,6 +232,8 @@ static unsigned int sugov_next_freq_shar
>  	unsigned long util = 0, max = 1;
>  	unsigned int j;
>  
> +	sg_policy->overload = false;
> +
>  	for_each_cpu(j, policy->cpus) {
>  		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
>  		unsigned long j_util, j_max;
> @@ -253,6 +262,7 @@ static unsigned int sugov_next_freq_shar
>  		}
>  
>  		sugov_iowait_boost(j_sg_cpu, &util, &max);
> +		sg_policy->overload = sg_policy->overload || sg_cpu->overload;
>  	}
>  
>  	return get_next_freq(sg_policy, util, max);
> @@ -273,6 +283,7 @@ static void sugov_update_shared(struct u
>  	sg_cpu->util = util;
>  	sg_cpu->max = max;
>  	sg_cpu->flags = flags;
> +	sg_cpu->overload = this_rq()->rd->overload;
>  
>  	sugov_set_iowait_boost(sg_cpu, time, flags);
>  	sg_cpu->last_update = time;
> 

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 15:08                 ` Patrick Bellasi
@ 2017-03-21 15:18                   ` Peter Zijlstra
  2017-03-21 19:28                     ` Patrick Bellasi
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-21 15:18 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Rafael J. Wysocki, Vincent Guittot, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar


Seriously people, trim your replies.

On Tue, Mar 21, 2017 at 03:08:20PM +0000, Patrick Bellasi wrote:

> And than we can move this bit into an inline function, something like e.g.:
> 
>    static inline bool sugov_this_cpu_is_busy()
>    {
>            return this_rq()->rd->overloaded
>    }

No, that's just entirely and utterly wrong. It being in rd means its
very much not about _this_ CPU in any way.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 15:04                 ` Peter Zijlstra
@ 2017-03-21 15:18                   ` Rafael J. Wysocki
  2017-03-21 17:00                     ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-21 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tuesday, March 21, 2017 04:04:03 PM Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 03:46:07PM +0100, Rafael J. Wysocki wrote:
> > @@ -207,6 +212,8 @@ static void sugov_update_single(struct u
> >  	if (!sugov_should_update_freq(sg_policy, time))
> >  		return;
> >  
> > +	sg_policy->overload = this_rq()->rd->overload;
> > +
> 
> Same problem as before; rd->overload is set if _any_ CPU in the root
> domain has more than 1 runnable task at a random point in history (when
> we ran the load balance tick -- and since that is the same tick used for
> timers, there's a bias to over-account there).

OK

What about the one below then?

It checks both the idle calls count and overload and only then it will prevent
the frequency from being decreased.

It is sufficient for the case at hand.

I guess if rd->overload is not set, this means that none of the CPUs is
oversubscribed and we just saturate the capacity in a one-task-per-CPU kind
of fashion.  Right?

---
 include/linux/tick.h             |    1 +
 kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
 kernel/time/tick-sched.c         |   12 ++++++++++++
 3 files changed, 40 insertions(+)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -37,6 +37,7 @@ struct sugov_policy {
 	s64 freq_update_delay_ns;
 	unsigned int next_freq;
 	unsigned int cached_raw_freq;
+	bool busy;
 
 	/* The next fields are only needed if fast switch cannot be used. */
 	struct irq_work irq_work;
@@ -56,11 +57,15 @@ struct sugov_cpu {
 	unsigned long iowait_boost;
 	unsigned long iowait_boost_max;
 	u64 last_update;
+#ifdef CONFIG_NO_HZ_COMMON
+	unsigned long saved_idle_calls;
+#endif
 
 	/* The fields below are only needed when sharing a policy. */
 	unsigned long util;
 	unsigned long max;
 	unsigned int flags;
+	bool busy;
 };
 
 static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -93,6 +98,9 @@ static void sugov_update_commit(struct s
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
 
+	if (sg_policy->busy && next_freq < sg_policy->next_freq)
+		next_freq = sg_policy->next_freq;
+
 	if (policy->fast_switch_enabled) {
 		if (sg_policy->next_freq == next_freq) {
 			trace_cpu_frequency(policy->cur, smp_processor_id());
@@ -192,6 +200,19 @@ static void sugov_iowait_boost(struct su
 	sg_cpu->iowait_boost >>= 1;
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+	unsigned long idle_calls = tick_nohz_get_idle_calls();
+	bool not_idle = idle_calls == sg_cpu->saved_idle_calls;
+
+	sg_cpu->saved_idle_calls = idle_calls;
+	return not_idle && this_rq()->rd->overload;
+}
+#else
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+#endif /* CONFIG_NO_HZ_COMMON */
+
 static void sugov_update_single(struct update_util_data *hook, u64 time,
 				unsigned int flags)
 {
@@ -207,6 +228,8 @@ static void sugov_update_single(struct u
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
+	sg_policy->busy = sugov_cpu_is_busy(sg_cpu);
+
 	if (flags & SCHED_CPUFREQ_RT_DL) {
 		next_f = policy->cpuinfo.max_freq;
 	} else {
@@ -225,6 +248,8 @@ static unsigned int sugov_next_freq_shar
 	unsigned long util = 0, max = 1;
 	unsigned int j;
 
+	sg_policy->busy = false;
+
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
 		unsigned long j_util, j_max;
@@ -253,6 +278,7 @@ static unsigned int sugov_next_freq_shar
 		}
 
 		sugov_iowait_boost(j_sg_cpu, &util, &max);
+		sg_policy->busy = sg_policy->busy || sg_cpu->busy;
 	}
 
 	return get_next_freq(sg_policy, util, max);
@@ -273,6 +299,7 @@ static void sugov_update_shared(struct u
 	sg_cpu->util = util;
 	sg_cpu->max = max;
 	sg_cpu->flags = flags;
+	sg_cpu->busy = sugov_cpu_is_busy(sg_cpu);
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
+extern unsigned long tick_nohz_get_idle_calls(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 #else /* !CONFIG_NO_HZ_COMMON */
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
 	return ts->sleep_length;
 }
 
+/**
+ * tick_nohz_get_idle_calls - return the current idle calls counter value
+ *
+ * Called from the schedutil frequency scaling governor in scheduler context.
+ */
+unsigned long tick_nohz_get_idle_calls(void)
+{
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	return ts->idle_calls;
+}
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 15:18                   ` Rafael J. Wysocki
@ 2017-03-21 17:00                     ` Peter Zijlstra
  2017-03-21 17:17                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-21 17:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tue, Mar 21, 2017 at 04:18:52PM +0100, Rafael J. Wysocki wrote:
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +	unsigned long idle_calls = tick_nohz_get_idle_calls();
> +	bool not_idle = idle_calls == sg_cpu->saved_idle_calls;
> +
> +	sg_cpu->saved_idle_calls = idle_calls;
> +	return not_idle && this_rq()->rd->overload;
> +}

So I really don't understand the rd->overload thing. What is it supposed
to do here?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 14:58               ` Peter Zijlstra
@ 2017-03-21 17:00                 ` Vincent Guittot
  2017-03-21 17:01                   ` Vincent Guittot
  0 siblings, 1 reply; 71+ messages in thread
From: Vincent Guittot @ 2017-03-21 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21 March 2017 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote:
> > On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > > For the not overloaded case, it makes sense to immediately update to
> > > > OPP to be aligned with the new utilization of the CPU even if it was
> > > > not idle in the past couple of ticks
> > >
> > > Yeah, but we cannot know. Also, who cares?
> > >
> >
> > embedded system that doesn't want to stay at higest OPP if significant part
> > of the utilzation has moved away as an example
> > AFAICT, schedutil tries to select the best OPP according to the current
> > utilization of the CPU so if the utilization decreases, the OPP should also
> > decrease
>
> Sure I get that; but given the lack of crystal ball instructions we
> cannot know if this is the case or not.

cfs_rq->avg.load_avg account the waiting time of CPU (in addition to
the weight of task) so i was wondering if we can't use it to detect if
we are in the overloaded  case or not even if utilization is not mac
capacity because we have just migrated a task (and its utilization)
out



>
> And if we really dropped below 100% utilization, we should hit idle
> fairly soon.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 17:00                 ` Vincent Guittot
@ 2017-03-21 17:01                   ` Vincent Guittot
  0 siblings, 0 replies; 71+ messages in thread
From: Vincent Guittot @ 2017-03-21 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21 March 2017 at 18:00, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 21 March 2017 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote:
>> > On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
>> > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
>> > >
>> > > > For the not overloaded case, it makes sense to immediately update to
>> > > > OPP to be aligned with the new utilization of the CPU even if it was
>> > > > not idle in the past couple of ticks
>> > >
>> > > Yeah, but we cannot know. Also, who cares?
>> > >
>> >
>> > embedded system that doesn't want to stay at higest OPP if significant part
>> > of the utilzation has moved away as an example
>> > AFAICT, schedutil tries to select the best OPP according to the current
>> > utilization of the CPU so if the utilization decreases, the OPP should also
>> > decrease
>>
>> Sure I get that; but given the lack of crystal ball instructions we
>> cannot know if this is the case or not.
>
> cfs_rq->avg.load_avg account the waiting time of CPU (in addition to

sorry i wanted to say the waiting time of tasks on the CPU

> the weight of task) so i was wondering if we can't use it to detect if
> we are in the overloaded  case or not even if utilization is not mac
> capacity because we have just migrated a task (and its utilization)
> out
>
>
>
>>
>> And if we really dropped below 100% utilization, we should hit idle
>> fairly soon.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 17:00                     ` Peter Zijlstra
@ 2017-03-21 17:17                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-21 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On Tuesday, March 21, 2017 06:00:17 PM Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 04:18:52PM +0100, Rafael J. Wysocki wrote:
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > +	unsigned long idle_calls = tick_nohz_get_idle_calls();
> > +	bool not_idle = idle_calls == sg_cpu->saved_idle_calls;
> > +
> > +	sg_cpu->saved_idle_calls = idle_calls;
> > +	return not_idle && this_rq()->rd->overload;
> > +}
> 
> So I really don't understand the rd->overload thing. What is it supposed
> to do here?

The idea was that if the CPU was running one task saturating the capacity which
then was migrated out of it, the frequency should still be reduced.

And since rd->overload covers all CPUs (in general) it kind of tells us whether
or not there are other tasks to replace the migrated one any time soon.

However, if there are no tasks to replace the migrated one, the CPU will go
idle quickly (as there are no taks to run on it), in which case keeping the
current frequency on it shouldn't matter.

In all of the other cases keeping the current frequency is the right thing to
do IMO.

So, it looks like checking this_rq()->rd->overload doesn't really help after all. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs
  2017-03-21 15:18                   ` Peter Zijlstra
@ 2017-03-21 19:28                     ` Patrick Bellasi
  0 siblings, 0 replies; 71+ messages in thread
From: Patrick Bellasi @ 2017-03-21 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Vincent Guittot, Linux PM, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar

On 21-Mar 16:18, Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 03:08:20PM +0000, Patrick Bellasi wrote:
> 
> > And than we can move this bit into an inline function, something like e.g.:
> > 
> >    static inline bool sugov_this_cpu_is_busy()
> >    {
> >            return this_rq()->rd->overloaded
> >    }
> 
> No, that's just entirely and utterly wrong. It being in rd means its
> very much not about _this_ CPU in any way.

You right (of course), we cannot really use "this_" in the name of
a function with such a code.

The suggestion here was at least to factor out whatever code we want
to use to check if the current CPU has to be subject to a down-scaling
constraint.

However, using rd->overload is not the best option, for the many reasons
you explained in your previous comment. Thus, we should probably stay
with the idle time tracking solution initially proposed by Rafael.

Sorry for the noise :-(

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-20 21:46   ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki
                       ` (2 preceding siblings ...)
  2017-03-21 11:50     ` Patrick Bellasi
@ 2017-03-21 23:08     ` Rafael J. Wysocki
  2017-03-22  9:26       ` Peter Zijlstra
                         ` (5 more replies)
  3 siblings, 6 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-21 23:08 UTC (permalink / raw)
  To: Linux PM, Peter Zijlstra
  Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task.  If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away.  That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway.  On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/tick.h             |    1 +
 kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
 kernel/time/tick-sched.c         |   12 ++++++++++++
 3 files changed, 40 insertions(+)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -61,6 +61,11 @@ struct sugov_cpu {
 	unsigned long util;
 	unsigned long max;
 	unsigned int flags;
+
+	/* The field below is for single-CPU policies only. */
+#ifdef CONFIG_NO_HZ_COMMON
+	unsigned long saved_idle_calls;
+#endif
 };
 
 static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su
 	sg_cpu->iowait_boost >>= 1;
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+	unsigned long idle_calls = tick_nohz_get_idle_calls();
+	bool ret = idle_calls == sg_cpu->saved_idle_calls;
+
+	sg_cpu->saved_idle_calls = idle_calls;
+	return ret;
+}
+#else
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+#endif /* CONFIG_NO_HZ_COMMON */
+
 static void sugov_update_single(struct update_util_data *hook, u64 time,
 				unsigned int flags)
 {
@@ -200,6 +218,7 @@ static void sugov_update_single(struct u
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned long util, max;
 	unsigned int next_f;
+	bool busy;
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
@@ -207,12 +226,20 @@ static void sugov_update_single(struct u
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
+	busy = sugov_cpu_is_busy(sg_cpu);
+
 	if (flags & SCHED_CPUFREQ_RT_DL) {
 		next_f = policy->cpuinfo.max_freq;
 	} else {
 		sugov_get_util(&util, &max);
 		sugov_iowait_boost(sg_cpu, &util, &max);
 		next_f = get_next_freq(sg_policy, util, max);
+		/*
+		 * Do not reduce the frequency if the CPU has not been idle
+		 * recently, as the reduction is likely to be premature then.
+		 */
+		if (busy && next_f < sg_policy->next_freq)
+			next_f = sg_policy->next_freq;
 	}
 	sugov_update_commit(sg_policy, time, next_f);
 }
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
+extern unsigned long tick_nohz_get_idle_calls(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 #else /* !CONFIG_NO_HZ_COMMON */
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
 	return ts->sleep_length;
 }
 
+/**
+ * tick_nohz_get_idle_calls - return the current idle calls counter value
+ *
+ * Called from the schedutil frequency scaling governor in scheduler context.
+ */
+unsigned long tick_nohz_get_idle_calls(void)
+{
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	return ts->idle_calls;
+}
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
@ 2017-03-22  9:26       ` Peter Zijlstra
  2017-03-22  9:54       ` Viresh Kumar
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2017-03-22  9:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner

On Wed, Mar 22, 2017 at 12:08:50AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
> 
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
> 
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Right; this makes sense to me. Of course it would be good to have some
more measurements on this, but in principle:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
  2017-03-22  9:26       ` Peter Zijlstra
@ 2017-03-22  9:54       ` Viresh Kumar
  2017-03-23  1:04       ` Joel Fernandes
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 71+ messages in thread
From: Viresh Kumar @ 2017-03-22  9:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner

On 22-03-17, 00:08, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
> 
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
> 
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 40 insertions(+)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-20 12:34       ` Patrick Bellasi
@ 2017-03-22 23:56         ` Joel Fernandes
  2017-03-23 22:08           ` Vincent Guittot
  0 siblings, 1 reply; 71+ messages in thread
From: Joel Fernandes @ 2017-03-22 23:56 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Rafael J. Wysocki, Viresh Kumar, Vincent Guittot, Linux PM, LKML,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Morten Rasmussen, Ingo Molnar

On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> On 20-Mar 09:26, Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases.  The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause  described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> As long as the IRQ processing does not generate a context switch,
> which is happening (eventually) if the top half schedule some deferred
> work to be executed by a bottom half.
>
> Thus, me too I would say that all the top half time is accounted in
> PELT, since the current task is still RUNNABLE/RUNNING.

Sorry if I'm missing something but doesn't this depend on whether you
have CONFIG_IRQ_TIME_ACCOUNTING enabled?

__update_load_avg uses rq->clock_task for deltas which I think
shouldn't account IRQ time with that config option. So it should be
quite possible for IRQ time spent to reduce the PELT signal right?

>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU.  The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> Quite likely. Indeed, it can really be that the CFS task is preempted
> because of some RT activity generated by the IRQ handler.
>
> More in general, I've also noticed many suboptimal freq switches when
> RT tasks interleave with CFS ones, because of:
> - relatively long down _and up_ throttling times
> - the way schedutil's flags are tracked and updated
> - the callsites from where we call schedutil updates
>
> For example it can really happen that we are running at the highest
> OPP because of some RT activity. Then we switch back to a relatively
> low utilization CFS workload and then:
> 1. a tick happens which produces a frequency drop

Any idea why this frequency drop would happen? Say a running CFS task
gets preempted by RT task, the PELT signal shouldn't drop for the
duration the CFS task is preempted because the task is runnable, so
once the CFS task gets CPU back, schedutil should still maintain the
capacity right?

Regards,
Joel

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
  2017-03-22  9:26       ` Peter Zijlstra
  2017-03-22  9:54       ` Viresh Kumar
@ 2017-03-23  1:04       ` Joel Fernandes
  2017-03-23 19:26       ` Sai Gurrappadi
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 71+ messages in thread
From: Joel Fernandes @ 2017-03-23  1:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner

On Tue, Mar 21, 2017 at 4:08 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Makes sense,

Reviewed-by: Joel Fernandes <joelaf@google.com>

Thanks,
Joel

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
                         ` (2 preceding siblings ...)
  2017-03-23  1:04       ` Joel Fernandes
@ 2017-03-23 19:26       ` Sai Gurrappadi
  2017-03-23 20:48         ` Sai Gurrappadi
  2017-03-24  1:39         ` Rafael J. Wysocki
  2017-03-25  1:14       ` Sai Gurrappadi
  2017-05-08  3:49       ` Wanpeng Li
  5 siblings, 2 replies; 71+ messages in thread
From: Sai Gurrappadi @ 2017-03-23 19:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM, Peter Zijlstra
  Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner,
	Peter Boonstoppel

Hi Rafael,

On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

<snip>

> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On

I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:

1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
2. Do wakeup on dst_cpu, add load to dst_cpu

Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.

This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.


> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
> 
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
> 
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 40 insertions(+)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,11 @@ struct sugov_cpu {
>  	unsigned long util;
>  	unsigned long max;
>  	unsigned int flags;
> +
> +	/* The field below is for single-CPU policies only. */
> +#ifdef CONFIG_NO_HZ_COMMON
> +	unsigned long saved_idle_calls;
> +#endif
>  };
>  
>  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su
>  	sg_cpu->iowait_boost >>= 1;
>  }
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +	unsigned long idle_calls = tick_nohz_get_idle_calls();
> +	bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> +	sg_cpu->saved_idle_calls = idle_calls;
> +	return ret;
> +}

Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :)

Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right?

If I am reading this correctly, the PELT update on the dequeue for the periodic task (in the scenario above) happens _before_ the idle_calls++ which is in tick_nohz_idle_enter.

Thanks!
-Sai 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-23 19:26       ` Sai Gurrappadi
@ 2017-03-23 20:48         ` Sai Gurrappadi
  2017-03-24  1:39         ` Rafael J. Wysocki
  1 sibling, 0 replies; 71+ messages in thread
From: Sai Gurrappadi @ 2017-03-23 20:48 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM, Peter Zijlstra
  Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner,
	Peter Boonstoppel



On 03/23/2017 12:26 PM, Sai Gurrappadi wrote:

> 
> Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :)
> 
> Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right?
 
Apologies, this example here is flawed because on task dequeue, its utilization isn't removed. There is no problem in this case...


-Sai

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-22 23:56         ` Joel Fernandes
@ 2017-03-23 22:08           ` Vincent Guittot
  2017-03-25  3:48             ` Joel Fernandes
  0 siblings, 1 reply; 71+ messages in thread
From: Vincent Guittot @ 2017-03-23 22:08 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Patrick Bellasi, Rafael J. Wysocki, Viresh Kumar, Linux PM, LKML,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Morten Rasmussen, Ingo Molnar

On 23 March 2017 at 00:56, Joel Fernandes <joelaf@google.com> wrote:
> On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
>> On 20-Mar 09:26, Vincent Guittot wrote:
>>> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>>> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> >>
>>> >> The PELT metric used by the schedutil governor underestimates the
>>> >> CPU utilization in some cases.  The reason for that may be time spent
>>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>>
>>> Are you sure of the root cause  described above (time stolen by irq
>>> handler) or is it just a hypotheses ? That would be good to be sure of
>>> the root cause
>>> Furthermore, IIRC the time spent in irq context is also accounted as
>>> run time for the running cfs task but not RT and deadline task running
>>> time
>>
>> As long as the IRQ processing does not generate a context switch,
>> which is happening (eventually) if the top half schedule some deferred
>> work to be executed by a bottom half.
>>
>> Thus, me too I would say that all the top half time is accounted in
>> PELT, since the current task is still RUNNABLE/RUNNING.
>
> Sorry if I'm missing something but doesn't this depend on whether you
> have CONFIG_IRQ_TIME_ACCOUNTING enabled?
>
> __update_load_avg uses rq->clock_task for deltas which I think
> shouldn't account IRQ time with that config option. So it should be
> quite possible for IRQ time spent to reduce the PELT signal right?
>
>>
>>> So I'm not really aligned with the description of your problem: PELT
>>> metric underestimates the load of the CPU.  The PELT is just about
>>> tracking CFS task utilization but not whole CPU utilization and
>>> according to your description of the problem (time stolen by irq),
>>> your problem doesn't come from an underestimation of CFS task but from
>>> time spent in something else but not accounted in the value used by
>>> schedutil
>>
>> Quite likely. Indeed, it can really be that the CFS task is preempted
>> because of some RT activity generated by the IRQ handler.
>>
>> More in general, I've also noticed many suboptimal freq switches when
>> RT tasks interleave with CFS ones, because of:
>> - relatively long down _and up_ throttling times
>> - the way schedutil's flags are tracked and updated
>> - the callsites from where we call schedutil updates
>>
>> For example it can really happen that we are running at the highest
>> OPP because of some RT activity. Then we switch back to a relatively
>> low utilization CFS workload and then:
>> 1. a tick happens which produces a frequency drop
>
> Any idea why this frequency drop would happen? Say a running CFS task
> gets preempted by RT task, the PELT signal shouldn't drop for the
> duration the CFS task is preempted because the task is runnable, so

utilization only tracks the running state but not runnable state.
Runnable state is tracked in load_avg

> once the CFS task gets CPU back, schedutil should still maintain the
> capacity right?
>
> Regards,
> Joel

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-23 19:26       ` Sai Gurrappadi
  2017-03-23 20:48         ` Sai Gurrappadi
@ 2017-03-24  1:39         ` Rafael J. Wysocki
  2017-03-24 19:08           ` Sai Gurrappadi
  1 sibling, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-24  1:39 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner, Peter Boonstoppel

On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Rafael,

Hi,

> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> <snip>
>
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task.  If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away.  That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>> This is unlikely to be an issue on systems where cpufreq policies are
>> shared between multiple CPUs, because in those cases the policy
>> utilization is computed as the maximum of the CPU utilization values
>> over the whole policy and if that turns out to be low, reducing the
>> frequency for the policy most likely is a good idea anyway.  On
>
> I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:
>
> 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
> 2. Do wakeup on dst_cpu, add load to dst_cpu
>
> Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.
>
> This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.

Interesting, and this seems to be related to last_freq_update_time
being per-policy (which it has to be, because frequency updates are
per-policy too and that's what we need to rate-limit).

Does this happen often enough to be a real concern in practice on
those configurations, though?

The other CPUs in the policy need to be either idle (so schedutil
doesn't take them into account at all) or lightly utilized for that to
happen, so that would affect workloads with one CPU hog type of task
that is migrated from one CPU to another within a policy and that
doesn't happen too often AFAICS.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-24  1:39         ` Rafael J. Wysocki
@ 2017-03-24 19:08           ` Sai Gurrappadi
  0 siblings, 0 replies; 71+ messages in thread
From: Sai Gurrappadi @ 2017-03-24 19:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner, Peter Boonstoppel

On 03/23/2017 06:39 PM, Rafael J. Wysocki wrote:
> On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
>> Hi Rafael,
> 
> Hi,
> 
>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> <snip>
>>
>>>
>>> That has been attributed to CPU utilization metric updates on task
>>> migration that cause the total utilization value for the CPU to be
>>> reduced by the utilization of the migrated task.  If that happens,
>>> the schedutil governor may see a CPU utilization reduction and will
>>> attempt to reduce the CPU frequency accordingly right away.  That
>>> may be premature, though, for example if the system is generally
>>> busy and there are other runnable tasks waiting to be run on that
>>> CPU already.
>>>
>>> This is unlikely to be an issue on systems where cpufreq policies are
>>> shared between multiple CPUs, because in those cases the policy
>>> utilization is computed as the maximum of the CPU utilization values
>>> over the whole policy and if that turns out to be low, reducing the
>>> frequency for the policy most likely is a good idea anyway.  On
>>
>> I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:
>>
>> 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
>> 2. Do wakeup on dst_cpu, add load to dst_cpu
>>
>> Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.
>>
>> This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.
> 
> Interesting, and this seems to be related to last_freq_update_time
> being per-policy (which it has to be, because frequency updates are
> per-policy too and that's what we need to rate-limit).
> 

Correct.

> Does this happen often enough to be a real concern in practice on
> those configurations, though?
> 
> The other CPUs in the policy need to be either idle (so schedutil
> doesn't take them into account at all) or lightly utilized for that to
> happen, so that would affect workloads with one CPU hog type of task
> that is migrated from one CPU to another within a policy and that
> doesn't happen too often AFAICS.

So it is possible, even likely in some cases for a heavy CPU task to migrate on wakeup between the policy->cpus via select_idle_sibling() if the prev_cpu it was on was !idle on wakeup.

This style of heavy thread + lots of light work is a common pattern on Android (games, browsing, etc.) given how Android does its threading for ipc (Binder stuff) + its rendering/audio pipelines. 

I unfortunately don't have any numbers atm though.

-Sai

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
                         ` (3 preceding siblings ...)
  2017-03-23 19:26       ` Sai Gurrappadi
@ 2017-03-25  1:14       ` Sai Gurrappadi
  2017-03-25  1:39         ` Rafael J. Wysocki
  2017-03-27  7:04         ` Vincent Guittot
  2017-05-08  3:49       ` Wanpeng Li
  5 siblings, 2 replies; 71+ messages in thread
From: Sai Gurrappadi @ 2017-03-25  1:14 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM, Peter Zijlstra
  Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner,
	Peter Boonstoppel

Hi Rafael,

On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 

Thinking out loud a bit, I wonder if what you really want to do is basically:

schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);

Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.

Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.

Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:

schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);

Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.

Thoughts?

Thanks,
-Sai

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-25  1:14       ` Sai Gurrappadi
@ 2017-03-25  1:39         ` Rafael J. Wysocki
  2017-03-27  7:04         ` Vincent Guittot
  1 sibling, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-25  1:39 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner, Peter Boonstoppel

On Sat, Mar 25, 2017 at 2:14 AM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Rafael,
>
> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> The way the schedutil governor uses the PELT metric causes it to
>> underestimate the CPU utilization in some cases.
>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register.  Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again.  That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task.  If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away.  That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>
> Thinking out loud a bit, I wonder if what you really want to do is basically:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>
> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
>
> Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.
>
> Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);
>
> Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.
>
> Thoughts?

Well, is the total_cpu_util_avg metric readily available?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-23 22:08           ` Vincent Guittot
@ 2017-03-25  3:48             ` Joel Fernandes
  2017-03-27  6:59               ` Vincent Guittot
  0 siblings, 1 reply; 71+ messages in thread
From: Joel Fernandes @ 2017-03-25  3:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Patrick Bellasi, Rafael J. Wysocki, Viresh Kumar, Linux PM, LKML,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Morten Rasmussen, Ingo Molnar

Hi Vincent,

On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
[..]
>>>
>>>> So I'm not really aligned with the description of your problem: PELT
>>>> metric underestimates the load of the CPU.  The PELT is just about
>>>> tracking CFS task utilization but not whole CPU utilization and
>>>> according to your description of the problem (time stolen by irq),
>>>> your problem doesn't come from an underestimation of CFS task but from
>>>> time spent in something else but not accounted in the value used by
>>>> schedutil
>>>
>>> Quite likely. Indeed, it can really be that the CFS task is preempted
>>> because of some RT activity generated by the IRQ handler.
>>>
>>> More in general, I've also noticed many suboptimal freq switches when
>>> RT tasks interleave with CFS ones, because of:
>>> - relatively long down _and up_ throttling times
>>> - the way schedutil's flags are tracked and updated
>>> - the callsites from where we call schedutil updates
>>>
>>> For example it can really happen that we are running at the highest
>>> OPP because of some RT activity. Then we switch back to a relatively
>>> low utilization CFS workload and then:
>>> 1. a tick happens which produces a frequency drop
>>
>> Any idea why this frequency drop would happen? Say a running CFS task
>> gets preempted by RT task, the PELT signal shouldn't drop for the
>> duration the CFS task is preempted because the task is runnable, so
>
> utilization only tracks the running state but not runnable state.
> Runnable state is tracked in load_avg

Thanks. I got it now.

Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
(which drives the frequency for CFS) still tracks the blocked/runnable
time of tasks although its decayed as time moves forward. Only when we
migrate the rq of a cfs task is the util_avg contribution removed from
the rq. But I can see now why running RT can decay this load tracking
signal.

Regards,
Joel

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs
  2017-03-25  3:48             ` Joel Fernandes
@ 2017-03-27  6:59               ` Vincent Guittot
  0 siblings, 0 replies; 71+ messages in thread
From: Vincent Guittot @ 2017-03-27  6:59 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Patrick Bellasi, Rafael J. Wysocki, Viresh Kumar, Linux PM, LKML,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli,
	Morten Rasmussen, Ingo Molnar

On 25 March 2017 at 04:48, Joel Fernandes <joelaf@google.com> wrote:
> Hi Vincent,
>
> On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> [..]
>>>>
>>>>> So I'm not really aligned with the description of your problem: PELT
>>>>> metric underestimates the load of the CPU.  The PELT is just about
>>>>> tracking CFS task utilization but not whole CPU utilization and
>>>>> according to your description of the problem (time stolen by irq),
>>>>> your problem doesn't come from an underestimation of CFS task but from
>>>>> time spent in something else but not accounted in the value used by
>>>>> schedutil
>>>>
>>>> Quite likely. Indeed, it can really be that the CFS task is preempted
>>>> because of some RT activity generated by the IRQ handler.
>>>>
>>>> More in general, I've also noticed many suboptimal freq switches when
>>>> RT tasks interleave with CFS ones, because of:
>>>> - relatively long down _and up_ throttling times
>>>> - the way schedutil's flags are tracked and updated
>>>> - the callsites from where we call schedutil updates
>>>>
>>>> For example it can really happen that we are running at the highest
>>>> OPP because of some RT activity. Then we switch back to a relatively
>>>> low utilization CFS workload and then:
>>>> 1. a tick happens which produces a frequency drop
>>>
>>> Any idea why this frequency drop would happen? Say a running CFS task
>>> gets preempted by RT task, the PELT signal shouldn't drop for the
>>> duration the CFS task is preempted because the task is runnable, so
>>
>> utilization only tracks the running state but not runnable state.
>> Runnable state is tracked in load_avg
>
> Thanks. I got it now.
>
> Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
> (which drives the frequency for CFS) still tracks the blocked/runnable
> time of tasks although its decayed as time moves forward. Only when we
> migrate the rq of a cfs task is the util_avg contribution removed from
> the rq. But I can see now why running RT can decay this load tracking
> signal.

Yes. you're right


>
> Regards,
> Joel

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-25  1:14       ` Sai Gurrappadi
  2017-03-25  1:39         ` Rafael J. Wysocki
@ 2017-03-27  7:04         ` Vincent Guittot
  2017-03-27 21:01           ` Sai Gurrappadi
  1 sibling, 1 reply; 71+ messages in thread
From: Vincent Guittot @ 2017-03-27  7:04 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi,
	Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner,
	Peter Boonstoppel

On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Rafael,
>
> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> The way the schedutil governor uses the PELT metric causes it to
>> underestimate the CPU utilization in some cases.
>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register.  Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again.  That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task.  If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away.  That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>
> Thinking out loud a bit, I wonder if what you really want to do is basically:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>
> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.

But we loose the interest of immediate decrease when tasks migrate.
Instead of total_cpu_util_avg we should better track RT utilization in
the same manner so with ongoing work for deadline we will have :
total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
and we still take advantage of task migration effect


>
> Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.
>
> Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);
>
> Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.
>
> Thoughts?
>
> Thanks,
> -Sai

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-27  7:04         ` Vincent Guittot
@ 2017-03-27 21:01           ` Sai Gurrappadi
  2017-03-27 21:11             ` Rafael J. Wysocki
  0 siblings, 1 reply; 71+ messages in thread
From: Sai Gurrappadi @ 2017-03-27 21:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi,
	Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner,
	Peter Boonstoppel

Hi Vincent,

On 03/27/2017 12:04 AM, Vincent Guittot wrote:
> On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
>> Hi Rafael,
>>
>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> The way the schedutil governor uses the PELT metric causes it to
>>> underestimate the CPU utilization in some cases.
>>>
>>> That can be easily demonstrated by running kernel compilation on
>>> a Sandy Bridge Intel processor, running turbostat in parallel with
>>> it and looking at the values written to the MSR_IA32_PERF_CTL
>>> register.  Namely, the expected result would be that when all CPUs
>>> were 100% busy, all of them would be requested to run in the maximum
>>> P-state, but observation shows that this clearly isn't the case.
>>> The CPUs run in the maximum P-state for a while and then are
>>> requested to run slower and go back to the maximum P-state after
>>> a while again.  That causes the actual frequency of the processor to
>>> visibly oscillate below the sustainable maximum in a jittery fashion
>>> which clearly is not desirable.
>>>
>>> That has been attributed to CPU utilization metric updates on task
>>> migration that cause the total utilization value for the CPU to be
>>> reduced by the utilization of the migrated task.  If that happens,
>>> the schedutil governor may see a CPU utilization reduction and will
>>> attempt to reduce the CPU frequency accordingly right away.  That
>>> may be premature, though, for example if the system is generally
>>> busy and there are other runnable tasks waiting to be run on that
>>> CPU already.
>>>
>>
>> Thinking out loud a bit, I wonder if what you really want to do is basically:
>>
>> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>>
>> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
> 
> But we loose the interest of immediate decrease when tasks migrate.

Indeed, this is not ideal.

> Instead of total_cpu_util_avg we should better track RT utilization in
> the same manner so with ongoing work for deadline we will have :
> total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
> and we still take advantage of task migration effect

I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU.

In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself.

Maybe there are better ways to solve that though :-)

Thanks,
-Sai

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-27 21:01           ` Sai Gurrappadi
@ 2017-03-27 21:11             ` Rafael J. Wysocki
  0 siblings, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-03-27 21:11 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM, Peter Zijlstra,
	LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner, Peter Boonstoppel

On Mon, Mar 27, 2017 at 11:01 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Vincent,
>
> On 03/27/2017 12:04 AM, Vincent Guittot wrote:
>> On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
>>> Hi Rafael,
>>>
>>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>
>>>> The way the schedutil governor uses the PELT metric causes it to
>>>> underestimate the CPU utilization in some cases.
>>>>
>>>> That can be easily demonstrated by running kernel compilation on
>>>> a Sandy Bridge Intel processor, running turbostat in parallel with
>>>> it and looking at the values written to the MSR_IA32_PERF_CTL
>>>> register.  Namely, the expected result would be that when all CPUs
>>>> were 100% busy, all of them would be requested to run in the maximum
>>>> P-state, but observation shows that this clearly isn't the case.
>>>> The CPUs run in the maximum P-state for a while and then are
>>>> requested to run slower and go back to the maximum P-state after
>>>> a while again.  That causes the actual frequency of the processor to
>>>> visibly oscillate below the sustainable maximum in a jittery fashion
>>>> which clearly is not desirable.
>>>>
>>>> That has been attributed to CPU utilization metric updates on task
>>>> migration that cause the total utilization value for the CPU to be
>>>> reduced by the utilization of the migrated task.  If that happens,
>>>> the schedutil governor may see a CPU utilization reduction and will
>>>> attempt to reduce the CPU frequency accordingly right away.  That
>>>> may be premature, though, for example if the system is generally
>>>> busy and there are other runnable tasks waiting to be run on that
>>>> CPU already.
>>>>
>>>
>>> Thinking out loud a bit, I wonder if what you really want to do is basically:
>>>
>>> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>>>
>>> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
>>
>> But we loose the interest of immediate decrease when tasks migrate.
>
> Indeed, this is not ideal.
>
>> Instead of total_cpu_util_avg we should better track RT utilization in
>> the same manner so with ongoing work for deadline we will have :
>> total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
>> and we still take advantage of task migration effect
>
> I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU.
>
> In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself.
>
> Maybe there are better ways to solve that though :-)

I wonder if it's viable to postpone the utilization update on both the
source and target runqueues until the task has been fully migrated?

That would make the artificial utilization reductions go away.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
                         ` (4 preceding siblings ...)
  2017-03-25  1:14       ` Sai Gurrappadi
@ 2017-05-08  3:49       ` Wanpeng Li
  2017-05-08  4:01         ` Viresh Kumar
  5 siblings, 1 reply; 71+ messages in thread
From: Wanpeng Li @ 2017-05-08  3:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi,
	Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner

Hi Rafael,
2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values

Sorry for one question maybe not associated with this patch. If the
cpufreq policy is shared between multiple CPUs, the function
intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
which is managing this policy, I wonder whether other cpus which are
affected should also update their per-logical cpu's IA32_PERF_CTL MSR?

Regards,
Wanpeng Li

> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 40 insertions(+)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,11 @@ struct sugov_cpu {
>         unsigned long util;
>         unsigned long max;
>         unsigned int flags;
> +
> +       /* The field below is for single-CPU policies only. */
> +#ifdef CONFIG_NO_HZ_COMMON
> +       unsigned long saved_idle_calls;
> +#endif
>  };
>
>  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su
>         sg_cpu->iowait_boost >>= 1;
>  }
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +       unsigned long idle_calls = tick_nohz_get_idle_calls();
> +       bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> +       sg_cpu->saved_idle_calls = idle_calls;
> +       return ret;
> +}
> +#else
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +#endif /* CONFIG_NO_HZ_COMMON */
> +
>  static void sugov_update_single(struct update_util_data *hook, u64 time,
>                                 unsigned int flags)
>  {
> @@ -200,6 +218,7 @@ static void sugov_update_single(struct u
>         struct cpufreq_policy *policy = sg_policy->policy;
>         unsigned long util, max;
>         unsigned int next_f;
> +       bool busy;
>
>         sugov_set_iowait_boost(sg_cpu, time, flags);
>         sg_cpu->last_update = time;
> @@ -207,12 +226,20 @@ static void sugov_update_single(struct u
>         if (!sugov_should_update_freq(sg_policy, time))
>                 return;
>
> +       busy = sugov_cpu_is_busy(sg_cpu);
> +
>         if (flags & SCHED_CPUFREQ_RT_DL) {
>                 next_f = policy->cpuinfo.max_freq;
>         } else {
>                 sugov_get_util(&util, &max);
>                 sugov_iowait_boost(sg_cpu, &util, &max);
>                 next_f = get_next_freq(sg_policy, util, max);
> +               /*
> +                * Do not reduce the frequency if the CPU has not been idle
> +                * recently, as the reduction is likely to be premature then.
> +                */
> +               if (busy && next_f < sg_policy->next_freq)
> +                       next_f = sg_policy->next_freq;
>         }
>         sugov_update_commit(sg_policy, time, next_f);
>  }
> Index: linux-pm/include/linux/tick.h
> ===================================================================
> --- linux-pm.orig/include/linux/tick.h
> +++ linux-pm/include/linux/tick.h
> @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
>  extern void tick_nohz_idle_exit(void);
>  extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
> +extern unsigned long tick_nohz_get_idle_calls(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>  #else /* !CONFIG_NO_HZ_COMMON */
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
>         return ts->sleep_length;
>  }
>
> +/**
> + * tick_nohz_get_idle_calls - return the current idle calls counter value
> + *
> + * Called from the schedutil frequency scaling governor in scheduler context.
> + */
> +unsigned long tick_nohz_get_idle_calls(void)
> +{
> +       struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +       return ts->idle_calls;
> +}
> +
>  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
>  {
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-05-08  3:49       ` Wanpeng Li
@ 2017-05-08  4:01         ` Viresh Kumar
  2017-05-08  5:15           ` Wanpeng Li
  2017-05-08 22:16           ` Rafael J. Wysocki
  0 siblings, 2 replies; 71+ messages in thread
From: Viresh Kumar @ 2017-05-08  4:01 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner

On 08-05-17, 11:49, Wanpeng Li wrote:
> Hi Rafael,
> 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > The way the schedutil governor uses the PELT metric causes it to
> > underestimate the CPU utilization in some cases.
> >
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> >
> > That has been attributed to CPU utilization metric updates on task
> > migration that cause the total utilization value for the CPU to be
> > reduced by the utilization of the migrated task.  If that happens,
> > the schedutil governor may see a CPU utilization reduction and will
> > attempt to reduce the CPU frequency accordingly right away.  That
> > may be premature, though, for example if the system is generally
> > busy and there are other runnable tasks waiting to be run on that
> > CPU already.
> >
> > This is unlikely to be an issue on systems where cpufreq policies are
> > shared between multiple CPUs, because in those cases the policy
> > utilization is computed as the maximum of the CPU utilization values
> 
> Sorry for one question maybe not associated with this patch. If the
> cpufreq policy is shared between multiple CPUs, the function
> intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
> which is managing this policy, I wonder whether other cpus which are
> affected should also update their per-logical cpu's IA32_PERF_CTL MSR?

The CPUs share the policy when they share their freq/voltage rails and so
changing perf state of one CPU should result in that changing for all the CPUs
in that policy. Otherwise, they can't be considered to be part of the same
policy.

That's why this code is changing it only for policy->cpu alone.

-- 
viresh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-05-08  4:01         ` Viresh Kumar
@ 2017-05-08  5:15           ` Wanpeng Li
  2017-05-08 22:16           ` Rafael J. Wysocki
  1 sibling, 0 replies; 71+ messages in thread
From: Wanpeng Li @ 2017-05-08  5:15 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner

2017-05-08 12:01 GMT+08:00 Viresh Kumar <viresh.kumar@linaro.org>:
> On 08-05-17, 11:49, Wanpeng Li wrote:
>> Hi Rafael,
>> 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >
>> > The way the schedutil governor uses the PELT metric causes it to
>> > underestimate the CPU utilization in some cases.
>> >
>> > That can be easily demonstrated by running kernel compilation on
>> > a Sandy Bridge Intel processor, running turbostat in parallel with
>> > it and looking at the values written to the MSR_IA32_PERF_CTL
>> > register.  Namely, the expected result would be that when all CPUs
>> > were 100% busy, all of them would be requested to run in the maximum
>> > P-state, but observation shows that this clearly isn't the case.
>> > The CPUs run in the maximum P-state for a while and then are
>> > requested to run slower and go back to the maximum P-state after
>> > a while again.  That causes the actual frequency of the processor to
>> > visibly oscillate below the sustainable maximum in a jittery fashion
>> > which clearly is not desirable.
>> >
>> > That has been attributed to CPU utilization metric updates on task
>> > migration that cause the total utilization value for the CPU to be
>> > reduced by the utilization of the migrated task.  If that happens,
>> > the schedutil governor may see a CPU utilization reduction and will
>> > attempt to reduce the CPU frequency accordingly right away.  That
>> > may be premature, though, for example if the system is generally
>> > busy and there are other runnable tasks waiting to be run on that
>> > CPU already.
>> >
>> > This is unlikely to be an issue on systems where cpufreq policies are
>> > shared between multiple CPUs, because in those cases the policy
>> > utilization is computed as the maximum of the CPU utilization values
>>
>> Sorry for one question maybe not associated with this patch. If the
>> cpufreq policy is shared between multiple CPUs, the function
>> intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
>> which is managing this policy, I wonder whether other cpus which are
>> affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
>
> The CPUs share the policy when they share their freq/voltage rails and so
> changing perf state of one CPU should result in that changing for all the CPUs
> in that policy. Otherwise, they can't be considered to be part of the same
> policy.
>
> That's why this code is changing it only for policy->cpu alone.

I see, thanks for the explanation.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-05-08  4:01         ` Viresh Kumar
  2017-05-08  5:15           ` Wanpeng Li
@ 2017-05-08 22:16           ` Rafael J. Wysocki
  2017-05-08 22:36             ` Wanpeng Li
  1 sibling, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-05-08 22:16 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Wanpeng Li, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada,
	Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes,
	Morten Rasmussen, Ingo Molnar, Thomas Gleixner

On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote:
> On 08-05-17, 11:49, Wanpeng Li wrote:
> > Hi Rafael,
> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > The way the schedutil governor uses the PELT metric causes it to
> > > underestimate the CPU utilization in some cases.
> > >
> > > That can be easily demonstrated by running kernel compilation on
> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > register.  Namely, the expected result would be that when all CPUs
> > > were 100% busy, all of them would be requested to run in the maximum
> > > P-state, but observation shows that this clearly isn't the case.
> > > The CPUs run in the maximum P-state for a while and then are
> > > requested to run slower and go back to the maximum P-state after
> > > a while again.  That causes the actual frequency of the processor to
> > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > which clearly is not desirable.
> > >
> > > That has been attributed to CPU utilization metric updates on task
> > > migration that cause the total utilization value for the CPU to be
> > > reduced by the utilization of the migrated task.  If that happens,
> > > the schedutil governor may see a CPU utilization reduction and will
> > > attempt to reduce the CPU frequency accordingly right away.  That
> > > may be premature, though, for example if the system is generally
> > > busy and there are other runnable tasks waiting to be run on that
> > > CPU already.
> > >
> > > This is unlikely to be an issue on systems where cpufreq policies are
> > > shared between multiple CPUs, because in those cases the policy
> > > utilization is computed as the maximum of the CPU utilization values
> > 
> > Sorry for one question maybe not associated with this patch. If the
> > cpufreq policy is shared between multiple CPUs, the function
> > intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
> > which is managing this policy, I wonder whether other cpus which are
> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
> 
> The CPUs share the policy when they share their freq/voltage rails and so
> changing perf state of one CPU should result in that changing for all the CPUs
> in that policy. Otherwise, they can't be considered to be part of the same
> policy.

To be entirely precise, this depends on the granularity of the HW interface.

If the interface is per-logical-CPU, we will use it this way for efficiency
reasons and even if there is some coordination on the HW side, the information
on how exactly it works usually is limited.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-05-08 22:16           ` Rafael J. Wysocki
@ 2017-05-08 22:36             ` Wanpeng Li
  2017-05-08 23:01               ` Rafael J. Wysocki
  0 siblings, 1 reply; 71+ messages in thread
From: Wanpeng Li @ 2017-05-08 22:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner

2017-05-09 6:16 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote:
>> On 08-05-17, 11:49, Wanpeng Li wrote:
>> > Hi Rafael,
>> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> > >
>> > > The way the schedutil governor uses the PELT metric causes it to
>> > > underestimate the CPU utilization in some cases.
>> > >
>> > > That can be easily demonstrated by running kernel compilation on
>> > > a Sandy Bridge Intel processor, running turbostat in parallel with
>> > > it and looking at the values written to the MSR_IA32_PERF_CTL
>> > > register.  Namely, the expected result would be that when all CPUs
>> > > were 100% busy, all of them would be requested to run in the maximum
>> > > P-state, but observation shows that this clearly isn't the case.
>> > > The CPUs run in the maximum P-state for a while and then are
>> > > requested to run slower and go back to the maximum P-state after
>> > > a while again.  That causes the actual frequency of the processor to
>> > > visibly oscillate below the sustainable maximum in a jittery fashion
>> > > which clearly is not desirable.
>> > >
>> > > That has been attributed to CPU utilization metric updates on task
>> > > migration that cause the total utilization value for the CPU to be
>> > > reduced by the utilization of the migrated task.  If that happens,
>> > > the schedutil governor may see a CPU utilization reduction and will
>> > > attempt to reduce the CPU frequency accordingly right away.  That
>> > > may be premature, though, for example if the system is generally
>> > > busy and there are other runnable tasks waiting to be run on that
>> > > CPU already.
>> > >
>> > > This is unlikely to be an issue on systems where cpufreq policies are
>> > > shared between multiple CPUs, because in those cases the policy
>> > > utilization is computed as the maximum of the CPU utilization values
>> >
>> > Sorry for one question maybe not associated with this patch. If the
>> > cpufreq policy is shared between multiple CPUs, the function
>> > intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
>> > which is managing this policy, I wonder whether other cpus which are
>> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
>>
>> The CPUs share the policy when they share their freq/voltage rails and so
>> changing perf state of one CPU should result in that changing for all the CPUs
>> in that policy. Otherwise, they can't be considered to be part of the same
>> policy.
>
> To be entirely precise, this depends on the granularity of the HW interface.
>
> If the interface is per-logical-CPU, we will use it this way for efficiency
> reasons and even if there is some coordination on the HW side, the information
> on how exactly it works usually is limited.

I check it on several Xeon servers on hand, however, I didn't find
/sys/devices/system/cpu/cpufreq/policyx/affected_cpus can affect more
than one logical cpu, so I guess most of Xeon servers are not support
shared cpufreq policy, then which kind of boxes support that?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely
  2017-05-08 22:36             ` Wanpeng Li
@ 2017-05-08 23:01               ` Rafael J. Wysocki
  0 siblings, 0 replies; 71+ messages in thread
From: Rafael J. Wysocki @ 2017-05-08 23:01 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Viresh Kumar, Linux PM, Peter Zijlstra, LKML,
	Srinivas Pandruvada, Juri Lelli, Vincent Guittot,
	Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar,
	Thomas Gleixner

On Tuesday, May 09, 2017 06:36:14 AM Wanpeng Li wrote:
> 2017-05-09 6:16 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote:
> >> On 08-05-17, 11:49, Wanpeng Li wrote:
> >> > Hi Rafael,
> >> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> >> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> > >
> >> > > The way the schedutil governor uses the PELT metric causes it to
> >> > > underestimate the CPU utilization in some cases.
> >> > >
> >> > > That can be easily demonstrated by running kernel compilation on
> >> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> >> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> >> > > register.  Namely, the expected result would be that when all CPUs
> >> > > were 100% busy, all of them would be requested to run in the maximum
> >> > > P-state, but observation shows that this clearly isn't the case.
> >> > > The CPUs run in the maximum P-state for a while and then are
> >> > > requested to run slower and go back to the maximum P-state after
> >> > > a while again.  That causes the actual frequency of the processor to
> >> > > visibly oscillate below the sustainable maximum in a jittery fashion
> >> > > which clearly is not desirable.
> >> > >
> >> > > That has been attributed to CPU utilization metric updates on task
> >> > > migration that cause the total utilization value for the CPU to be
> >> > > reduced by the utilization of the migrated task.  If that happens,
> >> > > the schedutil governor may see a CPU utilization reduction and will
> >> > > attempt to reduce the CPU frequency accordingly right away.  That
> >> > > may be premature, though, for example if the system is generally
> >> > > busy and there are other runnable tasks waiting to be run on that
> >> > > CPU already.
> >> > >
> >> > > This is unlikely to be an issue on systems where cpufreq policies are
> >> > > shared between multiple CPUs, because in those cases the policy
> >> > > utilization is computed as the maximum of the CPU utilization values
> >> >
> >> > Sorry for one question maybe not associated with this patch. If the
> >> > cpufreq policy is shared between multiple CPUs, the function
> >> > intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
> >> > which is managing this policy, I wonder whether other cpus which are
> >> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
> >>
> >> The CPUs share the policy when they share their freq/voltage rails and so
> >> changing perf state of one CPU should result in that changing for all the CPUs
> >> in that policy. Otherwise, they can't be considered to be part of the same
> >> policy.
> >
> > To be entirely precise, this depends on the granularity of the HW interface.
> >
> > If the interface is per-logical-CPU, we will use it this way for efficiency
> > reasons and even if there is some coordination on the HW side, the information
> > on how exactly it works usually is limited.
> 
> I check it on several Xeon servers on hand, however, I didn't find
> /sys/devices/system/cpu/cpufreq/policyx/affected_cpus can affect more
> than one logical cpu, so I guess most of Xeon servers are not support
> shared cpufreq policy, then which kind of boxes support that?

On Intel the interface for performance scaling is per-logical-CPU in general.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2017-05-08 23:08 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-19 13:21 [PATCH 0/2] cpufreq: schedutil: Fix and optimization Rafael J. Wysocki
2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki
2017-03-20  3:28   ` Viresh Kumar
2017-03-20 12:36     ` Rafael J. Wysocki
2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki
2017-03-19 21:24   ` Rafael J. Wysocki
2017-03-19 21:42     ` Rafael J. Wysocki
2017-03-20 10:38     ` Peter Zijlstra
2017-03-20 12:31       ` Rafael J. Wysocki
2017-03-20  3:57   ` Viresh Kumar
2017-03-20  8:26     ` Vincent Guittot
2017-03-20 12:34       ` Patrick Bellasi
2017-03-22 23:56         ` Joel Fernandes
2017-03-23 22:08           ` Vincent Guittot
2017-03-25  3:48             ` Joel Fernandes
2017-03-27  6:59               ` Vincent Guittot
2017-03-20 12:59       ` Rafael J. Wysocki
2017-03-20 13:20         ` Vincent Guittot
2017-03-20 12:48     ` Rafael J. Wysocki
2017-03-20 10:36   ` Peter Zijlstra
2017-03-20 12:35     ` Rafael J. Wysocki
2017-03-20 12:50       ` Peter Zijlstra
2017-03-20 13:04         ` Rafael J. Wysocki
2017-03-20 13:06         ` Patrick Bellasi
2017-03-20 13:05           ` Rafael J. Wysocki
2017-03-20 14:13             ` Patrick Bellasi
2017-03-20 21:46   ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki
2017-03-21  6:40     ` Viresh Kumar
2017-03-21 12:30       ` Rafael J. Wysocki
2017-03-21  8:50     ` Vincent Guittot
2017-03-21 11:56       ` Patrick Bellasi
2017-03-21 13:22       ` Peter Zijlstra
2017-03-21 13:37         ` Vincent Guittot
2017-03-21 14:03           ` Peter Zijlstra
2017-03-21 14:18             ` Vincent Guittot
2017-03-21 14:25             ` Patrick Bellasi
     [not found]             ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com>
2017-03-21 14:58               ` Peter Zijlstra
2017-03-21 17:00                 ` Vincent Guittot
2017-03-21 17:01                   ` Vincent Guittot
2017-03-21 14:26           ` Rafael J. Wysocki
2017-03-21 14:38             ` Patrick Bellasi
2017-03-21 14:46               ` Rafael J. Wysocki
2017-03-21 14:50                 ` Rafael J. Wysocki
2017-03-21 15:04                 ` Peter Zijlstra
2017-03-21 15:18                   ` Rafael J. Wysocki
2017-03-21 17:00                     ` Peter Zijlstra
2017-03-21 17:17                       ` Rafael J. Wysocki
2017-03-21 15:08                 ` Patrick Bellasi
2017-03-21 15:18                   ` Peter Zijlstra
2017-03-21 19:28                     ` Patrick Bellasi
2017-03-21 15:02             ` Peter Zijlstra
2017-03-21 11:50     ` Patrick Bellasi
2017-03-21 23:08     ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki
2017-03-22  9:26       ` Peter Zijlstra
2017-03-22  9:54       ` Viresh Kumar
2017-03-23  1:04       ` Joel Fernandes
2017-03-23 19:26       ` Sai Gurrappadi
2017-03-23 20:48         ` Sai Gurrappadi
2017-03-24  1:39         ` Rafael J. Wysocki
2017-03-24 19:08           ` Sai Gurrappadi
2017-03-25  1:14       ` Sai Gurrappadi
2017-03-25  1:39         ` Rafael J. Wysocki
2017-03-27  7:04         ` Vincent Guittot
2017-03-27 21:01           ` Sai Gurrappadi
2017-03-27 21:11             ` Rafael J. Wysocki
2017-05-08  3:49       ` Wanpeng Li
2017-05-08  4:01         ` Viresh Kumar
2017-05-08  5:15           ` Wanpeng Li
2017-05-08 22:16           ` Rafael J. Wysocki
2017-05-08 22:36             ` Wanpeng Li
2017-05-08 23:01               ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).