* [PATCH 0/2] cpufreq: schedutil: Fix and optimization @ 2017-03-19 13:21 Rafael J. Wysocki 2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki 2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki 0 siblings, 2 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-19 13:21 UTC (permalink / raw) To: Linux PM Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar Hi, Patch [1/2] in this series fixes an initialization issue in schedutil. Patch [2/2] is an optimization, but it also might be regarded as a fix or at least a problem workaround. Namely, while playing with the passive mode in intel_pstate lately I noticed that schedutil evidently underestimated CPU utilization at high loads. I guess it has been known for a while, but this is the first time I've seen that so clearly (actually by looking at the values written to the P-state request register and not via frequencies). After some investigation I can quite confidently attribute that to a PELT metric issue, as everything else has been ruled out with high confidence, and hence the patch (which also works around the problem adding even more condifence to the observation that PELT underestimates CPU utilization at least sometimes). Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() 2017-03-19 13:21 [PATCH 0/2] cpufreq: schedutil: Fix and optimization Rafael J. Wysocki @ 2017-03-19 13:30 ` Rafael J. Wysocki 2017-03-20 3:28 ` Viresh Kumar 2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki 1 sibling, 1 reply; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-19 13:30 UTC (permalink / raw) To: Linux PM Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> sugov_start() only initializes struct sugov_cpu per-CPU structures for shared policies, but it should do that for single-CPU policies too. That in particular makes the IO-wait boost mechanism work in the cases when cpufreq policies correspond to individual CPUs. Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- kernel/sched/cpufreq_schedutil.c | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -577,20 +577,14 @@ static int sugov_start(struct cpufreq_po for_each_cpu(cpu, policy->cpus) { struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + memset(sg_cpu, 0, sizeof(*sg_cpu)); sg_cpu->sg_policy = sg_policy; - if (policy_is_shared(policy)) { - sg_cpu->util = 0; - sg_cpu->max = 0; - sg_cpu->flags = SCHED_CPUFREQ_RT; - sg_cpu->last_update = 0; - sg_cpu->iowait_boost = 0; - sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq; - cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, - sugov_update_shared); - } else { - cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, - sugov_update_single); - } + sg_cpu->flags = SCHED_CPUFREQ_RT; + sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq; + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + policy_is_shared(policy) ? + sugov_update_shared : + sugov_update_single); } return 0; } ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() 2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki @ 2017-03-20 3:28 ` Viresh Kumar 2017-03-20 12:36 ` Rafael J. Wysocki 0 siblings, 1 reply; 71+ messages in thread From: Viresh Kumar @ 2017-03-20 3:28 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 19-03-17, 14:30, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > sugov_start() only initializes struct sugov_cpu per-CPU structures > for shared policies, but it should do that for single-CPU policies too. > > That in particular makes the IO-wait boost mechanism work in the > cases when cpufreq policies correspond to individual CPUs. > > Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting) > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 4.9+ <stable@vger.kernel.org> # 4.9+ Acked-by: Viresh Kumar <viresh.kumar@linaro.org> -- viresh ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() 2017-03-20 3:28 ` Viresh Kumar @ 2017-03-20 12:36 ` Rafael J. Wysocki 0 siblings, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 12:36 UTC (permalink / raw) To: Viresh Kumar Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Monday, March 20, 2017 08:58:40 AM Viresh Kumar wrote: > On 19-03-17, 14:30, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > sugov_start() only initializes struct sugov_cpu per-CPU structures > > for shared policies, but it should do that for single-CPU policies too. > > > > That in particular makes the IO-wait boost mechanism work in the > > cases when cpufreq policies correspond to individual CPUs. > > > > Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting) > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Cc: 4.9+ <stable@vger.kernel.org> # 4.9+ "stable" checks Fixes: tags too, but yes. > Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-19 13:21 [PATCH 0/2] cpufreq: schedutil: Fix and optimization Rafael J. Wysocki 2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki @ 2017-03-19 13:34 ` Rafael J. Wysocki 2017-03-19 21:24 ` Rafael J. Wysocki ` (3 more replies) 1 sibling, 4 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-19 13:34 UTC (permalink / raw) To: Linux PM Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> The PELT metric used by the schedutil governor underestimates the CPU utilization in some cases. The reason for that may be time spent in interrupt handlers and similar which is not accounted for by PELT. That can be easily demonstrated by running kernel compilation on a Sandy Bridge Intel processor, running turbostat in parallel with it and looking at the values written to the MSR_IA32_PERF_CTL register. Namely, the expected result would be that when all CPUs were 100% busy, all of them would be requested to run in the maximum P-state, but observation shows that this clearly isn't the case. The CPUs run in the maximum P-state for a while and then are requested to run slower and go back to the maximum P-state after a while again. That causes the actual frequency of the processor to visibly oscillate below the sustainable maximum in a jittery fashion which clearly is not desirable. To work around this issue use the observation that, from the schedutil governor's perspective, CPUs that are never idle should always run at the maximum frequency and make that happen. To that end, add a counter of idle calls to struct sugov_cpu and modify cpuidle_idle_call() to increment that counter every time it is about to put the given CPU into an idle state. Next, make the schedutil governor look at that counter for the current CPU every time before it is about to start heavy computations. If the counter has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), the CPU has not been idle for at least that long and the governor will choose the maximum frequency for it without looking at the PELT metric at all. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- include/linux/sched/cpufreq.h | 6 ++++++ kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++-- kernel/sched/idle.c | 3 +++ 3 files changed, 45 insertions(+), 2 deletions(-) Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -20,6 +20,7 @@ #include "sched.h" #define SUGOV_KTHREAD_PRIORITY 50 +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC) struct sugov_tunables { struct gov_attr_set attr_set; @@ -55,6 +56,9 @@ struct sugov_cpu { unsigned long iowait_boost; unsigned long iowait_boost_max; + unsigned long idle_calls; + unsigned long saved_idle_calls; + u64 busy_start; u64 last_update; /* The fields below are only needed when sharing a policy. */ @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su sg_cpu->iowait_boost >>= 1; } +void cpufreq_schedutil_idle_call(void) +{ + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu); + + sg_cpu->idle_calls++; +} + +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) +{ + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) { + sg_cpu->busy_start = 0; + return false; + } + + if (!sg_cpu->busy_start) { + sg_cpu->busy_start = sg_cpu->last_update; + return false; + } + + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD; +} + +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu) +{ + if (!sg_cpu->busy_start) + sg_cpu->saved_idle_calls = sg_cpu->idle_calls; +} + static void sugov_update_single(struct update_util_data *hook, u64 time, unsigned int flags) { @@ -207,7 +239,7 @@ static void sugov_update_single(struct u if (!sugov_should_update_freq(sg_policy, time)) return; - if (flags & SCHED_CPUFREQ_RT_DL) { + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) { next_f = policy->cpuinfo.max_freq; } else { sugov_get_util(&util, &max); @@ -215,6 +247,7 @@ static void sugov_update_single(struct u next_f = get_next_freq(sg_policy, util, max); } sugov_update_commit(sg_policy, time, next_f); + sugov_save_idle_calls(sg_cpu); } static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u sg_cpu->last_update = time; if (sugov_should_update_freq(sg_policy, time)) { - if (flags & SCHED_CPUFREQ_RT_DL) + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) next_f = sg_policy->policy->cpuinfo.max_freq; else next_f = sugov_next_freq_shared(sg_cpu); sugov_update_commit(sg_policy, time, next_f); + sugov_save_idle_calls(sg_cpu); } raw_spin_unlock(&sg_policy->update_lock); Index: linux-pm/kernel/sched/idle.c =================================================================== --- linux-pm.orig/kernel/sched/idle.c +++ linux-pm/kernel/sched/idle.c @@ -151,6 +151,9 @@ static void cpuidle_idle_call(void) */ rcu_idle_enter(); + /* Notify the schedutil cpufreq governor that we are entering idle. */ + cpufreq_schedutil_idle_call(); + if (cpuidle_not_available(drv, dev)) { default_idle_call(); goto exit_idle; Index: linux-pm/include/linux/sched/cpufreq.h =================================================================== --- linux-pm.orig/include/linux/sched/cpufreq.h +++ linux-pm/include/linux/sched/cpufreq.h @@ -24,4 +24,10 @@ void cpufreq_add_update_util_hook(int cp void cpufreq_remove_update_util_hook(int cpu); #endif /* CONFIG_CPU_FREQ */ +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL +extern void cpufreq_schedutil_idle_call(void); +#else +static inline void cpufreq_schedutil_idle_call(void) {} +#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ + #endif /* _LINUX_SCHED_CPUFREQ_H */ ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki @ 2017-03-19 21:24 ` Rafael J. Wysocki 2017-03-19 21:42 ` Rafael J. Wysocki 2017-03-20 10:38 ` Peter Zijlstra 2017-03-20 3:57 ` Viresh Kumar ` (2 subsequent siblings) 3 siblings, 2 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-19 21:24 UTC (permalink / raw) To: Linux PM Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 3957 bytes --] On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The PELT metric used by the schedutil governor underestimates the > CPU utilization in some cases. The reason for that may be time spent > in interrupt handlers and similar which is not accounted for by PELT. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. In case you are wondering about the actual numbers, attached are two turbostat log files from two runs of the same workload, without (before.txt) and with (after.txt) the patch applied. The workload is essentially "make -j 5" in the kernel source tree and the machine has an SSD storage and a quad-core Intel Sandy Bridge processor. The P-states available for each core are between 8 and 31 (0x1f) corresponding to 800 MHz and 3.1 GHz, respectively. All cores can run sustainably at 2.9 GHz at the same time, although that is not a guaranteed sustainable frequency (it may be dropped occasionally for thermal reasons, for example). The interesting columns are Bzy_MHz (and specifically the rows with "-" under CPU that correspond to the entire processor), which is the avreage frequency between iterations based on the numbers read from feedback registers, and the rightmost one, which is the values written to the P-state request register (the 3rd and 4th hex digits from the right represent the requested P-state). The turbostat data collection ran every 2 seconds and I looked at the last 30 iterations in each case corresponding to about 1 minute of the workload run during which all of the cores were around 100% busy. Now, if you look at after.txt (the run with the patch applied), you'll notice that during those last 30 iterations P-state 31 (0x1f) had been requested on all cores pretty much 100% of the time (meaning: as expected in that case) and the average processor frequency (computed by taking the average from all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to drop it from 2.9 GHz occasionally). In the before.txt case (without the patch) the average frequency over the last 30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch applied (on the average). That already is quite a measurable difference, but it would have been much worse if the processor had not coordinated P-states in hardware (such that if any core requested 31, the processor would pick that one or close to it for the entire package regardless of the requests from the other cores). Namely, if you look at the P-states requested for different cores (during the last 30 iterations of the before.txt run), which essentially is what should be used according to the governor, the average of them is 27.25 (almost 4 bins lower than the maximum) and the standard deviation is 6, so it is not like they are a little off occasionally. At least some of them are way off most of the time. Honestly, if the processor had been capable of doing per-core P-states, that would have been a disaster and there are customers who wouldn't look at schedutil again after being confronted with these numbers. So this is rather serious. BTW, both intel_pstate in the active mode and ondemand request 0x1f on all cores for that workload, like in the after.txt case. Thanks, Rafael [-- Attachment #2: before.txt --] [-- Type: text/plain, Size: 15917 bytes --] CPUID(7): No-SGX CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 93 11.26 826 2494 0x00000000 0 61 7.56 808 2494 0x00000800 2 90 10.87 832 2494 0x00000800 1 130 15.79 822 2494 0x00000800 3 90 10.83 837 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 54 6.63 811 2494 0x00000000 0 46 5.68 816 2494 0x00000800 2 55 6.87 804 2494 0x00000800 1 79 9.77 806 2494 0x00000800 3 35 4.20 828 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 305 36.58 835 2494 0x00000000 0 332 39.69 838 2494 0x00000800 2 307 36.95 833 2494 0x00000800 1 248 29.82 832 2494 0x00000800 3 333 39.87 836 2494 0x00000c00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 357 38.08 940 2495 0x00000000 0 424 36.72 1158 2495 0x00000800 2 312 36.27 862 2495 0x00000800 1 407 46.72 874 2494 0x00000800 3 284 32.62 874 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2017 92.64 2183 2494 0x00000000 0 2041 94.65 2162 2494 0x00000800 2 2032 93.65 2175 2494 0x00001200 1 2024 92.33 2197 2494 0x00001f00 3 1972 89.92 2198 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2455 99.21 2480 2494 0x00000000 0 2450 98.95 2482 2494 0x00000900 2 2457 99.29 2480 2494 0x00000900 1 2453 99.11 2481 2494 0x00001000 3 2461 99.49 2479 2494 0x00000d00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2308 99.14 2333 2494 0x00000000 0 2309 99.04 2337 2494 0x00001f00 2 2307 99.12 2333 2494 0x00001200 1 2305 99.03 2333 2494 0x00001600 3 2310 99.39 2329 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2613 99.22 2639 2494 0x00000000 0 2614 99.32 2638 2494 0x00000800 2 2589 98.19 2643 2494 0x00001200 1 2623 99.60 2640 2494 0x00001800 3 2624 99.76 2637 2494 0x00000a00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2687 98.70 2729 2494 0x00000000 0 2700 99.24 2727 2494 0x00000800 2 2707 99.48 2728 2494 0x00001f00 1 2643 97.09 2728 2494 0x00001c00 3 2699 99.01 2733 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.98 2900 2494 0x00000000 0 2892 99.96 2900 2494 0x00001f00 2 2892 99.96 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001b00 2 2893 100.00 2900 2494 0x00001800 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001a00 2 2893 100.00 2900 2494 0x00000d00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001900 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2886 100.00 2893 2494 0x00000000 0 2886 100.00 2893 2494 0x00001f00 2 2886 100.00 2893 2494 0x00001f00 1 2886 100.00 2893 2494 0x00001900 3 2886 100.00 2893 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001b00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001c00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001900 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2890 99.99 2897 2494 0x00000000 0 2890 100.00 2897 2494 0x00001f00 2 2890 100.00 2897 2494 0x00001800 1 2890 100.00 2897 2494 0x00000e00 3 2889 99.97 2897 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001600 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00000a00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2890 100.00 2897 2494 0x00000000 0 2890 100.00 2897 2494 0x00001f00 2 2890 100.00 2897 2494 0x00001f00 1 2890 100.00 2897 2494 0x00001b00 3 2890 100.00 2897 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2870 99.94 2878 2494 0x00000000 0 2872 100.00 2878 2494 0x00001f00 2 2867 99.84 2878 2494 0x00001f00 1 2872 100.00 2878 2494 0x00001f00 3 2870 99.92 2879 2494 0x00001200 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00000f00 2 2892 100.00 2898 2494 0x00001b00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2891 99.94 2899 2494 0x00000000 0 2893 100.00 2899 2494 0x00001f00 2 2887 99.78 2899 2494 0x00001f00 1 2893 100.00 2899 2494 0x00001f00 3 2893 100.00 2899 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001d00 2 2893 100.00 2900 2494 0x00001c00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2889 100.00 2895 2494 0x00000000 0 2889 100.00 2895 2494 0x00001f00 2 2889 100.00 2895 2494 0x00001600 1 2889 100.00 2895 2494 0x00001c00 3 2889 100.00 2895 2494 0x00000b00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001800 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001600 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001200 3 2892 99.96 2900 2494 0x00001300 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2891 99.99 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2891 99.97 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 99.99 2899 2494 0x00000000 0 2892 100.00 2899 2494 0x00001f00 2 2892 100.00 2899 2494 0x00001100 1 2891 99.96 2899 2494 0x00001f00 3 2892 100.00 2899 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2886 100.00 2893 2494 0x00000000 0 2886 100.00 2893 2494 0x00000f00 2 2886 100.00 2893 2494 0x00000900 1 2886 100.00 2893 2494 0x00001300 3 2886 100.00 2893 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2495 0x00000000 0 2892 100.00 2898 2495 0x00001f00 2 2892 100.00 2898 2495 0x00001700 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001700 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00000d00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2899 2494 0x00000000 0 2892 99.98 2899 2494 0x00001f00 2 2892 100.00 2899 2494 0x00001f00 1 2892 100.00 2899 2494 0x00001f00 3 2892 100.00 2899 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2879 100.00 2885 2494 0x00000000 0 2879 100.00 2885 2494 0x00001500 2 2879 100.00 2885 2494 0x00000900 1 2878 99.99 2885 2494 0x00001f00 3 2879 100.00 2885 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2883 100.00 2889 2494 0x00000000 0 2883 100.00 2889 2494 0x00001f00 2 2883 100.00 2889 2494 0x00001f00 1 2883 100.00 2889 2494 0x00001f00 3 2883 100.00 2889 2494 0x00001600 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001400 1 2893 100.00 2900 2494 0x00001b00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001100 2 2893 100.00 2900 2494 0x00001000 1 2893 100.00 2900 2494 0x00001f00 3 2893 99.98 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001300 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001500 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2899 2494 0x00000000 0 2893 100.00 2899 2494 0x00001f00 2 2893 100.00 2899 2494 0x00001c00 1 2893 100.00 2899 2494 0x00001f00 3 2893 100.00 2899 2494 0x00001c00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001b00 3 2893 100.00 2900 2494 0x00001d00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2879 100.00 2886 2494 0x00000000 0 2879 100.00 2886 2494 0x00001200 2 2879 100.00 2886 2494 0x00001f00 1 2879 100.00 2886 2494 0x00001f00 3 2879 100.00 2886 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2892 99.98 2900 2494 0x00001300 3 2893 100.00 2900 2494 0x00001000 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2495 0x00000000 0 2892 100.00 2898 2495 0x00001f00 2 2892 100.00 2898 2495 0x00000800 1 2892 100.00 2898 2495 0x00001b00 3 2892 100.00 2898 2494 0x00001600 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001b00 1 2892 99.96 2900 2494 0x00001f00 3 2893 99.99 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2891 99.99 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001a00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2890 99.95 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001600 2 2893 100.00 2900 2494 0x00001100 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001c00 [-- Attachment #3: after.txt --] [-- Type: text/plain, Size: 18461 bytes --] CPUID(7): No-SGX CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 171 20.00 854 2496 0x00000000 0 182 21.23 858 2496 0x00000800 2 170 19.93 855 2496 0x00000800 1 173 20.54 845 2496 0x00000800 3 157 18.32 858 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 222 25.57 872 2494 0x00000000 0 210 23.66 889 2494 0x00000800 2 200 23.20 865 2494 0x00000800 1 189 21.33 886 2494 0x00000800 3 291 34.09 856 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 132 16.37 808 2494 0x00000000 0 131 16.19 811 2494 0x00000800 2 215 26.87 801 2494 0x00000800 1 107 13.33 803 2494 0x00000800 3 76 9.08 834 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 503 47.44 1063 2494 0x00000000 0 510 48.60 1051 2494 0x00000a00 2 482 45.17 1070 2494 0x00000800 1 512 48.94 1048 2494 0x00000a00 3 508 47.06 1082 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 447 40.48 1108 2494 0x00000000 0 444 39.85 1117 2494 0x00000800 2 422 38.66 1093 2494 0x00000a00 1 440 38.68 1140 2494 0x00000800 3 484 44.75 1084 2494 0x00000a00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 405 34.24 1185 2494 0x00000000 0 422 33.63 1258 2494 0x00000800 2 435 35.97 1211 2494 0x00000800 1 296 26.52 1118 2494 0x00000800 3 468 40.85 1147 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 155 19.07 813 2494 0x00000000 0 161 19.92 809 2494 0x00000800 2 156 19.31 810 2494 0x00000800 1 119 14.79 809 2494 0x00000800 3 183 22.26 823 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 632 36.32 1744 2494 0x00000000 0 1237 58.35 2125 2494 0x00001f00 2 503 31.16 1617 2494 0x00000800 1 396 28.29 1403 2494 0x00000800 3 392 27.50 1430 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 799 36.92 2168 2494 0x00000000 0 1935 77.42 2505 2494 0x00000800 2 386 22.32 1733 2494 0x00000800 1 474 25.39 1870 2494 0x00000800 3 400 22.53 1778 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 1856 82.07 2267 2494 0x00000000 0 1850 82.38 2251 2494 0x00000900 2 1836 81.44 2260 2494 0x00001500 1 1850 80.66 2299 2494 0x00001700 3 1889 83.80 2259 2494 0x00000a00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2601 95.10 2741 2494 0x00000000 0 2625 96.00 2740 2494 0x00001200 2 2573 93.94 2746 2494 0x00001000 1 2635 96.58 2734 2494 0x00000c00 3 2570 93.87 2744 2494 0x00000b00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2488 92.91 2684 2494 0x00000000 0 2475 92.17 2692 2494 0x00000b00 2 2525 94.49 2678 2494 0x00001c00 1 2506 93.54 2685 2494 0x00001f00 3 2446 91.43 2682 2494 0x00000800 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2508 94.52 2659 2494 0x00000000 0 2551 96.11 2661 2494 0x00001200 2 2525 95.10 2661 2494 0x00001f00 1 2531 95.48 2657 2494 0x00001f00 3 2423 91.40 2657 2494 0x00001400 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2616 94.99 2760 2494 0x00000000 0 2577 93.80 2754 2494 0x00001f00 2 2653 96.61 2753 2494 0x00001f00 1 2606 94.38 2767 2494 0x00001f00 3 2626 95.17 2765 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2851 98.63 2897 2494 0x00000000 0 2823 97.66 2897 2494 0x00000f00 2 2854 98.71 2898 2494 0x00001600 1 2881 99.69 2897 2494 0x00001f00 3 2845 98.44 2897 2494 0x00001900 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2871 99.23 2900 2494 0x00000000 0 2884 99.68 2900 2494 0x00001900 2 2868 99.11 2900 2494 0x00001f00 1 2875 99.35 2900 2494 0x00001f00 3 2858 98.78 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 99.96 2900 2494 0x00000000 0 2893 99.97 2900 2494 0x00001f00 2 2892 99.93 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2892 99.94 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 99.97 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 99.97 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 99.98 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2495 0x00000000 0 2892 100.00 2898 2495 0x00001f00 2 2892 100.00 2898 2495 0x00001f00 1 2892 100.00 2898 2495 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2892 99.94 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2891 99.99 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2890 99.95 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2892 99.96 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.98 2900 2494 0x00000000 0 2892 99.97 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2892 99.95 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 99.98 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2894 100.00 2900 2494 0x00000000 0 2894 100.00 2900 2495 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2890 100.00 2897 2494 0x00000000 0 2890 100.00 2897 2494 0x00001f00 2 2890 100.00 2897 2494 0x00001f00 1 2890 100.00 2897 2494 0x00001f00 3 2890 100.00 2897 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 99.99 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2892 99.96 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2888 99.99 2895 2494 0x00000000 0 2888 100.00 2895 2494 0x00001f00 2 2888 100.00 2895 2494 0x00001f00 1 2888 100.00 2895 2494 0x00001f00 3 2888 99.98 2895 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 100.00 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2892 100.00 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 99.99 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2892 99.96 2900 2494 0x00000000 0 2889 99.84 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2893 100.00 2900 2494 0x00000000 0 2893 100.00 2900 2494 0x00001f00 2 2893 100.00 2900 2494 0x00001f00 1 2893 100.00 2900 2494 0x00001f00 3 2893 100.00 2900 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2886 99.81 2898 2494 0x00000000 0 2892 100.00 2898 2494 0x00001f00 2 2890 99.95 2898 2494 0x00001f00 1 2872 99.30 2898 2494 0x00001f00 3 2891 99.98 2898 2494 0x00001f00 CPU Avg_MHz Busy% Bzy_MHz TSC_MHz MSR 0x199 - 2889 99.90 2898 2494 0x00000000 0 2891 99.98 2898 2494 0x00001f00 2 2892 100.00 2898 2494 0x00001f00 1 2880 99.61 2898 2494 0x00001f00 3 2892 100.00 2898 2494 0x00001f00 ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-19 21:24 ` Rafael J. Wysocki @ 2017-03-19 21:42 ` Rafael J. Wysocki 2017-03-20 10:38 ` Peter Zijlstra 1 sibling, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-19 21:42 UTC (permalink / raw) To: Linux PM Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Sunday, March 19, 2017 10:24:24 PM Rafael J. Wysocki wrote: > On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > The PELT metric used by the schedutil governor underestimates the > > CPU utilization in some cases. The reason for that may be time spent > > in interrupt handlers and similar which is not accounted for by PELT. > > > > That can be easily demonstrated by running kernel compilation on > > a Sandy Bridge Intel processor, running turbostat in parallel with > > it and looking at the values written to the MSR_IA32_PERF_CTL > > register. Namely, the expected result would be that when all CPUs > > were 100% busy, all of them would be requested to run in the maximum > > P-state, but observation shows that this clearly isn't the case. > > The CPUs run in the maximum P-state for a while and then are > > requested to run slower and go back to the maximum P-state after > > a while again. That causes the actual frequency of the processor to > > visibly oscillate below the sustainable maximum in a jittery fashion > > which clearly is not desirable. > > In case you are wondering about the actual numbers, attached are two turbostat > log files from two runs of the same workload, without (before.txt) and with (after.txt) > the patch applied. > > The workload is essentially "make -j 5" in the kernel source tree and the > machine has an SSD storage and a quad-core Intel Sandy Bridge processor. > The P-states available for each core are between 8 and 31 (0x1f) corresponding > to 800 MHz and 3.1 GHz, respectively. All cores can run sustainably at 2.9 GHz > at the same time, although that is not a guaranteed sustainable frequency > (it may be dropped occasionally for thermal reasons, for example). > > The interesting columns are Bzy_MHz (and specifically the rows with "-" under > CPU that correspond to the entire processor), which is the avreage frequency > between iterations based on the numbers read from feedback registers, and > the rightmost one, which is the values written to the P-state request register > (the 3rd and 4th hex digits from the right represent the requested P-state). > > The turbostat data collection ran every 2 seconds and I looked at the last 30 > iterations in each case corresponding to about 1 minute of the workload run > during which all of the cores were around 100% busy. > > Now, if you look at after.txt (the run with the patch applied), you'll notice that > during those last 30 iterations P-state 31 (0x1f) had been requested on all > cores pretty much 100% of the time (meaning: as expected in that case) and > the average processor frequency (computed by taking the average from > all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to > drop it from 2.9 GHz occasionally). > > In the before.txt case (without the patch) the average frequency over the last > 30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch > applied (on the average). 0.08% of course, sorry. Still visible, though. :-) Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-19 21:24 ` Rafael J. Wysocki 2017-03-19 21:42 ` Rafael J. Wysocki @ 2017-03-20 10:38 ` Peter Zijlstra 2017-03-20 12:31 ` Rafael J. Wysocki 1 sibling, 1 reply; 71+ messages in thread From: Peter Zijlstra @ 2017-03-20 10:38 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote: > Honestly, if the processor had been capable of doing per-core P-states, that > would have been a disaster and there are customers who wouldn't look at > schedutil again after being confronted with these numbers. This, I feel, is a bit overstated. We have bug, we fix them. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 10:38 ` Peter Zijlstra @ 2017-03-20 12:31 ` Rafael J. Wysocki 0 siblings, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 12:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Monday, March 20, 2017 11:38:15 AM Peter Zijlstra wrote: > On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote: > > > Honestly, if the processor had been capable of doing per-core P-states, that > > would have been a disaster and there are customers who wouldn't look at > > schedutil again after being confronted with these numbers. > > This, I feel, is a bit overstated. Maybe a bit, but I'm not really sure ... > We have bug, we fix them. Of course. :-) ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki 2017-03-19 21:24 ` Rafael J. Wysocki @ 2017-03-20 3:57 ` Viresh Kumar 2017-03-20 8:26 ` Vincent Guittot 2017-03-20 12:48 ` Rafael J. Wysocki 2017-03-20 10:36 ` Peter Zijlstra 2017-03-20 21:46 ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki 3 siblings, 2 replies; 71+ messages in thread From: Viresh Kumar @ 2017-03-20 3:57 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 19-03-17, 14:34, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The PELT metric used by the schedutil governor underestimates the > CPU utilization in some cases. The reason for that may be time spent > in interrupt handlers and similar which is not accounted for by PELT. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > To work around this issue use the observation that, from the > schedutil governor's perspective, CPUs that are never idle should > always run at the maximum frequency and make that happen. > > To that end, add a counter of idle calls to struct sugov_cpu and > modify cpuidle_idle_call() to increment that counter every time it > is about to put the given CPU into an idle state. Next, make the > schedutil governor look at that counter for the current CPU every > time before it is about to start heavy computations. If the counter > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > the CPU has not been idle for at least that long and the governor > will choose the maximum frequency for it without looking at the PELT > metric at all. Looks like we are fixing a PELT problem with a schedutil Hack :) > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > include/linux/sched/cpufreq.h | 6 ++++++ > kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++-- > kernel/sched/idle.c | 3 +++ > 3 files changed, 45 insertions(+), 2 deletions(-) > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > =================================================================== > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -20,6 +20,7 @@ > #include "sched.h" > > #define SUGOV_KTHREAD_PRIORITY 50 > +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC) > > struct sugov_tunables { > struct gov_attr_set attr_set; > @@ -55,6 +56,9 @@ struct sugov_cpu { > > unsigned long iowait_boost; > unsigned long iowait_boost_max; > + unsigned long idle_calls; > + unsigned long saved_idle_calls; > + u64 busy_start; > u64 last_update; > > /* The fields below are only needed when sharing a policy. */ > @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su > sg_cpu->iowait_boost >>= 1; > } > > +void cpufreq_schedutil_idle_call(void) > +{ > + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu); > + > + sg_cpu->idle_calls++; > +} > + > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > +{ > + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) { > + sg_cpu->busy_start = 0; > + return false; > + } > + > + if (!sg_cpu->busy_start) { > + sg_cpu->busy_start = sg_cpu->last_update; > + return false; > + } > + > + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD; > +} > + > +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu) > +{ > + if (!sg_cpu->busy_start) > + sg_cpu->saved_idle_calls = sg_cpu->idle_calls; Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible for idle_calls to get incremented by this time? > +} > + > static void sugov_update_single(struct update_util_data *hook, u64 time, > unsigned int flags) > { > @@ -207,7 +239,7 @@ static void sugov_update_single(struct u > if (!sugov_should_update_freq(sg_policy, time)) > return; > > - if (flags & SCHED_CPUFREQ_RT_DL) { > + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) { > next_f = policy->cpuinfo.max_freq; > } else { > sugov_get_util(&util, &max); > @@ -215,6 +247,7 @@ static void sugov_update_single(struct u > next_f = get_next_freq(sg_policy, util, max); > } > sugov_update_commit(sg_policy, time, next_f); > + sugov_save_idle_calls(sg_cpu); > } > > static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) > @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u > sg_cpu->last_update = time; > > if (sugov_should_update_freq(sg_policy, time)) { > - if (flags & SCHED_CPUFREQ_RT_DL) > + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) What about others CPUs in this policy? > next_f = sg_policy->policy->cpuinfo.max_freq; > else > next_f = sugov_next_freq_shared(sg_cpu); > > sugov_update_commit(sg_policy, time, next_f); > + sugov_save_idle_calls(sg_cpu); > } > > raw_spin_unlock(&sg_policy->update_lock); -- viresh ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 3:57 ` Viresh Kumar @ 2017-03-20 8:26 ` Vincent Guittot 2017-03-20 12:34 ` Patrick Bellasi 2017-03-20 12:59 ` Rafael J. Wysocki 2017-03-20 12:48 ` Rafael J. Wysocki 1 sibling, 2 replies; 71+ messages in thread From: Vincent Guittot @ 2017-03-20 8:26 UTC (permalink / raw) To: Viresh Kumar Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 19-03-17, 14:34, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> The PELT metric used by the schedutil governor underestimates the >> CPU utilization in some cases. The reason for that may be time spent >> in interrupt handlers and similar which is not accounted for by PELT. Are you sure of the root cause described above (time stolen by irq handler) or is it just a hypotheses ? That would be good to be sure of the root cause Furthermore, IIRC the time spent in irq context is also accounted as run time for the running cfs task but not RT and deadline task running time So I'm not really aligned with the description of your problem: PELT metric underestimates the load of the CPU. The PELT is just about tracking CFS task utilization but not whole CPU utilization and according to your description of the problem (time stolen by irq), your problem doesn't come from an underestimation of CFS task but from time spent in something else but not accounted in the value used by schedutil That would be good to be sure of what is running during this not accounted time and find a way to account them >> >> That can be easily demonstrated by running kernel compilation on >> a Sandy Bridge Intel processor, running turbostat in parallel with >> it and looking at the values written to the MSR_IA32_PERF_CTL >> register. Namely, the expected result would be that when all CPUs >> were 100% busy, all of them would be requested to run in the maximum >> P-state, but observation shows that this clearly isn't the case. >> The CPUs run in the maximum P-state for a while and then are >> requested to run slower and go back to the maximum P-state after >> a while again. That causes the actual frequency of the processor to >> visibly oscillate below the sustainable maximum in a jittery fashion >> which clearly is not desirable. >> >> To work around this issue use the observation that, from the >> schedutil governor's perspective, CPUs that are never idle should >> always run at the maximum frequency and make that happen. >> >> To that end, add a counter of idle calls to struct sugov_cpu and >> modify cpuidle_idle_call() to increment that counter every time it >> is about to put the given CPU into an idle state. Next, make the >> schedutil governor look at that counter for the current CPU every >> time before it is about to start heavy computations. If the counter >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), >> the CPU has not been idle for at least that long and the governor >> will choose the maximum frequency for it without looking at the PELT >> metric at all. > > Looks like we are fixing a PELT problem with a schedutil Hack :) I would not say that it's a PELT problem (at least based on current description) but more that we don't track all activities of CPU > >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> --- >> include/linux/sched/cpufreq.h | 6 ++++++ >> kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++-- >> kernel/sched/idle.c | 3 +++ >> 3 files changed, 45 insertions(+), 2 deletions(-) >> >> Index: linux-pm/kernel/sched/cpufreq_schedutil.c >> =================================================================== >> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c >> +++ linux-pm/kernel/sched/cpufreq_schedutil.c >> @@ -20,6 +20,7 @@ >> #include "sched.h" >> >> #define SUGOV_KTHREAD_PRIORITY 50 >> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC) >> >> struct sugov_tunables { >> struct gov_attr_set attr_set; >> @@ -55,6 +56,9 @@ struct sugov_cpu { >> >> unsigned long iowait_boost; >> unsigned long iowait_boost_max; >> + unsigned long idle_calls; >> + unsigned long saved_idle_calls; >> + u64 busy_start; >> u64 last_update; >> >> /* The fields below are only needed when sharing a policy. */ >> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su >> sg_cpu->iowait_boost >>= 1; >> } >> >> +void cpufreq_schedutil_idle_call(void) >> +{ >> + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu); >> + >> + sg_cpu->idle_calls++; >> +} >> + >> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) >> +{ >> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) { >> + sg_cpu->busy_start = 0; >> + return false; >> + } >> + >> + if (!sg_cpu->busy_start) { >> + sg_cpu->busy_start = sg_cpu->last_update; >> + return false; >> + } >> + >> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD; >> +} >> + >> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu) >> +{ >> + if (!sg_cpu->busy_start) >> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls; > > Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible > for idle_calls to get incremented by this time? > >> +} >> + >> static void sugov_update_single(struct update_util_data *hook, u64 time, >> unsigned int flags) >> { >> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u >> if (!sugov_should_update_freq(sg_policy, time)) >> return; >> >> - if (flags & SCHED_CPUFREQ_RT_DL) { >> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) { >> next_f = policy->cpuinfo.max_freq; >> } else { >> sugov_get_util(&util, &max); >> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u >> next_f = get_next_freq(sg_policy, util, max); >> } >> sugov_update_commit(sg_policy, time, next_f); >> + sugov_save_idle_calls(sg_cpu); >> } >> >> static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) >> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u >> sg_cpu->last_update = time; >> >> if (sugov_should_update_freq(sg_policy, time)) { >> - if (flags & SCHED_CPUFREQ_RT_DL) >> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) > > What about others CPUs in this policy? > >> next_f = sg_policy->policy->cpuinfo.max_freq; >> else >> next_f = sugov_next_freq_shared(sg_cpu); >> >> sugov_update_commit(sg_policy, time, next_f); >> + sugov_save_idle_calls(sg_cpu); >> } >> >> raw_spin_unlock(&sg_policy->update_lock); > > -- > viresh ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 8:26 ` Vincent Guittot @ 2017-03-20 12:34 ` Patrick Bellasi 2017-03-22 23:56 ` Joel Fernandes 2017-03-20 12:59 ` Rafael J. Wysocki 1 sibling, 1 reply; 71+ messages in thread From: Patrick Bellasi @ 2017-03-20 12:34 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Viresh Kumar, Vincent Guittot, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20-Mar 09:26, Vincent Guittot wrote: > On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 19-03-17, 14:34, Rafael J. Wysocki wrote: > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >> > >> The PELT metric used by the schedutil governor underestimates the > >> CPU utilization in some cases. The reason for that may be time spent > >> in interrupt handlers and similar which is not accounted for by PELT. > > Are you sure of the root cause described above (time stolen by irq > handler) or is it just a hypotheses ? That would be good to be sure of > the root cause > Furthermore, IIRC the time spent in irq context is also accounted as > run time for the running cfs task but not RT and deadline task running > time As long as the IRQ processing does not generate a context switch, which is happening (eventually) if the top half schedule some deferred work to be executed by a bottom half. Thus, me too I would say that all the top half time is accounted in PELT, since the current task is still RUNNABLE/RUNNING. > So I'm not really aligned with the description of your problem: PELT > metric underestimates the load of the CPU. The PELT is just about > tracking CFS task utilization but not whole CPU utilization and > according to your description of the problem (time stolen by irq), > your problem doesn't come from an underestimation of CFS task but from > time spent in something else but not accounted in the value used by > schedutil Quite likely. Indeed, it can really be that the CFS task is preempted because of some RT activity generated by the IRQ handler. More in general, I've also noticed many suboptimal freq switches when RT tasks interleave with CFS ones, because of: - relatively long down _and up_ throttling times - the way schedutil's flags are tracked and updated - the callsites from where we call schedutil updates For example it can really happen that we are running at the highest OPP because of some RT activity. Then we switch back to a relatively low utilization CFS workload and then: 1. a tick happens which produces a frequency drop 2. right after a new IRQ starts a RT task Well, because of the schedutil's throttling mechanism we can end up running at the reduce OPP for quite long before (eventually) going back to the higher OPP... if and only we are lucky enough to get a new RT task starting outside of a throttling window. I guess that if we have quite a lot of IRQ->RT activations in a CPU with a relatively low CFS utilization, this unwanted frequency hopping is quite likely. > That would be good to be sure of what is running during this not > accounted time and find a way to account them Do we have any number to characterize the frequency of IRQ/RT activations? We should notice also that, while CFS tasks are preempted by RT ones, the PELT signal decays. This can contribute on lowering even further the utilization demand from the CFS class. > >> That can be easily demonstrated by running kernel compilation on > >> a Sandy Bridge Intel processor, running turbostat in parallel with > >> it and looking at the values written to the MSR_IA32_PERF_CTL > >> register. Namely, the expected result would be that when all CPUs > >> were 100% busy, all of them would be requested to run in the maximum > >> P-state, but observation shows that this clearly isn't the case. Can you share a trace with at least sched_switch events enabled? This can help on identify which workloads are there and how they interact with schedutil. > >> The CPUs run in the maximum P-state for a while and then are > >> requested to run slower and go back to the maximum P-state after > >> a while again. That causes the actual frequency of the processor to That's possibly exactly the pattern I've described above. > >> visibly oscillate below the sustainable maximum in a jittery fashion > >> which clearly is not desirable. > >> > >> To work around this issue use the observation that, from the > >> schedutil governor's perspective, CPUs that are never idle should > >> always run at the maximum frequency and make that happen. Perhaps some of the patches I've posted in this series: https://lkml.org/lkml/2017/3/2/385 can help on that side. The pattern you report seems to me quite matching with the case "b" described in the cover letter. I've also shared a notebook: https://gist.github.com/derkling/d6a21b459a18091b2b058668a550010d which clearly shows that behavior happening in the first plot, cell named Out[15]. Did you had a chance to have a look at them? It would be nice to know if they can provide some benefits to your use-case. > >> To that end, add a counter of idle calls to struct sugov_cpu and > >> modify cpuidle_idle_call() to increment that counter every time it > >> is about to put the given CPU into an idle state. Next, make the > >> schedutil governor look at that counter for the current CPU every > >> time before it is about to start heavy computations. If the counter > >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > >> the CPU has not been idle for at least that long and the governor > >> will choose the maximum frequency for it without looking at the PELT > >> metric at all. > > > > Looks like we are fixing a PELT problem with a schedutil Hack :) > > I would not say that it's a PELT problem (at least based on current > description) but more that we don't track all > activities of CPU Agree... with both comments. > >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >> --- > >> include/linux/sched/cpufreq.h | 6 ++++++ > >> kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++-- > >> kernel/sched/idle.c | 3 +++ > >> 3 files changed, 45 insertions(+), 2 deletions(-) > >> > >> Index: linux-pm/kernel/sched/cpufreq_schedutil.c > >> =================================================================== > >> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > >> +++ linux-pm/kernel/sched/cpufreq_schedutil.c > >> @@ -20,6 +20,7 @@ > >> #include "sched.h" > >> > >> #define SUGOV_KTHREAD_PRIORITY 50 > >> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC) > >> > >> struct sugov_tunables { > >> struct gov_attr_set attr_set; > >> @@ -55,6 +56,9 @@ struct sugov_cpu { > >> > >> unsigned long iowait_boost; > >> unsigned long iowait_boost_max; > >> + unsigned long idle_calls; > >> + unsigned long saved_idle_calls; > >> + u64 busy_start; > >> u64 last_update; > >> > >> /* The fields below are only needed when sharing a policy. */ > >> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su > >> sg_cpu->iowait_boost >>= 1; > >> } > >> > >> +void cpufreq_schedutil_idle_call(void) > >> +{ > >> + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu); > >> + > >> + sg_cpu->idle_calls++; > >> +} > >> + > >> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > >> +{ > >> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) { > >> + sg_cpu->busy_start = 0; > >> + return false; > >> + } > >> + > >> + if (!sg_cpu->busy_start) { > >> + sg_cpu->busy_start = sg_cpu->last_update; > >> + return false; > >> + } > >> + > >> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD; > >> +} > >> + > >> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu) > >> +{ > >> + if (!sg_cpu->busy_start) > >> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls; > > > > Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible > > for idle_calls to get incremented by this time? > > > >> +} > >> + > >> static void sugov_update_single(struct update_util_data *hook, u64 time, > >> unsigned int flags) > >> { > >> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u > >> if (!sugov_should_update_freq(sg_policy, time)) > >> return; > >> > >> - if (flags & SCHED_CPUFREQ_RT_DL) { > >> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) { > >> next_f = policy->cpuinfo.max_freq; > >> } else { > >> sugov_get_util(&util, &max); > >> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u > >> next_f = get_next_freq(sg_policy, util, max); > >> } > >> sugov_update_commit(sg_policy, time, next_f); > >> + sugov_save_idle_calls(sg_cpu); > >> } > >> > >> static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) > >> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u > >> sg_cpu->last_update = time; > >> > >> if (sugov_should_update_freq(sg_policy, time)) { > >> - if (flags & SCHED_CPUFREQ_RT_DL) > >> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) > > > > What about others CPUs in this policy? > > > >> next_f = sg_policy->policy->cpuinfo.max_freq; > >> else > >> next_f = sugov_next_freq_shared(sg_cpu); > >> > >> sugov_update_commit(sg_policy, time, next_f); > >> + sugov_save_idle_calls(sg_cpu); > >> } > >> > >> raw_spin_unlock(&sg_policy->update_lock); > > > > -- > > viresh -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 12:34 ` Patrick Bellasi @ 2017-03-22 23:56 ` Joel Fernandes 2017-03-23 22:08 ` Vincent Guittot 0 siblings, 1 reply; 71+ messages in thread From: Joel Fernandes @ 2017-03-22 23:56 UTC (permalink / raw) To: Patrick Bellasi Cc: Rafael J. Wysocki, Viresh Kumar, Vincent Guittot, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Morten Rasmussen, Ingo Molnar On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > On 20-Mar 09:26, Vincent Guittot wrote: >> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> > On 19-03-17, 14:34, Rafael J. Wysocki wrote: >> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> >> >> The PELT metric used by the schedutil governor underestimates the >> >> CPU utilization in some cases. The reason for that may be time spent >> >> in interrupt handlers and similar which is not accounted for by PELT. >> >> Are you sure of the root cause described above (time stolen by irq >> handler) or is it just a hypotheses ? That would be good to be sure of >> the root cause >> Furthermore, IIRC the time spent in irq context is also accounted as >> run time for the running cfs task but not RT and deadline task running >> time > > As long as the IRQ processing does not generate a context switch, > which is happening (eventually) if the top half schedule some deferred > work to be executed by a bottom half. > > Thus, me too I would say that all the top half time is accounted in > PELT, since the current task is still RUNNABLE/RUNNING. Sorry if I'm missing something but doesn't this depend on whether you have CONFIG_IRQ_TIME_ACCOUNTING enabled? __update_load_avg uses rq->clock_task for deltas which I think shouldn't account IRQ time with that config option. So it should be quite possible for IRQ time spent to reduce the PELT signal right? > >> So I'm not really aligned with the description of your problem: PELT >> metric underestimates the load of the CPU. The PELT is just about >> tracking CFS task utilization but not whole CPU utilization and >> according to your description of the problem (time stolen by irq), >> your problem doesn't come from an underestimation of CFS task but from >> time spent in something else but not accounted in the value used by >> schedutil > > Quite likely. Indeed, it can really be that the CFS task is preempted > because of some RT activity generated by the IRQ handler. > > More in general, I've also noticed many suboptimal freq switches when > RT tasks interleave with CFS ones, because of: > - relatively long down _and up_ throttling times > - the way schedutil's flags are tracked and updated > - the callsites from where we call schedutil updates > > For example it can really happen that we are running at the highest > OPP because of some RT activity. Then we switch back to a relatively > low utilization CFS workload and then: > 1. a tick happens which produces a frequency drop Any idea why this frequency drop would happen? Say a running CFS task gets preempted by RT task, the PELT signal shouldn't drop for the duration the CFS task is preempted because the task is runnable, so once the CFS task gets CPU back, schedutil should still maintain the capacity right? Regards, Joel ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-22 23:56 ` Joel Fernandes @ 2017-03-23 22:08 ` Vincent Guittot 2017-03-25 3:48 ` Joel Fernandes 0 siblings, 1 reply; 71+ messages in thread From: Vincent Guittot @ 2017-03-23 22:08 UTC (permalink / raw) To: Joel Fernandes Cc: Patrick Bellasi, Rafael J. Wysocki, Viresh Kumar, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Morten Rasmussen, Ingo Molnar On 23 March 2017 at 00:56, Joel Fernandes <joelaf@google.com> wrote: > On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi > <patrick.bellasi@arm.com> wrote: >> On 20-Mar 09:26, Vincent Guittot wrote: >>> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote: >>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote: >>> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >> >>> >> The PELT metric used by the schedutil governor underestimates the >>> >> CPU utilization in some cases. The reason for that may be time spent >>> >> in interrupt handlers and similar which is not accounted for by PELT. >>> >>> Are you sure of the root cause described above (time stolen by irq >>> handler) or is it just a hypotheses ? That would be good to be sure of >>> the root cause >>> Furthermore, IIRC the time spent in irq context is also accounted as >>> run time for the running cfs task but not RT and deadline task running >>> time >> >> As long as the IRQ processing does not generate a context switch, >> which is happening (eventually) if the top half schedule some deferred >> work to be executed by a bottom half. >> >> Thus, me too I would say that all the top half time is accounted in >> PELT, since the current task is still RUNNABLE/RUNNING. > > Sorry if I'm missing something but doesn't this depend on whether you > have CONFIG_IRQ_TIME_ACCOUNTING enabled? > > __update_load_avg uses rq->clock_task for deltas which I think > shouldn't account IRQ time with that config option. So it should be > quite possible for IRQ time spent to reduce the PELT signal right? > >> >>> So I'm not really aligned with the description of your problem: PELT >>> metric underestimates the load of the CPU. The PELT is just about >>> tracking CFS task utilization but not whole CPU utilization and >>> according to your description of the problem (time stolen by irq), >>> your problem doesn't come from an underestimation of CFS task but from >>> time spent in something else but not accounted in the value used by >>> schedutil >> >> Quite likely. Indeed, it can really be that the CFS task is preempted >> because of some RT activity generated by the IRQ handler. >> >> More in general, I've also noticed many suboptimal freq switches when >> RT tasks interleave with CFS ones, because of: >> - relatively long down _and up_ throttling times >> - the way schedutil's flags are tracked and updated >> - the callsites from where we call schedutil updates >> >> For example it can really happen that we are running at the highest >> OPP because of some RT activity. Then we switch back to a relatively >> low utilization CFS workload and then: >> 1. a tick happens which produces a frequency drop > > Any idea why this frequency drop would happen? Say a running CFS task > gets preempted by RT task, the PELT signal shouldn't drop for the > duration the CFS task is preempted because the task is runnable, so utilization only tracks the running state but not runnable state. Runnable state is tracked in load_avg > once the CFS task gets CPU back, schedutil should still maintain the > capacity right? > > Regards, > Joel ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-23 22:08 ` Vincent Guittot @ 2017-03-25 3:48 ` Joel Fernandes 2017-03-27 6:59 ` Vincent Guittot 0 siblings, 1 reply; 71+ messages in thread From: Joel Fernandes @ 2017-03-25 3:48 UTC (permalink / raw) To: Vincent Guittot Cc: Patrick Bellasi, Rafael J. Wysocki, Viresh Kumar, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Morten Rasmussen, Ingo Molnar Hi Vincent, On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot <vincent.guittot@linaro.org> wrote: [..] >>> >>>> So I'm not really aligned with the description of your problem: PELT >>>> metric underestimates the load of the CPU. The PELT is just about >>>> tracking CFS task utilization but not whole CPU utilization and >>>> according to your description of the problem (time stolen by irq), >>>> your problem doesn't come from an underestimation of CFS task but from >>>> time spent in something else but not accounted in the value used by >>>> schedutil >>> >>> Quite likely. Indeed, it can really be that the CFS task is preempted >>> because of some RT activity generated by the IRQ handler. >>> >>> More in general, I've also noticed many suboptimal freq switches when >>> RT tasks interleave with CFS ones, because of: >>> - relatively long down _and up_ throttling times >>> - the way schedutil's flags are tracked and updated >>> - the callsites from where we call schedutil updates >>> >>> For example it can really happen that we are running at the highest >>> OPP because of some RT activity. Then we switch back to a relatively >>> low utilization CFS workload and then: >>> 1. a tick happens which produces a frequency drop >> >> Any idea why this frequency drop would happen? Say a running CFS task >> gets preempted by RT task, the PELT signal shouldn't drop for the >> duration the CFS task is preempted because the task is runnable, so > > utilization only tracks the running state but not runnable state. > Runnable state is tracked in load_avg Thanks. I got it now. Correct me if I'm wrong but strictly speaking utilization for a cfs_rq (which drives the frequency for CFS) still tracks the blocked/runnable time of tasks although its decayed as time moves forward. Only when we migrate the rq of a cfs task is the util_avg contribution removed from the rq. But I can see now why running RT can decay this load tracking signal. Regards, Joel ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-25 3:48 ` Joel Fernandes @ 2017-03-27 6:59 ` Vincent Guittot 0 siblings, 0 replies; 71+ messages in thread From: Vincent Guittot @ 2017-03-27 6:59 UTC (permalink / raw) To: Joel Fernandes Cc: Patrick Bellasi, Rafael J. Wysocki, Viresh Kumar, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Morten Rasmussen, Ingo Molnar On 25 March 2017 at 04:48, Joel Fernandes <joelaf@google.com> wrote: > Hi Vincent, > > On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot > <vincent.guittot@linaro.org> wrote: > [..] >>>> >>>>> So I'm not really aligned with the description of your problem: PELT >>>>> metric underestimates the load of the CPU. The PELT is just about >>>>> tracking CFS task utilization but not whole CPU utilization and >>>>> according to your description of the problem (time stolen by irq), >>>>> your problem doesn't come from an underestimation of CFS task but from >>>>> time spent in something else but not accounted in the value used by >>>>> schedutil >>>> >>>> Quite likely. Indeed, it can really be that the CFS task is preempted >>>> because of some RT activity generated by the IRQ handler. >>>> >>>> More in general, I've also noticed many suboptimal freq switches when >>>> RT tasks interleave with CFS ones, because of: >>>> - relatively long down _and up_ throttling times >>>> - the way schedutil's flags are tracked and updated >>>> - the callsites from where we call schedutil updates >>>> >>>> For example it can really happen that we are running at the highest >>>> OPP because of some RT activity. Then we switch back to a relatively >>>> low utilization CFS workload and then: >>>> 1. a tick happens which produces a frequency drop >>> >>> Any idea why this frequency drop would happen? Say a running CFS task >>> gets preempted by RT task, the PELT signal shouldn't drop for the >>> duration the CFS task is preempted because the task is runnable, so >> >> utilization only tracks the running state but not runnable state. >> Runnable state is tracked in load_avg > > Thanks. I got it now. > > Correct me if I'm wrong but strictly speaking utilization for a cfs_rq > (which drives the frequency for CFS) still tracks the blocked/runnable > time of tasks although its decayed as time moves forward. Only when we > migrate the rq of a cfs task is the util_avg contribution removed from > the rq. But I can see now why running RT can decay this load tracking > signal. Yes. you're right > > Regards, > Joel ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 8:26 ` Vincent Guittot 2017-03-20 12:34 ` Patrick Bellasi @ 2017-03-20 12:59 ` Rafael J. Wysocki 2017-03-20 13:20 ` Vincent Guittot 1 sibling, 1 reply; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 12:59 UTC (permalink / raw) To: Vincent Guittot Cc: Viresh Kumar, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote: > On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 19-03-17, 14:34, Rafael J. Wysocki wrote: > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >> > >> The PELT metric used by the schedutil governor underestimates the > >> CPU utilization in some cases. The reason for that may be time spent > >> in interrupt handlers and similar which is not accounted for by PELT. > > Are you sure of the root cause described above (time stolen by irq > handler) or is it just a hypotheses ? That would be good to be sure of > the root cause No, I'm not sure. That's why I said "may be". :-) > Furthermore, IIRC the time spent in irq context is also accounted as > run time for the running cfs task but not RT and deadline task running > time OK Anyway, the problem is that we have a 100% busy CPU which quite evidently is not reported as 100% busy by the metric we use. Now, if I was sure about the root cause, I would fix it rather than suggest workarounds. > So I'm not really aligned with the description of your problem: PELT > metric underestimates the load of the CPU. The PELT is just about > tracking CFS task utilization but not whole CPU utilization and > according to your description of the problem (time stolen by irq), > your problem doesn't come from an underestimation of CFS task but from > time spent in something else but not accounted in the value used by > schedutil That's fair enough, but the assumption was that PELT would be sufficient for that. To me, it clearly isn't, so the assumption was incorrect. > That would be good to be sure of what is running during this not > accounted time and find a way to account them Yes, I agree. I'm not sure if I can carry out full investigation of that any time soon, however. I sent this mostly to let everybody know that there was a problem and how it could be worked around. That's why it is an RFC. > > >> > >> That can be easily demonstrated by running kernel compilation on > >> a Sandy Bridge Intel processor, running turbostat in parallel with > >> it and looking at the values written to the MSR_IA32_PERF_CTL > >> register. Namely, the expected result would be that when all CPUs > >> were 100% busy, all of them would be requested to run in the maximum > >> P-state, but observation shows that this clearly isn't the case. > >> The CPUs run in the maximum P-state for a while and then are > >> requested to run slower and go back to the maximum P-state after > >> a while again. That causes the actual frequency of the processor to > >> visibly oscillate below the sustainable maximum in a jittery fashion > >> which clearly is not desirable. > >> > >> To work around this issue use the observation that, from the > >> schedutil governor's perspective, CPUs that are never idle should > >> always run at the maximum frequency and make that happen. > >> > >> To that end, add a counter of idle calls to struct sugov_cpu and > >> modify cpuidle_idle_call() to increment that counter every time it > >> is about to put the given CPU into an idle state. Next, make the > >> schedutil governor look at that counter for the current CPU every > >> time before it is about to start heavy computations. If the counter > >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > >> the CPU has not been idle for at least that long and the governor > >> will choose the maximum frequency for it without looking at the PELT > >> metric at all. > > > > Looks like we are fixing a PELT problem with a schedutil Hack :) > > I would not say that it's a PELT problem (at least based on current > description) but more that we don't track all > activities of CPU I generally agree. PELT does what it does. However, using PELT the way we do that in schedutil turns out to be problematic. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 12:59 ` Rafael J. Wysocki @ 2017-03-20 13:20 ` Vincent Guittot 0 siblings, 0 replies; 71+ messages in thread From: Vincent Guittot @ 2017-03-20 13:20 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Viresh Kumar, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20 March 2017 at 13:59, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote: >> On 20 March 2017 at 04:57, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> > On 19-03-17, 14:34, Rafael J. Wysocki wrote: >> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> >> >> The PELT metric used by the schedutil governor underestimates the >> >> CPU utilization in some cases. The reason for that may be time spent >> >> in interrupt handlers and similar which is not accounted for by PELT. >> >> Are you sure of the root cause described above (time stolen by irq >> handler) or is it just a hypotheses ? That would be good to be sure of >> the root cause > > No, I'm not sure. That's why I said "may be". :-) > >> Furthermore, IIRC the time spent in irq context is also accounted as >> run time for the running cfs task but not RT and deadline task running >> time > > OK > > Anyway, the problem is that we have a 100% busy CPU which quite evidently > is not reported as 100% busy by the metric we use. > > Now, if I was sure about the root cause, I would fix it rather than suggest > workarounds. > >> So I'm not really aligned with the description of your problem: PELT >> metric underestimates the load of the CPU. The PELT is just about >> tracking CFS task utilization but not whole CPU utilization and >> according to your description of the problem (time stolen by irq), >> your problem doesn't come from an underestimation of CFS task but from >> time spent in something else but not accounted in the value used by >> schedutil > > That's fair enough, but the assumption was that PELT would be sufficient for > that. To me, it clearly isn't, so the assumption was incorrect. So PELT is just about CFS utilization and we need metrics for other sched class as well I know that Juri and other people are working on the dealine part In current kernel, the only available metric is rt_avg (and scale_rt_capacity) which gives you the available compute capacity for CFS and in some way the amount of capacity used by RT and deadline. That would be interesting to see if the amount of capacity used by RT tasks (i assume that you don't have deadline task in your use case) is similar to the amount of underestimation of CFS > >> That would be good to be sure of what is running during this not >> accounted time and find a way to account them > > Yes, I agree. > > I'm not sure if I can carry out full investigation of that any time soon, however. If you can share traces as Patrick proposed, we could estimate how much time/capacity is used by rt tasks and if this can explain the underestimation. > > I sent this mostly to let everybody know that there was a problem and how it > could be worked around. That's why it is an RFC. Ok > >> >> >> >> >> That can be easily demonstrated by running kernel compilation on >> >> a Sandy Bridge Intel processor, running turbostat in parallel with >> >> it and looking at the values written to the MSR_IA32_PERF_CTL >> >> register. Namely, the expected result would be that when all CPUs >> >> were 100% busy, all of them would be requested to run in the maximum >> >> P-state, but observation shows that this clearly isn't the case. >> >> The CPUs run in the maximum P-state for a while and then are >> >> requested to run slower and go back to the maximum P-state after >> >> a while again. That causes the actual frequency of the processor to >> >> visibly oscillate below the sustainable maximum in a jittery fashion >> >> which clearly is not desirable. >> >> >> >> To work around this issue use the observation that, from the >> >> schedutil governor's perspective, CPUs that are never idle should >> >> always run at the maximum frequency and make that happen. >> >> >> >> To that end, add a counter of idle calls to struct sugov_cpu and >> >> modify cpuidle_idle_call() to increment that counter every time it >> >> is about to put the given CPU into an idle state. Next, make the >> >> schedutil governor look at that counter for the current CPU every >> >> time before it is about to start heavy computations. If the counter >> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), >> >> the CPU has not been idle for at least that long and the governor >> >> will choose the maximum frequency for it without looking at the PELT >> >> metric at all. >> > >> > Looks like we are fixing a PELT problem with a schedutil Hack :) >> >> I would not say that it's a PELT problem (at least based on current >> description) but more that we don't track all >> activities of CPU > > I generally agree. PELT does what it does. > > However, using PELT the way we do that in schedutil turns out to be problematic. yes. PELT is not enough but just part of the equation > > Thanks, > Rafael > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 3:57 ` Viresh Kumar 2017-03-20 8:26 ` Vincent Guittot @ 2017-03-20 12:48 ` Rafael J. Wysocki 1 sibling, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 12:48 UTC (permalink / raw) To: Viresh Kumar Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Monday, March 20, 2017 09:27:45 AM Viresh Kumar wrote: > On 19-03-17, 14:34, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > The PELT metric used by the schedutil governor underestimates the > > CPU utilization in some cases. The reason for that may be time spent > > in interrupt handlers and similar which is not accounted for by PELT. > > > > That can be easily demonstrated by running kernel compilation on > > a Sandy Bridge Intel processor, running turbostat in parallel with > > it and looking at the values written to the MSR_IA32_PERF_CTL > > register. Namely, the expected result would be that when all CPUs > > were 100% busy, all of them would be requested to run in the maximum > > P-state, but observation shows that this clearly isn't the case. > > The CPUs run in the maximum P-state for a while and then are > > requested to run slower and go back to the maximum P-state after > > a while again. That causes the actual frequency of the processor to > > visibly oscillate below the sustainable maximum in a jittery fashion > > which clearly is not desirable. > > > > To work around this issue use the observation that, from the > > schedutil governor's perspective, CPUs that are never idle should > > always run at the maximum frequency and make that happen. > > > > To that end, add a counter of idle calls to struct sugov_cpu and > > modify cpuidle_idle_call() to increment that counter every time it > > is about to put the given CPU into an idle state. Next, make the > > schedutil governor look at that counter for the current CPU every > > time before it is about to start heavy computations. If the counter > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > > the CPU has not been idle for at least that long and the governor > > will choose the maximum frequency for it without looking at the PELT > > metric at all. > > Looks like we are fixing a PELT problem with a schedutil Hack :) Did you notice the "To work around this issue" phrase above? This is a workaround, not a fix. But it is an optimization too, because avoiding heavy computations if they are not necessary should be a win in any case. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > include/linux/sched/cpufreq.h | 6 ++++++ > > kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++-- > > kernel/sched/idle.c | 3 +++ > > 3 files changed, 45 insertions(+), 2 deletions(-) > > > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > > =================================================================== > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > > @@ -20,6 +20,7 @@ > > #include "sched.h" > > > > #define SUGOV_KTHREAD_PRIORITY 50 > > +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC) > > > > struct sugov_tunables { > > struct gov_attr_set attr_set; > > @@ -55,6 +56,9 @@ struct sugov_cpu { > > > > unsigned long iowait_boost; > > unsigned long iowait_boost_max; > > + unsigned long idle_calls; > > + unsigned long saved_idle_calls; The above two could be unsigned int, BTW. > > + u64 busy_start; > > u64 last_update; > > > > /* The fields below are only needed when sharing a policy. */ > > @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su > > sg_cpu->iowait_boost >>= 1; > > } > > > > +void cpufreq_schedutil_idle_call(void) > > +{ > > + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu); > > + > > + sg_cpu->idle_calls++; > > +} > > + > > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > > +{ > > + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) { > > + sg_cpu->busy_start = 0; > > + return false; > > + } > > + > > + if (!sg_cpu->busy_start) { > > + sg_cpu->busy_start = sg_cpu->last_update; > > + return false; > > + } > > + > > + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD; > > +} > > + > > +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu) > > +{ > > + if (!sg_cpu->busy_start) > > + sg_cpu->saved_idle_calls = sg_cpu->idle_calls; > > Why aren't we doing this in sugov_cpu_is_busy() itself ? No, we aren't. sugov_cpu_is_busy() may not be called at all sometimes. > And isn't it possible for idle_calls to get incremented by this time? No, it isn't. The idle loop does that and after it has disabled interrupts. If this code is running, we surely are not in there on the same CPU, are we? > > +} > > + > > static void sugov_update_single(struct update_util_data *hook, u64 time, > > unsigned int flags) > > { > > @@ -207,7 +239,7 @@ static void sugov_update_single(struct u > > if (!sugov_should_update_freq(sg_policy, time)) > > return; > > > > - if (flags & SCHED_CPUFREQ_RT_DL) { > > + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) { > > next_f = policy->cpuinfo.max_freq; > > } else { > > sugov_get_util(&util, &max); > > @@ -215,6 +247,7 @@ static void sugov_update_single(struct u > > next_f = get_next_freq(sg_policy, util, max); > > } > > sugov_update_commit(sg_policy, time, next_f); > > + sugov_save_idle_calls(sg_cpu); > > } > > > > static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) > > @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u > > sg_cpu->last_update = time; > > > > if (sugov_should_update_freq(sg_policy, time)) { > > - if (flags & SCHED_CPUFREQ_RT_DL) > > + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) > > What about others CPUs in this policy? We'll check this for them when they get updated. The point is that we can just bail out earlier here and avoid the heavy stuff, but we don't have to. This is opportunistic. > > next_f = sg_policy->policy->cpuinfo.max_freq; > > else > > next_f = sugov_next_freq_shared(sg_cpu); > > > > sugov_update_commit(sg_policy, time, next_f); > > + sugov_save_idle_calls(sg_cpu); > > } > > > > raw_spin_unlock(&sg_policy->update_lock); > > Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki 2017-03-19 21:24 ` Rafael J. Wysocki 2017-03-20 3:57 ` Viresh Kumar @ 2017-03-20 10:36 ` Peter Zijlstra 2017-03-20 12:35 ` Rafael J. Wysocki 2017-03-20 21:46 ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki 3 siblings, 1 reply; 71+ messages in thread From: Peter Zijlstra @ 2017-03-20 10:36 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The PELT metric used by the schedutil governor underestimates the > CPU utilization in some cases. The reason for that may be time spent > in interrupt handlers and similar which is not accounted for by PELT. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > To work around this issue use the observation that, from the > schedutil governor's perspective, CPUs that are never idle should > always run at the maximum frequency and make that happen. > > To that end, add a counter of idle calls to struct sugov_cpu and > modify cpuidle_idle_call() to increment that counter every time it > is about to put the given CPU into an idle state. Next, make the > schedutil governor look at that counter for the current CPU every > time before it is about to start heavy computations. If the counter > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > the CPU has not been idle for at least that long and the governor > will choose the maximum frequency for it without looking at the PELT > metric at all. Why the time limit? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 10:36 ` Peter Zijlstra @ 2017-03-20 12:35 ` Rafael J. Wysocki 2017-03-20 12:50 ` Peter Zijlstra 0 siblings, 1 reply; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 12:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote: > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > The PELT metric used by the schedutil governor underestimates the > > CPU utilization in some cases. The reason for that may be time spent > > in interrupt handlers and similar which is not accounted for by PELT. > > > > That can be easily demonstrated by running kernel compilation on > > a Sandy Bridge Intel processor, running turbostat in parallel with > > it and looking at the values written to the MSR_IA32_PERF_CTL > > register. Namely, the expected result would be that when all CPUs > > were 100% busy, all of them would be requested to run in the maximum > > P-state, but observation shows that this clearly isn't the case. > > The CPUs run in the maximum P-state for a while and then are > > requested to run slower and go back to the maximum P-state after > > a while again. That causes the actual frequency of the processor to > > visibly oscillate below the sustainable maximum in a jittery fashion > > which clearly is not desirable. > > > > To work around this issue use the observation that, from the > > schedutil governor's perspective, CPUs that are never idle should > > always run at the maximum frequency and make that happen. > > > > To that end, add a counter of idle calls to struct sugov_cpu and > > modify cpuidle_idle_call() to increment that counter every time it > > is about to put the given CPU into an idle state. Next, make the > > schedutil governor look at that counter for the current CPU every > > time before it is about to start heavy computations. If the counter > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > > the CPU has not been idle for at least that long and the governor > > will choose the maximum frequency for it without looking at the PELT > > metric at all. > > Why the time limit? One iteration appeared to be a bit too aggressive, but honestly I think I need to check again if this thing is regarded as viable at all. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 12:35 ` Rafael J. Wysocki @ 2017-03-20 12:50 ` Peter Zijlstra 2017-03-20 13:04 ` Rafael J. Wysocki 2017-03-20 13:06 ` Patrick Bellasi 0 siblings, 2 replies; 71+ messages in thread From: Peter Zijlstra @ 2017-03-20 12:50 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote: > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote: > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > The PELT metric used by the schedutil governor underestimates the > > > CPU utilization in some cases. The reason for that may be time spent > > > in interrupt handlers and similar which is not accounted for by PELT. > > > > > > That can be easily demonstrated by running kernel compilation on > > > a Sandy Bridge Intel processor, running turbostat in parallel with > > > it and looking at the values written to the MSR_IA32_PERF_CTL > > > register. Namely, the expected result would be that when all CPUs > > > were 100% busy, all of them would be requested to run in the maximum > > > P-state, but observation shows that this clearly isn't the case. > > > The CPUs run in the maximum P-state for a while and then are > > > requested to run slower and go back to the maximum P-state after > > > a while again. That causes the actual frequency of the processor to > > > visibly oscillate below the sustainable maximum in a jittery fashion > > > which clearly is not desirable. > > > > > > To work around this issue use the observation that, from the > > > schedutil governor's perspective, CPUs that are never idle should > > > always run at the maximum frequency and make that happen. > > > > > > To that end, add a counter of idle calls to struct sugov_cpu and > > > modify cpuidle_idle_call() to increment that counter every time it > > > is about to put the given CPU into an idle state. Next, make the > > > schedutil governor look at that counter for the current CPU every > > > time before it is about to start heavy computations. If the counter > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > > > the CPU has not been idle for at least that long and the governor > > > will choose the maximum frequency for it without looking at the PELT > > > metric at all. > > > > Why the time limit? > > One iteration appeared to be a bit too aggressive, but honestly I think > I need to check again if this thing is regarded as viable at all. > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and RCU come to mind as things that might already use something like that. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 12:50 ` Peter Zijlstra @ 2017-03-20 13:04 ` Rafael J. Wysocki 2017-03-20 13:06 ` Patrick Bellasi 1 sibling, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 13:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Monday, March 20, 2017 01:50:09 PM Peter Zijlstra wrote: > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote: > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote: > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > > > The PELT metric used by the schedutil governor underestimates the > > > > CPU utilization in some cases. The reason for that may be time spent > > > > in interrupt handlers and similar which is not accounted for by PELT. > > > > > > > > That can be easily demonstrated by running kernel compilation on > > > > a Sandy Bridge Intel processor, running turbostat in parallel with > > > > it and looking at the values written to the MSR_IA32_PERF_CTL > > > > register. Namely, the expected result would be that when all CPUs > > > > were 100% busy, all of them would be requested to run in the maximum > > > > P-state, but observation shows that this clearly isn't the case. > > > > The CPUs run in the maximum P-state for a while and then are > > > > requested to run slower and go back to the maximum P-state after > > > > a while again. That causes the actual frequency of the processor to > > > > visibly oscillate below the sustainable maximum in a jittery fashion > > > > which clearly is not desirable. > > > > > > > > To work around this issue use the observation that, from the > > > > schedutil governor's perspective, CPUs that are never idle should > > > > always run at the maximum frequency and make that happen. > > > > > > > > To that end, add a counter of idle calls to struct sugov_cpu and > > > > modify cpuidle_idle_call() to increment that counter every time it > > > > is about to put the given CPU into an idle state. Next, make the > > > > schedutil governor look at that counter for the current CPU every > > > > time before it is about to start heavy computations. If the counter > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > > > > the CPU has not been idle for at least that long and the governor > > > > will choose the maximum frequency for it without looking at the PELT > > > > metric at all. > > > > > > Why the time limit? > > > > One iteration appeared to be a bit too aggressive, but honestly I think > > I need to check again if this thing is regarded as viable at all. > > > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. OK > I just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and > RCU come to mind as things that might already use something like that. NOHZ does that, but I did't want this to artificially depend on NOHZ. That said, yes, we can use that one too. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 12:50 ` Peter Zijlstra 2017-03-20 13:04 ` Rafael J. Wysocki @ 2017-03-20 13:06 ` Patrick Bellasi 2017-03-20 13:05 ` Rafael J. Wysocki 1 sibling, 1 reply; 71+ messages in thread From: Patrick Bellasi @ 2017-03-20 13:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20-Mar 13:50, Peter Zijlstra wrote: > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote: > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote: > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > > > The PELT metric used by the schedutil governor underestimates the > > > > CPU utilization in some cases. The reason for that may be time spent > > > > in interrupt handlers and similar which is not accounted for by PELT. > > > > > > > > That can be easily demonstrated by running kernel compilation on > > > > a Sandy Bridge Intel processor, running turbostat in parallel with > > > > it and looking at the values written to the MSR_IA32_PERF_CTL > > > > register. Namely, the expected result would be that when all CPUs > > > > were 100% busy, all of them would be requested to run in the maximum > > > > P-state, but observation shows that this clearly isn't the case. > > > > The CPUs run in the maximum P-state for a while and then are > > > > requested to run slower and go back to the maximum P-state after > > > > a while again. That causes the actual frequency of the processor to > > > > visibly oscillate below the sustainable maximum in a jittery fashion > > > > which clearly is not desirable. > > > > > > > > To work around this issue use the observation that, from the > > > > schedutil governor's perspective, CPUs that are never idle should > > > > always run at the maximum frequency and make that happen. > > > > > > > > To that end, add a counter of idle calls to struct sugov_cpu and > > > > modify cpuidle_idle_call() to increment that counter every time it > > > > is about to put the given CPU into an idle state. Next, make the > > > > schedutil governor look at that counter for the current CPU every > > > > time before it is about to start heavy computations. If the counter > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > > > > the CPU has not been idle for at least that long and the governor > > > > will choose the maximum frequency for it without looking at the PELT > > > > metric at all. > > > > > > Why the time limit? > > > > One iteration appeared to be a bit too aggressive, but honestly I think > > I need to check again if this thing is regarded as viable at all. > > > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and > RCU come to mind as things that might already use something like that. Maybe the problem is not going down (e.g. when there are only small CFS tasks it makes perfectly sense) but instead not being fast enough on rampin-up when a new RT task is activated. And this boils down to two main point: 1) throttling for up transitions perhaps is only harmful 2) the call sites for schedutils updates are not properly positioned in specific scheduler decision points. The proposed patch is adding yet another throttling mechanism, perhaps on top of one which already needs to be improved. -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 13:06 ` Patrick Bellasi @ 2017-03-20 13:05 ` Rafael J. Wysocki 2017-03-20 14:13 ` Patrick Bellasi 0 siblings, 1 reply; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 13:05 UTC (permalink / raw) To: Patrick Bellasi Cc: Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote: > On 20-Mar 13:50, Peter Zijlstra wrote: > > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote: > > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote: > > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote: > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > > > > > The PELT metric used by the schedutil governor underestimates the > > > > > CPU utilization in some cases. The reason for that may be time spent > > > > > in interrupt handlers and similar which is not accounted for by PELT. > > > > > > > > > > That can be easily demonstrated by running kernel compilation on > > > > > a Sandy Bridge Intel processor, running turbostat in parallel with > > > > > it and looking at the values written to the MSR_IA32_PERF_CTL > > > > > register. Namely, the expected result would be that when all CPUs > > > > > were 100% busy, all of them would be requested to run in the maximum > > > > > P-state, but observation shows that this clearly isn't the case. > > > > > The CPUs run in the maximum P-state for a while and then are > > > > > requested to run slower and go back to the maximum P-state after > > > > > a while again. That causes the actual frequency of the processor to > > > > > visibly oscillate below the sustainable maximum in a jittery fashion > > > > > which clearly is not desirable. > > > > > > > > > > To work around this issue use the observation that, from the > > > > > schedutil governor's perspective, CPUs that are never idle should > > > > > always run at the maximum frequency and make that happen. > > > > > > > > > > To that end, add a counter of idle calls to struct sugov_cpu and > > > > > modify cpuidle_idle_call() to increment that counter every time it > > > > > is about to put the given CPU into an idle state. Next, make the > > > > > schedutil governor look at that counter for the current CPU every > > > > > time before it is about to start heavy computations. If the counter > > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > > > > > the CPU has not been idle for at least that long and the governor > > > > > will choose the maximum frequency for it without looking at the PELT > > > > > metric at all. > > > > > > > > Why the time limit? > > > > > > One iteration appeared to be a bit too aggressive, but honestly I think > > > I need to check again if this thing is regarded as viable at all. > > > > > > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I > > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and > > RCU come to mind as things that might already use something like that. > > Maybe the problem is not going down (e.g. when there are only small > CFS tasks it makes perfectly sense) but instead not being fast enough > on rampin-up when a new RT task is activated. > > And this boils down to two main point: > 1) throttling for up transitions perhaps is only harmful > 2) the call sites for schedutils updates are not properly positioned > in specific scheduler decision points. > > The proposed patch is adding yet another throttling mechanism, perhaps > on top of one which already needs to be improved. It is not throttling anything. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs 2017-03-20 13:05 ` Rafael J. Wysocki @ 2017-03-20 14:13 ` Patrick Bellasi 0 siblings, 0 replies; 71+ messages in thread From: Patrick Bellasi @ 2017-03-20 14:13 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20-Mar 14:05, Rafael J. Wysocki wrote: > On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote: > > On 20-Mar 13:50, Peter Zijlstra wrote: > > > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote: > > > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote: > > > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote: > > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > > > > > > > The PELT metric used by the schedutil governor underestimates the > > > > > > CPU utilization in some cases. The reason for that may be time spent > > > > > > in interrupt handlers and similar which is not accounted for by PELT. > > > > > > > > > > > > That can be easily demonstrated by running kernel compilation on > > > > > > a Sandy Bridge Intel processor, running turbostat in parallel with > > > > > > it and looking at the values written to the MSR_IA32_PERF_CTL > > > > > > register. Namely, the expected result would be that when all CPUs > > > > > > were 100% busy, all of them would be requested to run in the maximum > > > > > > P-state, but observation shows that this clearly isn't the case. > > > > > > The CPUs run in the maximum P-state for a while and then are > > > > > > requested to run slower and go back to the maximum P-state after > > > > > > a while again. That causes the actual frequency of the processor to > > > > > > visibly oscillate below the sustainable maximum in a jittery fashion > > > > > > which clearly is not desirable. > > > > > > > > > > > > To work around this issue use the observation that, from the > > > > > > schedutil governor's perspective, CPUs that are never idle should > > > > > > always run at the maximum frequency and make that happen. > > > > > > > > > > > > To that end, add a counter of idle calls to struct sugov_cpu and > > > > > > modify cpuidle_idle_call() to increment that counter every time it > > > > > > is about to put the given CPU into an idle state. Next, make the > > > > > > schedutil governor look at that counter for the current CPU every > > > > > > time before it is about to start heavy computations. If the counter > > > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms), > > > > > > the CPU has not been idle for at least that long and the governor > > > > > > will choose the maximum frequency for it without looking at the PELT > > > > > > metric at all. > > > > > > > > > > Why the time limit? > > > > > > > > One iteration appeared to be a bit too aggressive, but honestly I think > > > > I need to check again if this thing is regarded as viable at all. > > > > > > > > > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I > > > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and > > > RCU come to mind as things that might already use something like that. > > > > Maybe the problem is not going down (e.g. when there are only small > > CFS tasks it makes perfectly sense) but instead not being fast enough > > on rampin-up when a new RT task is activated. > > > > And this boils down to two main point: > > 1) throttling for up transitions perhaps is only harmful > > 2) the call sites for schedutils updates are not properly positioned > > in specific scheduler decision points. > > > > The proposed patch is adding yet another throttling mechanism, perhaps > > on top of one which already needs to be improved. > > It is not throttling anything. It's a kind-of... - if (flags & SCHED_CPUFREQ_RT_DL) { + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) { next_f = policy->cpuinfo.max_freq; This check disregard any signal the scheduler can provide. A 60% CFS task with a 100ms period, with such a policy will end up running at the highest OPP for just 10% of its entire activation. Moreover, when it completes, we are likely to enter an idle OPP while still remaining at the highest OPP. IMHO the ultimate goal of scheduitl should be that to be driven by the scheduler, which has (or can have) all the required information to support OPP selection. If something is not working, well, then we should properly fix the signals and/or provide (at least) a per-task tunable interface. Adding an hardcoded threshold is an easy fix but it will ultimately increase the complexity of the governor. -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki ` (2 preceding siblings ...) 2017-03-20 10:36 ` Peter Zijlstra @ 2017-03-20 21:46 ` Rafael J. Wysocki 2017-03-21 6:40 ` Viresh Kumar ` (3 more replies) 3 siblings, 4 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-20 21:46 UTC (permalink / raw) To: Linux PM Cc: LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> The way the schedutil governor uses the PELT metric causes it to underestimate the CPU utilization in some cases. That can be easily demonstrated by running kernel compilation on a Sandy Bridge Intel processor, running turbostat in parallel with it and looking at the values written to the MSR_IA32_PERF_CTL register. Namely, the expected result would be that when all CPUs were 100% busy, all of them would be requested to run in the maximum P-state, but observation shows that this clearly isn't the case. The CPUs run in the maximum P-state for a while and then are requested to run slower and go back to the maximum P-state after a while again. That causes the actual frequency of the processor to visibly oscillate below the sustainable maximum in a jittery fashion which clearly is not desirable. To work around this issue use the observation that, from the schedutil governor's perspective, it does not make sense to decrease the frequency of a CPU that doesn't enter idle and avoid decreasing the frequency of busy CPUs. To that end, use the counter of idle calls in the timekeeping code. Namely, make the schedutil governor look at that counter for the current CPU every time before it is about to set a new frequency for that CPU's policy. If the counter has not changed since the previous iteration, the CPU has been busy for all that time and its frequency should not be decreased, so if the new frequency would be lower than the one set previously, the governor will skip the frequency update. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- This is a slightly different approach (avoid decreasing frequency for busy CPUs instead of bumping if for them to the max upfront) and it works around the original problem too. I tried to address a few Peter's comments here and the result doesn't seem to be too heavy-wieght. Thanks, Rafael --- include/linux/tick.h | 1 + kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++++++++++---- kernel/time/tick-sched.c | 12 ++++++++++++ 3 files changed, 37 insertions(+), 4 deletions(-) Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -56,6 +56,9 @@ struct sugov_cpu { unsigned long iowait_boost; unsigned long iowait_boost_max; u64 last_update; +#ifdef CONFIG_NO_HZ_COMMON + unsigned long saved_idle_calls; +#endif /* The fields below are only needed when sharing a policy. */ unsigned long util; @@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str return delta_ns >= sg_policy->freq_update_delay_ns; } -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, - unsigned int next_freq) +#ifdef CONFIG_NO_HZ_COMMON +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) +{ + unsigned long idle_calls = tick_nohz_get_idle_calls(); + bool ret = idle_calls == sg_cpu->saved_idle_calls; + + sg_cpu->saved_idle_calls = idle_calls; + return ret; +} +#else +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } +#endif /* CONFIG_NO_HZ_COMMON */ + +static void sugov_update_commit(struct sugov_cpu *sg_cpu, + struct sugov_policy *sg_policy, + u64 time, unsigned int next_freq) { struct cpufreq_policy *policy = sg_policy->policy; + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq) + next_freq = sg_policy->next_freq; + if (policy->fast_switch_enabled) { if (sg_policy->next_freq == next_freq) { trace_cpu_frequency(policy->cur, smp_processor_id()); @@ -214,7 +234,7 @@ static void sugov_update_single(struct u sugov_iowait_boost(sg_cpu, &util, &max); next_f = get_next_freq(sg_policy, util, max); } - sugov_update_commit(sg_policy, time, next_f); + sugov_update_commit(sg_cpu, sg_policy, time, next_f); } static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) @@ -283,7 +303,7 @@ static void sugov_update_shared(struct u else next_f = sugov_next_freq_shared(sg_cpu); - sugov_update_commit(sg_policy, time, next_f); + sugov_update_commit(sg_cpu, sg_policy, time, next_f); } raw_spin_unlock(&sg_policy->update_lock); Index: linux-pm/include/linux/tick.h =================================================================== --- linux-pm.orig/include/linux/tick.h +++ linux-pm/include/linux/tick.h @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void); extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); +extern unsigned long tick_nohz_get_idle_calls(void); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); #else /* !CONFIG_NO_HZ_COMMON */ Index: linux-pm/kernel/time/tick-sched.c =================================================================== --- linux-pm.orig/kernel/time/tick-sched.c +++ linux-pm/kernel/time/tick-sched.c @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void) return ts->sleep_length; } +/** + * tick_nohz_get_idle_calls - return the current idle calls counter value + * + * Called from the schedutil frequency scaling governor in scheduler context. + */ +unsigned long tick_nohz_get_idle_calls(void) +{ + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); + + return ts->idle_calls; +} + static void tick_nohz_account_idle_ticks(struct tick_sched *ts) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-20 21:46 ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki @ 2017-03-21 6:40 ` Viresh Kumar 2017-03-21 12:30 ` Rafael J. Wysocki 2017-03-21 8:50 ` Vincent Guittot ` (2 subsequent siblings) 3 siblings, 1 reply; 71+ messages in thread From: Viresh Kumar @ 2017-03-21 6:40 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20-03-17, 22:46, Rafael J. Wysocki wrote: > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > +static void sugov_update_commit(struct sugov_cpu *sg_cpu, > + struct sugov_policy *sg_policy, > + u64 time, unsigned int next_freq) > { > struct cpufreq_policy *policy = sg_policy->policy; > > + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq) > + next_freq = sg_policy->next_freq; > + In the earlier version you said that we want to be opportunistic and don't want to do heavy computation and so check only for current CPU. But in this version, all those computations are already done by now. Why shouldn't we check all CPUs in the policy now? I am asking as we will still have the same problem, we are trying to work-around if the current CPU isn't busy but others sharing the policy are. Also, why not return directly from within the if block? To run trace_cpu_frequency()? I don't remember exactly, but why don't we run that for !fast-switch case? We can simplify the code a bit if we check for no freq change at the top of the routine. -- viresh ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 6:40 ` Viresh Kumar @ 2017-03-21 12:30 ` Rafael J. Wysocki 0 siblings, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-21 12:30 UTC (permalink / raw) To: Viresh Kumar Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tue, Mar 21, 2017 at 7:40 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 20-03-17, 22:46, Rafael J. Wysocki wrote: >> Index: linux-pm/kernel/sched/cpufreq_schedutil.c > >> +static void sugov_update_commit(struct sugov_cpu *sg_cpu, >> + struct sugov_policy *sg_policy, >> + u64 time, unsigned int next_freq) >> { >> struct cpufreq_policy *policy = sg_policy->policy; >> >> + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq) >> + next_freq = sg_policy->next_freq; >> + > > In the earlier version you said that we want to be opportunistic and > don't want to do heavy computation and so check only for current CPU. > > But in this version, all those computations are already done by now. > Why shouldn't we check all CPUs in the policy now? I am asking as we > will still have the same problem, we are trying to work-around if the > current CPU isn't busy but others sharing the policy are. This isn't the way I'm looking at that. This is an easy (and relatively cheap) check to make for the *current* *CPU* and our frequency selection algorithm turns out to have problems, so it would be kind of unreasonable to not use the opportunity to fix up the value coming from it - if we can do that easily enough. For the other CPUs in the policy that would require extra synchronization etc., so not that easy any more. > Also, why not return directly from within the if block? To run > trace_cpu_frequency()? Yes. > I don't remember exactly, but why don't we run that for !fast-switch > case? That's an interesting question. We do that in the fast switch case, because otherwise utilities get confused if the frequency is not updated for a long enough time. I'm not really sure why they don't get confused in the other case, though. [In that case the core calls trace_cpu_frequency() for us, but only if we actually run the async work.] It looks like it wouldn't hurt to always run trace_cpu_frequency() when we want to bail out early for next_freq == sg_policy->next_freq. Let me prepare a patch for that. :-) > We can simplify the code a bit if we check for no freq change at > the top of the routine. Right. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-20 21:46 ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki 2017-03-21 6:40 ` Viresh Kumar @ 2017-03-21 8:50 ` Vincent Guittot 2017-03-21 11:56 ` Patrick Bellasi 2017-03-21 13:22 ` Peter Zijlstra 2017-03-21 11:50 ` Patrick Bellasi 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki 3 siblings, 2 replies; 71+ messages in thread From: Vincent Guittot @ 2017-03-21 8:50 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The way the schedutil governor uses the PELT metric causes it to > underestimate the CPU utilization in some cases. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > To work around this issue use the observation that, from the > schedutil governor's perspective, it does not make sense to decrease > the frequency of a CPU that doesn't enter idle and avoid decreasing > the frequency of busy CPUs. I don't fully agree with that statement. If there are 2 runnable tasks on CPU A and scheduler migrates the waiting task to another CPU B so CPU A is less loaded now, it makes sense to reduce the OPP. That's even for that purpose that we have decided to use scheduler metrics in cpufreq governor so we can adjust OPP immediately when tasks migrate. That being said, i probably know why you see such OPP switches in your use case. When we migrate a task, we also migrate/remove its utilization from CPU. If the CPU is not overloaded, it means that runnable tasks have all computation that they need and don't have any reason to use more when a task migrates to another CPU. so decreasing the OPP makes sense because the utilzation is decreasing If the CPU is overloaded, it means that runnable tasks have to share CPU time and probably don't have all computations that they would like so when a task migrate, the remaining tasks on the CPU will increase their utilization and fill space left by the task that has just migrated. So the CPU's utilization will decrease when a task migrates (and as a result the OPP) but then its utilization will increase with remaining tasks running more time as well as the OPP So you need to make the difference between this 2 cases: Is a CPU overloaded or not. You can't really rely on the utilization to detect that but you could take advantage of the load which take into account the waiting time of tasks Vincent > > To that end, use the counter of idle calls in the timekeeping code. > Namely, make the schedutil governor look at that counter for the > current CPU every time before it is about to set a new frequency > for that CPU's policy. If the counter has not changed since the > previous iteration, the CPU has been busy for all that time and > its frequency should not be decreased, so if the new frequency would > be lower than the one set previously, the governor will skip the > frequency update. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > This is a slightly different approach (avoid decreasing frequency for busy CPUs > instead of bumping if for them to the max upfront) and it works around the > original problem too. > > I tried to address a few Peter's comments here and the result doesn't seem to > be too heavy-wieght. > > Thanks, > Rafael > > --- > include/linux/tick.h | 1 + > kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++++++++++---- > kernel/time/tick-sched.c | 12 ++++++++++++ > 3 files changed, 37 insertions(+), 4 deletions(-) > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > =================================================================== > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -56,6 +56,9 @@ struct sugov_cpu { > unsigned long iowait_boost; > unsigned long iowait_boost_max; > u64 last_update; > +#ifdef CONFIG_NO_HZ_COMMON > + unsigned long saved_idle_calls; > +#endif > > /* The fields below are only needed when sharing a policy. */ > unsigned long util; > @@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str > return delta_ns >= sg_policy->freq_update_delay_ns; > } > > -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > - unsigned int next_freq) > +#ifdef CONFIG_NO_HZ_COMMON > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > +{ > + unsigned long idle_calls = tick_nohz_get_idle_calls(); > + bool ret = idle_calls == sg_cpu->saved_idle_calls; > + > + sg_cpu->saved_idle_calls = idle_calls; > + return ret; > +} > +#else > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } > +#endif /* CONFIG_NO_HZ_COMMON */ > + > +static void sugov_update_commit(struct sugov_cpu *sg_cpu, > + struct sugov_policy *sg_policy, > + u64 time, unsigned int next_freq) > { > struct cpufreq_policy *policy = sg_policy->policy; > > + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq) > + next_freq = sg_policy->next_freq; > + > if (policy->fast_switch_enabled) { > if (sg_policy->next_freq == next_freq) { > trace_cpu_frequency(policy->cur, smp_processor_id()); > @@ -214,7 +234,7 @@ static void sugov_update_single(struct u > sugov_iowait_boost(sg_cpu, &util, &max); > next_f = get_next_freq(sg_policy, util, max); > } > - sugov_update_commit(sg_policy, time, next_f); > + sugov_update_commit(sg_cpu, sg_policy, time, next_f); > } > > static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) > @@ -283,7 +303,7 @@ static void sugov_update_shared(struct u > else > next_f = sugov_next_freq_shared(sg_cpu); > > - sugov_update_commit(sg_policy, time, next_f); > + sugov_update_commit(sg_cpu, sg_policy, time, next_f); > } > > raw_spin_unlock(&sg_policy->update_lock); > Index: linux-pm/include/linux/tick.h > =================================================================== > --- linux-pm.orig/include/linux/tick.h > +++ linux-pm/include/linux/tick.h > @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void); > extern void tick_nohz_idle_exit(void); > extern void tick_nohz_irq_exit(void); > extern ktime_t tick_nohz_get_sleep_length(void); > +extern unsigned long tick_nohz_get_idle_calls(void); > extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); > extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); > #else /* !CONFIG_NO_HZ_COMMON */ > Index: linux-pm/kernel/time/tick-sched.c > =================================================================== > --- linux-pm.orig/kernel/time/tick-sched.c > +++ linux-pm/kernel/time/tick-sched.c > @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void) > return ts->sleep_length; > } > > +/** > + * tick_nohz_get_idle_calls - return the current idle calls counter value > + * > + * Called from the schedutil frequency scaling governor in scheduler context. > + */ > +unsigned long tick_nohz_get_idle_calls(void) > +{ > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); > + > + return ts->idle_calls; > +} > + > static void tick_nohz_account_idle_ticks(struct tick_sched *ts) > { > #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 8:50 ` Vincent Guittot @ 2017-03-21 11:56 ` Patrick Bellasi 2017-03-21 13:22 ` Peter Zijlstra 1 sibling, 0 replies; 71+ messages in thread From: Patrick Bellasi @ 2017-03-21 11:56 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21-Mar 09:50, Vincent Guittot wrote: > On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > The way the schedutil governor uses the PELT metric causes it to > > underestimate the CPU utilization in some cases. > > > > That can be easily demonstrated by running kernel compilation on > > a Sandy Bridge Intel processor, running turbostat in parallel with > > it and looking at the values written to the MSR_IA32_PERF_CTL > > register. Namely, the expected result would be that when all CPUs > > were 100% busy, all of them would be requested to run in the maximum > > P-state, but observation shows that this clearly isn't the case. > > The CPUs run in the maximum P-state for a while and then are > > requested to run slower and go back to the maximum P-state after > > a while again. That causes the actual frequency of the processor to > > visibly oscillate below the sustainable maximum in a jittery fashion > > which clearly is not desirable. > > > > To work around this issue use the observation that, from the > > schedutil governor's perspective, it does not make sense to decrease > > the frequency of a CPU that doesn't enter idle and avoid decreasing > > the frequency of busy CPUs. > > I don't fully agree with that statement. > If there are 2 runnable tasks on CPU A and scheduler migrates the > waiting task to another CPU B so CPU A is less loaded now, it makes > sense to reduce the OPP. That's even for that purpose that we have > decided to use scheduler metrics in cpufreq governor so we can adjust > OPP immediately when tasks migrate. > That being said, i probably know why you see such OPP switches in your > use case. When we migrate a task, we also migrate/remove its > utilization from CPU. > If the CPU is not overloaded, it means that runnable tasks have all > computation that they need and don't have any reason to use more when > a task migrates to another CPU. so decreasing the OPP makes sense > because the utilzation is decreasing > If the CPU is overloaded, it means that runnable tasks have to share > CPU time and probably don't have all computations that they would like > so when a task migrate, the remaining tasks on the CPU will increase > their utilization and fill space left by the task that has just > migrated. So the CPU's utilization will decrease when a task migrates > (and as a result the OPP) but then its utilization will increase with > remaining tasks running more time as well as the OPP > > So you need to make the difference between this 2 cases: Is a CPU > overloaded or not. You can't really rely on the utilization to detect > that but you could take advantage of the load which take into account > the waiting time of tasks Right, we can use "overloaded" for the time being until we push the "overutilized" bits. [...] > > +#ifdef CONFIG_NO_HZ_COMMON > > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > > +{ > > + unsigned long idle_calls = tick_nohz_get_idle_calls(); > > + bool ret = idle_calls == sg_cpu->saved_idle_calls; Vincent: are you proposing something like this? + if (this_rq()->rd->overload) + return false; > > + > > + sg_cpu->saved_idle_calls = idle_calls; > > + return ret; > > +} > > +#else > > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } > > +#endif /* CONFIG_NO_HZ_COMMON */ > > + > > +static void sugov_update_commit(struct sugov_cpu *sg_cpu, > > + struct sugov_policy *sg_policy, > > + u64 time, unsigned int next_freq) > > { > > struct cpufreq_policy *policy = sg_policy->policy; > > > > + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq) > > + next_freq = sg_policy->next_freq; > > + > > if (policy->fast_switch_enabled) { > > if (sg_policy->next_freq == next_freq) { > > trace_cpu_frequency(policy->cur, smp_processor_id()); > > @@ -214,7 +234,7 @@ static void sugov_update_single(struct u > > sugov_iowait_boost(sg_cpu, &util, &max); > > next_f = get_next_freq(sg_policy, util, max); > > } > > - sugov_update_commit(sg_policy, time, next_f); > > + sugov_update_commit(sg_cpu, sg_policy, time, next_f); > > } > > [...] -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 8:50 ` Vincent Guittot 2017-03-21 11:56 ` Patrick Bellasi @ 2017-03-21 13:22 ` Peter Zijlstra 2017-03-21 13:37 ` Vincent Guittot 1 sibling, 1 reply; 71+ messages in thread From: Peter Zijlstra @ 2017-03-21 13:22 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote: > On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > To work around this issue use the observation that, from the > > schedutil governor's perspective, it does not make sense to decrease > > the frequency of a CPU that doesn't enter idle and avoid decreasing > > the frequency of busy CPUs. > > I don't fully agree with that statement. > If there are 2 runnable tasks on CPU A and scheduler migrates the > waiting task to another CPU B so CPU A is less loaded now, it makes > sense to reduce the OPP. That's even for that purpose that we have > decided to use scheduler metrics in cpufreq governor so we can adjust > OPP immediately when tasks migrate. > That being said, i probably know why you see such OPP switches in your > use case. When we migrate a task, we also migrate/remove its > utilization from CPU. > If the CPU is not overloaded, it means that runnable tasks have all > computation that they need and don't have any reason to use more when > a task migrates to another CPU. so decreasing the OPP makes sense > because the utilzation is decreasing > If the CPU is overloaded, it means that runnable tasks have to share > CPU time and probably don't have all computations that they would like > so when a task migrate, the remaining tasks on the CPU will increase > their utilization and fill space left by the task that has just > migrated. So the CPU's utilization will decrease when a task migrates > (and as a result the OPP) but then its utilization will increase with > remaining tasks running more time as well as the OPP > > So you need to make the difference between this 2 cases: Is a CPU > overloaded or not. You can't really rely on the utilization to detect > that but you could take advantage of the load which take into account > the waiting time of tasks I'm confused. What two cases? You only list the overloaded case, but he does exactly that. Note that the lack of idle time is an exact equivalent of 100% utilized. So even while we cannot currently detect the 100% utilized state through the running state tracking; because averages etc.. we can detect the lack of idle time. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 13:22 ` Peter Zijlstra @ 2017-03-21 13:37 ` Vincent Guittot 2017-03-21 14:03 ` Peter Zijlstra 2017-03-21 14:26 ` Rafael J. Wysocki 0 siblings, 2 replies; 71+ messages in thread From: Vincent Guittot @ 2017-03-21 13:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote: >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > >> > To work around this issue use the observation that, from the >> > schedutil governor's perspective, it does not make sense to decrease >> > the frequency of a CPU that doesn't enter idle and avoid decreasing >> > the frequency of busy CPUs. >> >> I don't fully agree with that statement. >> If there are 2 runnable tasks on CPU A and scheduler migrates the >> waiting task to another CPU B so CPU A is less loaded now, it makes >> sense to reduce the OPP. That's even for that purpose that we have >> decided to use scheduler metrics in cpufreq governor so we can adjust >> OPP immediately when tasks migrate. >> That being said, i probably know why you see such OPP switches in your >> use case. When we migrate a task, we also migrate/remove its >> utilization from CPU. >> If the CPU is not overloaded, it means that runnable tasks have all >> computation that they need and don't have any reason to use more when >> a task migrates to another CPU. so decreasing the OPP makes sense >> because the utilzation is decreasing >> If the CPU is overloaded, it means that runnable tasks have to share >> CPU time and probably don't have all computations that they would like >> so when a task migrate, the remaining tasks on the CPU will increase >> their utilization and fill space left by the task that has just >> migrated. So the CPU's utilization will decrease when a task migrates >> (and as a result the OPP) but then its utilization will increase with >> remaining tasks running more time as well as the OPP >> >> So you need to make the difference between this 2 cases: Is a CPU >> overloaded or not. You can't really rely on the utilization to detect >> that but you could take advantage of the load which take into account >> the waiting time of tasks > > I'm confused. What two cases? You only list the overloaded case, but he overloaded vs not overloaded use case. For the not overloaded case, it makes sense to immediately update to OPP to be aligned with the new utilization of the CPU even if it was not idle in the past couple of ticks > does exactly that. Note that the lack of idle time is an exact > equivalent of 100% utilized. > > So even while we cannot currently detect the 100% utilized state through > the running state tracking; because averages etc.. we can detect the > lack of idle time. But after how much lack of idle time do we consider that we are overloaded ? > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 13:37 ` Vincent Guittot @ 2017-03-21 14:03 ` Peter Zijlstra 2017-03-21 14:18 ` Vincent Guittot ` (2 more replies) 2017-03-21 14:26 ` Rafael J. Wysocki 1 sibling, 3 replies; 71+ messages in thread From: Peter Zijlstra @ 2017-03-21 14:03 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote: > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > For the not overloaded case, it makes sense to immediately update to > OPP to be aligned with the new utilization of the CPU even if it was > not idle in the past couple of ticks Yeah, but we cannot know. Also, who cares? > > does exactly that. Note that the lack of idle time is an exact > > equivalent of 100% utilized. > > > > So even while we cannot currently detect the 100% utilized state through > > the running state tracking; because averages etc.. we can detect the > > lack of idle time. > > But after how much lack of idle time do we consider that we are overloaded ? 0 :-) Note that utilization is an absolute metric, not a windowed one. That is, there is no actual time associated with it. Now, for practical purposes we end up using windowed things in many places, ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:03 ` Peter Zijlstra @ 2017-03-21 14:18 ` Vincent Guittot 2017-03-21 14:25 ` Patrick Bellasi [not found] ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com> 2 siblings, 0 replies; 71+ messages in thread From: Vincent Guittot @ 2017-03-21 14:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote: >> On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > >> For the not overloaded case, it makes sense to immediately update to >> OPP to be aligned with the new utilization of the CPU even if it was >> not idle in the past couple of ticks > > Yeah, but we cannot know. Also, who cares? embedded system that doesn't want to stay at higest OPP if significant part of the utilzation has moved away as an example AFAICT, schedutil tries to select the best OPP according to the current utilization of the CPU so if the utilization decreases, the OPP should also decrease > >> > does exactly that. Note that the lack of idle time is an exact >> > equivalent of 100% utilized. >> > >> > So even while we cannot currently detect the 100% utilized state through >> > the running state tracking; because averages etc.. we can detect the >> > lack of idle time. >> >> But after how much lack of idle time do we consider that we are overloaded ? > > 0 :-) > > Note that utilization is an absolute metric, not a windowed one. That > is, there is no actual time associated with it. Now, for practical > purposes we end up using windowed things in many places, > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:03 ` Peter Zijlstra 2017-03-21 14:18 ` Vincent Guittot @ 2017-03-21 14:25 ` Patrick Bellasi [not found] ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com> 2 siblings, 0 replies; 71+ messages in thread From: Patrick Bellasi @ 2017-03-21 14:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21-Mar 15:03, Peter Zijlstra wrote: > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote: > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > For the not overloaded case, it makes sense to immediately update to > > OPP to be aligned with the new utilization of the CPU even if it was > > not idle in the past couple of ticks > > Yeah, but we cannot know. Also, who cares? > > > > does exactly that. Note that the lack of idle time is an exact > > > equivalent of 100% utilized. > > > > > > So even while we cannot currently detect the 100% utilized state through > > > the running state tracking; because averages etc.. we can detect the > > > lack of idle time. > > > > But after how much lack of idle time do we consider that we are overloaded ? > > 0 :-) If we should use "utilization" this time can be non 0 and it depends for example on how long PELT takes to build up a utilization value which marks the CPU as "overutilized"... thus we already have a suitable time at least for CFS tasks. > Note that utilization is an absolute metric, not a windowed one. That > is, there is no actual time associated with it. Now, for practical > purposes we end up using windowed things in many places, > -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com>]
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs [not found] ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com> @ 2017-03-21 14:58 ` Peter Zijlstra 2017-03-21 17:00 ` Vincent Guittot 0 siblings, 1 reply; 71+ messages in thread From: Peter Zijlstra @ 2017-03-21 14:58 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote: > On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote: > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > For the not overloaded case, it makes sense to immediately update to > > > OPP to be aligned with the new utilization of the CPU even if it was > > > not idle in the past couple of ticks > > > > Yeah, but we cannot know. Also, who cares? > > > > embedded system that doesn't want to stay at higest OPP if significant part > of the utilzation has moved away as an example > AFAICT, schedutil tries to select the best OPP according to the current > utilization of the CPU so if the utilization decreases, the OPP should also > decrease Sure I get that; but given the lack of crystal ball instructions we cannot know if this is the case or not. And if we really dropped below 100% utilization, we should hit idle fairly soon. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:58 ` Peter Zijlstra @ 2017-03-21 17:00 ` Vincent Guittot 2017-03-21 17:01 ` Vincent Guittot 0 siblings, 1 reply; 71+ messages in thread From: Vincent Guittot @ 2017-03-21 17:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21 March 2017 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote: > > On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote: > > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > For the not overloaded case, it makes sense to immediately update to > > > > OPP to be aligned with the new utilization of the CPU even if it was > > > > not idle in the past couple of ticks > > > > > > Yeah, but we cannot know. Also, who cares? > > > > > > > embedded system that doesn't want to stay at higest OPP if significant part > > of the utilzation has moved away as an example > > AFAICT, schedutil tries to select the best OPP according to the current > > utilization of the CPU so if the utilization decreases, the OPP should also > > decrease > > Sure I get that; but given the lack of crystal ball instructions we > cannot know if this is the case or not. cfs_rq->avg.load_avg account the waiting time of CPU (in addition to the weight of task) so i was wondering if we can't use it to detect if we are in the overloaded case or not even if utilization is not mac capacity because we have just migrated a task (and its utilization) out > > And if we really dropped below 100% utilization, we should hit idle > fairly soon. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 17:00 ` Vincent Guittot @ 2017-03-21 17:01 ` Vincent Guittot 0 siblings, 0 replies; 71+ messages in thread From: Vincent Guittot @ 2017-03-21 17:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21 March 2017 at 18:00, Vincent Guittot <vincent.guittot@linaro.org> wrote: > On 21 March 2017 at 15:58, Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote: >> > On 21 March 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote: >> > >> > > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote: >> > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: >> > > >> > > > For the not overloaded case, it makes sense to immediately update to >> > > > OPP to be aligned with the new utilization of the CPU even if it was >> > > > not idle in the past couple of ticks >> > > >> > > Yeah, but we cannot know. Also, who cares? >> > > >> > >> > embedded system that doesn't want to stay at higest OPP if significant part >> > of the utilzation has moved away as an example >> > AFAICT, schedutil tries to select the best OPP according to the current >> > utilization of the CPU so if the utilization decreases, the OPP should also >> > decrease >> >> Sure I get that; but given the lack of crystal ball instructions we >> cannot know if this is the case or not. > > cfs_rq->avg.load_avg account the waiting time of CPU (in addition to sorry i wanted to say the waiting time of tasks on the CPU > the weight of task) so i was wondering if we can't use it to detect if > we are in the overloaded case or not even if utilization is not mac > capacity because we have just migrated a task (and its utilization) > out > > > >> >> And if we really dropped below 100% utilization, we should hit idle >> fairly soon. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 13:37 ` Vincent Guittot 2017-03-21 14:03 ` Peter Zijlstra @ 2017-03-21 14:26 ` Rafael J. Wysocki 2017-03-21 14:38 ` Patrick Bellasi 2017-03-21 15:02 ` Peter Zijlstra 1 sibling, 2 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-21 14:26 UTC (permalink / raw) To: Vincent Guittot, Peter Zijlstra Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote: > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote: > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > >> > To work around this issue use the observation that, from the > >> > schedutil governor's perspective, it does not make sense to decrease > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing > >> > the frequency of busy CPUs. > >> > >> I don't fully agree with that statement. > >> If there are 2 runnable tasks on CPU A and scheduler migrates the > >> waiting task to another CPU B so CPU A is less loaded now, it makes > >> sense to reduce the OPP. That's even for that purpose that we have > >> decided to use scheduler metrics in cpufreq governor so we can adjust > >> OPP immediately when tasks migrate. > >> That being said, i probably know why you see such OPP switches in your > >> use case. When we migrate a task, we also migrate/remove its > >> utilization from CPU. > >> If the CPU is not overloaded, it means that runnable tasks have all > >> computation that they need and don't have any reason to use more when > >> a task migrates to another CPU. so decreasing the OPP makes sense > >> because the utilzation is decreasing > >> If the CPU is overloaded, it means that runnable tasks have to share > >> CPU time and probably don't have all computations that they would like > >> so when a task migrate, the remaining tasks on the CPU will increase > >> their utilization and fill space left by the task that has just > >> migrated. So the CPU's utilization will decrease when a task migrates > >> (and as a result the OPP) but then its utilization will increase with > >> remaining tasks running more time as well as the OPP > >> > >> So you need to make the difference between this 2 cases: Is a CPU > >> overloaded or not. You can't really rely on the utilization to detect > >> that but you could take advantage of the load which take into account > >> the waiting time of tasks > > > > I'm confused. What two cases? You only list the overloaded case, but he > > overloaded vs not overloaded use case. > For the not overloaded case, it makes sense to immediately update to > OPP to be aligned with the new utilization of the CPU even if it was > not idle in the past couple of ticks Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't, conditions may change by the time we actually update it and in that case It'd be better to wait and see IMO. In any case, the theory about migrating tasks made sense to me, so below is what I tested. It works, and besides it has a nice feature that I don't need to fetch for the timekeeping data. :-) I only wonder if we want to do this or only prevent the frequency from decreasing in the overloaded case? --- kernel/sched/cpufreq_schedutil.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -61,6 +61,7 @@ struct sugov_cpu { unsigned long util; unsigned long max; unsigned int flags; + bool overload; }; static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); @@ -207,7 +208,7 @@ static void sugov_update_single(struct u if (!sugov_should_update_freq(sg_policy, time)) return; - if (flags & SCHED_CPUFREQ_RT_DL) { + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) { next_f = policy->cpuinfo.max_freq; } else { sugov_get_util(&util, &max); @@ -242,7 +243,7 @@ static unsigned int sugov_next_freq_shar j_sg_cpu->iowait_boost = 0; continue; } - if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) + if ((j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) || j_sg_cpu->overload) return policy->cpuinfo.max_freq; j_util = j_sg_cpu->util; @@ -273,12 +274,13 @@ static void sugov_update_shared(struct u sg_cpu->util = util; sg_cpu->max = max; sg_cpu->flags = flags; + sg_cpu->overload = this_rq()->rd->overload; sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; if (sugov_should_update_freq(sg_policy, time)) { - if (flags & SCHED_CPUFREQ_RT_DL) + if ((flags & SCHED_CPUFREQ_RT_DL) || sg_cpu->overload) next_f = sg_policy->policy->cpuinfo.max_freq; else next_f = sugov_next_freq_shared(sg_cpu); ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:26 ` Rafael J. Wysocki @ 2017-03-21 14:38 ` Patrick Bellasi 2017-03-21 14:46 ` Rafael J. Wysocki 2017-03-21 15:02 ` Peter Zijlstra 1 sibling, 1 reply; 71+ messages in thread From: Patrick Bellasi @ 2017-03-21 14:38 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21-Mar 15:26, Rafael J. Wysocki wrote: > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote: > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote: > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > >> > To work around this issue use the observation that, from the > > >> > schedutil governor's perspective, it does not make sense to decrease > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing > > >> > the frequency of busy CPUs. > > >> > > >> I don't fully agree with that statement. > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the > > >> waiting task to another CPU B so CPU A is less loaded now, it makes > > >> sense to reduce the OPP. That's even for that purpose that we have > > >> decided to use scheduler metrics in cpufreq governor so we can adjust > > >> OPP immediately when tasks migrate. > > >> That being said, i probably know why you see such OPP switches in your > > >> use case. When we migrate a task, we also migrate/remove its > > >> utilization from CPU. > > >> If the CPU is not overloaded, it means that runnable tasks have all > > >> computation that they need and don't have any reason to use more when > > >> a task migrates to another CPU. so decreasing the OPP makes sense > > >> because the utilzation is decreasing > > >> If the CPU is overloaded, it means that runnable tasks have to share > > >> CPU time and probably don't have all computations that they would like > > >> so when a task migrate, the remaining tasks on the CPU will increase > > >> their utilization and fill space left by the task that has just > > >> migrated. So the CPU's utilization will decrease when a task migrates > > >> (and as a result the OPP) but then its utilization will increase with > > >> remaining tasks running more time as well as the OPP > > >> > > >> So you need to make the difference between this 2 cases: Is a CPU > > >> overloaded or not. You can't really rely on the utilization to detect > > >> that but you could take advantage of the load which take into account > > >> the waiting time of tasks > > > > > > I'm confused. What two cases? You only list the overloaded case, but he > > > > overloaded vs not overloaded use case. > > For the not overloaded case, it makes sense to immediately update to > > OPP to be aligned with the new utilization of the CPU even if it was > > not idle in the past couple of ticks > > Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't, > conditions may change by the time we actually update it and in that case It'd > be better to wait and see IMO. > > In any case, the theory about migrating tasks made sense to me, so below is > what I tested. It works, and besides it has a nice feature that I don't need > to fetch for the timekeeping data. :-) > > I only wonder if we want to do this or only prevent the frequency from > decreasing in the overloaded case? > > --- > kernel/sched/cpufreq_schedutil.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > =================================================================== > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -61,6 +61,7 @@ struct sugov_cpu { > unsigned long util; > unsigned long max; > unsigned int flags; > + bool overload; > }; > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u > if (!sugov_should_update_freq(sg_policy, time)) > return; > > - if (flags & SCHED_CPUFREQ_RT_DL) { > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) { > next_f = policy->cpuinfo.max_freq; Isn't this going to max OPP every time we have more than 1 task in that CPU? In that case it will not fit the case: we have two 10% tasks on that CPU. Previous solution was better IMO, apart from using overloaded instead of overutilized (which is not yet there) :-/ > } else { > sugov_get_util(&util, &max); > @@ -242,7 +243,7 @@ static unsigned int sugov_next_freq_shar > j_sg_cpu->iowait_boost = 0; > continue; > } > - if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) > + if ((j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) || j_sg_cpu->overload) > return policy->cpuinfo.max_freq; > > j_util = j_sg_cpu->util; > @@ -273,12 +274,13 @@ static void sugov_update_shared(struct u > sg_cpu->util = util; > sg_cpu->max = max; > sg_cpu->flags = flags; > + sg_cpu->overload = this_rq()->rd->overload; > > sugov_set_iowait_boost(sg_cpu, time, flags); > sg_cpu->last_update = time; > > if (sugov_should_update_freq(sg_policy, time)) { > - if (flags & SCHED_CPUFREQ_RT_DL) > + if ((flags & SCHED_CPUFREQ_RT_DL) || sg_cpu->overload) > next_f = sg_policy->policy->cpuinfo.max_freq; > else > next_f = sugov_next_freq_shared(sg_cpu); > -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:38 ` Patrick Bellasi @ 2017-03-21 14:46 ` Rafael J. Wysocki 2017-03-21 14:50 ` Rafael J. Wysocki ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-21 14:46 UTC (permalink / raw) To: Patrick Bellasi Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote: > On 21-Mar 15:26, Rafael J. Wysocki wrote: > > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote: > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote: > > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > >> > To work around this issue use the observation that, from the > > > >> > schedutil governor's perspective, it does not make sense to decrease > > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing > > > >> > the frequency of busy CPUs. > > > >> > > > >> I don't fully agree with that statement. > > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the > > > >> waiting task to another CPU B so CPU A is less loaded now, it makes > > > >> sense to reduce the OPP. That's even for that purpose that we have > > > >> decided to use scheduler metrics in cpufreq governor so we can adjust > > > >> OPP immediately when tasks migrate. > > > >> That being said, i probably know why you see such OPP switches in your > > > >> use case. When we migrate a task, we also migrate/remove its > > > >> utilization from CPU. > > > >> If the CPU is not overloaded, it means that runnable tasks have all > > > >> computation that they need and don't have any reason to use more when > > > >> a task migrates to another CPU. so decreasing the OPP makes sense > > > >> because the utilzation is decreasing > > > >> If the CPU is overloaded, it means that runnable tasks have to share > > > >> CPU time and probably don't have all computations that they would like > > > >> so when a task migrate, the remaining tasks on the CPU will increase > > > >> their utilization and fill space left by the task that has just > > > >> migrated. So the CPU's utilization will decrease when a task migrates > > > >> (and as a result the OPP) but then its utilization will increase with > > > >> remaining tasks running more time as well as the OPP > > > >> > > > >> So you need to make the difference between this 2 cases: Is a CPU > > > >> overloaded or not. You can't really rely on the utilization to detect > > > >> that but you could take advantage of the load which take into account > > > >> the waiting time of tasks > > > > > > > > I'm confused. What two cases? You only list the overloaded case, but he > > > > > > overloaded vs not overloaded use case. > > > For the not overloaded case, it makes sense to immediately update to > > > OPP to be aligned with the new utilization of the CPU even if it was > > > not idle in the past couple of ticks > > > > Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't, > > conditions may change by the time we actually update it and in that case It'd > > be better to wait and see IMO. > > > > In any case, the theory about migrating tasks made sense to me, so below is > > what I tested. It works, and besides it has a nice feature that I don't need > > to fetch for the timekeeping data. :-) > > > > I only wonder if we want to do this or only prevent the frequency from > > decreasing in the overloaded case? > > > > --- > > kernel/sched/cpufreq_schedutil.c | 8 +++++--- > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > > =================================================================== > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > > @@ -61,6 +61,7 @@ struct sugov_cpu { > > unsigned long util; > > unsigned long max; > > unsigned int flags; > > + bool overload; > > }; > > > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u > > if (!sugov_should_update_freq(sg_policy, time)) > > return; > > > > - if (flags & SCHED_CPUFREQ_RT_DL) { > > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) { > > next_f = policy->cpuinfo.max_freq; > > Isn't this going to max OPP every time we have more than 1 task in > that CPU? > > In that case it will not fit the case: we have two 10% tasks on that CPU. Good point. > Previous solution was better IMO, apart from using overloaded instead > of overutilized (which is not yet there) :-/ OK, so the one below works too. --- kernel/sched/cpufreq_schedutil.c | 11 +++++++++++ 1 file changed, 11 insertions(+) Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -37,6 +37,7 @@ struct sugov_policy { s64 freq_update_delay_ns; unsigned int next_freq; unsigned int cached_raw_freq; + bool overload; /* The next fields are only needed if fast switch cannot be used. */ struct irq_work irq_work; @@ -61,6 +62,7 @@ struct sugov_cpu { unsigned long util; unsigned long max; unsigned int flags; + bool overload; }; static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); @@ -93,6 +95,9 @@ static void sugov_update_commit(struct s { struct cpufreq_policy *policy = sg_policy->policy; + if (sg_policy->overload && next_freq < sg_policy->next_freq) + next_freq = sg_policy->next_freq; + if (policy->fast_switch_enabled) { if (sg_policy->next_freq == next_freq) { trace_cpu_frequency(policy->cur, smp_processor_id()); @@ -207,6 +212,8 @@ static void sugov_update_single(struct u if (!sugov_should_update_freq(sg_policy, time)) return; + sg_policy->overload = this_rq()->rd->overload; + if (flags & SCHED_CPUFREQ_RT_DL) { next_f = policy->cpuinfo.max_freq; } else { @@ -225,6 +232,8 @@ static unsigned int sugov_next_freq_shar unsigned long util = 0, max = 1; unsigned int j; + sg_policy->overload = false; + for_each_cpu(j, policy->cpus) { struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); unsigned long j_util, j_max; @@ -253,6 +262,7 @@ static unsigned int sugov_next_freq_shar } sugov_iowait_boost(j_sg_cpu, &util, &max); + sg_policy->overload = sg_policy->overload || sg_cpu->overload; } return get_next_freq(sg_policy, util, max); @@ -273,6 +283,7 @@ static void sugov_update_shared(struct u sg_cpu->util = util; sg_cpu->max = max; sg_cpu->flags = flags; + sg_cpu->overload = this_rq()->rd->overload; sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:46 ` Rafael J. Wysocki @ 2017-03-21 14:50 ` Rafael J. Wysocki 2017-03-21 15:04 ` Peter Zijlstra 2017-03-21 15:08 ` Patrick Bellasi 2 siblings, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-21 14:50 UTC (permalink / raw) To: Patrick Bellasi Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tuesday, March 21, 2017 03:46:07 PM Rafael J. Wysocki wrote: > On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote: > > On 21-Mar 15:26, Rafael J. Wysocki wrote: > > > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote: > > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote: > > > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > > > >> > To work around this issue use the observation that, from the > > > > >> > schedutil governor's perspective, it does not make sense to decrease > > > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing > > > > >> > the frequency of busy CPUs. > > > > >> > > > > >> I don't fully agree with that statement. > > > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the > > > > >> waiting task to another CPU B so CPU A is less loaded now, it makes > > > > >> sense to reduce the OPP. That's even for that purpose that we have > > > > >> decided to use scheduler metrics in cpufreq governor so we can adjust > > > > >> OPP immediately when tasks migrate. > > > > >> That being said, i probably know why you see such OPP switches in your > > > > >> use case. When we migrate a task, we also migrate/remove its > > > > >> utilization from CPU. > > > > >> If the CPU is not overloaded, it means that runnable tasks have all > > > > >> computation that they need and don't have any reason to use more when > > > > >> a task migrates to another CPU. so decreasing the OPP makes sense > > > > >> because the utilzation is decreasing > > > > >> If the CPU is overloaded, it means that runnable tasks have to share > > > > >> CPU time and probably don't have all computations that they would like > > > > >> so when a task migrate, the remaining tasks on the CPU will increase > > > > >> their utilization and fill space left by the task that has just > > > > >> migrated. So the CPU's utilization will decrease when a task migrates > > > > >> (and as a result the OPP) but then its utilization will increase with > > > > >> remaining tasks running more time as well as the OPP > > > > >> > > > > >> So you need to make the difference between this 2 cases: Is a CPU > > > > >> overloaded or not. You can't really rely on the utilization to detect > > > > >> that but you could take advantage of the load which take into account > > > > >> the waiting time of tasks > > > > > > > > > > I'm confused. What two cases? You only list the overloaded case, but he > > > > > > > > overloaded vs not overloaded use case. > > > > For the not overloaded case, it makes sense to immediately update to > > > > OPP to be aligned with the new utilization of the CPU even if it was > > > > not idle in the past couple of ticks > > > > > > Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't, > > > conditions may change by the time we actually update it and in that case It'd > > > be better to wait and see IMO. > > > > > > In any case, the theory about migrating tasks made sense to me, so below is > > > what I tested. It works, and besides it has a nice feature that I don't need > > > to fetch for the timekeeping data. :-) > > > > > > I only wonder if we want to do this or only prevent the frequency from > > > decreasing in the overloaded case? > > > > > > --- > > > kernel/sched/cpufreq_schedutil.c | 8 +++++--- > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > > > =================================================================== > > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > > > @@ -61,6 +61,7 @@ struct sugov_cpu { > > > unsigned long util; > > > unsigned long max; > > > unsigned int flags; > > > + bool overload; > > > }; > > > > > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > > > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u > > > if (!sugov_should_update_freq(sg_policy, time)) > > > return; > > > > > > - if (flags & SCHED_CPUFREQ_RT_DL) { > > > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) { > > > next_f = policy->cpuinfo.max_freq; > > > > Isn't this going to max OPP every time we have more than 1 task in > > that CPU? > > > > In that case it will not fit the case: we have two 10% tasks on that CPU. > > Good point. > > > Previous solution was better IMO, apart from using overloaded instead > > of overutilized (which is not yet there) :-/ > > OK, so the one below works too. Admittedly, we could check the idle condition and the overload flag at the same time, though. Let me try that too. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:46 ` Rafael J. Wysocki 2017-03-21 14:50 ` Rafael J. Wysocki @ 2017-03-21 15:04 ` Peter Zijlstra 2017-03-21 15:18 ` Rafael J. Wysocki 2017-03-21 15:08 ` Patrick Bellasi 2 siblings, 1 reply; 71+ messages in thread From: Peter Zijlstra @ 2017-03-21 15:04 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tue, Mar 21, 2017 at 03:46:07PM +0100, Rafael J. Wysocki wrote: > @@ -207,6 +212,8 @@ static void sugov_update_single(struct u > if (!sugov_should_update_freq(sg_policy, time)) > return; > > + sg_policy->overload = this_rq()->rd->overload; > + Same problem as before; rd->overload is set if _any_ CPU in the root domain has more than 1 runnable task at a random point in history (when we ran the load balance tick -- and since that is the same tick used for timers, there's a bias to over-account there). ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 15:04 ` Peter Zijlstra @ 2017-03-21 15:18 ` Rafael J. Wysocki 2017-03-21 17:00 ` Peter Zijlstra 0 siblings, 1 reply; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-21 15:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tuesday, March 21, 2017 04:04:03 PM Peter Zijlstra wrote: > On Tue, Mar 21, 2017 at 03:46:07PM +0100, Rafael J. Wysocki wrote: > > @@ -207,6 +212,8 @@ static void sugov_update_single(struct u > > if (!sugov_should_update_freq(sg_policy, time)) > > return; > > > > + sg_policy->overload = this_rq()->rd->overload; > > + > > Same problem as before; rd->overload is set if _any_ CPU in the root > domain has more than 1 runnable task at a random point in history (when > we ran the load balance tick -- and since that is the same tick used for > timers, there's a bias to over-account there). OK What about the one below then? It checks both the idle calls count and overload and only then it will prevent the frequency from being decreased. It is sufficient for the case at hand. I guess if rd->overload is not set, this means that none of the CPUs is oversubscribed and we just saturate the capacity in a one-task-per-CPU kind of fashion. Right? --- include/linux/tick.h | 1 + kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++ kernel/time/tick-sched.c | 12 ++++++++++++ 3 files changed, 40 insertions(+) Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -37,6 +37,7 @@ struct sugov_policy { s64 freq_update_delay_ns; unsigned int next_freq; unsigned int cached_raw_freq; + bool busy; /* The next fields are only needed if fast switch cannot be used. */ struct irq_work irq_work; @@ -56,11 +57,15 @@ struct sugov_cpu { unsigned long iowait_boost; unsigned long iowait_boost_max; u64 last_update; +#ifdef CONFIG_NO_HZ_COMMON + unsigned long saved_idle_calls; +#endif /* The fields below are only needed when sharing a policy. */ unsigned long util; unsigned long max; unsigned int flags; + bool busy; }; static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); @@ -93,6 +98,9 @@ static void sugov_update_commit(struct s { struct cpufreq_policy *policy = sg_policy->policy; + if (sg_policy->busy && next_freq < sg_policy->next_freq) + next_freq = sg_policy->next_freq; + if (policy->fast_switch_enabled) { if (sg_policy->next_freq == next_freq) { trace_cpu_frequency(policy->cur, smp_processor_id()); @@ -192,6 +200,19 @@ static void sugov_iowait_boost(struct su sg_cpu->iowait_boost >>= 1; } +#ifdef CONFIG_NO_HZ_COMMON +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) +{ + unsigned long idle_calls = tick_nohz_get_idle_calls(); + bool not_idle = idle_calls == sg_cpu->saved_idle_calls; + + sg_cpu->saved_idle_calls = idle_calls; + return not_idle && this_rq()->rd->overload; +} +#else +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } +#endif /* CONFIG_NO_HZ_COMMON */ + static void sugov_update_single(struct update_util_data *hook, u64 time, unsigned int flags) { @@ -207,6 +228,8 @@ static void sugov_update_single(struct u if (!sugov_should_update_freq(sg_policy, time)) return; + sg_policy->busy = sugov_cpu_is_busy(sg_cpu); + if (flags & SCHED_CPUFREQ_RT_DL) { next_f = policy->cpuinfo.max_freq; } else { @@ -225,6 +248,8 @@ static unsigned int sugov_next_freq_shar unsigned long util = 0, max = 1; unsigned int j; + sg_policy->busy = false; + for_each_cpu(j, policy->cpus) { struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); unsigned long j_util, j_max; @@ -253,6 +278,7 @@ static unsigned int sugov_next_freq_shar } sugov_iowait_boost(j_sg_cpu, &util, &max); + sg_policy->busy = sg_policy->busy || sg_cpu->busy; } return get_next_freq(sg_policy, util, max); @@ -273,6 +299,7 @@ static void sugov_update_shared(struct u sg_cpu->util = util; sg_cpu->max = max; sg_cpu->flags = flags; + sg_cpu->busy = sugov_cpu_is_busy(sg_cpu); sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; Index: linux-pm/include/linux/tick.h =================================================================== --- linux-pm.orig/include/linux/tick.h +++ linux-pm/include/linux/tick.h @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void); extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); +extern unsigned long tick_nohz_get_idle_calls(void); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); #else /* !CONFIG_NO_HZ_COMMON */ Index: linux-pm/kernel/time/tick-sched.c =================================================================== --- linux-pm.orig/kernel/time/tick-sched.c +++ linux-pm/kernel/time/tick-sched.c @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void) return ts->sleep_length; } +/** + * tick_nohz_get_idle_calls - return the current idle calls counter value + * + * Called from the schedutil frequency scaling governor in scheduler context. + */ +unsigned long tick_nohz_get_idle_calls(void) +{ + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); + + return ts->idle_calls; +} + static void tick_nohz_account_idle_ticks(struct tick_sched *ts) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 15:18 ` Rafael J. Wysocki @ 2017-03-21 17:00 ` Peter Zijlstra 2017-03-21 17:17 ` Rafael J. Wysocki 0 siblings, 1 reply; 71+ messages in thread From: Peter Zijlstra @ 2017-03-21 17:00 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tue, Mar 21, 2017 at 04:18:52PM +0100, Rafael J. Wysocki wrote: > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > +{ > + unsigned long idle_calls = tick_nohz_get_idle_calls(); > + bool not_idle = idle_calls == sg_cpu->saved_idle_calls; > + > + sg_cpu->saved_idle_calls = idle_calls; > + return not_idle && this_rq()->rd->overload; > +} So I really don't understand the rd->overload thing. What is it supposed to do here? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 17:00 ` Peter Zijlstra @ 2017-03-21 17:17 ` Rafael J. Wysocki 0 siblings, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-21 17:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Patrick Bellasi, Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tuesday, March 21, 2017 06:00:17 PM Peter Zijlstra wrote: > On Tue, Mar 21, 2017 at 04:18:52PM +0100, Rafael J. Wysocki wrote: > > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > > +{ > > + unsigned long idle_calls = tick_nohz_get_idle_calls(); > > + bool not_idle = idle_calls == sg_cpu->saved_idle_calls; > > + > > + sg_cpu->saved_idle_calls = idle_calls; > > + return not_idle && this_rq()->rd->overload; > > +} > > So I really don't understand the rd->overload thing. What is it supposed > to do here? The idea was that if the CPU was running one task saturating the capacity which then was migrated out of it, the frequency should still be reduced. And since rd->overload covers all CPUs (in general) it kind of tells us whether or not there are other tasks to replace the migrated one any time soon. However, if there are no tasks to replace the migrated one, the CPU will go idle quickly (as there are no taks to run on it), in which case keeping the current frequency on it shouldn't matter. In all of the other cases keeping the current frequency is the right thing to do IMO. So, it looks like checking this_rq()->rd->overload doesn't really help after all. :-) Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:46 ` Rafael J. Wysocki 2017-03-21 14:50 ` Rafael J. Wysocki 2017-03-21 15:04 ` Peter Zijlstra @ 2017-03-21 15:08 ` Patrick Bellasi 2017-03-21 15:18 ` Peter Zijlstra 2 siblings, 1 reply; 71+ messages in thread From: Patrick Bellasi @ 2017-03-21 15:08 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Vincent Guittot, Peter Zijlstra, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21-Mar 15:46, Rafael J. Wysocki wrote: > On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote: > > On 21-Mar 15:26, Rafael J. Wysocki wrote: > > > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote: > > > > On 21 March 2017 at 14:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote: > > > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > > > >> > To work around this issue use the observation that, from the > > > > >> > schedutil governor's perspective, it does not make sense to decrease > > > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing > > > > >> > the frequency of busy CPUs. > > > > >> > > > > >> I don't fully agree with that statement. > > > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the > > > > >> waiting task to another CPU B so CPU A is less loaded now, it makes > > > > >> sense to reduce the OPP. That's even for that purpose that we have > > > > >> decided to use scheduler metrics in cpufreq governor so we can adjust > > > > >> OPP immediately when tasks migrate. > > > > >> That being said, i probably know why you see such OPP switches in your > > > > >> use case. When we migrate a task, we also migrate/remove its > > > > >> utilization from CPU. > > > > >> If the CPU is not overloaded, it means that runnable tasks have all > > > > >> computation that they need and don't have any reason to use more when > > > > >> a task migrates to another CPU. so decreasing the OPP makes sense > > > > >> because the utilzation is decreasing > > > > >> If the CPU is overloaded, it means that runnable tasks have to share > > > > >> CPU time and probably don't have all computations that they would like > > > > >> so when a task migrate, the remaining tasks on the CPU will increase > > > > >> their utilization and fill space left by the task that has just > > > > >> migrated. So the CPU's utilization will decrease when a task migrates > > > > >> (and as a result the OPP) but then its utilization will increase with > > > > >> remaining tasks running more time as well as the OPP > > > > >> > > > > >> So you need to make the difference between this 2 cases: Is a CPU > > > > >> overloaded or not. You can't really rely on the utilization to detect > > > > >> that but you could take advantage of the load which take into account > > > > >> the waiting time of tasks > > > > > > > > > > I'm confused. What two cases? You only list the overloaded case, but he > > > > > > > > overloaded vs not overloaded use case. > > > > For the not overloaded case, it makes sense to immediately update to > > > > OPP to be aligned with the new utilization of the CPU even if it was > > > > not idle in the past couple of ticks > > > > > > Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't, > > > conditions may change by the time we actually update it and in that case It'd > > > be better to wait and see IMO. > > > > > > In any case, the theory about migrating tasks made sense to me, so below is > > > what I tested. It works, and besides it has a nice feature that I don't need > > > to fetch for the timekeeping data. :-) > > > > > > I only wonder if we want to do this or only prevent the frequency from > > > decreasing in the overloaded case? > > > > > > --- > > > kernel/sched/cpufreq_schedutil.c | 8 +++++--- > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > > > =================================================================== > > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > > > @@ -61,6 +61,7 @@ struct sugov_cpu { > > > unsigned long util; > > > unsigned long max; > > > unsigned int flags; > > > + bool overload; > > > }; > > > > > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > > > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u > > > if (!sugov_should_update_freq(sg_policy, time)) > > > return; > > > > > > - if (flags & SCHED_CPUFREQ_RT_DL) { > > > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) { > > > next_f = policy->cpuinfo.max_freq; > > > > Isn't this going to max OPP every time we have more than 1 task in > > that CPU? > > > > In that case it will not fit the case: we have two 10% tasks on that CPU. > > Good point. > > > Previous solution was better IMO, apart from using overloaded instead > > of overutilized (which is not yet there) :-/ > > OK, so the one below works too. Better... just one minor comment. > --- > kernel/sched/cpufreq_schedutil.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > =================================================================== > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -37,6 +37,7 @@ struct sugov_policy { > s64 freq_update_delay_ns; > unsigned int next_freq; > unsigned int cached_raw_freq; > + bool overload; Can we avoid using "overloaded" in favor of a more generic and schedutil specific name. Mainly because in the future we would probably like to switch from "overloaded" to "overutilized". What about something like: "busy" ? > > /* The next fields are only needed if fast switch cannot be used. */ > struct irq_work irq_work; > @@ -61,6 +62,7 @@ struct sugov_cpu { > unsigned long util; > unsigned long max; > unsigned int flags; > + bool overload; > }; > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > @@ -93,6 +95,9 @@ static void sugov_update_commit(struct s > { > struct cpufreq_policy *policy = sg_policy->policy; > > + if (sg_policy->overload && next_freq < sg_policy->next_freq) > + next_freq = sg_policy->next_freq; > + > if (policy->fast_switch_enabled) { > if (sg_policy->next_freq == next_freq) { > trace_cpu_frequency(policy->cur, smp_processor_id()); > @@ -207,6 +212,8 @@ static void sugov_update_single(struct u > if (!sugov_should_update_freq(sg_policy, time)) > return; > > + sg_policy->overload = this_rq()->rd->overload; And than we can move this bit into an inline function, something like e.g.: static inline bool sugov_this_cpu_is_busy() { return this_rq()->rd->overloaded } Where in future we can easily switch from usage of "overloaded" to usage of "utilization". > + > if (flags & SCHED_CPUFREQ_RT_DL) { > next_f = policy->cpuinfo.max_freq; > } else { > @@ -225,6 +232,8 @@ static unsigned int sugov_next_freq_shar > unsigned long util = 0, max = 1; > unsigned int j; > > + sg_policy->overload = false; > + > for_each_cpu(j, policy->cpus) { > struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); > unsigned long j_util, j_max; > @@ -253,6 +262,7 @@ static unsigned int sugov_next_freq_shar > } > > sugov_iowait_boost(j_sg_cpu, &util, &max); > + sg_policy->overload = sg_policy->overload || sg_cpu->overload; > } > > return get_next_freq(sg_policy, util, max); > @@ -273,6 +283,7 @@ static void sugov_update_shared(struct u > sg_cpu->util = util; > sg_cpu->max = max; > sg_cpu->flags = flags; > + sg_cpu->overload = this_rq()->rd->overload; > > sugov_set_iowait_boost(sg_cpu, time, flags); > sg_cpu->last_update = time; > -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 15:08 ` Patrick Bellasi @ 2017-03-21 15:18 ` Peter Zijlstra 2017-03-21 19:28 ` Patrick Bellasi 0 siblings, 1 reply; 71+ messages in thread From: Peter Zijlstra @ 2017-03-21 15:18 UTC (permalink / raw) To: Patrick Bellasi Cc: Rafael J. Wysocki, Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar Seriously people, trim your replies. On Tue, Mar 21, 2017 at 03:08:20PM +0000, Patrick Bellasi wrote: > And than we can move this bit into an inline function, something like e.g.: > > static inline bool sugov_this_cpu_is_busy() > { > return this_rq()->rd->overloaded > } No, that's just entirely and utterly wrong. It being in rd means its very much not about _this_ CPU in any way. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 15:18 ` Peter Zijlstra @ 2017-03-21 19:28 ` Patrick Bellasi 0 siblings, 0 replies; 71+ messages in thread From: Patrick Bellasi @ 2017-03-21 19:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Rafael J. Wysocki, Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 21-Mar 16:18, Peter Zijlstra wrote: > On Tue, Mar 21, 2017 at 03:08:20PM +0000, Patrick Bellasi wrote: > > > And than we can move this bit into an inline function, something like e.g.: > > > > static inline bool sugov_this_cpu_is_busy() > > { > > return this_rq()->rd->overloaded > > } > > No, that's just entirely and utterly wrong. It being in rd means its > very much not about _this_ CPU in any way. You right (of course), we cannot really use "this_" in the name of a function with such a code. The suggestion here was at least to factor out whatever code we want to use to check if the current CPU has to be subject to a down-scaling constraint. However, using rd->overload is not the best option, for the many reasons you explained in your previous comment. Thus, we should probably stay with the idle time tracking solution initially proposed by Rafael. Sorry for the noise :-( -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-21 14:26 ` Rafael J. Wysocki 2017-03-21 14:38 ` Patrick Bellasi @ 2017-03-21 15:02 ` Peter Zijlstra 1 sibling, 0 replies; 71+ messages in thread From: Peter Zijlstra @ 2017-03-21 15:02 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Vincent Guittot, Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar On Tue, Mar 21, 2017 at 03:26:06PM +0100, Rafael J. Wysocki wrote: > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) { > next_f = policy->cpuinfo.max_freq; So this I think is wrong; rd->overload is set if _any_ of the CPUs in the root domain is overloaded. And given the root domain is typically the _entire_ machine, this would have a tendency to run at max_freq far too often. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs 2017-03-20 21:46 ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki 2017-03-21 6:40 ` Viresh Kumar 2017-03-21 8:50 ` Vincent Guittot @ 2017-03-21 11:50 ` Patrick Bellasi 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki 3 siblings, 0 replies; 71+ messages in thread From: Patrick Bellasi @ 2017-03-21 11:50 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Joel Fernandes, Morten Rasmussen, Ingo Molnar On 20-Mar 22:46, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The way the schedutil governor uses the PELT metric causes it to > underestimate the CPU utilization in some cases. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > To work around this issue use the observation that, from the > schedutil governor's perspective, it does not make sense to decrease > the frequency of a CPU that doesn't enter idle and avoid decreasing > the frequency of busy CPUs. > > To that end, use the counter of idle calls in the timekeeping code. > Namely, make the schedutil governor look at that counter for the > current CPU every time before it is about to set a new frequency > for that CPU's policy. If the counter has not changed since the > previous iteration, the CPU has been busy for all that time and > its frequency should not be decreased, so if the new frequency would > be lower than the one set previously, the governor will skip the > frequency update. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > This is a slightly different approach (avoid decreasing frequency for busy CPUs > instead of bumping if for them to the max upfront) and it works around the > original problem too. I like much better this version where we do not enforce max frequency as well as we removed the hardcoded time threshold. ;-) Makes sense to me also to avoid down scaling until we don't hit an IDLE. However, I also agree with Vincent's observation: this constraint should be there only for "overutilized" CPUs... but unfortunately, in mainline we are still missing that flag and thus we should probably use the "overloaded" one for the time being. > I tried to address a few Peter's comments here and the result doesn't seem to > be too heavy-wieght. Nice! > Thanks, > Rafael > > --- > include/linux/tick.h | 1 + > kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++++++++++---- > kernel/time/tick-sched.c | 12 ++++++++++++ > 3 files changed, 37 insertions(+), 4 deletions(-) > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > =================================================================== > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -56,6 +56,9 @@ struct sugov_cpu { > unsigned long iowait_boost; > unsigned long iowait_boost_max; > u64 last_update; > +#ifdef CONFIG_NO_HZ_COMMON > + unsigned long saved_idle_calls; > +#endif > > /* The fields below are only needed when sharing a policy. */ > unsigned long util; > @@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str > return delta_ns >= sg_policy->freq_update_delay_ns; > } > > -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, > - unsigned int next_freq) > +#ifdef CONFIG_NO_HZ_COMMON > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > +{ > + unsigned long idle_calls = tick_nohz_get_idle_calls(); > + bool ret = idle_calls == sg_cpu->saved_idle_calls; > + > + sg_cpu->saved_idle_calls = idle_calls; > + return ret; > +} > +#else > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } > +#endif /* CONFIG_NO_HZ_COMMON */ > + > +static void sugov_update_commit(struct sugov_cpu *sg_cpu, > + struct sugov_policy *sg_policy, > + u64 time, unsigned int next_freq) > { > struct cpufreq_policy *policy = sg_policy->policy; > > + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq) > + next_freq = sg_policy->next_freq; > + > if (policy->fast_switch_enabled) { > if (sg_policy->next_freq == next_freq) { > trace_cpu_frequency(policy->cur, smp_processor_id()); > @@ -214,7 +234,7 @@ static void sugov_update_single(struct u > sugov_iowait_boost(sg_cpu, &util, &max); > next_f = get_next_freq(sg_policy, util, max); > } > - sugov_update_commit(sg_policy, time, next_f); > + sugov_update_commit(sg_cpu, sg_policy, time, next_f); > } > > static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu) > @@ -283,7 +303,7 @@ static void sugov_update_shared(struct u > else > next_f = sugov_next_freq_shared(sg_cpu); > > - sugov_update_commit(sg_policy, time, next_f); > + sugov_update_commit(sg_cpu, sg_policy, time, next_f); > } > > raw_spin_unlock(&sg_policy->update_lock); > Index: linux-pm/include/linux/tick.h > =================================================================== > --- linux-pm.orig/include/linux/tick.h > +++ linux-pm/include/linux/tick.h > @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void); > extern void tick_nohz_idle_exit(void); > extern void tick_nohz_irq_exit(void); > extern ktime_t tick_nohz_get_sleep_length(void); > +extern unsigned long tick_nohz_get_idle_calls(void); > extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); > extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); > #else /* !CONFIG_NO_HZ_COMMON */ > Index: linux-pm/kernel/time/tick-sched.c > =================================================================== > --- linux-pm.orig/kernel/time/tick-sched.c > +++ linux-pm/kernel/time/tick-sched.c > @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void) > return ts->sleep_length; > } > > +/** > + * tick_nohz_get_idle_calls - return the current idle calls counter value > + * > + * Called from the schedutil frequency scaling governor in scheduler context. > + */ > +unsigned long tick_nohz_get_idle_calls(void) > +{ > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); > + > + return ts->idle_calls; > +} > + > static void tick_nohz_account_idle_ticks(struct tick_sched *ts) > { > #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE > -- #include <best/regards.h> Patrick Bellasi ^ permalink raw reply [flat|nested] 71+ messages in thread
* [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-20 21:46 ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki ` (2 preceding siblings ...) 2017-03-21 11:50 ` Patrick Bellasi @ 2017-03-21 23:08 ` Rafael J. Wysocki 2017-03-22 9:26 ` Peter Zijlstra ` (5 more replies) 3 siblings, 6 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-21 23:08 UTC (permalink / raw) To: Linux PM, Peter Zijlstra Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> The way the schedutil governor uses the PELT metric causes it to underestimate the CPU utilization in some cases. That can be easily demonstrated by running kernel compilation on a Sandy Bridge Intel processor, running turbostat in parallel with it and looking at the values written to the MSR_IA32_PERF_CTL register. Namely, the expected result would be that when all CPUs were 100% busy, all of them would be requested to run in the maximum P-state, but observation shows that this clearly isn't the case. The CPUs run in the maximum P-state for a while and then are requested to run slower and go back to the maximum P-state after a while again. That causes the actual frequency of the processor to visibly oscillate below the sustainable maximum in a jittery fashion which clearly is not desirable. That has been attributed to CPU utilization metric updates on task migration that cause the total utilization value for the CPU to be reduced by the utilization of the migrated task. If that happens, the schedutil governor may see a CPU utilization reduction and will attempt to reduce the CPU frequency accordingly right away. That may be premature, though, for example if the system is generally busy and there are other runnable tasks waiting to be run on that CPU already. This is unlikely to be an issue on systems where cpufreq policies are shared between multiple CPUs, because in those cases the policy utilization is computed as the maximum of the CPU utilization values over the whole policy and if that turns out to be low, reducing the frequency for the policy most likely is a good idea anyway. On systems with one CPU per policy, however, it may affect performance adversely and even lead to increased energy consumption in some cases. On those systems it may be addressed by taking another utilization metric into consideration, like whether or not the CPU whose frequency is about to be reduced has been idle recently, because if that's not the case, the CPU is likely to be busy in the near future and its frequency should not be reduced. To that end, use the counter of idle calls in the timekeeping code. Namely, make the schedutil governor look at that counter for the current CPU every time before its frequency is about to be reduced. If the counter has not changed since the previous iteration of the governor computations for that CPU, the CPU has been busy for all that time and its frequency should not be decreased, so if the new frequency would be lower than the one set previously, the governor will skip the frequency update. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- include/linux/tick.h | 1 + kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++ kernel/time/tick-sched.c | 12 ++++++++++++ 3 files changed, 40 insertions(+) Index: linux-pm/kernel/sched/cpufreq_schedutil.c =================================================================== --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c +++ linux-pm/kernel/sched/cpufreq_schedutil.c @@ -61,6 +61,11 @@ struct sugov_cpu { unsigned long util; unsigned long max; unsigned int flags; + + /* The field below is for single-CPU policies only. */ +#ifdef CONFIG_NO_HZ_COMMON + unsigned long saved_idle_calls; +#endif }; static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su sg_cpu->iowait_boost >>= 1; } +#ifdef CONFIG_NO_HZ_COMMON +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) +{ + unsigned long idle_calls = tick_nohz_get_idle_calls(); + bool ret = idle_calls == sg_cpu->saved_idle_calls; + + sg_cpu->saved_idle_calls = idle_calls; + return ret; +} +#else +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } +#endif /* CONFIG_NO_HZ_COMMON */ + static void sugov_update_single(struct update_util_data *hook, u64 time, unsigned int flags) { @@ -200,6 +218,7 @@ static void sugov_update_single(struct u struct cpufreq_policy *policy = sg_policy->policy; unsigned long util, max; unsigned int next_f; + bool busy; sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; @@ -207,12 +226,20 @@ static void sugov_update_single(struct u if (!sugov_should_update_freq(sg_policy, time)) return; + busy = sugov_cpu_is_busy(sg_cpu); + if (flags & SCHED_CPUFREQ_RT_DL) { next_f = policy->cpuinfo.max_freq; } else { sugov_get_util(&util, &max); sugov_iowait_boost(sg_cpu, &util, &max); next_f = get_next_freq(sg_policy, util, max); + /* + * Do not reduce the frequency if the CPU has not been idle + * recently, as the reduction is likely to be premature then. + */ + if (busy && next_f < sg_policy->next_freq) + next_f = sg_policy->next_freq; } sugov_update_commit(sg_policy, time, next_f); } Index: linux-pm/include/linux/tick.h =================================================================== --- linux-pm.orig/include/linux/tick.h +++ linux-pm/include/linux/tick.h @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void); extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); +extern unsigned long tick_nohz_get_idle_calls(void); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); #else /* !CONFIG_NO_HZ_COMMON */ Index: linux-pm/kernel/time/tick-sched.c =================================================================== --- linux-pm.orig/kernel/time/tick-sched.c +++ linux-pm/kernel/time/tick-sched.c @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void) return ts->sleep_length; } +/** + * tick_nohz_get_idle_calls - return the current idle calls counter value + * + * Called from the schedutil frequency scaling governor in scheduler context. + */ +unsigned long tick_nohz_get_idle_calls(void) +{ + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); + + return ts->idle_calls; +} + static void tick_nohz_account_idle_ticks(struct tick_sched *ts) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki @ 2017-03-22 9:26 ` Peter Zijlstra 2017-03-22 9:54 ` Viresh Kumar ` (4 subsequent siblings) 5 siblings, 0 replies; 71+ messages in thread From: Peter Zijlstra @ 2017-03-22 9:26 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner On Wed, Mar 22, 2017 at 12:08:50AM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The way the schedutil governor uses the PELT metric causes it to > underestimate the CPU utilization in some cases. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > That has been attributed to CPU utilization metric updates on task > migration that cause the total utilization value for the CPU to be > reduced by the utilization of the migrated task. If that happens, > the schedutil governor may see a CPU utilization reduction and will > attempt to reduce the CPU frequency accordingly right away. That > may be premature, though, for example if the system is generally > busy and there are other runnable tasks waiting to be run on that > CPU already. > > This is unlikely to be an issue on systems where cpufreq policies are > shared between multiple CPUs, because in those cases the policy > utilization is computed as the maximum of the CPU utilization values > over the whole policy and if that turns out to be low, reducing the > frequency for the policy most likely is a good idea anyway. On > systems with one CPU per policy, however, it may affect performance > adversely and even lead to increased energy consumption in some cases. > > On those systems it may be addressed by taking another utilization > metric into consideration, like whether or not the CPU whose > frequency is about to be reduced has been idle recently, because if > that's not the case, the CPU is likely to be busy in the near future > and its frequency should not be reduced. > > To that end, use the counter of idle calls in the timekeeping code. > Namely, make the schedutil governor look at that counter for the > current CPU every time before its frequency is about to be reduced. > If the counter has not changed since the previous iteration of the > governor computations for that CPU, the CPU has been busy for all > that time and its frequency should not be decreased, so if the new > frequency would be lower than the one set previously, the governor > will skip the frequency update. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Right; this makes sense to me. Of course it would be good to have some more measurements on this, but in principle: Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki 2017-03-22 9:26 ` Peter Zijlstra @ 2017-03-22 9:54 ` Viresh Kumar 2017-03-23 1:04 ` Joel Fernandes ` (3 subsequent siblings) 5 siblings, 0 replies; 71+ messages in thread From: Viresh Kumar @ 2017-03-22 9:54 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner On 22-03-17, 00:08, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The way the schedutil governor uses the PELT metric causes it to > underestimate the CPU utilization in some cases. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > That has been attributed to CPU utilization metric updates on task > migration that cause the total utilization value for the CPU to be > reduced by the utilization of the migrated task. If that happens, > the schedutil governor may see a CPU utilization reduction and will > attempt to reduce the CPU frequency accordingly right away. That > may be premature, though, for example if the system is generally > busy and there are other runnable tasks waiting to be run on that > CPU already. > > This is unlikely to be an issue on systems where cpufreq policies are > shared between multiple CPUs, because in those cases the policy > utilization is computed as the maximum of the CPU utilization values > over the whole policy and if that turns out to be low, reducing the > frequency for the policy most likely is a good idea anyway. On > systems with one CPU per policy, however, it may affect performance > adversely and even lead to increased energy consumption in some cases. > > On those systems it may be addressed by taking another utilization > metric into consideration, like whether or not the CPU whose > frequency is about to be reduced has been idle recently, because if > that's not the case, the CPU is likely to be busy in the near future > and its frequency should not be reduced. > > To that end, use the counter of idle calls in the timekeeping code. > Namely, make the schedutil governor look at that counter for the > current CPU every time before its frequency is about to be reduced. > If the counter has not changed since the previous iteration of the > governor computations for that CPU, the CPU has been busy for all > that time and its frequency should not be decreased, so if the new > frequency would be lower than the one set previously, the governor > will skip the frequency update. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > include/linux/tick.h | 1 + > kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++ > kernel/time/tick-sched.c | 12 ++++++++++++ > 3 files changed, 40 insertions(+) Acked-by: Viresh Kumar <viresh.kumar@linaro.org> -- viresh ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki 2017-03-22 9:26 ` Peter Zijlstra 2017-03-22 9:54 ` Viresh Kumar @ 2017-03-23 1:04 ` Joel Fernandes 2017-03-23 19:26 ` Sai Gurrappadi ` (2 subsequent siblings) 5 siblings, 0 replies; 71+ messages in thread From: Joel Fernandes @ 2017-03-23 1:04 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Morten Rasmussen, Ingo Molnar, Thomas Gleixner On Tue, Mar 21, 2017 at 4:08 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The way the schedutil governor uses the PELT metric causes it to > underestimate the CPU utilization in some cases. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > That has been attributed to CPU utilization metric updates on task > migration that cause the total utilization value for the CPU to be > reduced by the utilization of the migrated task. If that happens, > the schedutil governor may see a CPU utilization reduction and will > attempt to reduce the CPU frequency accordingly right away. That > may be premature, though, for example if the system is generally > busy and there are other runnable tasks waiting to be run on that > CPU already. > > This is unlikely to be an issue on systems where cpufreq policies are > shared between multiple CPUs, because in those cases the policy > utilization is computed as the maximum of the CPU utilization values > over the whole policy and if that turns out to be low, reducing the > frequency for the policy most likely is a good idea anyway. On > systems with one CPU per policy, however, it may affect performance > adversely and even lead to increased energy consumption in some cases. > > On those systems it may be addressed by taking another utilization > metric into consideration, like whether or not the CPU whose > frequency is about to be reduced has been idle recently, because if > that's not the case, the CPU is likely to be busy in the near future > and its frequency should not be reduced. > > To that end, use the counter of idle calls in the timekeeping code. > Namely, make the schedutil governor look at that counter for the > current CPU every time before its frequency is about to be reduced. > If the counter has not changed since the previous iteration of the > governor computations for that CPU, the CPU has been busy for all > that time and its frequency should not be decreased, so if the new > frequency would be lower than the one set previously, the governor > will skip the frequency update. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Makes sense, Reviewed-by: Joel Fernandes <joelaf@google.com> Thanks, Joel ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki ` (2 preceding siblings ...) 2017-03-23 1:04 ` Joel Fernandes @ 2017-03-23 19:26 ` Sai Gurrappadi 2017-03-23 20:48 ` Sai Gurrappadi 2017-03-24 1:39 ` Rafael J. Wysocki 2017-03-25 1:14 ` Sai Gurrappadi 2017-05-08 3:49 ` Wanpeng Li 5 siblings, 2 replies; 71+ messages in thread From: Sai Gurrappadi @ 2017-03-23 19:26 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM, Peter Zijlstra Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel Hi Rafael, On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> <snip> > > That has been attributed to CPU utilization metric updates on task > migration that cause the total utilization value for the CPU to be > reduced by the utilization of the migrated task. If that happens, > the schedutil governor may see a CPU utilization reduction and will > attempt to reduce the CPU frequency accordingly right away. That > may be premature, though, for example if the system is generally > busy and there are other runnable tasks waiting to be run on that > CPU already. > > This is unlikely to be an issue on systems where cpufreq policies are > shared between multiple CPUs, because in those cases the policy > utilization is computed as the maximum of the CPU utilization values > over the whole policy and if that turns out to be low, reducing the > frequency for the policy most likely is a good idea anyway. On I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates: 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet) 2. Do wakeup on dst_cpu, add load to dst_cpu Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens. This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario. > systems with one CPU per policy, however, it may affect performance > adversely and even lead to increased energy consumption in some cases. > > On those systems it may be addressed by taking another utilization > metric into consideration, like whether or not the CPU whose > frequency is about to be reduced has been idle recently, because if > that's not the case, the CPU is likely to be busy in the near future > and its frequency should not be reduced. > > To that end, use the counter of idle calls in the timekeeping code. > Namely, make the schedutil governor look at that counter for the > current CPU every time before its frequency is about to be reduced. > If the counter has not changed since the previous iteration of the > governor computations for that CPU, the CPU has been busy for all > that time and its frequency should not be decreased, so if the new > frequency would be lower than the one set previously, the governor > will skip the frequency update. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > include/linux/tick.h | 1 + > kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++ > kernel/time/tick-sched.c | 12 ++++++++++++ > 3 files changed, 40 insertions(+) > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > =================================================================== > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -61,6 +61,11 @@ struct sugov_cpu { > unsigned long util; > unsigned long max; > unsigned int flags; > + > + /* The field below is for single-CPU policies only. */ > +#ifdef CONFIG_NO_HZ_COMMON > + unsigned long saved_idle_calls; > +#endif > }; > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su > sg_cpu->iowait_boost >>= 1; > } > > +#ifdef CONFIG_NO_HZ_COMMON > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > +{ > + unsigned long idle_calls = tick_nohz_get_idle_calls(); > + bool ret = idle_calls == sg_cpu->saved_idle_calls; > + > + sg_cpu->saved_idle_calls = idle_calls; > + return ret; > +} Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :) Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right? If I am reading this correctly, the PELT update on the dequeue for the periodic task (in the scenario above) happens _before_ the idle_calls++ which is in tick_nohz_idle_enter. Thanks! -Sai ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-23 19:26 ` Sai Gurrappadi @ 2017-03-23 20:48 ` Sai Gurrappadi 2017-03-24 1:39 ` Rafael J. Wysocki 1 sibling, 0 replies; 71+ messages in thread From: Sai Gurrappadi @ 2017-03-23 20:48 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM, Peter Zijlstra Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel On 03/23/2017 12:26 PM, Sai Gurrappadi wrote: > > Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :) > > Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right? Apologies, this example here is flawed because on task dequeue, its utilization isn't removed. There is no problem in this case... -Sai ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-23 19:26 ` Sai Gurrappadi 2017-03-23 20:48 ` Sai Gurrappadi @ 2017-03-24 1:39 ` Rafael J. Wysocki 2017-03-24 19:08 ` Sai Gurrappadi 1 sibling, 1 reply; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-24 1:39 UTC (permalink / raw) To: Sai Gurrappadi Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote: > Hi Rafael, Hi, > On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > <snip> > >> >> That has been attributed to CPU utilization metric updates on task >> migration that cause the total utilization value for the CPU to be >> reduced by the utilization of the migrated task. If that happens, >> the schedutil governor may see a CPU utilization reduction and will >> attempt to reduce the CPU frequency accordingly right away. That >> may be premature, though, for example if the system is generally >> busy and there are other runnable tasks waiting to be run on that >> CPU already. >> >> This is unlikely to be an issue on systems where cpufreq policies are >> shared between multiple CPUs, because in those cases the policy >> utilization is computed as the maximum of the CPU utilization values >> over the whole policy and if that turns out to be low, reducing the >> frequency for the policy most likely is a good idea anyway. On > > I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates: > > 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet) > 2. Do wakeup on dst_cpu, add load to dst_cpu > > Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens. > > This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario. Interesting, and this seems to be related to last_freq_update_time being per-policy (which it has to be, because frequency updates are per-policy too and that's what we need to rate-limit). Does this happen often enough to be a real concern in practice on those configurations, though? The other CPUs in the policy need to be either idle (so schedutil doesn't take them into account at all) or lightly utilized for that to happen, so that would affect workloads with one CPU hog type of task that is migrated from one CPU to another within a policy and that doesn't happen too often AFAICS. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-24 1:39 ` Rafael J. Wysocki @ 2017-03-24 19:08 ` Sai Gurrappadi 0 siblings, 0 replies; 71+ messages in thread From: Sai Gurrappadi @ 2017-03-24 19:08 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel On 03/23/2017 06:39 PM, Rafael J. Wysocki wrote: > On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote: >> Hi Rafael, > > Hi, > >> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> <snip> >> >>> >>> That has been attributed to CPU utilization metric updates on task >>> migration that cause the total utilization value for the CPU to be >>> reduced by the utilization of the migrated task. If that happens, >>> the schedutil governor may see a CPU utilization reduction and will >>> attempt to reduce the CPU frequency accordingly right away. That >>> may be premature, though, for example if the system is generally >>> busy and there are other runnable tasks waiting to be run on that >>> CPU already. >>> >>> This is unlikely to be an issue on systems where cpufreq policies are >>> shared between multiple CPUs, because in those cases the policy >>> utilization is computed as the maximum of the CPU utilization values >>> over the whole policy and if that turns out to be low, reducing the >>> frequency for the policy most likely is a good idea anyway. On >> >> I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates: >> >> 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet) >> 2. Do wakeup on dst_cpu, add load to dst_cpu >> >> Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens. >> >> This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario. > > Interesting, and this seems to be related to last_freq_update_time > being per-policy (which it has to be, because frequency updates are > per-policy too and that's what we need to rate-limit). > Correct. > Does this happen often enough to be a real concern in practice on > those configurations, though? > > The other CPUs in the policy need to be either idle (so schedutil > doesn't take them into account at all) or lightly utilized for that to > happen, so that would affect workloads with one CPU hog type of task > that is migrated from one CPU to another within a policy and that > doesn't happen too often AFAICS. So it is possible, even likely in some cases for a heavy CPU task to migrate on wakeup between the policy->cpus via select_idle_sibling() if the prev_cpu it was on was !idle on wakeup. This style of heavy thread + lots of light work is a common pattern on Android (games, browsing, etc.) given how Android does its threading for ipc (Binder stuff) + its rendering/audio pipelines. I unfortunately don't have any numbers atm though. -Sai ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki ` (3 preceding siblings ...) 2017-03-23 19:26 ` Sai Gurrappadi @ 2017-03-25 1:14 ` Sai Gurrappadi 2017-03-25 1:39 ` Rafael J. Wysocki 2017-03-27 7:04 ` Vincent Guittot 2017-05-08 3:49 ` Wanpeng Li 5 siblings, 2 replies; 71+ messages in thread From: Sai Gurrappadi @ 2017-03-25 1:14 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM, Peter Zijlstra Cc: LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel Hi Rafael, On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The way the schedutil governor uses the PELT metric causes it to > underestimate the CPU utilization in some cases. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > That has been attributed to CPU utilization metric updates on task > migration that cause the total utilization value for the CPU to be > reduced by the utilization of the migrated task. If that happens, > the schedutil governor may see a CPU utilization reduction and will > attempt to reduce the CPU frequency accordingly right away. That > may be premature, though, for example if the system is generally > busy and there are other runnable tasks waiting to be run on that > CPU already. > Thinking out loud a bit, I wonder if what you really want to do is basically: schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg); Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs. Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading. Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so: schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg); Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases. Thoughts? Thanks, -Sai ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-25 1:14 ` Sai Gurrappadi @ 2017-03-25 1:39 ` Rafael J. Wysocki 2017-03-27 7:04 ` Vincent Guittot 1 sibling, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-25 1:39 UTC (permalink / raw) To: Sai Gurrappadi Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel On Sat, Mar 25, 2017 at 2:14 AM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote: > Hi Rafael, > > On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> The way the schedutil governor uses the PELT metric causes it to >> underestimate the CPU utilization in some cases. >> >> That can be easily demonstrated by running kernel compilation on >> a Sandy Bridge Intel processor, running turbostat in parallel with >> it and looking at the values written to the MSR_IA32_PERF_CTL >> register. Namely, the expected result would be that when all CPUs >> were 100% busy, all of them would be requested to run in the maximum >> P-state, but observation shows that this clearly isn't the case. >> The CPUs run in the maximum P-state for a while and then are >> requested to run slower and go back to the maximum P-state after >> a while again. That causes the actual frequency of the processor to >> visibly oscillate below the sustainable maximum in a jittery fashion >> which clearly is not desirable. >> >> That has been attributed to CPU utilization metric updates on task >> migration that cause the total utilization value for the CPU to be >> reduced by the utilization of the migrated task. If that happens, >> the schedutil governor may see a CPU utilization reduction and will >> attempt to reduce the CPU frequency accordingly right away. That >> may be premature, though, for example if the system is generally >> busy and there are other runnable tasks waiting to be run on that >> CPU already. >> > > Thinking out loud a bit, I wonder if what you really want to do is basically: > > schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg); > > Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs. > > Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading. > > Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so: > > schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg); > > Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases. > > Thoughts? Well, is the total_cpu_util_avg metric readily available? Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-25 1:14 ` Sai Gurrappadi 2017-03-25 1:39 ` Rafael J. Wysocki @ 2017-03-27 7:04 ` Vincent Guittot 2017-03-27 21:01 ` Sai Gurrappadi 1 sibling, 1 reply; 71+ messages in thread From: Vincent Guittot @ 2017-03-27 7:04 UTC (permalink / raw) To: Sai Gurrappadi Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote: > Hi Rafael, > > On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> The way the schedutil governor uses the PELT metric causes it to >> underestimate the CPU utilization in some cases. >> >> That can be easily demonstrated by running kernel compilation on >> a Sandy Bridge Intel processor, running turbostat in parallel with >> it and looking at the values written to the MSR_IA32_PERF_CTL >> register. Namely, the expected result would be that when all CPUs >> were 100% busy, all of them would be requested to run in the maximum >> P-state, but observation shows that this clearly isn't the case. >> The CPUs run in the maximum P-state for a while and then are >> requested to run slower and go back to the maximum P-state after >> a while again. That causes the actual frequency of the processor to >> visibly oscillate below the sustainable maximum in a jittery fashion >> which clearly is not desirable. >> >> That has been attributed to CPU utilization metric updates on task >> migration that cause the total utilization value for the CPU to be >> reduced by the utilization of the migrated task. If that happens, >> the schedutil governor may see a CPU utilization reduction and will >> attempt to reduce the CPU frequency accordingly right away. That >> may be premature, though, for example if the system is generally >> busy and there are other runnable tasks waiting to be run on that >> CPU already. >> > > Thinking out loud a bit, I wonder if what you really want to do is basically: > > schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg); > > Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs. But we loose the interest of immediate decrease when tasks migrate. Instead of total_cpu_util_avg we should better track RT utilization in the same manner so with ongoing work for deadline we will have : total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg and we still take advantage of task migration effect > > Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading. > > Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so: > > schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg); > > Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases. > > Thoughts? > > Thanks, > -Sai ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-27 7:04 ` Vincent Guittot @ 2017-03-27 21:01 ` Sai Gurrappadi 2017-03-27 21:11 ` Rafael J. Wysocki 0 siblings, 1 reply; 71+ messages in thread From: Sai Gurrappadi @ 2017-03-27 21:01 UTC (permalink / raw) To: Vincent Guittot Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel Hi Vincent, On 03/27/2017 12:04 AM, Vincent Guittot wrote: > On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote: >> Hi Rafael, >> >> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> The way the schedutil governor uses the PELT metric causes it to >>> underestimate the CPU utilization in some cases. >>> >>> That can be easily demonstrated by running kernel compilation on >>> a Sandy Bridge Intel processor, running turbostat in parallel with >>> it and looking at the values written to the MSR_IA32_PERF_CTL >>> register. Namely, the expected result would be that when all CPUs >>> were 100% busy, all of them would be requested to run in the maximum >>> P-state, but observation shows that this clearly isn't the case. >>> The CPUs run in the maximum P-state for a while and then are >>> requested to run slower and go back to the maximum P-state after >>> a while again. That causes the actual frequency of the processor to >>> visibly oscillate below the sustainable maximum in a jittery fashion >>> which clearly is not desirable. >>> >>> That has been attributed to CPU utilization metric updates on task >>> migration that cause the total utilization value for the CPU to be >>> reduced by the utilization of the migrated task. If that happens, >>> the schedutil governor may see a CPU utilization reduction and will >>> attempt to reduce the CPU frequency accordingly right away. That >>> may be premature, though, for example if the system is generally >>> busy and there are other runnable tasks waiting to be run on that >>> CPU already. >>> >> >> Thinking out loud a bit, I wonder if what you really want to do is basically: >> >> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg); >> >> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs. > > But we loose the interest of immediate decrease when tasks migrate. Indeed, this is not ideal. > Instead of total_cpu_util_avg we should better track RT utilization in > the same manner so with ongoing work for deadline we will have : > total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg > and we still take advantage of task migration effect I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU. In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself. Maybe there are better ways to solve that though :-) Thanks, -Sai ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-27 21:01 ` Sai Gurrappadi @ 2017-03-27 21:11 ` Rafael J. Wysocki 0 siblings, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-03-27 21:11 UTC (permalink / raw) To: Sai Gurrappadi Cc: Vincent Guittot, Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner, Peter Boonstoppel On Mon, Mar 27, 2017 at 11:01 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote: > Hi Vincent, > > On 03/27/2017 12:04 AM, Vincent Guittot wrote: >> On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote: >>> Hi Rafael, >>> >>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote: >>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>>> >>>> The way the schedutil governor uses the PELT metric causes it to >>>> underestimate the CPU utilization in some cases. >>>> >>>> That can be easily demonstrated by running kernel compilation on >>>> a Sandy Bridge Intel processor, running turbostat in parallel with >>>> it and looking at the values written to the MSR_IA32_PERF_CTL >>>> register. Namely, the expected result would be that when all CPUs >>>> were 100% busy, all of them would be requested to run in the maximum >>>> P-state, but observation shows that this clearly isn't the case. >>>> The CPUs run in the maximum P-state for a while and then are >>>> requested to run slower and go back to the maximum P-state after >>>> a while again. That causes the actual frequency of the processor to >>>> visibly oscillate below the sustainable maximum in a jittery fashion >>>> which clearly is not desirable. >>>> >>>> That has been attributed to CPU utilization metric updates on task >>>> migration that cause the total utilization value for the CPU to be >>>> reduced by the utilization of the migrated task. If that happens, >>>> the schedutil governor may see a CPU utilization reduction and will >>>> attempt to reduce the CPU frequency accordingly right away. That >>>> may be premature, though, for example if the system is generally >>>> busy and there are other runnable tasks waiting to be run on that >>>> CPU already. >>>> >>> >>> Thinking out loud a bit, I wonder if what you really want to do is basically: >>> >>> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg); >>> >>> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs. >> >> But we loose the interest of immediate decrease when tasks migrate. > > Indeed, this is not ideal. > >> Instead of total_cpu_util_avg we should better track RT utilization in >> the same manner so with ongoing work for deadline we will have : >> total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg >> and we still take advantage of task migration effect > > I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU. > > In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself. > > Maybe there are better ways to solve that though :-) I wonder if it's viable to postpone the utilization update on both the source and target runqueues until the task has been fully migrated? That would make the artificial utilization reductions go away. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki ` (4 preceding siblings ...) 2017-03-25 1:14 ` Sai Gurrappadi @ 2017-05-08 3:49 ` Wanpeng Li 2017-05-08 4:01 ` Viresh Kumar 5 siblings, 1 reply; 71+ messages in thread From: Wanpeng Li @ 2017-05-08 3:49 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner Hi Rafael, 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > The way the schedutil governor uses the PELT metric causes it to > underestimate the CPU utilization in some cases. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. > > That has been attributed to CPU utilization metric updates on task > migration that cause the total utilization value for the CPU to be > reduced by the utilization of the migrated task. If that happens, > the schedutil governor may see a CPU utilization reduction and will > attempt to reduce the CPU frequency accordingly right away. That > may be premature, though, for example if the system is generally > busy and there are other runnable tasks waiting to be run on that > CPU already. > > This is unlikely to be an issue on systems where cpufreq policies are > shared between multiple CPUs, because in those cases the policy > utilization is computed as the maximum of the CPU utilization values Sorry for one question maybe not associated with this patch. If the cpufreq policy is shared between multiple CPUs, the function intel_cpufreq_target() just updates IA32_PERF_CTL MSR of the cpu which is managing this policy, I wonder whether other cpus which are affected should also update their per-logical cpu's IA32_PERF_CTL MSR? Regards, Wanpeng Li > over the whole policy and if that turns out to be low, reducing the > frequency for the policy most likely is a good idea anyway. On > systems with one CPU per policy, however, it may affect performance > adversely and even lead to increased energy consumption in some cases. > > On those systems it may be addressed by taking another utilization > metric into consideration, like whether or not the CPU whose > frequency is about to be reduced has been idle recently, because if > that's not the case, the CPU is likely to be busy in the near future > and its frequency should not be reduced. > > To that end, use the counter of idle calls in the timekeeping code. > Namely, make the schedutil governor look at that counter for the > current CPU every time before its frequency is about to be reduced. > If the counter has not changed since the previous iteration of the > governor computations for that CPU, the CPU has been busy for all > that time and its frequency should not be decreased, so if the new > frequency would be lower than the one set previously, the governor > will skip the frequency update. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > include/linux/tick.h | 1 + > kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++ > kernel/time/tick-sched.c | 12 ++++++++++++ > 3 files changed, 40 insertions(+) > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c > =================================================================== > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c > +++ linux-pm/kernel/sched/cpufreq_schedutil.c > @@ -61,6 +61,11 @@ struct sugov_cpu { > unsigned long util; > unsigned long max; > unsigned int flags; > + > + /* The field below is for single-CPU policies only. */ > +#ifdef CONFIG_NO_HZ_COMMON > + unsigned long saved_idle_calls; > +#endif > }; > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su > sg_cpu->iowait_boost >>= 1; > } > > +#ifdef CONFIG_NO_HZ_COMMON > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > +{ > + unsigned long idle_calls = tick_nohz_get_idle_calls(); > + bool ret = idle_calls == sg_cpu->saved_idle_calls; > + > + sg_cpu->saved_idle_calls = idle_calls; > + return ret; > +} > +#else > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } > +#endif /* CONFIG_NO_HZ_COMMON */ > + > static void sugov_update_single(struct update_util_data *hook, u64 time, > unsigned int flags) > { > @@ -200,6 +218,7 @@ static void sugov_update_single(struct u > struct cpufreq_policy *policy = sg_policy->policy; > unsigned long util, max; > unsigned int next_f; > + bool busy; > > sugov_set_iowait_boost(sg_cpu, time, flags); > sg_cpu->last_update = time; > @@ -207,12 +226,20 @@ static void sugov_update_single(struct u > if (!sugov_should_update_freq(sg_policy, time)) > return; > > + busy = sugov_cpu_is_busy(sg_cpu); > + > if (flags & SCHED_CPUFREQ_RT_DL) { > next_f = policy->cpuinfo.max_freq; > } else { > sugov_get_util(&util, &max); > sugov_iowait_boost(sg_cpu, &util, &max); > next_f = get_next_freq(sg_policy, util, max); > + /* > + * Do not reduce the frequency if the CPU has not been idle > + * recently, as the reduction is likely to be premature then. > + */ > + if (busy && next_f < sg_policy->next_freq) > + next_f = sg_policy->next_freq; > } > sugov_update_commit(sg_policy, time, next_f); > } > Index: linux-pm/include/linux/tick.h > =================================================================== > --- linux-pm.orig/include/linux/tick.h > +++ linux-pm/include/linux/tick.h > @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void); > extern void tick_nohz_idle_exit(void); > extern void tick_nohz_irq_exit(void); > extern ktime_t tick_nohz_get_sleep_length(void); > +extern unsigned long tick_nohz_get_idle_calls(void); > extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); > extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); > #else /* !CONFIG_NO_HZ_COMMON */ > Index: linux-pm/kernel/time/tick-sched.c > =================================================================== > --- linux-pm.orig/kernel/time/tick-sched.c > +++ linux-pm/kernel/time/tick-sched.c > @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void) > return ts->sleep_length; > } > > +/** > + * tick_nohz_get_idle_calls - return the current idle calls counter value > + * > + * Called from the schedutil frequency scaling governor in scheduler context. > + */ > +unsigned long tick_nohz_get_idle_calls(void) > +{ > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); > + > + return ts->idle_calls; > +} > + > static void tick_nohz_account_idle_ticks(struct tick_sched *ts) > { > #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-05-08 3:49 ` Wanpeng Li @ 2017-05-08 4:01 ` Viresh Kumar 2017-05-08 5:15 ` Wanpeng Li 2017-05-08 22:16 ` Rafael J. Wysocki 0 siblings, 2 replies; 71+ messages in thread From: Viresh Kumar @ 2017-05-08 4:01 UTC (permalink / raw) To: Wanpeng Li Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner On 08-05-17, 11:49, Wanpeng Li wrote: > Hi Rafael, > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > The way the schedutil governor uses the PELT metric causes it to > > underestimate the CPU utilization in some cases. > > > > That can be easily demonstrated by running kernel compilation on > > a Sandy Bridge Intel processor, running turbostat in parallel with > > it and looking at the values written to the MSR_IA32_PERF_CTL > > register. Namely, the expected result would be that when all CPUs > > were 100% busy, all of them would be requested to run in the maximum > > P-state, but observation shows that this clearly isn't the case. > > The CPUs run in the maximum P-state for a while and then are > > requested to run slower and go back to the maximum P-state after > > a while again. That causes the actual frequency of the processor to > > visibly oscillate below the sustainable maximum in a jittery fashion > > which clearly is not desirable. > > > > That has been attributed to CPU utilization metric updates on task > > migration that cause the total utilization value for the CPU to be > > reduced by the utilization of the migrated task. If that happens, > > the schedutil governor may see a CPU utilization reduction and will > > attempt to reduce the CPU frequency accordingly right away. That > > may be premature, though, for example if the system is generally > > busy and there are other runnable tasks waiting to be run on that > > CPU already. > > > > This is unlikely to be an issue on systems where cpufreq policies are > > shared between multiple CPUs, because in those cases the policy > > utilization is computed as the maximum of the CPU utilization values > > Sorry for one question maybe not associated with this patch. If the > cpufreq policy is shared between multiple CPUs, the function > intel_cpufreq_target() just updates IA32_PERF_CTL MSR of the cpu > which is managing this policy, I wonder whether other cpus which are > affected should also update their per-logical cpu's IA32_PERF_CTL MSR? The CPUs share the policy when they share their freq/voltage rails and so changing perf state of one CPU should result in that changing for all the CPUs in that policy. Otherwise, they can't be considered to be part of the same policy. That's why this code is changing it only for policy->cpu alone. -- viresh ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-05-08 4:01 ` Viresh Kumar @ 2017-05-08 5:15 ` Wanpeng Li 2017-05-08 22:16 ` Rafael J. Wysocki 1 sibling, 0 replies; 71+ messages in thread From: Wanpeng Li @ 2017-05-08 5:15 UTC (permalink / raw) To: Viresh Kumar Cc: Rafael J. Wysocki, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner 2017-05-08 12:01 GMT+08:00 Viresh Kumar <viresh.kumar@linaro.org>: > On 08-05-17, 11:49, Wanpeng Li wrote: >> Hi Rafael, >> 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> > >> > The way the schedutil governor uses the PELT metric causes it to >> > underestimate the CPU utilization in some cases. >> > >> > That can be easily demonstrated by running kernel compilation on >> > a Sandy Bridge Intel processor, running turbostat in parallel with >> > it and looking at the values written to the MSR_IA32_PERF_CTL >> > register. Namely, the expected result would be that when all CPUs >> > were 100% busy, all of them would be requested to run in the maximum >> > P-state, but observation shows that this clearly isn't the case. >> > The CPUs run in the maximum P-state for a while and then are >> > requested to run slower and go back to the maximum P-state after >> > a while again. That causes the actual frequency of the processor to >> > visibly oscillate below the sustainable maximum in a jittery fashion >> > which clearly is not desirable. >> > >> > That has been attributed to CPU utilization metric updates on task >> > migration that cause the total utilization value for the CPU to be >> > reduced by the utilization of the migrated task. If that happens, >> > the schedutil governor may see a CPU utilization reduction and will >> > attempt to reduce the CPU frequency accordingly right away. That >> > may be premature, though, for example if the system is generally >> > busy and there are other runnable tasks waiting to be run on that >> > CPU already. >> > >> > This is unlikely to be an issue on systems where cpufreq policies are >> > shared between multiple CPUs, because in those cases the policy >> > utilization is computed as the maximum of the CPU utilization values >> >> Sorry for one question maybe not associated with this patch. If the >> cpufreq policy is shared between multiple CPUs, the function >> intel_cpufreq_target() just updates IA32_PERF_CTL MSR of the cpu >> which is managing this policy, I wonder whether other cpus which are >> affected should also update their per-logical cpu's IA32_PERF_CTL MSR? > > The CPUs share the policy when they share their freq/voltage rails and so > changing perf state of one CPU should result in that changing for all the CPUs > in that policy. Otherwise, they can't be considered to be part of the same > policy. > > That's why this code is changing it only for policy->cpu alone. I see, thanks for the explanation. Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-05-08 4:01 ` Viresh Kumar 2017-05-08 5:15 ` Wanpeng Li @ 2017-05-08 22:16 ` Rafael J. Wysocki 2017-05-08 22:36 ` Wanpeng Li 1 sibling, 1 reply; 71+ messages in thread From: Rafael J. Wysocki @ 2017-05-08 22:16 UTC (permalink / raw) To: Viresh Kumar Cc: Wanpeng Li, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote: > On 08-05-17, 11:49, Wanpeng Li wrote: > > Hi Rafael, > > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > The way the schedutil governor uses the PELT metric causes it to > > > underestimate the CPU utilization in some cases. > > > > > > That can be easily demonstrated by running kernel compilation on > > > a Sandy Bridge Intel processor, running turbostat in parallel with > > > it and looking at the values written to the MSR_IA32_PERF_CTL > > > register. Namely, the expected result would be that when all CPUs > > > were 100% busy, all of them would be requested to run in the maximum > > > P-state, but observation shows that this clearly isn't the case. > > > The CPUs run in the maximum P-state for a while and then are > > > requested to run slower and go back to the maximum P-state after > > > a while again. That causes the actual frequency of the processor to > > > visibly oscillate below the sustainable maximum in a jittery fashion > > > which clearly is not desirable. > > > > > > That has been attributed to CPU utilization metric updates on task > > > migration that cause the total utilization value for the CPU to be > > > reduced by the utilization of the migrated task. If that happens, > > > the schedutil governor may see a CPU utilization reduction and will > > > attempt to reduce the CPU frequency accordingly right away. That > > > may be premature, though, for example if the system is generally > > > busy and there are other runnable tasks waiting to be run on that > > > CPU already. > > > > > > This is unlikely to be an issue on systems where cpufreq policies are > > > shared between multiple CPUs, because in those cases the policy > > > utilization is computed as the maximum of the CPU utilization values > > > > Sorry for one question maybe not associated with this patch. If the > > cpufreq policy is shared between multiple CPUs, the function > > intel_cpufreq_target() just updates IA32_PERF_CTL MSR of the cpu > > which is managing this policy, I wonder whether other cpus which are > > affected should also update their per-logical cpu's IA32_PERF_CTL MSR? > > The CPUs share the policy when they share their freq/voltage rails and so > changing perf state of one CPU should result in that changing for all the CPUs > in that policy. Otherwise, they can't be considered to be part of the same > policy. To be entirely precise, this depends on the granularity of the HW interface. If the interface is per-logical-CPU, we will use it this way for efficiency reasons and even if there is some coordination on the HW side, the information on how exactly it works usually is limited. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-05-08 22:16 ` Rafael J. Wysocki @ 2017-05-08 22:36 ` Wanpeng Li 2017-05-08 23:01 ` Rafael J. Wysocki 0 siblings, 1 reply; 71+ messages in thread From: Wanpeng Li @ 2017-05-08 22:36 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Viresh Kumar, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner 2017-05-09 6:16 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote: >> On 08-05-17, 11:49, Wanpeng Li wrote: >> > Hi Rafael, >> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> > > >> > > The way the schedutil governor uses the PELT metric causes it to >> > > underestimate the CPU utilization in some cases. >> > > >> > > That can be easily demonstrated by running kernel compilation on >> > > a Sandy Bridge Intel processor, running turbostat in parallel with >> > > it and looking at the values written to the MSR_IA32_PERF_CTL >> > > register. Namely, the expected result would be that when all CPUs >> > > were 100% busy, all of them would be requested to run in the maximum >> > > P-state, but observation shows that this clearly isn't the case. >> > > The CPUs run in the maximum P-state for a while and then are >> > > requested to run slower and go back to the maximum P-state after >> > > a while again. That causes the actual frequency of the processor to >> > > visibly oscillate below the sustainable maximum in a jittery fashion >> > > which clearly is not desirable. >> > > >> > > That has been attributed to CPU utilization metric updates on task >> > > migration that cause the total utilization value for the CPU to be >> > > reduced by the utilization of the migrated task. If that happens, >> > > the schedutil governor may see a CPU utilization reduction and will >> > > attempt to reduce the CPU frequency accordingly right away. That >> > > may be premature, though, for example if the system is generally >> > > busy and there are other runnable tasks waiting to be run on that >> > > CPU already. >> > > >> > > This is unlikely to be an issue on systems where cpufreq policies are >> > > shared between multiple CPUs, because in those cases the policy >> > > utilization is computed as the maximum of the CPU utilization values >> > >> > Sorry for one question maybe not associated with this patch. If the >> > cpufreq policy is shared between multiple CPUs, the function >> > intel_cpufreq_target() just updates IA32_PERF_CTL MSR of the cpu >> > which is managing this policy, I wonder whether other cpus which are >> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR? >> >> The CPUs share the policy when they share their freq/voltage rails and so >> changing perf state of one CPU should result in that changing for all the CPUs >> in that policy. Otherwise, they can't be considered to be part of the same >> policy. > > To be entirely precise, this depends on the granularity of the HW interface. > > If the interface is per-logical-CPU, we will use it this way for efficiency > reasons and even if there is some coordination on the HW side, the information > on how exactly it works usually is limited. I check it on several Xeon servers on hand, however, I didn't find /sys/devices/system/cpu/cpufreq/policyx/affected_cpus can affect more than one logical cpu, so I guess most of Xeon servers are not support shared cpufreq policy, then which kind of boxes support that? Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely 2017-05-08 22:36 ` Wanpeng Li @ 2017-05-08 23:01 ` Rafael J. Wysocki 0 siblings, 0 replies; 71+ messages in thread From: Rafael J. Wysocki @ 2017-05-08 23:01 UTC (permalink / raw) To: Wanpeng Li Cc: Viresh Kumar, Linux PM, Peter Zijlstra, LKML, Srinivas Pandruvada, Juri Lelli, Vincent Guittot, Patrick Bellasi, Joel Fernandes, Morten Rasmussen, Ingo Molnar, Thomas Gleixner On Tuesday, May 09, 2017 06:36:14 AM Wanpeng Li wrote: > 2017-05-09 6:16 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > > On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote: > >> On 08-05-17, 11:49, Wanpeng Li wrote: > >> > Hi Rafael, > >> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > >> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >> > > > >> > > The way the schedutil governor uses the PELT metric causes it to > >> > > underestimate the CPU utilization in some cases. > >> > > > >> > > That can be easily demonstrated by running kernel compilation on > >> > > a Sandy Bridge Intel processor, running turbostat in parallel with > >> > > it and looking at the values written to the MSR_IA32_PERF_CTL > >> > > register. Namely, the expected result would be that when all CPUs > >> > > were 100% busy, all of them would be requested to run in the maximum > >> > > P-state, but observation shows that this clearly isn't the case. > >> > > The CPUs run in the maximum P-state for a while and then are > >> > > requested to run slower and go back to the maximum P-state after > >> > > a while again. That causes the actual frequency of the processor to > >> > > visibly oscillate below the sustainable maximum in a jittery fashion > >> > > which clearly is not desirable. > >> > > > >> > > That has been attributed to CPU utilization metric updates on task > >> > > migration that cause the total utilization value for the CPU to be > >> > > reduced by the utilization of the migrated task. If that happens, > >> > > the schedutil governor may see a CPU utilization reduction and will > >> > > attempt to reduce the CPU frequency accordingly right away. That > >> > > may be premature, though, for example if the system is generally > >> > > busy and there are other runnable tasks waiting to be run on that > >> > > CPU already. > >> > > > >> > > This is unlikely to be an issue on systems where cpufreq policies are > >> > > shared between multiple CPUs, because in those cases the policy > >> > > utilization is computed as the maximum of the CPU utilization values > >> > > >> > Sorry for one question maybe not associated with this patch. If the > >> > cpufreq policy is shared between multiple CPUs, the function > >> > intel_cpufreq_target() just updates IA32_PERF_CTL MSR of the cpu > >> > which is managing this policy, I wonder whether other cpus which are > >> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR? > >> > >> The CPUs share the policy when they share their freq/voltage rails and so > >> changing perf state of one CPU should result in that changing for all the CPUs > >> in that policy. Otherwise, they can't be considered to be part of the same > >> policy. > > > > To be entirely precise, this depends on the granularity of the HW interface. > > > > If the interface is per-logical-CPU, we will use it this way for efficiency > > reasons and even if there is some coordination on the HW side, the information > > on how exactly it works usually is limited. > > I check it on several Xeon servers on hand, however, I didn't find > /sys/devices/system/cpu/cpufreq/policyx/affected_cpus can affect more > than one logical cpu, so I guess most of Xeon servers are not support > shared cpufreq policy, then which kind of boxes support that? On Intel the interface for performance scaling is per-logical-CPU in general. Thanks, Rafael ^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2017-05-08 23:08 UTC | newest] Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-19 13:21 [PATCH 0/2] cpufreq: schedutil: Fix and optimization Rafael J. Wysocki 2017-03-19 13:30 ` [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() Rafael J. Wysocki 2017-03-20 3:28 ` Viresh Kumar 2017-03-20 12:36 ` Rafael J. Wysocki 2017-03-19 13:34 ` [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs Rafael J. Wysocki 2017-03-19 21:24 ` Rafael J. Wysocki 2017-03-19 21:42 ` Rafael J. Wysocki 2017-03-20 10:38 ` Peter Zijlstra 2017-03-20 12:31 ` Rafael J. Wysocki 2017-03-20 3:57 ` Viresh Kumar 2017-03-20 8:26 ` Vincent Guittot 2017-03-20 12:34 ` Patrick Bellasi 2017-03-22 23:56 ` Joel Fernandes 2017-03-23 22:08 ` Vincent Guittot 2017-03-25 3:48 ` Joel Fernandes 2017-03-27 6:59 ` Vincent Guittot 2017-03-20 12:59 ` Rafael J. Wysocki 2017-03-20 13:20 ` Vincent Guittot 2017-03-20 12:48 ` Rafael J. Wysocki 2017-03-20 10:36 ` Peter Zijlstra 2017-03-20 12:35 ` Rafael J. Wysocki 2017-03-20 12:50 ` Peter Zijlstra 2017-03-20 13:04 ` Rafael J. Wysocki 2017-03-20 13:06 ` Patrick Bellasi 2017-03-20 13:05 ` Rafael J. Wysocki 2017-03-20 14:13 ` Patrick Bellasi 2017-03-20 21:46 ` [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of " Rafael J. Wysocki 2017-03-21 6:40 ` Viresh Kumar 2017-03-21 12:30 ` Rafael J. Wysocki 2017-03-21 8:50 ` Vincent Guittot 2017-03-21 11:56 ` Patrick Bellasi 2017-03-21 13:22 ` Peter Zijlstra 2017-03-21 13:37 ` Vincent Guittot 2017-03-21 14:03 ` Peter Zijlstra 2017-03-21 14:18 ` Vincent Guittot 2017-03-21 14:25 ` Patrick Bellasi [not found] ` <CAKfTPtALorn7HNpz4LOfWWSc3u+9y5iHB5byzfTHGQXDA+tVJQ@mail.gmail.com> 2017-03-21 14:58 ` Peter Zijlstra 2017-03-21 17:00 ` Vincent Guittot 2017-03-21 17:01 ` Vincent Guittot 2017-03-21 14:26 ` Rafael J. Wysocki 2017-03-21 14:38 ` Patrick Bellasi 2017-03-21 14:46 ` Rafael J. Wysocki 2017-03-21 14:50 ` Rafael J. Wysocki 2017-03-21 15:04 ` Peter Zijlstra 2017-03-21 15:18 ` Rafael J. Wysocki 2017-03-21 17:00 ` Peter Zijlstra 2017-03-21 17:17 ` Rafael J. Wysocki 2017-03-21 15:08 ` Patrick Bellasi 2017-03-21 15:18 ` Peter Zijlstra 2017-03-21 19:28 ` Patrick Bellasi 2017-03-21 15:02 ` Peter Zijlstra 2017-03-21 11:50 ` Patrick Bellasi 2017-03-21 23:08 ` [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely Rafael J. Wysocki 2017-03-22 9:26 ` Peter Zijlstra 2017-03-22 9:54 ` Viresh Kumar 2017-03-23 1:04 ` Joel Fernandes 2017-03-23 19:26 ` Sai Gurrappadi 2017-03-23 20:48 ` Sai Gurrappadi 2017-03-24 1:39 ` Rafael J. Wysocki 2017-03-24 19:08 ` Sai Gurrappadi 2017-03-25 1:14 ` Sai Gurrappadi 2017-03-25 1:39 ` Rafael J. Wysocki 2017-03-27 7:04 ` Vincent Guittot 2017-03-27 21:01 ` Sai Gurrappadi 2017-03-27 21:11 ` Rafael J. Wysocki 2017-05-08 3:49 ` Wanpeng Li 2017-05-08 4:01 ` Viresh Kumar 2017-05-08 5:15 ` Wanpeng Li 2017-05-08 22:16 ` Rafael J. Wysocki 2017-05-08 22:36 ` Wanpeng Li 2017-05-08 23:01 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).