linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"
@ 2019-07-18  6:26 Doug Smythies
  2019-07-18 10:28 ` Viresh Kumar
  2019-07-22  6:51 ` [PATCH] cpufreq: schedutil: Don't skip freq update when limits change Viresh Kumar
  0 siblings, 2 replies; 27+ messages in thread
From: Doug Smythies @ 2019-07-18  6:26 UTC (permalink / raw)
  To: viresh.kumar, rjw, joel; +Cc: linux-kernel, linux-pm, dsmythies

This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.

The commit caused a regression whereby reducing the maximum
CPU clock frequency is ineffective while busy, and the CPU
clock remains unchanged. Once the system has experienced
some idle time, the new limit is implemented.
A consequence is that any thermal throttling monitoring
and control based on max freq limits fail to respond
in a timely manor, if at all, to a thermal temperature
trip on a busy system.

Please review the original e-mail threads, as this
attempt at a revertion would then re-open some of
those questions. One possible alterative to this
revertion would be to revert B7EAF1AAB9F8:

  Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  Date:   Wed Mar 22 00:08:50 2017 +0100

      cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

References:

https://marc.info/?t=152576189300002&r=1&w=2
https://marc.info/?t=152586219000002&r=1&w=2
https://marc.info/?t=152607169800004&r=1&w=2

https://marc.info/?t=149013813900002&r=1&w=2
---
Test Data:

Typical commands:
watch -n 5 cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test).
- adjust downwards the maximum clock frequency
  echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
  On my system a lower thermal trip point it used,
  because it doesn't get hot enough.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the max_perf_pct but nothing happens in response.
- Once the lowest limit of max_perf_pct is reached, observe
  thermald fall through to intel-powerclamp, and kidle inject
  starts to be used. Therefore the "busy" condition (see the code)
  is not longer met, and the CPU frequency suddenly goes to minimum.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
  option, thereby excluding help from the fall through partner
  experienced in test 2.
- Observations here tracked test 2 until the lower limit
  of max_perf_pct is reached. The CPUs remain at maximum
  frequency until the load is removed. Processor
  temperature continues to climb.

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2 + this revert patch:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz changes on the turbostat output.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the max_perf_pct and observe the actual frequency
  adjust in response.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
  option, adjusting max_perf_pct.
- Observations here tracked test 2. (test 2 never fell
  through to intel_powerclamp and kidle inject.

Driver: acpi-cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "2200000" > $file; done
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use intel_powerclamp control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the kidle_inj amounts.
- If the maximum kidle_inj is not enough, then oberve thermald
  start to limit scaling_max_freq, however the actual frequency
  response is immediate. Why? Because of the kidle inject, the
  system is not longer fully "busy", and so that condition is not met.

Test 3: Limit thermald to using scaling_max_freq only:
- Adjust the cpu cooling device order to just include the one
  option, thereby excluding help from previous partner
  experienced in test 2.
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  reducing scaling_max_freq.
- observe the Bzy_MHz does not change on the turbostat output.
- Eventually, for all CPUs, scaling_max_freq is set to the minimum.
- Processor temperature continues to climb.
- The CPUs remain at maximum frequency until the load is removed.

Driver: acpi-cpufreq ; Governor: Schedutil
Kernel 5.2 + this revert patch:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "2200000" > $file; done
- observe the Bzy_MHz changes on the turbostat output.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use intel_powerclamp control first.
- enable and start thermald
- This test already passes using kernel 5.2, and
  proper operation was observed with this revert.

Test 3: Limit thermald to using scaling_max_freq only:
- Adjust the cpu cooling device order to just include the one
  option, thereby excluding help from previous partner
  experienced in test 2.
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  reducing scaling_max_freq.
- observe the Bzy_MHz now does change on the turbostat output
  and the processor package temperature is now controlled.
---
 kernel/sched/cpufreq_schedutil.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 962cf34..ea4e04b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -89,8 +89,15 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 	    !cpufreq_this_cpu_can_update(sg_policy->policy))
 		return false;
 
-	if (unlikely(sg_policy->need_freq_update))
+	if (unlikely(sg_policy->need_freq_update)) {
+		sg_policy->need_freq_update = false;
+		/*
+		 * This happens when limits change, so forget the previous
+		 * next_freq value and force an update.
+		 */
+		sg_policy->next_freq = UINT_MAX;
 		return true;
+	}
 
 	delta_ns = time - sg_policy->last_freq_update_time;
 
@@ -168,10 +175,8 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 
 	freq = map_util_freq(util, freq, max);
 
-	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
+	if (freq == sg_policy->cached_raw_freq && sg_policy->next_freq != UINT_MAX)
 		return sg_policy->next_freq;
-
-	sg_policy->need_freq_update = false;
 	sg_policy->cached_raw_freq = freq;
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
@@ -457,7 +462,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	 * Do not reduce the frequency if the CPU has not been idle
 	 * recently, as the reduction is likely to be premature then.
 	 */
-	if (busy && next_f < sg_policy->next_freq) {
+	if (busy && next_f < sg_policy->next_freq &&
+	    sg_policy->next_freq != UINT_MAX) {
 		next_f = sg_policy->next_freq;
 
 		/* Reset cached freq as next_freq has changed */
@@ -819,7 +825,7 @@ static int sugov_start(struct cpufreq_policy *policy)
 
 	sg_policy->freq_update_delay_ns	= sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
 	sg_policy->last_freq_update_time	= 0;
-	sg_policy->next_freq			= 0;
+	sg_policy->next_freq			= UINT_MAX;
 	sg_policy->work_in_progress		= false;
 	sg_policy->need_freq_update		= false;
 	sg_policy->cached_raw_freq		= 0;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"
  2019-07-18  6:26 [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX" Doug Smythies
@ 2019-07-18 10:28 ` Viresh Kumar
  2019-07-18 15:46   ` Doug Smythies
  2019-07-22  6:51 ` [PATCH] cpufreq: schedutil: Don't skip freq update when limits change Viresh Kumar
  1 sibling, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-07-18 10:28 UTC (permalink / raw)
  To: Doug Smythies; +Cc: rjw, joel, linux-kernel, linux-pm, dsmythies

On 17-07-19, 23:26, Doug Smythies wrote:
> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.
> 
> The commit caused a regression whereby reducing the maximum
> CPU clock frequency is ineffective while busy, and the CPU
> clock remains unchanged. Once the system has experienced
> some idle time, the new limit is implemented.

Can you explain why this patch caused that issue ? I am sorry but I couldn't
understand it from your email. How are we trying to reduce the frequency? Is
clk_set_rate() getting called with that finally and not working ?

> A consequence is that any thermal throttling monitoring
> and control based on max freq limits fail to respond
> in a timely manor, if at all, to a thermal temperature
> trip on a busy system.

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"
  2019-07-18 10:28 ` Viresh Kumar
@ 2019-07-18 15:46   ` Doug Smythies
  2019-07-22  6:49     ` Viresh Kumar
  0 siblings, 1 reply; 27+ messages in thread
From: Doug Smythies @ 2019-07-18 15:46 UTC (permalink / raw)
  To: 'Viresh Kumar'; +Cc: rjw, joel, linux-kernel, linux-pm, Doug Smythies

On 2019.07.18 03:28 Viresh Kumar wrote:
> On 17-07-19, 23:26, Doug Smythies wrote:
>> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.
>> 
>> The commit caused a regression whereby reducing the maximum
>> CPU clock frequency is ineffective while busy, and the CPU
>> clock remains unchanged. Once the system has experienced
>> some idle time, the new limit is implemented.
>
> Can you explain why this patch caused that issue ? I am sorry but I couldn't
> understand it from your email. How are we trying to reduce the frequency? Is
> clk_set_rate() getting called with that finally and not working ?

The patch eliminates the "flag", UNIT_MAX, and it's related comment,
that was used to indicate if it was a limit change that causes the
subsequent execution of sugov_update_single.

As for "clk_set_rate()", I don't know. I bisected the kernel
and got here. I also looked at reverting B7EAF1AAB9F8, but
it seemed to have some purpose which I don't know of more
important than this one or not.

I'll reinsert the first test below with more detail:

On 2019.07.17 23:27 Doug wrote:

> Driver: intel_cpufreq ; Governor: Schedutil
> Kernel 5.2:

$ uname -a
Linux s15 5.2.0-stock #608 SMP PREEMPT Mon Jul 8 08:21:56 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux

doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil

> Test 1: No thermal throttling involved (I only ever use thermal, if anything):

doug@s15:~$ sudo systemctl status thermald.service
. thermald.service - Thermal Daemon Service
   Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

> - load the system (some sort of CPU stress test).

Use whatever you want. I use my own, only because I always have and it prints something
every so often which gives an indication of actual clock speed.

doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 &
[1] 3000
[2] 3001
[3] 3002
[4] 3003
[5] 3004
[6] 3005
[7] 3006
[8] 3007
[9] 3008

Watch everything with turobstat:

doug@s15:~$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
0.03    1600    269     25      3.69
0.07    1600    390     25      3.72
0.06    1600    368     25      3.72
0.06    1600    343     25      3.71
0.04    1600    269     26      3.70
74.72   3474    30230   46      43.95  <<< Add load
100.00  3500    40228   50      58.27
100.00  3500    40196   52      58.59
100.00  3500    40215   55      58.91
100.00  3500    40211   56      59.12
100.00  3500    40291   58      59.33  <<< Try to set 60% max
100.00  3500    40278   59      59.55
100.00  3500    40591   61      59.73
100.00  3500    40279   61      60.10
100.00  3500    40292   63      60.35
100.00  3500    40401   64      60.85
99.99   3500    40352   65      61.12
100.00  3500    40230   67      61.32
100.00  3500    40228   67      61.52
100.00  3500    40400   68      61.60  <<< Try to set 42% max
100.00  3500    40279   69      61.74
100.00  3500    40258   68      61.85
100.00  3500    40226   70      62.01
100.00  3500    40220   70      62.10
100.00  3500    40211   71      62.25

> - adjust downwards the maximum clock frequency
>   echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

doug@s15:~/temp$ echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
... wait awhile ...
doug@s15:~/temp$ echo 42 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
42
doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
42

> - observe the Bzy_MHz does not change on the turbostat output.

See annotated turbostat output above.

> - reduce the system load, such that there is some idle time.
> - observe the Bzy_MHz now does change.
> - increase the load again.
> - observe the CPU frequency is now pinned to the new limit.

Go back to 60% first:

doug@s15:~/temp$ echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
killall test1
... wait awhile ...
doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 &
[1] 3040
[2] 3041
[3] 3042
[4] 3043
[5] 3044
[6] 3045
[7] 3046
[8] 3047
[9] 3048

doug@s15:~$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
100.00  3500    40289   80      64.32
100.00  3500    40223   79      64.16
100.00  3500    40212   79      64.14
100.00  3500    40269   79      64.17
100.00  3500    41330   79      64.19  <<< Freq is above the 60% limit
14.10   3489    6260    55      12.63  <<< Load removed, now not "busy"
0.03    1600    263     53      4.13   <<< see sugov_update_single
23.22   2293    9582    65      10.72  <<< Load applied
100.00  2300    40229   66      35.09  <<< 3.8 GHz * 0.6 = 2.3 GHz
100.00  2300    40209   66      35.21
100.00  2300    40211   65      35.20
100.00  2300    40219   65      35.16
100.00  2300    40224   65      35.14

The only procedure changes when using the acpi-cpufreq CPU scaling driver and
the schedutil governor are for setting and monitoring the max freq settings,
as described in the test write-ups.

Hope this helps.

... Doug



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"
  2019-07-18 15:46   ` Doug Smythies
@ 2019-07-22  6:49     ` Viresh Kumar
  0 siblings, 0 replies; 27+ messages in thread
From: Viresh Kumar @ 2019-07-22  6:49 UTC (permalink / raw)
  To: Doug Smythies; +Cc: rjw, joel, linux-kernel, linux-pm

On 18-07-19, 08:46, Doug Smythies wrote:
> On 2019.07.18 03:28 Viresh Kumar wrote:
> > On 17-07-19, 23:26, Doug Smythies wrote:
> >> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.
> >> 
> >> The commit caused a regression whereby reducing the maximum
> >> CPU clock frequency is ineffective while busy, and the CPU
> >> clock remains unchanged. Once the system has experienced
> >> some idle time, the new limit is implemented.
> >
> > Can you explain why this patch caused that issue ? I am sorry but I couldn't
> > understand it from your email. How are we trying to reduce the frequency? Is
> > clk_set_rate() getting called with that finally and not working ?
> 
> The patch eliminates the "flag", UNIT_MAX, and it's related comment,
> that was used to indicate if it was a limit change that causes the
> subsequent execution of sugov_update_single.

I think I may have understood the root cause. Please try the patch I just sent
as reply to this thread. Thanks.

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-18  6:26 [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX" Doug Smythies
  2019-07-18 10:28 ` Viresh Kumar
@ 2019-07-22  6:51 ` Viresh Kumar
  2019-07-23  7:10   ` Doug Smythies
  1 sibling, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-07-22  6:51 UTC (permalink / raw)
  To: Rafael Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-pm, Vincent Guittot, joel, dsmythies, v4 . 18+,
	Doug Smythies, linux-kernel

To avoid reducing the frequency of a CPU prematurely, we skip reducing
the frequency if the CPU had been busy recently.

This should not be done when the limits of the policy are changed, for
example due to thermal throttling. We should always get the frequency
within limits as soon as possible.

Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
Reported-by: Doug Smythies <doug.smythies@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
@Doug: Please try this patch, it must fix the issue you reported.

 kernel/sched/cpufreq_schedutil.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 636ca6f88c8e..b53c4f02b0f1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -447,7 +447,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	unsigned long util, max;
 	unsigned int next_f;
-	bool busy;
+	bool busy = false;
 
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
@@ -457,7 +457,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
-	busy = sugov_cpu_is_busy(sg_cpu);
+	/* Limits may have changed, don't skip frequency update */
+	if (!sg_policy->need_freq_update)
+		busy = sugov_cpu_is_busy(sg_cpu);
 
 	util = sugov_get_util(sg_cpu);
 	max = sg_cpu->max;
-- 
2.21.0.rc0.269.g1a574e7a288b


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-22  6:51 ` [PATCH] cpufreq: schedutil: Don't skip freq update when limits change Viresh Kumar
@ 2019-07-23  7:10   ` Doug Smythies
  2019-07-23  9:13     ` Rafael J. Wysocki
  2019-07-23  9:15     ` Viresh Kumar
  0 siblings, 2 replies; 27+ messages in thread
From: Doug Smythies @ 2019-07-23  7:10 UTC (permalink / raw)
  To: 'Viresh Kumar', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra'
  Cc: linux-pm, 'Vincent Guittot', joel, 'v4 . 18+',
	linux-kernel

On 2019.07.21 23:52 Viresh Kumar wrote:

> To avoid reducing the frequency of a CPU prematurely, we skip reducing
> the frequency if the CPU had been busy recently.
> 
> This should not be done when the limits of the policy are changed, for
> example due to thermal throttling. We should always get the frequency
> within limits as soon as possible.
>
> Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
> Reported-by: Doug Smythies <doug.smythies@gmail.com>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
> @Doug: Please try this patch, it must fix the issue you reported.

It fixes the driver = acpi-cpufreq ; governor = schedutil test case
It does not fix the driver = intel_cpufreq ; governor = schedutil test case

I have checked my results twice, but will check again in the day or two.

... Doug

>
> kernel/sched/cpufreq_schedutil.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 636ca6f88c8e..b53c4f02b0f1 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -447,7 +447,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>  	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> 	unsigned long util, max;
>  	unsigned int next_f;
> -	bool busy;
> +	bool busy = false;
> 
> 	sugov_iowait_boost(sg_cpu, time, flags);
> 	sg_cpu->last_update = time;
> @@ -457,7 +457,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> 	if (!sugov_should_update_freq(sg_policy, time))
>		return;
> 
> -	busy = sugov_cpu_is_busy(sg_cpu);
> +	/* Limits may have changed, don't skip frequency update */
> +	if (!sg_policy->need_freq_update)
> +		busy = sugov_cpu_is_busy(sg_cpu);
> 
> 	util = sugov_get_util(sg_cpu);
>  	max = sg_cpu->max;
> -- 
> 2.21.0.rc0.269.g1a574e7a288b



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-23  7:10   ` Doug Smythies
@ 2019-07-23  9:13     ` Rafael J. Wysocki
  2019-07-23  9:15     ` Viresh Kumar
  1 sibling, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2019-07-23  9:13 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Viresh Kumar, Rafael Wysocki, Ingo Molnar, Peter Zijlstra,
	Linux PM, Vincent Guittot, Joel Fernandes, v4 . 18+,
	Linux Kernel Mailing List

On Tue, Jul 23, 2019 at 9:10 AM Doug Smythies <dsmythies@telus.net> wrote:
>
> On 2019.07.21 23:52 Viresh Kumar wrote:
>
> > To avoid reducing the frequency of a CPU prematurely, we skip reducing
> > the frequency if the CPU had been busy recently.
> >
> > This should not be done when the limits of the policy are changed, for
> > example due to thermal throttling. We should always get the frequency
> > within limits as soon as possible.
> >
> > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> > Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
> > Reported-by: Doug Smythies <doug.smythies@gmail.com>
> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> > ---
> > @Doug: Please try this patch, it must fix the issue you reported.
>
> It fixes the driver = acpi-cpufreq ; governor = schedutil test case
> It does not fix the driver = intel_cpufreq ; governor = schedutil test case

So what's the difference between them, with the patch applied?

> I have checked my results twice, but will check again in the day or two.

OK, thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-23  7:10   ` Doug Smythies
  2019-07-23  9:13     ` Rafael J. Wysocki
@ 2019-07-23  9:15     ` Viresh Kumar
  2019-07-23 10:27       ` Rafael J. Wysocki
  1 sibling, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-07-23  9:15 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael Wysocki', 'Ingo Molnar',
	'Peter Zijlstra', linux-pm, 'Vincent Guittot',
	joel, 'v4 . 18+',
	linux-kernel

On 23-07-19, 00:10, Doug Smythies wrote:
> On 2019.07.21 23:52 Viresh Kumar wrote:
> 
> > To avoid reducing the frequency of a CPU prematurely, we skip reducing
> > the frequency if the CPU had been busy recently.
> > 
> > This should not be done when the limits of the policy are changed, for
> > example due to thermal throttling. We should always get the frequency
> > within limits as soon as possible.
> >
> > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> > Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
> > Reported-by: Doug Smythies <doug.smythies@gmail.com>
> > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> > ---
> > @Doug: Please try this patch, it must fix the issue you reported.
> 
> It fixes the driver = acpi-cpufreq ; governor = schedutil test case
> It does not fix the driver = intel_cpufreq ; governor = schedutil test case
> 
> I have checked my results twice, but will check again in the day or two.

The patch you tried to revert wasn't doing any driver specific stuff
but only schedutil. If that revert fixes your issue with both the
drivers, then this patch should do it as well.

I am clueless now on what can go wrong with intel_cpufreq driver with
schedutil now.

Though there is one difference between intel_cpufreq and acpi_cpufreq,
intel_cpufreq has fast_switch_possible=true and so it uses slightly
different path in schedutil. I tried to look from that perspective as
well but couldn't find anything wrong.

If you still find intel_cpufreq to be broken, even with this patch,
please set fast_switch_possible=false instead of true in
__intel_pstate_cpu_init() and try tests again. That shall make it very
much similar to acpi-cpufreq driver.

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-23  9:15     ` Viresh Kumar
@ 2019-07-23 10:27       ` Rafael J. Wysocki
  2019-07-24 11:43         ` Viresh Kumar
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2019-07-23 10:27 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Doug Smythies, Rafael Wysocki, Ingo Molnar, Peter Zijlstra,
	Linux PM, Vincent Guittot, Joel Fernandes, v4 . 18+,
	Linux Kernel Mailing List

On Tue, Jul 23, 2019 at 11:15 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 23-07-19, 00:10, Doug Smythies wrote:
> > On 2019.07.21 23:52 Viresh Kumar wrote:
> >
> > > To avoid reducing the frequency of a CPU prematurely, we skip reducing
> > > the frequency if the CPU had been busy recently.
> > >
> > > This should not be done when the limits of the policy are changed, for
> > > example due to thermal throttling. We should always get the frequency
> > > within limits as soon as possible.
> > >
> > > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> > > Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
> > > Reported-by: Doug Smythies <doug.smythies@gmail.com>
> > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> > > ---
> > > @Doug: Please try this patch, it must fix the issue you reported.
> >
> > It fixes the driver = acpi-cpufreq ; governor = schedutil test case
> > It does not fix the driver = intel_cpufreq ; governor = schedutil test case
> >
> > I have checked my results twice, but will check again in the day or two.
>
> The patch you tried to revert wasn't doing any driver specific stuff
> but only schedutil. If that revert fixes your issue with both the
> drivers, then this patch should do it as well.
>
> I am clueless now on what can go wrong with intel_cpufreq driver with
> schedutil now.
>
> Though there is one difference between intel_cpufreq and acpi_cpufreq,
> intel_cpufreq has fast_switch_possible=true and so it uses slightly
> different path in schedutil. I tried to look from that perspective as
> well but couldn't find anything wrong.

acpi-cpufreq should use fast switching on the Doug's system too.

> If you still find intel_cpufreq to be broken, even with this patch,
> please set fast_switch_possible=false instead of true in
> __intel_pstate_cpu_init() and try tests again. That shall make it very
> much similar to acpi-cpufreq driver.

I wonder if this helps.  Even so, we want fast switching to be used by
intel_cpufreq.

Anyway, it looks like the change reverted by the Doug's patch
introduced a race condition that had not been present before.  Namely,
need_freq_update is cleared in get_next_freq() when it is set _or_
when the new freq is different from the cached one, so in the latter
case if it happens to be set by sugov_limits() after evaluating
sugov_should_update_freq() (which returned 'true' for timing reasons),
that update will be lost now. [Previously the update would not be
lost, because the clearing of need_freq_update depended only on its
current value.] Where it matters is that in the "need_freq_update set"
case, the "premature frequency reduction avoidance" should not be
applied (as you noticed and hence the $subject patch).

However, even with the $subject patch, need_freq_update may still be
set by sugov_limits() after the check added by it and then cleared by
get_next_freq(), so it doesn't really eliminate the problem.

IMO eliminating would require invalidating next_freq this way or
another when need_freq_update is set in sugov_should_update_freq(),
which was done before commit ecd2884291261e3fddbc7651ee11a20d596bb514.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-23 10:27       ` Rafael J. Wysocki
@ 2019-07-24 11:43         ` Viresh Kumar
  2019-07-25 15:20           ` Doug Smythies
  0 siblings, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-07-24 11:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael Wysocki, Ingo Molnar, Peter Zijlstra,
	Linux PM, Vincent Guittot, Joel Fernandes, v4 . 18+,
	Linux Kernel Mailing List

On 23-07-19, 12:27, Rafael J. Wysocki wrote:
> On Tue, Jul 23, 2019 at 11:15 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > Though there is one difference between intel_cpufreq and acpi_cpufreq,
> > intel_cpufreq has fast_switch_possible=true and so it uses slightly
> > different path in schedutil. I tried to look from that perspective as
> > well but couldn't find anything wrong.
> 
> acpi-cpufreq should use fast switching on the Doug's system too.

Ah okay.

> > If you still find intel_cpufreq to be broken, even with this patch,
> > please set fast_switch_possible=false instead of true in
> > __intel_pstate_cpu_init() and try tests again. That shall make it very
> > much similar to acpi-cpufreq driver.
> 
> I wonder if this helps.  Even so, we want fast switching to be used by
> intel_cpufreq.

With both using fast switching it shouldn't make any difference.

> Anyway, it looks like the change reverted by the Doug's patch
> introduced a race condition that had not been present before.  Namely,
> need_freq_update is cleared in get_next_freq() when it is set _or_
> when the new freq is different from the cached one, so in the latter
> case if it happens to be set by sugov_limits() after evaluating
> sugov_should_update_freq() (which returned 'true' for timing reasons),
> that update will be lost now. [Previously the update would not be
> lost, because the clearing of need_freq_update depended only on its
> current value.] Where it matters is that in the "need_freq_update set"
> case, the "premature frequency reduction avoidance" should not be
> applied (as you noticed and hence the $subject patch).
> 
> However, even with the $subject patch, need_freq_update may still be
> set by sugov_limits() after the check added by it and then cleared by
> get_next_freq(), so it doesn't really eliminate the problem.
> 
> IMO eliminating would require invalidating next_freq this way or
> another when need_freq_update is set in sugov_should_update_freq(),
> which was done before commit ecd2884291261e3fddbc7651ee11a20d596bb514.

Hmm, so to avoid locking in fast path we need two variable group to
protect against this kind of issues. I still don't want to override
next_freq with a special meaning as it can cause hidden bugs, we have
seen that earlier.

What about something like this then ?

-- 
viresh

-------------------------8<-------------------------
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 636ca6f88c8e..2f382b0959e5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -40,6 +40,7 @@ struct sugov_policy {
 	struct task_struct	*thread;
 	bool			work_in_progress;
 
+	bool			limits_changed;
 	bool			need_freq_update;
 };
 
@@ -89,8 +90,11 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 	    !cpufreq_this_cpu_can_update(sg_policy->policy))
 		return false;
 
-	if (unlikely(sg_policy->need_freq_update))
+	if (unlikely(sg_policy->limits_changed)) {
+		sg_policy->limits_changed = false;
+		sg_policy->need_freq_update = true;
 		return true;
+	}
 
 	delta_ns = time - sg_policy->last_freq_update_time;
 
@@ -437,7 +441,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
 static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu, struct sugov_policy *sg_policy)
 {
 	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
-		sg_policy->need_freq_update = true;
+		sg_policy->limits_changed = true;
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
@@ -447,7 +451,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	unsigned long util, max;
 	unsigned int next_f;
-	bool busy;
+	bool busy = false;
 
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
@@ -457,7 +461,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
-	busy = sugov_cpu_is_busy(sg_cpu);
+	/* Limits may have changed, don't skip frequency update */
+	if (!sg_policy->need_freq_update)
+		busy = sugov_cpu_is_busy(sg_cpu);
 
 	util = sugov_get_util(sg_cpu);
 	max = sg_cpu->max;
@@ -831,6 +837,7 @@ static int sugov_start(struct cpufreq_policy *policy)
 	sg_policy->last_freq_update_time	= 0;
 	sg_policy->next_freq			= 0;
 	sg_policy->work_in_progress		= false;
+	sg_policy->limits_changed		= false;
 	sg_policy->need_freq_update		= false;
 	sg_policy->cached_raw_freq		= 0;
 
@@ -879,7 +886,7 @@ static void sugov_limits(struct cpufreq_policy *policy)
 		mutex_unlock(&sg_policy->work_lock);
 	}
 
-	sg_policy->need_freq_update = true;
+	sg_policy->limits_changed = true;
 }
 
 struct cpufreq_governor schedutil_gov = {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-24 11:43         ` Viresh Kumar
@ 2019-07-25 15:20           ` Doug Smythies
  2019-07-26  3:26             ` Viresh Kumar
  2019-07-26  6:57             ` Viresh Kumar
  0 siblings, 2 replies; 27+ messages in thread
From: Doug Smythies @ 2019-07-25 15:20 UTC (permalink / raw)
  To: 'Viresh Kumar', 'Rafael J. Wysocki'
  Cc: 'Rafael Wysocki', 'Ingo Molnar',
	'Peter Zijlstra', 'Linux PM',
	'Vincent Guittot', 'Joel Fernandes',
	'v4 . 18+', 'Linux Kernel Mailing List'

Hi,

I am having trouble keeping up.
Here is what I have so far:

On 2019.07.24 04:43 Viresh Kumar wrote:
> On 23-07-19, 12:27, Rafael J. Wysocki wrote:
>> On Tue, Jul 23, 2019 at 11:15 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:

>>> Though there is one difference between intel_cpufreq and acpi_cpufreq,
>>> intel_cpufreq has fast_switch_possible=true and so it uses slightly
>>> different path in schedutil. I tried to look from that perspective as
>>> well but couldn't find anything wrong.
>> 
>> acpi-cpufreq should use fast switching on the Doug's system too.
>
> Ah okay.
>
>>> If you still find intel_cpufreq to be broken, even with this patch,

It is.

>>> please set fast_switch_possible=false instead of true in
>>> __intel_pstate_cpu_init() and try tests again. That shall make it very
>>> much similar to acpi-cpufreq driver.
>> 
>> I wonder if this helps.

It does not help.

>> Even so, we want fast switching to be used by intel_cpufreq.

>
> With both using fast switching it shouldn't make any difference.

>> Anyway, it looks like the change reverted by the Doug's patch
>> introduced a race condition that had not been present before.  Namely,
>> need_freq_update is cleared in get_next_freq() when it is set _or_
>> when the new freq is different from the cached one, so in the latter
>> case if it happens to be set by sugov_limits() after evaluating
>> sugov_should_update_freq() (which returned 'true' for timing reasons),
>> that update will be lost now. [Previously the update would not be
>> lost, because the clearing of need_freq_update depended only on its
>> current value.] Where it matters is that in the "need_freq_update set"
>> case, the "premature frequency reduction avoidance" should not be
>> applied (as you noticed and hence the $subject patch).
>> 
>> However, even with the $subject patch, need_freq_update may still be
>> set by sugov_limits() after the check added by it and then cleared by
>> get_next_freq(), so it doesn't really eliminate the problem.
>> 
>> IMO eliminating would require invalidating next_freq this way or
>> another when need_freq_update is set in sugov_should_update_freq(),
>> which was done before commit ecd2884291261e3fddbc7651ee11a20d596bb514.
>
> Hmm, so to avoid locking in fast path we need two variable group to
> protect against this kind of issues. I still don't want to override
> next_freq with a special meaning as it can cause hidden bugs, we have
> seen that earlier.
>
> What about something like this then ?

I tried the patch ("patch2"). It did not fix the issue.

To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil governor:

Test: Does a busy system respond to maximum CPU clock frequency reduction?

stock, unaltered: No.
revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes
viresh patch: No.
fast_switch edit: No.
viresh patch2: No.

References (and procedures used):
https://marc.info/?l=linux-pm&m=156346478429147&w=2
https://marc.info/?l=linux-kernel&m=156343125319461&w=2

... Doug



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-25 15:20           ` Doug Smythies
@ 2019-07-26  3:26             ` Viresh Kumar
  2019-07-26  6:57             ` Viresh Kumar
  1 sibling, 0 replies; 27+ messages in thread
From: Viresh Kumar @ 2019-07-26  3:26 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Linux PM', 'Vincent Guittot',
	'Joel Fernandes', 'v4 . 18+',
	'Linux Kernel Mailing List'

On 25-07-19, 08:20, Doug Smythies wrote:
> I tried the patch ("patch2"). It did not fix the issue.
> 
> To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil governor:
> 
> Test: Does a busy system respond to maximum CPU clock frequency reduction?
> 
> stock, unaltered: No.
> revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes
> viresh patch: No.
> fast_switch edit: No.

You tried this fast-switch thing with my patch applied, right ?

> viresh patch2: No.
> 
> References (and procedures used):
> https://marc.info/?l=linux-pm&m=156346478429147&w=2
> https://marc.info/?l=linux-kernel&m=156343125319461&w=2
> 
> ... Doug
> 

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-25 15:20           ` Doug Smythies
  2019-07-26  3:26             ` Viresh Kumar
@ 2019-07-26  6:57             ` Viresh Kumar
  2019-07-29  7:55               ` Doug Smythies
  1 sibling, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-07-26  6:57 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Linux PM', 'Vincent Guittot',
	'Joel Fernandes', 'v4 . 18+',
	'Linux Kernel Mailing List'

On 25-07-19, 08:20, Doug Smythies wrote:
> I tried the patch ("patch2"). It did not fix the issue.
> 
> To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil governor:
> 
> Test: Does a busy system respond to maximum CPU clock frequency reduction?
> 
> stock, unaltered: No.
> revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes
> viresh patch: No.
> fast_switch edit: No.
> viresh patch2: No.

Hmm, so I tried to reproduce your setup on my ARM board.
- booted only with CPU0 so I hit the sugov_update_single() routine
- And applied below diff to make CPU look permanently busy:

-------------------------8<-------------------------
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 2f382b0959e5..afb47490e5dc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy *sg_policy, u64 time,
        if (!sugov_update_next_freq(sg_policy, time, next_freq))
                return;
 
+       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
        next_freq = cpufreq_driver_fast_switch(policy, next_freq);
        if (!next_freq)
                return;
@@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 #ifdef CONFIG_NO_HZ_COMMON
 static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
 {
-       unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
-       bool ret = idle_calls == sg_cpu->saved_idle_calls;
-
-       sg_cpu->saved_idle_calls = idle_calls;
-       return ret;
+       return true;
 }
 #else
-static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return true; }
 #endif /* CONFIG_NO_HZ_COMMON */
 
 /*
@@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work)
        sg_policy->work_in_progress = false;
        raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
 
+       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
        mutex_lock(&sg_policy->work_lock);
        __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
        mutex_unlock(&sg_policy->work_lock);

-------------------------8<-------------------------

Now, the frequency never gets down and so gets set to the maximum
possible after a bit.

- Then I did:

echo <any-low-freq-value> > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq

Without my patch applied:
        The print never gets printed and so frequency doesn't go down.

With my patch applied:
        The print gets printed immediately from sugov_work() and so
        the frequency reduces.

Can you try with this diff along with my Patch2 ? I suspect there may
be something wrong with the intel_cpufreq driver as the patch fixes
the only path we have in the schedutil governor which takes busyness
of a CPU into account.

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-26  6:57             ` Viresh Kumar
@ 2019-07-29  7:55               ` Doug Smythies
  2019-07-29  8:32                 ` Viresh Kumar
  0 siblings, 1 reply; 27+ messages in thread
From: Doug Smythies @ 2019-07-29  7:55 UTC (permalink / raw)
  To: 'Viresh Kumar'
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Linux PM', 'Vincent Guittot',
	'Joel Fernandes', 'v4 . 18+',
	'Linux Kernel Mailing List'

On 2019.07.25 23:58 Viresh Kumar wrote:
> On 25-07-19, 08:20, Doug Smythies wrote:
>> I tried the patch ("patch2"). It did not fix the issue.
>> 
>> To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil governor:
>> 
>> Test: Does a busy system respond to maximum CPU clock frequency reduction?
>> 
>> stock, unaltered: No.
>> revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes
>> viresh patch: No.
>> fast_switch edit: No.
>> viresh patch2: No.
>
> Hmm, so I tried to reproduce your setup on my ARM board.
> - booted only with CPU0 so I hit the sugov_update_single() routine
> - And applied below diff to make CPU look permanently busy:
>
> -------------------------8<-------------------------
>diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 2f382b0959e5..afb47490e5dc 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy *sg_policy, u64 time,
>         if (!sugov_update_next_freq(sg_policy, time, next_freq))
>                return;
> 
> +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);

?? there is no "freq" variable here, and so this doesn't compile. However this works:

+       pr_info("%s: %d: %u\n", __func__, __LINE__, next_freq);

>         next_freq = cpufreq_driver_fast_switch(policy, next_freq);
>        if (!next_freq)
>                return;
> @@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> #ifdef CONFIG_NO_HZ_COMMON
> static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> {
> -       unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> -       bool ret = idle_calls == sg_cpu->saved_idle_calls;
> -
> -       sg_cpu->saved_idle_calls = idle_calls;
> -       return ret;
> +       return true;
>  }
>  #else
> -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return true; }
> #endif /* CONFIG_NO_HZ_COMMON */
> 
>  /*
> @@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work)
>         sg_policy->work_in_progress = false;
>         raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
> 
> +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
>         mutex_lock(&sg_policy->work_lock);
>         __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
>         mutex_unlock(&sg_policy->work_lock);
> 
> -------------------------8<-------------------------
>
> Now, the frequency never gets down and so gets set to the maximum
> possible after a bit.
>
> - Then I did:
>
> echo <any-low-freq-value> > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
>
> Without my patch applied:
>        The print never gets printed and so frequency doesn't go down.
>
> With my patch applied:
>        The print gets printed immediately from sugov_work() and so
>        the frequency reduces.
> 
> Can you try with this diff along with my Patch2 ? I suspect there may
> be something wrong with the intel_cpufreq driver as the patch fixes
> the only path we have in the schedutil governor which takes busyness
> of a CPU into account.

With this diff along with your patch2 There is never a print message
from sugov_work. There are from sugov_fast_switch.

Note that for the intel_cpufreq CPU scaling driver and the schedutil
governor I adjust the maximum clock frequency this way:

echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct

I also applied the pr_info messages to the reverted kernel, and re-did
my tests (where everything works as expected). There is never a print
message from sugov_work. There are from sugov_fast_switch.

Notes:

I do not know if:
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
/sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
Need to be accurate when using the intel_pstate driver in passive mode.
They are not.
The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0
suggests that they might need to be representative.
I wonder if something similar to that commit is needed
for other global changes, such as max_perf_pct and min_perf_pct?

intel_cpufreq/ondemand doesn't work properly on the reverted kernel.
(just discovered, not investigated)
I don't know about other governors.

... Doug



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-29  7:55               ` Doug Smythies
@ 2019-07-29  8:32                 ` Viresh Kumar
  2019-07-29  8:37                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-07-29  8:32 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Linux PM', 'Vincent Guittot',
	'Joel Fernandes', 'v4 . 18+',
	'Linux Kernel Mailing List'

On 29-07-19, 00:55, Doug Smythies wrote:
> On 2019.07.25 23:58 Viresh Kumar wrote:
> > Hmm, so I tried to reproduce your setup on my ARM board.
> > - booted only with CPU0 so I hit the sugov_update_single() routine
> > - And applied below diff to make CPU look permanently busy:
> >
> > -------------------------8<-------------------------
> >diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 2f382b0959e5..afb47490e5dc 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy *sg_policy, u64 time,
> >         if (!sugov_update_next_freq(sg_policy, time, next_freq))
> >                return;
> > 
> > +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
> 
> ?? there is no "freq" variable here, and so this doesn't compile. However this works:
> 
> +       pr_info("%s: %d: %u\n", __func__, __LINE__, next_freq);

There are two paths we can take to change the frequency, normal
sleep-able path (sugov_work) or fast path. Only one of them is taken
by any driver ever. In your case it is the fast path always and in
mine it was the slow path.

I only tested the diff with slow-path and copy pasted to fast path
while giving out to you and so the build issue. Sorry about that.

Also make sure that the print is added after sugov_update_next_freq()
is called, not before it.

> >         next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> >        if (!next_freq)
> >                return;
> > @@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> > #ifdef CONFIG_NO_HZ_COMMON
> > static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > {
> > -       unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> > -       bool ret = idle_calls == sg_cpu->saved_idle_calls;
> > -
> > -       sg_cpu->saved_idle_calls = idle_calls;
> > -       return ret;
> > +       return true;
> >  }
> >  #else
> > -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return true; }
> > #endif /* CONFIG_NO_HZ_COMMON */
> > 
> >  /*
> > @@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work)
> >         sg_policy->work_in_progress = false;
> >         raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
> > 
> > +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
> >         mutex_lock(&sg_policy->work_lock);
> >         __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
> >         mutex_unlock(&sg_policy->work_lock);
> > 
> > -------------------------8<-------------------------
> >
> > Now, the frequency never gets down and so gets set to the maximum
> > possible after a bit.
> >
> > - Then I did:
> >
> > echo <any-low-freq-value> > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
> >
> > Without my patch applied:
> >        The print never gets printed and so frequency doesn't go down.
> >
> > With my patch applied:
> >        The print gets printed immediately from sugov_work() and so
> >        the frequency reduces.
> > 
> > Can you try with this diff along with my Patch2 ? I suspect there may
> > be something wrong with the intel_cpufreq driver as the patch fixes
> > the only path we have in the schedutil governor which takes busyness
> > of a CPU into account.
> 
> With this diff along with your patch2 There is never a print message
> from sugov_work. There are from sugov_fast_switch.

Which is okay. sugov_work won't get hit in your case as I explained
above.

> Note that for the intel_cpufreq CPU scaling driver and the schedutil
> governor I adjust the maximum clock frequency this way:
> 
> echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct

This should eventually call sugov_limits() in schedutil governor, this
can be easily checked with another print message.

> I also applied the pr_info messages to the reverted kernel, and re-did
> my tests (where everything works as expected). There is never a print
> message from sugov_work. There are from sugov_fast_switch.

that's fine.

> Notes:
> 
> I do not know if:
> /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
> /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
> Need to be accurate when using the intel_pstate driver in passive mode.
> They are not.
> The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0
> suggests that they might need to be representative.
> I wonder if something similar to that commit is needed
> for other global changes, such as max_perf_pct and min_perf_pct?

We are already calling intel_pstate_update_policies() in that case, so
it should be fine I believe.

> intel_cpufreq/ondemand doesn't work properly on the reverted kernel.

reverted kernel ? The patch you reverted was only for schedutil and it
shouldn't have anything to do with ondemand.

> (just discovered, not investigated)
> I don't know about other governors.

When you do:

echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct

How soon does the print from sugov_fast_switch() gets printed ?
Immediately ? Check with both the kernels, with my patch and with the
reverted patch.

Also see if there is any difference in the next_freq value in both the
kernels when you change max_perf_pct.

FWIW, we now know the difference between intel-pstate and
acpi-cpufreq/my testcase and why we see differences here. In the cases
where my patch fixed the issue (acpi/ARM), we were really changing the
limits, i.e. policy->min/max. This happened because we touched
scaling_max_freq directly.

For the case of intel-pstate, you are changing max_perf_pct which
doesn't change policy->max directly. I am not very sure how all of it
work really, but at least schedutil will not see policy->max changing.

@Rafael: Do you understand why things don't work properly with
intel_cpufreq driver ?

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-29  8:32                 ` Viresh Kumar
@ 2019-07-29  8:37                   ` Rafael J. Wysocki
  2019-08-01  0:20                     ` Doug Smythies
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2019-07-29  8:37 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael Wysocki, Ingo Molnar,
	Peter Zijlstra, Linux PM, Vincent Guittot, Joel Fernandes,
	v4 . 18+,
	Linux Kernel Mailing List

On Mon, Jul 29, 2019 at 10:32 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 29-07-19, 00:55, Doug Smythies wrote:
> > On 2019.07.25 23:58 Viresh Kumar wrote:
> > > Hmm, so I tried to reproduce your setup on my ARM board.
> > > - booted only with CPU0 so I hit the sugov_update_single() routine
> > > - And applied below diff to make CPU look permanently busy:
> > >
> > > -------------------------8<-------------------------
> > >diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 2f382b0959e5..afb47490e5dc 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy *sg_policy, u64 time,
> > >         if (!sugov_update_next_freq(sg_policy, time, next_freq))
> > >                return;
> > >
> > > +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
> >
> > ?? there is no "freq" variable here, and so this doesn't compile. However this works:
> >
> > +       pr_info("%s: %d: %u\n", __func__, __LINE__, next_freq);
>
> There are two paths we can take to change the frequency, normal
> sleep-able path (sugov_work) or fast path. Only one of them is taken
> by any driver ever. In your case it is the fast path always and in
> mine it was the slow path.
>
> I only tested the diff with slow-path and copy pasted to fast path
> while giving out to you and so the build issue. Sorry about that.
>
> Also make sure that the print is added after sugov_update_next_freq()
> is called, not before it.
>
> > >         next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> > >        if (!next_freq)
> > >                return;
> > > @@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> > > #ifdef CONFIG_NO_HZ_COMMON
> > > static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > > {
> > > -       unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> > > -       bool ret = idle_calls == sg_cpu->saved_idle_calls;
> > > -
> > > -       sg_cpu->saved_idle_calls = idle_calls;
> > > -       return ret;
> > > +       return true;
> > >  }
> > >  #else
> > > -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> > > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return true; }
> > > #endif /* CONFIG_NO_HZ_COMMON */
> > >
> > >  /*
> > > @@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work)
> > >         sg_policy->work_in_progress = false;
> > >         raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
> > >
> > > +       pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
> > >         mutex_lock(&sg_policy->work_lock);
> > >         __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
> > >         mutex_unlock(&sg_policy->work_lock);
> > >
> > > -------------------------8<-------------------------
> > >
> > > Now, the frequency never gets down and so gets set to the maximum
> > > possible after a bit.
> > >
> > > - Then I did:
> > >
> > > echo <any-low-freq-value> > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
> > >
> > > Without my patch applied:
> > >        The print never gets printed and so frequency doesn't go down.
> > >
> > > With my patch applied:
> > >        The print gets printed immediately from sugov_work() and so
> > >        the frequency reduces.
> > >
> > > Can you try with this diff along with my Patch2 ? I suspect there may
> > > be something wrong with the intel_cpufreq driver as the patch fixes
> > > the only path we have in the schedutil governor which takes busyness
> > > of a CPU into account.
> >
> > With this diff along with your patch2 There is never a print message
> > from sugov_work. There are from sugov_fast_switch.
>
> Which is okay. sugov_work won't get hit in your case as I explained
> above.
>
> > Note that for the intel_cpufreq CPU scaling driver and the schedutil
> > governor I adjust the maximum clock frequency this way:
> >
> > echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct
>
> This should eventually call sugov_limits() in schedutil governor, this
> can be easily checked with another print message.
>
> > I also applied the pr_info messages to the reverted kernel, and re-did
> > my tests (where everything works as expected). There is never a print
> > message from sugov_work. There are from sugov_fast_switch.
>
> that's fine.
>
> > Notes:
> >
> > I do not know if:
> > /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
> > /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
> > Need to be accurate when using the intel_pstate driver in passive mode.
> > They are not.
> > The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0
> > suggests that they might need to be representative.
> > I wonder if something similar to that commit is needed
> > for other global changes, such as max_perf_pct and min_perf_pct?
>
> We are already calling intel_pstate_update_policies() in that case, so
> it should be fine I believe.
>
> > intel_cpufreq/ondemand doesn't work properly on the reverted kernel.
>
> reverted kernel ? The patch you reverted was only for schedutil and it
> shouldn't have anything to do with ondemand.
>
> > (just discovered, not investigated)
> > I don't know about other governors.
>
> When you do:
>
> echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct
>
> How soon does the print from sugov_fast_switch() gets printed ?
> Immediately ? Check with both the kernels, with my patch and with the
> reverted patch.
>
> Also see if there is any difference in the next_freq value in both the
> kernels when you change max_perf_pct.
>
> FWIW, we now know the difference between intel-pstate and
> acpi-cpufreq/my testcase and why we see differences here. In the cases
> where my patch fixed the issue (acpi/ARM), we were really changing the
> limits, i.e. policy->min/max. This happened because we touched
> scaling_max_freq directly.
>
> For the case of intel-pstate, you are changing max_perf_pct which
> doesn't change policy->max directly. I am not very sure how all of it
> work really, but at least schedutil will not see policy->max changing.
>
> @Rafael: Do you understand why things don't work properly with
> intel_cpufreq driver ?

I haven't tried to understand this yet, so no.

My somewhat educated guess is that using max_perf_pct has to do with
it, so I would try to retest to see if there's any difference when
scaling_max_freq is used instead of that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-29  8:37                   ` Rafael J. Wysocki
@ 2019-08-01  0:20                     ` Doug Smythies
  2019-08-01  6:17                       ` Viresh Kumar
  0 siblings, 1 reply; 27+ messages in thread
From: Doug Smythies @ 2019-08-01  0:20 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Viresh Kumar'
  Cc: 'Rafael Wysocki', 'Ingo Molnar',
	'Peter Zijlstra', 'Linux PM',
	'Vincent Guittot', 'Joel Fernandes',
	'v4 . 18+', 'Linux Kernel Mailing List'

Hi Viresh,

Summary:

The old way, using UINT_MAX had two purposes: first,
as a "need to do a frequency update" flag; but also second, to
force any subsequent old/new frequency comparison to NOT be "the same,
so why bother actually updating" (see: sugov_update_next_freq). All
patches so far have been dealing with the flag, but only partially
the comparisons. In a busy system, and when schedutil.c doesn't actually
know the currently set system limits, the new frequency is dominated by
values the same as the old frequency. So, when sugov_fast_switch calls 
sugov_update_next_freq, false is usually returned.

However, if we move the resetting of the flag and add another condition
to the "no need to actually update" decision, then perhaps this patch
version 1 will be O.K. It seems to be. (see way later in this e-mail).

On 2019.07.29 01:38 Rafael J. Wysocki wrote:
> On Mon, Jul 29, 2019 at 10:32 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>> On 29-07-19, 00:55, Doug Smythies wrote:
>>> On 2019.07.25 23:58 Viresh Kumar wrote:

...[snip]...

>>>> Now, the frequency never gets down and so gets set to the maximum
>>>> possible after a bit.
>>>>
>>>> - Then I did:
>>>>
>>>> echo <any-low-freq-value> > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
>>>>
>>>> Without my patch applied:
>>>>        The print never gets printed and so frequency doesn't go down.
>>>>
>>>> With my patch applied:
>>>>        The print gets printed immediately from sugov_work() and so
>>>>        the frequency reduces.
>>>>
>>>> Can you try with this diff along with my Patch2 ? I suspect there may
>>>> be something wrong with the intel_cpufreq driver as the patch fixes
>>>> the only path we have in the schedutil governor which takes busyness
>>>> of a CPU into account.
>>>
>>> With this diff along with your patch2 There is never a print message
>>> from sugov_work. There are from sugov_fast_switch.
>>
>> Which is okay. sugov_work won't get hit in your case as I explained
>> above.

O.K., I finally understand.

>>> Note that for the intel_cpufreq CPU scaling driver and the schedutil
>>> governor I adjust the maximum clock frequency this way:
>>>
>>> echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct
>>
>> This should eventually call sugov_limits() in schedutil governor, this
>> can be easily checked with another print message.
>>
>>> I also applied the pr_info messages to the reverted kernel, and re-did
>>> my tests (where everything works as expected). There is never a print
>>> message from sugov_work. There are from sugov_fast_switch.
>>
>> that's fine.
>>
>>> Notes:
>>>
>>> I do not know if:
>>> /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
>>> /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
>>> Need to be accurate when using the intel_pstate driver in passive mode.
>>> They are not.
>>> The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0
>>> suggests that they might need to be representative.
>>> I wonder if something similar to that commit is needed
>>> for other global changes, such as max_perf_pct and min_perf_pct?
>>
>> We are already calling intel_pstate_update_policies() in that case, so
>> it should be fine I believe.

I now believe that lack of synchronization between
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
and
/sys/devices/system/cpu/intel_pstate/max_perf_pct
is the root issue here.
The UINT_MAX next freq flag was also used to force a change
because the limits are then forced to be checked and enforced.

This e-mail continues with the assumption that this lack of
synchronization is O.K., because that is the way it is now.
However, if you want to have them synchronized then
Viresh's patch1 will work fine afterwards.

>>
>>> intel_cpufreq/ondemand doesn't work properly on the reverted kernel.
>>
>> reverted kernel ? The patch you reverted was only for schedutil and it
>> shouldn't have anything to do with ondemand.

Agreed. This is on hold until I have time to look into it.

>>> (just discovered, not investigated)
>>> I don't know about other governors.
>>
>> When you do:
>>
>> echo <any-low-percent> > /sys/devices/system/cpu/intel_pstate/max_perf_pct
>>
>> How soon does the print from sugov_fast_switch() gets printed ?
>> Immediately ? Check with both the kernels, with my patch and with the
>> reverted patch.

I don't really know how long.
So I added a message so I could get a time stamp:
I now do this:

echo "doug: Change max percent..." | sudo tee /dev/kmsg && echo 42 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

and then I can calculate the time from the dmseg output, see below.

>>
>> Also see if there is any difference in the next_freq value in both the
>> kernels when you change max_perf_pct.

Not really, it doesn't seem to mean anything anyhow, because
schedutil doesn't know about the limits.

>>
>> FWIW, we now know the difference between intel-pstate and
>> acpi-cpufreq/my testcase and why we see differences here. In the cases
>> where my patch fixed the issue (acpi/ARM), we were really changing the
>> limits, i.e. policy->min/max. This happened because we touched
>> scaling_max_freq directly.
>>
>> For the case of intel-pstate, you are changing max_perf_pct which
>> doesn't change policy->max directly. I am not very sure how all of it
>> work really, but at least schedutil will not see policy->max changing.
>>
>> @Rafael: Do you understand why things don't work properly with
>> intel_cpufreq driver ?
>
> I haven't tried to understand this yet, so no.
>
> My somewhat educated guess is that using max_perf_pct has to do with
> it, so I would try to retest to see if there's any difference when
> scaling_max_freq is used instead of that.

Yes, that works, but isn't the accepted way here and then
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
and
/sys/devices/system/cpu/intel_pstate/max_perf_pct are still not
synchronized.

It was thermald misbehaving that brought me here in the first place
and it was definitely modifying
/sys/devices/system/cpu/intel_pstate/max_perf_pct
during throttling attempts for the intel_cpufreq
driver and schedutil governor.

Test data (all CPUs always busy):

Kernel: revert + extra debug print statements:
Driver/governor: intel_cpufreq/schedutil

doug@s15:~/temp$ uname -a
Linux s15 5.2.0-revertdebug #630 SMP PREEMPT Sat Jul 27 15:34:29 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux

Command:

$ echo "doug: Change max percent..." | sudo tee /dev/kmsg && echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

Result:

[137552.507296] doug: Change max percent...
[137552.570049] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  <<< 62.75 mSec
[137552.570051] cpufreq_schedutil: sugov_fast_switch: 127: 3800000
[137552.570053] cpufreq_schedutil: sugov_fast_switch: 127: 3800000
[137552.570054] cpufreq_schedutil: sugov_fast_switch: 127: 3800000
[137552.570055] cpufreq_schedutil: sugov_fast_switch: 127: 3800000
[137552.570056] cpufreq_schedutil: sugov_fast_switch: 127: 3800000
[137552.570057] cpufreq_schedutil: sugov_fast_switch: 127: 3800000
[137552.571050] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  <<< 8, as expected

Note 1: 3.8 GHz is the max turbo frequency. 60% would be ~2.3 GHz
Note 2: this response behaviour is consistent.

Command:

$ echo "doug: Change max percent..." | sudo tee /dev/kmsg && echo 100 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

Result:

[137722.788266] doug: Change max percent...
[137722.837871] cpufreq_schedutil: sugov_fast_switch: 127: 2860961  << ramp up stuff ?
[137722.837873] cpufreq_schedutil: sugov_fast_switch: 127: 2852539
[137722.837875] cpufreq_schedutil: sugov_fast_switch: 127: 2875000
[137722.837877] cpufreq_schedutil: sugov_fast_switch: 127: 2591430
[137722.837878] cpufreq_schedutil: sugov_fast_switch: 127: 2875000
[137722.837880] cpufreq_schedutil: sugov_fast_switch: 127: 2875000
[137722.837882] cpufreq_schedutil: sugov_fast_switch: 127: 2863769
[137722.837883] cpufreq_schedutil: sugov_fast_switch: 127: 2875000
[137722.838876] cpufreq_schedutil: sugov_fast_switch: 127: 3625000
[137722.838891] cpufreq_schedutil: sugov_fast_switch: 127: 3600219
[137722.838893] cpufreq_schedutil: sugov_fast_switch: 127: 2935791
[137722.838894] cpufreq_schedutil: sugov_fast_switch: 127: 3625000
[137722.838895] cpufreq_schedutil: sugov_fast_switch: 127: 3596679
[137722.838896] cpufreq_schedutil: sugov_fast_switch: 127: 3625000
[137722.838897] cpufreq_schedutil: sugov_fast_switch: 127: 3625000
[137722.839872] cpufreq_schedutil: sugov_fast_switch: 127: 3617919
[137722.839873] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 1
[137722.839884] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 2
[137722.839884] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 3
[137722.839885] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 4
[137722.839886] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 5
[137722.839887] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 6
[137722.839888] cpufreq_schedutil: sugov_fast_switch: 127: 3394775
[137722.840886] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 7
[137722.840993] cpufreq_schedutil: sugov_fast_switch: 127: 3800000  << 8

Note 1: 3.8 GHz is the max turbo frequency.
Note 2: this response behaviour is consistent.

Kernel: Viresh "patch2" + extra debug print statements:
Driver/governor: intel_cpufreq/schedutil

doug@s15:~$ uname -a
Linux s15 5.2.0-patch2debug #629 SMP PREEMPT Sat Jul 27 09:35:24 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux
Command:

$ echo "doug: Change max percent..." | sudo tee /dev/kmsg && echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

Result:
[  295.223071] doug: Change max percent...
[  295.279621] cpufreq_schedutil: sugov_fast_switch: 124: 3427978  <<< 56.55 mSec

Note 1: 3.8 GHz is the max turbo frequency. 60% would be ~2.3 GHz

Note 2: this response behaviour is NOT consistent:
In this example 1 CPU, of 8, was actually switched to the new limited upper frequency.
We do not observe this in the frequency data because the CPU frequency is held high
by the other CPUs vote into the PLL.
This can be verified by looking directly at the what pstate was asked for MSRs.
Example:

doug@s15:~$ sudo rdmsr --bitfield 15:8 -d -a 0x199
38
38
16  <<< This one changed, in this case I asked for 42%
38
38
38
38
38

If I run this command many times, making sure the starting conditions are always the same:
Most often no sugov_fast_switch message is printed at all, and no CPU's are limited; Sometimes
1 "sugov_fast_switch" message is printed and 1 CPU is limited; Rarely, 2 "sugov_fast_switch"
messages are printed and 2 CPUs are limited; I never saw 3 or more.

If I run this command many many times in a row (like 60 times, I didn't actually count),
WITHOUT resetting starting conditions, eventually all CPU's will end up in a limited state
and the CPU frequency goes down. This does not appear to be a timing race condition, but
rather a frequency calculation condition, which can be forced by submitting UNIT_MAX as
the next_freq. Since the schedutil policy limits do not reflect the new limit just set
via the global command, it might well calculate the same frequency as last time,
and mostly does. When we force UINT_MAX, we also force a new frequency to actually propagate
to the actual governor (I think).

Command:

$ echo "doug: Change max percent..." | sudo tee /dev/kmsg && echo 100 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

Result (in this example, I think 1 CPU had been at a reduced state from before):

[  407.925707] doug: Change max percent...
[  407.982503] cpufreq_schedutil: sugov_fast_switch: 124: 2757080 <<< 56.796 mSec
[  407.983509] cpufreq_schedutil: sugov_fast_switch: 124: 3356445
[  407.984512] cpufreq_schedutil: sugov_fast_switch: 124: 3800000

Note 1: this response behaviour is NOT consistent:
If I start from the condition above, where after about 60 tries, the CPU
frequency is now actually low, this command doesn't clear all CPUs
restricted frequencies. However, keep in mind that only 1 CPU needs to 
have the restriction cleared for the overall frequency to rise to maximum
for all CPUs. The read MSR method is used to verify:

doug@s15:~$ sudo rdmsr --bitfield 15:8 -d -a 0x199
16
16
16
16
16
16
16
16
doug@s15:~$ echo "doug: Change max percent..." | sudo tee /dev/kmsg && echo 100 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
doug: Change max percent...
100
doug@s15:~$ sudo rdmsr --bitfield 15:8 -d -a 0x199
38  << 1
16
38  << 2
16
16
38  << 3
16
38  << 4

So, in this example, 4 CPUs changed and 4 didn't:

[  307.309284] doug: Change max percent...
[  307.363849] cpufreq_schedutil: sugov_fast_switch: 124: 1769531  << 54.565 mSec
[  307.363906] cpufreq_schedutil: sugov_fast_switch: 124: 1705078
[  307.364223] cpufreq_schedutil: sugov_fast_switch: 124: 1984375
[  307.364372] cpufreq_schedutil: sugov_fast_switch: 124: 1997314
[  307.365226] cpufreq_schedutil: sugov_fast_switch: 124: 1814941
[  307.365228] cpufreq_schedutil: sugov_fast_switch: 124: 1765625
[  307.365231] cpufreq_schedutil: sugov_fast_switch: 124: 2250976
[  307.365245] cpufreq_schedutil: sugov_fast_switch: 124: 2480468
[  307.366231] cpufreq_schedutil: sugov_fast_switch: 124: 3100585
[  307.366250] cpufreq_schedutil: sugov_fast_switch: 124: 2594238
[  307.366251] cpufreq_schedutil: sugov_fast_switch: 124: 1925048
[  307.367227] cpufreq_schedutil: sugov_fast_switch: 124: 1995117
[  307.367228] cpufreq_schedutil: sugov_fast_switch: 124: 2932617
[  307.367248] cpufreq_schedutil: sugov_fast_switch: 124: 3800000  << 1
[  307.367249] cpufreq_schedutil: sugov_fast_switch: 124: 2036132
[  307.368229] cpufreq_schedutil: sugov_fast_switch: 124: 3449707
[  307.368245] cpufreq_schedutil: sugov_fast_switch: 124: 2148193
[  307.369228] cpufreq_schedutil: sugov_fast_switch: 124: 2229003
[  307.369230] cpufreq_schedutil: sugov_fast_switch: 124: 3800000  << 2
[  307.369245] cpufreq_schedutil: sugov_fast_switch: 124: 2258544
[  307.370246] cpufreq_schedutil: sugov_fast_switch: 124: 2372436
[  307.371228] cpufreq_schedutil: sugov_fast_switch: 124: 2577392
[  307.371230] cpufreq_schedutil: sugov_fast_switch: 124: 2475585
[  307.372248] cpufreq_schedutil: sugov_fast_switch: 124: 2673339
[  307.374228] cpufreq_schedutil: sugov_fast_switch: 124: 2903686
[  307.374229] cpufreq_schedutil: sugov_fast_switch: 124: 2932617
[  307.375235] cpufreq_schedutil: sugov_fast_switch: 124: 3237304
[  307.375449] cpufreq_schedutil: sugov_fast_switch: 124: 3391113
[  307.376235] cpufreq_schedutil: sugov_fast_switch: 124: 3573120
[  307.377228] cpufreq_schedutil: sugov_fast_switch: 124: 3800000  << 3
[  307.377231] cpufreq_schedutil: sugov_fast_switch: 124: 3800000  << 4

It took 4 tries to clear all CPU's. I do not know why it takes
~60 tries to limit them all, but only a few to go the other way,
but it does seem much more probable that the calculated frequency
would be different because conditions are not pinned in this case.

O.K. 1 final test: O.K. so just add back this one line to
Viresh's patch2:

+               sg_policy->next_freq = UINT_MAX;

And everything works.
Why? Because now the frequency is different for certain.
O.K. so where is that decision made? How about here:
sugov_update_next_freq
Maybe that needs a flag check before returning false, since
the comparison is based on incorrect information because the
policies are do not reflect reality.

Going back to patch version 1:

With all this new knowledge, how about going back to
version 1 of this patch, and then adding this:

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 808d32b..f9156db 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
                                   unsigned int next_freq)
 {
-       if (sg_policy->next_freq == next_freq)
+       /*
+        * Always force an update if the flag is set, regardless.
+        * In some implementations (intel_cpufreq) the frequency is clamped
+        * further downstream, and might not actually be different here.
+        */
+       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
                return false;

        sg_policy->next_freq = next_freq;
@@ -171,7 +176,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
        if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
                return sg_policy->next_freq;

-       sg_policy->need_freq_update = false;
        sg_policy->cached_raw_freq = freq;
        return cpufreq_driver_resolve_freq(policy, freq);
 }
@@ -478,6 +482,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
                sugov_deferred_update(sg_policy, time, next_f);
                raw_spin_unlock(&sg_policy->update_lock);
        }
+       sg_policy->need_freq_update = false;
 }

 static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)

I do not know if there are other spots that need a similar change.
This seems to work, so far, but I still need to test more.
Works for both intel_cpufreq and acpi-cpufreq, so far.
Please note that I do not know how to test the original issue that led
to the change away from UINT_MAX in the first place,
ecd2884291261e3fddbc7651ee11a20d596bb514, which should be tested in case
of some introduced regression.

Thanks to anybody that actually read this far.

... Doug



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-01  0:20                     ` Doug Smythies
@ 2019-08-01  6:17                       ` Viresh Kumar
  2019-08-01  7:47                         ` Rafael J. Wysocki
  2019-08-01 17:57                         ` Doug Smythies
  0 siblings, 2 replies; 27+ messages in thread
From: Viresh Kumar @ 2019-08-01  6:17 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Linux PM', 'Vincent Guittot',
	'Joel Fernandes', 'v4 . 18+',
	'Linux Kernel Mailing List'

On 31-07-19, 17:20, Doug Smythies wrote:
> Hi Viresh,
> 
> Summary:
> 
> The old way, using UINT_MAX had two purposes: first,
> as a "need to do a frequency update" flag; but also second, to
> force any subsequent old/new frequency comparison to NOT be "the same,
> so why bother actually updating" (see: sugov_update_next_freq). All
> patches so far have been dealing with the flag, but only partially
> the comparisons. In a busy system, and when schedutil.c doesn't actually
> know the currently set system limits, the new frequency is dominated by
> values the same as the old frequency. So, when sugov_fast_switch calls 
> sugov_update_next_freq, false is usually returned.

And finally we know "Why" :)

Good work Doug. Thanks for taking it to the end.

> However, if we move the resetting of the flag and add another condition
> to the "no need to actually update" decision, then perhaps this patch
> version 1 will be O.K. It seems to be. (see way later in this e-mail).

> With all this new knowledge, how about going back to
> version 1 of this patch, and then adding this:
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 808d32b..f9156db 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
>                                    unsigned int next_freq)
>  {
> -       if (sg_policy->next_freq == next_freq)
> +       /*
> +        * Always force an update if the flag is set, regardless.
> +        * In some implementations (intel_cpufreq) the frequency is clamped
> +        * further downstream, and might not actually be different here.
> +        */
> +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
>                 return false;

This is not correct because this is an optimization we have in place
to make things more efficient. And it was working by luck earlier and
my patch broke it for good :)

Things need to get a bit more synchronized and something like this may
help (completely untested):

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index cc27d4c59dca..2d84361fbebc 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2314,6 +2314,18 @@ static int intel_cpufreq_target(struct cpufreq_policy *policy,
        return 0;
 }
 
+static unsigned int intel_cpufreq_resolve_freq(struct cpufreq_policy *policy,
+                                              unsigned int target_freq)
+{
+       struct cpudata *cpu = all_cpu_data[policy->cpu];
+       int target_pstate;
+
+       target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
+       target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+
+       return target_pstate * cpu->pstate.scaling;
+}
+
 static unsigned int intel_cpufreq_fast_switch(struct cpufreq_policy *policy,
                                              unsigned int target_freq)
 {
@@ -2350,6 +2362,7 @@ static struct cpufreq_driver intel_cpufreq = {
        .verify         = intel_cpufreq_verify_policy,
        .target         = intel_cpufreq_target,
        .fast_switch    = intel_cpufreq_fast_switch,
+       .resolve_freq   = intel_cpufreq_resolve_freq,
        .init           = intel_cpufreq_cpu_init,
        .exit           = intel_pstate_cpu_exit,
        .stop_cpu       = intel_cpufreq_stop_cpu,

-------------------------8<-------------------------

Please try this with my patch 2. We need patch 2 instead of 1 because
of another race condition Rafael noticed.

cpufreq_schedutil calls driver specific resolve_freq() to find the new
target frequency and this is where the limits should get applied IMO.

Rafael can help with reviewing this diff but it would be great if you
can give this a try Doug.

Thanks.

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-01  6:17                       ` Viresh Kumar
@ 2019-08-01  7:47                         ` Rafael J. Wysocki
  2019-08-01  7:55                           ` Viresh Kumar
  2019-08-01 17:57                         ` Doug Smythies
  1 sibling, 1 reply; 27+ messages in thread
From: Rafael J. Wysocki @ 2019-08-01  7:47 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Doug Smythies, Rafael J. Wysocki, Rafael Wysocki, Ingo Molnar,
	Peter Zijlstra, Linux PM, Vincent Guittot, Joel Fernandes,
	v4 . 18+,
	Linux Kernel Mailing List

On Thu, Aug 1, 2019 at 8:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 31-07-19, 17:20, Doug Smythies wrote:
> > Hi Viresh,
> >
> > Summary:
> >
> > The old way, using UINT_MAX had two purposes: first,
> > as a "need to do a frequency update" flag; but also second, to
> > force any subsequent old/new frequency comparison to NOT be "the same,
> > so why bother actually updating" (see: sugov_update_next_freq). All
> > patches so far have been dealing with the flag, but only partially
> > the comparisons. In a busy system, and when schedutil.c doesn't actually
> > know the currently set system limits, the new frequency is dominated by
> > values the same as the old frequency. So, when sugov_fast_switch calls
> > sugov_update_next_freq, false is usually returned.
>
> And finally we know "Why" :)
>
> Good work Doug. Thanks for taking it to the end.
>
> > However, if we move the resetting of the flag and add another condition
> > to the "no need to actually update" decision, then perhaps this patch
> > version 1 will be O.K. It seems to be. (see way later in this e-mail).
>
> > With all this new knowledge, how about going back to
> > version 1 of this patch, and then adding this:
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 808d32b..f9156db 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> >                                    unsigned int next_freq)
> >  {
> > -       if (sg_policy->next_freq == next_freq)
> > +       /*
> > +        * Always force an update if the flag is set, regardless.
> > +        * In some implementations (intel_cpufreq) the frequency is clamped
> > +        * further downstream, and might not actually be different here.
> > +        */
> > +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
> >                 return false;
>
> This is not correct because this is an optimization we have in place
> to make things more efficient. And it was working by luck earlier and
> my patch broke it for good :)

OK, so since we know why it was wrong now, why don't we just revert
it?  Plus maybe add some comment explaining the rationale in there?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-01  7:47                         ` Rafael J. Wysocki
@ 2019-08-01  7:55                           ` Viresh Kumar
  0 siblings, 0 replies; 27+ messages in thread
From: Viresh Kumar @ 2019-08-01  7:55 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Rafael Wysocki, Ingo Molnar, Peter Zijlstra,
	Linux PM, Vincent Guittot, Joel Fernandes, v4 . 18+,
	Linux Kernel Mailing List

On 01-08-19, 09:47, Rafael J. Wysocki wrote:
> On Thu, Aug 1, 2019 at 8:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 31-07-19, 17:20, Doug Smythies wrote:
> > > Hi Viresh,
> > >
> > > Summary:
> > >
> > > The old way, using UINT_MAX had two purposes: first,
> > > as a "need to do a frequency update" flag; but also second, to
> > > force any subsequent old/new frequency comparison to NOT be "the same,
> > > so why bother actually updating" (see: sugov_update_next_freq). All
> > > patches so far have been dealing with the flag, but only partially
> > > the comparisons. In a busy system, and when schedutil.c doesn't actually
> > > know the currently set system limits, the new frequency is dominated by
> > > values the same as the old frequency. So, when sugov_fast_switch calls
> > > sugov_update_next_freq, false is usually returned.
> >
> > And finally we know "Why" :)
> >
> > Good work Doug. Thanks for taking it to the end.
> >
> > > However, if we move the resetting of the flag and add another condition
> > > to the "no need to actually update" decision, then perhaps this patch
> > > version 1 will be O.K. It seems to be. (see way later in this e-mail).
> >
> > > With all this new knowledge, how about going back to
> > > version 1 of this patch, and then adding this:
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 808d32b..f9156db 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > >  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> > >                                    unsigned int next_freq)
> > >  {
> > > -       if (sg_policy->next_freq == next_freq)
> > > +       /*
> > > +        * Always force an update if the flag is set, regardless.
> > > +        * In some implementations (intel_cpufreq) the frequency is clamped
> > > +        * further downstream, and might not actually be different here.
> > > +        */
> > > +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
> > >                 return false;
> >
> > This is not correct because this is an optimization we have in place
> > to make things more efficient. And it was working by luck earlier and
> > my patch broke it for good :)
> 
> OK, so since we know why it was wrong now, why don't we just revert
> it?  Plus maybe add some comment explaining the rationale in there?

Because the patch [1] which caused these issues was almost correct,
just that it missed the busy accounting for single CPU case.

The main idea behind the original patch [1] was to avoid any
unwanted/hidden side-affects by overriding the value of next_freq.
What we see above is exactly the case for that. Because we override
the value of next_freq, we made intel-pstate work by chance,
unintentionally. Which is wrong. And who knows what other side affects
it had, we already found two (this one and the one fixed by [1]).

I would strongly suggest that we don't override the value of next_freq
with special meaning, as it is used at so many places we don't know
what it may result in.

-- 
viresh

[1] ecd288429126 cpufreq: schedutil: Don't set next_freq to UINT_MAX

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-01  6:17                       ` Viresh Kumar
  2019-08-01  7:47                         ` Rafael J. Wysocki
@ 2019-08-01 17:57                         ` Doug Smythies
  2019-08-02  3:48                           ` Viresh Kumar
  1 sibling, 1 reply; 27+ messages in thread
From: Doug Smythies @ 2019-08-01 17:57 UTC (permalink / raw)
  To: 'Viresh Kumar'
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Linux PM', 'Vincent Guittot',
	'Joel Fernandes', 'v4 . 18+',
	'Linux Kernel Mailing List'

On 2019.07.31 23:17 Viresh Kumar wrote:
> On 31-07-19, 17:20, Doug Smythies wrote:
>> Summary:
>> 
>> The old way, using UINT_MAX had two purposes: first,
>> as a "need to do a frequency update" flag; but also second, to
>> force any subsequent old/new frequency comparison to NOT be "the same,
>> so why bother actually updating" (see: sugov_update_next_freq). All
>> patches so far have been dealing with the flag, but only partially
>> the comparisons. In a busy system, and when schedutil.c doesn't actually
>> know the currently set system limits, the new frequency is dominated by
>> values the same as the old frequency. So, when sugov_fast_switch calls 
>> sugov_update_next_freq, false is usually returned.
>
> And finally we know "Why" :)
>
> Good work Doug. Thanks for taking it to the end.
>
>> However, if we move the resetting of the flag and add another condition
>> to the "no need to actually update" decision, then perhaps this patch
>> version 1 will be O.K. It seems to be. (see way later in this e-mail).
>
>> With all this new knowledge, how about going back to
>> version 1 of this patch, and then adding this:
>> 
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index 808d32b..f9156db 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
>>                                    unsigned int next_freq)
>>  {
>> -       if (sg_policy->next_freq == next_freq)
>> +       /*
>> +        * Always force an update if the flag is set, regardless.
>> +        * In some implementations (intel_cpufreq) the frequency is clamped
>> +        * further downstream, and might not actually be different here.
>> +        */
>> +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
>>                 return false;
>
> This is not correct because this is an optimization we have in place
> to make things more efficient. And it was working by luck earlier and
> my patch broke it for good :)

Disagree.
All I did was use a flag where it used to be set to UNIT_MAX, to basically
implement the same thing.

> Things need to get a bit more synchronized and something like this may
> help (completely untested):
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index cc27d4c59dca..2d84361fbebc 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2314,6 +2314,18 @@ static int intel_cpufreq_target(struct cpufreq_policy *policy,
>        return 0;
> }
> 
> +static unsigned int intel_cpufreq_resolve_freq(struct cpufreq_policy *policy,
> +                                              unsigned int target_freq)
> +{
> +       struct cpudata *cpu = all_cpu_data[policy->cpu];
> +       int target_pstate;
> +
> +       target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> +       target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> +
> +       return target_pstate * cpu->pstate.scaling;
> +}
> +
>  static unsigned int intel_cpufreq_fast_switch(struct cpufreq_policy *policy,
>                                               unsigned int target_freq)
>  {
> @@ -2350,6 +2362,7 @@ static struct cpufreq_driver intel_cpufreq = {
>         .verify         = intel_cpufreq_verify_policy,
>         .target         = intel_cpufreq_target,
>         .fast_switch    = intel_cpufreq_fast_switch,
> +       .resolve_freq   = intel_cpufreq_resolve_freq,
>         .init           = intel_cpufreq_cpu_init,
>         .exit           = intel_pstate_cpu_exit,
>         .stop_cpu       = intel_cpufreq_stop_cpu,
> 
> -------------------------8<-------------------------
>
> Please try this with my patch 2.

O.K.

> We need patch 2 instead of 1 because
> of another race condition Rafael noticed.

Disagree.
Notice that my modifications to your patch1 addresses
that condition by moving the clearing of "need_freq_update"
to sometime later.

> 
> cpufreq_schedutil calls driver specific resolve_freq() to find the new
> target frequency and this is where the limits should get applied IMO.

Oh! I didn't know. But yes, that makes sense.

>
> Rafael can help with reviewing this diff but it would be great if you
> can give this a try Doug.

Anyway, I added the above code (I am calling it patch3) to patch2, as
you asked, and it does work. I also added it to my modified patch1,
additionally removing the extra condition check that I added
(i.e. all that remains of my patch1 modifications is the moved
clearing of "need_freq_update") That kernel also worked for both
intel_cpufreq/schedutil and acpi-cpufreq/schedutil.

Again, I do not know how to test the original issue that led
to the change away from UINT_MAX in the first place,
ecd2884291261e3fddbc7651ee11a20d596bb514, which should be
tested in case of some introduced regression.

... Doug



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-01 17:57                         ` Doug Smythies
@ 2019-08-02  3:48                           ` Viresh Kumar
  2019-08-02  9:11                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-08-02  3:48 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Rafael J. Wysocki', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra',
	'Linux PM', 'Vincent Guittot',
	'Joel Fernandes', 'v4 . 18+',
	'Linux Kernel Mailing List'

On 01-08-19, 10:57, Doug Smythies wrote:
> On 2019.07.31 23:17 Viresh Kumar wrote:
> > On 31-07-19, 17:20, Doug Smythies wrote:
> >> Summary:
> >> 
> >> The old way, using UINT_MAX had two purposes: first,
> >> as a "need to do a frequency update" flag; but also second, to
> >> force any subsequent old/new frequency comparison to NOT be "the same,
> >> so why bother actually updating" (see: sugov_update_next_freq). All
> >> patches so far have been dealing with the flag, but only partially
> >> the comparisons. In a busy system, and when schedutil.c doesn't actually
> >> know the currently set system limits, the new frequency is dominated by
> >> values the same as the old frequency. So, when sugov_fast_switch calls 
> >> sugov_update_next_freq, false is usually returned.
> >
> > And finally we know "Why" :)
> >
> > Good work Doug. Thanks for taking it to the end.
> >
> >> However, if we move the resetting of the flag and add another condition
> >> to the "no need to actually update" decision, then perhaps this patch
> >> version 1 will be O.K. It seems to be. (see way later in this e-mail).
> >
> >> With all this new knowledge, how about going back to
> >> version 1 of this patch, and then adding this:
> >> 
> >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >> index 808d32b..f9156db 100644
> >> --- a/kernel/sched/cpufreq_schedutil.c
> >> +++ b/kernel/sched/cpufreq_schedutil.c
> >> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> >>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> >>                                    unsigned int next_freq)
> >>  {
> >> -       if (sg_policy->next_freq == next_freq)
> >> +       /*
> >> +        * Always force an update if the flag is set, regardless.
> >> +        * In some implementations (intel_cpufreq) the frequency is clamped
> >> +        * further downstream, and might not actually be different here.
> >> +        */
> >> +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
> >>                 return false;
> >
> > This is not correct because this is an optimization we have in place
> > to make things more efficient. And it was working by luck earlier and
> > my patch broke it for good :)
> 
> Disagree.
> All I did was use a flag where it used to be set to UNIT_MAX, to basically
> implement the same thing.

And the earlier code wasn't fully correct as well, that's why we tried
to fix it earlier. So introducing the UINT_MAX thing again would be
wrong, even if it fixes the problem for you.

Also this won't fix the issue for rest of the governors but just
schedutil. Because this is a driver only problem and there is no point
trying to fix that in a governor.

> > Things need to get a bit more synchronized and something like this may
> > help (completely untested):
> >
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index cc27d4c59dca..2d84361fbebc 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2314,6 +2314,18 @@ static int intel_cpufreq_target(struct cpufreq_policy *policy,
> >        return 0;
> > }
> > 
> > +static unsigned int intel_cpufreq_resolve_freq(struct cpufreq_policy *policy,
> > +                                              unsigned int target_freq)
> > +{
> > +       struct cpudata *cpu = all_cpu_data[policy->cpu];
> > +       int target_pstate;
> > +
> > +       target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> > +       target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> > +
> > +       return target_pstate * cpu->pstate.scaling;
> > +}
> > +
> >  static unsigned int intel_cpufreq_fast_switch(struct cpufreq_policy *policy,
> >                                               unsigned int target_freq)
> >  {
> > @@ -2350,6 +2362,7 @@ static struct cpufreq_driver intel_cpufreq = {
> >         .verify         = intel_cpufreq_verify_policy,
> >         .target         = intel_cpufreq_target,
> >         .fast_switch    = intel_cpufreq_fast_switch,
> > +       .resolve_freq   = intel_cpufreq_resolve_freq,
> >         .init           = intel_cpufreq_cpu_init,
> >         .exit           = intel_pstate_cpu_exit,
> >         .stop_cpu       = intel_cpufreq_stop_cpu,
> > 
> > -------------------------8<-------------------------
> >
> > Please try this with my patch 2.
> 
> O.K.
> 
> > We need patch 2 instead of 1 because
> > of another race condition Rafael noticed.
> 
> Disagree.
> Notice that my modifications to your patch1 addresses
> that condition by moving the clearing of "need_freq_update"
> to sometime later.

As I said above, that isn't the right way of fixing a driver issue in
governor.
 
> > cpufreq_schedutil calls driver specific resolve_freq() to find the new
> > target frequency and this is where the limits should get applied IMO.
> 
> Oh! I didn't know. But yes, that makes sense.

The thing is that the governors, schedutil or others, need to get the
frequency limits from cpufreq core and the core tries to get it using
resolve_freq() in this particular case. If you don't have that
implemented, all the governors may end up having this issue.

> > Rafael can help with reviewing this diff but it would be great if you
> > can give this a try Doug.
> 
> Anyway, I added the above code (I am calling it patch3) to patch2, as
> you asked, and it does work. I also added it to my modified patch1,
> additionally removing the extra condition check that I added
> (i.e. all that remains of my patch1 modifications is the moved
> clearing of "need_freq_update") That kernel also worked for both
> intel_cpufreq/schedutil and acpi-cpufreq/schedutil.

Great, thanks.

> Again, I do not know how to test the original issue that led
> to the change away from UINT_MAX in the first place,
> ecd2884291261e3fddbc7651ee11a20d596bb514, which should be
> tested in case of some introduced regression.

The problem then was with overriding next_freq with UINT_MAX and since
we aren't going that path again, there is no need to test it again.

I will send the two patches to be applied now. Your tested-by would be
helpful to get them merged.

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-02  3:48                           ` Viresh Kumar
@ 2019-08-02  9:11                             ` Rafael J. Wysocki
  2019-08-02  9:19                               ` Rafael J. Wysocki
  2019-08-06  4:00                               ` Viresh Kumar
  0 siblings, 2 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2019-08-02  9:11 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Doug Smythies, 'Rafael J. Wysocki', 'Ingo Molnar',
	'Peter Zijlstra', 'Linux PM',
	'Vincent Guittot', 'Joel Fernandes',
	'v4 . 18+', 'Linux Kernel Mailing List'

On Friday, August 2, 2019 5:48:19 AM CEST Viresh Kumar wrote:
> On 01-08-19, 10:57, Doug Smythies wrote:
> > On 2019.07.31 23:17 Viresh Kumar wrote:
> > > On 31-07-19, 17:20, Doug Smythies wrote:
> > >> Summary:
> > >> 
> > >> The old way, using UINT_MAX had two purposes: first,
> > >> as a "need to do a frequency update" flag; but also second, to
> > >> force any subsequent old/new frequency comparison to NOT be "the same,
> > >> so why bother actually updating" (see: sugov_update_next_freq). All
> > >> patches so far have been dealing with the flag, but only partially
> > >> the comparisons. In a busy system, and when schedutil.c doesn't actually
> > >> know the currently set system limits, the new frequency is dominated by
> > >> values the same as the old frequency. So, when sugov_fast_switch calls 
> > >> sugov_update_next_freq, false is usually returned.
> > >
> > > And finally we know "Why" :)
> > >
> > > Good work Doug. Thanks for taking it to the end.
> > >
> > >> However, if we move the resetting of the flag and add another condition
> > >> to the "no need to actually update" decision, then perhaps this patch
> > >> version 1 will be O.K. It seems to be. (see way later in this e-mail).
> > >
> > >> With all this new knowledge, how about going back to
> > >> version 1 of this patch, and then adding this:
> > >> 
> > >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > >> index 808d32b..f9156db 100644
> > >> --- a/kernel/sched/cpufreq_schedutil.c
> > >> +++ b/kernel/sched/cpufreq_schedutil.c
> > >> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > >>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> > >>                                    unsigned int next_freq)
> > >>  {
> > >> -       if (sg_policy->next_freq == next_freq)
> > >> +       /*
> > >> +        * Always force an update if the flag is set, regardless.
> > >> +        * In some implementations (intel_cpufreq) the frequency is clamped
> > >> +        * further downstream, and might not actually be different here.
> > >> +        */
> > >> +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
> > >>                 return false;
> > >
> > > This is not correct because this is an optimization we have in place
> > > to make things more efficient. And it was working by luck earlier and
> > > my patch broke it for good :)
> > 
> > Disagree.
> > All I did was use a flag where it used to be set to UNIT_MAX, to basically
> > implement the same thing.
> 
> And the earlier code wasn't fully correct as well, that's why we tried
> to fix it earlier.

Your argument seems to be "There was an earlier problem related to this, which
was fixed, so it is fragile and I'd rather avoid it".  Still, you are claiming that the
code was in fact incorrect and you are not giving convincing arguments to
support that.

> So introducing the UINT_MAX thing again would be
> wrong, even if it fixes the problem for you.

Would it be wrong, because it would reintroduce the fragile code, or would it
be wrong, because it would re-introduce a bug?  What bug if so?

> Also this won't fix the issue for rest of the governors but just
> schedutil. Because this is a driver only problem and there is no point
> trying to fix that in a governor.

Well, I'm not convinced that this is a driver problem yet.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-02  9:11                             ` Rafael J. Wysocki
@ 2019-08-02  9:19                               ` Rafael J. Wysocki
  2019-08-06  4:00                               ` Viresh Kumar
  1 sibling, 0 replies; 27+ messages in thread
From: Rafael J. Wysocki @ 2019-08-02  9:19 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Doug Smythies, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	Linux PM, Vincent Guittot, Joel Fernandes, v4 . 18+,
	Linux Kernel Mailing List

On Fri, Aug 2, 2019 at 11:11 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Friday, August 2, 2019 5:48:19 AM CEST Viresh Kumar wrote:
> > On 01-08-19, 10:57, Doug Smythies wrote:
> > > On 2019.07.31 23:17 Viresh Kumar wrote:
> > > > On 31-07-19, 17:20, Doug Smythies wrote:
> > > >> Summary:
> > > >>
> > > >> The old way, using UINT_MAX had two purposes: first,
> > > >> as a "need to do a frequency update" flag; but also second, to
> > > >> force any subsequent old/new frequency comparison to NOT be "the same,
> > > >> so why bother actually updating" (see: sugov_update_next_freq). All
> > > >> patches so far have been dealing with the flag, but only partially
> > > >> the comparisons. In a busy system, and when schedutil.c doesn't actually
> > > >> know the currently set system limits, the new frequency is dominated by
> > > >> values the same as the old frequency. So, when sugov_fast_switch calls
> > > >> sugov_update_next_freq, false is usually returned.
> > > >
> > > > And finally we know "Why" :)
> > > >
> > > > Good work Doug. Thanks for taking it to the end.
> > > >
> > > >> However, if we move the resetting of the flag and add another condition
> > > >> to the "no need to actually update" decision, then perhaps this patch
> > > >> version 1 will be O.K. It seems to be. (see way later in this e-mail).
> > > >
> > > >> With all this new knowledge, how about going back to
> > > >> version 1 of this patch, and then adding this:
> > > >>
> > > >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > >> index 808d32b..f9156db 100644
> > > >> --- a/kernel/sched/cpufreq_schedutil.c
> > > >> +++ b/kernel/sched/cpufreq_schedutil.c
> > > >> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > > >>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> > > >>                                    unsigned int next_freq)
> > > >>  {
> > > >> -       if (sg_policy->next_freq == next_freq)
> > > >> +       /*
> > > >> +        * Always force an update if the flag is set, regardless.
> > > >> +        * In some implementations (intel_cpufreq) the frequency is clamped
> > > >> +        * further downstream, and might not actually be different here.
> > > >> +        */
> > > >> +       if (sg_policy->next_freq == next_freq && !sg_policy->need_freq_update)
> > > >>                 return false;
> > > >
> > > > This is not correct because this is an optimization we have in place
> > > > to make things more efficient. And it was working by luck earlier and
> > > > my patch broke it for good :)
> > >
> > > Disagree.
> > > All I did was use a flag where it used to be set to UNIT_MAX, to basically
> > > implement the same thing.
> >
> > And the earlier code wasn't fully correct as well, that's why we tried
> > to fix it earlier.
>
> Your argument seems to be "There was an earlier problem related to this, which
> was fixed, so it is fragile and I'd rather avoid it".  Still, you are claiming that the
> code was in fact incorrect and you are not giving convincing arguments to
> support that.
>
> > So introducing the UINT_MAX thing again would be
> > wrong, even if it fixes the problem for you.
>
> Would it be wrong, because it would reintroduce the fragile code, or would it
> be wrong, because it would re-introduce a bug?  What bug if so?
>
> > Also this won't fix the issue for rest of the governors but just
> > schedutil. Because this is a driver only problem and there is no point
> > trying to fix that in a governor.
>
> Well, I'm not convinced that this is a driver problem yet.

I stand corrected, this is a driver problem, but IMO it needs to be
addressed differently.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-08-02  9:11                             ` Rafael J. Wysocki
  2019-08-02  9:19                               ` Rafael J. Wysocki
@ 2019-08-06  4:00                               ` Viresh Kumar
  1 sibling, 0 replies; 27+ messages in thread
From: Viresh Kumar @ 2019-08-06  4:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, 'Rafael J. Wysocki', 'Ingo Molnar',
	'Peter Zijlstra', 'Linux PM',
	'Vincent Guittot', 'Joel Fernandes',
	'v4 . 18+', 'Linux Kernel Mailing List'

On 02-08-19, 11:11, Rafael J. Wysocki wrote:
> On Friday, August 2, 2019 5:48:19 AM CEST Viresh Kumar wrote:
> > On 01-08-19, 10:57, Doug Smythies wrote:
> > > Disagree.
> > > All I did was use a flag where it used to be set to UNIT_MAX, to basically
> > > implement the same thing.
> > 
> > And the earlier code wasn't fully correct as well, that's why we tried
> > to fix it earlier.
> 
> Your argument seems to be "There was an earlier problem related to this, which
> was fixed, so it is fragile and I'd rather avoid it".  Still, you are claiming that the
> code was in fact incorrect and you are not giving convincing arguments to
> support that.
> 
> > So introducing the UINT_MAX thing again would be
> > wrong, even if it fixes the problem for you.
> 
> Would it be wrong, because it would reintroduce the fragile code, or would it
> be wrong, because it would re-introduce a bug?  What bug if so?

There will be two issues here if that patch is reintroduced:

- It will cause the BUG to reappear, which was fixed by the earlier
  commit. The commit log of ecd28842912 explains the bug in detail.

- And overriding next_freq as a flag will make the code fragile and we
  may have similar bugs coming up.

But yeah, lets continue discussion on the intel-pstate patch now.

-- 
viresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
  2019-07-31  2:58 Viresh Kumar
@ 2019-07-31 23:19 ` Doug Smythies
  0 siblings, 0 replies; 27+ messages in thread
From: Doug Smythies @ 2019-07-31 23:19 UTC (permalink / raw)
  To: 'Viresh Kumar', 'Rafael Wysocki',
	'Ingo Molnar', 'Peter Zijlstra'
  Cc: linux-pm, 'Vincent Guittot', 'v4 . 18+', linux-kernel

On 2019.07.31 Viresh Kumar wrote:
> To avoid reducing the frequency of a CPU prematurely, we skip reducing
> the frequency if the CPU had been busy recently.
>
> This should not be done when the limits of the policy are changed, for
> example due to thermal throttling. We should always get the frequency
> within the new limits as soon as possible.
> 
> Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
> Reported-by: Doug Smythies <doug.smythies@gmail.com>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
> @Doug: Can you please provide your Tested-by for this commit, as it
> already fixed the issue around acpi-cpufreq driver.
>
> We will continue to see what's wrong with intel-pstate though.

Please give me a few more hours.
I'll reply to another thread with new information at that time.

My recommendation will be to scrap this "patch2" and go back
to "patch1" [1], with a couple of modifications. The logic
of patch1 is sound.
Teaser: it is working for intel_cpufreq/schedutil, but I
have yet to test acpi-cpufreq/schedutil.

[1] https://marc.info/?l=linux-pm&m=156377832225470&w=2

... Doug



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
@ 2019-07-31  2:58 Viresh Kumar
  2019-07-31 23:19 ` Doug Smythies
  0 siblings, 1 reply; 27+ messages in thread
From: Viresh Kumar @ 2019-07-31  2:58 UTC (permalink / raw)
  To: Rafael Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-pm, Vincent Guittot, v4 . 18+, Doug Smythies, linux-kernel

To avoid reducing the frequency of a CPU prematurely, we skip reducing
the frequency if the CPU had been busy recently.

This should not be done when the limits of the policy are changed, for
example due to thermal throttling. We should always get the frequency
within the new limits as soon as possible.

Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
Reported-by: Doug Smythies <doug.smythies@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
@Doug: Can you please provide your Tested-by for this commit, as it
already fixed the issue around acpi-cpufreq driver.

We will continue to see what's wrong with intel-pstate though.

 kernel/sched/cpufreq_schedutil.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 636ca6f88c8e..2f382b0959e5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -40,6 +40,7 @@ struct sugov_policy {
 	struct task_struct	*thread;
 	bool			work_in_progress;
 
+	bool			limits_changed;
 	bool			need_freq_update;
 };
 
@@ -89,8 +90,11 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 	    !cpufreq_this_cpu_can_update(sg_policy->policy))
 		return false;
 
-	if (unlikely(sg_policy->need_freq_update))
+	if (unlikely(sg_policy->limits_changed)) {
+		sg_policy->limits_changed = false;
+		sg_policy->need_freq_update = true;
 		return true;
+	}
 
 	delta_ns = time - sg_policy->last_freq_update_time;
 
@@ -437,7 +441,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
 static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu, struct sugov_policy *sg_policy)
 {
 	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
-		sg_policy->need_freq_update = true;
+		sg_policy->limits_changed = true;
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
@@ -447,7 +451,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
 	unsigned long util, max;
 	unsigned int next_f;
-	bool busy;
+	bool busy = false;
 
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
@@ -457,7 +461,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
-	busy = sugov_cpu_is_busy(sg_cpu);
+	/* Limits may have changed, don't skip frequency update */
+	if (!sg_policy->need_freq_update)
+		busy = sugov_cpu_is_busy(sg_cpu);
 
 	util = sugov_get_util(sg_cpu);
 	max = sg_cpu->max;
@@ -831,6 +837,7 @@ static int sugov_start(struct cpufreq_policy *policy)
 	sg_policy->last_freq_update_time	= 0;
 	sg_policy->next_freq			= 0;
 	sg_policy->work_in_progress		= false;
+	sg_policy->limits_changed		= false;
 	sg_policy->need_freq_update		= false;
 	sg_policy->cached_raw_freq		= 0;
 
@@ -879,7 +886,7 @@ static void sugov_limits(struct cpufreq_policy *policy)
 		mutex_unlock(&sg_policy->work_lock);
 	}
 
-	sg_policy->need_freq_update = true;
+	sg_policy->limits_changed = true;
 }
 
 struct cpufreq_governor schedutil_gov = {
-- 
2.21.0.rc0.269.g1a574e7a288b


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-08-06  4:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-18  6:26 [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX" Doug Smythies
2019-07-18 10:28 ` Viresh Kumar
2019-07-18 15:46   ` Doug Smythies
2019-07-22  6:49     ` Viresh Kumar
2019-07-22  6:51 ` [PATCH] cpufreq: schedutil: Don't skip freq update when limits change Viresh Kumar
2019-07-23  7:10   ` Doug Smythies
2019-07-23  9:13     ` Rafael J. Wysocki
2019-07-23  9:15     ` Viresh Kumar
2019-07-23 10:27       ` Rafael J. Wysocki
2019-07-24 11:43         ` Viresh Kumar
2019-07-25 15:20           ` Doug Smythies
2019-07-26  3:26             ` Viresh Kumar
2019-07-26  6:57             ` Viresh Kumar
2019-07-29  7:55               ` Doug Smythies
2019-07-29  8:32                 ` Viresh Kumar
2019-07-29  8:37                   ` Rafael J. Wysocki
2019-08-01  0:20                     ` Doug Smythies
2019-08-01  6:17                       ` Viresh Kumar
2019-08-01  7:47                         ` Rafael J. Wysocki
2019-08-01  7:55                           ` Viresh Kumar
2019-08-01 17:57                         ` Doug Smythies
2019-08-02  3:48                           ` Viresh Kumar
2019-08-02  9:11                             ` Rafael J. Wysocki
2019-08-02  9:19                               ` Rafael J. Wysocki
2019-08-06  4:00                               ` Viresh Kumar
2019-07-31  2:58 Viresh Kumar
2019-07-31 23:19 ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).