linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance)
@ 2021-01-22 20:40 Giovanni Gherdovich
  2021-01-22 20:40 ` [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Giovanni Gherdovich
  2021-01-24 22:30 ` [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance) Michael Larabel
  0 siblings, 2 replies; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-01-22 20:40 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki,
	Viresh Kumar
  Cc: Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi, Giovanni Gherdovich

Michael Larabel from Phoronix.com graciously tested v1, see results at:

AMD EPYC 7702 -
https://openbenchmarking.org/result/2101210-PTS-LINUX51178

AMD Ryzen 9 5950X - 
https://openbenchmarking.org/result/2101212-HA-RYZEN959566

The reported regression is recovered, and some workloads even report an
improvement over the v5.10 results.

Thanks Michael for the feedback!


v1 at https://lore.kernel.org/lkml/20210121003223.20257-1-ggherdovich@suse.cz/

Changes wrt v1:

- move code around so that it builds for non-x86 architectures too

Giovanni Gherdovich (1):
  x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant
    formula

 drivers/cpufreq/acpi-cpufreq.c   | 64 +++++++++++++++++++++++++++++++-
 drivers/cpufreq/cpufreq.c        |  3 ++
 include/linux/cpufreq.h          |  5 +++
 kernel/sched/cpufreq_schedutil.c |  8 +++-
 4 files changed, 76 insertions(+), 4 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-22 20:40 [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance) Giovanni Gherdovich
@ 2021-01-22 20:40 ` Giovanni Gherdovich
  2021-01-25 10:04   ` Peter Zijlstra
                     ` (3 more replies)
  2021-01-24 22:30 ` [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance) Michael Larabel
  1 sibling, 4 replies; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-01-22 20:40 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki,
	Viresh Kumar
  Cc: Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi, Giovanni Gherdovich

Phoronix.com discovered a severe performance regression on AMD EPYC
introduced on schedutil [see link 1] by the following commits from v5.11-rc1

    commit 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
    commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")

Furthermore commit db865272d9c4 ("cpufreq: Avoid configuring old governors as
default with intel_pstate") from v5.10 made it extremely easy to default to
schedutil even if the preferred driver is acpi_cpufreq. Distros are likely to
build both intel_pstate and acpi_cpufreq on x86, and the presence of the
former removes ondemand from the defaults. This situation amplifies the
visibility of the bug we're addressing.

[link 1] https://www.phoronix.com/scan.php?page=article&item=linux511-amd-schedutil&num=1

1. PROBLEM DESCRIPTION   : over-utilization and schedutil
2. PROPOSED SOLUTION     : raise freq_max in schedutil formula
3. DATA TABLE            : image processing benchmark
4. ANALYSIS AND COMMENTS : with over-utilization, freq-invariance is lost

1. PROBLEM DESCRIPTION (over-utilization and schedutil)

The problem happens on CPU-bound workloads spanning a large number of cores.
In this case schedutil won't select the maximum P-State. Actually, it's
likely that it will select the minimum one.

A CPU-bound workload puts the machine in a state generally called
"over-utilization": an increase in CPU speed doesn't result in an increase of
capacity. The fraction of time tasks spend on CPU becomes constant regardless
of clock frequency (the tasks eat whatever we throw at them), and the PELT
invariant util goes up and down with the frequency (i.e. it's not invariant
anymore).

2. PROPOSED SOLUTION (raise freq_max in schedutil formula)

The solution we implement here is a stop-gap one: when the driver is
acpi_cpufreq and the machine an AMD EPYC, schedutil will use max_boost instead
of max_P as the value for freq_max in its formula

    freq_next = 1.25 * freq_max * util

essentially giving freq_next some more headroom to grow in the over-utilized
case. This is the approach also followed by intel_pstate in passive mode.

The correct way to attack this problem would be to have schedutil detect
over-utilization and select freq_max irrespective of the util value, which has
no meaning at that point. This approach is too risky for an -rc5 submission so
we defer it to the next cycle.

3. DATA TABLE (image processing benchmark)

What follows is a more detailed account of the effects on a specific test.

TEST        : Intel Open Image Denoise, www.openimagedenoise.org
INVOCATION  : ./denoise -hdr memorial.pfm -out out.pfm -bench 200 -threads $NTHREADS
CPU         : MODEL            : 2x AMD EPYC 7742
              FREQUENCY TABLE  : P2: 1.50 GHz
                                 P1: 2.00 GHz
				 P0: 2.25 GHz
              MAX BOOST        :     3.40 GHz

Results: threads, msecs (ratio). Lower is better.

               v5.10          v5.11-rc4    v5.11-rc4-patch
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      1   1069.85 (1.00)   1071.84 (1.00)   1070.42 (1.00)
      2    542.24 (1.00)    544.40 (1.00)    544.48 (1.00)
      4    278.00 (1.00)    278.44 (1.00)    277.72 (1.00)
      8    149.81 (1.00)    149.61 (1.00)    149.87 (1.00)
     16     79.01 (1.00)     79.31 (1.00)     78.94 (1.00)
     24     58.01 (1.00)     58.51 (1.01)     58.15 (1.00)
     32     46.58 (1.00)     48.30 (1.04)     46.66 (1.00)
     48     37.29 (1.00)     51.29 (1.38)     37.27 (1.00)
     64     34.01 (1.00)     49.59 (1.46)     33.71 (0.99)
     80     31.09 (1.00)     44.27 (1.42)     31.33 (1.01)
     96     28.56 (1.00)     40.82 (1.43)     28.47 (1.00)
    112     28.09 (1.00)     40.06 (1.43)     28.63 (1.02)
    120     28.73 (1.00)     39.78 (1.38)     28.14 (0.98)
    128     28.93 (1.00)     39.60 (1.37)     29.38 (1.02)

See how the 128 threads case is almost 40% worse than baseline in v5.11-rc4.

4. ANALYSIS AND COMMENTS (with over-utilization freq-invariance is lost)

Statistics for NTHREADS=128 (number of physical cores of the machine)

                                      v5.10          v5.11-rc4
                                      ~~~~~~~~~~~~~~~~~~~~~~~~
CPU activity (mpstat)                 80-90%         80-90%
schedutil requests (tracepoint)       always P0      mostly P2
CPU frequency (HW feedback)           ~2.2 GHz       ~1.5 GHz
PELT root rq util (tracepoint)        ~825           ~450

mpstat shows that the workload is CPU-bound and usage doesn't change with
clock speed. What is striking is that the PELT util of any root runqueue in
v5.11-rc4 is half of what used to be before the frequency invariant support
(v5.10), leading to wrong frequency choices. How did we get there?

This workload is constant in time, so instead of using the PELT sum we can
pretend that scale invariance is obtained with

    util_inv = util_raw * freq_curr / freq_max1        [formula-1]

where util_raw is the PELT util from v5.10 (which is to say, not invariant),
and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
2.825 GHz.  Then we have the schedutil formula

    freq_next = 1.25 * freq_max2 * util_inv            [formula-2]

Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
3.4 GHz).

Since all cores are busy, there is no boost available. Let's be generous and say
the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
above and taking util_raw = 825/1024 = 0.8, freq_next is:

    freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz

After quantization (pick the next frequency up in the table), freq_next is
P1 = 2.0 GHz. See how we lost 250 MHz in the process. Iterate once more,
freq_next become 1.59 GHz. Since it's > P2, it's saved by quantization and P1
is selected, but if util_raw fluctuates a little and goes below 0.75, P0 is
selected and that kills util_inv by formula-1, which gives util_inv = 0.4.

The culprit of the problem is that with over-utilization, util_raw and
freq_curr in formula-1 are independent. In the nominal case, if freq_curr goes
up then util_raw goes down and viceversa. Here util_raw doesn't care and stays
constant. If freq_curr descrease, util_inv decreases too and so forth (it's a
feedback loop).

Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")
Reported-by: Michael Larabel <Michael@phoronix.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 drivers/cpufreq/acpi-cpufreq.c   | 64 +++++++++++++++++++++++++++++++-
 drivers/cpufreq/cpufreq.c        |  3 ++
 include/linux/cpufreq.h          |  5 +++
 kernel/sched/cpufreq_schedutil.c |  8 +++-
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 1e4fbb002a31..2378bc1bf2c4 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -27,6 +27,10 @@
 
 #include <acpi/processor.h>
 
+#ifdef CONFIG_ACPI_CPPC_LIB
+#include <acpi/cppc_acpi.h>
+#endif
+
 #include <asm/msr.h>
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
@@ -628,11 +632,57 @@ static int acpi_cpufreq_blacklist(struct cpuinfo_x86 *c)
 }
 #endif
 
+#ifdef CONFIG_ACPI_CPPC_LIB
+static bool amd_max_boost(unsigned int max_freq,
+			  unsigned int *max_boost)
+{
+	struct cppc_perf_caps perf_caps;
+	u64 highest_perf, nominal_perf, perf_ratio;
+	int ret;
+
+	ret = cppc_get_perf_caps(0, &perf_caps);
+	if (ret) {
+		pr_debug("Could not retrieve perf counters (%d)\n", ret);
+		return false;
+	}
+
+	highest_perf = perf_caps.highest_perf;
+	nominal_perf = perf_caps.nominal_perf;
+
+	if (!highest_perf || !nominal_perf) {
+		pr_debug("Could not retrieve highest or nominal performance\n");
+		return false;
+	}
+
+	perf_ratio = div_u64(highest_perf * SCHED_CAPACITY_SCALE, nominal_perf);
+	if (perf_ratio <= SCHED_CAPACITY_SCALE) {
+		pr_debug("Either perf_ratio is 0, or nominal >= highest performance\n");
+		return false;
+	}
+
+	*max_boost = max_freq * perf_ratio >> SCHED_CAPACITY_SHIFT;
+	if (!*max_boost) {
+		pr_debug("max_boost seems to be zero\n");
+		return false;
+	}
+
+	return true;
+}
+#else
+static bool amd_max_boost(unsigned int max_freq,
+			  unsigned int *max_boost)
+{
+	return false;
+}
+#endif
+
 static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
 {
 	unsigned int i;
 	unsigned int valid_states = 0;
 	unsigned int cpu = policy->cpu;
+	unsigned int freq, max_freq = 0;
+	unsigned int max_boost;
 	struct acpi_cpufreq_data *data;
 	unsigned int result = 0;
 	struct cpuinfo_x86 *c = &cpu_data(policy->cpu);
@@ -779,15 +829,25 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
 		    freq_table[valid_states-1].frequency / 1000)
 			continue;
 
+		freq = perf->states[i].core_frequency * 1000;
 		freq_table[valid_states].driver_data = i;
-		freq_table[valid_states].frequency =
-		    perf->states[i].core_frequency * 1000;
+		freq_table[valid_states].frequency = freq;
+
+		if (freq > max_freq)
+			max_freq = freq;
+
 		valid_states++;
 	}
 	freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
 	policy->freq_table = freq_table;
 	perf->state = 0;
 
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+	    amd_max_boost(max_freq, &max_boost)) {
+		policy->cpuinfo.max_boost = max_boost;
+		static_branch_enable(&cpufreq_amd_max_boost);
+	}
+
 	switch (perf->control_register.space_id) {
 	case ACPI_ADR_SPACE_SYSTEM_IO:
 		/*
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index d0a3525ce27f..b96677f6b57e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2721,6 +2721,9 @@ int cpufreq_boost_enabled(void)
 }
 EXPORT_SYMBOL_GPL(cpufreq_boost_enabled);
 
+DEFINE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
+EXPORT_SYMBOL_GPL(cpufreq_amd_max_boost);
+
 /*********************************************************************
  *               REGISTER / UNREGISTER CPUFREQ DRIVER                *
  *********************************************************************/
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 9c8b7437b6cd..341cac76d254 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -40,9 +40,14 @@ enum cpufreq_table_sorting {
 	CPUFREQ_TABLE_SORTED_DESCENDING
 };
 
+DECLARE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
+
+#define cpufreq_driver_has_max_boost() static_branch_unlikely(&cpufreq_amd_max_boost)
+
 struct cpufreq_cpuinfo {
 	unsigned int		max_freq;
 	unsigned int		min_freq;
+	unsigned int		max_boost;
 
 	/* in 10^(-9) s = nanoseconds */
 	unsigned int		transition_latency;
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6931f0cdeb80..541f3db3f576 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -159,8 +159,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 				  unsigned long util, unsigned long max)
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
-	unsigned int freq = arch_scale_freq_invariant() ?
-				policy->cpuinfo.max_freq : policy->cur;
+	unsigned int freq, max_freq;
+
+	max_freq = cpufreq_driver_has_max_boost() ?
+			policy->cpuinfo.max_boost : policy->cpuinfo.max_freq;
+
+	freq = arch_scale_freq_invariant() ? max_freq : policy->cur;
 
 	freq = map_util_freq(util, freq, max);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance)
  2021-01-22 20:40 [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance) Giovanni Gherdovich
  2021-01-22 20:40 ` [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Giovanni Gherdovich
@ 2021-01-24 22:30 ` Michael Larabel
  2021-01-25  8:34   ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Michael Larabel @ 2021-01-24 22:30 UTC (permalink / raw)
  To: Giovanni Gherdovich, Borislav Petkov, Ingo Molnar,
	Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar
  Cc: Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, x86, linux-pm, linux-kernel,
	linux-acpi

 From ongoing tests of this patch, it still certainly shows to address 
most of the Linux 5.11 performance regression previously encountered 
when using Schedutil. Additionally, for a number of workloads where not 
seeing a regression from 5.10 to 5.11 Git is still showing even better 
performance with this patch. The power monitoring on the AMD EPYC server 
is showing higher power spikes but the average power consumption rate is 
roughly comparable to that of Linux 5.11 Git, which is higher than 5.10 
by just about 3%.

So this patch still seems to be working out well and indeed taking care 
of some wide losses seen otherwise on Linux 5.11 when using Schedutil on 
AMD Zen2/Zen3. Still have some other tests running but so far no 
unexpected results.

Michael


AMD EPYC 7F72 2P

On an EPYC 7F72 2P server[1] across 147 test cases I am finding the 
patched Linux 5.11 kernel to be just over 1% faster than 5.10 stable 
compared to the unpatched 5.11 Git being just behind 5.10. For the 
workloads on that server where Linux 5.11 is slower with Schedutil, the 
patch indeed is largely addressing that regression and also providing 
other improvements.

During that testing, the amd_energy interface was monitored. Linux 5.11 
with Schedutil AMD freq invariance did show on average 10 Watts (~3.7%) 
higher power consumption on average than Linux 5.10 with Schedutil. But 
with this patch, that average is still basically the same. The peak 
power consumption during any of the tests was higher at 530~549 Watts 
compared to 501 Watts with Linux 5.10. Overall the performance is 
looking good but given amd_energy still not working for consumer models, 
I don't have much power data to share at the moment.

Ryzen 9 5950X

Expanding on the prior testing with the 5950X, I ran some follow-up 
tests[2]. Of 221 test cases there, the current Linux 5.11 Git 
performance came around 2% slower on a geo mean basis than Linux 5.10 
while the patched performance pulls it to ~2.5% faster than 5.10. There 
still are some cases where Linux 5.10 is faster, but overall at least 
the patched Linux 5.11 performance doesn't show nearly as many 
regressions. In a number of test cases, the Linux 5.11 patched 
performance is outright better than Linux 5.10 even where 5.11 
(un-patched) hadn't regressed or by that much.

Ryzen 5 4500U

For something at the lower end of the spectrum I also ran a number of 
tests on a Ryzen 5 4500U notebook[3]. Linux 5.11 (unpatched) doesn't see 
as many regressions as on the larger systems but still the patched 
performance did help in a number of tests, particularly around 
graphics/gaming. In the heavier multi-core core tests are still a number 
of cases where Linux 5.10 is faster than patched/unpatched 5.11. The 
patched kernel in those 90 tests came out to being about 4% faster than 
5.10.

(Result highlights below, results with little change set to hidden by 
default.)
[1] 
https://openbenchmarking.org/result/2101248-HA-AMDEPYC7F52&grs&hlc=1&hnr=1&hlc=1
[2] https://openbenchmarking.org/result/2101242-HA-RYZEN959530&grs&hlc=1
[3] 
https://openbenchmarking.org/result/2101232-PTS-RENOIRLI89&grs&hnr=1&hlc=1


On 1/22/21 2:40 PM, Giovanni Gherdovich wrote:
> Michael Larabel from Phoronix.com graciously tested v1, see results at:
>
> AMD EPYC 7702 -
> https://openbenchmarking.org/result/2101210-PTS-LINUX51178
>
> AMD Ryzen 9 5950X -
> https://openbenchmarking.org/result/2101212-HA-RYZEN959566
>
> The reported regression is recovered, and some workloads even report an
> improvement over the v5.10 results.
>
> Thanks Michael for the feedback!
>
>
> v1 at https://lore.kernel.org/lkml/20210121003223.20257-1-ggherdovich@suse.cz/
>
> Changes wrt v1:
>
> - move code around so that it builds for non-x86 architectures too
>
> Giovanni Gherdovich (1):
>    x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant
>      formula
>
>   drivers/cpufreq/acpi-cpufreq.c   | 64 +++++++++++++++++++++++++++++++-
>   drivers/cpufreq/cpufreq.c        |  3 ++
>   include/linux/cpufreq.h          |  5 +++
>   kernel/sched/cpufreq_schedutil.c |  8 +++-
>   4 files changed, 76 insertions(+), 4 deletions(-)
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance)
  2021-01-24 22:30 ` [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance) Michael Larabel
@ 2021-01-25  8:34   ` Peter Zijlstra
  2021-01-26  9:01     ` Giovanni Gherdovich
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-25  8:34 UTC (permalink / raw)
  To: Michael Larabel
  Cc: Giovanni Gherdovich, Borislav Petkov, Ingo Molnar,
	Rafael J . Wysocki, Viresh Kumar, Jon Grimm, Nathan Fontenot,
	Yazen Ghannam, Thomas Lendacky, Suthikulpanit Suravee,
	Mel Gorman, Pu Wen, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, x86, linux-pm, linux-kernel, linux-acpi

On Sun, Jan 24, 2021 at 04:30:57PM -0600, Michael Larabel wrote:
> From ongoing tests of this patch, it still certainly shows to address most
> of the Linux 5.11 performance regression previously encountered when using
> Schedutil. Additionally, for a number of workloads where not seeing a
> regression from 5.10 to 5.11 Git is still showing even better performance
> with this patch. The power monitoring on the AMD EPYC server is showing
> higher power spikes but the average power consumption rate is roughly
> comparable to that of Linux 5.11 Git, which is higher than 5.10 by just
> about 3%.
> 
> So this patch still seems to be working out well and indeed taking care of
> some wide losses seen otherwise on Linux 5.11 when using Schedutil on AMD
> Zen2/Zen3. Still have some other tests running but so far no unexpected
> results.
> 

Did you do all this writing and forget to add:

Tested-by: Michael Larabel <Michael@phoronix.com>

?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-22 20:40 ` [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Giovanni Gherdovich
@ 2021-01-25 10:04   ` Peter Zijlstra
  2021-01-26  9:28     ` Giovanni Gherdovich
  2021-02-02 18:40     ` Rafael J. Wysocki
  2021-01-25 10:06   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-25 10:04 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Borislav Petkov, Ingo Molnar, Rafael J . Wysocki, Viresh Kumar,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> This workload is constant in time, so instead of using the PELT sum we can
> pretend that scale invariance is obtained with
> 
>     util_inv = util_raw * freq_curr / freq_max1        [formula-1]
> 
> where util_raw is the PELT util from v5.10 (which is to say, not invariant),
> and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
> commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
> frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
> 2.825 GHz.  Then we have the schedutil formula
> 
>     freq_next = 1.25 * freq_max2 * util_inv            [formula-2]
> 
> Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
> 3.4 GHz).
> 
> Since all cores are busy, there is no boost available. Let's be generous and say
> the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
> above and taking util_raw = 825/1024 = 0.8, freq_next is:
> 
>     freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz

Right, so here's a 'problem' between schedutil and cpufreq, they don't
use the same f_max at all times.

And this is also an inconsistency between acpi_cpufreq and intel_pstate
(passive). IIRC the intel_pstate cpufreq drivers uses 4C/1C/P0 resp,
while ACPI seems to stick to P0 f_max.

Rafael; should ACPI change that behaviour rather than adding yet another
magic variable?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-22 20:40 ` [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Giovanni Gherdovich
  2021-01-25 10:04   ` Peter Zijlstra
@ 2021-01-25 10:06   ` Peter Zijlstra
  2021-01-26  9:09     ` Giovanni Gherdovich
  2021-02-02 18:59   ` Rafael J. Wysocki
  2021-02-03  6:04   ` Viresh Kumar
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-25 10:06 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Borislav Petkov, Ingo Molnar, Rafael J . Wysocki, Viresh Kumar,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> 1. PROBLEM DESCRIPTION (over-utilization and schedutil)
> 
> The problem happens on CPU-bound workloads spanning a large number of cores.
> In this case schedutil won't select the maximum P-State. Actually, it's
> likely that it will select the minimum one.
> 
> A CPU-bound workload puts the machine in a state generally called
> "over-utilization": an increase in CPU speed doesn't result in an increase of
> capacity. The fraction of time tasks spend on CPU becomes constant regardless
> of clock frequency (the tasks eat whatever we throw at them), and the PELT
> invariant util goes up and down with the frequency (i.e. it's not invariant
> anymore).

>                                       v5.10          v5.11-rc4
>                                       ~~~~~~~~~~~~~~~~~~~~~~~~
> CPU activity (mpstat)                 80-90%         80-90%
> schedutil requests (tracepoint)       always P0      mostly P2
> CPU frequency (HW feedback)           ~2.2 GHz       ~1.5 GHz
> PELT root rq util (tracepoint)        ~825           ~450
> 
> mpstat shows that the workload is CPU-bound and usage doesn't change with

So I'm having trouble with calling a 80%-90% workload CPU bound, because
clearly there's a ton of idle time.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance)
  2021-01-25  8:34   ` Peter Zijlstra
@ 2021-01-26  9:01     ` Giovanni Gherdovich
  0 siblings, 0 replies; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-01-26  9:01 UTC (permalink / raw)
  To: Peter Zijlstra, Michael Larabel
  Cc: Borislav Petkov, Ingo Molnar, Rafael J . Wysocki, Viresh Kumar,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, x86, linux-pm, linux-kernel,
	linux-acpi

On Mon, 2021-01-25 at 09:34 +0100, Peter Zijlstra wrote:
> On Sun, Jan 24, 2021 at 04:30:57PM -0600, Michael Larabel wrote:
> > From ongoing tests of this patch, it still certainly shows to address most
> > of the Linux 5.11 performance regression previously encountered when using
> > Schedutil. Additionally, for a number of workloads where not seeing a
> > regression from 5.10 to 5.11 Git is still showing even better performance
> > with this patch. The power monitoring on the AMD EPYC server is showing
> > higher power spikes but the average power consumption rate is roughly
> > comparable to that of Linux 5.11 Git, which is higher than 5.10 by just
> > about 3%.
> > 
> > So this patch still seems to be working out well and indeed taking care of
> > some wide losses seen otherwise on Linux 5.11 when using Schedutil on AMD
> > Zen2/Zen3. Still have some other tests running but so far no unexpected
> > results.
> > 
> 
> Did you do all this writing and forget to add:
> 
> Tested-by: Michael Larabel <Michael@phoronix.com>
> 
> ?

Michael confirmed me off-list that yes, the patch should carry the
"Tested-by" tag with his name.


Giovanni

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-25 10:06   ` Peter Zijlstra
@ 2021-01-26  9:09     ` Giovanni Gherdovich
  2021-01-26  9:31       ` Mel Gorman
  0 siblings, 1 reply; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-01-26  9:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Ingo Molnar, Rafael J . Wysocki, Viresh Kumar,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Mon, 2021-01-25 at 11:06 +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> > 1. PROBLEM DESCRIPTION (over-utilization and schedutil)
> > 
> > The problem happens on CPU-bound workloads spanning a large number of cores.
> > In this case schedutil won't select the maximum P-State. Actually, it's
> > likely that it will select the minimum one.
> > 
> > A CPU-bound workload puts the machine in a state generally called
> > "over-utilization": an increase in CPU speed doesn't result in an increase of
> > capacity. The fraction of time tasks spend on CPU becomes constant regardless
> > of clock frequency (the tasks eat whatever we throw at them), and the PELT
> > invariant util goes up and down with the frequency (i.e. it's not invariant
> > anymore).
> >                                       v5.10          v5.11-rc4
> >                                       ~~~~~~~~~~~~~~~~~~~~~~~~
> > CPU activity (mpstat)                 80-90%         80-90%
> > schedutil requests (tracepoint)       always P0      mostly P2
> > CPU frequency (HW feedback)           ~2.2 GHz       ~1.5 GHz
> > PELT root rq util (tracepoint)        ~825           ~450
> > 
> > mpstat shows that the workload is CPU-bound and usage doesn't change with
> 
> So I'm having trouble with calling a 80%-90% workload CPU bound, because
> clearly there's a ton of idle time.

Yes you're right. There is considerable idle time and calling it CPU-bound is
a bit of a stretch.

Yet I don't think I'm completely off the mark. The busy time is the same with
the machine running at 1.5 GHz and at 2.2 GHz (it just takes longer to
finish). To me it seems like the CPU is the bottleneck, with some overhead on
top.

I will confirm what causes the idle time.


Giovanni

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-25 10:04   ` Peter Zijlstra
@ 2021-01-26  9:28     ` Giovanni Gherdovich
  2021-01-26 10:02       ` Peter Zijlstra
  2021-02-02 18:45       ` Rafael J. Wysocki
  2021-02-02 18:40     ` Rafael J. Wysocki
  1 sibling, 2 replies; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-01-26  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Ingo Molnar, Rafael J . Wysocki, Viresh Kumar,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Mon, 2021-01-25 at 11:04 +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> > This workload is constant in time, so instead of using the PELT sum we can
> > pretend that scale invariance is obtained with
> > 
> >     util_inv = util_raw * freq_curr / freq_max1        [formula-1]
> > 
> > where util_raw is the PELT util from v5.10 (which is to say, not invariant),
> > and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
> > commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
> > frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
> > 2.825 GHz.  Then we have the schedutil formula
> > 
> >     freq_next = 1.25 * freq_max2 * util_inv            [formula-2]
> > 
> > Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
> > 3.4 GHz).
> > 
> > Since all cores are busy, there is no boost available. Let's be generous and say
> > the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
> > above and taking util_raw = 825/1024 = 0.8, freq_next is:
> > 
> >     freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz
> 
> Right, so here's a 'problem' between schedutil and cpufreq, they don't
> use the same f_max at all times.
> 
> And this is also an inconsistency between acpi_cpufreq and intel_pstate
> (passive). IIRC the intel_pstate cpufreq drivers uses 4C/1C/P0 resp,
> while ACPI seems to stick to P0 f_max.

That's correct. A different f_max is used depending on the occasion. Let me
rephrase with:

cpufreq core asks the driver what's the f_max. What's the answer?

intel_pstate says: 1C
acpi_cpufreq says: P0

scheduler asks the freq-invariance machinery what's f_max, because it needs to
compute f_curr/f_max. What's the answer?

Intel CPUs: 4C in most cases, 1C on Atom, something else on Xeon Phi.
AMD CPUs: (P0 + 1C) / 2.


Legend:
1C = 1-core boost
4C = 4-cores boost
P0 = max non-boost P-States

> 
> Rafael; should ACPI change that behaviour rather than adding yet another
> magic variable?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-26  9:09     ` Giovanni Gherdovich
@ 2021-01-26  9:31       ` Mel Gorman
  2021-01-26 10:05         ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-01-26  9:31 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Peter Zijlstra, Borislav Petkov, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Tue, Jan 26, 2021 at 10:09:27AM +0100, Giovanni Gherdovich wrote:
> On Mon, 2021-01-25 at 11:06 +0100, Peter Zijlstra wrote:
> > On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> > > 1. PROBLEM DESCRIPTION (over-utilization and schedutil)
> > > 
> > > The problem happens on CPU-bound workloads spanning a large number of cores.
> > > In this case schedutil won't select the maximum P-State. Actually, it's
> > > likely that it will select the minimum one.
> > > 
> > > A CPU-bound workload puts the machine in a state generally called
> > > "over-utilization": an increase in CPU speed doesn't result in an increase of
> > > capacity. The fraction of time tasks spend on CPU becomes constant regardless
> > > of clock frequency (the tasks eat whatever we throw at them), and the PELT
> > > invariant util goes up and down with the frequency (i.e. it's not invariant
> > > anymore).
> > >                                       v5.10          v5.11-rc4
> > >                                       ~~~~~~~~~~~~~~~~~~~~~~~~
> > > CPU activity (mpstat)                 80-90%         80-90%
> > > schedutil requests (tracepoint)       always P0      mostly P2
> > > CPU frequency (HW feedback)           ~2.2 GHz       ~1.5 GHz
> > > PELT root rq util (tracepoint)        ~825           ~450
> > > 
> > > mpstat shows that the workload is CPU-bound and usage doesn't change with
> > 
> > So I'm having trouble with calling a 80%-90% workload CPU bound, because
> > clearly there's a ton of idle time.
> 
> Yes you're right. There is considerable idle time and calling it CPU-bound is
> a bit of a stretch.
> 
> Yet I don't think I'm completely off the mark. The busy time is the same with
> the machine running at 1.5 GHz and at 2.2 GHz (it just takes longer to
> finish). To me it seems like the CPU is the bottleneck, with some overhead on
> top.
> 

I think this is an important observation because while the load may not
be fully CPU-bound, it's still at the point where race-to-idle by running
at a higher frequency is relevant. During the busy time, the results
(and Michael's testing) indicate that the higher frequency may still be
justified. I agree that there is a "a 'problem' between schedutil and
cpufreq, they don't use the same f_max at all times", fixing that mid
-rc may not be appropriate because it's a big change in an rc window.

So, should this patch be merged for 5.11 as a stopgap, fix up
schedutil/cpufreq and then test both AMD and Intel chips reporting the
correct max non-turbo and max-turbo frequencies? That would give time to
give some testing in linux-next before merging to reduce the chance
something else falls out.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-26  9:28     ` Giovanni Gherdovich
@ 2021-01-26 10:02       ` Peter Zijlstra
  2021-02-02 18:45       ` Rafael J. Wysocki
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 10:02 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Borislav Petkov, Ingo Molnar, Rafael J . Wysocki, Viresh Kumar,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Tue, Jan 26, 2021 at 10:28:30AM +0100, Giovanni Gherdovich wrote:
> On Mon, 2021-01-25 at 11:04 +0100, Peter Zijlstra wrote:
> > On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> > > This workload is constant in time, so instead of using the PELT sum we can
> > > pretend that scale invariance is obtained with
> > > 
> > >     util_inv = util_raw * freq_curr / freq_max1        [formula-1]
> > > 
> > > where util_raw is the PELT util from v5.10 (which is to say, not invariant),
> > > and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
> > > commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
> > > frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
> > > 2.825 GHz.  Then we have the schedutil formula
> > > 
> > >     freq_next = 1.25 * freq_max2 * util_inv            [formula-2]
> > > 
> > > Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
> > > 3.4 GHz).
> > > 
> > > Since all cores are busy, there is no boost available. Let's be generous and say
> > > the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
> > > above and taking util_raw = 825/1024 = 0.8, freq_next is:
> > > 
> > >     freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz
> > 
> > Right, so here's a 'problem' between schedutil and cpufreq, they don't
> > use the same f_max at all times.
> > 
> > And this is also an inconsistency between acpi_cpufreq and intel_pstate
> > (passive). IIRC the intel_pstate cpufreq drivers uses 4C/1C/P0 resp,
> > while ACPI seems to stick to P0 f_max.
> 
> That's correct. A different f_max is used depending on the occasion. Let me
> rephrase with:
> 
> cpufreq core asks the driver what's the f_max. What's the answer?
> 
> intel_pstate says: 1C

Oh indeed it does...

> acpi_cpufreq says: P0
> 
> scheduler asks the freq-invariance machinery what's f_max, because it needs to
> compute f_curr/f_max. What's the answer?
> 
> Intel CPUs: 4C in most cases, 1C on Atom, something else on Xeon Phi.
> AMD CPUs: (P0 + 1C) / 2.

Right, and thwn freq-invariance uses larger f_max than cpufreq uses for
frequency selection, we under select exactly like you found.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-26  9:31       ` Mel Gorman
@ 2021-01-26 10:05         ` Peter Zijlstra
       [not found]           ` <1611933781.15858.48.camel@suse.cz>
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 10:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Giovanni Gherdovich, Borislav Petkov, Ingo Molnar,
	Rafael J . Wysocki, Viresh Kumar, Jon Grimm, Nathan Fontenot,
	Yazen Ghannam, Thomas Lendacky, Suthikulpanit Suravee, Pu Wen,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	x86, linux-pm, linux-kernel, linux-acpi

On Tue, Jan 26, 2021 at 09:31:40AM +0000, Mel Gorman wrote:

> So, should this patch be merged for 5.11 as a stopgap, fix up
> schedutil/cpufreq and then test both AMD and Intel chips reporting the
> correct max non-turbo and max-turbo frequencies? That would give time to
> give some testing in linux-next before merging to reduce the chance
> something else falls out.

Yeah, we should probably do this now. Rafael, you want this or should I
take it?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
       [not found]           ` <1611933781.15858.48.camel@suse.cz>
@ 2021-02-02 14:17             ` Giovanni Gherdovich
  2021-02-02 18:21               ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-02-02 14:17 UTC (permalink / raw)
  To: Rafael J . Wysocki
  Cc: Peter Zijlstra, Mel Gorman, Borislav Petkov, Ingo Molnar,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Fri, 2021-01-29 at 16:23 +0100, Giovanni Gherdovich wrote:
> On Tue, 2021-01-26 at 11:05 +0100, Peter Zijlstra wrote:
> > On Tue, Jan 26, 2021 at 09:31:40AM +0000, Mel Gorman wrote:
> > 
> > > So, should this patch be merged for 5.11 as a stopgap, fix up
> > > schedutil/cpufreq and then test both AMD and Intel chips reporting the
> > > correct max non-turbo and max-turbo frequencies? That would give time to
> > > give some testing in linux-next before merging to reduce the chance
> > > something else falls out.
> > 
> > Yeah, we should probably do this now. Rafael, you want this or should I
> > take it?
> 
> Hello Rafael,
> 
> did you have a chance to check this patch?
> 
> It fixes a performance regression from 5.11-rc1, I hoped it could be included
> before v5.11 is released.

Hello Rafael,

you haven't replied to this patch, which was written aiming at v5.11.

Do you see any problem with it?
Frequency-invariant schedutil needs the driver to advertise a high max_freq to
work properly; the patch addresses this for AMD EPYC.


Thanks,
Giovanni

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 14:17             ` Giovanni Gherdovich
@ 2021-02-02 18:21               ` Peter Zijlstra
  2021-02-02 18:29                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-02-02 18:21 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Rafael J . Wysocki, Mel Gorman, Borislav Petkov, Ingo Molnar,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

On Tue, Feb 02, 2021 at 03:17:05PM +0100, Giovanni Gherdovich wrote:
> Hello Rafael,
> 
> you haven't replied to this patch, which was written aiming at v5.11.

I've tentatively queued this for x86/urgent, but ideally this goes
through a cpufreq tree. Rafael, Viresh?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 18:21               ` Peter Zijlstra
@ 2021-02-02 18:29                 ` Rafael J. Wysocki
  2021-02-02 19:00                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-02 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Giovanni Gherdovich, Rafael J . Wysocki, Mel Gorman,
	Borislav Petkov, Ingo Molnar, Viresh Kumar, Jon Grimm,
	Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Pu Wen, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Michael Larabel, the arch/x86 maintainers,
	Linux PM, Linux Kernel Mailing List, ACPI Devel Maling List

On Tue, Feb 2, 2021 at 7:24 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Feb 02, 2021 at 03:17:05PM +0100, Giovanni Gherdovich wrote:
> > Hello Rafael,
> >
> > you haven't replied to this patch, which was written aiming at v5.11.
>
> I've tentatively queued this for x86/urgent, but ideally this goes
> through a cpufreq tree. Rafael, Viresh?

I've missed it, sorry.

I can queue it up tomorrow if all goes well.

Cheers!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-25 10:04   ` Peter Zijlstra
  2021-01-26  9:28     ` Giovanni Gherdovich
@ 2021-02-02 18:40     ` Rafael J. Wysocki
  1 sibling, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-02 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Giovanni Gherdovich, Borislav Petkov, Ingo Molnar,
	Rafael J . Wysocki, Viresh Kumar, Jon Grimm, Nathan Fontenot,
	Yazen Ghannam, Thomas Lendacky, Suthikulpanit Suravee,
	Mel Gorman, Pu Wen, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Michael Larabel, the arch/x86 maintainers,
	Linux PM, Linux Kernel Mailing List, ACPI Devel Maling List

On Mon, Jan 25, 2021 at 11:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> > This workload is constant in time, so instead of using the PELT sum we can
> > pretend that scale invariance is obtained with
> >
> >     util_inv = util_raw * freq_curr / freq_max1        [formula-1]
> >
> > where util_raw is the PELT util from v5.10 (which is to say, not invariant),
> > and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
> > commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
> > frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
> > 2.825 GHz.  Then we have the schedutil formula
> >
> >     freq_next = 1.25 * freq_max2 * util_inv            [formula-2]
> >
> > Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
> > 3.4 GHz).
> >
> > Since all cores are busy, there is no boost available. Let's be generous and say
> > the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
> > above and taking util_raw = 825/1024 = 0.8, freq_next is:
> >
> >     freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz
>
> Right, so here's a 'problem' between schedutil and cpufreq, they don't
> use the same f_max at all times.
>
> And this is also an inconsistency between acpi_cpufreq and intel_pstate
> (passive). IIRC the intel_pstate cpufreq drivers uses 4C/1C/P0 resp,
> while ACPI seems to stick to P0 f_max.

The only place where 4C is used is the scale invariance code AFAICS.

intel_pstate uses P0 as the f_max unless turbo is disabled.

The difference between intel_pstate and acpi_cpufreq is that (a) the
latter uses a frequency table and the former doesn't and (b) the
latter uses the P0 entry of the frequency table to represent the
entire turbo range,

> Rafael; should ACPI change that behaviour rather than adding yet another
> magic variable?

I'm not sure.  That may change the behavior from what is expected by some users.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-26  9:28     ` Giovanni Gherdovich
  2021-01-26 10:02       ` Peter Zijlstra
@ 2021-02-02 18:45       ` Rafael J. Wysocki
  2021-02-02 19:11         ` Rafael J. Wysocki
  1 sibling, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-02 18:45 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Peter Zijlstra, Borislav Petkov, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Mel Gorman, Pu Wen,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List,
	ACPI Devel Maling List

On Tue, Jan 26, 2021 at 5:19 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> On Mon, 2021-01-25 at 11:04 +0100, Peter Zijlstra wrote:
> > On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> > > This workload is constant in time, so instead of using the PELT sum we can
> > > pretend that scale invariance is obtained with
> > >
> > >     util_inv = util_raw * freq_curr / freq_max1        [formula-1]
> > >
> > > where util_raw is the PELT util from v5.10 (which is to say, not invariant),
> > > and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
> > > commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
> > > frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
> > > 2.825 GHz.  Then we have the schedutil formula
> > >
> > >     freq_next = 1.25 * freq_max2 * util_inv            [formula-2]
> > >
> > > Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
> > > 3.4 GHz).
> > >
> > > Since all cores are busy, there is no boost available. Let's be generous and say
> > > the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
> > > above and taking util_raw = 825/1024 = 0.8, freq_next is:
> > >
> > >     freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz
> >
> > Right, so here's a 'problem' between schedutil and cpufreq, they don't
> > use the same f_max at all times.
> >
> > And this is also an inconsistency between acpi_cpufreq and intel_pstate
> > (passive). IIRC the intel_pstate cpufreq drivers uses 4C/1C/P0 resp,
> > while ACPI seems to stick to P0 f_max.
>
> That's correct. A different f_max is used depending on the occasion. Let me
> rephrase with:

OK, I confused the terminology, sorry about that.

> cpufreq core asks the driver what's the f_max. What's the answer?
>
> intel_pstate says: 1C

Yes, unless turbo is disabled, in which case it is P0.

> acpi_cpufreq says: P0

This is P0+1, isn't it?

> scheduler asks the freq-invariance machinery what's f_max, because it needs to
> compute f_curr/f_max. What's the answer?
>
> Intel CPUs: 4C in most cases, 1C on Atom, something else on Xeon Phi.
> AMD CPUs: (P0 + 1C) / 2.
>
>
> Legend:
> 1C = 1-core boost
> 4C = 4-cores boost
> P0 = max non-boost P-States

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-22 20:40 ` [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Giovanni Gherdovich
  2021-01-25 10:04   ` Peter Zijlstra
  2021-01-25 10:06   ` Peter Zijlstra
@ 2021-02-02 18:59   ` Rafael J. Wysocki
  2021-02-02 19:26     ` Rafael J. Wysocki
  2021-02-03  9:12     ` Giovanni Gherdovich
  2021-02-03  6:04   ` Viresh Kumar
  3 siblings, 2 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-02 18:59 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Mel Gorman, Pu Wen,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List,
	ACPI Devel Maling List

On Fri, Jan 22, 2021 at 9:47 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>

[cut]

>
> Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
> Fixes: 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")
> Reported-by: Michael Larabel <Michael@phoronix.com>
> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> ---
>  drivers/cpufreq/acpi-cpufreq.c   | 64 +++++++++++++++++++++++++++++++-
>  drivers/cpufreq/cpufreq.c        |  3 ++
>  include/linux/cpufreq.h          |  5 +++
>  kernel/sched/cpufreq_schedutil.c |  8 +++-
>  4 files changed, 76 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> index 1e4fbb002a31..2378bc1bf2c4 100644
> --- a/drivers/cpufreq/acpi-cpufreq.c
> +++ b/drivers/cpufreq/acpi-cpufreq.c
> @@ -27,6 +27,10 @@
>
>  #include <acpi/processor.h>
>
> +#ifdef CONFIG_ACPI_CPPC_LIB

Why is the #ifdef needed here?

> +#include <acpi/cppc_acpi.h>
> +#endif
> +
>  #include <asm/msr.h>
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
> @@ -628,11 +632,57 @@ static int acpi_cpufreq_blacklist(struct cpuinfo_x86 *c)
>  }
>  #endif
>
> +#ifdef CONFIG_ACPI_CPPC_LIB
> +static bool amd_max_boost(unsigned int max_freq,
> +                         unsigned int *max_boost)
> +{
> +       struct cppc_perf_caps perf_caps;
> +       u64 highest_perf, nominal_perf, perf_ratio;
> +       int ret;
> +
> +       ret = cppc_get_perf_caps(0, &perf_caps);
> +       if (ret) {
> +               pr_debug("Could not retrieve perf counters (%d)\n", ret);
> +               return false;
> +       }
> +
> +       highest_perf = perf_caps.highest_perf;
> +       nominal_perf = perf_caps.nominal_perf;
> +
> +       if (!highest_perf || !nominal_perf) {
> +               pr_debug("Could not retrieve highest or nominal performance\n");
> +               return false;
> +       }
> +
> +       perf_ratio = div_u64(highest_perf * SCHED_CAPACITY_SCALE, nominal_perf);
> +       if (perf_ratio <= SCHED_CAPACITY_SCALE) {
> +               pr_debug("Either perf_ratio is 0, or nominal >= highest performance\n");
> +               return false;
> +       }
> +
> +       *max_boost = max_freq * perf_ratio >> SCHED_CAPACITY_SHIFT;
> +       if (!*max_boost) {
> +               pr_debug("max_boost seems to be zero\n");
> +               return false;
> +       }
> +
> +       return true;
> +}
> +#else
> +static bool amd_max_boost(unsigned int max_freq,
> +                         unsigned int *max_boost)
> +{
> +       return false;
> +}
> +#endif
> +
>  static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  {
>         unsigned int i;
>         unsigned int valid_states = 0;
>         unsigned int cpu = policy->cpu;
> +       unsigned int freq, max_freq = 0;
> +       unsigned int max_boost;
>         struct acpi_cpufreq_data *data;
>         unsigned int result = 0;
>         struct cpuinfo_x86 *c = &cpu_data(policy->cpu);
> @@ -779,15 +829,25 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
>                     freq_table[valid_states-1].frequency / 1000)
>                         continue;
>
> +               freq = perf->states[i].core_frequency * 1000;
>                 freq_table[valid_states].driver_data = i;
> -               freq_table[valid_states].frequency =
> -                   perf->states[i].core_frequency * 1000;
> +               freq_table[valid_states].frequency = freq;
> +
> +               if (freq > max_freq)
> +                       max_freq = freq;
> +
>                 valid_states++;
>         }
>         freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
>         policy->freq_table = freq_table;
>         perf->state = 0;
>
> +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> +           amd_max_boost(max_freq, &max_boost)) {
> +               policy->cpuinfo.max_boost = max_boost;

Why not to set max_freq to max_boost instead?

This value is set once at the init time anyway and schedutil would use
max_boost instead of max_freq anyway.

Also notice that the static branch is global and the max_boost value
for different CPUs may be different, at least in theory.

> +               static_branch_enable(&cpufreq_amd_max_boost);
> +       }
> +
>         switch (perf->control_register.space_id) {
>         case ACPI_ADR_SPACE_SYSTEM_IO:
>                 /*
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index d0a3525ce27f..b96677f6b57e 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2721,6 +2721,9 @@ int cpufreq_boost_enabled(void)
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_boost_enabled);
>
> +DEFINE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
> +EXPORT_SYMBOL_GPL(cpufreq_amd_max_boost);
> +
>  /*********************************************************************
>   *               REGISTER / UNREGISTER CPUFREQ DRIVER                *
>   *********************************************************************/
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 9c8b7437b6cd..341cac76d254 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -40,9 +40,14 @@ enum cpufreq_table_sorting {
>         CPUFREQ_TABLE_SORTED_DESCENDING
>  };
>
> +DECLARE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
> +
> +#define cpufreq_driver_has_max_boost() static_branch_unlikely(&cpufreq_amd_max_boost)
> +
>  struct cpufreq_cpuinfo {
>         unsigned int            max_freq;
>         unsigned int            min_freq;
> +       unsigned int            max_boost;
>
>         /* in 10^(-9) s = nanoseconds */
>         unsigned int            transition_latency;
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 6931f0cdeb80..541f3db3f576 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -159,8 +159,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>                                   unsigned long util, unsigned long max)
>  {
>         struct cpufreq_policy *policy = sg_policy->policy;
> -       unsigned int freq = arch_scale_freq_invariant() ?
> -                               policy->cpuinfo.max_freq : policy->cur;
> +       unsigned int freq, max_freq;
> +
> +       max_freq = cpufreq_driver_has_max_boost() ?
> +                       policy->cpuinfo.max_boost : policy->cpuinfo.max_freq;
> +
> +       freq = arch_scale_freq_invariant() ? max_freq : policy->cur;
>
>         freq = map_util_freq(util, freq, max);
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 18:29                 ` Rafael J. Wysocki
@ 2021-02-02 19:00                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-02 19:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Giovanni Gherdovich, Rafael J . Wysocki, Mel Gorman,
	Borislav Petkov, Ingo Molnar, Viresh Kumar, Jon Grimm,
	Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Pu Wen, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Michael Larabel, the arch/x86 maintainers,
	Linux PM, Linux Kernel Mailing List, ACPI Devel Maling List

On Tue, Feb 2, 2021 at 7:29 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Feb 2, 2021 at 7:24 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Feb 02, 2021 at 03:17:05PM +0100, Giovanni Gherdovich wrote:
> > > Hello Rafael,
> > >
> > > you haven't replied to this patch, which was written aiming at v5.11.
> >
> > I've tentatively queued this for x86/urgent, but ideally this goes
> > through a cpufreq tree. Rafael, Viresh?
>
> I've missed it, sorry.
>
> I can queue it up tomorrow if all goes well.

So actually I'm not sure about it.

Looks overly complicated to me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 18:45       ` Rafael J. Wysocki
@ 2021-02-02 19:11         ` Rafael J. Wysocki
  2021-02-03  9:56           ` Giovanni Gherdovich
  0 siblings, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-02 19:11 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Peter Zijlstra, Borislav Petkov, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Mel Gorman, Pu Wen,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List,
	ACPI Devel Maling List

On Tue, Feb 2, 2021 at 7:45 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Jan 26, 2021 at 5:19 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> >
> > On Mon, 2021-01-25 at 11:04 +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 22, 2021 at 09:40:38PM +0100, Giovanni Gherdovich wrote:
> > > > This workload is constant in time, so instead of using the PELT sum we can
> > > > pretend that scale invariance is obtained with
> > > >
> > > >     util_inv = util_raw * freq_curr / freq_max1        [formula-1]
> > > >
> > > > where util_raw is the PELT util from v5.10 (which is to say, not invariant),
> > > > and util_inv is the PELT util from v5.11-rc4. freq_max1 comes from
> > > > commit 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for
> > > > frequency invariance on AMD EPYC") and is (P0+max_boost)/2 = (2.25+3.4)/2 =
> > > > 2.825 GHz.  Then we have the schedutil formula
> > > >
> > > >     freq_next = 1.25 * freq_max2 * util_inv            [formula-2]
> > > >
> > > > Here v5.11-rc4 uses freq_max2 = P0 = 2.25 GHz (and this patch changes it to
> > > > 3.4 GHz).
> > > >
> > > > Since all cores are busy, there is no boost available. Let's be generous and say
> > > > the tasks initially get P0, i.e. freq_curr = 2.25 GHz. Combining the formulas
> > > > above and taking util_raw = 825/1024 = 0.8, freq_next is:
> > > >
> > > >     freq_next = 1.25 * 2.25 * 0.8 * 2.25 / 2.825 = 1.79 GHz
> > >
> > > Right, so here's a 'problem' between schedutil and cpufreq, they don't
> > > use the same f_max at all times.
> > >
> > > And this is also an inconsistency between acpi_cpufreq and intel_pstate
> > > (passive). IIRC the intel_pstate cpufreq drivers uses 4C/1C/P0 resp,
> > > while ACPI seems to stick to P0 f_max.
> >
> > That's correct. A different f_max is used depending on the occasion. Let me
> > rephrase with:
>
> OK, I confused the terminology, sorry about that.
>
> > cpufreq core asks the driver what's the f_max. What's the answer?
> >
> > intel_pstate says: 1C
>
> Yes, unless turbo is disabled, in which case it is P0.

BTW, and that actually is quite important, the max_freq reported by
intel_pstate doesn't matter for schedutil after the new ->adjust_perf
callback has been added, because that doesn't even use the frequency.

So, as a long-term remedy, it may just be better to implement
->adjust_perf in acpi_cpufreq().

Again, I'm terribly sorry for missing this thread and the patch.

> > acpi_cpufreq says: P0
>
> This is P0+1, isn't it?
>
> > scheduler asks the freq-invariance machinery what's f_max, because it needs to
> > compute f_curr/f_max. What's the answer?
> >
> > Intel CPUs: 4C in most cases, 1C on Atom, something else on Xeon Phi.
> > AMD CPUs: (P0 + 1C) / 2.
> >
> >
> > Legend:
> > 1C = 1-core boost
> > 4C = 4-cores boost
> > P0 = max non-boost P-States

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 18:59   ` Rafael J. Wysocki
@ 2021-02-02 19:26     ` Rafael J. Wysocki
  2021-02-03  8:39       ` Giovanni Gherdovich
  2021-02-03  9:12     ` Giovanni Gherdovich
  1 sibling, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-02 19:26 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Mel Gorman, Pu Wen,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List,
	ACPI Devel Maling List

On Tue, Feb 2, 2021 at 7:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Jan 22, 2021 at 9:47 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> >
>
> [cut]
>
> >
> > Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
> > Fixes: 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")
> > Reported-by: Michael Larabel <Michael@phoronix.com>
> > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> > ---
> >  drivers/cpufreq/acpi-cpufreq.c   | 64 +++++++++++++++++++++++++++++++-
> >  drivers/cpufreq/cpufreq.c        |  3 ++
> >  include/linux/cpufreq.h          |  5 +++
> >  kernel/sched/cpufreq_schedutil.c |  8 +++-
> >  4 files changed, 76 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> > index 1e4fbb002a31..2378bc1bf2c4 100644
> > --- a/drivers/cpufreq/acpi-cpufreq.c
> > +++ b/drivers/cpufreq/acpi-cpufreq.c
> > @@ -27,6 +27,10 @@
> >
> >  #include <acpi/processor.h>
> >
> > +#ifdef CONFIG_ACPI_CPPC_LIB
>
> Why is the #ifdef needed here?
>
> > +#include <acpi/cppc_acpi.h>
> > +#endif
> > +
> >  #include <asm/msr.h>
> >  #include <asm/processor.h>
> >  #include <asm/cpufeature.h>
> > @@ -628,11 +632,57 @@ static int acpi_cpufreq_blacklist(struct cpuinfo_x86 *c)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_ACPI_CPPC_LIB
> > +static bool amd_max_boost(unsigned int max_freq,
> > +                         unsigned int *max_boost)
> > +{
> > +       struct cppc_perf_caps perf_caps;
> > +       u64 highest_perf, nominal_perf, perf_ratio;
> > +       int ret;
> > +
> > +       ret = cppc_get_perf_caps(0, &perf_caps);
> > +       if (ret) {
> > +               pr_debug("Could not retrieve perf counters (%d)\n", ret);
> > +               return false;
> > +       }
> > +
> > +       highest_perf = perf_caps.highest_perf;
> > +       nominal_perf = perf_caps.nominal_perf;
> > +
> > +       if (!highest_perf || !nominal_perf) {
> > +               pr_debug("Could not retrieve highest or nominal performance\n");
> > +               return false;
> > +       }
> > +
> > +       perf_ratio = div_u64(highest_perf * SCHED_CAPACITY_SCALE, nominal_perf);
> > +       if (perf_ratio <= SCHED_CAPACITY_SCALE) {
> > +               pr_debug("Either perf_ratio is 0, or nominal >= highest performance\n");
> > +               return false;
> > +       }
> > +
> > +       *max_boost = max_freq * perf_ratio >> SCHED_CAPACITY_SHIFT;
> > +       if (!*max_boost) {
> > +               pr_debug("max_boost seems to be zero\n");
> > +               return false;
> > +       }
> > +
> > +       return true;
> > +}
> > +#else
> > +static bool amd_max_boost(unsigned int max_freq,
> > +                         unsigned int *max_boost)
> > +{
> > +       return false;
> > +}
> > +#endif
> > +
> >  static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
> >  {
> >         unsigned int i;
> >         unsigned int valid_states = 0;
> >         unsigned int cpu = policy->cpu;
> > +       unsigned int freq, max_freq = 0;
> > +       unsigned int max_boost;
> >         struct acpi_cpufreq_data *data;
> >         unsigned int result = 0;
> >         struct cpuinfo_x86 *c = &cpu_data(policy->cpu);
> > @@ -779,15 +829,25 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
> >                     freq_table[valid_states-1].frequency / 1000)
> >                         continue;
> >
> > +               freq = perf->states[i].core_frequency * 1000;
> >                 freq_table[valid_states].driver_data = i;
> > -               freq_table[valid_states].frequency =
> > -                   perf->states[i].core_frequency * 1000;
> > +               freq_table[valid_states].frequency = freq;
> > +
> > +               if (freq > max_freq)
> > +                       max_freq = freq;
> > +
> >                 valid_states++;
> >         }
> >         freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
> >         policy->freq_table = freq_table;
> >         perf->state = 0;
> >
> > +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> > +           amd_max_boost(max_freq, &max_boost)) {
> > +               policy->cpuinfo.max_boost = max_boost;
>
> Why not to set max_freq to max_boost instead?

I mean, would setting the frequency in the last table entry to max_boost work?

Alternatively, one more (artificial) entry with the frequency equal to
max_boost could be added.

> This value is set once at the init time anyway and schedutil would use
> max_boost instead of max_freq anyway.
>
> Also notice that the static branch is global and the max_boost value
> for different CPUs may be different, at least in theory.
>
> > +               static_branch_enable(&cpufreq_amd_max_boost);
> > +       }
> > +
> >         switch (perf->control_register.space_id) {
> >         case ACPI_ADR_SPACE_SYSTEM_IO:
> >                 /*
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index d0a3525ce27f..b96677f6b57e 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -2721,6 +2721,9 @@ int cpufreq_boost_enabled(void)
> >  }
> >  EXPORT_SYMBOL_GPL(cpufreq_boost_enabled);
> >
> > +DEFINE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
> > +EXPORT_SYMBOL_GPL(cpufreq_amd_max_boost);
> > +
> >  /*********************************************************************
> >   *               REGISTER / UNREGISTER CPUFREQ DRIVER                *
> >   *********************************************************************/
> > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> > index 9c8b7437b6cd..341cac76d254 100644
> > --- a/include/linux/cpufreq.h
> > +++ b/include/linux/cpufreq.h
> > @@ -40,9 +40,14 @@ enum cpufreq_table_sorting {
> >         CPUFREQ_TABLE_SORTED_DESCENDING
> >  };
> >
> > +DECLARE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
> > +
> > +#define cpufreq_driver_has_max_boost() static_branch_unlikely(&cpufreq_amd_max_boost)
> > +
> >  struct cpufreq_cpuinfo {
> >         unsigned int            max_freq;
> >         unsigned int            min_freq;
> > +       unsigned int            max_boost;
> >
> >         /* in 10^(-9) s = nanoseconds */
> >         unsigned int            transition_latency;
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 6931f0cdeb80..541f3db3f576 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -159,8 +159,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >                                   unsigned long util, unsigned long max)
> >  {
> >         struct cpufreq_policy *policy = sg_policy->policy;
> > -       unsigned int freq = arch_scale_freq_invariant() ?
> > -                               policy->cpuinfo.max_freq : policy->cur;
> > +       unsigned int freq, max_freq;
> > +
> > +       max_freq = cpufreq_driver_has_max_boost() ?
> > +                       policy->cpuinfo.max_boost : policy->cpuinfo.max_freq;
> > +
> > +       freq = arch_scale_freq_invariant() ? max_freq : policy->cur;
> >
> >         freq = map_util_freq(util, freq, max);
> >
> > --
> > 2.26.2
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-01-22 20:40 ` [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Giovanni Gherdovich
                     ` (2 preceding siblings ...)
  2021-02-02 18:59   ` Rafael J. Wysocki
@ 2021-02-03  6:04   ` Viresh Kumar
  3 siblings, 0 replies; 26+ messages in thread
From: Viresh Kumar @ 2021-02-03  6:04 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel, x86,
	linux-pm, linux-kernel, linux-acpi

I am sorry but I wasn't able to get the full picture (not your fault,
it is me), but ...

On 22-01-21, 21:40, Giovanni Gherdovich wrote:
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index d0a3525ce27f..b96677f6b57e 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2721,6 +2721,9 @@ int cpufreq_boost_enabled(void)
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_boost_enabled);
>  
> +DEFINE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
> +EXPORT_SYMBOL_GPL(cpufreq_amd_max_boost);
> +
>  /*********************************************************************
>   *               REGISTER / UNREGISTER CPUFREQ DRIVER                *
>   *********************************************************************/
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 9c8b7437b6cd..341cac76d254 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -40,9 +40,14 @@ enum cpufreq_table_sorting {
>  	CPUFREQ_TABLE_SORTED_DESCENDING
>  };
>  
> +DECLARE_STATIC_KEY_FALSE(cpufreq_amd_max_boost);
> +
> +#define cpufreq_driver_has_max_boost() static_branch_unlikely(&cpufreq_amd_max_boost)
> +

I am not happy with AMD specific code/changes in common parts..

>  struct cpufreq_cpuinfo {
>  	unsigned int		max_freq;
>  	unsigned int		min_freq;
> +	unsigned int		max_boost;
>  
>  	/* in 10^(-9) s = nanoseconds */
>  	unsigned int		transition_latency;
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 6931f0cdeb80..541f3db3f576 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -159,8 +159,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>  				  unsigned long util, unsigned long max)
>  {
>  	struct cpufreq_policy *policy = sg_policy->policy;
> -	unsigned int freq = arch_scale_freq_invariant() ?
> -				policy->cpuinfo.max_freq : policy->cur;
> +	unsigned int freq, max_freq;
> +
> +	max_freq = cpufreq_driver_has_max_boost() ?
> +			policy->cpuinfo.max_boost : policy->cpuinfo.max_freq;

Also, can't we update max_freq itself from the cpufreq driver? What
troubles will it cost ?

> +
> +	freq = arch_scale_freq_invariant() ? max_freq : policy->cur;
>  
>  	freq = map_util_freq(util, freq, max);
>  
> -- 
> 2.26.2

-- 
viresh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 19:26     ` Rafael J. Wysocki
@ 2021-02-03  8:39       ` Giovanni Gherdovich
  2021-02-03 13:40         ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-02-03  8:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki,
	Jon Grimm, Nathan Fontenot, Yazen Ghannam, Thomas Lendacky,
	Suthikulpanit Suravee, Mel Gorman, Pu Wen, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List,
	ACPI Devel Maling List

Hello,

both Rafael and Viresh make a similar remark: why adding a new "max_boost"
variable, since "max_freq" is already available and could be used instead.

Replying here to both.

On Tue, 2021-02-02 at 20:26 +0100, Rafael J. Wysocki wrote:
> On Tue, Feb 2, 2021 at 7:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > 
> > On Fri, Jan 22, 2021 at 9:47 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> > 
> > [cut]
> > > @@ -779,15 +829,25 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > >                     freq_table[valid_states-1].frequency / 1000)
> > >                         continue;
> > > 
> > > +               freq = perf->states[i].core_frequency * 1000;
> > >                 freq_table[valid_states].driver_data = i;
> > > -               freq_table[valid_states].frequency =
> > > -                   perf->states[i].core_frequency * 1000;
> > > +               freq_table[valid_states].frequency = freq;
> > > +
> > > +               if (freq > max_freq)
> > > +                       max_freq = freq;
> > > +
> > >                 valid_states++;
> > >         }
> > >         freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
> > >         policy->freq_table = freq_table;
> > >         perf->state = 0;
> > > 
> > > +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> > > +           amd_max_boost(max_freq, &max_boost)) {
> > > +               policy->cpuinfo.max_boost = max_boost;
> > 
> > Why not to set max_freq to max_boost instead?
> 
> I mean, would setting the frequency in the last table entry to max_boost work?
> 
> Alternatively, one more (artificial) entry with the frequency equal to
> max_boost could be added.

On Wed, 2021-02-03 at 11:34 +0530, Viresh Kumar wrote:
> [cut]
> 
> On 22-01-21, 21:40, Giovanni Gherdovich wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 6931f0cdeb80..541f3db3f576 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -159,8 +159,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> >  				  unsigned long util, unsigned long max)
> >  {
> >  	struct cpufreq_policy *policy = sg_policy->policy;
> > -	unsigned int freq = arch_scale_freq_invariant() ?
> > -				policy->cpuinfo.max_freq : policy->cur;
> > +	unsigned int freq, max_freq;
> > +
> > +	max_freq = cpufreq_driver_has_max_boost() ?
> > +			policy->cpuinfo.max_boost : policy->cpuinfo.max_freq;
> 
> Also, can't we update max_freq itself from the cpufreq driver? What
> troubles will it cost ?

I could add the max boost frequency to the frequency table (and
policy->cpuinfo.max_freq would follow), yes, but that would trigger the
following warning from acpi-cpufreq.c:

static void acpi_cpufreq_cpu_ready(struct cpufreq_policy *policy)
{
        struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data,
                                                              policy->cpu);

        if (perf->states[0].core_frequency * 1000 != policy->cpuinfo.max_freq)
                pr_warn(FW_WARN "P-state 0 is not max freq\n");
}

so I thought that to stay out of troubles I'd supply a different variable,
max_boost, to be used only in the schedutil formula. After schedutil
figures out a desired next_freq then the usual comparison with the
firmware-supplied frequency table could take place.

Altering the frequency table seemed more invasive because once a freq value is
in there, it's going to be actually requested by the driver to the platform. I
only want my max_boost to stretch the range of schedutil's next_freq.

On Tue, 2021-02-02 at 19:59 +0100, Rafael J. Wysocki wrote:
> 
> [cut]
> Also notice that the static branch is global and the max_boost value
> for different CPUs may be different, at least in theory.

In theory yes, but I'm guarding the code with two conditions:

* vendor is X86_VENDOR_AMD
* cppc_get_perf_caps() returns success

this identifies AMD EPYC cpus with model 7xx2 and later, where max_boost is
the same on all cores. I may have added synchronization so that only one cpu
sets the value, but I didn't in the interest of simplicity for an -rc patch
(I'd have to consider hotplug, the maxcpus= command line param, ecc).


Giovanni

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 18:59   ` Rafael J. Wysocki
  2021-02-02 19:26     ` Rafael J. Wysocki
@ 2021-02-03  9:12     ` Giovanni Gherdovich
  1 sibling, 0 replies; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-02-03  9:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Mel Gorman, Pu Wen,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List,
	ACPI Devel Maling List

On Tue, 2021-02-02 at 19:59 +0100, Rafael J. Wysocki wrote:
> On Fri, Jan 22, 2021 at 9:47 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> > 
> 
> [cut]
> 
> > diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> > index 1e4fbb002a31..2378bc1bf2c4 100644
> > --- a/drivers/cpufreq/acpi-cpufreq.c
> > +++ b/drivers/cpufreq/acpi-cpufreq.c
> > @@ -27,6 +27,10 @@
> > 
> >  #include <acpi/processor.h>
> > 
> > +#ifdef CONFIG_ACPI_CPPC_LIB
> 
> Why is the #ifdef needed here?
> 

Right, it isn't needed. The guard is necessary only later when the function
cppc_get_perf_caps() is used. I'll send a v3 with this correction.


Giovanni


> > +#include <acpi/cppc_acpi.h>
> > +#endif
> > +
> >  #include <asm/msr.h>
> >  #include <asm/processor.h>
> >  #include <asm/cpufeature.h>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-02 19:11         ` Rafael J. Wysocki
@ 2021-02-03  9:56           ` Giovanni Gherdovich
  0 siblings, 0 replies; 26+ messages in thread
From: Giovanni Gherdovich @ 2021-02-03  9:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Borislav Petkov, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Jon Grimm, Nathan Fontenot, Yazen Ghannam,
	Thomas Lendacky, Suthikulpanit Suravee, Mel Gorman, Pu Wen,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Michael Larabel,
	the arch/x86 maintainers, Linux PM, Linux Kernel Mailing List,
	ACPI Devel Maling List

On Tue, 2021-02-02 at 20:11 +0100, Rafael J. Wysocki wrote:
> On Tue, Feb 2, 2021 at 7:45 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > 
> > On Tue, Jan 26, 2021 at 5:19 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> > > 
> > >
> > > cpufreq core asks the driver what's the f_max. What's the answer?
> > > 
> > > intel_pstate says: 1C
> > 
> > Yes, unless turbo is disabled, in which case it is P0.
> 
> BTW, and that actually is quite important, the max_freq reported by
> intel_pstate doesn't matter for schedutil after the new ->adjust_perf
> callback has been added, because that doesn't even use the frequency.
> 
> So, as a long-term remedy, it may just be better to implement
> ->adjust_perf in acpi_cpufreq().

Thanks for pointing this out.

I agree that in the long term adding ->adjust_perf to acpi_cpufreq is
the best solution.

Yet for this submission, considering it's late in the 5.11 cycle,
the patch I propose is a reasonable candidate: the footprint is small and
it's gone through a fair amount of testing.


Thanks,
Giovanni

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula
  2021-02-03  8:39       ` Giovanni Gherdovich
@ 2021-02-03 13:40         ` Rafael J. Wysocki
  0 siblings, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2021-02-03 13:40 UTC (permalink / raw)
  To: Giovanni Gherdovich
  Cc: Rafael J. Wysocki, Viresh Kumar, Borislav Petkov, Ingo Molnar,
	Peter Zijlstra, Rafael J . Wysocki, Jon Grimm, Nathan Fontenot,
	Yazen Ghannam, Thomas Lendacky, Suthikulpanit Suravee,
	Mel Gorman, Pu Wen, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Michael Larabel, the arch/x86 maintainers,
	Linux PM, Linux Kernel Mailing List, ACPI Devel Maling List

On Wed, Feb 3, 2021 at 9:39 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> Hello,
>
> both Rafael and Viresh make a similar remark: why adding a new "max_boost"
> variable, since "max_freq" is already available and could be used instead.
>
> Replying here to both.
>
> On Tue, 2021-02-02 at 20:26 +0100, Rafael J. Wysocki wrote:
> > On Tue, Feb 2, 2021 at 7:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Fri, Jan 22, 2021 at 9:47 PM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> > >
> > > [cut]
> > > > @@ -779,15 +829,25 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > > >                     freq_table[valid_states-1].frequency / 1000)
> > > >                         continue;
> > > >
> > > > +               freq = perf->states[i].core_frequency * 1000;
> > > >                 freq_table[valid_states].driver_data = i;
> > > > -               freq_table[valid_states].frequency =
> > > > -                   perf->states[i].core_frequency * 1000;
> > > > +               freq_table[valid_states].frequency = freq;
> > > > +
> > > > +               if (freq > max_freq)
> > > > +                       max_freq = freq;
> > > > +
> > > >                 valid_states++;
> > > >         }
> > > >         freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
> > > >         policy->freq_table = freq_table;
> > > >         perf->state = 0;
> > > >
> > > > +       if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> > > > +           amd_max_boost(max_freq, &max_boost)) {
> > > > +               policy->cpuinfo.max_boost = max_boost;
> > >
> > > Why not to set max_freq to max_boost instead?
> >
> > I mean, would setting the frequency in the last table entry to max_boost work?
> >
> > Alternatively, one more (artificial) entry with the frequency equal to
> > max_boost could be added.
>
> On Wed, 2021-02-03 at 11:34 +0530, Viresh Kumar wrote:
> > [cut]
> >
> > On 22-01-21, 21:40, Giovanni Gherdovich wrote:
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 6931f0cdeb80..541f3db3f576 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -159,8 +159,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > >                               unsigned long util, unsigned long max)
> > >  {
> > >     struct cpufreq_policy *policy = sg_policy->policy;
> > > -   unsigned int freq = arch_scale_freq_invariant() ?
> > > -                           policy->cpuinfo.max_freq : policy->cur;
> > > +   unsigned int freq, max_freq;
> > > +
> > > +   max_freq = cpufreq_driver_has_max_boost() ?
> > > +                   policy->cpuinfo.max_boost : policy->cpuinfo.max_freq;
> >
> > Also, can't we update max_freq itself from the cpufreq driver? What
> > troubles will it cost ?
>
> I could add the max boost frequency to the frequency table (and
> policy->cpuinfo.max_freq would follow), yes, but that would trigger the
> following warning from acpi-cpufreq.c:
>
> static void acpi_cpufreq_cpu_ready(struct cpufreq_policy *policy)
> {
>         struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data,
>                                                               policy->cpu);
>
>         if (perf->states[0].core_frequency * 1000 != policy->cpuinfo.max_freq)
>                 pr_warn(FW_WARN "P-state 0 is not max freq\n");
> }

This check can be changed, though.

> so I thought that to stay out of troubles I'd supply a different variable,
> max_boost, to be used only in the schedutil formula.

Which is not necessary and the extra static branch is not necessary.

Moreover, there is no reason whatsoever to believe that EPYC is the
only affected processor model.  If I'm not mistaken, the regression
will be visible on every CPU where the scale invariance algorithm uses
the max frequency greater than the max frequency used acpi_cpufreq.

Also, AFAICS, it should be sufficient to modify acpi_cpufreq to remedy
this for all of the affected CPUs, not just EPYC.

> After schedutil figures out a desired next_freq then the usual comparison with the
> firmware-supplied frequency table could take place.
>
> Altering the frequency table seemed more invasive because once a freq value is
> in there, it's going to be actually requested by the driver to the platform.

This need not be the case if the control structure for the new entry
is copied from an existing one.

> I only want my max_boost to stretch the range of schedutil's next_freq.

Right, but that can be done in a different way which would be cleaner too IMO.

I'm going to send an alternative patch to fix this problem.

> On Tue, 2021-02-02 at 19:59 +0100, Rafael J. Wysocki wrote:
> >
> > [cut]
> > Also notice that the static branch is global and the max_boost value
> > for different CPUs may be different, at least in theory.
>
> In theory yes, but I'm guarding the code with two conditions:
>
> * vendor is X86_VENDOR_AMD
> * cppc_get_perf_caps() returns success
>
> this identifies AMD EPYC cpus with model 7xx2 and later, where max_boost is
> the same on all cores. I may have added synchronization so that only one cpu
> sets the value, but I didn't in the interest of simplicity for an -rc patch
> (I'd have to consider hotplug, the maxcpus= command line param, ecc).

And what about the other potentially affected processors?

I wouldn't worry about the -rc time frame too much.  If we can do a
better fix now, let's do it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-02-03 13:42 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-22 20:40 [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance) Giovanni Gherdovich
2021-01-22 20:40 ` [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula Giovanni Gherdovich
2021-01-25 10:04   ` Peter Zijlstra
2021-01-26  9:28     ` Giovanni Gherdovich
2021-01-26 10:02       ` Peter Zijlstra
2021-02-02 18:45       ` Rafael J. Wysocki
2021-02-02 19:11         ` Rafael J. Wysocki
2021-02-03  9:56           ` Giovanni Gherdovich
2021-02-02 18:40     ` Rafael J. Wysocki
2021-01-25 10:06   ` Peter Zijlstra
2021-01-26  9:09     ` Giovanni Gherdovich
2021-01-26  9:31       ` Mel Gorman
2021-01-26 10:05         ` Peter Zijlstra
     [not found]           ` <1611933781.15858.48.camel@suse.cz>
2021-02-02 14:17             ` Giovanni Gherdovich
2021-02-02 18:21               ` Peter Zijlstra
2021-02-02 18:29                 ` Rafael J. Wysocki
2021-02-02 19:00                   ` Rafael J. Wysocki
2021-02-02 18:59   ` Rafael J. Wysocki
2021-02-02 19:26     ` Rafael J. Wysocki
2021-02-03  8:39       ` Giovanni Gherdovich
2021-02-03 13:40         ` Rafael J. Wysocki
2021-02-03  9:12     ` Giovanni Gherdovich
2021-02-03  6:04   ` Viresh Kumar
2021-01-24 22:30 ` [PATCH v2 0/1] AMD EPYC: fix schedutil perf regression (freq-invariance) Michael Larabel
2021-01-25  8:34   ` Peter Zijlstra
2021-01-26  9:01     ` Giovanni Gherdovich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).