[RFT,v1] cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
diff mbox series

Message ID 1974978.nRy8TqEeLZ@kreacher
State Accepted
Commit 538b0188da4653b9f4511a114f014354fb6fb7a5
Headers show
Series
  • [RFT,v1] cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
Related show

Commit Message

Rafael J. Wysocki Feb. 15, 2021, 7:24 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover
boost frequencies") attempted to address a performance issue involving
acpi-cpufreq, the schedutil governor and scale-invariance on x86 by
extending the frequency tables created by acpi-cpufreq to cover the
entire range of "turbo" (or "boost") frequencies, but that caused
frequencies reported via /proc/cpuinfo and the scaling_cur_freq
attribute in sysfs to change which may confuse users and monitoring
tools.

For this reason, revert the part of commit 3c55e94c0ade adding the
extra entry to the frequency table and use the observation that
in principle cpuinfo.max_freq need not be equal to the maximum
frequency listed in the frequency table for the given policy.

Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq
drivers to set their own cpuinfo.max_freq above that frequency and
change  acpi-cpufreq to set cpuinfo.max_freq to the maximum boost
frequency found via CPPC.

This should be sufficient to let all of the cpufreq subsystem know
the real maximum frequency of the CPU without changing frequency
reporting.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305
Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
Reported-by: Matt McDonald <gardotd426@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Michael, Giovanni,

The fix for the EPYC performance regression that was merged into 5.11 introduced
an undesirable side-effect by distorting the CPU frequency reporting via
/proc/cpuinfo and scaling_cur_freq (see the BZ link above for details).

The patch below is reported to address this problem and it should still allow
schedutil to achieve desirable performance, because it simply sets
cpuinfo.max_freq without extending the frequency table of the CPU.

Please test this one and let me know if it adversely affects performance.

Thanks!

---
 drivers/cpufreq/acpi-cpufreq.c |   62 ++++++++++-------------------------------
 drivers/cpufreq/freq_table.c   |    8 ++++-
 2 files changed, 23 insertions(+), 47 deletions(-)

Comments

Michael Larabel Feb. 16, 2021, 1:49 a.m. UTC | #1
On 2/15/21 1:24 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover
> boost frequencies") attempted to address a performance issue involving
> acpi-cpufreq, the schedutil governor and scale-invariance on x86 by
> extending the frequency tables created by acpi-cpufreq to cover the
> entire range of "turbo" (or "boost") frequencies, but that caused
> frequencies reported via /proc/cpuinfo and the scaling_cur_freq
> attribute in sysfs to change which may confuse users and monitoring
> tools.
>
> For this reason, revert the part of commit 3c55e94c0ade adding the
> extra entry to the frequency table and use the observation that
> in principle cpuinfo.max_freq need not be equal to the maximum
> frequency listed in the frequency table for the given policy.
>
> Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq
> drivers to set their own cpuinfo.max_freq above that frequency and
> change  acpi-cpufreq to set cpuinfo.max_freq to the maximum boost
> frequency found via CPPC.
>
> This should be sufficient to let all of the cpufreq subsystem know
> the real maximum frequency of the CPU without changing frequency
> reporting.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305
> Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
> Reported-by: Matt McDonald <gardotd426@gmail.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>
> Michael, Giovanni,
>
> The fix for the EPYC performance regression that was merged into 5.11 introduced
> an undesirable side-effect by distorting the CPU frequency reporting via
> /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details).
>
> The patch below is reported to address this problem and it should still allow
> schedutil to achieve desirable performance, because it simply sets
> cpuinfo.max_freq without extending the frequency table of the CPU.
>
> Please test this one and let me know if it adversely affects performance.
>
> Thanks!


When carrying out tests so far today on an EPYC 7F72 2P and Ryzen 9 
5900X with workloads seeing impact from the prior patches, everything is 
looking good when comparing v5.11 to v5.11 + this patch. Not seeing any 
measurable difference on either of those systems as a result of this patch.

Running some additional tests and on a few more boxes that should wrap 
up tomorrow but at least so far the patch isn't showing any measurable 
changes to performance.

Michael


>
> ---
>   drivers/cpufreq/acpi-cpufreq.c |   62 ++++++++++-------------------------------
>   drivers/cpufreq/freq_table.c   |    8 ++++-
>   2 files changed, 23 insertions(+), 47 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
> +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
> @@ -54,7 +54,6 @@ struct acpi_cpufreq_data {
>   	unsigned int resume;
>   	unsigned int cpu_feature;
>   	unsigned int acpi_perf_cpu;
> -	unsigned int first_perf_state;
>   	cpumask_var_t freqdomain_cpus;
>   	void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val);
>   	u32 (*cpu_freq_read)(struct acpi_pct_register *reg);
> @@ -223,10 +222,10 @@ static unsigned extract_msr(struct cpufr
>   
>   	perf = to_perf_data(data);
>   
> -	cpufreq_for_each_entry(pos, policy->freq_table + data->first_perf_state)
> +	cpufreq_for_each_entry(pos, policy->freq_table)
>   		if (msr == perf->states[pos->driver_data].status)
>   			return pos->frequency;
> -	return policy->freq_table[data->first_perf_state].frequency;
> +	return policy->freq_table[0].frequency;
>   }
>   
>   static unsigned extract_freq(struct cpufreq_policy *policy, u32 val)
> @@ -365,7 +364,6 @@ static unsigned int get_cur_freq_on_cpu(
>   	struct cpufreq_policy *policy;
>   	unsigned int freq;
>   	unsigned int cached_freq;
> -	unsigned int state;
>   
>   	pr_debug("%s (%d)\n", __func__, cpu);
>   
> @@ -377,11 +375,7 @@ static unsigned int get_cur_freq_on_cpu(
>   	if (unlikely(!data || !policy->freq_table))
>   		return 0;
>   
> -	state = to_perf_data(data)->state;
> -	if (state < data->first_perf_state)
> -		state = data->first_perf_state;
> -
> -	cached_freq = policy->freq_table[state].frequency;
> +	cached_freq = policy->freq_table[to_perf_data(data)->state].frequency;
>   	freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data));
>   	if (freq != cached_freq) {
>   		/*
> @@ -680,7 +674,6 @@ static int acpi_cpufreq_cpu_init(struct
>   	struct cpuinfo_x86 *c = &cpu_data(cpu);
>   	unsigned int valid_states = 0;
>   	unsigned int result = 0;
> -	unsigned int state_count;
>   	u64 max_boost_ratio;
>   	unsigned int i;
>   #ifdef CONFIG_SMP
> @@ -795,28 +788,8 @@ static int acpi_cpufreq_cpu_init(struct
>   		goto err_unreg;
>   	}
>   
> -	state_count = perf->state_count + 1;
> -
> -	max_boost_ratio = get_max_boost_ratio(cpu);
> -	if (max_boost_ratio) {
> -		/*
> -		 * Make a room for one more entry to represent the highest
> -		 * available "boost" frequency.
> -		 */
> -		state_count++;
> -		valid_states++;
> -		data->first_perf_state = valid_states;
> -	} else {
> -		/*
> -		 * If the maximum "boost" frequency is unknown, ask the arch
> -		 * scale-invariance code to use the "nominal" performance for
> -		 * CPU utilization scaling so as to prevent the schedutil
> -		 * governor from selecting inadequate CPU frequencies.
> -		 */
> -		arch_set_max_freq_ratio(true);
> -	}
> -
> -	freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL);
> +	freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table),
> +			     GFP_KERNEL);
>   	if (!freq_table) {
>   		result = -ENOMEM;
>   		goto err_unreg;
> @@ -851,27 +824,25 @@ static int acpi_cpufreq_cpu_init(struct
>   	}
>   	freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
>   
> +	max_boost_ratio = get_max_boost_ratio(cpu);
>   	if (max_boost_ratio) {
> -		unsigned int state = data->first_perf_state;
> -		unsigned int freq = freq_table[state].frequency;
> +		unsigned int freq = freq_table[0].frequency;
>   
>   		/*
>   		 * Because the loop above sorts the freq_table entries in the
>   		 * descending order, freq is the maximum frequency in the table.
>   		 * Assume that it corresponds to the CPPC nominal frequency and
> -		 * use it to populate the frequency field of the extra "boost"
> -		 * frequency entry.
> +		 * use it to set cpuinfo.max_freq.
>   		 */
> -		freq_table[0].frequency = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT;
> +		policy->cpuinfo.max_freq = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT;
> +	} else {
>   		/*
> -		 * The purpose of the extra "boost" frequency entry is to make
> -		 * the rest of cpufreq aware of the real maximum frequency, but
> -		 * the way to request it is the same as for the first_perf_state
> -		 * entry that is expected to cover the entire range of "boost"
> -		 * frequencies of the CPU, so copy the driver_data value from
> -		 * that entry.
> +		 * If the maximum "boost" frequency is unknown, ask the arch
> +		 * scale-invariance code to use the "nominal" performance for
> +		 * CPU utilization scaling so as to prevent the schedutil
> +		 * governor from selecting inadequate CPU frequencies.
>   		 */
> -		freq_table[0].driver_data = freq_table[state].driver_data;
> +		arch_set_max_freq_ratio(true);
>   	}
>   
>   	policy->freq_table = freq_table;
> @@ -947,8 +918,7 @@ static void acpi_cpufreq_cpu_ready(struc
>   {
>   	struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data,
>   							      policy->cpu);
> -	struct acpi_cpufreq_data *data = policy->driver_data;
> -	unsigned int freq = policy->freq_table[data->first_perf_state].frequency;
> +	unsigned int freq = policy->freq_table[0].frequency;
>   
>   	if (perf->states[0].core_frequency * 1000 != freq)
>   		pr_warn(FW_WARN "P-state 0 is not max freq\n");
> Index: linux-pm/drivers/cpufreq/freq_table.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/freq_table.c
> +++ linux-pm/drivers/cpufreq/freq_table.c
> @@ -52,7 +52,13 @@ int cpufreq_frequency_table_cpuinfo(stru
>   	}
>   
>   	policy->min = policy->cpuinfo.min_freq = min_freq;
> -	policy->max = policy->cpuinfo.max_freq = max_freq;
> +	policy->max = max_freq;
> +	/*
> +	 * If the driver has set its own cpuinfo.max_freq above max_freq, leave
> +	 * it as is.
> +	 */
> +	if (policy->cpuinfo.max_freq < max_freq)
> +		policy->max = policy->cpuinfo.max_freq = max_freq;
>   
>   	if (policy->min == ~0)
>   		return -EINVAL;
>
>
>
Giovanni Gherdovich Feb. 16, 2021, 3:02 p.m. UTC | #2
On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover
> boost frequencies") attempted to address a performance issue involving
> acpi-cpufreq, the schedutil governor and scale-invariance on x86 by
> extending the frequency tables created by acpi-cpufreq to cover the
> entire range of "turbo" (or "boost") frequencies, but that caused
> frequencies reported via /proc/cpuinfo and the scaling_cur_freq
> attribute in sysfs to change which may confuse users and monitoring
> tools.
> 
> For this reason, revert the part of commit 3c55e94c0ade adding the
> extra entry to the frequency table and use the observation that
> in principle cpuinfo.max_freq need not be equal to the maximum
> frequency listed in the frequency table for the given policy.
> 
> Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq
> drivers to set their own cpuinfo.max_freq above that frequency and
> change  acpi-cpufreq to set cpuinfo.max_freq to the maximum boost
> frequency found via CPPC.
> 
> This should be sufficient to let all of the cpufreq subsystem know
> the real maximum frequency of the CPU without changing frequency
> reporting.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305
> Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
> Reported-by: Matt McDonald <gardotd426@gmail.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> Michael, Giovanni,
> 
> The fix for the EPYC performance regression that was merged into 5.11 introduced
> an undesirable side-effect by distorting the CPU frequency reporting via
> /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details).
> 
> The patch below is reported to address this problem and it should still allow
> schedutil to achieve desirable performance, because it simply sets
> cpuinfo.max_freq without extending the frequency table of the CPU.
> 
> Please test this one and let me know if it adversely affects performance.
> 
> Thanks!
> 
> ---
>  drivers/cpufreq/acpi-cpufreq.c |   62 ++++++++++-------------------------------
>  drivers/cpufreq/freq_table.c   |    8 ++++-
>  2 files changed, 23 insertions(+), 47 deletions(-)

Hello Rafael,

I've run the quick image processing test below and the performance is in line
with v5.11. I'll send some more results as longer tests complete.

TEST        : Intel Open Image Denoise, www.openimagedenoise.org
INVOCATION  : ./denoise -hdr memorial.pfm -out out.pfm -bench 200 -threads $NTHREADS
CPU         : MODEL            : 2x AMD EPYC 7742
              FREQUENCY TABLE  : P2: 1.50 GHz
                                 P1: 2.00 GHz
				 P0: 2.25 GHz
              MAX BOOST        :     3.40 GHz

Results: threads, msecs (ratio). Lower is better.

              v5.11          v5.11-patch
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      1   1071.43 (1.00)   1068.57 (1.00)
      2    541.50 (1.00)    542.26 (1.00)
      4    276.38 (1.00)    276.96 (1.00)
      8    149.51 (1.00)    149.24 (1.00)
     16     78.57 (1.00)     78.57 (1.00)
     24     57.59 (1.00)     57.67 (1.00)
     32     46.40 (1.00)     46.30 (1.00)
     48     37.48 (1.00)     38.28 (1.02)
     64     33.18 (1.00)     33.69 (1.02)
     80     30.73 (1.00)     31.24 (1.02)
     96     28.06 (1.00)     28.79 (1.03)
    112     27.82 (1.00)     28.14 (1.01)
    120     28.33 (1.00)     29.16 (1.03)
    128     28.44 (1.00)     28.35 (1.00)


Giovanni
Giovanni Gherdovich Feb. 17, 2021, 10:45 a.m. UTC | #3
On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover
> boost frequencies") attempted to address a performance issue involving
> acpi-cpufreq, the schedutil governor and scale-invariance on x86 by
> extending the frequency tables created by acpi-cpufreq to cover the
> entire range of "turbo" (or "boost") frequencies, but that caused
> frequencies reported via /proc/cpuinfo and the scaling_cur_freq
> attribute in sysfs to change which may confuse users and monitoring
> tools.
> 
> For this reason, revert the part of commit 3c55e94c0ade adding the
> extra entry to the frequency table and use the observation that
> in principle cpuinfo.max_freq need not be equal to the maximum
> frequency listed in the frequency table for the given policy.
> 
> Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq
> drivers to set their own cpuinfo.max_freq above that frequency and
> change  acpi-cpufreq to set cpuinfo.max_freq to the maximum boost
> frequency found via CPPC.
> 
> This should be sufficient to let all of the cpufreq subsystem know
> the real maximum frequency of the CPU without changing frequency
> reporting.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305
> Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
> Reported-by: Matt McDonald <gardotd426@gmail.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> Michael, Giovanni,
> 
> The fix for the EPYC performance regression that was merged into 5.11 introduced
> an undesirable side-effect by distorting the CPU frequency reporting via
> /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details).
> 
> The patch below is reported to address this problem and it should still allow
> schedutil to achieve desirable performance, because it simply sets
> cpuinfo.max_freq without extending the frequency table of the CPU.
> 
> Please test this one and let me know if it adversely affects performance.
> 
> Thanks!

Hello Rafael,

more extended testing confirms the initial feeling; performance with this
patch is mostly identical to vanilla v5.11. Tbench shows an improvement.

Thanks for the fix!

Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>

Results follow. The machine has two sockets with an AMD EPYC 7742 each.
The governor is always schedutil.


Ratios of time, lower is better:
					    v5.11     v5.11
					   vanilla    patch
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    NASA Parallel Benchmarks w/ MPI         1.00      0.96
    NASA Parallel Benchmarks w/ OpenMP      1.00      ~
    dbench on XFS                           1.00      ~
    Linux kernel compilation                1.00      ~
    git unit test suite                     1.00      ~


Ratio of throughput, higher is better:
					    v5.11     v5.11
					   vanilla    patch
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    tbench on localhost                     1.00      1.09


Tilde (~): no change wrt baseline.


Giovanni
Michael Larabel Feb. 17, 2021, 1:23 p.m. UTC | #4
On 2/15/21 7:49 PM, Michael Larabel wrote:
>
> On 2/15/21 1:24 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover
>> boost frequencies") attempted to address a performance issue involving
>> acpi-cpufreq, the schedutil governor and scale-invariance on x86 by
>> extending the frequency tables created by acpi-cpufreq to cover the
>> entire range of "turbo" (or "boost") frequencies, but that caused
>> frequencies reported via /proc/cpuinfo and the scaling_cur_freq
>> attribute in sysfs to change which may confuse users and monitoring
>> tools.
>>
>> For this reason, revert the part of commit 3c55e94c0ade adding the
>> extra entry to the frequency table and use the observation that
>> in principle cpuinfo.max_freq need not be equal to the maximum
>> frequency listed in the frequency table for the given policy.
>>
>> Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq
>> drivers to set their own cpuinfo.max_freq above that frequency and
>> change  acpi-cpufreq to set cpuinfo.max_freq to the maximum boost
>> frequency found via CPPC.
>>
>> This should be sufficient to let all of the cpufreq subsystem know
>> the real maximum frequency of the CPU without changing frequency
>> reporting.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305
>> Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover 
>> boost frequencies")
>> Reported-by: Matt McDonald <gardotd426@gmail.com>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>
>> Michael, Giovanni,
>>
>> The fix for the EPYC performance regression that was merged into 5.11 
>> introduced
>> an undesirable side-effect by distorting the CPU frequency reporting via
>> /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details).
>>
>> The patch below is reported to address this problem and it should 
>> still allow
>> schedutil to achieve desirable performance, because it simply sets
>> cpuinfo.max_freq without extending the frequency table of the CPU.
>>
>> Please test this one and let me know if it adversely affects 
>> performance.
>>
>> Thanks!
>
>
> When carrying out tests so far today on an EPYC 7F72 2P and Ryzen 9 
> 5900X with workloads seeing impact from the prior patches, everything 
> is looking good when comparing v5.11 to v5.11 + this patch. Not seeing 
> any measurable difference on either of those systems as a result of 
> this patch.
>
> Running some additional tests and on a few more boxes that should wrap 
> up tomorrow but at least so far the patch isn't showing any measurable 
> changes to performance.
>
> Michael
>

Linux 5.11 + this patch is still looking fine on the mix of EPYC (Zen 2) 
and Ryzen (Zen 2/3) systems tried with the previous workloads. Not 
seeing any measurable change in performance from this new patch, so it's 
looking fine on my end.

Michael

Tested-by: Michael Larabel <Michael@phoronix.com>


>
>>
>> ---
>>   drivers/cpufreq/acpi-cpufreq.c |   62 
>> ++++++++++-------------------------------
>>   drivers/cpufreq/freq_table.c   |    8 ++++-
>>   2 files changed, 23 insertions(+), 47 deletions(-)
>>
>> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
>> +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
>> @@ -54,7 +54,6 @@ struct acpi_cpufreq_data {
>>       unsigned int resume;
>>       unsigned int cpu_feature;
>>       unsigned int acpi_perf_cpu;
>> -    unsigned int first_perf_state;
>>       cpumask_var_t freqdomain_cpus;
>>       void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val);
>>       u32 (*cpu_freq_read)(struct acpi_pct_register *reg);
>> @@ -223,10 +222,10 @@ static unsigned extract_msr(struct cpufr
>>         perf = to_perf_data(data);
>>   -    cpufreq_for_each_entry(pos, policy->freq_table + 
>> data->first_perf_state)
>> +    cpufreq_for_each_entry(pos, policy->freq_table)
>>           if (msr == perf->states[pos->driver_data].status)
>>               return pos->frequency;
>> -    return policy->freq_table[data->first_perf_state].frequency;
>> +    return policy->freq_table[0].frequency;
>>   }
>>     static unsigned extract_freq(struct cpufreq_policy *policy, u32 val)
>> @@ -365,7 +364,6 @@ static unsigned int get_cur_freq_on_cpu(
>>       struct cpufreq_policy *policy;
>>       unsigned int freq;
>>       unsigned int cached_freq;
>> -    unsigned int state;
>>         pr_debug("%s (%d)\n", __func__, cpu);
>>   @@ -377,11 +375,7 @@ static unsigned int get_cur_freq_on_cpu(
>>       if (unlikely(!data || !policy->freq_table))
>>           return 0;
>>   -    state = to_perf_data(data)->state;
>> -    if (state < data->first_perf_state)
>> -        state = data->first_perf_state;
>> -
>> -    cached_freq = policy->freq_table[state].frequency;
>> +    cached_freq = 
>> policy->freq_table[to_perf_data(data)->state].frequency;
>>       freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data));
>>       if (freq != cached_freq) {
>>           /*
>> @@ -680,7 +674,6 @@ static int acpi_cpufreq_cpu_init(struct
>>       struct cpuinfo_x86 *c = &cpu_data(cpu);
>>       unsigned int valid_states = 0;
>>       unsigned int result = 0;
>> -    unsigned int state_count;
>>       u64 max_boost_ratio;
>>       unsigned int i;
>>   #ifdef CONFIG_SMP
>> @@ -795,28 +788,8 @@ static int acpi_cpufreq_cpu_init(struct
>>           goto err_unreg;
>>       }
>>   -    state_count = perf->state_count + 1;
>> -
>> -    max_boost_ratio = get_max_boost_ratio(cpu);
>> -    if (max_boost_ratio) {
>> -        /*
>> -         * Make a room for one more entry to represent the highest
>> -         * available "boost" frequency.
>> -         */
>> -        state_count++;
>> -        valid_states++;
>> -        data->first_perf_state = valid_states;
>> -    } else {
>> -        /*
>> -         * If the maximum "boost" frequency is unknown, ask the arch
>> -         * scale-invariance code to use the "nominal" performance for
>> -         * CPU utilization scaling so as to prevent the schedutil
>> -         * governor from selecting inadequate CPU frequencies.
>> -         */
>> -        arch_set_max_freq_ratio(true);
>> -    }
>> -
>> -    freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL);
>> +    freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table),
>> +                 GFP_KERNEL);
>>       if (!freq_table) {
>>           result = -ENOMEM;
>>           goto err_unreg;
>> @@ -851,27 +824,25 @@ static int acpi_cpufreq_cpu_init(struct
>>       }
>>       freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
>>   +    max_boost_ratio = get_max_boost_ratio(cpu);
>>       if (max_boost_ratio) {
>> -        unsigned int state = data->first_perf_state;
>> -        unsigned int freq = freq_table[state].frequency;
>> +        unsigned int freq = freq_table[0].frequency;
>>             /*
>>            * Because the loop above sorts the freq_table entries in the
>>            * descending order, freq is the maximum frequency in the 
>> table.
>>            * Assume that it corresponds to the CPPC nominal frequency 
>> and
>> -         * use it to populate the frequency field of the extra "boost"
>> -         * frequency entry.
>> +         * use it to set cpuinfo.max_freq.
>>            */
>> -        freq_table[0].frequency = freq * max_boost_ratio >> 
>> SCHED_CAPACITY_SHIFT;
>> +        policy->cpuinfo.max_freq = freq * max_boost_ratio >> 
>> SCHED_CAPACITY_SHIFT;
>> +    } else {
>>           /*
>> -         * The purpose of the extra "boost" frequency entry is to make
>> -         * the rest of cpufreq aware of the real maximum frequency, but
>> -         * the way to request it is the same as for the 
>> first_perf_state
>> -         * entry that is expected to cover the entire range of "boost"
>> -         * frequencies of the CPU, so copy the driver_data value from
>> -         * that entry.
>> +         * If the maximum "boost" frequency is unknown, ask the arch
>> +         * scale-invariance code to use the "nominal" performance for
>> +         * CPU utilization scaling so as to prevent the schedutil
>> +         * governor from selecting inadequate CPU frequencies.
>>            */
>> -        freq_table[0].driver_data = freq_table[state].driver_data;
>> +        arch_set_max_freq_ratio(true);
>>       }
>>         policy->freq_table = freq_table;
>> @@ -947,8 +918,7 @@ static void acpi_cpufreq_cpu_ready(struc
>>   {
>>       struct acpi_processor_performance *perf = 
>> per_cpu_ptr(acpi_perf_data,
>>                                     policy->cpu);
>> -    struct acpi_cpufreq_data *data = policy->driver_data;
>> -    unsigned int freq = 
>> policy->freq_table[data->first_perf_state].frequency;
>> +    unsigned int freq = policy->freq_table[0].frequency;
>>         if (perf->states[0].core_frequency * 1000 != freq)
>>           pr_warn(FW_WARN "P-state 0 is not max freq\n");
>> Index: linux-pm/drivers/cpufreq/freq_table.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpufreq/freq_table.c
>> +++ linux-pm/drivers/cpufreq/freq_table.c
>> @@ -52,7 +52,13 @@ int cpufreq_frequency_table_cpuinfo(stru
>>       }
>>         policy->min = policy->cpuinfo.min_freq = min_freq;
>> -    policy->max = policy->cpuinfo.max_freq = max_freq;
>> +    policy->max = max_freq;
>> +    /*
>> +     * If the driver has set its own cpuinfo.max_freq above 
>> max_freq, leave
>> +     * it as is.
>> +     */
>> +    if (policy->cpuinfo.max_freq < max_freq)
>> +        policy->max = policy->cpuinfo.max_freq = max_freq;
>>         if (policy->min == ~0)
>>           return -EINVAL;
>>
>>
>>
Rafael J. Wysocki Feb. 17, 2021, 2:18 p.m. UTC | #5
On Wed, Feb 17, 2021 at 11:46 AM Giovanni Gherdovich
<ggherdovich@suse.cz> wrote:
>
> On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover
> > boost frequencies") attempted to address a performance issue involving
> > acpi-cpufreq, the schedutil governor and scale-invariance on x86 by
> > extending the frequency tables created by acpi-cpufreq to cover the
> > entire range of "turbo" (or "boost") frequencies, but that caused
> > frequencies reported via /proc/cpuinfo and the scaling_cur_freq
> > attribute in sysfs to change which may confuse users and monitoring
> > tools.
> >
> > For this reason, revert the part of commit 3c55e94c0ade adding the
> > extra entry to the frequency table and use the observation that
> > in principle cpuinfo.max_freq need not be equal to the maximum
> > frequency listed in the frequency table for the given policy.
> >
> > Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq
> > drivers to set their own cpuinfo.max_freq above that frequency and
> > change  acpi-cpufreq to set cpuinfo.max_freq to the maximum boost
> > frequency found via CPPC.
> >
> > This should be sufficient to let all of the cpufreq subsystem know
> > the real maximum frequency of the CPU without changing frequency
> > reporting.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305
> > Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
> > Reported-by: Matt McDonald <gardotd426@gmail.com>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > Michael, Giovanni,
> >
> > The fix for the EPYC performance regression that was merged into 5.11 introduced
> > an undesirable side-effect by distorting the CPU frequency reporting via
> > /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details).
> >
> > The patch below is reported to address this problem and it should still allow
> > schedutil to achieve desirable performance, because it simply sets
> > cpuinfo.max_freq without extending the frequency table of the CPU.
> >
> > Please test this one and let me know if it adversely affects performance.
> >
> > Thanks!
>
> Hello Rafael,
>
> more extended testing confirms the initial feeling; performance with this
> patch is mostly identical to vanilla v5.11.

Thank you!

> Tbench shows an improvement.

Interesting.

> Thanks for the fix!

YW

> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
>
> Results follow. The machine has two sockets with an AMD EPYC 7742 each.
> The governor is always schedutil.
>
>
> Ratios of time, lower is better:
>                                             v5.11     v5.11
>                                            vanilla    patch
>     - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>     NASA Parallel Benchmarks w/ MPI         1.00      0.96
>     NASA Parallel Benchmarks w/ OpenMP      1.00      ~
>     dbench on XFS                           1.00      ~
>     Linux kernel compilation                1.00      ~
>     git unit test suite                     1.00      ~
>
>
> Ratio of throughput, higher is better:
>                                             v5.11     v5.11
>                                            vanilla    patch
>     - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>     tbench on localhost                     1.00      1.09
>
>
> Tilde (~): no change wrt baseline.

Thanks again!

It would be good to hear from Michael too, but this is already
sufficient for me to queue up the patch for 5.12-rc.

Cheers!

Patch
diff mbox series

Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -54,7 +54,6 @@  struct acpi_cpufreq_data {
 	unsigned int resume;
 	unsigned int cpu_feature;
 	unsigned int acpi_perf_cpu;
-	unsigned int first_perf_state;
 	cpumask_var_t freqdomain_cpus;
 	void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val);
 	u32 (*cpu_freq_read)(struct acpi_pct_register *reg);
@@ -223,10 +222,10 @@  static unsigned extract_msr(struct cpufr
 
 	perf = to_perf_data(data);
 
-	cpufreq_for_each_entry(pos, policy->freq_table + data->first_perf_state)
+	cpufreq_for_each_entry(pos, policy->freq_table)
 		if (msr == perf->states[pos->driver_data].status)
 			return pos->frequency;
-	return policy->freq_table[data->first_perf_state].frequency;
+	return policy->freq_table[0].frequency;
 }
 
 static unsigned extract_freq(struct cpufreq_policy *policy, u32 val)
@@ -365,7 +364,6 @@  static unsigned int get_cur_freq_on_cpu(
 	struct cpufreq_policy *policy;
 	unsigned int freq;
 	unsigned int cached_freq;
-	unsigned int state;
 
 	pr_debug("%s (%d)\n", __func__, cpu);
 
@@ -377,11 +375,7 @@  static unsigned int get_cur_freq_on_cpu(
 	if (unlikely(!data || !policy->freq_table))
 		return 0;
 
-	state = to_perf_data(data)->state;
-	if (state < data->first_perf_state)
-		state = data->first_perf_state;
-
-	cached_freq = policy->freq_table[state].frequency;
+	cached_freq = policy->freq_table[to_perf_data(data)->state].frequency;
 	freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data));
 	if (freq != cached_freq) {
 		/*
@@ -680,7 +674,6 @@  static int acpi_cpufreq_cpu_init(struct
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 	unsigned int valid_states = 0;
 	unsigned int result = 0;
-	unsigned int state_count;
 	u64 max_boost_ratio;
 	unsigned int i;
 #ifdef CONFIG_SMP
@@ -795,28 +788,8 @@  static int acpi_cpufreq_cpu_init(struct
 		goto err_unreg;
 	}
 
-	state_count = perf->state_count + 1;
-
-	max_boost_ratio = get_max_boost_ratio(cpu);
-	if (max_boost_ratio) {
-		/*
-		 * Make a room for one more entry to represent the highest
-		 * available "boost" frequency.
-		 */
-		state_count++;
-		valid_states++;
-		data->first_perf_state = valid_states;
-	} else {
-		/*
-		 * If the maximum "boost" frequency is unknown, ask the arch
-		 * scale-invariance code to use the "nominal" performance for
-		 * CPU utilization scaling so as to prevent the schedutil
-		 * governor from selecting inadequate CPU frequencies.
-		 */
-		arch_set_max_freq_ratio(true);
-	}
-
-	freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL);
+	freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table),
+			     GFP_KERNEL);
 	if (!freq_table) {
 		result = -ENOMEM;
 		goto err_unreg;
@@ -851,27 +824,25 @@  static int acpi_cpufreq_cpu_init(struct
 	}
 	freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
 
+	max_boost_ratio = get_max_boost_ratio(cpu);
 	if (max_boost_ratio) {
-		unsigned int state = data->first_perf_state;
-		unsigned int freq = freq_table[state].frequency;
+		unsigned int freq = freq_table[0].frequency;
 
 		/*
 		 * Because the loop above sorts the freq_table entries in the
 		 * descending order, freq is the maximum frequency in the table.
 		 * Assume that it corresponds to the CPPC nominal frequency and
-		 * use it to populate the frequency field of the extra "boost"
-		 * frequency entry.
+		 * use it to set cpuinfo.max_freq.
 		 */
-		freq_table[0].frequency = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT;
+		policy->cpuinfo.max_freq = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT;
+	} else {
 		/*
-		 * The purpose of the extra "boost" frequency entry is to make
-		 * the rest of cpufreq aware of the real maximum frequency, but
-		 * the way to request it is the same as for the first_perf_state
-		 * entry that is expected to cover the entire range of "boost"
-		 * frequencies of the CPU, so copy the driver_data value from
-		 * that entry.
+		 * If the maximum "boost" frequency is unknown, ask the arch
+		 * scale-invariance code to use the "nominal" performance for
+		 * CPU utilization scaling so as to prevent the schedutil
+		 * governor from selecting inadequate CPU frequencies.
 		 */
-		freq_table[0].driver_data = freq_table[state].driver_data;
+		arch_set_max_freq_ratio(true);
 	}
 
 	policy->freq_table = freq_table;
@@ -947,8 +918,7 @@  static void acpi_cpufreq_cpu_ready(struc
 {
 	struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data,
 							      policy->cpu);
-	struct acpi_cpufreq_data *data = policy->driver_data;
-	unsigned int freq = policy->freq_table[data->first_perf_state].frequency;
+	unsigned int freq = policy->freq_table[0].frequency;
 
 	if (perf->states[0].core_frequency * 1000 != freq)
 		pr_warn(FW_WARN "P-state 0 is not max freq\n");
Index: linux-pm/drivers/cpufreq/freq_table.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/freq_table.c
+++ linux-pm/drivers/cpufreq/freq_table.c
@@ -52,7 +52,13 @@  int cpufreq_frequency_table_cpuinfo(stru
 	}
 
 	policy->min = policy->cpuinfo.min_freq = min_freq;
-	policy->max = policy->cpuinfo.max_freq = max_freq;
+	policy->max = max_freq;
+	/*
+	 * If the driver has set its own cpuinfo.max_freq above max_freq, leave
+	 * it as is.
+	 */
+	if (policy->cpuinfo.max_freq < max_freq)
+		policy->max = policy->cpuinfo.max_freq = max_freq;
 
 	if (policy->min == ~0)
 		return -EINVAL;