[v2,2/5] cpufreq: intel_pstate: Always return last EPP value from sysfs
diff mbox series

Message ID 2064342.aRc67yb0pC@kreacher
State Superseded
Headers show
Series
  • cpufreq: intel_pstate: Address some HWP-related oddities
Related show

Commit Message

Rafael J. Wysocki Aug. 24, 2020, 5:42 p.m. UTC
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>

Make the energy_performance_preference policy attribute in sysfs
always return the last EPP value written to it instead of the one
currently in the HWP Request MSR to avoid possible confusion when
the performance scaling algorithm is used in the active mode with
HWP enabled (in which case the EPP is forced to 0 regardless of
what value it has been set to via sysfs).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: No changes.

---
 drivers/cpufreq/intel_pstate.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

Comments

Artem Bityutskiy Aug. 25, 2020, 6:20 a.m. UTC | #1
On Mon, 2020-08-24 at 19:42 +0200, Rafael J. Wysocki wrote:
> From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> 
> Make the energy_performance_preference policy attribute in sysfs
> always return the last EPP value written to it instead of the one
> currently in the HWP Request MSR to avoid possible confusion when
> the performance scaling algorithm is used in the active mode with
> HWP enabled (in which case the EPP is forced to 0 regardless of
> what value it has been set to via sysfs).

Why is this a good idea, I wonder. If there was a prior discussion,
please, point to it.

The general approach to changing settings via sysfs is often like this:

1. Write new value.
2. Read it back and verify that it is the same. Because there is no
better way to verify that the kernel "accepted" the value.

Let's say I write 'balanced' to energy_performance_preference. I read
it back, and it contains 'balanced', so I am happy, I trust the kernel
changed EPP to "balanced".

If the kernel, in fact, uses something else, I want to know about it
and have my script fail. Why caching the value and making my script
_think_ it succeeded is a good idea.

In other words, in my usage scenarios at list, I prefer kernel telling
the true EPP value, not some "cached, but not used" value.

Artem.
Rafael J. Wysocki Aug. 25, 2020, 2:51 p.m. UTC | #2
On Tue, Aug 25, 2020 at 8:20 AM Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> On Mon, 2020-08-24 at 19:42 +0200, Rafael J. Wysocki wrote:
> > From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> >
> > Make the energy_performance_preference policy attribute in sysfs
> > always return the last EPP value written to it instead of the one
> > currently in the HWP Request MSR to avoid possible confusion when
> > the performance scaling algorithm is used in the active mode with
> > HWP enabled (in which case the EPP is forced to 0 regardless of
> > what value it has been set to via sysfs).
>
> Why is this a good idea, I wonder. If there was a prior discussion,
> please, point to it.
>
> The general approach to changing settings via sysfs is often like this:
>
> 1. Write new value.
> 2. Read it back and verify that it is the same. Because there is no
> better way to verify that the kernel "accepted" the value.

If the write is successful (ie. no errors returned and the value
returned is equal to the number of written characters), the kernel
*has* accepted the written value, but it may not have taken effect.
These are two different things.

The written value may take an effect immediately or it may take an
effect later, depending on the current configuration etc.  If you
don't see the effect of it immediately, it doesn't matter that there
was a failure of some sort.

> Let's say I write 'balanced' to energy_performance_preference. I read
> it back, and it contains 'balanced', so I am happy, I trust the kernel
> changed EPP to "balanced".
>
> If the kernel, in fact, uses something else, I want to know about it
> and have my script fail.

Why do you want it to fail then?

> Why caching the value and making my script _think_ it succeeded is a good idea.

Because when you change the scaling algorithm or the driver's
operation mode, the value you have written will take effect.

In this particular case it is explained in the driver documentation
that the performance scaling algorithm in the active mode overrides
the sysfs value and that's the only case when it can be overridden.
So whatever you write to this attribute will not take effect
immediately anyway, but it may take an effect later.

> In other words, in my usage scenarios at list, I prefer kernel telling
> the true EPP value, not some "cached, but not used" value.

An alternative is to fail writes to energy_performance_preference if
the driver works in the active mode and the scaling algorithm for the
scaling CPU is performance and *then* to make reads from it return the
value in the register.

Accepting a write and returning a different value in a subsequent read
is confusing.

Thanks!
Srinivas Pandruvada Aug. 25, 2020, 3:06 p.m. UTC | #3
On Tue, 2020-08-25 at 16:51 +0200, Rafael J. Wysocki wrote:
> On Tue, Aug 25, 2020 at 8:20 AM Artem Bityutskiy <dedekind1@gmail.com
> > wrote:
> > On Mon, 2020-08-24 at 19:42 +0200, Rafael J. Wysocki wrote:
> > > From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> > > 
> > > Make the energy_performance_preference policy attribute in sysfs
> > > always return the last EPP value written to it instead of the one
> > > currently in the HWP Request MSR to avoid possible confusion when
> > > the performance scaling algorithm is used in the active mode with
> > > HWP enabled (in which case the EPP is forced to 0 regardless of
> > > what value it has been set to via sysfs).
> > 
> > Why is this a good idea, I wonder. If there was a prior discussion,
> > please, point to it.
> > 
> > The general approach to changing settings via sysfs is often like
> > this:
> > 
> > 1. Write new value.
> > 2. Read it back and verify that it is the same. Because there is no
> > better way to verify that the kernel "accepted" the value.
> 
> If the write is successful (ie. no errors returned and the value
> returned is equal to the number of written characters), the kernel
> *has* accepted the written value, but it may not have taken effect.
> These are two different things.
> 
> The written value may take an effect immediately or it may take an
> effect later, depending on the current configuration etc.  If you
> don't see the effect of it immediately, it doesn't matter that there
> was a failure of some sort.
> 
> > Let's say I write 'balanced' to energy_performance_preference. I
> > read
> > it back, and it contains 'balanced', so I am happy, I trust the
> > kernel
> > changed EPP to "balanced".
> > 
> > If the kernel, in fact, uses something else, I want to know about
> > it
> > and have my script fail.
> 
> Why do you want it to fail then?
> 
> > Why caching the value and making my script _think_ it succeeded is
> > a good idea.
> 
> Because when you change the scaling algorithm or the driver's
> operation mode, the value you have written will take effect.
> 
> In this particular case it is explained in the driver documentation
> that the performance scaling algorithm in the active mode overrides
> the sysfs value and that's the only case when it can be overridden.
> So whatever you write to this attribute will not take effect
> immediately anyway, but it may take an effect later.

In some cases without even changing active/passive this is happening
when there was some error previously. For example:

#cat energy_performance_preference 
127
[root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 1 0x774
8000ff00

I think we should show reality. In mode change can be a special case
and use the stored value to restore in new mode.

Thanks,
Srinivas

> > In other words, in my usage scenarios at list, I prefer kernel
> > telling
> > the true EPP value, not some "cached, but not used" value.
> 
> An alternative is to fail writes to energy_performance_preference if
> the driver works in the active mode and the scaling algorithm for the
> scaling CPU is performance and *then* to make reads from it return
> the
> value in the register.
> 
> Accepting a write and returning a different value in a subsequent
> read
> is confusing.
> 
> Thanks!
Rafael J. Wysocki Aug. 25, 2020, 3:14 p.m. UTC | #4
On Tue, Aug 25, 2020 at 5:06 PM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Tue, 2020-08-25 at 16:51 +0200, Rafael J. Wysocki wrote:
> > On Tue, Aug 25, 2020 at 8:20 AM Artem Bityutskiy <dedekind1@gmail.com
> > > wrote:
> > > On Mon, 2020-08-24 at 19:42 +0200, Rafael J. Wysocki wrote:
> > > > From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> > > >
> > > > Make the energy_performance_preference policy attribute in sysfs
> > > > always return the last EPP value written to it instead of the one
> > > > currently in the HWP Request MSR to avoid possible confusion when
> > > > the performance scaling algorithm is used in the active mode with
> > > > HWP enabled (in which case the EPP is forced to 0 regardless of
> > > > what value it has been set to via sysfs).
> > >
> > > Why is this a good idea, I wonder. If there was a prior discussion,
> > > please, point to it.
> > >
> > > The general approach to changing settings via sysfs is often like
> > > this:
> > >
> > > 1. Write new value.
> > > 2. Read it back and verify that it is the same. Because there is no
> > > better way to verify that the kernel "accepted" the value.
> >
> > If the write is successful (ie. no errors returned and the value
> > returned is equal to the number of written characters), the kernel
> > *has* accepted the written value, but it may not have taken effect.
> > These are two different things.
> >
> > The written value may take an effect immediately or it may take an
> > effect later, depending on the current configuration etc.  If you
> > don't see the effect of it immediately, it doesn't matter that there
> > was a failure of some sort.
> >
> > > Let's say I write 'balanced' to energy_performance_preference. I
> > > read
> > > it back, and it contains 'balanced', so I am happy, I trust the
> > > kernel
> > > changed EPP to "balanced".
> > >
> > > If the kernel, in fact, uses something else, I want to know about
> > > it
> > > and have my script fail.
> >
> > Why do you want it to fail then?
> >
> > > Why caching the value and making my script _think_ it succeeded is
> > > a good idea.
> >
> > Because when you change the scaling algorithm or the driver's
> > operation mode, the value you have written will take effect.
> >
> > In this particular case it is explained in the driver documentation
> > that the performance scaling algorithm in the active mode overrides
> > the sysfs value and that's the only case when it can be overridden.
> > So whatever you write to this attribute will not take effect
> > immediately anyway, but it may take an effect later.
>
> In some cases without even changing active/passive this is happening
> when there was some error previously. For example:
>
> #cat energy_performance_preference
> 127
> [root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 1 0x774
> 8000ff00
>
> I think we should show reality. In mode change can be a special case
> and use the stored value to restore in new mode.

OK, so I'll make it fail on attempts to change the EPP from 0
(performance) in the active mode with the performance "governor".

Cheers!
Srinivas Pandruvada Aug. 25, 2020, 3:26 p.m. UTC | #5
On Tue, 2020-08-25 at 17:14 +0200, Rafael J. Wysocki wrote:
> On Tue, Aug 25, 2020 at 5:06 PM Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > On Tue, 2020-08-25 at 16:51 +0200, Rafael J. Wysocki wrote:
> > > On Tue, Aug 25, 2020 at 8:20 AM Artem Bityutskiy <
> > > dedekind1@gmail.com
> > > > wrote:
> > > > On Mon, 2020-08-24 at 19:42 +0200, Rafael J. Wysocki wrote:
> > > > > From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> > > > > 
> > > > > Make the energy_performance_preference policy attribute in
> > > > > sysfs
> > > > > always return the last EPP value written to it instead of the
> > > > > one
> > > > > currently in the HWP Request MSR to avoid possible confusion
> > > > > when
> > > > > the performance scaling algorithm is used in the active mode
> > > > > with
> > > > > HWP enabled (in which case the EPP is forced to 0 regardless
> > > > > of
> > > > > what value it has been set to via sysfs).
> > > > 
> > > > Why is this a good idea, I wonder. If there was a prior
> > > > discussion,
> > > > please, point to it.
> > > > 
> > > > The general approach to changing settings via sysfs is often
> > > > like
> > > > this:
> > > > 
> > > > 1. Write new value.
> > > > 2. Read it back and verify that it is the same. Because there
> > > > is no
> > > > better way to verify that the kernel "accepted" the value.
> > > 
> > > If the write is successful (ie. no errors returned and the value
> > > returned is equal to the number of written characters), the
> > > kernel
> > > *has* accepted the written value, but it may not have taken
> > > effect.
> > > These are two different things.
> > > 
> > > The written value may take an effect immediately or it may take
> > > an
> > > effect later, depending on the current configuration etc.  If you
> > > don't see the effect of it immediately, it doesn't matter that
> > > there
> > > was a failure of some sort.
> > > 
> > > > Let's say I write 'balanced' to energy_performance_preference.
> > > > I
> > > > read
> > > > it back, and it contains 'balanced', so I am happy, I trust the
> > > > kernel
> > > > changed EPP to "balanced".
> > > > 
> > > > If the kernel, in fact, uses something else, I want to know
> > > > about
> > > > it
> > > > and have my script fail.
> > > 
> > > Why do you want it to fail then?
> > > 
> > > > Why caching the value and making my script _think_ it succeeded
> > > > is
> > > > a good idea.
> > > 
> > > Because when you change the scaling algorithm or the driver's
> > > operation mode, the value you have written will take effect.
> > > 
> > > In this particular case it is explained in the driver
> > > documentation
> > > that the performance scaling algorithm in the active mode
> > > overrides
> > > the sysfs value and that's the only case when it can be
> > > overridden.
> > > So whatever you write to this attribute will not take effect
> > > immediately anyway, but it may take an effect later.
> > 
> > In some cases without even changing active/passive this is
> > happening
> > when there was some error previously. For example:
> > 
> > #cat energy_performance_preference
> > 127
> > [root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 1 0x774
> > 8000ff00
> > 
> > I think we should show reality. In mode change can be a special
> > case
> > and use the stored value to restore in new mode.
> 
> OK, so I'll make it fail on attempts to change the EPP from 0
> (performance) in the active mode with the performance "governor".
> 
Here the scaling governor is powersave.

# cat scaling_governor 
powersave


Thanks,
Srinivas

> Cheers!
Rafael J. Wysocki Aug. 25, 2020, 3:53 p.m. UTC | #6
On Tue, Aug 25, 2020 at 5:27 PM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Tue, 2020-08-25 at 17:14 +0200, Rafael J. Wysocki wrote:
> > On Tue, Aug 25, 2020 at 5:06 PM Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com> wrote:
> > > On Tue, 2020-08-25 at 16:51 +0200, Rafael J. Wysocki wrote:
> > > > On Tue, Aug 25, 2020 at 8:20 AM Artem Bityutskiy <
> > > > dedekind1@gmail.com
> > > > > wrote:
> > > > > On Mon, 2020-08-24 at 19:42 +0200, Rafael J. Wysocki wrote:
> > > > > > From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> > > > > >
> > > > > > Make the energy_performance_preference policy attribute in
> > > > > > sysfs
> > > > > > always return the last EPP value written to it instead of the
> > > > > > one
> > > > > > currently in the HWP Request MSR to avoid possible confusion
> > > > > > when
> > > > > > the performance scaling algorithm is used in the active mode
> > > > > > with
> > > > > > HWP enabled (in which case the EPP is forced to 0 regardless
> > > > > > of
> > > > > > what value it has been set to via sysfs).
> > > > >
> > > > > Why is this a good idea, I wonder. If there was a prior
> > > > > discussion,
> > > > > please, point to it.
> > > > >
> > > > > The general approach to changing settings via sysfs is often
> > > > > like
> > > > > this:
> > > > >
> > > > > 1. Write new value.
> > > > > 2. Read it back and verify that it is the same. Because there
> > > > > is no
> > > > > better way to verify that the kernel "accepted" the value.
> > > >
> > > > If the write is successful (ie. no errors returned and the value
> > > > returned is equal to the number of written characters), the
> > > > kernel
> > > > *has* accepted the written value, but it may not have taken
> > > > effect.
> > > > These are two different things.
> > > >
> > > > The written value may take an effect immediately or it may take
> > > > an
> > > > effect later, depending on the current configuration etc.  If you
> > > > don't see the effect of it immediately, it doesn't matter that
> > > > there
> > > > was a failure of some sort.
> > > >
> > > > > Let's say I write 'balanced' to energy_performance_preference.
> > > > > I
> > > > > read
> > > > > it back, and it contains 'balanced', so I am happy, I trust the
> > > > > kernel
> > > > > changed EPP to "balanced".
> > > > >
> > > > > If the kernel, in fact, uses something else, I want to know
> > > > > about
> > > > > it
> > > > > and have my script fail.
> > > >
> > > > Why do you want it to fail then?
> > > >
> > > > > Why caching the value and making my script _think_ it succeeded
> > > > > is
> > > > > a good idea.
> > > >
> > > > Because when you change the scaling algorithm or the driver's
> > > > operation mode, the value you have written will take effect.
> > > >
> > > > In this particular case it is explained in the driver
> > > > documentation
> > > > that the performance scaling algorithm in the active mode
> > > > overrides
> > > > the sysfs value and that's the only case when it can be
> > > > overridden.
> > > > So whatever you write to this attribute will not take effect
> > > > immediately anyway, but it may take an effect later.
> > >
> > > In some cases without even changing active/passive this is
> > > happening
> > > when there was some error previously. For example:
> > >
> > > #cat energy_performance_preference
> > > 127
> > > [root@otcpl-perf-test-skx-i9 cpufreq]# rdmsr -p 1 0x774
> > > 8000ff00
> > >
> > > I think we should show reality. In mode change can be a special
> > > case
> > > and use the stored value to restore in new mode.
> >
> > OK, so I'll make it fail on attempts to change the EPP from 0
> > (performance) in the active mode with the performance "governor".
> >
> Here the scaling governor is powersave.
>
> # cat scaling_governor
> powersave

What I'm saying is that reads from energy_performance_preference will
still return the register value, but writes to it will fail on
attempts to change to anything different from "performance" when in
the active mode and the current governor is "performance".

Cheers!
Artem Bityutskiy Aug. 26, 2020, 9:54 a.m. UTC | #7
Thanks for answer Rafael, it looks like there are 2 different things
now.

1. What kernel returns when I _read_ e_p_p file - truth or "cached" ?

2. How kernel behaves when I _write_ to e_p_p file something it cannot
provide - error or success.

For #1, I think we need to keep it simple and always return true policy
value. Does not matter what someone wrote there. If some process wrote
"powersave", but kernel uses EPP 0 anyway, the other process probably
wants to know the truth and get "performance" when reading e_p_p.

On Tue, 2020-08-25 at 16:51 +0200, Rafael J. Wysocki wrote:
> An alternative is to fail writes to energy_performance_preference if
> the driver works in the active mode and the scaling algorithm for the
> scaling CPU is performance and *then* to make reads from it return the
> value in the register.

Yes, this is #2. This sounds like the _right_ way to do it.

Suppose my script wants to exercise the system with 4 different EPP
policies. It changes the policy and runs measurements, each run takes
few _days_.

Now, my script asks for "powersave". Kernel _knows_ it cannot provide
it (performance+active enabled). Why would it not return error ("can't
do") instead of success ("yes, Sir!")?

Note, I deliberately use simple words like "my script" instead of "a
user-space process" to make it easier to convey the idea.

Anyway, if kernel returns error, I can go and improve my script WRT
controlling the performance+active mode knobs.

Patch
diff mbox series

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index bcda1e700a73..3d18934fa975 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -606,13 +606,10 @@  static const unsigned int epp_values[] = {
 
 static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data, int *raw_epp)
 {
-	s16 epp;
+	s16 epp = cpu_data->epp_cached;
 	int index = -EINVAL;
 
 	*raw_epp = 0;
-	epp = intel_pstate_get_epp(cpu_data, 0);
-	if (epp < 0)
-		return epp;
 
 	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		if (epp == HWP_EPP_PERFORMANCE)
@@ -644,6 +641,8 @@  static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data, int *raw
 
 static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
 {
+	int ret;
+
 	/*
 	 * Use the cached HWP Request MSR value, because in the active mode the
 	 * register itself may be updated by intel_pstate_hwp_boost_up() or
@@ -659,7 +658,11 @@  static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
 	 * function, so it cannot run in parallel with the update below.
 	 */
 	WRITE_ONCE(cpu->hwp_req_cached, value);
-	return wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+	ret = wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+	if (!ret)
+		cpu->epp_cached = epp;
+
+	return ret;
 }
 
 static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
@@ -762,10 +765,8 @@  static ssize_t store_energy_performance_preference(
 			cpufreq_stop_governor(policy);
 			ret = intel_pstate_set_epp(cpu, epp);
 			err = cpufreq_start_governor(policy);
-			if (!ret) {
-				cpu->epp_cached = epp;
+			if (!ret)
 				ret = err;
-			}
 		}
 	}
 
@@ -2378,6 +2379,13 @@  static int intel_pstate_cpu_init(struct cpufreq_policy *policy)
 	 */
 	policy->policy = CPUFREQ_POLICY_POWERSAVE;
 
+
+	if (hwp_active) {
+		struct cpudata *cpu = all_cpu_data[policy->cpu];
+
+		cpu->epp_cached = intel_pstate_get_epp(cpu, 0);
+	}
+
 	return 0;
 }
 
@@ -2585,7 +2593,7 @@  static int intel_cpufreq_cpu_init(struct cpufreq_policy *policy)
 		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
 		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
 		WRITE_ONCE(cpu->hwp_req_cached, value);
-		cpu->epp_cached = (value & GENMASK_ULL(31, 24)) >> 24;
+		cpu->epp_cached = intel_pstate_get_epp(cpu, value);
 	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
 		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;