[PATCH v2 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
@ 2020-07-16 17:37 Rafael J. Wysocki
  2020-07-16 17:38 ` [PATCH v2 1/2] cpufreq: intel_pstate: Rearrange the storing of newv EPP values Rafael J. Wysocki
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-16 17:37 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

Hi All,

This really is a v2 of this patch:

https://patchwork.kernel.org/patch/11663271/

with an extra preceding cleanup patch to avoid making unrelated changes in the
[2/2].

Thanks!




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/2] cpufreq: intel_pstate: Rearrange the storing of newv EPP values
  2020-07-16 17:37 [PATCH v2 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
@ 2020-07-16 17:38 ` Rafael J. Wysocki
  2020-07-16 17:42 ` [PATCH v2 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
  2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
  2 siblings, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-16 17:38 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Move the locking away from intel_pstate_set_energy_pref_index()
into its only caller and drop the (now redundant) return_pref label
from it.

Also move the "raw" EPP value check into the caller of that function,
so as to do it before acquiring the mutex, and reduce code duplication
related to the "raw" EPP values processing somewhat.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/intel_pstate.c |   29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -648,28 +648,18 @@ static int intel_pstate_set_energy_pref_
 	if (!pref_index)
 		epp = cpu_data->epp_default;
 
-	mutex_lock(&intel_pstate_limits_lock);
-
 	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		u64 value;
 
 		ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, &value);
 		if (ret)
-			goto return_pref;
+			return ret;
 
 		value &= ~GENMASK_ULL(31, 24);
 
-		if (use_raw) {
-			if (raw_epp > 255) {
-				ret = -EINVAL;
-				goto return_pref;
-			}
-			value |= (u64)raw_epp << 24;
-			ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
-			goto return_pref;
-		}
-
-		if (epp == -EINVAL)
+		if (use_raw)
+			epp = raw_epp;
+		else if (epp == -EINVAL)
 			epp = epp_values[pref_index - 1];
 
 		value |= (u64)epp << 24;
@@ -679,8 +669,6 @@ static int intel_pstate_set_energy_pref_
 			epp = (pref_index - 1) << 2;
 		ret = intel_pstate_set_epb(cpu_data->cpu, epp);
 	}
-return_pref:
-	mutex_unlock(&intel_pstate_limits_lock);
 
 	return ret;
 }
@@ -723,12 +711,19 @@ static ssize_t store_energy_performance_
 		if (ret)
 			return ret;
 
+		if (epp > 255)
+			return -EINVAL;
+
 		raw = true;
 	}
 
+	mutex_lock(&intel_pstate_limits_lock);
+
 	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
 	if (ret)
-		return ret;
+		count = ret;
+
+	mutex_unlock(&intel_pstate_limits_lock);
 
 	return count;
 }




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-16 17:37 [PATCH v2 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
  2020-07-16 17:38 ` [PATCH v2 1/2] cpufreq: intel_pstate: Rearrange the storing of newv EPP values Rafael J. Wysocki
@ 2020-07-16 17:42 ` Rafael J. Wysocki
  2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
  2 siblings, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-16 17:42 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to go above
the floor P-state set by intel_pstate with or without hardware
coordination of P-states among CPUs in the same package.

Also note that the setting of the HWP floor may not be taken into
account by the processor in the following cases:

 * For the HWP floor in the range of P-states above the base
   frequency, referred to as the turbo range, the processor has a
   license to choose any P-state from that range, either below or
   above the HWP floor, just like a non-HWP processor in the case
   when the target P-state falls into the turbo range.

 * If P-states of the CPUs in the same package are coordinated
   at the hardware level, the processor may choose a P-state
   above the HWP floor, just like a non-HWP processor in the
   analogous case.

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 ++++++-------
 drivers/cpufreq/intel_pstate.c                |  169 ++++++++++++++++++++------
 2 files changed, 175 insertions(+), 83 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -689,6 +692,8 @@ static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
@@ -717,14 +722,34 @@ static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	mutex_lock(&intel_pstate_driver_lock);
+
+	if (!intel_pstate_driver) {
+		mutex_unlock(&intel_pstate_driver_lock);
+		return -EAGAIN;
+	}
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (ret)
-		count = ret;
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
+		if (ret)
+			count = ret;
+	} else {
+		/*
+		 * In the passive mode simply update the cached EPP value and
+		 * rely on intel_cpufreq_adjust_hwp() to pick it up later.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu_data->epp_default;
+
+		WRITE_ONCE(cpu_data->epp_cached, epp);
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
+	mutex_unlock(&intel_pstate_driver_lock);
+
 	return count;
 }
 
@@ -1137,8 +1162,6 @@ static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -2040,6 +2063,7 @@ static int intel_pstate_init_cpu(unsigne
 		cpu->epp_default = -EINVAL;
 		cpu->epp_powersave = -EINVAL;
 		cpu->epp_saved = -EINVAL;
+		WRITE_ONCE(cpu->epp_cached, -EINVAL);
 	}
 
 	cpu = all_cpu_data[cpunum];
@@ -2238,7 +2262,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2246,12 +2273,10 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2381,13 +2406,82 @@ static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+	s16 epp;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	/*
+	 * In case the EPP has been adjusted via sysfs, write the last cached
+	 * value of it to the MSR as well.
+	 */
+	epp = READ_ONCE(cpu->epp_cached);
+	if (epp >= 0) {
+		value &= ~GENMASK_ULL(31, 24);
+		value |= (u64)epp << 24;
+	}
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2395,6 +2489,7 @@ static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2406,15 +2501,11 @@ static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2424,15 +2515,14 @@ static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2452,7 +2542,6 @@ static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2464,10 +2553,17 @@ static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2568,9 +2664,6 @@ static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2826,7 +2919,10 @@ static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2897,14 +2993,13 @@ static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@ This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@ manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@ argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@ Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@ on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@ at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@ of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@ of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-16 17:37 [PATCH v2 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
  2020-07-16 17:38 ` [PATCH v2 1/2] cpufreq: intel_pstate: Rearrange the storing of newv EPP values Rafael J. Wysocki
  2020-07-16 17:42 ` [PATCH v2 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
@ 2020-07-27 15:13 ` Rafael J. Wysocki
  2020-07-27 15:15   ` [PATCH v3 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
                     ` (4 more replies)
  2 siblings, 5 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-27 15:13 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

Hi All,

On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki wrote:
>
> This really is a v2 of this patch:
> 
> https://patchwork.kernel.org/patch/11663271/
> 
> with an extra preceding cleanup patch to avoid making unrelated changes in the
> [2/2].

Almost the same as before, but the first patch has been reworked to handle
errors in store_energy_performance_preference() correctly and rebased on top
of the current linux-pm.git branch.

No functional changes otherwise.

Thanks!




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values
  2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
@ 2020-07-27 15:15   ` Rafael J. Wysocki
  2020-07-27 15:17   ` [PATCH v3 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-27 15:15 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Move the locking away from intel_pstate_set_energy_pref_index()
into its only caller and drop the (now redundant) return_pref label
from it.

Also move the "raw" EPP value check into the caller of that function,
so as to do it before acquiring the mutex, and reduce code duplication
related to the "raw" EPP values processing somewhat.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3:

   * Fix error handling in intel_pstate_set_energy_pref_index() and
     rebase.

---
 drivers/cpufreq/intel_pstate.c |   35 +++++++++++++++--------------------
 1 file changed, 15 insertions(+), 20 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -649,28 +649,18 @@ static int intel_pstate_set_energy_pref_
 	if (!pref_index)
 		epp = cpu_data->epp_default;
 
-	mutex_lock(&intel_pstate_limits_lock);
-
 	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		u64 value;
 
 		ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, &value);
 		if (ret)
-			goto return_pref;
+			return ret;
 
 		value &= ~GENMASK_ULL(31, 24);
 
-		if (use_raw) {
-			if (raw_epp > 255) {
-				ret = -EINVAL;
-				goto return_pref;
-			}
-			value |= (u64)raw_epp << 24;
-			ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
-			goto return_pref;
-		}
-
-		if (epp == -EINVAL)
+		if (use_raw)
+			epp = raw_epp;
+		else if (epp == -EINVAL)
 			epp = epp_values[pref_index - 1];
 
 		value |= (u64)epp << 24;
@@ -680,8 +670,6 @@ static int intel_pstate_set_energy_pref_
 			epp = (pref_index - 1) << 2;
 		ret = intel_pstate_set_epb(cpu_data->cpu, epp);
 	}
-return_pref:
-	mutex_unlock(&intel_pstate_limits_lock);
 
 	return ret;
 }
@@ -708,8 +696,8 @@ static ssize_t store_energy_performance_
 	struct cpudata *cpu_data = all_cpu_data[policy->cpu];
 	char str_preference[21];
 	bool raw = false;
+	ssize_t ret;
 	u32 epp = 0;
-	int ret;
 
 	ret = sscanf(buf, "%20s", str_preference);
 	if (ret != 1)
@@ -724,14 +712,21 @@ static ssize_t store_energy_performance_
 		if (ret)
 			return ret;
 
+		if (epp > 255)
+			return -EINVAL;
+
 		raw = true;
 	}
 
+	mutex_lock(&intel_pstate_limits_lock);
+
 	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (ret)
-		return ret;
+	if (!ret)
+		ret = count;
 
-	return count;
+	mutex_unlock(&intel_pstate_limits_lock);
+
+	return ret;
 }
 
 static ssize_t show_energy_performance_preference(




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
  2020-07-27 15:15   ` [PATCH v3 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
@ 2020-07-27 15:17   ` Rafael J. Wysocki
  2020-07-28 15:09   ` [PATCH v4 0/2] " Rafael J. Wysocki
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-27 15:17 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to go above
the floor P-state set by intel_pstate with or without hardware
coordination of P-states among CPUs in the same package.

Also note that the setting of the HWP floor may not be taken into
account by the processor in the following cases:

 * For the HWP floor in the range of P-states above the base
   frequency, referred to as the turbo range, the processor has a
   license to choose any P-state from that range, either below or
   above the HWP floor, just like a non-HWP processor in the case
   when the target P-state falls into the turbo range.

 * If P-states of the CPUs in the same package are coordinated
   at the hardware level, the processor may choose a P-state
   above the HWP floor, just like a non-HWP processor in the
   analogous case.

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

v2 -> v3:
   * Rebase.

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 ++++++-------
 drivers/cpufreq/intel_pstate.c                |  168 ++++++++++++++++++++------
 2 files changed, 175 insertions(+), 82 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -690,6 +693,8 @@ static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
@@ -718,14 +723,35 @@ static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	mutex_lock(&intel_pstate_driver_lock);
+
+	if (!intel_pstate_driver) {
+		mutex_unlock(&intel_pstate_driver_lock);
+		return -EAGAIN;
+	}
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (!ret)
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
+		if (!ret)
+			ret = count;
+	} else {
+		/*
+		 * In the passive mode simply update the cached EPP value and
+		 * rely on intel_cpufreq_adjust_hwp() to pick it up later.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu_data->epp_default;
+
+		WRITE_ONCE(cpu_data->epp_cached, epp);
 		ret = count;
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
+	mutex_unlock(&intel_pstate_driver_lock);
+
 	return ret;
 }
 
@@ -1138,8 +1164,6 @@ static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -2041,6 +2065,7 @@ static int intel_pstate_init_cpu(unsigne
 		cpu->epp_default = -EINVAL;
 		cpu->epp_powersave = -EINVAL;
 		cpu->epp_saved = -EINVAL;
+		WRITE_ONCE(cpu->epp_cached, -EINVAL);
 	}
 
 	cpu = all_cpu_data[cpunum];
@@ -2239,7 +2264,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2247,12 +2275,10 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2382,13 +2408,82 @@ static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+	s16 epp;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	/*
+	 * In case the EPP has been adjusted via sysfs, write the last cached
+	 * value of it to the MSR as well.
+	 */
+	epp = READ_ONCE(cpu->epp_cached);
+	if (epp >= 0) {
+		value &= ~GENMASK_ULL(31, 24);
+		value |= (u64)epp << 24;
+	}
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2396,6 +2491,7 @@ static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2407,15 +2503,11 @@ static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2425,15 +2517,14 @@ static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2453,7 +2544,6 @@ static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2465,10 +2555,17 @@ static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2569,9 +2666,6 @@ static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2827,7 +2921,10 @@ static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2898,14 +2995,13 @@ static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@ This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@ manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@ argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@ Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@ on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@ at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@ of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@ of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
  2020-07-27 15:15   ` [PATCH v3 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
  2020-07-27 15:17   ` [PATCH v3 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
@ 2020-07-28 15:09   ` Rafael J. Wysocki
  2020-07-28 15:11     ` [PATCH v4 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
                       ` (2 more replies)
  2020-08-04 15:10   ` [PATCH v6] " Rafael J. Wysocki
  2020-08-05 16:55   ` [PATCH v7] " Rafael J. Wysocki
  4 siblings, 3 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-28 15:09 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

Hi All,

On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
> 
> On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki wrote:
> >
> > This really is a v2 of this patch:
> > 
> > https://patchwork.kernel.org/patch/11663271/
> > 
> > with an extra preceding cleanup patch to avoid making unrelated changes in the
> > [2/2].
> 
> Almost the same as before, but the first patch has been reworked to handle
> errors in store_energy_performance_preference() correctly and rebased on top
> of the current linux-pm.git branch.
> 
> No functional changes otherwise.

One more update of the second patch.

Namely, I realized that the hwp_dynamic_boost sysfs switch was present in the
passive mode after the v3 (and the previous versions) of that patch which isn't
correct, so this modifies it to avoid exposing hwp_dynamic_boost in the passive
mode.

The first patch is the same as in the v2.

Thanks!




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v4 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values
  2020-07-28 15:09   ` [PATCH v4 0/2] " Rafael J. Wysocki
@ 2020-07-28 15:11     ` Rafael J. Wysocki
  2020-07-30  1:31       ` Francisco Jerez
  2020-07-28 15:13     ` [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
  2020-08-01 16:39     ` [PATCH v4 0/2] " Srinivas Pandruvada
  2 siblings, 1 reply; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-28 15:11 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Move the locking away from intel_pstate_set_energy_pref_index()
into its only caller and drop the (now redundant) return_pref label
from it.

Also move the "raw" EPP value check into the caller of that function,
so as to do it before acquiring the mutex, and reduce code duplication
related to the "raw" EPP values processing somewhat.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3:

   * Fix error handling in intel_pstate_set_energy_pref_index() and
     rebase.

v3 -> v4: No changes

---
 drivers/cpufreq/intel_pstate.c |   35 +++++++++++++++--------------------
 1 file changed, 15 insertions(+), 20 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -649,28 +649,18 @@ static int intel_pstate_set_energy_pref_
 	if (!pref_index)
 		epp = cpu_data->epp_default;
 
-	mutex_lock(&intel_pstate_limits_lock);
-
 	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
 		u64 value;
 
 		ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, &value);
 		if (ret)
-			goto return_pref;
+			return ret;
 
 		value &= ~GENMASK_ULL(31, 24);
 
-		if (use_raw) {
-			if (raw_epp > 255) {
-				ret = -EINVAL;
-				goto return_pref;
-			}
-			value |= (u64)raw_epp << 24;
-			ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
-			goto return_pref;
-		}
-
-		if (epp == -EINVAL)
+		if (use_raw)
+			epp = raw_epp;
+		else if (epp == -EINVAL)
 			epp = epp_values[pref_index - 1];
 
 		value |= (u64)epp << 24;
@@ -680,8 +670,6 @@ static int intel_pstate_set_energy_pref_
 			epp = (pref_index - 1) << 2;
 		ret = intel_pstate_set_epb(cpu_data->cpu, epp);
 	}
-return_pref:
-	mutex_unlock(&intel_pstate_limits_lock);
 
 	return ret;
 }
@@ -708,8 +696,8 @@ static ssize_t store_energy_performance_
 	struct cpudata *cpu_data = all_cpu_data[policy->cpu];
 	char str_preference[21];
 	bool raw = false;
+	ssize_t ret;
 	u32 epp = 0;
-	int ret;
 
 	ret = sscanf(buf, "%20s", str_preference);
 	if (ret != 1)
@@ -724,14 +712,21 @@ static ssize_t store_energy_performance_
 		if (ret)
 			return ret;
 
+		if (epp > 255)
+			return -EINVAL;
+
 		raw = true;
 	}
 
+	mutex_lock(&intel_pstate_limits_lock);
+
 	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (ret)
-		return ret;
+	if (!ret)
+		ret = count;
 
-	return count;
+	mutex_unlock(&intel_pstate_limits_lock);
+
+	return ret;
 }
 
 static ssize_t show_energy_performance_preference(




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-28 15:09   ` [PATCH v4 0/2] " Rafael J. Wysocki
  2020-07-28 15:11     ` [PATCH v4 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
@ 2020-07-28 15:13     ` Rafael J. Wysocki
  2020-08-01 23:21       ` Srinivas Pandruvada
  2020-08-03 17:17       ` [PATCH v5] " Rafael J. Wysocki
  2020-08-01 16:39     ` [PATCH v4 0/2] " Srinivas Pandruvada
  2 siblings, 2 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-07-28 15:13 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to go above
the floor P-state set by intel_pstate with or without hardware
coordination of P-states among CPUs in the same package.

Also note that the setting of the HWP floor may not be taken into
account by the processor in the following cases:

 * For the HWP floor in the range of P-states above the base
   frequency, referred to as the turbo range, the processor has a
   license to choose any P-state from that range, either below or
   above the HWP floor, just like a non-HWP processor in the case
   when the target P-state falls into the turbo range.

 * If P-states of the CPUs in the same package are coordinated
   at the hardware level, the processor may choose a P-state
   above the HWP floor, just like a non-HWP processor in the
   analogous case.

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

v2 -> v3:
   * Rebase.

v3 -> v4:
   * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive mode.

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++------
 drivers/cpufreq/intel_pstate.c                |  204 ++++++++++++++++++++------
 2 files changed, 204 insertions(+), 89 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -690,6 +693,8 @@ static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
@@ -718,14 +723,35 @@ static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	mutex_lock(&intel_pstate_driver_lock);
+
+	if (!intel_pstate_driver) {
+		mutex_unlock(&intel_pstate_driver_lock);
+		return -EAGAIN;
+	}
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (!ret)
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
+		if (!ret)
+			ret = count;
+	} else {
+		/*
+		 * In the passive mode simply update the cached EPP value and
+		 * rely on intel_cpufreq_adjust_hwp() to pick it up later.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu_data->epp_default;
+
+		WRITE_ONCE(cpu_data->epp_cached, epp);
 		ret = count;
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
+	mutex_unlock(&intel_pstate_driver_lock);
+
 	return ret;
 }
 
@@ -1138,8 +1164,6 @@ static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -1323,9 +1347,10 @@ static const struct attribute_group inte
 
 static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
 
+static struct kobject *intel_pstate_kobject;
+
 static void __init intel_pstate_sysfs_expose_params(void)
 {
-	struct kobject *intel_pstate_kobject;
 	int rc;
 
 	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
@@ -1350,17 +1375,31 @@ static void __init intel_pstate_sysfs_ex
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
-	if (hwp_active) {
-		rc = sysfs_create_file(intel_pstate_kobject,
-				       &hwp_dynamic_boost.attr);
-		WARN_ON(rc);
-	}
-
 	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
 		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
 		WARN_ON(rc);
 	}
 }
+
+static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
+{
+	int rc;
+
+	if (!hwp_active)
+		return;
+
+	rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+	WARN_ON_ONCE(rc);
+}
+
+static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
+{
+	if (!hwp_active)
+		return;
+
+	sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+}
+
 /************************** sysfs end ************************/
 
 static void intel_pstate_hwp_enable(struct cpudata *cpudata)
@@ -2041,6 +2080,7 @@ static int intel_pstate_init_cpu(unsigne
 		cpu->epp_default = -EINVAL;
 		cpu->epp_powersave = -EINVAL;
 		cpu->epp_saved = -EINVAL;
+		WRITE_ONCE(cpu->epp_cached, -EINVAL);
 	}
 
 	cpu = all_cpu_data[cpunum];
@@ -2239,7 +2279,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2247,12 +2290,10 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2382,13 +2423,82 @@ static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+	s16 epp;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	/*
+	 * In case the EPP has been adjusted via sysfs, write the last cached
+	 * value of it to the MSR as well.
+	 */
+	epp = READ_ONCE(cpu->epp_cached);
+	if (epp >= 0) {
+		value &= ~GENMASK_ULL(31, 24);
+		value |= (u64)epp << 24;
+	}
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2396,6 +2506,7 @@ static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2407,15 +2518,11 @@ static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2425,15 +2532,14 @@ static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2453,7 +2559,6 @@ static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2465,10 +2570,17 @@ static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2545,6 +2657,10 @@ static void intel_pstate_driver_cleanup(
 		}
 	}
 	put_online_cpus();
+
+	if (intel_pstate_driver == &intel_pstate)
+		intel_pstate_sysfs_hide_hwp_dynamic_boost();
+
 	intel_pstate_driver = NULL;
 }
 
@@ -2552,6 +2668,9 @@ static int intel_pstate_register_driver(
 {
 	int ret;
 
+	if (driver == &intel_pstate)
+		intel_pstate_sysfs_expose_hwp_dynamic_boost();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2569,9 +2688,6 @@ static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2827,7 +2943,10 @@ static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2898,14 +3017,13 @@ static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@ This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@ manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@ argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@ Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@ on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@ at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@ of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@ of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values
  2020-07-28 15:11     ` [PATCH v4 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
@ 2020-07-30  1:31       ` Francisco Jerez
  0 siblings, 0 replies; 33+ messages in thread
From: Francisco Jerez @ 2020-07-30  1:31 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies


[-- Attachment #1.1: Type: text/plain, Size: 3057 bytes --]

"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Move the locking away from intel_pstate_set_energy_pref_index()
> into its only caller and drop the (now redundant) return_pref label
> from it.
>
> Also move the "raw" EPP value check into the caller of that function,
> so as to do it before acquiring the mutex, and reduce code duplication
> related to the "raw" EPP values processing somewhat.
>
> No intentional functional impact.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-by: Francisco Jerez <currojerez@riseup.net>

> ---
>
> v2 -> v3:
>
>    * Fix error handling in intel_pstate_set_energy_pref_index() and
>      rebase.
>
> v3 -> v4: No changes
>
> ---
>  drivers/cpufreq/intel_pstate.c |   35 +++++++++++++++--------------------
>  1 file changed, 15 insertions(+), 20 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -649,28 +649,18 @@ static int intel_pstate_set_energy_pref_
>  	if (!pref_index)
>  		epp = cpu_data->epp_default;
>  
> -	mutex_lock(&intel_pstate_limits_lock);
> -
>  	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
>  		u64 value;
>  
>  		ret = rdmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, &value);
>  		if (ret)
> -			goto return_pref;
> +			return ret;
>  
>  		value &= ~GENMASK_ULL(31, 24);
>  
> -		if (use_raw) {
> -			if (raw_epp > 255) {
> -				ret = -EINVAL;
> -				goto return_pref;
> -			}
> -			value |= (u64)raw_epp << 24;
> -			ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> -			goto return_pref;
> -		}
> -
> -		if (epp == -EINVAL)
> +		if (use_raw)
> +			epp = raw_epp;
> +		else if (epp == -EINVAL)
>  			epp = epp_values[pref_index - 1];
>  
>  		value |= (u64)epp << 24;
> @@ -680,8 +670,6 @@ static int intel_pstate_set_energy_pref_
>  			epp = (pref_index - 1) << 2;
>  		ret = intel_pstate_set_epb(cpu_data->cpu, epp);
>  	}
> -return_pref:
> -	mutex_unlock(&intel_pstate_limits_lock);
>  
>  	return ret;
>  }
> @@ -708,8 +696,8 @@ static ssize_t store_energy_performance_
>  	struct cpudata *cpu_data = all_cpu_data[policy->cpu];
>  	char str_preference[21];
>  	bool raw = false;
> +	ssize_t ret;
>  	u32 epp = 0;
> -	int ret;
>  
>  	ret = sscanf(buf, "%20s", str_preference);
>  	if (ret != 1)
> @@ -724,14 +712,21 @@ static ssize_t store_energy_performance_
>  		if (ret)
>  			return ret;
>  
> +		if (epp > 255)
> +			return -EINVAL;
> +
>  		raw = true;
>  	}
>  
> +	mutex_lock(&intel_pstate_limits_lock);
> +
>  	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
> -	if (ret)
> -		return ret;
> +	if (!ret)
> +		ret = count;
>  
> -	return count;
> +	mutex_unlock(&intel_pstate_limits_lock);
> +
> +	return ret;
>  }
>  
>  static ssize_t show_energy_performance_preference(

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-28 15:09   ` [PATCH v4 0/2] " Rafael J. Wysocki
  2020-07-28 15:11     ` [PATCH v4 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
  2020-07-28 15:13     ` [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
@ 2020-08-01 16:39     ` Srinivas Pandruvada
  2020-08-02 14:00       ` Doug Smythies
  2020-08-03 17:23       ` Rafael J. Wysocki
  2 siblings, 2 replies; 33+ messages in thread
From: Srinivas Pandruvada @ 2020-08-01 16:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Giovanni Gherdovich,
	Doug Smythies, Francisco Jerez

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

On Tue, 2020-07-28 at 17:09 +0200, Rafael J. Wysocki wrote:
> Hi All,
> 
> On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
> > On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki wrote:
> > > This really is a v2 of this patch:
> > > 
> > > https://patchwork.kernel.org/patch/11663271/
> > > 
> > > with an extra preceding cleanup patch to avoid making unrelated
> > > changes in the
> > > [2/2].
> > 
I applied this series along with
[PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
on 5.8 latest master (On top of raw epp patchset).

When intel_pstate=passive from kernel command line then it is fine, no
crash. But switch dynamically, crashed:
Attached crash.txt. I may need to try your linux-pm tree.

Then after some playing I reached a state when I monitor MSR 0x774:
while true; do rdmsr 0x774; sleep 1; done
80002704
...
...
ff000101
ff000101
ff000101
ff000101
ff000101
ff000101
ff000101
ff000101

Don't have a recipe to reproduce this.

Thanks,
Srinivas

> > Almost the same as before, but the first patch has been reworked to
> > handle
> > errors in store_energy_performance_preference() correctly and
> > rebased on top
> > of the current linux-pm.git branch.
> > 
> > No functional changes otherwise.
> 
> One more update of the second patch.
> 
> Namely, I realized that the hwp_dynamic_boost sysfs switch was
> present in the
> passive mode after the v3 (and the previous versions) of that patch
> which isn't
> correct, so this modifies it to avoid exposing hwp_dynamic_boost in
> the passive
> mode.
> 
> The first patch is the same as in the v2.
> 
> Thanks!
> 
> 
> 

[-- Attachment #2: crash.txt --]
[-- Type: text/plain, Size: 5520 bytes --]

[  232.483420] BUG: kernel NULL pointer dereference, address: 0000000000000030
[  232.483435] #PF: supervisor read access in kernel mode
[  232.483441] #PF: error_code(0x0000) - not-present page
[  232.483446] PGD 0 P4D 0 
[  232.483457] Oops: 0000 [#1] SMP NOPTI
[  232.483469] CPU: 7 PID: 2064 Comm: bash Tainted: G        W         5.8.0-rc6+ #6
[  232.483474] Hardware name: Dell Inc. XPS 13 7390 2-in-1/06CDVY, BIOS 1.3.1 03/02/2020
[  232.483491] RIP: 0010:sysfs_remove_file_ns+0x6/0x20
[  232.483500] Code: ff 4c 89 e7 e8 bb ce ff ff 4c 89 ef e8 43 f9 1d 00 41 5c 41 5d 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 <48> 8b 7f 30 48 8b 36 48 89 e5 e8 cb e2 ff ff 5d c3 66 0f 1f 84 00
[  232.483507] RSP: 0018:ffffaa37c1c93df8 EFLAGS: 00010246
[  232.483514] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000008
[  232.483519] RDX: 0000000000000000 RSI: ffffffffa87f1f60 RDI: 0000000000000000
[  232.483524] RBP: ffffaa37c1c93e18 R08: 0000000000000000 R09: ffffffffa791ff00
[  232.483529] R10: ffff8a15bc0f3600 R11: 0000000000000001 R12: 0000000000000008
[  232.483533] R13: ffff8a15b3059197 R14: fffffffffffffff2 R15: ffff8a159f945020
[  232.483541] FS:  00007f068a3d4740(0000) GS:ffff8a15bf7c0000(0000) knlGS:0000000000000000
[  232.483547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  232.483552] CR2: 0000000000000030 CR3: 000000046ec78004 CR4: 0000000000760ee0
[  232.483557] PKRU: 55555554
[  232.483561] Call Trace:
[  232.483581]  ? intel_pstate_driver_cleanup+0xbd/0xd0
[  232.483590]  store_status+0x9b/0x180
[  232.483603]  kobj_attr_store+0x12/0x20
[  232.483610]  sysfs_kf_write+0x3e/0x50
[  232.483623]  kernfs_fop_write+0xda/0x1b0
[  232.483636]  vfs_write+0xc9/0x200
[  232.483647]  ksys_write+0x67/0xe0
[  232.483657]  __x64_sys_write+0x1a/0x20
[  232.483668]  do_syscall_64+0x52/0xc0
[  232.483680]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  232.483689] RIP: 0033:0x7f068a4e8057
[  232.483693] Code: Bad RIP value.
[  232.483699] RSP: 002b:00007ffe0d4faec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  232.483705] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f068a4e8057
[  232.483710] RDX: 0000000000000008 RSI: 000055b52e3a6700 RDI: 0000000000000001
[  232.483715] RBP: 000055b52e3a6700 R08: 000000000000000a R09: 0000000000000007
[  232.483719] R10: 000055b52cca5017 R11: 0000000000000246 R12: 0000000000000008
[  232.483724] R13: 00007f068a5c36a0 R14: 00007f068a5c44a0 R15: 00007f068a5c38a0
[  232.483735] Modules linked in: msr rfcomm ccm cmac algif_hash algif_skcipher af_alg wacom usbhid hid_multitouch bnep hid_sensor_als hid_sensor_incl_3d hid_sensor_accel_3d hid_sensor_magn_3d hid_sensor_gyro_3d hid_sensor_rotation hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_custom x86_pkg_temp_thermal hid_sensor_hub intel_powerclamp hid_generic intel_ishtp_loader snd_sof_pci snd_sof_intel_byt intel_ishtp_hid snd_sof_intel_ipc dell_laptop dell_wmi cros_ec_ishtp snd_sof_intel_hda_common mei_hdcp rtsx_pci_sdmmc intel_rapl_msr intel_wmi_thunderbolt coretemp wmi_bmof dell_smbios snd_soc_hdac_hda cros_ec snd_sof_xtensa_dsp snd_sof_intel_hda snd_sof kvm_intel dell_wmi_descriptor snd_hda_codec_hdmi dcdbas snd_hda_ext_core dell_smm_hwmon nls_iso8859_1 snd_soc_acpi_intel_match kvm snd_hda_codec_realtek snd_soc_acpi snd_hda_codec_generic iwlmvm ledtrig_audio snd_soc_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel
[  232.483821]  crypto_simd cryptd snd_compress glue_helper ac97_bus snd_pcm_dmaengine rapl mac80211 intel_cstate snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core joydev snd_hwdep snd_pcm libarc4 efi_pstore iwlwifi btusb snd_seq_midi snd_seq_midi_event btrtl snd_rawmidi btbcm btintel snd_seq bluetooth cfg80211 snd_seq_device snd_timer snd i2c_i801 i2c_smbus ucsi_acpi processor_thermal_device typec_ucsi intel_rapl_common intel_lpss_pci intel_lpss idma64 rtsx_pci soundcore mei_me ecdh_generic mei intel_ish_ipc ecc i2c_hid intel_ishtp intel_soc_dts_iosf typec virt_dma wmi hid int3403_thermal soc_button_array int340x_thermal_zone int3400_thermal intel_hid acpi_thermal_rel sparse_keymap acpi_pad acpi_tad sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm psmouse nvme nvme_core input_leds serio_raw mac_hid video pinctrl_icelake pinctrl_intel
[  232.483940] CR2: 0000000000000030
[  232.483950] ---[ end trace 31db41bab6fdff6f ]---
[  233.812260] RIP: 0010:sysfs_remove_file_ns+0x6/0x20
[  233.812281] Code: ff 4c 89 e7 e8 bb ce ff ff 4c 89 ef e8 43 f9 1d 00 41 5c 41 5d 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 <48> 8b 7f 30 48 8b 36 48 89 e5 e8 cb e2 ff ff 5d c3 66 0f 1f 84 00
[  233.812292] RSP: 0018:ffffaa37c1c93df8 EFLAGS: 00010246
[  233.812302] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000008
[  233.812308] RDX: 0000000000000000 RSI: ffffffffa87f1f60 RDI: 0000000000000000
[  233.812313] RBP: ffffaa37c1c93e18 R08: 0000000000000000 R09: ffffffffa791ff00
[  233.812318] R10: ffff8a15bc0f3600 R11: 0000000000000001 R12: 0000000000000008
[  233.812323] R13: ffff8a15b3059197 R14: fffffffffffffff2 R15: ffff8a159f945020
[  233.812331] FS:  00007f068a3d4740(0000) GS:ffff8a15bf7c0000(0000) knlGS:0000000000000000
[  233.812337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  233.812343] CR2: 0000000000000030 CR3: 000000046ec78004 CR4: 0000000000760ee0
[  233.812349] PKRU: 55555554


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-28 15:13     ` [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
@ 2020-08-01 23:21       ` Srinivas Pandruvada
  2020-08-02 14:14         ` Doug Smythies
  2020-08-03 17:17       ` [PATCH v5] " Rafael J. Wysocki
  1 sibling, 1 reply; 33+ messages in thread
From: Srinivas Pandruvada @ 2020-08-01 23:21 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Giovanni Gherdovich,
	Doug Smythies, Francisco Jerez

On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.
> 
> Among other things, this allows utilization clamps to be taken
> into account, at least to a certain extent, when intel_pstate is
> in use and makes it more likely that sufficient capacity for
> deadline tasks will be provided.
> 
> After this change, the resulting behavior of an HWP system with
> intel_pstate in the passive mode should be close to the behavior
> of the analogous non-HWP system with intel_pstate in the passive
> mode, except that in the frequency range below the base frequency
> (ie. the frequency retured by the base_frequency cpufreq attribute
> in sysfs on HWP systems) the HWP algorithm is allowed to go above
> the floor P-state set by intel_pstate with or without hardware
> coordination of P-states among CPUs in the same package.
> 
Do you mean HWP.req.min will be below base_freq (unless user overrides
it)?
With busy workload I see HWP req.min = HWP req.max.
The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio: 0x27)
When I monitor MSR 0x774 (HWP_REQ), I see
0x80002727

Normally msr 0x774
0x80002704 

Thanks,
Srinivas

> Also note that the setting of the HWP floor may not be taken into
> account by the processor in the following cases:
> 
>  * For the HWP floor in the range of P-states above the base
>    frequency, referred to as the turbo range, the processor has a
>    license to choose any P-state from that range, either below or
>    above the HWP floor, just like a non-HWP processor in the case
>    when the target P-state falls into the turbo range.
> 
>  * If P-states of the CPUs in the same package are coordinated
>    at the hardware level, the processor may choose a P-state
>    above the HWP floor, just like a non-HWP processor in the
>    analogous case.
> 
> With this change applied, intel_pstate in the passive mode
> assumes complete control over the HWP request MSR and concurrent
> changes of that MSR (eg. via the direct MSR access interface) are
> overridden by it.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v1 -> v2:
>    * Avoid a race condition when updating the HWP request register
> while
>      setting a new EPP value via sysfs.
> 
> v2 -> v3:
>    * Rebase.
> 
> v3 -> v4:
>    * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive
> mode.
> 
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++------
>  drivers/cpufreq/intel_pstate.c                |  204
> ++++++++++++++++++++------
>  2 files changed, 204 insertions(+), 89 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -36,6 +36,7 @@
>  #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
>  
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
>  #define INTEL_CPUFREQ_TRANSITION_DELAY		500
>  
>  #ifdef CONFIG_ACPI
> @@ -220,6 +221,7 @@ struct global_params {
>   *			preference/bias
>   * @epp_saved:		Saved EPP/EPB during system suspend or
> CPU offline
>   *			operation
> + * @epp_cached		Cached HWP energy-performance
> preference value
>   * @hwp_req_cached:	Cached value of the last HWP Request MSR
>   * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
>   * @last_io_update:	Last time when IO wake flag was set
> @@ -257,6 +259,7 @@ struct cpudata {
>  	s16 epp_policy;
>  	s16 epp_default;
>  	s16 epp_saved;
> +	s16 epp_cached;
>  	u64 hwp_req_cached;
>  	u64 hwp_cap_cached;
>  	u64 last_io_update;
> @@ -690,6 +693,8 @@ static ssize_t show_energy_performance_a
>  
>  cpufreq_freq_attr_ro(energy_performance_available_preferences);
>  
> +static struct cpufreq_driver intel_pstate;
> +
>  static ssize_t store_energy_performance_preference(
>  		struct cpufreq_policy *policy, const char *buf, size_t
> count)
>  {
> @@ -718,14 +723,35 @@ static ssize_t store_energy_performance_
>  		raw = true;
>  	}
>  
> +	mutex_lock(&intel_pstate_driver_lock);
> +
> +	if (!intel_pstate_driver) {
> +		mutex_unlock(&intel_pstate_driver_lock);
> +		return -EAGAIN;
> +	}
> +
>  	mutex_lock(&intel_pstate_limits_lock);
>  
> -	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw,
> epp);
> -	if (!ret)
> +	if (intel_pstate_driver == &intel_pstate) {
> +		ret = intel_pstate_set_energy_pref_index(cpu_data, ret,
> raw, epp);
> +		if (!ret)
> +			ret = count;
> +	} else {
> +		/*
> +		 * In the passive mode simply update the cached EPP
> value and
> +		 * rely on intel_cpufreq_adjust_hwp() to pick it up
> later.
> +		 */
> +		if (!raw)
> +			epp = ret ? epp_values[ret - 1] : cpu_data-
> >epp_default;
> +
> +		WRITE_ONCE(cpu_data->epp_cached, epp);
>  		ret = count;
> +	}
>  
>  	mutex_unlock(&intel_pstate_limits_lock);
>  
> +	mutex_unlock(&intel_pstate_driver_lock);
> +
>  	return ret;
>  }
>  
> @@ -1138,8 +1164,6 @@ static ssize_t store_no_turbo(struct kob
>  	return count;
>  }
>  
> -static struct cpufreq_driver intel_pstate;
> -
>  static void update_qos_request(enum freq_qos_req_type type)
>  {
>  	int max_state, turbo_max, freq, i, perf_pct;
> @@ -1323,9 +1347,10 @@ static const struct attribute_group inte
>  
>  static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
>  
> +static struct kobject *intel_pstate_kobject;
> +
>  static void __init intel_pstate_sysfs_expose_params(void)
>  {
> -	struct kobject *intel_pstate_kobject;
>  	int rc;
>  
>  	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
> @@ -1350,17 +1375,31 @@ static void __init intel_pstate_sysfs_ex
>  	rc = sysfs_create_file(intel_pstate_kobject,
> &min_perf_pct.attr);
>  	WARN_ON(rc);
>  
> -	if (hwp_active) {
> -		rc = sysfs_create_file(intel_pstate_kobject,
> -				       &hwp_dynamic_boost.attr);
> -		WARN_ON(rc);
> -	}
> -
>  	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
>  		rc = sysfs_create_file(intel_pstate_kobject,
> &energy_efficiency.attr);
>  		WARN_ON(rc);
>  	}
>  }
> +
> +static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
> +{
> +	int rc;
> +
> +	if (!hwp_active)
> +		return;
> +
> +	rc = sysfs_create_file(intel_pstate_kobject,
> &hwp_dynamic_boost.attr);
> +	WARN_ON_ONCE(rc);
> +}
> +
> +static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
> +{
> +	if (!hwp_active)
> +		return;
> +
> +	sysfs_remove_file(intel_pstate_kobject,
> &hwp_dynamic_boost.attr);
> +}
> +
>  /************************** sysfs end ************************/
>  
>  static void intel_pstate_hwp_enable(struct cpudata *cpudata)
> @@ -2041,6 +2080,7 @@ static int intel_pstate_init_cpu(unsigne
>  		cpu->epp_default = -EINVAL;
>  		cpu->epp_powersave = -EINVAL;
>  		cpu->epp_saved = -EINVAL;
> +		WRITE_ONCE(cpu->epp_cached, -EINVAL);
>  	}
>  
>  	cpu = all_cpu_data[cpunum];
> @@ -2239,7 +2279,10 @@ static int intel_pstate_verify_policy(st
>  
>  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>  {
> -	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> +	if (hwp_active)
> +		intel_pstate_hwp_force_min_perf(policy->cpu);
> +	else
> +		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>  }
>  
>  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> @@ -2247,12 +2290,10 @@ static void intel_pstate_stop_cpu(struct
>  	pr_debug("CPU %d exiting\n", policy->cpu);
>  
>  	intel_pstate_clear_update_util_hook(policy->cpu);
> -	if (hwp_active) {
> +	if (hwp_active)
>  		intel_pstate_hwp_save_state(policy);
> -		intel_pstate_hwp_force_min_perf(policy->cpu);
> -	} else {
> -		intel_cpufreq_stop_cpu(policy);
> -	}
> +
> +	intel_cpufreq_stop_cpu(policy);
>  }
>  
>  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> @@ -2382,13 +2423,82 @@ static void intel_cpufreq_trace(struct c
>  		fp_toint(cpu->iowait_boost * 100));
>  }
>  
> +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32
> target_pstate,
> +				     bool fast_switch)
> +{
> +	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> +	s16 epp;
> +
> +	value &= ~HWP_MIN_PERF(~0L);
> +	value |= HWP_MIN_PERF(target_pstate);
> +
> +	/*
> +	 * The entire MSR needs to be updated in order to update the
> HWP min
> +	 * field in it, so opportunistically update the max too if
> needed.
> +	 */
> +	value &= ~HWP_MAX_PERF(~0L);
> +	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> +
> +	/*
> +	 * In case the EPP has been adjusted via sysfs, write the last
> cached
> +	 * value of it to the MSR as well.
> +	 */
> +	epp = READ_ONCE(cpu->epp_cached);
> +	if (epp >= 0) {
> +		value &= ~GENMASK_ULL(31, 24);
> +		value |= (u64)epp << 24;
> +	}
> +
> +	if (value == prev)
> +		return;
> +
> +	WRITE_ONCE(cpu->hwp_req_cached, value);
> +	if (fast_switch)
> +		wrmsrl(MSR_HWP_REQUEST, value);
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +}
> +
> +static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
> +					  u32 target_pstate, bool
> fast_switch)
> +{
> +	if (fast_switch)
> +		wrmsrl(MSR_IA32_PERF_CTL,
> +		       pstate_funcs.get_val(cpu, target_pstate));
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
> +			      pstate_funcs.get_val(cpu,
> target_pstate));
> +}
> +
> +static int intel_cpufreq_update_pstate(struct cpudata *cpu, int
> target_pstate,
> +				       bool fast_switch)
> +{
> +	int old_pstate = cpu->pstate.current_pstate;
> +
> +	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> +	if (target_pstate != old_pstate) {
> +		cpu->pstate.current_pstate = target_pstate;
> +		if (hwp_active)
> +			intel_cpufreq_adjust_hwp(cpu, target_pstate,
> +						 fast_switch);
> +		else
> +			intel_cpufreq_adjust_perf_ctl(cpu,
> target_pstate,
> +						      fast_switch);
> +	}
> +
> +	intel_cpufreq_trace(cpu, fast_switch ?
> INTEL_PSTATE_TRACE_FAST_SWITCH :
> +			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
> +
> +	return target_pstate;
> +}
> +
>  static int intel_cpufreq_target(struct cpufreq_policy *policy,
>  				unsigned int target_freq,
>  				unsigned int relation)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
>  	struct cpufreq_freqs freqs;
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
> @@ -2396,6 +2506,7 @@ static int intel_cpufreq_target(struct c
>  	freqs.new = target_freq;
>  
>  	cpufreq_freq_transition_begin(policy, &freqs);
> +
>  	switch (relation) {
>  	case CPUFREQ_RELATION_L:
>  		target_pstate = DIV_ROUND_UP(freqs.new, cpu-
> >pstate.scaling);
> @@ -2407,15 +2518,11 @@ static int intel_cpufreq_target(struct c
>  		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu-
> >pstate.scaling);
>  		break;
>  	}
> -	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	if (target_pstate != cpu->pstate.current_pstate) {
> -		cpu->pstate.current_pstate = target_pstate;
> -		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
> -			      pstate_funcs.get_val(cpu,
> target_pstate));
> -	}
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate,
> false);
> +
>  	freqs.new = target_pstate * cpu->pstate.scaling;
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET,
> old_pstate);
> +
>  	cpufreq_freq_transition_end(policy, &freqs, false);
>  
>  	return 0;
> @@ -2425,15 +2532,14 @@ static unsigned int intel_cpufreq_fast_s
>  					      unsigned int target_freq)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
>  	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> -	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	intel_pstate_update_pstate(cpu, target_pstate);
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH,
> old_pstate);
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate,
> true);
> +
>  	return target_pstate * cpu->pstate.scaling;
>  }
>  
> @@ -2453,7 +2559,6 @@ static int intel_cpufreq_cpu_init(struct
>  		return ret;
>  
>  	policy->cpuinfo.transition_latency =
> INTEL_CPUFREQ_TRANSITION_LATENCY;
> -	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
>  	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
>  	policy->cur = policy->cpuinfo.min_freq;
>  
> @@ -2465,10 +2570,17 @@ static int intel_cpufreq_cpu_init(struct
>  
>  	cpu = all_cpu_data[policy->cpu];
>  
> -	if (hwp_active)
> +	if (hwp_active) {
> +		u64 value;
> +
>  		intel_pstate_get_hwp_max(policy->cpu, &turbo_max,
> &max_state);
> -	else
> +		policy->transition_delay_us =
> INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
> +		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
> +		WRITE_ONCE(cpu->hwp_req_cached, value);
> +	} else {
>  		turbo_max = cpu->pstate.turbo_pstate;
> +		policy->transition_delay_us =
> INTEL_CPUFREQ_TRANSITION_DELAY;
> +	}
>  
>  	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
>  	min_freq *= cpu->pstate.scaling;
> @@ -2545,6 +2657,10 @@ static void intel_pstate_driver_cleanup(
>  		}
>  	}
>  	put_online_cpus();
> +
> +	if (intel_pstate_driver == &intel_pstate)
> +		intel_pstate_sysfs_hide_hwp_dynamic_boost();
> +
>  	intel_pstate_driver = NULL;
>  }
>  
> @@ -2552,6 +2668,9 @@ static int intel_pstate_register_driver(
>  {
>  	int ret;
>  
> +	if (driver == &intel_pstate)
> +		intel_pstate_sysfs_expose_hwp_dynamic_boost();
> +
>  	memset(&global, 0, sizeof(global));
>  	global.max_perf_pct = 100;
>  
> @@ -2569,9 +2688,6 @@ static int intel_pstate_register_driver(
>  
>  static int intel_pstate_unregister_driver(void)
>  {
> -	if (hwp_active)
> -		return -EBUSY;
> -
>  	cpufreq_unregister_driver(intel_pstate_driver);
>  	intel_pstate_driver_cleanup();
>  
> @@ -2827,7 +2943,10 @@ static int __init intel_pstate_init(void
>  			hwp_active++;
>  			hwp_mode_bdw = id->driver_data;
>  			intel_pstate.attr = hwp_cpufreq_attrs;
> -			default_driver = &intel_pstate;
> +			intel_cpufreq.attr = hwp_cpufreq_attrs;
> +			if (!default_driver)
> +				default_driver = &intel_pstate;
> +
>  			goto hwp_cpu_matched;
>  		}
>  	} else {
> @@ -2898,14 +3017,13 @@ static int __init intel_pstate_setup(cha
>  	if (!str)
>  		return -EINVAL;
>  
> -	if (!strcmp(str, "disable")) {
> +	if (!strcmp(str, "disable"))
>  		no_load = 1;
> -	} else if (!strcmp(str, "active")) {
> +	else if (!strcmp(str, "active"))
>  		default_driver = &intel_pstate;
> -	} else if (!strcmp(str, "passive")) {
> +	else if (!strcmp(str, "passive"))
>  		default_driver = &intel_cpufreq;
> -		no_hwp = 1;
> -	}
> +
>  	if (!strcmp(str, "no_hwp")) {
>  		pr_info("HWP disabled\n");
>  		no_hwp = 1;
> Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> ===================================================================
> --- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
> +++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
>  Operation Modes
>  ===============
>  
> -``intel_pstate`` can operate in three different modes: in the active
> mode with
> -or without hardware-managed P-states support and in the passive
> mode.  Which of
> -them will be in effect depends on what kernel command line options
> are used and
> -on the capabilities of the processor.
> +``intel_pstate`` can operate in two different modes, active or
> passive.  In the
> +active mode, it uses its own internal preformance scaling governor
> algorithm or
> +allows the hardware to do preformance scaling by itself, while in
> the passive
> +mode it responds to requests made by a generic ``CPUFreq`` governor
> implementing
> +a certain performance scaling algorithm.  Which of them will be in
> effect
> +depends on what kernel command line options are used and on the
> capabilities of
> +the processor.
>  
>  Active Mode
>  -----------
> @@ -194,10 +197,11 @@ This is the default operation mode of ``
>  hardware-managed P-states (HWP) support.  It is always used if the
>  ``intel_pstate=passive`` argument is passed to the kernel in the
> command line
>  regardless of whether or not the given processor supports
> HWP.  [Note that the
> -``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if
> it is used
> -without ``intel_pstate=active``.]  Like in the active mode without
> HWP support,
> -in this mode ``intel_pstate`` may refuse to work with processors
> that are not
> -recognized by it.
> +``intel_pstate=no_hwp`` setting causes the driver to start in the
> passive mode
> +if it is not combined with ``intel_pstate=active``.]  Like in the
> active mode
> +without HWP support, in this mode ``intel_pstate`` may refuse to
> work with
> +processors that are not recognized by it if HWP is prevented from
> being enabled
> +through the kernel command line.
>  
>  If the driver works in this mode, the ``scaling_driver`` policy
> attribute in
>  ``sysfs`` for all ``CPUFreq`` policies contains the string
> "intel_cpufreq".
> @@ -318,10 +322,9 @@ manuals need to be consulted to get to i
>  
>  For this reason, there is a list of supported processors in
> ``intel_pstate`` and
>  the driver initialization will fail if the detected processor is not
> in that
> -list, unless it supports the `HWP feature <Active Mode_>`_.  [The
> interface to
> -obtain all of the information listed above is the same for all of
> the processors
> -supporting the HWP feature, which is why they all are supported by
> -``intel_pstate``.]
> +list, unless it supports the HWP feature.  [The interface to obtain
> all of the
> +information listed above is the same for all of the processors
> supporting the
> +HWP feature, which is why ``intel_pstate`` works with all of them.]
>  
>  
>  User Space Interface in ``sysfs``
> @@ -425,22 +428,16 @@ argument is passed to the kernel in the
>  	as well as the per-policy ones) are then reset to their default
>  	values, possibly depending on the target operation mode.]
>  
> -	That only is supported in some configurations, though (for
> example, if
> -	the `HWP feature is enabled in the processor <Active Mode With
> HWP_>`_,
> -	the operation mode of the driver cannot be changed), and if it
> is not
> -	supported in the current configuration, writes to this
> attribute will
> -	fail with an appropriate error.
> -
>  ``energy_efficiency``
> -	This attribute is only present on platforms, which have CPUs
> matching
> -	Kaby Lake or Coffee Lake desktop CPU model. By default
> -	energy efficiency optimizations are disabled on these CPU
> models in HWP
> -	mode by this driver. Enabling energy efficiency may limit
> maximum
> -	operating frequency in both HWP and non HWP mode. In non HWP
> mode,
> -	optimizations are done only in the turbo frequency range. In
> HWP mode,
> -	optimizations are done in the entire frequency range. Setting
> this
> -	attribute to "1" enables energy efficiency optimizations and
> setting
> -	to "0" disables energy efficiency optimizations.
> +	This attribute is only present on platforms with CPUs matching
> the Kaby
> +	Lake or Coffee Lake desktop CPU model. By default, energy-
> efficiency
> +	optimizations are disabled on these CPU models if HWP is
> enabled.
> +	Enabling energy-efficiency optimizations may limit maximum
> operating
> +	frequency with or without the HWP feature.  With HWP enabled,
> the
> +	optimizations are done only in the turbo frequency
> range.  Without it,
> +	they are done in the entire available frequency range.  Setting
> this
> +	attribute to "1" enables the energy-efficiency optimizations
> and setting
> +	to "0" disables them.
>  
>  Interpretation of Policy Attributes
>  -----------------------------------
> @@ -484,8 +481,8 @@ Next, the following policy attributes ha
>  	policy for the time interval between the last two invocations
> of the
>  	driver's utilization update callback by the CPU scheduler for
> that CPU.
>  
> -One more policy attribute is present if the `HWP feature is enabled
> in the
> -processor <Active Mode With HWP_>`_:
> +One more policy attribute is present if the HWP feature is enabled
> in the
> +processor:
>  
>  ``base_frequency``
>  	Shows the base frequency of the CPU. Any frequency above this
> will be
> @@ -526,11 +523,11 @@ on the following rules, regardless of th
>  
>   3. The global and per-policy limits can be set independently.
>  
> -If the `HWP feature is enabled in the processor <Active Mode With
> HWP_>`_, the
> -resulting effective values are written into its registers whenever
> the limits
> -change in order to request its internal P-state selection logic to
> always set
> -P-states within these limits.  Otherwise, the limits are taken into
> account by
> -scaling governors (in the `passive mode <Passive Mode_>`_) and by
> the driver
> +In the `active mode with the HWP feature enabled <Active Mode With
> HWP_>`_, the
> +resulting effective values are written into hardware registers
> whenever the
> +limits change in order to request its internal P-state selection
> logic to always
> +set P-states within these limits.  Otherwise, the limits are taken
> into account
> +by scaling governors (in the `passive mode <Passive Mode_>`_) and by
> the driver
>  every time before setting a new P-state for a CPU.
>  
>  Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command
> line argument
> @@ -541,12 +538,11 @@ at all and the only way to set the limit
>  Energy vs Performance Hints
>  ---------------------------
>  
> -If ``intel_pstate`` works in the `active mode with the HWP feature
> enabled
> -<Active Mode With HWP_>`_ in the processor, additional attributes
> are present
> -in every ``CPUFreq`` policy directory in ``sysfs``.  They are
> intended to allow
> -user space to help ``intel_pstate`` to adjust the processor's
> internal P-state
> -selection logic by focusing it on performance or on energy-
> efficiency, or
> -somewhere between the two extremes:
> +If the hardware-managed P-states (HWP) is enabled in the processor,
> additional
> +attributes, intended to allow user space to help ``intel_pstate`` to
> adjust the
> +processor's internal P-state selection logic by focusing it on
> performance or on
> +energy-efficiency, or somewhere between the two extremes, are
> present in every
> +``CPUFreq`` policy directory in ``sysfs``.  They are :
>  
>  ``energy_performance_preference``
>  	Current value of the energy vs performance hint for the given
> policy
> @@ -650,12 +646,14 @@ of them have to be prepended with the ``
>  	Do not register ``intel_pstate`` as the scaling driver even if
> the
>  	processor is supported by it.
>  
> +``active``
> +	Register ``intel_pstate`` in the `active mode <Active Mode_>`_
> to start
> +	with.
> +
>  ``passive``
>  	Register ``intel_pstate`` in the `passive mode <Passive
> Mode_>`_ to
>  	start with.
>  
> -	This option implies the ``no_hwp`` one described below.
> -
>  ``force``
>  	Register ``intel_pstate`` as the scaling driver instead of
>  	``acpi-cpufreq`` even if the latter is preferred on the given
> system.
> @@ -670,13 +668,12 @@ of them have to be prepended with the ``
>  	driver is used instead of ``acpi-cpufreq``.
>  
>  ``no_hwp``
> -	Do not enable the `hardware-managed P-states (HWP) feature
> -	<Active Mode With HWP_>`_ even if it is supported by the
> processor.
> +	Do not enable the hardware-managed P-states (HWP) feature even
> if it is
> +	supported by the processor.
>  
>  ``hwp_only``
>  	Register ``intel_pstate`` as the scaling driver only if the
> -	`hardware-managed P-states (HWP) feature <Active Mode With
> HWP_>`_ is
> -	supported by the processor.
> +	hardware-managed P-states (HWP) feature is supported by the
> processor.
>  
>  ``support_acpi_ppc``
>  	Take ACPI ``_PPC`` performance limits into account.
> 
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-01 16:39     ` [PATCH v4 0/2] " Srinivas Pandruvada
@ 2020-08-02 14:00       ` Doug Smythies
  2020-08-02 18:39         ` Srinivas Pandruvada
  2020-08-03 17:23       ` Rafael J. Wysocki
  1 sibling, 1 reply; 33+ messages in thread
From: Doug Smythies @ 2020-08-02 14:00 UTC (permalink / raw)
  To: 'Srinivas Pandruvada'
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Giovanni Gherdovich',
	'Francisco Jerez', 'Linux PM',
	'Rafael J. Wysocki'

On 2020.08.01 09:40 Srinivas Pandruvada wrote:
>> On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
>>> On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki wrote:
>>>> This really is a v2 of this patch:
>>>>
>>>> https://patchwork.kernel.org/patch/11663271/
>>>>
>>>> with an extra preceding cleanup patch to avoid making unrelated
>>>> changes in the
>>>> [2/2].
>>>
> I applied this series along with
> [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
> on 5.8 latest master (On top of raw epp patchset).

Hi Srinivas,

Would you be kind enough to provide a "git log --oneline" output
of what you did.

I have been trying unsuccessfully to apply the patches,
so somewhere I obviously missed something.

> When intel_pstate=passive from kernel command line then it is fine, no
> crash. But switch dynamically, crashed:

I'll try to repeat, if I can get an actual kernel.

> Attached crash.txt. I may need to try your linux-pm tree.

I also tried the linux-pm tree, same.

... Doug



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-01 23:21       ` Srinivas Pandruvada
@ 2020-08-02 14:14         ` Doug Smythies
  2020-08-02 19:20           ` Srinivas Pandruvada
  0 siblings, 1 reply; 33+ messages in thread
From: Doug Smythies @ 2020-08-02 14:14 UTC (permalink / raw)
  To: 'Srinivas Pandruvada'
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Giovanni Gherdovich',
	'Francisco Jerez', 'Linux PM',
	'Rafael J. Wysocki'

On 2020.08.01 16:41 Srinivas Pandruvada wrote:
> On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> >
> > Among other things, this allows utilization clamps to be taken
> > into account, at least to a certain extent, when intel_pstate is
> > in use and makes it more likely that sufficient capacity for
> > deadline tasks will be provided.
> >
> > After this change, the resulting behavior of an HWP system with
> > intel_pstate in the passive mode should be close to the behavior
> > of the analogous non-HWP system with intel_pstate in the passive
> > mode, except that in the frequency range below the base frequency
> > (ie. the frequency retured by the base_frequency cpufreq attribute
> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> > the floor P-state set by intel_pstate with or without hardware
> > coordination of P-states among CPUs in the same package.
> >
> Do you mean HWP.req.min will be below base_freq (unless user overrides
> it)?

No.

> With busy workload I see HWP req.min = HWP req.max.
> The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio: 0x27)
> When I monitor MSR 0x774 (HWP_REQ), I see
> 0x80002727

Yes, that is what I expect to see.

> 
> Normally msr 0x774
> 0x80002704

That would be "active" mode and the powersave governor, correct?.
And yes that is what I expect for your processor.
For mine, load or no load, decoded:
0x774: IA32_HWP_REQUEST:    CPU 0-5 :
    raw: 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 :
    min:        8 :        8 :        8 :        8 :        8 :        8 :
    max:       46 :       46 :       46 :       46 :       46 :       46 :
    des:        0 :        0 :        0 :        0 :        0 :        0 :
    epp:      128 :      128 :      128 :      128 :      128 :      128 :
    act:        0 :        0 :        0 :        0 :        0 :        0 :

This thread is about passive mode, and myself, I do not expect the last byte to be
4 (8 for mine) under load.

... Doug



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-02 14:00       ` Doug Smythies
@ 2020-08-02 18:39         ` Srinivas Pandruvada
  2020-08-03  0:26           ` Doug Smythies
  0 siblings, 1 reply; 33+ messages in thread
From: Srinivas Pandruvada @ 2020-08-02 18:39 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Giovanni Gherdovich',
	'Francisco Jerez', 'Linux PM',
	'Rafael J. Wysocki'

On Sun, 2020-08-02 at 07:00 -0700, Doug Smythies wrote:
> On 2020.08.01 09:40 Srinivas Pandruvada wrote:
> > > On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
> > > > On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki
> > > > wrote:
> > > > > This really is a v2 of this patch:
> > > > > 
> > > > > https://patchwork.kernel.org/patch/11663271/
> > > > > 
> > > > > with an extra preceding cleanup patch to avoid making
> > > > > unrelated
> > > > > changes in the
> > > > > [2/2].
> > I applied this series along with
> > [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active
> > mode
> > on 5.8 latest master (On top of raw epp patchset).
> 
> Hi Srinivas,
Hi Doug,

> 
> Would you be kind enough to provide a "git log --oneline" output
> of what you did.

69dd9b2b11cd (HEAD -> 5-9-devel) cpufreq: intel_pstate: Implement
passive mode with HWP enabled
63efaa01b06a cpufreq: intel_pstate: Fix EPP setting via sysfs in active
mode
e11e0a2edf83 cpufreq: intel_pstate: Rearrange the storing of new EPP
values
93c3fd6a315c cpufreq: intel_pstate: Avoid enabling HWP if EPP is not
supported
7cef1dd371c3 cpufreq: intel_pstate: Clean up aperf_mperf_shift
description
a3248d8d3a11 cpufreq: intel_pstate: Supply struct attribute description
for get_aperf_mperf_shift()
f52b6b075b07 cpufreq: intel_pstate: Fix static checker warning for epp
variable
4a59d6be0774 cpufreq: intel_pstate: Allow raw energy performance
preference value
7b34b5acdcc6 cpufreq: intel_pstate: Allow enable/disable energy
efficiency
ac3a0c847296 (origin/master, origin/HEAD, master) Merge
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Thanks,
Srinivas

> 
> I have been trying unsuccessfully to apply the patches,
> so somewhere I obviously missed something.
> 
> > When intel_pstate=passive from kernel command line then it is fine,
> > no
> > crash. But switch dynamically, crashed:
> 
> I'll try to repeat, if I can get an actual kernel.
> 
> > Attached crash.txt. I may need to try your linux-pm tree.
> 
> I also tried the linux-pm tree, same.
> ... Doug
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-02 14:14         ` Doug Smythies
@ 2020-08-02 19:20           ` Srinivas Pandruvada
  0 siblings, 0 replies; 33+ messages in thread
From: Srinivas Pandruvada @ 2020-08-02 19:20 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Giovanni Gherdovich',
	'Francisco Jerez', 'Linux PM',
	'Rafael J. Wysocki'

On Sun, 2020-08-02 at 07:14 -0700, Doug Smythies wrote:
> On 2020.08.01 16:41 Srinivas Pandruvada wrote:
> > On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > 
> > > Allow intel_pstate to work in the passive mode with HWP enabled
> > > and
> > > make it set the HWP minimum performance limit (HWP floor) to the
> > > P-state value given by the target frequency supplied by the
> > > cpufreq
> > > governor, so as to prevent the HWP algorithm and the CPU
> > > scheduler
> > > from working against each other, at least when the schedutil
> > > governor
> > > is in use, and update the intel_pstate documentation accordingly.
> > > 
> > > Among other things, this allows utilization clamps to be taken
> > > into account, at least to a certain extent, when intel_pstate is
> > > in use and makes it more likely that sufficient capacity for
> > > deadline tasks will be provided.
> > > 
> > > After this change, the resulting behavior of an HWP system with
> > > intel_pstate in the passive mode should be close to the behavior
> > > of the analogous non-HWP system with intel_pstate in the passive
> > > mode, except that in the frequency range below the base frequency
> > > (ie. the frequency retured by the base_frequency cpufreq
> > > attribute
> > > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> > > the floor P-state set by intel_pstate with or without hardware
> > > coordination of P-states among CPUs in the same package.
> > > 
> > Do you mean HWP.req.min will be below base_freq (unless user
> > overrides
> > it)?
> 
> No.
Correct. I was just thinking about base_freq relation.
I can set floor above or below base_freq, HWP will reach upto ceiling/
max.

For example:

Floor above base of 0x0d

Busy%	Bzy_MHz	TSC_MHz	            M0X774
51.33	3500	1498	0x0000000000000000
99.70	3500	1498	0x00000
0008000270e
2.74	3500	1498	0x000000008000270e
2.92	3500	1498	
0x000000008000270e
99.77	3500	1498	0x000000008000270e
99.78	3500	
1498	0x000000008000270e
2.98	3500	1498	0x000000008000270e
99.75
	3500	1498	0x000000008000270e
3.01	3500	1498	0x00000
0008000270e

Floor Below base of 0x0d

Busy%	Bzy_MHz	TSC_MHz	            M0X774
51.33	3500	1498	0x0000000000000000
3.08	3500	1498	0x000000008000270c
99.77	3500	1498	0x000000008000270c
2.87	3500	1498	0x000000008000270c
99.75	3500	1498	0x000000008000270c
2.81	3500	1498	0x000000008000270c
99.76	3500	1498	0x000000008000270c
99.78	3500	1498	0x000000008000270c
2.82	3500	1498	0x000000008000270c


Thanks,
Srinivas

> > With busy workload I see HWP req.min = HWP req.max.
> > The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio:
> > 0x27)
> > When I monitor MSR 0x774 (HWP_REQ), I see
> > 0x80002727
> 
> Yes, that is what I expect to see.
> 
> > Normally msr 0x774
> > 0x80002704
> 
> That would be "active" mode and the powersave governor, correct?.
> And yes that is what I expect for your processor.
> For mine, load or no load, decoded:
> 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
>     raw: 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 :
> 80002E08 :
>     min:        8 :        8 :        8 :        8 :        8
> :        8 :
>     max:       46 :       46 :       46 :       46 :       46
> :       46 :
>     des:        0 :        0 :        0 :        0 :        0
> :        0 :
>     epp:      128 :      128 :      128 :      128 :      128
> :      128 :
>     act:        0 :        0 :        0 :        0 :        0
> :        0 :
> 
> This thread is about passive mode, and myself, I do not expect the
> last byte to be
> 4 (8 for mine) under load.
> 
> ... Doug
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-02 18:39         ` Srinivas Pandruvada
@ 2020-08-03  0:26           ` Doug Smythies
  0 siblings, 0 replies; 33+ messages in thread
From: Doug Smythies @ 2020-08-03  0:26 UTC (permalink / raw)
  To: 'Srinivas Pandruvada'
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Giovanni Gherdovich',
	'Francisco Jerez', 'Linux PM',
	'Rafael J. Wysocki'

Hi Srinivas,

Thanks for your help. I was missing several needed patches.

On 2020.08.02 11:39 Srinivas Pandruvada wrote:
> On Sun, 2020-08-02 at 07:00 -0700, Doug Smythies wrote:
> > On 2020.08.01 09:40 Srinivas Pandruvada wrote:
> > > > On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
> > > > > On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki
> > > > > wrote:
> > > > > > This really is a v2 of this patch:
> > > > > >
> > > > > > https://patchwork.kernel.org/patch/11663271/
> > > > > >
> > > > > > with an extra preceding cleanup patch to avoid making
> > > > > > unrelated
> > > > > > changes in the
> > > > > > [2/2].
> > > I applied this series along with
> > > [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active
> > > mode
> > > on 5.8 latest master (On top of raw epp patchset).
> >
> > Would you be kind enough to provide a "git log --oneline" output
> > of what you did.
> 
> 69dd9b2b11cd (HEAD -> 5-9-devel) cpufreq: intel_pstate: Implement
> passive mode with HWP enabled
> 63efaa01b06a cpufreq: intel_pstate: Fix EPP setting via sysfs in active
> mode
> e11e0a2edf83 cpufreq: intel_pstate: Rearrange the storing of new EPP
> values
> 93c3fd6a315c cpufreq: intel_pstate: Avoid enabling HWP if EPP is not
> supported
> 7cef1dd371c3 cpufreq: intel_pstate: Clean up aperf_mperf_shift
> description
> a3248d8d3a11 cpufreq: intel_pstate: Supply struct attribute description
> for get_aperf_mperf_shift()
> f52b6b075b07 cpufreq: intel_pstate: Fix static checker warning for epp
> variable
> 4a59d6be0774 cpufreq: intel_pstate: Allow raw energy performance
> preference value
> 7b34b5acdcc6 cpufreq: intel_pstate: Allow enable/disable energy
> efficiency
> ac3a0c847296 (origin/master, origin/HEAD, master) Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> >
> > I have been trying unsuccessfully to apply the patches,
> > so somewhere I obviously missed something.
> >
> > > When intel_pstate=passive from kernel command line then it is fine,
> > > no
> > > crash. But switch dynamically, crashed:
> >
> > I'll try to repeat, if I can get an actual kernel.

I could not repeat your crash.

I tried booting with and without intel_pstate=passive
on the kernel command line and then switching back and
forth thereafter.

However, I do confirm EPP is messed up. But not min
and max from MSR 0x774, they behave as expected,
based on quick testing only.

Since you mentioned:

>>> Don't have a recipe to reproduce this.

Maybe I simply didn't hit whatever.

... Doug

Useless additional stuff:

# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.8.0-rc7-dssp root=UUID=0ac356c1-caa9-4c2e-8229-4408bd998dbd ro ipv6.disable=1 consoleblank=450 intel_pstate=passive cpuidle_sysfs_switch cpuidle.governor=teo

Went "active" then "passive" and set ondemand governor. 2 X 100% CPU loads:

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 883C0000
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882D0000
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 : guaranteed 37 : efficient 11 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
    raw: FF002E0A : FF002E2E : FF002E2E : FF002E08 : FF002E18 : FF002E08 :
    min:       10 :       46 :       46 :        8 :       24 :        8 :
    max:       46 :       46 :       46 :       46 :       46 :       46 :
    des:        0 :        0 :        0 :        0 :        0 :        0 :
    epp:      255 :      255 :      255 :      255 :      255 :      255 :
    act:        0 :        0 :        0 :        0 :        0 :        0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

Kernel:

d72c8472dbd5 (HEAD -> k58rc7-d3) cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
c2f4869fbc27 cpufreq: intel_pstate: Implement passive mode with HWP enabled
85219968fab9 cpufreq: intel_pstate: Rearrange the storing of new EPP values
5c09a1a38106 cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
9f29c81fe0b3 cpufreq: intel_pstate: Clean up aperf_mperf_shift description
2a863c241495 cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift()
4180d8413037 cpufreq: intel_pstate: Fix static checker warning for epp variable
7cd50e86a9e6 cpufreq: intel_pstate: Allow raw energy performance preference value
56dce9a1081e cpufreq: intel_pstate: Allow enable/disable energy efficiency
92ed30191993 (tag: v5.8-rc7) Linux 5.8-rc7



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v5] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-28 15:13     ` [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
  2020-08-01 23:21       ` Srinivas Pandruvada
@ 2020-08-03 17:17       ` Rafael J. Wysocki
  1 sibling, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-03 17:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM, Linux Documentation, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Giovanni Gherdovich, Doug Smythies,
	Francisco Jerez

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate,
with or without hardware coordination of P-states among CPUs in the
same package.

[If P-states of the CPUs in the same package are coordinated at the
 hardware level, a non-HWP processor may choose a P-state above the
 target one like a processor with HWP enabled may choose a P-state
 above the HWP floor, so the HWP behavior is analogous to the non-HWP
 one in that case.

 Also note that the HWP floor may not be taken into account by
 the processor in the range of P-states above the base frequency,
 referred to as the turbo range, where the processor has a license to
 choose any P-state, either below or above the HWP floor, just like a
 non-HWP processor in the case when the target P-state falls into the
 turbo range.]

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is on top of the linux-next branch in linux-pm.git from today (should
be there in linux-next tomorrow).

This is just one patch now, because the previous [1/2] has been merged
into the branch above (it is needed to fix a bug).

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

v2 -> v3:
   * Rebase.

v3 -> v4:
   * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive mode.

v4 -> v5:
   * Do not acquire intel_pstate_driver_lock in
     store_energy_performance_preference(), because it runs under
     policy->rwsem, so intel_pstate_driver cannot change while it is running.
   * Rearrange the changelog a bit to avoid confusion.

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 +++++------
 drivers/cpufreq/intel_pstate.c                |  203 ++++++++++++++++++++------
 2 files changed, 203 insertions(+), 89 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -697,6 +700,8 @@ static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
@@ -725,11 +730,31 @@ static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	/*
+	 * This function runs with the policy R/W semaphore held, which
+	 * guarantees that the driver pointer will not change while it is
+	 * running.
+	 */
+	if (!intel_pstate_driver)
+		return -EAGAIN;
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (!ret)
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
+		if (!ret)
+			ret = count;
+	} else {
+		/*
+		 * In the passive mode simply update the cached EPP value and
+		 * rely on intel_cpufreq_adjust_hwp() to pick it up later.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu_data->epp_default;
+
+		WRITE_ONCE(cpu_data->epp_cached, epp);
 		ret = count;
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
@@ -1145,8 +1170,6 @@ static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -1330,9 +1353,10 @@ static const struct attribute_group inte
 
 static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
 
+static struct kobject *intel_pstate_kobject;
+
 static void __init intel_pstate_sysfs_expose_params(void)
 {
-	struct kobject *intel_pstate_kobject;
 	int rc;
 
 	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
@@ -1357,17 +1381,31 @@ static void __init intel_pstate_sysfs_ex
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
-	if (hwp_active) {
-		rc = sysfs_create_file(intel_pstate_kobject,
-				       &hwp_dynamic_boost.attr);
-		WARN_ON(rc);
-	}
-
 	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
 		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
 		WARN_ON(rc);
 	}
 }
+
+static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
+{
+	int rc;
+
+	if (!hwp_active)
+		return;
+
+	rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+	WARN_ON_ONCE(rc);
+}
+
+static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
+{
+	if (!hwp_active)
+		return;
+
+	sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+}
+
 /************************** sysfs end ************************/
 
 static void intel_pstate_hwp_enable(struct cpudata *cpudata)
@@ -2048,6 +2086,7 @@ static int intel_pstate_init_cpu(unsigne
 		cpu->epp_default = -EINVAL;
 		cpu->epp_powersave = -EINVAL;
 		cpu->epp_saved = -EINVAL;
+		WRITE_ONCE(cpu->epp_cached, -EINVAL);
 	}
 
 	cpu = all_cpu_data[cpunum];
@@ -2246,7 +2285,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2254,12 +2296,10 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2389,13 +2429,82 @@ static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+	s16 epp;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	/*
+	 * In case the EPP has been adjusted via sysfs, write the last cached
+	 * value of it to the MSR as well.
+	 */
+	epp = READ_ONCE(cpu->epp_cached);
+	if (epp >= 0) {
+		value &= ~GENMASK_ULL(31, 24);
+		value |= (u64)epp << 24;
+	}
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2403,6 +2512,7 @@ static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2414,15 +2524,11 @@ static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2432,15 +2538,14 @@ static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2460,7 +2565,6 @@ static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2472,10 +2576,17 @@ static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2552,6 +2663,10 @@ static void intel_pstate_driver_cleanup(
 		}
 	}
 	put_online_cpus();
+
+	if (intel_pstate_driver == &intel_pstate)
+		intel_pstate_sysfs_hide_hwp_dynamic_boost();
+
 	intel_pstate_driver = NULL;
 }
 
@@ -2559,6 +2674,9 @@ static int intel_pstate_register_driver(
 {
 	int ret;
 
+	if (driver == &intel_pstate)
+		intel_pstate_sysfs_expose_hwp_dynamic_boost();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2576,9 +2694,6 @@ static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2834,7 +2949,10 @@ static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2905,14 +3023,13 @@ static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@ This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@ manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@ argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@ Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@ on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@ at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@ of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@ of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-01 16:39     ` [PATCH v4 0/2] " Srinivas Pandruvada
  2020-08-02 14:00       ` Doug Smythies
@ 2020-08-03 17:23       ` Rafael J. Wysocki
  1 sibling, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-03 17:23 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Linux PM, Linux Documentation, LKML, Peter Zijlstra,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez

On Saturday, August 1, 2020 6:39:30 PM CEST Srinivas Pandruvada wrote:
> 
> --=-bU21ZBsdw4g45G9I/wXt
> Content-Type: text/plain; charset="UTF-8"
> Content-Transfer-Encoding: 7bit
> 
> On Tue, 2020-07-28 at 17:09 +0200, Rafael J. Wysocki wrote:
> > Hi All,
> > 
> > On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
> > > On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki wrote:
> > > > This really is a v2 of this patch:
> > > > 
> > > > https://patchwork.kernel.org/patch/11663271/
> > > > 
> > > > with an extra preceding cleanup patch to avoid making unrelated
> > > > changes in the
> > > > [2/2].
> > > 
> I applied this series along with
> [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
> on 5.8 latest master (On top of raw epp patchset).
> 
> When intel_pstate=passive from kernel command line then it is fine, no
> crash. But switch dynamically, crashed:
> Attached crash.txt. I may need to try your linux-pm tree.

Please try the v5 on top of my linux-next branch:

https://patchwork.kernel.org/patch/11698495/

FWIW, I cannot reproduce the crash with it.

> Then after some playing I reached a state when I monitor MSR 0x774:
> while true; do rdmsr 0x774; sleep 1; done
> 80002704
> ...
> ...
> ff000101
> ff000101
> ff000101
> ff000101
> ff000101
> ff000101
> ff000101
> ff000101
> 
> Don't have a recipe to reproduce this.

Well, maybe it locked up due to the deadlock in the v4 of the patch.

Please see if you get this with the v5 above applied.

Cheers!




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
                     ` (2 preceding siblings ...)
  2020-07-28 15:09   ` [PATCH v4 0/2] " Rafael J. Wysocki
@ 2020-08-04 15:10   ` Rafael J. Wysocki
  2020-08-04 17:04     ` Doug Smythies
  2020-08-05 16:55   ` [PATCH v7] " Rafael J. Wysocki
  4 siblings, 1 reply; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-04 15:10 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez,
	Viresh Kumar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate,
with or without hardware coordination of P-states among CPUs in the
same package.

[If P-states of the CPUs in the same package are coordinated at the
 hardware level, a non-HWP processor may choose a P-state above the
 target one like a processor with HWP enabled may choose a P-state
 above the HWP floor, so the HWP behavior is analogous to the non-HWP
 one in that case.

 Also note that the HWP floor may not be taken into account by
 the processor in the range of P-states above the base frequency,
 referred to as the turbo range, where the processor has a license to
 choose any P-state, either below or above the HWP floor, just like a
 non-HWP processor in the case when the target P-state falls into the
 turbo range.]

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is on top of the material already in the mainline.

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

v2 -> v3:
   * Rebase.

v3 -> v4:
   * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive mode.

v4 -> v5:
   * Do not acquire intel_pstate_driver_lock in
     store_energy_performance_preference(), because it runs under
     policy->rwsem, so intel_pstate_driver cannot change while it is running.
   * Rearrange the changelog a bit to avoid confusion.

v5 -> v6:
   * Fix the problem with the EPP setting via sysfs not working with the
     performance and powersave governors by stopping and restarting the
     governor around the sysfs-based EPP updates in the passive mode.
   * Because of that, use the epp_cached field just for avoiding the above
     if the new EPP value for the given CPU is the same as the old one.
   * Export cpufreq_start/stop_governor() from the core (for the above).

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 ++++-----
 drivers/cpufreq/cpufreq.c                     |    6 
 drivers/cpufreq/intel_pstate.c                |  248 +++++++++++++++++++-------
 include/linux/cpufreq.h                       |    2 
 4 files changed, 234 insertions(+), 111 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -639,6 +642,26 @@ static int intel_pstate_get_energy_pref_
 	return index;
 }
 
+static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
+{
+	/*
+	 * Use the cached HWP Request MSR value, because in the active mode the
+	 * register itself may be updated by intel_pstate_hwp_boost_up() or
+	 * intel_pstate_hwp_boost_down() at any time.
+	 */
+	u64 value = READ_ONCE(cpu->hwp_req_cached);
+
+	value &= ~GENMASK_ULL(31, 24);
+	value |= (u64)epp << 24;
+	/*
+	 * The only other updater of hwp_req_cached in the active mode,
+	 * intel_pstate_hwp_set(), is called under the same lock as this
+	 * function, so it cannot run in parallel with the update below.
+	 */
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	return wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
 static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
 					      int pref_index, bool use_raw,
 					      u32 raw_epp)
@@ -650,28 +673,12 @@ static int intel_pstate_set_energy_pref_
 		epp = cpu_data->epp_default;
 
 	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
-		/*
-		 * Use the cached HWP Request MSR value, because the register
-		 * itself may be updated by intel_pstate_hwp_boost_up() or
-		 * intel_pstate_hwp_boost_down() at any time.
-		 */
-		u64 value = READ_ONCE(cpu_data->hwp_req_cached);
-
-		value &= ~GENMASK_ULL(31, 24);
-
 		if (use_raw)
 			epp = raw_epp;
 		else if (epp == -EINVAL)
 			epp = epp_values[pref_index - 1];
 
-		value |= (u64)epp << 24;
-		/*
-		 * The only other updater of hwp_req_cached in the active mode,
-		 * intel_pstate_hwp_set(), is called under the same lock as this
-		 * function, so it cannot run in parallel with the update below.
-		 */
-		WRITE_ONCE(cpu_data->hwp_req_cached, value);
-		ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
+		ret = intel_pstate_set_epp(cpu_data, epp);
 	} else {
 		if (epp == -EINVAL)
 			epp = (pref_index - 1) << 2;
@@ -697,6 +704,8 @@ static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
@@ -725,11 +734,47 @@ static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	/*
+	 * This function runs with the policy R/W semaphore held, which
+	 * guarantees that the driver pointer will not change while it is
+	 * running.
+	 */
+	if (!intel_pstate_driver)
+		return -EAGAIN;
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (!ret)
-		ret = count;
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
+		if (!ret)
+			ret = count;
+	} else {
+		/*
+		 * In the passive mode the governor needs to be stopped on the
+		 * target CPU before the EPP update and restarted after it,
+		 * which is super-heavy-weight, so make sure it is worth doing
+		 * upfront.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu_data->epp_default;
+
+		if (cpu_data->epp_cached == epp) {
+			ret = count;
+		} else {
+			int err;
+
+			cpufreq_stop_governor(policy);
+			ret = intel_pstate_set_epp(cpu_data, epp);
+			if (!ret)
+				cpu_data->epp_cached = epp;
+
+			err = cpufreq_start_governor(policy);
+			if (err)
+				ret = ret ?: err;
+			else if (!ret)
+				ret = count;
+		}
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
@@ -1145,8 +1190,6 @@ static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -1330,9 +1373,10 @@ static const struct attribute_group inte
 
 static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
 
+static struct kobject *intel_pstate_kobject;
+
 static void __init intel_pstate_sysfs_expose_params(void)
 {
-	struct kobject *intel_pstate_kobject;
 	int rc;
 
 	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
@@ -1357,17 +1401,31 @@ static void __init intel_pstate_sysfs_ex
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
-	if (hwp_active) {
-		rc = sysfs_create_file(intel_pstate_kobject,
-				       &hwp_dynamic_boost.attr);
-		WARN_ON(rc);
-	}
-
 	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
 		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
 		WARN_ON(rc);
 	}
 }
+
+static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
+{
+	int rc;
+
+	if (!hwp_active)
+		return;
+
+	rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+	WARN_ON_ONCE(rc);
+}
+
+static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
+{
+	if (!hwp_active)
+		return;
+
+	sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+}
+
 /************************** sysfs end ************************/
 
 static void intel_pstate_hwp_enable(struct cpudata *cpudata)
@@ -2246,7 +2304,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2254,12 +2315,10 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2389,13 +2448,71 @@ static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2403,6 +2520,7 @@ static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2414,15 +2532,11 @@ static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2432,15 +2546,14 @@ static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2460,7 +2573,6 @@ static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2472,10 +2584,18 @@ static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+		cpu->epp_cached = (value & GENMASK_ULL(31, 24)) >> 24;
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2552,6 +2672,10 @@ static void intel_pstate_driver_cleanup(
 		}
 	}
 	put_online_cpus();
+
+	if (intel_pstate_driver == &intel_pstate)
+		intel_pstate_sysfs_hide_hwp_dynamic_boost();
+
 	intel_pstate_driver = NULL;
 }
 
@@ -2559,6 +2683,9 @@ static int intel_pstate_register_driver(
 {
 	int ret;
 
+	if (driver == &intel_pstate)
+		intel_pstate_sysfs_expose_hwp_dynamic_boost();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2576,9 +2703,6 @@ static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2834,7 +2958,10 @@ static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2905,14 +3032,13 @@ static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@ This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@ manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@ argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@ Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@ on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@ at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@ of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@ of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -73,8 +73,6 @@ static inline bool has_target(void)
 static unsigned int __cpufreq_get(struct cpufreq_policy *policy);
 static int cpufreq_init_governor(struct cpufreq_policy *policy);
 static void cpufreq_exit_governor(struct cpufreq_policy *policy);
-static int cpufreq_start_governor(struct cpufreq_policy *policy);
-static void cpufreq_stop_governor(struct cpufreq_policy *policy);
 static void cpufreq_governor_limits(struct cpufreq_policy *policy);
 static int cpufreq_set_policy(struct cpufreq_policy *policy,
 			      struct cpufreq_governor *new_gov,
@@ -2266,7 +2264,7 @@ static void cpufreq_exit_governor(struct
 	module_put(policy->governor->owner);
 }
 
-static int cpufreq_start_governor(struct cpufreq_policy *policy)
+int cpufreq_start_governor(struct cpufreq_policy *policy)
 {
 	int ret;
 
@@ -2293,7 +2291,7 @@ static int cpufreq_start_governor(struct
 	return 0;
 }
 
-static void cpufreq_stop_governor(struct cpufreq_policy *policy)
+void cpufreq_stop_governor(struct cpufreq_policy *policy)
 {
 	if (cpufreq_suspended || !policy->governor)
 		return;
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -576,6 +576,8 @@ unsigned int cpufreq_driver_resolve_freq
 unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy);
 int cpufreq_register_governor(struct cpufreq_governor *governor);
 void cpufreq_unregister_governor(struct cpufreq_governor *governor);
+int cpufreq_start_governor(struct cpufreq_policy *policy);
+void cpufreq_stop_governor(struct cpufreq_policy *policy);
 
 #define cpufreq_governor_init(__governor)			\
 static int __init __governor##_init(void)			\




^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-04 15:10   ` [PATCH v6] " Rafael J. Wysocki
@ 2020-08-04 17:04     ` Doug Smythies
  2020-08-05  9:34       ` Rafael J. Wysocki
  0 siblings, 1 reply; 33+ messages in thread
From: Doug Smythies @ 2020-08-04 17:04 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Srinivas Pandruvada',
	'Giovanni Gherdovich', 'Francisco Jerez',
	'Viresh Kumar', 'Linux PM'

Hi Rafael,

I was just writing you about V5 when this V6 came.

On 2020.08.04 08:11 Rafael J. Wysocki wrote:
...
> This is on top of the material already in the mainline.

Oh, should have read that part better,
but did get there in the end.
...
> v5 -> v6:
>    * Fix the problem with the EPP setting via sysfs not working with the
>      performance and powersave governors by stopping and restarting the
>      governor around the sysfs-based EPP updates in the passive mode.
>    * Because of that, use the epp_cached field just for avoiding the above
>      if the new EPP value for the given CPU is the same as the old one.
>    * Export cpufreq_start/stop_governor() from the core (for the above).

EPP is still not right.
I am not messing with it at all, just observing via my msr-decoder.

I booted without any intel_pstate related directives for the
kernel command line. The below is as expected (performance gov.):

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 88450000
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 88430000
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 109252E : high 46 : guaranteed 37 : efficient 9 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
    raw: 00002E2E : 00002E2E : 00002E2E : 00002E2E : 00002E2E : 00002E2E :
    min:       46 :       46 :       46 :       46 :       46 :       46 :
    max:       46 :       46 :       46 :       46 :       46 :       46 :
    des:        0 :        0 :        0 :        0 :        0 :        0 :
    epp:        0 :        0 :        0 :        0 :        0 :        0 :
    act:        0 :        0 :        0 :        0 :        0 :        0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

and then switched to passive mode later. EPP is not as expected. Expect 0
(performance mode):

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 88440000
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 88420000
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 108252E : high 46 : guaranteed 37 : efficient 8 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
    raw: FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E :
    min:       46 :       46 :       46 :       46 :       46 :       46 :
    max:       46 :       46 :       46 :       46 :       46 :       46 :
    des:        0 :        0 :        0 :        0 :        0 :        0 :
    epp:      255 :      255 :      255 :      255 :      255 :      255 :
    act:        0 :        0 :        0 :        0 :        0 :        0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

Then switched to ondemand governor, and put 100% load on 2 CPUs.
EPP is not as expected, which I don't actually know what to expect,
but assume 128:

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 883B0000
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882B0000
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 : guaranteed 37 : efficient 11 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
    raw: FF002E09 : FF002E0C : FF002E2E : FF002E08 : FF002E2E : FF002E18 :
    min:        9 :       12 :       46 :        8 :       46 :       24 :
    max:       46 :       46 :       46 :       46 :       46 :       46 :
    des:        0 :        0 :        0 :        0 :        0 :        0 :
    epp:      255 :      255 :      255 :      255 :      255 :      255 :
    act:        0 :        0 :        0 :        0 :        0 :        0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

For what it's worth, Kernel:

78b39581ed85 (HEAD -> dtemp) cpufreq: intel_pstate: Implement passive mode with HWP enabled
c0842fbc1b18 (origin/master, origin/HEAD, master) random32: move the pseudo-random 32-bit definitions to prandom.h
2baa85d6927d Merge tag 'acpi-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
04084978003c Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

... Doug



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-04 17:04     ` Doug Smythies
@ 2020-08-05  9:34       ` Rafael J. Wysocki
  2020-08-05 15:38         ` Srinivas Pandruvada
  0 siblings, 1 reply; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-05  9:34 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Linux Documentation, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Giovanni Gherdovich, Francisco Jerez,
	Viresh Kumar, Linux PM

Hi Doug,

On Tue, Aug 4, 2020 at 7:07 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Rafael,
>
> I was just writing you about V5 when this V6 came.
>
> On 2020.08.04 08:11 Rafael J. Wysocki wrote:
> ...
> > This is on top of the material already in the mainline.
>
> Oh, should have read that part better,
> but did get there in the end.
> ...
> > v5 -> v6:
> >    * Fix the problem with the EPP setting via sysfs not working with the
> >      performance and powersave governors by stopping and restarting the
> >      governor around the sysfs-based EPP updates in the passive mode.
> >    * Because of that, use the epp_cached field just for avoiding the above
> >      if the new EPP value for the given CPU is the same as the old one.
> >    * Export cpufreq_start/stop_governor() from the core (for the above).
>
> EPP is still not right.
> I am not messing with it at all, just observing via my msr-decoder.

If you are not touching it, then it should not change in the passive mode.

> I booted without any intel_pstate related directives for the
> kernel command line. The below is as expected (performance gov.):

Note that the active mode performance scaling algorithm (which is not
the same as the performance cpufreq governor) sets the EPP to 0 for
all of the CPUs that it is used with and the driver sets the EPP to
255 in ->stop_cpu.

That last bit is questionable, but that's the active mode behavior
which is not changed by the $subject patch.

It would be more reasonable to restore the previous EPP when stopping
CPUs.  Let me cut a v7 with that changed.

> # /home/doug/c/msr-decoder
> How many CPUs?: 6
> 8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
> B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
> 1.) 0x19C: IA32_THERM_STATUS: 88450000
> 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset
> 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 88430000
> 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
> A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
> 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 109252E : high 46 : guaranteed 37 : efficient 9 : lowest 1
> 6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
>     raw: 00002E2E : 00002E2E : 00002E2E : 00002E2E : 00002E2E : 00002E2E :
>     min:       46 :       46 :       46 :       46 :       46 :       46 :
>     max:       46 :       46 :       46 :       46 :       46 :       46 :
>     des:        0 :        0 :        0 :        0 :        0 :        0 :
>     epp:        0 :        0 :        0 :        0 :        0 :        0 :
>     act:        0 :        0 :        0 :        0 :        0 :        0 :
> 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0
>
> and then switched to passive mode later. EPP is not as expected. Expect 0
> (performance mode):
>
> # /home/doug/c/msr-decoder
> How many CPUs?: 6
> 8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
> B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
> 1.) 0x19C: IA32_THERM_STATUS: 88440000
> 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset
> 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 88420000
> 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
> A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
> 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 108252E : high 46 : guaranteed 37 : efficient 8 : lowest 1
> 6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
>     raw: FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E :
>     min:       46 :       46 :       46 :       46 :       46 :       46 :
>     max:       46 :       46 :       46 :       46 :       46 :       46 :
>     des:        0 :        0 :        0 :        0 :        0 :        0 :
>     epp:      255 :      255 :      255 :      255 :      255 :      255 :
>     act:        0 :        0 :        0 :        0 :        0 :        0 :
> 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

The 0xFF EPP value is what the active mode left behind and the passive
mode doesn't touch the EPP at all.

> Then switched to ondemand governor, and put 100% load on 2 CPUs.
> EPP is not as expected, which I don't actually know what to expect,
> but assume 128:
>
> # /home/doug/c/msr-decoder
> How many CPUs?: 6
> 8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
> B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
> 1.) 0x19C: IA32_THERM_STATUS: 883B0000
> 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset
> 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882B0000
> 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
> A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
> 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 : guaranteed 37 : efficient 11 : lowest 1
> 6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
>     raw: FF002E09 : FF002E0C : FF002E2E : FF002E08 : FF002E2E : FF002E18 :
>     min:        9 :       12 :       46 :        8 :       46 :       24 :
>     max:       46 :       46 :       46 :       46 :       46 :       46 :
>     des:        0 :        0 :        0 :        0 :        0 :        0 :
>     epp:      255 :      255 :      255 :      255 :      255 :      255 :
>     act:        0 :        0 :        0 :        0 :        0 :        0 :
> 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

It is still 0xFF as previously (because the passive mode doesn't
change the EPP).

> For what it's worth, Kernel:
>
> 78b39581ed85 (HEAD -> dtemp) cpufreq: intel_pstate: Implement passive mode with HWP enabled
> c0842fbc1b18 (origin/master, origin/HEAD, master) random32: move the pseudo-random 32-bit definitions to prandom.h
> 2baa85d6927d Merge tag 'acpi-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> 04084978003c Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-05  9:34       ` Rafael J. Wysocki
@ 2020-08-05 15:38         ` Srinivas Pandruvada
  2020-08-05 16:28           ` Rafael J. Wysocki
  0 siblings, 1 reply; 33+ messages in thread
From: Srinivas Pandruvada @ 2020-08-05 15:38 UTC (permalink / raw)
  To: Rafael J. Wysocki, Doug Smythies
  Cc: Rafael J. Wysocki, Linux Documentation, LKML, Peter Zijlstra,
	Giovanni Gherdovich, Francisco Jerez, Viresh Kumar, Linux PM

On Wed, 2020-08-05 at 11:34 +0200, Rafael J. Wysocki wrote:
> Hi Doug,
> 
> On Tue, Aug 4, 2020 at 7:07 PM Doug Smythies <dsmythies@telus.net>
> wrote:
> > Hi Rafael,
> > 
> > 
[...]

> Note that the active mode performance scaling algorithm (which is not
> the same as the performance cpufreq governor) sets the EPP to 0 for
> all of the CPUs that it is used with and the driver sets the EPP to
> 255 in ->stop_cpu.
> 
> That last bit is questionable, but that's the active mode behavior
> which is not changed by the $subject patch.
You need to set the CPU which is going offline to the lowest perf
settings. If not its sibling's performance can never be lowered than
offlined CPUs max/min/epp.

Thanks,
Srinivas


> 
> It would be more reasonable to restore the previous EPP when stopping
> CPUs.  Let me cut a v7 with that changed.
> 
> > # /home/doug/c/msr-decoder
> > How many CPUs?: 6
> > 8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46
> > :  46 :  46 :
> > B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
> > 1.) 0x19C: IA32_THERM_STATUS: 88450000
> > 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination
> > enabled OOB Bit 8 reset OOB Bit 18 reset
> > 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 88430000
> > 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
> > A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO
> > disable
> > 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 109252E : high 46 :
> > guaranteed 37 : efficient 9 : lowest 1
> > 6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
> >     raw: 00002E2E : 00002E2E : 00002E2E : 00002E2E : 00002E2E :
> > 00002E2E :
> >     min:       46 :       46 :       46 :       46 :       46
> > :       46 :
> >     max:       46 :       46 :       46 :       46 :       46
> > :       46 :
> >     des:        0 :        0 :        0 :        0 :        0
> > :        0 :
> >     epp:        0 :        0 :        0 :        0 :        0
> > :        0 :
> >     act:        0 :        0 :        0 :        0 :        0
> > :        0 :
> > 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0
> > : lowest 0
> > 
> > and then switched to passive mode later. EPP is not as expected.
> > Expect 0
> > (performance mode):
> > 
> > # /home/doug/c/msr-decoder
> > How many CPUs?: 6
> > 8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46
> > :  46 :  46 :
> > B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
> > 1.) 0x19C: IA32_THERM_STATUS: 88440000
> > 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination
> > enabled OOB Bit 8 reset OOB Bit 18 reset
> > 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 88420000
> > 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
> > A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO
> > disable
> > 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 108252E : high 46 :
> > guaranteed 37 : efficient 8 : lowest 1
> > 6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
> >     raw: FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E :
> > FF002E2E :
> >     min:       46 :       46 :       46 :       46 :       46
> > :       46 :
> >     max:       46 :       46 :       46 :       46 :       46
> > :       46 :
> >     des:        0 :        0 :        0 :        0 :        0
> > :        0 :
> >     epp:      255 :      255 :      255 :      255 :      255
> > :      255 :
> >     act:        0 :        0 :        0 :        0 :        0
> > :        0 :
> > 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0
> > : lowest 0
> 
> The 0xFF EPP value is what the active mode left behind and the
> passive
> mode doesn't touch the EPP at all.
> 
> > Then switched to ondemand governor, and put 100% load on 2 CPUs.
> > EPP is not as expected, which I don't actually know what to expect,
> > but assume 128:
> > 
> > # /home/doug/c/msr-decoder
> > How many CPUs?: 6
> > 8.) 0x198: IA32_PERF_STATUS     : CPU 0-5 :  46 :  46 :  46 :  46
> > :  46 :  46 :
> > B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
> > 1.) 0x19C: IA32_THERM_STATUS: 883B0000
> > 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination
> > enabled OOB Bit 8 reset OOB Bit 18 reset
> > 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882B0000
> > 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
> > A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO
> > disable
> > 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 :
> > guaranteed 37 : efficient 11 : lowest 1
> > 6.) 0x774: IA32_HWP_REQUEST:    CPU 0-5 :
> >     raw: FF002E09 : FF002E0C : FF002E2E : FF002E08 : FF002E2E :
> > FF002E18 :
> >     min:        9 :       12 :       46 :        8 :       46
> > :       24 :
> >     max:       46 :       46 :       46 :       46 :       46
> > :       46 :
> >     des:        0 :        0 :        0 :        0 :        0
> > :        0 :
> >     epp:      255 :      255 :      255 :      255 :      255
> > :      255 :
> >     act:        0 :        0 :        0 :        0 :        0
> > :        0 :
> > 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0
> > : lowest 0
> 
> It is still 0xFF as previously (because the passive mode doesn't
> change the EPP).
> 
> > For what it's worth, Kernel:
> > 
> > 78b39581ed85 (HEAD -> dtemp) cpufreq: intel_pstate: Implement
> > passive mode with HWP enabled
> > c0842fbc1b18 (origin/master, origin/HEAD, master) random32: move
> > the pseudo-random 32-bit definitions to prandom.h
> > 2baa85d6927d Merge tag 'acpi-5.9-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> > 04084978003c Merge tag 'pm-5.9-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> 
> Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-05 15:38         ` Srinivas Pandruvada
@ 2020-08-05 16:28           ` Rafael J. Wysocki
  0 siblings, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-05 16:28 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Doug Smythies, Rafael J. Wysocki,
	Linux Documentation, LKML, Peter Zijlstra, Giovanni Gherdovich,
	Francisco Jerez, Viresh Kumar, Linux PM

On Wed, Aug 5, 2020 at 5:38 PM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Wed, 2020-08-05 at 11:34 +0200, Rafael J. Wysocki wrote:
> > Hi Doug,
> >
> > On Tue, Aug 4, 2020 at 7:07 PM Doug Smythies <dsmythies@telus.net>
> > wrote:
> > > Hi Rafael,
> > >
> > >
> [...]
>
> > Note that the active mode performance scaling algorithm (which is not
> > the same as the performance cpufreq governor) sets the EPP to 0 for
> > all of the CPUs that it is used with and the driver sets the EPP to
> > 255 in ->stop_cpu.
> >
> > That last bit is questionable, but that's the active mode behavior
> > which is not changed by the $subject patch.
> You need to set the CPU which is going offline to the lowest perf
> settings. If not its sibling's performance can never be lowered than
> offlined CPUs max/min/epp.

OK, fair enough.  I'm not going to make this change after all then. :-)

This behavior is really confusing, though, because if you change the
status from "active" to "passive" or the other way around, the EPP
goes to 0xFF on all CPUs and cannot be restored.

To fix this, we need an extra patch on top of the $subject one to
introduce ->online and ->offline callbacks so that ->offline sets the
EPP to 0xFF and either ->online or ->exit can restore it to the
previous setting.

Cheers!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
                     ` (3 preceding siblings ...)
  2020-08-04 15:10   ` [PATCH v6] " Rafael J. Wysocki
@ 2020-08-05 16:55   ` Rafael J. Wysocki
  2020-08-06  5:55     ` Doug Smythies
  2020-08-06 12:03     ` Rafael J. Wysocki
  4 siblings, 2 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-05 16:55 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez,
	Viresh Kumar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate,
with or without hardware coordination of P-states among CPUs in the
same package.

[If P-states of the CPUs in the same package are coordinated at the
 hardware level, a non-HWP processor may choose a P-state above the
 target one like a processor with HWP enabled may choose a P-state
 above the HWP floor, so the HWP behavior is analogous to the non-HWP
 one in that case.

 Also note that the HWP floor may not be taken into account by
 the processor in the range of P-states above the base frequency,
 referred to as the turbo range, where the processor has a license to
 choose any P-state, either below or above the HWP floor, just like a
 non-HWP processor in the case when the target P-state falls into the
 turbo range.]

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is on top of the material already in the mainline.

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

v2 -> v3:
   * Rebase.

v3 -> v4:
   * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive mode.

v4 -> v5:
   * Do not acquire intel_pstate_driver_lock in
     store_energy_performance_preference(), because it runs under
     policy->rwsem, so intel_pstate_driver cannot change while it is running.
   * Rearrange the changelog a bit to avoid confusion.

v5 -> v6:
   * Fix the problem with the EPP setting via sysfs not working with the
     performance and powersave governors by stopping and restarting the
     governor around the sysfs-based EPP updates in the passive mode.
   * Because of that, use the epp_cached field just for avoiding the above
     if the new EPP value for the given CPU is the same as the old one.
   * Export cpufreq_start/stop_governor() from the core (for the above).

v6 -> v7:
   * Cosmetic changes in store_energy_performance_prefernce() to reduce the
     LoC number and make it a bit easier to read.  No intentional functional
     impact.

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 ++++-----
 drivers/cpufreq/cpufreq.c                     |    6 
 drivers/cpufreq/intel_pstate.c                |  248 +++++++++++++++++++-------
 include/linux/cpufreq.h                       |    2 
 4 files changed, 234 insertions(+), 111 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -639,6 +642,26 @@ static int intel_pstate_get_energy_pref_
 	return index;
 }
 
+static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
+{
+	/*
+	 * Use the cached HWP Request MSR value, because in the active mode the
+	 * register itself may be updated by intel_pstate_hwp_boost_up() or
+	 * intel_pstate_hwp_boost_down() at any time.
+	 */
+	u64 value = READ_ONCE(cpu->hwp_req_cached);
+
+	value &= ~GENMASK_ULL(31, 24);
+	value |= (u64)epp << 24;
+	/*
+	 * The only other updater of hwp_req_cached in the active mode,
+	 * intel_pstate_hwp_set(), is called under the same lock as this
+	 * function, so it cannot run in parallel with the update below.
+	 */
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	return wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
 static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
 					      int pref_index, bool use_raw,
 					      u32 raw_epp)
@@ -650,28 +673,12 @@ static int intel_pstate_set_energy_pref_
 		epp = cpu_data->epp_default;
 
 	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
-		/*
-		 * Use the cached HWP Request MSR value, because the register
-		 * itself may be updated by intel_pstate_hwp_boost_up() or
-		 * intel_pstate_hwp_boost_down() at any time.
-		 */
-		u64 value = READ_ONCE(cpu_data->hwp_req_cached);
-
-		value &= ~GENMASK_ULL(31, 24);
-
 		if (use_raw)
 			epp = raw_epp;
 		else if (epp == -EINVAL)
 			epp = epp_values[pref_index - 1];
 
-		value |= (u64)epp << 24;
-		/*
-		 * The only other updater of hwp_req_cached in the active mode,
-		 * intel_pstate_hwp_set(), is called under the same lock as this
-		 * function, so it cannot run in parallel with the update below.
-		 */
-		WRITE_ONCE(cpu_data->hwp_req_cached, value);
-		ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
+		ret = intel_pstate_set_epp(cpu_data, epp);
 	} else {
 		if (epp == -EINVAL)
 			epp = (pref_index - 1) << 2;
@@ -697,6 +704,8 @@ static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
@@ -725,11 +734,47 @@ static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	/*
+	 * This function runs with the policy R/W semaphore held, which
+	 * guarantees that the driver pointer will not change while it is
+	 * running.
+	 */
+	if (!intel_pstate_driver)
+		return -EAGAIN;
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (!ret)
-		ret = count;
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
+		if (!ret)
+			ret = count;
+	} else {
+		/*
+		 * In the passive mode the governor needs to be stopped on the
+		 * target CPU before the EPP update and restarted after it,
+		 * which is super-heavy-weight, so make sure it is worth doing
+		 * upfront.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu_data->epp_default;
+
+		if (cpu_data->epp_cached == epp) {
+			ret = count;
+		} else {
+			int err;
+
+			cpufreq_stop_governor(policy);
+			ret = intel_pstate_set_epp(cpu_data, epp);
+			if (!ret)
+				cpu_data->epp_cached = epp;
+
+			err = cpufreq_start_governor(policy);
+			if (err)
+				ret = ret ?: err;
+			else if (!ret)
+				ret = count;
+		}
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
@@ -1145,8 +1190,6 @@ static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -1330,9 +1373,10 @@ static const struct attribute_group inte
 
 static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
 
+static struct kobject *intel_pstate_kobject;
+
 static void __init intel_pstate_sysfs_expose_params(void)
 {
-	struct kobject *intel_pstate_kobject;
 	int rc;
 
 	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
@@ -1357,17 +1401,31 @@ static void __init intel_pstate_sysfs_ex
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
-	if (hwp_active) {
-		rc = sysfs_create_file(intel_pstate_kobject,
-				       &hwp_dynamic_boost.attr);
-		WARN_ON(rc);
-	}
-
 	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
 		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
 		WARN_ON(rc);
 	}
 }
+
+static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
+{
+	int rc;
+
+	if (!hwp_active)
+		return;
+
+	rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+	WARN_ON_ONCE(rc);
+}
+
+static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
+{
+	if (!hwp_active)
+		return;
+
+	sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+}
+
 /************************** sysfs end ************************/
 
 static void intel_pstate_hwp_enable(struct cpudata *cpudata)
@@ -2246,7 +2304,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2254,12 +2315,10 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2389,13 +2448,71 @@ static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2403,6 +2520,7 @@ static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2414,15 +2532,11 @@ static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2432,15 +2546,14 @@ static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2460,7 +2573,6 @@ static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2472,10 +2584,18 @@ static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+		cpu->epp_cached = (value & GENMASK_ULL(31, 24)) >> 24;
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2552,6 +2672,10 @@ static void intel_pstate_driver_cleanup(
 		}
 	}
 	put_online_cpus();
+
+	if (intel_pstate_driver == &intel_pstate)
+		intel_pstate_sysfs_hide_hwp_dynamic_boost();
+
 	intel_pstate_driver = NULL;
 }
 
@@ -2559,6 +2683,9 @@ static int intel_pstate_register_driver(
 {
 	int ret;
 
+	if (driver == &intel_pstate)
+		intel_pstate_sysfs_expose_hwp_dynamic_boost();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2576,9 +2703,6 @@ static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2834,7 +2958,10 @@ static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2905,14 +3032,13 @@ static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@ This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@ manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@ argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@ Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@ on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@ at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@ of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@ of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -73,8 +73,6 @@ static inline bool has_target(void)
 static unsigned int __cpufreq_get(struct cpufreq_policy *policy);
 static int cpufreq_init_governor(struct cpufreq_policy *policy);
 static void cpufreq_exit_governor(struct cpufreq_policy *policy);
-static int cpufreq_start_governor(struct cpufreq_policy *policy);
-static void cpufreq_stop_governor(struct cpufreq_policy *policy);
 static void cpufreq_governor_limits(struct cpufreq_policy *policy);
 static int cpufreq_set_policy(struct cpufreq_policy *policy,
 			      struct cpufreq_governor *new_gov,
@@ -2266,7 +2264,7 @@ static void cpufreq_exit_governor(struct
 	module_put(policy->governor->owner);
 }
 
-static int cpufreq_start_governor(struct cpufreq_policy *policy)
+int cpufreq_start_governor(struct cpufreq_policy *policy)
 {
 	int ret;
 
@@ -2293,7 +2291,7 @@ static int cpufreq_start_governor(struct
 	return 0;
 }
 
-static void cpufreq_stop_governor(struct cpufreq_policy *policy)
+void cpufreq_stop_governor(struct cpufreq_policy *policy)
 {
 	if (cpufreq_suspended || !policy->governor)
 		return;
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -576,6 +576,8 @@ unsigned int cpufreq_driver_resolve_freq
 unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy);
 int cpufreq_register_governor(struct cpufreq_governor *governor);
 void cpufreq_unregister_governor(struct cpufreq_governor *governor);
+int cpufreq_start_governor(struct cpufreq_policy *policy);
+void cpufreq_stop_governor(struct cpufreq_policy *policy);
 
 #define cpufreq_governor_init(__governor)			\
 static int __init __governor##_init(void)			\




^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-05 16:55   ` [PATCH v7] " Rafael J. Wysocki
@ 2020-08-06  5:55     ` Doug Smythies
  2020-08-06 11:25       ` Rafael J. Wysocki
  2020-08-06 12:03     ` Rafael J. Wysocki
  1 sibling, 1 reply; 33+ messages in thread
From: Doug Smythies @ 2020-08-06  5:55 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Srinivas Pandruvada',
	'Giovanni Gherdovich', 'Francisco Jerez',
	'Viresh Kumar', 'Linux PM'

On 2020.08.05 09:56 Rafael J. Wysocki wrote:

> v6 -> v7:
>    * Cosmetic changes in store_energy_performance_prefernce() to reduce the
>      LoC number and make it a bit easier to read.  No intentional functional
>      impact.

??
V7 is identical to V6.

Diff:

$ diff hwppassive-v6-2-2.patch hwppassive-v7-2-2.patch
2c2
< Sent: August 4, 2020 8:11 AM
---
> Sent: August 5, 2020 9:56 AM
5c5
< Subject: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled
---
> Subject: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
76a77,81
>
> v6 -> v7:
>    * Cosmetic changes in store_energy_performance_prefernce() to reduce the
>      LoC number and make it a bit easier to read.  No intentional functional
>      impact.

... Doug



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-06  5:55     ` Doug Smythies
@ 2020-08-06 11:25       ` Rafael J. Wysocki
  0 siblings, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-06 11:25 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Linux Documentation, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Giovanni Gherdovich, Francisco Jerez,
	Viresh Kumar, Linux PM

On Thu, Aug 6, 2020 at 1:04 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> On 2020.08.05 09:56 Rafael J. Wysocki wrote:
>
> > v6 -> v7:
> >    * Cosmetic changes in store_energy_performance_prefernce() to reduce the
> >      LoC number and make it a bit easier to read.  No intentional functional
> >      impact.
>
> ??
> V7 is identical to V6.
>
> Diff:

Oh.

It looks like I sent the v6 instead of it, then.

I'll send the right one shortly, sorry for the confusion.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-05 16:55   ` [PATCH v7] " Rafael J. Wysocki
  2020-08-06  5:55     ` Doug Smythies
@ 2020-08-06 12:03     ` Rafael J. Wysocki
  2020-08-10  0:44       ` Srinivas Pandruvada
                         ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-06 12:03 UTC (permalink / raw)
  To: Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Francisco Jerez,
	Viresh Kumar

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate,
with or without hardware coordination of P-states among CPUs in the
same package.

[If P-states of the CPUs in the same package are coordinated at the
 hardware level, a non-HWP processor may choose a P-state above the
 target one like a processor with HWP enabled may choose a P-state
 above the HWP floor, so the HWP behavior is analogous to the non-HWP
 one in that case.

 Also note that the HWP floor may not be taken into account by
 the processor in the range of P-states above the base frequency,
 referred to as the turbo range, where the processor has a license to
 choose any P-state, either below or above the HWP floor, just like a
 non-HWP processor in the case when the target P-state falls into the
 turbo range.]

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Sending the right patch this time, sorry for the confusion.

This is based on the current mainline.

v1 -> v2:
   * Avoid a race condition when updating the HWP request register while
     setting a new EPP value via sysfs.

v2 -> v3:
   * Rebase.

v3 -> v4:
   * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive mode.

v4 -> v5:
   * Do not acquire intel_pstate_driver_lock in
     store_energy_performance_preference(), because it runs under
     policy->rwsem, so intel_pstate_driver cannot change while it is running.
   * Rearrange the changelog a bit to avoid confusion.

v5 -> v6:
   * Fix the problem with the EPP setting via sysfs not working with the
     performance and powersave governors by stopping and restarting the
     governor around the sysfs-based EPP updates in the passive mode.
   * Because of that, use the epp_cached field just for avoiding the above
     if the new EPP value for the given CPU is the same as the old one.
   * Export cpufreq_start/stop_governor() from the core (for the above).

v6 -> v7:
   * Cosmetic changes in store_energy_performance_prefernce() to reduce the
     LoC number and make it a bit easier to read.  No intentional functional
     impact.

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 ++++-----
 drivers/cpufreq/cpufreq.c                     |    6 
 drivers/cpufreq/intel_pstate.c                |  245 +++++++++++++++++++-------
 include/linux/cpufreq.h                       |    2 
 4 files changed, 229 insertions(+), 113 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY		500
 
 #ifdef CONFIG_ACPI
@@ -220,6 +221,7 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @epp_cached		Cached HWP energy-performance preference value
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
  * @last_io_update:	Last time when IO wake flag was set
@@ -257,6 +259,7 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	s16 epp_cached;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
 	u64 last_io_update;
@@ -639,6 +642,26 @@ static int intel_pstate_get_energy_pref_
 	return index;
 }
 
+static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
+{
+	/*
+	 * Use the cached HWP Request MSR value, because in the active mode the
+	 * register itself may be updated by intel_pstate_hwp_boost_up() or
+	 * intel_pstate_hwp_boost_down() at any time.
+	 */
+	u64 value = READ_ONCE(cpu->hwp_req_cached);
+
+	value &= ~GENMASK_ULL(31, 24);
+	value |= (u64)epp << 24;
+	/*
+	 * The only other updater of hwp_req_cached in the active mode,
+	 * intel_pstate_hwp_set(), is called under the same lock as this
+	 * function, so it cannot run in parallel with the update below.
+	 */
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	return wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
 static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
 					      int pref_index, bool use_raw,
 					      u32 raw_epp)
@@ -650,28 +673,12 @@ static int intel_pstate_set_energy_pref_
 		epp = cpu_data->epp_default;
 
 	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
-		/*
-		 * Use the cached HWP Request MSR value, because the register
-		 * itself may be updated by intel_pstate_hwp_boost_up() or
-		 * intel_pstate_hwp_boost_down() at any time.
-		 */
-		u64 value = READ_ONCE(cpu_data->hwp_req_cached);
-
-		value &= ~GENMASK_ULL(31, 24);
-
 		if (use_raw)
 			epp = raw_epp;
 		else if (epp == -EINVAL)
 			epp = epp_values[pref_index - 1];
 
-		value |= (u64)epp << 24;
-		/*
-		 * The only other updater of hwp_req_cached in the active mode,
-		 * intel_pstate_hwp_set(), is called under the same lock as this
-		 * function, so it cannot run in parallel with the update below.
-		 */
-		WRITE_ONCE(cpu_data->hwp_req_cached, value);
-		ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
+		ret = intel_pstate_set_epp(cpu_data, epp);
 	} else {
 		if (epp == -EINVAL)
 			epp = (pref_index - 1) << 2;
@@ -697,10 +704,12 @@ static ssize_t show_energy_performance_a
 
 cpufreq_freq_attr_ro(energy_performance_available_preferences);
 
+static struct cpufreq_driver intel_pstate;
+
 static ssize_t store_energy_performance_preference(
 		struct cpufreq_policy *policy, const char *buf, size_t count)
 {
-	struct cpudata *cpu_data = all_cpu_data[policy->cpu];
+	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	char str_preference[21];
 	bool raw = false;
 	ssize_t ret;
@@ -725,15 +734,44 @@ static ssize_t store_energy_performance_
 		raw = true;
 	}
 
+	/*
+	 * This function runs with the policy R/W semaphore held, which
+	 * guarantees that the driver pointer will not change while it is
+	 * running.
+	 */
+	if (!intel_pstate_driver)
+		return -EAGAIN;
+
 	mutex_lock(&intel_pstate_limits_lock);
 
-	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
-	if (!ret)
-		ret = count;
+	if (intel_pstate_driver == &intel_pstate) {
+		ret = intel_pstate_set_energy_pref_index(cpu, ret, raw, epp);
+	} else {
+		/*
+		 * In the passive mode the governor needs to be stopped on the
+		 * target CPU before the EPP update and restarted after it,
+		 * which is super-heavy-weight, so make sure it is worth doing
+		 * upfront.
+		 */
+		if (!raw)
+			epp = ret ? epp_values[ret - 1] : cpu->epp_default;
+
+		if (cpu->epp_cached != epp) {
+			int err;
+
+			cpufreq_stop_governor(policy);
+			ret = intel_pstate_set_epp(cpu, epp);
+			err = cpufreq_start_governor(policy);
+			if (!ret) {
+				cpu->epp_cached = epp;
+				ret = err;
+			}
+		}
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
-	return ret;
+	return ret ?: count;
 }
 
 static ssize_t show_energy_performance_preference(
@@ -1145,8 +1183,6 @@ static ssize_t store_no_turbo(struct kob
 	return count;
 }
 
-static struct cpufreq_driver intel_pstate;
-
 static void update_qos_request(enum freq_qos_req_type type)
 {
 	int max_state, turbo_max, freq, i, perf_pct;
@@ -1330,9 +1366,10 @@ static const struct attribute_group inte
 
 static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
 
+static struct kobject *intel_pstate_kobject;
+
 static void __init intel_pstate_sysfs_expose_params(void)
 {
-	struct kobject *intel_pstate_kobject;
 	int rc;
 
 	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
@@ -1357,17 +1394,31 @@ static void __init intel_pstate_sysfs_ex
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
-	if (hwp_active) {
-		rc = sysfs_create_file(intel_pstate_kobject,
-				       &hwp_dynamic_boost.attr);
-		WARN_ON(rc);
-	}
-
 	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
 		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
 		WARN_ON(rc);
 	}
 }
+
+static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
+{
+	int rc;
+
+	if (!hwp_active)
+		return;
+
+	rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+	WARN_ON_ONCE(rc);
+}
+
+static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
+{
+	if (!hwp_active)
+		return;
+
+	sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
+}
+
 /************************** sysfs end ************************/
 
 static void intel_pstate_hwp_enable(struct cpudata *cpudata)
@@ -2246,7 +2297,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+	if (hwp_active)
+		intel_pstate_hwp_force_min_perf(policy->cpu);
+	else
+		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
 }
 
 static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
@@ -2254,12 +2308,10 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("CPU %d exiting\n", policy->cpu);
 
 	intel_pstate_clear_update_util_hook(policy->cpu);
-	if (hwp_active) {
+	if (hwp_active)
 		intel_pstate_hwp_save_state(policy);
-		intel_pstate_hwp_force_min_perf(policy->cpu);
-	} else {
-		intel_cpufreq_stop_cpu(policy);
-	}
+
+	intel_cpufreq_stop_cpu(policy);
 }
 
 static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
@@ -2389,13 +2441,71 @@ static void intel_cpufreq_trace(struct c
 		fp_toint(cpu->iowait_boost * 100));
 }
 
+static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
+				     bool fast_switch)
+{
+	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
+
+	value &= ~HWP_MIN_PERF(~0L);
+	value |= HWP_MIN_PERF(target_pstate);
+
+	/*
+	 * The entire MSR needs to be updated in order to update the HWP min
+	 * field in it, so opportunistically update the max too if needed.
+	 */
+	value &= ~HWP_MAX_PERF(~0L);
+	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
+
+	if (value == prev)
+		return;
+
+	WRITE_ONCE(cpu->hwp_req_cached, value);
+	if (fast_switch)
+		wrmsrl(MSR_HWP_REQUEST, value);
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+}
+
+static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
+					  u32 target_pstate, bool fast_switch)
+{
+	if (fast_switch)
+		wrmsrl(MSR_IA32_PERF_CTL,
+		       pstate_funcs.get_val(cpu, target_pstate));
+	else
+		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
+			      pstate_funcs.get_val(cpu, target_pstate));
+}
+
+static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
+				       bool fast_switch)
+{
+	int old_pstate = cpu->pstate.current_pstate;
+
+	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
+	if (target_pstate != old_pstate) {
+		cpu->pstate.current_pstate = target_pstate;
+		if (hwp_active)
+			intel_cpufreq_adjust_hwp(cpu, target_pstate,
+						 fast_switch);
+		else
+			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
+						      fast_switch);
+	}
+
+	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
+			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
+	return target_pstate;
+}
+
 static int intel_cpufreq_target(struct cpufreq_policy *policy,
 				unsigned int target_freq,
 				unsigned int relation)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
 	struct cpufreq_freqs freqs;
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
@@ -2403,6 +2513,7 @@ static int intel_cpufreq_target(struct c
 	freqs.new = target_freq;
 
 	cpufreq_freq_transition_begin(policy, &freqs);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
 		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
@@ -2414,15 +2525,11 @@ static int intel_cpufreq_target(struct c
 		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
 		break;
 	}
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	if (target_pstate != cpu->pstate.current_pstate) {
-		cpu->pstate.current_pstate = target_pstate;
-		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
-			      pstate_funcs.get_val(cpu, target_pstate));
-	}
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
+
 	freqs.new = target_pstate * cpu->pstate.scaling;
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
+
 	cpufreq_freq_transition_end(policy, &freqs, false);
 
 	return 0;
@@ -2432,15 +2539,14 @@ static unsigned int intel_cpufreq_fast_s
 					      unsigned int target_freq)
 {
 	struct cpudata *cpu = all_cpu_data[policy->cpu];
-	int target_pstate, old_pstate;
+	int target_pstate;
 
 	update_turbo_state();
 
 	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
-	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	old_pstate = cpu->pstate.current_pstate;
-	intel_pstate_update_pstate(cpu, target_pstate);
-	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
+
+	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
+
 	return target_pstate * cpu->pstate.scaling;
 }
 
@@ -2460,7 +2566,6 @@ static int intel_cpufreq_cpu_init(struct
 		return ret;
 
 	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
-	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
 	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
 	policy->cur = policy->cpuinfo.min_freq;
 
@@ -2472,10 +2577,18 @@ static int intel_cpufreq_cpu_init(struct
 
 	cpu = all_cpu_data[policy->cpu];
 
-	if (hwp_active)
+	if (hwp_active) {
+		u64 value;
+
 		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
-	else
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
+		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
+		WRITE_ONCE(cpu->hwp_req_cached, value);
+		cpu->epp_cached = (value & GENMASK_ULL(31, 24)) >> 24;
+	} else {
 		turbo_max = cpu->pstate.turbo_pstate;
+		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
+	}
 
 	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
 	min_freq *= cpu->pstate.scaling;
@@ -2552,6 +2665,10 @@ static void intel_pstate_driver_cleanup(
 		}
 	}
 	put_online_cpus();
+
+	if (intel_pstate_driver == &intel_pstate)
+		intel_pstate_sysfs_hide_hwp_dynamic_boost();
+
 	intel_pstate_driver = NULL;
 }
 
@@ -2559,6 +2676,9 @@ static int intel_pstate_register_driver(
 {
 	int ret;
 
+	if (driver == &intel_pstate)
+		intel_pstate_sysfs_expose_hwp_dynamic_boost();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2576,9 +2696,6 @@ static int intel_pstate_register_driver(
 
 static int intel_pstate_unregister_driver(void)
 {
-	if (hwp_active)
-		return -EBUSY;
-
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
@@ -2834,7 +2951,10 @@ static int __init intel_pstate_init(void
 			hwp_active++;
 			hwp_mode_bdw = id->driver_data;
 			intel_pstate.attr = hwp_cpufreq_attrs;
-			default_driver = &intel_pstate;
+			intel_cpufreq.attr = hwp_cpufreq_attrs;
+			if (!default_driver)
+				default_driver = &intel_pstate;
+
 			goto hwp_cpu_matched;
 		}
 	} else {
@@ -2905,14 +3025,13 @@ static int __init intel_pstate_setup(cha
 	if (!str)
 		return -EINVAL;
 
-	if (!strcmp(str, "disable")) {
+	if (!strcmp(str, "disable"))
 		no_load = 1;
-	} else if (!strcmp(str, "active")) {
+	else if (!strcmp(str, "active"))
 		default_driver = &intel_pstate;
-	} else if (!strcmp(str, "passive")) {
+	else if (!strcmp(str, "passive"))
 		default_driver = &intel_cpufreq;
-		no_hwp = 1;
-	}
+
 	if (!strcmp(str, "no_hwp")) {
 		pr_info("HWP disabled\n");
 		no_hwp = 1;
Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
+++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
@@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
 Operation Modes
 ===============
 
-``intel_pstate`` can operate in three different modes: in the active mode with
-or without hardware-managed P-states support and in the passive mode.  Which of
-them will be in effect depends on what kernel command line options are used and
-on the capabilities of the processor.
+``intel_pstate`` can operate in two different modes, active or passive.  In the
+active mode, it uses its own internal preformance scaling governor algorithm or
+allows the hardware to do preformance scaling by itself, while in the passive
+mode it responds to requests made by a generic ``CPUFreq`` governor implementing
+a certain performance scaling algorithm.  Which of them will be in effect
+depends on what kernel command line options are used and on the capabilities of
+the processor.
 
 Active Mode
 -----------
@@ -194,10 +197,11 @@ This is the default operation mode of ``
 hardware-managed P-states (HWP) support.  It is always used if the
 ``intel_pstate=passive`` argument is passed to the kernel in the command line
 regardless of whether or not the given processor supports HWP.  [Note that the
-``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
-without ``intel_pstate=active``.]  Like in the active mode without HWP support,
-in this mode ``intel_pstate`` may refuse to work with processors that are not
-recognized by it.
+``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
+if it is not combined with ``intel_pstate=active``.]  Like in the active mode
+without HWP support, in this mode ``intel_pstate`` may refuse to work with
+processors that are not recognized by it if HWP is prevented from being enabled
+through the kernel command line.
 
 If the driver works in this mode, the ``scaling_driver`` policy attribute in
 ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
@@ -318,10 +322,9 @@ manuals need to be consulted to get to i
 
 For this reason, there is a list of supported processors in ``intel_pstate`` and
 the driver initialization will fail if the detected processor is not in that
-list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
-obtain all of the information listed above is the same for all of the processors
-supporting the HWP feature, which is why they all are supported by
-``intel_pstate``.]
+list, unless it supports the HWP feature.  [The interface to obtain all of the
+information listed above is the same for all of the processors supporting the
+HWP feature, which is why ``intel_pstate`` works with all of them.]
 
 
 User Space Interface in ``sysfs``
@@ -425,22 +428,16 @@ argument is passed to the kernel in the
 	as well as the per-policy ones) are then reset to their default
 	values, possibly depending on the target operation mode.]
 
-	That only is supported in some configurations, though (for example, if
-	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
-	the operation mode of the driver cannot be changed), and if it is not
-	supported in the current configuration, writes to this attribute will
-	fail with an appropriate error.
-
 ``energy_efficiency``
-	This attribute is only present on platforms, which have CPUs matching
-	Kaby Lake or Coffee Lake desktop CPU model. By default
-	energy efficiency optimizations are disabled on these CPU models in HWP
-	mode by this driver. Enabling energy efficiency may limit maximum
-	operating frequency in both HWP and non HWP mode. In non HWP mode,
-	optimizations are done only in the turbo frequency range. In HWP mode,
-	optimizations are done in the entire frequency range. Setting this
-	attribute to "1" enables energy efficiency optimizations and setting
-	to "0" disables energy efficiency optimizations.
+	This attribute is only present on platforms with CPUs matching the Kaby
+	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
+	optimizations are disabled on these CPU models if HWP is enabled.
+	Enabling energy-efficiency optimizations may limit maximum operating
+	frequency with or without the HWP feature.  With HWP enabled, the
+	optimizations are done only in the turbo frequency range.  Without it,
+	they are done in the entire available frequency range.  Setting this
+	attribute to "1" enables the energy-efficiency optimizations and setting
+	to "0" disables them.
 
 Interpretation of Policy Attributes
 -----------------------------------
@@ -484,8 +481,8 @@ Next, the following policy attributes ha
 	policy for the time interval between the last two invocations of the
 	driver's utilization update callback by the CPU scheduler for that CPU.
 
-One more policy attribute is present if the `HWP feature is enabled in the
-processor <Active Mode With HWP_>`_:
+One more policy attribute is present if the HWP feature is enabled in the
+processor:
 
 ``base_frequency``
 	Shows the base frequency of the CPU. Any frequency above this will be
@@ -526,11 +523,11 @@ on the following rules, regardless of th
 
  3. The global and per-policy limits can be set independently.
 
-If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
-resulting effective values are written into its registers whenever the limits
-change in order to request its internal P-state selection logic to always set
-P-states within these limits.  Otherwise, the limits are taken into account by
-scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
+In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
+resulting effective values are written into hardware registers whenever the
+limits change in order to request its internal P-state selection logic to always
+set P-states within these limits.  Otherwise, the limits are taken into account
+by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
 every time before setting a new P-state for a CPU.
 
 Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
@@ -541,12 +538,11 @@ at all and the only way to set the limit
 Energy vs Performance Hints
 ---------------------------
 
-If ``intel_pstate`` works in the `active mode with the HWP feature enabled
-<Active Mode With HWP_>`_ in the processor, additional attributes are present
-in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
-user space to help ``intel_pstate`` to adjust the processor's internal P-state
-selection logic by focusing it on performance or on energy-efficiency, or
-somewhere between the two extremes:
+If the hardware-managed P-states (HWP) is enabled in the processor, additional
+attributes, intended to allow user space to help ``intel_pstate`` to adjust the
+processor's internal P-state selection logic by focusing it on performance or on
+energy-efficiency, or somewhere between the two extremes, are present in every
+``CPUFreq`` policy directory in ``sysfs``.  They are :
 
 ``energy_performance_preference``
 	Current value of the energy vs performance hint for the given policy
@@ -650,12 +646,14 @@ of them have to be prepended with the ``
 	Do not register ``intel_pstate`` as the scaling driver even if the
 	processor is supported by it.
 
+``active``
+	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
+	with.
+
 ``passive``
 	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
 	start with.
 
-	This option implies the ``no_hwp`` one described below.
-
 ``force``
 	Register ``intel_pstate`` as the scaling driver instead of
 	``acpi-cpufreq`` even if the latter is preferred on the given system.
@@ -670,13 +668,12 @@ of them have to be prepended with the ``
 	driver is used instead of ``acpi-cpufreq``.
 
 ``no_hwp``
-	Do not enable the `hardware-managed P-states (HWP) feature
-	<Active Mode With HWP_>`_ even if it is supported by the processor.
+	Do not enable the hardware-managed P-states (HWP) feature even if it is
+	supported by the processor.
 
 ``hwp_only``
 	Register ``intel_pstate`` as the scaling driver only if the
-	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
-	supported by the processor.
+	hardware-managed P-states (HWP) feature is supported by the processor.
 
 ``support_acpi_ppc``
 	Take ACPI ``_PPC`` performance limits into account.
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -73,8 +73,6 @@ static inline bool has_target(void)
 static unsigned int __cpufreq_get(struct cpufreq_policy *policy);
 static int cpufreq_init_governor(struct cpufreq_policy *policy);
 static void cpufreq_exit_governor(struct cpufreq_policy *policy);
-static int cpufreq_start_governor(struct cpufreq_policy *policy);
-static void cpufreq_stop_governor(struct cpufreq_policy *policy);
 static void cpufreq_governor_limits(struct cpufreq_policy *policy);
 static int cpufreq_set_policy(struct cpufreq_policy *policy,
 			      struct cpufreq_governor *new_gov,
@@ -2266,7 +2264,7 @@ static void cpufreq_exit_governor(struct
 	module_put(policy->governor->owner);
 }
 
-static int cpufreq_start_governor(struct cpufreq_policy *policy)
+int cpufreq_start_governor(struct cpufreq_policy *policy)
 {
 	int ret;
 
@@ -2293,7 +2291,7 @@ static int cpufreq_start_governor(struct
 	return 0;
 }
 
-static void cpufreq_stop_governor(struct cpufreq_policy *policy)
+void cpufreq_stop_governor(struct cpufreq_policy *policy)
 {
 	if (cpufreq_suspended || !policy->governor)
 		return;
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -576,6 +576,8 @@ unsigned int cpufreq_driver_resolve_freq
 unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy);
 int cpufreq_register_governor(struct cpufreq_governor *governor);
 void cpufreq_unregister_governor(struct cpufreq_governor *governor);
+int cpufreq_start_governor(struct cpufreq_policy *policy);
+void cpufreq_stop_governor(struct cpufreq_policy *policy);
 
 #define cpufreq_governor_init(__governor)			\
 static int __init __governor##_init(void)			\




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-06 12:03     ` Rafael J. Wysocki
@ 2020-08-10  0:44       ` Srinivas Pandruvada
  2020-08-11  0:51       ` Francisco Jerez
  2020-08-17 21:06       ` Doug Smythies
  2 siblings, 0 replies; 33+ messages in thread
From: Srinivas Pandruvada @ 2020-08-10  0:44 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Giovanni Gherdovich,
	Doug Smythies, Francisco Jerez, Viresh Kumar

On Thu, 2020-08-06 at 14:03 +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.
> 
> Among other things, this allows utilization clamps to be taken
> into account, at least to a certain extent, when intel_pstate is
> in use and makes it more likely that sufficient capacity for
> deadline tasks will be provided.
> 
> After this change, the resulting behavior of an HWP system with
> intel_pstate in the passive mode should be close to the behavior
> of the analogous non-HWP system with intel_pstate in the passive
> mode, except that in the frequency range below the base frequency
> (ie. the frequency retured by the base_frequency cpufreq attribute
> in sysfs on HWP systems) the HWP algorithm is allowed to make the
> CPU run at a frequency above the floor P-state set by intel_pstate,
> with or without hardware coordination of P-states among CPUs in the
> same package.
> 
> [If P-states of the CPUs in the same package are coordinated at the
>  hardware level, a non-HWP processor may choose a P-state above the
>  target one like a processor with HWP enabled may choose a P-state
>  above the HWP floor, so the HWP behavior is analogous to the non-HWP
>  one in that case.
> 
>  Also note that the HWP floor may not be taken into account by
>  the processor in the range of P-states above the base frequency,
>  referred to as the turbo range, where the processor has a license to
>  choose any P-state, either below or above the HWP floor, just like a
>  non-HWP processor in the case when the target P-state falls into the
>  turbo range.]
> 
> With this change applied, intel_pstate in the passive mode
> assumes complete control over the HWP request MSR and concurrent
> changes of that MSR (eg. via the direct MSR access interface) are
> overridden by it.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

> ---
> 
> Sending the right patch this time, sorry for the confusion.
> 
> This is based on the current mainline.
> 
> v1 -> v2:
>    * Avoid a race condition when updating the HWP request register
> while
>      setting a new EPP value via sysfs.
> 
> v2 -> v3:
>    * Rebase.
> 
> v3 -> v4:
>    * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive
> mode.
> 
> v4 -> v5:
>    * Do not acquire intel_pstate_driver_lock in
>      store_energy_performance_preference(), because it runs under
>      policy->rwsem, so intel_pstate_driver cannot change while it is
> running.
>    * Rearrange the changelog a bit to avoid confusion.
> 
> v5 -> v6:
>    * Fix the problem with the EPP setting via sysfs not working with
> the
>      performance and powersave governors by stopping and restarting
> the
>      governor around the sysfs-based EPP updates in the passive mode.
>    * Because of that, use the epp_cached field just for avoiding the
> above
>      if the new EPP value for the given CPU is the same as the old
> one.
>    * Export cpufreq_start/stop_governor() from the core (for the
> above).
> 
> v6 -> v7:
>    * Cosmetic changes in store_energy_performance_prefernce() to
> reduce the
>      LoC number and make it a bit easier to read.  No intentional
> functional
>      impact.
> 
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |   89 ++++-----
>  drivers/cpufreq/cpufreq.c                     |    6 
>  drivers/cpufreq/intel_pstate.c                |  245
> +++++++++++++++++++-------
>  include/linux/cpufreq.h                       |    2 
>  4 files changed, 229 insertions(+), 113 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -36,6 +36,7 @@
>  #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
>  
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
>  #define INTEL_CPUFREQ_TRANSITION_DELAY		500
>  
>  #ifdef CONFIG_ACPI
> @@ -220,6 +221,7 @@ struct global_params {
>   *			preference/bias
>   * @epp_saved:		Saved EPP/EPB during system suspend or
> CPU offline
>   *			operation
> + * @epp_cached		Cached HWP energy-performance
> preference value
>   * @hwp_req_cached:	Cached value of the last HWP Request MSR
>   * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
>   * @last_io_update:	Last time when IO wake flag was set
> @@ -257,6 +259,7 @@ struct cpudata {
>  	s16 epp_policy;
>  	s16 epp_default;
>  	s16 epp_saved;
> +	s16 epp_cached;
>  	u64 hwp_req_cached;
>  	u64 hwp_cap_cached;
>  	u64 last_io_update;
> @@ -639,6 +642,26 @@ static int intel_pstate_get_energy_pref_
>  	return index;
>  }
>  
> +static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
> +{
> +	/*
> +	 * Use the cached HWP Request MSR value, because in the active
> mode the
> +	 * register itself may be updated by
> intel_pstate_hwp_boost_up() or
> +	 * intel_pstate_hwp_boost_down() at any time.
> +	 */
> +	u64 value = READ_ONCE(cpu->hwp_req_cached);
> +
> +	value &= ~GENMASK_ULL(31, 24);
> +	value |= (u64)epp << 24;
> +	/*
> +	 * The only other updater of hwp_req_cached in the active mode,
> +	 * intel_pstate_hwp_set(), is called under the same lock as
> this
> +	 * function, so it cannot run in parallel with the update
> below.
> +	 */
> +	WRITE_ONCE(cpu->hwp_req_cached, value);
> +	return wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +}
> +
>  static int intel_pstate_set_energy_pref_index(struct cpudata
> *cpu_data,
>  					      int pref_index, bool
> use_raw,
>  					      u32 raw_epp)
> @@ -650,28 +673,12 @@ static int intel_pstate_set_energy_pref_
>  		epp = cpu_data->epp_default;
>  
>  	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> -		/*
> -		 * Use the cached HWP Request MSR value, because the
> register
> -		 * itself may be updated by intel_pstate_hwp_boost_up()
> or
> -		 * intel_pstate_hwp_boost_down() at any time.
> -		 */
> -		u64 value = READ_ONCE(cpu_data->hwp_req_cached);
> -
> -		value &= ~GENMASK_ULL(31, 24);
> -
>  		if (use_raw)
>  			epp = raw_epp;
>  		else if (epp == -EINVAL)
>  			epp = epp_values[pref_index - 1];
>  
> -		value |= (u64)epp << 24;
> -		/*
> -		 * The only other updater of hwp_req_cached in the
> active mode,
> -		 * intel_pstate_hwp_set(), is called under the same
> lock as this
> -		 * function, so it cannot run in parallel with the
> update below.
> -		 */
> -		WRITE_ONCE(cpu_data->hwp_req_cached, value);
> -		ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST,
> value);
> +		ret = intel_pstate_set_epp(cpu_data, epp);
>  	} else {
>  		if (epp == -EINVAL)
>  			epp = (pref_index - 1) << 2;
> @@ -697,10 +704,12 @@ static ssize_t show_energy_performance_a
>  
>  cpufreq_freq_attr_ro(energy_performance_available_preferences);
>  
> +static struct cpufreq_driver intel_pstate;
> +
>  static ssize_t store_energy_performance_preference(
>  		struct cpufreq_policy *policy, const char *buf, size_t
> count)
>  {
> -	struct cpudata *cpu_data = all_cpu_data[policy->cpu];
> +	struct cpudata *cpu = all_cpu_data[policy->cpu];
>  	char str_preference[21];
>  	bool raw = false;
>  	ssize_t ret;
> @@ -725,15 +734,44 @@ static ssize_t store_energy_performance_
>  		raw = true;
>  	}
>  
> +	/*
> +	 * This function runs with the policy R/W semaphore held, which
> +	 * guarantees that the driver pointer will not change while it
> is
> +	 * running.
> +	 */
> +	if (!intel_pstate_driver)
> +		return -EAGAIN;
> +
>  	mutex_lock(&intel_pstate_limits_lock);
>  
> -	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw,
> epp);
> -	if (!ret)
> -		ret = count;
> +	if (intel_pstate_driver == &intel_pstate) {
> +		ret = intel_pstate_set_energy_pref_index(cpu, ret, raw,
> epp);
> +	} else {
> +		/*
> +		 * In the passive mode the governor needs to be stopped
> on the
> +		 * target CPU before the EPP update and restarted after
> it,
> +		 * which is super-heavy-weight, so make sure it is
> worth doing
> +		 * upfront.
> +		 */
> +		if (!raw)
> +			epp = ret ? epp_values[ret - 1] : cpu-
> >epp_default;
> +
> +		if (cpu->epp_cached != epp) {
> +			int err;
> +
> +			cpufreq_stop_governor(policy);
> +			ret = intel_pstate_set_epp(cpu, epp);
> +			err = cpufreq_start_governor(policy);
> +			if (!ret) {
> +				cpu->epp_cached = epp;
> +				ret = err;
> +			}
> +		}
> +	}
>  
>  	mutex_unlock(&intel_pstate_limits_lock);
>  
> -	return ret;
> +	return ret ?: count;
>  }
>  
>  static ssize_t show_energy_performance_preference(
> @@ -1145,8 +1183,6 @@ static ssize_t store_no_turbo(struct kob
>  	return count;
>  }
>  
> -static struct cpufreq_driver intel_pstate;
> -
>  static void update_qos_request(enum freq_qos_req_type type)
>  {
>  	int max_state, turbo_max, freq, i, perf_pct;
> @@ -1330,9 +1366,10 @@ static const struct attribute_group inte
>  
>  static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
>  
> +static struct kobject *intel_pstate_kobject;
> +
>  static void __init intel_pstate_sysfs_expose_params(void)
>  {
> -	struct kobject *intel_pstate_kobject;
>  	int rc;
>  
>  	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
> @@ -1357,17 +1394,31 @@ static void __init intel_pstate_sysfs_ex
>  	rc = sysfs_create_file(intel_pstate_kobject,
> &min_perf_pct.attr);
>  	WARN_ON(rc);
>  
> -	if (hwp_active) {
> -		rc = sysfs_create_file(intel_pstate_kobject,
> -				       &hwp_dynamic_boost.attr);
> -		WARN_ON(rc);
> -	}
> -
>  	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
>  		rc = sysfs_create_file(intel_pstate_kobject,
> &energy_efficiency.attr);
>  		WARN_ON(rc);
>  	}
>  }
> +
> +static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
> +{
> +	int rc;
> +
> +	if (!hwp_active)
> +		return;
> +
> +	rc = sysfs_create_file(intel_pstate_kobject,
> &hwp_dynamic_boost.attr);
> +	WARN_ON_ONCE(rc);
> +}
> +
> +static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
> +{
> +	if (!hwp_active)
> +		return;
> +
> +	sysfs_remove_file(intel_pstate_kobject,
> &hwp_dynamic_boost.attr);
> +}
> +
>  /************************** sysfs end ************************/
>  
>  static void intel_pstate_hwp_enable(struct cpudata *cpudata)
> @@ -2246,7 +2297,10 @@ static int intel_pstate_verify_policy(st
>  
>  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>  {
> -	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> +	if (hwp_active)
> +		intel_pstate_hwp_force_min_perf(policy->cpu);
> +	else
> +		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>  }
>  
>  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> @@ -2254,12 +2308,10 @@ static void intel_pstate_stop_cpu(struct
>  	pr_debug("CPU %d exiting\n", policy->cpu);
>  
>  	intel_pstate_clear_update_util_hook(policy->cpu);
> -	if (hwp_active) {
> +	if (hwp_active)
>  		intel_pstate_hwp_save_state(policy);
> -		intel_pstate_hwp_force_min_perf(policy->cpu);
> -	} else {
> -		intel_cpufreq_stop_cpu(policy);
> -	}
> +
> +	intel_cpufreq_stop_cpu(policy);
>  }
>  
>  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> @@ -2389,13 +2441,71 @@ static void intel_cpufreq_trace(struct c
>  		fp_toint(cpu->iowait_boost * 100));
>  }
>  
> +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32
> target_pstate,
> +				     bool fast_switch)
> +{
> +	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> +
> +	value &= ~HWP_MIN_PERF(~0L);
> +	value |= HWP_MIN_PERF(target_pstate);
> +
> +	/*
> +	 * The entire MSR needs to be updated in order to update the
> HWP min
> +	 * field in it, so opportunistically update the max too if
> needed.
> +	 */
> +	value &= ~HWP_MAX_PERF(~0L);
> +	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> +
> +	if (value == prev)
> +		return;
> +
> +	WRITE_ONCE(cpu->hwp_req_cached, value);
> +	if (fast_switch)
> +		wrmsrl(MSR_HWP_REQUEST, value);
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +}
> +
> +static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
> +					  u32 target_pstate, bool
> fast_switch)
> +{
> +	if (fast_switch)
> +		wrmsrl(MSR_IA32_PERF_CTL,
> +		       pstate_funcs.get_val(cpu, target_pstate));
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
> +			      pstate_funcs.get_val(cpu,
> target_pstate));
> +}
> +
> +static int intel_cpufreq_update_pstate(struct cpudata *cpu, int
> target_pstate,
> +				       bool fast_switch)
> +{
> +	int old_pstate = cpu->pstate.current_pstate;
> +
> +	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> +	if (target_pstate != old_pstate) {
> +		cpu->pstate.current_pstate = target_pstate;
> +		if (hwp_active)
> +			intel_cpufreq_adjust_hwp(cpu, target_pstate,
> +						 fast_switch);
> +		else
> +			intel_cpufreq_adjust_perf_ctl(cpu,
> target_pstate,
> +						      fast_switch);
> +	}
> +
> +	intel_cpufreq_trace(cpu, fast_switch ?
> INTEL_PSTATE_TRACE_FAST_SWITCH :
> +			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
> +
> +	return target_pstate;
> +}
> +
>  static int intel_cpufreq_target(struct cpufreq_policy *policy,
>  				unsigned int target_freq,
>  				unsigned int relation)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
>  	struct cpufreq_freqs freqs;
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
> @@ -2403,6 +2513,7 @@ static int intel_cpufreq_target(struct c
>  	freqs.new = target_freq;
>  
>  	cpufreq_freq_transition_begin(policy, &freqs);
> +
>  	switch (relation) {
>  	case CPUFREQ_RELATION_L:
>  		target_pstate = DIV_ROUND_UP(freqs.new, cpu-
> >pstate.scaling);
> @@ -2414,15 +2525,11 @@ static int intel_cpufreq_target(struct c
>  		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu-
> >pstate.scaling);
>  		break;
>  	}
> -	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	if (target_pstate != cpu->pstate.current_pstate) {
> -		cpu->pstate.current_pstate = target_pstate;
> -		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
> -			      pstate_funcs.get_val(cpu,
> target_pstate));
> -	}
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate,
> false);
> +
>  	freqs.new = target_pstate * cpu->pstate.scaling;
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET,
> old_pstate);
> +
>  	cpufreq_freq_transition_end(policy, &freqs, false);
>  
>  	return 0;
> @@ -2432,15 +2539,14 @@ static unsigned int intel_cpufreq_fast_s
>  					      unsigned int target_freq)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
>  	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> -	target_pstate = intel_pstate_prepare_request(cpu,
> target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	intel_pstate_update_pstate(cpu, target_pstate);
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH,
> old_pstate);
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate,
> true);
> +
>  	return target_pstate * cpu->pstate.scaling;
>  }
>  
> @@ -2460,7 +2566,6 @@ static int intel_cpufreq_cpu_init(struct
>  		return ret;
>  
>  	policy->cpuinfo.transition_latency =
> INTEL_CPUFREQ_TRANSITION_LATENCY;
> -	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
>  	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
>  	policy->cur = policy->cpuinfo.min_freq;
>  
> @@ -2472,10 +2577,18 @@ static int intel_cpufreq_cpu_init(struct
>  
>  	cpu = all_cpu_data[policy->cpu];
>  
> -	if (hwp_active)
> +	if (hwp_active) {
> +		u64 value;
> +
>  		intel_pstate_get_hwp_max(policy->cpu, &turbo_max,
> &max_state);
> -	else
> +		policy->transition_delay_us =
> INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
> +		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
> +		WRITE_ONCE(cpu->hwp_req_cached, value);
> +		cpu->epp_cached = (value & GENMASK_ULL(31, 24)) >> 24;
> +	} else {
>  		turbo_max = cpu->pstate.turbo_pstate;
> +		policy->transition_delay_us =
> INTEL_CPUFREQ_TRANSITION_DELAY;
> +	}
>  
>  	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
>  	min_freq *= cpu->pstate.scaling;
> @@ -2552,6 +2665,10 @@ static void intel_pstate_driver_cleanup(
>  		}
>  	}
>  	put_online_cpus();
> +
> +	if (intel_pstate_driver == &intel_pstate)
> +		intel_pstate_sysfs_hide_hwp_dynamic_boost();
> +
>  	intel_pstate_driver = NULL;
>  }
>  
> @@ -2559,6 +2676,9 @@ static int intel_pstate_register_driver(
>  {
>  	int ret;
>  
> +	if (driver == &intel_pstate)
> +		intel_pstate_sysfs_expose_hwp_dynamic_boost();
> +
>  	memset(&global, 0, sizeof(global));
>  	global.max_perf_pct = 100;
>  
> @@ -2576,9 +2696,6 @@ static int intel_pstate_register_driver(
>  
>  static int intel_pstate_unregister_driver(void)
>  {
> -	if (hwp_active)
> -		return -EBUSY;
> -
>  	cpufreq_unregister_driver(intel_pstate_driver);
>  	intel_pstate_driver_cleanup();
>  
> @@ -2834,7 +2951,10 @@ static int __init intel_pstate_init(void
>  			hwp_active++;
>  			hwp_mode_bdw = id->driver_data;
>  			intel_pstate.attr = hwp_cpufreq_attrs;
> -			default_driver = &intel_pstate;
> +			intel_cpufreq.attr = hwp_cpufreq_attrs;
> +			if (!default_driver)
> +				default_driver = &intel_pstate;
> +
>  			goto hwp_cpu_matched;
>  		}
>  	} else {
> @@ -2905,14 +3025,13 @@ static int __init intel_pstate_setup(cha
>  	if (!str)
>  		return -EINVAL;
>  
> -	if (!strcmp(str, "disable")) {
> +	if (!strcmp(str, "disable"))
>  		no_load = 1;
> -	} else if (!strcmp(str, "active")) {
> +	else if (!strcmp(str, "active"))
>  		default_driver = &intel_pstate;
> -	} else if (!strcmp(str, "passive")) {
> +	else if (!strcmp(str, "passive"))
>  		default_driver = &intel_cpufreq;
> -		no_hwp = 1;
> -	}
> +
>  	if (!strcmp(str, "no_hwp")) {
>  		pr_info("HWP disabled\n");
>  		no_hwp = 1;
> Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> ===================================================================
> --- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
> +++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
>  Operation Modes
>  ===============
>  
> -``intel_pstate`` can operate in three different modes: in the active
> mode with
> -or without hardware-managed P-states support and in the passive
> mode.  Which of
> -them will be in effect depends on what kernel command line options
> are used and
> -on the capabilities of the processor.
> +``intel_pstate`` can operate in two different modes, active or
> passive.  In the
> +active mode, it uses its own internal preformance scaling governor
> algorithm or
> +allows the hardware to do preformance scaling by itself, while in
> the passive
> +mode it responds to requests made by a generic ``CPUFreq`` governor
> implementing
> +a certain performance scaling algorithm.  Which of them will be in
> effect
> +depends on what kernel command line options are used and on the
> capabilities of
> +the processor.
>  
>  Active Mode
>  -----------
> @@ -194,10 +197,11 @@ This is the default operation mode of ``
>  hardware-managed P-states (HWP) support.  It is always used if the
>  ``intel_pstate=passive`` argument is passed to the kernel in the
> command line
>  regardless of whether or not the given processor supports
> HWP.  [Note that the
> -``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if
> it is used
> -without ``intel_pstate=active``.]  Like in the active mode without
> HWP support,
> -in this mode ``intel_pstate`` may refuse to work with processors
> that are not
> -recognized by it.
> +``intel_pstate=no_hwp`` setting causes the driver to start in the
> passive mode
> +if it is not combined with ``intel_pstate=active``.]  Like in the
> active mode
> +without HWP support, in this mode ``intel_pstate`` may refuse to
> work with
> +processors that are not recognized by it if HWP is prevented from
> being enabled
> +through the kernel command line.
>  
>  If the driver works in this mode, the ``scaling_driver`` policy
> attribute in
>  ``sysfs`` for all ``CPUFreq`` policies contains the string
> "intel_cpufreq".
> @@ -318,10 +322,9 @@ manuals need to be consulted to get to i
>  
>  For this reason, there is a list of supported processors in
> ``intel_pstate`` and
>  the driver initialization will fail if the detected processor is not
> in that
> -list, unless it supports the `HWP feature <Active Mode_>`_.  [The
> interface to
> -obtain all of the information listed above is the same for all of
> the processors
> -supporting the HWP feature, which is why they all are supported by
> -``intel_pstate``.]
> +list, unless it supports the HWP feature.  [The interface to obtain
> all of the
> +information listed above is the same for all of the processors
> supporting the
> +HWP feature, which is why ``intel_pstate`` works with all of them.]
>  
>  
>  User Space Interface in ``sysfs``
> @@ -425,22 +428,16 @@ argument is passed to the kernel in the
>  	as well as the per-policy ones) are then reset to their default
>  	values, possibly depending on the target operation mode.]
>  
> -	That only is supported in some configurations, though (for
> example, if
> -	the `HWP feature is enabled in the processor <Active Mode With
> HWP_>`_,
> -	the operation mode of the driver cannot be changed), and if it
> is not
> -	supported in the current configuration, writes to this
> attribute will
> -	fail with an appropriate error.
> -
>  ``energy_efficiency``
> -	This attribute is only present on platforms, which have CPUs
> matching
> -	Kaby Lake or Coffee Lake desktop CPU model. By default
> -	energy efficiency optimizations are disabled on these CPU
> models in HWP
> -	mode by this driver. Enabling energy efficiency may limit
> maximum
> -	operating frequency in both HWP and non HWP mode. In non HWP
> mode,
> -	optimizations are done only in the turbo frequency range. In
> HWP mode,
> -	optimizations are done in the entire frequency range. Setting
> this
> -	attribute to "1" enables energy efficiency optimizations and
> setting
> -	to "0" disables energy efficiency optimizations.
> +	This attribute is only present on platforms with CPUs matching
> the Kaby
> +	Lake or Coffee Lake desktop CPU model. By default, energy-
> efficiency
> +	optimizations are disabled on these CPU models if HWP is
> enabled.
> +	Enabling energy-efficiency optimizations may limit maximum
> operating
> +	frequency with or without the HWP feature.  With HWP enabled,
> the
> +	optimizations are done only in the turbo frequency
> range.  Without it,
> +	they are done in the entire available frequency range.  Setting
> this
> +	attribute to "1" enables the energy-efficiency optimizations
> and setting
> +	to "0" disables them.
>  
>  Interpretation of Policy Attributes
>  -----------------------------------
> @@ -484,8 +481,8 @@ Next, the following policy attributes ha
>  	policy for the time interval between the last two invocations
> of the
>  	driver's utilization update callback by the CPU scheduler for
> that CPU.
>  
> -One more policy attribute is present if the `HWP feature is enabled
> in the
> -processor <Active Mode With HWP_>`_:
> +One more policy attribute is present if the HWP feature is enabled
> in the
> +processor:
>  
>  ``base_frequency``
>  	Shows the base frequency of the CPU. Any frequency above this
> will be
> @@ -526,11 +523,11 @@ on the following rules, regardless of th
>  
>   3. The global and per-policy limits can be set independently.
>  
> -If the `HWP feature is enabled in the processor <Active Mode With
> HWP_>`_, the
> -resulting effective values are written into its registers whenever
> the limits
> -change in order to request its internal P-state selection logic to
> always set
> -P-states within these limits.  Otherwise, the limits are taken into
> account by
> -scaling governors (in the `passive mode <Passive Mode_>`_) and by
> the driver
> +In the `active mode with the HWP feature enabled <Active Mode With
> HWP_>`_, the
> +resulting effective values are written into hardware registers
> whenever the
> +limits change in order to request its internal P-state selection
> logic to always
> +set P-states within these limits.  Otherwise, the limits are taken
> into account
> +by scaling governors (in the `passive mode <Passive Mode_>`_) and by
> the driver
>  every time before setting a new P-state for a CPU.
>  
>  Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command
> line argument
> @@ -541,12 +538,11 @@ at all and the only way to set the limit
>  Energy vs Performance Hints
>  ---------------------------
>  
> -If ``intel_pstate`` works in the `active mode with the HWP feature
> enabled
> -<Active Mode With HWP_>`_ in the processor, additional attributes
> are present
> -in every ``CPUFreq`` policy directory in ``sysfs``.  They are
> intended to allow
> -user space to help ``intel_pstate`` to adjust the processor's
> internal P-state
> -selection logic by focusing it on performance or on energy-
> efficiency, or
> -somewhere between the two extremes:
> +If the hardware-managed P-states (HWP) is enabled in the processor,
> additional
> +attributes, intended to allow user space to help ``intel_pstate`` to
> adjust the
> +processor's internal P-state selection logic by focusing it on
> performance or on
> +energy-efficiency, or somewhere between the two extremes, are
> present in every
> +``CPUFreq`` policy directory in ``sysfs``.  They are :
>  
>  ``energy_performance_preference``
>  	Current value of the energy vs performance hint for the given
> policy
> @@ -650,12 +646,14 @@ of them have to be prepended with the ``
>  	Do not register ``intel_pstate`` as the scaling driver even if
> the
>  	processor is supported by it.
>  
> +``active``
> +	Register ``intel_pstate`` in the `active mode <Active Mode_>`_
> to start
> +	with.
> +
>  ``passive``
>  	Register ``intel_pstate`` in the `passive mode <Passive
> Mode_>`_ to
>  	start with.
>  
> -	This option implies the ``no_hwp`` one described below.
> -
>  ``force``
>  	Register ``intel_pstate`` as the scaling driver instead of
>  	``acpi-cpufreq`` even if the latter is preferred on the given
> system.
> @@ -670,13 +668,12 @@ of them have to be prepended with the ``
>  	driver is used instead of ``acpi-cpufreq``.
>  
>  ``no_hwp``
> -	Do not enable the `hardware-managed P-states (HWP) feature
> -	<Active Mode With HWP_>`_ even if it is supported by the
> processor.
> +	Do not enable the hardware-managed P-states (HWP) feature even
> if it is
> +	supported by the processor.
>  
>  ``hwp_only``
>  	Register ``intel_pstate`` as the scaling driver only if the
> -	`hardware-managed P-states (HWP) feature <Active Mode With
> HWP_>`_ is
> -	supported by the processor.
> +	hardware-managed P-states (HWP) feature is supported by the
> processor.
>  
>  ``support_acpi_ppc``
>  	Take ACPI ``_PPC`` performance limits into account.
> Index: linux-pm/drivers/cpufreq/cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq.c
> +++ linux-pm/drivers/cpufreq/cpufreq.c
> @@ -73,8 +73,6 @@ static inline bool has_target(void)
>  static unsigned int __cpufreq_get(struct cpufreq_policy *policy);
>  static int cpufreq_init_governor(struct cpufreq_policy *policy);
>  static void cpufreq_exit_governor(struct cpufreq_policy *policy);
> -static int cpufreq_start_governor(struct cpufreq_policy *policy);
> -static void cpufreq_stop_governor(struct cpufreq_policy *policy);
>  static void cpufreq_governor_limits(struct cpufreq_policy *policy);
>  static int cpufreq_set_policy(struct cpufreq_policy *policy,
>  			      struct cpufreq_governor *new_gov,
> @@ -2266,7 +2264,7 @@ static void cpufreq_exit_governor(struct
>  	module_put(policy->governor->owner);
>  }
>  
> -static int cpufreq_start_governor(struct cpufreq_policy *policy)
> +int cpufreq_start_governor(struct cpufreq_policy *policy)
>  {
>  	int ret;
>  
> @@ -2293,7 +2291,7 @@ static int cpufreq_start_governor(struct
>  	return 0;
>  }
>  
> -static void cpufreq_stop_governor(struct cpufreq_policy *policy)
> +void cpufreq_stop_governor(struct cpufreq_policy *policy)
>  {
>  	if (cpufreq_suspended || !policy->governor)
>  		return;
> Index: linux-pm/include/linux/cpufreq.h
> ===================================================================
> --- linux-pm.orig/include/linux/cpufreq.h
> +++ linux-pm/include/linux/cpufreq.h
> @@ -576,6 +576,8 @@ unsigned int cpufreq_driver_resolve_freq
>  unsigned int cpufreq_policy_transition_delay_us(struct
> cpufreq_policy *policy);
>  int cpufreq_register_governor(struct cpufreq_governor *governor);
>  void cpufreq_unregister_governor(struct cpufreq_governor *governor);
> +int cpufreq_start_governor(struct cpufreq_policy *policy);
> +void cpufreq_stop_governor(struct cpufreq_policy *policy);
>  
>  #define cpufreq_governor_init(__governor)			\
>  static int __init __governor##_init(void)			\
> 
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-06 12:03     ` Rafael J. Wysocki
  2020-08-10  0:44       ` Srinivas Pandruvada
@ 2020-08-11  0:51       ` Francisco Jerez
  2020-08-11 15:33         ` Rafael J. Wysocki
  2020-08-17 21:06       ` Doug Smythies
  2 siblings, 1 reply; 33+ messages in thread
From: Francisco Jerez @ 2020-08-11  0:51 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM
  Cc: Linux Documentation, LKML, Peter Zijlstra, Srinivas Pandruvada,
	Giovanni Gherdovich, Doug Smythies, Viresh Kumar


[-- Attachment #1.1: Type: text/plain, Size: 30098 bytes --]

"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.
>
> Among other things, this allows utilization clamps to be taken
> into account, at least to a certain extent, when intel_pstate is
> in use and makes it more likely that sufficient capacity for
> deadline tasks will be provided.
>
> After this change, the resulting behavior of an HWP system with
> intel_pstate in the passive mode should be close to the behavior
> of the analogous non-HWP system with intel_pstate in the passive
> mode, except that in the frequency range below the base frequency
> (ie. the frequency retured by the base_frequency cpufreq attribute
> in sysfs on HWP systems) the HWP algorithm is allowed to make the
> CPU run at a frequency above the floor P-state set by intel_pstate,
> with or without hardware coordination of P-states among CPUs in the
> same package.
>

The "frequency range below the base frequency" part of the paragraph
above seems somewhat misleading, since AFAICT the same thing will happen
in the P-state range above the base frequency.   Another minor comment
below, other than that LGTM:

Reviewed-by: Francisco Jerez <currojerez@riseup.net>

> [If P-states of the CPUs in the same package are coordinated at the
>  hardware level, a non-HWP processor may choose a P-state above the
>  target one like a processor with HWP enabled may choose a P-state
>  above the HWP floor, so the HWP behavior is analogous to the non-HWP
>  one in that case.
>
>  Also note that the HWP floor may not be taken into account by
>  the processor in the range of P-states above the base frequency,
>  referred to as the turbo range, where the processor has a license to
>  choose any P-state, either below or above the HWP floor, just like a
>  non-HWP processor in the case when the target P-state falls into the
>  turbo range.]
>
> With this change applied, intel_pstate in the passive mode
> assumes complete control over the HWP request MSR and concurrent
> changes of that MSR (eg. via the direct MSR access interface) are
> overridden by it.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>
> Sending the right patch this time, sorry for the confusion.
>
> This is based on the current mainline.
>
> v1 -> v2:
>    * Avoid a race condition when updating the HWP request register while
>      setting a new EPP value via sysfs.
>
> v2 -> v3:
>    * Rebase.
>
> v3 -> v4:
>    * Avoid exposing the hwp_dynamic_boost sysfs switch in the passive mode.
>
> v4 -> v5:
>    * Do not acquire intel_pstate_driver_lock in
>      store_energy_performance_preference(), because it runs under
>      policy->rwsem, so intel_pstate_driver cannot change while it is running.
>    * Rearrange the changelog a bit to avoid confusion.
>
> v5 -> v6:
>    * Fix the problem with the EPP setting via sysfs not working with the
>      performance and powersave governors by stopping and restarting the
>      governor around the sysfs-based EPP updates in the passive mode.
>    * Because of that, use the epp_cached field just for avoiding the above
>      if the new EPP value for the given CPU is the same as the old one.
>    * Export cpufreq_start/stop_governor() from the core (for the above).
>
> v6 -> v7:
>    * Cosmetic changes in store_energy_performance_prefernce() to reduce the
>      LoC number and make it a bit easier to read.  No intentional functional
>      impact.
>
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |   89 ++++-----
>  drivers/cpufreq/cpufreq.c                     |    6 
>  drivers/cpufreq/intel_pstate.c                |  245 +++++++++++++++++++-------
>  include/linux/cpufreq.h                       |    2 
>  4 files changed, 229 insertions(+), 113 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -36,6 +36,7 @@
>  #define INTEL_PSTATE_SAMPLING_INTERVAL	(10 * NSEC_PER_MSEC)
>  
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY	20000
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP	5000
>  #define INTEL_CPUFREQ_TRANSITION_DELAY		500
>  
>  #ifdef CONFIG_ACPI
> @@ -220,6 +221,7 @@ struct global_params {
>   *			preference/bias
>   * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
>   *			operation
> + * @epp_cached		Cached HWP energy-performance preference value
>   * @hwp_req_cached:	Cached value of the last HWP Request MSR
>   * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
>   * @last_io_update:	Last time when IO wake flag was set
> @@ -257,6 +259,7 @@ struct cpudata {
>  	s16 epp_policy;
>  	s16 epp_default;
>  	s16 epp_saved;
> +	s16 epp_cached;
>  	u64 hwp_req_cached;
>  	u64 hwp_cap_cached;
>  	u64 last_io_update;
> @@ -639,6 +642,26 @@ static int intel_pstate_get_energy_pref_
>  	return index;
>  }
>  
> +static int intel_pstate_set_epp(struct cpudata *cpu, u32 epp)
> +{
> +	/*
> +	 * Use the cached HWP Request MSR value, because in the active mode the
> +	 * register itself may be updated by intel_pstate_hwp_boost_up() or
> +	 * intel_pstate_hwp_boost_down() at any time.
> +	 */
> +	u64 value = READ_ONCE(cpu->hwp_req_cached);
> +
> +	value &= ~GENMASK_ULL(31, 24);
> +	value |= (u64)epp << 24;
> +	/*
> +	 * The only other updater of hwp_req_cached in the active mode,
> +	 * intel_pstate_hwp_set(), is called under the same lock as this
> +	 * function, so it cannot run in parallel with the update below.
> +	 */
> +	WRITE_ONCE(cpu->hwp_req_cached, value);
> +	return wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +}
> +
>  static int intel_pstate_set_energy_pref_index(struct cpudata *cpu_data,
>  					      int pref_index, bool use_raw,
>  					      u32 raw_epp)
> @@ -650,28 +673,12 @@ static int intel_pstate_set_energy_pref_
>  		epp = cpu_data->epp_default;
>  
>  	if (boot_cpu_has(X86_FEATURE_HWP_EPP)) {
> -		/*
> -		 * Use the cached HWP Request MSR value, because the register
> -		 * itself may be updated by intel_pstate_hwp_boost_up() or
> -		 * intel_pstate_hwp_boost_down() at any time.
> -		 */
> -		u64 value = READ_ONCE(cpu_data->hwp_req_cached);
> -
> -		value &= ~GENMASK_ULL(31, 24);
> -
>  		if (use_raw)
>  			epp = raw_epp;
>  		else if (epp == -EINVAL)
>  			epp = epp_values[pref_index - 1];
>  
> -		value |= (u64)epp << 24;
> -		/*
> -		 * The only other updater of hwp_req_cached in the active mode,
> -		 * intel_pstate_hwp_set(), is called under the same lock as this
> -		 * function, so it cannot run in parallel with the update below.
> -		 */
> -		WRITE_ONCE(cpu_data->hwp_req_cached, value);
> -		ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> +		ret = intel_pstate_set_epp(cpu_data, epp);
>  	} else {
>  		if (epp == -EINVAL)
>  			epp = (pref_index - 1) << 2;
> @@ -697,10 +704,12 @@ static ssize_t show_energy_performance_a
>  
>  cpufreq_freq_attr_ro(energy_performance_available_preferences);
>  
> +static struct cpufreq_driver intel_pstate;
> +
>  static ssize_t store_energy_performance_preference(
>  		struct cpufreq_policy *policy, const char *buf, size_t count)
>  {
> -	struct cpudata *cpu_data = all_cpu_data[policy->cpu];
> +	struct cpudata *cpu = all_cpu_data[policy->cpu];
>  	char str_preference[21];
>  	bool raw = false;
>  	ssize_t ret;
> @@ -725,15 +734,44 @@ static ssize_t store_energy_performance_
>  		raw = true;
>  	}
>  
> +	/*
> +	 * This function runs with the policy R/W semaphore held, which
> +	 * guarantees that the driver pointer will not change while it is
> +	 * running.
> +	 */
> +	if (!intel_pstate_driver)
> +		return -EAGAIN;
> +
>  	mutex_lock(&intel_pstate_limits_lock);
>  
> -	ret = intel_pstate_set_energy_pref_index(cpu_data, ret, raw, epp);
> -	if (!ret)
> -		ret = count;
> +	if (intel_pstate_driver == &intel_pstate) {
> +		ret = intel_pstate_set_energy_pref_index(cpu, ret, raw, epp);
> +	} else {
> +		/*
> +		 * In the passive mode the governor needs to be stopped on the
> +		 * target CPU before the EPP update and restarted after it,
> +		 * which is super-heavy-weight, so make sure it is worth doing
> +		 * upfront.
> +		 */
> +		if (!raw)
> +			epp = ret ? epp_values[ret - 1] : cpu->epp_default;
> +
> +		if (cpu->epp_cached != epp) {
> +			int err;
> +
> +			cpufreq_stop_governor(policy);
> +			ret = intel_pstate_set_epp(cpu, epp);
> +			err = cpufreq_start_governor(policy);
> +			if (!ret) {
> +				cpu->epp_cached = epp;
> +				ret = err;
> +			}
> +		}
> +	}
>  
>  	mutex_unlock(&intel_pstate_limits_lock);
>  
> -	return ret;
> +	return ret ?: count;
>  }
>  
>  static ssize_t show_energy_performance_preference(
> @@ -1145,8 +1183,6 @@ static ssize_t store_no_turbo(struct kob
>  	return count;
>  }
>  
> -static struct cpufreq_driver intel_pstate;
> -
>  static void update_qos_request(enum freq_qos_req_type type)
>  {
>  	int max_state, turbo_max, freq, i, perf_pct;
> @@ -1330,9 +1366,10 @@ static const struct attribute_group inte
>  
>  static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[];
>  
> +static struct kobject *intel_pstate_kobject;
> +
>  static void __init intel_pstate_sysfs_expose_params(void)
>  {
> -	struct kobject *intel_pstate_kobject;
>  	int rc;
>  
>  	intel_pstate_kobject = kobject_create_and_add("intel_pstate",
> @@ -1357,17 +1394,31 @@ static void __init intel_pstate_sysfs_ex
>  	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
>  	WARN_ON(rc);
>  
> -	if (hwp_active) {
> -		rc = sysfs_create_file(intel_pstate_kobject,
> -				       &hwp_dynamic_boost.attr);
> -		WARN_ON(rc);
> -	}
> -
>  	if (x86_match_cpu(intel_pstate_cpu_ee_disable_ids)) {
>  		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
>  		WARN_ON(rc);
>  	}
>  }
> +
> +static void intel_pstate_sysfs_expose_hwp_dynamic_boost(void)
> +{
> +	int rc;
> +
> +	if (!hwp_active)
> +		return;
> +
> +	rc = sysfs_create_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
> +	WARN_ON_ONCE(rc);
> +}
> +
> +static void intel_pstate_sysfs_hide_hwp_dynamic_boost(void)
> +{
> +	if (!hwp_active)
> +		return;
> +
> +	sysfs_remove_file(intel_pstate_kobject, &hwp_dynamic_boost.attr);
> +}
> +
>  /************************** sysfs end ************************/
>  
>  static void intel_pstate_hwp_enable(struct cpudata *cpudata)
> @@ -2246,7 +2297,10 @@ static int intel_pstate_verify_policy(st
>  
>  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>  {
> -	intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> +	if (hwp_active)
> +		intel_pstate_hwp_force_min_perf(policy->cpu);
> +	else
> +		intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>  }
>  
>  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> @@ -2254,12 +2308,10 @@ static void intel_pstate_stop_cpu(struct
>  	pr_debug("CPU %d exiting\n", policy->cpu);
>  
>  	intel_pstate_clear_update_util_hook(policy->cpu);
> -	if (hwp_active) {
> +	if (hwp_active)
>  		intel_pstate_hwp_save_state(policy);
> -		intel_pstate_hwp_force_min_perf(policy->cpu);
> -	} else {
> -		intel_cpufreq_stop_cpu(policy);
> -	}
> +
> +	intel_cpufreq_stop_cpu(policy);
>  }
>  
>  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> @@ -2389,13 +2441,71 @@ static void intel_cpufreq_trace(struct c
>  		fp_toint(cpu->iowait_boost * 100));
>  }
>  
> +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate,
> +				     bool fast_switch)
> +{
> +	u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev;
> +
> +	value &= ~HWP_MIN_PERF(~0L);
> +	value |= HWP_MIN_PERF(target_pstate);
> +
> +	/*
> +	 * The entire MSR needs to be updated in order to update the HWP min
> +	 * field in it, so opportunistically update the max too if needed.
> +	 */
> +	value &= ~HWP_MAX_PERF(~0L);
> +	value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> +
> +	if (value == prev)
> +		return;
> +
> +	WRITE_ONCE(cpu->hwp_req_cached, value);
> +	if (fast_switch)
> +		wrmsrl(MSR_HWP_REQUEST, value);
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
> +}
> +
> +static void intel_cpufreq_adjust_perf_ctl(struct cpudata *cpu,
> +					  u32 target_pstate, bool fast_switch)
> +{
> +	if (fast_switch)
> +		wrmsrl(MSR_IA32_PERF_CTL,
> +		       pstate_funcs.get_val(cpu, target_pstate));
> +	else
> +		wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL,
> +			      pstate_funcs.get_val(cpu, target_pstate));
> +}
> +
> +static int intel_cpufreq_update_pstate(struct cpudata *cpu, int target_pstate,
> +				       bool fast_switch)
> +{
> +	int old_pstate = cpu->pstate.current_pstate;
> +
> +	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> +	if (target_pstate != old_pstate) {
> +		cpu->pstate.current_pstate = target_pstate;
> +		if (hwp_active)
> +			intel_cpufreq_adjust_hwp(cpu, target_pstate,
> +						 fast_switch);
> +		else
> +			intel_cpufreq_adjust_perf_ctl(cpu, target_pstate,
> +						      fast_switch);
> +	}
> +
> +	intel_cpufreq_trace(cpu, fast_switch ? INTEL_PSTATE_TRACE_FAST_SWITCH :
> +			    INTEL_PSTATE_TRACE_TARGET, old_pstate);
> +
> +	return target_pstate;
> +}
> +
>  static int intel_cpufreq_target(struct cpufreq_policy *policy,
>  				unsigned int target_freq,
>  				unsigned int relation)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
>  	struct cpufreq_freqs freqs;
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
> @@ -2403,6 +2513,7 @@ static int intel_cpufreq_target(struct c
>  	freqs.new = target_freq;
>  
>  	cpufreq_freq_transition_begin(policy, &freqs);
> +
>  	switch (relation) {
>  	case CPUFREQ_RELATION_L:
>  		target_pstate = DIV_ROUND_UP(freqs.new, cpu->pstate.scaling);
> @@ -2414,15 +2525,11 @@ static int intel_cpufreq_target(struct c
>  		target_pstate = DIV_ROUND_CLOSEST(freqs.new, cpu->pstate.scaling);
>  		break;
>  	}
> -	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	if (target_pstate != cpu->pstate.current_pstate) {
> -		cpu->pstate.current_pstate = target_pstate;
> -		wrmsrl_on_cpu(policy->cpu, MSR_IA32_PERF_CTL,
> -			      pstate_funcs.get_val(cpu, target_pstate));
> -	}
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, false);
> +
>  	freqs.new = target_pstate * cpu->pstate.scaling;
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, old_pstate);
> +
>  	cpufreq_freq_transition_end(policy, &freqs, false);
>  
>  	return 0;
> @@ -2432,15 +2539,14 @@ static unsigned int intel_cpufreq_fast_s
>  					      unsigned int target_freq)
>  {
>  	struct cpudata *cpu = all_cpu_data[policy->cpu];
> -	int target_pstate, old_pstate;
> +	int target_pstate;
>  
>  	update_turbo_state();
>  
>  	target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> -	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> -	old_pstate = cpu->pstate.current_pstate;
> -	intel_pstate_update_pstate(cpu, target_pstate);
> -	intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_FAST_SWITCH, old_pstate);
> +
> +	target_pstate = intel_cpufreq_update_pstate(cpu, target_pstate, true);
> +
>  	return target_pstate * cpu->pstate.scaling;
>  }
>  
> @@ -2460,7 +2566,6 @@ static int intel_cpufreq_cpu_init(struct
>  		return ret;
>  
>  	policy->cpuinfo.transition_latency = INTEL_CPUFREQ_TRANSITION_LATENCY;
> -	policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
>  	/* This reflects the intel_pstate_get_cpu_pstates() setting. */
>  	policy->cur = policy->cpuinfo.min_freq;
>  
> @@ -2472,10 +2577,18 @@ static int intel_cpufreq_cpu_init(struct
>  
>  	cpu = all_cpu_data[policy->cpu];
>  
> -	if (hwp_active)
> +	if (hwp_active) {
> +		u64 value;
> +
>  		intel_pstate_get_hwp_max(policy->cpu, &turbo_max, &max_state);
> -	else
> +		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY_HWP;
> +		rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, &value);
> +		WRITE_ONCE(cpu->hwp_req_cached, value);
> +		cpu->epp_cached = (value & GENMASK_ULL(31, 24)) >> 24;
> +	} else {
>  		turbo_max = cpu->pstate.turbo_pstate;
> +		policy->transition_delay_us = INTEL_CPUFREQ_TRANSITION_DELAY;
> +	}
>  
>  	min_freq = DIV_ROUND_UP(turbo_max * global.min_perf_pct, 100);
>  	min_freq *= cpu->pstate.scaling;
> @@ -2552,6 +2665,10 @@ static void intel_pstate_driver_cleanup(
>  		}
>  	}
>  	put_online_cpus();
> +
> +	if (intel_pstate_driver == &intel_pstate)
> +		intel_pstate_sysfs_hide_hwp_dynamic_boost();
> +
>  	intel_pstate_driver = NULL;
>  }
>  
> @@ -2559,6 +2676,9 @@ static int intel_pstate_register_driver(
>  {
>  	int ret;
>  
> +	if (driver == &intel_pstate)
> +		intel_pstate_sysfs_expose_hwp_dynamic_boost();
> +
>  	memset(&global, 0, sizeof(global));
>  	global.max_perf_pct = 100;
>  
> @@ -2576,9 +2696,6 @@ static int intel_pstate_register_driver(
>  
>  static int intel_pstate_unregister_driver(void)
>  {
> -	if (hwp_active)
> -		return -EBUSY;
> -
>  	cpufreq_unregister_driver(intel_pstate_driver);
>  	intel_pstate_driver_cleanup();
>  
> @@ -2834,7 +2951,10 @@ static int __init intel_pstate_init(void
>  			hwp_active++;
>  			hwp_mode_bdw = id->driver_data;
>  			intel_pstate.attr = hwp_cpufreq_attrs;
> -			default_driver = &intel_pstate;
> +			intel_cpufreq.attr = hwp_cpufreq_attrs;
> +			if (!default_driver)
> +				default_driver = &intel_pstate;
> +
>  			goto hwp_cpu_matched;
>  		}
>  	} else {
> @@ -2905,14 +3025,13 @@ static int __init intel_pstate_setup(cha
>  	if (!str)
>  		return -EINVAL;
>  
> -	if (!strcmp(str, "disable")) {
> +	if (!strcmp(str, "disable"))
>  		no_load = 1;
> -	} else if (!strcmp(str, "active")) {
> +	else if (!strcmp(str, "active"))
>  		default_driver = &intel_pstate;
> -	} else if (!strcmp(str, "passive")) {
> +	else if (!strcmp(str, "passive"))
>  		default_driver = &intel_cpufreq;
> -		no_hwp = 1;
> -	}
> +
>  	if (!strcmp(str, "no_hwp")) {
>  		pr_info("HWP disabled\n");
>  		no_hwp = 1;
> Index: linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> ===================================================================
> --- linux-pm.orig/Documentation/admin-guide/pm/intel_pstate.rst
> +++ linux-pm/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -54,10 +54,13 @@ registered (see `below <status_attr_>`_)
>  Operation Modes
>  ===============
>  
> -``intel_pstate`` can operate in three different modes: in the active mode with
> -or without hardware-managed P-states support and in the passive mode.  Which of
> -them will be in effect depends on what kernel command line options are used and
> -on the capabilities of the processor.
> +``intel_pstate`` can operate in two different modes, active or passive.  In the
> +active mode, it uses its own internal preformance scaling governor algorithm or
> +allows the hardware to do preformance scaling by itself, while in the passive

Typo: performance

> +mode it responds to requests made by a generic ``CPUFreq`` governor implementing
> +a certain performance scaling algorithm.  Which of them will be in effect
> +depends on what kernel command line options are used and on the capabilities of
> +the processor.
>  
>  Active Mode
>  -----------
> @@ -194,10 +197,11 @@ This is the default operation mode of ``
>  hardware-managed P-states (HWP) support.  It is always used if the
>  ``intel_pstate=passive`` argument is passed to the kernel in the command line
>  regardless of whether or not the given processor supports HWP.  [Note that the
> -``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used
> -without ``intel_pstate=active``.]  Like in the active mode without HWP support,
> -in this mode ``intel_pstate`` may refuse to work with processors that are not
> -recognized by it.
> +``intel_pstate=no_hwp`` setting causes the driver to start in the passive mode
> +if it is not combined with ``intel_pstate=active``.]  Like in the active mode
> +without HWP support, in this mode ``intel_pstate`` may refuse to work with
> +processors that are not recognized by it if HWP is prevented from being enabled
> +through the kernel command line.
>  
>  If the driver works in this mode, the ``scaling_driver`` policy attribute in
>  ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq".
> @@ -318,10 +322,9 @@ manuals need to be consulted to get to i
>  
>  For this reason, there is a list of supported processors in ``intel_pstate`` and
>  the driver initialization will fail if the detected processor is not in that
> -list, unless it supports the `HWP feature <Active Mode_>`_.  [The interface to
> -obtain all of the information listed above is the same for all of the processors
> -supporting the HWP feature, which is why they all are supported by
> -``intel_pstate``.]
> +list, unless it supports the HWP feature.  [The interface to obtain all of the
> +information listed above is the same for all of the processors supporting the
> +HWP feature, which is why ``intel_pstate`` works with all of them.]
>  
>  
>  User Space Interface in ``sysfs``
> @@ -425,22 +428,16 @@ argument is passed to the kernel in the
>  	as well as the per-policy ones) are then reset to their default
>  	values, possibly depending on the target operation mode.]
>  
> -	That only is supported in some configurations, though (for example, if
> -	the `HWP feature is enabled in the processor <Active Mode With HWP_>`_,
> -	the operation mode of the driver cannot be changed), and if it is not
> -	supported in the current configuration, writes to this attribute will
> -	fail with an appropriate error.
> -
>  ``energy_efficiency``
> -	This attribute is only present on platforms, which have CPUs matching
> -	Kaby Lake or Coffee Lake desktop CPU model. By default
> -	energy efficiency optimizations are disabled on these CPU models in HWP
> -	mode by this driver. Enabling energy efficiency may limit maximum
> -	operating frequency in both HWP and non HWP mode. In non HWP mode,
> -	optimizations are done only in the turbo frequency range. In HWP mode,
> -	optimizations are done in the entire frequency range. Setting this
> -	attribute to "1" enables energy efficiency optimizations and setting
> -	to "0" disables energy efficiency optimizations.
> +	This attribute is only present on platforms with CPUs matching the Kaby
> +	Lake or Coffee Lake desktop CPU model. By default, energy-efficiency
> +	optimizations are disabled on these CPU models if HWP is enabled.
> +	Enabling energy-efficiency optimizations may limit maximum operating
> +	frequency with or without the HWP feature.  With HWP enabled, the
> +	optimizations are done only in the turbo frequency range.  Without it,
> +	they are done in the entire available frequency range.  Setting this
> +	attribute to "1" enables the energy-efficiency optimizations and setting
> +	to "0" disables them.
>  
>  Interpretation of Policy Attributes
>  -----------------------------------
> @@ -484,8 +481,8 @@ Next, the following policy attributes ha
>  	policy for the time interval between the last two invocations of the
>  	driver's utilization update callback by the CPU scheduler for that CPU.
>  
> -One more policy attribute is present if the `HWP feature is enabled in the
> -processor <Active Mode With HWP_>`_:
> +One more policy attribute is present if the HWP feature is enabled in the
> +processor:
>  
>  ``base_frequency``
>  	Shows the base frequency of the CPU. Any frequency above this will be
> @@ -526,11 +523,11 @@ on the following rules, regardless of th
>  
>   3. The global and per-policy limits can be set independently.
>  
> -If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the
> -resulting effective values are written into its registers whenever the limits
> -change in order to request its internal P-state selection logic to always set
> -P-states within these limits.  Otherwise, the limits are taken into account by
> -scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
> +In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the
> +resulting effective values are written into hardware registers whenever the
> +limits change in order to request its internal P-state selection logic to always
> +set P-states within these limits.  Otherwise, the limits are taken into account
> +by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver
>  every time before setting a new P-state for a CPU.
>  
>  Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument
> @@ -541,12 +538,11 @@ at all and the only way to set the limit
>  Energy vs Performance Hints
>  ---------------------------
>  
> -If ``intel_pstate`` works in the `active mode with the HWP feature enabled
> -<Active Mode With HWP_>`_ in the processor, additional attributes are present
> -in every ``CPUFreq`` policy directory in ``sysfs``.  They are intended to allow
> -user space to help ``intel_pstate`` to adjust the processor's internal P-state
> -selection logic by focusing it on performance or on energy-efficiency, or
> -somewhere between the two extremes:
> +If the hardware-managed P-states (HWP) is enabled in the processor, additional
> +attributes, intended to allow user space to help ``intel_pstate`` to adjust the
> +processor's internal P-state selection logic by focusing it on performance or on
> +energy-efficiency, or somewhere between the two extremes, are present in every
> +``CPUFreq`` policy directory in ``sysfs``.  They are :
>  
>  ``energy_performance_preference``
>  	Current value of the energy vs performance hint for the given policy
> @@ -650,12 +646,14 @@ of them have to be prepended with the ``
>  	Do not register ``intel_pstate`` as the scaling driver even if the
>  	processor is supported by it.
>  
> +``active``
> +	Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start
> +	with.
> +
>  ``passive``
>  	Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to
>  	start with.
>  
> -	This option implies the ``no_hwp`` one described below.
> -
>  ``force``
>  	Register ``intel_pstate`` as the scaling driver instead of
>  	``acpi-cpufreq`` even if the latter is preferred on the given system.
> @@ -670,13 +668,12 @@ of them have to be prepended with the ``
>  	driver is used instead of ``acpi-cpufreq``.
>  
>  ``no_hwp``
> -	Do not enable the `hardware-managed P-states (HWP) feature
> -	<Active Mode With HWP_>`_ even if it is supported by the processor.
> +	Do not enable the hardware-managed P-states (HWP) feature even if it is
> +	supported by the processor.
>  
>  ``hwp_only``
>  	Register ``intel_pstate`` as the scaling driver only if the
> -	`hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is
> -	supported by the processor.
> +	hardware-managed P-states (HWP) feature is supported by the processor.
>  
>  ``support_acpi_ppc``
>  	Take ACPI ``_PPC`` performance limits into account.
> Index: linux-pm/drivers/cpufreq/cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq.c
> +++ linux-pm/drivers/cpufreq/cpufreq.c
> @@ -73,8 +73,6 @@ static inline bool has_target(void)
>  static unsigned int __cpufreq_get(struct cpufreq_policy *policy);
>  static int cpufreq_init_governor(struct cpufreq_policy *policy);
>  static void cpufreq_exit_governor(struct cpufreq_policy *policy);
> -static int cpufreq_start_governor(struct cpufreq_policy *policy);
> -static void cpufreq_stop_governor(struct cpufreq_policy *policy);
>  static void cpufreq_governor_limits(struct cpufreq_policy *policy);
>  static int cpufreq_set_policy(struct cpufreq_policy *policy,
>  			      struct cpufreq_governor *new_gov,
> @@ -2266,7 +2264,7 @@ static void cpufreq_exit_governor(struct
>  	module_put(policy->governor->owner);
>  }
>  
> -static int cpufreq_start_governor(struct cpufreq_policy *policy)
> +int cpufreq_start_governor(struct cpufreq_policy *policy)
>  {
>  	int ret;
>  
> @@ -2293,7 +2291,7 @@ static int cpufreq_start_governor(struct
>  	return 0;
>  }
>  
> -static void cpufreq_stop_governor(struct cpufreq_policy *policy)
> +void cpufreq_stop_governor(struct cpufreq_policy *policy)
>  {
>  	if (cpufreq_suspended || !policy->governor)
>  		return;
> Index: linux-pm/include/linux/cpufreq.h
> ===================================================================
> --- linux-pm.orig/include/linux/cpufreq.h
> +++ linux-pm/include/linux/cpufreq.h
> @@ -576,6 +576,8 @@ unsigned int cpufreq_driver_resolve_freq
>  unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy);
>  int cpufreq_register_governor(struct cpufreq_governor *governor);
>  void cpufreq_unregister_governor(struct cpufreq_governor *governor);
> +int cpufreq_start_governor(struct cpufreq_policy *policy);
> +void cpufreq_stop_governor(struct cpufreq_policy *policy);
>  
>  #define cpufreq_governor_init(__governor)			\
>  static int __init __governor##_init(void)			\

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-11  0:51       ` Francisco Jerez
@ 2020-08-11 15:33         ` Rafael J. Wysocki
  0 siblings, 0 replies; 33+ messages in thread
From: Rafael J. Wysocki @ 2020-08-11 15:33 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: Linux PM, Linux Documentation, LKML, Peter Zijlstra,
	Srinivas Pandruvada, Giovanni Gherdovich, Doug Smythies,
	Viresh Kumar

On Tuesday, August 11, 2020 2:51:41 AM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> 
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> >
> > Among other things, this allows utilization clamps to be taken
> > into account, at least to a certain extent, when intel_pstate is
> > in use and makes it more likely that sufficient capacity for
> > deadline tasks will be provided.
> >
> > After this change, the resulting behavior of an HWP system with
> > intel_pstate in the passive mode should be close to the behavior
> > of the analogous non-HWP system with intel_pstate in the passive
> > mode, except that in the frequency range below the base frequency
> > (ie. the frequency retured by the base_frequency cpufreq attribute
> > in sysfs on HWP systems) the HWP algorithm is allowed to make the
> > CPU run at a frequency above the floor P-state set by intel_pstate,
> > with or without hardware coordination of P-states among CPUs in the
> > same package.
> >
> 
> The "frequency range below the base frequency" part of the paragraph
> above seems somewhat misleading, since AFAICT the same thing will happen
> in the P-state range above the base frequency. 

Fair enough.  I rephrased the changelog when applying the patch.

> Another minor comment below, other than that LGTM:

And this one has been fixed too.

> Reviewed-by: Francisco Jerez <currojerez@riseup.net>

Thanks!




^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-06 12:03     ` Rafael J. Wysocki
  2020-08-10  0:44       ` Srinivas Pandruvada
  2020-08-11  0:51       ` Francisco Jerez
@ 2020-08-17 21:06       ` Doug Smythies
  2020-09-07  0:16         ` Doug Smythies
  2 siblings, 1 reply; 33+ messages in thread
From: Doug Smythies @ 2020-08-17 21:06 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Srinivas Pandruvada',
	'Giovanni Gherdovich', 'Francisco Jerez',
	'Viresh Kumar', 'Linux PM'

On 2020.08.06 05:04 Rafael J. Wysocki wrote:

> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.

...

Hi Rafael,

You may or may not recall, I mentioned my further feedback would be
delayed, as I wanted to work on reducing the labour content of my
most basic CPU frequency scaler test.

I have tested kernel 5.9-rc1 for pretty much every intel_pstate
variant and governor, and also the acpi-cpufreq driver.

Other than changing governors, changes were only made via
grub command line options and re-boot. EPP or EPB were never
modified, they were always whatever default.

performance governor: (left mostly blank, on purpose.)
acpi-cpufreq:
intel_cpufreq hwp: good
intel_cpufreq no hwp:
intel_pstate hwp:
intel_pstate no hwp:

ondemand governor:
acpi-cpufreq: good
intel_cpufreq hwp: bad
intel_cpufreq no hwp: good

conservative governor:
acpi-cpufreq: good
intel_cpufreq hwp: good
intel_cpufreq no hwp: good

schedutil governor:
acpi-cpufreq: good
intel_cpufreq hwp: bad
intel_cpufreq no hwp: good

powersave governor:
acpi-cpufreq: good
intel_cpufreq hwp: bad
intel_cpufreq no hwp: good

active-powersave governor:
intel_pstate hwp: ? not smooth, suffers from the broken HWP issue.
intel_pstate no hwp: good.
Intel_pstate hwp, idle state 2 disabled: Better but still worse for power.

Now, we don't actually care about CPU frequency, we care about power:

ondemand governor:

periodic workflow at 347 hertz.
~58% load at 4.60 GHz (where hwp operates)
~76% load at 3.5 GHz (where no hwp operates)

intel_cpufreq hwp: 14.3 processor package watts. 51.5 watts on the mains to the computer.
intel_cpufreq no hwp: 9.1 processor package watts. 45.5 watts on the mains to the computer. 

schedutil governor:

periodic workflow at 347 hertz.
~36% load at 4.60 GHz (where hwp operates)
~55% load at 3.2 GHz (where no hwp operates)

intel_cpufreq hwp: 9.6 processor package watts. 45.8 watts on the mains to the computer.
intel_cpufreq no hwp: ~6 processor package watts. ~41 watts on the mains to the computer. (noisy)

powersave governor:

periodic workflow at 347 hertz.
~39.8% load at 2.00 GHz (where hwp operates)
~92.5% load at 0.8 GHz (where no hwp operates)

intel_cpufreq hwp: 2.6 processor package watts. 38 watts on the mains to the computer.
intel_cpufreq no hwp: 1.9 processor package watts. 36 watts on the mains to the computer.

active-powersave governor:

periodic workflow at 347 hertz.
~58% load at 4.60 GHz (where hwp operates)
~72% load at 3.88 GHz (where no hwp operates) 

intel_pstate hwp: 14.2 processor package watts. 52 watts on the mains to the computer.
intel_pstate no hwp: 10.1 processor package watts. 48 watts on the mains to the computer.

Link to web page with much of this same content which, in turn, links to various graphs.
Coded, to avoid the barrage of bots:

double u double u double u dot smythies dot com /~doug/linux/s18/hwp/v7/

... Doug

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
  2020-08-17 21:06       ` Doug Smythies
@ 2020-09-07  0:16         ` Doug Smythies
  0 siblings, 0 replies; 33+ messages in thread
From: Doug Smythies @ 2020-09-07  0:16 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Linux Documentation', 'LKML',
	'Peter Zijlstra', 'Srinivas Pandruvada',
	'Giovanni Gherdovich', 'Francisco Jerez',
	'Viresh Kumar', 'Linux PM'

Hi Rafael,

On 2020.08.17 14:06 Doug Smythies wrote:
> On 2020.08.06 05:04 Rafael J. Wysocki wrote:
> 
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> 
...
> 
> powersave governor:
> acpi-cpufreq: good
> intel_cpufreq hwp: bad
> intel_cpufreq no hwp: good

It occurs to me that my expectations as to what 
is meant by "powersave" might not agree with yours. 

For the powersave governor, this is what we have now:

intel_cpufreq hwp == intel_pstate hwp
intel_cpufreq no hwp == acpi-cpufreq == always minimum freq
intel_pstate no hwp ~= acpi-cpufreq/ondemand

Is that your understanding/intention?

My expectation was/is:

intel_cpufreq hwp == intel_cpufreq no hwp == acpi-cpufreq == always minimum freq
intel_pstate no hwp ~= acpi-cpufreq/ondemand
intel_pstate hwp == Unique. Say, extremely course version of ondemand.

... Doug



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-09-07  0:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-16 17:37 [PATCH v2 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
2020-07-16 17:38 ` [PATCH v2 1/2] cpufreq: intel_pstate: Rearrange the storing of newv EPP values Rafael J. Wysocki
2020-07-16 17:42 ` [PATCH v2 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
2020-07-27 15:13 ` [PATCH v3 0/2] " Rafael J. Wysocki
2020-07-27 15:15   ` [PATCH v3 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
2020-07-27 15:17   ` [PATCH v3 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
2020-07-28 15:09   ` [PATCH v4 0/2] " Rafael J. Wysocki
2020-07-28 15:11     ` [PATCH v4 1/2] cpufreq: intel_pstate: Rearrange the storing of new EPP values Rafael J. Wysocki
2020-07-30  1:31       ` Francisco Jerez
2020-07-28 15:13     ` [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled Rafael J. Wysocki
2020-08-01 23:21       ` Srinivas Pandruvada
2020-08-02 14:14         ` Doug Smythies
2020-08-02 19:20           ` Srinivas Pandruvada
2020-08-03 17:17       ` [PATCH v5] " Rafael J. Wysocki
2020-08-01 16:39     ` [PATCH v4 0/2] " Srinivas Pandruvada
2020-08-02 14:00       ` Doug Smythies
2020-08-02 18:39         ` Srinivas Pandruvada
2020-08-03  0:26           ` Doug Smythies
2020-08-03 17:23       ` Rafael J. Wysocki
2020-08-04 15:10   ` [PATCH v6] " Rafael J. Wysocki
2020-08-04 17:04     ` Doug Smythies
2020-08-05  9:34       ` Rafael J. Wysocki
2020-08-05 15:38         ` Srinivas Pandruvada
2020-08-05 16:28           ` Rafael J. Wysocki
2020-08-05 16:55   ` [PATCH v7] " Rafael J. Wysocki
2020-08-06  5:55     ` Doug Smythies
2020-08-06 11:25       ` Rafael J. Wysocki
2020-08-06 12:03     ` Rafael J. Wysocki
2020-08-10  0:44       ` Srinivas Pandruvada
2020-08-11  0:51       ` Francisco Jerez
2020-08-11 15:33         ` Rafael J. Wysocki
2020-08-17 21:06       ` Doug Smythies
2020-09-07  0:16         ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).