All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/15] Intel Hardware P-States (HWP) support
@ 2023-07-06 18:54 Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
                   ` (14 more replies)
  0 siblings, 15 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, Anthony PERARD, Juergen Gross, George Dunlap,
	Julien Grall, Stefano Stabellini, Henry Wang, Community Manager

Hi,

This patch series adds Hardware-Controlled Performance States (HWP) for
Intel processors to Xen.

v2 was only partially reviewed, so v3 is mostly a reposting of v2.  In v2 &
v3, I think I addressed all comments for v1.  I kept patch 11 "xenpm:
Factor out a non-fatal cpuid_parse variant", with a v2 comment
explaining why I keep it.

v3 adds "xen/x86: Tweak PDC bits when using HWP".  Qubes testing revealed
an issue where enabling HWP can crash firwmare code (maybe SMM).  This
requires a Linux change to get the PDC bits from Xen and pass them to
ACPI.  Roger has a patch [0] to set the PDC bits.  Roger's 3 patch
series was tested with "xen/x86: Tweak PDC bits when using HWP" on
affected hardware and allowed proper operation.

v4:
There is a large amount or renaming from HWP/hwp to CPPC/cppc in the series.
The driver remains hwp_ prefixed since it is dealing with the hardware
interface.  The sysctl, xc and xenpm interfaces were renamed to cppc to
be the generic ACPI CPPC (Collaborative Processor Performance Control)
interface.

struct xen_get_cpufreq_para was re-organized in a binary compatible
fashion to nest scaling governor options.  This allows the cppc support
to use uint32_t's for its parameters.

HWP is now enabled with a top-level cpufreq=hwp option.  It will
fallback to cpufreq=xen if hwp is unavailable.  This seems like the most
user-friendly option.  Since the user was trying to specify *some*
cpufreq, we should give them the best that we can instead of disabling
the functionality.

"xenpm: Factor out a non-fatal cpuid_parse variant" was dropped.
set-cpufreq-cppc expects either a cpu number or none specified, which
implies all.

Some patches were re-arrange - "xen/x86: Tweak PDC bits when using HWP"
now comes immediately after "cpufreq: Add Hardware P-State (HWP) driver"

The implementation of "cpufreq: Allow restricting to internal governors
only " changed, so I removed Jan's Ack.

v5:
HWP is enabled with a toplevel cpufreq=hwp option.  There is no fallback
by default, but a cpufreq=hwp;xen syntax is now supported.  That tries
hwp first.  If HWP registration is unsuccessful, then xen registration
is performed as a fallback.

More changes from Jan's feedback.  They are typically minor and
documented in individual patches.

Previous cover letter:

With HWP, the processor makes its own determinations for frequency
selection, though users can set some parameters and preferences.  There
is also Turbo Boost which dynamically pushes the max frequency if
possible.

The existing governors don't work with HWP since they select frequencies
and HWP doesn't expose those.  Therefore a dummy hwp-interal governor is
used that doesn't do anything.

xenpm get-cpufreq-para is extended to show HWP parameters, and
set-cpufreq-cppc is added to set them.

A lightly loaded OpenXT laptop showed ~1W power savings according to
powertop.  A mostly idle Fedora system (dom0 only) showed a more modest
power savings.

This is for a 10th gen 6-core 1600 MHz base 4900 MHZ max cpu.  In the
default balance mode, Turbo Boost doesn't exceed 4GHz.  Tweaking the
energy_perf preference with `xenpm set-cpufreq-para balance ene:64`,
I've seen the CPU hit 4.7GHz before throttling down and bouncing around
between 4.3 and 4.5 GHz.  Curiously the other cores read ~4GHz when
turbo boost takes affect.  This was done after pinning all dom0 cores,
and using taskset to pin to vCPU/pCPU 11 and running a bash tightloop.

HWP defaults to disabled and running with the existing HWP configuration
- it doesn't reconfigure by default.  It can be enabled with
cpufreq=hwp.

Hardware Duty Cycling (HDC) is another feature to autonomously powerdown
things.  It defaults to enabled when HWP is enabled, but HDC can be
disabled on the command line.  cpufreq=xen:hwp,no-hdc

I've only tested on 8th gen and 10th gen systems with activity window
and energy_perf support.  So the pathes for CPUs lacking those features
are untested.

Fast MSR support was removed in v2.  The model specific checking was not
done properly, and I don't have hardware to test with.  Since writes are
expected to be infrequent, I just removed the code.

This changes the systcl_pm_op hypercall, so that wants review.

Regards,
Jason

[0] https://lore.kernel.org/xen-devel/20221121102113.41893-3-roger.pau@citrix.com/

Jason Andryuk (15):
  cpufreq: Allow restricting to internal governors only
  cpufreq: Add perf_freq to cpuinfo
  cpufreq: Export intel_feature_detect
  xen/sysctl: Nest cpufreq scaling options
  pmstat&xenpm: Re-arrage for cpufreq union
  cpufreq: Add Hardware P-State (HWP) driver
  xen/x86: Tweak PDC bits when using HWP
  xenpm: Change get-cpufreq-para output for hwp
  cpufreq: Export HWP parameters to userspace as CPPC
  libxc: Include cppc_para in definitions
  xenpm: Print HWP/CPPC parameters
  xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  libxc: Add xc_set_cpufreq_cppc
  xenpm: Add set-cpufreq-cppc subcommand
  CHANGELOG: Add Intel HWP entry

 CHANGELOG.md                                 |   2 +-
 docs/misc/xen-command-line.pandoc            |  18 +-
 tools/include/xenctrl.h                      |  28 +-
 tools/libs/ctrl/xc_pm.c                      |  42 +-
 tools/misc/xenpm.c                           | 392 +++++++++--
 xen/arch/x86/acpi/cpufreq/Makefile           |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c          |  29 +-
 xen/arch/x86/acpi/cpufreq/hwp.c              | 646 +++++++++++++++++++
 xen/arch/x86/acpi/lib.c                      |   5 +
 xen/arch/x86/cpu/mcheck/mce_intel.c          |   6 +
 xen/arch/x86/include/asm/cpufeature.h        |  12 +-
 xen/arch/x86/include/asm/msr-index.h         |  16 +-
 xen/drivers/acpi/pmstat.c                    | 100 +--
 xen/drivers/cpufreq/cpufreq.c                |  56 +-
 xen/drivers/cpufreq/cpufreq_misc_governors.c |   9 +
 xen/drivers/cpufreq/cpufreq_ondemand.c       |   3 +
 xen/drivers/cpufreq/utility.c                |   1 +
 xen/include/acpi/cpufreq/cpufreq.h           |  21 +
 xen/include/acpi/pdc_intel.h                 |   1 +
 xen/include/public/sysctl.h                  | 144 ++++-
 20 files changed, 1411 insertions(+), 121 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

-- 
2.41.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 01/15] cpufreq: Allow restricting to internal governors only
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-10 11:41   ` Jan Beulich
  2023-07-06 18:54 ` [PATCH v5 02/15] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Jan Beulich

For hwp, the standard governors are not usable, and only the internal
one is applicable.  Add the cpufreq_governor_internal boolean to
indicate when an internal governor, like hwp, will be used.  This is set
during presmp_initcall, and governor registration can be skipped when
called during initcall.

This way unusable governors are not registered, and only compatible
governors are advertised to userspace.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v5:
Check cpufreq_governor_internal and skip registration as applicable
Remove internal flag

v4:
Rework to use an internal flag
Removed Jan's Ack since the approach is different.

v3:
Switch to initdata
Add Jan Acked-by
Commit message s/they/the/ typo
Don't register hwp-internal when running non-hwp - Marek

v2:
Switch to "-internal"
Add blank line in header
---
 xen/drivers/cpufreq/cpufreq.c                | 1 +
 xen/drivers/cpufreq/cpufreq_misc_governors.c | 9 +++++++++
 xen/drivers/cpufreq/cpufreq_ondemand.c       | 3 +++
 xen/include/acpi/cpufreq/cpufreq.h           | 2 ++
 4 files changed, 15 insertions(+)

diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 2321c7dd07..67a58d409b 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -56,6 +56,7 @@ struct cpufreq_dom {
 };
 static LIST_HEAD_READ_MOSTLY(cpufreq_dom_list_head);
 
+bool __initdata cpufreq_governor_internal;
 struct cpufreq_governor *__read_mostly cpufreq_opt_governor;
 LIST_HEAD_READ_MOSTLY(cpufreq_governor_list);
 
diff --git a/xen/drivers/cpufreq/cpufreq_misc_governors.c b/xen/drivers/cpufreq/cpufreq_misc_governors.c
index f5571f5486..0327fad23b 100644
--- a/xen/drivers/cpufreq/cpufreq_misc_governors.c
+++ b/xen/drivers/cpufreq/cpufreq_misc_governors.c
@@ -120,6 +120,9 @@ static int __init cf_check cpufreq_gov_userspace_init(void)
 {
     unsigned int cpu;
 
+    if ( cpufreq_governor_internal )
+        return 0;
+
     for_each_online_cpu(cpu)
         per_cpu(cpu_set_freq, cpu) = userspace_cmdline_freq;
     register_cpu_notifier(&cpufreq_userspace_cpu_nfb);
@@ -162,6 +165,9 @@ struct cpufreq_governor cpufreq_gov_performance = {
 
 static int __init cf_check cpufreq_gov_performance_init(void)
 {
+    if ( cpufreq_governor_internal )
+        return 0;
+
     return cpufreq_register_governor(&cpufreq_gov_performance);
 }
 __initcall(cpufreq_gov_performance_init);
@@ -201,6 +207,9 @@ struct cpufreq_governor cpufreq_gov_powersave = {
 
 static int __init cf_check cpufreq_gov_powersave_init(void)
 {
+    if ( cpufreq_governor_internal )
+        return 0;
+
     return cpufreq_register_governor(&cpufreq_gov_powersave);
 }
 __initcall(cpufreq_gov_powersave_init);
diff --git a/xen/drivers/cpufreq/cpufreq_ondemand.c b/xen/drivers/cpufreq/cpufreq_ondemand.c
index fbcd14d6c3..06cfc88d30 100644
--- a/xen/drivers/cpufreq/cpufreq_ondemand.c
+++ b/xen/drivers/cpufreq/cpufreq_ondemand.c
@@ -360,6 +360,9 @@ struct cpufreq_governor cpufreq_gov_dbs = {
 
 static int __init cf_check cpufreq_gov_dbs_init(void)
 {
+    if ( cpufreq_governor_internal )
+        return 0;
+
     return cpufreq_register_governor(&cpufreq_gov_dbs);
 }
 __initcall(cpufreq_gov_dbs_init);
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 35dcf21e8f..44fc4c58fc 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -116,6 +116,8 @@ extern struct cpufreq_governor cpufreq_gov_powersave;
 
 extern struct list_head cpufreq_governor_list;
 
+extern bool cpufreq_governor_internal;
+
 extern int cpufreq_register_governor(struct cpufreq_governor *governor);
 extern struct cpufreq_governor *__find_governor(const char *governor);
 #define CPUFREQ_DEFAULT_GOVERNOR &cpufreq_gov_dbs
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 02/15] cpufreq: Add perf_freq to cpuinfo
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 03/15] cpufreq: Export intel_feature_detect Jason Andryuk
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

acpi-cpufreq scales the aperf/mperf measurements by max_freq, but HWP
needs to scale by base frequency.  Settings max_freq to base_freq
"works" but the code is not obvious, and returning values to userspace
is tricky.  Add an additonal perf_freq member which is used for scaling
aperf/mperf measurements.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v3:
Add Jan's Ack

I don't like this, but it seems the best way to re-use the common
aperf/mperf code.  The other option would be to add wrappers that then
do the acpi vs. hwp scaling.
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c | 2 +-
 xen/drivers/cpufreq/utility.c       | 1 +
 xen/include/acpi/cpufreq/cpufreq.h  | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 2e0067fbe5..6c70d04395 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -316,7 +316,7 @@ unsigned int get_measured_perf(unsigned int cpu, unsigned int flag)
     else
         perf_percent = 0;
 
-    return policy->cpuinfo.max_freq * perf_percent / 100;
+    return policy->cpuinfo.perf_freq * perf_percent / 100;
 }
 
 static unsigned int cf_check get_cur_freq_on_cpu(unsigned int cpu)
diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index 9eb7ecedcd..6831f62851 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -236,6 +236,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
 
     policy->min = policy->cpuinfo.min_freq = min_freq;
     policy->max = policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = max_freq;
     policy->cpuinfo.second_max_freq = second_max_freq;
 
     if (policy->min == ~0)
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 44fc4c58fc..1f1898d811 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -37,6 +37,9 @@ extern struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
 struct cpufreq_cpuinfo {
     unsigned int        max_freq;
     unsigned int        second_max_freq;    /* P1 if Turbo Mode is on */
+    unsigned int        perf_freq; /* Scaling freq for aperf/mpref.
+                                      acpi-cpufreq uses max_freq, but HWP uses
+                                      base_freq.*/
     unsigned int        min_freq;
     unsigned int        transition_latency; /* in 10^(-9) s = nanoseconds */
 };
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 03/15] cpufreq: Export intel_feature_detect
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 02/15] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Export feature_detect as intel_feature_detect so it can be re-used by
HWP.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v4:
Add Jan's Ack

v3:
Remove void * cast when calling intel_feature_detect

v2:
export intel_feature_detect with typed pointer
Move intel_feature_detect to acpi/cpufreq/cpufreq.h since the
declaration now contains struct cpufreq_policy *.
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c | 8 ++++++--
 xen/include/acpi/cpufreq/cpufreq.h  | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 6c70d04395..f1cc473b4f 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -339,9 +339,8 @@ static unsigned int cf_check get_cur_freq_on_cpu(unsigned int cpu)
     return extract_freq(get_cur_val(cpumask_of(cpu)), data);
 }
 
-static void cf_check feature_detect(void *info)
+void intel_feature_detect(struct cpufreq_policy *policy)
 {
-    struct cpufreq_policy *policy = info;
     unsigned int eax;
 
     eax = cpuid_eax(6);
@@ -353,6 +352,11 @@ static void cf_check feature_detect(void *info)
     }
 }
 
+static void cf_check feature_detect(void *info)
+{
+    intel_feature_detect(info);
+}
+
 static unsigned int check_freqs(const cpumask_t *mask, unsigned int freq,
                                 struct acpi_cpufreq_data *data)
 {
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 1f1898d811..482ea5b0de 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -243,4 +243,6 @@ int write_userspace_scaling_setspeed(unsigned int cpu, unsigned int freq);
 void cpufreq_dbs_timer_suspend(void);
 void cpufreq_dbs_timer_resume(void);
 
+void intel_feature_detect(struct cpufreq_policy *policy);
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 04/15] xen/sysctl: Nest cpufreq scaling options
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (2 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 03/15] cpufreq: Export intel_feature_detect Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-10 11:51   ` Jan Beulich
  2023-07-06 18:54 ` [PATCH v5 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini

Add a union and struct so that most of the scaling variables of struct
xen_get_cpufreq_para are within in a binary-compatible layout.  This
allows cppc_para to live in the larger union and use uint32_ts - struct
xen_cppc_para will be 10 uint32_t's.

The new scaling struct is 3 * uint32_t + 16 bytes CPUFREQ_NAME_LEN + 4 *
uint32_t for xen_ondemand = 11 uint32_t.  That means the old size is
retained, int32_t turbo_enabled doesn't move and it's binary compatible.

The out-of-context memcpy() in xc_get_cpufreq_para() now handles the
copying of the fields removed there.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v5:
Expand commit message
Change comment to driver/governor
---
 tools/include/xenctrl.h     | 22 +++++++++++++---------
 tools/libs/ctrl/xc_pm.c     |  7 +------
 tools/misc/xenpm.c          | 24 ++++++++++++------------
 xen/drivers/acpi/pmstat.c   | 27 ++++++++++++++-------------
 xen/include/public/sysctl.h | 22 +++++++++++++---------
 5 files changed, 53 insertions(+), 49 deletions(-)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index dba33d5d0f..8aedb952a0 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1909,16 +1909,20 @@ struct xc_get_cpufreq_para {
     uint32_t cpuinfo_cur_freq;
     uint32_t cpuinfo_max_freq;
     uint32_t cpuinfo_min_freq;
-    uint32_t scaling_cur_freq;
-
-    char scaling_governor[CPUFREQ_NAME_LEN];
-    uint32_t scaling_max_freq;
-    uint32_t scaling_min_freq;
-
-    /* for specific governor */
     union {
-        xc_userspace_t userspace;
-        xc_ondemand_t ondemand;
+        struct {
+            uint32_t scaling_cur_freq;
+
+            char scaling_governor[CPUFREQ_NAME_LEN];
+            uint32_t scaling_max_freq;
+            uint32_t scaling_min_freq;
+
+            /* for specific governor */
+            union {
+                xc_userspace_t userspace;
+                xc_ondemand_t ondemand;
+            } u;
+        } s;
     } u;
 
     int32_t turbo_enabled;
diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index c3a9864bf7..6e751e242f 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -265,17 +265,12 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
         user_para->cpuinfo_cur_freq = sys_para->cpuinfo_cur_freq;
         user_para->cpuinfo_max_freq = sys_para->cpuinfo_max_freq;
         user_para->cpuinfo_min_freq = sys_para->cpuinfo_min_freq;
-        user_para->scaling_cur_freq = sys_para->scaling_cur_freq;
-        user_para->scaling_max_freq = sys_para->scaling_max_freq;
-        user_para->scaling_min_freq = sys_para->scaling_min_freq;
         user_para->turbo_enabled    = sys_para->turbo_enabled;
 
         memcpy(user_para->scaling_driver,
                 sys_para->scaling_driver, CPUFREQ_NAME_LEN);
-        memcpy(user_para->scaling_governor,
-                sys_para->scaling_governor, CPUFREQ_NAME_LEN);
 
-        /* copy to user_para no matter what cpufreq governor */
+        /* copy to user_para no matter what cpufreq driver/governor */
         BUILD_BUG_ON(sizeof(((struct xc_get_cpufreq_para *)0)->u) !=
 		     sizeof(((struct xen_get_cpufreq_para *)0)->u));
 
diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 1bb6187e56..ee8ce5d5f2 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -730,39 +730,39 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
     printf("scaling_avail_gov    : %s\n",
            p_cpufreq->scaling_available_governors);
 
-    printf("current_governor     : %s\n", p_cpufreq->scaling_governor);
-    if ( !strncmp(p_cpufreq->scaling_governor,
+    printf("current_governor     : %s\n", p_cpufreq->u.s.scaling_governor);
+    if ( !strncmp(p_cpufreq->u.s.scaling_governor,
                   "userspace", CPUFREQ_NAME_LEN) )
     {
         printf("  userspace specific :\n");
         printf("    scaling_setspeed : %u\n",
-               p_cpufreq->u.userspace.scaling_setspeed);
+               p_cpufreq->u.s.u.userspace.scaling_setspeed);
     }
-    else if ( !strncmp(p_cpufreq->scaling_governor,
+    else if ( !strncmp(p_cpufreq->u.s.scaling_governor,
                        "ondemand", CPUFREQ_NAME_LEN) )
     {
         printf("  ondemand specific  :\n");
         printf("    sampling_rate    : max [%u] min [%u] cur [%u]\n",
-               p_cpufreq->u.ondemand.sampling_rate_max,
-               p_cpufreq->u.ondemand.sampling_rate_min,
-               p_cpufreq->u.ondemand.sampling_rate);
+               p_cpufreq->u.s.u.ondemand.sampling_rate_max,
+               p_cpufreq->u.s.u.ondemand.sampling_rate_min,
+               p_cpufreq->u.s.u.ondemand.sampling_rate);
         printf("    up_threshold     : %u\n",
-               p_cpufreq->u.ondemand.up_threshold);
+               p_cpufreq->u.s.u.ondemand.up_threshold);
     }
 
     printf("scaling_avail_freq   :");
     for ( i = 0; i < p_cpufreq->freq_num; i++ )
         if ( p_cpufreq->scaling_available_frequencies[i] ==
-             p_cpufreq->scaling_cur_freq )
+             p_cpufreq->u.s.scaling_cur_freq )
             printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
         else
             printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
     printf("\n");
 
     printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->scaling_max_freq,
-           p_cpufreq->scaling_min_freq,
-           p_cpufreq->scaling_cur_freq);
+           p_cpufreq->u.s.scaling_max_freq,
+           p_cpufreq->u.s.scaling_min_freq,
+           p_cpufreq->u.s.scaling_cur_freq);
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 1bae635101..f5a9ac3f1a 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -258,37 +258,38 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
         cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
     op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
     op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
-    op->u.get_para.scaling_cur_freq = policy->cur;
-    op->u.get_para.scaling_max_freq = policy->max;
-    op->u.get_para.scaling_min_freq = policy->min;
+
+    op->u.get_para.u.s.scaling_cur_freq = policy->cur;
+    op->u.get_para.u.s.scaling_max_freq = policy->max;
+    op->u.get_para.u.s.scaling_min_freq = policy->min;
 
     if ( cpufreq_driver.name[0] )
-        strlcpy(op->u.get_para.scaling_driver, 
+        strlcpy(op->u.get_para.scaling_driver,
             cpufreq_driver.name, CPUFREQ_NAME_LEN);
     else
         strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
 
     if ( policy->governor->name[0] )
-        strlcpy(op->u.get_para.scaling_governor, 
+        strlcpy(op->u.get_para.u.s.scaling_governor,
             policy->governor->name, CPUFREQ_NAME_LEN);
     else
-        strlcpy(op->u.get_para.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
+        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
 
     /* governor specific para */
-    if ( !strncasecmp(op->u.get_para.scaling_governor,
+    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
                       "userspace", CPUFREQ_NAME_LEN) )
     {
-        op->u.get_para.u.userspace.scaling_setspeed = policy->cur;
+        op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
     }
 
-    if ( !strncasecmp(op->u.get_para.scaling_governor,
+    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
                       "ondemand", CPUFREQ_NAME_LEN) )
     {
         ret = get_cpufreq_ondemand_para(
-            &op->u.get_para.u.ondemand.sampling_rate_max,
-            &op->u.get_para.u.ondemand.sampling_rate_min,
-            &op->u.get_para.u.ondemand.sampling_rate,
-            &op->u.get_para.u.ondemand.up_threshold);
+            &op->u.get_para.u.s.u.ondemand.sampling_rate_max,
+            &op->u.get_para.u.s.u.ondemand.sampling_rate_min,
+            &op->u.get_para.u.s.u.ondemand.sampling_rate,
+            &op->u.get_para.u.s.u.ondemand.up_threshold);
     }
     op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
 
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 33e86ace51..b0028d3058 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -317,16 +317,20 @@ struct xen_get_cpufreq_para {
     uint32_t cpuinfo_cur_freq;
     uint32_t cpuinfo_max_freq;
     uint32_t cpuinfo_min_freq;
-    uint32_t scaling_cur_freq;
-
-    char scaling_governor[CPUFREQ_NAME_LEN];
-    uint32_t scaling_max_freq;
-    uint32_t scaling_min_freq;
-
-    /* for specific governor */
     union {
-        struct  xen_userspace userspace;
-        struct  xen_ondemand ondemand;
+        struct {
+            uint32_t scaling_cur_freq;
+
+            char scaling_governor[CPUFREQ_NAME_LEN];
+            uint32_t scaling_max_freq;
+            uint32_t scaling_min_freq;
+
+            /* for specific governor */
+            union {
+                struct  xen_userspace userspace;
+                struct  xen_ondemand ondemand;
+            } u;
+        } s;
     } u;
 
     int32_t turbo_enabled;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 05/15] pmstat&xenpm: Re-arrage for cpufreq union
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (3 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-10 12:09   ` Jan Beulich
  2023-07-06 18:54 ` [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross, Jan Beulich

Rearrange code now that xen_sysctl_pm_op's get_para fields has the
nested union and struct.  In particular, the scaling governor
information like scaling_available_governors is inside the union, so it
is not always available.  Move those fields (op->u.get_para.u.s.u.*)
together as well as the common fields (ones outside the union like
op->u.get_para.turbo_enabled).

With that, gov_num may be 0, so bounce buffer handling needs
to be modified.

scaling_governor and other fields inside op->u.get_para.u.s.u.* won't be
used for hwp, so this will simplify the change when hwp support is
introduced and re-indents these lines all together.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v5:
Remove some formatting changes
Expand commit message
---
 tools/libs/ctrl/xc_pm.c   | 12 ++++++++----
 tools/misc/xenpm.c        |  3 ++-
 xen/drivers/acpi/pmstat.c | 24 ++++++++++++------------
 3 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index 6e751e242f..cea3eab22e 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -221,7 +221,7 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
     {
         if ( (!user_para->affected_cpus)                    ||
              (!user_para->scaling_available_frequencies)    ||
-             (!user_para->scaling_available_governors) )
+             (user_para->gov_num && !user_para->scaling_available_governors) )
         {
             errno = EINVAL;
             return -1;
@@ -230,12 +230,15 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
             goto unlock_1;
         if ( xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
             goto unlock_2;
-        if ( xc_hypercall_bounce_pre(xch, scaling_available_governors) )
+        if ( user_para->gov_num &&
+             xc_hypercall_bounce_pre(xch, scaling_available_governors) )
             goto unlock_3;
 
         set_xen_guest_handle(sys_para->affected_cpus, affected_cpus);
         set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);
-        set_xen_guest_handle(sys_para->scaling_available_governors, scaling_available_governors);
+        if ( user_para->gov_num )
+            set_xen_guest_handle(sys_para->scaling_available_governors,
+                                 scaling_available_governors);
     }
 
     sysctl.cmd = XEN_SYSCTL_pm_op;
@@ -278,7 +281,8 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
     }
 
 unlock_4:
-    xc_hypercall_bounce_post(xch, scaling_available_governors);
+    if ( user_para->gov_num )
+        xc_hypercall_bounce_post(xch, scaling_available_governors);
 unlock_3:
     xc_hypercall_bounce_post(xch, scaling_available_frequencies);
 unlock_2:
diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index ee8ce5d5f2..1c474c3b59 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -811,7 +811,8 @@ static int show_cpufreq_para_by_cpuid(xc_interface *xc_handle, int cpuid)
             ret = -ENOMEM;
             goto out;
         }
-        if (!(p_cpufreq->scaling_available_governors =
+        if (p_cpufreq->gov_num &&
+            !(p_cpufreq->scaling_available_governors =
               malloc(p_cpufreq->gov_num * CPUFREQ_NAME_LEN * sizeof(char))))
         {
             fprintf(stderr,
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index f5a9ac3f1a..d67d99e62f 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -239,6 +239,18 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     if ( ret )
         return ret;
 
+    op->u.get_para.cpuinfo_cur_freq =
+        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
+    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
+    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
+    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
+
+    if ( cpufreq_driver.name[0] )
+        strlcpy(op->u.get_para.scaling_driver,
+            cpufreq_driver.name, CPUFREQ_NAME_LEN);
+    else
+        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
+
     if ( !(scaling_available_governors =
            xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
         return -ENOMEM;
@@ -254,21 +266,10 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     if ( ret )
         return ret;
 
-    op->u.get_para.cpuinfo_cur_freq =
-        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
-    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
-    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
-
     op->u.get_para.u.s.scaling_cur_freq = policy->cur;
     op->u.get_para.u.s.scaling_max_freq = policy->max;
     op->u.get_para.u.s.scaling_min_freq = policy->min;
 
-    if ( cpufreq_driver.name[0] )
-        strlcpy(op->u.get_para.scaling_driver,
-            cpufreq_driver.name, CPUFREQ_NAME_LEN);
-    else
-        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
-
     if ( policy->governor->name[0] )
         strlcpy(op->u.get_para.u.s.scaling_governor,
             policy->governor->name, CPUFREQ_NAME_LEN);
@@ -291,7 +292,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
             &op->u.get_para.u.s.u.ondemand.sampling_rate,
             &op->u.get_para.u.s.u.ondemand.up_threshold);
     }
-    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
 
     return ret;
 }
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (4 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-10 13:13   ` Jan Beulich
  2023-07-06 18:54 ` [PATCH v5 07/15] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné

From the Intel SDM: "Hardware-Controlled Performance States (HWP), which
autonomously selects performance states while utilizing OS supplied
performance guidance hints."

Enable HWP to run in autonomous mode by poking the correct MSRs.  HWP is
disabled by default, and cpufreq=hwp enables it.

cpufreq= parsing is expanded to allow cpufreq=hwp;xen.  This allows
trying HWP and falling back to xen if not available.  Only hwp and xen
are supported for this fallback feature.  hdc is a sub-option under hwp
(i.e.  cpufreq=hwp,hdc=0) as is verbose.

There is no interface to configure - xen_sysctl_pm_op/xenpm will
be extended to configure in subsequent patches.  It will run with the
default values, which should be the default 0x80 (out of 0x0-0xff)
energy/performance preference.

Unscientific powertop measurement of an mostly idle, customized OpenXT
install:
A 10th gen 6-core laptop showed battery discharge drop from ~9.x to
~7.x watts.
A 8th gen 4-core laptop dropped from ~10 to ~9

Power usage depends on many factors, especially display brightness, but
this does show a power saving in balanced mode when CPU utilization is
low.

HWP isn't compatible with an external governor - it doesn't take
explicit frequency requests.  Therefore a minimal internal governor,
hwp, is also added as a placeholder.

While adding to the xen-command-line.pandoc entry, un-nest verbose from
minfreq.  They are independent.

With cpufreq=hwp,verbose, HWP prints processor capabilities that are not
used by the code, like HW_FEEDBACK.  This is done because otherwise
there isn't a convenient way to query the information.

Xen doesn't use the HWP interrupt, so it is disabled like in the Linux
pstate driver.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
We disable on cpuid_level < 0x16.  cpuid(0x16) is used to get the cpu
frequencies for calculating the APERF/MPERF.  Without it, things would
still work, but the average cpu frequency output would be wrong.

My 8th & 10th gen test systems both report:
(XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0
(XEN) HWP: Hardware Duty Cycling (HDC) supported
(XEN) HWP: HW_FEEDBACK not supported

We can't use parse_boolean() since it requires a single name=val string
and cpufreq_handle_common_option is provided two strings.  Use
parse_bool() and manual handle no-hwp.

FAST_IA32_HWP_REQUEST was removed in v2.  The check in v1 was wrong,
it's a model specific feature and the CPUID bit is only available
after enabling via the MSR.  Support was untested since I don't have
hardware with the feature.  Writes are expected to be infrequent, so
just leave it out.

---
v2:
Alphabetize headers
Re-work driver registration
name hwp_drv_data anonymous union "hw"
Drop hwp_verbose_cont
style cleanups
Condense hwp_governor switch
hwp_cpufreq_target remove .raw from hwp_req assignment
Use typed-pointer in a few functions
Pass type to xzalloc
Add HWP_ENERGY_PERF_BALANCE/IA32_ENERGY_BIAS_BALANCE defines
Add XEN_HWP_GOVERNOR define for "hwp-internal"
Capitalize CPUID and MSR defines
Change '_' to '-' for energy-perf & act-window
Read-modify-write MSRs updates
Use FAST_IA32_HWP_REQUEST_MSR_ENABLE define
constify pointer in hwp_set_misc_turbo
Add space after non-fallthrough break in governor switch
Add IA32_ENERGY_BIAS_MASK define
Check CPUID_PM_LEAK for energy bias when needed
Fail initialization with curr_req = -1
Fold hwp_read_capabilities into hwp_init_msrs
Add command line cpufreq=xen:hwp
Add command line cpufreq=xen:hdc
Use per_cpu for hwp_drv_data pointers
Move hwp_energy_perf_bias call into hwp_write_request
energy_perf 0 is valid, so hwp_energy_perf_bias cannot be skipped
Ensure we don't generate interrupts
Remove Fast Write of Uncore MSR
Initialize hwp_drv_data from curr_req
Use SPDX line instead of license text in hwp.c

v3:
Add cf_check to cpufreq_gov_hwp_init() - Marek
Print cpuid_level with %#x - Marek

v4:
Use BIT() for CPUID and MSR bits
Move __initdata after type
Add __ro_after_init to feature_*
Remove aperf/mperf comment
Move feature_hwp_energy_perf { to newline
Remove _IA32_ infix
Use unsigned int & bool for bitfields
Require energy perf pref (Remove ENERGY_PERF_BIAS support)
Initialize activity_window
Return errors on wrmsr failure
Change command line to: cpufreq=xen:hwp
Move hdc into the hwp-specific handle_options
Drop feature_hwp_energy_perf, feature_hwp_pkg_level_ctl & feature_hwp_peci
Print features before exiting when energy/performance preference isn't available
Disable HWP MSR on initialization error
Change hwp_ print macros to add prefixes
Disable HDC when hdc=0 - (opt_hdc no longer initdata)
Mark hwp governor internal and use "hwp" name
Add XEN_HWP_DRIVER
Use top-level cpufreq=hwp command line option
Document that cpufreq=hwp falls back to cpufreq=xen without hardware
Add SPDX suffix GPL-2.0-only

v5:
Use _AC() macro in MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE definition
hwp_err arg re-ordering
Use XEN_HWP_DRIVER_NAME
Use cpufreq.h for all declarations
Clear feature_hdc on failure and print a message
Use unnamed bitfields instead of reservered
Remove asm/io.h include
static DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data
Remove some empty newlines
Align feature_hdc assignment
Remove feature_hwp
Remove unnecesary return at end of void hwp_init_msrs()
BUILD_BUG_ON member variable
Reformat a compound if
Clear pre_cpu hwp_drv_data before xfree()
Only print HWP capabilities for CPU 0
Specify processor models in turbo comment
Use arg[1] in setup_cpufreq_option()
Remove some log messages
Drop double newline
Parse verbose as a boolean instead of the custom parsing.
Support cpufreq=hwp;xen fallback
Call hwp_available() from hwp_register_driver()
Move cpufreq_govenor_internal setting to hwp_register_driver
---
 docs/misc/xen-command-line.pandoc     |  18 +-
 xen/arch/x86/acpi/cpufreq/Makefile    |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c   |  19 +-
 xen/arch/x86/acpi/cpufreq/hwp.c       | 513 ++++++++++++++++++++++++++
 xen/arch/x86/include/asm/cpufeature.h |  12 +-
 xen/arch/x86/include/asm/msr-index.h  |  15 +-
 xen/drivers/cpufreq/cpufreq.c         |  55 ++-
 xen/include/acpi/cpufreq/cpufreq.h    |   9 +
 xen/include/public/sysctl.h           |   2 +
 9 files changed, 629 insertions(+), 15 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 4060ebdc5d..f6138da173 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
 available support.
 
 ### cpufreq
-> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
+> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]]`
 
 > Default: `xen`
 
@@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
   respectively.
 * `verbose` option can be included as a string or also as `verbose=<integer>`
+  for `xen`.  It is a boolean for `hwp`.
+* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
+  hardware.  HWP is a Skylake+ feature which provides better CPU power
+  management.  The default is disabled.  If `hwp` is selected, but hardware
+  support is not available, Xen will fallback to cpufreq=xen.
+* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
+  processor to autonomously force physical package components into idle state.
+  The default is enabled, but the option only applies when `hwp` is enabled.
+
+There is also support for `;`-separated fallback options:
+`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
+if unavailable.
+
+Note: grub2 requires to escape or quote ';', so `"cpufreq=hwp;xen"` should be
+specified within double quotes inside grub.cfg.  Refer to the grub2
+documentation for more information.
 
 ### cpuid (x86)
 > `= List of comma separated booleans`
diff --git a/xen/arch/x86/acpi/cpufreq/Makefile b/xen/arch/x86/acpi/cpufreq/Makefile
index f75da9b9ca..db83aa6b14 100644
--- a/xen/arch/x86/acpi/cpufreq/Makefile
+++ b/xen/arch/x86/acpi/cpufreq/Makefile
@@ -1,2 +1,3 @@
 obj-y += cpufreq.o
+obj-y += hwp.o
 obj-y += powernow.o
diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index f1cc473b4f..3adc99f377 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -642,7 +642,24 @@ static int __init cf_check cpufreq_driver_init(void)
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
+            unsigned int i;
+            ret = -ENOENT;
+
+            for ( i = 0; i < cpufreq_xen_cnt; i++ )
+            {
+                switch ( cpufreq_xen_opts[i] )
+                {
+                case CPUFREQ_xen:
+                    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
+                    break;
+                case CPUFREQ_hwp:
+                    ret = hwp_register_driver();
+                    break;
+                }
+
+                if ( ret == 0 )
+                    break;
+            }
             break;
 
         case X86_VENDOR_AMD:
diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
new file mode 100644
index 0000000000..1ac07bbeb1
--- /dev/null
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -0,0 +1,513 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
+ *
+ * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
+ */
+
+#include <xen/cpumask.h>
+#include <xen/init.h>
+#include <xen/param.h>
+#include <xen/xmalloc.h>
+#include <asm/msr.h>
+#include <acpi/cpufreq/cpufreq.h>
+
+static bool __ro_after_init feature_hwp_notification;
+static bool __ro_after_init feature_hwp_activity_window;
+
+static bool __ro_after_init feature_hdc;
+
+static bool __ro_after_init opt_cpufreq_hdc = true;
+
+union hwp_request
+{
+    struct
+    {
+        unsigned int min_perf:8;
+        unsigned int max_perf:8;
+        unsigned int desired:8;
+        unsigned int energy_perf:8;
+        unsigned int activity_window:10;
+        bool package_control:1;
+        unsigned int :16;
+        bool activity_window_valid:1;
+        bool energy_perf_valid:1;
+        bool desired_valid:1;
+        bool max_perf_valid:1;
+        bool min_perf_valid:1;
+    };
+    uint64_t raw;
+};
+
+struct hwp_drv_data
+{
+    union
+    {
+        uint64_t hwp_caps;
+        struct
+        {
+            unsigned int highest:8;
+            unsigned int guaranteed:8;
+            unsigned int most_efficient:8;
+            unsigned int lowest:8;
+            unsigned int :32;
+        } hw;
+    };
+    union hwp_request curr_req;
+    int ret;
+    uint16_t activity_window;
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf;
+};
+static DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);
+
+#define hwp_err(cpu, fmt, ...) \
+    printk(XENLOG_ERR "HWP: CPU%u error: " fmt, cpu, ##__VA_ARGS__)
+#define hwp_info(fmt, ...)    printk(XENLOG_INFO "HWP: " fmt, ##__VA_ARGS__)
+#define hwp_verbose(fmt, ...)                             \
+({                                                        \
+    if ( cpufreq_verbose )                                \
+        printk(XENLOG_DEBUG "HWP: " fmt, ##__VA_ARGS__);  \
+})
+
+static int cf_check hwp_governor(struct cpufreq_policy *policy,
+                                 unsigned int event)
+{
+    int ret;
+
+    if ( policy == NULL )
+        return -EINVAL;
+
+    switch ( event )
+    {
+    case CPUFREQ_GOV_START:
+    case CPUFREQ_GOV_LIMITS:
+        ret = 0;
+        break;
+
+    case CPUFREQ_GOV_STOP:
+    default:
+        ret = -EINVAL;
+        break;
+    }
+
+    return ret;
+}
+
+static bool hwp_handle_option(const char *s, const char *end)
+{
+    int ret;
+
+    ret = parse_boolean("verbose", s, end);
+    if ( ret >= 0 ) {
+        cpufreq_verbose = ret;
+        return true;
+    }
+
+    ret = parse_boolean("hdc", s, end);
+    if ( ret >= 0 ) {
+        opt_cpufreq_hdc = ret;
+        return true;
+    }
+
+    return false;
+}
+
+int __init hwp_cmdline_parse(const char *s, const char *e)
+{
+    do
+    {
+        const char *end = strpbrk(s, ",;");
+
+        if ( s && !hwp_handle_option(s, end) )
+        {
+            printk(XENLOG_WARNING "cpufreq/hwp: option '%s' not recognized\n",
+                   s);
+
+            return -1;
+        }
+
+        s = end ? ++end : end;
+    } while ( s && s < e );
+
+    return 0;
+}
+
+static struct cpufreq_governor cpufreq_gov_hwp =
+{
+    .name          = "hwp",
+    .governor      = hwp_governor,
+};
+
+static int __init cf_check cpufreq_gov_hwp_init(void)
+{
+    if ( !cpufreq_governor_internal )
+        return 0;
+
+    return cpufreq_register_governor(&cpufreq_gov_hwp);
+}
+__initcall(cpufreq_gov_hwp_init);
+
+static bool __init hwp_available(void)
+{
+    unsigned int eax;
+
+    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
+    {
+        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
+                    boot_cpu_data.cpuid_level);
+
+        return false;
+    }
+
+    if ( boot_cpu_data.cpuid_level < 0x16 )
+    {
+        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
+                 boot_cpu_data.cpuid_level);
+
+        return false;
+    }
+
+    eax = cpuid_eax(CPUID_PM_LEAF);
+
+    hwp_verbose("%d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
+                !!(eax & CPUID6_EAX_HWP),
+                !!(eax & CPUID6_EAX_HWP_NOTIFICATION),
+                !!(eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW),
+                !!(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE),
+                !!(eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST),
+                !!(eax & CPUID6_EAX_HWP_PECI));
+
+    if ( !(eax & CPUID6_EAX_HWP) )
+        return false;
+
+    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) )
+    {
+        hwp_verbose("disabled: No energy/performance preference available");
+
+        return false;
+    }
+
+    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
+    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
+    feature_hdc                 = eax & CPUID6_EAX_HDC;
+
+    hwp_verbose("Hardware Duty Cycling (HDC) %ssupported%s\n",
+                feature_hdc ? "" : "not ",
+                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
+                            : "");
+
+    hwp_verbose("HW_FEEDBACK %ssupported\n",
+                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
+
+    hwp_info("Using HWP for cpufreq\n");
+
+    return true;
+}
+
+static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
+    {
+        hwp_err(cpu, "rdmsr_safe(MSR_PKG_HDC_CTL)\n");
+        return -1;
+    }
+
+    if ( val )
+        msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
+    else
+        msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
+
+    if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
+    {
+        hwp_err(cpu, "wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", msr);
+        return -1;
+    }
+
+    return 0;
+}
+
+static int hdc_set_pm_ctl1(unsigned int cpu, bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_PM_CTL1, msr) )
+    {
+        hwp_err(cpu, "rdmsr_safe(MSR_PM_CTL1)\n");
+        return -1;
+    }
+
+    if ( val )
+        msr |= PM_CTL1_HDC_ALLOW_BLOCK;
+    else
+        msr &= ~PM_CTL1_HDC_ALLOW_BLOCK;
+
+    if ( wrmsr_safe(MSR_PM_CTL1, msr) )
+    {
+        hwp_err(cpu, "wrmsr_safe(MSR_PM_CTL1): %016lx\n", msr);
+        return -1;
+    }
+
+    return 0;
+}
+
+static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
+{
+    uint32_t base_khz, max_khz, bus_khz, edx;
+
+    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
+
+    policy->cpuinfo.perf_freq = base_khz * 1000;
+    policy->cpuinfo.min_freq = base_khz * 1000;
+    policy->cpuinfo.max_freq = max_khz * 1000;
+    policy->min = base_khz * 1000;
+    policy->max = max_khz * 1000;
+    policy->cur = 0;
+}
+
+static void cf_check hwp_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    uint64_t val;
+
+    /*
+     * Package level MSR, but we don't have a good idea of packages here, so
+     * just do it everytime.
+     */
+    if ( rdmsr_safe(MSR_PM_ENABLE, val) )
+    {
+        hwp_err(policy->cpu, "rdmsr_safe(MSR_PM_ENABLE)\n");
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    /* Ensure we don't generate interrupts */
+    if ( feature_hwp_notification )
+        wrmsr_safe(MSR_HWP_INTERRUPT, 0);
+
+    hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
+    if ( !(val & PM_ENABLE_HWP_ENABLE) )
+    {
+        val |= PM_ENABLE_HWP_ENABLE;
+        if ( wrmsr_safe(MSR_PM_ENABLE, val) )
+        {
+            hwp_err(policy->cpu, "wrmsr_safe(MSR_PM_ENABLE, %lx)\n", val);
+            data->curr_req.raw = -1;
+            return;
+        }
+    }
+
+    if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
+    {
+        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_CAPABILITIES)\n");
+        goto error;
+    }
+
+    if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
+    {
+        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n");
+        goto error;
+    }
+
+    /*
+     * Check for APERF/MPERF support in hardware
+     * also check for boost/turbo support
+     */
+    intel_feature_detect(policy);
+
+    if ( feature_hdc )
+    {
+        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) ||
+             hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) ) {
+            hwp_err(policy->cpu, "Disabling HDC support\n");
+            feature_hdc = false;
+            goto error;
+        }
+    }
+
+    hwp_get_cpu_speeds(policy);
+
+    return;
+
+ error:
+    data->curr_req.raw = -1;
+    val &= ~PM_ENABLE_HWP_ENABLE;
+    if ( wrmsr_safe(MSR_PM_ENABLE, val) )
+        hwp_err(policy->cpu, "wrmsr_safe(MSR_PM_ENABLE, %lx)\n", val);
+}
+
+static int cf_check hwp_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+
+    if ( !feature_hwp_activity_window && data->activity_window )
+    {
+        hwp_verbose("HWP activity window not supported\n");
+
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static void cf_check hwp_write_request(void *info)
+{
+    const struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    union hwp_request hwp_req = data->curr_req;
+
+    data->ret = 0;
+
+    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(hwp_req.raw));
+    if ( wrmsr_safe(MSR_HWP_REQUEST, hwp_req.raw) )
+    {
+        hwp_verbose("CPU%u: error wrmsr_safe(MSR_HWP_REQUEST, %lx)\n",
+                    policy->cpu, hwp_req.raw);
+        rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw);
+        data->ret = -EINVAL;
+    }
+}
+
+static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
+                                       unsigned int target_freq,
+                                       unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    /* Zero everything to ensure reserved bits are zero... */
+    union hwp_request hwp_req = { .raw = 0 };
+
+    /* .. and update from there */
+    hwp_req.min_perf = data->minimum;
+    hwp_req.max_perf = data->maximum;
+    hwp_req.desired = data->desired;
+    hwp_req.energy_perf = data->energy_perf;
+    if ( feature_hwp_activity_window )
+        hwp_req.activity_window = data->activity_window;
+
+    if ( hwp_req.raw == data->curr_req.raw )
+        return 0;
+
+    data->curr_req = hwp_req;
+
+    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
+
+    return data->ret;
+}
+
+static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data;
+
+    data = xzalloc(struct hwp_drv_data);
+    if ( !data )
+        return -ENOMEM;
+
+    policy->governor = &cpufreq_gov_hwp;
+
+    per_cpu(hwp_drv_data, cpu) = data;
+
+    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
+
+    if ( data->curr_req.raw == -1 )
+    {
+        hwp_err(cpu, "Could not initialize HWP properly\n");
+        per_cpu(hwp_drv_data, cpu) = NULL;
+        xfree(data);
+        return -ENODEV;
+    }
+
+    data->minimum = data->curr_req.min_perf;
+    data->maximum = data->curr_req.max_perf;
+    data->desired = data->curr_req.desired;
+    data->energy_perf = data->curr_req.energy_perf;
+    data->activity_window = data->curr_req.activity_window;
+
+    if ( cpu == 0 )
+        hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
+
+    hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw);
+
+    return 0;
+}
+
+static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+    per_cpu(hwp_drv_data, policy->cpu) = NULL;
+    xfree(data);
+
+    return 0;
+}
+
+/*
+ * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
+ * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least
+ * with testing on i7-10810U and i7-8550U.  MSR_MISC_ENABLE and
+ * MISC_ENABLE_TURBO_DISENGAGE is what Linux uses and seems to work.
+ */
+static void cf_check hwp_set_misc_turbo(void *info)
+{
+    const struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+    uint64_t msr;
+
+    data->ret = 0;
+
+    if ( rdmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+    {
+        hwp_verbose("CPU%u: error rdmsr_safe(MSR_IA32_MISC_ENABLE)\n",
+                    policy->cpu);
+        data->ret = -EINVAL;
+
+        return;
+    }
+
+    if ( policy->turbo == CPUFREQ_TURBO_ENABLED )
+        msr &= ~MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+    else
+        msr |= MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+
+    if ( wrmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+    {
+        hwp_verbose("CPU%u: error wrmsr_safe(MSR_IA32_MISC_ENABLE): %016lx\n",
+                    policy->cpu, msr);
+        data->ret = -EINVAL;
+    }
+}
+
+static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy)
+{
+    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);
+
+    return per_cpu(hwp_drv_data, cpuid)->ret;
+}
+
+static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
+{
+    .name   = XEN_HWP_DRIVER_NAME,
+    .verify = hwp_cpufreq_verify,
+    .target = hwp_cpufreq_target,
+    .init   = hwp_cpufreq_cpu_init,
+    .exit   = hwp_cpufreq_cpu_exit,
+    .update = hwp_cpufreq_update,
+};
+
+int __init hwp_register_driver(void)
+{
+    int ret;
+
+    if ( !hwp_available() )
+        return -EINVAL;
+
+    ret = cpufreq_register_driver(&hwp_cpufreq_driver);
+    cpufreq_governor_internal = (ret == 0);
+
+    return ret;
+}
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 64e1dad225..93466441f5 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -52,8 +52,16 @@ static inline bool boot_cpu_has(unsigned int feat)
     return cpu_has(&boot_cpu_data, feat);
 }
 
-#define CPUID_PM_LEAF                    6
-#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
+#define CPUID_PM_LEAF                                6
+#define CPUID6_EAX_HWP                               BIT(7, U)
+#define CPUID6_EAX_HWP_NOTIFICATION                  BIT(8, U)
+#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               BIT(9, U)
+#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE BIT(10, U)
+#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         BIT(11, U)
+#define CPUID6_EAX_HDC                               BIT(13, U)
+#define CPUID6_EAX_HWP_PECI                          BIT(16, U)
+#define CPUID6_EAX_HW_FEEDBACK                       BIT(19, U)
+#define CPUID6_ECX_APERFMPERF_CAPABILITY             BIT(0, U)
 
 /* CPUID level 0x00000001.edx */
 #define cpu_has_fpu             1
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 4f861c0bb4..68407bd707 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -151,6 +151,13 @@
 
 #define MSR_PKRS                            0x000006e1
 
+#define MSR_PM_ENABLE                       0x00000770
+#define  PM_ENABLE_HWP_ENABLE               BIT(0, ULL)
+
+#define MSR_HWP_CAPABILITIES                0x00000771
+#define MSR_HWP_INTERRUPT                   0x00000773
+#define MSR_HWP_REQUEST                     0x00000774
+
 #define MSR_X2APIC_FIRST                    0x00000800
 #define MSR_X2APIC_LAST                     0x000008ff
 
@@ -165,6 +172,11 @@
 #define  PASID_PASID_MASK                   0x000fffff
 #define  PASID_VALID                        (_AC(1, ULL) << 31)
 
+#define MSR_PKG_HDC_CTL                     0x00000db0
+#define  PKG_HDC_CTL_HDC_PKG_ENABLE         BIT(0, ULL)
+#define MSR_PM_CTL1                         0x00000db1
+#define  PM_CTL1_HDC_ALLOW_BLOCK            BIT(0, ULL)
+
 #define MSR_UARCH_MISC_CTRL                 0x00001b01
 #define  UARCH_CTRL_DOITM                   (_AC(1, ULL) <<  0)
 
@@ -502,7 +514,8 @@
 #define MSR_IA32_MISC_ENABLE_MONITOR_ENABLE (1<<18)
 #define MSR_IA32_MISC_ENABLE_LIMIT_CPUID  (1<<22)
 #define MSR_IA32_MISC_ENABLE_XTPR_DISABLE (1<<23)
-#define MSR_IA32_MISC_ENABLE_XD_DISABLE   (_AC(1, ULL) << 34)
+#define MSR_IA32_MISC_ENABLE_XD_DISABLE      (_AC(1, ULL) << 34)
+#define MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE (_AC(1, ULL) << 38)
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 #define MSR_IA32_ENERGY_PERF_BIAS	0x000001b0
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 67a58d409b..67f01a8293 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -63,12 +63,18 @@ LIST_HEAD_READ_MOSTLY(cpufreq_governor_list);
 /* set xen as default cpufreq */
 enum cpufreq_controller cpufreq_controller = FREQCTL_xen;
 
-static int __init cpufreq_cmdline_parse(const char *s);
+enum cpufreq_xen_opt cpufreq_xen_opts[2] = { CPUFREQ_xen, };
+unsigned int cpufreq_xen_cnt = 1;
+
+static int __init cpufreq_cmdline_parse(const char *s, const char *e);
 
 static int __init cf_check setup_cpufreq_option(const char *str)
 {
-    const char *arg = strpbrk(str, ",:");
+    const char *arg = strpbrk(str, ",:;");
     int choice;
+    int ret = -EINVAL;
+
+    cpufreq_xen_cnt = 0;
 
     if ( !arg )
         arg = strchr(str, '\0');
@@ -89,15 +95,42 @@ static int __init cf_check setup_cpufreq_option(const char *str)
         return 0;
     }
 
-    if ( choice > 0 || !cmdline_strcmp(str, "xen") )
+    do
     {
-        xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
-        cpufreq_controller = FREQCTL_xen;
-        if ( *arg && *(arg + 1) )
-            return cpufreq_cmdline_parse(arg + 1);
-    }
+        const char *end = strchr(str, ';');
+        if ( end == NULL )
+            end = strchr(str, '\0');
+
+        arg = strchr(str, ',');
+        if ( !arg || arg > end )
+            arg = strchr(str, '\0');
+
+        if ( cpufreq_xen_cnt == ARRAY_SIZE(cpufreq_xen_opts) )
+            return -E2BIG;
+
+        if ( choice > 0 || !cmdline_strcmp(str, "xen") )
+        {
+            xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
+            cpufreq_controller = FREQCTL_xen;
+            cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_xen;
+            if ( arg[0] && arg[1] )
+                ret = cpufreq_cmdline_parse(arg + 1, end);
+        }
+        else if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
+        {
+            xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
+            cpufreq_controller = FREQCTL_xen;
+            cpufreq_xen_opts[cpufreq_xen_cnt++] = CPUFREQ_hwp;
+            if ( arg[0] && arg[1] )
+                ret = hwp_cmdline_parse(arg + 1, end);
+        }
+        else
+            ret = -EINVAL;
+
+        str = end ? ++end : end;
+    } while ( choice < 0 && ret == 0 && *str );
 
-    return (choice < 0) ? -EINVAL : 0;
+    return (choice < 0) ? ret : 0;
 }
 custom_param("cpufreq", setup_cpufreq_option);
 
@@ -576,7 +609,7 @@ static int __init cpufreq_handle_common_option(const char *name, const char *val
     return 0;
 }
 
-static int __init cpufreq_cmdline_parse(const char *s)
+static int __init cpufreq_cmdline_parse(const char *s, const char *e)
 {
     static struct cpufreq_governor *__initdata cpufreq_governors[] =
     {
@@ -592,6 +625,8 @@ static int __init cpufreq_cmdline_parse(const char *s)
     int rc = 0;
 
     strlcpy(buf, s, sizeof(buf));
+    if (e - s < sizeof(buf))
+        buf[e - s] = '\0';
     do {
         char *val, *end = strchr(str, ',');
         unsigned int i;
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 482ea5b0de..c965a8c6b6 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -24,6 +24,12 @@ DECLARE_PER_CPU(spinlock_t, cpufreq_statistic_lock);
 
 extern bool_t cpufreq_verbose;
 
+enum cpufreq_xen_opt {
+    CPUFREQ_xen,
+    CPUFREQ_hwp,
+};
+extern enum cpufreq_xen_opt cpufreq_xen_opts[2];
+extern unsigned int cpufreq_xen_cnt;
 struct cpufreq_governor;
 
 struct acpi_cpufreq_data {
@@ -245,4 +251,7 @@ void cpufreq_dbs_timer_resume(void);
 
 void intel_feature_detect(struct cpufreq_policy *policy);
 
+int hwp_cmdline_parse(const char *s, const char *e);
+int hwp_register_driver(void);
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index b0028d3058..59700b02f2 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -296,6 +296,8 @@ struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+#define XEN_HWP_DRIVER_NAME "hwp"
+
 /*
  * cpufreq para name of this structure named
  * same as sysfs file name of native linux
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 07/15] xen/x86: Tweak PDC bits when using HWP
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (5 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 08/15] xenpm: Change get-cpufreq-para output for hwp Jason Andryuk
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Qubes testing of HWP support had a report of a laptop, Thinkpad X1
Carbon Gen 4 with a Skylake processor, locking up during boot when HWP
is enabled.  A user found a kernel bug that seems to be the same issue:
https://bugzilla.kernel.org/show_bug.cgi?id=110941.

That bug was fixed by Linux commit a21211672c9a ("ACPI / processor:
Request native thermal interrupt handling via _OSC").  The tl;dr is SMM
crashes when it receives thermal interrupts, so Linux calls the ACPI
_OSC method to take over interrupt handling.

The Linux fix looks at the CPU features to decide whether or not to call
_OSC with bit 12 set to take over native interrupt handling.  Xen needs
some way to communicate HWP to Dom0 for making an equivalent call.

Xen exposes modified PDC bits via the platform_op set_pminfo hypercall.
Expand that to set bit 12 when HWP is present and in use.

Any generated interrupt would be handled by Xen's thermal drive, which
clears the status.

Bit 12 isn't named in the linux header and is open coded in Linux's
usage.  Name it ACPI_PDC_CPPC_NATIVE_INTR.

This will need a corresponding linux patch to pick up and apply the PDC
bits.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v5:
Minor fixup for feature_hwp removal
Use cpurfreq.h for declaration

v4:
Added __ro_after_init
s/ACPI_PDC_CPPC_NTV_INT/ACPI_PDC_CPPC_NATIVE_INTR/
Remove _IA32_
Fixup for opt_cpufreq_hwp removal
Add Jan Reviewed-by

v3:
New
---
 xen/arch/x86/acpi/cpufreq/hwp.c      | 9 +++++++++
 xen/arch/x86/acpi/lib.c              | 5 +++++
 xen/arch/x86/cpu/mcheck/mce_intel.c  | 6 ++++++
 xen/arch/x86/include/asm/msr-index.h | 1 +
 xen/include/acpi/cpufreq/cpufreq.h   | 1 +
 xen/include/acpi/pdc_intel.h         | 1 +
 6 files changed, 23 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 1ac07bbeb1..ce897d566f 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -12,6 +12,8 @@
 #include <asm/msr.h>
 #include <acpi/cpufreq/cpufreq.h>
 
+static bool __ro_after_init hwp_in_use;
+
 static bool __ro_after_init feature_hwp_notification;
 static bool __ro_after_init feature_hwp_activity_window;
 
@@ -150,6 +152,11 @@ static int __init cf_check cpufreq_gov_hwp_init(void)
 }
 __initcall(cpufreq_gov_hwp_init);
 
+bool hwp_active(void)
+{
+    return hwp_in_use;
+}
+
 static bool __init hwp_available(void)
 {
     unsigned int eax;
@@ -202,6 +209,8 @@ static bool __init hwp_available(void)
     hwp_verbose("HW_FEEDBACK %ssupported\n",
                 (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
 
+    hwp_in_use = true;
+
     hwp_info("Using HWP for cpufreq\n");
 
     return true;
diff --git a/xen/arch/x86/acpi/lib.c b/xen/arch/x86/acpi/lib.c
index 43831b92d1..51cb082ca0 100644
--- a/xen/arch/x86/acpi/lib.c
+++ b/xen/arch/x86/acpi/lib.c
@@ -26,6 +26,8 @@
 #include <asm/fixmap.h>
 #include <asm/mwait.h>
 
+#include <acpi/cpufreq/cpufreq.h>
+
 u32 __read_mostly acpi_smi_cmd;
 u8 __read_mostly acpi_enable_value;
 u8 __read_mostly acpi_disable_value;
@@ -140,5 +142,8 @@ int arch_acpi_set_pdc_bits(u32 acpi_id, u32 *pdc, u32 mask)
 	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
 		pdc[2] &= ~(ACPI_PDC_C_C1_FFH | ACPI_PDC_C_C2C3_FFH);
 
+	if (hwp_active())
+		pdc[2] |= ACPI_PDC_CPPC_NATIVE_INTR;
+
 	return 0;
 }
diff --git a/xen/arch/x86/cpu/mcheck/mce_intel.c b/xen/arch/x86/cpu/mcheck/mce_intel.c
index 2f23f02923..4045c6591d 100644
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -15,6 +15,9 @@
 #include <asm/p2m.h>
 #include <asm/mce.h>
 #include <asm/apic.h>
+
+#include <acpi/cpufreq/cpufreq.h>
+
 #include "mce.h"
 #include "x86_mca.h"
 #include "barrier.h"
@@ -64,6 +67,9 @@ static void cf_check intel_thermal_interrupt(struct cpu_user_regs *regs)
 
     ack_APIC_irq();
 
+    if ( hwp_active() )
+        wrmsr_safe(MSR_HWP_STATUS, 0);
+
     if ( NOW() < per_cpu(next, cpu) )
         return;
 
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 68407bd707..c8f507c92b 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -157,6 +157,7 @@
 #define MSR_HWP_CAPABILITIES                0x00000771
 #define MSR_HWP_INTERRUPT                   0x00000773
 #define MSR_HWP_REQUEST                     0x00000774
+#define MSR_HWP_STATUS                      0x00000777
 
 #define MSR_X2APIC_FIRST                    0x00000800
 #define MSR_X2APIC_LAST                     0x000008ff
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index c965a8c6b6..ccbd6ea4c5 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -253,5 +253,6 @@ void intel_feature_detect(struct cpufreq_policy *policy);
 
 int hwp_cmdline_parse(const char *s, const char *e);
 int hwp_register_driver(void);
+bool hwp_active(void);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/acpi/pdc_intel.h b/xen/include/acpi/pdc_intel.h
index 4fb719d6f5..abaa098b51 100644
--- a/xen/include/acpi/pdc_intel.h
+++ b/xen/include/acpi/pdc_intel.h
@@ -17,6 +17,7 @@
 #define ACPI_PDC_C_C1_FFH		(0x0100)
 #define ACPI_PDC_C_C2C3_FFH		(0x0200)
 #define ACPI_PDC_SMP_P_HWCOORD		(0x0800)
+#define ACPI_PDC_CPPC_NATIVE_INTR	(0x1000)
 
 #define ACPI_PDC_EST_CAPABILITY_SMP	(ACPI_PDC_SMP_C1PT | \
 					 ACPI_PDC_C_C1_HALT | \
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 08/15] xenpm: Change get-cpufreq-para output for hwp
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (6 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 07/15] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Jan Beulich

When using HWP, some of the returned data is not applicable.  In that
case, we should just omit it to avoid confusing the user.  So switch to
printing the base and max frequencies since those are relevant to HWP.
Similarly, stop printing the CPU frequencies since those do not apply.
The scaling fields are also no longer printed.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v5:
Use XEN_HWP_DRIVER_NAME
Add Jan's Ack

v4:
s/turbo/max/
Check for XEN_HWP_DRIVER driver instead of "-internal"

v2:
Use full governor name XEN_HWP_GOVERNOR to change output
Style fixes
---
 tools/misc/xenpm.c | 83 +++++++++++++++++++++++++---------------------
 1 file changed, 46 insertions(+), 37 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 1c474c3b59..21c93386de 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -711,6 +711,7 @@ void start_gather_func(int argc, char *argv[])
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
+    bool hwp = strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) == 0;
     int i;
 
     printf("cpu id               : %d\n", cpuid);
@@ -720,49 +721,57 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
         printf(" %d", p_cpufreq->affected_cpus[i]);
     printf("\n");
 
-    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->cpuinfo_max_freq,
-           p_cpufreq->cpuinfo_min_freq,
-           p_cpufreq->cpuinfo_cur_freq);
+    if ( hwp )
+        printf("cpuinfo frequency    : base [%u] max [%u]\n",
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_max_freq);
+    else
+        printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->cpuinfo_max_freq,
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_cur_freq);
 
     printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
 
-    printf("scaling_avail_gov    : %s\n",
-           p_cpufreq->scaling_available_governors);
-
-    printf("current_governor     : %s\n", p_cpufreq->u.s.scaling_governor);
-    if ( !strncmp(p_cpufreq->u.s.scaling_governor,
-                  "userspace", CPUFREQ_NAME_LEN) )
-    {
-        printf("  userspace specific :\n");
-        printf("    scaling_setspeed : %u\n",
-               p_cpufreq->u.s.u.userspace.scaling_setspeed);
-    }
-    else if ( !strncmp(p_cpufreq->u.s.scaling_governor,
-                       "ondemand", CPUFREQ_NAME_LEN) )
+    if ( !hwp )
     {
-        printf("  ondemand specific  :\n");
-        printf("    sampling_rate    : max [%u] min [%u] cur [%u]\n",
-               p_cpufreq->u.s.u.ondemand.sampling_rate_max,
-               p_cpufreq->u.s.u.ondemand.sampling_rate_min,
-               p_cpufreq->u.s.u.ondemand.sampling_rate);
-        printf("    up_threshold     : %u\n",
-               p_cpufreq->u.s.u.ondemand.up_threshold);
-    }
+        printf("scaling_avail_gov    : %s\n",
+               p_cpufreq->scaling_available_governors);
 
-    printf("scaling_avail_freq   :");
-    for ( i = 0; i < p_cpufreq->freq_num; i++ )
-        if ( p_cpufreq->scaling_available_frequencies[i] ==
-             p_cpufreq->u.s.scaling_cur_freq )
-            printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
-        else
-            printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
-    printf("\n");
+        printf("current_governor     : %s\n", p_cpufreq->u.s.scaling_governor);
+        if ( !strncmp(p_cpufreq->u.s.scaling_governor,
+                      "userspace", CPUFREQ_NAME_LEN) )
+        {
+            printf("  userspace specific :\n");
+            printf("    scaling_setspeed : %u\n",
+                   p_cpufreq->u.s.u.userspace.scaling_setspeed);
+        }
+        else if ( !strncmp(p_cpufreq->u.s.scaling_governor,
+                           "ondemand", CPUFREQ_NAME_LEN) )
+        {
+            printf("  ondemand specific  :\n");
+            printf("    sampling_rate    : max [%u] min [%u] cur [%u]\n",
+                   p_cpufreq->u.s.u.ondemand.sampling_rate_max,
+                   p_cpufreq->u.s.u.ondemand.sampling_rate_min,
+                   p_cpufreq->u.s.u.ondemand.sampling_rate);
+            printf("    up_threshold     : %u\n",
+                   p_cpufreq->u.s.u.ondemand.up_threshold);
+        }
+
+        printf("scaling_avail_freq   :");
+        for ( i = 0; i < p_cpufreq->freq_num; i++ )
+            if ( p_cpufreq->scaling_available_frequencies[i] ==
+                 p_cpufreq->u.s.scaling_cur_freq )
+                printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
+            else
+                printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
+        printf("\n");
 
-    printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->u.s.scaling_max_freq,
-           p_cpufreq->u.s.scaling_min_freq,
-           p_cpufreq->u.s.scaling_cur_freq);
+        printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->u.s.scaling_max_freq,
+               p_cpufreq->u.s.scaling_min_freq,
+               p_cpufreq->u.s.scaling_cur_freq);
+    }
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 09/15] cpufreq: Export HWP parameters to userspace as CPPC
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (7 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 08/15] xenpm: Change get-cpufreq-para output for hwp Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-13 12:37   ` Jan Beulich
  2023-07-06 18:54 ` [PATCH v5 10/15] libxc: Include cppc_para in definitions Jason Andryuk
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Extend xen_get_cpufreq_para to return hwp parameters.  HWP is an
implementation of ACPI CPPC (Collaborative Processor Performance
Control).  Use the CPPC name since that might be useful in the future
for AMD P-state.

We need the features bitmask to indicate fields supported by the actual
hardware - this only applies to activity window for the time being.

The HWP most_efficient is mapped to CPPC lowest_nonlinear, and guaranteed is
mapped to nominal.  CPPC has a guaranteed that is optional while nominal
is required.  ACPI spec says "If this register is not implemented, OSPM
assumes guaranteed performance is always equal to nominal performance."

The use of uint8_t parameters matches the hardware size.  uint32_t
entries grows the sysctl_t past the build assertion in setup.c.  The
uint8_t ranges are supported across multiple generations, so hopefully
they won't change.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Style fixes
Don't bump XEN_SYSCTL_INTERFACE_VERSION
Drop cpufreq.h comment divider
Expand xen_hwp_para comment
Add HWP activity window mantissa/exponent defines
Handle union rename
Add const to get_hwp_para
Remove hw_ prefix from xen_hwp_para members
Use XEN_HWP_GOVERNOR
Use per_cpu for hwp_drv_data

v4:
Fixup for opt_cpufreq_hwp/hdc removal
get_hwp_para() takes cpu as arg
XEN_ prefix HWP_ACT_WINDOW_*
Drop HWP_ACT_WINDOW_EXPONENT_SHIFT - shift MASK
Remove Energy Bias (0-15) EPP fallback
Rename xen_hwp_para to xen_cppc_para
s/hwp/cppc/
Use scaling driver to switch output

v5:
Use XEN_HWP_DRIVER_NAME
Use cpufreq.h for declarations
Fixup some comments
Drop const from unsigned int cpu
Drop some unnecessary { }
Use strncmp
Switch sizeof(char) to sizeof(*scaling_available_governors)
Reindent copy_to_guest call
Add "HWP: " prefix to sysctl comments for cppc->hwp mapping
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 23 ++++++++++
 xen/drivers/acpi/pmstat.c          | 74 ++++++++++++++++--------------
 xen/include/acpi/cpufreq/cpufreq.h |  2 +
 xen/include/public/sysctl.h        | 56 ++++++++++++++++++++++
 4 files changed, 121 insertions(+), 34 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index ce897d566f..50b66a0449 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -508,6 +508,29 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
     .update = hwp_cpufreq_update,
 };
 
+int get_hwp_para(unsigned int cpu,
+                 struct xen_cppc_para *cppc_para)
+{
+    const struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+
+    if ( data == NULL )
+        return -ENODATA;
+
+    cppc_para->features         =
+        (feature_hwp_activity_window ? XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW : 0);
+    cppc_para->lowest           = data->hw.lowest;
+    cppc_para->lowest_nonlinear = data->hw.most_efficient;
+    cppc_para->nominal          = data->hw.guaranteed;
+    cppc_para->highest          = data->hw.highest;
+    cppc_para->minimum          = data->minimum;
+    cppc_para->maximum          = data->maximum;
+    cppc_para->desired          = data->desired;
+    cppc_para->energy_perf      = data->energy_perf;
+    cppc_para->activity_window  = data->activity_window;
+
+    return 0;
+}
+
 int __init hwp_register_driver(void)
 {
     int ret;
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index d67d99e62f..f674ef51aa 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -251,46 +251,52 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     else
         strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
 
-    if ( !(scaling_available_governors =
-           xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
-        return -ENOMEM;
-    if ( (ret = read_scaling_available_governors(scaling_available_governors,
-                gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
+    if ( !strncmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER_NAME,
+                      CPUFREQ_NAME_LEN) )
+        ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
+    else
     {
+        if ( !(scaling_available_governors =
+               xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
+            return -ENOMEM;
+        if ( (ret = read_scaling_available_governors(
+                        scaling_available_governors,
+                        gov_num * CPUFREQ_NAME_LEN *
+                            sizeof(*scaling_available_governors) )) )
+        {
+            xfree(scaling_available_governors);
+            return ret;
+        }
+        ret = copy_to_guest(op->u.get_para.scaling_available_governors,
+                            scaling_available_governors,
+                            gov_num * CPUFREQ_NAME_LEN);
         xfree(scaling_available_governors);
-        return ret;
-    }
-    ret = copy_to_guest(op->u.get_para.scaling_available_governors,
-                scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);
-    xfree(scaling_available_governors);
-    if ( ret )
-        return ret;
+        if ( ret )
+            return ret;
 
-    op->u.get_para.u.s.scaling_cur_freq = policy->cur;
-    op->u.get_para.u.s.scaling_max_freq = policy->max;
-    op->u.get_para.u.s.scaling_min_freq = policy->min;
+        op->u.get_para.u.s.scaling_cur_freq = policy->cur;
+        op->u.get_para.u.s.scaling_max_freq = policy->max;
+        op->u.get_para.u.s.scaling_min_freq = policy->min;
 
-    if ( policy->governor->name[0] )
-        strlcpy(op->u.get_para.u.s.scaling_governor,
-            policy->governor->name, CPUFREQ_NAME_LEN);
-    else
-        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
+        if ( policy->governor->name[0] )
+            strlcpy(op->u.get_para.u.s.scaling_governor,
+                policy->governor->name, CPUFREQ_NAME_LEN);
+        else
+            strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
+                    CPUFREQ_NAME_LEN);
 
-    /* governor specific para */
-    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
-                      "userspace", CPUFREQ_NAME_LEN) )
-    {
-        op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
-    }
+        /* governor specific para */
+        if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
+                          "userspace", CPUFREQ_NAME_LEN) )
+            op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
 
-    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
-                      "ondemand", CPUFREQ_NAME_LEN) )
-    {
-        ret = get_cpufreq_ondemand_para(
-            &op->u.get_para.u.s.u.ondemand.sampling_rate_max,
-            &op->u.get_para.u.s.u.ondemand.sampling_rate_min,
-            &op->u.get_para.u.s.u.ondemand.sampling_rate,
-            &op->u.get_para.u.s.u.ondemand.up_threshold);
+        if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
+                          "ondemand", CPUFREQ_NAME_LEN) )
+            ret = get_cpufreq_ondemand_para(
+                &op->u.get_para.u.s.u.ondemand.sampling_rate_max,
+                &op->u.get_para.u.s.u.ondemand.sampling_rate_min,
+                &op->u.get_para.u.s.u.ondemand.sampling_rate,
+                &op->u.get_para.u.s.u.ondemand.up_threshold);
     }
 
     return ret;
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index ccbd6ea4c5..c5636edf0e 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -254,5 +254,7 @@ void intel_feature_detect(struct cpufreq_policy *policy);
 int hwp_cmdline_parse(const char *s, const char *e);
 int hwp_register_driver(void);
 bool hwp_active(void);
+int get_hwp_para(unsigned int cpu,
+                 struct xen_cppc_para *cppc_para);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 59700b02f2..a5f8369116 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -296,6 +296,61 @@ struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+struct xen_cppc_para {
+    /* OUT */
+    /* activity_window supported if set */
+#define XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW  (1 << 0)
+    uint32_t features; /* bit flags for features */
+    /*
+     * See Intel SDM: HWP Performance Range and Dynamic Capabilities
+     *
+     * These four are 0-255 hardware-provided values.  They "continuous,
+     * abstract unit-less, performance" values.  Smaller numbers are slower
+     * and larger ones are faster.
+     */
+    uint32_t lowest;
+    uint32_t lowest_nonlinear; /* HWP: most_efficient */
+    uint32_t nominal; /* HWP: guaranteed */
+    uint32_t highest;
+    /*
+     * See Intel SDM: IA32_HWP_REQUEST MSR (Address: 774H Logical Processor
+     * Scope)
+     *
+     * These are all hints, and the processor may deviate outside of them.
+     * Values below are 0-255.
+     *
+     * minimum and maximum can be set to the above hardware values to constrain
+     * operation.  The full range 0-255 is accepted and will be clipped by
+     * hardware.
+     */
+    uint32_t minimum;
+    uint32_t maximum;
+    /*
+     * An explicit performance hint, disabling hardware selection.
+     * 0 lets the hardware decide.
+     */
+    uint32_t desired;
+    /*
+     * Hint to hardware for energy/performance preference.
+     * 0:   Performance
+     * 128: Balance (Default)
+     * 255: Powersaving
+     */
+    uint32_t energy_perf;
+    /*
+     * Activity Window is a moving history window for the processor's operation
+     * calculations, controlling responsiveness.  Measured in microseconds
+     * encoded as:
+     *
+     * bits 6:0   - 7bit mantissa
+     * bits 9:7   - 3bit base-10 exponent
+     * btis 15:10 - Unused - must be 0
+     */
+#define XEN_CPPC_ACT_WINDOW_MANTISSA_MASK  0x07f
+#define XEN_CPPC_ACT_WINDOW_EXPONENT_MASK  0x380
+    uint32_t activity_window;
+};
+
 #define XEN_HWP_DRIVER_NAME "hwp"
 
 /*
@@ -333,6 +388,7 @@ struct xen_get_cpufreq_para {
                 struct  xen_ondemand ondemand;
             } u;
         } s;
+        struct xen_cppc_para cppc_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 10/15] libxc: Include cppc_para in definitions
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (8 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 11/15] xenpm: Print HWP/CPPC parameters Jason Andryuk
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross

Expose the cppc_para fields through libxc.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
---
v4:
Rename hwp to cppc
Add Anthony's Ack
---
 tools/include/xenctrl.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 8aedb952a0..2092632296 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1892,6 +1892,7 @@ int xc_smt_disable(xc_interface *xch);
  */
 typedef struct xen_userspace xc_userspace_t;
 typedef struct xen_ondemand xc_ondemand_t;
+typedef struct xen_cppc_para xc_cppc_para_t;
 
 struct xc_get_cpufreq_para {
     /* IN/OUT variable */
@@ -1923,6 +1924,7 @@ struct xc_get_cpufreq_para {
                 xc_ondemand_t ondemand;
             } u;
         } s;
+        xc_cppc_para_t cppc_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 11/15] xenpm: Print HWP/CPPC parameters
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (9 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 10/15] libxc: Include cppc_para in definitions Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Jan Beulich

Print HWP-specific parameters.  Some are always present, but others
depend on hardware support.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v2:
Style fixes
Declare i outside loop
Replace repearted hardware/configured limits with spaces
Fixup for hw_ removal
Use XEN_HWP_GOVERNOR
Use HWP_ACT_WINDOW_EXPONENT_*
Remove energy_perf hw autonomous - 0 doesn't mean autonomous

v4:
Return activity_window from calculate_hwp_activity_window
Use blanks instead of _ in output
Use MASK_EXTR
Check XEN_HWP_DRIVER name since governor is no longer returned
s/hwp/cppc

v5:
Add Jan's Reviewed-by
---
 tools/misc/xenpm.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 21c93386de..3abd99fd20 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -708,6 +708,46 @@ void start_gather_func(int argc, char *argv[])
     pause();
 }
 
+static unsigned int calculate_activity_window(const xc_cppc_para_t *cppc,
+                                              const char **units)
+{
+    unsigned int mantissa = MASK_EXTR(cppc->activity_window,
+                                      XEN_CPPC_ACT_WINDOW_MANTISSA_MASK);
+    unsigned int exponent = MASK_EXTR(cppc->activity_window,
+                                      XEN_CPPC_ACT_WINDOW_EXPONENT_MASK);
+    unsigned int multiplier = 1;
+    unsigned int i;
+
+    /*
+     * SDM only states a 0 register is hardware selected, and doesn't mention
+     * a 0 mantissa with a non-0 exponent.  Only special case a 0 register.
+     */
+    if ( cppc->activity_window == 0 )
+    {
+        *units = "hardware selected";
+
+        return 0;
+    }
+
+    if ( exponent >= 6 )
+    {
+        *units = "s";
+        exponent -= 6;
+    }
+    else if ( exponent >= 3 )
+    {
+        *units = "ms";
+        exponent -= 3;
+    }
+    else
+        *units = "us";
+
+    for ( i = 0; i < exponent; i++ )
+        multiplier *= 10;
+
+    return mantissa * multiplier;
+}
+
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
@@ -772,6 +812,32 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                p_cpufreq->u.s.scaling_min_freq,
                p_cpufreq->u.s.scaling_cur_freq);
     }
+    else
+    {
+        const xc_cppc_para_t *cppc = &p_cpufreq->u.cppc_para;
+
+        printf("cppc variables       :\n");
+        printf("  hardware limits    : lowest [%u] lowest nonlinear [%u]\n",
+               cppc->lowest, cppc->lowest_nonlinear);
+        printf("                     : nominal [%u] highest [%u]\n",
+               cppc->nominal, cppc->highest);
+        printf("  configured limits  : min [%u] max [%u] energy perf [%u]\n",
+               cppc->minimum, cppc->maximum, cppc->energy_perf);
+
+        if ( cppc->features & XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW )
+        {
+            unsigned int activity_window;
+            const char *units;
+
+            activity_window = calculate_activity_window(cppc, &units);
+            printf("                     : activity_window [%u %s]\n",
+                   activity_window, units);
+        }
+
+        printf("                     : desired [%u%s]\n",
+               cppc->desired,
+               cppc->desired ? "" : " hw autonomous");
+    }
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (10 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 11/15] xenpm: Print HWP/CPPC parameters Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-13 13:02   ` Jan Beulich
  2023-07-06 18:54 ` [PATCH v5 13/15] libxc: Add xc_set_cpufreq_cppc Jason Andryuk
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Add SET_CPUFREQ_HWP xen_sysctl_pm_op to set HWP parameters.  The sysctl
supports setting multiple values simultaneously as indicated by the
set_params bits.  This allows atomically applying new HWP configuration
via a single wrmsr.

XEN_SYSCTL_HWP_SET_PRESET_BALANCE/PERFORMANCE/POWERSAVE provide three
common presets.  Setting them depends on hardware limits which the
hypervisor is already caching.  So using them allows skipping a
hypercall to query the limits (lowest/highest) to then set those same
values.  The code is organized to allow a preset to be refined with
additional parameters if desired.

"most_efficient" and "guaranteed" could be additional presets in the
future, but the are not added now.  Those levels can change at runtime,
but we don't have code in place to monitor and update for those events.

Since activity window may not be supported by all hardware, omit writing
it when not supported, and return that fact to userspace by updating
set_params.

CPPC parameter checking disallows setting reserved bytes and ensure
values are only non-zero when the corresponding set_params bit is set.
There is no range checking (0-255 is allowed) since hardware is
documented to clip internally.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---
v5:
Use cpufreq.h for include
Add () around bit test
Fix Configure typo
Drop duplicated energy_perf comment
Define XEN_SYSCTL_CPPC_ACT_WINDOW_MASK from *_MANTISSA_MASK and *_EXPONENT_MASK
Re-order XEN_SYSCTL_CPPC_SET_* flags to match field and hardware order.
Remove const from set_cppc param to update set_params
Skip Activity Window if not supported by hardware and clear set_params
Make parameter parsing consistent
Add an exit path when there are no parameters to write.
Expand the header file to cover the IN/OUT set_params.
Remove the "desired" lowest/highest checking as hardware clips internally

v4:
Remove IA32_ENERGY_BIAS support
Validate parameters don't exceed 255
Use CPPC/cppc name
set_cppc_para() add const
set_cppc_para() return hwp_cpufreq_target()
Expand sysctl comments

v3:
Remove cpufreq_governor_internal from set_cpufreq_hwp

v2:
Update for naming anonymous union
Drop hwp_err for invalid input in set_hwp_para()
Drop uint16_t cast in XEN_SYSCTL_HWP_SET_PARAM_MASK
Drop parens for HWP_SET_PRESET defines
Reference activity_window format comment
Place SET_CPUFREQ_HWP after SET_CPUFREQ_PARA
Add {HWP,IA32}_ENERGY_PERF_MAX_{PERFORMANCE,POWERSAVE} defines
Order defines before fields in sysctl.h
Use XEN_HWP_GOVERNOR
Use per_cpu for hwp_drv_data
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 101 +++++++++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c          |  17 +++++
 xen/include/acpi/cpufreq/cpufreq.h |   2 +
 xen/include/public/sysctl.h        |  64 ++++++++++++++++++
 4 files changed, 184 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 50b66a0449..32df9af4b3 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -21,6 +21,10 @@ static bool __ro_after_init feature_hdc;
 
 static bool __ro_after_init opt_cpufreq_hdc = true;
 
+#define HWP_ENERGY_PERF_MAX_PERFORMANCE 0
+#define HWP_ENERGY_PERF_BALANCE         0x80
+#define HWP_ENERGY_PERF_MAX_POWERSAVE   0xff
+
 union hwp_request
 {
     struct
@@ -531,6 +535,103 @@ int get_hwp_para(unsigned int cpu,
     return 0;
 }
 
+int set_hwp_para(struct cpufreq_policy *policy,
+                 struct xen_set_cppc_para *set_cppc)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    bool cleared_act_window = false;
+
+    if ( data == NULL )
+        return -EINVAL;
+
+    /* Validate all parameters - Disallow reserved bits. */
+    if ( set_cppc->minimum > 255 ||
+         set_cppc->maximum > 255 ||
+         set_cppc->desired > 255 ||
+         set_cppc->energy_perf > 255 ||
+         set_cppc->set_params & ~XEN_SYSCTL_CPPC_SET_PARAM_MASK ||
+         set_cppc->activity_window & ~XEN_SYSCTL_CPPC_ACT_WINDOW_MASK )
+        return -EINVAL;
+
+    /* Only allow values if params bit is set. */
+    if ( (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
+          set_cppc->desired) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
+          set_cppc->minimum) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
+          set_cppc->maximum) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF) &&
+          set_cppc->energy_perf) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
+          set_cppc->activity_window) )
+        return -EINVAL;
+
+    /* Clear out activity window if lacking HW supported. */
+    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
+         !feature_hwp_activity_window ) {
+        set_cppc->set_params &= ~XEN_SYSCTL_CPPC_SET_ACT_WINDOW;
+        cleared_act_window = true;
+    }
+
+    /* Return if there is nothing to do. */
+    if ( set_cppc->set_params == 0 )
+        return cleared_act_window ? 0 : -EINVAL;
+
+    /* Apply presets */
+    switch ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_PRESET_MASK )
+    {
+    case XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE:
+        data->minimum = data->hw.lowest;
+        data->maximum = data->hw.lowest;
+        data->activity_window = 0;
+        data->energy_perf = HWP_ENERGY_PERF_MAX_POWERSAVE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE:
+        data->minimum = data->hw.highest;
+        data->maximum = data->hw.highest;
+        data->activity_window = 0;
+        data->energy_perf = HWP_ENERGY_PERF_MAX_PERFORMANCE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_BALANCE:
+        data->minimum = data->hw.lowest;
+        data->maximum = data->hw.highest;
+        data->activity_window = 0;
+        data->energy_perf = HWP_ENERGY_PERF_BALANCE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_NONE:
+        break;
+
+    default:
+        return -EINVAL;
+    }
+
+    /* Further customize presets if needed */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM )
+        data->minimum = set_cppc->minimum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM )
+        data->maximum = set_cppc->maximum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
+        data->energy_perf = set_cppc->energy_perf;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+        data->desired = set_cppc->desired;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW )
+        data->activity_window = set_cppc->activity_window &
+                                XEN_SYSCTL_CPPC_ACT_WINDOW_MASK;
+
+    return hwp_cpufreq_target(policy, 0, 0);
+}
+
 int __init hwp_register_driver(void)
 {
     int ret;
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index f674ef51aa..cfd7fdfb1c 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -400,6 +400,19 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
     return ret;
 }
 
+static int set_cpufreq_cppc(struct xen_sysctl_pm_op *op)
+{
+    struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
+
+    if ( !policy || !policy->governor )
+        return -EINVAL;
+
+    if ( !hwp_active() )
+        return -EINVAL;
+
+    return set_hwp_para(policy, &op->u.set_cppc);
+}
+
 int do_pm_op(struct xen_sysctl_pm_op *op)
 {
     int ret = 0;
@@ -472,6 +485,10 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
         break;
     }
 
+    case SET_CPUFREQ_CPPC:
+        ret = set_cpufreq_cppc(op);
+        break;
+
     case GET_CPUFREQ_AVGFREQ:
     {
         op->u.get_avgfreq = cpufreq_driver_getavg(op->cpuid, USR_GETAVG);
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index c5636edf0e..c6b8c991b4 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -256,5 +256,7 @@ int hwp_register_driver(void);
 bool hwp_active(void);
 int get_hwp_para(unsigned int cpu,
                  struct xen_cppc_para *cppc_para);
+int set_hwp_para(struct cpufreq_policy *policy,
+                 struct xen_set_cppc_para *set_cppc);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index a5f8369116..2457bf5e8f 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -351,6 +351,68 @@ struct xen_cppc_para {
     uint32_t activity_window;
 };
 
+/*
+ * Set CPPC values.
+ *
+ * Configure the parameters for CPPC.  Set bits in set_params control which
+ * values are applied.  If a bit is not set in set_params, the field must be
+ * zero.
+ *
+ * For HWP specifically, values must be limited to 0-255 or within
+ * XEN_SYSCTL_CPPC_ACT_WINDOW_MASK for activity window.  Set bits outside the
+ * range will be returned as -EINVAL.
+ *
+ * Activity Window may not be supported by the hardware.  In that case, the
+ * returned set_params will clear XEN_SYSCTL_CPPC_SET_ACT_WINDOW to indicate
+ * that it was not applied - though the rest of the values will be applied.
+ *
+ * There are a set of presets along with individual fields.  Presets are
+ * applied first, and then individual fields.  This allows customizing
+ * a preset without having to specify every value.
+ *
+ * The preset options values are as follows:
+ *
+ * preset      | minimum | maxium  | energy_perf
+ * ------------+---------+---------+----------------
+ * powersave   | lowest  | lowest  | powersave (255)
+ * ------------+---------+---------+----------------
+ * balance     | lowest  | highest | balance (128)
+ * ------------+---------+---------+----------------
+ * performance | highest | highest | performance (0)
+ *
+ * desired and activity_window are set to 0, hardware selected.
+ */
+struct xen_set_cppc_para {
+#define XEN_SYSCTL_CPPC_SET_MINIMUM              (1U << 0)
+#define XEN_SYSCTL_CPPC_SET_MAXIMUM              (1U << 1)
+#define XEN_SYSCTL_CPPC_SET_DESIRED              (1U << 2)
+#define XEN_SYSCTL_CPPC_SET_ENERGY_PERF          (1U << 3)
+#define XEN_SYSCTL_CPPC_SET_ACT_WINDOW           (1U << 4)
+#define XEN_SYSCTL_CPPC_SET_PRESET_MASK          0xf0000000
+#define XEN_SYSCTL_CPPC_SET_PRESET_NONE          0x00000000
+#define XEN_SYSCTL_CPPC_SET_PRESET_BALANCE       0x10000000
+#define XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE     0x20000000
+#define XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE   0x30000000
+#define XEN_SYSCTL_CPPC_SET_PARAM_MASK \
+        (XEN_SYSCTL_CPPC_SET_PRESET_MASK | \
+         XEN_SYSCTL_CPPC_SET_MINIMUM     | \
+         XEN_SYSCTL_CPPC_SET_MAXIMUM     | \
+         XEN_SYSCTL_CPPC_SET_DESIRED     | \
+         XEN_SYSCTL_CPPC_SET_ENERGY_PERF | \
+         XEN_SYSCTL_CPPC_SET_ACT_WINDOW  )
+    /* IN/OUT */
+    uint32_t set_params; /* bitflags for valid values */
+    /* See comments in struct xen_cppc_para. */
+    /* IN */
+    uint32_t minimum;
+    uint32_t maximum;
+    uint32_t desired;
+    uint32_t energy_perf;
+#define XEN_SYSCTL_CPPC_ACT_WINDOW_MASK (XEN_CPPC_ACT_WINDOW_MANTISSA_MASK | \
+                                         XEN_CPPC_ACT_WINDOW_EXPONENT_MASK)
+    uint32_t activity_window;
+};
+
 #define XEN_HWP_DRIVER_NAME "hwp"
 
 /*
@@ -418,6 +480,7 @@ struct xen_sysctl_pm_op {
     #define SET_CPUFREQ_GOV            (CPUFREQ_PARA | 0x02)
     #define SET_CPUFREQ_PARA           (CPUFREQ_PARA | 0x03)
     #define GET_CPUFREQ_AVGFREQ        (CPUFREQ_PARA | 0x04)
+    #define SET_CPUFREQ_CPPC           (CPUFREQ_PARA | 0x05)
 
     /* set/reset scheduler power saving option */
     #define XEN_SYSCTL_pm_op_set_sched_opt_smt    0x21
@@ -444,6 +507,7 @@ struct xen_sysctl_pm_op {
         struct xen_get_cpufreq_para get_para;
         struct xen_set_cpufreq_gov  set_gov;
         struct xen_set_cpufreq_para set_para;
+        struct xen_set_cppc_para    set_cppc;
         uint64_aligned_t get_avgfreq;
         uint32_t                    set_sched_opt_smt;
 #define XEN_SYSCTL_CX_UNLIMITED 0xffffffff
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 13/15] libxc: Add xc_set_cpufreq_cppc
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (11 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 14/15] xenpm: Add set-cpufreq-cppc subcommand Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 15/15] CHANGELOG: Add Intel HWP entry Jason Andryuk
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross

Add xc_set_cpufreq_cppc to allow calling xen_systctl_pm_op
SET_CPUFREQ_CPPC.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
---
v2:
Mark xc_set_hwp_para_t const

v4:
s/hwp/cppc/
Add Anthony's Ack

v5:
Remove const and copy back result
---
 tools/include/xenctrl.h |  4 ++++
 tools/libs/ctrl/xc_pm.c | 23 +++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 2092632296..52f42fb5b6 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1930,11 +1930,15 @@ struct xc_get_cpufreq_para {
     int32_t turbo_enabled;
 };
 
+typedef struct xen_set_cppc_para xc_set_cppc_para_t;
+
 int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
                         struct xc_get_cpufreq_para *user_para);
 int xc_set_cpufreq_gov(xc_interface *xch, int cpuid, char *govname);
 int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
                         int ctrl_type, int ctrl_value);
+int xc_set_cpufreq_cppc(xc_interface *xch, int cpuid,
+                        xc_set_cppc_para_t *set_cppc);
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq);
 
 int xc_set_sched_opt_smt(xc_interface *xch, uint32_t value);
diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index cea3eab22e..1f267147f6 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -329,6 +329,29 @@ int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
     return xc_sysctl(xch, &sysctl);
 }
 
+int xc_set_cpufreq_cppc(xc_interface *xch, int cpuid,
+                        xc_set_cppc_para_t *set_cppc)
+{
+    DECLARE_SYSCTL;
+    int ret;
+
+    if ( !xch )
+    {
+        errno = EINVAL;
+        return -1;
+    }
+    sysctl.cmd = XEN_SYSCTL_pm_op;
+    sysctl.u.pm_op.cmd = SET_CPUFREQ_CPPC;
+    sysctl.u.pm_op.cpuid = cpuid;
+    sysctl.u.pm_op.u.set_cppc = *set_cppc;
+
+    ret = xc_sysctl(xch, &sysctl);
+
+    *set_cppc = sysctl.u.pm_op.u.set_cppc;
+
+    return ret;
+}
+
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq)
 {
     int ret = 0;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 14/15] xenpm: Add set-cpufreq-cppc subcommand
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (12 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 13/15] libxc: Add xc_set_cpufreq_cppc Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  2023-07-06 18:54 ` [PATCH v5 15/15] CHANGELOG: Add Intel HWP entry Jason Andryuk
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Jan Beulich

set-cpufreq-cppc allows setting the Hardware P-State (HWP) parameters.

It can be run on all or just a single cpu.  There are presets of
balance, powersave & performance.  Those can be further tweaked by
param:val arguments as explained in the usage description.

Parameter names are just checked to the first 3 characters to shorten
typing.

Some options are hardware dependent, and ranges can be found in
get-cpufreq-para.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v5:
Make i unsigned int
Use local max_cpuid instead of max_cpu_nr
Add Jan's Ack
Check set_params and print a message if activity window wasn't set

v4:
Remove energy bias 0-15 & 7 references
Use MASK_INSR
Fixup { placement
Drop extra case in parse_activity_window
strcmp suffix
Expand help text
s/hwp/cppc/
Use isdigit() to check cpuid - otherwise run on all CPUs.

v2:
Compare provided parameter name and not just 3 characters.
Use "-" in parameter names
Remove hw_
Replace sscanf with strchr & strtoul.
Remove toplevel error message with lower level ones.
Help text s/127/128/
Help text mention truncation.
Avoid some truncation rounding down by adding 5 before division.
Help test mention default microseconds
Also comment the limit check written to avoid overflow.
---
 tools/misc/xenpm.c | 244 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 244 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 3abd99fd20..0877f43946 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -16,6 +16,8 @@
  */
 #define MAX_NR_CPU 512
 
+#include <ctype.h>
+#include <limits.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -67,6 +69,30 @@ void show_help(void)
             " set-max-cstate        <num>|'unlimited' [<num2>|'unlimited']\n"
             "                                     set the C-State limitation (<num> >= 0) and\n"
             "                                     optionally the C-sub-state limitation (<num2> >= 0)\n"
+            " set-cpufreq-cppc      [cpuid] [balance|performance|powersave] <param:val>*\n"
+            "                                     set Hardware P-State (HWP) parameters\n"
+            "                                     on CPU <cpuid> or all if omitted.\n"
+            "                                     optionally a preset of one of:\n"
+            "                                       balance|performance|powersave\n"
+            "                                     an optional list of param:val arguments\n"
+            "                                       minimum:N (0-255)\n"
+            "                                       maximum:N (0-255)\n"
+            "                                           get-cpufreq-para lowest/highest\n"
+            "                                           values are limits for\n"
+            "                                           minumum/maximum.\n"
+            "                                       desired:N (0-255)\n"
+            "                                           set explicit performance target.\n"
+            "                                           non-zero disables auto-HWP mode.\n"
+            "                                       energy-perf:N (0-255)\n"
+            "                                                   energy/performance hint\n"
+            "                                                   lower - favor performance\n"
+            "                                                   higher - favor powersave\n"
+            "                                                   128 - balance\n"
+            "                                       act-window:N{,m,u}s range 1us-1270s\n"
+            "                                           window for internal calculations.\n"
+            "                                           units default to \"us\" if unspecified.\n"
+            "                                           truncates un-representable values.\n"
+            "                                           0 lets the hardware decide.\n"
             " start [seconds]                     start collect Cx/Px statistics,\n"
             "                                     output after CTRL-C or SIGINT or several seconds.\n"
             " enable-turbo-mode     [cpuid]       enable Turbo Mode for processors that support it.\n"
@@ -1292,6 +1318,223 @@ void disable_turbo_mode(int argc, char *argv[])
                 errno, strerror(errno));
 }
 
+/*
+ * Parse activity_window:NNN{us,ms,s} and validate range.
+ *
+ * Activity window is a 7bit mantissa (0-127) with a 3bit exponent (0-7) base
+ * 10 in microseconds.  So the range is 1 microsecond to 1270 seconds.  A value
+ * of 0 lets the hardware autonomously select the window.
+ *
+ * Return 0 on success
+ *       -1 on error
+ */
+static int parse_activity_window(xc_set_cppc_para_t *set_cppc, unsigned long u,
+                                 const char *suffix)
+{
+    unsigned int exponent = 0;
+    unsigned int multiplier = 1;
+
+    if ( suffix && suffix[0] )
+    {
+        if ( strcmp(suffix, "s") == 0 )
+        {
+            multiplier = 1000 * 1000;
+            exponent = 6;
+        }
+        else if ( strcmp(suffix, "ms") == 0 )
+        {
+            multiplier = 1000;
+            exponent = 3;
+        }
+        else if ( strcmp(suffix, "us") != 0 )
+        {
+            fprintf(stderr, "invalid activity window units: \"%s\"\n", suffix);
+
+            return -1;
+        }
+    }
+
+    /* u * multipler > 1270 * 1000 * 1000 transformed to avoid overflow. */
+    if ( u > 1270 * 1000 * 1000 / multiplier )
+    {
+        fprintf(stderr, "activity window is too large\n");
+
+        return -1;
+    }
+
+    /* looking for 7 bits of mantissa and 3 bits of exponent */
+    while ( u > 127 )
+    {
+        u += 5; /* Round up to mitigate truncation rounding down
+                   e.g. 128 -> 120 vs 128 -> 130. */
+        u /= 10;
+        exponent += 1;
+    }
+
+    set_cppc->activity_window =
+        MASK_INSR(exponent, XEN_CPPC_ACT_WINDOW_EXPONENT_MASK) |
+        MASK_INSR(u, XEN_CPPC_ACT_WINDOW_MANTISSA_MASK);
+    set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_ACT_WINDOW;
+
+    return 0;
+}
+
+static int parse_cppc_opts(xc_set_cppc_para_t *set_cppc, int *cpuid,
+                           int argc, char *argv[])
+{
+    int i = 0;
+
+    if ( argc < 1 )
+    {
+        fprintf(stderr, "Missing arguments\n");
+        return -1;
+    }
+
+    if ( isdigit(argv[i][0]) )
+    {
+        if ( sscanf(argv[i], "%d", cpuid) != 1 || *cpuid < 0 )
+        {
+            fprintf(stderr, "Could not parse cpuid \"%s\"\n", argv[i]);
+            return -1;
+        }
+
+        i++;
+    }
+
+    if ( i == argc )
+    {
+        fprintf(stderr, "Missing arguments\n");
+        return -1;
+    }
+
+    if ( strcasecmp(argv[i], "powersave") == 0 )
+    {
+        set_cppc->set_params = XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "performance") == 0 )
+    {
+        set_cppc->set_params = XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "balance") == 0 )
+    {
+        set_cppc->set_params = XEN_SYSCTL_CPPC_SET_PRESET_BALANCE;
+        i++;
+    }
+
+    for ( ; i < argc; i++)
+    {
+        unsigned long val;
+        char *param = argv[i];
+        char *value;
+        char *suffix;
+        int ret;
+
+        value = strchr(param, ':');
+        if ( value == NULL )
+        {
+            fprintf(stderr, "\"%s\" is an invalid cppc parameter\n", argv[i]);
+            return -1;
+        }
+
+        value[0] = '\0';
+        value++;
+
+        errno = 0;
+        val = strtoul(value, &suffix, 10);
+        if ( (errno && val == ULONG_MAX) || value == suffix )
+        {
+            fprintf(stderr, "Could not parse number \"%s\"\n", value);
+            return -1;
+        }
+
+        if ( strncasecmp(param, "act-window", strlen(param)) == 0 )
+        {
+            ret = parse_activity_window(set_cppc, val, suffix);
+            if (ret)
+                return -1;
+
+            continue;
+        }
+
+        if ( val > 255 )
+        {
+            fprintf(stderr, "\"%s\" value \"%lu\" is out of range\n", param,
+                    val);
+            return -1;
+        }
+
+        if ( suffix && suffix[0] )
+        {
+            fprintf(stderr, "Suffix \"%s\" is invalid\n", suffix);
+            return -1;
+        }
+
+        if ( strncasecmp(param, "minimum", MAX(2, strlen(param))) == 0 )
+        {
+            set_cppc->minimum = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_MINIMUM;
+        }
+        else if ( strncasecmp(param, "maximum", MAX(2, strlen(param))) == 0 )
+        {
+            set_cppc->maximum = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_MAXIMUM;
+        }
+        else if ( strncasecmp(param, "desired", strlen(param)) == 0 )
+        {
+            set_cppc->desired = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_DESIRED;
+        }
+        else if ( strncasecmp(param, "energy-perf", strlen(param)) == 0 )
+        {
+            set_cppc->energy_perf = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_ENERGY_PERF;
+        }
+        else
+        {
+            fprintf(stderr, "\"%s\" is an invalid parameter\n", param);
+            return -1;
+        }
+    }
+
+    if ( set_cppc->set_params == 0 )
+    {
+        fprintf(stderr, "No parameters set in request\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void cppc_set_func(int argc, char *argv[])
+{
+    xc_set_cppc_para_t set_cppc = {};
+    unsigned int max_cpuid = max_cpu_nr;
+    int cpuid = -1;
+    unsigned int i = 0;
+    uint32_t set_params;
+
+    if ( parse_cppc_opts(&set_cppc, &cpuid, argc, argv) )
+        exit(EINVAL);
+
+    if ( cpuid != -1 )
+    {
+        i = cpuid;
+        max_cpuid = i + 1;
+    }
+
+    set_params = set_cppc.set_params;
+    for ( ; i < max_cpuid; i++ ) {
+        if ( xc_set_cpufreq_cppc(xc_handle, i, &set_cppc) )
+            fprintf(stderr, "[CPU%d] failed to set cppc params (%d - %s)\n",
+                    i, errno, strerror(errno));
+    }
+
+    if ( (set_params ^ set_cppc.set_params) & XEN_SYSCTL_CPPC_SET_ACT_WINDOW )
+        printf("Activity window not supported and omitted\n");
+}
+
 struct {
     const char *name;
     void (*function)(int argc, char *argv[]);
@@ -1302,6 +1545,7 @@ struct {
     { "get-cpufreq-average", cpufreq_func },
     { "start", start_gather_func },
     { "get-cpufreq-para", cpufreq_para_func },
+    { "set-cpufreq-cppc", cppc_set_func },
     { "set-scaling-maxfreq", scaling_max_freq_func },
     { "set-scaling-minfreq", scaling_min_freq_func },
     { "set-scaling-governor", scaling_governor_func },
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 15/15] CHANGELOG: Add Intel HWP entry
  2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (13 preceding siblings ...)
  2023-07-06 18:54 ` [PATCH v5 14/15] xenpm: Add set-cpufreq-cppc subcommand Jason Andryuk
@ 2023-07-06 18:54 ` Jason Andryuk
  14 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-06 18:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Henry Wang, Community Manager

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Henry Wang <Henry.Wang@arm.com>
---
v3:
Position under existing Added section
Add Henry's Ack

v2:
Add blank line
---
 CHANGELOG.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7d7e0590f8..8d6e6c3088 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -24,7 +24,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
  - xl/libxl can customize SMBIOS strings for HVM guests.
  - Add support for AVX512-FP16 on x86.
  - On Arm, Xen supports guests running SVE/SVE2 instructions. (Tech Preview)
-
+ - Add Intel Hardware P-States (HWP) cpufreq driver.
 
 ## [4.17.0](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=RELEASE-4.17.0) - 2022-12-12
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 01/15] cpufreq: Allow restricting to internal governors only
  2023-07-06 18:54 ` [PATCH v5 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
@ 2023-07-10 11:41   ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2023-07-10 11:41 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel

On 06.07.2023 20:54, Jason Andryuk wrote:
> For hwp, the standard governors are not usable, and only the internal
> one is applicable.  Add the cpufreq_governor_internal boolean to
> indicate when an internal governor, like hwp, will be used.  This is set
> during presmp_initcall, and governor registration can be skipped when
> called during initcall.
> 
> This way unusable governors are not registered, and only compatible
> governors are advertised to userspace.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Acked-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 04/15] xen/sysctl: Nest cpufreq scaling options
  2023-07-06 18:54 ` [PATCH v5 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
@ 2023-07-10 11:51   ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2023-07-10 11:51 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Wei Liu, Anthony PERARD, Juergen Gross, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, xen-devel

On 06.07.2023 20:54, Jason Andryuk wrote:
> Add a union and struct so that most of the scaling variables of struct
> xen_get_cpufreq_para are within in a binary-compatible layout.  This
> allows cppc_para to live in the larger union and use uint32_ts - struct
> xen_cppc_para will be 10 uint32_t's.
> 
> The new scaling struct is 3 * uint32_t + 16 bytes CPUFREQ_NAME_LEN + 4 *
> uint32_t for xen_ondemand = 11 uint32_t.  That means the old size is
> retained, int32_t turbo_enabled doesn't move and it's binary compatible.
> 
> The out-of-context memcpy() in xc_get_cpufreq_para() now handles the
> copying of the fields removed there.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Nevertheless I continue to be uncertain about ...

> --- a/tools/include/xenctrl.h
> +++ b/tools/include/xenctrl.h
> @@ -1909,16 +1909,20 @@ struct xc_get_cpufreq_para {
>      uint32_t cpuinfo_cur_freq;
>      uint32_t cpuinfo_max_freq;
>      uint32_t cpuinfo_min_freq;
> -    uint32_t scaling_cur_freq;
> -
> -    char scaling_governor[CPUFREQ_NAME_LEN];
> -    uint32_t scaling_max_freq;
> -    uint32_t scaling_min_freq;
> -
> -    /* for specific governor */
>      union {
> -        xc_userspace_t userspace;
> -        xc_ondemand_t ondemand;
> +        struct {
> +            uint32_t scaling_cur_freq;
> +
> +            char scaling_governor[CPUFREQ_NAME_LEN];
> +            uint32_t scaling_max_freq;
> +            uint32_t scaling_min_freq;
> +
> +            /* for specific governor */
> +            union {
> +                xc_userspace_t userspace;
> +                xc_ondemand_t ondemand;
> +            } u;
> +        } s;
>      } u;
>  
>      int32_t turbo_enabled;

... all of this: Parts of the struct can apparently go out of sync with
the sysctl struct, but other parts have to remain in sync without there
being an appropriate build-time check (checking merely sizes clearly
isn't enough). Therefore I'd really like to have a toolstack side
review / ack here as well.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 05/15] pmstat&xenpm: Re-arrage for cpufreq union
  2023-07-06 18:54 ` [PATCH v5 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
@ 2023-07-10 12:09   ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2023-07-10 12:09 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, Juergen Gross, xen-devel

On 06.07.2023 20:54, Jason Andryuk wrote:
> Rearrange code now that xen_sysctl_pm_op's get_para fields has the
> nested union and struct.  In particular, the scaling governor
> information like scaling_available_governors is inside the union, so it
> is not always available.  Move those fields (op->u.get_para.u.s.u.*)
> together as well as the common fields (ones outside the union like
> op->u.get_para.turbo_enabled).
> 
> With that, gov_num may be 0, so bounce buffer handling needs
> to be modified.
> 
> scaling_governor and other fields inside op->u.get_para.u.s.u.* won't be
> used for hwp, so this will simplify the change when hwp support is
> introduced and re-indents these lines all together.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Acked-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-06 18:54 ` [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2023-07-10 13:13   ` Jan Beulich
  2023-07-10 15:22     ` Jason Andryuk
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2023-07-10 13:13 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 06.07.2023 20:54, Jason Andryuk wrote:
> @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
>  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
>    respectively.
>  * `verbose` option can be included as a string or also as `verbose=<integer>`
> +  for `xen`.  It is a boolean for `hwp`.
> +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
> +  hardware.  HWP is a Skylake+ feature which provides better CPU power
> +  management.  The default is disabled.  If `hwp` is selected, but hardware
> +  support is not available, Xen will fallback to cpufreq=xen.
> +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
> +  processor to autonomously force physical package components into idle state.
> +  The default is enabled, but the option only applies when `hwp` is enabled.
> +
> +There is also support for `;`-separated fallback options:
> +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
> +if unavailable.

In the given example, does "verbose" also apply to the fallback case? If so,
perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?

> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> @@ -642,7 +642,24 @@ static int __init cf_check cpufreq_driver_init(void)
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> +            unsigned int i;

At the moment we still don't mix declarations and statements, i.e. all
declarations have to be at the top of a block/scope. What iirc we do use
in a couple of places (and what hence you may want to do here as well) is
...

> +            ret = -ENOENT;
> +
> +            for ( i = 0; i < cpufreq_xen_cnt; i++ )

... declare the induction variable inside the loop header.

> +            {
> +                switch ( cpufreq_xen_opts[i] )
> +                {
> +                case CPUFREQ_xen:
> +                    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> +                    break;
> +                case CPUFREQ_hwp:
> +                    ret = hwp_register_driver();
> +                    break;
> +                }
> +
> +                if ( ret == 0 )
> +                    break;
> +            }
>              break;

In this model any kind of failure results in the fallback to be tried
(and the fallback's error to be returned to the caller rather than
the primary one). This may or may not be what we actually want;
personally I would have expected

                if ( ret != -ENODEV )
                    break;

or some such instead.

> +static bool hwp_handle_option(const char *s, const char *end)
> +{
> +    int ret;
> +
> +    ret = parse_boolean("verbose", s, end);
> +    if ( ret >= 0 ) {

Nit: Style (brace placement).

> +        cpufreq_verbose = ret;
> +        return true;
> +    }
> +
> +    ret = parse_boolean("hdc", s, end);
> +    if ( ret >= 0 ) {

Same here.

> +        opt_cpufreq_hdc = ret;
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +int __init hwp_cmdline_parse(const char *s, const char *e)
> +{
> +    do
> +    {
> +        const char *end = strpbrk(s, ",;");
> +
> +        if ( s && !hwp_handle_option(s, end) )

This check of s not being NULL comes too late, as strpbrk() would have
de-referenced it already. Considering ...

> +        {
> +            printk(XENLOG_WARNING "cpufreq/hwp: option '%s' not recognized\n",
> +                   s);
> +
> +            return -1;
> +        }
> +
> +        s = end ? ++end : end;
> +    } while ( s && s < e );

... this it probably wants to move even ahead of the loop.

> +static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err(cpu, "rdmsr_safe(MSR_PKG_HDC_CTL)\n");
> +        return -1;
> +    }
> +
> +    if ( val )
> +        msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
> +    else
> +        msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
> +
> +    if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err(cpu, "wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", msr);
> +        return -1;
> +    }
> +
> +    return 0;
> +}

Please can you use either boolean return values or proper 0 / -errno
ones? (Same again then in the subsequent function.)

> +static void cf_check hwp_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    uint64_t val;
> +
> +    /*
> +     * Package level MSR, but we don't have a good idea of packages here, so
> +     * just do it everytime.
> +     */
> +    if ( rdmsr_safe(MSR_PM_ENABLE, val) )
> +    {
> +        hwp_err(policy->cpu, "rdmsr_safe(MSR_PM_ENABLE)\n");
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    /* Ensure we don't generate interrupts */
> +    if ( feature_hwp_notification )
> +        wrmsr_safe(MSR_HWP_INTERRUPT, 0);
> +
> +    hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
> +    if ( !(val & PM_ENABLE_HWP_ENABLE) )
> +    {
> +        val |= PM_ENABLE_HWP_ENABLE;
> +        if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> +        {
> +            hwp_err(policy->cpu, "wrmsr_safe(MSR_PM_ENABLE, %lx)\n", val);
> +            data->curr_req.raw = -1;
> +            return;
> +        }
> +    }
> +
> +    if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
> +    {
> +        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_CAPABILITIES)\n");
> +        goto error;
> +    }
> +
> +    if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
> +    {
> +        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n");
> +        goto error;
> +    }
> +
> +    /*
> +     * Check for APERF/MPERF support in hardware
> +     * also check for boost/turbo support
> +     */
> +    intel_feature_detect(policy);
> +
> +    if ( feature_hdc )
> +    {
> +        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) ||
> +             hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) ) {

Please can these two if()s be joined and the well-placed brace be
retained?

> +            hwp_err(policy->cpu, "Disabling HDC support\n");
> +            feature_hdc = false;
> +            goto error;

Why? Can't you continue just with HDC turned off?

> +static void cf_check hwp_write_request(void *info)
> +{
> +    const struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    union hwp_request hwp_req = data->curr_req;
> +
> +    data->ret = 0;
> +
> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(hwp_req.raw));

You changed only the right side to not be sizeof(<type>).

> +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data;
> +
> +    data = xzalloc(struct hwp_drv_data);
> +    if ( !data )
> +        return -ENOMEM;
> +
> +    policy->governor = &cpufreq_gov_hwp;
> +
> +    per_cpu(hwp_drv_data, cpu) = data;
> +
> +    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);

Could I talk you into moving the helper function immediately ahead of
this (sole) one using it, much like you have it for hwp_cpufreq_target()
and hwp_write_request()?

> +    if ( data->curr_req.raw == -1 )
> +    {
> +        hwp_err(cpu, "Could not initialize HWP properly\n");
> +        per_cpu(hwp_drv_data, cpu) = NULL;
> +        xfree(data);
> +        return -ENODEV;
> +    }
> +
> +    data->minimum = data->curr_req.min_perf;
> +    data->maximum = data->curr_req.max_perf;
> +    data->desired = data->curr_req.desired;
> +    data->energy_perf = data->curr_req.energy_perf;
> +    data->activity_window = data->curr_req.activity_window;
> +
> +    if ( cpu == 0 )
> +        hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);

While I'm fine with this (perhaps apart from you using "cpu == 0",
which is an idiom we're trying to get rid of), ...

> +    hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw);

... this once-per-CPU message still looks to verbose to me. Perhaps
for both:
- print for the BSP,
- print when AP value differs from BSP (albeit I don't know how
[un]likely that is)?

> +static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
> +    per_cpu(hwp_drv_data, policy->cpu) = NULL;
> +    xfree(data);

Nit: Style (blank line between declaration(s) and statement(s) please.
(Also at least once again below.)

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -63,12 +63,18 @@ LIST_HEAD_READ_MOSTLY(cpufreq_governor_list);
>  /* set xen as default cpufreq */
>  enum cpufreq_controller cpufreq_controller = FREQCTL_xen;
>  
> -static int __init cpufreq_cmdline_parse(const char *s);
> +enum cpufreq_xen_opt cpufreq_xen_opts[2] = { CPUFREQ_xen, };
> +unsigned int cpufreq_xen_cnt = 1;

Looks like both can be __initdata?

As to the array initializer: For one Misra won't like the 2nd slot not
initialized. Plus the implicit 0 there is nothing else than CPUFREQ_xen,
which also ends up a little fragile. Perhaps 0 wants to stand for
CPUFREQ_none (or whatever name you deem appropriate)?

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-10 13:13   ` Jan Beulich
@ 2023-07-10 15:22     ` Jason Andryuk
  2023-07-11  8:18       ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-10 15:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Mon, Jul 10, 2023 at 9:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 06.07.2023 20:54, Jason Andryuk wrote:
> > @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
> >  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
> >    respectively.
> >  * `verbose` option can be included as a string or also as `verbose=<integer>`
> > +  for `xen`.  It is a boolean for `hwp`.
> > +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
> > +  hardware.  HWP is a Skylake+ feature which provides better CPU power
> > +  management.  The default is disabled.  If `hwp` is selected, but hardware
> > +  support is not available, Xen will fallback to cpufreq=xen.
> > +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
> > +  processor to autonomously force physical package components into idle state.
> > +  The default is enabled, but the option only applies when `hwp` is enabled.
> > +
> > +There is also support for `;`-separated fallback options:
> > +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
> > +if unavailable.
>
> In the given example, does "verbose" also apply to the fallback case? If so,
> perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?

Yes, "verbose" is applied to both.  I can make the change.  I
mentioned it in the commit message, but I'll mention it here as well.

> > --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> > +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> > @@ -642,7 +642,24 @@ static int __init cf_check cpufreq_driver_init(void)
> >          switch ( boot_cpu_data.x86_vendor )
> >          {
> >          case X86_VENDOR_INTEL:
> > -            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> > +            unsigned int i;
>
> At the moment we still don't mix declarations and statements, i.e. all
> declarations have to be at the top of a block/scope. What iirc we do use
> in a couple of places (and what hence you may want to do here as well) is
> ...
>
> > +            ret = -ENOENT;
> > +
> > +            for ( i = 0; i < cpufreq_xen_cnt; i++ )
>
> ... declare the induction variable inside the loop header.

Sounds good, thanks.

> > +            {
> > +                switch ( cpufreq_xen_opts[i] )
> > +                {
> > +                case CPUFREQ_xen:
> > +                    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> > +                    break;
> > +                case CPUFREQ_hwp:
> > +                    ret = hwp_register_driver();
> > +                    break;
> > +                }
> > +
> > +                if ( ret == 0 )
> > +                    break;
> > +            }
> >              break;
>
> In this model any kind of failure results in the fallback to be tried
> (and the fallback's error to be returned to the caller rather than
> the primary one). This may or may not be what we actually want;
> personally I would have expected
>
>                 if ( ret != -ENODEV )
>                     break;
>
> or some such instead.

I guess this comes back to our fruit preferences. :)

I can switch it around like that, and make hwp_register_driver()
return -ENODEV for hwp_available() returning false.

> > +static bool hwp_handle_option(const char *s, const char *end)
> > +{
> > +    int ret;
> > +
> > +    ret = parse_boolean("verbose", s, end);
> > +    if ( ret >= 0 ) {
>
> Nit: Style (brace placement).
>
> > +        cpufreq_verbose = ret;
> > +        return true;
> > +    }
> > +
> > +    ret = parse_boolean("hdc", s, end);
> > +    if ( ret >= 0 ) {
>
> Same here.

Thanks.  Sorry about those.

> > +        opt_cpufreq_hdc = ret;
> > +        return true;
> > +    }
> > +
> > +    return false;
> > +}
> > +
> > +int __init hwp_cmdline_parse(const char *s, const char *e)
> > +{
> > +    do
> > +    {
> > +        const char *end = strpbrk(s, ",;");
> > +
> > +        if ( s && !hwp_handle_option(s, end) )
>
> This check of s not being NULL comes too late, as strpbrk() would have
> de-referenced it already. Considering ...
>
> > +        {
> > +            printk(XENLOG_WARNING "cpufreq/hwp: option '%s' not recognized\n",
> > +                   s);
> > +
> > +            return -1;
> > +        }
> > +
> > +        s = end ? ++end : end;
> > +    } while ( s && s < e );
>
> ... this it probably wants to move even ahead of the loop.

I'll switch from do/while to just while and then the NULL check will
be covered.  In practice, this function is never called with s ==
NULL.

> > +static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
> > +    {
> > +        hwp_err(cpu, "rdmsr_safe(MSR_PKG_HDC_CTL)\n");
> > +        return -1;
> > +    }
> > +
> > +    if ( val )
> > +        msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +    else
> > +        msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +
> > +    if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
> > +    {
> > +        hwp_err(cpu, "wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", msr);
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
>
> Please can you use either boolean return values or proper 0 / -errno
> ones? (Same again then in the subsequent function.)

Sure, I'll use booleans.

> > +static void cf_check hwp_init_msrs(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    uint64_t val;
> > +
> > +    /*
> > +     * Package level MSR, but we don't have a good idea of packages here, so
> > +     * just do it everytime.
> > +     */
> > +    if ( rdmsr_safe(MSR_PM_ENABLE, val) )
> > +    {
> > +        hwp_err(policy->cpu, "rdmsr_safe(MSR_PM_ENABLE)\n");
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    /* Ensure we don't generate interrupts */
> > +    if ( feature_hwp_notification )
> > +        wrmsr_safe(MSR_HWP_INTERRUPT, 0);
> > +
> > +    hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
> > +    if ( !(val & PM_ENABLE_HWP_ENABLE) )
> > +    {
> > +        val |= PM_ENABLE_HWP_ENABLE;
> > +        if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> > +        {
> > +            hwp_err(policy->cpu, "wrmsr_safe(MSR_PM_ENABLE, %lx)\n", val);
> > +            data->curr_req.raw = -1;
> > +            return;
> > +        }
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
> > +    {
> > +        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_CAPABILITIES)\n");
> > +        goto error;
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
> > +    {
> > +        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n");
> > +        goto error;
> > +    }
> > +
> > +    /*
> > +     * Check for APERF/MPERF support in hardware
> > +     * also check for boost/turbo support
> > +     */
> > +    intel_feature_detect(policy);
> > +
> > +    if ( feature_hdc )
> > +    {
> > +        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) ||
> > +             hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) ) {
>
> Please can these two if()s be joined and the well-placed brace be
> retained?

Sure.

> > +            hwp_err(policy->cpu, "Disabling HDC support\n");
> > +            feature_hdc = false;
> > +            goto error;
>
> Why? Can't you continue just with HDC turned off?

Yes, that is what I intended to implement after your earlier review,
but I failed to actually delete the goto.

> > +static void cf_check hwp_write_request(void *info)
> > +{
> > +    const struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    union hwp_request hwp_req = data->curr_req;
> > +
> > +    data->ret = 0;
> > +
> > +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(hwp_req.raw));
>
> You changed only the right side to not be sizeof(<type>).

Updated.  I was just focused on removing the uint64_t from your earlier comment.

> > +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > +    unsigned int cpu = policy->cpu;
> > +    struct hwp_drv_data *data;
> > +
> > +    data = xzalloc(struct hwp_drv_data);
> > +    if ( !data )
> > +        return -ENOMEM;
> > +
> > +    policy->governor = &cpufreq_gov_hwp;
> > +
> > +    per_cpu(hwp_drv_data, cpu) = data;
> > +
> > +    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
>
> Could I talk you into moving the helper function immediately ahead of
> this (sole) one using it, much like you have it for hwp_cpufreq_target()
> and hwp_write_request()?

Yes. sounds good.  I'll move hdc_set_pkg_hdc_ctl(), hdc_set_pm_ctl1(),
hwp_get_cpu_speeds() as well since they are all called by
hwp_init_msrs().

> > +    if ( data->curr_req.raw == -1 )
> > +    {
> > +        hwp_err(cpu, "Could not initialize HWP properly\n");
> > +        per_cpu(hwp_drv_data, cpu) = NULL;
> > +        xfree(data);
> > +        return -ENODEV;
> > +    }
> > +
> > +    data->minimum = data->curr_req.min_perf;
> > +    data->maximum = data->curr_req.max_perf;
> > +    data->desired = data->curr_req.desired;
> > +    data->energy_perf = data->curr_req.energy_perf;
> > +    data->activity_window = data->curr_req.activity_window;
> > +
> > +    if ( cpu == 0 )
> > +        hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
>
> While I'm fine with this (perhaps apart from you using "cpu == 0",
> which is an idiom we're trying to get rid of), ...

Oh, I didn't know that.  What is the preferred way to identify the
BSP?  This doesn't necessarily run on the BSP, so "cpu"/"policy->cpu"
is all we have to make a determination.

> > +    hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw);
>
> ... this once-per-CPU message still looks to verbose to me. Perhaps
> for both:
> - print for the BSP,
> - print when AP value differs from BSP (albeit I don't know how
> [un]likely that is)?

On my test systems, the values have all been identical.  But your
differing values idea seems good.

> > +static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> > +{
> > +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
> > +    per_cpu(hwp_drv_data, policy->cpu) = NULL;
> > +    xfree(data);
>
> Nit: Style (blank line between declaration(s) and statement(s) please.
> (Also at least once again below.)
>
> > --- a/xen/drivers/cpufreq/cpufreq.c
> > +++ b/xen/drivers/cpufreq/cpufreq.c
> > @@ -63,12 +63,18 @@ LIST_HEAD_READ_MOSTLY(cpufreq_governor_list);
> >  /* set xen as default cpufreq */
> >  enum cpufreq_controller cpufreq_controller = FREQCTL_xen;
> >
> > -static int __init cpufreq_cmdline_parse(const char *s);
> > +enum cpufreq_xen_opt cpufreq_xen_opts[2] = { CPUFREQ_xen, };
> > +unsigned int cpufreq_xen_cnt = 1;
>
> Looks like both can be __initdata?

Yes, thanks.

> As to the array initializer: For one Misra won't like the 2nd slot not
> initialized. Plus the implicit 0 there is nothing else than CPUFREQ_xen,
> which also ends up a little fragile. Perhaps 0 wants to stand for
> CPUFREQ_none (or whatever name you deem appropriate)?

:) I had a CPUFREQ_none originally, but dropped it as there was no
need for one with cpufreq_xen_cnt controlling the iteration.  I'll add
it back.  (gcc 12 at least complains that the switch in
cpufreq_driver_init() needs to handle CPUFREQ_none, so I'll just have
it return 0 in that case.)

Thanks for the review.

Regards,
Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-10 15:22     ` Jason Andryuk
@ 2023-07-11  8:18       ` Jan Beulich
  2023-07-11 14:16         ` Jason Andryuk
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2023-07-11  8:18 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 10.07.2023 17:22, Jason Andryuk wrote:
> On Mon, Jul 10, 2023 at 9:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 06.07.2023 20:54, Jason Andryuk wrote:
>>> @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
>>>  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
>>>    respectively.
>>>  * `verbose` option can be included as a string or also as `verbose=<integer>`
>>> +  for `xen`.  It is a boolean for `hwp`.
>>> +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
>>> +  hardware.  HWP is a Skylake+ feature which provides better CPU power
>>> +  management.  The default is disabled.  If `hwp` is selected, but hardware
>>> +  support is not available, Xen will fallback to cpufreq=xen.
>>> +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
>>> +  processor to autonomously force physical package components into idle state.
>>> +  The default is enabled, but the option only applies when `hwp` is enabled.
>>> +
>>> +There is also support for `;`-separated fallback options:
>>> +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
>>> +if unavailable.
>>
>> In the given example, does "verbose" also apply to the fallback case? If so,
>> perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?
> 
> Yes, "verbose" is applied to both.  I can make the change.  I
> mentioned it in the commit message, but I'll mention it here as well.

FTAOD my earlier comment implied that the spelling form you use above
should not even be accepted when parsing. I.e. it was not just about
the doc aspect.

>>> +            {
>>> +                switch ( cpufreq_xen_opts[i] )
>>> +                {
>>> +                case CPUFREQ_xen:
>>> +                    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
>>> +                    break;
>>> +                case CPUFREQ_hwp:
>>> +                    ret = hwp_register_driver();
>>> +                    break;
>>> +                }
>>> +
>>> +                if ( ret == 0 )
>>> +                    break;
>>> +            }
>>>              break;
>>
>> In this model any kind of failure results in the fallback to be tried
>> (and the fallback's error to be returned to the caller rather than
>> the primary one). This may or may not be what we actually want;
>> personally I would have expected
>>
>>                 if ( ret != -ENODEV )
>>                     break;
>>
>> or some such instead.
> 
> I guess this comes back to our fruit preferences. :)

Does it? It's not just a style question here, but one of when / whether
to use the fallback.

> I can switch it around like that, and make hwp_register_driver()
> return -ENODEV for hwp_available() returning false.

Thanks.

>>> +int __init hwp_cmdline_parse(const char *s, const char *e)
>>> +{
>>> +    do
>>> +    {
>>> +        const char *end = strpbrk(s, ",;");
>>> +
>>> +        if ( s && !hwp_handle_option(s, end) )
>>
>> This check of s not being NULL comes too late, as strpbrk() would have
>> de-referenced it already. Considering ...
>>
>>> +        {
>>> +            printk(XENLOG_WARNING "cpufreq/hwp: option '%s' not recognized\n",
>>> +                   s);
>>> +
>>> +            return -1;
>>> +        }
>>> +
>>> +        s = end ? ++end : end;
>>> +    } while ( s && s < e );
>>
>> ... this it probably wants to move even ahead of the loop.
> 
> I'll switch from do/while to just while and then the NULL check will
> be covered.  In practice, this function is never called with s ==
> NULL.

In which case - why not leave things largely as they are, simply dropping
the odd check of s?

>>> +    if ( data->curr_req.raw == -1 )
>>> +    {
>>> +        hwp_err(cpu, "Could not initialize HWP properly\n");
>>> +        per_cpu(hwp_drv_data, cpu) = NULL;
>>> +        xfree(data);
>>> +        return -ENODEV;
>>> +    }
>>> +
>>> +    data->minimum = data->curr_req.min_perf;
>>> +    data->maximum = data->curr_req.max_perf;
>>> +    data->desired = data->curr_req.desired;
>>> +    data->energy_perf = data->curr_req.energy_perf;
>>> +    data->activity_window = data->curr_req.activity_window;
>>> +
>>> +    if ( cpu == 0 )
>>> +        hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
>>
>> While I'm fine with this (perhaps apart from you using "cpu == 0",
>> which is an idiom we're trying to get rid of), ...
> 
> Oh, I didn't know that.  What is the preferred way to identify the
> BSP?

Sometimes we pass a separate boolean to functions, in other cases we
check whether a struct cpuinfo_x86 * equals &boot_cpu_info. The
latter clearly can't be used here, and the former doesn't look to be
a good fit either. However, ...

>  This doesn't necessarily run on the BSP, so "cpu"/"policy->cpu"
> is all we have to make a determination.

... isn't it, conversely, the case that the function only ever runs
on "cpu" when it is the BSP? In which case "cpu == smp_processor_id()"
ought to do the trick.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-11  8:18       ` Jan Beulich
@ 2023-07-11 14:16         ` Jason Andryuk
  2023-07-11 14:40           ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-11 14:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Tue, Jul 11, 2023 at 4:18 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 10.07.2023 17:22, Jason Andryuk wrote:
> > On Mon, Jul 10, 2023 at 9:13 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 06.07.2023 20:54, Jason Andryuk wrote:
> >>> @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
> >>>  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
> >>>    respectively.
> >>>  * `verbose` option can be included as a string or also as `verbose=<integer>`
> >>> +  for `xen`.  It is a boolean for `hwp`.
> >>> +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
> >>> +  hardware.  HWP is a Skylake+ feature which provides better CPU power
> >>> +  management.  The default is disabled.  If `hwp` is selected, but hardware
> >>> +  support is not available, Xen will fallback to cpufreq=xen.
> >>> +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
> >>> +  processor to autonomously force physical package components into idle state.
> >>> +  The default is enabled, but the option only applies when `hwp` is enabled.
> >>> +
> >>> +There is also support for `;`-separated fallback options:
> >>> +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
> >>> +if unavailable.
> >>
> >> In the given example, does "verbose" also apply to the fallback case? If so,
> >> perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?
> >
> > Yes, "verbose" is applied to both.  I can make the change.  I
> > mentioned it in the commit message, but I'll mention it here as well.
>
> FTAOD my earlier comment implied that the spelling form you use above
> should not even be accepted when parsing. I.e. it was not just about
> the doc aspect.

Oh.  So what exactly do you want then?

There is a single cpufreq_verbose variable today that is set by either
cpufreq=hwp,verbose or cpufreq=xen,verbose.  Is that okay, or should
the "xen" and "hwp" each get a separate variable?

Do you only want to allow a single trailing "verbose" to apply to all
of cpufreq (cpufreq=$foo,verbose)?  Or do you want "verbose" to be
only valid for "xen"?  Both cpufreq_cmdline_parse() and
hwp_cmdline_parse() just loop over their options and don't care about
order, even though the documentation lists verbose last.  Would you
want "cpufreq=hwp,verbose,hdc" to fail to parse?

All parsing is done upfront before knowing whether "xen" or "hwp" will
be used as the cpufreq driver, so there is a trickiness for
implementing "verbose" only for one option.  Similarly,
"cpufreq=hwp,invalid;xen" will try "hwp" (but not "xen")  since the
live variables are updated.  Even without this patch, cpufreq will be
configured up to an invalid parameter.

FYI, cpufreq=xen;hwp will be accepted.  "xen" shouldn't fail, so it
doesn't make sense to specify that.  But it didn't seem necessary to
prevent it.

> >>> +            {
> >>> +                switch ( cpufreq_xen_opts[i] )
> >>> +                {
> >>> +                case CPUFREQ_xen:
> >>> +                    ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> >>> +                    break;
> >>> +                case CPUFREQ_hwp:
> >>> +                    ret = hwp_register_driver();
> >>> +                    break;
> >>> +                }
> >>> +
> >>> +                if ( ret == 0 )
> >>> +                    break;
> >>> +            }
> >>>              break;
> >>
> >> In this model any kind of failure results in the fallback to be tried
> >> (and the fallback's error to be returned to the caller rather than
> >> the primary one). This may or may not be what we actually want;
> >> personally I would have expected
> >>
> >>                 if ( ret != -ENODEV )
> >>                     break;
> >>
> >> or some such instead.
> >
> > I guess this comes back to our fruit preferences. :)
>
> Does it? It's not just a style question here, but one of when / whether
> to use the fallback.

Indeed.  I was trying to allude back to the earlier conversation.  Do
we try the pears only when there are no apples or also when the apples
are "bad"?  Only falling back for no apples is fine.

> > I can switch it around like that, and make hwp_register_driver()
> > return -ENODEV for hwp_available() returning false.
>
> Thanks.
>
> >>> +int __init hwp_cmdline_parse(const char *s, const char *e)
> >>> +{
> >>> +    do
> >>> +    {
> >>> +        const char *end = strpbrk(s, ",;");
> >>> +
> >>> +        if ( s && !hwp_handle_option(s, end) )
> >>
> >> This check of s not being NULL comes too late, as strpbrk() would have
> >> de-referenced it already. Considering ...
> >>
> >>> +        {
> >>> +            printk(XENLOG_WARNING "cpufreq/hwp: option '%s' not recognized\n",
> >>> +                   s);
> >>> +
> >>> +            return -1;

I should change this to -EINVAL.

> >>> +        }
> >>> +
> >>> +        s = end ? ++end : end;
> >>> +    } while ( s && s < e );
> >>
> >> ... this it probably wants to move even ahead of the loop.
> >
> > I'll switch from do/while to just while and then the NULL check will
> > be covered.  In practice, this function is never called with s ==
> > NULL.
>
> In which case - why not leave things largely as they are, simply dropping
> the odd check of s?

Sure.

> >>> +    if ( data->curr_req.raw == -1 )
> >>> +    {
> >>> +        hwp_err(cpu, "Could not initialize HWP properly\n");
> >>> +        per_cpu(hwp_drv_data, cpu) = NULL;
> >>> +        xfree(data);
> >>> +        return -ENODEV;
> >>> +    }
> >>> +
> >>> +    data->minimum = data->curr_req.min_perf;
> >>> +    data->maximum = data->curr_req.max_perf;
> >>> +    data->desired = data->curr_req.desired;
> >>> +    data->energy_perf = data->curr_req.energy_perf;
> >>> +    data->activity_window = data->curr_req.activity_window;
> >>> +
> >>> +    if ( cpu == 0 )
> >>> +        hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
> >>
> >> While I'm fine with this (perhaps apart from you using "cpu == 0",
> >> which is an idiom we're trying to get rid of), ...
> >
> > Oh, I didn't know that.  What is the preferred way to identify the
> > BSP?
>
> Sometimes we pass a separate boolean to functions, in other cases we
> check whether a struct cpuinfo_x86 * equals &boot_cpu_info. The
> latter clearly can't be used here, and the former doesn't look to be
> a good fit either. However, ...
>
> >  This doesn't necessarily run on the BSP, so "cpu"/"policy->cpu"
> > is all we have to make a determination.
>
> ... isn't it, conversely, the case that the function only ever runs
> on "cpu" when it is the BSP? In which case "cpu == smp_processor_id()"
> ought to do the trick.

The calls do not necessarily run from the BSP.  The cpufreq init
callbacks run later when dom0 uploads the ACPI processor data.  If you
don't want "cpu == 0", maybe just print for the first CPU regardless
of number, and then print differences from that?

Regards,
Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-11 14:16         ` Jason Andryuk
@ 2023-07-11 14:40           ` Jan Beulich
  2023-07-11 17:49             ` Jason Andryuk
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2023-07-11 14:40 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 11.07.2023 16:16, Jason Andryuk wrote:
> On Tue, Jul 11, 2023 at 4:18 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 10.07.2023 17:22, Jason Andryuk wrote:
>>> On Mon, Jul 10, 2023 at 9:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 06.07.2023 20:54, Jason Andryuk wrote:
>>>>> @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
>>>>>  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
>>>>>    respectively.
>>>>>  * `verbose` option can be included as a string or also as `verbose=<integer>`
>>>>> +  for `xen`.  It is a boolean for `hwp`.
>>>>> +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
>>>>> +  hardware.  HWP is a Skylake+ feature which provides better CPU power
>>>>> +  management.  The default is disabled.  If `hwp` is selected, but hardware
>>>>> +  support is not available, Xen will fallback to cpufreq=xen.
>>>>> +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
>>>>> +  processor to autonomously force physical package components into idle state.
>>>>> +  The default is enabled, but the option only applies when `hwp` is enabled.
>>>>> +
>>>>> +There is also support for `;`-separated fallback options:
>>>>> +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
>>>>> +if unavailable.
>>>>
>>>> In the given example, does "verbose" also apply to the fallback case? If so,
>>>> perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?
>>>
>>> Yes, "verbose" is applied to both.  I can make the change.  I
>>> mentioned it in the commit message, but I'll mention it here as well.
>>
>> FTAOD my earlier comment implied that the spelling form you use above
>> should not even be accepted when parsing. I.e. it was not just about
>> the doc aspect.
> 
> Oh.  So what exactly do you want then?
> 
> There is a single cpufreq_verbose variable today that is set by either
> cpufreq=hwp,verbose or cpufreq=xen,verbose.  Is that okay, or should
> the "xen" and "hwp" each get a separate variable?
> 
> Do you only want to allow a single trailing "verbose" to apply to all
> of cpufreq (cpufreq=$foo,verbose)?  Or do you want "verbose" to be
> only valid for "xen"?  Both cpufreq_cmdline_parse() and
> hwp_cmdline_parse() just loop over their options and don't care about
> order, even though the documentation lists verbose last.  Would you
> want "cpufreq=hwp,verbose,hdc" to fail to parse?
> 
> All parsing is done upfront before knowing whether "xen" or "hwp" will
> be used as the cpufreq driver, so there is a trickiness for
> implementing "verbose" only for one option.  Similarly,
> "cpufreq=hwp,invalid;xen" will try "hwp" (but not "xen")  since the
> live variables are updated.  Even without this patch, cpufreq will be
> configured up to an invalid parameter.

Right, and I'd like to see "hwp;xen" to be treated as a "unit", with
",verbose" applying to whichever succeeds initializing. I don't think
there is much point to have separate verbosity variables.

> FYI, cpufreq=xen;hwp will be accepted.  "xen" shouldn't fail, so it
> doesn't make sense to specify that.  But it didn't seem necessary to
> prevent it.

Sure, that's fine.

>>>>> +    if ( data->curr_req.raw == -1 )
>>>>> +    {
>>>>> +        hwp_err(cpu, "Could not initialize HWP properly\n");
>>>>> +        per_cpu(hwp_drv_data, cpu) = NULL;
>>>>> +        xfree(data);
>>>>> +        return -ENODEV;
>>>>> +    }
>>>>> +
>>>>> +    data->minimum = data->curr_req.min_perf;
>>>>> +    data->maximum = data->curr_req.max_perf;
>>>>> +    data->desired = data->curr_req.desired;
>>>>> +    data->energy_perf = data->curr_req.energy_perf;
>>>>> +    data->activity_window = data->curr_req.activity_window;
>>>>> +
>>>>> +    if ( cpu == 0 )
>>>>> +        hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
>>>>
>>>> While I'm fine with this (perhaps apart from you using "cpu == 0",
>>>> which is an idiom we're trying to get rid of), ...
>>>
>>> Oh, I didn't know that.  What is the preferred way to identify the
>>> BSP?
>>
>> Sometimes we pass a separate boolean to functions, in other cases we
>> check whether a struct cpuinfo_x86 * equals &boot_cpu_info. The
>> latter clearly can't be used here, and the former doesn't look to be
>> a good fit either. However, ...
>>
>>>  This doesn't necessarily run on the BSP, so "cpu"/"policy->cpu"
>>> is all we have to make a determination.
>>
>> ... isn't it, conversely, the case that the function only ever runs
>> on "cpu" when it is the BSP? In which case "cpu == smp_processor_id()"
>> ought to do the trick.
> 
> The calls do not necessarily run from the BSP.  The cpufreq init
> callbacks run later when dom0 uploads the ACPI processor data.

Oh, of course. How did I manage to forget?

>  If you
> don't want "cpu == 0", maybe just print for the first CPU regardless
> of number, and then print differences from that?

Perhaps best then.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-11 14:40           ` Jan Beulich
@ 2023-07-11 17:49             ` Jason Andryuk
  2023-07-12  8:43               ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-11 17:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Tue, Jul 11, 2023 at 10:41 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 11.07.2023 16:16, Jason Andryuk wrote:
> > On Tue, Jul 11, 2023 at 4:18 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 10.07.2023 17:22, Jason Andryuk wrote:
> >>> On Mon, Jul 10, 2023 at 9:13 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>> On 06.07.2023 20:54, Jason Andryuk wrote:
> >>>>> @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
> >>>>>  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
> >>>>>    respectively.
> >>>>>  * `verbose` option can be included as a string or also as `verbose=<integer>`
> >>>>> +  for `xen`.  It is a boolean for `hwp`.
> >>>>> +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
> >>>>> +  hardware.  HWP is a Skylake+ feature which provides better CPU power
> >>>>> +  management.  The default is disabled.  If `hwp` is selected, but hardware
> >>>>> +  support is not available, Xen will fallback to cpufreq=xen.
> >>>>> +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
> >>>>> +  processor to autonomously force physical package components into idle state.
> >>>>> +  The default is enabled, but the option only applies when `hwp` is enabled.
> >>>>> +
> >>>>> +There is also support for `;`-separated fallback options:
> >>>>> +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
> >>>>> +if unavailable.
> >>>>
> >>>> In the given example, does "verbose" also apply to the fallback case? If so,
> >>>> perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?
> >>>
> >>> Yes, "verbose" is applied to both.  I can make the change.  I
> >>> mentioned it in the commit message, but I'll mention it here as well.
> >>
> >> FTAOD my earlier comment implied that the spelling form you use above
> >> should not even be accepted when parsing. I.e. it was not just about
> >> the doc aspect.
> >
> > Oh.  So what exactly do you want then?
> >
> > There is a single cpufreq_verbose variable today that is set by either
> > cpufreq=hwp,verbose or cpufreq=xen,verbose.  Is that okay, or should
> > the "xen" and "hwp" each get a separate variable?
> >
> > Do you only want to allow a single trailing "verbose" to apply to all
> > of cpufreq (cpufreq=$foo,verbose)?  Or do you want "verbose" to be
> > only valid for "xen"?  Both cpufreq_cmdline_parse() and
> > hwp_cmdline_parse() just loop over their options and don't care about
> > order, even though the documentation lists verbose last.  Would you
> > want "cpufreq=hwp,verbose,hdc" to fail to parse?
> >
> > All parsing is done upfront before knowing whether "xen" or "hwp" will
> > be used as the cpufreq driver, so there is a trickiness for
> > implementing "verbose" only for one option.  Similarly,
> > "cpufreq=hwp,invalid;xen" will try "hwp" (but not "xen")  since the
> > live variables are updated.  Even without this patch, cpufreq will be
> > configured up to an invalid parameter.
>
> Right, and I'd like to see "hwp;xen" to be treated as a "unit", with
> ",verbose" applying to whichever succeeds initializing. I don't think
> there is much point to have separate verbosity variables.

When you say "hwp;xen" as a unit, you don't mean to intermix all the
options like:
cpufreq=hwp;xen:ondemand,hdc,maxfreq=42
do you?

Because of the suboptions, I don't treat "hwp;xen" as a unit, but as
strings separated by ';'.
That allows the full selection of parameters like:
cpufreq=hwc,no-hdc;xen:ondemand,maxfreq=42,minfreq=0

This lets each respective parser handle the options it knows about.
This does duplicate "verbose" handling.  cpufreq_cmdline_parse() and
hwp_cmdline_parse() are also usable when only one of "hwp" or "xen" is
specified.

These all work:
cpufreq=xen:ondemand,verbose
cpufreq=hwp,hdc,verbose
cpurfre=hwp,hdc;xen:ondemand,verbose

To disallow "verbose" in "cpufreq=hwp,verbose;xen" would require extra
code, and setup_cpufreq_option() is already rather complicated IMO.
It's a corner case, but doesn't seem harmful to me.   Hmmm, making the
above fail parsing may be worse since it would only try "hwp" without
a fallback to "xen".

I just want to be clear on exactly what I need to implement.

Regards,
Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-11 17:49             ` Jason Andryuk
@ 2023-07-12  8:43               ` Jan Beulich
  2023-07-12 19:38                 ` Jason Andryuk
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2023-07-12  8:43 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 11.07.2023 19:49, Jason Andryuk wrote:
> On Tue, Jul 11, 2023 at 10:41 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 11.07.2023 16:16, Jason Andryuk wrote:
>>> On Tue, Jul 11, 2023 at 4:18 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 10.07.2023 17:22, Jason Andryuk wrote:
>>>>> On Mon, Jul 10, 2023 at 9:13 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>> On 06.07.2023 20:54, Jason Andryuk wrote:
>>>>>>> @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
>>>>>>>  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
>>>>>>>    respectively.
>>>>>>>  * `verbose` option can be included as a string or also as `verbose=<integer>`
>>>>>>> +  for `xen`.  It is a boolean for `hwp`.
>>>>>>> +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
>>>>>>> +  hardware.  HWP is a Skylake+ feature which provides better CPU power
>>>>>>> +  management.  The default is disabled.  If `hwp` is selected, but hardware
>>>>>>> +  support is not available, Xen will fallback to cpufreq=xen.
>>>>>>> +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
>>>>>>> +  processor to autonomously force physical package components into idle state.
>>>>>>> +  The default is enabled, but the option only applies when `hwp` is enabled.
>>>>>>> +
>>>>>>> +There is also support for `;`-separated fallback options:
>>>>>>> +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
>>>>>>> +if unavailable.
>>>>>>
>>>>>> In the given example, does "verbose" also apply to the fallback case? If so,
>>>>>> perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?
>>>>>
>>>>> Yes, "verbose" is applied to both.  I can make the change.  I
>>>>> mentioned it in the commit message, but I'll mention it here as well.
>>>>
>>>> FTAOD my earlier comment implied that the spelling form you use above
>>>> should not even be accepted when parsing. I.e. it was not just about
>>>> the doc aspect.
>>>
>>> Oh.  So what exactly do you want then?
>>>
>>> There is a single cpufreq_verbose variable today that is set by either
>>> cpufreq=hwp,verbose or cpufreq=xen,verbose.  Is that okay, or should
>>> the "xen" and "hwp" each get a separate variable?
>>>
>>> Do you only want to allow a single trailing "verbose" to apply to all
>>> of cpufreq (cpufreq=$foo,verbose)?  Or do you want "verbose" to be
>>> only valid for "xen"?  Both cpufreq_cmdline_parse() and
>>> hwp_cmdline_parse() just loop over their options and don't care about
>>> order, even though the documentation lists verbose last.  Would you
>>> want "cpufreq=hwp,verbose,hdc" to fail to parse?
>>>
>>> All parsing is done upfront before knowing whether "xen" or "hwp" will
>>> be used as the cpufreq driver, so there is a trickiness for
>>> implementing "verbose" only for one option.  Similarly,
>>> "cpufreq=hwp,invalid;xen" will try "hwp" (but not "xen")  since the
>>> live variables are updated.  Even without this patch, cpufreq will be
>>> configured up to an invalid parameter.
>>
>> Right, and I'd like to see "hwp;xen" to be treated as a "unit", with
>> ",verbose" applying to whichever succeeds initializing. I don't think
>> there is much point to have separate verbosity variables.
> 
> When you say "hwp;xen" as a unit, you don't mean to intermix all the
> options like:
> cpufreq=hwp;xen:ondemand,hdc,maxfreq=42
> do you?
> 
> Because of the suboptions, I don't treat "hwp;xen" as a unit, but as
> strings separated by ';'.
> That allows the full selection of parameters like:
> cpufreq=hwc,no-hdc;xen:ondemand,maxfreq=42,minfreq=0
> 
> This lets each respective parser handle the options it knows about.
> This does duplicate "verbose" handling.  cpufreq_cmdline_parse() and
> hwp_cmdline_parse() are also usable when only one of "hwp" or "xen" is
> specified.
> 
> These all work:
> cpufreq=xen:ondemand,verbose
> cpufreq=hwp,hdc,verbose
> cpurfre=hwp,hdc;xen:ondemand,verbose
> 
> To disallow "verbose" in "cpufreq=hwp,verbose;xen" would require extra
> code, and setup_cpufreq_option() is already rather complicated IMO.
> It's a corner case, but doesn't seem harmful to me.   Hmmm, making the
> above fail parsing may be worse since it would only try "hwp" without
> a fallback to "xen".
> 
> I just want to be clear on exactly what I need to implement.

Maybe we need to take a step back a revisit what option forms actually
make sense to express. Part of the problem may be that we permit (but
not require afaics) the use of colon as a separator after the "main"
option ("xen", "none", "dom0-kernel", and perhaps now "hwp"). Such a
colon suggests that what follows are sub-options applicable to that
specific "main" option, especially since what follows "xen:" can be
more than just the governor name (and in fact no governor name is
required - I've been using cpufreq=xen:up_threshold=40 on some of my
systems, for example). I have to admit that I don't see a clean way
of (largely) retaining existing behavior while at the same time
avoiding ambiguity with your additions (and it may well be that there
is pre-existing ambiguity as well, but the introduction of yet
another separator [semicolon] clearly makes things worse in this
regard, as it suggests strong grouping).

Maybe we want to consider an alternative form of expressing the
fallback. What about e.g. "cpufreq=hwp:hdc(xen:ondemand),verbose"
(and its possible variations)?

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-12  8:43               ` Jan Beulich
@ 2023-07-12 19:38                 ` Jason Andryuk
  2023-07-13  6:14                   ` Jan Beulich
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Andryuk @ 2023-07-12 19:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Wed, Jul 12, 2023 at 4:43 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 11.07.2023 19:49, Jason Andryuk wrote:
> > On Tue, Jul 11, 2023 at 10:41 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 11.07.2023 16:16, Jason Andryuk wrote:
> >>> On Tue, Jul 11, 2023 at 4:18 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>> On 10.07.2023 17:22, Jason Andryuk wrote:
> >>>>> On Mon, Jul 10, 2023 at 9:13 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>>>> On 06.07.2023 20:54, Jason Andryuk wrote:
> >>>>>>> @@ -510,6 +510,22 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
> >>>>>>>  * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
> >>>>>>>    respectively.
> >>>>>>>  * `verbose` option can be included as a string or also as `verbose=<integer>`
> >>>>>>> +  for `xen`.  It is a boolean for `hwp`.
> >>>>>>> +* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
> >>>>>>> +  hardware.  HWP is a Skylake+ feature which provides better CPU power
> >>>>>>> +  management.  The default is disabled.  If `hwp` is selected, but hardware
> >>>>>>> +  support is not available, Xen will fallback to cpufreq=xen.
> >>>>>>> +* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
> >>>>>>> +  processor to autonomously force physical package components into idle state.
> >>>>>>> +  The default is enabled, but the option only applies when `hwp` is enabled.
> >>>>>>> +
> >>>>>>> +There is also support for `;`-separated fallback options:
> >>>>>>> +`cpufreq=hwp,verbose;xen`.  This first tries `hwp` and falls back to `xen`
> >>>>>>> +if unavailable.
> >>>>>>
> >>>>>> In the given example, does "verbose" also apply to the fallback case? If so,
> >>>>>> perhaps better "cpufreq=hwp;xen,verbose", to eliminate that ambiguity?
> >>>>>
> >>>>> Yes, "verbose" is applied to both.  I can make the change.  I
> >>>>> mentioned it in the commit message, but I'll mention it here as well.
> >>>>
> >>>> FTAOD my earlier comment implied that the spelling form you use above
> >>>> should not even be accepted when parsing. I.e. it was not just about
> >>>> the doc aspect.
> >>>
> >>> Oh.  So what exactly do you want then?
> >>>
> >>> There is a single cpufreq_verbose variable today that is set by either
> >>> cpufreq=hwp,verbose or cpufreq=xen,verbose.  Is that okay, or should
> >>> the "xen" and "hwp" each get a separate variable?
> >>>
> >>> Do you only want to allow a single trailing "verbose" to apply to all
> >>> of cpufreq (cpufreq=$foo,verbose)?  Or do you want "verbose" to be
> >>> only valid for "xen"?  Both cpufreq_cmdline_parse() and
> >>> hwp_cmdline_parse() just loop over their options and don't care about
> >>> order, even though the documentation lists verbose last.  Would you
> >>> want "cpufreq=hwp,verbose,hdc" to fail to parse?
> >>>
> >>> All parsing is done upfront before knowing whether "xen" or "hwp" will
> >>> be used as the cpufreq driver, so there is a trickiness for
> >>> implementing "verbose" only for one option.  Similarly,
> >>> "cpufreq=hwp,invalid;xen" will try "hwp" (but not "xen")  since the
> >>> live variables are updated.  Even without this patch, cpufreq will be
> >>> configured up to an invalid parameter.
> >>
> >> Right, and I'd like to see "hwp;xen" to be treated as a "unit", with
> >> ",verbose" applying to whichever succeeds initializing. I don't think
> >> there is much point to have separate verbosity variables.
> >
> > When you say "hwp;xen" as a unit, you don't mean to intermix all the
> > options like:
> > cpufreq=hwp;xen:ondemand,hdc,maxfreq=42
> > do you?
> >
> > Because of the suboptions, I don't treat "hwp;xen" as a unit, but as
> > strings separated by ';'.
> > That allows the full selection of parameters like:
> > cpufreq=hwc,no-hdc;xen:ondemand,maxfreq=42,minfreq=0
> >
> > This lets each respective parser handle the options it knows about.
> > This does duplicate "verbose" handling.  cpufreq_cmdline_parse() and
> > hwp_cmdline_parse() are also usable when only one of "hwp" or "xen" is
> > specified.
> >
> > These all work:
> > cpufreq=xen:ondemand,verbose
> > cpufreq=hwp,hdc,verbose
> > cpurfre=hwp,hdc;xen:ondemand,verbose
> >
> > To disallow "verbose" in "cpufreq=hwp,verbose;xen" would require extra
> > code, and setup_cpufreq_option() is already rather complicated IMO.
> > It's a corner case, but doesn't seem harmful to me.   Hmmm, making the
> > above fail parsing may be worse since it would only try "hwp" without
> > a fallback to "xen".
> >
> > I just want to be clear on exactly what I need to implement.
>
> Maybe we need to take a step back a revisit what option forms actually
> make sense to express. Part of the problem may be that we permit (but
> not require afaics) the use of colon as a separator after the "main"
> option ("xen", "none", "dom0-kernel", and perhaps now "hwp"). Such a
> colon suggests that what follows are sub-options applicable to that
> specific "main" option, especially since what follows "xen:" can be
> more than just the governor name (and in fact no governor name is
> required - I've been using cpufreq=xen:up_threshold=40 on some of my
> systems, for example). I have to admit that I don't see a clean way
> of (largely) retaining existing behavior while at the same time
> avoiding ambiguity with your additions (and it may well be that there
> is pre-existing ambiguity as well, but the introduction of yet
> another separator [semicolon] clearly makes things worse in this
> regard, as it suggests strong grouping).
>
> Maybe we want to consider an alternative form of expressing the
> fallback. What about e.g. "cpufreq=hwp:hdc(xen:ondemand),verbose"
> (and its possible variations)?

I think I like this less.

I think the strong grouping of ';' is nice.  It clearly delineates the
first and fallback option, and it could be extended to N in the
future.  I like that if you have 2 working cpufreq= lines, you just
concatenate them together with the ';'.  If the suboptions were all a
disjoint set, I think it would be easy to pick this.

The downside is the "verbose" handling.  The strong grouping makes
verbose seemingly only apply to the specified group and not globally.
Not the best, but we can document that away.

I implemented something to only allow specifying "verbose" in only the
last ';' grouping.  It's not that bad, but seems a little weird in
light of the strong grouping since it uses the same comma separator.
Also, failing to parse "verbose" in the first part breaks the
concatenation.

A different or additional separator could be used to separate verbose,
but that seems kinda overkill and would not be backwards compatible.

Another idea could be to allow multiple cpufreq= lines.  Each one
would call setup_cpufreq_option(), which could track the entries in
cpufreq_xen_opts[].  It still doesn't show a single "verbose" applies
globally.  And it breaks "later options override earlier ones."
"cpufreqfallback=" is another option, but I don't particularly care
for that.

I think I just prefer the ';' separator and letting "verbose" be
weird.  It maps well into the existing syntax and I liked it when you
first suggested it.

Regards,
Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-07-12 19:38                 ` Jason Andryuk
@ 2023-07-13  6:14                   ` Jan Beulich
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2023-07-13  6:14 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 12.07.2023 21:38, Jason Andryuk wrote:
> I think I just prefer the ';' separator and letting "verbose" be
> weird.  It maps well into the existing syntax and I liked it when you
> first suggested it.

Well, okay, then let's stick to that, emphasizing the weirdness in doc.

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 09/15] cpufreq: Export HWP parameters to userspace as CPPC
  2023-07-06 18:54 ` [PATCH v5 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
@ 2023-07-13 12:37   ` Jan Beulich
  2023-07-13 16:11     ` Jason Andryuk
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2023-07-13 12:37 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 06.07.2023 20:54, Jason Andryuk wrote:
> Extend xen_get_cpufreq_para to return hwp parameters.  HWP is an
> implementation of ACPI CPPC (Collaborative Processor Performance
> Control).  Use the CPPC name since that might be useful in the future
> for AMD P-state.
> 
> We need the features bitmask to indicate fields supported by the actual
> hardware - this only applies to activity window for the time being.
> 
> The HWP most_efficient is mapped to CPPC lowest_nonlinear, and guaranteed is
> mapped to nominal.  CPPC has a guaranteed that is optional while nominal
> is required.  ACPI spec says "If this register is not implemented, OSPM
> assumes guaranteed performance is always equal to nominal performance."
> 
> The use of uint8_t parameters matches the hardware size.  uint32_t
> entries grows the sysctl_t past the build assertion in setup.c.  The
> uint8_t ranges are supported across multiple generations, so hopefully
> they won't change.

Isn't this paragraph stale now?

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -251,46 +251,52 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>      else
>          strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
>  
> -    if ( !(scaling_available_governors =
> -           xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> -        return -ENOMEM;
> -    if ( (ret = read_scaling_available_governors(scaling_available_governors,
> -                gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> +    if ( !strncmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER_NAME,
> +                      CPUFREQ_NAME_LEN) )
> +        ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
> +    else
>      {
> +        if ( !(scaling_available_governors =
> +               xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> +            return -ENOMEM;
> +        if ( (ret = read_scaling_available_governors(
> +                        scaling_available_governors,
> +                        gov_num * CPUFREQ_NAME_LEN *
> +                            sizeof(*scaling_available_governors) )) )

Nit: Too deep indentation of this last line. If you want to visually
express the continuation of the last argument, add a pair of parens:

        if ( (ret = read_scaling_available_governors(
                        scaling_available_governors,
                        (gov_num * CPUFREQ_NAME_LEN *
                         sizeof(*scaling_available_governors)))) )

Otherwise all line beginnings want to align with one another, no matter
whether new or continued argument.

Also there's a stray blank after the 1st closing paren.

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -296,6 +296,61 @@ struct xen_ondemand {
>      uint32_t up_threshold;
>  };
>  
> +struct xen_cppc_para {
> +    /* OUT */
> +    /* activity_window supported if set */
> +#define XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW  (1 << 0)
> +    uint32_t features; /* bit flags for features */
> +    /*
> +     * See Intel SDM: HWP Performance Range and Dynamic Capabilities
> +     *
> +     * These four are 0-255 hardware-provided values.  They "continuous,

Nit: "They're"

> +     * abstract unit-less, performance" values.  Smaller numbers are slower
> +     * and larger ones are faster.
> +     */

With the adjustments (provided you agree)
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-07-06 18:54 ` [PATCH v5 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
@ 2023-07-13 13:02   ` Jan Beulich
  2023-07-13 16:12     ` Jason Andryuk
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2023-07-13 13:02 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 06.07.2023 20:54, Jason Andryuk wrote:
> @@ -531,6 +535,103 @@ int get_hwp_para(unsigned int cpu,
>      return 0;
>  }
>  
> +int set_hwp_para(struct cpufreq_policy *policy,
> +                 struct xen_set_cppc_para *set_cppc)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +    bool cleared_act_window = false;
> +
> +    if ( data == NULL )
> +        return -EINVAL;

I don't think EINVAL is appropriate here. EOPNOTSUPP might be, or ENOENT,
or EIO, or perhaps a few others.

> +    /* Validate all parameters - Disallow reserved bits. */
> +    if ( set_cppc->minimum > 255 ||
> +         set_cppc->maximum > 255 ||
> +         set_cppc->desired > 255 ||
> +         set_cppc->energy_perf > 255 ||
> +         set_cppc->set_params & ~XEN_SYSCTL_CPPC_SET_PARAM_MASK ||
> +         set_cppc->activity_window & ~XEN_SYSCTL_CPPC_ACT_WINDOW_MASK )

Nit: Parentheses again please around the operands of &.

> +        return -EINVAL;
> +
> +    /* Only allow values if params bit is set. */
> +    if ( (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
> +          set_cppc->desired) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
> +          set_cppc->minimum) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
> +          set_cppc->maximum) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF) &&
> +          set_cppc->energy_perf) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
> +          set_cppc->activity_window) )
> +        return -EINVAL;
> +
> +    /* Clear out activity window if lacking HW supported. */
> +    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
> +         !feature_hwp_activity_window ) {
> +        set_cppc->set_params &= ~XEN_SYSCTL_CPPC_SET_ACT_WINDOW;
> +        cleared_act_window = true;
> +    }
> +
> +    /* Return if there is nothing to do. */
> +    if ( set_cppc->set_params == 0 )
> +        return cleared_act_window ? 0 : -EINVAL;

Is it really necessary to return an error when there's nothing to do?
We have various hypercalls which can degenerate to no-ops under
certain conditions, and which simply return success then.

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -400,6 +400,19 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
>      return ret;
>  }
>  
> +static int set_cpufreq_cppc(struct xen_sysctl_pm_op *op)
> +{
> +    struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
> +
> +    if ( !policy || !policy->governor )
> +        return -EINVAL;
> +
> +    if ( !hwp_active() )
> +        return -EINVAL;

In both cases I again wonder in how far EINVAL is really appropriate.

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -351,6 +351,68 @@ struct xen_cppc_para {
>      uint32_t activity_window;
>  };
>  
> +/*
> + * Set CPPC values.
> + *
> + * Configure the parameters for CPPC.  Set bits in set_params control which
> + * values are applied.  If a bit is not set in set_params, the field must be
> + * zero.
> + *
> + * For HWP specifically, values must be limited to 0-255 or within
> + * XEN_SYSCTL_CPPC_ACT_WINDOW_MASK for activity window.  Set bits outside the
> + * range will be returned as -EINVAL.
> + *
> + * Activity Window may not be supported by the hardware.  In that case, the
> + * returned set_params will clear XEN_SYSCTL_CPPC_SET_ACT_WINDOW to indicate
> + * that it was not applied - though the rest of the values will be applied.
> + *
> + * There are a set of presets along with individual fields.  Presets are
> + * applied first, and then individual fields.  This allows customizing
> + * a preset without having to specify every value.
> + *
> + * The preset options values are as follows:
> + *
> + * preset      | minimum | maxium  | energy_perf
> + * ------------+---------+---------+----------------
> + * powersave   | lowest  | lowest  | powersave (255)
> + * ------------+---------+---------+----------------
> + * balance     | lowest  | highest | balance (128)
> + * ------------+---------+---------+----------------
> + * performance | highest | highest | performance (0)
> + *
> + * desired and activity_window are set to 0, hardware selected.
> + */
> +struct xen_set_cppc_para {
> +#define XEN_SYSCTL_CPPC_SET_MINIMUM              (1U << 0)
> +#define XEN_SYSCTL_CPPC_SET_MAXIMUM              (1U << 1)
> +#define XEN_SYSCTL_CPPC_SET_DESIRED              (1U << 2)
> +#define XEN_SYSCTL_CPPC_SET_ENERGY_PERF          (1U << 3)
> +#define XEN_SYSCTL_CPPC_SET_ACT_WINDOW           (1U << 4)
> +#define XEN_SYSCTL_CPPC_SET_PRESET_MASK          0xf0000000
> +#define XEN_SYSCTL_CPPC_SET_PRESET_NONE          0x00000000
> +#define XEN_SYSCTL_CPPC_SET_PRESET_BALANCE       0x10000000
> +#define XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE     0x20000000
> +#define XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE   0x30000000

As corrections for the respective Misra rule are in the process of
being merged, please add U suffixes here (at the very least on the
_MASK).

Jan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 09/15] cpufreq: Export HWP parameters to userspace as CPPC
  2023-07-13 12:37   ` Jan Beulich
@ 2023-07-13 16:11     ` Jason Andryuk
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-13 16:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Thu, Jul 13, 2023 at 8:37 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> With the adjustments (provided you agree)
> Acked-by: Jan Beulich <jbeulich@suse.com>

Yes, they all sound good.  Thank you.

-Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-07-13 13:02   ` Jan Beulich
@ 2023-07-13 16:12     ` Jason Andryuk
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Andryuk @ 2023-07-13 16:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Thu, Jul 13, 2023 at 9:02 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 06.07.2023 20:54, Jason Andryuk wrote:
> > @@ -531,6 +535,103 @@ int get_hwp_para(unsigned int cpu,
> >      return 0;
> >  }
> >
> > +int set_hwp_para(struct cpufreq_policy *policy,
> > +                 struct xen_set_cppc_para *set_cppc)
> > +{
> > +    unsigned int cpu = policy->cpu;
> > +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> > +    bool cleared_act_window = false;
> > +
> > +    if ( data == NULL )
> > +        return -EINVAL;
>
> I don't think EINVAL is appropriate here. EOPNOTSUPP might be, or ENOENT,
> or EIO, or perhaps a few others.

Yes.  ENOENT seems good here since a NULL data is comparable to not existing.

> > +    /* Validate all parameters - Disallow reserved bits. */
> > +    if ( set_cppc->minimum > 255 ||
> > +         set_cppc->maximum > 255 ||
> > +         set_cppc->desired > 255 ||
> > +         set_cppc->energy_perf > 255 ||
> > +         set_cppc->set_params & ~XEN_SYSCTL_CPPC_SET_PARAM_MASK ||
> > +         set_cppc->activity_window & ~XEN_SYSCTL_CPPC_ACT_WINDOW_MASK )
>
> Nit: Parentheses again please around the operands of &.

Sure

> > +        return -EINVAL;
> > +
> > +    /* Only allow values if params bit is set. */
> > +    if ( (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
> > +          set_cppc->desired) ||
> > +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
> > +          set_cppc->minimum) ||
> > +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
> > +          set_cppc->maximum) ||
> > +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF) &&
> > +          set_cppc->energy_perf) ||
> > +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
> > +          set_cppc->activity_window) )
> > +        return -EINVAL;
> > +
> > +    /* Clear out activity window if lacking HW supported. */
> > +    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
> > +         !feature_hwp_activity_window ) {
> > +        set_cppc->set_params &= ~XEN_SYSCTL_CPPC_SET_ACT_WINDOW;
> > +        cleared_act_window = true;
> > +    }
> > +
> > +    /* Return if there is nothing to do. */
> > +    if ( set_cppc->set_params == 0 )
> > +        return cleared_act_window ? 0 : -EINVAL;
>
> Is it really necessary to return an error when there's nothing to do?
> We have various hypercalls which can degenerate to no-ops under
> certain conditions, and which simply return success then.

With all the earlier parameter checking, I think it would be fine to
return success here for a no-op.

> > --- a/xen/drivers/acpi/pmstat.c
> > +++ b/xen/drivers/acpi/pmstat.c
> > @@ -400,6 +400,19 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
> >      return ret;
> >  }
> >
> > +static int set_cpufreq_cppc(struct xen_sysctl_pm_op *op)
> > +{
> > +    struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
> > +
> > +    if ( !policy || !policy->governor )
> > +        return -EINVAL;
> > +
> > +    if ( !hwp_active() )
> > +        return -EINVAL;
>
> In both cases I again wonder in how far EINVAL is really appropriate.

-EOPNOTSUPP seems good for the !hwp_active() case.  Maybe ENOENT for
the policy one.

> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -351,6 +351,68 @@ struct xen_cppc_para {
> >      uint32_t activity_window;
> >  };
> >
> > +/*
> > + * Set CPPC values.
> > + *
> > + * Configure the parameters for CPPC.  Set bits in set_params control which
> > + * values are applied.  If a bit is not set in set_params, the field must be
> > + * zero.
> > + *
> > + * For HWP specifically, values must be limited to 0-255 or within
> > + * XEN_SYSCTL_CPPC_ACT_WINDOW_MASK for activity window.  Set bits outside the
> > + * range will be returned as -EINVAL.
> > + *
> > + * Activity Window may not be supported by the hardware.  In that case, the
> > + * returned set_params will clear XEN_SYSCTL_CPPC_SET_ACT_WINDOW to indicate
> > + * that it was not applied - though the rest of the values will be applied.
> > + *
> > + * There are a set of presets along with individual fields.  Presets are
> > + * applied first, and then individual fields.  This allows customizing
> > + * a preset without having to specify every value.
> > + *
> > + * The preset options values are as follows:
> > + *
> > + * preset      | minimum | maxium  | energy_perf
> > + * ------------+---------+---------+----------------
> > + * powersave   | lowest  | lowest  | powersave (255)
> > + * ------------+---------+---------+----------------
> > + * balance     | lowest  | highest | balance (128)
> > + * ------------+---------+---------+----------------
> > + * performance | highest | highest | performance (0)
> > + *
> > + * desired and activity_window are set to 0, hardware selected.
> > + */
> > +struct xen_set_cppc_para {
> > +#define XEN_SYSCTL_CPPC_SET_MINIMUM              (1U << 0)
> > +#define XEN_SYSCTL_CPPC_SET_MAXIMUM              (1U << 1)
> > +#define XEN_SYSCTL_CPPC_SET_DESIRED              (1U << 2)
> > +#define XEN_SYSCTL_CPPC_SET_ENERGY_PERF          (1U << 3)
> > +#define XEN_SYSCTL_CPPC_SET_ACT_WINDOW           (1U << 4)
> > +#define XEN_SYSCTL_CPPC_SET_PRESET_MASK          0xf0000000
> > +#define XEN_SYSCTL_CPPC_SET_PRESET_NONE          0x00000000
> > +#define XEN_SYSCTL_CPPC_SET_PRESET_BALANCE       0x10000000
> > +#define XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE     0x20000000
> > +#define XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE   0x30000000
>
> As corrections for the respective Misra rule are in the process of
> being merged, please add U suffixes here (at the very least on the
> _MASK).

Sure.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-07-13 16:13 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-06 18:54 [PATCH v5 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
2023-07-10 11:41   ` Jan Beulich
2023-07-06 18:54 ` [PATCH v5 02/15] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 03/15] cpufreq: Export intel_feature_detect Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
2023-07-10 11:51   ` Jan Beulich
2023-07-06 18:54 ` [PATCH v5 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
2023-07-10 12:09   ` Jan Beulich
2023-07-06 18:54 ` [PATCH v5 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
2023-07-10 13:13   ` Jan Beulich
2023-07-10 15:22     ` Jason Andryuk
2023-07-11  8:18       ` Jan Beulich
2023-07-11 14:16         ` Jason Andryuk
2023-07-11 14:40           ` Jan Beulich
2023-07-11 17:49             ` Jason Andryuk
2023-07-12  8:43               ` Jan Beulich
2023-07-12 19:38                 ` Jason Andryuk
2023-07-13  6:14                   ` Jan Beulich
2023-07-06 18:54 ` [PATCH v5 07/15] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 08/15] xenpm: Change get-cpufreq-para output for hwp Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
2023-07-13 12:37   ` Jan Beulich
2023-07-13 16:11     ` Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 10/15] libxc: Include cppc_para in definitions Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 11/15] xenpm: Print HWP/CPPC parameters Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
2023-07-13 13:02   ` Jan Beulich
2023-07-13 16:12     ` Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 13/15] libxc: Add xc_set_cpufreq_cppc Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 14/15] xenpm: Add set-cpufreq-cppc subcommand Jason Andryuk
2023-07-06 18:54 ` [PATCH v5 15/15] CHANGELOG: Add Intel HWP entry Jason Andryuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.