All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/15] Intel Hardware P-States (HWP) support
@ 2023-06-14 18:02 Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, Anthony PERARD, Juergen Gross, George Dunlap,
	Julien Grall, Stefano Stabellini, Henry Wang, Community Manager

Hi,

This patch series adds Hardware-Controlled Performance States (HWP) for
Intel processors to Xen.

v2 was only partially reviewed, so v3 is mostly a reposting of v2.  In v2 &
v3, I think I addressed all comments for v1.  I kept patch 11 "xenpm:
Factor out a non-fatal cpuid_parse variant", with a v2 comment
explaining why I keep it.

v3 adds "xen/x86: Tweak PDC bits when using HWP".  Qubes testing revealed
an issue where enabling HWP can crash firwmare code (maybe SMM).  This
requires a Linux change to get the PDC bits from Xen and pass them to
ACPI.  Roger has a patch [0] to set the PDC bits.  Roger's 3 patch
series was tested with "xen/x86: Tweak PDC bits when using HWP" on
affected hardware and allowed proper operation.

v4:
There is a large amount or renaming from HWP/hwp to CPPC/cppc in the series.
The driver remains hwp_ prefixed since it is dealing with the hardware
interface.  The sysctl, xc and xenpm interfaces were renamed to cppc to
be the generic ACPI CPPC (Collaborative Processor Performance Control)
interface.

struct xen_get_cpufreq_para was re-organized in a binary compatible
fashion to nest scaling governor options.  This allows the cppc support
to use uint32_t's for its parameters.

HWP is now enabled with a top-level cpufreq=hwp option.  It will
fallback to cpufreq=xen if hwp is unavailable.  This seems like the most
user-friendly option.  Since the user was trying to specify *some*
cpufreq, we should give them the best that we can instead of disabling
the functionality.

"xenpm: Factor out a non-fatal cpuid_parse variant" was dropped.
set-cpufreq-cppc expects either a cpu number or none specified, which
implies all.

Some patches were re-arrange - "xen/x86: Tweak PDC bits when using HWP"
now comes immediately after "cpufreq: Add Hardware P-State (HWP) driver"

The implementation of "cpufreq: Allow restricting to internal governors
only " changed, so I removed Jan's Ack.

Previous cover letter:

With HWP, the processor makes its own determinations for frequency
selection, though users can set some parameters and preferences.  There
is also Turbo Boost which dynamically pushes the max frequency if
possible.

The existing governors don't work with HWP since they select frequencies
and HWP doesn't expose those.  Therefore a dummy hwp-interal governor is
used that doesn't do anything.

xenpm get-cpufreq-para is extended to show HWP parameters, and
set-cpufreq-cppc is added to set them.

A lightly loaded OpenXT laptop showed ~1W power savings according to
powertop.  A mostly idle Fedora system (dom0 only) showed a more modest
power savings.

This is for a 10th gen 6-core 1600 MHz base 4900 MHZ max cpu.  In the
default balance mode, Turbo Boost doesn't exceed 4GHz.  Tweaking the
energy_perf preference with `xenpm set-cpufreq-para balance ene:64`,
I've seen the CPU hit 4.7GHz before throttling down and bouncing around
between 4.3 and 4.5 GHz.  Curiously the other cores read ~4GHz when
turbo boost takes affect.  This was done after pinning all dom0 cores,
and using taskset to pin to vCPU/pCPU 11 and running a bash tightloop.

HWP defaults to disabled and running with the existing HWP configuration
- it doesn't reconfigure by default.  It can be enabled with
cpufreq=hwp.

Hardware Duty Cycling (HDC) is another feature to autonomously powerdown
things.  It defaults to enabled when HWP is enabled, but HDC can be
disabled on the command line.  cpufreq=xen:hwp,no-hdc

I've only tested on 8th gen and 10th gen systems with activity window
and energy_perf support.  So the pathes for CPUs lacking those features
are untested.

Fast MSR support was removed in v2.  The model specific checking was not
done properly, and I don't have hardware to test with.  Since writes are
expected to be infrequent, I just removed the code.

This changes the systcl_pm_op hypercall, so that wants review.

Regards,
Jason

[0] https://lore.kernel.org/xen-devel/20221121102113.41893-3-roger.pau@citrix.com/

Jason Andryuk (15):
  cpufreq: Allow restricting to internal governors only
  cpufreq: Add perf_freq to cpuinfo
  cpufreq: Export intel_feature_detect
  xen/sysctl: Nest cpufreq scaling options
  pmstat&xenpm: Re-arrage for cpufreq union
  cpufreq: Add Hardware P-State (HWP) driver
  xen/x86: Tweak PDC bits when using HWP
  xenpm: Change get-cpufreq-para output for hwp
  cpufreq: Export HWP parameters to userspace as CPPC
  libxc: Include cppc_para in definitions
  xenpm: Print HWP/CPPC parameters
  xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  libxc: Add xc_set_cpufreq_cppc
  xenpm: Add set-cpufreq-cppc subcommand
  CHANGELOG: Add Intel HWP entry

 CHANGELOG.md                              |   2 +-
 docs/misc/xen-command-line.pandoc         |   9 +-
 tools/include/xenctrl.h                   |  28 +-
 tools/libs/ctrl/xc_pm.c                   |  35 +-
 tools/misc/xenpm.c                        | 385 +++++++++++--
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |  15 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           | 664 ++++++++++++++++++++++
 xen/arch/x86/acpi/lib.c                   |   5 +
 xen/arch/x86/cpu/mcheck/mce_intel.c       |   6 +
 xen/arch/x86/include/asm/cpufeature.h     |  12 +-
 xen/arch/x86/include/asm/msr-index.h      |  14 +
 xen/drivers/acpi/pmstat.c                 | 102 ++--
 xen/drivers/cpufreq/cpufreq.c             |  16 +
 xen/drivers/cpufreq/utility.c             |   1 +
 xen/include/acpi/cpufreq/cpufreq.h        |  15 +
 xen/include/acpi/cpufreq/processor_perf.h |   4 +
 xen/include/acpi/pdc_intel.h              |   1 +
 xen/include/public/sysctl.h               | 137 ++++-
 19 files changed, 1343 insertions(+), 109 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

-- 
2.40.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-15 13:21   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 02/15] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Jan Beulich

For hwp, the standard governors are not usable, and only the internal
one is applicable.  Add the cpufreq_governor_internal boolean to
indicate when an internal governor, like hwp, will be used.
This is set during presmp_initcall, so that it can suppress governor
registration during initcall.  Add an internal flag to struct
cpufreq_governor to indicate such governors.

This way, the unusable governors are not registered, so the internal
one is the only one returned to userspace.  This means incompatible
governors won't be advertised to userspace.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v4:
Rework to use an internal flag
Removed Jan's Ack since the approach is different.

v3:
Switch to initdata
Add Jan Acked-by
Commit message s/they/the/ typo
Don't register hwp-internal when running non-hwp - Marek

v2:
Switch to "-internal"
Add blank line in header
---
 xen/drivers/cpufreq/cpufreq.c      | 7 +++++++
 xen/include/acpi/cpufreq/cpufreq.h | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 2321c7dd07..cccf9a64c8 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -56,6 +56,7 @@ struct cpufreq_dom {
 };
 static LIST_HEAD_READ_MOSTLY(cpufreq_dom_list_head);
 
+bool __initdata cpufreq_governor_internal;
 struct cpufreq_governor *__read_mostly cpufreq_opt_governor;
 LIST_HEAD_READ_MOSTLY(cpufreq_governor_list);
 
@@ -121,6 +122,12 @@ int __init cpufreq_register_governor(struct cpufreq_governor *governor)
     if (!governor)
         return -EINVAL;
 
+    if (cpufreq_governor_internal && !governor->internal)
+        return -EINVAL;
+
+    if (!cpufreq_governor_internal && governor->internal)
+        return -EINVAL;
+
     if (__find_governor(governor->name) != NULL)
         return -EEXIST;
 
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 35dcf21e8f..1c0872506a 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -106,6 +106,7 @@ struct cpufreq_governor {
                         unsigned int event);
     bool_t  (*handle_option)(const char *name, const char *value);
     struct list_head governor_list;
+    bool    internal;
 };
 
 extern struct cpufreq_governor *cpufreq_opt_governor;
@@ -114,6 +115,8 @@ extern struct cpufreq_governor cpufreq_gov_userspace;
 extern struct cpufreq_governor cpufreq_gov_performance;
 extern struct cpufreq_governor cpufreq_gov_powersave;
 
+extern bool cpufreq_governor_internal;
+
 extern struct list_head cpufreq_governor_list;
 
 extern int cpufreq_register_governor(struct cpufreq_governor *governor);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 02/15] cpufreq: Add perf_freq to cpuinfo
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 03/15] cpufreq: Export intel_feature_detect Jason Andryuk
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

acpi-cpufreq scales the aperf/mperf measurements by max_freq, but HWP
needs to scale by base frequency.  Settings max_freq to base_freq
"works" but the code is not obvious, and returning values to userspace
is tricky.  Add an additonal perf_freq member which is used for scaling
aperf/mperf measurements.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v3:
Add Jan's Ack

I don't like this, but it seems the best way to re-use the common
aperf/mperf code.  The other option would be to add wrappers that then
do the acpi vs. hwp scaling.
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c | 2 +-
 xen/drivers/cpufreq/utility.c       | 1 +
 xen/include/acpi/cpufreq/cpufreq.h  | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 2e0067fbe5..6c70d04395 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -316,7 +316,7 @@ unsigned int get_measured_perf(unsigned int cpu, unsigned int flag)
     else
         perf_percent = 0;
 
-    return policy->cpuinfo.max_freq * perf_percent / 100;
+    return policy->cpuinfo.perf_freq * perf_percent / 100;
 }
 
 static unsigned int cf_check get_cur_freq_on_cpu(unsigned int cpu)
diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index 9eb7ecedcd..6831f62851 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -236,6 +236,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
 
     policy->min = policy->cpuinfo.min_freq = min_freq;
     policy->max = policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = max_freq;
     policy->cpuinfo.second_max_freq = second_max_freq;
 
     if (policy->min == ~0)
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 1c0872506a..e2e03b8bd7 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -37,6 +37,9 @@ extern struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
 struct cpufreq_cpuinfo {
     unsigned int        max_freq;
     unsigned int        second_max_freq;    /* P1 if Turbo Mode is on */
+    unsigned int        perf_freq; /* Scaling freq for aperf/mpref.
+                                      acpi-cpufreq uses max_freq, but HWP uses
+                                      base_freq.*/
     unsigned int        min_freq;
     unsigned int        transition_latency; /* in 10^(-9) s = nanoseconds */
 };
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 03/15] cpufreq: Export intel_feature_detect
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 02/15] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Export feature_detect as intel_feature_detect so it can be re-used by
HWP.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v4:
Add Jan's Ack

v3:
Remove void * cast when calling intel_feature_detect

v2:
export intel_feature_detect with typed pointer
Move intel_feature_detect to acpi/cpufreq/cpufreq.h since the
declaration now contains struct cpufreq_policy *.
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c | 8 ++++++--
 xen/include/acpi/cpufreq/cpufreq.h  | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 6c70d04395..f1cc473b4f 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -339,9 +339,8 @@ static unsigned int cf_check get_cur_freq_on_cpu(unsigned int cpu)
     return extract_freq(get_cur_val(cpumask_of(cpu)), data);
 }
 
-static void cf_check feature_detect(void *info)
+void intel_feature_detect(struct cpufreq_policy *policy)
 {
-    struct cpufreq_policy *policy = info;
     unsigned int eax;
 
     eax = cpuid_eax(6);
@@ -353,6 +352,11 @@ static void cf_check feature_detect(void *info)
     }
 }
 
+static void cf_check feature_detect(void *info)
+{
+    intel_feature_detect(info);
+}
+
 static unsigned int check_freqs(const cpumask_t *mask, unsigned int freq,
                                 struct acpi_cpufreq_data *data)
 {
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index e2e03b8bd7..a49efd1cb2 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -244,4 +244,6 @@ int write_userspace_scaling_setspeed(unsigned int cpu, unsigned int freq);
 void cpufreq_dbs_timer_suspend(void);
 void cpufreq_dbs_timer_resume(void);
 
+void intel_feature_detect(struct cpufreq_policy *policy);
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (2 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 03/15] cpufreq: Export intel_feature_detect Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-15 13:29   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini

Add a union and struct so that most of the scaling variables of struct
xen_get_cpufreq_para are within in a binary-compatible layout.  This
allows cppc_para to live in the larger union and use uint32_ts - struct
xen_cppc_para will be 10 uint32_t's.

The new scaling struct is 3 * uint32_t + 16 bytes CPUFREQ_NAME_LEN + 4 *
uint32_t for xen_ondemand = 11 uint32_t.  That means the old size is
retained, int32_t turbo_enabled doesn't move and it's binary compatible.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/include/xenctrl.h     | 22 +++++++++++++---------
 tools/libs/ctrl/xc_pm.c     |  5 -----
 tools/misc/xenpm.c          | 24 ++++++++++++------------
 xen/drivers/acpi/pmstat.c   | 27 ++++++++++++++-------------
 xen/include/public/sysctl.h | 22 +++++++++++++---------
 5 files changed, 52 insertions(+), 48 deletions(-)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index dba33d5d0f..8aedb952a0 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1909,16 +1909,20 @@ struct xc_get_cpufreq_para {
     uint32_t cpuinfo_cur_freq;
     uint32_t cpuinfo_max_freq;
     uint32_t cpuinfo_min_freq;
-    uint32_t scaling_cur_freq;
-
-    char scaling_governor[CPUFREQ_NAME_LEN];
-    uint32_t scaling_max_freq;
-    uint32_t scaling_min_freq;
-
-    /* for specific governor */
     union {
-        xc_userspace_t userspace;
-        xc_ondemand_t ondemand;
+        struct {
+            uint32_t scaling_cur_freq;
+
+            char scaling_governor[CPUFREQ_NAME_LEN];
+            uint32_t scaling_max_freq;
+            uint32_t scaling_min_freq;
+
+            /* for specific governor */
+            union {
+                xc_userspace_t userspace;
+                xc_ondemand_t ondemand;
+            } u;
+        } s;
     } u;
 
     int32_t turbo_enabled;
diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index c3a9864bf7..f92542eaf7 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -265,15 +265,10 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
         user_para->cpuinfo_cur_freq = sys_para->cpuinfo_cur_freq;
         user_para->cpuinfo_max_freq = sys_para->cpuinfo_max_freq;
         user_para->cpuinfo_min_freq = sys_para->cpuinfo_min_freq;
-        user_para->scaling_cur_freq = sys_para->scaling_cur_freq;
-        user_para->scaling_max_freq = sys_para->scaling_max_freq;
-        user_para->scaling_min_freq = sys_para->scaling_min_freq;
         user_para->turbo_enabled    = sys_para->turbo_enabled;
 
         memcpy(user_para->scaling_driver,
                 sys_para->scaling_driver, CPUFREQ_NAME_LEN);
-        memcpy(user_para->scaling_governor,
-                sys_para->scaling_governor, CPUFREQ_NAME_LEN);
 
         /* copy to user_para no matter what cpufreq governor */
         BUILD_BUG_ON(sizeof(((struct xc_get_cpufreq_para *)0)->u) !=
diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 1bb6187e56..ee8ce5d5f2 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -730,39 +730,39 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
     printf("scaling_avail_gov    : %s\n",
            p_cpufreq->scaling_available_governors);
 
-    printf("current_governor     : %s\n", p_cpufreq->scaling_governor);
-    if ( !strncmp(p_cpufreq->scaling_governor,
+    printf("current_governor     : %s\n", p_cpufreq->u.s.scaling_governor);
+    if ( !strncmp(p_cpufreq->u.s.scaling_governor,
                   "userspace", CPUFREQ_NAME_LEN) )
     {
         printf("  userspace specific :\n");
         printf("    scaling_setspeed : %u\n",
-               p_cpufreq->u.userspace.scaling_setspeed);
+               p_cpufreq->u.s.u.userspace.scaling_setspeed);
     }
-    else if ( !strncmp(p_cpufreq->scaling_governor,
+    else if ( !strncmp(p_cpufreq->u.s.scaling_governor,
                        "ondemand", CPUFREQ_NAME_LEN) )
     {
         printf("  ondemand specific  :\n");
         printf("    sampling_rate    : max [%u] min [%u] cur [%u]\n",
-               p_cpufreq->u.ondemand.sampling_rate_max,
-               p_cpufreq->u.ondemand.sampling_rate_min,
-               p_cpufreq->u.ondemand.sampling_rate);
+               p_cpufreq->u.s.u.ondemand.sampling_rate_max,
+               p_cpufreq->u.s.u.ondemand.sampling_rate_min,
+               p_cpufreq->u.s.u.ondemand.sampling_rate);
         printf("    up_threshold     : %u\n",
-               p_cpufreq->u.ondemand.up_threshold);
+               p_cpufreq->u.s.u.ondemand.up_threshold);
     }
 
     printf("scaling_avail_freq   :");
     for ( i = 0; i < p_cpufreq->freq_num; i++ )
         if ( p_cpufreq->scaling_available_frequencies[i] ==
-             p_cpufreq->scaling_cur_freq )
+             p_cpufreq->u.s.scaling_cur_freq )
             printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
         else
             printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
     printf("\n");
 
     printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->scaling_max_freq,
-           p_cpufreq->scaling_min_freq,
-           p_cpufreq->scaling_cur_freq);
+           p_cpufreq->u.s.scaling_max_freq,
+           p_cpufreq->u.s.scaling_min_freq,
+           p_cpufreq->u.s.scaling_cur_freq);
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 1bae635101..f5a9ac3f1a 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -258,37 +258,38 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
         cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
     op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
     op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
-    op->u.get_para.scaling_cur_freq = policy->cur;
-    op->u.get_para.scaling_max_freq = policy->max;
-    op->u.get_para.scaling_min_freq = policy->min;
+
+    op->u.get_para.u.s.scaling_cur_freq = policy->cur;
+    op->u.get_para.u.s.scaling_max_freq = policy->max;
+    op->u.get_para.u.s.scaling_min_freq = policy->min;
 
     if ( cpufreq_driver.name[0] )
-        strlcpy(op->u.get_para.scaling_driver, 
+        strlcpy(op->u.get_para.scaling_driver,
             cpufreq_driver.name, CPUFREQ_NAME_LEN);
     else
         strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
 
     if ( policy->governor->name[0] )
-        strlcpy(op->u.get_para.scaling_governor, 
+        strlcpy(op->u.get_para.u.s.scaling_governor,
             policy->governor->name, CPUFREQ_NAME_LEN);
     else
-        strlcpy(op->u.get_para.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
+        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
 
     /* governor specific para */
-    if ( !strncasecmp(op->u.get_para.scaling_governor,
+    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
                       "userspace", CPUFREQ_NAME_LEN) )
     {
-        op->u.get_para.u.userspace.scaling_setspeed = policy->cur;
+        op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
     }
 
-    if ( !strncasecmp(op->u.get_para.scaling_governor,
+    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
                       "ondemand", CPUFREQ_NAME_LEN) )
     {
         ret = get_cpufreq_ondemand_para(
-            &op->u.get_para.u.ondemand.sampling_rate_max,
-            &op->u.get_para.u.ondemand.sampling_rate_min,
-            &op->u.get_para.u.ondemand.sampling_rate,
-            &op->u.get_para.u.ondemand.up_threshold);
+            &op->u.get_para.u.s.u.ondemand.sampling_rate_max,
+            &op->u.get_para.u.s.u.ondemand.sampling_rate_min,
+            &op->u.get_para.u.s.u.ondemand.sampling_rate,
+            &op->u.get_para.u.s.u.ondemand.up_threshold);
     }
     op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
 
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 9d06e92d0f..bdcea99d71 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -317,16 +317,20 @@ struct xen_get_cpufreq_para {
     uint32_t cpuinfo_cur_freq;
     uint32_t cpuinfo_max_freq;
     uint32_t cpuinfo_min_freq;
-    uint32_t scaling_cur_freq;
-
-    char scaling_governor[CPUFREQ_NAME_LEN];
-    uint32_t scaling_max_freq;
-    uint32_t scaling_min_freq;
-
-    /* for specific governor */
     union {
-        struct  xen_userspace userspace;
-        struct  xen_ondemand ondemand;
+        struct {
+            uint32_t scaling_cur_freq;
+
+            char scaling_governor[CPUFREQ_NAME_LEN];
+            uint32_t scaling_max_freq;
+            uint32_t scaling_min_freq;
+
+            /* for specific governor */
+            union {
+                struct  xen_userspace userspace;
+                struct  xen_ondemand ondemand;
+            } u;
+        } s;
     } u;
 
     int32_t turbo_enabled;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (3 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-15 14:38   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross, Jan Beulich

Move some code around now that common xen_sysctl_pm_op get_para fields
are together.  In particular, the scaling governor information like
scaling_available_governors is inside the union, so it is not always
available.

With that, gov_num may be 0, so bounce buffer handling needs
to be modified.

scaling_governor won't be filled for hwp, so this will simplify the
change when it is introduced.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/libs/ctrl/xc_pm.c   | 12 ++++++++----
 tools/misc/xenpm.c        |  3 ++-
 xen/drivers/acpi/pmstat.c | 32 +++++++++++++++++---------------
 3 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index f92542eaf7..19fe1a79dd 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -221,7 +221,7 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
     {
         if ( (!user_para->affected_cpus)                    ||
              (!user_para->scaling_available_frequencies)    ||
-             (!user_para->scaling_available_governors) )
+             (user_para->gov_num && !user_para->scaling_available_governors) )
         {
             errno = EINVAL;
             return -1;
@@ -230,12 +230,15 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
             goto unlock_1;
         if ( xc_hypercall_bounce_pre(xch, scaling_available_frequencies) )
             goto unlock_2;
-        if ( xc_hypercall_bounce_pre(xch, scaling_available_governors) )
+        if ( user_para->gov_num &&
+             xc_hypercall_bounce_pre(xch, scaling_available_governors) )
             goto unlock_3;
 
         set_xen_guest_handle(sys_para->affected_cpus, affected_cpus);
         set_xen_guest_handle(sys_para->scaling_available_frequencies, scaling_available_frequencies);
-        set_xen_guest_handle(sys_para->scaling_available_governors, scaling_available_governors);
+        if ( user_para->gov_num )
+            set_xen_guest_handle(sys_para->scaling_available_governors,
+                                 scaling_available_governors);
     }
 
     sysctl.cmd = XEN_SYSCTL_pm_op;
@@ -278,7 +281,8 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
     }
 
 unlock_4:
-    xc_hypercall_bounce_post(xch, scaling_available_governors);
+    if ( user_para->gov_num )
+        xc_hypercall_bounce_post(xch, scaling_available_governors);
 unlock_3:
     xc_hypercall_bounce_post(xch, scaling_available_frequencies);
 unlock_2:
diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index ee8ce5d5f2..1c474c3b59 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -811,7 +811,8 @@ static int show_cpufreq_para_by_cpuid(xc_interface *xc_handle, int cpuid)
             ret = -ENOMEM;
             goto out;
         }
-        if (!(p_cpufreq->scaling_available_governors =
+        if (p_cpufreq->gov_num &&
+            !(p_cpufreq->scaling_available_governors =
               malloc(p_cpufreq->gov_num * CPUFREQ_NAME_LEN * sizeof(char))))
         {
             fprintf(stderr,
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index f5a9ac3f1a..57359c21d8 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -239,11 +239,24 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     if ( ret )
         return ret;
 
+    op->u.get_para.cpuinfo_cur_freq =
+        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
+    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
+    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
+    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
+
+    if ( cpufreq_driver.name[0] )
+        strlcpy(op->u.get_para.scaling_driver,
+            cpufreq_driver.name, CPUFREQ_NAME_LEN);
+    else
+        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
+
     if ( !(scaling_available_governors =
            xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
         return -ENOMEM;
-    if ( (ret = read_scaling_available_governors(scaling_available_governors,
-                gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
+    if ( (ret = read_scaling_available_governors(
+                    scaling_available_governors,
+                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
     {
         xfree(scaling_available_governors);
         return ret;
@@ -254,26 +267,16 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     if ( ret )
         return ret;
 
-    op->u.get_para.cpuinfo_cur_freq =
-        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
-    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
-    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
-
     op->u.get_para.u.s.scaling_cur_freq = policy->cur;
     op->u.get_para.u.s.scaling_max_freq = policy->max;
     op->u.get_para.u.s.scaling_min_freq = policy->min;
 
-    if ( cpufreq_driver.name[0] )
-        strlcpy(op->u.get_para.scaling_driver,
-            cpufreq_driver.name, CPUFREQ_NAME_LEN);
-    else
-        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
-
     if ( policy->governor->name[0] )
         strlcpy(op->u.get_para.u.s.scaling_governor,
             policy->governor->name, CPUFREQ_NAME_LEN);
     else
-        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
+        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
+                CPUFREQ_NAME_LEN);
 
     /* governor specific para */
     if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
@@ -291,7 +294,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
             &op->u.get_para.u.s.u.ondemand.sampling_rate,
             &op->u.get_para.u.s.u.ondemand.up_threshold);
     }
-    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
 
     return ret;
 }
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (4 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-19 11:38   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 07/15] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné

From the Intel SDM: "Hardware-Controlled Performance States (HWP), which
autonomously selects performance states while utilizing OS supplied
performance guidance hints."

Enable HWP to run in autonomous mode by poking the correct MSRs.
cpufreq=hwp enables and specifying cpufreq=xen would disable it.  hdc is
a sub-option under hwp (i.e.  cpufreq=xen:hwp,hdc=0) as is verbose.  If
cpufreq=hwp is specified, but hardware support is unavailable, Xen
fallbacks back to cpufreq=xen.

There is no interface to configure - xen_sysctl_pm_op/xenpm will
be extended to configure in subsequent patches.  It will run with the
default values, which should be the default 0x80 (out of 0x0-0xff)
energy/performance preference.

Unscientific powertop measurement of an mostly idle, customized OpenXT
install:
A 10th gen 6-core laptop showed battery discharge drop from ~9.x to
~7.x watts.
A 8th gen 4-core laptop dropped from ~10 to ~9

Power usage depends on many factors, especially display brightness, but
this does show a power saving in balanced mode when CPU utilization is
low.

HWP isn't compatible with an external governor - it doesn't take
explicit frequency requests.  Therefore a minimal internal governor,
hwp, is also added as a placeholder.

While adding to the xen-command-line.pandoc entry, un-nest verbose from
minfreq.  They are independent.

With cpufreq=hwp,verbose, HWP prints processor capabilities that are not
used by the code, like HW_FEEDBACK.  This is done because otherwise
there isn't a convenient way to query the information.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---

We disable on cpuid_level < 0x16.  cpuid(0x16) is used to get the cpu
frequencies for calculating the APERF/MPERF.  Without it, things would
still work, but the average cpu frequency output would be wrong.

My 8th & 10th gen test systems both report:
(XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0
(XEN) HWP: Hardware Duty Cycling (HDC) supported
(XEN) HWP: HW_FEEDBACK not supported

Specifying HWP as a fake governor - cpufreq=xen:hwp - would require
resetting the governor if HWP hardware wasn't available.  Making
cpufreq=hwp a top level option avoids that issue.

Falling back from cpufreq=hwp to cpufreq=xen is a more user-friendly
choice than disabling cpufreq when HWP is not available.  Specifying
cpufreq=hwp indicates the user wants cpufreq, so, if HWP isn't
available, it makes sense to give them the cpufreq that can be
supported.  i.e. I can't see a user only wanting cpufreq=hwp or
cpufreq=none, but not cpufreq=xen.

We can't use parse_boolean() since it requires a single name=val string
and cpufreq_handle_common_option is provided two strings.  Use
parse_bool() and manual handle no-hwp.

Write to disable the interrupt - the linux pstate driver does this.  We
don't use the interrupts, so we can just turn them off.  We aren't ready
to handle them, so we don't want any.  Unclear if this is necessary.
SDM says it's default disabled.

FAST_IA32_HWP_REQUEST was removed in v2.  The check in v1 was wrong,
it's a model specific feature and the CPUID bit is only available
after enabling via the MSR.  Support was untested since I don't have
hardware with the feature.  Writes are expected to be infrequent, so
just leave it out.

---
v2:
Alphabetize headers
Re-work driver registration
name hwp_drv_data anonymous union "hw"
Drop hwp_verbose_cont
style cleanups
Condense hwp_governor switch
hwp_cpufreq_target remove .raw from hwp_req assignment
Use typed-pointer in a few functions
Pass type to xzalloc
Add HWP_ENERGY_PERF_BALANCE/IA32_ENERGY_BIAS_BALANCE defines
Add XEN_HWP_GOVERNOR define for "hwp-internal"
Capitalize CPUID and MSR defines
Change '_' to '-' for energy-perf & act-window
Read-modify-write MSRs updates
Use FAST_IA32_HWP_REQUEST_MSR_ENABLE define
constify pointer in hwp_set_misc_turbo
Add space after non-fallthrough break in governor switch
Add IA32_ENERGY_BIAS_MASK define
Check CPUID_PM_LEAK for energy bias when needed
Fail initialization with curr_req = -1
Fold hwp_read_capabilities into hwp_init_msrs
Add command line cpufreq=xen:hwp
Add command line cpufreq=xen:hdc
Use per_cpu for hwp_drv_data pointers
Move hwp_energy_perf_bias call into hwp_write_request
energy_perf 0 is valid, so hwp_energy_perf_bias cannot be skipped
Ensure we don't generate interrupts
Remove Fast Write of Uncore MSR
Initialize hwp_drv_data from curr_req
Use SPDX line instead of license text in hwp.c

v3:
Add cf_check to cpufreq_gov_hwp_init() - Marek
Print cpuid_level with %#x - Marek

v4:
Use BIT() for CPUID and MSR bits
Move __initdata after type
Add __ro_after_init to feature_*
Remove aperf/mperf comment
Move feature_hwp_energy_perf { to newline
Remove _IA32_ infix
Use unsigned int & bool for bitfields
Require energy perf pref (Remove ENERGY_PERF_BIAS support)
Initialize activity_window
Return errors on wrmsr failure
Change command line to: cpufreq=xen:hwp
Move hdc into the hwp-specific handle_options
Drop feature_hwp_energy_perf, feature_hwp_pkg_level_ctl & feature_hwp_peci
Print features before exiting when energy/performance preference isn't available
Disable HWP MSR on initialization error
Change hwp_ print macros to add prefixes
Disable HDC when hdc=0 - (opt_hdc no longer initdata)
Mark hwp governor internal and use "hwp" name
Add XEN_HWP_DRIVER
Use top-level cpufreq=hwp command line option
Document that cpufreq=hwp falls back to cpufreq=xen without hardware
Add SPDX suffix GPL-2.0-only
---
 docs/misc/xen-command-line.pandoc         |   9 +-
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |   5 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           | 537 ++++++++++++++++++++++
 xen/arch/x86/include/asm/cpufeature.h     |  12 +-
 xen/arch/x86/include/asm/msr-index.h      |  13 +
 xen/drivers/cpufreq/cpufreq.c             |   9 +
 xen/include/acpi/cpufreq/cpufreq.h        |   3 +
 xen/include/acpi/cpufreq/processor_perf.h |   3 +
 xen/include/public/sysctl.h               |   1 +
 10 files changed, 589 insertions(+), 4 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 4060ebdc5d..9d6bdd6d66 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
 available support.
 
 ### cpufreq
-> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
+> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]]`
 
 > Default: `xen`
 
@@ -510,6 +510,13 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
   respectively.
 * `verbose` option can be included as a string or also as `verbose=<integer>`
+* `hwp` selects Hardware-Controlled Performance States (HWP) on supported Intel
+  hardware.  HWP is a Skylake+ feature which provides better CPU power
+  management.  The default is disabled.  If `hwp` is selected, but hardware
+  support is not available, Xen will fallback to cpufreq=xen.
+* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
+  processor to autonomously force physical package components into idle state.
+  The default is enabled, but the option only applies when `hwp` is enabled.
 
 ### cpuid (x86)
 > `= List of comma separated booleans`
diff --git a/xen/arch/x86/acpi/cpufreq/Makefile b/xen/arch/x86/acpi/cpufreq/Makefile
index f75da9b9ca..db83aa6b14 100644
--- a/xen/arch/x86/acpi/cpufreq/Makefile
+++ b/xen/arch/x86/acpi/cpufreq/Makefile
@@ -1,2 +1,3 @@
 obj-y += cpufreq.o
+obj-y += hwp.o
 obj-y += powernow.o
diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index f1cc473b4f..56816b1aee 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -642,7 +642,10 @@ static int __init cf_check cpufreq_driver_init(void)
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
+            if ( hwp_available() )
+                ret = hwp_register_driver();
+            else
+                ret = cpufreq_register_driver(&acpi_cpufreq_driver);
             break;
 
         case X86_VENDOR_AMD:
diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
new file mode 100644
index 0000000000..c62345dde7
--- /dev/null
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -0,0 +1,537 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
+ *
+ * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
+ */
+
+#include <xen/cpumask.h>
+#include <xen/init.h>
+#include <xen/param.h>
+#include <xen/xmalloc.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <acpi/cpufreq/cpufreq.h>
+
+static bool __ro_after_init feature_hwp;
+static bool __ro_after_init feature_hwp_notification;
+static bool __ro_after_init feature_hwp_activity_window;
+
+static bool __ro_after_init feature_hdc;
+
+bool __initdata opt_cpufreq_hwp;
+static bool __ro_after_init opt_cpufreq_hdc = true;
+
+union hwp_request
+{
+    struct
+    {
+        unsigned int min_perf:8;
+        unsigned int max_perf:8;
+        unsigned int desired:8;
+        unsigned int energy_perf:8;
+        unsigned int activity_window:10;
+        bool package_control:1;
+        unsigned int reserved:16;
+        bool activity_window_valid:1;
+        bool energy_perf_valid:1;
+        bool desired_valid:1;
+        bool max_perf_valid:1;
+        bool min_perf_valid:1;
+    };
+    uint64_t raw;
+};
+
+struct hwp_drv_data
+{
+    union
+    {
+        uint64_t hwp_caps;
+        struct
+        {
+            unsigned int highest:8;
+            unsigned int guaranteed:8;
+            unsigned int most_efficient:8;
+            unsigned int lowest:8;
+            unsigned int reserved:32;
+        } hw;
+    };
+    union hwp_request curr_req;
+    int ret;
+    uint16_t activity_window;
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf;
+};
+DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);
+
+#define hwp_err(fmt, cpu, ...) \
+    printk(XENLOG_ERR "HWP: CPU%u error: " fmt, cpu, ##__VA_ARGS__)
+#define hwp_info(fmt, ...)    printk(XENLOG_INFO "HWP: " fmt, ##__VA_ARGS__)
+#define hwp_verbose(fmt, ...)                             \
+({                                                        \
+    if ( cpufreq_verbose )                                \
+        printk(XENLOG_DEBUG "HWP: " fmt, ##__VA_ARGS__);  \
+})
+
+static int cf_check hwp_governor(struct cpufreq_policy *policy,
+                                 unsigned int event)
+{
+    int ret;
+
+    if ( policy == NULL )
+        return -EINVAL;
+
+    switch ( event )
+    {
+    case CPUFREQ_GOV_START:
+    case CPUFREQ_GOV_LIMITS:
+        ret = 0;
+        break;
+
+    case CPUFREQ_GOV_STOP:
+    default:
+        ret = -EINVAL;
+        break;
+    }
+
+    return ret;
+}
+
+static bool hwp_handle_option(const char *s, const char *end)
+{
+    int ret;
+
+    if ( strncmp(s, "verbose", 7) == 0 )
+    {
+        s += 7;
+        if ( *s == '=' )
+        {
+            s++;
+            cpufreq_verbose = !!simple_strtoul(s, NULL, 0);
+
+            return true;
+        }
+
+        if ( end == NULL ||
+             (end == s && (*end == '\0' || *end == ',')) )
+        {
+            cpufreq_verbose = true;
+
+            return true;
+        }
+
+        return false;
+    }
+
+    ret = parse_boolean("hdc", s, end);
+    if ( ret < 0 )
+        return false;
+
+    opt_cpufreq_hdc = ret;
+
+    return true;
+}
+
+int __init hwp_cmdline_parse(const char *s)
+{
+    do
+    {
+        const char *end = strchr(s, ',');
+
+        if ( s && !hwp_handle_option(s, end) )
+        {
+            printk(XENLOG_WARNING "cpufreq/hwp: option '%s' not recognized\n",
+                   s);
+
+
+            return -1;
+        }
+
+        s = end ? ++end : end;
+    } while ( s );
+
+    return 0;
+}
+
+static struct cpufreq_governor cpufreq_gov_hwp =
+{
+    .name          = "hwp",
+    .governor      = hwp_governor,
+    .internal      = true,
+};
+
+static int __init cf_check cpufreq_gov_hwp_init(void)
+{
+    return cpufreq_register_governor(&cpufreq_gov_hwp);
+}
+__initcall(cpufreq_gov_hwp_init);
+
+bool __init hwp_available(void)
+{
+    unsigned int eax;
+
+    if ( !opt_cpufreq_hwp )
+        return false;
+
+    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
+    {
+        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
+                    boot_cpu_data.cpuid_level);
+
+        return false;
+    }
+
+    if ( boot_cpu_data.cpuid_level < 0x16 )
+    {
+        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
+                 boot_cpu_data.cpuid_level);
+
+        return false;
+    }
+
+    eax = cpuid_eax(CPUID_PM_LEAF);
+
+    hwp_verbose("%d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
+                !!(eax & CPUID6_EAX_HWP),
+                !!(eax & CPUID6_EAX_HWP_NOTIFICATION),
+                !!(eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW),
+                !!(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE),
+                !!(eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST),
+                !!(eax & CPUID6_EAX_HWP_PECI));
+
+    if ( !(eax & CPUID6_EAX_HWP) )
+        return false;
+
+    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) )
+    {
+        hwp_verbose("disabled: No energy/performance preference available");
+
+        return false;
+    }
+
+    feature_hwp                 = eax & CPUID6_EAX_HWP;
+    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
+    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
+    feature_hdc = eax & CPUID6_EAX_HDC;
+
+    hwp_verbose("Hardware Duty Cycling (HDC) %ssupported%s\n",
+                feature_hdc ? "" : "not ",
+                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
+                            : "");
+
+    hwp_verbose("HW_FEEDBACK %ssupported\n",
+                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
+
+    cpufreq_governor_internal = feature_hwp;
+
+    if ( feature_hwp )
+        hwp_info("Using HWP for cpufreq\n");
+
+    return feature_hwp;
+}
+
+static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
+    {
+        hwp_err("rdmsr_safe(MSR_PKG_HDC_CTL)\n", cpu);
+
+        return -1;
+    }
+
+    if ( val )
+        msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
+    else
+        msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
+
+    if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
+    {
+        hwp_err("wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", cpu, msr);
+
+        return -1;
+    }
+
+    return 0;
+}
+
+static int hdc_set_pm_ctl1(unsigned int cpu, bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_PM_CTL1, msr) )
+    {
+        hwp_err("rdmsr_safe(MSR_PM_CTL1)\n", cpu);
+
+        return -1;
+    }
+
+    if ( val )
+        msr |= PM_CTL1_HDC_ALLOW_BLOCK;
+    else
+        msr &= ~PM_CTL1_HDC_ALLOW_BLOCK;
+
+    if ( wrmsr_safe(MSR_PM_CTL1, msr) )
+    {
+        hwp_err("wrmsr_safe(MSR_PM_CTL1): %016lx\n", cpu, msr);
+
+        return -1;
+    }
+
+    return 0;
+}
+
+static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
+{
+    uint32_t base_khz, max_khz, bus_khz, edx;
+
+    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
+
+    policy->cpuinfo.perf_freq = base_khz * 1000;
+    policy->cpuinfo.min_freq = base_khz * 1000;
+    policy->cpuinfo.max_freq = max_khz * 1000;
+    policy->min = base_khz * 1000;
+    policy->max = max_khz * 1000;
+    policy->cur = 0;
+}
+
+static void cf_check hwp_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    uint64_t val;
+
+    /*
+     * Package level MSR, but we don't have a good idea of packages here, so
+     * just do it everytime.
+     */
+    if ( rdmsr_safe(MSR_PM_ENABLE, val) )
+    {
+        hwp_err("rdmsr_safe(MSR_PM_ENABLE)\n", policy->cpu);
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    /* Ensure we don't generate interrupts */
+    if ( feature_hwp_notification )
+        wrmsr_safe(MSR_HWP_INTERRUPT, 0);
+
+    hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
+    if ( !(val & PM_ENABLE_HWP_ENABLE) )
+    {
+        val |= PM_ENABLE_HWP_ENABLE;
+        if ( wrmsr_safe(MSR_PM_ENABLE, val) )
+        {
+            hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
+            data->curr_req.raw = -1;
+            return;
+        }
+    }
+
+    if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
+    {
+        hwp_err("rdmsr_safe(MSR_HWP_CAPABILITIES)\n", policy->cpu);
+        goto error;
+    }
+
+    if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
+    {
+        hwp_err("rdmsr_safe(MSR_HWP_REQUEST)\n", policy->cpu);
+        goto error;
+    }
+
+    /*
+     * Check for APERF/MPERF support in hardware
+     * also check for boost/turbo support
+     */
+    intel_feature_detect(policy);
+
+    if ( feature_hdc )
+    {
+        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) )
+            goto error;
+        if ( hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) )
+            goto error;
+    }
+
+    hwp_get_cpu_speeds(policy);
+
+    return;
+
+ error:
+    data->curr_req.raw = -1;
+    val &= ~PM_ENABLE_HWP_ENABLE;
+    if ( wrmsr_safe(MSR_PM_ENABLE, val) )
+        hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
+
+    return;
+}
+
+static int cf_check hwp_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+
+    if ( !feature_hwp_activity_window && data->activity_window )
+    {
+        hwp_verbose("HWP activity window not supported\n");
+
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static void cf_check hwp_write_request(void *info)
+{
+    const struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    union hwp_request hwp_req = data->curr_req;
+
+    data->ret = 0;
+
+    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
+    if ( wrmsr_safe(MSR_HWP_REQUEST, hwp_req.raw) )
+    {
+        hwp_verbose("CPU%u: error wrmsr_safe(MSR_HWP_REQUEST, %lx)\n",
+                    policy->cpu, hwp_req.raw);
+        rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw);
+        data->ret = -EINVAL;
+    }
+}
+
+static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
+                                       unsigned int target_freq,
+                                       unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    /* Zero everything to ensure reserved bits are zero... */
+    union hwp_request hwp_req = { .raw = 0 };
+
+    /* .. and update from there */
+    hwp_req.min_perf = data->minimum;
+    hwp_req.max_perf = data->maximum;
+    hwp_req.desired = data->desired;
+    hwp_req.energy_perf = data->energy_perf;
+    if ( feature_hwp_activity_window )
+        hwp_req.activity_window = data->activity_window;
+
+    if ( hwp_req.raw == data->curr_req.raw )
+        return 0;
+
+    data->curr_req = hwp_req;
+
+    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
+    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
+
+    return data->ret;
+}
+
+static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data;
+
+    data = xzalloc(struct hwp_drv_data);
+    if ( !data )
+        return -ENOMEM;
+
+    if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name,
+                                        cpufreq_gov_hwp.name) )
+        printk(XENLOG_WARNING
+               "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n",
+               cpufreq_opt_governor->name, cpufreq_gov_hwp.name);
+    policy->governor = &cpufreq_gov_hwp;
+
+    per_cpu(hwp_drv_data, cpu) = data;
+
+    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
+
+    if ( data->curr_req.raw == -1 )
+    {
+        hwp_err("Could not initialize HWP properly\n", cpu);
+        XFREE(per_cpu(hwp_drv_data, cpu));
+        return -ENODEV;
+    }
+
+    data->minimum = data->curr_req.min_perf;
+    data->maximum = data->curr_req.max_perf;
+    data->desired = data->curr_req.desired;
+    data->energy_perf = data->curr_req.energy_perf;
+    data->activity_window = data->curr_req.activity_window;
+
+    hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
+
+    hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw);
+
+    return 0;
+}
+
+static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    XFREE(per_cpu(hwp_drv_data, policy->cpu));
+
+    return 0;
+}
+
+/*
+ * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
+ * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least
+ * with my HWP testing.  MSR_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
+ * is what Linux uses and seems to work.
+ */
+static void cf_check hwp_set_misc_turbo(void *info)
+{
+    const struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+    uint64_t msr;
+
+    data->ret = 0;
+
+    if ( rdmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+    {
+        hwp_verbose("CPU%u: error rdmsr_safe(MSR_IA32_MISC_ENABLE)\n",
+                    policy->cpu);
+        data->ret = -EINVAL;
+
+        return;
+    }
+
+    if ( policy->turbo == CPUFREQ_TURBO_ENABLED )
+        msr &= ~MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+    else
+        msr |= MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+
+    if ( wrmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+    {
+        hwp_verbose("CPU%u: error wrmsr_safe(MSR_IA32_MISC_ENABLE): %016lx\n",
+                    policy->cpu, msr);
+        data->ret = -EINVAL;
+    }
+}
+
+static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy)
+{
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpuid);
+    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);
+
+    return data->ret;
+}
+
+static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
+{
+    .name   = XEN_HWP_DRIVER,
+    .verify = hwp_cpufreq_verify,
+    .target = hwp_cpufreq_target,
+    .init   = hwp_cpufreq_cpu_init,
+    .exit   = hwp_cpufreq_cpu_exit,
+    .update = hwp_cpufreq_update,
+};
+
+int __init hwp_register_driver(void)
+{
+    return cpufreq_register_driver(&hwp_cpufreq_driver);
+}
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index ace31e3b1f..00d618483c 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -52,8 +52,16 @@ static inline bool boot_cpu_has(unsigned int feat)
     return cpu_has(&boot_cpu_data, feat);
 }
 
-#define CPUID_PM_LEAF                    6
-#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
+#define CPUID_PM_LEAF                                6
+#define CPUID6_EAX_HWP                               BIT(7, U)
+#define CPUID6_EAX_HWP_NOTIFICATION                  BIT(8, U)
+#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               BIT(9, U)
+#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE BIT(10, U)
+#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         BIT(11, U)
+#define CPUID6_EAX_HDC                               BIT(13, U)
+#define CPUID6_EAX_HWP_PECI                          BIT(16, U)
+#define CPUID6_EAX_HW_FEEDBACK                       BIT(19, U)
+#define CPUID6_ECX_APERFMPERF_CAPABILITY             BIT(0, U)
 
 /* CPUID level 0x00000001.edx */
 #define cpu_has_fpu             1
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 2749e433d2..47b09a24b5 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -151,6 +151,13 @@
 
 #define MSR_PKRS                            0x000006e1
 
+#define MSR_PM_ENABLE                       0x00000770
+#define  PM_ENABLE_HWP_ENABLE               BIT(0, ULL)
+
+#define MSR_HWP_CAPABILITIES                0x00000771
+#define MSR_HWP_INTERRUPT                   0x00000773
+#define MSR_HWP_REQUEST                     0x00000774
+
 #define MSR_X2APIC_FIRST                    0x00000800
 #define MSR_X2APIC_LAST                     0x000008ff
 
@@ -165,6 +172,11 @@
 #define  PASID_PASID_MASK                   0x000fffff
 #define  PASID_VALID                        (_AC(1, ULL) << 31)
 
+#define MSR_PKG_HDC_CTL                     0x00000db0
+#define  PKG_HDC_CTL_HDC_PKG_ENABLE         BIT(0, ULL)
+#define MSR_PM_CTL1                         0x00000db1
+#define  PM_CTL1_HDC_ALLOW_BLOCK            BIT(0, ULL)
+
 #define MSR_UARCH_MISC_CTRL                 0x00001b01
 #define  UARCH_CTRL_DOITM                   (_AC(1, ULL) <<  0)
 
@@ -503,6 +515,7 @@
 #define MSR_IA32_MISC_ENABLE_LIMIT_CPUID  (1<<22)
 #define MSR_IA32_MISC_ENABLE_XTPR_DISABLE (1<<23)
 #define MSR_IA32_MISC_ENABLE_XD_DISABLE	(1ULL << 34)
+#define MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE (1ULL << 38)
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 #define MSR_IA32_ENERGY_PERF_BIAS	0x000001b0
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index cccf9a64c8..b6e6d70ddf 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -97,6 +97,15 @@ static int __init cf_check setup_cpufreq_option(const char *str)
             return cpufreq_cmdline_parse(arg + 1);
     }
 
+    if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
+    {
+        xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
+        cpufreq_controller = FREQCTL_xen;
+        opt_cpufreq_hwp = true;
+        if ( *arg && *(arg + 1) )
+            return hwp_cmdline_parse(arg + 1);
+    }
+
     return (choice < 0) ? -EINVAL : 0;
 }
 custom_param("cpufreq", setup_cpufreq_option);
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index a49efd1cb2..6ec7dbcbbb 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -246,4 +246,7 @@ void cpufreq_dbs_timer_resume(void);
 
 void intel_feature_detect(struct cpufreq_policy *policy);
 
+extern bool __initdata opt_cpufreq_hwp;
+int hwp_cmdline_parse(const char *s);
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index d8a1ba68a6..b751ca4937 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -7,6 +7,9 @@
 
 #define XEN_PX_INIT 0x80000000
 
+bool hwp_available(void);
+int hwp_register_driver(void);
+
 int powernow_cpufreq_init(void);
 unsigned int powernow_register_driver(void);
 unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index bdcea99d71..08fe815329 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -296,6 +296,7 @@ struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+#define XEN_HWP_DRIVER "hwp"
 /*
  * cpufreq para name of this structure named
  * same as sysfs file name of native linux
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 07/15] xen/x86: Tweak PDC bits when using HWP
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (5 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 08/15] xenpm: Change get-cpufreq-para output for hwp Jason Andryuk
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Qubes testing of HWP support had a report of a laptop, Thinkpad X1
Carbon Gen 4 with a Skylake processor, locking up during boot when HWP
is enabled.  A user found a kernel bug that seems to be the same issue:
https://bugzilla.kernel.org/show_bug.cgi?id=110941.

That bug was fixed by Linux commit a21211672c9a ("ACPI / processor:
Request native thermal interrupt handling via _OSC").  The tl;dr is SMM
crashes when it receives thermal interrupts, so Linux calls the ACPI
_OSC method to take over interrupt handling.

The Linux fix looks at the CPU features to decide whether or not to call
_OSC with bit 12 set to take over native interrupt handling.  Xen needs
some way to communicate HWP to Dom0 for making an equivalent call.

Xen exposes modified PDC bits via the platform_op set_pminfo hypercall.
Expand that to set bit 12 when HWP is present and in use.

Any generated interrupt would be handled by Xen's thermal drive, which
clears the status.

Bit 12 isn't named in the linux header and is open coded in Linux's
usage.

This will need a corresponding linux patch to pick up and apply the PDC
bits.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v4:
Added __ro_after_init
s/ACPI_PDC_CPPC_NTV_INT/ACPI_PDC_CPPC_NATIVE_INTR/
Remove _IA32_
Fixup for opt_cpufreq_hwp removal
Add Jan Reviewed-by

v3:
New
---
 xen/arch/x86/acpi/cpufreq/hwp.c           | 16 +++++++++++-----
 xen/arch/x86/acpi/lib.c                   |  5 +++++
 xen/arch/x86/cpu/mcheck/mce_intel.c       |  6 ++++++
 xen/arch/x86/include/asm/msr-index.h      |  1 +
 xen/include/acpi/cpufreq/processor_perf.h |  1 +
 xen/include/acpi/pdc_intel.h              |  1 +
 6 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index c62345dde7..5f210b54ff 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -13,7 +13,8 @@
 #include <asm/msr.h>
 #include <acpi/cpufreq/cpufreq.h>
 
-static bool __ro_after_init feature_hwp;
+static bool __ro_after_init hwp_in_use;
+
 static bool __ro_after_init feature_hwp_notification;
 static bool __ro_after_init feature_hwp_activity_window;
 
@@ -168,6 +169,11 @@ static int __init cf_check cpufreq_gov_hwp_init(void)
 }
 __initcall(cpufreq_gov_hwp_init);
 
+bool hwp_active(void)
+{
+    return hwp_in_use;
+}
+
 bool __init hwp_available(void)
 {
     unsigned int eax;
@@ -211,7 +217,6 @@ bool __init hwp_available(void)
         return false;
     }
 
-    feature_hwp                 = eax & CPUID6_EAX_HWP;
     feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
     feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
     feature_hdc = eax & CPUID6_EAX_HDC;
@@ -224,12 +229,13 @@ bool __init hwp_available(void)
     hwp_verbose("HW_FEEDBACK %ssupported\n",
                 (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
 
-    cpufreq_governor_internal = feature_hwp;
+    hwp_in_use = eax & CPUID6_EAX_HWP;
+    cpufreq_governor_internal = hwp_in_use;
 
-    if ( feature_hwp )
+    if ( hwp_in_use )
         hwp_info("Using HWP for cpufreq\n");
 
-    return feature_hwp;
+    return hwp_in_use;
 }
 
 static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
diff --git a/xen/arch/x86/acpi/lib.c b/xen/arch/x86/acpi/lib.c
index 43831b92d1..1b4710a790 100644
--- a/xen/arch/x86/acpi/lib.c
+++ b/xen/arch/x86/acpi/lib.c
@@ -26,6 +26,8 @@
 #include <asm/fixmap.h>
 #include <asm/mwait.h>
 
+#include <acpi/cpufreq/processor_perf.h>
+
 u32 __read_mostly acpi_smi_cmd;
 u8 __read_mostly acpi_enable_value;
 u8 __read_mostly acpi_disable_value;
@@ -140,5 +142,8 @@ int arch_acpi_set_pdc_bits(u32 acpi_id, u32 *pdc, u32 mask)
 	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
 		pdc[2] &= ~(ACPI_PDC_C_C1_FFH | ACPI_PDC_C_C2C3_FFH);
 
+	if (hwp_active())
+		pdc[2] |= ACPI_PDC_CPPC_NATIVE_INTR;
+
 	return 0;
 }
diff --git a/xen/arch/x86/cpu/mcheck/mce_intel.c b/xen/arch/x86/cpu/mcheck/mce_intel.c
index 2f23f02923..c95152ad85 100644
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -15,6 +15,9 @@
 #include <asm/p2m.h>
 #include <asm/mce.h>
 #include <asm/apic.h>
+
+#include <acpi/cpufreq/processor_perf.h>
+
 #include "mce.h"
 #include "x86_mca.h"
 #include "barrier.h"
@@ -64,6 +67,9 @@ static void cf_check intel_thermal_interrupt(struct cpu_user_regs *regs)
 
     ack_APIC_irq();
 
+    if ( hwp_active() )
+        wrmsr_safe(MSR_HWP_STATUS, 0);
+
     if ( NOW() < per_cpu(next, cpu) )
         return;
 
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 47b09a24b5..351745f6bc 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -157,6 +157,7 @@
 #define MSR_HWP_CAPABILITIES                0x00000771
 #define MSR_HWP_INTERRUPT                   0x00000773
 #define MSR_HWP_REQUEST                     0x00000774
+#define MSR_HWP_STATUS                      0x00000777
 
 #define MSR_X2APIC_FIRST                    0x00000800
 #define MSR_X2APIC_LAST                     0x000008ff
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index b751ca4937..dd8ec36ba7 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -8,6 +8,7 @@
 #define XEN_PX_INIT 0x80000000
 
 bool hwp_available(void);
+bool hwp_active(void);
 int hwp_register_driver(void);
 
 int powernow_cpufreq_init(void);
diff --git a/xen/include/acpi/pdc_intel.h b/xen/include/acpi/pdc_intel.h
index 4fb719d6f5..abaa098b51 100644
--- a/xen/include/acpi/pdc_intel.h
+++ b/xen/include/acpi/pdc_intel.h
@@ -17,6 +17,7 @@
 #define ACPI_PDC_C_C1_FFH		(0x0100)
 #define ACPI_PDC_C_C2C3_FFH		(0x0200)
 #define ACPI_PDC_SMP_P_HWCOORD		(0x0800)
+#define ACPI_PDC_CPPC_NATIVE_INTR	(0x1000)
 
 #define ACPI_PDC_EST_CAPABILITY_SMP	(ACPI_PDC_SMP_C1PT | \
 					 ACPI_PDC_C_C1_HALT | \
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 08/15] xenpm: Change get-cpufreq-para output for hwp
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (6 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 07/15] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-19 11:50   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD

When using HWP, some of the returned data is not applicable.  In that
case, we should just omit it to avoid confusing the user.  So switch to
printing the base and max frequencies since those are relevant to HWP.
Similarly, stop printing the CPU frequencies since those do not apply.
The scaling fields are also no longer printed.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Use full governor name XEN_HWP_GOVERNOR to change output
Style fixes

v4:
s/turbo/max/
Check for XEN_HWP_DRIVER driver instead of "-internal"
---
 tools/misc/xenpm.c | 83 +++++++++++++++++++++++++---------------------
 1 file changed, 46 insertions(+), 37 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 1c474c3b59..4e53e68dc5 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -711,6 +711,7 @@ void start_gather_func(int argc, char *argv[])
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
+    bool hwp = strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER) == 0;
     int i;
 
     printf("cpu id               : %d\n", cpuid);
@@ -720,49 +721,57 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
         printf(" %d", p_cpufreq->affected_cpus[i]);
     printf("\n");
 
-    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->cpuinfo_max_freq,
-           p_cpufreq->cpuinfo_min_freq,
-           p_cpufreq->cpuinfo_cur_freq);
+    if ( hwp )
+        printf("cpuinfo frequency    : base [%u] max [%u]\n",
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_max_freq);
+    else
+        printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->cpuinfo_max_freq,
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_cur_freq);
 
     printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
 
-    printf("scaling_avail_gov    : %s\n",
-           p_cpufreq->scaling_available_governors);
-
-    printf("current_governor     : %s\n", p_cpufreq->u.s.scaling_governor);
-    if ( !strncmp(p_cpufreq->u.s.scaling_governor,
-                  "userspace", CPUFREQ_NAME_LEN) )
-    {
-        printf("  userspace specific :\n");
-        printf("    scaling_setspeed : %u\n",
-               p_cpufreq->u.s.u.userspace.scaling_setspeed);
-    }
-    else if ( !strncmp(p_cpufreq->u.s.scaling_governor,
-                       "ondemand", CPUFREQ_NAME_LEN) )
+    if ( !hwp )
     {
-        printf("  ondemand specific  :\n");
-        printf("    sampling_rate    : max [%u] min [%u] cur [%u]\n",
-               p_cpufreq->u.s.u.ondemand.sampling_rate_max,
-               p_cpufreq->u.s.u.ondemand.sampling_rate_min,
-               p_cpufreq->u.s.u.ondemand.sampling_rate);
-        printf("    up_threshold     : %u\n",
-               p_cpufreq->u.s.u.ondemand.up_threshold);
-    }
+        printf("scaling_avail_gov    : %s\n",
+               p_cpufreq->scaling_available_governors);
 
-    printf("scaling_avail_freq   :");
-    for ( i = 0; i < p_cpufreq->freq_num; i++ )
-        if ( p_cpufreq->scaling_available_frequencies[i] ==
-             p_cpufreq->u.s.scaling_cur_freq )
-            printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
-        else
-            printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
-    printf("\n");
+        printf("current_governor     : %s\n", p_cpufreq->u.s.scaling_governor);
+        if ( !strncmp(p_cpufreq->u.s.scaling_governor,
+                      "userspace", CPUFREQ_NAME_LEN) )
+        {
+            printf("  userspace specific :\n");
+            printf("    scaling_setspeed : %u\n",
+                   p_cpufreq->u.s.u.userspace.scaling_setspeed);
+        }
+        else if ( !strncmp(p_cpufreq->u.s.scaling_governor,
+                           "ondemand", CPUFREQ_NAME_LEN) )
+        {
+            printf("  ondemand specific  :\n");
+            printf("    sampling_rate    : max [%u] min [%u] cur [%u]\n",
+                   p_cpufreq->u.s.u.ondemand.sampling_rate_max,
+                   p_cpufreq->u.s.u.ondemand.sampling_rate_min,
+                   p_cpufreq->u.s.u.ondemand.sampling_rate);
+            printf("    up_threshold     : %u\n",
+                   p_cpufreq->u.s.u.ondemand.up_threshold);
+        }
+
+        printf("scaling_avail_freq   :");
+        for ( i = 0; i < p_cpufreq->freq_num; i++ )
+            if ( p_cpufreq->scaling_available_frequencies[i] ==
+                 p_cpufreq->u.s.scaling_cur_freq )
+                printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
+            else
+                printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
+        printf("\n");
 
-    printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->u.s.scaling_max_freq,
-           p_cpufreq->u.s.scaling_min_freq,
-           p_cpufreq->u.s.scaling_cur_freq);
+        printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->u.s.scaling_max_freq,
+               p_cpufreq->u.s.scaling_min_freq,
+               p_cpufreq->u.s.scaling_cur_freq);
+    }
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (7 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 08/15] xenpm: Change get-cpufreq-para output for hwp Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-19 14:24   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 10/15] libxc: Include cppc_para in definitions Jason Andryuk
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Extend xen_get_cpufreq_para to return hwp parameters.  HWP is an
implementation of ACPI CPPC (Collaborative Processor Performance
Control).  Use the CPPC name since that might be useful in the future
for AMD P-state.

We need the features bitmask to indicate fields supported by the actual
hardware - this only applies to activity window for the time being.

The HWP most_efficient is mapped to CPPC lowest_nonlinear, and guaranteed is
mapped to nominal.  CPPC has a guaranteed that is optional while nominal
is required.  ACPI spec says "If this register is not implemented, OSPM
assumes guaranteed performance is always equal to nominal performance."

The use of uint8_t parameters matches the hardware size.  uint32_t
entries grows the sysctl_t past the build assertion in setup.c.  The
uint8_t ranges are supported across multiple generations, so hopefully
they won't change.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Style fixes
Don't bump XEN_SYSCTL_INTERFACE_VERSION
Drop cpufreq.h comment divider
Expand xen_hwp_para comment
Add HWP activity window mantissa/exponent defines
Handle union rename
Add const to get_hwp_para
Remove hw_ prefix from xen_hwp_para members
Use XEN_HWP_GOVERNOR
Use per_cpu for hwp_drv_data

v4:
Fixup for opt_cpufreq_hwp/hdc removal
get_hwp_para() takes cpu as arg
XEN_ prefix HWP_ACT_WINDOW_*
Drop HWP_ACT_WINDOW_EXPONENT_SHIFT - shift MASK
Remove Energy Bias (0-15) EPP fallback
Rename xen_hwp_para to xen_cppc_para
s/hwp/cppc/
Use scaling driver to switch output
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 23 +++++++++
 xen/drivers/acpi/pmstat.c          | 78 ++++++++++++++++--------------
 xen/include/acpi/cpufreq/cpufreq.h |  2 +
 xen/include/public/sysctl.h        | 56 +++++++++++++++++++++
 4 files changed, 123 insertions(+), 36 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 5f210b54ff..86c5793266 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -537,6 +537,29 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
     .update = hwp_cpufreq_update,
 };
 
+int get_hwp_para(const unsigned int cpu,
+                 struct xen_cppc_para *cppc_para)
+{
+    const struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+
+    if ( data == NULL )
+        return -EINVAL;
+
+    cppc_para->features         =
+        (feature_hwp_activity_window ? XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW : 0);
+    cppc_para->lowest           = data->hw.lowest;
+    cppc_para->lowest_nonlinear = data->hw.most_efficient;
+    cppc_para->nominal          = data->hw.guaranteed;
+    cppc_para->highest          = data->hw.highest;
+    cppc_para->minimum          = data->minimum;
+    cppc_para->maximum          = data->maximum;
+    cppc_para->desired          = data->desired;
+    cppc_para->energy_perf      = data->energy_perf;
+    cppc_para->activity_window  = data->activity_window;
+
+    return 0;
+}
+
 int __init hwp_register_driver(void)
 {
     return cpufreq_register_driver(&hwp_cpufreq_driver);
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 57359c21d8..10143c084c 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -251,48 +251,54 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     else
         strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
 
-    if ( !(scaling_available_governors =
-           xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
-        return -ENOMEM;
-    if ( (ret = read_scaling_available_governors(
-                    scaling_available_governors,
-                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
+    if ( !strncasecmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER,
+                      CPUFREQ_NAME_LEN) )
+        ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
+    else
     {
+        if ( !(scaling_available_governors =
+               xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
+            return -ENOMEM;
+        if ( (ret = read_scaling_available_governors(
+                        scaling_available_governors,
+                        gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
+        {
+            xfree(scaling_available_governors);
+            return ret;
+        }
+        ret = copy_to_guest(op->u.get_para.scaling_available_governors,
+                    scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);
         xfree(scaling_available_governors);
-        return ret;
-    }
-    ret = copy_to_guest(op->u.get_para.scaling_available_governors,
-                scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);
-    xfree(scaling_available_governors);
-    if ( ret )
-        return ret;
+        if ( ret )
+            return ret;
 
-    op->u.get_para.u.s.scaling_cur_freq = policy->cur;
-    op->u.get_para.u.s.scaling_max_freq = policy->max;
-    op->u.get_para.u.s.scaling_min_freq = policy->min;
+        op->u.get_para.u.s.scaling_cur_freq = policy->cur;
+        op->u.get_para.u.s.scaling_max_freq = policy->max;
+        op->u.get_para.u.s.scaling_min_freq = policy->min;
 
-    if ( policy->governor->name[0] )
-        strlcpy(op->u.get_para.u.s.scaling_governor,
-            policy->governor->name, CPUFREQ_NAME_LEN);
-    else
-        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
-                CPUFREQ_NAME_LEN);
+        if ( policy->governor->name[0] )
+            strlcpy(op->u.get_para.u.s.scaling_governor,
+                policy->governor->name, CPUFREQ_NAME_LEN);
+        else
+            strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
+                    CPUFREQ_NAME_LEN);
 
-    /* governor specific para */
-    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
-                      "userspace", CPUFREQ_NAME_LEN) )
-    {
-        op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
-    }
+        /* governor specific para */
+        if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
+                          "userspace", CPUFREQ_NAME_LEN) )
+        {
+            op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
+        }
 
-    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
-                      "ondemand", CPUFREQ_NAME_LEN) )
-    {
-        ret = get_cpufreq_ondemand_para(
-            &op->u.get_para.u.s.u.ondemand.sampling_rate_max,
-            &op->u.get_para.u.s.u.ondemand.sampling_rate_min,
-            &op->u.get_para.u.s.u.ondemand.sampling_rate,
-            &op->u.get_para.u.s.u.ondemand.up_threshold);
+        if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
+                          "ondemand", CPUFREQ_NAME_LEN) )
+        {
+            ret = get_cpufreq_ondemand_para(
+                &op->u.get_para.u.s.u.ondemand.sampling_rate_max,
+                &op->u.get_para.u.s.u.ondemand.sampling_rate_min,
+                &op->u.get_para.u.s.u.ondemand.sampling_rate,
+                &op->u.get_para.u.s.u.ondemand.up_threshold);
+        }
     }
 
     return ret;
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 6ec7dbcbbb..fb655fac14 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -248,5 +248,7 @@ void intel_feature_detect(struct cpufreq_policy *policy);
 
 extern bool __initdata opt_cpufreq_hwp;
 int hwp_cmdline_parse(const char *s);
+int get_hwp_para(const unsigned int cpu,
+                 struct xen_cppc_para *cppc_para);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 08fe815329..5c9b60b5d0 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -296,6 +296,61 @@ struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+struct xen_cppc_para {
+    /* OUT */
+    /* activity_window supported if 1 */
+#define XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW  (1 << 0)
+    uint32_t features; /* bit flags for features */
+    /*
+     * See Intel SDM: HWP Performance Range and Dynamic Capabilities
+     *
+     * These four are 0-255 hardware-provided values.  They "continuous,
+     * abstract unit-less, performance" values.  smaller numbers are slower
+     * and larger ones are faster.
+     */
+    uint32_t lowest;
+    uint32_t lowest_nonlinear; /* most_efficient */
+    uint32_t nominal; /* guaranteed */
+    uint32_t highest;
+    /*
+     * See Intel SDM: IA32_HWP_REQUEST MSR (Address: 774H Logical Processor
+     * Scope)
+     *
+     * These are all hints, and the processor may deviate outside of them.
+     * Values below are 0-255.
+     *
+     * minimum and maximum can be set to the above hardware values to constrain
+     * operation.  The full range 0-255 is accepted and will be clipped by
+     * hardware.
+     */
+    uint32_t minimum;
+    uint32_t maximum;
+    /*
+     * Set an explicit performance hint, disabling hardware selection.
+     * 0 lets the hardware decide.
+     */
+    uint32_t desired;
+    /*
+     * Hint to hardware for energy/performance preference.
+     * 0:   Performance
+     * 128: Balance (Default)
+     * 255: Powersaving
+     */
+    uint32_t energy_perf;
+    /*
+     * Activity Window is a moving history window for the processor's operation
+     * calculations, controlling responsiveness.  Measured in microseconds
+     * encoded as:
+     *
+     * bits 6:0   - 7bit mantissa
+     * bits 9:7   - 3bit base-10 exponent
+     * btis 15:10 - Unused - must be 0
+     */
+#define XEN_CPPC_ACT_WINDOW_MANTISSA_MASK  0x07f
+#define XEN_CPPC_ACT_WINDOW_EXPONENT_MASK  0x380
+    uint32_t activity_window;
+};
+
 #define XEN_HWP_DRIVER "hwp"
 /*
  * cpufreq para name of this structure named
@@ -332,6 +387,7 @@ struct xen_get_cpufreq_para {
                 struct  xen_ondemand ondemand;
             } u;
         } s;
+        struct xen_cppc_para cppc_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 10/15] libxc: Include cppc_para in definitions
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (8 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 11/15] xenpm: Print HWP/CPPC parameters Jason Andryuk
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross

Expose the cppc_para fields through libxc.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
---
v4:
Rename hwp to cppc
Add Anthony's Ack
---
 tools/include/xenctrl.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 8aedb952a0..2092632296 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1892,6 +1892,7 @@ int xc_smt_disable(xc_interface *xch);
  */
 typedef struct xen_userspace xc_userspace_t;
 typedef struct xen_ondemand xc_ondemand_t;
+typedef struct xen_cppc_para xc_cppc_para_t;
 
 struct xc_get_cpufreq_para {
     /* IN/OUT variable */
@@ -1923,6 +1924,7 @@ struct xc_get_cpufreq_para {
                 xc_ondemand_t ondemand;
             } u;
         } s;
+        xc_cppc_para_t cppc_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 11/15] xenpm: Print HWP/CPPC parameters
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (9 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 10/15] libxc: Include cppc_para in definitions Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-19 14:29   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD

Print HWP-specific parameters.  Some are always present, but others
depend on hardware support.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Style fixes
Declare i outside loop
Replace repearted hardware/configured limits with spaces
Fixup for hw_ removal
Use XEN_HWP_GOVERNOR
Use HWP_ACT_WINDOW_EXPONENT_*
Remove energy_perf hw autonomous - 0 doesn't mean autonomous

v4:
Return activity_window from calculate_hwp_activity_window
Use blanks instead of _ in output
Use MASK_EXTR
Check XEN_HWP_DRIVER name since governor is no longer returned
s/hwp/cppc
---
 tools/misc/xenpm.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 4e53e68dc5..488797fd20 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -708,6 +708,46 @@ void start_gather_func(int argc, char *argv[])
     pause();
 }
 
+static unsigned int calculate_activity_window(const xc_cppc_para_t *cppc,
+                                              const char **units)
+{
+    unsigned int mantissa = MASK_EXTR(cppc->activity_window,
+                                      XEN_CPPC_ACT_WINDOW_MANTISSA_MASK);
+    unsigned int exponent = MASK_EXTR(cppc->activity_window,
+                                      XEN_CPPC_ACT_WINDOW_EXPONENT_MASK);
+    unsigned int multiplier = 1;
+    unsigned int i;
+
+    /*
+     * SDM only states a 0 register is hardware selected, and doesn't mention
+     * a 0 mantissa with a non-0 exponent.  Only special case a 0 register.
+     */
+    if ( cppc->activity_window == 0 )
+    {
+        *units = "hardware selected";
+
+        return 0;
+    }
+
+    if ( exponent >= 6 )
+    {
+        *units = "s";
+        exponent -= 6;
+    }
+    else if ( exponent >= 3 )
+    {
+        *units = "ms";
+        exponent -= 3;
+    }
+    else
+        *units = "us";
+
+    for ( i = 0; i < exponent; i++ )
+        multiplier *= 10;
+
+    return mantissa * multiplier;
+}
+
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
@@ -772,6 +812,32 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                p_cpufreq->u.s.scaling_min_freq,
                p_cpufreq->u.s.scaling_cur_freq);
     }
+    else
+    {
+        const xc_cppc_para_t *cppc = &p_cpufreq->u.cppc_para;
+
+        printf("cppc variables       :\n");
+        printf("  hardware limits    : lowest [%u] lowest nonlinear [%u]\n",
+               cppc->lowest, cppc->lowest_nonlinear);
+        printf("                     : nominal [%u] highest [%u]\n",
+               cppc->nominal, cppc->highest);
+        printf("  configured limits  : min [%u] max [%u] energy perf [%u]\n",
+               cppc->minimum, cppc->maximum, cppc->energy_perf);
+
+        if ( cppc->features & XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW )
+        {
+            unsigned int activity_window;
+            const char *units;
+
+            activity_window = calculate_activity_window(cppc, &units);
+            printf("                     : activity_window [%u %s]\n",
+                   activity_window, units);
+        }
+
+        printf("                     : desired [%u%s]\n",
+               cppc->desired,
+               cppc->desired ? "" : " hw autonomous");
+    }
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (10 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 11/15] xenpm: Print HWP/CPPC parameters Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
       [not found]   ` <0b53687e-e781-7c01-34e9-e41cd14967c7@suse.com>
  2023-06-14 18:02 ` [PATCH v4 13/15] libxc: Add xc_set_cpufreq_cppc Jason Andryuk
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Add SET_CPUFREQ_HWP xen_sysctl_pm_op to set HWP parameters.  The sysctl
supports setting multiple values simultaneously as indicated by the
set_params bits.  This allows atomically applying new HWP configuration
via a single wrmsr.

XEN_SYSCTL_HWP_SET_PRESET_BALANCE/PERFORMANCE/POWERSAVE provide three
common presets.  Setting them depends on hardware limits which the
hypervisor is already caching.  So using them allows skipping a
hypercall to query the limits (lowest/highest) to then set those same
values.  The code is organized to allow a preset to be refined with
additional stuff if desired.

"most_efficient" and "guaranteed" could be additional presets in the
future, but the are not added now.  Those levels can change at runtime,
but we don't have code in place to monitor and update for those events.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---
v4:
Remove IA32_ENERGY_BIAS support
Validate parameters don't exceed 255
Use CPPC/cppc name
set_cppc_para() add const
set_cppc_para() return hwp_cpufreq_target()
Expand sysctl comments

v3:
Remove cpufreq_governor_internal from set_cpufreq_hwp

v2:
Update for naming anonymous union
Drop hwp_err for invalid input in set_hwp_para()
Drop uint16_t cast in XEN_SYSCTL_HWP_SET_PARAM_MASK
Drop parens for HWP_SET_PRESET defines
Reference activity_window format comment
Place SET_CPUFREQ_HWP after SET_CPUFREQ_PARA
Add {HWP,IA32}_ENERGY_PERF_MAX_{PERFORMANCE,POWERSAVE} defines
Order defines before fields in sysctl.h
Use XEN_HWP_GOVERNOR
Use per_cpu for hwp_drv_data
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 98 ++++++++++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c          | 17 ++++++
 xen/include/acpi/cpufreq/cpufreq.h |  2 +
 xen/include/public/sysctl.h        | 58 ++++++++++++++++++
 4 files changed, 175 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 86c5793266..3ee046940c 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -23,6 +23,10 @@ static bool __ro_after_init feature_hdc;
 bool __initdata opt_cpufreq_hwp;
 static bool __ro_after_init opt_cpufreq_hdc = true;
 
+#define HWP_ENERGY_PERF_MAX_PERFORMANCE 0
+#define HWP_ENERGY_PERF_BALANCE         0x80
+#define HWP_ENERGY_PERF_MAX_POWERSAVE   0xff
+
 union hwp_request
 {
     struct
@@ -560,6 +564,100 @@ int get_hwp_para(const unsigned int cpu,
     return 0;
 }
 
+int set_hwp_para(struct cpufreq_policy *policy,
+                 const struct xen_set_cppc_para *set_cppc)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+
+    if ( data == NULL )
+        return -EINVAL;
+
+    /* Validate all parameters first */
+    if ( set_cppc->set_params & ~XEN_SYSCTL_CPPC_SET_PARAM_MASK )
+        return -EINVAL;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW &&
+         !feature_hwp_activity_window )
+        return -EINVAL;
+
+    if ( !(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
+         set_cppc->activity_window )
+        return -EINVAL;
+
+    if ( set_cppc->activity_window & ~XEN_SYSCTL_CPPC_ACT_WINDOW_MASK )
+        return -EINVAL;
+
+    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
+         set_cppc->desired != 0 &&
+         (set_cppc->desired < data->hw.lowest ||
+          set_cppc->desired > data->hw.highest) )
+        return -EINVAL;
+
+    /*
+     * minimum & maximum are not validated against lowest or highest as
+     * hardware doesn't seem to care and the SDM says CPUs will clip
+     * internally.
+     */
+    if ( set_cppc->minimum > 255 ||
+         set_cppc->maximum > 255 ||
+         set_cppc->energy_perf > 255 )
+        return -EINVAL;
+
+    /* Apply presets */
+    switch ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_PRESET_MASK )
+    {
+    case XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE:
+        data->minimum = data->hw.lowest;
+        data->maximum = data->hw.lowest;
+        data->activity_window = 0;
+        data->energy_perf = HWP_ENERGY_PERF_MAX_POWERSAVE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE:
+        data->minimum = data->hw.highest;
+        data->maximum = data->hw.highest;
+        data->activity_window = 0;
+        data->energy_perf = HWP_ENERGY_PERF_MAX_PERFORMANCE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_BALANCE:
+        data->minimum = data->hw.lowest;
+        data->maximum = data->hw.highest;
+        data->activity_window = 0;
+        data->energy_perf = HWP_ENERGY_PERF_BALANCE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_NONE:
+        break;
+
+    default:
+        return -EINVAL;
+    }
+
+    /* Further customize presets if needed */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM )
+        data->minimum = set_cppc->minimum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM )
+        data->maximum = set_cppc->maximum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
+        data->energy_perf = set_cppc->energy_perf;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+        data->desired = set_cppc->desired;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW )
+        data->activity_window = set_cppc->activity_window &
+                                XEN_SYSCTL_CPPC_ACT_WINDOW_MASK;
+
+    return hwp_cpufreq_target(policy, 0, 0);
+}
+
 int __init hwp_register_driver(void)
 {
     return cpufreq_register_driver(&hwp_cpufreq_driver);
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 10143c084c..e4482bcdbc 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -402,6 +402,19 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
     return ret;
 }
 
+static int set_cpufreq_cppc(const struct xen_sysctl_pm_op *op)
+{
+    struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
+
+    if ( !policy || !policy->governor )
+        return -EINVAL;
+
+    if ( !hwp_active() )
+        return -EINVAL;
+
+    return set_hwp_para(policy, &op->u.set_cppc);
+}
+
 int do_pm_op(struct xen_sysctl_pm_op *op)
 {
     int ret = 0;
@@ -474,6 +487,10 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
         break;
     }
 
+    case SET_CPUFREQ_CPPC:
+        ret = set_cpufreq_cppc(op);
+        break;
+
     case GET_CPUFREQ_AVGFREQ:
     {
         op->u.get_avgfreq = cpufreq_driver_getavg(op->cpuid, USR_GETAVG);
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index fb655fac14..d757da2ef4 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -250,5 +250,7 @@ extern bool __initdata opt_cpufreq_hwp;
 int hwp_cmdline_parse(const char *s);
 int get_hwp_para(const unsigned int cpu,
                  struct xen_cppc_para *cppc_para);
+int set_hwp_para(struct cpufreq_policy *policy,
+                 const struct xen_set_cppc_para *set_cppc);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 5c9b60b5d0..96c58eb998 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -351,6 +351,62 @@ struct xen_cppc_para {
     uint32_t activity_window;
 };
 
+/*
+ * Set CPPC values.
+ *
+ * Configured the parameters for CPPC.  Set bits in set_params control which
+ * values are applied.
+ *
+ * There are a set of presets along with individual fields.  Presets are
+ * applied first, and then individual fields.  This allows customizing
+ * a preset without having to specify every value.
+ *
+ * The preset options values are as follows:
+ *
+ * preset      | minimum | maxium  | energy_perf
+ * ------------+---------+---------+----------------
+ * powersave   | lowest  | lowest  | powersave (255)
+ * ------------+---------+---------+----------------
+ * balance     | lowest  | highest | balance (128)
+ * ------------+---------+---------+----------------
+ * performance | highest | highest | performance (0)
+ *
+ * desired and activity_window are set to 0, hardware selected.
+ */
+struct xen_set_cppc_para {
+#define XEN_SYSCTL_CPPC_SET_DESIRED              (1U << 0)
+#define XEN_SYSCTL_CPPC_SET_ENERGY_PERF          (1U << 1)
+#define XEN_SYSCTL_CPPC_SET_ACT_WINDOW           (1U << 2)
+#define XEN_SYSCTL_CPPC_SET_MINIMUM              (1U << 3)
+#define XEN_SYSCTL_CPPC_SET_MAXIMUM              (1U << 4)
+#define XEN_SYSCTL_CPPC_SET_PRESET_MASK          0xf000
+#define XEN_SYSCTL_CPPC_SET_PRESET_NONE          0x0000
+#define XEN_SYSCTL_CPPC_SET_PRESET_BALANCE       0x1000
+#define XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE     0x2000
+#define XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE   0x3000
+#define XEN_SYSCTL_CPPC_SET_PARAM_MASK ( \
+                                  XEN_SYSCTL_CPPC_SET_PRESET_MASK | \
+                                  XEN_SYSCTL_CPPC_SET_DESIRED     | \
+                                  XEN_SYSCTL_CPPC_SET_ENERGY_PERF | \
+                                  XEN_SYSCTL_CPPC_SET_ACT_WINDOW  | \
+                                  XEN_SYSCTL_CPPC_SET_MINIMUM     | \
+                                  XEN_SYSCTL_CPPC_SET_MAXIMUM     )
+    uint32_t set_params; /* bitflags for valid values */
+    /* See comments in struct xen_cppc_para. */
+    uint32_t minimum;
+    uint32_t maximum;
+    uint32_t desired;
+    /*
+     * Hint to hardware for energy/performance preference.
+     * 0:   Performance
+     * 128: Balance (Default)
+     * 255: Powersaving
+     */
+    uint32_t energy_perf;
+#define XEN_SYSCTL_CPPC_ACT_WINDOW_MASK          0x03ff
+    uint32_t activity_window;
+};
+
 #define XEN_HWP_DRIVER "hwp"
 /*
  * cpufreq para name of this structure named
@@ -417,6 +473,7 @@ struct xen_sysctl_pm_op {
     #define SET_CPUFREQ_GOV            (CPUFREQ_PARA | 0x02)
     #define SET_CPUFREQ_PARA           (CPUFREQ_PARA | 0x03)
     #define GET_CPUFREQ_AVGFREQ        (CPUFREQ_PARA | 0x04)
+    #define SET_CPUFREQ_CPPC           (CPUFREQ_PARA | 0x05)
 
     /* set/reset scheduler power saving option */
     #define XEN_SYSCTL_pm_op_set_sched_opt_smt    0x21
@@ -443,6 +500,7 @@ struct xen_sysctl_pm_op {
         struct xen_get_cpufreq_para get_para;
         struct xen_set_cpufreq_gov  set_gov;
         struct xen_set_cpufreq_para set_para;
+        struct xen_set_cppc_para    set_cppc;
         uint64_aligned_t get_avgfreq;
         uint32_t                    set_sched_opt_smt;
 #define XEN_SYSCTL_CX_UNLIMITED 0xffffffff
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 13/15] libxc: Add xc_set_cpufreq_cppc
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (11 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 14/15] xenpm: Add set-cpufreq-cppc subcommand Jason Andryuk
  2023-06-14 18:02 ` [PATCH v4 15/15] CHANGELOG: Add Intel HWP entry Jason Andryuk
  14 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross

Add xc_set_cpufreq_cppc to allow calling xen_systctl_pm_op
SET_CPUFREQ_CPPC.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
---
v2:
Mark xc_set_hwp_para_t const

v4:
s/hwp/cppc/
Add Anthony's Ack
---
 tools/include/xenctrl.h |  4 ++++
 tools/libs/ctrl/xc_pm.c | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 2092632296..c7eb97959a 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1930,11 +1930,15 @@ struct xc_get_cpufreq_para {
     int32_t turbo_enabled;
 };
 
+typedef struct xen_set_cppc_para xc_set_cppc_para_t;
+
 int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
                         struct xc_get_cpufreq_para *user_para);
 int xc_set_cpufreq_gov(xc_interface *xch, int cpuid, char *govname);
 int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
                         int ctrl_type, int ctrl_value);
+int xc_set_cpufreq_cppc(xc_interface *xch, int cpuid,
+                        const xc_set_cppc_para_t *set_cppc);
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq);
 
 int xc_set_sched_opt_smt(xc_interface *xch, uint32_t value);
diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index 19fe1a79dd..e86045697d 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -329,6 +329,24 @@ int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
     return xc_sysctl(xch, &sysctl);
 }
 
+int xc_set_cpufreq_cppc(xc_interface *xch, int cpuid,
+                        const xc_set_cppc_para_t *set_cppc)
+{
+    DECLARE_SYSCTL;
+
+    if ( !xch )
+    {
+        errno = EINVAL;
+        return -1;
+    }
+    sysctl.cmd = XEN_SYSCTL_pm_op;
+    sysctl.u.pm_op.cmd = SET_CPUFREQ_CPPC;
+    sysctl.u.pm_op.cpuid = cpuid;
+    sysctl.u.pm_op.u.set_cppc = *set_cppc;
+
+    return xc_sysctl(xch, &sysctl);
+}
+
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq)
 {
     int ret = 0;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 14/15] xenpm: Add set-cpufreq-cppc subcommand
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (12 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 13/15] libxc: Add xc_set_cpufreq_cppc Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  2023-06-19 14:54   ` Jan Beulich
  2023-06-14 18:02 ` [PATCH v4 15/15] CHANGELOG: Add Intel HWP entry Jason Andryuk
  14 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD

set-cpufreq-cppc allows setting the Hardware P-State (HWP) parameters.

It can be run on all or just a single cpu.  There are presets of
balance, powersave & performance.  Those can be further tweaked by
param:val arguments as explained in the usage description.

Parameter names are just checked to the first 3 characters to shorten
typing.

Some options are hardware dependent, and ranges can be found in
get-cpufreq-para.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v4:
Remove energy bias 0-15 & 7 references
Use MASK_INSR
Fixup { placement
Drop extra case in parse_activity_window
strcmp suffix
Expand help text
s/hwp/cppc/
Use isdigit() to check cpuid - otherwise run on all CPUs.

v2:
Compare provided parameter name and not just 3 characters.
Use "-" in parameter names
Remove hw_
Replace sscanf with strchr & strtoul.
Remove toplevel error message with lower level ones.
Help text s/127/128/
Help text mention truncation.
Avoid some truncation rounding down by adding 5 before division.
Help test mention default microseconds
Also comment the limit check written to avoid overflow.
---
 tools/misc/xenpm.c | 237 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 488797fd20..2f2b699794 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -16,6 +16,8 @@
  */
 #define MAX_NR_CPU 512
 
+#include <ctype.h>
+#include <limits.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -67,6 +69,30 @@ void show_help(void)
             " set-max-cstate        <num>|'unlimited' [<num2>|'unlimited']\n"
             "                                     set the C-State limitation (<num> >= 0) and\n"
             "                                     optionally the C-sub-state limitation (<num2> >= 0)\n"
+            " set-cpufreq-cppc      [cpuid] [balance|performance|powersave] <param:val>*\n"
+            "                                     set Hardware P-State (HWP) parameters\n"
+            "                                     on CPU <cpuid> or all if omitted.\n"
+            "                                     optionally a preset of one of:\n"
+            "                                       balance|performance|powersave\n"
+            "                                     an optional list of param:val arguments\n"
+            "                                       minimum:N (0-255)\n"
+            "                                       maximum:N (0-255)\n"
+            "                                           get-cpufreq-para lowest/highest\n"
+            "                                           values are limits for\n"
+            "                                           minumum/maximum.\n"
+            "                                       desired:N (0-255)\n"
+            "                                           set explicit performance target.\n"
+            "                                           non-zero disables auto-HWP mode.\n"
+            "                                       energy-perf:N (0-255)\n"
+            "                                                   energy/performance hint\n"
+            "                                                   lower - favor performance\n"
+            "                                                   higher - favor powersave\n"
+            "                                                   128 - balance\n"
+            "                                       act-window:N{,m,u}s range 1us-1270s\n"
+            "                                           window for internal calculations.\n"
+            "                                           units default to \"us\" if unspecified.\n"
+            "                                           truncates un-representable values.\n"
+            "                                           0 lets the hardware decide.\n"
             " start [seconds]                     start collect Cx/Px statistics,\n"
             "                                     output after CTRL-C or SIGINT or several seconds.\n"
             " enable-turbo-mode     [cpuid]       enable Turbo Mode for processors that support it.\n"
@@ -1292,6 +1318,216 @@ void disable_turbo_mode(int argc, char *argv[])
                 errno, strerror(errno));
 }
 
+/*
+ * Parse activity_window:NNN{us,ms,s} and validate range.
+ *
+ * Activity window is a 7bit mantissa (0-127) with a 3bit exponent (0-7) base
+ * 10 in microseconds.  So the range is 1 microsecond to 1270 seconds.  A value
+ * of 0 lets the hardware autonomously select the window.
+ *
+ * Return 0 on success
+ *       -1 on error
+ */
+static int parse_activity_window(xc_set_cppc_para_t *set_cppc, unsigned long u,
+                                 const char *suffix)
+{
+    unsigned int exponent = 0;
+    unsigned int multiplier = 1;
+
+    if ( suffix && suffix[0] )
+    {
+        if ( strcmp(suffix, "s") == 0 )
+        {
+            multiplier = 1000 * 1000;
+            exponent = 6;
+        }
+        else if ( strcmp(suffix, "ms") == 0 )
+        {
+            multiplier = 1000;
+            exponent = 3;
+        }
+        else if ( strcmp(suffix, "us") != 0 )
+        {
+            fprintf(stderr, "invalid activity window units: \"%s\"\n", suffix);
+
+            return -1;
+        }
+    }
+
+    /* u * multipler > 1270 * 1000 * 1000 transformed to avoid overflow. */
+    if ( u > 1270 * 1000 * 1000 / multiplier )
+    {
+        fprintf(stderr, "activity window is too large\n");
+
+        return -1;
+    }
+
+    /* looking for 7 bits of mantissa and 3 bits of exponent */
+    while ( u > 127 )
+    {
+        u += 5; /* Round up to mitigate truncation rounding down
+                   e.g. 128 -> 120 vs 128 -> 130. */
+        u /= 10;
+        exponent += 1;
+    }
+
+    set_cppc->activity_window =
+        MASK_INSR(exponent, XEN_CPPC_ACT_WINDOW_EXPONENT_MASK) |
+        MASK_INSR(u, XEN_CPPC_ACT_WINDOW_MANTISSA_MASK);
+    set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_ACT_WINDOW;
+
+    return 0;
+}
+
+static int parse_cppc_opts(xc_set_cppc_para_t *set_cppc, int *cpuid,
+                           int argc, char *argv[])
+{
+    int i = 0;
+
+    if ( argc < 1 )
+    {
+        fprintf(stderr, "Missing arguments\n");
+        return -1;
+    }
+
+    if ( isdigit(argv[i][0]) )
+    {
+        if ( sscanf(argv[i], "%d", cpuid) != 1 || *cpuid < 0 )
+        {
+            fprintf(stderr, "Could not parse cpuid \"%s\"\n", argv[i]);
+            return -1;
+        }
+
+        i++;
+    }
+
+    if ( i == argc )
+    {
+        fprintf(stderr, "Missing arguments\n");
+        return -1;
+    }
+
+    if ( strcasecmp(argv[i], "powersave") == 0 )
+    {
+        set_cppc->set_params = XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "performance") == 0 )
+    {
+        set_cppc->set_params = XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "balance") == 0 )
+    {
+        set_cppc->set_params = XEN_SYSCTL_CPPC_SET_PRESET_BALANCE;
+        i++;
+    }
+
+    for ( ; i < argc; i++)
+    {
+        unsigned long val;
+        char *param = argv[i];
+        char *value;
+        char *suffix;
+        int ret;
+
+        value = strchr(param, ':');
+        if ( value == NULL )
+        {
+            fprintf(stderr, "\"%s\" is an invalid cppc parameter\n", argv[i]);
+            return -1;
+        }
+
+        value[0] = '\0';
+        value++;
+
+        errno = 0;
+        val = strtoul(value, &suffix, 10);
+        if ( (errno && val == ULONG_MAX) || value == suffix )
+        {
+            fprintf(stderr, "Could not parse number \"%s\"\n", value);
+            return -1;
+        }
+
+        if ( strncasecmp(param, "act-window", strlen(param)) == 0 )
+        {
+            ret = parse_activity_window(set_cppc, val, suffix);
+            if (ret)
+                return -1;
+
+            continue;
+        }
+
+        if ( val > 255 )
+        {
+            fprintf(stderr, "\"%s\" value \"%lu\" is out of range\n", param,
+                    val);
+            return -1;
+        }
+
+        if ( suffix && suffix[0] )
+        {
+            fprintf(stderr, "Suffix \"%s\" is invalid\n", suffix);
+            return -1;
+        }
+
+        if ( strncasecmp(param, "minimum", MAX(2, strlen(param))) == 0 )
+        {
+            set_cppc->minimum = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_MINIMUM;
+        }
+        else if ( strncasecmp(param, "maximum", MAX(2, strlen(param))) == 0 )
+        {
+            set_cppc->maximum = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_MAXIMUM;
+        }
+        else if ( strncasecmp(param, "desired", strlen(param)) == 0 )
+        {
+            set_cppc->desired = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_DESIRED;
+        }
+        else if ( strncasecmp(param, "energy-perf", strlen(param)) == 0 )
+        {
+            set_cppc->energy_perf = val;
+            set_cppc->set_params |= XEN_SYSCTL_CPPC_SET_ENERGY_PERF;
+        }
+        else
+        {
+            fprintf(stderr, "\"%s\" is an invalid parameter\n", param);
+            return -1;
+        }
+    }
+
+    if ( set_cppc->set_params == 0 )
+    {
+        fprintf(stderr, "No parameters set in request\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void cppc_set_func(int argc, char *argv[])
+{
+    xc_set_cppc_para_t set_cppc = {};
+    int cpuid = -1;
+    int i = 0;
+
+    if ( parse_cppc_opts(&set_cppc, &cpuid, argc, argv) )
+        exit(EINVAL);
+
+    if ( cpuid != -1 )
+    {
+        i = cpuid;
+        max_cpu_nr = i + 1;
+    }
+
+    for ( ; i < max_cpu_nr; i++ )
+        if ( xc_set_cpufreq_cppc(xc_handle, i, &set_cppc) )
+            fprintf(stderr, "[CPU%d] failed to set cppc params (%d - %s)\n",
+                    i, errno, strerror(errno));
+}
+
 struct {
     const char *name;
     void (*function)(int argc, char *argv[]);
@@ -1302,6 +1538,7 @@ struct {
     { "get-cpufreq-average", cpufreq_func },
     { "start", start_gather_func },
     { "get-cpufreq-para", cpufreq_para_func },
+    { "set-cpufreq-cppc", cppc_set_func },
     { "set-scaling-maxfreq", scaling_max_freq_func },
     { "set-scaling-minfreq", scaling_min_freq_func },
     { "set-scaling-governor", scaling_governor_func },
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 15/15] CHANGELOG: Add Intel HWP entry
  2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (13 preceding siblings ...)
  2023-06-14 18:02 ` [PATCH v4 14/15] xenpm: Add set-cpufreq-cppc subcommand Jason Andryuk
@ 2023-06-14 18:02 ` Jason Andryuk
  14 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-14 18:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Henry Wang, Community Manager

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Henry Wang <Henry.Wang@arm.com>
---
v3:
Position under existing Added section
Add Henry's Ack

v2:
Add blank line
---
 CHANGELOG.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7d7e0590f8..8d6e6c3088 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -24,7 +24,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
  - xl/libxl can customize SMBIOS strings for HVM guests.
  - Add support for AVX512-FP16 on x86.
  - On Arm, Xen supports guests running SVE/SVE2 instructions. (Tech Preview)
-
+ - Add Intel Hardware P-States (HWP) cpufreq driver.
 
 ## [4.17.0](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=RELEASE-4.17.0) - 2022-12-12
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only
  2023-06-14 18:02 ` [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
@ 2023-06-15 13:21   ` Jan Beulich
  2023-06-15 14:04     ` Jason Andryuk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-06-15 13:21 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> @@ -121,6 +122,12 @@ int __init cpufreq_register_governor(struct cpufreq_governor *governor)
>      if (!governor)
>          return -EINVAL;
>  
> +    if (cpufreq_governor_internal && !governor->internal)
> +        return -EINVAL;
> +
> +    if (!cpufreq_governor_internal && governor->internal)
> +        return -EINVAL;

First just a nit: Why not simply 

    if (cpufreq_governor_internal != governor->internal)
        return -EINVAL;

?

Yet then I find this approach a little odd anyway. I think the
registration attempts would better be suppressed, thus also not
resulting in (apparently) failed init-calls. Especially for the
userspace governor this would then also mean / allow to avoid
registering of the CPU notifier.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options
  2023-06-14 18:02 ` [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
@ 2023-06-15 13:29   ` Jan Beulich
  2023-06-15 14:07     ` Jason Andryuk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-06-15 13:29 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Wei Liu, Anthony PERARD, Juergen Gross, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> --- a/tools/include/xenctrl.h
> +++ b/tools/include/xenctrl.h
> @@ -1909,16 +1909,20 @@ struct xc_get_cpufreq_para {
>      uint32_t cpuinfo_cur_freq;
>      uint32_t cpuinfo_max_freq;
>      uint32_t cpuinfo_min_freq;
> -    uint32_t scaling_cur_freq;
> -
> -    char scaling_governor[CPUFREQ_NAME_LEN];
> -    uint32_t scaling_max_freq;
> -    uint32_t scaling_min_freq;
> -
> -    /* for specific governor */
>      union {
> -        xc_userspace_t userspace;
> -        xc_ondemand_t ondemand;
> +        struct {
> +            uint32_t scaling_cur_freq;
> +
> +            char scaling_governor[CPUFREQ_NAME_LEN];
> +            uint32_t scaling_max_freq;
> +            uint32_t scaling_min_freq;
> +
> +            /* for specific governor */
> +            union {
> +                xc_userspace_t userspace;
> +                xc_ondemand_t ondemand;
> +            } u;
> +        } s;
>      } u;

There's no comment in the header that this needs to mirror the sysctl
struct. Does it really need changing?

> --- a/tools/libs/ctrl/xc_pm.c
> +++ b/tools/libs/ctrl/xc_pm.c
> @@ -265,15 +265,10 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
>          user_para->cpuinfo_cur_freq = sys_para->cpuinfo_cur_freq;
>          user_para->cpuinfo_max_freq = sys_para->cpuinfo_max_freq;
>          user_para->cpuinfo_min_freq = sys_para->cpuinfo_min_freq;
> -        user_para->scaling_cur_freq = sys_para->scaling_cur_freq;
> -        user_para->scaling_max_freq = sys_para->scaling_max_freq;
> -        user_para->scaling_min_freq = sys_para->scaling_min_freq;
>          user_para->turbo_enabled    = sys_para->turbo_enabled;
>  
>          memcpy(user_para->scaling_driver,
>                  sys_para->scaling_driver, CPUFREQ_NAME_LEN);
> -        memcpy(user_para->scaling_governor,
> -                sys_para->scaling_governor, CPUFREQ_NAME_LEN);

Did you really mean to remove the copying of these 4 entities, rather
than simply change the way the fields are accessed?

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only
  2023-06-15 13:21   ` Jan Beulich
@ 2023-06-15 14:04     ` Jason Andryuk
  2023-06-15 14:23       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-15 14:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Thu, Jun 15, 2023 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 14.06.2023 20:02, Jason Andryuk wrote:
> > @@ -121,6 +122,12 @@ int __init cpufreq_register_governor(struct cpufreq_governor *governor)
> >      if (!governor)
> >          return -EINVAL;
> >
> > +    if (cpufreq_governor_internal && !governor->internal)
> > +        return -EINVAL;
> > +
> > +    if (!cpufreq_governor_internal && governor->internal)
> > +        return -EINVAL;
>
> First just a nit: Why not simply
>
>     if (cpufreq_governor_internal != governor->internal)
>         return -EINVAL;
>
> ?

Yes, that would work.

> Yet then I find this approach a little odd anyway. I think the
> registration attempts would better be suppressed, thus also not
> resulting in (apparently) failed init-calls. Especially for the
> userspace governor this would then also mean / allow to avoid
> registering of the CPU notifier.

So are you suggesting that each caller check cpufreq_governor_internal
and potentially skip calling cpufreq_register_governor()?  e.g. the
start of cpufreq_gov_userspace_init() would gain:

    if (cpufreq_governor_internal)
        return 0

?

I put the check in cpufreq_register_governor() since then it is
handled in a single place instead of being spread around.

To leave the check in cpufreq_register_governor(),
cpufreq_gov_userspace_init() could be rearranged to call
cpufreq_register_governor() first and only call
register_cpu_notifier() on success?

Or do you want the whole governor registration to be re-worked to be
more explicit?  Remove initcalls and have governors registered
according to the cpufreq driver?

Regards,
Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options
  2023-06-15 13:29   ` Jan Beulich
@ 2023-06-15 14:07     ` Jason Andryuk
  2023-06-15 14:28       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-15 14:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Anthony PERARD, Juergen Gross, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, xen-devel

On Thu, Jun 15, 2023 at 9:29 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 14.06.2023 20:02, Jason Andryuk wrote:
> > --- a/tools/include/xenctrl.h
> > +++ b/tools/include/xenctrl.h
> > @@ -1909,16 +1909,20 @@ struct xc_get_cpufreq_para {
> >      uint32_t cpuinfo_cur_freq;
> >      uint32_t cpuinfo_max_freq;
> >      uint32_t cpuinfo_min_freq;
> > -    uint32_t scaling_cur_freq;
> > -
> > -    char scaling_governor[CPUFREQ_NAME_LEN];
> > -    uint32_t scaling_max_freq;
> > -    uint32_t scaling_min_freq;
> > -
> > -    /* for specific governor */
> >      union {
> > -        xc_userspace_t userspace;
> > -        xc_ondemand_t ondemand;
> > +        struct {
> > +            uint32_t scaling_cur_freq;
> > +
> > +            char scaling_governor[CPUFREQ_NAME_LEN];
> > +            uint32_t scaling_max_freq;
> > +            uint32_t scaling_min_freq;
> > +
> > +            /* for specific governor */
> > +            union {
> > +                xc_userspace_t userspace;
> > +                xc_ondemand_t ondemand;
> > +            } u;
> > +        } s;
> >      } u;
>
> There's no comment in the header that this needs to mirror the sysctl
> struct. Does it really need changing?

Since this matched the other structure, I kept them in sync.  The
cppc/hwp data needs to be represented somehow, and it gets introduced
in the same way for both later.  If this doesn't get the new nested
struct, then maybe fields could be placed into the single union.  That
would grow the overall struct and have unused fields for hwp.

> > --- a/tools/libs/ctrl/xc_pm.c
> > +++ b/tools/libs/ctrl/xc_pm.c
> > @@ -265,15 +265,10 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
> >          user_para->cpuinfo_cur_freq = sys_para->cpuinfo_cur_freq;
> >          user_para->cpuinfo_max_freq = sys_para->cpuinfo_max_freq;
> >          user_para->cpuinfo_min_freq = sys_para->cpuinfo_min_freq;
> > -        user_para->scaling_cur_freq = sys_para->scaling_cur_freq;
> > -        user_para->scaling_max_freq = sys_para->scaling_max_freq;
> > -        user_para->scaling_min_freq = sys_para->scaling_min_freq;
> >          user_para->turbo_enabled    = sys_para->turbo_enabled;
> >
> >          memcpy(user_para->scaling_driver,
> >                  sys_para->scaling_driver, CPUFREQ_NAME_LEN);
> > -        memcpy(user_para->scaling_governor,
> > -                sys_para->scaling_governor, CPUFREQ_NAME_LEN);
>
> Did you really mean to remove the copying of these 4 entities, rather
> than simply change the way the fields are accessed?

Yes, it was intentional.

The immediate following lines are:
        /* copy to user_para no matter what cpufreq governor */
        BUILD_BUG_ON(sizeof(((struct xc_get_cpufreq_para *)0)->u) !=
             sizeof(((struct xen_get_cpufreq_para *)0)->u));

        memcpy(&user_para->u, &sys_para->u, sizeof(sys_para->u));

And this memcpy copies all the moved entities.

I suppose the comment could change to "...no matter which cpufreq driver".

Regards,
Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only
  2023-06-15 14:04     ` Jason Andryuk
@ 2023-06-15 14:23       ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-06-15 14:23 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel

On 15.06.2023 16:04, Jason Andryuk wrote:
> On Thu, Jun 15, 2023 at 9:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>> Yet then I find this approach a little odd anyway. I think the
>> registration attempts would better be suppressed, thus also not
>> resulting in (apparently) failed init-calls. Especially for the
>> userspace governor this would then also mean / allow to avoid
>> registering of the CPU notifier.
> 
> So are you suggesting that each caller check cpufreq_governor_internal
> and potentially skip calling cpufreq_register_governor()?  e.g. the
> start of cpufreq_gov_userspace_init() would gain:
> 
>     if (cpufreq_governor_internal)
>         return 0
> 
> ?

Right.

> I put the check in cpufreq_register_governor() since then it is
> handled in a single place instead of being spread around.

I understand this was the goal, but I'm still wondering why we would
try registration we know will fail. Plus, as said and even if benign
right now, I'm not happy to see initcall failures being introduced.

> To leave the check in cpufreq_register_governor(),
> cpufreq_gov_userspace_init() could be rearranged to call
> cpufreq_register_governor() first and only call
> register_cpu_notifier() on success?

Might be an option, but would address only part of my concerns.

> Or do you want the whole governor registration to be re-worked to be
> more explicit?  Remove initcalls and have governors registered
> according to the cpufreq driver?

No, I'm not after any extensive re-work.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options
  2023-06-15 14:07     ` Jason Andryuk
@ 2023-06-15 14:28       ` Jan Beulich
  2023-06-15 17:56         ` Jason Andryuk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-06-15 14:28 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Wei Liu, Anthony PERARD, Juergen Gross, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, xen-devel

On 15.06.2023 16:07, Jason Andryuk wrote:
> On Thu, Jun 15, 2023 at 9:29 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 14.06.2023 20:02, Jason Andryuk wrote:
>>> --- a/tools/include/xenctrl.h
>>> +++ b/tools/include/xenctrl.h
>>> @@ -1909,16 +1909,20 @@ struct xc_get_cpufreq_para {
>>>      uint32_t cpuinfo_cur_freq;
>>>      uint32_t cpuinfo_max_freq;
>>>      uint32_t cpuinfo_min_freq;
>>> -    uint32_t scaling_cur_freq;
>>> -
>>> -    char scaling_governor[CPUFREQ_NAME_LEN];
>>> -    uint32_t scaling_max_freq;
>>> -    uint32_t scaling_min_freq;
>>> -
>>> -    /* for specific governor */
>>>      union {
>>> -        xc_userspace_t userspace;
>>> -        xc_ondemand_t ondemand;
>>> +        struct {
>>> +            uint32_t scaling_cur_freq;
>>> +
>>> +            char scaling_governor[CPUFREQ_NAME_LEN];
>>> +            uint32_t scaling_max_freq;
>>> +            uint32_t scaling_min_freq;
>>> +
>>> +            /* for specific governor */
>>> +            union {
>>> +                xc_userspace_t userspace;
>>> +                xc_ondemand_t ondemand;
>>> +            } u;
>>> +        } s;
>>>      } u;
>>
>> There's no comment in the header that this needs to mirror the sysctl
>> struct. Does it really need changing?
> 
> Since this matched the other structure, I kept them in sync.  The
> cppc/hwp data needs to be represented somehow, and it gets introduced
> in the same way for both later.  If this doesn't get the new nested
> struct, then maybe fields could be placed into the single union.  That
> would grow the overall struct and have unused fields for hwp.

I guess I need to leave this to the maintainers then. Still ...

>>> --- a/tools/libs/ctrl/xc_pm.c
>>> +++ b/tools/libs/ctrl/xc_pm.c
>>> @@ -265,15 +265,10 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
>>>          user_para->cpuinfo_cur_freq = sys_para->cpuinfo_cur_freq;
>>>          user_para->cpuinfo_max_freq = sys_para->cpuinfo_max_freq;
>>>          user_para->cpuinfo_min_freq = sys_para->cpuinfo_min_freq;
>>> -        user_para->scaling_cur_freq = sys_para->scaling_cur_freq;
>>> -        user_para->scaling_max_freq = sys_para->scaling_max_freq;
>>> -        user_para->scaling_min_freq = sys_para->scaling_min_freq;
>>>          user_para->turbo_enabled    = sys_para->turbo_enabled;
>>>
>>>          memcpy(user_para->scaling_driver,
>>>                  sys_para->scaling_driver, CPUFREQ_NAME_LEN);
>>> -        memcpy(user_para->scaling_governor,
>>> -                sys_para->scaling_governor, CPUFREQ_NAME_LEN);
>>
>> Did you really mean to remove the copying of these 4 entities, rather
>> than simply change the way the fields are accessed?
> 
> Yes, it was intentional.
> 
> The immediate following lines are:
>         /* copy to user_para no matter what cpufreq governor */
>         BUILD_BUG_ON(sizeof(((struct xc_get_cpufreq_para *)0)->u) !=
>              sizeof(((struct xen_get_cpufreq_para *)0)->u));
> 
>         memcpy(&user_para->u, &sys_para->u, sizeof(sys_para->u));

... this suggests that some matching is intended, yet it's not clear
to me why then the hole struct-s aren't assumed to be matching / made
match.

> And this memcpy copies all the moved entities.

Right, I should have gone to the source instead of going just from
patch context, sorry.

> I suppose the comment could change to "...no matter which cpufreq driver".

Yeah, well, it really would be driver/governor then, I guess.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union
  2023-06-14 18:02 ` [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
@ 2023-06-15 14:38   ` Jan Beulich
  2023-06-15 15:05     ` Jason Andryuk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-06-15 14:38 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, Juergen Gross, xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> Move some code around now that common xen_sysctl_pm_op get_para fields
> are together.  In particular, the scaling governor information like
> scaling_available_governors is inside the union, so it is not always
> available.
> 
> With that, gov_num may be 0, so bounce buffer handling needs
> to be modified.
> 
> scaling_governor won't be filled for hwp, so this will simplify the
> change when it is introduced.

While I think this suitably describes the tool stack side changes, ...

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -239,11 +239,24 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>      if ( ret )
>          return ret;
>  
> +    op->u.get_para.cpuinfo_cur_freq =
> +        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
> +    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
> +    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
> +    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
> +
> +    if ( cpufreq_driver.name[0] )
> +        strlcpy(op->u.get_para.scaling_driver,
> +            cpufreq_driver.name, CPUFREQ_NAME_LEN);
> +    else
> +        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
> +
>      if ( !(scaling_available_governors =
>             xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
>          return -ENOMEM;
> -    if ( (ret = read_scaling_available_governors(scaling_available_governors,
> -                gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> +    if ( (ret = read_scaling_available_governors(
> +                    scaling_available_governors,
> +                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
>      {
>          xfree(scaling_available_governors);
>          return ret;
> @@ -254,26 +267,16 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>      if ( ret )
>          return ret;
>  
> -    op->u.get_para.cpuinfo_cur_freq =
> -        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
> -    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
> -    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
> -
>      op->u.get_para.u.s.scaling_cur_freq = policy->cur;
>      op->u.get_para.u.s.scaling_max_freq = policy->max;
>      op->u.get_para.u.s.scaling_min_freq = policy->min;
>  
> -    if ( cpufreq_driver.name[0] )
> -        strlcpy(op->u.get_para.scaling_driver,
> -            cpufreq_driver.name, CPUFREQ_NAME_LEN);
> -    else
> -        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
> -
>      if ( policy->governor->name[0] )
>          strlcpy(op->u.get_para.u.s.scaling_governor,
>              policy->governor->name, CPUFREQ_NAME_LEN);
>      else
> -        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
> +        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> +                CPUFREQ_NAME_LEN);
>  
>      /* governor specific para */
>      if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
> @@ -291,7 +294,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>              &op->u.get_para.u.s.u.ondemand.sampling_rate,
>              &op->u.get_para.u.s.u.ondemand.up_threshold);
>      }
> -    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
>  
>      return ret;
>  }

... all I see on the hypervisor side is re-ordering of steps and re-formatting
of over-long lines. It's not clear to me why what you do is necessary for your
purpose.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union
  2023-06-15 14:38   ` Jan Beulich
@ 2023-06-15 15:05     ` Jason Andryuk
  2023-06-15 15:22       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-15 15:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Anthony PERARD, Juergen Gross, xen-devel

On Thu, Jun 15, 2023 at 10:38 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 14.06.2023 20:02, Jason Andryuk wrote:
> > Move some code around now that common xen_sysctl_pm_op get_para fields
> > are together.  In particular, the scaling governor information like
> > scaling_available_governors is inside the union, so it is not always
> > available.
> >
> > With that, gov_num may be 0, so bounce buffer handling needs
> > to be modified.
> >
> > scaling_governor won't be filled for hwp, so this will simplify the
> > change when it is introduced.
>
> While I think this suitably describes the tool stack side changes, ...
>
> > --- a/xen/drivers/acpi/pmstat.c
> > +++ b/xen/drivers/acpi/pmstat.c
> > @@ -239,11 +239,24 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >      if ( ret )
> >          return ret;
> >
> > +    op->u.get_para.cpuinfo_cur_freq =
> > +        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
> > +    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
> > +    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
> > +    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
> > +
> > +    if ( cpufreq_driver.name[0] )
> > +        strlcpy(op->u.get_para.scaling_driver,
> > +            cpufreq_driver.name, CPUFREQ_NAME_LEN);
> > +    else
> > +        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
> > +
> >      if ( !(scaling_available_governors =
> >             xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> >          return -ENOMEM;
> > -    if ( (ret = read_scaling_available_governors(scaling_available_governors,
> > -                gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> > +    if ( (ret = read_scaling_available_governors(
> > +                    scaling_available_governors,
> > +                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> >      {
> >          xfree(scaling_available_governors);
> >          return ret;
> > @@ -254,26 +267,16 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >      if ( ret )
> >          return ret;
> >
> > -    op->u.get_para.cpuinfo_cur_freq =
> > -        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
> > -    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
> > -    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
> > -
> >      op->u.get_para.u.s.scaling_cur_freq = policy->cur;
> >      op->u.get_para.u.s.scaling_max_freq = policy->max;
> >      op->u.get_para.u.s.scaling_min_freq = policy->min;
> >
> > -    if ( cpufreq_driver.name[0] )
> > -        strlcpy(op->u.get_para.scaling_driver,
> > -            cpufreq_driver.name, CPUFREQ_NAME_LEN);
> > -    else
> > -        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
> > -
> >      if ( policy->governor->name[0] )
> >          strlcpy(op->u.get_para.u.s.scaling_governor,
> >              policy->governor->name, CPUFREQ_NAME_LEN);
> >      else
> > -        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
> > +        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> > +                CPUFREQ_NAME_LEN);
> >
> >      /* governor specific para */
> >      if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
> > @@ -291,7 +294,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >              &op->u.get_para.u.s.u.ondemand.sampling_rate,
> >              &op->u.get_para.u.s.u.ondemand.up_threshold);
> >      }
> > -    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
> >
> >      return ret;
> >  }
>
> ... all I see on the hypervisor side is re-ordering of steps and re-formatting
> of over-long lines. It's not clear to me why what you do is necessary for your
> purpose.

The purpose was to move accesses to the nested struct and union
"op->u.get_para.u.s.u" to the end of the function, and the accesses to
common fields (e.g. op->u.get_para.turbo_enabled) earlier.  This
simplifies the changes in "cpufreq: Export HWP parameters to userspace
as CPPC".  These governor fields get indented, and that needed some
re-formatting.  Some re-formatting slipped in while I rebased changes
- sorry about that.

Regards,
Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union
  2023-06-15 15:05     ` Jason Andryuk
@ 2023-06-15 15:22       ` Jan Beulich
  2023-06-15 15:45         ` Jason Andryuk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-06-15 15:22 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, Juergen Gross, xen-devel

On 15.06.2023 17:05, Jason Andryuk wrote:
> On Thu, Jun 15, 2023 at 10:38 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 14.06.2023 20:02, Jason Andryuk wrote:
>>> Move some code around now that common xen_sysctl_pm_op get_para fields
>>> are together.  In particular, the scaling governor information like
>>> scaling_available_governors is inside the union, so it is not always
>>> available.
>>>
>>> With that, gov_num may be 0, so bounce buffer handling needs
>>> to be modified.
>>>
>>> scaling_governor won't be filled for hwp, so this will simplify the
>>> change when it is introduced.
>>
>> While I think this suitably describes the tool stack side changes, ...
>>
>>> --- a/xen/drivers/acpi/pmstat.c
>>> +++ b/xen/drivers/acpi/pmstat.c
>>> @@ -239,11 +239,24 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>>>      if ( ret )
>>>          return ret;
>>>
>>> +    op->u.get_para.cpuinfo_cur_freq =
>>> +        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
>>> +    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
>>> +    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
>>> +    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
>>> +
>>> +    if ( cpufreq_driver.name[0] )
>>> +        strlcpy(op->u.get_para.scaling_driver,
>>> +            cpufreq_driver.name, CPUFREQ_NAME_LEN);
>>> +    else
>>> +        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
>>> +
>>>      if ( !(scaling_available_governors =
>>>             xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
>>>          return -ENOMEM;
>>> -    if ( (ret = read_scaling_available_governors(scaling_available_governors,
>>> -                gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
>>> +    if ( (ret = read_scaling_available_governors(
>>> +                    scaling_available_governors,
>>> +                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
>>>      {
>>>          xfree(scaling_available_governors);
>>>          return ret;
>>> @@ -254,26 +267,16 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>>>      if ( ret )
>>>          return ret;
>>>
>>> -    op->u.get_para.cpuinfo_cur_freq =
>>> -        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
>>> -    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
>>> -    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
>>> -
>>>      op->u.get_para.u.s.scaling_cur_freq = policy->cur;
>>>      op->u.get_para.u.s.scaling_max_freq = policy->max;
>>>      op->u.get_para.u.s.scaling_min_freq = policy->min;
>>>
>>> -    if ( cpufreq_driver.name[0] )
>>> -        strlcpy(op->u.get_para.scaling_driver,
>>> -            cpufreq_driver.name, CPUFREQ_NAME_LEN);
>>> -    else
>>> -        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
>>> -
>>>      if ( policy->governor->name[0] )
>>>          strlcpy(op->u.get_para.u.s.scaling_governor,
>>>              policy->governor->name, CPUFREQ_NAME_LEN);
>>>      else
>>> -        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
>>> +        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
>>> +                CPUFREQ_NAME_LEN);
>>>
>>>      /* governor specific para */
>>>      if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
>>> @@ -291,7 +294,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>>>              &op->u.get_para.u.s.u.ondemand.sampling_rate,
>>>              &op->u.get_para.u.s.u.ondemand.up_threshold);
>>>      }
>>> -    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
>>>
>>>      return ret;
>>>  }
>>
>> ... all I see on the hypervisor side is re-ordering of steps and re-formatting
>> of over-long lines. It's not clear to me why what you do is necessary for your
>> purpose.
> 
> The purpose was to move accesses to the nested struct and union
> "op->u.get_para.u.s.u" to the end of the function, and the accesses to
> common fields (e.g. op->u.get_para.turbo_enabled) earlier.  This
> simplifies the changes in "cpufreq: Export HWP parameters to userspace
> as CPPC".

Could you mention this as (part of) the purpose in the description?

> These governor fields get indented, and that needed some re-formatting.

Would it maybe make sense to split the function?

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union
  2023-06-15 15:22       ` Jan Beulich
@ 2023-06-15 15:45         ` Jason Andryuk
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-15 15:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Anthony PERARD, Juergen Gross, xen-devel

On Thu, Jun 15, 2023 at 11:22 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 15.06.2023 17:05, Jason Andryuk wrote:
> > On Thu, Jun 15, 2023 at 10:38 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 14.06.2023 20:02, Jason Andryuk wrote:
> >>> Move some code around now that common xen_sysctl_pm_op get_para fields
> >>> are together.  In particular, the scaling governor information like
> >>> scaling_available_governors is inside the union, so it is not always
> >>> available.
> >>>
> >>> With that, gov_num may be 0, so bounce buffer handling needs
> >>> to be modified.
> >>>
> >>> scaling_governor won't be filled for hwp, so this will simplify the
> >>> change when it is introduced.
> >>
> >> While I think this suitably describes the tool stack side changes, ...
> >>
> >>> --- a/xen/drivers/acpi/pmstat.c
> >>> +++ b/xen/drivers/acpi/pmstat.c
> >>> @@ -239,11 +239,24 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >>>      if ( ret )
> >>>          return ret;
> >>>
> >>> +    op->u.get_para.cpuinfo_cur_freq =
> >>> +        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
> >>> +    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
> >>> +    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
> >>> +    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
> >>> +
> >>> +    if ( cpufreq_driver.name[0] )
> >>> +        strlcpy(op->u.get_para.scaling_driver,
> >>> +            cpufreq_driver.name, CPUFREQ_NAME_LEN);
> >>> +    else
> >>> +        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
> >>> +
> >>>      if ( !(scaling_available_governors =
> >>>             xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> >>>          return -ENOMEM;
> >>> -    if ( (ret = read_scaling_available_governors(scaling_available_governors,
> >>> -                gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> >>> +    if ( (ret = read_scaling_available_governors(
> >>> +                    scaling_available_governors,
> >>> +                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> >>>      {
> >>>          xfree(scaling_available_governors);
> >>>          return ret;
> >>> @@ -254,26 +267,16 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >>>      if ( ret )
> >>>          return ret;
> >>>
> >>> -    op->u.get_para.cpuinfo_cur_freq =
> >>> -        cpufreq_driver.get ? cpufreq_driver.get(op->cpuid) : policy->cur;
> >>> -    op->u.get_para.cpuinfo_max_freq = policy->cpuinfo.max_freq;
> >>> -    op->u.get_para.cpuinfo_min_freq = policy->cpuinfo.min_freq;
> >>> -
> >>>      op->u.get_para.u.s.scaling_cur_freq = policy->cur;
> >>>      op->u.get_para.u.s.scaling_max_freq = policy->max;
> >>>      op->u.get_para.u.s.scaling_min_freq = policy->min;
> >>>
> >>> -    if ( cpufreq_driver.name[0] )
> >>> -        strlcpy(op->u.get_para.scaling_driver,
> >>> -            cpufreq_driver.name, CPUFREQ_NAME_LEN);
> >>> -    else
> >>> -        strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
> >>> -
> >>>      if ( policy->governor->name[0] )
> >>>          strlcpy(op->u.get_para.u.s.scaling_governor,
> >>>              policy->governor->name, CPUFREQ_NAME_LEN);
> >>>      else
> >>> -        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown", CPUFREQ_NAME_LEN);
> >>> +        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> >>> +                CPUFREQ_NAME_LEN);
> >>>
> >>>      /* governor specific para */
> >>>      if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
> >>> @@ -291,7 +294,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >>>              &op->u.get_para.u.s.u.ondemand.sampling_rate,
> >>>              &op->u.get_para.u.s.u.ondemand.up_threshold);
> >>>      }
> >>> -    op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
> >>>
> >>>      return ret;
> >>>  }
> >>
> >> ... all I see on the hypervisor side is re-ordering of steps and re-formatting
> >> of over-long lines. It's not clear to me why what you do is necessary for your
> >> purpose.
> >
> > The purpose was to move accesses to the nested struct and union
> > "op->u.get_para.u.s.u" to the end of the function, and the accesses to
> > common fields (e.g. op->u.get_para.turbo_enabled) earlier.  This
> > simplifies the changes in "cpufreq: Export HWP parameters to userspace
> > as CPPC".
>
> Could you mention this as (part of) the purpose in the description?

Certainly.

> > These governor fields get indented, and that needed some re-formatting.
>
> Would it maybe make sense to split the function?

While that could be done, I don't think it's needed.  Maybe you'll
feel otherwise after you look at "cpufreq: Export HWP parameters to
userspace as CPPC".

Regards,
Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options
  2023-06-15 14:28       ` Jan Beulich
@ 2023-06-15 17:56         ` Jason Andryuk
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Andryuk @ 2023-06-15 17:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Anthony PERARD, Juergen Gross, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, xen-devel

On Thu, Jun 15, 2023 at 10:29 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 15.06.2023 16:07, Jason Andryuk wrote:
> > On Thu, Jun 15, 2023 at 9:29 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 14.06.2023 20:02, Jason Andryuk wrote:
> >>> --- a/tools/include/xenctrl.h
> >>> +++ b/tools/include/xenctrl.h
> >>> @@ -1909,16 +1909,20 @@ struct xc_get_cpufreq_para {
> >>>      uint32_t cpuinfo_cur_freq;
> >>>      uint32_t cpuinfo_max_freq;
> >>>      uint32_t cpuinfo_min_freq;
> >>> -    uint32_t scaling_cur_freq;
> >>> -
> >>> -    char scaling_governor[CPUFREQ_NAME_LEN];
> >>> -    uint32_t scaling_max_freq;
> >>> -    uint32_t scaling_min_freq;
> >>> -
> >>> -    /* for specific governor */
> >>>      union {
> >>> -        xc_userspace_t userspace;
> >>> -        xc_ondemand_t ondemand;
> >>> +        struct {
> >>> +            uint32_t scaling_cur_freq;
> >>> +
> >>> +            char scaling_governor[CPUFREQ_NAME_LEN];
> >>> +            uint32_t scaling_max_freq;
> >>> +            uint32_t scaling_min_freq;
> >>> +
> >>> +            /* for specific governor */
> >>> +            union {
> >>> +                xc_userspace_t userspace;
> >>> +                xc_ondemand_t ondemand;
> >>> +            } u;
> >>> +        } s;
> >>>      } u;
> >>
> >> There's no comment in the header that this needs to mirror the sysctl
> >> struct. Does it really need changing?
> >
> > Since this matched the other structure, I kept them in sync.  The
> > cppc/hwp data needs to be represented somehow, and it gets introduced
> > in the same way for both later.  If this doesn't get the new nested
> > struct, then maybe fields could be placed into the single union.  That
> > would grow the overall struct and have unused fields for hwp.
>
> I guess I need to leave this to the maintainers then. Still ...
>
> >>> --- a/tools/libs/ctrl/xc_pm.c
> >>> +++ b/tools/libs/ctrl/xc_pm.c
> >>> @@ -265,15 +265,10 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
> >>>          user_para->cpuinfo_cur_freq = sys_para->cpuinfo_cur_freq;
> >>>          user_para->cpuinfo_max_freq = sys_para->cpuinfo_max_freq;
> >>>          user_para->cpuinfo_min_freq = sys_para->cpuinfo_min_freq;
> >>> -        user_para->scaling_cur_freq = sys_para->scaling_cur_freq;
> >>> -        user_para->scaling_max_freq = sys_para->scaling_max_freq;
> >>> -        user_para->scaling_min_freq = sys_para->scaling_min_freq;
> >>>          user_para->turbo_enabled    = sys_para->turbo_enabled;
> >>>
> >>>          memcpy(user_para->scaling_driver,
> >>>                  sys_para->scaling_driver, CPUFREQ_NAME_LEN);
> >>> -        memcpy(user_para->scaling_governor,
> >>> -                sys_para->scaling_governor, CPUFREQ_NAME_LEN);
> >>
> >> Did you really mean to remove the copying of these 4 entities, rather
> >> than simply change the way the fields are accessed?
> >
> > Yes, it was intentional.
> >
> > The immediate following lines are:
> >         /* copy to user_para no matter what cpufreq governor */
> >         BUILD_BUG_ON(sizeof(((struct xc_get_cpufreq_para *)0)->u) !=
> >              sizeof(((struct xen_get_cpufreq_para *)0)->u));
> >
> >         memcpy(&user_para->u, &sys_para->u, sizeof(sys_para->u));
>
> ... this suggests that some matching is intended, yet it's not clear
> to me why then the hole struct-s aren't assumed to be matching / made
> match.

The tools version replaces struct xen_$foo with xc_$foo typedefs.
Maybe it's just to enforce the typedefs?

Regards,
Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-06-14 18:02 ` [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2023-06-19 11:38   ` Jan Beulich
  2023-06-20 18:14     ` Jason Andryuk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-06-19 11:38 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> Falling back from cpufreq=hwp to cpufreq=xen is a more user-friendly
> choice than disabling cpufreq when HWP is not available.  Specifying
> cpufreq=hwp indicates the user wants cpufreq, so, if HWP isn't
> available, it makes sense to give them the cpufreq that can be
> supported.  i.e. I can't see a user only wanting cpufreq=hwp or
> cpufreq=none, but not cpufreq=xen.

I continue to disagree with this, and I'd like to ask for another
maintainer's opinion. To me the described behavior is like getting
pears when having asked for apples. I nevertheless agree that
having said fallback as an option, but I'd see that done by giving
meaning to something like "cpufreq=hwp,xen" (without having checked
whether that doesn't have meaning already, i.e. just to give you an
idea).

> We can't use parse_boolean() since it requires a single name=val string
> and cpufreq_handle_common_option is provided two strings.  Use
> parse_bool() and manual handle no-hwp.
> 
> Write to disable the interrupt - the linux pstate driver does this.  We
> don't use the interrupts, so we can just turn them off.  We aren't ready
> to handle them, so we don't want any.  Unclear if this is necessary.
> SDM says it's default disabled.

Part of this may want to move to the description?

> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,537 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
> + */
> +
> +#include <xen/cpumask.h>
> +#include <xen/init.h>
> +#include <xen/param.h>
> +#include <xen/xmalloc.h>
> +#include <asm/io.h>

What do you need this one for?

> +#include <asm/msr.h>
> +#include <acpi/cpufreq/cpufreq.h>
> +
> +static bool __ro_after_init feature_hwp;
> +static bool __ro_after_init feature_hwp_notification;
> +static bool __ro_after_init feature_hwp_activity_window;
> +
> +static bool __ro_after_init feature_hdc;
> +
> +bool __initdata opt_cpufreq_hwp;
> +static bool __ro_after_init opt_cpufreq_hdc = true;
> +
> +union hwp_request
> +{
> +    struct
> +    {
> +        unsigned int min_perf:8;
> +        unsigned int max_perf:8;
> +        unsigned int desired:8;
> +        unsigned int energy_perf:8;
> +        unsigned int activity_window:10;
> +        bool package_control:1;
> +        unsigned int reserved:16;

Better leave this and ...

> +        bool activity_window_valid:1;
> +        bool energy_perf_valid:1;
> +        bool desired_valid:1;
> +        bool max_perf_valid:1;
> +        bool min_perf_valid:1;
> +    };
> +    uint64_t raw;
> +};
> +
> +struct hwp_drv_data
> +{
> +    union
> +    {
> +        uint64_t hwp_caps;
> +        struct
> +        {
> +            unsigned int highest:8;
> +            unsigned int guaranteed:8;
> +            unsigned int most_efficient:8;
> +            unsigned int lowest:8;
> +            unsigned int reserved:32;

... this without a name? Hardware interfaces like this one are, in my
understanding, one of the main applications of unnamed bitfields.

> +        } hw;
> +    };
> +    union hwp_request curr_req;
> +    int ret;
> +    uint16_t activity_window;
> +    uint8_t minimum;
> +    uint8_t maximum;
> +    uint8_t desired;
> +    uint8_t energy_perf;
> +};
> +DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);

static?

> +static bool hwp_handle_option(const char *s, const char *end)
> +{
> +    int ret;
> +
> +    if ( strncmp(s, "verbose", 7) == 0 )
> +    {
> +        s += 7;
> +        if ( *s == '=' )
> +        {
> +            s++;
> +            cpufreq_verbose = !!simple_strtoul(s, NULL, 0);
> +
> +            return true;
> +        }
> +
> +        if ( end == NULL ||
> +             (end == s && (*end == '\0' || *end == ',')) )
> +        {
> +            cpufreq_verbose = true;
> +
> +            return true;
> +        }
> +
> +        return false;
> +    }

Would be nice if the handling of "verbose" didn't need duplicating here.
However, if that's unavoidable, did you consider handling this as a
proper boolean instead of the custom way cpufreq_handle_common_option()
also uses?

> +bool __init hwp_available(void)
> +{
> +    unsigned int eax;
> +
> +    if ( !opt_cpufreq_hwp )
> +        return false;
> +
> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> +    {
> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> +                    boot_cpu_data.cpuid_level);
> +
> +        return false;
> +    }
> +
> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> +    {
> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> +                 boot_cpu_data.cpuid_level);
> +
> +        return false;
> +    }
> +
> +    eax = cpuid_eax(CPUID_PM_LEAF);
> +
> +    hwp_verbose("%d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> +                !!(eax & CPUID6_EAX_HWP),
> +                !!(eax & CPUID6_EAX_HWP_NOTIFICATION),
> +                !!(eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW),
> +                !!(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE),
> +                !!(eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST),
> +                !!(eax & CPUID6_EAX_HWP_PECI));
> +
> +    if ( !(eax & CPUID6_EAX_HWP) )
> +        return false;
> +
> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) )
> +    {
> +        hwp_verbose("disabled: No energy/performance preference available");
> +
> +        return false;
> +    }
> +
> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> +    feature_hdc = eax & CPUID6_EAX_HDC;

Would you mind matching this line with the earlier three, padding-wise?

> +    hwp_verbose("Hardware Duty Cycling (HDC) %ssupported%s\n",
> +                feature_hdc ? "" : "not ",
> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> +                            : "");
> +
> +    hwp_verbose("HW_FEEDBACK %ssupported\n",
> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
> +
> +    cpufreq_governor_internal = feature_hwp;

What use is feature_hwp? Already its setting is a little odd (you could
use true there as much as you could here and for the return value below,
because of the earlier CPUID6_EAX_HWP check), and then I haven't been
able to find any further use of the variable.

> +    if ( feature_hwp )
> +        hwp_info("Using HWP for cpufreq\n");
> +
> +    return feature_hwp;
> +}
> +
> +static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_PKG_HDC_CTL)\n", cpu);
> +
> +        return -1;

Nit: While blank lines often help, and we even demand them ahead of a
function's main return statement, here and ...

> +    }
> +
> +    if ( val )
> +        msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
> +    else
> +        msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
> +
> +    if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", cpu, msr);
> +
> +        return -1;

... here (and then again below) I think they do more harm than good.

> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> +{
> +    uint32_t base_khz, max_khz, bus_khz, edx;
> +
> +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> +
> +    policy->cpuinfo.perf_freq = base_khz * 1000;
> +    policy->cpuinfo.min_freq = base_khz * 1000;
> +    policy->cpuinfo.max_freq = max_khz * 1000;
> +    policy->min = base_khz * 1000;
> +    policy->max = max_khz * 1000;

What are the consequences (for the driver) when any of the values read
is zero? While I haven't checked in combination with HWP, I know that
CPUs may populate only some of the fields.

> +static void cf_check hwp_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    uint64_t val;
> +
> +    /*
> +     * Package level MSR, but we don't have a good idea of packages here, so
> +     * just do it everytime.
> +     */
> +    if ( rdmsr_safe(MSR_PM_ENABLE, val) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_PM_ENABLE)\n", policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    /* Ensure we don't generate interrupts */
> +    if ( feature_hwp_notification )
> +        wrmsr_safe(MSR_HWP_INTERRUPT, 0);
> +
> +    hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
> +    if ( !(val & PM_ENABLE_HWP_ENABLE) )
> +    {
> +        val |= PM_ENABLE_HWP_ENABLE;
> +        if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> +        {
> +            hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> +            data->curr_req.raw = -1;
> +            return;
> +        }
> +    }
> +
> +    if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_HWP_CAPABILITIES)\n", policy->cpu);
> +        goto error;
> +    }
> +
> +    if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_HWP_REQUEST)\n", policy->cpu);

Having seen a number of hwp_err() by now, I think these are confusing:
The format string as seen at call sites doesn't match the number of
arguments. I see two possible solutions to this: Either you demand
that calling functions maintain a "cpu" local variable, and you simply
use that from the macro without passing it as argument. Or you flip
parameters / arguments:

        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n");

> +        goto error;
> +    }
> +
> +    /*
> +     * Check for APERF/MPERF support in hardware
> +     * also check for boost/turbo support
> +     */
> +    intel_feature_detect(policy);
> +
> +    if ( feature_hdc )
> +    {
> +        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) )
> +            goto error;
> +        if ( hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) )
> +            goto error;

If either of these fails the very first time through (presumably for the
BSP), wouldn't a better course of action be to clear feature_hdc?

> +    }
> +
> +    hwp_get_cpu_speeds(policy);
> +
> +    return;
> +
> + error:
> +    data->curr_req.raw = -1;
> +    val &= ~PM_ENABLE_HWP_ENABLE;
> +    if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> +        hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> +
> +    return;
> +}

I think in general we avoid "return" with no value at the end of functions.

> +static void cf_check hwp_write_request(void *info)
> +{
> +    const struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    union hwp_request hwp_req = data->curr_req;
> +
> +    data->ret = 0;
> +
> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));

Why uint64_t? Generally we try to avoid using types in sizeof() and
alike, and instead refer to applicable variables. I.e. here:

    BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw));

> +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> +                                       unsigned int target_freq,
> +                                       unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +    /* Zero everything to ensure reserved bits are zero... */
> +    union hwp_request hwp_req = { .raw = 0 };
> +
> +    /* .. and update from there */
> +    hwp_req.min_perf = data->minimum;
> +    hwp_req.max_perf = data->maximum;
> +    hwp_req.desired = data->desired;
> +    hwp_req.energy_perf = data->energy_perf;
> +    if ( feature_hwp_activity_window )
> +        hwp_req.activity_window = data->activity_window;
> +
> +    if ( hwp_req.raw == data->curr_req.raw )
> +        return 0;
> +
> +    data->curr_req = hwp_req;
> +
> +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);

Is this at risk of being too verbose when the verbose option as given?

> +    return data->ret;
> +}
> +
> +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data;
> +
> +    data = xzalloc(struct hwp_drv_data);
> +    if ( !data )
> +        return -ENOMEM;
> +
> +    if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name,
> +                                        cpufreq_gov_hwp.name) )

Nit: I think this would better be

    if ( cpufreq_opt_governor &&
         strcmp(cpufreq_opt_governor->name, cpufreq_gov_hwp.name) )

> +        printk(XENLOG_WARNING
> +               "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n",
> +               cpufreq_opt_governor->name, cpufreq_gov_hwp.name);
> +    policy->governor = &cpufreq_gov_hwp;
> +
> +    per_cpu(hwp_drv_data, cpu) = data;
> +
> +    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
> +
> +    if ( data->curr_req.raw == -1 )
> +    {
> +        hwp_err("Could not initialize HWP properly\n", cpu);
> +        XFREE(per_cpu(hwp_drv_data, cpu));

Is this completely safe even in the CPU online/hotplug case? It would
seem better to me to go this way:

        per_cpu(hwp_drv_data, cpu) = NULL;
        xfree(data);

Or even defer publishing "data" ...

> +        return -ENODEV;
> +    }
> +
> +    data->minimum = data->curr_req.min_perf;
> +    data->maximum = data->curr_req.max_perf;
> +    data->desired = data->curr_req.desired;
> +    data->energy_perf = data->curr_req.energy_perf;
> +    data->activity_window = data->curr_req.activity_window;

... until after it was fully populated. (But I seem to vaguely recall
that you need to use the field somewhere in the init process.)

> +    hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);

Isn't this expected (or even required) to be the same on all CPUs? If
so, no need to log every time.

> +static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +    XFREE(per_cpu(hwp_drv_data, policy->cpu));

Same remark as above, primarily because this is also used on an error
path.

> +/*
> + * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
> + * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least
> + * with my HWP testing.  MSR_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
> + * is what Linux uses and seems to work.
> + */

"my testing" imo wants replacing here by saying what hardware was tested
(not who did the testing).

> +static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy)
> +{
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpuid);
> +    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);

Nit: Blank line please between declaration(s) and statement(s). Or
alternatively drop the local variable ltogether, as it's used ...

> +    return data->ret;

... just once here.

> +}
> +
> +static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =

Seeing __initconstrel here, just as a remark (not a request for you
to do anything): I think we want to finish conversion to altcall, such
that these can all become __initconst_cf_clobber.

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -97,6 +97,15 @@ static int __init cf_check setup_cpufreq_option(const char *str)
>              return cpufreq_cmdline_parse(arg + 1);
>      }
>  
> +    if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
> +    {
> +        xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> +        cpufreq_controller = FREQCTL_xen;
> +        opt_cpufreq_hwp = true;
> +        if ( *arg && *(arg + 1) )

A pretty unusual way of writing arg[1].

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -246,4 +246,7 @@ void cpufreq_dbs_timer_resume(void);
>  
>  void intel_feature_detect(struct cpufreq_policy *policy);
>  
> +extern bool __initdata opt_cpufreq_hwp;

No __initdata or alike on declarations please. Section placement
attributes (among others) only belong on the definition.

> --- a/xen/include/acpi/cpufreq/processor_perf.h
> +++ b/xen/include/acpi/cpufreq/processor_perf.h
> @@ -7,6 +7,9 @@
>  
>  #define XEN_PX_INIT 0x80000000
>  
> +bool hwp_available(void);
> +int hwp_register_driver(void);
> +
>  int powernow_cpufreq_init(void);
>  unsigned int powernow_register_driver(void);
>  unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);

The existing split between what lives in which header is pretty
arbitrary from all I can tell. Is there a particular reason you can't
keep together the four declarations you add?

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -296,6 +296,7 @@ struct xen_ondemand {
>      uint32_t up_threshold;
>  };
>  
> +#define XEN_HWP_DRIVER "hwp"

I think this wants some addition to the identifier which makes it expected
that the expansion is a string literal. Perhaps XEN_HWP_DRIVER_NAME?

>  /*

Nit: There also wants to be a blank line between #define and comment.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 08/15] xenpm: Change get-cpufreq-para output for hwp
  2023-06-14 18:02 ` [PATCH v4 08/15] xenpm: Change get-cpufreq-para output for hwp Jason Andryuk
@ 2023-06-19 11:50   ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-06-19 11:50 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> When using HWP, some of the returned data is not applicable.  In that
> case, we should just omit it to avoid confusing the user.  So switch to
> printing the base and max frequencies since those are relevant to HWP.
> Similarly, stop printing the CPU frequencies since those do not apply.
> The scaling fields are also no longer printed.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Acked-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC
  2023-06-14 18:02 ` [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
@ 2023-06-19 14:24   ` Jan Beulich
  2023-06-20 18:41     ` Jason Andryuk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2023-06-19 14:24 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -537,6 +537,29 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
>      .update = hwp_cpufreq_update,
>  };
>  
> +int get_hwp_para(const unsigned int cpu,
> +                 struct xen_cppc_para *cppc_para)
> +{
> +    const struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +
> +    if ( data == NULL )
> +        return -EINVAL;

Maybe better -ENODATA in this case?

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -251,48 +251,54 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>      else
>          strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
>  
> -    if ( !(scaling_available_governors =
> -           xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> -        return -ENOMEM;
> -    if ( (ret = read_scaling_available_governors(
> -                    scaling_available_governors,
> -                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> +    if ( !strncasecmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER,
> +                      CPUFREQ_NAME_LEN) )

Mind me asking why you think case-insensitive compare is appropriate here?

> +        ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
> +    else
>      {
> +        if ( !(scaling_available_governors =
> +               xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> +            return -ENOMEM;
> +        if ( (ret = read_scaling_available_governors(
> +                        scaling_available_governors,
> +                        gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )

I realize you only re-indent this, but since you need to touch it anyway,
may I suggest to also switch to siezof(*scaling_available_governors)?

> +        {
> +            xfree(scaling_available_governors);
> +            return ret;
> +        }
> +        ret = copy_to_guest(op->u.get_para.scaling_available_governors,
> +                    scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);

Similarly here: Please adjust indentation while you touch this code.

>          xfree(scaling_available_governors);
> -        return ret;
> -    }
> -    ret = copy_to_guest(op->u.get_para.scaling_available_governors,
> -                scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);
> -    xfree(scaling_available_governors);
> -    if ( ret )
> -        return ret;
> +        if ( ret )
> +            return ret;
>  
> -    op->u.get_para.u.s.scaling_cur_freq = policy->cur;
> -    op->u.get_para.u.s.scaling_max_freq = policy->max;
> -    op->u.get_para.u.s.scaling_min_freq = policy->min;
> +        op->u.get_para.u.s.scaling_cur_freq = policy->cur;
> +        op->u.get_para.u.s.scaling_max_freq = policy->max;
> +        op->u.get_para.u.s.scaling_min_freq = policy->min;
>  
> -    if ( policy->governor->name[0] )
> -        strlcpy(op->u.get_para.u.s.scaling_governor,
> -            policy->governor->name, CPUFREQ_NAME_LEN);
> -    else
> -        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> -                CPUFREQ_NAME_LEN);
> +        if ( policy->governor->name[0] )
> +            strlcpy(op->u.get_para.u.s.scaling_governor,
> +                policy->governor->name, CPUFREQ_NAME_LEN);
> +        else
> +            strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> +                    CPUFREQ_NAME_LEN);
>  
> -    /* governor specific para */
> -    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
> -                      "userspace", CPUFREQ_NAME_LEN) )
> -    {
> -        op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
> -    }
> +        /* governor specific para */
> +        if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
> +                          "userspace", CPUFREQ_NAME_LEN) )
> +        {
> +            op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
> +        }

Would also be nice if you could get rid of the unnecessary braces here
at this occasion.

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -248,5 +248,7 @@ void intel_feature_detect(struct cpufreq_policy *policy);
>  
>  extern bool __initdata opt_cpufreq_hwp;
>  int hwp_cmdline_parse(const char *s);
> +int get_hwp_para(const unsigned int cpu,

I think we generally avoid const when it's not a pointed-to type. It's
not useful at all in a declaration.

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -296,6 +296,61 @@ struct xen_ondemand {
>      uint32_t up_threshold;
>  };
>  
> +struct xen_cppc_para {
> +    /* OUT */
> +    /* activity_window supported if 1 */
> +#define XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW  (1 << 0)

I think 1 isn't very helpful, looking forward. Perhaps better "set" or
"flag set"?

> +    uint32_t features; /* bit flags for features */
> +    /*
> +     * See Intel SDM: HWP Performance Range and Dynamic Capabilities
> +     *
> +     * These four are 0-255 hardware-provided values.  They "continuous,
> +     * abstract unit-less, performance" values.  smaller numbers are slower
> +     * and larger ones are faster.
> +     */
> +    uint32_t lowest;
> +    uint32_t lowest_nonlinear; /* most_efficient */

Why non_linear in the external interface when internally you use
most_efficient (merely put in the comment here)?

> +    uint32_t nominal; /* guaranteed */

Similar question for the name choice here.

> +    uint32_t highest;
> +    /*
> +     * See Intel SDM: IA32_HWP_REQUEST MSR (Address: 774H Logical Processor
> +     * Scope)
> +     *
> +     * These are all hints, and the processor may deviate outside of them.
> +     * Values below are 0-255.
> +     *
> +     * minimum and maximum can be set to the above hardware values to constrain
> +     * operation.  The full range 0-255 is accepted and will be clipped by
> +     * hardware.
> +     */
> +    uint32_t minimum;
> +    uint32_t maximum;
> +    /*
> +     * Set an explicit performance hint, disabling hardware selection.
> +     * 0 lets the hardware decide.
> +     */
> +    uint32_t desired;

"Set" kind of conflicts with all fields being marked as OUT above. I think
the word can simply be dropped?

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 11/15] xenpm: Print HWP/CPPC parameters
  2023-06-14 18:02 ` [PATCH v4 11/15] xenpm: Print HWP/CPPC parameters Jason Andryuk
@ 2023-06-19 14:29   ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-06-19 14:29 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> Print HWP-specific parameters.  Some are always present, but others
> depend on hardware support.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 14/15] xenpm: Add set-cpufreq-cppc subcommand
  2023-06-14 18:02 ` [PATCH v4 14/15] xenpm: Add set-cpufreq-cppc subcommand Jason Andryuk
@ 2023-06-19 14:54   ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-06-19 14:54 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 14.06.2023 20:02, Jason Andryuk wrote:
> +static void cppc_set_func(int argc, char *argv[])
> +{
> +    xc_set_cppc_para_t set_cppc = {};
> +    int cpuid = -1;
> +    int i = 0;

While cpuid needs to remain of signed type, i wants to be unsigned int.

> +    if ( parse_cppc_opts(&set_cppc, &cpuid, argc, argv) )
> +        exit(EINVAL);
> +
> +    if ( cpuid != -1 )
> +    {
> +        i = cpuid;
> +        max_cpu_nr = i + 1;

While it looks like (ab)using the global variable this way is okay for
now, I'd prefer if you added a local variable instead, such that the
global can retain its meaning. Then
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-06-19 11:38   ` Jan Beulich
@ 2023-06-20 18:14     ` Jason Andryuk
  2023-06-21  7:58       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-20 18:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Mon, Jun 19, 2023 at 7:38 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 14.06.2023 20:02, Jason Andryuk wrote:
> > Falling back from cpufreq=hwp to cpufreq=xen is a more user-friendly
> > choice than disabling cpufreq when HWP is not available.  Specifying
> > cpufreq=hwp indicates the user wants cpufreq, so, if HWP isn't
> > available, it makes sense to give them the cpufreq that can be
> > supported.  i.e. I can't see a user only wanting cpufreq=hwp or
> > cpufreq=none, but not cpufreq=xen.
>
> I continue to disagree with this, and I'd like to ask for another
> maintainer's opinion. To me the described behavior is like getting
> pears when having asked for apples. I nevertheless agree that
> having said fallback as an option, but I'd see that done by giving
> meaning to something like "cpufreq=hwp,xen" (without having checked
> whether that doesn't have meaning already, i.e. just to give you an
> idea).

I know you disagree with the approach.  I implemented what would be
useful to me and Marek.  It felt counterproductive to implement a hard
failure mode that is unsuitable for my use case.  Also there was the
possibility you wouldn't mind when you saw how it was implemented.

Yeah, asking for an apple and getting a pear can be the wrong thing if
your recipe calls for apples.  But getting *some* fruit can be better
than no fruit if you are hungry.

As implemented here, with HWP default disabled,
no command line -> default cpufreq=xen
cpufreq=xen -> only cpufreq=xen
cpufreq=hwp -> try hwp and fallback to cpufreq=xen

If/when HWP is default enabled:
no command line -> try hwp and fallback to cpufreq=xen
cpufreq=hwp -> try hwp and fallback to cpufreq=xen
cpufreq=xen -> cpufreq=xen

Unless some other idea comes to me, I guess I'll look at implementing
"cpufreq=hwp,xen".

> > We can't use parse_boolean() since it requires a single name=val string
> > and cpufreq_handle_common_option is provided two strings.  Use
> > parse_bool() and manual handle no-hwp.
> >
> > Write to disable the interrupt - the linux pstate driver does this.  We
> > don't use the interrupts, so we can just turn them off.  We aren't ready
> > to handle them, so we don't want any.  Unclear if this is necessary.
> > SDM says it's default disabled.
>
> Part of this may want to move to the description?

Sure.

> > --- /dev/null
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -0,0 +1,537 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> > + *
> > + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
> > + */
> > +
> > +#include <xen/cpumask.h>
> > +#include <xen/init.h>
> > +#include <xen/param.h>
> > +#include <xen/xmalloc.h>
> > +#include <asm/io.h>
>
> What do you need this one for?

It seems to be unneeded.

> > +#include <asm/msr.h>
> > +#include <acpi/cpufreq/cpufreq.h>
> > +
> > +static bool __ro_after_init feature_hwp;
> > +static bool __ro_after_init feature_hwp_notification;
> > +static bool __ro_after_init feature_hwp_activity_window;
> > +
> > +static bool __ro_after_init feature_hdc;
> > +
> > +bool __initdata opt_cpufreq_hwp;
> > +static bool __ro_after_init opt_cpufreq_hdc = true;
> > +
> > +union hwp_request
> > +{
> > +    struct
> > +    {
> > +        unsigned int min_perf:8;
> > +        unsigned int max_perf:8;
> > +        unsigned int desired:8;
> > +        unsigned int energy_perf:8;
> > +        unsigned int activity_window:10;
> > +        bool package_control:1;
> > +        unsigned int reserved:16;
>
> Better leave this and ...
>
> > +        bool activity_window_valid:1;
> > +        bool energy_perf_valid:1;
> > +        bool desired_valid:1;
> > +        bool max_perf_valid:1;
> > +        bool min_perf_valid:1;
> > +    };
> > +    uint64_t raw;
> > +};
> > +
> > +struct hwp_drv_data
> > +{
> > +    union
> > +    {
> > +        uint64_t hwp_caps;
> > +        struct
> > +        {
> > +            unsigned int highest:8;
> > +            unsigned int guaranteed:8;
> > +            unsigned int most_efficient:8;
> > +            unsigned int lowest:8;
> > +            unsigned int reserved:32;
>
> ... this without a name? Hardware interfaces like this one are, in my
> understanding, one of the main applications of unnamed bitfields.

Thanks - I didn't know you could have unnamed bitfields.

> > +        } hw;
> > +    };
> > +    union hwp_request curr_req;
> > +    int ret;
> > +    uint16_t activity_window;
> > +    uint8_t minimum;
> > +    uint8_t maximum;
> > +    uint8_t desired;
> > +    uint8_t energy_perf;
> > +};
> > +DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);
>
> static?

Sure

> > +static bool hwp_handle_option(const char *s, const char *end)
> > +{
> > +    int ret;
> > +
> > +    if ( strncmp(s, "verbose", 7) == 0 )
> > +    {
> > +        s += 7;
> > +        if ( *s == '=' )
> > +        {
> > +            s++;
> > +            cpufreq_verbose = !!simple_strtoul(s, NULL, 0);
> > +
> > +            return true;
> > +        }
> > +
> > +        if ( end == NULL ||
> > +             (end == s && (*end == '\0' || *end == ',')) )
> > +        {
> > +            cpufreq_verbose = true;
> > +
> > +            return true;
> > +        }
> > +
> > +        return false;
> > +    }
>
> Would be nice if the handling of "verbose" didn't need duplicating here.
> However, if that's unavoidable, did you consider handling this as a
> proper boolean instead of the custom way cpufreq_handle_common_option()
> also uses?

It seemed more complicated to try to re-use the "verbose" handling
from cpufreq_handle_common_option(), especially since minfreq and
maxfreq are also in there.

I didn't use proper boolean parsing to remain consistent with
cpufreq_handle_common_option() and the command line option
documentation.  I'm fine with switching it to a proper boolean if
that's what you want.

> > +bool __init hwp_available(void)
> > +{
> > +    unsigned int eax;
> > +
> > +    if ( !opt_cpufreq_hwp )
> > +        return false;
> > +
> > +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> > +    {
> > +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> > +                    boot_cpu_data.cpuid_level);
> > +
> > +        return false;
> > +    }
> > +
> > +    if ( boot_cpu_data.cpuid_level < 0x16 )
> > +    {
> > +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> > +                 boot_cpu_data.cpuid_level);
> > +
> > +        return false;
> > +    }
> > +
> > +    eax = cpuid_eax(CPUID_PM_LEAF);
> > +
> > +    hwp_verbose("%d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> > +                !!(eax & CPUID6_EAX_HWP),
> > +                !!(eax & CPUID6_EAX_HWP_NOTIFICATION),
> > +                !!(eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW),
> > +                !!(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE),
> > +                !!(eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST),
> > +                !!(eax & CPUID6_EAX_HWP_PECI));
> > +
> > +    if ( !(eax & CPUID6_EAX_HWP) )
> > +        return false;
> > +
> > +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) )
> > +    {
> > +        hwp_verbose("disabled: No energy/performance preference available");
> > +
> > +        return false;
> > +    }
> > +
> > +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> > +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> > +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> > +    feature_hdc = eax & CPUID6_EAX_HDC;
>
> Would you mind matching this line with the earlier three, padding-wise?

Sure.

> > +    hwp_verbose("Hardware Duty Cycling (HDC) %ssupported%s\n",
> > +                feature_hdc ? "" : "not ",
> > +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> > +                            : "");
> > +
> > +    hwp_verbose("HW_FEEDBACK %ssupported\n",
> > +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
> > +
> > +    cpufreq_governor_internal = feature_hwp;
>
> What use is feature_hwp? Already its setting is a little odd (you could
> use true there as much as you could here and for the return value below,
> because of the earlier CPUID6_EAX_HWP check), and then I haven't been
> able to find any further use of the variable.

Yes, feature_hwp is no longer useful and can be removed.  It actually
gets removed in the next patch, but this can be simplified.

> > +    if ( feature_hwp )
> > +        hwp_info("Using HWP for cpufreq\n");
> > +
> > +    return feature_hwp;
> > +}
> > +
> > +static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
> > +    {
> > +        hwp_err("rdmsr_safe(MSR_PKG_HDC_CTL)\n", cpu);
> > +
> > +        return -1;
>
> Nit: While blank lines often help, and we even demand them ahead of a
> function's main return statement, here and ...
>
> > +    }
> > +
> > +    if ( val )
> > +        msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +    else
> > +        msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +
> > +    if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
> > +    {
> > +        hwp_err("wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", cpu, msr);
> > +
> > +        return -1;
>
> ... here (and then again below) I think they do more harm than good.

Ok, I'll remove them.

> > +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> > +{
> > +    uint32_t base_khz, max_khz, bus_khz, edx;
> > +
> > +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> > +
> > +    policy->cpuinfo.perf_freq = base_khz * 1000;
> > +    policy->cpuinfo.min_freq = base_khz * 1000;
> > +    policy->cpuinfo.max_freq = max_khz * 1000;
> > +    policy->min = base_khz * 1000;
> > +    policy->max = max_khz * 1000;
>
> What are the consequences (for the driver) when any of the values read
> is zero? While I haven't checked in combination with HWP, I know that
> CPUs may populate only some of the fields.

Interesting - I didn't know CPUs may populate only some of the fields.

These values are not used by the hwp driver itself.  They are
populated mainly to provide back to userspace.  The one exception is
perf_freq - that is used by get_measured_perf() to calculate
aperf/mperf.  If base_khz and therefore perf_freq were 0, then
get_measured_perf() would just return 0 for aperf/mperf.

> > +static void cf_check hwp_init_msrs(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    uint64_t val;
> > +
> > +    /*
> > +     * Package level MSR, but we don't have a good idea of packages here, so
> > +     * just do it everytime.
> > +     */
> > +    if ( rdmsr_safe(MSR_PM_ENABLE, val) )
> > +    {
> > +        hwp_err("rdmsr_safe(MSR_PM_ENABLE)\n", policy->cpu);
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    /* Ensure we don't generate interrupts */
> > +    if ( feature_hwp_notification )
> > +        wrmsr_safe(MSR_HWP_INTERRUPT, 0);
> > +
> > +    hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
> > +    if ( !(val & PM_ENABLE_HWP_ENABLE) )
> > +    {
> > +        val |= PM_ENABLE_HWP_ENABLE;
> > +        if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> > +        {
> > +            hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> > +            data->curr_req.raw = -1;
> > +            return;
> > +        }
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
> > +    {
> > +        hwp_err("rdmsr_safe(MSR_HWP_CAPABILITIES)\n", policy->cpu);
> > +        goto error;
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
> > +    {
> > +        hwp_err("rdmsr_safe(MSR_HWP_REQUEST)\n", policy->cpu);
>
> Having seen a number of hwp_err() by now, I think these are confusing:
> The format string as seen at call sites doesn't match the number of
> arguments. I see two possible solutions to this: Either you demand
> that calling functions maintain a "cpu" local variable, and you simply
> use that from the macro without passing it as argument. Or you flip
> parameters / arguments:
>
>         hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n");

I think your flipped parameter idea is good and will use that.

> > +        goto error;
> > +    }
> > +
> > +    /*
> > +     * Check for APERF/MPERF support in hardware
> > +     * also check for boost/turbo support
> > +     */
> > +    intel_feature_detect(policy);
> > +
> > +    if ( feature_hdc )
> > +    {
> > +        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) )
> > +            goto error;
> > +        if ( hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) )
> > +            goto error;
>
> If either of these fails the very first time through (presumably for the
> BSP), wouldn't a better course of action be to clear feature_hdc?

So if HWP is working, but the HDC part fails, just clear feature_hdc
but keep using HWP?  I don't know how likely that is to happen, but
that seems reasonable.

> > +    }
> > +
> > +    hwp_get_cpu_speeds(policy);
> > +
> > +    return;
> > +
> > + error:
> > +    data->curr_req.raw = -1;
> > +    val &= ~PM_ENABLE_HWP_ENABLE;
> > +    if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> > +        hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> > +
> > +    return;
> > +}
>
> I think in general we avoid "return" with no value at the end of functions.

Ok, I'll remove it.

> > +static void cf_check hwp_write_request(void *info)
> > +{
> > +    const struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    union hwp_request hwp_req = data->curr_req;
> > +
> > +    data->ret = 0;
> > +
> > +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
>
> Why uint64_t? Generally we try to avoid using types in sizeof() and
> alike, and instead refer to applicable variables. I.e. here:
>
>     BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw));

I used uint64_t because I thought that more clearly stated that the
union is supposed to be 64bits in size - the size of the MSR.  I'll
change as you show above.

> > +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> > +                                       unsigned int target_freq,
> > +                                       unsigned int relation)
> > +{
> > +    unsigned int cpu = policy->cpu;
> > +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> > +    /* Zero everything to ensure reserved bits are zero... */
> > +    union hwp_request hwp_req = { .raw = 0 };
> > +
> > +    /* .. and update from there */
> > +    hwp_req.min_perf = data->minimum;
> > +    hwp_req.max_perf = data->maximum;
> > +    hwp_req.desired = data->desired;
> > +    hwp_req.energy_perf = data->energy_perf;
> > +    if ( feature_hwp_activity_window )
> > +        hwp_req.activity_window = data->activity_window;
> > +
> > +    if ( hwp_req.raw == data->curr_req.raw )
> > +        return 0;
> > +
> > +    data->curr_req = hwp_req;
> > +
> > +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> > +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
>
> Is this at risk of being too verbose when the verbose option as given?

For my client use case, it seems fine since there aren't too many
CPUs.  But I think you are correct that for a server use case it would
be too much.  Hmmm.  Do you think I should drop it or make it
ratelimited?

> > +    return data->ret;
> > +}
> > +
> > +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > +    unsigned int cpu = policy->cpu;
> > +    struct hwp_drv_data *data;
> > +
> > +    data = xzalloc(struct hwp_drv_data);
> > +    if ( !data )
> > +        return -ENOMEM;
> > +
> > +    if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name,
> > +                                        cpufreq_gov_hwp.name) )
>
> Nit: I think this would better be
>
>     if ( cpufreq_opt_governor &&
>          strcmp(cpufreq_opt_governor->name, cpufreq_gov_hwp.name) )

Sounds good.  Actually, with the top level cpufreq=hwp,
cpufreq_opt_governor shouldn't be set anymore.  I left it since it
would point out something being unexpected.  policy->governor is set
unilaterally, so maybe this check and message should just be dropped.
Thoughts?

> > +        printk(XENLOG_WARNING
> > +               "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n",
> > +               cpufreq_opt_governor->name, cpufreq_gov_hwp.name);
> > +    policy->governor = &cpufreq_gov_hwp;
> > +
> > +    per_cpu(hwp_drv_data, cpu) = data;
> > +
> > +    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
> > +
> > +    if ( data->curr_req.raw == -1 )
> > +    {
> > +        hwp_err("Could not initialize HWP properly\n", cpu);
> > +        XFREE(per_cpu(hwp_drv_data, cpu));
>
> Is this completely safe even in the CPU online/hotplug case? It would
> seem better to me to go this way:
>
>         per_cpu(hwp_drv_data, cpu) = NULL;
>         xfree(data);

I'll do this because...

> Or even defer publishing "data" ...
>
> > +        return -ENODEV;
> > +    }
> > +
> > +    data->minimum = data->curr_req.min_perf;
> > +    data->maximum = data->curr_req.max_perf;
> > +    data->desired = data->curr_req.desired;
> > +    data->energy_perf = data->curr_req.energy_perf;
> > +    data->activity_window = data->curr_req.activity_window;
>
> ... until after it was fully populated. (But I seem to vaguely recall
> that you need to use the field somewhere in the init process.)

... hwp_init_msrs() uses the per-cpu entry to save data->curr_req, so
it can't be totally deferred.

> > +    hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
>
> Isn't this expected (or even required) to be the same on all CPUs? If
> so, no need to log every time.

I'll print only for CPU 0.

> > +static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> > +{
> > +    XFREE(per_cpu(hwp_drv_data, policy->cpu));
>
> Same remark as above, primarily because this is also used on an error
> path.
>
> > +/*
> > + * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
> > + * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least
> > + * with my HWP testing.  MSR_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
> > + * is what Linux uses and seems to work.
> > + */
>
> "my testing" imo wants replacing here by saying what hardware was tested
> (not who did the testing).

Good idea.

> > +static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy)
> > +{
> > +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpuid);
> > +    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);
>
> Nit: Blank line please between declaration(s) and statement(s). Or
> alternatively drop the local variable ltogether, as it's used ...
>
> > +    return data->ret;
>
> ... just once here.

I'll drop the local variable.

> > +}
> > +
> > +static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
>
> Seeing __initconstrel here, just as a remark (not a request for you
> to do anything): I think we want to finish conversion to altcall, such
> that these can all become __initconst_cf_clobber.
>
> > --- a/xen/drivers/cpufreq/cpufreq.c
> > +++ b/xen/drivers/cpufreq/cpufreq.c
> > @@ -97,6 +97,15 @@ static int __init cf_check setup_cpufreq_option(const char *str)
> >              return cpufreq_cmdline_parse(arg + 1);
> >      }
> >
> > +    if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
> > +    {
> > +        xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> > +        cpufreq_controller = FREQCTL_xen;
> > +        opt_cpufreq_hwp = true;
> > +        if ( *arg && *(arg + 1) )
>
> A pretty unusual way of writing arg[1].

I just copied code above and out of context.  I'll change it to arg[1].

> > --- a/xen/include/acpi/cpufreq/cpufreq.h
> > +++ b/xen/include/acpi/cpufreq/cpufreq.h
> > @@ -246,4 +246,7 @@ void cpufreq_dbs_timer_resume(void);
> >
> >  void intel_feature_detect(struct cpufreq_policy *policy);
> >
> > +extern bool __initdata opt_cpufreq_hwp;
>
> No __initdata or alike on declarations please. Section placement
> attributes (among others) only belong on the definition.

Ok.

> > --- a/xen/include/acpi/cpufreq/processor_perf.h
> > +++ b/xen/include/acpi/cpufreq/processor_perf.h
> > @@ -7,6 +7,9 @@
> >
> >  #define XEN_PX_INIT 0x80000000
> >
> > +bool hwp_available(void);
> > +int hwp_register_driver(void);
> > +
> >  int powernow_cpufreq_init(void);
> >  unsigned int powernow_register_driver(void);
> >  unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
>
> The existing split between what lives in which header is pretty
> arbitrary from all I can tell. Is there a particular reason you can't
> keep together the four declarations you add?

No reason.  I'll move them all to cpufreq.h.  I was going to move them
to processor_perf.h, but later get/set_hwp_para() definitions need
struct cpufreq_policy.  cpufreq.h includes processor_perf.h before it
defines struct cpufreq_policy, so it's easier to just move the
declarations to cpufreq.h.

> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -296,6 +296,7 @@ struct xen_ondemand {
> >      uint32_t up_threshold;
> >  };
> >
> > +#define XEN_HWP_DRIVER "hwp"
>
> I think this wants some addition to the identifier which makes it expected
> that the expansion is a string literal. Perhaps XEN_HWP_DRIVER_NAME?

XEN_HWP_DRIVER_NAME sounds good.

> >  /*
>
> Nit: There also wants to be a blank line between #define and comment.

Ok.

Thanks, Jan.

Regards,
Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC
  2023-06-19 14:24   ` Jan Beulich
@ 2023-06-20 18:41     ` Jason Andryuk
  2023-06-21  8:12       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Andryuk @ 2023-06-20 18:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Mon, Jun 19, 2023 at 10:24 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 14.06.2023 20:02, Jason Andryuk wrote:
> > --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -537,6 +537,29 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
> >      .update = hwp_cpufreq_update,
> >  };
> >
> > +int get_hwp_para(const unsigned int cpu,
> > +                 struct xen_cppc_para *cppc_para)
> > +{
> > +    const struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> > +
> > +    if ( data == NULL )
> > +        return -EINVAL;
>
> Maybe better -ENODATA in this case?

Sounds good.

> > --- a/xen/drivers/acpi/pmstat.c
> > +++ b/xen/drivers/acpi/pmstat.c
> > @@ -251,48 +251,54 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >      else
> >          strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
> >
> > -    if ( !(scaling_available_governors =
> > -           xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> > -        return -ENOMEM;
> > -    if ( (ret = read_scaling_available_governors(
> > -                    scaling_available_governors,
> > -                    gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
> > +    if ( !strncasecmp(op->u.get_para.scaling_driver, XEN_HWP_DRIVER,
> > +                      CPUFREQ_NAME_LEN) )
>
> Mind me asking why you think case-insensitive compare is appropriate here?

I'll change to strncmp().  All the other string comparisons on
pmstat.c are strncasecmp, so I followed that pattern.

> > +        ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
> > +    else
> >      {
> > +        if ( !(scaling_available_governors =
> > +               xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
> > +            return -ENOMEM;
> > +        if ( (ret = read_scaling_available_governors(
> > +                        scaling_available_governors,
> > +                        gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
>
> I realize you only re-indent this, but since you need to touch it anyway,
> may I suggest to also switch to siezof(*scaling_available_governors)?

How about dropping sizeof(*scaling_available_governors)?  This length ...

> > +        {
> > +            xfree(scaling_available_governors);
> > +            return ret;
> > +        }
> > +        ret = copy_to_guest(op->u.get_para.scaling_available_governors,
> > +                    scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);
>
> Similarly here: Please adjust indentation while you touch this code.

... should match here, but this second one lacks the "* sizeof($foo)".
They are strings, so multiplying by sizeof() is unusual.

FTAOD, you want the indenting as:
        ret = copy_to_guest(op->u.get_para.scaling_available_governors,
                            scaling_available_governors,
                            gov_num * CPUFREQ_NAME_LEN);
?

> >          xfree(scaling_available_governors);
> > -        return ret;
> > -    }
> > -    ret = copy_to_guest(op->u.get_para.scaling_available_governors,
> > -                scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);
> > -    xfree(scaling_available_governors);
> > -    if ( ret )
> > -        return ret;
> > +        if ( ret )
> > +            return ret;
> >
> > -    op->u.get_para.u.s.scaling_cur_freq = policy->cur;
> > -    op->u.get_para.u.s.scaling_max_freq = policy->max;
> > -    op->u.get_para.u.s.scaling_min_freq = policy->min;
> > +        op->u.get_para.u.s.scaling_cur_freq = policy->cur;
> > +        op->u.get_para.u.s.scaling_max_freq = policy->max;
> > +        op->u.get_para.u.s.scaling_min_freq = policy->min;
> >
> > -    if ( policy->governor->name[0] )
> > -        strlcpy(op->u.get_para.u.s.scaling_governor,
> > -            policy->governor->name, CPUFREQ_NAME_LEN);
> > -    else
> > -        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> > -                CPUFREQ_NAME_LEN);
> > +        if ( policy->governor->name[0] )
> > +            strlcpy(op->u.get_para.u.s.scaling_governor,
> > +                policy->governor->name, CPUFREQ_NAME_LEN);
> > +        else
> > +            strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> > +                    CPUFREQ_NAME_LEN);
> >
> > -    /* governor specific para */
> > -    if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
> > -                      "userspace", CPUFREQ_NAME_LEN) )
> > -    {
> > -        op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
> > -    }
> > +        /* governor specific para */
> > +        if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
> > +                          "userspace", CPUFREQ_NAME_LEN) )
> > +        {
> > +            op->u.get_para.u.s.u.userspace.scaling_setspeed = policy->cur;
> > +        }
>
> Would also be nice if you could get rid of the unnecessary braces here
> at this occasion.

Sure

> > --- a/xen/include/acpi/cpufreq/cpufreq.h
> > +++ b/xen/include/acpi/cpufreq/cpufreq.h
> > @@ -248,5 +248,7 @@ void intel_feature_detect(struct cpufreq_policy *policy);
> >
> >  extern bool __initdata opt_cpufreq_hwp;
> >  int hwp_cmdline_parse(const char *s);
> > +int get_hwp_para(const unsigned int cpu,
>
> I think we generally avoid const when it's not a pointed-to type. It's
> not useful at all in a declaration.

Ok

> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -296,6 +296,61 @@ struct xen_ondemand {
> >      uint32_t up_threshold;
> >  };
> >
> > +struct xen_cppc_para {
> > +    /* OUT */
> > +    /* activity_window supported if 1 */
> > +#define XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW  (1 << 0)
>
> I think 1 isn't very helpful, looking forward. Perhaps better "set" or
> "flag set"?

"set" works for me.

> > +    uint32_t features; /* bit flags for features */
> > +    /*
> > +     * See Intel SDM: HWP Performance Range and Dynamic Capabilities
> > +     *
> > +     * These four are 0-255 hardware-provided values.  They "continuous,
> > +     * abstract unit-less, performance" values.  smaller numbers are slower
> > +     * and larger ones are faster.
> > +     */
> > +    uint32_t lowest;
> > +    uint32_t lowest_nonlinear; /* most_efficient */
>
> Why non_linear in the external interface when internally you use
> most_efficient (merely put in the comment here)?
>
> > +    uint32_t nominal; /* guaranteed */
>
> Similar question for the name choice here.

There is a naming mismatch between the HWP fields and the CPPC fields.
The commit message includes:
The HWP most_efficient is mapped to CPPC lowest_nonlinear, and guaranteed is
mapped to nominal.  CPPC has a guaranteed that is optional while nominal
is required.  ACPI spec says "If this register is not implemented, OSPM
assumes guaranteed performance is always equal to nominal performance."

So the comments were to help with the mapping.  Should I prefix the
comments like "HWP: most_efficient"?

> > +    uint32_t highest;
> > +    /*
> > +     * See Intel SDM: IA32_HWP_REQUEST MSR (Address: 774H Logical Processor
> > +     * Scope)
> > +     *
> > +     * These are all hints, and the processor may deviate outside of them.
> > +     * Values below are 0-255.
> > +     *
> > +     * minimum and maximum can be set to the above hardware values to constrain
> > +     * operation.  The full range 0-255 is accepted and will be clipped by
> > +     * hardware.
> > +     */
> > +    uint32_t minimum;
> > +    uint32_t maximum;
> > +    /*
> > +     * Set an explicit performance hint, disabling hardware selection.
> > +     * 0 lets the hardware decide.
> > +     */
> > +    uint32_t desired;
>
> "Set" kind of conflicts with all fields being marked as OUT above. I think
> the word can simply be dropped?

Sounds good.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver
  2023-06-20 18:14     ` Jason Andryuk
@ 2023-06-21  7:58       ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-06-21  7:58 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 20.06.2023 20:14, Jason Andryuk wrote:
> On Mon, Jun 19, 2023 at 7:38 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 14.06.2023 20:02, Jason Andryuk wrote:
>>> Falling back from cpufreq=hwp to cpufreq=xen is a more user-friendly
>>> choice than disabling cpufreq when HWP is not available.  Specifying
>>> cpufreq=hwp indicates the user wants cpufreq, so, if HWP isn't
>>> available, it makes sense to give them the cpufreq that can be
>>> supported.  i.e. I can't see a user only wanting cpufreq=hwp or
>>> cpufreq=none, but not cpufreq=xen.
>>
>> I continue to disagree with this, and I'd like to ask for another
>> maintainer's opinion. To me the described behavior is like getting
>> pears when having asked for apples. I nevertheless agree that
>> having said fallback as an option, but I'd see that done by giving
>> meaning to something like "cpufreq=hwp,xen" (without having checked
>> whether that doesn't have meaning already, i.e. just to give you an
>> idea).
> 
> I know you disagree with the approach.  I implemented what would be
> useful to me and Marek.  It felt counterproductive to implement a hard
> failure mode that is unsuitable for my use case.  Also there was the
> possibility you wouldn't mind when you saw how it was implemented.
> 
> Yeah, asking for an apple and getting a pear can be the wrong thing if
> your recipe calls for apples.  But getting *some* fruit can be better
> than no fruit if you are hungry.

;-)

> As implemented here, with HWP default disabled,
> no command line -> default cpufreq=xen
> cpufreq=xen -> only cpufreq=xen
> cpufreq=hwp -> try hwp and fallback to cpufreq=xen
> 
> If/when HWP is default enabled:
> no command line -> try hwp and fallback to cpufreq=xen
> cpufreq=hwp -> try hwp and fallback to cpufreq=xen
> cpufreq=xen -> cpufreq=xen

At which point the question would be why to have such an option in
the first place. Plus how to specify that fallback to "xen" is not
wanted.

> Unless some other idea comes to me, I guess I'll look at implementing
> "cpufreq=hwp,xen".

Thanks.

>>> +static bool hwp_handle_option(const char *s, const char *end)
>>> +{
>>> +    int ret;
>>> +
>>> +    if ( strncmp(s, "verbose", 7) == 0 )
>>> +    {
>>> +        s += 7;
>>> +        if ( *s == '=' )
>>> +        {
>>> +            s++;
>>> +            cpufreq_verbose = !!simple_strtoul(s, NULL, 0);
>>> +
>>> +            return true;
>>> +        }
>>> +
>>> +        if ( end == NULL ||
>>> +             (end == s && (*end == '\0' || *end == ',')) )
>>> +        {
>>> +            cpufreq_verbose = true;
>>> +
>>> +            return true;
>>> +        }
>>> +
>>> +        return false;
>>> +    }
>>
>> Would be nice if the handling of "verbose" didn't need duplicating here.
>> However, if that's unavoidable, did you consider handling this as a
>> proper boolean instead of the custom way cpufreq_handle_common_option()
>> also uses?
> 
> It seemed more complicated to try to re-use the "verbose" handling
> from cpufreq_handle_common_option(), especially since minfreq and
> maxfreq are also in there.
> 
> I didn't use proper boolean parsing to remain consistent with
> cpufreq_handle_common_option() and the command line option
> documentation.  I'm fine with switching it to a proper boolean if
> that's what you want.

It would be nice if you could do that here, but I won't insist.

>>> +        goto error;
>>> +    }
>>> +
>>> +    /*
>>> +     * Check for APERF/MPERF support in hardware
>>> +     * also check for boost/turbo support
>>> +     */
>>> +    intel_feature_detect(policy);
>>> +
>>> +    if ( feature_hdc )
>>> +    {
>>> +        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) )
>>> +            goto error;
>>> +        if ( hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) )
>>> +            goto error;
>>
>> If either of these fails the very first time through (presumably for the
>> BSP), wouldn't a better course of action be to clear feature_hdc?
> 
> So if HWP is working, but the HDC part fails, just clear feature_hdc
> but keep using HWP?  I don't know how likely that is to happen, but
> that seems reasonable.

Just to answer the question - yes, that's what I was thinking of. Maybe
accompanied by a log message.

>>> +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
>>> +                                       unsigned int target_freq,
>>> +                                       unsigned int relation)
>>> +{
>>> +    unsigned int cpu = policy->cpu;
>>> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
>>> +    /* Zero everything to ensure reserved bits are zero... */
>>> +    union hwp_request hwp_req = { .raw = 0 };
>>> +
>>> +    /* .. and update from there */
>>> +    hwp_req.min_perf = data->minimum;
>>> +    hwp_req.max_perf = data->maximum;
>>> +    hwp_req.desired = data->desired;
>>> +    hwp_req.energy_perf = data->energy_perf;
>>> +    if ( feature_hwp_activity_window )
>>> +        hwp_req.activity_window = data->activity_window;
>>> +
>>> +    if ( hwp_req.raw == data->curr_req.raw )
>>> +        return 0;
>>> +
>>> +    data->curr_req = hwp_req;
>>> +
>>> +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
>>> +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
>>
>> Is this at risk of being too verbose when the verbose option as given?
> 
> For my client use case, it seems fine since there aren't too many
> CPUs.  But I think you are correct that for a server use case it would
> be too much.  Hmmm.  Do you think I should drop it or make it
> ratelimited?

I think it may have been useful for you during development, but I'd
rather see it dropped. Anyone needing to really fiddle with the
driver can add back whatever logging they deem necessary for what
they do.

>>> +    return data->ret;
>>> +}
>>> +
>>> +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>> +{
>>> +    unsigned int cpu = policy->cpu;
>>> +    struct hwp_drv_data *data;
>>> +
>>> +    data = xzalloc(struct hwp_drv_data);
>>> +    if ( !data )
>>> +        return -ENOMEM;
>>> +
>>> +    if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name,
>>> +                                        cpufreq_gov_hwp.name) )
>>
>> Nit: I think this would better be
>>
>>     if ( cpufreq_opt_governor &&
>>          strcmp(cpufreq_opt_governor->name, cpufreq_gov_hwp.name) )
> 
> Sounds good.  Actually, with the top level cpufreq=hwp,
> cpufreq_opt_governor shouldn't be set anymore.  I left it since it
> would point out something being unexpected.  policy->governor is set
> unilaterally, so maybe this check and message should just be dropped.
> Thoughts?

I didn't realize this could be dropped. If it can be, please do.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC
  2023-06-20 18:41     ` Jason Andryuk
@ 2023-06-21  8:12       ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-06-21  8:12 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 20.06.2023 20:41, Jason Andryuk wrote:
> On Mon, Jun 19, 2023 at 10:24 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 14.06.2023 20:02, Jason Andryuk wrote:
>>> +    else
>>>      {
>>> +        if ( !(scaling_available_governors =
>>> +               xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
>>> +            return -ENOMEM;
>>> +        if ( (ret = read_scaling_available_governors(
>>> +                        scaling_available_governors,
>>> +                        gov_num * CPUFREQ_NAME_LEN * sizeof(char))) )
>>
>> I realize you only re-indent this, but since you need to touch it anyway,
>> may I suggest to also switch to siezof(*scaling_available_governors)?
> 
> How about dropping sizeof(*scaling_available_governors)?  This length ...
> 
>>> +        {
>>> +            xfree(scaling_available_governors);
>>> +            return ret;
>>> +        }
>>> +        ret = copy_to_guest(op->u.get_para.scaling_available_governors,
>>> +                    scaling_available_governors, gov_num * CPUFREQ_NAME_LEN);
>>
>> Similarly here: Please adjust indentation while you touch this code.
> 
> ... should match here, but this second one lacks the "* sizeof($foo)".

Indeed it does, because that multiplication happens inside copy_to_guest()
(really copy_to_guest_offset()).

> They are strings, so multiplying by sizeof() is unusual.

Kind of, but in code which may want switching to Unicode (and not just
UTF-8,but e.g. UCS2 or UCS4) at some point it's a good idea to have such
right away. We don't mean to do any such switch, but I think it's good
practice to not assume that strings / string literals only ever consist
of plain char elements.

> FTAOD, you want the indenting as:
>         ret = copy_to_guest(op->u.get_para.scaling_available_governors,
>                             scaling_available_governors,
>                             gov_num * CPUFREQ_NAME_LEN);
> ?

That's one conforming way, yes. Another would be

        ret = copy_to_guest(
                  op->u.get_para.scaling_available_governors,
                  scaling_available_governors,
                  gov_num * CPUFREQ_NAME_LEN);

>>> --- a/xen/include/public/sysctl.h
>>> +++ b/xen/include/public/sysctl.h
>>> @@ -296,6 +296,61 @@ struct xen_ondemand {
>>>      uint32_t up_threshold;
>>>  };
>>>
>>> +struct xen_cppc_para {
>>> +    /* OUT */
>>> +    /* activity_window supported if 1 */
>>> +#define XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW  (1 << 0)
>>
>> I think 1 isn't very helpful, looking forward. Perhaps better "set" or
>> "flag set"?
> 
> "set" works for me.
> 
>>> +    uint32_t features; /* bit flags for features */
>>> +    /*
>>> +     * See Intel SDM: HWP Performance Range and Dynamic Capabilities
>>> +     *
>>> +     * These four are 0-255 hardware-provided values.  They "continuous,
>>> +     * abstract unit-less, performance" values.  smaller numbers are slower
>>> +     * and larger ones are faster.
>>> +     */
>>> +    uint32_t lowest;
>>> +    uint32_t lowest_nonlinear; /* most_efficient */
>>
>> Why non_linear in the external interface when internally you use
>> most_efficient (merely put in the comment here)?
>>
>>> +    uint32_t nominal; /* guaranteed */
>>
>> Similar question for the name choice here.
> 
> There is a naming mismatch between the HWP fields and the CPPC fields.
> The commit message includes:
> The HWP most_efficient is mapped to CPPC lowest_nonlinear, and guaranteed is
> mapped to nominal.  CPPC has a guaranteed that is optional while nominal
> is required.  ACPI spec says "If this register is not implemented, OSPM
> assumes guaranteed performance is always equal to nominal performance."
> 
> So the comments were to help with the mapping.  Should I prefix the
> comments like "HWP: most_efficient"?

Yes please. (I was going to suggest exactly that when I hadn't read this
question, yet.)

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
       [not found]     ` <CAKf6xputOYsrr5u+8rKZtbBuzL2GhaW_5c77VCmZ5fne_hZVyw@mail.gmail.com>
@ 2023-06-21 15:27       ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2023-06-21 15:27 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel

(re-adding xen-devel@)

On 21.06.2023 16:16, Jason Andryuk wrote:
> On Mon, Jun 19, 2023 at 10:47 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 14.06.2023 20:02, Jason Andryuk wrote:
>>> +    if ( !(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
>>> +         set_cppc->activity_window )
>>> +        return -EINVAL;
> 
> There were a few aspects intended to be checked, but I have failed to
> implement them all properly and consistently.  The 32bit fields of the
> CPPC interface are larger than the 8 bit HWP fields (10 bits for
> activity window).  So the first aspect was supposed to ensure all
> those out-of-range bits are 0.
> 
> The second aspect, which wasn't implemented properly, was that fields
> would be 0 unless the corresponding bit was set in set_params.
> 
> The third aspect was to fail if a field was specified but hardware
> support isn't available.  That is now only activity window.
> 
> Do aspects #1 and #2 sound appropriate?  We can discuss #3 below.

Personally I'd prefer if inapplicable fields weren't checked, unless we
expect re-use of those fields with a different way of indicating that
the field holds an applicable value. But my primary desire is for
checking to be as consistent as possible.

>> Feels like I have wondered before what good this check does. I'm
>> inclined to suggest to ...
> 
> This check was supposed to enforce #2.
> 
>>> +    if ( set_cppc->activity_window & ~XEN_SYSCTL_CPPC_ACT_WINDOW_MASK )
>>> +        return -EINVAL;
>>
>> ... fold the two relevant checks, omitting the middle one:
>>
>>     if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW) &&
>>          (!feature_hwp_activity_window ||
>>           (set_cppc->activity_window & ~XEN_SYSCTL_CPPC_ACT_WINDOW_MASK))
>>         return -EINVAL;
>>
>> Yet I'm also a little worried about the feature check, requiring the
>> caller to first figure out whether that feature is available. Would
>> it be an alternative to make such "best effort", preferably with
>> some indication that this aspect of the request was not carried out?
> 
> Yes, it would be nice to try and apply on a "best effort" basis as
> it's only activity window which may not be supported.
> 
> The SDM says, "Processors may support a subset of IA32_HWP_REQUEST
> fields as indicated by CPUID. Reads of non-supported fields will
> return 0. Writes to non-supported fields are ignored."
> 
> I'll have to test this, but potentially we just let the writes go
> through?  If the user checks xenpm, they will see that the activity
> window isn't supported?  Hmmm, I don't have a machine without activity
> window support, so I can't test it.  Skylake introduced HWP, but my
> skylake test system supports activity window.
> 
> Or do you want to make xen_set_cppc_para have an in/out and return the
> applied features?

Yes, that was what I meant with "indication of some sort". You could
e.g. simply clear the respective control bit in the request (and then
arrange for it to be copied back).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-06-21 15:28 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-14 18:02 [PATCH v4 00/15] Intel Hardware P-States (HWP) support Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 01/15] cpufreq: Allow restricting to internal governors only Jason Andryuk
2023-06-15 13:21   ` Jan Beulich
2023-06-15 14:04     ` Jason Andryuk
2023-06-15 14:23       ` Jan Beulich
2023-06-14 18:02 ` [PATCH v4 02/15] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 03/15] cpufreq: Export intel_feature_detect Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 04/15] xen/sysctl: Nest cpufreq scaling options Jason Andryuk
2023-06-15 13:29   ` Jan Beulich
2023-06-15 14:07     ` Jason Andryuk
2023-06-15 14:28       ` Jan Beulich
2023-06-15 17:56         ` Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 05/15] pmstat&xenpm: Re-arrage for cpufreq union Jason Andryuk
2023-06-15 14:38   ` Jan Beulich
2023-06-15 15:05     ` Jason Andryuk
2023-06-15 15:22       ` Jan Beulich
2023-06-15 15:45         ` Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
2023-06-19 11:38   ` Jan Beulich
2023-06-20 18:14     ` Jason Andryuk
2023-06-21  7:58       ` Jan Beulich
2023-06-14 18:02 ` [PATCH v4 07/15] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 08/15] xenpm: Change get-cpufreq-para output for hwp Jason Andryuk
2023-06-19 11:50   ` Jan Beulich
2023-06-14 18:02 ` [PATCH v4 09/15] cpufreq: Export HWP parameters to userspace as CPPC Jason Andryuk
2023-06-19 14:24   ` Jan Beulich
2023-06-20 18:41     ` Jason Andryuk
2023-06-21  8:12       ` Jan Beulich
2023-06-14 18:02 ` [PATCH v4 10/15] libxc: Include cppc_para in definitions Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 11/15] xenpm: Print HWP/CPPC parameters Jason Andryuk
2023-06-19 14:29   ` Jan Beulich
2023-06-14 18:02 ` [PATCH v4 12/15] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
     [not found]   ` <0b53687e-e781-7c01-34e9-e41cd14967c7@suse.com>
     [not found]     ` <CAKf6xputOYsrr5u+8rKZtbBuzL2GhaW_5c77VCmZ5fne_hZVyw@mail.gmail.com>
2023-06-21 15:27       ` Jan Beulich
2023-06-14 18:02 ` [PATCH v4 13/15] libxc: Add xc_set_cpufreq_cppc Jason Andryuk
2023-06-14 18:02 ` [PATCH v4 14/15] xenpm: Add set-cpufreq-cppc subcommand Jason Andryuk
2023-06-19 14:54   ` Jan Beulich
2023-06-14 18:02 ` [PATCH v4 15/15] CHANGELOG: Add Intel HWP entry Jason Andryuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.