All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support
@ 2023-05-01 19:30 Jason Andryuk
  2023-05-01 19:30 ` [PATCH v3 01/14 RESEND] cpufreq: Allow restricting to internal governors only Jason Andryuk
                   ` (13 more replies)
  0 siblings, 14 replies; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Anthony PERARD, Juergen Gross, Henry Wang, Community Manager

Hi,

Resending since I messed up the To: line.  Sorry about that.

This patch series adds Hardware-Controlled Performance States (HWP) for
Intel processors to Xen.

v2 was only partially reviewed, so v3 is mostly a reposting of v2.  In v2 &
v3, I think I addressed all comments for v1.  I kept patch 11 "xenpm:
Factor out a non-fatal cpuid_parse variant", with a v2 comment
explaining why I keep it.

v3 adds "xen/x86: Tweak PDC bits when using HWP".  Qubes testing revealed
an issue where enabling HWP can crash firwmare code (maybe SMM).  This
requires a Linux change to get the PDC bits from Xen and pass them to
ACPI.  Roger has a patch [0] to set the PDC bits.  Roger's 3 patch
series was tested with "xen/x86: Tweak PDC bits when using HWP" on
affected hardware and allowed proper operation.

Previous cover letter:

With HWP, the processor makes its own determinations for frequency
selection, though users can set some parameters and preferences.  There
is also Turbo Boost which dynamically pushes the max frequency if
possible.

The existing governors don't work with HWP since they select frequencies
and HWP doesn't expose those.  Therefore a dummy hwp-interal governor is
used that doesn't do anything.

xenpm get-cpufreq-para is extended to show HWP parameters, and
set-cpufreq-hwp is added to set them.

A lightly loaded OpenXT laptop showed ~1W power savings according to
powertop.  A mostly idle Fedora system (dom0 only) showed a more modest
power savings.

This for for a 10th gen 6-core 1600 MHz base 4900 MHZ max cpu.  In the
default balance mode, Turbo Boost doesn't exceed 4GHz.  Tweaking the
energy_perf preference with `xenpm set-cpufreq-hwp balance ene:64`,
I've seen the CPU hit 4.7GHz before throttling down and bouncing around
between 4.3 and 4.5 GHz.  Curiously the other cores read ~4GHz when
turbo boost takes affect.  This was done after pinning all dom0 cores,
and using taskset to pin to vCPU/pCPU 11 and running a bash tightloop.

HWP defaults to disabled and running with the existing HWP configuration
- it doesn't reconfigure by default.  It can be enabled with
cpufreq=xen:hwp.

Hardware Duty Cycling (HDC) is another feature to autonomously powerdown
things.  It defaults to enabled when HWP is enabled, but HDC can be
disabled on the command line.  cpufreq=xen:hwp,no-hdc

I've only tested on 8th gen and 10th gen systems with activity window
and energy_perf support.  So the pathes for CPUs lacking those features
are untested.

Fast MSR support was removed in v2.  The model specific checking was not
done properly, and I don't have hardware to test with.  Since writes are
expected to be infrequent, I just removed the code.

This changes the systcl_pm_op hypercall, so that wants review.

Regards,
Jason

[0] https://lore.kernel.org/xen-devel/20221121102113.41893-3-roger.pau@citrix.com/

Jason Andryuk (14):
  cpufreq: Allow restricting to internal governors only
  cpufreq: Add perf_freq to cpuinfo
  cpufreq: Export intel_feature_detect
  cpufreq: Add Hardware P-State (HWP) driver
  xenpm: Change get-cpufreq-para output for internal
  xen/x86: Tweak PDC bits when using HWP
  cpufreq: Export HWP parameters to userspace
  libxc: Include hwp_para in definitions
  xenpm: Print HWP parameters
  xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  libxc: Add xc_set_cpufreq_hwp
  xenpm: Factor out a non-fatal cpuid_parse variant
  xenpm: Add set-cpufreq-hwp subcommand
  CHANGELOG: Add Intel HWP entry

 CHANGELOG.md                              |   1 +
 docs/misc/xen-command-line.pandoc         |   8 +-
 tools/include/xenctrl.h                   |   6 +
 tools/libs/ctrl/xc_pm.c                   |  18 +
 tools/misc/xenpm.c                        | 355 +++++++++++-
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |  15 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           | 633 ++++++++++++++++++++++
 xen/arch/x86/acpi/lib.c                   |   5 +
 xen/arch/x86/cpu/mcheck/mce_intel.c       |   6 +
 xen/arch/x86/include/asm/cpufeature.h     |  13 +-
 xen/arch/x86/include/asm/msr-index.h      |  14 +
 xen/drivers/acpi/pmstat.c                 |  23 +
 xen/drivers/cpufreq/cpufreq.c             |  40 ++
 xen/drivers/cpufreq/utility.c             |   1 +
 xen/include/acpi/cpufreq/cpufreq.h        |  14 +
 xen/include/acpi/cpufreq/processor_perf.h |   4 +
 xen/include/acpi/pdc_intel.h              |   1 +
 xen/include/public/sysctl.h               |  57 ++
 19 files changed, 1187 insertions(+), 28 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

-- 
2.40.0



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 01/14 RESEND] cpufreq: Allow restricting to internal governors only
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-01 19:30 ` [PATCH v3 02/14 RESEND] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Jan Beulich

For hwp, the standard governors are not usable, and only the internal
one is applicable.  Add the cpufreq_governor_internal boolean to
indicate when an internal governor, like hwp-internal, will be used.
This is set during presmp_initcall, so that it can suppress governor
registration during initcall.  Only a governor with a name containing
"-internal" will be allowed in that case.

This way, the unuseable governors are not registered, so the internal
one is the only one returned to userspace.  This means incompatible
governors won't be advertised to userspace.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v3:
Switch to initdata
Add Jan Acked-by
Commit message s/they/the/ typo
Don't register hwp-internal when running non-hwp - Marek

v2:
Switch to "-internal"
Add blank line in header
---
 xen/drivers/cpufreq/cpufreq.c      | 8 ++++++++
 xen/include/acpi/cpufreq/cpufreq.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 2321c7dd07..7bd81680da 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -56,6 +56,7 @@ struct cpufreq_dom {
 };
 static LIST_HEAD_READ_MOSTLY(cpufreq_dom_list_head);
 
+bool __initdata cpufreq_governor_internal;
 struct cpufreq_governor *__read_mostly cpufreq_opt_governor;
 LIST_HEAD_READ_MOSTLY(cpufreq_governor_list);
 
@@ -121,6 +122,13 @@ int __init cpufreq_register_governor(struct cpufreq_governor *governor)
     if (!governor)
         return -EINVAL;
 
+    if (cpufreq_governor_internal &&
+        strstr(governor->name, "-internal") == NULL)
+        return -EINVAL;
+
+    if (!cpufreq_governor_internal && strstr(governor->name, "-internal"))
+        return -EINVAL;
+
     if (__find_governor(governor->name) != NULL)
         return -EEXIST;
 
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 35dcf21e8f..0da32ef519 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -114,6 +114,8 @@ extern struct cpufreq_governor cpufreq_gov_userspace;
 extern struct cpufreq_governor cpufreq_gov_performance;
 extern struct cpufreq_governor cpufreq_gov_powersave;
 
+extern bool cpufreq_governor_internal;
+
 extern struct list_head cpufreq_governor_list;
 
 extern int cpufreq_register_governor(struct cpufreq_governor *governor);
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 02/14 RESEND] cpufreq: Add perf_freq to cpuinfo
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
  2023-05-01 19:30 ` [PATCH v3 01/14 RESEND] cpufreq: Allow restricting to internal governors only Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-01 19:30 ` [PATCH v3 03/14 RESEND] cpufreq: Export intel_feature_detect Jason Andryuk
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

acpi-cpufreq scales the aperf/mperf measurements by max_freq, but HWP
needs to scale by base frequency.  Settings max_freq to base_freq
"works" but the code is not obvious, and returning values to userspace
is tricky.  Add an additonal perf_freq member which is used for scaling
aperf/mperf measurements.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v3:
Add Jan's Ack

I don't like this, but it seems the best way to re-use the common
aperf/mperf code.  The other option would be to add wrappers that then
do the acpi vs. hwp scaling.
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c | 2 +-
 xen/drivers/cpufreq/utility.c       | 1 +
 xen/include/acpi/cpufreq/cpufreq.h  | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 2e0067fbe5..6c70d04395 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -316,7 +316,7 @@ unsigned int get_measured_perf(unsigned int cpu, unsigned int flag)
     else
         perf_percent = 0;
 
-    return policy->cpuinfo.max_freq * perf_percent / 100;
+    return policy->cpuinfo.perf_freq * perf_percent / 100;
 }
 
 static unsigned int cf_check get_cur_freq_on_cpu(unsigned int cpu)
diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index 9eb7ecedcd..6831f62851 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -236,6 +236,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
 
     policy->min = policy->cpuinfo.min_freq = min_freq;
     policy->max = policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = max_freq;
     policy->cpuinfo.second_max_freq = second_max_freq;
 
     if (policy->min == ~0)
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 0da32ef519..a06aa92f62 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -37,6 +37,9 @@ extern struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
 struct cpufreq_cpuinfo {
     unsigned int        max_freq;
     unsigned int        second_max_freq;    /* P1 if Turbo Mode is on */
+    unsigned int        perf_freq; /* Scaling freq for aperf/mpref.
+                                      acpi-cpufreq uses max_freq, but HWP uses
+                                      base_freq.*/
     unsigned int        min_freq;
     unsigned int        transition_latency; /* in 10^(-9) s = nanoseconds */
 };
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 03/14 RESEND] cpufreq: Export intel_feature_detect
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
  2023-05-01 19:30 ` [PATCH v3 01/14 RESEND] cpufreq: Allow restricting to internal governors only Jason Andryuk
  2023-05-01 19:30 ` [PATCH v3 02/14 RESEND] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-04 11:16   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Export feature_detect as intel_feature_detect so it can be re-used by
HWP.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v3:
Remove void * cast when calling intel_feature_detect

v2:
export intel_feature_detect with typed pointer
Move intel_feature_detect to acpi/cpufreq/cpufreq.h since the
declaration now contains struct cpufreq_policy *.
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c | 8 ++++++--
 xen/include/acpi/cpufreq/cpufreq.h  | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 6c70d04395..f1cc473b4f 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -339,9 +339,8 @@ static unsigned int cf_check get_cur_freq_on_cpu(unsigned int cpu)
     return extract_freq(get_cur_val(cpumask_of(cpu)), data);
 }
 
-static void cf_check feature_detect(void *info)
+void intel_feature_detect(struct cpufreq_policy *policy)
 {
-    struct cpufreq_policy *policy = info;
     unsigned int eax;
 
     eax = cpuid_eax(6);
@@ -353,6 +352,11 @@ static void cf_check feature_detect(void *info)
     }
 }
 
+static void cf_check feature_detect(void *info)
+{
+    intel_feature_detect(info);
+}
+
 static unsigned int check_freqs(const cpumask_t *mask, unsigned int freq,
                                 struct acpi_cpufreq_data *data)
 {
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index a06aa92f62..0f334d2a43 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -243,4 +243,6 @@ int write_userspace_scaling_setspeed(unsigned int cpu, unsigned int freq);
 void cpufreq_dbs_timer_suspend(void);
 void cpufreq_dbs_timer_resume(void);
 
+void intel_feature_detect(struct cpufreq_policy *policy);
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (2 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 03/14 RESEND] cpufreq: Export intel_feature_detect Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-04 13:11   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné

From the Intel SDM: "Hardware-Controlled Performance States (HWP), which
autonomously selects performance states while utilizing OS supplied
performance guidance hints."

Enable HWP to run in autonomous mode by poking the correct MSRs.
cpufreq=xen:hwp enables and cpufreq=xen:hwp=0 disables.  The same for
hdc.

There is no interface to configure - xen_sysctl_pm_op/xenpm will
be to be extended to configure in subsequent patches.  It will run with
the default values, which should be the default 0x80 (out
of 0x0-0xff) energy/performance preference.

Unscientific powertop measurement of an mostly idle, customized OpenXT
install:
A 10th gen 6-core laptop showed battery discharge drop from ~9.x to
~7.x watts.
A 8th gen 4-core laptop dropped from ~10 to ~9

Power usage depends on many factors, especially display brightness, but
this does show an power saving in balanced mode when CPU utilization is
low.

HWP isn't compatible with an external governor - it doesn't take
explicit frequency requests.  Therefore a minimal internal governor,
hwp-internal, is also added as a placeholder.

While adding to the xen-command-line.pandoc entry, un-nest verbose from
minfreq.  They are independent.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---

We disable on cpuid_level < 0x16.  cpuid(0x16) is used to get the cpu
frequencies for calculating the APERF/MPERF.  Without it, things would
still work, but the averge cpufrequency output would be wrong.

My 8th & 10th gen test systems both report:
(XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0
(XEN) HWP: Hardware Duty Cycling (HDC) supported
(XEN) HWP: HW_FEEDBACK not supported

IA32_ENERGY_PERF_BIAS has not been tested.

For cpufreq=xen:hwp, placing the option inside the governor wouldn't
work.  Users would have to select the hwp-internal governor to turn off
hwp support.  hwp-internal isn't usable without hwp, and users wouldn't
be able to select a different governor.  That doesn't matter while hwp
defaults off, but it would if or when hwp defaults to enabled.

We can't use parse_boolean() since it requires a single name=val string
and cpufreq_handle_common_option is provided two strings.  Use
parse_bool() and manual handle no-hwp.

Write to disable the interrupt - the linux pstate driver does this.  We
don't use the interrupts, so we can just turn them off.  We aren't ready
to handle them, so we don't want any.  Unclear if this is necessary.
SDM says it's default disabled.

FAST_IA32_HWP_REQUEST was removed in v2.  The check in v1 was wrong,
it's a model specific feature and the CPUID bit is only available
after enabling via the MSR.  Support was untested since I don't have
hardware with the feature.  Writes are expected to be infrequent, so
just leave it out.

---
v2:
Alphabetize headers
Re-work driver registration
name hwp_drv_data anonymous union "hw"
Drop hwp_verbose_cont
style cleanups
Condense hwp_governor switch
hwp_cpufreq_target remove .raw from hwp_req assignment
Use typed-pointer in a few functions
Pass type to xzalloc
Add HWP_ENERGY_PERF_BALANCE/IA32_ENERGY_BIAS_BALANCE defines
Add XEN_HWP_GOVERNOR define for "hwp-internal"
Capitalize CPUID and MSR defines
Change '_' to '-' for energy-perf & act-window
Read-modify-write MSRs updates
Use FAST_IA32_HWP_REQUEST_MSR_ENABLE define
constify pointer in hwp_set_misc_turbo
Add space after non-fallthrough break in governor switch
Add IA32_ENERGY_BIAS_MASK define
Check CPUID_PM_LEAK for energy bias when needed
Fail initialization with curr_req = -1
Fold hwp_read_capabilities into hwp_init_msrs
Add command line cpufreq=xen:hwp
Add command line cpufreq=xen:hdc
Use per_cpu for hwp_drv_data pointers
Move hwp_energy_perf_bias call into hwp_write_request
energy_perf 0 is valid, so hwp_energy_perf_bias cannot be skipped
Ensure we don't generate interrupts
Remove Fast Write of Uncore MSR
Initialize hwp_drv_data from curr_req
Use SPDX line instead of license text in hwp.c

v3:
Add cf_check to cpufreq_gov_hwp_init() - Marek
Print cpuid_level with %#x - Marek
---
 docs/misc/xen-command-line.pandoc         |   8 +-
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |   5 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           | 506 ++++++++++++++++++++++
 xen/arch/x86/include/asm/cpufeature.h     |  13 +-
 xen/arch/x86/include/asm/msr-index.h      |  13 +
 xen/drivers/cpufreq/cpufreq.c             |  32 ++
 xen/include/acpi/cpufreq/cpufreq.h        |   3 +
 xen/include/acpi/cpufreq/processor_perf.h |   3 +
 xen/include/public/sysctl.h               |   1 +
 10 files changed, 581 insertions(+), 4 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e0b89b7d33..aaa31f444b 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
 available support.
 
 ### cpufreq
-> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
+> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
 
 > Default: `xen`
 
@@ -510,6 +510,12 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
   respectively.
 * `verbose` option can be included as a string or also as `verbose=<integer>`
+* `<hwp>` is a boolean to enable Hardware-Controlled Performance States (HWP)
+  on supported Intel hardware.  HWP is a Skylake+ feature which provides better
+  CPU power management.  The default is disabled.
+* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
+  processor to autonomously force physical package components into idle state.
+  The default is enabled, but the option only applies when `<hwp>` is enabled.
 
 ### cpuid (x86)
 > `= List of comma separated booleans`
diff --git a/xen/arch/x86/acpi/cpufreq/Makefile b/xen/arch/x86/acpi/cpufreq/Makefile
index f75da9b9ca..db83aa6b14 100644
--- a/xen/arch/x86/acpi/cpufreq/Makefile
+++ b/xen/arch/x86/acpi/cpufreq/Makefile
@@ -1,2 +1,3 @@
 obj-y += cpufreq.o
+obj-y += hwp.o
 obj-y += powernow.o
diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index f1cc473b4f..56816b1aee 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -642,7 +642,10 @@ static int __init cf_check cpufreq_driver_init(void)
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
+            if ( hwp_available() )
+                ret = hwp_register_driver();
+            else
+                ret = cpufreq_register_driver(&acpi_cpufreq_driver);
             break;
 
         case X86_VENDOR_AMD:
diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
new file mode 100644
index 0000000000..57f13867d3
--- /dev/null
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -0,0 +1,506 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
+ *
+ * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
+ */
+
+#include <xen/cpumask.h>
+#include <xen/init.h>
+#include <xen/param.h>
+#include <xen/xmalloc.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <acpi/cpufreq/cpufreq.h>
+
+static bool feature_hwp;
+static bool feature_hwp_notification;
+static bool feature_hwp_activity_window;
+static bool feature_hwp_energy_perf;
+static bool feature_hwp_pkg_level_ctl;
+static bool feature_hwp_peci;
+
+static bool feature_hdc;
+
+__initdata bool opt_cpufreq_hwp = false;
+__initdata bool opt_cpufreq_hdc = true;
+
+#define HWP_ENERGY_PERF_BALANCE         0x80
+#define IA32_ENERGY_BIAS_BALANCE        0x7
+#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
+#define IA32_ENERGY_BIAS_MASK           0xf
+
+union hwp_request
+{
+    struct
+    {
+        uint64_t min_perf:8;
+        uint64_t max_perf:8;
+        uint64_t desired:8;
+        uint64_t energy_perf:8;
+        uint64_t activity_window:10;
+        uint64_t package_control:1;
+        uint64_t reserved:16;
+        uint64_t activity_window_valid:1;
+        uint64_t energy_perf_valid:1;
+        uint64_t desired_valid:1;
+        uint64_t max_perf_valid:1;
+        uint64_t min_perf_valid:1;
+    };
+    uint64_t raw;
+};
+
+struct hwp_drv_data
+{
+    union
+    {
+        uint64_t hwp_caps;
+        struct
+        {
+            uint64_t highest:8;
+            uint64_t guaranteed:8;
+            uint64_t most_efficient:8;
+            uint64_t lowest:8;
+            uint64_t reserved:32;
+        } hw;
+    };
+    union hwp_request curr_req;
+    uint16_t activity_window;
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf;
+};
+DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);
+
+#define hwp_err(...)     printk(XENLOG_ERR __VA_ARGS__)
+#define hwp_info(...)    printk(XENLOG_INFO __VA_ARGS__)
+#define hwp_verbose(...)                   \
+({                                         \
+    if ( cpufreq_verbose )                 \
+        printk(XENLOG_DEBUG __VA_ARGS__);  \
+})
+
+static int cf_check hwp_governor(struct cpufreq_policy *policy,
+                                 unsigned int event)
+{
+    int ret;
+
+    if ( policy == NULL )
+        return -EINVAL;
+
+    switch ( event )
+    {
+    case CPUFREQ_GOV_START:
+    case CPUFREQ_GOV_LIMITS:
+        ret = 0;
+        break;
+
+    case CPUFREQ_GOV_STOP:
+    default:
+        ret = -EINVAL;
+        break;
+    }
+
+    return ret;
+}
+
+static struct cpufreq_governor hwp_cpufreq_governor =
+{
+    .name          = XEN_HWP_GOVERNOR,
+    .governor      = hwp_governor,
+};
+
+static int __init cf_check cpufreq_gov_hwp_init(void)
+{
+    return cpufreq_register_governor(&hwp_cpufreq_governor);
+}
+__initcall(cpufreq_gov_hwp_init);
+
+bool __init hwp_available(void)
+{
+    unsigned int eax, ecx, unused;
+    bool use_hwp;
+
+    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
+    {
+        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
+                    boot_cpu_data.cpuid_level);
+        return false;
+    }
+
+    if ( boot_cpu_data.cpuid_level < 0x16 )
+    {
+        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
+                 boot_cpu_data.cpuid_level);
+        return false;
+    }
+
+    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
+
+    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
+         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
+    {
+        hwp_verbose("HWP disabled: No energy/performance preference available");
+        return false;
+    }
+
+    feature_hwp                 = eax & CPUID6_EAX_HWP;
+    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
+    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
+    feature_hwp_energy_perf     =
+        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
+    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
+    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
+
+    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
+                feature_hwp, feature_hwp_notification,
+                feature_hwp_activity_window, feature_hwp_energy_perf,
+                feature_hwp_pkg_level_ctl, feature_hwp_peci);
+
+    if ( !feature_hwp )
+        return false;
+
+    feature_hdc = eax & CPUID6_EAX_HDC;
+
+    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
+                feature_hdc ? "" : "not ",
+                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
+                            : "");
+
+    feature_hdc = feature_hdc && opt_cpufreq_hdc;
+
+    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
+                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
+
+    use_hwp = feature_hwp && opt_cpufreq_hwp;
+    cpufreq_governor_internal = use_hwp;
+
+    if ( use_hwp )
+        hwp_info("Using HWP for cpufreq\n");
+
+    return use_hwp;
+}
+
+static void hdc_set_pkg_hdc_ctl(bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
+
+        return;
+    }
+
+    if ( val )
+        msr |= IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
+    else
+        msr &= ~IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
+
+    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
+}
+
+static void hdc_set_pm_ctl1(bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
+
+        return;
+    }
+
+    if ( val )
+        msr |= IA32_PM_CTL1_HDC_ALLOW_BLOCK;
+    else
+        msr &= ~IA32_PM_CTL1_HDC_ALLOW_BLOCK;
+
+    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
+}
+
+static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
+{
+    uint32_t base_khz, max_khz, bus_khz, edx;
+
+    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
+
+    /* aperf/mperf scales base. */
+    policy->cpuinfo.perf_freq = base_khz * 1000;
+    policy->cpuinfo.min_freq = base_khz * 1000;
+    policy->cpuinfo.max_freq = max_khz * 1000;
+    policy->min = base_khz * 1000;
+    policy->max = max_khz * 1000;
+    policy->cur = 0;
+}
+
+static void cf_check hwp_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    uint64_t val;
+
+    /*
+     * Package level MSR, but we don't have a good idea of packages here, so
+     * just do it everytime.
+     */
+    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    /* Ensure we don't generate interrupts */
+    if ( feature_hwp_notification )
+        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
+
+    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
+    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
+    {
+        val |= IA32_PM_ENABLE_HWP_ENABLE;
+        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
+        {
+            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
+                    policy->cpu, val);
+            data->curr_req.raw = -1;
+            return;
+        }
+    }
+
+    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
+                policy->cpu);
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    if ( !feature_hwp_energy_perf ) {
+        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
+        {
+            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
+            data->curr_req.raw = -1;
+
+            return;
+        }
+
+        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
+    }
+
+    /*
+     * Check for APERF/MPERF support in hardware
+     * also check for boost/turbo support
+     */
+    intel_feature_detect(policy);
+
+    if ( feature_hdc )
+    {
+        hdc_set_pkg_hdc_ctl(true);
+        hdc_set_pm_ctl1(true);
+    }
+
+    hwp_get_cpu_speeds(policy);
+}
+
+static int cf_check hwp_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+
+    if ( !feature_hwp_energy_perf && data->energy_perf )
+    {
+        if ( data->energy_perf > IA32_ENERGY_BIAS_MAX_POWERSAVE )
+        {
+            hwp_err("energy_perf %d exceeds IA32_ENERGY_PERF_BIAS range 0-15\n",
+                    data->energy_perf);
+
+            return -EINVAL;
+        }
+    }
+
+    if ( !feature_hwp_activity_window && data->activity_window )
+    {
+        hwp_err("HWP activity window not supported\n");
+
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+/* val 0 - highest performance, 15 - maximum energy savings */
+static void hwp_energy_perf_bias(const struct hwp_drv_data *data)
+{
+    uint64_t msr;
+    uint8_t val = data->energy_perf;
+
+    ASSERT(val <= IA32_ENERGY_BIAS_MAX_POWERSAVE);
+
+    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
+
+        return;
+    }
+
+    msr &= ~IA32_ENERGY_BIAS_MASK;
+    msr |= val;
+
+    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr);
+}
+
+static void cf_check hwp_write_request(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    union hwp_request hwp_req = data->curr_req;
+
+    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
+    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
+    {
+        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
+                policy->cpu, hwp_req.raw);
+        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
+    }
+
+    if ( !feature_hwp_energy_perf )
+        hwp_energy_perf_bias(data);
+
+}
+
+static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
+                                       unsigned int target_freq,
+                                       unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    /* Zero everything to ensure reserved bits are zero... */
+    union hwp_request hwp_req = { .raw = 0 };
+
+    /* .. and update from there */
+    hwp_req.min_perf = data->minimum;
+    hwp_req.max_perf = data->maximum;
+    hwp_req.desired = data->desired;
+    if ( feature_hwp_energy_perf )
+        hwp_req.energy_perf = data->energy_perf;
+    if ( feature_hwp_activity_window )
+        hwp_req.activity_window = data->activity_window;
+
+    if ( hwp_req.raw == data->curr_req.raw )
+        return 0;
+
+    data->curr_req = hwp_req;
+
+    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
+    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
+
+    return 0;
+}
+
+static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data;
+
+    data = xzalloc(struct hwp_drv_data);
+    if ( !data )
+        return -ENOMEM;
+
+    if ( cpufreq_opt_governor )
+        printk(XENLOG_WARNING
+               "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n",
+               cpufreq_opt_governor->name, hwp_cpufreq_governor.name);
+    policy->governor = &hwp_cpufreq_governor;
+
+    per_cpu(hwp_drv_data, cpu) = data;
+
+    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
+
+    if ( data->curr_req.raw == -1 )
+    {
+        hwp_err("CPU%u: Could not initialize HWP properly\n", cpu);
+        XFREE(per_cpu(hwp_drv_data, cpu));
+        return -ENODEV;
+    }
+
+    data->minimum = data->curr_req.min_perf;
+    data->maximum = data->curr_req.max_perf;
+    data->desired = data->curr_req.desired;
+    /* the !feature_hwp_energy_perf case was handled in hwp_init_msrs(). */
+    if ( feature_hwp_energy_perf )
+        data->energy_perf = data->curr_req.energy_perf;
+
+    hwp_verbose("CPU%u: IA32_HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
+
+    hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw);
+
+    return 0;
+}
+
+static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    XFREE(per_cpu(hwp_drv_data, policy->cpu));
+
+    return 0;
+}
+
+/*
+ * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
+ * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least
+ * with my HWP testing.  MSR_IA32_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
+ * is what Linux uses and seems to work.
+ */
+static void cf_check hwp_set_misc_turbo(void *info)
+{
+    const struct cpufreq_policy *policy = info;
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_MISC_ENABLE)\n", policy->cpu);
+
+        return;
+    }
+
+    if ( policy->turbo == CPUFREQ_TURBO_ENABLED )
+        msr &= ~MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+    else
+        msr |= MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+
+    if ( wrmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_MISC_ENABLE): %016lx\n",
+                policy->cpu, msr);
+}
+
+static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy)
+{
+    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);
+
+    return 0;
+}
+
+static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
+{
+    .name   = "hwp-cpufreq",
+    .verify = hwp_cpufreq_verify,
+    .target = hwp_cpufreq_target,
+    .init   = hwp_cpufreq_cpu_init,
+    .exit   = hwp_cpufreq_cpu_exit,
+    .update = hwp_cpufreq_update,
+};
+
+int __init hwp_register_driver(void)
+{
+    return cpufreq_register_driver(&hwp_cpufreq_driver);
+}
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 4140ec0938..f2ff1d5fde 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
 #define cpu_has(c, bit)		test_bit(bit, (c)->x86_capability)
 #define boot_cpu_has(bit)	test_bit(bit, boot_cpu_data.x86_capability)
 
-#define CPUID_PM_LEAF                    6
-#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
+#define CPUID_PM_LEAF                                6
+#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
+#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
+#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
+#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
+#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
+#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
+#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
+#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)
+#define CPUID6_ECX_APERFMPERF_CAPABILITY             0x1
+#define CPUID6_ECX_IA32_ENERGY_PERF_BIAS             0x8
 
 /* CPUID level 0x00000001.edx */
 #define cpu_has_fpu             1
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index fa771ed0b5..a2a22339e4 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -151,6 +151,13 @@
 
 #define MSR_PKRS                            0x000006e1
 
+#define MSR_IA32_PM_ENABLE                  0x00000770
+#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
+
+#define MSR_IA32_HWP_CAPABILITIES           0x00000771
+#define MSR_IA32_HWP_INTERRUPT              0x00000773
+#define MSR_IA32_HWP_REQUEST                0x00000774
+
 #define MSR_X2APIC_FIRST                    0x00000800
 #define MSR_X2APIC_LAST                     0x000008ff
 
@@ -165,6 +172,11 @@
 #define  PASID_PASID_MASK                   0x000fffff
 #define  PASID_VALID                        (_AC(1, ULL) << 31)
 
+#define MSR_IA32_PKG_HDC_CTL                0x00000db0
+#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)
+#define MSR_IA32_PM_CTL1                    0x00000db1
+#define  IA32_PM_CTL1_HDC_ALLOW_BLOCK       (_AC(1, ULL) <<  0)
+
 #define MSR_UARCH_MISC_CTRL                 0x00001b01
 #define  UARCH_CTRL_DOITM                   (_AC(1, ULL) <<  0)
 
@@ -500,6 +512,7 @@
 #define MSR_IA32_MISC_ENABLE_LIMIT_CPUID  (1<<22)
 #define MSR_IA32_MISC_ENABLE_XTPR_DISABLE (1<<23)
 #define MSR_IA32_MISC_ENABLE_XD_DISABLE	(1ULL << 34)
+#define MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE (1ULL << 38)
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 #define MSR_IA32_ENERGY_PERF_BIAS	0x000001b0
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 7bd81680da..9470eb7230 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -565,6 +565,38 @@ static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
 
 static int __init cpufreq_handle_common_option(const char *name, const char *val)
 {
+    if (!strcmp(name, "hdc")) {
+        if (val) {
+            int ret = parse_bool(val, NULL);
+            if (ret != -1) {
+                opt_cpufreq_hdc = ret;
+                return 1;
+            }
+        } else {
+            opt_cpufreq_hdc = true;
+            return 1;
+        }
+    } else if (!strcmp(name, "no-hdc")) {
+        opt_cpufreq_hdc = false;
+        return 1;
+    }
+
+    if (!strcmp(name, "hwp")) {
+        if (val) {
+            int ret = parse_bool(val, NULL);
+            if (ret != -1) {
+                opt_cpufreq_hwp = ret;
+                return 1;
+            }
+        } else {
+            opt_cpufreq_hwp = true;
+            return 1;
+        }
+    } else if (!strcmp(name, "no-hwp")) {
+        opt_cpufreq_hwp = false;
+        return 1;
+    }
+
     if (!strcmp(name, "maxfreq") && val) {
         usr_max_freq = simple_strtoul(val, NULL, 0);
         return 1;
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 0f334d2a43..29a712a4f1 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -245,4 +245,7 @@ void cpufreq_dbs_timer_resume(void);
 
 void intel_feature_detect(struct cpufreq_policy *policy);
 
+extern bool opt_cpufreq_hwp;
+extern bool opt_cpufreq_hdc;
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index d8a1ba68a6..b751ca4937 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -7,6 +7,9 @@
 
 #define XEN_PX_INIT 0x80000000
 
+bool hwp_available(void);
+int hwp_register_driver(void);
+
 int powernow_cpufreq_init(void);
 unsigned int powernow_register_driver(void);
 unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 2b24d6bfd0..b448f13b75 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -292,6 +292,7 @@ struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+#define XEN_HWP_GOVERNOR "hwp-internal"
 /*
  * cpufreq para name of this structure named
  * same as sysfs file name of native linux
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (3 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-04 14:35   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD

When using HWP, some of the returned data is not applicable.  In that
case, we should just omit it to avoid confusing the user.  So switch to
printing the base and turbo frequencies since those are relevant to HWP.
Similarly, stop printing the CPU frequencies since those do not apply.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Use full governor name XEN_HWP_GOVERNOR to change output
Style fixes
---
 tools/misc/xenpm.c | 41 +++++++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 1bb6187e56..ce8d7644d0 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -711,6 +711,7 @@ void start_gather_func(int argc, char *argv[])
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
+    bool internal = strstr(p_cpufreq->scaling_governor, XEN_HWP_GOVERNOR);
     int i;
 
     printf("cpu id               : %d\n", cpuid);
@@ -720,10 +721,15 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
         printf(" %d", p_cpufreq->affected_cpus[i]);
     printf("\n");
 
-    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->cpuinfo_max_freq,
-           p_cpufreq->cpuinfo_min_freq,
-           p_cpufreq->cpuinfo_cur_freq);
+    if ( internal )
+        printf("cpuinfo frequency    : base [%u] turbo [%u]\n",
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_max_freq);
+    else
+        printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->cpuinfo_max_freq,
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_cur_freq);
 
     printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
 
@@ -750,19 +756,22 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                p_cpufreq->u.ondemand.up_threshold);
     }
 
-    printf("scaling_avail_freq   :");
-    for ( i = 0; i < p_cpufreq->freq_num; i++ )
-        if ( p_cpufreq->scaling_available_frequencies[i] ==
-             p_cpufreq->scaling_cur_freq )
-            printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
-        else
-            printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
-    printf("\n");
+    if ( !internal )
+    {
+        printf("scaling_avail_freq   :");
+        for ( i = 0; i < p_cpufreq->freq_num; i++ )
+            if ( p_cpufreq->scaling_available_frequencies[i] ==
+                 p_cpufreq->scaling_cur_freq )
+                printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
+            else
+                printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
+        printf("\n");
 
-    printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->scaling_max_freq,
-           p_cpufreq->scaling_min_freq,
-           p_cpufreq->scaling_cur_freq);
+        printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->scaling_max_freq,
+               p_cpufreq->scaling_min_freq,
+               p_cpufreq->scaling_cur_freq);
+    }
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (4 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-08  9:53   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace Jason Andryuk
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Qubes testing of HWP support had a report of a laptop, Thinkpad X1
Carbon Gen 4 with a Skylake processor, locking up during boot when HWP
is enabled.  A user found a kernel bug that seems to be the same issue:
https://bugzilla.kernel.org/show_bug.cgi?id=110941.

That bug was fixed by Linux commit a21211672c9a ("ACPI / processor:
Request native thermal interrupt handling via _OSC").  The tl;dr is SMM
crashes when it receives thermal interrupts, so Linux calls the ACPI
_OSC method to take over interrupt handling.

The Linux fix looks at the CPU features to decide whether or not to call
_OSC with bit 12 set to take over native interrupt handling.  Xen needs
some way to communicate HWP to Dom0 for making an equivalent call.

Xen exposes modified PDC bits via the platform_op set_pminfo hypercall.
Expand that to set bit 12 when HWP is present and in use.

Any generated interrupt would be handled by Xen's thermal drive, which
clears the status.

Bit 12 isn't named in the linux header and is open coded in Linux's
usage.

This will need a corresponding linux patch to pick up and apply the PDC
bits.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
New in v3

 xen/arch/x86/acpi/cpufreq/hwp.c           | 16 +++++++++++-----
 xen/arch/x86/acpi/lib.c                   |  5 +++++
 xen/arch/x86/cpu/mcheck/mce_intel.c       |  6 ++++++
 xen/arch/x86/include/asm/msr-index.h      |  1 +
 xen/include/acpi/cpufreq/processor_perf.h |  1 +
 xen/include/acpi/pdc_intel.h              |  1 +
 6 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 57f13867d3..f84abe1386 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -13,6 +13,8 @@
 #include <asm/msr.h>
 #include <acpi/cpufreq/cpufreq.h>
 
+static bool hwp_in_use;
+
 static bool feature_hwp;
 static bool feature_hwp_notification;
 static bool feature_hwp_activity_window;
@@ -117,10 +119,14 @@ static int __init cf_check cpufreq_gov_hwp_init(void)
 }
 __initcall(cpufreq_gov_hwp_init);
 
+bool hwp_active(void)
+{
+    return hwp_in_use;
+}
+
 bool __init hwp_available(void)
 {
     unsigned int eax, ecx, unused;
-    bool use_hwp;
 
     if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
     {
@@ -173,13 +179,13 @@ bool __init hwp_available(void)
     hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
                 (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
 
-    use_hwp = feature_hwp && opt_cpufreq_hwp;
-    cpufreq_governor_internal = use_hwp;
+    hwp_in_use = feature_hwp && opt_cpufreq_hwp;
+    cpufreq_governor_internal = hwp_in_use;
 
-    if ( use_hwp )
+    if ( hwp_in_use )
         hwp_info("Using HWP for cpufreq\n");
 
-    return use_hwp;
+    return hwp_in_use;
 }
 
 static void hdc_set_pkg_hdc_ctl(bool val)
diff --git a/xen/arch/x86/acpi/lib.c b/xen/arch/x86/acpi/lib.c
index 43831b92d1..20d6115ba9 100644
--- a/xen/arch/x86/acpi/lib.c
+++ b/xen/arch/x86/acpi/lib.c
@@ -26,6 +26,8 @@
 #include <asm/fixmap.h>
 #include <asm/mwait.h>
 
+#include <acpi/cpufreq/processor_perf.h>
+
 u32 __read_mostly acpi_smi_cmd;
 u8 __read_mostly acpi_enable_value;
 u8 __read_mostly acpi_disable_value;
@@ -140,5 +142,8 @@ int arch_acpi_set_pdc_bits(u32 acpi_id, u32 *pdc, u32 mask)
 	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
 		pdc[2] &= ~(ACPI_PDC_C_C1_FFH | ACPI_PDC_C_C2C3_FFH);
 
+	if (hwp_active())
+		pdc[2] |= ACPI_PDC_CPPC_NTV_INT;
+
 	return 0;
 }
diff --git a/xen/arch/x86/cpu/mcheck/mce_intel.c b/xen/arch/x86/cpu/mcheck/mce_intel.c
index 2f23f02923..d430342924 100644
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -15,6 +15,9 @@
 #include <asm/p2m.h>
 #include <asm/mce.h>
 #include <asm/apic.h>
+
+#include <acpi/cpufreq/processor_perf.h>
+
 #include "mce.h"
 #include "x86_mca.h"
 #include "barrier.h"
@@ -64,6 +67,9 @@ static void cf_check intel_thermal_interrupt(struct cpu_user_regs *regs)
 
     ack_APIC_irq();
 
+    if ( hwp_active() )
+        wrmsr_safe(MSR_IA32_HWP_STATUS, 0);
+
     if ( NOW() < per_cpu(next, cpu) )
         return;
 
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index a2a22339e4..f5269022da 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -157,6 +157,7 @@
 #define MSR_IA32_HWP_CAPABILITIES           0x00000771
 #define MSR_IA32_HWP_INTERRUPT              0x00000773
 #define MSR_IA32_HWP_REQUEST                0x00000774
+#define MSR_IA32_HWP_STATUS                 0x00000777
 
 #define MSR_X2APIC_FIRST                    0x00000800
 #define MSR_X2APIC_LAST                     0x000008ff
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index b751ca4937..dd8ec36ba7 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -8,6 +8,7 @@
 #define XEN_PX_INIT 0x80000000
 
 bool hwp_available(void);
+bool hwp_active(void);
 int hwp_register_driver(void);
 
 int powernow_cpufreq_init(void);
diff --git a/xen/include/acpi/pdc_intel.h b/xen/include/acpi/pdc_intel.h
index 4fb719d6f5..e8332898fc 100644
--- a/xen/include/acpi/pdc_intel.h
+++ b/xen/include/acpi/pdc_intel.h
@@ -17,6 +17,7 @@
 #define ACPI_PDC_C_C1_FFH		(0x0100)
 #define ACPI_PDC_C_C2C3_FFH		(0x0200)
 #define ACPI_PDC_SMP_P_HWCOORD		(0x0800)
+#define ACPI_PDC_CPPC_NTV_INT		(0x1000)
 
 #define ACPI_PDC_EST_CAPABILITY_SMP	(ACPI_PDC_SMP_C1PT | \
 					 ACPI_PDC_C_C1_HALT | \
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (5 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-08 10:25   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 08/14 RESEND] libxc: Include hwp_para in definitions Jason Andryuk
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Extend xen_get_cpufreq_para to return hwp parameters.  These match the
hardware rather closely.

We need the features bitmask to indicated fields supported by the actual
hardware.

The use of uint8_t parameters matches the hardware size.  uint32_t
entries grows the sysctl_t past the build assertion in setup.c.  The
uint8_t ranges are supported across multiple generations, so hopefully
they won't change.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Style fixes
Don't bump XEN_SYSCTL_INTERFACE_VERSION
Drop cpufreq.h comment divider
Expand xen_hwp_para comment
Add HWP activity window mantissa/exponent defines
Handle union rename
Add const to get_hwp_para
Remove hw_ prefix from xen_hwp_para members
Use XEN_HWP_GOVERNOR
Use per_cpu for hwp_drv_data
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 25 +++++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c          |  5 +++++
 xen/include/acpi/cpufreq/cpufreq.h |  2 ++
 xen/include/public/sysctl.h        | 26 ++++++++++++++++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index f84abe1386..cb52918799 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -506,6 +506,31 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
     .update = hwp_cpufreq_update,
 };
 
+int get_hwp_para(const struct cpufreq_policy *policy,
+                 struct xen_hwp_para *hwp_para)
+{
+    unsigned int cpu = policy->cpu;
+    const struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+
+    if ( data == NULL )
+        return -EINVAL;
+
+    hwp_para->features        =
+        (feature_hwp_activity_window ? XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  : 0) |
+        (feature_hwp_energy_perf     ? XEN_SYSCTL_HWP_FEAT_ENERGY_PERF : 0);
+    hwp_para->lowest          = data->hw.lowest;
+    hwp_para->most_efficient  = data->hw.most_efficient;
+    hwp_para->guaranteed      = data->hw.guaranteed;
+    hwp_para->highest         = data->hw.highest;
+    hwp_para->minimum         = data->minimum;
+    hwp_para->maximum         = data->maximum;
+    hwp_para->energy_perf     = data->energy_perf;
+    hwp_para->activity_window = data->activity_window;
+    hwp_para->desired         = data->desired;
+
+    return 0;
+}
+
 int __init hwp_register_driver(void)
 {
     return cpufreq_register_driver(&hwp_cpufreq_driver);
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 1bae635101..67fd9dabd4 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -290,6 +290,11 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
             &op->u.get_para.u.ondemand.sampling_rate,
             &op->u.get_para.u.ondemand.up_threshold);
     }
+
+    if ( !strncasecmp(op->u.get_para.scaling_governor, XEN_HWP_GOVERNOR,
+                      CPUFREQ_NAME_LEN) )
+        ret = get_hwp_para(policy, &op->u.get_para.u.hwp_para);
+
     op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
 
     return ret;
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 29a712a4f1..92b4c7e79c 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -247,5 +247,7 @@ void intel_feature_detect(struct cpufreq_policy *policy);
 
 extern bool opt_cpufreq_hwp;
 extern bool opt_cpufreq_hdc;
+int get_hwp_para(const struct cpufreq_policy *policy,
+                 struct xen_hwp_para *hwp_para);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index b448f13b75..bf7e6594a7 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -292,6 +292,31 @@ struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+struct xen_hwp_para {
+    /*
+     * bits 6:0   - 7bit mantissa
+     * bits 9:7   - 3bit base-10 exponent
+     * btis 15:10 - Unused - must be 0
+     */
+#define HWP_ACT_WINDOW_MANTISSA_MASK  0x7f
+#define HWP_ACT_WINDOW_EXPONENT_MASK  0x7
+#define HWP_ACT_WINDOW_EXPONENT_SHIFT 7
+    uint16_t activity_window;
+    /* energy_perf range 0-255 if 1. Otherwise 0-15 */
+#define XEN_SYSCTL_HWP_FEAT_ENERGY_PERF (1 << 0)
+    /* activity_window supported if 1 */
+#define XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  (1 << 1)
+    uint8_t features; /* bit flags for features */
+    uint8_t lowest;
+    uint8_t most_efficient;
+    uint8_t guaranteed;
+    uint8_t highest;
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf;
+};
+
 #define XEN_HWP_GOVERNOR "hwp-internal"
 /*
  * cpufreq para name of this structure named
@@ -324,6 +349,7 @@ struct xen_get_cpufreq_para {
     union {
         struct  xen_userspace userspace;
         struct  xen_ondemand ondemand;
+        struct  xen_hwp_para hwp_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 08/14 RESEND] libxc: Include hwp_para in definitions
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (6 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-19 13:53   ` Anthony PERARD
  2023-05-01 19:30 ` [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters Jason Andryuk
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross

Expose the hwp_para fields through libxc.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/include/xenctrl.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 05967ecc92..437001d713 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1903,6 +1903,7 @@ int xc_smt_disable(xc_interface *xch);
  */
 typedef struct xen_userspace xc_userspace_t;
 typedef struct xen_ondemand xc_ondemand_t;
+typedef struct xen_hwp_para xc_hwp_para_t;
 
 struct xc_get_cpufreq_para {
     /* IN/OUT variable */
@@ -1930,6 +1931,7 @@ struct xc_get_cpufreq_para {
     union {
         xc_userspace_t userspace;
         xc_ondemand_t ondemand;
+        xc_hwp_para_t hwp_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (7 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 08/14 RESEND] libxc: Include hwp_para in definitions Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-08 10:43   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD

Print HWP-specific parameters.  Some are always present, but others
depend on hardware support.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Style fixes
Declare i outside loop
Replace repearted hardware/configured limits with spaces
Fixup for hw_ removal
Use XEN_HWP_GOVERNOR
Use HWP_ACT_WINDOW_EXPONENT_*
Remove energy_perf hw autonomous - 0 doesn't mean autonomous
---
 tools/misc/xenpm.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index ce8d7644d0..b2defde0d4 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -708,6 +708,44 @@ void start_gather_func(int argc, char *argv[])
     pause();
 }
 
+static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
+                                          unsigned int *activity_window,
+                                          const char **units)
+{
+    unsigned int mantissa = hwp->activity_window & HWP_ACT_WINDOW_MANTISSA_MASK;
+    unsigned int exponent =
+        (hwp->activity_window >> HWP_ACT_WINDOW_EXPONENT_SHIFT) &
+            HWP_ACT_WINDOW_EXPONENT_MASK;
+    unsigned int multiplier = 1;
+    unsigned int i;
+
+    if ( hwp->activity_window == 0 )
+    {
+        *units = "hardware selected";
+        *activity_window = 0;
+
+        return;
+    }
+
+    if ( exponent >= 6 )
+    {
+        *units = "s";
+        exponent -= 6;
+    }
+    else if ( exponent >= 3 )
+    {
+        *units = "ms";
+        exponent -= 3;
+    }
+    else
+        *units = "us";
+
+    for ( i = 0; i < exponent; i++ )
+        multiplier *= 10;
+
+    *activity_window = mantissa * multiplier;
+}
+
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
@@ -773,6 +811,33 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                p_cpufreq->scaling_cur_freq);
     }
 
+    if ( strcmp(p_cpufreq->scaling_governor, XEN_HWP_GOVERNOR) == 0 )
+    {
+        const xc_hwp_para_t *hwp = &p_cpufreq->u.hwp_para;
+
+        printf("hwp variables        :\n");
+        printf("  hardware limits    : lowest [%u] most_efficient [%u]\n",
+               hwp->lowest, hwp->most_efficient);
+        printf("                     : guaranteed [%u] highest [%u]\n",
+               hwp->guaranteed, hwp->highest);
+        printf("  configured limits  : min [%u] max [%u] energy_perf [%u]\n",
+               hwp->minimum, hwp->maximum, hwp->energy_perf);
+
+        if ( hwp->features & XEN_SYSCTL_HWP_FEAT_ACT_WINDOW )
+        {
+            unsigned int activity_window;
+            const char *units;
+
+            calculate_hwp_activity_window(hwp, &activity_window, &units);
+            printf("                     : activity_window [%u %s]\n",
+                   activity_window, units);
+        }
+
+        printf("                     : desired [%u%s]\n",
+               hwp->desired,
+               hwp->desired ? "" : " hw autonomous");
+    }
+
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
     printf("\n");
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (8 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-08 11:27   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 11/14 RESEND] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Add SET_CPUFREQ_HWP xen_sysctl_pm_op to set HWP parameters.  The sysctl
supports setting multiple values simultaneously as indicated by the
set_params bits.  This allows atomically applying new HWP configuration
via a single wrmsr.

XEN_SYSCTL_HWP_SET_PRESET_BALANCE/PERFORMANCE/POWERSAVE provide three
common presets.  Setting them depends on hardware limits which the
hypervisor is already caching.  So using them allows skipping a
hypercall to query the limits (lowest/highest) to then set those same
values.  The code is organized to allow a preset to be refined with
additional stuff if desired.

"most_efficient" and "guaranteed" could be additional presets in the
future, but the are not added now.  Those levels can change at runtime,
but we don't have code in place to monitor and update for those events.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---
v3:
Remove cpufreq_governor_internal from set_cpufreq_hwp

v2:
Update for naming anonymous union
Drop hwp_err for invalid input in set_hwp_para()
Drop uint16_t cast in XEN_SYSCTL_HWP_SET_PARAM_MASK
Drop parens for HWP_SET_PRESET defines
Reference activity_window format comment
Place SET_CPUFREQ_HWP after SET_CPUFREQ_PARA
Add {HWP,IA32}_ENERGY_PERF_MAX_{PERFORMANCE,POWERSAVE} defines
Order defines before fields in sysctl.h
Use XEN_HWP_GOVERNOR
Use per_cpu for hwp_drv_data
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 96 ++++++++++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c          | 18 ++++++
 xen/include/acpi/cpufreq/cpufreq.h |  2 +
 xen/include/public/sysctl.h        | 30 ++++++++++
 4 files changed, 146 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index cb52918799..3d15875dc1 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -27,7 +27,9 @@ static bool feature_hdc;
 __initdata bool opt_cpufreq_hwp = false;
 __initdata bool opt_cpufreq_hdc = true;
 
+#define HWP_ENERGY_PERF_MAX_PERFORMANCE 0
 #define HWP_ENERGY_PERF_BALANCE         0x80
+#define HWP_ENERGY_PERF_MAX_POWERSAVE   0xff
 #define IA32_ENERGY_BIAS_BALANCE        0x7
 #define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
 #define IA32_ENERGY_BIAS_MASK           0xf
@@ -531,6 +533,100 @@ int get_hwp_para(const struct cpufreq_policy *policy,
     return 0;
 }
 
+int set_hwp_para(struct cpufreq_policy *policy,
+                 struct xen_set_hwp_para *set_hwp)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+
+    if ( data == NULL )
+        return -EINVAL;
+
+    /* Validate all parameters first */
+    if ( set_hwp->set_params & ~XEN_SYSCTL_HWP_SET_PARAM_MASK )
+        return -EINVAL;
+
+    if ( set_hwp->activity_window & ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK )
+        return -EINVAL;
+
+    if ( !feature_hwp_energy_perf &&
+         (set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF) &&
+         set_hwp->energy_perf > IA32_ENERGY_BIAS_MAX_POWERSAVE )
+        return -EINVAL;
+
+    if ( (set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED) &&
+         set_hwp->desired != 0 &&
+         (set_hwp->desired < data->hw.lowest ||
+          set_hwp->desired > data->hw.highest) )
+        return -EINVAL;
+
+    /*
+     * minimum & maximum are not validated as hardware doesn't seem to care
+     * and the SDM says CPUs will clip internally.
+     */
+
+    /* Apply presets */
+    switch ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_PRESET_MASK )
+    {
+    case XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE:
+        data->minimum = data->hw.lowest;
+        data->maximum = data->hw.lowest;
+        data->activity_window = 0;
+        if ( feature_hwp_energy_perf )
+            data->energy_perf = HWP_ENERGY_PERF_MAX_POWERSAVE;
+        else
+            data->energy_perf = IA32_ENERGY_BIAS_MAX_POWERSAVE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE:
+        data->minimum = data->hw.highest;
+        data->maximum = data->hw.highest;
+        data->activity_window = 0;
+        data->energy_perf = HWP_ENERGY_PERF_MAX_PERFORMANCE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_HWP_SET_PRESET_BALANCE:
+        data->minimum = data->hw.lowest;
+        data->maximum = data->hw.highest;
+        data->activity_window = 0;
+        if ( feature_hwp_energy_perf )
+            data->energy_perf = HWP_ENERGY_PERF_BALANCE;
+        else
+            data->energy_perf = IA32_ENERGY_BIAS_BALANCE;
+        data->desired = 0;
+        break;
+
+    case XEN_SYSCTL_HWP_SET_PRESET_NONE:
+        break;
+
+    default:
+        return -EINVAL;
+    }
+
+    /* Further customize presets if needed */
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MINIMUM )
+        data->minimum = set_hwp->minimum;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MAXIMUM )
+        data->maximum = set_hwp->maximum;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF )
+        data->energy_perf = set_hwp->energy_perf;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED )
+        data->desired = set_hwp->desired;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ACT_WINDOW )
+        data->activity_window = set_hwp->activity_window &
+                                XEN_SYSCTL_HWP_ACT_WINDOW_MASK;
+
+    hwp_cpufreq_target(policy, 0, 0);
+
+    return 0;
+}
+
 int __init hwp_register_driver(void)
 {
     return cpufreq_register_driver(&hwp_cpufreq_driver);
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 67fd9dabd4..12c76f5e57 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -398,6 +398,20 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
     return ret;
 }
 
+static int set_cpufreq_hwp(struct xen_sysctl_pm_op *op)
+{
+    struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
+
+    if ( !policy || !policy->governor )
+        return -EINVAL;
+
+    if ( strncasecmp(policy->governor->name, XEN_HWP_GOVERNOR,
+                     CPUFREQ_NAME_LEN) )
+        return -EINVAL;
+
+    return set_hwp_para(policy, &op->u.set_hwp);
+}
+
 int do_pm_op(struct xen_sysctl_pm_op *op)
 {
     int ret = 0;
@@ -470,6 +484,10 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
         break;
     }
 
+    case SET_CPUFREQ_HWP:
+        ret = set_cpufreq_hwp(op);
+        break;
+
     case GET_CPUFREQ_AVGFREQ:
     {
         op->u.get_avgfreq = cpufreq_driver_getavg(op->cpuid, USR_GETAVG);
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 92b4c7e79c..b8831b2cd3 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -249,5 +249,7 @@ extern bool opt_cpufreq_hwp;
 extern bool opt_cpufreq_hdc;
 int get_hwp_para(const struct cpufreq_policy *policy,
                  struct xen_hwp_para *hwp_para);
+int set_hwp_para(struct cpufreq_policy *policy,
+                 struct xen_set_hwp_para *set_hwp);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index bf7e6594a7..3242472cbe 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -317,6 +317,34 @@ struct xen_hwp_para {
     uint8_t energy_perf;
 };
 
+/* set multiple values simultaneously when set_args bit is set */
+struct xen_set_hwp_para {
+#define XEN_SYSCTL_HWP_SET_DESIRED              (1U << 0)
+#define XEN_SYSCTL_HWP_SET_ENERGY_PERF          (1U << 1)
+#define XEN_SYSCTL_HWP_SET_ACT_WINDOW           (1U << 2)
+#define XEN_SYSCTL_HWP_SET_MINIMUM              (1U << 3)
+#define XEN_SYSCTL_HWP_SET_MAXIMUM              (1U << 4)
+#define XEN_SYSCTL_HWP_SET_PRESET_MASK          0xf000
+#define XEN_SYSCTL_HWP_SET_PRESET_NONE          0x0000
+#define XEN_SYSCTL_HWP_SET_PRESET_BALANCE       0x1000
+#define XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE     0x2000
+#define XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE   0x3000
+#define XEN_SYSCTL_HWP_SET_PARAM_MASK ( \
+                                  XEN_SYSCTL_HWP_SET_PRESET_MASK | \
+                                  XEN_SYSCTL_HWP_SET_DESIRED     | \
+                                  XEN_SYSCTL_HWP_SET_ENERGY_PERF | \
+                                  XEN_SYSCTL_HWP_SET_ACT_WINDOW  | \
+                                  XEN_SYSCTL_HWP_SET_MINIMUM     | \
+                                  XEN_SYSCTL_HWP_SET_MAXIMUM     )
+    uint16_t set_params; /* bitflags for valid values */
+#define XEN_SYSCTL_HWP_ACT_WINDOW_MASK          0x03ff
+    uint16_t activity_window; /* See comment in struct xen_hwp_para */
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf; /* 0-255 or 0-15 depending on HW support */
+};
+
 #define XEN_HWP_GOVERNOR "hwp-internal"
 /*
  * cpufreq para name of this structure named
@@ -379,6 +407,7 @@ struct xen_sysctl_pm_op {
     #define SET_CPUFREQ_GOV            (CPUFREQ_PARA | 0x02)
     #define SET_CPUFREQ_PARA           (CPUFREQ_PARA | 0x03)
     #define GET_CPUFREQ_AVGFREQ        (CPUFREQ_PARA | 0x04)
+    #define SET_CPUFREQ_HWP            (CPUFREQ_PARA | 0x05)
 
     /* set/reset scheduler power saving option */
     #define XEN_SYSCTL_pm_op_set_sched_opt_smt    0x21
@@ -405,6 +434,7 @@ struct xen_sysctl_pm_op {
         struct xen_get_cpufreq_para get_para;
         struct xen_set_cpufreq_gov  set_gov;
         struct xen_set_cpufreq_para set_para;
+        struct xen_set_hwp_para     set_hwp;
         uint64_aligned_t get_avgfreq;
         uint32_t                    set_sched_opt_smt;
 #define XEN_SYSCTL_CX_UNLIMITED 0xffffffff
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 11/14 RESEND] libxc: Add xc_set_cpufreq_hwp
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (9 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-19 13:55   ` Anthony PERARD
  2023-05-01 19:30 ` [PATCH v3 12/14 RESEND] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD, Juergen Gross

Add xc_set_cpufreq_hwp to allow calling xen_systctl_pm_op
SET_CPUFREQ_HWP.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---
v2:
Mark xc_set_hwp_para_t const
---
 tools/include/xenctrl.h |  4 ++++
 tools/libs/ctrl/xc_pm.c | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 437001d713..cd367d9d8f 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1937,11 +1937,15 @@ struct xc_get_cpufreq_para {
     int32_t turbo_enabled;
 };
 
+typedef struct xen_set_hwp_para xc_set_hwp_para_t;
+
 int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
                         struct xc_get_cpufreq_para *user_para);
 int xc_set_cpufreq_gov(xc_interface *xch, int cpuid, char *govname);
 int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
                         int ctrl_type, int ctrl_value);
+int xc_set_cpufreq_hwp(xc_interface *xch, int cpuid,
+                       const xc_set_hwp_para_t *set_hwp);
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq);
 
 int xc_set_sched_opt_smt(xc_interface *xch, uint32_t value);
diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index c3a9864bf7..a747ab053c 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -330,6 +330,24 @@ int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
     return xc_sysctl(xch, &sysctl);
 }
 
+int xc_set_cpufreq_hwp(xc_interface *xch, int cpuid,
+                       const xc_set_hwp_para_t *set_hwp)
+{
+    DECLARE_SYSCTL;
+
+    if ( !xch )
+    {
+        errno = EINVAL;
+        return -1;
+    }
+    sysctl.cmd = XEN_SYSCTL_pm_op;
+    sysctl.u.pm_op.cmd = SET_CPUFREQ_HWP;
+    sysctl.u.pm_op.cpuid = cpuid;
+    sysctl.u.pm_op.u.set_hwp = *set_hwp;
+
+    return xc_sysctl(xch, &sysctl);
+}
+
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq)
 {
     int ret = 0;
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 12/14 RESEND] xenpm: Factor out a non-fatal cpuid_parse variant
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (10 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 11/14 RESEND] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-08 12:01   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
  2023-05-01 19:30 ` [PATCH v3 14/14 RESEND] CHANGELOG: Add Intel HWP entry Jason Andryuk
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD

Allow cpuid_parse to be re-used without terminating xenpm.  HWP will
re-use it to optionally parse a cpuid.  Unlike other uses of
cpuid_parse, parse_hwp_opts will take a variable number of arguments and
cannot just check argc.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Retained because cpuid_parse handles numeric cpu numbers and "all".
---
 tools/misc/xenpm.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index b2defde0d4..6e74606970 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -79,17 +79,26 @@ void help_func(int argc, char *argv[])
     show_help();
 }
 
-static void parse_cpuid(const char *arg, int *cpuid)
+static int parse_cpuid_non_fatal(const char *arg, int *cpuid)
 {
     if ( sscanf(arg, "%d", cpuid) != 1 || *cpuid < 0 )
     {
         if ( strcasecmp(arg, "all") )
-        {
-            fprintf(stderr, "Invalid CPU identifier: '%s'\n", arg);
-            exit(EINVAL);
-        }
+            return -1;
+
         *cpuid = -1;
     }
+
+    return 0;
+}
+
+static void parse_cpuid(const char *arg, int *cpuid)
+{
+    if ( parse_cpuid_non_fatal(arg, cpuid) )
+    {
+        fprintf(stderr, "Invalid CPU identifier: '%s'\n", arg);
+        exit(EINVAL);
+    }
 }
 
 static void parse_cpuid_and_int(int argc, char *argv[],
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (11 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 12/14 RESEND] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  2023-05-08 11:56   ` Jan Beulich
  2023-05-01 19:30 ` [PATCH v3 14/14 RESEND] CHANGELOG: Add Intel HWP entry Jason Andryuk
  13 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Wei Liu, Anthony PERARD

set-cpufreq-hwp allows setting the Hardware P-State (HWP) parameters.

It can be run on all or just a single cpu.  There are presets of
balance, powersave & performance.  Those can be further tweaked by
param:val arguments as explained in the usage description.

Parameter names are just checked to the first 3 characters to shorten
typing.

Some options are hardware dependent, and ranges can be found in
get-cpufreq-para.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Compare provided parameter name and not just 3 characters.
Use "-" in parameter names
Remove hw_
Replace sscanf with strchr & strtoul.
Remove toplevel error message with lower level ones.
Help text s/127/128/
Help text mention truncation.
Avoid some truncation rounding down by adding 5 before division.
Help test mention default microseconds
Also comment the limit check written to avoid overflow.
---
 tools/misc/xenpm.c | 230 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 230 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 6e74606970..8d99c78670 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -16,6 +16,7 @@
  */
 #define MAX_NR_CPU 512
 
+#include <limits.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -67,6 +68,27 @@ void show_help(void)
             " set-max-cstate        <num>|'unlimited' [<num2>|'unlimited']\n"
             "                                     set the C-State limitation (<num> >= 0) and\n"
             "                                     optionally the C-sub-state limitation (<num2> >= 0)\n"
+            " set-cpufreq-hwp       [cpuid] [balance|performance|powersave] <param:val>*\n"
+            "                                     set Hardware P-State (HWP) parameters\n"
+            "                                     optionally a preset of one of\n"
+            "                                       balance|performance|powersave\n"
+            "                                     an optional list of param:val arguments\n"
+            "                                       minimum:N  lowest ... highest\n"
+            "                                       maximum:N  lowest ... highest\n"
+            "                                       desired:N  lowest ... highest\n"
+            "                                           Set explicit performance target.\n"
+            "                                           non-zero disables auto-HWP mode.\n"
+            "                                       energy-perf:0-255 (or 0-15)\n"
+            "                                                   energy/performance hint\n"
+            "                                                   lower - favor performance\n"
+            "                                                   higher - favor powersave\n"
+            "                                                   128 (or 7) - balance\n"
+            "                                       act-window:N{,m,u}s range 1us-1270s\n"
+            "                                           window for internal calculations.\n"
+            "                                           Defaults to us without units.\n"
+            "                                           Truncates un-representable values.\n"
+            "                                           0 lets the hardware decide.\n"
+            "                                     get-cpufreq-para returns lowest/highest.\n"
             " start [seconds]                     start collect Cx/Px statistics,\n"
             "                                     output after CTRL-C or SIGINT or several seconds.\n"
             " enable-turbo-mode     [cpuid]       enable Turbo Mode for processors that support it.\n"
@@ -1299,6 +1321,213 @@ void disable_turbo_mode(int argc, char *argv[])
                 errno, strerror(errno));
 }
 
+/*
+ * Parse activity_window:NNN{us,ms,s} and validate range.
+ *
+ * Activity window is a 7bit mantissa (0-127) with a 3bit exponent (0-7) base
+ * 10 in microseconds.  So the range is 1 microsecond to 1270 seconds.  A value
+ * of 0 lets the hardware autonomously select the window.
+ *
+ * Return 0 on success
+ *       -1 on error
+ */
+static int parse_activity_window(xc_set_hwp_para_t *set_hwp, unsigned long u,
+                                 const char *suffix)
+{
+    unsigned int exponent = 0;
+    unsigned int multiplier = 1;
+
+    if ( suffix && suffix[0] )
+    {
+        if ( strcasecmp(suffix, "s") == 0 )
+        {
+            multiplier = 1000 * 1000;
+            exponent = 6;
+        }
+        else if ( strcasecmp(suffix, "ms") == 0 )
+        {
+            multiplier = 1000;
+            exponent = 3;
+        }
+        else if ( strcasecmp(suffix, "us") == 0 )
+        {
+            multiplier = 1;
+            exponent = 0;
+        }
+        else
+        {
+            fprintf(stderr, "invalid activity window units: \"%s\"\n", suffix);
+
+            return -1;
+        }
+    }
+
+    /* u * multipler > 1270 * 1000 * 1000 transformed to avoid overflow. */
+    if ( u > 1270 * 1000 * 1000 / multiplier )
+    {
+        fprintf(stderr, "activity window is too large\n");
+
+        return -1;
+    }
+
+    /* looking for 7 bits of mantissa and 3 bits of exponent */
+    while ( u > 127 )
+    {
+        u += 5; /* Round up to mitigate truncation rounding down
+                   e.g. 128 -> 120 vs 128 -> 130. */
+        u /= 10;
+        exponent += 1;
+    }
+
+    set_hwp->activity_window = (exponent & HWP_ACT_WINDOW_EXPONENT_MASK) <<
+                                   HWP_ACT_WINDOW_EXPONENT_SHIFT |
+                               (u & HWP_ACT_WINDOW_MANTISSA_MASK);
+    set_hwp->set_params |= XEN_SYSCTL_HWP_SET_ACT_WINDOW;
+
+    return 0;
+}
+
+static int parse_hwp_opts(xc_set_hwp_para_t *set_hwp, int *cpuid,
+                          int argc, char *argv[])
+{
+    int i = 0;
+
+    if ( argc < 1 ) {
+        fprintf(stderr, "Missing arguments\n");
+        return -1;
+    }
+
+    if ( parse_cpuid_non_fatal(argv[i], cpuid) == 0 )
+    {
+        i++;
+    }
+
+    if ( i == argc ) {
+        fprintf(stderr, "Missing arguments\n");
+        return -1;
+    }
+
+    if ( strcasecmp(argv[i], "powersave") == 0 )
+    {
+        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "performance") == 0 )
+    {
+        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "balance") == 0 )
+    {
+        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_BALANCE;
+        i++;
+    }
+
+    for ( ; i < argc; i++)
+    {
+        unsigned long val;
+        char *param = argv[i];
+        char *value;
+        char *suffix;
+        int ret;
+
+        value = strchr(param, ':');
+        if ( value == NULL )
+        {
+            fprintf(stderr, "\"%s\" is an invalid hwp parameter\n", argv[i]);
+            return -1;
+        }
+
+        value[0] = '\0';
+        value++;
+
+        errno = 0;
+        val = strtoul(value, &suffix, 10);
+        if ( (errno && val == ULONG_MAX) || value == suffix )
+        {
+            fprintf(stderr, "Could not parse number \"%s\"\n", value);
+            return -1;
+        }
+
+        if ( strncasecmp(param, "act-window", strlen(param)) == 0 )
+        {
+            ret = parse_activity_window(set_hwp, val, suffix);
+            if (ret)
+                return -1;
+
+            continue;
+        }
+
+        if ( val > 255 )
+        {
+            fprintf(stderr, "\"%s\" value \"%lu\" is out of range\n", param,
+                    val);
+            return -1;
+        }
+
+        if ( suffix && suffix[0] )
+        {
+            fprintf(stderr, "Suffix \"%s\" is invalid\n", suffix);
+            return -1;
+        }
+
+        if ( strncasecmp(param, "minimum", MAX(2, strlen(param))) == 0 )
+        {
+            set_hwp->minimum = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_MINIMUM;
+        }
+        else if ( strncasecmp(param, "maximum", MAX(2, strlen(param))) == 0 )
+        {
+            set_hwp->maximum = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_MAXIMUM;
+        }
+        else if ( strncasecmp(param, "desired", strlen(param)) == 0 )
+        {
+            set_hwp->desired = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_DESIRED;
+        }
+        else if ( strncasecmp(param, "energy-perf", strlen(param)) == 0 )
+        {
+            set_hwp->energy_perf = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_ENERGY_PERF;
+        }
+        else
+        {
+            fprintf(stderr, "\"%s\" is an invalid parameter\n", param);
+            return -1;
+        }
+    }
+
+    if ( set_hwp->set_params == 0 )
+    {
+        fprintf(stderr, "No parameters set in request\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void hwp_set_func(int argc, char *argv[])
+{
+    xc_set_hwp_para_t set_hwp = {};
+    int cpuid = -1;
+    int i = 0;
+
+    if ( parse_hwp_opts(&set_hwp, &cpuid, argc, argv) )
+        exit(EINVAL);
+
+    if ( cpuid != -1 )
+    {
+        i = cpuid;
+        max_cpu_nr = i + 1;
+    }
+
+    for ( ; i < max_cpu_nr; i++ )
+        if ( xc_set_cpufreq_hwp(xc_handle, i, &set_hwp) )
+            fprintf(stderr, "[CPU%d] failed to set hwp params (%d - %s)\n",
+                    i, errno, strerror(errno));
+}
+
 struct {
     const char *name;
     void (*function)(int argc, char *argv[]);
@@ -1309,6 +1538,7 @@ struct {
     { "get-cpufreq-average", cpufreq_func },
     { "start", start_gather_func },
     { "get-cpufreq-para", cpufreq_para_func },
+    { "set-cpufreq-hwp", hwp_set_func },
     { "set-scaling-maxfreq", scaling_max_freq_func },
     { "set-scaling-minfreq", scaling_min_freq_func },
     { "set-scaling-governor", scaling_governor_func },
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 14/14 RESEND] CHANGELOG: Add Intel HWP entry
  2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (12 preceding siblings ...)
  2023-05-01 19:30 ` [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
@ 2023-05-01 19:30 ` Jason Andryuk
  13 siblings, 0 replies; 53+ messages in thread
From: Jason Andryuk @ 2023-05-01 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Henry Wang, Community Manager

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Henry Wang <Henry.Wang@arm.com>
---
v3:
Position under existing Added section
Add Henry's Ack

v2:
Add blank line
---
 CHANGELOG.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5dbf8b06d7..2eb9e2cfd0 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -18,6 +18,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
    - Bus-lock detection, used by Xen to mitigate (by rate-limiting) the system
      wide impact of a guest misusing atomic instructions.
  - xl/libxl can customize SMBIOS strings for HVM guests.
+ - Add Intel Hardware P-States (HWP) cpufreq driver.
 
 ## [4.17.0](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=RELEASE-4.17.0) - 2022-12-12
 
-- 
2.40.0



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 03/14 RESEND] cpufreq: Export intel_feature_detect
  2023-05-01 19:30 ` [PATCH v3 03/14 RESEND] cpufreq: Export intel_feature_detect Jason Andryuk
@ 2023-05-04 11:16   ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-04 11:16 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> Export feature_detect as intel_feature_detect so it can be re-used by
> HWP.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Acked-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-01 19:30 ` [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2023-05-04 13:11   ` Jan Beulich
  2023-05-04 16:56     ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-04 13:11 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> For cpufreq=xen:hwp, placing the option inside the governor wouldn't
> work.  Users would have to select the hwp-internal governor to turn off
> hwp support.

I'm afraid I don't understand this, and you'll find a comment towards
this further down. Even when ...

>  hwp-internal isn't usable without hwp, and users wouldn't
> be able to select a different governor.  That doesn't matter while hwp
> defaults off, but it would if or when hwp defaults to enabled.

... it starts defaulting to enabled, selecting another governor can
simply have the side effect of turning off hwp.

> Write to disable the interrupt - the linux pstate driver does this.  We
> don't use the interrupts, so we can just turn them off.  We aren't ready
> to handle them, so we don't want any.  Unclear if this is necessary.
> SDM says it's default disabled.

Definitely better to be on the safe side.

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
>  available support.
>  
>  ### cpufreq
> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`

Considering you use a special internal governor, the 4 governor alternatives are
meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
was another governor name would seem better to me. This would then also get rid
of one of the two special "no-" prefix parsing cases (which I'm not overly
happy about).

Even if not done that way I'm puzzled by the way you spell out the interaction
of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
"hwp" was specified, so even if not merged with the governors group "hwp" should
come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
way you've spelled it out it actually looks to be kind of the other way around.)

Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
was specified.

Overall I'm getting the impression that beyond your "verbose" related adjustment
more is needed, if you're meaning to get things closer to how we parse the
option (splitting across multiple lines to help see what I mean):

`= none
 | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
                          [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
                          [,verbose]]}
 | dom0-kernel`

(We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
maxfreq, but better be more tight in the doc than too relaxed.)

Furthermore while max/min freq don't apply directly, there are still two MSRs
controlling bounds at the package and logical processor levels.

> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,506 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
> + */
> +
> +#include <xen/cpumask.h>
> +#include <xen/init.h>
> +#include <xen/param.h>
> +#include <xen/xmalloc.h>
> +#include <asm/io.h>
> +#include <asm/msr.h>
> +#include <acpi/cpufreq/cpufreq.h>
> +
> +static bool feature_hwp;
> +static bool feature_hwp_notification;
> +static bool feature_hwp_activity_window;
> +static bool feature_hwp_energy_perf;
> +static bool feature_hwp_pkg_level_ctl;
> +static bool feature_hwp_peci;
> +
> +static bool feature_hdc;

Most (all?) of these want to be __ro_after_init, I expect.

> +__initdata bool opt_cpufreq_hwp = false;
> +__initdata bool opt_cpufreq_hdc = true;

Nit (style): Please put annotations after the type.

> +#define HWP_ENERGY_PERF_BALANCE         0x80
> +#define IA32_ENERGY_BIAS_BALANCE        0x7
> +#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
> +#define IA32_ENERGY_BIAS_MASK           0xf
> +
> +union hwp_request
> +{
> +    struct
> +    {
> +        uint64_t min_perf:8;
> +        uint64_t max_perf:8;
> +        uint64_t desired:8;
> +        uint64_t energy_perf:8;
> +        uint64_t activity_window:10;
> +        uint64_t package_control:1;
> +        uint64_t reserved:16;
> +        uint64_t activity_window_valid:1;
> +        uint64_t energy_perf_valid:1;
> +        uint64_t desired_valid:1;
> +        uint64_t max_perf_valid:1;
> +        uint64_t min_perf_valid:1;

The boolean fields here would probably better be of type "bool". I also
don't see the need for using uint64_t for any of the other fields -
unsigned int will be quite fine, I think. Only ...

> +    };
> +    uint64_t raw;

... this wants to keep this type. (Same again below then.)

> +bool __init hwp_available(void)
> +{
> +    unsigned int eax, ecx, unused;
> +    bool use_hwp;
> +
> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> +    {
> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> +                    boot_cpu_data.cpuid_level);
> +        return false;
> +    }
> +
> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> +    {
> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> +                 boot_cpu_data.cpuid_level);
> +        return false;
> +    }
> +
> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> +
> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> +    {
> +        hwp_verbose("HWP disabled: No energy/performance preference available");
> +        return false;
> +    }
> +
> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> +    feature_hwp_energy_perf     =
> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> +
> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> +                feature_hwp, feature_hwp_notification,
> +                feature_hwp_activity_window, feature_hwp_energy_perf,
> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> +
> +    if ( !feature_hwp )
> +        return false;
> +
> +    feature_hdc = eax & CPUID6_EAX_HDC;
> +
> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> +                feature_hdc ? "" : "not ",
> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> +                            : "");
> +
> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> +
> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");

You report this, but you don't really use it?

> +    use_hwp = feature_hwp && opt_cpufreq_hwp;

There's a lot of output you may produce until you make it here, which is
largely meaningless when opt_cpufreq_hwp == false. Is there a reason you
don't check that flag first thing in the function?

> +static void hdc_set_pkg_hdc_ctl(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
> +
> +        return;
> +    }
> +
> +    if ( val )
> +        msr |= IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> +    else
> +        msr &= ~IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> +
> +    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
> +}
> +
> +static void hdc_set_pm_ctl1(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> +
> +        return;
> +    }
> +
> +    if ( val )
> +        msr |= IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> +    else
> +        msr &= ~IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> +
> +    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
> +}

For both functions: Elsewhere you also log the affected CPU in hwp_err().
Without this I'm not convinced the logging here is very useful. In fact I
wonder whether hwp_err() shouldn't take care of this and/or the "error"
part of the string literal. A HWP: prefix might also not be bad ...

> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> +{
> +    uint32_t base_khz, max_khz, bus_khz, edx;
> +
> +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> +
> +    /* aperf/mperf scales base. */
> +    policy->cpuinfo.perf_freq = base_khz * 1000;
> +    policy->cpuinfo.min_freq = base_khz * 1000;
> +    policy->cpuinfo.max_freq = max_khz * 1000;
> +    policy->min = base_khz * 1000;
> +    policy->max = max_khz * 1000;
> +    policy->cur = 0;

What is the comment intended to be telling me here?

> +static void cf_check hwp_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    uint64_t val;
> +
> +    /*
> +     * Package level MSR, but we don't have a good idea of packages here, so
> +     * just do it everytime.
> +     */
> +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    /* Ensure we don't generate interrupts */
> +    if ( feature_hwp_notification )
> +        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
> +
> +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
> +    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
> +    {
> +        val |= IA32_PM_ENABLE_HWP_ENABLE;
> +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +        {
> +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
> +                    policy->cpu, val);
> +            data->curr_req.raw = -1;
> +            return;
> +        }
> +    }
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> +                policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    if ( !feature_hwp_energy_perf ) {

Nit: Brace placement.

> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> +        {
> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> +            data->curr_req.raw = -1;
> +
> +            return;
> +        }
> +
> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> +    }

In order to not need to undo the "enable" you've already done, maybe that
should move down here? With all the sanity checking you do here, maybe
you should also check that the write of the enable bit actually took
effect?

> +/* val 0 - highest performance, 15 - maximum energy savings */
> +static void hwp_energy_perf_bias(const struct hwp_drv_data *data)
> +{
> +    uint64_t msr;
> +    uint8_t val = data->energy_perf;
> +
> +    ASSERT(val <= IA32_ENERGY_BIAS_MAX_POWERSAVE);
> +
> +    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> +
> +        return;
> +    }
> +
> +    msr &= ~IA32_ENERGY_BIAS_MASK;
> +    msr |= val;
> +
> +    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr);
> +}
> +
> +static void cf_check hwp_write_request(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    union hwp_request hwp_req = data->curr_req;
> +
> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
> +    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
> +    {
> +        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
> +                policy->cpu, hwp_req.raw);
> +        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
> +    }
> +
> +    if ( !feature_hwp_energy_perf )
> +        hwp_energy_perf_bias(data);
> +
> +}
> +
> +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> +                                       unsigned int target_freq,
> +                                       unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +    /* Zero everything to ensure reserved bits are zero... */
> +    union hwp_request hwp_req = { .raw = 0 };
> +
> +    /* .. and update from there */
> +    hwp_req.min_perf = data->minimum;
> +    hwp_req.max_perf = data->maximum;
> +    hwp_req.desired = data->desired;
> +    if ( feature_hwp_energy_perf )
> +        hwp_req.energy_perf = data->energy_perf;
> +    if ( feature_hwp_activity_window )
> +        hwp_req.activity_window = data->activity_window;
> +
> +    if ( hwp_req.raw == data->curr_req.raw )
> +        return 0;
> +
> +    data->curr_req = hwp_req;
> +
> +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> +
> +    return 0;
> +}

If I'm not mistaken these 3 functions can only be reached from the user
space tool (via set_cpufreq_para()). On that path I don't think there
should be any hwp_err(); definitely not in non-verbose mode. Instead it
would be good if a sensible error code could be reported back. (Same
then for hwp_cpufreq_update() and its helper.)

> --- a/xen/arch/x86/include/asm/cpufeature.h
> +++ b/xen/arch/x86/include/asm/cpufeature.h
> @@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
>  #define cpu_has(c, bit)		test_bit(bit, (c)->x86_capability)
>  #define boot_cpu_has(bit)	test_bit(bit, boot_cpu_data.x86_capability)
>  
> -#define CPUID_PM_LEAF                    6
> -#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
> +#define CPUID_PM_LEAF                                6
> +#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
> +#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
> +#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
> +#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
> +#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
> +#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
> +#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
> +#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)

Perhaps better without open-coding BIT()?

I also find it a little odd that e.g. bit 17 is left out here despite you
declaring the 5 "valid" bits in union hwp_request (which are qualified by
this CPUID bit afaict).

> +#define CPUID6_ECX_APERFMPERF_CAPABILITY             0x1
> +#define CPUID6_ECX_IA32_ENERGY_PERF_BIAS             0x8

Why not the same form here?

> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -151,6 +151,13 @@
>  
>  #define MSR_PKRS                            0x000006e1
>  
> +#define MSR_IA32_PM_ENABLE                  0x00000770
> +#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
> +
> +#define MSR_IA32_HWP_CAPABILITIES           0x00000771
> +#define MSR_IA32_HWP_INTERRUPT              0x00000773
> +#define MSR_IA32_HWP_REQUEST                0x00000774

I think for new MSRs being added here in particular Andrew would like to
see the IA32 infixes omitted. (I'd extend this then to
CPUID6_ECX_IA32_ENERGY_PERF_BIAS as well.)

> @@ -165,6 +172,11 @@
>  #define  PASID_PASID_MASK                   0x000fffff
>  #define  PASID_VALID                        (_AC(1, ULL) << 31)
>  
> +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
> +#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)

The name has two redundant infixes, which looks odd, but then I can't
suggest any better without going too much out of sync with the SDM.

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -565,6 +565,38 @@ static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
>  
>  static int __init cpufreq_handle_common_option(const char *name, const char *val)
>  {
> +    if (!strcmp(name, "hdc")) {
> +        if (val) {
> +            int ret = parse_bool(val, NULL);
> +            if (ret != -1) {
> +                opt_cpufreq_hdc = ret;
> +                return 1;
> +            }
> +        } else {
> +            opt_cpufreq_hdc = true;
> +            return 1;
> +        }
> +    } else if (!strcmp(name, "no-hdc")) {
> +        opt_cpufreq_hdc = false;
> +        return 1;
> +    }

I think recognizing a "no-" prefix would want to be separated out, and be
restricted to val being NULL. It would result in val being pointed at the
string "no" (or "off" or anything else parse_bool() recognizes as negative
indicator).

Yet if, as suggested above, "hwp" became a "fake" governor also when
parsing the command line, "hdc" could actually be handled in its
handle_option() hook.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal
  2023-05-01 19:30 ` [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
@ 2023-05-04 14:35   ` Jan Beulich
  2023-05-04 17:00     ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-04 14:35 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> When using HWP, some of the returned data is not applicable.  In that
> case, we should just omit it to avoid confusing the user.  So switch to
> printing the base and turbo frequencies since those are relevant to HWP.
> Similarly, stop printing the CPU frequencies since those do not apply.

It vaguely feels like I have asked this before: Can you point me at a
place in the SDM where it is said that CPUID 0x16's "Maximum Frequency"
is the turbo frequency? Without such a reference I feel a little uneasy
with ...

> @@ -720,10 +721,15 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>          printf(" %d", p_cpufreq->affected_cpus[i]);
>      printf("\n");
>  
> -    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
> -           p_cpufreq->cpuinfo_max_freq,
> -           p_cpufreq->cpuinfo_min_freq,
> -           p_cpufreq->cpuinfo_cur_freq);
> +    if ( internal )
> +        printf("cpuinfo frequency    : base [%u] turbo [%u]\n",
> +               p_cpufreq->cpuinfo_min_freq,
> +               p_cpufreq->cpuinfo_max_freq);

... calling it "turbo" (and not "max") here.

Jan

> +    else
> +        printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
> +               p_cpufreq->cpuinfo_max_freq,
> +               p_cpufreq->cpuinfo_min_freq,
> +               p_cpufreq->cpuinfo_cur_freq);
>  
>      printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
>  



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-04 13:11   ` Jan Beulich
@ 2023-05-04 16:56     ` Jason Andryuk
  2023-05-05  7:01       ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-04 16:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > For cpufreq=xen:hwp, placing the option inside the governor wouldn't
> > work.  Users would have to select the hwp-internal governor to turn off
> > hwp support.
>
> I'm afraid I don't understand this, and you'll find a comment towards
> this further down. Even when ...
>
> >  hwp-internal isn't usable without hwp, and users wouldn't
> > be able to select a different governor.  That doesn't matter while hwp
> > defaults off, but it would if or when hwp defaults to enabled.
>
> ... it starts defaulting to enabled, selecting another governor can
> simply have the side effect of turning off hwp.

I didn't think of that - makes sense.

> > Write to disable the interrupt - the linux pstate driver does this.  We
> > don't use the interrupts, so we can just turn them off.  We aren't ready
> > to handle them, so we don't want any.  Unclear if this is necessary.
> > SDM says it's default disabled.
>
> Definitely better to be on the safe side.
>
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
> >  available support.
> >
> >  ### cpufreq
> > -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> > +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
>
> Considering you use a special internal governor, the 4 governor alternatives are
> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
> was another governor name would seem better to me. This would then also get rid
> of one of the two special "no-" prefix parsing cases (which I'm not overly
> happy about).
>
> Even if not done that way I'm puzzled by the way you spell out the interaction
> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
> "hwp" was specified, so even if not merged with the governors group "hwp" should
> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
> way you've spelled it out it actually looks to be kind of the other way around.)

I placed them in alphabetical order, but, yes, it doesn't make sense.

> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
> was specified.
>
> Overall I'm getting the impression that beyond your "verbose" related adjustment
> more is needed, if you're meaning to get things closer to how we parse the
> option (splitting across multiple lines to help see what I mean):
>
> `= none
>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
>                           [,verbose]]}
>  | dom0-kernel`
>
> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> maxfreq, but better be more tight in the doc than too relaxed.)
>
> Furthermore while max/min freq don't apply directly, there are still two MSRs
> controlling bounds at the package and logical processor levels.

Well, we only program the logical processor level MSRs because we
don't have a good idea of the packages to know when we can skip
writing an MSR.

How about this:
`= none
 | {{ <boolean> | xen } {
[:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
                        | [:hwp[,hdc]] }
                          [,verbose]]}
 | dom0-kernel`

i.e:
xen:hwp,hdc

> > --- /dev/null
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -0,0 +1,506 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> > + *
> > + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
> > + */
> > +
> > +#include <xen/cpumask.h>
> > +#include <xen/init.h>
> > +#include <xen/param.h>
> > +#include <xen/xmalloc.h>
> > +#include <asm/io.h>
> > +#include <asm/msr.h>
> > +#include <acpi/cpufreq/cpufreq.h>
> > +
> > +static bool feature_hwp;
> > +static bool feature_hwp_notification;
> > +static bool feature_hwp_activity_window;
> > +static bool feature_hwp_energy_perf;
> > +static bool feature_hwp_pkg_level_ctl;
> > +static bool feature_hwp_peci;
> > +
> > +static bool feature_hdc;
>
> Most (all?) of these want to be __ro_after_init, I expect.

I think you are correct.  (This pre-dates __ro_after_init and I didn't
update it.)

> > +__initdata bool opt_cpufreq_hwp = false;
> > +__initdata bool opt_cpufreq_hdc = true;
>
> Nit (style): Please put annotations after the type.
>
> > +#define HWP_ENERGY_PERF_BALANCE         0x80
> > +#define IA32_ENERGY_BIAS_BALANCE        0x7
> > +#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
> > +#define IA32_ENERGY_BIAS_MASK           0xf
> > +
> > +union hwp_request
> > +{
> > +    struct
> > +    {
> > +        uint64_t min_perf:8;
> > +        uint64_t max_perf:8;
> > +        uint64_t desired:8;
> > +        uint64_t energy_perf:8;
> > +        uint64_t activity_window:10;
> > +        uint64_t package_control:1;
> > +        uint64_t reserved:16;
> > +        uint64_t activity_window_valid:1;
> > +        uint64_t energy_perf_valid:1;
> > +        uint64_t desired_valid:1;
> > +        uint64_t max_perf_valid:1;
> > +        uint64_t min_perf_valid:1;
>
> The boolean fields here would probably better be of type "bool". I also
> don't see the need for using uint64_t for any of the other fields -
> unsigned int will be quite fine, I think. Only ...

This is the hardware MSR format, so it seemed natural to use uint64_t
and the bit fields.  To me, uint64_t foo:$bits; better shows that we
are dividing up a single hardware register using bit fields.
Honestly, I'm unfamiliar with the finer points of laying out bitfields
with bool.  And the 10 bits of activity window throws off aligning to
standard types.

This seems to have the correct layout:
struct
{
        unsigned char min_perf;
        unsigned char max_perf;
        unsigned char desired;
        unsigned char energy_perf;
        unsigned int activity_window:10;
        bool package_control:1;
        unsigned int reserved:16;
        bool activity_window_valid:1;
        bool energy_perf_valid:1;
        bool desired_valid:1;
        bool max_perf_valid:1;
        bool min_perf_valid:1;
} ;

Or would you prefer the first 8 bit ones to be unsigned int
min_perf:8?  The bools seem to need :1, which doesn't seem to be
gaining us much, IMO.  I'd strongly prefer just keeping it as I have
it, but I will change it however you like.

> > +    };
> > +    uint64_t raw;
>
> ... this wants to keep this type. (Same again below then.)

For "below", do you want:

        struct
        {
            unsigned char highest;
            unsigned char guaranteed;
            unsigned char most_efficient;
            unsigned char lowest;
            unsigned int reserved;
        } hw;
?

> > +bool __init hwp_available(void)
> > +{
> > +    unsigned int eax, ecx, unused;
> > +    bool use_hwp;
> > +
> > +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> > +    {
> > +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> > +                    boot_cpu_data.cpuid_level);
> > +        return false;
> > +    }
> > +
> > +    if ( boot_cpu_data.cpuid_level < 0x16 )
> > +    {
> > +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> > +                 boot_cpu_data.cpuid_level);
> > +        return false;
> > +    }
> > +
> > +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> > +
> > +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> > +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> > +    {
> > +        hwp_verbose("HWP disabled: No energy/performance preference available");
> > +        return false;
> > +    }
> > +
> > +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> > +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> > +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> > +    feature_hwp_energy_perf     =
> > +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> > +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> > +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> > +
> > +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> > +                feature_hwp, feature_hwp_notification,
> > +                feature_hwp_activity_window, feature_hwp_energy_perf,
> > +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> > +
> > +    if ( !feature_hwp )
> > +        return false;
> > +
> > +    feature_hdc = eax & CPUID6_EAX_HDC;
> > +
> > +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> > +                feature_hdc ? "" : "not ",
> > +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> > +                            : "");
> > +
> > +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> > +
> > +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> > +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
>
> You report this, but you don't really use it?

Correct.  I needed to know what capabilities my processors have.

feature_hwp_pkg_level_ctl and feature_hwp_peci can also be dropped
since they aren't used beyond printing their values.  I'd still lean
toward keeping their printing under verbose since otherwise there
isn't a convenient way to know if they are available without
recompiling.

> > +    use_hwp = feature_hwp && opt_cpufreq_hwp;
>
> There's a lot of output you may produce until you make it here, which is
> largely meaningless when opt_cpufreq_hwp == false. Is there a reason you
> don't check that flag first thing in the function?

opt_cpufreq_hwp can be checked earlier for an early exit, yes.  The
code came about during development to print all the HWP capabilities
even if it wasn't enabled.  But eliminating it now makes sense.

> > +static void hdc_set_pkg_hdc_ctl(bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
> > +
> > +        return;
> > +    }
> > +
> > +    if ( val )
> > +        msr |= IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +    else
> > +        msr &= ~IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +
> > +    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> > +        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
> > +}
> > +
> > +static void hdc_set_pm_ctl1(bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> > +
> > +        return;
> > +    }
> > +
> > +    if ( val )
> > +        msr |= IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> > +    else
> > +        msr &= ~IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> > +
> > +    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
> > +        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
> > +}
>
> For both functions: Elsewhere you also log the affected CPU in hwp_err().
> Without this I'm not convinced the logging here is very useful. In fact I
> wonder whether hwp_err() shouldn't take care of this and/or the "error"
> part of the string literal. A HWP: prefix might also not be bad ...

Sounds good.  I'll investigate.

> > +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> > +{
> > +    uint32_t base_khz, max_khz, bus_khz, edx;
> > +
> > +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> > +
> > +    /* aperf/mperf scales base. */
> > +    policy->cpuinfo.perf_freq = base_khz * 1000;
> > +    policy->cpuinfo.min_freq = base_khz * 1000;
> > +    policy->cpuinfo.max_freq = max_khz * 1000;
> > +    policy->min = base_khz * 1000;
> > +    policy->max = max_khz * 1000;
> > +    policy->cur = 0;
>
> What is the comment intended to be telling me here?

When I was surprised to discover that I needed to pass in the base
frequency for proper aperf/mperf scaling, it seemed relevant at the
time as it's the opposite of ACPI cpufreq.  It can be dropped now.

> > +static void cf_check hwp_init_msrs(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    uint64_t val;
> > +
> > +    /*
> > +     * Package level MSR, but we don't have a good idea of packages here, so
> > +     * just do it everytime.
> > +     */
> > +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    /* Ensure we don't generate interrupts */
> > +    if ( feature_hwp_notification )
> > +        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
> > +
> > +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
> > +    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
> > +    {
> > +        val |= IA32_PM_ENABLE_HWP_ENABLE;
> > +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
> > +        {
> > +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
> > +                    policy->cpu, val);
> > +            data->curr_req.raw = -1;
> > +            return;
> > +        }
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> > +                policy->cpu);
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    if ( !feature_hwp_energy_perf ) {
>
> Nit: Brace placement.
>
> > +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> > +        {
> > +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> > +            data->curr_req.raw = -1;
> > +
> > +            return;
> > +        }
> > +
> > +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> > +    }
>
> In order to not need to undo the "enable" you've already done, maybe that
> should move down here?

HWP needs to be enabled before the Capabilities and Request MSRs can
be read.  Reading them shouldn't fail, but it seems safer to use
rdmsr_safe in case something goes wrong.

I think I will rip out ENERGY_PERF_BIAS.  The Linux driver doesn't
support it.  I thought it might be necessary, but my test machines
don't need it.  The Qubes report with SkyLake wasn't using
ENERGY_PERF_BIAS, and SkyLake introduced HWP.  So the set of machines
needing it is probably small and older, so it probably isn't worth
supporting.

> With all the sanity checking you do here, maybe
> you should also check that the write of the enable bit actually took
> effect?

I can add that.

> > +/* val 0 - highest performance, 15 - maximum energy savings */
> > +static void hwp_energy_perf_bias(const struct hwp_drv_data *data)
> > +{
> > +    uint64_t msr;
> > +    uint8_t val = data->energy_perf;
> > +
> > +    ASSERT(val <= IA32_ENERGY_BIAS_MAX_POWERSAVE);
> > +
> > +    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> > +
> > +        return;
> > +    }
> > +
> > +    msr &= ~IA32_ENERGY_BIAS_MASK;
> > +    msr |= val;
> > +
> > +    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> > +        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr);
> > +}
> > +
> > +static void cf_check hwp_write_request(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    union hwp_request hwp_req = data->curr_req;
> > +
> > +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
> > +    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
> > +    {
> > +        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
> > +                policy->cpu, hwp_req.raw);
> > +        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
> > +    }
> > +
> > +    if ( !feature_hwp_energy_perf )
> > +        hwp_energy_perf_bias(data);
> > +
> > +}
> > +
> > +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> > +                                       unsigned int target_freq,
> > +                                       unsigned int relation)
> > +{
> > +    unsigned int cpu = policy->cpu;
> > +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> > +    /* Zero everything to ensure reserved bits are zero... */
> > +    union hwp_request hwp_req = { .raw = 0 };
> > +
> > +    /* .. and update from there */
> > +    hwp_req.min_perf = data->minimum;
> > +    hwp_req.max_perf = data->maximum;
> > +    hwp_req.desired = data->desired;
> > +    if ( feature_hwp_energy_perf )
> > +        hwp_req.energy_perf = data->energy_perf;
> > +    if ( feature_hwp_activity_window )
> > +        hwp_req.activity_window = data->activity_window;
> > +
> > +    if ( hwp_req.raw == data->curr_req.raw )
> > +        return 0;
> > +
> > +    data->curr_req = hwp_req;
> > +
> > +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> > +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> > +
> > +    return 0;
> > +}
>
> If I'm not mistaken these 3 functions can only be reached from the user
> space tool (via set_cpufreq_para()). On that path I don't think there
> should be any hwp_err(); definitely not in non-verbose mode. Instead it
> would be good if a sensible error code could be reported back. (Same
> then for hwp_cpufreq_update() and its helper.)

I'll investigate this.  I guess I'll have to stash a result in struct
hwp_drv_data.

> > --- a/xen/arch/x86/include/asm/cpufeature.h
> > +++ b/xen/arch/x86/include/asm/cpufeature.h
> > @@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
> >  #define cpu_has(c, bit)              test_bit(bit, (c)->x86_capability)
> >  #define boot_cpu_has(bit)    test_bit(bit, boot_cpu_data.x86_capability)
> >
> > -#define CPUID_PM_LEAF                    6
> > -#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
> > +#define CPUID_PM_LEAF                                6
> > +#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
> > +#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
> > +#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
> > +#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
> > +#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
> > +#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
> > +#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
> > +#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)
>
> Perhaps better without open-coding BIT()?

Ok.

> I also find it a little odd that e.g. bit 17 is left out here despite you
> declaring the 5 "valid" bits in union hwp_request (which are qualified by
> this CPUID bit afaict).

Well, I thought I wasn't supposed to introduce unused defines, so I
didn't add one for 17.  For union hwp_request, the "valid" bits are
part of the register structure, so it makes sense to include them
instead of an incomplete definition.  IIRC, at some point I set the
"valid" bits when I wasn't supposed to, and they caused the wrmsr
calls to fail.  That might have been because my test machines don't
have package-level HWP.

(I was confused when the CPUID section stated "Bit 17: Flexible HWP is
supported if set.", but there are no further references to "Flexible
HWP" in the SDM.)

> > +#define CPUID6_ECX_APERFMPERF_CAPABILITY             0x1
> > +#define CPUID6_ECX_IA32_ENERGY_PERF_BIAS             0x8
>
> Why not the same form here?

I was re-indenting APERFMPERF, and added ENERGY_PERF_BIAS in a
consistent style.  I will update with BIT().

> > --- a/xen/arch/x86/include/asm/msr-index.h
> > +++ b/xen/arch/x86/include/asm/msr-index.h
> > @@ -151,6 +151,13 @@
> >
> >  #define MSR_PKRS                            0x000006e1
> >
> > +#define MSR_IA32_PM_ENABLE                  0x00000770
> > +#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
> > +
> > +#define MSR_IA32_HWP_CAPABILITIES           0x00000771
> > +#define MSR_IA32_HWP_INTERRUPT              0x00000773
> > +#define MSR_IA32_HWP_REQUEST                0x00000774
>
> I think for new MSRs being added here in particular Andrew would like to
> see the IA32 infixes omitted. (I'd extend this then to
> CPUID6_ECX_IA32_ENERGY_PERF_BIAS as well.)

Ok.

> > @@ -165,6 +172,11 @@
> >  #define  PASID_PASID_MASK                   0x000fffff
> >  #define  PASID_VALID                        (_AC(1, ULL) << 31)
> >
> > +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
> > +#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)
>
> The name has two redundant infixes, which looks odd, but then I can't
> suggest any better without going too much out of sync with the SDM.

Yes, it's not a good name, but I was trying to keep close to the SDM.
FAOD, these should drop IA32_ to become:
MSR_PKG_HDC_CTL
PKG_HDC_CTL_HDC_PKG_ENABLE
?

> > --- a/xen/drivers/cpufreq/cpufreq.c
> > +++ b/xen/drivers/cpufreq/cpufreq.c
> > @@ -565,6 +565,38 @@ static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
> >
> >  static int __init cpufreq_handle_common_option(const char *name, const char *val)
> >  {
> > +    if (!strcmp(name, "hdc")) {
> > +        if (val) {
> > +            int ret = parse_bool(val, NULL);
> > +            if (ret != -1) {
> > +                opt_cpufreq_hdc = ret;
> > +                return 1;
> > +            }
> > +        } else {
> > +            opt_cpufreq_hdc = true;
> > +            return 1;
> > +        }
> > +    } else if (!strcmp(name, "no-hdc")) {
> > +        opt_cpufreq_hdc = false;
> > +        return 1;
> > +    }
>
> I think recognizing a "no-" prefix would want to be separated out, and be
> restricted to val being NULL. It would result in val being pointed at the
> string "no" (or "off" or anything else parse_bool() recognizes as negative
> indicator).
>
> Yet if, as suggested above, "hwp" became a "fake" governor also when
> parsing the command line, "hdc" could actually be handled in its
> handle_option() hook.

Makes sense.

Thank you for taking the time to review this.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal
  2023-05-04 14:35   ` Jan Beulich
@ 2023-05-04 17:00     ` Jason Andryuk
  2023-05-05  7:04       ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-04 17:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Anthony PERARD, xen-devel

On Thu, May 4, 2023 at 10:35 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > When using HWP, some of the returned data is not applicable.  In that
> > case, we should just omit it to avoid confusing the user.  So switch to
> > printing the base and turbo frequencies since those are relevant to HWP.
> > Similarly, stop printing the CPU frequencies since those do not apply.
>
> It vaguely feels like I have asked this before: Can you point me at a
> place in the SDM where it is said that CPUID 0x16's "Maximum Frequency"
> is the turbo frequency? Without such a reference I feel a little uneasy
> with ...

I don't have a reference, but I found it empirically to match the
"turbo" frequency.

For an Intel® Core™ i7-10810U,
https://ark.intel.com/content/www/us/en/ark/products/201888/intel-core-i710810u-processor-12m-cache-up-to-4-90-ghz.html

Max Turbo Frequency 4.90 GHz

# xenpm get-cpufreq-para
cpu id               : 0
affected_cpus        : 0
cpuinfo frequency    : base [1600000] turbo [4900000]

Turbo has to be enabled to reach (close to) that frequency.

From my cover letter:
This is for a 10th gen 6-core 1600 MHz base 4900 MHZ max cpu.  In the
default balance mode, Turbo Boost doesn't exceed 4GHz.  Tweaking the
energy_perf preference with `xenpm set-cpufreq-hwp balance ene:64`,
I've seen the CPU hit 4.7GHz before throttling down and bouncing around
between 4.3 and 4.5 GHz.  Curiously the other cores read ~4GHz when
turbo boost takes affect.  This was done after pinning all dom0 cores,
and using taskset to pin to vCPU/pCPU 11 and running a bash tightloop.

> > @@ -720,10 +721,15 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
> >          printf(" %d", p_cpufreq->affected_cpus[i]);
> >      printf("\n");
> >
> > -    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
> > -           p_cpufreq->cpuinfo_max_freq,
> > -           p_cpufreq->cpuinfo_min_freq,
> > -           p_cpufreq->cpuinfo_cur_freq);
> > +    if ( internal )
> > +        printf("cpuinfo frequency    : base [%u] turbo [%u]\n",
> > +               p_cpufreq->cpuinfo_min_freq,
> > +               p_cpufreq->cpuinfo_max_freq);
>
> ... calling it "turbo" (and not "max") here.

I'm fine with "max".  I think I went with turbo since it's a value you
cannot sustain but can only hit in short bursts.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-04 16:56     ` Jason Andryuk
@ 2023-05-05  7:01       ` Jan Beulich
  2023-05-05 15:35         ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-05  7:01 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 04.05.2023 18:56, Jason Andryuk wrote:
> On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
>>>  available support.
>>>
>>>  ### cpufreq
>>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
>>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
>>
>> Considering you use a special internal governor, the 4 governor alternatives are
>> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
>> was another governor name would seem better to me. This would then also get rid
>> of one of the two special "no-" prefix parsing cases (which I'm not overly
>> happy about).
>>
>> Even if not done that way I'm puzzled by the way you spell out the interaction
>> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
>> "hwp" was specified, so even if not merged with the governors group "hwp" should
>> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
>> way you've spelled it out it actually looks to be kind of the other way around.)
> 
> I placed them in alphabetical order, but, yes, it doesn't make sense.
> 
>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
>> was specified.
>>
>> Overall I'm getting the impression that beyond your "verbose" related adjustment
>> more is needed, if you're meaning to get things closer to how we parse the
>> option (splitting across multiple lines to help see what I mean):
>>
>> `= none
>>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
>>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
>>                           [,verbose]]}
>>  | dom0-kernel`
>>
>> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
>> maxfreq, but better be more tight in the doc than too relaxed.)
>>
>> Furthermore while max/min freq don't apply directly, there are still two MSRs
>> controlling bounds at the package and logical processor levels.
> 
> Well, we only program the logical processor level MSRs because we
> don't have a good idea of the packages to know when we can skip
> writing an MSR.
> 
> How about this:
> `= none
>  | {{ <boolean> | xen } {
> [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
>                         | [:hwp[,hdc]] }
>                           [,verbose]]}
>  | dom0-kernel`

Looks right, yes.

>>> --- /dev/null
>>> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
>>> @@ -0,0 +1,506 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
>>> + *
>>> + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
>>> + */
>>> +
>>> +#include <xen/cpumask.h>
>>> +#include <xen/init.h>
>>> +#include <xen/param.h>
>>> +#include <xen/xmalloc.h>
>>> +#include <asm/io.h>
>>> +#include <asm/msr.h>
>>> +#include <acpi/cpufreq/cpufreq.h>
>>> +
>>> +static bool feature_hwp;
>>> +static bool feature_hwp_notification;
>>> +static bool feature_hwp_activity_window;
>>> +static bool feature_hwp_energy_perf;
>>> +static bool feature_hwp_pkg_level_ctl;
>>> +static bool feature_hwp_peci;
>>> +
>>> +static bool feature_hdc;
>>
>> Most (all?) of these want to be __ro_after_init, I expect.
> 
> I think you are correct.  (This pre-dates __ro_after_init and I didn't
> update it.)

Yet even then they should have used __read_mostly.

>>> +union hwp_request
>>> +{
>>> +    struct
>>> +    {
>>> +        uint64_t min_perf:8;
>>> +        uint64_t max_perf:8;
>>> +        uint64_t desired:8;
>>> +        uint64_t energy_perf:8;
>>> +        uint64_t activity_window:10;
>>> +        uint64_t package_control:1;
>>> +        uint64_t reserved:16;
>>> +        uint64_t activity_window_valid:1;
>>> +        uint64_t energy_perf_valid:1;
>>> +        uint64_t desired_valid:1;
>>> +        uint64_t max_perf_valid:1;
>>> +        uint64_t min_perf_valid:1;
>>
>> The boolean fields here would probably better be of type "bool". I also
>> don't see the need for using uint64_t for any of the other fields -
>> unsigned int will be quite fine, I think. Only ...
> 
> This is the hardware MSR format, so it seemed natural to use uint64_t
> and the bit fields.  To me, uint64_t foo:$bits; better shows that we
> are dividing up a single hardware register using bit fields.
> Honestly, I'm unfamiliar with the finer points of laying out bitfields
> with bool.  And the 10 bits of activity window throws off aligning to
> standard types.
> 
> This seems to have the correct layout:
> struct
> {
>         unsigned char min_perf;
>         unsigned char max_perf;
>         unsigned char desired;
>         unsigned char energy_perf;
>         unsigned int activity_window:10;
>         bool package_control:1;
>         unsigned int reserved:16;
>         bool activity_window_valid:1;
>         bool energy_perf_valid:1;
>         bool desired_valid:1;
>         bool max_perf_valid:1;
>         bool min_perf_valid:1;
> } ;
> 
> Or would you prefer the first 8 bit ones to be unsigned int
> min_perf:8?

Personally I think using bitfields uniformly would be better. What you
definitely cannot use if not using a bitfield is "unsigned char", it
ought to by uint8_t then. If using a bitfield, as said, I think it's
best to stick to unsigned int and bool, unless field width goes
beyond 32 bits or fields cross a 32-bit boundary.

>  The bools seem to need :1, which doesn't seem to be
> gaining us much, IMO.  I'd strongly prefer just keeping it as I have
> it, but I will change it however you like.

It's not so much how I like it, but to follow (a) existing practice
(for the boolean fields) and (b) ./CODING_STYLE (for the selection of
types).

>>> +    };
>>> +    uint64_t raw;
>>
>> ... this wants to keep this type. (Same again below then.)
> 
> For "below", do you want:
> 
>         struct
>         {
>             unsigned char highest;
>             unsigned char guaranteed;
>             unsigned char most_efficient;
>             unsigned char lowest;
>             unsigned int reserved;
>         } hw;
> ?

No - it can only be bitfields or fixed-width types here.

>>> +bool __init hwp_available(void)
>>> +{
>>> +    unsigned int eax, ecx, unused;
>>> +    bool use_hwp;
>>> +
>>> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
>>> +    {
>>> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
>>> +                    boot_cpu_data.cpuid_level);
>>> +        return false;
>>> +    }
>>> +
>>> +    if ( boot_cpu_data.cpuid_level < 0x16 )
>>> +    {
>>> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
>>> +                 boot_cpu_data.cpuid_level);
>>> +        return false;
>>> +    }
>>> +
>>> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
>>> +
>>> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
>>> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
>>> +    {
>>> +        hwp_verbose("HWP disabled: No energy/performance preference available");
>>> +        return false;
>>> +    }
>>> +
>>> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
>>> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
>>> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
>>> +    feature_hwp_energy_perf     =
>>> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
>>> +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
>>> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
>>> +
>>> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
>>> +                feature_hwp, feature_hwp_notification,
>>> +                feature_hwp_activity_window, feature_hwp_energy_perf,
>>> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
>>> +
>>> +    if ( !feature_hwp )
>>> +        return false;
>>> +
>>> +    feature_hdc = eax & CPUID6_EAX_HDC;
>>> +
>>> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
>>> +                feature_hdc ? "" : "not ",
>>> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
>>> +                            : "");
>>> +
>>> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
>>> +
>>> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
>>> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
>>
>> You report this, but you don't really use it?
> 
> Correct.  I needed to know what capabilities my processors have.
> 
> feature_hwp_pkg_level_ctl and feature_hwp_peci can also be dropped
> since they aren't used beyond printing their values.  I'd still lean
> toward keeping their printing under verbose since otherwise there
> isn't a convenient way to know if they are available without
> recompiling.

That's fine, but wants mentioning in the description. Also respective
variables would want to be __initdata then, be local to the function,
or be dropped altogether. Plus you'd want to be consistent - either
you use a helper variable for all print-only features, or you don't.

>>> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
>>> +{
>>> +    uint32_t base_khz, max_khz, bus_khz, edx;
>>> +
>>> +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
>>> +
>>> +    /* aperf/mperf scales base. */
>>> +    policy->cpuinfo.perf_freq = base_khz * 1000;
>>> +    policy->cpuinfo.min_freq = base_khz * 1000;
>>> +    policy->cpuinfo.max_freq = max_khz * 1000;
>>> +    policy->min = base_khz * 1000;
>>> +    policy->max = max_khz * 1000;
>>> +    policy->cur = 0;
>>
>> What is the comment intended to be telling me here?
> 
> When I was surprised to discover that I needed to pass in the base
> frequency for proper aperf/mperf scaling, it seemed relevant at the
> time as it's the opposite of ACPI cpufreq.  It can be dropped now.

Well, I'm not insisting on dropping the comment. It could also be left,
but then extended so it can be understood what is meant.

>>> +static void cf_check hwp_init_msrs(void *info)
>>> +{
>>> +    struct cpufreq_policy *policy = info;
>>> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
>>> +    uint64_t val;
>>> +
>>> +    /*
>>> +     * Package level MSR, but we don't have a good idea of packages here, so
>>> +     * just do it everytime.
>>> +     */
>>> +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
>>> +        data->curr_req.raw = -1;
>>> +        return;
>>> +    }
>>> +
>>> +    /* Ensure we don't generate interrupts */
>>> +    if ( feature_hwp_notification )
>>> +        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
>>> +
>>> +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
>>> +    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
>>> +    {
>>> +        val |= IA32_PM_ENABLE_HWP_ENABLE;
>>> +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
>>> +        {
>>> +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
>>> +                    policy->cpu, val);
>>> +            data->curr_req.raw = -1;
>>> +            return;
>>> +        }
>>> +    }
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
>>> +                policy->cpu);
>>> +        data->curr_req.raw = -1;
>>> +        return;
>>> +    }
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
>>> +        data->curr_req.raw = -1;
>>> +        return;
>>> +    }
>>> +
>>> +    if ( !feature_hwp_energy_perf ) {
>>
>> Nit: Brace placement.
>>
>>> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
>>> +        {
>>> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
>>> +            data->curr_req.raw = -1;
>>> +
>>> +            return;
>>> +        }
>>> +
>>> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
>>> +    }
>>
>> In order to not need to undo the "enable" you've already done, maybe that
>> should move down here?
> 
> HWP needs to be enabled before the Capabilities and Request MSRs can
> be read.

I must have missed this aspect in the SDM. Do you have a pointer?

>  Reading them shouldn't fail, but it seems safer to use
> rdmsr_safe in case something goes wrong.

Sure. But then the "enable" will need undoing in the unlikely event of
failure.

>>> --- a/xen/arch/x86/include/asm/cpufeature.h
>>> +++ b/xen/arch/x86/include/asm/cpufeature.h
>>> @@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
>>>  #define cpu_has(c, bit)              test_bit(bit, (c)->x86_capability)
>>>  #define boot_cpu_has(bit)    test_bit(bit, boot_cpu_data.x86_capability)
>>>
>>> -#define CPUID_PM_LEAF                    6
>>> -#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
>>> +#define CPUID_PM_LEAF                                6
>>> +#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
>>> +#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
>>> +#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
>>> +#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
>>> +#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
>>> +#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
>>> +#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
>>> +#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)
>>
>> Perhaps better without open-coding BIT()?
> 
> Ok.
> 
>> I also find it a little odd that e.g. bit 17 is left out here despite you
>> declaring the 5 "valid" bits in union hwp_request (which are qualified by
>> this CPUID bit afaict).
> 
> Well, I thought I wasn't supposed to introduce unused defines, so I
> didn't add one for 17.  For union hwp_request, the "valid" bits are
> part of the register structure, so it makes sense to include them
> instead of an incomplete definition.  IIRC, at some point I set the
> "valid" bits when I wasn't supposed to, and they caused the wrmsr
> calls to fail.  That might have been because my test machines don't
> have package-level HWP.
> 
> (I was confused when the CPUID section stated "Bit 17: Flexible HWP is
> supported if set.", but there are no further references to "Flexible
> HWP" in the SDM.)

A not uncommon issue with the SDM. At least there is a place where bit
17's purpose is described in the HWP section.

>>> @@ -165,6 +172,11 @@
>>>  #define  PASID_PASID_MASK                   0x000fffff
>>>  #define  PASID_VALID                        (_AC(1, ULL) << 31)
>>>
>>> +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
>>> +#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)
>>
>> The name has two redundant infixes, which looks odd, but then I can't
>> suggest any better without going too much out of sync with the SDM.
> 
> Yes, it's not a good name, but I was trying to keep close to the SDM.
> FAOD, these should drop IA32_ to become:
> MSR_PKG_HDC_CTL
> PKG_HDC_CTL_HDC_PKG_ENABLE
> ?

Right.

> Thank you for taking the time to review this.

Well, it has taken me awfully long to get back to this.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal
  2023-05-04 17:00     ` Jason Andryuk
@ 2023-05-05  7:04       ` Jan Beulich
  2023-05-05 15:40         ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-05  7:04 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 04.05.2023 19:00, Jason Andryuk wrote:
> On Thu, May 4, 2023 at 10:35 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> When using HWP, some of the returned data is not applicable.  In that
>>> case, we should just omit it to avoid confusing the user.  So switch to
>>> printing the base and turbo frequencies since those are relevant to HWP.
>>> Similarly, stop printing the CPU frequencies since those do not apply.
>>
>> It vaguely feels like I have asked this before: Can you point me at a
>> place in the SDM where it is said that CPUID 0x16's "Maximum Frequency"
>> is the turbo frequency? Without such a reference I feel a little uneasy
>> with ...
> 
> I don't have a reference, but I found it empirically to match the
> "turbo" frequency.
> 
> For an Intel® Core™ i7-10810U,
> https://ark.intel.com/content/www/us/en/ark/products/201888/intel-core-i710810u-processor-12m-cache-up-to-4-90-ghz.html
> 
> Max Turbo Frequency 4.90 GHz
> 
> # xenpm get-cpufreq-para
> cpu id               : 0
> affected_cpus        : 0
> cpuinfo frequency    : base [1600000] turbo [4900000]
> 
> Turbo has to be enabled to reach (close to) that frequency.
> 
> From my cover letter:
> This is for a 10th gen 6-core 1600 MHz base 4900 MHZ max cpu.  In the
> default balance mode, Turbo Boost doesn't exceed 4GHz.  Tweaking the
> energy_perf preference with `xenpm set-cpufreq-hwp balance ene:64`,
> I've seen the CPU hit 4.7GHz before throttling down and bouncing around
> between 4.3 and 4.5 GHz.  Curiously the other cores read ~4GHz when
> turbo boost takes affect.  This was done after pinning all dom0 cores,
> and using taskset to pin to vCPU/pCPU 11 and running a bash tightloop.

Right, but what matters for the longer term future is what gets committed
(and the cover letter won't be). IOW ...

>>> @@ -720,10 +721,15 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>>>          printf(" %d", p_cpufreq->affected_cpus[i]);
>>>      printf("\n");
>>>
>>> -    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
>>> -           p_cpufreq->cpuinfo_max_freq,
>>> -           p_cpufreq->cpuinfo_min_freq,
>>> -           p_cpufreq->cpuinfo_cur_freq);
>>> +    if ( internal )
>>> +        printf("cpuinfo frequency    : base [%u] turbo [%u]\n",
>>> +               p_cpufreq->cpuinfo_min_freq,
>>> +               p_cpufreq->cpuinfo_max_freq);
>>
>> ... calling it "turbo" (and not "max") here.
> 
> I'm fine with "max".  I think I went with turbo since it's a value you
> cannot sustain but can only hit in short bursts.

... I don't mind you sticking to "turbo" as long as the description makes
clear why that was chosen despite the SDM not naming it this way.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-05  7:01       ` Jan Beulich
@ 2023-05-05 15:35         ` Jason Andryuk
  2023-05-08  6:33           ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-05 15:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 04.05.2023 18:56, Jason Andryuk wrote:
> > On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>> --- a/docs/misc/xen-command-line.pandoc
> >>> +++ b/docs/misc/xen-command-line.pandoc
> >>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
> >>>  available support.
> >>>
> >>>  ### cpufreq
> >>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> >>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
> >>
> >> Considering you use a special internal governor, the 4 governor alternatives are
> >> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
> >> was another governor name would seem better to me. This would then also get rid
> >> of one of the two special "no-" prefix parsing cases (which I'm not overly
> >> happy about).
> >>
> >> Even if not done that way I'm puzzled by the way you spell out the interaction
> >> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
> >> "hwp" was specified, so even if not merged with the governors group "hwp" should
> >> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
> >> way you've spelled it out it actually looks to be kind of the other way around.)
> >
> > I placed them in alphabetical order, but, yes, it doesn't make sense.
> >
> >> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
> >> was specified.
> >>
> >> Overall I'm getting the impression that beyond your "verbose" related adjustment
> >> more is needed, if you're meaning to get things closer to how we parse the
> >> option (splitting across multiple lines to help see what I mean):
> >>
> >> `= none
> >>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
> >>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
> >>                           [,verbose]]}
> >>  | dom0-kernel`
> >>
> >> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> >> maxfreq, but better be more tight in the doc than too relaxed.)
> >>
> >> Furthermore while max/min freq don't apply directly, there are still two MSRs
> >> controlling bounds at the package and logical processor levels.
> >
> > Well, we only program the logical processor level MSRs because we
> > don't have a good idea of the packages to know when we can skip
> > writing an MSR.
> >
> > How about this:
> > `= none
> >  | {{ <boolean> | xen } {
> > [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
> >                         | [:hwp[,hdc]] }
> >                           [,verbose]]}
> >  | dom0-kernel`
>
> Looks right, yes.

There is a wrinkle to using the hwp governor.  The hwp governor was
named "hwp-internal", so it needs to be renamed to "hwp" for use with
command line parsing.  That means the checking for "-internal" needs
to change to just "hwp" which removes the generality of the original
implementation.

The other issue is that if you select "hwp" as the governor, but HWP
hardware support is not available, then hwp_available() needs to reset
the governor back to the default.  This feels like a layering
violation.

I'm still investigating, but promoting hwp to a top level option -
cpufreq=hwp - might be a better arrangement.

> >>> +union hwp_request
> >>> +{
> >>> +    struct
> >>> +    {
> >>> +        uint64_t min_perf:8;
> >>> +        uint64_t max_perf:8;
> >>> +        uint64_t desired:8;
> >>> +        uint64_t energy_perf:8;
> >>> +        uint64_t activity_window:10;
> >>> +        uint64_t package_control:1;
> >>> +        uint64_t reserved:16;
> >>> +        uint64_t activity_window_valid:1;
> >>> +        uint64_t energy_perf_valid:1;
> >>> +        uint64_t desired_valid:1;
> >>> +        uint64_t max_perf_valid:1;
> >>> +        uint64_t min_perf_valid:1;
> >>
> >> The boolean fields here would probably better be of type "bool". I also
> >> don't see the need for using uint64_t for any of the other fields -
> >> unsigned int will be quite fine, I think. Only ...
> >
> > This is the hardware MSR format, so it seemed natural to use uint64_t
> > and the bit fields.  To me, uint64_t foo:$bits; better shows that we
> > are dividing up a single hardware register using bit fields.
> > Honestly, I'm unfamiliar with the finer points of laying out bitfields
> > with bool.  And the 10 bits of activity window throws off aligning to
> > standard types.
> >
> > This seems to have the correct layout:
> > struct
> > {
> >         unsigned char min_perf;
> >         unsigned char max_perf;
> >         unsigned char desired;
> >         unsigned char energy_perf;
> >         unsigned int activity_window:10;
> >         bool package_control:1;
> >         unsigned int reserved:16;
> >         bool activity_window_valid:1;
> >         bool energy_perf_valid:1;
> >         bool desired_valid:1;
> >         bool max_perf_valid:1;
> >         bool min_perf_valid:1;
> > } ;
> >
> > Or would you prefer the first 8 bit ones to be unsigned int
> > min_perf:8?
>
> Personally I think using bitfields uniformly would be better. What you
> definitely cannot use if not using a bitfield is "unsigned char", it
> ought to by uint8_t then. If using a bitfield, as said, I think it's
> best to stick to unsigned int and bool, unless field width goes
> beyond 32 bits or fields cross a 32-bit boundary.

Ok, thanks.

> >>> +bool __init hwp_available(void)
> >>> +{
> >>> +    unsigned int eax, ecx, unused;
> >>> +    bool use_hwp;
> >>> +
> >>> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> >>> +    {
> >>> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> >>> +                    boot_cpu_data.cpuid_level);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> >>> +    {
> >>> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> >>> +                 boot_cpu_data.cpuid_level);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> >>> +
> >>> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> >>> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> >>> +    {
> >>> +        hwp_verbose("HWP disabled: No energy/performance preference available");
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> >>> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> >>> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> >>> +    feature_hwp_energy_perf     =
> >>> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> >>> +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> >>> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> >>> +
> >>> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> >>> +                feature_hwp, feature_hwp_notification,
> >>> +                feature_hwp_activity_window, feature_hwp_energy_perf,
> >>> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> >>> +
> >>> +    if ( !feature_hwp )
> >>> +        return false;
> >>> +
> >>> +    feature_hdc = eax & CPUID6_EAX_HDC;
> >>> +
> >>> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> >>> +                feature_hdc ? "" : "not ",
> >>> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> >>> +                            : "");
> >>> +
> >>> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> >>> +
> >>> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> >>> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
> >>
> >> You report this, but you don't really use it?
> >
> > Correct.  I needed to know what capabilities my processors have.
> >
> > feature_hwp_pkg_level_ctl and feature_hwp_peci can also be dropped
> > since they aren't used beyond printing their values.  I'd still lean
> > toward keeping their printing under verbose since otherwise there
> > isn't a convenient way to know if they are available without
> > recompiling.
>
> That's fine, but wants mentioning in the description. Also respective
> variables would want to be __initdata then, be local to the function,
> or be dropped altogether. Plus you'd want to be consistent - either
> you use a helper variable for all print-only features, or you don't.

Got it, thanks.

> >>> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> >>> +        {
> >>> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> >>> +            data->curr_req.raw = -1;
> >>> +
> >>> +            return;
> >>> +        }
> >>> +
> >>> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> >>> +    }
> >>
> >> In order to not need to undo the "enable" you've already done, maybe that
> >> should move down here?
> >
> > HWP needs to be enabled before the Capabilities and Request MSRs can
> > be read.
>
> I must have missed this aspect in the SDM. Do you have a pointer?

In 15.4.2 Enabling HWP
Additional MSRs associated with HWP may only be accessed after HWP is
enabled, with the exception of IA32_HWP_INTERRUPT and MSR_PPERF.
Accessing the IA32_HWP_INTERRUPT MSR requires only HWP is present as
enumerated by CPUID but does not require enabling HWP.

> >  Reading them shouldn't fail, but it seems safer to use
> > rdmsr_safe in case something goes wrong.
>
> Sure. But then the "enable" will need undoing in the unlikely event of
> failure.

Yes.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal
  2023-05-05  7:04       ` Jan Beulich
@ 2023-05-05 15:40         ` Jason Andryuk
  0 siblings, 0 replies; 53+ messages in thread
From: Jason Andryuk @ 2023-05-05 15:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Anthony PERARD, xen-devel

On Fri, May 5, 2023 at 3:04 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>> @@ -720,10 +721,15 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
> >>>          printf(" %d", p_cpufreq->affected_cpus[i]);
> >>>      printf("\n");
> >>>
> >>> -    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
> >>> -           p_cpufreq->cpuinfo_max_freq,
> >>> -           p_cpufreq->cpuinfo_min_freq,
> >>> -           p_cpufreq->cpuinfo_cur_freq);
> >>> +    if ( internal )
> >>> +        printf("cpuinfo frequency    : base [%u] turbo [%u]\n",
> >>> +               p_cpufreq->cpuinfo_min_freq,
> >>> +               p_cpufreq->cpuinfo_max_freq);
> >>
> >> ... calling it "turbo" (and not "max") here.
> >
> > I'm fine with "max".  I think I went with turbo since it's a value you
> > cannot sustain but can only hit in short bursts.
>
> ... I don't mind you sticking to "turbo" as long as the description makes
> clear why that was chosen despite the SDM not naming it this way.

I switched to "max" since as you point out that matches the SDM naming.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-05 15:35         ` Jason Andryuk
@ 2023-05-08  6:33           ` Jan Beulich
  2023-05-10 13:54             ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-08  6:33 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 05.05.2023 17:35, Jason Andryuk wrote:
> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 04.05.2023 18:56, Jason Andryuk wrote:
>>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>>>> --- a/docs/misc/xen-command-line.pandoc
>>>>> +++ b/docs/misc/xen-command-line.pandoc
>>>>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
>>>>>  available support.
>>>>>
>>>>>  ### cpufreq
>>>>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
>>>>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
>>>>
>>>> Considering you use a special internal governor, the 4 governor alternatives are
>>>> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
>>>> was another governor name would seem better to me. This would then also get rid
>>>> of one of the two special "no-" prefix parsing cases (which I'm not overly
>>>> happy about).
>>>>
>>>> Even if not done that way I'm puzzled by the way you spell out the interaction
>>>> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
>>>> "hwp" was specified, so even if not merged with the governors group "hwp" should
>>>> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
>>>> way you've spelled it out it actually looks to be kind of the other way around.)
>>>
>>> I placed them in alphabetical order, but, yes, it doesn't make sense.
>>>
>>>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
>>>> was specified.
>>>>
>>>> Overall I'm getting the impression that beyond your "verbose" related adjustment
>>>> more is needed, if you're meaning to get things closer to how we parse the
>>>> option (splitting across multiple lines to help see what I mean):
>>>>
>>>> `= none
>>>>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
>>>>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
>>>>                           [,verbose]]}
>>>>  | dom0-kernel`
>>>>
>>>> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
>>>> maxfreq, but better be more tight in the doc than too relaxed.)
>>>>
>>>> Furthermore while max/min freq don't apply directly, there are still two MSRs
>>>> controlling bounds at the package and logical processor levels.
>>>
>>> Well, we only program the logical processor level MSRs because we
>>> don't have a good idea of the packages to know when we can skip
>>> writing an MSR.
>>>
>>> How about this:
>>> `= none
>>>  | {{ <boolean> | xen } {
>>> [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
>>>                         | [:hwp[,hdc]] }
>>>                           [,verbose]]}
>>>  | dom0-kernel`
>>
>> Looks right, yes.
> 
> There is a wrinkle to using the hwp governor.  The hwp governor was
> named "hwp-internal", so it needs to be renamed to "hwp" for use with
> command line parsing.  That means the checking for "-internal" needs
> to change to just "hwp" which removes the generality of the original
> implementation.

I'm afraid I don't see why this would strictly be necessary or a
consequence.

> The other issue is that if you select "hwp" as the governor, but HWP
> hardware support is not available, then hwp_available() needs to reset
> the governor back to the default.  This feels like a layering
> violation.

Layering violation - yes. But why would the governor need resetting in
this case? If HWP was asked for but isn't available, I don't think any
other cpufreq handling (and hence governor) should be put in place.
And turning off cpufreq altogether (if necessary in the first place)
wouldn't, to me, feel as much like a layering violation.

> I'm still investigating, but promoting hwp to a top level option -
> cpufreq=hwp - might be a better arrangement.

Might be an alternative, yes.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP
  2023-05-01 19:30 ` [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
@ 2023-05-08  9:53   ` Jan Beulich
  2023-05-10 14:08     ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-08  9:53 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -13,6 +13,8 @@
>  #include <asm/msr.h>
>  #include <acpi/cpufreq/cpufreq.h>
>  
> +static bool hwp_in_use;

__ro_after_init again, please.

> --- a/xen/include/acpi/pdc_intel.h
> +++ b/xen/include/acpi/pdc_intel.h
> @@ -17,6 +17,7 @@
>  #define ACPI_PDC_C_C1_FFH		(0x0100)
>  #define ACPI_PDC_C_C2C3_FFH		(0x0200)
>  #define ACPI_PDC_SMP_P_HWCOORD		(0x0800)
> +#define ACPI_PDC_CPPC_NTV_INT		(0x1000)

I can probably live with NTV (albeit I'd prefer NATIVE), but INT is too
ambiguous for my taste: Can at least that become INTR, please?

With at least the minimal adjustments
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-01 19:30 ` [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace Jason Andryuk
@ 2023-05-08 10:25   ` Jan Beulich
  2023-05-08 10:46     ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-08 10:25 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
> hardware rather closely.
> 
> We need the features bitmask to indicated fields supported by the actual
> hardware.
> 
> The use of uint8_t parameters matches the hardware size.  uint32_t
> entries grows the sysctl_t past the build assertion in setup.c.  The
> uint8_t ranges are supported across multiple generations, so hopefully
> they won't change.

Still it feels a little odd for values to be this narrow. Aiui the
scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
used by HWP. So you could widen the union in struct
xen_get_cpufreq_para (in a binary but not necessarily source compatible
manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
placed scaling_cur_freq could be included as well ...

> --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -506,6 +506,31 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
>      .update = hwp_cpufreq_update,
>  };
>  
> +int get_hwp_para(const struct cpufreq_policy *policy,

While I don't really mind a policy being passed into here, ...

> +                 struct xen_hwp_para *hwp_para)
> +{
> +    unsigned int cpu = policy->cpu;

... this is its only use afaics, and hence the caller could as well pass
in just a CPU number?

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -292,6 +292,31 @@ struct xen_ondemand {
>      uint32_t up_threshold;
>  };
>  
> +struct xen_hwp_para {
> +    /*
> +     * bits 6:0   - 7bit mantissa
> +     * bits 9:7   - 3bit base-10 exponent
> +     * btis 15:10 - Unused - must be 0
> +     */
> +#define HWP_ACT_WINDOW_MANTISSA_MASK  0x7f
> +#define HWP_ACT_WINDOW_EXPONENT_MASK  0x7
> +#define HWP_ACT_WINDOW_EXPONENT_SHIFT 7
> +    uint16_t activity_window;
> +    /* energy_perf range 0-255 if 1. Otherwise 0-15 */
> +#define XEN_SYSCTL_HWP_FEAT_ENERGY_PERF (1 << 0)
> +    /* activity_window supported if 1 */
> +#define XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  (1 << 1)
> +    uint8_t features; /* bit flags for features */
> +    uint8_t lowest;
> +    uint8_t most_efficient;
> +    uint8_t guaranteed;
> +    uint8_t highest;
> +    uint8_t minimum;
> +    uint8_t maximum;
> +    uint8_t desired;
> +    uint8_t energy_perf;

These fields could do with some more commentary. To be honest I had
trouble figuring (from the SDM) what exact meaning specific numeric
values have. Readers of this header should at the very least be told
where they can turn to in order to understand what these fields
communicate. (FTAOD this could be section names, but please not
section numbers. The latter are fine to use in a discussion, but
they're changing too frequently to make them useful in code
comments.)

> +};

Also, if you decide to stick to uint8_t, then the trailing padding
field (another uint8_t) wants making explicit. I'm on the edge
whether to ask to also check the field: Right here the struct is
"get only", and peeking ahead you look to be introducing a separate
sub-op for "set". Perhaps if you added /* OUT */ at the top of the
new struct? (But if you don't check the field for being zero, then
you'll want to set it to zero for forward compatibility.)

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters
  2023-05-01 19:30 ` [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters Jason Andryuk
@ 2023-05-08 10:43   ` Jan Beulich
  2023-05-10 18:11     ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-08 10:43 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> Print HWP-specific parameters.  Some are always present, but others
> depend on hardware support.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
> ---
> v2:
> Style fixes
> Declare i outside loop
> Replace repearted hardware/configured limits with spaces
> Fixup for hw_ removal
> Use XEN_HWP_GOVERNOR
> Use HWP_ACT_WINDOW_EXPONENT_*
> Remove energy_perf hw autonomous - 0 doesn't mean autonomous
> ---
>  tools/misc/xenpm.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
> index ce8d7644d0..b2defde0d4 100644
> --- a/tools/misc/xenpm.c
> +++ b/tools/misc/xenpm.c
> @@ -708,6 +708,44 @@ void start_gather_func(int argc, char *argv[])
>      pause();
>  }
>  
> +static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
> +                                          unsigned int *activity_window,
> +                                          const char **units)

The function's return value would be nice to use for one of the two
values that are being returned.

> +{
> +    unsigned int mantissa = hwp->activity_window & HWP_ACT_WINDOW_MANTISSA_MASK;
> +    unsigned int exponent =
> +        (hwp->activity_window >> HWP_ACT_WINDOW_EXPONENT_SHIFT) &
> +            HWP_ACT_WINDOW_EXPONENT_MASK;

I wish we had MASK_EXTR() in common-macros.h. While really a comment on
patch 7 - HWP_ACT_WINDOW_EXPONENT_SHIFT is redundant information and
should imo be omitted from the public interface, in favor of just a
(suitably shifted) mask value. Also note how those constants all lack
proper XEN_ prefixes.

> +    unsigned int multiplier = 1;
> +    unsigned int i;
> +
> +    if ( hwp->activity_window == 0 )
> +    {
> +        *units = "hardware selected";
> +        *activity_window = 0;
> +
> +        return;
> +    }

While in line with documentation, any mantissa of 0 results in a 0us
window, which I assume would then also mean "hardware selected".

> @@ -773,6 +811,33 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>                 p_cpufreq->scaling_cur_freq);
>      }
>  
> +    if ( strcmp(p_cpufreq->scaling_governor, XEN_HWP_GOVERNOR) == 0 )
> +    {
> +        const xc_hwp_para_t *hwp = &p_cpufreq->u.hwp_para;
> +
> +        printf("hwp variables        :\n");
> +        printf("  hardware limits    : lowest [%u] most_efficient [%u]\n",

Here and ...

> +               hwp->lowest, hwp->most_efficient);
> +        printf("                     : guaranteed [%u] highest [%u]\n",
> +               hwp->guaranteed, hwp->highest);
> +        printf("  configured limits  : min [%u] max [%u] energy_perf [%u]\n",

... here I wonder what use the underscores are in produced output. I'd
use blanks. If you really want a separator there, then please use
dashes.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-08 10:25   ` Jan Beulich
@ 2023-05-08 10:46     ` Jan Beulich
  2023-05-10 17:49       ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-08 10:46 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.05.2023 12:25, Jan Beulich wrote:
> On 01.05.2023 21:30, Jason Andryuk wrote:
>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
>> hardware rather closely.
>>
>> We need the features bitmask to indicated fields supported by the actual
>> hardware.
>>
>> The use of uint8_t parameters matches the hardware size.  uint32_t
>> entries grows the sysctl_t past the build assertion in setup.c.  The
>> uint8_t ranges are supported across multiple generations, so hopefully
>> they won't change.
> 
> Still it feels a little odd for values to be this narrow. Aiui the
> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
> used by HWP. So you could widen the union in struct
> xen_get_cpufreq_para (in a binary but not necessarily source compatible
> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
> placed scaling_cur_freq could be included as well ...

Having seen patch 9 now as well, I wonder whether here (or in a separate
patch) you don't want to limit providing inapplicable data (for example
not filling *scaling_available_governors would even avoid an allocation,
thus removing a possible reason for failure), while there (or again in a
separate patch) you'd also limit what the tool reports (inapplicable
output causes confusion / questions at best).

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-05-01 19:30 ` [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
@ 2023-05-08 11:27   ` Jan Beulich
  2023-05-22 12:45     ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-08 11:27 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> @@ -531,6 +533,100 @@ int get_hwp_para(const struct cpufreq_policy *policy,
>      return 0;
>  }
>  
> +int set_hwp_para(struct cpufreq_policy *policy,
> +                 struct xen_set_hwp_para *set_hwp)

const?

> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +
> +    if ( data == NULL )
> +        return -EINVAL;
> +
> +    /* Validate all parameters first */
> +    if ( set_hwp->set_params & ~XEN_SYSCTL_HWP_SET_PARAM_MASK )
> +        return -EINVAL;
> +
> +    if ( set_hwp->activity_window & ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK )
> +        return -EINVAL;

Below you limit checks to when the respective control bit is set. I
think you want the same here.

> +    if ( !feature_hwp_energy_perf &&
> +         (set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF) &&
> +         set_hwp->energy_perf > IA32_ENERGY_BIAS_MAX_POWERSAVE )
> +        return -EINVAL;
> +
> +    if ( (set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED) &&
> +         set_hwp->desired != 0 &&
> +         (set_hwp->desired < data->hw.lowest ||
> +          set_hwp->desired > data->hw.highest) )
> +        return -EINVAL;
> +
> +    /*
> +     * minimum & maximum are not validated as hardware doesn't seem to care
> +     * and the SDM says CPUs will clip internally.
> +     */
> +
> +    /* Apply presets */
> +    switch ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_PRESET_MASK )
> +    {
> +    case XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE:
> +        data->minimum = data->hw.lowest;
> +        data->maximum = data->hw.lowest;
> +        data->activity_window = 0;
> +        if ( feature_hwp_energy_perf )
> +            data->energy_perf = HWP_ENERGY_PERF_MAX_POWERSAVE;
> +        else
> +            data->energy_perf = IA32_ENERGY_BIAS_MAX_POWERSAVE;
> +        data->desired = 0;
> +        break;
> +
> +    case XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE:
> +        data->minimum = data->hw.highest;
> +        data->maximum = data->hw.highest;
> +        data->activity_window = 0;
> +        data->energy_perf = HWP_ENERGY_PERF_MAX_PERFORMANCE;
> +        data->desired = 0;
> +        break;
> +
> +    case XEN_SYSCTL_HWP_SET_PRESET_BALANCE:
> +        data->minimum = data->hw.lowest;
> +        data->maximum = data->hw.highest;
> +        data->activity_window = 0;
> +        if ( feature_hwp_energy_perf )
> +            data->energy_perf = HWP_ENERGY_PERF_BALANCE;
> +        else
> +            data->energy_perf = IA32_ENERGY_BIAS_BALANCE;
> +        data->desired = 0;
> +        break;
> +
> +    case XEN_SYSCTL_HWP_SET_PRESET_NONE:
> +        break;
> +
> +    default:
> +        return -EINVAL;
> +    }

So presets set all the values for which the individual item control bits
are clear. That's not exactly what I would have expected, and it took me
reading the code several times until I realized that you write life per-
CPU data fields here, not fields of some intermediate variable. I think
this could do with saying explicitly in the public header (if indeed the
intended model).

> +    /* Further customize presets if needed */
> +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MINIMUM )
> +        data->minimum = set_hwp->minimum;
> +
> +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MAXIMUM )
> +        data->maximum = set_hwp->maximum;
> +
> +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF )
> +        data->energy_perf = set_hwp->energy_perf;
> +
> +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED )
> +        data->desired = set_hwp->desired;
> +
> +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ACT_WINDOW )
> +        data->activity_window = set_hwp->activity_window &
> +                                XEN_SYSCTL_HWP_ACT_WINDOW_MASK;
> +
> +    hwp_cpufreq_target(policy, 0, 0);
> +
> +    return 0;

I don't think you should assume here that hwp_cpufreq_target() will
only ever return 0. Plus by returning its return value here you
allow the compiler to tail-call optimize this code.

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -398,6 +398,20 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
>      return ret;
>  }
>  
> +static int set_cpufreq_hwp(struct xen_sysctl_pm_op *op)

const?

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -317,6 +317,34 @@ struct xen_hwp_para {
>      uint8_t energy_perf;
>  };
>  
> +/* set multiple values simultaneously when set_args bit is set */

What "set_args bit" does this comment refer to?

> +struct xen_set_hwp_para {
> +#define XEN_SYSCTL_HWP_SET_DESIRED              (1U << 0)
> +#define XEN_SYSCTL_HWP_SET_ENERGY_PERF          (1U << 1)
> +#define XEN_SYSCTL_HWP_SET_ACT_WINDOW           (1U << 2)
> +#define XEN_SYSCTL_HWP_SET_MINIMUM              (1U << 3)
> +#define XEN_SYSCTL_HWP_SET_MAXIMUM              (1U << 4)
> +#define XEN_SYSCTL_HWP_SET_PRESET_MASK          0xf000
> +#define XEN_SYSCTL_HWP_SET_PRESET_NONE          0x0000
> +#define XEN_SYSCTL_HWP_SET_PRESET_BALANCE       0x1000
> +#define XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE     0x2000
> +#define XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE   0x3000
> +#define XEN_SYSCTL_HWP_SET_PARAM_MASK ( \
> +                                  XEN_SYSCTL_HWP_SET_PRESET_MASK | \
> +                                  XEN_SYSCTL_HWP_SET_DESIRED     | \
> +                                  XEN_SYSCTL_HWP_SET_ENERGY_PERF | \
> +                                  XEN_SYSCTL_HWP_SET_ACT_WINDOW  | \
> +                                  XEN_SYSCTL_HWP_SET_MINIMUM     | \
> +                                  XEN_SYSCTL_HWP_SET_MAXIMUM     )
> +    uint16_t set_params; /* bitflags for valid values */
> +#define XEN_SYSCTL_HWP_ACT_WINDOW_MASK          0x03ff
> +    uint16_t activity_window; /* See comment in struct xen_hwp_para */
> +    uint8_t minimum;
> +    uint8_t maximum;
> +    uint8_t desired;
> +    uint8_t energy_perf; /* 0-255 or 0-15 depending on HW support */

Instead of (or in addition to) the "HW support" reference, could this
gain a reference to the "get para" bit determining which range to use?

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand
  2023-05-01 19:30 ` [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
@ 2023-05-08 11:56   ` Jan Beulich
  2023-05-08 12:00     ` Jan Beulich
  2023-05-22 12:59     ` Jason Andryuk
  0 siblings, 2 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-08 11:56 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> @@ -67,6 +68,27 @@ void show_help(void)
>              " set-max-cstate        <num>|'unlimited' [<num2>|'unlimited']\n"
>              "                                     set the C-State limitation (<num> >= 0) and\n"
>              "                                     optionally the C-sub-state limitation (<num2> >= 0)\n"
> +            " set-cpufreq-hwp       [cpuid] [balance|performance|powersave] <param:val>*\n"
> +            "                                     set Hardware P-State (HWP) parameters\n"
> +            "                                     optionally a preset of one of\n"
> +            "                                       balance|performance|powersave\n"
> +            "                                     an optional list of param:val arguments\n"
> +            "                                       minimum:N  lowest ... highest\n"
> +            "                                       maximum:N  lowest ... highest\n"
> +            "                                       desired:N  lowest ... highest\n"

Personally I consider these three uses of "lowest ... highest" confusing:
It's not clear at all whether they're part of the option or merely mean
to express the allowable range for N (which I think they do). Perhaps ...

> +            "                                           Set explicit performance target.\n"
> +            "                                           non-zero disables auto-HWP mode.\n"
> +            "                                       energy-perf:0-255 (or 0-15)\n"

..., also taking this into account:

            "                                       energy-perf:N (0-255 or 0-15)\n"

and then use parentheses as well for the earlier value range explanations
(and again below)?

Also up from here you suddenly start having full stops on the lines. I
guess you also want to be consistent in your use of capital letters at
the start of lines (I didn't go check how consistent pre-existing code
is in this regard).

> @@ -1299,6 +1321,213 @@ void disable_turbo_mode(int argc, char *argv[])
>                  errno, strerror(errno));
>  }
>  
> +/*
> + * Parse activity_window:NNN{us,ms,s} and validate range.
> + *
> + * Activity window is a 7bit mantissa (0-127) with a 3bit exponent (0-7) base
> + * 10 in microseconds.  So the range is 1 microsecond to 1270 seconds.  A value
> + * of 0 lets the hardware autonomously select the window.
> + *
> + * Return 0 on success
> + *       -1 on error
> + */
> +static int parse_activity_window(xc_set_hwp_para_t *set_hwp, unsigned long u,
> +                                 const char *suffix)
> +{
> +    unsigned int exponent = 0;
> +    unsigned int multiplier = 1;
> +
> +    if ( suffix && suffix[0] )
> +    {
> +        if ( strcasecmp(suffix, "s") == 0 )
> +        {
> +            multiplier = 1000 * 1000;
> +            exponent = 6;
> +        }
> +        else if ( strcasecmp(suffix, "ms") == 0 )
> +        {
> +            multiplier = 1000;
> +            exponent = 3;
> +        }
> +        else if ( strcasecmp(suffix, "us") == 0 )
> +        {
> +            multiplier = 1;
> +            exponent = 0;
> +        }

Considering the initializers, this "else if" body isn't really needed,
and ...

> +        else

... instead this could become "else if ( strcmp() != 0 )".

Note also that I use strcmp() there - none of s, ms, or us are commonly
expressed by capital letters. (I wonder though whether μs shouldn't also
be recognized.)

> +        {
> +            fprintf(stderr, "invalid activity window units: \"%s\"\n", suffix);
> +
> +            return -1;
> +        }
> +    }
> +
> +    /* u * multipler > 1270 * 1000 * 1000 transformed to avoid overflow. */
> +    if ( u > 1270 * 1000 * 1000 / multiplier )
> +    {
> +        fprintf(stderr, "activity window is too large\n");
> +
> +        return -1;
> +    }
> +
> +    /* looking for 7 bits of mantissa and 3 bits of exponent */
> +    while ( u > 127 )
> +    {
> +        u += 5; /* Round up to mitigate truncation rounding down
> +                   e.g. 128 -> 120 vs 128 -> 130. */
> +        u /= 10;
> +        exponent += 1;
> +    }
> +
> +    set_hwp->activity_window = (exponent & HWP_ACT_WINDOW_EXPONENT_MASK) <<
> +                                   HWP_ACT_WINDOW_EXPONENT_SHIFT |

The shift wants parenthesizing against the | and the shift amount wants
indenting slightly less. (Really this would want to be MASK_INSR().)

> +                               (u & HWP_ACT_WINDOW_MANTISSA_MASK);
> +    set_hwp->set_params |= XEN_SYSCTL_HWP_SET_ACT_WINDOW;
> +
> +    return 0;
> +}
> +
> +static int parse_hwp_opts(xc_set_hwp_para_t *set_hwp, int *cpuid,
> +                          int argc, char *argv[])
> +{
> +    int i = 0;
> +
> +    if ( argc < 1 ) {
> +        fprintf(stderr, "Missing arguments\n");
> +        return -1;
> +    }
> +
> +    if ( parse_cpuid_non_fatal(argv[i], cpuid) == 0 )
> +    {
> +        i++;
> +    }

I don't think you need the earlier patch and the separate helper:
Whether a CPU number is present can be told by checking
isdigit(argv[i][0]).

Also (nit) note how you're mixing brace placement throughout this
function.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand
  2023-05-08 11:56   ` Jan Beulich
@ 2023-05-08 12:00     ` Jan Beulich
  2023-05-22 12:59     ` Jason Andryuk
  1 sibling, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-08 12:00 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 08.05.2023 13:56, Jan Beulich wrote:
> On 01.05.2023 21:30, Jason Andryuk wrote:
>> +static int parse_hwp_opts(xc_set_hwp_para_t *set_hwp, int *cpuid,
>> +                          int argc, char *argv[])
>> +{
>> +    int i = 0;
>> +
>> +    if ( argc < 1 ) {
>> +        fprintf(stderr, "Missing arguments\n");
>> +        return -1;
>> +    }
>> +
>> +    if ( parse_cpuid_non_fatal(argv[i], cpuid) == 0 )
>> +    {
>> +        i++;
>> +    }
> 
> I don't think you need the earlier patch and the separate helper:
> Whether a CPU number is present can be told by checking
> isdigit(argv[i][0]).

Hmm, yes, there is "all", but your help text doesn't mention it and
since you're handling a variable number of arguments anyway, there's
not need for anyone to say "all" - they can simply omit the optional
argument.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 12/14 RESEND] xenpm: Factor out a non-fatal cpuid_parse variant
  2023-05-01 19:30 ` [PATCH v3 12/14 RESEND] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
@ 2023-05-08 12:01   ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-08 12:01 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 01.05.2023 21:30, Jason Andryuk wrote:
> Allow cpuid_parse to be re-used without terminating xenpm.  HWP will
> re-use it to optionally parse a cpuid.  Unlike other uses of
> cpuid_parse, parse_hwp_opts will take a variable number of arguments and
> cannot just check argc.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
> ---
> v2:
> Retained because cpuid_parse handles numeric cpu numbers and "all".

Assuming you can convince me of retaining this patch:

> --- a/tools/misc/xenpm.c
> +++ b/tools/misc/xenpm.c
> @@ -79,17 +79,26 @@ void help_func(int argc, char *argv[])
>      show_help();
>  }
>  
> -static void parse_cpuid(const char *arg, int *cpuid)
> +static int parse_cpuid_non_fatal(const char *arg, int *cpuid)
>  {
>      if ( sscanf(arg, "%d", cpuid) != 1 || *cpuid < 0 )
>      {
>          if ( strcasecmp(arg, "all") )
> -        {
> -            fprintf(stderr, "Invalid CPU identifier: '%s'\n", arg);
> -            exit(EINVAL);
> -        }
> +            return -1;
> +
>          *cpuid = -1;
>      }
> +
> +    return 0;
> +}

Looks like this function wants to return bool?

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-08  6:33           ` Jan Beulich
@ 2023-05-10 13:54             ` Jason Andryuk
  2023-05-10 14:19               ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-10 13:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 05.05.2023 17:35, Jason Andryuk wrote:
> > On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 04.05.2023 18:56, Jason Andryuk wrote:
> >>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>>>> --- a/docs/misc/xen-command-line.pandoc
> >>>>> +++ b/docs/misc/xen-command-line.pandoc
> >>>>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
> >>>>>  available support.
> >>>>>
> >>>>>  ### cpufreq
> >>>>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> >>>>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
> >>>>
> >>>> Considering you use a special internal governor, the 4 governor alternatives are
> >>>> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
> >>>> was another governor name would seem better to me. This would then also get rid
> >>>> of one of the two special "no-" prefix parsing cases (which I'm not overly
> >>>> happy about).
> >>>>
> >>>> Even if not done that way I'm puzzled by the way you spell out the interaction
> >>>> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
> >>>> "hwp" was specified, so even if not merged with the governors group "hwp" should
> >>>> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
> >>>> way you've spelled it out it actually looks to be kind of the other way around.)
> >>>
> >>> I placed them in alphabetical order, but, yes, it doesn't make sense.
> >>>
> >>>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
> >>>> was specified.
> >>>>
> >>>> Overall I'm getting the impression that beyond your "verbose" related adjustment
> >>>> more is needed, if you're meaning to get things closer to how we parse the
> >>>> option (splitting across multiple lines to help see what I mean):
> >>>>
> >>>> `= none
> >>>>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
> >>>>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
> >>>>                           [,verbose]]}
> >>>>  | dom0-kernel`
> >>>>
> >>>> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> >>>> maxfreq, but better be more tight in the doc than too relaxed.)
> >>>>
> >>>> Furthermore while max/min freq don't apply directly, there are still two MSRs
> >>>> controlling bounds at the package and logical processor levels.
> >>>
> >>> Well, we only program the logical processor level MSRs because we
> >>> don't have a good idea of the packages to know when we can skip
> >>> writing an MSR.
> >>>
> >>> How about this:
> >>> `= none
> >>>  | {{ <boolean> | xen } {
> >>> [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
> >>>                         | [:hwp[,hdc]] }
> >>>                           [,verbose]]}
> >>>  | dom0-kernel`
> >>
> >> Looks right, yes.
> >
> > There is a wrinkle to using the hwp governor.  The hwp governor was
> > named "hwp-internal", so it needs to be renamed to "hwp" for use with
> > command line parsing.  That means the checking for "-internal" needs
> > to change to just "hwp" which removes the generality of the original
> > implementation.
>
> I'm afraid I don't see why this would strictly be necessary or a
> consequence.

Maybe I took your comment too far when you mentioned using hwp as a
fake governor.  I used the actual HWP struct cpufreq_governor to hook
into cpufreq_cmdline_parse().  cpufreq_cmdline_parse() uses the that
name for comparison.  But the naming stops being an issue if struct
cpufreq_governor gains a bool .internal flag.  That flag also makes
the registration checks clearer.

> > The other issue is that if you select "hwp" as the governor, but HWP
> > hardware support is not available, then hwp_available() needs to reset
> > the governor back to the default.  This feels like a layering
> > violation.
>
> Layering violation - yes. But why would the governor need resetting in
> this case? If HWP was asked for but isn't available, I don't think any
> other cpufreq handling (and hence governor) should be put in place.
> And turning off cpufreq altogether (if necessary in the first place)
> wouldn't, to me, feel as much like a layering violation.

My goal was for Xen to use HWP if available and fallback to the acpi
cpufreq driver if not.  That to me seems more user-friendly than
disabling cpufreq.

            if ( hwp_available() )
                ret = hwp_register_driver();
            else
                ret = cpufreq_register_driver(&acpi_cpufreq_driver);

If we are setting cpufreq_opt_governor to enter hwp_available(), but
then HWP isn't available, it seems to me that we need to reset
cpufreq_opt_governor when exiting hwp_available() false.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP
  2023-05-08  9:53   ` Jan Beulich
@ 2023-05-10 14:08     ` Jason Andryuk
  0 siblings, 0 replies; 53+ messages in thread
From: Jason Andryuk @ 2023-05-10 14:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On Mon, May 8, 2023 at 5:53 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -13,6 +13,8 @@
> >  #include <asm/msr.h>
> >  #include <acpi/cpufreq/cpufreq.h>
> >
> > +static bool hwp_in_use;
>
> __ro_after_init again, please.

Of course.  (I'd already made the change locally after the earlier ones.)

> > --- a/xen/include/acpi/pdc_intel.h
> > +++ b/xen/include/acpi/pdc_intel.h
> > @@ -17,6 +17,7 @@
> >  #define ACPI_PDC_C_C1_FFH            (0x0100)
> >  #define ACPI_PDC_C_C2C3_FFH          (0x0200)
> >  #define ACPI_PDC_SMP_P_HWCOORD               (0x0800)
> > +#define ACPI_PDC_CPPC_NTV_INT                (0x1000)
>
> I can probably live with NTV (albeit I'd prefer NATIVE), but INT is too
> ambiguous for my taste: Can at least that become INTR, please?

Sounds good.  I'm switching to ACPI_PDC_CPPC_NATIVE_INTR.

> With at least the minimal adjustments
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thank you.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-10 13:54             ` Jason Andryuk
@ 2023-05-10 14:19               ` Jan Beulich
  2023-05-12  1:02                 ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-10 14:19 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 10.05.2023 15:54, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 05.05.2023 17:35, Jason Andryuk wrote:
>>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>>> The other issue is that if you select "hwp" as the governor, but HWP
>>> hardware support is not available, then hwp_available() needs to reset
>>> the governor back to the default.  This feels like a layering
>>> violation.
>>
>> Layering violation - yes. But why would the governor need resetting in
>> this case? If HWP was asked for but isn't available, I don't think any
>> other cpufreq handling (and hence governor) should be put in place.
>> And turning off cpufreq altogether (if necessary in the first place)
>> wouldn't, to me, feel as much like a layering violation.
> 
> My goal was for Xen to use HWP if available and fallback to the acpi
> cpufreq driver if not.  That to me seems more user-friendly than
> disabling cpufreq.
> 
>             if ( hwp_available() )
>                 ret = hwp_register_driver();
>             else
>                 ret = cpufreq_register_driver(&acpi_cpufreq_driver);

That's fine as a (future) default, but for now using hwp requires a
command line option, and if that option says "hwp" then it ought to
be hwp imo.

> If we are setting cpufreq_opt_governor to enter hwp_available(), but
> then HWP isn't available, it seems to me that we need to reset
> cpufreq_opt_governor when exiting hwp_available() false.

This may be necessary in the future, but shouldn't be necessary right
now. It's not entirely clear to me how that future is going to look
like, command line option wise.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-08 10:46     ` Jan Beulich
@ 2023-05-10 17:49       ` Jason Andryuk
  2023-05-11  6:21         ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-10 17:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Mon, May 8, 2023 at 6:26 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > Extend xen_get_cpufreq_para to return hwp parameters.  These match the
> > hardware rather closely.
> >
> > We need the features bitmask to indicated fields supported by the actual
> > hardware.
> >
> > The use of uint8_t parameters matches the hardware size.  uint32_t
> > entries grows the sysctl_t past the build assertion in setup.c.  The
> > uint8_t ranges are supported across multiple generations, so hopefully
> > they won't change.
>
> Still it feels a little odd for values to be this narrow. Aiui the
> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
> used by HWP. So you could widen the union in struct
> xen_get_cpufreq_para (in a binary but not necessarily source compatible
> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
> placed scaling_cur_freq could be included as well ...

The values are narrow, but they match the hardware.  It works for HWP,
so there is no need to change at this time AFAICT.

Do you want me to make this change?

> > --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -506,6 +506,31 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
> >      .update = hwp_cpufreq_update,
> >  };
> >
> > +int get_hwp_para(const struct cpufreq_policy *policy,
>
> While I don't really mind a policy being passed into here, ...
>
> > +                 struct xen_hwp_para *hwp_para)
> > +{
> > +    unsigned int cpu = policy->cpu;
>
> ... this is its only use afaics, and hence the caller could as well pass
> in just a CPU number?

Sounds good.

> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -292,6 +292,31 @@ struct xen_ondemand {
> >      uint32_t up_threshold;
> >  };
> >
> > +struct xen_hwp_para {
> > +    /*
> > +     * bits 6:0   - 7bit mantissa
> > +     * bits 9:7   - 3bit base-10 exponent
> > +     * btis 15:10 - Unused - must be 0
> > +     */
> > +#define HWP_ACT_WINDOW_MANTISSA_MASK  0x7f
> > +#define HWP_ACT_WINDOW_EXPONENT_MASK  0x7
> > +#define HWP_ACT_WINDOW_EXPONENT_SHIFT 7
> > +    uint16_t activity_window;
> > +    /* energy_perf range 0-255 if 1. Otherwise 0-15 */
> > +#define XEN_SYSCTL_HWP_FEAT_ENERGY_PERF (1 << 0)
> > +    /* activity_window supported if 1 */
> > +#define XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  (1 << 1)
> > +    uint8_t features; /* bit flags for features */
> > +    uint8_t lowest;
> > +    uint8_t most_efficient;
> > +    uint8_t guaranteed;
> > +    uint8_t highest;
> > +    uint8_t minimum;
> > +    uint8_t maximum;
> > +    uint8_t desired;
> > +    uint8_t energy_perf;
>
> These fields could do with some more commentary. To be honest I had
> trouble figuring (from the SDM) what exact meaning specific numeric
> values have. Readers of this header should at the very least be told
> where they can turn to in order to understand what these fields
> communicate. (FTAOD this could be section names, but please not
> section numbers. The latter are fine to use in a discussion, but
> they're changing too frequently to make them useful in code
> comments.)

Sounds good.  I'll add some description.

> > +};
>
> Also, if you decide to stick to uint8_t, then the trailing padding
> field (another uint8_t) wants making explicit. I'm on the edge
> whether to ask to also check the field: Right here the struct is
> "get only", and peeking ahead you look to be introducing a separate
> sub-op for "set". Perhaps if you added /* OUT */ at the top of the
> new struct? (But if you don't check the field for being zero, then
> you'll want to set it to zero for forward compatibility.)

Thanks for catching.  I'll add the padding field and zero it.

On Mon, May 8, 2023 at 6:47 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 08.05.2023 12:25, Jan Beulich wrote:
> > On 01.05.2023 21:30, Jason Andryuk wrote:
> >> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
> >> hardware rather closely.
> >>
> >> We need the features bitmask to indicated fields supported by the actual
> >> hardware.
> >>
> >> The use of uint8_t parameters matches the hardware size.  uint32_t
> >> entries grows the sysctl_t past the build assertion in setup.c.  The
> >> uint8_t ranges are supported across multiple generations, so hopefully
> >> they won't change.
> >
> > Still it feels a little odd for values to be this narrow. Aiui the
> > scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
> > used by HWP. So you could widen the union in struct
> > xen_get_cpufreq_para (in a binary but not necessarily source compatible
> > manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
> > placed scaling_cur_freq could be included as well ...
>
> Having seen patch 9 now as well, I wonder whether here (or in a separate
> patch) you don't want to limit providing inapplicable data (for example
> not filling *scaling_available_governors would even avoid an allocation,
> thus removing a possible reason for failure), while there (or again in a
> separate patch) you'd also limit what the tool reports (inapplicable
> output causes confusion / questions at best).

The xenpm output only shows relevant information:

# xenpm get-cpufreq-para 11
cpu id               : 11
affected_cpus        : 11
cpuinfo frequency    : base [1600000] max [4900000]
scaling_driver       : hwp-cpufreq
scaling_avail_gov    : hwp
current_governor     : hwp
hwp variables        :
  hardware limits    : lowest [1] most_efficient [11]
                     : guaranteed [11] highest [49]
  configured limits  : min [1] max [255] energy_perf [128]
                     : activity_window [0 hardware selected]
                     : desired [0 hw autonomous]
turbo mode           : enabled

The scaling_*_freq values, policy->{min,max,cur} are filled in with
base, max and 0 in hwp_get_cpu_speeds(), so it's not totally invalid
values being returned.  The governor registration restricting to only
internal governors when HWP is active means it only has the single
governor.  I think it's okay as-is, but let me know if you want
something changed.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters
  2023-05-08 10:43   ` Jan Beulich
@ 2023-05-10 18:11     ` Jason Andryuk
  2023-05-11  6:25       ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-10 18:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Anthony PERARD, xen-devel

On Mon, May 8, 2023 at 6:43 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > Print HWP-specific parameters.  Some are always present, but others
> > depend on hardware support.
> >
> > Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
> > ---
> > v2:
> > Style fixes
> > Declare i outside loop
> > Replace repearted hardware/configured limits with spaces
> > Fixup for hw_ removal
> > Use XEN_HWP_GOVERNOR
> > Use HWP_ACT_WINDOW_EXPONENT_*
> > Remove energy_perf hw autonomous - 0 doesn't mean autonomous
> > ---
> >  tools/misc/xenpm.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 65 insertions(+)
> >
> > diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
> > index ce8d7644d0..b2defde0d4 100644
> > --- a/tools/misc/xenpm.c
> > +++ b/tools/misc/xenpm.c
> > @@ -708,6 +708,44 @@ void start_gather_func(int argc, char *argv[])
> >      pause();
> >  }
> >
> > +static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
> > +                                          unsigned int *activity_window,
> > +                                          const char **units)
>
> The function's return value would be nice to use for one of the two
> values that are being returned.

Ok, I'll return activity_window.

> > +{
> > +    unsigned int mantissa = hwp->activity_window & HWP_ACT_WINDOW_MANTISSA_MASK;
> > +    unsigned int exponent =
> > +        (hwp->activity_window >> HWP_ACT_WINDOW_EXPONENT_SHIFT) &
> > +            HWP_ACT_WINDOW_EXPONENT_MASK;
>
> I wish we had MASK_EXTR() in common-macros.h. While really a comment on
> patch 7 - HWP_ACT_WINDOW_EXPONENT_SHIFT is redundant information and
> should imo be omitted from the public interface, in favor of just a
> (suitably shifted) mask value. Also note how those constants all lack
> proper XEN_ prefixes.

I'll add a patch adding MASK_EXTR() & MASK_INSR() to common-macros.h
and use those - is there any reason not to do that?

I'll also add XEN_ prefixes.

> > +    unsigned int multiplier = 1;
> > +    unsigned int i;
> > +
> > +    if ( hwp->activity_window == 0 )
> > +    {
> > +        *units = "hardware selected";
> > +        *activity_window = 0;
> > +
> > +        return;
> > +    }
>
> While in line with documentation, any mantissa of 0 results in a 0us
> window, which I assume would then also mean "hardware selected".

I hadn't considered that.  The hardware seems to allow you to write a
0 mantissa, non-0 exponent.  From the SDM, it's unclear what that
would mean.  The code as written would display "0 us", "0 ms", or "0
s" - not "0 hardware selected".  Do you want more explicity printing
for those cases?  I think it's fine to have a distinction between the
output.  "0 hardware selected" is the known valid value that is
working as expected.  The other ones being something different seems
good to me since we don't really know what they mean.

> > @@ -773,6 +811,33 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
> >                 p_cpufreq->scaling_cur_freq);
> >      }
> >
> > +    if ( strcmp(p_cpufreq->scaling_governor, XEN_HWP_GOVERNOR) == 0 )
> > +    {
> > +        const xc_hwp_para_t *hwp = &p_cpufreq->u.hwp_para;
> > +
> > +        printf("hwp variables        :\n");
> > +        printf("  hardware limits    : lowest [%u] most_efficient [%u]\n",
>
> Here and ...
>
> > +               hwp->lowest, hwp->most_efficient);
> > +        printf("                     : guaranteed [%u] highest [%u]\n",
> > +               hwp->guaranteed, hwp->highest);
> > +        printf("  configured limits  : min [%u] max [%u] energy_perf [%u]\n",
>
> ... here I wonder what use the underscores are in produced output. I'd
> use blanks. If you really want a separator there, then please use
> dashes.

I'll use blanks.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-10 17:49       ` Jason Andryuk
@ 2023-05-11  6:21         ` Jan Beulich
  2023-05-11 13:49           ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-11  6:21 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 10.05.2023 19:49, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 6:26 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
>>> hardware rather closely.
>>>
>>> We need the features bitmask to indicated fields supported by the actual
>>> hardware.
>>>
>>> The use of uint8_t parameters matches the hardware size.  uint32_t
>>> entries grows the sysctl_t past the build assertion in setup.c.  The
>>> uint8_t ranges are supported across multiple generations, so hopefully
>>> they won't change.
>>
>> Still it feels a little odd for values to be this narrow. Aiui the
>> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
>> used by HWP. So you could widen the union in struct
>> xen_get_cpufreq_para (in a binary but not necessarily source compatible
>> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
>> placed scaling_cur_freq could be included as well ...
> 
> The values are narrow, but they match the hardware.  It works for HWP,
> so there is no need to change at this time AFAICT.
> 
> Do you want me to make this change?

Well, much depends on what these 8-bit values actually express (I did
raise this question in one of the replies to your patches, as I wasn't
able to find anything in the SDM). That'll then hopefully allow to
make some educated prediction on on how likely it is that a future
variant of hwp would want to widen them. (Was it energy_perf that went
from 4 to 8 bits at some point, which you even comment upon in the
public header?)

> On Mon, May 8, 2023 at 6:47 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 08.05.2023 12:25, Jan Beulich wrote:
>>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
>>>> hardware rather closely.
>>>>
>>>> We need the features bitmask to indicated fields supported by the actual
>>>> hardware.
>>>>
>>>> The use of uint8_t parameters matches the hardware size.  uint32_t
>>>> entries grows the sysctl_t past the build assertion in setup.c.  The
>>>> uint8_t ranges are supported across multiple generations, so hopefully
>>>> they won't change.
>>>
>>> Still it feels a little odd for values to be this narrow. Aiui the
>>> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
>>> used by HWP. So you could widen the union in struct
>>> xen_get_cpufreq_para (in a binary but not necessarily source compatible
>>> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
>>> placed scaling_cur_freq could be included as well ...
>>
>> Having seen patch 9 now as well, I wonder whether here (or in a separate
>> patch) you don't want to limit providing inapplicable data (for example
>> not filling *scaling_available_governors would even avoid an allocation,
>> thus removing a possible reason for failure), while there (or again in a
>> separate patch) you'd also limit what the tool reports (inapplicable
>> output causes confusion / questions at best).
> 
> The xenpm output only shows relevant information:
> 
> # xenpm get-cpufreq-para 11
> cpu id               : 11
> affected_cpus        : 11
> cpuinfo frequency    : base [1600000] max [4900000]
> scaling_driver       : hwp-cpufreq
> scaling_avail_gov    : hwp
> current_governor     : hwp
> hwp variables        :
>   hardware limits    : lowest [1] most_efficient [11]
>                      : guaranteed [11] highest [49]
>   configured limits  : min [1] max [255] energy_perf [128]
>                      : activity_window [0 hardware selected]
>                      : desired [0 hw autonomous]
> turbo mode           : enabled
> 
> The scaling_*_freq values, policy->{min,max,cur} are filled in with
> base, max and 0 in hwp_get_cpu_speeds(), so it's not totally invalid
> values being returned.  The governor registration restricting to only
> internal governors when HWP is active means it only has the single
> governor.  I think it's okay as-is, but let me know if you want
> something changed.

Well, the main connection here was to the possible overloading of
space in the sysctl struct, by widening what the union covers. That
can of course only be done for fields which don't convey useful data.
If we go without the further overloading, I guess we can for now leave
the tool as you have it, and deal with possible tidying later on.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters
  2023-05-10 18:11     ` Jason Andryuk
@ 2023-05-11  6:25       ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-11  6:25 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 10.05.2023 20:11, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 6:43 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> --- a/tools/misc/xenpm.c
>>> +++ b/tools/misc/xenpm.c
>>> @@ -708,6 +708,44 @@ void start_gather_func(int argc, char *argv[])
>>>      pause();
>>>  }
>>>
>>> +static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
>>> +                                          unsigned int *activity_window,
>>> +                                          const char **units)
>>> +{
>>> +    unsigned int mantissa = hwp->activity_window & HWP_ACT_WINDOW_MANTISSA_MASK;
>>> +    unsigned int exponent =
>>> +        (hwp->activity_window >> HWP_ACT_WINDOW_EXPONENT_SHIFT) &
>>> +            HWP_ACT_WINDOW_EXPONENT_MASK;
>>
>> I wish we had MASK_EXTR() in common-macros.h. While really a comment on
>> patch 7 - HWP_ACT_WINDOW_EXPONENT_SHIFT is redundant information and
>> should imo be omitted from the public interface, in favor of just a
>> (suitably shifted) mask value. Also note how those constants all lack
>> proper XEN_ prefixes.
> 
> I'll add a patch adding MASK_EXTR() & MASK_INSR() to common-macros.h
> and use those - is there any reason not to do that?

I don't think there is, but I'm also not a maintainer of that code.

>>> +    unsigned int multiplier = 1;
>>> +    unsigned int i;
>>> +
>>> +    if ( hwp->activity_window == 0 )
>>> +    {
>>> +        *units = "hardware selected";
>>> +        *activity_window = 0;
>>> +
>>> +        return;
>>> +    }
>>
>> While in line with documentation, any mantissa of 0 results in a 0us
>> window, which I assume would then also mean "hardware selected".
> 
> I hadn't considered that.  The hardware seems to allow you to write a
> 0 mantissa, non-0 exponent.  From the SDM, it's unclear what that
> would mean.  The code as written would display "0 us", "0 ms", or "0
> s" - not "0 hardware selected".  Do you want more explicity printing
> for those cases?  I think it's fine to have a distinction between the
> output.  "0 hardware selected" is the known valid value that is
> working as expected.  The other ones being something different seems
> good to me since we don't really know what they mean.

Keeping things - apart from perhaps adding a respective comment - is
okay, as long as we don't know any better.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-11  6:21         ` Jan Beulich
@ 2023-05-11 13:49           ` Jason Andryuk
  2023-05-11 14:10             ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-11 13:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Thu, May 11, 2023 at 2:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 10.05.2023 19:49, Jason Andryuk wrote:
> > On Mon, May 8, 2023 at 6:26 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
> >>> hardware rather closely.
> >>>
> >>> We need the features bitmask to indicated fields supported by the actual
> >>> hardware.
> >>>
> >>> The use of uint8_t parameters matches the hardware size.  uint32_t
> >>> entries grows the sysctl_t past the build assertion in setup.c.  The
> >>> uint8_t ranges are supported across multiple generations, so hopefully
> >>> they won't change.
> >>
> >> Still it feels a little odd for values to be this narrow. Aiui the
> >> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
> >> used by HWP. So you could widen the union in struct
> >> xen_get_cpufreq_para (in a binary but not necessarily source compatible
> >> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
> >> placed scaling_cur_freq could be included as well ...
> >
> > The values are narrow, but they match the hardware.  It works for HWP,
> > so there is no need to change at this time AFAICT.
> >
> > Do you want me to make this change?
>
> Well, much depends on what these 8-bit values actually express (I did
> raise this question in one of the replies to your patches, as I wasn't
> able to find anything in the SDM). That'll then hopefully allow to
> make some educated prediction on on how likely it is that a future
> variant of hwp would want to widen them.

Sorry for not providing a reference earlier.  In the SDM,
HARDWARE-CONTROLLED PERFORMANCE STATES (HWP) section, there is this
second paragraph:
"""
In contrast, HWP is an implementation of the ACPI-defined
Collaborative Processor Performance Control (CPPC), which specifies
that the platform enumerates a continuous, abstract unit-less,
performance value scale that is not tied to a specific performance
state / frequency by definition. While the enumerated scale is roughly
linear in terms of a delivered integer workload performance result,
the OS is required to characterize the performance value range to
comprehend the delivered performance for an applied workload.
"""

The numbers are "continuous, abstract unit-less, performance value."
So there isn't much to go on there, but generally, smaller numbers
mean slower and bigger numbers mean faster.

Cross referencing the ACPI spec here:
https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html#collaborative-processor-performance-control

Scrolling down you can find the register entries such as

Highest Performance
Register or DWORD Attribute:  Read
Size:                         8-32 bits

AMD has its own pstate implementation that is similar to HWP.  Looking
at the Linux support, the AMD hardware also use 8 bit values for the
comparable fields:
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/msr-index.h#L612

So Intel and AMD are 8bit for now at least.  Something could do 32bits
according to the ACPI spec.

8 bits of granularity for slow to fast seems like plenty to me.  I'm
not sure what one would gain from 16 or 32 bits, but I'm not designing
the hardware.  From the earlier xenpm output, "highest" was 49, so
still a decent amount of room in an 8 bit range.

> (Was it energy_perf that went
> from 4 to 8 bits at some point, which you even comment upon in the
> public header?)

energy_perf (Energy_Performanc_Preference) had a fallback: "If
CPUID.06H:EAX[bit 10] indicates that this field is not supported, HWP
uses the value of the IA32_ENERGY_PERF_BIAS MSR to determine the
energy efficiency / performance preference."  So it had a different
range, but that was because it was being put into an older register.

However, I've removed that fallback code in v4.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-11 13:49           ` Jason Andryuk
@ 2023-05-11 14:10             ` Jan Beulich
  2023-05-11 20:22               ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-11 14:10 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 11.05.2023 15:49, Jason Andryuk wrote:
> On Thu, May 11, 2023 at 2:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 10.05.2023 19:49, Jason Andryuk wrote:
>>> On Mon, May 8, 2023 at 6:26 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>>>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
>>>>> hardware rather closely.
>>>>>
>>>>> We need the features bitmask to indicated fields supported by the actual
>>>>> hardware.
>>>>>
>>>>> The use of uint8_t parameters matches the hardware size.  uint32_t
>>>>> entries grows the sysctl_t past the build assertion in setup.c.  The
>>>>> uint8_t ranges are supported across multiple generations, so hopefully
>>>>> they won't change.
>>>>
>>>> Still it feels a little odd for values to be this narrow. Aiui the
>>>> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
>>>> used by HWP. So you could widen the union in struct
>>>> xen_get_cpufreq_para (in a binary but not necessarily source compatible
>>>> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
>>>> placed scaling_cur_freq could be included as well ...
>>>
>>> The values are narrow, but they match the hardware.  It works for HWP,
>>> so there is no need to change at this time AFAICT.
>>>
>>> Do you want me to make this change?
>>
>> Well, much depends on what these 8-bit values actually express (I did
>> raise this question in one of the replies to your patches, as I wasn't
>> able to find anything in the SDM). That'll then hopefully allow to
>> make some educated prediction on on how likely it is that a future
>> variant of hwp would want to widen them.
> 
> Sorry for not providing a reference earlier.  In the SDM,
> HARDWARE-CONTROLLED PERFORMANCE STATES (HWP) section, there is this
> second paragraph:
> """
> In contrast, HWP is an implementation of the ACPI-defined
> Collaborative Processor Performance Control (CPPC), which specifies
> that the platform enumerates a continuous, abstract unit-less,
> performance value scale that is not tied to a specific performance
> state / frequency by definition. While the enumerated scale is roughly
> linear in terms of a delivered integer workload performance result,
> the OS is required to characterize the performance value range to
> comprehend the delivered performance for an applied workload.
> """
> 
> The numbers are "continuous, abstract unit-less, performance value."
> So there isn't much to go on there, but generally, smaller numbers
> mean slower and bigger numbers mean faster.
> 
> Cross referencing the ACPI spec here:
> https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html#collaborative-processor-performance-control
> 
> Scrolling down you can find the register entries such as
> 
> Highest Performance
> Register or DWORD Attribute:  Read
> Size:                         8-32 bits
> 
> AMD has its own pstate implementation that is similar to HWP.  Looking
> at the Linux support, the AMD hardware also use 8 bit values for the
> comparable fields:
> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/msr-index.h#L612
> 
> So Intel and AMD are 8bit for now at least.  Something could do 32bits
> according to the ACPI spec.
> 
> 8 bits of granularity for slow to fast seems like plenty to me.  I'm
> not sure what one would gain from 16 or 32 bits, but I'm not designing
> the hardware.  From the earlier xenpm output, "highest" was 49, so
> still a decent amount of room in an 8 bit range.

Hmm, thanks for the pointers. I'm still somewhat undecided. I guess I'm
okay with you keeping things as you have them. If and when needed we can
still rework the structure - it is possible to change it as it's (for
the time being at least) still an unstable interface.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-11 14:10             ` Jan Beulich
@ 2023-05-11 20:22               ` Jason Andryuk
  2023-05-12  6:32                 ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-11 20:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Thu, May 11, 2023 at 10:10 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 11.05.2023 15:49, Jason Andryuk wrote:
> > On Thu, May 11, 2023 at 2:21 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 10.05.2023 19:49, Jason Andryuk wrote:
> >>> On Mon, May 8, 2023 at 6:26 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>>
> >>>> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>>>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
> >>>>> hardware rather closely.
> >>>>>
> >>>>> We need the features bitmask to indicated fields supported by the actual
> >>>>> hardware.
> >>>>>
> >>>>> The use of uint8_t parameters matches the hardware size.  uint32_t
> >>>>> entries grows the sysctl_t past the build assertion in setup.c.  The
> >>>>> uint8_t ranges are supported across multiple generations, so hopefully
> >>>>> they won't change.
> >>>>
> >>>> Still it feels a little odd for values to be this narrow. Aiui the
> >>>> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
> >>>> used by HWP. So you could widen the union in struct
> >>>> xen_get_cpufreq_para (in a binary but not necessarily source compatible
> >>>> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
> >>>> placed scaling_cur_freq could be included as well ...
> >>>
> >>> The values are narrow, but they match the hardware.  It works for HWP,
> >>> so there is no need to change at this time AFAICT.
> >>>
> >>> Do you want me to make this change?
> >>
> >> Well, much depends on what these 8-bit values actually express (I did
> >> raise this question in one of the replies to your patches, as I wasn't
> >> able to find anything in the SDM). That'll then hopefully allow to
> >> make some educated prediction on on how likely it is that a future
> >> variant of hwp would want to widen them.
> >
> > Sorry for not providing a reference earlier.  In the SDM,
> > HARDWARE-CONTROLLED PERFORMANCE STATES (HWP) section, there is this
> > second paragraph:
> > """
> > In contrast, HWP is an implementation of the ACPI-defined
> > Collaborative Processor Performance Control (CPPC), which specifies
> > that the platform enumerates a continuous, abstract unit-less,
> > performance value scale that is not tied to a specific performance
> > state / frequency by definition. While the enumerated scale is roughly
> > linear in terms of a delivered integer workload performance result,
> > the OS is required to characterize the performance value range to
> > comprehend the delivered performance for an applied workload.
> > """
> >
> > The numbers are "continuous, abstract unit-less, performance value."
> > So there isn't much to go on there, but generally, smaller numbers
> > mean slower and bigger numbers mean faster.
> >
> > Cross referencing the ACPI spec here:
> > https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html#collaborative-processor-performance-control
> >
> > Scrolling down you can find the register entries such as
> >
> > Highest Performance
> > Register or DWORD Attribute:  Read
> > Size:                         8-32 bits
> >
> > AMD has its own pstate implementation that is similar to HWP.  Looking
> > at the Linux support, the AMD hardware also use 8 bit values for the
> > comparable fields:
> > https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/msr-index.h#L612
> >
> > So Intel and AMD are 8bit for now at least.  Something could do 32bits
> > according to the ACPI spec.
> >
> > 8 bits of granularity for slow to fast seems like plenty to me.  I'm
> > not sure what one would gain from 16 or 32 bits, but I'm not designing
> > the hardware.  From the earlier xenpm output, "highest" was 49, so
> > still a decent amount of room in an 8 bit range.
>
> Hmm, thanks for the pointers. I'm still somewhat undecided. I guess I'm
> okay with you keeping things as you have them. If and when needed we can
> still rework the structure - it is possible to change it as it's (for
> the time being at least) still an unstable interface.

With an anonymous union and anonymous struct, struct
xen_get_cpufreq_para can be re-arranged and compile without any
changes to other cpufreq code.  struct xen_hwp_para becomes 10
uint32_t's.  The old scaling is 3 * uint32_t + 16 bytes
CPUFREQ_NAME_LEN + 4 * uint32_t for xen_ondemand = 11 uint32_t.  So
int32_t turbo_enabled doesn't move and it's binary compatible.

Anonymous unions and structs aren't allowed in the public header
though, right?  So that would need to change, though it doesn't seem
too bad.  There isn't too much churn.

I have no plans to tackle AMD pstate.  But having glanced at it this
morning, maybe these hwp sysctls should be renamed cppc?  AMD pstate
and HWP are both implementations of CPPC, so that could be more future
proof?  But, again, I only glanced at the AMD stuff, so there may be
other changes needed.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-10 14:19               ` Jan Beulich
@ 2023-05-12  1:02                 ` Marek Marczykowski-Górecki
  2023-05-12  6:28                   ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-05-12  1:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 2465 bytes --]

On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote:
> On 10.05.2023 15:54, Jason Andryuk wrote:
> > On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 05.05.2023 17:35, Jason Andryuk wrote:
> >>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>> The other issue is that if you select "hwp" as the governor, but HWP
> >>> hardware support is not available, then hwp_available() needs to reset
> >>> the governor back to the default.  This feels like a layering
> >>> violation.
> >>
> >> Layering violation - yes. But why would the governor need resetting in
> >> this case? If HWP was asked for but isn't available, I don't think any
> >> other cpufreq handling (and hence governor) should be put in place.
> >> And turning off cpufreq altogether (if necessary in the first place)
> >> wouldn't, to me, feel as much like a layering violation.
> > 
> > My goal was for Xen to use HWP if available and fallback to the acpi
> > cpufreq driver if not.  That to me seems more user-friendly than
> > disabling cpufreq.
> > 
> >             if ( hwp_available() )
> >                 ret = hwp_register_driver();
> >             else
> >                 ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> 
> That's fine as a (future) default, but for now using hwp requires a
> command line option, and if that option says "hwp" then it ought to
> be hwp imo.

As a downstrem distribution, I'd strongly prefer to have an option that
would enable HWP when present and fallback to other driver otherwise,
even if that isn't the default upstream. I can't possibly require large
group of users (either HWP-having or HWP-not-having) to edit the Xen
cmdline to get power management working well.

If the meaning for cpufreq=hwp absolutely must include "nothing if HWP
is not available", then maybe it should be named cpufreq=try-hwp
instead, or cpufreq=prefer-hwp or something else like this?

> > If we are setting cpufreq_opt_governor to enter hwp_available(), but
> > then HWP isn't available, it seems to me that we need to reset
> > cpufreq_opt_governor when exiting hwp_available() false.
> 
> This may be necessary in the future, but shouldn't be necessary right
> now. It's not entirely clear to me how that future is going to look
> like, command line option wise.
> 
> Jan
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
  2023-05-12  1:02                 ` Marek Marczykowski-Górecki
@ 2023-05-12  6:28                   ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-12  6:28 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 12.05.2023 03:02, Marek Marczykowski-Górecki wrote:
> On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote:
>> On 10.05.2023 15:54, Jason Andryuk wrote:
>>> On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 05.05.2023 17:35, Jason Andryuk wrote:
>>>>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>> The other issue is that if you select "hwp" as the governor, but HWP
>>>>> hardware support is not available, then hwp_available() needs to reset
>>>>> the governor back to the default.  This feels like a layering
>>>>> violation.
>>>>
>>>> Layering violation - yes. But why would the governor need resetting in
>>>> this case? If HWP was asked for but isn't available, I don't think any
>>>> other cpufreq handling (and hence governor) should be put in place.
>>>> And turning off cpufreq altogether (if necessary in the first place)
>>>> wouldn't, to me, feel as much like a layering violation.
>>>
>>> My goal was for Xen to use HWP if available and fallback to the acpi
>>> cpufreq driver if not.  That to me seems more user-friendly than
>>> disabling cpufreq.
>>>
>>>             if ( hwp_available() )
>>>                 ret = hwp_register_driver();
>>>             else
>>>                 ret = cpufreq_register_driver(&acpi_cpufreq_driver);
>>
>> That's fine as a (future) default, but for now using hwp requires a
>> command line option, and if that option says "hwp" then it ought to
>> be hwp imo.
> 
> As a downstrem distribution, I'd strongly prefer to have an option that
> would enable HWP when present and fallback to other driver otherwise,
> even if that isn't the default upstream. I can't possibly require large
> group of users (either HWP-having or HWP-not-having) to edit the Xen
> cmdline to get power management working well.
> 
> If the meaning for cpufreq=hwp absolutely must include "nothing if HWP
> is not available", then maybe it should be named cpufreq=try-hwp
> instead, or cpufreq=prefer-hwp or something else like this?

Any new sub-option needs to fit the existing ones in its meaning. I
could see e.g. "cpufreq=xen" alone to effect what you want (once hwp
becomes available for use by default). But (for now at least) I
continue to think that a request for "hwp" ought to mean HWP.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace
  2023-05-11 20:22               ` Jason Andryuk
@ 2023-05-12  6:32                 ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-12  6:32 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 11.05.2023 22:22, Jason Andryuk wrote:
> On Thu, May 11, 2023 at 10:10 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 11.05.2023 15:49, Jason Andryuk wrote:
>>> On Thu, May 11, 2023 at 2:21 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>
>>>> On 10.05.2023 19:49, Jason Andryuk wrote:
>>>>> On Mon, May 8, 2023 at 6:26 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>>>
>>>>>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>>>>>> Extend xen_get_cpufreq_para to return hwp parameters.  These match the
>>>>>>> hardware rather closely.
>>>>>>>
>>>>>>> We need the features bitmask to indicated fields supported by the actual
>>>>>>> hardware.
>>>>>>>
>>>>>>> The use of uint8_t parameters matches the hardware size.  uint32_t
>>>>>>> entries grows the sysctl_t past the build assertion in setup.c.  The
>>>>>>> uint8_t ranges are supported across multiple generations, so hopefully
>>>>>>> they won't change.
>>>>>>
>>>>>> Still it feels a little odd for values to be this narrow. Aiui the
>>>>>> scaling_governor[] and scaling_{max,min}_freq fields aren't (really)
>>>>>> used by HWP. So you could widen the union in struct
>>>>>> xen_get_cpufreq_para (in a binary but not necessarily source compatible
>>>>>> manner), gaining you 6 more uint32_t slots. Possibly the somewhat oddly
>>>>>> placed scaling_cur_freq could be included as well ...
>>>>>
>>>>> The values are narrow, but they match the hardware.  It works for HWP,
>>>>> so there is no need to change at this time AFAICT.
>>>>>
>>>>> Do you want me to make this change?
>>>>
>>>> Well, much depends on what these 8-bit values actually express (I did
>>>> raise this question in one of the replies to your patches, as I wasn't
>>>> able to find anything in the SDM). That'll then hopefully allow to
>>>> make some educated prediction on on how likely it is that a future
>>>> variant of hwp would want to widen them.
>>>
>>> Sorry for not providing a reference earlier.  In the SDM,
>>> HARDWARE-CONTROLLED PERFORMANCE STATES (HWP) section, there is this
>>> second paragraph:
>>> """
>>> In contrast, HWP is an implementation of the ACPI-defined
>>> Collaborative Processor Performance Control (CPPC), which specifies
>>> that the platform enumerates a continuous, abstract unit-less,
>>> performance value scale that is not tied to a specific performance
>>> state / frequency by definition. While the enumerated scale is roughly
>>> linear in terms of a delivered integer workload performance result,
>>> the OS is required to characterize the performance value range to
>>> comprehend the delivered performance for an applied workload.
>>> """
>>>
>>> The numbers are "continuous, abstract unit-less, performance value."
>>> So there isn't much to go on there, but generally, smaller numbers
>>> mean slower and bigger numbers mean faster.
>>>
>>> Cross referencing the ACPI spec here:
>>> https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html#collaborative-processor-performance-control
>>>
>>> Scrolling down you can find the register entries such as
>>>
>>> Highest Performance
>>> Register or DWORD Attribute:  Read
>>> Size:                         8-32 bits
>>>
>>> AMD has its own pstate implementation that is similar to HWP.  Looking
>>> at the Linux support, the AMD hardware also use 8 bit values for the
>>> comparable fields:
>>> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/msr-index.h#L612
>>>
>>> So Intel and AMD are 8bit for now at least.  Something could do 32bits
>>> according to the ACPI spec.
>>>
>>> 8 bits of granularity for slow to fast seems like plenty to me.  I'm
>>> not sure what one would gain from 16 or 32 bits, but I'm not designing
>>> the hardware.  From the earlier xenpm output, "highest" was 49, so
>>> still a decent amount of room in an 8 bit range.
>>
>> Hmm, thanks for the pointers. I'm still somewhat undecided. I guess I'm
>> okay with you keeping things as you have them. If and when needed we can
>> still rework the structure - it is possible to change it as it's (for
>> the time being at least) still an unstable interface.
> 
> With an anonymous union and anonymous struct, struct
> xen_get_cpufreq_para can be re-arranged and compile without any
> changes to other cpufreq code.  struct xen_hwp_para becomes 10
> uint32_t's.  The old scaling is 3 * uint32_t + 16 bytes
> CPUFREQ_NAME_LEN + 4 * uint32_t for xen_ondemand = 11 uint32_t.  So
> int32_t turbo_enabled doesn't move and it's binary compatible.
> 
> Anonymous unions and structs aren't allowed in the public header
> though, right?

Correct.

>  So that would need to change, though it doesn't seem
> too bad.  There isn't too much churn.
> 
> I have no plans to tackle AMD pstate.  But having glanced at it this
> morning, maybe these hwp sysctls should be renamed cppc?  AMD pstate
> and HWP are both implementations of CPPC, so that could be more future
> proof?  But, again, I only glanced at the AMD stuff, so there may be
> other changes needed.

I consider this naming change plan plausible. If further adjustments
end up necessary for AMD, that'll be no worse (but maybe better) than
if we have to go from HWP to a more general name altogether.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 08/14 RESEND] libxc: Include hwp_para in definitions
  2023-05-01 19:30 ` [PATCH v3 08/14 RESEND] libxc: Include hwp_para in definitions Jason Andryuk
@ 2023-05-19 13:53   ` Anthony PERARD
  0 siblings, 0 replies; 53+ messages in thread
From: Anthony PERARD @ 2023-05-19 13:53 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel, Wei Liu, Juergen Gross

On Mon, May 01, 2023 at 03:30:28PM -0400, Jason Andryuk wrote:
> Expose the hwp_para fields through libxc.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Acked-by: Anthony PERARD <anthony.perard@citrix.com>

Thanks,

-- 
Anthony PERARD


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 11/14 RESEND] libxc: Add xc_set_cpufreq_hwp
  2023-05-01 19:30 ` [PATCH v3 11/14 RESEND] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
@ 2023-05-19 13:55   ` Anthony PERARD
  0 siblings, 0 replies; 53+ messages in thread
From: Anthony PERARD @ 2023-05-19 13:55 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel, Wei Liu, Juergen Gross

On Mon, May 01, 2023 at 03:30:31PM -0400, Jason Andryuk wrote:
> Add xc_set_cpufreq_hwp to allow calling xen_systctl_pm_op
> SET_CPUFREQ_HWP.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Acked-by: Anthony PERARD <anthony.perard@citrix.com>

Thanks,

-- 
Anthony PERARD


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-05-08 11:27   ` Jan Beulich
@ 2023-05-22 12:45     ` Jason Andryuk
  2023-05-22 13:10       ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-22 12:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Mon, May 8, 2023 at 7:27 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > @@ -531,6 +533,100 @@ int get_hwp_para(const struct cpufreq_policy *policy,
> >      return 0;
> >  }
> >
> > +int set_hwp_para(struct cpufreq_policy *policy,
> > +                 struct xen_set_hwp_para *set_hwp)
>
> const?

set_hwp can be const.  policy is passed to hwp_cpufreq_target() &
on_selected_cpus(), so it cannot readily be made const.

> > +{
> > +    unsigned int cpu = policy->cpu;
> > +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> > +
> > +    if ( data == NULL )
> > +        return -EINVAL;
> > +
> > +    /* Validate all parameters first */
> > +    if ( set_hwp->set_params & ~XEN_SYSCTL_HWP_SET_PARAM_MASK )
> > +        return -EINVAL;
> > +
> > +    if ( set_hwp->activity_window & ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK )
> > +        return -EINVAL;
>
> Below you limit checks to when the respective control bit is set. I
> think you want the same here.

Not sure if you mean feature_hwp_activity_window or the bit in
set_params as control bit.  But, yes, they can both use some
additional checking.  IIRC, I wanted to always check
~XEN_SYSCTL_HWP_ACT_WINDOW_MASK, because bits should never be set
whether or not the activity window is supported by hardware.

> > +    if ( !feature_hwp_energy_perf &&
> > +         (set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF) &&
> > +         set_hwp->energy_perf > IA32_ENERGY_BIAS_MAX_POWERSAVE )
> > +        return -EINVAL;
> > +
> > +    if ( (set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED) &&
> > +         set_hwp->desired != 0 &&
> > +         (set_hwp->desired < data->hw.lowest ||
> > +          set_hwp->desired > data->hw.highest) )
> > +        return -EINVAL;
> > +
> > +    /*
> > +     * minimum & maximum are not validated as hardware doesn't seem to care
> > +     * and the SDM says CPUs will clip internally.
> > +     */
> > +
> > +    /* Apply presets */
> > +    switch ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_PRESET_MASK )
> > +    {
> > +    case XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE:
> > +        data->minimum = data->hw.lowest;
> > +        data->maximum = data->hw.lowest;
> > +        data->activity_window = 0;
> > +        if ( feature_hwp_energy_perf )
> > +            data->energy_perf = HWP_ENERGY_PERF_MAX_POWERSAVE;
> > +        else
> > +            data->energy_perf = IA32_ENERGY_BIAS_MAX_POWERSAVE;
> > +        data->desired = 0;
> > +        break;
> > +
> > +    case XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE:
> > +        data->minimum = data->hw.highest;
> > +        data->maximum = data->hw.highest;
> > +        data->activity_window = 0;
> > +        data->energy_perf = HWP_ENERGY_PERF_MAX_PERFORMANCE;
> > +        data->desired = 0;
> > +        break;
> > +
> > +    case XEN_SYSCTL_HWP_SET_PRESET_BALANCE:
> > +        data->minimum = data->hw.lowest;
> > +        data->maximum = data->hw.highest;
> > +        data->activity_window = 0;
> > +        if ( feature_hwp_energy_perf )
> > +            data->energy_perf = HWP_ENERGY_PERF_BALANCE;
> > +        else
> > +            data->energy_perf = IA32_ENERGY_BIAS_BALANCE;
> > +        data->desired = 0;
> > +        break;
> > +
> > +    case XEN_SYSCTL_HWP_SET_PRESET_NONE:
> > +        break;
> > +
> > +    default:
> > +        return -EINVAL;
> > +    }
>
> So presets set all the values for which the individual item control bits
> are clear. That's not exactly what I would have expected, and it took me
> reading the code several times until I realized that you write life per-
> CPU data fields here, not fields of some intermediate variable. I think
> this could do with saying explicitly in the public header (if indeed the
> intended model).

The commit message mentioned the idea of using a preset and further
refinement.  The comments above "/* Apply presets */" and below "/*
Further customize presets if needed */" were an attempt to highlight
that.  But you are right that the public header should state this
clearly.

> > +    /* Further customize presets if needed */
> > +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MINIMUM )
> > +        data->minimum = set_hwp->minimum;
> > +
> > +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MAXIMUM )
> > +        data->maximum = set_hwp->maximum;
> > +
> > +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF )
> > +        data->energy_perf = set_hwp->energy_perf;
> > +
> > +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED )
> > +        data->desired = set_hwp->desired;
> > +
> > +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ACT_WINDOW )
> > +        data->activity_window = set_hwp->activity_window &
> > +                                XEN_SYSCTL_HWP_ACT_WINDOW_MASK;
> > +
> > +    hwp_cpufreq_target(policy, 0, 0);
> > +
> > +    return 0;
>
> I don't think you should assume here that hwp_cpufreq_target() will
> only ever return 0. Plus by returning its return value here you
> allow the compiler to tail-call optimize this code.

Thanks for catching that.  Yeah, I made hwp_cpufreq_target() return a
value per your earlier comment, so its value should be returned now.

> > --- a/xen/drivers/acpi/pmstat.c
> > +++ b/xen/drivers/acpi/pmstat.c
> > @@ -398,6 +398,20 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
> >      return ret;
> >  }
> >
> > +static int set_cpufreq_hwp(struct xen_sysctl_pm_op *op)
>
> const?

Yes

> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -317,6 +317,34 @@ struct xen_hwp_para {
> >      uint8_t energy_perf;
> >  };
> >
> > +/* set multiple values simultaneously when set_args bit is set */
>
> What "set_args bit" does this comment refer to?

That should be set_params. IIRC, set_args was the previous name.

> > +struct xen_set_hwp_para {
> > +#define XEN_SYSCTL_HWP_SET_DESIRED              (1U << 0)
> > +#define XEN_SYSCTL_HWP_SET_ENERGY_PERF          (1U << 1)
> > +#define XEN_SYSCTL_HWP_SET_ACT_WINDOW           (1U << 2)
> > +#define XEN_SYSCTL_HWP_SET_MINIMUM              (1U << 3)
> > +#define XEN_SYSCTL_HWP_SET_MAXIMUM              (1U << 4)
> > +#define XEN_SYSCTL_HWP_SET_PRESET_MASK          0xf000
> > +#define XEN_SYSCTL_HWP_SET_PRESET_NONE          0x0000
> > +#define XEN_SYSCTL_HWP_SET_PRESET_BALANCE       0x1000
> > +#define XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE     0x2000
> > +#define XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE   0x3000
> > +#define XEN_SYSCTL_HWP_SET_PARAM_MASK ( \
> > +                                  XEN_SYSCTL_HWP_SET_PRESET_MASK | \
> > +                                  XEN_SYSCTL_HWP_SET_DESIRED     | \
> > +                                  XEN_SYSCTL_HWP_SET_ENERGY_PERF | \
> > +                                  XEN_SYSCTL_HWP_SET_ACT_WINDOW  | \
> > +                                  XEN_SYSCTL_HWP_SET_MINIMUM     | \
> > +                                  XEN_SYSCTL_HWP_SET_MAXIMUM     )
> > +    uint16_t set_params; /* bitflags for valid values */
> > +#define XEN_SYSCTL_HWP_ACT_WINDOW_MASK          0x03ff
> > +    uint16_t activity_window; /* See comment in struct xen_hwp_para */
> > +    uint8_t minimum;
> > +    uint8_t maximum;
> > +    uint8_t desired;
> > +    uint8_t energy_perf; /* 0-255 or 0-15 depending on HW support */
>
> Instead of (or in addition to) the "HW support" reference, could this
> gain a reference to the "get para" bit determining which range to use?

I've removed the fallback 0-15 support locally, so this will be removed.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand
  2023-05-08 11:56   ` Jan Beulich
  2023-05-08 12:00     ` Jan Beulich
@ 2023-05-22 12:59     ` Jason Andryuk
  2023-05-22 13:20       ` Jan Beulich
  1 sibling, 1 reply; 53+ messages in thread
From: Jason Andryuk @ 2023-05-22 12:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Anthony PERARD, xen-devel

On Mon, May 8, 2023 at 7:56 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > @@ -67,6 +68,27 @@ void show_help(void)
> >              " set-max-cstate        <num>|'unlimited' [<num2>|'unlimited']\n"
> >              "                                     set the C-State limitation (<num> >= 0) and\n"
> >              "                                     optionally the C-sub-state limitation (<num2> >= 0)\n"
> > +            " set-cpufreq-hwp       [cpuid] [balance|performance|powersave] <param:val>*\n"
> > +            "                                     set Hardware P-State (HWP) parameters\n"
> > +            "                                     optionally a preset of one of\n"
> > +            "                                       balance|performance|powersave\n"
> > +            "                                     an optional list of param:val arguments\n"
> > +            "                                       minimum:N  lowest ... highest\n"
> > +            "                                       maximum:N  lowest ... highest\n"
> > +            "                                       desired:N  lowest ... highest\n"
>
> Personally I consider these three uses of "lowest ... highest" confusing:
> It's not clear at all whether they're part of the option or merely mean
> to express the allowable range for N (which I think they do). Perhaps ...
>
> > +            "                                           Set explicit performance target.\n"
> > +            "                                           non-zero disables auto-HWP mode.\n"
> > +            "                                       energy-perf:0-255 (or 0-15)\n"
>
> ..., also taking this into account:
>
>             "                                       energy-perf:N (0-255 or 0-15)\n"
>
> and then use parentheses as well for the earlier value range explanations
> (and again below)?

lowest and highest were supposed to reference the values from `xenpm
get-cpufreq-para`.  You removed some later lines that state
"get-cpufreq-para returns lowest/highest".  However, they aren't
enforced limits.  You can program from the range 0-255 and the
hardware is supposed to clip internally, so your idea of
"energy-perf:N (0-255)" seems good to me.

> Also up from here you suddenly start having full stops on the lines. I
> guess you also want to be consistent in your use of capital letters at
> the start of lines (I didn't go check how consistent pre-existing code
> is in this regard).

Looks like the existing code is consistently non-capital letters, but
the full stops are inconsistent.  I'll go with non-capital and full
stop for this addition.

> > @@ -1299,6 +1321,213 @@ void disable_turbo_mode(int argc, char *argv[])
> >                  errno, strerror(errno));
> >  }
> >
> > +/*
> > + * Parse activity_window:NNN{us,ms,s} and validate range.
> > + *
> > + * Activity window is a 7bit mantissa (0-127) with a 3bit exponent (0-7) base
> > + * 10 in microseconds.  So the range is 1 microsecond to 1270 seconds.  A value
> > + * of 0 lets the hardware autonomously select the window.
> > + *
> > + * Return 0 on success
> > + *       -1 on error
> > + */
> > +static int parse_activity_window(xc_set_hwp_para_t *set_hwp, unsigned long u,
> > +                                 const char *suffix)
> > +{
> > +    unsigned int exponent = 0;
> > +    unsigned int multiplier = 1;
> > +
> > +    if ( suffix && suffix[0] )
> > +    {
> > +        if ( strcasecmp(suffix, "s") == 0 )
> > +        {
> > +            multiplier = 1000 * 1000;
> > +            exponent = 6;
> > +        }
> > +        else if ( strcasecmp(suffix, "ms") == 0 )
> > +        {
> > +            multiplier = 1000;
> > +            exponent = 3;
> > +        }
> > +        else if ( strcasecmp(suffix, "us") == 0 )
> > +        {
> > +            multiplier = 1;
> > +            exponent = 0;
> > +        }
>
> Considering the initializers, this "else if" body isn't really needed,
> and ...
>
> > +        else
>
> ... instead this could become "else if ( strcmp() != 0 )".
>
> Note also that I use strcmp() there - none of s, ms, or us are commonly
> expressed by capital letters.

That sounds fine.

> (I wonder though whether μs shouldn't also
> be recognized.)

While that makes sense, I do not plan to change it.  I don't know the
proper way to deal with unicode from C.  (I suppose a memcmp with the
UTF-8 encoding would be possible, but I don't know if there are corner
cases I'm overlooking.)

> > +        {
> > +            fprintf(stderr, "invalid activity window units: \"%s\"\n", suffix);
> > +
> > +            return -1;
> > +        }
> > +    }
> > +
> > +    /* u * multipler > 1270 * 1000 * 1000 transformed to avoid overflow. */
> > +    if ( u > 1270 * 1000 * 1000 / multiplier )
> > +    {
> > +        fprintf(stderr, "activity window is too large\n");
> > +
> > +        return -1;
> > +    }
> > +
> > +    /* looking for 7 bits of mantissa and 3 bits of exponent */
> > +    while ( u > 127 )
> > +    {
> > +        u += 5; /* Round up to mitigate truncation rounding down
> > +                   e.g. 128 -> 120 vs 128 -> 130. */
> > +        u /= 10;
> > +        exponent += 1;
> > +    }
> > +
> > +    set_hwp->activity_window = (exponent & HWP_ACT_WINDOW_EXPONENT_MASK) <<
> > +                                   HWP_ACT_WINDOW_EXPONENT_SHIFT |
>
> The shift wants parenthesizing against the | and the shift amount wants
> indenting slightly less. (Really this would want to be MASK_INSR().)

I'll use MASK_INSR.

> > +                               (u & HWP_ACT_WINDOW_MANTISSA_MASK);
> > +    set_hwp->set_params |= XEN_SYSCTL_HWP_SET_ACT_WINDOW;
> > +
> > +    return 0;
> > +}
> > +
> > +static int parse_hwp_opts(xc_set_hwp_para_t *set_hwp, int *cpuid,
> > +                          int argc, char *argv[])
> > +{
> > +    int i = 0;
> > +
> > +    if ( argc < 1 ) {
> > +        fprintf(stderr, "Missing arguments\n");
> > +        return -1;
> > +    }
> > +
> > +    if ( parse_cpuid_non_fatal(argv[i], cpuid) == 0 )
> > +    {
> > +        i++;
> > +    }
>
> I don't think you need the earlier patch and the separate helper:
> Whether a CPU number is present can be told by checking
> isdigit(argv[i][0]).

> Hmm, yes, there is "all", but your help text doesn't mention it and
> since you're handling a variable number of arguments anyway, there's
> not need for anyone to say "all" - they can simply omit the optional
> argument.

Most xenpm commands take "all" or a numeric cpuid, so I intended to be
consistent with them.  That was the whole point of
parse_cpuid_non_fatal() - to reuse the existing parsing code for
consistency.

I didn't read the other help text carefully enough to see that the
numeric cpuid and "all" handling was repeated.

For consistency, I would retain parse_cpuid_non_fatal() and expand the
help text.  If you don't want that, I'll switch to isdigit(argv[i][0])
and have the omission of a digit indicate all CPUs as you suggest.
Just let me know what you want.

> Also (nit) note how you're mixing brace placement throughout this
> function.

Will fix.

Regards,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-05-22 12:45     ` Jason Andryuk
@ 2023-05-22 13:10       ` Jan Beulich
  2023-05-22 14:43         ` Jason Andryuk
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2023-05-22 13:10 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 22.05.2023 14:45, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 7:27 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> @@ -531,6 +533,100 @@ int get_hwp_para(const struct cpufreq_policy *policy,
>>>      return 0;
>>>  }
>>>
>>> +int set_hwp_para(struct cpufreq_policy *policy,
>>> +                 struct xen_set_hwp_para *set_hwp)
>>
>> const?
> 
> set_hwp can be const.  policy is passed to hwp_cpufreq_target() &
> on_selected_cpus(), so it cannot readily be made const.

I was only meaning the 2nd parameter, yes.

>>> +{
>>> +    unsigned int cpu = policy->cpu;
>>> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
>>> +
>>> +    if ( data == NULL )
>>> +        return -EINVAL;
>>> +
>>> +    /* Validate all parameters first */
>>> +    if ( set_hwp->set_params & ~XEN_SYSCTL_HWP_SET_PARAM_MASK )
>>> +        return -EINVAL;
>>> +
>>> +    if ( set_hwp->activity_window & ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK )
>>> +        return -EINVAL;
>>
>> Below you limit checks to when the respective control bit is set. I
>> think you want the same here.
> 
> Not sure if you mean feature_hwp_activity_window or the bit in
> set_params as control bit.  But, yes, they can both use some
> additional checking.  IIRC, I wanted to always check
> ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK, because bits should never be set
> whether or not the activity window is supported by hardware.

I took ...

>>> +    if ( !feature_hwp_energy_perf &&
>>> +         (set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF) &&
>>> +         set_hwp->energy_perf > IA32_ENERGY_BIAS_MAX_POWERSAVE )
>>> +        return -EINVAL;
>>> +
>>> +    if ( (set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED) &&
>>> +         set_hwp->desired != 0 &&
>>> +         (set_hwp->desired < data->hw.lowest ||
>>> +          set_hwp->desired > data->hw.highest) )
>>> +        return -EINVAL;

... e.g. this for comparison, where you apply the range check only when
the XEN_SYSCTL_HWP_* bit is set. I think you want to be consistent in
such checking: Either you always allow the caller to not care about
fields that aren't going to be consumed when their controlling bit is
off, or you always check validity. Both approaches have their pros and
cons, I think.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand
  2023-05-22 12:59     ` Jason Andryuk
@ 2023-05-22 13:20       ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2023-05-22 13:20 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Wei Liu, Anthony PERARD, xen-devel

On 22.05.2023 14:59, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 7:56 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> +static int parse_hwp_opts(xc_set_hwp_para_t *set_hwp, int *cpuid,
>>> +                          int argc, char *argv[])
>>> +{
>>> +    int i = 0;
>>> +
>>> +    if ( argc < 1 ) {
>>> +        fprintf(stderr, "Missing arguments\n");
>>> +        return -1;
>>> +    }
>>> +
>>> +    if ( parse_cpuid_non_fatal(argv[i], cpuid) == 0 )
>>> +    {
>>> +        i++;
>>> +    }
>>
>> I don't think you need the earlier patch and the separate helper:
>> Whether a CPU number is present can be told by checking
>> isdigit(argv[i][0]).
> 
>> Hmm, yes, there is "all", but your help text doesn't mention it and
>> since you're handling a variable number of arguments anyway, there's
>> not need for anyone to say "all" - they can simply omit the optional
>> argument.
> 
> Most xenpm commands take "all" or a numeric cpuid, so I intended to be
> consistent with them.  That was the whole point of
> parse_cpuid_non_fatal() - to reuse the existing parsing code for
> consistency.
> 
> I didn't read the other help text carefully enough to see that the
> numeric cpuid and "all" handling was repeated.
> 
> For consistency, I would retain parse_cpuid_non_fatal() and expand the
> help text.  If you don't want that, I'll switch to isdigit(argv[i][0])
> and have the omission of a digit indicate all CPUs as you suggest.
> Just let me know what you want.

While I don't want to push you towards something you don't like yourself,
my view on the "all" has been "Why did they introduce that?" It makes
some sense when it's a placeholder to avoid needing to deal with a
variable number of arguments, but already that doesn't apply to all the
pre-existing operations. Note how many functions already have

    if ( argc > 0 )
        parse_cpuid(argv[0], &cpuid);

and {en,dis}able-turbo-mode don't properly mention "all" in their help
text either.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2023-05-22 13:10       ` Jan Beulich
@ 2023-05-22 14:43         ` Jason Andryuk
  0 siblings, 0 replies; 53+ messages in thread
From: Jason Andryuk @ 2023-05-22 14:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Mon, May 22, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 22.05.2023 14:45, Jason Andryuk wrote:
> > On Mon, May 8, 2023 at 7:27 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>> +    if ( set_hwp->activity_window & ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK )
> >>> +        return -EINVAL;
> >>
> >> Below you limit checks to when the respective control bit is set. I
> >> think you want the same here.
> >
> > Not sure if you mean feature_hwp_activity_window or the bit in
> > set_params as control bit.  But, yes, they can both use some
> > additional checking.  IIRC, I wanted to always check
> > ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK, because bits should never be set
> > whether or not the activity window is supported by hardware.
>
> I took ...
>
> >>> +    if ( !feature_hwp_energy_perf &&
> >>> +         (set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF) &&
> >>> +         set_hwp->energy_perf > IA32_ENERGY_BIAS_MAX_POWERSAVE )
> >>> +        return -EINVAL;
> >>> +
> >>> +    if ( (set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED) &&
> >>> +         set_hwp->desired != 0 &&
> >>> +         (set_hwp->desired < data->hw.lowest ||
> >>> +          set_hwp->desired > data->hw.highest) )
> >>> +        return -EINVAL;
>
> ... e.g. this for comparison, where you apply the range check only when
> the XEN_SYSCTL_HWP_* bit is set. I think you want to be consistent in
> such checking: Either you always allow the caller to not care about
> fields that aren't going to be consumed when their controlling bit is
> off, or you always check validity. Both approaches have their pros and
> cons, I think.

Ok, good point.  I wrote it inconsistently because the SDM states the
desired limit: "When set to a non-zero value (between the range of
Lowest_Performance and Highest_Performance of IA32_HWP_CAPABILITIES)
conveys an explicit performance request hint to the hardware;
effectively disabling HW Autonomous selection."  And I was trying to
follow that.  But later "The HWP hardware clips and resolves the field
values as necessary to the valid range." seems to override that.  I'll
test to verify that it is correct, and drop the lowest/highest
checking if so.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2023-05-22 14:43 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-01 19:30 [PATCH v3 00/14 RESEND] Intel Hardware P-States (HWP) support Jason Andryuk
2023-05-01 19:30 ` [PATCH v3 01/14 RESEND] cpufreq: Allow restricting to internal governors only Jason Andryuk
2023-05-01 19:30 ` [PATCH v3 02/14 RESEND] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
2023-05-01 19:30 ` [PATCH v3 03/14 RESEND] cpufreq: Export intel_feature_detect Jason Andryuk
2023-05-04 11:16   ` Jan Beulich
2023-05-01 19:30 ` [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
2023-05-04 13:11   ` Jan Beulich
2023-05-04 16:56     ` Jason Andryuk
2023-05-05  7:01       ` Jan Beulich
2023-05-05 15:35         ` Jason Andryuk
2023-05-08  6:33           ` Jan Beulich
2023-05-10 13:54             ` Jason Andryuk
2023-05-10 14:19               ` Jan Beulich
2023-05-12  1:02                 ` Marek Marczykowski-Górecki
2023-05-12  6:28                   ` Jan Beulich
2023-05-01 19:30 ` [PATCH v3 05/14 RESEND] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
2023-05-04 14:35   ` Jan Beulich
2023-05-04 17:00     ` Jason Andryuk
2023-05-05  7:04       ` Jan Beulich
2023-05-05 15:40         ` Jason Andryuk
2023-05-01 19:30 ` [PATCH v3 06/14 RESEND] xen/x86: Tweak PDC bits when using HWP Jason Andryuk
2023-05-08  9:53   ` Jan Beulich
2023-05-10 14:08     ` Jason Andryuk
2023-05-01 19:30 ` [PATCH v3 07/14 RESEND] cpufreq: Export HWP parameters to userspace Jason Andryuk
2023-05-08 10:25   ` Jan Beulich
2023-05-08 10:46     ` Jan Beulich
2023-05-10 17:49       ` Jason Andryuk
2023-05-11  6:21         ` Jan Beulich
2023-05-11 13:49           ` Jason Andryuk
2023-05-11 14:10             ` Jan Beulich
2023-05-11 20:22               ` Jason Andryuk
2023-05-12  6:32                 ` Jan Beulich
2023-05-01 19:30 ` [PATCH v3 08/14 RESEND] libxc: Include hwp_para in definitions Jason Andryuk
2023-05-19 13:53   ` Anthony PERARD
2023-05-01 19:30 ` [PATCH v3 09/14 RESEND] xenpm: Print HWP parameters Jason Andryuk
2023-05-08 10:43   ` Jan Beulich
2023-05-10 18:11     ` Jason Andryuk
2023-05-11  6:25       ` Jan Beulich
2023-05-01 19:30 ` [PATCH v3 10/14 RESEND] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
2023-05-08 11:27   ` Jan Beulich
2023-05-22 12:45     ` Jason Andryuk
2023-05-22 13:10       ` Jan Beulich
2023-05-22 14:43         ` Jason Andryuk
2023-05-01 19:30 ` [PATCH v3 11/14 RESEND] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
2023-05-19 13:55   ` Anthony PERARD
2023-05-01 19:30 ` [PATCH v3 12/14 RESEND] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
2023-05-08 12:01   ` Jan Beulich
2023-05-01 19:30 ` [PATCH v3 13/14 RESEND] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
2023-05-08 11:56   ` Jan Beulich
2023-05-08 12:00     ` Jan Beulich
2023-05-22 12:59     ` Jason Andryuk
2023-05-22 13:20       ` Jan Beulich
2023-05-01 19:30 ` [PATCH v3 14/14 RESEND] CHANGELOG: Add Intel HWP entry Jason Andryuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.