xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] Intel Hardware P-States (HWP) support
@ 2021-05-03 19:27 Jason Andryuk
  2021-05-03 19:27 ` [PATCH 01/13] cpufreq: Allow restricting to internal governors only Jason Andryuk
                   ` (13 more replies)
  0 siblings, 14 replies; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:27 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Community Manager

Hi,

This patch series adds Hardware-Controlled Performance States (HWP) for
Intel processors to Xen.

With HWP, the processor makes its own determinations for frequency
selection depending, though users can set some parameters and
preferences.  There is also Turbo Boost which dynamically pushes the max
frequency if possible.

The existing governors don't work with HWP since they select frequencies
and HWP doesn't expose those.  Therefore a dummy hwp-interal governor is
used that doesn't do anything.

xenpm get-cpufreq-para is extended to show HWP parameters, and
set-cpufreq-hwp is added to set them.

A lightly loaded OpenXT laptop showed ~1W power savings according to
powertop.  A mostly idle Fedora system (dom0 only) showed a more modest
power savings.

This for for a 10th gen 6-core 1600 MHz base 4900 MHZ max cpu.  In the
default balance mode, Turbo Boost doesn't exceed 4GHz.  Tweaking the
energy_perf preference with `xenpm set-cpufreq-hwp balance ene:64`,
I've seen the CPU hit 4.7GHz before throttling down and bouncing around
between 4.3 and 4.5 GHz.  Curiously the other cores read ~4GHz when
turbo boost takes affect.  This was done after pinning all dom0 cores,
and using taskset to pin to vCPU/pCPU 11 and running a bash tightloop.

There are only minor changes since the RFC posting.  Typo fixes and a
few hunks have been folded into earlier patches.  I reordered "xenpm:
Factor out a non-fatal cpuid_parse variant" immediately before "xenpm:
Add set-cpufreq-hwp subcommand" where it is used.

Open questions:

HWP defaults to enabled and running in balanced mode.  This is useful
for testing, but should the old ACPI cpufreq driver remain the default?

This series unilaterally enables Hardware Duty Cycling (HDC) which is
another feature to autonomously powerdown things.  That is enabled if
HWP is enabled.  Maybe that want to be configurable?

I've only tested on an 8th gen and a 10th gen systems.  The don't have
fast MSR support, and they do have activity window and energy_perf
support.  So the respective other modes are untested.

This changes the systcl_pm_op hypercall, so that wants review.

I wanted to get this out since I know Qubes is also interested.

Regards,
Jason

Jason Andryuk (13):
  cpufreq: Allow restricting to internal governors only
  cpufreq: Add perf_freq to cpuinfo
  cpufreq: Export intel_feature_detect
  cpufreq: Add Hardware P-State (HWP) driver
  xenpm: Change get-cpufreq-para output for internal
  cpufreq: Export HWP parameters to userspace
  libxc: Include hwp_para in definitions
  xenpm: Print HWP parameters
  xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  libxc: Add xc_set_cpufreq_hwp
  xenpm: Factor out a non-fatal cpuid_parse variant
  xenpm: Add set-cpufreq-hwp subcommand
  CHANGELOG: Add Intel HWP entry

 CHANGELOG.md                              |   2 +
 docs/misc/xen-command-line.pandoc         |   9 +
 tools/include/xenctrl.h                   |   6 +
 tools/libs/ctrl/xc_pm.c                   |  18 +
 tools/misc/xenpm.c                        | 375 +++++++++++-
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |  15 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           | 671 ++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c                 |  30 +
 xen/drivers/cpufreq/cpufreq.c             |   4 +
 xen/drivers/cpufreq/utility.c             |   1 +
 xen/include/acpi/cpufreq/cpufreq.h        |   9 +
 xen/include/acpi/cpufreq/processor_perf.h |   5 +
 xen/include/asm-x86/cpufeature.h          |  11 +-
 xen/include/asm-x86/msr-index.h           |  17 +
 xen/include/public/sysctl.h               |  52 +-
 16 files changed, 1197 insertions(+), 29 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

-- 
2.30.2



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/13] cpufreq: Allow restricting to internal governors only
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
@ 2021-05-03 19:27 ` Jason Andryuk
  2021-05-26 13:18   ` Jan Beulich
  2021-05-03 19:27 ` [PATCH 02/13] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Jan Beulich

For hwp, the standard governors are not usable, and only the internal
one is applicable.  Add the cpufreq_governor_internal boolean to
indicate when an internal governor, like hwp-internal, will be used.
This is set during presmp_initcall, so that it can suppress governor
registration during initcall.  Only a governor with a name containing
"internal" will be allowed in that case.

This way, the unuseable governors are not registered, so they internal
one is the only one returned to userspace.  This means incompatible
governors won't be advertised to userspace.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 xen/drivers/cpufreq/cpufreq.c      | 4 ++++
 xen/include/acpi/cpufreq/cpufreq.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 419aae83ee..6cbf150538 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -57,6 +57,7 @@ struct cpufreq_dom {
 };
 static LIST_HEAD_READ_MOSTLY(cpufreq_dom_list_head);
 
+bool __read_mostly cpufreq_governor_internal;
 struct cpufreq_governor *__read_mostly cpufreq_opt_governor;
 LIST_HEAD_READ_MOSTLY(cpufreq_governor_list);
 
@@ -122,6 +123,9 @@ int __init cpufreq_register_governor(struct cpufreq_governor *governor)
     if (!governor)
         return -EINVAL;
 
+    if (cpufreq_governor_internal && strstr(governor->name, "internal") == NULL)
+        return -EINVAL;
+
     if (__find_governor(governor->name) != NULL)
         return -EEXIST;
 
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index e88b20bfed..56df5eebed 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -115,6 +115,7 @@ extern struct cpufreq_governor cpufreq_gov_dbs;
 extern struct cpufreq_governor cpufreq_gov_userspace;
 extern struct cpufreq_governor cpufreq_gov_performance;
 extern struct cpufreq_governor cpufreq_gov_powersave;
+extern bool cpufreq_governor_internal;
 
 extern struct list_head cpufreq_governor_list;
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/13] cpufreq: Add perf_freq to cpuinfo
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
  2021-05-03 19:27 ` [PATCH 01/13] cpufreq: Allow restricting to internal governors only Jason Andryuk
@ 2021-05-03 19:27 ` Jason Andryuk
  2021-05-26 13:24   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 03/13] cpufreq: Export intel_feature_detect Jason Andryuk
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:27 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

acpi-cpufreq scales the aperf/mperf measurements by max_freq, but HWP
needs to scale by base frequency.  Settings max_freq to base_freq
"works" but the code is not obvious, and returning values to userspace
is tricky.  Add an additonal perf_freq member which is used for scaling
aperf/mperf measurements.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
I don't like this, but it seems the best way to re-use the common
aperf/mperf code.  The other option would be to add wrappers that then
do the acpi vs. hwp scaling.
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c | 2 +-
 xen/drivers/cpufreq/utility.c       | 1 +
 xen/include/acpi/cpufreq/cpufreq.h  | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index f1f3c6923f..5eac2f7321 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -317,7 +317,7 @@ unsigned int get_measured_perf(unsigned int cpu, unsigned int flag)
     else
         perf_percent = 0;
 
-    return policy->cpuinfo.max_freq * perf_percent / 100;
+    return policy->cpuinfo.perf_freq * perf_percent / 100;
 }
 
 static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index b93895d4dd..788929e079 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -236,6 +236,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
 
     policy->min = policy->cpuinfo.min_freq = min_freq;
     policy->max = policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = max_freq;
     policy->cpuinfo.second_max_freq = second_max_freq;
 
     if (policy->min == ~0)
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 56df5eebed..b91859ce5d 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -37,6 +37,9 @@ extern struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
 struct cpufreq_cpuinfo {
     unsigned int        max_freq;
     unsigned int        second_max_freq;    /* P1 if Turbo Mode is on */
+    unsigned int        perf_freq; /* Scaling freq for aperf/mpref.
+                                      acpi-cpufreq uses max_freq, but HWP uses
+                                      base_freq.*/
     unsigned int        min_freq;
     unsigned int        transition_latency; /* in 10^(-9) s = nanoseconds */
 };
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/13] cpufreq: Export intel_feature_detect
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
  2021-05-03 19:27 ` [PATCH 01/13] cpufreq: Allow restricting to internal governors only Jason Andryuk
  2021-05-03 19:27 ` [PATCH 02/13] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-26 13:27   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Export feature_detect as intel_feature_detect so it can be re-used by
HWP.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 xen/arch/x86/acpi/cpufreq/cpufreq.c       | 4 ++--
 xen/include/acpi/cpufreq/processor_perf.h | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 5eac2f7321..8aae9b534d 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -340,7 +340,7 @@ static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
     return extract_freq(get_cur_val(cpumask_of(cpu)), data);
 }
 
-static void feature_detect(void *info)
+void intel_feature_detect(void *info)
 {
     struct cpufreq_policy *policy = info;
     unsigned int eax;
@@ -596,7 +596,7 @@ acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
     /* Check for APERF/MPERF support in hardware
      * also check for boost support */
     if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6)
-        on_selected_cpus(cpumask_of(cpu), feature_detect, policy, 1);
+        on_selected_cpus(cpumask_of(cpu), intel_feature_detect, policy, 1);
 
     /*
      * the first call to ->target() should result in us actually
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index d8a1ba68a6..e2c08f0e6d 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -7,6 +7,8 @@
 
 #define XEN_PX_INIT 0x80000000
 
+void intel_feature_detect(void *info);
+
 int powernow_cpufreq_init(void);
 unsigned int powernow_register_driver(void);
 unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (2 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 03/13] cpufreq: Export intel_feature_detect Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-26 14:59   ` Jan Beulich
  2021-05-27  7:45   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 05/13] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Ian Jackson,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Roger Pau Monné

From the Intel SDM: "Hardware-Controlled Performance States (HWP), which
autonomously selects performance states while utilizing OS supplied
performance guidance hints."

Enable HWP to run in autonomous mode by poking the correct MSRs.

There is no interface to configure - it hardcodes the default 0x80 (out
of 0x0-0xff) energy/performance preference.  xen_sysctl_pm_op/xenpm will
be to be extended to configure in subsequent patches.

Unscientific powertop measurement of an mostly idle, customized OpenXT
install:
A 10th gen 6-core laptop showed battery discharge drop from ~9.x to
~7.x watts.
A 8th gen 4-core laptop dropped from ~10 to ~9

Power usage depends on many factors, especially display brightness, but
this does show an power saving in balanced mode when CPU utilization is
low.

HWP isn't compatible with an external governor - it doesn't take
explicit frequency requests.  Therefore a minimal internal governor,
hwp-internal, is also added as a placeholder.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---

We disable on cpuid_level < 0x16.  cpuid(0x16) is used to get the cpu
frequencies for calculating the APERF/MPERF.  Without it, things would
still work, but the averge cpufrequency output would be wrong.

If HWP Energy_Performance_Preference isn't supported, the code falls
back to IA32_ENERGY_PERF_BIAS.  Right now, we don't check
CPUID.06H:ECX.SETBH[bit 3] before using that MSR.  The SDM reads like
it'll be available, and I assume it was available by the time Skylake
introduced HWP.

My 8th & 10th gen test systems both report:
(XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0
(XEN) HWP: FAST_IA32_HWP_REQUEST not supported
(XEN) HWP: Hardware Duty Cycling (HDC) supported
(XEN) HWP: HW_FEEDBACK not supported

So FAST_IA32_HWP_REQUEST and IA32_ENERGY_PERF_BIAS have not been tested.
---
 docs/misc/xen-command-line.pandoc         |   9 +
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |   9 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           | 533 ++++++++++++++++++++++
 xen/include/acpi/cpufreq/processor_perf.h |   3 +
 xen/include/asm-x86/cpufeature.h          |  11 +-
 xen/include/asm-x86/msr-index.h           |  17 +
 7 files changed, 579 insertions(+), 4 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index c32a397a12..66363a3d71 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1355,6 +1355,15 @@ Specify whether guests are to be given access to physical port 80
 (often used for debugging purposes), to override the DMI based
 detection of systems known to misbehave upon accesses to that port.
 
+### hwp (x86)
+> `= <boolean>`
+
+> Default: `true`
+
+Specifies whether Xen uses Hardware-Controlled Performance States (HWP)
+on supported Intel hardware.  HWP is a Skylake+ feature which provides
+better CPU power management.
+
 ### idle_latency_factor (x86)
 > `= <integer>`
 
diff --git a/xen/arch/x86/acpi/cpufreq/Makefile b/xen/arch/x86/acpi/cpufreq/Makefile
index f75da9b9ca..db83aa6b14 100644
--- a/xen/arch/x86/acpi/cpufreq/Makefile
+++ b/xen/arch/x86/acpi/cpufreq/Makefile
@@ -1,2 +1,3 @@
 obj-y += cpufreq.o
+obj-y += hwp.o
 obj-y += powernow.o
diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 8aae9b534d..966490bda1 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -641,9 +641,12 @@ static int __init cpufreq_driver_init(void)
     int ret = 0;
 
     if ((cpufreq_controller == FREQCTL_xen) &&
-        (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL))
-        ret = cpufreq_register_driver(&acpi_cpufreq_driver);
-    else if ((cpufreq_controller == FREQCTL_xen) &&
+        (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) {
+        if (hwp_available())
+            ret = hwp_register_driver();
+        else
+            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
+    } else if ((cpufreq_controller == FREQCTL_xen) &&
         (boot_cpu_data.x86_vendor &
          (X86_VENDOR_AMD | X86_VENDOR_HYGON)))
         ret = powernow_register_driver();
diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
new file mode 100644
index 0000000000..f8e6fdbd41
--- /dev/null
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -0,0 +1,533 @@
+/*
+ * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
+ *
+ * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or (at
+ * your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/cpumask.h>
+#include <xen/init.h>
+#include <xen/param.h>
+#include <xen/xmalloc.h>
+#include <asm/msr.h>
+#include <asm/io.h>
+#include <acpi/cpufreq/cpufreq.h>
+
+static bool feature_hwp;
+static bool feature_hwp_notification;
+static bool feature_hwp_activity_window;
+static bool feature_hwp_energy_perf;
+static bool feature_hwp_pkg_level_ctl;
+static bool feature_hwp_peci;
+
+static bool feature_hdc;
+static bool feature_fast_msr;
+
+bool opt_hwp = true;
+boolean_param("hwp", opt_hwp);
+
+union hwp_request
+{
+    struct
+    {
+        uint64_t min_perf:8;
+        uint64_t max_perf:8;
+        uint64_t desired:8;
+        uint64_t energy_perf:8;
+        uint64_t activity_window:10;
+        uint64_t package_control:1;
+        uint64_t reserved:16;
+        uint64_t activity_window_valid:1;
+        uint64_t energy_perf_valid:1;
+        uint64_t desired_valid:1;
+        uint64_t max_perf_valid:1;
+        uint64_t min_perf_valid:1;
+    };
+    uint64_t raw;
+};
+
+struct hwp_drv_data
+{
+    union
+    {
+        uint64_t hwp_caps;
+        struct
+        {
+            uint64_t hw_highest:8;
+            uint64_t hw_guaranteed:8;
+            uint64_t hw_most_efficient:8;
+            uint64_t hw_lowest:8;
+            uint64_t hw_reserved:32;
+        };
+    };
+    union hwp_request curr_req;
+    uint16_t activity_window;
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf;
+};
+struct hwp_drv_data *hwp_drv_data[NR_CPUS];
+
+#define hwp_err(...)     printk(XENLOG_ERR __VA_ARGS__)
+#define hwp_info(...)    printk(XENLOG_INFO __VA_ARGS__)
+#define hwp_verbose(...)                   \
+({                                         \
+    if ( cpufreq_verbose )                 \
+    {                                      \
+        printk(XENLOG_DEBUG __VA_ARGS__);  \
+    }                                      \
+})
+#define hwp_verbose_cont(...)              \
+({                                         \
+    if ( cpufreq_verbose )                 \
+    {                                      \
+        printk(             __VA_ARGS__);  \
+    }                                      \
+})
+
+static int hwp_governor(struct cpufreq_policy *policy,
+                        unsigned int event)
+{
+    int ret;
+
+    if ( policy == NULL )
+        return -EINVAL;
+
+    switch (event)
+    {
+    case CPUFREQ_GOV_START:
+        ret = 0;
+        break;
+    case CPUFREQ_GOV_STOP:
+        ret = -EINVAL;
+        break;
+    case CPUFREQ_GOV_LIMITS:
+        ret = 0;
+        break;
+    default:
+        ret = -EINVAL;
+    }
+
+    return ret;
+}
+
+static struct cpufreq_governor hwp_cpufreq_governor =
+{
+    .name          = "hwp-internal",
+    .governor      = hwp_governor,
+};
+
+static int __init cpufreq_gov_hwp_init(void)
+{
+    return cpufreq_register_governor(&hwp_cpufreq_governor);
+}
+__initcall(cpufreq_gov_hwp_init);
+
+bool hwp_available(void)
+{
+    uint32_t eax;
+    uint64_t val;
+    bool use_hwp;
+
+    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
+    {
+        hwp_verbose("cpuid_level (%u) lacks HWP support\n", boot_cpu_data.cpuid_level);
+
+        return false;
+    }
+
+    eax = cpuid_eax(CPUID_PM_LEAF);
+    feature_hwp                 = !!(eax & CPUID6_EAX_HWP);
+    feature_hwp_notification    = !!(eax & CPUID6_EAX_HWP_Notification);
+    feature_hwp_activity_window = !!(eax & CPUID6_EAX_HWP_Activity_Window);
+    feature_hwp_energy_perf     =
+        !!(eax & CPUID6_EAX_HWP_Energy_Performance_Preference);
+    feature_hwp_pkg_level_ctl   =
+        !!(eax & CPUID6_EAX_HWP_Package_Level_Request);
+    feature_hwp_peci            = !!(eax & CPUID6_EAX_HWP_PECI);
+
+    hwp_verbose("HWP: %d notify: %d act_window: %d energy_perf: %d pkg_level: %d peci: %d\n",
+                feature_hwp, feature_hwp_notification,
+                feature_hwp_activity_window, feature_hwp_energy_perf,
+                feature_hwp_pkg_level_ctl, feature_hwp_peci);
+
+    if ( !feature_hwp )
+    {
+        hwp_verbose("Hardware does not support HWP\n");
+
+        return false;
+    }
+
+    if ( boot_cpu_data.cpuid_level < 0x16 )
+    {
+        hwp_info("HWP disabled: cpuid_level %x < 0x16 lacks CPU freq info\n",
+                 boot_cpu_data.cpuid_level);
+
+        return false;
+    }
+
+    hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n",
+                eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not ");
+    if ( eax & CPUID6_EAX_FAST_HWP_MSR )
+    {
+        if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) )
+            hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n");
+
+        hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val);
+        if (val & FAST_IA32_HWP_REQUEST )
+        {
+            hwp_verbose("HWP: FAST_IA32_HWP_REQUEST MSR available\n");
+            feature_fast_msr = true;
+        }
+    }
+
+    feature_hdc = !!(eax & CPUID6_EAX_HDC);
+
+    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported\n",
+                feature_hdc ? "" : "not ");
+
+    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
+                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
+
+    use_hwp = feature_hwp && opt_hwp;
+    cpufreq_governor_internal = use_hwp;
+
+    if ( use_hwp )
+        hwp_info("Using HWP for cpufreq\n");
+
+    return use_hwp;
+}
+
+static void hdc_set_pkg_hdc_ctl(bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
+
+        return;
+    }
+
+    msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0;
+
+    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
+}
+
+static void hdc_set_pm_ctl1(bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
+
+        return;
+    }
+
+    msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0;
+
+    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
+}
+
+static void hwp_fast_uncore_msrs_ctl(bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) )
+        hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n");
+
+    msr = val;
+
+    if ( wrmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) )
+        hwp_err("error wrmsr_safe(MSR_FAST_UNCORE_MSRS_CTL): %016lx\n", msr);
+}
+
+static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
+{
+    uint32_t base_khz, max_khz, bus_khz, edx;
+
+    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
+
+    /* aperf/mperf scales base. */
+    policy->cpuinfo.perf_freq = base_khz * 1000;
+    policy->cpuinfo.min_freq = base_khz * 1000;
+    policy->cpuinfo.max_freq = max_khz * 1000;
+    policy->min = base_khz * 1000;
+    policy->max = max_khz * 1000;
+    policy->cur = 0;
+}
+
+static void hwp_read_capabilities(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
+
+    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
+                policy->cpu);
+
+        return;
+    }
+
+    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
+
+        return;
+    }
+}
+
+static void hwp_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    uint64_t val;
+
+    /* Package level MSR, but we don't have a good idea of packages here, so
+     * just do it everytime. */
+    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
+
+        return;
+    }
+
+    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
+    if ( val != IA32_PM_ENABLE_HWP_ENABLE )
+    {
+        val = IA32_PM_ENABLE_HWP_ENABLE;
+        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
+            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
+                    policy->cpu, val);
+    }
+
+    hwp_read_capabilities(info);
+
+    /* Check for APERF/MPERF support in hardware
+     * also check for boost/turbo support */
+    intel_feature_detect(policy);
+
+    if ( feature_hdc )
+    {
+        hdc_set_pkg_hdc_ctl(true);
+        hdc_set_pm_ctl1(true);
+    }
+
+    if ( feature_fast_msr )
+        hwp_fast_uncore_msrs_ctl(true);
+
+    hwp_get_cpu_speeds(policy);
+}
+
+static int hwp_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = hwp_drv_data[cpu];
+
+    if ( !feature_hwp_energy_perf && data->energy_perf )
+    {
+        if ( data->energy_perf > 15 )
+        {
+            hwp_err("energy_perf %d exceeds IA32_ENERGY_PERF_BIAS range 0-15\n",
+                    data->energy_perf);
+
+            return -EINVAL;
+        }
+    }
+
+    if ( !feature_hwp_activity_window && data->activity_window )
+    {
+        hwp_err("HWP activity window not supported.\n");
+
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+/* val 0 - highest performance, 15 - maximum energy savings */
+static void hwp_energy_perf_bias(void *info)
+{
+    uint64_t msr;
+    struct hwp_drv_data *data = info;
+    uint8_t val = data->energy_perf;
+
+    ASSERT(val <= 15);
+
+    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
+
+        return;
+    }
+
+    msr &= ~(0xf);
+    msr |= val;
+
+    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr);
+}
+
+static void hwp_write_request(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
+    union hwp_request hwp_req = data->curr_req;
+
+    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
+    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
+    {
+        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
+                policy->cpu, hwp_req.raw);
+        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
+    }
+}
+
+static int hwp_cpufreq_target(struct cpufreq_policy *policy,
+                              unsigned int target_freq, unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = hwp_drv_data[cpu];
+    union hwp_request hwp_req;
+
+    /* Zero everything to ensure reserved bits are zero... */
+    hwp_req.raw = 0;
+    /* .. and update from there */
+    hwp_req.min_perf = data->minimum;
+    hwp_req.max_perf = data->maximum;
+    hwp_req.desired = data->desired;
+    if ( feature_hwp_energy_perf )
+        hwp_req.energy_perf = data->energy_perf;
+    if ( feature_hwp_activity_window )
+        hwp_req.activity_window = data->activity_window;
+
+    if ( hwp_req.raw == data->curr_req.raw )
+        return 0;
+
+    data->curr_req.raw = hwp_req.raw;
+
+    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
+    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
+
+    if ( !feature_hwp_energy_perf && data->energy_perf )
+    {
+        on_selected_cpus(cpumask_of(cpu), hwp_energy_perf_bias,
+                         data, 1);
+    }
+
+    return 0;
+}
+
+static int hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data;
+
+    if ( cpufreq_opt_governor )
+    {
+        printk(XENLOG_WARNING
+               "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n",
+               cpufreq_opt_governor->name, hwp_cpufreq_governor.name);
+    }
+    policy->governor = &hwp_cpufreq_governor;
+
+    data = xzalloc(typeof(*data));
+    if ( !data )
+        return -ENOMEM;
+
+    hwp_drv_data[cpu] = data;
+
+    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
+
+    data->minimum = data->hw_lowest;
+    data->maximum = data->hw_highest;
+    data->desired = 0; /* default to HW autonomous */
+    if ( feature_hwp_energy_perf )
+        data->energy_perf = 0x80;
+    else
+        data->energy_perf = 7;
+
+    hwp_verbose("CPU%u: IA32_HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
+
+    hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw);
+
+    return 0;
+}
+
+static int hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+
+    xfree(hwp_drv_data[cpu]);
+    hwp_drv_data[cpu] = NULL;
+
+    return 0;
+}
+
+/* The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
+ * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least
+ * with my HWP testing.  MSR_IA32_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
+ * is what Linux uses and seems to work. */
+static void hwp_set_misc_turbo(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_MISC_ENABLE)\n", policy->cpu);
+
+        return;
+    }
+
+    if ( policy->turbo == CPUFREQ_TURBO_ENABLED )
+        msr &= ~MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+    else
+        msr |= MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+
+    if ( wrmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_MISC_ENABLE): %016lx\n",
+                policy->cpu, msr);
+}
+
+static int hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy)
+{
+    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);
+
+    return 0;
+}
+
+static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
+{
+    .name   = "hwp-cpufreq",
+    .verify = hwp_cpufreq_verify,
+    .target = hwp_cpufreq_target,
+    .init   = hwp_cpufreq_cpu_init,
+    .exit   = hwp_cpufreq_cpu_exit,
+    .update = hwp_cpufreq_update,
+};
+
+int hwp_register_driver(void)
+{
+    int ret;
+
+    ret = cpufreq_register_driver(&hwp_cpufreq_driver);
+
+    return ret;
+}
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index e2c08f0e6d..2e67e667e0 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -9,6 +9,9 @@
 
 void intel_feature_detect(void *info);
 
+bool hwp_available(void);
+int hwp_register_driver(void);
+
 int powernow_cpufreq_init(void);
 unsigned int powernow_register_driver(void);
 unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 33b2257888..1900c90f90 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -26,7 +26,16 @@
 #define CPUID5_ECX_EXTENSIONS_SUPPORTED 0x1
 #define CPUID5_ECX_INTERRUPT_BREAK      0x2
 
-#define CPUID_PM_LEAF                    6
+#define CPUID_PM_LEAF                                6
+#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
+#define CPUID6_EAX_HWP_Notification                  (_AC(1, U) <<  8)
+#define CPUID6_EAX_HWP_Activity_Window               (_AC(1, U) <<  9)
+#define CPUID6_EAX_HWP_Energy_Performance_Preference (_AC(1, U) << 10)
+#define CPUID6_EAX_HWP_Package_Level_Request         (_AC(1, U) << 11)
+#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
+#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
+#define CPUID6_EAX_FAST_HWP_MSR                      (_AC(1, U) << 18)
+#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)
 #define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
 
 /* CPUID level 0x00000001.edx */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index bd3a3a1e7f..b8f712e1a3 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -101,6 +101,12 @@
 #define MSR_RTIT_ADDR_A(n)                 (0x00000580 + (n) * 2)
 #define MSR_RTIT_ADDR_B(n)                 (0x00000581 + (n) * 2)
 
+#define MSR_FAST_UNCORE_MSRS_CTL            0x00000657
+#define  FAST_IA32_HWP_REQUEST_MSR_ENABLE   (_AC(1, ULL) <<  0)
+
+#define MSR_FAST_UNCORE_MSRS_CAPABILITY     0x0000065f
+#define  FAST_IA32_HWP_REQUEST              (_AC(1, ULL) <<  0)
+
 #define MSR_U_CET                           0x000006a0
 #define MSR_S_CET                           0x000006a2
 #define  CET_SHSTK_EN                       (_AC(1, ULL) <<  0)
@@ -112,10 +118,20 @@
 #define MSR_PL3_SSP                         0x000006a7
 #define MSR_INTERRUPT_SSP_TABLE             0x000006a8
 
+#define MSR_IA32_PM_ENABLE                  0x00000770
+#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
+#define MSR_IA32_HWP_CAPABILITIES           0x00000771
+#define MSR_IA32_HWP_REQUEST                0x00000774
+
 #define MSR_PASID                           0x00000d93
 #define  PASID_PASID_MASK                   0x000fffff
 #define  PASID_VALID                        (_AC(1, ULL) << 31)
 
+#define MSR_IA32_PKG_HDC_CTL                0x00000db0
+#define  IA32_PKG_HDC_CTL_HDC_PKG_Enable    (_AC(1, ULL) <<  0)
+#define MSR_IA32_PM_CTL1                    0x00000db1
+#define  IA32_PM_CTL1_HDC_Allow_Block       (_AC(1, ULL) <<  0)
+
 #define MSR_K8_VM_CR                        0xc0010114
 #define  VM_CR_INIT_REDIRECTION             (_AC(1, ULL) <<  1)
 #define  VM_CR_SVM_DISABLE                  (_AC(1, ULL) <<  4)
@@ -460,6 +476,7 @@
 #define MSR_IA32_MISC_ENABLE_LIMIT_CPUID  (1<<22)
 #define MSR_IA32_MISC_ENABLE_XTPR_DISABLE (1<<23)
 #define MSR_IA32_MISC_ENABLE_XD_DISABLE	(1ULL << 34)
+#define MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE (1ULL << 38)
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 #define MSR_IA32_ENERGY_PERF_BIAS	0x000001b0
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/13] xenpm: Change get-cpufreq-para output for internal
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (3 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-26 15:21   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 06/13] cpufreq: Export HWP parameters to userspace Jason Andryuk
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Ian Jackson, Wei Liu

When using HWP, some of the returned data is not applicable.  In that
case, we should just omit it to avoid confusing the user.  So switch to
printing the base and turbo frequencies since those are relevant to HWP.
Similarly, stop printing the CPU frequencies since those do not apply.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/misc/xenpm.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index d0191d4984..562bf939f9 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -711,6 +711,7 @@ void start_gather_func(int argc, char *argv[])
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
+    bool internal = strstr(p_cpufreq->scaling_governor, "internal");
     int i;
 
     printf("cpu id               : %d\n", cpuid);
@@ -720,10 +721,19 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
         printf(" %d", p_cpufreq->affected_cpus[i]);
     printf("\n");
 
-    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->cpuinfo_max_freq,
-           p_cpufreq->cpuinfo_min_freq,
-           p_cpufreq->cpuinfo_cur_freq);
+    if ( internal )
+    {
+        printf("cpuinfo frequency    : base [%u] turbo [%u]\n",
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_max_freq);
+    }
+    else
+    {
+        printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->cpuinfo_max_freq,
+               p_cpufreq->cpuinfo_min_freq,
+               p_cpufreq->cpuinfo_cur_freq);
+    }
 
     printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
 
@@ -750,19 +760,22 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                p_cpufreq->u.ondemand.up_threshold);
     }
 
-    printf("scaling_avail_freq   :");
-    for ( i = 0; i < p_cpufreq->freq_num; i++ )
-        if ( p_cpufreq->scaling_available_frequencies[i] ==
-             p_cpufreq->scaling_cur_freq )
-            printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
-        else
-            printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
-    printf("\n");
+    if ( !internal )
+    {
+        printf("scaling_avail_freq   :");
+        for ( i = 0; i < p_cpufreq->freq_num; i++ )
+            if ( p_cpufreq->scaling_available_frequencies[i] ==
+                 p_cpufreq->scaling_cur_freq )
+                printf(" *%d", p_cpufreq->scaling_available_frequencies[i]);
+            else
+                printf(" %d", p_cpufreq->scaling_available_frequencies[i]);
+        printf("\n");
 
-    printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
-           p_cpufreq->scaling_max_freq,
-           p_cpufreq->scaling_min_freq,
-           p_cpufreq->scaling_cur_freq);
+        printf("scaling frequency    : max [%u] min [%u] cur [%u]\n",
+               p_cpufreq->scaling_max_freq,
+               p_cpufreq->scaling_min_freq,
+               p_cpufreq->scaling_cur_freq);
+    }
 
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/13] cpufreq: Export HWP parameters to userspace
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (4 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 05/13] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-27  7:55   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 07/13] libxc: Include hwp_para in definitions Jason Andryuk
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini

Extend xen_get_cpufreq_para to return hwp parameters.  These match the
hardware rather closely.

We need the hw_features bitmask to indicated fields supported by the
actual hardware.

The use of uint8_t parameters matches the hardware size.  uint32_t
entries grows the sysctl_t past the build assertion in setup.c.  The
uint8_t ranges are supported across multiple generations, so hopefully
they won't change.

Increment XEN_SYSCTL_INTERFACE_VERSION for the new fields.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 24 ++++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c          |  6 ++++++
 xen/include/acpi/cpufreq/cpufreq.h |  3 +++
 xen/include/public/sysctl.h        | 20 +++++++++++++++++++-
 4 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index f8e6fdbd41..92222d6d85 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -523,6 +523,30 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
     .update = hwp_cpufreq_update,
 };
 
+int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = hwp_drv_data[cpu];
+
+    if ( data == NULL )
+        return -EINVAL;
+
+    hwp_para->hw_feature        =
+        feature_hwp_activity_window ? XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  : 0 |
+        feature_hwp_energy_perf     ? XEN_SYSCTL_HWP_FEAT_ENERGY_PERF : 0;
+    hwp_para->hw_lowest         = data->hw_lowest;
+    hwp_para->hw_most_efficient = data->hw_most_efficient;
+    hwp_para->hw_guaranteed     = data->hw_guaranteed;
+    hwp_para->hw_highest        = data->hw_highest;
+    hwp_para->minimum           = data->minimum;
+    hwp_para->maximum           = data->maximum;
+    hwp_para->energy_perf       = data->energy_perf;
+    hwp_para->activity_window   = data->activity_window;
+    hwp_para->desired           = data->desired;
+
+    return 0;
+}
+
 int hwp_register_driver(void)
 {
     int ret;
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 1bae635101..3e35c42949 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -290,6 +290,12 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
             &op->u.get_para.u.ondemand.sampling_rate,
             &op->u.get_para.u.ondemand.up_threshold);
     }
+
+    if ( !strncasecmp(op->u.get_para.scaling_governor,
+                      "hwp-internal", CPUFREQ_NAME_LEN) )
+    {
+        ret = get_hwp_para(policy, &op->u.get_para.u.hwp_para);
+    }
     op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
 
     return ret;
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index b91859ce5d..42146ca2cf 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -246,4 +246,7 @@ int write_userspace_scaling_setspeed(unsigned int cpu, unsigned int freq);
 void cpufreq_dbs_timer_suspend(void);
 void cpufreq_dbs_timer_resume(void);
 
+/********************** hwp hypercall helper *************************/
+int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para);
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 039ccf885c..1a6c6397ea 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -35,7 +35,7 @@
 #include "domctl.h"
 #include "physdev.h"
 
-#define XEN_SYSCTL_INTERFACE_VERSION 0x00000013
+#define XEN_SYSCTL_INTERFACE_VERSION 0x00000014
 
 /*
  * Read console content from Xen buffer ring.
@@ -301,6 +301,23 @@ struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+struct xen_hwp_para {
+    uint16_t activity_window; /* 7bit mantissa and 3bit exponent */
+#define XEN_SYSCTL_HWP_FEAT_ENERGY_PERF (1 << 0) /* energy_perf range 0-255 if
+                                                    1. Otherwise 0-15 */
+#define XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  (1 << 1) /* activity_window supported
+                                                    if 1 */
+    uint8_t hw_feature; /* bit flags for features */
+    uint8_t hw_lowest;
+    uint8_t hw_most_efficient;
+    uint8_t hw_guaranteed;
+    uint8_t hw_highest;
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf;
+};
+
 /*
  * cpufreq para name of this structure named
  * same as sysfs file name of native linux
@@ -332,6 +349,7 @@ struct xen_get_cpufreq_para {
     union {
         struct  xen_userspace userspace;
         struct  xen_ondemand ondemand;
+        struct  xen_hwp_para hwp_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/13] libxc: Include hwp_para in definitions
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (5 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 06/13] cpufreq: Export HWP parameters to userspace Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-03 19:28 ` [PATCH 08/13] xenpm: Print HWP parameters Jason Andryuk
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Ian Jackson, Wei Liu

Expose the hwp_para fields through libxc.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/include/xenctrl.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 27cec1b93f..82dfa1613a 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1960,6 +1960,7 @@ int xc_smt_disable(xc_interface *xch);
  */
 typedef struct xen_userspace xc_userspace_t;
 typedef struct xen_ondemand xc_ondemand_t;
+typedef struct xen_hwp_para xc_hwp_para_t;
 
 struct xc_get_cpufreq_para {
     /* IN/OUT variable */
@@ -1987,6 +1988,7 @@ struct xc_get_cpufreq_para {
     union {
         xc_userspace_t userspace;
         xc_ondemand_t ondemand;
+        xc_hwp_para_t hwp_para;
     } u;
 
     int32_t turbo_enabled;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/13] xenpm: Print HWP parameters
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (6 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 07/13] libxc: Include hwp_para in definitions Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-27  8:02   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 09/13] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Ian Jackson, Wei Liu

Print HWP-specific parameters.  Some are always present, but others
depend on hardware support.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/misc/xenpm.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 562bf939f9..9588dac991 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -708,6 +708,43 @@ void start_gather_func(int argc, char *argv[])
     pause();
 }
 
+static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
+                                          unsigned int *activity_window,
+                                          const char **units)
+{
+    unsigned int mantissa = hwp->activity_window & 0x7f;
+    unsigned int exponent = ( hwp->activity_window >> 7 ) & 0x7;
+    unsigned int multiplier = 1;
+
+    if ( hwp->activity_window == 0 )
+    {
+        *units = "hardware selected";
+        *activity_window = 0;
+
+        return;
+    }
+
+    if ( exponent >= 6 )
+    {
+        *units = "s";
+        exponent -= 6;
+    }
+    else if ( exponent >= 3 )
+    {
+        *units = "ms";
+        exponent -= 3;
+    }
+    else
+    {
+        *units = "us";
+    }
+
+    for ( unsigned int i = 0; i < exponent; i++ )
+        multiplier *= 10;
+
+    *activity_window = mantissa * multiplier;
+}
+
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
@@ -777,6 +814,40 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                p_cpufreq->scaling_cur_freq);
     }
 
+    if ( strcmp(p_cpufreq->scaling_governor, "hwp-internal") == 0 )
+    {
+        const xc_hwp_para_t *hwp = &p_cpufreq->u.hwp_para;
+
+        printf("hwp variables        :\n");
+        printf("  hardware limits    : lowest [%u] most_efficient [%u]\n",
+               hwp->hw_lowest, hwp->hw_most_efficient);
+        printf("  hardware limits    : guaranteed [%u] highest [%u]\n",
+               hwp->hw_guaranteed, hwp->hw_highest);
+        printf("  configured limits  : min [%u] max [%u] energy_perf [%u]\n",
+               hwp->minimum, hwp->maximum, hwp->energy_perf);
+
+        if ( hwp->hw_feature & XEN_SYSCTL_HWP_FEAT_ENERGY_PERF )
+        {
+            printf("  configured limits  : energy_perf [%u%s]\n",
+                   hwp->energy_perf,
+                   hwp->energy_perf ? "" : " hw autonomous");
+        }
+
+        if ( hwp->hw_feature & XEN_SYSCTL_HWP_FEAT_ACT_WINDOW )
+        {
+            unsigned int activity_window;
+            const char *units;
+
+            calculate_hwp_activity_window(hwp, &activity_window, &units);
+            printf("  configured limits  : activity_window [%u %s]\n",
+                   activity_window, units);
+        }
+
+        printf("  configured limits  : desired [%u%s]\n",
+               hwp->desired,
+               hwp->desired ? "" : " hw autonomous");
+    }
+
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
     printf("\n");
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/13] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (7 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 08/13] xenpm: Print HWP parameters Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-27  8:33   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini

Add SET_CPUFREQ_HWP xen_sysctl_pm_op to set HWP parameters.  The sysctl
supports setting multiple values simultaneously as indicated by the
set_params bits.  This allows atomically applying new HWP configuration
via a single wrmsr.

XEN_SYSCTL_HWP_SET_PRESET_BALANCE/PERFORMANCE/POWERSAVE provide three
common presets.  Setting them depends on hardware limits which the
hypervisor is already caching.  So using them allows skipping a
hypercall to query the limits (hw_lowest/highest) to then set those same
values.  The code is organized to allow a preset to be refined with
additional stuff if desired.

"most_efficient" and "guaranteed" could be additional presets in the
future, but the are not added now.  Those levels can change at runtime,
but we don't have code in place to monitor and update for those events.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 114 +++++++++++++++++++++++++++++
 xen/drivers/acpi/pmstat.c          |  24 ++++++
 xen/include/acpi/cpufreq/cpufreq.h |   2 +
 xen/include/public/sysctl.h        |  32 ++++++++
 4 files changed, 172 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 92222d6d85..0fd70d76a8 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -547,6 +547,120 @@ int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para)
     return 0;
 }
 
+int set_hwp_para(struct cpufreq_policy *policy,
+                 struct xen_set_hwp_para *set_hwp)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = hwp_drv_data[cpu];
+
+    if ( data == NULL )
+        return -EINVAL;
+
+    /* Validate all parameters first */
+    if ( set_hwp->set_params & ~XEN_SYSCTL_HWP_SET_PARAM_MASK )
+    {
+        hwp_err("Invalid bits in hwp set_params %u\n",
+                set_hwp->set_params);
+
+        return -EINVAL;
+    }
+
+    if ( set_hwp->activity_window & ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK )
+    {
+        hwp_err("Invalid bits in activity window %u\n",
+                set_hwp->activity_window);
+
+        return -EINVAL;
+    }
+
+    if ( !feature_hwp_energy_perf &&
+         set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF &&
+         set_hwp->energy_perf > 0xf )
+    {
+        hwp_err("energy_perf %u out of range for IA32_ENERGY_PERF_BIAS\n",
+                set_hwp->energy_perf);
+
+        return -EINVAL;
+    }
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED &&
+         set_hwp->desired != 0 &&
+         ( set_hwp->desired < data->hw_lowest ||
+           set_hwp->desired > data->hw_highest ) )
+    {
+        hwp_err("hwp desired %u is out of range (%u ... %u)\n",
+                set_hwp->desired, data->hw_lowest, data->hw_highest);
+
+        return -EINVAL;
+    }
+
+    /*
+     * minimum & maximum are not validated as hardware doesn't seem to care
+     * and the SDM says CPUs will clip internally.
+     */
+
+    /* Apply presets */
+    switch ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_PRESET_MASK )
+    {
+    case XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE:
+        data->minimum = data->hw_lowest;
+        data->maximum = data->hw_lowest;
+        data->activity_window = 0;
+        if ( feature_hwp_energy_perf )
+            data->energy_perf = 0xff;
+        else
+            data->energy_perf = 0xf;
+        data->desired = 0;
+        break;
+    case XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE:
+        data->minimum = data->hw_highest;
+        data->maximum = data->hw_highest;
+        data->activity_window = 0;
+        data->energy_perf = 0;
+        data->desired = 0;
+        break;
+    case XEN_SYSCTL_HWP_SET_PRESET_BALANCE:
+        data->minimum = data->hw_lowest;
+        data->maximum = data->hw_highest;
+        data->activity_window = 0;
+        data->energy_perf = 0x80;
+        if ( feature_hwp_energy_perf )
+            data->energy_perf = 0x80;
+        else
+            data->energy_perf = 0x7;
+        data->desired = 0;
+        break;
+    case XEN_SYSCTL_HWP_SET_PRESET_NONE:
+        break;
+    default:
+        printk("HWP: Invalid preset value: %u\n",
+               set_hwp->set_params & XEN_SYSCTL_HWP_SET_PRESET_MASK);
+
+        return -EINVAL;
+    }
+
+    /* Further customize presets if needed */
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MINIMUM )
+        data->minimum = set_hwp->minimum;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_MAXIMUM )
+        data->maximum = set_hwp->maximum;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF )
+        data->energy_perf = set_hwp->energy_perf;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED )
+        data->desired = set_hwp->desired;
+
+    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_ACT_WINDOW )
+        data->activity_window = set_hwp->activity_window &
+                                XEN_SYSCTL_HWP_ACT_WINDOW_MASK;
+
+    hwp_cpufreq_target(policy, 0, 0);
+
+    return 0;
+}
+
 int hwp_register_driver(void)
 {
     int ret;
diff --git a/xen/drivers/acpi/pmstat.c b/xen/drivers/acpi/pmstat.c
index 3e35c42949..016b0445ec 100644
--- a/xen/drivers/acpi/pmstat.c
+++ b/xen/drivers/acpi/pmstat.c
@@ -318,6 +318,24 @@ static int set_cpufreq_gov(struct xen_sysctl_pm_op *op)
     return __cpufreq_set_policy(old_policy, &new_policy);
 }
 
+static int set_cpufreq_hwp(struct xen_sysctl_pm_op *op)
+{
+    struct cpufreq_policy *policy;
+
+    if ( !cpufreq_governor_internal )
+        return -EINVAL;
+
+    policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
+
+    if ( !policy || !policy->governor )
+        return -EINVAL;
+
+    if ( strncasecmp(policy->governor->name, "hwp-internal", CPUFREQ_NAME_LEN) )
+        return -EINVAL;
+
+    return set_hwp_para(policy, &op->u.set_hwp);
+}
+
 static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
 {
     int ret = 0;
@@ -465,6 +483,12 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
         break;
     }
 
+    case SET_CPUFREQ_HWP:
+    {
+        ret = set_cpufreq_hwp(op);
+        break;
+    }
+
     case SET_CPUFREQ_PARA:
     {
         ret = set_cpufreq_para(op);
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 42146ca2cf..7ff7d0d4bb 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -248,5 +248,7 @@ void cpufreq_dbs_timer_resume(void);
 
 /********************** hwp hypercall helper *************************/
 int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para);
+int set_hwp_para(struct cpufreq_policy *policy,
+                 struct xen_set_hwp_para *set_hwp);
 
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 1a6c6397ea..3f18a3d522 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -318,6 +318,36 @@ struct xen_hwp_para {
     uint8_t energy_perf;
 };
 
+/* set multiple values simultaneously when set_args bit is set */
+struct xen_set_hwp_para {
+    uint16_t set_params; /* bitflags for valid values */
+#define XEN_SYSCTL_HWP_SET_DESIRED              (1U << 0)
+#define XEN_SYSCTL_HWP_SET_ENERGY_PERF          (1U << 1)
+#define XEN_SYSCTL_HWP_SET_ACT_WINDOW           (1U << 2)
+#define XEN_SYSCTL_HWP_SET_MINIMUM              (1U << 3)
+#define XEN_SYSCTL_HWP_SET_MAXIMUM              (1U << 4)
+#define XEN_SYSCTL_HWP_SET_PRESET_MASK          (0xf000)
+#define XEN_SYSCTL_HWP_SET_PRESET_NONE          (0x0000)
+#define XEN_SYSCTL_HWP_SET_PRESET_BALANCE       (0x1000)
+#define XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE     (0x2000)
+#define XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE   (0x3000)
+#define XEN_SYSCTL_HWP_SET_PARAM_MASK ((uint16_t)( \
+                                  XEN_SYSCTL_HWP_SET_PRESET_MASK | \
+                                  XEN_SYSCTL_HWP_SET_DESIRED     | \
+                                  XEN_SYSCTL_HWP_SET_ENERGY_PERF | \
+                                  XEN_SYSCTL_HWP_SET_ACT_WINDOW  | \
+                                  XEN_SYSCTL_HWP_SET_MINIMUM     | \
+                                  XEN_SYSCTL_HWP_SET_MAXIMUM     ))
+
+    uint16_t activity_window; /* 7bit mantissa and 3bit exponent */
+#define XEN_SYSCTL_HWP_ACT_WINDOW_MASK          (0x03ff)
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf; /* 0-255 or 0-15 depending on HW support */
+};
+
+
 /*
  * cpufreq para name of this structure named
  * same as sysfs file name of native linux
@@ -379,6 +409,7 @@ struct xen_sysctl_pm_op {
     #define SET_CPUFREQ_GOV            (CPUFREQ_PARA | 0x02)
     #define SET_CPUFREQ_PARA           (CPUFREQ_PARA | 0x03)
     #define GET_CPUFREQ_AVGFREQ        (CPUFREQ_PARA | 0x04)
+    #define SET_CPUFREQ_HWP            (CPUFREQ_PARA | 0x05)
 
     /* set/reset scheduler power saving option */
     #define XEN_SYSCTL_pm_op_set_sched_opt_smt    0x21
@@ -405,6 +436,7 @@ struct xen_sysctl_pm_op {
         struct xen_get_cpufreq_para get_para;
         struct xen_set_cpufreq_gov  set_gov;
         struct xen_set_cpufreq_para set_para;
+        struct xen_set_hwp_para     set_hwp;
         uint64_aligned_t get_avgfreq;
         uint32_t                    set_sched_opt_smt;
 #define XEN_SYSCTL_CX_UNLIMITED 0xffffffff
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (8 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 09/13] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-04  8:03   ` Jan Beulich
  2021-05-27  9:45   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 11/13] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Ian Jackson, Wei Liu

Add xc_set_cpufreq_hwp to allow calling xen_systctl_pm_op
SET_CPUFREQ_HWP.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---
Am I allowed to do set_hwp = *set_hwp struct assignment?
---
 tools/include/xenctrl.h |  4 ++++
 tools/libs/ctrl/xc_pm.c | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 82dfa1613a..0fd1e756cb 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1994,11 +1994,15 @@ struct xc_get_cpufreq_para {
     int32_t turbo_enabled;
 };
 
+typedef struct xen_set_hwp_para xc_set_hwp_para_t;
+
 int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
                         struct xc_get_cpufreq_para *user_para);
 int xc_set_cpufreq_gov(xc_interface *xch, int cpuid, char *govname);
 int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
                         int ctrl_type, int ctrl_value);
+int xc_set_cpufreq_hwp(xc_interface *xch, int cpuid,
+                       xc_set_hwp_para_t *set_hwp);
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq);
 
 int xc_set_sched_opt_smt(xc_interface *xch, uint32_t value);
diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index 76d7eb7f26..407a24d2aa 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -330,6 +330,24 @@ int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
     return xc_sysctl(xch, &sysctl);
 }
 
+int xc_set_cpufreq_hwp(xc_interface *xch, int cpuid,
+                       xc_set_hwp_para_t *set_hwp)
+{
+    DECLARE_SYSCTL;
+
+    if ( !xch )
+    {
+        errno = EINVAL;
+        return -1;
+    }
+    sysctl.cmd = XEN_SYSCTL_pm_op;
+    sysctl.u.pm_op.cmd = SET_CPUFREQ_HWP;
+    sysctl.u.pm_op.cpuid = cpuid;
+    sysctl.u.pm_op.u.set_hwp = *set_hwp;
+
+    return xc_sysctl(xch, &sysctl);
+}
+
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq)
 {
     int ret = 0;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 11/13] xenpm: Factor out a non-fatal cpuid_parse variant
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (9 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-27  8:41   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 12/13] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Ian Jackson, Wei Liu

Allow cpuid_prase to be re-used without terminating xenpm.  HWP
will re-use it to optionally parse a cpuid.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/misc/xenpm.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 9588dac991..a686f8f46e 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -79,17 +79,26 @@ void help_func(int argc, char *argv[])
     show_help();
 }
 
-static void parse_cpuid(const char *arg, int *cpuid)
+static int parse_cpuid_non_fatal(const char *arg, int *cpuid)
 {
     if ( sscanf(arg, "%d", cpuid) != 1 || *cpuid < 0 )
     {
         if ( strcasecmp(arg, "all") )
-        {
-            fprintf(stderr, "Invalid CPU identifier: '%s'\n", arg);
-            exit(EINVAL);
-        }
+            return -1;
+
         *cpuid = -1;
     }
+
+    return 0;
+}
+
+static void parse_cpuid(const char *arg, int *cpuid)
+{
+    if ( parse_cpuid_non_fatal(arg, cpuid) )
+    {
+        fprintf(stderr, "Invalid CPU identifier: '%s'\n", arg);
+        exit(EINVAL);
+    }
 }
 
 static void parse_cpuid_and_int(int argc, char *argv[],
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 12/13] xenpm: Add set-cpufreq-hwp subcommand
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (10 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 11/13] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-27  9:46   ` Jan Beulich
  2021-05-03 19:28 ` [PATCH 13/13] CHANGELOG: Add Intel HWP entry Jason Andryuk
  2021-05-20 14:57 ` [PATCH 00/13] Intel Hardware P-States (HWP) support Jan Beulich
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Ian Jackson, Wei Liu

set-cpufreq-hwp allows setting the Hardware P-State (HWP) parameters.

It can be run on all or just a single cpu.  There are presets of
balance, powersave & performance.  Those can be further tweaked by
param:val arguments as explained in the usage description.

Parameter names are just checked to the first 3 characters to shorten
typing.

Some options are hardware dependent, and ranges can be found in
get-cpufreq-para.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 tools/misc/xenpm.c | 240 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index a686f8f46e..d3bcaf3b58 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -67,6 +67,25 @@ void show_help(void)
             " set-max-cstate        <num>|'unlimited' [<num2>|'unlimited']\n"
             "                                     set the C-State limitation (<num> >= 0) and\n"
             "                                     optionally the C-sub-state limitation (<num2> >= 0)\n"
+            " set-cpufreq-hwp       [cpuid] [balance|performance|powersave] <param:val>*\n"
+            "                                     set Hardware P-State (HWP) parameters\n"
+            "                                     optionally a preset of one of\n"
+            "                                       balance|performance|powersave\n"
+            "                                     an optional list of param:val arguments\n"
+            "                                       minimum:N  hw_lowest ... hw_highest\n"
+            "                                       maximum:N  hw_lowest ... hw_highest\n"
+            "                                       desired:N  hw_lowest ... hw_highest\n"
+            "                                           Set explicit performance target.\n"
+            "                                           non-zero disables auto-HWP mode.\n"
+            "                                       energy_perf:0-255 (or 0-15)\n"
+            "                                                   energy/performance hint\n"
+            "                                                   lower favor performance\n"
+            "                                                   higher favor powersave\n"
+            "                                                   127 (or 7) balance\n"
+            "                                       act_window:N{,m,u}s range 0us-1270s\n"
+            "                                           window for internal calculations.\n"
+            "                                           0 lets the hardware decide.\n"
+            "                                     get-cpufreq-para returns hw_lowest/highest.\n"
             " start [seconds]                     start collect Cx/Px statistics,\n"
             "                                     output after CTRL-C or SIGINT or several seconds.\n"
             " enable-turbo-mode     [cpuid]       enable Turbo Mode for processors that support it.\n"
@@ -1309,6 +1328,226 @@ void disable_turbo_mode(int argc, char *argv[])
                 errno, strerror(errno));
 }
 
+/*
+ * Parse activity_window:NNN{us,ms,s} and validate range.
+ *
+ * Activity window is a 7bit mantissa (0-127) with a 3bit exponent (0-7) base
+ * 10 in microseconds.  So the range is 1 microsecond to 1270 seconds.  A value
+ * of 0 lets the hardware autonomously select the window.
+ *
+ * Return 0 on success
+ *       -1 on error
+ *        1 Not activity_window. i.e. try parsing as another argument
+ */
+static int parse_activity_window(xc_set_hwp_para_t *set_hwp, char *p)
+{
+    char *param = NULL, *val = NULL, *suffix = NULL;
+    unsigned int u;
+    unsigned int exponent = 0;
+    unsigned int multiplier = 1;
+    int ret;
+
+    ret = sscanf(p, "%m[a-z_A-Z]:%ms", &param, &val);
+    if ( ret != 2 )
+    {
+        return -1;
+    }
+
+    if ( strncasecmp(param, "act", 3) != 0 )
+    {
+        ret = 1;
+
+        goto out;
+    }
+
+    free(param);
+    param = NULL;
+
+    ret = sscanf(val, "%u%ms", &u, &suffix);
+    if ( ret != 1 && ret != 2 )
+    {
+        fprintf(stderr, "invalid activity window: %s\n", val);
+
+        ret = -1;
+
+        goto out;
+    }
+
+    if ( ret == 2 && suffix )
+    {
+        if ( strcasecmp(suffix, "s") == 0 )
+        {
+            multiplier = 1000 * 1000;
+            exponent = 6;
+        }
+        else if ( strcasecmp(suffix, "ms") == 0 )
+        {
+            multiplier = 1000;
+            exponent = 3;
+        }
+        else if ( strcasecmp(suffix, "us") == 0 )
+        {
+            multiplier = 1;
+            exponent = 0;
+        }
+        else
+        {
+            fprintf(stderr, "invalid activity window units: %s\n", suffix);
+
+            ret = -1;
+            goto out;
+        }
+    }
+
+    if ( u > 1270 * 1000 * 1000 / multiplier )
+    {
+        fprintf(stderr, "activity window %s too large\n", val);
+
+        ret = -1;
+        goto out;
+    }
+
+    /* looking for 7 bits of mantissa and 3 bits of exponent */
+    while ( u > 127 )
+    {
+        u /= 10;
+        exponent += 1;
+    }
+
+    set_hwp->activity_window = ( exponent & 0x7 ) << 7 | ( u & 0x7f );
+    set_hwp->set_params |= XEN_SYSCTL_HWP_SET_ACT_WINDOW;
+
+    ret = 0;
+
+ out:
+    free(suffix);
+    free(param);
+    free(val);
+
+    return ret;
+}
+
+static int parse_hwp_opts(xc_set_hwp_para_t *set_hwp, int *cpuid,
+                          int argc, char *argv[])
+{
+    int i = 0;
+
+    if ( argc < 1 )
+        return -1;
+
+    if ( parse_cpuid_non_fatal(argv[i], cpuid) == 0 )
+    {
+        i++;
+    }
+
+    if ( i == argc )
+        return -1;
+
+    if ( strcasecmp(argv[i], "powersave") == 0 )
+    {
+        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "performance") == 0 )
+    {
+        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE;
+        i++;
+    }
+    else if ( strcasecmp(argv[i], "balance") == 0 )
+    {
+        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_BALANCE;
+        i++;
+    }
+
+    for ( ; i < argc; i++)
+    {
+        unsigned int val;
+        char *param;
+        int ret;
+
+        ret = parse_activity_window(set_hwp, argv[i]);
+        switch ( ret )
+        {
+        case -1:
+            return -1;
+        case 0:
+            continue;
+            break;
+        case 1:
+            /* try other parsing */
+            break;
+        }
+
+        /* sscanf can't handle split on ':' for "%ms:%u'  */
+        ret = sscanf(argv[i], "%m[a-zA-Z_]:%u", &param, &val);
+        if ( ret != 2 )
+        {
+            fprintf(stderr, "%s is an invalid hwp parameter.\n", argv[i]);
+            return -1;
+        }
+
+        if ( val > 255 )
+        {
+            fprintf(stderr, "%s value %u is out of range.\n", param, val);
+            return -1;
+        }
+
+        if ( strncasecmp(param, "min", 3) == 0 )
+        {
+            set_hwp->minimum = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_MINIMUM;
+        }
+        else if ( strncasecmp(param, "max", 3) == 0 )
+        {
+            set_hwp->maximum = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_MAXIMUM;
+        }
+        else if ( strncasecmp(param, "des", 3) == 0 )
+        {
+            set_hwp->desired = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_DESIRED;
+        }
+        else if ( strncasecmp(param, "ene", 3) == 0 )
+        {
+            set_hwp->energy_perf = val;
+            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_ENERGY_PERF;
+        }
+        else
+        {
+            fprintf(stderr, "%s is an invalid parameter\n.", param);
+            return -1;
+        }
+
+        free(param);
+    }
+
+    return 0;
+}
+
+static void hwp_set_func(int argc, char *argv[])
+{
+    xc_set_hwp_para_t set_hwp = {};
+    int cpuid = -1;
+    int i = 0;
+
+    if ( parse_hwp_opts(&set_hwp, &cpuid, argc, argv) )
+    {
+        fprintf(stderr, "Missing, excess, or invalid argument(s)\n");
+        exit(EINVAL);
+    }
+
+    if ( cpuid != -1 )
+    {
+        i = cpuid;
+        max_cpu_nr = i + 1;
+    }
+
+    for ( ; i < max_cpu_nr; i++ )
+        if ( xc_set_cpufreq_hwp(xc_handle, i, &set_hwp) )
+            fprintf(stderr, "[CPU%d] failed to set hwp params (%d - %s)\n",
+                    i, errno, strerror(errno));
+}
+
 struct {
     const char *name;
     void (*function)(int argc, char *argv[]);
@@ -1319,6 +1558,7 @@ struct {
     { "get-cpufreq-average", cpufreq_func },
     { "start", start_gather_func },
     { "get-cpufreq-para", cpufreq_para_func },
+    { "set-cpufreq-hwp", hwp_set_func },
     { "set-scaling-maxfreq", scaling_max_freq_func },
     { "set-scaling-minfreq", scaling_min_freq_func },
     { "set-scaling-governor", scaling_governor_func },
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 13/13] CHANGELOG: Add Intel HWP entry
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (11 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 12/13] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
@ 2021-05-03 19:28 ` Jason Andryuk
  2021-05-27  9:48   ` Jan Beulich
  2021-05-20 14:57 ` [PATCH 00/13] Intel Hardware P-States (HWP) support Jan Beulich
  13 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-03 19:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Jason Andryuk, Ian Jackson, Community Manager

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
 CHANGELOG.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 0106fccec1..bbca67bc0b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,8 @@ Notable changes to Xen will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 
 ## [unstable UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=staging) - TBD
+### Added / support upgraded
+ - Intel Hardware P-States (HWP) cpufreq driver
 
 ## [4.15.0 UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=RELEASE-4.15.0) - TBD
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp
  2021-05-03 19:28 ` [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
@ 2021-05-04  8:03   ` Jan Beulich
  2021-05-04 11:31     ` Jason Andryuk
  2021-05-27  9:45   ` Jan Beulich
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-04  8:03 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Wei Liu, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> Add xc_set_cpufreq_hwp to allow calling xen_systctl_pm_op
> SET_CPUFREQ_HWP.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
> 
> ---
> Am I allowed to do set_hwp = *set_hwp struct assignment?

I'm puzzled by the question - why would you not be?

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp
  2021-05-04  8:03   ` Jan Beulich
@ 2021-05-04 11:31     ` Jason Andryuk
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Andryuk @ 2021-05-04 11:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Jackson, Wei Liu, xen-devel

On Tue, May 4, 2021 at 4:03 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2021 21:28, Jason Andryuk wrote:
> > Add xc_set_cpufreq_hwp to allow calling xen_systctl_pm_op
> > SET_CPUFREQ_HWP.
> >
> > Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
> >
> > ---
> > Am I allowed to do set_hwp = *set_hwp struct assignment?
>
> I'm puzzled by the question - why would you not be?

Yes, I thought it perfectly sensible to do.    However, I didn't see
other places in the file assigning structs, so I was not sure if there
was some reason against it.

Thanks for taking a look.

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/13] Intel Hardware P-States (HWP) support
  2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
                   ` (12 preceding siblings ...)
  2021-05-03 19:28 ` [PATCH 13/13] CHANGELOG: Add Intel HWP entry Jason Andryuk
@ 2021-05-20 14:57 ` Jan Beulich
  13 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-20 14:57 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Community Manager, xen-devel

On 03.05.2021 21:27, Jason Andryuk wrote:
> Open questions:
> 
> HWP defaults to enabled and running in balanced mode.  This is useful
> for testing, but should the old ACPI cpufreq driver remain the default?

As long as this new code is experimental, yes. But once it's deemed
stable and fully supported, I think on HWP-capable hardware the new
driver should be used by default.

> This series unilaterally enables Hardware Duty Cycling (HDC) which is
> another feature to autonomously powerdown things.  That is enabled if
> HWP is enabled.  Maybe that want to be configurable?

Probably; not even sure what the default would want to be.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/13] cpufreq: Allow restricting to internal governors only
  2021-05-03 19:27 ` [PATCH 01/13] cpufreq: Allow restricting to internal governors only Jason Andryuk
@ 2021-05-26 13:18   ` Jan Beulich
  2021-05-26 14:12     ` Jason Andryuk
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-26 13:18 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel

On 03.05.2021 21:27, Jason Andryuk wrote:
> For hwp, the standard governors are not usable, and only the internal
> one is applicable.

So you say "one" here but use plural in the subject. Which one is
it (going to be)?

>  Add the cpufreq_governor_internal boolean to
> indicate when an internal governor, like hwp-internal, will be used.
> This is set during presmp_initcall, so that it can suppress governor

DYM s/is/will be/? Afaict this is going to happen later in the series.
Which is a good indication that such "hanging in the air" changes
aren't necessarily the best way of splitting a set of changes, ...

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -57,6 +57,7 @@ struct cpufreq_dom {
>  };
>  static LIST_HEAD_READ_MOSTLY(cpufreq_dom_list_head);
>  
> +bool __read_mostly cpufreq_governor_internal;

... also supported by you introducing a non-static variable without
any consumer outside of this file (and without any producer at all).

> @@ -122,6 +123,9 @@ int __init cpufreq_register_governor(struct cpufreq_governor *governor)
>      if (!governor)
>          return -EINVAL;
>  
> +    if (cpufreq_governor_internal && strstr(governor->name, "internal") == NULL)

I wonder whether designating any governors ending in "-internal"
wouldn't be less prone for possible future ambiguities.

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -115,6 +115,7 @@ extern struct cpufreq_governor cpufreq_gov_dbs;
>  extern struct cpufreq_governor cpufreq_gov_userspace;
>  extern struct cpufreq_governor cpufreq_gov_performance;
>  extern struct cpufreq_governor cpufreq_gov_powersave;
> +extern bool cpufreq_governor_internal;

Please separate from the governor declarations by a blank line.

Sorry, all quite nit-like remarks, but still ...

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/13] cpufreq: Add perf_freq to cpuinfo
  2021-05-03 19:27 ` [PATCH 02/13] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
@ 2021-05-26 13:24   ` Jan Beulich
  2021-05-26 14:19     ` Jason Andryuk
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-26 13:24 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 03.05.2021 21:27, Jason Andryuk wrote:
> acpi-cpufreq scales the aperf/mperf measurements by max_freq, but HWP
> needs to scale by base frequency.  Settings max_freq to base_freq
> "works" but the code is not obvious, and returning values to userspace
> is tricky.  Add an additonal perf_freq member which is used for scaling
> aperf/mperf measurements.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
> ---
> I don't like this, but it seems the best way to re-use the common
> aperf/mperf code.  The other option would be to add wrappers that then
> do the acpi vs. hwp scaling.

Not sure I understand what you mean by "wrappers". I would assume that
for hwp you simply install a different getavg hook? Or else I guess
I'd need to see at least an outline of what you see as the alternative.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 03/13] cpufreq: Export intel_feature_detect
  2021-05-03 19:28 ` [PATCH 03/13] cpufreq: Export intel_feature_detect Jason Andryuk
@ 2021-05-26 13:27   ` Jan Beulich
  2021-05-26 14:44     ` Jason Andryuk
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-26 13:27 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> Export feature_detect as intel_feature_detect so it can be re-used by
> HWP.
> 
> Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

Acked-by: Jan Beulich <jbeulich@suse.com>
albeit ...

> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> @@ -340,7 +340,7 @@ static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
>      return extract_freq(get_cur_val(cpumask_of(cpu)), data);
>  }
>  
> -static void feature_detect(void *info)
> +void intel_feature_detect(void *info)
>  {
>      struct cpufreq_policy *policy = info;

... because of this (requiring the hwp code to stay in sync with
possible changes here, without the compiler being able to point
out inconsistencies) I'm not overly happy with such a change. Yet
I guess this isn't the first case we have in the code base.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/13] cpufreq: Allow restricting to internal governors only
  2021-05-26 13:18   ` Jan Beulich
@ 2021-05-26 14:12     ` Jason Andryuk
  2021-05-26 15:09       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-26 14:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, May 26, 2021 at 9:18 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2021 21:27, Jason Andryuk wrote:
> > For hwp, the standard governors are not usable, and only the internal
> > one is applicable.
>
> So you say "one" here but use plural in the subject. Which one is
> it (going to be)?

hwp only uses a single governor, but this is common code.  AMD or ARM
could require their own internal governor which is why the subject say
plural.

> >  Add the cpufreq_governor_internal boolean to
> > indicate when an internal governor, like hwp-internal, will be used.
> > This is set during presmp_initcall, so that it can suppress governor
>
> DYM s/is/will be/? Afaict this is going to happen later in the series.
> Which is a good indication that such "hanging in the air" changes
> aren't necessarily the best way of splitting a set of changes, ...

In terms of the patch series, yes, "will be".  The use of "is" is
directing how to use the feature.  Yes, it is "hanging in the air",
but I was trying to explain the "why" and "how" of using it.

I was trying to split this preparatory change from the actual hwp
introduction.  I suppose it could be ordered after hwp, and the extra,
unusable governors would be advertised until then.

> > --- a/xen/drivers/cpufreq/cpufreq.c
> > +++ b/xen/drivers/cpufreq/cpufreq.c
> > @@ -57,6 +57,7 @@ struct cpufreq_dom {
> >  };
> >  static LIST_HEAD_READ_MOSTLY(cpufreq_dom_list_head);
> >
> > +bool __read_mostly cpufreq_governor_internal;
>
> ... also supported by you introducing a non-static variable without
> any consumer outside of this file (and without any producer at all).
>
> > @@ -122,6 +123,9 @@ int __init cpufreq_register_governor(struct cpufreq_governor *governor)
> >      if (!governor)
> >          return -EINVAL;
> >
> > +    if (cpufreq_governor_internal && strstr(governor->name, "internal") == NULL)
>
> I wonder whether designating any governors ending in "-internal"
> wouldn't be less prone for possible future ambiguities.

Yes, that would be good.

> > --- a/xen/include/acpi/cpufreq/cpufreq.h
> > +++ b/xen/include/acpi/cpufreq/cpufreq.h
> > @@ -115,6 +115,7 @@ extern struct cpufreq_governor cpufreq_gov_dbs;
> >  extern struct cpufreq_governor cpufreq_gov_userspace;
> >  extern struct cpufreq_governor cpufreq_gov_performance;
> >  extern struct cpufreq_governor cpufreq_gov_powersave;
> > +extern bool cpufreq_governor_internal;
>
> Please separate from the governor declarations by a blank line.

Sure.

> Sorry, all quite nit-like remarks, but still ...

It's fine.  Would a design session be useful to discuss hwp?

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/13] cpufreq: Add perf_freq to cpuinfo
  2021-05-26 13:24   ` Jan Beulich
@ 2021-05-26 14:19     ` Jason Andryuk
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Andryuk @ 2021-05-26 14:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On Wed, May 26, 2021 at 9:24 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2021 21:27, Jason Andryuk wrote:
> > acpi-cpufreq scales the aperf/mperf measurements by max_freq, but HWP
> > needs to scale by base frequency.  Settings max_freq to base_freq
> > "works" but the code is not obvious, and returning values to userspace
> > is tricky.  Add an additonal perf_freq member which is used for scaling
> > aperf/mperf measurements.
> >
> > Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
> > ---
> > I don't like this, but it seems the best way to re-use the common
> > aperf/mperf code.  The other option would be to add wrappers that then
> > do the acpi vs. hwp scaling.
>
> Not sure I understand what you mean by "wrappers". I would assume that
> for hwp you simply install a different getavg hook? Or else I guess
> I'd need to see at least an outline of what you see as the alternative.

Something like a common get_measured_perf() returning just the
aperf/mperf ratio with cpufreq_driver specific getavg() scaling as
appropriate.

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 03/13] cpufreq: Export intel_feature_detect
  2021-05-26 13:27   ` Jan Beulich
@ 2021-05-26 14:44     ` Jason Andryuk
  2021-05-26 15:11       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-26 14:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On Wed, May 26, 2021 at 9:27 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2021 21:28, Jason Andryuk wrote:
> > Export feature_detect as intel_feature_detect so it can be re-used by
> > HWP.
> >
> > Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
>
> Acked-by: Jan Beulich <jbeulich@suse.com>
> albeit ...
>
> > --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> > +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> > @@ -340,7 +340,7 @@ static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
> >      return extract_freq(get_cur_val(cpumask_of(cpu)), data);
> >  }
> >
> > -static void feature_detect(void *info)
> > +void intel_feature_detect(void *info)
> >  {
> >      struct cpufreq_policy *policy = info;
>
> ... because of this (requiring the hwp code to stay in sync with
> possible changes here, without the compiler being able to point
> out inconsistencies) I'm not overly happy with such a change. Yet
> I guess this isn't the first case we have in the code base.

For acpi-cpufreq, this is called by on_selected_cpus(), but hwp calls
this directly.  You could do something like:

void intel_feature_detect(struct cpufreq_policy *policy)
{
    /* current feature_detect() */
}

static void feature_detect(void *info)
    struct cpufreq_policy *policy = info;

    intel_feature_detect(policy);
}

Would you prefer that?

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-05-03 19:28 ` [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
@ 2021-05-26 14:59   ` Jan Beulich
  2021-05-27  7:23     ` Jan Beulich
  2021-05-27 18:50     ` Jason Andryuk
  2021-05-27  7:45   ` Jan Beulich
  1 sibling, 2 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-26 14:59 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> If HWP Energy_Performance_Preference isn't supported, the code falls
> back to IA32_ENERGY_PERF_BIAS.  Right now, we don't check
> CPUID.06H:ECX.SETBH[bit 3] before using that MSR.

May I ask what problem there is doing so?

>  The SDM reads like
> it'll be available, and I assume it was available by the time Skylake
> introduced HWP.

The SDM documents the MSR's presence back to at least Nehalem, but ties
it to the CPUID bit even there.

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1355,6 +1355,15 @@ Specify whether guests are to be given access to physical port 80
>  (often used for debugging purposes), to override the DMI based
>  detection of systems known to misbehave upon accesses to that port.
>  
> +### hwp (x86)
> +> `= <boolean>`
> +
> +> Default: `true`
> +
> +Specifies whether Xen uses Hardware-Controlled Performance States (HWP)
> +on supported Intel hardware.  HWP is a Skylake+ feature which provides
> +better CPU power management.

Is there a particular reason giving this a top-level option rather
than a sub-option of cpufreq=?

> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> @@ -641,9 +641,12 @@ static int __init cpufreq_driver_init(void)
>      int ret = 0;
>  
>      if ((cpufreq_controller == FREQCTL_xen) &&
> -        (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL))
> -        ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> -    else if ((cpufreq_controller == FREQCTL_xen) &&
> +        (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) {
> +        if (hwp_available())
> +            ret = hwp_register_driver();
> +        else
> +            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> +    } else if ((cpufreq_controller == FREQCTL_xen) &&

I'd prefer if you did this with slightly less code churn, e.g.
(considering that the vendor check isn't really necessary afaict)

    if ((cpufreq_controller == FREQCTL_xen) && hwp_available())
        ret = hwp_register_driver();
    else if ((cpufreq_controller == FREQCTL_xen) &&
             (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL))
        ret = cpufreq_register_driver(&acpi_cpufreq_driver);
    ...

> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,533 @@
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or (at
> + * your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/cpumask.h>
> +#include <xen/init.h>
> +#include <xen/param.h>
> +#include <xen/xmalloc.h>
> +#include <asm/msr.h>
> +#include <asm/io.h>

Nit: Please swap these two for being properly sorted (alphabetically).

> +#include <acpi/cpufreq/cpufreq.h>
> +
> +static bool feature_hwp;
> +static bool feature_hwp_notification;
> +static bool feature_hwp_activity_window;
> +static bool feature_hwp_energy_perf;
> +static bool feature_hwp_pkg_level_ctl;
> +static bool feature_hwp_peci;
> +
> +static bool feature_hdc;
> +static bool feature_fast_msr;
> +
> +bool opt_hwp = true;

Please add __read_mostly or even __initdata as applicable.

> +struct hwp_drv_data
> +{
> +    union
> +    {
> +        uint64_t hwp_caps;
> +        struct
> +        {
> +            uint64_t hw_highest:8;
> +            uint64_t hw_guaranteed:8;
> +            uint64_t hw_most_efficient:8;
> +            uint64_t hw_lowest:8;
> +            uint64_t hw_reserved:32;

I'd like to suggest to convert the hw_ prefixes here into ...

> +        };

... naming of the field ("hw") here.

> +    };
> +    union hwp_request curr_req;
> +    uint16_t activity_window;
> +    uint8_t minimum;
> +    uint8_t maximum;
> +    uint8_t desired;
> +    uint8_t energy_perf;
> +};
> +struct hwp_drv_data *hwp_drv_data[NR_CPUS];

New NR_CPUS-dimensioned arrays need explicit justification. From
what I get I can't see why this couldn't be per-CPU data instead.

Also - static?

> +#define hwp_err(...)     printk(XENLOG_ERR __VA_ARGS__)
> +#define hwp_info(...)    printk(XENLOG_INFO __VA_ARGS__)
> +#define hwp_verbose(...)                   \
> +({                                         \
> +    if ( cpufreq_verbose )                 \
> +    {                                      \
> +        printk(XENLOG_DEBUG __VA_ARGS__);  \
> +    }                                      \
> +})
> +#define hwp_verbose_cont(...)              \
> +({                                         \
> +    if ( cpufreq_verbose )                 \
> +    {                                      \
> +        printk(             __VA_ARGS__);  \
> +    }                                      \
> +})

Please omit this as not properly working (we don't have Linux'es
printk continuation logic) and as not actually used anywhere. For
hwp_verbose() please omit unnecessary braces.

> +static int hwp_governor(struct cpufreq_policy *policy,
> +                        unsigned int event)
> +{
> +    int ret;
> +
> +    if ( policy == NULL )
> +        return -EINVAL;
> +
> +    switch (event)

Style: Missing blanks.

> +    {
> +    case CPUFREQ_GOV_START:
> +        ret = 0;
> +        break;
> +    case CPUFREQ_GOV_STOP:
> +        ret = -EINVAL;
> +        break;

Any particular need for this, when you have ...

> +    case CPUFREQ_GOV_LIMITS:
> +        ret = 0;
> +        break;
> +    default:
> +        ret = -EINVAL;

... this (albeit with a missing "break")? Similarly, any particular
reason not to fold the other two?

> +bool hwp_available(void)

The only caller of this function is __init, so the function here should
be, too.

> +{
> +    uint32_t eax;

This could well be unsigned int afaict - see ./CODING_STYLE.

> +    uint64_t val;
> +    bool use_hwp;
> +
> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> +    {
> +        hwp_verbose("cpuid_level (%u) lacks HWP support\n", boot_cpu_data.cpuid_level);
> +
> +        return false;
> +    }
> +
> +    eax = cpuid_eax(CPUID_PM_LEAF);
> +    feature_hwp                 = !!(eax & CPUID6_EAX_HWP);
> +    feature_hwp_notification    = !!(eax & CPUID6_EAX_HWP_Notification);
> +    feature_hwp_activity_window = !!(eax & CPUID6_EAX_HWP_Activity_Window);
> +    feature_hwp_energy_perf     =
> +        !!(eax & CPUID6_EAX_HWP_Energy_Performance_Preference);
> +    feature_hwp_pkg_level_ctl   =
> +        !!(eax & CPUID6_EAX_HWP_Package_Level_Request);
> +    feature_hwp_peci            = !!(eax & CPUID6_EAX_HWP_PECI);

Please avoid !! unless it's really needed (i.e. when the conversion to
bool isn't implicit anyway). Also elsewhere.

> +    hwp_verbose("HWP: %d notify: %d act_window: %d energy_perf: %d pkg_level: %d peci: %d\n",
> +                feature_hwp, feature_hwp_notification,
> +                feature_hwp_activity_window, feature_hwp_energy_perf,
> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> +
> +    if ( !feature_hwp )
> +    {
> +        hwp_verbose("Hardware does not support HWP\n");

This is redundant with the hwp_verbose immediately ahead.

> +        return false;
> +    }
> +
> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> +    {
> +        hwp_info("HWP disabled: cpuid_level %x < 0x16 lacks CPU freq info\n",
> +                 boot_cpu_data.cpuid_level);
> +
> +        return false;

While we commonly insist on blank lines ahead ofthe main return
statement of a function, we don't normally have such extra blank
lines in cases like this one.

> +    }

Perhaps worth folding with the earlier CPUID level check?

> +    hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n",
> +                eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not ");
> +    if ( eax & CPUID6_EAX_FAST_HWP_MSR )
> +    {
> +        if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) )
> +            hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n");
> +
> +        hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val);

Missing "else" above here?

> +        if (val & FAST_IA32_HWP_REQUEST )

Style: Missing blank.

> +static void hdc_set_pkg_hdc_ctl(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");

I'm not convinced of the need of having such log messages after
failed rdmsr/wrmsr, but I'm definitely against them getting logged
unconditionally. In verbose mode, maybe.

> +        return;
> +    }
> +
> +    msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0;

If you don't use the prior value, why did you read it? But I
think you really mean to set/clear just bot 0.

> +static void hdc_set_pm_ctl1(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> +
> +        return;
> +    }
> +
> +    msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0;

Same here then, and ...

> +static void hwp_fast_uncore_msrs_ctl(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) )
> +        hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n");
> +
> +    msr = val;

... here (where you imply bit 0 instead of using a proper
constant).

Also for all three functions I'm not convinced their names are
in good sync with their parameters being boolean.

> +static void hwp_read_capabilities(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> +                policy->cpu);
> +
> +        return;
> +    }
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
> +
> +        return;
> +    }
> +}

This function doesn't indicate failure to its caller(s), so am I
to understand that failure to read either ofthe MSRs is actually
benign to the driver?

> +static void hwp_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    uint64_t val;
> +
> +    /* Package level MSR, but we don't have a good idea of packages here, so
> +     * just do it everytime. */

Style: Comment format (also elsewhere).

> +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
> +
> +        return;
> +    }
> +
> +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
> +    if ( val != IA32_PM_ENABLE_HWP_ENABLE )
> +    {
> +        val = IA32_PM_ENABLE_HWP_ENABLE;

You should neither depend on reserved bits being zero, nor discard any
non-zero value here, I think.

> +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
> +                    policy->cpu, val);
> +    }
> +
> +    hwp_read_capabilities(info);

Please pass properly typed arguments (and have properly typed parameters)
wherever possible - here: policy. The exception are callback functions
and alike, where the function type may have to match a sufficiently
generic one.

> +static int hwp_cpufreq_verify(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = hwp_drv_data[cpu];
> +
> +    if ( !feature_hwp_energy_perf && data->energy_perf )
> +    {
> +        if ( data->energy_perf > 15 )
> +        {
> +            hwp_err("energy_perf %d exceeds IA32_ENERGY_PERF_BIAS range 0-15\n",
> +                    data->energy_perf);
> +
> +            return -EINVAL;
> +        }
> +    }
> +
> +    if ( !feature_hwp_activity_window && data->activity_window )
> +    {
> +        hwp_err("HWP activity window not supported.\n");

As in the majority of log messages you have, please omit full stops.

> +static void hwp_energy_perf_bias(void *info)
> +{
> +    uint64_t msr;
> +    struct hwp_drv_data *data = info;
> +    uint8_t val = data->energy_perf;
> +
> +    ASSERT(val <= 15);
> +
> +    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> +
> +        return;
> +    }
> +
> +    msr &= ~(0xf);

Unnessary parentheses.

> +static void hwp_write_request(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
> +    union hwp_request hwp_req = data->curr_req;
> +
> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));

ITYM

    BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw));

here?

> +    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
> +    {
> +        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
> +                policy->cpu, hwp_req.raw);
> +        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);

What if this one fails, too? data->curr_req.raw then pretty certainly
ends up stale.

> +static int hwp_cpufreq_target(struct cpufreq_policy *policy,
> +                              unsigned int target_freq, unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = hwp_drv_data[cpu];
> +    union hwp_request hwp_req;
> +
> +    /* Zero everything to ensure reserved bits are zero... */
> +    hwp_req.raw = 0;> +    /* .. and update from there */
> +    hwp_req.min_perf = data->minimum;
> +    hwp_req.max_perf = data->maximum;
> +    hwp_req.desired = data->desired;

We typically prefer to use initializers to achieve the same effect.
Since the bitfields part is in an unnamed struct, old gcc would
prohibit use of an initializer for all of the assignments, but at
least "raw" can be set in the initializer.

> +    if ( feature_hwp_energy_perf )
> +        hwp_req.energy_perf = data->energy_perf;
> +    if ( feature_hwp_activity_window )
> +        hwp_req.activity_window = data->activity_window;
> +
> +    if ( hwp_req.raw == data->curr_req.raw )
> +        return 0;
> +
> +    data->curr_req.raw = hwp_req.raw;

I think you can omit .raw on both sides.

> +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> +
> +    if ( !feature_hwp_energy_perf && data->energy_perf )
> +    {
> +        on_selected_cpus(cpumask_of(cpu), hwp_energy_perf_bias,
> +                         data, 1);
> +    }

Like elsewhere, please omit unnecessary braces.

> +static int hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data;
> +
> +    if ( cpufreq_opt_governor )
> +    {
> +        printk(XENLOG_WARNING
> +               "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n",
> +               cpufreq_opt_governor->name, hwp_cpufreq_governor.name);
> +    }

Same here (and perhaps elsewhere) again.

> +    policy->governor = &hwp_cpufreq_governor;
> +
> +    data = xzalloc(typeof(*data));

Commonly we specify the type explicitly in such cases, rather than using
typeof(). I will admit though that I'm not entirely certain which one's
better. But consistency across the code base is perhaps preferable for
the time being.

> +    if ( !data )
> +        return -ENOMEM;

Is it correct to have set the governor before this error check?

> +    hwp_drv_data[cpu] = data;
> +
> +    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
> +
> +    data->minimum = data->hw_lowest;
> +    data->maximum = data->hw_highest;
> +    data->desired = 0; /* default to HW autonomous */
> +    if ( feature_hwp_energy_perf )
> +        data->energy_perf = 0x80;
> +    else
> +        data->energy_perf = 7;

Where's this 7 coming from? (You do mention the 0x80 at least in the
description.)

> +static int hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +
> +    xfree(hwp_drv_data[cpu]);
> +    hwp_drv_data[cpu] = NULL;

Please don't open-code XFREE().

> +int hwp_register_driver(void)

__init

> +{
> +    int ret;
> +
> +    ret = cpufreq_register_driver(&hwp_cpufreq_driver);
> +
> +    return ret;

Preferably the body would consist of just a singe return statement.

> --- a/xen/include/asm-x86/msr-index.h
> +++ b/xen/include/asm-x86/msr-index.h
> @@ -101,6 +101,12 @@
>  #define MSR_RTIT_ADDR_A(n)                 (0x00000580 + (n) * 2)
>  #define MSR_RTIT_ADDR_B(n)                 (0x00000581 + (n) * 2)
>  
> +#define MSR_FAST_UNCORE_MSRS_CTL            0x00000657
> +#define  FAST_IA32_HWP_REQUEST_MSR_ENABLE   (_AC(1, ULL) <<  0)
> +
> +#define MSR_FAST_UNCORE_MSRS_CAPABILITY     0x0000065f
> +#define  FAST_IA32_HWP_REQUEST              (_AC(1, ULL) <<  0)
> +
>  #define MSR_U_CET                           0x000006a0
>  #define MSR_S_CET                           0x000006a2
>  #define  CET_SHSTK_EN                       (_AC(1, ULL) <<  0)
> @@ -112,10 +118,20 @@
>  #define MSR_PL3_SSP                         0x000006a7
>  #define MSR_INTERRUPT_SSP_TABLE             0x000006a8
>  
> +#define MSR_IA32_PM_ENABLE                  0x00000770
> +#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)

Please have a blank line after here.

> +#define MSR_IA32_HWP_CAPABILITIES           0x00000771
> +#define MSR_IA32_HWP_REQUEST                0x00000774
> +
>  #define MSR_PASID                           0x00000d93
>  #define  PASID_PASID_MASK                   0x000fffff
>  #define  PASID_VALID                        (_AC(1, ULL) << 31)
>  
> +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
> +#define  IA32_PKG_HDC_CTL_HDC_PKG_Enable    (_AC(1, ULL) <<  0)

I don't think "HDC" twice is helpful?

Also please use all upper case names (actually also for the
CPUID constants higher up).

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/13] cpufreq: Allow restricting to internal governors only
  2021-05-26 14:12     ` Jason Andryuk
@ 2021-05-26 15:09       ` Jan Beulich
  2021-05-26 16:44         ` Jason Andryuk
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-26 15:09 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel

On 26.05.2021 16:12, Jason Andryuk wrote:
> On Wed, May 26, 2021 at 9:18 AM Jan Beulich <jbeulich@suse.com> wrote:
>> Sorry, all quite nit-like remarks, but still ...
> 
> It's fine.  Would a design session be useful to discuss hwp?

Is there anything beyond patch review that's necessary there? I'm
also not really set up to usefully join design sessions, I'm afraid.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 03/13] cpufreq: Export intel_feature_detect
  2021-05-26 14:44     ` Jason Andryuk
@ 2021-05-26 15:11       ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-26 15:11 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 26.05.2021 16:44, Jason Andryuk wrote:
> On Wed, May 26, 2021 at 9:27 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 03.05.2021 21:28, Jason Andryuk wrote:
>>> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
>>> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
>>> @@ -340,7 +340,7 @@ static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
>>>      return extract_freq(get_cur_val(cpumask_of(cpu)), data);
>>>  }
>>>
>>> -static void feature_detect(void *info)
>>> +void intel_feature_detect(void *info)
>>>  {
>>>      struct cpufreq_policy *policy = info;
>>
>> ... because of this (requiring the hwp code to stay in sync with
>> possible changes here, without the compiler being able to point
>> out inconsistencies) I'm not overly happy with such a change. Yet
>> I guess this isn't the first case we have in the code base.
> 
> For acpi-cpufreq, this is called by on_selected_cpus(), but hwp calls
> this directly.  You could do something like:
> 
> void intel_feature_detect(struct cpufreq_policy *policy)
> {
>     /* current feature_detect() */
> }
> 
> static void feature_detect(void *info)
>     struct cpufreq_policy *policy = info;
> 
>     intel_feature_detect(policy);
> }
> 
> Would you prefer that?

Would feel less fragile, yes. And no need for the intermediate "policy"
variable, afaics.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 05/13] xenpm: Change get-cpufreq-para output for internal
  2021-05-03 19:28 ` [PATCH 05/13] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
@ 2021-05-26 15:21   ` Jan Beulich
  2021-05-27  5:54     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-26 15:21 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Wei Liu, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> --- a/tools/misc/xenpm.c
> +++ b/tools/misc/xenpm.c
> @@ -711,6 +711,7 @@ void start_gather_func(int argc, char *argv[])
>  /* print out parameters about cpu frequency */
>  static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>  {
> +    bool internal = strstr(p_cpufreq->scaling_governor, "internal");

Like suggested for the hypervisor, perhaps better check for names
ending in "-internal"?

> @@ -720,10 +721,19 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>          printf(" %d", p_cpufreq->affected_cpus[i]);
>      printf("\n");
>  
> -    printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
> -           p_cpufreq->cpuinfo_max_freq,
> -           p_cpufreq->cpuinfo_min_freq,
> -           p_cpufreq->cpuinfo_cur_freq);
> +    if ( internal )
> +    {
> +        printf("cpuinfo frequency    : base [%u] turbo [%u]\n",
> +               p_cpufreq->cpuinfo_min_freq,
> +               p_cpufreq->cpuinfo_max_freq);
> +    }
> +    else
> +    {
> +        printf("cpuinfo frequency    : max [%u] min [%u] cur [%u]\n",
> +               p_cpufreq->cpuinfo_max_freq,
> +               p_cpufreq->cpuinfo_min_freq,
> +               p_cpufreq->cpuinfo_cur_freq);
> +    }

Since the file adopts hypervisor coding style, the unnecessary
braces would again better be omitted.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/13] cpufreq: Allow restricting to internal governors only
  2021-05-26 15:09       ` Jan Beulich
@ 2021-05-26 16:44         ` Jason Andryuk
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Andryuk @ 2021-05-26 16:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, May 26, 2021 at 11:09 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 26.05.2021 16:12, Jason Andryuk wrote:
> > On Wed, May 26, 2021 at 9:18 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> Sorry, all quite nit-like remarks, but still ...
> >
> > It's fine.  Would a design session be useful to discuss hwp?
>
> Is there anything beyond patch review that's necessary there? I'm
> also not really set up to usefully join design sessions, I'm afraid.

It's not necessary, but I figured I'd offer since Xen Summit is going on.

The part where it might be useful is the hypercall interface.  I
basically exposed all the MSR configuration.  I couldn't think of a
useful abstraction, and this way the full range of configuration is
available.  Additionally, there are some convenience presets to make
it a little more user friendly.

Thank you for reviewing.

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 05/13] xenpm: Change get-cpufreq-para output for internal
  2021-05-26 15:21   ` Jan Beulich
@ 2021-05-27  5:54     ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  5:54 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Wei Liu, xen-devel

On 26.05.2021 17:21, Jan Beulich wrote:
> On 03.05.2021 21:28, Jason Andryuk wrote:
>> --- a/tools/misc/xenpm.c
>> +++ b/tools/misc/xenpm.c
>> @@ -711,6 +711,7 @@ void start_gather_func(int argc, char *argv[])
>>  /* print out parameters about cpu frequency */
>>  static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>>  {
>> +    bool internal = strstr(p_cpufreq->scaling_governor, "internal");
> 
> Like suggested for the hypervisor, perhaps better check for names
> ending in "-internal"?

Thinking about it more, the adjustments you make aren't necessarily
applicable to other hypothetical internal governors, are they? In
which case you rather want to check for hwp specifically.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-05-26 14:59   ` Jan Beulich
@ 2021-05-27  7:23     ` Jan Beulich
  2021-05-27 18:50     ` Jason Andryuk
  1 sibling, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  7:23 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 26.05.2021 16:59, Jan Beulich wrote:
> On 03.05.2021 21:28, Jason Andryuk wrote:
>> +static void hdc_set_pkg_hdc_ctl(bool val)
>> +{
>> +    uint64_t msr;
>> +
>> +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
>> +    {
>> +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
> 
> I'm not convinced of the need of having such log messages after
> failed rdmsr/wrmsr, but I'm definitely against them getting logged
> unconditionally. In verbose mode, maybe.

Perhaps not even there, considering that recovery from faults gets
logged already anyway (see extable_fixup()), and is suitably
restricted to debug builds.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-05-03 19:28 ` [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
  2021-05-26 14:59   ` Jan Beulich
@ 2021-05-27  7:45   ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  7:45 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> If HWP Energy_Performance_Preference isn't supported, the code falls
> back to IA32_ENERGY_PERF_BIAS.  Right now, we don't check
> CPUID.06H:ECX.SETBH[bit 3] before using that MSR.  The SDM reads like
> it'll be available, and I assume it was available by the time Skylake
> introduced HWP.

Upon more detailed reading of the respective SDM sections, I only
see two options: Either you fail driver initialization if the bit
is clear, or you correctly deal with the bit being clear. If Xen
runs virtualized itself, the combination of CPUID bits set may
not match that of any bare metal hardware that exists.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/13] cpufreq: Export HWP parameters to userspace
  2021-05-03 19:28 ` [PATCH 06/13] cpufreq: Export HWP parameters to userspace Jason Andryuk
@ 2021-05-27  7:55   ` Jan Beulich
  2021-05-28 13:19     ` Jason Andryuk
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  7:55 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -523,6 +523,30 @@ static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
>      .update = hwp_cpufreq_update,
>  };
>  
> +int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = hwp_drv_data[cpu];

const, perhaps also for the first function parameter, and in
general (throughout the series) whenever an item is not meant to
be modified.

> +    if ( data == NULL )
> +        return -EINVAL;
> +
> +    hwp_para->hw_feature        =
> +        feature_hwp_activity_window ? XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  : 0 |
> +        feature_hwp_energy_perf     ? XEN_SYSCTL_HWP_FEAT_ENERGY_PERF : 0;

This needs parentheses added, as | has higher precedence than ?:.

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -290,6 +290,12 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>              &op->u.get_para.u.ondemand.sampling_rate,
>              &op->u.get_para.u.ondemand.up_threshold);
>      }
> +
> +    if ( !strncasecmp(op->u.get_para.scaling_governor,
> +                      "hwp-internal", CPUFREQ_NAME_LEN) )
> +    {
> +        ret = get_hwp_para(policy, &op->u.get_para.u.hwp_para);
> +    }
>      op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);

Nit: Unnecessary parentheses again, and with the leading blank line
you also want a trailing one. (As an aside I'm also not overly happy
to see the call keyed to the governor name. Is there really no other
indication that hwp is in use?)

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -246,4 +246,7 @@ int write_userspace_scaling_setspeed(unsigned int cpu, unsigned int freq);
>  void cpufreq_dbs_timer_suspend(void);
>  void cpufreq_dbs_timer_resume(void);
>  
> +/********************** hwp hypercall helper *************************/
> +int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para);

While I can see that the excessive number of stars matches what
we have elsewhere in the header, I still wonder if you need to go
this far for a single declaration. If you want to stick to this,
then to match the rest of the file you want to follow the comment
by a blank line.

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -35,7 +35,7 @@
>  #include "domctl.h"
>  #include "physdev.h"
>  
> -#define XEN_SYSCTL_INTERFACE_VERSION 0x00000013
> +#define XEN_SYSCTL_INTERFACE_VERSION 0x00000014

As long as the size of struct xen_get_cpufreq_para doesn't change
(which from the description I imply it doesn't), you don't need to
bump the version, I don't think - your change is a pure addition
to the interface.

> @@ -301,6 +301,23 @@ struct xen_ondemand {
>      uint32_t up_threshold;
>  };
>  
> +struct xen_hwp_para {
> +    uint16_t activity_window; /* 7bit mantissa and 3bit exponent */

If you go this far with commenting, you should also make the further
aspects clear: Which bits these are, and that the exponent is taking
10 as the base (in most other cases one would expect 2).

> +#define XEN_SYSCTL_HWP_FEAT_ENERGY_PERF (1 << 0) /* energy_perf range 0-255 if
> +                                                    1. Otherwise 0-15 */
> +#define XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  (1 << 1) /* activity_window supported
> +                                                    if 1 */

Style: Comment formatting. You may want to move the comment on separate
lines ahead of what they comment.

> +    uint8_t hw_feature; /* bit flags for features */
> +    uint8_t hw_lowest;
> +    uint8_t hw_most_efficient;
> +    uint8_t hw_guaranteed;
> +    uint8_t hw_highest;

Any particular reason for the recurring hw_ prefixes?

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 08/13] xenpm: Print HWP parameters
  2021-05-03 19:28 ` [PATCH 08/13] xenpm: Print HWP parameters Jason Andryuk
@ 2021-05-27  8:02   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  8:02 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Wei Liu, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> --- a/tools/misc/xenpm.c
> +++ b/tools/misc/xenpm.c
> @@ -708,6 +708,43 @@ void start_gather_func(int argc, char *argv[])
>      pause();
>  }
>  
> +static void calculate_hwp_activity_window(const xc_hwp_para_t *hwp,
> +                                          unsigned int *activity_window,
> +                                          const char **units)
> +{
> +    unsigned int mantissa = hwp->activity_window & 0x7f;
> +    unsigned int exponent = ( hwp->activity_window >> 7 ) & 0x7;

Excess blanks inside the parentheses.

> +    unsigned int multiplier = 1;
> +
> +    if ( hwp->activity_window == 0 )
> +    {
> +        *units = "hardware selected";
> +        *activity_window = 0;
> +
> +        return;
> +    }
> +
> +    if ( exponent >= 6 )
> +    {
> +        *units = "s";
> +        exponent -= 6;
> +    }
> +    else if ( exponent >= 3 )
> +    {
> +        *units = "ms";
> +        exponent -= 3;
> +    }
> +    else
> +    {
> +        *units = "us";
> +    }
> +
> +    for ( unsigned int i = 0; i < exponent; i++ )

This requires the compiler to default to C99 mode, which I don't
think we enforce just yet.

> +        multiplier *= 10;
> +
> +    *activity_window = mantissa * multiplier;
> +}
> +
>  /* print out parameters about cpu frequency */
>  static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>  {
> @@ -777,6 +814,40 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
>                 p_cpufreq->scaling_cur_freq);
>      }
>  
> +    if ( strcmp(p_cpufreq->scaling_governor, "hwp-internal") == 0 )
> +    {
> +        const xc_hwp_para_t *hwp = &p_cpufreq->u.hwp_para;
> +
> +        printf("hwp variables        :\n");
> +        printf("  hardware limits    : lowest [%u] most_efficient [%u]\n",
> +               hwp->hw_lowest, hwp->hw_most_efficient);
> +        printf("  hardware limits    : guaranteed [%u] highest [%u]\n",
> +               hwp->hw_guaranteed, hwp->hw_highest);
> +        printf("  configured limits  : min [%u] max [%u] energy_perf [%u]\n",
> +               hwp->minimum, hwp->maximum, hwp->energy_perf);
> +
> +        if ( hwp->hw_feature & XEN_SYSCTL_HWP_FEAT_ENERGY_PERF )
> +        {
> +            printf("  configured limits  : energy_perf [%u%s]\n",
> +                   hwp->energy_perf,
> +                   hwp->energy_perf ? "" : " hw autonomous");
> +        }
> +
> +        if ( hwp->hw_feature & XEN_SYSCTL_HWP_FEAT_ACT_WINDOW )
> +        {
> +            unsigned int activity_window;
> +            const char *units;
> +
> +            calculate_hwp_activity_window(hwp, &activity_window, &units);
> +            printf("  configured limits  : activity_window [%u %s]\n",
> +                   activity_window, units);
> +        }
> +
> +        printf("  configured limits  : desired [%u%s]\n",
> +               hwp->desired,
> +               hwp->desired ? "" : " hw autonomous");
> +    }

I suppose output readability would improve if you didn't repeat
"hardware limits :" and "configured limits :" on continuation-like
lines, but rather simply indented by enough spaces.

Also again please again omit an unnecessary pair of braces.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/13] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op
  2021-05-03 19:28 ` [PATCH 09/13] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
@ 2021-05-27  8:33   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  8:33 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> --- a/xen/arch/x86/acpi/cpufreq/hwp.c
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -547,6 +547,120 @@ int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para)
>      return 0;
>  }
>  
> +int set_hwp_para(struct cpufreq_policy *policy,
> +                 struct xen_set_hwp_para *set_hwp)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = hwp_drv_data[cpu];
> +
> +    if ( data == NULL )
> +        return -EINVAL;
> +
> +    /* Validate all parameters first */
> +    if ( set_hwp->set_params & ~XEN_SYSCTL_HWP_SET_PARAM_MASK )
> +    {
> +        hwp_err("Invalid bits in hwp set_params %u\n",
> +                set_hwp->set_params);
> +
> +        return -EINVAL;
> +    }
> +
> +    if ( set_hwp->activity_window & ~XEN_SYSCTL_HWP_ACT_WINDOW_MASK )
> +    {
> +        hwp_err("Invalid bits in activity window %u\n",
> +                set_hwp->activity_window);
> +
> +        return -EINVAL;
> +    }
> +
> +    if ( !feature_hwp_energy_perf &&
> +         set_hwp->set_params & XEN_SYSCTL_HWP_SET_ENERGY_PERF &&

Please add parentheses around the operands of & here and ...

> +         set_hwp->energy_perf > 0xf )
> +    {
> +        hwp_err("energy_perf %u out of range for IA32_ENERGY_PERF_BIAS\n",
> +                set_hwp->energy_perf);
> +
> +        return -EINVAL;
> +    }
> +
> +    if ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_DESIRED &&

... here.

> +         set_hwp->desired != 0 &&
> +         ( set_hwp->desired < data->hw_lowest ||
> +           set_hwp->desired > data->hw_highest ) )

Excess blanks inside the inner pair of parentheses.

> +    {
> +        hwp_err("hwp desired %u is out of range (%u ... %u)\n",
> +                set_hwp->desired, data->hw_lowest, data->hw_highest);
> +
> +        return -EINVAL;
> +    }

None of these -EINVAL should be accompanied by a hwp_err, imo.

> +    /*
> +     * minimum & maximum are not validated as hardware doesn't seem to care
> +     * and the SDM says CPUs will clip internally.
> +     */
> +
> +    /* Apply presets */
> +    switch ( set_hwp->set_params & XEN_SYSCTL_HWP_SET_PRESET_MASK )
> +    {
> +    case XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE:
> +        data->minimum = data->hw_lowest;
> +        data->maximum = data->hw_lowest;
> +        data->activity_window = 0;
> +        if ( feature_hwp_energy_perf )
> +            data->energy_perf = 0xff;
> +        else
> +            data->energy_perf = 0xf;

There may want to be constants #define-d for these, and ...

> +        data->desired = 0;
> +        break;
> +    case XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE:
> +        data->minimum = data->hw_highest;
> +        data->maximum = data->hw_highest;
> +        data->activity_window = 0;
> +        data->energy_perf = 0;
> +        data->desired = 0;
> +        break;
> +    case XEN_SYSCTL_HWP_SET_PRESET_BALANCE:
> +        data->minimum = data->hw_lowest;
> +        data->maximum = data->hw_highest;
> +        data->activity_window = 0;
> +        data->energy_perf = 0x80;
> +        if ( feature_hwp_energy_perf )
> +            data->energy_perf = 0x80;
> +        else
> +            data->energy_perf = 0x7;

... since these aren't the sole instances of these kind of magic
numbers there surely want to be #define-s for these (such that
the connection between the two [or more?] instances becomes
visible). Actually, the same applies to the 0xf further up, which
has a second use yet a few more lines up.

> +        data->desired = 0;
> +        break;
> +    case XEN_SYSCTL_HWP_SET_PRESET_NONE:
> +        break;
> +    default:
> +        printk("HWP: Invalid preset value: %u\n",
> +               set_hwp->set_params & XEN_SYSCTL_HWP_SET_PRESET_MASK);
> +
> +        return -EINVAL;
> +    }

For the entire switch() - please have blank lines between (non-fall-
through, which here is all of them) case blocks.

> --- a/xen/drivers/acpi/pmstat.c
> +++ b/xen/drivers/acpi/pmstat.c
> @@ -318,6 +318,24 @@ static int set_cpufreq_gov(struct xen_sysctl_pm_op *op)
>      return __cpufreq_set_policy(old_policy, &new_policy);
>  }
>  
> +static int set_cpufreq_hwp(struct xen_sysctl_pm_op *op)
> +{
> +    struct cpufreq_policy *policy;
> +
> +    if ( !cpufreq_governor_internal )
> +        return -EINVAL;
> +
> +    policy = per_cpu(cpufreq_cpu_policy, op->cpuid);
> +
> +    if ( !policy || !policy->governor )
> +        return -EINVAL;
> +
> +    if ( strncasecmp(policy->governor->name, "hwp-internal", CPUFREQ_NAME_LEN) )

I think this recurring string literal also wants to at least gain
a #define.

> @@ -465,6 +483,12 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
>          break;
>      }
>  
> +    case SET_CPUFREQ_HWP:
> +    {
> +        ret = set_cpufreq_hwp(op);
> +        break;
> +    }
> +
>      case SET_CPUFREQ_PARA:
>      {
>          ret = set_cpufreq_para(op);

I think you want to insert somewhere below this one and, despite all
the odd precedents, omit the stray braces.

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -248,5 +248,7 @@ void cpufreq_dbs_timer_resume(void);
>  
>  /********************** hwp hypercall helper *************************/
>  int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para);
> +int set_hwp_para(struct cpufreq_policy *policy,
> +                 struct xen_set_hwp_para *set_hwp);

This renders the comment stale - the patch introducing it probably
can use plural right away.

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -318,6 +318,36 @@ struct xen_hwp_para {
>      uint8_t energy_perf;
>  };
>  
> +/* set multiple values simultaneously when set_args bit is set */
> +struct xen_set_hwp_para {
> +    uint16_t set_params; /* bitflags for valid values */
> +#define XEN_SYSCTL_HWP_SET_DESIRED              (1U << 0)
> +#define XEN_SYSCTL_HWP_SET_ENERGY_PERF          (1U << 1)
> +#define XEN_SYSCTL_HWP_SET_ACT_WINDOW           (1U << 2)
> +#define XEN_SYSCTL_HWP_SET_MINIMUM              (1U << 3)
> +#define XEN_SYSCTL_HWP_SET_MAXIMUM              (1U << 4)
> +#define XEN_SYSCTL_HWP_SET_PRESET_MASK          (0xf000)
> +#define XEN_SYSCTL_HWP_SET_PRESET_NONE          (0x0000)
> +#define XEN_SYSCTL_HWP_SET_PRESET_BALANCE       (0x1000)
> +#define XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE     (0x2000)
> +#define XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE   (0x3000)

Personally I'd prefer unnecessary parentheses (like around single
tokens) to be omitted.

> +#define XEN_SYSCTL_HWP_SET_PARAM_MASK ((uint16_t)( \

What's the reason for this cast? Wherever possible #define-d
constants should be suitable for use in preprocessor conditionals.

> +                                  XEN_SYSCTL_HWP_SET_PRESET_MASK | \
> +                                  XEN_SYSCTL_HWP_SET_DESIRED     | \
> +                                  XEN_SYSCTL_HWP_SET_ENERGY_PERF | \
> +                                  XEN_SYSCTL_HWP_SET_ACT_WINDOW  | \
> +                                  XEN_SYSCTL_HWP_SET_MINIMUM     | \
> +                                  XEN_SYSCTL_HWP_SET_MAXIMUM     ))
> +
> +    uint16_t activity_window; /* 7bit mantissa and 3bit exponent */

Since the other respective comment is to be extended, perhaps here
you can simply refer to that one?

> +#define XEN_SYSCTL_HWP_ACT_WINDOW_MASK          (0x03ff)
> +    uint8_t minimum;
> +    uint8_t maximum;
> +    uint8_t desired;
> +    uint8_t energy_perf; /* 0-255 or 0-15 depending on HW support */
> +};
> +
> +

No double blank lines please.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/13] xenpm: Factor out a non-fatal cpuid_parse variant
  2021-05-03 19:28 ` [PATCH 11/13] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
@ 2021-05-27  8:41   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  8:41 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Wei Liu, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> Allow cpuid_prase to be re-used without terminating xenpm.  HWP
> will re-use it to optionally parse a cpuid.

I'm not convinced of the need for this patch: The next one could
easily put an isdigit() check in place to tell cpuid from other
arguments.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp
  2021-05-03 19:28 ` [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
  2021-05-04  8:03   ` Jan Beulich
@ 2021-05-27  9:45   ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  9:45 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Wei Liu, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> --- a/tools/libs/ctrl/xc_pm.c
> +++ b/tools/libs/ctrl/xc_pm.c
> @@ -330,6 +330,24 @@ int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
>      return xc_sysctl(xch, &sysctl);
>  }
>  
> +int xc_set_cpufreq_hwp(xc_interface *xch, int cpuid,
> +                       xc_set_hwp_para_t *set_hwp)

Besides for general considerations, for xenpm to legitimately pass
the same struct instance into this function multiple times, the
last parameter wants to be pointer-to-const, declaring the intent
of the function to leave the struct unaltered.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/13] xenpm: Add set-cpufreq-hwp subcommand
  2021-05-03 19:28 ` [PATCH 12/13] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
@ 2021-05-27  9:46   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  9:46 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Wei Liu, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> @@ -1309,6 +1328,226 @@ void disable_turbo_mode(int argc, char *argv[])
>                  errno, strerror(errno));
>  }
>  
> +/*
> + * Parse activity_window:NNN{us,ms,s} and validate range.
> + *
> + * Activity window is a 7bit mantissa (0-127) with a 3bit exponent (0-7) base
> + * 10 in microseconds.  So the range is 1 microsecond to 1270 seconds.  A value
> + * of 0 lets the hardware autonomously select the window.
> + *
> + * Return 0 on success
> + *       -1 on error
> + *        1 Not activity_window. i.e. try parsing as another argument
> + */
> +static int parse_activity_window(xc_set_hwp_para_t *set_hwp, char *p)
> +{
> +    char *param = NULL, *val = NULL, *suffix = NULL;
> +    unsigned int u;
> +    unsigned int exponent = 0;
> +    unsigned int multiplier = 1;
> +    int ret;
> +
> +    ret = sscanf(p, "%m[a-z_A-Z]:%ms", &param, &val);

I have to confess that I first needed to look up availability of the
m modifier. It looks to be in POSIX.1-2008, but not in C11 and older.
I'm therefore not sure if you can legitimately use it; I've not been
able to spot pre-existing uses throughout tools/.

Also, following the naming of other options of this tool (including
the new set-cpufreq-hwp subcommand you add here), instead of _
options should use - (and the pattern here and in the other similar
sscanf() further down then wants adjusting).

> +    if ( ret != 2 )
> +    {
> +        return -1;

No error message at all in this case?

> +    }
> +
> +    if ( strncasecmp(param, "act", 3) != 0 )
> +    {
> +        ret = 1;
> +
> +        goto out;
> +    }
> +
> +    free(param);
> +    param = NULL;
> +
> +    ret = sscanf(val, "%u%ms", &u, &suffix);

Can't you parse this right in the first sscanf()?

> +    if ( ret != 1 && ret != 2 )
> +    {
> +        fprintf(stderr, "invalid activity window: %s\n", val);
> +
> +        ret = -1;
> +
> +        goto out;
> +    }
> +
> +    if ( ret == 2 && suffix )

The help text doesn't clarify what an omitted suffix means; it's
unambiguous only when the value is zero. (While looking at that I
also started wondering whether the range there starting at 0us is
actually appropriate - the range really starts at 1us afaict, with
0 having special meaning.)

> +    {
> +        if ( strcasecmp(suffix, "s") == 0 )
> +        {
> +            multiplier = 1000 * 1000;
> +            exponent = 6;
> +        }
> +        else if ( strcasecmp(suffix, "ms") == 0 )
> +        {
> +            multiplier = 1000;
> +            exponent = 3;
> +        }
> +        else if ( strcasecmp(suffix, "us") == 0 )
> +        {
> +            multiplier = 1;
> +            exponent = 0;
> +        }
> +        else
> +        {
> +            fprintf(stderr, "invalid activity window units: %s\n", suffix);

I think you want to generally quote %s in such cases, to make clear
what is actually part of a malformed string.

> +            ret = -1;
> +            goto out;
> +        }
> +    }
> +
> +    if ( u > 1270 * 1000 * 1000 / multiplier )
> +    {
> +        fprintf(stderr, "activity window %s too large\n", val);
> +
> +        ret = -1;
> +        goto out;
> +    }
> +
> +    /* looking for 7 bits of mantissa and 3 bits of exponent */
> +    while ( u > 127 )

Prior to this loop, don't you need to multiply by "multiplier"?

> +    {
> +        u /= 10;

Fractions get silently chopped off - may want spelling out in
the help text.

> +        exponent += 1;
> +    }
> +
> +    set_hwp->activity_window = ( exponent & 0x7 ) << 7 | ( u & 0x7f );

Excess blanks inside parentheses again.

> +static int parse_hwp_opts(xc_set_hwp_para_t *set_hwp, int *cpuid,
> +                          int argc, char *argv[])
> +{
> +    int i = 0;
> +
> +    if ( argc < 1 )
> +        return -1;
> +
> +    if ( parse_cpuid_non_fatal(argv[i], cpuid) == 0 )
> +    {
> +        i++;
> +    }

Unnecessary braces again, the more that you ...

> +    if ( i == argc )
> +        return -1;

... don't have any here.

> +    if ( strcasecmp(argv[i], "powersave") == 0 )
> +    {
> +        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_POWERSAVE;
> +        i++;
> +    }
> +    else if ( strcasecmp(argv[i], "performance") == 0 )
> +    {
> +        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_PERFORMANCE;
> +        i++;
> +    }
> +    else if ( strcasecmp(argv[i], "balance") == 0 )
> +    {
> +        set_hwp->set_params = XEN_SYSCTL_HWP_SET_PRESET_BALANCE;
> +        i++;
> +    }
> +
> +    for ( ; i < argc; i++)
> +    {
> +        unsigned int val;
> +        char *param;
> +        int ret;
> +
> +        ret = parse_activity_window(set_hwp, argv[i]);
> +        switch ( ret )
> +        {
> +        case -1:
> +            return -1;
> +        case 0:
> +            continue;
> +            break;

Why "break" after "continue"? I can see compilers legitimately warning
in such a case.

> +        case 1:

This may better be "default:", or could be omitted altogether. Or
alternatively you may want to have a "default:" with assert().

> +            /* try other parsing */
> +            break;
> +        }
> +
> +        /* sscanf can't handle split on ':' for "%ms:%u'  */
> +        ret = sscanf(argv[i], "%m[a-zA-Z_]:%u", &param, &val);
> +        if ( ret != 2 )
> +        {
> +            fprintf(stderr, "%s is an invalid hwp parameter.\n", argv[i]);

Outside of this function you omit full stops from error messages.
Elsewhere in the tool full stops are also absent except in two or
three deprecation warnings. Hence I think you want to drop them
from messages in this function.

> +            return -1;
> +        }
> +
> +        if ( val > 255 )
> +        {
> +            fprintf(stderr, "%s value %u is out of range.\n", param, val);
> +            return -1;
> +        }
> +
> +        if ( strncasecmp(param, "min", 3) == 0 )
> +        {
> +            set_hwp->minimum = val;
> +            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_MINIMUM;
> +        }
> +        else if ( strncasecmp(param, "max", 3) == 0 )
> +        {
> +            set_hwp->maximum = val;
> +            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_MAXIMUM;
> +        }
> +        else if ( strncasecmp(param, "des", 3) == 0 )
> +        {
> +            set_hwp->desired = val;
> +            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_DESIRED;
> +        }
> +        else if ( strncasecmp(param, "ene", 3) == 0 )
> +        {
> +            set_hwp->energy_perf = val;
> +            set_hwp->set_params |= XEN_SYSCTL_HWP_SET_ENERGY_PERF;
> +        }

While I can see the point of limiting to 3 characters, you would
better not accept longer but e.g. typoed strings.

> +        else
> +        {
> +            fprintf(stderr, "%s is an invalid parameter\n.", param);
> +            return -1;
> +        }
> +
> +        free(param);
> +    }
> +
> +    return 0;

Should you perhaps return an error here if set_hwp->set_params is
still zero?

> +}
> +
> +static void hwp_set_func(int argc, char *argv[])
> +{
> +    xc_set_hwp_para_t set_hwp = {};
> +    int cpuid = -1;
> +    int i = 0;
> +
> +    if ( parse_hwp_opts(&set_hwp, &cpuid, argc, argv) )
> +    {
> +        fprintf(stderr, "Missing, excess, or invalid argument(s)\n");

Isn't this redundant with earlier logged messages, which are also
more specific (with the one exception noted)?

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 13/13] CHANGELOG: Add Intel HWP entry
  2021-05-03 19:28 ` [PATCH 13/13] CHANGELOG: Add Intel HWP entry Jason Andryuk
@ 2021-05-27  9:48   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-27  9:48 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Ian Jackson, Community Manager, xen-devel

On 03.05.2021 21:28, Jason Andryuk wrote:
> --- a/CHANGELOG.md
> +++ b/CHANGELOG.md
> @@ -5,6 +5,8 @@ Notable changes to Xen will be documented in this file.
>  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
>  
>  ## [unstable UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=staging) - TBD
> +### Added / support upgraded
> + - Intel Hardware P-States (HWP) cpufreq driver

Please add a blank line between the ## and ### lines, to match what's
already there for earlier versions.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-05-26 14:59   ` Jan Beulich
  2021-05-27  7:23     ` Jan Beulich
@ 2021-05-27 18:50     ` Jason Andryuk
  2021-05-28  6:35       ` Jan Beulich
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-27 18:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2021 21:28, Jason Andryuk wrote:
> > If HWP Energy_Performance_Preference isn't supported, the code falls
> > back to IA32_ENERGY_PERF_BIAS.  Right now, we don't check
> > CPUID.06H:ECX.SETBH[bit 3] before using that MSR.
>
> May I ask what problem there is doing so?
>
> >  The SDM reads like
> > it'll be available, and I assume it was available by the time Skylake
> > introduced HWP.
>
> The SDM documents the MSR's presence back to at least Nehalem, but ties
> it to the CPUID bit even there.

Your point about inconsistent virtualized environments is something I
hadn't considered.  I will add a check, but I will make hwp disable in
that case.  While it could run just without an energy/performance
preference, HWP doesn't seem useful in that scenario.

> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -1355,6 +1355,15 @@ Specify whether guests are to be given access to physical port 80
> >  (often used for debugging purposes), to override the DMI based
> >  detection of systems known to misbehave upon accesses to that port.
> >
> > +### hwp (x86)
> > +> `= <boolean>`
> > +
> > +> Default: `true`
> > +
> > +Specifies whether Xen uses Hardware-Controlled Performance States (HWP)
> > +on supported Intel hardware.  HWP is a Skylake+ feature which provides
> > +better CPU power management.
>
> Is there a particular reason giving this a top-level option rather
> than a sub-option of cpufreq=?

I will investigate.

Below, I'm trimming everything where I will just make the changes
according to your earlier email.

> > +    };
> > +    union hwp_request curr_req;
> > +    uint16_t activity_window;
> > +    uint8_t minimum;
> > +    uint8_t maximum;
> > +    uint8_t desired;
> > +    uint8_t energy_perf;
> > +};
> > +struct hwp_drv_data *hwp_drv_data[NR_CPUS];
>
> New NR_CPUS-dimensioned arrays need explicit justification. From
> what I get I can't see why this couldn't be per-CPU data instead.
>
> Also - static?

I followed acpi-cpufreq with the NR_CPUS array.  per-cpu makes sense
and I'll investigate.

> > +    hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n",
> > +                eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not ");
> > +    if ( eax & CPUID6_EAX_FAST_HWP_MSR )
> > +    {
> > +        if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) )
> > +            hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n");
> > +
> > +        hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val);
>
> Missing "else" above here?

Are unbalanced braces acceptable or must they be balanced?  Is this acceptable:
if ()
  one;
else {
  two;
  three;
}

> > +static void hdc_set_pkg_hdc_ctl(bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
>
> I'm not convinced of the need of having such log messages after
> failed rdmsr/wrmsr, but I'm definitely against them getting logged
> unconditionally. In verbose mode, maybe.

We should not fail the rdmsr if our earlier cpuid check passed.  So in
that respect these messages can be removed.  The benefit here is that
you can see which MSR failed.  If you relied on extable_fixup, you
would have to cross reference manually.  How will administors know if
hwp isn't working properly there are no messages?

For HWP in general, the SDM says to check CPUID for the availability
of MSRs.  Therefore rdmsr/wrmsr shouldn't fail.  During development, I
saw wrmsr failures when with "Valid Maximum" and other "Valid" bits
set.  I think that was because I hadn't set up the Package Request
MSR.  That has been fixed by not using those bits.  Anyway,
rdmsr/wrmsr shouldn't fail, so how much code should be put into
checking for those failures which shouldn't happen?

My sample size is 2 models of processor, so verbose reporting of
errors makes some sense during wider deployment and testing.

> > +        return;
> > +    }
> > +
> > +    msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0;
>
> If you don't use the prior value, why did you read it? But I
> think you really mean to set/clear just bot 0.

During development I printed the initial values .  I removed the
printing before submission but not the reading.

In the SDM, It says "Bits 63:1 are reserved and must be zero", so I
intended to only write 0 or 1.  Below, you remark on not discarding
reserved bits. So you want all of these to be read-modify-write
operations?

> > +static void hdc_set_pm_ctl1(bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> > +
> > +        return;
> > +    }
> > +
> > +    msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0;
>
> Same here then, and ...
>
> > +static void hwp_fast_uncore_msrs_ctl(bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) )
> > +        hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n");
> > +
> > +    msr = val;
>
> ... here (where you imply bit 0 instead of using a proper
> constant).
>
> Also for all three functions I'm not convinced their names are
> in good sync with their parameters being boolean.

Would you prefer something named closer to the bit names like:
hdc_set_pkg_hdc_ctl -> hdc_set_pkg_enable
hdc_set_pm_ctl1 -> hdc_set_allow_block
hwp_fast_uncore_msrs_ctl -> hwp_fast_request_enable

> > +static void hwp_read_capabilities(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
> > +
> > +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> > +                policy->cpu);
> > +
> > +        return;
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
> > +
> > +        return;
> > +    }
> > +}
>
> This function doesn't indicate failure to its caller(s), so am I
> to understand that failure to read either ofthe MSRs is actually
> benign to the driver?

They really shouldn't fail since the CPUID check passed during
initialization.  If you can't read/write MSRs, something is broken and
the driver cannot function.  The machine would still run, but HWP
would be uncontrollable.  How much care should be given to
"impossible" situations?

> > +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
> > +
> > +        return;
> > +    }
> > +
> > +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
> > +    if ( val != IA32_PM_ENABLE_HWP_ENABLE )
> > +    {
> > +        val = IA32_PM_ENABLE_HWP_ENABLE;
>
> You should neither depend on reserved bits being zero, nor discard any
> non-zero value here, I think.

Ok.

> > +static void hwp_write_request(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
> > +    union hwp_request hwp_req = data->curr_req;
> > +
> > +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
>
> ITYM
>
>     BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw));
>
> here?

Originally, I wanted this to live next to the union definition.
However, BUILD_BUG_ON needs to live in a function, so I placed it here
where it is used.

I'd prefer
    BUILD_BUG_ON(sizeof(hwp_req) != sizeof(uint64_t))

to make the comparison to 64bit, the size of the MSR, explicit.

> > +    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
> > +    {
> > +        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
> > +                policy->cpu, hwp_req.raw);
> > +        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
>
> What if this one fails, too? data->curr_req.raw then pretty certainly
> ends up stale.

It would be stale, but I think this is unlikely.  rdmsr should not
fail given our earlier CPUID checks.  wrmsr could fail if you set
something wrong.  During testing I set some of the valid
"Maximum/Minimum" bits (now unused) that cause wrmsr to fault when
some other MSR (maybe IA32_HWP_REQUEST_PKG) wasn't set.

> > +    policy->governor = &hwp_cpufreq_governor;
> > +
> > +    data = xzalloc(typeof(*data));
>
> Commonly we specify the type explicitly in such cases, rather than using
> typeof(). I will admit though that I'm not entirely certain which one's
> better. But consistency across the code base is perhaps preferable for
> the time being.

I thought typeof(*data) is always preferable since it will always be
the matching type.  I can change it, but I feel like it's a step
backwards.

> > +    if ( !data )
> > +        return -ENOMEM;
>
> Is it correct to have set the governor before this error check?

I will re-order.

> > +    hwp_drv_data[cpu] = data;
> > +
> > +    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
> > +
> > +    data->minimum = data->hw_lowest;
> > +    data->maximum = data->hw_highest;
> > +    data->desired = 0; /* default to HW autonomous */
> > +    if ( feature_hwp_energy_perf )
> > +        data->energy_perf = 0x80;
> > +    else
> > +        data->energy_perf = 7;
>
> Where's this 7 coming from? (You do mention the 0x80 at least in the
> description.)

When HWP energy performance preference is unavailable, it falls back
to IA32_ENERGY_PERF_BIAS with a 0-15 range.  Per the SDM Vol3 14.3.4,
"A value of 7 roughly translates into a hint to balance performance
with energy consumption."  I will add a comment.

> > --- a/xen/include/asm-x86/msr-index.h
> > +++ b/xen/include/asm-x86/msr-index.h
> > +#define MSR_IA32_HWP_CAPABILITIES           0x00000771
> > +#define MSR_IA32_HWP_REQUEST                0x00000774
> > +
> >  #define MSR_PASID                           0x00000d93
> >  #define  PASID_PASID_MASK                   0x000fffff
> >  #define  PASID_VALID                        (_AC(1, ULL) << 31)
> >
> > +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
> > +#define  IA32_PKG_HDC_CTL_HDC_PKG_Enable    (_AC(1, ULL) <<  0)
>
> I don't think "HDC" twice is helpful?

I took the MSR name "IA32_PKG_HDC_CTL" and bit name "HDC_PKG_Enable"
from the SDM.  I intentionally left the case to match the SDM.

> Also please use all upper case names (actually also for the
> CPUID constants higher up).

Again, I took them straight from the SDM.

I will make them all upper case and switch
IA32_PKG_HDC_CTL_HDC_PKG_Enable to IA32_PKG_HDC_CTL_PKG_Enable.

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-05-27 18:50     ` Jason Andryuk
@ 2021-05-28  6:35       ` Jan Beulich
  2021-06-03 11:55         ` Jason Andryuk
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-05-28  6:35 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 27.05.2021 20:50, Jason Andryuk wrote:
> On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 03.05.2021 21:28, Jason Andryuk wrote:
>>> If HWP Energy_Performance_Preference isn't supported, the code falls
>>> back to IA32_ENERGY_PERF_BIAS.  Right now, we don't check
>>> CPUID.06H:ECX.SETBH[bit 3] before using that MSR.
>>
>> May I ask what problem there is doing so?
>>
>>>  The SDM reads like
>>> it'll be available, and I assume it was available by the time Skylake
>>> introduced HWP.
>>
>> The SDM documents the MSR's presence back to at least Nehalem, but ties
>> it to the CPUID bit even there.
> 
> Your point about inconsistent virtualized environments is something I
> hadn't considered.  I will add a check, but I will make hwp disable in
> that case.  While it could run just without an energy/performance
> preference, HWP doesn't seem useful in that scenario.

Makes sense. Of course I wouldn't expect hypervisors to populate much
of CPUID leaf 6 anyway, like is the case for Xen itself.

>>> +    hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n",
>>> +                eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not ");
>>> +    if ( eax & CPUID6_EAX_FAST_HWP_MSR )
>>> +    {
>>> +        if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) )
>>> +            hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n");
>>> +
>>> +        hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val);
>>
>> Missing "else" above here?
> 
> Are unbalanced braces acceptable or must they be balanced?  Is this acceptable:
> if ()
>   one;
> else {
>   two;
>   three;
> }

Yes, it is. But I don't see how the question relates to my comment.
All that needs to go in the else's body is the hwp_verbose().

>>> +static void hdc_set_pkg_hdc_ctl(bool val)
>>> +{
>>> +    uint64_t msr;
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
>>> +    {
>>> +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
>>
>> I'm not convinced of the need of having such log messages after
>> failed rdmsr/wrmsr, but I'm definitely against them getting logged
>> unconditionally. In verbose mode, maybe.
> 
> We should not fail the rdmsr if our earlier cpuid check passed.  So in
> that respect these messages can be removed.  The benefit here is that
> you can see which MSR failed.  If you relied on extable_fixup, you
> would have to cross reference manually.  How will administors know if
> hwp isn't working properly there are no messages?

This same question would go for various other MSR accesses which
might fail, but which aren't accompanied by an explicit log message.
I wouldn't mind a single summary message reporting if e.g. HWP
setup failed. More detailed analysis of such would be more of a
developer's than an administrator's job then anyway.

> For HWP in general, the SDM says to check CPUID for the availability
> of MSRs.  Therefore rdmsr/wrmsr shouldn't fail.  During development, I
> saw wrmsr failures when with "Valid Maximum" and other "Valid" bits
> set.  I think that was because I hadn't set up the Package Request
> MSR.  That has been fixed by not using those bits.  Anyway,
> rdmsr/wrmsr shouldn't fail, so how much code should be put into
> checking for those failures which shouldn't happen?

If there's any risk of accesses causing a fault, using *msr_safe()
is of course the right choice. Beyond that you will need to consider
what the consequences are. Sadly this needs doing on every single
case individually. See "Handling unexpected conditions" section of
./CODING_STYLE for guidelines (even if the specific constructs
aren't in use here).

>>> +        return;
>>> +    }
>>> +
>>> +    msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0;
>>
>> If you don't use the prior value, why did you read it? But I
>> think you really mean to set/clear just bot 0.
> 
> During development I printed the initial values .  I removed the
> printing before submission but not the reading.
> 
> In the SDM, It says "Bits 63:1 are reserved and must be zero", so I
> intended to only write 0 or 1.  Below, you remark on not discarding
> reserved bits. So you want all of these to be read-modify-write
> operations?

Yes, this is the only way to be forward compatible.

>>> +static void hdc_set_pm_ctl1(bool val)
>>> +{
>>> +    uint64_t msr;
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
>>> +    {
>>> +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
>>> +
>>> +        return;
>>> +    }
>>> +
>>> +    msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0;
>>
>> Same here then, and ...
>>
>>> +static void hwp_fast_uncore_msrs_ctl(bool val)
>>> +{
>>> +    uint64_t msr;
>>> +
>>> +    if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) )
>>> +        hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n");
>>> +
>>> +    msr = val;
>>
>> ... here (where you imply bit 0 instead of using a proper
>> constant).
>>
>> Also for all three functions I'm not convinced their names are
>> in good sync with their parameters being boolean.
> 
> Would you prefer something named closer to the bit names like:
> hdc_set_pkg_hdc_ctl -> hdc_set_pkg_enable
> hdc_set_pm_ctl1 -> hdc_set_allow_block
> hwp_fast_uncore_msrs_ctl -> hwp_fast_request_enable

My primary request is for function name, purpose, and parameter(s)
to be in line. So e.g. if you left the parameters as boolean, then
I think your suggested name changes make sense. Alternatively you
could make the functions e.g. be full register updating ones, with
suitable parameters, which could be a mask-to-set and mask-to-clear.

>>> +static void hwp_read_capabilities(void *info)
>>> +{
>>> +    struct cpufreq_policy *policy = info;
>>> +    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
>>> +                policy->cpu);
>>> +
>>> +        return;
>>> +    }
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
>>> +
>>> +        return;
>>> +    }
>>> +}
>>
>> This function doesn't indicate failure to its caller(s), so am I
>> to understand that failure to read either ofthe MSRs is actually
>> benign to the driver?
> 
> They really shouldn't fail since the CPUID check passed during
> initialization.  If you can't read/write MSRs, something is broken and
> the driver cannot function.  The machine would still run, but HWP
> would be uncontrollable.  How much care should be given to
> "impossible" situations?

See above. The main point is to avoid crashing. In the specific
case here I think you could simply drop both the log messages and
the early return, assuming the caller can deal fine with the zero
value(s) that rdmsr_safe() will substitute for the MSR value(s).
Bailing early, otoh, calls for handing back an error indicator
imo.

>>> +static void hwp_write_request(void *info)
>>> +{
>>> +    struct cpufreq_policy *policy = info;
>>> +    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
>>> +    union hwp_request hwp_req = data->curr_req;
>>> +
>>> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
>>
>> ITYM
>>
>>     BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw));
>>
>> here?
> 
> Originally, I wanted this to live next to the union definition.
> However, BUILD_BUG_ON needs to live in a function, so I placed it here
> where it is used.
> 
> I'd prefer
>     BUILD_BUG_ON(sizeof(hwp_req) != sizeof(uint64_t))
> 
> to make the comparison to 64bit, the size of the MSR, explicit.

Well, then "raw" could still be wrong in principle, but it is the
whole point of having "raw" to have it match MSR size. So while I
could live with your alternative, I continue to think my suggestion
is the more appropriate form of the check.

>>> +    policy->governor = &hwp_cpufreq_governor;
>>> +
>>> +    data = xzalloc(typeof(*data));
>>
>> Commonly we specify the type explicitly in such cases, rather than using
>> typeof(). I will admit though that I'm not entirely certain which one's
>> better. But consistency across the code base is perhaps preferable for
>> the time being.
> 
> I thought typeof(*data) is always preferable since it will always be
> the matching type.  I can change it, but I feel like it's a step
> backwards.

There's exactly one similar use in the entire code base. The original
idea with xmalloc() was that one would explicitly specify the intended
type, such that without looking elsewhere one can see what exactly is
to be allocated. One could further rely on the compiler warning if
there was a disagreement between declaration and assignment.

If instead we were to use typeof() everywhere, there's be a fair
amount of redundancy between the LHS of the assignment and the operand
of typeof(), which would call for eliminating (by abstracting away).

>>> +    if ( feature_hwp_energy_perf )
>>> +        data->energy_perf = 0x80;
>>> +    else
>>> +        data->energy_perf = 7;
>>
>> Where's this 7 coming from? (You do mention the 0x80 at least in the
>> description.)
> 
> When HWP energy performance preference is unavailable, it falls back
> to IA32_ENERGY_PERF_BIAS with a 0-15 range.  Per the SDM Vol3 14.3.4,
> "A value of 7 roughly translates into a hint to balance performance
> with energy consumption."  I will add a comment.

Actually, as per a comment on a later patch, I'm instead expecting
you to add a #define, the name of which will serve as comment.

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/13] cpufreq: Export HWP parameters to userspace
  2021-05-27  7:55   ` Jan Beulich
@ 2021-05-28 13:19     ` Jason Andryuk
  2021-05-28 13:39       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-05-28 13:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, xen-devel

On Thu, May 27, 2021 at 4:03 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 03.05.2021 21:28, Jason Andryuk wrote:
> > --- a/xen/drivers/acpi/pmstat.c
> > +++ b/xen/drivers/acpi/pmstat.c
> > @@ -290,6 +290,12 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
> >              &op->u.get_para.u.ondemand.sampling_rate,
> >              &op->u.get_para.u.ondemand.up_threshold);
> >      }
> > +
> > +    if ( !strncasecmp(op->u.get_para.scaling_governor,
> > +                      "hwp-internal", CPUFREQ_NAME_LEN) )
> > +    {
> > +        ret = get_hwp_para(policy, &op->u.get_para.u.hwp_para);
> > +    }
> >      op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
>
> Nit: Unnecessary parentheses again, and with the leading blank line
> you also want a trailing one. (As an aside I'm also not overly happy
> to see the call keyed to the governor name. Is there really no other
> indication that hwp is in use?)

This is preceded by similar checks for "userspace" and "ondemand", so
it is following existing code.  Unlike other governors, hwp-internal
is static.  It could be exported if you want to switch to comparing
with cpufreq_driver.

> > --- a/xen/include/acpi/cpufreq/cpufreq.h
> > +++ b/xen/include/acpi/cpufreq/cpufreq.h
> > @@ -246,4 +246,7 @@ int write_userspace_scaling_setspeed(unsigned int cpu, unsigned int freq);
> >  void cpufreq_dbs_timer_suspend(void);
> >  void cpufreq_dbs_timer_resume(void);
> >
> > +/********************** hwp hypercall helper *************************/
> > +int get_hwp_para(struct cpufreq_policy *policy, struct xen_hwp_para *hwp_para);
>
> While I can see that the excessive number of stars matches what
> we have elsewhere in the header, I still wonder if you need to go
> this far for a single declaration. If you want to stick to this,
> then to match the rest of the file you want to follow the comment
> by a blank line.

Will remove.

> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -301,6 +301,23 @@ struct xen_ondemand {
> >      uint32_t up_threshold;
> >  };
> >
> > +struct xen_hwp_para {
> > +    uint16_t activity_window; /* 7bit mantissa and 3bit exponent */
>
> If you go this far with commenting, you should also make the further
> aspects clear: Which bits these are, and that the exponent is taking
> 10 as the base (in most other cases one would expect 2).

Yes, this is much more useful.

> > +#define XEN_SYSCTL_HWP_FEAT_ENERGY_PERF (1 << 0) /* energy_perf range 0-255 if
> > +                                                    1. Otherwise 0-15 */
> > +#define XEN_SYSCTL_HWP_FEAT_ACT_WINDOW  (1 << 1) /* activity_window supported
> > +                                                    if 1 */
>
> Style: Comment formatting. You may want to move the comment on separate
> lines ahead of what they comment.
>
> > +    uint8_t hw_feature; /* bit flags for features */
> > +    uint8_t hw_lowest;
> > +    uint8_t hw_most_efficient;
> > +    uint8_t hw_guaranteed;
> > +    uint8_t hw_highest;
>
> Any particular reason for the recurring hw_ prefixes?

The idea was to differentiate values provided by CPU hardware from
user-configured values.

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/13] cpufreq: Export HWP parameters to userspace
  2021-05-28 13:19     ` Jason Andryuk
@ 2021-05-28 13:39       ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-05-28 13:39 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, xen-devel

On 28.05.2021 15:19, Jason Andryuk wrote:
> On Thu, May 27, 2021 at 4:03 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 03.05.2021 21:28, Jason Andryuk wrote:
>>> --- a/xen/drivers/acpi/pmstat.c
>>> +++ b/xen/drivers/acpi/pmstat.c
>>> @@ -290,6 +290,12 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>>>              &op->u.get_para.u.ondemand.sampling_rate,
>>>              &op->u.get_para.u.ondemand.up_threshold);
>>>      }
>>> +
>>> +    if ( !strncasecmp(op->u.get_para.scaling_governor,
>>> +                      "hwp-internal", CPUFREQ_NAME_LEN) )
>>> +    {
>>> +        ret = get_hwp_para(policy, &op->u.get_para.u.hwp_para);
>>> +    }
>>>      op->u.get_para.turbo_enabled = cpufreq_get_turbo_status(op->cpuid);
>>
>> Nit: Unnecessary parentheses again, and with the leading blank line
>> you also want a trailing one. (As an aside I'm also not overly happy
>> to see the call keyed to the governor name. Is there really no other
>> indication that hwp is in use?)
> 
> This is preceded by similar checks for "userspace" and "ondemand", so
> it is following existing code.  Unlike other governors, hwp-internal
> is static.  It could be exported if you want to switch to comparing
> with cpufreq_driver.

Hmm, well, then feel free to keep the logic as you have it, except
please don't take presence of unnecessary braces as excuse to add
more.

>>> +    uint8_t hw_feature; /* bit flags for features */
>>> +    uint8_t hw_lowest;
>>> +    uint8_t hw_most_efficient;
>>> +    uint8_t hw_guaranteed;
>>> +    uint8_t hw_highest;
>>
>> Any particular reason for the recurring hw_ prefixes?
> 
> The idea was to differentiate values provided by CPU hardware from
> user-configured values.

I think that follows from their names already without the prefix.
I'd prefer if you dropped them, but I'll try to not insist (I may
comment on this again in later versions, in case I forgot by then).

Jan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-05-28  6:35       ` Jan Beulich
@ 2021-06-03 11:55         ` Jason Andryuk
  2021-06-04  6:39           ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Andryuk @ 2021-06-03 11:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On Fri, May 28, 2021 at 2:35 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 27.05.2021 20:50, Jason Andryuk wrote:
> > On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 03.05.2021 21:28, Jason Andryuk wrote:
> >>> +    hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n",
> >>> +                eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not ");
> >>> +    if ( eax & CPUID6_EAX_FAST_HWP_MSR )
> >>> +    {
> >>> +        if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) )
> >>> +            hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n");
> >>> +
> >>> +        hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val);
> >>
> >> Missing "else" above here?
> >
> > Are unbalanced braces acceptable or must they be balanced?  Is this acceptable:
> > if ()
> >   one;
> > else {
> >   two;
> >   three;
> > }
>
> Yes, it is. But I don't see how the question relates to my comment.
> All that needs to go in the else's body is the hwp_verbose().

'val' shouldn't be used to set features when the rdmsr fails, so the
following code needs to be within the else.  Unless you want to rely
on a failed rdmsr returning 0.

> >>> +static void hdc_set_pkg_hdc_ctl(bool val)
> >>> +{
> >>> +    uint64_t msr;
> >>> +
> >>> +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> >>> +    {
> >>> +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
> >>
> >> I'm not convinced of the need of having such log messages after
> >> failed rdmsr/wrmsr, but I'm definitely against them getting logged
> >> unconditionally. In verbose mode, maybe.
> >
> > We should not fail the rdmsr if our earlier cpuid check passed.  So in
> > that respect these messages can be removed.  The benefit here is that
> > you can see which MSR failed.  If you relied on extable_fixup, you
> > would have to cross reference manually.  How will administors know if
> > hwp isn't working properly there are no messages?
>
> This same question would go for various other MSR accesses which
> might fail, but which aren't accompanied by an explicit log message.
> I wouldn't mind a single summary message reporting if e.g. HWP
> setup failed. More detailed analysis of such would be more of a
> developer's than an administrator's job then anyway.
>
> > For HWP in general, the SDM says to check CPUID for the availability
> > of MSRs.  Therefore rdmsr/wrmsr shouldn't fail.  During development, I
> > saw wrmsr failures when with "Valid Maximum" and other "Valid" bits
> > set.  I think that was because I hadn't set up the Package Request
> > MSR.  That has been fixed by not using those bits.  Anyway,
> > rdmsr/wrmsr shouldn't fail, so how much code should be put into
> > checking for those failures which shouldn't happen?
>
> If there's any risk of accesses causing a fault, using *msr_safe()
> is of course the right choice. Beyond that you will need to consider
> what the consequences are. Sadly this needs doing on every single
> case individually. See "Handling unexpected conditions" section of
> ./CODING_STYLE for guidelines (even if the specific constructs
> aren't in use here).

Using *msr_safe(), I think the worst case is the users can't set HWP
in the way they want.  So power/performance may not be what the users
wanted, but Xen won't crash.  The hardware will throttle itself if
needed for self-protection, so that should be okay as well.

> >>> +static void hdc_set_pm_ctl1(bool val)
> >>> +{
> >>> +    uint64_t msr;
> >>> +
> >>> +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> >>> +    {
> >>> +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> >>> +
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0;
> >>
> >> Same here then, and ...
> >>
> >>> +static void hwp_fast_uncore_msrs_ctl(bool val)
> >>> +{
> >>> +    uint64_t msr;
> >>> +
> >>> +    if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) )
> >>> +        hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n");
> >>> +
> >>> +    msr = val;
> >>
> >> ... here (where you imply bit 0 instead of using a proper
> >> constant).
> >>
> >> Also for all three functions I'm not convinced their names are
> >> in good sync with their parameters being boolean.
> >
> > Would you prefer something named closer to the bit names like:
> > hdc_set_pkg_hdc_ctl -> hdc_set_pkg_enable
> > hdc_set_pm_ctl1 -> hdc_set_allow_block
> > hwp_fast_uncore_msrs_ctl -> hwp_fast_request_enable
>
> My primary request is for function name, purpose, and parameter(s)
> to be in line. So e.g. if you left the parameters as boolean, then
> I think your suggested name changes make sense. Alternatively you
> could make the functions e.g. be full register updating ones, with
> suitable parameters, which could be a mask-to-set and mask-to-clear.

I'm going to use the replacement names while keeping the boolean
argument.  Those MSRs only have single bits defined today, so
functions with boolean parameters are simpler.

> >>> +static void hwp_read_capabilities(void *info)
> >>> +{
> >>> +    struct cpufreq_policy *policy = info;
> >>> +    struct hwp_drv_data *data = hwp_drv_data[policy->cpu];
> >>> +
> >>> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> >>> +    {
> >>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> >>> +                policy->cpu);
> >>> +
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> >>> +    {
> >>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
> >>> +
> >>> +        return;
> >>> +    }
> >>> +}
> >>
> >> This function doesn't indicate failure to its caller(s), so am I
> >> to understand that failure to read either ofthe MSRs is actually
> >> benign to the driver?
> >
> > They really shouldn't fail since the CPUID check passed during
> > initialization.  If you can't read/write MSRs, something is broken and
> > the driver cannot function.  The machine would still run, but HWP
> > would be uncontrollable.  How much care should be given to
> > "impossible" situations?
>
> See above. The main point is to avoid crashing. In the specific
> case here I think you could simply drop both the log messages and
> the early return, assuming the caller can deal fine with the zero
> value(s) that rdmsr_safe() will substitute for the MSR value(s).
> Bailing early, otoh, calls for handing back an error indicator
> imo.

I changed it to have failure set curr_req.raw = -1.  Then
cpufreq_driver.init returns -ENODEV in that case.

> >>> +    policy->governor = &hwp_cpufreq_governor;
> >>> +
> >>> +    data = xzalloc(typeof(*data));
> >>
> >> Commonly we specify the type explicitly in such cases, rather than using
> >> typeof(). I will admit though that I'm not entirely certain which one's
> >> better. But consistency across the code base is perhaps preferable for
> >> the time being.
> >
> > I thought typeof(*data) is always preferable since it will always be
> > the matching type.  I can change it, but I feel like it's a step
> > backwards.
>
> There's exactly one similar use in the entire code base. The original
> idea with xmalloc() was that one would explicitly specify the intended
> type, such that without looking elsewhere one can see what exactly is
> to be allocated. One could further rely on the compiler warning if
> there was a disagreement between declaration and assignment.

Oh, okay.  Thanks for the explanation of xmalloc().

> >>> +    if ( feature_hwp_energy_perf )
> >>> +        data->energy_perf = 0x80;
> >>> +    else
> >>> +        data->energy_perf = 7;
> >>
> >> Where's this 7 coming from? (You do mention the 0x80 at least in the
> >> description.)
> >
> > When HWP energy performance preference is unavailable, it falls back
> > to IA32_ENERGY_PERF_BIAS with a 0-15 range.  Per the SDM Vol3 14.3.4,
> > "A value of 7 roughly translates into a hint to balance performance
> > with energy consumption."  I will add a comment.
>
> Actually, as per a comment on a later patch, I'm instead expecting
> you to add a #define, the name of which will serve as comment.

Yes, sounds good.

Regards,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver
  2021-06-03 11:55         ` Jason Andryuk
@ 2021-06-04  6:39           ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-06-04  6:39 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 03.06.2021 13:55, Jason Andryuk wrote:
> On Fri, May 28, 2021 at 2:35 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 27.05.2021 20:50, Jason Andryuk wrote:
>>> On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 03.05.2021 21:28, Jason Andryuk wrote:
>>>>> +    hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n",
>>>>> +                eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not ");
>>>>> +    if ( eax & CPUID6_EAX_FAST_HWP_MSR )
>>>>> +    {
>>>>> +        if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) )
>>>>> +            hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n");
>>>>> +
>>>>> +        hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val);
>>>>
>>>> Missing "else" above here?
>>>
>>> Are unbalanced braces acceptable or must they be balanced?  Is this acceptable:
>>> if ()
>>>   one;
>>> else {
>>>   two;
>>>   three;
>>> }
>>
>> Yes, it is. But I don't see how the question relates to my comment.
>> All that needs to go in the else's body is the hwp_verbose().
> 
> 'val' shouldn't be used to set features when the rdmsr fails, so the
> following code needs to be within the else.  Unless you want to rely
> on a failed rdmsr returning 0.

It is intentional for rdmsr_safe() to return a zero value when the
access faulted, so I certainly think you may rely on this.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-06-04  6:39 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-03 19:27 [PATCH 00/13] Intel Hardware P-States (HWP) support Jason Andryuk
2021-05-03 19:27 ` [PATCH 01/13] cpufreq: Allow restricting to internal governors only Jason Andryuk
2021-05-26 13:18   ` Jan Beulich
2021-05-26 14:12     ` Jason Andryuk
2021-05-26 15:09       ` Jan Beulich
2021-05-26 16:44         ` Jason Andryuk
2021-05-03 19:27 ` [PATCH 02/13] cpufreq: Add perf_freq to cpuinfo Jason Andryuk
2021-05-26 13:24   ` Jan Beulich
2021-05-26 14:19     ` Jason Andryuk
2021-05-03 19:28 ` [PATCH 03/13] cpufreq: Export intel_feature_detect Jason Andryuk
2021-05-26 13:27   ` Jan Beulich
2021-05-26 14:44     ` Jason Andryuk
2021-05-26 15:11       ` Jan Beulich
2021-05-03 19:28 ` [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver Jason Andryuk
2021-05-26 14:59   ` Jan Beulich
2021-05-27  7:23     ` Jan Beulich
2021-05-27 18:50     ` Jason Andryuk
2021-05-28  6:35       ` Jan Beulich
2021-06-03 11:55         ` Jason Andryuk
2021-06-04  6:39           ` Jan Beulich
2021-05-27  7:45   ` Jan Beulich
2021-05-03 19:28 ` [PATCH 05/13] xenpm: Change get-cpufreq-para output for internal Jason Andryuk
2021-05-26 15:21   ` Jan Beulich
2021-05-27  5:54     ` Jan Beulich
2021-05-03 19:28 ` [PATCH 06/13] cpufreq: Export HWP parameters to userspace Jason Andryuk
2021-05-27  7:55   ` Jan Beulich
2021-05-28 13:19     ` Jason Andryuk
2021-05-28 13:39       ` Jan Beulich
2021-05-03 19:28 ` [PATCH 07/13] libxc: Include hwp_para in definitions Jason Andryuk
2021-05-03 19:28 ` [PATCH 08/13] xenpm: Print HWP parameters Jason Andryuk
2021-05-27  8:02   ` Jan Beulich
2021-05-03 19:28 ` [PATCH 09/13] xen: Add SET_CPUFREQ_HWP xen_sysctl_pm_op Jason Andryuk
2021-05-27  8:33   ` Jan Beulich
2021-05-03 19:28 ` [PATCH 10/13] libxc: Add xc_set_cpufreq_hwp Jason Andryuk
2021-05-04  8:03   ` Jan Beulich
2021-05-04 11:31     ` Jason Andryuk
2021-05-27  9:45   ` Jan Beulich
2021-05-03 19:28 ` [PATCH 11/13] xenpm: Factor out a non-fatal cpuid_parse variant Jason Andryuk
2021-05-27  8:41   ` Jan Beulich
2021-05-03 19:28 ` [PATCH 12/13] xenpm: Add set-cpufreq-hwp subcommand Jason Andryuk
2021-05-27  9:46   ` Jan Beulich
2021-05-03 19:28 ` [PATCH 13/13] CHANGELOG: Add Intel HWP entry Jason Andryuk
2021-05-27  9:48   ` Jan Beulich
2021-05-20 14:57 ` [PATCH 00/13] Intel Hardware P-States (HWP) support Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).