All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior
@ 2023-11-11  1:49 David Dai
  2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: David Dai @ 2023-11-11  1:49 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai,
	Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Hi,

This patch series is a continuation of the talk Saravana gave at LPC 2022
titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
of the talk is that workloads running in a guest VM get terrible task
placement and CPUfreq behavior when compared to running the same workload
in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
inside VMs. This would make power and performance terrible just by running
the workload in a VM even if we assume there is zero virtualization
overhead.

With this series, a workload running in a VM gets the same task placement
and CPUfreq behavior as it would when running in the host.

The idea is to improve VM CPUfreq/sched behavior by:
- Having guest kernel do accurate load tracking by taking host CPU
  arch/type and frequency into account.
- Sharing vCPU frequency requirements with the host so that the
  host can do proper frequency scaling and task placement on the host side.

Based on feedback from RFC v1 proposal[4], we've revised our
implementation to using MMIO reads and writes to pass information
from/to host instead of using hypercalls. In our example, the
VMM(Virtual Machine Manager) translates the frequency requests into
Uclamp_min and applies it to the vCPU thread as a hint to the host
kernel.

To achieve the results below, configure the host to:
- Affine vCPUs to specific clusters.
- Set vCPU capacity to match the host CPU they are running on.

To make it easy for folks to try this out with CrosVM, we have put up
userspace patches[5][6]. With those patches, you can configure CrosVM
correctly by adding the options "--host-cpu-topology" and "--virt-cpufreq".

Results:
========

Here are some side-by-side comparisons of RFC v1 proposal vs the current
patch series and are labelled as follows.

- (RFC v1) UtilHyp = hypercall + util_guest
- (current) UClampMMIO = MMIO + UClamp_min

Use cases running a minimal system inside a VM on a Pixel 6:
============================================================

FIO
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Usecase(avg MB/s) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Seq Write         |     13.3 |    16.4 |   +23% |       13.4 |    +1% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Write        |     11.2 |    12.9 |   +15% |       11.2 |     0% |
+-------------------+----------+---------+--------+------------+--------+
| Seq Read          |      100 |     168 |   +68% |        136 |   +36% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Read         |     20.5 |    35.6 |   +74% |       29.5 |   +44% |
+-------------------+----------+---------+--------+------------+--------+

CPU-based ML Inference Benchmark
Lower is better
+----------------+----------+------------+--------+------------+--------+
| Test Case (ms) | Baseline | UtilHyp    | %delta | UClampMMIO | %delta |
+----------------+----------+------------+--------+------------+--------+
| Cached Sample  |          |            |        |            |        |
| Inference      |     3.40 |       2.37 |   -30% |       2.97 |   -13% |
+----------------+----------+------------+--------+------------+--------+
| Small Sample   |          |            |        |            |        |
| Inference      |     9.87 |       6.78 |   -31% |       7.92 |   -20% |
+----------------+----------+------------+--------+------------+--------+
| Large Sample   |          |            |        |            |        |
| Inference      |    33.35 |      26.74 |   -20% |      31.48 |    -6% |
+----------------+----------+------------+--------+------------+--------+

Use cases running Android inside a VM on a Chromebook:
======================================================

PCMark (Emulates real world usecases)
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Test Case (score) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Weighted Total    |     5970 |    7162 |   +20% |       6782 |   +14% |
+-------------------+----------+---------+--------+------------+--------+
| Web Browsing      |     5558 |    5877 |    +6% |       5729 |    +3% |
+-------------------+----------+---------+--------+------------+--------+
| Video Editing     |     4921 |    5140 |    +4% |       5079 |    +3% |
+-------------------+----------+---------+--------+------------+--------+
| Writing           |     6864 |    9111 |   +33% |       8171 |   +10% |
+-------------------+----------+---------+--------+------------+--------+
| Photo Editing     |     7983 |   11349 |   +42% |      10313 |   +29% |
+-------------------+----------+---------+--------+------------+--------+
| Data Manipulation |     5814 |    6051 |    +4% |       6051 |    +1% |
+-------------------+----------+---------+--------+------------+--------+

PCMark Performance/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Score/mAh         |       85 |     102 |   +20% |         94 |    10% |
+-------------------+----------+---------+--------+------------+--------+

Roblox
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| FPS               |    20.88 |   25.64 |   +23% |      24.05 |   +15% |
+-------------------+----------+---------+--------+------------+--------+

Roblox Frames/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Frames/mAh        |    85.29 |  102.31 |   +20% |     94.20  |    10% |
+-------------------+----------+---------+--------+------------+--------+

We've simplified our implementation based on community feedback to make
it less intrusive and to use a more generic MMIO interface for
communication with the host. The results show that the current design
still has tangible improvements over baseline. We'll continue looking
into ways to reduce the overhead of the MMIO read/writes and submit
separate and generic patches for that if we find any good optimizations.

Thanks,
David & Saravana

Cc: Saravana Kannan <saravanak@google.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Masami Hiramatsu <mhiramat@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Cc: Gupta Pankaj <pankaj.gupta@amd.com>
Cc: Mel Gorman <mgorman@suse.de>

v3 -> v4:
-Fixed dt-binding formatting issues
-Added additional dt-binding descriptions for “HW interfaces”
-Changed dt-binding to “qemu,virtual-cpufreq”
-Fixed Kconfig formatting issues
-Removed frequency downscaling when requesting frequency updates
-Removed ops and cpufreq driver data
-Added check to limit freq_scale to 1024
-Added KHZ in the register offset naming
-Added comments to explain FIE and not allowing dvfs_possible_from_any_cpu

v2 -> v3:
- Dropped patches adding new hypercalls
- Dropped patch adding util_guest in sched/fair
- Cpufreq driver now populates frequency using opp bindings
- Removed transition_delay_us=1 cpufreq setting as it was configured too
  agressively and resulted in poor I/O performance
- Modified guest cpufreq driver to read/write MMIO regions instead of
  using hypercalls to communicate with the host
- Modified guest cpufreq driver to pass frequency info instead of
  utilization of the current vCPU's runqueue which now takes
  iowait_boost into account from the schedutil governor
- Updated DT bindings for a virtual CPU frequency device
Userspace changes:
- Updated CrosVM patches to emulate a virtual cpufreq device
- Updated to newer userspace binaries when collecting more recent
  benchmark data

v1 -> v2:
- No functional changes.
- Added description for EAS and removed DVFS in coverletter.
- Added a v2 tag to the subject.
- Fixed up the inconsistent "units" between tables.
- Made sure everyone is To/Cc-ed for all the patches in the series.

[1] - https://lpc.events/event/16/contributions/1195/
[2] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf
[3] - https://www.youtube.com/watch?v=hIg_5bg6opU
[4] - https://lore.kernel.org/all/20230331014356.1033759-1-davidai@google.com/
[5] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4208668
[6] - https://chromium-review.googlesource.com/q/topic:%22virtcpufreq-v4%22

David Dai (2):
  dt-bindings: cpufreq: add virtual cpufreq device
  cpufreq: add virtual-cpufreq driver

 .../cpufreq/qemu,cpufreq-virtual.yaml         |  99 +++++++++
 drivers/cpufreq/Kconfig                       |  15 ++
 drivers/cpufreq/Makefile                      |   1 +
 drivers/cpufreq/virtual-cpufreq.c             | 201 ++++++++++++++++++
 include/linux/arch_topology.h                 |   1 +
 5 files changed, 317 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
 create mode 100644 drivers/cpufreq/virtual-cpufreq.c

-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-11-11  1:49 [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior David Dai
@ 2023-11-11  1:49 ` David Dai
  2023-11-15  6:27   ` Viresh Kumar
                     ` (3 more replies)
  2023-11-11  1:49 ` [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver David Dai
  2023-11-13 12:20 ` [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior Hongyan Xia
  2 siblings, 4 replies; 20+ messages in thread
From: David Dai @ 2023-11-11  1:49 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai,
	Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Adding bindings to represent a virtual cpufreq device.

Virtual machines may expose MMIO regions for a virtual cpufreq device
for guests to read frequency information or to request frequency
selection. The virtual cpufreq device has an individual controller for
each frequency domain.

Co-developed-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: David Dai <davidai@google.com>
---
 .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
 1 file changed, 99 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml

diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
new file mode 100644
index 000000000000..16606cf1fd1a
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
@@ -0,0 +1,99 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Virtual CPUFreq
+
+maintainers:
+  - David Dai <davidai@google.com>
+  - Saravana Kannan <saravanak@google.com>
+
+description:
+  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
+  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
+  is associated with a frequency domain which can be shared with other vCPUs.
+  Each frequency domain has its own set of registers for frequency controls.
+
+properties:
+  compatible:
+    const: qemu,virtual-cpufreq
+
+  reg:
+    maxItems: 1
+    description:
+      Address and size of region containing frequency controls for each of the
+      frequency domains. Regions for each frequency domain is placed
+      contiugously and contain registers for controlling DVFS(Dynamic Frequency
+      and Voltage) characteristics. The size of the region is proportional to
+      total number of frequency domains.
+
+required:
+  - compatible
+  - reg
+
+additionalProperties: false
+
+examples:
+  - |
+    // This example shows a two CPU configuration with a frequency domain
+    // for each CPU.
+    cpus {
+      #address-cells = <1>;
+      #size-cells = <0>;
+
+      cpu@0 {
+        compatible = "arm,armv8";
+        device_type = "cpu";
+        reg = <0x0>;
+        operating-points-v2 = <&opp_table0>;
+      };
+
+      cpu@1 {
+        compatible = "arm,armv8";
+        device_type = "cpu";
+        reg = <0x0>;
+        operating-points-v2 = <&opp_table1>;
+      };
+    };
+
+    opp_table0: opp-table-0 {
+      compatible = "operating-points-v2";
+      opp-shared;
+
+      opp1098000000 {
+        opp-hz = /bits/ 64 <1098000000>;
+        opp-level = <1>;
+      };
+
+      opp1197000000 {
+        opp-hz = /bits/ 64 <1197000000>;
+        opp-level = <2>;
+      };
+    };
+
+    opp_table1: opp-table-1 {
+      compatible = "operating-points-v2";
+      opp-shared;
+
+      opp1106000000 {
+        opp-hz = /bits/ 64 <1106000000>;
+        opp-level = <1>;
+      };
+
+      opp1277000000 {
+        opp-hz = /bits/ 64 <1277000000>;
+        opp-level = <2>;
+      };
+    };
+
+    soc {
+      #address-cells = <1>;
+      #size-cells = <1>;
+
+      cpufreq@1040000 {
+        compatible = "qemu,virtual-cpufreq";
+        reg = <0x1040000 0x10>;
+      };
+    };
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver
  2023-11-11  1:49 [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior David Dai
  2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
@ 2023-11-11  1:49 ` David Dai
  2023-11-15  6:29   ` Viresh Kumar
  2024-01-15 16:58   ` Hongyan Xia
  2023-11-13 12:20 ` [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior Hongyan Xia
  2 siblings, 2 replies; 20+ messages in thread
From: David Dai @ 2023-11-11  1:49 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai,
	Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Introduce a virtualized cpufreq driver for guest kernels to improve
performance and power of workloads within VMs.

This driver does two main things:

1. Sends the frequency of vCPUs as a hint to the host. The host uses the
hint to schedule the vCPU threads and decide physical CPU frequency.

2. If a VM does not support a virtualized FIE(like AMUs), it queries the
host CPU frequency by reading a MMIO region of a virtual cpufreq device
to update the guest's frequency scaling factor periodically. This enables
accurate Per-Entity Load Tracking for tasks running in the guest.

Co-developed-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: David Dai <davidai@google.com>
---
 drivers/cpufreq/Kconfig           |  15 +++
 drivers/cpufreq/Makefile          |   1 +
 drivers/cpufreq/virtual-cpufreq.c | 201 ++++++++++++++++++++++++++++++
 include/linux/arch_topology.h     |   1 +
 4 files changed, 218 insertions(+)
 create mode 100644 drivers/cpufreq/virtual-cpufreq.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 35efb53d5492..f2d37075aa10 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -217,6 +217,21 @@ config CPUFREQ_DT
 
 	  If in doubt, say N.
 
+config CPUFREQ_VIRT
+	tristate "Virtual cpufreq driver"
+	depends on OF
+	select PM_OPP
+	help
+	  This adds a virtualized cpufreq driver for guest kernels that
+	  read/writes to a MMIO region for a virtualized cpufreq device to
+	  communicate with the host. It sends frequency updates to the host
+	  which gets used as a hint to schedule vCPU threads and select CPU
+	  frequency. If a VM does not support a virtualized FIE such as AMUs,
+	  it updates the frequency scaling factor by polling host CPU frequency
+	  to enable accurate Per-Entity Load Tracking for tasks running in the guest.
+
+	  If in doubt, say N.
+
 config CPUFREQ_DT_PLATDEV
 	tristate "Generic DT based cpufreq platdev driver"
 	depends on OF
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 8d141c71b016..eb72ecdc24db 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET)	+= cpufreq_governor_attr_set.o
 
 obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o
 obj-$(CONFIG_CPUFREQ_DT_PLATDEV)	+= cpufreq-dt-platdev.o
+obj-$(CONFIG_CPUFREQ_VIRT)		+= virtual-cpufreq.o
 
 # Traces
 CFLAGS_amd-pstate-trace.o               := -I$(src)
diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
new file mode 100644
index 000000000000..f828d3345a68
--- /dev/null
+++ b/drivers/cpufreq/virtual-cpufreq.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ */
+
+#include <linux/arch_topology.h>
+#include <linux/cpufreq.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/of_platform.h>
+#include <linux/pm_opp.h>
+#include <linux/slab.h>
+
+#define REG_CUR_FREQ_KHZ_OFFSET 0x0
+#define REG_SET_FREQ_KHZ_OFFSET 0x4
+#define PER_CPU_OFFSET 0x8
+
+static void __iomem *base;
+
+static void virt_scale_freq_tick(void)
+{
+	int cpu = smp_processor_id();
+	u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu);
+	u64 cur_freq;
+	unsigned long scale;
+
+	cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET
+			+ REG_CUR_FREQ_KHZ_OFFSET);
+
+	cur_freq <<= SCHED_CAPACITY_SHIFT;
+	scale = (unsigned long)div_u64(cur_freq, max_freq);
+	scale = min(scale, SCHED_CAPACITY_SCALE);
+
+	this_cpu_write(arch_freq_scale, scale);
+}
+
+static struct scale_freq_data virt_sfd = {
+	.source = SCALE_FREQ_SOURCE_VIRT,
+	.set_freq_scale = virt_scale_freq_tick,
+};
+
+static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
+{
+	writel_relaxed(policy->cached_target_freq,
+		       base + policy->cpu * PER_CPU_OFFSET + REG_SET_FREQ_KHZ_OFFSET);
+	return 0;
+}
+
+static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
+					     unsigned int target_freq)
+{
+	virt_cpufreq_set_perf(policy);
+	return target_freq;
+}
+
+static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
+				     unsigned int index)
+{
+	return virt_cpufreq_set_perf(policy);
+}
+
+static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+	struct cpufreq_frequency_table *table;
+	struct device *cpu_dev;
+	int ret;
+
+	cpu_dev = get_cpu_device(policy->cpu);
+	if (!cpu_dev)
+		return -ENODEV;
+
+	ret = dev_pm_opp_of_add_table(cpu_dev);
+	if (ret)
+		return ret;
+
+	ret = dev_pm_opp_get_opp_count(cpu_dev);
+	if (ret <= 0) {
+		dev_err(cpu_dev, "OPP table can't be empty\n");
+		return -ENODEV;
+	}
+
+	ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
+	if (ret) {
+		dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
+		return ret;
+	}
+
+	policy->freq_table = table;
+
+	/*
+	 * To simplify and improve latency of handling frequency requests on
+	 * the host side, this ensures that the vCPU thread triggering the MMIO
+	 * abort is the same thread whose performance constraints (Ex. uclamp
+	 * settings) need to be updated. This simplifies the VMM (Virtual
+	 * Machine Manager) having to find the correct vCPU thread and/or
+	 * facing permission issues when configuring other threads.
+	 */
+	policy->dvfs_possible_from_any_cpu = false;
+	policy->fast_switch_possible = true;
+
+	/*
+	 * Using the default SCALE_FREQ_SOURCE_CPUFREQ is insufficient since
+	 * the actual physical CPU frequency may not match requested frequency
+	 * from the vCPU thread due to frequency update latencies or other
+	 * inputs to the physical CPU frequency selection. This additional FIE
+	 * source allows for more accurate freq_scale updates and only takes
+	 * effect if another FIE source such as AMUs have not been registered.
+	 */
+	topology_set_scale_freq_source(&virt_sfd, policy->cpus);
+
+	return 0;
+}
+
+static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
+	kfree(policy->freq_table);
+	policy->freq_table = NULL;
+	return 0;
+}
+
+static int virt_cpufreq_online(struct cpufreq_policy *policy)
+{
+	/* Nothing to restore. */
+	return 0;
+}
+
+static int virt_cpufreq_offline(struct cpufreq_policy *policy)
+{
+	/* Dummy offline() to avoid exit() being called and freeing resources. */
+	return 0;
+}
+
+static struct cpufreq_driver cpufreq_virt_driver = {
+	.name		= "virt-cpufreq",
+	.init		= virt_cpufreq_cpu_init,
+	.exit		= virt_cpufreq_cpu_exit,
+	.online         = virt_cpufreq_online,
+	.offline        = virt_cpufreq_offline,
+	.verify		= cpufreq_generic_frequency_table_verify,
+	.target_index	= virt_cpufreq_target_index,
+	.fast_switch	= virt_cpufreq_fast_switch,
+	.attr		= cpufreq_generic_attr,
+};
+
+static int virt_cpufreq_driver_probe(struct platform_device *pdev)
+{
+	int ret;
+
+	base = devm_platform_ioremap_resource(pdev, 0);
+	if (IS_ERR(base))
+		return PTR_ERR(base);
+
+	ret = cpufreq_register_driver(&cpufreq_virt_driver);
+	if (ret) {
+		dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret);
+		return ret;
+	}
+
+	dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n");
+	return 0;
+}
+
+static int virt_cpufreq_driver_remove(struct platform_device *pdev)
+{
+	cpufreq_unregister_driver(&cpufreq_virt_driver);
+	return 0;
+}
+
+static const struct of_device_id virt_cpufreq_match[] = {
+	{ .compatible = "qemu,virtual-cpufreq", .data = NULL},
+	{}
+};
+MODULE_DEVICE_TABLE(of, virt_cpufreq_match);
+
+static struct platform_driver virt_cpufreq_driver = {
+	.probe = virt_cpufreq_driver_probe,
+	.remove = virt_cpufreq_driver_remove,
+	.driver = {
+		.name = "virt-cpufreq",
+		.of_match_table = virt_cpufreq_match,
+	},
+};
+
+static int __init virt_cpufreq_init(void)
+{
+	return platform_driver_register(&virt_cpufreq_driver);
+}
+postcore_initcall(virt_cpufreq_init);
+
+static void __exit virt_cpufreq_exit(void)
+{
+	platform_driver_unregister(&virt_cpufreq_driver);
+}
+module_exit(virt_cpufreq_exit);
+
+MODULE_DESCRIPTION("Virtual cpufreq driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index a07b510e7dc5..888282dce2ba 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -42,6 +42,7 @@ enum scale_freq_source {
 	SCALE_FREQ_SOURCE_CPUFREQ = 0,
 	SCALE_FREQ_SOURCE_ARCH,
 	SCALE_FREQ_SOURCE_CPPC,
+	SCALE_FREQ_SOURCE_VIRT,
 };
 
 struct scale_freq_data {
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior
  2023-11-11  1:49 [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior David Dai
  2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
  2023-11-11  1:49 ` [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver David Dai
@ 2023-11-13 12:20 ` Hongyan Xia
  2023-11-13 12:26   ` Marc Zyngier
  2 siblings, 1 reply; 20+ messages in thread
From: Hongyan Xia @ 2023-11-13 12:20 UTC (permalink / raw)
  To: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Hi David,

On 11/11/2023 01:49, David Dai wrote:
> Hi,
> 
> This patch series is a continuation of the talk Saravana gave at LPC 2022
> titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
> of the talk is that workloads running in a guest VM get terrible task
> placement and CPUfreq behavior when compared to running the same workload
> in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
> inside VMs. This would make power and performance terrible just by running
> the workload in a VM even if we assume there is zero virtualization
> overhead.
> 
> With this series, a workload running in a VM gets the same task placement
> and CPUfreq behavior as it would when running in the host.
> 
> The idea is to improve VM CPUfreq/sched behavior by:
> - Having guest kernel do accurate load tracking by taking host CPU
>    arch/type and frequency into account.
> - Sharing vCPU frequency requirements with the host so that the
>    host can do proper frequency scaling and task placement on the host side.
> 
> Based on feedback from RFC v1 proposal[4], we've revised our
> implementation to using MMIO reads and writes to pass information
> from/to host instead of using hypercalls. In our example, the
> VMM(Virtual Machine Manager) translates the frequency requests into
> Uclamp_min and applies it to the vCPU thread as a hint to the host
> kernel.

Sorry for not noticing this series until now.

The problem you are having with uclamp is actually the same as what
I'm tackling right now. Basically my conclusion so far is that uclamp
max aggregation faces quite many problems, which can be easily solved by
sum aggregation (summing up the clamped utilization values instead of
applying the max uclamp value to the whole rq):

https://lore.kernel.org/all/cover.1696345700.git.Hongyan.Xia2@arm.com/

What you described as util_guest sounds to me as exactly what uclamp_min
under sum aggregation does. I'm really tempted to ask you to apply my
series and see if the new uclamp_min does what you want, instead of
introducing a new util_guest signal. If you have no time for this I can
try to replicate your setup and do the experiments myself.

Also, my knowledge with KVM is limited. May I know where the vCPU fork
happens? Can't you just set the p->sched_reset_on_fork flag on fork to
not carry forward the uclamp values?

> 
> [...]
Hongyan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior
  2023-11-13 12:20 ` [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior Hongyan Xia
@ 2023-11-13 12:26   ` Marc Zyngier
  0 siblings, 0 replies; 20+ messages in thread
From: Marc Zyngier @ 2023-11-13 12:26 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Mon, 13 Nov 2023 12:20:29 +0000,
Hongyan Xia <hongyan.xia2@arm.com> wrote:
> 
> Also, my knowledge with KVM is limited. May I know where the vCPU fork
> happens? Can't you just set the p->sched_reset_on_fork flag on fork to
> not carry forward the uclamp values?

There is no "vCPU fork". The vcpu is a VMM (qemu, crosvm, kvmtool)
thread, nothing else.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
@ 2023-11-15  6:27   ` Viresh Kumar
  2023-11-16 16:22     ` Rob Herring
  2023-11-15  8:49   ` Marc Zyngier
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Viresh Kumar @ 2023-11-15  6:27 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On 10-11-23, 17:49, David Dai wrote:
> diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> +properties:
> +  compatible:
> +    const: qemu,virtual-cpufreq

Not sure why we need to mention QEMU here.. Why limit this to just QEMU ?

-- 
viresh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver
  2023-11-11  1:49 ` [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver David Dai
@ 2023-11-15  6:29   ` Viresh Kumar
  2023-12-08  1:18     ` David Dai
  2024-01-15 16:58   ` Hongyan Xia
  1 sibling, 1 reply; 20+ messages in thread
From: Viresh Kumar @ 2023-11-15  6:29 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On 10-11-23, 17:49, David Dai wrote:
> diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
> +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> +{
> +	writel_relaxed(policy->cached_target_freq,

Drivers shouldn't be using the cached_target_freq directly. Use the target freq
or index passed from cpufreq core.

> +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
> +	kfree(policy->freq_table);
> +	policy->freq_table = NULL;

No need of doing this. Also the order of above two calls is wrong anyway.

-- 
viresh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
  2023-11-15  6:27   ` Viresh Kumar
@ 2023-11-15  8:49   ` Marc Zyngier
  2023-12-07 22:44     ` Saravana Kannan
  2023-12-08 12:45   ` Sudeep Holla
  2024-01-15 16:28   ` Hongyan Xia
  3 siblings, 1 reply; 20+ messages in thread
From: Marc Zyngier @ 2023-11-15  8:49 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Sat, 11 Nov 2023 01:49:29 +0000,
David Dai <davidai@google.com> wrote:
> 
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device
> for guests to read frequency information or to request frequency
> selection. The virtual cpufreq device has an individual controller for
> each frequency domain.

I would really refrain form having absolute frequencies here. A
virtual machine can be migrated, and there are *zero* guarantees that
the target system has the same clock range as the source.

This really should be a relative number, much like the capacity. That,
at least, can be migrated across systems.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-11-15  6:27   ` Viresh Kumar
@ 2023-11-16 16:22     ` Rob Herring
  0 siblings, 0 replies; 20+ messages in thread
From: Rob Herring @ 2023-11-16 16:22 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: David Dai, Rafael J. Wysocki, Krzysztof Kozlowski, Conor Dooley,
	Sudeep Holla, Saravana Kannan, Quentin Perret, Masami Hiramatsu,
	Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On Wed, Nov 15, 2023 at 11:57:41AM +0530, Viresh Kumar wrote:
> On 10-11-23, 17:49, David Dai wrote:
> > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> > +properties:
> > +  compatible:
> > +    const: qemu,virtual-cpufreq
> 
> Not sure why we need to mention QEMU here.. Why limit this to just QEMU ?

Because that is the implementation. And also to discourage anyone from 
using this on their hardware or thinking it is generic and the way to 
extend it to another implementation is adding properties.

Rob

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-11-15  8:49   ` Marc Zyngier
@ 2023-12-07 22:44     ` Saravana Kannan
  2023-12-08  8:52       ` Marc Zyngier
  0 siblings, 1 reply; 20+ messages in thread
From: Saravana Kannan @ 2023-12-07 22:44 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 11 Nov 2023 01:49:29 +0000,
> David Dai <davidai@google.com> wrote:
> >
> > Adding bindings to represent a virtual cpufreq device.
> >
> > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > for guests to read frequency information or to request frequency
> > selection. The virtual cpufreq device has an individual controller for
> > each frequency domain.
>
> I would really refrain form having absolute frequencies here. A
> virtual machine can be migrated, and there are *zero* guarantees that
> the target system has the same clock range as the source.
>
> This really should be a relative number, much like the capacity. That,
> at least, can be migrated across systems.

There's nothing in this patch that mandates absolute frequency.
In true KVM philosophy, we leave it to the VMM to decide.

-Saravana

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver
  2023-11-15  6:29   ` Viresh Kumar
@ 2023-12-08  1:18     ` David Dai
  2023-12-08  9:51       ` Viresh Kumar
  0 siblings, 1 reply; 20+ messages in thread
From: David Dai @ 2023-12-08  1:18 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

Hi Viresh,

Apologies for the late reply,

On Wed, Nov 15, 2023 at 3:29 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 10-11-23, 17:49, David Dai wrote:
> > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
> > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > +{
> > +     writel_relaxed(policy->cached_target_freq,
>
> Drivers shouldn't be using the cached_target_freq directly. Use the target freq
> or index passed from cpufreq core.

We were trying to avoid rounding to frequency table entries to provide
more accurate frequency requests. However, we didn't find any
significant power or performance regressions using the frequencies
from the table, so I'll send another patch series using your
suggestion.

>
> > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> > +{
> > +     topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
> > +     kfree(policy->freq_table);
> > +     policy->freq_table = NULL;
>
> No need of doing this. Also the order of above two calls is wrong anyway.

Can you clarify this point a bit more? Are you suggesting to just
remove setting policy->freq_table to NULL and swap the ordering
freeing the freq_table vs clearing the topology source? I can
alternatively use dev_pm_opp_free_cpufreq_table to mirror the init.

Thanks,
David

>
> --
> viresh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-12-07 22:44     ` Saravana Kannan
@ 2023-12-08  8:52       ` Marc Zyngier
  2024-01-12 22:02         ` Saravana Kannan
  0 siblings, 1 reply; 20+ messages in thread
From: Marc Zyngier @ 2023-12-08  8:52 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On Thu, 07 Dec 2023 22:44:36 +0000,
Saravana Kannan <saravanak@google.com> wrote:
> 
> On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Sat, 11 Nov 2023 01:49:29 +0000,
> > David Dai <davidai@google.com> wrote:
> > >
> > > Adding bindings to represent a virtual cpufreq device.
> > >
> > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > for guests to read frequency information or to request frequency
> > > selection. The virtual cpufreq device has an individual controller for
> > > each frequency domain.
> >
> > I would really refrain form having absolute frequencies here. A
> > virtual machine can be migrated, and there are *zero* guarantees that
> > the target system has the same clock range as the source.
> >
> > This really should be a relative number, much like the capacity. That,
> > at least, can be migrated across systems.
> 
> There's nothing in this patch that mandates absolute frequency.
> In true KVM philosophy, we leave it to the VMM to decide.

This has nothing to do with KVM. It would apply to any execution
environment, including QEMU in TCG mode.

To quote the original patch:

+    description:
+      Address and size of region containing frequency controls for each of the
+      frequency domains. Regions for each frequency domain is placed
+      contiugously and contain registers for controlling DVFS(Dynamic Frequency
+      and Voltage) characteristics. The size of the region is proportional to
+      total number of frequency domains.

What part of that indicates that *relative* frequencies are
acceptable? The example explicitly uses the opp-v2 binding, which
clearly is about absolute frequency.

To reiterate: absolute frequencies are not the right tool for the job,
and they should explicitly be described as relative in the spec. Not
left as a "whatev'" option for the execution environment to interpret.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver
  2023-12-08  1:18     ` David Dai
@ 2023-12-08  9:51       ` Viresh Kumar
  0 siblings, 0 replies; 20+ messages in thread
From: Viresh Kumar @ 2023-12-08  9:51 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On 08-12-23, 10:18, David Dai wrote:
> Hi Viresh,
> 
> Apologies for the late reply,
> 
> On Wed, Nov 15, 2023 at 3:29 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 10-11-23, 17:49, David Dai wrote:
> > > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
> > > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > > +{
> > > +     writel_relaxed(policy->cached_target_freq,
> >
> > Drivers shouldn't be using the cached_target_freq directly. Use the target freq
> > or index passed from cpufreq core.
> 
> We were trying to avoid rounding to frequency table entries to provide
> more accurate frequency requests. However, we didn't find any
> significant power or performance regressions using the frequencies
> from the table, so I'll send another patch series using your
> suggestion.
> 
> >
> > > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> > > +{
> > > +     topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
> > > +     kfree(policy->freq_table);

This becomes a dangling pointer for a very short amount of time. There may or
may not be a actual race here and so I said the ordering must be just the
opposite anyway.

> > > +     policy->freq_table = NULL;

And I thought this isn't required too since the core is going the free the
policy structure right after returning from here. But maybe it is not a
guarantee that the core provides (the code can change in future) and so be
better to unset it anyway.

> > No need of doing this. Also the order of above two calls is wrong anyway.
> 
> Can you clarify this point a bit more? Are you suggesting to just
> remove setting policy->freq_table to NULL and swap the ordering
> freeing the freq_table vs clearing the topology source? I can
> alternatively use dev_pm_opp_free_cpufreq_table to mirror the init.

That would be better actually, to let a single piece of code manage this :)

-- 
viresh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
  2023-11-15  6:27   ` Viresh Kumar
  2023-11-15  8:49   ` Marc Zyngier
@ 2023-12-08 12:45   ` Sudeep Holla
  2024-01-12 22:15     ` Saravana Kannan
  2024-01-15 16:28   ` Hongyan Xia
  3 siblings, 1 reply; 20+ messages in thread
From: Sudeep Holla @ 2023-12-08 12:45 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Sudeep Holla, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

On Fri, Nov 10, 2023 at 05:49:29PM -0800, David Dai wrote:
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device
> for guests to read frequency information or to request frequency
> selection. The virtual cpufreq device has an individual controller for
> each frequency domain.
> 
> Co-developed-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: David Dai <davidai@google.com>
> ---
>  .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
>  1 file changed, 99 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> 
> diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> new file mode 100644
> index 000000000000..16606cf1fd1a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> @@ -0,0 +1,99 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Virtual CPUFreq
> +
> +maintainers:
> +  - David Dai <davidai@google.com>
> +  - Saravana Kannan <saravanak@google.com>
> +
> +description:
> +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> +  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
> +  is associated with a frequency domain which can be shared with other vCPUs.
> +  Each frequency domain has its own set of registers for frequency controls.
> +

Are these register controls described somewhere ? The reason I ask is we
should be able to have single implementation of this virtual cpufreq
irrespective of the firmware used(DT vs ACPI) IMO.

> +properties:
> +  compatible:
> +    const: qemu,virtual-cpufreq
> +
> +  reg:
> +    maxItems: 1
> +    description:
> +      Address and size of region containing frequency controls for each of the
> +      frequency domains. Regions for each frequency domain is placed
> +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> +      and Voltage) characteristics. The size of the region is proportional to
> +      total number of frequency domains.
> +
> +required:
> +  - compatible
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    // This example shows a two CPU configuration with a frequency domain
> +    // for each CPU.
> +    cpus {
> +      #address-cells = <1>;
> +      #size-cells = <0>;
> +
> +      cpu@0 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table0>;
> +      };
> +
> +      cpu@1 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table1>;
> +      };
> +    };
> +
> +    opp_table0: opp-table-0 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1098000000 {
> +        opp-hz = /bits/ 64 <1098000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1197000000 {
> +        opp-hz = /bits/ 64 <1197000000>;
> +        opp-level = <2>;
> +      };
> +    };
> +
> +    opp_table1: opp-table-1 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1106000000 {
> +        opp-hz = /bits/ 64 <1106000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1277000000 {
> +        opp-hz = /bits/ 64 <1277000000>;
> +        opp-level = <2>;
> +      };
> +    };
>

I think using OPP with absolute frequencies seems not appropriate here.
Why can't these fetched from the registers and have some abstract values
instead ?

> +    soc {
> +      #address-cells = <1>;
> +      #size-cells = <1>;
> +
> +      cpufreq@1040000 {
> +        compatible = "qemu,virtual-cpufreq";
> +        reg = <0x1040000 0x10>;

So just 16bytes for 2 CPU system ? How does the register layout look like ?
I assume just 4 x 32bit registers: 2 for reading and 2 for setting the
frequencies ?

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-12-08  8:52       ` Marc Zyngier
@ 2024-01-12 22:02         ` Saravana Kannan
  2024-01-13  9:37           ` Marc Zyngier
  0 siblings, 1 reply; 20+ messages in thread
From: Saravana Kannan @ 2024-01-12 22:02 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

Sorry for the delay in response. Was very busy for a while and then
holidays started.

On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 07 Dec 2023 22:44:36 +0000,
> Saravana Kannan <saravanak@google.com> wrote:
> >
> > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Sat, 11 Nov 2023 01:49:29 +0000,
> > > David Dai <davidai@google.com> wrote:
> > > >
> > > > Adding bindings to represent a virtual cpufreq device.
> > > >
> > > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > > for guests to read frequency information or to request frequency
> > > > selection. The virtual cpufreq device has an individual controller for
> > > > each frequency domain.
> > >
> > > I would really refrain form having absolute frequencies here. A
> > > virtual machine can be migrated, and there are *zero* guarantees that
> > > the target system has the same clock range as the source.
> > >
> > > This really should be a relative number, much like the capacity. That,
> > > at least, can be migrated across systems.
> >
> > There's nothing in this patch that mandates absolute frequency.
> > In true KVM philosophy, we leave it to the VMM to decide.
>
> This has nothing to do with KVM. It would apply to any execution
> environment, including QEMU in TCG mode.
>
> To quote the original patch:
>
> +    description:
> +      Address and size of region containing frequency controls for each of the
> +      frequency domains. Regions for each frequency domain is placed
> +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> +      and Voltage) characteristics. The size of the region is proportional to
> +      total number of frequency domains.
>
> What part of that indicates that *relative* frequencies are
> acceptable? The example explicitly uses the opp-v2 binding, which
> clearly is about absolute frequency.

We can update the doc to make that clearer and update the example too.

> To reiterate: absolute frequencies are not the right tool for the job,
> and they should explicitly be described as relative in the spec. Not
> left as a "whatev'" option for the execution environment to interpret.

I think it depends on the use case. If there's no plan to migrate the
VM across different devices, there's no need to do the unnecessary
normalization back and forth.

And if we can translate between pCPU frequency and a normalized
frequency, we can do the same for whatever made up frequencies too. In
fact, we plan to do exactly that in our internal use cases for this.
There's nothing here that prevents the VMM from doing that.

Also, if there are hardware virtualized performance counters (AMU,
CPPC, etc) that are used for frequency normalization, then we have to
use the real frequencies in those devices otherwise the "current
frequency" can be 2 GHz while the max normalized frequency is 1024
KHz. That'll mess up load tracking.

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-12-08 12:45   ` Sudeep Holla
@ 2024-01-12 22:15     ` Saravana Kannan
  0 siblings, 0 replies; 20+ messages in thread
From: Saravana Kannan @ 2024-01-12 22:15 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Fri, Dec 8, 2023 at 4:47 AM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> On Fri, Nov 10, 2023 at 05:49:29PM -0800, David Dai wrote:
> > Adding bindings to represent a virtual cpufreq device.
> >
> > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > for guests to read frequency information or to request frequency
> > selection. The virtual cpufreq device has an individual controller for
> > each frequency domain.
> >
> > Co-developed-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: David Dai <davidai@google.com>
> > ---
> >  .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
> >  1 file changed, 99 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> > new file mode 100644
> > index 000000000000..16606cf1fd1a
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> > @@ -0,0 +1,99 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: Virtual CPUFreq
> > +
> > +maintainers:
> > +  - David Dai <davidai@google.com>
> > +  - Saravana Kannan <saravanak@google.com>
> > +
> > +description:
> > +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> > +  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
> > +  is associated with a frequency domain which can be shared with other vCPUs.
> > +  Each frequency domain has its own set of registers for frequency controls.
> > +
>
> Are these register controls described somewhere ? The reason I ask is we
> should be able to have single implementation of this virtual cpufreq
> irrespective of the firmware used(DT vs ACPI) IMO.

Agree that we want the same driver to work for DT and ACPI. This doc
was written to be similar to other DT binding docs that don't describe
the registers in the DT binding. The registers are pretty straight
forward (can tell from the code too). One register to "set" the
frequency and another to "get" the current frequency. We don't have
any ACPI expertise/hardware to test this on or care about it right
now. But David looked at some ACPI drivers and we think it should be
trivial to add ACPI support to this. Just a different set of probe
functions to register and populate the CPUfreq table.

>
> > +properties:
> > +  compatible:
> > +    const: qemu,virtual-cpufreq
> > +
> > +  reg:
> > +    maxItems: 1
> > +    description:
> > +      Address and size of region containing frequency controls for each of the
> > +      frequency domains. Regions for each frequency domain is placed
> > +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> > +      and Voltage) characteristics. The size of the region is proportional to
> > +      total number of frequency domains.
> > +
> > +required:
> > +  - compatible
> > +  - reg
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > +  - |
> > +    // This example shows a two CPU configuration with a frequency domain
> > +    // for each CPU.
> > +    cpus {
> > +      #address-cells = <1>;
> > +      #size-cells = <0>;
> > +
> > +      cpu@0 {
> > +        compatible = "arm,armv8";
> > +        device_type = "cpu";
> > +        reg = <0x0>;
> > +        operating-points-v2 = <&opp_table0>;
> > +      };
> > +
> > +      cpu@1 {
> > +        compatible = "arm,armv8";
> > +        device_type = "cpu";
> > +        reg = <0x0>;
> > +        operating-points-v2 = <&opp_table1>;
> > +      };
> > +    };
> > +
> > +    opp_table0: opp-table-0 {
> > +      compatible = "operating-points-v2";
> > +      opp-shared;
> > +
> > +      opp1098000000 {
> > +        opp-hz = /bits/ 64 <1098000000>;
> > +        opp-level = <1>;
> > +      };
> > +
> > +      opp1197000000 {
> > +        opp-hz = /bits/ 64 <1197000000>;
> > +        opp-level = <2>;
> > +      };
> > +    };
> > +
> > +    opp_table1: opp-table-1 {
> > +      compatible = "operating-points-v2";
> > +      opp-shared;
> > +
> > +      opp1106000000 {
> > +        opp-hz = /bits/ 64 <1106000000>;
> > +        opp-level = <1>;
> > +      };
> > +
> > +      opp1277000000 {
> > +        opp-hz = /bits/ 64 <1277000000>;
> > +        opp-level = <2>;
> > +      };
> > +    };
> >
>
> I think using OPP with absolute frequencies seems not appropriate here.
> Why can't these fetched from the registers and have some abstract values
> instead ?

Whether the frequencies are real or you want to cap it to 1024 and
normalize it, you still need to populate the CPUfreq table. And we
didn't want to reinvent the wheel and want to use existing means of
representing the table in as cross-architecture as possible -- so, DT
and ACPI should cover them all. For example, if we want to say CPU0
and 1 for a single CPUfreq policy, that's all already doable in DT. We
don't want to reinvent new schemes/register interfaces for that.

>
> > +    soc {
> > +      #address-cells = <1>;
> > +      #size-cells = <1>;
> > +
> > +      cpufreq@1040000 {
> > +        compatible = "qemu,virtual-cpufreq";
> > +        reg = <0x1040000 0x10>;
>
> So just 16bytes for 2 CPU system ? How does the register layout look like ?
> I assume just 4 x 32bit registers: 2 for reading and 2 for setting the
> frequencies ?

Yup. 2 registers per CPU frequency domain or policy.

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2024-01-12 22:02         ` Saravana Kannan
@ 2024-01-13  9:37           ` Marc Zyngier
  2024-01-16 23:47             ` Saravana Kannan
  0 siblings, 1 reply; 20+ messages in thread
From: Marc Zyngier @ 2024-01-13  9:37 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On Fri, 12 Jan 2024 22:02:39 +0000,
Saravana Kannan <saravanak@google.com> wrote:
> 
> Sorry for the delay in response. Was very busy for a while and then
> holidays started.
> 
> On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Thu, 07 Dec 2023 22:44:36 +0000,
> > Saravana Kannan <saravanak@google.com> wrote:
> > >
> > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Sat, 11 Nov 2023 01:49:29 +0000,
> > > > David Dai <davidai@google.com> wrote:
> > > > >
> > > > > Adding bindings to represent a virtual cpufreq device.
> > > > >
> > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > > > for guests to read frequency information or to request frequency
> > > > > selection. The virtual cpufreq device has an individual controller for
> > > > > each frequency domain.
> > > >
> > > > I would really refrain form having absolute frequencies here. A
> > > > virtual machine can be migrated, and there are *zero* guarantees that
> > > > the target system has the same clock range as the source.
> > > >
> > > > This really should be a relative number, much like the capacity. That,
> > > > at least, can be migrated across systems.
> > >
> > > There's nothing in this patch that mandates absolute frequency.
> > > In true KVM philosophy, we leave it to the VMM to decide.
> >
> > This has nothing to do with KVM. It would apply to any execution
> > environment, including QEMU in TCG mode.
> >
> > To quote the original patch:
> >
> > +    description:
> > +      Address and size of region containing frequency controls for each of the
> > +      frequency domains. Regions for each frequency domain is placed
> > +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> > +      and Voltage) characteristics. The size of the region is proportional to
> > +      total number of frequency domains.
> >
> > What part of that indicates that *relative* frequencies are
> > acceptable? The example explicitly uses the opp-v2 binding, which
> > clearly is about absolute frequency.
> 
> We can update the doc to make that clearer and update the example too.
> 
> > To reiterate: absolute frequencies are not the right tool for the job,
> > and they should explicitly be described as relative in the spec. Not
> > left as a "whatev'" option for the execution environment to interpret.
> 
> I think it depends on the use case. If there's no plan to migrate the
> VM across different devices, there's no need to do the unnecessary
> normalization back and forth.

VM migration is a given, specially when QEMU is involved. Designing
something that doesn't support it is a bug, plain and simple.

> And if we can translate between pCPU frequency and a normalized
> frequency, we can do the same for whatever made up frequencies too. In
> fact, we plan to do exactly that in our internal use cases for this.
> There's nothing here that prevents the VMM from doing that.
>
> Also, if there are hardware virtualized performance counters (AMU,
> CPPC, etc) that are used for frequency normalization, then we have to
> use the real frequencies in those devices otherwise the "current
> frequency" can be 2 GHz while the max normalized frequency is 1024
> KHz. That'll mess up load tracking.

And that's exactly why this shouldn't be a *frequency*, but a
performance scale or some other unit-less coefficient. Just like the
big-little capacity.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
                     ` (2 preceding siblings ...)
  2023-12-08 12:45   ` Sudeep Holla
@ 2024-01-15 16:28   ` Hongyan Xia
  3 siblings, 0 replies; 20+ messages in thread
From: Hongyan Xia @ 2024-01-15 16:28 UTC (permalink / raw)
  To: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

On 11/11/2023 01:49, David Dai wrote:
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device
> for guests to read frequency information or to request frequency
> selection. The virtual cpufreq device has an individual controller for
> each frequency domain.
> 
> Co-developed-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: David Dai <davidai@google.com>
> ---
>   .../cpufreq/qemu,cpufreq-virtual.yaml         | 99 +++++++++++++++++++
>   1 file changed, 99 insertions(+)
>   create mode 100644 Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> 
> diff --git a/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> new file mode 100644
> index 000000000000..16606cf1fd1a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/cpufreq/qemu,cpufreq-virtual.yaml
> @@ -0,0 +1,99 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/cpufreq/qemu,cpufreq-virtual.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Virtual CPUFreq
> +
> +maintainers:
> +  - David Dai <davidai@google.com>
> +  - Saravana Kannan <saravanak@google.com>
> +
> +description:
> +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> +  selection of its vCPUs as a hint to the host through MMIO regions. Each vCPU
> +  is associated with a frequency domain which can be shared with other vCPUs.
> +  Each frequency domain has its own set of registers for frequency controls.
> +
> +properties:
> +  compatible:
> +    const: qemu,virtual-cpufreq
> +
> +  reg:
> +    maxItems: 1
> +    description:
> +      Address and size of region containing frequency controls for each of the
> +      frequency domains. Regions for each frequency domain is placed
> +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> +      and Voltage) characteristics. The size of the region is proportional to
> +      total number of frequency domains.
> +
> +required:
> +  - compatible
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    // This example shows a two CPU configuration with a frequency domain
> +    // for each CPU.
> +    cpus {
> +      #address-cells = <1>;
> +      #size-cells = <0>;
> +
> +      cpu@0 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table0>;
> +      };
> +
> +      cpu@1 {
> +        compatible = "arm,armv8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table1>;
> +      };
> +    };
> +
> +    opp_table0: opp-table-0 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1098000000 {
> +        opp-hz = /bits/ 64 <1098000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1197000000 {
> +        opp-hz = /bits/ 64 <1197000000>;
> +        opp-level = <2>;
> +      };
> +    };
> +
> +    opp_table1: opp-table-1 {
> +      compatible = "operating-points-v2";
> +      opp-shared;
> +
> +      opp1106000000 {
> +        opp-hz = /bits/ 64 <1106000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1277000000 {
> +        opp-hz = /bits/ 64 <1277000000>;
> +        opp-level = <2>;
> +      };
> +    };

NIT: If my understanding is correct, it might be worth re-iterating that 
these OPPs should mirror the host frequency domain this vCPU is pinned to.

Also, since VM migration has been mentioned elsewhere in this thread, am 
I right in saying that you can't change these OPPs after registration? 
So, even if one wants to migrate, one has to migrate to an SoC with the 
same frequency domains anyway, otherwise the OPPs in the VM are entirely 
bogus?

> +
> +    soc {
> +      #address-cells = <1>;
> +      #size-cells = <1>;
> +
> +      cpufreq@1040000 {
> +        compatible = "qemu,virtual-cpufreq";
> +        reg = <0x1040000 0x10>;
> +      };
> +    };

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver
  2023-11-11  1:49 ` [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver David Dai
  2023-11-15  6:29   ` Viresh Kumar
@ 2024-01-15 16:58   ` Hongyan Xia
  1 sibling, 0 replies; 20+ messages in thread
From: Hongyan Xia @ 2024-01-15 16:58 UTC (permalink / raw)
  To: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

On 11/11/2023 01:49, David Dai wrote:
> [...]
> diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
> index 8d141c71b016..eb72ecdc24db 100644
> --- a/drivers/cpufreq/Makefile
> +++ b/drivers/cpufreq/Makefile
> @@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET)	+= cpufreq_governor_attr_set.o
>   
>   obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o
>   obj-$(CONFIG_CPUFREQ_DT_PLATDEV)	+= cpufreq-dt-platdev.o
> +obj-$(CONFIG_CPUFREQ_VIRT)		+= virtual-cpufreq.o
>   
>   # Traces
>   CFLAGS_amd-pstate-trace.o               := -I$(src)
> diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
> new file mode 100644
> index 000000000000..f828d3345a68
> --- /dev/null
> +++ b/drivers/cpufreq/virtual-cpufreq.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 Google LLC
> + */
> +
> +#include <linux/arch_topology.h>
> +#include <linux/cpufreq.h>
> +#include <linux/init.h>
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of_address.h>
> +#include <linux/of_platform.h>
> +#include <linux/pm_opp.h>
> +#include <linux/slab.h>
> +
> +#define REG_CUR_FREQ_KHZ_OFFSET 0x0
> +#define REG_SET_FREQ_KHZ_OFFSET 0x4
> +#define PER_CPU_OFFSET 0x8
> +
> +static void __iomem *base;
> +
> +static void virt_scale_freq_tick(void)
> +{
> +	int cpu = smp_processor_id();
> +	u32 max_freq = (u32)cpufreq_get_hw_max_freq(cpu);
> +	u64 cur_freq;
> +	unsigned long scale;
> +
> +	cur_freq = (u64)readl_relaxed(base + cpu * PER_CPU_OFFSET
> +			+ REG_CUR_FREQ_KHZ_OFFSET);
> +
> +	cur_freq <<= SCHED_CAPACITY_SHIFT;
> +	scale = (unsigned long)div_u64(cur_freq, max_freq);
> +	scale = min(scale, SCHED_CAPACITY_SCALE);
> +
> +	this_cpu_write(arch_freq_scale, scale);
> +}

Here we update the scaling factor in the guest, but is there any way to 
let the guest know when the host dequeues the vCPU so that the guest 
PELT signal doesn't appear larger than it actually is? Is this a known 
limitation and is there a way to mitigate it?

> [...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device
  2024-01-13  9:37           ` Marc Zyngier
@ 2024-01-16 23:47             ` Saravana Kannan
  0 siblings, 0 replies; 20+ messages in thread
From: Saravana Kannan @ 2024-01-16 23:47 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On Sat, Jan 13, 2024 at 1:37 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 12 Jan 2024 22:02:39 +0000,
> Saravana Kannan <saravanak@google.com> wrote:
> >
> > Sorry for the delay in response. Was very busy for a while and then
> > holidays started.
> >
> > On Fri, Dec 8, 2023 at 12:52 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Thu, 07 Dec 2023 22:44:36 +0000,
> > > Saravana Kannan <saravanak@google.com> wrote:
> > > >
> > > > On Wed, Nov 15, 2023 at 12:49 AM Marc Zyngier <maz@kernel.org> wrote:
> > > > >
> > > > > On Sat, 11 Nov 2023 01:49:29 +0000,
> > > > > David Dai <davidai@google.com> wrote:
> > > > > >
> > > > > > Adding bindings to represent a virtual cpufreq device.
> > > > > >
> > > > > > Virtual machines may expose MMIO regions for a virtual cpufreq device
> > > > > > for guests to read frequency information or to request frequency
> > > > > > selection. The virtual cpufreq device has an individual controller for
> > > > > > each frequency domain.
> > > > >
> > > > > I would really refrain form having absolute frequencies here. A
> > > > > virtual machine can be migrated, and there are *zero* guarantees that
> > > > > the target system has the same clock range as the source.
> > > > >
> > > > > This really should be a relative number, much like the capacity. That,
> > > > > at least, can be migrated across systems.
> > > >
> > > > There's nothing in this patch that mandates absolute frequency.
> > > > In true KVM philosophy, we leave it to the VMM to decide.
> > >
> > > This has nothing to do with KVM. It would apply to any execution
> > > environment, including QEMU in TCG mode.
> > >
> > > To quote the original patch:
> > >
> > > +    description:
> > > +      Address and size of region containing frequency controls for each of the
> > > +      frequency domains. Regions for each frequency domain is placed
> > > +      contiugously and contain registers for controlling DVFS(Dynamic Frequency
> > > +      and Voltage) characteristics. The size of the region is proportional to
> > > +      total number of frequency domains.
> > >
> > > What part of that indicates that *relative* frequencies are
> > > acceptable? The example explicitly uses the opp-v2 binding, which
> > > clearly is about absolute frequency.
> >
> > We can update the doc to make that clearer and update the example too.
> >
> > > To reiterate: absolute frequencies are not the right tool for the job,
> > > and they should explicitly be described as relative in the spec. Not
> > > left as a "whatev'" option for the execution environment to interpret.
> >
> > I think it depends on the use case. If there's no plan to migrate the
> > VM across different devices, there's no need to do the unnecessary
> > normalization back and forth.
>
> VM migration is a given, specially when QEMU is involved. Designing
> something that doesn't support it is a bug, plain and simple.

I'm not saying this patch series doesn't allow migrating. I'm just
pointing out that normalization might not always be worth it for a VMM
to do.

We'll update the example and documentation to used normalize values.
CPUfreq itself used KHz for the tables and we can't really change that
when we are emulating CPU freq scaling. Same goes for OPP table DT
property names.

But the values we use can be 0 to 1024 Hz and be normalized.

> > And if we can translate between pCPU frequency and a normalized
> > frequency, we can do the same for whatever made up frequencies too. In
> > fact, we plan to do exactly that in our internal use cases for this.
> > There's nothing here that prevents the VMM from doing that.
> >
> > Also, if there are hardware virtualized performance counters (AMU,
> > CPPC, etc) that are used for frequency normalization, then we have to
> > use the real frequencies in those devices otherwise the "current
> > frequency" can be 2 GHz while the max normalized frequency is 1024
> > KHz. That'll mess up load tracking.
>
> And that's exactly why this shouldn't be a *frequency*, but a
> performance scale or some other unit-less coefficient. Just like the
> big-little capacity.

Sorry I wasn't cleared in my explanation. Let me explain it better.
When performance counters are available, the scheduler uses them to
compute the current average CPU frequency over a scheduler tick. It
looks at the performance counters to figure out how many CPU cycles
have gone by and how much time has gone by and does (delta cycles /
delta seconds) to compute current CPU freq in Hz. So, when a HW
virtualized performance counter is available, and the scheduler inside
the VM uses it to compute the average CPU frequency, the resulting
frequency is going to be the real physical CPU frequency. And the CPU
frequency value the scheduler used to compute the PELT load will be
completely different from the normalized values and the load tracking
will be completely wrong.

Using their performance counters inside the VM reduces a lot of MMIO
access exits to the host or memory accesses. So it makes sense a VM
might want to use that. I was just trying to say that in that
situation, a VMM might have a good reason to just use the real
frequencies inside the VM too.

In any case, we can update the doc to use normalized values/examples
and call out this caveat.

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-01-16 23:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-11  1:49 [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior David Dai
2023-11-11  1:49 ` [PATCH v4 1/2] dt-bindings: cpufreq: add virtual cpufreq device David Dai
2023-11-15  6:27   ` Viresh Kumar
2023-11-16 16:22     ` Rob Herring
2023-11-15  8:49   ` Marc Zyngier
2023-12-07 22:44     ` Saravana Kannan
2023-12-08  8:52       ` Marc Zyngier
2024-01-12 22:02         ` Saravana Kannan
2024-01-13  9:37           ` Marc Zyngier
2024-01-16 23:47             ` Saravana Kannan
2023-12-08 12:45   ` Sudeep Holla
2024-01-12 22:15     ` Saravana Kannan
2024-01-15 16:28   ` Hongyan Xia
2023-11-11  1:49 ` [PATCH v4 2/2] cpufreq: add virtual-cpufreq driver David Dai
2023-11-15  6:29   ` Viresh Kumar
2023-12-08  1:18     ` David Dai
2023-12-08  9:51       ` Viresh Kumar
2024-01-15 16:58   ` Hongyan Xia
2023-11-13 12:20 ` [PATCH v4 0/2] Improve VM CPUfreq and task placement behavior Hongyan Xia
2023-11-13 12:26   ` Marc Zyngier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.