linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Improve VM CPUfreq and task placement behavior
@ 2023-07-31 17:46 David Dai
  2023-07-31 17:46 ` [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq David Dai
  2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
  0 siblings, 2 replies; 23+ messages in thread
From: David Dai @ 2023-07-31 17:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai,
	Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Hi,

This patch series is a continuation of the talk Saravana gave at LPC 2022
titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist
of the talk is that workloads running in a guest VM get terrible task
placement and CPUfreq behavior when compared to running the same workload
in the host. Effectively, no EAS(Energy Aware Scheduling) for threads
inside VMs. This would make power and performance terrible just by running
the workload in a VM even if we assume there is zero virtualization
overhead.

With this series, a workload running in a VM gets the same task placement
and CPUfreq behavior as it would when running in the host.

The idea is to improve VM CPUfreq/sched behavior by:
- Having guest kernel do accurate load tracking by taking host CPU
  arch/type and frequency into account.
- Sharing vCPU frequency requirements with the host so that the
  host can do proper frequency scaling and task placement on the host side.

Based on feedback from RFC V1 proposal[4], we've revised our
implementation to using MMIO reads and writes to pass information
from/to host instead of using hypercalls. In our example, the
VMM(Virtual Machine Manager) translates the frequency requests into
Uclamp_min and applies it to the vCPU thread as a hint to the host
kernel.

To achieve the results below, configure the host to:
- Affine vCPUs to specific clusters.
- Set vCPU capacity to match the host CPU they are running on.

To make it easy for folks to try this out with CrosVM, we have put up
userspace patches[5][6]. With those patches, you can configure CrosVM
correctly by adding the options "--host-cpu-topology" and "--virt-cpufreq".

Results:
========

Here are some side-by-side comparisons of RFC V1 proposal vs the current
RFC V3 proposal and are labelled as follows. Some of the numbers have
changed due to using newer userspace binaries compared to RFC V1:

- (RFC V1) UtilHyp = hypercall + util_guest
- (V3) UClampMMIO = MMIO + UClamp_min

Use cases running a minimal system inside a VM on a Pixel 6:
============================================================

FIO
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Usecase(avg MB/s) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Seq Write         |     13.3 |    16.4 |   +23% |       13.4 |    +1% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Write        |     11.2 |    12.9 |   +15% |       11.2 |     0% |
+-------------------+----------+---------+--------+------------+--------+
| Seq Read          |      100 |     168 |   +68% |        136 |   +36% |
+-------------------+----------+---------+--------+------------+--------+
| Rand Read         |     20.5 |    35.6 |   +74% |       29.5 |   +44% |
+-------------------+----------+---------+--------+------------+--------+

CPU-based ML Inference Benchmark
Lower is better
+----------------+----------+------------+--------+------------+--------+
| Test Case (ms) | Baseline | UtilHyp    | %delta | UClampMMIO | %delta |
+----------------+----------+------------+--------+------------+--------+
| Cached Sample  |          |            |        |            |        |
| Inference      |     3.40 |       2.37 |   -30% |       2.97 |   -13% |
+----------------+----------+------------+--------+------------+--------+
| Small Sample   |          |            |        |            |        |
| Inference      |     9.87 |       6.78 |   -31% |       7.92 |   -20% |
+----------------+----------+------------+--------+------------+--------+
| Large Sample   |          |            |        |            |        |
| Inference      |    33.35 |      26.74 |   -20% |      31.48 |    -6% |
+----------------+----------+------------+--------+------------+--------+

Use cases running Android inside a VM on a Chromebook:
======================================================

PCMark (Emulates real world usecases)
Higher is better
+-------------------+----------+---------+--------+------------+--------+
| Test Case (score) | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Weighted Total    |     5970 |    7162 |   +20% |       6782 |   +14% |
+-------------------+----------+---------+--------+------------+--------+
| Web Browsing      |     5558 |    5877 |    +6% |       5729 |    +3% |
+-------------------+----------+---------+--------+------------+--------+
| Video Editing     |     4921 |    5140 |    +4% |       5079 |    +3% |
+-------------------+----------+---------+--------+------------+--------+
| Writing           |     6864 |    9111 |   +33% |       8171 |   +10% |
+-------------------+----------+---------+--------+------------+--------+
| Photo Editing     |     7983 |   11349 |   +42% |      10313 |   +29% |
+-------------------+----------+---------+--------+------------+--------+
| Data Manipulation |     5814 |    6051 |    +4% |       6051 |    +1% |
+-------------------+----------+---------+--------+------------+--------+

PCMark Performance/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Score/mAh         |       85 |     102 |   +20% |         94 |    10% |
+-------------------+----------+---------+--------+------------+--------+

Roblox
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| FPS               |    20.88 |   25.64 |   +23% |      24.05 |   +15% |
+-------------------+----------+---------+--------+------------+--------+

Roblox Frames/mAh
Higher is better
+-------------------+----------+---------+--------+------------+--------+
|                   | Baseline | UtilHyp | %delta | UClampMMIO | %delta |
+-------------------+----------+---------+--------+------------+--------+
| Frames/mAh        |    85.29 |  102.31 |   +20% |     94.20  |    10% |
+-------------------+----------+---------+--------+------------+--------+

We've simplified our implementation based on community feedback to make
it less intrusive and to use a more generic MMIO interface for
communication with the host. The results show that the current design
still has tangible improvements over baseline. We'll continue looking
into ways to reduce the overhead of the MMIO read/writes and submit
separate and generic patches for that if we find any good optimizations.

Thanks,
David & Saravana

Cc: Saravana Kannan <saravanak@google.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Masami Hiramatsu <mhiramat@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Cc: Gupta Pankaj <pankaj.gupta@amd.com>
Cc: Mel Gorman <mgorman@suse.de>

v2 -> v3:
- Dropped patches adding new hypercalls
- Dropped patch adding util_guest in sched/fair
- Cpufreq driver now populates frequency using opp bindings
- Removed transition_delay_us=1 cpufreq setting as it was configured too
  agressively and resulted in poor I/O performance
- Modified guest cpufreq driver to read/write MMIO regions instead of
  using hypercalls to communicate with the host
- Modified guest cpufreq driver to pass frequency info instead of
  utilization of the current vCPU's runqueue which now takes
  iowait_boost into account from the schedutil governor
- Updated DT bindings for a virtual CPU frequency device
Userspace changes:
- Updated CrosVM patches to emulate a virtual cpufreq device
- Updated to newer userspace binaries when collecting more recent
  benchmark data

v1 -> v2:
- No functional changes.
- Added description for EAS and removed DVFS in coverletter.
- Added a v2 tag to the subject.
- Fixed up the inconsistent "units" between tables.
- Made sure everyone is To/Cc-ed for all the patches in the series.

[1] - https://lpc.events/event/16/contributions/1195/
[2] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf
[3] - https://www.youtube.com/watch?v=hIg_5bg6opU
[4] - https://lore.kernel.org/all/20230331014356.1033759-1-davidai@google.com/
[5] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4208668
[6] - https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4504738

David Dai (2):
  dt-bindings: cpufreq: add bindings for virtual cpufreq
  cpufreq: add virtual-cpufreq driver

 .../bindings/cpufreq/cpufreq-virtual.yaml     |  89 +++++++
 drivers/cpufreq/Kconfig                       |  15 ++
 drivers/cpufreq/Makefile                      |   1 +
 drivers/cpufreq/virtual-cpufreq.c             | 237 ++++++++++++++++++
 include/linux/arch_topology.h                 |   1 +
 5 files changed, 343 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
 create mode 100644 drivers/cpufreq/virtual-cpufreq.c

-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq
  2023-07-31 17:46 [PATCH v3 0/2] Improve VM CPUfreq and task placement behavior David Dai
@ 2023-07-31 17:46 ` David Dai
  2023-07-31 18:12   ` Rob Herring
  2023-08-05 19:38   ` Krzysztof Kozlowski
  2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
  1 sibling, 2 replies; 23+ messages in thread
From: David Dai @ 2023-07-31 17:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai,
	Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Adding bindings to represent a virtual cpufreq device.

Virtual machines may expose MMIO regions for a virtual cpufreq device for
guests to read frequency information or to request frequency selection. The
virtual cpufreq device has an individual controller for each CPU.

Co-developed-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: David Dai <davidai@google.com>
---
 .../bindings/cpufreq/cpufreq-virtual.yaml     | 89 +++++++++++++++++++
 1 file changed, 89 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml

diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
new file mode 100644
index 000000000000..f377cfc972ca
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
@@ -0,0 +1,89 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/cpufreq/cpufreq-virtual.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yamll#
+
+title: Virtual CPUFreq
+
+maintainers:
+  - David Dai <davidai@google.com>
+  - Saravana Kannan <saravanak@google.com>
+
+description:
+  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
+  selection of its vCPUs as a hint to the host through MMIO regions. The host
+  uses the hint to schedule vCPU threads and select physical CPU frequency. It
+  enables accurate Per-Entity Load Tracking for tasks running in the guest by
+  querying host CPU frequency unless a virtualized FIE (ex. AMU) exists.
+
+properties:
+  compatible:
+    const: virtual,cpufreq
+  reg:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+
+additionalProperties: false
+
+examples:
+  - |
+    cpus {
+      #address-cells = <1>;
+      #size-cells = <0>;
+
+      cpu@0 {
+        compatible = "arm,arm-v8";
+        device_type = "cpu";
+        reg = <0x0>;
+        operating-points-v2 = <&opp_table0>;
+      };
+
+      cpu@1 {
+        compatible = "arm,arm-v8";
+        device_type = "cpu";
+        reg = <0x0>;
+        operating-points-v2 = <&opp_table1>;
+      };
+    };
+
+    opp_table0: opp-table-0 {
+      compatible = "operating-points-v2";
+
+      opp1098000000 {
+        opp-hz = /bits/ 64 <1098000000>;
+        opp-level = <1>;
+      };
+
+      opp1197000000 {
+        opp-hz = /bits/ 64 <1197000000>;
+        opp-level = <2>;
+      };
+    };
+
+    opp_table1: opp-table-1 {
+      compatible = "operating-points-v2";
+
+      opp1106000000 {
+        opp-hz = /bits/ 64 <1106000000>;
+        opp-level = <1>;
+      };
+
+      opp1277000000 {
+        opp-hz = /bits/ 64 <1277000000>;
+        opp-level = <2>;
+      };
+    };
+
+    soc {
+      #address-cells = <1>;
+      #size-cells = <1>;
+
+      cpufreq {
+        reg = <0x1040000 0x10>;
+        compatible = "virtual,cpufreq";
+      };
+    };
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-07-31 17:46 [PATCH v3 0/2] Improve VM CPUfreq and task placement behavior David Dai
  2023-07-31 17:46 ` [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq David Dai
@ 2023-07-31 17:46 ` David Dai
  2023-07-31 22:02   ` Randy Dunlap
                     ` (4 more replies)
  1 sibling, 5 replies; 23+ messages in thread
From: David Dai @ 2023-07-31 17:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, David Dai,
	Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Introduce a virtualized cpufreq driver for guest kernels to improve
performance and power of workloads within VMs.

This driver does two main things:

1. Sends the frequency of vCPUs as a hint to the host. The host uses the
hint to schedule the vCPU threads and decide physical CPU frequency.

2. If a VM does not support a virtualized FIE(like AMUs), it queries the
host CPU frequency by reading a MMIO region of a virtual cpufreq device
to update the guest's frequency scaling factor periodically. This enables
accurate Per-Entity Load Tracking for tasks running in the guest.

Co-developed-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: David Dai <davidai@google.com>
---
 drivers/cpufreq/Kconfig           |  15 ++
 drivers/cpufreq/Makefile          |   1 +
 drivers/cpufreq/virtual-cpufreq.c | 237 ++++++++++++++++++++++++++++++
 include/linux/arch_topology.h     |   1 +
 4 files changed, 254 insertions(+)
 create mode 100644 drivers/cpufreq/virtual-cpufreq.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index f429b9b37b76..3977ca796747 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -217,6 +217,21 @@ config CPUFREQ_DT
 
 	  If in doubt, say N.
 
+config CPUFREQ_VIRT
+        tristate "Virtual cpufreq driver"
+	depends on OF
+	select PM_OPP
+        help
+	  This adds a virtualized cpufreq driver for guest kernels that
+	  read/writes to a MMIO region for a virtualized cpufreq device to
+	  communicate with the host. It sends frequency updates to the host
+	  which gets used as a hint to schedule vCPU threads and select CPU
+	  frequency. If a VM does not support a virtualized FIE such as AMUs,
+	  it updates the frequency scaling factor by polling host CPU frequency
+	  to enable accurate Per-Entity Load Tracking for tasks running in the guest.
+
+	  If in doubt, say N.
+
 config CPUFREQ_DT_PLATDEV
 	tristate "Generic DT based cpufreq platdev driver"
 	depends on OF
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index ef8510774913..bbc9f9bdfa4b 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_ATTR_SET)	+= cpufreq_governor_attr_set.o
 
 obj-$(CONFIG_CPUFREQ_DT)		+= cpufreq-dt.o
 obj-$(CONFIG_CPUFREQ_DT_PLATDEV)	+= cpufreq-dt-platdev.o
+obj-$(CONFIG_CPUFREQ_VIRT)		+= virtual-cpufreq.o
 
 # Traces
 CFLAGS_amd-pstate-trace.o               := -I$(src)
diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
new file mode 100644
index 000000000000..66b0fd9b821c
--- /dev/null
+++ b/drivers/cpufreq/virtual-cpufreq.c
@@ -0,0 +1,237 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ */
+
+#include <linux/arch_topology.h>
+#include <linux/cpufreq.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/of_platform.h>
+#include <linux/pm_opp.h>
+#include <linux/slab.h>
+
+#define REG_CUR_FREQ_OFFSET 0x0
+#define REG_SET_FREQ_OFFSET 0x4
+#define PER_CPU_OFFSET 0x8
+
+struct virt_cpufreq_ops {
+	void (*set_freq)(struct cpufreq_policy *policy, u32 freq);
+	u32 (*get_freq)(struct cpufreq_policy *policy);
+};
+
+struct virt_cpufreq_drv_data {
+	void __iomem *base;
+	const struct virt_cpufreq_ops *ops;
+};
+
+static void virt_cpufreq_set_freq(struct cpufreq_policy *policy, u32 freq)
+{
+	struct virt_cpufreq_drv_data *data = policy->driver_data;
+
+	writel_relaxed(freq, data->base + policy->cpu * PER_CPU_OFFSET
+			+ REG_SET_FREQ_OFFSET);
+}
+
+static u32 virt_cpufreq_get_freq(struct cpufreq_policy *policy)
+{
+	struct virt_cpufreq_drv_data *data = policy->driver_data;
+
+	return readl_relaxed(data->base + policy->cpu * PER_CPU_OFFSET
+			+ REG_CUR_FREQ_OFFSET);
+}
+
+static const struct virt_cpufreq_ops virt_freq_ops = {
+	.set_freq = virt_cpufreq_set_freq,
+	.get_freq = virt_cpufreq_get_freq,
+};
+
+static void virt_scale_freq_tick(void)
+{
+	struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
+	struct virt_cpufreq_drv_data *data = policy->driver_data;
+	u32 max_freq = (u32)policy->cpuinfo.max_freq;
+	u64 cur_freq;
+	u64 scale;
+
+	cpufreq_cpu_put(policy);
+
+	cur_freq = (u64)data->ops->get_freq(policy);
+	cur_freq <<= SCHED_CAPACITY_SHIFT;
+	scale = div_u64(cur_freq, max_freq);
+
+	this_cpu_write(arch_freq_scale, (unsigned long)scale);
+}
+
+static struct scale_freq_data virt_sfd = {
+	.source = SCALE_FREQ_SOURCE_VIRT,
+	.set_freq_scale = virt_scale_freq_tick,
+};
+
+static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
+{
+	struct virt_cpufreq_drv_data *data = policy->driver_data;
+	/*
+	 * Use cached frequency to avoid rounding to freq table entries
+	 * and undo 25% frequency boost applied by schedutil.
+	 */
+	u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
+
+	data->ops->set_freq(policy, freq);
+	return 0;
+}
+
+static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
+		unsigned int target_freq)
+{
+	virt_cpufreq_set_perf(policy);
+	return target_freq;
+}
+
+static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
+		unsigned int index)
+{
+	return virt_cpufreq_set_perf(policy);
+}
+
+static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+	struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
+	struct cpufreq_frequency_table *table;
+	struct device *cpu_dev;
+	int ret;
+
+	cpu_dev = get_cpu_device(policy->cpu);
+	if (!cpu_dev)
+		return -ENODEV;
+
+	ret = dev_pm_opp_of_add_table(cpu_dev);
+	if (ret)
+		return ret;
+
+	ret = dev_pm_opp_get_opp_count(cpu_dev);
+	if (ret <= 0) {
+		dev_err(cpu_dev, "OPP table can't be empty\n");
+		return -ENODEV;
+	}
+
+	ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
+	if (ret) {
+		dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
+		return ret;
+	}
+
+	policy->freq_table = table;
+	policy->dvfs_possible_from_any_cpu = false;
+	policy->fast_switch_possible = true;
+	policy->driver_data = drv_data;
+
+	/*
+	 * Only takes effect if another FIE source such as AMUs
+	 * have not been registered.
+	 */
+	topology_set_scale_freq_source(&virt_sfd, policy->cpus);
+
+	return 0;
+
+}
+
+static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
+	kfree(policy->freq_table);
+	policy->freq_table = NULL;
+	return 0;
+}
+
+static int virt_cpufreq_online(struct cpufreq_policy *policy)
+{
+	/* Nothing to restore. */
+	return 0;
+}
+
+static int virt_cpufreq_offline(struct cpufreq_policy *policy)
+{
+	/* Dummy offline() to avoid exit() being called and freeing resources. */
+	return 0;
+}
+
+static struct cpufreq_driver cpufreq_virt_driver = {
+	.name		= "virt-cpufreq",
+	.init		= virt_cpufreq_cpu_init,
+	.exit		= virt_cpufreq_cpu_exit,
+	.online         = virt_cpufreq_online,
+	.offline        = virt_cpufreq_offline,
+	.verify		= cpufreq_generic_frequency_table_verify,
+	.target_index	= virt_cpufreq_target_index,
+	.fast_switch	= virt_cpufreq_fast_switch,
+	.attr		= cpufreq_generic_attr,
+};
+
+static int virt_cpufreq_driver_probe(struct platform_device *pdev)
+{
+	int ret;
+	struct virt_cpufreq_drv_data *drv_data;
+
+	drv_data = devm_kzalloc(&pdev->dev, sizeof(*drv_data), GFP_KERNEL);
+	if (!drv_data)
+		return -ENOMEM;
+
+	drv_data->ops = of_device_get_match_data(&pdev->dev);
+	if (!drv_data->ops)
+		return -EINVAL;
+
+	drv_data->base = devm_platform_ioremap_resource(pdev, 0);
+	if (IS_ERR(drv_data->base))
+		return PTR_ERR(drv_data->base);
+
+	cpufreq_virt_driver.driver_data = drv_data;
+
+	ret = cpufreq_register_driver(&cpufreq_virt_driver);
+	if (ret) {
+		dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret);
+		return ret;
+	}
+
+	dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n");
+	return 0;
+}
+
+static int virt_cpufreq_driver_remove(struct platform_device *pdev)
+{
+	cpufreq_unregister_driver(&cpufreq_virt_driver);
+	return 0;
+}
+
+static const struct of_device_id virt_cpufreq_match[] = {
+	{ .compatible = "virtual,cpufreq", .data = &virt_freq_ops},
+	{}
+};
+MODULE_DEVICE_TABLE(of, virt_cpufreq_match);
+
+static struct platform_driver virt_cpufreq_driver = {
+	.probe = virt_cpufreq_driver_probe,
+	.remove = virt_cpufreq_driver_remove,
+	.driver = {
+		.name = "virt-cpufreq",
+		.of_match_table = virt_cpufreq_match,
+	},
+};
+
+static int __init virt_cpufreq_init(void)
+{
+	return platform_driver_register(&virt_cpufreq_driver);
+}
+postcore_initcall(virt_cpufreq_init);
+
+static void __exit virt_cpufreq_exit(void)
+{
+	platform_driver_unregister(&virt_cpufreq_driver);
+}
+module_exit(virt_cpufreq_exit);
+
+MODULE_DESCRIPTION("Virtual cpufreq driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index a07b510e7dc5..888282dce2ba 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -42,6 +42,7 @@ enum scale_freq_source {
 	SCALE_FREQ_SOURCE_CPUFREQ = 0,
 	SCALE_FREQ_SOURCE_ARCH,
 	SCALE_FREQ_SOURCE_CPPC,
+	SCALE_FREQ_SOURCE_VIRT,
 };
 
 struct scale_freq_data {
-- 
2.41.0.585.gd2178a4bd4-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq
  2023-07-31 17:46 ` [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq David Dai
@ 2023-07-31 18:12   ` Rob Herring
  2023-08-05 19:38   ` Krzysztof Kozlowski
  1 sibling, 0 replies; 23+ messages in thread
From: Rob Herring @ 2023-07-31 18:12 UTC (permalink / raw)
  To: David Dai
  Cc: Conor Dooley, Quentin Perret, linux-kernel, Viresh Kumar,
	Sudeep Holla, Rob Herring, devicetree, Rafael J. Wysocki,
	Will Deacon, Masami Hiramatsu, Krzysztof Kozlowski, Oliver Upton,
	Mel Gorman, Dietmar Eggemann, Saravana Kannan, Gupta Pankaj,
	kernel-team, Vincent Guittot, Pavan Kondeti, linux-pm,
	Marc Zyngier, Peter Zijlstra


On Mon, 31 Jul 2023 10:46:08 -0700, David Dai wrote:
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device for
> guests to read frequency information or to request frequency selection. The
> virtual cpufreq device has an individual controller for each CPU.
> 
> Co-developed-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: David Dai <davidai@google.com>
> ---
>  .../bindings/cpufreq/cpufreq-virtual.yaml     | 89 +++++++++++++++++++
>  1 file changed, 89 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
> 

My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check'
on your patch (DT_CHECKER_FLAGS is new in v5.13):

yamllint warnings/errors:

dtschema/dtc warnings/errors:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/jsonschema/validators.py", line 909, in resolve_from_url
    document = self.store[url]
  File "/usr/local/lib/python3.10/dist-packages/jsonschema/_utils.py", line 28, in __getitem__
    return self.store[self.normalize(uri)]
KeyError: 'http://devicetree.org/meta-schemas/core.yamll'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/jsonschema/validators.py", line 912, in resolve_from_url
    document = self.resolve_remote(url)
  File "/usr/local/lib/python3.10/dist-packages/jsonschema/validators.py", line 1011, in resolve_remote
    result = self.handlers[scheme](uri)
  File "/usr/local/lib/python3.10/dist-packages/dtschema/schema.py", line 91, in http_handler
    raise RefResolutionError('Error in referenced schema matching $id: ' + uri)
jsonschema.exceptions.RefResolutionError: Error in referenced schema matching $id: http://devicetree.org/meta-schemas/core.yamll

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/dt-doc-validate", line 64, in <module>
    ret |= check_doc(f)
  File "/usr/local/bin/dt-doc-validate", line 32, in check_doc
    for error in sorted(dtsch.iter_errors(), key=lambda e: e.linecol):
  File "/usr/local/lib/python3.10/dist-packages/dtschema/schema.py", line 130, in iter_errors
    meta_schema = self.resolver.resolve_from_url(self['$schema'])
  File "/usr/local/lib/python3.10/dist-packages/jsonschema/validators.py", line 914, in resolve_from_url
    raise exceptions.RefResolutionError(exc)
jsonschema.exceptions.RefResolutionError: Error in referenced schema matching $id: http://devicetree.org/meta-schemas/core.yamll
Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.example.dts:69.19-72.13: Warning (unit_address_vs_reg): /example-0/soc/cpufreq: node has a reg or ranges property, but no unit name
Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.example.dtb: /example-0/cpus/cpu@0: failed to match any schema with compatible: ['arm,arm-v8']
Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.example.dtb: /example-0/cpus/cpu@1: failed to match any schema with compatible: ['arm,arm-v8']

doc reference errors (make refcheckdocs):

See https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20230731174613.4133167-2-davidai@google.com

The base for the series is generally the latest rc1. A different dependency
should be noted in *this* patch.

If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:

pip3 install dtschema --upgrade

Please check and re-submit after running the above command yourself. Note
that DT_SCHEMA_FILES can be set to your schema file to speed up checking
your schema. However, it must be unset to test all examples with your schema.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
@ 2023-07-31 22:02   ` Randy Dunlap
  2023-07-31 23:46     ` David Dai
  2023-08-01  9:36   ` Viresh Kumar
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Randy Dunlap @ 2023-07-31 22:02 UTC (permalink / raw)
  To: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel



On 7/31/23 10:46, David Dai wrote:
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index f429b9b37b76..3977ca796747 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -217,6 +217,21 @@ config CPUFREQ_DT
>  
>  	  If in doubt, say N.
>  
> +config CPUFREQ_VIRT
> +        tristate "Virtual cpufreq driver"
> +	depends on OF
> +	select PM_OPP
> +        help

The 4 lines above should be indented with one tab (not 8 spaces).

> +	  This adds a virtualized cpufreq driver for guest kernels that
> +	  read/writes to a MMIO region for a virtualized cpufreq device to

	  reads/writes to an MMIO region

> +	  communicate with the host. It sends frequency updates to the host
> +	  which gets used as a hint to schedule vCPU threads and select CPU
> +	  frequency. If a VM does not support a virtualized FIE such as AMUs,
> +	  it updates the frequency scaling factor by polling host CPU frequency
> +	  to enable accurate Per-Entity Load Tracking for tasks running in the guest.
> +
> +	  If in doubt, say N.

-- 
~Randy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-07-31 22:02   ` Randy Dunlap
@ 2023-07-31 23:46     ` David Dai
  0 siblings, 0 replies; 23+ messages in thread
From: David Dai @ 2023-07-31 23:46 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

Hi Randy,

Thanks for reviewing,

On Mon, Jul 31, 2023 at 3:02 PM Randy Dunlap <rdunlap@infradead.org> wrote:
>
>
>
> On 7/31/23 10:46, David Dai wrote:
> > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> > index f429b9b37b76..3977ca796747 100644
> > --- a/drivers/cpufreq/Kconfig
> > +++ b/drivers/cpufreq/Kconfig
> > @@ -217,6 +217,21 @@ config CPUFREQ_DT
> >
> >         If in doubt, say N.
> >
> > +config CPUFREQ_VIRT
> > +        tristate "Virtual cpufreq driver"
> > +     depends on OF
> > +     select PM_OPP
> > +        help
>
> The 4 lines above should be indented with one tab (not 8 spaces).

Ok.

>
> > +       This adds a virtualized cpufreq driver for guest kernels that
> > +       read/writes to a MMIO region for a virtualized cpufreq device to
>
>           reads/writes to an MMIO region

Will fix these, thanks!
David

>
> > +       communicate with the host. It sends frequency updates to the host
> > +       which gets used as a hint to schedule vCPU threads and select CPU
> > +       frequency. If a VM does not support a virtualized FIE such as AMUs,
> > +       it updates the frequency scaling factor by polling host CPU frequency
> > +       to enable accurate Per-Entity Load Tracking for tasks running in the guest.
> > +
> > +       If in doubt, say N.
>
> --
> ~Randy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
  2023-07-31 22:02   ` Randy Dunlap
@ 2023-08-01  9:36   ` Viresh Kumar
  2023-08-02 22:16     ` Saravana Kannan
  2023-08-03 16:50     ` David Dai
  2023-08-01  9:45   ` Quentin Perret
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Viresh Kumar @ 2023-08-01  9:36 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On 31-07-23, 10:46, David Dai wrote:
> diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
> new file mode 100644
> index 000000000000..66b0fd9b821c
> --- /dev/null
> +++ b/drivers/cpufreq/virtual-cpufreq.c
> @@ -0,0 +1,237 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 Google LLC
> + */
> +
> +#include <linux/arch_topology.h>
> +#include <linux/cpufreq.h>
> +#include <linux/init.h>
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of_address.h>
> +#include <linux/of_platform.h>
> +#include <linux/pm_opp.h>
> +#include <linux/slab.h>
> +
> +#define REG_CUR_FREQ_OFFSET 0x0
> +#define REG_SET_FREQ_OFFSET 0x4
> +#define PER_CPU_OFFSET 0x8
> +
> +struct virt_cpufreq_ops {
> +	void (*set_freq)(struct cpufreq_policy *policy, u32 freq);
> +	u32 (*get_freq)(struct cpufreq_policy *policy);
> +};

Since you only have one implementation currently, this isn't really
required. Keep the data as NULL in `virt_cpufreq_match` and use
writel/readl directly.

This can be updated if we need more implementations later.

> +struct virt_cpufreq_drv_data {
> +	void __iomem *base;
> +	const struct virt_cpufreq_ops *ops;
> +};
> +
> +static void virt_cpufreq_set_freq(struct cpufreq_policy *policy, u32 freq)
> +{
> +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> +
> +	writel_relaxed(freq, data->base + policy->cpu * PER_CPU_OFFSET
> +			+ REG_SET_FREQ_OFFSET);
> +}
> +
> +static u32 virt_cpufreq_get_freq(struct cpufreq_policy *policy)
> +{
> +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> +
> +	return readl_relaxed(data->base + policy->cpu * PER_CPU_OFFSET
> +			+ REG_CUR_FREQ_OFFSET);

This doesn't look properly aligned. Please run checkpatch (--strict
(optional)).

> +}
> +
> +static const struct virt_cpufreq_ops virt_freq_ops = {
> +	.set_freq = virt_cpufreq_set_freq,
> +	.get_freq = virt_cpufreq_get_freq,
> +};
> +
> +static void virt_scale_freq_tick(void)
> +{
> +	struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> +	u32 max_freq = (u32)policy->cpuinfo.max_freq;
> +	u64 cur_freq;
> +	u64 scale;
> +
> +	cpufreq_cpu_put(policy);
> +
> +	cur_freq = (u64)data->ops->get_freq(policy);
> +	cur_freq <<= SCHED_CAPACITY_SHIFT;
> +	scale = div_u64(cur_freq, max_freq);
> +
> +	this_cpu_write(arch_freq_scale, (unsigned long)scale);
> +}
> +
> +static struct scale_freq_data virt_sfd = {
> +	.source = SCALE_FREQ_SOURCE_VIRT,
> +	.set_freq_scale = virt_scale_freq_tick,
> +};
> +
> +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> +{
> +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> +	/*
> +	 * Use cached frequency to avoid rounding to freq table entries
> +	 * and undo 25% frequency boost applied by schedutil.
> +	 */
> +	u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> +
> +	data->ops->set_freq(policy, freq);
> +	return 0;
> +}
> +
> +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> +		unsigned int target_freq)
> +{
> +	virt_cpufreq_set_perf(policy);
> +	return target_freq;
> +}
> +
> +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> +		unsigned int index)
> +{
> +	return virt_cpufreq_set_perf(policy);
> +}
> +
> +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +	struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> +	struct cpufreq_frequency_table *table;
> +	struct device *cpu_dev;
> +	int ret;
> +
> +	cpu_dev = get_cpu_device(policy->cpu);
> +	if (!cpu_dev)
> +		return -ENODEV;
> +
> +	ret = dev_pm_opp_of_add_table(cpu_dev);
> +	if (ret)
> +		return ret;
> +
> +	ret = dev_pm_opp_get_opp_count(cpu_dev);
> +	if (ret <= 0) {
> +		dev_err(cpu_dev, "OPP table can't be empty\n");
> +		return -ENODEV;
> +	}
> +
> +	ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> +	if (ret) {
> +		dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> +		return ret;
> +	}
> +
> +	policy->freq_table = table;
> +	policy->dvfs_possible_from_any_cpu = false;

Why can't we call virt_cpufreq_target_index() from any CPU ?

> +	policy->fast_switch_possible = true;
> +	policy->driver_data = drv_data;
> +
> +	/*
> +	 * Only takes effect if another FIE source such as AMUs
> +	 * have not been registered.
> +	 */
> +	topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> +
> +	return 0;
> +
> +}
> +
> +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +	topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
> +	kfree(policy->freq_table);
> +	policy->freq_table = NULL;
> +	return 0;
> +}
> +
> +static int virt_cpufreq_online(struct cpufreq_policy *policy)
> +{
> +	/* Nothing to restore. */
> +	return 0;
> +}
> +
> +static int virt_cpufreq_offline(struct cpufreq_policy *policy)
> +{
> +	/* Dummy offline() to avoid exit() being called and freeing resources. */
> +	return 0;
> +}
> +
> +static struct cpufreq_driver cpufreq_virt_driver = {
> +	.name		= "virt-cpufreq",
> +	.init		= virt_cpufreq_cpu_init,
> +	.exit		= virt_cpufreq_cpu_exit,
> +	.online         = virt_cpufreq_online,
> +	.offline        = virt_cpufreq_offline,
> +	.verify		= cpufreq_generic_frequency_table_verify,
> +	.target_index	= virt_cpufreq_target_index,
> +	.fast_switch	= virt_cpufreq_fast_switch,
> +	.attr		= cpufreq_generic_attr,
> +};
> +
> +static int virt_cpufreq_driver_probe(struct platform_device *pdev)
> +{
> +	int ret;
> +	struct virt_cpufreq_drv_data *drv_data;
> +
> +	drv_data = devm_kzalloc(&pdev->dev, sizeof(*drv_data), GFP_KERNEL);
> +	if (!drv_data)
> +		return -ENOMEM;
> +
> +	drv_data->ops = of_device_get_match_data(&pdev->dev);
> +	if (!drv_data->ops)
> +		return -EINVAL;
> +
> +	drv_data->base = devm_platform_ioremap_resource(pdev, 0);
> +	if (IS_ERR(drv_data->base))
> +		return PTR_ERR(drv_data->base);
> +
> +	cpufreq_virt_driver.driver_data = drv_data;
> +
> +	ret = cpufreq_register_driver(&cpufreq_virt_driver);
> +	if (ret) {
> +		dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret);
> +		return ret;
> +	}
> +
> +	dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n");
> +	return 0;
> +}
> +
> +static int virt_cpufreq_driver_remove(struct platform_device *pdev)
> +{
> +	cpufreq_unregister_driver(&cpufreq_virt_driver);
> +	return 0;
> +}
> +
> +static const struct of_device_id virt_cpufreq_match[] = {
> +	{ .compatible = "virtual,cpufreq", .data = &virt_freq_ops},
> +	{}
> +};
> +MODULE_DEVICE_TABLE(of, virt_cpufreq_match);
> +
> +static struct platform_driver virt_cpufreq_driver = {
> +	.probe = virt_cpufreq_driver_probe,
> +	.remove = virt_cpufreq_driver_remove,
> +	.driver = {
> +		.name = "virt-cpufreq",
> +		.of_match_table = virt_cpufreq_match,
> +	},
> +};
> +
> +static int __init virt_cpufreq_init(void)
> +{
> +	return platform_driver_register(&virt_cpufreq_driver);
> +}
> +postcore_initcall(virt_cpufreq_init);

Why do you want to use this and not module_init() ? Then you can
simply use `module_platform_driver()`.

> +
> +static void __exit virt_cpufreq_exit(void)
> +{
> +	platform_driver_unregister(&virt_cpufreq_driver);
> +}
> +module_exit(virt_cpufreq_exit);
> +
> +MODULE_DESCRIPTION("Virtual cpufreq driver");
> +MODULE_LICENSE("GPL");

-- 
viresh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
  2023-07-31 22:02   ` Randy Dunlap
  2023-08-01  9:36   ` Viresh Kumar
@ 2023-08-01  9:45   ` Quentin Perret
  2023-08-01  9:49     ` Quentin Perret
  2023-08-04 22:23     ` Saravana Kannan
  2023-08-03  4:18   ` Pavan Kondeti
  2023-08-12  2:55   ` kernel test robot
  4 siblings, 2 replies; 23+ messages in thread
From: Quentin Perret @ 2023-08-01  9:45 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

Hi David,

On Monday 31 Jul 2023 at 10:46:09 (-0700), David Dai wrote:
> +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> +{
> +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> +	/*
> +	 * Use cached frequency to avoid rounding to freq table entries
> +	 * and undo 25% frequency boost applied by schedutil.
> +	 */

The VMM would be a better place for this scaling I think, the driver
can't/shouldn't make assumptions about the governor it is running with
given that this is a guest userspace decision essentially.

IIRC the fast_switch() path is only used by schedutil, so one could
probably make a case to scale things there, but it'd be inconsistent
with the "slow" switch case, and would create a fragile dependency, so
it's probably not worth pursuing.

> +	u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> +
> +	data->ops->set_freq(policy, freq);
> +	return 0;
> +}

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-01  9:45   ` Quentin Perret
@ 2023-08-01  9:49     ` Quentin Perret
  2023-08-04 22:23     ` Saravana Kannan
  1 sibling, 0 replies; 23+ messages in thread
From: Quentin Perret @ 2023-08-01  9:49 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Tuesday 01 Aug 2023 at 09:45:22 (+0000), Quentin Perret wrote:
> Hi David,
> 
> On Monday 31 Jul 2023 at 10:46:09 (-0700), David Dai wrote:
> > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > +{
> > +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +	/*
> > +	 * Use cached frequency to avoid rounding to freq table entries
> > +	 * and undo 25% frequency boost applied by schedutil.
> > +	 */
> 
> The VMM would be a better place for this scaling I think, the driver
> can't/shouldn't make assumptions about the governor it is running with
> given that this is a guest userspace decision essentially.
> 
> IIRC the fast_switch() path is only used by schedutil, so one could
> probably make a case to scale things there, but it'd be inconsistent
> with the "slow" switch case, and would create a fragile dependency, so
> it's probably not worth pursuing.

Alternatively we could make the schedutil margin configurable via the
cmdline or something along those lines, so we can set it to 0 in the
guest and avoid the issue entirely.

Some partners have been asking for this IIRC , so I suspect there would
be interest from other parties.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-01  9:36   ` Viresh Kumar
@ 2023-08-02 22:16     ` Saravana Kannan
  2023-08-03  5:51       ` Viresh Kumar
  2023-08-03 16:50     ` David Dai
  1 sibling, 1 reply; 23+ messages in thread
From: Saravana Kannan @ 2023-08-02 22:16 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: David Dai, Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Quentin Perret, Masami Hiramatsu,
	Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On Tue, Aug 1, 2023 at 2:36 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 31-07-23, 10:46, David Dai wrote:
> > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
> > new file mode 100644
> > index 000000000000..66b0fd9b821c
> > --- /dev/null
> > +++ b/drivers/cpufreq/virtual-cpufreq.c
> > @@ -0,0 +1,237 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2023 Google LLC
> > + */
> > +
> > +#include <linux/arch_topology.h>
> > +#include <linux/cpufreq.h>
> > +#include <linux/init.h>
> > +#include <linux/sched.h>
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/of_address.h>
> > +#include <linux/of_platform.h>
> > +#include <linux/pm_opp.h>
> > +#include <linux/slab.h>
> > +
> > +#define REG_CUR_FREQ_OFFSET 0x0
> > +#define REG_SET_FREQ_OFFSET 0x4
> > +#define PER_CPU_OFFSET 0x8
> > +
> > +struct virt_cpufreq_ops {
> > +     void (*set_freq)(struct cpufreq_policy *policy, u32 freq);
> > +     u32 (*get_freq)(struct cpufreq_policy *policy);
> > +};
>
> Since you only have one implementation currently, this isn't really
> required. Keep the data as NULL in `virt_cpufreq_match` and use
> writel/readl directly.
>
> This can be updated if we need more implementations later.
>
> > +struct virt_cpufreq_drv_data {
> > +     void __iomem *base;
> > +     const struct virt_cpufreq_ops *ops;
> > +};
> > +
> > +static void virt_cpufreq_set_freq(struct cpufreq_policy *policy, u32 freq)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +
> > +     writel_relaxed(freq, data->base + policy->cpu * PER_CPU_OFFSET
> > +                     + REG_SET_FREQ_OFFSET);
> > +}
> > +
> > +static u32 virt_cpufreq_get_freq(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +
> > +     return readl_relaxed(data->base + policy->cpu * PER_CPU_OFFSET
> > +                     + REG_CUR_FREQ_OFFSET);
>
> This doesn't look properly aligned. Please run checkpatch (--strict
> (optional)).
>
> > +}
> > +
> > +static const struct virt_cpufreq_ops virt_freq_ops = {
> > +     .set_freq = virt_cpufreq_set_freq,
> > +     .get_freq = virt_cpufreq_get_freq,
> > +};
> > +
> > +static void virt_scale_freq_tick(void)
> > +{
> > +     struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +     u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > +     u64 cur_freq;
> > +     u64 scale;
> > +
> > +     cpufreq_cpu_put(policy);
> > +
> > +     cur_freq = (u64)data->ops->get_freq(policy);
> > +     cur_freq <<= SCHED_CAPACITY_SHIFT;
> > +     scale = div_u64(cur_freq, max_freq);
> > +
> > +     this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > +}
> > +
> > +static struct scale_freq_data virt_sfd = {
> > +     .source = SCALE_FREQ_SOURCE_VIRT,
> > +     .set_freq_scale = virt_scale_freq_tick,
> > +};
> > +
> > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +     /*
> > +      * Use cached frequency to avoid rounding to freq table entries
> > +      * and undo 25% frequency boost applied by schedutil.
> > +      */
> > +     u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> > +
> > +     data->ops->set_freq(policy, freq);
> > +     return 0;
> > +}
> > +
> > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > +             unsigned int target_freq)
> > +{
> > +     virt_cpufreq_set_perf(policy);
> > +     return target_freq;
> > +}
> > +
> > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > +             unsigned int index)
> > +{
> > +     return virt_cpufreq_set_perf(policy);
> > +}
> > +
> > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > +     struct cpufreq_frequency_table *table;
> > +     struct device *cpu_dev;
> > +     int ret;
> > +
> > +     cpu_dev = get_cpu_device(policy->cpu);
> > +     if (!cpu_dev)
> > +             return -ENODEV;
> > +
> > +     ret = dev_pm_opp_of_add_table(cpu_dev);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = dev_pm_opp_get_opp_count(cpu_dev);
> > +     if (ret <= 0) {
> > +             dev_err(cpu_dev, "OPP table can't be empty\n");
> > +             return -ENODEV;
> > +     }
> > +
> > +     ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > +     if (ret) {
> > +             dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > +             return ret;
> > +     }
> > +
> > +     policy->freq_table = table;
> > +     policy->dvfs_possible_from_any_cpu = false;
>
> Why can't we call virt_cpufreq_target_index() from any CPU ?

This is mainly an optimization to reduce the latency of the "frequency
change" which has a huge impact on the performance (as can be seen
from the numbers in the cover letter).

Setting this flag means that the vCPU thread triggering the MMIO
handling (on the host side) is the thread on which the host needs to
apply any uclamp settings. So this avoids the VMM having to look up
the right vCPU thread that corresponds to this CPU, and any
permissions issues wrt setting another threads uclamp, etc. This
becomes even more important if/when BPF support is added for handling
simple MMIO read/writes. Will Deacon has been working on the eBPF
part[1] and IIUC, not setting this flag adds a lot of extra overhead
on the BPF side.

So, yeah, this flag is very helpful wrt reducing latency/simplifying
host side implementation and that's why we want it here.

[1] - https://kvm-forum.qemu.org/2023/talk/AZKC77/

Thanks,
Saravana

>
> > +     policy->fast_switch_possible = true;
> > +     policy->driver_data = drv_data;
> > +
> > +     /*
> > +      * Only takes effect if another FIE source such as AMUs
> > +      * have not been registered.
> > +      */
> > +     topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > +
> > +     return 0;
> > +
> > +}
> > +
> > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> > +{
> > +     topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
> > +     kfree(policy->freq_table);
> > +     policy->freq_table = NULL;
> > +     return 0;
> > +}
> > +
> > +static int virt_cpufreq_online(struct cpufreq_policy *policy)
> > +{
> > +     /* Nothing to restore. */
> > +     return 0;
> > +}
> > +
> > +static int virt_cpufreq_offline(struct cpufreq_policy *policy)
> > +{
> > +     /* Dummy offline() to avoid exit() being called and freeing resources. */
> > +     return 0;
> > +}
> > +
> > +static struct cpufreq_driver cpufreq_virt_driver = {
> > +     .name           = "virt-cpufreq",
> > +     .init           = virt_cpufreq_cpu_init,
> > +     .exit           = virt_cpufreq_cpu_exit,
> > +     .online         = virt_cpufreq_online,
> > +     .offline        = virt_cpufreq_offline,
> > +     .verify         = cpufreq_generic_frequency_table_verify,
> > +     .target_index   = virt_cpufreq_target_index,
> > +     .fast_switch    = virt_cpufreq_fast_switch,
> > +     .attr           = cpufreq_generic_attr,
> > +};
> > +
> > +static int virt_cpufreq_driver_probe(struct platform_device *pdev)
> > +{
> > +     int ret;
> > +     struct virt_cpufreq_drv_data *drv_data;
> > +
> > +     drv_data = devm_kzalloc(&pdev->dev, sizeof(*drv_data), GFP_KERNEL);
> > +     if (!drv_data)
> > +             return -ENOMEM;
> > +
> > +     drv_data->ops = of_device_get_match_data(&pdev->dev);
> > +     if (!drv_data->ops)
> > +             return -EINVAL;
> > +
> > +     drv_data->base = devm_platform_ioremap_resource(pdev, 0);
> > +     if (IS_ERR(drv_data->base))
> > +             return PTR_ERR(drv_data->base);
> > +
> > +     cpufreq_virt_driver.driver_data = drv_data;
> > +
> > +     ret = cpufreq_register_driver(&cpufreq_virt_driver);
> > +     if (ret) {
> > +             dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret);
> > +             return ret;
> > +     }
> > +
> > +     dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n");
> > +     return 0;
> > +}
> > +
> > +static int virt_cpufreq_driver_remove(struct platform_device *pdev)
> > +{
> > +     cpufreq_unregister_driver(&cpufreq_virt_driver);
> > +     return 0;
> > +}
> > +
> > +static const struct of_device_id virt_cpufreq_match[] = {
> > +     { .compatible = "virtual,cpufreq", .data = &virt_freq_ops},
> > +     {}
> > +};
> > +MODULE_DEVICE_TABLE(of, virt_cpufreq_match);
> > +
> > +static struct platform_driver virt_cpufreq_driver = {
> > +     .probe = virt_cpufreq_driver_probe,
> > +     .remove = virt_cpufreq_driver_remove,
> > +     .driver = {
> > +             .name = "virt-cpufreq",
> > +             .of_match_table = virt_cpufreq_match,
> > +     },
> > +};
> > +
> > +static int __init virt_cpufreq_init(void)
> > +{
> > +     return platform_driver_register(&virt_cpufreq_driver);
> > +}
> > +postcore_initcall(virt_cpufreq_init);
>
> Why do you want to use this and not module_init() ? Then you can
> simply use `module_platform_driver()`.
>
> > +
> > +static void __exit virt_cpufreq_exit(void)
> > +{
> > +     platform_driver_unregister(&virt_cpufreq_driver);
> > +}
> > +module_exit(virt_cpufreq_exit);
> > +
> > +MODULE_DESCRIPTION("Virtual cpufreq driver");
> > +MODULE_LICENSE("GPL");
>
> --
> viresh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
                     ` (2 preceding siblings ...)
  2023-08-01  9:45   ` Quentin Perret
@ 2023-08-03  4:18   ` Pavan Kondeti
  2023-08-04 23:46     ` David Dai
  2023-08-12  2:55   ` kernel test robot
  4 siblings, 1 reply; 23+ messages in thread
From: Pavan Kondeti @ 2023-08-03  4:18 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote:
> Introduce a virtualized cpufreq driver for guest kernels to improve
> performance and power of workloads within VMs.
> 
> This driver does two main things:
> 
> 1. Sends the frequency of vCPUs as a hint to the host. The host uses the
> hint to schedule the vCPU threads and decide physical CPU frequency.
> 
> 2. If a VM does not support a virtualized FIE(like AMUs), it queries the
> host CPU frequency by reading a MMIO region of a virtual cpufreq device
> to update the guest's frequency scaling factor periodically. This enables
> accurate Per-Entity Load Tracking for tasks running in the guest.
> 
> Co-developed-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: David Dai <davidai@google.com>

[...]

> +static void virt_scale_freq_tick(void)
> +{
> +	struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> +	u32 max_freq = (u32)policy->cpuinfo.max_freq;
> +	u64 cur_freq;
> +	u64 scale;
> +
> +	cpufreq_cpu_put(policy);
> +
> +	cur_freq = (u64)data->ops->get_freq(policy);
> +	cur_freq <<= SCHED_CAPACITY_SHIFT;
> +	scale = div_u64(cur_freq, max_freq);
> +
> +	this_cpu_write(arch_freq_scale, (unsigned long)scale);
> +}
> +

We expect the host to provide the frequency in kHz, can you please add a
comment about it. It is not very obvious when you look at the
REG_CUR_FREQ_OFFSET register name.

> +static struct scale_freq_data virt_sfd = {
> +	.source = SCALE_FREQ_SOURCE_VIRT,
> +	.set_freq_scale = virt_scale_freq_tick,
> +};
> +
> +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> +{
> +	struct virt_cpufreq_drv_data *data = policy->driver_data;
> +	/*
> +	 * Use cached frequency to avoid rounding to freq table entries
> +	 * and undo 25% frequency boost applied by schedutil.
> +	 */
> +	u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> +
> +	data->ops->set_freq(policy, freq);
> +	return 0;
> +}

Why do we undo the frequency boost? A governor may apply other boosts
like RT (uclamp), iowait. It is not clear why we need to worry about
governor policies here.

> +
> +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> +		unsigned int target_freq)
> +{
> +	virt_cpufreq_set_perf(policy);
> +	return target_freq;
> +}
> +
> +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> +		unsigned int index)
> +{
> +	return virt_cpufreq_set_perf(policy);
> +}
> +
> +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +	struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> +	struct cpufreq_frequency_table *table;
> +	struct device *cpu_dev;
> +	int ret;
> +
> +	cpu_dev = get_cpu_device(policy->cpu);
> +	if (!cpu_dev)
> +		return -ENODEV;
> +
> +	ret = dev_pm_opp_of_add_table(cpu_dev);
> +	if (ret)
> +		return ret;
> +
> +	ret = dev_pm_opp_get_opp_count(cpu_dev);
> +	if (ret <= 0) {
> +		dev_err(cpu_dev, "OPP table can't be empty\n");
> +		return -ENODEV;
> +	}
> +
> +	ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> +	if (ret) {
> +		dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> +		return ret;
> +	}
> +
> +	policy->freq_table = table;
> +	policy->dvfs_possible_from_any_cpu = false;
> +	policy->fast_switch_possible = true;
> +	policy->driver_data = drv_data;
> +
> +	/*
> +	 * Only takes effect if another FIE source such as AMUs
> +	 * have not been registered.
> +	 */
> +	topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> +
> +	return 0;
> +
> +}
> +

Do we need to register as FIE source even with the below commit? By
registering as a source, we are not supplying any accurate metric. We
still fallback on the same source that cpufreq implements it.

874f63531064 ("cpufreq: report whether cpufreq supports Frequency
Invariance (FI)")

Thanks,
Pavan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-02 22:16     ` Saravana Kannan
@ 2023-08-03  5:51       ` Viresh Kumar
  2023-08-04 22:24         ` Saravana Kannan
  0 siblings, 1 reply; 23+ messages in thread
From: Viresh Kumar @ 2023-08-03  5:51 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: David Dai, Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Quentin Perret, Masami Hiramatsu,
	Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On 02-08-23, 15:16, Saravana Kannan wrote:
> This is mainly an optimization to reduce the latency of the "frequency
> change" which has a huge impact on the performance (as can be seen
> from the numbers in the cover letter).
> 
> Setting this flag means that the vCPU thread triggering the MMIO
> handling (on the host side) is the thread on which the host needs to
> apply any uclamp settings. So this avoids the VMM having to look up
> the right vCPU thread that corresponds to this CPU, and any
> permissions issues wrt setting another threads uclamp, etc. This
> becomes even more important if/when BPF support is added for handling
> simple MMIO read/writes. Will Deacon has been working on the eBPF
> part[1] and IIUC, not setting this flag adds a lot of extra overhead
> on the BPF side.
> 
> So, yeah, this flag is very helpful wrt reducing latency/simplifying
> host side implementation and that's why we want it here.
> 
> [1] - https://kvm-forum.qemu.org/2023/talk/AZKC77/

Would be good to have a (big) comment in the code explaining that as
it isn't obvious. Thanks.

-- 
viresh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-01  9:36   ` Viresh Kumar
  2023-08-02 22:16     ` Saravana Kannan
@ 2023-08-03 16:50     ` David Dai
  2023-08-04  4:42       ` Viresh Kumar
  1 sibling, 1 reply; 23+ messages in thread
From: David Dai @ 2023-08-03 16:50 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

Hi Viresh,

Thanks for reviewing!

On Tue, Aug 1, 2023 at 2:36 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 31-07-23, 10:46, David Dai wrote:
> > diff --git a/drivers/cpufreq/virtual-cpufreq.c b/drivers/cpufreq/virtual-cpufreq.c
> > new file mode 100644
> > index 000000000000..66b0fd9b821c
> > --- /dev/null
> > +++ b/drivers/cpufreq/virtual-cpufreq.c
> > @@ -0,0 +1,237 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2023 Google LLC
> > + */
> > +
> > +#include <linux/arch_topology.h>
> > +#include <linux/cpufreq.h>
> > +#include <linux/init.h>
> > +#include <linux/sched.h>
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/of_address.h>
> > +#include <linux/of_platform.h>
> > +#include <linux/pm_opp.h>
> > +#include <linux/slab.h>
> > +
> > +#define REG_CUR_FREQ_OFFSET 0x0
> > +#define REG_SET_FREQ_OFFSET 0x4
> > +#define PER_CPU_OFFSET 0x8
> > +
> > +struct virt_cpufreq_ops {
> > +     void (*set_freq)(struct cpufreq_policy *policy, u32 freq);
> > +     u32 (*get_freq)(struct cpufreq_policy *policy);
> > +};
>
> Since you only have one implementation currently, this isn't really
> required. Keep the data as NULL in `virt_cpufreq_match` and use
> writel/readl directly.

Okay, I’ll remove the ops for now and bring it back in the future if required.

>
> This can be updated if we need more implementations later.
>
> > +struct virt_cpufreq_drv_data {
> > +     void __iomem *base;
> > +     const struct virt_cpufreq_ops *ops;
> > +};
> > +
> > +static void virt_cpufreq_set_freq(struct cpufreq_policy *policy, u32 freq)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +
> > +     writel_relaxed(freq, data->base + policy->cpu * PER_CPU_OFFSET
> > +                     + REG_SET_FREQ_OFFSET);
> > +}
> > +
> > +static u32 virt_cpufreq_get_freq(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +
> > +     return readl_relaxed(data->base + policy->cpu * PER_CPU_OFFSET
> > +                     + REG_CUR_FREQ_OFFSET);
>
> This doesn't look properly aligned. Please run checkpatch (--strict
> (optional)).

Ok.

>
> > +}
> > +
> > +static const struct virt_cpufreq_ops virt_freq_ops = {
> > +     .set_freq = virt_cpufreq_set_freq,
> > +     .get_freq = virt_cpufreq_get_freq,
> > +};
> > +
> > +static void virt_scale_freq_tick(void)
> > +{
> > +     struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +     u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > +     u64 cur_freq;
> > +     u64 scale;
> > +
> > +     cpufreq_cpu_put(policy);
> > +
> > +     cur_freq = (u64)data->ops->get_freq(policy);
> > +     cur_freq <<= SCHED_CAPACITY_SHIFT;
> > +     scale = div_u64(cur_freq, max_freq);
> > +
> > +     this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > +}
> > +
> > +static struct scale_freq_data virt_sfd = {
> > +     .source = SCALE_FREQ_SOURCE_VIRT,
> > +     .set_freq_scale = virt_scale_freq_tick,
> > +};
> > +
> > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +     /*
> > +      * Use cached frequency to avoid rounding to freq table entries
> > +      * and undo 25% frequency boost applied by schedutil.
> > +      */
> > +     u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> > +
> > +     data->ops->set_freq(policy, freq);
> > +     return 0;
> > +}
> > +
> > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > +             unsigned int target_freq)
> > +{
> > +     virt_cpufreq_set_perf(policy);
> > +     return target_freq;
> > +}
> > +
> > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > +             unsigned int index)
> > +{
> > +     return virt_cpufreq_set_perf(policy);
> > +}
> > +
> > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > +     struct cpufreq_frequency_table *table;
> > +     struct device *cpu_dev;
> > +     int ret;
> > +
> > +     cpu_dev = get_cpu_device(policy->cpu);
> > +     if (!cpu_dev)
> > +             return -ENODEV;
> > +
> > +     ret = dev_pm_opp_of_add_table(cpu_dev);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = dev_pm_opp_get_opp_count(cpu_dev);
> > +     if (ret <= 0) {
> > +             dev_err(cpu_dev, "OPP table can't be empty\n");
> > +             return -ENODEV;
> > +     }
> > +
> > +     ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > +     if (ret) {
> > +             dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > +             return ret;
> > +     }
> > +
> > +     policy->freq_table = table;
> > +     policy->dvfs_possible_from_any_cpu = false;
>
> Why can't we call virt_cpufreq_target_index() from any CPU ?
>
> > +     policy->fast_switch_possible = true;
> > +     policy->driver_data = drv_data;
> > +
> > +     /*
> > +      * Only takes effect if another FIE source such as AMUs
> > +      * have not been registered.
> > +      */
> > +     topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > +
> > +     return 0;
> > +
> > +}
> > +
> > +static int virt_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> > +{
> > +     topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_VIRT, policy->related_cpus);
> > +     kfree(policy->freq_table);
> > +     policy->freq_table = NULL;
> > +     return 0;
> > +}
> > +
> > +static int virt_cpufreq_online(struct cpufreq_policy *policy)
> > +{
> > +     /* Nothing to restore. */
> > +     return 0;
> > +}
> > +
> > +static int virt_cpufreq_offline(struct cpufreq_policy *policy)
> > +{
> > +     /* Dummy offline() to avoid exit() being called and freeing resources. */
> > +     return 0;
> > +}
> > +
> > +static struct cpufreq_driver cpufreq_virt_driver = {
> > +     .name           = "virt-cpufreq",
> > +     .init           = virt_cpufreq_cpu_init,
> > +     .exit           = virt_cpufreq_cpu_exit,
> > +     .online         = virt_cpufreq_online,
> > +     .offline        = virt_cpufreq_offline,
> > +     .verify         = cpufreq_generic_frequency_table_verify,
> > +     .target_index   = virt_cpufreq_target_index,
> > +     .fast_switch    = virt_cpufreq_fast_switch,
> > +     .attr           = cpufreq_generic_attr,
> > +};
> > +
> > +static int virt_cpufreq_driver_probe(struct platform_device *pdev)
> > +{
> > +     int ret;
> > +     struct virt_cpufreq_drv_data *drv_data;
> > +
> > +     drv_data = devm_kzalloc(&pdev->dev, sizeof(*drv_data), GFP_KERNEL);
> > +     if (!drv_data)
> > +             return -ENOMEM;
> > +
> > +     drv_data->ops = of_device_get_match_data(&pdev->dev);
> > +     if (!drv_data->ops)
> > +             return -EINVAL;
> > +
> > +     drv_data->base = devm_platform_ioremap_resource(pdev, 0);
> > +     if (IS_ERR(drv_data->base))
> > +             return PTR_ERR(drv_data->base);
> > +
> > +     cpufreq_virt_driver.driver_data = drv_data;
> > +
> > +     ret = cpufreq_register_driver(&cpufreq_virt_driver);
> > +     if (ret) {
> > +             dev_err(&pdev->dev, "Virtual CPUFreq driver failed to register: %d\n", ret);
> > +             return ret;
> > +     }
> > +
> > +     dev_dbg(&pdev->dev, "Virtual CPUFreq driver initialized\n");
> > +     return 0;
> > +}
> > +
> > +static int virt_cpufreq_driver_remove(struct platform_device *pdev)
> > +{
> > +     cpufreq_unregister_driver(&cpufreq_virt_driver);
> > +     return 0;
> > +}
> > +
> > +static const struct of_device_id virt_cpufreq_match[] = {
> > +     { .compatible = "virtual,cpufreq", .data = &virt_freq_ops},
> > +     {}
> > +};
> > +MODULE_DEVICE_TABLE(of, virt_cpufreq_match);
> > +
> > +static struct platform_driver virt_cpufreq_driver = {
> > +     .probe = virt_cpufreq_driver_probe,
> > +     .remove = virt_cpufreq_driver_remove,
> > +     .driver = {
> > +             .name = "virt-cpufreq",
> > +             .of_match_table = virt_cpufreq_match,
> > +     },
> > +};
> > +
> > +static int __init virt_cpufreq_init(void)
> > +{
> > +     return platform_driver_register(&virt_cpufreq_driver);
> > +}
> > +postcore_initcall(virt_cpufreq_init);
>
> Why do you want to use this and not module_init() ? Then you can
> simply use `module_platform_driver()`.

We found that using postcore_init over module_init results in a
small(2-3%) but measurable benefit during boot time for VMs, so this
is an optimization that I’d prefer to keep.

Thanks,
David

>
> > +
> > +static void __exit virt_cpufreq_exit(void)
> > +{
> > +     platform_driver_unregister(&virt_cpufreq_driver);
> > +}
> > +module_exit(virt_cpufreq_exit);
> > +
> > +MODULE_DESCRIPTION("Virtual cpufreq driver");
> > +MODULE_LICENSE("GPL");
>
> --
> viresh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-03 16:50     ` David Dai
@ 2023-08-04  4:42       ` Viresh Kumar
  0 siblings, 0 replies; 23+ messages in thread
From: Viresh Kumar @ 2023-08-04  4:42 UTC (permalink / raw)
  To: David Dai
  Cc: Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Saravana Kannan, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On 03-08-23, 09:50, David Dai wrote:
> > Why do you want to use this and not module_init() ? Then you can
> > simply use `module_platform_driver()`.
> 
> We found that using postcore_init over module_init results in a
> small(2-3%) but measurable benefit during boot time for VMs, so this
> is an optimization that I’d prefer to keep.

Okay. That's what platforms normally do (kick in cpufreq support
earlier), so we can boot at a higher frequency. Just wasn't sure if it
matters for this driver too.

-- 
viresh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-01  9:45   ` Quentin Perret
  2023-08-01  9:49     ` Quentin Perret
@ 2023-08-04 22:23     ` Saravana Kannan
  1 sibling, 0 replies; 23+ messages in thread
From: Saravana Kannan @ 2023-08-04 22:23 UTC (permalink / raw)
  To: Quentin Perret
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Tue, Aug 1, 2023 at 2:45 AM Quentin Perret <qperret@google.com> wrote:
>
> Hi David,
>
> On Monday 31 Jul 2023 at 10:46:09 (-0700), David Dai wrote:
> > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +     /*
> > +      * Use cached frequency to avoid rounding to freq table entries
> > +      * and undo 25% frequency boost applied by schedutil.
> > +      */
>
> The VMM would be a better place for this scaling I think, the driver
> can't/shouldn't make assumptions about the governor it is running with
> given that this is a guest userspace decision essentially.
>
> IIRC the fast_switch() path is only used by schedutil, so one could
> probably make a case to scale things there, but it'd be inconsistent
> with the "slow" switch case, and would create a fragile dependency, so
> it's probably not worth pursuing.

Thanks for the input Quentin!

David and I spend several hours over several days discussing this. We
were trying to think through and decide if we were really removing the
25% margin applied by the guest side schedutil or the host side
schedutil. We ran through different thought experiments on what would
happen if the guest used ondemand/conservative/performance/powersave
governors and what if in the future we had a configurable schedutil
margin.

We changed our opinions multiple times until we finally remembered
this goal from my original presentation[1]:

"On an idle host, running the use case in the host vs VM, should
result in close to identical DVFS behavior of the physical CPUs and
CPU selection for the threads."

For that statement to be true when the guest uses
ondemand/conservative governor, we have to remove the 25% margin
applied by the host side schedutil governor. Otherwise, running the
workload on the VM will result in frequencies 25% higher than running
the same load on the host with ondemand/conservative governor.

So, we finally concluded that we are really undoing the host side
schedutil margin. And in that case, it makes sense to undo this in the
VMM side. So, we'll go with your suggestion in this email instead of
making the schedutil margin to be 0 for the guest.

[1] - https://lpc.events/event/16/contributions/1195/attachments/970/1893/LPC%202022%20-%20VM%20DVFS.pdf

Thanks,
Saravana

>
> > +     u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> > +
> > +     data->ops->set_freq(policy, freq);
> > +     return 0;
> > +}
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-03  5:51       ` Viresh Kumar
@ 2023-08-04 22:24         ` Saravana Kannan
  0 siblings, 0 replies; 23+ messages in thread
From: Saravana Kannan @ 2023-08-04 22:24 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: David Dai, Rafael J. Wysocki, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Sudeep Holla, Quentin Perret, Masami Hiramatsu,
	Will Deacon, Peter Zijlstra, Vincent Guittot, Marc Zyngier,
	Oliver Upton, Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj,
	Mel Gorman, kernel-team, linux-pm, devicetree, linux-kernel

On Wed, Aug 2, 2023 at 10:52 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 02-08-23, 15:16, Saravana Kannan wrote:
> > This is mainly an optimization to reduce the latency of the "frequency
> > change" which has a huge impact on the performance (as can be seen
> > from the numbers in the cover letter).
> >
> > Setting this flag means that the vCPU thread triggering the MMIO
> > handling (on the host side) is the thread on which the host needs to
> > apply any uclamp settings. So this avoids the VMM having to look up
> > the right vCPU thread that corresponds to this CPU, and any
> > permissions issues wrt setting another threads uclamp, etc. This
> > becomes even more important if/when BPF support is added for handling
> > simple MMIO read/writes. Will Deacon has been working on the eBPF
> > part[1] and IIUC, not setting this flag adds a lot of extra overhead
> > on the BPF side.
> >
> > So, yeah, this flag is very helpful wrt reducing latency/simplifying
> > host side implementation and that's why we want it here.
> >
> > [1] - https://kvm-forum.qemu.org/2023/talk/AZKC77/
>
> Would be good to have a (big) comment in the code explaining that as
> it isn't obvious. Thanks.

Will do.

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-03  4:18   ` Pavan Kondeti
@ 2023-08-04 23:46     ` David Dai
  2023-08-07  3:22       ` Pavan Kondeti
  0 siblings, 1 reply; 23+ messages in thread
From: David Dai @ 2023-08-04 23:46 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

Hi Pavan,

Thanks for reviewing!

On Wed, Aug 2, 2023 at 9:18 PM Pavan Kondeti <quic_pkondeti@quicinc.com> wrote:
>
> On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote:
> > Introduce a virtualized cpufreq driver for guest kernels to improve
> > performance and power of workloads within VMs.
> >
> > This driver does two main things:
> >
> > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the
> > hint to schedule the vCPU threads and decide physical CPU frequency.
> >
> > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the
> > host CPU frequency by reading a MMIO region of a virtual cpufreq device
> > to update the guest's frequency scaling factor periodically. This enables
> > accurate Per-Entity Load Tracking for tasks running in the guest.
> >
> > Co-developed-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: David Dai <davidai@google.com>
>
> [...]
>
> > +static void virt_scale_freq_tick(void)
> > +{
> > +     struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +     u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > +     u64 cur_freq;
> > +     u64 scale;
> > +
> > +     cpufreq_cpu_put(policy);
> > +
> > +     cur_freq = (u64)data->ops->get_freq(policy);
> > +     cur_freq <<= SCHED_CAPACITY_SHIFT;
> > +     scale = div_u64(cur_freq, max_freq);
> > +
> > +     this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > +}
> > +
>
> We expect the host to provide the frequency in kHz, can you please add a
> comment about it. It is not very obvious when you look at the
> REG_CUR_FREQ_OFFSET register name.

I’ll include a KHZ in the offset names.

>
> > +static struct scale_freq_data virt_sfd = {
> > +     .source = SCALE_FREQ_SOURCE_VIRT,
> > +     .set_freq_scale = virt_scale_freq_tick,
> > +};
> > +
> > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > +     /*
> > +      * Use cached frequency to avoid rounding to freq table entries
> > +      * and undo 25% frequency boost applied by schedutil.
> > +      */
> > +     u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> > +
> > +     data->ops->set_freq(policy, freq);
> > +     return 0;
> > +}
>
> Why do we undo the frequency boost? A governor may apply other boosts
> like RT (uclamp), iowait. It is not clear why we need to worry about
> governor policies here.

See Saravana’s response to Quentin for more details, but in short,
we’ll remove this particular snippet in the driver.

>
> > +
> > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > +             unsigned int target_freq)
> > +{
> > +     virt_cpufreq_set_perf(policy);
> > +     return target_freq;
> > +}
> > +
> > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > +             unsigned int index)
> > +{
> > +     return virt_cpufreq_set_perf(policy);
> > +}
> > +
> > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > +     struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > +     struct cpufreq_frequency_table *table;
> > +     struct device *cpu_dev;
> > +     int ret;
> > +
> > +     cpu_dev = get_cpu_device(policy->cpu);
> > +     if (!cpu_dev)
> > +             return -ENODEV;
> > +
> > +     ret = dev_pm_opp_of_add_table(cpu_dev);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = dev_pm_opp_get_opp_count(cpu_dev);
> > +     if (ret <= 0) {
> > +             dev_err(cpu_dev, "OPP table can't be empty\n");
> > +             return -ENODEV;
> > +     }
> > +
> > +     ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > +     if (ret) {
> > +             dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > +             return ret;
> > +     }
> > +
> > +     policy->freq_table = table;
> > +     policy->dvfs_possible_from_any_cpu = false;
> > +     policy->fast_switch_possible = true;
> > +     policy->driver_data = drv_data;
> > +
> > +     /*
> > +      * Only takes effect if another FIE source such as AMUs
> > +      * have not been registered.
> > +      */
> > +     topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > +
> > +     return 0;
> > +
> > +}
> > +
>
> Do we need to register as FIE source even with the below commit? By
> registering as a source, we are not supplying any accurate metric. We
> still fallback on the same source that cpufreq implements it.

The arch_set_freq_scale() done at cpufreq driver’s frequency updates
at cpufreq_freq_transition_end() and cpufreq_driver_fast_switch() only
represent the guest’s frequency request. However, this does not
accurately represent the physical CPU’s frequency that the vCPU is
running on. E.g. There may be other processes sharing the same
physical CPU that results in a much higher CPU frequency than what’s
requested by the vCPU.

Thanks,
David

>
> 874f63531064 ("cpufreq: report whether cpufreq supports Frequency
> Invariance (FI)")
>
> Thanks,
> Pavan
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq
  2023-07-31 17:46 ` [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq David Dai
  2023-07-31 18:12   ` Rob Herring
@ 2023-08-05 19:38   ` Krzysztof Kozlowski
  2023-08-08 23:31     ` Saravana Kannan
  1 sibling, 1 reply; 23+ messages in thread
From: Krzysztof Kozlowski @ 2023-08-05 19:38 UTC (permalink / raw)
  To: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan
  Cc: Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Pavan Kondeti, Gupta Pankaj, Mel Gorman, kernel-team, linux-pm,
	devicetree, linux-kernel

On 31/07/2023 19:46, David Dai wrote:
> Adding bindings to represent a virtual cpufreq device.
> 
> Virtual machines may expose MMIO regions for a virtual cpufreq device for
> guests to read frequency information or to request frequency selection. The
> virtual cpufreq device has an individual controller for each CPU.

A nit, subject: drop second/last, redundant "bindings for". The
"dt-bindings" prefix is already stating that these are bindings.

> 
> Co-developed-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Signed-off-by: David Dai <davidai@google.com>
> ---
>  .../bindings/cpufreq/cpufreq-virtual.yaml     | 89 +++++++++++++++++++
>  1 file changed, 89 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
> 
> diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
> new file mode 100644
> index 000000000000..f377cfc972ca
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
> @@ -0,0 +1,89 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/cpufreq/cpufreq-virtual.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yamll#
> +
> +title: Virtual CPUFreq
> +
> +maintainers:
> +  - David Dai <davidai@google.com>
> +  - Saravana Kannan <saravanak@google.com>
> +
> +description:
> +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> +  selection of its vCPUs as a hint to the host through MMIO regions. The host
> +  uses the hint to schedule vCPU threads and select physical CPU frequency. It
> +  enables accurate Per-Entity Load Tracking for tasks running in the guest by
> +  querying host CPU frequency unless a virtualized FIE (ex. AMU) exists.

Why do you need DT for this? You control hypervisor, thus control the
interface to the guest. I think Rob made it pretty clear that
discoverable usecases (which is yours) are not for DT.

Incomplete style-review follows:

> +
> +properties:
> +  compatible:
> +    const: virtual,cpufreq

Missing blank line.

> +  reg:
> +    maxItems: 1
> +
> +required:
> +  - compatible
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    cpus {
> +      #address-cells = <1>;
> +      #size-cells = <0>;
> +
> +      cpu@0 {
> +        compatible = "arm,arm-v8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table0>;
> +      };
> +
> +      cpu@1 {
> +        compatible = "arm,arm-v8";
> +        device_type = "cpu";
> +        reg = <0x0>;
> +        operating-points-v2 = <&opp_table1>;
> +      };
> +    };
> +
> +    opp_table0: opp-table-0 {
> +      compatible = "operating-points-v2";
> +
> +      opp1098000000 {
> +        opp-hz = /bits/ 64 <1098000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1197000000 {
> +        opp-hz = /bits/ 64 <1197000000>;
> +        opp-level = <2>;
> +      };
> +    };
> +
> +    opp_table1: opp-table-1 {
> +      compatible = "operating-points-v2";
> +
> +      opp1106000000 {
> +        opp-hz = /bits/ 64 <1106000000>;
> +        opp-level = <1>;
> +      };
> +
> +      opp1277000000 {
> +        opp-hz = /bits/ 64 <1277000000>;
> +        opp-level = <2>;
> +      };
> +    };
> +
> +    soc {
> +      #address-cells = <1>;
> +      #size-cells = <1>;
> +
> +      cpufreq {

Missing unit address

> +        reg = <0x1040000 0x10>;
> +        compatible = "virtual,cpufreq";

compatible is always the first property.

Also, you did not test it...


Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-04 23:46     ` David Dai
@ 2023-08-07  3:22       ` Pavan Kondeti
  2023-08-24 23:55         ` David Dai
  0 siblings, 1 reply; 23+ messages in thread
From: Pavan Kondeti @ 2023-08-07  3:22 UTC (permalink / raw)
  To: David Dai
  Cc: Pavan Kondeti, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Fri, Aug 04, 2023 at 04:46:11PM -0700, David Dai wrote:
> Hi Pavan,
> 
> Thanks for reviewing!
> 
> On Wed, Aug 2, 2023 at 9:18 PM Pavan Kondeti <quic_pkondeti@quicinc.com> wrote:
> >
> > On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote:
> > > Introduce a virtualized cpufreq driver for guest kernels to improve
> > > performance and power of workloads within VMs.
> > >
> > > This driver does two main things:
> > >
> > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the
> > > hint to schedule the vCPU threads and decide physical CPU frequency.
> > >
> > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the
> > > host CPU frequency by reading a MMIO region of a virtual cpufreq device
> > > to update the guest's frequency scaling factor periodically. This enables
> > > accurate Per-Entity Load Tracking for tasks running in the guest.
> > >
> > > Co-developed-by: Saravana Kannan <saravanak@google.com>
> > > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > > Signed-off-by: David Dai <davidai@google.com>
> >
> > [...]
> >
> > > +static void virt_scale_freq_tick(void)
> > > +{
> > > +     struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > > +     u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > > +     u64 cur_freq;
> > > +     u64 scale;
> > > +
> > > +     cpufreq_cpu_put(policy);
> > > +
> > > +     cur_freq = (u64)data->ops->get_freq(policy);
> > > +     cur_freq <<= SCHED_CAPACITY_SHIFT;
> > > +     scale = div_u64(cur_freq, max_freq);
> > > +
> > > +     this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > > +}
> > > +
> >
> > We expect the host to provide the frequency in kHz, can you please add a
> > comment about it. It is not very obvious when you look at the
> > REG_CUR_FREQ_OFFSET register name.
> 
> I’ll include a KHZ in the offset names.
> 

Sure, that would help. Also, can you limit the scale to
SCHED_CAPACITY_SCALE? It may be possible that host may be running at a
higher frequency than max_freq advertised on this guest.

> >
> > > +
> > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > > +             unsigned int target_freq)
> > > +{
> > > +     virt_cpufreq_set_perf(policy);
> > > +     return target_freq;
> > > +}
> > > +
> > > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > > +             unsigned int index)
> > > +{
> > > +     return virt_cpufreq_set_perf(policy);
> > > +}
> > > +
> > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > > +{
> > > +     struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > > +     struct cpufreq_frequency_table *table;
> > > +     struct device *cpu_dev;
> > > +     int ret;
> > > +
> > > +     cpu_dev = get_cpu_device(policy->cpu);
> > > +     if (!cpu_dev)
> > > +             return -ENODEV;
> > > +
> > > +     ret = dev_pm_opp_of_add_table(cpu_dev);
> > > +     if (ret)
> > > +             return ret;
> > > +
> > > +     ret = dev_pm_opp_get_opp_count(cpu_dev);
> > > +     if (ret <= 0) {
> > > +             dev_err(cpu_dev, "OPP table can't be empty\n");
> > > +             return -ENODEV;
> > > +     }
> > > +
> > > +     ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > > +     if (ret) {
> > > +             dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > > +             return ret;
> > > +     }
> > > +
> > > +     policy->freq_table = table;
> > > +     policy->dvfs_possible_from_any_cpu = false;
> > > +     policy->fast_switch_possible = true;
> > > +     policy->driver_data = drv_data;
> > > +
> > > +     /*
> > > +      * Only takes effect if another FIE source such as AMUs
> > > +      * have not been registered.
> > > +      */
> > > +     topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > > +
> > > +     return 0;
> > > +
> > > +}
> > > +
> >
> > Do we need to register as FIE source even with the below commit? By
> > registering as a source, we are not supplying any accurate metric. We
> > still fallback on the same source that cpufreq implements it.
> 
> The arch_set_freq_scale() done at cpufreq driver’s frequency updates
> at cpufreq_freq_transition_end() and cpufreq_driver_fast_switch() only
> represent the guest’s frequency request. However, this does not
> accurately represent the physical CPU’s frequency that the vCPU is
> running on. E.g. There may be other processes sharing the same
> physical CPU that results in a much higher CPU frequency than what’s
> requested by the vCPU.
> 

understood that policy->cur may not reflect the actual frequency. Is this
something needs to be advertised to cpufreq core so that it query the
underlying cpufreq driver and use it for frequency scale updates. This
also gives userspace to read the actual frequency when read from sysfs.

In fact, cpufreq_driver_fast_switch() comment says that
cpufreq_driver::fast_switch() should return the *actual* frequency and
the same is used to update frequency scale updates. I understand that it
depends on other things like if host defer the frequency switch, the
value read from REG_CUR_FREQ_OFFSET may reflect the old value..

May be a comment in code would help.

Thanks,
Pavan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq
  2023-08-05 19:38   ` Krzysztof Kozlowski
@ 2023-08-08 23:31     ` Saravana Kannan
  2023-08-09  6:28       ` Krzysztof Kozlowski
  0 siblings, 1 reply; 23+ messages in thread
From: Saravana Kannan @ 2023-08-08 23:31 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Sat, Aug 5, 2023 at 12:38 PM Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
> On 31/07/2023 19:46, David Dai wrote:
> > Adding bindings to represent a virtual cpufreq device.
> >
> > Virtual machines may expose MMIO regions for a virtual cpufreq device for
> > guests to read frequency information or to request frequency selection. The
> > virtual cpufreq device has an individual controller for each CPU.
>
> A nit, subject: drop second/last, redundant "bindings for". The
> "dt-bindings" prefix is already stating that these are bindings.
>
> >
> > Co-developed-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > Signed-off-by: David Dai <davidai@google.com>
> > ---
> >  .../bindings/cpufreq/cpufreq-virtual.yaml     | 89 +++++++++++++++++++
> >  1 file changed, 89 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
> > new file mode 100644
> > index 000000000000..f377cfc972ca
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-virtual.yaml
> > @@ -0,0 +1,89 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/cpufreq/cpufreq-virtual.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yamll#
> > +
> > +title: Virtual CPUFreq
> > +
> > +maintainers:
> > +  - David Dai <davidai@google.com>
> > +  - Saravana Kannan <saravanak@google.com>
> > +
> > +description:
> > +  Virtual CPUFreq is a virtualized driver in guest kernels that sends frequency
> > +  selection of its vCPUs as a hint to the host through MMIO regions. The host
> > +  uses the hint to schedule vCPU threads and select physical CPU frequency. It
> > +  enables accurate Per-Entity Load Tracking for tasks running in the guest by
> > +  querying host CPU frequency unless a virtualized FIE (ex. AMU) exists.
>
> Why do you need DT for this? You control hypervisor, thus control the
> interface to the guest. I think Rob made it pretty clear that
> discoverable usecases (which is yours) are not for DT.
>
> Incomplete style-review follows:
>
> > +
> > +properties:
> > +  compatible:
> > +    const: virtual,cpufreq
>
> Missing blank line.
>
> > +  reg:
> > +    maxItems: 1
> > +
> > +required:
> > +  - compatible
> > +  - reg
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > +  - |
> > +    cpus {
> > +      #address-cells = <1>;
> > +      #size-cells = <0>;
> > +
> > +      cpu@0 {
> > +        compatible = "arm,arm-v8";
> > +        device_type = "cpu";
> > +        reg = <0x0>;
> > +        operating-points-v2 = <&opp_table0>;
> > +      };
> > +
> > +      cpu@1 {
> > +        compatible = "arm,arm-v8";
> > +        device_type = "cpu";
> > +        reg = <0x0>;
> > +        operating-points-v2 = <&opp_table1>;
> > +      };
> > +    };
> > +
> > +    opp_table0: opp-table-0 {
> > +      compatible = "operating-points-v2";
> > +
> > +      opp1098000000 {
> > +        opp-hz = /bits/ 64 <1098000000>;
> > +        opp-level = <1>;
> > +      };
> > +
> > +      opp1197000000 {
> > +        opp-hz = /bits/ 64 <1197000000>;
> > +        opp-level = <2>;
> > +      };
> > +    };
> > +
> > +    opp_table1: opp-table-1 {
> > +      compatible = "operating-points-v2";
> > +
> > +      opp1106000000 {
> > +        opp-hz = /bits/ 64 <1106000000>;
> > +        opp-level = <1>;
> > +      };
> > +
> > +      opp1277000000 {
> > +        opp-hz = /bits/ 64 <1277000000>;
> > +        opp-level = <2>;
> > +      };
> > +    };
> > +
> > +    soc {
> > +      #address-cells = <1>;
> > +      #size-cells = <1>;
> > +
> > +      cpufreq {
>
> Missing unit address
>
> > +        reg = <0x1040000 0x10>;
> > +        compatible = "virtual,cpufreq";
>
> compatible is always the first property.
>
> Also, you did not test it...

Why do you say this? This patch series was obviously tested very well
with all the data we collected.

-Saravana

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq
  2023-08-08 23:31     ` Saravana Kannan
@ 2023-08-09  6:28       ` Krzysztof Kozlowski
  0 siblings, 0 replies; 23+ messages in thread
From: Krzysztof Kozlowski @ 2023-08-09  6:28 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Quentin Perret,
	Masami Hiramatsu, Will Deacon, Peter Zijlstra, Vincent Guittot,
	Marc Zyngier, Oliver Upton, Dietmar Eggemann, Pavan Kondeti,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On 09/08/2023 01:31, Saravana Kannan wrote:
>>
>>> +        reg = <0x1040000 0x10>;
>>> +        compatible = "virtual,cpufreq";
>>
>> compatible is always the first property.
>>
>> Also, you did not test it...
> 
> Why do you say this? This patch series was obviously tested very well
> with all the data we collected.

Why do I say? Because of warning and huge fat Python exception? Test
it... you will see.

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
                     ` (3 preceding siblings ...)
  2023-08-03  4:18   ` Pavan Kondeti
@ 2023-08-12  2:55   ` kernel test robot
  4 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2023-08-12  2:55 UTC (permalink / raw)
  To: David Dai, Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan
  Cc: oe-kbuild-all, Quentin Perret, Masami Hiramatsu, Will Deacon,
	Peter Zijlstra, Vincent Guittot, Marc Zyngier, Oliver Upton,
	Dietmar Eggemann, Pavan Kondeti, Gupta Pankaj, Mel Gorman,
	kernel-team, linux-pm, devicetree, linux-kernel

Hi David,

kernel test robot noticed the following build errors:

[auto build test ERROR on rafael-pm/linux-next]
[also build test ERROR on robh/for-next linus/master v6.5-rc5 next-20230809]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Dai/dt-bindings-cpufreq-add-bindings-for-virtual-cpufreq/20230801-014946
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link:    https://lore.kernel.org/r/20230731174613.4133167-3-davidai%40google.com
patch subject: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
config: arm-randconfig-r061-20230812 (https://download.01.org/0day-ci/archive/20230812/202308121020.23DejpAv-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230812/202308121020.23DejpAv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308121020.23DejpAv-lkp@intel.com/

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: drivers/cpufreq/virtual-cpufreq.o: in function `virt_cpufreq_cpu_exit':
>> virtual-cpufreq.c:(.text+0xf8): undefined reference to `topology_clear_scale_freq_source'
   arm-linux-gnueabi-ld: drivers/cpufreq/virtual-cpufreq.o: in function `virt_cpufreq_cpu_init':
>> virtual-cpufreq.c:(.text+0x1c8): undefined reference to `topology_set_scale_freq_source'
   arm-linux-gnueabi-ld: drivers/cpufreq/virtual-cpufreq.o: in function `virt_scale_freq_tick':
>> virtual-cpufreq.c:(.text+0x330): undefined reference to `arch_freq_scale'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
  2023-08-07  3:22       ` Pavan Kondeti
@ 2023-08-24 23:55         ` David Dai
  0 siblings, 0 replies; 23+ messages in thread
From: David Dai @ 2023-08-24 23:55 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Sudeep Holla, Saravana Kannan,
	Quentin Perret, Masami Hiramatsu, Will Deacon, Peter Zijlstra,
	Vincent Guittot, Marc Zyngier, Oliver Upton, Dietmar Eggemann,
	Gupta Pankaj, Mel Gorman, kernel-team, linux-pm, devicetree,
	linux-kernel

On Sun, Aug 6, 2023 at 8:22 PM Pavan Kondeti <quic_pkondeti@quicinc.com> wrote:
>
> On Fri, Aug 04, 2023 at 04:46:11PM -0700, David Dai wrote:
> > Hi Pavan,
> >
> > Thanks for reviewing!
> >
> > On Wed, Aug 2, 2023 at 9:18 PM Pavan Kondeti <quic_pkondeti@quicinc.com> wrote:
> > >
> > > On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote:
> > > > Introduce a virtualized cpufreq driver for guest kernels to improve
> > > > performance and power of workloads within VMs.
> > > >
> > > > This driver does two main things:
> > > >
> > > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the
> > > > hint to schedule the vCPU threads and decide physical CPU frequency.
> > > >
> > > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the
> > > > host CPU frequency by reading a MMIO region of a virtual cpufreq device
> > > > to update the guest's frequency scaling factor periodically. This enables
> > > > accurate Per-Entity Load Tracking for tasks running in the guest.
> > > >
> > > > Co-developed-by: Saravana Kannan <saravanak@google.com>
> > > > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > > > Signed-off-by: David Dai <davidai@google.com>
> > >
> > > [...]
> > >
> > > > +static void virt_scale_freq_tick(void)
> > > > +{
> > > > +     struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > > > +     struct virt_cpufreq_drv_data *data = policy->driver_data;
> > > > +     u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > > > +     u64 cur_freq;
> > > > +     u64 scale;
> > > > +
> > > > +     cpufreq_cpu_put(policy);
> > > > +
> > > > +     cur_freq = (u64)data->ops->get_freq(policy);
> > > > +     cur_freq <<= SCHED_CAPACITY_SHIFT;
> > > > +     scale = div_u64(cur_freq, max_freq);
> > > > +
> > > > +     this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > > > +}
> > > > +
> > >
> > > We expect the host to provide the frequency in kHz, can you please add a
> > > comment about it. It is not very obvious when you look at the
> > > REG_CUR_FREQ_OFFSET register name.
> >
> > I’ll include a KHZ in the offset names.
> >
>

Hi Pavan,

Apologies for the slow responses, I was out on vacation for the week
prior to last week.

> Sure, that would help. Also, can you limit the scale to
> SCHED_CAPACITY_SCALE? It may be possible that host may be running at a
> higher frequency than max_freq advertised on this guest.

Good catch, will include a check to limit freq_scale to SCHED_CAPACITY_SCALE.

>
> > >
> > > > +
> > > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > > > +             unsigned int target_freq)
> > > > +{
> > > > +     virt_cpufreq_set_perf(policy);
> > > > +     return target_freq;
> > > > +}
> > > > +
> > > > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > > > +             unsigned int index)
> > > > +{
> > > > +     return virt_cpufreq_set_perf(policy);
> > > > +}
> > > > +
> > > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > > > +{
> > > > +     struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > > > +     struct cpufreq_frequency_table *table;
> > > > +     struct device *cpu_dev;
> > > > +     int ret;
> > > > +
> > > > +     cpu_dev = get_cpu_device(policy->cpu);
> > > > +     if (!cpu_dev)
> > > > +             return -ENODEV;
> > > > +
> > > > +     ret = dev_pm_opp_of_add_table(cpu_dev);
> > > > +     if (ret)
> > > > +             return ret;
> > > > +
> > > > +     ret = dev_pm_opp_get_opp_count(cpu_dev);
> > > > +     if (ret <= 0) {
> > > > +             dev_err(cpu_dev, "OPP table can't be empty\n");
> > > > +             return -ENODEV;
> > > > +     }
> > > > +
> > > > +     ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > > > +     if (ret) {
> > > > +             dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > > > +             return ret;
> > > > +     }
> > > > +
> > > > +     policy->freq_table = table;
> > > > +     policy->dvfs_possible_from_any_cpu = false;
> > > > +     policy->fast_switch_possible = true;
> > > > +     policy->driver_data = drv_data;
> > > > +
> > > > +     /*
> > > > +      * Only takes effect if another FIE source such as AMUs
> > > > +      * have not been registered.
> > > > +      */
> > > > +     topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > > > +
> > > > +     return 0;
> > > > +
> > > > +}
> > > > +
> > >
> > > Do we need to register as FIE source even with the below commit? By
> > > registering as a source, we are not supplying any accurate metric. We
> > > still fallback on the same source that cpufreq implements it.
> >
> > The arch_set_freq_scale() done at cpufreq driver’s frequency updates
> > at cpufreq_freq_transition_end() and cpufreq_driver_fast_switch() only
> > represent the guest’s frequency request. However, this does not
> > accurately represent the physical CPU’s frequency that the vCPU is
> > running on. E.g. There may be other processes sharing the same
> > physical CPU that results in a much higher CPU frequency than what’s
> > requested by the vCPU.
> >
>
> understood that policy->cur may not reflect the actual frequency. Is this
> something needs to be advertised to cpufreq core so that it query the
> underlying cpufreq driver and use it for frequency scale updates. This
> also gives userspace to read the actual frequency when read from sysfs.
>

Adding a ->get() callback in the driver to advertise to the cpufreq
core does not actually update the freq_scale if fast_switch is
enabled. Since fast_switch is required for performance reasons, I
don’t see value in adding ->get().

> In fact, cpufreq_driver_fast_switch() comment says that
> cpufreq_driver::fast_switch() should return the *actual* frequency and
> the same is used to update frequency scale updates. I understand that it
> depends on other things like if host defer the frequency switch, the
> value read from REG_CUR_FREQ_OFFSET may reflect the old value..
>
> May be a comment in code would help.
>

Sounds good, I'll include a comment.

Thanks,
David

> Thanks,
> Pavan
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-08-24 23:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-31 17:46 [PATCH v3 0/2] Improve VM CPUfreq and task placement behavior David Dai
2023-07-31 17:46 ` [PATCH v3 1/2] dt-bindings: cpufreq: add bindings for virtual cpufreq David Dai
2023-07-31 18:12   ` Rob Herring
2023-08-05 19:38   ` Krzysztof Kozlowski
2023-08-08 23:31     ` Saravana Kannan
2023-08-09  6:28       ` Krzysztof Kozlowski
2023-07-31 17:46 ` [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver David Dai
2023-07-31 22:02   ` Randy Dunlap
2023-07-31 23:46     ` David Dai
2023-08-01  9:36   ` Viresh Kumar
2023-08-02 22:16     ` Saravana Kannan
2023-08-03  5:51       ` Viresh Kumar
2023-08-04 22:24         ` Saravana Kannan
2023-08-03 16:50     ` David Dai
2023-08-04  4:42       ` Viresh Kumar
2023-08-01  9:45   ` Quentin Perret
2023-08-01  9:49     ` Quentin Perret
2023-08-04 22:23     ` Saravana Kannan
2023-08-03  4:18   ` Pavan Kondeti
2023-08-04 23:46     ` David Dai
2023-08-07  3:22       ` Pavan Kondeti
2023-08-24 23:55         ` David Dai
2023-08-12  2:55   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).