[PATCH 0/4] KVM: arm64: Improve PMU support on heterogeneous systems

* [PATCH 0/4] KVM: arm64: Improve PMU support on heterogeneous systems
@ 2021-11-15 16:50 ` Alexandru Elisei
  0 siblings, 0 replies; 26+ messages in thread
From: Alexandru Elisei @ 2021-11-15 16:50 UTC (permalink / raw)
  To: maz, james.morse, suzuki.poulose, will, mark.rutland,
	linux-arm-kernel, kvmarm

(CC'ing Peter Maydell in case this might be of interest to qemu)

The series can be found at [x], and the kvmtool support at [2].

At the moment, the experience of running a virtual machine with a PMU on a
heterogeneous systems (where there are different PMUs), varies from just
works, if the VCPUs run only on the correct physical CPUs, to doesn't work
at all, if the VCPUs run only on the incorrect physical CPUs, to something
doesn't look right, if the VCPUs run some of the time on the correct
physical CPUs, and some of the time on the incorrect physical CPUs. The
reason for this behaviour is that KVM creates perf events to emulate a
guest PMU, and the choice of PMU that is used to create these events is
left entirely up to the perf core system, based on the hardware probe
order. The first PMU to register with perf (via perf_pmu_register()) is the
PMU that will always be chosen when creating the events (in
perf_event_create_kernel_counter() -> perf_event_alloc() ->
perf_event_init()).

Let's take the example of a rockpro64 board, CPUs 0-3 are Cortex-A53 (the
little cores), CPUs 4-5 are Cortex-A72 (the big cores), and each group has
their own PMU. When running the pmu-cycle-counter test from kvm-unit-tests
on the little cores, everything is working as expected:

taskset -c 0-3 ./vm run -c1 -m64 --nodefaults -k arm/pmu.flat -p "cycle-counter 0" --pmu
[..]
PASS: pmu: cycle-counter: Monotonically increasing cycle count
[..]
PASS: pmu: cycle-counter: Cycle/instruction ratio
SUMMARY: 2 tests

But when running the same test on the big cores:

$ taskset -c 4-5 ./vm run -c1 -m64 --nodefaults -k arm/pmu.flat -p "cycle-counter 0" --pmu
[..]
FAIL: pmu: cycle-counter: Monotonically increasing cycle count
[..]
FAIL: pmu: cycle-counter: Cycle/instruction ratio
SUMMARY: 2 tests, 2 unexpected failures

The same behaviour is exhibited when running under qemu.

The test passes on the little cores in that particular setup because the
little cores are the "correct" cores: the PMU that perf chooses to create
the events on is the PMU associated with the little cores. The test fails
on the big cores because the events cannot be scheduled in, as the PMU is
associated with a different set of cores (merge_sched_in() exits early
because event_filter_match() returns false).

It gets even more impredicatable, as the order in which the PMUs are probed
during boot dictates which PMU is chosen for creating the events, and the
probe order can change if, for example, the order of the PMU nodes in the
DTB changes, or if the kernel is booted with asynchronous driver probing
for the armv8-pmu driver. A user can end up in a situation where pinning
the VM on a set of CPUs works just fine, and after a reboot doesn't work
anymore, without any kind of explanation or hints of why it stopped
working.

All of this is not ideal from the user perspective and this series aims to
improve that by adding a new PMU attribute which can be used to tell KVM
exactly on which PMU events for the VCPU should be created. The contract is
that user is still responsible for pinning the VCPUs on the corresponding
CPUs, and KVM will refuse to run the VCPU on a CPU with a different PMU.

With this series on top of kvmtool support for KVM_ARM_VCPU_PMU_V3_SET_PMU
attribute [2], running the same test as above on the little cores, then on
the big cores:

$ taskset -c 0-3 ./vm run -c1 -m64 --nodefaults -k arm/pmu.flat -p "cycle-counter 0" --pmu
[..]
PASS: pmu: cycle-counter: Monotonically increasing cycle count
[..]
PASS: pmu: cycle-counter: Cycle/instruction ratio
SUMMARY: 2 tests

$ taskset -c 4-5 ./vm run -c1 -m64 --nodefaults -k arm/pmu.flat -p "cycle-counter 0" --pmu
[..]
PASS: pmu: cycle-counter: Monotonically increasing cycle count
[..]
PASS: pmu: cycle-counter: Cycle/instruction ratio
SUMMARY: 2 tests

We get a saner behaviour, which is reproducible across reboots, regardless
of the probe order.

And this is what happens if the VCPU is run on a physical PMU with a
different PMU than what was set by userspace:

$ taskset -c 3-4 ./vm run -c1 -m64 --nodefaults -k arm/pmu.flat -p "cycle-counter 0" --pmu
KVM_RUN failed: Exec format error

kvmtool sets the PMU for all VCPUs from the main thread; the main thread
runs on the little core (CPU3), but the VCPU is scheduled on the big core
(CPU4); there is a mismatch between the VCPU PMU and the physical CPU PMU,
and KVM returns -ENOEXEC from KVM_RUN.

[1] https://gitlab.arm.com/linux-arm/linux-ae/-/tree/pmu-big-little-fix-v1
[2] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/pmu-big-little-fix-v1

Alexandru Elisei (4):
  perf: Fix wrong name in comment for struct perf_cpu_context
  KVM: arm64: Keep a list of probed PMUs
  KVM: arm64: Add KVM_ARM_VCPU_PMU_V3_SET_PMU attribute
  KVM: arm64: Refuse to run VCPU if the PMU doesn't match the physical
    CPU

 Documentation/virt/kvm/api.rst          |  5 ++-
 Documentation/virt/kvm/devices/vcpu.rst | 26 +++++++++++
 arch/arm64/include/asm/kvm_host.h       |  3 ++
 arch/arm64/include/uapi/asm/kvm.h       |  1 +
 arch/arm64/kvm/arm.c                    | 15 +++++++
 arch/arm64/kvm/pmu-emul.c               | 58 +++++++++++++++++++++++--
 include/kvm/arm_pmu.h                   |  6 +++
 include/linux/perf_event.h              |  2 +-
 tools/arch/arm64/include/uapi/asm/kvm.h |  1 +
 9 files changed, 110 insertions(+), 7 deletions(-)

-- 
2.33.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 26+ messages in thread