REGRESSION: Upgrading host kernel from 5.11 to 5.13 breaks QEMU guests - perf/fw_devlink/kvm

From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	kvmarm@lists.cs.columbia.edu
Cc: Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Alexandru Elisei <alexandru.elisei@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Will Deacon <will@kernel.org>
Subject: REGRESSION: Upgrading host kernel from 5.11 to 5.13 breaks QEMU guests - perf/fw_devlink/kvm
Date: Sat, 18 Sep 2021 17:17:45 +0100	[thread overview]
Message-ID: <YUYRKVflRtUytzy5@shell.armlinux.org.uk> (raw)

Hi,

This morning, I upgraded my VM host from Debian Buster to Debian
Bullseye. This host (Armada 8040) runs libvirt. I placed all the VMs
into managedsave, which basically means their state is saved out by
QEMU ready to be resumed once the upgrade is complete.

Initially, I was running 5.11 on the host, which didn't have POSIX
ACLs enabled on tmpfs. Sadly, upgrading to Bullseye now requires
this to be enabled, so I upgraded the kernel to 5.13 to avoid this
problem - without POSIX ACLs on tmpfs, qemu refuses to even start.

Under Bullseye with 5.13, I could start new VMs, but I could not
resume the saved VMs - it would spit out:

2021-09-18T11:14:14.456227Z qemu-system-aarch64: Invalid value 236 expecting positive value <= 162
2021-09-18T11:14:14.456269Z qemu-system-aarch64: Failed to load cpu:cpreg_vmstate_array_len
2021-09-18T11:14:14.456279Z qemu-system-aarch64: error while loading state for instance 0x0 of device 'cpu'
2021-09-18T11:14:14.456887Z qemu-system-aarch64: load of migration failed: Invalid argument

Essentially, what this is complaining about is that the register
list returned by the KVM_GET_REG_LIST ioctl has reduced in size,
meaning that the saved VM has more registers that need to be set
(236 of them, after QEMU's filtering) than those which are actually
available under 5.13 (162 after QEMU's filtering).

After much debugging, and writing a program to create a VM and CPU,
and retrieve the register list etc, I have finally tracked down
exactly what is going on...

Under 5.13, KVM believes there is no PMU, so it doesn't advertise the
PMU registers, despite the hardware having a PMU and the kernel
having support.

kvm_arm_support_pmu_v3() determines whether the PMU_V3 feature is
available or not.

Under 5.11, this was determined via:

	perf_num_counters() > 0

which in turn was derived from (essentially):

	__oprofile_cpu_pmu && __oprofile_cpu_pmu->num_events > 0

__oprofile_cpu_pmu can be set at any time when the PMU has been
initialised, and due to initialisation ordering, it appears to happen
much later in the kernel boot.

However, under 5.13, oprofile has been removed, and this test is no
longer present. Instead, kvm attempts to determine the availability
of PMUv3 when it initialises, and fails to because the PMU has not
yet been initialised. If there is no PMU at the point KVM initialises,
then KVM will never advertise a PMU.

This change of behaviour is what breaks KVM on Armada 8040, causing
a userspace regression.

What makes this more confusing is the PMU errors have gone:

5.13:
[    0.170514] PCI: CLS 0 bytes, default 64
[    0.171085] kvm [1]: IPA Size Limit: 44 bits
[    0.172532] kvm [1]: vgic interrupt IRQ9
[    0.172650] kvm: pmu event creation failed -2
[    0.172688] kvm [1]: Hyp mode initialized successfully
...
[    1.479833] hw perfevents: enabled with armv8_cortex_a72 PMU driver, 7 counters available

5.11:
[    0.138831] PCI: CLS 0 bytes, default 64
[    0.139245] hw perfevents: unable to count PMU IRQs
[    0.139251] hw perfevents: /ap806/config-space@f0000000/pmu: failed to register PMU devices!
[    0.139503] kvm [1]: IPA Size Limit: 44 bits
[    0.140735] kvm [1]: vgic interrupt IRQ9
[    0.140836] kvm [1]: Hyp mode initialized successfully
...
[    1.447789] hw perfevents: enabled with armv8_cortex_a72 PMU driver, 7 counters available

Overall, what this means is that the only way to safely upgrade from
5.11 to 5.13 (I don't know where 5.12 fits in this) is to completely
shut down all your VMs. You can't even migrate them to another
identical machine across these kernels. You have to completely shut
them down.

I did manage to eventually rescue my VMs after many hours, by booting
back into 5.11, and then screwing around with bind mounts to replace
/dev with a filesystem that did have POSIX ACLs enabled, so libvirt
and qemu could then resume the VMs and cleanly shut them down all
down.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!