[RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support

* [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support
@ 2020-10-27 17:26 Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it Alexandru Elisei
                   ` (15 more replies)
  0 siblings, 16 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

Statistical Profiling Extension (SPE) is an optional feature added in
ARMv8.2. It allows sampling at regular intervals of the operations executed
by the PE and storing a record of each operation in a memory buffer. A high
level overview of the extension is presented in an article on arm.com [1].

This series implements SPE support for KVM guests. The series is based on
v5.10-rc1 has been almost completely rewritten, but I've tried to keep some
patches from v2 [2] and the initial version of the series [3]. The series
can also be found in a repo [4] to make testing easier.

This series is firmly in RFC territory for several reasons:

* It introduces an userspace API to pre-map guest memory at stage 2, which
  I think deserves some discussion before we commit to it.

* The way I'm handling the SPE interrupt is completely different than what
  was implemented in v2.

* SPE state save/restore unconditionally save the host SPE state on VM
  entry and restores it on VM exit, regardless of whether the host is
  actually profiling or not. I plan to improve this in following
  iterations.

I am also interested to know why the spe header lives in
/include/kvm/kvm_spe.h instead of /arch/arm64/incluse/asm/kvm_spe.h. My
guess is that the headers there are for code that was shared with KVM arm.
 Since KVM arm was removed, I would like to move the header to /arch/arm64,
but I wanted to make sure that is acceptable.

The profiling buffer
====================

KVM cannot handle SPE stage 2 faults and the guest memory must be
memory-resident and mapped at stage 2 the entire lifetime of the guest.
More details in patch #10 ("KVM: arm64: Add a new VM device control group
for SPE").

This is achieved with the help of userspace in two stages:

1. Userspace calls mlock() on the VMAs that represent the guest memory.

2. After userspace has copied everything to the guest memory, it uses the
   KVM_ARM_VM_SPE_CTRL(KVM_ARM_VM_SPE_FINALIZE) ioctl to tell KVM to map
   all VM_LOCKED and VM_HUGETLB VMAs at stage 2 (explanation why VM_HUGETLB
   is also mapped in patch #10).

I have added support for SPE to kvmtool, patches are on the mailing list
[5], as well as in a repo [6] for easy testing.

There are some things that I'm not 100% sure about and I would like to get
some feedback before we commit to an ABI:

* At the moment, having SPE enabled for a guest forces unmapping of the
  guest memory when the VCPU is reset. This is done to make sure the
  dcaches are cleaned to POC when the VM starts. It isn't necessary when
  the system has FWB, but I decided to unmap the guest memory even in this
  case for two reasons:

  1. Userspace doesn't know when FWB is available and thus if the finalize
call is necessary.

  2. I haven't seen anywhere in the documentation a statement regarding
changing memslots when the VM is in the process of resetting, I am assuming
it's not forbidden (please correct me if I'm wrong).

If it's forbidden to change memslots when resetting the VM, then we could
add an extension of something similar that tells userspace if a finalize
call is required after VM reset.

* Instead of a SPE control group we could have a KVM_ARM_VM_FINALIZE ioctl
  on the vm fd, similar to KVM_ARM_VCPU_FINALIZE. I don't have a strong
  preference for either, the reason for the current implementation is that
  I hadn't thought about KVM_ARM_VM_FINALIZE until the series were almost
  finished.

The buffer interrupt
====================

Also referred to in the Arm ARM as the Profiling Buffer management
interrupt. The guest SPE interrupt handling has been completely reworked
and now it's handled by checking the service bit in the PMBSR_EL1 register
on every switch to host; implementation in patch #14 ("KVM: arm64: Emulate
SPE buffer management event interrupt").

Another option that I considered was to change the host irq handler for the
SPE interrupt to check kvm_get_running_cpu() and defer the handling of the
interrupt to the KVM code. There are a few reasons I decided against it:

* We need to keet the PMBSR_EL1.S bit set until KVM enables interrupts,
  which means that the host won't be able to profile KVM between
  kvm_load()/kvm_put().

* Software can trigger the interrupt with a write to the PMBSR_EL1 register
  that sets the service bit. This means that the KVM irq handler won't be
  able to distinguish between the guest configuring PMBSR_EL1 to report a
  stage 2 fault, which is harmless for the host, and the hardware reporting
  it, which can indicate a bug. Even more serious, KVM won't be able to
  distinguish between a PMBSR_EL1 value indicating an External Abort written
  by the guest, again, harmless, and one reported by the hardware, which
  is pretty serious.

This is what the architecture says about SPE external aborts, on page
D9-2806:

"A write to the Profiling Buffer might generate an external abort,
including an external abort on a translation table walk or translation
table update. It is an IMPLEMENTATION DEFINED choice whether such an
external abort:
* Is reported to the Statistical Profiling Extension and treated as a
  Profiling Buffer management event.
* Generates an SError interrupt exception."

I decided to treat the SPE external abort like an SError and panic.
However, I'm not 100% sure that's the right thing to do because the SPE
driver never checks the PMBSR_EL1.EA bit.

There is an argument to be made against my approach to handling the buffer
interrupt, and that is that it requires KVM to trap accesses to the buffer
registers and to read one extra register, PMBSR_EL1, when switching to the
host. I believe this overhead to be minimal because writes to the buffer
registers are rare and they happen when an event is installed or stopped.

Note that in both cases the guest SPE interrupt is purely virtual and has
to be deactivated by KVM when the guest clears the PMBSR_EL1.S bit. This
means trapping the accesses to the buffer registers while the interrupt is
asserted even in the case where the host SPE driver irq handler handles the
interrupt triggered by the guest.

Context switching SPE registers
===============================

As mentioned earlier, this is done on every world switch under the
assumption that the host is using SPE at the same time as the guest, which
obviously will not always be the case.

I plan to improve this in following iterations by doing the context switch
on vcpu_load()/vcpu_put() when the host is not profiling. The challenge
will be detecting when the host is profiling. That can be detected in
vcpu_load(), but according to my understanding of perf, a new event can be
installed on the CPU via an IPI. In that case the perf driver would have to
notify KVM that it's starting profiling on the core so KVM can save the
guest SPE registers.

In v2 of the patches it has been suggested that on NVHE systems, the EL2
code must do the SPE context switch unconditionally [7]. I don't believe
that is necessary because all the registers that SPE uses in the NVHE case
are EL1 registers.

Testing
=======

I have written two basic kvm-unit-tests tests for SPE that I used while
developing the series [8]; they can also be found on this branch [9].

For testing, I have used FVP and a Neoverse N1 machine. These are the tests
that I ran:

1. kvm-unit-tests tests

The tests check the basic operation of the SPE buffer and some corner cases
which were hard to trigger with a Linux guest.

2. Check that profiling behaves the same in the guest and in the host

I used this command for testing on an N1 machine:

$ perf record -e arm_spe/ts_enable=1,pa_enable=1,pct_enable=1/ dd if=/dev/zero of=/dev/null count=5000000

then I checked the output of perf report --dump-raw-trace. The command is
not executed simultaneously in the guest and in the host. Results:

* On VHE:
  - guest 538 interrupts, perf.data size 541.190MiB, 1096 total events.
  - host 536 interrupts, perf.data size 541.190MiB, 1096 total events.

* Without VHE:
  - guest 537 interrupts, perf.data size 539,997 MiB, 1091 total events.
  - host 535 interrupts, perf.data size 539.986 MiB, 1093 total events.

I ran the tests multiple times and there were very minor variations in the
results.

3. Test concurrent profiling in the guest and host, version A

For this test I used the command:

perf record -ae arm_spe/ts_enable=1,pa_enable=1,pct_enable=1/ -- iperf3 -c 127.0.0.1 -t 60

The command is executed concurrently in the guest and the host; at the same
time I run the kvm-unit-tests tests in a loop on the host.

The guest had the same number of CPUs as the host (4). On the host,
perf.data was around 3.5G and the SPE interrupt fired 3100 times.  In the
guest, perf.data was around 2.8G and the interrupt fired 2700 times.  I
dumped the data with perf report --dump-raw-trace > perf.trace, looked sane
to me. My explanation for the difference is that the timer frequency is the
same for the guest and the host, but the guest spends less time executing
on the physical CPU because it's shared with the host, hence fewer
operations in the same amount of time.

4. Test concurrent profiling in the guest and host, version B

For this test I used the command:

$ perf record -e arm_spe/ts_enable=1,pa_enable=1,pct_enable=1/ dd if=/dev/zero of=/dev/null count=50000000

which means 10 times more operations than in test 2. This exhibits a
behavior which I don't fully understand. In the host, I get similar results
(interrupt number, total events) with what I would get if the guest isn't
running, which is expected. But in the guest, I get 50% less interrupts
than in the host and the total number of events is less. I am still looking
into this, it might be something that I don't understand about the
workload.

[1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/statistical-profiling-extension-for-armv8-a
[2] https://www.spinics.net/lists/arm-kernel/msg776228.html
[3] https://lists.cs.columbia.edu/pipermail/kvmarm/2019-February/034887.html
[4] https://gitlab.arm.com/linux-arm/linux-ae/-/tree/kvm-spe-v3
[5] https://lore.kernel.org/kvm/20201027171735.13638-1-alexandru.elisei@arm.com/
[6] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/kvm-spe-v3
[7] https://lore.kernel.org/linux-arm-kernel/2a9c9076588ef1dd36a6a365848cdfe7@kernel.org/
[8] https://lore.kernel.org/kvm/20201027171944.13933-1-alexandru.elisei@arm.com/
[9] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/kvm-spe-v2

Alexandru Elisei (12):
  KVM: arm64: Initialize VCPU mdcr_el2 before loading it
  KVM: arm64: Hide SPE from guests
  arm64: Introduce CPU SPE feature
  KVM: arm64: Introduce VCPU SPE feature
  KVM: arm64: Introduce SPE primitives
  KVM: arm64: Use separate function for the mapping size in
    user_mem_abort()
  KVM: arm64: Add a new VM device control group for SPE
  KVM: arm64: Add SPE system registers to VCPU context
  KVM: arm64: Switch SPE context on VM entry/exit
  KVM: arm64: Emulate SPE buffer management interrupt
  KVM: arm64: Enable SPE for guests
  Documentation: arm64: Document ARM Neoverse-N1 erratum #1688567

Sudeep Holla (4):
  dt-bindings: ARM SPE: Highlight the need for PPI partitions on
    heterogeneous systems
  KVM: arm64: Define SPE data structure for each VCPU
  KVM: arm64: Add a new VCPU device control group for SPE
  KVM: arm64: VHE: Clear MDCR_EL2.E2PB in vcpu_put()

 Documentation/arm64/silicon-errata.rst        |   2 +
 .../devicetree/bindings/arm/spe-pmu.txt       |   5 +-
 Documentation/virt/kvm/devices/vcpu.rst       |  40 +++
 Documentation/virt/kvm/devices/vm.rst         |  28 ++
 arch/arm64/include/asm/cpucaps.h              |   3 +-
 arch/arm64/include/asm/kvm_arm.h              |   1 +
 arch/arm64/include/asm/kvm_host.h             |  30 +-
 arch/arm64/include/asm/kvm_hyp.h              |  28 +-
 arch/arm64/include/asm/kvm_mmu.h              |   2 +
 arch/arm64/include/asm/sysreg.h               |   4 +
 arch/arm64/include/uapi/asm/kvm.h             |   7 +
 arch/arm64/kernel/cpufeature.c                |  24 ++
 arch/arm64/kvm/Kconfig                        |   8 +
 arch/arm64/kvm/Makefile                       |   1 +
 arch/arm64/kvm/arm.c                          |  84 ++++-
 arch/arm64/kvm/debug.c                        | 100 ++++--
 arch/arm64/kvm/guest.c                        |  57 +++
 arch/arm64/kvm/hyp/include/hyp/spe-sr.h       |  38 ++
 arch/arm64/kvm/hyp/include/hyp/switch.h       |   1 -
 arch/arm64/kvm/hyp/nvhe/Makefile              |   1 +
 arch/arm64/kvm/hyp/nvhe/debug-sr.c            |  16 +-
 arch/arm64/kvm/hyp/nvhe/spe-sr.c              | 109 ++++++
 arch/arm64/kvm/hyp/nvhe/switch.c              |  12 +
 arch/arm64/kvm/hyp/vhe/Makefile               |   1 +
 arch/arm64/kvm/hyp/vhe/spe-sr.c               | 139 ++++++++
 arch/arm64/kvm/hyp/vhe/switch.c               |  50 ++-
 arch/arm64/kvm/hyp/vhe/sysreg-sr.c            |   2 +-
 arch/arm64/kvm/mmu.c                          | 224 ++++++++++--
 arch/arm64/kvm/reset.c                        |  23 ++
 arch/arm64/kvm/spe.c                          | 324 ++++++++++++++++++
 arch/arm64/kvm/sys_regs.c                     |  52 +++
 include/kvm/arm_spe.h                         | 104 ++++++
 include/uapi/linux/kvm.h                      |   1 +
 33 files changed, 1454 insertions(+), 67 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/include/hyp/spe-sr.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/spe-sr.c
 create mode 100644 arch/arm64/kvm/hyp/vhe/spe-sr.c
 create mode 100644 arch/arm64/kvm/spe.c
 create mode 100644 include/kvm/arm_spe.h

-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread