kvmarm.lists.cs.columbia.edu archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support
@ 2020-10-27 17:26 Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it Alexandru Elisei
                   ` (15 more replies)
  0 siblings, 16 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

Statistical Profiling Extension (SPE) is an optional feature added in
ARMv8.2. It allows sampling at regular intervals of the operations executed
by the PE and storing a record of each operation in a memory buffer. A high
level overview of the extension is presented in an article on arm.com [1].

This series implements SPE support for KVM guests. The series is based on
v5.10-rc1 has been almost completely rewritten, but I've tried to keep some
patches from v2 [2] and the initial version of the series [3]. The series
can also be found in a repo [4] to make testing easier.

This series is firmly in RFC territory for several reasons:

* It introduces an userspace API to pre-map guest memory at stage 2, which
  I think deserves some discussion before we commit to it.

* The way I'm handling the SPE interrupt is completely different than what
  was implemented in v2.

* SPE state save/restore unconditionally save the host SPE state on VM
  entry and restores it on VM exit, regardless of whether the host is
  actually profiling or not. I plan to improve this in following
  iterations.

I am also interested to know why the spe header lives in
/include/kvm/kvm_spe.h instead of /arch/arm64/incluse/asm/kvm_spe.h. My
guess is that the headers there are for code that was shared with KVM arm.
 Since KVM arm was removed, I would like to move the header to /arch/arm64,
but I wanted to make sure that is acceptable.

The profiling buffer
====================

KVM cannot handle SPE stage 2 faults and the guest memory must be
memory-resident and mapped at stage 2 the entire lifetime of the guest.
More details in patch #10 ("KVM: arm64: Add a new VM device control group
for SPE").

This is achieved with the help of userspace in two stages:

1. Userspace calls mlock() on the VMAs that represent the guest memory.

2. After userspace has copied everything to the guest memory, it uses the
   KVM_ARM_VM_SPE_CTRL(KVM_ARM_VM_SPE_FINALIZE) ioctl to tell KVM to map
   all VM_LOCKED and VM_HUGETLB VMAs at stage 2 (explanation why VM_HUGETLB
   is also mapped in patch #10).

I have added support for SPE to kvmtool, patches are on the mailing list
[5], as well as in a repo [6] for easy testing.

There are some things that I'm not 100% sure about and I would like to get
some feedback before we commit to an ABI:

* At the moment, having SPE enabled for a guest forces unmapping of the
  guest memory when the VCPU is reset. This is done to make sure the
  dcaches are cleaned to POC when the VM starts. It isn't necessary when
  the system has FWB, but I decided to unmap the guest memory even in this
  case for two reasons:

  1. Userspace doesn't know when FWB is available and thus if the finalize
call is necessary.

  2. I haven't seen anywhere in the documentation a statement regarding
changing memslots when the VM is in the process of resetting, I am assuming
it's not forbidden (please correct me if I'm wrong).

If it's forbidden to change memslots when resetting the VM, then we could
add an extension of something similar that tells userspace if a finalize
call is required after VM reset.

* Instead of a SPE control group we could have a KVM_ARM_VM_FINALIZE ioctl
  on the vm fd, similar to KVM_ARM_VCPU_FINALIZE. I don't have a strong
  preference for either, the reason for the current implementation is that
  I hadn't thought about KVM_ARM_VM_FINALIZE until the series were almost
  finished.

The buffer interrupt
====================

Also referred to in the Arm ARM as the Profiling Buffer management
interrupt. The guest SPE interrupt handling has been completely reworked
and now it's handled by checking the service bit in the PMBSR_EL1 register
on every switch to host; implementation in patch #14 ("KVM: arm64: Emulate
SPE buffer management event interrupt").

Another option that I considered was to change the host irq handler for the
SPE interrupt to check kvm_get_running_cpu() and defer the handling of the
interrupt to the KVM code. There are a few reasons I decided against it:

* We need to keet the PMBSR_EL1.S bit set until KVM enables interrupts,
  which means that the host won't be able to profile KVM between
  kvm_load()/kvm_put().

* Software can trigger the interrupt with a write to the PMBSR_EL1 register
  that sets the service bit. This means that the KVM irq handler won't be
  able to distinguish between the guest configuring PMBSR_EL1 to report a
  stage 2 fault, which is harmless for the host, and the hardware reporting
  it, which can indicate a bug. Even more serious, KVM won't be able to
  distinguish between a PMBSR_EL1 value indicating an External Abort written
  by the guest, again, harmless, and one reported by the hardware, which
  is pretty serious.

This is what the architecture says about SPE external aborts, on page
D9-2806:

"A write to the Profiling Buffer might generate an external abort,
including an external abort on a translation table walk or translation
table update. It is an IMPLEMENTATION DEFINED choice whether such an
external abort:
* Is reported to the Statistical Profiling Extension and treated as a
  Profiling Buffer management event.
* Generates an SError interrupt exception."

I decided to treat the SPE external abort like an SError and panic.
However, I'm not 100% sure that's the right thing to do because the SPE
driver never checks the PMBSR_EL1.EA bit.

There is an argument to be made against my approach to handling the buffer
interrupt, and that is that it requires KVM to trap accesses to the buffer
registers and to read one extra register, PMBSR_EL1, when switching to the
host. I believe this overhead to be minimal because writes to the buffer
registers are rare and they happen when an event is installed or stopped.

Note that in both cases the guest SPE interrupt is purely virtual and has
to be deactivated by KVM when the guest clears the PMBSR_EL1.S bit. This
means trapping the accesses to the buffer registers while the interrupt is
asserted even in the case where the host SPE driver irq handler handles the
interrupt triggered by the guest.

Context switching SPE registers
===============================

As mentioned earlier, this is done on every world switch under the
assumption that the host is using SPE at the same time as the guest, which
obviously will not always be the case.

I plan to improve this in following iterations by doing the context switch
on vcpu_load()/vcpu_put() when the host is not profiling. The challenge
will be detecting when the host is profiling. That can be detected in
vcpu_load(), but according to my understanding of perf, a new event can be
installed on the CPU via an IPI. In that case the perf driver would have to
notify KVM that it's starting profiling on the core so KVM can save the
guest SPE registers.

In v2 of the patches it has been suggested that on NVHE systems, the EL2
code must do the SPE context switch unconditionally [7]. I don't believe
that is necessary because all the registers that SPE uses in the NVHE case
are EL1 registers.

Testing
=======

I have written two basic kvm-unit-tests tests for SPE that I used while
developing the series [8]; they can also be found on this branch [9].

For testing, I have used FVP and a Neoverse N1 machine. These are the tests
that I ran:

1. kvm-unit-tests tests

The tests check the basic operation of the SPE buffer and some corner cases
which were hard to trigger with a Linux guest.

2. Check that profiling behaves the same in the guest and in the host

I used this command for testing on an N1 machine:

$ perf record -e arm_spe/ts_enable=1,pa_enable=1,pct_enable=1/ dd if=/dev/zero of=/dev/null count=5000000

then I checked the output of perf report --dump-raw-trace. The command is
not executed simultaneously in the guest and in the host. Results:

* On VHE:
  - guest 538 interrupts, perf.data size 541.190MiB, 1096 total events.
  - host 536 interrupts, perf.data size 541.190MiB, 1096 total events.

* Without VHE:
  - guest 537 interrupts, perf.data size 539,997 MiB, 1091 total events.
  - host 535 interrupts, perf.data size 539.986 MiB, 1093 total events.

I ran the tests multiple times and there were very minor variations in the
results.

3. Test concurrent profiling in the guest and host, version A

For this test I used the command:

perf record -ae arm_spe/ts_enable=1,pa_enable=1,pct_enable=1/ -- iperf3 -c 127.0.0.1 -t 60

The command is executed concurrently in the guest and the host; at the same
time I run the kvm-unit-tests tests in a loop on the host.

The guest had the same number of CPUs as the host (4). On the host,
perf.data was around 3.5G and the SPE interrupt fired 3100 times.  In the
guest, perf.data was around 2.8G and the interrupt fired 2700 times.  I
dumped the data with perf report --dump-raw-trace > perf.trace, looked sane
to me. My explanation for the difference is that the timer frequency is the
same for the guest and the host, but the guest spends less time executing
on the physical CPU because it's shared with the host, hence fewer
operations in the same amount of time.

4. Test concurrent profiling in the guest and host, version B

For this test I used the command:

$ perf record -e arm_spe/ts_enable=1,pa_enable=1,pct_enable=1/ dd if=/dev/zero of=/dev/null count=50000000

which means 10 times more operations than in test 2. This exhibits a
behavior which I don't fully understand. In the host, I get similar results
(interrupt number, total events) with what I would get if the guest isn't
running, which is expected. But in the guest, I get 50% less interrupts
than in the host and the total number of events is less. I am still looking
into this, it might be something that I don't understand about the
workload.

[1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/statistical-profiling-extension-for-armv8-a
[2] https://www.spinics.net/lists/arm-kernel/msg776228.html
[3] https://lists.cs.columbia.edu/pipermail/kvmarm/2019-February/034887.html
[4] https://gitlab.arm.com/linux-arm/linux-ae/-/tree/kvm-spe-v3
[5] https://lore.kernel.org/kvm/20201027171735.13638-1-alexandru.elisei@arm.com/
[6] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/kvm-spe-v3
[7] https://lore.kernel.org/linux-arm-kernel/2a9c9076588ef1dd36a6a365848cdfe7@kernel.org/
[8] https://lore.kernel.org/kvm/20201027171944.13933-1-alexandru.elisei@arm.com/
[9] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/kvm-spe-v2

Alexandru Elisei (12):
  KVM: arm64: Initialize VCPU mdcr_el2 before loading it
  KVM: arm64: Hide SPE from guests
  arm64: Introduce CPU SPE feature
  KVM: arm64: Introduce VCPU SPE feature
  KVM: arm64: Introduce SPE primitives
  KVM: arm64: Use separate function for the mapping size in
    user_mem_abort()
  KVM: arm64: Add a new VM device control group for SPE
  KVM: arm64: Add SPE system registers to VCPU context
  KVM: arm64: Switch SPE context on VM entry/exit
  KVM: arm64: Emulate SPE buffer management interrupt
  KVM: arm64: Enable SPE for guests
  Documentation: arm64: Document ARM Neoverse-N1 erratum #1688567

Sudeep Holla (4):
  dt-bindings: ARM SPE: Highlight the need for PPI partitions on
    heterogeneous systems
  KVM: arm64: Define SPE data structure for each VCPU
  KVM: arm64: Add a new VCPU device control group for SPE
  KVM: arm64: VHE: Clear MDCR_EL2.E2PB in vcpu_put()

 Documentation/arm64/silicon-errata.rst        |   2 +
 .../devicetree/bindings/arm/spe-pmu.txt       |   5 +-
 Documentation/virt/kvm/devices/vcpu.rst       |  40 +++
 Documentation/virt/kvm/devices/vm.rst         |  28 ++
 arch/arm64/include/asm/cpucaps.h              |   3 +-
 arch/arm64/include/asm/kvm_arm.h              |   1 +
 arch/arm64/include/asm/kvm_host.h             |  30 +-
 arch/arm64/include/asm/kvm_hyp.h              |  28 +-
 arch/arm64/include/asm/kvm_mmu.h              |   2 +
 arch/arm64/include/asm/sysreg.h               |   4 +
 arch/arm64/include/uapi/asm/kvm.h             |   7 +
 arch/arm64/kernel/cpufeature.c                |  24 ++
 arch/arm64/kvm/Kconfig                        |   8 +
 arch/arm64/kvm/Makefile                       |   1 +
 arch/arm64/kvm/arm.c                          |  84 ++++-
 arch/arm64/kvm/debug.c                        | 100 ++++--
 arch/arm64/kvm/guest.c                        |  57 +++
 arch/arm64/kvm/hyp/include/hyp/spe-sr.h       |  38 ++
 arch/arm64/kvm/hyp/include/hyp/switch.h       |   1 -
 arch/arm64/kvm/hyp/nvhe/Makefile              |   1 +
 arch/arm64/kvm/hyp/nvhe/debug-sr.c            |  16 +-
 arch/arm64/kvm/hyp/nvhe/spe-sr.c              | 109 ++++++
 arch/arm64/kvm/hyp/nvhe/switch.c              |  12 +
 arch/arm64/kvm/hyp/vhe/Makefile               |   1 +
 arch/arm64/kvm/hyp/vhe/spe-sr.c               | 139 ++++++++
 arch/arm64/kvm/hyp/vhe/switch.c               |  50 ++-
 arch/arm64/kvm/hyp/vhe/sysreg-sr.c            |   2 +-
 arch/arm64/kvm/mmu.c                          | 224 ++++++++++--
 arch/arm64/kvm/reset.c                        |  23 ++
 arch/arm64/kvm/spe.c                          | 324 ++++++++++++++++++
 arch/arm64/kvm/sys_regs.c                     |  52 +++
 include/kvm/arm_spe.h                         | 104 ++++++
 include/uapi/linux/kvm.h                      |   1 +
 33 files changed, 1454 insertions(+), 67 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/include/hyp/spe-sr.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/spe-sr.c
 create mode 100644 arch/arm64/kvm/hyp/vhe/spe-sr.c
 create mode 100644 arch/arm64/kvm/spe.c
 create mode 100644 include/kvm/arm_spe.h

-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-11-19 16:58   ` James Morse
  2020-10-27 17:26 ` [RFC PATCH v3 02/16] dt-bindings: ARM SPE: Highlight the need for PPI partitions on heterogeneous systems Alexandru Elisei
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

When a VCPU is created, the kvm_vcpu struct is initialized to zero in
kvm_vm_ioctl_create_vcpu(). On VHE systems, the first time
vcpu.arch.mdcr_el2 is loaded on hardware is in vcpu_load(), before it is
set to a sensible value in kvm_arm_setup_debug() later in the run loop. The
result is that KVM executes for a short time with MDCR_EL2 set to zero.

This is mostly harmless as we don't need to trap debug and SPE register
accesses from EL1 (we're still running in the host at EL2), but we do set
MDCR_EL2.HPMN to 0 which is constrained unpredictable according to ARM DDI
0487F.b, page D13-3620; the required behavior from the hardware in this
case is to reserve an unkown number of registers for EL2 and EL3 exclusive
use.

Initialize mdcr_el2 in kvm_vcpu_vcpu_first_run_init(), so we can avoid the
constrained unpredictable behavior and to ensure that the MDCR_EL2 register
has the same value after each vcpu_load(), including the first time the
VCPU is run.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/arm.c              |  3 +-
 arch/arm64/kvm/debug.c            | 81 +++++++++++++++++++++----------
 3 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 0aecbab6a7fb..25d326aecded 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -597,6 +597,7 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
 void kvm_arm_init_debug(void);
+void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index f56122eedffc..e51d8f328c7e 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -544,6 +544,8 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 		static_branch_inc(&userspace_irqchip_in_use);
 	}
 
+	kvm_arm_vcpu_init_debug(vcpu);
+
 	ret = kvm_timer_enable(vcpu);
 	if (ret)
 		return ret;
@@ -739,7 +741,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		}
 
 		kvm_arm_setup_debug(vcpu);
-
 		/**************************************************************
 		 * Enter the guest
 		 */
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 7a7e425616b5..22ee448aee2b 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -68,6 +68,59 @@ void kvm_arm_init_debug(void)
 	__this_cpu_write(mdcr_el2, kvm_call_hyp_ret(__kvm_get_mdcr_el2));
 }
 
+/**
+ * kvm_arm_setup_mdcr_el2 - configure vcpu mdcr_el2 value
+ *
+ * @vcpu:	the vcpu pointer
+ * @host_mdcr:  host mdcr_el2 value
+ *
+ * This ensures we will trap access to:
+ *  - Performance monitors (MDCR_EL2_TPM/MDCR_EL2_TPMCR)
+ *  - Debug ROM Address (MDCR_EL2_TDRA)
+ *  - OS related registers (MDCR_EL2_TDOSA)
+ *  - Statistical profiler (MDCR_EL2_TPMS/MDCR_EL2_E2PB)
+ */
+static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu, u32 host_mdcr)
+{
+	bool trap_debug = !(vcpu->arch.flags & KVM_ARM64_DEBUG_DIRTY);
+
+	/*
+	 * This also clears MDCR_EL2_E2PB_MASK to disable guest access
+	 * to the profiling buffer.
+	 */
+	vcpu->arch.mdcr_el2 = host_mdcr & MDCR_EL2_HPMN_MASK;
+	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
+				MDCR_EL2_TPMS |
+				MDCR_EL2_TPMCR |
+				MDCR_EL2_TDRA |
+				MDCR_EL2_TDOSA);
+
+	if (vcpu->guest_debug) {
+		/* Route all software debug exceptions to EL2 */
+		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDE;
+		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW)
+			trap_debug = true;
+	}
+
+	/* Trap debug register access */
+	if (trap_debug)
+		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDA;
+
+	trace_kvm_arm_set_dreg32("MDCR_EL2", vcpu->arch.mdcr_el2);
+}
+
+/**
+ * kvm_arm_vcpu_init_debug - setup vcpu debug traps
+ *
+ * @vcpu:	the vcpu pointer
+ *
+ * Set vcpu initial mdcr_el2 value.
+ */
+void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu)
+{
+	kvm_arm_setup_mdcr_el2(vcpu, this_cpu_read(mdcr_el2));
+}
+
 /**
  * kvm_arm_reset_debug_ptr - reset the debug ptr to point to the vcpu state
  */
@@ -83,12 +136,7 @@ void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu)
  * @vcpu:	the vcpu pointer
  *
  * This is called before each entry into the hypervisor to setup any
- * debug related registers. Currently this just ensures we will trap
- * access to:
- *  - Performance monitors (MDCR_EL2_TPM/MDCR_EL2_TPMCR)
- *  - Debug ROM Address (MDCR_EL2_TDRA)
- *  - OS related registers (MDCR_EL2_TDOSA)
- *  - Statistical profiler (MDCR_EL2_TPMS/MDCR_EL2_E2PB)
+ * debug related registers.
  *
  * Additionally, KVM only traps guest accesses to the debug registers if
  * the guest is not actively using them (see the KVM_ARM64_DEBUG_DIRTY
@@ -100,27 +148,14 @@ void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu)
 
 void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
 {
-	bool trap_debug = !(vcpu->arch.flags & KVM_ARM64_DEBUG_DIRTY);
 	unsigned long mdscr, orig_mdcr_el2 = vcpu->arch.mdcr_el2;
 
 	trace_kvm_arm_setup_debug(vcpu, vcpu->guest_debug);
 
-	/*
-	 * This also clears MDCR_EL2_E2PB_MASK to disable guest access
-	 * to the profiling buffer.
-	 */
-	vcpu->arch.mdcr_el2 = __this_cpu_read(mdcr_el2) & MDCR_EL2_HPMN_MASK;
-	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
-				MDCR_EL2_TPMS |
-				MDCR_EL2_TPMCR |
-				MDCR_EL2_TDRA |
-				MDCR_EL2_TDOSA);
+	kvm_arm_setup_mdcr_el2(vcpu, __this_cpu_read(mdcr_el2));
 
 	/* Is Guest debugging in effect? */
 	if (vcpu->guest_debug) {
-		/* Route all software debug exceptions to EL2 */
-		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDE;
-
 		/* Save guest debug state */
 		save_guest_debug_regs(vcpu);
 
@@ -174,7 +209,6 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
 
 			vcpu->arch.debug_ptr = &vcpu->arch.external_debug_state;
 			vcpu->arch.flags |= KVM_ARM64_DEBUG_DIRTY;
-			trap_debug = true;
 
 			trace_kvm_arm_set_regset("BKPTS", get_num_brps(),
 						&vcpu->arch.debug_ptr->dbg_bcr[0],
@@ -189,10 +223,6 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
 	BUG_ON(!vcpu->guest_debug &&
 		vcpu->arch.debug_ptr != &vcpu->arch.vcpu_debug_state);
 
-	/* Trap debug register access */
-	if (trap_debug)
-		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDA;
-
 	/* If KDE or MDE are set, perform a full save/restore cycle. */
 	if (vcpu_read_sys_reg(vcpu, MDSCR_EL1) & (DBG_MDSCR_KDE | DBG_MDSCR_MDE))
 		vcpu->arch.flags |= KVM_ARM64_DEBUG_DIRTY;
@@ -201,7 +231,6 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
 	if (has_vhe() && orig_mdcr_el2 != vcpu->arch.mdcr_el2)
 		write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
 
-	trace_kvm_arm_set_dreg32("MDCR_EL2", vcpu->arch.mdcr_el2);
 	trace_kvm_arm_set_dreg32("MDSCR_EL1", vcpu_read_sys_reg(vcpu, MDSCR_EL1));
 }
 
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 02/16] dt-bindings: ARM SPE: Highlight the need for PPI partitions on heterogeneous systems
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 03/16] KVM: arm64: Hide SPE from guests Alexandru Elisei
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, Sudeep Holla, Andrew Murray, will

From: Sudeep Holla <sudeep.holla@arm.com>

It's not entirely clear from the binding document that the only way to
express ARM SPE affined to a subset of CPUs on a heterogeneous systems is
through the use of PPI partitions available in the interrupt controller
bindings.

Let's make it clear.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/devicetree/bindings/arm/spe-pmu.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/arm/spe-pmu.txt b/Documentation/devicetree/bindings/arm/spe-pmu.txt
index 93372f2a7df9..4f4815800f6e 100644
--- a/Documentation/devicetree/bindings/arm/spe-pmu.txt
+++ b/Documentation/devicetree/bindings/arm/spe-pmu.txt
@@ -9,8 +9,9 @@ performance sample data using an in-memory trace buffer.
 	       "arm,statistical-profiling-extension-v1"
 
 - interrupts : Exactly 1 PPI must be listed. For heterogeneous systems where
-               SPE is only supported on a subset of the CPUs, please consult
-	       the arm,gic-v3 binding for details on describing a PPI partition.
+               SPE is only supported on a subset of the CPUs, a PPI partition
+	       described in the arm,gic-v3 binding must be used to describe
+	       the set of CPUs this interrupt is affine to.
 
 ** Example:
 
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 03/16] KVM: arm64: Hide SPE from guests
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 02/16] dt-bindings: ARM SPE: Highlight the need for PPI partitions on heterogeneous systems Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature Alexandru Elisei
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

When SPE is not implemented, accesses to the SPE registers cause an
undefined exception. KVM advertises the presence of SPE in the
ID_AA64DFR0_EL1 register, but configures MDCR_EL2 to trap accesses to the
registers and injects an undefined exception when that happens.

The architecture doesn't allow trapping access to the PMBIDR_EL1 register,
which means the guest will be able to read it even if SPE is not advertised
in the ID register. However, since it's usually better for a read to
unexpectedly succeed than to cause an exception, let's stop advertising the
presence of SPE to guests to better match how KVM emulates the
architecture.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/sys_regs.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index d9117bc56237..aa776c006a2a 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -244,6 +244,12 @@ static bool access_vm_reg(struct kvm_vcpu *vcpu,
 	return true;
 }
 
+static unsigned int spe_visibility(const struct kvm_vcpu *vcpu,
+				   const struct sys_reg_desc *r)
+{
+	return REG_HIDDEN_GUEST | REG_HIDDEN_USER;
+}
+
 static bool access_actlr(struct kvm_vcpu *vcpu,
 			 struct sys_reg_params *p,
 			 const struct sys_reg_desc *r)
@@ -1143,6 +1149,8 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
 		val = cpuid_feature_cap_perfmon_field(val,
 						ID_AA64DFR0_PMUVER_SHIFT,
 						ID_AA64DFR0_PMUVER_8_1);
+		/* Don't advertise SPE to guests */
+		val &= ~(0xfUL << ID_AA64DFR0_PMSVER_SHIFT);
 	} else if (id == SYS_ID_DFR0_EL1) {
 		/* Limit guests to PMUv3 for ARMv8.1 */
 		val = cpuid_feature_cap_perfmon_field(val,
@@ -1590,6 +1598,17 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
 	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
 
+	{ SYS_DESC(SYS_PMSCR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSICR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSIRR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSFCR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSEVFR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSLATFR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSIDR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMBLIMITR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMBPTR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMBSR_EL1), .visibility = spe_visibility },
+
 	{ SYS_DESC(SYS_PMINTENSET_EL1), access_pminten, reset_unknown, PMINTENSET_EL1 },
 	{ SYS_DESC(SYS_PMINTENCLR_EL1), access_pminten, reset_unknown, PMINTENSET_EL1 },
 
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (2 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 03/16] KVM: arm64: Hide SPE from guests Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-11-19 16:58   ` James Morse
  2020-10-27 17:26 ` [RFC PATCH v3 05/16] KVM: arm64: Introduce VCPU " Alexandru Elisei
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

Detect Statistical Profiling Extension (SPE) support using the cpufeatures
framework. The presence of SPE is reported via the ARM64_SPE capability.

The feature will be necessary for emulating SPE in KVM, because KVM needs
that all CPUs have SPE hardware to avoid scheduling a VCPU on a CPU without
support. For this reason, the feature type ARM64_CPUCAP_SYSTEM_FEATURE has
been selected to disallow hotplugging a CPU which doesn't support SPE.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/cpucaps.h |  3 ++-
 arch/arm64/kernel/cpufeature.c   | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 42868dbd29fd..10fd094d9a5b 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -65,7 +65,8 @@
 #define ARM64_HAS_ARMv8_4_TTL			55
 #define ARM64_HAS_TLB_RANGE			56
 #define ARM64_MTE				57
+#define ARM64_SPE				58
 
-#define ARM64_NCAPS				58
+#define ARM64_NCAPS				59
 
 #endif /* __ASM_CPUCAPS_H */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index dcc165b3fc04..4a0f4dc53824 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1278,6 +1278,18 @@ has_useable_cnp(const struct arm64_cpu_capabilities *entry, int scope)
 	return has_cpuid_feature(entry, scope);
 }
 
+static bool __maybe_unused
+has_usable_spe(const struct arm64_cpu_capabilities *entry, int scope)
+{
+	u64 pmbidr;
+
+	if (!has_cpuid_feature(entry, scope))
+		return false;
+
+	pmbidr = read_sysreg_s(SYS_PMBIDR_EL1);
+	return !(pmbidr & BIT(SYS_PMBIDR_EL1_P_SHIFT));
+}
+
 /*
  * This check is triggered during the early boot before the cpufeature
  * is initialised. Checking the status on the local CPU allows the boot
@@ -2003,6 +2015,18 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.min_field_value = 1,
 		.cpu_enable = cpu_enable_cnp,
 	},
+#endif
+#ifdef CONFIG_ARM_SPE_PMU
+	{
+		.desc = "Statistical Profiling Extension (SPE)",
+		.capability = ARM64_SPE,
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.matches = has_usable_spe,
+		.sys_reg = SYS_ID_AA64DFR0_EL1,
+		.sign = FTR_UNSIGNED,
+		.field_pos = ID_AA64DFR0_PMSVER_SHIFT,
+		.min_field_value = 1,
+	},
 #endif
 	{
 		.desc = "Speculation barrier (SB)",
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 05/16] KVM: arm64: Introduce VCPU SPE feature
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (3 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 06/16] KVM: arm64: Introduce SPE primitives Alexandru Elisei
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

Introduce the feature bit, but don't allow userspace to set it yet.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/uapi/asm/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 1c17c3a24411..489e12304dbb 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -106,6 +106,7 @@ struct kvm_regs {
 #define KVM_ARM_VCPU_SVE		4 /* enable SVE for this CPU */
 #define KVM_ARM_VCPU_PTRAUTH_ADDRESS	5 /* VCPU uses address authentication */
 #define KVM_ARM_VCPU_PTRAUTH_GENERIC	6 /* VCPU uses generic authentication */
+#define KVM_ARM_VCPU_SPE		7 /* Enable SPE for this CPU */
 
 struct kvm_vcpu_init {
 	__u32 target;
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 06/16] KVM: arm64: Introduce SPE primitives
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (4 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 05/16] KVM: arm64: Introduce VCPU " Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-11-19 16:58   ` James Morse
  2020-10-27 17:26 ` [RFC PATCH v3 07/16] KVM: arm64: Define SPE data structure for each VCPU Alexandru Elisei
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

KVM SPE emulation depends on the configuration option KVM_ARM_SPE and on on
having hardware SPE support on all CPUs. The host driver must be
compiled-in because we need the SPE interrupt to be enabled; it will be
used to kick us out of the guest when the profiling buffer management
interrupt is asserted by the GIC (for example, when the buffer is full).

Add a VCPU flag to inform KVM that the guest has SPE enabled.

It's worth noting that even though the KVM_ARM_SPE config option is gated
by the SPE host driver being compiled-in, we don't actually check that the
driver was loaded successfully when we advertise SPE support for guests.
That's because we can live with the SPE interrupt being disabled. There is
a delay between when the SPE hardware asserts the interrupt and when the
GIC samples the interrupt line and asserts it to the CPU. If the SPE
interrupt is disabled at the GIC level, this delay will be larger, at most
a host timer tick.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h |  9 +++++++++
 arch/arm64/kvm/Kconfig            |  8 ++++++++
 include/kvm/arm_spe.h             | 19 +++++++++++++++++++
 3 files changed, 36 insertions(+)
 create mode 100644 include/kvm/arm_spe.h

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 25d326aecded..43eee197764f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -406,6 +406,7 @@ struct kvm_vcpu_arch {
 #define KVM_ARM64_GUEST_HAS_SVE		(1 << 5) /* SVE exposed to guest */
 #define KVM_ARM64_VCPU_SVE_FINALIZED	(1 << 6) /* SVE config completed */
 #define KVM_ARM64_GUEST_HAS_PTRAUTH	(1 << 7) /* PTRAUTH exposed to guest */
+#define KVM_ARM64_GUEST_HAS_SPE		(1 << 8) /* SPE exposed to guest */
 
 #define vcpu_has_sve(vcpu) (system_supports_sve() && \
 			    ((vcpu)->arch.flags & KVM_ARM64_GUEST_HAS_SVE))
@@ -419,6 +420,14 @@ struct kvm_vcpu_arch {
 #define vcpu_has_ptrauth(vcpu)		false
 #endif
 
+#ifdef CONFIG_KVM_ARM_SPE
+#define vcpu_has_spe(vcpu)						\
+	(cpus_have_final_cap(ARM64_SPE) &&				\
+	 ((vcpu)->arch.flags & KVM_ARM64_GUEST_HAS_SPE))
+#else
+#define vcpu_has_spe(vcpu)		false
+#endif
+
 #define vcpu_gp_regs(v)		(&(v)->arch.ctxt.regs)
 
 /*
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 043756db8f6e..8b35c0b806a7 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -57,6 +57,14 @@ config KVM_ARM_PMU
 	  Adds support for a virtual Performance Monitoring Unit (PMU) in
 	  virtual machines.
 
+config KVM_ARM_SPE
+	bool "Virtual Statistical Profiling Extension (SPE) support"
+	depends on ARM_SPE_PMU
+	default y
+	help
+	  Adds support for a virtual Statistical Profiling Extension (SPE) in
+	  virtual machines.
+
 endif # KVM
 
 endif # VIRTUALIZATION
diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
new file mode 100644
index 000000000000..db51ef15bf45
--- /dev/null
+++ b/include/kvm/arm_spe.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 ARM Ltd.
+ */
+
+#ifndef __ASM_ARM_KVM_SPE_H
+#define __ASM_ARM_KVM_SPE_H
+
+#ifdef CONFIG_KVM_ARM_SPE
+static inline bool kvm_arm_supports_spe(void)
+{
+	return cpus_have_final_cap(ARM64_SPE);
+}
+
+#else
+#define kvm_arm_supports_spe()	false
+
+#endif /* CONFIG_KVM_ARM_SPE */
+#endif /* __ASM_ARM_KVM_SPE_H */
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 07/16] KVM: arm64: Define SPE data structure for each VCPU
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (5 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 06/16] KVM: arm64: Introduce SPE primitives Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-10-27 17:26 ` [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, Sudeep Holla, Andrew Murray, will

From: Sudeep Holla <sudeep.holla@arm.com>

Define basic struct for supporting SPE for guest VCPUs.

[Andrew M: Add irq_level, rename irq to irq_num for kvm_spe ]
[Alexandru E: Reworked patch ]

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h | 2 ++
 include/kvm/arm_spe.h             | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 43eee197764f..5b68c06930c6 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -35,6 +35,7 @@
 #include <kvm/arm_vgic.h>
 #include <kvm/arm_arch_timer.h>
 #include <kvm/arm_pmu.h>
+#include <kvm/arm_spe.h>
 
 #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS
 
@@ -329,6 +330,7 @@ struct kvm_vcpu_arch {
 	struct vgic_cpu vgic_cpu;
 	struct arch_timer_cpu timer_cpu;
 	struct kvm_pmu pmu;
+	struct kvm_spe_cpu spe_cpu;
 
 	/*
 	 * Anything that is not used directly from assembly code goes
diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
index db51ef15bf45..46ec447ed013 100644
--- a/include/kvm/arm_spe.h
+++ b/include/kvm/arm_spe.h
@@ -12,8 +12,17 @@ static inline bool kvm_arm_supports_spe(void)
 	return cpus_have_final_cap(ARM64_SPE);
 }
 
+struct kvm_spe_cpu {
+	int irq_num; 		/* Guest visibile INTID */
+	bool irq_level; 	/* 'true' if interrupt is asserted to the VGIC */
+	bool initialized; 	/* Feature is initialized on VCPU */
+};
+
 #else
 #define kvm_arm_supports_spe()	false
 
+struct kvm_spe_cpu {
+};
+
 #endif /* CONFIG_KVM_ARM_SPE */
 #endif /* __ASM_ARM_KVM_SPE_H */
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (6 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 07/16] KVM: arm64: Define SPE data structure for each VCPU Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-11-05  9:58   ` Haibo Xu
  2020-11-19 16:58   ` James Morse
  2020-10-27 17:26 ` [RFC PATCH v3 09/16] KVM: arm64: Use separate function for the mapping size in user_mem_abort() Alexandru Elisei
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, Sudeep Holla, will

From: Sudeep Holla <sudeep.holla@arm.com>

To configure the virtual SPE buffer management interrupt number, we use a
VCPU kvm_device ioctl, encapsulating the KVM_ARM_VCPU_SPE_IRQ attribute
within the KVM_ARM_VCPU_SPE_CTRL group.

After configuring the SPE, userspace is required to call the VCPU ioctl
with the attribute KVM_ARM_VCPU_SPE_INIT to initialize SPE on the VCPU.

[Alexandru E: Fixed compilation errors, don't allow userspace to set the
	VCPU feature, removed unused functions, fixed mismatched
	descriptions, comments and error codes, reworked logic, rebased on
	top of v5.10-rc1]

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  40 ++++++++
 arch/arm64/include/uapi/asm/kvm.h       |   3 +
 arch/arm64/kvm/Makefile                 |   1 +
 arch/arm64/kvm/guest.c                  |   9 ++
 arch/arm64/kvm/reset.c                  |  23 +++++
 arch/arm64/kvm/spe.c                    | 129 ++++++++++++++++++++++++
 include/kvm/arm_spe.h                   |  27 +++++
 include/uapi/linux/kvm.h                |   1 +
 8 files changed, 233 insertions(+)
 create mode 100644 arch/arm64/kvm/spe.c

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 2acec3b9ef65..6135b9827fbe 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,3 +161,43 @@ Specifies the base address of the stolen time structure for this VCPU. The
 base address must be 64 byte aligned and exist within a valid guest memory
 region. See Documentation/virt/kvm/arm/pvtime.rst for more information
 including the layout of the stolen time structure.
+
+4. GROUP: KVM_ARM_VCPU_SPE_CTRL
+===============================
+
+:Architectures: ARM64
+
+4.1 ATTRIBUTE: KVM_ARM_VCPU_SPE_IRQ
+-----------------------------------
+
+:Parameters: in kvm_device_attr.addr the address for the SPE buffer management
+             interrupt is a pointer to an int
+
+Returns:
+
+	 =======  ========================================================
+	 -EBUSY   The SPE buffer management interrupt is already set
+	 -EINVAL  Invalid SPE overflow interrupt number
+	 -EFAULT  Could not read the buffer management interrupt number
+	 -ENXIO   SPE not supported or not properly configured
+	 =======  ========================================================
+
+A value describing the SPE (Statistical Profiling Extension) overflow interrupt
+number for this vcpu. This interrupt should be a PPI and the interrupt type and
+number must be same for each vcpu.
+
+4.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
+------------------------------------
+
+:Parameters: no additional parameter in kvm_device_attr.addr
+
+Returns:
+
+	 =======  ======================================================
+	 -EBUSY   SPE already initialized
+	 -ENODEV  GIC not initialized
+	 -ENXIO   SPE not supported or not properly configured
+	 =======  ======================================================
+
+Request the initialization of the SPE. Must be done after initializing the
+in-kernel irqchip and after setting the interrupt number for the VCPU.
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 489e12304dbb..ca57dfb7abf0 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -360,6 +360,9 @@ struct kvm_vcpu_events {
 #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
 #define KVM_ARM_VCPU_PVTIME_CTRL	2
 #define   KVM_ARM_VCPU_PVTIME_IPA	0
+#define KVM_ARM_VCPU_SPE_CTRL		3
+#define   KVM_ARM_VCPU_SPE_IRQ		0
+#define   KVM_ARM_VCPU_SPE_INIT		1
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT		28
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 1504c81fbf5d..f6e76f64ffbe 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -25,3 +25,4 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
 	 vgic/vgic-its.o vgic/vgic-debug.o
 
 kvm-$(CONFIG_KVM_ARM_PMU)  += pmu-emul.o
+kvm-$(CONFIG_KVM_ARM_SPE)  += spe.o
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index dfb5218137ca..2ba790eeb782 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -926,6 +926,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_set_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_SPE_CTRL:
+		ret = kvm_arm_spe_set_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
@@ -949,6 +952,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_get_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_SPE_CTRL:
+		ret = kvm_arm_spe_get_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
@@ -972,6 +978,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_has_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_SPE_CTRL:
+		ret = kvm_arm_spe_has_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index f32490229a4c..4dc205fa4be1 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -87,6 +87,9 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_PTRAUTH_GENERIC:
 		r = system_has_full_ptr_auth();
 		break;
+	case KVM_CAP_ARM_SPE:
+		r = kvm_arm_supports_spe();
+		break;
 	default:
 		r = 0;
 	}
@@ -223,6 +226,19 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int kvm_vcpu_enable_spe(struct kvm_vcpu *vcpu)
+{
+	if (!kvm_arm_supports_spe())
+		return -EINVAL;
+
+	/* SPE is disabled if the PE is in AArch32 state */
+	if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
+		return -EINVAL;
+
+	vcpu->arch.flags |= KVM_ARM64_GUEST_HAS_SPE;
+	return 0;
+}
+
 /**
  * kvm_reset_vcpu - sets core registers and sys_regs to reset value
  * @vcpu: The VCPU pointer
@@ -274,6 +290,13 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	if (test_bit(KVM_ARM_VCPU_SPE, vcpu->arch.features)) {
+		if (kvm_vcpu_enable_spe(vcpu)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
 	switch (vcpu->arch.target) {
 	default:
 		if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
new file mode 100644
index 000000000000..f91a52cd7cd3
--- /dev/null
+++ b/arch/arm64/kvm/spe.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 ARM Ltd.
+ */
+
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/uaccess.h>
+
+#include <kvm/arm_spe.h>
+#include <kvm/arm_vgic.h>
+
+static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu_has_spe(vcpu))
+		return false;
+
+	if (!irqchip_in_kernel(vcpu->kvm))
+		return false;
+
+	return true;
+}
+
+static int kvm_arm_spe_init(struct kvm_vcpu *vcpu)
+{
+	if (!kvm_arm_spe_irq_initialized(vcpu))
+		return -ENXIO;
+
+	if (!vgic_initialized(vcpu->kvm))
+		return -ENODEV;
+
+	if (kvm_arm_spe_vcpu_initialized(vcpu))
+		return -EBUSY;
+
+	if (kvm_vgic_set_owner(vcpu, vcpu->arch.spe_cpu.irq_num, &vcpu->arch.spe_cpu))
+		return -ENXIO;
+
+	vcpu->arch.spe_cpu.initialized = true;
+
+	return 0;
+}
+
+static bool kvm_arm_spe_irq_is_valid(struct kvm *kvm, int irq)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	/* The SPE overflow interrupt can be a PPI only */
+	if (!irq_is_ppi(irq))
+		return false;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (!kvm_arm_spe_irq_initialized(vcpu))
+			continue;
+
+		if (vcpu->arch.spe_cpu.irq_num != irq)
+			return false;
+	}
+
+	return true;
+}
+
+int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	switch (attr->attr) {
+	case KVM_ARM_VCPU_SPE_IRQ: {
+		int __user *uaddr = (int __user *)(long)attr->addr;
+		int irq;
+
+		if (!kvm_arm_vcpu_supports_spe(vcpu))
+			return -ENXIO;
+
+		if (get_user(irq, uaddr))
+			return -EFAULT;
+
+		if (!kvm_arm_spe_irq_is_valid(vcpu->kvm, irq))
+			return -EINVAL;
+
+		if (kvm_arm_spe_irq_initialized(vcpu))
+			return -EBUSY;
+
+		kvm_debug("Set kvm ARM SPE irq: %d\n", irq);
+		vcpu->arch.spe_cpu.irq_num = irq;
+
+		return 0;
+	}
+	case KVM_ARM_VCPU_SPE_INIT:
+		return kvm_arm_spe_init(vcpu);
+	}
+
+	return -ENXIO;
+}
+
+int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	switch (attr->attr) {
+	case KVM_ARM_VCPU_SPE_IRQ: {
+		int __user *uaddr = (int __user *)(long)attr->addr;
+		int irq;
+
+		if (!kvm_arm_vcpu_supports_spe(vcpu))
+			return -ENXIO;
+
+		if (!kvm_arm_spe_irq_initialized(vcpu))
+			return -ENXIO;
+
+		irq = vcpu->arch.spe_cpu.irq_num;
+		if (put_user(irq, uaddr))
+			return -EFAULT;
+
+		return 0;
+	}
+	}
+
+	return -ENXIO;
+}
+
+int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	switch (attr->attr) {
+	case KVM_ARM_VCPU_SPE_IRQ:
+		fallthrough;
+	case KVM_ARM_VCPU_SPE_INIT:
+		if (kvm_arm_vcpu_supports_spe(vcpu))
+			return 0;
+	}
+
+	return -ENXIO;
+}
diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
index 46ec447ed013..0275e8097529 100644
--- a/include/kvm/arm_spe.h
+++ b/include/kvm/arm_spe.h
@@ -18,11 +18,38 @@ struct kvm_spe_cpu {
 	bool initialized; 	/* Feature is initialized on VCPU */
 };
 
+#define kvm_arm_spe_irq_initialized(v)			\
+	((v)->arch.spe_cpu.irq_num >= VGIC_NR_SGIS &&	\
+	 (v)->arch.spe_cpu.irq_num < VGIC_MAX_PRIVATE)
+#define kvm_arm_spe_vcpu_initialized(v)	((v)->arch.spe_cpu.initialized)
+
+int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+
 #else
 #define kvm_arm_supports_spe()	false
 
 struct kvm_spe_cpu {
 };
 
+#define kvm_arm_spe_irq_initialized(v)	false
+#define kvm_arm_spe_vcpu_initialized(v)	false
+
+static inline int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+static inline int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+static inline int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
 #endif /* CONFIG_KVM_ARM_SPE */
 #endif /* __ASM_ARM_KVM_SPE_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ca41220b40b8..96228b823711 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1053,6 +1053,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_X86_USER_SPACE_MSR 188
 #define KVM_CAP_X86_MSR_FILTER 189
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
+#define KVM_CAP_ARM_SPE 191
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 09/16] KVM: arm64: Use separate function for the mapping size in user_mem_abort()
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (7 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-11-05 10:01   ` Haibo Xu
  2020-10-27 17:26 ` [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE Alexandru Elisei
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

user_mem_abort() is already a long and complex function, let's make it
slightly easier to understand by abstracting the algorithm for choosing the
stage 2 IPA entry size into its own function.

This also makes it possible to reuse the code when guest SPE support will
be added.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/mmu.c | 55 ++++++++++++++++++++++++++------------------
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 19aacc7d64de..c3c43555490d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -738,12 +738,43 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	return PAGE_SIZE;
 }
 
+static short stage2_max_pageshift(struct kvm_memory_slot *memslot,
+				  struct vm_area_struct *vma, hva_t hva,
+				  bool *force_pte)
+{
+	short pageshift;
+
+	*force_pte = false;
+
+	if (is_vm_hugetlb_page(vma))
+		pageshift = huge_page_shift(hstate_vma(vma));
+	else
+		pageshift = PAGE_SHIFT;
+
+	if (memslot_is_logging(memslot) || (vma->vm_flags & VM_PFNMAP)) {
+		*force_pte = true;
+		pageshift = PAGE_SHIFT;
+	}
+
+	if (pageshift == PUD_SHIFT &&
+	    !fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
+		pageshift = PMD_SHIFT;
+
+	if (pageshift == PMD_SHIFT &&
+	    !fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+		*force_pte = true;
+		pageshift = PAGE_SHIFT;
+	}
+
+	return pageshift;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
+	bool write_fault, writable, force_pte;
 	bool exec_fault;
 	bool device = false;
 	unsigned long mmu_seq;
@@ -776,27 +807,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		vma_shift = huge_page_shift(hstate_vma(vma));
-	else
-		vma_shift = PAGE_SHIFT;
-
-	if (logging_active ||
-	    (vma->vm_flags & VM_PFNMAP)) {
-		force_pte = true;
-		vma_shift = PAGE_SHIFT;
-	}
-
-	if (vma_shift == PUD_SHIFT &&
-	    !fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
-	       vma_shift = PMD_SHIFT;
-
-	if (vma_shift == PMD_SHIFT &&
-	    !fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
-		force_pte = true;
-		vma_shift = PAGE_SHIFT;
-	}
-
+	vma_shift = stage2_max_pageshift(memslot, vma, hva, &force_pte);
 	vma_pagesize = 1UL << vma_shift;
 	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
 		fault_ipa &= ~(vma_pagesize - 1);
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (8 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 09/16] KVM: arm64: Use separate function for the mapping size in user_mem_abort() Alexandru Elisei
@ 2020-10-27 17:26 ` Alexandru Elisei
  2020-11-05 10:10   ` Haibo Xu
  2020-11-19 16:59   ` James Morse
  2020-10-27 17:27 ` [RFC PATCH v3 11/16] KVM: arm64: Add SPE system registers to VCPU context Alexandru Elisei
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:26 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

Stage 2 faults triggered by the profiling buffer attempting to write to
memory are reported by the SPE hardware by asserting a buffer management
event interrupt. Interrupts are by their nature asynchronous, which means
that the guest might have changed its stage 1 translation tables since the
attempted write. SPE reports the guest virtual address that caused the data
abort, but not the IPA, which means that KVM would have to walk the guest's
stage 1 tables to find the IPA; using the AT instruction to walk the
guest's tables in hardware is not an option because it doesn't report the
IPA in the case of a stage 2 fault on a stage 1 table walk.

Fix both problems by pre-mapping the guest's memory at stage 2 with write
permissions to avoid any faults. Userspace calls mlock() on the VMAs that
back the guest's memory, pinning the pages in memory, then tells KVM to map
the memory at stage 2 by using the VM control group KVM_ARM_VM_SPE_CTRL
with the attribute KVM_ARM_VM_SPE_FINALIZE. KVM will map all writable VMAs
which have the VM_LOCKED flag set. Hugetlb VMAs are practically pinned in
memory after they are faulted in and mlock() doesn't set the VM_LOCKED
flag, and just faults the pages in; KVM will treat hugetlb VMAs like they
have the VM_LOCKED flag and will also map them, faulting them in if
necessary, when handling the ioctl.

VM live migration relies on a bitmap of dirty pages. This bitmap is created
by write-protecting a memslot and updating it as KVM handles stage 2 write
faults. Because KVM cannot handle stage 2 faults reported by the profiling
buffer, it will not pre-map a logging memslot. This effectively means that
profiling is not available when the VM is configured for live migration.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vm.rst |  28 +++++
 arch/arm64/include/asm/kvm_host.h     |   5 +
 arch/arm64/include/asm/kvm_mmu.h      |   2 +
 arch/arm64/include/uapi/asm/kvm.h     |   3 +
 arch/arm64/kvm/arm.c                  |  78 +++++++++++-
 arch/arm64/kvm/guest.c                |  48 ++++++++
 arch/arm64/kvm/mmu.c                  | 169 ++++++++++++++++++++++++++
 arch/arm64/kvm/spe.c                  |  81 ++++++++++++
 include/kvm/arm_spe.h                 |  36 ++++++
 9 files changed, 448 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vm.rst b/Documentation/virt/kvm/devices/vm.rst
index 0aa5b1cfd700..b70798a72d8a 100644
--- a/Documentation/virt/kvm/devices/vm.rst
+++ b/Documentation/virt/kvm/devices/vm.rst
@@ -314,3 +314,31 @@ Allows userspace to query the status of migration mode.
 	     if it is enabled
 :Returns:   -EFAULT if the given address is not accessible from kernel space;
 	    0 in case of success.
+
+6. GROUP: KVM_ARM_VM_SPE_CTRL
+===============================
+
+:Architectures: arm64
+
+6.1. ATTRIBUTE: KVM_ARM_VM_SPE_FINALIZE
+-----------------------------------------
+
+Finalizes the creation of the SPE feature by mapping the guest memory in the
+stage 2 table. Guest memory must be readable, writable and pinned in RAM, which
+is achieved with an mlock() system call; the memory can be backed by a hugetlbfs
+file. Memory regions from read-only or dirty page logging enabled memslots will
+be ignored. After the call, no changes to the guest memory, including to its
+contents, are permitted.
+
+Subsequent KVM_ARM_VCPU_INIT calls will cause the memory to become unmapped and
+the feature must be finalized again before any VCPU can run.
+
+If any VCPUs are run before finalizing the feature, KVM_RUN will return -EPERM.
+
+:Parameters: none
+:Returns:   -EAGAIN if guest memory has been modified while the call was
+            executing
+            -EBUSY if the feature is already initialized
+            -EFAULT if an address backing the guest memory is invalid
+            -ENXIO if SPE is not supported or not properly configured
+            0 in case of success
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 5b68c06930c6..27f581750c6e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -92,6 +92,7 @@ struct kvm_s2_mmu {
 
 struct kvm_arch {
 	struct kvm_s2_mmu mmu;
+	struct kvm_spe spe;
 
 	/* VTCR_EL2 value for this VM */
 	u64    vtcr;
@@ -612,6 +613,10 @@ void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
+int kvm_arm_vm_arch_set_attr(struct kvm *kvm, struct kvm_device_attr *attr);
+int kvm_arm_vm_arch_get_attr(struct kvm *kvm, struct kvm_device_attr *attr);
+int kvm_arm_vm_arch_has_attr(struct kvm *kvm, struct kvm_device_attr *attr);
+
 int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
 			       struct kvm_device_attr *attr);
 int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 331394306cce..bad94662bbed 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -124,6 +124,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 			  phys_addr_t pa, unsigned long size, bool writable);
+int kvm_map_locked_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			   enum kvm_pgtable_prot prot);
 
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
 
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index ca57dfb7abf0..8876e564ba56 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -350,6 +350,9 @@ struct kvm_vcpu_events {
 #define   KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES	3
 #define   KVM_DEV_ARM_ITS_CTRL_RESET		4
 
+#define KVM_ARM_VM_SPE_CTRL		0
+#define   KVM_ARM_VM_SPE_FINALIZE	0
+
 /* Device Control API on vcpu fd */
 #define KVM_ARM_VCPU_PMU_V3_CTRL	0
 #define   KVM_ARM_VCPU_PMU_V3_IRQ	0
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e51d8f328c7e..2d98248f2c66 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -41,6 +41,7 @@
 #include <kvm/arm_hypercalls.h>
 #include <kvm/arm_pmu.h>
 #include <kvm/arm_psci.h>
+#include <kvm/arm_spe.h>
 
 #ifdef REQUIRES_VIRT
 __asm__(".arch_extension	virt");
@@ -653,6 +654,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	if (unlikely(!kvm_vcpu_initialized(vcpu)))
 		return -ENOEXEC;
 
+	if (vcpu_has_spe(vcpu) && unlikely(!kvm_arm_spe_finalized(vcpu->kvm)))
+		return -EPERM;
+
 	ret = kvm_vcpu_first_run_init(vcpu);
 	if (ret)
 		return ret;
@@ -982,12 +986,22 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
 	 * ensuring that the data side is always coherent. We still
 	 * need to invalidate the I-cache though, as FWB does *not*
 	 * imply CTR_EL0.DIC.
+	 *
+	 * If the guest has SPE, we need to unmap the entire address space to
+	 * allow for any changes to the VM memory made by userspace to propagate
+	 * to the stage 2 tables when SPE is re-finalized; this also makes sure
+	 * we keep the userspace and the guest's view of the memory contents
+	 * synchronized.
 	 */
 	if (vcpu->arch.has_run_once) {
-		if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
+		if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
+		    vcpu_has_spe(vcpu)) {
 			stage2_unmap_vm(vcpu->kvm);
-		else
+			if (vcpu_has_spe(vcpu))
+				kvm_arm_spe_notify_vcpu_init(vcpu);
+		} else {
 			__flush_icache_all();
+		}
 	}
 
 	vcpu_reset_hcr(vcpu);
@@ -1045,6 +1059,45 @@ static int kvm_arm_vcpu_has_attr(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+static int kvm_arm_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret = -ENXIO;
+
+	switch (attr->group) {
+	default:
+		ret = kvm_arm_vm_arch_set_attr(kvm, attr);
+		break;
+	}
+
+	return ret;
+}
+
+static int kvm_arm_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret = -ENXIO;
+
+	switch (attr->group) {
+	default:
+		ret = kvm_arm_vm_arch_get_attr(kvm, attr);
+		break;
+	}
+
+	return ret;
+}
+
+static int kvm_arm_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret = -ENXIO;
+
+	switch (attr->group) {
+	default:
+		ret = kvm_arm_vm_arch_has_attr(kvm, attr);
+		break;
+	}
+
+	return ret;
+}
+
 static int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
 				   struct kvm_vcpu_events *events)
 {
@@ -1259,6 +1312,27 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 		return 0;
 	}
+	case KVM_SET_DEVICE_ATTR: {
+		struct kvm_device_attr attr;
+
+		if (copy_from_user(&attr, argp, sizeof(attr)))
+			return -EFAULT;
+		return kvm_arm_vm_set_attr(kvm, &attr);
+	}
+	case KVM_GET_DEVICE_ATTR: {
+		struct kvm_device_attr attr;
+
+		if (copy_from_user(&attr, argp, sizeof(attr)))
+			return -EFAULT;
+		return kvm_arm_vm_get_attr(kvm, &attr);
+	}
+	case KVM_HAS_DEVICE_ATTR: {
+		struct kvm_device_attr attr;
+
+		if (copy_from_user(&attr, argp, sizeof(attr)))
+			return -EFAULT;
+		return kvm_arm_vm_has_attr(kvm, &attr);
+	}
 	default:
 		return -EINVAL;
 	}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 2ba790eeb782..d0dc4bdb8b4a 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -988,3 +988,51 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
 	return ret;
 }
+
+int kvm_arm_vm_arch_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret;
+
+	switch (attr->group) {
+	case KVM_ARM_VM_SPE_CTRL:
+		ret = kvm_arm_vm_spe_set_attr(kvm, attr);
+		break;
+	default:
+		ret = -ENXIO;
+		break;
+	}
+
+	return ret;
+}
+
+int kvm_arm_vm_arch_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret;
+
+	switch (attr->group) {
+	case KVM_ARM_VM_SPE_CTRL:
+		ret = kvm_arm_vm_spe_get_attr(kvm, attr);
+		break;
+	default:
+		ret = -ENXIO;
+		break;
+	}
+
+	return ret;
+}
+
+int kvm_arm_vm_arch_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	int ret;
+
+	switch (attr->group) {
+	case KVM_ARM_VM_SPE_CTRL:
+		ret = kvm_arm_vm_spe_has_attr(kvm, attr);
+		break;
+	default:
+		ret = -ENXIO;
+		break;
+	}
+
+	return ret;
+}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c3c43555490d..31b2216a5881 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1365,6 +1365,175 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	return ret;
 }
 
+static int stage2_map_vma(struct kvm *kvm,
+			  struct kvm_memory_slot *memslot,
+			  struct vm_area_struct *vma,
+			  enum kvm_pgtable_prot prot,
+			  unsigned long mmu_seq, hva_t *hvap,
+			  struct kvm_mmu_memory_cache *cache)
+{
+	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
+	unsigned long stage2_pagesize, remaining;
+	bool force_pte, writable;
+	hva_t hva, hva_end;
+	kvm_pfn_t pfn;
+	gpa_t gpa;
+	gfn_t gfn;
+	int ret;
+
+	hva = max(memslot->userspace_addr, vma->vm_start);
+	hva_end = min(vma->vm_end, memslot->userspace_addr +
+			(memslot->npages << PAGE_SHIFT));
+
+	gpa = (memslot->base_gfn << PAGE_SHIFT) + hva - memslot->userspace_addr;
+	gfn = gpa >> PAGE_SHIFT;
+
+	stage2_pagesize = 1UL << stage2_max_pageshift(memslot, vma, hva, &force_pte);
+
+	while (hva < hva_end) {
+		ret = kvm_mmu_topup_memory_cache(cache,
+						 kvm_mmu_cache_min_pages(kvm));
+		if (ret)
+			return ret;
+
+		/*
+		 * We start mapping with the highest possible page size, so the
+		 * gpa and gfn will always be properly aligned to the current
+		 * page size.
+		 */
+		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL, true, &writable);
+		if (pfn == KVM_PFN_ERR_HWPOISON)
+			return -EFAULT;
+		if (is_error_noslot_pfn(pfn))
+			return -EFAULT;
+		/* Can only happen if naughty userspace changed the VMA. */
+		if (kvm_is_device_pfn(pfn) || !writable)
+			return -EAGAIN;
+
+		spin_lock(&kvm->mmu_lock);
+		if (mmu_notifier_retry(kvm, mmu_seq)) {
+			spin_unlock(&kvm->mmu_lock);
+			return -EAGAIN;
+		}
+
+		remaining = hva_end - hva;
+		if (stage2_pagesize == PUD_SIZE && remaining < PUD_SIZE)
+			stage2_pagesize = PMD_SIZE;
+		if (stage2_pagesize == PMD_SIZE && remaining < PMD_SIZE) {
+			force_pte = true;
+			stage2_pagesize = PAGE_SIZE;
+		}
+
+		if (!force_pte && stage2_pagesize == PAGE_SIZE)
+			/*
+			 * The hva and gpa will always be PMD aligned if
+			 * hva is backed by a transparent huge page. gpa will
+			 * not be modified and it's not necessary to recompute
+			 * hva.
+			 */
+			stage2_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &gpa);
+
+		ret = kvm_pgtable_stage2_map(pgt, gpa, stage2_pagesize,
+					     __pfn_to_phys(pfn), prot, cache);
+		spin_unlock(&kvm->mmu_lock);
+
+		kvm_set_pfn_accessed(pfn);
+		kvm_release_pfn_dirty(pfn);
+
+		if (ret)
+			return ret;
+		else if (hva < hva_end)
+			cond_resched();
+
+		hva += stage2_pagesize;
+		gpa += stage2_pagesize;
+		gfn = gpa >> PAGE_SHIFT;
+	}
+
+	*hvap = hva;
+	return 0;
+}
+
+int kvm_map_locked_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			   enum kvm_pgtable_prot prot)
+{
+	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	struct vm_area_struct *vma;
+	unsigned long mmu_seq;
+	hva_t hva, hva_memslot_end;
+	int ret;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (!(prot & KVM_PGTABLE_PROT_R))
+		return -EPERM;
+	if ((prot & KVM_PGTABLE_PROT_W) && (memslot->flags & KVM_MEM_READONLY))
+		return -EPERM;
+
+	hva = memslot->userspace_addr;
+	hva_memslot_end = memslot->userspace_addr + (memslot->npages << PAGE_SHIFT);
+
+	/*
+	 * Be extra careful here in case userspace is messing with the VMAs
+	 * backing the memslot.
+	 */
+	mmu_seq = kvm->mmu_notifier_seq;
+	smp_rmb();
+
+	/*
+	 * A memslot might span multiple VMAs and any holes between them, while
+	 * a VMA might span multiple memslots (see
+	 * kvm_arch_prepare_memory_region()). Take the intersection of the VMAs
+	 * with the memslot.
+	 */
+	do {
+		mmap_read_lock(current->mm);
+		vma = find_vma(current->mm, hva);
+		/*
+		 * find_vma() returns first VMA with hva < vma->vm_end, which
+		 * means that it is possible for the VMA to start *after* the
+		 * end of the memslot.
+		 */
+		if (!vma || vma->vm_start >= hva_memslot_end) {
+			mmap_read_unlock(current->mm);
+			return 0;
+		}
+
+		/*
+		 * VM_LOCKED pages are put in the unevictable LRU list and
+		 * hugetlb pages are not put in any LRU list; both will stay
+		 * pinned in memory.
+		 */
+		if (!(vma->vm_flags & VM_LOCKED) && !is_vm_hugetlb_page(vma)) {
+			/* Go to next VMA. */
+			hva = vma->vm_end;
+			mmap_read_unlock(current->mm);
+			continue;
+		}
+		if (!(vma->vm_flags & VM_READ) ||
+		    ((prot & KVM_PGTABLE_PROT_W) && !(vma->vm_flags & VM_WRITE))) {
+			/* Go to next VMA. */
+			hva = vma->vm_end;
+			mmap_read_unlock(current->mm);
+			continue;
+		}
+		mmap_read_unlock(current->mm);
+
+		ret = stage2_map_vma(kvm, memslot, vma, prot, mmu_seq, &hva, &cache);
+		if (ret)
+			return ret;
+	} while (hva < hva_memslot_end);
+
+	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) {
+		spin_lock(&kvm->mmu_lock);
+		stage2_flush_memslot(kvm, memslot);
+		spin_unlock(&kvm->mmu_lock);
+	}
+
+	return 0;
+}
+
+
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 }
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index f91a52cd7cd3..316ff8dfed5b 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -10,6 +10,13 @@
 #include <kvm/arm_spe.h>
 #include <kvm/arm_vgic.h>
 
+#include <asm/kvm_mmu.h>
+
+void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	vcpu->kvm->arch.spe.finalized = false;
+}
+
 static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu_has_spe(vcpu))
@@ -115,6 +122,50 @@ int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 	return -ENXIO;
 }
 
+static int kvm_arm_spe_finalize(struct kvm *kvm)
+{
+	struct kvm_memory_slot *memslot;
+	enum kvm_pgtable_prot prot;
+	struct kvm_vcpu *vcpu;
+	int i, ret;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (!kvm_arm_spe_vcpu_initialized(vcpu))
+			return -ENXIO;
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+	if (kvm_arm_spe_finalized(kvm)) {
+		mutex_unlock(&kvm->slots_lock);
+		return -EBUSY;
+	}
+
+	prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
+	kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
+		/* Only map memory that SPE can write to. */
+		if (memslot->flags & KVM_MEM_READONLY)
+			continue;
+		 /*
+		  * Dirty page logging will write-protect pages, which breaks
+		  * SPE.
+		  */
+		if (memslot->dirty_bitmap)
+			continue;
+		ret = kvm_map_locked_memslot(kvm, memslot, prot);
+		if (ret)
+			break;
+	}
+
+	if (!ret)
+		kvm->arch.spe.finalized = true;
+	mutex_unlock(&kvm->slots_lock);
+
+	if (ret)
+		stage2_unmap_vm(kvm);
+
+	return ret;
+}
+
 int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	switch (attr->attr) {
@@ -127,3 +178,33 @@ int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 
 	return -ENXIO;
 }
+
+int kvm_arm_vm_spe_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	switch (attr->attr) {
+	case KVM_ARM_VM_SPE_FINALIZE:
+		return kvm_arm_spe_finalize(kvm);
+	}
+
+	return -ENXIO;
+}
+
+int kvm_arm_vm_spe_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+int kvm_arm_vm_spe_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	switch (attr->attr) {
+	case KVM_ARM_VM_SPE_FINALIZE:
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			if (kvm_arm_vcpu_supports_spe(vcpu))
+				return 0;
+	}
+
+	return -ENXIO;
+}
diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
index 0275e8097529..7f9f3a03aadb 100644
--- a/include/kvm/arm_spe.h
+++ b/include/kvm/arm_spe.h
@@ -18,23 +18,38 @@ struct kvm_spe_cpu {
 	bool initialized; 	/* Feature is initialized on VCPU */
 };
 
+struct kvm_spe {
+	bool finalized;
+};
+
 #define kvm_arm_spe_irq_initialized(v)			\
 	((v)->arch.spe_cpu.irq_num >= VGIC_NR_SGIS &&	\
 	 (v)->arch.spe_cpu.irq_num < VGIC_MAX_PRIVATE)
 #define kvm_arm_spe_vcpu_initialized(v)	((v)->arch.spe_cpu.initialized)
+#define kvm_arm_spe_finalized(k)	((k)->arch.spe.finalized)
 
 int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 
+int kvm_arm_vm_spe_set_attr(struct kvm *vcpu, struct kvm_device_attr *attr);
+int kvm_arm_vm_spe_get_attr(struct kvm *vcpu, struct kvm_device_attr *attr);
+int kvm_arm_vm_spe_has_attr(struct kvm *vcpu, struct kvm_device_attr *attr);
+
+void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu);
+
 #else
 #define kvm_arm_supports_spe()	false
 
 struct kvm_spe_cpu {
 };
 
+struct kvm_spe {
+};
+
 #define kvm_arm_spe_irq_initialized(v)	false
 #define kvm_arm_spe_vcpu_initialized(v)	false
+#define kvm_arm_spe_finalized(k)	false
 
 static inline int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu,
 				       struct kvm_device_attr *attr)
@@ -51,5 +66,26 @@ static inline int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu,
 {
 	return -ENXIO;
 }
+
+static inline int kvm_arm_vm_spe_set_attr(struct kvm *vcpu,
+					  struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static inline int kvm_arm_vm_spe_get_attr(struct kvm *vcpu,
+					  struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static inline int kvm_arm_vm_spe_has_attr(struct kvm *vcpu,
+					  struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static inline void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu) {}
+
 #endif /* CONFIG_KVM_ARM_SPE */
 #endif /* __ASM_ARM_KVM_SPE_H */
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 11/16] KVM: arm64: Add SPE system registers to VCPU context
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (9 preceding siblings ...)
  2020-10-27 17:26 ` [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE Alexandru Elisei
@ 2020-10-27 17:27 ` Alexandru Elisei
  2020-10-27 17:27 ` [RFC PATCH v3 12/16] KVM: arm64: VHE: Clear MDCR_EL2.E2PB in vcpu_put() Alexandru Elisei
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:27 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

Add the SPE system registers to the VCPU context. Omitted are PMBIDR_EL1,
which cannot be trapped, and PMSIR_EL1, which is a read-only register. The
registers are simply stored in the sys_regs array on a write, and returned
on a read; complete emulation and save/restore on world switch will be
added in a future patch.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h | 11 +++++++
 arch/arm64/kvm/spe.c              | 10 +++++++
 arch/arm64/kvm/sys_regs.c         | 48 ++++++++++++++++++++++++-------
 include/kvm/arm_spe.h             |  9 ++++++
 4 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 27f581750c6e..bcecc6224c59 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -194,6 +194,17 @@ enum vcpu_sysreg {
 	CNTP_CVAL_EL0,
 	CNTP_CTL_EL0,
 
+	/* Statistical Profiling Extension Registers. */
+	PMSCR_EL1,	/* Statistical Profiling Control Register */
+	PMSICR_EL1,	/* Sampling Interval Counter Register */
+	PMSIRR_EL1,	/* Sampling Interval Reload Register */
+	PMSFCR_EL1,	/* Sampling Filter Control Register */
+	PMSEVFR_EL1,	/* Sampling Event Filter Register */
+	PMSLATFR_EL1,	/* Sampling Latency Filter Register */
+	PMBLIMITR_EL1,	/* Profiling Buffer Limit Address Register */
+	PMBPTR_EL1,	/* Profiling Buffer Write Pointer Register */
+	PMBSR_EL1,	/* Profiling Buffer Status/syndrome Register */
+
 	/* 32bit specific registers. Keep them at the end of the range */
 	DACR32_EL2,	/* Domain Access Control Register */
 	IFSR32_EL2,	/* Instruction Fault Status Register */
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 316ff8dfed5b..0e365a51cac7 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -12,6 +12,16 @@
 
 #include <asm/kvm_mmu.h>
 
+void kvm_arm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
+{
+	__vcpu_sys_reg(vcpu, reg) = val;
+}
+
+u64 kvm_arm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
+{
+	return __vcpu_sys_reg(vcpu, reg);
+}
+
 void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu)
 {
 	vcpu->kvm->arch.spe.finalized = false;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index aa776c006a2a..2871484993ec 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -244,9 +244,37 @@ static bool access_vm_reg(struct kvm_vcpu *vcpu,
 	return true;
 }
 
+static bool access_spe_reg(struct kvm_vcpu *vcpu,
+			   struct sys_reg_params *p,
+			   const struct sys_reg_desc *r)
+{
+	u64 val = p->regval;
+	int reg = r->reg;
+	u32 sr = sys_reg((u32)r->Op0, (u32)r->Op1,
+			 (u32)r->CRn, (u32)r->CRm, (u32)r->Op2);
+
+	if (sr == SYS_PMSIDR_EL1) {
+		/* Ignore writes. */
+		if (!p->is_write)
+			p->regval = read_sysreg_s(SYS_PMSIDR_EL1);
+		goto out;
+	}
+
+	if (p->is_write)
+		kvm_arm_spe_write_sysreg(vcpu, reg, val);
+	else
+		p->regval = kvm_arm_spe_read_sysreg(vcpu, reg);
+
+out:
+	return true;
+}
+
 static unsigned int spe_visibility(const struct kvm_vcpu *vcpu,
 				   const struct sys_reg_desc *r)
 {
+	if (vcpu_has_spe(vcpu))
+		return 0;
+
 	return REG_HIDDEN_GUEST | REG_HIDDEN_USER;
 }
 
@@ -1598,16 +1626,16 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
 	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
 
-	{ SYS_DESC(SYS_PMSCR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMSICR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMSIRR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMSFCR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMSEVFR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMSLATFR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMSIDR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMBLIMITR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMBPTR_EL1), .visibility = spe_visibility },
-	{ SYS_DESC(SYS_PMBSR_EL1), .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSCR_EL1), access_spe_reg, reset_val, PMSCR_EL1, 0, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSICR_EL1), access_spe_reg, reset_val, PMSICR_EL1, 0, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSIRR_EL1), access_spe_reg, reset_val, PMSIRR_EL1, 0, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSFCR_EL1), access_spe_reg, reset_val, PMSFCR_EL1, 0, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSEVFR_EL1), access_spe_reg, reset_val, PMSEVFR_EL1, 0, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSLATFR_EL1), access_spe_reg, reset_val, PMSEVFR_EL1, 0, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMSIDR_EL1), access_spe_reg, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMBLIMITR_EL1), access_spe_reg, reset_val, PMBLIMITR_EL1, 0, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMBPTR_EL1), access_spe_reg, reset_unknown, PMBPTR_EL1, .visibility = spe_visibility },
+	{ SYS_DESC(SYS_PMBSR_EL1), access_spe_reg, reset_val, PMBSR_EL1, 0, .visibility = spe_visibility },
 
 	{ SYS_DESC(SYS_PMINTENSET_EL1), access_pminten, reset_unknown, PMINTENSET_EL1 },
 	{ SYS_DESC(SYS_PMINTENCLR_EL1), access_pminten, reset_unknown, PMINTENSET_EL1 },
diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
index 7f9f3a03aadb..a2429edc4483 100644
--- a/include/kvm/arm_spe.h
+++ b/include/kvm/arm_spe.h
@@ -38,6 +38,9 @@ int kvm_arm_vm_spe_has_attr(struct kvm *vcpu, struct kvm_device_attr *attr);
 
 void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu);
 
+void kvm_arm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val);
+u64 kvm_arm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg);
+
 #else
 #define kvm_arm_supports_spe()	false
 
@@ -87,5 +90,11 @@ static inline int kvm_arm_vm_spe_has_attr(struct kvm *vcpu,
 
 static inline void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu) {}
 
+static inline void kvm_arm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val) {}
+static inline u64 kvm_arm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
+{
+	return 0;
+}
+
 #endif /* CONFIG_KVM_ARM_SPE */
 #endif /* __ASM_ARM_KVM_SPE_H */
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 12/16] KVM: arm64: VHE: Clear MDCR_EL2.E2PB in vcpu_put()
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (10 preceding siblings ...)
  2020-10-27 17:27 ` [RFC PATCH v3 11/16] KVM: arm64: Add SPE system registers to VCPU context Alexandru Elisei
@ 2020-10-27 17:27 ` Alexandru Elisei
  2020-10-27 17:27 ` [RFC PATCH v3 13/16] KVM: arm64: Switch SPE context on VM entry/exit Alexandru Elisei
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:27 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, Sudeep Holla, will

From: Sudeep Holla <sudeep.holla@arm.com>

On VHE systems, the kernel executes at EL2 and configures the profiling
buffer to use the EL2&0 translation regime and to trap accesses from the
guest by clearing MDCR_EL2.E2PB. In vcpu_put(), KVM does a bitwise or with
the E2PB mask, preserving its value. This has been correct so far, since
MDCR_EL2.E2B has the same value (0b00) for all VMs.

However, this will change when KVM enables support for SPE in guests. For
such guests KVM will configure the profiling buffer to use the EL1&0
translation regime, a setting that is obviously undesirable to be preserved
for the host running at EL2. Let's avoid this situation by explicitly
clearing E2PB in vcpu_put().

[ Alexandru E: Rebased on top of 5.10-rc1, reworded commit ]

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/hyp/vhe/switch.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index fe69de16dadc..3f4db1fa388b 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -97,9 +97,7 @@ void deactivate_traps_vhe_put(void)
 {
 	u64 mdcr_el2 = read_sysreg(mdcr_el2);
 
-	mdcr_el2 &= MDCR_EL2_HPMN_MASK |
-		    MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT |
-		    MDCR_EL2_TPMS;
+	mdcr_el2 &= MDCR_EL2_HPMN_MASK | MDCR_EL2_TPMS;
 
 	write_sysreg(mdcr_el2, mdcr_el2);
 
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 13/16] KVM: arm64: Switch SPE context on VM entry/exit
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (11 preceding siblings ...)
  2020-10-27 17:27 ` [RFC PATCH v3 12/16] KVM: arm64: VHE: Clear MDCR_EL2.E2PB in vcpu_put() Alexandru Elisei
@ 2020-10-27 17:27 ` Alexandru Elisei
  2020-10-27 17:27 ` [RFC PATCH v3 14/16] KVM: arm64: Emulate SPE buffer management interrupt Alexandru Elisei
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:27 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

When the host and the guest are using SPE at the same time, KVM will have
to save and restore the proper SPE context on VM entry (save host's,
restore guest's, and on VM exit (save guest's, restore host's).

On systems without VHE, the world switch happens at EL2, while both the
guest and the host execute at EL1, and according to ARM DDI 0487F.b, page
D9-2807, sampling is disabled in this case:

"If the PE takes an exception to an Exception level where the Statistical
Profiling Extension is disabled, no new operations are selected for
sampling."

We still have to disable the buffer before we switch translation regimes
because we don't want the SPE buffer to speculate memory accesses using a
stale buffer pointer.

On VHE systems, the world switch happens at EL2, with the host potentially
in the middle of a profiling session and we also need to explicitely
disable host sampling.

The buffer owning Exception level is determined by MDCR_EL2.E2PB. On
systems with VHE, this is the different between the guest (executes at EL1)
and the host (executes at EL2). The current behavior of perf is to profile
KVM until it drops to the guest at EL1. To preserve this behavior as much
as possible, KVM will defer changing the value of MDCR_EL2 until
__{activate,deactivate}_traps().

For the purposes of emulating the SPE buffer management interrupt, MDCR_EL2
is configured to trap accesses to the buffer control registers; the guest
can access the rest of the registers directly.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_arm.h        |   1 +
 arch/arm64/include/asm/kvm_hyp.h        |  28 +++++-
 arch/arm64/include/asm/sysreg.h         |   1 +
 arch/arm64/kvm/debug.c                  |  29 +++++-
 arch/arm64/kvm/hyp/include/hyp/spe-sr.h |  38 ++++++++
 arch/arm64/kvm/hyp/include/hyp/switch.h |   1 -
 arch/arm64/kvm/hyp/nvhe/Makefile        |   1 +
 arch/arm64/kvm/hyp/nvhe/debug-sr.c      |  16 ++-
 arch/arm64/kvm/hyp/nvhe/spe-sr.c        |  93 ++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/switch.c        |  12 +++
 arch/arm64/kvm/hyp/vhe/Makefile         |   1 +
 arch/arm64/kvm/hyp/vhe/spe-sr.c         | 124 ++++++++++++++++++++++++
 arch/arm64/kvm/hyp/vhe/switch.c         |  48 ++++++++-
 arch/arm64/kvm/hyp/vhe/sysreg-sr.c      |   2 +-
 arch/arm64/kvm/spe.c                    |   3 +
 arch/arm64/kvm/sys_regs.c               |   1 +
 16 files changed, 384 insertions(+), 15 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/include/hyp/spe-sr.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/spe-sr.c
 create mode 100644 arch/arm64/kvm/hyp/vhe/spe-sr.c

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 64ce29378467..033980a9b3fc 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -280,6 +280,7 @@
 #define MDCR_EL2_TPMS		(1 << 14)
 #define MDCR_EL2_E2PB_MASK	(UL(0x3))
 #define MDCR_EL2_E2PB_SHIFT	(UL(12))
+#define MDCR_EL2_E2PB_EL1_TRAP	(2 << MDCR_EL2_E2PB_SHIFT);
 #define MDCR_EL2_TDRA		(1 << 11)
 #define MDCR_EL2_TDOSA		(1 << 10)
 #define MDCR_EL2_TDA		(1 << 9)
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 6b664de5ec1f..4358cba6784a 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -79,6 +79,32 @@ void sysreg_save_guest_state_vhe(struct kvm_cpu_context *ctxt);
 void sysreg_restore_guest_state_vhe(struct kvm_cpu_context *ctxt);
 #endif
 
+#ifdef CONFIG_KVM_ARM_SPE
+#ifdef __KVM_NVHE_HYPERVISOR__
+void __sysreg_save_spe_host_state_nvhe(struct kvm_cpu_context *ctxt);
+void __sysreg_restore_spe_host_state_nvhe(struct kvm_cpu_context *ctxt);
+void __sysreg_save_spe_guest_state_nvhe(struct kvm_vcpu *vcpu);
+void __sysreg_restore_spe_guest_state_nvhe(struct kvm_vcpu *vcpu);
+#else
+void sysreg_save_spe_host_state_vhe(struct kvm_cpu_context *ctxt);
+void sysreg_restore_spe_host_state_vhe(struct kvm_cpu_context *ctxt);
+void sysreg_save_spe_guest_state_vhe(struct kvm_vcpu *vcpu);
+void sysreg_restore_spe_guest_state_vhe(struct kvm_vcpu *vcpu);
+#endif
+#else	/* !CONFIG_KVM_ARM_SPE */
+#ifdef __KVM_NVHE_HYPERVISOR__
+void __sysreg_save_spe_host_state_nvhe(struct kvm_cpu_context *ctxt) {}
+void __sysreg_restore_spe_host_state_nvhe(struct kvm_cpu_context *ctxt) {}
+void __sysreg_save_spe_guest_state_nvhe(struct kvm_vcpu *vcpu) {}
+void __sysreg_restore_spe_guest_state_nvhe(struct kvm_vcpu *vcpu) {}
+#else
+void sysreg_save_spe_host_state_vhe(struct kvm_cpu_context *ctxt) {}
+void sysreg_restore_spe_host_state_vhe(struct kvm_cpu_context *ctxt) {}
+void sysreg_save_spe_guest_state_vhe(struct kvm_vcpu *vcpu) {}
+void sysreg_restore_spe_guest_state_vhe(struct kvm_vcpu *vcpu) {}
+#endif
+#endif /* CONFIG_KVM_ARM_SPE */
+
 void __debug_switch_to_guest(struct kvm_vcpu *vcpu);
 void __debug_switch_to_host(struct kvm_vcpu *vcpu);
 
@@ -87,7 +113,7 @@ void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
 
 #ifndef __KVM_NVHE_HYPERVISOR__
 void activate_traps_vhe_load(struct kvm_vcpu *vcpu);
-void deactivate_traps_vhe_put(void);
+void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu);
 #endif
 
 u64 __guest_enter(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index d52c1b3ce589..20159af17578 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -255,6 +255,7 @@
 
 /* Sampling controls */
 #define SYS_PMSCR_EL1			sys_reg(3, 0, 9, 9, 0)
+#define SYS_PMSCR_EL12			sys_reg(3, 5, 9, 9, 0)
 #define SYS_PMSCR_EL1_E0SPE_SHIFT	0
 #define SYS_PMSCR_EL1_E1SPE_SHIFT	1
 #define SYS_PMSCR_EL1_CX_SHIFT		3
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 22ee448aee2b..892ce9cc4079 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -84,17 +84,28 @@ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu, u32 host_mdcr)
 {
 	bool trap_debug = !(vcpu->arch.flags & KVM_ARM64_DEBUG_DIRTY);
 
-	/*
-	 * This also clears MDCR_EL2_E2PB_MASK to disable guest access
-	 * to the profiling buffer.
-	 */
 	vcpu->arch.mdcr_el2 = host_mdcr & MDCR_EL2_HPMN_MASK;
 	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
-				MDCR_EL2_TPMS |
 				MDCR_EL2_TPMCR |
 				MDCR_EL2_TDRA |
 				MDCR_EL2_TDOSA);
 
+	if (vcpu_has_spe(vcpu)) {
+		/*
+		 * Use EL1&0 translation regime, trap accesses to the buffer
+		 * control registers, allow guest direct access to the
+		 * statistical profiling control registers by leaving the TPMS
+		 * bit clear.
+		 */
+		vcpu->arch.mdcr_el2 |= MDCR_EL2_E2PB_EL1_TRAP;
+	} else {
+		/*
+		 * Disable buffer by leaving E2PB zero, trap accesses to all SPE
+		 * registers.
+		 */
+		vcpu->arch.mdcr_el2 |= MDCR_EL2_TPMS;
+	}
+
 	if (vcpu->guest_debug) {
 		/* Route all software debug exceptions to EL2 */
 		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDE;
@@ -227,10 +238,18 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
 	if (vcpu_read_sys_reg(vcpu, MDSCR_EL1) & (DBG_MDSCR_KDE | DBG_MDSCR_MDE))
 		vcpu->arch.flags |= KVM_ARM64_DEBUG_DIRTY;
 
+	/*
+	 * On VHE systems, when the guest has SPE, MDCR_EL2 write is deferred
+	 * until __activate_traps().
+	 */
+	if (has_vhe() && vcpu_has_spe(vcpu))
+		goto out;
+
 	/* Write mdcr_el2 changes since vcpu_load on VHE systems */
 	if (has_vhe() && orig_mdcr_el2 != vcpu->arch.mdcr_el2)
 		write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
 
+out:
 	trace_kvm_arm_set_dreg32("MDSCR_EL1", vcpu_read_sys_reg(vcpu, MDSCR_EL1));
 }
 
diff --git a/arch/arm64/kvm/hyp/include/hyp/spe-sr.h b/arch/arm64/kvm/hyp/include/hyp/spe-sr.h
new file mode 100644
index 000000000000..00ed684c117c
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/hyp/spe-sr.h
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 - ARM Ltd
+ * Author: Alexandru Elisei <alexandru.elisei@arm.com>
+ */
+
+#ifndef __ARM64_KVM_HYP_SPE_SR_H__
+#define __ARM64_KVM_HYP_SPE_SR_H__
+
+#include <linux/kvm_host.h>
+
+#include <asm/sysreg.h>
+
+#ifdef CONFIG_KVM_ARM_SPE
+static inline void __sysreg_save_spe_state_common(struct kvm_cpu_context *ctxt)
+{
+	ctxt_sys_reg(ctxt, PMSICR_EL1) = read_sysreg_s(SYS_PMSICR_EL1);
+	ctxt_sys_reg(ctxt, PMSIRR_EL1) = read_sysreg_s(SYS_PMSIRR_EL1);
+	ctxt_sys_reg(ctxt, PMSFCR_EL1) = read_sysreg_s(SYS_PMSFCR_EL1);
+	ctxt_sys_reg(ctxt, PMSEVFR_EL1) = read_sysreg_s(SYS_PMSEVFR_EL1);
+	ctxt_sys_reg(ctxt, PMSLATFR_EL1) = read_sysreg_s(SYS_PMSLATFR_EL1);
+}
+
+
+static inline void __sysreg_restore_spe_state_common(struct kvm_cpu_context *ctxt)
+{
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSICR_EL1), SYS_PMSICR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSIRR_EL1), SYS_PMSIRR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSFCR_EL1), SYS_PMSFCR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSEVFR_EL1), SYS_PMSEVFR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSLATFR_EL1), SYS_PMSLATFR_EL1);
+}
+#else
+static inline void __sysreg_save_spe_state_common(struct kvm_cpu_context *ctxt) {}
+static inline void __sysreg_restore_spe_state_common(struct kvm_cpu_context *ctxt) {}
+
+#endif /* CONFIG_KVM_ARM_SPE */
+#endif /* __ARM64_KVM_HYP_SPE_SR_H__ */
diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index 313a8fa3c721..c88a40eeb171 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -90,7 +90,6 @@ static inline void __activate_traps_common(struct kvm_vcpu *vcpu)
 	 */
 	write_sysreg(0, pmselr_el0);
 	write_sysreg(ARMV8_PMU_USERENR_MASK, pmuserenr_el0);
-	write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
 }
 
 static inline void __deactivate_traps_common(void)
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index ddde15fe85f2..fcc33b682a45 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -7,6 +7,7 @@ asflags-y := -D__KVM_NVHE_HYPERVISOR__
 ccflags-y := -D__KVM_NVHE_HYPERVISOR__
 
 obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o hyp-main.o
+obj-$(CONFIG_KVM_ARM_SPE) += spe-sr.o
 obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o
 
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 91a711aa8382..af65afca479a 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -60,14 +60,24 @@ static void __debug_restore_spe(u64 pmscr_el1)
 
 void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
 {
-	/* Disable and flush SPE data generation */
-	__debug_save_spe(&vcpu->arch.host_debug_state.pmscr_el1);
+	/*
+	 * If the guest is using SPE, host SPE was disabled when the host state
+	 * was saved.
+	 */
+	if (!vcpu_has_spe(vcpu))
+		/* Disable and flush SPE data generation */
+		__debug_save_spe(&vcpu->arch.host_debug_state.pmscr_el1);
 	__debug_switch_to_guest_common(vcpu);
 }
 
 void __debug_switch_to_host(struct kvm_vcpu *vcpu)
 {
-	__debug_restore_spe(vcpu->arch.host_debug_state.pmscr_el1);
+	/*
+	 * Host SPE state was restored with the rest of the host registers when
+	 * the guest is using SPE.
+	 */
+	if (!vcpu_has_spe(vcpu))
+		__debug_restore_spe(vcpu->arch.host_debug_state.pmscr_el1);
 	__debug_switch_to_host_common(vcpu);
 }
 
diff --git a/arch/arm64/kvm/hyp/nvhe/spe-sr.c b/arch/arm64/kvm/hyp/nvhe/spe-sr.c
new file mode 100644
index 000000000000..a73ee820b27f
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/spe-sr.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 - ARM Ltd
+ * Author: Alexandru Elisei <alexandru.elisei@arm.com>
+ */
+
+#include <hyp/spe-sr.h>
+
+#include <linux/kvm_host.h>
+
+#include <asm/kprobes.h>
+#include <asm/kvm_hyp.h>
+
+/*
+ * The SPE profiling buffer acts like a separate observer in the system, and we
+ * need to make sure it's disabled before switching translation regimes (host to
+ * guest and vice versa).
+ *
+ * Sampling is disabled when we're at an higher exception level than the owning
+ * exception level, and we don't disable sampling on save/restore, like we do in
+ * the VHE case, where the host is profiling at EL2.
+ *
+ * Profiling is enabled when both sampling and the buffer are enabled, as a
+ * result we don't have to worry about PMBPTR_EL1 restrictions with regard to
+ * PMBLIMITR_EL1.LIMIT.
+ */
+
+void __sysreg_save_spe_host_state_nvhe(struct kvm_cpu_context *ctxt)
+{
+	u64 pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+
+	if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+		psb_csync();
+		dsb(nsh);
+		write_sysreg_s(0, SYS_PMBLIMITR_EL1);
+		isb();
+	}
+
+	ctxt_sys_reg(ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+	ctxt_sys_reg(ctxt, PMBSR_EL1) = read_sysreg_s(SYS_PMBSR_EL1);
+	ctxt_sys_reg(ctxt, PMBLIMITR_EL1) = pmblimitr;
+	ctxt_sys_reg(ctxt, PMSCR_EL1) = read_sysreg_s(SYS_PMSCR_EL1);
+
+	__sysreg_save_spe_state_common(ctxt);
+}
+
+void __sysreg_restore_spe_guest_state_nvhe(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+
+	__sysreg_restore_spe_state_common(guest_ctxt);
+
+	/* Make sure the switch to the guest's stage 1 + stage 2 is visible */
+	isb();
+
+	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+	/* The guest buffer management event interrupt is always virtual. */
+	write_sysreg_s(0, SYS_PMBSR_EL1);
+	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMSCR_EL1), SYS_PMSCR_EL1);
+}
+
+void __sysreg_save_spe_guest_state_nvhe(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+	u64 pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+
+	if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+		psb_csync();
+		dsb(nsh);
+		write_sysreg_s(0, SYS_PMBLIMITR_EL1);
+		isb();
+	}
+
+	ctxt_sys_reg(guest_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+	ctxt_sys_reg(guest_ctxt, PMSCR_EL1) = read_sysreg_s(SYS_PMSCR_EL1);
+	/* PMBLIMITR_EL1 is updated only on trap. */
+
+	__sysreg_save_spe_state_common(guest_ctxt);
+}
+
+void __sysreg_restore_spe_host_state_nvhe(struct kvm_cpu_context *ctxt)
+{
+	__sysreg_restore_spe_state_common(ctxt);
+
+	/* Make sure the switch to host's stage 1 is visible */
+	isb();
+
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMBSR_EL1), SYS_PMBSR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSCR_EL1), SYS_PMSCR_EL1);
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index a457a0306e03..4fde45c4c805 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -39,6 +39,8 @@ static void __activate_traps(struct kvm_vcpu *vcpu)
 	___activate_traps(vcpu);
 	__activate_traps_common(vcpu);
 
+	write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
+
 	val = CPTR_EL2_DEFAULT;
 	val |= CPTR_EL2_TTA | CPTR_EL2_TZ | CPTR_EL2_TAM;
 	if (!update_fp_enabled(vcpu)) {
@@ -188,6 +190,8 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 	pmu_switch_needed = __pmu_switch_to_guest(host_ctxt);
 
 	__sysreg_save_state_nvhe(host_ctxt);
+	if (vcpu_has_spe(vcpu))
+		__sysreg_save_spe_host_state_nvhe(host_ctxt);
 
 	/*
 	 * We must restore the 32-bit state before the sysregs, thanks
@@ -203,6 +207,9 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 	__load_guest_stage2(kern_hyp_va(vcpu->arch.hw_mmu));
 	__activate_traps(vcpu);
 
+	if (vcpu_has_spe(vcpu))
+		__sysreg_restore_spe_guest_state_nvhe(vcpu);
+
 	__hyp_vgic_restore_state(vcpu);
 	__timer_enable_traps(vcpu);
 
@@ -216,6 +223,9 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 	} while (fixup_guest_exit(vcpu, &exit_code));
 
 	__sysreg_save_state_nvhe(guest_ctxt);
+	if (vcpu_has_spe(vcpu))
+		__sysreg_save_spe_guest_state_nvhe(vcpu);
+
 	__sysreg32_save_state(vcpu);
 	__timer_disable_traps(vcpu);
 	__hyp_vgic_save_state(vcpu);
@@ -224,6 +234,8 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 	__load_host_stage2();
 
 	__sysreg_restore_state_nvhe(host_ctxt);
+	if (vcpu_has_spe(vcpu))
+		__sysreg_restore_spe_host_state_nvhe(host_ctxt);
 
 	if (vcpu->arch.flags & KVM_ARM64_FP_ENABLED)
 		__fpsimd_save_fpexc32(vcpu);
diff --git a/arch/arm64/kvm/hyp/vhe/Makefile b/arch/arm64/kvm/hyp/vhe/Makefile
index 461e97c375cc..daff3119c359 100644
--- a/arch/arm64/kvm/hyp/vhe/Makefile
+++ b/arch/arm64/kvm/hyp/vhe/Makefile
@@ -7,5 +7,6 @@ asflags-y := -D__KVM_VHE_HYPERVISOR__
 ccflags-y := -D__KVM_VHE_HYPERVISOR__
 
 obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o
+obj-$(CONFIG_KVM_ARM_SPE) += spe-sr.o
 obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o
diff --git a/arch/arm64/kvm/hyp/vhe/spe-sr.c b/arch/arm64/kvm/hyp/vhe/spe-sr.c
new file mode 100644
index 000000000000..dd947e9f249c
--- /dev/null
+++ b/arch/arm64/kvm/hyp/vhe/spe-sr.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 - ARM Ltd
+ * Author: Alexandru Elisei <alexandru.elisei@arm.com>
+ */
+
+#include <hyp/spe-sr.h>
+
+#include <linux/kvm_host.h>
+
+#include <asm/kprobes.h>
+#include <asm/kvm_hyp.h>
+
+void sysreg_save_spe_host_state_vhe(struct kvm_cpu_context *ctxt)
+{
+	u64 pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+	u64 pmscr_el2 = read_sysreg_el2(SYS_PMSCR);
+
+	/* Allow guest to select timestamp source, disable sampling. */
+	write_sysreg_el2(BIT(SYS_PMSCR_EL1_PCT_SHIFT), SYS_PMSCR);
+	if (pmscr_el2 & BIT(SYS_PMSCR_EL1_E1SPE_SHIFT))
+		isb();
+
+	if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+		psb_csync();
+		dsb(nsh);
+		write_sysreg_s(0, SYS_PMBLIMITR_EL1);
+		isb();
+	}
+
+	ctxt_sys_reg(ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+	ctxt_sys_reg(ctxt, PMBSR_EL1) = read_sysreg_s(SYS_PMBSR_EL1);
+	ctxt_sys_reg(ctxt, PMBLIMITR_EL1) = pmblimitr;
+	/*
+	 * We abuse the context register PMSCR_EL1 to save the host's PMSCR,
+	 * which is actually PMSCR_EL2 because KVM is running at EL2.
+	 */
+	ctxt_sys_reg(ctxt, PMSCR_EL1) = pmscr_el2;
+
+	__sysreg_save_spe_state_common(ctxt);
+}
+NOKPROBE_SYMBOL(sysreg_save_spe_host_state_vhe);
+
+void sysreg_restore_spe_guest_state_vhe(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+
+	/*
+	 * Make sure the write to MDCR_EL2 which changes the buffer owning
+	 * Exception level is visible.
+	 */
+	isb();
+
+	/*
+	 * Order doesn't matter because sampling is disabled at EL2. However,
+	 * restore guest registers in the same program order as the host for
+	 * consistency.
+	 */
+	__sysreg_restore_spe_state_common(guest_ctxt);
+
+	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+	/* The guest buffer management event interrupt is always virtual. */
+	write_sysreg_s(0, SYS_PMBSR_EL1);
+	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+	write_sysreg_el1(ctxt_sys_reg(guest_ctxt, PMSCR_EL1), SYS_PMSCR);
+}
+NOKPROBE_SYMBOL(sysreg_restore_spe_guest_state_vhe);
+
+void sysreg_save_spe_guest_state_vhe(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+	u64 pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+
+	/*
+	 * We're going to switch buffer owning Exception level when we restore
+	 * the host MDCR_EL2 value, make sure the buffer is disabled until we
+	 * restore the host registers.
+	 *
+	 * Sampling at EL2 was disabled when we saved the host's SPE state, no
+	 * need to disable it again.
+	 */
+	if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+		/*
+		 * We don't an ISB before PSB CSYNC because exception entry is a
+		 * context synchronization event.
+		 */
+		psb_csync();
+		dsb(nsh);
+		write_sysreg_s(0, SYS_PMBLIMITR_EL1);
+		isb();
+	}
+
+	ctxt_sys_reg(guest_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+	ctxt_sys_reg(guest_ctxt, PMSCR_EL1) = read_sysreg_el1(SYS_PMSCR);
+	/* PMBLIMITR_EL1 is updated only on trap, skip saving it. */
+
+	__sysreg_save_spe_state_common(guest_ctxt);
+}
+NOKPROBE_SYMBOL(sysreg_save_spe_guest_state_vhe);
+
+void sysreg_restore_spe_host_state_vhe(struct kvm_cpu_context *ctxt)
+{
+	/*
+	 * Order matters now because we're possibly restarting profiling.
+	 * Restore common state first so PMSICR_EL1 is updated before PMSCR_EL2.
+	 */
+	__sysreg_restore_spe_state_common(ctxt);
+
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+	/*
+	 * Make sure PMBPTR_EL1 update is seen first, so we don't end up in a
+	 * situation where the buffer is enabled and the pointer passes
+	 * the value of PMBLIMITR_EL1.LIMIT programmed by the guest.
+	 *
+	 * This also serves to make sure the write to MDCR_EL2 which changes the
+	 * buffer owning Exception level is visible; the buffer is still
+	 * disabled until the write to PMBLIMITR_EL1.
+	 */
+	isb();
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMBSR_EL1), SYS_PMBSR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+	write_sysreg_el2(ctxt_sys_reg(ctxt, PMSCR_EL1), SYS_PMSCR);
+}
+NOKPROBE_SYMBOL(sysreg_restore_host_state_vhe);
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 3f4db1fa388b..c7f3c8a004b6 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -64,6 +64,9 @@ static void __activate_traps(struct kvm_vcpu *vcpu)
 
 	write_sysreg(val, cpacr_el1);
 
+	if (vcpu_has_spe(vcpu))
+		write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
+
 	write_sysreg(__this_cpu_read(kvm_hyp_vector), vbar_el1);
 }
 NOKPROBE_SYMBOL(__activate_traps);
@@ -84,6 +87,13 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
 	asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
 
 	write_sysreg(CPACR_EL1_DEFAULT, cpacr_el1);
+	if (vcpu_has_spe(vcpu)) {
+		u64 mdcr_el2 = read_sysreg(mdcr_el2);
+
+		mdcr_el2 &= MDCR_EL2_HPMN_MASK | MDCR_EL2_TPMS;
+
+		write_sysreg(mdcr_el2, mdcr_el2);
+	}
 	write_sysreg(vectors, vbar_el1);
 }
 NOKPROBE_SYMBOL(__deactivate_traps);
@@ -91,15 +101,36 @@ NOKPROBE_SYMBOL(__deactivate_traps);
 void activate_traps_vhe_load(struct kvm_vcpu *vcpu)
 {
 	__activate_traps_common(vcpu);
+	/*
+	 * When the guest is using SPE, vcpu->arch.mdcr_el2 configures the
+	 * profiling buffer to use the EL1&0 translation regime. If that's
+	 * loaded on the hardware and host has profiling enabled, the SPE buffer
+	 * will start using the guest's EL1&0 translation regime, but without
+	 * stage 2 enabled. That's bad.
+	 *
+	 * We cannot rely on checking here that profiling is enabled because
+	 * perf might install an event on the CPU via an IPI before we
+	 * deactivate interrupts. Instead, we defer loading the guest mdcr_el2
+	 * until __activate_traps().
+	 */
+	if (vcpu_has_spe(vcpu))
+		return;
+	write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
 }
 
-void deactivate_traps_vhe_put(void)
+void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu)
 {
-	u64 mdcr_el2 = read_sysreg(mdcr_el2);
+	/*
+	 * When the guest is using SPE, we load the host MDCR_EL2 value early,
+	 * in __deactivate_traps(). to allow perf to profile KVM.
+	 */
+	if (!vcpu_has_spe(vcpu)) {
+		u64 mdcr_el2 = read_sysreg(mdcr_el2);
 
-	mdcr_el2 &= MDCR_EL2_HPMN_MASK | MDCR_EL2_TPMS;
+		mdcr_el2 &= MDCR_EL2_HPMN_MASK | MDCR_EL2_TPMS;
 
-	write_sysreg(mdcr_el2, mdcr_el2);
+		write_sysreg(mdcr_el2, mdcr_el2);
+	}
 
 	__deactivate_traps_common();
 }
@@ -116,6 +147,8 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 	guest_ctxt = &vcpu->arch.ctxt;
 
 	sysreg_save_host_state_vhe(host_ctxt);
+	if (vcpu_has_spe(vcpu))
+		sysreg_save_spe_host_state_vhe(host_ctxt);
 
 	/*
 	 * ARM erratum 1165522 requires us to configure both stage 1 and
@@ -132,6 +165,9 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 	__activate_traps(vcpu);
 
 	sysreg_restore_guest_state_vhe(guest_ctxt);
+	if (vcpu_has_spe(vcpu))
+		sysreg_restore_spe_guest_state_vhe(vcpu);
+
 	__debug_switch_to_guest(vcpu);
 
 	do {
@@ -142,10 +178,14 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 	} while (fixup_guest_exit(vcpu, &exit_code));
 
 	sysreg_save_guest_state_vhe(guest_ctxt);
+	if (vcpu_has_spe(vcpu))
+		sysreg_save_spe_guest_state_vhe(vcpu);
 
 	__deactivate_traps(vcpu);
 
 	sysreg_restore_host_state_vhe(host_ctxt);
+	if (vcpu_has_spe(vcpu))
+		sysreg_restore_spe_host_state_vhe(host_ctxt);
 
 	if (vcpu->arch.flags & KVM_ARM64_FP_ENABLED)
 		__fpsimd_save_fpexc32(vcpu);
diff --git a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c
index 2a0b8c88d74f..007a12dd4351 100644
--- a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c
+++ b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c
@@ -101,7 +101,7 @@ void kvm_vcpu_put_sysregs_vhe(struct kvm_vcpu *vcpu)
 	struct kvm_cpu_context *host_ctxt;
 
 	host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
-	deactivate_traps_vhe_put();
+	deactivate_traps_vhe_put(vcpu);
 
 	__sysreg_save_el1_state(guest_ctxt);
 	__sysreg_save_user_state(guest_ctxt);
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 0e365a51cac7..ba80f2716a11 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2019 ARM Ltd.
  */
 
+#include <linux/bug.h>
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
 #include <linux/uaccess.h>
@@ -14,11 +15,13 @@
 
 void kvm_arm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 {
+	WARN(reg < PMBLIMITR_EL1, "Unexpected trap to SPE register\n");
 	__vcpu_sys_reg(vcpu, reg) = val;
 }
 
 u64 kvm_arm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
 {
+	WARN(reg < PMBLIMITR_EL1, "Unexpected trap to SPE register\n");
 	return __vcpu_sys_reg(vcpu, reg);
 }
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 2871484993ec..3a0687602839 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -254,6 +254,7 @@ static bool access_spe_reg(struct kvm_vcpu *vcpu,
 			 (u32)r->CRn, (u32)r->CRm, (u32)r->Op2);
 
 	if (sr == SYS_PMSIDR_EL1) {
+		WARN(true, "Unexpected trap to SPE register\n");
 		/* Ignore writes. */
 		if (!p->is_write)
 			p->regval = read_sysreg_s(SYS_PMSIDR_EL1);
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 14/16] KVM: arm64: Emulate SPE buffer management interrupt
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (12 preceding siblings ...)
  2020-10-27 17:27 ` [RFC PATCH v3 13/16] KVM: arm64: Switch SPE context on VM entry/exit Alexandru Elisei
@ 2020-10-27 17:27 ` Alexandru Elisei
  2020-10-27 17:27 ` [RFC PATCH v3 15/16] KVM: arm64: Enable SPE for guests Alexandru Elisei
  2020-10-27 17:27 ` [RFC PATCH v3 16/16] Documentation: arm64: Document ARM Neoverse-N1 erratum #1688567 Alexandru Elisei
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:27 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

A profiling buffer management interrupt is asserted when the buffer fills,
on a fault or on an external abort. The service bit, PMBSR_EL1.S, is set as
long as SPE asserts this interrupt. SPE can also assert the interrupt
following a direct write to PMBSR_EL1 that sets the bit. The SPE hardware
stops asserting the interrupt only when the service bit is cleared.

KVM emulates the interrupt by reading the value of the service bit on each
guest exit to determine if the SPE hardware asserted the interrupt (for
example, if the buffer was full). Writes to the buffer registers are
trapped, to determine when the interrupt should be cleared or when the
guest wants to explicitely assert the interrupt by setting the service bit.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/sysreg.h  |   3 +
 arch/arm64/kvm/arm.c             |   3 +
 arch/arm64/kvm/hyp/nvhe/spe-sr.c |  20 +++++-
 arch/arm64/kvm/hyp/vhe/spe-sr.c  |  19 +++++-
 arch/arm64/kvm/spe.c             | 101 +++++++++++++++++++++++++++++++
 include/kvm/arm_spe.h            |   4 ++
 6 files changed, 146 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 20159af17578..0398bcba83a6 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -299,6 +299,7 @@
 #define SYS_PMBLIMITR_EL1_FM_SHIFT	1
 #define SYS_PMBLIMITR_EL1_FM_MASK	0x3UL
 #define SYS_PMBLIMITR_EL1_FM_STOP_IRQ	(0 << SYS_PMBLIMITR_EL1_FM_SHIFT)
+#define SYS_PMBLIMITR_EL1_RES0		0xfffffffffffff007UL
 
 #define SYS_PMBPTR_EL1			sys_reg(3, 0, 9, 10, 1)
 
@@ -323,6 +324,8 @@
 
 #define SYS_PMBSR_EL1_BUF_BSC_FULL	(0x1UL << SYS_PMBSR_EL1_BUF_BSC_SHIFT)
 
+#define SYS_PMBSR_EL1_RES0		0x00000000fc0fffffUL
+
 /*** End of Statistical Profiling Extension ***/
 
 #define SYS_PMINTENSET_EL1		sys_reg(3, 0, 9, 14, 1)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2d98248f2c66..c6a675aba71e 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -775,6 +775,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 */
 		kvm_vgic_sync_hwstate(vcpu);
 
+		if (vcpu_has_spe(vcpu))
+			kvm_arm_spe_sync_hwstate(vcpu);
+
 		/*
 		 * Sync the timer hardware state before enabling interrupts as
 		 * we don't want vtimer interrupts to race with syncing the
diff --git a/arch/arm64/kvm/hyp/nvhe/spe-sr.c b/arch/arm64/kvm/hyp/nvhe/spe-sr.c
index a73ee820b27f..2794a7c7fcd9 100644
--- a/arch/arm64/kvm/hyp/nvhe/spe-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/spe-sr.c
@@ -47,6 +47,14 @@ void __sysreg_save_spe_host_state_nvhe(struct kvm_cpu_context *ctxt)
 void __sysreg_restore_spe_guest_state_nvhe(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+	struct kvm_spe_cpu *spe_cpu = &vcpu->arch.spe_cpu;
+	u64 pmblimitr;
+
+	/* Disable guest profiling when the interrupt is asserted. */
+	if (spe_cpu->irq_level)
+		pmblimitr = 0;
+	else
+		pmblimitr = ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1);
 
 	__sysreg_restore_spe_state_common(guest_ctxt);
 
@@ -54,16 +62,24 @@ void __sysreg_restore_spe_guest_state_nvhe(struct kvm_vcpu *vcpu)
 	isb();
 
 	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
-	/* The guest buffer management event interrupt is always virtual. */
+	/* The guest buffer management interrupt is always virtual. */
 	write_sysreg_s(0, SYS_PMBSR_EL1);
-	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+	write_sysreg_s(pmblimitr, SYS_PMBLIMITR_EL1);
 	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMSCR_EL1), SYS_PMSCR_EL1);
 }
 
 void __sysreg_save_spe_guest_state_nvhe(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+	struct kvm_spe_cpu *spe_cpu = &vcpu->arch.spe_cpu;
 	u64 pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+	u64 pmbsr = read_sysreg_s(SYS_PMBSR_EL1);
+
+	/* Update guest PMBSR_EL1 only when SPE asserts an interrupt. */
+	if (pmbsr & BIT(SYS_PMBSR_EL1_S_SHIFT)) {
+		ctxt_sys_reg(guest_ctxt, PMBSR_EL1) = pmbsr;
+		spe_cpu->pmbirq_asserted = true;
+	}
 
 	if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
 		psb_csync();
diff --git a/arch/arm64/kvm/hyp/vhe/spe-sr.c b/arch/arm64/kvm/hyp/vhe/spe-sr.c
index dd947e9f249c..24173f838bb1 100644
--- a/arch/arm64/kvm/hyp/vhe/spe-sr.c
+++ b/arch/arm64/kvm/hyp/vhe/spe-sr.c
@@ -44,6 +44,8 @@ NOKPROBE_SYMBOL(sysreg_save_spe_host_state_vhe);
 void sysreg_restore_spe_guest_state_vhe(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+	struct kvm_spe_cpu *spe_cpu = &vcpu->arch.spe_cpu;
+	u64 pmblimitr;
 
 	/*
 	 * Make sure the write to MDCR_EL2 which changes the buffer owning
@@ -51,6 +53,12 @@ void sysreg_restore_spe_guest_state_vhe(struct kvm_vcpu *vcpu)
 	 */
 	isb();
 
+	/* Disable guest profiling when the interrupt is asserted. */
+	if (spe_cpu->irq_level)
+		pmblimitr = 0;
+	else
+		pmblimitr = ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1);
+
 	/*
 	 * Order doesn't matter because sampling is disabled at EL2. However,
 	 * restore guest registers in the same program order as the host for
@@ -59,9 +67,9 @@ void sysreg_restore_spe_guest_state_vhe(struct kvm_vcpu *vcpu)
 	__sysreg_restore_spe_state_common(guest_ctxt);
 
 	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
-	/* The guest buffer management event interrupt is always virtual. */
+	/* The guest buffer management interrupt is always virtual. */
 	write_sysreg_s(0, SYS_PMBSR_EL1);
-	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+	write_sysreg_s(pmblimitr, SYS_PMBLIMITR_EL1);
 	write_sysreg_el1(ctxt_sys_reg(guest_ctxt, PMSCR_EL1), SYS_PMSCR);
 }
 NOKPROBE_SYMBOL(sysreg_restore_spe_guest_state_vhe);
@@ -69,8 +77,15 @@ NOKPROBE_SYMBOL(sysreg_restore_spe_guest_state_vhe);
 void sysreg_save_spe_guest_state_vhe(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt;
+	struct kvm_spe_cpu *spe_cpu = &vcpu->arch.spe_cpu;
 	u64 pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+	u64 pmbsr = read_sysreg_s(SYS_PMBSR_EL1);
 
+	/* Update guest PMBSR_EL1 only when SPE asserts an interrupt. */
+	if (pmbsr & BIT(SYS_PMBSR_EL1_S_SHIFT)) {
+		ctxt_sys_reg(guest_ctxt, PMBSR_EL1) = pmbsr;
+		spe_cpu->pmbirq_asserted = true;
+	}
 	/*
 	 * We're going to switch buffer owning Exception level when we restore
 	 * the host MDCR_EL2 value, make sure the buffer is disabled until we
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index ba80f2716a11..243fd621d640 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -12,11 +12,112 @@
 #include <kvm/arm_vgic.h>
 
 #include <asm/kvm_mmu.h>
+#include <asm/kvm_host.h>
+
+static void kvm_arm_spe_update_irq(struct kvm_vcpu *vcpu, bool new_level)
+{
+	struct kvm_spe_cpu *spe_cpu = &vcpu->arch.spe_cpu;
+	int ret;
+
+	spe_cpu->irq_level = new_level;
+	ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id, spe_cpu->irq_num,
+				  new_level, spe_cpu);
+	WARN_ON(ret);
+}
+
+void kvm_arm_spe_sync_hwstate(struct kvm_vcpu *vcpu)
+{
+	struct kvm_spe_cpu *spe_cpu = &vcpu->arch.spe_cpu;
+	u64 pmbsr, pmbsr_ec;
+
+	if (!spe_cpu->pmbirq_asserted)
+		return;
+	spe_cpu->pmbirq_asserted = false;
+
+	pmbsr = __vcpu_sys_reg(vcpu, PMBSR_EL1);
+	pmbsr_ec = pmbsr & (SYS_PMBSR_EL1_EC_MASK << SYS_PMBSR_EL1_EC_SHIFT);
+
+	if (pmbsr & BIT(SYS_PMBSR_EL1_EA_SHIFT)) {
+		/*
+		 * The guest managed to trigger an external abort, something is
+		 * very definitely broken and there is no way for us to recover.
+		 * Treat it like we would if the external abort generated an
+		 * SError and panic now.
+		 */
+		panic("KVM SPE External Abort\n");
+		cpu_park_loop();
+		unreachable();
+	}
+
+	switch (pmbsr_ec) {
+	case SYS_PMBSR_EL1_EC_FAULT_S2:
+		/*
+		 * If we see this then either the guest memory isn't pinned
+		 * correctly (KVM bug or userspace got it wrong), or the guest
+		 * programmed the buffer pointer with a bogus address.
+		 * PMBPTR_El1 will point to the guest VA that triggered the
+		 * DABT, print it as it might be useful for debugging.
+		 */
+		pr_warn_ratelimited("KVM SPE Stage 2 Data Abort, pmbsr=0x%llx, pmbptr=0x%llx\n",
+				pmbsr, __vcpu_sys_reg(vcpu, PMBPTR_EL1));
+		/*
+		 * Convert the stage 2 DABT into a guest SPE buffer synchronous
+		 * external abort.
+		 */
+		__vcpu_sys_reg(vcpu, PMBSR_EL1) = BIT(SYS_PMBSR_EL1_S_SHIFT) |
+						  BIT(SYS_PMBSR_EL1_EA_SHIFT);
+	case SYS_PMBSR_EL1_EC_FAULT_S1:
+	case SYS_PMBSR_EL1_EC_BUF:
+		break;
+	default:
+		pr_warn_ratelimited("KVM SPE Unknown buffer syndrome, pmbsr=0x%llx, pmbptr=0x%llx\n",
+				pmbsr, __vcpu_sys_reg(vcpu, PMBPTR_EL1));
+		__vcpu_sys_reg(vcpu, PMBSR_EL1) = BIT(SYS_PMBSR_EL1_S_SHIFT) |
+						  BIT(SYS_PMBSR_EL1_EA_SHIFT);
+	}
+
+	if (spe_cpu->irq_level)
+		return;
+
+	kvm_arm_spe_update_irq(vcpu, true);
+}
+
+static u64 kvm_arm_spe_get_reg_mask(int reg)
+{
+	switch (reg) {
+	case PMBLIMITR_EL1:
+		return SYS_PMBLIMITR_EL1_RES0;
+	case PMBPTR_EL1:
+		return ~0UL;
+	case PMBSR_EL1:
+		return SYS_PMBSR_EL1_RES0;
+	default:
+		return 0UL;
+	}
+}
 
 void kvm_arm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 {
+	struct kvm_spe_cpu *spe_cpu = &vcpu->arch.spe_cpu;
+	bool irq_level;
+
 	WARN(reg < PMBLIMITR_EL1, "Unexpected trap to SPE register\n");
+
+	val &= kvm_arm_spe_get_reg_mask(reg);
 	__vcpu_sys_reg(vcpu, reg) = val;
+
+	if (reg != PMBSR_EL1)
+		return;
+
+	irq_level = val & BIT(SYS_PMBSR_EL1_S_SHIFT);
+	/*
+	 * The VGIC configures PPIs as level-sensitive, we need to update the
+	 * interrupt state if it changes.
+	 */
+	if (spe_cpu->irq_level == irq_level)
+		return;
+
+	kvm_arm_spe_update_irq(vcpu, irq_level);
 }
 
 u64 kvm_arm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
index a2429edc4483..d45c17dd157d 100644
--- a/include/kvm/arm_spe.h
+++ b/include/kvm/arm_spe.h
@@ -16,6 +16,7 @@ struct kvm_spe_cpu {
 	int irq_num; 		/* Guest visibile INTID */
 	bool irq_level; 	/* 'true' if interrupt is asserted to the VGIC */
 	bool initialized; 	/* Feature is initialized on VCPU */
+	bool pmbirq_asserted;	/* Hardware asserted PMBIRQ */
 };
 
 struct kvm_spe {
@@ -41,6 +42,7 @@ void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val);
 u64 kvm_arm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg);
 
+void kvm_arm_spe_sync_hwstate(struct kvm_vcpu *vcpu);
 #else
 #define kvm_arm_supports_spe()	false
 
@@ -96,5 +98,7 @@ static inline u64 kvm_arm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
 	return 0;
 }
 
+static inline void kvm_arm_spe_sync_hwstate(struct kvm_vcpu *vcpu) {}
+
 #endif /* CONFIG_KVM_ARM_SPE */
 #endif /* __ASM_ARM_KVM_SPE_H */
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 15/16] KVM: arm64: Enable SPE for guests
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (13 preceding siblings ...)
  2020-10-27 17:27 ` [RFC PATCH v3 14/16] KVM: arm64: Emulate SPE buffer management interrupt Alexandru Elisei
@ 2020-10-27 17:27 ` Alexandru Elisei
  2020-10-27 17:27 ` [RFC PATCH v3 16/16] Documentation: arm64: Document ARM Neoverse-N1 erratum #1688567 Alexandru Elisei
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:27 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

We have all the bits in place to expose SPE to guests, allow userspace to
set the feature and advertise the presence of SPE in the ID_AA64DFR0_EL1
register.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h | 2 +-
 arch/arm64/kvm/sys_regs.c         | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index bcecc6224c59..e5504c9847fc 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -39,7 +39,7 @@
 
 #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS
 
-#define KVM_VCPU_MAX_FEATURES 7
+#define KVM_VCPU_MAX_FEATURES 8
 
 #define KVM_REQ_SLEEP \
 	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 3a0687602839..076be04d2e28 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1178,8 +1178,12 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
 		val = cpuid_feature_cap_perfmon_field(val,
 						ID_AA64DFR0_PMUVER_SHIFT,
 						ID_AA64DFR0_PMUVER_8_1);
-		/* Don't advertise SPE to guests */
-		val &= ~(0xfUL << ID_AA64DFR0_PMSVER_SHIFT);
+		/*
+		 * Don't advertise SPE to guests without SPE. Otherwise, allow
+		 * the guest to detect the hardware SPE version.
+		 */
+		if (!vcpu_has_spe(vcpu))
+			val &= ~(0xfUL << ID_AA64DFR0_PMSVER_SHIFT);
 	} else if (id == SYS_ID_DFR0_EL1) {
 		/* Limit guests to PMUv3 for ARMv8.1 */
 		val = cpuid_feature_cap_perfmon_field(val,
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 16/16] Documentation: arm64: Document ARM Neoverse-N1 erratum #1688567
  2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (14 preceding siblings ...)
  2020-10-27 17:27 ` [RFC PATCH v3 15/16] KVM: arm64: Enable SPE for guests Alexandru Elisei
@ 2020-10-27 17:27 ` Alexandru Elisei
  15 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-10-27 17:27 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm; +Cc: maz, will

According to erratum #1688567, a SPE buffer write that results in an Access
flag fault or Permission fault at stage 2 is reported with an unsupported
PMBSR_EL1.FSC code.

KVM avoids SPE stage 2 faults altogether by requiring userspace to lock the
guest memory in RAM and pre-mapping it in stage 2 before the VM is started.
As a result, KVM is not impacted by this erratum.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/arm64/silicon-errata.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
index d3587805de64..1f6c403fd555 100644
--- a/Documentation/arm64/silicon-errata.rst
+++ b/Documentation/arm64/silicon-errata.rst
@@ -96,6 +96,8 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | ARM            | Neoverse-N1     | #1542419        | ARM64_ERRATUM_1542419       |
 +----------------+-----------------+-----------------+-----------------------------+
+| ARM            | Neoverse-N1     | #1688567        | N/A                         |
++----------------+-----------------+-----------------+-----------------------------+
 | ARM            | MMU-500         | #841119,826419  | N/A                         |
 +----------------+-----------------+-----------------+-----------------------------+
 +----------------+-----------------+-----------------+-----------------------------+
-- 
2.29.1

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE
  2020-10-27 17:26 ` [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
@ 2020-11-05  9:58   ` Haibo Xu
  2020-12-02 15:20     ` Alexandru Elisei
  2020-11-19 16:58   ` James Morse
  1 sibling, 1 reply; 35+ messages in thread
From: Haibo Xu @ 2020-11-05  9:58 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, will, kvmarm, linux-arm-kernel, Sudeep Holla

On Wed, 28 Oct 2020 at 01:26, Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>
> From: Sudeep Holla <sudeep.holla@arm.com>
>
> To configure the virtual SPE buffer management interrupt number, we use a
> VCPU kvm_device ioctl, encapsulating the KVM_ARM_VCPU_SPE_IRQ attribute
> within the KVM_ARM_VCPU_SPE_CTRL group.
>
> After configuring the SPE, userspace is required to call the VCPU ioctl
> with the attribute KVM_ARM_VCPU_SPE_INIT to initialize SPE on the VCPU.
>
> [Alexandru E: Fixed compilation errors, don't allow userspace to set the
>         VCPU feature, removed unused functions, fixed mismatched
>         descriptions, comments and error codes, reworked logic, rebased on
>         top of v5.10-rc1]
>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  40 ++++++++
>  arch/arm64/include/uapi/asm/kvm.h       |   3 +
>  arch/arm64/kvm/Makefile                 |   1 +
>  arch/arm64/kvm/guest.c                  |   9 ++
>  arch/arm64/kvm/reset.c                  |  23 +++++
>  arch/arm64/kvm/spe.c                    | 129 ++++++++++++++++++++++++
>  include/kvm/arm_spe.h                   |  27 +++++
>  include/uapi/linux/kvm.h                |   1 +
>  8 files changed, 233 insertions(+)
>  create mode 100644 arch/arm64/kvm/spe.c
>
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..6135b9827fbe 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,43 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_ARM_VCPU_SPE_CTRL
> +===============================
> +
> +:Architectures: ARM64
> +
> +4.1 ATTRIBUTE: KVM_ARM_VCPU_SPE_IRQ
> +-----------------------------------
> +
> +:Parameters: in kvm_device_attr.addr the address for the SPE buffer management
> +             interrupt is a pointer to an int
> +
> +Returns:
> +
> +        =======  ========================================================
> +        -EBUSY   The SPE buffer management interrupt is already set
> +        -EINVAL  Invalid SPE overflow interrupt number
> +        -EFAULT  Could not read the buffer management interrupt number
> +        -ENXIO   SPE not supported or not properly configured
> +        =======  ========================================================
> +
> +A value describing the SPE (Statistical Profiling Extension) overflow interrupt
> +number for this vcpu. This interrupt should be a PPI and the interrupt type and
> +number must be same for each vcpu.
> +
> +4.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
> +------------------------------------
> +
> +:Parameters: no additional parameter in kvm_device_attr.addr
> +
> +Returns:
> +
> +        =======  ======================================================
> +        -EBUSY   SPE already initialized
> +        -ENODEV  GIC not initialized
> +        -ENXIO   SPE not supported or not properly configured
> +        =======  ======================================================
> +
> +Request the initialization of the SPE. Must be done after initializing the
> +in-kernel irqchip and after setting the interrupt number for the VCPU.
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 489e12304dbb..ca57dfb7abf0 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -360,6 +360,9 @@ struct kvm_vcpu_events {
>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER                1
>  #define KVM_ARM_VCPU_PVTIME_CTRL       2
>  #define   KVM_ARM_VCPU_PVTIME_IPA      0
> +#define KVM_ARM_VCPU_SPE_CTRL          3
> +#define   KVM_ARM_VCPU_SPE_IRQ         0
> +#define   KVM_ARM_VCPU_SPE_INIT                1
>
>  /* KVM_IRQ_LINE irq field index values */
>  #define KVM_ARM_IRQ_VCPU2_SHIFT                28
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 1504c81fbf5d..f6e76f64ffbe 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -25,3 +25,4 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
>          vgic/vgic-its.o vgic/vgic-debug.o
>
>  kvm-$(CONFIG_KVM_ARM_PMU)  += pmu-emul.o
> +kvm-$(CONFIG_KVM_ARM_SPE)  += spe.o
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index dfb5218137ca..2ba790eeb782 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -926,6 +926,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
>         case KVM_ARM_VCPU_PVTIME_CTRL:
>                 ret = kvm_arm_pvtime_set_attr(vcpu, attr);
>                 break;
> +       case KVM_ARM_VCPU_SPE_CTRL:
> +               ret = kvm_arm_spe_set_attr(vcpu, attr);
> +               break;
>         default:
>                 ret = -ENXIO;
>                 break;
> @@ -949,6 +952,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
>         case KVM_ARM_VCPU_PVTIME_CTRL:
>                 ret = kvm_arm_pvtime_get_attr(vcpu, attr);
>                 break;
> +       case KVM_ARM_VCPU_SPE_CTRL:
> +               ret = kvm_arm_spe_get_attr(vcpu, attr);
> +               break;
>         default:
>                 ret = -ENXIO;
>                 break;
> @@ -972,6 +978,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>         case KVM_ARM_VCPU_PVTIME_CTRL:
>                 ret = kvm_arm_pvtime_has_attr(vcpu, attr);
>                 break;
> +       case KVM_ARM_VCPU_SPE_CTRL:
> +               ret = kvm_arm_spe_has_attr(vcpu, attr);
> +               break;
>         default:
>                 ret = -ENXIO;
>                 break;
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index f32490229a4c..4dc205fa4be1 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -87,6 +87,9 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>         case KVM_CAP_ARM_PTRAUTH_GENERIC:
>                 r = system_has_full_ptr_auth();
>                 break;
> +       case KVM_CAP_ARM_SPE:
> +               r = kvm_arm_supports_spe();
> +               break;
>         default:
>                 r = 0;
>         }
> @@ -223,6 +226,19 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
>         return 0;
>  }
>
> +static int kvm_vcpu_enable_spe(struct kvm_vcpu *vcpu)
> +{
> +       if (!kvm_arm_supports_spe())
> +               return -EINVAL;
> +
> +       /* SPE is disabled if the PE is in AArch32 state */
> +       if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
> +               return -EINVAL;
> +
> +       vcpu->arch.flags |= KVM_ARM64_GUEST_HAS_SPE;
> +       return 0;
> +}
> +
>  /**
>   * kvm_reset_vcpu - sets core registers and sys_regs to reset value
>   * @vcpu: The VCPU pointer
> @@ -274,6 +290,13 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>                 }
>         }
>
> +       if (test_bit(KVM_ARM_VCPU_SPE, vcpu->arch.features)) {
> +               if (kvm_vcpu_enable_spe(vcpu)) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +       }
> +
>         switch (vcpu->arch.target) {
>         default:
>                 if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
> new file mode 100644
> index 000000000000..f91a52cd7cd3
> --- /dev/null
> +++ b/arch/arm64/kvm/spe.c
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 ARM Ltd.
> + */
> +
> +#include <linux/kvm.h>
> +#include <linux/kvm_host.h>
> +#include <linux/uaccess.h>
> +
> +#include <kvm/arm_spe.h>
> +#include <kvm/arm_vgic.h>
> +
> +static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
> +{
> +       if (!vcpu_has_spe(vcpu))
> +               return false;
> +
> +       if (!irqchip_in_kernel(vcpu->kvm))
> +               return false;
> +

nit: should we move the irqchip_in_kernel() check to the caller?

> +       return true;
> +}
> +
> +static int kvm_arm_spe_init(struct kvm_vcpu *vcpu)
> +{
> +       if (!kvm_arm_spe_irq_initialized(vcpu))
> +               return -ENXIO;
> +
> +       if (!vgic_initialized(vcpu->kvm))
> +               return -ENODEV;
> +
> +       if (kvm_arm_spe_vcpu_initialized(vcpu))
> +               return -EBUSY;
> +
> +       if (kvm_vgic_set_owner(vcpu, vcpu->arch.spe_cpu.irq_num, &vcpu->arch.spe_cpu))
> +               return -ENXIO;
> +
> +       vcpu->arch.spe_cpu.initialized = true;
> +
> +       return 0;
> +}
> +
> +static bool kvm_arm_spe_irq_is_valid(struct kvm *kvm, int irq)
> +{
> +       int i;
> +       struct kvm_vcpu *vcpu;
> +
> +       /* The SPE overflow interrupt can be a PPI only */
> +       if (!irq_is_ppi(irq))
> +               return false;
> +
> +       kvm_for_each_vcpu(i, vcpu, kvm) {
> +               if (!kvm_arm_spe_irq_initialized(vcpu))
> +                       continue;
> +
> +               if (vcpu->arch.spe_cpu.irq_num != irq)
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
> +int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
> +{
> +       switch (attr->attr) {
> +       case KVM_ARM_VCPU_SPE_IRQ: {
> +               int __user *uaddr = (int __user *)(long)attr->addr;
> +               int irq;
> +
> +               if (!kvm_arm_vcpu_supports_spe(vcpu))
> +                       return -ENXIO;
> +
> +               if (get_user(irq, uaddr))
> +                       return -EFAULT;
> +
> +               if (!kvm_arm_spe_irq_is_valid(vcpu->kvm, irq))
> +                       return -EINVAL;
> +
> +               if (kvm_arm_spe_irq_initialized(vcpu))
> +                       return -EBUSY;
> +
> +               kvm_debug("Set kvm ARM SPE irq: %d\n", irq);
> +               vcpu->arch.spe_cpu.irq_num = irq;
> +
> +               return 0;
> +       }
> +       case KVM_ARM_VCPU_SPE_INIT:
> +               return kvm_arm_spe_init(vcpu);
> +       }
> +
> +       return -ENXIO;
> +}
> +
> +int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
> +{
> +       switch (attr->attr) {
> +       case KVM_ARM_VCPU_SPE_IRQ: {
> +               int __user *uaddr = (int __user *)(long)attr->addr;
> +               int irq;
> +
> +               if (!kvm_arm_vcpu_supports_spe(vcpu))
> +                       return -ENXIO;
> +
> +               if (!kvm_arm_spe_irq_initialized(vcpu))
> +                       return -ENXIO;
> +
> +               irq = vcpu->arch.spe_cpu.irq_num;
> +               if (put_user(irq, uaddr))
> +                       return -EFAULT;
> +
> +               return 0;
> +       }
> +       }
> +
> +       return -ENXIO;
> +}
> +
> +int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
> +{
> +       switch (attr->attr) {
> +       case KVM_ARM_VCPU_SPE_IRQ:
> +               fallthrough;
> +       case KVM_ARM_VCPU_SPE_INIT:
> +               if (kvm_arm_vcpu_supports_spe(vcpu))
> +                       return 0;
> +       }
> +
> +       return -ENXIO;
> +}
> diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
> index 46ec447ed013..0275e8097529 100644
> --- a/include/kvm/arm_spe.h
> +++ b/include/kvm/arm_spe.h
> @@ -18,11 +18,38 @@ struct kvm_spe_cpu {
>         bool initialized;       /* Feature is initialized on VCPU */
>  };
>
> +#define kvm_arm_spe_irq_initialized(v)                 \
> +       ((v)->arch.spe_cpu.irq_num >= VGIC_NR_SGIS &&   \
> +        (v)->arch.spe_cpu.irq_num < VGIC_MAX_PRIVATE)
> +#define kvm_arm_spe_vcpu_initialized(v)        ((v)->arch.spe_cpu.initialized)
> +
> +int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> +int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> +int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> +
>  #else
>  #define kvm_arm_supports_spe() false
>
>  struct kvm_spe_cpu {
>  };
>
> +#define kvm_arm_spe_irq_initialized(v) false
> +#define kvm_arm_spe_vcpu_initialized(v)        false
> +
> +static inline int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu,
> +                                      struct kvm_device_attr *attr)
> +{
> +       return -ENXIO;
> +}
> +static inline int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu,
> +                                      struct kvm_device_attr *attr)
> +{
> +       return -ENXIO;
> +}
> +static inline int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu,
> +                                      struct kvm_device_attr *attr)
> +{
> +       return -ENXIO;
> +}
>  #endif /* CONFIG_KVM_ARM_SPE */
>  #endif /* __ASM_ARM_KVM_SPE_H */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index ca41220b40b8..96228b823711 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1053,6 +1053,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_X86_USER_SPACE_MSR 188
>  #define KVM_CAP_X86_MSR_FILTER 189
>  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> +#define KVM_CAP_ARM_SPE 191
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> --
> 2.29.1
>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 09/16] KVM: arm64: Use separate function for the mapping size in user_mem_abort()
  2020-10-27 17:26 ` [RFC PATCH v3 09/16] KVM: arm64: Use separate function for the mapping size in user_mem_abort() Alexandru Elisei
@ 2020-11-05 10:01   ` Haibo Xu
  2020-12-02 16:29     ` Alexandru Elisei
  0 siblings, 1 reply; 35+ messages in thread
From: Haibo Xu @ 2020-11-05 10:01 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, will, kvmarm, linux-arm-kernel

On Wed, 28 Oct 2020 at 01:26, Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>
> user_mem_abort() is already a long and complex function, let's make it
> slightly easier to understand by abstracting the algorithm for choosing the
> stage 2 IPA entry size into its own function.
>
> This also makes it possible to reuse the code when guest SPE support will
> be added.
>

Better to mention that there is "No functional change"!

> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/kvm/mmu.c | 55 ++++++++++++++++++++++++++------------------
>  1 file changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 19aacc7d64de..c3c43555490d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -738,12 +738,43 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>         return PAGE_SIZE;
>  }
>
> +static short stage2_max_pageshift(struct kvm_memory_slot *memslot,
> +                                 struct vm_area_struct *vma, hva_t hva,
> +                                 bool *force_pte)
> +{
> +       short pageshift;
> +
> +       *force_pte = false;
> +
> +       if (is_vm_hugetlb_page(vma))
> +               pageshift = huge_page_shift(hstate_vma(vma));
> +       else
> +               pageshift = PAGE_SHIFT;
> +
> +       if (memslot_is_logging(memslot) || (vma->vm_flags & VM_PFNMAP)) {
> +               *force_pte = true;
> +               pageshift = PAGE_SHIFT;
> +       }
> +
> +       if (pageshift == PUD_SHIFT &&
> +           !fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
> +               pageshift = PMD_SHIFT;
> +
> +       if (pageshift == PMD_SHIFT &&
> +           !fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> +               *force_pte = true;
> +               pageshift = PAGE_SHIFT;
> +       }
> +
> +       return pageshift;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                           struct kvm_memory_slot *memslot, unsigned long hva,
>                           unsigned long fault_status)
>  {
>         int ret = 0;
> -       bool write_fault, writable, force_pte = false;
> +       bool write_fault, writable, force_pte;
>         bool exec_fault;
>         bool device = false;
>         unsigned long mmu_seq;
> @@ -776,27 +807,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 return -EFAULT;
>         }
>
> -       if (is_vm_hugetlb_page(vma))
> -               vma_shift = huge_page_shift(hstate_vma(vma));
> -       else
> -               vma_shift = PAGE_SHIFT;
> -
> -       if (logging_active ||
> -           (vma->vm_flags & VM_PFNMAP)) {
> -               force_pte = true;
> -               vma_shift = PAGE_SHIFT;
> -       }
> -
> -       if (vma_shift == PUD_SHIFT &&
> -           !fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
> -              vma_shift = PMD_SHIFT;
> -
> -       if (vma_shift == PMD_SHIFT &&
> -           !fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
> -               force_pte = true;
> -               vma_shift = PAGE_SHIFT;
> -       }
> -
> +       vma_shift = stage2_max_pageshift(memslot, vma, hva, &force_pte);
>         vma_pagesize = 1UL << vma_shift;
>         if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
>                 fault_ipa &= ~(vma_pagesize - 1);
> --
> 2.29.1
>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE
  2020-10-27 17:26 ` [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE Alexandru Elisei
@ 2020-11-05 10:10   ` Haibo Xu
  2020-12-02 16:35     ` Alexandru Elisei
  2020-11-19 16:59   ` James Morse
  1 sibling, 1 reply; 35+ messages in thread
From: Haibo Xu @ 2020-11-05 10:10 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, will, kvmarm, linux-arm-kernel

On Wed, 28 Oct 2020 at 01:26, Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>
> Stage 2 faults triggered by the profiling buffer attempting to write to
> memory are reported by the SPE hardware by asserting a buffer management
> event interrupt. Interrupts are by their nature asynchronous, which means
> that the guest might have changed its stage 1 translation tables since the
> attempted write. SPE reports the guest virtual address that caused the data
> abort, but not the IPA, which means that KVM would have to walk the guest's
> stage 1 tables to find the IPA; using the AT instruction to walk the
> guest's tables in hardware is not an option because it doesn't report the
> IPA in the case of a stage 2 fault on a stage 1 table walk.
>
> Fix both problems by pre-mapping the guest's memory at stage 2 with write
> permissions to avoid any faults. Userspace calls mlock() on the VMAs that
> back the guest's memory, pinning the pages in memory, then tells KVM to map
> the memory at stage 2 by using the VM control group KVM_ARM_VM_SPE_CTRL
> with the attribute KVM_ARM_VM_SPE_FINALIZE. KVM will map all writable VMAs
> which have the VM_LOCKED flag set. Hugetlb VMAs are practically pinned in
> memory after they are faulted in and mlock() doesn't set the VM_LOCKED
> flag, and just faults the pages in; KVM will treat hugetlb VMAs like they
> have the VM_LOCKED flag and will also map them, faulting them in if
> necessary, when handling the ioctl.
>
> VM live migration relies on a bitmap of dirty pages. This bitmap is created
> by write-protecting a memslot and updating it as KVM handles stage 2 write
> faults. Because KVM cannot handle stage 2 faults reported by the profiling
> buffer, it will not pre-map a logging memslot. This effectively means that
> profiling is not available when the VM is configured for live migration.
>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  Documentation/virt/kvm/devices/vm.rst |  28 +++++
>  arch/arm64/include/asm/kvm_host.h     |   5 +
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/uapi/asm/kvm.h     |   3 +
>  arch/arm64/kvm/arm.c                  |  78 +++++++++++-
>  arch/arm64/kvm/guest.c                |  48 ++++++++
>  arch/arm64/kvm/mmu.c                  | 169 ++++++++++++++++++++++++++
>  arch/arm64/kvm/spe.c                  |  81 ++++++++++++
>  include/kvm/arm_spe.h                 |  36 ++++++
>  9 files changed, 448 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virt/kvm/devices/vm.rst b/Documentation/virt/kvm/devices/vm.rst
> index 0aa5b1cfd700..b70798a72d8a 100644
> --- a/Documentation/virt/kvm/devices/vm.rst
> +++ b/Documentation/virt/kvm/devices/vm.rst
> @@ -314,3 +314,31 @@ Allows userspace to query the status of migration mode.
>              if it is enabled
>  :Returns:   -EFAULT if the given address is not accessible from kernel space;
>             0 in case of success.
> +
> +6. GROUP: KVM_ARM_VM_SPE_CTRL
> +===============================
> +
> +:Architectures: arm64
> +
> +6.1. ATTRIBUTE: KVM_ARM_VM_SPE_FINALIZE
> +-----------------------------------------
> +
> +Finalizes the creation of the SPE feature by mapping the guest memory in the
> +stage 2 table. Guest memory must be readable, writable and pinned in RAM, which
> +is achieved with an mlock() system call; the memory can be backed by a hugetlbfs
> +file. Memory regions from read-only or dirty page logging enabled memslots will
> +be ignored. After the call, no changes to the guest memory, including to its
> +contents, are permitted.
> +
> +Subsequent KVM_ARM_VCPU_INIT calls will cause the memory to become unmapped and
> +the feature must be finalized again before any VCPU can run.
> +
> +If any VCPUs are run before finalizing the feature, KVM_RUN will return -EPERM.
> +
> +:Parameters: none
> +:Returns:   -EAGAIN if guest memory has been modified while the call was
> +            executing
> +            -EBUSY if the feature is already initialized
> +            -EFAULT if an address backing the guest memory is invalid
> +            -ENXIO if SPE is not supported or not properly configured
> +            0 in case of success
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5b68c06930c6..27f581750c6e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -92,6 +92,7 @@ struct kvm_s2_mmu {
>
>  struct kvm_arch {
>         struct kvm_s2_mmu mmu;
> +       struct kvm_spe spe;
>
>         /* VTCR_EL2 value for this VM */
>         u64    vtcr;
> @@ -612,6 +613,10 @@ void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
>  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
>  void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
>  void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
> +int kvm_arm_vm_arch_set_attr(struct kvm *kvm, struct kvm_device_attr *attr);
> +int kvm_arm_vm_arch_get_attr(struct kvm *kvm, struct kvm_device_attr *attr);
> +int kvm_arm_vm_arch_has_attr(struct kvm *kvm, struct kvm_device_attr *attr);
> +
>  int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
>                                struct kvm_device_attr *attr);
>  int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 331394306cce..bad94662bbed 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -124,6 +124,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu);
>  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                           phys_addr_t pa, unsigned long size, bool writable);
> +int kvm_map_locked_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> +                          enum kvm_pgtable_prot prot);
>
>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index ca57dfb7abf0..8876e564ba56 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -350,6 +350,9 @@ struct kvm_vcpu_events {
>  #define   KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3
>  #define   KVM_DEV_ARM_ITS_CTRL_RESET           4
>
> +#define KVM_ARM_VM_SPE_CTRL            0
> +#define   KVM_ARM_VM_SPE_FINALIZE      0
> +
>  /* Device Control API on vcpu fd */
>  #define KVM_ARM_VCPU_PMU_V3_CTRL       0
>  #define   KVM_ARM_VCPU_PMU_V3_IRQ      0
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e51d8f328c7e..2d98248f2c66 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -41,6 +41,7 @@
>  #include <kvm/arm_hypercalls.h>
>  #include <kvm/arm_pmu.h>
>  #include <kvm/arm_psci.h>
> +#include <kvm/arm_spe.h>
>
>  #ifdef REQUIRES_VIRT
>  __asm__(".arch_extension       virt");
> @@ -653,6 +654,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>         if (unlikely(!kvm_vcpu_initialized(vcpu)))
>                 return -ENOEXEC;
>
> +       if (vcpu_has_spe(vcpu) && unlikely(!kvm_arm_spe_finalized(vcpu->kvm)))
> +               return -EPERM;
> +
>         ret = kvm_vcpu_first_run_init(vcpu);
>         if (ret)
>                 return ret;
> @@ -982,12 +986,22 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
>          * ensuring that the data side is always coherent. We still
>          * need to invalidate the I-cache though, as FWB does *not*
>          * imply CTR_EL0.DIC.
> +        *
> +        * If the guest has SPE, we need to unmap the entire address space to
> +        * allow for any changes to the VM memory made by userspace to propagate
> +        * to the stage 2 tables when SPE is re-finalized; this also makes sure
> +        * we keep the userspace and the guest's view of the memory contents
> +        * synchronized.
>          */
>         if (vcpu->arch.has_run_once) {
> -               if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> +               if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
> +                   vcpu_has_spe(vcpu)) {
>                         stage2_unmap_vm(vcpu->kvm);
> -               else
> +                       if (vcpu_has_spe(vcpu))
> +                               kvm_arm_spe_notify_vcpu_init(vcpu);
> +               } else {
>                         __flush_icache_all();
> +               }
>         }
>
>         vcpu_reset_hcr(vcpu);
> @@ -1045,6 +1059,45 @@ static int kvm_arm_vcpu_has_attr(struct kvm_vcpu *vcpu,
>         return ret;
>  }
>
> +static int kvm_arm_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       int ret = -ENXIO;
> +
> +       switch (attr->group) {
> +       default:
> +               ret = kvm_arm_vm_arch_set_attr(kvm, attr);
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
> +static int kvm_arm_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       int ret = -ENXIO;
> +
> +       switch (attr->group) {
> +       default:
> +               ret = kvm_arm_vm_arch_get_attr(kvm, attr);
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
> +static int kvm_arm_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       int ret = -ENXIO;
> +
> +       switch (attr->group) {
> +       default:
> +               ret = kvm_arm_vm_arch_has_attr(kvm, attr);
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
>  static int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>                                    struct kvm_vcpu_events *events)
>  {
> @@ -1259,6 +1312,27 @@ long kvm_arch_vm_ioctl(struct file *filp,
>
>                 return 0;
>         }
> +       case KVM_SET_DEVICE_ATTR: {
> +               struct kvm_device_attr attr;
> +
> +               if (copy_from_user(&attr, argp, sizeof(attr)))
> +                       return -EFAULT;
> +               return kvm_arm_vm_set_attr(kvm, &attr);
> +       }
> +       case KVM_GET_DEVICE_ATTR: {
> +               struct kvm_device_attr attr;
> +
> +               if (copy_from_user(&attr, argp, sizeof(attr)))
> +                       return -EFAULT;
> +               return kvm_arm_vm_get_attr(kvm, &attr);
> +       }
> +       case KVM_HAS_DEVICE_ATTR: {
> +               struct kvm_device_attr attr;
> +
> +               if (copy_from_user(&attr, argp, sizeof(attr)))
> +                       return -EFAULT;
> +               return kvm_arm_vm_has_attr(kvm, &attr);
> +       }
>         default:
>                 return -EINVAL;
>         }
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 2ba790eeb782..d0dc4bdb8b4a 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -988,3 +988,51 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>
>         return ret;
>  }
> +
> +int kvm_arm_vm_arch_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       int ret;
> +
> +       switch (attr->group) {
> +       case KVM_ARM_VM_SPE_CTRL:
> +               ret = kvm_arm_vm_spe_set_attr(kvm, attr);
> +               break;
> +       default:
> +               ret = -ENXIO;
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
> +int kvm_arm_vm_arch_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       int ret;
> +
> +       switch (attr->group) {
> +       case KVM_ARM_VM_SPE_CTRL:
> +               ret = kvm_arm_vm_spe_get_attr(kvm, attr);
> +               break;
> +       default:
> +               ret = -ENXIO;
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
> +int kvm_arm_vm_arch_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       int ret;
> +
> +       switch (attr->group) {
> +       case KVM_ARM_VM_SPE_CTRL:
> +               ret = kvm_arm_vm_spe_has_attr(kvm, attr);
> +               break;
> +       default:
> +               ret = -ENXIO;
> +               break;
> +       }
> +
> +       return ret;
> +}
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c3c43555490d..31b2216a5881 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1365,6 +1365,175 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>         return ret;
>  }
>
> +static int stage2_map_vma(struct kvm *kvm,
> +                         struct kvm_memory_slot *memslot,
> +                         struct vm_area_struct *vma,
> +                         enum kvm_pgtable_prot prot,
> +                         unsigned long mmu_seq, hva_t *hvap,
> +                         struct kvm_mmu_memory_cache *cache)
> +{
> +       struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> +       unsigned long stage2_pagesize, remaining;
> +       bool force_pte, writable;
> +       hva_t hva, hva_end;
> +       kvm_pfn_t pfn;
> +       gpa_t gpa;
> +       gfn_t gfn;
> +       int ret;
> +
> +       hva = max(memslot->userspace_addr, vma->vm_start);
> +       hva_end = min(vma->vm_end, memslot->userspace_addr +
> +                       (memslot->npages << PAGE_SHIFT));
> +
> +       gpa = (memslot->base_gfn << PAGE_SHIFT) + hva - memslot->userspace_addr;
> +       gfn = gpa >> PAGE_SHIFT;
> +
> +       stage2_pagesize = 1UL << stage2_max_pageshift(memslot, vma, hva, &force_pte);
> +
> +       while (hva < hva_end) {
> +               ret = kvm_mmu_topup_memory_cache(cache,
> +                                                kvm_mmu_cache_min_pages(kvm));
> +               if (ret)
> +                       return ret;
> +
> +               /*
> +                * We start mapping with the highest possible page size, so the
> +                * gpa and gfn will always be properly aligned to the current
> +                * page size.
> +                */
> +               pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL, true, &writable);
> +               if (pfn == KVM_PFN_ERR_HWPOISON)
> +                       return -EFAULT;
> +               if (is_error_noslot_pfn(pfn))
> +                       return -EFAULT;
> +               /* Can only happen if naughty userspace changed the VMA. */
> +               if (kvm_is_device_pfn(pfn) || !writable)
> +                       return -EAGAIN;
> +
> +               spin_lock(&kvm->mmu_lock);
> +               if (mmu_notifier_retry(kvm, mmu_seq)) {
> +                       spin_unlock(&kvm->mmu_lock);
> +                       return -EAGAIN;
> +               }
> +
> +               remaining = hva_end - hva;
> +               if (stage2_pagesize == PUD_SIZE && remaining < PUD_SIZE)
> +                       stage2_pagesize = PMD_SIZE;
> +               if (stage2_pagesize == PMD_SIZE && remaining < PMD_SIZE) {
> +                       force_pte = true;
> +                       stage2_pagesize = PAGE_SIZE;
> +               }
> +
> +               if (!force_pte && stage2_pagesize == PAGE_SIZE)
> +                       /*
> +                        * The hva and gpa will always be PMD aligned if
> +                        * hva is backed by a transparent huge page. gpa will
> +                        * not be modified and it's not necessary to recompute
> +                        * hva.
> +                        */
> +                       stage2_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &gpa);
> +
> +               ret = kvm_pgtable_stage2_map(pgt, gpa, stage2_pagesize,
> +                                            __pfn_to_phys(pfn), prot, cache);
> +               spin_unlock(&kvm->mmu_lock);
> +
> +               kvm_set_pfn_accessed(pfn);
> +               kvm_release_pfn_dirty(pfn);
> +
> +               if (ret)
> +                       return ret;
> +               else if (hva < hva_end)
> +                       cond_resched();
> +
> +               hva += stage2_pagesize;
> +               gpa += stage2_pagesize;
> +               gfn = gpa >> PAGE_SHIFT;
> +       }
> +
> +       *hvap = hva;
> +       return 0;
> +}
> +
> +int kvm_map_locked_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> +                          enum kvm_pgtable_prot prot)
> +{
> +       struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +       struct vm_area_struct *vma;
> +       unsigned long mmu_seq;
> +       hva_t hva, hva_memslot_end;
> +       int ret;
> +
> +       lockdep_assert_held(&kvm->slots_lock);
> +
> +       if (!(prot & KVM_PGTABLE_PROT_R))
> +               return -EPERM;
> +       if ((prot & KVM_PGTABLE_PROT_W) && (memslot->flags & KVM_MEM_READONLY))
> +               return -EPERM;
> +
> +       hva = memslot->userspace_addr;
> +       hva_memslot_end = memslot->userspace_addr + (memslot->npages << PAGE_SHIFT);
> +
> +       /*
> +        * Be extra careful here in case userspace is messing with the VMAs
> +        * backing the memslot.
> +        */
> +       mmu_seq = kvm->mmu_notifier_seq;
> +       smp_rmb();
> +
> +       /*
> +        * A memslot might span multiple VMAs and any holes between them, while
> +        * a VMA might span multiple memslots (see
> +        * kvm_arch_prepare_memory_region()). Take the intersection of the VMAs
> +        * with the memslot.
> +        */
> +       do {
> +               mmap_read_lock(current->mm);
> +               vma = find_vma(current->mm, hva);
> +               /*
> +                * find_vma() returns first VMA with hva < vma->vm_end, which
> +                * means that it is possible for the VMA to start *after* the
> +                * end of the memslot.
> +                */
> +               if (!vma || vma->vm_start >= hva_memslot_end) {
> +                       mmap_read_unlock(current->mm);
> +                       return 0;
> +               }
> +
> +               /*
> +                * VM_LOCKED pages are put in the unevictable LRU list and
> +                * hugetlb pages are not put in any LRU list; both will stay
> +                * pinned in memory.
> +                */
> +               if (!(vma->vm_flags & VM_LOCKED) && !is_vm_hugetlb_page(vma)) {
> +                       /* Go to next VMA. */
> +                       hva = vma->vm_end;
> +                       mmap_read_unlock(current->mm);
> +                       continue;
> +               }
> +               if (!(vma->vm_flags & VM_READ) ||
> +                   ((prot & KVM_PGTABLE_PROT_W) && !(vma->vm_flags & VM_WRITE))) {
> +                       /* Go to next VMA. */
> +                       hva = vma->vm_end;
> +                       mmap_read_unlock(current->mm);
> +                       continue;
> +               }
> +               mmap_read_unlock(current->mm);
> +
> +               ret = stage2_map_vma(kvm, memslot, vma, prot, mmu_seq, &hva, &cache);
> +               if (ret)
> +                       return ret;
> +       } while (hva < hva_memslot_end);
> +
> +       if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) {
> +               spin_lock(&kvm->mmu_lock);
> +               stage2_flush_memslot(kvm, memslot);
> +               spin_unlock(&kvm->mmu_lock);
> +       }
> +
> +       return 0;
> +}
> +
> +
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  }
> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
> index f91a52cd7cd3..316ff8dfed5b 100644
> --- a/arch/arm64/kvm/spe.c
> +++ b/arch/arm64/kvm/spe.c
> @@ -10,6 +10,13 @@
>  #include <kvm/arm_spe.h>
>  #include <kvm/arm_vgic.h>
>
> +#include <asm/kvm_mmu.h>
> +

It seems that the below function is used to de-finalize the spe status
if I get it correctly.
How about rename the function to some like "kvm_arm_vcpu_init_spe_definalize()"

> +void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu)
> +{
> +       vcpu->kvm->arch.spe.finalized = false;
> +}
> +
>  static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
>  {
>         if (!vcpu_has_spe(vcpu))
> @@ -115,6 +122,50 @@ int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>         return -ENXIO;
>  }
>
> +static int kvm_arm_spe_finalize(struct kvm *kvm)
> +{
> +       struct kvm_memory_slot *memslot;
> +       enum kvm_pgtable_prot prot;
> +       struct kvm_vcpu *vcpu;
> +       int i, ret;
> +
> +       kvm_for_each_vcpu(i, vcpu, kvm) {
> +               if (!kvm_arm_spe_vcpu_initialized(vcpu))
> +                       return -ENXIO;
> +       }
> +
> +       mutex_unlock(&kvm->slots_lock);

Should be mutex_lock(&kvm->slots_lock);?

> +       if (kvm_arm_spe_finalized(kvm)) {
> +               mutex_unlock(&kvm->slots_lock);
> +               return -EBUSY;
> +       }
> +
> +       prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> +       kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
> +               /* Only map memory that SPE can write to. */
> +               if (memslot->flags & KVM_MEM_READONLY)
> +                       continue;
> +                /*
> +                 * Dirty page logging will write-protect pages, which breaks
> +                 * SPE.
> +                 */
> +               if (memslot->dirty_bitmap)
> +                       continue;
> +               ret = kvm_map_locked_memslot(kvm, memslot, prot);
> +               if (ret)
> +                       break;
> +       }
> +
> +       if (!ret)
> +               kvm->arch.spe.finalized = true;
> +       mutex_unlock(&kvm->slots_lock);
> +
> +       if (ret)
> +               stage2_unmap_vm(kvm);
> +
> +       return ret;
> +}
> +
>  int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  {
>         switch (attr->attr) {
> @@ -127,3 +178,33 @@ int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>
>         return -ENXIO;
>  }
> +
> +int kvm_arm_vm_spe_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       switch (attr->attr) {
> +       case KVM_ARM_VM_SPE_FINALIZE:
> +               return kvm_arm_spe_finalize(kvm);
> +       }
> +
> +       return -ENXIO;
> +}
> +
> +int kvm_arm_vm_spe_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       return -ENXIO;
> +}
> +
> +int kvm_arm_vm_spe_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +       struct kvm_vcpu *vcpu;
> +       int i;
> +
> +       switch (attr->attr) {
> +       case KVM_ARM_VM_SPE_FINALIZE:
> +               kvm_for_each_vcpu(i, vcpu, kvm)
> +                       if (kvm_arm_vcpu_supports_spe(vcpu))
> +                               return 0;
> +       }
> +
> +       return -ENXIO;
> +}
> diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
> index 0275e8097529..7f9f3a03aadb 100644
> --- a/include/kvm/arm_spe.h
> +++ b/include/kvm/arm_spe.h
> @@ -18,23 +18,38 @@ struct kvm_spe_cpu {
>         bool initialized;       /* Feature is initialized on VCPU */
>  };
>
> +struct kvm_spe {
> +       bool finalized;
> +};
> +
>  #define kvm_arm_spe_irq_initialized(v)                 \
>         ((v)->arch.spe_cpu.irq_num >= VGIC_NR_SGIS &&   \
>          (v)->arch.spe_cpu.irq_num < VGIC_MAX_PRIVATE)
>  #define kvm_arm_spe_vcpu_initialized(v)        ((v)->arch.spe_cpu.initialized)
> +#define kvm_arm_spe_finalized(k)       ((k)->arch.spe.finalized)
>
>  int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>  int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>  int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>
> +int kvm_arm_vm_spe_set_attr(struct kvm *vcpu, struct kvm_device_attr *attr);
> +int kvm_arm_vm_spe_get_attr(struct kvm *vcpu, struct kvm_device_attr *attr);
> +int kvm_arm_vm_spe_has_attr(struct kvm *vcpu, struct kvm_device_attr *attr);
> +
> +void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu);
> +
>  #else
>  #define kvm_arm_supports_spe() false
>
>  struct kvm_spe_cpu {
>  };
>
> +struct kvm_spe {
> +};
> +
>  #define kvm_arm_spe_irq_initialized(v) false
>  #define kvm_arm_spe_vcpu_initialized(v)        false
> +#define kvm_arm_spe_finalized(k)       false
>
>  static inline int kvm_arm_spe_set_attr(struct kvm_vcpu *vcpu,
>                                        struct kvm_device_attr *attr)
> @@ -51,5 +66,26 @@ static inline int kvm_arm_spe_has_attr(struct kvm_vcpu *vcpu,
>  {
>         return -ENXIO;
>  }
> +
> +static inline int kvm_arm_vm_spe_set_attr(struct kvm *vcpu,
> +                                         struct kvm_device_attr *attr)
> +{
> +       return -ENXIO;
> +}
> +
> +static inline int kvm_arm_vm_spe_get_attr(struct kvm *vcpu,
> +                                         struct kvm_device_attr *attr)
> +{
> +       return -ENXIO;
> +}
> +
> +static inline int kvm_arm_vm_spe_has_attr(struct kvm *vcpu,
> +                                         struct kvm_device_attr *attr)
> +{
> +       return -ENXIO;
> +}
> +
> +static inline void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu) {}
> +
>  #endif /* CONFIG_KVM_ARM_SPE */
>  #endif /* __ASM_ARM_KVM_SPE_H */
> --
> 2.29.1
>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it
  2020-10-27 17:26 ` [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it Alexandru Elisei
@ 2020-11-19 16:58   ` James Morse
  2020-12-02 14:25     ` Alexandru Elisei
  0 siblings, 1 reply; 35+ messages in thread
From: James Morse @ 2020-11-19 16:58 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi Alex,

On 27/10/2020 17:26, Alexandru Elisei wrote:
> When a VCPU is created, the kvm_vcpu struct is initialized to zero in
> kvm_vm_ioctl_create_vcpu(). On VHE systems, the first time
> vcpu.arch.mdcr_el2 is loaded on hardware is in vcpu_load(), before it is
> set to a sensible value in kvm_arm_setup_debug() later in the run loop. The
> result is that KVM executes for a short time with MDCR_EL2 set to zero.
> 
> This is mostly harmless as we don't need to trap debug and SPE register
> accesses from EL1 (we're still running in the host at EL2), but we do set
> MDCR_EL2.HPMN to 0 which is constrained unpredictable according to ARM DDI
> 0487F.b, page D13-3620; the required behavior from the hardware in this
> case is to reserve an unkown number of registers for EL2 and EL3 exclusive
> use.
> 
> Initialize mdcr_el2 in kvm_vcpu_vcpu_first_run_init(), so we can avoid the
> constrained unpredictable behavior and to ensure that the MDCR_EL2 register
> has the same value after each vcpu_load(), including the first time the
> VCPU is run.


> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
> index 7a7e425616b5..22ee448aee2b 100644
> --- a/arch/arm64/kvm/debug.c
> +++ b/arch/arm64/kvm/debug.c
> @@ -68,6 +68,59 @@ void kvm_arm_init_debug(void)

> +static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu, u32 host_mdcr)
> +{
> +	bool trap_debug = !(vcpu->arch.flags & KVM_ARM64_DEBUG_DIRTY);
> +
> +	/*
> +	 * This also clears MDCR_EL2_E2PB_MASK to disable guest access
> +	 * to the profiling buffer.
> +	 */
> +	vcpu->arch.mdcr_el2 = host_mdcr & MDCR_EL2_HPMN_MASK;
> +	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
> +				MDCR_EL2_TPMS |
> +				MDCR_EL2_TPMCR |
> +				MDCR_EL2_TDRA |
> +				MDCR_EL2_TDOSA);

> +	if (vcpu->guest_debug) {
> +		/* Route all software debug exceptions to EL2 */
> +		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDE;
> +		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW)
> +			trap_debug = true;
> +	}

This had me confused for a while... could you hint that this is when the guest is being
'external' debugged by the VMM? (its clear-er before this change)


Thanks,

James


> +	/* Trap debug register access */
> +	if (trap_debug)
> +		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDA;
> +
> +	trace_kvm_arm_set_dreg32("MDCR_EL2", vcpu->arch.mdcr_el2);
> +}
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature
  2020-10-27 17:26 ` [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature Alexandru Elisei
@ 2020-11-19 16:58   ` James Morse
  2020-12-02 14:29     ` Alexandru Elisei
  0 siblings, 1 reply; 35+ messages in thread
From: James Morse @ 2020-11-19 16:58 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi Alex,

On 27/10/2020 17:26, Alexandru Elisei wrote:
> Detect Statistical Profiling Extension (SPE) support using the cpufeatures
> framework. The presence of SPE is reported via the ARM64_SPE capability.
> 
> The feature will be necessary for emulating SPE in KVM, because KVM needs
> that all CPUs have SPE hardware to avoid scheduling a VCPU on a CPU without
> support. For this reason, the feature type ARM64_CPUCAP_SYSTEM_FEATURE has
> been selected to disallow hotplugging a CPU which doesn't support SPE.

Can you mention the existing driver in the commit message? Surprisingly it doesn't use
cpufeature at all. It looks like arm_spe_pmu_dev_init() goes out of its way to support
mismatched systems. (otherwise the significance of the new behaviour isn't clear!)

I read it as: the host is fine with mismatched systems, and the existing drivers supports
this. But KVM is not. After this patch you can't make the system mismatched 'late'.


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 06/16] KVM: arm64: Introduce SPE primitives
  2020-10-27 17:26 ` [RFC PATCH v3 06/16] KVM: arm64: Introduce SPE primitives Alexandru Elisei
@ 2020-11-19 16:58   ` James Morse
  2020-12-02 15:13     ` Alexandru Elisei
  0 siblings, 1 reply; 35+ messages in thread
From: James Morse @ 2020-11-19 16:58 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi Alex,

On 27/10/2020 17:26, Alexandru Elisei wrote:
> KVM SPE emulation depends on the configuration option KVM_ARM_SPE and on on
> having hardware SPE support on all CPUs.

> The host driver must be
> compiled-in because we need the SPE interrupt to be enabled; it will be
> used to kick us out of the guest when the profiling buffer management
> interrupt is asserted by the GIC (for example, when the buffer is full).

Great: SPE IRQ very important...


> Add a VCPU flag to inform KVM that the guest has SPE enabled.
> 
> It's worth noting that even though the KVM_ARM_SPE config option is gated
> by the SPE host driver being compiled-in, we don't actually check that the
> driver was loaded successfully when we advertise SPE support for guests.

Eh?

> That's because we can live with the SPE interrupt being disabled. There is
> a delay between when the SPE hardware asserts the interrupt and when the
> GIC samples the interrupt line and asserts it to the CPU. If the SPE
> interrupt is disabled at the GIC level, this delay will be larger,

How does this work? Surely the IRQ needs to be enabled before it can become pending at the
CPU to kick us out of the guest...


> at most a host timer tick.

(Because the timer brings us out of the guest anyway?)


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE
  2020-10-27 17:26 ` [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
  2020-11-05  9:58   ` Haibo Xu
@ 2020-11-19 16:58   ` James Morse
  2020-12-02 16:28     ` Alexandru Elisei
  1 sibling, 1 reply; 35+ messages in thread
From: James Morse @ 2020-11-19 16:58 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, linux-arm-kernel, Sudeep Holla, will, kvmarm

Hi Alex,

On 27/10/2020 17:26, Alexandru Elisei wrote:
> From: Sudeep Holla <sudeep.holla@arm.com>
> 
> To configure the virtual SPE buffer management interrupt number, we use a
> VCPU kvm_device ioctl, encapsulating the KVM_ARM_VCPU_SPE_IRQ attribute
> within the KVM_ARM_VCPU_SPE_CTRL group.
> 
> After configuring the SPE, userspace is required to call the VCPU ioctl
> with the attribute KVM_ARM_VCPU_SPE_INIT to initialize SPE on the VCPU.

> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..6135b9827fbe 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,43 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_ARM_VCPU_SPE_CTRL
> +===============================
> +
> +:Architectures: ARM64
> +
> +4.1 ATTRIBUTE: KVM_ARM_VCPU_SPE_IRQ
> +-----------------------------------
> +
> +:Parameters: in kvm_device_attr.addr the address for the SPE buffer management
> +             interrupt is a pointer to an int
> +
> +Returns:
> +
> +	 =======  ========================================================
> +	 -EBUSY   The SPE buffer management interrupt is already set
> +	 -EINVAL  Invalid SPE overflow interrupt number
> +	 -EFAULT  Could not read the buffer management interrupt number
> +	 -ENXIO   SPE not supported or not properly configured
> +	 =======  ========================================================
> +
> +A value describing the SPE (Statistical Profiling Extension) overflow interrupt
> +number for this vcpu. This interrupt should be a PPI and the interrupt type and
> +number must be same for each vcpu.
> +
> +4.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
> +------------------------------------
> +
> +:Parameters: no additional parameter in kvm_device_attr.addr
> +
> +Returns:
> +
> +	 =======  ======================================================
> +	 -EBUSY   SPE already initialized
> +	 -ENODEV  GIC not initialized
> +	 -ENXIO   SPE not supported or not properly configured
> +	 =======  ======================================================

> +Request the initialization of the SPE. Must be done after initializing the
> +in-kernel irqchip and after setting the interrupt number for the VCPU.

Fantastic!


> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index f32490229a4c..4dc205fa4be1 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -87,6 +87,9 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ARM_PTRAUTH_GENERIC:
>  		r = system_has_full_ptr_auth();
>  		break;
> +	case KVM_CAP_ARM_SPE:
> +		r = kvm_arm_supports_spe();
> +		break;
>  	default:
>  		r = 0;
>  	}
> @@ -223,6 +226,19 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +static int kvm_vcpu_enable_spe(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvm_arm_supports_spe())
> +		return -EINVAL;
> +
> +	/* SPE is disabled if the PE is in AArch32 state */
> +	if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
> +		return -EINVAL;
> +
> +	vcpu->arch.flags |= KVM_ARM64_GUEST_HAS_SPE;
> +	return 0;
> +}

VCPU-reset promotes the VMM feature into flags. How does this interact with
kvm_arm_spe_init()?

It doesn't look like this resets any state, couldn't it be done once by kvm_arm_spe_init()?


>  /**
>   * kvm_reset_vcpu - sets core registers and sys_regs to reset value
>   * @vcpu: The VCPU pointer
> @@ -274,6 +290,13 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>  		}
>  	}
>  
> +	if (test_bit(KVM_ARM_VCPU_SPE, vcpu->arch.features)) {
> +		if (kvm_vcpu_enable_spe(vcpu)) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
>  	switch (vcpu->arch.target) {
>  	default:
>  		if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {

> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
> new file mode 100644
> index 000000000000..f91a52cd7cd3
> --- /dev/null
> +++ b/arch/arm64/kvm/spe.c
> @@ -0,0 +1,129 @@

> +static bool kvm_arm_spe_irq_is_valid(struct kvm *kvm, int irq)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +
> +	/* The SPE overflow interrupt can be a PPI only */
> +	if (!irq_is_ppi(irq))
> +		return false;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (!kvm_arm_spe_irq_initialized(vcpu))
> +			continue;
> +
> +		if (vcpu->arch.spe_cpu.irq_num != irq)
> +			return false;
> +	}

Looks like you didn't really want a vcpu property! (huh, patch 10 adds a vm property too)
We're making this a vcpu property because of the PPI and system registers? (both good reasons)

If the PPI number lived in struct kvm_arch, you'd only only need to check it was
uninitialised, or the same to get the same behaviour, which would save some of this error
handling.


> +	return true;
> +}

> diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
> index 46ec447ed013..0275e8097529 100644
> --- a/include/kvm/arm_spe.h
> +++ b/include/kvm/arm_spe.h
> @@ -18,11 +18,38 @@ struct kvm_spe_cpu {
>  	bool initialized; 	/* Feature is initialized on VCPU */
>  };
>  
> +#define kvm_arm_spe_irq_initialized(v)			\
> +	((v)->arch.spe_cpu.irq_num >= VGIC_NR_SGIS &&	\
> +	 (v)->arch.spe_cpu.irq_num < VGIC_MAX_PRIVATE)

Didn't GICv(mumbles) add an additional PPI range? Could this be made irq_is_ppi(), that
way if the vgic gains support for that, we don't get weird behaviour here?


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE
  2020-10-27 17:26 ` [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE Alexandru Elisei
  2020-11-05 10:10   ` Haibo Xu
@ 2020-11-19 16:59   ` James Morse
  2021-03-23 14:27     ` Alexandru Elisei
  1 sibling, 1 reply; 35+ messages in thread
From: James Morse @ 2020-11-19 16:59 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi Alex,

On 27/10/2020 17:26, Alexandru Elisei wrote:
> Stage 2 faults triggered by the profiling buffer attempting to write to
> memory are reported by the SPE hardware by asserting a buffer management
> event interrupt. Interrupts are by their nature asynchronous, which means
> that the guest might have changed its stage 1 translation tables since the
> attempted write. SPE reports the guest virtual address that caused the data
> abort, but not the IPA, which means that KVM would have to walk the guest's
> stage 1 tables to find the IPA; using the AT instruction to walk the
> guest's tables in hardware is not an option because it doesn't report the
> IPA in the case of a stage 2 fault on a stage 1 table walk.

Great detailed description, I think a summary helps identify 'both' problems:
| To work reliably, both the profiling buffer and the page tables to reach it must not
| fault at stage2.

> Fix both problems by pre-mapping the guest's memory at stage 2 with write
> permissions to avoid any faults. Userspace calls mlock() on the VMAs that
> back the guest's memory, pinning the pages in memory, then tells KVM to map
> the memory at stage 2 by using the VM control group KVM_ARM_VM_SPE_CTRL
> with the attribute KVM_ARM_VM_SPE_FINALIZE.

The reason to have this feature is SPE, but is there anything SPE specific in the feature?

I can imagine this being useful on its own if I wanted to reduce guest-exits for
quasi-real-time reasons, and had memory to burn!

(as an independent feature, it might be useful on other architectures too...)


Would it make sense to add this as a flag to KVM_SET_USER_MEMORY_REGION? That is the point
that the userspace_addr is provided to KVM, this would allow us to fail the call if a
KVM_MEM_LOCKED memslot can't be created because the underlying VMA aren't VM_LOCKED.

(it also makes it easy to catch incompatible changes of flags in the future)

/me wanders off musing if this can then be combined with VM_PFNMAP in
kvm_arch_prepare_memory_region()....


> KVM will map all writable VMAs which have the VM_LOCKED flag set.

> Hugetlb VMAs are practically pinned in
> memory after they are faulted in and mlock() doesn't set the VM_LOCKED
> flag, and just faults the pages in;

Ugh. It would be nice to avoid special casing this. KVM shouldn't have to care about the
difference between a hugetlbfs PMD and a THP PMD.

From mlock_fixup(), it looks like this is because these VMA can't be split.
Is it possible to change this if mlock() is called for the whole range? (user-space must
know its hugetlbfs!)

Alternatively, it would good if mm can tell us when a page is locked (and/or special
cased). That way dax does the right thing too, without having extra special casing in KVM.
This would also catch VM_PFNMAP if mm knows its effectively the same as VM_LOCKED...


> KVM will treat hugetlb VMAs like they
> have the VM_LOCKED flag and will also map them, faulting them in if
> necessary, when handling the ioctl.

Surely user-space should call mlock() to do the faulting in? (and do that before handing
the memory over to KVM)

Getting KVM to do it will create a loop via the mmu_notifier if this touches a COW page,
which in turn bumps the sequence counter causing us to bomb out with -EAGAIN.
(it looks like wp_page_copy() is the only case that calls set_pte_at_notify())


> VM live migration relies on a bitmap of dirty pages. This bitmap is created
> by write-protecting a memslot and updating it as KVM handles stage 2 write
> faults. Because KVM cannot handle stage 2 faults reported by the profiling
> buffer, it will not pre-map a logging memslot. This effectively means that
> profiling is not available when the VM is configured for live migration.

Yeah ... that sucks. Have any of the Qemu folk said what they'd like to see here?

I can imagine making the logging-enable call fail if any CPU has SPE profiling enabled, as
the logging will change the results of SPE... We'd then need an exit to user-space to say
that the vcpu tried to enable SPE while logging was active. Qemu can then decide whether
to block that vcpu until migration completes, or abort migration.
But: I've no idea how Qemu manages migration, so it may not be able to do irregular things
like this.

As a short cut, can we let the arch code fail calls that make problematic changes. (e.g.
setting KVM_MEM_LOG_DIRTY_PAGES or KVM_MEM_READONLY). It looks like you currently document
these as silently breaking something else... (an invitation to debug a subtle interaction
in the future!)

~

How does this interact with KSM?
I can see its try_to_merge_one_page() calling write_protect_page() before testing the
vm_flags for VM_LOCKED ... so it doesn't look like mlock() stop KSM from doing its work -
which in turn will cause stage2 faults.

It looks like this is all hinged on VM_MERGEABLE, which can be cleared with an madvise()
call using MADV_UNMERGEABLE ... but from the man page at least this is to undo a previous
hint.

I can't find what sets this for regular vma, so presumably its not... see what you think,
I reckon we need to add "no madvise() MADV_MERGEABLE" to the documentation, and get KVM to
check the corresponding vma flag when it looks for VM_LOCKED regions.

I think the 'testing flags' is justified, even though we can't enforce they don't change,
as we can catch a stage2 fault that shouldn't have happened.


> diff --git a/Documentation/virt/kvm/devices/vm.rst b/Documentation/virt/kvm/devices/vm.rst
> index 0aa5b1cfd700..b70798a72d8a 100644
> --- a/Documentation/virt/kvm/devices/vm.rst
> +++ b/Documentation/virt/kvm/devices/vm.rst
> @@ -314,3 +314,31 @@ Allows userspace to query the status of migration mode.
>  	     if it is enabled
>  :Returns:   -EFAULT if the given address is not accessible from kernel space;
>  	    0 in case of success.
> +
> +6. GROUP: KVM_ARM_VM_SPE_CTRL
> +===============================
> +
> +:Architectures: arm64
> +
> +6.1. ATTRIBUTE: KVM_ARM_VM_SPE_FINALIZE
> +-----------------------------------------
> +
> +Finalizes the creation of the SPE feature by mapping the guest memory in the
> +stage 2 table. Guest memory must be readable, writable and pinned in RAM, which
> +is achieved with an mlock() system call;

(I first read this as mlock() makes memory writeable...)


> the memory can be backed by a hugetlbfs
> +file. Memory regions from read-only or dirty page logging enabled memslots will
> +be ignored. After the call, no changes to the guest memory,

> including to its contents, are permitted.

If guest memory is pinned as writeable, why can't the VMM write to it? Doesn't this
requirement preclude virtio?

Is 'no messing with the memslots' enforced in any way?


> +Subsequent KVM_ARM_VCPU_INIT calls will cause the memory to become unmapped and
> +the feature must be finalized again before any VCPU can run.
> +
> +If any VCPUs are run before finalizing the feature, KVM_RUN will return -EPERM.
> +
> +:Parameters: none
> +:Returns:   -EAGAIN if guest memory has been modified while the call was
> +            executing
> +            -EBUSY if the feature is already initialized
> +            -EFAULT if an address backing the guest memory is invalid
> +            -ENXIO if SPE is not supported or not properly configured
> +            0 in case of success

If we need a one-shot finalise call that sets up stage2, is there any mileage in KVM
reporting how much memory it pinned to stage2? This is so that the VMM can know it got the
mmap()/mlock() setup correct? Otherwise we depend on noticing silent failures some time
later... (I prefer the 'all or nothing' for a memslot though.)


> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e51d8f328c7e..2d98248f2c66 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -41,6 +41,7 @@
>  #include <kvm/arm_hypercalls.h>
>  #include <kvm/arm_pmu.h>
>  #include <kvm/arm_psci.h>
> +#include <kvm/arm_spe.h>
>  
>  #ifdef REQUIRES_VIRT
>  __asm__(".arch_extension	virt");
> @@ -653,6 +654,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	if (unlikely(!kvm_vcpu_initialized(vcpu)))
>  		return -ENOEXEC;
>  
> +	if (vcpu_has_spe(vcpu) && unlikely(!kvm_arm_spe_finalized(vcpu->kvm)))
> +		return -EPERM;

(does the unlikely() make a difference here?)


>  	ret = kvm_vcpu_first_run_init(vcpu);
>  	if (ret)
>  		return ret;
> @@ -982,12 +986,22 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
>  	 * ensuring that the data side is always coherent. We still
>  	 * need to invalidate the I-cache though, as FWB does *not*
>  	 * imply CTR_EL0.DIC.
> +	 *
> +	 * If the guest has SPE, we need to unmap the entire address space to
> +	 * allow for any changes to the VM memory made by userspace to propagate
> +	 * to the stage 2 tables when SPE is re-finalized;

This is about the layout of memory (instead of the contents)? Doesn't this get
synchronised by the mmu_notifier?

This is registered during kvm_create_vm(), and unregistered during kvm_destroy_vm()... so
it will see any changeseither side of this call...


(the existing call is about cleaning the initial state that the VMM re-wrote to the PoC. I
can't see how SPE or memory pinning fit in here)


>          this also makes sure
> +	 * we keep the userspace and the guest's view of the memory contents
> +	 * synchronized.
>  	 */
>  	if (vcpu->arch.has_run_once) {
> -		if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> +		if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
> +		    vcpu_has_spe(vcpu)) {
>  			stage2_unmap_vm(vcpu->kvm);
> -		else
> +			if (vcpu_has_spe(vcpu))
> +				kvm_arm_spe_notify_vcpu_init(vcpu);
> +		} else {
>  			__flush_icache_all();
> +		}
>  	}
>  
>  	vcpu_reset_hcr(vcpu);


From here...

> @@ -1045,6 +1059,45 @@ static int kvm_arm_vcpu_has_attr(struct kvm_vcpu *vcpu,
>  	return ret;
>  }
>  
> +static int kvm_arm_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +	int ret = -ENXIO;
> +
> +	switch (attr->group) {
> +	default:
> +		ret = kvm_arm_vm_arch_set_attr(kvm, attr);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int kvm_arm_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +	int ret = -ENXIO;
> +
> +	switch (attr->group) {
> +	default:
> +		ret = kvm_arm_vm_arch_get_attr(kvm, attr);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int kvm_arm_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +	int ret = -ENXIO;
> +
> +	switch (attr->group) {
> +	default:
> +		ret = kvm_arm_vm_arch_has_attr(kvm, attr);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
>  static int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>  				   struct kvm_vcpu_events *events)
>  {
> @@ -1259,6 +1312,27 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  
>  		return 0;
>  	}
> +	case KVM_SET_DEVICE_ATTR: {
> +		struct kvm_device_attr attr;
> +
> +		if (copy_from_user(&attr, argp, sizeof(attr)))
> +			return -EFAULT;
> +		return kvm_arm_vm_set_attr(kvm, &attr);
> +	}
> +	case KVM_GET_DEVICE_ATTR: {
> +		struct kvm_device_attr attr;
> +
> +		if (copy_from_user(&attr, argp, sizeof(attr)))
> +			return -EFAULT;
> +		return kvm_arm_vm_get_attr(kvm, &attr);
> +	}
> +	case KVM_HAS_DEVICE_ATTR: {
> +		struct kvm_device_attr attr;
> +
> +		if (copy_from_user(&attr, argp, sizeof(attr)))
> +			return -EFAULT;
> +		return kvm_arm_vm_has_attr(kvm, &attr);
> +	}
>  	default:
>  		return -EINVAL;
>  	}> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 2ba790eeb782..d0dc4bdb8b4a 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -988,3 +988,51 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  
>  	return ret;
>  }
> +
> +int kvm_arm_vm_arch_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +	int ret;
> +
> +	switch (attr->group) {
> +	case KVM_ARM_VM_SPE_CTRL:
> +		ret = kvm_arm_vm_spe_set_attr(kvm, attr);
> +		break;
> +	default:
> +		ret = -ENXIO;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +int kvm_arm_vm_arch_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +	int ret;
> +
> +	switch (attr->group) {
> +	case KVM_ARM_VM_SPE_CTRL:
> +		ret = kvm_arm_vm_spe_get_attr(kvm, attr);
> +		break;
> +	default:
> +		ret = -ENXIO;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +int kvm_arm_vm_arch_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
> +{
> +	int ret;
> +
> +	switch (attr->group) {
> +	case KVM_ARM_VM_SPE_CTRL:
> +		ret = kvm_arm_vm_spe_has_attr(kvm, attr);
> +		break;
> +	default:
> +		ret = -ENXIO;
> +		break;
> +	}
> +
> +	return ret;
> +}

... to here, is almost entirely boiler plate for supporting 0-or-more vm ioctl. Could this
be a separate preparatory patch, just so it isn't wrapped up in the SPE/memory-pinning
specifics?


> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c3c43555490d..31b2216a5881 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1365,6 +1365,175 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static int stage2_map_vma(struct kvm *kvm,
> +			  struct kvm_memory_slot *memslot,
> +			  struct vm_area_struct *vma,
> +			  enum kvm_pgtable_prot prot,
> +			  unsigned long mmu_seq, hva_t *hvap,
> +			  struct kvm_mmu_memory_cache *cache)
> +{
> +	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> +	unsigned long stage2_pagesize, remaining;
> +	bool force_pte, writable;
> +	hva_t hva, hva_end;
> +	kvm_pfn_t pfn;
> +	gpa_t gpa;
> +	gfn_t gfn;
> +	int ret;
> +
> +	hva = max(memslot->userspace_addr, vma->vm_start);
> +	hva_end = min(vma->vm_end, memslot->userspace_addr +
> +			(memslot->npages << PAGE_SHIFT));
> +
> +	gpa = (memslot->base_gfn << PAGE_SHIFT) + hva - memslot->userspace_addr;
> +	gfn = gpa >> PAGE_SHIFT;
> +
> +	stage2_pagesize = 1UL << stage2_max_pageshift(memslot, vma, hva, &force_pte);
> +
> +	while (hva < hva_end) {
> +		ret = kvm_mmu_topup_memory_cache(cache,
> +						 kvm_mmu_cache_min_pages(kvm));
> +		if (ret)
> +			return ret;
> +
> +		/*
> +		 * We start mapping with the highest possible page size, so the
> +		 * gpa and gfn will always be properly aligned to the current
> +		 * page size.
> +		 */
> +		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL, true, &writable);

Heh, if this causes the stage1 page tables to be changed, it will invoke the mmu notifier,
which will cause us to fail with -EAGAIN afterwards. User-space could keep retrying, and
it would fix a page at a time...

Passing atomic here would stop this, as we don't want to update the stage1 tables. If they
haven't been setup as needed, then this should fail early, with the finger pointing at
stage1. This way we don't mask a bug in user-space, and get caught out by 'this used to work'.

(or is this what prevents access-flag faults at stage1?)


> +		if (pfn == KVM_PFN_ERR_HWPOISON)
> +			return -EFAULT;> +		if (is_error_noslot_pfn(pfn))

Doesn't is_error_noslot_pfn() cover KVM_PFN_ERR_HWPOISON?


> +			return -EFAULT;
> +		/* Can only happen if naughty userspace changed the VMA. */
> +		if (kvm_is_device_pfn(pfn) || !writable)
> +			return -EAGAIN;

kvm_release_pfn_(*cough*)() ?

My reading is __gfn_to_pfn_memslot() calls gup, which takes a reference you release (or
adjust) at the end of the loop.


> +		spin_lock(&kvm->mmu_lock);
> +		if (mmu_notifier_retry(kvm, mmu_seq)) {
> +			spin_unlock(&kvm->mmu_lock);

> +			return -EAGAIN;

(same again)


> +		}
> +
> +		remaining = hva_end - hva;
> +		if (stage2_pagesize == PUD_SIZE && remaining < PUD_SIZE)
> +			stage2_pagesize = PMD_SIZE;
> +		if (stage2_pagesize == PMD_SIZE && remaining < PMD_SIZE) {

> +			force_pte = true;

I had to sleep on this one: You're forced to put down a PTE because of the remaining size
in the memslot? This is to prevent rolling back up to a THP size if that is what stage1 is
using?


> +			stage2_pagesize = PAGE_SIZE;
> +		}
> +
> +		if (!force_pte && stage2_pagesize == PAGE_SIZE)

> +			/*
> +			 * The hva and gpa will always be PMD aligned if
> +			 * hva is backed by a transparent huge page.

because you walk through the vma in order... but what about the first page?

What stops me starting my memslot on a 1MB boundary, which is half way through a 2MB THP?
Doesn't the 'hva=max()' align hva up to the memslot boundary?



>                          gpa will
> +			 * not be modified and it's not necessary to recompute
> +			 * hva.
> +			 */
> +			stage2_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &gpa);
> +
> +		ret = kvm_pgtable_stage2_map(pgt, gpa, stage2_pagesize,
> +					     __pfn_to_phys(pfn), prot, cache);
> +		spin_unlock(&kvm->mmu_lock);


> +		kvm_set_pfn_accessed(pfn);

This leads into mark_page_accessed(), which has:
|		 * Unevictable pages are on the "LRU_UNEVICTABLE" list. But,
|		 * this list is never rotated or maintained, so marking an
|		 * evictable page accessed has no effect.

This is to tell swap 'not yet'? ... Isn't that impossible by this point?



> +		kvm_release_pfn_dirty(pfn);

> +		if (ret)
> +			return ret;
> +		else if (hva < hva_end)
> +			cond_resched();

(we do this even for the last time round the loop as hva hasn't been updated yet)


> +		hva += stage2_pagesize;
> +		gpa += stage2_pagesize;
> +		gfn = gpa >> PAGE_SHIFT;
> +	}
> +
> +	*hvap = hva;
> +	return 0;
> +}



> +int kvm_map_locked_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> +			   enum kvm_pgtable_prot prot)
> +{
> +	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +	struct vm_area_struct *vma;
> +	unsigned long mmu_seq;
> +	hva_t hva, hva_memslot_end;
> +	int ret;
> +
> +	lockdep_assert_held(&kvm->slots_lock);

> +	if (!(prot & KVM_PGTABLE_PROT_R))
> +		return -EPERM;
> +	if ((prot & KVM_PGTABLE_PROT_W) && (memslot->flags & KVM_MEM_READONLY))
> +		return -EPERM;

This is checking the static value from kvm_arm_spe_finalize()?


> +	hva = memslot->userspace_addr;
> +	hva_memslot_end = memslot->userspace_addr + (memslot->npages << PAGE_SHIFT);
> +
> +	/*
> +	 * Be extra careful here in case userspace is messing with the VMAs
> +	 * backing the memslot.
> +	 */

If we held mmap_read_lock() for the duration, wouldn't that be impossible?
(and after that point we can scream from the mmu_notifier if a memslot is changed...)


> +	mmu_seq = kvm->mmu_notifier_seq;
> +	smp_rmb();
> +
> +	/*
> +	 * A memslot might span multiple VMAs and any holes between them, while
> +	 * a VMA might span multiple memslots (see
> +	 * kvm_arch_prepare_memory_region()). Take the intersection of the VMAs
> +	 * with the memslot.
> +	 */
> +	do {
> +		mmap_read_lock(current->mm);
> +		vma = find_vma(current->mm, hva);
> +		/*
> +		 * find_vma() returns first VMA with hva < vma->vm_end, which
> +		 * means that it is possible for the VMA to start *after* the
> +		 * end of the memslot.
> +		 */
> +		if (!vma || vma->vm_start >= hva_memslot_end) {
> +			mmap_read_unlock(current->mm);
> +			return 0;
> +		}
> +
> +		/*
> +		 * VM_LOCKED pages are put in the unevictable LRU list and
> +		 * hugetlb pages are not put in any LRU list; both will stay
> +		 * pinned in memory.
> +		 */
> +		if (!(vma->vm_flags & VM_LOCKED) && !is_vm_hugetlb_page(vma)) {
> +			/* Go to next VMA. */
> +			hva = vma->vm_end;
> +			mmap_read_unlock(current->mm);
> +			continue;
> +		}
> +		if (!(vma->vm_flags & VM_READ) ||
> +		    ((prot & KVM_PGTABLE_PROT_W) && !(vma->vm_flags & VM_WRITE))) {
> +			/* Go to next VMA. */
> +			hva = vma->vm_end;
> +			mmap_read_unlock(current->mm);
> +			continue;
> +		}
> +		mmap_read_unlock(current->mm);

Can't a writer now come in and remove vma?, which you pass to:

> +		ret = stage2_map_vma(kvm, memslot, vma, prot, mmu_seq, &hva, &cache);

As this only reads from the stage1 entries, I think you may be able to hold a read lock
for the duration of the loop. (if we tell gup not to write new entries)


> +		if (ret)
> +			return ret;
> +	} while (hva < hva_memslot_end);
> +
> +	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) {
> +		spin_lock(&kvm->mmu_lock);
> +		stage2_flush_memslot(kvm, memslot);
> +		spin_unlock(&kvm->mmu_lock);
> +	}
> +
> +	return 0;
> +}
> +
> +
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
>  }
> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
> index f91a52cd7cd3..316ff8dfed5b 100644
> --- a/arch/arm64/kvm/spe.c
> +++ b/arch/arm64/kvm/spe.c
> @@ -10,6 +10,13 @@
>  #include <kvm/arm_spe.h>
>  #include <kvm/arm_vgic.h>
>  
> +#include <asm/kvm_mmu.h>
> +
> +void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu)
> +{
> +	vcpu->kvm->arch.spe.finalized = false;
> +}
> +
>  static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
>  {
>  	if (!vcpu_has_spe(vcpu))
> @@ -115,6 +122,50 @@ int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  	return -ENXIO;
>  }
>  
> +static int kvm_arm_spe_finalize(struct kvm *kvm)
> +{
> +	struct kvm_memory_slot *memslot;
> +	enum kvm_pgtable_prot prot;
> +	struct kvm_vcpu *vcpu;
> +	int i, ret;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (!kvm_arm_spe_vcpu_initialized(vcpu))
> +			return -ENXIO;
> +	}
> +
> +	mutex_unlock(&kvm->slots_lock);

Typo?


> +	if (kvm_arm_spe_finalized(kvm)) {

> +		mutex_unlock(&kvm->slots_lock);

> +		return -EBUSY;
> +	}
> +
> +	prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> +	kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
> +		/* Only map memory that SPE can write to. */
> +		if (memslot->flags & KVM_MEM_READONLY)
> +			continue;
> +		 /*
> +		  * Dirty page logging will write-protect pages, which breaks
> +		  * SPE.
> +		  */
> +		if (memslot->dirty_bitmap)
> +			continue;

This silently skips regions that set KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE, which should be
harmless until KVM_CLEAR_DIRTY_LOG clears the bitmap bits, and makes them write-protect
(the runtime update ends in kvm_mmu_write_protect_pt_masked())

It's the silent bit that bothers me. If this were done as a memslot flag, we could tell
the VMM whether its the mm flags on the vma we can't cope with, or the KVM flag on the
memslot.


> +		ret = kvm_map_locked_memslot(kvm, memslot, prot);
> +		if (ret)
> +			break;
> +	}
> +
> +	if (!ret)
> +		kvm->arch.spe.finalized = true;
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (ret)
> +		stage2_unmap_vm(kvm);

We haven't put in any invalid mappings, is this needed?



> +
> +	return ret;
> +}


I think separating the boiler plate, and SPE bits from the stage2/mm code would make this
patch simpler.


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it
  2020-11-19 16:58   ` James Morse
@ 2020-12-02 14:25     ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-02 14:25 UTC (permalink / raw)
  To: James Morse; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi James,

Many thanks for having a look at the series!

On 11/19/20 4:58 PM, James Morse wrote:
> Hi Alex,
>
> On 27/10/2020 17:26, Alexandru Elisei wrote:
>> When a VCPU is created, the kvm_vcpu struct is initialized to zero in
>> kvm_vm_ioctl_create_vcpu(). On VHE systems, the first time
>> vcpu.arch.mdcr_el2 is loaded on hardware is in vcpu_load(), before it is
>> set to a sensible value in kvm_arm_setup_debug() later in the run loop. The
>> result is that KVM executes for a short time with MDCR_EL2 set to zero.
>>
>> This is mostly harmless as we don't need to trap debug and SPE register
>> accesses from EL1 (we're still running in the host at EL2), but we do set
>> MDCR_EL2.HPMN to 0 which is constrained unpredictable according to ARM DDI
>> 0487F.b, page D13-3620; the required behavior from the hardware in this
>> case is to reserve an unkown number of registers for EL2 and EL3 exclusive
>> use.
>>
>> Initialize mdcr_el2 in kvm_vcpu_vcpu_first_run_init(), so we can avoid the
>> constrained unpredictable behavior and to ensure that the MDCR_EL2 register
>> has the same value after each vcpu_load(), including the first time the
>> VCPU is run.
>
>> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
>> index 7a7e425616b5..22ee448aee2b 100644
>> --- a/arch/arm64/kvm/debug.c
>> +++ b/arch/arm64/kvm/debug.c
>> @@ -68,6 +68,59 @@ void kvm_arm_init_debug(void)
>> +static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu, u32 host_mdcr)
>> +{
>> +	bool trap_debug = !(vcpu->arch.flags & KVM_ARM64_DEBUG_DIRTY);
>> +
>> +	/*
>> +	 * This also clears MDCR_EL2_E2PB_MASK to disable guest access
>> +	 * to the profiling buffer.
>> +	 */
>> +	vcpu->arch.mdcr_el2 = host_mdcr & MDCR_EL2_HPMN_MASK;
>> +	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
>> +				MDCR_EL2_TPMS |
>> +				MDCR_EL2_TPMCR |
>> +				MDCR_EL2_TDRA |
>> +				MDCR_EL2_TDOSA);
>> +	if (vcpu->guest_debug) {
>> +		/* Route all software debug exceptions to EL2 */
>> +		vcpu->arch.mdcr_el2 |= MDCR_EL2_TDE;
>> +		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW)
>> +			trap_debug = true;
>> +	}
> This had me confused for a while... could you hint that this is when the guest is being
> 'external' debugged by the VMM? (its clear-er before this change)

I can put a comment above the if statement similar to the one from
kvm_arm_setup_debug(), where this code is lifted from:

        /* Is the VCPU being debugged by userspace? */

What do you think?

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature
  2020-11-19 16:58   ` James Morse
@ 2020-12-02 14:29     ` Alexandru Elisei
  2020-12-02 17:23       ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-02 14:29 UTC (permalink / raw)
  To: James Morse; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi James,

On 11/19/20 4:58 PM, James Morse wrote:
> Hi Alex,
>
> On 27/10/2020 17:26, Alexandru Elisei wrote:
>> Detect Statistical Profiling Extension (SPE) support using the cpufeatures
>> framework. The presence of SPE is reported via the ARM64_SPE capability.
>>
>> The feature will be necessary for emulating SPE in KVM, because KVM needs
>> that all CPUs have SPE hardware to avoid scheduling a VCPU on a CPU without
>> support. For this reason, the feature type ARM64_CPUCAP_SYSTEM_FEATURE has
>> been selected to disallow hotplugging a CPU which doesn't support SPE.
> Can you mention the existing driver in the commit message? Surprisingly it doesn't use
> cpufeature at all. It looks like arm_spe_pmu_dev_init() goes out of its way to support
> mismatched systems. (otherwise the significance of the new behaviour isn't clear!)
>
> I read it as: the host is fine with mismatched systems, and the existing drivers supports
> this. But KVM is not. After this patch you can't make the system mismatched 'late'.

That was exactly my intention. Certainly, I'll try to make the commit message
clearer by mentioning the SPE driver.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 06/16] KVM: arm64: Introduce SPE primitives
  2020-11-19 16:58   ` James Morse
@ 2020-12-02 15:13     ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-02 15:13 UTC (permalink / raw)
  To: James Morse; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi James,

On 11/19/20 4:58 PM, James Morse wrote:
> Hi Alex,
>
> On 27/10/2020 17:26, Alexandru Elisei wrote:
>> KVM SPE emulation depends on the configuration option KVM_ARM_SPE and on on
>> having hardware SPE support on all CPUs.
>> The host driver must be
>> compiled-in because we need the SPE interrupt to be enabled; it will be
>> used to kick us out of the guest when the profiling buffer management
>> interrupt is asserted by the GIC (for example, when the buffer is full).
> Great: SPE IRQ very important...

Within reason.

>
>
>> Add a VCPU flag to inform KVM that the guest has SPE enabled.
>>
>> It's worth noting that even though the KVM_ARM_SPE config option is gated
>> by the SPE host driver being compiled-in, we don't actually check that the
>> driver was loaded successfully when we advertise SPE support for guests.
> Eh?

Yes, this looks half-baked, and probably is, because:

1. I'm not sure I haven't missed anything with my approach to handling the SPE
interrupt triggered by the guest (details in the cover letter). The other option
would be to use the SPE driver IRQ handler, which makes this moot.

2. The SPE driver probing fails when the host has kpti enabled (the kernel can't
profile userspace). In my opinion, this shouldn't affect SPE support for guests,
but I didn't want to modify the SPE driver at this stage because of 1.

If we agree on my approach to guest SPE interrupt handling, this patch can be
improved by at least checking that the SPE driver probed successfully and taking
the case where kpti is enabled into consideration.

>
>> That's because we can live with the SPE interrupt being disabled. There is
>> a delay between when the SPE hardware asserts the interrupt and when the
>> GIC samples the interrupt line and asserts it to the CPU. If the SPE
>> interrupt is disabled at the GIC level, this delay will be larger,
> How does this work? Surely the IRQ needs to be enabled before it can become pending at the
> CPU to kick us out of the guest...

As long as the SPE hardware asserts the buffer management interrupt to the GIC
(PMBSR_EL1.S = 1), no profiling is done. If the interrupt is not enabled at the
GIC level, then the CPU will not take the interrupt (obviously). But as far as the
SPE hardware is concerned, the interrupt is asserted and profiling is disabled.
The host checks the PMBSR_EL1.S bit on every VM exit to see if SPE asserts the
interrupt and there's no dependency on the GIC asserting the interrupt. The SPE
interrupt being disabled at the GIC level is not as bad as it sounds (but it's
definitely not ideal) because there will always be a delay between the SPE
hardware asserting the interrupt to the GIC and the GIC asserting it to the CPU.
Not requiring the interrupt to be enabled at the GIC level makes that delay longer
in the case where the host driver failed probing.

>
>
>> at most a host timer tick.
> (Because the timer brings us out of the guest anyway?)

Yes, once very 4 ms according to the default value for CONFIG_HZ.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE
  2020-11-05  9:58   ` Haibo Xu
@ 2020-12-02 15:20     ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-02 15:20 UTC (permalink / raw)
  To: Haibo Xu; +Cc: maz, will, kvmarm, linux-arm-kernel, Sudeep Holla

Hi Haibu,

Thanks for having a look at the patches!

On 11/5/20 9:58 AM, Haibo Xu wrote:
> On Wed, 28 Oct 2020 at 01:26, Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> From: Sudeep Holla <sudeep.holla@arm.com>
>>
>> To configure the virtual SPE buffer management interrupt number, we use a
>> VCPU kvm_device ioctl, encapsulating the KVM_ARM_VCPU_SPE_IRQ attribute
>> within the KVM_ARM_VCPU_SPE_CTRL group.
>>
>> After configuring the SPE, userspace is required to call the VCPU ioctl
>> with the attribute KVM_ARM_VCPU_SPE_INIT to initialize SPE on the VCPU.
>>
>> [Alexandru E: Fixed compilation errors, don't allow userspace to set the
>>         VCPU feature, removed unused functions, fixed mismatched
>>         descriptions, comments and error codes, reworked logic, rebased on
>>         top of v5.10-rc1]
>>
>> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
>> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
>> ---
>>  Documentation/virt/kvm/devices/vcpu.rst |  40 ++++++++
>>  arch/arm64/include/uapi/asm/kvm.h       |   3 +
>>  arch/arm64/kvm/Makefile                 |   1 +
>>  arch/arm64/kvm/guest.c                  |   9 ++
>>  arch/arm64/kvm/reset.c                  |  23 +++++
>>  arch/arm64/kvm/spe.c                    | 129 ++++++++++++++++++++++++
>>  include/kvm/arm_spe.h                   |  27 +++++
>>  include/uapi/linux/kvm.h                |   1 +
>>  8 files changed, 233 insertions(+)
>>  create mode 100644 arch/arm64/kvm/spe.c
>>
>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
>> index 2acec3b9ef65..6135b9827fbe 100644
>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>> @@ -161,3 +161,43 @@ Specifies the base address of the stolen time structure for this VCPU. The
>>  base address must be 64 byte aligned and exist within a valid guest memory
>>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>>  including the layout of the stolen time structure.
>> +
>> +4. GROUP: KVM_ARM_VCPU_SPE_CTRL
>> +===============================
>> +
>> +:Architectures: ARM64
>> +
>> +4.1 ATTRIBUTE: KVM_ARM_VCPU_SPE_IRQ
>> +-----------------------------------
>> +
>> +:Parameters: in kvm_device_attr.addr the address for the SPE buffer management
>> +             interrupt is a pointer to an int
>> +
>> +Returns:
>> +
>> +        =======  ========================================================
>> +        -EBUSY   The SPE buffer management interrupt is already set
>> +        -EINVAL  Invalid SPE overflow interrupt number
>> +        -EFAULT  Could not read the buffer management interrupt number
>> +        -ENXIO   SPE not supported or not properly configured
>> +        =======  ========================================================
>> +
>> +A value describing the SPE (Statistical Profiling Extension) overflow interrupt
>> +number for this vcpu. This interrupt should be a PPI and the interrupt type and
>> +number must be same for each vcpu.
>> +
>> +4.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
>> +------------------------------------
>> +
>> +:Parameters: no additional parameter in kvm_device_attr.addr
>> +
>> +Returns:
>> +
>> +        =======  ======================================================
>> +        -EBUSY   SPE already initialized
>> +        -ENODEV  GIC not initialized
>> +        -ENXIO   SPE not supported or not properly configured
>> +        =======  ======================================================
>> +
>> +Request the initialization of the SPE. Must be done after initializing the
>> +in-kernel irqchip and after setting the interrupt number for the VCPU.
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 489e12304dbb..ca57dfb7abf0 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -360,6 +360,9 @@ struct kvm_vcpu_events {
>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER                1
>>  #define KVM_ARM_VCPU_PVTIME_CTRL       2
>>  #define   KVM_ARM_VCPU_PVTIME_IPA      0
>> +#define KVM_ARM_VCPU_SPE_CTRL          3
>> +#define   KVM_ARM_VCPU_SPE_IRQ         0
>> +#define   KVM_ARM_VCPU_SPE_INIT                1
>>
>>  /* KVM_IRQ_LINE irq field index values */
>>  #define KVM_ARM_IRQ_VCPU2_SHIFT                28
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 1504c81fbf5d..f6e76f64ffbe 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -25,3 +25,4 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
>>          vgic/vgic-its.o vgic/vgic-debug.o
>>
>>  kvm-$(CONFIG_KVM_ARM_PMU)  += pmu-emul.o
>> +kvm-$(CONFIG_KVM_ARM_SPE)  += spe.o
>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index dfb5218137ca..2ba790eeb782 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -926,6 +926,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
>>         case KVM_ARM_VCPU_PVTIME_CTRL:
>>                 ret = kvm_arm_pvtime_set_attr(vcpu, attr);
>>                 break;
>> +       case KVM_ARM_VCPU_SPE_CTRL:
>> +               ret = kvm_arm_spe_set_attr(vcpu, attr);
>> +               break;
>>         default:
>>                 ret = -ENXIO;
>>                 break;
>> @@ -949,6 +952,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
>>         case KVM_ARM_VCPU_PVTIME_CTRL:
>>                 ret = kvm_arm_pvtime_get_attr(vcpu, attr);
>>                 break;
>> +       case KVM_ARM_VCPU_SPE_CTRL:
>> +               ret = kvm_arm_spe_get_attr(vcpu, attr);
>> +               break;
>>         default:
>>                 ret = -ENXIO;
>>                 break;
>> @@ -972,6 +978,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>         case KVM_ARM_VCPU_PVTIME_CTRL:
>>                 ret = kvm_arm_pvtime_has_attr(vcpu, attr);
>>                 break;
>> +       case KVM_ARM_VCPU_SPE_CTRL:
>> +               ret = kvm_arm_spe_has_attr(vcpu, attr);
>> +               break;
>>         default:
>>                 ret = -ENXIO;
>>                 break;
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index f32490229a4c..4dc205fa4be1 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -87,6 +87,9 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>         case KVM_CAP_ARM_PTRAUTH_GENERIC:
>>                 r = system_has_full_ptr_auth();
>>                 break;
>> +       case KVM_CAP_ARM_SPE:
>> +               r = kvm_arm_supports_spe();
>> +               break;
>>         default:
>>                 r = 0;
>>         }
>> @@ -223,6 +226,19 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
>>         return 0;
>>  }
>>
>> +static int kvm_vcpu_enable_spe(struct kvm_vcpu *vcpu)
>> +{
>> +       if (!kvm_arm_supports_spe())
>> +               return -EINVAL;
>> +
>> +       /* SPE is disabled if the PE is in AArch32 state */
>> +       if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
>> +               return -EINVAL;
>> +
>> +       vcpu->arch.flags |= KVM_ARM64_GUEST_HAS_SPE;
>> +       return 0;
>> +}
>> +
>>  /**
>>   * kvm_reset_vcpu - sets core registers and sys_regs to reset value
>>   * @vcpu: The VCPU pointer
>> @@ -274,6 +290,13 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>>                 }
>>         }
>>
>> +       if (test_bit(KVM_ARM_VCPU_SPE, vcpu->arch.features)) {
>> +               if (kvm_vcpu_enable_spe(vcpu)) {
>> +                       ret = -EINVAL;
>> +                       goto out;
>> +               }
>> +       }
>> +
>>         switch (vcpu->arch.target) {
>>         default:
>>                 if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
>> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
>> new file mode 100644
>> index 000000000000..f91a52cd7cd3
>> --- /dev/null
>> +++ b/arch/arm64/kvm/spe.c
>> @@ -0,0 +1,129 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2019 ARM Ltd.
>> + */
>> +
>> +#include <linux/kvm.h>
>> +#include <linux/kvm_host.h>
>> +#include <linux/uaccess.h>
>> +
>> +#include <kvm/arm_spe.h>
>> +#include <kvm/arm_vgic.h>
>> +
>> +static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
>> +{
>> +       if (!vcpu_has_spe(vcpu))
>> +               return false;
>> +
>> +       if (!irqchip_in_kernel(vcpu->kvm))
>> +               return false;
>> +
> nit: should we move the irqchip_in_kernel() check to the caller?

Yes, definitely, I can move the irqchip_in_kernel() check to the callers because
it's a VM, not VCPU property, and use vcpu_has_spe() directly.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE
  2020-11-19 16:58   ` James Morse
@ 2020-12-02 16:28     ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-02 16:28 UTC (permalink / raw)
  To: James Morse; +Cc: maz, linux-arm-kernel, Sudeep Holla, will, kvmarm

Hi James,

On 11/19/20 4:58 PM, James Morse wrote:
> Hi Alex,
>
> On 27/10/2020 17:26, Alexandru Elisei wrote:
>> From: Sudeep Holla <sudeep.holla@arm.com>
>>
>> To configure the virtual SPE buffer management interrupt number, we use a
>> VCPU kvm_device ioctl, encapsulating the KVM_ARM_VCPU_SPE_IRQ attribute
>> within the KVM_ARM_VCPU_SPE_CTRL group.
>>
>> After configuring the SPE, userspace is required to call the VCPU ioctl
>> with the attribute KVM_ARM_VCPU_SPE_INIT to initialize SPE on the VCPU.
>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
>> index 2acec3b9ef65..6135b9827fbe 100644
>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>> @@ -161,3 +161,43 @@ Specifies the base address of the stolen time structure for this VCPU. The
>>  base address must be 64 byte aligned and exist within a valid guest memory
>>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>>  including the layout of the stolen time structure.
>> +
>> +4. GROUP: KVM_ARM_VCPU_SPE_CTRL
>> +===============================
>> +
>> +:Architectures: ARM64
>> +
>> +4.1 ATTRIBUTE: KVM_ARM_VCPU_SPE_IRQ
>> +-----------------------------------
>> +
>> +:Parameters: in kvm_device_attr.addr the address for the SPE buffer management
>> +             interrupt is a pointer to an int
>> +
>> +Returns:
>> +
>> +	 =======  ========================================================
>> +	 -EBUSY   The SPE buffer management interrupt is already set
>> +	 -EINVAL  Invalid SPE overflow interrupt number
>> +	 -EFAULT  Could not read the buffer management interrupt number
>> +	 -ENXIO   SPE not supported or not properly configured
>> +	 =======  ========================================================
>> +
>> +A value describing the SPE (Statistical Profiling Extension) overflow interrupt
>> +number for this vcpu. This interrupt should be a PPI and the interrupt type and
>> +number must be same for each vcpu.
>> +
>> +4.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
>> +------------------------------------
>> +
>> +:Parameters: no additional parameter in kvm_device_attr.addr
>> +
>> +Returns:
>> +
>> +	 =======  ======================================================
>> +	 -EBUSY   SPE already initialized
>> +	 -ENODEV  GIC not initialized
>> +	 -ENXIO   SPE not supported or not properly configured
>> +	 =======  ======================================================
>> +Request the initialization of the SPE. Must be done after initializing the
>> +in-kernel irqchip and after setting the interrupt number for the VCPU.
> Fantastic!
>
>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index f32490229a4c..4dc205fa4be1 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -87,6 +87,9 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  	case KVM_CAP_ARM_PTRAUTH_GENERIC:
>>  		r = system_has_full_ptr_auth();
>>  		break;
>> +	case KVM_CAP_ARM_SPE:
>> +		r = kvm_arm_supports_spe();
>> +		break;
>>  	default:
>>  		r = 0;
>>  	}
>> @@ -223,6 +226,19 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
>>  	return 0;
>>  }
>>  
>> +static int kvm_vcpu_enable_spe(struct kvm_vcpu *vcpu)
>> +{
>> +	if (!kvm_arm_supports_spe())
>> +		return -EINVAL;
>> +
>> +	/* SPE is disabled if the PE is in AArch32 state */
>> +	if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
>> +		return -EINVAL;
>> +
>> +	vcpu->arch.flags |= KVM_ARM64_GUEST_HAS_SPE;
>> +	return 0;
>> +}
> VCPU-reset promotes the VMM feature into flags. How does this interact with
> kvm_arm_spe_init()?
>
> It doesn't look like this resets any state, couldn't it be done once by kvm_arm_spe_init()?

I need to check here for incompatible features (KVM_ARM_VCPU_EL1_32BIT) or
unsupported SPE to return an error code to KVM_ARM_VCPU_INIT (the ioctl is handled
in kvm_arch_vcpu_ioctl_vcpu_init -> kvm_vcpu_set_target)

We need to track the feature per-vcpu, to refuse finalization if the feature was
not set on all of them. I could use features, but I think using flags will end up
slightly faster.

When KVM SPE is optimized, I will probably add at least one vcpu->arch.flags flag
(something like KVM_ARM64_HOST_USES_SPE) to allow for lazy save/restore of the
host SPE context when the flag is missing. I was thinking that in that case
checking vcpu->arch.flags will be cheaper. This is also the approach that SVE
uses, from what I can tell.

For now, I would prefer to keep it as a vcpu flag and make the decision once I
start implementing performance optimizations. But using features is definitely
doable if there are objections against using flags.

>
>
>>  /**
>>   * kvm_reset_vcpu - sets core registers and sys_regs to reset value
>>   * @vcpu: The VCPU pointer
>> @@ -274,6 +290,13 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>>  		}
>>  	}
>>  
>> +	if (test_bit(KVM_ARM_VCPU_SPE, vcpu->arch.features)) {
>> +		if (kvm_vcpu_enable_spe(vcpu)) {
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +	}
>> +
>>  	switch (vcpu->arch.target) {
>>  	default:
>>  		if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
>> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
>> new file mode 100644
>> index 000000000000..f91a52cd7cd3
>> --- /dev/null
>> +++ b/arch/arm64/kvm/spe.c
>> @@ -0,0 +1,129 @@
>> +static bool kvm_arm_spe_irq_is_valid(struct kvm *kvm, int irq)
>> +{
>> +	int i;
>> +	struct kvm_vcpu *vcpu;
>> +
>> +	/* The SPE overflow interrupt can be a PPI only */
>> +	if (!irq_is_ppi(irq))
>> +		return false;
>> +
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		if (!kvm_arm_spe_irq_initialized(vcpu))
>> +			continue;
>> +
>> +		if (vcpu->arch.spe_cpu.irq_num != irq)
>> +			return false;
>> +	}
> Looks like you didn't really want a vcpu property! (huh, patch 10 adds a vm property too)
> We're making this a vcpu property because of the PPI and system registers? (both good reasons)
>
> If the PPI number lived in struct kvm_arch, you'd only only need to check it was
> uninitialised, or the same to get the same behaviour, which would save some of this error
> handling.

The Arm ARM mandates that the SPE interrupt must be a PPI. I think it makes more
sense to have a Private Peripheral Interrupt ID saved as a per-vcpu variable than
a per-VM one, like a SPI.

>
>
>> +	return true;
>> +}
>> diff --git a/include/kvm/arm_spe.h b/include/kvm/arm_spe.h
>> index 46ec447ed013..0275e8097529 100644
>> --- a/include/kvm/arm_spe.h
>> +++ b/include/kvm/arm_spe.h
>> @@ -18,11 +18,38 @@ struct kvm_spe_cpu {
>>  	bool initialized; 	/* Feature is initialized on VCPU */
>>  };
>>  
>> +#define kvm_arm_spe_irq_initialized(v)			\
>> +	((v)->arch.spe_cpu.irq_num >= VGIC_NR_SGIS &&	\
>> +	 (v)->arch.spe_cpu.irq_num < VGIC_MAX_PRIVATE)
> Didn't GICv(mumbles) add an additional PPI range? Could this be made irq_is_ppi(), that
> way if the vgic gains support for that, we don't get weird behaviour here?

You're right, the macro reimplements irq_is_ppi(), I'll rewrite it to use
irq_is_ppi().

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 09/16] KVM: arm64: Use separate function for the mapping size in user_mem_abort()
  2020-11-05 10:01   ` Haibo Xu
@ 2020-12-02 16:29     ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-02 16:29 UTC (permalink / raw)
  To: Haibo Xu; +Cc: maz, will, kvmarm, linux-arm-kernel

Hi Haibo,

On 11/5/20 10:01 AM, Haibo Xu wrote:
> On Wed, 28 Oct 2020 at 01:26, Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> user_mem_abort() is already a long and complex function, let's make it
>> slightly easier to understand by abstracting the algorithm for choosing the
>> stage 2 IPA entry size into its own function.
>>
>> This also makes it possible to reuse the code when guest SPE support will
>> be added.
>>
> Better to mention that there is "No functional change"!

That's a good point, I'll add it.

Thanks,

Alex

>
>> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
>> ---
>>  arch/arm64/kvm/mmu.c | 55 ++++++++++++++++++++++++++------------------
>>  1 file changed, 33 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 19aacc7d64de..c3c43555490d 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -738,12 +738,43 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>         return PAGE_SIZE;
>>  }
>>
>> +static short stage2_max_pageshift(struct kvm_memory_slot *memslot,
>> +                                 struct vm_area_struct *vma, hva_t hva,
>> +                                 bool *force_pte)
>> +{
>> +       short pageshift;
>> +
>> +       *force_pte = false;
>> +
>> +       if (is_vm_hugetlb_page(vma))
>> +               pageshift = huge_page_shift(hstate_vma(vma));
>> +       else
>> +               pageshift = PAGE_SHIFT;
>> +
>> +       if (memslot_is_logging(memslot) || (vma->vm_flags & VM_PFNMAP)) {
>> +               *force_pte = true;
>> +               pageshift = PAGE_SHIFT;
>> +       }
>> +
>> +       if (pageshift == PUD_SHIFT &&
>> +           !fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
>> +               pageshift = PMD_SHIFT;
>> +
>> +       if (pageshift == PMD_SHIFT &&
>> +           !fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
>> +               *force_pte = true;
>> +               pageshift = PAGE_SHIFT;
>> +       }
>> +
>> +       return pageshift;
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>                           struct kvm_memory_slot *memslot, unsigned long hva,
>>                           unsigned long fault_status)
>>  {
>>         int ret = 0;
>> -       bool write_fault, writable, force_pte = false;
>> +       bool write_fault, writable, force_pte;
>>         bool exec_fault;
>>         bool device = false;
>>         unsigned long mmu_seq;
>> @@ -776,27 +807,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>                 return -EFAULT;
>>         }
>>
>> -       if (is_vm_hugetlb_page(vma))
>> -               vma_shift = huge_page_shift(hstate_vma(vma));
>> -       else
>> -               vma_shift = PAGE_SHIFT;
>> -
>> -       if (logging_active ||
>> -           (vma->vm_flags & VM_PFNMAP)) {
>> -               force_pte = true;
>> -               vma_shift = PAGE_SHIFT;
>> -       }
>> -
>> -       if (vma_shift == PUD_SHIFT &&
>> -           !fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
>> -              vma_shift = PMD_SHIFT;
>> -
>> -       if (vma_shift == PMD_SHIFT &&
>> -           !fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
>> -               force_pte = true;
>> -               vma_shift = PAGE_SHIFT;
>> -       }
>> -
>> +       vma_shift = stage2_max_pageshift(memslot, vma, hva, &force_pte);
>>         vma_pagesize = 1UL << vma_shift;
>>         if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
>>                 fault_ipa &= ~(vma_pagesize - 1);
>> --
>> 2.29.1
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE
  2020-11-05 10:10   ` Haibo Xu
@ 2020-12-02 16:35     ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-02 16:35 UTC (permalink / raw)
  To: Haibo Xu; +Cc: maz, will, kvmarm, linux-arm-kernel

Hi Haibo,

On 11/5/20 10:10 AM, Haibo Xu wrote:
> On Wed, 28 Oct 2020 at 01:26, Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> Stage 2 faults triggered by the profiling buffer attempting to write to
>> memory are reported by the SPE hardware by asserting a buffer management
>> event interrupt. Interrupts are by their nature asynchronous, which means
>> that the guest might have changed its stage 1 translation tables since the
>> attempted write. SPE reports the guest virtual address that caused the data
>> abort, but not the IPA, which means that KVM would have to walk the guest's
>> stage 1 tables to find the IPA; using the AT instruction to walk the
>> guest's tables in hardware is not an option because it doesn't report the
>> IPA in the case of a stage 2 fault on a stage 1 table walk.
>>
>> Fix both problems by pre-mapping the guest's memory at stage 2 with write
>> permissions to avoid any faults. Userspace calls mlock() on the VMAs that
>> back the guest's memory, pinning the pages in memory, then tells KVM to map
>> the memory at stage 2 by using the VM control group KVM_ARM_VM_SPE_CTRL
>> with the attribute KVM_ARM_VM_SPE_FINALIZE. KVM will map all writable VMAs
>> which have the VM_LOCKED flag set. Hugetlb VMAs are practically pinned in
>> memory after they are faulted in and mlock() doesn't set the VM_LOCKED
>> flag, and just faults the pages in; KVM will treat hugetlb VMAs like they
>> have the VM_LOCKED flag and will also map them, faulting them in if
>> necessary, when handling the ioctl.
>>
>> VM live migration relies on a bitmap of dirty pages. This bitmap is created
>> by write-protecting a memslot and updating it as KVM handles stage 2 write
>> faults. Because KVM cannot handle stage 2 faults reported by the profiling
>> buffer, it will not pre-map a logging memslot. This effectively means that
>> profiling is not available when the VM is configured for live migration.
>>
>> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
>> ---
>> [..]
> It seems that the below function is used to de-finalize the spe status
> if I get it correctly.
> How about rename the function to some like "kvm_arm_vcpu_init_spe_definalize()"

I don't have a strong opinion about the name and I'll keep your suggestion in mind
for the next iteration. The series is an RFC and the function might not even be
there in the final version.

>
>> +void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu)
>> +{
>> +       vcpu->kvm->arch.spe.finalized = false;
>> +}
>> +
>>  static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
>>  {
>>         if (!vcpu_has_spe(vcpu))
>> @@ -115,6 +122,50 @@ int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>         return -ENXIO;
>>  }
>>
>> +static int kvm_arm_spe_finalize(struct kvm *kvm)
>> +{
>> +       struct kvm_memory_slot *memslot;
>> +       enum kvm_pgtable_prot prot;
>> +       struct kvm_vcpu *vcpu;
>> +       int i, ret;
>> +
>> +       kvm_for_each_vcpu(i, vcpu, kvm) {
>> +               if (!kvm_arm_spe_vcpu_initialized(vcpu))
>> +                       return -ENXIO;
>> +       }
>> +
>> +       mutex_unlock(&kvm->slots_lock);
> Should be mutex_lock(&kvm->slots_lock);?

Definitely, nicely spotted! That's a typo on my part.

It doesn't affect the test results because kvmtool will call finalize exactly once
after the entire VM has been initialized, so there will be no concurrent accesses
to this function.

Thanks,

Alex

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature
  2020-12-02 14:29     ` Alexandru Elisei
@ 2020-12-02 17:23       ` Will Deacon
  2020-12-03 10:07         ` Alexandru Elisei
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2020-12-02 17:23 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, linux-arm-kernel, kvmarm

On Wed, Dec 02, 2020 at 02:29:31PM +0000, Alexandru Elisei wrote:
> On 11/19/20 4:58 PM, James Morse wrote:
> > On 27/10/2020 17:26, Alexandru Elisei wrote:
> >> Detect Statistical Profiling Extension (SPE) support using the cpufeatures
> >> framework. The presence of SPE is reported via the ARM64_SPE capability.
> >>
> >> The feature will be necessary for emulating SPE in KVM, because KVM needs
> >> that all CPUs have SPE hardware to avoid scheduling a VCPU on a CPU without
> >> support. For this reason, the feature type ARM64_CPUCAP_SYSTEM_FEATURE has
> >> been selected to disallow hotplugging a CPU which doesn't support SPE.
> > Can you mention the existing driver in the commit message? Surprisingly it doesn't use
> > cpufeature at all. It looks like arm_spe_pmu_dev_init() goes out of its way to support
> > mismatched systems. (otherwise the significance of the new behaviour isn't clear!)
> >
> > I read it as: the host is fine with mismatched systems, and the existing drivers supports
> > this. But KVM is not. After this patch you can't make the system mismatched 'late'.
> 
> That was exactly my intention. Certainly, I'll try to make the commit message
> clearer by mentioning the SPE driver.

Hmm, so are you saying that with this patch applied, a machine where KVM
isn't even being used can no longer late-online CPUs without SPE if the boot
CPUs had it? If so, then I don't think that's acceptable, unfortunately.

As James points out, the current driver is very careful to support
big.LITTLE misconfigurations and I don't see why KVM support should change
that.

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature
  2020-12-02 17:23       ` Will Deacon
@ 2020-12-03 10:07         ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2020-12-03 10:07 UTC (permalink / raw)
  To: Will Deacon; +Cc: maz, linux-arm-kernel, kvmarm

Hi Will,

On 12/2/20 5:23 PM, Will Deacon wrote:
> On Wed, Dec 02, 2020 at 02:29:31PM +0000, Alexandru Elisei wrote:
>> On 11/19/20 4:58 PM, James Morse wrote:
>>> On 27/10/2020 17:26, Alexandru Elisei wrote:
>>>> Detect Statistical Profiling Extension (SPE) support using the cpufeatures
>>>> framework. The presence of SPE is reported via the ARM64_SPE capability.
>>>>
>>>> The feature will be necessary for emulating SPE in KVM, because KVM needs
>>>> that all CPUs have SPE hardware to avoid scheduling a VCPU on a CPU without
>>>> support. For this reason, the feature type ARM64_CPUCAP_SYSTEM_FEATURE has
>>>> been selected to disallow hotplugging a CPU which doesn't support SPE.
>>> Can you mention the existing driver in the commit message? Surprisingly it doesn't use
>>> cpufeature at all. It looks like arm_spe_pmu_dev_init() goes out of its way to support
>>> mismatched systems. (otherwise the significance of the new behaviour isn't clear!)
>>>
>>> I read it as: the host is fine with mismatched systems, and the existing drivers supports
>>> this. But KVM is not. After this patch you can't make the system mismatched 'late'.
>> That was exactly my intention. Certainly, I'll try to make the commit message
>> clearer by mentioning the SPE driver.
> Hmm, so are you saying that with this patch applied, a machine where KVM
> isn't even being used can no longer late-online CPUs without SPE if the boot
> CPUs had it? If so, then I don't think that's acceptable, unfortunately.

Yes, the idea was to prevent hotplugging CPUs that don't have the capability so
the kernel won't schedule a SPE-enabled guest on a CPU which doesn't have SPE.

>
> As James points out, the current driver is very careful to support
> big.LITTLE misconfigurations and I don't see why KVM support should change
> that.

That makes sense, thank you for making it clear from the start that this approach
is not the right one.

There was a discussion for supporting KVM SPE on heterogeneous systems [1]. I
chose to use a capability because the focus for this iteration was to ensure the
correctness of SPE emulation, and the capability looked like the easiest way to
get KVM SPE up and running for testing.

The idea discussed previously [1] was to have userspace configure the VM with a
cpumask representing the CPUs the VM is allowed to run on. KVM detects if the VCPU
is scheduled on a physical CPU not in the cpumask, and returns from KVM_RUN with
an error code. That looks like a good solution to me and generic enough that it
can be used for all sorts of mismatched features. I will try to implement it in
the next iteration, after I get more feedback on the current series.

[1] https://www.spinics.net/lists/arm-kernel/msg778477.html

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE
  2020-11-19 16:59   ` James Morse
@ 2021-03-23 14:27     ` Alexandru Elisei
  0 siblings, 0 replies; 35+ messages in thread
From: Alexandru Elisei @ 2021-03-23 14:27 UTC (permalink / raw)
  To: James Morse; +Cc: maz, linux-arm-kernel, will, kvmarm

Hi James,

Sorry for taking so long to reply to this, been busy with other things, but your
comments have been very helpful and they gave me a lot to think about. For the
next iteration of the series I've decided to use pin_user_pages with the
FOLL_LONGTERM flags, similar to how vfio_iommu_type1 does the memory pinning. I
believe that this approach matches the semantics of the flag, since SPE
practically functions like a device that can do DMA to guest memory at any time,
with the difference being that it uses the CPU translation tables instead of the
IOMMU tables. This will also remove the burden on userspace to call mlock()
beforehand, and the distinction between hugetlbfs and regular pages will be
removed from KVM, as those are handled internally by pin_user_pages.

I will try to answer those comments where I think a more elaborate explanation
will be helpful; regardless or my reply, all comments have been taken on board for
the next iteration.

On 11/19/20 4:59 PM, James Morse wrote:
> Hi Alex,
>
> On 27/10/2020 17:26, Alexandru Elisei wrote:
>> Stage 2 faults triggered by the profiling buffer attempting to write to
>> memory are reported by the SPE hardware by asserting a buffer management
>> event interrupt. Interrupts are by their nature asynchronous, which means
>> that the guest might have changed its stage 1 translation tables since the
>> attempted write. SPE reports the guest virtual address that caused the data
>> abort, but not the IPA, which means that KVM would have to walk the guest's
>> stage 1 tables to find the IPA; using the AT instruction to walk the
>> guest's tables in hardware is not an option because it doesn't report the
>> IPA in the case of a stage 2 fault on a stage 1 table walk.
> Great detailed description, I think a summary helps identify 'both' problems:
> | To work reliably, both the profiling buffer and the page tables to reach it must not
> | fault at stage2.
>
>> Fix both problems by pre-mapping the guest's memory at stage 2 with write
>> permissions to avoid any faults. Userspace calls mlock() on the VMAs that
>> back the guest's memory, pinning the pages in memory, then tells KVM to map
>> the memory at stage 2 by using the VM control group KVM_ARM_VM_SPE_CTRL
>> with the attribute KVM_ARM_VM_SPE_FINALIZE.
> The reason to have this feature is SPE, but is there anything SPE specific in the feature?
>
> I can imagine this being useful on its own if I wanted to reduce guest-exits for
> quasi-real-time reasons, and had memory to burn!
>
> (as an independent feature, it might be useful on other architectures too...)
>
>
> Would it make sense to add this as a flag to KVM_SET_USER_MEMORY_REGION? That is the point
> that the userspace_addr is provided to KVM, this would allow us to fail the call if a
> KVM_MEM_LOCKED memslot can't be created because the underlying VMA aren't VM_LOCKED.
>
> (it also makes it easy to catch incompatible changes of flags in the future)
>
> /me wanders off musing if this can then be combined with VM_PFNMAP in
> kvm_arch_prepare_memory_region()....
>
>
>> KVM will map all writable VMAs which have the VM_LOCKED flag set.
>> Hugetlb VMAs are practically pinned in
>> memory after they are faulted in and mlock() doesn't set the VM_LOCKED
>> flag, and just faults the pages in;
> Ugh. It would be nice to avoid special casing this. KVM shouldn't have to care about the
> difference between a hugetlbfs PMD and a THP PMD.
>
> From mlock_fixup(), it looks like this is because these VMA can't be split.
> Is it possible to change this if mlock() is called for the whole range? (user-space must
> know its hugetlbfs!)
>
> Alternatively, it would good if mm can tell us when a page is locked (and/or special
> cased). That way dax does the right thing too, without having extra special casing in KVM.
> This would also catch VM_PFNMAP if mm knows its effectively the same as VM_LOCKED...
>
>
>> KVM will treat hugetlb VMAs like they
>> have the VM_LOCKED flag and will also map them, faulting them in if
>> necessary, when handling the ioctl.
> Surely user-space should call mlock() to do the faulting in? (and do that before handing
> the memory over to KVM)
>
> Getting KVM to do it will create a loop via the mmu_notifier if this touches a COW page,
> which in turn bumps the sequence counter causing us to bomb out with -EAGAIN.
> (it looks like wp_page_copy() is the only case that calls set_pte_at_notify())
>
>
>> VM live migration relies on a bitmap of dirty pages. This bitmap is created
>> by write-protecting a memslot and updating it as KVM handles stage 2 write
>> faults. Because KVM cannot handle stage 2 faults reported by the profiling
>> buffer, it will not pre-map a logging memslot. This effectively means that
>> profiling is not available when the VM is configured for live migration.
> Yeah ... that sucks. Have any of the Qemu folk said what they'd like to see here?
>
> I can imagine making the logging-enable call fail if any CPU has SPE profiling enabled, as
> the logging will change the results of SPE... We'd then need an exit to user-space to say
> that the vcpu tried to enable SPE while logging was active. Qemu can then decide whether
> to block that vcpu until migration completes, or abort migration.
> But: I've no idea how Qemu manages migration, so it may not be able to do irregular things
> like this.
>
> As a short cut, can we let the arch code fail calls that make problematic changes. (e.g.
> setting KVM_MEM_LOG_DIRTY_PAGES or KVM_MEM_READONLY). It looks like you currently document
> these as silently breaking something else... (an invitation to debug a subtle interaction
> in the future!)

The solution I am considering implementing is to have userspace do an ioctl to
stop guest SPE before turning on dirty logging. This ioctl will have to specify
how KVM should behave when guest is profiling. I can see two useful behaviours:

- If guest profiling is enabled or becomes enabled, KVM_RUN will return to
userspace with a description of why it returned.

- If guest profiling is enabled or becomes enabled, KVM will trap the SPE
registers and it will disable profiling when the guest is running.

The two options can be changed while the profiling is stopped, and hopefully this
will be enough to let userspace implement whatever policy they want when migrating
a VM. Obviously, there will also be an ioctl to let KVM know that guest profiling
can be re-enabled.

Thanks,

Alex

>
> ~
>
> How does this interact with KSM?
> I can see its try_to_merge_one_page() calling write_protect_page() before testing the
> vm_flags for VM_LOCKED ... so it doesn't look like mlock() stop KSM from doing its work -
> which in turn will cause stage2 faults.
>
> It looks like this is all hinged on VM_MERGEABLE, which can be cleared with an madvise()
> call using MADV_UNMERGEABLE ... but from the man page at least this is to undo a previous
> hint.
>
> I can't find what sets this for regular vma, so presumably its not... see what you think,
> I reckon we need to add "no madvise() MADV_MERGEABLE" to the documentation, and get KVM to
> check the corresponding vma flag when it looks for VM_LOCKED regions.
>
> I think the 'testing flags' is justified, even though we can't enforce they don't change,
> as we can catch a stage2 fault that shouldn't have happened.
>
>
>> diff --git a/Documentation/virt/kvm/devices/vm.rst b/Documentation/virt/kvm/devices/vm.rst
>> index 0aa5b1cfd700..b70798a72d8a 100644
>> --- a/Documentation/virt/kvm/devices/vm.rst
>> +++ b/Documentation/virt/kvm/devices/vm.rst
>> @@ -314,3 +314,31 @@ Allows userspace to query the status of migration mode.
>>  	     if it is enabled
>>  :Returns:   -EFAULT if the given address is not accessible from kernel space;
>>  	    0 in case of success.
>> +
>> +6. GROUP: KVM_ARM_VM_SPE_CTRL
>> +===============================
>> +
>> +:Architectures: arm64
>> +
>> +6.1. ATTRIBUTE: KVM_ARM_VM_SPE_FINALIZE
>> +-----------------------------------------
>> +
>> +Finalizes the creation of the SPE feature by mapping the guest memory in the
>> +stage 2 table. Guest memory must be readable, writable and pinned in RAM, which
>> +is achieved with an mlock() system call;
> (I first read this as mlock() makes memory writeable...)
>
>
>> the memory can be backed by a hugetlbfs
>> +file. Memory regions from read-only or dirty page logging enabled memslots will
>> +be ignored. After the call, no changes to the guest memory,
>> including to its contents, are permitted.
> If guest memory is pinned as writeable, why can't the VMM write to it? Doesn't this
> requirement preclude virtio?
>
> Is 'no messing with the memslots' enforced in any way?
>
>
>> +Subsequent KVM_ARM_VCPU_INIT calls will cause the memory to become unmapped and
>> +the feature must be finalized again before any VCPU can run.
>> +
>> +If any VCPUs are run before finalizing the feature, KVM_RUN will return -EPERM.
>> +
>> +:Parameters: none
>> +:Returns:   -EAGAIN if guest memory has been modified while the call was
>> +            executing
>> +            -EBUSY if the feature is already initialized
>> +            -EFAULT if an address backing the guest memory is invalid
>> +            -ENXIO if SPE is not supported or not properly configured
>> +            0 in case of success
> If we need a one-shot finalise call that sets up stage2, is there any mileage in KVM
> reporting how much memory it pinned to stage2? This is so that the VMM can know it got the
> mmap()/mlock() setup correct? Otherwise we depend on noticing silent failures some time
> later... (I prefer the 'all or nothing' for a memslot though.)
>
>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e51d8f328c7e..2d98248f2c66 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -41,6 +41,7 @@
>>  #include <kvm/arm_hypercalls.h>
>>  #include <kvm/arm_pmu.h>
>>  #include <kvm/arm_psci.h>
>> +#include <kvm/arm_spe.h>
>>  
>>  #ifdef REQUIRES_VIRT
>>  __asm__(".arch_extension	virt");
>> @@ -653,6 +654,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>  	if (unlikely(!kvm_vcpu_initialized(vcpu)))
>>  		return -ENOEXEC;
>>  
>> +	if (vcpu_has_spe(vcpu) && unlikely(!kvm_arm_spe_finalized(vcpu->kvm)))
>> +		return -EPERM;
> (does the unlikely() make a difference here?)
>
>
>>  	ret = kvm_vcpu_first_run_init(vcpu);
>>  	if (ret)
>>  		return ret;
>> @@ -982,12 +986,22 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
>>  	 * ensuring that the data side is always coherent. We still
>>  	 * need to invalidate the I-cache though, as FWB does *not*
>>  	 * imply CTR_EL0.DIC.
>> +	 *
>> +	 * If the guest has SPE, we need to unmap the entire address space to
>> +	 * allow for any changes to the VM memory made by userspace to propagate
>> +	 * to the stage 2 tables when SPE is re-finalized;
> This is about the layout of memory (instead of the contents)? Doesn't this get
> synchronised by the mmu_notifier?
>
> This is registered during kvm_create_vm(), and unregistered during kvm_destroy_vm()... so
> it will see any changeseither side of this call...
>
>
> (the existing call is about cleaning the initial state that the VMM re-wrote to the PoC. I
> can't see how SPE or memory pinning fit in here)
>
>
>>          this also makes sure
>> +	 * we keep the userspace and the guest's view of the memory contents
>> +	 * synchronized.
>>  	 */
>>  	if (vcpu->arch.has_run_once) {
>> -		if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
>> +		if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
>> +		    vcpu_has_spe(vcpu)) {
>>  			stage2_unmap_vm(vcpu->kvm);
>> -		else
>> +			if (vcpu_has_spe(vcpu))
>> +				kvm_arm_spe_notify_vcpu_init(vcpu);
>> +		} else {
>>  			__flush_icache_all();
>> +		}
>>  	}
>>  
>>  	vcpu_reset_hcr(vcpu);
>
> From here...
>
>> @@ -1045,6 +1059,45 @@ static int kvm_arm_vcpu_has_attr(struct kvm_vcpu *vcpu,
>>  	return ret;
>>  }
>>  
>> +static int kvm_arm_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
>> +{
>> +	int ret = -ENXIO;
>> +
>> +	switch (attr->group) {
>> +	default:
>> +		ret = kvm_arm_vm_arch_set_attr(kvm, attr);
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int kvm_arm_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
>> +{
>> +	int ret = -ENXIO;
>> +
>> +	switch (attr->group) {
>> +	default:
>> +		ret = kvm_arm_vm_arch_get_attr(kvm, attr);
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int kvm_arm_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
>> +{
>> +	int ret = -ENXIO;
>> +
>> +	switch (attr->group) {
>> +	default:
>> +		ret = kvm_arm_vm_arch_has_attr(kvm, attr);
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  static int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>>  				   struct kvm_vcpu_events *events)
>>  {
>> @@ -1259,6 +1312,27 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>  
>>  		return 0;
>>  	}
>> +	case KVM_SET_DEVICE_ATTR: {
>> +		struct kvm_device_attr attr;
>> +
>> +		if (copy_from_user(&attr, argp, sizeof(attr)))
>> +			return -EFAULT;
>> +		return kvm_arm_vm_set_attr(kvm, &attr);
>> +	}
>> +	case KVM_GET_DEVICE_ATTR: {
>> +		struct kvm_device_attr attr;
>> +
>> +		if (copy_from_user(&attr, argp, sizeof(attr)))
>> +			return -EFAULT;
>> +		return kvm_arm_vm_get_attr(kvm, &attr);
>> +	}
>> +	case KVM_HAS_DEVICE_ATTR: {
>> +		struct kvm_device_attr attr;
>> +
>> +		if (copy_from_user(&attr, argp, sizeof(attr)))
>> +			return -EFAULT;
>> +		return kvm_arm_vm_has_attr(kvm, &attr);
>> +	}
>>  	default:
>>  		return -EINVAL;
>>  	}> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index 2ba790eeb782..d0dc4bdb8b4a 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -988,3 +988,51 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>  
>>  	return ret;
>>  }
>> +
>> +int kvm_arm_vm_arch_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
>> +{
>> +	int ret;
>> +
>> +	switch (attr->group) {
>> +	case KVM_ARM_VM_SPE_CTRL:
>> +		ret = kvm_arm_vm_spe_set_attr(kvm, attr);
>> +		break;
>> +	default:
>> +		ret = -ENXIO;
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +int kvm_arm_vm_arch_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
>> +{
>> +	int ret;
>> +
>> +	switch (attr->group) {
>> +	case KVM_ARM_VM_SPE_CTRL:
>> +		ret = kvm_arm_vm_spe_get_attr(kvm, attr);
>> +		break;
>> +	default:
>> +		ret = -ENXIO;
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +int kvm_arm_vm_arch_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
>> +{
>> +	int ret;
>> +
>> +	switch (attr->group) {
>> +	case KVM_ARM_VM_SPE_CTRL:
>> +		ret = kvm_arm_vm_spe_has_attr(kvm, attr);
>> +		break;
>> +	default:
>> +		ret = -ENXIO;
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
> ... to here, is almost entirely boiler plate for supporting 0-or-more vm ioctl. Could this
> be a separate preparatory patch, just so it isn't wrapped up in the SPE/memory-pinning
> specifics?
>
>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c3c43555490d..31b2216a5881 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1365,6 +1365,175 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>  	return ret;
>>  }
>>  
>> +static int stage2_map_vma(struct kvm *kvm,
>> +			  struct kvm_memory_slot *memslot,
>> +			  struct vm_area_struct *vma,
>> +			  enum kvm_pgtable_prot prot,
>> +			  unsigned long mmu_seq, hva_t *hvap,
>> +			  struct kvm_mmu_memory_cache *cache)
>> +{
>> +	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>> +	unsigned long stage2_pagesize, remaining;
>> +	bool force_pte, writable;
>> +	hva_t hva, hva_end;
>> +	kvm_pfn_t pfn;
>> +	gpa_t gpa;
>> +	gfn_t gfn;
>> +	int ret;
>> +
>> +	hva = max(memslot->userspace_addr, vma->vm_start);
>> +	hva_end = min(vma->vm_end, memslot->userspace_addr +
>> +			(memslot->npages << PAGE_SHIFT));
>> +
>> +	gpa = (memslot->base_gfn << PAGE_SHIFT) + hva - memslot->userspace_addr;
>> +	gfn = gpa >> PAGE_SHIFT;
>> +
>> +	stage2_pagesize = 1UL << stage2_max_pageshift(memslot, vma, hva, &force_pte);
>> +
>> +	while (hva < hva_end) {
>> +		ret = kvm_mmu_topup_memory_cache(cache,
>> +						 kvm_mmu_cache_min_pages(kvm));
>> +		if (ret)
>> +			return ret;
>> +
>> +		/*
>> +		 * We start mapping with the highest possible page size, so the
>> +		 * gpa and gfn will always be properly aligned to the current
>> +		 * page size.
>> +		 */
>> +		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL, true, &writable);
> Heh, if this causes the stage1 page tables to be changed, it will invoke the mmu notifier,
> which will cause us to fail with -EAGAIN afterwards. User-space could keep retrying, and
> it would fix a page at a time...
>
> Passing atomic here would stop this, as we don't want to update the stage1 tables. If they
> haven't been setup as needed, then this should fail early, with the finger pointing at
> stage1. This way we don't mask a bug in user-space, and get caught out by 'this used to work'.
>
> (or is this what prevents access-flag faults at stage1?)
>
>
>> +		if (pfn == KVM_PFN_ERR_HWPOISON)
>> +			return -EFAULT;> +		if (is_error_noslot_pfn(pfn))
> Doesn't is_error_noslot_pfn() cover KVM_PFN_ERR_HWPOISON?
>
>
>> +			return -EFAULT;
>> +		/* Can only happen if naughty userspace changed the VMA. */
>> +		if (kvm_is_device_pfn(pfn) || !writable)
>> +			return -EAGAIN;
> kvm_release_pfn_(*cough*)() ?
>
> My reading is __gfn_to_pfn_memslot() calls gup, which takes a reference you release (or
> adjust) at the end of the loop.
>
>
>> +		spin_lock(&kvm->mmu_lock);
>> +		if (mmu_notifier_retry(kvm, mmu_seq)) {
>> +			spin_unlock(&kvm->mmu_lock);
>> +			return -EAGAIN;
> (same again)
>
>
>> +		}
>> +
>> +		remaining = hva_end - hva;
>> +		if (stage2_pagesize == PUD_SIZE && remaining < PUD_SIZE)
>> +			stage2_pagesize = PMD_SIZE;
>> +		if (stage2_pagesize == PMD_SIZE && remaining < PMD_SIZE) {
>> +			force_pte = true;
> I had to sleep on this one: You're forced to put down a PTE because of the remaining size
> in the memslot? This is to prevent rolling back up to a THP size if that is what stage1 is
> using?
>
>
>> +			stage2_pagesize = PAGE_SIZE;
>> +		}
>> +
>> +		if (!force_pte && stage2_pagesize == PAGE_SIZE)
>> +			/*
>> +			 * The hva and gpa will always be PMD aligned if
>> +			 * hva is backed by a transparent huge page.
> because you walk through the vma in order... but what about the first page?
>
> What stops me starting my memslot on a 1MB boundary, which is half way through a 2MB THP?
> Doesn't the 'hva=max()' align hva up to the memslot boundary?
>
>
>
>>                          gpa will
>> +			 * not be modified and it's not necessary to recompute
>> +			 * hva.
>> +			 */
>> +			stage2_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &gpa);
>> +
>> +		ret = kvm_pgtable_stage2_map(pgt, gpa, stage2_pagesize,
>> +					     __pfn_to_phys(pfn), prot, cache);
>> +		spin_unlock(&kvm->mmu_lock);
>
>> +		kvm_set_pfn_accessed(pfn);
> This leads into mark_page_accessed(), which has:
> |		 * Unevictable pages are on the "LRU_UNEVICTABLE" list. But,
> |		 * this list is never rotated or maintained, so marking an
> |		 * evictable page accessed has no effect.
>
> This is to tell swap 'not yet'? ... Isn't that impossible by this point?
>
>
>
>> +		kvm_release_pfn_dirty(pfn);
>> +		if (ret)
>> +			return ret;
>> +		else if (hva < hva_end)
>> +			cond_resched();
> (we do this even for the last time round the loop as hva hasn't been updated yet)
>
>
>> +		hva += stage2_pagesize;
>> +		gpa += stage2_pagesize;
>> +		gfn = gpa >> PAGE_SHIFT;
>> +	}
>> +
>> +	*hvap = hva;
>> +	return 0;
>> +}
>
>
>> +int kvm_map_locked_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
>> +			   enum kvm_pgtable_prot prot)
>> +{
>> +	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
>> +	struct vm_area_struct *vma;
>> +	unsigned long mmu_seq;
>> +	hva_t hva, hva_memslot_end;
>> +	int ret;
>> +
>> +	lockdep_assert_held(&kvm->slots_lock);
>> +	if (!(prot & KVM_PGTABLE_PROT_R))
>> +		return -EPERM;
>> +	if ((prot & KVM_PGTABLE_PROT_W) && (memslot->flags & KVM_MEM_READONLY))
>> +		return -EPERM;
> This is checking the static value from kvm_arm_spe_finalize()?
>
>
>> +	hva = memslot->userspace_addr;
>> +	hva_memslot_end = memslot->userspace_addr + (memslot->npages << PAGE_SHIFT);
>> +
>> +	/*
>> +	 * Be extra careful here in case userspace is messing with the VMAs
>> +	 * backing the memslot.
>> +	 */
> If we held mmap_read_lock() for the duration, wouldn't that be impossible?
> (and after that point we can scream from the mmu_notifier if a memslot is changed...)
>
>
>> +	mmu_seq = kvm->mmu_notifier_seq;
>> +	smp_rmb();
>> +
>> +	/*
>> +	 * A memslot might span multiple VMAs and any holes between them, while
>> +	 * a VMA might span multiple memslots (see
>> +	 * kvm_arch_prepare_memory_region()). Take the intersection of the VMAs
>> +	 * with the memslot.
>> +	 */
>> +	do {
>> +		mmap_read_lock(current->mm);
>> +		vma = find_vma(current->mm, hva);
>> +		/*
>> +		 * find_vma() returns first VMA with hva < vma->vm_end, which
>> +		 * means that it is possible for the VMA to start *after* the
>> +		 * end of the memslot.
>> +		 */
>> +		if (!vma || vma->vm_start >= hva_memslot_end) {
>> +			mmap_read_unlock(current->mm);
>> +			return 0;
>> +		}
>> +
>> +		/*
>> +		 * VM_LOCKED pages are put in the unevictable LRU list and
>> +		 * hugetlb pages are not put in any LRU list; both will stay
>> +		 * pinned in memory.
>> +		 */
>> +		if (!(vma->vm_flags & VM_LOCKED) && !is_vm_hugetlb_page(vma)) {
>> +			/* Go to next VMA. */
>> +			hva = vma->vm_end;
>> +			mmap_read_unlock(current->mm);
>> +			continue;
>> +		}
>> +		if (!(vma->vm_flags & VM_READ) ||
>> +		    ((prot & KVM_PGTABLE_PROT_W) && !(vma->vm_flags & VM_WRITE))) {
>> +			/* Go to next VMA. */
>> +			hva = vma->vm_end;
>> +			mmap_read_unlock(current->mm);
>> +			continue;
>> +		}
>> +		mmap_read_unlock(current->mm);
> Can't a writer now come in and remove vma?, which you pass to:
>
>> +		ret = stage2_map_vma(kvm, memslot, vma, prot, mmu_seq, &hva, &cache);
> As this only reads from the stage1 entries, I think you may be able to hold a read lock
> for the duration of the loop. (if we tell gup not to write new entries)
>
>
>> +		if (ret)
>> +			return ret;
>> +	} while (hva < hva_memslot_end);
>> +
>> +	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) {
>> +		spin_lock(&kvm->mmu_lock);
>> +		stage2_flush_memslot(kvm, memslot);
>> +		spin_unlock(&kvm->mmu_lock);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +
>>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>>  {
>>  }
>> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
>> index f91a52cd7cd3..316ff8dfed5b 100644
>> --- a/arch/arm64/kvm/spe.c
>> +++ b/arch/arm64/kvm/spe.c
>> @@ -10,6 +10,13 @@
>>  #include <kvm/arm_spe.h>
>>  #include <kvm/arm_vgic.h>
>>  
>> +#include <asm/kvm_mmu.h>
>> +
>> +void kvm_arm_spe_notify_vcpu_init(struct kvm_vcpu *vcpu)
>> +{
>> +	vcpu->kvm->arch.spe.finalized = false;
>> +}
>> +
>>  static bool kvm_arm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
>>  {
>>  	if (!vcpu_has_spe(vcpu))
>> @@ -115,6 +122,50 @@ int kvm_arm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>  	return -ENXIO;
>>  }
>>  
>> +static int kvm_arm_spe_finalize(struct kvm *kvm)
>> +{
>> +	struct kvm_memory_slot *memslot;
>> +	enum kvm_pgtable_prot prot;
>> +	struct kvm_vcpu *vcpu;
>> +	int i, ret;
>> +
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		if (!kvm_arm_spe_vcpu_initialized(vcpu))
>> +			return -ENXIO;
>> +	}
>> +
>> +	mutex_unlock(&kvm->slots_lock);
> Typo?
>
>
>> +	if (kvm_arm_spe_finalized(kvm)) {
>> +		mutex_unlock(&kvm->slots_lock);
>> +		return -EBUSY;
>> +	}
>> +
>> +	prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
>> +	kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
>> +		/* Only map memory that SPE can write to. */
>> +		if (memslot->flags & KVM_MEM_READONLY)
>> +			continue;
>> +		 /*
>> +		  * Dirty page logging will write-protect pages, which breaks
>> +		  * SPE.
>> +		  */
>> +		if (memslot->dirty_bitmap)
>> +			continue;
> This silently skips regions that set KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE, which should be
> harmless until KVM_CLEAR_DIRTY_LOG clears the bitmap bits, and makes them write-protect
> (the runtime update ends in kvm_mmu_write_protect_pt_masked())
>
> It's the silent bit that bothers me. If this were done as a memslot flag, we could tell
> the VMM whether its the mm flags on the vma we can't cope with, or the KVM flag on the
> memslot.
>
>
>> +		ret = kvm_map_locked_memslot(kvm, memslot, prot);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	if (!ret)
>> +		kvm->arch.spe.finalized = true;
>> +	mutex_unlock(&kvm->slots_lock);
>> +
>> +	if (ret)
>> +		stage2_unmap_vm(kvm);
> We haven't put in any invalid mappings, is this needed?
>
>
>
>> +
>> +	return ret;
>> +}
>
> I think separating the boiler plate, and SPE bits from the stage2/mm code would make this
> patch simpler.
>
>
> Thanks,
>
> James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-03-23 14:28 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-27 17:26 [RFC PATCH v3 00/16] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 01/16] KVM: arm64: Initialize VCPU mdcr_el2 before loading it Alexandru Elisei
2020-11-19 16:58   ` James Morse
2020-12-02 14:25     ` Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 02/16] dt-bindings: ARM SPE: Highlight the need for PPI partitions on heterogeneous systems Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 03/16] KVM: arm64: Hide SPE from guests Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 04/16] arm64: Introduce CPU SPE feature Alexandru Elisei
2020-11-19 16:58   ` James Morse
2020-12-02 14:29     ` Alexandru Elisei
2020-12-02 17:23       ` Will Deacon
2020-12-03 10:07         ` Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 05/16] KVM: arm64: Introduce VCPU " Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 06/16] KVM: arm64: Introduce SPE primitives Alexandru Elisei
2020-11-19 16:58   ` James Morse
2020-12-02 15:13     ` Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 07/16] KVM: arm64: Define SPE data structure for each VCPU Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 08/16] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
2020-11-05  9:58   ` Haibo Xu
2020-12-02 15:20     ` Alexandru Elisei
2020-11-19 16:58   ` James Morse
2020-12-02 16:28     ` Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 09/16] KVM: arm64: Use separate function for the mapping size in user_mem_abort() Alexandru Elisei
2020-11-05 10:01   ` Haibo Xu
2020-12-02 16:29     ` Alexandru Elisei
2020-10-27 17:26 ` [RFC PATCH v3 10/16] KVM: arm64: Add a new VM device control group for SPE Alexandru Elisei
2020-11-05 10:10   ` Haibo Xu
2020-12-02 16:35     ` Alexandru Elisei
2020-11-19 16:59   ` James Morse
2021-03-23 14:27     ` Alexandru Elisei
2020-10-27 17:27 ` [RFC PATCH v3 11/16] KVM: arm64: Add SPE system registers to VCPU context Alexandru Elisei
2020-10-27 17:27 ` [RFC PATCH v3 12/16] KVM: arm64: VHE: Clear MDCR_EL2.E2PB in vcpu_put() Alexandru Elisei
2020-10-27 17:27 ` [RFC PATCH v3 13/16] KVM: arm64: Switch SPE context on VM entry/exit Alexandru Elisei
2020-10-27 17:27 ` [RFC PATCH v3 14/16] KVM: arm64: Emulate SPE buffer management interrupt Alexandru Elisei
2020-10-27 17:27 ` [RFC PATCH v3 15/16] KVM: arm64: Enable SPE for guests Alexandru Elisei
2020-10-27 17:27 ` [RFC PATCH v3 16/16] Documentation: arm64: Document ARM Neoverse-N1 erratum #1688567 Alexandru Elisei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).