KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/9] arm64: Stolen time support
@ 2019-08-02 14:50 Steven Price
  2019-08-02 14:50 ` [PATCH 1/9] KVM: arm64: Document PV-time interface Steven Price
                   ` (9 more replies)
  0 siblings, 10 replies; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

This series add support for paravirtualized time for arm64 guests and
KVM hosts following the specification in Arm's document DEN 0057A:

https://developer.arm.com/docs/den0057/a

It implements support for stolen time, allowing the guest to
identify time when it is forcibly not executing.

It doesn't implement support for Live Physical Time (LPT) as there are
some concerns about the overheads and approach in the above
specification, and I expect an updated version of the specification to
be released soon with just the stolen time parts.

I previously posted a series including LPT (as well as stolen time):
https://lore.kernel.org/kvmarm/20181212150226.38051-1-steven.price@arm.com/

Patches 2, 5, 7 and 8 are cleanup patches and could be taken separately.

Christoffer Dall (1):
  KVM: arm/arm64: Factor out hypercall handling from PSCI code

Steven Price (8):
  KVM: arm64: Document PV-time interface
  KVM: arm64: Implement PV_FEATURES call
  KVM: arm64: Support stolen time reporting via shared structure
  KVM: Allow kvm_device_ops to be const
  KVM: arm64: Provide a PV_TIME device to user space
  arm/arm64: Provide a wrapper for SMCCC 1.1 calls
  arm/arm64: Make use of the SMCCC 1.1 wrapper
  arm64: Retrieve stolen time as paravirtualized guest

 Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++
 arch/arm/kvm/Makefile                    |   2 +-
 arch/arm/kvm/handle_exit.c               |   2 +-
 arch/arm/mm/proc-v7-bugs.c               |  13 +-
 arch/arm64/include/asm/kvm_host.h        |  13 +-
 arch/arm64/include/asm/kvm_mmu.h         |   2 +
 arch/arm64/include/asm/pvclock-abi.h     |  20 +++
 arch/arm64/include/uapi/asm/kvm.h        |   6 +
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/cpu_errata.c           |  80 ++++------
 arch/arm64/kernel/kvm.c                  | 155 ++++++++++++++++++
 arch/arm64/kvm/Kconfig                   |   1 +
 arch/arm64/kvm/Makefile                  |   2 +
 arch/arm64/kvm/handle_exit.c             |   4 +-
 include/kvm/arm_hypercalls.h             |  44 ++++++
 include/kvm/arm_psci.h                   |   2 +-
 include/linux/arm-smccc.h                |  58 +++++++
 include/linux/cpuhotplug.h               |   1 +
 include/linux/kvm_host.h                 |   4 +-
 include/linux/kvm_types.h                |   2 +
 include/uapi/linux/kvm.h                 |   2 +
 virt/kvm/arm/arm.c                       |  18 +++
 virt/kvm/arm/hypercalls.c                | 138 ++++++++++++++++
 virt/kvm/arm/mmu.c                       |  44 ++++++
 virt/kvm/arm/psci.c                      |  84 +---------
 virt/kvm/arm/pvtime.c                    | 190 +++++++++++++++++++++++
 virt/kvm/kvm_main.c                      |   6 +-
 27 files changed, 848 insertions(+), 153 deletions(-)
 create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
 create mode 100644 arch/arm64/include/asm/pvclock-abi.h
 create mode 100644 arch/arm64/kernel/kvm.c
 create mode 100644 include/kvm/arm_hypercalls.h
 create mode 100644 virt/kvm/arm/hypercalls.c
 create mode 100644 virt/kvm/arm/pvtime.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/9] KVM: arm64: Document PV-time interface
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-03 11:13   ` Marc Zyngier
                     ` (2 more replies)
  2019-08-02 14:50 ` [PATCH 2/9] KVM: arm/arm64: Factor out hypercall handling from PSCI code Steven Price
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

Introduce a paravirtualization interface for KVM/arm64 based on the
"Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.

This only adds the details about "Stolen Time" as the details of "Live
Physical Time" have not been fully agreed.

User space can specify a reserved area of memory for the guest and
inform KVM to populate the memory with information on time that the host
kernel has stolen from the guest.

A hypercall interface is provided for the guest to interrogate the
hypervisor's support for this interface and the location of the shared
memory structures.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
 1 file changed, 107 insertions(+)
 create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt

diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
new file mode 100644
index 000000000000..e6ae9799e1d5
--- /dev/null
+++ b/Documentation/virtual/kvm/arm/pvtime.txt
@@ -0,0 +1,107 @@
+Paravirtualized time support for arm64
+======================================
+
+Arm specification DEN0057/A defined a standard for paravirtualised time
+support for Aarch64 guests:
+
+https://developer.arm.com/docs/den0057/a
+
+KVM/Arm64 implements the stolen time part of this specification by providing
+some hypervisor service calls to support a paravirtualized guest obtaining a
+view of the amount of time stolen from its execution.
+
+Two new SMCCC compatible hypercalls are defined:
+
+PV_FEATURES 0xC5000020
+PV_TIME_ST  0xC5000022
+
+These are only available in the SMC64/HVC64 calling convention as
+paravirtualized time is not available to 32 bit Arm guests.
+
+PV_FEATURES
+    Function ID:  (uint32)  : 0xC5000020
+    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
+    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
+                              PV-time feature is supported by the hypervisor.
+
+PV_TIME_ST
+    Function ID:  (uint32)  : 0xC5000022
+    Return value: (int64)   : IPA of the stolen time data structure for this
+                              (V)CPU. On failure:
+                              NOT_SUPPORTED (-1)
+
+Stolen Time
+-----------
+
+The structure pointed to by the PV_TIME_ST hypercall is as follows:
+
+  Field       | Byte Length | Byte Offset | Description
+  ----------- | ----------- | ----------- | --------------------------
+  Revision    |      4      |      0      | Must be 0 for version 0.1
+  Attributes  |      4      |      4      | Must be 0
+  Stolen time |      8      |      8      | Stolen time in unsigned
+              |             |             | nanoseconds indicating how
+              |             |             | much time this VCPU thread
+              |             |             | was involuntarily not
+              |             |             | running on a physical CPU.
+
+The structure will be updated by the hypervisor periodically as time is stolen
+from the VCPU. It will be present within a reserved region of the normal
+memory given to the guest. The guest should not attempt to write into this
+memory. There is a structure by VCPU of the guest.
+
+User space interface
+====================
+
+User space can request that KVM provide the paravirtualized time interface to
+a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
+
+    struct kvm_create_device pvtime_device = {
+            .type = KVM_DEV_TYPE_ARM_PV_TIME,
+            .attr = 0,
+            .flags = 0,
+    };
+
+    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
+
+The guest IPA of the structures must be given to KVM. This is the base address
+of an array of stolen time structures (one for each VCPU). For example:
+
+    struct kvm_device_attr st_base = {
+            .group = KVM_DEV_ARM_PV_TIME_PADDR,
+            .attr = KVM_DEV_ARM_PV_TIME_ST,
+            .addr = (u64)(unsigned long)&st_paddr
+    };
+
+    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
+
+For migration (or save/restore) of a guest it is necessary to save the contents
+of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
+provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
+to be read/written.
+
+It is also necessary for the physical address to be set identically when
+restoring.
+
+    void *save_state(int fd, u64 attr, u32 *size) {
+        struct kvm_device_attr get_size = {
+                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
+                .attr = attr,
+                .addr = (u64)(unsigned long)size
+        };
+
+        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
+
+        void *buffer = malloc(*size);
+
+        struct kvm_device_attr get_state = {
+                .group = KVM_DEV_ARM_PV_TIME_STATE,
+                .attr = attr,
+                .addr = (u64)(unsigned long)size
+        };
+
+        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
+    }
+
+    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
+
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/9] KVM: arm/arm64: Factor out hypercall handling from PSCI code
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
  2019-08-02 14:50 ` [PATCH 1/9] KVM: arm64: Document PV-time interface Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-02 14:50 ` [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call Steven Price
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel, Christoffer Dall

From: Christoffer Dall <christoffer.dall@arm.com>

We currently intertwine the KVM PSCI implementation with the general
dispatch of hypercall handling, which makes perfect sense because PSCI
is the only category of hypercalls we support.

However, as we are about to support additional hypercalls, factor out
this functionality into a separate hypercall handler file.

Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[steven.price@arm.com: rebased]
Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm/kvm/Makefile        |  2 +-
 arch/arm/kvm/handle_exit.c   |  2 +-
 arch/arm64/kvm/Makefile      |  1 +
 arch/arm64/kvm/handle_exit.c |  4 +-
 include/kvm/arm_hypercalls.h | 43 ++++++++++++++++++
 include/kvm/arm_psci.h       |  2 +-
 virt/kvm/arm/hypercalls.c    | 59 +++++++++++++++++++++++++
 virt/kvm/arm/psci.c          | 84 +-----------------------------------
 8 files changed, 110 insertions(+), 87 deletions(-)
 create mode 100644 include/kvm/arm_hypercalls.h
 create mode 100644 virt/kvm/arm/hypercalls.c

diff --git a/arch/arm/kvm/Makefile b/arch/arm/kvm/Makefile
index 531e59f5be9c..ef4d01088efc 100644
--- a/arch/arm/kvm/Makefile
+++ b/arch/arm/kvm/Makefile
@@ -23,7 +23,7 @@ obj-y += kvm-arm.o init.o interrupts.o
 obj-y += handle_exit.o guest.o emulate.o reset.o
 obj-y += coproc.o coproc_a15.o coproc_a7.o   vgic-v3-coproc.o
 obj-y += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
-obj-y += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
+obj-y += $(KVM)/arm/psci.o $(KVM)/arm/perf.o $(KVM)/arm/hypercalls.o
 obj-y += $(KVM)/arm/aarch32.o
 
 obj-y += $(KVM)/arm/vgic/vgic.o
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index 2a6a1394d26e..e58a89d2f13f 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -9,7 +9,7 @@
 #include <asm/kvm_emulate.h>
 #include <asm/kvm_coproc.h>
 #include <asm/kvm_mmu.h>
-#include <kvm/arm_psci.h>
+#include <kvm/arm_hypercalls.h>
 #include <trace/events/kvm.h>
 
 #include "trace.h"
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 3ac1a64d2fb9..73dce4d47d47 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -13,6 +13,7 @@ obj-$(CONFIG_KVM_ARM_HOST) += hyp/
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
+kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
 
 kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
 kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 706cca23f0d2..aacfc55de44c 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -11,8 +11,6 @@
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
 
-#include <kvm/arm_psci.h>
-
 #include <asm/esr.h>
 #include <asm/exception.h>
 #include <asm/kvm_asm.h>
@@ -22,6 +20,8 @@
 #include <asm/debug-monitors.h>
 #include <asm/traps.h>
 
+#include <kvm/arm_hypercalls.h>
+
 #define CREATE_TRACE_POINTS
 #include "trace.h"
 
diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
new file mode 100644
index 000000000000..35a5abcc4ca3
--- /dev/null
+++ b/include/kvm/arm_hypercalls.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2019 Arm Ltd. */
+
+#ifndef __KVM_ARM_HYPERCALLS_H
+#define __KVM_ARM_HYPERCALLS_H
+
+#include <asm/kvm_emulate.h>
+
+int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
+
+static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
+{
+	return vcpu_get_reg(vcpu, 0);
+}
+
+static inline unsigned long smccc_get_arg1(struct kvm_vcpu *vcpu)
+{
+	return vcpu_get_reg(vcpu, 1);
+}
+
+static inline unsigned long smccc_get_arg2(struct kvm_vcpu *vcpu)
+{
+	return vcpu_get_reg(vcpu, 2);
+}
+
+static inline unsigned long smccc_get_arg3(struct kvm_vcpu *vcpu)
+{
+	return vcpu_get_reg(vcpu, 3);
+}
+
+static inline void smccc_set_retval(struct kvm_vcpu *vcpu,
+			     unsigned long a0,
+			     unsigned long a1,
+			     unsigned long a2,
+			     unsigned long a3)
+{
+	vcpu_set_reg(vcpu, 0, a0);
+	vcpu_set_reg(vcpu, 1, a1);
+	vcpu_set_reg(vcpu, 2, a2);
+	vcpu_set_reg(vcpu, 3, a3);
+}
+
+#endif
diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h
index 632e78bdef4d..5b58bd2fe088 100644
--- a/include/kvm/arm_psci.h
+++ b/include/kvm/arm_psci.h
@@ -40,7 +40,7 @@ static inline int kvm_psci_version(struct kvm_vcpu *vcpu, struct kvm *kvm)
 }
 
 
-int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
+int kvm_psci_call(struct kvm_vcpu *vcpu);
 
 struct kvm_one_reg;
 
diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
new file mode 100644
index 000000000000..f875241bd030
--- /dev/null
+++ b/virt/kvm/arm/hypercalls.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2019 Arm Ltd.
+
+#include <linux/arm-smccc.h>
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_emulate.h>
+
+#include <kvm/arm_hypercalls.h>
+#include <kvm/arm_psci.h>
+
+int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
+{
+	u32 func_id = smccc_get_function(vcpu);
+	u32 val = SMCCC_RET_NOT_SUPPORTED;
+	u32 feature;
+
+	switch (func_id) {
+	case ARM_SMCCC_VERSION_FUNC_ID:
+		val = ARM_SMCCC_VERSION_1_1;
+		break;
+	case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
+		feature = smccc_get_arg1(vcpu);
+		switch (feature) {
+		case ARM_SMCCC_ARCH_WORKAROUND_1:
+			switch (kvm_arm_harden_branch_predictor()) {
+			case KVM_BP_HARDEN_UNKNOWN:
+				break;
+			case KVM_BP_HARDEN_WA_NEEDED:
+				val = SMCCC_RET_SUCCESS;
+				break;
+			case KVM_BP_HARDEN_NOT_REQUIRED:
+				val = SMCCC_RET_NOT_REQUIRED;
+				break;
+			}
+			break;
+		case ARM_SMCCC_ARCH_WORKAROUND_2:
+			switch (kvm_arm_have_ssbd()) {
+			case KVM_SSBD_FORCE_DISABLE:
+			case KVM_SSBD_UNKNOWN:
+				break;
+			case KVM_SSBD_KERNEL:
+				val = SMCCC_RET_SUCCESS;
+				break;
+			case KVM_SSBD_FORCE_ENABLE:
+			case KVM_SSBD_MITIGATED:
+				val = SMCCC_RET_NOT_REQUIRED;
+				break;
+			}
+			break;
+		}
+		break;
+	default:
+		return kvm_psci_call(vcpu);
+	}
+
+	smccc_set_retval(vcpu, val, 0, 0, 0);
+	return 1;
+}
diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
index 87927f7e1ee7..17e2bdd4b76f 100644
--- a/virt/kvm/arm/psci.c
+++ b/virt/kvm/arm/psci.c
@@ -15,6 +15,7 @@
 #include <asm/kvm_host.h>
 
 #include <kvm/arm_psci.h>
+#include <kvm/arm_hypercalls.h>
 
 /*
  * This is an implementation of the Power State Coordination Interface
@@ -23,38 +24,6 @@
 
 #define AFFINITY_MASK(level)	~((0x1UL << ((level) * MPIDR_LEVEL_BITS)) - 1)
 
-static u32 smccc_get_function(struct kvm_vcpu *vcpu)
-{
-	return vcpu_get_reg(vcpu, 0);
-}
-
-static unsigned long smccc_get_arg1(struct kvm_vcpu *vcpu)
-{
-	return vcpu_get_reg(vcpu, 1);
-}
-
-static unsigned long smccc_get_arg2(struct kvm_vcpu *vcpu)
-{
-	return vcpu_get_reg(vcpu, 2);
-}
-
-static unsigned long smccc_get_arg3(struct kvm_vcpu *vcpu)
-{
-	return vcpu_get_reg(vcpu, 3);
-}
-
-static void smccc_set_retval(struct kvm_vcpu *vcpu,
-			     unsigned long a0,
-			     unsigned long a1,
-			     unsigned long a2,
-			     unsigned long a3)
-{
-	vcpu_set_reg(vcpu, 0, a0);
-	vcpu_set_reg(vcpu, 1, a1);
-	vcpu_set_reg(vcpu, 2, a2);
-	vcpu_set_reg(vcpu, 3, a3);
-}
-
 static unsigned long psci_affinity_mask(unsigned long affinity_level)
 {
 	if (affinity_level <= 3)
@@ -373,7 +342,7 @@ static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
  * Errors:
  * -EINVAL: Unrecognized PSCI function
  */
-static int kvm_psci_call(struct kvm_vcpu *vcpu)
+int kvm_psci_call(struct kvm_vcpu *vcpu)
 {
 	switch (kvm_psci_version(vcpu, vcpu->kvm)) {
 	case KVM_ARM_PSCI_1_0:
@@ -387,55 +356,6 @@ static int kvm_psci_call(struct kvm_vcpu *vcpu)
 	};
 }
 
-int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
-{
-	u32 func_id = smccc_get_function(vcpu);
-	u32 val = SMCCC_RET_NOT_SUPPORTED;
-	u32 feature;
-
-	switch (func_id) {
-	case ARM_SMCCC_VERSION_FUNC_ID:
-		val = ARM_SMCCC_VERSION_1_1;
-		break;
-	case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
-		feature = smccc_get_arg1(vcpu);
-		switch(feature) {
-		case ARM_SMCCC_ARCH_WORKAROUND_1:
-			switch (kvm_arm_harden_branch_predictor()) {
-			case KVM_BP_HARDEN_UNKNOWN:
-				break;
-			case KVM_BP_HARDEN_WA_NEEDED:
-				val = SMCCC_RET_SUCCESS;
-				break;
-			case KVM_BP_HARDEN_NOT_REQUIRED:
-				val = SMCCC_RET_NOT_REQUIRED;
-				break;
-			}
-			break;
-		case ARM_SMCCC_ARCH_WORKAROUND_2:
-			switch (kvm_arm_have_ssbd()) {
-			case KVM_SSBD_FORCE_DISABLE:
-			case KVM_SSBD_UNKNOWN:
-				break;
-			case KVM_SSBD_KERNEL:
-				val = SMCCC_RET_SUCCESS;
-				break;
-			case KVM_SSBD_FORCE_ENABLE:
-			case KVM_SSBD_MITIGATED:
-				val = SMCCC_RET_NOT_REQUIRED;
-				break;
-			}
-			break;
-		}
-		break;
-	default:
-		return kvm_psci_call(vcpu);
-	}
-
-	smccc_set_retval(vcpu, val, 0, 0, 0);
-	return 1;
-}
-
 int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu)
 {
 	return 3;		/* PSCI version and two workaround registers */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
  2019-08-02 14:50 ` [PATCH 1/9] KVM: arm64: Document PV-time interface Steven Price
  2019-08-02 14:50 ` [PATCH 2/9] KVM: arm/arm64: Factor out hypercall handling from PSCI code Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-03 11:21   ` Marc Zyngier
  2019-08-02 14:50 ` [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure Steven Price
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

This provides a mechanism for querying which paravirtualized features
are available in this hypervisor.

Also add the header file which defines the ABI for the paravirtualized
clock features we're about to add.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/pvclock-abi.h | 20 ++++++++++++++++++++
 include/linux/arm-smccc.h            | 14 ++++++++++++++
 virt/kvm/arm/hypercalls.c            |  9 +++++++++
 3 files changed, 43 insertions(+)
 create mode 100644 arch/arm64/include/asm/pvclock-abi.h

diff --git a/arch/arm64/include/asm/pvclock-abi.h b/arch/arm64/include/asm/pvclock-abi.h
new file mode 100644
index 000000000000..1f7cdc102691
--- /dev/null
+++ b/arch/arm64/include/asm/pvclock-abi.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2019 Arm Ltd. */
+
+#ifndef __ASM_PVCLOCK_ABI_H
+#define __ASM_PVCLOCK_ABI_H
+
+/* The below structures and constants are defined in ARM DEN0057A */
+
+struct pvclock_vcpu_stolen_time_info {
+	__le32 revision;
+	__le32 attributes;
+	__le64 stolen_time;
+	/* Structure must be 64 byte aligned, pad to that size */
+	u8 padding[48];
+} __packed;
+
+#define PV_VM_TIME_NOT_SUPPORTED	-1
+#define PV_VM_TIME_INVALID_PARAMETERS	-2
+
+#endif
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 080012a6f025..e7f129f26ebd 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -45,6 +45,7 @@
 #define ARM_SMCCC_OWNER_SIP		2
 #define ARM_SMCCC_OWNER_OEM		3
 #define ARM_SMCCC_OWNER_STANDARD	4
+#define ARM_SMCCC_OWNER_STANDARD_HYP	5
 #define ARM_SMCCC_OWNER_TRUSTED_APP	48
 #define ARM_SMCCC_OWNER_TRUSTED_APP_END	49
 #define ARM_SMCCC_OWNER_TRUSTED_OS	50
@@ -302,5 +303,18 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, unsigned long a1,
 #define SMCCC_RET_NOT_SUPPORTED			-1
 #define SMCCC_RET_NOT_REQUIRED			-2
 
+/* Paravirtualised time calls (defined by ARM DEN0057A) */
+#define ARM_SMCCC_HV_PV_FEATURES				\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
+			   ARM_SMCCC_SMC_64,			\
+			   ARM_SMCCC_OWNER_STANDARD_HYP,	\
+			   0x20)
+
+#define ARM_SMCCC_HV_PV_TIME_ST					\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
+			   ARM_SMCCC_SMC_64,			\
+			   ARM_SMCCC_OWNER_STANDARD_HYP,	\
+			   0x22)
+
 #endif /*__ASSEMBLY__*/
 #endif /*__LINUX_ARM_SMCCC_H*/
diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
index f875241bd030..2906b2df99df 100644
--- a/virt/kvm/arm/hypercalls.c
+++ b/virt/kvm/arm/hypercalls.c
@@ -5,6 +5,7 @@
 #include <linux/kvm_host.h>
 
 #include <asm/kvm_emulate.h>
+#include <asm/pvclock-abi.h>
 
 #include <kvm/arm_hypercalls.h>
 #include <kvm/arm_psci.h>
@@ -48,6 +49,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 				break;
 			}
 			break;
+		case ARM_SMCCC_HV_PV_FEATURES:
+			val = SMCCC_RET_SUCCESS;
+			break;
+		}
+		break;
+	case ARM_SMCCC_HV_PV_FEATURES:
+		feature = smccc_get_arg1(vcpu);
+		switch (feature) {
 		}
 		break;
 	default:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
                   ` (2 preceding siblings ...)
  2019-08-02 14:50 ` [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-03 11:55   ` Marc Zyngier
  2019-08-03 17:58   ` Marc Zyngier
  2019-08-02 14:50 ` [PATCH 5/9] KVM: Allow kvm_device_ops to be const Steven Price
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

Implement the service call for configuring a shared structre between a
VCPU and the hypervisor in which the hypervisor can write the time
stolen from the VCPU's execution time by other tasks on the host.

The hypervisor allocates memory which is placed at an IPA chosen by user
space. The hypervisor then uses WRITE_ONCE() to update the shared
structre ensuring single copy atomicity of the 64-bit unsigned value
that reports stolen time in nanoseconds.

Whenever stolen time is enabled by the guest, the stolen time counter is
reset.

The stolen time itself is retrieved from the sched_info structure
maintained by the Linux scheduler code. We enable SCHEDSTATS when
selecting KVM Kconfig to ensure this value is meaningful.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_host.h | 13 +++++-
 arch/arm64/kvm/Kconfig            |  1 +
 include/kvm/arm_hypercalls.h      |  1 +
 include/linux/kvm_types.h         |  2 +
 virt/kvm/arm/arm.c                | 18 ++++++++
 virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
 6 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f656169db8c3..78f270190d43 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -44,6 +44,7 @@
 	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
 #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
+#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
 
 DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
@@ -83,6 +84,11 @@ struct kvm_arch {
 
 	/* Mandated version of PSCI */
 	u32 psci_version;
+
+	struct kvm_arch_pvtime {
+		void *st;
+		gpa_t st_base;
+	} pvtime;
 };
 
 #define KVM_NR_MEM_OBJS     40
@@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
 	/* True when deferrable sysregs are loaded on the physical CPU,
 	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
 	bool sysregs_loaded_on_cpu;
-};
 
+	/* Guest PV state */
+	struct {
+		u64 steal;
+		u64 last_steal;
+	} steal;
+};
 /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
 #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
 				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a67121d419a2..d8b88e40d223 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -39,6 +39,7 @@ config KVM
 	select IRQ_BYPASS_MANAGER
 	select HAVE_KVM_IRQ_BYPASS
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
+	select SCHEDSTATS
 	---help---
 	  Support hosting virtualized guest machines.
 	  We don't support KVM with 16K page tables yet, due to the multiple
diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
index 35a5abcc4ca3..9f0710ab4292 100644
--- a/include/kvm/arm_hypercalls.h
+++ b/include/kvm/arm_hypercalls.h
@@ -7,6 +7,7 @@
 #include <asm/kvm_emulate.h>
 
 int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
+int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
 
 static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
 {
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index bde5374ae021..1c88e69db3d9 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
 typedef u64            gpa_t;
 typedef u64            gfn_t;
 
+#define GPA_INVALID	(~(gpa_t)0)
+
 typedef unsigned long  hva_t;
 typedef u64            hpa_t;
 typedef u64            hfn_t;
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index f645c0fbf7ec..ebd963d2580b 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -40,6 +40,10 @@
 #include <asm/kvm_coproc.h>
 #include <asm/sections.h>
 
+#include <kvm/arm_hypercalls.h>
+#include <kvm/arm_pmu.h>
+#include <kvm/arm_psci.h>
+
 #ifdef REQUIRES_VIRT
 __asm__(".arch_extension	virt");
 #endif
@@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.max_vcpus = vgic_present ?
 				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
 
+	kvm->arch.pvtime.st_base = GPA_INVALID;
 	return ret;
 out_free_stage2_pgd:
 	kvm_free_stage2_pgd(kvm);
@@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	kvm_vcpu_load_sysregs(vcpu);
 	kvm_arch_vcpu_load_fp(vcpu);
 	kvm_vcpu_pmu_restore_guest(vcpu);
+	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
 
 	if (single_task_running())
 		vcpu_clear_wfe_traps(vcpu);
@@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
 	smp_rmb();
 }
 
+static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
+{
+	int idx;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	kvm_update_stolen_time(vcpu);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+}
+
 static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.target >= 0;
@@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
 		 * that a VCPU sees new virtual interrupts.
 		 */
 		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
+
+		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
+			vcpu_req_record_steal(vcpu);
 	}
 }
 
diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
index 2906b2df99df..196c71c8dd87 100644
--- a/virt/kvm/arm/hypercalls.c
+++ b/virt/kvm/arm/hypercalls.c
@@ -10,6 +10,70 @@
 #include <kvm/arm_hypercalls.h>
 #include <kvm/arm_psci.h>
 
+
+static struct pvclock_vcpu_stolen_time_info *pvtime_get_st(
+		struct kvm_vcpu *vcpu)
+{
+	struct pvclock_vcpu_stolen_time_info *st = vcpu->kvm->arch.pvtime.st;
+
+	if (!st)
+		return NULL;
+
+	return &st[kvm_vcpu_get_idx(vcpu)];
+}
+
+int kvm_update_stolen_time(struct kvm_vcpu *vcpu)
+{
+	u64 steal;
+	struct pvclock_vcpu_stolen_time_info *kaddr;
+
+	if (vcpu->kvm->arch.pvtime.st_base == GPA_INVALID)
+		return -ENOTSUPP;
+
+	kaddr = pvtime_get_st(vcpu);
+
+	if (!kaddr)
+		return -ENOTSUPP;
+
+	kaddr->revision = 0;
+	kaddr->attributes = 0;
+
+	/* Let's do the local bookkeeping */
+	steal = vcpu->arch.steal.steal;
+	steal += current->sched_info.run_delay - vcpu->arch.steal.last_steal;
+	vcpu->arch.steal.last_steal = current->sched_info.run_delay;
+	vcpu->arch.steal.steal = steal;
+
+	/* Now write out the value to the shared page */
+	WRITE_ONCE(kaddr->stolen_time, cpu_to_le64(steal));
+
+	return 0;
+}
+
+static int kvm_hypercall_stolen_time(struct kvm_vcpu *vcpu)
+{
+	u64 ret;
+	int err;
+
+	/*
+	 * Start counting stolen time from the time the guest requests
+	 * the feature enabled.
+	 */
+	vcpu->arch.steal.steal = 0;
+	vcpu->arch.steal.last_steal = current->sched_info.run_delay;
+
+	err = kvm_update_stolen_time(vcpu);
+
+	if (err)
+		ret = SMCCC_RET_NOT_SUPPORTED;
+	else
+		ret = vcpu->kvm->arch.pvtime.st_base +
+			(sizeof(struct pvclock_vcpu_stolen_time_info) *
+			 kvm_vcpu_get_idx(vcpu));
+
+	smccc_set_retval(vcpu, ret, 0, 0, 0);
+	return 1;
+}
 int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 {
 	u32 func_id = smccc_get_function(vcpu);
@@ -57,8 +121,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 	case ARM_SMCCC_HV_PV_FEATURES:
 		feature = smccc_get_arg1(vcpu);
 		switch (feature) {
+		case ARM_SMCCC_HV_PV_FEATURES:
+		case ARM_SMCCC_HV_PV_TIME_ST:
+			val = SMCCC_RET_SUCCESS;
+			break;
 		}
 		break;
+	case ARM_SMCCC_HV_PV_TIME_ST:
+		return kvm_hypercall_stolen_time(vcpu);
 	default:
 		return kvm_psci_call(vcpu);
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 5/9] KVM: Allow kvm_device_ops to be const
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
                   ` (3 preceding siblings ...)
  2019-08-02 14:50 ` [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-02 14:50 ` [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space Steven Price
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

Currently a kvm_device_ops structure cannot be const without triggering
compiler warnings. However the structure doesn't need to be written to
and, by marking it const, it can be read-only in memory. Add some more
const keywords to allow this.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 include/linux/kvm_host.h | 4 ++--
 virt/kvm/kvm_main.c      | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5c5b5867024c..be31a6f8351a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1236,7 +1236,7 @@ extern unsigned int halt_poll_ns_grow_start;
 extern unsigned int halt_poll_ns_shrink;
 
 struct kvm_device {
-	struct kvm_device_ops *ops;
+	const struct kvm_device_ops *ops;
 	struct kvm *kvm;
 	void *private;
 	struct list_head vm_node;
@@ -1289,7 +1289,7 @@ struct kvm_device_ops {
 void kvm_device_get(struct kvm_device *dev);
 void kvm_device_put(struct kvm_device *dev);
 struct kvm_device *kvm_device_from_filp(struct file *filp);
-int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type);
+int kvm_register_device_ops(const struct kvm_device_ops *ops, u32 type);
 void kvm_unregister_device_ops(u32 type);
 
 extern struct kvm_device_ops kvm_mpic_ops;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 887f3b0c2b60..8c12110ec87a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3035,14 +3035,14 @@ struct kvm_device *kvm_device_from_filp(struct file *filp)
 	return filp->private_data;
 }
 
-static struct kvm_device_ops *kvm_device_ops_table[KVM_DEV_TYPE_MAX] = {
+static const struct kvm_device_ops *kvm_device_ops_table[KVM_DEV_TYPE_MAX] = {
 #ifdef CONFIG_KVM_MPIC
 	[KVM_DEV_TYPE_FSL_MPIC_20]	= &kvm_mpic_ops,
 	[KVM_DEV_TYPE_FSL_MPIC_42]	= &kvm_mpic_ops,
 #endif
 };
 
-int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type)
+int kvm_register_device_ops(const struct kvm_device_ops *ops, u32 type)
 {
 	if (type >= ARRAY_SIZE(kvm_device_ops_table))
 		return -ENOSPC;
@@ -3063,7 +3063,7 @@ void kvm_unregister_device_ops(u32 type)
 static int kvm_ioctl_create_device(struct kvm *kvm,
 				   struct kvm_create_device *cd)
 {
-	struct kvm_device_ops *ops = NULL;
+	const struct kvm_device_ops *ops = NULL;
 	struct kvm_device *dev;
 	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
 	int type;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
                   ` (4 preceding siblings ...)
  2019-08-02 14:50 ` [PATCH 5/9] KVM: Allow kvm_device_ops to be const Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-03 12:51   ` Marc Zyngier
  2019-08-02 14:50 ` [PATCH 7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls Steven Price
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

Allow user space to inform the KVM host where in the physical memory
map the paravirtualized time structures should be located.

A device is created which provides the base address of an array of
Stolen Time (ST) structures, one for each VCPU. There must be (64 *
total number of VCPUs) bytes of memory available at this location.

The address is given in terms of the physical address visible to
the guest and must be 64 byte aligned. The memory should be marked as
reserved to the guest to stop it allocating it for other purposes.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_mmu.h  |   2 +
 arch/arm64/include/uapi/asm/kvm.h |   6 +
 arch/arm64/kvm/Makefile           |   1 +
 include/uapi/linux/kvm.h          |   2 +
 virt/kvm/arm/mmu.c                |  44 +++++++
 virt/kvm/arm/pvtime.c             | 190 ++++++++++++++++++++++++++++++
 6 files changed, 245 insertions(+)
 create mode 100644 virt/kvm/arm/pvtime.c

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index befe37d4bc0e..88c8a4b2836f 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -157,6 +157,8 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm *kvm);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 			  phys_addr_t pa, unsigned long size, bool writable);
+int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
+			  phys_addr_t pa, unsigned long size, bool writable);
 
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
 
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 9a507716ae2f..95516a4198ea 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -367,6 +367,12 @@ struct kvm_vcpu_events {
 #define KVM_PSCI_RET_INVAL		PSCI_RET_INVALID_PARAMS
 #define KVM_PSCI_RET_DENIED		PSCI_RET_DENIED
 
+/* Device Control API: PV_TIME */
+#define KVM_DEV_ARM_PV_TIME_PADDR	0
+#define  KVM_DEV_ARM_PV_TIME_ST		0
+#define KVM_DEV_ARM_PV_TIME_STATE_SIZE	1
+#define KVM_DEV_ARM_PV_TIME_STATE	2
+
 #endif
 
 #endif /* __ARM_KVM_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 73dce4d47d47..5ffbdc39e780 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -14,6 +14,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/e
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
+kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/pvtime.o
 
 kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
 kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a7c19540ce21..04bffafa0708 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1222,6 +1222,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
 	KVM_DEV_TYPE_XIVE,
 #define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
+	KVM_DEV_TYPE_ARM_PV_TIME,
+#define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
 	KVM_DEV_TYPE_MAX,
 };
 
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 38b4c910b6c3..be28a4aee451 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1368,6 +1368,50 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	return ret;
 }
 
+/**
+ * kvm_phys_addr_memremap - map a memory range to guest IPA
+ *
+ * @kvm:	The KVM pointer
+ * @guest_ipa:	The IPA at which to insert the mapping
+ * @pa:		The physical address of the memory
+ * @size:	The size of the mapping
+ */
+int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
+			  phys_addr_t pa, unsigned long size, bool writable)
+{
+	phys_addr_t addr, end;
+	int ret = 0;
+	unsigned long pfn;
+	struct kvm_mmu_memory_cache cache = { 0, };
+
+	end = (guest_ipa + size + PAGE_SIZE - 1) & PAGE_MASK;
+	pfn = __phys_to_pfn(pa);
+
+	for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
+		pte_t pte = pfn_pte(pfn, PAGE_S2);
+
+		if (writable)
+			pte = kvm_s2pte_mkwrite(pte);
+
+		ret = mmu_topup_memory_cache(&cache,
+					     kvm_mmu_cache_min_pages(kvm),
+					     KVM_NR_MEM_OBJS);
+		if (ret)
+			goto out;
+		spin_lock(&kvm->mmu_lock);
+		ret = stage2_set_pte(kvm, &cache, addr, &pte, 0);
+		spin_unlock(&kvm->mmu_lock);
+		if (ret)
+			goto out;
+
+		pfn++;
+	}
+
+out:
+	mmu_free_memory_cache(&cache);
+	return ret;
+}
+
 static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
 {
 	kvm_pfn_t pfn = *pfnp;
diff --git a/virt/kvm/arm/pvtime.c b/virt/kvm/arm/pvtime.c
new file mode 100644
index 000000000000..9051bc07eae1
--- /dev/null
+++ b/virt/kvm/arm/pvtime.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2019 Arm Ltd.
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_mmu.h>
+
+/* We currently only support PV time on ARM64 */
+#ifdef CONFIG_ARM64
+
+#include <asm/pvclock-abi.h>
+
+static int max_stolen_size(void)
+{
+	int size = KVM_MAX_VCPUS * sizeof(struct pvclock_vcpu_stolen_time_info);
+
+	return ALIGN(size, PAGE_SIZE);
+}
+
+static int kvm_arm_pvtime_create(struct kvm_device *dev, u32 type)
+{
+	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
+
+	pvtime->st = alloc_pages_exact(max_stolen_size(),
+				       GFP_KERNEL | __GFP_ZERO);
+	if (!pvtime->st)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void kvm_arm_pvtime_destroy(struct kvm_device *dev)
+{
+	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
+
+	pvtime->st_base = GPA_INVALID;
+	free_pages_exact(pvtime->st, max_stolen_size());
+	kfree(dev);
+}
+
+static int pvtime_map_pages(struct kvm *kvm, gpa_t guest_paddr,
+			    void *kaddr, int size)
+{
+	return kvm_phys_addr_memremap(kvm, guest_paddr,
+			virt_to_phys(kaddr),
+			size, false);
+}
+
+static int pvtime_save_state(struct kvm *kvm, u64 type, void __user *user)
+{
+	void *source;
+	size_t size;
+
+	switch (type) {
+	case KVM_DEV_ARM_PV_TIME_ST:
+		source = kvm->arch.pvtime.st;
+		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
+			atomic_read(&kvm->online_vcpus);
+		break;
+	default:
+		return -ENXIO;
+	}
+
+	if (copy_to_user(user, source, size))
+		return -EFAULT;
+	return 0;
+}
+
+static int pvtime_restore_state(struct kvm *kvm, u64 type, void __user *user)
+{
+	void *dest;
+	size_t size;
+
+	switch (type) {
+	case KVM_DEV_ARM_PV_TIME_ST:
+		dest = kvm->arch.pvtime.st;
+		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
+			atomic_read(&kvm->online_vcpus);
+		break;
+	default:
+		return -ENXIO;
+	}
+
+	if (copy_from_user(dest, user, size))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
+				   struct kvm_device_attr *attr)
+{
+	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
+	u64 __user *user = (u64 __user *)attr->addr;
+	u64 paddr;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_ARM_PV_TIME_PADDR:
+		if (get_user(paddr, user))
+			return -EFAULT;
+		if (paddr & 63)
+			return -EINVAL;
+		switch (attr->attr) {
+		case KVM_DEV_ARM_PV_TIME_ST:
+			if (pvtime->st_base != GPA_INVALID)
+				return -EEXIST;
+			ret = pvtime_map_pages(dev->kvm, paddr, pvtime->st,
+					max_stolen_size());
+			if (ret)
+				return ret;
+			pvtime->st_base = paddr;
+			return 0;
+		}
+		break;
+	case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
+		return -EPERM;
+	case KVM_DEV_ARM_PV_TIME_STATE:
+		return pvtime_restore_state(dev->kvm, attr->attr, user);
+	}
+	return -ENXIO;
+}
+
+static int kvm_arm_pvtime_get_attr(struct kvm_device *dev,
+				   struct kvm_device_attr *attr)
+{
+	u64 __user *user = (u64 __user *)attr->addr;
+	u32 size;
+
+	switch (attr->group) {
+	case KVM_DEV_ARM_PV_TIME_PADDR:
+		switch (attr->attr) {
+		case KVM_DEV_ARM_PV_TIME_ST:
+			if (put_user(dev->kvm->arch.pvtime.st_base, user))
+				return -EFAULT;
+			return 0;
+		}
+		break;
+	case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
+		switch (attr->attr) {
+		case KVM_DEV_ARM_PV_TIME_ST:
+			size = sizeof(struct pvclock_vcpu_stolen_time_info);
+			size *= atomic_read(&dev->kvm->online_vcpus);
+			break;
+		default:
+			return -ENXIO;
+		}
+		if (put_user(size, user))
+			return -EFAULT;
+		return 0;
+	case KVM_DEV_ARM_PV_TIME_STATE:
+		return pvtime_save_state(dev->kvm, attr->attr, user);
+	}
+	return -ENXIO;
+}
+
+static int kvm_arm_pvtime_has_attr(struct kvm_device *dev,
+				   struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_ARM_PV_TIME_PADDR:
+	case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
+	case KVM_DEV_ARM_PV_TIME_STATE:
+		switch (attr->attr) {
+		case KVM_DEV_ARM_PV_TIME_ST:
+			return 0;
+		}
+		break;
+	}
+	return -ENXIO;
+}
+
+static const struct kvm_device_ops pvtime_ops = {
+	"Arm PV time",
+	.create = kvm_arm_pvtime_create,
+	.destroy = kvm_arm_pvtime_destroy,
+	.set_attr = kvm_arm_pvtime_set_attr,
+	.get_attr = kvm_arm_pvtime_get_attr,
+	.has_attr = kvm_arm_pvtime_has_attr
+};
+
+static int __init kvm_pvtime_init(void)
+{
+	kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
+
+	return 0;
+}
+
+late_initcall(kvm_pvtime_init);
+
+#endif
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
                   ` (5 preceding siblings ...)
  2019-08-02 14:50 ` [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-05 10:03   ` Will Deacon
  2019-08-02 14:50 ` [PATCH 8/9] arm/arm64: Make use of the SMCCC 1.1 wrapper Steven Price
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

SMCCC 1.1 calls may use either HVC or SMC depending on the PSCI
conduit. Rather than coding this in every call site provide a macro
which uses the correct instruction. The macro also handles the case
where no PSCI conduit is configured returning a not supported error
in res, along with returning the conduit used for the call.

This allow us to remove some duplicated code and will be useful later
when adding paravirtualized time hypervisor calls.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 include/linux/arm-smccc.h | 44 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index e7f129f26ebd..eee1e832221d 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -303,6 +303,50 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, unsigned long a1,
 #define SMCCC_RET_NOT_SUPPORTED			-1
 #define SMCCC_RET_NOT_REQUIRED			-2
 
+/* Like arm_smccc_1_1* but always returns SMCCC_RET_NOT_SUPPORTED.
+ * Used when the PSCI conduit is not defined. The empty asm statement
+ * avoids compiler warnings about unused variables.
+ */
+#define __fail_smccc_1_1(...)						\
+	do {								\
+		__declare_args(__count_args(__VA_ARGS__), __VA_ARGS__);	\
+		asm ("" __constraints(__count_args(__VA_ARGS__)));	\
+		if (___res)						\
+			___res->a0 = SMCCC_RET_NOT_SUPPORTED;		\
+	} while (0)
+
+/*
+ * arm_smccc_1_1_invoke() - make an SMCCC v1.1 compliant call
+ *
+ * This is a variadic macro taking one to eight source arguments, and
+ * an optional return structure.
+ *
+ * @a0-a7: arguments passed in registers 0 to 7
+ * @res: result values from registers 0 to 3
+ *
+ * This macro will make either an HVC call or an SMC call depending on the
+ * current PSCI conduit. If no valid conduit is available then -1
+ * (SMCCC_RET_NOT_SUPPORTED) is returned in @res.a0 (if supplied).
+ *
+ * The return value also provides the conduit that was used.
+ */
+#define arm_smccc_1_1_invoke(...) ({					\
+		int method = psci_ops.conduit;				\
+		switch (method) {					\
+		case PSCI_CONDUIT_HVC:					\
+			arm_smccc_1_1_hvc(__VA_ARGS__);			\
+			break;						\
+		case PSCI_CONDUIT_SMC:					\
+			arm_smccc_1_1_smc(__VA_ARGS__);			\
+			break;						\
+		default:						\
+			__fail_smccc_1_1(__VA_ARGS__);			\
+			method = PSCI_CONDUIT_NONE;			\
+			break;						\
+		}							\
+		method;							\
+	})
+
 /* Paravirtualised time calls (defined by ARM DEN0057A) */
 #define ARM_SMCCC_HV_PV_FEATURES				\
 	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 8/9] arm/arm64: Make use of the SMCCC 1.1 wrapper
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
                   ` (6 preceding siblings ...)
  2019-08-02 14:50 ` [PATCH 7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-02 14:50 ` [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest Steven Price
  2019-08-03 18:05 ` [PATCH 0/9] arm64: Stolen time support Marc Zyngier
  9 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

Rather than directly choosing which function to use based on
psci_ops.conduit, use the new arm_smccc_1_1 wrapper instead.

In some cases we still need to do some operations based on the
conduit, but the code duplication is removed.

No functional change.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm/mm/proc-v7-bugs.c     | 13 +++---
 arch/arm64/kernel/cpu_errata.c | 80 ++++++++++++----------------------
 2 files changed, 33 insertions(+), 60 deletions(-)

diff --git a/arch/arm/mm/proc-v7-bugs.c b/arch/arm/mm/proc-v7-bugs.c
index 9a07916af8dd..8eb52f3385e7 100644
--- a/arch/arm/mm/proc-v7-bugs.c
+++ b/arch/arm/mm/proc-v7-bugs.c
@@ -78,12 +78,13 @@ static void cpu_v7_spectre_init(void)
 		if (psci_ops.smccc_version == SMCCC_VERSION_1_0)
 			break;
 
+		arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+				     ARM_SMCCC_ARCH_WORKAROUND_1, &res);
+		if ((int)res.a0 != 0)
+			return;
+
 		switch (psci_ops.conduit) {
 		case PSCI_CONDUIT_HVC:
-			arm_smccc_1_1_hvc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
-					  ARM_SMCCC_ARCH_WORKAROUND_1, &res);
-			if ((int)res.a0 != 0)
-				break;
 			per_cpu(harden_branch_predictor_fn, cpu) =
 				call_hvc_arch_workaround_1;
 			cpu_do_switch_mm = cpu_v7_hvc_switch_mm;
@@ -91,10 +92,6 @@ static void cpu_v7_spectre_init(void)
 			break;
 
 		case PSCI_CONDUIT_SMC:
-			arm_smccc_1_1_smc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
-					  ARM_SMCCC_ARCH_WORKAROUND_1, &res);
-			if ((int)res.a0 != 0)
-				break;
 			per_cpu(harden_branch_predictor_fn, cpu) =
 				call_smc_arch_workaround_1;
 			cpu_do_switch_mm = cpu_v7_smc_switch_mm;
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 1e43ba5c79b7..400a49aaae85 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -215,40 +215,31 @@ static int detect_harden_bp_fw(void)
 	if (psci_ops.smccc_version == SMCCC_VERSION_1_0)
 		return -1;
 
+	arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+			     ARM_SMCCC_ARCH_WORKAROUND_1, &res);
+
+	switch ((int)res.a0) {
+	case 1:
+		/* Firmware says we're just fine */
+		return 0;
+	case 0:
+		break;
+	default:
+		return -1;
+	}
+
 	switch (psci_ops.conduit) {
 	case PSCI_CONDUIT_HVC:
-		arm_smccc_1_1_hvc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
-				  ARM_SMCCC_ARCH_WORKAROUND_1, &res);
-		switch ((int)res.a0) {
-		case 1:
-			/* Firmware says we're just fine */
-			return 0;
-		case 0:
-			cb = call_hvc_arch_workaround_1;
-			/* This is a guest, no need to patch KVM vectors */
-			smccc_start = NULL;
-			smccc_end = NULL;
-			break;
-		default:
-			return -1;
-		}
+		cb = call_hvc_arch_workaround_1;
+		/* This is a guest, no need to patch KVM vectors */
+		smccc_start = NULL;
+		smccc_end = NULL;
 		break;
 
 	case PSCI_CONDUIT_SMC:
-		arm_smccc_1_1_smc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
-				  ARM_SMCCC_ARCH_WORKAROUND_1, &res);
-		switch ((int)res.a0) {
-		case 1:
-			/* Firmware says we're just fine */
-			return 0;
-		case 0:
-			cb = call_smc_arch_workaround_1;
-			smccc_start = __smccc_workaround_1_smc_start;
-			smccc_end = __smccc_workaround_1_smc_end;
-			break;
-		default:
-			return -1;
-		}
+		cb = call_smc_arch_workaround_1;
+		smccc_start = __smccc_workaround_1_smc_start;
+		smccc_end = __smccc_workaround_1_smc_end;
 		break;
 
 	default:
@@ -338,6 +329,7 @@ void __init arm64_enable_wa2_handling(struct alt_instr *alt,
 
 void arm64_set_ssbd_mitigation(bool state)
 {
+	int conduit;
 	if (!IS_ENABLED(CONFIG_ARM64_SSBD)) {
 		pr_info_once("SSBD disabled by kernel configuration\n");
 		return;
@@ -351,19 +343,10 @@ void arm64_set_ssbd_mitigation(bool state)
 		return;
 	}
 
-	switch (psci_ops.conduit) {
-	case PSCI_CONDUIT_HVC:
-		arm_smccc_1_1_hvc(ARM_SMCCC_ARCH_WORKAROUND_2, state, NULL);
-		break;
+	conduit = arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_WORKAROUND_2, state,
+				       NULL);
 
-	case PSCI_CONDUIT_SMC:
-		arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2, state, NULL);
-		break;
-
-	default:
-		WARN_ON_ONCE(1);
-		break;
-	}
+	WARN_ON_ONCE(conduit == PSCI_CONDUIT_NONE);
 }
 
 static bool has_ssbd_mitigation(const struct arm64_cpu_capabilities *entry,
@@ -373,6 +356,7 @@ static bool has_ssbd_mitigation(const struct arm64_cpu_capabilities *entry,
 	bool required = true;
 	s32 val;
 	bool this_cpu_safe = false;
+	int conduit;
 
 	WARN_ON(scope != SCOPE_LOCAL_CPU || preemptible());
 
@@ -397,18 +381,10 @@ static bool has_ssbd_mitigation(const struct arm64_cpu_capabilities *entry,
 		return false;
 	}
 
-	switch (psci_ops.conduit) {
-	case PSCI_CONDUIT_HVC:
-		arm_smccc_1_1_hvc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
-				  ARM_SMCCC_ARCH_WORKAROUND_2, &res);
-		break;
-
-	case PSCI_CONDUIT_SMC:
-		arm_smccc_1_1_smc(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
-				  ARM_SMCCC_ARCH_WORKAROUND_2, &res);
-		break;
+	conduit = arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+				       ARM_SMCCC_ARCH_WORKAROUND_2, &res);
 
-	default:
+	if (conduit == PSCI_CONDUIT_NONE) {
 		ssbd_state = ARM64_SSBD_UNKNOWN;
 		if (!this_cpu_safe)
 			__ssb_safe = false;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
                   ` (7 preceding siblings ...)
  2019-08-02 14:50 ` [PATCH 8/9] arm/arm64: Make use of the SMCCC 1.1 wrapper Steven Price
@ 2019-08-02 14:50 ` Steven Price
  2019-08-04  9:53   ` Marc Zyngier
  2019-08-09 13:51   ` Zenghui Yu
  2019-08-03 18:05 ` [PATCH 0/9] arm64: Stolen time support Marc Zyngier
  9 siblings, 2 replies; 43+ messages in thread
From: Steven Price @ 2019-08-02 14:50 UTC (permalink / raw)
  To: unlisted-recipients:; (no To-header on input)
  Cc: Steven Price, Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

Enable paravirtualization features when running under a hypervisor
supporting the PV_TIME_ST hypercall.

For each (v)CPU, we ask the hypervisor for the location of a shared
page which the hypervisor will use to report stolen time to us. We set
pv_time_ops to the stolen time function which simply reads the stolen
value from the shared page for a VCPU. We guarantee single-copy
atomicity using READ_ONCE which means we can also read the stolen
time for another VCPU than the currently running one while it is
potentially being updated by the hypervisor.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/kernel/Makefile |   1 +
 arch/arm64/kernel/kvm.c    | 155 +++++++++++++++++++++++++++++++++++++
 include/linux/cpuhotplug.h |   1 +
 3 files changed, 157 insertions(+)
 create mode 100644 arch/arm64/kernel/kvm.c

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 478491f07b4f..eb36edf9b930 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
 obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
 obj-$(CONFIG_ARM64_SSBD)		+= ssbd.o
 obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
+obj-$(CONFIG_PARAVIRT)			+= kvm.o
 
 obj-y					+= vdso/ probes/
 obj-$(CONFIG_COMPAT_VDSO)		+= vdso32/
diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
new file mode 100644
index 000000000000..245398c79dae
--- /dev/null
+++ b/arch/arm64/kernel/kvm.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2019 Arm Ltd.
+
+#define pr_fmt(fmt) "kvmarm-pv: " fmt
+
+#include <linux/arm-smccc.h>
+#include <linux/cpuhotplug.h>
+#include <linux/io.h>
+#include <linux/printk.h>
+#include <linux/psci.h>
+#include <linux/reboot.h>
+#include <linux/slab.h>
+
+#include <asm/paravirt.h>
+#include <asm/pvclock-abi.h>
+#include <asm/smp_plat.h>
+
+struct kvmarm_stolen_time_region {
+	struct pvclock_vcpu_stolen_time_info *kaddr;
+};
+
+static DEFINE_PER_CPU(struct kvmarm_stolen_time_region, stolen_time_region);
+
+static bool steal_acc = true;
+static int __init parse_no_stealacc(char *arg)
+{
+	steal_acc = false;
+	return 0;
+}
+early_param("no-steal-acc", parse_no_stealacc);
+
+/* return stolen time in ns by asking the hypervisor */
+static u64 kvm_steal_clock(int cpu)
+{
+	struct kvmarm_stolen_time_region *reg;
+
+	reg = per_cpu_ptr(&stolen_time_region, cpu);
+	if (!reg->kaddr) {
+		pr_warn_once("stolen time enabled but not configured for cpu %d\n",
+			     cpu);
+		return 0;
+	}
+
+	return le64_to_cpu(READ_ONCE(reg->kaddr->stolen_time));
+}
+
+static int disable_stolen_time_current_cpu(void)
+{
+	struct kvmarm_stolen_time_region *reg;
+
+	reg = this_cpu_ptr(&stolen_time_region);
+	if (!reg->kaddr)
+		return 0;
+
+	memunmap(reg->kaddr);
+	memset(reg, 0, sizeof(*reg));
+
+	return 0;
+}
+
+static int stolen_time_dying_cpu(unsigned int cpu)
+{
+	return disable_stolen_time_current_cpu();
+}
+
+static int init_stolen_time_cpu(unsigned int cpu)
+{
+	struct kvmarm_stolen_time_region *reg;
+	struct arm_smccc_res res;
+
+	reg = this_cpu_ptr(&stolen_time_region);
+
+	if (reg->kaddr)
+		return 0;
+
+	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, &res);
+
+	if ((long)res.a0 < 0)
+		return -EINVAL;
+
+	reg->kaddr = memremap(res.a0,
+			sizeof(struct pvclock_vcpu_stolen_time_info),
+			MEMREMAP_WB);
+
+	if (reg->kaddr == NULL) {
+		pr_warn("Failed to map stolen time data structure\n");
+		return -EINVAL;
+	}
+
+	if (le32_to_cpu(reg->kaddr->revision) != 0 ||
+			le32_to_cpu(reg->kaddr->attributes) != 0) {
+		pr_warn("Unexpected revision or attributes in stolen time data\n");
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+static int kvm_arm_init_stolen_time(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state(CPUHP_AP_ARM_KVMPV_STARTING,
+				"hypervisor/kvmarm/pv:starting",
+				init_stolen_time_cpu, stolen_time_dying_cpu);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+static bool has_kvm_steal_clock(void)
+{
+	struct arm_smccc_res res;
+
+	/* To detect the presence of PV time support we require SMCCC 1.1+ */
+	if (psci_ops.smccc_version < SMCCC_VERSION_1_1)
+		return false;
+
+	arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
+			     ARM_SMCCC_HV_PV_FEATURES, &res);
+
+	if (res.a0 != SMCCC_RET_SUCCESS)
+		return false;
+
+	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_FEATURES,
+			     ARM_SMCCC_HV_PV_TIME_ST, &res);
+
+	if (res.a0 != SMCCC_RET_SUCCESS)
+		return false;
+
+	return true;
+}
+
+static int __init kvm_guest_init(void)
+{
+	int ret = 0;
+
+	if (!has_kvm_steal_clock())
+		return 0;
+
+	ret = kvm_arm_init_stolen_time();
+	if (ret)
+		return ret;
+
+	pv_ops.time.steal_clock = kvm_steal_clock;
+
+	static_key_slow_inc(&paravirt_steal_enabled);
+	if (steal_acc)
+		static_key_slow_inc(&paravirt_steal_rq_enabled);
+
+	pr_info("using stolen time PV\n");
+
+	return 0;
+}
+early_initcall(kvm_guest_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 068793a619ca..89d75edb5750 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -136,6 +136,7 @@ enum cpuhp_state {
 	/* Must be the last timer callback */
 	CPUHP_AP_DUMMY_TIMER_STARTING,
 	CPUHP_AP_ARM_XEN_STARTING,
+	CPUHP_AP_ARM_KVMPV_STARTING,
 	CPUHP_AP_ARM_CORESIGHT_STARTING,
 	CPUHP_AP_ARM64_ISNDEP_STARTING,
 	CPUHP_AP_SMPCFD_DYING,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
  2019-08-02 14:50 ` [PATCH 1/9] KVM: arm64: Document PV-time interface Steven Price
@ 2019-08-03 11:13   ` Marc Zyngier
  2019-08-05 13:06     ` Steven Price
  2019-08-05  3:23   ` Zenghui Yu
  2019-08-05 16:40   ` Christophe de Dinechin
  2 siblings, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 11:13 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel, peter.maydell

On Fri,  2 Aug 2019 15:50:09 +0100
Steven Price <steven.price@arm.com> wrote:

[+Peter for the userspace aspect of things]

Hi Steve,

> Introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
> 
> This only adds the details about "Stolen Time" as the details of "Live
> Physical Time" have not been fully agreed.
> 
> User space can specify a reserved area of memory for the guest and
> inform KVM to populate the memory with information on time that the host
> kernel has stolen from the guest.
> 
> A hypercall interface is provided for the guest to interrogate the
> hypervisor's support for this interface and the location of the shared
> memory structures.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>  1 file changed, 107 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
> 
> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
> new file mode 100644
> index 000000000000..e6ae9799e1d5
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
> @@ -0,0 +1,107 @@
> +Paravirtualized time support for arm64
> +======================================
> +
> +Arm specification DEN0057/A defined a standard for paravirtualised time
> +support for Aarch64 guests:

nit: AArch64

> +
> +https://developer.arm.com/docs/den0057/a

Between this file and the above document, which one is authoritative?

> +
> +KVM/Arm64 implements the stolen time part of this specification by providing

nit: KVM/arm64

> +some hypervisor service calls to support a paravirtualized guest obtaining a
> +view of the amount of time stolen from its execution.
> +
> +Two new SMCCC compatible hypercalls are defined:
> +
> +PV_FEATURES 0xC5000020
> +PV_TIME_ST  0xC5000022
> +
> +These are only available in the SMC64/HVC64 calling convention as
> +paravirtualized time is not available to 32 bit Arm guests.
> +
> +PV_FEATURES
> +    Function ID:  (uint32)  : 0xC5000020
> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
> +                              PV-time feature is supported by the hypervisor.

How is PV_FEATURES discovered? Is the intention to make it a generic
ARM-wide PV discovery mechanism, not specific to PV time?

> +
> +PV_TIME_ST
> +    Function ID:  (uint32)  : 0xC5000022
> +    Return value: (int64)   : IPA of the stolen time data structure for this
> +                              (V)CPU. On failure:
> +                              NOT_SUPPORTED (-1)
> +

Is the size implicit? What are the memory attributes? This either needs
documenting here, or point to the right bit to the spec.

> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.
> +
> +The structure will be updated by the hypervisor periodically as time is stolen

Is it really periodic? If so, when is the update frequency?

> +from the VCPU. It will be present within a reserved region of the normal
> +memory given to the guest. The guest should not attempt to write into this
> +memory. There is a structure by VCPU of the guest.

What if the vcpu writes to it? Does it get a fault? If there is a
structure per vcpu, what is the layout in memory? How does a vcpu find
its own data structure? Is that the address returned by PV_TIME_ST?

> +
> +User space interface
> +====================
> +
> +User space can request that KVM provide the paravirtualized time interface to
> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
> +
> +    struct kvm_create_device pvtime_device = {
> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
> +            .attr = 0,
> +            .flags = 0,
> +    };
> +
> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
> +
> +The guest IPA of the structures must be given to KVM. This is the base address

nit: s/guest //

> +of an array of stolen time structures (one for each VCPU). For example:
> +
> +    struct kvm_device_attr st_base = {
> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
> +            .addr = (u64)(unsigned long)&st_paddr
> +    };
> +
> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);

So the allocation itself is performed by the kernel? What are the
ordering requirements between creating vcpus and the device? What are
the alignment requirements for the base address?

> +
> +For migration (or save/restore) of a guest it is necessary to save the contents
> +of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
> +to be read/written.

Is the size variable depending on the number of vcpus?

> +
> +It is also necessary for the physical address to be set identically when
> +restoring.
> +
> +    void *save_state(int fd, u64 attr, u32 *size) {
> +        struct kvm_device_attr get_size = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
> +
> +        void *buffer = malloc(*size);
> +
> +        struct kvm_device_attr get_state = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
> +    }
> +
> +    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
> +

Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call
  2019-08-02 14:50 ` [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call Steven Price
@ 2019-08-03 11:21   ` Marc Zyngier
  2019-08-05 13:14     ` Steven Price
  0 siblings, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 11:21 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On Fri,  2 Aug 2019 15:50:11 +0100
Steven Price <steven.price@arm.com> wrote:

> This provides a mechanism for querying which paravirtualized features
> are available in this hypervisor.
> 
> Also add the header file which defines the ABI for the paravirtualized
> clock features we're about to add.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/pvclock-abi.h | 20 ++++++++++++++++++++
>  include/linux/arm-smccc.h            | 14 ++++++++++++++
>  virt/kvm/arm/hypercalls.c            |  9 +++++++++
>  3 files changed, 43 insertions(+)
>  create mode 100644 arch/arm64/include/asm/pvclock-abi.h
> 
> diff --git a/arch/arm64/include/asm/pvclock-abi.h b/arch/arm64/include/asm/pvclock-abi.h
> new file mode 100644
> index 000000000000..1f7cdc102691
> --- /dev/null
> +++ b/arch/arm64/include/asm/pvclock-abi.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2019 Arm Ltd. */
> +
> +#ifndef __ASM_PVCLOCK_ABI_H
> +#define __ASM_PVCLOCK_ABI_H
> +
> +/* The below structures and constants are defined in ARM DEN0057A */
> +
> +struct pvclock_vcpu_stolen_time_info {
> +	__le32 revision;
> +	__le32 attributes;
> +	__le64 stolen_time;
> +	/* Structure must be 64 byte aligned, pad to that size */
> +	u8 padding[48];
> +} __packed;
> +
> +#define PV_VM_TIME_NOT_SUPPORTED	-1

Isn't the intent for this to be the same value as
SMCCC_RET_NOT_SUPPORTED?

> +#define PV_VM_TIME_INVALID_PARAMETERS	-2

It overlaps with SMCCC_RET_NOT_REQUIRED. Is that a problem? Should we
consider a spec change for this?

> +
> +#endif
> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index 080012a6f025..e7f129f26ebd 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -45,6 +45,7 @@
>  #define ARM_SMCCC_OWNER_SIP		2
>  #define ARM_SMCCC_OWNER_OEM		3
>  #define ARM_SMCCC_OWNER_STANDARD	4
> +#define ARM_SMCCC_OWNER_STANDARD_HYP	5
>  #define ARM_SMCCC_OWNER_TRUSTED_APP	48
>  #define ARM_SMCCC_OWNER_TRUSTED_APP_END	49
>  #define ARM_SMCCC_OWNER_TRUSTED_OS	50
> @@ -302,5 +303,18 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, unsigned long a1,
>  #define SMCCC_RET_NOT_SUPPORTED			-1
>  #define SMCCC_RET_NOT_REQUIRED			-2
>  
> +/* Paravirtualised time calls (defined by ARM DEN0057A) */
> +#define ARM_SMCCC_HV_PV_FEATURES				\
> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
> +			   ARM_SMCCC_SMC_64,			\
> +			   ARM_SMCCC_OWNER_STANDARD_HYP,	\
> +			   0x20)
> +
> +#define ARM_SMCCC_HV_PV_TIME_ST					\
> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
> +			   ARM_SMCCC_SMC_64,			\
> +			   ARM_SMCCC_OWNER_STANDARD_HYP,	\
> +			   0x22)
> +
>  #endif /*__ASSEMBLY__*/
>  #endif /*__LINUX_ARM_SMCCC_H*/
> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
> index f875241bd030..2906b2df99df 100644
> --- a/virt/kvm/arm/hypercalls.c
> +++ b/virt/kvm/arm/hypercalls.c
> @@ -5,6 +5,7 @@
>  #include <linux/kvm_host.h>
>  
>  #include <asm/kvm_emulate.h>
> +#include <asm/pvclock-abi.h>
>  
>  #include <kvm/arm_hypercalls.h>
>  #include <kvm/arm_psci.h>
> @@ -48,6 +49,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  				break;
>  			}
>  			break;
> +		case ARM_SMCCC_HV_PV_FEATURES:
> +			val = SMCCC_RET_SUCCESS;
> +			break;
> +		}
> +		break;
> +	case ARM_SMCCC_HV_PV_FEATURES:
> +		feature = smccc_get_arg1(vcpu);
> +		switch (feature) {
>  		}
>  		break;
>  	default:


Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
  2019-08-02 14:50 ` [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure Steven Price
@ 2019-08-03 11:55   ` Marc Zyngier
  2019-08-05 14:09     ` Steven Price
  2019-08-03 17:58   ` Marc Zyngier
  1 sibling, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 11:55 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On Fri,  2 Aug 2019 15:50:12 +0100
Steven Price <steven.price@arm.com> wrote:

> Implement the service call for configuring a shared structre between a

structure

> VCPU and the hypervisor in which the hypervisor can write the time
> stolen from the VCPU's execution time by other tasks on the host.
> 
> The hypervisor allocates memory which is placed at an IPA chosen by user
> space. The hypervisor then uses WRITE_ONCE() to update the shared
> structre ensuring single copy atomicity of the 64-bit unsigned value

structure

> that reports stolen time in nanoseconds.
> 
> Whenever stolen time is enabled by the guest, the stolen time counter is
> reset.
> 
> The stolen time itself is retrieved from the sched_info structure
> maintained by the Linux scheduler code. We enable SCHEDSTATS when
> selecting KVM Kconfig to ensure this value is meaningful.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/kvm_host.h | 13 +++++-
>  arch/arm64/kvm/Kconfig            |  1 +
>  include/kvm/arm_hypercalls.h      |  1 +
>  include/linux/kvm_types.h         |  2 +
>  virt/kvm/arm/arm.c                | 18 ++++++++
>  virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
>  6 files changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index f656169db8c3..78f270190d43 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -44,6 +44,7 @@
>  	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
>  #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
> +#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>  
>  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>  
> @@ -83,6 +84,11 @@ struct kvm_arch {
>  
>  	/* Mandated version of PSCI */
>  	u32 psci_version;
> +
> +	struct kvm_arch_pvtime {
> +		void *st;

Is it really a void *? I'm sure you can use a proper type here...

> +		gpa_t st_base;
> +	} pvtime;
>  };
>  
>  #define KVM_NR_MEM_OBJS     40
> @@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
>  	/* True when deferrable sysregs are loaded on the physical CPU,
>  	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
>  	bool sysregs_loaded_on_cpu;
> -};
>  
> +	/* Guest PV state */
> +	struct {
> +		u64 steal;
> +		u64 last_steal;
> +	} steal;
> +};
>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
>  				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index a67121d419a2..d8b88e40d223 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -39,6 +39,7 @@ config KVM
>  	select IRQ_BYPASS_MANAGER
>  	select HAVE_KVM_IRQ_BYPASS
>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
> +	select SCHEDSTATS
>  	---help---
>  	  Support hosting virtualized guest machines.
>  	  We don't support KVM with 16K page tables yet, due to the multiple
> diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
> index 35a5abcc4ca3..9f0710ab4292 100644
> --- a/include/kvm/arm_hypercalls.h
> +++ b/include/kvm/arm_hypercalls.h
> @@ -7,6 +7,7 @@
>  #include <asm/kvm_emulate.h>
>  
>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
>  
>  static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
>  {
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index bde5374ae021..1c88e69db3d9 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
>  typedef u64            gpa_t;
>  typedef u64            gfn_t;
>  
> +#define GPA_INVALID	(~(gpa_t)0)
> +
>  typedef unsigned long  hva_t;
>  typedef u64            hpa_t;
>  typedef u64            hfn_t;
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index f645c0fbf7ec..ebd963d2580b 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -40,6 +40,10 @@
>  #include <asm/kvm_coproc.h>
>  #include <asm/sections.h>
>  
> +#include <kvm/arm_hypercalls.h>
> +#include <kvm/arm_pmu.h>
> +#include <kvm/arm_psci.h>
> +
>  #ifdef REQUIRES_VIRT
>  __asm__(".arch_extension	virt");
>  #endif
> @@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	kvm->arch.max_vcpus = vgic_present ?
>  				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>  
> +	kvm->arch.pvtime.st_base = GPA_INVALID;
>  	return ret;
>  out_free_stage2_pgd:
>  	kvm_free_stage2_pgd(kvm);
> @@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	kvm_vcpu_load_sysregs(vcpu);
>  	kvm_arch_vcpu_load_fp(vcpu);
>  	kvm_vcpu_pmu_restore_guest(vcpu);
> +	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
>  
>  	if (single_task_running())
>  		vcpu_clear_wfe_traps(vcpu);
> @@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>  	smp_rmb();
>  }
>  
> +static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
> +{
> +	int idx;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	kvm_update_stolen_time(vcpu);
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +}
> +
>  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
>  {
>  	return vcpu->arch.target >= 0;
> @@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>  		 * that a VCPU sees new virtual interrupts.
>  		 */
>  		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
> +			vcpu_req_record_steal(vcpu);
>  	}
>  }
>  
> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
> index 2906b2df99df..196c71c8dd87 100644
> --- a/virt/kvm/arm/hypercalls.c
> +++ b/virt/kvm/arm/hypercalls.c
> @@ -10,6 +10,70 @@
>  #include <kvm/arm_hypercalls.h>
>  #include <kvm/arm_psci.h>
>  
> +
> +static struct pvclock_vcpu_stolen_time_info *pvtime_get_st(
> +		struct kvm_vcpu *vcpu)

nit: on a single line.

> +{
> +	struct pvclock_vcpu_stolen_time_info *st = vcpu->kvm->arch.pvtime.st;
> +
> +	if (!st)
> +		return NULL;
> +
> +	return &st[kvm_vcpu_get_idx(vcpu)];
> +}
> +
> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu)
> +{
> +	u64 steal;
> +	struct pvclock_vcpu_stolen_time_info *kaddr;
> +
> +	if (vcpu->kvm->arch.pvtime.st_base == GPA_INVALID)
> +		return -ENOTSUPP;

So for a guest that doesn't have stolen time support (which is 100% of
them for the foreseeable future), we still set a request, take the srcu
lock and end-up here for nothing. I'd rather we test this st_base
early, as it should never change once the guest has started.

> +
> +	kaddr = pvtime_get_st(vcpu);
> +
> +	if (!kaddr)
> +		return -ENOTSUPP;

How can this happen?

> +
> +	kaddr->revision = 0;
> +	kaddr->attributes = 0;

Why does this need to be written each time we update the stolen time? I
have the feeling this would be better moved to the hypercall
initializing the data structure.

> +
> +	/* Let's do the local bookkeeping */
> +	steal = vcpu->arch.steal.steal;
> +	steal += current->sched_info.run_delay - vcpu->arch.steal.last_steal;
> +	vcpu->arch.steal.last_steal = current->sched_info.run_delay;
> +	vcpu->arch.steal.steal = steal;
> +
> +	/* Now write out the value to the shared page */
> +	WRITE_ONCE(kaddr->stolen_time, cpu_to_le64(steal));

Is there any requirement for this to be visible to another CPU than the
one this is being written from?

> +
> +	return 0;
> +}
> +
> +static int kvm_hypercall_stolen_time(struct kvm_vcpu *vcpu)
> +{
> +	u64 ret;
> +	int err;
> +
> +	/*
> +	 * Start counting stolen time from the time the guest requests
> +	 * the feature enabled.
> +	 */
> +	vcpu->arch.steal.steal = 0;
> +	vcpu->arch.steal.last_steal = current->sched_info.run_delay;
> +
> +	err = kvm_update_stolen_time(vcpu);
> +
> +	if (err)
> +		ret = SMCCC_RET_NOT_SUPPORTED;
> +	else
> +		ret = vcpu->kvm->arch.pvtime.st_base +
> +			(sizeof(struct pvclock_vcpu_stolen_time_info) *
> +			 kvm_vcpu_get_idx(vcpu));
> +
> +	smccc_set_retval(vcpu, ret, 0, 0, 0);
> +	return 1;
> +}
>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  {
>  	u32 func_id = smccc_get_function(vcpu);
> @@ -57,8 +121,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  	case ARM_SMCCC_HV_PV_FEATURES:
>  		feature = smccc_get_arg1(vcpu);
>  		switch (feature) {
> +		case ARM_SMCCC_HV_PV_FEATURES:
> +		case ARM_SMCCC_HV_PV_TIME_ST:
> +			val = SMCCC_RET_SUCCESS;
> +			break;
>  		}
>  		break;
> +	case ARM_SMCCC_HV_PV_TIME_ST:
> +		return kvm_hypercall_stolen_time(vcpu);
>  	default:
>  		return kvm_psci_call(vcpu);
>  	}


Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
  2019-08-02 14:50 ` [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space Steven Price
@ 2019-08-03 12:51   ` Marc Zyngier
  2019-08-03 17:34     ` Marc Zyngier
  2019-08-05 16:10     ` Steven Price
  0 siblings, 2 replies; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 12:51 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On Fri,  2 Aug 2019 15:50:14 +0100
Steven Price <steven.price@arm.com> wrote:

> Allow user space to inform the KVM host where in the physical memory
> map the paravirtualized time structures should be located.
> 
> A device is created which provides the base address of an array of
> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
> total number of VCPUs) bytes of memory available at this location.
> 
> The address is given in terms of the physical address visible to
> the guest and must be 64 byte aligned. The memory should be marked as
> reserved to the guest to stop it allocating it for other purposes.

Why? You seem to be allocating the memory from the kernel, so as far as
the guest is concerned, this isn't generally usable memory.

> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h  |   2 +
>  arch/arm64/include/uapi/asm/kvm.h |   6 +
>  arch/arm64/kvm/Makefile           |   1 +
>  include/uapi/linux/kvm.h          |   2 +
>  virt/kvm/arm/mmu.c                |  44 +++++++
>  virt/kvm/arm/pvtime.c             | 190 ++++++++++++++++++++++++++++++
>  6 files changed, 245 insertions(+)
>  create mode 100644 virt/kvm/arm/pvtime.c
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index befe37d4bc0e..88c8a4b2836f 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -157,6 +157,8 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm);
>  void kvm_free_stage2_pgd(struct kvm *kvm);
>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  			  phys_addr_t pa, unsigned long size, bool writable);
> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
> +			  phys_addr_t pa, unsigned long size, bool writable);
>  
>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>  
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 9a507716ae2f..95516a4198ea 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -367,6 +367,12 @@ struct kvm_vcpu_events {
>  #define KVM_PSCI_RET_INVAL		PSCI_RET_INVALID_PARAMS
>  #define KVM_PSCI_RET_DENIED		PSCI_RET_DENIED
>  
> +/* Device Control API: PV_TIME */
> +#define KVM_DEV_ARM_PV_TIME_PADDR	0
> +#define  KVM_DEV_ARM_PV_TIME_ST		0
> +#define KVM_DEV_ARM_PV_TIME_STATE_SIZE	1
> +#define KVM_DEV_ARM_PV_TIME_STATE	2
> +
>  #endif
>  
>  #endif /* __ARM_KVM_H__ */
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 73dce4d47d47..5ffbdc39e780 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,6 +14,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/e
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
> +kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/pvtime.o
>  
>  kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a7c19540ce21..04bffafa0708 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1222,6 +1222,8 @@ enum kvm_device_type {
>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
>  	KVM_DEV_TYPE_XIVE,
>  #define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
> +	KVM_DEV_TYPE_ARM_PV_TIME,
> +#define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 38b4c910b6c3..be28a4aee451 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1368,6 +1368,50 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  	return ret;
>  }
>  
> +/**
> + * kvm_phys_addr_memremap - map a memory range to guest IPA
> + *
> + * @kvm:	The KVM pointer
> + * @guest_ipa:	The IPA at which to insert the mapping
> + * @pa:		The physical address of the memory
> + * @size:	The size of the mapping
> + */
> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
> +			  phys_addr_t pa, unsigned long size, bool writable)
> +{
> +	phys_addr_t addr, end;
> +	int ret = 0;
> +	unsigned long pfn;
> +	struct kvm_mmu_memory_cache cache = { 0, };
> +
> +	end = (guest_ipa + size + PAGE_SIZE - 1) & PAGE_MASK;
> +	pfn = __phys_to_pfn(pa);
> +
> +	for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
> +		pte_t pte = pfn_pte(pfn, PAGE_S2);
> +
> +		if (writable)
> +			pte = kvm_s2pte_mkwrite(pte);
> +
> +		ret = mmu_topup_memory_cache(&cache,
> +					     kvm_mmu_cache_min_pages(kvm),
> +					     KVM_NR_MEM_OBJS);
> +		if (ret)
> +			goto out;
> +		spin_lock(&kvm->mmu_lock);
> +		ret = stage2_set_pte(kvm, &cache, addr, &pte, 0);
> +		spin_unlock(&kvm->mmu_lock);
> +		if (ret)
> +			goto out;
> +
> +		pfn++;
> +	}
> +
> +out:
> +	mmu_free_memory_cache(&cache);
> +	return ret;
> +}

This is an exact copy of kvm_phys_addr_ioremap(), with only the memory
attributes changing. Surely we can have a shared implementation that
takes the memory attribute as a parameter.

> +
>  static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
>  {
>  	kvm_pfn_t pfn = *pfnp;
> diff --git a/virt/kvm/arm/pvtime.c b/virt/kvm/arm/pvtime.c
> new file mode 100644
> index 000000000000..9051bc07eae1
> --- /dev/null
> +++ b/virt/kvm/arm/pvtime.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2019 Arm Ltd.
> +
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_mmu.h>
> +
> +/* We currently only support PV time on ARM64 */
> +#ifdef CONFIG_ARM64

And we're only compiling it on arm64, so why the #ifdef?

> +
> +#include <asm/pvclock-abi.h>
> +
> +static int max_stolen_size(void)
> +{
> +	int size = KVM_MAX_VCPUS * sizeof(struct pvclock_vcpu_stolen_time_info);

So we're always allocating enough memory for 512 CPUs? That's an
additional 32kB of contiguous memory...

> +
> +	return ALIGN(size, PAGE_SIZE);
> +}
> +
> +static int kvm_arm_pvtime_create(struct kvm_device *dev, u32 type)
> +{
> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
> +
> +	pvtime->st = alloc_pages_exact(max_stolen_size(),
> +				       GFP_KERNEL | __GFP_ZERO);
> +	if (!pvtime->st)
> +		return -ENOMEM;

Is there any chance we could use a vmalloc allocation instead? This
would lift the requirement on having physically contiguous memory.

> +
> +	return 0;
> +}
> +
> +static void kvm_arm_pvtime_destroy(struct kvm_device *dev)
> +{
> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
> +
> +	pvtime->st_base = GPA_INVALID;
> +	free_pages_exact(pvtime->st, max_stolen_size());
> +	kfree(dev);
> +}
> +
> +static int pvtime_map_pages(struct kvm *kvm, gpa_t guest_paddr,
> +			    void *kaddr, int size)
> +{
> +	return kvm_phys_addr_memremap(kvm, guest_paddr,
> +			virt_to_phys(kaddr),
> +			size, false);
> +}
> +
> +static int pvtime_save_state(struct kvm *kvm, u64 type, void __user *user)
> +{
> +	void *source;
> +	size_t size;
> +
> +	switch (type) {
> +	case KVM_DEV_ARM_PV_TIME_ST:
> +		source = kvm->arch.pvtime.st;
> +		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
> +			atomic_read(&kvm->online_vcpus);
> +		break;
> +	default:
> +		return -ENXIO;
> +	}
> +
> +	if (copy_to_user(user, source, size))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static int pvtime_restore_state(struct kvm *kvm, u64 type, void __user *user)
> +{
> +	void *dest;
> +	size_t size;
> +
> +	switch (type) {
> +	case KVM_DEV_ARM_PV_TIME_ST:
> +		dest = kvm->arch.pvtime.st;
> +		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
> +			atomic_read(&kvm->online_vcpus);
> +		break;
> +	default:
> +		return -ENXIO;
> +	}
> +
> +	if (copy_from_user(dest, user, size))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
> +				   struct kvm_device_attr *attr)
> +{
> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
> +	u64 __user *user = (u64 __user *)attr->addr;
> +	u64 paddr;
> +	int ret;
> +
> +	switch (attr->group) {
> +	case KVM_DEV_ARM_PV_TIME_PADDR:
> +		if (get_user(paddr, user))
> +			return -EFAULT;
> +		if (paddr & 63)
> +			return -EINVAL;

You should check whether the device fits into the IPA space for this
guest, and whether it overlaps with anything else.

> +		switch (attr->attr) {
> +		case KVM_DEV_ARM_PV_TIME_ST:
> +			if (pvtime->st_base != GPA_INVALID)
> +				return -EEXIST;
> +			ret = pvtime_map_pages(dev->kvm, paddr, pvtime->st,
> +					max_stolen_size());

Consider moving the size directly into pvtime_map_pages(), and dropping
the pvtime->st parameter. All you need is kvm and paddr.

> +			if (ret)
> +				return ret;
> +			pvtime->st_base = paddr;
> +			return 0;
> +		}
> +		break;
> +	case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
> +		return -EPERM;
> +	case KVM_DEV_ARM_PV_TIME_STATE:
> +		return pvtime_restore_state(dev->kvm, attr->attr, user);
> +	}
> +	return -ENXIO;
> +}
> +
> +static int kvm_arm_pvtime_get_attr(struct kvm_device *dev,
> +				   struct kvm_device_attr *attr)
> +{
> +	u64 __user *user = (u64 __user *)attr->addr;
> +	u32 size;
> +
> +	switch (attr->group) {
> +	case KVM_DEV_ARM_PV_TIME_PADDR:
> +		switch (attr->attr) {
> +		case KVM_DEV_ARM_PV_TIME_ST:
> +			if (put_user(dev->kvm->arch.pvtime.st_base, user))
> +				return -EFAULT;
> +			return 0;
> +		}
> +		break;
> +	case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
> +		switch (attr->attr) {
> +		case KVM_DEV_ARM_PV_TIME_ST:
> +			size = sizeof(struct pvclock_vcpu_stolen_time_info);
> +			size *= atomic_read(&dev->kvm->online_vcpus);
> +			break;
> +		default:
> +			return -ENXIO;
> +		}
> +		if (put_user(size, user))
> +			return -EFAULT;
> +		return 0;
> +	case KVM_DEV_ARM_PV_TIME_STATE:
> +		return pvtime_save_state(dev->kvm, attr->attr, user);
> +	}
> +	return -ENXIO;
> +}
> +
> +static int kvm_arm_pvtime_has_attr(struct kvm_device *dev,
> +				   struct kvm_device_attr *attr)
> +{
> +	switch (attr->group) {
> +	case KVM_DEV_ARM_PV_TIME_PADDR:
> +	case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
> +	case KVM_DEV_ARM_PV_TIME_STATE:
> +		switch (attr->attr) {
> +		case KVM_DEV_ARM_PV_TIME_ST:
> +			return 0;
> +		}
> +		break;
> +	}
> +	return -ENXIO;
> +}
> +
> +static const struct kvm_device_ops pvtime_ops = {
> +	"Arm PV time",
> +	.create = kvm_arm_pvtime_create,
> +	.destroy = kvm_arm_pvtime_destroy,
> +	.set_attr = kvm_arm_pvtime_set_attr,
> +	.get_attr = kvm_arm_pvtime_get_attr,
> +	.has_attr = kvm_arm_pvtime_has_attr
> +};
> +
> +static int __init kvm_pvtime_init(void)
> +{
> +	kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
> +
> +	return 0;
> +}
> +
> +late_initcall(kvm_pvtime_init);
> +
> +#endif

Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
  2019-08-03 12:51   ` Marc Zyngier
@ 2019-08-03 17:34     ` Marc Zyngier
  2019-08-07 13:39       ` Steven Price
  2019-08-05 16:10     ` Steven Price
  1 sibling, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 17:34 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, Catalin Marinas, linux-doc, Russell King, linux-kernel,
	linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm

On Sat, 3 Aug 2019 13:51:13 +0100
Marc Zyngier <maz@kernel.org> wrote:

[forgot that one]

> On Fri,  2 Aug 2019 15:50:14 +0100
> Steven Price <steven.price@arm.com> wrote:

[...]

> > +static int __init kvm_pvtime_init(void)
> > +{
> > +	kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
> > +
> > +	return 0;
> > +}
> > +
> > +late_initcall(kvm_pvtime_init);

Why is it an initcall? So far, the only initcall we've used is the one
that initializes KVM itself. Can't we just the device_ops just like we
do for the vgic?

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
  2019-08-02 14:50 ` [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure Steven Price
  2019-08-03 11:55   ` Marc Zyngier
@ 2019-08-03 17:58   ` Marc Zyngier
  2019-08-03 18:13     ` Marc Zyngier
  1 sibling, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 17:58 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On Fri,  2 Aug 2019 15:50:12 +0100
Steven Price <steven.price@arm.com> wrote:

> Implement the service call for configuring a shared structre between a
> VCPU and the hypervisor in which the hypervisor can write the time
> stolen from the VCPU's execution time by other tasks on the host.
> 
> The hypervisor allocates memory which is placed at an IPA chosen by user
> space. The hypervisor then uses WRITE_ONCE() to update the shared
> structre ensuring single copy atomicity of the 64-bit unsigned value
> that reports stolen time in nanoseconds.
> 
> Whenever stolen time is enabled by the guest, the stolen time counter is
> reset.
> 
> The stolen time itself is retrieved from the sched_info structure
> maintained by the Linux scheduler code. We enable SCHEDSTATS when
> selecting KVM Kconfig to ensure this value is meaningful.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/kvm_host.h | 13 +++++-
>  arch/arm64/kvm/Kconfig            |  1 +
>  include/kvm/arm_hypercalls.h      |  1 +
>  include/linux/kvm_types.h         |  2 +
>  virt/kvm/arm/arm.c                | 18 ++++++++
>  virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
>  6 files changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index f656169db8c3..78f270190d43 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -44,6 +44,7 @@
>  	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
>  #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
> +#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>  
>  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>  
> @@ -83,6 +84,11 @@ struct kvm_arch {
>  
>  	/* Mandated version of PSCI */
>  	u32 psci_version;
> +
> +	struct kvm_arch_pvtime {
> +		void *st;
> +		gpa_t st_base;
> +	} pvtime;
>  };
>  
>  #define KVM_NR_MEM_OBJS     40
> @@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
>  	/* True when deferrable sysregs are loaded on the physical CPU,
>  	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
>  	bool sysregs_loaded_on_cpu;
> -};
>  
> +	/* Guest PV state */
> +	struct {
> +		u64 steal;
> +		u64 last_steal;
> +	} steal;
> +};
>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
>  				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index a67121d419a2..d8b88e40d223 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -39,6 +39,7 @@ config KVM
>  	select IRQ_BYPASS_MANAGER
>  	select HAVE_KVM_IRQ_BYPASS
>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
> +	select SCHEDSTATS
>  	---help---
>  	  Support hosting virtualized guest machines.
>  	  We don't support KVM with 16K page tables yet, due to the multiple
> diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
> index 35a5abcc4ca3..9f0710ab4292 100644
> --- a/include/kvm/arm_hypercalls.h
> +++ b/include/kvm/arm_hypercalls.h
> @@ -7,6 +7,7 @@
>  #include <asm/kvm_emulate.h>
>  
>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
>  
>  static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
>  {
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index bde5374ae021..1c88e69db3d9 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
>  typedef u64            gpa_t;
>  typedef u64            gfn_t;
>  
> +#define GPA_INVALID	(~(gpa_t)0)
> +
>  typedef unsigned long  hva_t;
>  typedef u64            hpa_t;
>  typedef u64            hfn_t;
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index f645c0fbf7ec..ebd963d2580b 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -40,6 +40,10 @@
>  #include <asm/kvm_coproc.h>
>  #include <asm/sections.h>
>  
> +#include <kvm/arm_hypercalls.h>
> +#include <kvm/arm_pmu.h>
> +#include <kvm/arm_psci.h>
> +
>  #ifdef REQUIRES_VIRT
>  __asm__(".arch_extension	virt");
>  #endif
> @@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	kvm->arch.max_vcpus = vgic_present ?
>  				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>  
> +	kvm->arch.pvtime.st_base = GPA_INVALID;
>  	return ret;
>  out_free_stage2_pgd:
>  	kvm_free_stage2_pgd(kvm);
> @@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	kvm_vcpu_load_sysregs(vcpu);
>  	kvm_arch_vcpu_load_fp(vcpu);
>  	kvm_vcpu_pmu_restore_guest(vcpu);
> +	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
>  
>  	if (single_task_running())
>  		vcpu_clear_wfe_traps(vcpu);
> @@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>  	smp_rmb();
>  }
>  
> +static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
> +{
> +	int idx;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	kvm_update_stolen_time(vcpu);
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +}
> +
>  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
>  {
>  	return vcpu->arch.target >= 0;
> @@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>  		 * that a VCPU sees new virtual interrupts.
>  		 */
>  		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
> +			vcpu_req_record_steal(vcpu);

Something troubles me. Here, you've set the request on load. But you
can be preempted at any time (preemption gets disabled just after).

I have the feeling that should you get preempted right here, you'll
end-up having accumulated the wrong amount of steal time, as the
request put via load when you'll get scheduled back in will only get
processed after a full round of entry/exit/entry, which doesn't look
great.

Am I getting it right?

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] arm64: Stolen time support
  2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
                   ` (8 preceding siblings ...)
  2019-08-02 14:50 ` [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest Steven Price
@ 2019-08-03 18:05 ` Marc Zyngier
  2019-08-05 13:06   ` Steven Price
  9 siblings, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 18:05 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On Fri,  2 Aug 2019 15:50:08 +0100
Steven Price <steven.price@arm.com> wrote:

Hi Steven,

> This series add support for paravirtualized time for arm64 guests and
> KVM hosts following the specification in Arm's document DEN 0057A:
> 
> https://developer.arm.com/docs/den0057/a
> 
> It implements support for stolen time, allowing the guest to
> identify time when it is forcibly not executing.
> 
> It doesn't implement support for Live Physical Time (LPT) as there are
> some concerns about the overheads and approach in the above
> specification, and I expect an updated version of the specification to
> be released soon with just the stolen time parts.

Thanks for posting this.

My current concern with this series is around the fact that we allocate
memory from the kernel on behalf of the guest. It is the first example
of such thing in the ARM port, and I can't really say I'm fond of it.

x86 seems to get away with it by having the memory allocated from
userspace, why I tend to like more. Yes, put_user is more
expensive than a straight store, but this isn't done too often either.

What is the rational for your current approach?

Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
  2019-08-03 17:58   ` Marc Zyngier
@ 2019-08-03 18:13     ` Marc Zyngier
  2019-08-05 14:18       ` Steven Price
  0 siblings, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-03 18:13 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, Catalin Marinas, linux-doc, Russell King, linux-kernel,
	linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm

On Sat, 3 Aug 2019 18:58:17 +0100
Marc Zyngier <maz@kernel.org> wrote:

> On Fri,  2 Aug 2019 15:50:12 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
> > Implement the service call for configuring a shared structre between a
> > VCPU and the hypervisor in which the hypervisor can write the time
> > stolen from the VCPU's execution time by other tasks on the host.
> > 
> > The hypervisor allocates memory which is placed at an IPA chosen by user
> > space. The hypervisor then uses WRITE_ONCE() to update the shared
> > structre ensuring single copy atomicity of the 64-bit unsigned value
> > that reports stolen time in nanoseconds.
> > 
> > Whenever stolen time is enabled by the guest, the stolen time counter is
> > reset.
> > 
> > The stolen time itself is retrieved from the sched_info structure
> > maintained by the Linux scheduler code. We enable SCHEDSTATS when
> > selecting KVM Kconfig to ensure this value is meaningful.
> > 
> > Signed-off-by: Steven Price <steven.price@arm.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h | 13 +++++-
> >  arch/arm64/kvm/Kconfig            |  1 +
> >  include/kvm/arm_hypercalls.h      |  1 +
> >  include/linux/kvm_types.h         |  2 +
> >  virt/kvm/arm/arm.c                | 18 ++++++++
> >  virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
> >  6 files changed, 104 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index f656169db8c3..78f270190d43 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -44,6 +44,7 @@
> >  	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
> >  #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
> > +#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
> >  
> >  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
> >  
> > @@ -83,6 +84,11 @@ struct kvm_arch {
> >  
> >  	/* Mandated version of PSCI */
> >  	u32 psci_version;
> > +
> > +	struct kvm_arch_pvtime {
> > +		void *st;
> > +		gpa_t st_base;
> > +	} pvtime;
> >  };
> >  
> >  #define KVM_NR_MEM_OBJS     40
> > @@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
> >  	/* True when deferrable sysregs are loaded on the physical CPU,
> >  	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
> >  	bool sysregs_loaded_on_cpu;
> > -};
> >  
> > +	/* Guest PV state */
> > +	struct {
> > +		u64 steal;
> > +		u64 last_steal;
> > +	} steal;
> > +};
> >  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
> >  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
> >  				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
> > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> > index a67121d419a2..d8b88e40d223 100644
> > --- a/arch/arm64/kvm/Kconfig
> > +++ b/arch/arm64/kvm/Kconfig
> > @@ -39,6 +39,7 @@ config KVM
> >  	select IRQ_BYPASS_MANAGER
> >  	select HAVE_KVM_IRQ_BYPASS
> >  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
> > +	select SCHEDSTATS
> >  	---help---
> >  	  Support hosting virtualized guest machines.
> >  	  We don't support KVM with 16K page tables yet, due to the multiple
> > diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
> > index 35a5abcc4ca3..9f0710ab4292 100644
> > --- a/include/kvm/arm_hypercalls.h
> > +++ b/include/kvm/arm_hypercalls.h
> > @@ -7,6 +7,7 @@
> >  #include <asm/kvm_emulate.h>
> >  
> >  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
> > +int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
> >  
> >  static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
> >  {
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index bde5374ae021..1c88e69db3d9 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
> >  typedef u64            gpa_t;
> >  typedef u64            gfn_t;
> >  
> > +#define GPA_INVALID	(~(gpa_t)0)
> > +
> >  typedef unsigned long  hva_t;
> >  typedef u64            hpa_t;
> >  typedef u64            hfn_t;
> > diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> > index f645c0fbf7ec..ebd963d2580b 100644
> > --- a/virt/kvm/arm/arm.c
> > +++ b/virt/kvm/arm/arm.c
> > @@ -40,6 +40,10 @@
> >  #include <asm/kvm_coproc.h>
> >  #include <asm/sections.h>
> >  
> > +#include <kvm/arm_hypercalls.h>
> > +#include <kvm/arm_pmu.h>
> > +#include <kvm/arm_psci.h>
> > +
> >  #ifdef REQUIRES_VIRT
> >  __asm__(".arch_extension	virt");
> >  #endif
> > @@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> >  	kvm->arch.max_vcpus = vgic_present ?
> >  				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
> >  
> > +	kvm->arch.pvtime.st_base = GPA_INVALID;
> >  	return ret;
> >  out_free_stage2_pgd:
> >  	kvm_free_stage2_pgd(kvm);
> > @@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >  	kvm_vcpu_load_sysregs(vcpu);
> >  	kvm_arch_vcpu_load_fp(vcpu);
> >  	kvm_vcpu_pmu_restore_guest(vcpu);
> > +	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
> >  
> >  	if (single_task_running())
> >  		vcpu_clear_wfe_traps(vcpu);
> > @@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
> >  	smp_rmb();
> >  }
> >  
> > +static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
> > +{
> > +	int idx;
> > +
> > +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +	kvm_update_stolen_time(vcpu);
> > +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +}
> > +
> >  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
> >  {
> >  	return vcpu->arch.target >= 0;
> > @@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
> >  		 * that a VCPU sees new virtual interrupts.
> >  		 */
> >  		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
> > +
> > +		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
> > +			vcpu_req_record_steal(vcpu);  
> 
> Something troubles me. Here, you've set the request on load. But you
> can be preempted at any time (preemption gets disabled just after).
> 
> I have the feeling that should you get preempted right here, you'll
> end-up having accumulated the wrong amount of steal time, as the
> request put via load when you'll get scheduled back in will only get
> processed after a full round of entry/exit/entry, which doesn't look
> great.

Ah, no. We're saved by the check for pending requests right before we
jump in the guest, causing an early exit and the whole shebang to be
restarted.

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest
  2019-08-02 14:50 ` [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest Steven Price
@ 2019-08-04  9:53   ` Marc Zyngier
  2019-08-08 15:29     ` Steven Price
  2019-08-09 13:51   ` Zenghui Yu
  1 sibling, 1 reply; 43+ messages in thread
From: Marc Zyngier @ 2019-08-04  9:53 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On Fri,  2 Aug 2019 15:50:17 +0100
Steven Price <steven.price@arm.com> wrote:

> Enable paravirtualization features when running under a hypervisor
> supporting the PV_TIME_ST hypercall.
> 
> For each (v)CPU, we ask the hypervisor for the location of a shared
> page which the hypervisor will use to report stolen time to us. We set
> pv_time_ops to the stolen time function which simply reads the stolen
> value from the shared page for a VCPU. We guarantee single-copy
> atomicity using READ_ONCE which means we can also read the stolen
> time for another VCPU than the currently running one while it is
> potentially being updated by the hypervisor.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/kernel/Makefile |   1 +
>  arch/arm64/kernel/kvm.c    | 155 +++++++++++++++++++++++++++++++++++++

nit: Why not using paravirt.c, which clearly states what it does? The
alternative would be to name it kvm-pv.c.

>  include/linux/cpuhotplug.h |   1 +
>  3 files changed, 157 insertions(+)
>  create mode 100644 arch/arm64/kernel/kvm.c
> 
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index 478491f07b4f..eb36edf9b930 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -63,6 +63,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
>  obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
>  obj-$(CONFIG_ARM64_SSBD)		+= ssbd.o
>  obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
> +obj-$(CONFIG_PARAVIRT)			+= kvm.o
>  
>  obj-y					+= vdso/ probes/
>  obj-$(CONFIG_COMPAT_VDSO)		+= vdso32/
> diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
> new file mode 100644
> index 000000000000..245398c79dae
> --- /dev/null
> +++ b/arch/arm64/kernel/kvm.c
> @@ -0,0 +1,155 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2019 Arm Ltd.
> +
> +#define pr_fmt(fmt) "kvmarm-pv: " fmt
> +
> +#include <linux/arm-smccc.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/io.h>
> +#include <linux/printk.h>
> +#include <linux/psci.h>
> +#include <linux/reboot.h>
> +#include <linux/slab.h>
> +
> +#include <asm/paravirt.h>
> +#include <asm/pvclock-abi.h>
> +#include <asm/smp_plat.h>
> +
> +struct kvmarm_stolen_time_region {
> +	struct pvclock_vcpu_stolen_time_info *kaddr;
> +};
> +
> +static DEFINE_PER_CPU(struct kvmarm_stolen_time_region, stolen_time_region);
> +
> +static bool steal_acc = true;
> +static int __init parse_no_stealacc(char *arg)
> +{
> +	steal_acc = false;
> +	return 0;
> +}
> +early_param("no-steal-acc", parse_no_stealacc);
> +
> +/* return stolen time in ns by asking the hypervisor */
> +static u64 kvm_steal_clock(int cpu)
> +{
> +	struct kvmarm_stolen_time_region *reg;
> +
> +	reg = per_cpu_ptr(&stolen_time_region, cpu);
> +	if (!reg->kaddr) {
> +		pr_warn_once("stolen time enabled but not configured for cpu %d\n",
> +			     cpu);
> +		return 0;
> +	}
> +
> +	return le64_to_cpu(READ_ONCE(reg->kaddr->stolen_time));
> +}
> +
> +static int disable_stolen_time_current_cpu(void)
> +{
> +	struct kvmarm_stolen_time_region *reg;
> +
> +	reg = this_cpu_ptr(&stolen_time_region);
> +	if (!reg->kaddr)
> +		return 0;
> +
> +	memunmap(reg->kaddr);
> +	memset(reg, 0, sizeof(*reg));
> +
> +	return 0;
> +}
> +
> +static int stolen_time_dying_cpu(unsigned int cpu)
> +{
> +	return disable_stolen_time_current_cpu();
> +}
> +
> +static int init_stolen_time_cpu(unsigned int cpu)
> +{
> +	struct kvmarm_stolen_time_region *reg;
> +	struct arm_smccc_res res;
> +
> +	reg = this_cpu_ptr(&stolen_time_region);
> +
> +	if (reg->kaddr)
> +		return 0;

Can this actually happen? It'd take two CPU_UP calls from the HP
notifiers to get in that situation...

> +
> +	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, &res);
> +
> +	if ((long)res.a0 < 0)
> +		return -EINVAL;
> +
> +	reg->kaddr = memremap(res.a0,
> +			sizeof(struct pvclock_vcpu_stolen_time_info),
> +			MEMREMAP_WB);
> +
> +	if (reg->kaddr == NULL) {
> +		pr_warn("Failed to map stolen time data structure\n");
> +		return -EINVAL;

-ENOMEM is the expected return code.

> +	}
> +
> +	if (le32_to_cpu(reg->kaddr->revision) != 0 ||
> +			le32_to_cpu(reg->kaddr->attributes) != 0) {
> +		pr_warn("Unexpected revision or attributes in stolen time data\n");
> +		return -ENXIO;
> +	}
> +
> +	return 0;
> +}
> +
> +static int kvm_arm_init_stolen_time(void)
> +{
> +	int ret;
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ARM_KVMPV_STARTING,
> +				"hypervisor/kvmarm/pv:starting",
> +				init_stolen_time_cpu, stolen_time_dying_cpu);
> +	if (ret < 0)
> +		return ret;
> +	return 0;
> +}
> +
> +static bool has_kvm_steal_clock(void)
> +{
> +	struct arm_smccc_res res;
> +
> +	/* To detect the presence of PV time support we require SMCCC 1.1+ */
> +	if (psci_ops.smccc_version < SMCCC_VERSION_1_1)
> +		return false;
> +
> +	arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
> +			     ARM_SMCCC_HV_PV_FEATURES, &res);
> +
> +	if (res.a0 != SMCCC_RET_SUCCESS)
> +		return false;
> +
> +	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_FEATURES,
> +			     ARM_SMCCC_HV_PV_TIME_ST, &res);
> +
> +	if (res.a0 != SMCCC_RET_SUCCESS)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __init kvm_guest_init(void)
> +{
> +	int ret = 0;
> +
> +	if (!has_kvm_steal_clock())
> +		return 0;
> +
> +	ret = kvm_arm_init_stolen_time();
> +	if (ret)
> +		return ret;
> +
> +	pv_ops.time.steal_clock = kvm_steal_clock;
> +
> +	static_key_slow_inc(&paravirt_steal_enabled);
> +	if (steal_acc)
> +		static_key_slow_inc(&paravirt_steal_rq_enabled);
> +
> +	pr_info("using stolen time PV\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_guest_init);

Is there any reason why we wouldn't directly call into this rather than
using an initcall?

> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 068793a619ca..89d75edb5750 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -136,6 +136,7 @@ enum cpuhp_state {
>  	/* Must be the last timer callback */
>  	CPUHP_AP_DUMMY_TIMER_STARTING,
>  	CPUHP_AP_ARM_XEN_STARTING,
> +	CPUHP_AP_ARM_KVMPV_STARTING,
>  	CPUHP_AP_ARM_CORESIGHT_STARTING,
>  	CPUHP_AP_ARM64_ISNDEP_STARTING,
>  	CPUHP_AP_SMPCFD_DYING,


Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
  2019-08-02 14:50 ` [PATCH 1/9] KVM: arm64: Document PV-time interface Steven Price
  2019-08-03 11:13   ` Marc Zyngier
@ 2019-08-05  3:23   ` Zenghui Yu
  2019-08-05 13:06     ` Steven Price
  2019-08-05 16:40   ` Christophe de Dinechin
  2 siblings, 1 reply; 43+ messages in thread
From: Zenghui Yu @ 2019-08-05  3:23 UTC (permalink / raw)
  To: Steven Price
  Cc: linux-kernel, kvm, Catalin Marinas, linux-doc, Russell King,
	linux-arm-kernel, Marc Zyngier, Paolo Bonzini, Will Deacon,
	kvmarm

Hi Steven,

On 2019/8/2 22:50, Steven Price wrote:
> Introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
> 
> This only adds the details about "Stolen Time" as the details of "Live
> Physical Time" have not been fully agreed.
> 
> User space can specify a reserved area of memory for the guest and
> inform KVM to populate the memory with information on time that the host
> kernel has stolen from the guest.
> 
> A hypercall interface is provided for the guest to interrogate the
> hypervisor's support for this interface and the location of the shared
> memory structures.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>   Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>   1 file changed, 107 insertions(+)
>   create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
                                     ^^^^^^^
This directory has been renamed recently, see:

https://patchwork.ozlabs.org/patch/1136104/


Zenghui

> 
> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
> new file mode 100644
> index 000000000000..e6ae9799e1d5
> --- /dev/null
> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
> @@ -0,0 +1,107 @@
> +Paravirtualized time support for arm64
> +======================================
> +
> +Arm specification DEN0057/A defined a standard for paravirtualised time
> +support for Aarch64 guests:
> +
> +https://developer.arm.com/docs/den0057/a
> +
> +KVM/Arm64 implements the stolen time part of this specification by providing
> +some hypervisor service calls to support a paravirtualized guest obtaining a
> +view of the amount of time stolen from its execution.
> +
> +Two new SMCCC compatible hypercalls are defined:
> +
> +PV_FEATURES 0xC5000020
> +PV_TIME_ST  0xC5000022
> +
> +These are only available in the SMC64/HVC64 calling convention as
> +paravirtualized time is not available to 32 bit Arm guests.
> +
> +PV_FEATURES
> +    Function ID:  (uint32)  : 0xC5000020
> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
> +                              PV-time feature is supported by the hypervisor.
> +
> +PV_TIME_ST
> +    Function ID:  (uint32)  : 0xC5000022
> +    Return value: (int64)   : IPA of the stolen time data structure for this
> +                              (V)CPU. On failure:
> +                              NOT_SUPPORTED (-1)
> +
> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.
> +
> +The structure will be updated by the hypervisor periodically as time is stolen
> +from the VCPU. It will be present within a reserved region of the normal
> +memory given to the guest. The guest should not attempt to write into this
> +memory. There is a structure by VCPU of the guest.
> +
> +User space interface
> +====================
> +
> +User space can request that KVM provide the paravirtualized time interface to
> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
> +
> +    struct kvm_create_device pvtime_device = {
> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
> +            .attr = 0,
> +            .flags = 0,
> +    };
> +
> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
> +
> +The guest IPA of the structures must be given to KVM. This is the base address
> +of an array of stolen time structures (one for each VCPU). For example:
> +
> +    struct kvm_device_attr st_base = {
> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
> +            .addr = (u64)(unsigned long)&st_paddr
> +    };
> +
> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
> +
> +For migration (or save/restore) of a guest it is necessary to save the contents
> +of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
> +to be read/written.
> +
> +It is also necessary for the physical address to be set identically when
> +restoring.
> +
> +    void *save_state(int fd, u64 attr, u32 *size) {
> +        struct kvm_device_attr get_size = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE_SIZE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, get_size);
> +
> +        void *buffer = malloc(*size);
> +
> +        struct kvm_device_attr get_state = {
> +                .group = KVM_DEV_ARM_PV_TIME_STATE,
> +                .attr = attr,
> +                .addr = (u64)(unsigned long)size
> +        };
> +
> +        ioctl(fd, KVM_GET_DEVICE_ATTR, buffer);
> +    }
> +
> +    void *st_state = save_state(pvtime_fd, KVM_DEV_ARM_PV_TIME_ST, &st_size);
> +
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls
  2019-08-02 14:50 ` [PATCH 7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls Steven Price
@ 2019-08-05 10:03   ` Will Deacon
  0 siblings, 0 replies; 43+ messages in thread
From: Will Deacon @ 2019-08-05 10:03 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, James Morse, Julien Thierry, Suzuki K Pouloze, kvm,
	kvmarm, linux-arm-kernel, linux-doc, linux-kernel

On Fri, Aug 02, 2019 at 03:50:15PM +0100, Steven Price wrote:
> SMCCC 1.1 calls may use either HVC or SMC depending on the PSCI
> conduit. Rather than coding this in every call site provide a macro
> which uses the correct instruction. The macro also handles the case
> where no PSCI conduit is configured returning a not supported error
> in res, along with returning the conduit used for the call.
> 
> This allow us to remove some duplicated code and will be useful later
> when adding paravirtualized time hypervisor calls.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  include/linux/arm-smccc.h | 44 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)

Acked-by: Will Deacon <will@kernel.org>

Will

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] arm64: Stolen time support
  2019-08-03 18:05 ` [PATCH 0/9] arm64: Stolen time support Marc Zyngier
@ 2019-08-05 13:06   ` Steven Price
  2019-08-05 13:26     ` Marc Zyngier
  2019-08-14 13:02     ` Alexander Graf
  0 siblings, 2 replies; 43+ messages in thread
From: Steven Price @ 2019-08-05 13:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry

On 03/08/2019 19:05, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:08 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
> Hi Steven,
> 
>> This series add support for paravirtualized time for arm64 guests and
>> KVM hosts following the specification in Arm's document DEN 0057A:
>>
>> https://developer.arm.com/docs/den0057/a
>>
>> It implements support for stolen time, allowing the guest to
>> identify time when it is forcibly not executing.
>>
>> It doesn't implement support for Live Physical Time (LPT) as there are
>> some concerns about the overheads and approach in the above
>> specification, and I expect an updated version of the specification to
>> be released soon with just the stolen time parts.
> 
> Thanks for posting this.
> 
> My current concern with this series is around the fact that we allocate
> memory from the kernel on behalf of the guest. It is the first example
> of such thing in the ARM port, and I can't really say I'm fond of it.
> 
> x86 seems to get away with it by having the memory allocated from
> userspace, why I tend to like more. Yes, put_user is more
> expensive than a straight store, but this isn't done too often either.
> 
> What is the rational for your current approach?

As I see it there are 3 approaches that can be taken here:

1. Hypervisor allocates memory and adds it to the virtual machine. This
means that everything to do with the 'device' is encapsulated behind the
KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
stolen time structure to be fast it cannot be a trapping region and has
to be backed by real memory - in this case allocated by the host kernel.

2. Host user space allocates memory. Similar to above, but this time
user space needs to manage the memory region as well as the usual
KVM_CREATE_DEVICE dance. I've no objection to this, but it means
kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
to size the memory region).

3. Guest kernel "donates" the memory to the hypervisor for the
structure. As far as I'm aware this is what x86 does. The problems I see
this approach are:

 a) kexec becomes much more tricky - there needs to be a disabling
mechanism for the guest to stop the hypervisor scribbling on memory
before starting the new kernel.

 b) If there is more than one entity that is interested in the
information (e.g. firmware and kernel) then this requires some form of
arbitration in the guest because the hypervisor doesn't want to have to
track an arbitrary number of regions to update.

 c) Performance can suffer if the host kernel doesn't have a suitably
aligned/sized area to use. As you say - put_user() is more expensive.
The structure is updated on every return to the VM.


Of course x86 does prove the third approach can work, but I'm not sure
which is actually better. Avoid the kexec cancellation requirements was
the main driver of the current approach. Although many of the
conversations about this were also tied up with Live Physical Time which
adds its own complications.

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
  2019-08-05  3:23   ` Zenghui Yu
@ 2019-08-05 13:06     ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-05 13:06 UTC (permalink / raw)
  To: Zenghui Yu
  Cc: kvm, linux-doc, Catalin Marinas, Russell King, linux-kernel,
	Marc Zyngier, Paolo Bonzini, Will Deacon, kvmarm,
	linux-arm-kernel

On 05/08/2019 04:23, Zenghui Yu wrote:
> Hi Steven,
> 
> On 2019/8/2 22:50, Steven Price wrote:
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on time that the host
>> kernel has stolen from the guest.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>   Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>>   1 file changed, 107 insertions(+)
>>   create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>                                     ^^^^^^^
> This directory has been renamed recently, see:
> 
> https://patchwork.ozlabs.org/patch/1136104/

Thanks for pointing that out - I'll move it in the next version.

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
  2019-08-03 11:13   ` Marc Zyngier
@ 2019-08-05 13:06     ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-05 13:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: peter.maydell, kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry

On 03/08/2019 12:13, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:09 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
> [+Peter for the userspace aspect of things]
> 
> Hi Steve,
> 
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on time that the host
>> kernel has stolen from the guest.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>>  1 file changed, 107 insertions(+)
>>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>>
>> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
>> new file mode 100644
>> index 000000000000..e6ae9799e1d5
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
>> @@ -0,0 +1,107 @@
>> +Paravirtualized time support for arm64
>> +======================================
>> +
>> +Arm specification DEN0057/A defined a standard for paravirtualised time
>> +support for Aarch64 guests:
> 
> nit: AArch64
> 
>> +
>> +https://developer.arm.com/docs/den0057/a
> 
> Between this file and the above document, which one is authoritative?

The above document should be authoritative - although I'm still waiting
for the final version to be published. I'm not expecting any changes to
the stolen time part though.

>> +
>> +KVM/Arm64 implements the stolen time part of this specification by providing
> 
> nit: KVM/arm64
> 
>> +some hypervisor service calls to support a paravirtualized guest obtaining a
>> +view of the amount of time stolen from its execution.
>> +
>> +Two new SMCCC compatible hypercalls are defined:
>> +
>> +PV_FEATURES 0xC5000020
>> +PV_TIME_ST  0xC5000022
>> +
>> +These are only available in the SMC64/HVC64 calling convention as
>> +paravirtualized time is not available to 32 bit Arm guests.
>> +
>> +PV_FEATURES
>> +    Function ID:  (uint32)  : 0xC5000020
>> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
>> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
>> +                              PV-time feature is supported by the hypervisor.
> 
> How is PV_FEATURES discovered? Is the intention to make it a generic
> ARM-wide PV discovery mechanism, not specific to PV time?

SMCCC is mandated for PV time. So, assuming the hypervisor supports
SMCCC, the "NOT_SUPPORTED" return is mandated by SMCCC if PV time isn't
supported.

However, we do also use the SMCCC_ARCH_FEATURES mechanism to check the
existence of PV_FEATURES before use. I'll update the document to call
this out.

>> +
>> +PV_TIME_ST
>> +    Function ID:  (uint32)  : 0xC5000022
>> +    Return value: (int64)   : IPA of the stolen time data structure for this
>> +                              (V)CPU. On failure:
>> +                              NOT_SUPPORTED (-1)
>> +
> 
> Is the size implicit? What are the memory attributes? This either needs
> documenting here, or point to the right bit to the spec.

The size is implicit - it's a pointer to the below structure, so the
guest can only rely on the first 16 bytes being valid. The memory
attributes are described in the specification as:

"The calling guest can map the IPA into normal memory with inner and
outer write back caching attributes, in the inner sharable domain"

I'll put those details in this document for completeness.

>> +Stolen Time
>> +-----------
>> +
>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>> +
>> +  Field       | Byte Length | Byte Offset | Description
>> +  ----------- | ----------- | ----------- | --------------------------
>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>> +  Attributes  |      4      |      4      | Must be 0
>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>> +              |             |             | nanoseconds indicating how
>> +              |             |             | much time this VCPU thread
>> +              |             |             | was involuntarily not
>> +              |             |             | running on a physical CPU.
>> +
>> +The structure will be updated by the hypervisor periodically as time is stolen
> 
> Is it really periodic? If so, when is the update frequency?

Hmm, periodic might be the wrong term - there is no guaranteed frequency
of update. The spec says:

"The hypervisor must update this value prior to scheduling a virtual CPU"

I guess that's probably the best description.

>> +from the VCPU. It will be present within a reserved region of the normal
>> +memory given to the guest. The guest should not attempt to write into this
>> +memory. There is a structure by VCPU of the guest.
> 
> What if the vcpu writes to it? Does it get a fault?

From the perspective from the specification this is undefined. A fault
would therefore be acceptable but isn't generated in the implementation
defined here.

> If there is a
> structure per vcpu, what is the layout in memory? How does a vcpu find
> its own data structure? Is that the address returned by PV_TIME_ST?

A call to PV_TIME_ST returns the structure for the calling vCPU - I'll
make that explicit. The layout is therefore defined by the hypervisor
and cannot be relied on by the guest. As below this implementation uses
a simple array of structures.

>> +
>> +User space interface
>> +====================
>> +
>> +User space can request that KVM provide the paravirtualized time interface to
>> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
>> +
>> +    struct kvm_create_device pvtime_device = {
>> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
>> +            .attr = 0,
>> +            .flags = 0,
>> +    };
>> +
>> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
>> +
>> +The guest IPA of the structures must be given to KVM. This is the base address
> 
> nit: s/guest //
> 
>> +of an array of stolen time structures (one for each VCPU). For example:
>> +
>> +    struct kvm_device_attr st_base = {
>> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
>> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
>> +            .addr = (u64)(unsigned long)&st_paddr
>> +    };
>> +
>> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
> 
> So the allocation itself is performed by the kernel? What are the
> ordering requirements between creating vcpus and the device? What are
> the alignment requirements for the base address?

The base address should be page aligned - I'll spell that out.

There are currently no ordering requirements between creating vcpus and
the device. However...

>> +
>> +For migration (or save/restore) of a guest it is necessary to save the contents
>> +of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
>> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
>> +to be read/written.
> 
> Is the size variable depending on the number of vcpus?

...yes - so restoring the state after migration must be done after
creating the vcpus. I'll point out that the device should created after.

Thanks for the review,

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call
  2019-08-03 11:21   ` Marc Zyngier
@ 2019-08-05 13:14     ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-05 13:14 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry

On 03/08/2019 12:21, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:11 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> This provides a mechanism for querying which paravirtualized features
>> are available in this hypervisor.
>>
>> Also add the header file which defines the ABI for the paravirtualized
>> clock features we're about to add.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/pvclock-abi.h | 20 ++++++++++++++++++++
>>  include/linux/arm-smccc.h            | 14 ++++++++++++++
>>  virt/kvm/arm/hypercalls.c            |  9 +++++++++
>>  3 files changed, 43 insertions(+)
>>  create mode 100644 arch/arm64/include/asm/pvclock-abi.h
>>
>> diff --git a/arch/arm64/include/asm/pvclock-abi.h b/arch/arm64/include/asm/pvclock-abi.h
>> new file mode 100644
>> index 000000000000..1f7cdc102691
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/pvclock-abi.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (C) 2019 Arm Ltd. */
>> +
>> +#ifndef __ASM_PVCLOCK_ABI_H
>> +#define __ASM_PVCLOCK_ABI_H
>> +
>> +/* The below structures and constants are defined in ARM DEN0057A */
>> +
>> +struct pvclock_vcpu_stolen_time_info {
>> +	__le32 revision;
>> +	__le32 attributes;
>> +	__le64 stolen_time;
>> +	/* Structure must be 64 byte aligned, pad to that size */
>> +	u8 padding[48];
>> +} __packed;
>> +
>> +#define PV_VM_TIME_NOT_SUPPORTED	-1
> 
> Isn't the intent for this to be the same value as
> SMCCC_RET_NOT_SUPPORTED?

Yes it is, I guess there's not much point defining it again.

>> +#define PV_VM_TIME_INVALID_PARAMETERS	-2
> 
> It overlaps with SMCCC_RET_NOT_REQUIRED. Is that a problem? Should we
> consider a spec change for this?

Actually INVALID_PARAMETERS is only for Live Physical Time, since we're
not implementing it here, this can go as well.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] arm64: Stolen time support
  2019-08-05 13:06   ` Steven Price
@ 2019-08-05 13:26     ` Marc Zyngier
  2019-08-14 13:02     ` Alexander Graf
  1 sibling, 0 replies; 43+ messages in thread
From: Marc Zyngier @ 2019-08-05 13:26 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry

On 05/08/2019 14:06, Steven Price wrote:
> On 03/08/2019 19:05, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:08 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>> Hi Steven,
>>
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>>> specification, and I expect an updated version of the specification to
>>> be released soon with just the stolen time parts.
>>
>> Thanks for posting this.
>>
>> My current concern with this series is around the fact that we allocate
>> memory from the kernel on behalf of the guest. It is the first example
>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>
>> x86 seems to get away with it by having the memory allocated from
>> userspace, why I tend to like more. Yes, put_user is more
>> expensive than a straight store, but this isn't done too often either.
>>
>> What is the rational for your current approach?
> 
> As I see it there are 3 approaches that can be taken here:
> 
> 1. Hypervisor allocates memory and adds it to the virtual machine. This
> means that everything to do with the 'device' is encapsulated behind the
> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> stolen time structure to be fast it cannot be a trapping region and has
> to be backed by real memory - in this case allocated by the host kernel.
> 
> 2. Host user space allocates memory. Similar to above, but this time
> user space needs to manage the memory region as well as the usual
> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> to size the memory region).
> 
> 3. Guest kernel "donates" the memory to the hypervisor for the
> structure. As far as I'm aware this is what x86 does. The problems I see
> this approach are:
> 
>  a) kexec becomes much more tricky - there needs to be a disabling
> mechanism for the guest to stop the hypervisor scribbling on memory
> before starting the new kernel.
> 
>  b) If there is more than one entity that is interested in the
> information (e.g. firmware and kernel) then this requires some form of
> arbitration in the guest because the hypervisor doesn't want to have to
> track an arbitrary number of regions to update.
> 
>  c) Performance can suffer if the host kernel doesn't have a suitably
> aligned/sized area to use. As you say - put_user() is more expensive.
> The structure is updated on every return to the VM.
> 
> 
> Of course x86 does prove the third approach can work, but I'm not sure
> which is actually better. Avoid the kexec cancellation requirements was
> the main driver of the current approach. Although many of the
> conversations about this were also tied up with Live Physical Time which
> adds its own complications.

My current train of thoughts is around (2):

- We don't need a new mechanism to track pages or deal with overlapping
IPA ranges
- We can get rid of the save/restore interface

The drawback is that the amount of memory required per vcpu becomes ABI.
I don't think that's a huge deal, as the hypervisor has the same
contract with the guest.

We also take a small hit with put_user(), but this is only done as a
consequence of vcpu_load() (and not on every entry as you suggest
above). It'd be worth quantifying this overhead before making any
decision one way or another.

Thanks,

	M.
-- 
Jazz is not dead, it just smells funny...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
  2019-08-03 11:55   ` Marc Zyngier
@ 2019-08-05 14:09     ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-05 14:09 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry

On 03/08/2019 12:55, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:12 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> Implement the service call for configuring a shared structre between a
> 
> structure
> 
>> VCPU and the hypervisor in which the hypervisor can write the time
>> stolen from the VCPU's execution time by other tasks on the host.
>>
>> The hypervisor allocates memory which is placed at an IPA chosen by user
>> space. The hypervisor then uses WRITE_ONCE() to update the shared
>> structre ensuring single copy atomicity of the 64-bit unsigned value
> 
> structure

Twice in one commit message... thanks for spotting! :)

>> that reports stolen time in nanoseconds.
>>
>> Whenever stolen time is enabled by the guest, the stolen time counter is
>> reset.
>>
>> The stolen time itself is retrieved from the sched_info structure
>> maintained by the Linux scheduler code. We enable SCHEDSTATS when
>> selecting KVM Kconfig to ensure this value is meaningful.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_host.h | 13 +++++-
>>  arch/arm64/kvm/Kconfig            |  1 +
>>  include/kvm/arm_hypercalls.h      |  1 +
>>  include/linux/kvm_types.h         |  2 +
>>  virt/kvm/arm/arm.c                | 18 ++++++++
>>  virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
>>  6 files changed, 104 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index f656169db8c3..78f270190d43 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -44,6 +44,7 @@
>>  	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>  #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
>>  #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
>> +#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>>  
>>  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>>  
>> @@ -83,6 +84,11 @@ struct kvm_arch {
>>  
>>  	/* Mandated version of PSCI */
>>  	u32 psci_version;
>> +
>> +	struct kvm_arch_pvtime {
>> +		void *st;
> 
> Is it really a void *? I'm sure you can use a proper type here...

Indeed that sounds like a good idea!

>> +		gpa_t st_base;
>> +	} pvtime;
>>  };
>>  
>>  #define KVM_NR_MEM_OBJS     40
>> @@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
>>  	/* True when deferrable sysregs are loaded on the physical CPU,
>>  	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
>>  	bool sysregs_loaded_on_cpu;
>> -};
>>  
>> +	/* Guest PV state */
>> +	struct {
>> +		u64 steal;
>> +		u64 last_steal;
>> +	} steal;
>> +};
>>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>>  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
>>  				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index a67121d419a2..d8b88e40d223 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -39,6 +39,7 @@ config KVM
>>  	select IRQ_BYPASS_MANAGER
>>  	select HAVE_KVM_IRQ_BYPASS
>>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
>> +	select SCHEDSTATS
>>  	---help---
>>  	  Support hosting virtualized guest machines.
>>  	  We don't support KVM with 16K page tables yet, due to the multiple
>> diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
>> index 35a5abcc4ca3..9f0710ab4292 100644
>> --- a/include/kvm/arm_hypercalls.h
>> +++ b/include/kvm/arm_hypercalls.h
>> @@ -7,6 +7,7 @@
>>  #include <asm/kvm_emulate.h>
>>  
>>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
>> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
>>  
>>  static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
>>  {
>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>> index bde5374ae021..1c88e69db3d9 100644
>> --- a/include/linux/kvm_types.h
>> +++ b/include/linux/kvm_types.h
>> @@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
>>  typedef u64            gpa_t;
>>  typedef u64            gfn_t;
>>  
>> +#define GPA_INVALID	(~(gpa_t)0)
>> +
>>  typedef unsigned long  hva_t;
>>  typedef u64            hpa_t;
>>  typedef u64            hfn_t;
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index f645c0fbf7ec..ebd963d2580b 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -40,6 +40,10 @@
>>  #include <asm/kvm_coproc.h>
>>  #include <asm/sections.h>
>>  
>> +#include <kvm/arm_hypercalls.h>
>> +#include <kvm/arm_pmu.h>
>> +#include <kvm/arm_psci.h>
>> +
>>  #ifdef REQUIRES_VIRT
>>  __asm__(".arch_extension	virt");
>>  #endif
>> @@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>  	kvm->arch.max_vcpus = vgic_present ?
>>  				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>>  
>> +	kvm->arch.pvtime.st_base = GPA_INVALID;
>>  	return ret;
>>  out_free_stage2_pgd:
>>  	kvm_free_stage2_pgd(kvm);
>> @@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>  	kvm_vcpu_load_sysregs(vcpu);
>>  	kvm_arch_vcpu_load_fp(vcpu);
>>  	kvm_vcpu_pmu_restore_guest(vcpu);
>> +	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
>>  
>>  	if (single_task_running())
>>  		vcpu_clear_wfe_traps(vcpu);
>> @@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>>  	smp_rmb();
>>  }
>>  
>> +static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
>> +{
>> +	int idx;
>> +
>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +	kvm_update_stolen_time(vcpu);
>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +}
>> +
>>  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
>>  {
>>  	return vcpu->arch.target >= 0;
>> @@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>>  		 * that a VCPU sees new virtual interrupts.
>>  		 */
>>  		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
>> +
>> +		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
>> +			vcpu_req_record_steal(vcpu);
>>  	}
>>  }
>>  
>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
>> index 2906b2df99df..196c71c8dd87 100644
>> --- a/virt/kvm/arm/hypercalls.c
>> +++ b/virt/kvm/arm/hypercalls.c
>> @@ -10,6 +10,70 @@
>>  #include <kvm/arm_hypercalls.h>
>>  #include <kvm/arm_psci.h>
>>  
>> +
>> +static struct pvclock_vcpu_stolen_time_info *pvtime_get_st(
>> +		struct kvm_vcpu *vcpu)
> 
> nit: on a single line.
> 
>> +{
>> +	struct pvclock_vcpu_stolen_time_info *st = vcpu->kvm->arch.pvtime.st;
>> +
>> +	if (!st)
>> +		return NULL;
>> +
>> +	return &st[kvm_vcpu_get_idx(vcpu)];
>> +}
>> +
>> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 steal;
>> +	struct pvclock_vcpu_stolen_time_info *kaddr;
>> +
>> +	if (vcpu->kvm->arch.pvtime.st_base == GPA_INVALID)
>> +		return -ENOTSUPP;
> 
> So for a guest that doesn't have stolen time support (which is 100% of
> them for the foreseeable future), we still set a request, take the srcu
> lock and end-up here for nothing. I'd rather we test this st_base
> early, as it should never change once the guest has started.

Good point - I'll make the call to kvm_make_request() conditional on
st_base being set.

>> +
>> +	kaddr = pvtime_get_st(vcpu);
>> +
>> +	if (!kaddr)
>> +		return -ENOTSUPP;
> 
> How can this happen?

Good question, and it makes the pvtime_get_st() helper rather pointless.
Will remove.

>> +
>> +	kaddr->revision = 0;
>> +	kaddr->attributes = 0;
> 
> Why does this need to be written each time we update the stolen time? I
> have the feeling this would be better moved to the hypercall
> initializing the data structure.

Sure.

>> +
>> +	/* Let's do the local bookkeeping */
>> +	steal = vcpu->arch.steal.steal;
>> +	steal += current->sched_info.run_delay - vcpu->arch.steal.last_steal;
>> +	vcpu->arch.steal.last_steal = current->sched_info.run_delay;
>> +	vcpu->arch.steal.steal = steal;
>> +
>> +	/* Now write out the value to the shared page */
>> +	WRITE_ONCE(kaddr->stolen_time, cpu_to_le64(steal));
> 
> Is there any requirement for this to be visible to another CPU than the
> one this is being written from?

I don't believe there is a requirement for this to be visible
synchronously - there is an implicit race here if another vCPU is
accessing the structure (because the hypervisor can preempt and update
at any point), so we don't need to push this out with a barrier. However
we must use 64-bit single-copy atomicity writes to ensure that another
vCPU can't see a half-updated value.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
  2019-08-03 18:13     ` Marc Zyngier
@ 2019-08-05 14:18       ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-05 14:18 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, linux-doc, Catalin Marinas, linux-kernel, Russell King,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

On 03/08/2019 19:13, Marc Zyngier wrote:
> On Sat, 3 Aug 2019 18:58:17 +0100
> Marc Zyngier <maz@kernel.org> wrote:
> 
>> On Fri,  2 Aug 2019 15:50:12 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> Implement the service call for configuring a shared structre between a
>>> VCPU and the hypervisor in which the hypervisor can write the time
>>> stolen from the VCPU's execution time by other tasks on the host.
>>>
>>> The hypervisor allocates memory which is placed at an IPA chosen by user
>>> space. The hypervisor then uses WRITE_ONCE() to update the shared
>>> structre ensuring single copy atomicity of the 64-bit unsigned value
>>> that reports stolen time in nanoseconds.
>>>
>>> Whenever stolen time is enabled by the guest, the stolen time counter is
>>> reset.
>>>
>>> The stolen time itself is retrieved from the sched_info structure
>>> maintained by the Linux scheduler code. We enable SCHEDSTATS when
>>> selecting KVM Kconfig to ensure this value is meaningful.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_host.h | 13 +++++-
>>>  arch/arm64/kvm/Kconfig            |  1 +
>>>  include/kvm/arm_hypercalls.h      |  1 +
>>>  include/linux/kvm_types.h         |  2 +
>>>  virt/kvm/arm/arm.c                | 18 ++++++++
>>>  virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
>>>  6 files changed, 104 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index f656169db8c3..78f270190d43 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -44,6 +44,7 @@
>>>  	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>>  #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
>>>  #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
>>> +#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>>>  
>>>  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>>>  
>>> @@ -83,6 +84,11 @@ struct kvm_arch {
>>>  
>>>  	/* Mandated version of PSCI */
>>>  	u32 psci_version;
>>> +
>>> +	struct kvm_arch_pvtime {
>>> +		void *st;
>>> +		gpa_t st_base;
>>> +	} pvtime;
>>>  };
>>>  
>>>  #define KVM_NR_MEM_OBJS     40
>>> @@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
>>>  	/* True when deferrable sysregs are loaded on the physical CPU,
>>>  	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
>>>  	bool sysregs_loaded_on_cpu;
>>> -};
>>>  
>>> +	/* Guest PV state */
>>> +	struct {
>>> +		u64 steal;
>>> +		u64 last_steal;
>>> +	} steal;
>>> +};
>>>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>>>  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
>>>  				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
>>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>>> index a67121d419a2..d8b88e40d223 100644
>>> --- a/arch/arm64/kvm/Kconfig
>>> +++ b/arch/arm64/kvm/Kconfig
>>> @@ -39,6 +39,7 @@ config KVM
>>>  	select IRQ_BYPASS_MANAGER
>>>  	select HAVE_KVM_IRQ_BYPASS
>>>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
>>> +	select SCHEDSTATS
>>>  	---help---
>>>  	  Support hosting virtualized guest machines.
>>>  	  We don't support KVM with 16K page tables yet, due to the multiple
>>> diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
>>> index 35a5abcc4ca3..9f0710ab4292 100644
>>> --- a/include/kvm/arm_hypercalls.h
>>> +++ b/include/kvm/arm_hypercalls.h
>>> @@ -7,6 +7,7 @@
>>>  #include <asm/kvm_emulate.h>
>>>  
>>>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
>>> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
>>>  
>>>  static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
>>>  {
>>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>>> index bde5374ae021..1c88e69db3d9 100644
>>> --- a/include/linux/kvm_types.h
>>> +++ b/include/linux/kvm_types.h
>>> @@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
>>>  typedef u64            gpa_t;
>>>  typedef u64            gfn_t;
>>>  
>>> +#define GPA_INVALID	(~(gpa_t)0)
>>> +
>>>  typedef unsigned long  hva_t;
>>>  typedef u64            hpa_t;
>>>  typedef u64            hfn_t;
>>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>>> index f645c0fbf7ec..ebd963d2580b 100644
>>> --- a/virt/kvm/arm/arm.c
>>> +++ b/virt/kvm/arm/arm.c
>>> @@ -40,6 +40,10 @@
>>>  #include <asm/kvm_coproc.h>
>>>  #include <asm/sections.h>
>>>  
>>> +#include <kvm/arm_hypercalls.h>
>>> +#include <kvm/arm_pmu.h>
>>> +#include <kvm/arm_psci.h>
>>> +
>>>  #ifdef REQUIRES_VIRT
>>>  __asm__(".arch_extension	virt");
>>>  #endif
>>> @@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>>  	kvm->arch.max_vcpus = vgic_present ?
>>>  				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>>>  
>>> +	kvm->arch.pvtime.st_base = GPA_INVALID;
>>>  	return ret;
>>>  out_free_stage2_pgd:
>>>  	kvm_free_stage2_pgd(kvm);
>>> @@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>  	kvm_vcpu_load_sysregs(vcpu);
>>>  	kvm_arch_vcpu_load_fp(vcpu);
>>>  	kvm_vcpu_pmu_restore_guest(vcpu);
>>> +	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
>>>  
>>>  	if (single_task_running())
>>>  		vcpu_clear_wfe_traps(vcpu);
>>> @@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>>>  	smp_rmb();
>>>  }
>>>  
>>> +static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
>>> +{
>>> +	int idx;
>>> +
>>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>>> +	kvm_update_stolen_time(vcpu);
>>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>> +}
>>> +
>>>  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
>>>  {
>>>  	return vcpu->arch.target >= 0;
>>> @@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>  		 * that a VCPU sees new virtual interrupts.
>>>  		 */
>>>  		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
>>> +
>>> +		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
>>> +			vcpu_req_record_steal(vcpu);  
>>
>> Something troubles me. Here, you've set the request on load. But you
>> can be preempted at any time (preemption gets disabled just after).
>>
>> I have the feeling that should you get preempted right here, you'll
>> end-up having accumulated the wrong amount of steal time, as the
>> request put via load when you'll get scheduled back in will only get
>> processed after a full round of entry/exit/entry, which doesn't look
>> great.
> 
> Ah, no. We're saved by the check for pending requests right before we
> jump in the guest, causing an early exit and the whole shebang to be
> restarted.

Yes, that's my understanding. Obviously not ideal if it happens in that
small window, but everything is redone to get the right values in the end.

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
  2019-08-03 12:51   ` Marc Zyngier
  2019-08-03 17:34     ` Marc Zyngier
@ 2019-08-05 16:10     ` Steven Price
  2019-08-05 16:28       ` Marc Zyngier
  1 sibling, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-05 16:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry

On 03/08/2019 13:51, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:14 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> Allow user space to inform the KVM host where in the physical memory
>> map the paravirtualized time structures should be located.
>>
>> A device is created which provides the base address of an array of
>> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
>> total number of VCPUs) bytes of memory available at this location.
>>
>> The address is given in terms of the physical address visible to
>> the guest and must be 64 byte aligned. The memory should be marked as
>> reserved to the guest to stop it allocating it for other purposes.
> 
> Why? You seem to be allocating the memory from the kernel, so as far as
> the guest is concerned, this isn't generally usable memory.

I obviously didn't word it very well - that's what I meant. The "memory"
that represents the stolen time structure shouldn't be shown to the
guest as normal memory, but "reserved" for the purpose of stolen time.

To be honest it looks like I forgot to rewrite this commit message -
which 64 byte alignment is all that the guest can rely on (because each
vCPU has it's own structure), the actual array of structures needs to be
page aligned to ensure we can safely map it into the guest.

>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_mmu.h  |   2 +
>>  arch/arm64/include/uapi/asm/kvm.h |   6 +
>>  arch/arm64/kvm/Makefile           |   1 +
>>  include/uapi/linux/kvm.h          |   2 +
>>  virt/kvm/arm/mmu.c                |  44 +++++++
>>  virt/kvm/arm/pvtime.c             | 190 ++++++++++++++++++++++++++++++
>>  6 files changed, 245 insertions(+)
>>  create mode 100644 virt/kvm/arm/pvtime.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index befe37d4bc0e..88c8a4b2836f 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -157,6 +157,8 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm);
>>  void kvm_free_stage2_pgd(struct kvm *kvm);
>>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>  			  phys_addr_t pa, unsigned long size, bool writable);
>> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> +			  phys_addr_t pa, unsigned long size, bool writable);
>>  
>>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>>  
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 9a507716ae2f..95516a4198ea 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -367,6 +367,12 @@ struct kvm_vcpu_events {
>>  #define KVM_PSCI_RET_INVAL		PSCI_RET_INVALID_PARAMS
>>  #define KVM_PSCI_RET_DENIED		PSCI_RET_DENIED
>>  
>> +/* Device Control API: PV_TIME */
>> +#define KVM_DEV_ARM_PV_TIME_PADDR	0
>> +#define  KVM_DEV_ARM_PV_TIME_ST		0
>> +#define KVM_DEV_ARM_PV_TIME_STATE_SIZE	1
>> +#define KVM_DEV_ARM_PV_TIME_STATE	2
>> +
>>  #endif
>>  
>>  #endif /* __ARM_KVM_H__ */
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 73dce4d47d47..5ffbdc39e780 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -14,6 +14,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/e
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
>> +kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/pvtime.o
>>  
>>  kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index a7c19540ce21..04bffafa0708 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1222,6 +1222,8 @@ enum kvm_device_type {
>>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
>>  	KVM_DEV_TYPE_XIVE,
>>  #define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>> +	KVM_DEV_TYPE_ARM_PV_TIME,
>> +#define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 38b4c910b6c3..be28a4aee451 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1368,6 +1368,50 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>  	return ret;
>>  }
>>  
>> +/**
>> + * kvm_phys_addr_memremap - map a memory range to guest IPA
>> + *
>> + * @kvm:	The KVM pointer
>> + * @guest_ipa:	The IPA at which to insert the mapping
>> + * @pa:		The physical address of the memory
>> + * @size:	The size of the mapping
>> + */
>> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> +			  phys_addr_t pa, unsigned long size, bool writable)
>> +{
>> +	phys_addr_t addr, end;
>> +	int ret = 0;
>> +	unsigned long pfn;
>> +	struct kvm_mmu_memory_cache cache = { 0, };
>> +
>> +	end = (guest_ipa + size + PAGE_SIZE - 1) & PAGE_MASK;
>> +	pfn = __phys_to_pfn(pa);
>> +
>> +	for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
>> +		pte_t pte = pfn_pte(pfn, PAGE_S2);
>> +
>> +		if (writable)
>> +			pte = kvm_s2pte_mkwrite(pte);
>> +
>> +		ret = mmu_topup_memory_cache(&cache,
>> +					     kvm_mmu_cache_min_pages(kvm),
>> +					     KVM_NR_MEM_OBJS);
>> +		if (ret)
>> +			goto out;
>> +		spin_lock(&kvm->mmu_lock);
>> +		ret = stage2_set_pte(kvm, &cache, addr, &pte, 0);
>> +		spin_unlock(&kvm->mmu_lock);
>> +		if (ret)
>> +			goto out;
>> +
>> +		pfn++;
>> +	}
>> +
>> +out:
>> +	mmu_free_memory_cache(&cache);
>> +	return ret;
>> +}
> 
> This is an exact copy of kvm_phys_addr_ioremap(), with only the memory
> attributes changing. Surely we can have a shared implementation that
> takes the memory attribute as a parameter.

Good point - although due to below I'm going to need something which can
deal with non-contiguous memory...

>> +
>>  static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
>>  {
>>  	kvm_pfn_t pfn = *pfnp;
>> diff --git a/virt/kvm/arm/pvtime.c b/virt/kvm/arm/pvtime.c
>> new file mode 100644
>> index 000000000000..9051bc07eae1
>> --- /dev/null
>> +++ b/virt/kvm/arm/pvtime.c
>> @@ -0,0 +1,190 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (C) 2019 Arm Ltd.
>> +
>> +#include <linux/kvm_host.h>
>> +#include <asm/kvm_mmu.h>
>> +
>> +/* We currently only support PV time on ARM64 */
>> +#ifdef CONFIG_ARM64
> 
> And we're only compiling it on arm64, so why the #ifdef?

Another good point - will remove.

>> +
>> +#include <asm/pvclock-abi.h>
>> +
>> +static int max_stolen_size(void)
>> +{
>> +	int size = KVM_MAX_VCPUS * sizeof(struct pvclock_vcpu_stolen_time_info);
> 
> So we're always allocating enough memory for 512 CPUs? That's an
> additional 32kB of contiguous memory...
> 
>> +
>> +	return ALIGN(size, PAGE_SIZE);
>> +}
>> +
>> +static int kvm_arm_pvtime_create(struct kvm_device *dev, u32 type)
>> +{
>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +
>> +	pvtime->st = alloc_pages_exact(max_stolen_size(),
>> +				       GFP_KERNEL | __GFP_ZERO);
>> +	if (!pvtime->st)
>> +		return -ENOMEM;
> 
> Is there any chance we could use a vmalloc allocation instead? This
> would lift the requirement on having physically contiguous memory.

Yes I think it should be possible to use non-contiguous memory and vmalloc.

>> +
>> +	return 0;
>> +}
>> +
>> +static void kvm_arm_pvtime_destroy(struct kvm_device *dev)
>> +{
>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +
>> +	pvtime->st_base = GPA_INVALID;
>> +	free_pages_exact(pvtime->st, max_stolen_size());
>> +	kfree(dev);
>> +}
>> +
>> +static int pvtime_map_pages(struct kvm *kvm, gpa_t guest_paddr,
>> +			    void *kaddr, int size)
>> +{
>> +	return kvm_phys_addr_memremap(kvm, guest_paddr,
>> +			virt_to_phys(kaddr),
>> +			size, false);
>> +}
>> +
>> +static int pvtime_save_state(struct kvm *kvm, u64 type, void __user *user)
>> +{
>> +	void *source;
>> +	size_t size;
>> +
>> +	switch (type) {
>> +	case KVM_DEV_ARM_PV_TIME_ST:
>> +		source = kvm->arch.pvtime.st;
>> +		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
>> +			atomic_read(&kvm->online_vcpus);
>> +		break;
>> +	default:
>> +		return -ENXIO;
>> +	}
>> +
>> +	if (copy_to_user(user, source, size))
>> +		return -EFAULT;
>> +	return 0;
>> +}
>> +
>> +static int pvtime_restore_state(struct kvm *kvm, u64 type, void __user *user)
>> +{
>> +	void *dest;
>> +	size_t size;
>> +
>> +	switch (type) {
>> +	case KVM_DEV_ARM_PV_TIME_ST:
>> +		dest = kvm->arch.pvtime.st;
>> +		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
>> +			atomic_read(&kvm->online_vcpus);
>> +		break;
>> +	default:
>> +		return -ENXIO;
>> +	}
>> +
>> +	if (copy_from_user(dest, user, size))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
>> +
>> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
>> +				   struct kvm_device_attr *attr)
>> +{
>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +	u64 __user *user = (u64 __user *)attr->addr;
>> +	u64 paddr;
>> +	int ret;
>> +
>> +	switch (attr->group) {
>> +	case KVM_DEV_ARM_PV_TIME_PADDR:
>> +		if (get_user(paddr, user))
>> +			return -EFAULT;
>> +		if (paddr & 63)
>> +			return -EINVAL;
> 
> You should check whether the device fits into the IPA space for this
> guest, and whether it overlaps with anything else.

pvtime_map_pages() should fail in the case of overlap. That seems
sufficient to me - do you think we need something stronger?

>> +		switch (attr->attr) {
>> +		case KVM_DEV_ARM_PV_TIME_ST:
>> +			if (pvtime->st_base != GPA_INVALID)
>> +				return -EEXIST;
>> +			ret = pvtime_map_pages(dev->kvm, paddr, pvtime->st,
>> +					max_stolen_size());
> 
> Consider moving the size directly into pvtime_map_pages(), and dropping
> the pvtime->st parameter. All you need is kvm and paddr.

Will do.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
  2019-08-05 16:10     ` Steven Price
@ 2019-08-05 16:28       ` Marc Zyngier
  0 siblings, 0 replies; 43+ messages in thread
From: Marc Zyngier @ 2019-08-05 16:28 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry

On 05/08/2019 17:10, Steven Price wrote:
> On 03/08/2019 13:51, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:14 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> Allow user space to inform the KVM host where in the physical memory
>>> map the paravirtualized time structures should be located.
>>>
>>> A device is created which provides the base address of an array of
>>> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
>>> total number of VCPUs) bytes of memory available at this location.
>>>
>>> The address is given in terms of the physical address visible to
>>> the guest and must be 64 byte aligned. The memory should be marked as
>>> reserved to the guest to stop it allocating it for other purposes.
>>
>> Why? You seem to be allocating the memory from the kernel, so as far as
>> the guest is concerned, this isn't generally usable memory.
> 
> I obviously didn't word it very well - that's what I meant. The "memory"
> that represents the stolen time structure shouldn't be shown to the
> guest as normal memory, but "reserved" for the purpose of stolen time.
> 
> To be honest it looks like I forgot to rewrite this commit message -
> which 64 byte alignment is all that the guest can rely on (because each
> vCPU has it's own structure), the actual array of structures needs to be
> page aligned to ensure we can safely map it into the guest.
> 
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_mmu.h  |   2 +
>>>  arch/arm64/include/uapi/asm/kvm.h |   6 +
>>>  arch/arm64/kvm/Makefile           |   1 +
>>>  include/uapi/linux/kvm.h          |   2 +
>>>  virt/kvm/arm/mmu.c                |  44 +++++++
>>>  virt/kvm/arm/pvtime.c             | 190 ++++++++++++++++++++++++++++++
>>>  6 files changed, 245 insertions(+)
>>>  create mode 100644 virt/kvm/arm/pvtime.c

[...]

>>> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
>>> +				   struct kvm_device_attr *attr)
>>> +{
>>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>>> +	u64 __user *user = (u64 __user *)attr->addr;
>>> +	u64 paddr;
>>> +	int ret;
>>> +
>>> +	switch (attr->group) {
>>> +	case KVM_DEV_ARM_PV_TIME_PADDR:
>>> +		if (get_user(paddr, user))
>>> +			return -EFAULT;
>>> +		if (paddr & 63)
>>> +			return -EINVAL;
>>
>> You should check whether the device fits into the IPA space for this
>> guest, and whether it overlaps with anything else.
> 
> pvtime_map_pages() should fail in the case of overlap. That seems
> sufficient to me - do you think we need something stronger?

Definitely. stage2_set_pte() won't fail for a non-IO overlapping
mapping, and will just treat it as guest memory. If this overlaps with a
memslot, we'll never be able to fault that page in, ending up with
interesting memory corruption... :-/

That's one of the reasons why I think option (2) in your earlier email
is an interesting one, as it sidesteps a whole lot of ugly and hard to
test corner cases.

Thanks,

	M.
-- 
Jazz is not dead, it just smells funny...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
  2019-08-02 14:50 ` [PATCH 1/9] KVM: arm64: Document PV-time interface Steven Price
  2019-08-03 11:13   ` Marc Zyngier
  2019-08-05  3:23   ` Zenghui Yu
@ 2019-08-05 16:40   ` Christophe de Dinechin
  2019-08-07 13:21     ` Steven Price
  2 siblings, 1 reply; 43+ messages in thread
From: Christophe de Dinechin @ 2019-08-05 16:40 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel


Steven Price writes:

> Introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>
> This only adds the details about "Stolen Time" as the details of "Live
> Physical Time" have not been fully agreed.
>
[...]

> +
> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.

I know very little about the topic, but I don't understand how the spec
as proposed allows an accurate reading of the relation between physical
time and stolen time simultaneously. In other words, could you draw
Figure 1 of the spec from within the guest? Or is it a non-objective?

For example, if you read the stolen time before you read CNTVCT_EL0,
isn't it possible for a lengthy event like a migration to occur between
the two reads, causing the stolen time to be obsolete and off by seconds?

--
Cheers,
Christophe de Dinechin (IRC c3d)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
  2019-08-05 16:40   ` Christophe de Dinechin
@ 2019-08-07 13:21     ` Steven Price
       [not found]       ` <9F77FA64-C71B-4025-A58D-3AC07E6688DE@dinechin.org>
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-07 13:21 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: kvm, Radim Krčmář,
	Catalin Marinas, Suzuki K Pouloze, linux-doc, Russell King,
	linux-kernel, James Morse, linux-arm-kernel, Marc Zyngier,
	Paolo Bonzini, Will Deacon, kvmarm, Julien Thierry

On 05/08/2019 17:40, Christophe de Dinechin wrote:
> 
> Steven Price writes:
> 
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
> [...]
> 
>> +
>> +Stolen Time
>> +-----------
>> +
>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>> +
>> +  Field       | Byte Length | Byte Offset | Description
>> +  ----------- | ----------- | ----------- | --------------------------
>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>> +  Attributes  |      4      |      4      | Must be 0
>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>> +              |             |             | nanoseconds indicating how
>> +              |             |             | much time this VCPU thread
>> +              |             |             | was involuntarily not
>> +              |             |             | running on a physical CPU.
> 
> I know very little about the topic, but I don't understand how the spec
> as proposed allows an accurate reading of the relation between physical
> time and stolen time simultaneously. In other words, could you draw
> Figure 1 of the spec from within the guest? Or is it a non-objective?

Figure 1 is mostly attempting to explain Live Physical Time (LPT), which
is not part of this patch series. But it does touch on stolen time by
the difference between "live physical time" and "virtual time".

I'm not sure what you mean by "from within the guest". From the
perspective of the guest the parts of the diagram where the guest isn't
running don't exist (therefore there are discontinuities in the
"physical time" and "live physical time" lines).

This patch series doesn't attempt to provide the guest with a view of
"physical time" (or LPT) - but it might be able to observe that by
consulting something external (e.g. an NTP server, or an emulated RTC
which reports wall-clock time).

What it does provide is a mechanism for obtaining the difference (as
reported by the host) between "live physical time" and "virtual time" -
this is reported in nanoseconds in the above structure.

> For example, if you read the stolen time before you read CNTVCT_EL0,
> isn't it possible for a lengthy event like a migration to occur between
> the two reads, causing the stolen time to be obsolete and off by seconds?

"Lengthy events" like migration are represented by the "paused" state in
the diagram - i.e. it's the difference between "physical time" and "live
physical time". So stolen time doesn't attempt to represent that.

And yes, there is a race between reading CNTVCT_EL0 and reading stolen
time - but in practice this doesn't really matter. The usual pseudo-code
way of using stolen time is:

  * scheduler captures stolen time from structure and CNTVCT_EL0:
      before_timer = CNTVCT_EL0
      before_stolen = stolen
  * schedule in process
  * process is pre-empted (or blocked in some way)
  * scheduler captures stolen time from structure and CNTVCT_EL0:
      after_timer = CNTVCT_EL0
      after_stolen = stolen
      time = to_nsecs(after_timer - before_timer) -
             (after_stolen - before_stolen)

The scheduler can then charge the process for "time" nanoseconds of
time. This ensures that a process isn't unfairly penalised if the host
doesn't schedule the VCPU while it is supposed to be running.

The race is very small in comparison to the time the process is running,
and in the worst case just means the process is charged slightly more
(or less) than it should be.

I guess if you're really worried about it, you could do a dance like:

	do {
		before = stolen
		timer = CNTVCT_EL0
		after = stolen
	} while (before != after);

But I don't see the need to have such an accurate view of elapsed time
that the VCPU was scheduled. And of course at the moment (without this
series) the guest has no idea about time stolen by the host.

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
  2019-08-03 17:34     ` Marc Zyngier
@ 2019-08-07 13:39       ` Steven Price
  2019-08-07 13:51         ` Marc Zyngier
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-07 13:39 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, linux-doc, Catalin Marinas, linux-kernel, Russell King,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

On 03/08/2019 18:34, Marc Zyngier wrote:
> On Sat, 3 Aug 2019 13:51:13 +0100
> Marc Zyngier <maz@kernel.org> wrote:
> 
> [forgot that one]
> 
>> On Fri,  2 Aug 2019 15:50:14 +0100
>> Steven Price <steven.price@arm.com> wrote:
> 
> [...]
> 
>>> +static int __init kvm_pvtime_init(void)
>>> +{
>>> +	kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +late_initcall(kvm_pvtime_init);
> 
> Why is it an initcall? So far, the only initcall we've used is the one
> that initializes KVM itself. Can't we just the device_ops just like we
> do for the vgic?

So would you prefer a direct call from init_subsystems() in
virt/kvm/arm/arm.c?

The benefit of initcall is just that it keeps the code self-contained.
In init_subsystems() I'd either need a #ifdef CONFIG_ARM64 or a dummy
function for arm.

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
  2019-08-07 13:39       ` Steven Price
@ 2019-08-07 13:51         ` Marc Zyngier
  0 siblings, 0 replies; 43+ messages in thread
From: Marc Zyngier @ 2019-08-07 13:51 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, linux-doc, Catalin Marinas, linux-kernel, Russell King,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

On 07/08/2019 14:39, Steven Price wrote:
> On 03/08/2019 18:34, Marc Zyngier wrote:
>> On Sat, 3 Aug 2019 13:51:13 +0100
>> Marc Zyngier <maz@kernel.org> wrote:
>>
>> [forgot that one]
>>
>>> On Fri,  2 Aug 2019 15:50:14 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>
>> [...]
>>
>>>> +static int __init kvm_pvtime_init(void)
>>>> +{
>>>> +	kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +late_initcall(kvm_pvtime_init);
>>
>> Why is it an initcall? So far, the only initcall we've used is the one
>> that initializes KVM itself. Can't we just the device_ops just like we
>> do for the vgic?
> 
> So would you prefer a direct call from init_subsystems() in
> virt/kvm/arm/arm.c?

Yes. Consistency is important.

> The benefit of initcall is just that it keeps the code self-contained.
> In init_subsystems() I'd either need a #ifdef CONFIG_ARM64 or a dummy
> function for arm.

Having a dummy function for 32bit ARM is fine. Most of the code we add
to the 32bit port is made of empty stubs anyway.

Thanks,

	M.
-- 
Jazz is not dead, it just smells funny...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
       [not found]       ` <9F77FA64-C71B-4025-A58D-3AC07E6688DE@dinechin.org>
@ 2019-08-07 15:26         ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-07 15:26 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: KVM list, Catalin Marinas, linux-doc, Russell King, open list,
	Marc Zyngier, Paolo Bonzini, Will Deacon, kvmarm,
	linux-arm-kernel

On 07/08/2019 15:28, Christophe de Dinechin wrote:
> 
> 
>> On 7 Aug 2019, at 15:21, Steven Price <steven.price@arm.com
>> <mailto:steven.price@arm.com>> wrote:
>>
>> On 05/08/2019 17:40, Christophe de Dinechin wrote:
>>>
>>> Steven Price writes:
>>>
>>>> Introduce a paravirtualization interface for KVM/arm64 based on the
>>>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>>>
>>>> This only adds the details about "Stolen Time" as the details of "Live
>>>> Physical Time" have not been fully agreed.
>>>>
>>> [...]
>>>
>>>> +
>>>> +Stolen Time
>>>> +-----------
>>>> +
>>>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>>>> +
>>>> +  Field       | Byte Length | Byte Offset | Description
>>>> +  ----------- | ----------- | ----------- | --------------------------
>>>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>>>> +  Attributes  |      4      |      4      | Must be 0
>>>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>>>> +              |             |             | nanoseconds indicating how
>>>> +              |             |             | much time this VCPU thread
>>>> +              |             |             | was involuntarily not
>>>> +              |             |             | running on a physical CPU.
>>>
>>> I know very little about the topic, but I don't understand how the spec
>>> as proposed allows an accurate reading of the relation between physical
>>> time and stolen time simultaneously. In other words, could you draw
>>> Figure 1 of the spec from within the guest? Or is it a non-objective?
>>
>> Figure 1 is mostly attempting to explain Live Physical Time (LPT), which
>> is not part of this patch series. But it does touch on stolen time by
>> the difference between "live physical time" and "virtual time".
>>
>> I'm not sure what you mean by "from within the guest". From the
>> perspective of the guest the parts of the diagram where the guest isn't
>> running don't exist (therefore there are discontinuities in the
>> "physical time" and "live physical time" lines).
> 
> I meant: If I run code within the guest that attempts to draw Figure 1,
> race conditions may cause the diagram actually drawn by your guest
> program to look completely wrong on occasions.
> 
>> This patch series doesn't attempt to provide the guest with a view of
>> "physical time" (or LPT) - but it might be able to observe that by
>> consulting something external (e.g. an NTP server, or an emulated RTC
>> which reports wall-clock time).
> 
> … with what appear to be like a built-in race condition, as you correctly
> identified. I was wondering if the built-in race condition was deliberate
> and/or necessary, or if it was irrelevant for the planned uses of the value.
> 
>> What it does provide is a mechanism for obtaining the difference (as
>> reported by the host) between "live physical time" and "virtual time" -
>> this is reported in nanoseconds in the above structure.
>>
>>> For example, if you read the stolen time before you read CNTVCT_EL0,
>>> isn't it possible for a lengthy event like a migration to occur between
>>> the two reads, causing the stolen time to be obsolete and off by seconds?
>>
>> "Lengthy events" like migration are represented by the "paused" state in
>> the diagram - i.e. it's the difference between "physical time" and "live
>> physical time". So stolen time doesn't attempt to represent that.
>>
>> And yes, there is a race between reading CNTVCT_EL0 and reading stolen
>> time - but in practice this doesn't really matter. The usual pseudo-code
>> way of using stolen time is:
> 
> I’m assuming this is the guest scheduler you are talking about,

yes

> and I’m assuming virtualization can preempt that code anywhere.
> Maybe that’s where I’m wrong?

You are correct, the guest can be preempted at any point.

> 
> For the sake of the argument, assume there is a 1s pause.
> Not completely unreasonable in a migration scenario.

As I mentioned before, events like migration are not represented by
stolen time. They would be represented by CNTVCT_EL0 appearing to pause
during the migration (so showing a difference between "physical time"
and "live physical time"). The stolen time value would not be incremented.

>>  * scheduler captures stolen time from structure and CNTVCT_EL0:
>>      before_timer = CNTVCT_EL0
> 
> [insert optional 1s pause here, case A]
> 
>>      before_stolen = stolen
>>  * schedule in process
>>  * process is pre-empted (or blocked in some way)
>>  * scheduler captures stolen time from structure and CNTVCT_EL0:
>>      after_timer = CNTVCT_EL0
> 
> [insert optional 1s pause here, case B]
> 
>>      after_stolen = stolen
>>      time = to_nsecs(after_timer - before_timer) -
>>             (after_stolen - before_stolen)
> 
> In case A, time is too big by one second. In case B, it is too small,
> to the point where your code might need to be ready for
> “time” unexpectedly showing up as negative.

So a 1 second pause is unlikely for stolen time - this means that the
VCPU was ready to run, but the host didn't run it for some reason. But
in theory you are correct this could happen. The core code deals with it
like this (update_rq_clock_task):
> 	if (static_key_false((&paravirt_steal_rq_enabled))) {
> 		steal = paravirt_steal_clock(cpu_of(rq));
> 		steal -= rq->prev_steal_time_rq;
> 
> 		if (unlikely(steal > delta))
> 			steal = delta;
> 
> 		rq->prev_steal_time_rq += steal;
> 		delta -= steal;
> 	}

So if (steal > delta) then steal is capped to delta, preventing the
final delta from going negative.

>>
>> The scheduler can then charge the process for "time" nanoseconds of
>> time. This ensures that a process isn't unfairly penalised if the host
>> doesn't schedule the VCPU while it is supposed to be running.
>>
>> The race is very small in comparison to the time the process is running,
>> and in the worst case just means the process is charged slightly more
>> (or less) than it should be.
> 
> At this point, what I don’t understand is why the race would be
> “very small” or why you would only be charged “slightly” more or less?

The window between measuring the time using CNTVCT_EL0 and getting the
stolen time from the hypervisor is pretty short. The amount of time that
is (normally) stolen in one go is also small. So the race is unlikely
and the error when it occurs is (usually) small.

Long events (such as migration or pausing the guest) are not considered
"stolen time" and should be reflected to the guest in other ways.

>> I guess if you're really worried about it, you could do a dance like:
>>
>> do {
>> before = stolen
>> timer = CNTVCT_EL0
>> after = stolen
>> } while (before != after);
> 
> That will work as long as nothing in that loop requires something
> that would cause `stolen` to jump. If there is such a guarantee,
> then that’s even efficient, because in most cases the loop
> would only run once, at the cost of one extra read and one test.

Note that other architectures don't have such loops, so arm64 is just
following the lead of existing architecture.

>> But I don't see the need to have such an accurate view of elapsed time
>> that the VCPU was scheduled. And of course at the moment (without this
>> series) the guest has no idea about time stolen by the host.
> 
> I’m certainly not arguing that exposing stolen time is a bad idea,
> I’m only wondering if the proposed solution is racy, and if so, if
> it is intentional.
> 
> If it’s indeed racy, the problem could be mitigated in a number of
> ways
> 
> a) document your loop or something similar as being the recommended
> way to avoid the race, and then ensure that the loop actually
> will always work as intended. The upside is that it’s just a change in
> some comments or documentation.
> 
> b) having a single interface that exposes multiple times. For example,
> you could have a copy of CNTVCT_EL0 written alongside stolen time,
> and then the scheduler could use that copy for its decision.

That would still be racy - the structure can be updated at any time (as
the host could interrupt the VCPU at any time), so you would still be
left with the problem of reading both atomically - which would mean
going back to the loop. This is the approach that LPT takes and is
documented in the spec.

Also I can't see why you would want to know the CNTVCT_EL0 value at the
point the stolen time was updated, it's much more useful to know the
current CNTVCT_EL0 value.

Ultimately reading the stolen time is always going to be slightly racy
because you are including some of the scheduler's time in the
calculation of how much time the process was running for. The pauses you
describe above are instances where time has been stolen from the
scheduler, but that time is being accounted for/against a user space
process. While the algorithm could be changed so that it's always a
positive for the user space process I'm not sure that's a benefit (it's
probably better that statistically it can go either way).

Steve


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest
  2019-08-04  9:53   ` Marc Zyngier
@ 2019-08-08 15:29     ` Steven Price
  2019-08-08 15:49       ` Marc Zyngier
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-08 15:29 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On 04/08/2019 10:53, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:17 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> Enable paravirtualization features when running under a hypervisor
>> supporting the PV_TIME_ST hypercall.
>>
>> For each (v)CPU, we ask the hypervisor for the location of a shared
>> page which the hypervisor will use to report stolen time to us. We set
>> pv_time_ops to the stolen time function which simply reads the stolen
>> value from the shared page for a VCPU. We guarantee single-copy
>> atomicity using READ_ONCE which means we can also read the stolen
>> time for another VCPU than the currently running one while it is
>> potentially being updated by the hypervisor.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/kernel/Makefile |   1 +
>>  arch/arm64/kernel/kvm.c    | 155 +++++++++++++++++++++++++++++++++++++
> 
> nit: Why not using paravirt.c, which clearly states what it does? The
> alternative would be to name it kvm-pv.c.

I can move it to paravirt.c - seems reasonable.

>>  include/linux/cpuhotplug.h |   1 +
>>  3 files changed, 157 insertions(+)
>>  create mode 100644 arch/arm64/kernel/kvm.c
>>
>> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
>> index 478491f07b4f..eb36edf9b930 100644
>> --- a/arch/arm64/kernel/Makefile
>> +++ b/arch/arm64/kernel/Makefile
>> @@ -63,6 +63,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
>>  obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
>>  obj-$(CONFIG_ARM64_SSBD)		+= ssbd.o
>>  obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
>> +obj-$(CONFIG_PARAVIRT)			+= kvm.o
>>  
>>  obj-y					+= vdso/ probes/
>>  obj-$(CONFIG_COMPAT_VDSO)		+= vdso32/
>> diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
>> new file mode 100644
>> index 000000000000..245398c79dae
>> --- /dev/null
>> +++ b/arch/arm64/kernel/kvm.c
>> @@ -0,0 +1,155 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (C) 2019 Arm Ltd.
>> +
>> +#define pr_fmt(fmt) "kvmarm-pv: " fmt
>> +
>> +#include <linux/arm-smccc.h>
>> +#include <linux/cpuhotplug.h>
>> +#include <linux/io.h>
>> +#include <linux/printk.h>
>> +#include <linux/psci.h>
>> +#include <linux/reboot.h>
>> +#include <linux/slab.h>
>> +
>> +#include <asm/paravirt.h>
>> +#include <asm/pvclock-abi.h>
>> +#include <asm/smp_plat.h>
>> +
>> +struct kvmarm_stolen_time_region {
>> +	struct pvclock_vcpu_stolen_time_info *kaddr;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct kvmarm_stolen_time_region, stolen_time_region);
>> +
>> +static bool steal_acc = true;
>> +static int __init parse_no_stealacc(char *arg)
>> +{
>> +	steal_acc = false;
>> +	return 0;
>> +}
>> +early_param("no-steal-acc", parse_no_stealacc);
>> +
>> +/* return stolen time in ns by asking the hypervisor */
>> +static u64 kvm_steal_clock(int cpu)
>> +{
>> +	struct kvmarm_stolen_time_region *reg;
>> +
>> +	reg = per_cpu_ptr(&stolen_time_region, cpu);
>> +	if (!reg->kaddr) {
>> +		pr_warn_once("stolen time enabled but not configured for cpu %d\n",
>> +			     cpu);
>> +		return 0;
>> +	}
>> +
>> +	return le64_to_cpu(READ_ONCE(reg->kaddr->stolen_time));
>> +}
>> +
>> +static int disable_stolen_time_current_cpu(void)
>> +{
>> +	struct kvmarm_stolen_time_region *reg;
>> +
>> +	reg = this_cpu_ptr(&stolen_time_region);
>> +	if (!reg->kaddr)
>> +		return 0;
>> +
>> +	memunmap(reg->kaddr);
>> +	memset(reg, 0, sizeof(*reg));
>> +
>> +	return 0;
>> +}
>> +
>> +static int stolen_time_dying_cpu(unsigned int cpu)
>> +{
>> +	return disable_stolen_time_current_cpu();
>> +}
>> +
>> +static int init_stolen_time_cpu(unsigned int cpu)
>> +{
>> +	struct kvmarm_stolen_time_region *reg;
>> +	struct arm_smccc_res res;
>> +
>> +	reg = this_cpu_ptr(&stolen_time_region);
>> +
>> +	if (reg->kaddr)
>> +		return 0;
> 
> Can this actually happen? It'd take two CPU_UP calls from the HP
> notifiers to get in that situation...

Yes, something would have to be very broken for that to happen - I'll
remove this check.

>> +
>> +	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, &res);
>> +
>> +	if ((long)res.a0 < 0)
>> +		return -EINVAL;
>> +
>> +	reg->kaddr = memremap(res.a0,
>> +			sizeof(struct pvclock_vcpu_stolen_time_info),
>> +			MEMREMAP_WB);
>> +
>> +	if (reg->kaddr == NULL) {
>> +		pr_warn("Failed to map stolen time data structure\n");
>> +		return -EINVAL;
> 
> -ENOMEM is the expected return code.

Ok

>> +	}
>> +
>> +	if (le32_to_cpu(reg->kaddr->revision) != 0 ||
>> +			le32_to_cpu(reg->kaddr->attributes) != 0) {
>> +		pr_warn("Unexpected revision or attributes in stolen time data\n");
>> +		return -ENXIO;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int kvm_arm_init_stolen_time(void)
>> +{
>> +	int ret;
>> +
>> +	ret = cpuhp_setup_state(CPUHP_AP_ARM_KVMPV_STARTING,
>> +				"hypervisor/kvmarm/pv:starting",
>> +				init_stolen_time_cpu, stolen_time_dying_cpu);
>> +	if (ret < 0)
>> +		return ret;
>> +	return 0;
>> +}
>> +
>> +static bool has_kvm_steal_clock(void)
>> +{
>> +	struct arm_smccc_res res;
>> +
>> +	/* To detect the presence of PV time support we require SMCCC 1.1+ */
>> +	if (psci_ops.smccc_version < SMCCC_VERSION_1_1)
>> +		return false;
>> +
>> +	arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
>> +			     ARM_SMCCC_HV_PV_FEATURES, &res);
>> +
>> +	if (res.a0 != SMCCC_RET_SUCCESS)
>> +		return false;
>> +
>> +	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_FEATURES,
>> +			     ARM_SMCCC_HV_PV_TIME_ST, &res);
>> +
>> +	if (res.a0 != SMCCC_RET_SUCCESS)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int __init kvm_guest_init(void)
>> +{
>> +	int ret = 0;
>> +
>> +	if (!has_kvm_steal_clock())
>> +		return 0;
>> +
>> +	ret = kvm_arm_init_stolen_time();
>> +	if (ret)
>> +		return ret;
>> +
>> +	pv_ops.time.steal_clock = kvm_steal_clock;
>> +
>> +	static_key_slow_inc(&paravirt_steal_enabled);
>> +	if (steal_acc)
>> +		static_key_slow_inc(&paravirt_steal_rq_enabled);
>> +
>> +	pr_info("using stolen time PV\n");
>> +
>> +	return 0;
>> +}
>> +early_initcall(kvm_guest_init);
> 
> Is there any reason why we wouldn't directly call into this rather than
> using an initcall?

I'm not sure where the direct call would go - any pointers?

Thanks,

Steve

>> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
>> index 068793a619ca..89d75edb5750 100644
>> --- a/include/linux/cpuhotplug.h
>> +++ b/include/linux/cpuhotplug.h
>> @@ -136,6 +136,7 @@ enum cpuhp_state {
>>  	/* Must be the last timer callback */
>>  	CPUHP_AP_DUMMY_TIMER_STARTING,
>>  	CPUHP_AP_ARM_XEN_STARTING,
>> +	CPUHP_AP_ARM_KVMPV_STARTING,
>>  	CPUHP_AP_ARM_CORESIGHT_STARTING,
>>  	CPUHP_AP_ARM64_ISNDEP_STARTING,
>>  	CPUHP_AP_SMPCFD_DYING,
> 
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest
  2019-08-08 15:29     ` Steven Price
@ 2019-08-08 15:49       ` Marc Zyngier
  0 siblings, 0 replies; 43+ messages in thread
From: Marc Zyngier @ 2019-08-08 15:49 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Paolo Bonzini, Radim Krčmář,
	Russell King, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel, linux-doc,
	linux-kernel

On 08/08/2019 16:29, Steven Price wrote:
> On 04/08/2019 10:53, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:17 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> Enable paravirtualization features when running under a hypervisor
>>> supporting the PV_TIME_ST hypercall.
>>>
>>> For each (v)CPU, we ask the hypervisor for the location of a shared
>>> page which the hypervisor will use to report stolen time to us. We set
>>> pv_time_ops to the stolen time function which simply reads the stolen
>>> value from the shared page for a VCPU. We guarantee single-copy
>>> atomicity using READ_ONCE which means we can also read the stolen
>>> time for another VCPU than the currently running one while it is
>>> potentially being updated by the hypervisor.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>>  arch/arm64/kernel/Makefile |   1 +
>>>  arch/arm64/kernel/kvm.c    | 155 +++++++++++++++++++++++++++++++++++++

[...]

>>> +static int __init kvm_guest_init(void)
>>> +{
>>> +	int ret = 0;
>>> +
>>> +	if (!has_kvm_steal_clock())
>>> +		return 0;
>>> +
>>> +	ret = kvm_arm_init_stolen_time();
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	pv_ops.time.steal_clock = kvm_steal_clock;
>>> +
>>> +	static_key_slow_inc(&paravirt_steal_enabled);
>>> +	if (steal_acc)
>>> +		static_key_slow_inc(&paravirt_steal_rq_enabled);
>>> +
>>> +	pr_info("using stolen time PV\n");
>>> +
>>> +	return 0;
>>> +}
>>> +early_initcall(kvm_guest_init);
>>
>> Is there any reason why we wouldn't directly call into this rather than
>> using an initcall?
> 
> I'm not sure where the direct call would go - any pointers?

I'd be temped to say arch/arm64/kernel/time.c:time_init(), provided that
there is no issue with the CPU hotplug lock (I remember hitting that a
while ago).

	M.
-- 
Jazz is not dead, it just smells funny...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest
  2019-08-02 14:50 ` [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest Steven Price
  2019-08-04  9:53   ` Marc Zyngier
@ 2019-08-09 13:51   ` Zenghui Yu
  2019-08-12 10:39     ` Steven Price
  1 sibling, 1 reply; 43+ messages in thread
From: Zenghui Yu @ 2019-08-09 13:51 UTC (permalink / raw)
  To: Steven Price
  Cc: linux-kernel, kvm, Catalin Marinas, linux-doc, Russell King,
	linux-arm-kernel, Marc Zyngier, Paolo Bonzini, Will Deacon,
	kvmarm

On 2019/8/2 22:50, Steven Price wrote:
> Enable paravirtualization features when running under a hypervisor
> supporting the PV_TIME_ST hypercall.
> 
> For each (v)CPU, we ask the hypervisor for the location of a shared
> page which the hypervisor will use to report stolen time to us. We set
> pv_time_ops to the stolen time function which simply reads the stolen
> value from the shared page for a VCPU. We guarantee single-copy
> atomicity using READ_ONCE which means we can also read the stolen
> time for another VCPU than the currently running one while it is
> potentially being updated by the hypervisor.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>   arch/arm64/kernel/Makefile |   1 +
>   arch/arm64/kernel/kvm.c    | 155 +++++++++++++++++++++++++++++++++++++
>   include/linux/cpuhotplug.h |   1 +
>   3 files changed, 157 insertions(+)
>   create mode 100644 arch/arm64/kernel/kvm.c
> 
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index 478491f07b4f..eb36edf9b930 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -63,6 +63,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
>   obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
>   obj-$(CONFIG_ARM64_SSBD)		+= ssbd.o
>   obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
> +obj-$(CONFIG_PARAVIRT)			+= kvm.o
>   
>   obj-y					+= vdso/ probes/
>   obj-$(CONFIG_COMPAT_VDSO)		+= vdso32/
> diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
> new file mode 100644
> index 000000000000..245398c79dae
> --- /dev/null
> +++ b/arch/arm64/kernel/kvm.c
> @@ -0,0 +1,155 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2019 Arm Ltd.
> +
> +#define pr_fmt(fmt) "kvmarm-pv: " fmt
> +
> +#include <linux/arm-smccc.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/io.h>
> +#include <linux/printk.h>
> +#include <linux/psci.h>
> +#include <linux/reboot.h>
> +#include <linux/slab.h>
> +
> +#include <asm/paravirt.h>
> +#include <asm/pvclock-abi.h>
> +#include <asm/smp_plat.h>
> +
> +struct kvmarm_stolen_time_region {
> +	struct pvclock_vcpu_stolen_time_info *kaddr;
> +};
> +
> +static DEFINE_PER_CPU(struct kvmarm_stolen_time_region, stolen_time_region);
> +
> +static bool steal_acc = true;
> +static int __init parse_no_stealacc(char *arg)
> +{
> +	steal_acc = false;
> +	return 0;
> +}
> +early_param("no-steal-acc", parse_no_stealacc);
> +
> +/* return stolen time in ns by asking the hypervisor */
> +static u64 kvm_steal_clock(int cpu)
> +{
> +	struct kvmarm_stolen_time_region *reg;
> +
> +	reg = per_cpu_ptr(&stolen_time_region, cpu);
> +	if (!reg->kaddr) {
> +		pr_warn_once("stolen time enabled but not configured for cpu %d\n",
> +			     cpu);
> +		return 0;
> +	}
> +
> +	return le64_to_cpu(READ_ONCE(reg->kaddr->stolen_time));
> +}
> +
> +static int disable_stolen_time_current_cpu(void)
> +{
> +	struct kvmarm_stolen_time_region *reg;
> +
> +	reg = this_cpu_ptr(&stolen_time_region);
> +	if (!reg->kaddr)
> +		return 0;
> +
> +	memunmap(reg->kaddr);
> +	memset(reg, 0, sizeof(*reg));
> +
> +	return 0;
> +}
> +
> +static int stolen_time_dying_cpu(unsigned int cpu)
> +{
> +	return disable_stolen_time_current_cpu();
> +}
> +
> +static int init_stolen_time_cpu(unsigned int cpu)
> +{
> +	struct kvmarm_stolen_time_region *reg;
> +	struct arm_smccc_res res;
> +
> +	reg = this_cpu_ptr(&stolen_time_region);
> +
> +	if (reg->kaddr)
> +		return 0;
> +
> +	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, &res);
> +
> +	if ((long)res.a0 < 0)
> +		return -EINVAL;

Hi Steven,

Since userspace is not involved yet (right?), no one will create the
PV_TIME device for guest (and no one will specify the IPA of the shared
stolen time region), and I guess we will get a "not supported" error
here.

So what should we do if we want to test this series now?  Any userspace
tools?  If no, do you have any plans for userspace developing? ;-)


Thanks,
zenghui

> +
> +	reg->kaddr = memremap(res.a0,
> +			sizeof(struct pvclock_vcpu_stolen_time_info),
> +			MEMREMAP_WB);
> +
> +	if (reg->kaddr == NULL) {
> +		pr_warn("Failed to map stolen time data structure\n");
> +		return -EINVAL;
> +	}
> +
> +	if (le32_to_cpu(reg->kaddr->revision) != 0 ||
> +			le32_to_cpu(reg->kaddr->attributes) != 0) {
> +		pr_warn("Unexpected revision or attributes in stolen time data\n");
> +		return -ENXIO;
> +	}
> +
> +	return 0;
> +}
> +
> +static int kvm_arm_init_stolen_time(void)
> +{
> +	int ret;
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ARM_KVMPV_STARTING,
> +				"hypervisor/kvmarm/pv:starting",
> +				init_stolen_time_cpu, stolen_time_dying_cpu);
> +	if (ret < 0)
> +		return ret;
> +	return 0;
> +}
> +
> +static bool has_kvm_steal_clock(void)
> +{
> +	struct arm_smccc_res res;
> +
> +	/* To detect the presence of PV time support we require SMCCC 1.1+ */
> +	if (psci_ops.smccc_version < SMCCC_VERSION_1_1)
> +		return false;
> +
> +	arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
> +			     ARM_SMCCC_HV_PV_FEATURES, &res);
> +
> +	if (res.a0 != SMCCC_RET_SUCCESS)
> +		return false;
> +
> +	arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_FEATURES,
> +			     ARM_SMCCC_HV_PV_TIME_ST, &res);
> +
> +	if (res.a0 != SMCCC_RET_SUCCESS)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __init kvm_guest_init(void)
> +{
> +	int ret = 0;
> +
> +	if (!has_kvm_steal_clock())
> +		return 0;
> +
> +	ret = kvm_arm_init_stolen_time();
> +	if (ret)
> +		return ret;
> +
> +	pv_ops.time.steal_clock = kvm_steal_clock;
> +
> +	static_key_slow_inc(&paravirt_steal_enabled);
> +	if (steal_acc)
> +		static_key_slow_inc(&paravirt_steal_rq_enabled);
> +
> +	pr_info("using stolen time PV\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_guest_init);
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 068793a619ca..89d75edb5750 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -136,6 +136,7 @@ enum cpuhp_state {
>   	/* Must be the last timer callback */
>   	CPUHP_AP_DUMMY_TIMER_STARTING,
>   	CPUHP_AP_ARM_XEN_STARTING,
> +	CPUHP_AP_ARM_KVMPV_STARTING,
>   	CPUHP_AP_ARM_CORESIGHT_STARTING,
>   	CPUHP_AP_ARM64_ISNDEP_STARTING,
>   	CPUHP_AP_SMPCFD_DYING,
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest
  2019-08-09 13:51   ` Zenghui Yu
@ 2019-08-12 10:39     ` Steven Price
  2019-08-13  6:06       ` Zenghui Yu
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-08-12 10:39 UTC (permalink / raw)
  To: Zenghui Yu
  Cc: linux-kernel, kvm, Catalin Marinas, linux-doc, Russell King,
	linux-arm-kernel, Marc Zyngier, Paolo Bonzini, Will Deacon,
	kvmarm

On 09/08/2019 14:51, Zenghui Yu wrote:
[...]
> Hi Steven,
> 
> Since userspace is not involved yet (right?), no one will create the
> PV_TIME device for guest (and no one will specify the IPA of the shared
> stolen time region), and I guess we will get a "not supported" error
> here.
> 
> So what should we do if we want to test this series now?  Any userspace
> tools?  If no, do you have any plans for userspace developing? ;-)

At the moment I have the following patch to kvmtool which creates the
PV_TIME device - this isn't in a state to go upstream, and Marc has
asked that I rework the memory allocation, so this will need to change.

It's a little ugly as it simply reserves the first page of RAM to use
for the PV time structures.

----8<----
diff --git a/Makefile b/Makefile
index 3862112..a79956b 100644
--- a/Makefile
+++ b/Makefile
@@ -158,7 +158,7 @@ endif
 # ARM
 OBJS_ARM_COMMON		:= arm/fdt.o arm/gic.o arm/gicv2m.o arm/ioport.o \
 			   arm/kvm.o arm/kvm-cpu.o arm/pci.o arm/timer.o \
-			   arm/pmu.o
+			   arm/pmu.o arm/pvtime.o
 HDRS_ARM_COMMON		:= arm/include
 ifeq ($(ARCH), arm)
 	DEFINES		+= -DCONFIG_ARM
diff --git a/arm/fdt.c b/arm/fdt.c
index c80e6da..19eccbc 100644
--- a/arm/fdt.c
+++ b/arm/fdt.c
@@ -119,6 +119,7 @@ static int setup_fdt(struct kvm *kvm)
 
 	/* Create new tree without a reserve map */
 	_FDT(fdt_create(fdt, FDT_MAX_SIZE));
+	_FDT(fdt_add_reservemap_entry(fdt, kvm->arch.memory_guest_start, 4096));
 	_FDT(fdt_finish_reservemap(fdt));
 
 	/* Header */
diff --git a/arm/kvm.c b/arm/kvm.c
index 1f85fc6..8bbfef1 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -11,6 +11,8 @@
 #include <linux/kvm.h>
 #include <linux/sizes.h>
 
+int pvtime_create(struct kvm *kvm);
+
 struct kvm_ext kvm_req_ext[] = {
 	{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
 	{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
@@ -86,6 +88,10 @@ void kvm__arch_init(struct kvm *kvm, const char *hugetlbfs_path, u64 ram_size)
 	/* Create the virtual GIC. */
 	if (gic__create(kvm, kvm->cfg.arch.irqchip))
 		die("Failed to create virtual GIC");
+
+	/* Setup PV time */
+	if (pvtime_create(kvm))
+		die("Failed to initialise PV time");
 }
 
 #define FDT_ALIGN	SZ_2M
diff --git a/arm/pvtime.c b/arm/pvtime.c
new file mode 100644
index 0000000..abcaab3
--- /dev/null
+++ b/arm/pvtime.c
@@ -0,0 +1,77 @@
+#include "kvm/kvm.h"
+
+#define KVM_DEV_TYPE_ARM_PV_TIME (KVM_DEV_TYPE_ARM_VGIC_ITS+2)
+
+/* Device Control API: PV_TIME */
+#define KVM_DEV_ARM_PV_TIME_PADDR	0
+#define KVM_DEV_ARM_PV_TIME_FREQUENCY	3
+
+#define KVM_DEV_ARM_PV_TIME_ST		0
+#define KVM_DEV_ARM_PV_TIME_LPT		1
+
+static int pvtime_fd;
+
+int pvtime_create(struct kvm *kvm);
+
+int pvtime_create(struct kvm *kvm)
+{
+	int err;
+	u64 lpt_paddr = 0x10000000;
+	u64 st_paddr = lpt_paddr + 4096;
+	u32 frequency = 100 * 1000 * 1000;
+
+	printf("lpt_paddr=%llx\n", lpt_paddr);
+
+	struct kvm_create_device pvtime_device = {
+		.type = KVM_DEV_TYPE_ARM_PV_TIME,
+		.flags = 0,
+	};
+
+	err = ioctl(kvm->vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
+	if (err) {
+		printf("Failed to create PV device\n");
+		return 0;
+	}
+
+	pvtime_fd = pvtime_device.fd;
+
+	struct kvm_device_attr lpt_base = {
+		.group = KVM_DEV_ARM_PV_TIME_PADDR,
+		.attr = KVM_DEV_ARM_PV_TIME_LPT,
+		.addr = (u64)(unsigned long)&lpt_paddr
+	};
+	struct kvm_device_attr st_base = {
+		.group = KVM_DEV_ARM_PV_TIME_PADDR,
+		.attr = KVM_DEV_ARM_PV_TIME_ST,
+		.addr = (u64)(unsigned long)&st_paddr
+	};
+
+	struct kvm_device_attr lpt_freq = {
+		.group = KVM_DEV_ARM_PV_TIME_FREQUENCY,
+		.attr = KVM_DEV_ARM_PV_TIME_LPT,
+		.addr = (u64)(unsigned long)&frequency
+	};
+
+	err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_base);
+	if (err) {
+		perror("ioctl lpt_base failed");
+		printf("Ignoring LPT...\n");
+	}
+	err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
+	if (err) {
+		perror("ioctl st_base failed");
+		goto out_err;
+	}
+	err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_freq);
+	if (err) {
+		perror("ioctl lpt_freq failed");
+		printf("Ignoring LPT...\n");
+	}
+
+	printf("PV time setup\n");
+
+	return 0;
+out_err:
+	close(pvtime_fd);
+	return err;
+}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest
  2019-08-12 10:39     ` Steven Price
@ 2019-08-13  6:06       ` Zenghui Yu
  0 siblings, 0 replies; 43+ messages in thread
From: Zenghui Yu @ 2019-08-13  6:06 UTC (permalink / raw)
  To: Steven Price
  Cc: linux-kernel, kvm, Catalin Marinas, linux-doc, Russell King,
	linux-arm-kernel, Marc Zyngier, Paolo Bonzini, Will Deacon,
	kvmarm

On 2019/8/12 18:39, Steven Price wrote:
> On 09/08/2019 14:51, Zenghui Yu wrote:
> [...]
>> Hi Steven,
>>
>> Since userspace is not involved yet (right?), no one will create the
>> PV_TIME device for guest (and no one will specify the IPA of the shared
>> stolen time region), and I guess we will get a "not supported" error
>> here.
>>
>> So what should we do if we want to test this series now?  Any userspace
>> tools?  If no, do you have any plans for userspace developing? ;-)
> 
> At the moment I have the following patch to kvmtool which creates the
> PV_TIME device - this isn't in a state to go upstream, and Marc has
> asked that I rework the memory allocation, so this will need to change.
> 
> It's a little ugly as it simply reserves the first page of RAM to use
> for the PV time structures.

Thanks for sharing the code. It's good enough to show what is required
in user-space.

(I'm not familiar with kvmtool. I will first take some time to move the
steal time part to Qemu and see what will happen.)


Thanks,
zenghui

> ----8<----
> diff --git a/Makefile b/Makefile
> index 3862112..a79956b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -158,7 +158,7 @@ endif
>   # ARM
>   OBJS_ARM_COMMON		:= arm/fdt.o arm/gic.o arm/gicv2m.o arm/ioport.o \
>   			   arm/kvm.o arm/kvm-cpu.o arm/pci.o arm/timer.o \
> -			   arm/pmu.o
> +			   arm/pmu.o arm/pvtime.o
>   HDRS_ARM_COMMON		:= arm/include
>   ifeq ($(ARCH), arm)
>   	DEFINES		+= -DCONFIG_ARM
> diff --git a/arm/fdt.c b/arm/fdt.c
> index c80e6da..19eccbc 100644
> --- a/arm/fdt.c
> +++ b/arm/fdt.c
> @@ -119,6 +119,7 @@ static int setup_fdt(struct kvm *kvm)
>   
>   	/* Create new tree without a reserve map */
>   	_FDT(fdt_create(fdt, FDT_MAX_SIZE));
> +	_FDT(fdt_add_reservemap_entry(fdt, kvm->arch.memory_guest_start, 4096));
>   	_FDT(fdt_finish_reservemap(fdt));
>   
>   	/* Header */
> diff --git a/arm/kvm.c b/arm/kvm.c
> index 1f85fc6..8bbfef1 100644
> --- a/arm/kvm.c
> +++ b/arm/kvm.c
> @@ -11,6 +11,8 @@
>   #include <linux/kvm.h>
>   #include <linux/sizes.h>
>   
> +int pvtime_create(struct kvm *kvm);
> +
>   struct kvm_ext kvm_req_ext[] = {
>   	{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
>   	{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> @@ -86,6 +88,10 @@ void kvm__arch_init(struct kvm *kvm, const char *hugetlbfs_path, u64 ram_size)
>   	/* Create the virtual GIC. */
>   	if (gic__create(kvm, kvm->cfg.arch.irqchip))
>   		die("Failed to create virtual GIC");
> +
> +	/* Setup PV time */
> +	if (pvtime_create(kvm))
> +		die("Failed to initialise PV time");
>   }
>   
>   #define FDT_ALIGN	SZ_2M
> diff --git a/arm/pvtime.c b/arm/pvtime.c
> new file mode 100644
> index 0000000..abcaab3
> --- /dev/null
> +++ b/arm/pvtime.c
> @@ -0,0 +1,77 @@
> +#include "kvm/kvm.h"
> +
> +#define KVM_DEV_TYPE_ARM_PV_TIME (KVM_DEV_TYPE_ARM_VGIC_ITS+2)
> +
> +/* Device Control API: PV_TIME */
> +#define KVM_DEV_ARM_PV_TIME_PADDR	0
> +#define KVM_DEV_ARM_PV_TIME_FREQUENCY	3
> +
> +#define KVM_DEV_ARM_PV_TIME_ST		0
> +#define KVM_DEV_ARM_PV_TIME_LPT		1
> +
> +static int pvtime_fd;
> +
> +int pvtime_create(struct kvm *kvm);
> +
> +int pvtime_create(struct kvm *kvm)
> +{
> +	int err;
> +	u64 lpt_paddr = 0x10000000;
> +	u64 st_paddr = lpt_paddr + 4096;
> +	u32 frequency = 100 * 1000 * 1000;
> +
> +	printf("lpt_paddr=%llx\n", lpt_paddr);
> +
> +	struct kvm_create_device pvtime_device = {
> +		.type = KVM_DEV_TYPE_ARM_PV_TIME,
> +		.flags = 0,
> +	};
> +
> +	err = ioctl(kvm->vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
> +	if (err) {
> +		printf("Failed to create PV device\n");
> +		return 0;
> +	}
> +
> +	pvtime_fd = pvtime_device.fd;
> +
> +	struct kvm_device_attr lpt_base = {
> +		.group = KVM_DEV_ARM_PV_TIME_PADDR,
> +		.attr = KVM_DEV_ARM_PV_TIME_LPT,
> +		.addr = (u64)(unsigned long)&lpt_paddr
> +	};
> +	struct kvm_device_attr st_base = {
> +		.group = KVM_DEV_ARM_PV_TIME_PADDR,
> +		.attr = KVM_DEV_ARM_PV_TIME_ST,
> +		.addr = (u64)(unsigned long)&st_paddr
> +	};
> +
> +	struct kvm_device_attr lpt_freq = {
> +		.group = KVM_DEV_ARM_PV_TIME_FREQUENCY,
> +		.attr = KVM_DEV_ARM_PV_TIME_LPT,
> +		.addr = (u64)(unsigned long)&frequency
> +	};
> +
> +	err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_base);
> +	if (err) {
> +		perror("ioctl lpt_base failed");
> +		printf("Ignoring LPT...\n");
> +	}
> +	err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
> +	if (err) {
> +		perror("ioctl st_base failed");
> +		goto out_err;
> +	}
> +	err = ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &lpt_freq);
> +	if (err) {
> +		perror("ioctl lpt_freq failed");
> +		printf("Ignoring LPT...\n");
> +	}
> +
> +	printf("PV time setup\n");
> +
> +	return 0;
> +out_err:
> +	close(pvtime_fd);
> +	return err;
> +}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/9] arm64: Stolen time support
  2019-08-05 13:06   ` Steven Price
  2019-08-05 13:26     ` Marc Zyngier
@ 2019-08-14 13:02     ` Alexander Graf
       [not found]       ` <8636i3omnd.wl-maz@kernel.org>
  1 sibling, 1 reply; 43+ messages in thread
From: Alexander Graf @ 2019-08-14 13:02 UTC (permalink / raw)
  To: Steven Price, Marc Zyngier
  Cc: kvm, Catalin Marinas, linux-doc, Russell King, linux-kernel,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel



On 05.08.19 15:06, Steven Price wrote:
> On 03/08/2019 19:05, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:08 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>> Hi Steven,
>>
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>>> specification, and I expect an updated version of the specification to
>>> be released soon with just the stolen time parts.
>>
>> Thanks for posting this.
>>
>> My current concern with this series is around the fact that we allocate
>> memory from the kernel on behalf of the guest. It is the first example
>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>
>> x86 seems to get away with it by having the memory allocated from
>> userspace, why I tend to like more. Yes, put_user is more
>> expensive than a straight store, but this isn't done too often either.
>>
>> What is the rational for your current approach?
> 
> As I see it there are 3 approaches that can be taken here:
> 
> 1. Hypervisor allocates memory and adds it to the virtual machine. This
> means that everything to do with the 'device' is encapsulated behind the
> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> stolen time structure to be fast it cannot be a trapping region and has
> to be backed by real memory - in this case allocated by the host kernel.
> 
> 2. Host user space allocates memory. Similar to above, but this time
> user space needs to manage the memory region as well as the usual
> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> to size the memory region).

You ideally want to get the host overhead for a VM to as little as you 
can. I'm not terribly fond of the idea of reserving a full page just 
because we're too afraid of having the guest donate memory.

> 
> 3. Guest kernel "donates" the memory to the hypervisor for the
> structure. As far as I'm aware this is what x86 does. The problems I see
> this approach are:
> 
>   a) kexec becomes much more tricky - there needs to be a disabling
> mechanism for the guest to stop the hypervisor scribbling on memory
> before starting the new kernel.

I wouldn't call "quiesce a device" much more tricky. We have to do that 
for other devices as well today.

>   b) If there is more than one entity that is interested in the
> information (e.g. firmware and kernel) then this requires some form of
> arbitration in the guest because the hypervisor doesn't want to have to
> track an arbitrary number of regions to update.

Why would FW care?

>   c) Performance can suffer if the host kernel doesn't have a suitably
> aligned/sized area to use. As you say - put_user() is more expensive.

Just define the interface to always require natural alignment when 
donating a memory location?

> The structure is updated on every return to the VM.

If you really do suffer from put_user(), there are alternatives. You 
could just map the page on the registration hcall and then leave it 
pinned until the vcpu gets destroyed again.

> Of course x86 does prove the third approach can work, but I'm not sure
> which is actually better. Avoid the kexec cancellation requirements was
> the main driver of the current approach. Although many of the

I really don't understand the problem with kexec cancellation. Worst 
case, let guest FW set it up for you and propagate only the address down 
via ACPI/DT. That way you can mark the respective memory as reserved too.

But even with a Linux only mechanism, just take a look at 
arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook 
into machine_crash_shutdown() and machine_shutdown().


Alex

> conversations about this were also tied up with Live Physical Time which
> adds its own complications.
> 
> Steve
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [UNVERIFIED SENDER] Re: [PATCH 0/9] arm64: Stolen time support
       [not found]       ` <8636i3omnd.wl-maz@kernel.org>
@ 2019-08-14 14:52         ` " Alexander Graf
  2019-08-16 10:23           ` Steven Price
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Graf @ 2019-08-14 14:52 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Steven Price, kvm, Catalin Marinas, linux-doc, Russell King,
	linux-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	linux-arm-kernel



On 14.08.19 16:19, Marc Zyngier wrote:
> On Wed, 14 Aug 2019 14:02:25 +0100,
> Alexander Graf <graf@amazon.com> wrote:
>>
>>
>>
>> On 05.08.19 15:06, Steven Price wrote:
>>> On 03/08/2019 19:05, Marc Zyngier wrote:
>>>> On Fri,  2 Aug 2019 15:50:08 +0100
>>>> Steven Price <steven.price@arm.com> wrote:
>>>>
>>>> Hi Steven,
>>>>
>>>>> This series add support for paravirtualized time for arm64 guests and
>>>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>>>
>>>>> https://developer.arm.com/docs/den0057/a
>>>>>
>>>>> It implements support for stolen time, allowing the guest to
>>>>> identify time when it is forcibly not executing.
>>>>>
>>>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>>>> some concerns about the overheads and approach in the above
>>>>> specification, and I expect an updated version of the specification to
>>>>> be released soon with just the stolen time parts.
>>>>
>>>> Thanks for posting this.
>>>>
>>>> My current concern with this series is around the fact that we allocate
>>>> memory from the kernel on behalf of the guest. It is the first example
>>>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>>>
>>>> x86 seems to get away with it by having the memory allocated from
>>>> userspace, why I tend to like more. Yes, put_user is more
>>>> expensive than a straight store, but this isn't done too often either.
>>>>
>>>> What is the rational for your current approach?
>>>
>>> As I see it there are 3 approaches that can be taken here:
>>>
>>> 1. Hypervisor allocates memory and adds it to the virtual machine. This
>>> means that everything to do with the 'device' is encapsulated behind the
>>> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
>>> stolen time structure to be fast it cannot be a trapping region and has
>>> to be backed by real memory - in this case allocated by the host kernel.
>>>
>>> 2. Host user space allocates memory. Similar to above, but this time
>>> user space needs to manage the memory region as well as the usual
>>> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
>>> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
>>> to size the memory region).
>>
>> You ideally want to get the host overhead for a VM to as little as you
>> can. I'm not terribly fond of the idea of reserving a full page just
>> because we're too afraid of having the guest donate memory.
> 
> Well, reduce the amount of memory you give to the guest by one page,
> and allocate that page to the stolen time device. Problem solved!
> 
> Seriously, if you're worried about the allocation of a single page,
> you should first look at how many holes we have in the vcpu structure,
> for example (even better, with the 8.4 NV patches applied). Just
> fixing that would give you that page back *per vcpu*.

I'm worried about additional memory slots, about fragmenting the 
cachable guest memory regions, about avoidable HV taxes.

I think we need to distinguish here between the KVM implementation and 
the hypervisor/guest interface. Just because in KVM we can save overhead 
today doesn't mean that the HV interface should be built around the 
assumption that "memory is free".

> 
>>> 3. Guest kernel "donates" the memory to the hypervisor for the
>>> structure. As far as I'm aware this is what x86 does. The problems I see
>>> this approach are:
>>>
>>>    a) kexec becomes much more tricky - there needs to be a disabling
>>> mechanism for the guest to stop the hypervisor scribbling on memory
>>> before starting the new kernel.
>>
>> I wouldn't call "quiesce a device" much more tricky. We have to do
>> that for other devices as well today.
> 
> And since there is no standard way of doing it, we keep inventing
> weird and wonderful ways of doing so -- cue the terrible GICv3 LPI
> situation, and all the various hacks to keep existing IOMMU mappings
> around across firmware/kernel handovers as well as kexec.

Well, the good news here is that we don't have to keep it around ;).

> 
>>
>>>    b) If there is more than one entity that is interested in the
>>> information (e.g. firmware and kernel) then this requires some form of
>>> arbitration in the guest because the hypervisor doesn't want to have to
>>> track an arbitrary number of regions to update.
>>
>> Why would FW care?
> 
> Exactly. It doesn't care. Not caring means it doesn't know about the
> page the guest has allocated for stolen time, and starts using it for
> its own purposes. Hello, memory corruption. Same thing goes if you
> reboot into a non stolen time aware kernel.

If you reboot, you go via the vcpu reset path which clears the map, no? 
Same goes for FW entry. If you enter firmware that does not set up the 
map, you never see it.

> 
>>
>>>    c) Performance can suffer if the host kernel doesn't have a suitably
>>> aligned/sized area to use. As you say - put_user() is more expensive.
>>
>> Just define the interface to always require natural alignment when
>> donating a memory location?
>>
>>> The structure is updated on every return to the VM.
>>
>> If you really do suffer from put_user(), there are alternatives. You
>> could just map the page on the registration hcall and then leave it
>> pinned until the vcpu gets destroyed again.
> 
> put_user() should be cheap enough. It is one of the things we tend to
> optimise anyway. And yes, worse case, we pin the page.
> 
>>
>>> Of course x86 does prove the third approach can work, but I'm not sure
>>> which is actually better. Avoid the kexec cancellation requirements was
>>> the main driver of the current approach. Although many of the
>>
>> I really don't understand the problem with kexec cancellation. Worst
>> case, let guest FW set it up for you and propagate only the address
>> down via ACPI/DT. That way you can mark the respective memory as
>> reserved too.
> 
> We already went down that road with the LPI hack. I'm not going there
> again if we can avoid it. And it turn out that we can. Just allocate
> the stolen time page as a separate memblock, give it to KVM for that
> purpose.
> 
> Your suggestion of letting the guest firmware set something up only
> works if whatever you're booting after that understands it. If it
> doesn't, you're screwed.

Why? For UEFI, mark the region as reserved in the memory map. For DT, 
just mark it straight on reserved.

That said, I'm not advocating for doing it in the FW. I think this can 
be solved really easily with a simple guest driver to enable and a vcpu 
reset hook to disable the map.

> 
>> But even with a Linux only mechanism, just take a look at
>> arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook
>> into machine_crash_shutdown() and machine_shutdown().
> 
> I'm not going to take something that is Linux specific. It has to work
> for all guests, at all times, whether they know about the hypervisor
> service or not.

If they don't know about the HV service, they don't register the writer, 
so they don't see corruption.

If they know about the HV service and they don't support kexec, they 
don't have to worry because a vcpu reset should also clear the map.

If they do support kexec, they already have a mechanism to quiesce devices.

So I don't understand how this is Linux specific? The question was Linux 
specific, so I answered with precedence to show that disabling on kexec 
is not all that hard :).


Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [UNVERIFIED SENDER] Re: [PATCH 0/9] arm64: Stolen time support
  2019-08-14 14:52         ` [UNVERIFIED SENDER] " Alexander Graf
@ 2019-08-16 10:23           ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-08-16 10:23 UTC (permalink / raw)
  To: Alexander Graf, Marc Zyngier
  Cc: kvm, Catalin Marinas, linux-doc, Russell King, linux-kernel,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel

On 14/08/2019 15:52, Alexander Graf wrote:
> 
> 
> On 14.08.19 16:19, Marc Zyngier wrote:
>> On Wed, 14 Aug 2019 14:02:25 +0100,
>> Alexander Graf <graf@amazon.com> wrote:
>>>
>>>
>>>
>>> On 05.08.19 15:06, Steven Price wrote:
>>>> On 03/08/2019 19:05, Marc Zyngier wrote:
>>>>> On Fri,  2 Aug 2019 15:50:08 +0100
>>>>> Steven Price <steven.price@arm.com> wrote:
>>>>>
>>>>> Hi Steven,
>>>>>
>>>>>> This series add support for paravirtualized time for arm64 guests and
>>>>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>>>>
>>>>>> https://developer.arm.com/docs/den0057/a
>>>>>>
>>>>>> It implements support for stolen time, allowing the guest to
>>>>>> identify time when it is forcibly not executing.
>>>>>>
>>>>>> It doesn't implement support for Live Physical Time (LPT) as there
>>>>>> are
>>>>>> some concerns about the overheads and approach in the above
>>>>>> specification, and I expect an updated version of the
>>>>>> specification to
>>>>>> be released soon with just the stolen time parts.
>>>>>
>>>>> Thanks for posting this.
>>>>>
>>>>> My current concern with this series is around the fact that we
>>>>> allocate
>>>>> memory from the kernel on behalf of the guest. It is the first example
>>>>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>>>>
>>>>> x86 seems to get away with it by having the memory allocated from
>>>>> userspace, why I tend to like more. Yes, put_user is more
>>>>> expensive than a straight store, but this isn't done too often either.
>>>>>
>>>>> What is the rational for your current approach?
>>>>
>>>> As I see it there are 3 approaches that can be taken here:
>>>>
>>>> 1. Hypervisor allocates memory and adds it to the virtual machine. This
>>>> means that everything to do with the 'device' is encapsulated behind
>>>> the
>>>> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want
>>>> the
>>>> stolen time structure to be fast it cannot be a trapping region and has
>>>> to be backed by real memory - in this case allocated by the host
>>>> kernel.
>>>>
>>>> 2. Host user space allocates memory. Similar to above, but this time
>>>> user space needs to manage the memory region as well as the usual
>>>> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
>>>> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
>>>> to size the memory region).
>>>
>>> You ideally want to get the host overhead for a VM to as little as you
>>> can. I'm not terribly fond of the idea of reserving a full page just
>>> because we're too afraid of having the guest donate memory.
>>
>> Well, reduce the amount of memory you give to the guest by one page,
>> and allocate that page to the stolen time device. Problem solved!
>>
>> Seriously, if you're worried about the allocation of a single page,
>> you should first look at how many holes we have in the vcpu structure,
>> for example (even better, with the 8.4 NV patches applied). Just
>> fixing that would give you that page back *per vcpu*.
> 
> I'm worried about additional memory slots, about fragmenting the
> cachable guest memory regions, about avoidable HV taxes.
> 
> I think we need to distinguish here between the KVM implementation and
> the hypervisor/guest interface. Just because in KVM we can save overhead
> today doesn't mean that the HV interface should be built around the
> assumption that "memory is free".

The HV interface just requires that the host provides some memory for
the structures to live in. The memory can be adjacent (or even within)
the normal memory of the guest. The only requirement is that the guest
isn't told to use this memory for normal allocations (i.e. it should
either be explicitly reserved or just not contained within the normal
memory block).

>>
>>>> 3. Guest kernel "donates" the memory to the hypervisor for the
>>>> structure. As far as I'm aware this is what x86 does. The problems I
>>>> see
>>>> this approach are:
>>>>
>>>>    a) kexec becomes much more tricky - there needs to be a disabling
>>>> mechanism for the guest to stop the hypervisor scribbling on memory
>>>> before starting the new kernel.
>>>
>>> I wouldn't call "quiesce a device" much more tricky. We have to do
>>> that for other devices as well today.
>>
>> And since there is no standard way of doing it, we keep inventing
>> weird and wonderful ways of doing so -- cue the terrible GICv3 LPI
>> situation, and all the various hacks to keep existing IOMMU mappings
>> around across firmware/kernel handovers as well as kexec.
> 
> Well, the good news here is that we don't have to keep it around ;).
> 
>>
>>>
>>>>    b) If there is more than one entity that is interested in the
>>>> information (e.g. firmware and kernel) then this requires some form of
>>>> arbitration in the guest because the hypervisor doesn't want to have to
>>>> track an arbitrary number of regions to update.
>>>
>>> Why would FW care?
>>
>> Exactly. It doesn't care. Not caring means it doesn't know about the
>> page the guest has allocated for stolen time, and starts using it for
>> its own purposes. Hello, memory corruption. Same thing goes if you
>> reboot into a non stolen time aware kernel.
> 
> If you reboot, you go via the vcpu reset path which clears the map, no?
> Same goes for FW entry. If you enter firmware that does not set up the
> map, you never see it.

Doing this per-vcpu implies you are probably going to have to allocate
an entire page per vcpu. Because it's entirely possible for a guest to
reset an individual vcpu. Or at the least there's some messy reference
counting going on here.

Having a region of memory provided by the host means the structures can
be packed and there's nothing to be done in the reset path.

>>
>>>
>>>>    c) Performance can suffer if the host kernel doesn't have a suitably
>>>> aligned/sized area to use. As you say - put_user() is more expensive.
>>>
>>> Just define the interface to always require natural alignment when
>>> donating a memory location?
>>>
>>>> The structure is updated on every return to the VM.
>>>
>>> If you really do suffer from put_user(), there are alternatives. You
>>> could just map the page on the registration hcall and then leave it
>>> pinned until the vcpu gets destroyed again.
>>
>> put_user() should be cheap enough. It is one of the things we tend to
>> optimise anyway. And yes, worse case, we pin the page.
>>
>>>
>>>> Of course x86 does prove the third approach can work, but I'm not sure
>>>> which is actually better. Avoid the kexec cancellation requirements was
>>>> the main driver of the current approach. Although many of the
>>>
>>> I really don't understand the problem with kexec cancellation. Worst
>>> case, let guest FW set it up for you and propagate only the address
>>> down via ACPI/DT. That way you can mark the respective memory as
>>> reserved too.
>>
>> We already went down that road with the LPI hack. I'm not going there
>> again if we can avoid it. And it turn out that we can. Just allocate
>> the stolen time page as a separate memblock, give it to KVM for that
>> purpose.
>>
>> Your suggestion of letting the guest firmware set something up only
>> works if whatever you're booting after that understands it. If it
>> doesn't, you're screwed.
> 
> Why? For UEFI, mark the region as reserved in the memory map. For DT,
> just mark it straight on reserved.
> 
> That said, I'm not advocating for doing it in the FW. I think this can
> be solved really easily with a simple guest driver to enable and a vcpu
> reset hook to disable the map.
> 
>>
>>> But even with a Linux only mechanism, just take a look at
>>> arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook
>>> into machine_crash_shutdown() and machine_shutdown().
>>
>> I'm not going to take something that is Linux specific. It has to work
>> for all guests, at all times, whether they know about the hypervisor
>> service or not.
> 
> If they don't know about the HV service, they don't register the writer,
> so they don't see corruption.
> 
> If they know about the HV service and they don't support kexec, they
> don't have to worry because a vcpu reset should also clear the map.
> 
> If they do support kexec, they already have a mechanism to quiesce devices.
> 
> So I don't understand how this is Linux specific? The question was Linux
> specific, so I answered with precedence to show that disabling on kexec
> is not all that hard :).

My concern is more around a something like Jailhouse as a
guest-hypervisor. There Linux gives up CPUs to run another OS. This
hand-off of a CPU is much easier if there's just a structure in memory
somewhere which doesn't move.

The kexec case like you say can be handled as a device to quiesce.

I don't think either scheme is unworkable, but I do think getting the
host to provide the memory is easier for both guest and host. Marc had a
good point that getting user space to allocate the memory is probably
preferable to getting the host kernel to do so, so I'm reworking the
code to do that.

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, back to index

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-02 14:50 [PATCH 0/9] arm64: Stolen time support Steven Price
2019-08-02 14:50 ` [PATCH 1/9] KVM: arm64: Document PV-time interface Steven Price
2019-08-03 11:13   ` Marc Zyngier
2019-08-05 13:06     ` Steven Price
2019-08-05  3:23   ` Zenghui Yu
2019-08-05 13:06     ` Steven Price
2019-08-05 16:40   ` Christophe de Dinechin
2019-08-07 13:21     ` Steven Price
     [not found]       ` <9F77FA64-C71B-4025-A58D-3AC07E6688DE@dinechin.org>
2019-08-07 15:26         ` Steven Price
2019-08-02 14:50 ` [PATCH 2/9] KVM: arm/arm64: Factor out hypercall handling from PSCI code Steven Price
2019-08-02 14:50 ` [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call Steven Price
2019-08-03 11:21   ` Marc Zyngier
2019-08-05 13:14     ` Steven Price
2019-08-02 14:50 ` [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure Steven Price
2019-08-03 11:55   ` Marc Zyngier
2019-08-05 14:09     ` Steven Price
2019-08-03 17:58   ` Marc Zyngier
2019-08-03 18:13     ` Marc Zyngier
2019-08-05 14:18       ` Steven Price
2019-08-02 14:50 ` [PATCH 5/9] KVM: Allow kvm_device_ops to be const Steven Price
2019-08-02 14:50 ` [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space Steven Price
2019-08-03 12:51   ` Marc Zyngier
2019-08-03 17:34     ` Marc Zyngier
2019-08-07 13:39       ` Steven Price
2019-08-07 13:51         ` Marc Zyngier
2019-08-05 16:10     ` Steven Price
2019-08-05 16:28       ` Marc Zyngier
2019-08-02 14:50 ` [PATCH 7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls Steven Price
2019-08-05 10:03   ` Will Deacon
2019-08-02 14:50 ` [PATCH 8/9] arm/arm64: Make use of the SMCCC 1.1 wrapper Steven Price
2019-08-02 14:50 ` [PATCH 9/9] arm64: Retrieve stolen time as paravirtualized guest Steven Price
2019-08-04  9:53   ` Marc Zyngier
2019-08-08 15:29     ` Steven Price
2019-08-08 15:49       ` Marc Zyngier
2019-08-09 13:51   ` Zenghui Yu
2019-08-12 10:39     ` Steven Price
2019-08-13  6:06       ` Zenghui Yu
2019-08-03 18:05 ` [PATCH 0/9] arm64: Stolen time support Marc Zyngier
2019-08-05 13:06   ` Steven Price
2019-08-05 13:26     ` Marc Zyngier
2019-08-14 13:02     ` Alexander Graf
     [not found]       ` <8636i3omnd.wl-maz@kernel.org>
2019-08-14 14:52         ` [UNVERIFIED SENDER] " Alexander Graf
2019-08-16 10:23           ` Steven Price

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org kvm@archiver.kernel.org
	public-inbox-index kvm


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox