KVM ARM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
@ 2020-05-08  3:29 Gavin Shan
  2020-05-08  3:29 ` [PATCH RFCv2 1/9] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
                   ` (10 more replies)
  0 siblings, 11 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

There are two stages of page faults and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. There might be
IO activities involved for host to populate the requested page. For
instance, the requested page has been swapped out previously. In this
case, the guest (vCPU) has to suspend for a few of milliseconds, which
depends on the swapping media, regardless of the overall system load.

The series adds asychornous page fault to improve the situation. A
signal (PAGE_NOT_PRESENT) is sent from host to the guest if the requested
page isn't absent immediately. In the mean while, a worker is started
to populate the requested page in background. Guest either picks another
available process to run or puts current (faulting) process to power
saving mode when receiving the (PAGE_NOT_PRESENT) signal. After the
requested page is populated by the worker, another signal (PAGE_READY)
is sent from host to guest. Guest wakes up the (faulting) process when
receiving the (PAGE_READY) signal.

The signals are conveyed through control block. The control block physical
address is passed from guest to host through dedicated KVM vendor specific
hypercall. The control block is visible and accessible by host and guest
in the mean while. The hypercall is also used to enable, disable, configure
the functionality. Notifications, by injected abort data exception, are
fired when there are pending signals. The exception handler will be invoked
in guest kernel.

Testing
=======
The tests are carried on the following machine. A guest with single vCPU
and 4GB memory is started. Also, the QEMU process is put into memory cgroup
(v1) whose memory limit is set to 2GB. In the guest, there are two threads,
which are memory bound and CPU bound separately. The memory bound thread
allocates all available memory, accesses and them free them. The CPU bound
thread simply executes block of "nop". The test is carried out for 5 time
continuously and the average number (per minute) of executed blocks in the
CPU bound thread is taken as indicator of improvement.

   Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
   Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)

   Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
                                        7629633480 6170527920)
   With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
                                        8095084020 8434672260)
   Outcome:     +17.8%

Another test case is to measure the time consumed by the application, but
with the CPU-bound thread disabled.

   Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
   With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
   Outcome:     +1.2%

I also have some code in the host to capture the number of async page faults,
time used to do swapin and its maximal/minimal values when async page fault
is enabled. During the test, the CPU-bound thread is disabled. There is about
30% of the time used to do swapin.

   Number of async page fault:     7555 times
   Total time used by application: 42.2 seconds
   Total time used by swapin:      12.7 seconds   (30%)
         Minimal swapin time:      36.2 us
         Maximal swapin time:      55.7 ms

Changelog
=========
RFCv1 -> RFCv2
   * Rebase to 5.7.rc3
   * Performance data                                                   (Marc Zyngier)
   * Replace IMPDEF system register with KVM vendor specific hypercall  (Mark Rutland)
   * Based on Will's KVM vendor hypercall probe mechanism               (Will Deacon)
   * Don't use IMPDEF DFSC (0x43). Async page fault reason is conveyed
     by the control block                                               (Mark Rutland)
   * Delayed wakeup mechanism in guest kernel                           (Gavin Shan)
   * Stability improvement in the guest kernel: delayed wakeup mechanism,
     external abort disallowed region, lazily clear async page fault,
     disabled interrupt on acquiring the head's lock and so on          (Gavin Shan)
   * Stability improvement in the host kernel: serialized async page
     faults etc.                                                        (Gavin Shan)
   * Performance improvement in guest kernel: percpu sleeper head       (Gavin Shan)

Gavin Shan (7):
  kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  kvm/arm64: Detach ESR operator from vCPU struct
  kvm/arm64: Replace hsr with esr
  kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  kvm/arm64: Support async page fault
  kernel/sched: Add cpu_rq_is_locked()
  arm64: Support async page fault

Will Deacon (2):
  arm64: Probe for the presence of KVM hypervisor services during boot
  arm/arm64: KVM: Advertise KVM UID to guests via SMCCC

 arch/arm64/Kconfig                       |  11 +
 arch/arm64/include/asm/exception.h       |   3 +
 arch/arm64/include/asm/hypervisor.h      |  11 +
 arch/arm64/include/asm/kvm_emulate.h     |  83 +++--
 arch/arm64/include/asm/kvm_host.h        |  47 +++
 arch/arm64/include/asm/kvm_para.h        |  40 +++
 arch/arm64/include/uapi/asm/Kbuild       |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h   |  22 ++
 arch/arm64/kernel/entry.S                |  33 ++
 arch/arm64/kernel/process.c              |   4 +
 arch/arm64/kernel/setup.c                |  35 ++
 arch/arm64/kvm/Kconfig                   |   1 +
 arch/arm64/kvm/Makefile                  |   2 +
 arch/arm64/kvm/handle_exit.c             |  48 +--
 arch/arm64/kvm/hyp/switch.c              |  33 +-
 arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |   7 +-
 arch/arm64/kvm/inject_fault.c            |   4 +-
 arch/arm64/kvm/sys_regs.c                |  38 +-
 arch/arm64/mm/fault.c                    | 434 +++++++++++++++++++++++
 include/linux/arm-smccc.h                |  32 ++
 include/linux/sched.h                    |   1 +
 kernel/sched/core.c                      |   8 +
 virt/kvm/arm/arm.c                       |  40 ++-
 virt/kvm/arm/async_pf.c                  | 335 +++++++++++++++++
 virt/kvm/arm/hyp/aarch32.c               |   4 +-
 virt/kvm/arm/hyp/vgic-v3-sr.c            |   7 +-
 virt/kvm/arm/hypercalls.c                |  37 +-
 virt/kvm/arm/mmio.c                      |  27 +-
 virt/kvm/arm/mmu.c                       |  69 +++-
 29 files changed, 1264 insertions(+), 154 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 virt/kvm/arm/async_pf.c

-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 1/9] arm64: Probe for the presence of KVM hypervisor services during boot
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-08  3:29 ` [PATCH RFCv2 2/9] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC Gavin Shan
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

From: Will Deacon <will@kernel.org>

Although the SMCCC specification provides some limited functionality for
describing the presence of hypervisor and firmware services, this is
generally applicable only to functions designated as "Arm Architecture
Service Functions" and no portable discovery mechanism is provided for
standard hypervisor services, despite having a designated range of
function identifiers reserved by the specification.

In an attempt to avoid the need for additional firmware changes every
time a new function is added, introduce a UID to identify the service
provider as being compatible with KVM. Once this has been established,
additional services can be discovered via a feature bitmap.

Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/hypervisor.h | 11 +++++++++
 arch/arm64/kernel/setup.c           | 35 +++++++++++++++++++++++++++++
 include/linux/arm-smccc.h           | 26 +++++++++++++++++++++
 3 files changed, 72 insertions(+)

diff --git a/arch/arm64/include/asm/hypervisor.h b/arch/arm64/include/asm/hypervisor.h
index f9cc1d021791..91e4bd890819 100644
--- a/arch/arm64/include/asm/hypervisor.h
+++ b/arch/arm64/include/asm/hypervisor.h
@@ -2,6 +2,17 @@
 #ifndef _ASM_ARM64_HYPERVISOR_H
 #define _ASM_ARM64_HYPERVISOR_H
 
+#include <linux/arm-smccc.h>
 #include <asm/xen/hypervisor.h>
 
+static inline bool kvm_arm_hyp_service_available(u32 func_id)
+{
+	extern DECLARE_BITMAP(__kvm_arm_hyp_services, ARM_SMCCC_KVM_NUM_FUNCS);
+
+	if (func_id >= ARM_SMCCC_KVM_NUM_FUNCS)
+		return -EINVAL;
+
+	return test_bit(func_id, __kvm_arm_hyp_services);
+}
+
 #endif
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 3fd2c11c09fc..61c3774c7bc9 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/acpi.h>
+#include <linux/arm-smccc.h>
 #include <linux/export.h>
 #include <linux/kernel.h>
 #include <linux/stddef.h>
@@ -275,6 +276,39 @@ static int __init reserve_memblock_reserved_regions(void)
 arch_initcall(reserve_memblock_reserved_regions);
 
 u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
+DECLARE_BITMAP(__kvm_arm_hyp_services, ARM_SMCCC_KVM_NUM_FUNCS) = { };
+
+static void __init kvm_init_hyp_services(void)
+{
+	int i;
+	struct arm_smccc_res res;
+
+	if (psci_ops.smccc_version == SMCCC_VERSION_1_0)
+		return;
+
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
+	if (res.a0 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0 ||
+	    res.a1 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1 ||
+	    res.a2 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2 ||
+	    res.a3 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3)
+		return;
+
+	memset(&res, 0, sizeof(res));
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID, &res);
+	for (i = 0; i < 32; ++i) {
+		if (res.a0 & (i))
+			set_bit(i + (32 * 0), __kvm_arm_hyp_services);
+		if (res.a1 & (i))
+			set_bit(i + (32 * 1), __kvm_arm_hyp_services);
+		if (res.a2 & (i))
+			set_bit(i + (32 * 2), __kvm_arm_hyp_services);
+		if (res.a3 & (i))
+			set_bit(i + (32 * 3), __kvm_arm_hyp_services);
+	}
+
+	pr_info("KVM hypervisor services detected (0x%08lx 0x%08lx 0x%08lx 0x%08lx)\n",
+		res.a3, res.a2, res.a1, res.a0);
+}
 
 void __init setup_arch(char **cmdline_p)
 {
@@ -344,6 +378,7 @@ void __init setup_arch(char **cmdline_p)
 	else
 		psci_acpi_init();
 
+	kvm_init_hyp_services();
 	init_bootcpu_ops();
 	smp_init_cpus();
 	smp_build_mpidr_hash();
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 59494df0f55b..bdc0124a064a 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -46,11 +46,14 @@
 #define ARM_SMCCC_OWNER_OEM		3
 #define ARM_SMCCC_OWNER_STANDARD	4
 #define ARM_SMCCC_OWNER_STANDARD_HYP	5
+#define ARM_SMCCC_OWNER_VENDOR_HYP	6
 #define ARM_SMCCC_OWNER_TRUSTED_APP	48
 #define ARM_SMCCC_OWNER_TRUSTED_APP_END	49
 #define ARM_SMCCC_OWNER_TRUSTED_OS	50
 #define ARM_SMCCC_OWNER_TRUSTED_OS_END	63
 
+#define ARM_SMCCC_FUNC_QUERY_CALL_UID	0xff01
+
 #define ARM_SMCCC_QUIRK_NONE		0
 #define ARM_SMCCC_QUIRK_QCOM_A6		1 /* Save/restore register a6 */
 
@@ -77,6 +80,29 @@
 			   ARM_SMCCC_SMC_32,				\
 			   0, 0x7fff)
 
+#define ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID				\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
+			   ARM_SMCCC_SMC_32,				\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
+			   ARM_SMCCC_FUNC_QUERY_CALL_UID)
+
+/* KVM UID value: 28b46fb6-2ec5-11e9-a9ca-4b564d003a74 */
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0	0xb66fb428U
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1	0xe911c52eU
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2	0x564bcaa9U
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3	0x743a004dU
+
+/* KVM "vendor specific" services */
+#define ARM_SMCCC_KVM_FUNC_FEATURES		0
+#define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
+#define ARM_SMCCC_KVM_NUM_FUNCS			128
+
+#define ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID			\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
+			   ARM_SMCCC_SMC_32,				\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
+			   ARM_SMCCC_KVM_FUNC_FEATURES)
+
 #ifndef __ASSEMBLY__
 
 #include <linux/linkage.h>
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 2/9] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
  2020-05-08  3:29 ` [PATCH RFCv2 1/9] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-08  3:29 ` [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr() Gavin Shan
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

From: Will Deacon <will@kernel.org>

We can advertise ourselves to guests as KVM and provide a basic features
bitmap for discoverability of future hypervisor services.

Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 virt/kvm/arm/hypercalls.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
index 550dfa3e53cd..db6dce3d0e23 100644
--- a/virt/kvm/arm/hypercalls.c
+++ b/virt/kvm/arm/hypercalls.c
@@ -12,13 +12,13 @@
 int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 {
 	u32 func_id = smccc_get_function(vcpu);
-	long val = SMCCC_RET_NOT_SUPPORTED;
+	u32 val[4] = {SMCCC_RET_NOT_SUPPORTED};
 	u32 feature;
 	gpa_t gpa;
 
 	switch (func_id) {
 	case ARM_SMCCC_VERSION_FUNC_ID:
-		val = ARM_SMCCC_VERSION_1_1;
+		val[0] = ARM_SMCCC_VERSION_1_1;
 		break;
 	case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
 		feature = smccc_get_arg1(vcpu);
@@ -28,10 +28,10 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 			case KVM_BP_HARDEN_UNKNOWN:
 				break;
 			case KVM_BP_HARDEN_WA_NEEDED:
-				val = SMCCC_RET_SUCCESS;
+				val[0] = SMCCC_RET_SUCCESS;
 				break;
 			case KVM_BP_HARDEN_NOT_REQUIRED:
-				val = SMCCC_RET_NOT_REQUIRED;
+				val[0] = SMCCC_RET_NOT_REQUIRED;
 				break;
 			}
 			break;
@@ -41,31 +41,40 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 			case KVM_SSBD_UNKNOWN:
 				break;
 			case KVM_SSBD_KERNEL:
-				val = SMCCC_RET_SUCCESS;
+				val[0] = SMCCC_RET_SUCCESS;
 				break;
 			case KVM_SSBD_FORCE_ENABLE:
 			case KVM_SSBD_MITIGATED:
-				val = SMCCC_RET_NOT_REQUIRED;
+				val[0] = SMCCC_RET_NOT_REQUIRED;
 				break;
 			}
 			break;
 		case ARM_SMCCC_HV_PV_TIME_FEATURES:
-			val = SMCCC_RET_SUCCESS;
+			val[0] = SMCCC_RET_SUCCESS;
 			break;
 		}
 		break;
 	case ARM_SMCCC_HV_PV_TIME_FEATURES:
-		val = kvm_hypercall_pv_features(vcpu);
+		val[0] = kvm_hypercall_pv_features(vcpu);
 		break;
 	case ARM_SMCCC_HV_PV_TIME_ST:
 		gpa = kvm_init_stolen_time(vcpu);
 		if (gpa != GPA_INVALID)
-			val = gpa;
+			val[0] = gpa;
+		break;
+	case ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID:
+		val[0] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0;
+		val[1] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1;
+		val[2] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2;
+		val[3] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3;
+		break;
+	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
+		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
 		break;
 	default:
 		return kvm_psci_call(vcpu);
 	}
 
-	smccc_set_retval(vcpu, val, 0, 0, 0);
+	smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
 	return 1;
 }
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
  2020-05-08  3:29 ` [PATCH RFCv2 1/9] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
  2020-05-08  3:29 ` [PATCH RFCv2 2/9] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-26 10:42   ` Mark Rutland
  2020-05-08  3:29 ` [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct Gavin Shan
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to
kvm_vcpu_get_esr() to it a bit more self-explaining because the
functions returns ESR instead of HSR on aarch64. This shouldn't
cause any functional changes.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/kvm_emulate.h | 36 +++++++++++++++-------------
 arch/arm64/kvm/handle_exit.c         | 12 +++++-----
 arch/arm64/kvm/hyp/switch.c          |  2 +-
 arch/arm64/kvm/sys_regs.c            |  6 ++---
 virt/kvm/arm/hyp/aarch32.c           |  2 +-
 virt/kvm/arm/hyp/vgic-v3-sr.c        |  4 ++--
 virt/kvm/arm/mmu.c                   |  6 ++---
 7 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index a30b4eec7cb4..bd1a69e7c104 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -265,14 +265,14 @@ static inline bool vcpu_mode_priv(const struct kvm_vcpu *vcpu)
 	return mode != PSR_MODE_EL0t;
 }
 
-static __always_inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
+static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.fault.esr_el2;
 }
 
 static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
 {
-	u32 esr = kvm_vcpu_get_hsr(vcpu);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 
 	if (esr & ESR_ELx_CV)
 		return (esr & ESR_ELx_COND_MASK) >> ESR_ELx_COND_SHIFT;
@@ -297,64 +297,66 @@ static inline u64 kvm_vcpu_get_disr(const struct kvm_vcpu *vcpu)
 
 static inline u32 kvm_vcpu_hvc_get_imm(const struct kvm_vcpu *vcpu)
 {
-	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_xVC_IMM_MASK;
+	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_xVC_IMM_MASK;
 }
 
 static __always_inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV);
+	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_ISV);
 }
 
 static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu)
 {
-	return kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
+	return kvm_vcpu_get_esr(vcpu) &
+	       (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
 }
 
 static inline bool kvm_vcpu_dabt_issext(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SSE);
+	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SSE);
 }
 
 static inline bool kvm_vcpu_dabt_issf(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SF);
+	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SF);
 }
 
 static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)
 {
-	return (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
+	return (kvm_vcpu_get_esr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
 }
 
 static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_S1PTW);
+	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_S1PTW);
 }
 
 static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WNR) ||
+	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_WNR) ||
 		kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */
 }
 
 static inline bool kvm_vcpu_dabt_is_cm(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_CM);
+	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_CM);
 }
 
 static __always_inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu)
 {
-	return 1 << ((kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
+	return 1 << ((kvm_vcpu_get_esr(vcpu) & ESR_ELx_SAS) >>
+		     ESR_ELx_SAS_SHIFT);
 }
 
 /* This one is not specific to Data Abort */
 static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_IL);
+	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_IL);
 }
 
 static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)
 {
-	return ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
+	return ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
 }
 
 static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
@@ -364,12 +366,12 @@ static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
 
 static __always_inline u8 kvm_vcpu_trap_get_fault(const struct kvm_vcpu *vcpu)
 {
-	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC;
+	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC;
 }
 
 static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vcpu)
 {
-	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC_TYPE;
+	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
 }
 
 static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
@@ -393,7 +395,7 @@ static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
 
 static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
 {
-	u32 esr = kvm_vcpu_get_hsr(vcpu);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	return ESR_ELx_SYS64_ISS_RT(esr);
 }
 
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index aacfc55de44c..c5b75a4d5eda 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -89,7 +89,7 @@ static int handle_no_fpsimd(struct kvm_vcpu *vcpu, struct kvm_run *run)
  */
 static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
+	if (kvm_vcpu_get_esr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
 		trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
 		vcpu->stat.wfe_exit_stat++;
 		kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
@@ -119,7 +119,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
  */
 static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	u32 hsr = kvm_vcpu_get_hsr(vcpu);
+	u32 hsr = kvm_vcpu_get_esr(vcpu);
 	int ret = 0;
 
 	run->exit_reason = KVM_EXIT_DEBUG;
@@ -146,7 +146,7 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	u32 hsr = kvm_vcpu_get_hsr(vcpu);
+	u32 hsr = kvm_vcpu_get_esr(vcpu);
 
 	kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n",
 		      hsr, esr_get_class_string(hsr));
@@ -226,7 +226,7 @@ static exit_handle_fn arm_exit_handlers[] = {
 
 static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
 {
-	u32 hsr = kvm_vcpu_get_hsr(vcpu);
+	u32 hsr = kvm_vcpu_get_esr(vcpu);
 	u8 hsr_ec = ESR_ELx_EC(hsr);
 
 	return arm_exit_handlers[hsr_ec];
@@ -267,7 +267,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		       int exception_index)
 {
 	if (ARM_SERROR_PENDING(exception_index)) {
-		u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
+		u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
 
 		/*
 		 * HVC/SMC already have an adjusted PC, which we need
@@ -333,5 +333,5 @@ void handle_exit_early(struct kvm_vcpu *vcpu, struct kvm_run *run,
 	exception_index = ARM_EXCEPTION_CODE(exception_index);
 
 	if (exception_index == ARM_EXCEPTION_EL1_SERROR)
-		kvm_handle_guest_serror(vcpu, kvm_vcpu_get_hsr(vcpu));
+		kvm_handle_guest_serror(vcpu, kvm_vcpu_get_esr(vcpu));
 }
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 8a1e81a400e0..2c3242bcfed2 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -437,7 +437,7 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
 
 static bool __hyp_text handle_tx2_tvm(struct kvm_vcpu *vcpu)
 {
-	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_hsr(vcpu));
+	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu));
 	int rt = kvm_vcpu_sys_get_rt(vcpu);
 	u64 val = vcpu_get_reg(vcpu, rt);
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 51db934702b6..5b61465927b7 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2214,7 +2214,7 @@ static int kvm_handle_cp_64(struct kvm_vcpu *vcpu,
 			    size_t nr_specific)
 {
 	struct sys_reg_params params;
-	u32 hsr = kvm_vcpu_get_hsr(vcpu);
+	u32 hsr = kvm_vcpu_get_esr(vcpu);
 	int Rt = kvm_vcpu_sys_get_rt(vcpu);
 	int Rt2 = (hsr >> 10) & 0x1f;
 
@@ -2271,7 +2271,7 @@ static int kvm_handle_cp_32(struct kvm_vcpu *vcpu,
 			    size_t nr_specific)
 {
 	struct sys_reg_params params;
-	u32 hsr = kvm_vcpu_get_hsr(vcpu);
+	u32 hsr = kvm_vcpu_get_esr(vcpu);
 	int Rt  = kvm_vcpu_sys_get_rt(vcpu);
 
 	params.is_aarch32 = true;
@@ -2387,7 +2387,7 @@ static void reset_sys_reg_descs(struct kvm_vcpu *vcpu,
 int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
 	struct sys_reg_params params;
-	unsigned long esr = kvm_vcpu_get_hsr(vcpu);
+	unsigned long esr = kvm_vcpu_get_esr(vcpu);
 	int Rt = kvm_vcpu_sys_get_rt(vcpu);
 	int ret;
 
diff --git a/virt/kvm/arm/hyp/aarch32.c b/virt/kvm/arm/hyp/aarch32.c
index d31f267961e7..864b477e660a 100644
--- a/virt/kvm/arm/hyp/aarch32.c
+++ b/virt/kvm/arm/hyp/aarch32.c
@@ -51,7 +51,7 @@ bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu)
 	int cond;
 
 	/* Top two bits non-zero?  Unconditional. */
-	if (kvm_vcpu_get_hsr(vcpu) >> 30)
+	if (kvm_vcpu_get_esr(vcpu) >> 30)
 		return true;
 
 	/* Is condition field valid? */
diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
index ccf1fde9836c..8a7a14ec9120 100644
--- a/virt/kvm/arm/hyp/vgic-v3-sr.c
+++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
@@ -441,7 +441,7 @@ static int __hyp_text __vgic_v3_bpr_min(void)
 
 static int __hyp_text __vgic_v3_get_group(struct kvm_vcpu *vcpu)
 {
-	u32 esr = kvm_vcpu_get_hsr(vcpu);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	u8 crm = (esr & ESR_ELx_SYS64_ISS_CRM_MASK) >> ESR_ELx_SYS64_ISS_CRM_SHIFT;
 
 	return crm != 8;
@@ -1007,7 +1007,7 @@ int __hyp_text __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu)
 	bool is_read;
 	u32 sysreg;
 
-	esr = kvm_vcpu_get_hsr(vcpu);
+	esr = kvm_vcpu_get_esr(vcpu);
 	if (vcpu_mode_is_32bit(vcpu)) {
 		if (!kvm_condition_valid(vcpu)) {
 			__kvm_skip_instr(vcpu);
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e3b9ee268823..5da0d0e7519b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1922,7 +1922,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		 * For RAS the host kernel may handle this abort.
 		 * There is no need to pass the error into the guest.
 		 */
-		if (!kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
+		if (!kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu)))
 			return 1;
 
 		if (unlikely(!is_iabt)) {
@@ -1931,7 +1931,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		}
 	}
 
-	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_hsr(vcpu),
+	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_esr(vcpu),
 			      kvm_vcpu_get_hfar(vcpu), fault_ipa);
 
 	/* Check the stage-2 fault is trans. fault or write fault */
@@ -1940,7 +1940,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
 			kvm_vcpu_trap_get_class(vcpu),
 			(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
-			(unsigned long)kvm_vcpu_get_hsr(vcpu));
+			(unsigned long)kvm_vcpu_get_esr(vcpu));
 		return -EFAULT;
 	}
 
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (2 preceding siblings ...)
  2020-05-08  3:29 ` [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr() Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-26 10:51   ` Mark Rutland
  2020-05-08  3:29 ` [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr Gavin Shan
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

There are a set of inline functions defined in kvm_emulate.h. Those
functions reads ESR from vCPU fault information struct and then operate
on it. So it's tied with vCPU fault information and vCPU struct. It
limits their usage scope.

This detaches these functions from the vCPU struct. With this, the
caller has flexibility on where the ESR is read. It shouldn't cause
any functional changes.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/kvm_emulate.h     | 83 +++++++++++-------------
 arch/arm64/kvm/handle_exit.c             | 20 ++++--
 arch/arm64/kvm/hyp/switch.c              | 24 ++++---
 arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |  7 +-
 arch/arm64/kvm/inject_fault.c            |  4 +-
 arch/arm64/kvm/sys_regs.c                | 12 ++--
 virt/kvm/arm/arm.c                       |  4 +-
 virt/kvm/arm/hyp/aarch32.c               |  2 +-
 virt/kvm/arm/hyp/vgic-v3-sr.c            |  5 +-
 virt/kvm/arm/mmio.c                      | 27 ++++----
 virt/kvm/arm/mmu.c                       | 22 ++++---
 11 files changed, 112 insertions(+), 98 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index bd1a69e7c104..2873bf6dc85e 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -270,10 +270,8 @@ static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu)
 	return vcpu->arch.fault.esr_el2;
 }
 
-static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
+static __always_inline int kvm_vcpu_get_condition(u32 esr)
 {
-	u32 esr = kvm_vcpu_get_esr(vcpu);
-
 	if (esr & ESR_ELx_CV)
 		return (esr & ESR_ELx_COND_MASK) >> ESR_ELx_COND_SHIFT;
 
@@ -295,88 +293,86 @@ static inline u64 kvm_vcpu_get_disr(const struct kvm_vcpu *vcpu)
 	return vcpu->arch.fault.disr_el1;
 }
 
-static inline u32 kvm_vcpu_hvc_get_imm(const struct kvm_vcpu *vcpu)
+static __always_inline u32 kvm_vcpu_hvc_get_imm(u32 esr)
 {
-	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_xVC_IMM_MASK;
+	return esr & ESR_ELx_xVC_IMM_MASK;
 }
 
-static __always_inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_dabt_isvalid(u32 esr)
 {
-	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_ISV);
+	return !!(esr & ESR_ELx_ISV);
 }
 
-static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu)
+static __always_inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(u32 esr)
 {
-	return kvm_vcpu_get_esr(vcpu) &
-	       (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
+	return esr & (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
 }
 
-static inline bool kvm_vcpu_dabt_issext(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_dabt_issext(u32 esr)
 {
-	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SSE);
+	return !!(esr & ESR_ELx_SSE);
 }
 
-static inline bool kvm_vcpu_dabt_issf(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_dabt_issf(u32 esr)
 {
-	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SF);
+	return !!(esr & ESR_ELx_SF);
 }
 
-static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)
+static __always_inline int kvm_vcpu_dabt_get_rd(u32 esr)
 {
-	return (kvm_vcpu_get_esr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
+	return (esr & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
 }
 
-static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_dabt_iss1tw(u32 esr)
 {
-	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_S1PTW);
+	return !!(esr & ESR_ELx_S1PTW);
 }
 
-static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_dabt_iswrite(u32 esr)
 {
-	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_WNR) ||
-		kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */
+	return !!(esr & ESR_ELx_WNR) ||
+		kvm_vcpu_dabt_iss1tw(esr); /* AF/DBM update */
 }
 
-static inline bool kvm_vcpu_dabt_is_cm(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_dabt_is_cm(u32 esr)
 {
-	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_CM);
+	return !!(esr & ESR_ELx_CM);
 }
 
-static __always_inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu)
+static __always_inline unsigned int kvm_vcpu_dabt_get_as(u32 esr)
 {
-	return 1 << ((kvm_vcpu_get_esr(vcpu) & ESR_ELx_SAS) >>
-		     ESR_ELx_SAS_SHIFT);
+	return 1 << ((esr & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
 }
 
 /* This one is not specific to Data Abort */
-static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_trap_il_is32bit(u32 esr)
 {
-	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_IL);
+	return !!(esr & ESR_ELx_IL);
 }
 
-static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)
+static __always_inline u8 kvm_vcpu_trap_get_class(u32 esr)
 {
-	return ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
+	return ESR_ELx_EC(esr);
 }
 
-static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_trap_is_iabt(u32 esr)
 {
-	return kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_IABT_LOW;
+	return kvm_vcpu_trap_get_class(esr) == ESR_ELx_EC_IABT_LOW;
 }
 
-static __always_inline u8 kvm_vcpu_trap_get_fault(const struct kvm_vcpu *vcpu)
+static __always_inline u8 kvm_vcpu_trap_get_fault(u32 esr)
 {
-	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC;
+	return esr & ESR_ELx_FSC;
 }
 
-static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vcpu)
+static __always_inline u8 kvm_vcpu_trap_get_fault_type(u32 esr)
 {
-	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
+	return esr & ESR_ELx_FSC_TYPE;
 }
 
-static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_vcpu_dabt_isextabt(u32 esr)
 {
-	switch (kvm_vcpu_trap_get_fault(vcpu)) {
+	switch (kvm_vcpu_trap_get_fault(esr)) {
 	case FSC_SEA:
 	case FSC_SEA_TTW0:
 	case FSC_SEA_TTW1:
@@ -393,18 +389,17 @@ static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
 	}
 }
 
-static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
+static __always_inline int kvm_vcpu_sys_get_rt(u32 esr)
 {
-	u32 esr = kvm_vcpu_get_esr(vcpu);
 	return ESR_ELx_SYS64_ISS_RT(esr);
 }
 
-static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_is_write_fault(u32 esr)
 {
-	if (kvm_vcpu_trap_is_iabt(vcpu))
+	if (kvm_vcpu_trap_is_iabt(esr))
 		return false;
 
-	return kvm_vcpu_dabt_iswrite(vcpu);
+	return kvm_vcpu_dabt_iswrite(esr);
 }
 
 static inline unsigned long kvm_vcpu_get_mpidr_aff(struct kvm_vcpu *vcpu)
@@ -527,7 +522,7 @@ static __always_inline void __hyp_text __kvm_skip_instr(struct kvm_vcpu *vcpu)
 	*vcpu_pc(vcpu) = read_sysreg_el2(SYS_ELR);
 	vcpu->arch.ctxt.gp_regs.regs.pstate = read_sysreg_el2(SYS_SPSR);
 
-	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(kvm_vcpu_get_esr(vcpu)));
 
 	write_sysreg_el2(vcpu->arch.ctxt.gp_regs.regs.pstate, SYS_SPSR);
 	write_sysreg_el2(*vcpu_pc(vcpu), SYS_ELR);
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index c5b75a4d5eda..00858db82a64 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -38,7 +38,7 @@ static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	int ret;
 
 	trace_kvm_hvc_arm64(*vcpu_pc(vcpu), vcpu_get_reg(vcpu, 0),
-			    kvm_vcpu_hvc_get_imm(vcpu));
+			    kvm_vcpu_hvc_get_imm(kvm_vcpu_get_esr(vcpu)));
 	vcpu->stat.hvc_exit_stat++;
 
 	ret = kvm_hvc_call_handler(vcpu);
@@ -52,6 +52,8 @@ static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+
 	/*
 	 * "If an SMC instruction executed at Non-secure EL1 is
 	 * trapped to EL2 because HCR_EL2.TSC is 1, the exception is a
@@ -61,7 +63,7 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	 * otherwise return to the same address...
 	 */
 	vcpu_set_reg(vcpu, 0, ~0UL);
-	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(esr));
 	return 1;
 }
 
@@ -89,7 +91,9 @@ static int handle_no_fpsimd(struct kvm_vcpu *vcpu, struct kvm_run *run)
  */
 static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	if (kvm_vcpu_get_esr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+
+	if (esr & ESR_ELx_WFx_ISS_WFE) {
 		trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
 		vcpu->stat.wfe_exit_stat++;
 		kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
@@ -100,7 +104,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 	}
 
-	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(esr));
 
 	return 1;
 }
@@ -240,6 +244,7 @@ static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
  */
 static int handle_trap_exceptions(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	int handled;
 
 	/*
@@ -247,7 +252,7 @@ static int handle_trap_exceptions(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	 * that fail their condition code check"
 	 */
 	if (!kvm_condition_valid(vcpu)) {
-		kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+		kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(esr));
 		handled = 1;
 	} else {
 		exit_handle_fn exit_handler;
@@ -267,7 +272,8 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		       int exception_index)
 {
 	if (ARM_SERROR_PENDING(exception_index)) {
-		u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
+		u32 esr = kvm_vcpu_get_esr(vcpu);
+		u8 hsr_ec = ESR_ELx_EC(esr);
 
 		/*
 		 * HVC/SMC already have an adjusted PC, which we need
@@ -276,7 +282,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		 */
 		if (hsr_ec == ESR_ELx_EC_HVC32 || hsr_ec == ESR_ELx_EC_HVC64 ||
 		    hsr_ec == ESR_ELx_EC_SMC32 || hsr_ec == ESR_ELx_EC_SMC64) {
-			u32 adj =  kvm_vcpu_trap_il_is32bit(vcpu) ? 4 : 2;
+			u32 adj =  kvm_vcpu_trap_il_is32bit(esr) ? 4 : 2;
 			*vcpu_pc(vcpu) -= adj;
 		}
 
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 2c3242bcfed2..369f22f49f3d 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -355,6 +355,7 @@ static bool __hyp_text __populate_fault_info(struct kvm_vcpu *vcpu)
 /* Check for an FPSIMD/SVE trap and handle as appropriate */
 static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	bool vhe, sve_guest, sve_host;
 	u8 hsr_ec;
 
@@ -371,7 +372,7 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
 		vhe = has_vhe();
 	}
 
-	hsr_ec = kvm_vcpu_trap_get_class(vcpu);
+	hsr_ec = kvm_vcpu_trap_get_class(esr);
 	if (hsr_ec != ESR_ELx_EC_FP_ASIMD &&
 	    hsr_ec != ESR_ELx_EC_SVE)
 		return false;
@@ -438,7 +439,8 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
 static bool __hyp_text handle_tx2_tvm(struct kvm_vcpu *vcpu)
 {
 	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu));
-	int rt = kvm_vcpu_sys_get_rt(vcpu);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+	int rt = kvm_vcpu_sys_get_rt(esr);
 	u64 val = vcpu_get_reg(vcpu, rt);
 
 	/*
@@ -497,6 +499,8 @@ static bool __hyp_text handle_tx2_tvm(struct kvm_vcpu *vcpu)
  */
 static bool __hyp_text fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+
 	if (ARM_EXCEPTION_CODE(*exit_code) != ARM_EXCEPTION_IRQ)
 		vcpu->arch.fault.esr_el2 = read_sysreg_el2(SYS_ESR);
 
@@ -510,7 +514,7 @@ static bool __hyp_text fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code)
 		goto exit;
 
 	if (cpus_have_final_cap(ARM64_WORKAROUND_CAVIUM_TX2_219_TVM) &&
-	    kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_SYS64 &&
+	    kvm_vcpu_trap_get_class(esr) == ESR_ELx_EC_SYS64 &&
 	    handle_tx2_tvm(vcpu))
 		return true;
 
@@ -530,11 +534,11 @@ static bool __hyp_text fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code)
 	if (static_branch_unlikely(&vgic_v2_cpuif_trap)) {
 		bool valid;
 
-		valid = kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_DABT_LOW &&
-			kvm_vcpu_trap_get_fault_type(vcpu) == FSC_FAULT &&
-			kvm_vcpu_dabt_isvalid(vcpu) &&
-			!kvm_vcpu_dabt_isextabt(vcpu) &&
-			!kvm_vcpu_dabt_iss1tw(vcpu);
+		valid = kvm_vcpu_trap_get_class(esr) == ESR_ELx_EC_DABT_LOW &&
+			kvm_vcpu_trap_get_fault_type(esr) == FSC_FAULT &&
+			kvm_vcpu_dabt_isvalid(esr) &&
+			!kvm_vcpu_dabt_isextabt(esr) &&
+			!kvm_vcpu_dabt_iss1tw(esr);
 
 		if (valid) {
 			int ret = __vgic_v2_perform_cpuif_access(vcpu);
@@ -551,8 +555,8 @@ static bool __hyp_text fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code)
 	}
 
 	if (static_branch_unlikely(&vgic_v3_cpuif_trap) &&
-	    (kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_SYS64 ||
-	     kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_CP15_32)) {
+	    (kvm_vcpu_trap_get_class(esr) == ESR_ELx_EC_SYS64 ||
+	     kvm_vcpu_trap_get_class(esr) == ESR_ELx_EC_CP15_32)) {
 		int ret = __vgic_v3_perform_cpuif_access(vcpu);
 
 		if (ret == 1)
diff --git a/arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c b/arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c
index 4f3a087e36d5..bcf13a074b69 100644
--- a/arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c
+++ b/arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c
@@ -36,6 +36,7 @@ int __hyp_text __vgic_v2_perform_cpuif_access(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = kern_hyp_va(vcpu->kvm);
 	struct vgic_dist *vgic = &kvm->arch.vgic;
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	phys_addr_t fault_ipa;
 	void __iomem *addr;
 	int rd;
@@ -50,7 +51,7 @@ int __hyp_text __vgic_v2_perform_cpuif_access(struct kvm_vcpu *vcpu)
 		return 0;
 
 	/* Reject anything but a 32bit access */
-	if (kvm_vcpu_dabt_get_as(vcpu) != sizeof(u32)) {
+	if (kvm_vcpu_dabt_get_as(esr) != sizeof(u32)) {
 		__kvm_skip_instr(vcpu);
 		return -1;
 	}
@@ -61,11 +62,11 @@ int __hyp_text __vgic_v2_perform_cpuif_access(struct kvm_vcpu *vcpu)
 		return -1;
 	}
 
-	rd = kvm_vcpu_dabt_get_rd(vcpu);
+	rd = kvm_vcpu_dabt_get_rd(esr);
 	addr  = hyp_symbol_addr(kvm_vgic_global_state)->vcpu_hyp_va;
 	addr += fault_ipa - vgic->vgic_cpu_base;
 
-	if (kvm_vcpu_dabt_iswrite(vcpu)) {
+	if (kvm_vcpu_dabt_iswrite(esr)) {
 		u32 data = vcpu_get_reg(vcpu, rd);
 		if (__is_be(vcpu)) {
 			/* guest pre-swabbed data, undo this for writel() */
diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
index 6aafc2825c1c..0ae7c2e40e02 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -128,7 +128,7 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr
 	 * Build an {i,d}abort, depending on the level and the
 	 * instruction set. Report an external synchronous abort.
 	 */
-	if (kvm_vcpu_trap_il_is32bit(vcpu))
+	if (kvm_vcpu_trap_il_is32bit(kvm_vcpu_get_esr(vcpu)))
 		esr |= ESR_ELx_IL;
 
 	/*
@@ -161,7 +161,7 @@ static void inject_undef64(struct kvm_vcpu *vcpu)
 	 * Build an unknown exception, depending on the instruction
 	 * set.
 	 */
-	if (kvm_vcpu_trap_il_is32bit(vcpu))
+	if (kvm_vcpu_trap_il_is32bit(kvm_vcpu_get_esr(vcpu)))
 		esr |= ESR_ELx_IL;
 
 	vcpu_write_sys_reg(vcpu, esr, ESR_EL1);
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 5b61465927b7..012fff834a4b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2126,6 +2126,7 @@ static void perform_access(struct kvm_vcpu *vcpu,
 			   struct sys_reg_params *params,
 			   const struct sys_reg_desc *r)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	trace_kvm_sys_access(*vcpu_pc(vcpu), params, r);
 
 	/* Check for regs disabled by runtime config */
@@ -2143,7 +2144,7 @@ static void perform_access(struct kvm_vcpu *vcpu,
 
 	/* Skip instruction if instructed so */
 	if (likely(r->access(vcpu, params, r)))
-		kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+		kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(esr));
 }
 
 /*
@@ -2180,7 +2181,8 @@ static int emulate_cp(struct kvm_vcpu *vcpu,
 static void unhandled_cp_access(struct kvm_vcpu *vcpu,
 				struct sys_reg_params *params)
 {
-	u8 hsr_ec = kvm_vcpu_trap_get_class(vcpu);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+	u8 hsr_ec = kvm_vcpu_trap_get_class(esr);
 	int cp = -1;
 
 	switch(hsr_ec) {
@@ -2215,7 +2217,7 @@ static int kvm_handle_cp_64(struct kvm_vcpu *vcpu,
 {
 	struct sys_reg_params params;
 	u32 hsr = kvm_vcpu_get_esr(vcpu);
-	int Rt = kvm_vcpu_sys_get_rt(vcpu);
+	int Rt = kvm_vcpu_sys_get_rt(hsr);
 	int Rt2 = (hsr >> 10) & 0x1f;
 
 	params.is_aarch32 = true;
@@ -2272,7 +2274,7 @@ static int kvm_handle_cp_32(struct kvm_vcpu *vcpu,
 {
 	struct sys_reg_params params;
 	u32 hsr = kvm_vcpu_get_esr(vcpu);
-	int Rt  = kvm_vcpu_sys_get_rt(vcpu);
+	int Rt  = kvm_vcpu_sys_get_rt(hsr);
 
 	params.is_aarch32 = true;
 	params.is_32bit = true;
@@ -2388,7 +2390,7 @@ int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
 	struct sys_reg_params params;
 	unsigned long esr = kvm_vcpu_get_esr(vcpu);
-	int Rt = kvm_vcpu_sys_get_rt(vcpu);
+	int Rt = kvm_vcpu_sys_get_rt(esr);
 	int ret;
 
 	trace_kvm_handle_sys_reg(esr);
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 48d0ec44ad77..2cbb57485760 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -808,7 +808,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		 * guest time.
 		 */
 		guest_exit();
-		trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
+		trace_kvm_exit(ret,
+			kvm_vcpu_trap_get_class(kvm_vcpu_get_esr(vcpu)),
+			*vcpu_pc(vcpu));
 
 		/* Exit types that need handling before we can be preempted */
 		handle_exit_early(vcpu, run, ret);
diff --git a/virt/kvm/arm/hyp/aarch32.c b/virt/kvm/arm/hyp/aarch32.c
index 864b477e660a..df3055ab3a42 100644
--- a/virt/kvm/arm/hyp/aarch32.c
+++ b/virt/kvm/arm/hyp/aarch32.c
@@ -55,7 +55,7 @@ bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu)
 		return true;
 
 	/* Is condition field valid? */
-	cond = kvm_vcpu_get_condition(vcpu);
+	cond = kvm_vcpu_get_condition(kvm_vcpu_get_esr(vcpu));
 	if (cond == 0xE)
 		return true;
 
diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
index 8a7a14ec9120..bb2174b8a086 100644
--- a/virt/kvm/arm/hyp/vgic-v3-sr.c
+++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
@@ -1000,14 +1000,13 @@ static void __hyp_text __vgic_v3_write_ctlr(struct kvm_vcpu *vcpu,
 
 int __hyp_text __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	int rt;
-	u32 esr;
 	u32 vmcr;
 	void (*fn)(struct kvm_vcpu *, u32, int);
 	bool is_read;
 	u32 sysreg;
 
-	esr = kvm_vcpu_get_esr(vcpu);
 	if (vcpu_mode_is_32bit(vcpu)) {
 		if (!kvm_condition_valid(vcpu)) {
 			__kvm_skip_instr(vcpu);
@@ -1119,7 +1118,7 @@ int __hyp_text __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu)
 	}
 
 	vmcr = __vgic_v3_read_vmcr();
-	rt = kvm_vcpu_sys_get_rt(vcpu);
+	rt = kvm_vcpu_sys_get_rt(esr);
 	fn(vcpu, vmcr, rt);
 
 	__kvm_skip_instr(vcpu);
diff --git a/virt/kvm/arm/mmio.c b/virt/kvm/arm/mmio.c
index aedfcff99ac5..d92bee8c75e3 100644
--- a/virt/kvm/arm/mmio.c
+++ b/virt/kvm/arm/mmio.c
@@ -81,6 +81,7 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len)
  */
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	unsigned long data;
 	unsigned int len;
 	int mask;
@@ -91,30 +92,30 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 	vcpu->mmio_needed = 0;
 
-	if (!kvm_vcpu_dabt_iswrite(vcpu)) {
-		len = kvm_vcpu_dabt_get_as(vcpu);
+	if (!kvm_vcpu_dabt_iswrite(esr)) {
+		len = kvm_vcpu_dabt_get_as(esr);
 		data = kvm_mmio_read_buf(run->mmio.data, len);
 
-		if (kvm_vcpu_dabt_issext(vcpu) &&
+		if (kvm_vcpu_dabt_issext(esr) &&
 		    len < sizeof(unsigned long)) {
 			mask = 1U << ((len * 8) - 1);
 			data = (data ^ mask) - mask;
 		}
 
-		if (!kvm_vcpu_dabt_issf(vcpu))
+		if (!kvm_vcpu_dabt_issf(esr))
 			data = data & 0xffffffff;
 
 		trace_kvm_mmio(KVM_TRACE_MMIO_READ, len, run->mmio.phys_addr,
 			       &data);
 		data = vcpu_data_host_to_guest(vcpu, data, len);
-		vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
+		vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(esr), data);
 	}
 
 	/*
 	 * The MMIO instruction is emulated and should not be re-executed
 	 * in the guest.
 	 */
-	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(esr));
 
 	return 0;
 }
@@ -122,6 +123,7 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
 int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		 phys_addr_t fault_ipa)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	unsigned long data;
 	unsigned long rt;
 	int ret;
@@ -133,10 +135,11 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
 	 * No valid syndrome? Ask userspace for help if it has
 	 * voluntered to do so, and bail out otherwise.
 	 */
-	if (!kvm_vcpu_dabt_isvalid(vcpu)) {
+	if (!kvm_vcpu_dabt_isvalid(esr)) {
 		if (vcpu->kvm->arch.return_nisv_io_abort_to_user) {
 			run->exit_reason = KVM_EXIT_ARM_NISV;
-			run->arm_nisv.esr_iss = kvm_vcpu_dabt_iss_nisv_sanitized(vcpu);
+			run->arm_nisv.esr_iss =
+				kvm_vcpu_dabt_iss_nisv_sanitized(esr);
 			run->arm_nisv.fault_ipa = fault_ipa;
 			return 0;
 		}
@@ -146,7 +149,7 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
 	}
 
 	/* Page table accesses IO mem: tell guest to fix its TTBR */
-	if (kvm_vcpu_dabt_iss1tw(vcpu)) {
+	if (kvm_vcpu_dabt_iss1tw(esr)) {
 		kvm_inject_dabt(vcpu, kvm_vcpu_get_hfar(vcpu));
 		return 1;
 	}
@@ -156,9 +159,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
 	 * from the CPU. Then try if some in-kernel emulation feels
 	 * responsible, otherwise let user space do its magic.
 	 */
-	is_write = kvm_vcpu_dabt_iswrite(vcpu);
-	len = kvm_vcpu_dabt_get_as(vcpu);
-	rt = kvm_vcpu_dabt_get_rd(vcpu);
+	is_write = kvm_vcpu_dabt_iswrite(esr);
+	len = kvm_vcpu_dabt_get_as(esr);
+	rt = kvm_vcpu_dabt_get_rd(esr);
 
 	if (is_write) {
 		data = vcpu_data_guest_to_host(vcpu, vcpu_get_reg(vcpu, rt),
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 5da0d0e7519b..e462e0368fd9 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1661,6 +1661,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  unsigned long fault_status)
 {
 	int ret;
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault, needs_exec;
 	unsigned long mmu_seq;
@@ -1674,8 +1675,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long vma_pagesize, flags = 0;
 
-	write_fault = kvm_is_write_fault(vcpu);
-	exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
+	write_fault = kvm_is_write_fault(esr);
+	exec_fault = kvm_vcpu_trap_is_iabt(esr);
 	VM_BUG_ON(write_fault && exec_fault);
 
 	if (fault_status == FSC_PERM && !write_fault && !exec_fault) {
@@ -1903,6 +1904,7 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
  */
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	unsigned long fault_status;
 	phys_addr_t fault_ipa;
 	struct kvm_memory_slot *memslot;
@@ -1911,13 +1913,13 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	gfn_t gfn;
 	int ret, idx;
 
-	fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
+	fault_status = kvm_vcpu_trap_get_fault_type(esr);
 
 	fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
-	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
+	is_iabt = kvm_vcpu_trap_is_iabt(esr);
 
 	/* Synchronous External Abort? */
-	if (kvm_vcpu_dabt_isextabt(vcpu)) {
+	if (kvm_vcpu_dabt_isextabt(esr)) {
 		/*
 		 * For RAS the host kernel may handle this abort.
 		 * There is no need to pass the error into the guest.
@@ -1938,8 +1940,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
 	    fault_status != FSC_ACCESS) {
 		kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
-			kvm_vcpu_trap_get_class(vcpu),
-			(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
+			kvm_vcpu_trap_get_class(esr),
+			(unsigned long)kvm_vcpu_trap_get_fault(esr),
 			(unsigned long)kvm_vcpu_get_esr(vcpu));
 		return -EFAULT;
 	}
@@ -1949,7 +1951,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	gfn = fault_ipa >> PAGE_SHIFT;
 	memslot = gfn_to_memslot(vcpu->kvm, gfn);
 	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
-	write_fault = kvm_is_write_fault(vcpu);
+	write_fault = kvm_is_write_fault(esr);
 	if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
 		if (is_iabt) {
 			/* Prefetch Abort on I/O address */
@@ -1967,8 +1969,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		 * So let's assume that the guest is just being
 		 * cautious, and skip the instruction.
 		 */
-		if (kvm_vcpu_dabt_is_cm(vcpu)) {
-			kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
+		if (kvm_vcpu_dabt_is_cm(esr)) {
+			kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(esr));
 			ret = 1;
 			goto out_unlock;
 		}
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (3 preceding siblings ...)
  2020-05-08  3:29 ` [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-26 10:45   ` Mark Rutland
  2020-05-08  3:29 ` [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

This replace the variable names to make them self-explaining. The
tracepoint isn't changed accordingly because they're part of ABI:

   * @hsr to @esr
   * @hsr_ec to @ec
   * Use kvm_vcpu_trap_get_class() helper if possible

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/kvm/handle_exit.c | 28 ++++++++++++++--------------
 arch/arm64/kvm/hyp/switch.c  |  9 ++++-----
 arch/arm64/kvm/sys_regs.c    | 30 +++++++++++++++---------------
 3 files changed, 33 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 00858db82a64..e3b3dcd5b811 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -123,13 +123,13 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
  */
 static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	u32 hsr = kvm_vcpu_get_esr(vcpu);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 	int ret = 0;
 
 	run->exit_reason = KVM_EXIT_DEBUG;
-	run->debug.arch.hsr = hsr;
+	run->debug.arch.hsr = esr;
 
-	switch (ESR_ELx_EC(hsr)) {
+	switch (kvm_vcpu_trap_get_class(esr)) {
 	case ESR_ELx_EC_WATCHPT_LOW:
 		run->debug.arch.far = vcpu->arch.fault.far_el2;
 		/* fall through */
@@ -139,8 +139,8 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	case ESR_ELx_EC_BRK64:
 		break;
 	default:
-		kvm_err("%s: un-handled case hsr: %#08x\n",
-			__func__, (unsigned int) hsr);
+		kvm_err("%s: un-handled case esr: %#08x\n",
+			__func__, (unsigned int)esr);
 		ret = -1;
 		break;
 	}
@@ -150,10 +150,10 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
-	u32 hsr = kvm_vcpu_get_esr(vcpu);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
 
-	kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n",
-		      hsr, esr_get_class_string(hsr));
+	kvm_pr_unimpl("Unknown exception class: esr: %#08x -- %s\n",
+		      esr, esr_get_class_string(esr));
 
 	kvm_inject_undefined(vcpu);
 	return 1;
@@ -230,10 +230,10 @@ static exit_handle_fn arm_exit_handlers[] = {
 
 static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
 {
-	u32 hsr = kvm_vcpu_get_esr(vcpu);
-	u8 hsr_ec = ESR_ELx_EC(hsr);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+	u8 ec = kvm_vcpu_trap_get_class(esr);
 
-	return arm_exit_handlers[hsr_ec];
+	return arm_exit_handlers[ec];
 }
 
 /*
@@ -273,15 +273,15 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
 {
 	if (ARM_SERROR_PENDING(exception_index)) {
 		u32 esr = kvm_vcpu_get_esr(vcpu);
-		u8 hsr_ec = ESR_ELx_EC(esr);
+		u8 ec = kvm_vcpu_trap_get_class(esr);
 
 		/*
 		 * HVC/SMC already have an adjusted PC, which we need
 		 * to correct in order to return to after having
 		 * injected the SError.
 		 */
-		if (hsr_ec == ESR_ELx_EC_HVC32 || hsr_ec == ESR_ELx_EC_HVC64 ||
-		    hsr_ec == ESR_ELx_EC_SMC32 || hsr_ec == ESR_ELx_EC_SMC64) {
+		if (ec == ESR_ELx_EC_HVC32 || ec == ESR_ELx_EC_HVC64 ||
+		    ec == ESR_ELx_EC_SMC32 || ec == ESR_ELx_EC_SMC64) {
 			u32 adj =  kvm_vcpu_trap_il_is32bit(esr) ? 4 : 2;
 			*vcpu_pc(vcpu) -= adj;
 		}
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 369f22f49f3d..7bf4840bf90e 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -356,8 +356,8 @@ static bool __hyp_text __populate_fault_info(struct kvm_vcpu *vcpu)
 static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
 {
 	u32 esr = kvm_vcpu_get_esr(vcpu);
+	u8 ec = kvm_vcpu_trap_get_class(esr);
 	bool vhe, sve_guest, sve_host;
-	u8 hsr_ec;
 
 	if (!system_supports_fpsimd())
 		return false;
@@ -372,14 +372,13 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
 		vhe = has_vhe();
 	}
 
-	hsr_ec = kvm_vcpu_trap_get_class(esr);
-	if (hsr_ec != ESR_ELx_EC_FP_ASIMD &&
-	    hsr_ec != ESR_ELx_EC_SVE)
+	if (ec != ESR_ELx_EC_FP_ASIMD &&
+	    ec != ESR_ELx_EC_SVE)
 		return false;
 
 	/* Don't handle SVE traps for non-SVE vcpus here: */
 	if (!sve_guest)
-		if (hsr_ec != ESR_ELx_EC_FP_ASIMD)
+		if (ec != ESR_ELx_EC_FP_ASIMD)
 			return false;
 
 	/* Valid trap.  Switch the context: */
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 012fff834a4b..58f81ab519af 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2182,10 +2182,10 @@ static void unhandled_cp_access(struct kvm_vcpu *vcpu,
 				struct sys_reg_params *params)
 {
 	u32 esr = kvm_vcpu_get_esr(vcpu);
-	u8 hsr_ec = kvm_vcpu_trap_get_class(esr);
+	u8 ec = kvm_vcpu_trap_get_class(esr);
 	int cp = -1;
 
-	switch(hsr_ec) {
+	switch (ec) {
 	case ESR_ELx_EC_CP15_32:
 	case ESR_ELx_EC_CP15_64:
 		cp = 15;
@@ -2216,17 +2216,17 @@ static int kvm_handle_cp_64(struct kvm_vcpu *vcpu,
 			    size_t nr_specific)
 {
 	struct sys_reg_params params;
-	u32 hsr = kvm_vcpu_get_esr(vcpu);
-	int Rt = kvm_vcpu_sys_get_rt(hsr);
-	int Rt2 = (hsr >> 10) & 0x1f;
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+	int Rt = kvm_vcpu_sys_get_rt(esr);
+	int Rt2 = (esr >> 10) & 0x1f;
 
 	params.is_aarch32 = true;
 	params.is_32bit = false;
-	params.CRm = (hsr >> 1) & 0xf;
-	params.is_write = ((hsr & 1) == 0);
+	params.CRm = (esr >> 1) & 0xf;
+	params.is_write = ((esr & 1) == 0);
 
 	params.Op0 = 0;
-	params.Op1 = (hsr >> 16) & 0xf;
+	params.Op1 = (esr >> 16) & 0xf;
 	params.Op2 = 0;
 	params.CRn = 0;
 
@@ -2273,18 +2273,18 @@ static int kvm_handle_cp_32(struct kvm_vcpu *vcpu,
 			    size_t nr_specific)
 {
 	struct sys_reg_params params;
-	u32 hsr = kvm_vcpu_get_esr(vcpu);
-	int Rt  = kvm_vcpu_sys_get_rt(hsr);
+	u32 esr = kvm_vcpu_get_esr(vcpu);
+	int Rt = kvm_vcpu_sys_get_rt(esr);
 
 	params.is_aarch32 = true;
 	params.is_32bit = true;
-	params.CRm = (hsr >> 1) & 0xf;
+	params.CRm = (esr >> 1) & 0xf;
 	params.regval = vcpu_get_reg(vcpu, Rt);
-	params.is_write = ((hsr & 1) == 0);
-	params.CRn = (hsr >> 10) & 0xf;
+	params.is_write = ((esr & 1) == 0);
+	params.CRn = (esr >> 10) & 0xf;
 	params.Op0 = 0;
-	params.Op1 = (hsr >> 14) & 0x7;
-	params.Op2 = (hsr >> 17) & 0x7;
+	params.Op1 = (esr >> 14) & 0x7;
+	params.Op2 = (esr >> 17) & 0x7;
 
 	if (!emulate_cp(vcpu, &params, target_specific, nr_specific) ||
 	    !emulate_cp(vcpu, &params, global, nr_global)) {
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (4 preceding siblings ...)
  2020-05-08  3:29 ` [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-26 10:58   ` Mark Rutland
  2020-05-08  3:29 ` [PATCH RFCv2 7/9] kvm/arm64: Support async page fault Gavin Shan
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

This renames user_mem_abort() to kvm_handle_user_mem_abort(), and
then export it. The function will be used in asynchronous page fault
to populate a page table entry once the corresponding page is populated
from the backup device (e.g. swap partition):

   * Parameter @fault_status is replace by @esr.
   * The parameters are reorder based on their importance.

This shouldn't cause any functional changes.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/kvm_host.h |  4 ++++
 virt/kvm/arm/mmu.c                | 14 ++++++++------
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 32c8a675e5a4..f77c706777ec 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -437,6 +437,10 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
+			      struct kvm_memory_slot *memslot,
+			      phys_addr_t fault_ipa, unsigned long hva,
+			      bool prefault);
 int kvm_unmap_hva_range(struct kvm *kvm,
 			unsigned long start, unsigned long end);
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e462e0368fd9..95aaabb2b1fc 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1656,12 +1656,12 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
 	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
 }
 
-static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-			  struct kvm_memory_slot *memslot, unsigned long hva,
-			  unsigned long fault_status)
+int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
+			      struct kvm_memory_slot *memslot,
+			      phys_addr_t fault_ipa, unsigned long hva,
+			      bool prefault)
 {
-	int ret;
-	u32 esr = kvm_vcpu_get_esr(vcpu);
+	unsigned int fault_status = kvm_vcpu_trap_get_fault_type(esr);
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault, needs_exec;
 	unsigned long mmu_seq;
@@ -1674,6 +1674,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	pgprot_t mem_type = PAGE_S2;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long vma_pagesize, flags = 0;
+	int ret;
 
 	write_fault = kvm_is_write_fault(esr);
 	exec_fault = kvm_vcpu_trap_is_iabt(esr);
@@ -1995,7 +1996,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		goto out_unlock;
 	}
 
-	ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
+	ret = kvm_handle_user_mem_abort(vcpu, esr, memslot,
+					fault_ipa, hva, false);
 	if (ret == 0)
 		ret = 1;
 out:
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (5 preceding siblings ...)
  2020-05-08  3:29 ` [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-26 12:34   ` Mark Rutland
  2020-05-08  3:29 ` [PATCH RFCv2 8/9] kernel/sched: Add cpu_rq_is_locked() Gavin Shan
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

There are two stages of fault pages and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. To populate the
requested page can be related to IO activities if the page was swapped
out previously. In this case, the guest has to suspend for a few of
milliseconds at least, regardless of the overall system load. There
is no useful work done during the suspended period from guest's view.

This adds asychornous page fault to improve the situation. A signal
(PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
to be populated. Guest might reschedule to another running process if
possible. Otherwise, the vCPU is put into power-saving mode, which is
actually to cause vCPU reschedule from host's view. A followup signal
(PAGE_READY) is sent to guest once the requested page is populated.
The suspended task is waken up or scheduled when guest receives the
signal. With this mechanism, the vCPU won't be stuck when the requested
page is being populated by host.

There are more details highlighted as below. Note the implementation is
similar to what x86 has to some extent:

   * A dedicated SMCCC ID is reserved to enable, disable or configure
     the functionality. The only 64-bits parameter is conveyed by two
     registers (w2/w1). Bits[63:56] is the bitmap used to specify the
     operated functionality like enabling/disabling/configuration. The
     bits[55:6] is the physical address of control block or external
     data abort injection disallowed region. Bit[5:0] are used to pass
     control flags.

   * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
     isn't ready. In the mean while, a work is started to populate the
     page asynchronously in background. The stage 2 page table entry is
     updated accordingly and another signal (PAGE_READY) is fired after
     the request page is populted. The signals is notified by injected
     data abort fault.

   * The signals are fired and consumed in sequential fashion. It means
     no more signals will be fired if there is pending one, awaiting the
     guest to consume. It's because the injected data abort faults have
     to be done in sequential fashion.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/kvm_host.h      |  43 ++++
 arch/arm64/include/asm/kvm_para.h      |  27 ++
 arch/arm64/include/uapi/asm/Kbuild     |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
 arch/arm64/kvm/Kconfig                 |   1 +
 arch/arm64/kvm/Makefile                |   2 +
 include/linux/arm-smccc.h              |   6 +
 virt/kvm/arm/arm.c                     |  36 ++-
 virt/kvm/arm/async_pf.c                | 335 +++++++++++++++++++++++++
 virt/kvm/arm/hypercalls.c              |   8 +
 virt/kvm/arm/mmu.c                     |  29 ++-
 11 files changed, 506 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 virt/kvm/arm/async_pf.c

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f77c706777ec..a207728d6f3f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -250,6 +250,23 @@ struct vcpu_reset_state {
 	bool		reset;
 };
 
+#ifdef CONFIG_KVM_ASYNC_PF
+
+/* Should be a power of two number */
+#define ASYNC_PF_PER_VCPU	64
+
+/*
+ * The association of gfn and token. The token will be sent to guest as
+ * page fault address. Also, the guest could be in aarch32 mode. So its
+ * length should be 32-bits.
+ */
+struct kvm_arch_async_pf {
+	u32     token;
+	gfn_t   gfn;
+	u32	esr;
+};
+#endif /* CONFIG_KVM_ASYNC_PF */
+
 struct kvm_vcpu_arch {
 	struct kvm_cpu_context ctxt;
 	void *sve_state;
@@ -351,6 +368,17 @@ struct kvm_vcpu_arch {
 		u64 last_steal;
 		gpa_t base;
 	} steal;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+	struct {
+		struct gfn_to_hva_cache	cache;
+		gfn_t			gfns[ASYNC_PF_PER_VCPU];
+		u64			control_block;
+		u16			id;
+		bool			send_user_only;
+		u64			no_fault_inst_range;
+	} apf;
+#endif
 };
 
 /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
@@ -604,6 +632,21 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
 static inline void __cpu_init_stage2(void) {}
 
+#ifdef CONFIG_KVM_ASYNC_PF
+bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
+			    gpa_t gpa, gfn_t gfn);
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
+long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu);
+#endif /* CONFIG_KVM_ASYNC_PF */
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
new file mode 100644
index 000000000000..0ea481dd1c7a
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_para.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM_KVM_PARA_H
+#define _ASM_ARM_KVM_PARA_H
+
+#include <uapi/asm/kvm_para.h>
+
+static inline bool kvm_check_and_clear_guest_paused(void)
+{
+	return false;
+}
+
+static inline unsigned int kvm_arch_para_features(void)
+{
+	return 0;
+}
+
+static inline unsigned int kvm_arch_para_hints(void)
+{
+	return 0;
+}
+
+static inline bool kvm_para_available(void)
+{
+	return false;
+}
+
+#endif /* _ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
index 602d137932dc..f66554cd5c45 100644
--- a/arch/arm64/include/uapi/asm/Kbuild
+++ b/arch/arm64/include/uapi/asm/Kbuild
@@ -1,3 +1 @@
 # SPDX-License-Identifier: GPL-2.0
-
-generic-y += kvm_para.h
diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
new file mode 100644
index 000000000000..e0bd0e579b9a
--- /dev/null
+++ b/arch/arm64/include/uapi/asm/kvm_para.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_ARM_KVM_PARA_H
+#define _UAPI_ASM_ARM_KVM_PARA_H
+
+#include <linux/types.h>
+
+#define KVM_FEATURE_ASYNC_PF	0
+
+/* Async PF */
+#define KVM_ASYNC_PF_ENABLED		(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
+
+#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
+#define KVM_PV_REASON_PAGE_READY	2
+
+struct kvm_vcpu_pv_apf_data {
+	__u32	reason;
+	__u8	pad[60];
+	__u32	enabled;
+};
+
+#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 449386d76441..1053e16b1739 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -34,6 +34,7 @@ config KVM
 	select KVM_VFIO
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_IRQFD
+	select KVM_ASYNC_PF
 	select KVM_ARM_PMU if HW_PERF_EVENTS
 	select HAVE_KVM_MSI
 	select HAVE_KVM_IRQCHIP
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 5ffbdc39e780..3be24c1e401f 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -37,3 +37,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-debug.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
 kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
+kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
+kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/arm/async_pf.o
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index bdc0124a064a..22007dd3b9f0 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -94,6 +94,7 @@
 
 /* KVM "vendor specific" services */
 #define ARM_SMCCC_KVM_FUNC_FEATURES		0
+#define ARM_SMCCC_KVM_FUNC_APF			1
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
 #define ARM_SMCCC_KVM_NUM_FUNCS			128
 
@@ -102,6 +103,11 @@
 			   ARM_SMCCC_SMC_32,				\
 			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
 			   ARM_SMCCC_KVM_FUNC_FEATURES)
+#define ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID				\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
+			   ARM_SMCCC_SMC_32,				\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
+			   ARM_SMCCC_KVM_FUNC_APF)
 
 #ifndef __ASSEMBLY__
 
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 2cbb57485760..3f62899cef13 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -222,6 +222,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		 */
 		r = 1;
 		break;
+#ifdef CONFIG_KVM_ASYNC_PF
+	case KVM_CAP_ASYNC_PF:
+		r = 1;
+		break;
+#endif
 	default:
 		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
 		break;
@@ -269,6 +274,10 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	/* Force users to call KVM_ARM_VCPU_INIT */
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
+#ifdef CONFIG_KVM_ASYNC_PF
+	vcpu->arch.apf.control_block = 0UL;
+	vcpu->arch.apf.no_fault_inst_range = 0x800;
+#endif
 
 	/* Set up the timer */
 	kvm_timer_vcpu_init(vcpu);
@@ -426,8 +435,27 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 {
 	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
-	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
-		&& !v->arch.power_off && !v->arch.pause);
+
+	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
+	    !v->arch.power_off && !v->arch.pause)
+		return true;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
+		u32 val;
+		int ret;
+
+		if (!list_empty_careful(&v->async_pf.done))
+			return true;
+
+		ret = kvm_read_guest_cached(v->kvm, &v->arch.apf.cache,
+					    &val, sizeof(val));
+		if (ret || val)
+			return true;
+	}
+#endif
+
+	return false;
 }
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
@@ -683,6 +711,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 		check_vcpu_requests(vcpu);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+		kvm_check_async_pf_completion(vcpu);
+#endif
+
 		/*
 		 * Preparing the interrupts to be injected also
 		 * involves poking the GIC, which must be done in a
diff --git a/virt/kvm/arm/async_pf.c b/virt/kvm/arm/async_pf.c
new file mode 100644
index 000000000000..5be49d684de3
--- /dev/null
+++ b/virt/kvm/arm/async_pf.c
@@ -0,0 +1,335 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Asynchronous Page Fault Support
+ *
+ * Copyright (C) 2020 Red Hat, Inc., Gavin Shan
+ *
+ * Based on arch/x86/kernel/kvm.c
+ */
+
+#include <linux/arm-smccc.h>
+#include <linux/kvm_host.h>
+#include <asm/kvm_emulate.h>
+#include <kvm/arm_hypercalls.h>
+
+static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
+{
+	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
+}
+
+static inline u32 kvm_async_pf_hash_next(u32 key)
+{
+	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
+}
+
+static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
+{
+	int i;
+
+	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
+		vcpu->arch.apf.gfns[i] = ~0;
+}
+
+/*
+ * Add gfn to the hash table. It's ensured there is a free entry
+ * when this function is called.
+ */
+static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	while (vcpu->arch.apf.gfns[key] != ~0)
+		key = kvm_async_pf_hash_next(key);
+
+	vcpu->arch.apf.gfns[key] = gfn;
+}
+
+static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+	int i;
+
+	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
+		if (vcpu->arch.apf.gfns[key] == gfn ||
+		    vcpu->arch.apf.gfns[key] == ~0)
+			break;
+
+		key = kvm_async_pf_hash_next(key);
+	}
+
+	return key;
+}
+
+static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 i, j, k;
+
+	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
+	while (true) {
+		vcpu->arch.apf.gfns[i] = ~0;
+
+		do {
+			j = kvm_async_pf_hash_next(j);
+			if (vcpu->arch.apf.gfns[j] == ~0)
+				return;
+
+			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
+			/*
+			 * k lies cyclically in ]i,j]
+			 * |    i.k.j |
+			 * |....j i.k.| or  |.k..j i...|
+			 */
+		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
+
+		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
+		i = j;
+	}
+}
+
+bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
+
+	return vcpu->arch.apf.gfns[key] == gfn;
+}
+
+static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu, u32 *val)
+{
+	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
+				     val, sizeof(*val));
+}
+
+static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu, u32 val)
+{
+	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
+				      &val, sizeof(val));
+}
+
+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
+{
+	u64 vbar, pc;
+	u32 val;
+	int ret;
+
+	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+		return false;
+
+	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
+		return false;
+
+	/* Pending page fault, which ins't acknowledged by guest */
+	ret = kvm_async_pf_read_cache(vcpu, &val);
+	if (ret || val)
+		return false;
+
+	/*
+	 * Events can't be injected through data abort because it's
+	 * going to clobber ELR_EL1, which might not consued (or saved)
+	 * by guest yet.
+	 */
+	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+	pc = *vcpu_pc(vcpu);
+	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+		return false;
+
+	return true;
+}
+
+/*
+ * We need deliver the page present signal as quick as possible because
+ * it's performance critical. So the signal is delivered no matter which
+ * privilege level the guest has. It's possible the signal can't be handled
+ * by the guest immediately. However, host doesn't contribute the delay
+ * anyway.
+ */
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
+{
+	u64 vbar, pc;
+	u32 val;
+	int ret;
+
+	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+		return true;
+
+	/* Pending page fault, which ins't acknowledged by guest */
+	ret = kvm_async_pf_read_cache(vcpu, &val);
+	if (ret || val)
+		return false;
+
+	/*
+	 * Events can't be injected through data abort because it's
+	 * going to clobber ELR_EL1, which might not consued (or saved)
+	 * by guest yet.
+	 */
+	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+	pc = *vcpu_pc(vcpu);
+	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+		return false;
+
+	return true;
+}
+
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
+			    gpa_t gpa, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
+
+	arch.token = (vcpu->arch.apf.id++ << 16) | vcpu->vcpu_id;
+	arch.gfn = gfn;
+	arch.esr = esr;
+
+	return kvm_setup_async_pf(vcpu, gpa, hva, &arch);
+}
+
+/*
+ * It's garanteed that no pending asynchronous page fault when this is
+ * called. It means all previous issued asynchronous page faults have
+ * been acknoledged.
+ */
+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work)
+{
+	int ret;
+
+	kvm_async_pf_hash_add(vcpu, work->arch.gfn);
+	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT);
+	if (ret) {
+		kvm_err("%s: Error %d writing cache\n", __func__, ret);
+		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
+		return;
+	}
+
+	kvm_inject_dabt(vcpu, work->arch.token);
+}
+
+/*
+ * It's garanteed that no pending asynchronous page fault when this is
+ * called. It means all previous issued asynchronous page faults have
+ * been acknoledged.
+ */
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work)
+{
+	int ret;
+
+	/* Broadcast wakeup */
+	if (work->wakeup_all)
+		work->arch.token = ~0;
+	else
+		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
+
+	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_READY);
+	if (ret) {
+		kvm_err("%s: Error %d writing cache\n", __func__, ret);
+		return;
+	}
+
+	kvm_inject_dabt(vcpu, work->arch.token);
+}
+
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work)
+{
+	struct kvm_memory_slot *memslot;
+	unsigned int esr = work->arch.esr;
+	phys_addr_t gpa = work->cr2_or_gpa;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	unsigned long hva;
+	bool write_fault, writable;
+	int idx;
+
+	/*
+	 * We shouldn't issue prefault for special work to wake up
+	 * all pending tasks because the associated token (address)
+	 * is invalid.
+	 */
+	if (work->wakeup_all)
+		return;
+
+	/*
+	 * The gpa was validated before the work is started. However, the
+	 * memory slots might be changed since then. So we need to redo the
+	 * validatation here.
+	 */
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+	write_fault = kvm_is_write_fault(esr);
+	memslot = gfn_to_memslot(vcpu->kvm, gfn);
+	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
+	if (kvm_is_error_hva(hva) || (write_fault && !writable))
+		goto out;
+
+	kvm_handle_user_mem_abort(vcpu, esr, memslot, gpa, hva, true);
+
+out:
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+}
+
+static long kvm_async_pf_update_enable_reg(struct kvm_vcpu *vcpu, u64 data)
+{
+	bool enabled, enable;
+	gpa_t gpa = (data & ~0x3F);
+	int ret;
+
+	enabled = !!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED);
+	enable = !!(data & KVM_ASYNC_PF_ENABLED);
+	if (enable == enabled) {
+		kvm_debug("%s: Async PF has been %s (0x%llx -> 0x%llx)\n",
+			  __func__, enabled ? "enabled" : "disabled",
+			  vcpu->arch.apf.control_block, data);
+		return SMCCC_RET_NOT_REQUIRED;
+	}
+
+	if (enable) {
+		ret = kvm_gfn_to_hva_cache_init(
+			vcpu->kvm, &vcpu->arch.apf.cache,
+			gpa + offsetof(struct kvm_vcpu_pv_apf_data, reason),
+			sizeof(u32));
+		if (ret) {
+			kvm_err("%s: Error %d initializing cache on 0x%llx\n",
+				__func__, ret, data);
+			return SMCCC_RET_NOT_SUPPORTED;
+		}
+
+		kvm_async_pf_hash_reset(vcpu);
+		vcpu->arch.apf.send_user_only =
+			!(data & KVM_ASYNC_PF_SEND_ALWAYS);
+		kvm_async_pf_wakeup_all(vcpu);
+		vcpu->arch.apf.control_block = data;
+	} else {
+		kvm_clear_async_pf_completion_queue(vcpu);
+		vcpu->arch.apf.control_block = data;
+	}
+
+	return SMCCC_RET_SUCCESS;
+}
+
+long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu)
+{
+	u64 data, func, val, range;
+	long ret = SMCCC_RET_SUCCESS;
+
+	data = (smccc_get_arg2(vcpu) << 32) | smccc_get_arg1(vcpu);
+	func = data & (0xfful << 56);
+	val = data & ~(0xfful << 56);
+	switch (func) {
+	case BIT(63):
+		ret = kvm_async_pf_update_enable_reg(vcpu, val);
+		break;
+	case BIT(62):
+		if (vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
+			ret = SMCCC_RET_NOT_SUPPORTED;
+			break;
+		}
+
+		range = vcpu->arch.apf.no_fault_inst_range;
+		vcpu->arch.apf.no_fault_inst_range = max(range, val);
+		break;
+	default:
+		kvm_err("%s: Unrecognized function 0x%llx\n", __func__, func);
+		ret = SMCCC_RET_NOT_SUPPORTED;
+	}
+
+	return ret;
+}
diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
index db6dce3d0e23..a7e0fe17e2f1 100644
--- a/virt/kvm/arm/hypercalls.c
+++ b/virt/kvm/arm/hypercalls.c
@@ -70,7 +70,15 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 		break;
 	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
 		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
+#ifdef CONFIG_KVM_ASYNC_PF
+		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_APF);
+#endif
 		break;
+#ifdef CONFIG_KVM_ASYNC_PF
+	case ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID:
+		val[0] = kvm_async_pf_hypercall(vcpu);
+		break;
+#endif
 	default:
 		return kvm_psci_call(vcpu);
 	}
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 95aaabb2b1fc..a303815845a2 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1656,6 +1656,30 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
 	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
 }
 
+static bool try_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
+			 gfn_t gfn, kvm_pfn_t *pfn, bool write,
+			 bool *writable, bool prefault)
+{
+	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+#ifdef CONFIG_KVM_ASYNC_PF
+	bool async = false;
+
+	/* Bail if *pfn has correct page */
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
+	if (!async)
+		return false;
+
+	if (!prefault && kvm_arch_can_inject_async_page_not_present(vcpu)) {
+		if (kvm_async_pf_hash_find(vcpu, gfn) ||
+		    kvm_arch_setup_async_pf(vcpu, esr, gpa, gfn))
+			return true;
+	}
+#endif
+
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+	return false;
+}
+
 int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
 			      struct kvm_memory_slot *memslot,
 			      phys_addr_t fault_ipa, unsigned long hva,
@@ -1737,7 +1761,10 @@ int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
 	 */
 	smp_rmb();
 
-	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (try_async_pf(vcpu, esr, fault_ipa, gfn, &pfn,
+			 write_fault, &writable, prefault))
+		return 1;
+
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 8/9] kernel/sched: Add cpu_rq_is_locked()
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (6 preceding siblings ...)
  2020-05-08  3:29 ` [PATCH RFCv2 7/9] kvm/arm64: Support async page fault Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-08  3:29 ` [PATCH RFCv2 9/9] arm64: Support async page fault Gavin Shan
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

This adds API cpu_rq_is_locked() to check if the CPU's runqueue has been
locked or not. It's used in the subsequent patch to determine the task
wakeup should be executed immediately or delayed.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..e68882443da7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1691,6 +1691,7 @@ extern struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *n
  */
 extern struct task_struct *find_get_task_by_vpid(pid_t nr);
 
+extern bool cpu_rq_is_locked(int cpu);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..30f4a8845495 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -73,6 +73,14 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+bool cpu_rq_is_locked(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	return raw_spin_is_locked(&rq->lock) ? true :  false;
+}
+EXPORT_SYMBOL_GPL(cpu_rq_is_locked);
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (7 preceding siblings ...)
  2020-05-08  3:29 ` [PATCH RFCv2 8/9] kernel/sched: Add cpu_rq_is_locked() Gavin Shan
@ 2020-05-08  3:29 ` Gavin Shan
  2020-05-26 12:56   ` Mark Rutland
  2020-05-27  6:48   ` Paolo Bonzini
  2020-05-25 23:39 ` [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
  2020-05-26 13:09 ` Mark Rutland
  10 siblings, 2 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-08  3:29 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

This supports asynchronous page fault for the guest. The design is
similar to what x86 has: on receiving a PAGE_NOT_PRESENT signal from
the host, the current task is either rescheduled or put into power
saving mode. The task will be waken up when PAGE_READY signal is
received. The PAGE_READY signal might be received in the context
of the suspended process, to be waken up. That means the suspended
process has to wake up itself, but it's not safe and prone to cause
dead-lock on CPU runqueue lock. So the wakeup is delayed on returning
from kernel space to user space or idle process is picked for running.

The signals are conveyed through the async page fault control block,
which was passed to host on enabling the functionality. On each page
fault, the control block is checked and switch to the async page fault
handling flow if any signals exist.

The feature is put into the CONFIG_KVM_GUEST umbrella, which is added
by this patch. So we have inline functions implemented in kvm_para.h,
like other architectures do, to check if async page fault (one of the
KVM para-virtualized features) is available. Also, the kernel boot
parameter "no-kvmapf" can be specified to disable the feature.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/Kconfig                 |  11 +
 arch/arm64/include/asm/exception.h |   3 +
 arch/arm64/include/asm/kvm_para.h  |  27 +-
 arch/arm64/kernel/entry.S          |  33 +++
 arch/arm64/kernel/process.c        |   4 +
 arch/arm64/mm/fault.c              | 434 +++++++++++++++++++++++++++++
 6 files changed, 505 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 40fb05d96c60..2d5e5ee62d6d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1045,6 +1045,17 @@ config PARAVIRT
 	  under a hypervisor, potentially improving performance significantly
 	  over full virtualization.
 
+config KVM_GUEST
+	bool "KVM Guest Support"
+	depends on PARAVIRT
+	default y
+	help
+	  This option enables various optimizations for running under the KVM
+	  hypervisor. Overhead for the kernel when not running inside KVM should
+	  be minimal.
+
+	  In case of doubt, say Y
+
 config PARAVIRT_TIME_ACCOUNTING
 	bool "Paravirtual steal time accounting"
 	select PARAVIRT
diff --git a/arch/arm64/include/asm/exception.h b/arch/arm64/include/asm/exception.h
index 7a6e81ca23a8..d878afa42746 100644
--- a/arch/arm64/include/asm/exception.h
+++ b/arch/arm64/include/asm/exception.h
@@ -46,4 +46,7 @@ void bad_el0_sync(struct pt_regs *regs, int reason, unsigned int esr);
 void do_cp15instr(unsigned int esr, struct pt_regs *regs);
 void do_el0_svc(struct pt_regs *regs);
 void do_el0_svc_compat(struct pt_regs *regs);
+#ifdef CONFIG_KVM_GUEST
+void kvm_async_pf_delayed_wake(void);
+#endif
 #endif	/* __ASM_EXCEPTION_H */
diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
index 0ea481dd1c7a..b2f8ef243df7 100644
--- a/arch/arm64/include/asm/kvm_para.h
+++ b/arch/arm64/include/asm/kvm_para.h
@@ -3,6 +3,20 @@
 #define _ASM_ARM_KVM_PARA_H
 
 #include <uapi/asm/kvm_para.h>
+#include <linux/of.h>
+#include <asm/hypervisor.h>
+
+#ifdef CONFIG_KVM_GUEST
+static inline int kvm_para_available(void)
+{
+	return 1;
+}
+#else
+static inline int kvm_para_available(void)
+{
+	return 0;
+}
+#endif /* CONFIG_KVM_GUEST */
 
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
@@ -11,17 +25,16 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 
 static inline unsigned int kvm_arch_para_features(void)
 {
-	return 0;
+	unsigned int features = 0;
+
+	if (kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_APF))
+		features |= (1 << KVM_FEATURE_ASYNC_PF);
+
+	return features;
 }
 
 static inline unsigned int kvm_arch_para_hints(void)
 {
 	return 0;
 }
-
-static inline bool kvm_para_available(void)
-{
-	return false;
-}
-
 #endif /* _ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ddcde093c433..15efd57129ff 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -751,12 +751,45 @@ finish_ret_to_user:
 	enable_step_tsk x1, x2
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	bl	stackleak_erase
+#endif
+#ifdef CONFIG_KVM_GUEST
+	bl	kvm_async_pf_delayed_wake
 #endif
 	kernel_exit 0
 ENDPROC(ret_to_user)
 
 	.popsection				// .entry.text
 
+#ifdef CONFIG_KVM_GUEST
+	.pushsection ".rodata", "a"
+SYM_DATA_START(__exception_handlers_offset)
+	.quad	0
+	.quad	0
+	.quad	0
+	.quad	0
+	.quad	el1_sync - vectors
+	.quad	el1_irq - vectors
+	.quad	0
+	.quad	el1_error - vectors
+	.quad	el0_sync - vectors
+	.quad	el0_irq - vectors
+	.quad	0
+	.quad	el0_error - vectors
+#ifdef CONFIG_COMPAT
+	.quad	el0_sync_compat - vectors
+	.quad	el0_irq_compat - vectors
+	.quad	0
+	.quad	el0_error_compat - vectors
+#else
+	.quad	0
+	.quad	0
+	.quad	0
+	.quad	0
+#endif
+SYM_DATA_END(__exception_handlers_offset)
+	.popsection				// .rodata
+#endif /* CONFIG_KVM_GUEST */
+
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 /*
  * Exception vectors trampoline.
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 56be4cbf771f..5e7ee553566d 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -53,6 +53,7 @@
 #include <asm/processor.h>
 #include <asm/pointer_auth.h>
 #include <asm/stacktrace.h>
+#include <asm/exception.h>
 
 #if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_STACKPROTECTOR_PER_TASK)
 #include <linux/stackprotector.h>
@@ -70,6 +71,9 @@ void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
 
 static void __cpu_do_idle(void)
 {
+#ifdef CONFIG_KVM_GUEST
+	kvm_async_pf_delayed_wake();
+#endif
 	dsb(sy);
 	wfi();
 }
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c9cedc0432d2..cbf8b52135c9 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -19,10 +19,12 @@
 #include <linux/page-flags.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/debug.h>
+#include <linux/swait.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/kvm_para.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -48,8 +50,31 @@ struct fault_info {
 	const char *name;
 };
 
+#ifdef CONFIG_KVM_GUEST
+struct kvm_task_sleep_node {
+	struct hlist_node	link;
+	struct swait_queue_head	wq;
+	u32			token;
+	struct task_struct	*task;
+	int			cpu;
+	bool			halted;
+	bool			delayed;
+};
+
+struct kvm_task_sleep_head {
+	raw_spinlock_t		lock;
+	struct hlist_head	list;
+};
+#endif /* CONFIG_KVM_GUEST */
+
 static const struct fault_info fault_info[];
 static struct fault_info debug_fault_info[];
+#ifdef CONFIG_KVM_GUEST
+extern char __exception_handlers_offset[];
+static bool async_pf_available = true;
+static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_data) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_task_sleep_head, apf_head);
+#endif
 
 static inline const struct fault_info *esr_to_fault_info(unsigned int esr)
 {
@@ -717,10 +742,281 @@ static const struct fault_info fault_info[] = {
 	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 63"			},
 };
 
+#ifdef CONFIG_KVM_GUEST
+static inline unsigned int kvm_async_pf_read_enabled(void)
+{
+	return __this_cpu_read(apf_data.enabled);
+}
+
+static inline void kvm_async_pf_write_enabled(unsigned int val)
+{
+	__this_cpu_write(apf_data.enabled, val);
+}
+
+static inline unsigned int kvm_async_pf_read_reason(void)
+{
+	return __this_cpu_read(apf_data.reason);
+}
+
+static inline void kvm_async_pf_write_reason(unsigned int val)
+{
+	__this_cpu_write(apf_data.reason, val);
+}
+
+#define kvm_async_pf_lock(b, flags)					\
+	raw_spin_lock_irqsave(&(b)->lock, (flags))
+#define kvm_async_pf_trylock(b, flags)					\
+	raw_spin_trylock_irqsave(&(b)->lock, (flags))
+#define kvm_async_pf_unlock(b, flags)					\
+	raw_spin_unlock_irqrestore(&(b)->lock, (flags))
+#define kvm_async_pf_unlock_and_clear(b, flags)				\
+	do {								\
+		raw_spin_unlock_irqrestore(&(b)->lock, (flags));	\
+		kvm_async_pf_write_reason(0);				\
+	} while (0)
+
+static struct kvm_task_sleep_node *kvm_async_pf_find(
+		struct kvm_task_sleep_head *b, u32 token)
+{
+	struct kvm_task_sleep_node *n;
+	struct hlist_node *p;
+
+	hlist_for_each(p, &b->list) {
+		n = hlist_entry(p, typeof(*n), link);
+		if (n->token == token)
+			return n;
+	}
+
+	return NULL;
+}
+
+static void kvm_async_pf_wait(u32 token, int in_kernel)
+{
+	struct kvm_task_sleep_head *b = this_cpu_ptr(&apf_head);
+	struct kvm_task_sleep_node n, *e;
+	DECLARE_SWAITQUEUE(wait);
+	unsigned long flags;
+
+	kvm_async_pf_lock(b, flags);
+	e = kvm_async_pf_find(b, token);
+	if (e) {
+		/* dummy entry exist -> wake up was delivered ahead of PF */
+		hlist_del(&e->link);
+		kfree(e);
+		kvm_async_pf_unlock_and_clear(b, flags);
+
+		return;
+	}
+
+	n.token = token;
+	n.task = current;
+	n.cpu = smp_processor_id();
+	n.halted = is_idle_task(current) ||
+		   (IS_ENABLED(CONFIG_PREEMPT_COUNT) ?
+		    preempt_count() > 1 || rcu_preempt_depth() : in_kernel);
+	n.delayed = false;
+	init_swait_queue_head(&n.wq);
+	hlist_add_head(&n.link, &b->list);
+	kvm_async_pf_unlock_and_clear(b, flags);
+
+	for (;;) {
+		if (!n.halted) {
+			prepare_to_swait_exclusive(&n.wq, &wait,
+						   TASK_UNINTERRUPTIBLE);
+		}
+
+		if (hlist_unhashed(&n.link))
+			break;
+
+		if (!n.halted) {
+			schedule();
+		} else {
+			dsb(sy);
+			wfi();
+		}
+	}
+
+	if (!n.halted)
+		finish_swait(&n.wq, &wait);
+}
+
+/*
+ * There are two cases the suspended processed can't be waken up
+ * immediately: The waker is exactly the suspended process, or
+ * the current CPU runqueue has been locked. Otherwise, we might
+ * run into dead-lock.
+ */
+static inline void kvm_async_pf_wake_one(struct kvm_task_sleep_node *n)
+{
+	if (n->task == current ||
+	    cpu_rq_is_locked(smp_processor_id())) {
+		n->delayed = true;
+		return;
+	}
+
+	hlist_del_init(&n->link);
+	if (n->halted)
+		smp_send_reschedule(n->cpu);
+	else
+		swake_up_one(&n->wq);
+}
+
+void kvm_async_pf_delayed_wake(void)
+{
+	struct kvm_task_sleep_head *b;
+	struct kvm_task_sleep_node *n;
+	struct hlist_node *p, *next;
+	unsigned int reason;
+	unsigned long flags;
+
+	if (!kvm_async_pf_read_enabled())
+		return;
+
+	/*
+	 * We're running in the edging context, we need to complete
+	 * the work as quick as possible. So we have a preliminary
+	 * check without holding the lock.
+	 */
+	b = this_cpu_ptr(&apf_head);
+	if (hlist_empty(&b->list))
+		return;
+
+	/*
+	 * Set the async page fault reason to something to avoid
+	 * receiving the signals, which might cause lock contention
+	 * and possibly dead-lock. As we're in guest context, it's
+	 * safe to set the reason here.
+	 *
+	 * There might be pending signals. For that case, we needn't
+	 * do anything. Otherwise, the pending signal will be lost.
+	 */
+	reason = kvm_async_pf_read_reason();
+	if (!reason) {
+		kvm_async_pf_write_reason(KVM_PV_REASON_PAGE_NOT_PRESENT +
+					  KVM_PV_REASON_PAGE_READY);
+	}
+
+	if (!kvm_async_pf_trylock(b, flags))
+		goto done;
+
+	hlist_for_each_safe(p, next, &b->list) {
+		n = hlist_entry(p, typeof(*n), link);
+		if (n->cpu != smp_processor_id())
+			continue;
+		if (!n->delayed)
+			continue;
+
+		kvm_async_pf_wake_one(n);
+	}
+
+	kvm_async_pf_unlock(b, flags);
+
+done:
+	if (!reason)
+		kvm_async_pf_write_reason(0);
+}
+NOKPROBE_SYMBOL(kvm_async_pf_delayed_wake);
+
+static void kvm_async_pf_wake_all(void)
+{
+	struct kvm_task_sleep_head *b;
+	struct kvm_task_sleep_node *n;
+	struct hlist_node *p, *next;
+	unsigned long flags;
+
+	b = this_cpu_ptr(&apf_head);
+	kvm_async_pf_lock(b, flags);
+
+	hlist_for_each_safe(p, next, &b->list) {
+		n = hlist_entry(p, typeof(*n), link);
+		kvm_async_pf_wake_one(n);
+	}
+
+	kvm_async_pf_unlock(b, flags);
+
+	kvm_async_pf_write_reason(0);
+}
+
+static void kvm_async_pf_wake(u32 token)
+{
+	struct kvm_task_sleep_head *b = this_cpu_ptr(&apf_head);
+	struct kvm_task_sleep_node *n;
+	unsigned long flags;
+
+	if (token == ~0) {
+		kvm_async_pf_wake_all();
+		return;
+	}
+
+again:
+	kvm_async_pf_lock(b, flags);
+
+	n = kvm_async_pf_find(b, token);
+	if (!n) {
+		/*
+		 * Async PF was not yet handled. Add dummy entry
+		 * for the token. Busy wait until other CPU handles
+		 * the async PF on allocation failure.
+		 */
+		n = kzalloc(sizeof(*n), GFP_ATOMIC);
+		if (!n) {
+			kvm_async_pf_unlock(b, flags);
+			cpu_relax();
+			goto again;
+		}
+		n->token = token;
+		n->task = current;
+		n->cpu = smp_processor_id();
+		n->halted = false;
+		n->delayed = false;
+		init_swait_queue_head(&n->wq);
+		hlist_add_head(&n->link, &b->list);
+	} else {
+		kvm_async_pf_wake_one(n);
+	}
+
+	kvm_async_pf_unlock_and_clear(b, flags);
+}
+
+static bool do_async_pf(unsigned long addr, unsigned int esr,
+		       struct pt_regs *regs)
+{
+	u32 reason;
+
+	if (!kvm_async_pf_read_enabled())
+		return false;
+
+	reason = kvm_async_pf_read_reason();
+	if (!reason)
+		return false;
+
+	switch (reason) {
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+		kvm_async_pf_wait((u32)addr, !user_mode(regs));
+		break;
+	case KVM_PV_REASON_PAGE_READY:
+		kvm_async_pf_wake((u32)addr);
+		break;
+	default:
+		if (reason) {
+			pr_warn("%s: Illegal reason %d\n", __func__, reason);
+			kvm_async_pf_write_reason(0);
+		}
+	}
+
+	return true;
+}
+#endif /* CONFIG_KVM_GUEST */
+
 void do_mem_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 {
 	const struct fault_info *inf = esr_to_fault_info(esr);
 
+#ifdef CONFIG_KVM_GUEST
+	if (do_async_pf(addr, esr, regs))
+		return;
+#endif
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
@@ -878,3 +1174,141 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
 	debug_exception_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug_exception);
+
+#ifdef CONFIG_KVM_GUEST
+static int __init kvm_async_pf_available(char *arg)
+{
+	async_pf_available = false;
+	return 0;
+}
+early_param("no-kvmapf", kvm_async_pf_available);
+
+static void kvm_async_pf_enable(bool enable)
+{
+	struct arm_smccc_res res;
+	unsigned long *offsets = (unsigned long *)__exception_handlers_offset;
+	u32 enabled = kvm_async_pf_read_enabled();
+	u64 val;
+	int i;
+
+	if (enable == enabled)
+		return;
+
+	if (enable) {
+		/*
+		 * Asychonous page faults will be prohibited when CPU runs
+		 * instructions between the vector base and the maximal
+		 * offset, plus 4096. The 4096 is the assumped maximal
+		 * length for individual handler. The hardware registers
+		 * should be saved to stack at the beginning of the handlers,
+		 * so 4096 shuld be safe enough.
+		 */
+		val = 0;
+		for (i = 0; i < 16; i++) {
+			if (offsets[i] > val)
+				val = offsets[i];
+		}
+
+		val += 4096;
+		val |= BIT(62);
+
+		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+				     (u32)val, (u32)(val >> 32), &res);
+		if (res.a0 != SMCCC_RET_SUCCESS) {
+			pr_warn("Async PF configuration error %ld on CPU %d\n",
+				res.a0, smp_processor_id());
+			return;
+		}
+
+		/* FIXME: Enable KVM_ASYNC_PF_SEND_ALWAYS */
+		val = BIT(63);
+		val |= virt_to_phys(this_cpu_ptr(&apf_data));
+		val |= KVM_ASYNC_PF_ENABLED;
+
+		kvm_async_pf_write_enabled(1);
+		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+				     (u32)val, (u32)(val >> 32), &res);
+		if (res.a0 != SMCCC_RET_SUCCESS) {
+			pr_warn("Async PF enable error %ld on CPU %d\n",
+				res.a0, smp_processor_id());
+			kvm_async_pf_write_enabled(0);
+			return;
+		}
+	} else {
+		val = BIT(63);
+		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+				     (u32)val, (u32)(val >> 32), &res);
+		if (res.a0 != SMCCC_RET_SUCCESS) {
+			pr_warn("Async PF disable error %ld on CPU %d\n",
+				res.a0, smp_processor_id());
+			return;
+		}
+
+		kvm_async_pf_write_enabled(0);
+	}
+
+	pr_info("Async PF %s on CPU %d\n",
+		enable ? "enabled" : "disabled", smp_processor_id());
+}
+
+static void kvm_async_pf_cpu_reboot(void *unused)
+{
+	kvm_async_pf_enable(false);
+}
+
+static int kvm_async_pf_cpu_reboot_notify(struct notifier_block *nb,
+					  unsigned long code, void *unused)
+{
+	if (code == SYS_RESTART)
+		on_each_cpu(kvm_async_pf_cpu_reboot, NULL, 1);
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_async_pf_cpu_reboot_nb = {
+	.notifier_call = kvm_async_pf_cpu_reboot_notify,
+};
+
+static int kvm_async_pf_cpu_online(unsigned int cpu)
+{
+	struct kvm_task_sleep_head *b;
+
+	b = this_cpu_ptr(&apf_head);
+	raw_spin_lock_init(&b->lock);
+	kvm_async_pf_enable(true);
+	return 0;
+}
+
+static int kvm_async_pf_cpu_offline(unsigned int cpu)
+{
+	kvm_async_pf_enable(false);
+	return 0;
+}
+
+static int __init kvm_async_pf_cpu_init(void)
+{
+	struct kvm_task_sleep_head *b;
+	int ret;
+
+	if (!kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) ||
+	    !async_pf_available)
+		return -EPERM;
+
+	register_reboot_notifier(&kvm_async_pf_cpu_reboot_nb);
+	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+			"arm/kvm:online", kvm_async_pf_cpu_online,
+			kvm_async_pf_cpu_offline);
+	if (ret < 0) {
+		pr_warn("%s: Error %d to install cpu hotplug callbacks\n",
+			__func__, ret);
+		return ret;
+	}
+
+	b = this_cpu_ptr(&apf_head);
+	raw_spin_lock_init(&b->lock);
+	kvm_async_pf_enable(true);
+
+	return 0;
+}
+early_initcall(kvm_async_pf_cpu_init);
+#endif /* CONFIG_KVM_GUEST */
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (8 preceding siblings ...)
  2020-05-08  3:29 ` [PATCH RFCv2 9/9] arm64: Support async page fault Gavin Shan
@ 2020-05-25 23:39 ` Gavin Shan
  2020-05-26 13:09 ` Mark Rutland
  10 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-25 23:39 UTC (permalink / raw)
  To: kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

On 5/8/20 1:29 PM, Gavin Shan wrote:
> There are two stages of page faults and the stage one page fault is
> handled by guest itself. The guest is trapped to host when the page
> fault is caused by stage 2 page table, for example missing. The guest
> is suspended until the requested page is populated. There might be
> IO activities involved for host to populate the requested page. For
> instance, the requested page has been swapped out previously. In this
> case, the guest (vCPU) has to suspend for a few of milliseconds, which
> depends on the swapping media, regardless of the overall system load.
> 
> The series adds asychornous page fault to improve the situation. A
> signal (PAGE_NOT_PRESENT) is sent from host to the guest if the requested
> page isn't absent immediately. In the mean while, a worker is started
> to populate the requested page in background. Guest either picks another
> available process to run or puts current (faulting) process to power
> saving mode when receiving the (PAGE_NOT_PRESENT) signal. After the
> requested page is populated by the worker, another signal (PAGE_READY)
> is sent from host to guest. Guest wakes up the (faulting) process when
> receiving the (PAGE_READY) signal.
> 
> The signals are conveyed through control block. The control block physical
> address is passed from guest to host through dedicated KVM vendor specific
> hypercall. The control block is visible and accessible by host and guest
> in the mean while. The hypercall is also used to enable, disable, configure
> the functionality. Notifications, by injected abort data exception, are
> fired when there are pending signals. The exception handler will be invoked
> in guest kernel.
> 
> Testing
> =======
> The tests are carried on the following machine. A guest with single vCPU
> and 4GB memory is started. Also, the QEMU process is put into memory cgroup
> (v1) whose memory limit is set to 2GB. In the guest, there are two threads,
> which are memory bound and CPU bound separately. The memory bound thread
> allocates all available memory, accesses and them free them. The CPU bound
> thread simply executes block of "nop". The test is carried out for 5 time
> continuously and the average number (per minute) of executed blocks in the
> CPU bound thread is taken as indicator of improvement.
> 
>     Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
>     Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)
> 
>     Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
>                                          7629633480 6170527920)
>     With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
>                                          8095084020 8434672260)
>     Outcome:     +17.8%
> 
> Another test case is to measure the time consumed by the application, but
> with the CPU-bound thread disabled.
> 
>     Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
>     With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
>     Outcome:     +1.2%
> 
> I also have some code in the host to capture the number of async page faults,
> time used to do swapin and its maximal/minimal values when async page fault
> is enabled. During the test, the CPU-bound thread is disabled. There is about
> 30% of the time used to do swapin.
> 
>     Number of async page fault:     7555 times
>     Total time used by application: 42.2 seconds
>     Total time used by swapin:      12.7 seconds   (30%)
>           Minimal swapin time:      36.2 us
>           Maximal swapin time:      55.7 ms
> 

A kindly ping... Marc/Mark/Will, please let me know your comments
on this. thanks in advance!

> Changelog
> =========
> RFCv1 -> RFCv2
>     * Rebase to 5.7.rc3
>     * Performance data                                                   (Marc Zyngier)
>     * Replace IMPDEF system register with KVM vendor specific hypercall  (Mark Rutland)
>     * Based on Will's KVM vendor hypercall probe mechanism               (Will Deacon)
>     * Don't use IMPDEF DFSC (0x43). Async page fault reason is conveyed
>       by the control block                                               (Mark Rutland)
>     * Delayed wakeup mechanism in guest kernel                           (Gavin Shan)
>     * Stability improvement in the guest kernel: delayed wakeup mechanism,
>       external abort disallowed region, lazily clear async page fault,
>       disabled interrupt on acquiring the head's lock and so on          (Gavin Shan)
>     * Stability improvement in the host kernel: serialized async page
>       faults etc.                                                        (Gavin Shan)
>     * Performance improvement in guest kernel: percpu sleeper head       (Gavin Shan)
> 
> Gavin Shan (7):
>    kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
>    kvm/arm64: Detach ESR operator from vCPU struct
>    kvm/arm64: Replace hsr with esr
>    kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
>    kvm/arm64: Support async page fault
>    kernel/sched: Add cpu_rq_is_locked()
>    arm64: Support async page fault
> 
> Will Deacon (2):
>    arm64: Probe for the presence of KVM hypervisor services during boot
>    arm/arm64: KVM: Advertise KVM UID to guests via SMCCC
> 
>   arch/arm64/Kconfig                       |  11 +
>   arch/arm64/include/asm/exception.h       |   3 +
>   arch/arm64/include/asm/hypervisor.h      |  11 +
>   arch/arm64/include/asm/kvm_emulate.h     |  83 +++--
>   arch/arm64/include/asm/kvm_host.h        |  47 +++
>   arch/arm64/include/asm/kvm_para.h        |  40 +++
>   arch/arm64/include/uapi/asm/Kbuild       |   2 -
>   arch/arm64/include/uapi/asm/kvm_para.h   |  22 ++
>   arch/arm64/kernel/entry.S                |  33 ++
>   arch/arm64/kernel/process.c              |   4 +
>   arch/arm64/kernel/setup.c                |  35 ++
>   arch/arm64/kvm/Kconfig                   |   1 +
>   arch/arm64/kvm/Makefile                  |   2 +
>   arch/arm64/kvm/handle_exit.c             |  48 +--
>   arch/arm64/kvm/hyp/switch.c              |  33 +-
>   arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |   7 +-
>   arch/arm64/kvm/inject_fault.c            |   4 +-
>   arch/arm64/kvm/sys_regs.c                |  38 +-
>   arch/arm64/mm/fault.c                    | 434 +++++++++++++++++++++++
>   include/linux/arm-smccc.h                |  32 ++
>   include/linux/sched.h                    |   1 +
>   kernel/sched/core.c                      |   8 +
>   virt/kvm/arm/arm.c                       |  40 ++-
>   virt/kvm/arm/async_pf.c                  | 335 +++++++++++++++++
>   virt/kvm/arm/hyp/aarch32.c               |   4 +-
>   virt/kvm/arm/hyp/vgic-v3-sr.c            |   7 +-
>   virt/kvm/arm/hypercalls.c                |  37 +-
>   virt/kvm/arm/mmio.c                      |  27 +-
>   virt/kvm/arm/mmu.c                       |  69 +++-
>   29 files changed, 1264 insertions(+), 154 deletions(-)
>   create mode 100644 arch/arm64/include/asm/kvm_para.h
>   create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>   create mode 100644 virt/kvm/arm/async_pf.c
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  2020-05-08  3:29 ` [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr() Gavin Shan
@ 2020-05-26 10:42   ` Mark Rutland
  2020-05-27  2:43     ` Gavin Shan
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Rutland @ 2020-05-26 10:42 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote:
> Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to
> kvm_vcpu_get_esr() to it a bit more self-explaining because the
> functions returns ESR instead of HSR on aarch64. This shouldn't
> cause any functional changes.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>

I think that this would be a nice cleanup on its own, and could be taken
independently of the rest of this series if it were rebased and sent as
a single patch.

Mark.

> ---
>  arch/arm64/include/asm/kvm_emulate.h | 36 +++++++++++++++-------------
>  arch/arm64/kvm/handle_exit.c         | 12 +++++-----
>  arch/arm64/kvm/hyp/switch.c          |  2 +-
>  arch/arm64/kvm/sys_regs.c            |  6 ++---
>  virt/kvm/arm/hyp/aarch32.c           |  2 +-
>  virt/kvm/arm/hyp/vgic-v3-sr.c        |  4 ++--
>  virt/kvm/arm/mmu.c                   |  6 ++---
>  7 files changed, 35 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index a30b4eec7cb4..bd1a69e7c104 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -265,14 +265,14 @@ static inline bool vcpu_mode_priv(const struct kvm_vcpu *vcpu)
>  	return mode != PSR_MODE_EL0t;
>  }
>  
> -static __always_inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
> +static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu)
>  {
>  	return vcpu->arch.fault.esr_el2;
>  }
>  
>  static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
>  {
> -	u32 esr = kvm_vcpu_get_hsr(vcpu);
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>  
>  	if (esr & ESR_ELx_CV)
>  		return (esr & ESR_ELx_COND_MASK) >> ESR_ELx_COND_SHIFT;
> @@ -297,64 +297,66 @@ static inline u64 kvm_vcpu_get_disr(const struct kvm_vcpu *vcpu)
>  
>  static inline u32 kvm_vcpu_hvc_get_imm(const struct kvm_vcpu *vcpu)
>  {
> -	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_xVC_IMM_MASK;
> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_xVC_IMM_MASK;
>  }
>  
>  static __always_inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
>  {
> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV);
> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_ISV);
>  }
>  
>  static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu)
>  {
> -	return kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
> +	return kvm_vcpu_get_esr(vcpu) &
> +	       (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
>  }
>  
>  static inline bool kvm_vcpu_dabt_issext(const struct kvm_vcpu *vcpu)
>  {
> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SSE);
> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SSE);
>  }
>  
>  static inline bool kvm_vcpu_dabt_issf(const struct kvm_vcpu *vcpu)
>  {
> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SF);
> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SF);
>  }
>  
>  static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)
>  {
> -	return (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
> +	return (kvm_vcpu_get_esr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
>  }
>  
>  static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)
>  {
> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_S1PTW);
> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_S1PTW);
>  }
>  
>  static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)
>  {
> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WNR) ||
> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_WNR) ||
>  		kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */
>  }
>  
>  static inline bool kvm_vcpu_dabt_is_cm(const struct kvm_vcpu *vcpu)
>  {
> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_CM);
> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_CM);
>  }
>  
>  static __always_inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu)
>  {
> -	return 1 << ((kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
> +	return 1 << ((kvm_vcpu_get_esr(vcpu) & ESR_ELx_SAS) >>
> +		     ESR_ELx_SAS_SHIFT);
>  }
>  
>  /* This one is not specific to Data Abort */
>  static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu)
>  {
> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_IL);
> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_IL);
>  }
>  
>  static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)
>  {
> -	return ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
> +	return ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
>  }
>  
>  static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
> @@ -364,12 +366,12 @@ static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
>  
>  static __always_inline u8 kvm_vcpu_trap_get_fault(const struct kvm_vcpu *vcpu)
>  {
> -	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC;
> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC;
>  }
>  
>  static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vcpu)
>  {
> -	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC_TYPE;
> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
>  }
>  
>  static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
> @@ -393,7 +395,7 @@ static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
>  
>  static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
>  {
> -	u32 esr = kvm_vcpu_get_hsr(vcpu);
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>  	return ESR_ELx_SYS64_ISS_RT(esr);
>  }
>  
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index aacfc55de44c..c5b75a4d5eda 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -89,7 +89,7 @@ static int handle_no_fpsimd(struct kvm_vcpu *vcpu, struct kvm_run *run)
>   */
>  static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> -	if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
> +	if (kvm_vcpu_get_esr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
>  		trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
>  		vcpu->stat.wfe_exit_stat++;
>  		kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
> @@ -119,7 +119,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>   */
>  static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>  	int ret = 0;
>  
>  	run->exit_reason = KVM_EXIT_DEBUG;
> @@ -146,7 +146,7 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  
>  static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>  
>  	kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n",
>  		      hsr, esr_get_class_string(hsr));
> @@ -226,7 +226,7 @@ static exit_handle_fn arm_exit_handlers[] = {
>  
>  static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
>  {
> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>  	u8 hsr_ec = ESR_ELx_EC(hsr);
>  
>  	return arm_exit_handlers[hsr_ec];
> @@ -267,7 +267,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>  		       int exception_index)
>  {
>  	if (ARM_SERROR_PENDING(exception_index)) {
> -		u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
> +		u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
>  
>  		/*
>  		 * HVC/SMC already have an adjusted PC, which we need
> @@ -333,5 +333,5 @@ void handle_exit_early(struct kvm_vcpu *vcpu, struct kvm_run *run,
>  	exception_index = ARM_EXCEPTION_CODE(exception_index);
>  
>  	if (exception_index == ARM_EXCEPTION_EL1_SERROR)
> -		kvm_handle_guest_serror(vcpu, kvm_vcpu_get_hsr(vcpu));
> +		kvm_handle_guest_serror(vcpu, kvm_vcpu_get_esr(vcpu));
>  }
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 8a1e81a400e0..2c3242bcfed2 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -437,7 +437,7 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
>  
>  static bool __hyp_text handle_tx2_tvm(struct kvm_vcpu *vcpu)
>  {
> -	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_hsr(vcpu));
> +	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu));
>  	int rt = kvm_vcpu_sys_get_rt(vcpu);
>  	u64 val = vcpu_get_reg(vcpu, rt);
>  
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 51db934702b6..5b61465927b7 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -2214,7 +2214,7 @@ static int kvm_handle_cp_64(struct kvm_vcpu *vcpu,
>  			    size_t nr_specific)
>  {
>  	struct sys_reg_params params;
> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>  	int Rt = kvm_vcpu_sys_get_rt(vcpu);
>  	int Rt2 = (hsr >> 10) & 0x1f;
>  
> @@ -2271,7 +2271,7 @@ static int kvm_handle_cp_32(struct kvm_vcpu *vcpu,
>  			    size_t nr_specific)
>  {
>  	struct sys_reg_params params;
> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>  	int Rt  = kvm_vcpu_sys_get_rt(vcpu);
>  
>  	params.is_aarch32 = true;
> @@ -2387,7 +2387,7 @@ static void reset_sys_reg_descs(struct kvm_vcpu *vcpu,
>  int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
>  	struct sys_reg_params params;
> -	unsigned long esr = kvm_vcpu_get_hsr(vcpu);
> +	unsigned long esr = kvm_vcpu_get_esr(vcpu);
>  	int Rt = kvm_vcpu_sys_get_rt(vcpu);
>  	int ret;
>  
> diff --git a/virt/kvm/arm/hyp/aarch32.c b/virt/kvm/arm/hyp/aarch32.c
> index d31f267961e7..864b477e660a 100644
> --- a/virt/kvm/arm/hyp/aarch32.c
> +++ b/virt/kvm/arm/hyp/aarch32.c
> @@ -51,7 +51,7 @@ bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu)
>  	int cond;
>  
>  	/* Top two bits non-zero?  Unconditional. */
> -	if (kvm_vcpu_get_hsr(vcpu) >> 30)
> +	if (kvm_vcpu_get_esr(vcpu) >> 30)
>  		return true;
>  
>  	/* Is condition field valid? */
> diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
> index ccf1fde9836c..8a7a14ec9120 100644
> --- a/virt/kvm/arm/hyp/vgic-v3-sr.c
> +++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
> @@ -441,7 +441,7 @@ static int __hyp_text __vgic_v3_bpr_min(void)
>  
>  static int __hyp_text __vgic_v3_get_group(struct kvm_vcpu *vcpu)
>  {
> -	u32 esr = kvm_vcpu_get_hsr(vcpu);
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>  	u8 crm = (esr & ESR_ELx_SYS64_ISS_CRM_MASK) >> ESR_ELx_SYS64_ISS_CRM_SHIFT;
>  
>  	return crm != 8;
> @@ -1007,7 +1007,7 @@ int __hyp_text __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu)
>  	bool is_read;
>  	u32 sysreg;
>  
> -	esr = kvm_vcpu_get_hsr(vcpu);
> +	esr = kvm_vcpu_get_esr(vcpu);
>  	if (vcpu_mode_is_32bit(vcpu)) {
>  		if (!kvm_condition_valid(vcpu)) {
>  			__kvm_skip_instr(vcpu);
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index e3b9ee268823..5da0d0e7519b 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1922,7 +1922,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  		 * For RAS the host kernel may handle this abort.
>  		 * There is no need to pass the error into the guest.
>  		 */
> -		if (!kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
> +		if (!kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu)))
>  			return 1;
>  
>  		if (unlikely(!is_iabt)) {
> @@ -1931,7 +1931,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  		}
>  	}
>  
> -	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_hsr(vcpu),
> +	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_esr(vcpu),
>  			      kvm_vcpu_get_hfar(vcpu), fault_ipa);
>  
>  	/* Check the stage-2 fault is trans. fault or write fault */
> @@ -1940,7 +1940,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  		kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
>  			kvm_vcpu_trap_get_class(vcpu),
>  			(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
> -			(unsigned long)kvm_vcpu_get_hsr(vcpu));
> +			(unsigned long)kvm_vcpu_get_esr(vcpu));
>  		return -EFAULT;
>  	}
>  
> -- 
> 2.23.0
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr
  2020-05-08  3:29 ` [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr Gavin Shan
@ 2020-05-26 10:45   ` Mark Rutland
  2020-05-27  2:56     ` Gavin Shan
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Rutland @ 2020-05-26 10:45 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

On Fri, May 08, 2020 at 01:29:15PM +1000, Gavin Shan wrote:
> This replace the variable names to make them self-explaining. The
> tracepoint isn't changed accordingly because they're part of ABI:
> 
>    * @hsr to @esr
>    * @hsr_ec to @ec
>    * Use kvm_vcpu_trap_get_class() helper if possible
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>

As with patch 3, I think this cleanup makes sense independent from the
rest of the series, and I think it'd make sense to bundle all the
patches renaming hsr -> esr, and send those as a preparatory series.

Thanks,
Mark.

> ---
>  arch/arm64/kvm/handle_exit.c | 28 ++++++++++++++--------------
>  arch/arm64/kvm/hyp/switch.c  |  9 ++++-----
>  arch/arm64/kvm/sys_regs.c    | 30 +++++++++++++++---------------
>  3 files changed, 33 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 00858db82a64..e3b3dcd5b811 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -123,13 +123,13 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>   */
>  static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>  	int ret = 0;
>  
>  	run->exit_reason = KVM_EXIT_DEBUG;
> -	run->debug.arch.hsr = hsr;
> +	run->debug.arch.hsr = esr;
>  
> -	switch (ESR_ELx_EC(hsr)) {
> +	switch (kvm_vcpu_trap_get_class(esr)) {
>  	case ESR_ELx_EC_WATCHPT_LOW:
>  		run->debug.arch.far = vcpu->arch.fault.far_el2;
>  		/* fall through */
> @@ -139,8 +139,8 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  	case ESR_ELx_EC_BRK64:
>  		break;
>  	default:
> -		kvm_err("%s: un-handled case hsr: %#08x\n",
> -			__func__, (unsigned int) hsr);
> +		kvm_err("%s: un-handled case esr: %#08x\n",
> +			__func__, (unsigned int)esr);
>  		ret = -1;
>  		break;
>  	}
> @@ -150,10 +150,10 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  
>  static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>  
> -	kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n",
> -		      hsr, esr_get_class_string(hsr));
> +	kvm_pr_unimpl("Unknown exception class: esr: %#08x -- %s\n",
> +		      esr, esr_get_class_string(esr));
>  
>  	kvm_inject_undefined(vcpu);
>  	return 1;
> @@ -230,10 +230,10 @@ static exit_handle_fn arm_exit_handlers[] = {
>  
>  static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
>  {
> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
> -	u8 hsr_ec = ESR_ELx_EC(hsr);
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
> +	u8 ec = kvm_vcpu_trap_get_class(esr);
>  
> -	return arm_exit_handlers[hsr_ec];
> +	return arm_exit_handlers[ec];
>  }
>  
>  /*
> @@ -273,15 +273,15 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>  {
>  	if (ARM_SERROR_PENDING(exception_index)) {
>  		u32 esr = kvm_vcpu_get_esr(vcpu);
> -		u8 hsr_ec = ESR_ELx_EC(esr);
> +		u8 ec = kvm_vcpu_trap_get_class(esr);
>  
>  		/*
>  		 * HVC/SMC already have an adjusted PC, which we need
>  		 * to correct in order to return to after having
>  		 * injected the SError.
>  		 */
> -		if (hsr_ec == ESR_ELx_EC_HVC32 || hsr_ec == ESR_ELx_EC_HVC64 ||
> -		    hsr_ec == ESR_ELx_EC_SMC32 || hsr_ec == ESR_ELx_EC_SMC64) {
> +		if (ec == ESR_ELx_EC_HVC32 || ec == ESR_ELx_EC_HVC64 ||
> +		    ec == ESR_ELx_EC_SMC32 || ec == ESR_ELx_EC_SMC64) {
>  			u32 adj =  kvm_vcpu_trap_il_is32bit(esr) ? 4 : 2;
>  			*vcpu_pc(vcpu) -= adj;
>  		}
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 369f22f49f3d..7bf4840bf90e 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -356,8 +356,8 @@ static bool __hyp_text __populate_fault_info(struct kvm_vcpu *vcpu)
>  static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
>  {
>  	u32 esr = kvm_vcpu_get_esr(vcpu);
> +	u8 ec = kvm_vcpu_trap_get_class(esr);
>  	bool vhe, sve_guest, sve_host;
> -	u8 hsr_ec;
>  
>  	if (!system_supports_fpsimd())
>  		return false;
> @@ -372,14 +372,13 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
>  		vhe = has_vhe();
>  	}
>  
> -	hsr_ec = kvm_vcpu_trap_get_class(esr);
> -	if (hsr_ec != ESR_ELx_EC_FP_ASIMD &&
> -	    hsr_ec != ESR_ELx_EC_SVE)
> +	if (ec != ESR_ELx_EC_FP_ASIMD &&
> +	    ec != ESR_ELx_EC_SVE)
>  		return false;
>  
>  	/* Don't handle SVE traps for non-SVE vcpus here: */
>  	if (!sve_guest)
> -		if (hsr_ec != ESR_ELx_EC_FP_ASIMD)
> +		if (ec != ESR_ELx_EC_FP_ASIMD)
>  			return false;
>  
>  	/* Valid trap.  Switch the context: */
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 012fff834a4b..58f81ab519af 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -2182,10 +2182,10 @@ static void unhandled_cp_access(struct kvm_vcpu *vcpu,
>  				struct sys_reg_params *params)
>  {
>  	u32 esr = kvm_vcpu_get_esr(vcpu);
> -	u8 hsr_ec = kvm_vcpu_trap_get_class(esr);
> +	u8 ec = kvm_vcpu_trap_get_class(esr);
>  	int cp = -1;
>  
> -	switch(hsr_ec) {
> +	switch (ec) {
>  	case ESR_ELx_EC_CP15_32:
>  	case ESR_ELx_EC_CP15_64:
>  		cp = 15;
> @@ -2216,17 +2216,17 @@ static int kvm_handle_cp_64(struct kvm_vcpu *vcpu,
>  			    size_t nr_specific)
>  {
>  	struct sys_reg_params params;
> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
> -	int Rt = kvm_vcpu_sys_get_rt(hsr);
> -	int Rt2 = (hsr >> 10) & 0x1f;
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
> +	int Rt = kvm_vcpu_sys_get_rt(esr);
> +	int Rt2 = (esr >> 10) & 0x1f;
>  
>  	params.is_aarch32 = true;
>  	params.is_32bit = false;
> -	params.CRm = (hsr >> 1) & 0xf;
> -	params.is_write = ((hsr & 1) == 0);
> +	params.CRm = (esr >> 1) & 0xf;
> +	params.is_write = ((esr & 1) == 0);
>  
>  	params.Op0 = 0;
> -	params.Op1 = (hsr >> 16) & 0xf;
> +	params.Op1 = (esr >> 16) & 0xf;
>  	params.Op2 = 0;
>  	params.CRn = 0;
>  
> @@ -2273,18 +2273,18 @@ static int kvm_handle_cp_32(struct kvm_vcpu *vcpu,
>  			    size_t nr_specific)
>  {
>  	struct sys_reg_params params;
> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
> -	int Rt  = kvm_vcpu_sys_get_rt(hsr);
> +	u32 esr = kvm_vcpu_get_esr(vcpu);
> +	int Rt = kvm_vcpu_sys_get_rt(esr);
>  
>  	params.is_aarch32 = true;
>  	params.is_32bit = true;
> -	params.CRm = (hsr >> 1) & 0xf;
> +	params.CRm = (esr >> 1) & 0xf;
>  	params.regval = vcpu_get_reg(vcpu, Rt);
> -	params.is_write = ((hsr & 1) == 0);
> -	params.CRn = (hsr >> 10) & 0xf;
> +	params.is_write = ((esr & 1) == 0);
> +	params.CRn = (esr >> 10) & 0xf;
>  	params.Op0 = 0;
> -	params.Op1 = (hsr >> 14) & 0x7;
> -	params.Op2 = (hsr >> 17) & 0x7;
> +	params.Op1 = (esr >> 14) & 0x7;
> +	params.Op2 = (esr >> 17) & 0x7;
>  
>  	if (!emulate_cp(vcpu, &params, target_specific, nr_specific) ||
>  	    !emulate_cp(vcpu, &params, global, nr_global)) {
> -- 
> 2.23.0
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct
  2020-05-08  3:29 ` [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct Gavin Shan
@ 2020-05-26 10:51   ` Mark Rutland
  2020-05-27  2:55     ` Gavin Shan
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Rutland @ 2020-05-26 10:51 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

On Fri, May 08, 2020 at 01:29:14PM +1000, Gavin Shan wrote:
> There are a set of inline functions defined in kvm_emulate.h. Those
> functions reads ESR from vCPU fault information struct and then operate
> on it. So it's tied with vCPU fault information and vCPU struct. It
> limits their usage scope.
> 
> This detaches these functions from the vCPU struct. With this, the
> caller has flexibility on where the ESR is read. It shouldn't cause
> any functional changes.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/include/asm/kvm_emulate.h     | 83 +++++++++++-------------
>  arch/arm64/kvm/handle_exit.c             | 20 ++++--
>  arch/arm64/kvm/hyp/switch.c              | 24 ++++---
>  arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |  7 +-
>  arch/arm64/kvm/inject_fault.c            |  4 +-
>  arch/arm64/kvm/sys_regs.c                | 12 ++--
>  virt/kvm/arm/arm.c                       |  4 +-
>  virt/kvm/arm/hyp/aarch32.c               |  2 +-
>  virt/kvm/arm/hyp/vgic-v3-sr.c            |  5 +-
>  virt/kvm/arm/mmio.c                      | 27 ++++----
>  virt/kvm/arm/mmu.c                       | 22 ++++---
>  11 files changed, 112 insertions(+), 98 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index bd1a69e7c104..2873bf6dc85e 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -270,10 +270,8 @@ static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu)
>  	return vcpu->arch.fault.esr_el2;
>  }
>  
> -static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
> +static __always_inline int kvm_vcpu_get_condition(u32 esr)

Given the `vcpu` argument has been removed, it's odd to keep `vcpu` in the
name, rather than `esr`.

e.g. this would make more sense as something like esr_get_condition().

... and if we did something like that, we could move most of the
extraction functions into <asm/esr.h>, and share them with non-KVM code.

Otherwise, do you need to extract all of these for your use-case, or do
you only need a few of the helpers? If you only need a few, it might be
better to only factor those out for now, and keep the existing API in
place with wrappers, e.g. have:

| esr_get_condition(u32 esr) {
| 	... 
| }
| 
| kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
| {
| 	return esr_get_condition(kvm_vcpu_get_esr(vcpu));
| }

Thanks,
Mark.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  2020-05-08  3:29 ` [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
@ 2020-05-26 10:58   ` Mark Rutland
  2020-05-27  3:01     ` Gavin Shan
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Rutland @ 2020-05-26 10:58 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

On Fri, May 08, 2020 at 01:29:16PM +1000, Gavin Shan wrote:
> This renames user_mem_abort() to kvm_handle_user_mem_abort(), and
> then export it. The function will be used in asynchronous page fault
> to populate a page table entry once the corresponding page is populated
> from the backup device (e.g. swap partition):
> 
>    * Parameter @fault_status is replace by @esr.
>    * The parameters are reorder based on their importance.

It seems like multiple changes are going on here, and it would be
clearer with separate patches.

Passing the ESR rather than the extracted fault status seems fine, but
for clarirty it's be nicer to do this in its own patch.

Why is it necessary to re-order the function parameters? Does that align
with other function prototypes?

What exactly is the `prefault` parameter meant to do? It doesn't do
anything currently, so it'd be better to introduce it later when logic
using it is instroduced, or where callers will pass distinct values.

Thanks,
Mark.

> 
> This shouldn't cause any functional changes.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  4 ++++
>  virt/kvm/arm/mmu.c                | 14 ++++++++------
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 32c8a675e5a4..f77c706777ec 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -437,6 +437,10 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>  			      struct kvm_vcpu_events *events);
>  
>  #define KVM_ARCH_WANT_MMU_NOTIFIER
> +int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
> +			      struct kvm_memory_slot *memslot,
> +			      phys_addr_t fault_ipa, unsigned long hva,
> +			      bool prefault);
>  int kvm_unmap_hva_range(struct kvm *kvm,
>  			unsigned long start, unsigned long end);
>  int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index e462e0368fd9..95aaabb2b1fc 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1656,12 +1656,12 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>  	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
>  }
>  
> -static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> -			  struct kvm_memory_slot *memslot, unsigned long hva,
> -			  unsigned long fault_status)
> +int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
> +			      struct kvm_memory_slot *memslot,
> +			      phys_addr_t fault_ipa, unsigned long hva,
> +			      bool prefault)
>  {
> -	int ret;
> -	u32 esr = kvm_vcpu_get_esr(vcpu);
> +	unsigned int fault_status = kvm_vcpu_trap_get_fault_type(esr);
>  	bool write_fault, writable, force_pte = false;
>  	bool exec_fault, needs_exec;
>  	unsigned long mmu_seq;
> @@ -1674,6 +1674,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	pgprot_t mem_type = PAGE_S2;
>  	bool logging_active = memslot_is_logging(memslot);
>  	unsigned long vma_pagesize, flags = 0;
> +	int ret;
>  
>  	write_fault = kvm_is_write_fault(esr);
>  	exec_fault = kvm_vcpu_trap_is_iabt(esr);
> @@ -1995,7 +1996,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  		goto out_unlock;
>  	}
>  
> -	ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
> +	ret = kvm_handle_user_mem_abort(vcpu, esr, memslot,
> +					fault_ipa, hva, false);
>  	if (ret == 0)
>  		ret = 1;
>  out:
> -- 
> 2.23.0
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
  2020-05-08  3:29 ` [PATCH RFCv2 7/9] kvm/arm64: Support async page fault Gavin Shan
@ 2020-05-26 12:34   ` Mark Rutland
  2020-05-27  4:05     ` Gavin Shan
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Rutland @ 2020-05-26 12:34 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote:
> There are two stages of fault pages and the stage one page fault is
> handled by guest itself. The guest is trapped to host when the page
> fault is caused by stage 2 page table, for example missing. The guest
> is suspended until the requested page is populated. To populate the
> requested page can be related to IO activities if the page was swapped
> out previously. In this case, the guest has to suspend for a few of
> milliseconds at least, regardless of the overall system load. There
> is no useful work done during the suspended period from guest's view.

This is a bit difficult to read. How about:

| When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that
| is not mapped at Stage-2), the vCPU is suspended until the host has
| handled the fault. It can take the host milliseconds or longer to
| handle the fault as this may require IO, and when the system load is
| low neither the host nor guest perform useful work during such
| periods.

> 
> This adds asychornous page fault to improve the situation. A signal

Nit: typo for `asynchronous` here, and there are a few other typos in
the patch itself. It would be nice if you could run a spellcheck over
that.

> (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
> to be populated. Guest might reschedule to another running process if
> possible. Otherwise, the vCPU is put into power-saving mode, which is
> actually to cause vCPU reschedule from host's view. A followup signal
> (PAGE_READY) is sent to guest once the requested page is populated.
> The suspended task is waken up or scheduled when guest receives the
> signal. With this mechanism, the vCPU won't be stuck when the requested
> page is being populated by host.

It would probably be best to say 'notification' rather than 'signal'
here, and say 'the guest is notified', etc. As above, it seems that this
is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to
make it clear which context this applies to.

> 
> There are more details highlighted as below. Note the implementation is
> similar to what x86 has to some extent:
> 
>    * A dedicated SMCCC ID is reserved to enable, disable or configure
>      the functionality. The only 64-bits parameter is conveyed by two
>      registers (w2/w1). Bits[63:56] is the bitmap used to specify the
>      operated functionality like enabling/disabling/configuration. The
>      bits[55:6] is the physical address of control block or external
>      data abort injection disallowed region. Bit[5:0] are used to pass
>      control flags.
> 
>    * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
>      isn't ready. In the mean while, a work is started to populate the
>      page asynchronously in background. The stage 2 page table entry is
>      updated accordingly and another signal (PAGE_READY) is fired after
>      the request page is populted. The signals is notified by injected
>      data abort fault.
> 
>    * The signals are fired and consumed in sequential fashion. It means
>      no more signals will be fired if there is pending one, awaiting the
>      guest to consume. It's because the injected data abort faults have
>      to be done in sequential fashion.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/include/asm/kvm_host.h      |  43 ++++
>  arch/arm64/include/asm/kvm_para.h      |  27 ++
>  arch/arm64/include/uapi/asm/Kbuild     |   2 -
>  arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
>  arch/arm64/kvm/Kconfig                 |   1 +
>  arch/arm64/kvm/Makefile                |   2 +
>  include/linux/arm-smccc.h              |   6 +
>  virt/kvm/arm/arm.c                     |  36 ++-
>  virt/kvm/arm/async_pf.c                | 335 +++++++++++++++++++++++++
>  virt/kvm/arm/hypercalls.c              |   8 +
>  virt/kvm/arm/mmu.c                     |  29 ++-
>  11 files changed, 506 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kvm_para.h
>  create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>  create mode 100644 virt/kvm/arm/async_pf.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index f77c706777ec..a207728d6f3f 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -250,6 +250,23 @@ struct vcpu_reset_state {
>  	bool		reset;
>  };
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +
> +/* Should be a power of two number */
> +#define ASYNC_PF_PER_VCPU	64

What exactly is this number?

> +
> +/*
> + * The association of gfn and token. The token will be sent to guest as
> + * page fault address. Also, the guest could be in aarch32 mode. So its
> + * length should be 32-bits.
> + */

The length of what should be 32-bit? The token?

The guest sees the token as the fault address? How exactly is that
exposed to the guest, is that via a synthetic S1 fault?

> +struct kvm_arch_async_pf {
> +	u32     token;
> +	gfn_t   gfn;
> +	u32	esr;
> +};
> +#endif /* CONFIG_KVM_ASYNC_PF */
> +
>  struct kvm_vcpu_arch {
>  	struct kvm_cpu_context ctxt;
>  	void *sve_state;
> @@ -351,6 +368,17 @@ struct kvm_vcpu_arch {
>  		u64 last_steal;
>  		gpa_t base;
>  	} steal;
> +
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	struct {
> +		struct gfn_to_hva_cache	cache;
> +		gfn_t			gfns[ASYNC_PF_PER_VCPU];
> +		u64			control_block;
> +		u16			id;
> +		bool			send_user_only;
> +		u64			no_fault_inst_range;

What are all of these fields? This implies functionality not covered
in the commit message, and it's not at all clear what these are. 

For example, what exactly is `no_fault_inst_range`? If it's a range,
surely that needs a start/end or base/size pair rather than a single
value?

> +	} apf;
> +#endif
>  };
>  
>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
> @@ -604,6 +632,21 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  
>  static inline void __cpu_init_stage2(void) {}
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu);
> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
> +			    gpa_t gpa, gfn_t gfn);
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work);
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> +			       struct kvm_async_pf *work);
> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu);
> +#endif /* CONFIG_KVM_ASYNC_PF */
> +
>  /* Guest/host FPSIMD coordination helpers */
>  int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
> new file mode 100644
> index 000000000000..0ea481dd1c7a
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_para.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM_KVM_PARA_H
> +#define _ASM_ARM_KVM_PARA_H
> +
> +#include <uapi/asm/kvm_para.h>
> +
> +static inline bool kvm_check_and_clear_guest_paused(void)
> +{
> +	return false;
> +}
> +
> +static inline unsigned int kvm_arch_para_features(void)
> +{
> +	return 0;
> +}
> +
> +static inline unsigned int kvm_arch_para_hints(void)
> +{
> +	return 0;
> +}
> +
> +static inline bool kvm_para_available(void)
> +{
> +	return false;
> +}
> +
> +#endif /* _ASM_ARM_KVM_PARA_H */
> diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
> index 602d137932dc..f66554cd5c45 100644
> --- a/arch/arm64/include/uapi/asm/Kbuild
> +++ b/arch/arm64/include/uapi/asm/Kbuild
> @@ -1,3 +1 @@
>  # SPDX-License-Identifier: GPL-2.0
> -
> -generic-y += kvm_para.h
> diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
> new file mode 100644
> index 000000000000..e0bd0e579b9a
> --- /dev/null
> +++ b/arch/arm64/include/uapi/asm/kvm_para.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_ARM_KVM_PARA_H
> +#define _UAPI_ASM_ARM_KVM_PARA_H
> +
> +#include <linux/types.h>
> +
> +#define KVM_FEATURE_ASYNC_PF	0
> +
> +/* Async PF */
> +#define KVM_ASYNC_PF_ENABLED		(1 << 0)
> +#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
> +
> +#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
> +#define KVM_PV_REASON_PAGE_READY	2
> +
> +struct kvm_vcpu_pv_apf_data {
> +	__u32	reason;
> +	__u8	pad[60];
> +	__u32	enabled;
> +};

What's all the padding for?

> +
> +#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 449386d76441..1053e16b1739 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -34,6 +34,7 @@ config KVM
>  	select KVM_VFIO
>  	select HAVE_KVM_EVENTFD
>  	select HAVE_KVM_IRQFD
> +	select KVM_ASYNC_PF
>  	select KVM_ARM_PMU if HW_PERF_EVENTS
>  	select HAVE_KVM_MSI
>  	select HAVE_KVM_IRQCHIP
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 5ffbdc39e780..3be24c1e401f 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -37,3 +37,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-debug.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>  kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/arm/async_pf.o
> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
> index bdc0124a064a..22007dd3b9f0 100644
> --- a/include/linux/arm-smccc.h
> +++ b/include/linux/arm-smccc.h
> @@ -94,6 +94,7 @@
>  
>  /* KVM "vendor specific" services */
>  #define ARM_SMCCC_KVM_FUNC_FEATURES		0
> +#define ARM_SMCCC_KVM_FUNC_APF			1
>  #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
>  #define ARM_SMCCC_KVM_NUM_FUNCS			128
>  
> @@ -102,6 +103,11 @@
>  			   ARM_SMCCC_SMC_32,				\
>  			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
>  			   ARM_SMCCC_KVM_FUNC_FEATURES)
> +#define ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID				\
> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
> +			   ARM_SMCCC_SMC_32,				\
> +			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
> +			   ARM_SMCCC_KVM_FUNC_APF)
>  
>  #ifndef __ASSEMBLY__
>  
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 2cbb57485760..3f62899cef13 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -222,6 +222,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		 */
>  		r = 1;
>  		break;
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	case KVM_CAP_ASYNC_PF:
> +		r = 1;
> +		break;
> +#endif
>  	default:
>  		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
>  		break;
> @@ -269,6 +274,10 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	/* Force users to call KVM_ARM_VCPU_INIT */
>  	vcpu->arch.target = -1;
>  	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	vcpu->arch.apf.control_block = 0UL;
> +	vcpu->arch.apf.no_fault_inst_range = 0x800;

Where has this magic number come from?

> +#endif
>  
>  	/* Set up the timer */
>  	kvm_timer_vcpu_init(vcpu);
> @@ -426,8 +435,27 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>  {
>  	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
> -	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
> -		&& !v->arch.power_off && !v->arch.pause);
> +
> +	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
> +	    !v->arch.power_off && !v->arch.pause)
> +		return true;
> +
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
> +		u32 val;
> +		int ret;
> +
> +		if (!list_empty_careful(&v->async_pf.done))
> +			return true;
> +
> +		ret = kvm_read_guest_cached(v->kvm, &v->arch.apf.cache,
> +					    &val, sizeof(val));
> +		if (ret || val)
> +			return true;
> +	}
> +#endif
> +
> +	return false;
>  }
>  
>  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
> @@ -683,6 +711,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  
>  		check_vcpu_requests(vcpu);
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +		kvm_check_async_pf_completion(vcpu);
> +#endif

Rather than adding ifdeffery like this, please add an empty stub for
when CONFIG_KVM_ASYNC_PF isn't selected, so that this can be used
unconditionally.

> +
>  		/*
>  		 * Preparing the interrupts to be injected also
>  		 * involves poking the GIC, which must be done in a
> diff --git a/virt/kvm/arm/async_pf.c b/virt/kvm/arm/async_pf.c
> new file mode 100644
> index 000000000000..5be49d684de3
> --- /dev/null
> +++ b/virt/kvm/arm/async_pf.c
> @@ -0,0 +1,335 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Asynchronous Page Fault Support
> + *
> + * Copyright (C) 2020 Red Hat, Inc., Gavin Shan
> + *
> + * Based on arch/x86/kernel/kvm.c
> + */
> +
> +#include <linux/arm-smccc.h>
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_emulate.h>
> +#include <kvm/arm_hypercalls.h>
> +
> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> +{
> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
> +}
> +
> +static inline u32 kvm_async_pf_hash_next(u32 key)
> +{
> +	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
> +}
> +
> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;
> +
> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
> +		vcpu->arch.apf.gfns[i] = ~0;
> +}
> +
> +/*
> + * Add gfn to the hash table. It's ensured there is a free entry
> + * when this function is called.
> + */
> +static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	while (vcpu->arch.apf.gfns[key] != ~0)
> +		key = kvm_async_pf_hash_next(key);
> +
> +	vcpu->arch.apf.gfns[key] = gfn;
> +}
> +
> +static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +	int i;
> +
> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
> +		if (vcpu->arch.apf.gfns[key] == gfn ||
> +		    vcpu->arch.apf.gfns[key] == ~0)
> +			break;
> +
> +		key = kvm_async_pf_hash_next(key);
> +	}
> +
> +	return key;
> +}
> +
> +static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 i, j, k;
> +
> +	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
> +	while (true) {
> +		vcpu->arch.apf.gfns[i] = ~0;
> +
> +		do {
> +			j = kvm_async_pf_hash_next(j);
> +			if (vcpu->arch.apf.gfns[j] == ~0)
> +				return;
> +
> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
> +			/*
> +			 * k lies cyclically in ]i,j]
> +			 * |    i.k.j |
> +			 * |....j i.k.| or  |.k..j i...|
> +			 */
> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
> +
> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
> +		i = j;
> +	}
> +}

This looks like a copy-paste of code under arch/x86.

This looks like something that should be factored into common code
rather than duplicated. Do we not have an existing common hash table
implementation that we can use rather than building one specific to KVM
async page faults?

> +
> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
> +
> +	return vcpu->arch.apf.gfns[key] == gfn;
> +}
> +
> +static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu, u32 *val)
> +{
> +	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
> +				     val, sizeof(*val));
> +}
> +
> +static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu, u32 val)
> +{
> +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
> +				      &val, sizeof(val));
> +}
> +
> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
> +{
> +	u64 vbar, pc;
> +	u32 val;
> +	int ret;
> +
> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
> +		return false;
> +
> +	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
> +		return false;
> +
> +	/* Pending page fault, which ins't acknowledged by guest */
> +	ret = kvm_async_pf_read_cache(vcpu, &val);
> +	if (ret || val)
> +		return false;
> +
> +	/*
> +	 * Events can't be injected through data abort because it's
> +	 * going to clobber ELR_EL1, which might not consued (or saved)
> +	 * by guest yet.
> +	 */
> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
> +	pc = *vcpu_pc(vcpu);
> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
> +		return false;

Ah, so that's when this `no_fault_inst_range` is for.

As-is this is not sufficient, and we'll need t be extremely careful
here.

The vectors themselves typically only have a small amount of stub code,
and the bulk of the non-reentrant exception entry work happens
elsewhere, in a mixture of assembly and C code that isn't even virtually
contiguous with either the vectors or itself.

It's possible in theory that code in modules (or perhaps in eBPF JIT'd
code) that isn't safe to take a fault from, so even having a contiguous
range controlled by the kernel isn't ideal.

How does this work on x86?

> +
> +	return true;
> +}
> +
> +/*
> + * We need deliver the page present signal as quick as possible because
> + * it's performance critical. So the signal is delivered no matter which
> + * privilege level the guest has. It's possible the signal can't be handled
> + * by the guest immediately. However, host doesn't contribute the delay
> + * anyway.
> + */
> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> +{
> +	u64 vbar, pc;
> +	u32 val;
> +	int ret;
> +
> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
> +		return true;
> +
> +	/* Pending page fault, which ins't acknowledged by guest */
> +	ret = kvm_async_pf_read_cache(vcpu, &val);
> +	if (ret || val)
> +		return false;
> +
> +	/*
> +	 * Events can't be injected through data abort because it's
> +	 * going to clobber ELR_EL1, which might not consued (or saved)
> +	 * by guest yet.
> +	 */
> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
> +	pc = *vcpu_pc(vcpu);
> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
> +		return false;
> +
> +	return true;
> +}

Much of this is identical to the not_present case, so the same comments
apply. The common bits should probably be factored out into a helper.

> +
> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
> +			    gpa_t gpa, gfn_t gfn)
> +{
> +	struct kvm_arch_async_pf arch;
> +	unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
> +
> +	arch.token = (vcpu->arch.apf.id++ << 16) | vcpu->vcpu_id;
> +	arch.gfn = gfn;
> +	arch.esr = esr;
> +
> +	return kvm_setup_async_pf(vcpu, gpa, hva, &arch);
> +}
> +
> +/*
> + * It's garanteed that no pending asynchronous page fault when this is
> + * called. It means all previous issued asynchronous page faults have
> + * been acknoledged.
> + */
> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> +				     struct kvm_async_pf *work)
> +{
> +	int ret;
> +
> +	kvm_async_pf_hash_add(vcpu, work->arch.gfn);
> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT);
> +	if (ret) {
> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
> +		return;
> +	}
> +
> +	kvm_inject_dabt(vcpu, work->arch.token);
> +}
> +
> +/*
> + * It's garanteed that no pending asynchronous page fault when this is
> + * called. It means all previous issued asynchronous page faults have
> + * been acknoledged.
> + */
> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> +				 struct kvm_async_pf *work)
> +{
> +	int ret;
> +
> +	/* Broadcast wakeup */
> +	if (work->wakeup_all)
> +		work->arch.token = ~0;
> +	else
> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
> +
> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_READY);
> +	if (ret) {
> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
> +		return;
> +	}
> +
> +	kvm_inject_dabt(vcpu, work->arch.token);

So the guest sees a fake S1 abort with a fake address?

How is the guest expected to distinguish this from a real S1 fault?

> +}
> +
> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
> +			       struct kvm_async_pf *work)
> +{
> +	struct kvm_memory_slot *memslot;
> +	unsigned int esr = work->arch.esr;
> +	phys_addr_t gpa = work->cr2_or_gpa;
> +	gfn_t gfn = gpa >> PAGE_SHIFT;

Perhaps:

	gfn_t gfn = gpa_to_gfn(gpa);

> +	unsigned long hva;
> +	bool write_fault, writable;
> +	int idx;
> +
> +	/*
> +	 * We shouldn't issue prefault for special work to wake up
> +	 * all pending tasks because the associated token (address)
> +	 * is invalid.
> +	 */

I'm not sure what this comment is trying to say.

> +	if (work->wakeup_all)
> +		return;
> +
> +	/*
> +	 * The gpa was validated before the work is started. However, the
> +	 * memory slots might be changed since then. So we need to redo the
> +	 * validatation here.
> +	 */
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> +	write_fault = kvm_is_write_fault(esr);
> +	memslot = gfn_to_memslot(vcpu->kvm, gfn);
> +	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> +	if (kvm_is_error_hva(hva) || (write_fault && !writable))
> +		goto out;
> +
> +	kvm_handle_user_mem_abort(vcpu, esr, memslot, gpa, hva, true);
> +
> +out:
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +}
> +
> +static long kvm_async_pf_update_enable_reg(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	bool enabled, enable;
> +	gpa_t gpa = (data & ~0x3F);

What exactly is going on here? Why are the low 7 bits of data not valid?

This will also truncate the value to 32 bits; did you mean to do that?

> +	int ret;
> +
> +	enabled = !!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED);
> +	enable = !!(data & KVM_ASYNC_PF_ENABLED);
> +	if (enable == enabled) {
> +		kvm_debug("%s: Async PF has been %s (0x%llx -> 0x%llx)\n",
> +			  __func__, enabled ? "enabled" : "disabled",
> +			  vcpu->arch.apf.control_block, data);
> +		return SMCCC_RET_NOT_REQUIRED;
> +	}
> +
> +	if (enable) {
> +		ret = kvm_gfn_to_hva_cache_init(
> +			vcpu->kvm, &vcpu->arch.apf.cache,
> +			gpa + offsetof(struct kvm_vcpu_pv_apf_data, reason),
> +			sizeof(u32));
> +		if (ret) {
> +			kvm_err("%s: Error %d initializing cache on 0x%llx\n",
> +				__func__, ret, data);
> +			return SMCCC_RET_NOT_SUPPORTED;
> +		}
> +
> +		kvm_async_pf_hash_reset(vcpu);
> +		vcpu->arch.apf.send_user_only =
> +			!(data & KVM_ASYNC_PF_SEND_ALWAYS);
> +		kvm_async_pf_wakeup_all(vcpu);
> +		vcpu->arch.apf.control_block = data;
> +	} else {
> +		kvm_clear_async_pf_completion_queue(vcpu);
> +		vcpu->arch.apf.control_block = data;
> +	}
> +
> +	return SMCCC_RET_SUCCESS;
> +}
> +
> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu)
> +{
> +	u64 data, func, val, range;
> +	long ret = SMCCC_RET_SUCCESS;
> +
> +	data = (smccc_get_arg2(vcpu) << 32) | smccc_get_arg1(vcpu);

What prevents the high bits being set in arg1?

> +	func = data & (0xfful << 56);
> +	val = data & ~(0xfful << 56);
> +	switch (func) {
> +	case BIT(63):
> +		ret = kvm_async_pf_update_enable_reg(vcpu, val);

Please give BIT(63) a mnemonic.

> +		break;
> +	case BIT(62):

Likewise.

> +		if (vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
> +			ret = SMCCC_RET_NOT_SUPPORTED;
> +			break;
> +		}
> +
> +		range = vcpu->arch.apf.no_fault_inst_range;
> +		vcpu->arch.apf.no_fault_inst_range = max(range, val);

Huh? How is the `no_fault_inst_range` set by he guest?

Thanks,
Mark.

> +		break;
> +	default:
> +		kvm_err("%s: Unrecognized function 0x%llx\n", __func__, func);
> +		ret = SMCCC_RET_NOT_SUPPORTED;
> +	}
> +
> +	return ret;
> +}
> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
> index db6dce3d0e23..a7e0fe17e2f1 100644
> --- a/virt/kvm/arm/hypercalls.c
> +++ b/virt/kvm/arm/hypercalls.c
> @@ -70,7 +70,15 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  		break;
>  	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
>  		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_APF);
> +#endif
>  		break;
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	case ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID:
> +		val[0] = kvm_async_pf_hypercall(vcpu);
> +		break;
> +#endif
>  	default:
>  		return kvm_psci_call(vcpu);
>  	}
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 95aaabb2b1fc..a303815845a2 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1656,6 +1656,30 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>  	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
>  }
>  
> +static bool try_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
> +			 gfn_t gfn, kvm_pfn_t *pfn, bool write,
> +			 bool *writable, bool prefault)
> +{
> +	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	bool async = false;
> +
> +	/* Bail if *pfn has correct page */
> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
> +	if (!async)
> +		return false;
> +
> +	if (!prefault && kvm_arch_can_inject_async_page_not_present(vcpu)) {
> +		if (kvm_async_pf_hash_find(vcpu, gfn) ||
> +		    kvm_arch_setup_async_pf(vcpu, esr, gpa, gfn))
> +			return true;
> +	}
> +#endif
> +
> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
> +	return false;
> +}
> +
>  int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>  			      struct kvm_memory_slot *memslot,
>  			      phys_addr_t fault_ipa, unsigned long hva,
> @@ -1737,7 +1761,10 @@ int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>  	 */
>  	smp_rmb();
>  
> -	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (try_async_pf(vcpu, esr, fault_ipa, gfn, &pfn,
> +			 write_fault, &writable, prefault))
> +		return 1;
> +
>  	if (pfn == KVM_PFN_ERR_HWPOISON) {
>  		kvm_send_hwpoison_signal(hva, vma_shift);
>  		return 0;
> -- 
> 2.23.0
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-08  3:29 ` [PATCH RFCv2 9/9] arm64: Support async page fault Gavin Shan
@ 2020-05-26 12:56   ` Mark Rutland
  2020-05-27  6:48   ` Paolo Bonzini
  1 sibling, 0 replies; 38+ messages in thread
From: Mark Rutland @ 2020-05-26 12:56 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

On Fri, May 08, 2020 at 01:29:19PM +1000, Gavin Shan wrote:
> This supports asynchronous page fault for the guest. The design is
> similar to what x86 has: on receiving a PAGE_NOT_PRESENT signal from
> the host, the current task is either rescheduled or put into power
> saving mode. The task will be waken up when PAGE_READY signal is
> received. The PAGE_READY signal might be received in the context
> of the suspended process, to be waken up. That means the suspended
> process has to wake up itself, but it's not safe and prone to cause
> dead-lock on CPU runqueue lock. So the wakeup is delayed on returning
> from kernel space to user space or idle process is picked for running.
> 
> The signals are conveyed through the async page fault control block,
> which was passed to host on enabling the functionality. On each page
> fault, the control block is checked and switch to the async page fault
> handling flow if any signals exist.
> 
> The feature is put into the CONFIG_KVM_GUEST umbrella, which is added
> by this patch. So we have inline functions implemented in kvm_para.h,
> like other architectures do, to check if async page fault (one of the
> KVM para-virtualized features) is available. Also, the kernel boot
> parameter "no-kvmapf" can be specified to disable the feature.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/Kconfig                 |  11 +
>  arch/arm64/include/asm/exception.h |   3 +
>  arch/arm64/include/asm/kvm_para.h  |  27 +-
>  arch/arm64/kernel/entry.S          |  33 +++
>  arch/arm64/kernel/process.c        |   4 +
>  arch/arm64/mm/fault.c              | 434 +++++++++++++++++++++++++++++
>  6 files changed, 505 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 40fb05d96c60..2d5e5ee62d6d 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1045,6 +1045,17 @@ config PARAVIRT
>  	  under a hypervisor, potentially improving performance significantly
>  	  over full virtualization.
>  
> +config KVM_GUEST
> +	bool "KVM Guest Support"
> +	depends on PARAVIRT
> +	default y
> +	help
> +	  This option enables various optimizations for running under the KVM
> +	  hypervisor. Overhead for the kernel when not running inside KVM should
> +	  be minimal.
> +
> +	  In case of doubt, say Y
> +
>  config PARAVIRT_TIME_ACCOUNTING
>  	bool "Paravirtual steal time accounting"
>  	select PARAVIRT
> diff --git a/arch/arm64/include/asm/exception.h b/arch/arm64/include/asm/exception.h
> index 7a6e81ca23a8..d878afa42746 100644
> --- a/arch/arm64/include/asm/exception.h
> +++ b/arch/arm64/include/asm/exception.h
> @@ -46,4 +46,7 @@ void bad_el0_sync(struct pt_regs *regs, int reason, unsigned int esr);
>  void do_cp15instr(unsigned int esr, struct pt_regs *regs);
>  void do_el0_svc(struct pt_regs *regs);
>  void do_el0_svc_compat(struct pt_regs *regs);
> +#ifdef CONFIG_KVM_GUEST
> +void kvm_async_pf_delayed_wake(void);
> +#endif
>  #endif	/* __ASM_EXCEPTION_H */
> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
> index 0ea481dd1c7a..b2f8ef243df7 100644
> --- a/arch/arm64/include/asm/kvm_para.h
> +++ b/arch/arm64/include/asm/kvm_para.h
> @@ -3,6 +3,20 @@
>  #define _ASM_ARM_KVM_PARA_H
>  
>  #include <uapi/asm/kvm_para.h>
> +#include <linux/of.h>
> +#include <asm/hypervisor.h>
> +
> +#ifdef CONFIG_KVM_GUEST
> +static inline int kvm_para_available(void)
> +{
> +	return 1;
> +}
> +#else
> +static inline int kvm_para_available(void)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_KVM_GUEST */

Please make these bool, and return true/false, as was the case with the
existing stub.

>  
>  static inline bool kvm_check_and_clear_guest_paused(void)
>  {
> @@ -11,17 +25,16 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>  
>  static inline unsigned int kvm_arch_para_features(void)
>  {
> -	return 0;
> +	unsigned int features = 0;
> +
> +	if (kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_APF))
> +		features |= (1 << KVM_FEATURE_ASYNC_PF);
> +
> +	return features;
>  }
>  
>  static inline unsigned int kvm_arch_para_hints(void)
>  {
>  	return 0;
>  }
> -
> -static inline bool kvm_para_available(void)
> -{
> -	return false;
> -}
> -
>  #endif /* _ASM_ARM_KVM_PARA_H */
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index ddcde093c433..15efd57129ff 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -751,12 +751,45 @@ finish_ret_to_user:
>  	enable_step_tsk x1, x2
>  #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
>  	bl	stackleak_erase
> +#endif
> +#ifdef CONFIG_KVM_GUEST
> +	bl	kvm_async_pf_delayed_wake
>  #endif

Yuck. I am very much not keen on this living in the entry assembly.

What precisely is this needed for?

>  	kernel_exit 0
>  ENDPROC(ret_to_user)
>  
>  	.popsection				// .entry.text
>  
> +#ifdef CONFIG_KVM_GUEST
> +	.pushsection ".rodata", "a"
> +SYM_DATA_START(__exception_handlers_offset)
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +	.quad	el1_sync - vectors
> +	.quad	el1_irq - vectors
> +	.quad	0
> +	.quad	el1_error - vectors
> +	.quad	el0_sync - vectors
> +	.quad	el0_irq - vectors
> +	.quad	0
> +	.quad	el0_error - vectors
> +#ifdef CONFIG_COMPAT
> +	.quad	el0_sync_compat - vectors
> +	.quad	el0_irq_compat - vectors
> +	.quad	0
> +	.quad	el0_error_compat - vectors
> +#else
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +#endif
> +SYM_DATA_END(__exception_handlers_offset)
> +	.popsection				// .rodata
> +#endif /* CONFIG_KVM_GUEST */

This looks scary, and needs an introduction in the commit message.

> +
>  #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>  /*
>   * Exception vectors trampoline.
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 56be4cbf771f..5e7ee553566d 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -53,6 +53,7 @@
>  #include <asm/processor.h>
>  #include <asm/pointer_auth.h>
>  #include <asm/stacktrace.h>
> +#include <asm/exception.h>
>  
>  #if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_STACKPROTECTOR_PER_TASK)
>  #include <linux/stackprotector.h>
> @@ -70,6 +71,9 @@ void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
>  
>  static void __cpu_do_idle(void)
>  {
> +#ifdef CONFIG_KVM_GUEST
> +	kvm_async_pf_delayed_wake();
> +#endif
>  	dsb(sy);
>  	wfi();
>  }

This is meant to be a very low-level helper, so I don't think this
should live here.

If nothing else, this needs to have no overhead when async page faults
are not in use, so this probably needs an inline with a static key.


> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c9cedc0432d2..cbf8b52135c9 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -19,10 +19,12 @@
>  #include <linux/page-flags.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/debug.h>
> +#include <linux/swait.h>
>  #include <linux/highmem.h>
>  #include <linux/perf_event.h>
>  #include <linux/preempt.h>
>  #include <linux/hugetlb.h>
> +#include <linux/kvm_para.h>
>  
>  #include <asm/acpi.h>
>  #include <asm/bug.h>
> @@ -48,8 +50,31 @@ struct fault_info {
>  	const char *name;
>  };
>  
> +#ifdef CONFIG_KVM_GUEST
> +struct kvm_task_sleep_node {
> +	struct hlist_node	link;
> +	struct swait_queue_head	wq;
> +	u32			token;
> +	struct task_struct	*task;
> +	int			cpu;
> +	bool			halted;
> +	bool			delayed;
> +};
> +
> +struct kvm_task_sleep_head {
> +	raw_spinlock_t		lock;
> +	struct hlist_head	list;
> +};
> +#endif /* CONFIG_KVM_GUEST */
> +
>  static const struct fault_info fault_info[];
>  static struct fault_info debug_fault_info[];
> +#ifdef CONFIG_KVM_GUEST
> +extern char __exception_handlers_offset[];
> +static bool async_pf_available = true;
> +static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_data) __aligned(64);
> +static DEFINE_PER_CPU(struct kvm_task_sleep_head, apf_head);
> +#endif
>  
>  static inline const struct fault_info *esr_to_fault_info(unsigned int esr)
>  {
> @@ -717,10 +742,281 @@ static const struct fault_info fault_info[] = {
>  	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 63"			},
>  };
>  
> +#ifdef CONFIG_KVM_GUEST
> +static inline unsigned int kvm_async_pf_read_enabled(void)
> +{
> +	return __this_cpu_read(apf_data.enabled);
> +}
> +
> +static inline void kvm_async_pf_write_enabled(unsigned int val)
> +{
> +	__this_cpu_write(apf_data.enabled, val);
> +}
> +
> +static inline unsigned int kvm_async_pf_read_reason(void)
> +{
> +	return __this_cpu_read(apf_data.reason);
> +}
> +
> +static inline void kvm_async_pf_write_reason(unsigned int val)
> +{
> +	__this_cpu_write(apf_data.reason, val);
> +}
> +
> +#define kvm_async_pf_lock(b, flags)					\
> +	raw_spin_lock_irqsave(&(b)->lock, (flags))
> +#define kvm_async_pf_trylock(b, flags)					\
> +	raw_spin_trylock_irqsave(&(b)->lock, (flags))
> +#define kvm_async_pf_unlock(b, flags)					\
> +	raw_spin_unlock_irqrestore(&(b)->lock, (flags))
> +#define kvm_async_pf_unlock_and_clear(b, flags)				\
> +	do {								\
> +		raw_spin_unlock_irqrestore(&(b)->lock, (flags));	\
> +		kvm_async_pf_write_reason(0);				\
> +	} while (0)
> +
> +static struct kvm_task_sleep_node *kvm_async_pf_find(
> +		struct kvm_task_sleep_head *b, u32 token)
> +{
> +	struct kvm_task_sleep_node *n;
> +	struct hlist_node *p;
> +
> +	hlist_for_each(p, &b->list) {
> +		n = hlist_entry(p, typeof(*n), link);
> +		if (n->token == token)
> +			return n;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void kvm_async_pf_wait(u32 token, int in_kernel)
> +{
> +	struct kvm_task_sleep_head *b = this_cpu_ptr(&apf_head);
> +	struct kvm_task_sleep_node n, *e;
> +	DECLARE_SWAITQUEUE(wait);
> +	unsigned long flags;
> +
> +	kvm_async_pf_lock(b, flags);
> +	e = kvm_async_pf_find(b, token);
> +	if (e) {
> +		/* dummy entry exist -> wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		kvm_async_pf_unlock_and_clear(b, flags);
> +
> +		return;
> +	}
> +
> +	n.token = token;
> +	n.task = current;
> +	n.cpu = smp_processor_id();
> +	n.halted = is_idle_task(current) ||
> +		   (IS_ENABLED(CONFIG_PREEMPT_COUNT) ?
> +		    preempt_count() > 1 || rcu_preempt_depth() : in_kernel);
> +	n.delayed = false;
> +	init_swait_queue_head(&n.wq);
> +	hlist_add_head(&n.link, &b->list);
> +	kvm_async_pf_unlock_and_clear(b, flags);
> +
> +	for (;;) {
> +		if (!n.halted) {
> +			prepare_to_swait_exclusive(&n.wq, &wait,
> +						   TASK_UNINTERRUPTIBLE);
> +		}
> +
> +		if (hlist_unhashed(&n.link))
> +			break;
> +
> +		if (!n.halted) {
> +			schedule();
> +		} else {
> +			dsb(sy);
> +			wfi();
> +		}

Please don't open-code idle sequences. I strongly suspect this won't
work with pseudo-nmi, and regardless we don't want to duplicate this.

> +	}
> +
> +	if (!n.halted)
> +		finish_swait(&n.wq, &wait);
> +}
> +
> +/*
> + * There are two cases the suspended processed can't be waken up
> + * immediately: The waker is exactly the suspended process, or
> + * the current CPU runqueue has been locked. Otherwise, we might
> + * run into dead-lock.
> + */
> +static inline void kvm_async_pf_wake_one(struct kvm_task_sleep_node *n)
> +{
> +	if (n->task == current ||
> +	    cpu_rq_is_locked(smp_processor_id())) {
> +		n->delayed = true;
> +		return;
> +	}
> +
> +	hlist_del_init(&n->link);
> +	if (n->halted)
> +		smp_send_reschedule(n->cpu);
> +	else
> +		swake_up_one(&n->wq);
> +}
> +
> +void kvm_async_pf_delayed_wake(void)
> +{
> +	struct kvm_task_sleep_head *b;
> +	struct kvm_task_sleep_node *n;
> +	struct hlist_node *p, *next;
> +	unsigned int reason;
> +	unsigned long flags;
> +
> +	if (!kvm_async_pf_read_enabled())
> +		return;
> +
> +	/*
> +	 * We're running in the edging context, we need to complete
> +	 * the work as quick as possible. So we have a preliminary
> +	 * check without holding the lock.
> +	 */

What is 'the edging context'?

> +	b = this_cpu_ptr(&apf_head);
> +	if (hlist_empty(&b->list))
> +		return;
> +
> +	/*
> +	 * Set the async page fault reason to something to avoid
> +	 * receiving the signals, which might cause lock contention
> +	 * and possibly dead-lock. As we're in guest context, it's
> +	 * safe to set the reason here.
> +	 *
> +	 * There might be pending signals. For that case, we needn't
> +	 * do anything. Otherwise, the pending signal will be lost.
> +	 */
> +	reason = kvm_async_pf_read_reason();
> +	if (!reason) {
> +		kvm_async_pf_write_reason(KVM_PV_REASON_PAGE_NOT_PRESENT +
> +					  KVM_PV_REASON_PAGE_READY);
> +	}

Huh? Are we doing this to prevent the host from writing tho this area?

> +
> +	if (!kvm_async_pf_trylock(b, flags))
> +		goto done;
> +
> +	hlist_for_each_safe(p, next, &b->list) {
> +		n = hlist_entry(p, typeof(*n), link);
> +		if (n->cpu != smp_processor_id())
> +			continue;
> +		if (!n->delayed)
> +			continue;
> +
> +		kvm_async_pf_wake_one(n);
> +	}
> +
> +	kvm_async_pf_unlock(b, flags);
> +
> +done:
> +	if (!reason)
> +		kvm_async_pf_write_reason(0);
> +}
> +NOKPROBE_SYMBOL(kvm_async_pf_delayed_wake);
> +
> +static void kvm_async_pf_wake_all(void)
> +{
> +	struct kvm_task_sleep_head *b;
> +	struct kvm_task_sleep_node *n;
> +	struct hlist_node *p, *next;
> +	unsigned long flags;
> +
> +	b = this_cpu_ptr(&apf_head);
> +	kvm_async_pf_lock(b, flags);
> +
> +	hlist_for_each_safe(p, next, &b->list) {
> +		n = hlist_entry(p, typeof(*n), link);
> +		kvm_async_pf_wake_one(n);
> +	}
> +
> +	kvm_async_pf_unlock(b, flags);
> +
> +	kvm_async_pf_write_reason(0);
> +}
> +
> +static void kvm_async_pf_wake(u32 token)
> +{
> +	struct kvm_task_sleep_head *b = this_cpu_ptr(&apf_head);
> +	struct kvm_task_sleep_node *n;
> +	unsigned long flags;
> +
> +	if (token == ~0) {
> +		kvm_async_pf_wake_all();
> +		return;
> +	}
> +
> +again:
> +	kvm_async_pf_lock(b, flags);
> +
> +	n = kvm_async_pf_find(b, token);
> +	if (!n) {
> +		/*
> +		 * Async PF was not yet handled. Add dummy entry
> +		 * for the token. Busy wait until other CPU handles
> +		 * the async PF on allocation failure.
> +		 */
> +		n = kzalloc(sizeof(*n), GFP_ATOMIC);
> +		if (!n) {
> +			kvm_async_pf_unlock(b, flags);
> +			cpu_relax();
> +			goto again;
> +		}
> +		n->token = token;
> +		n->task = current;
> +		n->cpu = smp_processor_id();
> +		n->halted = false;
> +		n->delayed = false;
> +		init_swait_queue_head(&n->wq);
> +		hlist_add_head(&n->link, &b->list);
> +	} else {
> +		kvm_async_pf_wake_one(n);
> +	}
> +
> +	kvm_async_pf_unlock_and_clear(b, flags);
> +}
> +
> +static bool do_async_pf(unsigned long addr, unsigned int esr,
> +		       struct pt_regs *regs)
> +{
> +	u32 reason;
> +
> +	if (!kvm_async_pf_read_enabled())
> +		return false;
> +
> +	reason = kvm_async_pf_read_reason();
> +	if (!reason)
> +		return false;
> +
> +	switch (reason) {
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		kvm_async_pf_wait((u32)addr, !user_mode(regs));
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		kvm_async_pf_wake((u32)addr);
> +		break;
> +	default:
> +		if (reason) {
> +			pr_warn("%s: Illegal reason %d\n", __func__, reason);
> +			kvm_async_pf_write_reason(0);
> +		}
> +	}
> +
> +	return true;
> +}
> +#endif /* CONFIG_KVM_GUEST */
> +
>  void do_mem_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  {
>  	const struct fault_info *inf = esr_to_fault_info(esr);
>  
> +#ifdef CONFIG_KVM_GUEST
> +	if (do_async_pf(addr, esr, regs))
> +		return;
> +#endif
> +
>  	if (!inf->fn(addr, esr, regs))
>  		return;
>  
> @@ -878,3 +1174,141 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
>  	debug_exception_exit(regs);
>  }
>  NOKPROBE_SYMBOL(do_debug_exception);
> +
> +#ifdef CONFIG_KVM_GUEST
> +static int __init kvm_async_pf_available(char *arg)
> +{
> +	async_pf_available = false;
> +	return 0;
> +}
> +early_param("no-kvmapf", kvm_async_pf_available);
> +
> +static void kvm_async_pf_enable(bool enable)
> +{
> +	struct arm_smccc_res res;
> +	unsigned long *offsets = (unsigned long *)__exception_handlers_offset;
> +	u32 enabled = kvm_async_pf_read_enabled();
> +	u64 val;
> +	int i;
> +
> +	if (enable == enabled)
> +		return;
> +
> +	if (enable) {
> +		/*
> +		 * Asychonous page faults will be prohibited when CPU runs
> +		 * instructions between the vector base and the maximal
> +		 * offset, plus 4096. The 4096 is the assumped maximal
> +		 * length for individual handler. The hardware registers
> +		 * should be saved to stack at the beginning of the handlers,
> +		 * so 4096 shuld be safe enough.
> +		 */
> +		val = 0;
> +		for (i = 0; i < 16; i++) {
> +			if (offsets[i] > val)
> +				val = offsets[i];
> +		}
> +
> +		val += 4096;

NAK. This assumption is not true, and regardless we should not make any
assumptions of this sort; we should derive this from the code
explicitly. Guessing is not ok.

Given that non-reentrant exception handling code is scattered across at
least:

* kernel/debug-monitors.c
* kernel/entry.S
* kernel/entry-common.S
* kernel/traps.c
* mm/fault.c

... we *cannot* assume that fault handling code is virtually contiguous,
and certainly cannot assume where this falls w.r.t. the architectural
vectors that VBAR_ELx points to.

How does x86 handle this?

Thanks,
Mark.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
  2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
                   ` (9 preceding siblings ...)
  2020-05-25 23:39 ` [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
@ 2020-05-26 13:09 ` Mark Rutland
  2020-05-27  2:39   ` Gavin Shan
  10 siblings, 1 reply; 38+ messages in thread
From: Mark Rutland @ 2020-05-26 13:09 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

Hi Gavin,

At a high-level I'm rather fearful of this series. I can see many ways
that this can break, and I can also see that even if/when we get things
into a working state, constant vigilance will be requried for any
changes to the entry code.

I'm not keen on injecting non-architectural exceptions in this way, and
I'm also not keen on how deep the PV hooks are injected currently (e.g.
in the ret_to_user path).

I see a few patches have preparator cleanup that I think would be
worthwhile regardless of this series; if you could factor those out and
send them on their own it would get that out of the way and make it
easier to review the series itself. Similarly, there's some duplication
of code from arch/x86 which I think can be factored out to virt/kvm
instead as preparatory work.

Generally, I also think that you need to spend some time on commit
messages and/or documentation to better explain the concepts and
expected usage. I had to reverse-engineer the series by reviewing it in
entirety before I had an idea as to how basic parts of it strung
together, and a more thorough conceptual explanation would make it much
easier to critique the approach rather than the individual patches.

On Fri, May 08, 2020 at 01:29:10PM +1000, Gavin Shan wrote:
> Testing
> =======
> The tests are carried on the following machine. A guest with single vCPU
> and 4GB memory is started. Also, the QEMU process is put into memory cgroup
> (v1) whose memory limit is set to 2GB. In the guest, there are two threads,
> which are memory bound and CPU bound separately. The memory bound thread
> allocates all available memory, accesses and them free them. The CPU bound
> thread simply executes block of "nop".

I appreciate this is a microbenchmark, but that sounds far from
realistic.

Is there a specitic real workload that's expected to be representative
of?

Can you run tests with a real workload? For example, a kernel build
inside the VM?

> The test is carried out for 5 time
> continuously and the average number (per minute) of executed blocks in the
> CPU bound thread is taken as indicator of improvement.
> 
>    Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
>    Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)
> 
>    Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
>                                         7629633480 6170527920)
>    With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
>                                         8095084020 8434672260)
>    Outcome:     +17.8%
> 
> Another test case is to measure the time consumed by the application, but
> with the CPU-bound thread disabled.
> 
>    Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
>    With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
>    Outcome:     +1.2%

So this is pure overhead in that case?

I think we need to see a real workload that this benefits. As it stands
it seems that this is a lot of complexity to game a synthetic benchmark.

Thanks,
Mark.

> I also have some code in the host to capture the number of async page faults,
> time used to do swapin and its maximal/minimal values when async page fault
> is enabled. During the test, the CPU-bound thread is disabled. There is about
> 30% of the time used to do swapin.
> 
>    Number of async page fault:     7555 times
>    Total time used by application: 42.2 seconds
>    Total time used by swapin:      12.7 seconds   (30%)
>          Minimal swapin time:      36.2 us
>          Maximal swapin time:      55.7 ms
> 
> Changelog
> =========
> RFCv1 -> RFCv2
>    * Rebase to 5.7.rc3
>    * Performance data                                                   (Marc Zyngier)
>    * Replace IMPDEF system register with KVM vendor specific hypercall  (Mark Rutland)
>    * Based on Will's KVM vendor hypercall probe mechanism               (Will Deacon)
>    * Don't use IMPDEF DFSC (0x43). Async page fault reason is conveyed
>      by the control block                                               (Mark Rutland)
>    * Delayed wakeup mechanism in guest kernel                           (Gavin Shan)
>    * Stability improvement in the guest kernel: delayed wakeup mechanism,
>      external abort disallowed region, lazily clear async page fault,
>      disabled interrupt on acquiring the head's lock and so on          (Gavin Shan)
>    * Stability improvement in the host kernel: serialized async page
>      faults etc.                                                        (Gavin Shan)
>    * Performance improvement in guest kernel: percpu sleeper head       (Gavin Shan)
> 
> Gavin Shan (7):
>   kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
>   kvm/arm64: Detach ESR operator from vCPU struct
>   kvm/arm64: Replace hsr with esr
>   kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
>   kvm/arm64: Support async page fault
>   kernel/sched: Add cpu_rq_is_locked()
>   arm64: Support async page fault
> 
> Will Deacon (2):
>   arm64: Probe for the presence of KVM hypervisor services during boot
>   arm/arm64: KVM: Advertise KVM UID to guests via SMCCC
> 
>  arch/arm64/Kconfig                       |  11 +
>  arch/arm64/include/asm/exception.h       |   3 +
>  arch/arm64/include/asm/hypervisor.h      |  11 +
>  arch/arm64/include/asm/kvm_emulate.h     |  83 +++--
>  arch/arm64/include/asm/kvm_host.h        |  47 +++
>  arch/arm64/include/asm/kvm_para.h        |  40 +++
>  arch/arm64/include/uapi/asm/Kbuild       |   2 -
>  arch/arm64/include/uapi/asm/kvm_para.h   |  22 ++
>  arch/arm64/kernel/entry.S                |  33 ++
>  arch/arm64/kernel/process.c              |   4 +
>  arch/arm64/kernel/setup.c                |  35 ++
>  arch/arm64/kvm/Kconfig                   |   1 +
>  arch/arm64/kvm/Makefile                  |   2 +
>  arch/arm64/kvm/handle_exit.c             |  48 +--
>  arch/arm64/kvm/hyp/switch.c              |  33 +-
>  arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |   7 +-
>  arch/arm64/kvm/inject_fault.c            |   4 +-
>  arch/arm64/kvm/sys_regs.c                |  38 +-
>  arch/arm64/mm/fault.c                    | 434 +++++++++++++++++++++++
>  include/linux/arm-smccc.h                |  32 ++
>  include/linux/sched.h                    |   1 +
>  kernel/sched/core.c                      |   8 +
>  virt/kvm/arm/arm.c                       |  40 ++-
>  virt/kvm/arm/async_pf.c                  | 335 +++++++++++++++++
>  virt/kvm/arm/hyp/aarch32.c               |   4 +-
>  virt/kvm/arm/hyp/vgic-v3-sr.c            |   7 +-
>  virt/kvm/arm/hypercalls.c                |  37 +-
>  virt/kvm/arm/mmio.c                      |  27 +-
>  virt/kvm/arm/mmu.c                       |  69 +++-
>  29 files changed, 1264 insertions(+), 154 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kvm_para.h
>  create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>  create mode 100644 virt/kvm/arm/async_pf.c
> 
> -- 
> 2.23.0
> 
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
  2020-05-26 13:09 ` Mark Rutland
@ 2020-05-27  2:39   ` Gavin Shan
  2020-05-27  7:48     ` Marc Zyngier
  0 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-27  2:39 UTC (permalink / raw)
  To: Mark Rutland
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, pbonzini, will,
	kvmarm, linux-arm-kernel

Hi Mark,

On 5/26/20 11:09 PM, Mark Rutland wrote:
> At a high-level I'm rather fearful of this series. I can see many ways
> that this can break, and I can also see that even if/when we get things
> into a working state, constant vigilance will be requried for any
> changes to the entry code.
> 
> I'm not keen on injecting non-architectural exceptions in this way, and
> I'm also not keen on how deep the PV hooks are injected currently (e.g.
> in the ret_to_user path).
> 

First of all, thank you for your time and providing your comments continuously.
Since the series is tagged as RFC, it's not a surprise to see something is
obviously broken. However, Could you please provide more details? With more
details, I can figure out the solutions. If I'm correct, you're talking about
the added entry code and the injected PV hooks. Anyway, please provide more
details about your concerns so that I can figure out the solutions.

Let me briefly explain why we need the injected PV hooks in ret_to_user: There
are two fashions of wakeup and I would call them as direct wakeup and delayed
wakeup. The sleeping process is waked up directly when received PAGE_READY
notification from the host, which is the process of direct wakeup. However there
are some cases the direct wakeup can't be carried out. For example, the sleeper
and the waker are same process or the (CFS) runqueue has been locked by somebody
else. In these cases, the wakeup is delayed until the idle process is running or
in ret_to_user. It's how delayed wakeup works.

> I see a few patches have preparator cleanup that I think would be
> worthwhile regardless of this series; if you could factor those out and
> send them on their own it would get that out of the way and make it
> easier to review the series itself. Similarly, there's some duplication
> of code from arch/x86 which I think can be factored out to virt/kvm
> instead as preparatory work.
> 

Yep, I agree there are several cleanup patches can be posted separately
and merged in advance. I will do that and thanks for the comments.

About the shared code between arm64/x86, I need some time to investigate.
Basically, I agree to do so. I also included Paolo here to check his opnion.

It's no doubt these are all preparatory work, to make the review a bit
easier as you said :)

> Generally, I also think that you need to spend some time on commit
> messages and/or documentation to better explain the concepts and
> expected usage. I had to reverse-engineer the series by reviewing it in
> entirety before I had an idea as to how basic parts of it strung
> together, and a more thorough conceptual explanation would make it much
> easier to critique the approach rather than the individual patches.
> 

Yes, sure. I will do this in the future. Sorry about having taken you
too much to do the reverse-engineering. In next revision, I might put
more information in the cover letter and commit log to explain how things
are designed and working :)

> On Fri, May 08, 2020 at 01:29:10PM +1000, Gavin Shan wrote:
>> Testing
>> =======
>> The tests are carried on the following machine. A guest with single vCPU
>> and 4GB memory is started. Also, the QEMU process is put into memory cgroup
>> (v1) whose memory limit is set to 2GB. In the guest, there are two threads,
>> which are memory bound and CPU bound separately. The memory bound thread
>> allocates all available memory, accesses and them free them. The CPU bound
>> thread simply executes block of "nop".
> 
> I appreciate this is a microbenchmark, but that sounds far from
> realistic.
> 
> Is there a specitic real workload that's expected to be representative
> of?
> 
> Can you run tests with a real workload? For example, a kernel build
> inside the VM?
> 

Yeah, I agree it's far from a realistic workload. However, it's the test case
which was suggested when async page fault was proposed from day one, according
to the following document. On the page#34, you can see the benchmark, which is
similar to what we're doing.

https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf

Ok. I will test with the workload to build kernel or another better one to
represent the case.

>> The test is carried out for 5 time
>> continuously and the average number (per minute) of executed blocks in the
>> CPU bound thread is taken as indicator of improvement.
>>
>>     Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
>>     Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)
>>
>>     Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
>>                                          7629633480 6170527920)
>>     With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
>>                                          8095084020 8434672260)
>>     Outcome:     +17.8%
>>
>> Another test case is to measure the time consumed by the application, but
>> with the CPU-bound thread disabled.
>>
>>     Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
>>     With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
>>     Outcome:     +1.2%
> 
> So this is pure overhead in that case?
> 

Yes, It's the pure overhead, which is mainly contributed by the injected
PV code in ret_to_user.

> I think we need to see a real workload that this benefits. As it stands
> it seems that this is a lot of complexity to game a synthetic benchmark.
> 
> Thanks,
> Mark.
> 
>> I also have some code in the host to capture the number of async page faults,
>> time used to do swapin and its maximal/minimal values when async page fault
>> is enabled. During the test, the CPU-bound thread is disabled. There is about
>> 30% of the time used to do swapin.
>>
>>     Number of async page fault:     7555 times
>>     Total time used by application: 42.2 seconds
>>     Total time used by swapin:      12.7 seconds   (30%)
>>           Minimal swapin time:      36.2 us
>>           Maximal swapin time:      55.7 ms
>>

[...]

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  2020-05-26 10:42   ` Mark Rutland
@ 2020-05-27  2:43     ` Gavin Shan
  2020-05-27  7:20       ` Marc Zyngier
  0 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-27  2:43 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

Hi Mark,

On 5/26/20 8:42 PM, Mark Rutland wrote:
> On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote:
>> Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to
>> kvm_vcpu_get_esr() to it a bit more self-explaining because the
>> functions returns ESR instead of HSR on aarch64. This shouldn't
>> cause any functional changes.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> 
> I think that this would be a nice cleanup on its own, and could be taken
> independently of the rest of this series if it were rebased and sent as
> a single patch.
> 

Yeah, I'll see how PATCH[3,4,5] can be posted independently
as part of the preparatory work, which is suggested by you
in another reply.

By the way, I assume the cleanup patches are good enough to
target 5.8.rc1/rc2 if you agree.

Thanks,
Gavin
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h | 36 +++++++++++++++-------------
>>   arch/arm64/kvm/handle_exit.c         | 12 +++++-----
>>   arch/arm64/kvm/hyp/switch.c          |  2 +-
>>   arch/arm64/kvm/sys_regs.c            |  6 ++---
>>   virt/kvm/arm/hyp/aarch32.c           |  2 +-
>>   virt/kvm/arm/hyp/vgic-v3-sr.c        |  4 ++--
>>   virt/kvm/arm/mmu.c                   |  6 ++---
>>   7 files changed, 35 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index a30b4eec7cb4..bd1a69e7c104 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -265,14 +265,14 @@ static inline bool vcpu_mode_priv(const struct kvm_vcpu *vcpu)
>>   	return mode != PSR_MODE_EL0t;
>>   }
>>   
>> -static __always_inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
>> +static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu)
>>   {
>>   	return vcpu->arch.fault.esr_el2;
>>   }
>>   
>>   static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
>>   {
>> -	u32 esr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>>   
>>   	if (esr & ESR_ELx_CV)
>>   		return (esr & ESR_ELx_COND_MASK) >> ESR_ELx_COND_SHIFT;
>> @@ -297,64 +297,66 @@ static inline u64 kvm_vcpu_get_disr(const struct kvm_vcpu *vcpu)
>>   
>>   static inline u32 kvm_vcpu_hvc_get_imm(const struct kvm_vcpu *vcpu)
>>   {
>> -	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_xVC_IMM_MASK;
>> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_xVC_IMM_MASK;
>>   }
>>   
>>   static __always_inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
>>   {
>> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV);
>> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_ISV);
>>   }
>>   
>>   static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu)
>>   {
>> -	return kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
>> +	return kvm_vcpu_get_esr(vcpu) &
>> +	       (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
>>   }
>>   
>>   static inline bool kvm_vcpu_dabt_issext(const struct kvm_vcpu *vcpu)
>>   {
>> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SSE);
>> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SSE);
>>   }
>>   
>>   static inline bool kvm_vcpu_dabt_issf(const struct kvm_vcpu *vcpu)
>>   {
>> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SF);
>> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SF);
>>   }
>>   
>>   static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)
>>   {
>> -	return (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
>> +	return (kvm_vcpu_get_esr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
>>   }
>>   
>>   static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)
>>   {
>> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_S1PTW);
>> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_S1PTW);
>>   }
>>   
>>   static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)
>>   {
>> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WNR) ||
>> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_WNR) ||
>>   		kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */
>>   }
>>   
>>   static inline bool kvm_vcpu_dabt_is_cm(const struct kvm_vcpu *vcpu)
>>   {
>> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_CM);
>> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_CM);
>>   }
>>   
>>   static __always_inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu)
>>   {
>> -	return 1 << ((kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
>> +	return 1 << ((kvm_vcpu_get_esr(vcpu) & ESR_ELx_SAS) >>
>> +		     ESR_ELx_SAS_SHIFT);
>>   }
>>   
>>   /* This one is not specific to Data Abort */
>>   static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu)
>>   {
>> -	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_IL);
>> +	return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_IL);
>>   }
>>   
>>   static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)
>>   {
>> -	return ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
>> +	return ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
>>   }
>>   
>>   static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
>> @@ -364,12 +366,12 @@ static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
>>   
>>   static __always_inline u8 kvm_vcpu_trap_get_fault(const struct kvm_vcpu *vcpu)
>>   {
>> -	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC;
>> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC;
>>   }
>>   
>>   static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vcpu)
>>   {
>> -	return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC_TYPE;
>> +	return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
>>   }
>>   
>>   static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
>> @@ -393,7 +395,7 @@ static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
>>   
>>   static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
>>   {
>> -	u32 esr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>>   	return ESR_ELx_SYS64_ISS_RT(esr);
>>   }
>>   
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>> index aacfc55de44c..c5b75a4d5eda 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -89,7 +89,7 @@ static int handle_no_fpsimd(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>    */
>>   static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   {
>> -	if (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
>> +	if (kvm_vcpu_get_esr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
>>   		trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
>>   		vcpu->stat.wfe_exit_stat++;
>>   		kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>> @@ -119,7 +119,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>    */
>>   static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   {
>> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>>   	int ret = 0;
>>   
>>   	run->exit_reason = KVM_EXIT_DEBUG;
>> @@ -146,7 +146,7 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   
>>   static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   {
>> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>>   
>>   	kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n",
>>   		      hsr, esr_get_class_string(hsr));
>> @@ -226,7 +226,7 @@ static exit_handle_fn arm_exit_handlers[] = {
>>   
>>   static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
>>   {
>> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>>   	u8 hsr_ec = ESR_ELx_EC(hsr);
>>   
>>   	return arm_exit_handlers[hsr_ec];
>> @@ -267,7 +267,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>   		       int exception_index)
>>   {
>>   	if (ARM_SERROR_PENDING(exception_index)) {
>> -		u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
>> +		u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
>>   
>>   		/*
>>   		 * HVC/SMC already have an adjusted PC, which we need
>> @@ -333,5 +333,5 @@ void handle_exit_early(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>   	exception_index = ARM_EXCEPTION_CODE(exception_index);
>>   
>>   	if (exception_index == ARM_EXCEPTION_EL1_SERROR)
>> -		kvm_handle_guest_serror(vcpu, kvm_vcpu_get_hsr(vcpu));
>> +		kvm_handle_guest_serror(vcpu, kvm_vcpu_get_esr(vcpu));
>>   }
>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>> index 8a1e81a400e0..2c3242bcfed2 100644
>> --- a/arch/arm64/kvm/hyp/switch.c
>> +++ b/arch/arm64/kvm/hyp/switch.c
>> @@ -437,7 +437,7 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
>>   
>>   static bool __hyp_text handle_tx2_tvm(struct kvm_vcpu *vcpu)
>>   {
>> -	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_hsr(vcpu));
>> +	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu));
>>   	int rt = kvm_vcpu_sys_get_rt(vcpu);
>>   	u64 val = vcpu_get_reg(vcpu, rt);
>>   
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 51db934702b6..5b61465927b7 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -2214,7 +2214,7 @@ static int kvm_handle_cp_64(struct kvm_vcpu *vcpu,
>>   			    size_t nr_specific)
>>   {
>>   	struct sys_reg_params params;
>> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>>   	int Rt = kvm_vcpu_sys_get_rt(vcpu);
>>   	int Rt2 = (hsr >> 10) & 0x1f;
>>   
>> @@ -2271,7 +2271,7 @@ static int kvm_handle_cp_32(struct kvm_vcpu *vcpu,
>>   			    size_t nr_specific)
>>   {
>>   	struct sys_reg_params params;
>> -	u32 hsr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 hsr = kvm_vcpu_get_esr(vcpu);
>>   	int Rt  = kvm_vcpu_sys_get_rt(vcpu);
>>   
>>   	params.is_aarch32 = true;
>> @@ -2387,7 +2387,7 @@ static void reset_sys_reg_descs(struct kvm_vcpu *vcpu,
>>   int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   {
>>   	struct sys_reg_params params;
>> -	unsigned long esr = kvm_vcpu_get_hsr(vcpu);
>> +	unsigned long esr = kvm_vcpu_get_esr(vcpu);
>>   	int Rt = kvm_vcpu_sys_get_rt(vcpu);
>>   	int ret;
>>   
>> diff --git a/virt/kvm/arm/hyp/aarch32.c b/virt/kvm/arm/hyp/aarch32.c
>> index d31f267961e7..864b477e660a 100644
>> --- a/virt/kvm/arm/hyp/aarch32.c
>> +++ b/virt/kvm/arm/hyp/aarch32.c
>> @@ -51,7 +51,7 @@ bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu)
>>   	int cond;
>>   
>>   	/* Top two bits non-zero?  Unconditional. */
>> -	if (kvm_vcpu_get_hsr(vcpu) >> 30)
>> +	if (kvm_vcpu_get_esr(vcpu) >> 30)
>>   		return true;
>>   
>>   	/* Is condition field valid? */
>> diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
>> index ccf1fde9836c..8a7a14ec9120 100644
>> --- a/virt/kvm/arm/hyp/vgic-v3-sr.c
>> +++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
>> @@ -441,7 +441,7 @@ static int __hyp_text __vgic_v3_bpr_min(void)
>>   
>>   static int __hyp_text __vgic_v3_get_group(struct kvm_vcpu *vcpu)
>>   {
>> -	u32 esr = kvm_vcpu_get_hsr(vcpu);
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>>   	u8 crm = (esr & ESR_ELx_SYS64_ISS_CRM_MASK) >> ESR_ELx_SYS64_ISS_CRM_SHIFT;
>>   
>>   	return crm != 8;
>> @@ -1007,7 +1007,7 @@ int __hyp_text __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu)
>>   	bool is_read;
>>   	u32 sysreg;
>>   
>> -	esr = kvm_vcpu_get_hsr(vcpu);
>> +	esr = kvm_vcpu_get_esr(vcpu);
>>   	if (vcpu_mode_is_32bit(vcpu)) {
>>   		if (!kvm_condition_valid(vcpu)) {
>>   			__kvm_skip_instr(vcpu);
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index e3b9ee268823..5da0d0e7519b 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1922,7 +1922,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   		 * For RAS the host kernel may handle this abort.
>>   		 * There is no need to pass the error into the guest.
>>   		 */
>> -		if (!kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
>> +		if (!kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu)))
>>   			return 1;
>>   
>>   		if (unlikely(!is_iabt)) {
>> @@ -1931,7 +1931,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   		}
>>   	}
>>   
>> -	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_hsr(vcpu),
>> +	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_esr(vcpu),
>>   			      kvm_vcpu_get_hfar(vcpu), fault_ipa);
>>   
>>   	/* Check the stage-2 fault is trans. fault or write fault */
>> @@ -1940,7 +1940,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   		kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
>>   			kvm_vcpu_trap_get_class(vcpu),
>>   			(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
>> -			(unsigned long)kvm_vcpu_get_hsr(vcpu));
>> +			(unsigned long)kvm_vcpu_get_esr(vcpu));
>>   		return -EFAULT;
>>   	}
>>   
>> -- 
>> 2.23.0
>>
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct
  2020-05-26 10:51   ` Mark Rutland
@ 2020-05-27  2:55     ` Gavin Shan
  0 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-27  2:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

Hi Mark,

On 5/26/20 8:51 PM, Mark Rutland wrote:
> On Fri, May 08, 2020 at 01:29:14PM +1000, Gavin Shan wrote:
>> There are a set of inline functions defined in kvm_emulate.h. Those
>> functions reads ESR from vCPU fault information struct and then operate
>> on it. So it's tied with vCPU fault information and vCPU struct. It
>> limits their usage scope.
>>
>> This detaches these functions from the vCPU struct. With this, the
>> caller has flexibility on where the ESR is read. It shouldn't cause
>> any functional changes.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h     | 83 +++++++++++-------------
>>   arch/arm64/kvm/handle_exit.c             | 20 ++++--
>>   arch/arm64/kvm/hyp/switch.c              | 24 ++++---
>>   arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |  7 +-
>>   arch/arm64/kvm/inject_fault.c            |  4 +-
>>   arch/arm64/kvm/sys_regs.c                | 12 ++--
>>   virt/kvm/arm/arm.c                       |  4 +-
>>   virt/kvm/arm/hyp/aarch32.c               |  2 +-
>>   virt/kvm/arm/hyp/vgic-v3-sr.c            |  5 +-
>>   virt/kvm/arm/mmio.c                      | 27 ++++----
>>   virt/kvm/arm/mmu.c                       | 22 ++++---
>>   11 files changed, 112 insertions(+), 98 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index bd1a69e7c104..2873bf6dc85e 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -270,10 +270,8 @@ static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu)
>>   	return vcpu->arch.fault.esr_el2;
>>   }
>>   
>> -static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
>> +static __always_inline int kvm_vcpu_get_condition(u32 esr)
> 
> Given the `vcpu` argument has been removed, it's odd to keep `vcpu` in the
> name, rather than `esr`.
> 
> e.g. this would make more sense as something like esr_get_condition().
> 
> ... and if we did something like that, we could move most of the
> extraction functions into <asm/esr.h>, and share them with non-KVM code.
> 
> Otherwise, do you need to extract all of these for your use-case, or do
> you only need a few of the helpers? If you only need a few, it might be
> better to only factor those out for now, and keep the existing API in
> place with wrappers, e.g. have:
> 
> | esr_get_condition(u32 esr) {
> | 	...
> | }
> |
> | kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
> | {
> | 	return esr_get_condition(kvm_vcpu_get_esr(vcpu));
> | }
> 

Sure, I'll follow approach#1, to move these helper functions to asm/esr.h
and with "vcpu" dropped in their names. I don't think it makes sense to
maintain two sets of helper functions for the simple logic. So the helper
function will be called where they should be, as below:

    esr_get_condition(u32 esr) { ... }
    
    bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu)
    {
         int cond = esr_get_condition(kvm_vcpu_get_esr(vcpu));
         :
    }

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr
  2020-05-26 10:45   ` Mark Rutland
@ 2020-05-27  2:56     ` Gavin Shan
  0 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-27  2:56 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

Hi Mark,

On 5/26/20 8:45 PM, Mark Rutland wrote:
> On Fri, May 08, 2020 at 01:29:15PM +1000, Gavin Shan wrote:
>> This replace the variable names to make them self-explaining. The
>> tracepoint isn't changed accordingly because they're part of ABI:
>>
>>     * @hsr to @esr
>>     * @hsr_ec to @ec
>>     * Use kvm_vcpu_trap_get_class() helper if possible
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> 
> As with patch 3, I think this cleanup makes sense independent from the
> rest of the series, and I think it'd make sense to bundle all the
> patches renaming hsr -> esr, and send those as a preparatory series.
> 

Yes, PATCH[3/4/5] will be posted independently, as part of the
preparatory work, as you suggested.

Thanks,
Gavin

> Thanks,
> Mark.
> 
>> ---
>>   arch/arm64/kvm/handle_exit.c | 28 ++++++++++++++--------------
>>   arch/arm64/kvm/hyp/switch.c  |  9 ++++-----
>>   arch/arm64/kvm/sys_regs.c    | 30 +++++++++++++++---------------
>>   3 files changed, 33 insertions(+), 34 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>> index 00858db82a64..e3b3dcd5b811 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -123,13 +123,13 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>    */
>>   static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   {
>> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>>   	int ret = 0;
>>   
>>   	run->exit_reason = KVM_EXIT_DEBUG;
>> -	run->debug.arch.hsr = hsr;
>> +	run->debug.arch.hsr = esr;
>>   
>> -	switch (ESR_ELx_EC(hsr)) {
>> +	switch (kvm_vcpu_trap_get_class(esr)) {
>>   	case ESR_ELx_EC_WATCHPT_LOW:
>>   		run->debug.arch.far = vcpu->arch.fault.far_el2;
>>   		/* fall through */
>> @@ -139,8 +139,8 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   	case ESR_ELx_EC_BRK64:
>>   		break;
>>   	default:
>> -		kvm_err("%s: un-handled case hsr: %#08x\n",
>> -			__func__, (unsigned int) hsr);
>> +		kvm_err("%s: un-handled case esr: %#08x\n",
>> +			__func__, (unsigned int)esr);
>>   		ret = -1;
>>   		break;
>>   	}
>> @@ -150,10 +150,10 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   
>>   static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   {
>> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>>   
>> -	kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n",
>> -		      hsr, esr_get_class_string(hsr));
>> +	kvm_pr_unimpl("Unknown exception class: esr: %#08x -- %s\n",
>> +		      esr, esr_get_class_string(esr));
>>   
>>   	kvm_inject_undefined(vcpu);
>>   	return 1;
>> @@ -230,10 +230,10 @@ static exit_handle_fn arm_exit_handlers[] = {
>>   
>>   static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
>>   {
>> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
>> -	u8 hsr_ec = ESR_ELx_EC(hsr);
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>> +	u8 ec = kvm_vcpu_trap_get_class(esr);
>>   
>> -	return arm_exit_handlers[hsr_ec];
>> +	return arm_exit_handlers[ec];
>>   }
>>   
>>   /*
>> @@ -273,15 +273,15 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>   {
>>   	if (ARM_SERROR_PENDING(exception_index)) {
>>   		u32 esr = kvm_vcpu_get_esr(vcpu);
>> -		u8 hsr_ec = ESR_ELx_EC(esr);
>> +		u8 ec = kvm_vcpu_trap_get_class(esr);
>>   
>>   		/*
>>   		 * HVC/SMC already have an adjusted PC, which we need
>>   		 * to correct in order to return to after having
>>   		 * injected the SError.
>>   		 */
>> -		if (hsr_ec == ESR_ELx_EC_HVC32 || hsr_ec == ESR_ELx_EC_HVC64 ||
>> -		    hsr_ec == ESR_ELx_EC_SMC32 || hsr_ec == ESR_ELx_EC_SMC64) {
>> +		if (ec == ESR_ELx_EC_HVC32 || ec == ESR_ELx_EC_HVC64 ||
>> +		    ec == ESR_ELx_EC_SMC32 || ec == ESR_ELx_EC_SMC64) {
>>   			u32 adj =  kvm_vcpu_trap_il_is32bit(esr) ? 4 : 2;
>>   			*vcpu_pc(vcpu) -= adj;
>>   		}
>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>> index 369f22f49f3d..7bf4840bf90e 100644
>> --- a/arch/arm64/kvm/hyp/switch.c
>> +++ b/arch/arm64/kvm/hyp/switch.c
>> @@ -356,8 +356,8 @@ static bool __hyp_text __populate_fault_info(struct kvm_vcpu *vcpu)
>>   static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
>>   {
>>   	u32 esr = kvm_vcpu_get_esr(vcpu);
>> +	u8 ec = kvm_vcpu_trap_get_class(esr);
>>   	bool vhe, sve_guest, sve_host;
>> -	u8 hsr_ec;
>>   
>>   	if (!system_supports_fpsimd())
>>   		return false;
>> @@ -372,14 +372,13 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
>>   		vhe = has_vhe();
>>   	}
>>   
>> -	hsr_ec = kvm_vcpu_trap_get_class(esr);
>> -	if (hsr_ec != ESR_ELx_EC_FP_ASIMD &&
>> -	    hsr_ec != ESR_ELx_EC_SVE)
>> +	if (ec != ESR_ELx_EC_FP_ASIMD &&
>> +	    ec != ESR_ELx_EC_SVE)
>>   		return false;
>>   
>>   	/* Don't handle SVE traps for non-SVE vcpus here: */
>>   	if (!sve_guest)
>> -		if (hsr_ec != ESR_ELx_EC_FP_ASIMD)
>> +		if (ec != ESR_ELx_EC_FP_ASIMD)
>>   			return false;
>>   
>>   	/* Valid trap.  Switch the context: */
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 012fff834a4b..58f81ab519af 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -2182,10 +2182,10 @@ static void unhandled_cp_access(struct kvm_vcpu *vcpu,
>>   				struct sys_reg_params *params)
>>   {
>>   	u32 esr = kvm_vcpu_get_esr(vcpu);
>> -	u8 hsr_ec = kvm_vcpu_trap_get_class(esr);
>> +	u8 ec = kvm_vcpu_trap_get_class(esr);
>>   	int cp = -1;
>>   
>> -	switch(hsr_ec) {
>> +	switch (ec) {
>>   	case ESR_ELx_EC_CP15_32:
>>   	case ESR_ELx_EC_CP15_64:
>>   		cp = 15;
>> @@ -2216,17 +2216,17 @@ static int kvm_handle_cp_64(struct kvm_vcpu *vcpu,
>>   			    size_t nr_specific)
>>   {
>>   	struct sys_reg_params params;
>> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
>> -	int Rt = kvm_vcpu_sys_get_rt(hsr);
>> -	int Rt2 = (hsr >> 10) & 0x1f;
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>> +	int Rt = kvm_vcpu_sys_get_rt(esr);
>> +	int Rt2 = (esr >> 10) & 0x1f;
>>   
>>   	params.is_aarch32 = true;
>>   	params.is_32bit = false;
>> -	params.CRm = (hsr >> 1) & 0xf;
>> -	params.is_write = ((hsr & 1) == 0);
>> +	params.CRm = (esr >> 1) & 0xf;
>> +	params.is_write = ((esr & 1) == 0);
>>   
>>   	params.Op0 = 0;
>> -	params.Op1 = (hsr >> 16) & 0xf;
>> +	params.Op1 = (esr >> 16) & 0xf;
>>   	params.Op2 = 0;
>>   	params.CRn = 0;
>>   
>> @@ -2273,18 +2273,18 @@ static int kvm_handle_cp_32(struct kvm_vcpu *vcpu,
>>   			    size_t nr_specific)
>>   {
>>   	struct sys_reg_params params;
>> -	u32 hsr = kvm_vcpu_get_esr(vcpu);
>> -	int Rt  = kvm_vcpu_sys_get_rt(hsr);
>> +	u32 esr = kvm_vcpu_get_esr(vcpu);
>> +	int Rt = kvm_vcpu_sys_get_rt(esr);
>>   
>>   	params.is_aarch32 = true;
>>   	params.is_32bit = true;
>> -	params.CRm = (hsr >> 1) & 0xf;
>> +	params.CRm = (esr >> 1) & 0xf;
>>   	params.regval = vcpu_get_reg(vcpu, Rt);
>> -	params.is_write = ((hsr & 1) == 0);
>> -	params.CRn = (hsr >> 10) & 0xf;
>> +	params.is_write = ((esr & 1) == 0);
>> +	params.CRn = (esr >> 10) & 0xf;
>>   	params.Op0 = 0;
>> -	params.Op1 = (hsr >> 14) & 0x7;
>> -	params.Op2 = (hsr >> 17) & 0x7;
>> +	params.Op1 = (esr >> 14) & 0x7;
>> +	params.Op2 = (esr >> 17) & 0x7;
>>   
>>   	if (!emulate_cp(vcpu, &params, target_specific, nr_specific) ||
>>   	    !emulate_cp(vcpu, &params, global, nr_global)) {
>> -- 
>> 2.23.0
>>
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  2020-05-26 10:58   ` Mark Rutland
@ 2020-05-27  3:01     ` Gavin Shan
  0 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-27  3:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

Hi Mark,

On 5/26/20 8:58 PM, Mark Rutland wrote:
> On Fri, May 08, 2020 at 01:29:16PM +1000, Gavin Shan wrote:
>> This renames user_mem_abort() to kvm_handle_user_mem_abort(), and
>> then export it. The function will be used in asynchronous page fault
>> to populate a page table entry once the corresponding page is populated
>> from the backup device (e.g. swap partition):
>>
>>     * Parameter @fault_status is replace by @esr.
>>     * The parameters are reorder based on their importance.
> 
> It seems like multiple changes are going on here, and it would be
> clearer with separate patches.
> 
> Passing the ESR rather than the extracted fault status seems fine, but
> for clarirty it's be nicer to do this in its own patch.
> 

Ok. I'll have a separate patch to do this.

> Why is it necessary to re-order the function parameters? Does that align
> with other function prototypes?
> 

There are no explicit reasons for it. I can restore the order to what we
had previously if you like.

> What exactly is the `prefault` parameter meant to do? It doesn't do
> anything currently, so it'd be better to introduce it later when logic
> using it is instroduced, or where callers will pass distinct values.
> 

Yes, fair enough. I will merge the logic with PATCH[7] then.

> Thanks,
> Mark.
> 

Thanks,
Gavin

>>
>> This shouldn't cause any functional changes.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   arch/arm64/include/asm/kvm_host.h |  4 ++++
>>   virt/kvm/arm/mmu.c                | 14 ++++++++------
>>   2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 32c8a675e5a4..f77c706777ec 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -437,6 +437,10 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>>   			      struct kvm_vcpu_events *events);
>>   
>>   #define KVM_ARCH_WANT_MMU_NOTIFIER
>> +int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>> +			      struct kvm_memory_slot *memslot,
>> +			      phys_addr_t fault_ipa, unsigned long hva,
>> +			      bool prefault);
>>   int kvm_unmap_hva_range(struct kvm *kvm,
>>   			unsigned long start, unsigned long end);
>>   int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index e462e0368fd9..95aaabb2b1fc 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1656,12 +1656,12 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>>   	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
>>   }
>>   
>> -static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> -			  struct kvm_memory_slot *memslot, unsigned long hva,
>> -			  unsigned long fault_status)
>> +int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>> +			      struct kvm_memory_slot *memslot,
>> +			      phys_addr_t fault_ipa, unsigned long hva,
>> +			      bool prefault)
>>   {
>> -	int ret;
>> -	u32 esr = kvm_vcpu_get_esr(vcpu);
>> +	unsigned int fault_status = kvm_vcpu_trap_get_fault_type(esr);
>>   	bool write_fault, writable, force_pte = false;
>>   	bool exec_fault, needs_exec;
>>   	unsigned long mmu_seq;
>> @@ -1674,6 +1674,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	pgprot_t mem_type = PAGE_S2;
>>   	bool logging_active = memslot_is_logging(memslot);
>>   	unsigned long vma_pagesize, flags = 0;
>> +	int ret;
>>   
>>   	write_fault = kvm_is_write_fault(esr);
>>   	exec_fault = kvm_vcpu_trap_is_iabt(esr);
>> @@ -1995,7 +1996,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   		goto out_unlock;
>>   	}
>>   
>> -	ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
>> +	ret = kvm_handle_user_mem_abort(vcpu, esr, memslot,
>> +					fault_ipa, hva, false);
>>   	if (ret == 0)
>>   		ret = 1;
>>   out:
>> -- 
>> 2.23.0
>>
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
  2020-05-26 12:34   ` Mark Rutland
@ 2020-05-27  4:05     ` Gavin Shan
  2020-05-27  7:37       ` Marc Zyngier
  0 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-27  4:05 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, linux-kernel, shan.gavin, maz, will, kvmarm,
	linux-arm-kernel

Hi Mark,

On 5/26/20 10:34 PM, Mark Rutland wrote:
> On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote:
>> There are two stages of fault pages and the stage one page fault is
>> handled by guest itself. The guest is trapped to host when the page
>> fault is caused by stage 2 page table, for example missing. The guest
>> is suspended until the requested page is populated. To populate the
>> requested page can be related to IO activities if the page was swapped
>> out previously. In this case, the guest has to suspend for a few of
>> milliseconds at least, regardless of the overall system load. There
>> is no useful work done during the suspended period from guest's view.
> 
> This is a bit difficult to read. How about:
> 
> | When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that
> | is not mapped at Stage-2), the vCPU is suspended until the host has
> | handled the fault. It can take the host milliseconds or longer to
> | handle the fault as this may require IO, and when the system load is
> | low neither the host nor guest perform useful work during such
> | periods.
> 

Yes, It's much better.

>>
>> This adds asychornous page fault to improve the situation. A signal
> 
> Nit: typo for `asynchronous` here, and there are a few other typos in
> the patch itself. It would be nice if you could run a spellcheck over
> that.
> 

Sure.

>> (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
>> to be populated. Guest might reschedule to another running process if
>> possible. Otherwise, the vCPU is put into power-saving mode, which is
>> actually to cause vCPU reschedule from host's view. A followup signal
>> (PAGE_READY) is sent to guest once the requested page is populated.
>> The suspended task is waken up or scheduled when guest receives the
>> signal. With this mechanism, the vCPU won't be stuck when the requested
>> page is being populated by host.
> 
> It would probably be best to say 'notification' rather than 'signal'
> here, and say 'the guest is notified', etc. As above, it seems that this
> is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to
> make it clear which context this applies to.
> 

Ok.

>>
>> There are more details highlighted as below. Note the implementation is
>> similar to what x86 has to some extent:
>>
>>     * A dedicated SMCCC ID is reserved to enable, disable or configure
>>       the functionality. The only 64-bits parameter is conveyed by two
>>       registers (w2/w1). Bits[63:56] is the bitmap used to specify the
>>       operated functionality like enabling/disabling/configuration. The
>>       bits[55:6] is the physical address of control block or external
>>       data abort injection disallowed region. Bit[5:0] are used to pass
>>       control flags.
>>
>>     * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
>>       isn't ready. In the mean while, a work is started to populate the
>>       page asynchronously in background. The stage 2 page table entry is
>>       updated accordingly and another signal (PAGE_READY) is fired after
>>       the request page is populted. The signals is notified by injected
>>       data abort fault.
>>
>>     * The signals are fired and consumed in sequential fashion. It means
>>       no more signals will be fired if there is pending one, awaiting the
>>       guest to consume. It's because the injected data abort faults have
>>       to be done in sequential fashion.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   arch/arm64/include/asm/kvm_host.h      |  43 ++++
>>   arch/arm64/include/asm/kvm_para.h      |  27 ++
>>   arch/arm64/include/uapi/asm/Kbuild     |   2 -
>>   arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
>>   arch/arm64/kvm/Kconfig                 |   1 +
>>   arch/arm64/kvm/Makefile                |   2 +
>>   include/linux/arm-smccc.h              |   6 +
>>   virt/kvm/arm/arm.c                     |  36 ++-
>>   virt/kvm/arm/async_pf.c                | 335 +++++++++++++++++++++++++
>>   virt/kvm/arm/hypercalls.c              |   8 +
>>   virt/kvm/arm/mmu.c                     |  29 ++-
>>   11 files changed, 506 insertions(+), 5 deletions(-)
>>   create mode 100644 arch/arm64/include/asm/kvm_para.h
>>   create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>>   create mode 100644 virt/kvm/arm/async_pf.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index f77c706777ec..a207728d6f3f 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -250,6 +250,23 @@ struct vcpu_reset_state {
>>   	bool		reset;
>>   };
>>   
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +
>> +/* Should be a power of two number */
>> +#define ASYNC_PF_PER_VCPU	64
> 
> What exactly is this number?
> 

It's maximal number of async page faults pending on the specific vCPU.

>> +
>> +/*
>> + * The association of gfn and token. The token will be sent to guest as
>> + * page fault address. Also, the guest could be in aarch32 mode. So its
>> + * length should be 32-bits.
>> + */
> 
> The length of what should be 32-bit? The token?
> 
> The guest sees the token as the fault address? How exactly is that
> exposed to the guest, is that via a synthetic S1 fault?
> 

Yes, the token is 32-bits in length and it's exposed to guest via fault
address. I'll improve the comments in next revision.

>> +struct kvm_arch_async_pf {
>> +	u32     token;
>> +	gfn_t   gfn;
>> +	u32	esr;
>> +};
>> +#endif /* CONFIG_KVM_ASYNC_PF */
>> +
>>   struct kvm_vcpu_arch {
>>   	struct kvm_cpu_context ctxt;
>>   	void *sve_state;
>> @@ -351,6 +368,17 @@ struct kvm_vcpu_arch {
>>   		u64 last_steal;
>>   		gpa_t base;
>>   	} steal;
>> +
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	struct {
>> +		struct gfn_to_hva_cache	cache;
>> +		gfn_t			gfns[ASYNC_PF_PER_VCPU];
>> +		u64			control_block;
>> +		u16			id;
>> +		bool			send_user_only;
>> +		u64			no_fault_inst_range;
> 
> What are all of these fields? This implies functionality not covered
> in the commit message, and it's not at all clear what these are.
> 
> For example, what exactly is `no_fault_inst_range`? If it's a range,
> surely that needs a start/end or base/size pair rather than a single
> value?
> 

Ok. I will add more words about how the data struct is designed and
how it's used etc in next revision.

>> +	} apf;
>> +#endif
>>   };
>>   
>>   /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>> @@ -604,6 +632,21 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>   
>>   static inline void __cpu_init_stage2(void) {}
>>   
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu);
>> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
>> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
>> +			    gpa_t gpa, gfn_t gfn);
>> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>> +				     struct kvm_async_pf *work);
>> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>> +				     struct kvm_async_pf *work);
>> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>> +			       struct kvm_async_pf *work);
>> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu);
>> +#endif /* CONFIG_KVM_ASYNC_PF */
>> +
>>   /* Guest/host FPSIMD coordination helpers */
>>   int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
>>   void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
>> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
>> new file mode 100644
>> index 000000000000..0ea481dd1c7a
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/kvm_para.h
>> @@ -0,0 +1,27 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _ASM_ARM_KVM_PARA_H
>> +#define _ASM_ARM_KVM_PARA_H
>> +
>> +#include <uapi/asm/kvm_para.h>
>> +
>> +static inline bool kvm_check_and_clear_guest_paused(void)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline unsigned int kvm_arch_para_features(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline unsigned int kvm_arch_para_hints(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline bool kvm_para_available(void)
>> +{
>> +	return false;
>> +}
>> +
>> +#endif /* _ASM_ARM_KVM_PARA_H */
>> diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
>> index 602d137932dc..f66554cd5c45 100644
>> --- a/arch/arm64/include/uapi/asm/Kbuild
>> +++ b/arch/arm64/include/uapi/asm/Kbuild
>> @@ -1,3 +1 @@
>>   # SPDX-License-Identifier: GPL-2.0
>> -
>> -generic-y += kvm_para.h
>> diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
>> new file mode 100644
>> index 000000000000..e0bd0e579b9a
>> --- /dev/null
>> +++ b/arch/arm64/include/uapi/asm/kvm_para.h
>> @@ -0,0 +1,22 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_ASM_ARM_KVM_PARA_H
>> +#define _UAPI_ASM_ARM_KVM_PARA_H
>> +
>> +#include <linux/types.h>
>> +
>> +#define KVM_FEATURE_ASYNC_PF	0
>> +
>> +/* Async PF */
>> +#define KVM_ASYNC_PF_ENABLED		(1 << 0)
>> +#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
>> +
>> +#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
>> +#define KVM_PV_REASON_PAGE_READY	2
>> +
>> +struct kvm_vcpu_pv_apf_data {
>> +	__u32	reason;
>> +	__u8	pad[60];
>> +	__u32	enabled;
>> +};
> 
> What's all the padding for?
> 

The padding is ensure the @reason and @enabled in different cache
line. @reason is shared by host/guest, while @enabled is almostly
owned by guest.

>> +
>> +#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 449386d76441..1053e16b1739 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -34,6 +34,7 @@ config KVM
>>   	select KVM_VFIO
>>   	select HAVE_KVM_EVENTFD
>>   	select HAVE_KVM_IRQFD
>> +	select KVM_ASYNC_PF
>>   	select KVM_ARM_PMU if HW_PERF_EVENTS
>>   	select HAVE_KVM_MSI
>>   	select HAVE_KVM_IRQCHIP
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 5ffbdc39e780..3be24c1e401f 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -37,3 +37,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-debug.o
>>   kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
>>   kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>>   kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
>> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>> +kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/arm/async_pf.o
>> diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
>> index bdc0124a064a..22007dd3b9f0 100644
>> --- a/include/linux/arm-smccc.h
>> +++ b/include/linux/arm-smccc.h
>> @@ -94,6 +94,7 @@
>>   
>>   /* KVM "vendor specific" services */
>>   #define ARM_SMCCC_KVM_FUNC_FEATURES		0
>> +#define ARM_SMCCC_KVM_FUNC_APF			1
>>   #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
>>   #define ARM_SMCCC_KVM_NUM_FUNCS			128
>>   
>> @@ -102,6 +103,11 @@
>>   			   ARM_SMCCC_SMC_32,				\
>>   			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
>>   			   ARM_SMCCC_KVM_FUNC_FEATURES)
>> +#define ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID				\
>> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,				\
>> +			   ARM_SMCCC_SMC_32,				\
>> +			   ARM_SMCCC_OWNER_VENDOR_HYP,			\
>> +			   ARM_SMCCC_KVM_FUNC_APF)
>>   
>>   #ifndef __ASSEMBLY__
>>   
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index 2cbb57485760..3f62899cef13 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -222,6 +222,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   		 */
>>   		r = 1;
>>   		break;
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	case KVM_CAP_ASYNC_PF:
>> +		r = 1;
>> +		break;
>> +#endif
>>   	default:
>>   		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
>>   		break;
>> @@ -269,6 +274,10 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>>   	/* Force users to call KVM_ARM_VCPU_INIT */
>>   	vcpu->arch.target = -1;
>>   	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	vcpu->arch.apf.control_block = 0UL;
>> +	vcpu->arch.apf.no_fault_inst_range = 0x800;
> 
> Where has this magic number come from?
> 

It's the total length (size) of exception vectors. When the vCPU's
PC in the range, the DABT can't be injected.

>> +#endif
>>   
>>   	/* Set up the timer */
>>   	kvm_timer_vcpu_init(vcpu);
>> @@ -426,8 +435,27 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>>   int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>   {
>>   	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
>> -	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
>> -		&& !v->arch.power_off && !v->arch.pause);
>> +
>> +	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
>> +	    !v->arch.power_off && !v->arch.pause)
>> +		return true;
>> +
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
>> +		u32 val;
>> +		int ret;
>> +
>> +		if (!list_empty_careful(&v->async_pf.done))
>> +			return true;
>> +
>> +		ret = kvm_read_guest_cached(v->kvm, &v->arch.apf.cache,
>> +					    &val, sizeof(val));
>> +		if (ret || val)
>> +			return true;
>> +	}
>> +#endif
>> +
>> +	return false;
>>   }
>>   
>>   bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>> @@ -683,6 +711,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>   
>>   		check_vcpu_requests(vcpu);
>>   
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +		kvm_check_async_pf_completion(vcpu);
>> +#endif
> 
> Rather than adding ifdeffery like this, please add an empty stub for
> when CONFIG_KVM_ASYNC_PF isn't selected, so that this can be used
> unconditionally.
> 

Ok.

>> +
>>   		/*
>>   		 * Preparing the interrupts to be injected also
>>   		 * involves poking the GIC, which must be done in a
>> diff --git a/virt/kvm/arm/async_pf.c b/virt/kvm/arm/async_pf.c
>> new file mode 100644
>> index 000000000000..5be49d684de3
>> --- /dev/null
>> +++ b/virt/kvm/arm/async_pf.c
>> @@ -0,0 +1,335 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Asynchronous Page Fault Support
>> + *
>> + * Copyright (C) 2020 Red Hat, Inc., Gavin Shan
>> + *
>> + * Based on arch/x86/kernel/kvm.c
>> + */
>> +
>> +#include <linux/arm-smccc.h>
>> +#include <linux/kvm_host.h>
>> +#include <asm/kvm_emulate.h>
>> +#include <kvm/arm_hypercalls.h>
>> +
>> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
>> +{
>> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
>> +}
>> +
>> +static inline u32 kvm_async_pf_hash_next(u32 key)
>> +{
>> +	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
>> +}
>> +
>> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
>> +		vcpu->arch.apf.gfns[i] = ~0;
>> +}
>> +
>> +/*
>> + * Add gfn to the hash table. It's ensured there is a free entry
>> + * when this function is called.
>> + */
>> +static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_fn(gfn);
>> +
>> +	while (vcpu->arch.apf.gfns[key] != ~0)
>> +		key = kvm_async_pf_hash_next(key);
>> +
>> +	vcpu->arch.apf.gfns[key] = gfn;
>> +}
>> +
>> +static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_fn(gfn);
>> +	int i;
>> +
>> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
>> +		if (vcpu->arch.apf.gfns[key] == gfn ||
>> +		    vcpu->arch.apf.gfns[key] == ~0)
>> +			break;
>> +
>> +		key = kvm_async_pf_hash_next(key);
>> +	}
>> +
>> +	return key;
>> +}
>> +
>> +static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 i, j, k;
>> +
>> +	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
>> +	while (true) {
>> +		vcpu->arch.apf.gfns[i] = ~0;
>> +
>> +		do {
>> +			j = kvm_async_pf_hash_next(j);
>> +			if (vcpu->arch.apf.gfns[j] == ~0)
>> +				return;
>> +
>> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
>> +			/*
>> +			 * k lies cyclically in ]i,j]
>> +			 * |    i.k.j |
>> +			 * |....j i.k.| or  |.k..j i...|
>> +			 */
>> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
>> +
>> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
>> +		i = j;
>> +	}
>> +}
> 
> This looks like a copy-paste of code under arch/x86.
> 
> This looks like something that should be factored into common code
> rather than duplicated. Do we not have an existing common hash table
> implementation that we can use rather than building one specific to KVM
> async page faults?
> 

Yeah, It's copied from arch/x86. I will explore the possiblity to
make the code is sharable among multiple architectures.

>> +
>> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
>> +
>> +	return vcpu->arch.apf.gfns[key] == gfn;
>> +}
>> +
>> +static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu, u32 *val)
>> +{
>> +	return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
>> +				     val, sizeof(*val));
>> +}
>> +
>> +static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu, u32 val)
>> +{
>> +	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.cache,
>> +				      &val, sizeof(val));
>> +}
>> +
>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 vbar, pc;
>> +	u32 val;
>> +	int ret;
>> +
>> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>> +		return false;
>> +
>> +	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
>> +		return false;
>> +
>> +	/* Pending page fault, which ins't acknowledged by guest */
>> +	ret = kvm_async_pf_read_cache(vcpu, &val);
>> +	if (ret || val)
>> +		return false;
>> +
>> +	/*
>> +	 * Events can't be injected through data abort because it's
>> +	 * going to clobber ELR_EL1, which might not consued (or saved)
>> +	 * by guest yet.
>> +	 */
>> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>> +	pc = *vcpu_pc(vcpu);
>> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>> +		return false;
> 
> Ah, so that's when this `no_fault_inst_range` is for.
> 
> As-is this is not sufficient, and we'll need t be extremely careful
> here.
> 
> The vectors themselves typically only have a small amount of stub code,
> and the bulk of the non-reentrant exception entry work happens
> elsewhere, in a mixture of assembly and C code that isn't even virtually
> contiguous with either the vectors or itself.
> 
> It's possible in theory that code in modules (or perhaps in eBPF JIT'd
> code) that isn't safe to take a fault from, so even having a contiguous
> range controlled by the kernel isn't ideal.
> 
> How does this work on x86?
> 

Yeah, here we just provide a mechanism to forbid injecting data abort. The
range is fed by guest through HVC call. So I think it's guest related issue.
You had more comments about this in PATCH[9]. I will explain a bit more there.

x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected
if it's on. Otherwise, it's forbidden. It's workable because exception is
special interrupt to x86 if I'm correct.

            return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
                   !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
                         (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));


>> +
>> +	return true;
>> +}
>> +
>> +/*
>> + * We need deliver the page present signal as quick as possible because
>> + * it's performance critical. So the signal is delivered no matter which
>> + * privilege level the guest has. It's possible the signal can't be handled
>> + * by the guest immediately. However, host doesn't contribute the delay
>> + * anyway.
>> + */
>> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 vbar, pc;
>> +	u32 val;
>> +	int ret;
>> +
>> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>> +		return true;
>> +
>> +	/* Pending page fault, which ins't acknowledged by guest */
>> +	ret = kvm_async_pf_read_cache(vcpu, &val);
>> +	if (ret || val)
>> +		return false;
>> +
>> +	/*
>> +	 * Events can't be injected through data abort because it's
>> +	 * going to clobber ELR_EL1, which might not consued (or saved)
>> +	 * by guest yet.
>> +	 */
>> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>> +	pc = *vcpu_pc(vcpu);
>> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>> +		return false;
>> +
>> +	return true;
>> +}
> 
> Much of this is identical to the not_present case, so the same comments
> apply. The common bits should probably be factored out into a helper.
> 

Yep, I will take look to introduce a helper function if needed.

>> +
>> +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
>> +			    gpa_t gpa, gfn_t gfn)
>> +{
>> +	struct kvm_arch_async_pf arch;
>> +	unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
>> +
>> +	arch.token = (vcpu->arch.apf.id++ << 16) | vcpu->vcpu_id;
>> +	arch.gfn = gfn;
>> +	arch.esr = esr;
>> +
>> +	return kvm_setup_async_pf(vcpu, gpa, hva, &arch);
>> +}
>> +
>> +/*
>> + * It's garanteed that no pending asynchronous page fault when this is
>> + * called. It means all previous issued asynchronous page faults have
>> + * been acknoledged.
>> + */
>> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>> +				     struct kvm_async_pf *work)
>> +{
>> +	int ret;
>> +
>> +	kvm_async_pf_hash_add(vcpu, work->arch.gfn);
>> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT);
>> +	if (ret) {
>> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
>> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
>> +		return;
>> +	}
>> +
>> +	kvm_inject_dabt(vcpu, work->arch.token);
>> +}
>> +
>> +/*
>> + * It's garanteed that no pending asynchronous page fault when this is
>> + * called. It means all previous issued asynchronous page faults have
>> + * been acknoledged.
>> + */
>> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>> +				 struct kvm_async_pf *work)
>> +{
>> +	int ret;
>> +
>> +	/* Broadcast wakeup */
>> +	if (work->wakeup_all)
>> +		work->arch.token = ~0;
>> +	else
>> +		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
>> +
>> +	ret = kvm_async_pf_write_cache(vcpu, KVM_PV_REASON_PAGE_READY);
>> +	if (ret) {
>> +		kvm_err("%s: Error %d writing cache\n", __func__, ret);
>> +		return;
>> +	}
>> +
>> +	kvm_inject_dabt(vcpu, work->arch.token);
> 
> So the guest sees a fake S1 abort with a fake address?
> 
> How is the guest expected to distinguish this from a real S1 fault?
> 

KVM_PV_REASON_PAGE_READY is stored in the per-vCPU status block, which
is represented by the following data struct. When the @reason is zero,
it's normal fault. Otherwise, it's a async page fault. So we shouldn't
have async page fault and normal one cross over.

    struct kvm_vcpu_pv_apf_data {
         __u32   reason;
         __u8    pad[60];
         __u32   enabled;
    };

>> +}
>> +
>> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>> +			       struct kvm_async_pf *work)
>> +{
>> +	struct kvm_memory_slot *memslot;
>> +	unsigned int esr = work->arch.esr;
>> +	phys_addr_t gpa = work->cr2_or_gpa;
>> +	gfn_t gfn = gpa >> PAGE_SHIFT;
> 
> Perhaps:
> 
> 	gfn_t gfn = gpa_to_gfn(gpa);
> 

Sure.

>> +	unsigned long hva;
>> +	bool write_fault, writable;
>> +	int idx;
>> +
>> +	/*
>> +	 * We shouldn't issue prefault for special work to wake up
>> +	 * all pending tasks because the associated token (address)
>> +	 * is invalid.
>> +	 */
> 
> I'm not sure what this comment is trying to say.
> 

There are two types of workers. One of them is to wakeup all sleepers
in guest, preparing to shutdown the machine. For this type of wakeup,
the token (faulting address) is set to ~0, which is invalid. So there
isn't a prefault issued for this particular wakeup. I'll improve the
comments in next reivion :)

>> +	if (work->wakeup_all)
>> +		return;
>> +
>> +	/*
>> +	 * The gpa was validated before the work is started. However, the
>> +	 * memory slots might be changed since then. So we need to redo the
>> +	 * validatation here.
>> +	 */
>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +
>> +	write_fault = kvm_is_write_fault(esr);
>> +	memslot = gfn_to_memslot(vcpu->kvm, gfn);
>> +	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>> +	if (kvm_is_error_hva(hva) || (write_fault && !writable))
>> +		goto out;
>> +
>> +	kvm_handle_user_mem_abort(vcpu, esr, memslot, gpa, hva, true);
>> +
>> +out:
>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +}
>> +
>> +static long kvm_async_pf_update_enable_reg(struct kvm_vcpu *vcpu, u64 data)
>> +{
>> +	bool enabled, enable;
>> +	gpa_t gpa = (data & ~0x3F);
> 
> What exactly is going on here? Why are the low 7 bits of data not valid?
> 
> This will also truncate the value to 32 bits; did you mean to do that?
> 

Yes, because the control block (@data) is 64-bytes aligned.

>> +	int ret;
>> +
>> +	enabled = !!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED);
>> +	enable = !!(data & KVM_ASYNC_PF_ENABLED);
>> +	if (enable == enabled) {
>> +		kvm_debug("%s: Async PF has been %s (0x%llx -> 0x%llx)\n",
>> +			  __func__, enabled ? "enabled" : "disabled",
>> +			  vcpu->arch.apf.control_block, data);
>> +		return SMCCC_RET_NOT_REQUIRED;
>> +	}
>> +
>> +	if (enable) {
>> +		ret = kvm_gfn_to_hva_cache_init(
>> +			vcpu->kvm, &vcpu->arch.apf.cache,
>> +			gpa + offsetof(struct kvm_vcpu_pv_apf_data, reason),
>> +			sizeof(u32));
>> +		if (ret) {
>> +			kvm_err("%s: Error %d initializing cache on 0x%llx\n",
>> +				__func__, ret, data);
>> +			return SMCCC_RET_NOT_SUPPORTED;
>> +		}
>> +
>> +		kvm_async_pf_hash_reset(vcpu);
>> +		vcpu->arch.apf.send_user_only =
>> +			!(data & KVM_ASYNC_PF_SEND_ALWAYS);
>> +		kvm_async_pf_wakeup_all(vcpu);
>> +		vcpu->arch.apf.control_block = data;
>> +	} else {
>> +		kvm_clear_async_pf_completion_queue(vcpu);
>> +		vcpu->arch.apf.control_block = data;
>> +	}
>> +
>> +	return SMCCC_RET_SUCCESS;
>> +}
>> +
>> +long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 data, func, val, range;
>> +	long ret = SMCCC_RET_SUCCESS;
>> +
>> +	data = (smccc_get_arg2(vcpu) << 32) | smccc_get_arg1(vcpu);
> 
> What prevents the high bits being set in arg1?
> 

The function shuld work for both 32-bit and 64-bit guest.

>> +	func = data & (0xfful << 56);
>> +	val = data & ~(0xfful << 56);
>> +	switch (func) {
>> +	case BIT(63):
>> +		ret = kvm_async_pf_update_enable_reg(vcpu, val);
> 
> Please give BIT(63) a mnemonic.
> 
>> +		break;
>> +	case BIT(62):
> 
> Likewise.
> 

Yes.

>> +		if (vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
>> +			ret = SMCCC_RET_NOT_SUPPORTED;
>> +			break;
>> +		}
>> +
>> +		range = vcpu->arch.apf.no_fault_inst_range;
>> +		vcpu->arch.apf.no_fault_inst_range = max(range, val);
> 
> Huh? How is the `no_fault_inst_range` set by he guest?
> 

I'll explain to you when replying your comments on PATCH[9] :)

> Thanks,
> Mark.
> 

Thanks,
Gavin

>> +		break;
>> +	default:
>> +		kvm_err("%s: Unrecognized function 0x%llx\n", __func__, func);
>> +		ret = SMCCC_RET_NOT_SUPPORTED;
>> +	}
>> +
>> +	return ret;
>> +}
>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
>> index db6dce3d0e23..a7e0fe17e2f1 100644
>> --- a/virt/kvm/arm/hypercalls.c
>> +++ b/virt/kvm/arm/hypercalls.c
>> @@ -70,7 +70,15 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>>   		break;
>>   	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
>>   		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_APF);
>> +#endif
>>   		break;
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	case ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID:
>> +		val[0] = kvm_async_pf_hypercall(vcpu);
>> +		break;
>> +#endif
>>   	default:
>>   		return kvm_psci_call(vcpu);
>>   	}
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 95aaabb2b1fc..a303815845a2 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1656,6 +1656,30 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>>   	       (hva & ~(map_size - 1)) + map_size <= uaddr_end;
>>   }
>>   
>> +static bool try_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
>> +			 gfn_t gfn, kvm_pfn_t *pfn, bool write,
>> +			 bool *writable, bool prefault)
>> +{
>> +	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	bool async = false;
>> +
>> +	/* Bail if *pfn has correct page */
>> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
>> +	if (!async)
>> +		return false;
>> +
>> +	if (!prefault && kvm_arch_can_inject_async_page_not_present(vcpu)) {
>> +		if (kvm_async_pf_hash_find(vcpu, gfn) ||
>> +		    kvm_arch_setup_async_pf(vcpu, esr, gpa, gfn))
>> +			return true;
>> +	}
>> +#endif
>> +
>> +	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
>> +	return false;
>> +}
>> +
>>   int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>>   			      struct kvm_memory_slot *memslot,
>>   			      phys_addr_t fault_ipa, unsigned long hva,
>> @@ -1737,7 +1761,10 @@ int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
>>   	 */
>>   	smp_rmb();
>>   
>> -	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> +	if (try_async_pf(vcpu, esr, fault_ipa, gfn, &pfn,
>> +			 write_fault, &writable, prefault))
>> +		return 1;
>> +
>>   	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>   		kvm_send_hwpoison_signal(hva, vma_shift);
>>   		return 0;
>> -- 
>> 2.23.0
>>
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-08  3:29 ` [PATCH RFCv2 9/9] arm64: Support async page fault Gavin Shan
  2020-05-26 12:56   ` Mark Rutland
@ 2020-05-27  6:48   ` Paolo Bonzini
  2020-05-28  6:14     ` Gavin Shan
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2020-05-27  6:48 UTC (permalink / raw)
  To: Gavin Shan, kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

Hi Gavin,

I definitely appreciate the work, but this is repeating most of the
mistakes done in the x86 implementation.  In particular:

- the page ready signal can be done as an interrupt, rather than an
exception.  This is because "page ready" can be handled asynchronously,
in contrast to "page not present" which must be done on the same
instruction that triggers it.  You can refer to the recent series from
Vitaly Kuznetsov that switched "page ready" to an interrupt.

- the page not present is reusing the memory abort exception, and
there's really no reason to do so.  I think it would be best if ARM
could reserve one ESR exception code for the hypervisor.  Mark, any
ideas how to proceed here?

- for x86 we're also thinking of initiating the page fault from the
exception handler, rather than doing so from the hypervisor before
injecting the exception.  If ARM leads the way here, we would do our
best to share code when x86 does the same.

- do not bother with using KVM_ASYNC_PF_SEND_ALWAYS, it's a fringe case
that adds a lot of complexity.

Also, please include me on further iterations of the series.

Thanks!

Paolo

On 08/05/20 05:29, Gavin Shan wrote:
> This supports asynchronous page fault for the guest. The design is
> similar to what x86 has: on receiving a PAGE_NOT_PRESENT signal from
> the host, the current task is either rescheduled or put into power
> saving mode. The task will be waken up when PAGE_READY signal is
> received. The PAGE_READY signal might be received in the context
> of the suspended process, to be waken up. That means the suspended
> process has to wake up itself, but it's not safe and prone to cause
> dead-lock on CPU runqueue lock. So the wakeup is delayed on returning
> from kernel space to user space or idle process is picked for running.
> 
> The signals are conveyed through the async page fault control block,
> which was passed to host on enabling the functionality. On each page
> fault, the control block is checked and switch to the async page fault
> handling flow if any signals exist.
> 
> The feature is put into the CONFIG_KVM_GUEST umbrella, which is added
> by this patch. So we have inline functions implemented in kvm_para.h,
> like other architectures do, to check if async page fault (one of the
> KVM para-virtualized features) is available. Also, the kernel boot
> parameter "no-kvmapf" can be specified to disable the feature.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/Kconfig                 |  11 +
>  arch/arm64/include/asm/exception.h |   3 +
>  arch/arm64/include/asm/kvm_para.h  |  27 +-
>  arch/arm64/kernel/entry.S          |  33 +++
>  arch/arm64/kernel/process.c        |   4 +
>  arch/arm64/mm/fault.c              | 434 +++++++++++++++++++++++++++++
>  6 files changed, 505 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 40fb05d96c60..2d5e5ee62d6d 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1045,6 +1045,17 @@ config PARAVIRT
>  	  under a hypervisor, potentially improving performance significantly
>  	  over full virtualization.
>  
> +config KVM_GUEST
> +	bool "KVM Guest Support"
> +	depends on PARAVIRT
> +	default y
> +	help
> +	  This option enables various optimizations for running under the KVM
> +	  hypervisor. Overhead for the kernel when not running inside KVM should
> +	  be minimal.
> +
> +	  In case of doubt, say Y
> +
>  config PARAVIRT_TIME_ACCOUNTING
>  	bool "Paravirtual steal time accounting"
>  	select PARAVIRT
> diff --git a/arch/arm64/include/asm/exception.h b/arch/arm64/include/asm/exception.h
> index 7a6e81ca23a8..d878afa42746 100644
> --- a/arch/arm64/include/asm/exception.h
> +++ b/arch/arm64/include/asm/exception.h
> @@ -46,4 +46,7 @@ void bad_el0_sync(struct pt_regs *regs, int reason, unsigned int esr);
>  void do_cp15instr(unsigned int esr, struct pt_regs *regs);
>  void do_el0_svc(struct pt_regs *regs);
>  void do_el0_svc_compat(struct pt_regs *regs);
> +#ifdef CONFIG_KVM_GUEST
> +void kvm_async_pf_delayed_wake(void);
> +#endif
>  #endif	/* __ASM_EXCEPTION_H */
> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
> index 0ea481dd1c7a..b2f8ef243df7 100644
> --- a/arch/arm64/include/asm/kvm_para.h
> +++ b/arch/arm64/include/asm/kvm_para.h
> @@ -3,6 +3,20 @@
>  #define _ASM_ARM_KVM_PARA_H
>  
>  #include <uapi/asm/kvm_para.h>
> +#include <linux/of.h>
> +#include <asm/hypervisor.h>
> +
> +#ifdef CONFIG_KVM_GUEST
> +static inline int kvm_para_available(void)
> +{
> +	return 1;
> +}
> +#else
> +static inline int kvm_para_available(void)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_KVM_GUEST */
>  
>  static inline bool kvm_check_and_clear_guest_paused(void)
>  {
> @@ -11,17 +25,16 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>  
>  static inline unsigned int kvm_arch_para_features(void)
>  {
> -	return 0;
> +	unsigned int features = 0;
> +
> +	if (kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_APF))
> +		features |= (1 << KVM_FEATURE_ASYNC_PF);
> +
> +	return features;
>  }
>  
>  static inline unsigned int kvm_arch_para_hints(void)
>  {
>  	return 0;
>  }
> -
> -static inline bool kvm_para_available(void)
> -{
> -	return false;
> -}
> -
>  #endif /* _ASM_ARM_KVM_PARA_H */
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index ddcde093c433..15efd57129ff 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -751,12 +751,45 @@ finish_ret_to_user:
>  	enable_step_tsk x1, x2
>  #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
>  	bl	stackleak_erase
> +#endif
> +#ifdef CONFIG_KVM_GUEST
> +	bl	kvm_async_pf_delayed_wake
>  #endif
>  	kernel_exit 0
>  ENDPROC(ret_to_user)
>  
>  	.popsection				// .entry.text
>  
> +#ifdef CONFIG_KVM_GUEST
> +	.pushsection ".rodata", "a"
> +SYM_DATA_START(__exception_handlers_offset)
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +	.quad	el1_sync - vectors
> +	.quad	el1_irq - vectors
> +	.quad	0
> +	.quad	el1_error - vectors
> +	.quad	el0_sync - vectors
> +	.quad	el0_irq - vectors
> +	.quad	0
> +	.quad	el0_error - vectors
> +#ifdef CONFIG_COMPAT
> +	.quad	el0_sync_compat - vectors
> +	.quad	el0_irq_compat - vectors
> +	.quad	0
> +	.quad	el0_error_compat - vectors
> +#else
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +	.quad	0
> +#endif
> +SYM_DATA_END(__exception_handlers_offset)
> +	.popsection				// .rodata
> +#endif /* CONFIG_KVM_GUEST */
> +
>  #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>  /*
>   * Exception vectors trampoline.
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 56be4cbf771f..5e7ee553566d 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -53,6 +53,7 @@
>  #include <asm/processor.h>
>  #include <asm/pointer_auth.h>
>  #include <asm/stacktrace.h>
> +#include <asm/exception.h>
>  
>  #if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_STACKPROTECTOR_PER_TASK)
>  #include <linux/stackprotector.h>
> @@ -70,6 +71,9 @@ void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
>  
>  static void __cpu_do_idle(void)
>  {
> +#ifdef CONFIG_KVM_GUEST
> +	kvm_async_pf_delayed_wake();
> +#endif
>  	dsb(sy);
>  	wfi();
>  }
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c9cedc0432d2..cbf8b52135c9 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -19,10 +19,12 @@
>  #include <linux/page-flags.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/debug.h>
> +#include <linux/swait.h>
>  #include <linux/highmem.h>
>  #include <linux/perf_event.h>
>  #include <linux/preempt.h>
>  #include <linux/hugetlb.h>
> +#include <linux/kvm_para.h>
>  
>  #include <asm/acpi.h>
>  #include <asm/bug.h>
> @@ -48,8 +50,31 @@ struct fault_info {
>  	const char *name;
>  };
>  
> +#ifdef CONFIG_KVM_GUEST
> +struct kvm_task_sleep_node {
> +	struct hlist_node	link;
> +	struct swait_queue_head	wq;
> +	u32			token;
> +	struct task_struct	*task;
> +	int			cpu;
> +	bool			halted;
> +	bool			delayed;
> +};
> +
> +struct kvm_task_sleep_head {
> +	raw_spinlock_t		lock;
> +	struct hlist_head	list;
> +};
> +#endif /* CONFIG_KVM_GUEST */
> +
>  static const struct fault_info fault_info[];
>  static struct fault_info debug_fault_info[];
> +#ifdef CONFIG_KVM_GUEST
> +extern char __exception_handlers_offset[];
> +static bool async_pf_available = true;
> +static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_data) __aligned(64);
> +static DEFINE_PER_CPU(struct kvm_task_sleep_head, apf_head);
> +#endif
>  
>  static inline const struct fault_info *esr_to_fault_info(unsigned int esr)
>  {
> @@ -717,10 +742,281 @@ static const struct fault_info fault_info[] = {
>  	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 63"			},
>  };
>  
> +#ifdef CONFIG_KVM_GUEST
> +static inline unsigned int kvm_async_pf_read_enabled(void)
> +{
> +	return __this_cpu_read(apf_data.enabled);
> +}
> +
> +static inline void kvm_async_pf_write_enabled(unsigned int val)
> +{
> +	__this_cpu_write(apf_data.enabled, val);
> +}
> +
> +static inline unsigned int kvm_async_pf_read_reason(void)
> +{
> +	return __this_cpu_read(apf_data.reason);
> +}
> +
> +static inline void kvm_async_pf_write_reason(unsigned int val)
> +{
> +	__this_cpu_write(apf_data.reason, val);
> +}
> +
> +#define kvm_async_pf_lock(b, flags)					\
> +	raw_spin_lock_irqsave(&(b)->lock, (flags))
> +#define kvm_async_pf_trylock(b, flags)					\
> +	raw_spin_trylock_irqsave(&(b)->lock, (flags))
> +#define kvm_async_pf_unlock(b, flags)					\
> +	raw_spin_unlock_irqrestore(&(b)->lock, (flags))
> +#define kvm_async_pf_unlock_and_clear(b, flags)				\
> +	do {								\
> +		raw_spin_unlock_irqrestore(&(b)->lock, (flags));	\
> +		kvm_async_pf_write_reason(0);				\
> +	} while (0)
> +
> +static struct kvm_task_sleep_node *kvm_async_pf_find(
> +		struct kvm_task_sleep_head *b, u32 token)
> +{
> +	struct kvm_task_sleep_node *n;
> +	struct hlist_node *p;
> +
> +	hlist_for_each(p, &b->list) {
> +		n = hlist_entry(p, typeof(*n), link);
> +		if (n->token == token)
> +			return n;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void kvm_async_pf_wait(u32 token, int in_kernel)
> +{
> +	struct kvm_task_sleep_head *b = this_cpu_ptr(&apf_head);
> +	struct kvm_task_sleep_node n, *e;
> +	DECLARE_SWAITQUEUE(wait);
> +	unsigned long flags;
> +
> +	kvm_async_pf_lock(b, flags);
> +	e = kvm_async_pf_find(b, token);
> +	if (e) {
> +		/* dummy entry exist -> wake up was delivered ahead of PF */
> +		hlist_del(&e->link);
> +		kfree(e);
> +		kvm_async_pf_unlock_and_clear(b, flags);
> +
> +		return;
> +	}
> +
> +	n.token = token;
> +	n.task = current;
> +	n.cpu = smp_processor_id();
> +	n.halted = is_idle_task(current) ||
> +		   (IS_ENABLED(CONFIG_PREEMPT_COUNT) ?
> +		    preempt_count() > 1 || rcu_preempt_depth() : in_kernel);
> +	n.delayed = false;
> +	init_swait_queue_head(&n.wq);
> +	hlist_add_head(&n.link, &b->list);
> +	kvm_async_pf_unlock_and_clear(b, flags);
> +
> +	for (;;) {
> +		if (!n.halted) {
> +			prepare_to_swait_exclusive(&n.wq, &wait,
> +						   TASK_UNINTERRUPTIBLE);
> +		}
> +
> +		if (hlist_unhashed(&n.link))
> +			break;
> +
> +		if (!n.halted) {
> +			schedule();
> +		} else {
> +			dsb(sy);
> +			wfi();
> +		}
> +	}
> +
> +	if (!n.halted)
> +		finish_swait(&n.wq, &wait);
> +}
> +
> +/*
> + * There are two cases the suspended processed can't be waken up
> + * immediately: The waker is exactly the suspended process, or
> + * the current CPU runqueue has been locked. Otherwise, we might
> + * run into dead-lock.
> + */
> +static inline void kvm_async_pf_wake_one(struct kvm_task_sleep_node *n)
> +{
> +	if (n->task == current ||
> +	    cpu_rq_is_locked(smp_processor_id())) {
> +		n->delayed = true;
> +		return;
> +	}
> +
> +	hlist_del_init(&n->link);
> +	if (n->halted)
> +		smp_send_reschedule(n->cpu);
> +	else
> +		swake_up_one(&n->wq);
> +}
> +
> +void kvm_async_pf_delayed_wake(void)
> +{
> +	struct kvm_task_sleep_head *b;
> +	struct kvm_task_sleep_node *n;
> +	struct hlist_node *p, *next;
> +	unsigned int reason;
> +	unsigned long flags;
> +
> +	if (!kvm_async_pf_read_enabled())
> +		return;
> +
> +	/*
> +	 * We're running in the edging context, we need to complete
> +	 * the work as quick as possible. So we have a preliminary
> +	 * check without holding the lock.
> +	 */
> +	b = this_cpu_ptr(&apf_head);
> +	if (hlist_empty(&b->list))
> +		return;
> +
> +	/*
> +	 * Set the async page fault reason to something to avoid
> +	 * receiving the signals, which might cause lock contention
> +	 * and possibly dead-lock. As we're in guest context, it's
> +	 * safe to set the reason here.
> +	 *
> +	 * There might be pending signals. For that case, we needn't
> +	 * do anything. Otherwise, the pending signal will be lost.
> +	 */
> +	reason = kvm_async_pf_read_reason();
> +	if (!reason) {
> +		kvm_async_pf_write_reason(KVM_PV_REASON_PAGE_NOT_PRESENT +
> +					  KVM_PV_REASON_PAGE_READY);
> +	}
> +
> +	if (!kvm_async_pf_trylock(b, flags))
> +		goto done;
> +
> +	hlist_for_each_safe(p, next, &b->list) {
> +		n = hlist_entry(p, typeof(*n), link);
> +		if (n->cpu != smp_processor_id())
> +			continue;
> +		if (!n->delayed)
> +			continue;
> +
> +		kvm_async_pf_wake_one(n);
> +	}
> +
> +	kvm_async_pf_unlock(b, flags);
> +
> +done:
> +	if (!reason)
> +		kvm_async_pf_write_reason(0);
> +}
> +NOKPROBE_SYMBOL(kvm_async_pf_delayed_wake);
> +
> +static void kvm_async_pf_wake_all(void)
> +{
> +	struct kvm_task_sleep_head *b;
> +	struct kvm_task_sleep_node *n;
> +	struct hlist_node *p, *next;
> +	unsigned long flags;
> +
> +	b = this_cpu_ptr(&apf_head);
> +	kvm_async_pf_lock(b, flags);
> +
> +	hlist_for_each_safe(p, next, &b->list) {
> +		n = hlist_entry(p, typeof(*n), link);
> +		kvm_async_pf_wake_one(n);
> +	}
> +
> +	kvm_async_pf_unlock(b, flags);
> +
> +	kvm_async_pf_write_reason(0);
> +}
> +
> +static void kvm_async_pf_wake(u32 token)
> +{
> +	struct kvm_task_sleep_head *b = this_cpu_ptr(&apf_head);
> +	struct kvm_task_sleep_node *n;
> +	unsigned long flags;
> +
> +	if (token == ~0) {
> +		kvm_async_pf_wake_all();
> +		return;
> +	}
> +
> +again:
> +	kvm_async_pf_lock(b, flags);
> +
> +	n = kvm_async_pf_find(b, token);
> +	if (!n) {
> +		/*
> +		 * Async PF was not yet handled. Add dummy entry
> +		 * for the token. Busy wait until other CPU handles
> +		 * the async PF on allocation failure.
> +		 */
> +		n = kzalloc(sizeof(*n), GFP_ATOMIC);
> +		if (!n) {
> +			kvm_async_pf_unlock(b, flags);
> +			cpu_relax();
> +			goto again;
> +		}
> +		n->token = token;
> +		n->task = current;
> +		n->cpu = smp_processor_id();
> +		n->halted = false;
> +		n->delayed = false;
> +		init_swait_queue_head(&n->wq);
> +		hlist_add_head(&n->link, &b->list);
> +	} else {
> +		kvm_async_pf_wake_one(n);
> +	}
> +
> +	kvm_async_pf_unlock_and_clear(b, flags);
> +}
> +
> +static bool do_async_pf(unsigned long addr, unsigned int esr,
> +		       struct pt_regs *regs)
> +{
> +	u32 reason;
> +
> +	if (!kvm_async_pf_read_enabled())
> +		return false;
> +
> +	reason = kvm_async_pf_read_reason();
> +	if (!reason)
> +		return false;
> +
> +	switch (reason) {
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +		kvm_async_pf_wait((u32)addr, !user_mode(regs));
> +		break;
> +	case KVM_PV_REASON_PAGE_READY:
> +		kvm_async_pf_wake((u32)addr);
> +		break;
> +	default:
> +		if (reason) {
> +			pr_warn("%s: Illegal reason %d\n", __func__, reason);
> +			kvm_async_pf_write_reason(0);
> +		}
> +	}
> +
> +	return true;
> +}
> +#endif /* CONFIG_KVM_GUEST */
> +
>  void do_mem_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  {
>  	const struct fault_info *inf = esr_to_fault_info(esr);
>  
> +#ifdef CONFIG_KVM_GUEST
> +	if (do_async_pf(addr, esr, regs))
> +		return;
> +#endif
> +
>  	if (!inf->fn(addr, esr, regs))
>  		return;
>  
> @@ -878,3 +1174,141 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
>  	debug_exception_exit(regs);
>  }
>  NOKPROBE_SYMBOL(do_debug_exception);
> +
> +#ifdef CONFIG_KVM_GUEST
> +static int __init kvm_async_pf_available(char *arg)
> +{
> +	async_pf_available = false;
> +	return 0;
> +}
> +early_param("no-kvmapf", kvm_async_pf_available);
> +
> +static void kvm_async_pf_enable(bool enable)
> +{
> +	struct arm_smccc_res res;
> +	unsigned long *offsets = (unsigned long *)__exception_handlers_offset;
> +	u32 enabled = kvm_async_pf_read_enabled();
> +	u64 val;
> +	int i;
> +
> +	if (enable == enabled)
> +		return;
> +
> +	if (enable) {
> +		/*
> +		 * Asychonous page faults will be prohibited when CPU runs
> +		 * instructions between the vector base and the maximal
> +		 * offset, plus 4096. The 4096 is the assumped maximal
> +		 * length for individual handler. The hardware registers
> +		 * should be saved to stack at the beginning of the handlers,
> +		 * so 4096 shuld be safe enough.
> +		 */
> +		val = 0;
> +		for (i = 0; i < 16; i++) {
> +			if (offsets[i] > val)
> +				val = offsets[i];
> +		}
> +
> +		val += 4096;
> +		val |= BIT(62);
> +
> +		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
> +				     (u32)val, (u32)(val >> 32), &res);
> +		if (res.a0 != SMCCC_RET_SUCCESS) {
> +			pr_warn("Async PF configuration error %ld on CPU %d\n",
> +				res.a0, smp_processor_id());
> +			return;
> +		}
> +
> +		/* FIXME: Enable KVM_ASYNC_PF_SEND_ALWAYS */
> +		val = BIT(63);
> +		val |= virt_to_phys(this_cpu_ptr(&apf_data));
> +		val |= KVM_ASYNC_PF_ENABLED;
> +
> +		kvm_async_pf_write_enabled(1);
> +		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
> +				     (u32)val, (u32)(val >> 32), &res);
> +		if (res.a0 != SMCCC_RET_SUCCESS) {
> +			pr_warn("Async PF enable error %ld on CPU %d\n",
> +				res.a0, smp_processor_id());
> +			kvm_async_pf_write_enabled(0);
> +			return;
> +		}
> +	} else {
> +		val = BIT(63);
> +		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
> +				     (u32)val, (u32)(val >> 32), &res);
> +		if (res.a0 != SMCCC_RET_SUCCESS) {
> +			pr_warn("Async PF disable error %ld on CPU %d\n",
> +				res.a0, smp_processor_id());
> +			return;
> +		}
> +
> +		kvm_async_pf_write_enabled(0);
> +	}
> +
> +	pr_info("Async PF %s on CPU %d\n",
> +		enable ? "enabled" : "disabled", smp_processor_id());
> +}
> +
> +static void kvm_async_pf_cpu_reboot(void *unused)
> +{
> +	kvm_async_pf_enable(false);
> +}
> +
> +static int kvm_async_pf_cpu_reboot_notify(struct notifier_block *nb,
> +					  unsigned long code, void *unused)
> +{
> +	if (code == SYS_RESTART)
> +		on_each_cpu(kvm_async_pf_cpu_reboot, NULL, 1);
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block kvm_async_pf_cpu_reboot_nb = {
> +	.notifier_call = kvm_async_pf_cpu_reboot_notify,
> +};
> +
> +static int kvm_async_pf_cpu_online(unsigned int cpu)
> +{
> +	struct kvm_task_sleep_head *b;
> +
> +	b = this_cpu_ptr(&apf_head);
> +	raw_spin_lock_init(&b->lock);
> +	kvm_async_pf_enable(true);
> +	return 0;
> +}
> +
> +static int kvm_async_pf_cpu_offline(unsigned int cpu)
> +{
> +	kvm_async_pf_enable(false);
> +	return 0;
> +}
> +
> +static int __init kvm_async_pf_cpu_init(void)
> +{
> +	struct kvm_task_sleep_head *b;
> +	int ret;
> +
> +	if (!kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) ||
> +	    !async_pf_available)
> +		return -EPERM;
> +
> +	register_reboot_notifier(&kvm_async_pf_cpu_reboot_nb);
> +	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> +			"arm/kvm:online", kvm_async_pf_cpu_online,
> +			kvm_async_pf_cpu_offline);
> +	if (ret < 0) {
> +		pr_warn("%s: Error %d to install cpu hotplug callbacks\n",
> +			__func__, ret);
> +		return ret;
> +	}
> +
> +	b = this_cpu_ptr(&apf_head);
> +	raw_spin_lock_init(&b->lock);
> +	kvm_async_pf_enable(true);
> +
> +	return 0;
> +}
> +early_initcall(kvm_async_pf_cpu_init);
> +#endif /* CONFIG_KVM_GUEST */
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  2020-05-27  2:43     ` Gavin Shan
@ 2020-05-27  7:20       ` Marc Zyngier
  2020-05-28  6:34         ` Gavin Shan
  0 siblings, 1 reply; 38+ messages in thread
From: Marc Zyngier @ 2020-05-27  7:20 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, will, kvmarm,
	linux-arm-kernel

On 2020-05-27 03:43, Gavin Shan wrote:
> Hi Mark,
> 
> On 5/26/20 8:42 PM, Mark Rutland wrote:
>> On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote:
>>> Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to
>>> kvm_vcpu_get_esr() to it a bit more self-explaining because the
>>> functions returns ESR instead of HSR on aarch64. This shouldn't
>>> cause any functional changes.
>>> 
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> 
>> I think that this would be a nice cleanup on its own, and could be 
>> taken
>> independently of the rest of this series if it were rebased and sent 
>> as
>> a single patch.
>> 
> 
> Yeah, I'll see how PATCH[3,4,5] can be posted independently
> as part of the preparatory work, which is suggested by you
> in another reply.
> 
> By the way, I assume the cleanup patches are good enough to
> target 5.8.rc1/rc2 if you agree.

It's fine to base them on -rc1 or -rc2. They will not be merged
before 5.9 though.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
  2020-05-27  4:05     ` Gavin Shan
@ 2020-05-27  7:37       ` Marc Zyngier
  2020-05-28  6:32         ` Gavin Shan
  0 siblings, 1 reply; 38+ messages in thread
From: Marc Zyngier @ 2020-05-27  7:37 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, will, kvmarm,
	linux-arm-kernel

On 2020-05-27 05:05, Gavin Shan wrote:
> Hi Mark,
> 

[...]

>>> +struct kvm_vcpu_pv_apf_data {
>>> +	__u32	reason;
>>> +	__u8	pad[60];
>>> +	__u32	enabled;
>>> +};
>> 
>> What's all the padding for?
>> 
> 
> The padding is ensure the @reason and @enabled in different cache
> line. @reason is shared by host/guest, while @enabled is almostly
> owned by guest.

So you are assuming that a cache line is at most 64 bytes.
It is actualy implementation defined, and you can probe for it
by looking at the CTR_EL0 register. There are implementations
ranging from 32 to 256 bytes in the wild, and let's not mention
broken big-little implementations here.

[...]

>>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu 
>>> *vcpu)
>>> +{
>>> +	u64 vbar, pc;
>>> +	u32 val;
>>> +	int ret;
>>> +
>>> +	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>>> +		return false;
>>> +
>>> +	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
>>> +		return false;
>>> +
>>> +	/* Pending page fault, which ins't acknowledged by guest */
>>> +	ret = kvm_async_pf_read_cache(vcpu, &val);
>>> +	if (ret || val)
>>> +		return false;
>>> +
>>> +	/*
>>> +	 * Events can't be injected through data abort because it's
>>> +	 * going to clobber ELR_EL1, which might not consued (or saved)
>>> +	 * by guest yet.
>>> +	 */
>>> +	vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>>> +	pc = *vcpu_pc(vcpu);
>>> +	if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>>> +		return false;
>> 
>> Ah, so that's when this `no_fault_inst_range` is for.
>> 
>> As-is this is not sufficient, and we'll need t be extremely careful
>> here.
>> 
>> The vectors themselves typically only have a small amount of stub 
>> code,
>> and the bulk of the non-reentrant exception entry work happens
>> elsewhere, in a mixture of assembly and C code that isn't even 
>> virtually
>> contiguous with either the vectors or itself.
>> 
>> It's possible in theory that code in modules (or perhaps in eBPF JIT'd
>> code) that isn't safe to take a fault from, so even having a 
>> contiguous
>> range controlled by the kernel isn't ideal.
>> 
>> How does this work on x86?
>> 
> 
> Yeah, here we just provide a mechanism to forbid injecting data abort. 
> The
> range is fed by guest through HVC call. So I think it's guest related 
> issue.
> You had more comments about this in PATCH[9]. I will explain a bit more 
> there.
> 
> x86 basically relies on EFLAGS[IF] flag. The async page fault can be 
> injected
> if it's on. Otherwise, it's forbidden. It's workable because exception 
> is
> special interrupt to x86 if I'm correct.
> 
>            return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
>                   !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
>                         (GUEST_INTR_STATE_STI | 
> GUEST_INTR_STATE_MOV_SS));

I really wish this was relying on an architected exception delivery
mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
Trying to guess based on the PC won't fly. But these signals are
pretty hard to multiplex with anything else. Like any form of
non-architected exception injection, I don't see a good path forward
unless we start considering something like SDEI.

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
  2020-05-27  2:39   ` Gavin Shan
@ 2020-05-27  7:48     ` Marc Zyngier
  2020-05-27 16:10       ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Marc Zyngier @ 2020-05-27  7:48 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, pbonzini, will,
	kvmarm, linux-arm-kernel

On 2020-05-27 03:39, Gavin Shan wrote:
> Hi Mark,

[...]

>> Can you run tests with a real workload? For example, a kernel build
>> inside the VM?
>> 
> 
> Yeah, I agree it's far from a realistic workload. However, it's the 
> test case
> which was suggested when async page fault was proposed from day one, 
> according
> to the following document. On the page#34, you can see the benchmark, 
> which is
> similar to what we're doing.
> 
> https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf

My own question is whether this even makes any sense 10 years later.

The HW has massively changed, and this adds a whole lot of complexity
to both the hypervisor and the guest. It also plays very ugly games
with the exception model, which doesn't give me the warm fuzzy feeling
that it's going to be great.

> Ok. I will test with the workload to build kernel or another better one 
> to
> represent the case.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
  2020-05-27  7:48     ` Marc Zyngier
@ 2020-05-27 16:10       ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2020-05-27 16:10 UTC (permalink / raw)
  To: Marc Zyngier, Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, will, kvmarm,
	linux-arm-kernel

On 27/05/20 09:48, Marc Zyngier wrote:
> 
> My own question is whether this even makes any sense 10 years later.
> The HW has massively changed, and this adds a whole lot of complexity
> to both the hypervisor and the guest.

It still makes sense, but indeed it's for different reasons.  One
example is host page cache sharing, where (parts of) the host page cache
are visible to the guest.  In this context, async page faults are used
for any kind of host page faults, not just paging out memory due to
overcommit.

But I agree that it is very very important to design the exception model
first, as we're witnessing in x86 land the problems with a poor design.
 Nothing major, but just pain all around.

Paolo

> It also plays very ugly games
> with the exception model, which doesn't give me the warm fuzzy feeling
> that it's going to be great.

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-27  6:48   ` Paolo Bonzini
@ 2020-05-28  6:14     ` Gavin Shan
  2020-05-28  7:03       ` Marc Zyngier
  2020-05-28 10:48       ` Paolo Bonzini
  0 siblings, 2 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-28  6:14 UTC (permalink / raw)
  To: Paolo Bonzini, kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

Hi Paolo,

On 5/27/20 4:48 PM, Paolo Bonzini wrote:
> I definitely appreciate the work, but this is repeating most of the
> mistakes done in the x86 implementation.  In particular:
> 
> - the page ready signal can be done as an interrupt, rather than an
> exception.  This is because "page ready" can be handled asynchronously,
> in contrast to "page not present" which must be done on the same
> instruction that triggers it.  You can refer to the recent series from
> Vitaly Kuznetsov that switched "page ready" to an interrupt.
> 

Yeah, page ready can be handled asynchronously. I think it would be
nice for x86/arm64 to share same design. x86 has 256 vectors and it
seems 0xec is picked for the purpose. However, arm64 doesn't have so
many (interrupt/exception) vectors and PPI might be appropriate for
the purpose if I'm correct, because it has same INTD for all CPUs.
 From this point, it's similar to x86's vector. There are 16 PPIs, which
are in range of 16 to 31, and we might reserve one for this. According
to GICv3/v4 spec, 22 - 30 have been assigned.

> - the page not present is reusing the memory abort exception, and
> there's really no reason to do so.  I think it would be best if ARM
> could reserve one ESR exception code for the hypervisor.  Mark, any
> ideas how to proceed here?
> 

Well, a subclass of ESR exception code, whose DFSC (Data Fault Status
Code) is 0x34, was taken for the purpose in RFCv1. The code is IMPDEF
one and Mark suggested not to do so. I agree the page not present needs a
separately subclass of exception. With that, there will be less conflicts
and complexity. However, the question is which subclass or DFSC code I should
used for the purpose.

> - for x86 we're also thinking of initiating the page fault from the
> exception handler, rather than doing so from the hypervisor before
> injecting the exception.  If ARM leads the way here, we would do our
> best to share code when x86 does the same.
> 

Sorry, Paolo, I don't follow your idea here. Could you please provide
more details?

> - do not bother with using KVM_ASYNC_PF_SEND_ALWAYS, it's a fringe case
> that adds a lot of complexity.
> 

Yeah, I don't consider it so far.

> Also, please include me on further iterations of the series.
> 

Sure.

Thanks,
Gavin

[...]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
  2020-05-27  7:37       ` Marc Zyngier
@ 2020-05-28  6:32         ` Gavin Shan
  0 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-28  6:32 UTC (permalink / raw)
  To: Marc Zyngier, pbonzini
  Cc: catalin.marinas, linux-kernel, shan.gavin, will, kvmarm,
	linux-arm-kernel

Hi Marc,

On 5/27/20 5:37 PM, Marc Zyngier wrote:
> On 2020-05-27 05:05, Gavin Shan wrote:

[...]
  
>>>> +struct kvm_vcpu_pv_apf_data {
>>>> +    __u32    reason;
>>>> +    __u8    pad[60];
>>>> +    __u32    enabled;
>>>> +};
>>>
>>> What's all the padding for?
>>>
>>
>> The padding is ensure the @reason and @enabled in different cache
>> line. @reason is shared by host/guest, while @enabled is almostly
>> owned by guest.
> 
> So you are assuming that a cache line is at most 64 bytes.
> It is actualy implementation defined, and you can probe for it
> by looking at the CTR_EL0 register. There are implementations
> ranging from 32 to 256 bytes in the wild, and let's not mention
> broken big-little implementations here.
> 
> [...]
> 

Ok, Thanks for your comments and hints.

>>>> +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +    u64 vbar, pc;
>>>> +    u32 val;
>>>> +    int ret;
>>>> +
>>>> +    if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
>>>> +        return false;
>>>> +
>>>> +    if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
>>>> +        return false;
>>>> +
>>>> +    /* Pending page fault, which ins't acknowledged by guest */
>>>> +    ret = kvm_async_pf_read_cache(vcpu, &val);
>>>> +    if (ret || val)
>>>> +        return false;
>>>> +
>>>> +    /*
>>>> +     * Events can't be injected through data abort because it's
>>>> +     * going to clobber ELR_EL1, which might not consued (or saved)
>>>> +     * by guest yet.
>>>> +     */
>>>> +    vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
>>>> +    pc = *vcpu_pc(vcpu);
>>>> +    if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
>>>> +        return false;
>>>
>>> Ah, so that's when this `no_fault_inst_range` is for.
>>>
>>> As-is this is not sufficient, and we'll need t be extremely careful
>>> here.
>>>
>>> The vectors themselves typically only have a small amount of stub code,
>>> and the bulk of the non-reentrant exception entry work happens
>>> elsewhere, in a mixture of assembly and C code that isn't even virtually
>>> contiguous with either the vectors or itself.
>>>
>>> It's possible in theory that code in modules (or perhaps in eBPF JIT'd
>>> code) that isn't safe to take a fault from, so even having a contiguous
>>> range controlled by the kernel isn't ideal.
>>>
>>> How does this work on x86?
>>>
>>
>> Yeah, here we just provide a mechanism to forbid injecting data abort. The
>> range is fed by guest through HVC call. So I think it's guest related issue.
>> You had more comments about this in PATCH[9]. I will explain a bit more there.
>>
>> x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected
>> if it's on. Otherwise, it's forbidden. It's workable because exception is
>> special interrupt to x86 if I'm correct.
>>
>>            return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
>>                   !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
>>                         (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
> 
> I really wish this was relying on an architected exception delivery
> mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
> Trying to guess based on the PC won't fly. But these signals are
> pretty hard to multiplex with anything else. Like any form of
> non-architected exception injection, I don't see a good path forward
> unless we start considering something like SDEI.
> 
>          M.

As Paolo mentioned in another reply. There are two types of notifications
sent from host to guest: page_not_present and page_ready. The page_not_present
notification should be delivered synchronously while page_ready can be
delivered asynchronously. He also suggested to reserve a ESR (or DFSC) subclass
for page_not_present. For page_ready, it can be delivered by interrupt, like PPI.
x86 is changing the code to deliver page_ready by interrupt, which was done by
exception previously.

when we use interrupt, instead of exception for page_ready. We won't need the
game of guessing PC.

I assume you prefer to use SDEI for page_not_present, correct? In that case,
what's the current status of SDEI? I mean it has been fully or partially
supported, or we need develop it from the scratch :)

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  2020-05-27  7:20       ` Marc Zyngier
@ 2020-05-28  6:34         ` Gavin Shan
  0 siblings, 0 replies; 38+ messages in thread
From: Gavin Shan @ 2020-05-28  6:34 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: catalin.marinas, linux-kernel, shan.gavin, will, kvmarm,
	linux-arm-kernel

On 5/27/20 5:20 PM, Marc Zyngier wrote:
> On 2020-05-27 03:43, Gavin Shan wrote:
>> Hi Mark,
>>
>> On 5/26/20 8:42 PM, Mark Rutland wrote:
>>> On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote:
>>>> Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to
>>>> kvm_vcpu_get_esr() to it a bit more self-explaining because the
>>>> functions returns ESR instead of HSR on aarch64. This shouldn't
>>>> cause any functional changes.
>>>>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>
>>> I think that this would be a nice cleanup on its own, and could be taken
>>> independently of the rest of this series if it were rebased and sent as
>>> a single patch.
>>>
>>
>> Yeah, I'll see how PATCH[3,4,5] can be posted independently
>> as part of the preparatory work, which is suggested by you
>> in another reply.
>>
>> By the way, I assume the cleanup patches are good enough to
>> target 5.8.rc1/rc2 if you agree.
> 
> It's fine to base them on -rc1 or -rc2. They will not be merged
> before 5.9 though.
> 
> Thanks,
> 
>          M.

Sure, Thanks, Marc!

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-28  6:14     ` Gavin Shan
@ 2020-05-28  7:03       ` Marc Zyngier
  2020-05-28 10:53         ` Paolo Bonzini
  2020-05-28 10:48       ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Marc Zyngier @ 2020-05-28  7:03 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, Paolo Bonzini, will,
	kvmarm, linux-arm-kernel

On 2020-05-28 07:14, Gavin Shan wrote:
> Hi Paolo,
> 
> On 5/27/20 4:48 PM, Paolo Bonzini wrote:
>> I definitely appreciate the work, but this is repeating most of the
>> mistakes done in the x86 implementation.  In particular:
>> 
>> - the page ready signal can be done as an interrupt, rather than an
>> exception.  This is because "page ready" can be handled 
>> asynchronously,
>> in contrast to "page not present" which must be done on the same
>> instruction that triggers it.  You can refer to the recent series from
>> Vitaly Kuznetsov that switched "page ready" to an interrupt.
>> 
> 
> Yeah, page ready can be handled asynchronously. I think it would be
> nice for x86/arm64 to share same design. x86 has 256 vectors and it
> seems 0xec is picked for the purpose. However, arm64 doesn't have so
> many (interrupt/exception) vectors and PPI might be appropriate for
> the purpose if I'm correct, because it has same INTD for all CPUs.
> From this point, it's similar to x86's vector. There are 16 PPIs, which
> are in range of 16 to 31, and we might reserve one for this. According
> to GICv3/v4 spec, 22 - 30 have been assigned.

The assignment of the PPIs is completely implementation defined,
and is not part of the architecture (and certainly not in the
GICv3/v4 spec). SBSA makes some statements as to the way they *could*
be assigned, but that's in no way enforced. This allocation is entirely
controlled by userspace, which would need to configure tell KVM
which PPI to use on a per-VM basis.

You would then need to describe the PPI assignment through firmware
(both DT and ACPI) so that the guest kernel can know what PPI the
hypervisor would be signalling on.

It is also not very future proof should we move to a different
interrupt architecture.

> 
>> - the page not present is reusing the memory abort exception, and
>> there's really no reason to do so.  I think it would be best if ARM
>> could reserve one ESR exception code for the hypervisor.  Mark, any
>> ideas how to proceed here?
>> 
> 
> Well, a subclass of ESR exception code, whose DFSC (Data Fault Status
> Code) is 0x34, was taken for the purpose in RFCv1. The code is IMPDEF
> one and Mark suggested not to do so. I agree the page not present needs 
> a
> separately subclass of exception. With that, there will be less 
> conflicts
> and complexity. However, the question is which subclass or DFSC code I 
> should
> used for the purpose.

The current state of the architecture doesn't seem to leave much leeway 
in
terms of SW creativity here. You just can't allocate your own ISS 
encoding
without risking a clash with future revisions of the architecture.
It isn't even clear whether the value you put would stick in ESR_EL1
if it isn't a valid value for this CPU (see the definition of 'Reserved'
in the ARM ARM).

Allocating such a syndrome would require from ARM:

- the guarantee that no existing implementation, irrespective of the
   implementer, can cope with the ISS encoding of your choice,

- the written promise in the architecture that some EC/ISS values
   are reserved for SW, and that promise to apply retrospectively.

This is somewhat unlikely to happen.

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-28  6:14     ` Gavin Shan
  2020-05-28  7:03       ` Marc Zyngier
@ 2020-05-28 10:48       ` Paolo Bonzini
  2020-05-28 23:02         ` Gavin Shan
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2020-05-28 10:48 UTC (permalink / raw)
  To: Gavin Shan, kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

On 28/05/20 08:14, Gavin Shan wrote:
>> - for x86 we're also thinking of initiating the page fault from the
>> exception handler, rather than doing so from the hypervisor before
>> injecting the exception.  If ARM leads the way here, we would do our
>> best to share code when x86 does the same.
> 
> Sorry, Paolo, I don't follow your idea here. Could you please provide
> more details?

The idea is to inject stage2 page faults into the guest even before the
host starts populates the page.  The guest then invokes a hypercall,
telling the host to populate the page table and inject the 'page ready'
event (interrupt) when it's done.

For x86 the advantage is that the processor can take care of raising the
stage2 page fault in the guest, so it's faster.

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-28  7:03       ` Marc Zyngier
@ 2020-05-28 10:53         ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2020-05-28 10:53 UTC (permalink / raw)
  To: Marc Zyngier, Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, will, kvmarm,
	linux-arm-kernel

On 28/05/20 09:03, Marc Zyngier wrote:
> The current state of the architecture doesn't seem to leave much leeway in
> terms of SW creativity here. You just can't allocate your own ISS encoding
> without risking a clash with future revisions of the architecture.
> It isn't even clear whether the value you put would stick in ESR_EL1
> if it isn't a valid value for this CPU (see the definition of 'Reserved'
> in the ARM ARM).
> 
> Allocating such a syndrome would require from ARM:
> 
> - the guarantee that no existing implementation, irrespective of the
>   implementer, can cope with the ISS encoding of your choice,
> 
> - the written promise in the architecture that some EC/ISS values
>   are reserved for SW, and that promise to apply retrospectively.
> 
> This is somewhat unlikely to happen.

Well, that's a euphemism probably.  On x86 we're somewhat lucky that
there's an architectural way for injecting hypervisor vmexit directly in
the guest, and we can piggyback on that for async page faults (which are
essentially stage2 page faults that are processed by the guest).

Is it possible to reuse EL2 exception codes / syndromes somehow?  (I
haven't checked in the ARM ARM the differences between the EL1 and EL2
syndrome registers).

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-28 10:48       ` Paolo Bonzini
@ 2020-05-28 23:02         ` Gavin Shan
  2020-05-29  9:41           ` Marc Zyngier
  0 siblings, 1 reply; 38+ messages in thread
From: Gavin Shan @ 2020-05-28 23:02 UTC (permalink / raw)
  To: Paolo Bonzini, kvmarm
  Cc: maz, linux-kernel, shan.gavin, catalin.marinas, will, linux-arm-kernel

Hi Paolo,

On 5/28/20 8:48 PM, Paolo Bonzini wrote:
> On 28/05/20 08:14, Gavin Shan wrote:
>>> - for x86 we're also thinking of initiating the page fault from the
>>> exception handler, rather than doing so from the hypervisor before
>>> injecting the exception.  If ARM leads the way here, we would do our
>>> best to share code when x86 does the same.
>>
>> Sorry, Paolo, I don't follow your idea here. Could you please provide
>> more details?
> 
> The idea is to inject stage2 page faults into the guest even before the
> host starts populates the page.  The guest then invokes a hypercall,
> telling the host to populate the page table and inject the 'page ready'
> event (interrupt) when it's done.
> 
> For x86 the advantage is that the processor can take care of raising the
> stage2 page fault in the guest, so it's faster.
> 
I think there might be too much overhead if the page can be populated
quickly by host. For example, it's fast to populate the pages if swapin
isn't involved.

If I'm correct enough, it seems arm64 doesn't have similar mechanism,
routing stage2 page fault to guest.

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-28 23:02         ` Gavin Shan
@ 2020-05-29  9:41           ` Marc Zyngier
  2020-05-29 11:11             ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Marc Zyngier @ 2020-05-29  9:41 UTC (permalink / raw)
  To: Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, Paolo Bonzini, will,
	kvmarm, linux-arm-kernel

On 2020-05-29 00:02, Gavin Shan wrote:
> Hi Paolo,
> 
> On 5/28/20 8:48 PM, Paolo Bonzini wrote:
>> On 28/05/20 08:14, Gavin Shan wrote:
>>>> - for x86 we're also thinking of initiating the page fault from the
>>>> exception handler, rather than doing so from the hypervisor before
>>>> injecting the exception.  If ARM leads the way here, we would do our
>>>> best to share code when x86 does the same.
>>> 
>>> Sorry, Paolo, I don't follow your idea here. Could you please provide
>>> more details?
>> 
>> The idea is to inject stage2 page faults into the guest even before 
>> the
>> host starts populates the page.  The guest then invokes a hypercall,
>> telling the host to populate the page table and inject the 'page 
>> ready'
>> event (interrupt) when it's done.
>> 
>> For x86 the advantage is that the processor can take care of raising 
>> the
>> stage2 page fault in the guest, so it's faster.
>> 
> I think there might be too much overhead if the page can be populated
> quickly by host. For example, it's fast to populate the pages if swapin
> isn't involved.
> 
> If I'm correct enough, it seems arm64 doesn't have similar mechanism,
> routing stage2 page fault to guest.

Indeed, this isn't a thing on arm64. Exception caused by a S2 fault are
always routed to EL2.

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFCv2 9/9] arm64: Support async page fault
  2020-05-29  9:41           ` Marc Zyngier
@ 2020-05-29 11:11             ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2020-05-29 11:11 UTC (permalink / raw)
  To: Marc Zyngier, Gavin Shan
  Cc: catalin.marinas, linux-kernel, shan.gavin, will, kvmarm,
	linux-arm-kernel

On 29/05/20 11:41, Marc Zyngier wrote:
>>>
>>>
>>> For x86 the advantage is that the processor can take care of raising the
>>> stage2 page fault in the guest, so it's faster.
>>>
>> I think there might be too much overhead if the page can be populated
>> quickly by host. For example, it's fast to populate the pages if swapin
>> isn't involved.

Those would still be handled by the host.  Only those that are not
present in the host (which you can see through the MMU notifier) would
be routed to the guest.  You can do things differently between "not
present fault because the page table does not exist" and "not present
fault because the page is missing in the host".

>> If I'm correct enough, it seems arm64 doesn't have similar mechanism,
>> routing stage2 page fault to guest.
> 
> Indeed, this isn't a thing on arm64. Exception caused by a S2 fault are
> always routed to EL2.

Is there an ARM-approved way to reuse the S2 fault syndromes to detect
async page faults?

(By the way, another "modern" use for async page faults is for postcopy
live migration).

Thanks,

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, back to index

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-08  3:29 [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 1/9] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 2/9] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr() Gavin Shan
2020-05-26 10:42   ` Mark Rutland
2020-05-27  2:43     ` Gavin Shan
2020-05-27  7:20       ` Marc Zyngier
2020-05-28  6:34         ` Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct Gavin Shan
2020-05-26 10:51   ` Mark Rutland
2020-05-27  2:55     ` Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr Gavin Shan
2020-05-26 10:45   ` Mark Rutland
2020-05-27  2:56     ` Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
2020-05-26 10:58   ` Mark Rutland
2020-05-27  3:01     ` Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 7/9] kvm/arm64: Support async page fault Gavin Shan
2020-05-26 12:34   ` Mark Rutland
2020-05-27  4:05     ` Gavin Shan
2020-05-27  7:37       ` Marc Zyngier
2020-05-28  6:32         ` Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 8/9] kernel/sched: Add cpu_rq_is_locked() Gavin Shan
2020-05-08  3:29 ` [PATCH RFCv2 9/9] arm64: Support async page fault Gavin Shan
2020-05-26 12:56   ` Mark Rutland
2020-05-27  6:48   ` Paolo Bonzini
2020-05-28  6:14     ` Gavin Shan
2020-05-28  7:03       ` Marc Zyngier
2020-05-28 10:53         ` Paolo Bonzini
2020-05-28 10:48       ` Paolo Bonzini
2020-05-28 23:02         ` Gavin Shan
2020-05-29  9:41           ` Marc Zyngier
2020-05-29 11:11             ` Paolo Bonzini
2020-05-25 23:39 ` [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault Gavin Shan
2020-05-26 13:09 ` Mark Rutland
2020-05-27  2:39   ` Gavin Shan
2020-05-27  7:48     ` Marc Zyngier
2020-05-27 16:10       ` Paolo Bonzini

KVM ARM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvmarm/0 kvmarm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvmarm kvmarm/ https://lore.kernel.org/kvmarm \
		kvmarm@lists.cs.columbia.edu
	public-inbox-index kvmarm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/edu.columbia.cs.lists.kvmarm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git