kvmarm.lists.cs.columbia.edu archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Support Asynchronous Page Fault
@ 2020-08-18  1:13 Gavin Shan
  2020-08-18  1:13 ` [PATCH 1/6] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Gavin Shan @ 2020-08-18  1:13 UTC (permalink / raw)
  To: kvmarm; +Cc: maz, shan.gavin, pbonzini, will

There are two stages of page fault. The guest kernel is responsible
for handling stage one page fault, while the host kernel is to take
care of the stage two page fault. When page fault is triggered because
of stage two page fault, the guest is suspended until the requested
memory (page) is populated. Sometimes, the cost to populate the requested
page isn't cheap and can take hundreds of milliseconds in extreme
cases. This impacts the overall guest's performance.

This series introduces the feature (asynchronous page fault) to resolve
the issue and improve the guest's performance. It depends on the series
to support SDEI virtualization and refactoring SDEI client driver. This
also depends on QEMU changes to export SDEI/APFT tables. All the code
including this series can be found from github:

   https://github.com/gwshan/linux ("sdei_client")
   https://github.com/gwshan/linux ("sdei")
   https://github.com/gwshan/linux ("apf")
   https://github.com/gwshan/qemu  ("apf")

The functionality is driven by two notifications: page-not-present and
page-ready. They're delivered from the host to guest via SDEI event and
PPI separately. In the mean while, each notification is always associated
with a token, used to identify the notification. The token is passed by
the shared memory between host/guest. Besides, the SMCCC interface is
mitigated by the guest to configure, enable or disable the functionality.
It's traditional control path.

When the guest is trapped to host because of stage two page fault, a
page-not-present notification is raised by the host, and sent to the guest
through (KVM private) SDEI event (0x40200001) if the requested page can't
be populated immediately. In the mean while, a (background) worker is also
started to populate the requested page. On receiving the SDEI event, the
guest marks the current running process with special flag (TIF_ASYNC_PF)
and associates the process with a pre-defined waitqueue. At same time,
a (reschedule) IPI is sent to the CPU where the process was running.
After the SDEI event is acknoledged by the guest, the (reschedule) IPI
is delivered and it causes context switch from kernel to user space.
During the context switch, the process with TIF_ASYNC_PF flag is
suspended on the associated waitqueue.

Later on, a page-ready notification is sent to guest after the requested
page is populated by the (background) worker. On receiving the interrupt,
the guest uses the associated token to locate the process, which was
previously suspended because of page-not-present, and wakes it up.

The series is organized as below:

PATCH[1-2]: support KVM hypervisor SMCCC services, which are developed by
            Will Deacon.
PATCH[3]:   export kvm_handle_user_mem_abort() with @prefault parameter
            supported, which is prepatory work to support the feature.
PATCH[4]:   support asynchronous page fault in host side.
PATCH[5]:   exposes APFT (Asynchronous Page Fault Table) ACPI table, which
            will be used by guest kernel to support the feature
PATCH[6]:   support asynchronous page fault in guest side.

=======
Testing
=======

In the test case [1] and [2], "testsuite mem" is executed to allocate
the specified percentage of free memory (90%) and then release them.
In the mean while, the calculation thread is started or not. When
the calculation thread isn't started, there isn't obvious performance
degradtion. When the calculation thread is started, the performance
is improved by 27.7% and 28.6% separately, depending on THP enablement
sttus on the host side.

In test case [3] and [4], the kernel image is built and check the
used time. The performance is improved by 9.7% and 9.9% separately,
depending on THP enablement status on the host side.

[1] Two threads to allocate/free memory and do calculation
    vCPU:                  1
    Memory:                8GB
    memory.limit_in_bytes: 2GB
    memory.swappiness:     100
    host:                  THP disabled
    command:               "testsuite mem 90 1 [thread]"
    "-": Disabled asynchronous page fault
    "+": Enabled asynchronous page fault
    "T"  With the calculation thread

    Idx  -      +     Output  T-                   T+                   Output
    ==========================================================================
    1   93.1s  93.6s    -     223.8s 21117147961   391.9s 49845637101     -
    2   93.3s  94.2s    -     237.9s 23394567744   397.0s 50506074773     -
    3   93.5s  94.3s    -     244.2s 24305177553   405.8s 51853498870     -
    4   94.1s  95.0s    -     262.8s 27113310073   421.7s 54338181069     -
    5   94.3s  95.2s    -     272.7s 28565479414   434.3s 56171922019     -
    ==========================================================================
        93.6s  94.4s   -0.8%  248.2s 24899136549   410.1s 52543062766
                                       100318841/s          128122562/s  +27.7%

[2] Two threads to allocate/free memory and do calculation
    vCPU:                  1
    Memory:                8GB
    memory.limit_in_bytes: 2GB
    memory.swappiness:     100
    host:                  THP enabled
    command:               "testsuite mem 90 1 [thread]"
    "-": Disabled asynchronous page fault
    "+": Enabled asynchronous page fault
    "T"  With the calculation thread

    Idx  -      +     Output  T-                   T+                   Output
    ==========================================================================
    1   91.3s  91.2s    -     218.8s 20319612017   389.6s 49016175698     -
    2   91.7s  91.6s    -     233.9s 22619566161   402.0s 50901616319     -
    3   91.8s  91.9s    -     251.1s 25066180266   405.3s 51247353704     -
    4   92.7s  92.2s    -     251.1s 25262121229   406.9s 51692420054     -
    5   93.1s  92.2s    -     260.7s 26532616925   425.4s 54412348724     -
    ==========================================================================
        92.1s  91.8s   +3.0%  243.1s 23960019319   405.8  51453982899
                                       98560342/s          126796409/s  +28.6%

[3] Clear kernel image and rebuild it.
    vCPU: 24  Memory:      8GB
    memory.limit_in_bytes: 2GB
    memory.swapiness:      100
    Host:                  THP disabled
    command:               "make -j 24 clean > /dev/null 2>&1 &&
                            make -j 24 > /dev/null 2>&1"

    Idx   Disabled   Enabled   Output
    ==================================
    1     2211s      2000s     +9.5%
    2     2333s      2060s     +11.7%
    3     2568s      2192s     +14.6%
    4     2631s      2423s     +7.9%
    5     2756s      2605s     +5.4%
    ==================================
          2499s      2256s     +9.7%

[4] Clear kernel image and rebuild it.
    vCPU: 24  Memory:      8GB
    memory.limit_in_bytes: 2GB
    memory.swapiness:      100
    Host:                  THP enabled
    command:               "make -j 24 clean > /dev/null 2>&1 &&
                            make -j 24 > /dev/null 2>&1"

    Idx   Disabled   Enabled   Output
    ==================================
    1     2049s      1850s     +9.7%
    2     2144s      1947s     +9.1%
    3     2164s      1997s     +7.7%
    4     2192s      2031s     +7.3%
    5     2515s      2141s     +14.8%
    ==================================
          2214s      1993s     +9.9%


Gavin Shan (4):
  kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  arm64/kvm: Support async page fault
  drivers/acpi: Import ACPI APF table
  arm64/kernel: Support async page fault

Will Deacon (2):
  arm64: Probe for the presence of KVM hypervisor services during boot
  arm/arm64: KVM: Advertise KVM UID to guests via SMCCC

 arch/arm64/Kconfig                     |  11 +
 arch/arm64/include/asm/esr.h           |   5 +
 arch/arm64/include/asm/hypervisor.h    |  11 +
 arch/arm64/include/asm/kvm_emulate.h   |   8 +-
 arch/arm64/include/asm/kvm_host.h      |  54 +++
 arch/arm64/include/asm/kvm_para.h      |  41 +++
 arch/arm64/include/asm/processor.h     |   1 +
 arch/arm64/include/asm/thread_info.h   |   4 +-
 arch/arm64/include/uapi/asm/Kbuild     |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h |  23 ++
 arch/arm64/kernel/Makefile             |   1 +
 arch/arm64/kernel/kvm.c                | 478 +++++++++++++++++++++++++
 arch/arm64/kernel/setup.c              |  32 ++
 arch/arm64/kernel/signal.c             |  17 +
 arch/arm64/kvm/Kconfig                 |   1 +
 arch/arm64/kvm/Makefile                |   1 +
 arch/arm64/kvm/arm.c                   |  45 ++-
 arch/arm64/kvm/async_pf.c              | 462 ++++++++++++++++++++++++
 arch/arm64/kvm/hypercalls.c            |  37 +-
 arch/arm64/kvm/mmu.c                   |  47 ++-
 arch/arm64/kvm/sdei.c                  |   8 +
 include/acpi/actbl2.h                  |  18 +
 include/linux/arm-smccc.h              |  41 +++
 23 files changed, 1321 insertions(+), 27 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 arch/arm64/kernel/kvm.c
 create mode 100644 arch/arm64/kvm/async_pf.c

-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/6] arm64: Probe for the presence of KVM hypervisor services during boot
  2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
@ 2020-08-18  1:13 ` Gavin Shan
  2020-08-18  1:13 ` [PATCH 2/6] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC Gavin Shan
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Gavin Shan @ 2020-08-18  1:13 UTC (permalink / raw)
  To: kvmarm; +Cc: maz, shan.gavin, pbonzini, will

From: Will Deacon <will@kernel.org>

Although the SMCCC specification provides some limited functionality for
describing the presence of hypervisor and firmware services, this is
generally applicable only to functions designated as "Arm Architecture
Service Functions" and no portable discovery mechanism is provided for
standard hypervisor services, despite having a designated range of
function identifiers reserved by the specification.

In an attempt to avoid the need for additional firmware changes every
time a new function is added, introduce a UID to identify the service
provider as being compatible with KVM. Once this has been established,
additional services can be discovered via a feature bitmap.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/hypervisor.h | 11 ++++++++++
 arch/arm64/kernel/setup.c           | 32 +++++++++++++++++++++++++++++
 include/linux/arm-smccc.h           | 26 +++++++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/arch/arm64/include/asm/hypervisor.h b/arch/arm64/include/asm/hypervisor.h
index f9cc1d021791..91e4bd890819 100644
--- a/arch/arm64/include/asm/hypervisor.h
+++ b/arch/arm64/include/asm/hypervisor.h
@@ -2,6 +2,17 @@
 #ifndef _ASM_ARM64_HYPERVISOR_H
 #define _ASM_ARM64_HYPERVISOR_H
 
+#include <linux/arm-smccc.h>
 #include <asm/xen/hypervisor.h>
 
+static inline bool kvm_arm_hyp_service_available(u32 func_id)
+{
+	extern DECLARE_BITMAP(__kvm_arm_hyp_services, ARM_SMCCC_KVM_NUM_FUNCS);
+
+	if (func_id >= ARM_SMCCC_KVM_NUM_FUNCS)
+		return -EINVAL;
+
+	return test_bit(func_id, __kvm_arm_hyp_services);
+}
+
 #endif
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 77c4c9bad1b8..6a0360793317 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/acpi.h>
+#include <linux/arm-smccc.h>
 #include <linux/export.h>
 #include <linux/kernel.h>
 #include <linux/stddef.h>
@@ -275,6 +276,7 @@ static int __init reserve_memblock_reserved_regions(void)
 arch_initcall(reserve_memblock_reserved_regions);
 
 u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
+DECLARE_BITMAP(__kvm_arm_hyp_services, ARM_SMCCC_KVM_NUM_FUNCS) = { };
 
 u64 cpu_logical_map(int cpu)
 {
@@ -282,6 +284,35 @@ u64 cpu_logical_map(int cpu)
 }
 EXPORT_SYMBOL_GPL(cpu_logical_map);
 
+static void __init kvm_init_hyp_services(void)
+{
+	struct arm_smccc_res res;
+	int i;
+
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
+	if (res.a0 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0 ||
+	    res.a1 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1 ||
+	    res.a2 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2 ||
+	    res.a3 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3)
+		return;
+
+	memset(&res, 0, sizeof(res));
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID, &res);
+	for (i = 0; i < 32; ++i) {
+		if (res.a0 & (i))
+			set_bit(i + (32 * 0), __kvm_arm_hyp_services);
+		if (res.a1 & (i))
+			set_bit(i + (32 * 1), __kvm_arm_hyp_services);
+		if (res.a2 & (i))
+			set_bit(i + (32 * 2), __kvm_arm_hyp_services);
+		if (res.a3 & (i))
+			set_bit(i + (32 * 3), __kvm_arm_hyp_services);
+	}
+
+	pr_info("KVM hypervisor services detected (0x%08lx 0x%08lx 0x%08lx 0x%08lx)\n",
+		res.a3, res.a2, res.a1, res.a0);
+}
+
 void __init __no_sanitize_address setup_arch(char **cmdline_p)
 {
 	init_mm.start_code = (unsigned long) _text;
@@ -354,6 +385,7 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 	else
 		psci_acpi_init();
 
+	kvm_init_hyp_services();
 	init_bootcpu_ops();
 	smp_init_cpus();
 	smp_build_mpidr_hash();
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 15c706fb0a37..1c699fc6dc12 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -49,11 +49,14 @@
 #define ARM_SMCCC_OWNER_OEM		3
 #define ARM_SMCCC_OWNER_STANDARD	4
 #define ARM_SMCCC_OWNER_STANDARD_HYP	5
+#define ARM_SMCCC_OWNER_VENDOR_HYP	6
 #define ARM_SMCCC_OWNER_TRUSTED_APP	48
 #define ARM_SMCCC_OWNER_TRUSTED_APP_END	49
 #define ARM_SMCCC_OWNER_TRUSTED_OS	50
 #define ARM_SMCCC_OWNER_TRUSTED_OS_END	63
 
+#define ARM_SMCCC_FUNC_QUERY_CALL_UID	0xff01
+
 #define ARM_SMCCC_QUIRK_NONE		0
 #define ARM_SMCCC_QUIRK_QCOM_A6		1 /* Save/restore register a6 */
 
@@ -99,6 +102,29 @@
 			   ARM_SMCCC_OWNER_STANDARD_HYP,	\
 			   0x21)
 
+#define ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID			\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
+			   ARM_SMCCC_SMC_32,			\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,		\
+			   ARM_SMCCC_FUNC_QUERY_CALL_UID)
+
+/* KVM UID value: 28b46fb6-2ec5-11e9-a9ca-4b564d003a74 */
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0	0xb66fb428U
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1	0xe911c52eU
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2	0x564bcaa9U
+#define ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3	0x743a004dU
+
+/* KVM "vendor specific" services */
+#define ARM_SMCCC_KVM_FUNC_FEATURES		0
+#define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
+#define ARM_SMCCC_KVM_NUM_FUNCS			128
+
+#define ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID		\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
+			   ARM_SMCCC_SMC_32,			\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,		\
+			   ARM_SMCCC_KVM_FUNC_FEATURES)
+
 /*
  * Return codes defined in ARM DEN 0070A
  * ARM DEN 0070A is now merged/consolidated into ARM DEN 0028 C
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/6] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC
  2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
  2020-08-18  1:13 ` [PATCH 1/6] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
@ 2020-08-18  1:13 ` Gavin Shan
  2020-08-18  1:13 ` [PATCH 3/6] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Gavin Shan @ 2020-08-18  1:13 UTC (permalink / raw)
  To: kvmarm; +Cc: maz, shan.gavin, pbonzini, will

From: Will Deacon <will@kernel.org>

We can advertise ourselves to guests as KVM and provide a basic features
bitmap for discoverability of future hypervisor services.

Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/kvm/hypercalls.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 1268465efa64..cd48fd724a52 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -13,13 +13,13 @@
 int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 {
 	u32 func_id = smccc_get_function(vcpu);
-	long val = SMCCC_RET_NOT_SUPPORTED;
+	long val[4] = { SMCCC_RET_NOT_SUPPORTED };
 	u32 feature;
 	gpa_t gpa;
 
 	switch (func_id) {
 	case ARM_SMCCC_VERSION_FUNC_ID:
-		val = ARM_SMCCC_VERSION_1_1;
+		val[0] = ARM_SMCCC_VERSION_1_1;
 		break;
 	case ARM_SMCCC_ARCH_FEATURES_FUNC_ID:
 		feature = smccc_get_arg1(vcpu);
@@ -29,10 +29,10 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 			case KVM_BP_HARDEN_UNKNOWN:
 				break;
 			case KVM_BP_HARDEN_WA_NEEDED:
-				val = SMCCC_RET_SUCCESS;
+				val[0] = SMCCC_RET_SUCCESS;
 				break;
 			case KVM_BP_HARDEN_NOT_REQUIRED:
-				val = SMCCC_RET_NOT_REQUIRED;
+				val[0] = SMCCC_RET_NOT_REQUIRED;
 				break;
 			}
 			break;
@@ -42,26 +42,35 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 			case KVM_SSBD_UNKNOWN:
 				break;
 			case KVM_SSBD_KERNEL:
-				val = SMCCC_RET_SUCCESS;
+				val[0] = SMCCC_RET_SUCCESS;
 				break;
 			case KVM_SSBD_FORCE_ENABLE:
 			case KVM_SSBD_MITIGATED:
-				val = SMCCC_RET_NOT_REQUIRED;
+				val[0] = SMCCC_RET_NOT_REQUIRED;
 				break;
 			}
 			break;
 		case ARM_SMCCC_HV_PV_TIME_FEATURES:
-			val = SMCCC_RET_SUCCESS;
+			val[0] = SMCCC_RET_SUCCESS;
 			break;
 		}
 		break;
 	case ARM_SMCCC_HV_PV_TIME_FEATURES:
-		val = kvm_hypercall_pv_features(vcpu);
+		val[0] = kvm_hypercall_pv_features(vcpu);
 		break;
 	case ARM_SMCCC_HV_PV_TIME_ST:
 		gpa = kvm_init_stolen_time(vcpu);
 		if (gpa != GPA_INVALID)
-			val = gpa;
+			val[0] = gpa;
+		break;
+	case ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID:
+		val[0] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0;
+		val[1] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1;
+		val[2] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2;
+		val[3] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3;
+		break;
+	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
+		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
 		break;
 	case SDEI_1_0_FN_SDEI_VERSION:
 	case SDEI_1_0_FN_SDEI_EVENT_REGISTER:
@@ -85,6 +94,6 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 		return kvm_psci_call(vcpu);
 	}
 
-	smccc_set_retval(vcpu, val, 0, 0, 0);
+	smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
 	return 1;
 }
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/6] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
  2020-08-18  1:13 ` [PATCH 1/6] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
  2020-08-18  1:13 ` [PATCH 2/6] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC Gavin Shan
@ 2020-08-18  1:13 ` Gavin Shan
  2020-10-23 16:54   ` James Morse
  2020-08-18  1:13 ` [PATCH 4/6] arm64/kvm: Support async page fault Gavin Shan
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Gavin Shan @ 2020-08-18  1:13 UTC (permalink / raw)
  To: kvmarm; +Cc: maz, shan.gavin, pbonzini, will

This renames user_mem_abort() to kvm_handle_user_mem_abort(), and
then exports it. The function will be used in asynchronous page fault
to populate a page table entry once the corresponding page is populated
from the backup device (e.g. swap partition):

   * Parameter @fault_status is replace by @esr.
   * Parameter @prefault is added

As the @esr is passed as parameter, not fetched from vCPU struct. This
also introduces the necessasry helpers in esr.h, to manupulate the @esr.
The helpers defined in kvm_emulate.h reuses the newly added helper. This
shouldn't cause functional changes.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/esr.h         |  5 +++++
 arch/arm64/include/asm/kvm_emulate.h |  8 ++++----
 arch/arm64/include/asm/kvm_host.h    |  4 ++++
 arch/arm64/kvm/mmu.c                 | 18 ++++++++++++------
 4 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 035003acfa87..ec0b5d81183c 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -317,8 +317,13 @@
 					 ESR_ELx_CP15_32_ISS_DIR_READ)
 
 #ifndef __ASSEMBLY__
+#include <linux/bitfield.h>
 #include <asm/types.h>
 
+#define esr_get_trap_class(esr)		FIELD_GET(ESR_ELx_EC_MASK, esr)
+#define esr_is_dabt_wnr(esr)		!!(FIELD_GET(ESR_ELx_WNR, esr))
+#define esr_is_dabt_iss1tw(esr)		!!(FIELD_GET(ESR_ELx_S1PTW, esr))
+
 static inline bool esr_is_data_abort(u32 esr)
 {
 	const u32 ec = ESR_ELx_EC(esr);
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index bb7aee5927a5..2681d1fe4003 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -323,13 +323,13 @@ static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)
 
 static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_S1PTW);
+	return esr_is_dabt_iss1tw(kvm_vcpu_get_hsr(vcpu));
 }
 
 static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)
 {
-	return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WNR) ||
-		kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */
+	return (esr_is_dabt_wnr(kvm_vcpu_get_hsr(vcpu)) ||
+		kvm_vcpu_dabt_iss1tw(vcpu)); /* AF/DBM update */
 }
 
 static inline bool kvm_vcpu_dabt_is_cm(const struct kvm_vcpu *vcpu)
@@ -350,7 +350,7 @@ static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu
 
 static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)
 {
-	return ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
+	return esr_get_trap_class(kvm_vcpu_get_hsr(vcpu));
 }
 
 static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index ba8cdc304b81..b6c9851b2a65 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -441,6 +441,10 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+			      struct kvm_memory_slot *memslot,
+			      unsigned long hva, unsigned int esr,
+			      bool prefault);
 int kvm_unmap_hva_range(struct kvm *kvm,
 			unsigned long start, unsigned long end);
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7a7ddc4558a7..b23778392aa1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1787,11 +1787,15 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	return PAGE_SIZE;
 }
 
-static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-			  struct kvm_memory_slot *memslot, unsigned long hva,
-			  unsigned long fault_status)
+int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu,
+			      phys_addr_t fault_ipa,
+			      struct kvm_memory_slot *memslot,
+			      unsigned long hva,
+			      unsigned int esr,
+			      bool prefault)
 {
 	int ret;
+	unsigned int fault_status = (esr & ESR_ELx_FSC_TYPE);
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault, needs_exec;
 	unsigned long mmu_seq;
@@ -1805,8 +1809,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long vma_pagesize, flags = 0;
 
-	write_fault = kvm_is_write_fault(vcpu);
-	exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
+	exec_fault = (esr_get_trap_class(esr) == ESR_ELx_EC_IABT_LOW);
+	write_fault = (!exec_fault &&
+		       (esr_is_dabt_wnr(esr) || esr_is_dabt_iss1tw(esr)));
 	VM_BUG_ON(write_fault && exec_fault);
 
 	if (fault_status == FSC_PERM && !write_fault && !exec_fault) {
@@ -2116,7 +2121,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 	}
 
-	ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
+	ret = kvm_handle_user_mem_abort(vcpu, fault_ipa, memslot, hva,
+					kvm_vcpu_get_hsr(vcpu), false);
 	if (ret == 0)
 		ret = 1;
 out:
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/6] arm64/kvm: Support async page fault
  2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
                   ` (2 preceding siblings ...)
  2020-08-18  1:13 ` [PATCH 3/6] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
@ 2020-08-18  1:13 ` Gavin Shan
  2020-10-23 17:01   ` James Morse
  2020-08-18  1:13 ` [PATCH 5/6] drivers/acpi: Import ACPI APF table Gavin Shan
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Gavin Shan @ 2020-08-18  1:13 UTC (permalink / raw)
  To: kvmarm; +Cc: maz, shan.gavin, pbonzini, will

There are two stages of page fault. The guest kernel is responsible
for handling stage one page fault, while the host kernel is to take
care of the stage two page fault. When page fault is triggered because
of stage two page fault, the guest is suspended until the requested
memory (page) is populated. Sometimes, the cost to populate the requested
page isn't cheap and can take hundreds of milliseconds in extreme
cases. This impacts the overall guest's performance.

In order to resolve the issue and improve the guest's performance,
this introduces the feature (asychronous page fault). A page-not-present
notification is fired to guest if the requested page isn't ready so that
the guest can reschedule current process and pick another process to run.
Another page-ready notification is sent to guest so that the process,
which was rescheduled previously, can be picked up to run. With this,
the guest needn't necessarily suspend itself on stage two page fault.
These two notificatons are associated by so-called vCPU scoped tokens,
which is the combination of incremental sequence number and vCPU index
(@vcpu->vcpu_idx). Besides, the notifications are conveyed by SDEI and
PPI, whose numbers are specified by guest through SMCCC interface.

The asynchronous page fault always starts with a background woker if
current vCPU has enough token space and no pending notifications. After
the worker, which populates the requested page background, is started,
a page-not-present notification, together with an unique token, are sent
to guest through the specified private SDEI event. The notification is
acknowledged by clearing the shared cache (@vcpu->apf.cache). The current
process is marked for waiting completion of asynchronous page fault. A
subsequent (reschedule) IPI is sent to current vCPU so that the current
process can be reschedule and suspended until the asynchronous page is
completed. In the meanwhile, the woker populates the requested page and
it's queued to the completion queue once the task is completed. At this
point or the acknoledgement (by SMCCC) on pending PPI, a PPI is sent to
guest for page-ready notification. The guest uses the associated token
to locate the suspended process and wakes it up.

In order to fulfil the functions, there are several SMCCC function IDs
introduced and outlined as below:

   ARM_SMCCC_KVM_FUNC_APF_VERSION
      Returns the version, which can be used to identify ABI changes in
      the future.
   ARM_SMCCC_KVM_FUNC_APF_ENABLE
      Enable or disable asynchronous page fault on current vCPU.
   ARM_SMCCC_KVM_FUNC_APF_TOKEN_NUM
      Returns maximal number of tokens that current vCPU can have. It's
      used by guest to allocate the required resources.
   ARM_SMCCC_KVM_FUNC_APF_SDEI
   ARM_SMCCC_KVM_FUNC_APF_IRQ
      Used by guest to confiugre the SDEI event and PPI numbers.
   ARM_SMCCC_KVM_FUNC_APF_IRQ_ACK
      Acknowledge the page-ready notificaton in the PPI handler.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/kvm_host.h      |  50 +++
 arch/arm64/include/asm/kvm_para.h      |  27 ++
 arch/arm64/include/uapi/asm/Kbuild     |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h |  23 ++
 arch/arm64/kvm/Kconfig                 |   1 +
 arch/arm64/kvm/Makefile                |   1 +
 arch/arm64/kvm/arm.c                   |  45 ++-
 arch/arm64/kvm/async_pf.c              | 462 +++++++++++++++++++++++++
 arch/arm64/kvm/hypercalls.c            |   8 +
 arch/arm64/kvm/mmu.c                   |  29 +-
 arch/arm64/kvm/sdei.c                  |   8 +
 include/linux/arm-smccc.h              |  15 +
 12 files changed, 665 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 arch/arm64/kvm/async_pf.c

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index b6c9851b2a65..2e3bba6a99c3 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -46,6 +46,7 @@
 #define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
 #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
 #define KVM_REQ_SDEI		KVM_ARCH_REQ(5)
+#define KVM_REQ_APF_READY	KVM_ARCH_REQ(6)
 
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
@@ -239,6 +240,23 @@ struct vcpu_reset_state {
 	bool		reset;
 };
 
+#ifdef CONFIG_KVM_ASYNC_PF
+
+/* Should be a power of two number */
+#define ASYNC_PF_PER_VCPU	64
+
+/*
+ * The association of gfn and token. The token will be sent to guest as
+ * page fault address. Also, the guest could be in aarch32 mode. So its
+ * length should be 32-bits.
+ */
+struct kvm_arch_async_pf {
+	u32     token;
+	gfn_t   gfn;
+	u32     esr;
+};
+#endif /* CONFIG_KVM_ASYNC_PF */
+
 struct kvm_vcpu_arch {
 	struct kvm_cpu_context ctxt;
 	void *sve_state;
@@ -346,6 +364,22 @@ struct kvm_vcpu_arch {
 	struct kvm_sdei_vcpu_event	*sdei_normal_event;
 	struct kvm_sdei_vcpu_event	*sdei_critical_event;
 	struct list_head		sdei_events;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+	struct {
+		struct gfn_to_hva_cache	cache;
+		gfn_t			gfns[ASYNC_PF_PER_VCPU];
+		u64			control_block;
+		bool			send_user_only;
+		u64			sdei_event_num;
+		u32			irq;
+
+		u16			id;
+		bool			notpresent_pending;
+		u32			notpresent_token;
+		bool			pageready_pending;
+	} apf;
+#endif
 };
 
 /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
@@ -566,6 +600,22 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 			       struct kvm_device_attr *attr);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
+bool kvm_arch_async_not_present_allowed(struct kvm_vcpu *vcpu);
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
+			    gfn_t gfn);
+bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work);
+void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu);
+bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu);
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work);
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work);
+long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu);
+#endif /* CONFIG_KVM_ASYNC_PF */
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
new file mode 100644
index 000000000000..0ea481dd1c7a
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_para.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM_KVM_PARA_H
+#define _ASM_ARM_KVM_PARA_H
+
+#include <uapi/asm/kvm_para.h>
+
+static inline bool kvm_check_and_clear_guest_paused(void)
+{
+	return false;
+}
+
+static inline unsigned int kvm_arch_para_features(void)
+{
+	return 0;
+}
+
+static inline unsigned int kvm_arch_para_hints(void)
+{
+	return 0;
+}
+
+static inline bool kvm_para_available(void)
+{
+	return false;
+}
+
+#endif /* _ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
index 602d137932dc..f66554cd5c45 100644
--- a/arch/arm64/include/uapi/asm/Kbuild
+++ b/arch/arm64/include/uapi/asm/Kbuild
@@ -1,3 +1 @@
 # SPDX-License-Identifier: GPL-2.0
-
-generic-y += kvm_para.h
diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
new file mode 100644
index 000000000000..f0dbe86c2374
--- /dev/null
+++ b/arch/arm64/include/uapi/asm/kvm_para.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_ARM_KVM_PARA_H
+#define _UAPI_ASM_ARM_KVM_PARA_H
+
+#include <linux/types.h>
+
+#define KVM_FEATURE_ASYNC_PF	0
+
+/* Async PF */
+#define KVM_ASYNC_PF_ENABLED		(1 << 0)
+#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
+
+#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
+#define KVM_PV_REASON_PAGE_READY	2
+
+struct kvm_vcpu_pv_apf_data {
+	__u32	reason;
+	__u32	token;
+	__u8	pad[56];
+	__u32	enabled;
+};
+
+#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 13489aff4440..6dd843be8e10 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -32,6 +32,7 @@ menuconfig KVM
 	select KVM_VFIO
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_IRQFD
+	select KVM_ASYNC_PF
 	select HAVE_KVM_MSI
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 5ebd8abd81c8..f99fcccb8cc6 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -25,3 +25,4 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
 	 vgic/vgic-its.o vgic/vgic-debug.o
 
 kvm-$(CONFIG_KVM_ARM_PMU)  += pmu-emul.o
+kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o async_pf.o
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 4bec6c9ece18..1a57dc5dbf14 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -225,6 +225,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		 */
 		r = 1;
 		break;
+#ifdef CONFIG_KVM_ASYNC_PF
+	case KVM_CAP_ASYNC_PF:
+	case KVM_CAP_ASYNC_PF_INT:
+		r = 1;
+		break;
+#endif
 	default:
 		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
 		break;
@@ -274,6 +280,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+#ifdef CONFIG_KVM_ASYNC_PF
+	vcpu->arch.apf.control_block = 0UL;
+#endif
 
 	/* Set up the timer */
 	kvm_timer_vcpu_init(vcpu);
@@ -432,8 +441,32 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 {
 	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
-	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
-		&& !v->arch.power_off && !v->arch.pause);
+
+	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
+	    !v->arch.power_off && !v->arch.pause)
+		return true;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
+		u32 val;
+		int ret;
+
+		if (!list_empty_careful(&v->async_pf.done))
+			return true;
+
+		ret = kvm_read_guest_offset_cached(v->kvm, &v->arch.apf.cache,
+						   &val, 0, sizeof(val));
+		if (ret || val)
+			return true;
+
+		ret = kvm_read_guest_offset_cached(v->kvm, &v->arch.apf.cache,
+						   &val, 4, sizeof(val));
+		if (ret || val)
+			return true;
+	}
+#endif
+
+	return false;
 }
 
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
@@ -624,6 +657,11 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
 			kvm_reset_vcpu(vcpu);
 
+#ifdef CONFIG_KVM_ASYNC_PF
+		if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
+			kvm_check_async_pf_completion(vcpu);
+#endif
+
 		if (kvm_check_request(KVM_REQ_SDEI, vcpu))
 			kvm_sdei_deliver(vcpu);
 
@@ -739,7 +777,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		smp_store_mb(vcpu->mode, IN_GUEST_MODE);
 
 		if (ret <= 0 || need_new_vmid_gen(&vcpu->kvm->arch.vmid) ||
-		    kvm_request_pending(vcpu)) {
+		    (kvm_request_pending(vcpu) &&
+		     READ_ONCE(vcpu->requests) != (1UL << KVM_REQ_APF_READY))) {
 			vcpu->mode = OUTSIDE_GUEST_MODE;
 			isb(); /* Ensure work in x_flush_hwstate is committed */
 			kvm_pmu_sync_hwstate(vcpu);
diff --git a/arch/arm64/kvm/async_pf.c b/arch/arm64/kvm/async_pf.c
new file mode 100644
index 000000000000..910cb3a1bdc2
--- /dev/null
+++ b/arch/arm64/kvm/async_pf.c
@@ -0,0 +1,462 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Gavin Shan, Redhat Inc 2020.
+ *
+ * Asynchronous page fault support
+ */
+
+#include <linux/arm-smccc.h>
+#include <linux/kvm_host.h>
+#include <asm/kvm_emulate.h>
+#include <kvm/arm_hypercalls.h>
+#include <kvm/arm_vgic.h>
+#include <asm/kvm_sdei.h>
+
+static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
+{
+	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
+}
+
+static inline u32 kvm_async_pf_hash_next(u32 key)
+{
+	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
+}
+
+static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
+{
+	int i;
+
+	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
+		vcpu->arch.apf.gfns[i] = ~0;
+}
+
+/*
+ * Add gfn to the hash table. It's ensured there is a free entry
+ * when this function is called.
+ */
+static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+
+	while (vcpu->arch.apf.gfns[key] != ~0)
+		key = kvm_async_pf_hash_next(key);
+
+	vcpu->arch.apf.gfns[key] = gfn;
+}
+
+static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_fn(gfn);
+	int i;
+
+	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
+		if (vcpu->arch.apf.gfns[key] == gfn ||
+		    vcpu->arch.apf.gfns[key] == ~0)
+			break;
+
+		key = kvm_async_pf_hash_next(key);
+	}
+
+	return key;
+}
+
+static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 i, j, k;
+
+	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
+	while (true) {
+		vcpu->arch.apf.gfns[i] = ~0;
+
+		do {
+			j = kvm_async_pf_hash_next(j);
+			if (vcpu->arch.apf.gfns[j] == ~0)
+				return;
+
+			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
+			/*
+			 * k lies cyclically in ]i,j]
+			 * |    i.k.j |
+			 * |....j i.k.| or  |.k..j i...|
+			 */
+		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
+
+		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
+		i = j;
+	}
+}
+
+bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
+
+	return vcpu->arch.apf.gfns[key] == gfn;
+}
+
+static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu,
+					  u32 offset, u32 *val)
+{
+	return kvm_read_guest_offset_cached(vcpu->kvm,
+					    &vcpu->arch.apf.cache,
+					    val, offset, sizeof(*val));
+}
+
+static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu,
+					   u32 offset, u32 val)
+{
+	return kvm_write_guest_offset_cached(vcpu->kvm,
+					     &vcpu->arch.apf.cache,
+					     &val, offset, sizeof(val));
+}
+
+bool kvm_arch_async_not_present_allowed(struct kvm_vcpu *vcpu)
+{
+	u32 val;
+	int ret;
+
+	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+		return false;
+
+	if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
+		return false;
+
+	if (!irqchip_in_kernel(vcpu->kvm))
+		return false;
+
+	if (vcpu->arch.sdei_critical_event ||
+	    vcpu->arch.sdei_normal_event)
+		return false;
+
+	/* Pending page fault, which isn't acknowledged by guest */
+	ret = kvm_async_pf_read_cache(vcpu, 0, &val);
+	if (ret || val)
+		return false;
+
+	ret = kvm_async_pf_read_cache(vcpu, 4, &val);
+	if (ret || val)
+		return false;
+
+	return true;
+}
+
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
+			    u32 esr, gpa_t gpa, gfn_t gfn)
+{
+	struct kvm_arch_async_pf arch;
+	unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gfn);
+
+	arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
+	arch.gfn = gfn;
+	arch.esr = esr;
+
+	return kvm_setup_async_pf(vcpu, gpa, hva, &arch);
+}
+
+/*
+ * It's guaranteed that no pending asynchronous page fault when this is
+ * called. It means all previous issued asynchronous page faults have
+ * been acknowledged.
+ */
+bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
+				     struct kvm_async_pf *work)
+{
+	kvm_async_pf_hash_add(vcpu, work->arch.gfn);
+
+	if (kvm_async_pf_write_cache(vcpu, 4, work->arch.token) ||
+	    kvm_async_pf_write_cache(vcpu, 0, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+		kvm_err("%s: Error to write token (0x%08x)\n",
+			__func__, work->arch.token);
+		kvm_async_pf_write_cache(vcpu, 4, 0);
+		kvm_async_pf_write_cache(vcpu, 0, 0);
+		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
+		return false;
+	}
+
+	vcpu->arch.apf.notpresent_pending = true;
+	vcpu->arch.apf.notpresent_token = work->arch.token;
+
+	return !kvm_sdei_inject(vcpu, vcpu->arch.apf.sdei_event_num, false);
+}
+
+static void kvm_arch_async_sdei_notifier(struct kvm_vcpu *vcpu,
+					 unsigned long num,
+					 unsigned int state)
+{
+	if (num != vcpu->arch.apf.sdei_event_num) {
+		kvm_err("%s: Illegal event number %lu (%llu)\n",
+			__func__, num, vcpu->arch.apf.sdei_event_num);
+		return;
+	}
+
+	switch (state) {
+	case KVM_SDEI_STATE_DELIVERED:
+		if (!vcpu->arch.apf.notpresent_pending)
+			break;
+
+		vcpu->arch.apf.notpresent_token = 0;
+		vcpu->arch.apf.notpresent_pending = false;
+		break;
+	case KVM_SDEI_STATE_COMPLETED:
+		break;
+	default:
+		kvm_err("%s: Illegal state %d for event %lu\n",
+			__func__, state, num);
+	}
+}
+
+void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
+{
+	kvm_make_request(KVM_REQ_APF_READY, vcpu);
+	if (!vcpu->arch.apf.pageready_pending)
+		kvm_vcpu_kick(vcpu);
+}
+
+bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
+{
+	struct kvm_async_pf *work;
+	u32 reason, token;
+
+	if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+		return true;
+
+	if (vcpu->arch.apf.pageready_pending)
+		goto fail;
+
+	if (kvm_async_pf_read_cache(vcpu, 0, &reason) ||
+	    kvm_async_pf_read_cache(vcpu, 4, &token)) {
+		kvm_err("%s: Error to read cache\n",
+			__func__);
+		goto fail;
+	}
+
+	/*
+	 * There might be pending page-not-present notification (SDEI)
+	 * to be delivered. However, the corresponding work has been
+	 * completed. For this case, we need to cancel the notification
+	 * early to avoid the overhead because of the injected SDEI
+	 * and interrupt.
+	 */
+	if (vcpu->arch.apf.notpresent_pending) {
+		spin_lock(&vcpu->async_pf.lock);
+		work = list_first_entry_or_null(&vcpu->async_pf.done,
+						typeof(*work), link);
+		spin_unlock(&vcpu->async_pf.lock);
+		if (!work)
+			goto fail;
+
+		if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT &&
+		    work->arch.token == vcpu->arch.apf.notpresent_token &&
+		    token == vcpu->arch.apf.notpresent_token) {
+			kvm_make_request(KVM_REQ_APF_READY, vcpu);
+			return true;
+		}
+	}
+
+	if (reason || token)
+		goto fail;
+
+	return true;
+
+fail:
+	kvm_make_request(KVM_REQ_APF_READY, vcpu);
+	return false;
+}
+
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
+			       struct kvm_async_pf *work)
+{
+	struct kvm_memory_slot *memslot;
+	unsigned int esr = work->arch.esr;
+	phys_addr_t gpa = work->cr2_or_gpa;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	unsigned long hva;
+	bool write_fault, writable;
+	int idx;
+
+	/*
+	 * We shouldn't issue prefault for special work to wake up
+	 * all pending tasks because the associated token (address)
+	 * is invalid.
+	 */
+	if (work->wakeup_all)
+		return;
+
+	/*
+	 * The gpa was validated before the work is started. However, the
+	 * memory slots might be changed since then. So we need to redo the
+	 * validatation here.
+	 */
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+	write_fault = (esr_get_trap_class(esr) != ESR_ELx_EC_IABT_LOW &&
+		       (esr_is_dabt_wnr(esr) || esr_is_dabt_iss1tw(esr)));
+	memslot = gfn_to_memslot(vcpu->kvm, gfn);
+	hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
+	if (kvm_is_error_hva(hva) || (write_fault && !writable))
+		goto out;
+
+	kvm_handle_user_mem_abort(vcpu, gpa, memslot, hva, esr, true);
+
+out:
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+}
+
+
+/*
+ * It's guaranteed that no pending asynchronous page fault when this is
+ * called. It means all previous issued asynchronous page faults have
+ * been acknowledged.
+ */
+void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+				 struct kvm_async_pf *work)
+{
+	/*
+	 * The work might have been completed. In this case, we cancel
+	 * the page-not-present notification to avoid unnecessary overhead
+	 * which is introduced by SDEI event delivery and subsequent IRQ
+	 * context.
+	 */
+	if (work->wakeup_all) {
+		work->arch.token = ~0;
+	} else {
+		kvm_async_pf_hash_remove(vcpu, work->arch.gfn);
+
+		if (vcpu->arch.apf.notpresent_pending &&
+		    vcpu->arch.apf.notpresent_token == work->arch.token &&
+		    kvm_sdei_cancel(vcpu, vcpu->arch.apf.sdei_event_num)) {
+			kvm_async_pf_write_cache(vcpu, 4, 0);
+			kvm_async_pf_write_cache(vcpu, 0, 0);
+			vcpu->arch.apf.notpresent_pending = false;
+			vcpu->arch.apf.notpresent_token = 0;
+			return;
+		}
+	}
+
+	if (kvm_async_pf_write_cache(vcpu, 4, work->arch.token) ||
+	    kvm_async_pf_write_cache(vcpu, 0, KVM_PV_REASON_PAGE_READY)) {
+		kvm_async_pf_write_cache(vcpu, 4, 0);
+		kvm_async_pf_write_cache(vcpu, 0, 0);
+		kvm_err("%s: Error to token (0x%08x)\n",
+			__func__, work->arch.token);
+		return;
+	}
+
+	vcpu->arch.apf.pageready_pending = true;
+	kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_idx,
+			    vcpu->arch.apf.irq, true, NULL);
+}
+
+long kvm_async_pf_hypercall(struct kvm_vcpu *vcpu)
+{
+	u32 func;
+	u64 data;
+	gpa_t gpa;
+	bool enabled, enable;
+	int err;
+	long ret = SMCCC_RET_SUCCESS;
+
+	func = smccc_get_arg1(vcpu);
+	switch (func) {
+	case ARM_SMCCC_KVM_FUNC_APF_VERSION:
+		/* v1.0.0 */
+		ret = 0x010000;
+		break;
+	case ARM_SMCCC_KVM_FUNC_APF_TOKEN_NUM:
+		ret = ARRAY_SIZE(vcpu->arch.apf.gfns);
+		break;
+	case ARM_SMCCC_KVM_FUNC_APF_ENABLE:
+		data = (smccc_get_arg3(vcpu) << 32) | smccc_get_arg2(vcpu);
+		gpa = (data & ~0x3F);
+
+		/* Bail if the state transition isn't allowed */
+		enabled = !!(vcpu->arch.apf.control_block &
+			     KVM_ASYNC_PF_ENABLED);
+		enable = !!(data & KVM_ASYNC_PF_ENABLED);
+		if (enable == enabled) {
+			kvm_debug("%s: Async PF has been %s (0x%llx -> 0x%llx)\n",
+				  __func__, enabled ? "enabled" : "disabled",
+				  vcpu->arch.apf.control_block, data);
+			ret = SMCCC_RET_NOT_REQUIRED;
+			break;
+		}
+
+		/* To disable the functinality */
+		if (!enable) {
+			kvm_clear_async_pf_completion_queue(vcpu);
+			vcpu->arch.apf.control_block = data;
+			break;
+		}
+
+		/*
+		 * To enable the functionality. The SDEI event or IRQ number
+		 * should have been configured in advance.
+		 */
+		if (!vcpu->arch.apf.sdei_event_num || !vcpu->arch.apf.irq) {
+			kvm_err("%s: Invalid SDEI event or IRQ number\n",
+				__func__);
+			ret = SMCCC_RET_INVALID_PARAMETER;
+			break;
+		}
+
+		/* Register SDEI event notifier */
+		err = kvm_sdei_register_notifier(vcpu->arch.apf.sdei_event_num,
+						 kvm_arch_async_sdei_notifier);
+		if (err) {
+			kvm_err("%s: Error %d registering SDEI notifier\n",
+				__func__, err);
+			ret = SMCCC_RET_NOT_SUPPORTED;
+			break;
+		}
+
+		/* Initialize cache shared by host and guest */
+		err = kvm_gfn_to_hva_cache_init(vcpu->kvm,
+						&vcpu->arch.apf.cache,
+						gpa, sizeof(u64));
+		if (err) {
+			kvm_err("%s: Error %d initializing cache\n",
+				__func__, err);
+			ret = SMCCC_RET_NOT_SUPPORTED;
+			break;
+		}
+
+		/* Flush the token table */
+		kvm_async_pf_hash_reset(vcpu);
+		vcpu->arch.apf.send_user_only =
+			!(data & KVM_ASYNC_PF_SEND_ALWAYS);
+		kvm_async_pf_wakeup_all(vcpu);
+		vcpu->arch.apf.control_block = data;
+
+		break;
+	case ARM_SMCCC_KVM_FUNC_APF_SDEI:
+	case ARM_SMCCC_KVM_FUNC_APF_IRQ:
+		if (!irqchip_in_kernel(vcpu->kvm)) {
+			ret = SMCCC_RET_NOT_SUPPORTED;
+			break;
+		}
+
+		if (vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
+			ret = SMCCC_RET_NOT_REQUIRED;
+			break;
+		}
+
+		if (func == ARM_SMCCC_KVM_FUNC_APF_SDEI)
+			vcpu->arch.apf.sdei_event_num = smccc_get_arg2(vcpu);
+		else
+			vcpu->arch.apf.irq = smccc_get_arg2(vcpu);
+
+		break;
+	case ARM_SMCCC_KVM_FUNC_APF_IRQ_ACK:
+		if (!vcpu->arch.apf.pageready_pending)
+			break;
+
+		kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_idx,
+				    vcpu->arch.apf.irq, false, NULL);
+		vcpu->arch.apf.pageready_pending = false;
+		kvm_check_async_pf_completion(vcpu);
+		break;
+	default:
+		ret = SMCCC_RET_NOT_SUPPORTED;
+	}
+
+	return ret;
+}
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index cd48fd724a52..ae7e07aaa521 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -71,7 +71,15 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 		break;
 	case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
 		val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
+#ifdef CONFIG_KVM_ASYNC_PF
+		val[0] |= BIT(ARM_SMCCC_KVM_FUNC_APF);
+#endif
 		break;
+#ifdef CONFIG_KVM_ASYNC_PF
+	case ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID:
+		val[0] = kvm_async_pf_hypercall(vcpu);
+		break;
+#endif
 	case SDEI_1_0_FN_SDEI_VERSION:
 	case SDEI_1_0_FN_SDEI_EVENT_REGISTER:
 	case SDEI_1_0_FN_SDEI_EVENT_ENABLE:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b23778392aa1..81579bdab4a8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1787,6 +1787,30 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	return PAGE_SIZE;
 }
 
+static bool try_async_pf(struct kvm_vcpu *vcpu, u32 esr, gpa_t gpa,
+			 gfn_t gfn, kvm_pfn_t *pfn, bool write,
+			 bool *writable, bool prefault)
+{
+	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+#ifdef CONFIG_KVM_ASYNC_PF
+	bool async = false;
+
+	/* Bail if *pfn has correct page */
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
+	if (!async)
+		return false;
+
+	if (!prefault && kvm_arch_async_not_present_allowed(vcpu)) {
+		if (kvm_async_pf_hash_find(vcpu, gfn) ||
+		    kvm_arch_setup_async_pf(vcpu, esr, gpa, gfn))
+			return true;
+	}
+#endif
+
+	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+	return false;
+}
+
 int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu,
 			      phys_addr_t fault_ipa,
 			      struct kvm_memory_slot *memslot,
@@ -1870,7 +1894,10 @@ int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu,
 	 */
 	smp_rmb();
 
-	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (try_async_pf(vcpu, esr, fault_ipa, gfn, &pfn,
+			 write_fault, &writable, prefault))
+		return 1;
+
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
diff --git a/arch/arm64/kvm/sdei.c b/arch/arm64/kvm/sdei.c
index 2c05edcb3fbb..8a9df119fa73 100644
--- a/arch/arm64/kvm/sdei.c
+++ b/arch/arm64/kvm/sdei.c
@@ -22,6 +22,14 @@ static struct kvm_sdei_priv kvm_sdei_privs[] = {
 	  0,
 	  NULL
 	},
+	{ 0x40200001,   /* Asynchronous page fault */
+	  SDEI_EVENT_TYPE_PRIVATE,
+	  1,
+	  SDEI_EVENT_PRIORITY_CRITICAL,
+	  SDEI_EVENT_REGISTER_RM_ANY,
+	  0,
+	  NULL
+	},
 };
 
 #ifdef CONFIG_ARM_SDE_INTERFACE
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 1c699fc6dc12..31a061a3a093 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -116,6 +116,7 @@
 
 /* KVM "vendor specific" services */
 #define ARM_SMCCC_KVM_FUNC_FEATURES		0
+#define ARM_SMCCC_KVM_FUNC_APF			1
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
 #define ARM_SMCCC_KVM_NUM_FUNCS			128
 
@@ -125,6 +126,20 @@
 			   ARM_SMCCC_OWNER_VENDOR_HYP,		\
 			   ARM_SMCCC_KVM_FUNC_FEATURES)
 
+/* Async page fault service */
+#define ARM_SMCCC_KVM_FUNC_APF_VERSION		0
+#define ARM_SMCCC_KVM_FUNC_APF_ENABLE		1
+#define ARM_SMCCC_KVM_FUNC_APF_TOKEN_NUM	2
+#define ARM_SMCCC_KVM_FUNC_APF_SDEI		3
+#define ARM_SMCCC_KVM_FUNC_APF_IRQ		4
+#define ARM_SMCCC_KVM_FUNC_APF_IRQ_ACK		5
+
+#define ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID			\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
+			   ARM_SMCCC_SMC_32,			\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,		\
+			   ARM_SMCCC_KVM_FUNC_APF)
+
 /*
  * Return codes defined in ARM DEN 0070A
  * ARM DEN 0070A is now merged/consolidated into ARM DEN 0028 C
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/6] drivers/acpi: Import ACPI APF table
  2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
                   ` (3 preceding siblings ...)
  2020-08-18  1:13 ` [PATCH 4/6] arm64/kvm: Support async page fault Gavin Shan
@ 2020-08-18  1:13 ` Gavin Shan
  2020-10-23 16:55   ` James Morse
  2020-08-18  1:13 ` [PATCH 6/6] arm64/kernel: Support async page fault Gavin Shan
  2020-10-23 16:54 ` [PATCH 0/6] Support Asynchronous Page Fault James Morse
  6 siblings, 1 reply; 15+ messages in thread
From: Gavin Shan @ 2020-08-18  1:13 UTC (permalink / raw)
  To: kvmarm; +Cc: maz, shan.gavin, pbonzini, will

This defines the struct for ACPI APF table. The information included
in this table will be used by guest kernel to retrieve SDEI event
number, PPI number and its triggering properties:

   * @sdei_event: number of SDEI event used for page-not-present
   * @interrupt:  PPI used for page-ready
   * @interrupt_flags: PPI's mode and polarity

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 include/acpi/actbl2.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index ec66779cb193..2eb715f4463b 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -44,6 +44,7 @@
 #define ACPI_SIG_SDEI           "SDEI"	/* Software Delegated Exception Interface Table */
 #define ACPI_SIG_SDEV           "SDEV"	/* Secure Devices table */
 #define ACPI_SIG_NHLT           "NHLT"	/* Non-HDAudio Link Table */
+#define ACPI_SIG_APFT           "APFT"  /* Asynchronous Page Fault Table */
 
 /*
  * All tables must be byte-packed to match the ACPI specification, since
@@ -1737,6 +1738,23 @@ struct acpi_sdev_pcie_path {
 	u8 function;
 };
 
+/*******************************************************************************
+ *
+ * APFT - Asynchronous Page Fault Table
+ *
+ * Customized table used for asynchronous page fault on ARM
+ *
+ ******************************************************************************/
+struct acpi_table_apft {
+	struct acpi_table_header header;
+	u32 sdei_event;
+	u32 interrupt;
+	u32 interrupt_flags;
+};
+
+#define ACPI_APFT_INTERRUPT_MODE	(1)
+#define ACPI_APFT_INTERRUPT_POLARITY	(1<<1)
+
 /* Reset to default packing */
 
 #pragma pack()
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/6] arm64/kernel: Support async page fault
  2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
                   ` (4 preceding siblings ...)
  2020-08-18  1:13 ` [PATCH 5/6] drivers/acpi: Import ACPI APF table Gavin Shan
@ 2020-08-18  1:13 ` Gavin Shan
  2020-10-23 16:54 ` [PATCH 0/6] Support Asynchronous Page Fault James Morse
  6 siblings, 0 replies; 15+ messages in thread
From: Gavin Shan @ 2020-08-18  1:13 UTC (permalink / raw)
  To: kvmarm; +Cc: maz, shan.gavin, pbonzini, will

This supports the asynchronous page fault, basing on the services
provided by host. The asynchronous page fault starts with private
SDEI event, whose number is specified by the guest. The guest marks
the current running process as suspended state by the newly added
flag (TIF_ASYNC_PF) and associating it with a wait queue. In the
mean while, the process is put into the pre-allocated table, whose
index is indeitified by the associated token. In the SDEI event
handler, a followup (reschedule) IPI is sent to current CPU to
force context switching on current running process. During the
context switching, the flag (TIF_ASYNC_PF) is checked and the
process waits on the associated wait queue.

The guest receives a PPI, whose number is specified by it, after
the page becomes ready on host side. The pre-allocated table is
searched using the received token to locate the associated process.
The process is waken up after it's found. The previously suspended
process resumes for execution and the asynchronous page fault is
completed so far.

The implementation is guarded by CONFIG_KVM_GUEST. Also, the boot
parameter "no-kvmapf" can be used to disable this feature. Besides,
this feature has to be enabled after SDEI service is settled down
because this depends on SDEI service.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/Kconfig                   |  11 +
 arch/arm64/include/asm/kvm_para.h    |  26 +-
 arch/arm64/include/asm/processor.h   |   1 +
 arch/arm64/include/asm/thread_info.h |   4 +-
 arch/arm64/kernel/Makefile           |   1 +
 arch/arm64/kernel/kvm.c              | 478 +++++++++++++++++++++++++++
 arch/arm64/kernel/signal.c           |  17 +
 7 files changed, 531 insertions(+), 7 deletions(-)
 create mode 100644 arch/arm64/kernel/kvm.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e11b4ea06127..5543c0d8159e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1053,6 +1053,17 @@ config PARAVIRT
 	  under a hypervisor, potentially improving performance significantly
 	  over full virtualization.
 
+config KVM_GUEST
+	bool "KVM Guest Support"
+	depends on PARAVIRT
+	default y
+	help
+	  This option enables various optimizations for running under the KVM
+	  hypervisor. Overhead for the kernel when not running inside KVM should
+	  be minimal.
+
+	  In case of doubt, say Y
+
 config PARAVIRT_TIME_ACCOUNTING
 	bool "Paravirtual steal time accounting"
 	select PARAVIRT
diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
index 0ea481dd1c7a..1f038652cdcd 100644
--- a/arch/arm64/include/asm/kvm_para.h
+++ b/arch/arm64/include/asm/kvm_para.h
@@ -3,6 +3,20 @@
 #define _ASM_ARM_KVM_PARA_H
 
 #include <uapi/asm/kvm_para.h>
+#include <linux/of.h>
+#include <asm/hypervisor.h>
+
+#ifdef CONFIG_KVM_GUEST
+static inline int kvm_para_available(void)
+{
+	return 1;
+}
+#else
+static inline int kvm_para_available(void)
+{
+	return 0;
+}
+#endif /* CONFIG_KVM_GUEST */
 
 static inline bool kvm_check_and_clear_guest_paused(void)
 {
@@ -11,7 +25,12 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 
 static inline unsigned int kvm_arch_para_features(void)
 {
-	return 0;
+	unsigned int features = 0;
+
+	if (kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_APF))
+		features |= (1 << KVM_FEATURE_ASYNC_PF);
+
+	return features;
 }
 
 static inline unsigned int kvm_arch_para_hints(void)
@@ -19,9 +38,4 @@ static inline unsigned int kvm_arch_para_hints(void)
 	return 0;
 }
 
-static inline bool kvm_para_available(void)
-{
-	return false;
-}
-
 #endif /* _ASM_ARM_KVM_PARA_H */
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 240fe5e5b720..afd2e04cac3d 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -151,6 +151,7 @@ struct thread_struct {
 	struct ptrauth_keys_user	keys_user;
 	struct ptrauth_keys_kernel	keys_kernel;
 #endif
+	void				*data;
 };
 
 static inline void arch_thread_struct_whitelist(unsigned long *offset,
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 5e784e16ee89..5605dc9d2bd3 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -67,6 +67,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
+#define TIF_ASYNC_PF		6	/* Asynchronous page fault */
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
@@ -86,6 +87,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_ASYNC_PF		(1 << TIF_ASYNC_PF)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
@@ -99,7 +101,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK)
+				 _TIF_UPROBE | _TIF_FSCHECK | _TIF_ASYNC_PF)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index a561cbb91d4d..76472934689b 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -49,6 +49,7 @@ obj-$(CONFIG_ACPI)			+= acpi.o
 obj-$(CONFIG_ACPI_NUMA)			+= acpi_numa.o
 obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL)	+= acpi_parking_protocol.o
 obj-$(CONFIG_PARAVIRT)			+= paravirt.o
+obj-$(CONFIG_KVM_GUEST)			+= kvm.o
 obj-$(CONFIG_RANDOMIZE_BASE)		+= kaslr.o
 obj-$(CONFIG_HIBERNATION)		+= hibernate.o hibernate-asm.o
 obj-$(CONFIG_KEXEC_CORE)		+= machine_kexec.o relocate_kernel.o	\
diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
new file mode 100644
index 000000000000..d47eed4cff1e
--- /dev/null
+++ b/arch/arm64/kernel/kvm.c
@@ -0,0 +1,478 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Gavin Shan, Redhat Inc 2020.
+ */
+
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/of.h>
+#include <linux/of_fdt.h>
+#include <linux/arm-smccc.h>
+#include <linux/kvm_para.h>
+#include <linux/arm_sdei.h>
+#include <linux/acpi.h>
+#include <linux/cpuhotplug.h>
+#include <linux/reboot.h>
+
+struct kvm_apf_task {
+	unsigned int		token;
+	struct task_struct	*task;
+	struct swait_queue_head	wq;
+};
+
+struct kvm_apf_table {
+	raw_spinlock_t		lock;
+	unsigned int		count;
+	struct kvm_apf_task	tasks[0];
+};
+
+static bool async_pf_available = true;
+static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_data) __aligned(64);
+static struct kvm_apf_table __percpu *apf_tables;
+static unsigned int apf_tasks;
+static unsigned int apf_sdei_num;
+static unsigned int apf_ppi_num;
+static unsigned int apf_ppi_mode;
+static unsigned int apf_ppi_polarity;
+static struct sdei_event *apf_sdei_event;
+static int apf_irq;
+
+static bool kvm_async_pf_add_task(struct task_struct *task,
+				  unsigned int token)
+{
+	struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
+	unsigned int i, index = apf_tasks;
+	bool ret = false;
+
+	raw_spin_lock(&table->lock);
+
+	if (WARN_ON(table->count >= apf_tasks))
+		goto unlock;
+
+	for (i = 0; i < apf_tasks; i++) {
+		if (!table->tasks[i].task) {
+			if (index == apf_tasks) {
+				ret = true;
+				index = i;
+			}
+		} else if (table->tasks[i].task == task) {
+			WARN_ON(table->tasks[i].token != token);
+			ret = false;
+			break;
+		}
+	}
+
+	if (!ret)
+		goto unlock;
+
+	task->thread.data = &table->tasks[index].wq;
+	set_tsk_thread_flag(task, TIF_ASYNC_PF);
+
+	table->count++;
+	table->tasks[index].task = task;
+	table->tasks[index].token = token;
+
+unlock:
+	raw_spin_unlock(&table->lock);
+	return ret;
+}
+
+static inline void kvm_async_pf_remove_one_task(struct kvm_apf_table *table,
+						unsigned int index)
+{
+	clear_tsk_thread_flag(table->tasks[index].task, TIF_ASYNC_PF);
+	WRITE_ONCE(table->tasks[index].task->thread.data, NULL);
+
+	table->count--;
+	table->tasks[index].task = NULL;
+	table->tasks[index].token = 0;
+
+	swake_up_one(&table->tasks[index].wq);
+}
+
+static bool kvm_async_pf_remove_task(unsigned int token)
+{
+	struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
+	unsigned int i;
+	bool ret = (token == UINT_MAX);
+
+	raw_spin_lock(&table->lock);
+
+	for (i = 0; i < apf_tasks; i++) {
+		if (!table->tasks[i].task)
+			continue;
+
+		/* Wakeup all */
+		if (token == UINT_MAX) {
+			kvm_async_pf_remove_one_task(table, i);
+			continue;
+		}
+
+		if (table->tasks[i].token == token) {
+			kvm_async_pf_remove_one_task(table, i);
+			ret = true;
+			break;
+		}
+	}
+
+	raw_spin_unlock(&table->lock);
+
+	return ret;
+}
+
+static int kvm_async_pf_sdei_handler(unsigned int event,
+				     struct pt_regs *regs,
+				     void *arg)
+{
+	unsigned int reason = __this_cpu_read(apf_data.reason);
+	unsigned int token = __this_cpu_read(apf_data.token);
+	bool ret;
+
+	if (reason != KVM_PV_REASON_PAGE_NOT_PRESENT) {
+		pr_warn("%s: Bogus notification (%d, 0x%08x)\n",
+			__func__, reason, token);
+		return -EINVAL;
+	}
+
+	ret = kvm_async_pf_add_task(current, token);
+	__this_cpu_write(apf_data.token, 0);
+	__this_cpu_write(apf_data.reason, 0);
+
+	if (!ret)
+		return -ENOSPC;
+
+	smp_send_reschedule(smp_processor_id());
+
+	return 0;
+}
+
+static irqreturn_t kvm_async_pf_irq_handler(int irq, void *dev_id)
+{
+	unsigned int reason = __this_cpu_read(apf_data.reason);
+	unsigned int token = __this_cpu_read(apf_data.token);
+	struct arm_smccc_res res;
+
+	if (reason != KVM_PV_REASON_PAGE_READY) {
+		pr_warn("%s: Bogus interrupt %d (%d, 0x%08x)\n",
+			__func__, irq, reason, token);
+		return IRQ_HANDLED;
+	}
+
+	kvm_async_pf_remove_task(token);
+
+	__this_cpu_write(apf_data.token, 0);
+	__this_cpu_write(apf_data.reason, 0);
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_IRQ_ACK, &res);
+
+	return IRQ_HANDLED;
+}
+
+static int __init kvm_async_pf_available(char *arg)
+{
+	async_pf_available = false;
+	return 0;
+}
+early_param("no-kvmapf", kvm_async_pf_available);
+
+static void kvm_async_pf_disable(void)
+{
+	struct arm_smccc_res res;
+	u32 enabled = __this_cpu_read(apf_data.enabled);
+
+	if (!enabled)
+		return;
+
+	/* Disable the functionality */
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_ENABLE,
+			     0, 0, &res);
+	if (res.a0 != SMCCC_RET_SUCCESS) {
+		pr_warn("%s: Error %ld on CPU %d to disable APF\n",
+			__func__, res.a0, smp_processor_id());
+		return;
+	}
+
+	__this_cpu_write(apf_data.enabled, 0);
+
+	/* Clear SDEI event number */
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_SDEI, 0, &res);
+	if (res.a0 != SMCCC_RET_SUCCESS) {
+		pr_warn("%s: Error %ld on CPU %d to clear SDEI event\n",
+			__func__, res.a0, smp_processor_id());
+	}
+
+	/* Clear PPI number */
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_IRQ, 0, &res);
+	if (res.a0 != SMCCC_RET_SUCCESS) {
+		pr_warn("%s: Error %ld on CPU %d to clear PPI\n",
+			__func__, res.a0, smp_processor_id());
+	}
+
+	pr_info("Async PF disabled on CPU %d\n", smp_processor_id());
+}
+
+static void kvm_async_pf_enable(void)
+{
+	struct arm_smccc_res res;
+	u32 enabled = __this_cpu_read(apf_data.enabled);
+	u64 val = virt_to_phys(this_cpu_ptr(&apf_data));
+
+	if (enabled)
+		return;
+
+	/* Set SDEI event number */
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_SDEI, apf_sdei_num,
+			     &res);
+	if (res.a0 != SMCCC_RET_SUCCESS) {
+		pr_warn("%s: Error %ld on CPU %d to set SDEI event\n",
+			__func__, res.a0, smp_processor_id());
+		return;
+	}
+
+	/* Set PPI number */
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_IRQ,
+			     apf_ppi_num, &res);
+	if (res.a0 != SMCCC_RET_SUCCESS) {
+		pr_warn("%s: Error %ld on CPU %d to set PPI\n",
+			__func__, res.a0, smp_processor_id());
+		return;
+	}
+
+	/* Enable the functionality */
+	val |= KVM_ASYNC_PF_ENABLED;
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_ENABLE,
+			     (u32)val, (u32)(val >> 32), &res);
+	if (res.a0 != SMCCC_RET_SUCCESS) {
+		pr_warn("%s: Error %ld on CPU %d to enable APF\n",
+			__func__, res.a0, smp_processor_id());
+	}
+
+	__this_cpu_write(apf_data.enabled, 1);
+
+	pr_info("Async PF enabled on CPU %d\n", smp_processor_id());
+}
+
+static void kvm_async_pf_cpu_disable(void *info)
+{
+	disable_percpu_irq(apf_irq);
+	kvm_async_pf_disable();
+}
+
+static void kvm_async_pf_cpu_enable(void *info)
+{
+	enable_percpu_irq(apf_irq, IRQ_TYPE_LEVEL_HIGH);
+	kvm_async_pf_enable();
+}
+
+static int kvm_async_pf_cpu_reboot_notify(struct notifier_block *nb,
+					  unsigned long code,
+					  void *unused)
+{
+	if (code == SYS_RESTART) {
+		sdei_event_disable(apf_sdei_event);
+		sdei_event_unregister(apf_sdei_event);
+
+		on_each_cpu(kvm_async_pf_cpu_disable, NULL, 1);
+	}
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_async_pf_cpu_reboot_nb = {
+	.notifier_call = kvm_async_pf_cpu_reboot_notify,
+};
+
+static int kvm_async_pf_cpu_online(unsigned int cpu)
+{
+	kvm_async_pf_cpu_enable(NULL);
+
+	return 0;
+}
+
+static int kvm_async_pf_cpu_offline(unsigned int cpu)
+{
+	kvm_async_pf_cpu_disable(NULL);
+
+	return 0;
+}
+
+static int __init kvm_async_pf_info(void)
+{
+	struct acpi_table_apft *apft;
+	acpi_status status;
+
+	if (acpi_disabled)
+		return -EPERM;
+
+	status = acpi_get_table(ACPI_SIG_APFT, 0,
+				(struct acpi_table_header **)&apft);
+	if (ACPI_FAILURE(status)) {
+		pr_warn("%s: Failed to get ACPI:APFT <%s>\n",
+			__func__, acpi_format_exception(status));
+		return -ENODEV;
+	}
+
+	apf_sdei_num = apft->sdei_event;
+	apf_ppi_num = apft->interrupt;
+	apf_ppi_mode = (apft->interrupt_flags & ACPI_APFT_INTERRUPT_MODE) ?
+		       ACPI_EDGE_SENSITIVE : ACPI_LEVEL_SENSITIVE;
+	apf_ppi_polarity =
+		(apft->interrupt_flags & ACPI_APFT_INTERRUPT_POLARITY) ?
+		ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH;
+
+	return 0;
+}
+
+static int __init kvm_async_pf_init(void)
+{
+	struct kvm_apf_table *table;
+	struct arm_smccc_res res;
+	size_t size;
+	int cpu, i, ret;
+
+	if (!kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) ||
+	    !async_pf_available)
+		return -EPERM;
+
+	/* Retrieve information from ACPI:APFT */
+	ret = kvm_async_pf_info();
+	if (ret) {
+		pr_warn("%s: Error %d parsing ACPI:APFT\n",
+			__func__, ret);
+		return ret;
+	}
+
+	/*
+	 * Check the version and v1.0.0 or higher version is required
+	 * to support the functionality.
+	 */
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_VERSION,
+			     &res);
+	if (res.a0 == SMCCC_RET_NOT_SUPPORTED) {
+		pr_warn("%s: Error to retrieve version\n",
+			__func__);
+		return -EPERM;
+	}
+
+	if ((res.a0 & 0xFFFFFFFFFF000000) ||
+	    ((res.a0 & 0xFF0000) >> 16) < 0x1) {
+		pr_warn("%s: Invalid version (0x%016lx)\n",
+			__func__, res.a0);
+		return -EINVAL;
+	}
+
+	/* Retrieve number of tokens */
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_APF_FUNC_ID,
+			     ARM_SMCCC_KVM_FUNC_APF_TOKEN_NUM,
+			     &res);
+	if (res.a0 == SMCCC_RET_NOT_SUPPORTED) {
+		pr_warn("%s: Error to retrieve token number\n",
+			__func__);
+		return -EPERM;
+	}
+
+	/* Allocate and initialize the sleeper table */
+	apf_tasks = res.a0 * 2;
+	size = sizeof(struct kvm_apf_table) +
+	       apf_tasks * sizeof(struct kvm_apf_task);
+	apf_tables = __alloc_percpu(size, 0);
+	if (!apf_tables) {
+		pr_warn("%s: Unable to alloc async PF table\n",
+			__func__);
+		return -ENOMEM;
+	}
+
+	for_each_possible_cpu(cpu) {
+		table = per_cpu_ptr(apf_tables, cpu);
+		raw_spin_lock_init(&table->lock);
+		for (i = 0; i < apf_tasks; i++)
+			init_swait_queue_head(&table->tasks[i].wq);
+	}
+
+	/*
+	 * Initialize SDEI event for page-not-present notification.
+	 * The SDEI event number should have been retrieved from
+	 * the ACPI:APFT table.
+	 */
+	apf_sdei_event = sdei_event_register(apf_sdei_num,
+					     kvm_async_pf_sdei_handler,
+					     NULL);
+	if (IS_ERR(apf_sdei_event)) {
+		pr_warn("%s: Error %ld to register SDEI event\n",
+			__func__, PTR_ERR(apf_sdei_event));
+		ret = -EIO;
+		goto release_tables;
+	}
+
+	ret = sdei_event_enable(apf_sdei_event);
+	if (ret) {
+		pr_warn("%s: Error %d to enable SDEI event\n",
+			__func__, ret);
+		goto unregister_event;
+	}
+
+	/*
+	 * Initialize interrupt for page-ready notification. The
+	 * interrupt number and its properties should have been
+	 * retrieved from the ACPI:APFT table.
+	 */
+	apf_irq = acpi_register_gsi(NULL, apf_ppi_num,
+				    apf_ppi_mode, apf_ppi_polarity);
+	if (apf_irq <= 0) {
+		ret = -EIO;
+		pr_warn("%s: Unable to register IRQ\n", __func__);
+		goto disable_event;
+	}
+
+	ret = request_percpu_irq(apf_irq, kvm_async_pf_irq_handler,
+				 "Asynchronous Page Fault", &apf_data);
+	if (ret) {
+		pr_warn("%s: Error %d to request IRQ\n", __func__, ret);
+		goto unregister_irq;
+	}
+
+	register_reboot_notifier(&kvm_async_pf_cpu_reboot_nb);
+	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+			"arm/kvm:online", kvm_async_pf_cpu_online,
+			kvm_async_pf_cpu_offline);
+	if (ret < 0) {
+		pr_warn("%s: Error %d to install cpu hotplug callbacks\n",
+			__func__, ret);
+		goto release_irq;
+	}
+
+	/* Enable async PF on the online CPUs */
+	on_each_cpu(kvm_async_pf_cpu_enable, NULL, 1);
+
+	return 0;
+
+release_irq:
+	free_percpu_irq(apf_irq, &apf_data);
+unregister_irq:
+	acpi_unregister_gsi(apf_ppi_num);
+disable_event:
+	sdei_event_disable(apf_sdei_event);
+unregister_event:
+	sdei_event_unregister(apf_sdei_event);
+release_tables:
+	free_percpu(apf_tables);
+
+	return ret;
+}
+
+static int __init kvm_guest_init(void)
+{
+	return kvm_async_pf_init();
+}
+
+fs_initcall(kvm_guest_init);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 3b4f31f35e45..42328efbc946 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -921,6 +921,23 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 		/* Check valid user FS if needed */
 		addr_limit_user_check();
 
+		if (thread_flags & _TIF_ASYNC_PF) {
+			struct swait_queue_head *wq =
+				READ_ONCE(current->thread.data);
+			DECLARE_SWAITQUEUE(wait);
+
+			local_daif_restore(DAIF_PROCCTX_NOIRQ);
+
+			do {
+				prepare_to_swait_exclusive(wq,
+					&wait, TASK_UNINTERRUPTIBLE);
+				if (!test_thread_flag(TIF_ASYNC_PF))
+					break;
+
+				schedule();
+			} while (test_thread_flag(TIF_ASYNC_PF));
+		}
+
 		if (thread_flags & _TIF_NEED_RESCHED) {
 			/* Unmask Debug and SError for the next task */
 			local_daif_restore(DAIF_PROCCTX_NOIRQ);
-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6] Support Asynchronous Page Fault
  2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
                   ` (5 preceding siblings ...)
  2020-08-18  1:13 ` [PATCH 6/6] arm64/kernel: Support async page fault Gavin Shan
@ 2020-10-23 16:54 ` James Morse
  2020-10-31 14:18   ` Paolo Bonzini
  2020-11-02  5:23   ` Gavin Shan
  6 siblings, 2 replies; 15+ messages in thread
From: James Morse @ 2020-10-23 16:54 UTC (permalink / raw)
  To: Gavin Shan, kvmarm; +Cc: maz, will, shan.gavin, pbonzini

Hi Gavin,

I think this series would benefit from being in smaller pieces. I got lost in patch 4 for
quite a while. Suggestion on where to do that in patch 4.


On 18/08/2020 02:13, Gavin Shan wrote:
> There are two stages of page fault. The guest kernel is responsible
> for handling stage one page fault, while the host kernel is to take
> care of the stage two page fault. When page fault is triggered because
> of stage two page fault, the guest is suspended until the requested
> memory (page) is populated. Sometimes, the cost to populate the requested
> page isn't cheap and can take hundreds of milliseconds in extreme
> cases. This impacts the overall guest's performance.

You really need to use postcopy live migration to justify these changes. Otherwise the
story here is "over-commited hosts suck", which I don't think anyone cares about.


> This series introduces the feature (asynchronous page fault) to resolve
> the issue and improve the guest's performance. It depends on the series
> to support SDEI virtualization and refactoring SDEI client driver.

SDEI gives you an NMI ... which you use to set a TIF flag. This can only work reliably for
user-space. So much so that you have code in the hypervisor to only deliver the NMI ...
when in user-space.
The only reason you would need an NMI is to interrupt interrupts-masked code. Linux can't
reschedule when this is the case.

I can only conclude, you really don't need an NMI here.


Why couldn't we use an IRQ here, it would be a lot simpler? ... the reason is the arm
architecture can't guarantee us that we take the irq when there is also a stage2 fault for
the first instruction.
I reckon we can work around this in the hypervisor:
https://lore.kernel.org/r/20201023165108.15061-1-james.morse@arm.com


My problem with SDEI is, you don't really need an NMI, and it creates extra in-kernel
state that has to be migrated. I think having this state in the kernel complicates the
user-space handling of SIGBUS_MCEERR_AO signals that don't get delivered to a vCPU thread.


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  2020-08-18  1:13 ` [PATCH 3/6] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
@ 2020-10-23 16:54   ` James Morse
  2020-11-02  5:30     ` Gavin Shan
  0 siblings, 1 reply; 15+ messages in thread
From: James Morse @ 2020-10-23 16:54 UTC (permalink / raw)
  To: Gavin Shan, kvmarm; +Cc: maz, will, shan.gavin, pbonzini

Hi Gavin,

On 18/08/2020 02:13, Gavin Shan wrote:
> This renames user_mem_abort() to kvm_handle_user_mem_abort(), and
> then exports it. The function will be used in asynchronous page fault
> to populate a page table entry once the corresponding page is populated
> from the backup device (e.g. swap partition):
> 
>    * Parameter @fault_status is replace by @esr.
>    * Parameter @prefault is added
> 
> As the @esr is passed as parameter, not fetched from vCPU struct. This
> also introduces the necessasry helpers in esr.h, to manupulate the @esr.

(Nit: necessary, manipulate)


> The helpers defined in kvm_emulate.h reuses the newly added helper. This
> shouldn't cause functional changes.

user_mem_abort() is deep in the the guts of KVM's arch code. I don't think this should be
exported. It  must be called on the vcpu thread. It must be called under the VMs srcu
lock. There are also tricky interactions with whether the vcpu is loaded on this cpu or not...

I think it would be much simpler to always let the guest take the stage2-fault a second
time. This keeps the property that pages are only populate in response to a stage2 fault.
If the guest behaves, it will only schedule a task that accesses the page once its available.


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/6] drivers/acpi: Import ACPI APF table
  2020-08-18  1:13 ` [PATCH 5/6] drivers/acpi: Import ACPI APF table Gavin Shan
@ 2020-10-23 16:55   ` James Morse
  0 siblings, 0 replies; 15+ messages in thread
From: James Morse @ 2020-10-23 16:55 UTC (permalink / raw)
  To: Gavin Shan, kvmarm; +Cc: maz, will, shan.gavin, pbonzini

Hi Gavin,

On 18/08/2020 02:13, Gavin Shan wrote:
> This defines the struct for ACPI APF table. The information included
> in this table will be used by guest kernel to retrieve SDEI event
> number, PPI number and its triggering properties:
> 
>    * @sdei_event: number of SDEI event used for page-not-present
>    * @interrupt:  PPI used for page-ready
>    * @interrupt_flags: PPI's mode and polarity
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  include/acpi/actbl2.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)


Whoa! This is not how changes to ACPI work!

We have to work out what is needed, then the standard has to be updated, then the upstream
acpica project, then these files get synced back across from the 'upstream' acpica project...

If you need an ACPI table, we need something for DT too. I don't think a well designed
paravirt interface shouldn't need this. Wasn't that the whole point of the KVM "vendor
specific" services?!

The cover-letter message talks of shared memory, which this doesn't describe.

Ideally this stuff would be discovered via SMCCC, or a (readonly?) page of that shared
memory. That way its the same regardless of ACPI or DT.


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] arm64/kvm: Support async page fault
  2020-08-18  1:13 ` [PATCH 4/6] arm64/kvm: Support async page fault Gavin Shan
@ 2020-10-23 17:01   ` James Morse
  2020-11-02  7:19     ` Gavin Shan
  0 siblings, 1 reply; 15+ messages in thread
From: James Morse @ 2020-10-23 17:01 UTC (permalink / raw)
  To: Gavin Shan; +Cc: maz, shan.gavin, pbonzini, will, kvmarm

Hi Gavin,

[my mail client went a bit nuts - it thinks this got sent already, sorry if you received
it twice!]

I only got so far through this, so may have focussed on the wrong things.
This patch has too many things going on. Please split it up.

I think the page-fault and page-present should be done separately, with the gubbins the
core-code needs added first. The cap and corresponding documentation should be added last
so the whole thing bisects.

It looks like the guest and hypervisor parts are already split up, this is helpful,
thanks. Please ensure that the descriptions are clear on this distinction too. Its not at
all clear where where the 'reschedule ipi' comes from in this commit message.

I'd really like it if as much of the guest parts as possible lived in common code. There
is nothing arch-specific about an interrupt and reading from shared memory.


On 18/08/2020 02:13, Gavin Shan wrote:
> There are two stages of page fault. The guest kernel is responsible
> for handling stage one page fault, while the host kernel is to take
> care of the stage two page fault. When page fault is triggered because
> of stage two page fault, the guest is suspended until the requested
> memory (page) is populated. Sometimes, the cost to populate the requested
> page isn't cheap and can take hundreds of milliseconds in extreme
> cases. This impacts the overall guest's performance.

Yes, its the cost of abstracting the machine's resources and invisibly carving them up.

> In order to resolve the issue and improve the guest's performance,

Why is it an issue? You've missed the important part of the commit message: why we should
make this change.

This needs to mention postcopy live migration, which is the reason for doing any of this.

I don't think "I overcommited and swapped the guest to disk" is at all compelling.
(as the doctor says: don't do that then)


> this introduces the feature (asychronous page fault). A page-not-present

(asynchronous)

> notification is fired to guest if the requested page isn't ready so that
> the guest can reschedule current process and pick another process to run.
> Another page-ready notification is sent to guest so that the process,
> which was rescheduled previously, can be picked up to run. With this,

> the guest needn't necessarily suspend itself on stage two page fault.

Be suspended by the hypervisor until the resources are available.
Normally the guest has no idea this is happening.


> These two notificatons are associated by so-called vCPU scoped tokens,

(notifications)

> which is the combination of incremental sequence number and vCPU index
> (@vcpu->vcpu_idx). Besides, the notifications are conveyed by SDEI and
> PPI, whose numbers are specified by guest through SMCCC interface.

... so you don't need the ACPI table?

The 'asynchrononous page fault' notification obviously needs to be (er,) synchronous, so
it has to be per-cpu.

Why is the page-ready IRQ a PPI? We have a limited number of these, so shouldn't use one
unless it _needs_ to be per-cpu.

'vcpu index' should probably be the MPIDR_EL1, which is something both the host and guest
already know that identifies the vcpu. Inventing something new isn't necessary.


> The asynchronous page fault always starts with a background woker if

(worker. Please run your commit-messages through a spell checker)


> current vCPU has enough token space and no pending notifications. After
> the worker, which populates the requested page background, is started,

what does 'requested page background' mean? (missing the words 'in the'?)


> a page-not-present notification, together with an unique token, are sent
> to guest through the specified private SDEI event. The notification is
> acknowledged by clearing the shared cache (@vcpu->apf.cache). The current
> process is marked for waiting completion of asynchronous page fault. A
> subsequent (reschedule) IPI is sent to current vCPU so that the current
> process can be reschedule and suspended until the asynchronous page is
> completed. In the meanwhile, the woker populates the requested page and
> it's queued to the completion queue once the task is completed.

> At this
> point or the acknoledgement (by SMCCC) on pending PPI, a PPI is sent to
> guest for page-ready notification.

A PPI is made pending, when the PPI is pending?


> The guest uses the associated token
> to locate the suspended process and wakes it up.

I hope the 'token' is nothing more than the IPA.


> In order to fulfil the functions, there are several SMCCC function IDs
> introduced and outlined as below:

Please add a documentation file for these for the benefit of other guests (e.g.
kvm-unit-test) describing these in more detail, their parameters, the expected lifecycle etc.


>    ARM_SMCCC_KVM_FUNC_APF_VERSION
>       Returns the version, which can be used to identify ABI changes in
>       the future.
>    ARM_SMCCC_KVM_FUNC_APF_ENABLE
>       Enable or disable asynchronous page fault on current vCPU.
>    ARM_SMCCC_KVM_FUNC_APF_TOKEN_NUM
>       Returns maximal number of tokens that current vCPU can have. It's
>       used by guest to allocate the required resources.
>    ARM_SMCCC_KVM_FUNC_APF_SDEI
>    ARM_SMCCC_KVM_FUNC_APF_IRQ
>       Used by guest to confiugre the SDEI event and PPI numbers.
>    ARM_SMCCC_KVM_FUNC_APF_IRQ_ACK
>       Acknowledge the page-ready notificaton in the PPI handler.


> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index b6c9851b2a65..2e3bba6a99c3 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -46,6 +46,7 @@
>  #define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>  #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
>  #define KVM_REQ_SDEI		KVM_ARCH_REQ(5)
> +#define KVM_REQ_APF_READY	KVM_ARCH_REQ(6)
>
>  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>  				     KVM_DIRTY_LOG_INITIALLY_SET)
> @@ -239,6 +240,23 @@ struct vcpu_reset_state {
>  	bool		reset;
>  };
>
> +#ifdef CONFIG_KVM_ASYNC_PF
> +
> +/* Should be a power of two number */
> +#define ASYNC_PF_PER_VCPU	64
> +
> +/*
> + * The association of gfn and token. The token will be sent to guest as
> + * page fault address. Also, the guest could be in aarch32 mode. So its
> + * length should be 32-bits.
> + */
> +struct kvm_arch_async_pf {
> +	u32     token;
> +	gfn_t   gfn;
> +	u32     esr;
> +};
> +#endif /* CONFIG_KVM_ASYNC_PF */


#ifdef-ing a struct makes it harder to use IS_ENABLED() in the C code, which in turn makes
it harder to get complete coverage from the CI builders without aligning the stars to hit
the right randconfig combination.

Please avoid doing this unless its absolutely necessary. (its not like a struct definition
takes up any space in the binary)


>  struct kvm_vcpu_arch {
>  	struct kvm_cpu_context ctxt;
>  	void *sve_state;
> @@ -346,6 +364,22 @@ struct kvm_vcpu_arch {
>  	struct kvm_sdei_vcpu_event	*sdei_normal_event;
>  	struct kvm_sdei_vcpu_event	*sdei_critical_event;
>  	struct list_head		sdei_events;
> +
> +#ifdef CONFIG_KVM_ASYNC_PF

See above. Any reason these can't be dynamically allocated if the feature is enabled by
the VMM? (You're using a PPI, this has to be allocated by the VMM).

Not all guests would use this, so its a waste of space if its enabled but not in use.


> +	struct {
> +		struct gfn_to_hva_cache	cache;

> +		gfn_t			gfns[ASYNC_PF_PER_VCPU];
> +		u64			control_block;

> +		bool			send_user_only;

Hmmm...


> +		u64			sdei_event_num;
> +		u32			irq;

> +		u16			id;

a per-vcpu id? KVM already has one, and the architecture has one that the guest can see too.


> +		bool			notpresent_pending;
> +		u32			notpresent_token;
> +		bool			pageready_pending;
> +	} apf;
> +#endif
>  };
>
>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */


> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
> new file mode 100644
> index 000000000000..0ea481dd1c7a
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_para.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM_KVM_PARA_H
> +#define _ASM_ARM_KVM_PARA_H

> +#include <uapi/asm/kvm_para.h>

> +static inline bool kvm_check_and_clear_guest_paused(void)
> +{
> +	return false;
> +}

This is part of kvmclock? aarch64 has an architecture specific way of doing that stuff.
Why is this here?!

(if you've turned something on that needs these stubs, its probably worth doing that in an
preparation patch so you can explain why, instead of hiding it in here)


> +static inline unsigned int kvm_arch_para_features(void)
> +{
> +	return 0;
> +}
> +
> +static inline unsigned int kvm_arch_para_hints(void)
> +{
> +	return 0;
> +}
> +
> +static inline bool kvm_para_available(void)
> +{
> +	return false;
> +}
> +
> +#endif /* _ASM_ARM_KVM_PARA_H */



> diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
> index 602d137932dc..f66554cd5c45 100644
> --- a/arch/arm64/include/uapi/asm/Kbuild
> +++ b/arch/arm64/include/uapi/asm/Kbuild
> @@ -1,3 +1 @@
>  # SPDX-License-Identifier: GPL-2.0
> -
> -generic-y += kvm_para.h

The arch specific version is practically the same as the asm-generic one.

Please explain why you need to do this in a preparatory patch.


> diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
> new file mode 100644
> index 000000000000..f0dbe86c2374
> --- /dev/null
> +++ b/arch/arm64/include/uapi/asm/kvm_para.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_ARM_KVM_PARA_H
> +#define _UAPI_ASM_ARM_KVM_PARA_H
> +
> +#include <linux/types.h>
> +
> +#define KVM_FEATURE_ASYNC_PF	0
> +
> +/* Async PF */
> +#define KVM_ASYNC_PF_ENABLED		(1 << 0)
> +#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
> +
> +#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
> +#define KVM_PV_REASON_PAGE_READY	2
> +
> +struct kvm_vcpu_pv_apf_data {
> +	__u32	reason;
> +	__u32	token;
> +	__u8	pad[56];
> +	__u32	enabled;
> +};
> +
> +#endif /* _UAPI_ASM_ARM_KVM_PARA_H */

Presumably this is the reason to carry around an extra copy of kvm_para.h.

Wouldn't it be better to unpick that dependency instead?


> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 4bec6c9ece18..1a57dc5dbf14 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -225,6 +225,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		 */
>  		r = 1;
>  		break;
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	case KVM_CAP_ASYNC_PF:
> +	case KVM_CAP_ASYNC_PF_INT:
> +		r = 1;
> +		break;
> +#endif
>  	default:
>  		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
>  		break;
> @@ -274,6 +280,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
>  	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> +#ifdef CONFIG_KVM_ASYNC_PF
> +	vcpu->arch.apf.control_block = 0UL;
> +#endif

kvm_vm_ioctl_create_vcpu() allocates vcpu with kmem_cache_zalloc(), so this field will
already be zero.


> @@ -432,8 +441,32 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>  {
>  	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
> -	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
> -		&& !v->arch.power_off && !v->arch.pause);

As you are adding a conditional __copy_from_user() in here, please add an unconditional
might_sleep() at the top of this function. This lets us catch locations that will
sometimes break the first time they happen.


> +	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
> +	    !v->arch.power_off && !v->arch.pause)
> +		return true;
> +
> +#ifdef CONFIG_KVM_ASYNC_PF

Please use IS_ENABLED() here, #ifdeffery make this stuff harder to read and harder to
build-test.

(if you can come up with a readable IS_ENABLED() way of returning the _check_extension()
result - even better!)


> +	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
> +		u32 val;
> +		int ret;
> +
> +		if (!list_empty_careful(&v->async_pf.done))
> +			return true;

> +		ret = kvm_read_guest_offset_cached(v->kvm, &v->arch.apf.cache,
> +						   &val, 0, sizeof(val));

You added a helper for this: kvm_async_pf_read_cache()...


> +		if (ret || val)
> +			return true;
> +
> +		ret = kvm_read_guest_offset_cached(v->kvm, &v->arch.apf.cache,
> +						   &val, 4, sizeof(val));
> +		if (ret || val)
> +			return true;

These 0 and 4 really need some kind of name to give a hint as to what they are for.


> +	}

This really ought to wrapped up in a helper that gives a hint as to what it is doing.

There are two magic locations in guest memory that means the guest controls whether it is
runnable or not... this sounds like a source of hard-to-debug issues.

I'd much prefer it if the guest only ever got blocked in the stage2 fault handler, as it
does today. This mechanism just provides information about what is going on at stage2,
which allows the guest to avoid getting blocked if it can do something about it.

Giving the guest access to the controls doesn't look right.


> +#endif
> +
> +	return false;
>  }

>  bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
> @@ -624,6 +657,11 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>  		if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
>  			kvm_reset_vcpu(vcpu);
>
> +#ifdef CONFIG_KVM_ASYNC_PF
> +		if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
> +			kvm_check_async_pf_completion(vcpu);
> +#endif

(IS_ENABLED() would save you the link error ... providing a stub in core-code for
kvm_check_async_pf_completion() would let you rely on this request never being made instead)


>  		if (kvm_check_request(KVM_REQ_SDEI, vcpu))
>  			kvm_sdei_deliver(vcpu);
>
> @@ -739,7 +777,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  		smp_store_mb(vcpu->mode, IN_GUEST_MODE);
>
>  		if (ret <= 0 || need_new_vmid_gen(&vcpu->kvm->arch.vmid) ||
> -		    kvm_request_pending(vcpu)) {

> +		    (kvm_request_pending(vcpu) &&
> +		     READ_ONCE(vcpu->requests) != (1UL << KVM_REQ_APF_READY))) {

A double READ_ONCE() looks wrong. If you have to do this, presumably you didn't want a
vcpu->request to begin with.

It looks like KVM_REQ_APF_READY only exists to move the page-ready work onto the vcpu
thread because the vcpu got blocked further out by kvm_arch_vcpu_runnable(), and the
second half of the stage2-fault now needs doing.


I think it would be simpler to only ever block in the stage2 fault handler. The guest can
only ever be blocked on one stage2 fault at a time, so this abstraction to let it sleep
for any number is pointless. If I'm waiting for ipa-4, and get restarted because ipa-6 is
now available ... I'm going to fault on ipa-4 for a second time. There is no point
restarting the guest until it can make progress again.


From the first stage2 fault you can kick off the page-fault notification (which must be
synchronous, so can be notified without taking this fault again). If the vcpu takes a
fault that was already notified, it clearly can't handle the page-fault notification and
must be blocked until that page is available.

We should be able to depend on the regular wakeup from mm when the vcpu is blocked.
The extra information from page-present lets the guest know it can now touch these pages
without getting blocked ... they shouldn't need a wakeup.


>  			vcpu->mode = OUTSIDE_GUEST_MODE;
>  			isb(); /* Ensure work in x_flush_hwstate is committed */
>  			kvm_pmu_sync_hwstate(vcpu);


> diff --git a/arch/arm64/kvm/async_pf.c b/arch/arm64/kvm/async_pf.c
> new file mode 100644
> index 000000000000..910cb3a1bdc2
> --- /dev/null
> +++ b/arch/arm64/kvm/async_pf.c
> @@ -0,0 +1,462 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright Gavin Shan, Redhat Inc 2020.
> + *
> + * Asynchronous page fault support
> + */
> +
> +#include <linux/arm-smccc.h>
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_emulate.h>
> +#include <kvm/arm_hypercalls.h>
> +#include <kvm/arm_vgic.h>
> +#include <asm/kvm_sdei.h>

> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> +{
> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
> +}
> +
> +static inline u32 kvm_async_pf_hash_next(u32 key)
> +{
> +	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
> +}
> +
> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;
> +
> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
> +		vcpu->arch.apf.gfns[i] = ~0;
> +}

> +/*
> + * Add gfn to the hash table. It's ensured there is a free entry
> + * when this function is called.
> + */
> +static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +
> +	while (vcpu->arch.apf.gfns[key] != ~0)
> +		key = kvm_async_pf_hash_next(key);
> +
> +	vcpu->arch.apf.gfns[key] = gfn;
> +}
> +
> +static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_fn(gfn);
> +	int i;
> +
> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
> +		if (vcpu->arch.apf.gfns[key] == gfn ||
> +		    vcpu->arch.apf.gfns[key] == ~0)
> +			break;
> +
> +		key = kvm_async_pf_hash_next(key);
> +	}
> +
> +	return key;
> +}
> +
> +static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 i, j, k;
> +
> +	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
> +	while (true) {
> +		vcpu->arch.apf.gfns[i] = ~0;
> +
> +		do {
> +			j = kvm_async_pf_hash_next(j);
> +			if (vcpu->arch.apf.gfns[j] == ~0)
> +				return;
> +
> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
> +			/*
> +			 * k lies cyclically in ]i,j]
> +			 * |    i.k.j |
> +			 * |....j i.k.| or  |.k..j i...|
> +			 */
> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
> +
> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
> +		i = j;
> +	}
> +}
> +
> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
> +
> +	return vcpu->arch.apf.gfns[key] == gfn;
> +}
> +
> +static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu,
> +					  u32 offset, u32 *val)
> +{
> +	return kvm_read_guest_offset_cached(vcpu->kvm,
> +					    &vcpu->arch.apf.cache,
> +					    val, offset, sizeof(*val));
> +}
> +
> +static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu,
> +					   u32 offset, u32 val)
> +{
> +	return kvm_write_guest_offset_cached(vcpu->kvm,
> +					     &vcpu->arch.apf.cache,
> +					     &val, offset, sizeof(val));
> +}


This hash stuff doesn't look at all specific to arm64. Could it be moved to core code or a
shared header file so multiple architectures only need one copy of the code.
If they need to be different, it would be good to know why.




Thanks,

James

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6] Support Asynchronous Page Fault
  2020-10-23 16:54 ` [PATCH 0/6] Support Asynchronous Page Fault James Morse
@ 2020-10-31 14:18   ` Paolo Bonzini
  2020-11-02  5:23   ` Gavin Shan
  1 sibling, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2020-10-31 14:18 UTC (permalink / raw)
  To: James Morse, Gavin Shan, kvmarm; +Cc: maz, will, shan.gavin

On 23/10/20 18:54, James Morse wrote:
> SDEI gives you an NMI ... which you use to set a TIF flag. This can only work reliably for
> user-space. So much so that you have code in the hypervisor to only deliver the NMI ...
> when in user-space.
> The only reason you would need an NMI is to interrupt interrupts-masked code. Linux can't
> reschedule when this is the case.
> 
> I can only conclude, you really don't need an NMI here.

I don't think the issue is that you want an NMI.  It is just that the
synchronous interruption that we want is exactly the same as a SDEI, and
so is the notification reply from the guest to the host (e.g. accept the
async pagefault or process it synchronously).

Yes, it's more code, but at least in x86 world we tried hard not to
invent new paravirtualized mechanisms if we could avoid it, especially
in the host->guest direction, and it's almost always paid off.  This is
because in case we don't get things right, it's much much harder to fix
them on both the hypervisor and the guest side; by relying on existing
code that work on bare metal hardware, the guest side exists already and
you can develop against it.

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/6] Support Asynchronous Page Fault
  2020-10-23 16:54 ` [PATCH 0/6] Support Asynchronous Page Fault James Morse
  2020-10-31 14:18   ` Paolo Bonzini
@ 2020-11-02  5:23   ` Gavin Shan
  1 sibling, 0 replies; 15+ messages in thread
From: Gavin Shan @ 2020-11-02  5:23 UTC (permalink / raw)
  To: James Morse, kvmarm; +Cc: maz, will, shan.gavin, pbonzini

Hi James,

On 10/24/20 3:54 AM, James Morse wrote:
> I think this series would benefit from being in smaller pieces. I got lost in patch 4 for
> quite a while. Suggestion on where to do that in patch 4.
> 

Yes, I will split the patches into small ones for easier review
in next revision. Thanks for your comments :)

> 
> On 18/08/2020 02:13, Gavin Shan wrote:
>> There are two stages of page fault. The guest kernel is responsible
>> for handling stage one page fault, while the host kernel is to take
>> care of the stage two page fault. When page fault is triggered because
>> of stage two page fault, the guest is suspended until the requested
>> memory (page) is populated. Sometimes, the cost to populate the requested
>> page isn't cheap and can take hundreds of milliseconds in extreme
>> cases. This impacts the overall guest's performance.
> 
> You really need to use postcopy live migration to justify these changes. Otherwise the
> story here is "over-commited hosts suck", which I don't think anyone cares about.
> 

Yes, I will use live migration as the justification in next revision :)

> 
>> This series introduces the feature (asynchronous page fault) to resolve
>> the issue and improve the guest's performance. It depends on the series
>> to support SDEI virtualization and refactoring SDEI client driver.
> 
> SDEI gives you an NMI ... which you use to set a TIF flag. This can only work reliably for
> user-space. So much so that you have code in the hypervisor to only deliver the NMI ...
> when in user-space.
> The only reason you would need an NMI is to interrupt interrupts-masked code. Linux can't
> reschedule when this is the case.
> 
> I can only conclude, you really don't need an NMI here.
> 
> 
> Why couldn't we use an IRQ here, it would be a lot simpler? ... the reason is the arm
> architecture can't guarantee us that we take the irq when there is also a stage2 fault for
> the first instruction.
> I reckon we can work around this in the hypervisor:
> https://lore.kernel.org/r/20201023165108.15061-1-james.morse@arm.com
> 
> 
> My problem with SDEI is, you don't really need an NMI, and it creates extra in-kernel
> state that has to be migrated. I think having this state in the kernel complicates the
> user-space handling of SIGBUS_MCEERR_AO signals that don't get delivered to a vCPU thread.
> 
> 

Currently, the asynchronous page fault is only supported for memory access in
guest's userspace, but we needn't to be sticky to the use model in future. It
means the asynchornous page fault could be supported for memory access in guest's
kernel space where the interrupt can be disabled or masked. So NMI is needed and
SDEI fits the use model very well as Paolo replied in another thread.

About the feature to support SDEI virtualization, I thought there might be some
use cases where the emulated devices need inject SDEI event to guest. However,
I'm not too much familiar with the architecture yet. If it's required by the
emulated devices, there are more more justifications to merge the code. However,
the implementation itself isn't simple and I would say it's complicated.

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  2020-10-23 16:54   ` James Morse
@ 2020-11-02  5:30     ` Gavin Shan
  0 siblings, 0 replies; 15+ messages in thread
From: Gavin Shan @ 2020-11-02  5:30 UTC (permalink / raw)
  To: James Morse, kvmarm; +Cc: maz, will, shan.gavin, pbonzini

Hi James,

On 10/24/20 3:54 AM, James Morse wrote:
> On 18/08/2020 02:13, Gavin Shan wrote:
>> This renames user_mem_abort() to kvm_handle_user_mem_abort(), and
>> then exports it. The function will be used in asynchronous page fault
>> to populate a page table entry once the corresponding page is populated
>> from the backup device (e.g. swap partition):
>>
>>     * Parameter @fault_status is replace by @esr.
>>     * Parameter @prefault is added
>>
>> As the @esr is passed as parameter, not fetched from vCPU struct. This
>> also introduces the necessasry helpers in esr.h, to manupulate the @esr.
> 
> (Nit: necessary, manipulate)
> 

Thanks for your comments. It will be fixed in next revision :)

> 
>> The helpers defined in kvm_emulate.h reuses the newly added helper. This
>> shouldn't cause functional changes.
> 
> user_mem_abort() is deep in the the guts of KVM's arch code. I don't think this should be
> exported. It  must be called on the vcpu thread. It must be called under the VMs srcu
> lock. There are also tricky interactions with whether the vcpu is loaded on this cpu or not...
> 
> I think it would be much simpler to always let the guest take the stage2-fault a second
> time. This keeps the property that pages are only populate in response to a stage2 fault.
> If the guest behaves, it will only schedule a task that accesses the page once its available.
> 
> 

The function is called with SRCU hold in this patchset. However, it seems
more reasonable to be called in vCPU thread. On the other hand, I'm worrying
about the performance degradation. I will evaluate it in next revision and
leave the function is called in vCPU thread if performance is acceptable.
Otherwise, I would like to leave it as of being :)

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/6] arm64/kvm: Support async page fault
  2020-10-23 17:01   ` James Morse
@ 2020-11-02  7:19     ` Gavin Shan
  0 siblings, 0 replies; 15+ messages in thread
From: Gavin Shan @ 2020-11-02  7:19 UTC (permalink / raw)
  To: James Morse; +Cc: maz, shan.gavin, pbonzini, will, kvmarm

Hi James,

On 10/24/20 4:01 AM, James Morse wrote:
> [my mail client went a bit nuts - it thinks this got sent already, sorry if you received
> it twice!]
> 

Thanks for your time on this. No problem at all and sorry for late
response :)

> I only got so far through this, so may have focussed on the wrong things.
> This patch has too many things going on. Please split it up.
> 
> I think the page-fault and page-present should be done separately, with the gubbins the
> core-code needs added first. The cap and corresponding documentation should be added last
> so the whole thing bisects.
> 
> It looks like the guest and hypervisor parts are already split up, this is helpful,
> thanks. Please ensure that the descriptions are clear on this distinction too. Its not at
> all clear where where the 'reschedule ipi' comes from in this commit message.
> 
> I'd really like it if as much of the guest parts as possible lived in common code. There
> is nothing arch-specific about an interrupt and reading from shared memory.
> 

Yes, I will split this patch into 3 patches in next revision at least:
[1] support page-not-present notification
[2] support page-ready notification
[3] control/configuration path support

Sure, I will improve the change log to make things more understandable.

I need to evaluate the possiblity of make guest parts common. I guess
it guess it might be hard. Currently, there are 3 architectures support
or will support the feature: x86/s390/arm64. s390 has unique implementation.
Even x86/arm64 share common implementations, but still different to some
extents. For example, x86 uses NMI/IRQ for the notifications, but arm64
uses SDEI/IRQ for them. The resources for SDEI should be pre-allocated
on arm64, but x86 doesn't have the requirement.

> 
> On 18/08/2020 02:13, Gavin Shan wrote:
>> There are two stages of page fault. The guest kernel is responsible
>> for handling stage one page fault, while the host kernel is to take
>> care of the stage two page fault. When page fault is triggered because
>> of stage two page fault, the guest is suspended until the requested
>> memory (page) is populated. Sometimes, the cost to populate the requested
>> page isn't cheap and can take hundreds of milliseconds in extreme
>> cases. This impacts the overall guest's performance.
> 
> Yes, its the cost of abstracting the machine's resources and invisibly carving them up.
> 
>> In order to resolve the issue and improve the guest's performance,
> 
> Why is it an issue? You've missed the important part of the commit message: why we should
> make this change.
> 
> This needs to mention postcopy live migration, which is the reason for doing any of this.
> 
> I don't think "I overcommited and swapped the guest to disk" is at all compelling.
> (as the doctor says: don't do that then)
> 

Yes, live-migration should be the justification in next revision. Thanks
for the reminder :)

> 
>> this introduces the feature (asychronous page fault). A page-not-present
> 
> (asynchronous)
> 
>> notification is fired to guest if the requested page isn't ready so that
>> the guest can reschedule current process and pick another process to run.
>> Another page-ready notification is sent to guest so that the process,
>> which was rescheduled previously, can be picked up to run. With this,
> 
>> the guest needn't necessarily suspend itself on stage two page fault.
> 
> Be suspended by the hypervisor until the resources are available.
> Normally the guest has no idea this is happening.
> 
> 
>> These two notificatons are associated by so-called vCPU scoped tokens,
> 
> (notifications)
> >> which is the combination of incremental sequence number and vCPU index
>> (@vcpu->vcpu_idx). Besides, the notifications are conveyed by SDEI and
>> PPI, whose numbers are specified by guest through SMCCC interface.
> 
> ... so you don't need the ACPI table?
> 
> The 'asynchrononous page fault' notification obviously needs to be (er,) synchronous, so
> it has to be per-cpu.
> 
> Why is the page-ready IRQ a PPI? We have a limited number of these, so shouldn't use one
> unless it _needs_ to be per-cpu.
> 
> 'vcpu index' should probably be the MPIDR_EL1, which is something both the host and guest
> already know that identifies the vcpu. Inventing something new isn't necessary.
> 

ACPI table is needed to pass the PPI number from qemu to guest. The PPI number
is determined by qemu. The reason why we need PPI is the asynchronous page fault
notification (page-ready) is delivered on per-vCPU basis.

The 'vCPU index' was inherited from x86 implementation. Actually, the information
isn't used because the SDEI event and PPI are always sent to the target vCPU. It
means the incremental sequence number is enough to identify the associated asynchronous
page fault events. So we needn't the vCPU index conveyed as part of the token
and I will drop it in next revision.

> 
>> The asynchronous page fault always starts with a background woker if
> 
> (worker. Please run your commit-messages through a spell checker)
> 

Sure, Thanks.

> 
>> current vCPU has enough token space and no pending notifications. After
>> the worker, which populates the requested page background, is started,
> 
> what does 'requested page background' mean? (missing the words 'in the'?)
> 

Yeah, "in the" is missed and will be fixed if needed in next revision.

> 
>> a page-not-present notification, together with an unique token, are sent
>> to guest through the specified private SDEI event. The notification is
>> acknowledged by clearing the shared cache (@vcpu->apf.cache). The current
>> process is marked for waiting completion of asynchronous page fault. A
>> subsequent (reschedule) IPI is sent to current vCPU so that the current
>> process can be reschedule and suspended until the asynchronous page is
>> completed. In the meanwhile, the woker populates the requested page and
>> it's queued to the completion queue once the task is completed.
> 
>> At this
>> point or the acknoledgement (by SMCCC) on pending PPI, a PPI is sent to
>> guest for page-ready notification.
> 
> A PPI is made pending, when the PPI is pending?
> 

Another PPI can't be fired from host to guest before the previous PPI is
delivered and handled.

> 
>> The guest uses the associated token
>> to locate the suspended process and wakes it up.
> 
> I hope the 'token' is nothing more than the IPA.
> 

Correct. IPA can be used as token either except IPA is 64-bits
in length, but token is 32-bits long.

> 
>> In order to fulfil the functions, there are several SMCCC function IDs
>> introduced and outlined as below:
> 
> Please add a documentation file for these for the benefit of other guests (e.g.
> kvm-unit-test) describing these in more detail, their parameters, the expected lifecycle etc.
> 
> 
>>     ARM_SMCCC_KVM_FUNC_APF_VERSION
>>        Returns the version, which can be used to identify ABI changes in
>>        the future.
>>     ARM_SMCCC_KVM_FUNC_APF_ENABLE
>>        Enable or disable asynchronous page fault on current vCPU.
>>     ARM_SMCCC_KVM_FUNC_APF_TOKEN_NUM
>>        Returns maximal number of tokens that current vCPU can have. It's
>>        used by guest to allocate the required resources.
>>     ARM_SMCCC_KVM_FUNC_APF_SDEI
>>     ARM_SMCCC_KVM_FUNC_APF_IRQ
>>        Used by guest to confiugre the SDEI event and PPI numbers.
>>     ARM_SMCCC_KVM_FUNC_APF_IRQ_ACK
>>        Acknowledge the page-ready notificaton in the PPI handler.
> 
> 
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index b6c9851b2a65..2e3bba6a99c3 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -46,6 +46,7 @@
>>   #define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>>   #define KVM_REQ_RELOAD_GICv4	KVM_ARCH_REQ(4)
>>   #define KVM_REQ_SDEI		KVM_ARCH_REQ(5)
>> +#define KVM_REQ_APF_READY	KVM_ARCH_REQ(6)
>>
>>   #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>>   				     KVM_DIRTY_LOG_INITIALLY_SET)
>> @@ -239,6 +240,23 @@ struct vcpu_reset_state {
>>   	bool		reset;
>>   };
>>
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +
>> +/* Should be a power of two number */
>> +#define ASYNC_PF_PER_VCPU	64
>> +
>> +/*
>> + * The association of gfn and token. The token will be sent to guest as
>> + * page fault address. Also, the guest could be in aarch32 mode. So its
>> + * length should be 32-bits.
>> + */
>> +struct kvm_arch_async_pf {
>> +	u32     token;
>> +	gfn_t   gfn;
>> +	u32     esr;
>> +};
>> +#endif /* CONFIG_KVM_ASYNC_PF */
> 
> 
> #ifdef-ing a struct makes it harder to use IS_ENABLED() in the C code, which in turn makes
> it harder to get complete coverage from the CI builders without aligning the stars to hit
> the right randconfig combination.
> 
> Please avoid doing this unless its absolutely necessary. (its not like a struct definition
> takes up any space in the binary)
> 

Ok. The #ifdef will be dropped in next revision.

> 
>>   struct kvm_vcpu_arch {
>>   	struct kvm_cpu_context ctxt;
>>   	void *sve_state;
>> @@ -346,6 +364,22 @@ struct kvm_vcpu_arch {
>>   	struct kvm_sdei_vcpu_event	*sdei_normal_event;
>>   	struct kvm_sdei_vcpu_event	*sdei_critical_event;
>>   	struct list_head		sdei_events;
>> +
>> +#ifdef CONFIG_KVM_ASYNC_PF
> 
> See above. Any reason these can't be dynamically allocated if the feature is enabled by
> the VMM? (You're using a PPI, this has to be allocated by the VMM).
> 
> Not all guests would use this, so its a waste of space if its enabled but not in use.
> 

Good idea. It will be dynamically allocated in next revision :)

> 
>> +	struct {
>> +		struct gfn_to_hva_cache	cache;
> 
>> +		gfn_t			gfns[ASYNC_PF_PER_VCPU];
>> +		u64			control_block;
> 
>> +		bool			send_user_only;
> 
> Hmmm...
> 
> 
>> +		u64			sdei_event_num;
>> +		u32			irq;
> 
>> +		u16			id;
> 
> a per-vcpu id? KVM already has one, and the architecture has one that the guest can see too.
> 

It's different from that one. It's the incremental sequence number associated
with asynchronous page fault events.

> 
>> +		bool			notpresent_pending;
>> +		u32			notpresent_token;
>> +		bool			pageready_pending;
>> +	} apf;
>> +#endif
>>   };
>>
>>   /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
> 
> 
>> diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
>> new file mode 100644
>> index 000000000000..0ea481dd1c7a
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/kvm_para.h
>> @@ -0,0 +1,27 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _ASM_ARM_KVM_PARA_H
>> +#define _ASM_ARM_KVM_PARA_H
> 
>> +#include <uapi/asm/kvm_para.h>
> 
>> +static inline bool kvm_check_and_clear_guest_paused(void)
>> +{
>> +	return false;
>> +}
> 
> This is part of kvmclock? aarch64 has an architecture specific way of doing that stuff.
> Why is this here?!
> 
> (if you've turned something on that needs these stubs, its probably worth doing that in an
> preparation patch so you can explain why, instead of hiding it in here)
> 

It's used by the guest to retrieve the supported features through
SMCCC (KVM specific feature call). Yep, it should be included in
a separate patch, as part of the prepatory work to support asynchronous
page fault for the guest.

> 
>> +static inline unsigned int kvm_arch_para_features(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline unsigned int kvm_arch_para_hints(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline bool kvm_para_available(void)
>> +{
>> +	return false;
>> +}
>> +
>> +#endif /* _ASM_ARM_KVM_PARA_H */
> 
> 
> 
>> diff --git a/arch/arm64/include/uapi/asm/Kbuild b/arch/arm64/include/uapi/asm/Kbuild
>> index 602d137932dc..f66554cd5c45 100644
>> --- a/arch/arm64/include/uapi/asm/Kbuild
>> +++ b/arch/arm64/include/uapi/asm/Kbuild
>> @@ -1,3 +1 @@
>>   # SPDX-License-Identifier: GPL-2.0
>> -
>> -generic-y += kvm_para.h
> 
> The arch specific version is practically the same as the asm-generic one.
> 
> Please explain why you need to do this in a preparatory patch.
> 

Sure, will do.

> 
>> diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
>> new file mode 100644
>> index 000000000000..f0dbe86c2374
>> --- /dev/null
>> +++ b/arch/arm64/include/uapi/asm/kvm_para.h
>> @@ -0,0 +1,23 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_ASM_ARM_KVM_PARA_H
>> +#define _UAPI_ASM_ARM_KVM_PARA_H
>> +
>> +#include <linux/types.h>
>> +
>> +#define KVM_FEATURE_ASYNC_PF	0
>> +
>> +/* Async PF */
>> +#define KVM_ASYNC_PF_ENABLED		(1 << 0)
>> +#define KVM_ASYNC_PF_SEND_ALWAYS	(1 << 1)
>> +
>> +#define KVM_PV_REASON_PAGE_NOT_PRESENT	1
>> +#define KVM_PV_REASON_PAGE_READY	2
>> +
>> +struct kvm_vcpu_pv_apf_data {
>> +	__u32	reason;
>> +	__u32	token;
>> +	__u8	pad[56];
>> +	__u32	enabled;
>> +};
>> +
>> +#endif /* _UAPI_ASM_ARM_KVM_PARA_H */
> 
> Presumably this is the reason to carry around an extra copy of kvm_para.h.
> 
> Wouldn't it be better to unpick that dependency instead?
> 

Yes, it's the interface (or shared data struct) between host and guest.
Sorry that I don't understand your comments here. Do you mean to make
this common across multiple architectures?

> 
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 4bec6c9ece18..1a57dc5dbf14 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -225,6 +225,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   		 */
>>   		r = 1;
>>   		break;
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	case KVM_CAP_ASYNC_PF:
>> +	case KVM_CAP_ASYNC_PF_INT:
>> +		r = 1;
>> +		break;
>> +#endif
>>   	default:
>>   		r = kvm_arch_vm_ioctl_check_extension(kvm, ext);
>>   		break;
>> @@ -274,6 +280,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>>   	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>>
>>   	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +	vcpu->arch.apf.control_block = 0UL;
>> +#endif
> 
> kvm_vm_ioctl_create_vcpu() allocates vcpu with kmem_cache_zalloc(), so this field will
> already be zero.
> 

Yes. The @apf will be dynamically allocated in next revision as you
suggested above :)

> 
>> @@ -432,8 +441,32 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>>   int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>   {
>>   	bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF);
>> -	return ((irq_lines || kvm_vgic_vcpu_pending_irq(v))
>> -		&& !v->arch.power_off && !v->arch.pause);
> 
> As you are adding a conditional __copy_from_user() in here, please add an unconditional
> might_sleep() at the top of this function. This lets us catch locations that will
> sometimes break the first time they happen.
> 

Ok.

> 
>> +	if ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) &&
>> +	    !v->arch.power_off && !v->arch.pause)
>> +		return true;
>> +
>> +#ifdef CONFIG_KVM_ASYNC_PF
> 
> Please use IS_ENABLED() here, #ifdeffery make this stuff harder to read and harder to
> build-test.
> 
> (if you can come up with a readable IS_ENABLED() way of returning the _check_extension()
> result - even better!)
> 

Ok.

> 
>> +	if (v->arch.apf.control_block & KVM_ASYNC_PF_ENABLED) {
>> +		u32 val;
>> +		int ret;
>> +
>> +		if (!list_empty_careful(&v->async_pf.done))
>> +			return true;
> 
>> +		ret = kvm_read_guest_offset_cached(v->kvm, &v->arch.apf.cache,
>> +						   &val, 0, sizeof(val));
> 
> You added a helper for this: kvm_async_pf_read_cache()...
> 

Yep, the helper should be used here.

> 
>> +		if (ret || val)
>> +			return true;
>> +
>> +		ret = kvm_read_guest_offset_cached(v->kvm, &v->arch.apf.cache,
>> +						   &val, 4, sizeof(val));
>> +		if (ret || val)
>> +			return true;
> 
> These 0 and 4 really need some kind of name to give a hint as to what they are for.
> 

Yep, that would be better. I probably use offsetof() in next revision.

> 
>> +	}
> 
> This really ought to wrapped up in a helper that gives a hint as to what it is doing.
> 
> There are two magic locations in guest memory that means the guest controls whether it is
> runnable or not... this sounds like a source of hard-to-debug issues.
> 
> I'd much prefer it if the guest only ever got blocked in the stage2 fault handler, as it
> does today. This mechanism just provides information about what is going on at stage2,
> which allows the guest to avoid getting blocked if it can do something about it.
> 
> Giving the guest access to the controls doesn't look right.
> 

Yeah, It's a nice comment and I think @notpresent_pending and @pageready_pending
in struct kvm_vcpu_arch::apf can alternatively serve the purpose.

> 
>> +#endif
>> +
>> +	return false;
>>   }
> 
>>   bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
>> @@ -624,6 +657,11 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>>   		if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
>>   			kvm_reset_vcpu(vcpu);
>>
>> +#ifdef CONFIG_KVM_ASYNC_PF
>> +		if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
>> +			kvm_check_async_pf_completion(vcpu);
>> +#endif
> 
> (IS_ENABLED() would save you the link error ... providing a stub in core-code for
> kvm_check_async_pf_completion() would let you rely on this request never being made instead)
> 

Ok.

> 
>>   		if (kvm_check_request(KVM_REQ_SDEI, vcpu))
>>   			kvm_sdei_deliver(vcpu);
>>
>> @@ -739,7 +777,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>   		smp_store_mb(vcpu->mode, IN_GUEST_MODE);
>>
>>   		if (ret <= 0 || need_new_vmid_gen(&vcpu->kvm->arch.vmid) ||
>> -		    kvm_request_pending(vcpu)) {
> 
>> +		    (kvm_request_pending(vcpu) &&
>> +		     READ_ONCE(vcpu->requests) != (1UL << KVM_REQ_APF_READY))) {
> 
> A double READ_ONCE() looks wrong. If you have to do this, presumably you didn't want a
> vcpu->request to begin with.
> 
> It looks like KVM_REQ_APF_READY only exists to move the page-ready work onto the vcpu
> thread because the vcpu got blocked further out by kvm_arch_vcpu_runnable(), and the
> second half of the stage2-fault now needs doing.
> 
> 
> I think it would be simpler to only ever block in the stage2 fault handler. The guest can
> only ever be blocked on one stage2 fault at a time, so this abstraction to let it sleep
> for any number is pointless. If I'm waiting for ipa-4, and get restarted because ipa-6 is
> now available ... I'm going to fault on ipa-4 for a second time. There is no point
> restarting the guest until it can make progress again.
> 
> 
>>From the first stage2 fault you can kick off the page-fault notification (which must be
> synchronous, so can be notified without taking this fault again). If the vcpu takes a
> fault that was already notified, it clearly can't handle the page-fault notification and
> must be blocked until that page is available.
> 
> We should be able to depend on the regular wakeup from mm when the vcpu is blocked.
> The extra information from page-present lets the guest know it can now touch these pages
> without getting blocked ... they shouldn't need a wakeup.
> 

I agree it will become simpler, but it's going to affect the performance greatly
if I'm understanding your comments completely. In current design, multiple works
associated with asynchronous page fault and one specific vCPU can work in the
background at same time. It seems you're suggesting to have one work is available
at once on one specific vCPU. If so, the performance will be affected greatly.

> 
>>   			vcpu->mode = OUTSIDE_GUEST_MODE;
>>   			isb(); /* Ensure work in x_flush_hwstate is committed */
>>   			kvm_pmu_sync_hwstate(vcpu);
> 
> 
>> diff --git a/arch/arm64/kvm/async_pf.c b/arch/arm64/kvm/async_pf.c
>> new file mode 100644
>> index 000000000000..910cb3a1bdc2
>> --- /dev/null
>> +++ b/arch/arm64/kvm/async_pf.c
>> @@ -0,0 +1,462 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright Gavin Shan, Redhat Inc 2020.
>> + *
>> + * Asynchronous page fault support
>> + */
>> +
>> +#include <linux/arm-smccc.h>
>> +#include <linux/kvm_host.h>
>> +#include <asm/kvm_emulate.h>
>> +#include <kvm/arm_hypercalls.h>
>> +#include <kvm/arm_vgic.h>
>> +#include <asm/kvm_sdei.h>
> 
>> +static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
>> +{
>> +	return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
>> +}
>> +
>> +static inline u32 kvm_async_pf_hash_next(u32 key)
>> +{
>> +	return (key + 1) & (ASYNC_PF_PER_VCPU - 1);
>> +}
>> +
>> +static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++)
>> +		vcpu->arch.apf.gfns[i] = ~0;
>> +}
> 
>> +/*
>> + * Add gfn to the hash table. It's ensured there is a free entry
>> + * when this function is called.
>> + */
>> +static void kvm_async_pf_hash_add(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_fn(gfn);
>> +
>> +	while (vcpu->arch.apf.gfns[key] != ~0)
>> +		key = kvm_async_pf_hash_next(key);
>> +
>> +	vcpu->arch.apf.gfns[key] = gfn;
>> +}
>> +
>> +static u32 kvm_async_pf_hash_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_fn(gfn);
>> +	int i;
>> +
>> +	for (i = 0; i < ASYNC_PF_PER_VCPU; i++) {
>> +		if (vcpu->arch.apf.gfns[key] == gfn ||
>> +		    vcpu->arch.apf.gfns[key] == ~0)
>> +			break;
>> +
>> +		key = kvm_async_pf_hash_next(key);
>> +	}
>> +
>> +	return key;
>> +}
>> +
>> +static void kvm_async_pf_hash_remove(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 i, j, k;
>> +
>> +	i = j = kvm_async_pf_hash_slot(vcpu, gfn);
>> +	while (true) {
>> +		vcpu->arch.apf.gfns[i] = ~0;
>> +
>> +		do {
>> +			j = kvm_async_pf_hash_next(j);
>> +			if (vcpu->arch.apf.gfns[j] == ~0)
>> +				return;
>> +
>> +			k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
>> +			/*
>> +			 * k lies cyclically in ]i,j]
>> +			 * |    i.k.j |
>> +			 * |....j i.k.| or  |.k..j i...|
>> +			 */
>> +		} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
>> +
>> +		vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
>> +		i = j;
>> +	}
>> +}
>> +
>> +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +{
>> +	u32 key = kvm_async_pf_hash_slot(vcpu, gfn);
>> +
>> +	return vcpu->arch.apf.gfns[key] == gfn;
>> +}
>> +
>> +static inline int kvm_async_pf_read_cache(struct kvm_vcpu *vcpu,
>> +					  u32 offset, u32 *val)
>> +{
>> +	return kvm_read_guest_offset_cached(vcpu->kvm,
>> +					    &vcpu->arch.apf.cache,
>> +					    val, offset, sizeof(*val));
>> +}
>> +
>> +static inline int kvm_async_pf_write_cache(struct kvm_vcpu *vcpu,
>> +					   u32 offset, u32 val)
>> +{
>> +	return kvm_write_guest_offset_cached(vcpu->kvm,
>> +					     &vcpu->arch.apf.cache,
>> +					     &val, offset, sizeof(val));
>> +}
> 
> 
> This hash stuff doesn't look at all specific to arm64. Could it be moved to core code or a
> shared header file so multiple architectures only need one copy of the code.
> If they need to be different, it would be good to know why.
> 

Ok. It will be moved to common part shared by x86 and arm64 in next revision.

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-11-02  7:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18  1:13 [PATCH 0/6] Support Asynchronous Page Fault Gavin Shan
2020-08-18  1:13 ` [PATCH 1/6] arm64: Probe for the presence of KVM hypervisor services during boot Gavin Shan
2020-08-18  1:13 ` [PATCH 2/6] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC Gavin Shan
2020-08-18  1:13 ` [PATCH 3/6] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode Gavin Shan
2020-10-23 16:54   ` James Morse
2020-11-02  5:30     ` Gavin Shan
2020-08-18  1:13 ` [PATCH 4/6] arm64/kvm: Support async page fault Gavin Shan
2020-10-23 17:01   ` James Morse
2020-11-02  7:19     ` Gavin Shan
2020-08-18  1:13 ` [PATCH 5/6] drivers/acpi: Import ACPI APF table Gavin Shan
2020-10-23 16:55   ` James Morse
2020-08-18  1:13 ` [PATCH 6/6] arm64/kernel: Support async page fault Gavin Shan
2020-10-23 16:54 ` [PATCH 0/6] Support Asynchronous Page Fault James Morse
2020-10-31 14:18   ` Paolo Bonzini
2020-11-02  5:23   ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).