All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH part-5 00/22] VMX emulation
@ 2023-03-12 18:02 Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 01/22] pkvm: x86: Add memcpy lib Jason Chen CJ
                   ` (22 more replies)
  0 siblings, 23 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

This patch set is part-5 of this RFC patches. It introduces VMX
emulation for pKVM on Intel platform.

Host VM wants the capability to run its guest, it needs VMX support.

pKVM is designed to emulate VMX for host VM based on shadow vmcs.
This requires "VMCS shadowing" feature support in VMX secondary
processor-based VM-Execution controls field [1].

One alternative way to emulate VMX is based on enlightened vmcs (evmcs)
which was introduced by Hyper-V nesting support. evmcs does normal memory
reads/writes instead of doing VMWRITE/VMREAD instructions, it's a
flexible SW solution to emulate VMX, and does not need "VMCS shadowing"
feature support; while making evmcs work for pKVM leads to the
refactor to KVM Hyper-V code; to avoid change that part of code, we
choose to use shadow VMCS in this RFC.

    +--------------------+   +-----------------+
    |     host VM        |   |   guest VM      |
    |                    |   |                 |
    |        +---------+ |   |                 |
    |        | vmcs12* | |   |                 |
    |        +---------+ |   |                 |
    +--------------------+   +-----------------+
    +------------------------------------------+       +---------+
    |     +---------+         +---------+      |       | shadow  |
    |     | vmcs01* |         | vmcs02* +------+---+-->|  vcpu   |
    |     +---------+         +---------+      |   |   |  state  |
    |                      +---------------+   |   |   +---------+
    |                      | cached_vmcs12 +---+---+
    | pKVM                 +---------------+   |
    +------------------------------------------+

 [*]vmcs12: virtual vmcs of a nested guest
 [*]vmcs02: vmcs of a nested guest
 [*]vmcs01: vmcs of host VM

"VMCS shadowing" use a shadow vmcs page (vmcs02) to cache vmcs fields
accessing from host VM through VMWRITE/VMREAD, avoid causing vmexit.
The fields cached in vmcs02 is pre-defined by VMREAD/VMWRITE bitmap.
Meanwhile for other fields not in VMREAD/VMWRITE bitmap, accessing from
host VM cause VMREAD/VMWRITE vmexit, pKVM need to cache them in another
place - cached_vmcs12 is introduced for this purpose.

The vmcs02 page in root mode is kept in the structure shadow_vcpu_state,
which allocated then donated from host VM when it initialize vcpus for
its launched guest (nested). Same for field of cached_vmcs12.

pKVM use vmcs02 with two purposes, one is mentioned above, using it
as the shadow vmcs page of nested guest when host VM program its vmcs
fields; the other one is using it as ordinary (or active) vmcs for the
same guest during the vmlaunch/vmresume.

For a nested guest, during its vmcs programing from host VM, according
to above, its virtual vmcs (vmcs12) is saved in two places: vmcs02 for
shadow fields and cached_vmcs12 for no shadow fields. Meanwhile for
cached_vmcs12, there are also two parts for its fields: one is emulated
fields, the other one is host state fields. The emulated fields are
mostly security related control fields which shall be emulated to the
physical value then fill into vmcs02 before vmcs02 active to do
vmlaunch/vmresume for the nested guest. The host state fields are
guest state of host vcpu, it shall be restored to guest state of host
vcpu vmcs (vmcs01) before return to host VM.

Below is a summary for contents of different vmcs fields in each above
mentioned vmcs:

              host state      guest state          control
 ---------------------------------------------------------------
 vmcs12:      host VM's     nested guest's     set by host VM
 vmcs02:       pKVM's       nested guest's   set by host VM + pKVM*
 vmcs01:       pKVM's        host VM's          set by pKVM

 [*]the security related control fields of vmcs02 is controlled by pKVM
  (e.g., EPT_POINTER)

Blow show the brief vmcs emulation method for different vmcs fields for
a nested guest:

                host state      guest state   security related control
 ---------------------------------------------------------------------
 virutal vmcs:  cached_vmcs12*     vmcs02*          emulated*

 [*]cached_vmcs12: vmexit then set/get value to/from cached_vmcs12
 [*]vmcs02:        no-vmexit and directly shadow from vmcs02
 [*]emulated:      vmexit then do the emulation

The vmcs02 & cached_vmcs12 is sync back to vmcs12 during VMCLEAR
emulation, and updated from vmcs12 when emulating VMPTRLD. And before
the nested guest vmentry(vmlaunch/vmresume emulation), the vmcs02 is
further sync dirty fields(caused by vmwrite) from cached_vmcs12 and
update emulated fields through emulation.

INVEPT/INVVPID now is simplify emulated by doing a global INVEPT.

VMX msrs are emulated by pKVM as well to provide the VMX capabilities
to host VM, features of PT, SMM, shadowing VMCS and vmfunc are filtered
out.

[1]: SDM: Virtual Machine Control Structures chapter, VMCS TYPES.

Haiwei Li (2):
  pkvm: x86: Do guest address translation per page granularity
  pkvm: x86: Add check for guest address translation

Jason Chen CJ (19):
  pkvm: x86: Add memcpy lib
  pkvm: x86: Add memory operation APIs for for host VM
  pkvm: x86: Add hypercalls for shadow_vm/vcpu init & teardown
  KVM: VMX: Add new kvm_x86_ops vm_free
  KVM: VMX: Add initialization/teardown for shadow vm/vcpu
  pkvm: x86: Add hash table mapping for shadow vcpu based on vmcs12_pa
  pkvm: x86: Add VMXON/VMXOFF emulation
  KVM: VMX: Add more vmcs and vmcs12 fields definition
  pkvm: x86: Init vmcs read/write bitmap for vmcs emulation
  pkvm: x86: Initialize emulated fields for vmcs emulation
  pkvm: x86: Add msr ops for pKVM hypervisor
  pkvm: x86: Move _init_host_state_area to pKVM hypervisor
  pkvm: x86: Add vmcs_load/clear_track APIs
  pkvm: x86: Add VMPTRLD/VMCLEAR emulation
  pkvm: x86: Add VMREAD/VMWRITE emulation
  pkvm: x86: Add VMLAUNCH/VMRESUME emulation
  pkvm: x86: Add INVEPT/INVVPID emulation
  pkvm: x86: Initialize msr_bitmap for vmsr
  pkvm: x86: Add vmx msr emulation

Tina Zhang (1):
  pkvm: x86: Add has_vmcs_field() API for physical vmx capability check

 arch/x86/include/asm/kvm-x86-ops.h            |    1 +
 arch/x86/include/asm/kvm_host.h               |    5 +
 arch/x86/include/asm/kvm_pkvm.h               |   14 +
 arch/x86/include/asm/pkvm_image_vars.h        |    3 +-
 arch/x86/include/asm/vmx.h                    |    4 +
 arch/x86/kvm/vmx/pkvm/hyp/Makefile            |    6 +-
 arch/x86/kvm/vmx/pkvm/hyp/cpu.h               |   23 +
 arch/x86/kvm/vmx/pkvm/hyp/init_finalise.c     |    3 +
 arch/x86/kvm/vmx/pkvm/hyp/lib/memcpy_64.S     |   26 +
 arch/x86/kvm/vmx/pkvm/hyp/memory.c            |  216 ++++
 arch/x86/kvm/vmx/pkvm/hyp/memory.h            |   11 +
 arch/x86/kvm/vmx/pkvm/hyp/nested.c            | 1030 +++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/nested.h            |   27 +
 arch/x86/kvm/vmx/pkvm/hyp/pkvm.c              |  342 ++++++
 arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h          |   82 ++
 .../vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h    |  195 ++++
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c            |  174 ++-
 arch/x86/kvm/vmx/pkvm/hyp/vmsr.c              |   88 ++
 arch/x86/kvm/vmx/pkvm/hyp/vmsr.h              |   11 +
 arch/x86/kvm/vmx/pkvm/hyp/vmx.c               |   77 ++
 arch/x86/kvm/vmx/pkvm/hyp/vmx.h               |   23 +
 arch/x86/kvm/vmx/pkvm/include/pkvm.h          |    5 +
 arch/x86/kvm/vmx/pkvm/pkvm_constants.c        |    4 +
 arch/x86/kvm/vmx/pkvm/pkvm_host.c             |  181 +--
 arch/x86/kvm/vmx/vmcs12.c                     |    6 +
 arch/x86/kvm/vmx/vmcs12.h                     |   16 +-
 arch/x86/kvm/vmx/vmx.c                        |   14 +-
 arch/x86/kvm/x86.c                            |    1 +
 include/linux/kvm_host.h                      |    8 +
 29 files changed, 2459 insertions(+), 137 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/pkvm/hyp/lib/memcpy_64.S
 create mode 100644 arch/x86/kvm/vmx/pkvm/hyp/nested.c
 create mode 100644 arch/x86/kvm/vmx/pkvm/hyp/nested.h
 create mode 100644 arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
 create mode 100644 arch/x86/kvm/vmx/pkvm/hyp/vmsr.c
 create mode 100644 arch/x86/kvm/vmx/pkvm/hyp/vmsr.h
 create mode 100644 arch/x86/kvm/vmx/pkvm/hyp/vmx.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 01/22] pkvm: x86: Add memcpy lib
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 02/22] pkvm: x86: Add memory operation APIs for for host VM Jason Chen CJ
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

pKVM need its own memcpy library, it cannot directly use
arch/x86/lib/memcpy_64.S as it's based on ALTERNATIVE section
which pKVM does not support yet.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/Makefile        |  1 +
 arch/x86/kvm/vmx/pkvm/hyp/lib/memcpy_64.S | 26 +++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/Makefile b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
index fe852bd43a7e..9c410ec96f45 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/Makefile
+++ b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
@@ -17,6 +17,7 @@ pkvm-hyp-y	:= vmx_asm.o vmexit.o memory.o early_alloc.o pgtable.o mmu.o pkvm.o \
 ifndef CONFIG_PKVM_INTEL_DEBUG
 lib-dir		:= lib
 pkvm-hyp-y	+= $(lib-dir)/memset_64.o
+pkvm-hyp-y	+= $(lib-dir)/memcpy_64.o
 pkvm-hyp-$(CONFIG_RETPOLINE)	+= $(lib-dir)/retpoline.o
 pkvm-hyp-$(CONFIG_DEBUG_LIST)	+= $(lib-dir)/list_debug.o
 endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/lib/memcpy_64.S b/arch/x86/kvm/vmx/pkvm/hyp/lib/memcpy_64.S
new file mode 100644
index 000000000000..b976f646d352
--- /dev/null
+++ b/arch/x86/kvm/vmx/pkvm/hyp/lib/memcpy_64.S
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright 2002 Andi Kleen */
+
+#include <linux/linkage.h>
+
+/*
+ * memcpy - Copy a memory block.
+ *
+ * Input:
+ *  rdi destination
+ *  rsi source
+ *  rdx count
+ *
+ * Output:
+ * rax original destination
+ *
+ * This is enhanced fast string memcpy. It is faster and
+ * simpler than old memcpy.
+ */
+
+SYM_FUNC_START(memcpy)
+	movq %rdi, %rax
+	movq %rdx, %rcx
+	rep movsb
+	RET
+SYM_FUNC_END(memcpy)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 02/22] pkvm: x86: Add memory operation APIs for for host VM
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 01/22] pkvm: x86: Add memcpy lib Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 03/22] pkvm: x86: Do guest address translation per page granularity Jason Chen CJ
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

Add below memory operation APIs for host VM:
- gva2gpa
- read_gva/write_gva
- read_gpa/write_gpa
such ops will be used later for vmx instruction emulation, for example,
vmxon instruction will put vmxon area's point in a guest memory, it
means pKVM need to read its content from a gva.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/memory.c | 106 +++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/memory.h |  11 +++
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c |   1 +
 3 files changed, 118 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/memory.c b/arch/x86/kvm/vmx/pkvm/hyp/memory.c
index d3e479860189..e99fa72cedac 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/memory.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/memory.c
@@ -6,7 +6,10 @@
 #include <linux/types.h>
 #include <asm/kvm_pkvm.h>
 
+#include <pkvm.h>
 #include "memory.h"
+#include "pgtable.h"
+#include "pkvm_hyp.h"
 
 unsigned long __page_base_offset;
 unsigned long __symbol_base_offset;
@@ -63,3 +66,106 @@ bool mem_range_included(struct mem_range *child, struct mem_range *parent)
 {
 	return parent->start <= child->start && child->end <= parent->end;
 }
+
+void *host_gpa2hva(unsigned long gpa)
+{
+	/* host gpa = hpa */
+	return pkvm_phys_to_virt(gpa);
+}
+
+extern struct pkvm_pgtable_ops mmu_ops;
+static struct pkvm_mm_ops mm_ops = {
+	.phys_to_virt = host_gpa2hva,
+};
+
+static int check_translation(struct kvm_vcpu *vcpu, gpa_t gpa,
+		u64 prot, u32 access, struct x86_exception *exception)
+{
+	/* TODO: exception for #PF */
+	return 0;
+}
+
+int gva2gpa(struct kvm_vcpu *vcpu, gva_t gva, gpa_t *gpa,
+		u32 access, struct x86_exception *exception)
+{
+	struct pkvm_pgtable guest_mmu;
+	gpa_t _gpa;
+	u64 prot;
+	int pg_level;
+
+	/* caller should ensure exception is not NULL */
+	WARN_ON(exception == NULL);
+
+	memset(exception, 0, sizeof(*exception));
+
+	/*TODO: support other paging mode beside long mode */
+	guest_mmu.root_pa = vcpu->arch.cr3 & PAGE_MASK;
+	pkvm_pgtable_init(&guest_mmu, &mm_ops, &mmu_ops, &pkvm_hyp->mmu_cap, false);
+	pkvm_pgtable_lookup(&guest_mmu, (unsigned long)gva,
+			(unsigned long *)&_gpa, &prot, &pg_level);
+	*gpa = _gpa;
+	if (_gpa == INVALID_ADDR)
+		return -EFAULT;
+
+	return check_translation(vcpu, _gpa, prot, access, exception);
+}
+
+/* only support host VM now */
+static int copy_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
+		unsigned int bytes, struct x86_exception *exception, bool from_guest)
+{
+	u32 access = VMX_AR_DPL(vmcs_read32(GUEST_SS_AR_BYTES)) == 3 ? PFERR_USER_MASK : 0;
+	gpa_t gpa;
+	void *hva;
+	int ret;
+
+	/*FIXME: need check the gva per page granularity */
+	ret = gva2gpa(vcpu, gva, &gpa, access, exception);
+	if (ret)
+		return ret;
+
+	hva = host_gpa2hva(gpa);
+	if (from_guest)
+		memcpy(addr, hva, bytes);
+	else
+		memcpy(hva, addr, bytes);
+
+	return bytes;
+}
+
+int read_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
+		unsigned int bytes, struct x86_exception *exception)
+{
+	return copy_gva(vcpu, gva, addr, bytes, exception, true);
+}
+
+int write_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
+		unsigned int bytes, struct x86_exception *exception)
+{
+	return copy_gva(vcpu, gva, addr, bytes, exception, false);
+}
+
+/* only support host VM now */
+static int copy_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, void *addr,
+		unsigned int bytes, bool from_guest)
+{
+	void *hva;
+
+	hva = host_gpa2hva(gpa);
+	if (from_guest)
+		memcpy(addr, hva, bytes);
+	else
+		memcpy(hva, addr, bytes);
+
+	return bytes;
+}
+
+int read_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, void *addr, unsigned int bytes)
+{
+	return copy_gpa(vcpu, gpa, addr, bytes, true);
+}
+
+int write_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, void *addr, unsigned int bytes)
+{
+	return copy_gpa(vcpu, gpa, addr, bytes, false);
+}
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/memory.h b/arch/x86/kvm/vmx/pkvm/hyp/memory.h
index c9175272096b..4a75d8dff1b3 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/memory.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/memory.h
@@ -20,4 +20,15 @@ struct mem_range {
 bool find_mem_range(unsigned long addr, struct mem_range *range);
 bool mem_range_included(struct mem_range *child, struct mem_range *parent);
 
+#include <linux/kvm_host.h>
+void *host_gpa2hva(unsigned long gpa);
+int gva2gpa(struct kvm_vcpu *vcpu, gva_t gva, gpa_t *gpa,
+		u32 access, struct x86_exception *exception);
+int read_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
+		unsigned int bytes, struct x86_exception *exception);
+int write_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
+		unsigned int bytes, struct x86_exception *exception);
+int read_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, void *addr, unsigned int bytes);
+int write_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, void *addr, unsigned int bytes);
+
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index e8015a6830b0..02224d93384a 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -154,6 +154,7 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 		}
 
 		vcpu->arch.cr2 = native_read_cr2();
+		vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
 
 		vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
 		vmx->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 03/22] pkvm: x86: Do guest address translation per page granularity
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 01/22] pkvm: x86: Add memcpy lib Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 02/22] pkvm: x86: Add memory operation APIs for for host VM Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 04/22] pkvm: x86: Add check for guest address translation Jason Chen CJ
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Haiwei Li

From: Haiwei Li <haiwei.li@intel.com>

Guest memory operations like read_gva/write_gva/read_gpa/write_gpa only
support doing address translation for current page. It's not correct if
such operation access data over current page.

Fix above issue for these functions by doing address translation per page
granularity.

Signed-off-by: Haiwei Li <haiwei.li@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/memory.c | 65 ++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/memory.c b/arch/x86/kvm/vmx/pkvm/hyp/memory.c
index e99fa72cedac..a42669ccf89c 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/memory.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/memory.c
@@ -110,27 +110,47 @@ int gva2gpa(struct kvm_vcpu *vcpu, gva_t gva, gpa_t *gpa,
 	return check_translation(vcpu, _gpa, prot, access, exception);
 }
 
-/* only support host VM now */
-static int copy_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
-		unsigned int bytes, struct x86_exception *exception, bool from_guest)
+static inline int __copy_gpa(struct kvm_vcpu *vcpu, void *addr, gpa_t gpa,
+			     unsigned int size, unsigned int pg_size,
+			     bool from_guest)
 {
-	u32 access = VMX_AR_DPL(vmcs_read32(GUEST_SS_AR_BYTES)) == 3 ? PFERR_USER_MASK : 0;
-	gpa_t gpa;
+	unsigned int len, offset_in_pg;
 	void *hva;
-	int ret;
 
-	/*FIXME: need check the gva per page granularity */
-	ret = gva2gpa(vcpu, gva, &gpa, access, exception);
-	if (ret)
-		return ret;
+	offset_in_pg = (unsigned int)gpa & (pg_size - 1);
+	len = (size > (pg_size - offset_in_pg)) ? (pg_size - offset_in_pg) : size;
 
 	hva = host_gpa2hva(gpa);
 	if (from_guest)
-		memcpy(addr, hva, bytes);
+		memcpy(addr, hva, len);
 	else
-		memcpy(hva, addr, bytes);
+		memcpy(hva, addr, len);
 
-	return bytes;
+	return len;
+}
+
+/* only support host VM now */
+static int copy_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
+		unsigned int bytes, struct x86_exception *exception, bool from_guest)
+{
+	u32 access = VMX_AR_DPL(vmcs_read32(GUEST_SS_AR_BYTES)) == 3 ? PFERR_USER_MASK : 0;
+	gpa_t gpa;
+	unsigned int len;
+	int ret = 0;
+
+	while ((bytes > 0) && (ret == 0)) {
+		ret = gva2gpa(vcpu, gva, &gpa, access, exception);
+		if (ret >= 0) {
+			len = __copy_gpa(vcpu, addr, gpa, bytes, PAGE_SIZE, from_guest);
+			if (len == 0)
+				return -EINVAL;
+			gva += len;
+			addr += len;
+			bytes -= len;
+		}
+	}
+
+	return ret;
 }
 
 int read_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
@@ -149,15 +169,18 @@ int write_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
 static int copy_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, void *addr,
 		unsigned int bytes, bool from_guest)
 {
-	void *hva;
-
-	hva = host_gpa2hva(gpa);
-	if (from_guest)
-		memcpy(addr, hva, bytes);
-	else
-		memcpy(hva, addr, bytes);
+	unsigned int len;
+
+	while (bytes > 0) {
+		len = __copy_gpa(vcpu, addr, gpa, bytes, PAGE_SIZE, from_guest);
+		if (len == 0)
+			return -EINVAL;
+		gpa += len;
+		addr += len;
+		bytes -= len;
+	}
 
-	return bytes;
+	return 0;
 }
 
 int read_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, void *addr, unsigned int bytes)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 04/22] pkvm: x86: Add check for guest address translation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (2 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 03/22] pkvm: x86: Do guest address translation per page granularity Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 05/22] pkvm: x86: Add hypercalls for shadow_vm/vcpu init & teardown Jason Chen CJ
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Haiwei Li

From: Haiwei Li <haiwei.li@intel.com>

During guest address translation, it needs to check if there is
an exception happens, triggered by invalid address or permission.

For the callers who invoke read_gva/write_gva, should check the
`exception` and handle it, such as inject a #PF to guest.

Signed-off-by: Haiwei Li <haiwei.li@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/memory.c | 97 ++++++++++++++++++++++++++++--
 1 file changed, 92 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/memory.c b/arch/x86/kvm/vmx/pkvm/hyp/memory.c
index a42669ccf89c..6a400aef1bd8 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/memory.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/memory.c
@@ -78,11 +78,97 @@ static struct pkvm_mm_ops mm_ops = {
 	.phys_to_virt = host_gpa2hva,
 };
 
-static int check_translation(struct kvm_vcpu *vcpu, gpa_t gpa,
+static int check_translation(struct kvm_vcpu *vcpu, gva_t gva, gpa_t gpa,
 		u64 prot, u32 access, struct x86_exception *exception)
 {
-	/* TODO: exception for #PF */
+	u16 errcode = 0;
+	bool page_rw_flags_on = true;
+	bool user_mode_addr = true;
+	const int user_mode_access = access & PFERR_USER_MASK;
+	const int write_access = access & PFERR_WRITE_MASK;
+	bool cr4_smap = vmcs_readl(GUEST_CR4) & X86_CR4_SMAP;
+	bool cr0_wp = vmcs_readl(GUEST_CR0) & X86_CR0_WP;
+
+	/*
+	 * As pkvm hypervisor will not do instruction emulation, here we do not
+	 * expect guest memory access for instruction fetch.
+	 */
+	WARN_ON(access & PFERR_FETCH_MASK);
+
+	/* pte is not present */
+	if (gpa == INVALID_ADDR) {
+		goto check_fault;
+	} else {
+		errcode |= PFERR_PRESENT_MASK;
+
+		/*TODO: check reserved bits and PK */
+
+		/* check for R/W */
+		if ((prot & _PAGE_RW) == 0) {
+			if (write_access && (user_mode_access || cr0_wp))
+				/*
+				 * case 1: Supermode and wp is 1
+				 * case 2: Usermode
+				 */
+				goto check_fault;
+			page_rw_flags_on = false;
+		}
+
+		/* check for U/S */
+		if ((prot & _PAGE_USER) == 0) {
+			user_mode_addr = false;
+			if (user_mode_access)
+				goto check_fault;
+		}
+
+		/*
+		 * When SMAP is on, we only need to apply check when address is
+		 * user-mode address.
+		 *
+		 * Also SMAP only impacts the supervisor-mode access.
+		 */
+		/* if SMAP is enabled and supervisor-mode access */
+		if (cr4_smap && (!user_mode_access) && user_mode_addr) {
+			bool acflag = vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_AC;
+
+			/* read from user mode address, eflags.ac = 0 */
+			if ((!write_access) && (!acflag)) {
+				goto check_fault;
+			} else if (write_access) {
+				/* write to user mode address */
+
+				/* cr0.wp = 0, eflags.ac = 0 */
+				if ((!cr0_wp) && (!acflag))
+					goto check_fault;
+
+				/*
+				 * cr0.wp = 1, eflags.ac = 1, r/w flag is 0
+				 * on any paging structure entry
+				 */
+				if (cr0_wp && acflag && (!page_rw_flags_on))
+					goto check_fault;
+
+				/* cr0.wp = 1, eflags.ac = 0 */
+				if (cr0_wp && (!acflag))
+					goto check_fault;
+			} else {
+				/* do nothing */
+			}
+		}
+	}
+
 	return 0;
+
+check_fault:
+	errcode |= write_access | user_mode_access;
+	exception->error_code = errcode;
+	exception->vector = PF_VECTOR;
+	exception->error_code_valid = true;
+	exception->address = gva;
+	exception->nested_page_fault = false;
+	exception->async_page_fault = false;
+	return -EFAULT;
+
 }
 
 int gva2gpa(struct kvm_vcpu *vcpu, gva_t gva, gpa_t *gpa,
@@ -104,10 +190,8 @@ int gva2gpa(struct kvm_vcpu *vcpu, gva_t gva, gpa_t *gpa,
 	pkvm_pgtable_lookup(&guest_mmu, (unsigned long)gva,
 			(unsigned long *)&_gpa, &prot, &pg_level);
 	*gpa = _gpa;
-	if (_gpa == INVALID_ADDR)
-		return -EFAULT;
 
-	return check_translation(vcpu, _gpa, prot, access, exception);
+	return check_translation(vcpu, gva, _gpa, prot, access, exception);
 }
 
 static inline int __copy_gpa(struct kvm_vcpu *vcpu, void *addr, gpa_t gpa,
@@ -138,6 +222,9 @@ static int copy_gva(struct kvm_vcpu *vcpu, gva_t gva, void *addr,
 	unsigned int len;
 	int ret = 0;
 
+	if (!from_guest)
+		access |= PFERR_WRITE_MASK;
+
 	while ((bytes > 0) && (ret == 0)) {
 		ret = gva2gpa(vcpu, gva, &gpa, access, exception);
 		if (ret >= 0) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 05/22] pkvm: x86: Add hypercalls for shadow_vm/vcpu init & teardown
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (3 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 04/22] pkvm: x86: Add check for guest address translation Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 06/22] KVM: VMX: Add new kvm_x86_ops vm_free Jason Chen CJ
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

Host VM is able to create and launch its guest based on virtual VMX and
virtual EPT. pKVM is expected to provide necessary emulations (e.g.
VMX, EPT) for corresponding guest vcpu, introduce data structure
pkvm_shadow_vm & shadow_vcpu_state to manage/maintain the state &
information for such emulation of each guest vm/vcpu.

shadow_vm_handle is used as the identifier for a specific shadow vm
created by host VM, it links to the pointer of pkvm_shadow_vm for this
shadow vm, which followed by a pkvm_shadow_vcpu array corresponding to
the created vcpus for this vm.

The shadow vm/vcpus data structure for a specific vm is allocated by
host VM during its initialization, then pass to pKVM through new added
hypercalls PKVM_HC_INIT_SHADOW_VM & PKVM_HC_INIT_SHADOW_VCPU. Meanwhile
hypercalls PKVM_HC_TEARDOWN_SHADOW_VM & PKVM_HC_TEARDOWN_SHADOW_VCPU
are used by host VM when it wants to teardown a created vm.

In the future, after supporting page ownership management in pKVM,
these shadow vm/vcpus data structure pages shall be donated from host VM
to pKVM when target vm init, and be un-donated from pKVM to host
VM when this vm teardown, through the hypercalls mentioned above.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
---
 arch/x86/include/asm/kvm_pkvm.h      |   4 +
 arch/x86/kvm/vmx/pkvm/hyp/Makefile   |   3 +
 arch/x86/kvm/vmx/pkvm/hyp/pkvm.c     | 297 +++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h |  70 +++++++
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c   |  13 ++
 5 files changed, 387 insertions(+)

diff --git a/arch/x86/include/asm/kvm_pkvm.h b/arch/x86/include/asm/kvm_pkvm.h
index 0142b3dc3c01..6e8fee717e5d 100644
--- a/arch/x86/include/asm/kvm_pkvm.h
+++ b/arch/x86/include/asm/kvm_pkvm.h
@@ -16,6 +16,10 @@
 
 /* PKVM Hypercalls */
 #define PKVM_HC_INIT_FINALISE		1
+#define PKVM_HC_INIT_SHADOW_VM		2
+#define PKVM_HC_INIT_SHADOW_VCPU	3
+#define PKVM_HC_TEARDOWN_SHADOW_VM	4
+#define PKVM_HC_TEARDOWN_SHADOW_VCPU	5
 
 extern struct memblock_region pkvm_sym(hyp_memory)[];
 extern unsigned int pkvm_sym(hyp_memblock_nr);
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/Makefile b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
index 9c410ec96f45..7c6f71f18676 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/Makefile
+++ b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
@@ -16,8 +16,11 @@ pkvm-hyp-y	:= vmx_asm.o vmexit.o memory.o early_alloc.o pgtable.o mmu.o pkvm.o \
 
 ifndef CONFIG_PKVM_INTEL_DEBUG
 lib-dir		:= lib
+lib2-dir	:= ../../../../../../lib
 pkvm-hyp-y	+= $(lib-dir)/memset_64.o
 pkvm-hyp-y	+= $(lib-dir)/memcpy_64.o
+pkvm-hyp-y	+= $(lib2-dir)/find_bit.o
+pkvm-hyp-y	+= $(lib2-dir)/hweight.o
 pkvm-hyp-$(CONFIG_RETPOLINE)	+= $(lib-dir)/retpoline.o
 pkvm-hyp-$(CONFIG_DEBUG_LIST)	+= $(lib-dir)/list_debug.o
 endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c b/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c
index a5f776195af6..b110ac43a792 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c
@@ -5,4 +5,301 @@
 
 #include <pkvm.h>
 
+#include "pkvm_hyp.h"
+
 struct pkvm_hyp *pkvm_hyp;
+
+#define MAX_SHADOW_VMS	255
+#define HANDLE_OFFSET 1
+
+#define to_shadow_vm_handle(vcpu_handle)	((s64)(vcpu_handle) >> SHADOW_VM_HANDLE_SHIFT)
+#define to_shadow_vcpu_idx(vcpu_handle)		((s64)(vcpu_handle) & SHADOW_VCPU_INDEX_MASK)
+
+static DECLARE_BITMAP(shadow_vms_bitmap, MAX_SHADOW_VMS);
+static pkvm_spinlock_t shadow_vms_lock = __PKVM_SPINLOCK_UNLOCKED;
+struct shadow_vm_ref {
+	atomic_t refcount;
+	struct pkvm_shadow_vm *vm;
+};
+static struct shadow_vm_ref shadow_vms_ref[MAX_SHADOW_VMS];
+
+#define SHADOW_VCPU_ARRAY(vm) \
+	((struct shadow_vcpu_array *)((void *)(vm) + sizeof(struct pkvm_shadow_vm)))
+
+static int allocate_shadow_vm_handle(struct pkvm_shadow_vm *vm)
+{
+	struct shadow_vm_ref *vm_ref;
+	int handle;
+
+	/* The shadow_vm_handle is an int so cannot exceed the INT_MAX */
+	BUILD_BUG_ON(MAX_SHADOW_VMS > INT_MAX);
+
+	pkvm_spin_lock(&shadow_vms_lock);
+
+	handle = find_next_zero_bit(shadow_vms_bitmap, MAX_SHADOW_VMS,
+				    HANDLE_OFFSET);
+	if ((u32)handle < MAX_SHADOW_VMS) {
+		__set_bit(handle, shadow_vms_bitmap);
+		vm->shadow_vm_handle = handle;
+		vm_ref = &shadow_vms_ref[handle];
+		vm_ref->vm = vm;
+		atomic_set(&vm_ref->refcount, 1);
+	} else
+		handle = -ENOMEM;
+
+	pkvm_spin_unlock(&shadow_vms_lock);
+
+	return handle;
+}
+
+static struct pkvm_shadow_vm *free_shadow_vm_handle(int handle)
+{
+	struct shadow_vm_ref *vm_ref;
+	struct pkvm_shadow_vm *vm = NULL;
+
+	pkvm_spin_lock(&shadow_vms_lock);
+
+	if ((u32)handle >= MAX_SHADOW_VMS)
+		goto out;
+
+	vm_ref = &shadow_vms_ref[handle];
+	if ((atomic_cmpxchg(&vm_ref->refcount, 1, 0) != 1)) {
+		pkvm_err("%s: VM%d is busy, refcount %d\n",
+			 __func__, handle, atomic_read(&vm_ref->refcount));
+		goto out;
+	}
+
+	vm = vm_ref->vm;
+
+	vm_ref->vm = NULL;
+	__clear_bit(handle, shadow_vms_bitmap);
+out:
+	pkvm_spin_unlock(&shadow_vms_lock);
+	return vm;
+}
+
+int __pkvm_init_shadow_vm(unsigned long kvm_va,
+			  unsigned long shadow_pa,
+			  size_t shadow_size)
+{
+	struct pkvm_shadow_vm *vm;
+
+	if (!PAGE_ALIGNED(shadow_pa) ||
+		!PAGE_ALIGNED(shadow_size) ||
+		(shadow_size != PAGE_ALIGN(sizeof(struct pkvm_shadow_vm)
+					   + pkvm_shadow_vcpu_array_size())))
+		return -EINVAL;
+
+	vm = pkvm_phys_to_virt(shadow_pa);
+
+	memset(vm, 0, shadow_size);
+	pkvm_spin_lock_init(&vm->lock);
+
+	vm->host_kvm_va = kvm_va;
+	return allocate_shadow_vm_handle(vm);
+}
+
+unsigned long __pkvm_teardown_shadow_vm(int shadow_vm_handle)
+{
+	struct pkvm_shadow_vm *vm = free_shadow_vm_handle(shadow_vm_handle);
+
+	if (!vm)
+		return 0;
+
+	memset(vm, 0, sizeof(struct pkvm_shadow_vm) + pkvm_shadow_vcpu_array_size());
+
+	return pkvm_virt_to_phys(vm);
+}
+
+static struct pkvm_shadow_vm *get_shadow_vm(int shadow_vm_handle)
+{
+	struct shadow_vm_ref *vm_ref;
+
+	if ((u32)shadow_vm_handle >= MAX_SHADOW_VMS)
+		return NULL;
+
+	vm_ref = &shadow_vms_ref[shadow_vm_handle];
+	return atomic_inc_not_zero(&vm_ref->refcount) ? vm_ref->vm : NULL;
+}
+
+static void put_shadow_vm(int shadow_vm_handle)
+{
+	struct shadow_vm_ref *vm_ref;
+
+	if ((u32)shadow_vm_handle >= MAX_SHADOW_VMS)
+		return;
+
+	vm_ref = &shadow_vms_ref[shadow_vm_handle];
+	WARN_ON(atomic_dec_if_positive(&vm_ref->refcount) <= 0);
+}
+
+struct shadow_vcpu_state *get_shadow_vcpu(s64 shadow_vcpu_handle)
+{
+	int shadow_vm_handle = to_shadow_vm_handle(shadow_vcpu_handle);
+	u32 vcpu_idx = to_shadow_vcpu_idx(shadow_vcpu_handle);
+	struct shadow_vcpu_ref *vcpu_ref;
+	struct shadow_vcpu_state *vcpu;
+	struct pkvm_shadow_vm *vm;
+
+	if (vcpu_idx >= KVM_MAX_VCPUS)
+		return NULL;
+
+	vm = get_shadow_vm(shadow_vm_handle);
+	if (!vm)
+		return NULL;
+
+	vcpu_ref = &SHADOW_VCPU_ARRAY(vm)->ref[vcpu_idx];
+	vcpu = atomic_inc_not_zero(&vcpu_ref->refcount) ? vcpu_ref->vcpu : NULL;
+
+	put_shadow_vm(shadow_vm_handle);
+	return vcpu;
+}
+
+void put_shadow_vcpu(s64 shadow_vcpu_handle)
+{
+	int shadow_vm_handle = to_shadow_vm_handle(shadow_vcpu_handle);
+	u32 vcpu_idx = to_shadow_vcpu_idx(shadow_vcpu_handle);
+	struct shadow_vcpu_ref *vcpu_ref;
+	struct pkvm_shadow_vm *vm;
+
+	if (vcpu_idx >= KVM_MAX_VCPUS)
+		return;
+
+	vm = get_shadow_vm(shadow_vm_handle);
+	if (!vm)
+		return;
+
+	vcpu_ref = &SHADOW_VCPU_ARRAY(vm)->ref[vcpu_idx];
+	WARN_ON(atomic_dec_if_positive(&vcpu_ref->refcount) <= 0);
+
+	put_shadow_vm(shadow_vm_handle);
+}
+
+static s64 attach_shadow_vcpu_to_vm(struct pkvm_shadow_vm *vm,
+				    struct shadow_vcpu_state *shadow_vcpu)
+{
+	struct shadow_vcpu_ref *vcpu_ref;
+	u32 vcpu_idx;
+
+	/*
+	 * Shadow_vcpu_handle is a s64 value combined with shadow_vm_handle
+	 * and shadow_vcpu index from the array. So the array size cannot be
+	 * larger than the shadow_vcpu index mask.
+	 */
+	BUILD_BUG_ON(KVM_MAX_VCPUS > SHADOW_VCPU_INDEX_MASK);
+
+	/*
+	 * Save a shadow_vm pointer in shadow_vcpu requires additional
+	 * get so that later when use this pointer at runtime no need
+	 * to get again. This will be put when detaching this shadow_vcpu.
+	 */
+	shadow_vcpu->vm = get_shadow_vm(vm->shadow_vm_handle);
+	if (!shadow_vcpu->vm)
+		return -EINVAL;
+
+	pkvm_spin_lock(&vm->lock);
+
+	if (vm->created_vcpus == KVM_MAX_VCPUS) {
+		pkvm_spin_unlock(&vm->lock);
+		return -EINVAL;
+	}
+
+	vcpu_idx = vm->created_vcpus;
+	shadow_vcpu->shadow_vcpu_handle =
+		to_shadow_vcpu_handle(vm->shadow_vm_handle, vcpu_idx);
+	vcpu_ref = &SHADOW_VCPU_ARRAY(vm)->ref[vcpu_idx];
+	vcpu_ref->vcpu = shadow_vcpu;
+	vm->created_vcpus++;
+	atomic_set(&vcpu_ref->refcount, 1);
+
+	pkvm_spin_unlock(&vm->lock);
+
+	return shadow_vcpu->shadow_vcpu_handle;
+}
+
+static struct shadow_vcpu_state *
+detach_shadow_vcpu_from_vm(struct pkvm_shadow_vm *vm, s64 shadow_vcpu_handle)
+{
+	u32 vcpu_idx = to_shadow_vcpu_idx(shadow_vcpu_handle);
+	struct shadow_vcpu_state *shadow_vcpu = NULL;
+	struct shadow_vcpu_ref *vcpu_ref;
+
+	if (vcpu_idx >= KVM_MAX_VCPUS)
+		return NULL;
+
+	pkvm_spin_lock(&vm->lock);
+
+	vcpu_ref = &SHADOW_VCPU_ARRAY(vm)->ref[vcpu_idx];
+	if ((atomic_cmpxchg(&vcpu_ref->refcount, 1, 0) != 1)) {
+		pkvm_err("%s: VM%d shadow_vcpu%d is busy, refcount %d\n",
+			 __func__, vm->shadow_vm_handle, vcpu_idx,
+			 atomic_read(&vcpu_ref->refcount));
+	} else {
+		shadow_vcpu = vcpu_ref->vcpu;
+		vcpu_ref->vcpu = NULL;
+	}
+
+	pkvm_spin_unlock(&vm->lock);
+
+	if (shadow_vcpu)
+		/*
+		 * Paired with the get_shadow_vm when saving the shadow_vm pointer
+		 * during attaching shadow_vcpu.
+		 */
+		put_shadow_vm(shadow_vcpu->vm->shadow_vm_handle);
+
+	return shadow_vcpu;
+}
+
+s64 __pkvm_init_shadow_vcpu(struct kvm_vcpu *hvcpu, int shadow_vm_handle,
+			    unsigned long vcpu_va, unsigned long shadow_pa,
+			    size_t shadow_size)
+{
+	struct pkvm_shadow_vm *vm;
+	struct shadow_vcpu_state *shadow_vcpu;
+	struct x86_exception e;
+	s64 shadow_vcpu_handle;
+	int ret;
+
+	if (!PAGE_ALIGNED(shadow_pa) || !PAGE_ALIGNED(shadow_size) ||
+		(shadow_size != PAGE_ALIGN(sizeof(struct shadow_vcpu_state))) ||
+		(pkvm_hyp->vmcs_config.size > PAGE_SIZE))
+		return -EINVAL;
+
+	shadow_vcpu = pkvm_phys_to_virt(shadow_pa);
+	memset(shadow_vcpu, 0, shadow_size);
+
+	ret = read_gva(hvcpu, vcpu_va, &shadow_vcpu->vmx, sizeof(struct vcpu_vmx), &e);
+	if (ret < 0)
+		return -EINVAL;
+
+	vm = get_shadow_vm(shadow_vm_handle);
+	if (!vm)
+		return -EINVAL;
+
+	shadow_vcpu_handle = attach_shadow_vcpu_to_vm(vm, shadow_vcpu);
+
+	put_shadow_vm(shadow_vm_handle);
+
+	return shadow_vcpu_handle;
+}
+
+unsigned long __pkvm_teardown_shadow_vcpu(s64 shadow_vcpu_handle)
+{
+	int shadow_vm_handle = to_shadow_vm_handle(shadow_vcpu_handle);
+	struct shadow_vcpu_state *shadow_vcpu;
+	struct pkvm_shadow_vm *vm = get_shadow_vm(shadow_vm_handle);
+
+	if (!vm)
+		return 0;
+
+	shadow_vcpu = detach_shadow_vcpu_from_vm(vm, shadow_vcpu_handle);
+
+	put_shadow_vm(shadow_vm_handle);
+
+	if (!shadow_vcpu)
+		return 0;
+
+	memset(shadow_vcpu, 0, sizeof(struct shadow_vcpu_state));
+	return pkvm_virt_to_phys(shadow_vcpu);
+}
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
index e84296a714a2..f15a49b3be5d 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
@@ -5,6 +5,76 @@
 #ifndef __PKVM_HYP_H
 #define __PKVM_HYP_H
 
+#include <asm/pkvm_spinlock.h>
+
+/*
+ * A container for the vcpu state that hyp needs to maintain for protected VMs.
+ */
+struct shadow_vcpu_state {
+	/*
+	 * A unique id to the shadow vcpu, which is combined by
+	 * shadow_vm_handle and shadow_vcpu index in the array.
+	 * As shadow_vm_handle is in the high end and it is an
+	 * int, so define the shadow_vcpu_handle as a s64.
+	 */
+	s64 shadow_vcpu_handle;
+
+	struct pkvm_shadow_vm *vm;
+
+	struct vcpu_vmx vmx;
+} __aligned(PAGE_SIZE);
+
+#define SHADOW_VM_HANDLE_SHIFT		32
+#define SHADOW_VCPU_INDEX_MASK		((1UL << SHADOW_VM_HANDLE_SHIFT) - 1)
+#define to_shadow_vcpu_handle(vm_handle, vcpu_idx)		\
+		(((s64)(vm_handle) << SHADOW_VM_HANDLE_SHIFT) | \
+		 ((vcpu_idx) & SHADOW_VCPU_INDEX_MASK))
+
+/*
+ * Shadow_vcpu_array will be appended to the end of the pkvm_shadow_vm area
+ * implicitly, so that the shadow_vcpu_state pointer cannot be got directly
+ * from the pkvm_shadow_vm, but needs to be done through the interface
+ * get/put_shadow_vcpu. This can prevent the shadow_vcpu_state pointer from
+ * being abused without getting/putting the refcount.
+ */
+struct shadow_vcpu_array {
+	struct shadow_vcpu_ref {
+		atomic_t refcount;
+		struct shadow_vcpu_state *vcpu;
+	} ref[KVM_MAX_VCPUS];
+} __aligned(PAGE_SIZE);
+
+static inline size_t pkvm_shadow_vcpu_array_size(void)
+{
+	return sizeof(struct shadow_vcpu_array);
+}
+
+/*
+ * Holds the relevant data for running a protected vm.
+ */
+struct pkvm_shadow_vm {
+	/* A unique id to the shadow structs in the hyp shadow area. */
+	int shadow_vm_handle;
+
+	/* Number of vcpus for the vm. */
+	int created_vcpus;
+
+	/* The host's kvm va. */
+	unsigned long host_kvm_va;
+
+	pkvm_spinlock_t lock;
+} __aligned(PAGE_SIZE);
+
+int __pkvm_init_shadow_vm(unsigned long kvm_va, unsigned long shadow_pa,
+			  size_t shadow_size);
+unsigned long __pkvm_teardown_shadow_vm(int shadow_vm_handle);
+s64 __pkvm_init_shadow_vcpu(struct kvm_vcpu *hvcpu, int shadow_vm_handle,
+			    unsigned long vcpu_va, unsigned long shadow_pa,
+			    size_t shadow_size);
+unsigned long __pkvm_teardown_shadow_vcpu(s64 shadow_vcpu_handle);
+struct shadow_vcpu_state *get_shadow_vcpu(s64 shadow_vcpu_handle);
+void put_shadow_vcpu(s64 shadow_vcpu_handle);
+
 extern struct pkvm_hyp *pkvm_hyp;
 
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index 02224d93384a..6b82b6be612c 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -8,6 +8,7 @@
 #include <pkvm.h>
 #include "vmexit.h"
 #include "ept.h"
+#include "pkvm_hyp.h"
 #include "debug.h"
 
 #define CR4	4
@@ -88,6 +89,18 @@ static unsigned long handle_vmcall(struct kvm_vcpu *vcpu)
 	case PKVM_HC_INIT_FINALISE:
 		__pkvm_init_finalise(vcpu, a0, a1);
 		break;
+	case PKVM_HC_INIT_SHADOW_VM:
+		ret = __pkvm_init_shadow_vm(a0, a1, a2);
+		break;
+	case PKVM_HC_INIT_SHADOW_VCPU:
+		ret = __pkvm_init_shadow_vcpu(vcpu, a0, a1, a2, a3);
+		break;
+	case PKVM_HC_TEARDOWN_SHADOW_VM:
+		ret = __pkvm_teardown_shadow_vm(a0);
+		break;
+	case PKVM_HC_TEARDOWN_SHADOW_VCPU:
+		ret = __pkvm_teardown_shadow_vcpu(a0);
+		break;
 	default:
 		ret = -EINVAL;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 06/22] KVM: VMX: Add new kvm_x86_ops vm_free
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (4 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 05/22] pkvm: x86: Add hypercalls for shadow_vm/vcpu init & teardown Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 07/22] KVM: VMX: Add initialization/teardown for shadow vm/vcpu Jason Chen CJ
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

pKVM expect teardown shadow vm after all shadow vcpu got teardown, as
shadow vcpu data structure is attached with shadow vm. Meanwhile
kvm_x86_ops provided ops vm_destroy is called before vcpu_free, so add
a new kvm_x86_ops vm_free which called after all vcpu got freed.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 1 +
 arch/x86/include/asm/kvm_host.h    | 1 +
 arch/x86/kvm/x86.c                 | 1 +
 3 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index abccd51dcfca..444ff48ef2ac 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -22,6 +22,7 @@ KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_OPTIONAL(vm_destroy)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
+KVM_X86_OP_OPTIONAL(vm_free)
 KVM_X86_OP(vcpu_create)
 KVM_X86_OP(vcpu_free)
 KVM_X86_OP(vcpu_reset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c3cf849a1370..3dea471bfca4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1529,6 +1529,7 @@ struct kvm_x86_ops {
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
+	void (*vm_free)(struct kvm *kvm);
 
 	/* Create, but do not attach this VCPU */
 	int (*vcpu_precreate)(struct kvm *kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 84ddeabbf94b..877715426dac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12309,6 +12309,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_page_track_cleanup(kvm);
 	kvm_xen_destroy_vm(kvm);
 	kvm_hv_destroy_vm(kvm);
+	static_call_cond(kvm_x86_vm_free)(kvm);
 }
 
 static void memslot_rmap_free(struct kvm_memory_slot *slot)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 07/22] KVM: VMX: Add initialization/teardown for shadow vm/vcpu
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (5 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 06/22] KVM: VMX: Add new kvm_x86_ops vm_free Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 08/22] pkvm: x86: Add hash table mapping for shadow vcpu based on vmcs12_pa Jason Chen CJ
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

Add initialization/teardown for shadow vm & shadow vcpu from corresponding
kvm_x86_ops.

The initialization allocates shadow vm or shadow vcpu data structure
according to the size exposed from pkvm_constants, then hypercall to
send such data structure's address to pKVM.

The teardown hypercall to pKVM asking for the teardown of corresponding
shadow vm or shadow vcpu data structure in the hypervisor, then do finally
free of related memory.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
---
 arch/x86/include/asm/kvm_host.h        |  4 ++
 arch/x86/include/asm/kvm_pkvm.h        | 10 ++++
 arch/x86/kvm/vmx/pkvm/pkvm_constants.c |  4 ++
 arch/x86/kvm/vmx/pkvm/pkvm_host.c      | 76 ++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c                 | 14 ++++-
 include/linux/kvm_host.h               |  8 +++
 6 files changed, 114 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3dea471bfca4..74f0954c6899 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1750,6 +1750,10 @@ struct kvm_arch_async_pf {
 	bool direct_map;
 };
 
+struct kvm_protected_vm {
+	int shadow_vm_handle;
+};
+
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern u64 __read_mostly host_efer;
 extern bool __read_mostly allow_smaller_maxphyaddr;
diff --git a/arch/x86/include/asm/kvm_pkvm.h b/arch/x86/include/asm/kvm_pkvm.h
index 6e8fee717e5d..4e9531d88417 100644
--- a/arch/x86/include/asm/kvm_pkvm.h
+++ b/arch/x86/include/asm/kvm_pkvm.h
@@ -6,6 +6,8 @@
 #ifndef _ASM_X86_KVM_PKVM_H
 #define _ASM_X86_KVM_PKVM_H
 
+#include <linux/kvm_host.h>
+
 #ifdef CONFIG_PKVM_INTEL
 
 #include <linux/memblock.h>
@@ -131,8 +133,16 @@ static inline int hyp_pre_reserve_check(void)
 
 u64 hyp_total_reserve_pages(void);
 
+int pkvm_init_shadow_vm(struct kvm *kvm);
+void pkvm_teardown_shadow_vm(struct kvm *kvm);
+int pkvm_init_shadow_vcpu(struct kvm_vcpu *vcpu);
+void pkvm_teardown_shadow_vcpu(struct kvm_vcpu *vcpu);
 #else
 static inline void kvm_hyp_reserve(void) {}
+static inline int pkvm_init_shadow_vm(struct kvm *kvm) { return 0; }
+static inline void pkvm_teardown_shadow_vm(struct kvm *kvm) {}
+static inline int pkvm_init_shadow_vcpu(struct kvm_vcpu *vcpu) { return 0; }
+static inline void pkvm_teardown_shadow_vcpu(struct kvm_vcpu *vcpu) {}
 #endif
 
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/pkvm_constants.c b/arch/x86/kvm/vmx/pkvm/pkvm_constants.c
index 729147e6b85f..c6dc35b52664 100644
--- a/arch/x86/kvm/vmx/pkvm/pkvm_constants.c
+++ b/arch/x86/kvm/vmx/pkvm/pkvm_constants.c
@@ -7,9 +7,13 @@
 #include <linux/bug.h>
 #include <vdso/limits.h>
 #include <buddy_memory.h>
+#include <vmx/vmx.h>
+#include "hyp/pkvm_hyp.h"
 
 int main(void)
 {
 	DEFINE(PKVM_VMEMMAP_ENTRY_SIZE, sizeof(struct hyp_page));
+	DEFINE(PKVM_SHADOW_VM_SIZE, sizeof(struct pkvm_shadow_vm) + pkvm_shadow_vcpu_array_size());
+	DEFINE(PKVM_SHADOW_VCPU_STATE_SIZE, sizeof(struct shadow_vcpu_state));
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/pkvm/pkvm_host.c b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
index 8ea2d64236d0..2dff1123b61f 100644
--- a/arch/x86/kvm/vmx/pkvm/pkvm_host.c
+++ b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
@@ -869,6 +869,82 @@ static __init int pkvm_init_finalise(void)
 	return ret;
 }
 
+int pkvm_init_shadow_vm(struct kvm *kvm)
+{
+	struct kvm_protected_vm *pkvm = &kvm->pkvm;
+	size_t shadow_sz;
+	void *shadow_addr;
+	int ret;
+
+	shadow_sz = PAGE_ALIGN(PKVM_SHADOW_VM_SIZE);
+	shadow_addr = alloc_pages_exact(shadow_sz, GFP_KERNEL_ACCOUNT);
+	if (!shadow_addr)
+		return -ENOMEM;
+
+	ret = kvm_hypercall3(PKVM_HC_INIT_SHADOW_VM, (unsigned long)kvm,
+					  (unsigned long)__pa(shadow_addr), shadow_sz);
+	if (ret < 0)
+		goto free_page;
+
+	pkvm->shadow_vm_handle = ret;
+
+	return 0;
+free_page:
+	free_pages_exact(shadow_addr, shadow_sz);
+	return ret;
+}
+
+void pkvm_teardown_shadow_vm(struct kvm *kvm)
+{
+	struct kvm_protected_vm *pkvm = &kvm->pkvm;
+	unsigned long pa;
+
+	pa = kvm_hypercall1(PKVM_HC_TEARDOWN_SHADOW_VM, pkvm->shadow_vm_handle);
+	if (!pa)
+		return;
+
+	free_pages_exact(__va(pa), PAGE_ALIGN(PKVM_SHADOW_VM_SIZE));
+}
+
+int pkvm_init_shadow_vcpu(struct kvm_vcpu *vcpu)
+{
+	struct kvm_protected_vm *pkvm = &vcpu->kvm->pkvm;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	s64 shadow_vcpu_handle;
+	size_t shadow_sz;
+	void *shadow_addr;
+
+	shadow_sz = PAGE_ALIGN(PKVM_SHADOW_VCPU_STATE_SIZE);
+	shadow_addr = alloc_pages_exact(shadow_sz, GFP_KERNEL_ACCOUNT);
+	if (!shadow_addr)
+		return -ENOMEM;
+
+	shadow_vcpu_handle = kvm_hypercall4(PKVM_HC_INIT_SHADOW_VCPU,
+					    pkvm->shadow_vm_handle, (unsigned long)vmx,
+					    (unsigned long)__pa(shadow_addr), shadow_sz);
+	if (shadow_vcpu_handle < 0)
+		goto free_page;
+
+	vcpu->pkvm_shadow_vcpu_handle = shadow_vcpu_handle;
+
+	return 0;
+
+free_page:
+	free_pages_exact(shadow_addr, shadow_sz);
+	return -EINVAL;
+}
+
+void pkvm_teardown_shadow_vcpu(struct kvm_vcpu *vcpu)
+{
+	unsigned long pa = kvm_hypercall1(PKVM_HC_TEARDOWN_SHADOW_VCPU,
+					  vcpu->pkvm_shadow_vcpu_handle);
+
+	if (!pa)
+		return;
+
+	free_pages_exact(__va(pa), PAGE_ALIGN(PKVM_SHADOW_VCPU_STATE_SIZE));
+}
+
 __init int pkvm_init(void)
 {
 	int ret = 0, cpu;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6e9723306992..61ae4c1c713d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -48,6 +48,7 @@
 #include <asm/spec-ctrl.h>
 #include <asm/virtext.h>
 #include <asm/vmx.h>
+#include <asm/kvm_pkvm.h>
 
 #include "capabilities.h"
 #include "cpuid.h"
@@ -7329,6 +7330,8 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
 	free_vpid(vmx->vpid);
 	nested_vmx_free_vcpu(vcpu);
 	free_loaded_vmcs(vmx->loaded_vmcs);
+
+	pkvm_teardown_shadow_vcpu(vcpu);
 }
 
 static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
@@ -7426,7 +7429,7 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 		WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
 			   __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
 
-	return 0;
+	return pkvm_init_shadow_vcpu(vcpu);
 
 free_vmcs:
 	free_loaded_vmcs(vmx->loaded_vmcs);
@@ -7468,7 +7471,13 @@ static int vmx_vm_init(struct kvm *kvm)
 			break;
 		}
 	}
-	return 0;
+
+	return pkvm_init_shadow_vm(kvm);
+}
+
+static void vmx_vm_free(struct kvm *kvm)
+{
+	pkvm_teardown_shadow_vm(kvm);
 }
 
 static int __init vmx_check_processor_compat(void)
@@ -8104,6 +8113,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
 	.vm_destroy = vmx_vm_destroy,
+	.vm_free = vmx_vm_free,
 
 	.vcpu_precreate = vmx_vcpu_precreate,
 	.vcpu_create = vmx_vcpu_create,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..faab9a30002f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -390,6 +390,12 @@ struct kvm_vcpu {
 	 */
 	struct kvm_memory_slot *last_used_slot;
 	u64 last_used_slot_gen;
+
+	/*
+	 * Save the handle returned from the pkvm when init a shadow vcpu. This
+	 * will be used when teardown this shadow vcpu.
+	 */
+	s64 pkvm_shadow_vcpu_handle;
 };
 
 /*
@@ -805,6 +811,8 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+
+	struct kvm_protected_vm pkvm;
 };
 
 #define kvm_err(fmt, ...) \
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 08/22] pkvm: x86: Add hash table mapping for shadow vcpu based on vmcs12_pa
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (6 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 07/22] KVM: VMX: Add initialization/teardown for shadow vm/vcpu Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 09/22] pkvm: x86: Add VMXON/VMXOFF emulation Jason Chen CJ
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

Host VM execute vmptrld(vmcs12) then vmlaunch to launch its guest, while
pKVM need to get corresponding shadow_vcpu_state based on vmcs12 to do
vmptrld emulation (real vmcs page of guest - vmcs02 shall be kept in
shadow_vcpu_state - it will be added in the following patches).

Take use of hash table shadow_vcpu_table to build the mapping between
vmcs12_pa and shadow_vcpu_state. Then pKVM is able to quick find out
shadow_vcpu_state from vmcs12_pa when emulating vmptrld.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/pkvm.c     | 47 +++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h |  4 +++
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c b/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c
index b110ac43a792..9efedba2b3c9 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2022 Intel Corporation
  */
 
+#include <linux/hashtable.h>
 #include <pkvm.h>
 
 #include "pkvm_hyp.h"
@@ -26,6 +27,10 @@ static struct shadow_vm_ref shadow_vms_ref[MAX_SHADOW_VMS];
 #define SHADOW_VCPU_ARRAY(vm) \
 	((struct shadow_vcpu_array *)((void *)(vm) + sizeof(struct pkvm_shadow_vm)))
 
+#define SHADOW_VCPU_HASH_BITS		10
+DEFINE_HASHTABLE(shadow_vcpu_table, SHADOW_VCPU_HASH_BITS);
+static pkvm_spinlock_t shadow_vcpu_table_lock = __PKVM_SPINLOCK_UNLOCKED;
+
 static int allocate_shadow_vm_handle(struct pkvm_shadow_vm *vm)
 {
 	struct shadow_vm_ref *vm_ref;
@@ -133,6 +138,37 @@ static void put_shadow_vm(int shadow_vm_handle)
 	WARN_ON(atomic_dec_if_positive(&vm_ref->refcount) <= 0);
 }
 
+static void add_shadow_vcpu_vmcs12_map(struct shadow_vcpu_state *vcpu)
+{
+	pkvm_spin_lock(&shadow_vcpu_table_lock);
+	hash_add(shadow_vcpu_table, &vcpu->hnode, vcpu->vmcs12_pa);
+	pkvm_spin_unlock(&shadow_vcpu_table_lock);
+}
+
+static void remove_shadow_vcpu_vmcs12_map(struct shadow_vcpu_state *vcpu)
+{
+	pkvm_spin_lock(&shadow_vcpu_table_lock);
+	hash_del(&vcpu->hnode);
+	pkvm_spin_unlock(&shadow_vcpu_table_lock);
+}
+
+s64 find_shadow_vcpu_handle_by_vmcs(unsigned long vmcs12_pa)
+{
+	struct shadow_vcpu_state *shadow_vcpu;
+	s64 handle = -1;
+
+	pkvm_spin_lock(&shadow_vcpu_table_lock);
+	hash_for_each_possible(shadow_vcpu_table, shadow_vcpu, hnode, vmcs12_pa) {
+		if (shadow_vcpu->vmcs12_pa == vmcs12_pa) {
+			handle = shadow_vcpu->shadow_vcpu_handle;
+			break;
+		}
+	}
+	pkvm_spin_unlock(&shadow_vcpu_table_lock);
+
+	return handle;
+}
+
 struct shadow_vcpu_state *get_shadow_vcpu(s64 shadow_vcpu_handle)
 {
 	int shadow_vm_handle = to_shadow_vm_handle(shadow_vcpu_handle);
@@ -197,6 +233,8 @@ static s64 attach_shadow_vcpu_to_vm(struct pkvm_shadow_vm *vm,
 	if (!shadow_vcpu->vm)
 		return -EINVAL;
 
+	add_shadow_vcpu_vmcs12_map(shadow_vcpu);
+
 	pkvm_spin_lock(&vm->lock);
 
 	if (vm->created_vcpus == KVM_MAX_VCPUS) {
@@ -241,12 +279,14 @@ detach_shadow_vcpu_from_vm(struct pkvm_shadow_vm *vm, s64 shadow_vcpu_handle)
 
 	pkvm_spin_unlock(&vm->lock);
 
-	if (shadow_vcpu)
+	if (shadow_vcpu) {
+		remove_shadow_vcpu_vmcs12_map(shadow_vcpu);
 		/*
 		 * Paired with the get_shadow_vm when saving the shadow_vm pointer
 		 * during attaching shadow_vcpu.
 		 */
 		put_shadow_vm(shadow_vcpu->vm->shadow_vm_handle);
+	}
 
 	return shadow_vcpu;
 }
@@ -258,6 +298,7 @@ s64 __pkvm_init_shadow_vcpu(struct kvm_vcpu *hvcpu, int shadow_vm_handle,
 	struct pkvm_shadow_vm *vm;
 	struct shadow_vcpu_state *shadow_vcpu;
 	struct x86_exception e;
+	unsigned long vmcs12_va;
 	s64 shadow_vcpu_handle;
 	int ret;
 
@@ -273,6 +314,10 @@ s64 __pkvm_init_shadow_vcpu(struct kvm_vcpu *hvcpu, int shadow_vm_handle,
 	if (ret < 0)
 		return -EINVAL;
 
+	vmcs12_va = (unsigned long)shadow_vcpu->vmx.vmcs01.vmcs;
+	if (gva2gpa(hvcpu, vmcs12_va, (gpa_t *)&shadow_vcpu->vmcs12_pa, 0, &e))
+		return -EINVAL;
+
 	vm = get_shadow_vm(shadow_vm_handle);
 	if (!vm)
 		return -EINVAL;
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
index f15a49b3be5d..c574831c6d18 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
@@ -21,6 +21,9 @@ struct shadow_vcpu_state {
 
 	struct pkvm_shadow_vm *vm;
 
+	struct hlist_node hnode;
+	unsigned long vmcs12_pa;
+
 	struct vcpu_vmx vmx;
 } __aligned(PAGE_SIZE);
 
@@ -74,6 +77,7 @@ s64 __pkvm_init_shadow_vcpu(struct kvm_vcpu *hvcpu, int shadow_vm_handle,
 unsigned long __pkvm_teardown_shadow_vcpu(s64 shadow_vcpu_handle);
 struct shadow_vcpu_state *get_shadow_vcpu(s64 shadow_vcpu_handle);
 void put_shadow_vcpu(s64 shadow_vcpu_handle);
+s64 find_shadow_vcpu_handle_by_vmcs(unsigned long vmcs12_pa);
 
 extern struct pkvm_hyp *pkvm_hyp;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 09/22] pkvm: x86: Add VMXON/VMXOFF emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (7 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 08/22] pkvm: x86: Add hash table mapping for shadow vcpu based on vmcs12_pa Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 10/22] pkvm: x86: Add has_vmcs_field() API for physical vmx capability check Jason Chen CJ
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

Host VM keep the capability to launch its guests based on VMX, pKVM
need to provide VMX emulation for it. This includes emulations for
different VMX instructions - VMXON/VMXOFF, VMPTRLD/VMCLEAR,
VMWRITE/VMREAD, and VMRESUME/VMLAUNCH.

This patch introduces nested.c, and provide emulation for VMXON and
VMXOFF vmx instructions for host VM.

The emulation simply does state check and revision id validation for
vmxarea passed from VMXON/VMXOFF instructions, the physical VMX is kept
as enabled after the pKVM initialization.

More permission check still leaves as TODO.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/Makefile |   2 +-
 arch/x86/kvm/vmx/pkvm/hyp/nested.c | 195 +++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/nested.h |  11 ++
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c |  12 ++
 4 files changed, 219 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/Makefile b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
index 7c6f71f18676..660fd611395f 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/Makefile
+++ b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
@@ -12,7 +12,7 @@ ccflags-y += -D__PKVM_HYP__
 virt-dir	:= ../../../../../../$(KVM_PKVM)
 
 pkvm-hyp-y	:= vmx_asm.o vmexit.o memory.o early_alloc.o pgtable.o mmu.o pkvm.o \
-		   init_finalise.o ept.o idt.o irq.o
+		   init_finalise.o ept.o idt.o irq.o nested.o
 
 ifndef CONFIG_PKVM_INTEL_DEBUG
 lib-dir		:= lib
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
new file mode 100644
index 000000000000..f5e2eb8f51c8
--- /dev/null
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Intel Corporation
+ */
+
+#include <pkvm.h>
+
+#include "pkvm_hyp.h"
+#include "debug.h"
+
+enum VMXResult {
+	VMsucceed,
+	VMfailValid,
+	VMfailInvalid,
+};
+
+static void nested_vmx_result(enum VMXResult result, int error_number)
+{
+	u64 rflags = vmcs_readl(GUEST_RFLAGS);
+
+	rflags &= ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF);
+
+	if (result == VMfailValid) {
+		rflags |= X86_EFLAGS_ZF;
+		vmcs_write32(VM_INSTRUCTION_ERROR, error_number);
+	} else if (result == VMfailInvalid) {
+		rflags |= X86_EFLAGS_CF;
+	} else {
+		/* VMsucceed, do nothing */
+	}
+
+	if (result != VMsucceed)
+		pkvm_err("VMX failed: %d/%d", result, error_number);
+
+	vmcs_writel(GUEST_RFLAGS, rflags);
+}
+
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification,
+			u32 vmx_instruction_info, gva_t *ret)
+{
+	gva_t off;
+	struct kvm_segment s;
+
+	/*
+	 * According to Vol. 3B, "Information for VM Exits Due to Instruction
+	 * Execution", on an exit, vmx_instruction_info holds most of the
+	 * addressing components of the operand. Only the displacement part
+	 * is put in exit_qualification (see 3B, "Basic VM-Exit Information").
+	 * For how an actual address is calculated from all these components,
+	 * refer to Vol. 1, "Operand Addressing".
+	 */
+	int  scaling = vmx_instruction_info & 3;
+	int  addr_size = (vmx_instruction_info >> 7) & 7;
+	bool is_reg = vmx_instruction_info & (1u << 10);
+	int  seg_reg = (vmx_instruction_info >> 15) & 7;
+	int  index_reg = (vmx_instruction_info >> 18) & 0xf;
+	bool index_is_valid = !(vmx_instruction_info & (1u << 22));
+	int  base_reg       = (vmx_instruction_info >> 23) & 0xf;
+	bool base_is_valid  = !(vmx_instruction_info & (1u << 27));
+
+	if (is_reg) {
+		/* TODO: inject #UD */
+		return 1;
+	}
+
+	/* Addr = segment_base + offset */
+	/* offset = base + [index * scale] + displacement */
+	off = exit_qualification; /* holds the displacement */
+	if (addr_size == 1)
+		off = (gva_t)sign_extend64(off, 31);
+	else if (addr_size == 0)
+		off = (gva_t)sign_extend64(off, 15);
+	if (base_is_valid)
+		off += vcpu->arch.regs[base_reg];
+	if (index_is_valid)
+		off += vcpu->arch.regs[index_reg] << scaling;
+
+	if (seg_reg == VCPU_SREG_FS)
+		s.base = vmcs_readl(GUEST_FS_BASE);
+	if (seg_reg == VCPU_SREG_GS)
+		s.base = vmcs_readl(GUEST_GS_BASE);
+
+	/* TODO: support more cpu mode beside long mode */
+	/*
+	 * The effective address, i.e. @off, of a memory operand is truncated
+	 * based on the address size of the instruction.  Note that this is
+	 * the *effective address*, i.e. the address prior to accounting for
+	 * the segment's base.
+	 */
+	if (addr_size == 1) /* 32 bit */
+		off &= 0xffffffff;
+	else if (addr_size == 0) /* 16 bit */
+		off &= 0xffff;
+
+	/*
+	 * The virtual/linear address is never truncated in 64-bit
+	 * mode, e.g. a 32-bit address size can yield a 64-bit virtual
+	 * address when using FS/GS with a non-zero base.
+	 */
+	if (seg_reg == VCPU_SREG_FS || seg_reg == VCPU_SREG_GS)
+		*ret = s.base + off;
+	else
+		*ret = off;
+
+	/* TODO: check addr is canonical, otherwise inject #GP/#SS */
+
+	return 0;
+}
+
+static int nested_vmx_get_vmptr(struct kvm_vcpu *vcpu, gpa_t *vmpointer,
+				int *ret)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	struct x86_exception e;
+	int r;
+
+	if (get_vmx_mem_address(vcpu, vmx->exit_qualification,
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva)) {
+		*ret = 1;
+		return -EINVAL;
+	}
+
+	r = read_gva(vcpu, gva, vmpointer, sizeof(*vmpointer), &e);
+	if (r < 0) {
+		/*TODO: handle memory failure exception */
+		*ret = 1;
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int validate_vmcs_revision_id(struct kvm_vcpu *vcpu, gpa_t vmpointer)
+{
+	struct vmcs_config *vmcs_config = &pkvm_hyp->vmcs_config;
+	u32 rev_id;
+
+	read_gpa(vcpu, vmpointer, &rev_id, sizeof(rev_id));
+
+	return (rev_id == vmcs_config->revision_id);
+}
+
+static bool check_vmx_permission(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	bool permit = true;
+
+	/*TODO: check more env (cr, cpl) and inject #UD/#GP */
+	if (!vmx->nested.vmxon)
+		permit = false;
+
+	return permit;
+}
+
+int handle_vmxon(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gpa_t vmptr;
+	int r;
+
+	/*TODO: check env error(cr, efer, rflags, cpl) */
+	if (vmx->nested.vmxon) {
+		nested_vmx_result(VMfailValid, VMXERR_VMXON_IN_VMX_ROOT_OPERATION);
+	} else {
+		if (nested_vmx_get_vmptr(vcpu, &vmptr, &r)) {
+			nested_vmx_result(VMfailInvalid, 0);
+			return r;
+		} else if (!validate_vmcs_revision_id(vcpu, vmptr)) {
+			nested_vmx_result(VMfailInvalid, 0);
+		} else {
+			vmx->nested.vmxon_ptr = vmptr;
+			vmx->nested.vmxon = true;
+
+			nested_vmx_result(VMsucceed, 0);
+		}
+	}
+
+	return 0;
+}
+
+int handle_vmxoff(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (check_vmx_permission(vcpu)) {
+		vmx->nested.vmxon = false;
+		vmx->nested.vmxon_ptr = INVALID_GPA;
+
+		nested_vmx_result(VMsucceed, 0);
+	}
+
+	return 0;
+}
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.h b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
new file mode 100644
index 000000000000..2d21edaddb25
--- /dev/null
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Intel Corporation
+ */
+#ifndef __PKVM_NESTED_H
+#define __PKVM_NESTED_H
+
+int handle_vmxon(struct kvm_vcpu *vcpu);
+int handle_vmxoff(struct kvm_vcpu *vcpu);
+
+#endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index 6b82b6be612c..fa67cab803a8 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -9,6 +9,7 @@
 #include "vmexit.h"
 #include "ept.h"
 #include "pkvm_hyp.h"
+#include "nested.h"
 #include "debug.h"
 
 #define CR4	4
@@ -168,6 +169,7 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 
 		vcpu->arch.cr2 = native_read_cr2();
 		vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+		vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
 
 		vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
 		vmx->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -194,6 +196,16 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 			handle_write_msr(vcpu);
 			skip_instruction = true;
 			break;
+		case EXIT_REASON_VMON:
+			pkvm_dbg("CPU%d vmexit reason: VMXON.\n", vcpu->cpu);
+			handle_vmxon(vcpu);
+			skip_instruction = true;
+			break;
+		case EXIT_REASON_VMOFF:
+			pkvm_dbg("CPU%d vmexit reason: VMXOFF.\n", vcpu->cpu);
+			handle_vmxoff(vcpu);
+			skip_instruction = true;
+			break;
 		case EXIT_REASON_XSETBV:
 			handle_xsetbv(vcpu);
 			skip_instruction = true;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 10/22] pkvm: x86: Add has_vmcs_field() API for physical vmx capability check
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (8 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 09/22] pkvm: x86: Add VMXON/VMXOFF emulation Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 11/22] KVM: VMX: Add more vmcs and vmcs12 fields definition Jason Chen CJ
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Tina Zhang, Chuanxiao Dong, Jason Chen CJ

From: Tina Zhang <tina.zhang@intel.com>

Some fields in vmcs exist only on processors that support the 1-setting
of those fields in the control registers [1]. VMREAD/VMWRITE from/to
unsupported VMCS component leads to VMfailValid [2].

Introduce a function called has_vmcs_field() which can be used to check
if a field exists in vmcs before using VMREAD/VMWRITE to access the field.

[1]: SDM: Appendix B Field Encoding in VMCS, NOTES.
[2]: SDM: VMX Instruction Reference chapter, VMWRITE/VMREAD.

Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/nested.c | 115 +++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/pkvm_host.c  |  29 ++++++++
 2 files changed, 144 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
index f5e2eb8f51c8..31ad33f2cdbf 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -8,6 +8,121 @@
 #include "pkvm_hyp.h"
 #include "debug.h"
 
+/**
+ * According to SDM Appendix B Field Encoding in VMCS, some fields only
+ * exist on processor that support the 1-setting of the corresponding
+ * fields in the control regs.
+ */
+static bool has_vmcs_field(u16 encoding)
+{
+	struct nested_vmx_msrs *msrs = &pkvm_hyp->vmcs_config.nested;
+
+	switch (encoding) {
+	case MSR_BITMAP:
+		return msrs->procbased_ctls_high & CPU_BASED_USE_MSR_BITMAPS;
+	case VIRTUAL_APIC_PAGE_ADDR:
+	case VIRTUAL_APIC_PAGE_ADDR_HIGH:
+	case TPR_THRESHOLD:
+		return msrs->procbased_ctls_high & CPU_BASED_TPR_SHADOW;
+	case SECONDARY_VM_EXEC_CONTROL:
+		return msrs->procbased_ctls_high &
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+
+	case VIRTUAL_PROCESSOR_ID:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_ENABLE_VPID;
+	case XSS_EXIT_BITMAP:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_XSAVES;
+	case PML_ADDRESS:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_ENABLE_PML;
+	case VM_FUNCTION_CONTROL:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_ENABLE_VMFUNC;
+	case EPT_POINTER:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT;
+	case EOI_EXIT_BITMAP0:
+	case EOI_EXIT_BITMAP1:
+	case EOI_EXIT_BITMAP2:
+	case EOI_EXIT_BITMAP3:
+		return msrs->secondary_ctls_high &
+			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
+	case VMREAD_BITMAP:
+	case VMWRITE_BITMAP:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_SHADOW_VMCS;
+	case ENCLS_EXITING_BITMAP:
+		return msrs->secondary_ctls_high &
+			SECONDARY_EXEC_ENCLS_EXITING;
+	case GUEST_INTR_STATUS:
+		return msrs->secondary_ctls_high &
+			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
+	case GUEST_PML_INDEX:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_ENABLE_PML;
+	case APIC_ACCESS_ADDR:
+	case APIC_ACCESS_ADDR_HIGH:
+		return msrs->secondary_ctls_high &
+			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	case TSC_MULTIPLIER:
+	case TSC_MULTIPLIER_HIGH:
+		return msrs->secondary_ctls_high &
+			SECONDARY_EXEC_TSC_SCALING;
+	case GUEST_PHYSICAL_ADDRESS:
+	case GUEST_PHYSICAL_ADDRESS_HIGH:
+		return msrs->secondary_ctls_high &
+			SECONDARY_EXEC_ENABLE_EPT;
+	case GUEST_PDPTR0:
+	case GUEST_PDPTR0_HIGH:
+	case GUEST_PDPTR1:
+	case GUEST_PDPTR1_HIGH:
+	case GUEST_PDPTR2:
+	case GUEST_PDPTR2_HIGH:
+	case GUEST_PDPTR3:
+	case GUEST_PDPTR3_HIGH:
+		return msrs->secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT;
+	case PLE_GAP:
+	case PLE_WINDOW:
+		return msrs->secondary_ctls_high &
+			SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+
+	case VMX_PREEMPTION_TIMER_VALUE:
+		return msrs->pinbased_ctls_high &
+			PIN_BASED_VMX_PREEMPTION_TIMER;
+	case POSTED_INTR_DESC_ADDR:
+		return msrs->pinbased_ctls_high & PIN_BASED_POSTED_INTR;
+	case POSTED_INTR_NV:
+		return msrs->pinbased_ctls_high & PIN_BASED_POSTED_INTR;
+	case GUEST_IA32_PAT:
+	case GUEST_IA32_PAT_HIGH:
+		return (msrs->entry_ctls_high & VM_ENTRY_LOAD_IA32_PAT) ||
+			(msrs->exit_ctls_high & VM_EXIT_SAVE_IA32_PAT);
+	case GUEST_IA32_EFER:
+	case GUEST_IA32_EFER_HIGH:
+		return (msrs->entry_ctls_high & VM_ENTRY_LOAD_IA32_EFER) ||
+			(msrs->exit_ctls_high & VM_EXIT_SAVE_IA32_EFER);
+	case GUEST_IA32_PERF_GLOBAL_CTRL:
+	case GUEST_IA32_PERF_GLOBAL_CTRL_HIGH:
+		return msrs->entry_ctls_high & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+	case GUEST_BNDCFGS:
+	case GUEST_BNDCFGS_HIGH:
+		return (msrs->entry_ctls_high & VM_ENTRY_LOAD_BNDCFGS) ||
+			(msrs->exit_ctls_high & VM_EXIT_CLEAR_BNDCFGS);
+	case GUEST_IA32_RTIT_CTL:
+	case GUEST_IA32_RTIT_CTL_HIGH:
+		return (msrs->entry_ctls_high & VM_ENTRY_LOAD_IA32_RTIT_CTL) ||
+			(msrs->exit_ctls_high & VM_EXIT_CLEAR_IA32_RTIT_CTL);
+	case HOST_IA32_PAT:
+	case HOST_IA32_PAT_HIGH:
+		return msrs->exit_ctls_high & VM_EXIT_LOAD_IA32_PAT;
+	case HOST_IA32_EFER:
+	case HOST_IA32_EFER_HIGH:
+		return msrs->exit_ctls_high & VM_EXIT_LOAD_IA32_EFER;
+	case HOST_IA32_PERF_GLOBAL_CTRL:
+	case HOST_IA32_PERF_GLOBAL_CTRL_HIGH:
+		return msrs->exit_ctls_high & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+	case EPTP_LIST_ADDRESS:
+		return msrs->vmfunc_controls & VMX_VMFUNC_EPTP_SWITCHING;
+	default:
+		return true;
+	}
+}
+
 enum VMXResult {
 	VMsucceed,
 	VMfailValid,
diff --git a/arch/x86/kvm/vmx/pkvm/pkvm_host.c b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
index 2dff1123b61f..4ea82a147af5 100644
--- a/arch/x86/kvm/vmx/pkvm/pkvm_host.c
+++ b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
@@ -426,6 +426,33 @@ static __init void pkvm_host_deinit_vmx(struct pkvm_host_vcpu *vcpu)
 		vmx->vmcs01.msr_bitmap = NULL;
 }
 
+static __init void pkvm_host_setup_nested_vmx_cap(struct pkvm_hyp *pkvm)
+{
+	struct nested_vmx_msrs *msrs = &pkvm->vmcs_config.nested;
+
+	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
+		msrs->procbased_ctls_low,
+		msrs->procbased_ctls_high);
+
+	rdmsr_safe(MSR_IA32_VMX_PROCBASED_CTLS2,
+			&msrs->secondary_ctls_low,
+			&msrs->secondary_ctls_high);
+
+	rdmsr(MSR_IA32_VMX_PINBASED_CTLS,
+		msrs->pinbased_ctls_low,
+		msrs->pinbased_ctls_high);
+
+	rdmsrl_safe(MSR_IA32_VMX_VMFUNC, &msrs->vmfunc_controls);
+
+	rdmsr(MSR_IA32_VMX_EXIT_CTLS,
+		msrs->exit_ctls_low,
+		msrs->exit_ctls_high);
+
+	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
+		msrs->entry_ctls_low,
+		msrs->entry_ctls_high);
+}
+
 static __init int pkvm_host_check_and_setup_vmx_cap(struct pkvm_hyp *pkvm)
 {
 	struct vmcs_config *vmcs_config = &pkvm->vmcs_config;
@@ -476,6 +503,8 @@ static __init int pkvm_host_check_and_setup_vmx_cap(struct pkvm_hyp *pkvm)
 		pr_info("vmentry_ctrl 0x%x\n", vmcs_config->vmentry_ctrl);
 	}
 
+	pkvm_host_setup_nested_vmx_cap(pkvm);
+
 	return ret;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 11/22] KVM: VMX: Add more vmcs and vmcs12 fields definition
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (9 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 10/22] pkvm: x86: Add has_vmcs_field() API for physical vmx capability check Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 12/22] pkvm: x86: Init vmcs read/write bitmap for vmcs emulation Jason Chen CJ
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

Add more fields definition for vmcs and vmcs12, which can be used to
extend vmcs shadow fields support for VMX emulation.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/include/asm/vmx.h |  4 ++++
 arch/x86/kvm/vmx/vmcs12.c  |  6 ++++++
 arch/x86/kvm/vmx/vmcs12.h  | 16 ++++++++++++++--
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 498dc600bd5c..d9f119bab5b2 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -322,6 +322,10 @@ enum vmcs_field {
 	CR3_TARGET_VALUE2               = 0x0000600c,
 	CR3_TARGET_VALUE3               = 0x0000600e,
 	EXIT_QUALIFICATION              = 0x00006400,
+	EXIT_IO_RCX	                = 0x00006402,
+	EXIT_IO_RSI	                = 0x00006404,
+	EXIT_IO_RDI	                = 0x00006406,
+	EXIT_IO_RIP	                = 0x00006408,
 	GUEST_LINEAR_ADDRESS            = 0x0000640a,
 	GUEST_CR0                       = 0x00006800,
 	GUEST_CR3                       = 0x00006802,
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 2251b60920f8..6ab29b869914 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -112,6 +112,8 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs),
 	FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs),
 	FIELD(VMX_PREEMPTION_TIMER_VALUE, vmx_preemption_timer_value),
+	FIELD(PLE_GAP, ple_gap),
+	FIELD(PLE_WINDOW, ple_window),
 	FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask),
 	FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask),
 	FIELD(CR0_READ_SHADOW, cr0_read_shadow),
@@ -150,5 +152,9 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
 	FIELD(HOST_RSP, host_rsp),
 	FIELD(HOST_RIP, host_rip),
+	FIELD(EXIT_IO_RCX, exit_io_rcx),
+	FIELD(EXIT_IO_RSI, exit_io_rsi),
+	FIELD(EXIT_IO_RDI, exit_io_rdi),
+	FIELD(EXIT_IO_RIP, exit_io_rip),
 };
 const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 01936013428b..92483940bb40 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -117,7 +117,11 @@ struct __packed vmcs12 {
 	natural_width host_ia32_sysenter_eip;
 	natural_width host_rsp;
 	natural_width host_rip;
-	natural_width paddingl[8]; /* room for future expansion */
+	natural_width exit_io_rcx;
+	natural_width exit_io_rsi;
+	natural_width exit_io_rdi;
+	natural_width exit_io_rip;
+	natural_width paddingl[4]; /* room for future expansion */
 	u32 pin_based_vm_exec_control;
 	u32 cpu_based_vm_exec_control;
 	u32 exception_bitmap;
@@ -165,7 +169,9 @@ struct __packed vmcs12 {
 	u32 guest_sysenter_cs;
 	u32 host_ia32_sysenter_cs;
 	u32 vmx_preemption_timer_value;
-	u32 padding32[7]; /* room for future expansion */
+	u32 ple_gap;
+	u32 ple_window;
+	u32 padding32[5]; /* room for future expansion */
 	u16 virtual_processor_id;
 	u16 posted_intr_nv;
 	u16 guest_es_selector;
@@ -292,6 +298,10 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_ia32_sysenter_eip, 656);
 	CHECK_OFFSET(host_rsp, 664);
 	CHECK_OFFSET(host_rip, 672);
+	CHECK_OFFSET(exit_io_rcx, 680);
+	CHECK_OFFSET(exit_io_rsi, 688);
+	CHECK_OFFSET(exit_io_rdi, 696);
+	CHECK_OFFSET(exit_io_rip, 704);
 	CHECK_OFFSET(pin_based_vm_exec_control, 744);
 	CHECK_OFFSET(cpu_based_vm_exec_control, 748);
 	CHECK_OFFSET(exception_bitmap, 752);
@@ -339,6 +349,8 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(guest_sysenter_cs, 920);
 	CHECK_OFFSET(host_ia32_sysenter_cs, 924);
 	CHECK_OFFSET(vmx_preemption_timer_value, 928);
+	CHECK_OFFSET(ple_gap, 932);
+	CHECK_OFFSET(ple_window, 936);
 	CHECK_OFFSET(virtual_processor_id, 960);
 	CHECK_OFFSET(posted_intr_nv, 962);
 	CHECK_OFFSET(guest_es_selector, 964);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 12/22] pkvm: x86: Init vmcs read/write bitmap for vmcs emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (10 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 11/22] KVM: VMX: Add more vmcs and vmcs12 fields definition Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 13/22] pkvm: x86: Initialize emulated fields " Jason Chen CJ
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

As pKVM is designed to use shadow vmcs to support nested guest, the
vmread/vmwrite bitmap are prepared by filtering out shadow field bits,
such bitmap shall finally be set into VMREAD_BITMAP/VMWRITE_BITMAP
fields to indicate the interception fields of VMREAD/VMWRITE instruction.
Meanwhile for other fields as shadowing part, host VM can directly
VMREAD/WMWRITE them without causing vmexit [1].

Introduce pkvm_nested_vmcs_fields.h to help pre-define the shadow fields
which is refer from vmx/vmcs_shadow_fields.h.

[1]: SDM: Virtual Machine Control Structures chapter, VMCS TYPES.

Signed-off-by: Jason Cthen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/init_finalise.c     |   3 +
 arch/x86/kvm/vmx/pkvm/hyp/nested.c            |  77 +++++++++
 arch/x86/kvm/vmx/pkvm/hyp/nested.h            |   1 +
 .../vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h    | 156 ++++++++++++++++++
 4 files changed, 237 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/init_finalise.c b/arch/x86/kvm/vmx/pkvm/hyp/init_finalise.c
index 8c585a73237a..c16b53b7bcd0 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/init_finalise.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/init_finalise.c
@@ -17,6 +17,7 @@
 #include "mmu.h"
 #include "ept.h"
 #include "vmx.h"
+#include "nested.h"
 #include "debug.h"
 
 void *pkvm_mmu_pgt_base;
@@ -288,6 +289,8 @@ int __pkvm_init_finalise(struct kvm_vcpu *vcpu, struct pkvm_section sections[],
 	if (ret)
 		goto out;
 
+	pkvm_init_nest();
+
 	pkvm_init = true;
 
 switch_pgt:
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
index 31ad33f2cdbf..8ae37feda5ff 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -129,6 +129,78 @@ enum VMXResult {
 	VMfailInvalid,
 };
 
+struct shadow_vmcs_field {
+	u16	encoding;
+	u16	offset;
+};
+
+static u8 vmx_vmread_bitmap[PAGE_SIZE] __aligned(PAGE_SIZE);
+static u8 vmx_vmwrite_bitmap[PAGE_SIZE] __aligned(PAGE_SIZE);
+
+static struct shadow_vmcs_field shadow_read_only_fields[] = {
+#define SHADOW_FIELD_RO(x, y) { x, offsetof(struct vmcs12, y) },
+#include "pkvm_nested_vmcs_fields.h"
+};
+static int max_shadow_read_only_fields =
+	ARRAY_SIZE(shadow_read_only_fields);
+static struct shadow_vmcs_field shadow_read_write_fields[] = {
+#define SHADOW_FIELD_RW(x, y) { x, offsetof(struct vmcs12, y) },
+#include "pkvm_nested_vmcs_fields.h"
+};
+static int max_shadow_read_write_fields =
+	ARRAY_SIZE(shadow_read_write_fields);
+
+static void init_vmcs_shadow_fields(void)
+{
+	int i, j;
+
+	memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
+	memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
+
+	for (i = j = 0; i < max_shadow_read_only_fields; i++) {
+		struct shadow_vmcs_field entry = shadow_read_only_fields[i];
+		u16 field = entry.encoding;
+
+		if (!has_vmcs_field(field))
+			continue;
+
+		if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 &&
+		    (i + 1 == max_shadow_read_only_fields ||
+		     shadow_read_only_fields[i + 1].encoding != field + 1)) {
+			pkvm_err("Missing field from shadow_read_only_field %x\n",
+			       field + 1);
+		}
+
+		clear_bit(field, (unsigned long *)vmx_vmread_bitmap);
+		if (field & 1)
+			continue;
+		shadow_read_only_fields[j++] = entry;
+	}
+	max_shadow_read_only_fields = j;
+
+	for (i = j = 0; i < max_shadow_read_write_fields; i++) {
+		struct shadow_vmcs_field entry = shadow_read_write_fields[i];
+		u16 field = entry.encoding;
+
+		if (!has_vmcs_field(field))
+			continue;
+
+		if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 &&
+		    (i + 1 == max_shadow_read_write_fields ||
+		     shadow_read_write_fields[i + 1].encoding != field + 1)) {
+			pkvm_err("Missing field from shadow_read_write_field %x\n",
+			       field + 1);
+		}
+
+		clear_bit(field, (unsigned long *)vmx_vmwrite_bitmap);
+		clear_bit(field, (unsigned long *)vmx_vmread_bitmap);
+		if (field & 1)
+			continue;
+		shadow_read_write_fields[j++] = entry;
+	}
+	max_shadow_read_write_fields = j;
+}
+
 static void nested_vmx_result(enum VMXResult result, int error_number)
 {
 	u64 rflags = vmcs_readl(GUEST_RFLAGS);
@@ -308,3 +380,8 @@ int handle_vmxoff(struct kvm_vcpu *vcpu)
 
 	return 0;
 }
+
+void pkvm_init_nest(void)
+{
+	init_vmcs_shadow_fields();
+}
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.h b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
index 2d21edaddb25..16b70b13e80e 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
@@ -7,5 +7,6 @@
 
 int handle_vmxon(struct kvm_vcpu *vcpu);
 int handle_vmxoff(struct kvm_vcpu *vcpu);
+void pkvm_init_nest(void);
 
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
new file mode 100644
index 000000000000..4380d415428f
--- /dev/null
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Intel Corporation
+ */
+#if !defined(SHADOW_FIELD_RW) && !defined(SHADOW_FIELD_RO)
+BUILD_BUG_ON(1)
+#endif
+
+#ifndef SHADOW_FIELD_RW
+#define SHADOW_FIELD_RW(x, y)
+#endif
+#ifndef SHADOW_FIELD_RO
+#define SHADOW_FIELD_RO(x, y)
+#endif
+
+/*
+ * Shadow fields for vmcs02:
+ *
+ * These fields are HW shadowing in vmcs02, we try to shadow all non-host
+ * fields except emulated ones.
+ * Host state fields need to be recorded in cached_vmcs12 and restored to vmcs01's
+ * guest state when returning to L1 host, so please ensure __NO__ host fields below.
+ */
+
+/* 16-bits */
+SHADOW_FIELD_RW(POSTED_INTR_NV, posted_intr_nv)
+SHADOW_FIELD_RW(GUEST_ES_SELECTOR, guest_es_selector)
+SHADOW_FIELD_RW(GUEST_CS_SELECTOR, guest_cs_selector)
+SHADOW_FIELD_RW(GUEST_SS_SELECTOR, guest_ss_selector)
+SHADOW_FIELD_RW(GUEST_DS_SELECTOR, guest_ds_selector)
+SHADOW_FIELD_RW(GUEST_FS_SELECTOR, guest_fs_selector)
+SHADOW_FIELD_RW(GUEST_GS_SELECTOR, guest_gs_selector)
+SHADOW_FIELD_RW(GUEST_LDTR_SELECTOR, guest_ldtr_selector)
+SHADOW_FIELD_RW(GUEST_TR_SELECTOR, guest_tr_selector)
+SHADOW_FIELD_RW(GUEST_TR_SELECTOR, guest_tr_selector)
+SHADOW_FIELD_RW(GUEST_INTR_STATUS, guest_intr_status)
+SHADOW_FIELD_RW(GUEST_PML_INDEX, guest_pml_index)
+
+/* 32-bits */
+SHADOW_FIELD_RW(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control)
+SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control)
+SHADOW_FIELD_RW(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control)
+SHADOW_FIELD_RW(EXCEPTION_BITMAP, exception_bitmap)
+SHADOW_FIELD_RW(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask)
+SHADOW_FIELD_RW(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match)
+SHADOW_FIELD_RW(CR3_TARGET_COUNT, cr3_target_count)
+SHADOW_FIELD_RW(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count)
+SHADOW_FIELD_RW(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count)
+SHADOW_FIELD_RW(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count)
+SHADOW_FIELD_RW(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field)
+SHADOW_FIELD_RW(VM_ENTRY_EXCEPTION_ERROR_CODE, vm_entry_exception_error_code)
+SHADOW_FIELD_RW(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len)
+SHADOW_FIELD_RW(TPR_THRESHOLD, tpr_threshold)
+SHADOW_FIELD_RW(GUEST_ES_LIMIT, guest_es_limit)
+SHADOW_FIELD_RW(GUEST_CS_LIMIT, guest_cs_limit)
+SHADOW_FIELD_RW(GUEST_SS_LIMIT, guest_ss_limit)
+SHADOW_FIELD_RW(GUEST_DS_LIMIT, guest_ds_limit)
+SHADOW_FIELD_RW(GUEST_FS_LIMIT, guest_fs_limit)
+SHADOW_FIELD_RW(GUEST_GS_LIMIT, guest_gs_limit)
+SHADOW_FIELD_RW(GUEST_LDTR_LIMIT, guest_ldtr_limit)
+SHADOW_FIELD_RW(GUEST_TR_LIMIT, guest_tr_limit)
+SHADOW_FIELD_RW(GUEST_GDTR_LIMIT, guest_gdtr_limit)
+SHADOW_FIELD_RW(GUEST_IDTR_LIMIT, guest_idtr_limit)
+SHADOW_FIELD_RW(GUEST_ES_AR_BYTES, guest_es_ar_bytes)
+SHADOW_FIELD_RW(GUEST_CS_AR_BYTES, guest_cs_ar_bytes)
+SHADOW_FIELD_RW(GUEST_SS_AR_BYTES, guest_ss_ar_bytes)
+SHADOW_FIELD_RW(GUEST_DS_AR_BYTES, guest_ds_ar_bytes)
+SHADOW_FIELD_RW(GUEST_FS_AR_BYTES, guest_fs_ar_bytes)
+SHADOW_FIELD_RW(GUEST_GS_AR_BYTES, guest_gs_ar_bytes)
+SHADOW_FIELD_RW(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes)
+SHADOW_FIELD_RW(GUEST_TR_AR_BYTES, guest_tr_ar_bytes)
+SHADOW_FIELD_RW(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info)
+SHADOW_FIELD_RW(GUEST_ACTIVITY_STATE, guest_activity_state)
+SHADOW_FIELD_RW(GUEST_SYSENTER_CS, guest_sysenter_cs)
+SHADOW_FIELD_RW(VMX_PREEMPTION_TIMER_VALUE, vmx_preemption_timer_value)
+SHADOW_FIELD_RW(PLE_GAP, ple_gap)
+SHADOW_FIELD_RW(PLE_WINDOW, ple_window)
+
+/* Natural width */
+SHADOW_FIELD_RW(CR0_GUEST_HOST_MASK, cr0_guest_host_mask)
+SHADOW_FIELD_RW(CR4_GUEST_HOST_MASK, cr4_guest_host_mask)
+SHADOW_FIELD_RW(CR0_READ_SHADOW, cr0_read_shadow)
+SHADOW_FIELD_RW(CR4_READ_SHADOW, cr4_read_shadow)
+SHADOW_FIELD_RW(GUEST_CR0, guest_cr0)
+SHADOW_FIELD_RW(GUEST_CR3, guest_cr3)
+SHADOW_FIELD_RW(GUEST_CR4, guest_cr4)
+SHADOW_FIELD_RW(GUEST_ES_BASE, guest_es_base)
+SHADOW_FIELD_RW(GUEST_CS_BASE, guest_cs_base)
+SHADOW_FIELD_RW(GUEST_SS_BASE, guest_ss_base)
+SHADOW_FIELD_RW(GUEST_DS_BASE, guest_ds_base)
+SHADOW_FIELD_RW(GUEST_FS_BASE, guest_fs_base)
+SHADOW_FIELD_RW(GUEST_GS_BASE, guest_gs_base)
+SHADOW_FIELD_RW(GUEST_LDTR_BASE, guest_ldtr_base)
+SHADOW_FIELD_RW(GUEST_TR_BASE, guest_tr_base)
+SHADOW_FIELD_RW(GUEST_GDTR_BASE, guest_gdtr_base)
+SHADOW_FIELD_RW(GUEST_IDTR_BASE, guest_idtr_base)
+SHADOW_FIELD_RW(GUEST_DR7, guest_dr7)
+SHADOW_FIELD_RW(GUEST_RSP, guest_rsp)
+SHADOW_FIELD_RW(GUEST_RIP, guest_rip)
+SHADOW_FIELD_RW(GUEST_RFLAGS, guest_rflags)
+SHADOW_FIELD_RW(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions)
+SHADOW_FIELD_RW(GUEST_SYSENTER_ESP, guest_sysenter_esp)
+SHADOW_FIELD_RW(GUEST_SYSENTER_EIP, guest_sysenter_eip)
+
+/* 64-bit */
+SHADOW_FIELD_RW(TSC_OFFSET, tsc_offset)
+SHADOW_FIELD_RW(TSC_OFFSET_HIGH, tsc_offset)
+SHADOW_FIELD_RW(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr)
+SHADOW_FIELD_RW(VIRTUAL_APIC_PAGE_ADDR_HIGH, virtual_apic_page_addr)
+SHADOW_FIELD_RW(APIC_ACCESS_ADDR, apic_access_addr)
+SHADOW_FIELD_RW(APIC_ACCESS_ADDR_HIGH, apic_access_addr)
+SHADOW_FIELD_RW(TSC_MULTIPLIER, tsc_multiplier)
+SHADOW_FIELD_RW(TSC_MULTIPLIER_HIGH, tsc_multiplier)
+SHADOW_FIELD_RW(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl)
+SHADOW_FIELD_RW(GUEST_IA32_DEBUGCTL_HIGH, guest_ia32_debugctl)
+SHADOW_FIELD_RW(GUEST_IA32_PAT, guest_ia32_pat)
+SHADOW_FIELD_RW(GUEST_IA32_PAT_HIGH, guest_ia32_pat)
+SHADOW_FIELD_RW(GUEST_IA32_EFER, guest_ia32_efer)
+SHADOW_FIELD_RW(GUEST_IA32_EFER_HIGH, guest_ia32_efer)
+SHADOW_FIELD_RW(GUEST_IA32_PERF_GLOBAL_CTRL, guest_ia32_perf_global_ctrl)
+SHADOW_FIELD_RW(GUEST_IA32_PERF_GLOBAL_CTRL_HIGH, guest_ia32_perf_global_ctrl)
+SHADOW_FIELD_RW(GUEST_PDPTR0, guest_pdptr0)
+SHADOW_FIELD_RW(GUEST_PDPTR0_HIGH, guest_pdptr0)
+SHADOW_FIELD_RW(GUEST_PDPTR1, guest_pdptr1)
+SHADOW_FIELD_RW(GUEST_PDPTR1_HIGH, guest_pdptr1)
+SHADOW_FIELD_RW(GUEST_PDPTR2, guest_pdptr2)
+SHADOW_FIELD_RW(GUEST_PDPTR2_HIGH, guest_pdptr2)
+SHADOW_FIELD_RW(GUEST_PDPTR3, guest_pdptr3)
+SHADOW_FIELD_RW(GUEST_PDPTR3_HIGH, guest_pdptr3)
+SHADOW_FIELD_RW(GUEST_BNDCFGS, guest_bndcfgs)
+SHADOW_FIELD_RW(GUEST_BNDCFGS_HIGH, guest_bndcfgs)
+
+/* 32-bits */
+SHADOW_FIELD_RO(VM_INSTRUCTION_ERROR, vm_instruction_error)
+SHADOW_FIELD_RO(VM_EXIT_REASON, vm_exit_reason)
+SHADOW_FIELD_RO(VM_EXIT_INTR_INFO, vm_exit_intr_info)
+SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code)
+SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field)
+SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code)
+SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len)
+SHADOW_FIELD_RO(VMX_INSTRUCTION_INFO, vmx_instruction_info)
+
+/* Natural width */
+SHADOW_FIELD_RO(EXIT_QUALIFICATION, exit_qualification)
+SHADOW_FIELD_RO(EXIT_IO_RCX, exit_io_rcx)
+SHADOW_FIELD_RO(EXIT_IO_RSI, exit_io_rsi)
+SHADOW_FIELD_RO(EXIT_IO_RDI, exit_io_rdi)
+SHADOW_FIELD_RO(EXIT_IO_RIP, exit_io_rip)
+SHADOW_FIELD_RO(GUEST_LINEAR_ADDRESS, guest_linear_address)
+
+/* 64-bit */
+SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address)
+SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
+
+#undef SHADOW_FIELD_RW
+#undef SHADOW_FIELD_RO
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 13/22] pkvm: x86: Initialize emulated fields for vmcs emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (11 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 12/22] pkvm: x86: Init vmcs read/write bitmap for vmcs emulation Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 14/22] pkvm: x86: Add msr ops for pKVM hypervisor Jason Chen CJ
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

For vmcs shadow fields, Host VM directly access through VMREAD/WMWRITE
without vmexit.

Meanwhile for other vmcs fields, the VMREAD/WMWRITE from host VM cause
vmread/wmwrite vmexit, pKVM need to handle them. Such fields can be
divided into two categories: one is host fields, pKVM just need to
record them as the value host VM set, the other one is emulated fields
which pKVM need to do emulation to ensure the isolation/security.

Introduce a MACRO EMULATED_FIELD_RW in pkvm_nested_vmcs_fields.h to
help pre-define the emulated fields for vmcs emulation, based on
it, initialize emulated_fields[] array for future emulation.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/nested.c            | 23 +++++++++++
 .../vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h    | 41 ++++++++++++++++++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
index 8ae37feda5ff..8e6d0f01819a 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -149,6 +149,12 @@ static struct shadow_vmcs_field shadow_read_write_fields[] = {
 };
 static int max_shadow_read_write_fields =
 	ARRAY_SIZE(shadow_read_write_fields);
+static struct shadow_vmcs_field emulated_fields[] = {
+#define EMULATED_FIELD_RW(x, y) { x, offsetof(struct vmcs12, y) },
+#include "pkvm_nested_vmcs_fields.h"
+};
+static int max_emulated_fields =
+	ARRAY_SIZE(emulated_fields);
 
 static void init_vmcs_shadow_fields(void)
 {
@@ -201,6 +207,22 @@ static void init_vmcs_shadow_fields(void)
 	max_shadow_read_write_fields = j;
 }
 
+static void init_emulated_vmcs_fields(void)
+{
+	int i, j;
+
+	for (i = j = 0; i < max_emulated_fields; i++) {
+		struct shadow_vmcs_field entry = emulated_fields[i];
+		u16 field = entry.encoding;
+
+		if (!has_vmcs_field(field))
+			continue;
+
+		emulated_fields[j++] = entry;
+	}
+	max_emulated_fields = j;
+}
+
 static void nested_vmx_result(enum VMXResult result, int error_number)
 {
 	u64 rflags = vmcs_readl(GUEST_RFLAGS);
@@ -384,4 +406,5 @@ int handle_vmxoff(struct kvm_vcpu *vcpu)
 void pkvm_init_nest(void)
 {
 	init_vmcs_shadow_fields();
+	init_emulated_vmcs_fields();
 }
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
index 4380d415428f..8666cda4ee6d 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
@@ -2,10 +2,13 @@
 /*
  * Copyright (C) 2022 Intel Corporation
  */
-#if !defined(SHADOW_FIELD_RW) && !defined(SHADOW_FIELD_RO)
+#if !defined(EMULATED_FIELD_RW) && !defined(SHADOW_FIELD_RW) && !defined(SHADOW_FIELD_RO)
 BUILD_BUG_ON(1)
 #endif
 
+#ifndef EMULATED_FIELD_RW
+#define EMULATED_FIELD_RW(x, y)
+#endif
 #ifndef SHADOW_FIELD_RW
 #define SHADOW_FIELD_RW(x, y)
 #endif
@@ -13,6 +16,41 @@ BUILD_BUG_ON(1)
 #define SHADOW_FIELD_RO(x, y)
 #endif
 
+/*
+ * Emulated fields for vmcs02:
+ *
+ * These fields are recorded in cached_vmcs12, and should be emulated to
+ * real value in vmcs02 before vmcs01 active.
+ */
+/* 16-bits */
+EMULATED_FIELD_RW(VIRTUAL_PROCESSOR_ID, virtual_processor_id)
+
+/* 32-bits */
+EMULATED_FIELD_RW(VM_EXIT_CONTROLS, vm_exit_controls)
+EMULATED_FIELD_RW(VM_ENTRY_CONTROLS, vm_entry_controls)
+
+/* 64-bits, what about their HIGH 32 fields?  */
+EMULATED_FIELD_RW(IO_BITMAP_A, io_bitmap_a)
+EMULATED_FIELD_RW(IO_BITMAP_B, io_bitmap_b)
+EMULATED_FIELD_RW(MSR_BITMAP, msr_bitmap)
+EMULATED_FIELD_RW(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr)
+EMULATED_FIELD_RW(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr)
+EMULATED_FIELD_RW(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr)
+EMULATED_FIELD_RW(XSS_EXIT_BITMAP, xss_exit_bitmap)
+EMULATED_FIELD_RW(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr)
+EMULATED_FIELD_RW(PML_ADDRESS, pml_address)
+EMULATED_FIELD_RW(VM_FUNCTION_CONTROL, vm_function_control)
+EMULATED_FIELD_RW(EPT_POINTER, ept_pointer)
+EMULATED_FIELD_RW(EOI_EXIT_BITMAP0, eoi_exit_bitmap0)
+EMULATED_FIELD_RW(EOI_EXIT_BITMAP1, eoi_exit_bitmap1)
+EMULATED_FIELD_RW(EOI_EXIT_BITMAP2, eoi_exit_bitmap2)
+EMULATED_FIELD_RW(EOI_EXIT_BITMAP3, eoi_exit_bitmap3)
+EMULATED_FIELD_RW(EPTP_LIST_ADDRESS, eptp_list_address)
+EMULATED_FIELD_RW(VMREAD_BITMAP, vmread_bitmap)
+EMULATED_FIELD_RW(VMWRITE_BITMAP, vmwrite_bitmap)
+EMULATED_FIELD_RW(ENCLS_EXITING_BITMAP, encls_exiting_bitmap)
+EMULATED_FIELD_RW(VMCS_LINK_POINTER, vmcs_link_pointer)
+
 /*
  * Shadow fields for vmcs02:
  *
@@ -152,5 +190,6 @@ SHADOW_FIELD_RO(GUEST_LINEAR_ADDRESS, guest_linear_address)
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address)
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
 
+#undef EMULATED_FIELD_RW
 #undef SHADOW_FIELD_RW
 #undef SHADOW_FIELD_RO
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 14/22] pkvm: x86: Add msr ops for pKVM hypervisor
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (12 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 13/22] pkvm: x86: Initialize emulated fields " Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 15/22] pkvm: x86: Move _init_host_state_area to " Jason Chen CJ
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

Add pkvm msr ops, avoid using Linux msr ops directly to remove the
dependency of link to EXTABLE.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/cpu.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/cpu.h b/arch/x86/kvm/vmx/pkvm/hyp/cpu.h
index c49074292f7c..896c2984ffa6 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/cpu.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/cpu.h
@@ -13,6 +13,29 @@ static inline u64 pkvm_msr_read(u32 reg)
 	return (((u64)msrh << 32U) | msrl);
 }
 
+#define pkvm_rdmsr(msr, low, high)              \
+do {                                            \
+	u64 __val = pkvm_msr_read(msr);         \
+	(void)((low) = (u32)__val);             \
+	(void)((high) = (u32)(__val >> 32));    \
+} while (0)
+
+#define pkvm_rdmsrl(msr, val)                   \
+	((val) = pkvm_msr_read((msr)))
+
+static inline void pkvm_msr_write(u32 reg, u64 msr_val)
+{
+	asm volatile (" wrmsr " : : "c" (reg), "a" ((u32)msr_val), "d" ((u32)(msr_val >> 32U)));
+}
+
+#define pkvm_wrmsr(msr, low, high)              	\
+do {                                            	\
+	u64 __val = (u64)(high) << 32 | (u64)(low); 	\
+	pkvm_msr_write(msr, __val);             	\
+} while (0)
+
+#define pkvm_wrmsrl(msr, val)   pkvm_msr_write(msr, val)
+
 #ifdef CONFIG_PKVM_INTEL_DEBUG
 #include <linux/smp.h>
 static inline u64 get_pcpu_id(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 15/22] pkvm: x86: Move _init_host_state_area to pKVM hypervisor
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (13 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 14/22] pkvm: x86: Add msr ops for pKVM hypervisor Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 16/22] pkvm: x86: Add vmcs_load/clear_track APIs Jason Chen CJ
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

During the VMPTRLD emulation for nested guest, pKVM need to initialize
shadow vmcs's host state area based on hypervisor's setting as well, so
move this function from pkvm_host.c to hypervisor dir.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/Makefile   |  2 +-
 arch/x86/kvm/vmx/pkvm/hyp/vmx.c      | 77 ++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/vmx.h      |  2 +
 arch/x86/kvm/vmx/pkvm/include/pkvm.h |  1 +
 arch/x86/kvm/vmx/pkvm/pkvm_host.c    | 75 +--------------------------
 5 files changed, 82 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/Makefile b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
index 660fd611395f..ca6d43509ddc 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/Makefile
+++ b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
@@ -12,7 +12,7 @@ ccflags-y += -D__PKVM_HYP__
 virt-dir	:= ../../../../../../$(KVM_PKVM)
 
 pkvm-hyp-y	:= vmx_asm.o vmexit.o memory.o early_alloc.o pgtable.o mmu.o pkvm.o \
-		   init_finalise.o ept.o idt.o irq.o nested.o
+		   init_finalise.o ept.o idt.o irq.o nested.o vmx.o
 
 ifndef CONFIG_PKVM_INTEL_DEBUG
 lib-dir		:= lib
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmx.c b/arch/x86/kvm/vmx/pkvm/hyp/vmx.c
new file mode 100644
index 000000000000..fec99c567d07
--- /dev/null
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmx.c
@@ -0,0 +1,77 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <pkvm.h>
+#include "cpu.h"
+
+void pkvm_init_host_state_area(struct pkvm_pcpu *pcpu, int cpu)
+{
+	unsigned long a;
+#ifdef CONFIG_PKVM_INTEL_DEBUG
+	u32 high, low;
+	struct desc_ptr dt;
+	u16 selector;
+#endif
+
+	vmcs_writel(HOST_CR0, native_read_cr0() & ~X86_CR0_TS);
+	vmcs_writel(HOST_CR3, pcpu->cr3);
+	vmcs_writel(HOST_CR4, native_read_cr4());
+
+#ifdef CONFIG_PKVM_INTEL_DEBUG
+	savesegment(cs, selector);
+	vmcs_write16(HOST_CS_SELECTOR, selector);
+	savesegment(ss, selector);
+	vmcs_write16(HOST_SS_SELECTOR, selector);
+	savesegment(ds, selector);
+	vmcs_write16(HOST_DS_SELECTOR, selector);
+	savesegment(es, selector);
+	vmcs_write16(HOST_ES_SELECTOR, selector);
+	savesegment(fs, selector);
+	vmcs_write16(HOST_FS_SELECTOR, selector);
+	pkvm_rdmsrl(MSR_FS_BASE, a);
+	vmcs_writel(HOST_FS_BASE, a);
+	savesegment(gs, selector);
+	vmcs_write16(HOST_GS_SELECTOR, selector);
+	pkvm_rdmsrl(MSR_GS_BASE, a);
+	vmcs_writel(HOST_GS_BASE, a);
+
+	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);
+	vmcs_writel(HOST_TR_BASE, (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);
+
+	native_store_gdt(&dt);
+	vmcs_writel(HOST_GDTR_BASE, dt.address);
+	vmcs_writel(HOST_IDTR_BASE, (unsigned long)(&pcpu->idt_page));
+
+	pkvm_rdmsr(MSR_IA32_SYSENTER_CS, low, high);
+	vmcs_write32(HOST_IA32_SYSENTER_CS, low);
+
+	pkvm_rdmsrl(MSR_IA32_SYSENTER_ESP, a);
+	vmcs_writel(HOST_IA32_SYSENTER_ESP, a);
+
+	pkvm_rdmsrl(MSR_IA32_SYSENTER_EIP, a);
+	vmcs_writel(HOST_IA32_SYSENTER_EIP, a);
+#else
+	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);
+	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);
+	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);
+	vmcs_write16(HOST_ES_SELECTOR, 0);
+	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);
+	vmcs_write16(HOST_FS_SELECTOR, 0);
+	vmcs_write16(HOST_GS_SELECTOR, 0);
+	vmcs_writel(HOST_FS_BASE, 0);
+	vmcs_writel(HOST_GS_BASE, 0);
+
+	vmcs_writel(HOST_TR_BASE, (unsigned long)&pcpu->tss);
+	vmcs_writel(HOST_GDTR_BASE, (unsigned long)(&pcpu->gdt_page));
+	vmcs_writel(HOST_IDTR_BASE, (unsigned long)(&pcpu->idt_page));
+
+	vmcs_write16(HOST_GS_SELECTOR, __KERNEL_DS);
+	vmcs_writel(HOST_GS_BASE, cpu);
+#endif
+
+	/* MSR area */
+	pkvm_rdmsrl(MSR_EFER, a);
+	vmcs_write64(HOST_IA32_EFER, a);
+
+	pkvm_rdmsrl(MSR_IA32_CR_PAT, a);
+	vmcs_write64(HOST_IA32_PAT, a);
+}
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmx.h b/arch/x86/kvm/vmx/pkvm/hyp/vmx.h
index 178139d1b61f..35369cc3b646 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmx.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmx.h
@@ -50,4 +50,6 @@ static inline void vmx_enable_irq_window(struct vcpu_vmx *vmx)
 	exec_controls_setbit(vmx, CPU_BASED_INTR_WINDOW_EXITING);
 }
 
+void pkvm_init_host_state_area(struct pkvm_pcpu *pcpu, int cpu);
+
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/include/pkvm.h b/arch/x86/kvm/vmx/pkvm/include/pkvm.h
index 292d48d8ee44..d5393d477df1 100644
--- a/arch/x86/kvm/vmx/pkvm/include/pkvm.h
+++ b/arch/x86/kvm/vmx/pkvm/include/pkvm.h
@@ -98,6 +98,7 @@ extern struct pkvm_hyp *pkvm_sym(pkvm_hyp);
 
 PKVM_DECLARE(void, __pkvm_vmx_vmexit(void));
 PKVM_DECLARE(int, pkvm_main(struct kvm_vcpu *vcpu));
+PKVM_DECLARE(void, pkvm_init_host_state_area(struct pkvm_pcpu *pcpu, int cpu));
 
 PKVM_DECLARE(void *, pkvm_early_alloc_contig(unsigned int nr_pages));
 PKVM_DECLARE(void *, pkvm_early_alloc_page(void));
diff --git a/arch/x86/kvm/vmx/pkvm/pkvm_host.c b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
index 4ea82a147af5..cbba3033ba63 100644
--- a/arch/x86/kvm/vmx/pkvm/pkvm_host.c
+++ b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
@@ -240,84 +240,11 @@ static __init void init_guest_state_area(struct pkvm_host_vcpu *vcpu, int cpu)
 	vmcs_write64(VMCS_LINK_POINTER, -1ull);
 }
 
-static __init void _init_host_state_area(struct pkvm_pcpu *pcpu, int cpu)
-{
-	unsigned long a;
-#ifdef CONFIG_PKVM_INTEL_DEBUG
-	u32 high, low;
-	struct desc_ptr dt;
-	u16 selector;
-#endif
-
-	vmcs_writel(HOST_CR0, read_cr0() & ~X86_CR0_TS);
-	vmcs_writel(HOST_CR3, pcpu->cr3);
-	vmcs_writel(HOST_CR4, native_read_cr4());
-
-#ifdef CONFIG_PKVM_INTEL_DEBUG
-	savesegment(cs, selector);
-	vmcs_write16(HOST_CS_SELECTOR, selector);
-	savesegment(ss, selector);
-	vmcs_write16(HOST_SS_SELECTOR, selector);
-	savesegment(ds, selector);
-	vmcs_write16(HOST_DS_SELECTOR, selector);
-	savesegment(es, selector);
-	vmcs_write16(HOST_ES_SELECTOR, selector);
-	savesegment(fs, selector);
-	vmcs_write16(HOST_FS_SELECTOR, selector);
-	rdmsrl(MSR_FS_BASE, a);
-	vmcs_writel(HOST_FS_BASE, a);
-	savesegment(gs, selector);
-	vmcs_write16(HOST_GS_SELECTOR, selector);
-	rdmsrl(MSR_GS_BASE, a);
-	vmcs_writel(HOST_GS_BASE, a);
-
-	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);
-	vmcs_writel(HOST_TR_BASE, (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);
-
-	native_store_gdt(&dt);
-	vmcs_writel(HOST_GDTR_BASE, dt.address);
-	vmcs_writel(HOST_IDTR_BASE, (unsigned long)(&pcpu->idt_page));
-
-	rdmsr(MSR_IA32_SYSENTER_CS, low, high);
-	vmcs_write32(HOST_IA32_SYSENTER_CS, low);
-
-	rdmsrl(MSR_IA32_SYSENTER_ESP, a);
-	vmcs_writel(HOST_IA32_SYSENTER_ESP, a);
-
-	rdmsrl(MSR_IA32_SYSENTER_EIP, a);
-	vmcs_writel(HOST_IA32_SYSENTER_EIP, a);
-#else
-	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);
-	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);
-	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);
-	vmcs_write16(HOST_ES_SELECTOR, 0);
-	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);
-	vmcs_write16(HOST_FS_SELECTOR, 0);
-	vmcs_write16(HOST_GS_SELECTOR, 0);
-	vmcs_writel(HOST_FS_BASE, 0);
-	vmcs_writel(HOST_GS_BASE, 0);
-
-	vmcs_writel(HOST_TR_BASE, (unsigned long)&pcpu->tss);
-	vmcs_writel(HOST_GDTR_BASE, (unsigned long)(&pcpu->gdt_page));
-	vmcs_writel(HOST_IDTR_BASE, (unsigned long)(&pcpu->idt_page));
-
-	vmcs_write16(HOST_GS_SELECTOR, __KERNEL_DS);
-	vmcs_writel(HOST_GS_BASE, cpu);
-#endif
-
-	/* MSR area */
-	rdmsrl(MSR_EFER, a);
-	vmcs_write64(HOST_IA32_EFER, a);
-
-	rdmsrl(MSR_IA32_CR_PAT, a);
-	vmcs_write64(HOST_IA32_PAT, a);
-}
-
 static __init void init_host_state_area(struct pkvm_host_vcpu *vcpu, int cpu)
 {
 	struct pkvm_pcpu *pcpu = vcpu->pcpu;
 
-	_init_host_state_area(pcpu, cpu);
+	pkvm_sym(pkvm_init_host_state_area)(pcpu, cpu);
 
 	/*host RIP*/
 	vmcs_writel(HOST_RIP, (unsigned long)pkvm_sym(__pkvm_vmx_vmexit));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 16/22] pkvm: x86: Add vmcs_load/clear_track APIs
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (14 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 15/22] pkvm: x86: Move _init_host_state_area to " Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 17/22] pkvm: x86: Add VMPTRLD/VMCLEAR emulation Jason Chen CJ
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

Add vmcs_load_track/vmcs_clear_track to track vmcs load & clear which is
prepared for the following VMPTRLD emulation.

Use these track APIs for vmcs load/clear is necessary if pKVM want to
get current_vmcs pointer for some case, for example, if pKVM handle
nmi in the root mode, it may need tmp switch vmcs to host vcpu to inject
nmi then switch back to current_vmcs.

Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/vmx.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmx.h b/arch/x86/kvm/vmx/pkvm/hyp/vmx.h
index 35369cc3b646..54c17e256107 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmx.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmx.h
@@ -50,6 +50,27 @@ static inline void vmx_enable_irq_window(struct vcpu_vmx *vmx)
 	exec_controls_setbit(vmx, CPU_BASED_INTR_WINDOW_EXITING);
 }
 
+static inline void vmcs_load_track(struct vcpu_vmx *vmx, struct vmcs *vmcs)
+{
+	struct pkvm_host_vcpu *pkvm_host_vcpu = vmx_to_pkvm_hvcpu(vmx);
+
+	pkvm_host_vcpu->current_vmcs = vmcs;
+	barrier();
+	vmcs_load(vmcs);
+}
+
+static inline void vmcs_clear_track(struct vcpu_vmx *vmx, struct vmcs *vmcs)
+{
+	struct pkvm_host_vcpu *pkvm_host_vcpu = vmx_to_pkvm_hvcpu(vmx);
+
+	/* vmcs_clear might clear non-current vmcs */
+	if (pkvm_host_vcpu->current_vmcs == vmcs)
+		pkvm_host_vcpu->current_vmcs = NULL;
+
+	barrier();
+	vmcs_clear(vmcs);
+}
+
 void pkvm_init_host_state_area(struct pkvm_pcpu *pcpu, int cpu);
 
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 17/22] pkvm: x86: Add VMPTRLD/VMCLEAR emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (15 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 16/22] pkvm: x86: Add vmcs_load/clear_track APIs Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:02 ` [RFC PATCH part-5 18/22] pkvm: x86: Add VMREAD/VMWRITE emulation Jason Chen CJ
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

pKVM is designed to emulate VMX for host VM based on shadow vmcs.
The shadow vmcs page (vmcs02) in root mode is kept in the structure
shadow_vcpu_state, which allocated then donated from host VM when it
initialize vcpus for its launched guest (nested). Same for field of
cached_vmcs12 which used to cache no shadowed vmcs fields.

pKVM use vmcs02 as the shadow vmcs pointer of nested guest when host
VM program its vmcs fields, then switch vmcs02 to ordinary vmcs for the
vmlaunch/vmresume of same guest.

For a nested guest, during its vmcs programing from host VM, its virtual
vmcs(vmcs12) is saved in two places: one is shadowing fields in vmcs02
which host VM directly VMWRITE without vmexit, the other one is
cached_vmcs12 which saved by vmwrite vmexit handler upon the vmexit
triggered by VMWRITE instruction from host VM.

Meanwhile for cached_vmcs12, there are also two parts for its
fields: one is emulated fields, the other one is host state fields. The
emulated fields shall be emulated to the physical value then fill into
vmcs02 before vmcs02 active to do vmlaunch/vmresume for the nested guest.
The host state fields are guest state of host vcpu, it shall be restored
to guest state of host vcpu vmcs (vmcs01) before return to host VM.

Below is a summary for contents of different vmcs fields in each above
mentioned vmcs:

               host state      guest state          control
 ---------------------------------------------------------------
 vmcs12*:       host VM	      nested guest         host VM
 vmcs02*:        pKVM         nested guest      host VM + pKVM*
 vmcs01*:        pKVM           host VM              pKVM

 *vmcs12: virtual vmcs of a nested guest
 *vmcs02: vmcs of a nested guest
 *vmcs01: vmcs of host VM
 *the security related control fields of vmcs02 is controlled by pKVM
  (e.g., EPT_POINTER)

Below show the vmcs emulation method for different vmcs fields for a
nested guest:

                host state      guest state         control
 ---------------------------------------------------------------
 virutal vmcs:  cached_vmcs12*     vmcs02*          emulated*

 *cached_vmcs12: vmexit then get value from cached_vmcs12
 *vmcs02:        no-vmexit and directly shadow from vmcs02
 *emulated:      vmexit then do the emulation

This patch provide emulation for VMPTRLD and VMCLEAR vmx instructions.

For VMPTRLD, it first finds out shadow_vcpu_state (further to get its
cached_vmcs12 & vmcs02) based on vmcs12 fetched from this instruction,
then copy the whole virtual vmcs - vmcs12's content to the corresponding
cached_vmcs12. The vmcs02 is then filled based on 3 different parts:
- host state fields: initialized by pKVM as it's the real host
- shadow fields: copied from cached_vmcs12
- emulated fields: synced & emulated from cached_vmcs12

For VMCLEAR, the vmcs02 shadow fields are copied to cached_vmcs12, then
the whole cached_vmcs12 is saved to virtual vmcs pointer - vmcs12.

Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/nested.c   | 268 +++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/nested.h   |   2 +
 arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h |   8 +
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c   |  10 +
 arch/x86/kvm/vmx/pkvm/include/pkvm.h |   2 +
 5 files changed, 290 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
index 8e6d0f01819a..dab002ff3c68 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -6,6 +6,7 @@
 #include <pkvm.h>
 
 #include "pkvm_hyp.h"
+#include "vmx.h"
 #include "debug.h"
 
 /**
@@ -223,6 +224,11 @@ static void init_emulated_vmcs_fields(void)
 	max_emulated_fields = j;
 }
 
+static bool is_host_fields(unsigned long field)
+{
+	return (((field) >> 10U) & 0x3U) == 3U;
+}
+
 static void nested_vmx_result(enum VMXResult result, int error_number)
 {
 	u64 rflags = vmcs_readl(GUEST_RFLAGS);
@@ -363,6 +369,163 @@ static bool check_vmx_permission(struct kvm_vcpu *vcpu)
 	return permit;
 }
 
+static void clear_shadow_indicator(struct vmcs *vmcs)
+{
+	vmcs->hdr.shadow_vmcs = 0;
+}
+
+static void set_shadow_indicator(struct vmcs *vmcs)
+{
+	vmcs->hdr.shadow_vmcs = 1;
+}
+
+/* current vmcs is vmcs02 */
+static void copy_shadow_fields_vmcs02_to_vmcs12(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
+{
+	const struct shadow_vmcs_field *fields[] = {
+		shadow_read_write_fields,
+		shadow_read_only_fields
+	};
+	const int max_fields[] = {
+		max_shadow_read_write_fields,
+		max_shadow_read_only_fields
+	};
+	struct shadow_vmcs_field field;
+	unsigned long val;
+	int i, q;
+
+	for (q = 0; q < ARRAY_SIZE(fields); q++) {
+		for (i = 0; i < max_fields[q]; i++) {
+			field = fields[q][i];
+			val = __vmcs_readl(field.encoding);
+			if (is_host_fields((field.encoding))) {
+				pkvm_err("%s: field 0x%x is host field, please remove from shadowing!",
+						__func__, field.encoding);
+				continue;
+			}
+			vmcs12_write_any(vmcs12, field.encoding, field.offset, val);
+		}
+	}
+}
+
+/* current vmcs is vmcs02 */
+static void copy_shadow_fields_vmcs12_to_vmcs02(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
+{
+	const struct shadow_vmcs_field *fields[] = {
+		shadow_read_write_fields,
+		shadow_read_only_fields
+	};
+	const int max_fields[] = {
+		max_shadow_read_write_fields,
+		max_shadow_read_only_fields
+	};
+	struct shadow_vmcs_field field;
+	unsigned long val;
+	int i, q;
+
+	for (q = 0; q < ARRAY_SIZE(fields); q++) {
+		for (i = 0; i < max_fields[q]; i++) {
+			field = fields[q][i];
+			val = vmcs12_read_any(vmcs12, field.encoding,
+					      field.offset);
+			if (is_host_fields((field.encoding))) {
+				pkvm_err("%s: field 0x%x is host field, please remove from shadowing!",
+						__func__, field.encoding);
+				continue;
+			}
+			__vmcs_writel(field.encoding, val);
+		}
+	}
+}
+
+/* current vmcs is vmcs02*/
+static u64 emulate_field_for_vmcs02(struct vcpu_vmx *vmx, u16 field, u64 virt_val)
+{
+	u64 val = virt_val;
+
+	switch (field) {
+	case VM_ENTRY_CONTROLS:
+		/* L1 host wishes to use its own MSRs for L2 guest?
+		 * emulate it by enabling vmentry load for such guest states
+		 * then use vmcs01 saved guest states as vmcs02's guest states
+		 */
+		if ((val & VM_ENTRY_LOAD_IA32_EFER) != VM_ENTRY_LOAD_IA32_EFER)
+			val |= VM_ENTRY_LOAD_IA32_EFER;
+		if ((val & VM_ENTRY_LOAD_IA32_PAT) != VM_ENTRY_LOAD_IA32_PAT)
+			val |= VM_ENTRY_LOAD_IA32_PAT;
+		if ((val & VM_ENTRY_LOAD_DEBUG_CONTROLS) != VM_ENTRY_LOAD_DEBUG_CONTROLS)
+			val |= VM_ENTRY_LOAD_DEBUG_CONTROLS;
+		break;
+	case VM_EXIT_CONTROLS:
+		/* L1 host wishes to keep use MSRs from L2 guest after its VMExit?
+		 * emulate it by enabling vmexit save for such guest states
+		 * then vmcs01 shall take these guest states as its before L1 VMEntry
+		 *
+		 * And vmcs01 shall still keep enabling vmexit load such guest states as
+		 * pkvm need restore from its host states
+		 */
+		if ((val & VM_EXIT_LOAD_IA32_EFER) != VM_EXIT_LOAD_IA32_EFER)
+			val |= (VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER);
+		if ((val & VM_EXIT_LOAD_IA32_PAT) != VM_EXIT_LOAD_IA32_PAT)
+			val |= (VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT);
+		/* host always in 64bit mode */
+		val |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+		break;
+	}
+	return val;
+}
+
+/* current vmcs is vmcs02*/
+static void sync_vmcs12_dirty_fields_to_vmcs02(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
+{
+	struct shadow_vmcs_field field;
+	unsigned long val, phys_val;
+	int i;
+
+	if (vmx->nested.dirty_vmcs12) {
+		for (i = 0; i < max_emulated_fields; i++) {
+			field = emulated_fields[i];
+			val = vmcs12_read_any(vmcs12, field.encoding, field.offset);
+			phys_val = emulate_field_for_vmcs02(vmx, field.encoding, val);
+			__vmcs_writel(field.encoding, phys_val);
+		}
+		vmx->nested.dirty_vmcs12 = false;
+	}
+}
+
+static void nested_release_vmcs12(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct pkvm_host_vcpu *pkvm_hvcpu = to_pkvm_hvcpu(vcpu);
+	struct shadow_vcpu_state *cur_shadow_vcpu = pkvm_hvcpu->current_shadow_vcpu;
+	struct vmcs *vmcs02;
+	struct vmcs12 *vmcs12;
+
+	if (vmx->nested.current_vmptr == INVALID_GPA)
+		return;
+
+	/* cur_shadow_vcpu must be valid here */
+	vmcs02 = (struct vmcs *)cur_shadow_vcpu->vmcs02;
+	vmcs12 = (struct vmcs12 *)cur_shadow_vcpu->cached_vmcs12;
+	vmcs_load_track(vmx, vmcs02);
+	copy_shadow_fields_vmcs02_to_vmcs12(vmx, vmcs12);
+
+	vmcs_clear_track(vmx, vmcs02);
+	clear_shadow_indicator(vmcs02);
+
+	/*disable shadowing*/
+	vmcs_load_track(vmx, vmx->loaded_vmcs->vmcs);
+	secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_SHADOW_VMCS);
+	vmcs_write64(VMCS_LINK_POINTER, INVALID_GPA);
+
+	write_gpa(vcpu, vmx->nested.current_vmptr, vmcs12, VMCS12_SIZE);
+	vmx->nested.dirty_vmcs12 = false;
+	vmx->nested.current_vmptr = INVALID_GPA;
+	pkvm_hvcpu->current_shadow_vcpu = NULL;
+
+	put_shadow_vcpu(cur_shadow_vcpu->shadow_vcpu_handle);
+}
+
 int handle_vmxon(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -379,6 +542,8 @@ int handle_vmxon(struct kvm_vcpu *vcpu)
 		} else if (!validate_vmcs_revision_id(vcpu, vmptr)) {
 			nested_vmx_result(VMfailInvalid, 0);
 		} else {
+			vmx->nested.current_vmptr = INVALID_GPA;
+			vmx->nested.dirty_vmcs12 = false;
 			vmx->nested.vmxon_ptr = vmptr;
 			vmx->nested.vmxon = true;
 
@@ -403,6 +568,109 @@ int handle_vmxoff(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct pkvm_host_vcpu *pkvm_hvcpu = to_pkvm_hvcpu(vcpu);
+	struct shadow_vcpu_state *shadow_vcpu;
+	struct vmcs *vmcs02;
+	struct vmcs12 *vmcs12;
+	gpa_t vmptr;
+	int r;
+
+	if (check_vmx_permission(vcpu)) {
+		if (nested_vmx_get_vmptr(vcpu, &vmptr, &r)) {
+			nested_vmx_result(VMfailValid, VMXERR_VMPTRLD_INVALID_ADDRESS);
+			return r;
+		} else if (vmptr == vmx->nested.vmxon_ptr) {
+			nested_vmx_result(VMfailValid, VMXERR_VMPTRLD_VMXON_POINTER);
+		} else if (!validate_vmcs_revision_id(vcpu, vmptr)) {
+			nested_vmx_result(VMfailValid, VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+		} else {
+			if (vmx->nested.current_vmptr != vmptr) {
+				s64 handle;
+
+				nested_release_vmcs12(vcpu);
+
+				handle = find_shadow_vcpu_handle_by_vmcs(vmptr);
+				shadow_vcpu = handle > 0 ? get_shadow_vcpu(handle) : NULL;
+				if ((handle > 0) && shadow_vcpu) {
+					vmcs02 = (struct vmcs *)shadow_vcpu->vmcs02;
+					vmcs12 = (struct vmcs12 *) shadow_vcpu->cached_vmcs12;
+
+					read_gpa(vcpu, vmptr, vmcs12, VMCS12_SIZE);
+					vmx->nested.dirty_vmcs12 = true;
+
+					if (!shadow_vcpu->vmcs02_inited) {
+						memset(vmcs02, 0, pkvm_hyp->vmcs_config.size);
+						vmcs02->hdr.revision_id = pkvm_hyp->vmcs_config.revision_id;
+						vmcs_load_track(vmx, vmcs02);
+						pkvm_init_host_state_area(pkvm_hvcpu->pcpu, vcpu->cpu);
+						vmcs_writel(HOST_RIP, (unsigned long)__pkvm_vmx_vmexit);
+						shadow_vcpu->last_cpu = vcpu->cpu;
+						shadow_vcpu->vmcs02_inited = true;
+					} else {
+						vmcs_load_track(vmx, vmcs02);
+						if (shadow_vcpu->last_cpu != vcpu->cpu) {
+							pkvm_init_host_state_area(pkvm_hvcpu->pcpu, vcpu->cpu);
+							shadow_vcpu->last_cpu = vcpu->cpu;
+						}
+					}
+					copy_shadow_fields_vmcs12_to_vmcs02(vmx, vmcs12);
+					sync_vmcs12_dirty_fields_to_vmcs02(vmx, vmcs12);
+					vmcs_clear_track(vmx, vmcs02);
+					set_shadow_indicator(vmcs02);
+
+					/* enable shadowing */
+					vmcs_load_track(vmx, vmx->loaded_vmcs->vmcs);
+					vmcs_write64(VMREAD_BITMAP, __pkvm_pa_symbol(vmx_vmread_bitmap));
+					vmcs_write64(VMWRITE_BITMAP, __pkvm_pa_symbol(vmx_vmwrite_bitmap));
+					secondary_exec_controls_setbit(vmx, SECONDARY_EXEC_SHADOW_VMCS);
+					vmcs_write64(VMCS_LINK_POINTER, __pkvm_pa(vmcs02));
+
+					vmx->nested.current_vmptr = vmptr;
+					pkvm_hvcpu->current_shadow_vcpu = shadow_vcpu;
+
+					nested_vmx_result(VMsucceed, 0);
+				} else {
+					nested_vmx_result(VMfailValid, VMXERR_VMPTRLD_INVALID_ADDRESS);
+				}
+			} else {
+				nested_vmx_result(VMsucceed, 0);
+			}
+		}
+	}
+
+	return 0;
+}
+
+int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gpa_t vmptr;
+	u32 zero = 0;
+	int r;
+
+	if (check_vmx_permission(vcpu)) {
+		if (nested_vmx_get_vmptr(vcpu, &vmptr, &r)) {
+			nested_vmx_result(VMfailValid, VMXERR_VMPTRLD_INVALID_ADDRESS);
+			return r;
+		} else if (vmptr == vmx->nested.vmxon_ptr) {
+			nested_vmx_result(VMfailValid, VMXERR_VMCLEAR_VMXON_POINTER);
+		} else {
+			if (vmx->nested.current_vmptr == vmptr)
+				nested_release_vmcs12(vcpu);
+
+			write_gpa(vcpu, vmptr + offsetof(struct vmcs12, launch_state),
+					&zero, sizeof(zero));
+
+			nested_vmx_result(VMsucceed, 0);
+		}
+	}
+
+	return 0;
+}
+
 void pkvm_init_nest(void)
 {
 	init_vmcs_shadow_fields();
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.h b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
index 16b70b13e80e..a228b0fdc15d 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
@@ -7,6 +7,8 @@
 
 int handle_vmxon(struct kvm_vcpu *vcpu);
 int handle_vmxoff(struct kvm_vcpu *vcpu);
+int handle_vmptrld(struct kvm_vcpu *vcpu);
+int handle_vmclear(struct kvm_vcpu *vcpu);
 void pkvm_init_nest(void);
 
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
index c574831c6d18..82a59b5d7fd5 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_hyp.h
@@ -23,8 +23,16 @@ struct shadow_vcpu_state {
 
 	struct hlist_node hnode;
 	unsigned long vmcs12_pa;
+	bool vmcs02_inited;
 
 	struct vcpu_vmx vmx;
+
+	/* assume vmcs02 is one page */
+	u8 vmcs02[PAGE_SIZE] __aligned(PAGE_SIZE);
+	u8 cached_vmcs12[VMCS12_SIZE] __aligned(PAGE_SIZE);
+
+	/* The last cpu this vmcs02 runs with */
+	int last_cpu;
 } __aligned(PAGE_SIZE);
 
 #define SHADOW_VM_HANDLE_SHIFT		32
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index fa67cab803a8..b2cfb87983a8 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -206,6 +206,16 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 			handle_vmxoff(vcpu);
 			skip_instruction = true;
 			break;
+		case EXIT_REASON_VMPTRLD:
+			pkvm_dbg("CPU%d vmexit reason: VMPTRLD.\n", vcpu->cpu);
+			handle_vmptrld(vcpu);
+			skip_instruction = true;
+			break;
+		case EXIT_REASON_VMCLEAR:
+			pkvm_dbg("CPU%d vmexit reason: VMCLEAR.\n", vcpu->cpu);
+			handle_vmclear(vcpu);
+			skip_instruction = true;
+			break;
 		case EXIT_REASON_XSETBV:
 			handle_xsetbv(vcpu);
 			skip_instruction = true;
diff --git a/arch/x86/kvm/vmx/pkvm/include/pkvm.h b/arch/x86/kvm/vmx/pkvm/include/pkvm.h
index d5393d477df1..9b45627853b3 100644
--- a/arch/x86/kvm/vmx/pkvm/include/pkvm.h
+++ b/arch/x86/kvm/vmx/pkvm/include/pkvm.h
@@ -35,6 +35,8 @@ struct pkvm_host_vcpu {
 	struct vmcs *vmxarea;
 	struct vmcs *current_vmcs;
 
+	void *current_shadow_vcpu;
+
 	bool pending_nmi;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 18/22] pkvm: x86: Add VMREAD/VMWRITE emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (16 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 17/22] pkvm: x86: Add VMPTRLD/VMCLEAR emulation Jason Chen CJ
@ 2023-03-12 18:02 ` Jason Chen CJ
  2023-03-12 18:03 ` [RFC PATCH part-5 19/22] pkvm: x86: Add VMLAUNCH/VMRESUME emulation Jason Chen CJ
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:02 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

Provide emulation for VMREAD and VMWRITE vmx instructions.

The VMREAD/VMWRITE of non-shadowing vmcs fields from host VM cause
vmexit. Add vmexit handlers to manage these non-shadowing vmcs fields,
mainly two different parts:
- emulated fields: record in cached_vmcs12 and set dirty_vmcs12 to
  indicate emulation needed before vmcs02 take effect.
- host state fields: record in cached_vmcs12 and restore as guest
  state for vmcs01 when return back to host VM.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/include/asm/pkvm_image_vars.h |   3 +-
 arch/x86/kvm/vmx/pkvm/hyp/nested.c     | 138 +++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/nested.h     |   2 +
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c     |  10 ++
 4 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pkvm_image_vars.h b/arch/x86/include/asm/pkvm_image_vars.h
index 598c60302bac..967ee323a5c0 100644
--- a/arch/x86/include/asm/pkvm_image_vars.h
+++ b/arch/x86/include/asm/pkvm_image_vars.h
@@ -16,7 +16,8 @@ PKVM_ALIAS(sme_me_mask);
 #endif
 
 PKVM_ALIAS(__default_kernel_pte_mask);
-
+PKVM_ALIAS(vmcs12_field_offsets);
+PKVM_ALIAS(nr_vmcs12_fields);
 #endif
 
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
index dab002ff3c68..fd8755621cc8 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -229,6 +229,18 @@ static bool is_host_fields(unsigned long field)
 	return (((field) >> 10U) & 0x3U) == 3U;
 }
 
+static bool is_emulated_fields(unsigned long field_encoding)
+{
+	int i;
+
+	for (i = 0; i < max_emulated_fields; i++) {
+		if ((unsigned long)emulated_fields[i].encoding == field_encoding)
+			return true;
+	}
+
+	return false;
+}
+
 static void nested_vmx_result(enum VMXResult result, int error_number)
 {
 	u64 rflags = vmcs_readl(GUEST_RFLAGS);
@@ -671,6 +683,132 @@ int handle_vmclear(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct pkvm_host_vcpu *pkvm_hvcpu = to_pkvm_hvcpu(vcpu);
+	struct shadow_vcpu_state *cur_shadow_vcpu = pkvm_hvcpu->current_shadow_vcpu;
+	struct vmcs12 *vmcs12 = (struct vmcs12 *)cur_shadow_vcpu->cached_vmcs12;
+	u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	struct x86_exception e;
+	unsigned long field;
+	short offset;
+	gva_t gva;
+	int r, reg;
+	u64 value = 0;
+
+	if (check_vmx_permission(vcpu)) {
+		if (vmx->nested.current_vmptr == INVALID_GPA) {
+			nested_vmx_result(VMfailInvalid, 0);
+		} else {
+			if (instr_info & BIT(10)) {
+				reg = ((instr_info) >> 3) & 0xf;
+				value = vcpu->arch.regs[reg];
+			} else {
+				if (get_vmx_mem_address(vcpu, vmx->exit_qualification,
+							instr_info, &gva))
+					return 1;
+
+				r = read_gva(vcpu, gva, &value, 8, &e);
+				if (r < 0) {
+					/*TODO: handle memory failure exception */
+					return r;
+				}
+			}
+
+			reg = ((instr_info) >> 28) & 0xf;
+			field = vcpu->arch.regs[reg];
+
+			offset = get_vmcs12_field_offset(field);
+			if (offset < 0) {
+				nested_vmx_result(VMfailInvalid, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+				return 0;
+			}
+
+			/*TODO: check vcpu supports "VMWRITE to any supported field in the VMCS"*/
+			if (vmcs_field_readonly(field)) {
+				nested_vmx_result(VMfailInvalid, VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
+				return 0;
+			}
+
+			/*
+			 * Some Intel CPUs intentionally drop the reserved bits of the AR byte
+			 * fields on VMWRITE.  Emulate this behavior to ensure consistent KVM
+			 * behavior regardless of the underlying hardware, e.g. if an AR_BYTE
+			 * field is intercepted for VMWRITE but not VMREAD (in L1), then VMREAD
+			 * from L1 will return a different value than VMREAD from L2 (L1 sees
+			 * the stripped down value, L2 sees the full value as stored by KVM).
+			 */
+			if (field >= GUEST_ES_AR_BYTES && field <= GUEST_TR_AR_BYTES)
+				value &= 0x1f0ff;
+
+			vmcs12_write_any(vmcs12, field, offset, value);
+
+			if (is_emulated_fields(field)) {
+				vmx->nested.dirty_vmcs12 = true;
+				nested_vmx_result(VMsucceed, 0);
+			} else if (is_host_fields(field)) {
+				nested_vmx_result(VMsucceed, 0);
+			} else {
+				pkvm_err("%s: not include emulated fields 0x%lx, please add!\n",
+						__func__, field);
+				nested_vmx_result(VMfailInvalid, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+			}
+		}
+	}
+
+	return 0;
+}
+
+int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct pkvm_host_vcpu *pkvm_hvcpu = to_pkvm_hvcpu(vcpu);
+	struct shadow_vcpu_state *cur_shadow_vcpu = pkvm_hvcpu->current_shadow_vcpu;
+	struct vmcs12 *vmcs12 = (struct vmcs12 *)cur_shadow_vcpu->cached_vmcs12;
+	u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	struct x86_exception e;
+	unsigned long field;
+	short offset;
+	gva_t gva = 0;
+	int r, reg;
+	u64 value;
+
+	if (check_vmx_permission(vcpu)) {
+		if (vmx->nested.current_vmptr == INVALID_GPA) {
+			nested_vmx_result(VMfailInvalid, 0);
+		} else {
+			/* Decode instruction info and find the field to read */
+			reg = ((instr_info) >> 28) & 0xf;
+			field = vcpu->arch.regs[reg];
+
+			offset = get_vmcs12_field_offset(field);
+			if (offset < 0) {
+				nested_vmx_result(VMfailInvalid, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+			} else {
+				value = vmcs12_read_any(vmcs12, field, offset);
+				if (instr_info & BIT(10)) {
+					reg = ((instr_info) >> 3) & 0xf;
+					vcpu->arch.regs[reg] = value;
+				} else {
+					if (get_vmx_mem_address(vcpu, vmx->exit_qualification,
+								instr_info, &gva))
+						return 1;
+
+					r = write_gva(vcpu, gva, &value, 8, &e);
+					if (r < 0) {
+						/*TODO: handle memory failure exception */
+						return r;
+					}
+				}
+				nested_vmx_result(VMsucceed, 0);
+			}
+		}
+	}
+
+	return 0;
+}
+
 void pkvm_init_nest(void)
 {
 	init_vmcs_shadow_fields();
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.h b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
index a228b0fdc15d..5fc76bdb135a 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
@@ -9,6 +9,8 @@ int handle_vmxon(struct kvm_vcpu *vcpu);
 int handle_vmxoff(struct kvm_vcpu *vcpu);
 int handle_vmptrld(struct kvm_vcpu *vcpu);
 int handle_vmclear(struct kvm_vcpu *vcpu);
+int handle_vmwrite(struct kvm_vcpu *vcpu);
+int handle_vmread(struct kvm_vcpu *vcpu);
 void pkvm_init_nest(void);
 
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index b2cfb87983a8..d4f2a408e6e9 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -216,6 +216,16 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 			handle_vmclear(vcpu);
 			skip_instruction = true;
 			break;
+		case EXIT_REASON_VMREAD:
+			pkvm_dbg("CPU%d vmexit reason: WMREAD.\n", vcpu->cpu);
+			handle_vmread(vcpu);
+			skip_instruction = true;
+			break;
+		case EXIT_REASON_VMWRITE:
+			pkvm_dbg("CPU%d vmexit reason: VMWRITE.\n", vcpu->cpu);
+			handle_vmwrite(vcpu);
+			skip_instruction = true;
+			break;
 		case EXIT_REASON_XSETBV:
 			handle_xsetbv(vcpu);
 			skip_instruction = true;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 19/22] pkvm: x86: Add VMLAUNCH/VMRESUME emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (17 preceding siblings ...)
  2023-03-12 18:02 ` [RFC PATCH part-5 18/22] pkvm: x86: Add VMREAD/VMWRITE emulation Jason Chen CJ
@ 2023-03-12 18:03 ` Jason Chen CJ
  2023-03-12 18:03 ` [RFC PATCH part-5 20/22] pkvm: x86: Add INVEPT/INVVPID emulation Jason Chen CJ
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:03 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

Provide emulation for VMLAUNCH and VMRESUME vmx instructions.

As pKVM uses vmcs02 to do most part of vmcs12 guest fields shadowing,
before vmcs02 active, it does not need to take care this part of vmcs
fields. Meanwhile there are still emulated fields cached in
cached_vmcs12, so pKVM need to sync&emulate this part of vmcs12 guest
fields from cached_vmcs12 to vmcs02 before it active.

Another thing is that after nested guest vmexit(vmcs02 is current) and
before host vcpu vmentry(vmcs01 is current), pKVM need to prepare vmcs01's
guest state fields restoring from vmcs12's host state - it's vmcs12 host
state host vcpu want return back.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/nested.c | 149 +++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/nested.h |   3 +
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c | 170 ++++++++++++++++-------------
 3 files changed, 247 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
index fd8755621cc8..73fa66ba95bd 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -450,6 +450,15 @@ static void copy_shadow_fields_vmcs12_to_vmcs02(struct vcpu_vmx *vmx, struct vmc
 	}
 }
 
+/* current vmcs is vmcs01*/
+static void save_vmcs01_fields_for_emulation(struct vcpu_vmx *vmx)
+{
+	vmx->vcpu.arch.efer = vmcs_read64(GUEST_IA32_EFER);
+	vmx->vcpu.arch.pat = vmcs_read64(GUEST_IA32_PAT);
+	vmx->vcpu.arch.dr7 = vmcs_readl(GUEST_DR7);
+	vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+}
+
 /* current vmcs is vmcs02*/
 static u64 emulate_field_for_vmcs02(struct vcpu_vmx *vmx, u16 field, u64 virt_val)
 {
@@ -505,6 +514,66 @@ static void sync_vmcs12_dirty_fields_to_vmcs02(struct vcpu_vmx *vmx, struct vmcs
 	}
 }
 
+/* current vmcs is vmcs02*/
+static void update_vmcs02_fields_for_emulation(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
+{
+	/* L1 host wishes to use its own MSRs for L2 guest?
+	 * vmcs02 shall use such guest states in vmcs01 as its guest states
+	 */
+	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER) != VM_ENTRY_LOAD_IA32_EFER)
+		vmcs_write64(GUEST_IA32_EFER, vmx->vcpu.arch.efer);
+	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT) != VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
+	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) != VM_ENTRY_LOAD_DEBUG_CONTROLS) {
+		vmcs_writel(GUEST_DR7, vmx->vcpu.arch.dr7);
+		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
+	}
+}
+
+/* current vmcs is vmcs01, set vmcs01 guest state with vmcs02 host state */
+static void prepare_vmcs01_guest_state(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
+{
+	vmcs_writel(GUEST_CR0, vmcs12->host_cr0);
+	vmcs_writel(GUEST_CR3, vmcs12->host_cr3);
+	vmcs_writel(GUEST_CR4, vmcs12->host_cr4);
+
+	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);
+	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->host_ia32_sysenter_eip);
+	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
+
+	/* Both cases want vmcs01 to take EFER/PAT from L2
+	 * 1. L1 host wishes to load its own MSRs on L2 guest VMExit
+	 *    such vmcs12's host states shall be set as vmcs01's guest states
+	 * 2. L1 host wishes to keep use MSRs from L2 guest after its VMExit
+	 *    such vmcs02's guest state shall be set as vmcs01's guest states
+	 *    the vmcs02's guest state were recorded in vmcs12 host
+	 *
+	 * For case 1, IA32_PERF_GLOBAL_CTRL is separately checked.
+	 */
+	vmcs_write64(GUEST_IA32_EFER, vmcs12->host_ia32_efer);
+	vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
+		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, vmcs12->host_ia32_perf_global_ctrl);
+
+	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->host_cs_selector);
+	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->host_ds_selector);
+	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->host_es_selector);
+	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->host_fs_selector);
+	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->host_gs_selector);
+	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->host_ss_selector);
+	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->host_tr_selector);
+
+	vmcs_writel(GUEST_FS_BASE, vmcs12->host_fs_base);
+	vmcs_writel(GUEST_GS_BASE, vmcs12->host_gs_base);
+	vmcs_writel(GUEST_TR_BASE, vmcs12->host_tr_base);
+	vmcs_writel(GUEST_GDTR_BASE, vmcs12->host_gdtr_base);
+	vmcs_writel(GUEST_IDTR_BASE, vmcs12->host_idtr_base);
+
+	vmcs_writel(GUEST_RIP, vmcs12->host_rip);
+	vmcs_writel(GUEST_RSP, vmcs12->host_rsp);
+	vmcs_writel(GUEST_RFLAGS, 0x2);
+}
+
 static void nested_release_vmcs12(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -538,6 +607,38 @@ static void nested_release_vmcs12(struct kvm_vcpu *vcpu)
 	put_shadow_vcpu(cur_shadow_vcpu->shadow_vcpu_handle);
 }
 
+static void nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct pkvm_host_vcpu *pkvm_hvcpu = to_pkvm_hvcpu(vcpu);
+	struct shadow_vcpu_state *cur_shadow_vcpu = pkvm_hvcpu->current_shadow_vcpu;
+	struct vmcs *vmcs02 = (struct vmcs *)cur_shadow_vcpu->vmcs02;
+	struct vmcs12 *vmcs12 = (struct vmcs12 *)cur_shadow_vcpu->cached_vmcs12;
+
+	if (vmx->nested.current_vmptr == INVALID_GPA) {
+		nested_vmx_result(VMfailInvalid, 0);
+	} else if (vmcs12->launch_state == launch) {
+		/* VMLAUNCH_NONCLEAR_VMCS or VMRESUME_NONLAUNCHED_VMCS */
+		nested_vmx_result(VMfailValid,
+			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS : VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+	} else {
+		/* save vmcs01 guest state for possible emulation */
+		save_vmcs01_fields_for_emulation(vmx);
+
+		/* switch to vmcs02 */
+		vmcs_clear_track(vmx, vmcs02);
+		clear_shadow_indicator(vmcs02);
+		vmcs_load_track(vmx, vmcs02);
+
+		sync_vmcs12_dirty_fields_to_vmcs02(vmx, vmcs12);
+
+		update_vmcs02_fields_for_emulation(vmx, vmcs12);
+
+		/* mark guest mode */
+		vcpu->arch.hflags |= HF_GUEST_MASK;
+	}
+}
+
 int handle_vmxon(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -809,6 +910,54 @@ int handle_vmread(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+	if (check_vmx_permission(vcpu))
+		nested_vmx_run(vcpu, false);
+
+	return 0;
+}
+
+int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	if (check_vmx_permission(vcpu))
+		nested_vmx_run(vcpu, true);
+
+	return 0;
+}
+
+int nested_vmexit(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct pkvm_host_vcpu *pkvm_hvcpu = to_pkvm_hvcpu(vcpu);
+	struct shadow_vcpu_state *cur_shadow_vcpu = pkvm_hvcpu->current_shadow_vcpu;
+	struct vmcs *vmcs02 = (struct vmcs *)cur_shadow_vcpu->vmcs02;
+	struct vmcs12 *vmcs12 = (struct vmcs12 *)cur_shadow_vcpu->cached_vmcs12;
+
+	/* clear guest mode if need switch back to host */
+	vcpu->arch.hflags &= ~HF_GUEST_MASK;
+
+	/* L1 host wishes to keep use MSRs from L2 guest after its VMExit?
+	 * save vmcs02 guest state for later vmcs01 guest state preparation
+	 */
+	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER) != VM_EXIT_LOAD_IA32_EFER)
+		vmcs12->host_ia32_efer = vmcs_read64(GUEST_IA32_EFER);
+	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) != VM_EXIT_LOAD_IA32_PAT)
+		vmcs12->host_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+
+	if (!vmcs12->launch_state)
+		vmcs12->launch_state = 1;
+
+	/* switch to vmcs01 */
+	vmcs_clear_track(vmx, vmcs02);
+	set_shadow_indicator(vmcs02);
+	vmcs_load_track(vmx, vmx->loaded_vmcs->vmcs);
+
+	prepare_vmcs01_guest_state(vmx, vmcs12);
+
+	return 0;
+}
+
 void pkvm_init_nest(void)
 {
 	init_vmcs_shadow_fields();
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.h b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
index 5fc76bdb135a..3f785be165c2 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
@@ -11,6 +11,9 @@ int handle_vmptrld(struct kvm_vcpu *vcpu);
 int handle_vmclear(struct kvm_vcpu *vcpu);
 int handle_vmwrite(struct kvm_vcpu *vcpu);
 int handle_vmread(struct kvm_vcpu *vcpu);
+int handle_vmresume(struct kvm_vcpu *vcpu);
+int handle_vmlaunch(struct kvm_vcpu *vcpu);
+int nested_vmexit(struct kvm_vcpu *vcpu);
 void pkvm_init_nest(void);
 
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index d4f2a408e6e9..27b6518032b5 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -159,7 +159,7 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 	int launch = 1;
 
 	do {
-		bool skip_instruction = false;
+		bool skip_instruction = false, guest_exit = false;
 
 		if (__pkvm_vmx_vcpu_run(vcpu->arch.regs, launch)) {
 			pkvm_err("%s: CPU%d run_vcpu failed with error 0x%x\n",
@@ -174,87 +174,107 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 		vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
 		vmx->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 
-		switch (vmx->exit_reason.full) {
-		case EXIT_REASON_CPUID:
-			handle_cpuid(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_CR_ACCESS:
-			pkvm_dbg("CPU%d vmexit_reason: CR_ACCESS.\n", vcpu->cpu);
-			handle_cr(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_MSR_READ:
-			pkvm_dbg("CPU%d vmexit_reason: MSR_READ 0x%lx\n",
-					vcpu->cpu, vcpu->arch.regs[VCPU_REGS_RCX]);
-			handle_read_msr(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_MSR_WRITE:
-			pkvm_dbg("CPU%d vmexit_reason: MSR_WRITE 0x%lx\n",
-					vcpu->cpu, vcpu->arch.regs[VCPU_REGS_RCX]);
-			handle_write_msr(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_VMON:
-			pkvm_dbg("CPU%d vmexit reason: VMXON.\n", vcpu->cpu);
-			handle_vmxon(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_VMOFF:
-			pkvm_dbg("CPU%d vmexit reason: VMXOFF.\n", vcpu->cpu);
-			handle_vmxoff(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_VMPTRLD:
-			pkvm_dbg("CPU%d vmexit reason: VMPTRLD.\n", vcpu->cpu);
-			handle_vmptrld(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_VMCLEAR:
-			pkvm_dbg("CPU%d vmexit reason: VMCLEAR.\n", vcpu->cpu);
-			handle_vmclear(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_VMREAD:
-			pkvm_dbg("CPU%d vmexit reason: WMREAD.\n", vcpu->cpu);
-			handle_vmread(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_VMWRITE:
-			pkvm_dbg("CPU%d vmexit reason: VMWRITE.\n", vcpu->cpu);
-			handle_vmwrite(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_XSETBV:
-			handle_xsetbv(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_VMCALL:
-			vcpu->arch.regs[VCPU_REGS_RAX] = handle_vmcall(vcpu);
-			skip_instruction = true;
-			break;
-		case EXIT_REASON_EPT_VIOLATION:
-			if (handle_host_ept_violation(vmcs_read64(GUEST_PHYSICAL_ADDRESS)))
+		if (is_guest_mode(vcpu)) {
+			guest_exit = true;
+			nested_vmexit(vcpu);
+		} else {
+			switch (vmx->exit_reason.full) {
+			case EXIT_REASON_CPUID:
+				handle_cpuid(vcpu);
 				skip_instruction = true;
-			break;
-		case EXIT_REASON_INTERRUPT_WINDOW:
-			handle_irq_window(vcpu);
-			break;
-		default:
-			pkvm_dbg("CPU%d: Unsupported vmexit reason 0x%x.\n", vcpu->cpu, vmx->exit_reason.full);
-			skip_instruction = true;
-			break;
+				break;
+			case EXIT_REASON_CR_ACCESS:
+				pkvm_dbg("CPU%d vmexit_reason: CR_ACCESS.\n", vcpu->cpu);
+				handle_cr(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_MSR_READ:
+				pkvm_dbg("CPU%d vmexit_reason: MSR_READ 0x%lx\n",
+						vcpu->cpu, vcpu->arch.regs[VCPU_REGS_RCX]);
+				handle_read_msr(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_MSR_WRITE:
+				pkvm_dbg("CPU%d vmexit_reason: MSR_WRITE 0x%lx\n",
+						vcpu->cpu, vcpu->arch.regs[VCPU_REGS_RCX]);
+				handle_write_msr(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_VMLAUNCH:
+				handle_vmlaunch(vcpu);
+				break;
+			case EXIT_REASON_VMRESUME:
+				handle_vmresume(vcpu);
+				break;
+			case EXIT_REASON_VMON:
+				pkvm_dbg("CPU%d vmexit reason: VMXON.\n", vcpu->cpu);
+				handle_vmxon(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_VMOFF:
+				pkvm_dbg("CPU%d vmexit reason: VMXOFF.\n", vcpu->cpu);
+				handle_vmxoff(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_VMPTRLD:
+				pkvm_dbg("CPU%d vmexit reason: VMPTRLD.\n", vcpu->cpu);
+				handle_vmptrld(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_VMCLEAR:
+				pkvm_dbg("CPU%d vmexit reason: VMCLEAR.\n", vcpu->cpu);
+				handle_vmclear(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_VMREAD:
+				pkvm_dbg("CPU%d vmexit reason: WMREAD.\n", vcpu->cpu);
+				handle_vmread(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_VMWRITE:
+				pkvm_dbg("CPU%d vmexit reason: VMWRITE.\n", vcpu->cpu);
+				handle_vmwrite(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_XSETBV:
+				handle_xsetbv(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_VMCALL:
+				vcpu->arch.regs[VCPU_REGS_RAX] = handle_vmcall(vcpu);
+				skip_instruction = true;
+				break;
+			case EXIT_REASON_EPT_VIOLATION:
+				if (handle_host_ept_violation(vmcs_read64(GUEST_PHYSICAL_ADDRESS)))
+					skip_instruction = true;
+				break;
+			case EXIT_REASON_INTERRUPT_WINDOW:
+				handle_irq_window(vcpu);
+				break;
+			default:
+				pkvm_dbg("CPU%d: Unsupported vmexit reason 0x%x.\n", vcpu->cpu, vmx->exit_reason.full);
+				skip_instruction = true;
+				break;
+			}
 		}
 
-		/* now only need vmresume */
-		launch = 0;
+		if (is_guest_mode(vcpu)) {
+			/*
+			 * L2 VMExit -> L2 VMEntry: vmresume
+			 * L1 VMExit -> L2 VMEntry: vmlaunch
+			 * as vmcs02 is clear every time
+			 */
+			launch = guest_exit ? 0 : 1;
+		} else {
+			handle_pending_events(vcpu);
+
+			/* pkvm_host only need vmresume */
+			launch = 0;
+		}
 
 		if (skip_instruction)
 			skip_emulated_instruction();
 
-		handle_pending_events(vcpu);
-
 		native_write_cr2(vcpu->arch.cr2);
 	} while (1);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 20/22] pkvm: x86: Add INVEPT/INVVPID emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (18 preceding siblings ...)
  2023-03-12 18:03 ` [RFC PATCH part-5 19/22] pkvm: x86: Add VMLAUNCH/VMRESUME emulation Jason Chen CJ
@ 2023-03-12 18:03 ` Jason Chen CJ
  2023-03-12 18:03 ` [RFC PATCH part-5 21/22] pkvm: x86: Initialize msr_bitmap for vmsr Jason Chen CJ
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:03 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ

The INVEPT & INVVPID cause vmexit unconditionally, pKVM must have
handler for them.

It's a tmp solution, just call global invept for invept/invvpid. After
pKVM supported shadow EPT, such emulation shall be done based on
shadow EPT.

Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index 27b6518032b5..8e7392010887 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -251,6 +251,11 @@ int pkvm_main(struct kvm_vcpu *vcpu)
 			case EXIT_REASON_INTERRUPT_WINDOW:
 				handle_irq_window(vcpu);
 				break;
+			case EXIT_REASON_INVEPT:
+			case EXIT_REASON_INVVPID:
+				ept_sync_global();
+				skip_instruction = true;
+				break;
 			default:
 				pkvm_dbg("CPU%d: Unsupported vmexit reason 0x%x.\n", vcpu->cpu, vmx->exit_reason.full);
 				skip_instruction = true;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 21/22] pkvm: x86: Initialize msr_bitmap for vmsr
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (19 preceding siblings ...)
  2023-03-12 18:03 ` [RFC PATCH part-5 20/22] pkvm: x86: Add INVEPT/INVVPID emulation Jason Chen CJ
@ 2023-03-12 18:03 ` Jason Chen CJ
  2023-03-12 18:03 ` [RFC PATCH part-5 22/22] pkvm: x86: Add vmx msr emulation Jason Chen CJ
  2023-03-13 16:58 ` [RFC PATCH part-5 00/22] VMX emulation Sean Christopherson
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:03 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

Introduce enable_msr_interception API to init msr_bitmap, based on it
pKVM can setup the virtual MSR list which need to be trapped and
emulated in the hypervisor.

Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/Makefile   |  2 +-
 arch/x86/kvm/vmx/pkvm/hyp/vmexit.c   | 13 +----
 arch/x86/kvm/vmx/pkvm/hyp/vmsr.c     | 73 ++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/vmsr.h     | 11 +++++
 arch/x86/kvm/vmx/pkvm/include/pkvm.h |  2 +
 arch/x86/kvm/vmx/pkvm/pkvm_host.c    |  1 +
 6 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/Makefile b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
index ca6d43509ddc..fc75cdd9fc79 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/Makefile
+++ b/arch/x86/kvm/vmx/pkvm/hyp/Makefile
@@ -12,7 +12,7 @@ ccflags-y += -D__PKVM_HYP__
 virt-dir	:= ../../../../../../$(KVM_PKVM)
 
 pkvm-hyp-y	:= vmx_asm.o vmexit.o memory.o early_alloc.o pgtable.o mmu.o pkvm.o \
-		   init_finalise.o ept.o idt.o irq.o nested.o vmx.o
+		   init_finalise.o ept.o idt.o irq.o nested.o vmx.o vmsr.o
 
 ifndef CONFIG_PKVM_INTEL_DEBUG
 lib-dir		:= lib
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
index 8e7392010887..307514f44ec9 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmexit.c
@@ -9,6 +9,7 @@
 #include "vmexit.h"
 #include "ept.h"
 #include "pkvm_hyp.h"
+#include "vmsr.h"
 #include "nested.h"
 #include "debug.h"
 
@@ -109,18 +110,6 @@ static unsigned long handle_vmcall(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
-static void handle_read_msr(struct kvm_vcpu *vcpu)
-{
-	/* simply return 0 for non-supported MSRs */
-	vcpu->arch.regs[VCPU_REGS_RAX] = 0;
-	vcpu->arch.regs[VCPU_REGS_RDX] = 0;
-}
-
-static void handle_write_msr(struct kvm_vcpu *vcpu)
-{
-	/*No emulation for msr write now*/
-}
-
 static void handle_xsetbv(struct kvm_vcpu *vcpu)
 {
 	u32 eax = (u32)(vcpu->arch.regs[VCPU_REGS_RAX] & -1u);
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmsr.c b/arch/x86/kvm/vmx/pkvm/hyp/vmsr.c
new file mode 100644
index 000000000000..360b0333b84f
--- /dev/null
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmsr.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/*
+ * Copyright (C) 2018-2022 Intel Corporation
+ */
+
+#include <pkvm.h>
+#include "cpu.h"
+#include "debug.h"
+
+#define INTERCEPT_DISABLE		(0U)
+#define INTERCEPT_READ			(1U << 0U)
+#define INTERCEPT_WRITE			(1U << 1U)
+#define INTERCEPT_READ_WRITE		(INTERCEPT_READ | INTERCEPT_WRITE)
+
+static unsigned int emulated_ro_guest_msrs[] = {
+	/* DUMMY */
+};
+
+static void enable_msr_interception(u8 *bitmap, unsigned int msr_arg, unsigned int mode)
+{
+	unsigned int read_offset = 0U;
+	unsigned int write_offset = 2048U;
+	unsigned int msr = msr_arg;
+	u8 msr_bit;
+	unsigned int msr_index;
+
+	if ((msr <= 0x1FFFU) || ((msr >= 0xc0000000U) && (msr <= 0xc0001fffU))) {
+		if ((msr & 0xc0000000U) != 0U) {
+			read_offset = read_offset + 1024U;
+			write_offset = write_offset + 1024U;
+		}
+
+		msr &= 0x1FFFU;
+		msr_bit = (u8)(1U << (msr & 0x7U));
+		msr_index = msr >> 3U;
+
+		if ((mode & INTERCEPT_READ) == INTERCEPT_READ)
+			bitmap[read_offset + msr_index] |= msr_bit;
+		else
+			bitmap[read_offset + msr_index] &= ~msr_bit;
+
+		if ((mode & INTERCEPT_WRITE) == INTERCEPT_WRITE)
+			bitmap[write_offset + msr_index] |= msr_bit;
+		else
+			bitmap[write_offset + msr_index] &= ~msr_bit;
+	} else {
+		pkvm_err("%s, Invalid MSR: 0x%x", __func__, msr);
+	}
+}
+
+int handle_read_msr(struct kvm_vcpu *vcpu)
+{
+	/* simply return 0 for non-supported MSRs */
+	vcpu->arch.regs[VCPU_REGS_RAX] = 0;
+	vcpu->arch.regs[VCPU_REGS_RDX] = 0;
+
+	return 0;
+}
+
+int handle_write_msr(struct kvm_vcpu *vcpu)
+{
+	/*No emulation for msr write now*/
+	return 0;
+}
+
+void init_msr_emulation(struct vcpu_vmx *vmx)
+{
+	int i;
+	u8 *bitmap = (u8 *)vmx->loaded_vmcs->msr_bitmap;
+
+	for (i = 0; i < ARRAY_SIZE(emulated_ro_guest_msrs); i++)
+		enable_msr_interception(bitmap, emulated_ro_guest_msrs[i], INTERCEPT_READ);
+}
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmsr.h b/arch/x86/kvm/vmx/pkvm/hyp/vmsr.h
new file mode 100644
index 000000000000..1f39a37996f4
--- /dev/null
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmsr.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Intel Corporation
+ */
+#ifndef _PKVM_VMSR_H_
+#define _PKVM_VMSR_H_
+
+int handle_read_msr(struct kvm_vcpu *vcpu);
+int handle_write_msr(struct kvm_vcpu *vcpu);
+
+#endif
diff --git a/arch/x86/kvm/vmx/pkvm/include/pkvm.h b/arch/x86/kvm/vmx/pkvm/include/pkvm.h
index 9b45627853b3..59bbe645baaa 100644
--- a/arch/x86/kvm/vmx/pkvm/include/pkvm.h
+++ b/arch/x86/kvm/vmx/pkvm/include/pkvm.h
@@ -106,6 +106,8 @@ PKVM_DECLARE(void *, pkvm_early_alloc_contig(unsigned int nr_pages));
 PKVM_DECLARE(void *, pkvm_early_alloc_page(void));
 PKVM_DECLARE(void, pkvm_early_alloc_init(void *virt, unsigned long size));
 
+PKVM_DECLARE(void, init_msr_emulation(struct vcpu_vmx *vmx));
+
 PKVM_DECLARE(void, noop_handler(void));
 PKVM_DECLARE(void, nmi_handler(void));
 
diff --git a/arch/x86/kvm/vmx/pkvm/pkvm_host.c b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
index cbba3033ba63..90d7cddde9ef 100644
--- a/arch/x86/kvm/vmx/pkvm/pkvm_host.c
+++ b/arch/x86/kvm/vmx/pkvm/pkvm_host.c
@@ -280,6 +280,7 @@ static __init void init_execution_control(struct vcpu_vmx *vmx,
 	/* guest handles exception directly */
 	vmcs_write32(EXCEPTION_BITMAP, 0);
 
+	pkvm_sym(init_msr_emulation(vmx));
 	vmcs_write64(MSR_BITMAP, __pa(vmx->vmcs01.msr_bitmap));
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH part-5 22/22] pkvm: x86: Add vmx msr emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (20 preceding siblings ...)
  2023-03-12 18:03 ` [RFC PATCH part-5 21/22] pkvm: x86: Initialize msr_bitmap for vmsr Jason Chen CJ
@ 2023-03-12 18:03 ` Jason Chen CJ
  2023-03-13 16:58 ` [RFC PATCH part-5 00/22] VMX emulation Sean Christopherson
  22 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-12 18:03 UTC (permalink / raw)
  To: kvm; +Cc: Jason Chen CJ, Chuanxiao Dong

Host VM see VMX capability, but with reduced features. pKVM need to
provide such vmx msrs emulation to tell supported VMX capabilities to
the host VM.

Signed-off-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>
---
 arch/x86/kvm/vmx/pkvm/hyp/nested.c            | 65 +++++++++++++++++++
 arch/x86/kvm/vmx/pkvm/hyp/nested.h            |  8 +++
 .../vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h    |  2 +-
 arch/x86/kvm/vmx/pkvm/hyp/vmsr.c              | 25 +++++--
 4 files changed, 94 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.c b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
index 73fa66ba95bd..429bfe7bb309 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.c
@@ -6,9 +6,71 @@
 #include <pkvm.h>
 
 #include "pkvm_hyp.h"
+#include "nested.h"
+#include "cpu.h"
 #include "vmx.h"
 #include "debug.h"
 
+/*
+ * Not support shadow vmcs & vmfunc;
+ * Not support descriptor-table exiting
+ * as it requires guest memory access
+ * to decode and emulate instructions
+ * which is not supported for protected VM.
+ */
+#define NESTED_UNSUPPORTED_2NDEXEC 		\
+	(SECONDARY_EXEC_SHADOW_VMCS | 		\
+	 SECONDARY_EXEC_ENABLE_VMFUNC | 	\
+	 SECONDARY_EXEC_DESC)
+
+static const unsigned int vmx_msrs[] = {
+	LIST_OF_VMX_MSRS
+};
+
+bool is_vmx_msr(unsigned long msr)
+{
+	bool found = false;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vmx_msrs); i++) {
+		if (msr == vmx_msrs[i]) {
+			found = true;
+			break;
+		}
+	}
+
+	return found;
+}
+
+int read_vmx_msr(struct kvm_vcpu *vcpu, unsigned long msr, u64 *val)
+{
+	u32 low, high;
+	int err = 0;
+
+	pkvm_rdmsr(msr, low, high);
+
+	switch (msr) {
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+		high &= ~NESTED_UNSUPPORTED_2NDEXEC;
+		break;
+	case MSR_IA32_VMX_MISC:
+		/* not support PT, SMM */
+		low &= ~(MSR_IA32_VMX_MISC_INTEL_PT | BIT(28));
+		break;
+	case MSR_IA32_VMX_VMFUNC:
+		/* not support vmfunc */
+		low = high = 0;
+		break;
+	default:
+		err = -EACCES;
+		break;
+	}
+
+	*val = (u64)high << 32 | (u64)low;
+
+	return err;
+}
+
 /**
  * According to SDM Appendix B Field Encoding in VMCS, some fields only
  * exist on processor that support the 1-setting of the corresponding
@@ -492,6 +554,9 @@ static u64 emulate_field_for_vmcs02(struct vcpu_vmx *vmx, u16 field, u64 virt_va
 		/* host always in 64bit mode */
 		val |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
 		break;
+	case SECONDARY_VM_EXEC_CONTROL:
+		val &= ~NESTED_UNSUPPORTED_2NDEXEC;
+		break;
 	}
 	return val;
 }
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/nested.h b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
index 3f785be165c2..24cf731e96dd 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/nested.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/nested.h
@@ -16,4 +16,12 @@ int handle_vmlaunch(struct kvm_vcpu *vcpu);
 int nested_vmexit(struct kvm_vcpu *vcpu);
 void pkvm_init_nest(void);
 
+#define LIST_OF_VMX_MSRS        		\
+	MSR_IA32_VMX_MISC,                      \
+	MSR_IA32_VMX_PROCBASED_CTLS2,           \
+	MSR_IA32_VMX_VMFUNC
+
+bool is_vmx_msr(unsigned long msr);
+int read_vmx_msr(struct kvm_vcpu *vcpu, unsigned long msr, u64 *val);
+
 #endif
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
index 8666cda4ee6d..7b0f1d73d76c 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
+++ b/arch/x86/kvm/vmx/pkvm/hyp/pkvm_nested_vmcs_fields.h
@@ -28,6 +28,7 @@ EMULATED_FIELD_RW(VIRTUAL_PROCESSOR_ID, virtual_processor_id)
 /* 32-bits */
 EMULATED_FIELD_RW(VM_EXIT_CONTROLS, vm_exit_controls)
 EMULATED_FIELD_RW(VM_ENTRY_CONTROLS, vm_entry_controls)
+EMULATED_FIELD_RW(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control)
 
 /* 64-bits, what about their HIGH 32 fields?  */
 EMULATED_FIELD_RW(IO_BITMAP_A, io_bitmap_a)
@@ -77,7 +78,6 @@ SHADOW_FIELD_RW(GUEST_PML_INDEX, guest_pml_index)
 /* 32-bits */
 SHADOW_FIELD_RW(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control)
 SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control)
-SHADOW_FIELD_RW(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control)
 SHADOW_FIELD_RW(EXCEPTION_BITMAP, exception_bitmap)
 SHADOW_FIELD_RW(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask)
 SHADOW_FIELD_RW(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match)
diff --git a/arch/x86/kvm/vmx/pkvm/hyp/vmsr.c b/arch/x86/kvm/vmx/pkvm/hyp/vmsr.c
index 360b0333b84f..ec7476debf25 100644
--- a/arch/x86/kvm/vmx/pkvm/hyp/vmsr.c
+++ b/arch/x86/kvm/vmx/pkvm/hyp/vmsr.c
@@ -5,6 +5,7 @@
 
 #include <pkvm.h>
 #include "cpu.h"
+#include "nested.h"
 #include "debug.h"
 
 #define INTERCEPT_DISABLE		(0U)
@@ -13,7 +14,7 @@
 #define INTERCEPT_READ_WRITE		(INTERCEPT_READ | INTERCEPT_WRITE)
 
 static unsigned int emulated_ro_guest_msrs[] = {
-	/* DUMMY */
+	LIST_OF_VMX_MSRS,
 };
 
 static void enable_msr_interception(u8 *bitmap, unsigned int msr_arg, unsigned int mode)
@@ -50,11 +51,25 @@ static void enable_msr_interception(u8 *bitmap, unsigned int msr_arg, unsigned i
 
 int handle_read_msr(struct kvm_vcpu *vcpu)
 {
-	/* simply return 0 for non-supported MSRs */
-	vcpu->arch.regs[VCPU_REGS_RAX] = 0;
-	vcpu->arch.regs[VCPU_REGS_RDX] = 0;
+	unsigned long msr = vcpu->arch.regs[VCPU_REGS_RCX];
+	int ret = 0;
+	u32 low = 0, high = 0;
+	u64 val;
 
-	return 0;
+	/* For non-supported MSRs, return low=high=0 by default */
+	if (is_vmx_msr(msr)) {
+		ret = read_vmx_msr(vcpu, msr, &val);
+		if (!ret) {
+			low = (u32)val;
+			high = (u32)(val >> 32);
+		}
+	}
+	pkvm_dbg("%s: CPU%d Value of msr 0x%lx: low=0x%x, high=0x%x\n", __func__, vcpu->cpu, msr, low, high);
+
+	vcpu->arch.regs[VCPU_REGS_RAX] = low;
+	vcpu->arch.regs[VCPU_REGS_RDX] = high;
+
+	return ret;
 }
 
 int handle_write_msr(struct kvm_vcpu *vcpu)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
                   ` (21 preceding siblings ...)
  2023-03-12 18:03 ` [RFC PATCH part-5 22/22] pkvm: x86: Add vmx msr emulation Jason Chen CJ
@ 2023-03-13 16:58 ` Sean Christopherson
  2023-03-14 16:29   ` Jason Chen CJ
  22 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2023-03-13 16:58 UTC (permalink / raw)
  To: Jason Chen CJ; +Cc: kvm

On Mon, Mar 13, 2023, Jason Chen CJ wrote:
> This patch set is part-5 of this RFC patches. It introduces VMX
> emulation for pKVM on Intel platform.
> 
> Host VM wants the capability to run its guest, it needs VMX support.

No, the host VM only needs a way to request pKVM to run a VM.  If we go down the
rabbit hole of pKVM on x86, I think we should take the red pill[*] and go all the
way down said rabbit hole by heavily paravirtualizing the KVM=>pKVM interface.

Except for VMCALL vs. VMMCALL, it should be possible to eliminate all traces of
VMX and SVM from the interface.  That means no VMCS emulation, no EPT shadowing,
etc.  As a bonus, any paravirt stuff we do for pKVM x86 would also be usable for
KVM-on-KVM nested virtualization.

E.g. an idea floating around my head is to add a paravirt paging interface for
KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't need to maintain its own
TDP page tables.  I haven't pursued that idea in any real capacity since most
nested virtualization use cases for KVM involve running an older L1 kernel and/or
a non-KVM L1 hypervisor, i.e. there's no concrete use case to justify the development
and maintenance cost.  But if the PV code is "needed" by pKVM anyways...

[*] You take the blue pill, the story ends, you wake up in your bed and believe
    whatever you want to believe. You take the red pill, you stay in wonderland,
    and I show you how deep the rabbit hole goes.

    -Morpheus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-03-13 16:58 ` [RFC PATCH part-5 00/22] VMX emulation Sean Christopherson
@ 2023-03-14 16:29   ` Jason Chen CJ
  2023-06-08 21:38     ` Dmytro Maluka
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Chen CJ @ 2023-03-14 16:29 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
> > This patch set is part-5 of this RFC patches. It introduces VMX
> > emulation for pKVM on Intel platform.
> > 
> > Host VM wants the capability to run its guest, it needs VMX support.
> 
> No, the host VM only needs a way to request pKVM to run a VM.  If we go down the
> rabbit hole of pKVM on x86, I think we should take the red pill[*] and go all the
> way down said rabbit hole by heavily paravirtualizing the KVM=>pKVM interface.

hi, Sean,

Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on Intel
Platform Introduction", we hope VMX emulation can be there at least for
normal VM support.

> 
> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all traces of
> VMX and SVM from the interface.  That means no VMCS emulation, no EPT shadowing,
> etc.  As a bonus, any paravirt stuff we do for pKVM x86 would also be usable for
> KVM-on-KVM nested virtualization.
> 
> E.g. an idea floating around my head is to add a paravirt paging interface for
> KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't need to maintain its own
> TDP page tables.  I haven't pursued that idea in any real capacity since most
> nested virtualization use cases for KVM involve running an older L1 kernel and/or
> a non-KVM L1 hypervisor, i.e. there's no concrete use case to justify the development
> and maintenance cost.  But if the PV code is "needed" by pKVM anyways...

Yes, I agree, we could have performance & mem cost benefit by using
paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
miss other benefit you saw?

> 
> [*] You take the blue pill, the story ends, you wake up in your bed and believe
>     whatever you want to believe. You take the red pill, you stay in wonderland,
>     and I show you how deep the rabbit hole goes.
> 
>     -Morpheus

-- 

Thanks
Jason CJ Chen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-03-14 16:29   ` Jason Chen CJ
@ 2023-06-08 21:38     ` Dmytro Maluka
  2023-06-09  2:07       ` Chen, Jason CJ
  2023-06-15 21:13       ` Nadav Amit
  0 siblings, 2 replies; 34+ messages in thread
From: Dmytro Maluka @ 2023-06-08 21:38 UTC (permalink / raw)
  To: Jason Chen CJ, Sean Christopherson
  Cc: kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser

On 3/14/23 17:29, Jason Chen CJ wrote:
> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
>> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
>>> This patch set is part-5 of this RFC patches. It introduces VMX
>>> emulation for pKVM on Intel platform.
>>>
>>> Host VM wants the capability to run its guest, it needs VMX support.
>>
>> No, the host VM only needs a way to request pKVM to run a VM.  If we go down the
>> rabbit hole of pKVM on x86, I think we should take the red pill[*] and go all the
>> way down said rabbit hole by heavily paravirtualizing the KVM=>pKVM interface.
> 
> hi, Sean,
> 
> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on Intel
> Platform Introduction", we hope VMX emulation can be there at least for
> normal VM support.
> 
>>
>> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all traces of
>> VMX and SVM from the interface.  That means no VMCS emulation, no EPT shadowing,
>> etc.  As a bonus, any paravirt stuff we do for pKVM x86 would also be usable for
>> KVM-on-KVM nested virtualization.
>>
>> E.g. an idea floating around my head is to add a paravirt paging interface for
>> KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't need to maintain its own
>> TDP page tables.  I haven't pursued that idea in any real capacity since most
>> nested virtualization use cases for KVM involve running an older L1 kernel and/or
>> a non-KVM L1 hypervisor, i.e. there's no concrete use case to justify the development
>> and maintenance cost.  But if the PV code is "needed" by pKVM anyways...
> 
> Yes, I agree, we could have performance & mem cost benefit by using
> paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
> miss other benefit you saw?

As I see it, the advantages of a PV design for pKVM are:

- performance
- memory cost
- code simplicity (of the pKVM hypervisor, first of all)
- better alignment with the pKVM on ARM

Regarding performance, I actually suspect it may even be the least significant
of the above. I guess with a PV design we'd have roughly as many extra vmexits
as we have now (just due to hypercalls instead of traps on emulated VMX
instructions etc), so perhaps the performance improvement would be not as big
as we might expect (am I wrong?).

But the memory cost advantage seems to be very attractive. With the emulated
design pKVM needs to maintain shadow page tables (and other shadow structures
too, but page tables are the most memory demanding). Moreover, the number of
shadow page tables is obviously proportional to the number of VMs running, and
since pKVM reserves all its memory upfront preparing for the worst case, we
have pretty restrictive limits on the maximum number of VMs [*] (and if we run
fewer VMs than this limit, we waste memory).

To give some numbers, on a machine with 8GB of RAM, on ChromeOS with this
pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it only allows
up to 10 VMs running simultaneously), while on Android (ARM) it is afaik only
44MB. According to my analysis, if we get rid of all the shadow tables in pKVM,
we should have 44MB on x86 too (regardless of the maximum number of VMs).

[*] And some other limits too, e.g. on the maximum number of DMA-capable
devices, since pKVM also needs shadow IOMMU page tables if we have only 1-stage
IOMMU.

> 
>>
>> [*] You take the blue pill, the story ends, you wake up in your bed and believe
>>     whatever you want to believe. You take the red pill, you stay in wonderland,
>>     and I show you how deep the rabbit hole goes.
>>
>>     -Morpheus
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-08 21:38     ` Dmytro Maluka
@ 2023-06-09  2:07       ` Chen, Jason CJ
  2023-06-09  8:34         ` Dmytro Maluka
  2023-06-15 21:13       ` Nadav Amit
  1 sibling, 1 reply; 34+ messages in thread
From: Chen, Jason CJ @ 2023-06-09  2:07 UTC (permalink / raw)
  To: Dmytro Maluka, Christopherson,, Sean
  Cc: kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser, Chen, Jason CJ

> -----Original Message-----
> From: Dmytro Maluka <dmy@semihalf.com>
> Sent: Friday, June 9, 2023 5:38 AM
> To: Chen, Jason CJ <jason.cj.chen@intel.com>; Christopherson,, Sean
> <seanjc@google.com>
> Cc: kvm@vger.kernel.org; android-kvm@google.com; Dmitry Torokhov
> <dtor@chromium.org>; Tomasz Nowicki <tn@semihalf.com>; Grzegorz Jaszczyk
> <jaz@semihalf.com>; Keir Fraser <keirf@google.com>
> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation
> 
> On 3/14/23 17:29, Jason Chen CJ wrote:
> > On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
> >> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
> >>> This patch set is part-5 of this RFC patches. It introduces VMX
> >>> emulation for pKVM on Intel platform.
> >>>
> >>> Host VM wants the capability to run its guest, it needs VMX support.
> >>
> >> No, the host VM only needs a way to request pKVM to run a VM.  If we
> >> go down the rabbit hole of pKVM on x86, I think we should take the
> >> red pill[*] and go all the way down said rabbit hole by heavily paravirtualizing
> the KVM=>pKVM interface.
> >
> > hi, Sean,
> >
> > Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on
> > Intel Platform Introduction", we hope VMX emulation can be there at
> > least for normal VM support.
> >
> >>
> >> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all
> >> traces of VMX and SVM from the interface.  That means no VMCS
> >> emulation, no EPT shadowing, etc.  As a bonus, any paravirt stuff we
> >> do for pKVM x86 would also be usable for KVM-on-KVM nested virtualization.
> >>
> >> E.g. an idea floating around my head is to add a paravirt paging
> >> interface for KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't
> >> need to maintain its own TDP page tables.  I haven't pursued that
> >> idea in any real capacity since most nested virtualization use cases
> >> for KVM involve running an older L1 kernel and/or a non-KVM L1
> >> hypervisor, i.e. there's no concrete use case to justify the development and
> maintenance cost.  But if the PV code is "needed" by pKVM anyways...
> >
> > Yes, I agree, we could have performance & mem cost benefit by using
> > paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
> > miss other benefit you saw?
> 
> As I see it, the advantages of a PV design for pKVM are:
> 
> - performance
> - memory cost
> - code simplicity (of the pKVM hypervisor, first of all)
> - better alignment with the pKVM on ARM
> 
> Regarding performance, I actually suspect it may even be the least significant of
> the above. I guess with a PV design we'd have roughly as many extra vmexits as
> we have now (just due to hypercalls instead of traps on emulated VMX
> instructions etc), so perhaps the performance improvement would be not as big
> as we might expect (am I wrong?).

I think with PV design, we can benefit from skip shadowing. For example, a TLB flush
could be done in hypervisor directly, while shadowing EPT need emulate it by destroy
shadow EPT page table entries then do next shadowing upon ept violation.

Based on PV, with well-designed interfaces, I suppose we can also make some general
design for nested support on KVM-on-hypervisor (e.g., we can do first for KVM-on-KVM
then extend to support KVM-on-pKVM and others)

> 
> But the memory cost advantage seems to be very attractive. With the emulated
> design pKVM needs to maintain shadow page tables (and other shadow
> structures too, but page tables are the most memory demanding). Moreover,
> the number of shadow page tables is obviously proportional to the number of
> VMs running, and since pKVM reserves all its memory upfront preparing for the
> worst case, we have pretty restrictive limits on the maximum number of VMs [*]
> (and if we run fewer VMs than this limit, we waste memory).
> 
> To give some numbers, on a machine with 8GB of RAM, on ChromeOS with this
> pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it only
> allows up to 10 VMs running simultaneously), while on Android (ARM) it is afaik
> only 44MB. According to my analysis, if we get rid of all the shadow tables in
> pKVM, we should have 44MB on x86 too (regardless of the maximum number of
> VMs).
> 
> [*] And some other limits too, e.g. on the maximum number of DMA-capable
> devices, since pKVM also needs shadow IOMMU page tables if we have only 1-
> stage IOMMU.

I may not capture your meaning. Do you mean device want 2-stage while we only
have 1-stage IOMMU? If so, not sure if there is real use case.

Per my understanding, if for PV IOMMU, the simplest implementation is just
maintain 1-stage DMA mapping in the hypervisor as guest most likely just want 
1-stage DMA mapping for its device,  so if for IOMMU w/ nested capability meantime
guest want use its nested capability (e.g., for vSVA), we can further extend the PV
IOMMU interfaces.

> 
> >
> >>
> >> [*] You take the blue pill, the story ends, you wake up in your bed and believe
> >>     whatever you want to believe. You take the red pill, you stay in wonderland,
> >>     and I show you how deep the rabbit hole goes.
> >>
> >>     -Morpheus
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-09  2:07       ` Chen, Jason CJ
@ 2023-06-09  8:34         ` Dmytro Maluka
  2023-06-13 19:50           ` Sean Christopherson
  2023-06-15  3:59           ` Chen, Jason CJ
  0 siblings, 2 replies; 34+ messages in thread
From: Dmytro Maluka @ 2023-06-09  8:34 UTC (permalink / raw)
  To: Chen, Jason CJ, Christopherson,, Sean
  Cc: kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser

On 6/9/23 04:07, Chen, Jason CJ wrote:
>> -----Original Message-----
>> From: Dmytro Maluka <dmy@semihalf.com>
>> Sent: Friday, June 9, 2023 5:38 AM
>> To: Chen, Jason CJ <jason.cj.chen@intel.com>; Christopherson,, Sean
>> <seanjc@google.com>
>> Cc: kvm@vger.kernel.org; android-kvm@google.com; Dmitry Torokhov
>> <dtor@chromium.org>; Tomasz Nowicki <tn@semihalf.com>; Grzegorz Jaszczyk
>> <jaz@semihalf.com>; Keir Fraser <keirf@google.com>
>> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation
>>
>> On 3/14/23 17:29, Jason Chen CJ wrote:
>>> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
>>>> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
>>>>> This patch set is part-5 of this RFC patches. It introduces VMX
>>>>> emulation for pKVM on Intel platform.
>>>>>
>>>>> Host VM wants the capability to run its guest, it needs VMX support.
>>>>
>>>> No, the host VM only needs a way to request pKVM to run a VM.  If we
>>>> go down the rabbit hole of pKVM on x86, I think we should take the
>>>> red pill[*] and go all the way down said rabbit hole by heavily paravirtualizing
>> the KVM=>pKVM interface.
>>>
>>> hi, Sean,
>>>
>>> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on
>>> Intel Platform Introduction", we hope VMX emulation can be there at
>>> least for normal VM support.
>>>
>>>>
>>>> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all
>>>> traces of VMX and SVM from the interface.  That means no VMCS
>>>> emulation, no EPT shadowing, etc.  As a bonus, any paravirt stuff we
>>>> do for pKVM x86 would also be usable for KVM-on-KVM nested virtualization.
>>>>
>>>> E.g. an idea floating around my head is to add a paravirt paging
>>>> interface for KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't
>>>> need to maintain its own TDP page tables.  I haven't pursued that
>>>> idea in any real capacity since most nested virtualization use cases
>>>> for KVM involve running an older L1 kernel and/or a non-KVM L1
>>>> hypervisor, i.e. there's no concrete use case to justify the development and
>> maintenance cost.  But if the PV code is "needed" by pKVM anyways...
>>>
>>> Yes, I agree, we could have performance & mem cost benefit by using
>>> paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
>>> miss other benefit you saw?
>>
>> As I see it, the advantages of a PV design for pKVM are:
>>
>> - performance
>> - memory cost
>> - code simplicity (of the pKVM hypervisor, first of all)
>> - better alignment with the pKVM on ARM
>>
>> Regarding performance, I actually suspect it may even be the least significant of
>> the above. I guess with a PV design we'd have roughly as many extra vmexits as
>> we have now (just due to hypercalls instead of traps on emulated VMX
>> instructions etc), so perhaps the performance improvement would be not as big
>> as we might expect (am I wrong?).
> 
> I think with PV design, we can benefit from skip shadowing. For example, a TLB flush
> could be done in hypervisor directly, while shadowing EPT need emulate it by destroy
> shadow EPT page table entries then do next shadowing upon ept violation.

Yeah indeed, good point.

Is my understanding correct: TLB flush is still gonna be requested by
the host VM via a hypercall, but the benefit is that the hypervisor
merely needs to do INVEPT?

> 
> Based on PV, with well-designed interfaces, I suppose we can also make some general
> design for nested support on KVM-on-hypervisor (e.g., we can do first for KVM-on-KVM
> then extend to support KVM-on-pKVM and others)

Yep, as Sean suggested. Forgot to mention this too.

> 
>>
>> But the memory cost advantage seems to be very attractive. With the emulated
>> design pKVM needs to maintain shadow page tables (and other shadow
>> structures too, but page tables are the most memory demanding). Moreover,
>> the number of shadow page tables is obviously proportional to the number of
>> VMs running, and since pKVM reserves all its memory upfront preparing for the
>> worst case, we have pretty restrictive limits on the maximum number of VMs [*]
>> (and if we run fewer VMs than this limit, we waste memory).
>>
>> To give some numbers, on a machine with 8GB of RAM, on ChromeOS with this
>> pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it only
>> allows up to 10 VMs running simultaneously), while on Android (ARM) it is afaik
>> only 44MB. According to my analysis, if we get rid of all the shadow tables in
>> pKVM, we should have 44MB on x86 too (regardless of the maximum number of
>> VMs).
>>
>> [*] And some other limits too, e.g. on the maximum number of DMA-capable
>> devices, since pKVM also needs shadow IOMMU page tables if we have only 1-
>> stage IOMMU.
> 
> I may not capture your meaning. Do you mean device want 2-stage while we only
> have 1-stage IOMMU? If so, not sure if there is real use case.
> 
> Per my understanding, if for PV IOMMU, the simplest implementation is just
> maintain 1-stage DMA mapping in the hypervisor as guest most likely just want 
> 1-stage DMA mapping for its device,  so if for IOMMU w/ nested capability meantime
> guest want use its nested capability (e.g., for vSVA), we can further extend the PV
> IOMMU interfaces.

Sorry, I wasn't clear enough. I mean, on the host or guest side we need
just 1-stage IOMMU, but pKVM needs to ensure memory protection. So if
2-stage is available, pKVM can just use it, but if not, currently in
pKVM on Intel we use shadow page tables for that (just as a consequence
of the overall "mostly emulated" design). (So as a result, in
particular, pKVM memory footprint depends on the max number of PCI
devices allowed by pKVM.) And yeah, with a PV IOMMU we can avoid the
need for shadow page tables while still having only 1-stage IOMMU,
that's exactly my point.

> 
>>
>>>
>>>>
>>>> [*] You take the blue pill, the story ends, you wake up in your bed and believe
>>>>     whatever you want to believe. You take the red pill, you stay in wonderland,
>>>>     and I show you how deep the rabbit hole goes.
>>>>
>>>>     -Morpheus
>>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-09  8:34         ` Dmytro Maluka
@ 2023-06-13 19:50           ` Sean Christopherson
  2023-06-15 18:07             ` Dmytro Maluka
                               ` (2 more replies)
  2023-06-15  3:59           ` Chen, Jason CJ
  1 sibling, 3 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-06-13 19:50 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: Jason CJ Chen, kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser

On Fri, Jun 09, 2023, Dmytro Maluka wrote:
> On 6/9/23 04:07, Chen, Jason CJ wrote:
> > I think with PV design, we can benefit from skip shadowing. For example, a TLB flush
> > could be done in hypervisor directly, while shadowing EPT need emulate it by destroy
> > shadow EPT page table entries then do next shadowing upon ept violation.

This is a bit misleading.  KVM has an effective TLB for nested TDP only for 4KiB
pages; larger shadow pages are never allowed to go out-of-sync, i.e. KVM doesn't
wait until L1 does a TLB flush to update SPTEs.  KVM does "unload" roots, e.g. to
emulate INVEPT, but that usually just ends up being an extra slow TLB flush in L0,
because nested TDP SPTEs rarely go unsync in practice.  The patterns for hypervisors
managing VM memory don't typically trigger the types of PTE modifications that
result in unsync SPTEs.

I actually have a (very tiny) patch sitting around somwhere to disable unsync support
when TDP is enabled.  There is a very, very thoeretical bug where KVM might fail
to honor when a guest TDP PTE change is architecturally supposed to be visible,
and the simplest fix (by far) is to disable unsync support.  Disabling TDP+unsync
is a viable fix because unsync support is almost never used for nested TDP.  Legacy
shadow paging on the other hand *significantly* benefits from unsync support, e.g.
when the guest is managing CoW mappings. I haven't gotten around to posting the
patch to disable unsync on TDP purely because the flaw is almost comically theoretical.

Anyways, the point is that the TLB flushing side of nested TDP isn't all that
interesting.

> Yeah indeed, good point.
> 
> Is my understanding correct: TLB flush is still gonna be requested by
> the host VM via a hypercall, but the benefit is that the hypervisor
> merely needs to do INVEPT?

Maybe?  A paravirt paging scheme could do whatever it wanted.  The APIs could be
designed in such a way that L1 never needs to explicitly request a TLB flush,
e.g. if the contract is that changes must always become immediately visible to L2.

And TLB flushing is but one small aspect of page table shadowing.  With PV paging,
L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary
data type.  E.g. KVM as L1 could use an XArray to track L2 mappings.  And L0 in
turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially
*all* architectures) could have a single nested paging scheme for both Intel and
AMD, as opposed to needing code to deal with the differences between EPT and NPT.

A few months back, I mentally worked through the flows[*] (I forget why I was
thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to
support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an
XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new
mapping).

[*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-09  8:34         ` Dmytro Maluka
  2023-06-13 19:50           ` Sean Christopherson
@ 2023-06-15  3:59           ` Chen, Jason CJ
  1 sibling, 0 replies; 34+ messages in thread
From: Chen, Jason CJ @ 2023-06-15  3:59 UTC (permalink / raw)
  To: Dmytro Maluka, Christopherson,, Sean
  Cc: kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser, Chen, Jason CJ

> -----Original Message-----
> From: Dmytro Maluka <dmy@semihalf.com>
> Sent: Friday, June 9, 2023 4:35 PM
> To: Chen, Jason CJ <jason.cj.chen@intel.com>; Christopherson,, Sean
> <seanjc@google.com>
> Cc: kvm@vger.kernel.org; android-kvm@google.com; Dmitry Torokhov
> <dtor@chromium.org>; Tomasz Nowicki <tn@semihalf.com>; Grzegorz Jaszczyk
> <jaz@semihalf.com>; Keir Fraser <keirf@google.com>
> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation
> 
> On 6/9/23 04:07, Chen, Jason CJ wrote:
> >> -----Original Message-----
> >> From: Dmytro Maluka <dmy@semihalf.com>
> >> Sent: Friday, June 9, 2023 5:38 AM
> >> To: Chen, Jason CJ <jason.cj.chen@intel.com>; Christopherson,, Sean
> >> <seanjc@google.com>
> >> Cc: kvm@vger.kernel.org; android-kvm@google.com; Dmitry Torokhov
> >> <dtor@chromium.org>; Tomasz Nowicki <tn@semihalf.com>; Grzegorz
> >> Jaszczyk <jaz@semihalf.com>; Keir Fraser <keirf@google.com>
> >> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation
> >>
> >> On 3/14/23 17:29, Jason Chen CJ wrote:
> >>> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
> >>>> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
> >>>>> This patch set is part-5 of this RFC patches. It introduces VMX
> >>>>> emulation for pKVM on Intel platform.
> >>>>>
> >>>>> Host VM wants the capability to run its guest, it needs VMX support.
> >>>>
> >>>> No, the host VM only needs a way to request pKVM to run a VM.  If
> >>>> we go down the rabbit hole of pKVM on x86, I think we should take
> >>>> the red pill[*] and go all the way down said rabbit hole by heavily
> >>>> paravirtualizing
> >> the KVM=>pKVM interface.
> >>>
> >>> hi, Sean,
> >>>
> >>> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on
> >>> Intel Platform Introduction", we hope VMX emulation can be there at
> >>> least for normal VM support.
> >>>
> >>>>
> >>>> Except for VMCALL vs. VMMCALL, it should be possible to eliminate
> >>>> all traces of VMX and SVM from the interface.  That means no VMCS
> >>>> emulation, no EPT shadowing, etc.  As a bonus, any paravirt stuff
> >>>> we do for pKVM x86 would also be usable for KVM-on-KVM nested
> virtualization.
> >>>>
> >>>> E.g. an idea floating around my head is to add a paravirt paging
> >>>> interface for KVM-on-KVM so that L1's (KVM-high in this RFC)
> >>>> doesn't need to maintain its own TDP page tables.  I haven't
> >>>> pursued that idea in any real capacity since most nested
> >>>> virtualization use cases for KVM involve running an older L1 kernel
> >>>> and/or a non-KVM L1 hypervisor, i.e. there's no concrete use case
> >>>> to justify the development and
> >> maintenance cost.  But if the PV code is "needed" by pKVM anyways...
> >>>
> >>> Yes, I agree, we could have performance & mem cost benefit by using
> >>> paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
> >>> miss other benefit you saw?
> >>
> >> As I see it, the advantages of a PV design for pKVM are:
> >>
> >> - performance
> >> - memory cost
> >> - code simplicity (of the pKVM hypervisor, first of all)
> >> - better alignment with the pKVM on ARM
> >>
> >> Regarding performance, I actually suspect it may even be the least
> >> significant of the above. I guess with a PV design we'd have roughly
> >> as many extra vmexits as we have now (just due to hypercalls instead
> >> of traps on emulated VMX instructions etc), so perhaps the
> >> performance improvement would be not as big as we might expect (am I
> wrong?).
> >
> > I think with PV design, we can benefit from skip shadowing. For
> > example, a TLB flush could be done in hypervisor directly, while
> > shadowing EPT need emulate it by destroy shadow EPT page table entries then
> do next shadowing upon ept violation.
> 
> Yeah indeed, good point.
> 
> Is my understanding correct: TLB flush is still gonna be requested by the host VM
> via a hypercall, but the benefit is that the hypervisor merely needs to do INVEPT?

Sorry for later response, in my P.O.V, we should let EPT totally owned by the hypervisor,
so host VM will not trigger TLB flush as it does not manage EPT directly.

> 
> >
> > Based on PV, with well-designed interfaces, I suppose we can also make
> > some general design for nested support on KVM-on-hypervisor (e.g., we
> > can do first for KVM-on-KVM then extend to support KVM-on-pKVM and
> > others)
> 
> Yep, as Sean suggested. Forgot to mention this too.
> 
> >
> >>
> >> But the memory cost advantage seems to be very attractive. With the
> >> emulated design pKVM needs to maintain shadow page tables (and other
> >> shadow structures too, but page tables are the most memory
> >> demanding). Moreover, the number of shadow page tables is obviously
> >> proportional to the number of VMs running, and since pKVM reserves
> >> all its memory upfront preparing for the worst case, we have pretty
> >> restrictive limits on the maximum number of VMs [*] (and if we run fewer
> VMs than this limit, we waste memory).
> >>
> >> To give some numbers, on a machine with 8GB of RAM, on ChromeOS with
> >> this
> >> pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it
> >> only allows up to 10 VMs running simultaneously), while on Android
> >> (ARM) it is afaik only 44MB. According to my analysis, if we get rid
> >> of all the shadow tables in pKVM, we should have 44MB on x86 too
> >> (regardless of the maximum number of VMs).
> >>
> >> [*] And some other limits too, e.g. on the maximum number of
> >> DMA-capable devices, since pKVM also needs shadow IOMMU page tables
> >> if we have only 1- stage IOMMU.
> >
> > I may not capture your meaning. Do you mean device want 2-stage while
> > we only have 1-stage IOMMU? If so, not sure if there is real use case.
> >
> > Per my understanding, if for PV IOMMU, the simplest implementation is
> > just maintain 1-stage DMA mapping in the hypervisor as guest most
> > likely just want 1-stage DMA mapping for its device,  so if for IOMMU
> > w/ nested capability meantime guest want use its nested capability
> > (e.g., for vSVA), we can further extend the PV IOMMU interfaces.
> 
> Sorry, I wasn't clear enough. I mean, on the host or guest side we need just 1-
> stage IOMMU, but pKVM needs to ensure memory protection. So if 2-stage is
> available, pKVM can just use it, but if not, currently in pKVM on Intel we use
> shadow page tables for that (just as a consequence of the overall "mostly
> emulated" design). (So as a result, in particular, pKVM memory footprint
> depends on the max number of PCI devices allowed by pKVM.) And yeah, with a
> PV IOMMU we can avoid the need for shadow page tables while still having only
> 1-stage IOMMU, that's exactly my point.
> 
> >
> >>
> >>>
> >>>>
> >>>> [*] You take the blue pill, the story ends, you wake up in your bed and
> believe
> >>>>     whatever you want to believe. You take the red pill, you stay in
> wonderland,
> >>>>     and I show you how deep the rabbit hole goes.
> >>>>
> >>>>     -Morpheus
> >>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-13 19:50           ` Sean Christopherson
@ 2023-06-15 18:07             ` Dmytro Maluka
  2023-06-20 15:46             ` Jason Chen CJ
  2023-09-05  9:47             ` Jason Chen CJ
  2 siblings, 0 replies; 34+ messages in thread
From: Dmytro Maluka @ 2023-06-15 18:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jason CJ Chen, kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser

On 6/13/23 21:50, Sean Christopherson wrote:
> On Fri, Jun 09, 2023, Dmytro Maluka wrote:
>> Yeah indeed, good point.
>>
>> Is my understanding correct: TLB flush is still gonna be requested by
>> the host VM via a hypercall, but the benefit is that the hypervisor
>> merely needs to do INVEPT?
> 
> Maybe?  A paravirt paging scheme could do whatever it wanted.  The APIs could be
> designed in such a way that L1 never needs to explicitly request a TLB flush,
> e.g. if the contract is that changes must always become immediately visible to L2.
> 
> And TLB flushing is but one small aspect of page table shadowing.  With PV paging,
> L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary
> data type.  E.g. KVM as L1 could use an XArray to track L2 mappings.  And L0 in
> turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially
> *all* architectures) could have a single nested paging scheme for both Intel and
> AMD, as opposed to needing code to deal with the differences between EPT and NPT.
> 
> A few months back, I mentally worked through the flows[*] (I forget why I was
> thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to
> support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an
> XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new
> mapping).
> 
> [*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-)

Yeap indeed, thanks. (I should have thought myself that it's rather
pointless to use hardware-defined page tables and TLB semantics in L1 if
we go full PV.) In pKVM on ARM [1] it already looks similar to what you
described and is pretty simple: L1 pins the guest page, issues
__pkvm_host_map_guest hypercall to map it, and remembers it in a RB-tree
to unpin it later.

One concern though: can this be done lock-efficiently? For example, in
this pKVM-ARM code in [1] this (hypercall + RB-tree insertion) is done
under write-locked kvm->mmu_lock, so I assume it is prone to contention
when there are stage-2 page faults occurring simultaneously on multiple
CPUs from the same VM. In pKVM on Intel we also have the same per-VM
lock contention issue, though in L0 (see
pkvm_handle_shadow_ept_violation() in [2]) and we are already seeing
~50% perf drop caused by it in some benchmarks.

(To be precise, though, eliminating this per-VM write-lock would not be
enough for eliminating the contention: on both ARM and x86 there is also
global locking in pKVM in L0 down the road [3], for different reasons.)

[1] https://android.googlesource.com/kernel/common/+/d73b3af21fb90f6556383865af6ee16e4735a4a6/arch/arm64/kvm/mmu.c#1341

[2] https://lore.kernel.org/all/20230312180345.1778588-9-jason.cj.chen@intel.com/

[3] https://android.googlesource.com/kernel/common/+/d73b3af21fb90f6556383865af6ee16e4735a4a6/arch/arm64/kvm/hyp/nvhe/mem_protect.c#2176


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-08 21:38     ` Dmytro Maluka
  2023-06-09  2:07       ` Chen, Jason CJ
@ 2023-06-15 21:13       ` Nadav Amit
  1 sibling, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2023-06-15 21:13 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: Jason Chen CJ, Sean Christopherson, kvm, android-kvm,
	Dmitry Torokhov, Tomasz Nowicki, Grzegorz Jaszczyk, Keir Fraser


> On Jun 8, 2023, at 2:38 PM, Dmytro Maluka <dmy@semihalf.com> wrote:
> 
> On 3/14/23 17:29, Jason Chen CJ wrote:
>> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
>>> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
>>>> This patch set is part-5 of this RFC patches. It introduces VMX
>>>> emulation for pKVM on Intel platform.
>>>> 
>>>> Host VM wants the capability to run its guest, it needs VMX support.
>>> 
>>> No, the host VM only needs a way to request pKVM to run a VM.  If we go down the
>>> rabbit hole of pKVM on x86, I think we should take the red pill[*] and go all the
>>> way down said rabbit hole by heavily paravirtualizing the KVM=>pKVM interface.
>> 
>> hi, Sean,
>> 
>> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on Intel
>> Platform Introduction", we hope VMX emulation can be there at least for
>> normal VM support.

Just in case the PV approach is taken, please consider consulting with other
hypervisor vendors (e.g., Microsoft, VMware) before you define a PV
interface.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-13 19:50           ` Sean Christopherson
  2023-06-15 18:07             ` Dmytro Maluka
@ 2023-06-20 15:46             ` Jason Chen CJ
  2023-09-05  9:47             ` Jason Chen CJ
  2 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-06-20 15:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dmytro Maluka, kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser

On Tue, Jun 13, 2023 at 12:50:52PM -0700, Sean Christopherson wrote:
> On Fri, Jun 09, 2023, Dmytro Maluka wrote:
> > On 6/9/23 04:07, Chen, Jason CJ wrote:
> > > I think with PV design, we can benefit from skip shadowing. For example, a TLB flush
> > > could be done in hypervisor directly, while shadowing EPT need emulate it by destroy
> > > shadow EPT page table entries then do next shadowing upon ept violation.
> 
> This is a bit misleading.  KVM has an effective TLB for nested TDP only for 4KiB
> pages; larger shadow pages are never allowed to go out-of-sync, i.e. KVM doesn't
> wait until L1 does a TLB flush to update SPTEs.  KVM does "unload" roots, e.g. to
> emulate INVEPT, but that usually just ends up being an extra slow TLB flush in L0,
> because nested TDP SPTEs rarely go unsync in practice.  The patterns for hypervisors
> managing VM memory don't typically trigger the types of PTE modifications that
> result in unsync SPTEs.
> 
> I actually have a (very tiny) patch sitting around somwhere to disable unsync support
> when TDP is enabled.  There is a very, very thoeretical bug where KVM might fail
> to honor when a guest TDP PTE change is architecturally supposed to be visible,
> and the simplest fix (by far) is to disable unsync support.  Disabling TDP+unsync
> is a viable fix because unsync support is almost never used for nested TDP.  Legacy
> shadow paging on the other hand *significantly* benefits from unsync support, e.g.
> when the guest is managing CoW mappings. I haven't gotten around to posting the
> patch to disable unsync on TDP purely because the flaw is almost comically theoretical.
> 
> Anyways, the point is that the TLB flushing side of nested TDP isn't all that
> interesting.

Agree. Thanks to point it out! I was thinking based on comparing to
current RFC pkvm on x86 solution. :-(

To me, the KVM page table shadowing mechanism (e.g., unsync & sync page)
is too heavy & complicated, if we have KPOP solution, IIUC, we may be 
able to totally remove all shadowing stuff, right? :-)

BTW, KPOP may bring questions to support access tracking & page
dirty loging which may need extend more PV interfaces. MMIO fault
could be another issue if we want to keep optimization based on EPT
MISCONFIG for IA platform.

> 
> > Yeah indeed, good point.
> > 
> > Is my understanding correct: TLB flush is still gonna be requested by
> > the host VM via a hypercall, but the benefit is that the hypervisor
> > merely needs to do INVEPT?
> 
> Maybe?  A paravirt paging scheme could do whatever it wanted.  The APIs could be
> designed in such a way that L1 never needs to explicitly request a TLB flush,
> e.g. if the contract is that changes must always become immediately visible to L2.
> 
> And TLB flushing is but one small aspect of page table shadowing.  With PV paging,
> L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary
> data type.  E.g. KVM as L1 could use an XArray to track L2 mappings.  And L0 in
> turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially
> *all* architectures) could have a single nested paging scheme for both Intel and
> AMD, as opposed to needing code to deal with the differences between EPT and NPT.
> 
> A few months back, I mentally worked through the flows[*] (I forget why I was
> thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to
> support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an
> XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new
> mapping).
> 
> [*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-)

-- 

Thanks
Jason CJ Chen

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH part-5 00/22] VMX emulation
  2023-06-13 19:50           ` Sean Christopherson
  2023-06-15 18:07             ` Dmytro Maluka
  2023-06-20 15:46             ` Jason Chen CJ
@ 2023-09-05  9:47             ` Jason Chen CJ
  2 siblings, 0 replies; 34+ messages in thread
From: Jason Chen CJ @ 2023-09-05  9:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dmytro Maluka, kvm, android-kvm, Dmitry Torokhov, Tomasz Nowicki,
	Grzegorz Jaszczyk, Keir Fraser

On Tue, Jun 13, 2023 at 12:50:52PM -0700, Sean Christopherson wrote:
> Maybe?  A paravirt paging scheme could do whatever it wanted.  The APIs could be
> designed in such a way that L1 never needs to explicitly request a TLB flush,
> e.g. if the contract is that changes must always become immediately visible to L2.
> 
> And TLB flushing is but one small aspect of page table shadowing.  With PV paging,
> L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary
> data type.  E.g. KVM as L1 could use an XArray to track L2 mappings.  And L0 in
> turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially
> *all* architectures) could have a single nested paging scheme for both Intel and
> AMD, as opposed to needing code to deal with the differences between EPT and NPT.
> 
> A few months back, I mentally worked through the flows[*] (I forget why I was
> thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to
> support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an
> XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new
> mapping).
> 
> [*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-)

hi, Sean & all,

I did a POC[1] to support KPOP (KVM Paravirt Only Paging) for KVM on KVM
nested guest. I am not sure if such solution is welcome to KVM community,
I appreciate if you can give me some advice/direction. As I saw the
solution is straightforward and less memory cost (no double page tables),
but a rough benchmark based on stress-ng show less 1% improvement for
both cpu & vm stress test, comparing to legacy shadowing mode nested
guest solution.

Brief idea of this POC
----------------------

The brief idea of the POC is to intercept below x86 KVM MMU interfaces
to do three KPOP hypercalls - KVM_HC_KPOP_MMU_LOAD_UNLOAD,
KVM_HC_KPOP_MMU_MAP & KVM_HC_KPOP_MMU_UNMAP:

- int (*mmu_load)(struct kvm_vcpu *vcpu);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_LOAD_UNLOAD hypercall for MMU
  load to L0 KVM, L0 KVM shall help to create L2 guest MMU page table and
  ensure vcpu will load it as root pgd when corresponding nested vcpu is
  running.

- void (*mmu_unload)(struct kvm_vcpu *vcpu);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_LOAD_UNLOAD hypercall for MMU
  unload to L0 KVM, L0 KVM shall try to put & free corresponding L2 guest
  MMU page table.

- bool (*mmu_set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range
  *range);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_MAP hypercall for MMU remap
  to L0 KVM, L0 KVM shall try to remap the range's MMU mapping for all
  previous loaded L2 guest MMU page tables who belongs to L2 "kvm" and
  whose as_id (address id) is same as range->slot->as_id.

- bool (*mmu_unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range
  *range);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap
  to L0 KVM, L0 KVM shall try to unmap the range's MMU mapping for all
  previous loaded L2 guest MMU page tables who belongs to L2 "kvm" and
  whose as_id is same as range->slot->as_id.

- void (*mmu_zap_gfn_range)(struct kvm *kvm, gfn_t gfn_start, gfn_t
  gfn_end);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap
  to L0 KVM, L0 KVM shall try to unmap the {start, end} MMU mapping for
  all previous loaded L2 guest MMU page tables who belongs to L2 "kvm"
  (for all as_id).

- void (*mmu_zap_all)(struct kvm *kvm, bool fast);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap
  to L0 KVM, L0 KVM shall try to zap all MMU mapping for all previous
  loaded L2 guest MMU page tables who belongs to L2 "kvm" (for all as_id).

- the page fault handling function (direct_page_fault) in L1 KVM is also
  changed in this POC to support KPOP MMU mapping, which leads to
  KVM_HC_KPOP_MMU_MAP hypercall, and L0 KVM leverage kvm_tdp_mmu_map
  to do the MMU page mapping for previous loaded L2 guest MMU page table.

How geust MMU page table be identified?
---------------------------------------

The L2 guest MMU page table is identified by its L1 vcpu_holder & as_id,
L1 KVM is running L2 vcpu after loading L2 vcpu info into corresponding
vcpu_holder - for x86 it's vmcs. When L1 KVM do mmu_load for its L2 guest
MMU page table, it will also define different as_id for such table - for
x86 it's based on whether vcpu is running under smm mode.

And in this POC, L0 KVM maintains L2 guest MMU for L1 KVM in a per-VM
hash table which hashed by the vcpu_holders. Struct kpop_guest_mmu and
several APIs are introduced for managing L2 guest MMU:

struct kpop_guest_mmu {
        struct hlist_node hnode;
        u64 vcpu_holder;
        u64 kvm_id;
        u64 as_id;
        hpa_t root_hpa;
        refcount_t count;
};

- int kpop_alloc_guest_mmu(struct kvm_vcpu *vcpu, u64 vcpu_holder, u64
  kvm_id, u64 as_id)
- void kpop_put_guest_mmu(struct kvm_vcpu *vcpu, u64 vcpu_holder, u64
  kvm_id, u64 as_id)
- struct kpop_guest_mmu *kpop_find_guest_mmu(struct kvm *kvm, u64
  vcpu_holder, u64 as_id)
- int kpop_reload_guest_mmu(struct kvm_vcpu *vcpu, bool check_vcpu)


TODOs & OPENs
-------------

There are still a lot of TODOs:

- L2 translation info (XArray) in L1 KVM
  L1 KVM may need maintain translation info (ngpa-to-gpa) for L2 guests,
  one possible use case is for MMIO fault optimization. A simple way is
  to maintain a translation info XArray in L1 KVM.

- support UMIP emulation
  UMIP emulation want L0 KVM do instruction emulation for L2 guest,
  which want to do nested address translation, usually it should be done
  by guest_kpop_mmu's gva_to_gpa ops (unimplemented kpop_gva_to_gpa in my
  POC),  we either do such translation based on L1 maintained translation
  table (in this case XArray may not be a good choice for L1 translation
  table), or we maintain another new translation table (e.g., another
  XArray) in L0 for L2 guest.

- age/test_age
  age/test_age mmu interfaces should be supported, e.g., for SWAP in L1 VM.

- page track
  page track should be supported, e.g., for GVT graphics page table shadowing usage.

- dirty log
  dirty log should be supported for VM migration.

[1]: https://github.com/intel-staging/pKVM-IA/tree/KPOP_RFC

-- 

Thanks
Jason CJ Chen

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-09-05 16:04 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-12 18:02 [RFC PATCH part-5 00/22] VMX emulation Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 01/22] pkvm: x86: Add memcpy lib Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 02/22] pkvm: x86: Add memory operation APIs for for host VM Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 03/22] pkvm: x86: Do guest address translation per page granularity Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 04/22] pkvm: x86: Add check for guest address translation Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 05/22] pkvm: x86: Add hypercalls for shadow_vm/vcpu init & teardown Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 06/22] KVM: VMX: Add new kvm_x86_ops vm_free Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 07/22] KVM: VMX: Add initialization/teardown for shadow vm/vcpu Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 08/22] pkvm: x86: Add hash table mapping for shadow vcpu based on vmcs12_pa Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 09/22] pkvm: x86: Add VMXON/VMXOFF emulation Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 10/22] pkvm: x86: Add has_vmcs_field() API for physical vmx capability check Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 11/22] KVM: VMX: Add more vmcs and vmcs12 fields definition Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 12/22] pkvm: x86: Init vmcs read/write bitmap for vmcs emulation Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 13/22] pkvm: x86: Initialize emulated fields " Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 14/22] pkvm: x86: Add msr ops for pKVM hypervisor Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 15/22] pkvm: x86: Move _init_host_state_area to " Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 16/22] pkvm: x86: Add vmcs_load/clear_track APIs Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 17/22] pkvm: x86: Add VMPTRLD/VMCLEAR emulation Jason Chen CJ
2023-03-12 18:02 ` [RFC PATCH part-5 18/22] pkvm: x86: Add VMREAD/VMWRITE emulation Jason Chen CJ
2023-03-12 18:03 ` [RFC PATCH part-5 19/22] pkvm: x86: Add VMLAUNCH/VMRESUME emulation Jason Chen CJ
2023-03-12 18:03 ` [RFC PATCH part-5 20/22] pkvm: x86: Add INVEPT/INVVPID emulation Jason Chen CJ
2023-03-12 18:03 ` [RFC PATCH part-5 21/22] pkvm: x86: Initialize msr_bitmap for vmsr Jason Chen CJ
2023-03-12 18:03 ` [RFC PATCH part-5 22/22] pkvm: x86: Add vmx msr emulation Jason Chen CJ
2023-03-13 16:58 ` [RFC PATCH part-5 00/22] VMX emulation Sean Christopherson
2023-03-14 16:29   ` Jason Chen CJ
2023-06-08 21:38     ` Dmytro Maluka
2023-06-09  2:07       ` Chen, Jason CJ
2023-06-09  8:34         ` Dmytro Maluka
2023-06-13 19:50           ` Sean Christopherson
2023-06-15 18:07             ` Dmytro Maluka
2023-06-20 15:46             ` Jason Chen CJ
2023-09-05  9:47             ` Jason Chen CJ
2023-06-15  3:59           ` Chen, Jason CJ
2023-06-15 21:13       ` Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.