All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling
@ 2023-04-04 13:09 Binbin Wu
  2023-04-04 13:09 ` [PATCH v7 1/5] KVM: x86: Virtualize CR4.LAM_SUP Binbin Wu
                   ` (5 more replies)
  0 siblings, 6 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-04 13:09 UTC (permalink / raw)
  To: kvm, seanjc, pbonzini
  Cc: binbin.wu, kai.huang, chao.gao, xuelian.guo, robert.hu

===Feature Introduction===

Linear-address masking (LAM) [1], modifies the checking that is applied to
*64-bit* linear addresses, allowing software to use of the untranslated address
bits for metadata. 

When the feature is virtualized and exposed to guest, it can be used for efficient
address sanitizers (ASAN) implementation and for optimizations in JITs and virtual
machines.

Regarding which pointer bits are masked and can be used for metadata, LAM has 2
modes:
- LAM_48: metadata bits 62:48, i.e. LAM width of 15.
- LAM_57: metadata bits 62:57, i.e. LAM width of 6.

* For user pointers:
  CR3.LAM_U57 = CR3.LAM_U48 = 0, LAM is off;
  CR3.LAM_U57 = 1, LAM57 is active;
  CR3.LAM_U57 = 0 and CR3.LAM_U48 = 1, LAM48 is active.
* For supervisor pointers: 
  CR4.LAM_SUP =0, LAM is off;
  CR4.LAM_SUP =1 with 5-level paging mode, LAM57 is active;
  CR4.LAM_SUP =1 with 4-level paging mode, LAM48 is active.

The modified LAM canonicality check:
* LAM_S48                : [ 1 ][ metadata ][ 1 ]
                             63               47
* LAM_U48                : [ 0 ][ metadata ][ 0 ]
                             63               47
* LAM_S57                : [ 1 ][ metadata ][ 1 ]
                             63               56
* LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                             63               56
* LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                             63               56..47

Note:
1. LAM applies to only data address, not to instructions.
2. LAM identification of an address as user or supervisor is based solely on the
   value of pointer bit 63 and does not, for the purposes of LAM, depend on the CPL.
3. For user mode address, it is possible that 5-level paging and LAM_U48 are both
   set, in this case, the effective usable linear address width is 48. [2]
4. When VM exit, the page faulting linear address saved in VMCS field is clean,
   i.e. metadata cleared with canonical form.

===LAM KVM Design===
LAM KVM enabling includes the following parts:
- Feature Enumeration
  LAM feature is enumerated by CPUID.7.1:EAX.LAM[bit 26].
  If hardware supports LAM and host doesn't disable it explicitly (e.g. via 
  clearcpuid), LAM feature will be exposed to user VMM.

- CR4 Virtualization
  LAM uses CR4.LAM_SUP (bit 28) to configure LAM masking on supervisor pointers.
  CR4.LAM_SUP is allowed to be set if vCPU supports LAM, including in nested guest.
  CR4.LAM_SUP is allowed to be set even not in 64-bit mode, but it will not take
  effect since LAM only applies to 64-bit linear address.
  Change of CR4.LAM_SUP bit is intercepted to avoid vmread every time when KVM
  fetches its value, with the expectation that guest won't toggle the bit frequently.
  Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, so KVM doesn't
  need to emulate TLB flush based on it. 

- CR3 Virtualization
  LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM masking
  for user mode pointers.

  When EPT is on:
  CR3 is fully under control of guest, guest LAM is thus transparent to KVM.

  When EPT is off (shadow paging):
    * KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
      The two bits are allowed to be set in CR3 if vCPU supports LAM.
      The two bits should be kept as they are in the shadow CR3.
    * Perform GFN calculation from guest CR3/PGD generically by extracting the
      maximal base address mask.
    * Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
    To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
    the bits used to control supported features related to CR3 (e.g. LAM).

- Modified Canonicality Check and Metadata Mask
  When LAM is enabled, 64-bit linear address may be tagged with metadata. Linear
  address should be checked for modified canonicality and untagged (i.e. metadata
  bits should be masked by sign-extending the bit 47 or bit 56) in instruction
  emulations and VMExit handlings when LAM is applicable.

LAM inside nested guest is supported by this patch series. 
LAM inside SGX enclave mode is NOT supported by this patch series.

The patch series is based on linux kernel v6.3-rc4, depends on two patches:
- One from Kiril for LAM feature and flag definitions[3].
- The other is a bug fix sent out speperatly[4].

The patch series organized as following:
Patch 1/2: CR4/CR3 virtualization
Patch 3: Implementation of untag_addr
Patch 4: Untag address when LAM applicable
Patch 5: Expose LAM feature to userspace VMM

The corresponding QEMU patch:
https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg08036.html

===Unit Test===
1. Add a kvm-unit-test [5] for LAM, including LAM_SUP and LAM_{U57,U48}.
   For supervisor mode, this test covers CR4 LAM_SUP bits toggle, Memory/MMIO
   access with tagged pointer, and some special instructions (INVLPG, INVPCID,
   INVVPID), INVVIID cases also used to cover VMX instruction VMExit path.
   For uer mode, this test covers CR3 LAM bits toggle, Memory/MMIO access with
   tagged pointer.
   MMIO cases are used to trigger instruction emulation path.
   Run the unit test with both LAM feature on/off (i.e. including negative cases).
2. Run Kernel LAM kselftests in guest, with both EPT=Y/N.
3. Launch a nested guest.

All tests have passed in Simics environment.

[1] Intel ISE https://cdrdv2.intel.com/v1/dl/getContent/671368
    Chapter Linear Address Masking (LAM)
[2] Thus currently, LAM kernel enabling patch only enables LAM_U57. 
    https://lore.kernel.org/lkml/20230123220500.21077-1-kirill.shutemov@linux.intel.com/ 
[3] https://lore.kernel.org/lkml/20230123220500.21077-4-kirill.shutemov@linux.intel.com/
[4] https://lore.kernel.org/kvm/20230404032502.27798-1-binbin.wu@linux.intel.com/
[5] https://lore.kernel.org/kvm/20230319083732.29458-1-binbin.wu@linux.intel.com/

---
Changelog
v6 --> v7:
- Changes to CR3 virtualization when EPT off
  * Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination. (Sean)
  * Perform GFN calculation from guest CR3/PGD generically by extracting the maximal 
    base address mask. (Sean)
- Remove derefence of ctxt->vcpu in the emulator. (Sean)
- Fix a bug in v6, which hardcoded "write" to "false" by mistake in linearize(). (Chao Gao)
- Add Chao Gao's reviwed-by in Patch 5.
- Add Xuelian Guo's tested-by in the patch set.
- Seperate cleanup patches from the patch set.

v5 --> v6:
Add Patch 2 to fix the check of 64-bit mode.
Add untag_addr() to kvm_x86_ops to hide vendor specific code.
Simplify the LAM canonicality check per Chao's suggestion.
Add cr3_ctrl_bits to kvm_vcpu_arch to simplify cr3 invalidation/extract/mask (Chao Gao)
Extend the patchset scope to include nested virtualization and SGX ENCLS handling.
- Add X86_CR4_LAM_SUP in cr4_fixed1_update for nested vmx. (Chao Gao)
- Add SGX ENCLS VMExit handling
- Add VMX insturction VMExit handling
More descriptions in cover letter.
Add Chao's reviwed-by on Patch 4.
Add more test cases in kvm-unit-test.

v4 --> v5:
Reorder and melt patches surround CR3.LAM bits into Patch 3 of this
version.
Revise Patch 1's subject and description
Drop Patch 3
Use kvm_read_cr4_bits() instead of kvm_read_cr4()
Fix: No need to untag addr when write to msr, it should be legacy canonical check
Rename kvm_is_valid_cr3() --> kvm_vcpu_is_valid_cr3(), and update some call
sites of kvm_vcpu_is_valid_cr3() to use kvm_is_valid_cr3().
Other refactors and Miscs.

v3 --> v4:
Drop unrelated Patch 1 in v3 (Binbin, Sean, Xiaoyao)
Intercept CR4.LAM_SUP instead of pass through to guest (Sean)
Just filter out CR3.LAM_{U48, U57}, instead of all reserved high bits
(Sean, Yuan)
Use existing __canonical_address() helper instead write a new one (Weijiang)
Add LAM handling in KVM emulation (Yu, Yuan)
Add Jingqi's reviwed-by on Patch 7
Rebased to Kirill's latest code, which is 6.2-rc1 base.

v2 --> v3:
As LAM Kernel patches are in tip tree now, rebase to it.
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/

v1 --> v2:
1. Fixes i386-allyesconfig build error on get_pgd(), where
   CR3_HIGH_RSVD_MASK isn't applicable.
   (Reported-by: kernel test robot <lkp@intel.com>)
2. In kvm_set_cr3(), be conservative on skip tlb flush when only LAM bits
   toggles. (Kirill)

Binbin Wu (2):
  KVM: x86: Introduce untag_addr() in kvm_x86_ops
  KVM: x86: Untag address when LAM applicable

Robert Hoo (3):
  KVM: x86: Virtualize CR4.LAM_SUP
  KVM: x86: Virtualize CR3.LAM_{U48,U57}
  KVM: x86: Expose LAM feature to userspace VMM

 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    | 14 +++++-
 arch/x86/kvm/cpuid.c               |  2 +-
 arch/x86/kvm/cpuid.h               |  5 +++
 arch/x86/kvm/emulate.c             | 23 +++++++---
 arch/x86/kvm/kvm_emulate.h         |  2 +
 arch/x86/kvm/mmu.h                 |  5 +++
 arch/x86/kvm/mmu/mmu.c             |  6 ++-
 arch/x86/kvm/mmu/mmu_internal.h    |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h     |  6 ++-
 arch/x86/kvm/mmu/spte.h            |  2 +-
 arch/x86/kvm/svm/svm.c             |  7 +++
 arch/x86/kvm/vmx/nested.c          |  8 +++-
 arch/x86/kvm/vmx/sgx.c             |  1 +
 arch/x86/kvm/vmx/vmx.c             | 69 +++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h             |  2 +
 arch/x86/kvm/x86.c                 | 14 +++++-
 arch/x86/kvm/x86.h                 |  2 +
 18 files changed, 155 insertions(+), 15 deletions(-)


base-commit: 197b6b60ae7bc51dd0814953c562833143b292aa
prerequisite-patch-id: 883dc8f73520b47a6c3690c1704f2e85a2713e4f
prerequisite-patch-id: cf5655ce89a2390cd29f33c57a4fc307a6045f62
-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v7 1/5] KVM: x86: Virtualize CR4.LAM_SUP
  2023-04-04 13:09 [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
@ 2023-04-04 13:09 ` Binbin Wu
  2023-04-04 13:09 ` [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57} Binbin Wu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-04 13:09 UTC (permalink / raw)
  To: kvm, seanjc, pbonzini
  Cc: binbin.wu, kai.huang, chao.gao, xuelian.guo, robert.hu

From: Robert Hoo <robert.hu@linux.intel.com>

Allow setting of CR4.LAM_SUP (bit 28) by guest if vCPU supports LAM,
and intercept the bit (as it already is).

LAM uses CR4.LAM_SUP to configure LAM masking on supervisor mode address.
To virtualize that, move CR4.LAM_SUP out of CR4_RESERVED_BITS and its
reservation depends on vCPU has LAM feature or not.
CR4.LAM_SUP is allowed to be set even not in 64-bit mode, but it will not
take effect since LAM only applies to 64-bit linear address.

Leave the bit intercepted to avoid vmread every time when KVM fetches its
value, with the expectation that guest won't toggle the bit frequently.

Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, so KVM
doesn't need to emulate TLB flush based on it.
There's no other features/vmx_exec_controls connection, therefore no code
need to be complemented in kvm/vmx_set_cr4().

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/vmx/vmx.c          | 3 +++
 arch/x86/kvm/x86.h              | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 808c292ad3f4..ba594f9ea414 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -125,7 +125,8 @@
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_LAM_SUP))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d2d6e1b6c788..42f163862a0f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7630,6 +7630,9 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
 	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
 
+	entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
+	cr4_fixed1_update(X86_CR4_LAM_SUP,    eax, feature_bit(LAM));
+
 #undef cr4_fixed1_update
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a8167b47b8c8..3a9d97b899df 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -487,6 +487,8 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
 		__reserved_bits |= X86_CR4_VMXE;        \
 	if (!__cpu_has(__c, X86_FEATURE_PCID))          \
 		__reserved_bits |= X86_CR4_PCIDE;       \
+	if (!__cpu_has(__c, X86_FEATURE_LAM))           \
+		__reserved_bits |= X86_CR4_LAM_SUP;     \
 	__reserved_bits;                                \
 })
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-04 13:09 [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
  2023-04-04 13:09 ` [PATCH v7 1/5] KVM: x86: Virtualize CR4.LAM_SUP Binbin Wu
@ 2023-04-04 13:09 ` Binbin Wu
  2023-04-06 12:57   ` Huang, Kai
                     ` (2 more replies)
  2023-04-04 13:09 ` [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops Binbin Wu
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-04 13:09 UTC (permalink / raw)
  To: kvm, seanjc, pbonzini
  Cc: binbin.wu, kai.huang, chao.gao, xuelian.guo, robert.hu

From: Robert Hoo <robert.hu@linux.intel.com>

LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
masking for user mode pointers.

When EPT is on:
CR3 is fully under control of guest, guest LAM is thus transparent to KVM.

When EPT is off (shadow paging):
- KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
  The two bits are allowed to be set in CR3 if vCPU supports LAM.
  The two bits should be kept as they are in the shadow CR3.
- Perform GFN calculation from guest CR3/PGD generically by extracting the
  maximal base address mask since the validity has been checked already.
- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
  It could potentially increase root cache misses and mmu reload, however,
  it's very rare to turn off EPT when performace matters.

To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
the bits used to control supported features related to CR3 (e.g. LAM).
- Supported control bits are set to cr3_ctrl_bits after set cpuid.
- Add kvm_vcpu_is_legal_cr3() to validate CR3, tp allow setting of control
  bits for the supported features.
- Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
  a new guest CR3 (in vmx_load_mmu_pgd()).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 6 ++++++
 arch/x86/kvm/cpuid.h            | 5 +++++
 arch/x86/kvm/mmu.h              | 5 +++++
 arch/x86/kvm/mmu/mmu.c          | 6 +++++-
 arch/x86/kvm/mmu/mmu_internal.h | 1 +
 arch/x86/kvm/mmu/paging_tmpl.h  | 6 +++++-
 arch/x86/kvm/mmu/spte.h         | 2 +-
 arch/x86/kvm/vmx/nested.c       | 4 ++--
 arch/x86/kvm/vmx/vmx.c          | 6 +++++-
 arch/x86/kvm/x86.c              | 4 ++--
 10 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ba594f9ea414..498d2b5e8dc1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
 	unsigned long cr0_guest_owned_bits;
 	unsigned long cr2;
 	unsigned long cr3;
+	/*
+	 * Bits in CR3 used to enable certain features. These bits are allowed
+	 * to be set in CR3 when vCPU supports the features. When shadow paging
+	 * is used, these bits should be kept as they are in the shadow CR3.
+	 */
+	u64 cr3_ctrl_bits;
 	unsigned long cr4;
 	unsigned long cr4_guest_owned_bits;
 	unsigned long cr4_guest_rsvd_bits;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index b1658c0de847..ef8e1b912d7d 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
 	return vcpu->arch.maxphyaddr;
 }
 
+static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
+{
+	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
+}
+
 static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
 	return !(gpa & vcpu->arch.reserved_gpa_bits);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 168c46fd8dd1..29985eeb8e12 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
 	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
 }
 
+static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
+}
+
 static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 {
 	u64 root_hpa = vcpu->arch.mmu->root.hpa;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ebe542c565..de2c51a0b611 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	hpa_t root;
 
 	root_pgd = mmu->get_guest_pgd(vcpu);
-	root_gfn = root_pgd >> PAGE_SHIFT;
+	/*
+	* The guest PGD has already been checked for validity, unconditionally
+	* strip non-address bits when computing the GFN.
+	*/
+	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
 
 	if (mmu_check_root(vcpu, root_gfn))
 		return 1;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index cc58631e2336..c0479cbc2ca3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -21,6 +21,7 @@ extern bool dbg;
 #endif
 
 /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
+#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
 #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
 	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
 #define __PT_INDEX(address, level, bits_per_level) \
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 57f0b75c80f9..88351ba04249 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -62,7 +62,7 @@
 #endif
 
 /* Common logic, but per-type values.  These also need to be undefined. */
-#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
+#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
 #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
 #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
 #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
@@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	trace_kvm_mmu_pagetable_walk(addr, access);
 retry_walk:
 	walker->level = mmu->cpu_role.base.level;
+	/*
+	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
+	 * non-address bits.
+	 */
 	pte           = mmu->get_guest_pgd(vcpu);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..777f7d443e3b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -36,7 +36,7 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
 #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
 #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
 #else
-#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
 #endif
 
 #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1bc2b80273c9..d35bda9610e2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1079,7 +1079,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 			       bool nested_ept, bool reload_pdptrs,
 			       enum vm_entry_failure_code *entry_failure_code)
 {
-	if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
+	if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) {
 		*entry_failure_code = ENTRY_FAIL_DEFAULT;
 		return -EINVAL;
 	}
@@ -2907,7 +2907,7 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 
 	if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) ||
 	    CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) ||
-	    CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3)))
+	    CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3)))
 		return -EINVAL;
 
 	if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) ||
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 42f163862a0f..4d329ee9474c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3388,7 +3388,8 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			update_guest_cr3 = false;
 		vmx_ept_load_pdptrs(vcpu);
 	} else {
-		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu);
+		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) |
+		            kvm_get_active_cr3_ctrl_bits(vcpu);
 	}
 
 	if (update_guest_cr3)
@@ -7763,6 +7764,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		vmx->msr_ia32_feature_control_valid_bits &=
 			~FEAT_CTL_SGX_LC_ENABLED;
 
+	if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
+		vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | X86_CR3_LAM_U57;
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7713420abab0..aca255e69d0d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
 	 * the current vCPU mode is accurate.
 	 */
-	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
+	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
 		return 1;
 
 	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
@@ -11349,7 +11349,7 @@ static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		 */
 		if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA))
 			return false;
-		if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3))
+		if (!kvm_vcpu_is_legal_cr3(vcpu, sregs->cr3))
 			return false;
 	} else {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops
  2023-04-04 13:09 [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
  2023-04-04 13:09 ` [PATCH v7 1/5] KVM: x86: Virtualize CR4.LAM_SUP Binbin Wu
  2023-04-04 13:09 ` [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57} Binbin Wu
@ 2023-04-04 13:09 ` Binbin Wu
  2023-04-18  3:08   ` Zeng Guang
  2023-04-19  2:30   ` Chao Gao
  2023-04-04 13:09 ` [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable Binbin Wu
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-04 13:09 UTC (permalink / raw)
  To: kvm, seanjc, pbonzini
  Cc: binbin.wu, kai.huang, chao.gao, xuelian.guo, robert.hu

Introduce a new interface untag_addr() to kvm_x86_ops to untag the metadata
from linear address. Implement LAM version in VMX and dummy version in SVM.

When enabled feature like Intel Linear Address Masking or AMD Upper
Address Ignore, linear address may be tagged with metadata. Linear
address should be checked for modified canonicality and untagged in
instrution emulations or vmexit handlings if LAM or UAI is applicable.

Introduce untag_addr() to kvm_x86_ops to hide the code related to vendor
specific details.
- For VMX, LAM version is implemented.
  LAM has a modified canonical check when applicable:
  * LAM_S48                : [ 1 ][ metadata ][ 1 ]
                               63               47
  * LAM_U48                : [ 0 ][ metadata ][ 0 ]
                               63               47
  * LAM_S57                : [ 1 ][ metadata ][ 1 ]
                               63               56
  * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                               63               56
  * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                               63               56..47
  If LAM is applicable to certain address, untag the metadata bits and
  replace them with the value of bit 47 (LAM48) or bit 56 (LAM57). Later
  the untagged address will do legacy canonical check. So that LAM canonical
  check and mask can be covered by "untag + legacy canonical check".

  For cases LAM is not applicable, 'flags' is passed to the interface
  to skip untag.

- For SVM, add a dummy version to do nothing, but return the original
  address.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  5 +++
 arch/x86/kvm/svm/svm.c             |  7 ++++
 arch/x86/kvm/vmx/vmx.c             | 60 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.h             |  2 +
 5 files changed, 75 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8dc345cc6318..7d63d1b942ac 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -52,6 +52,7 @@ KVM_X86_OP(cache_reg)
 KVM_X86_OP(get_rflags)
 KVM_X86_OP(set_rflags)
 KVM_X86_OP(get_if_flag)
+KVM_X86_OP(untag_addr)
 KVM_X86_OP(flush_tlb_all)
 KVM_X86_OP(flush_tlb_current)
 KVM_X86_OP_OPTIONAL(tlb_remote_flush)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 498d2b5e8dc1..cb674ec826d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -69,6 +69,9 @@
 #define KVM_X86_NOTIFY_VMEXIT_VALID_BITS	(KVM_X86_NOTIFY_VMEXIT_ENABLED | \
 						 KVM_X86_NOTIFY_VMEXIT_USER)
 
+/* flags for kvm_x86_ops::untag_addr() */
+#define KVM_X86_UNTAG_ADDR_SKIP_LAM	_BITULL(0)
+
 /* x86-specific vcpu->requests bit members */
 #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
 #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
@@ -1607,6 +1610,8 @@ struct kvm_x86_ops {
 	void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
 	bool (*get_if_flag)(struct kvm_vcpu *vcpu);
 
+	u64 (*untag_addr)(struct kvm_vcpu *vcpu, u64 la, u64 flags);
+
 	void (*flush_tlb_all)(struct kvm_vcpu *vcpu);
 	void (*flush_tlb_current)(struct kvm_vcpu *vcpu);
 	int  (*tlb_remote_flush)(struct kvm *kvm);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 252e7f37e4e2..a6e6bd09642b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4696,6 +4696,11 @@ static int svm_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+static u64 svm_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags)
+{
+	return addr;
+}
+
 static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
 
@@ -4745,6 +4750,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.set_rflags = svm_set_rflags,
 	.get_if_flag = svm_get_if_flag,
 
+	.untag_addr = svm_untag_addr,
+
 	.flush_tlb_all = svm_flush_tlb_current,
 	.flush_tlb_current = svm_flush_tlb_current,
 	.flush_tlb_gva = svm_flush_tlb_gva,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4d329ee9474c..73cc495bd0da 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8137,6 +8137,64 @@ static void vmx_vm_destroy(struct kvm *kvm)
 	free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
 }
 
+
+#define LAM_S57_EN_MASK (X86_CR4_LAM_SUP | X86_CR4_LA57)
+
+static inline int lam_sign_extend_bit(bool user, struct kvm_vcpu *vcpu)
+{
+	u64 cr3, cr4;
+
+	if (user) {
+		cr3 = kvm_read_cr3(vcpu);
+		if (!!(cr3 & X86_CR3_LAM_U57))
+			return 56;
+		if (!!(cr3 & X86_CR3_LAM_U48))
+			return 47;
+	} else {
+		cr4 = kvm_read_cr4_bits(vcpu, LAM_S57_EN_MASK);
+		if (cr4 == LAM_S57_EN_MASK)
+			return 56;
+		if (!!(cr4 & X86_CR4_LAM_SUP))
+			return 47;
+	}
+	return -1;
+}
+
+/*
+ * Only called in 64-bit mode.
+ *
+ * Metadata bits are [62:48] in LAM48 and [62:57] in LAM57. Mask metadata in
+ * pointers by sign-extending the value of bit 47 (LAM48) or 56 (LAM57).
+ * The resulting address after untagging isn't guaranteed to be canonical.
+ * Callers should perform the original canonical check and raise #GP/#SS if the
+ * address is non-canonical.
+ */
+u64 vmx_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags)
+{
+	int sign_ext_bit;
+
+	/*
+	 * Instead of calling relatively expensive guest_cpuid_has(), just check
+	 * LAM_U48 in cr3_ctrl_bits. If not set, vCPU doesn't supports LAM.
+	 */
+	if (!(vcpu->arch.cr3_ctrl_bits & X86_CR3_LAM_U48) ||
+	    !!(flags & KVM_X86_UNTAG_ADDR_SKIP_LAM))
+		return addr;
+
+	if(!is_64_bit_mode(vcpu)){
+		WARN_ONCE(1, "Only be called in 64-bit mode");
+		return addr;
+	}
+
+	sign_ext_bit = lam_sign_extend_bit(!(addr >> 63), vcpu);
+
+	if (sign_ext_bit < 0)
+		return addr;
+
+	return (sign_extend64(addr, sign_ext_bit) & ~BIT_ULL(63)) |
+	       (addr & BIT_ULL(63));
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
 
@@ -8185,6 +8243,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.set_rflags = vmx_set_rflags,
 	.get_if_flag = vmx_get_if_flag,
 
+	.untag_addr = vmx_untag_addr,
+
 	.flush_tlb_all = vmx_flush_tlb_all,
 	.flush_tlb_current = vmx_flush_tlb_current,
 	.flush_tlb_gva = vmx_flush_tlb_gva,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2acdc54bc34b..79855b9a4aca 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -433,6 +433,8 @@ void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type);
 u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
 u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
 
+u64 vmx_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags);
+
 static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
 					     int type, bool value)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-04 13:09 [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
                   ` (2 preceding siblings ...)
  2023-04-04 13:09 ` [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops Binbin Wu
@ 2023-04-04 13:09 ` Binbin Wu
  2023-04-06 13:20   ` Huang, Kai
                     ` (2 more replies)
  2023-04-04 13:09 ` [PATCH v7 5/5] KVM: x86: Expose LAM feature to userspace VMM Binbin Wu
  2023-04-21  9:40 ` [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
  5 siblings, 3 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-04 13:09 UTC (permalink / raw)
  To: kvm, seanjc, pbonzini
  Cc: binbin.wu, kai.huang, chao.gao, xuelian.guo, robert.hu

Untag address for 64-bit memory/mmio operand in instruction emulations
and vmexit handlers when LAM is applicable.

For instruction emulation, untag address in __linearize() before
canonical check. LAM doesn't apply to instruction fetch and invlpg,
use KVM_X86_UNTAG_ADDR_SKIP_LAM to skip LAM untag.

For vmexit handlings related to 64-bit linear address:
- Cases need to untag address
  Operand(s) of VMX instructions and INVPCID
  Operand(s) of SGX ENCLS
  Linear address in INVVPID descriptor.
- Cases LAM doesn't apply to (no change needed)
  Operand of INVLPG
  Linear address in INVPCID descriptor

Co-developed-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---
 arch/x86/kvm/emulate.c     | 23 ++++++++++++++++++-----
 arch/x86/kvm/kvm_emulate.h |  2 ++
 arch/x86/kvm/vmx/nested.c  |  4 ++++
 arch/x86/kvm/vmx/sgx.c     |  1 +
 arch/x86/kvm/x86.c         | 10 ++++++++++
 5 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a20bec931764..b7df465eccf2 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -688,7 +688,8 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
 				       struct segmented_address addr,
 				       unsigned *max_size, unsigned size,
 				       bool write, bool fetch,
-				       enum x86emul_mode mode, ulong *linear)
+				       enum x86emul_mode mode, ulong *linear,
+				       u64 untag_flags)
 {
 	struct desc_struct desc;
 	bool usable;
@@ -701,6 +702,7 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
 	*max_size = 0;
 	switch (mode) {
 	case X86EMUL_MODE_PROT64:
+		la = ctxt->ops->untag_addr(ctxt, la, untag_flags);
 		*linear = la;
 		va_bits = ctxt_virt_addr_bits(ctxt);
 		if (!__is_canonical_address(la, va_bits))
@@ -758,7 +760,7 @@ static int linearize(struct x86_emulate_ctxt *ctxt,
 {
 	unsigned max_size;
 	return __linearize(ctxt, addr, &max_size, size, write, false,
-			   ctxt->mode, linear);
+			   ctxt->mode, linear, 0);
 }
 
 static inline int assign_eip(struct x86_emulate_ctxt *ctxt, ulong dst)
@@ -771,7 +773,12 @@ static inline int assign_eip(struct x86_emulate_ctxt *ctxt, ulong dst)
 
 	if (ctxt->op_bytes != sizeof(unsigned long))
 		addr.ea = dst & ((1UL << (ctxt->op_bytes << 3)) - 1);
-	rc = __linearize(ctxt, addr, &max_size, 1, false, true, ctxt->mode, &linear);
+	/*
+	 * LAM does not apply to addresses used for instruction fetches
+	 * or to those that specify the targets of jump and call instructions
+	 */
+	rc = __linearize(ctxt, addr, &max_size, 1, false, true, ctxt->mode,
+	                 &linear, KVM_X86_UNTAG_ADDR_SKIP_LAM);
 	if (rc == X86EMUL_CONTINUE)
 		ctxt->_eip = addr.ea;
 	return rc;
@@ -906,9 +913,12 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 	 * __linearize is called with size 0 so that it does not do any
 	 * boundary check itself.  Instead, we use max_size to check
 	 * against op_size.
+	 *
+	 * LAM does not apply to addresses used for instruction fetches
+	 * or to those that specify the targets of jump and call instructions
 	 */
 	rc = __linearize(ctxt, addr, &max_size, 0, false, true, ctxt->mode,
-			 &linear);
+			 &linear, KVM_X86_UNTAG_ADDR_SKIP_LAM);
 	if (unlikely(rc != X86EMUL_CONTINUE))
 		return rc;
 
@@ -3433,8 +3443,11 @@ static int em_invlpg(struct x86_emulate_ctxt *ctxt)
 {
 	int rc;
 	ulong linear;
+	unsigned max_size;
 
-	rc = linearize(ctxt, ctxt->src.addr.mem, 1, false, &linear);
+	/* skip untag for invlpg since LAM is not applied to invlpg */
+	rc = __linearize(ctxt, ctxt->src.addr.mem, &max_size, 1, false, false,
+			 ctxt->mode, &linear, KVM_X86_UNTAG_ADDR_SKIP_LAM);
 	if (rc == X86EMUL_CONTINUE)
 		ctxt->ops->invlpg(ctxt, linear);
 	/* Disable writeback. */
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index ab65f3a47dfd..8d9f782adccb 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -225,6 +225,8 @@ struct x86_emulate_ops {
 	int (*leave_smm)(struct x86_emulate_ctxt *ctxt);
 	void (*triple_fault)(struct x86_emulate_ctxt *ctxt);
 	int (*set_xcr)(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr);
+
+	u64 (*untag_addr)(struct x86_emulate_ctxt *ctxt, u64 addr, u64 flags);
 };
 
 /* Type, address-of, and value of an instruction's operand. */
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d35bda9610e2..48cca88bfd37 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4970,6 +4970,7 @@ int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification,
 		else
 			*ret = off;
 
+		*ret = vmx_untag_addr(vcpu, *ret, 0);
 		/* Long mode: #GP(0)/#SS(0) if the memory address is in a
 		 * non-canonical form. This is the only check on the memory
 		 * destination for long mode!
@@ -5787,6 +5788,9 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
 	vpid02 = nested_get_vpid02(vcpu);
 	switch (type) {
 	case VMX_VPID_EXTENT_INDIVIDUAL_ADDR:
+		/* invvpid is not valid in compatibility mode */
+		if (is_long_mode(vcpu))
+			operand.gla = vmx_untag_addr(vcpu, operand.gla, 0);
 		if (!operand.vpid ||
 		    is_noncanonical_address(operand.gla, vcpu))
 			return nested_vmx_fail(vcpu,
diff --git a/arch/x86/kvm/vmx/sgx.c b/arch/x86/kvm/vmx/sgx.c
index 0574030b071f..527f1a902c65 100644
--- a/arch/x86/kvm/vmx/sgx.c
+++ b/arch/x86/kvm/vmx/sgx.c
@@ -37,6 +37,7 @@ static int sgx_get_encls_gva(struct kvm_vcpu *vcpu, unsigned long offset,
 	if (!IS_ALIGNED(*gva, alignment)) {
 		fault = true;
 	} else if (likely(is_64_bit_mode(vcpu))) {
+		*gva = vmx_untag_addr(vcpu, *gva, 0);
 		fault = is_noncanonical_address(*gva, vcpu);
 	} else {
 		*gva &= 0xffffffff;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aca255e69d0d..18ad38649714 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8218,6 +8218,11 @@ static void emulator_vm_bugged(struct x86_emulate_ctxt *ctxt)
 		kvm_vm_bugged(kvm);
 }
 
+static u64 emulator_untag_addr(struct x86_emulate_ctxt *ctxt, u64 addr, u64 flags)
+{
+	return static_call(kvm_x86_untag_addr)(emul_to_vcpu(ctxt), addr, flags);
+}
+
 static const struct x86_emulate_ops emulate_ops = {
 	.vm_bugged           = emulator_vm_bugged,
 	.read_gpr            = emulator_read_gpr,
@@ -8263,6 +8268,7 @@ static const struct x86_emulate_ops emulate_ops = {
 	.leave_smm           = emulator_leave_smm,
 	.triple_fault        = emulator_triple_fault,
 	.set_xcr             = emulator_set_xcr,
+	.untag_addr          = emulator_untag_addr,
 };
 
 static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
@@ -13260,6 +13266,10 @@ int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva)
 
 	switch (type) {
 	case INVPCID_TYPE_INDIV_ADDR:
+		/*
+		 * LAM doesn't apply to the linear address in the descriptor,
+		 * still need to be canonical
+		 */
 		if ((!pcid_enabled && (operand.pcid != 0)) ||
 		    is_noncanonical_address(operand.gla, vcpu)) {
 			kvm_inject_gp(vcpu, 0);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 5/5] KVM: x86: Expose LAM feature to userspace VMM
  2023-04-04 13:09 [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
                   ` (3 preceding siblings ...)
  2023-04-04 13:09 ` [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable Binbin Wu
@ 2023-04-04 13:09 ` Binbin Wu
  2023-04-21  9:40 ` [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
  5 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-04 13:09 UTC (permalink / raw)
  To: kvm, seanjc, pbonzini
  Cc: binbin.wu, kai.huang, chao.gao, xuelian.guo, robert.hu

From: Robert Hoo <robert.hu@linux.intel.com>

LAM feature is enumerated by CPUID.7.1:EAX.LAM[bit 26].
Expose the feature to userspace as the final step after the following
supports:
- CR4.LAM_SUP virtualization
- CR3.LAM_U48 and CR3.LAM_U57 virtualization
- Check and untag 64-bit linear address when LAM applies in instruction
  emulations and vmexit handlers.

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Jingqi Liu <jingqi.liu@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 599aebec2d52..1dd5e9034386 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -670,7 +670,7 @@ void kvm_set_cpu_caps(void)
 	kvm_cpu_cap_mask(CPUID_7_1_EAX,
 		F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
 		F(FZRM) | F(FSRS) | F(FSRC) |
-		F(AMX_FP16) | F(AVX_IFMA)
+		F(AMX_FP16) | F(AVX_IFMA) | F(LAM)
 	);
 
 	kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-04 13:09 ` [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57} Binbin Wu
@ 2023-04-06 12:57   ` Huang, Kai
  2023-04-09 11:36     ` Binbin Wu
  2023-04-12 11:58   ` Huang, Kai
  2023-04-17  7:24   ` Chao Gao
  2 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-06 12:57 UTC (permalink / raw)
  To: kvm, pbonzini, Christopherson,, Sean, binbin.wu
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

On Tue, 2023-04-04 at 21:09 +0800, Binbin Wu wrote:
> From: Robert Hoo <robert.hu@linux.intel.com>
> 
> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
> masking for user mode pointers.
> 
> When EPT is on:
> CR3 is fully under control of guest, guest LAM is thus transparent to KVM.
> 
> When EPT is off (shadow paging):
> - KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
>   The two bits are allowed to be set in CR3 if vCPU supports LAM.
>   The two bits should be kept as they are in the shadow CR3.
> - Perform GFN calculation from guest CR3/PGD generically by extracting the
>   maximal base address mask since the validity has been checked already.
> - Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
>   It could potentially increase root cache misses and mmu reload, however,
>   it's very rare to turn off EPT when performace matters.
> 
> To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
> the bits used to control supported features related to CR3 (e.g. LAM).
> - Supported control bits are set to cr3_ctrl_bits after set cpuid.
> - Add kvm_vcpu_is_legal_cr3() to validate CR3, tp allow setting of control
						 ^
						 to ?
Could you run spell check for all patches?

>   bits for the supported features.
> - Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
>   a new guest CR3 (in vmx_load_mmu_pgd()).

KVM handles #PF for shadow MMU, and for TDP (EPT) there's also a case that KVM
will trap the #PF (see allow_smaller_maxphyaddr).  Do we need any handling to
the linear address in the #PF, i.e. stripping metadata off while walking page
table?  

I guess it's already done automatically?  Anyway, IMO it would be better if you
can add one or two sentences in the changelog to clarify whether such handling
is needed, and if not, why.

> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 6 ++++++
>  arch/x86/kvm/cpuid.h            | 5 +++++
>  arch/x86/kvm/mmu.h              | 5 +++++
>  arch/x86/kvm/mmu/mmu.c          | 6 +++++-
>  arch/x86/kvm/mmu/mmu_internal.h | 1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  | 6 +++++-
>  arch/x86/kvm/mmu/spte.h         | 2 +-
>  arch/x86/kvm/vmx/nested.c       | 4 ++--
>  arch/x86/kvm/vmx/vmx.c          | 6 +++++-
>  arch/x86/kvm/x86.c              | 4 ++--
>  10 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ba594f9ea414..498d2b5e8dc1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>  	unsigned long cr0_guest_owned_bits;
>  	unsigned long cr2;
>  	unsigned long cr3;
> +	/*
> +	 * Bits in CR3 used to enable certain features. These bits are allowed
> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
> +	 * is used, these bits should be kept as they are in the shadow CR3.
> +	 */
> +	u64 cr3_ctrl_bits;
>  	unsigned long cr4;
>  	unsigned long cr4_guest_owned_bits;
>  	unsigned long cr4_guest_rsvd_bits;
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index b1658c0de847..ef8e1b912d7d 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>  	return vcpu->arch.maxphyaddr;
>  }
>  
> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> +{
> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
> +}
> +
>  static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>  	return !(gpa & vcpu->arch.reserved_gpa_bits);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 168c46fd8dd1..29985eeb8e12 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>  	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>  }
>  
> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
> +}
> +
>  static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>  {
>  	u64 root_hpa = vcpu->arch.mmu->root.hpa;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8ebe542c565..de2c51a0b611 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	hpa_t root;
>  
>  	root_pgd = mmu->get_guest_pgd(vcpu);
> -	root_gfn = root_pgd >> PAGE_SHIFT;
> +	/*
> +	* The guest PGD has already been checked for validity, unconditionally
> +	* strip non-address bits when computing the GFN.
> +	*/
> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>  
>  	if (mmu_check_root(vcpu, root_gfn))
>  		return 1;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cc58631e2336..c0479cbc2ca3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -21,6 +21,7 @@ extern bool dbg;
>  #endif
>  
>  /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>  #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>  	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>  #define __PT_INDEX(address, level, bits_per_level) \
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 57f0b75c80f9..88351ba04249 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -62,7 +62,7 @@
>  #endif
>  
>  /* Common logic, but per-type values.  These also need to be undefined. */
> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>  #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	trace_kvm_mmu_pagetable_walk(addr, access);
>  retry_walk:
>  	walker->level = mmu->cpu_role.base.level;
> +	/*
> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
> +	 * non-address bits.
> +	 */
>  	pte           = mmu->get_guest_pgd(vcpu);
>  	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>  
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 1279db2eab44..777f7d443e3b 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -36,7 +36,7 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
>  #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
>  #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
>  #else
> -#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> +#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
>  #endif
>  
>  #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 1bc2b80273c9..d35bda9610e2 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1079,7 +1079,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>  			       bool nested_ept, bool reload_pdptrs,
>  			       enum vm_entry_failure_code *entry_failure_code)
>  {
> -	if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
> +	if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) {
>  		*entry_failure_code = ENTRY_FAIL_DEFAULT;
>  		return -EINVAL;
>  	}
> @@ -2907,7 +2907,7 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
>  
>  	if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) ||
>  	    CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) ||
> -	    CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3)))
> +	    CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3)))
>  		return -EINVAL;
>  
>  	if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) ||
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 42f163862a0f..4d329ee9474c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3388,7 +3388,8 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>  			update_guest_cr3 = false;
>  		vmx_ept_load_pdptrs(vcpu);
>  	} else {
> -		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu);
> +		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) |
> +		            kvm_get_active_cr3_ctrl_bits(vcpu);
>  	}
>  
>  	if (update_guest_cr3)
> @@ -7763,6 +7764,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  		vmx->msr_ia32_feature_control_valid_bits &=
>  			~FEAT_CTL_SGX_LC_ENABLED;
>  
> +	if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
> +		vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | X86_CR3_LAM_U57;
> +
>  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>  	vmx_update_exception_bitmap(vcpu);
>  }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7713420abab0..aca255e69d0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>  	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>  	 * the current vCPU mode is accurate.
>  	 */
> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
> +	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
>  		return 1;
>  
>  	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
> @@ -11349,7 +11349,7 @@ static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>  		 */
>  		if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA))
>  			return false;
> -		if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3))
> +		if (!kvm_vcpu_is_legal_cr3(vcpu, sregs->cr3))
>  			return false;
>  	} else {
>  		/*


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-04 13:09 ` [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable Binbin Wu
@ 2023-04-06 13:20   ` Huang, Kai
  2023-04-10  3:35     ` Binbin Wu
  2023-04-18  3:28   ` Zeng Guang
  2023-04-19  6:43   ` Chao Gao
  2 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-06 13:20 UTC (permalink / raw)
  To: kvm, pbonzini, Christopherson,, Sean, binbin.wu
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

On Tue, 2023-04-04 at 21:09 +0800, Binbin Wu wrote:
>  	case VMX_VPID_EXTENT_INDIVIDUAL_ADDR:
> +		/* invvpid is not valid in compatibility mode */
> +		if (is_long_mode(vcpu))
> +			operand.gla = vmx_untag_addr(vcpu, operand.gla, 0);

This comment doesn't make sense.  The code does nothing to distinguish the
compatibility mode and the 64-bit mode.

Now although we are all clear that here is_long_mode() basically equals to
is_64_bit_mode(), but I do think we need a comment or WARN() _SOMEWHERE_ to
indicate that compatibility mode is not possible when handling VMEXIT for VMX
instructions (except VMCALL).  Not everyone will be able to notice this small
thing in the SDM.

Then you can just delete this comment here.

Alternatively, for better readability actually I am thinking maybe we should
just use is_64_bit_mode(), because those segments are cached by KVM anyway so I
don't think there's measurable performance difference between is_long_mode() and
is_64_bit_mode().

Sean, any comments?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-06 12:57   ` Huang, Kai
@ 2023-04-09 11:36     ` Binbin Wu
  2023-04-11 23:11       ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-09 11:36 UTC (permalink / raw)
  To: Huang, Kai, kvm, pbonzini, Christopherson,, Sean
  Cc: Guo, Xuelian, robert.hu, Gao, Chao


On 4/6/2023 8:57 PM, Huang, Kai wrote:
> On Tue, 2023-04-04 at 21:09 +0800, Binbin Wu wrote:
>> From: Robert Hoo <robert.hu@linux.intel.com>
>>
>> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
>> masking for user mode pointers.
>>
>> When EPT is on:
>> CR3 is fully under control of guest, guest LAM is thus transparent to KVM.
>>
>> When EPT is off (shadow paging):
>> - KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
>>    The two bits are allowed to be set in CR3 if vCPU supports LAM.
>>    The two bits should be kept as they are in the shadow CR3.
>> - Perform GFN calculation from guest CR3/PGD generically by extracting the
>>    maximal base address mask since the validity has been checked already.
>> - Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
>>    It could potentially increase root cache misses and mmu reload, however,
>>    it's very rare to turn off EPT when performace matters.
>>
>> To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
>> the bits used to control supported features related to CR3 (e.g. LAM).
>> - Supported control bits are set to cr3_ctrl_bits after set cpuid.
>> - Add kvm_vcpu_is_legal_cr3() to validate CR3, tp allow setting of control
> 						 ^
> 						 to ?
> Could you run spell check for all patches?


OK, thanks for your advice.

>
>>    bits for the supported features.
>> - Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
>>    a new guest CR3 (in vmx_load_mmu_pgd()).
> KVM handles #PF for shadow MMU, and for TDP (EPT) there's also a case that KVM
> will trap the #PF (see allow_smaller_maxphyaddr).  Do we need any handling to
> the linear address in the #PF, i.e. stripping metadata off while walking page
> table?
>
> I guess it's already done automatically?  Anyway, IMO it would be better if you
> can add one or two sentences in the changelog to clarify whether such handling
> is needed, and if not, why.

LAM masking applies before paging, so the faulting linear address 
doesn't contain the metadata.
It has been mentioned in cover letter, but to be clear, I will add the 
clarification in the changelog
of patch 4.

Thanks.


>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
>> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h | 6 ++++++
>>   arch/x86/kvm/cpuid.h            | 5 +++++
>>   arch/x86/kvm/mmu.h              | 5 +++++
>>   arch/x86/kvm/mmu/mmu.c          | 6 +++++-
>>   arch/x86/kvm/mmu/mmu_internal.h | 1 +
>>   arch/x86/kvm/mmu/paging_tmpl.h  | 6 +++++-
>>   arch/x86/kvm/mmu/spte.h         | 2 +-
>>   arch/x86/kvm/vmx/nested.c       | 4 ++--
>>   arch/x86/kvm/vmx/vmx.c          | 6 +++++-
>>   arch/x86/kvm/x86.c              | 4 ++--
>>   10 files changed, 37 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index ba594f9ea414..498d2b5e8dc1 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>>   	unsigned long cr0_guest_owned_bits;
>>   	unsigned long cr2;
>>   	unsigned long cr3;
>> +	/*
>> +	 * Bits in CR3 used to enable certain features. These bits are allowed
>> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
>> +	 * is used, these bits should be kept as they are in the shadow CR3.
>> +	 */
>> +	u64 cr3_ctrl_bits;
>>   	unsigned long cr4;
>>   	unsigned long cr4_guest_owned_bits;
>>   	unsigned long cr4_guest_rsvd_bits;
>> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
>> index b1658c0de847..ef8e1b912d7d 100644
>> --- a/arch/x86/kvm/cpuid.h
>> +++ b/arch/x86/kvm/cpuid.h
>> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>>   	return vcpu->arch.maxphyaddr;
>>   }
>>   
>> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>> +{
>> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
>> +}
>> +
>>   static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>>   {
>>   	return !(gpa & vcpu->arch.reserved_gpa_bits);
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 168c46fd8dd1..29985eeb8e12 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>>   	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>>   }
>>   
>> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
>> +{
>> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
>> +}
>> +
>>   static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>>   {
>>   	u64 root_hpa = vcpu->arch.mmu->root.hpa;
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index c8ebe542c565..de2c51a0b611 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>>   	hpa_t root;
>>   
>>   	root_pgd = mmu->get_guest_pgd(vcpu);
>> -	root_gfn = root_pgd >> PAGE_SHIFT;
>> +	/*
>> +	* The guest PGD has already been checked for validity, unconditionally
>> +	* strip non-address bits when computing the GFN.
>> +	*/
>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>   
>>   	if (mmu_check_root(vcpu, root_gfn))
>>   		return 1;
>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
>> index cc58631e2336..c0479cbc2ca3 100644
>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>> @@ -21,6 +21,7 @@ extern bool dbg;
>>   #endif
>>   
>>   /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>>   #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>>   	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>>   #define __PT_INDEX(address, level, bits_per_level) \
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 57f0b75c80f9..88351ba04249 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -62,7 +62,7 @@
>>   #endif
>>   
>>   /* Common logic, but per-type values.  These also need to be undefined. */
>> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
>> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>>   #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
>> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>   	trace_kvm_mmu_pagetable_walk(addr, access);
>>   retry_walk:
>>   	walker->level = mmu->cpu_role.base.level;
>> +	/*
>> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
>> +	 * non-address bits.
>> +	 */
>>   	pte           = mmu->get_guest_pgd(vcpu);
>>   	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>>   
>> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
>> index 1279db2eab44..777f7d443e3b 100644
>> --- a/arch/x86/kvm/mmu/spte.h
>> +++ b/arch/x86/kvm/mmu/spte.h
>> @@ -36,7 +36,7 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
>>   #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
>>   #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
>>   #else
>> -#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>> +#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
>>   #endif
>>   
>>   #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 1bc2b80273c9..d35bda9610e2 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -1079,7 +1079,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>>   			       bool nested_ept, bool reload_pdptrs,
>>   			       enum vm_entry_failure_code *entry_failure_code)
>>   {
>> -	if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
>> +	if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) {
>>   		*entry_failure_code = ENTRY_FAIL_DEFAULT;
>>   		return -EINVAL;
>>   	}
>> @@ -2907,7 +2907,7 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
>>   
>>   	if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) ||
>>   	    CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) ||
>> -	    CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3)))
>> +	    CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3)))
>>   		return -EINVAL;
>>   
>>   	if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) ||
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 42f163862a0f..4d329ee9474c 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -3388,7 +3388,8 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>>   			update_guest_cr3 = false;
>>   		vmx_ept_load_pdptrs(vcpu);
>>   	} else {
>> -		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu);
>> +		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) |
>> +		            kvm_get_active_cr3_ctrl_bits(vcpu);
>>   	}
>>   
>>   	if (update_guest_cr3)
>> @@ -7763,6 +7764,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   		vmx->msr_ia32_feature_control_valid_bits &=
>>   			~FEAT_CTL_SGX_LC_ENABLED;
>>   
>> +	if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
>> +		vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | X86_CR3_LAM_U57;
>> +
>>   	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>>   	vmx_update_exception_bitmap(vcpu);
>>   }
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 7713420abab0..aca255e69d0d 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>>   	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>>   	 * the current vCPU mode is accurate.
>>   	 */
>> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
>> +	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
>>   		return 1;
>>   
>>   	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
>> @@ -11349,7 +11349,7 @@ static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>>   		 */
>>   		if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA))
>>   			return false;
>> -		if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3))
>> +		if (!kvm_vcpu_is_legal_cr3(vcpu, sregs->cr3))
>>   			return false;
>>   	} else {
>>   		/*

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-06 13:20   ` Huang, Kai
@ 2023-04-10  3:35     ` Binbin Wu
  0 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-10  3:35 UTC (permalink / raw)
  To: Huang, Kai, kvm, pbonzini, Christopherson,, Sean
  Cc: Guo, Xuelian, robert.hu, Gao, Chao


On 4/6/2023 9:20 PM, Huang, Kai wrote:
> On Tue, 2023-04-04 at 21:09 +0800, Binbin Wu wrote:
>>   	case VMX_VPID_EXTENT_INDIVIDUAL_ADDR:
>> +		/* invvpid is not valid in compatibility mode */
>> +		if (is_long_mode(vcpu))
>> +			operand.gla = vmx_untag_addr(vcpu, operand.gla, 0);
> This comment doesn't make sense.  The code does nothing to distinguish the
> compatibility mode and the 64-bit mode.

I was also hesitant when added the comment.


>
> Now although we are all clear that here is_long_mode() basically equals to
> is_64_bit_mode(), but I do think we need a comment or WARN() _SOMEWHERE_ to
> indicate that compatibility mode is not possible when handling VMEXIT for VMX
> instructions (except VMCALL).  Not everyone will be able to notice this small
> thing in the SDM.

If the WARN() is preferred, IMO, it can be added to 
nested_vmx_check_permission() because
it is called by all handlers "need" the check except for handle_vmxon().
handle_vmxon() can be added separately.


>
> Then you can just delete this comment here.
>
> Alternatively, for better readability actually I am thinking maybe we should
> just use is_64_bit_mode(), because those segments are cached by KVM anyway so I
> don't think there's measurable performance difference between is_long_mode() and
> is_64_bit_mode().

Agree.


>
> Sean, any comments?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-09 11:36     ` Binbin Wu
@ 2023-04-11 23:11       ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-04-11 23:11 UTC (permalink / raw)
  To: kvm, pbonzini, Christopherson,, Sean, binbin.wu
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

> > 
> > >    bits for the supported features.
> > > - Add kvm_get_active_cr3_ctrl_bits() to get the active control bits to form
> > >    a new guest CR3 (in vmx_load_mmu_pgd()).
> > KVM handles #PF for shadow MMU, and for TDP (EPT) there's also a case that KVM
> > will trap the #PF (see allow_smaller_maxphyaddr).  Do we need any handling to
> > the linear address in the #PF, i.e. stripping metadata off while walking page
> > table?
> > 
> > I guess it's already done automatically?  Anyway, IMO it would be better if you
> > can add one or two sentences in the changelog to clarify whether such handling
> > is needed, and if not, why.
> 
> LAM masking applies before paging, so the faulting linear address 
> doesn't contain the metadata.
> It has been mentioned in cover letter, but to be clear, I will add the 
> clarification in the changelog
> of patch 4.

Yes I think this will be helpful.  Thanks.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-04 13:09 ` [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57} Binbin Wu
  2023-04-06 12:57   ` Huang, Kai
@ 2023-04-12 11:58   ` Huang, Kai
  2023-04-13  1:36     ` Binbin Wu
  2023-04-17  7:24   ` Chao Gao
  2 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-12 11:58 UTC (permalink / raw)
  To: kvm, pbonzini, Christopherson,, Sean, binbin.wu
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

> 
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>  	unsigned long cr0_guest_owned_bits;
>  	unsigned long cr2;
>  	unsigned long cr3;
> +	/*
> +	 * Bits in CR3 used to enable certain features. These bits are allowed
> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
> +	 * is used, these bits should be kept as they are in the shadow CR3.
> +	 */

I don't quite follow the second sentence.  Not sure what does "these bits should
be kept" mean.  

Those control bits are not active bits in guest's CR3 but all control bits that
guest is allowed to set to CR3. And those bits depends on guest's CPUID but not
whether guest is using shadow paging or not.

I think you can just remove the second sentence.

> +	u64 cr3_ctrl_bits;
>  	unsigned long cr4;
>  	unsigned long cr4_guest_owned_bits;
>  	unsigned long cr4_guest_rsvd_bits;
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index b1658c0de847..ef8e1b912d7d 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>  	return vcpu->arch.maxphyaddr;
>  }
>  
> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> +{
> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
> +}
> +
>  static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>  	return !(gpa & vcpu->arch.reserved_gpa_bits);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 168c46fd8dd1..29985eeb8e12 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>  	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>  }
>  
> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
> +}
> +
>  static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>  {
>  	u64 root_hpa = vcpu->arch.mmu->root.hpa;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8ebe542c565..de2c51a0b611 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	hpa_t root;
>  
>  	root_pgd = mmu->get_guest_pgd(vcpu);
> -	root_gfn = root_pgd >> PAGE_SHIFT;
> +	/*
> +	* The guest PGD has already been checked for validity, unconditionally
> +	* strip non-address bits when computing the GFN.
> +	*/

Don't quite follow this comment either.  Can we just say:

	/*
	 * Guest's PGD may contain additional control bits.  Mask them off
	 * to get the GFN.
	 */

Which explains why it has "non-address bits" and needs mask off?

> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;

Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
mmu_check_root() may potentially catch other invalid bits, but in practice there
should be no difference I guess. 

>  
>  	if (mmu_check_root(vcpu, root_gfn))
>  		return 1;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cc58631e2336..c0479cbc2ca3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -21,6 +21,7 @@ extern bool dbg;
>  #endif
>  
>  /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>  #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>  	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>  #define __PT_INDEX(address, level, bits_per_level) \
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 57f0b75c80f9..88351ba04249 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -62,7 +62,7 @@
>  #endif
>  
>  /* Common logic, but per-type values.  These also need to be undefined. */
> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>  #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>  #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	trace_kvm_mmu_pagetable_walk(addr, access);
>  retry_walk:
>  	walker->level = mmu->cpu_role.base.level;
> +	/*
> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
> +	 * non-address bits.
> +	 */

I guess it will be helpful if we actually call out that guest's PGD may contain
control bits here.

Also, I am not sure whether it's better to just explicitly mask control bits off
here.

>  	pte           = mmu->get_guest_pgd(vcpu);
>  	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>  
> 

[...]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-12 11:58   ` Huang, Kai
@ 2023-04-13  1:36     ` Binbin Wu
  2023-04-13  2:27       ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-13  1:36 UTC (permalink / raw)
  To: Huang, Kai, kvm, pbonzini, Christopherson,, Sean
  Cc: Guo, Xuelian, robert.hu, Gao, Chao


On 4/12/2023 7:58 PM, Huang, Kai wrote:
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
>>   	unsigned long cr0_guest_owned_bits;
>>   	unsigned long cr2;
>>   	unsigned long cr3;
>> +	/*
>> +	 * Bits in CR3 used to enable certain features. These bits are allowed
>> +	 * to be set in CR3 when vCPU supports the features. When shadow paging
>> +	 * is used, these bits should be kept as they are in the shadow CR3.
>> +	 */
> I don't quite follow the second sentence.  Not sure what does "these bits should
> be kept" mean.
>
> Those control bits are not active bits in guest's CR3 but all control bits that
> guest is allowed to set to CR3. And those bits depends on guest's CPUID but not
> whether guest is using shadow paging or not.
>
> I think you can just remove the second sentence.

Yes, you are right. The second sentenc is confusing.

How about this:

+	/*
+	 * Bits in CR3 used to enable certain features. These bits are allowed
+	 * to be set in CR3 when vCPU supports the features, and they are used
+	 * as the mask to get the active control bits to form a new guest CR3.
+	 */


>
>> +	u64 cr3_ctrl_bits;
>>   	unsigned long cr4;
>>   	unsigned long cr4_guest_owned_bits;
>>   	unsigned long cr4_guest_rsvd_bits;
>> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
>> index b1658c0de847..ef8e1b912d7d 100644
>> --- a/arch/x86/kvm/cpuid.h
>> +++ b/arch/x86/kvm/cpuid.h
>> @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
>>   	return vcpu->arch.maxphyaddr;
>>   }
>>   
>> +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>> +{
>> +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
>> +}
>> +
>>   static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>>   {
>>   	return !(gpa & vcpu->arch.reserved_gpa_bits);
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 168c46fd8dd1..29985eeb8e12 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
>>   	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
>>   }
>>   
>> +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
>> +{
>> +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
>> +}
>> +
>>   static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
>>   {
>>   	u64 root_hpa = vcpu->arch.mmu->root.hpa;
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index c8ebe542c565..de2c51a0b611 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>>   	hpa_t root;
>>   
>>   	root_pgd = mmu->get_guest_pgd(vcpu);
>> -	root_gfn = root_pgd >> PAGE_SHIFT;
>> +	/*
>> +	* The guest PGD has already been checked for validity, unconditionally
>> +	* strip non-address bits when computing the GFN.
>> +	*/

The comment here is to call out that the set non-address bit(s) have 
been checked for legality
before, it is safe to strip these bits.


> Don't quite follow this comment either.  Can we just say:
>
> 	/*
> 	 * Guest's PGD may contain additional control bits.  Mask them off
> 	 * to get the GFN.
> 	 */
>
> Which explains why it has "non-address bits" and needs mask off?

How about merge this comment?


>
>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
> mmu_check_root() may potentially catch other invalid bits, but in practice there
> should be no difference I guess.

In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.

However, Sean pointed out that the return value of 
mmu->get_guest_pgd(vcpu) could be
EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.

Since the guest pgd has been check for valadity, for both CR3 and EPTP, 
it is safe to mask off
non-address bits to get GFN.

Maybe I should add this CR3 VS. EPTP part to the changelog to make it 
more undertandable.



>
>>   
>>   	if (mmu_check_root(vcpu, root_gfn))
>>   		return 1;
>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
>> index cc58631e2336..c0479cbc2ca3 100644
>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>> @@ -21,6 +21,7 @@ extern bool dbg;
>>   #endif
>>   
>>   /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>>   #define __PT_LEVEL_SHIFT(level, bits_per_level)	\
>>   	(PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>>   #define __PT_INDEX(address, level, bits_per_level) \
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 57f0b75c80f9..88351ba04249 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -62,7 +62,7 @@
>>   #endif
>>   
>>   /* Common logic, but per-type values.  These also need to be undefined. */
>> -#define PT_BASE_ADDR_MASK	((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
>> +#define PT_BASE_ADDR_MASK	((pt_element_t)__PT_BASE_ADDR_MASK)
>>   #define PT_LVL_ADDR_MASK(lvl)	__PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_LVL_OFFSET_MASK(lvl)	__PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>>   #define PT_INDEX(addr, lvl)	__PT_INDEX(addr, lvl, PT_LEVEL_BITS)
>> @@ -324,6 +324,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>   	trace_kvm_mmu_pagetable_walk(addr, access);
>>   retry_walk:
>>   	walker->level = mmu->cpu_role.base.level;
>> +	/*
>> +	 * No need to mask cr3_ctrl_bits, gpte_to_gfn() will strip
>> +	 * non-address bits.
>> +	 */
> I guess it will be helpful if we actually call out that guest's PGD may contain
> control bits here.

OK.


>
> Also, I am not sure whether it's better to just explicitly mask control bits off
> here.

Same reason as mention above.


>
>>   	pte           = mmu->get_guest_pgd(vcpu);
>>   	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>>   
>>
> [...]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-13  1:36     ` Binbin Wu
@ 2023-04-13  2:27       ` Huang, Kai
  2023-04-13  4:45         ` Binbin Wu
  0 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-13  2:27 UTC (permalink / raw)
  To: kvm, pbonzini, Christopherson,, Sean, binbin.wu
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
> On 4/12/2023 7:58 PM, Huang, Kai wrote:
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -729,6 +729,12 @@ struct kvm_vcpu_arch {
> > >   	unsigned long cr0_guest_owned_bits;
> > >   	unsigned long cr2;
> > >   	unsigned long cr3;
> > > +	/*
> > > +	 * Bits in CR3 used to enable certain features. These bits are allowed
> > > +	 * to be set in CR3 when vCPU supports the features. When shadow paging
> > > +	 * is used, these bits should be kept as they are in the shadow CR3.
> > > +	 */
> > I don't quite follow the second sentence.  Not sure what does "these bits should
> > be kept" mean.
> > 
> > Those control bits are not active bits in guest's CR3 but all control bits that
> > guest is allowed to set to CR3. And those bits depends on guest's CPUID but not
> > whether guest is using shadow paging or not.
> > 
> > I think you can just remove the second sentence.
> 
> Yes, you are right. The second sentenc is confusing.
> 
> How about this:
> 
> +	/*
> +	 * Bits in CR3 used to enable certain features. These bits are allowed
> +	 * to be set in CR3 when vCPU supports the features, and they are used
> +	 * as the mask to get the active control bits to form a new guest CR3.
> +	 */
> 

Fine to me, but IMHO it can be even simpler, for instance:

	/*
	 * CR3 non-address feature control bits.  Guest's CR3 may contain
	 * any of those bits at runtime.
	 */

> 
> > 
> > > +	u64 cr3_ctrl_bits;
> > >   	unsigned long cr4;
> > >   	unsigned long cr4_guest_owned_bits;
> > >   	unsigned long cr4_guest_rsvd_bits;
> > > diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> > > index b1658c0de847..ef8e1b912d7d 100644
> > > --- a/arch/x86/kvm/cpuid.h
> > > +++ b/arch/x86/kvm/cpuid.h
> > > @@ -42,6 +42,11 @@ static inline int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
> > >   	return vcpu->arch.maxphyaddr;
> > >   }
> > >   
> > > +static inline bool kvm_vcpu_is_legal_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> > > +{
> > > +	return !((cr3 & vcpu->arch.reserved_gpa_bits) & ~vcpu->arch.cr3_ctrl_bits);
> > > +}
> > > +
> > >   static inline bool kvm_vcpu_is_legal_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
> > >   {
> > >   	return !(gpa & vcpu->arch.reserved_gpa_bits);
> > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > > index 168c46fd8dd1..29985eeb8e12 100644
> > > --- a/arch/x86/kvm/mmu.h
> > > +++ b/arch/x86/kvm/mmu.h
> > > @@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
> > >   	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
> > >   }
> > >   
> > > +static inline u64 kvm_get_active_cr3_ctrl_bits(struct kvm_vcpu *vcpu)
> > > +{
> > > +	return kvm_read_cr3(vcpu) & vcpu->arch.cr3_ctrl_bits;
> > > +}
> > > +
> > >   static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
> > >   {
> > >   	u64 root_hpa = vcpu->arch.mmu->root.hpa;
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index c8ebe542c565..de2c51a0b611 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3732,7 +3732,11 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> > >   	hpa_t root;
> > >   
> > >   	root_pgd = mmu->get_guest_pgd(vcpu);
> > > -	root_gfn = root_pgd >> PAGE_SHIFT;
> > > +	/*
> > > +	* The guest PGD has already been checked for validity, unconditionally
> > > +	* strip non-address bits when computing the GFN.
> > > +	*/
> 
> The comment here is to call out that the set non-address bit(s) have 
> been checked for legality
> before, it is safe to strip these bits.
> 
> 
> > Don't quite follow this comment either.  Can we just say:
> > 
> > 	/*
> > 	 * Guest's PGD may contain additional control bits.  Mask them off
> > 	 * to get the GFN.
> > 	 */
> > 
> > Which explains why it has "non-address bits" and needs mask off?
> 
> How about merge this comment?

No strong opinion.  

> 
> 
> > 
> > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
> > mmu_check_root() may potentially catch other invalid bits, but in practice there
> > should be no difference I guess.
> 
> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
> 
> However, Sean pointed out that the return value of 
> mmu->get_guest_pgd(vcpu) could be
> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.

Yes, although EPTP's high bits don't contain any control bits.

But perhaps we want to make it future-proof in case some more control bits are
added to EPTP too.

> 
> Since the guest pgd has been check for valadity, for both CR3 and EPTP, 
> it is safe to mask off
> non-address bits to get GFN.
> 
> Maybe I should add this CR3 VS. EPTP part to the changelog to make it 
> more undertandable.

This isn't necessary, and can/should be done in comments if needed.

But IMHO you may want to add more material to explain how nested cases are
handled.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-13  2:27       ` Huang, Kai
@ 2023-04-13  4:45         ` Binbin Wu
  2023-04-13  9:13           ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-13  4:45 UTC (permalink / raw)
  To: Huang, Kai, kvm, pbonzini, Christopherson,, Sean
  Cc: Guo, Xuelian, robert.hu, Gao, Chao


On 4/13/2023 10:27 AM, Huang, Kai wrote:
> On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>> On 4/12/2023 7:58 PM, Huang, Kai wrote:
>>
...
>>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this way, below
>>> mmu_check_root() may potentially catch other invalid bits, but in practice there
>>> should be no difference I guess.
>> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>>
>> However, Sean pointed out that the return value of
>> mmu->get_guest_pgd(vcpu) could be
>> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
> Yes, although EPTP's high bits don't contain any control bits.
>
> But perhaps we want to make it future-proof in case some more control bits are
> added to EPTP too.
>
>> Since the guest pgd has been check for valadity, for both CR3 and EPTP,
>> it is safe to mask off
>> non-address bits to get GFN.
>>
>> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>> more undertandable.
> This isn't necessary, and can/should be done in comments if needed.
>
> But IMHO you may want to add more material to explain how nested cases are
> handled.

Do you mean about CR3 or others?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-13  4:45         ` Binbin Wu
@ 2023-04-13  9:13           ` Huang, Kai
  2023-04-21  6:35             ` Binbin Wu
  0 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-13  9:13 UTC (permalink / raw)
  To: Binbin Wu, kvm, pbonzini, Christopherson,, Sean
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

> On 4/13/2023 10:27 AM, Huang, Kai wrote:
> > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
> >> On 4/12/2023 7:58 PM, Huang, Kai wrote:
> >>
> ...
> >>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> >>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
> >>> way, below
> >>> mmu_check_root() may potentially catch other invalid bits, but in
> >>> practice there should be no difference I guess.
> >> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
> >>
> >> However, Sean pointed out that the return value of
> >> mmu->get_guest_pgd(vcpu) could be
> >> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
> > Yes, although EPTP's high bits don't contain any control bits.
> >
> > But perhaps we want to make it future-proof in case some more control
> > bits are added to EPTP too.
> >
> >> Since the guest pgd has been check for valadity, for both CR3 and
> >> EPTP, it is safe to mask off non-address bits to get GFN.
> >>
> >> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
> >> more undertandable.
> > This isn't necessary, and can/should be done in comments if needed.
> >
> > But IMHO you may want to add more material to explain how nested cases
> > are handled.
> 
> Do you mean about CR3 or others?
> 

This patch is about CR3, so CR3.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-04 13:09 ` [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57} Binbin Wu
  2023-04-06 12:57   ` Huang, Kai
  2023-04-12 11:58   ` Huang, Kai
@ 2023-04-17  7:24   ` Chao Gao
  2023-04-17  8:02     ` Binbin Wu
  2 siblings, 1 reply; 48+ messages in thread
From: Chao Gao @ 2023-04-17  7:24 UTC (permalink / raw)
  To: Binbin Wu; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu

On Tue, Apr 04, 2023 at 09:09:20PM +0800, Binbin Wu wrote:
> /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>+#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))

This is an open-coded GENMASK_ULL(). So, you'd better use GENMASK_ULL() here.

>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> 	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
> 	 * the current vCPU mode is accurate.
> 	 */
>-	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
>+	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))

I prefer to modify the call sites in SVM nested code to use the new
function. Although this change does not affect functionality, it
provides a clear distinction between CR3 checks and GPA checks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-17  7:24   ` Chao Gao
@ 2023-04-17  8:02     ` Binbin Wu
  0 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-17  8:02 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu


On 4/17/2023 3:24 PM, Chao Gao wrote:
> On Tue, Apr 04, 2023 at 09:09:20PM +0800, Binbin Wu wrote:
>> /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
>> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> This is an open-coded(). So, you'd better use GENMASK_ULL() here.

Here basically is a code move and rename from PT_BASE_ADDR_MASK to 
__PT_BASE_ADDR_MASK.
I didn't change the original code, but if it is preferred to use 
GENMASK_ULL()
in kernel/KVM, I can change it as following:

#define __PT_BASE_ADDR_MASK GENMASK_ULL(51, 12)


>
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1260,7 +1260,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>> 	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>> 	 * the current vCPU mode is accurate.
>> 	 */
>> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
>> +	if (!kvm_vcpu_is_legal_cr3(vcpu, cr3))
> I prefer to modify the call sites in SVM nested code to use the new
> function. Although this change does not affect functionality, it
> provides a clear distinction between CR3 checks and GPA checks.

Make sense, will do it.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops
  2023-04-04 13:09 ` [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops Binbin Wu
@ 2023-04-18  3:08   ` Zeng Guang
  2023-04-18  3:34     ` Binbin Wu
  2023-04-19  2:30   ` Chao Gao
  1 sibling, 1 reply; 48+ messages in thread
From: Zeng Guang @ 2023-04-18  3:08 UTC (permalink / raw)
  To: Binbin Wu, kvm, Christopherson,, Sean, pbonzini
  Cc: Huang, Kai, Gao, Chao, Guo, Xuelian, robert.hu


On 4/4/2023 9:09 PM, Binbin Wu wrote:
> Introduce a new interface untag_addr() to kvm_x86_ops to untag the metadata
> from linear address. Implement LAM version in VMX and dummy version in SVM.
>
> When enabled feature like Intel Linear Address Masking or AMD Upper
> Address Ignore, linear address may be tagged with metadata. Linear
> address should be checked for modified canonicality and untagged in
> instrution emulations or vmexit handlings if LAM or UAI is applicable.
>
> Introduce untag_addr() to kvm_x86_ops to hide the code related to vendor
> specific details.
> - For VMX, LAM version is implemented.
>    LAM has a modified canonical check when applicable:
>    * LAM_S48                : [ 1 ][ metadata ][ 1 ]
>                                 63               47
>    * LAM_U48                : [ 0 ][ metadata ][ 0 ]
>                                 63               47
>    * LAM_S57                : [ 1 ][ metadata ][ 1 ]
>                                 63               56
>    * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
>                                 63               56
>    * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
>                                 63               56..47
>    If LAM is applicable to certain address, untag the metadata bits and
>    replace them with the value of bit 47 (LAM48) or bit 56 (LAM57). Later
>    the untagged address will do legacy canonical check. So that LAM canonical
>    check and mask can be covered by "untag + legacy canonical check".
>
>    For cases LAM is not applicable, 'flags' is passed to the interface
>    to skip untag.
>
> - For SVM, add a dummy version to do nothing, but return the original
>    address.
>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
> ---
>   arch/x86/include/asm/kvm-x86-ops.h |  1 +
>   arch/x86/include/asm/kvm_host.h    |  5 +++
>   arch/x86/kvm/svm/svm.c             |  7 ++++
>   arch/x86/kvm/vmx/vmx.c             | 60 ++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/vmx.h             |  2 +
>   5 files changed, 75 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8dc345cc6318..7d63d1b942ac 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -52,6 +52,7 @@ KVM_X86_OP(cache_reg)
>   KVM_X86_OP(get_rflags)
>   KVM_X86_OP(set_rflags)
>   KVM_X86_OP(get_if_flag)
> +KVM_X86_OP(untag_addr)
>   KVM_X86_OP(flush_tlb_all)
>   KVM_X86_OP(flush_tlb_current)
>   KVM_X86_OP_OPTIONAL(tlb_remote_flush)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 498d2b5e8dc1..cb674ec826d4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -69,6 +69,9 @@
>   #define KVM_X86_NOTIFY_VMEXIT_VALID_BITS	(KVM_X86_NOTIFY_VMEXIT_ENABLED | \
>   						 KVM_X86_NOTIFY_VMEXIT_USER)
>   
> +/* flags for kvm_x86_ops::untag_addr() */
> +#define KVM_X86_UNTAG_ADDR_SKIP_LAM	_BITULL(0)
> +

Prefer to make definition and comments to be generic.
How about:
     /* x86-specific emulation flags */
     #define KVM_X86_EMULFLAG_SKIP_LAM_UNTAG_ADDR _BITULL(0)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-04 13:09 ` [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable Binbin Wu
  2023-04-06 13:20   ` Huang, Kai
@ 2023-04-18  3:28   ` Zeng Guang
  2023-04-18  3:38     ` Binbin Wu
  2023-04-19  6:43   ` Chao Gao
  2 siblings, 1 reply; 48+ messages in thread
From: Zeng Guang @ 2023-04-18  3:28 UTC (permalink / raw)
  To: Binbin Wu, kvm, Christopherson,, Sean, pbonzini
  Cc: Huang, Kai, Gao, Chao, Guo, Xuelian, robert.hu


On 4/4/2023 9:09 PM, Binbin Wu wrote:
> Untag address for 64-bit memory/mmio operand in instruction emulations
> and vmexit handlers when LAM is applicable.
>
> For instruction emulation, untag address in __linearize() before
> canonical check. LAM doesn't apply to instruction fetch and invlpg,
> use KVM_X86_UNTAG_ADDR_SKIP_LAM to skip LAM untag.
>
> For vmexit handlings related to 64-bit linear address:
> - Cases need to untag address
>    Operand(s) of VMX instructions and INVPCID
>    Operand(s) of SGX ENCLS
>    Linear address in INVVPID descriptor.
> - Cases LAM doesn't apply to (no change needed)
>    Operand of INVLPG
>    Linear address in INVPCID descriptor
>
> Co-developed-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
> ---
>   arch/x86/kvm/emulate.c     | 23 ++++++++++++++++++-----
>   arch/x86/kvm/kvm_emulate.h |  2 ++
>   arch/x86/kvm/vmx/nested.c  |  4 ++++
>   arch/x86/kvm/vmx/sgx.c     |  1 +
>   arch/x86/kvm/x86.c         | 10 ++++++++++
>   5 files changed, 35 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index a20bec931764..b7df465eccf2 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -688,7 +688,8 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
>   				       struct segmented_address addr,
>   				       unsigned *max_size, unsigned size,
>   				       bool write, bool fetch,
> -				       enum x86emul_mode mode, ulong *linear)
> +				       enum x86emul_mode mode, ulong *linear,
> +				       u64 untag_flags)

IMO, here should be "u64 flags" instead of "u64 untag_flags". Emulator can
use it as flag combination for other purpose.


>   {
>   	struct desc_struct desc;
>   	bool usable;
> @@ -701,6 +702,7 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
>   	*max_size = 0;
>   	switch (mode) {
>   	case X86EMUL_MODE_PROT64:
> +		la = ctxt->ops->untag_addr(ctxt, la, untag_flags);
>   		*linear = la;
>   		va_bits = ctxt_virt_addr_bits(ctxt);
>   		if (!__is_canonical_address(la, va_bits))
> @@ -758,7 +760,7 @@ static int linearize(struct x86_emulate_ctxt *ctxt,
>   {
>   	unsigned max_size;
>   	return __linearize(ctxt, addr, &max_size, size, write, false,
> -			   ctxt->mode, linear);
> +			   ctxt->mode, linear, 0);
>   }
>   
>   static inline int assign_eip(struct x86_emulate_ctxt *ctxt, ulong dst)
> @@ -771,7 +773,12 @@ static inline int assign_eip(struct x86_emulate_ctxt *ctxt, ulong dst)
>   
>   	if (ctxt->op_bytes != sizeof(unsigned long))
>   		addr.ea = dst & ((1UL << (ctxt->op_bytes << 3)) - 1);
> -	rc = __linearize(ctxt, addr, &max_size, 1, false, true, ctxt->mode, &linear);
> +	/*
> +	 * LAM does not apply to addresses used for instruction fetches
> +	 * or to those that specify the targets of jump and call instructions
> +	 */

This api handles the target address of branch and call instructions. I 
think it enough to only explain the exact case.

> +	rc = __linearize(ctxt, addr, &max_size, 1, false, true, ctxt->mode,
> +	                 &linear, KVM_X86_UNTAG_ADDR_SKIP_LAM);
>   	if (rc == X86EMUL_CONTINUE)
>   		ctxt->_eip = addr.ea;
>   	return rc;
> @@ -906,9 +913,12 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
>   	 * __linearize is called with size 0 so that it does not do any
>   	 * boundary check itself.  Instead, we use max_size to check
>   	 * against op_size.
> +	 *
> +	 * LAM does not apply to addresses used for instruction fetches
> +	 * or to those that specify the targets of jump and call instructions

Ditto.

>   	 */
>   	rc = __linearize(ctxt, addr, &max_size, 0, false, true, ctxt->mode,
> -			 &linear);
> +			 &linear, KVM_X86_UNTAG_ADDR_SKIP_LAM);
>   	if (unlikely(rc != X86EMUL_CONTINUE))
>   		return rc;
>   
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops
  2023-04-18  3:08   ` Zeng Guang
@ 2023-04-18  3:34     ` Binbin Wu
  0 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-18  3:34 UTC (permalink / raw)
  To: Zeng Guang, kvm, Christopherson,, Sean, pbonzini
  Cc: Huang, Kai, Gao, Chao, Guo, Xuelian, robert.hu


On 4/18/2023 11:08 AM, Zeng Guang wrote:
>
> On 4/4/2023 9:09 PM, Binbin Wu wrote:
>> Introduce a new interface untag_addr() to kvm_x86_ops to untag the 
>> metadata
>> from linear address. Implement LAM version in VMX and dummy version 
>> in SVM.
>>
>> When enabled feature like Intel Linear Address Masking or AMD Upper
>> Address Ignore, linear address may be tagged with metadata. Linear
>> address should be checked for modified canonicality and untagged in
>> instrution emulations or vmexit handlings if LAM or UAI is applicable.
>>
>> Introduce untag_addr() to kvm_x86_ops to hide the code related to vendor
>> specific details.
>> - For VMX, LAM version is implemented.
>>    LAM has a modified canonical check when applicable:
>>    * LAM_S48                : [ 1 ][ metadata ][ 1 ]
>>                                 63               47
>>    * LAM_U48                : [ 0 ][ metadata ][ 0 ]
>>                                 63               47
>>    * LAM_S57                : [ 1 ][ metadata ][ 1 ]
>>                                 63               56
>>    * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
>>                                 63               56
>>    * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
>>                                 63               56..47
>>    If LAM is applicable to certain address, untag the metadata bits and
>>    replace them with the value of bit 47 (LAM48) or bit 56 (LAM57). 
>> Later
>>    the untagged address will do legacy canonical check. So that LAM 
>> canonical
>>    check and mask can be covered by "untag + legacy canonical check".
>>
>>    For cases LAM is not applicable, 'flags' is passed to the interface
>>    to skip untag.
>>
>> - For SVM, add a dummy version to do nothing, but return the original
>>    address.
>>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>> ---
>>   arch/x86/include/asm/kvm-x86-ops.h |  1 +
>>   arch/x86/include/asm/kvm_host.h    |  5 +++
>>   arch/x86/kvm/svm/svm.c             |  7 ++++
>>   arch/x86/kvm/vmx/vmx.c             | 60 ++++++++++++++++++++++++++++++
>>   arch/x86/kvm/vmx/vmx.h             |  2 +
>>   5 files changed, 75 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
>> b/arch/x86/include/asm/kvm-x86-ops.h
>> index 8dc345cc6318..7d63d1b942ac 100644
>> --- a/arch/x86/include/asm/kvm-x86-ops.h
>> +++ b/arch/x86/include/asm/kvm-x86-ops.h
>> @@ -52,6 +52,7 @@ KVM_X86_OP(cache_reg)
>>   KVM_X86_OP(get_rflags)
>>   KVM_X86_OP(set_rflags)
>>   KVM_X86_OP(get_if_flag)
>> +KVM_X86_OP(untag_addr)
>>   KVM_X86_OP(flush_tlb_all)
>>   KVM_X86_OP(flush_tlb_current)
>>   KVM_X86_OP_OPTIONAL(tlb_remote_flush)
>> diff --git a/arch/x86/include/asm/kvm_host.h 
>> b/arch/x86/include/asm/kvm_host.h
>> index 498d2b5e8dc1..cb674ec826d4 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -69,6 +69,9 @@
>>   #define KVM_X86_NOTIFY_VMEXIT_VALID_BITS 
>> (KVM_X86_NOTIFY_VMEXIT_ENABLED | \
>>                            KVM_X86_NOTIFY_VMEXIT_USER)
>>   +/* flags for kvm_x86_ops::untag_addr() */
>> +#define KVM_X86_UNTAG_ADDR_SKIP_LAM    _BITULL(0)
>> +
>
> Prefer to make definition and comments to be generic.
> How about:
>     /* x86-specific emulation flags */
>     #define KVM_X86_EMULFLAG_SKIP_LAM_UNTAG_ADDR _BITULL(0)

You mean make the flag definitions not dedicated for LAM, but also can 
be used for other features?

AFAIK, LASS may also need to define similar flags for instruction 
emulation and VMExit handling.

For me, I'd like to define it as:

     /* x86-specific emulation flags */
     #define KVM_X86_EMULFLAG_SKIP_LAM    _BITULL(0)

since IMO, SKIP_LAM has described the purpose clearly already.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-18  3:28   ` Zeng Guang
@ 2023-04-18  3:38     ` Binbin Wu
  0 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-18  3:38 UTC (permalink / raw)
  To: Zeng Guang, kvm, Christopherson,, Sean, pbonzini
  Cc: Huang, Kai, Gao, Chao, Guo, Xuelian, robert.hu


On 4/18/2023 11:28 AM, Zeng Guang wrote:
>
> On 4/4/2023 9:09 PM, Binbin Wu wrote:
>> Untag address for 64-bit memory/mmio operand in instruction emulations
>> and vmexit handlers when LAM is applicable.
>>
>> For instruction emulation, untag address in __linearize() before
>> canonical check. LAM doesn't apply to instruction fetch and invlpg,
>> use KVM_X86_UNTAG_ADDR_SKIP_LAM to skip LAM untag.
>>
>> For vmexit handlings related to 64-bit linear address:
>> - Cases need to untag address
>>    Operand(s) of VMX instructions and INVPCID
>>    Operand(s) of SGX ENCLS
>>    Linear address in INVVPID descriptor.
>> - Cases LAM doesn't apply to (no change needed)
>>    Operand of INVLPG
>>    Linear address in INVPCID descriptor
>>
>> Co-developed-by: Robert Hoo <robert.hu@linux.intel.com>
>> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>> ---
>>   arch/x86/kvm/emulate.c     | 23 ++++++++++++++++++-----
>>   arch/x86/kvm/kvm_emulate.h |  2 ++
>>   arch/x86/kvm/vmx/nested.c  |  4 ++++
>>   arch/x86/kvm/vmx/sgx.c     |  1 +
>>   arch/x86/kvm/x86.c         | 10 ++++++++++
>>   5 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index a20bec931764..b7df465eccf2 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -688,7 +688,8 @@ static __always_inline int __linearize(struct 
>> x86_emulate_ctxt *ctxt,
>>                          struct segmented_address addr,
>>                          unsigned *max_size, unsigned size,
>>                          bool write, bool fetch,
>> -                       enum x86emul_mode mode, ulong *linear)
>> +                       enum x86emul_mode mode, ulong *linear,
>> +                       u64 untag_flags)
>
> IMO, here should be "u64 flags" instead of "u64 untag_flags". Emulator 
> can
> use it as flag combination for other purpose.

yes, make sense with the advise you suggested in patch 3.


>
>
>>   {
>>       struct desc_struct desc;
>>       bool usable;
>> @@ -701,6 +702,7 @@ static __always_inline int __linearize(struct 
>> x86_emulate_ctxt *ctxt,
>>       *max_size = 0;
>>       switch (mode) {
>>       case X86EMUL_MODE_PROT64:
>> +        la = ctxt->ops->untag_addr(ctxt, la, untag_flags);
>>           *linear = la;
>>           va_bits = ctxt_virt_addr_bits(ctxt);
>>           if (!__is_canonical_address(la, va_bits))
>> @@ -758,7 +760,7 @@ static int linearize(struct x86_emulate_ctxt *ctxt,
>>   {
>>       unsigned max_size;
>>       return __linearize(ctxt, addr, &max_size, size, write, false,
>> -               ctxt->mode, linear);
>> +               ctxt->mode, linear, 0);
>>   }
>>     static inline int assign_eip(struct x86_emulate_ctxt *ctxt, ulong 
>> dst)
>> @@ -771,7 +773,12 @@ static inline int assign_eip(struct 
>> x86_emulate_ctxt *ctxt, ulong dst)
>>         if (ctxt->op_bytes != sizeof(unsigned long))
>>           addr.ea = dst & ((1UL << (ctxt->op_bytes << 3)) - 1);
>> -    rc = __linearize(ctxt, addr, &max_size, 1, false, true, 
>> ctxt->mode, &linear);
>> +    /*
>> +     * LAM does not apply to addresses used for instruction fetches
>> +     * or to those that specify the targets of jump and call 
>> instructions
>> +     */
>
> This api handles the target address of branch and call instructions. I 
> think it enough to only explain the exact case.


OK, will make it specific.

>
>> +    rc = __linearize(ctxt, addr, &max_size, 1, false, true, ctxt->mode,
>> +                     &linear, KVM_X86_UNTAG_ADDR_SKIP_LAM);
>>       if (rc == X86EMUL_CONTINUE)
>>           ctxt->_eip = addr.ea;
>>       return rc;
>> @@ -906,9 +913,12 @@ static int __do_insn_fetch_bytes(struct 
>> x86_emulate_ctxt *ctxt, int op_size)
>>        * __linearize is called with size 0 so that it does not do any
>>        * boundary check itself.  Instead, we use max_size to check
>>        * against op_size.
>> +     *
>> +     * LAM does not apply to addresses used for instruction fetches
>> +     * or to those that specify the targets of jump and call 
>> instructions
>
> Ditto.
>
>>        */
>>       rc = __linearize(ctxt, addr, &max_size, 0, false, true, 
>> ctxt->mode,
>> -             &linear);
>> +             &linear, KVM_X86_UNTAG_ADDR_SKIP_LAM);
>>       if (unlikely(rc != X86EMUL_CONTINUE))
>>           return rc;
>>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops
  2023-04-04 13:09 ` [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops Binbin Wu
  2023-04-18  3:08   ` Zeng Guang
@ 2023-04-19  2:30   ` Chao Gao
  2023-04-19  3:08     ` Binbin Wu
  1 sibling, 1 reply; 48+ messages in thread
From: Chao Gao @ 2023-04-19  2:30 UTC (permalink / raw)
  To: Binbin Wu; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu

On Tue, Apr 04, 2023 at 09:09:21PM +0800, Binbin Wu wrote:
>Introduce a new interface untag_addr() to kvm_x86_ops to untag the metadata
>from linear address. Implement LAM version in VMX and dummy version in SVM.
>
>When enabled feature like Intel Linear Address Masking or AMD Upper
>Address Ignore, linear address may be tagged with metadata. Linear
>address should be checked for modified canonicality and untagged in
>instrution emulations or vmexit handlings if LAM or UAI is applicable.
>
>Introduce untag_addr() to kvm_x86_ops to hide the code related to vendor
>specific details.
>- For VMX, LAM version is implemented.
>  LAM has a modified canonical check when applicable:
>  * LAM_S48                : [ 1 ][ metadata ][ 1 ]
>                               63               47
>  * LAM_U48                : [ 0 ][ metadata ][ 0 ]
>                               63               47
>  * LAM_S57                : [ 1 ][ metadata ][ 1 ]
>                               63               56
>  * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
>                               63               56
>  * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
>                               63               56..47
>  If LAM is applicable to certain address, untag the metadata bits and
>  replace them with the value of bit 47 (LAM48) or bit 56 (LAM57). Later
>  the untagged address will do legacy canonical check. So that LAM canonical
>  check and mask can be covered by "untag + legacy canonical check".
>

>  For cases LAM is not applicable, 'flags' is passed to the interface
>  to skip untag.

The "flags" can be dropped. Callers can simply skip the call of .untag_addr().

>
>- For SVM, add a dummy version to do nothing, but return the original
>  address.
>
>Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>---
> arch/x86/include/asm/kvm-x86-ops.h |  1 +
> arch/x86/include/asm/kvm_host.h    |  5 +++
> arch/x86/kvm/svm/svm.c             |  7 ++++
> arch/x86/kvm/vmx/vmx.c             | 60 ++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/vmx.h             |  2 +
> 5 files changed, 75 insertions(+)
>
>diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
>index 8dc345cc6318..7d63d1b942ac 100644
>--- a/arch/x86/include/asm/kvm-x86-ops.h
>+++ b/arch/x86/include/asm/kvm-x86-ops.h
>@@ -52,6 +52,7 @@ KVM_X86_OP(cache_reg)
> KVM_X86_OP(get_rflags)
> KVM_X86_OP(set_rflags)
> KVM_X86_OP(get_if_flag)
>+KVM_X86_OP(untag_addr)
> KVM_X86_OP(flush_tlb_all)
> KVM_X86_OP(flush_tlb_current)
> KVM_X86_OP_OPTIONAL(tlb_remote_flush)
>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index 498d2b5e8dc1..cb674ec826d4 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -69,6 +69,9 @@
> #define KVM_X86_NOTIFY_VMEXIT_VALID_BITS	(KVM_X86_NOTIFY_VMEXIT_ENABLED | \
> 						 KVM_X86_NOTIFY_VMEXIT_USER)
> 
>+/* flags for kvm_x86_ops::untag_addr() */
>+#define KVM_X86_UNTAG_ADDR_SKIP_LAM	_BITULL(0)
>+
> /* x86-specific vcpu->requests bit members */
> #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
> #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
>@@ -1607,6 +1610,8 @@ struct kvm_x86_ops {
> 	void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
> 	bool (*get_if_flag)(struct kvm_vcpu *vcpu);
> 
>+	u64 (*untag_addr)(struct kvm_vcpu *vcpu, u64 la, u64 flags);
>+
> 	void (*flush_tlb_all)(struct kvm_vcpu *vcpu);
> 	void (*flush_tlb_current)(struct kvm_vcpu *vcpu);
> 	int  (*tlb_remote_flush)(struct kvm *kvm);
>diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>index 252e7f37e4e2..a6e6bd09642b 100644
>--- a/arch/x86/kvm/svm/svm.c
>+++ b/arch/x86/kvm/svm/svm.c
>@@ -4696,6 +4696,11 @@ static int svm_vm_init(struct kvm *kvm)
> 	return 0;
> }
> 
>+static u64 svm_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags)
>+{
>+	return addr;
>+}
>+
> static struct kvm_x86_ops svm_x86_ops __initdata = {
> 	.name = KBUILD_MODNAME,
> 
>@@ -4745,6 +4750,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> 	.set_rflags = svm_set_rflags,
> 	.get_if_flag = svm_get_if_flag,
> 
>+	.untag_addr = svm_untag_addr,
>+
> 	.flush_tlb_all = svm_flush_tlb_current,
> 	.flush_tlb_current = svm_flush_tlb_current,
> 	.flush_tlb_gva = svm_flush_tlb_gva,
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 4d329ee9474c..73cc495bd0da 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -8137,6 +8137,64 @@ static void vmx_vm_destroy(struct kvm *kvm)
> 	free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
> }
> 
>+
>+#define LAM_S57_EN_MASK (X86_CR4_LAM_SUP | X86_CR4_LA57)
>+
>+static inline int lam_sign_extend_bit(bool user, struct kvm_vcpu *vcpu)

Drop "inline" and let compilers decide whether to inline the function.

And it is better to swap the two parameters to align with the conversion
in kvm.

>+{
>+	u64 cr3, cr4;
>+
>+	if (user) {
>+		cr3 = kvm_read_cr3(vcpu);
>+		if (!!(cr3 & X86_CR3_LAM_U57))

It is weird to use double negation "!!" in if statements. I prefer to drop it.

>+			return 56;
>+		if (!!(cr3 & X86_CR3_LAM_U48))
>+			return 47;
>+	} else {
>+		cr4 = kvm_read_cr4_bits(vcpu, LAM_S57_EN_MASK);
>+		if (cr4 == LAM_S57_EN_MASK)
>+			return 56;
>+		if (!!(cr4 & X86_CR4_LAM_SUP))
>+			return 47;
>+	}
>+	return -1;
>+}
>+
>+/*
>+ * Only called in 64-bit mode.
>+ *
>+ * Metadata bits are [62:48] in LAM48 and [62:57] in LAM57. Mask metadata in
>+ * pointers by sign-extending the value of bit 47 (LAM48) or 56 (LAM57).
>+ * The resulting address after untagging isn't guaranteed to be canonical.
>+ * Callers should perform the original canonical check and raise #GP/#SS if the
>+ * address is non-canonical.
>+ */
>+u64 vmx_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags)
>+{
>+	int sign_ext_bit;
>+
>+	/*
>+	 * Instead of calling relatively expensive guest_cpuid_has(), just check
>+	 * LAM_U48 in cr3_ctrl_bits. If not set, vCPU doesn't supports LAM.
>+	 */
>+	if (!(vcpu->arch.cr3_ctrl_bits & X86_CR3_LAM_U48) ||
>+	    !!(flags & KVM_X86_UNTAG_ADDR_SKIP_LAM))
>+		return addr;
>+
>+	if(!is_64_bit_mode(vcpu)){
>+		WARN_ONCE(1, "Only be called in 64-bit mode");

use WARN_ON_ONCE() in case it can be triggered by guests, i.e.,

if (WARN_ON_ONCE(!is_64_bit_mode(vcpu))
	return addr;

>+		return addr;
>+	}
>+
>+	sign_ext_bit = lam_sign_extend_bit(!(addr >> 63), vcpu);
>+
>+	if (sign_ext_bit < 0)
>+		return addr;
>+
>+	return (sign_extend64(addr, sign_ext_bit) & ~BIT_ULL(63)) |
>+	       (addr & BIT_ULL(63));
>+}
>+
> static struct kvm_x86_ops vmx_x86_ops __initdata = {
> 	.name = KBUILD_MODNAME,
> 
>@@ -8185,6 +8243,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
> 	.set_rflags = vmx_set_rflags,
> 	.get_if_flag = vmx_get_if_flag,
> 
>+	.untag_addr = vmx_untag_addr,
>+
> 	.flush_tlb_all = vmx_flush_tlb_all,
> 	.flush_tlb_current = vmx_flush_tlb_current,
> 	.flush_tlb_gva = vmx_flush_tlb_gva,
>diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>index 2acdc54bc34b..79855b9a4aca 100644
>--- a/arch/x86/kvm/vmx/vmx.h
>+++ b/arch/x86/kvm/vmx/vmx.h
>@@ -433,6 +433,8 @@ void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type);
> u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
> u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
> 
>+u64 vmx_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags);
>+
> static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
> 					     int type, bool value)
> {
>-- 
>2.25.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops
  2023-04-19  2:30   ` Chao Gao
@ 2023-04-19  3:08     ` Binbin Wu
  2023-04-21  7:48       ` Binbin Wu
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-19  3:08 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu


On 4/19/2023 10:30 AM, Chao Gao wrote:
> On Tue, Apr 04, 2023 at 09:09:21PM +0800, Binbin Wu wrote:
>> Introduce a new interface untag_addr() to kvm_x86_ops to untag the metadata
> >from linear address. Implement LAM version in VMX and dummy version in SVM.
>> When enabled feature like Intel Linear Address Masking or AMD Upper
>> Address Ignore, linear address may be tagged with metadata. Linear
>> address should be checked for modified canonicality and untagged in
>> instrution emulations or vmexit handlings if LAM or UAI is applicable.
>>
>> Introduce untag_addr() to kvm_x86_ops to hide the code related to vendor
>> specific details.
>> - For VMX, LAM version is implemented.
>>   LAM has a modified canonical check when applicable:
>>   * LAM_S48                : [ 1 ][ metadata ][ 1 ]
>>                                63               47
>>   * LAM_U48                : [ 0 ][ metadata ][ 0 ]
>>                                63               47
>>   * LAM_S57                : [ 1 ][ metadata ][ 1 ]
>>                                63               56
>>   * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
>>                                63               56
>>   * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
>>                                63               56..47
>>   If LAM is applicable to certain address, untag the metadata bits and
>>   replace them with the value of bit 47 (LAM48) or bit 56 (LAM57). Later
>>   the untagged address will do legacy canonical check. So that LAM canonical
>>   check and mask can be covered by "untag + legacy canonical check".
>>
>>   For cases LAM is not applicable, 'flags' is passed to the interface
>>   to skip untag.
> The "flags" can be dropped. Callers can simply skip the call of .untag_addr().

OK.

The "flags" was added for proof of future if such kind of untag is also 
adopted in svm for AMD.

The cases to skip untag are different on the two vendor platforms.

But still, it is able to get the information in __linearize(), I will 
drop the parameter.



>
>> - For SVM, add a dummy version to do nothing, but return the original
>>   address.
>>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>> ---
>> arch/x86/include/asm/kvm-x86-ops.h |  1 +
>> arch/x86/include/asm/kvm_host.h    |  5 +++
>> arch/x86/kvm/svm/svm.c             |  7 ++++
>> arch/x86/kvm/vmx/vmx.c             | 60 ++++++++++++++++++++++++++++++
>> arch/x86/kvm/vmx/vmx.h             |  2 +
>> 5 files changed, 75 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
>> index 8dc345cc6318..7d63d1b942ac 100644
>> --- a/arch/x86/include/asm/kvm-x86-ops.h
>> +++ b/arch/x86/include/asm/kvm-x86-ops.h
>> @@ -52,6 +52,7 @@ KVM_X86_OP(cache_reg)
>> KVM_X86_OP(get_rflags)
>> KVM_X86_OP(set_rflags)
>> KVM_X86_OP(get_if_flag)
>> +KVM_X86_OP(untag_addr)
>> KVM_X86_OP(flush_tlb_all)
>> KVM_X86_OP(flush_tlb_current)
>> KVM_X86_OP_OPTIONAL(tlb_remote_flush)
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 498d2b5e8dc1..cb674ec826d4 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -69,6 +69,9 @@
>> #define KVM_X86_NOTIFY_VMEXIT_VALID_BITS	(KVM_X86_NOTIFY_VMEXIT_ENABLED | \
>> 						 KVM_X86_NOTIFY_VMEXIT_USER)
>>
>> +/* flags for kvm_x86_ops::untag_addr() */
>> +#define KVM_X86_UNTAG_ADDR_SKIP_LAM	_BITULL(0)
>> +
>> /* x86-specific vcpu->requests bit members */
>> #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
>> #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
>> @@ -1607,6 +1610,8 @@ struct kvm_x86_ops {
>> 	void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
>> 	bool (*get_if_flag)(struct kvm_vcpu *vcpu);
>>
>> +	u64 (*untag_addr)(struct kvm_vcpu *vcpu, u64 la, u64 flags);
>> +
>> 	void (*flush_tlb_all)(struct kvm_vcpu *vcpu);
>> 	void (*flush_tlb_current)(struct kvm_vcpu *vcpu);
>> 	int  (*tlb_remote_flush)(struct kvm *kvm);
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 252e7f37e4e2..a6e6bd09642b 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -4696,6 +4696,11 @@ static int svm_vm_init(struct kvm *kvm)
>> 	return 0;
>> }
>>
>> +static u64 svm_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags)
>> +{
>> +	return addr;
>> +}
>> +
>> static struct kvm_x86_ops svm_x86_ops __initdata = {
>> 	.name = KBUILD_MODNAME,
>>
>> @@ -4745,6 +4750,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>> 	.set_rflags = svm_set_rflags,
>> 	.get_if_flag = svm_get_if_flag,
>>
>> +	.untag_addr = svm_untag_addr,
>> +
>> 	.flush_tlb_all = svm_flush_tlb_current,
>> 	.flush_tlb_current = svm_flush_tlb_current,
>> 	.flush_tlb_gva = svm_flush_tlb_gva,
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 4d329ee9474c..73cc495bd0da 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -8137,6 +8137,64 @@ static void vmx_vm_destroy(struct kvm *kvm)
>> 	free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
>> }
>>
>> +
>> +#define LAM_S57_EN_MASK (X86_CR4_LAM_SUP | X86_CR4_LA57)
>> +
>> +static inline int lam_sign_extend_bit(bool user, struct kvm_vcpu *vcpu)
> Drop "inline" and let compilers decide whether to inline the function.
>
> And it is better to swap the two parameters to align with the conversion
> in kvm.


OK.

>
>> +{
>> +	u64 cr3, cr4;
>> +
>> +	if (user) {
>> +		cr3 = kvm_read_cr3(vcpu);
>> +		if (!!(cr3 & X86_CR3_LAM_U57))
> It is weird to use double negation "!!" in if statements. I prefer to drop it.

OK.

Have a check on the current code, there are such usages in some driver 
code, but no such usage in kvm-x86 code.

Will drop it.


>
>> +			return 56;
>> +		if (!!(cr3 & X86_CR3_LAM_U48))
>> +			return 47;
>> +	} else {
>> +		cr4 = kvm_read_cr4_bits(vcpu, LAM_S57_EN_MASK);
>> +		if (cr4 == LAM_S57_EN_MASK)
>> +			return 56;
>> +		if (!!(cr4 & X86_CR4_LAM_SUP))
>> +			return 47;
>> +	}
>> +	return -1;
>> +}
>> +
>> +/*
>> + * Only called in 64-bit mode.
>> + *
>> + * Metadata bits are [62:48] in LAM48 and [62:57] in LAM57. Mask metadata in
>> + * pointers by sign-extending the value of bit 47 (LAM48) or 56 (LAM57).
>> + * The resulting address after untagging isn't guaranteed to be canonical.
>> + * Callers should perform the original canonical check and raise #GP/#SS if the
>> + * address is non-canonical.
>> + */
>> +u64 vmx_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags)
>> +{
>> +	int sign_ext_bit;
>> +
>> +	/*
>> +	 * Instead of calling relatively expensive guest_cpuid_has(), just check
>> +	 * LAM_U48 in cr3_ctrl_bits. If not set, vCPU doesn't supports LAM.
>> +	 */
>> +	if (!(vcpu->arch.cr3_ctrl_bits & X86_CR3_LAM_U48) ||
>> +	    !!(flags & KVM_X86_UNTAG_ADDR_SKIP_LAM))
>> +		return addr;
>> +
>> +	if(!is_64_bit_mode(vcpu)){
>> +		WARN_ONCE(1, "Only be called in 64-bit mode");
> use WARN_ON_ONCE() in case it can be triggered by guests, i.e.,
>
> if (WARN_ON_ONCE(!is_64_bit_mode(vcpu))
> 	return addr;


OK.

>
>> +		return addr;
>> +	}
>> +
>> +	sign_ext_bit = lam_sign_extend_bit(!(addr >> 63), vcpu);
>> +
>> +	if (sign_ext_bit < 0)
>> +		return addr;
>> +
>> +	return (sign_extend64(addr, sign_ext_bit) & ~BIT_ULL(63)) |
>> +	       (addr & BIT_ULL(63));
>> +}
>> +
>> static struct kvm_x86_ops vmx_x86_ops __initdata = {
>> 	.name = KBUILD_MODNAME,
>>
>> @@ -8185,6 +8243,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>> 	.set_rflags = vmx_set_rflags,
>> 	.get_if_flag = vmx_get_if_flag,
>>
>> +	.untag_addr = vmx_untag_addr,
>> +
>> 	.flush_tlb_all = vmx_flush_tlb_all,
>> 	.flush_tlb_current = vmx_flush_tlb_current,
>> 	.flush_tlb_gva = vmx_flush_tlb_gva,
>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>> index 2acdc54bc34b..79855b9a4aca 100644
>> --- a/arch/x86/kvm/vmx/vmx.h
>> +++ b/arch/x86/kvm/vmx/vmx.h
>> @@ -433,6 +433,8 @@ void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr, int type);
>> u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
>> u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
>>
>> +u64 vmx_untag_addr(struct kvm_vcpu *vcpu, u64 addr, u64 flags);
>> +
>> static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
>> 					     int type, bool value)
>> {
>> -- 
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-04 13:09 ` [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable Binbin Wu
  2023-04-06 13:20   ` Huang, Kai
  2023-04-18  3:28   ` Zeng Guang
@ 2023-04-19  6:43   ` Chao Gao
  2023-04-21  7:57     ` Binbin Wu
  2 siblings, 1 reply; 48+ messages in thread
From: Chao Gao @ 2023-04-19  6:43 UTC (permalink / raw)
  To: Binbin Wu; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu

On Tue, Apr 04, 2023 at 09:09:22PM +0800, Binbin Wu wrote:
>Untag address for 64-bit memory/mmio operand in instruction emulations
>and vmexit handlers when LAM is applicable.
>
>For instruction emulation, untag address in __linearize() before
>canonical check. LAM doesn't apply to instruction fetch and invlpg,
>use KVM_X86_UNTAG_ADDR_SKIP_LAM to skip LAM untag.
>
>For vmexit handlings related to 64-bit linear address:
>- Cases need to untag address
>  Operand(s) of VMX instructions and INVPCID
>  Operand(s) of SGX ENCLS
>  Linear address in INVVPID descriptor.
>- Cases LAM doesn't apply to (no change needed)
>  Operand of INVLPG
>  Linear address in INVPCID descriptor
>
>Co-developed-by: Robert Hoo <robert.hu@linux.intel.com>
>Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
>Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>---
> arch/x86/kvm/emulate.c     | 23 ++++++++++++++++++-----
> arch/x86/kvm/kvm_emulate.h |  2 ++
> arch/x86/kvm/vmx/nested.c  |  4 ++++
> arch/x86/kvm/vmx/sgx.c     |  1 +
> arch/x86/kvm/x86.c         | 10 ++++++++++
> 5 files changed, 35 insertions(+), 5 deletions(-)
>
>diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>index a20bec931764..b7df465eccf2 100644
>--- a/arch/x86/kvm/emulate.c
>+++ b/arch/x86/kvm/emulate.c
>@@ -688,7 +688,8 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
> 				       struct segmented_address addr,
> 				       unsigned *max_size, unsigned size,
> 				       bool write, bool fetch,
>-				       enum x86emul_mode mode, ulong *linear)
>+				       enum x86emul_mode mode, ulong *linear,
>+				       u64 untag_flags)

@write and @fetch are like flags. I think we can consolidate them into
the @flags first as a cleanup patch and then add a flag for LAM.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-13  9:13           ` Huang, Kai
@ 2023-04-21  6:35             ` Binbin Wu
  2023-04-21 11:43               ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-21  6:35 UTC (permalink / raw)
  To: Huang, Kai, kvm, pbonzini, Christopherson,, Sean
  Cc: Guo, Xuelian, robert.hu, Gao, Chao


On 4/13/2023 5:13 PM, Huang, Kai wrote:
>> On 4/13/2023 10:27 AM, Huang, Kai wrote:
>>> On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>>>> On 4/12/2023 7:58 PM, Huang, Kai wrote:
>>>>
>> ...
>>>>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>>>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>>>>> way, below
>>>>> mmu_check_root() may potentially catch other invalid bits, but in
>>>>> practice there should be no difference I guess.
>>>> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>>>>
>>>> However, Sean pointed out that the return value of
>>>> mmu->get_guest_pgd(vcpu) could be
>>>> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>>> Yes, although EPTP's high bits don't contain any control bits.
>>>
>>> But perhaps we want to make it future-proof in case some more control
>>> bits are added to EPTP too.
>>>
>>>> Since the guest pgd has been check for valadity, for both CR3 and
>>>> EPTP, it is safe to mask off non-address bits to get GFN.
>>>>
>>>> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>>>> more undertandable.
>>> This isn't necessary, and can/should be done in comments if needed.
>>>
>>> But IMHO you may want to add more material to explain how nested cases
>>> are handled.
>> Do you mean about CR3 or others?
>>
> This patch is about CR3, so CR3.

For nested case, I plan to add the following in the changelog:

     For nested guest:
     - If CR3 is intercepted, after CR3 handled in L1, CR3 will be 
checked in
       nested_vmx_load_cr3() before returning back to L2.
     - For the shadow paging case (SPT02), LAM bits are also be handled 
to form
       a new shadow CR3 in vmx_load_mmu_pgd().



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops
  2023-04-19  3:08     ` Binbin Wu
@ 2023-04-21  7:48       ` Binbin Wu
  2023-04-21  8:21         ` Chao Gao
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-21  7:48 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu


On 4/19/2023 11:08 AM, Binbin Wu wrote:
>
> On 4/19/2023 10:30 AM, Chao Gao wrote:
>> On Tue, Apr 04, 2023 at 09:09:21PM +0800, Binbin Wu wrote:
>>> Introduce a new interface untag_addr() to kvm_x86_ops to untag the 
>>> metadata
>> >from linear address. Implement LAM version in VMX and dummy version 
>> in SVM.
>>> When enabled feature like Intel Linear Address Masking or AMD Upper
>>> Address Ignore, linear address may be tagged with metadata. Linear
>>> address should be checked for modified canonicality and untagged in
>>> instrution emulations or vmexit handlings if LAM or UAI is applicable.
>>>
>>> Introduce untag_addr() to kvm_x86_ops to hide the code related to 
>>> vendor
>>> specific details.
>>> - For VMX, LAM version is implemented.
>>>   LAM has a modified canonical check when applicable:
>>>   * LAM_S48                : [ 1 ][ metadata ][ 1 ]
>>>                                63               47
>>>   * LAM_U48                : [ 0 ][ metadata ][ 0 ]
>>>                                63               47
>>>   * LAM_S57                : [ 1 ][ metadata ][ 1 ]
>>>                                63               56
>>>   * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
>>>                                63               56
>>>   * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
>>>                                63               56..47
>>>   If LAM is applicable to certain address, untag the metadata bits and
>>>   replace them with the value of bit 47 (LAM48) or bit 56 (LAM57). 
>>> Later
>>>   the untagged address will do legacy canonical check. So that LAM 
>>> canonical
>>>   check and mask can be covered by "untag + legacy canonical check".
>>>
>>>   For cases LAM is not applicable, 'flags' is passed to the interface
>>>   to skip untag.
>> The "flags" can be dropped. Callers can simply skip the call of 
>> .untag_addr().
>
> OK.
>
> The "flags" was added for proof of future if such kind of untag is 
> also adopted in svm for AMD.
>
> The cases to skip untag are different on the two vendor platforms.
>
> But still, it is able to get the information in __linearize(), I will 
> drop the parameter.

Have a second thought, the flags is still needed to pass to vmx/svm.

If both implementions set the skip untag flag (SKIP_UNTAG_VMX | 
SKIP_UNTAG_SVM)
or neither sets the skip untag flag,  __linearize() can decide to call 
.untag_addr() or not.

However, in some case, if only one of the implementation need to set the 
skip untag for itself,
in __linearize(), there is no enough information to tell whether to skip 
the untag or not.


>
>
>
>>
>>> - For SVM, add a dummy version to do nothing, but return the original
>>>   address.
>>>
>>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>>> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>>> --- 
[...]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-19  6:43   ` Chao Gao
@ 2023-04-21  7:57     ` Binbin Wu
  2023-04-21  8:36       ` Chao Gao
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-21  7:57 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu


On 4/19/2023 2:43 PM, Chao Gao wrote:
> On Tue, Apr 04, 2023 at 09:09:22PM +0800, Binbin Wu wrote:
>> Untag address for 64-bit memory/mmio operand in instruction emulations
>> and vmexit handlers when LAM is applicable.
>>
>> For instruction emulation, untag address in __linearize() before
>> canonical check. LAM doesn't apply to instruction fetch and invlpg,
>> use KVM_X86_UNTAG_ADDR_SKIP_LAM to skip LAM untag.
>>
>> For vmexit handlings related to 64-bit linear address:
>> - Cases need to untag address
>>   Operand(s) of VMX instructions and INVPCID
>>   Operand(s) of SGX ENCLS
>>   Linear address in INVVPID descriptor.
>> - Cases LAM doesn't apply to (no change needed)
>>   Operand of INVLPG
>>   Linear address in INVPCID descriptor
>>
>> Co-developed-by: Robert Hoo <robert.hu@linux.intel.com>
>> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Tested-by: Xuelian Guo <xuelian.guo@intel.com>
>> ---
>> arch/x86/kvm/emulate.c     | 23 ++++++++++++++++++-----
>> arch/x86/kvm/kvm_emulate.h |  2 ++
>> arch/x86/kvm/vmx/nested.c  |  4 ++++
>> arch/x86/kvm/vmx/sgx.c     |  1 +
>> arch/x86/kvm/x86.c         | 10 ++++++++++
>> 5 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index a20bec931764..b7df465eccf2 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -688,7 +688,8 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
>> 				       struct segmented_address addr,
>> 				       unsigned *max_size, unsigned size,
>> 				       bool write, bool fetch,
>> -				       enum x86emul_mode mode, ulong *linear)
>> +				       enum x86emul_mode mode, ulong *linear,
>> +				       u64 untag_flags)
> @write and @fetch are like flags. I think we can consolidate them into
> the @flags first as a cleanup patch and then add a flag for LAM.

OK. Here is the proposed cleanup patch:


--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -687,8 +687,8 @@ static unsigned insn_alignment(struct 
x86_emulate_ctxt *ctxt, unsigned size)
  static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
                                        struct segmented_address addr,
                                        unsigned *max_size, unsigned size,
-                                      bool write, bool fetch,
-                                      enum x86emul_mode mode, ulong 
*linear)
+                                      u64 flags, enum x86emul_mode mode,
+                                      ulong *linear)
  {
         struct desc_struct desc;
         bool usable;
@@ -696,6 +696,8 @@ static __always_inline int __linearize(struct 
x86_emulate_ctxt *ctxt,
         u32 lim;
         u16 sel;
         u8  va_bits;
+       bool fetch = !!(flags & KVM_X86_EMULFLAG_FETCH);
+       bool write = !!(flags & KVM_X86_EMULFLAG_WRITE);

         la = seg_base(ctxt, addr.seg) + addr.ea;
         *max_size = 0;
@@ -757,7 +759,12 @@ static int linearize(struct x86_emulate_ctxt *ctxt,
                      ulong *linear)
  {
         unsigned max_size;
-       return __linearize(ctxt, addr, &max_size, size, write, false,
+       u64 flags = 0;
+
+       if (write)
+               flags |= KVM_X86_EMULFLAG_WRITE;
+
+       return __linearize(ctxt, addr, &max_size, size, flags,
                            ctxt->mode, linear);
  }

@@ -768,10 +775,11 @@ static inline int assign_eip(struct 
x86_emulate_ctxt *ctxt, ulong dst)
         unsigned max_size;
         struct segmented_address addr = { .seg = VCPU_SREG_CS,
                                            .ea = dst };
+       u64 flags = KVM_X86_EMULFLAG_FETCH;

         if (ctxt->op_bytes != sizeof(unsigned long))
                 addr.ea = dst & ((1UL << (ctxt->op_bytes << 3)) - 1);
-       rc = __linearize(ctxt, addr, &max_size, 1, false, true, 
ctxt->mode, &linear);
+       rc = __linearize(ctxt, addr, &max_size, 1, flags, ctxt->mode, 
&linear);
         if (rc == X86EMUL_CONTINUE)
                 ctxt->_eip = addr.ea;
         return rc;
@@ -896,6 +904,7 @@ static int __do_insn_fetch_bytes(struct 
x86_emulate_ctxt *ctxt, int op_size)
         int cur_size = ctxt->fetch.end - ctxt->fetch.data;
         struct segmented_address addr = { .seg = VCPU_SREG_CS,
                                            .ea = ctxt->eip + cur_size };
+       u64 flags = KVM_X86_EMULFLAG_FETCH;

         /*
          * We do not know exactly how many bytes will be needed, and
@@ -907,8 +916,7 @@ static int __do_insn_fetch_bytes(struct 
x86_emulate_ctxt *ctxt, int op_size)
          * boundary check itself.  Instead, we use max_size to check
          * against op_size.
          */
-       rc = __linearize(ctxt, addr, &max_size, 0, false, true, ctxt->mode,
-                        &linear);
+       rc = __linearize(ctxt, addr, &max_size, 0, flags, ctxt->mode, 
&linear);
         if (unlikely(rc != X86EMUL_CONTINUE))
                 return rc;

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a8167b47b8c8..8076e013ff9f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -48,6 +48,15 @@ void kvm_spurious_fault(void);
  #define KVM_SVM_DEFAULT_PLE_WINDOW_MAX USHRT_MAX
  #define KVM_SVM_DEFAULT_PLE_WINDOW     3000

+/* x86-specific emulation flags */
+#define KVM_X86_EMULFLAG_FETCH                 _BITULL(0)
+#define KVM_X86_EMULFLAG_WRITE                 _BITULL(1)


And the following two will be defined for untag:

#define KVM_X86_EMULFLAG_SKIP_UNTAG_VMX     _BITULL(2)
#define KVM_X86_EMULFLAG_SKIP_UNTAG_SVM     _BITULL(3) /* reserved for 
SVM */



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops
  2023-04-21  7:48       ` Binbin Wu
@ 2023-04-21  8:21         ` Chao Gao
  0 siblings, 0 replies; 48+ messages in thread
From: Chao Gao @ 2023-04-21  8:21 UTC (permalink / raw)
  To: Binbin Wu; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu

On Fri, Apr 21, 2023 at 03:48:50PM +0800, Binbin Wu wrote:
>
>On 4/19/2023 11:08 AM, Binbin Wu wrote:
>> 
>> On 4/19/2023 10:30 AM, Chao Gao wrote:
>> > On Tue, Apr 04, 2023 at 09:09:21PM +0800, Binbin Wu wrote:
>> > > Introduce a new interface untag_addr() to kvm_x86_ops to untag
>> > > the metadata
>> > >from linear address. Implement LAM version in VMX and dummy version
>> > in SVM.
>> > > When enabled feature like Intel Linear Address Masking or AMD Upper
>> > > Address Ignore, linear address may be tagged with metadata. Linear
>> > > address should be checked for modified canonicality and untagged in
>> > > instrution emulations or vmexit handlings if LAM or UAI is applicable.
>> > > 
>> > > Introduce untag_addr() to kvm_x86_ops to hide the code related to
>> > > vendor
>> > > specific details.
>> > > - For VMX, LAM version is implemented.
>> > >   LAM has a modified canonical check when applicable:
>> > >   * LAM_S48                : [ 1 ][ metadata ][ 1 ]
>> > >                                63               47
>> > >   * LAM_U48                : [ 0 ][ metadata ][ 0 ]
>> > >                                63               47
>> > >   * LAM_S57                : [ 1 ][ metadata ][ 1 ]
>> > >                                63               56
>> > >   * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
>> > >                                63               56
>> > >   * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
>> > >                                63               56..47
>> > >   If LAM is applicable to certain address, untag the metadata bits and
>> > >   replace them with the value of bit 47 (LAM48) or bit 56
>> > > (LAM57). Later
>> > >   the untagged address will do legacy canonical check. So that
>> > > LAM canonical
>> > >   check and mask can be covered by "untag + legacy canonical check".
>> > > 
>> > >   For cases LAM is not applicable, 'flags' is passed to the interface
>> > >   to skip untag.
>> > The "flags" can be dropped. Callers can simply skip the call of
>> > .untag_addr().
>> 
>> OK.
>> 
>> The "flags" was added for proof of future if such kind of untag is also
>> adopted in svm for AMD.
>> 
>> The cases to skip untag are different on the two vendor platforms.
>> 
>> But still, it is able to get the information in __linearize(), I will
>> drop the parameter.
>
>Have a second thought, the flags is still needed to pass to vmx/svm.
>
>If both implementions set the skip untag flag (SKIP_UNTAG_VMX |
>SKIP_UNTAG_SVM)
>or neither sets the skip untag flag,  __linearize() can decide to call
>.untag_addr() or not.
>
>However, in some case, if only one of the implementation need to set the skip
>untag for itself,
>in __linearize(), there is no enough information to tell whether to skip the
>untag or not.

OK. I have no strong preference. An alternative is:

We have only one flag. If AMD and Intel differ in some cases, set the
flag according to the CPU vendor of vCPUs.

if LAM applies to case A while AMD UAI doesn't, do

	if guest CPU is Intel
		set SKIP_UNTAG

for case B to which UAI applies while LAM doens't, do
	
	if guest CPU is AMD
		set SKIP_UNTAG

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-21  7:57     ` Binbin Wu
@ 2023-04-21  8:36       ` Chao Gao
  2023-04-21  9:13         ` Binbin Wu
  0 siblings, 1 reply; 48+ messages in thread
From: Chao Gao @ 2023-04-21  8:36 UTC (permalink / raw)
  To: Binbin Wu; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu

On Fri, Apr 21, 2023 at 03:57:15PM +0800, Binbin Wu wrote:
>
>> > --- a/arch/x86/kvm/emulate.c
>> > +++ b/arch/x86/kvm/emulate.c
>> > @@ -688,7 +688,8 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
>> > 				       struct segmented_address addr,
>> > 				       unsigned *max_size, unsigned size,
>> > 				       bool write, bool fetch,
>> > -				       enum x86emul_mode mode, ulong *linear)
>> > +				       enum x86emul_mode mode, ulong *linear,
>> > +				       u64 untag_flags)
>> @write and @fetch are like flags. I think we can consolidate them into
>> the @flags first as a cleanup patch and then add a flag for LAM.
>
>OK. Here is the proposed cleanup patch:

looks good to me

>
>
>--- a/arch/x86/kvm/x86.h
>+++ b/arch/x86/kvm/x86.h
>@@ -48,6 +48,15 @@ void kvm_spurious_fault(void);
> #define KVM_SVM_DEFAULT_PLE_WINDOW_MAX USHRT_MAX
> #define KVM_SVM_DEFAULT_PLE_WINDOW     3000
>
>+/* x86-specific emulation flags */
>+#define KVM_X86_EMULFLAG_FETCH                 _BITULL(0)
>+#define KVM_X86_EMULFLAG_WRITE                 _BITULL(1)

Can we move the definitions to arch/x86/kvm/kvm_emulate.h?

>
>
>And the following two will be defined for untag:
>
>#define KVM_X86_EMULFLAG_SKIP_UNTAG_VMX     _BITULL(2)
>#define KVM_X86_EMULFLAG_SKIP_UNTAG_SVM     _BITULL(3) /* reserved for SVM */
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable
  2023-04-21  8:36       ` Chao Gao
@ 2023-04-21  9:13         ` Binbin Wu
  0 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-21  9:13 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, seanjc, pbonzini, kai.huang, xuelian.guo, robert.hu


On 4/21/2023 4:36 PM, Chao Gao wrote:
> On Fri, Apr 21, 2023 at 03:57:15PM +0800, Binbin Wu wrote:
>>>> --- a/arch/x86/kvm/emulate.c
>>>> +++ b/arch/x86/kvm/emulate.c
>>>> @@ -688,7 +688,8 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
>>>> 				       struct segmented_address addr,
>>>> 				       unsigned *max_size, unsigned size,
>>>> 				       bool write, bool fetch,
>>>> -				       enum x86emul_mode mode, ulong *linear)
>>>> +				       enum x86emul_mode mode, ulong *linear,
>>>> +				       u64 untag_flags)
>>> @write and @fetch are like flags. I think we can consolidate them into
>>> the @flags first as a cleanup patch and then add a flag for LAM.
>> OK. Here is the proposed cleanup patch:
> looks good to me
>
>>
>> --- a/arch/x86/kvm/x86.h
>> +++ b/arch/x86/kvm/x86.h
>> @@ -48,6 +48,15 @@ void kvm_spurious_fault(void);
>>   #define KVM_SVM_DEFAULT_PLE_WINDOW_MAX USHRT_MAX
>>   #define KVM_SVM_DEFAULT_PLE_WINDOW     3000
>>
>> +/* x86-specific emulation flags */
>> +#define KVM_X86_EMULFLAG_FETCH                 _BITULL(0)
>> +#define KVM_X86_EMULFLAG_WRITE                 _BITULL(1)
> Can we move the definitions to arch/x86/kvm/kvm_emulate.h?

Then, the flags needs to be removed from .untag_addr() interface since 
currently
KVM_X86_EMULFLAG_SKIP_UNTAG_VMX is used in vmx. :(



>
>>
>> And the following two will be defined for untag:
>>
>> #define KVM_X86_EMULFLAG_SKIP_UNTAG_VMX     _BITULL(2)
>> #define KVM_X86_EMULFLAG_SKIP_UNTAG_SVM     _BITULL(3) /* reserved for SVM */
>>
>>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling
  2023-04-04 13:09 [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
                   ` (4 preceding siblings ...)
  2023-04-04 13:09 ` [PATCH v7 5/5] KVM: x86: Expose LAM feature to userspace VMM Binbin Wu
@ 2023-04-21  9:40 ` Binbin Wu
  5 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-21  9:40 UTC (permalink / raw)
  To: kvm, seanjc, pbonzini; +Cc: kai.huang, chao.gao, xuelian.guo, robert.hu

Gentle ping on the patch series.

Best regards,
Binbin

On 4/4/2023 9:09 PM, Binbin Wu wrote:
> ===Feature Introduction===
>
> Linear-address masking (LAM) [1], modifies the checking that is applied to
> *64-bit* linear addresses, allowing software to use of the untranslated address
> bits for metadata.
>
> When the feature is virtualized and exposed to guest, it can be used for efficient
> address sanitizers (ASAN) implementation and for optimizations in JITs and virtual
> machines.
>
> Regarding which pointer bits are masked and can be used for metadata, LAM has 2
> modes:
> - LAM_48: metadata bits 62:48, i.e. LAM width of 15.
> - LAM_57: metadata bits 62:57, i.e. LAM width of 6.
>
> * For user pointers:
>    CR3.LAM_U57 = CR3.LAM_U48 = 0, LAM is off;
>    CR3.LAM_U57 = 1, LAM57 is active;
>    CR3.LAM_U57 = 0 and CR3.LAM_U48 = 1, LAM48 is active.
> * For supervisor pointers:
>    CR4.LAM_SUP =0, LAM is off;
>    CR4.LAM_SUP =1 with 5-level paging mode, LAM57 is active;
>    CR4.LAM_SUP =1 with 4-level paging mode, LAM48 is active.
>
> The modified LAM canonicality check:
> * LAM_S48                : [ 1 ][ metadata ][ 1 ]
>                               63               47
> * LAM_U48                : [ 0 ][ metadata ][ 0 ]
>                               63               47
> * LAM_S57                : [ 1 ][ metadata ][ 1 ]
>                               63               56
> * LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
>                               63               56
> * LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
>                               63               56..47
>
> Note:
> 1. LAM applies to only data address, not to instructions.
> 2. LAM identification of an address as user or supervisor is based solely on the
>     value of pointer bit 63 and does not, for the purposes of LAM, depend on the CPL.
> 3. For user mode address, it is possible that 5-level paging and LAM_U48 are both
>     set, in this case, the effective usable linear address width is 48. [2]
> 4. When VM exit, the page faulting linear address saved in VMCS field is clean,
>     i.e. metadata cleared with canonical form.
>
> ===LAM KVM Design===
> LAM KVM enabling includes the following parts:
> - Feature Enumeration
>    LAM feature is enumerated by CPUID.7.1:EAX.LAM[bit 26].
>    If hardware supports LAM and host doesn't disable it explicitly (e.g. via
>    clearcpuid), LAM feature will be exposed to user VMM.
>
> - CR4 Virtualization
>    LAM uses CR4.LAM_SUP (bit 28) to configure LAM masking on supervisor pointers.
>    CR4.LAM_SUP is allowed to be set if vCPU supports LAM, including in nested guest.
>    CR4.LAM_SUP is allowed to be set even not in 64-bit mode, but it will not take
>    effect since LAM only applies to 64-bit linear address.
>    Change of CR4.LAM_SUP bit is intercepted to avoid vmread every time when KVM
>    fetches its value, with the expectation that guest won't toggle the bit frequently.
>    Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, so KVM doesn't
>    need to emulate TLB flush based on it.
>
> - CR3 Virtualization
>    LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM masking
>    for user mode pointers.
>
>    When EPT is on:
>    CR3 is fully under control of guest, guest LAM is thus transparent to KVM.
>
>    When EPT is off (shadow paging):
>      * KVM needs to handle guest CR3.LAM_U48 and CR3.LAM_U57 toggles.
>        The two bits are allowed to be set in CR3 if vCPU supports LAM.
>        The two bits should be kept as they are in the shadow CR3.
>      * Perform GFN calculation from guest CR3/PGD generically by extracting the
>        maximal base address mask.
>      * Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
>      To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
>      the bits used to control supported features related to CR3 (e.g. LAM).
>
> - Modified Canonicality Check and Metadata Mask
>    When LAM is enabled, 64-bit linear address may be tagged with metadata. Linear
>    address should be checked for modified canonicality and untagged (i.e. metadata
>    bits should be masked by sign-extending the bit 47 or bit 56) in instruction
>    emulations and VMExit handlings when LAM is applicable.
>
> LAM inside nested guest is supported by this patch series.
> LAM inside SGX enclave mode is NOT supported by this patch series.
>
> The patch series is based on linux kernel v6.3-rc4, depends on two patches:
> - One from Kiril for LAM feature and flag definitions[3].
> - The other is a bug fix sent out speperatly[4].
>
> The patch series organized as following:
> Patch 1/2: CR4/CR3 virtualization
> Patch 3: Implementation of untag_addr
> Patch 4: Untag address when LAM applicable
> Patch 5: Expose LAM feature to userspace VMM
>
> The corresponding QEMU patch:
> https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg08036.html
>
> ===Unit Test===
> 1. Add a kvm-unit-test [5] for LAM, including LAM_SUP and LAM_{U57,U48}.
>     For supervisor mode, this test covers CR4 LAM_SUP bits toggle, Memory/MMIO
>     access with tagged pointer, and some special instructions (INVLPG, INVPCID,
>     INVVPID), INVVIID cases also used to cover VMX instruction VMExit path.
>     For uer mode, this test covers CR3 LAM bits toggle, Memory/MMIO access with
>     tagged pointer.
>     MMIO cases are used to trigger instruction emulation path.
>     Run the unit test with both LAM feature on/off (i.e. including negative cases).
> 2. Run Kernel LAM kselftests in guest, with both EPT=Y/N.
> 3. Launch a nested guest.
>
> All tests have passed in Simics environment.
>
> [1] Intel ISE https://cdrdv2.intel.com/v1/dl/getContent/671368
>      Chapter Linear Address Masking (LAM)
> [2] Thus currently, LAM kernel enabling patch only enables LAM_U57.
>      https://lore.kernel.org/lkml/20230123220500.21077-1-kirill.shutemov@linux.intel.com/
> [3] https://lore.kernel.org/lkml/20230123220500.21077-4-kirill.shutemov@linux.intel.com/
> [4] https://lore.kernel.org/kvm/20230404032502.27798-1-binbin.wu@linux.intel.com/
> [5] https://lore.kernel.org/kvm/20230319083732.29458-1-binbin.wu@linux.intel.com/
>
> ---
> Changelog
> v6 --> v7:
> - Changes to CR3 virtualization when EPT off
>    * Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination. (Sean)
>    * Perform GFN calculation from guest CR3/PGD generically by extracting the maximal
>      base address mask. (Sean)
> - Remove derefence of ctxt->vcpu in the emulator. (Sean)
> - Fix a bug in v6, which hardcoded "write" to "false" by mistake in linearize(). (Chao Gao)
> - Add Chao Gao's reviwed-by in Patch 5.
> - Add Xuelian Guo's tested-by in the patch set.
> - Seperate cleanup patches from the patch set.
>
> v5 --> v6:
> Add Patch 2 to fix the check of 64-bit mode.
> Add untag_addr() to kvm_x86_ops to hide vendor specific code.
> Simplify the LAM canonicality check per Chao's suggestion.
> Add cr3_ctrl_bits to kvm_vcpu_arch to simplify cr3 invalidation/extract/mask (Chao Gao)
> Extend the patchset scope to include nested virtualization and SGX ENCLS handling.
> - Add X86_CR4_LAM_SUP in cr4_fixed1_update for nested vmx. (Chao Gao)
> - Add SGX ENCLS VMExit handling
> - Add VMX insturction VMExit handling
> More descriptions in cover letter.
> Add Chao's reviwed-by on Patch 4.
> Add more test cases in kvm-unit-test.
>
> v4 --> v5:
> Reorder and melt patches surround CR3.LAM bits into Patch 3 of this
> version.
> Revise Patch 1's subject and description
> Drop Patch 3
> Use kvm_read_cr4_bits() instead of kvm_read_cr4()
> Fix: No need to untag addr when write to msr, it should be legacy canonical check
> Rename kvm_is_valid_cr3() --> kvm_vcpu_is_valid_cr3(), and update some call
> sites of kvm_vcpu_is_valid_cr3() to use kvm_is_valid_cr3().
> Other refactors and Miscs.
>
> v3 --> v4:
> Drop unrelated Patch 1 in v3 (Binbin, Sean, Xiaoyao)
> Intercept CR4.LAM_SUP instead of pass through to guest (Sean)
> Just filter out CR3.LAM_{U48, U57}, instead of all reserved high bits
> (Sean, Yuan)
> Use existing __canonical_address() helper instead write a new one (Weijiang)
> Add LAM handling in KVM emulation (Yu, Yuan)
> Add Jingqi's reviwed-by on Patch 7
> Rebased to Kirill's latest code, which is 6.2-rc1 base.
>
> v2 --> v3:
> As LAM Kernel patches are in tip tree now, rebase to it.
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/
>
> v1 --> v2:
> 1. Fixes i386-allyesconfig build error on get_pgd(), where
>     CR3_HIGH_RSVD_MASK isn't applicable.
>     (Reported-by: kernel test robot <lkp@intel.com>)
> 2. In kvm_set_cr3(), be conservative on skip tlb flush when only LAM bits
>     toggles. (Kirill)
>
> Binbin Wu (2):
>    KVM: x86: Introduce untag_addr() in kvm_x86_ops
>    KVM: x86: Untag address when LAM applicable
>
> Robert Hoo (3):
>    KVM: x86: Virtualize CR4.LAM_SUP
>    KVM: x86: Virtualize CR3.LAM_{U48,U57}
>    KVM: x86: Expose LAM feature to userspace VMM
>
>   arch/x86/include/asm/kvm-x86-ops.h |  1 +
>   arch/x86/include/asm/kvm_host.h    | 14 +++++-
>   arch/x86/kvm/cpuid.c               |  2 +-
>   arch/x86/kvm/cpuid.h               |  5 +++
>   arch/x86/kvm/emulate.c             | 23 +++++++---
>   arch/x86/kvm/kvm_emulate.h         |  2 +
>   arch/x86/kvm/mmu.h                 |  5 +++
>   arch/x86/kvm/mmu/mmu.c             |  6 ++-
>   arch/x86/kvm/mmu/mmu_internal.h    |  1 +
>   arch/x86/kvm/mmu/paging_tmpl.h     |  6 ++-
>   arch/x86/kvm/mmu/spte.h            |  2 +-
>   arch/x86/kvm/svm/svm.c             |  7 +++
>   arch/x86/kvm/vmx/nested.c          |  8 +++-
>   arch/x86/kvm/vmx/sgx.c             |  1 +
>   arch/x86/kvm/vmx/vmx.c             | 69 +++++++++++++++++++++++++++++-
>   arch/x86/kvm/vmx/vmx.h             |  2 +
>   arch/x86/kvm/x86.c                 | 14 +++++-
>   arch/x86/kvm/x86.h                 |  2 +
>   18 files changed, 155 insertions(+), 15 deletions(-)
>
>
> base-commit: 197b6b60ae7bc51dd0814953c562833143b292aa
> prerequisite-patch-id: 883dc8f73520b47a6c3690c1704f2e85a2713e4f
> prerequisite-patch-id: cf5655ce89a2390cd29f33c57a4fc307a6045f62

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-21  6:35             ` Binbin Wu
@ 2023-04-21 11:43               ` Huang, Kai
  2023-04-21 15:32                 ` Chao Gao
  2023-04-22  3:32                 ` Binbin Wu
  0 siblings, 2 replies; 48+ messages in thread
From: Huang, Kai @ 2023-04-21 11:43 UTC (permalink / raw)
  To: kvm, pbonzini, Christopherson,, Sean, binbin.wu
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
> On 4/13/2023 5:13 PM, Huang, Kai wrote:
> > > On 4/13/2023 10:27 AM, Huang, Kai wrote:
> > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
> > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote:
> > > > > 
> > > ...
> > > > > > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
> > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
> > > > > > way, below
> > > > > > mmu_check_root() may potentially catch other invalid bits, but in
> > > > > > practice there should be no difference I guess.
> > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
> > > > > 
> > > > > However, Sean pointed out that the return value of
> > > > > mmu->get_guest_pgd(vcpu) could be
> > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
> > > > Yes, although EPTP's high bits don't contain any control bits.
> > > > 
> > > > But perhaps we want to make it future-proof in case some more control
> > > > bits are added to EPTP too.
> > > > 
> > > > > Since the guest pgd has been check for valadity, for both CR3 and
> > > > > EPTP, it is safe to mask off non-address bits to get GFN.
> > > > > 
> > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it
> > > > > more undertandable.
> > > > This isn't necessary, and can/should be done in comments if needed.
> > > > 
> > > > But IMHO you may want to add more material to explain how nested cases
> > > > are handled.
> > > Do you mean about CR3 or others?
> > > 
> > This patch is about CR3, so CR3.
> 
> For nested case, I plan to add the following in the changelog:
> 
>      For nested guest:
>      - If CR3 is intercepted, after CR3 handled in L1, CR3 will be 
> checked in
>        nested_vmx_load_cr3() before returning back to L2.
>      - For the shadow paging case (SPT02), LAM bits are also be handled 
> to form
>        a new shadow CR3 in vmx_load_mmu_pgd().
> 
> 

I don't know a lot of code detail of KVM nested code, but in general, since your
code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
changelog should focus on explaining why modifying these two functions are good
enough.

And to explain this, I think we should explain from hardware's perspective
rather than from shadow paging's perspective.

From L0's perspective, we care about:

	1) L1's CR3 register (VMCS01's GUEST_CR3)
	2) L1's VMCS to run L2 guest
		2.1) VMCS12's HOST_CR3 
		2.2) VMCS12's GUEST_CR3

For 1) the current changelog has explained (that we need to catch _active_
control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
intercept CR3 or not.  But this isn't the point because from hardware's point of
view we actually care about below two cases instead:

	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01 
	   to reflect VMCS12
	2) L0 to VMENTER to L2 using VMCS02 directly

The case 2) can fail due to fail to VMENTER to L2 of course but this should
result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
case 1).

For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
CR3 feature control bits.  The key code path is:

	vmx_handle_exit()
		-> prepare_vmcs12()
		-> load_vmcs12_host_state().  

For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
GUEST_CR3 against active control bits.  The key code path is 

	nested_vmx_run() -> 
		-> nested_vmx_check_host_state()
		-> nested_vmx_enter_non_root_mode()
			-> prepare_vmcs02_early()
			-> prepare_vmcs02()

Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
active control bits seems just enough.

Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
return early if any HOST state is wrong) currently checks
kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.

That being said, I do find it's not easy to come up with a "concise" changelog
to cover both non-nested and (especially) nested cases, but it seems we can
abstract some from my above typing?  

My suggestion is to focus on the hardware behaviour's perspective as typed
above.  But I believe Sean can easily make a "concise" changelog if he wants to
comment here :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-21 11:43               ` Huang, Kai
@ 2023-04-21 15:32                 ` Chao Gao
  2023-04-22  4:51                   ` Chao Gao
  2023-04-22  3:32                 ` Binbin Wu
  1 sibling, 1 reply; 48+ messages in thread
From: Chao Gao @ 2023-04-21 15:32 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, pbonzini, Christopherson,, Sean, binbin.wu, Guo, Xuelian, robert.hu

On Fri, Apr 21, 2023 at 07:43:52PM +0800, Huang, Kai wrote:
>On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
>> On 4/13/2023 5:13 PM, Huang, Kai wrote:
>> > > On 4/13/2023 10:27 AM, Huang, Kai wrote:
>> > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>> > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote:
>> > > > > 
>> > > ...
>> > > > > > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>> > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>> > > > > > way, below
>> > > > > > mmu_check_root() may potentially catch other invalid bits, but in
>> > > > > > practice there should be no difference I guess.
>> > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>> > > > > 
>> > > > > However, Sean pointed out that the return value of
>> > > > > mmu->get_guest_pgd(vcpu) could be
>> > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>> > > > Yes, although EPTP's high bits don't contain any control bits.
>> > > > 
>> > > > But perhaps we want to make it future-proof in case some more control
>> > > > bits are added to EPTP too.
>> > > > 
>> > > > > Since the guest pgd has been check for valadity, for both CR3 and
>> > > > > EPTP, it is safe to mask off non-address bits to get GFN.
>> > > > > 
>> > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>> > > > > more undertandable.
>> > > > This isn't necessary, and can/should be done in comments if needed.
>> > > > 
>> > > > But IMHO you may want to add more material to explain how nested cases
>> > > > are handled.
>> > > Do you mean about CR3 or others?
>> > > 
>> > This patch is about CR3, so CR3.
>> 
>> For nested case, I plan to add the following in the changelog:
>> 
>>      For nested guest:
>>      - If CR3 is intercepted, after CR3 handled in L1, CR3 will be 
>> checked in
>>        nested_vmx_load_cr3() before returning back to L2.
>>      - For the shadow paging case (SPT02), LAM bits are also be handled 
>> to form
>>        a new shadow CR3 in vmx_load_mmu_pgd().
>> 
>> 
>
>I don't know a lot of code detail of KVM nested code, but in general, since your
>code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
>changelog should focus on explaining why modifying these two functions are good
>enough.

the patch relaxes all existing checks on CR3, allowing previously reserved bits
to be set. It should be enough otherwise some checks on CR3 are missing in the
first place.

>
>And to explain this, I think we should explain from hardware's perspective
>rather than from shadow paging's perspective.
>
>From L0's perspective, we care about:
>
>	1) L1's CR3 register (VMCS01's GUEST_CR3)
>	2) L1's VMCS to run L2 guest
>		2.1) VMCS12's HOST_CR3 
>		2.2) VMCS12's GUEST_CR3
>
>For 1) the current changelog has explained (that we need to catch _active_
>control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
>intercept CR3 or not.  But this isn't the point because from hardware's point of
>view we actually care about below two cases instead:
>
>	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01 
>	   to reflect VMCS12
>	2) L0 to VMENTER to L2 using VMCS02 directly
>
>The case 2) can fail due to fail to VMENTER to L2 of course but this should
>result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
>case 1).
>
>For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
>CR3 feature control bits.  The key code path is:
>
>	vmx_handle_exit()
>		-> prepare_vmcs12()
>		-> load_vmcs12_host_state().  
>
>For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
>GUEST_CR3 against active control bits.  The key code path is 

...

>
>	nested_vmx_run() -> 
>		-> nested_vmx_check_host_state()
>		-> nested_vmx_enter_non_root_mode()
>			-> prepare_vmcs02_early()
>			-> prepare_vmcs02()
>
>Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
>(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
>kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
>active control bits seems just enough.

IMO, current KVM relies on hardware to do consistency check on the GUEST_CR3
during VM entry.

>
>Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
>return early if any HOST state is wrong) currently checks
>kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.
>
>That being said, I do find it's not easy to come up with a "concise" changelog
>to cover both non-nested and (especially) nested cases, but it seems we can
>abstract some from my above typing?  
>
>My suggestion is to focus on the hardware behaviour's perspective as typed
>above.  But I believe Sean can easily make a "concise" changelog if he wants to
>comment here :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-21 11:43               ` Huang, Kai
  2023-04-21 15:32                 ` Chao Gao
@ 2023-04-22  3:32                 ` Binbin Wu
  2023-04-22  4:43                   ` Chao Gao
  2023-04-25 22:48                   ` Huang, Kai
  1 sibling, 2 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-22  3:32 UTC (permalink / raw)
  To: Huang, Kai, kvm, pbonzini, Christopherson,, Sean
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

Kai,

Thanks for your inputs.

I rephrased the changelog as following:


LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
masking for user mode pointers.

To support LAM in KVM, CR3 validity checks and shadow paging handling 
need to be
modified accordingly.

== CR3 validity Check ==
When LAM is supported, CR3 LAM bits are allowed to be set and the check 
of CR3
needs to be modified.
Add a helper kvm_vcpu_is_legal_cr3() and use it instead of 
kvm_vcpu_is_legal_gpa()
to do the new CR3 checks in all existing CR3 checks as following:
When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
Non-nested case
- When EPT on, CR3 is fully under control of guest.
- When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() 
during
   CR3 VMExit handling.
Nested case, from L0's perspective, we care about:
- L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
- L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
   Two paths related:
   1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
          nested_vm_exit()
          -> load_vmcs12_host_state()
                -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3
   2. L0 to VMENTER to L2 using VMCS02
          nested_vmx_run()
          -> nested_vmx_check_host_state()   //check VMCS12's HOST_CR3
          -> nested_vmx_enter_non_root_mode()
             -> prepare_vmcs02()
                -> nested_vmx_load_cr3()     //check VMCS12's GUEST_CR3
   Path 2 can fail to VMENTER to L2 of course, but later this should 
result in
   path 1.

== Shadow paging handling ==
When EPT is off, the following changes needed to handle shadow paging:
- LAM bits should be stripped to perform GFN calculation from guest PGD 
when it
   is CR3 (for nested EPT case, guest PGD is nested EPTP).
   To be generic, extract the maximal base address mask from guest PGD 
since the
   validity of guest PGD has been checked already.
- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
   It could potentially increase root cache misses and MMU reload, however,
   it's very rare to turn off EPT when performance matters.
- Active CR3 LAM bits should be kept to form a shadow CR3.

To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
the bits used to control supported features related to CR3 (e.g. LAM).
- Supported control bits are set to the field after set cpuid.
- the field is used in
   kvm_vcpu_is_legal_cr3() to check CR3.
   kvm_get_active_cr3_ctrl_bits() to extract active control bits of CR3.
   Also as a quick check for LAM feature support.

On 4/21/2023 7:43 PM, Huang, Kai wrote:
> On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
>> On 4/13/2023 5:13 PM, Huang, Kai wrote:
>>>> On 4/13/2023 10:27 AM, Huang, Kai wrote:
>>>>> On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>>>>>> On 4/12/2023 7:58 PM, Huang, Kai wrote:
>>>>>>
>>>> ...
>>>>>>>> +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>>>>>>> Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>>>>>>> way, below
>>>>>>> mmu_check_root() may potentially catch other invalid bits, but in
>>>>>>> practice there should be no difference I guess.
>>>>>> In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>>>>>>
>>>>>> However, Sean pointed out that the return value of
>>>>>> mmu->get_guest_pgd(vcpu) could be
>>>>>> EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>>>>> Yes, although EPTP's high bits don't contain any control bits.
>>>>>
>>>>> But perhaps we want to make it future-proof in case some more control
>>>>> bits are added to EPTP too.
>>>>>
>>>>>> Since the guest pgd has been check for valadity, for both CR3 and
>>>>>> EPTP, it is safe to mask off non-address bits to get GFN.
>>>>>>
>>>>>> Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>>>>>> more undertandable.
>>>>> This isn't necessary, and can/should be done in comments if needed.
>>>>>
>>>>> But IMHO you may want to add more material to explain how nested cases
>>>>> are handled.
>>>> Do you mean about CR3 or others?
>>>>
>>> This patch is about CR3, so CR3.
>> For nested case, I plan to add the following in the changelog:
>>
>>       For nested guest:
>>       - If CR3 is intercepted, after CR3 handled in L1, CR3 will be
>> checked in
>>         nested_vmx_load_cr3() before returning back to L2.
>>       - For the shadow paging case (SPT02), LAM bits are also be handled
>> to form
>>         a new shadow CR3 in vmx_load_mmu_pgd().
>>
>>
> I don't know a lot of code detail of KVM nested code, but in general, since your
> code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
> changelog should focus on explaining why modifying these two functions are good
> enough.
>
> And to explain this, I think we should explain from hardware's perspective
> rather than from shadow paging's perspective.
>
>  From L0's perspective, we care about:
>
> 	1) L1's CR3 register (VMCS01's GUEST_CR3)
> 	2) L1's VMCS to run L2 guest
> 		2.1) VMCS12's HOST_CR3
> 		2.2) VMCS12's GUEST_CR3
>
> For 1) the current changelog has explained (that we need to catch _active_
> control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
> intercept CR3 or not.  But this isn't the point because from hardware's point of
> view we actually care about below two cases instead:
>
> 	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01
> 	   to reflect VMCS12
> 	2) L0 to VMENTER to L2 using VMCS02 directly
>
> The case 2) can fail due to fail to VMENTER to L2 of course but this should
> result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
> case 1).
>
> For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
> CR3 feature control bits.  The key code path is:
>
> 	vmx_handle_exit()
> 		-> prepare_vmcs12()
> 		-> load_vmcs12_host_state().
>
> For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
> GUEST_CR3 against active control bits.  The key code path is
>
> 	nested_vmx_run() ->
> 		-> nested_vmx_check_host_state()
> 		-> nested_vmx_enter_non_root_mode()
> 			-> prepare_vmcs02_early()
> 			-> prepare_vmcs02()
>
> Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
> (VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
> kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
> active control bits seems just enough.
>
> Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
> return early if any HOST state is wrong) currently checks
> kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.
>
> That being said, I do find it's not easy to come up with a "concise" changelog
> to cover both non-nested and (especially) nested cases, but it seems we can
> abstract some from my above typing?
>
> My suggestion is to focus on the hardware behaviour's perspective as typed
> above.  But I believe Sean can easily make a "concise" changelog if he wants to
> comment here :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-22  3:32                 ` Binbin Wu
@ 2023-04-22  4:43                   ` Chao Gao
  2023-04-27 13:19                     ` Huang, Kai
  2023-04-25 22:48                   ` Huang, Kai
  1 sibling, 1 reply; 48+ messages in thread
From: Chao Gao @ 2023-04-22  4:43 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Huang, Kai, kvm, pbonzini, Christopherson,,
	Sean, Guo, Xuelian, robert.hu

On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote:
>Kai,
>
>Thanks for your inputs.
>
>I rephrased the changelog as following:
>
>
>LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
>masking for user mode pointers.
>
>To support LAM in KVM, CR3 validity checks and shadow paging handling need to
>be
>modified accordingly.
>
>== CR3 validity Check ==
>When LAM is supported, CR3 LAM bits are allowed to be set and the check of
>CR3
>needs to be modified.

it is better to describe the hardware change here:

On processors that enumerate support for LAM, CR3 register allows
LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both
GUEST_CR3 and HOST_CR3 fields.

To emulate LAM hardware behavior, KVM needs to
1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace
2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12

>Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
>kvm_vcpu_is_legal_gpa()
>to do the new CR3 checks in all existing CR3 checks as following:
>When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
>Non-nested case
>- When EPT on, CR3 is fully under control of guest.
>- When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during
>  CR3 VMExit handling.
>Nested case, from L0's perspective, we care about:
>- L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
>- L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
>  Two paths related:
>  1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
>         nested_vm_exit()
>         -> load_vmcs12_host_state()
>               -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3

This is just a byproduct of using a unified function, i.e.,
nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit.

LAM spec says:

VM entry checks the values of the CR3 and CR4 fields in the guest-area
and host-state area of the VMCS. In particular, the bits in these fields
that correspond to bits reserved in the corresponding register are
checked and must be 0.

It doesn't mention any check on VM exit. So, it looks to me that VM exit
doesn't do consistency checks. Then, I think there is no need to call
out this path.

>  2. L0 to VMENTER to L2 using VMCS02
>         nested_vmx_run()
>         -> nested_vmx_check_host_state()   //check VMCS12's HOST_CR3
>         -> nested_vmx_enter_non_root_mode()
>            -> prepare_vmcs02()
>               -> nested_vmx_load_cr3()     //check VMCS12's GUEST_CR3
>  Path 2 can fail to VMENTER to L2 of course, but later this should result in
>  path 1.
>
>== Shadow paging handling ==
>When EPT is off, the following changes needed to handle shadow paging:
>- LAM bits should be stripped to perform GFN calculation from guest PGD when
>it
>  is CR3 (for nested EPT case, guest PGD is nested EPTP).
>  To be generic, extract the maximal base address mask from guest PGD since
>the
>  validity of guest PGD has been checked already.
>- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
>  It could potentially increase root cache misses and MMU reload, however,
>  it's very rare to turn off EPT when performance matters.
>- Active CR3 LAM bits should be kept to form a shadow CR3.
>
>To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
>the bits used to control supported features related to CR3 (e.g. LAM).
>- Supported control bits are set to the field after set cpuid.
>- the field is used in
>  kvm_vcpu_is_legal_cr3() to check CR3.
>  kvm_get_active_cr3_ctrl_bits() to extract active control bits of CR3.
>  Also as a quick check for LAM feature support.
>
>On 4/21/2023 7:43 PM, Huang, Kai wrote:
>> On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
>> > On 4/13/2023 5:13 PM, Huang, Kai wrote:
>> > > > On 4/13/2023 10:27 AM, Huang, Kai wrote:
>> > > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
>> > > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote:
>> > > > > > 
>> > > > ...
>> > > > > > > > +	root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>> > > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits?  In this
>> > > > > > > way, below
>> > > > > > > mmu_check_root() may potentially catch other invalid bits, but in
>> > > > > > > practice there should be no difference I guess.
>> > > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
>> > > > > > 
>> > > > > > However, Sean pointed out that the return value of
>> > > > > > mmu->get_guest_pgd(vcpu) could be
>> > > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
>> > > > > Yes, although EPTP's high bits don't contain any control bits.
>> > > > > 
>> > > > > But perhaps we want to make it future-proof in case some more control
>> > > > > bits are added to EPTP too.
>> > > > > 
>> > > > > > Since the guest pgd has been check for valadity, for both CR3 and
>> > > > > > EPTP, it is safe to mask off non-address bits to get GFN.
>> > > > > > 
>> > > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it
>> > > > > > more undertandable.
>> > > > > This isn't necessary, and can/should be done in comments if needed.
>> > > > > 
>> > > > > But IMHO you may want to add more material to explain how nested cases
>> > > > > are handled.
>> > > > Do you mean about CR3 or others?
>> > > > 
>> > > This patch is about CR3, so CR3.
>> > For nested case, I plan to add the following in the changelog:
>> > 
>> >       For nested guest:
>> >       - If CR3 is intercepted, after CR3 handled in L1, CR3 will be
>> > checked in
>> >         nested_vmx_load_cr3() before returning back to L2.
>> >       - For the shadow paging case (SPT02), LAM bits are also be handled
>> > to form
>> >         a new shadow CR3 in vmx_load_mmu_pgd().
>> > 
>> > 
>> I don't know a lot of code detail of KVM nested code, but in general, since your
>> code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
>> changelog should focus on explaining why modifying these two functions are good
>> enough.
>> 
>> And to explain this, I think we should explain from hardware's perspective
>> rather than from shadow paging's perspective.
>> 
>>  From L0's perspective, we care about:
>> 
>> 	1) L1's CR3 register (VMCS01's GUEST_CR3)
>> 	2) L1's VMCS to run L2 guest
>> 		2.1) VMCS12's HOST_CR3
>> 		2.2) VMCS12's GUEST_CR3
>> 
>> For 1) the current changelog has explained (that we need to catch _active_
>> control bits in guest's CR3 etc).  For nested (case 2)), L1 can choose to
>> intercept CR3 or not.  But this isn't the point because from hardware's point of
>> view we actually care about below two cases instead:
>> 
>> 	1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01
>> 	   to reflect VMCS12
>> 	2) L0 to VMENTER to L2 using VMCS02 directly
>> 
>> The case 2) can fail due to fail to VMENTER to L2 of course but this should
>> result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
>> case 1).
>> 
>> For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
>> CR3 feature control bits.  The key code path is:
>> 
>> 	vmx_handle_exit()
>> 		-> prepare_vmcs12()
>> 		-> load_vmcs12_host_state().
>> 
>> For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
>> GUEST_CR3 against active control bits.  The key code path is
>> 
>> 	nested_vmx_run() ->
>> 		-> nested_vmx_check_host_state()
>> 		-> nested_vmx_enter_non_root_mode()
>> 			-> prepare_vmcs02_early()
>> 			-> prepare_vmcs02()
>> 
>> Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
>> (VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
>> kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
>> active control bits seems just enough.
>> 
>> Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
>> return early if any HOST state is wrong) currently checks
>> kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.
>> 
>> That being said, I do find it's not easy to come up with a "concise" changelog
>> to cover both non-nested and (especially) nested cases, but it seems we can
>> abstract some from my above typing?
>> 
>> My suggestion is to focus on the hardware behaviour's perspective as typed
>> above.  But I believe Sean can easily make a "concise" changelog if he wants to
>> comment here :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-21 15:32                 ` Chao Gao
@ 2023-04-22  4:51                   ` Chao Gao
  2023-04-22  8:14                     ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Chao Gao @ 2023-04-22  4:51 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, pbonzini, Christopherson,, Sean, binbin.wu, Guo, Xuelian, robert.hu

On Fri, Apr 21, 2023 at 11:32:17PM +0800, Chao Gao wrote:
>>For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
>>GUEST_CR3 against active control bits.  The key code path is 
>
>...
>
>>
>>	nested_vmx_run() -> 
>>		-> nested_vmx_check_host_state()
>>		-> nested_vmx_enter_non_root_mode()
>>			-> prepare_vmcs02_early()
>>			-> prepare_vmcs02()
>>
>>Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
>>(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
>>kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
>>active control bits seems just enough.
>
>IMO, current KVM relies on hardware to do consistency check on the GUEST_CR3
>during VM entry.

Please disregard my remark on the consistency check on GUEST_CR3. I just took
a closer look at the code. It is indeed done by KVM in nested_vmx_load_cr3().
Sorry for the noise.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-22  4:51                   ` Chao Gao
@ 2023-04-22  8:14                     ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-04-22  8:14 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, pbonzini, robert.hu, Christopherson,, Sean, binbin.wu, Guo, Xuelian

On Sat, 2023-04-22 at 12:51 +0800, Chao Gao wrote:
> On Fri, Apr 21, 2023 at 11:32:17PM +0800, Chao Gao wrote:
> > > For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
> > > GUEST_CR3 against active control bits.  The key code path is 
> > 
> > ...
> > 
> > > 
> > > 	nested_vmx_run() -> 
> > > 		-> nested_vmx_check_host_state()
> > > 		-> nested_vmx_enter_non_root_mode()
> > > 			-> prepare_vmcs02_early()
> > > 			-> prepare_vmcs02()
> > > 
> > > Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
> > > (VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
> > > kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
> > > active control bits seems just enough.
> > 
> > IMO, current KVM relies on hardware to do consistency check on the GUEST_CR3
> > during VM entry.
> 
> Please disregard my remark on the consistency check on GUEST_CR3. I just took
> a closer look at the code. It is indeed done by KVM in nested_vmx_load_cr3().
> Sorry for the noise.

Right.  You cannot just simply rely on hardware to do CR3 check in VMENTER,
because at least there's a nasty case that L1's MAXPHYSADDR can be smaller than
L0.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-22  3:32                 ` Binbin Wu
  2023-04-22  4:43                   ` Chao Gao
@ 2023-04-25 22:48                   ` Huang, Kai
  2023-04-26  3:05                     ` Chao Gao
  1 sibling, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-25 22:48 UTC (permalink / raw)
  To: kvm, pbonzini, Christopherson,, Sean, binbin.wu
  Cc: Guo, Xuelian, robert.hu, Gao, Chao

On Sat, 2023-04-22 at 11:32 +0800, Binbin Wu wrote:
> Kai,
> 
> Thanks for your inputs.
> 
> I rephrased the changelog as following:

To me it seems there are too many details and should be trimmed down.  But
putting changelog aside, ...

> 
> 
> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
> masking for user mode pointers.
> 
> To support LAM in KVM, CR3 validity checks and shadow paging handling 
> need to be
> modified accordingly.
> 
> == CR3 validity Check ==
> When LAM is supported, CR3 LAM bits are allowed to be set and the check 
> of CR3
> needs to be modified.
> Add a helper kvm_vcpu_is_legal_cr3() and use it instead of 
> kvm_vcpu_is_legal_gpa()
> to do the new CR3 checks in all existing CR3 checks as following:
> When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
> Non-nested case
> - When EPT on, CR3 is fully under control of guest.
> - When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() 
> during
>    CR3 VMExit handling.

... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
KVM.

Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
seems there isn't because AFAICT the bits in CR4 are used to control super mode
linear address but not LAM in global?

So if it is true, then it appears hardware depends on CPUID purely to decide
whether to perform LAM or not.

Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
causing any trouble (because the hardware actually supports this feature)?

If it's true, it seems we should trap CR3 (at least loading) when hardware
supports LAM but it's not exposed to the guest, so that KVM can correctly reject
any LAM control bits when guest illegally does so?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-25 22:48                   ` Huang, Kai
@ 2023-04-26  3:05                     ` Chao Gao
  2023-04-26  5:13                       ` Binbin Wu
  2023-04-26  8:43                       ` Huang, Kai
  0 siblings, 2 replies; 48+ messages in thread
From: Chao Gao @ 2023-04-26  3:05 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, pbonzini, Christopherson,, Sean, binbin.wu, Guo, Xuelian, robert.hu

On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>KVM.
>
>Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>seems there isn't because AFAICT the bits in CR4 are used to control super mode
>linear address but not LAM in global?

Right.

>
>So if it is true, then it appears hardware depends on CPUID purely to decide
>whether to perform LAM or not.
>
>Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>causing any trouble (because the hardware actually supports this feature)?

Yes. But I think it is a non-issue ...

>
>If it's true, it seems we should trap CR3 (at least loading) when hardware
>supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>any LAM control bits when guest illegally does so?
>

Other features which need no explicit enablement (like AVX and other
new instructions) have the same problem.

The impact is some guests can use features which they are not supposed
to use. Then they might be broken after migration or kvm's instruction
emulation. But they put themselves at stake, KVM shouldn't be blamed.

The downside of intercepting CR3 is the performance impact on existing
VMs (all with old CPU models and thus all have no LAM). If they are
migrated to LAM-capable parts in the future, they will suffer
performance drop even though they are good tenents (i.e., won't try to
use LAM).

IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
when EPT=on cannot outweigh the performance impact. So, I vote to
document in changelog or comments that:
A guest can enable LAM for userspace pointers when EPT=on even if LAM
isn't exposed to it. KVM doens't prevent this out of performance
consideration

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-26  3:05                     ` Chao Gao
@ 2023-04-26  5:13                       ` Binbin Wu
  2023-04-26  8:44                         ` Huang, Kai
  2023-04-26  8:43                       ` Huang, Kai
  1 sibling, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-26  5:13 UTC (permalink / raw)
  To: Chao Gao, Huang, Kai
  Cc: kvm, pbonzini, Christopherson,, Sean, Guo, Xuelian, robert.hu



On 4/26/2023 11:05 AM, Chao Gao wrote:
> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>> ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>> KVM.
>>
>> Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>> seems there isn't because AFAICT the bits in CR4 are used to control super mode
>> linear address but not LAM in global?
> Right.
>
>> So if it is true, then it appears hardware depends on CPUID purely to decide
>> whether to perform LAM or not.
>>
>> Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>> hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>> causing any trouble (because the hardware actually supports this feature)?
> Yes. But I think it is a non-issue ...
>
>> If it's true, it seems we should trap CR3 (at least loading) when hardware
>> supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>> any LAM control bits when guest illegally does so?
>>
> Other features which need no explicit enablement (like AVX and other
> new instructions) have the same problem.
>
> The impact is some guests can use features which they are not supposed
> to use. Then they might be broken after migration or kvm's instruction
> emulation. But they put themselves at stake, KVM shouldn't be blamed.
Agree.

>
> The downside of intercepting CR3 is the performance impact on existing
> VMs (all with old CPU models and thus all have no LAM). If they are
> migrated to LAM-capable parts in the future, they will suffer
> performance drop even though they are good tenents (i.e., won't try to
> use LAM).
>
> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> when EPT=on cannot outweigh the performance impact. So, I vote to
> document in changelog or comments that:
> A guest can enable LAM for userspace pointers when EPT=on even if LAM
> isn't exposed to it. KVM doens't prevent this out of performance
> consideration

How about add the comments on the code:

+       /*
+        * A guest can enable LAM for userspace pointers when EPT=on on a
+        * processor supporting LAM even if LAM isn't exposed to it.
+        * KVM doesn't prevent this out of performance considerations.
+        */
         if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
                 vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | 
X86_CR3_LAM_U57;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-26  3:05                     ` Chao Gao
  2023-04-26  5:13                       ` Binbin Wu
@ 2023-04-26  8:43                       ` Huang, Kai
  2023-04-26 10:52                         ` Binbin Wu
  1 sibling, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-26  8:43 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, pbonzini, robert.hu, Christopherson,, Sean, binbin.wu, Guo, Xuelian

On Wed, 2023-04-26 at 11:05 +0800, Gao, Chao wrote:
> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
> > ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
> > KVM.
> > 
> > Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
> > seems there isn't because AFAICT the bits in CR4 are used to control super mode
> > linear address but not LAM in global?
> 
> Right.
> 
> > 
> > So if it is true, then it appears hardware depends on CPUID purely to decide
> > whether to perform LAM or not.
> > 
> > Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
> > hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
> > causing any trouble (because the hardware actually supports this feature)?
> 
> Yes. But I think it is a non-issue ...
> 
> > 
> > If it's true, it seems we should trap CR3 (at least loading) when hardware
> > supports LAM but it's not exposed to the guest, so that KVM can correctly reject
> > any LAM control bits when guest illegally does so?
> > 
> 
> Other features which need no explicit enablement (like AVX and other
> new instructions) have the same problem.

OK.

> 
> The impact is some guests can use features which they are not supposed
> to use. Then they might be broken after migration or kvm's instruction
> emulation. But they put themselves at stake, KVM shouldn't be blamed.
> 
> The downside of intercepting CR3 is the performance impact on existing
> VMs (all with old CPU models and thus all have no LAM). If they are
> migrated to LAM-capable parts in the future, they will suffer
> performance drop even though they are good tenents (i.e., won't try to
> use LAM).
> 
> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> when EPT=on cannot outweigh the performance impact. So, I vote to
> document in changelog or comments that:
> A guest can enable LAM for userspace pointers when EPT=on even if LAM
> isn't exposed to it. KVM doens't prevent this out of performance
> consideration

Yeah performance impact is the concern.  I agree we can just call out this in 
changelog and/or comments.  Just want to make sure this is mentioned/discussed.

My main concern is, as (any) VMEXIT saves guest's CR3 to VMCS's GUEST_CR3, KVM
may see GUEST_CR3 containing invalid control bits (because KVM believes the
guest doesn't support those feature bits), and if KVM code carelessly uses
WARN() around those code, a malicious guest may be able to attack host, which
means we need to pay more attention to code review around GUEST_CR3 in the
future.

Anyway not intercepting CR3 is fine to me, and will leave this to others.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-26  5:13                       ` Binbin Wu
@ 2023-04-26  8:44                         ` Huang, Kai
  2023-04-26  8:50                           ` Binbin Wu
  0 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-26  8:44 UTC (permalink / raw)
  To: binbin.wu, Gao, Chao
  Cc: kvm, pbonzini, robert.hu, Christopherson,, Sean, Guo, Xuelian

On Wed, 2023-04-26 at 13:13 +0800, Binbin Wu wrote:
> 
> On 4/26/2023 11:05 AM, Chao Gao wrote:
> > On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
> > > ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
> > > KVM.
> > > 
> > > Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
> > > seems there isn't because AFAICT the bits in CR4 are used to control super mode
> > > linear address but not LAM in global?
> > Right.
> > 
> > > So if it is true, then it appears hardware depends on CPUID purely to decide
> > > whether to perform LAM or not.
> > > 
> > > Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
> > > hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
> > > causing any trouble (because the hardware actually supports this feature)?
> > Yes. But I think it is a non-issue ...
> > 
> > > If it's true, it seems we should trap CR3 (at least loading) when hardware
> > > supports LAM but it's not exposed to the guest, so that KVM can correctly reject
> > > any LAM control bits when guest illegally does so?
> > > 
> > Other features which need no explicit enablement (like AVX and other
> > new instructions) have the same problem.
> > 
> > The impact is some guests can use features which they are not supposed
> > to use. Then they might be broken after migration or kvm's instruction
> > emulation. But they put themselves at stake, KVM shouldn't be blamed.
> Agree.
> 
> > 
> > The downside of intercepting CR3 is the performance impact on existing
> > VMs (all with old CPU models and thus all have no LAM). If they are
> > migrated to LAM-capable parts in the future, they will suffer
> > performance drop even though they are good tenents (i.e., won't try to
> > use LAM).
> > 
> > IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> > when EPT=on cannot outweigh the performance impact. So, I vote to
> > document in changelog or comments that:
> > A guest can enable LAM for userspace pointers when EPT=on even if LAM
> > isn't exposed to it. KVM doens't prevent this out of performance
> > consideration
> 
> How about add the comments on the code:
> 
> +       /*
> +        * A guest can enable LAM for userspace pointers when EPT=on on a
> +        * processor supporting LAM even if LAM isn't exposed to it.
> +        * KVM doesn't prevent this out of performance considerations.
> +        */
>          if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
>                  vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 | 
> X86_CR3_LAM_U57;
> 
> 

I would say we should at least call out in the changelog, but I don't have
opinion on whether to have this comment around this code or not.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-26  8:44                         ` Huang, Kai
@ 2023-04-26  8:50                           ` Binbin Wu
  0 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-26  8:50 UTC (permalink / raw)
  To: Huang, Kai, Gao, Chao
  Cc: kvm, pbonzini, robert.hu, Christopherson,, Sean, Guo, Xuelian



On 4/26/2023 4:44 PM, Huang, Kai wrote:
> On Wed, 2023-04-26 at 13:13 +0800, Binbin Wu wrote:
>> On 4/26/2023 11:05 AM, Chao Gao wrote:
>>> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>>>> ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>>>> KVM.
>>>>
>>>> Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>>>> seems there isn't because AFAICT the bits in CR4 are used to control super mode
>>>> linear address but not LAM in global?
>>> Right.
>>>
>>>> So if it is true, then it appears hardware depends on CPUID purely to decide
>>>> whether to perform LAM or not.
>>>>
>>>> Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>>>> hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>>>> causing any trouble (because the hardware actually supports this feature)?
>>> Yes. But I think it is a non-issue ...
>>>
>>>> If it's true, it seems we should trap CR3 (at least loading) when hardware
>>>> supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>>>> any LAM control bits when guest illegally does so?
>>>>
>>> Other features which need no explicit enablement (like AVX and other
>>> new instructions) have the same problem.
>>>
>>> The impact is some guests can use features which they are not supposed
>>> to use. Then they might be broken after migration or kvm's instruction
>>> emulation. But they put themselves at stake, KVM shouldn't be blamed.
>> Agree.
>>
>>> The downside of intercepting CR3 is the performance impact on existing
>>> VMs (all with old CPU models and thus all have no LAM). If they are
>>> migrated to LAM-capable parts in the future, they will suffer
>>> performance drop even though they are good tenents (i.e., won't try to
>>> use LAM).
>>>
>>> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
>>> when EPT=on cannot outweigh the performance impact. So, I vote to
>>> document in changelog or comments that:
>>> A guest can enable LAM for userspace pointers when EPT=on even if LAM
>>> isn't exposed to it. KVM doens't prevent this out of performance
>>> consideration
>> How about add the comments on the code:
>>
>> +       /*
>> +        * A guest can enable LAM for userspace pointers when EPT=on on a
>> +        * processor supporting LAM even if LAM isn't exposed to it.
>> +        * KVM doesn't prevent this out of performance considerations.
>> +        */
>>           if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
>>                   vcpu->arch.cr3_ctrl_bits |= X86_CR3_LAM_U48 |
>> X86_CR3_LAM_U57;
>>
>>
> I would say we should at least call out in the changelog, but I don't have
> opinion on whether to have this comment around this code or not.
OK, will also add it to changelog.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-26  8:43                       ` Huang, Kai
@ 2023-04-26 10:52                         ` Binbin Wu
  2023-04-27 13:23                           ` Huang, Kai
  0 siblings, 1 reply; 48+ messages in thread
From: Binbin Wu @ 2023-04-26 10:52 UTC (permalink / raw)
  To: Huang, Kai, Gao, Chao, Christopherson,, Sean
  Cc: kvm, pbonzini, robert.hu, Guo, Xuelian



On 4/26/2023 4:43 PM, Huang, Kai wrote:
> On Wed, 2023-04-26 at 11:05 +0800, Gao, Chao wrote:
>> On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
>>> ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
>>> KVM.
>>>
>>> Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
>>> seems there isn't because AFAICT the bits in CR4 are used to control super mode
>>> linear address but not LAM in global?
>> Right.
>>
>>> So if it is true, then it appears hardware depends on CPUID purely to decide
>>> whether to perform LAM or not.
>>>
>>> Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
>>> hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
>>> causing any trouble (because the hardware actually supports this feature)?
>> Yes. But I think it is a non-issue ...
>>
>>> If it's true, it seems we should trap CR3 (at least loading) when hardware
>>> supports LAM but it's not exposed to the guest, so that KVM can correctly reject
>>> any LAM control bits when guest illegally does so?
>>>
>> Other features which need no explicit enablement (like AVX and other
>> new instructions) have the same problem.
> OK.
>
>> The impact is some guests can use features which they are not supposed
>> to use. Then they might be broken after migration or kvm's instruction
>> emulation. But they put themselves at stake, KVM shouldn't be blamed.
>>
>> The downside of intercepting CR3 is the performance impact on existing
>> VMs (all with old CPU models and thus all have no LAM). If they are
>> migrated to LAM-capable parts in the future, they will suffer
>> performance drop even though they are good tenents (i.e., won't try to
>> use LAM).
>>
>> IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
>> when EPT=on cannot outweigh the performance impact. So, I vote to
>> document in changelog or comments that:
>> A guest can enable LAM for userspace pointers when EPT=on even if LAM
>> isn't exposed to it. KVM doens't prevent this out of performance
>> consideration
> Yeah performance impact is the concern.  I agree we can just call out this in
> changelog and/or comments.  Just want to make sure this is mentioned/discussed.
>
> My main concern is, as (any) VMEXIT saves guest's CR3 to VMCS's GUEST_CR3, KVM
> may see GUEST_CR3 containing invalid control bits (because KVM believes the
> guest doesn't support those feature bits), and if KVM code carelessly uses
> WARN() around those code, a malicious guest may be able to attack host, which
> means we need to pay more attention to code review around GUEST_CR3 in the
> future.

How about do guest CR3 LAM bits check based on the capability of 
physical processor instead of vCPU?
That is KVM doesn't prevent guest from setting CR3 LAM bits if the 
processor supports LAM.
>
> Anyway not intercepting CR3 is fine to me, and will leave this to others.
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-22  4:43                   ` Chao Gao
@ 2023-04-27 13:19                     ` Huang, Kai
  2023-04-29  4:56                       ` Binbin Wu
  0 siblings, 1 reply; 48+ messages in thread
From: Huang, Kai @ 2023-04-27 13:19 UTC (permalink / raw)
  To: Gao, Chao, binbin.wu
  Cc: kvm, pbonzini, robert.hu, Christopherson,, Sean, Guo, Xuelian

On Sat, 2023-04-22 at 12:43 +0800, Gao, Chao wrote:
> On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote:
> > Kai,
> > 
> > Thanks for your inputs.
> > 
> > I rephrased the changelog as following:
> > 
> > 
> > LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
> > masking for user mode pointers.
> > 
> > To support LAM in KVM, CR3 validity checks and shadow paging handling need to
> > be
> > modified accordingly.
> > 
> > == CR3 validity Check ==
> > When LAM is supported, CR3 LAM bits are allowed to be set and the check of
> > CR3
> > needs to be modified.
> 
> it is better to describe the hardware change here:
> 
> On processors that enumerate support for LAM, CR3 register allows
> LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both
> GUEST_CR3 and HOST_CR3 fields.
> 
> To emulate LAM hardware behavior, KVM needs to
> 1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace
> 2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12

Agreed.  This is more clearer.

> 
> > Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
> > kvm_vcpu_is_legal_gpa()
> > to do the new CR3 checks in all existing CR3 checks as following:
> > When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
> > Non-nested case
> > - When EPT on, CR3 is fully under control of guest.
> > - When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during
> >   CR3 VMExit handling.
> > Nested case, from L0's perspective, we care about:
> > - L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
> > - L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
> >   Two paths related:
> >   1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
> >          nested_vm_exit()
> >          -> load_vmcs12_host_state()
> >                -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3
> 
> This is just a byproduct of using a unified function, i.e.,
> nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit.
> 
> LAM spec says:
> 
> VM entry checks the values of the CR3 and CR4 fields in the guest-area
> and host-state area of the VMCS. In particular, the bits in these fields
> that correspond to bits reserved in the corresponding register are
> checked and must be 0.
> 
> It doesn't mention any check on VM exit. So, it looks to me that VM exit
> doesn't do consistency checks. Then, I think there is no need to call
> out this path.

But this isn't a true VMEXIT -- it is indeed a VMENTER from L0 to L1 using
VMCS01 but with an environment that allows L1 to run its VMEXIT handler just
like it received a VMEXIT from L2.

However I fully agree those code paths are details and shouldn't be changelog
material.

How about below changelog? 

Add support to allow guest to set two new CR3 non-address control bits to allow
guest to enable the new Intel CPU feature Linear Address Masking (LAM).

LAM modifies the checking that is applied to 64-bit linear addresses, allowing
software to use of the untranslated address bits for metadata.  For user mode
linear address, LAM uses two new CR3 non-address bits LAM_U48 (bit 62) and
LAM_U57 (bit 61) to configure the metadata bits for 4-level paging and 5-level
paging respectively.  LAM also changes VMENTER to allow both bits to be set in
VMCS's HOST_CR3 and GUEST_CR3 to support virtualization.

When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of
the two LAM control bits.  However when EPT is off, the actual CR3 used by the
guest is generated from the shadow MMU root which is different from the CR3 that
is *set* by the guest, and KVM needs to manually apply any active control bits
to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest.

KVM manually checks guest's CR3 to make sure it points to a valid guest physical
address (i.e. to support smaller MAXPHYSADDR in the guest).  Extend this check
to allow the two LAM control bits to be set.  And to make such check generic,
introduce a new field 'cr3_ctrl_bits' to vcpu to record all feature control bits
that are allowed to be set by the guest.

In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and
GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters
L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly.  KVM also
manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address.
Extend such check to allow the new LAM control bits too.

Note, LAM doesn't have a global enable bit in any control register to turn
on/off LAM completely, but purely depends on hardware's CPUID to determine
whether to perform LAM check or not.  That means, when EPT is on, even when KVM
doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o
causing problem.  This is an unfortunate virtualization hole.  KVM could choose
to intercept CR3 in this case and inject fault but this would hurt performance
when running a normal VM w/o LAM support.  This is undesirable.  Just choose to
let the guest do such illegal thing as the worst case is guest being killed when
KVM eventually find out such illegal behaviour and that is the guest to blame. 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-26 10:52                         ` Binbin Wu
@ 2023-04-27 13:23                           ` Huang, Kai
  0 siblings, 0 replies; 48+ messages in thread
From: Huang, Kai @ 2023-04-27 13:23 UTC (permalink / raw)
  To: Christopherson,, Sean, binbin.wu, Gao, Chao
  Cc: kvm, pbonzini, robert.hu, Guo, Xuelian

On Wed, 2023-04-26 at 18:52 +0800, Binbin Wu wrote:
> 
> On 4/26/2023 4:43 PM, Huang, Kai wrote:
> > On Wed, 2023-04-26 at 11:05 +0800, Gao, Chao wrote:
> > > On Wed, Apr 26, 2023 at 06:48:21AM +0800, Huang, Kai wrote:
> > > > ... when EPT is on, as you mentioned guest can update CR3 w/o causing VMEXIT to
> > > > KVM.
> > > > 
> > > > Is there any global enabling bit in any of CR to turn on/off LAM globally?  It
> > > > seems there isn't because AFAICT the bits in CR4 are used to control super mode
> > > > linear address but not LAM in global?
> > > Right.
> > > 
> > > > So if it is true, then it appears hardware depends on CPUID purely to decide
> > > > whether to perform LAM or not.
> > > > 
> > > > Which means, IIRC, when EPT is on, if we don't expose LAM to the guest on the
> > > > hardware that supports LAM, I think guest can still enable LAM in CR3 w/o
> > > > causing any trouble (because the hardware actually supports this feature)?
> > > Yes. But I think it is a non-issue ...
> > > 
> > > > If it's true, it seems we should trap CR3 (at least loading) when hardware
> > > > supports LAM but it's not exposed to the guest, so that KVM can correctly reject
> > > > any LAM control bits when guest illegally does so?
> > > > 
> > > Other features which need no explicit enablement (like AVX and other
> > > new instructions) have the same problem.
> > OK.
> > 
> > > The impact is some guests can use features which they are not supposed
> > > to use. Then they might be broken after migration or kvm's instruction
> > > emulation. But they put themselves at stake, KVM shouldn't be blamed.
> > > 
> > > The downside of intercepting CR3 is the performance impact on existing
> > > VMs (all with old CPU models and thus all have no LAM). If they are
> > > migrated to LAM-capable parts in the future, they will suffer
> > > performance drop even though they are good tenents (i.e., won't try to
> > > use LAM).
> > > 
> > > IMO, the value of preventing some guests from setting LAM_U48/U57 in CR3
> > > when EPT=on cannot outweigh the performance impact. So, I vote to
> > > document in changelog or comments that:
> > > A guest can enable LAM for userspace pointers when EPT=on even if LAM
> > > isn't exposed to it. KVM doens't prevent this out of performance
> > > consideration
> > Yeah performance impact is the concern.  I agree we can just call out this in
> > changelog and/or comments.  Just want to make sure this is mentioned/discussed.
> > 
> > My main concern is, as (any) VMEXIT saves guest's CR3 to VMCS's GUEST_CR3, KVM
> > may see GUEST_CR3 containing invalid control bits (because KVM believes the
> > guest doesn't support those feature bits), and if KVM code carelessly uses
> > WARN() around those code, a malicious guest may be able to attack host, which
> > means we need to pay more attention to code review around GUEST_CR3 in the
> > future.
> 
> How about do guest CR3 LAM bits check based on the capability of 
> physical processor instead of vCPU?
> That is KVM doesn't prevent guest from setting CR3 LAM bits if the 
> processor supports LAM.
> > 
> 

Doesn't seem to have any benefit as this looks like another hack (which also
breaks the rule of virtualization) in order to (try to) fix a virutalization
hole.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}
  2023-04-27 13:19                     ` Huang, Kai
@ 2023-04-29  4:56                       ` Binbin Wu
  0 siblings, 0 replies; 48+ messages in thread
From: Binbin Wu @ 2023-04-29  4:56 UTC (permalink / raw)
  To: Huang, Kai, Gao, Chao
  Cc: kvm, pbonzini, robert.hu, Christopherson,, Sean, Guo, Xuelian



On 4/27/2023 9:19 PM, Huang, Kai wrote:
> On Sat, 2023-04-22 at 12:43 +0800, Gao, Chao wrote:
>> On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote:
>>> Kai,
>>>
>>> Thanks for your inputs.
>>>
>>> I rephrased the changelog as following:
>>>
>>>
>>> LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
>>> masking for user mode pointers.
>>>
>>> To support LAM in KVM, CR3 validity checks and shadow paging handling need to
>>> be
>>> modified accordingly.
>>>
>>> == CR3 validity Check ==
>>> When LAM is supported, CR3 LAM bits are allowed to be set and the check of
>>> CR3
>>> needs to be modified.
>> it is better to describe the hardware change here:
>>
>> On processors that enumerate support for LAM, CR3 register allows
>> LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both
>> GUEST_CR3 and HOST_CR3 fields.
>>
>> To emulate LAM hardware behavior, KVM needs to
>> 1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace
>> 2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12
> Agreed.  This is more clearer.
>
>>> Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
>>> kvm_vcpu_is_legal_gpa()
>>> to do the new CR3 checks in all existing CR3 checks as following:
>>> When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
>>> Non-nested case
>>> - When EPT on, CR3 is fully under control of guest.
>>> - When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during
>>>    CR3 VMExit handling.
>>> Nested case, from L0's perspective, we care about:
>>> - L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
>>> - L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
>>>    Two paths related:
>>>    1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
>>>           nested_vm_exit()
>>>           -> load_vmcs12_host_state()
>>>                 -> nested_vmx_load_cr3()     //check VMCS12's HOST_CR3
>> This is just a byproduct of using a unified function, i.e.,
>> nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit.
>>
>> LAM spec says:
>>
>> VM entry checks the values of the CR3 and CR4 fields in the guest-area
>> and host-state area of the VMCS. In particular, the bits in these fields
>> that correspond to bits reserved in the corresponding register are
>> checked and must be 0.
>>
>> It doesn't mention any check on VM exit. So, it looks to me that VM exit
>> doesn't do consistency checks. Then, I think there is no need to call
>> out this path.
> But this isn't a true VMEXIT -- it is indeed a VMENTER from L0 to L1 using
> VMCS01 but with an environment that allows L1 to run its VMEXIT handler just
> like it received a VMEXIT from L2.
>
> However I fully agree those code paths are details and shouldn't be changelog
> material.
>
> How about below changelog?
>
> Add support to allow guest to set two new CR3 non-address control bits to allow
> guest to enable the new Intel CPU feature Linear Address Masking (LAM).
>
> LAM modifies the checking that is applied to 64-bit linear addresses, allowing
> software to use of the untranslated address bits for metadata.  For user mode
> linear address, LAM uses two new CR3 non-address bits LAM_U48 (bit 62) and
> LAM_U57 (bit 61) to configure the metadata bits for 4-level paging and 5-level
> paging respectively.  LAM also changes VMENTER to allow both bits to be set in
> VMCS's HOST_CR3 and GUEST_CR3 to support virtualization.
>
> When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of
> the two LAM control bits.  However when EPT is off, the actual CR3 used by the
> guest is generated from the shadow MMU root which is different from the CR3 that
> is *set* by the guest, and KVM needs to manually apply any active control bits
> to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest.
>
> KVM manually checks guest's CR3 to make sure it points to a valid guest physical
> address (i.e. to support smaller MAXPHYSADDR in the guest).  Extend this check
> to allow the two LAM control bits to be set.  And to make such check generic,
> introduce a new field 'cr3_ctrl_bits' to vcpu to record all feature control bits
> that are allowed to be set by the guest.
>
> In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and
> GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters
> L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly.  KVM also
> manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address.
> Extend such check to allow the new LAM control bits too.
>
> Note, LAM doesn't have a global enable bit in any control register to turn
> on/off LAM completely, but purely depends on hardware's CPUID to determine
> whether to perform LAM check or not.  That means, when EPT is on, even when KVM
> doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o
> causing problem.  This is an unfortunate virtualization hole.  KVM could choose
> to intercept CR3 in this case and inject fault but this would hurt performance
> when running a normal VM w/o LAM support.  This is undesirable.  Just choose to
> let the guest do such illegal thing as the worst case is guest being killed when
> KVM eventually find out such illegal behaviour and that is the guest to blame.
Thanks for the advice.



^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-04-29  4:56 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-04 13:09 [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu
2023-04-04 13:09 ` [PATCH v7 1/5] KVM: x86: Virtualize CR4.LAM_SUP Binbin Wu
2023-04-04 13:09 ` [PATCH v7 2/5] KVM: x86: Virtualize CR3.LAM_{U48,U57} Binbin Wu
2023-04-06 12:57   ` Huang, Kai
2023-04-09 11:36     ` Binbin Wu
2023-04-11 23:11       ` Huang, Kai
2023-04-12 11:58   ` Huang, Kai
2023-04-13  1:36     ` Binbin Wu
2023-04-13  2:27       ` Huang, Kai
2023-04-13  4:45         ` Binbin Wu
2023-04-13  9:13           ` Huang, Kai
2023-04-21  6:35             ` Binbin Wu
2023-04-21 11:43               ` Huang, Kai
2023-04-21 15:32                 ` Chao Gao
2023-04-22  4:51                   ` Chao Gao
2023-04-22  8:14                     ` Huang, Kai
2023-04-22  3:32                 ` Binbin Wu
2023-04-22  4:43                   ` Chao Gao
2023-04-27 13:19                     ` Huang, Kai
2023-04-29  4:56                       ` Binbin Wu
2023-04-25 22:48                   ` Huang, Kai
2023-04-26  3:05                     ` Chao Gao
2023-04-26  5:13                       ` Binbin Wu
2023-04-26  8:44                         ` Huang, Kai
2023-04-26  8:50                           ` Binbin Wu
2023-04-26  8:43                       ` Huang, Kai
2023-04-26 10:52                         ` Binbin Wu
2023-04-27 13:23                           ` Huang, Kai
2023-04-17  7:24   ` Chao Gao
2023-04-17  8:02     ` Binbin Wu
2023-04-04 13:09 ` [PATCH v7 3/5] KVM: x86: Introduce untag_addr() in kvm_x86_ops Binbin Wu
2023-04-18  3:08   ` Zeng Guang
2023-04-18  3:34     ` Binbin Wu
2023-04-19  2:30   ` Chao Gao
2023-04-19  3:08     ` Binbin Wu
2023-04-21  7:48       ` Binbin Wu
2023-04-21  8:21         ` Chao Gao
2023-04-04 13:09 ` [PATCH v7 4/5] KVM: x86: Untag address when LAM applicable Binbin Wu
2023-04-06 13:20   ` Huang, Kai
2023-04-10  3:35     ` Binbin Wu
2023-04-18  3:28   ` Zeng Guang
2023-04-18  3:38     ` Binbin Wu
2023-04-19  6:43   ` Chao Gao
2023-04-21  7:57     ` Binbin Wu
2023-04-21  8:36       ` Chao Gao
2023-04-21  9:13         ` Binbin Wu
2023-04-04 13:09 ` [PATCH v7 5/5] KVM: x86: Expose LAM feature to userspace VMM Binbin Wu
2023-04-21  9:40 ` [PATCH v7 0/5] Linear Address Masking (LAM) KVM Enabling Binbin Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.