All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/8] KVM: PKS Virtualization support
@ 2022-04-24 10:15 Lei Wang
  2022-04-24 10:15 ` [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields Lei Wang
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

This patch series is based on top of v10 PKS core support kernel patchset:
https://lore.kernel.org/lkml/20220419170649.1022246-1-ira.weiny@intel.com/

---

Protection Keys for Supervisor Pages(PKS) is a feature that extends the
Protection Keys architecture to support thread-specific permission
restrictions on supervisor pages.

PKS works similar to an existing feature named PKU(protecting user pages).
They both perform an additional check after normal paging permission
checks are done. Access or Writes can be disabled via a MSR update
without TLB flushes when permissions changes. If violating this
addional check, #PF occurs and PFEC.PK bit will be set.

PKS introduces MSR IA32_PKRS to manage supervisor protection key
rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
advertises on a group of pages with the same key which is set in the
leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
supported by XSAVES architecture.

This patchset aims to add the virtualization of PKS in KVM. It
implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
exposure, nested supported etc. Currently, PKS is not yet supported for
shadow paging. 

Detailed information about PKS can be found in the latest Intel 64 and
IA-32 Architectures Software Developer's Manual.

---

Changelogs:

v6->v7
- Add documentation to note that it's nice-to-have cache tracking for PKRS,
  and we also needn't hesitate to rip it out in the future if there's a strong
  reason to drop the caching. (Sean)
- Blindly reading PKRU/PKRS is wrong, fixed. (Sean)
- Add a non-inline helper kvm_mmu_pkr_bits() to read PKR bits. (Sean)
- Delete the comment for exposing the PKS because the pattern is common and the
  behavior is self-explanatory. (Sean)
- Add a helper vmx_set_host_pkrs() for setting host pkrs and rewrite the
  related code for concise. (Sean)
- Align an indentation in arch/x86/kvm/vmx/nested.c. (Sean)
- Read the current PKRS if from_vmentry == false under the nested condition.
  (Sean)
- v6: https://lore.kernel.org/lkml/20220221080840.7369-1-chenyi.qiang@intel.com/

v5->v6
- PKRS is preserved on INIT. Add the PKRS reset operation in kvm_vcpu_reset.
  (Sean)
- Track the pkrs as u32. Add the code WARN on bits 64:32 being set in VMCS field.
  (Sean)
- Adjust the MSR intercept and entry/exit control in VMCS according to
  guest CPUID. This resolve the issue when userspace re-enable this feature.
  (Sean)
- Split VMX restriction on PKS support(entry/exit load controls) out of
  common x86. And put tdp restriction together with PKU in common x86.
  (Sean)
- Thanks for Sean to revise the comments in mmu.c related to
  update_pkr_bitmap, which make it more clear for pkr bitmask cache usage.
- v5: https://lore.kernel.org/lkml/20210811101126.8973-1-chenyi.qiang@intel.com/

v4->v5
- Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
  And set them if PKS is exposed to guest. (Suggested by Sean)
- Add pkrs to standard register caching mechanism to help update
  vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
- Do the real pkrs update in VMCS field in vmx_vcpu_reset and
  vmx_sync_vmcs_host_state(). (Sean)
- Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
  (Sean & Paolo)
- v4: https://lore.kernel.org/lkml/20210205083706.14146-1-chenyi.qiang@intel.com/

v3->v4
- Make the MSR intercept and load-controls setting depend on CR4.PKS value
- shadow the guest pkrs and make it usable in PKS emultion
- add the cr4_pke and cr4_pks check in pkr_mask update
- squash PATCH 2 and PATCH 5 to make the dependencies read more clear
- v3: https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/

v2->v3:
- No function changes since last submit
- rebase on the latest PKS kernel support:
  https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
- add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
- RFC v2: https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/

v1->v2:
- rebase on the latest PKS kernel support:
  https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
- add a kvm-unit-tests for PKS
- add the check in kvm_init_msr_list for PKRS
- place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
- add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
  VMX MSR
- RFC v1: https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/

---

Chenyi Qiang (7):
  KVM: VMX: Introduce PKS VMCS fields
  KVM: VMX: Add proper cache tracking for PKRS
  KVM: X86: Expose IA32_PKRS MSR
  KVM: MMU: Rename the pkru to pkr
  KVM: MMU: Add support for PKS emulation
  KVM: VMX: Expose PKS to guest
  KVM: VMX: Enable PKS for nested VM

Lei Wang (1):
  KVM: MMU: Add helper function to get pkr bits

 arch/x86/include/asm/kvm_host.h |  17 +++--
 arch/x86/include/asm/vmx.h      |   6 ++
 arch/x86/kvm/cpuid.c            |  13 +++-
 arch/x86/kvm/kvm_cache_regs.h   |   7 ++
 arch/x86/kvm/mmu.h              |  29 +++----
 arch/x86/kvm/mmu/mmu.c          | 130 +++++++++++++++++++++++---------
 arch/x86/kvm/vmx/capabilities.h |   6 ++
 arch/x86/kvm/vmx/nested.c       |  36 ++++++++-
 arch/x86/kvm/vmx/vmcs.h         |   1 +
 arch/x86/kvm/vmx/vmcs12.c       |   2 +
 arch/x86/kvm/vmx/vmcs12.h       |   4 +
 arch/x86/kvm/vmx/vmx.c          |  85 +++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h          |  14 +++-
 arch/x86/kvm/x86.c              |   9 ++-
 arch/x86/kvm/x86.h              |   8 ++
 arch/x86/mm/pkeys.c             |   6 ++
 include/linux/pks.h             |   7 ++
 17 files changed, 301 insertions(+), 79 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-05-24 20:55   ` Sean Christopherson
  2022-04-24 10:15 ` [PATCH v7 2/8] KVM: VMX: Add proper cache tracking for PKRS Lei Wang
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

From: Chenyi Qiang <chenyi.qiang@intel.com>

PKS(Protection Keys for Supervisor Pages) is a feature that extends the
Protection Key architecture to support thread-specific permission
restrictions on supervisor pages.

A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
set of permissions associated with each protection domain.

Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
{host,guest}-state area to store the respective values of PKRS.

Every VM exit saves PKRS into guest-state area.
If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
area.
If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
area.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/vmx.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..7962d506ba91 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -95,6 +95,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_LOAD_IA32_PKRS			0x20000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -108,6 +109,7 @@
 #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
+#define VM_ENTRY_LOAD_IA32_PKRS			0x00400000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
@@ -245,12 +247,16 @@ enum vmcs_field {
 	GUEST_BNDCFGS_HIGH              = 0x00002813,
 	GUEST_IA32_RTIT_CTL		= 0x00002814,
 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
+	GUEST_IA32_PKRS			= 0x00002818,
+	GUEST_IA32_PKRS_HIGH		= 0x00002819,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_PKRS			= 0x00002c06,
+	HOST_IA32_PKRS_HIGH		= 0x00002c07,
 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
 	EXCEPTION_BITMAP                = 0x00004004,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v7 2/8] KVM: VMX: Add proper cache tracking for PKRS
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
  2022-04-24 10:15 ` [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-05-24 21:00   ` Sean Christopherson
  2022-04-24 10:15 ` [PATCH v7 3/8] KVM: X86: Expose IA32_PKRS MSR Lei Wang
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

From: Chenyi Qiang <chenyi.qiang@intel.com>

Add PKRS caching into the standard register caching mechanism in order
to take advantage of the availability checks provided by regs_avail.

This is because vcpu->arch.pkrs will be rarely acceesed by KVM, only in
the case of host userspace MSR reads and GVA->GPA translation in
following patches. It is unnecessary to keep it up-to-date at all times.

It also should be noted that the potential benefits of this caching are
tenuous because the MSR read is not a hot path. it's nice-to-have so that
we don't hesitate to rip it out in the future if there's a strong reason
to drop the caching.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Co-developed-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Lei Wang <lei4.wang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/kvm_cache_regs.h   |  7 +++++++
 arch/x86/kvm/vmx/vmx.c          | 11 +++++++++++
 arch/x86/kvm/vmx/vmx.h          |  3 ++-
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e0c0f0e1f754..f5455bada8cd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ enum kvm_reg {
 	VCPU_EXREG_SEGMENTS,
 	VCPU_EXREG_EXIT_INFO_1,
 	VCPU_EXREG_EXIT_INFO_2,
+	VCPU_EXREG_PKRS,
 };
 
 enum {
@@ -638,6 +639,7 @@ struct kvm_vcpu_arch {
 	unsigned long cr8;
 	u32 host_pkru;
 	u32 pkru;
+	u32 pkrs;
 	u32 hflags;
 	u64 efer;
 	u64 apic_base;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 3febc342360c..2b2540ca584f 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -177,6 +177,13 @@ static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu)
 		| ((u64)(kvm_rdx_read(vcpu) & -1u) << 32);
 }
 
+static inline u32 kvm_read_pkrs(struct kvm_vcpu *vcpu)
+{
+	if (!kvm_register_is_available(vcpu, VCPU_EXREG_PKRS))
+		static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_PKRS);
+	return vcpu->arch.pkrs;
+}
+
 static inline void enter_guest_mode(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.hflags |= HF_GUEST_MASK;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 04d170c4b61e..395b2deb76aa 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2258,6 +2258,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 {
 	unsigned long guest_owned_bits;
+	u64 ia32_pkrs;
 
 	kvm_register_mark_available(vcpu, reg);
 
@@ -2292,6 +2293,16 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 		vcpu->arch.cr4 &= ~guest_owned_bits;
 		vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
 		break;
+	case VCPU_EXREG_PKRS:
+		/*
+		 * The high 32 bits of PKRS are reserved and attempting to write
+		 * non-zero value will cause #GP. KVM intentionally drops those
+		 * bits.
+		 */
+		ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
+		WARN_ON_ONCE(ia32_pkrs >> 32);
+		vcpu->arch.pkrs = ia32_pkrs;
+		break;
 	default:
 		KVM_BUG_ON(1, vcpu->kvm);
 		break;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 9c6bfcd84008..661df9584b12 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -499,7 +499,8 @@ BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL)
 				(1 << VCPU_EXREG_CR3) |         \
 				(1 << VCPU_EXREG_CR4) |         \
 				(1 << VCPU_EXREG_EXIT_INFO_1) | \
-				(1 << VCPU_EXREG_EXIT_INFO_2))
+				(1 << VCPU_EXREG_EXIT_INFO_2) | \
+				(1 << VCPU_EXREG_PKRS))
 
 static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v7 3/8] KVM: X86: Expose IA32_PKRS MSR
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
  2022-04-24 10:15 ` [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields Lei Wang
  2022-04-24 10:15 ` [PATCH v7 2/8] KVM: VMX: Add proper cache tracking for PKRS Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-05-24 22:11   ` Sean Christopherson
  2022-04-24 10:15 ` [PATCH v7 4/8] KVM: MMU: Rename the pkru to pkr Lei Wang
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

From: Chenyi Qiang <chenyi.qiang@intel.com>

Protection Key for Superviosr Pages (PKS) uses IA32_PKRS MSR (PKRS) at
index 0x6E1 to allow software to manage superviosr key rights, i.e. it
can enforce additional permissions checks besides normal paging
protections via a MSR update without TLB flushes when permissions
change.

For performance consideration, PKRS intercept in KVM will be disabled
when PKS is supported in guest so that PKRS can be accessed without VM
exit.

PKS introduces dedicated control fields in VMCS to switch PKRS, which
only does the retore part. In addition, every VM exit saves PKRS into
the guest-state area in VMCS, while VM enter won't save the host value
due to the expectation that the host won't change the MSR often. Update
the host's value in VMCS manually if the MSR has been changed by the
kernel since the last time the VMCS was run.

Introduce a function get_current_pkrs() in arch/x86/mm/pkeys.c to export
the per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Co-developed-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Lei Wang <lei4.wang@intel.com>
---
 arch/x86/kvm/vmx/vmcs.h |  1 +
 arch/x86/kvm/vmx/vmx.c  | 63 +++++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h  |  9 +++++-
 arch/x86/kvm/x86.c      |  9 +++++-
 arch/x86/kvm/x86.h      |  6 ++++
 arch/x86/mm/pkeys.c     |  6 ++++
 include/linux/pks.h     |  7 +++++
 7 files changed, 94 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index e325c290a816..ee37741b2b9d 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -42,6 +42,7 @@ struct vmcs_host_state {
 #ifdef CONFIG_X86_64
 	u16           ds_sel, es_sel;
 #endif
+	u32           pkrs;
 };
 
 struct vmcs_controls_shadow {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 395b2deb76aa..9d0588e85410 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -28,6 +28,7 @@
 #include <linux/tboot.h>
 #include <linux/trace_events.h>
 #include <linux/entry-kvm.h>
+#include <linux/pkeys.h>
 
 #include <asm/apic.h>
 #include <asm/asm.h>
@@ -172,6 +173,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_CORE_C3_RESIDENCY,
 	MSR_CORE_C6_RESIDENCY,
 	MSR_CORE_C7_RESIDENCY,
+	MSR_IA32_PKRS,
 };
 
 /*
@@ -1111,6 +1113,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 #endif
 	unsigned long fs_base, gs_base;
 	u16 fs_sel, gs_sel;
+	u32 host_pkrs;
 	int i;
 
 	vmx->req_immediate_exit = false;
@@ -1146,6 +1149,17 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	 */
 	host_state->ldt_sel = kvm_read_ldt();
 
+	/*
+	 * Update the host pkrs vmcs field before vcpu runs.
+	 * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
+	 * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	 * guest_cpuid_has(vcpu, X86_FEATURE_PKS)
+	 */
+	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
+		host_pkrs = get_current_pkrs();
+		vmx_set_host_pkrs(host_state, host_pkrs);
+	}
+
 #ifdef CONFIG_X86_64
 	savesegment(ds, host_state->ds_sel);
 	savesegment(es, host_state->es_sel);
@@ -1901,6 +1915,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_DEBUGCTLMSR:
 		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
 		break;
+	case MSR_IA32_PKRS:
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		msr_info->data = kvm_read_pkrs(vcpu);
+		break;
 	default:
 	find_uret_msr:
 		msr = vmx_find_uret_msr(vmx, msr_info->index);
@@ -2242,7 +2263,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		}
 		ret = kvm_set_msr_common(vcpu, msr_info);
 		break;
-
+	case MSR_IA32_PKRS:
+		if (!kvm_pkrs_valid(data))
+			return 1;
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		vcpu->arch.pkrs = data;
+		kvm_register_mark_available(vcpu, VCPU_EXREG_PKRS);
+		vmcs_write64(GUEST_IA32_PKRS, data);
+		break;
 	default:
 	find_uret_msr:
 		msr = vmx_find_uret_msr(vmx, msr_index);
@@ -2533,7 +2564,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_EXIT_LOAD_IA32_EFER |
 	      VM_EXIT_CLEAR_BNDCFGS |
 	      VM_EXIT_PT_CONCEAL_PIP |
-	      VM_EXIT_CLEAR_IA32_RTIT_CTL;
+	      VM_EXIT_CLEAR_IA32_RTIT_CTL |
+	      VM_EXIT_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
 				&_vmexit_control) < 0)
 		return -EIO;
@@ -2557,7 +2589,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_ENTRY_LOAD_IA32_EFER |
 	      VM_ENTRY_LOAD_BNDCFGS |
 	      VM_ENTRY_PT_CONCEAL_PIP |
-	      VM_ENTRY_LOAD_IA32_RTIT_CTL;
+	      VM_ENTRY_LOAD_IA32_RTIT_CTL |
+	      VM_ENTRY_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
 				&_vmentry_control) < 0)
 		return -EIO;
@@ -4166,7 +4199,8 @@ static u32 vmx_vmentry_ctrl(void)
 				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
 	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
 	return vmentry_ctrl &
-		~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER);
+		~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER |
+		  VM_ENTRY_LOAD_IA32_PKRS);
 }
 
 static u32 vmx_vmexit_ctrl(void)
@@ -4178,7 +4212,8 @@ static u32 vmx_vmexit_ctrl(void)
 				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
 	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
 	return vmexit_ctrl &
-		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
+		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER |
+		  VM_EXIT_LOAD_IA32_PKRS);
 }
 
 static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -5923,6 +5958,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		       vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
 	if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
 		pr_err("BndCfgS = 0x%016llx\n", vmcs_read64(GUEST_BNDCFGS));
+	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(GUEST_IA32_PKRS));
 	pr_err("Interruptibility = %08x  ActivityState = %08x\n",
 	       vmcs_read32(GUEST_INTERRUPTIBILITY_INFO),
 	       vmcs_read32(GUEST_ACTIVITY_STATE));
@@ -5964,6 +6001,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		       vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
 	if (vmcs_read32(VM_EXIT_MSR_LOAD_COUNT) > 0)
 		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
+	if (vmexit_ctl & VM_EXIT_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(HOST_IA32_PKRS));
 
 	pr_err("*** Control State ***\n");
 	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
@@ -7406,6 +7445,20 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
+
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS)) {
+		if (guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {
+			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
+
+			vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+			vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+		} else {
+			vmx_enable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
+
+			vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+			vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+		}
+	}
 }
 
 static __init void vmx_set_cpu_caps(void)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 661df9584b12..91723a226bf3 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -352,7 +352,7 @@ struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	15
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	16
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
@@ -580,4 +580,11 @@ static inline int vmx_get_instr_info_reg2(u32 vmx_instr_info)
 	return (vmx_instr_info >> 28) & 0xf;
 }
 
+static inline void vmx_set_host_pkrs(struct vmcs_host_state *host, u32 pkrs){
+	if (unlikely(pkrs != host->pkrs)) {
+		vmcs_write64(HOST_IA32_PKRS, pkrs);
+		host->pkrs = pkrs;
+	}
+}
+
 #endif /* __KVM_X86_VMX_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 547ba00ef64f..d784bf3a4b3e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1396,7 +1396,7 @@ static const u32 msrs_to_save_all[] = {
 	MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
 	MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
 	MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
-	MSR_IA32_UMWAIT_CONTROL,
+	MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
 
 	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
 	MSR_ARCH_PERFMON_FIXED_CTR0 + 2,
@@ -6638,6 +6638,10 @@ static void kvm_init_msr_list(void)
 				intel_pt_validate_hw_cap(PT_CAP_num_address_ranges) * 2)
 				continue;
 			break;
+		case MSR_IA32_PKRS:
+			if (!kvm_cpu_cap_has(X86_FEATURE_PKS))
+				continue;
+			break;
 		case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
 			if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
 			    min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
@@ -11410,6 +11414,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
 	kvm_rip_write(vcpu, 0xfff0);
 
+	if (!init_event && kvm_cpu_cap_has(X86_FEATURE_PKS))
+		__kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
+
 	vcpu->arch.cr3 = 0;
 	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 588792f00334..7610f0d40b0f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -404,6 +404,12 @@ static inline void kvm_machine_check(void)
 #endif
 }
 
+static inline bool kvm_pkrs_valid(u64 data)
+{
+	/* bit[63,32] must be zero */
+	return !(data >> 32);
+}
+
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
 int kvm_spec_ctrl_test_value(u64 value);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 74ba51b9853b..bd75af62b685 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -495,4 +495,10 @@ void pks_update_exception(struct pt_regs *regs, u8 pkey, u8 protection)
 }
 EXPORT_SYMBOL_GPL(pks_update_exception);
 
+u32 get_current_pkrs(void)
+{
+	return this_cpu_read(pkrs_cache);
+}
+EXPORT_SYMBOL_GPL(get_current_pkrs);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pks.h b/include/linux/pks.h
index ce8eea81f208..0a71f8f4055d 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -53,6 +53,8 @@ static inline void pks_set_readwrite(u8 pkey)
 typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
 				 bool write);
 
+u32 get_current_pkrs(void);
+
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline bool pks_available(void)
@@ -68,6 +70,11 @@ static inline void pks_update_exception(struct pt_regs *regs,
 					u8 protection)
 { }
 
+static inline u32 get_current_pkrs(void)
+{
+	return 0;
+}
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #ifdef CONFIG_PKS_TEST
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v7 4/8] KVM: MMU: Rename the pkru to pkr
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
                   ` (2 preceding siblings ...)
  2022-04-24 10:15 ` [PATCH v7 3/8] KVM: X86: Expose IA32_PKRS MSR Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-04-24 10:15 ` [PATCH v7 5/8] KVM: MMU: Add helper function to get pkr bits Lei Wang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

From: Chenyi Qiang <chenyi.qiang@intel.com>

PKRU represents the PKU register utilized in the protection key rights
check for user pages. Protection Keys for Superviosr Pages (PKS) extends
the protection key architecture to cover supervisor pages.

Rename the *pkru* related variables and functions to *pkr* which stands
for both of the PKRU and PKRS. It makes sense because PKS and PKU each
have:
 - a single control register (PKRU and PKRS)
 - the same number of keys (16 in total)
 - the same format in control registers (Access and Write disable bits)

PKS and PKU can also share the same bitmap pkr_mask cache conditions
where protection key checks are needed, because they can share almost the
same requirements for PK restrictions to cause a fault, except they
focus on different pages (supervisor and user pages).

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.h              | 12 ++++++------
 arch/x86/kvm/mmu/mmu.c          | 10 +++++-----
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f5455bada8cd..1014d6a2b069 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -459,7 +459,7 @@ struct kvm_mmu {
 	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
 	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
 	*/
-	u32 pkru_mask;
+	u32 pkr_mask;
 
 	u64 *pae_root;
 	u64 *pml4_root;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e6cae6f22683..cb3f07e63778 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -239,8 +239,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	u32 errcode = PFERR_PRESENT_MASK;
 
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-	if (unlikely(mmu->pkru_mask)) {
-		u32 pkru_bits, offset;
+	if (unlikely(mmu->pkr_mask)) {
+		u32 pkr_bits, offset;
 
 		/*
 		* PKRU defines 32 bits, there are 16 domains and 2
@@ -248,15 +248,15 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		* index of the protection domain, so pte_pkey * 2 is
 		* is the index of the first bit for the domain.
 		*/
-		pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
 
 		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
 		offset = (pfec & ~1) +
 			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
 
-		pkru_bits &= mmu->pkru_mask >> offset;
-		errcode |= -pkru_bits & PFERR_PK_MASK;
-		fault |= (pkru_bits != 0);
+		pkr_bits &= mmu->pkr_mask >> offset;
+		errcode |= -pkr_bits & PFERR_PK_MASK;
+		fault |= (pkr_bits != 0);
 	}
 
 	return -(u32)fault & errcode;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f9080ee50ffa..de665361548d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4631,12 +4631,12 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
 * away both AD and WD.  For all reads or if the last condition holds, WD
 * only will be masked away.
 */
-static void update_pkru_bitmask(struct kvm_mmu *mmu)
+static void update_pkr_bitmask(struct kvm_mmu *mmu)
 {
 	unsigned bit;
 	bool wp;
 
-	mmu->pkru_mask = 0;
+	mmu->pkr_mask = 0;
 
 	if (!is_cr4_pke(mmu))
 		return;
@@ -4671,7 +4671,7 @@ static void update_pkru_bitmask(struct kvm_mmu *mmu)
 		/* PKRU.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
-		mmu->pkru_mask |= (pkey_bits & 3) << pfec;
+		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
 	}
 }
 
@@ -4683,7 +4683,7 @@ static void reset_guest_paging_metadata(struct kvm_vcpu *vcpu,
 
 	reset_rsvds_bits_mask(vcpu, mmu);
 	update_permission_bitmask(mmu, false);
-	update_pkru_bitmask(mmu);
+	update_pkr_bitmask(mmu);
 }
 
 static void paging64_init_context(struct kvm_mmu *context)
@@ -4951,7 +4951,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 		context->root_level = level;
 		context->direct_map = false;
 		update_permission_bitmask(context, true);
-		context->pkru_mask = 0;
+		context->pkr_mask = 0;
 		reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
 		reset_ept_shadow_zero_bits_mask(context, execonly);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v7 5/8] KVM: MMU: Add helper function to get pkr bits
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
                   ` (3 preceding siblings ...)
  2022-04-24 10:15 ` [PATCH v7 4/8] KVM: MMU: Rename the pkru to pkr Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-05-24 23:21   ` Sean Christopherson
  2022-04-24 10:15 ` [PATCH v7 6/8] KVM: MMU: Add support for PKS emulation Lei Wang
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

Extra the PKR stuff to a separate, non-inline helper, which is a
preparation to introduce pks support.

Signed-off-by: Lei Wang <lei4.wang@intel.com>
---
 arch/x86/kvm/mmu.h     | 20 +++++---------------
 arch/x86/kvm/mmu/mmu.c | 21 +++++++++++++++++++++
 2 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index cb3f07e63778..cea03053a153 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -204,6 +204,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	return vcpu->arch.mmu->page_fault(vcpu, &fault);
 }
 
+u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec);
+
 /*
  * Check if a given access (described through the I/D, W/R and U/S bits of a
  * page fault error code pfec) causes a permission fault with the given PTE
@@ -240,21 +243,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
 	if (unlikely(mmu->pkr_mask)) {
-		u32 pkr_bits, offset;
-
-		/*
-		* PKRU defines 32 bits, there are 16 domains and 2
-		* attribute bits per domain in pkru.  pte_pkey is the
-		* index of the protection domain, so pte_pkey * 2 is
-		* is the index of the first bit for the domain.
-		*/
-		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
-
-		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
-		offset = (pfec & ~1) +
-			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
-
-		pkr_bits &= mmu->pkr_mask >> offset;
+		u32 pkr_bits =
+			kvm_mmu_pkr_bits(vcpu, mmu, pte_access, pte_pkey, pfec);
 		errcode |= -pkr_bits & PFERR_PK_MASK;
 		fault |= (pkr_bits != 0);
 	}
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index de665361548d..6d3276986102 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6477,3 +6477,24 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_lpage_recovery_thread)
 		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
 }
+
+u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec)
+{
+	u32 pkr_bits, offset;
+
+	/*
+	* PKRU defines 32 bits, there are 16 domains and 2
+	* attribute bits per domain in pkru.  pte_pkey is the
+	* index of the protection domain, so pte_pkey * 2 is
+	* is the index of the first bit for the domain.
+	*/
+	pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+
+	/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
+	offset = (pfec & ~1) + ((pte_access & PT_USER_MASK)
+				<< (PFERR_RSVD_BIT - PT_USER_SHIFT));
+
+	pkr_bits &= mmu->pkr_mask >> offset;
+	return pkr_bits;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v7 6/8] KVM: MMU: Add support for PKS emulation
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
                   ` (4 preceding siblings ...)
  2022-04-24 10:15 ` [PATCH v7 5/8] KVM: MMU: Add helper function to get pkr bits Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-05-24 23:28   ` Sean Christopherson
  2022-04-24 10:15 ` [PATCH v7 7/8] KVM: VMX: Expose PKS to guest Lei Wang
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

From: Chenyi Qiang <chenyi.qiang@intel.com>

Up until now, pkr_mask had 0 bits for supervisor pages (the U/S bit in
page tables replaces the PFEC.RSVD in page fault error code).
For PKS support, fill in the bits using the same algorithm used for user
mode pages, but with CR4.PKE replaced by CR4.PKS. Because of this
change, CR4.PKS must also be included in the MMU role.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Co-developed-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Lei Wang <lei4.wang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  10 +--
 arch/x86/kvm/mmu.h              |   3 +-
 arch/x86/kvm/mmu/mmu.c          | 109 +++++++++++++++++++++-----------
 3 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1014d6a2b069..a245d9817f72 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -375,6 +375,7 @@ union kvm_mmu_extended_role {
 		unsigned int cr4_smap:1;
 		unsigned int cr4_smep:1;
 		unsigned int cr4_la57:1;
+		unsigned int cr4_pks:1;
 		unsigned int efer_lma:1;
 	};
 };
@@ -454,10 +455,11 @@ struct kvm_mmu {
 	u8 permissions[16];
 
 	/*
-	* The pkru_mask indicates if protection key checks are needed.  It
-	* consists of 16 domains indexed by page fault error code bits [4:1],
-	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
-	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
+	* The pkr_mask indicates if protection key checks are needed.
+	* It consists of 16 domains indexed by page fault error code
+	* bits[4:1] with PFEC.RSVD replaced by ACC_USER_MASK from the
+	* page tables. Each domain has 2 bits which are ANDed with AD
+	* and WD from PKRU/PKRS.
 	*/
 	u32 pkr_mask;
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index cea03053a153..6963c641e6ce 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -45,7 +45,8 @@
 #define PT32E_ROOT_LEVEL 3
 
 #define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PSE | X86_CR4_PAE | X86_CR4_LA57 | \
-			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
+			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE | \
+			       X86_CR4_PKS)
 
 #define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)
 #define KVM_MMU_EFER_ROLE_BITS (EFER_LME | EFER_NX)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d3276986102..a6cbc22d3312 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -209,6 +209,7 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
 BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
 BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
 BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pks, X86_CR4_PKS);
 BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
 BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);
 
@@ -231,6 +232,7 @@ BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smep);
 BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smap);
 BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pke);
 BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, la57);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pks);
 BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
 
 static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
@@ -4608,37 +4610,58 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
 }
 
 /*
-* PKU is an additional mechanism by which the paging controls access to
-* user-mode addresses based on the value in the PKRU register.  Protection
-* key violations are reported through a bit in the page fault error code.
-* Unlike other bits of the error code, the PK bit is not known at the
-* call site of e.g. gva_to_gpa; it must be computed directly in
-* permission_fault based on two bits of PKRU, on some machine state (CR4,
-* CR0, EFER, CPL), and on other bits of the error code and the page tables.
-*
-* In particular the following conditions come from the error code, the
-* page tables and the machine state:
-* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
-* - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
-* - PK is always zero if U=0 in the page tables
-* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
-*
-* The PKRU bitmask caches the result of these four conditions.  The error
-* code (minus the P bit) and the page table's U bit form an index into the
-* PKRU bitmask.  Two bits of the PKRU bitmask are then extracted and ANDed
-* with the two bits of the PKRU register corresponding to the protection key.
-* For the first three conditions above the bits will be 00, thus masking
-* away both AD and WD.  For all reads or if the last condition holds, WD
-* only will be masked away.
-*/
+ * Protection Key Rights (PKR) is an additional mechanism by which data accesses
+ * with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
+ * Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
+ * (PKRS) registers.  The Protection Key (PK) used for an access is a 4-bit
+ * value specified in bits 62:59 of the leaf PTE used to translate the address.
+ *
+ * PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
+ * access-disable (AD) and write-disable (WD) bit.  The PK from the leaf PTE is
+ * used to index the approriate PKR (see below), e.g. PK=1 would consume bits
+ * 3:2 (bit 3 == write-disable, bit 2 == access-disable).
+ *
+ * The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
+ * _address_ (not access type!).  For a user-mode address, PKRU is used; for a
+ * supervisor-mode address, PKRS is used.  An address is supervisor-mode if the
+ * U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
+ * an address is user-mode if the U/S flag is 1 in _all_ entries.  Again, this
+ * is the address type, not the the access type, e.g. a supervisor-mode _access_
+ * will consume PKRU if the _address_ is a user-mode address.
+ *
+ * As alluded to above, PKR checks are only performed for data accesses; code
+ * fetches are not subject to PKR checks.  Terminal page faults (!PRESENT or
+ * PFEC.RSVD=1) are also not subject to PKR checks.
+ *
+ * PKR write-disable checks for superivsor-mode _accesses_ are performed if and
+ * only if CR0.WP=1 (though access-disable checks still apply).
+ *
+ * In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
+ * (c) CR0.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
+ * PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
+ * (g) the address type (retrieved from the paging-structure entries).
+ *
+ * To avoid conditional branches in permission_fault(), the PKR bitmask caches
+ * the above inputs, except for (e) the PKR{S,U} entry.  The FETCH, USER, and
+ * WRITE bits of the PFEC and the effective value of the paging-structures' U/S
+ * bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
+ * PKR bitmask (similar to the 4-bit Protection Key itself).  The two bits of
+ * the PKR bitmask "entry" are then extracted and ANDed with the two bits of
+ * the PKR{S,U} register corresponding to the address type and protection key.
+ *
+ * E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
+ * will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
+ * to suppress PKR checks on code fetches.
+ */
 static void update_pkr_bitmask(struct kvm_mmu *mmu)
 {
 	unsigned bit;
 	bool wp;
-
+	bool cr4_pke = is_cr4_pke(mmu);
+	bool cr4_pks = is_cr4_pks(mmu);
 	mmu->pkr_mask = 0;
 
-	if (!is_cr4_pke(mmu))
+	if (!cr4_pke && !cr4_pks)
 		return;
 
 	wp = is_cr0_wp(mmu);
@@ -4656,19 +4679,22 @@ static void update_pkr_bitmask(struct kvm_mmu *mmu)
 		pte_user = pfec & PFERR_RSVD_MASK;
 
 		/*
-		 * Only need to check the access which is not an
-		 * instruction fetch and is to a user page.
+		 * need to check the access which is not an
+		 * instruction fetch and
+		 * - if cr4_pke 1-setting when accessing a user page.
+		 * - if cr4_pks 1-setting when accessing a supervisor page.
 		 */
-		check_pkey = (!ff && pte_user);
+		check_pkey = !ff && (pte_user ? cr4_pke : cr4_pks);
+
 		/*
-		 * write access is controlled by PKRU if it is a
-		 * user access or CR0.WP = 1.
+		 * write access is controlled by PKRU/PKRS if
+		 * it is a user access or CR0.WP = 1.
 		 */
 		check_write = check_pkey && wf && (uf || wp);
 
-		/* PKRU.AD stops both read and write access. */
+		/* PKRU/PKRS.AD stops both read and write access. */
 		pkey_bits = !!check_pkey;
-		/* PKRU.WD stops write access. */
+		/* PKRU/PKRS.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
 		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
@@ -4719,6 +4745,7 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu,
 		/* PKEY and LA57 are active iff long mode is active. */
 		ext.cr4_pke = ____is_efer_lma(regs) && ____is_cr4_pke(regs);
 		ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs);
+		ext.cr4_pks = ____is_efer_lma(regs) && ____is_cr4_pks(regs);
 		ext.efer_lma = ____is_efer_lma(regs);
 	}
 
@@ -6482,14 +6509,22 @@ u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec)
 {
 	u32 pkr_bits, offset;
+	u32 pkr;
 
 	/*
-	* PKRU defines 32 bits, there are 16 domains and 2
-	* attribute bits per domain in pkru.  pte_pkey is the
-	* index of the protection domain, so pte_pkey * 2 is
-	* is the index of the first bit for the domain.
+	* PKRU and PKRS both define 32 bits. There are 16 domains
+	* and 2 attribute bits per domain in them. pte_key is the
+	* index of the protection domain, so pte_pkey * 2 is the
+	* index of the first bit for the domain. The use of PKRU
+	* versus PKRS is selected by the address type, as determined
+	* by the U/S bit in the paging-structure entries.
 	*/
-	pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+	if (pte_access & PT_USER_MASK)
+		pkr = is_cr4_pke(mmu) ? vcpu->arch.pkru : 0;
+	else
+		pkr = is_cr4_pks(mmu) ? kvm_read_pkrs(vcpu) : 0;
+
+	pkr_bits = (pkr >> pte_pkey * 2) & 3;
 
 	/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
 	offset = (pfec & ~1) + ((pte_access & PT_USER_MASK)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v7 7/8] KVM: VMX: Expose PKS to guest
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
                   ` (5 preceding siblings ...)
  2022-04-24 10:15 ` [PATCH v7 6/8] KVM: MMU: Add support for PKS emulation Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-05-24 23:34   ` Sean Christopherson
  2022-04-24 10:15 ` [PATCH v7 8/8] KVM: VMX: Enable PKS for nested VM Lei Wang
  2022-05-06  7:32 ` [PATCH v7 0/8] KVM: PKS Virtualization support Wang, Lei
  8 siblings, 1 reply; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

From: Chenyi Qiang <chenyi.qiang@intel.com>

Existence of PKS is enumerated via CPUID.(EAX=7,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of
VM_{ENTRY,EXIT}_LOAD_IA32_PKRS VMCS controls currently.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Co-developed-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Lei Wang <lei4.wang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/cpuid.c            | 13 +++++++++----
 arch/x86/kvm/vmx/capabilities.h |  6 ++++++
 arch/x86/kvm/vmx/vmx.c          | 10 +++++++---
 arch/x86/kvm/x86.h              |  2 ++
 5 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a245d9817f72..6f78ed784661 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -117,7 +117,8 @@
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_PKS))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b24ca7f4ed7c..f419bdd7f6af 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -570,18 +570,23 @@ void kvm_set_cpu_caps(void)
 		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
 		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
 		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
-		F(SGX_LC) | F(BUS_LOCK_DETECT)
+		F(SGX_LC) | F(BUS_LOCK_DETECT) | F(PKS)
 	);
 	/* Set LA57 based on hardware capability. */
 	if (cpuid_ecx(7) & F(LA57))
 		kvm_cpu_cap_set(X86_FEATURE_LA57);
 
 	/*
-	 * PKU not yet implemented for shadow paging and requires OSPKE
-	 * to be set on the host. Clear it if that is not the case
+	 * Protection Keys are not supported for shadow paging.  PKU further
+	 * requires OSPKE to be set on the host in order to use {RD,WR}PKRU to
+	 * save/restore the guests PKRU.
 	 */
-	if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
+	if (!tdp_enabled) {
 		kvm_cpu_cap_clear(X86_FEATURE_PKU);
+		kvm_cpu_cap_clear(X86_FEATURE_PKS);
+	} else if (!boot_cpu_has(X86_FEATURE_OSPKE)) {
+		kvm_cpu_cap_clear(X86_FEATURE_PKU);
+	}
 
 	kvm_cpu_cap_mask(CPUID_7_EDX,
 		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3f430e218375..cc9c23ab85fd 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -104,6 +104,12 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
 	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL);
 }
 
+static inline bool cpu_has_load_ia32_pkrs(void)
+{
+	return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PKRS) &&
+	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PKRS);
+}
+
 static inline bool cpu_has_vmx_mpx(void)
 {
 	return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9d0588e85410..cbcb0d7b47a4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3250,7 +3250,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		}
 
 		/*
-		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
 		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
 		 * to be manually disabled when guest switches to non-paging
 		 * mode.
@@ -3258,10 +3258,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		 * If !enable_unrestricted_guest, the CPU is always running
 		 * with CR0.PG=1 and CR4 needs to be modified.
 		 * If enable_unrestricted_guest, the CPU automatically
-		 * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+		 * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
 		 */
 		if (!is_paging(vcpu))
-			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+				    X86_CR4_PKS);
 	}
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -7500,6 +7501,9 @@ static __init void vmx_set_cpu_caps(void)
 
 	if (cpu_has_vmx_waitpkg())
 		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+	if (cpu_has_load_ia32_pkrs())
+		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
 }
 
 static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 7610f0d40b0f..997b85a20962 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -449,6 +449,8 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
 		__reserved_bits |= X86_CR4_VMXE;        \
 	if (!__cpu_has(__c, X86_FEATURE_PCID))          \
 		__reserved_bits |= X86_CR4_PCIDE;       \
+	if (!__cpu_has(__c, X86_FEATURE_PKS))		\
+		__reserved_bits |= X86_CR4_PKS;		\
 	__reserved_bits;                                \
 })
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v7 8/8] KVM: VMX: Enable PKS for nested VM
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
                   ` (6 preceding siblings ...)
  2022-04-24 10:15 ` [PATCH v7 7/8] KVM: VMX: Expose PKS to guest Lei Wang
@ 2022-04-24 10:15 ` Lei Wang
  2022-05-20  1:24   ` Sean Christopherson
  2022-05-06  7:32 ` [PATCH v7 0/8] KVM: PKS Virtualization support Wang, Lei
  8 siblings, 1 reply; 24+ messages in thread
From: Lei Wang @ 2022-04-24 10:15 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: lei4.wang, chenyi.qiang, kvm, linux-kernel

From: Chenyi Qiang <chenyi.qiang@intel.com>

PKS MSR passes through guest directly. Configure the MSR to match the
L0/L1 settings so that nested VM runs PKS properly.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Co-developed-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Lei Wang <lei4.wang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 36 ++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmcs12.c |  2 ++
 arch/x86/kvm/vmx/vmcs12.h |  4 ++++
 arch/x86/kvm/vmx/vmx.c    |  1 +
 arch/x86/kvm/vmx/vmx.h    |  2 ++
 5 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 58a1fa7defc9..dde359dacfcb 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -252,6 +252,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
 	dest->ds_sel = src->ds_sel;
 	dest->es_sel = src->es_sel;
 #endif
+	vmx_set_host_pkrs(dest, src->pkrs);
 }
 
 static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
@@ -685,6 +686,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_PRED_CMD, MSR_TYPE_W);
 
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_PKRS, MSR_TYPE_RW);
+
 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
 
 	vmx->nested.force_msr_bitmap_recalc = false;
@@ -2433,6 +2437,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
 		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+		if (vmx->nested.nested_run_pending &&
+		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+			vmcs_write64(GUEST_IA32_PKRS, vmcs12->guest_ia32_pkrs);
 	}
 
 	if (nested_cpu_has_xsaves(vmcs12))
@@ -2521,6 +2529,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    (!vmx->nested.nested_run_pending ||
+	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
+		vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
+
 	vmx_set_rflags(vcpu, vmcs12->guest_rflags);
 
 	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
@@ -2897,6 +2910,10 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 					   vmcs12->host_ia32_perf_global_ctrl)))
 		return -EINVAL;
 
+	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->host_ia32_pkrs)))
+		return -EINVAL;
+
 #ifdef CONFIG_X86_64
 	ia32e = !!(vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE);
 #else
@@ -3049,6 +3066,10 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	if (nested_check_guest_non_reg_state(vmcs12))
 		return -EINVAL;
 
+	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->guest_ia32_pkrs)))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -3384,6 +3405,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	    (!from_vmentry ||
 	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) && 
+	    (!from_vmentry ||
+	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
+		vmx->nested.vmcs01_guest_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	/*
 	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
@@ -4029,6 +4054,7 @@ static bool is_vmcs12_ext_field(unsigned long field)
 	case GUEST_IDTR_BASE:
 	case GUEST_PENDING_DBG_EXCEPTIONS:
 	case GUEST_BNDCFGS:
+	case GUEST_IA32_PKRS:
 		return true;
 	default:
 		break;
@@ -4080,6 +4106,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 	if (kvm_mpx_supported())
 		vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (vmx->nested.msrs.entry_ctls_high & VM_ENTRY_LOAD_IA32_PKRS)
+		vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
 }
@@ -4317,6 +4345,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 					 vmcs12->host_ia32_perf_global_ctrl));
 
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS)
+		vmcs_write64(GUEST_IA32_PKRS, vmcs12->host_ia32_pkrs);
+
 	/* Set L1 segment info according to Intel SDM
 	    27.5.2 Loading Host Segment and Descriptor-Table Registers */
 	seg = (struct kvm_segment) {
@@ -6559,7 +6590,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
+		VM_EXIT_LOAD_IA32_PKRS;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6579,7 +6611,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		VM_ENTRY_IA32E_MODE |
 #endif
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
-		VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_PKRS;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER);
 
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 2251b60920f8..7aad1b2f1d81 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -62,9 +62,11 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD64(GUEST_PDPTR2, guest_pdptr2),
 	FIELD64(GUEST_PDPTR3, guest_pdptr3),
 	FIELD64(GUEST_BNDCFGS, guest_bndcfgs),
+	FIELD64(GUEST_IA32_PKRS, guest_ia32_pkrs),
 	FIELD64(HOST_IA32_PAT, host_ia32_pat),
 	FIELD64(HOST_IA32_EFER, host_ia32_efer),
 	FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
+	FIELD64(HOST_IA32_PKRS, host_ia32_pkrs),
 	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
 	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
 	FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 746129ddd5ae..4f41be3c351c 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -185,6 +185,8 @@ struct __packed vmcs12 {
 	u16 host_gs_selector;
 	u16 host_tr_selector;
 	u16 guest_pml_index;
+	u64 host_ia32_pkrs;
+	u64 guest_ia32_pkrs;
 };
 
 /*
@@ -359,6 +361,8 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_gs_selector, 992);
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
+	CHECK_OFFSET(host_ia32_pkrs, 998);
+	CHECK_OFFSET(guest_ia32_pkrs, 1006);
 }
 
 extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cbcb0d7b47a4..a62dc65299d5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7294,6 +7294,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
 	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
 	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
+	cr4_fixed1_update(X86_CR4_PKS,        ecx, feature_bit(PKS));
 
 #undef cr4_fixed1_update
 }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 91723a226bf3..82f79ac46d7b 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -222,6 +222,8 @@ struct nested_vmx {
 	u64 vmcs01_debugctl;
 	u64 vmcs01_guest_bndcfgs;
 
+	u64 vmcs01_guest_pkrs;
+
 	/* to migrate it to L1 if L2 writes to L1's CR8 directly */
 	int l1_tpr_threshold;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 0/8] KVM: PKS Virtualization support
  2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
                   ` (7 preceding siblings ...)
  2022-04-24 10:15 ` [PATCH v7 8/8] KVM: VMX: Enable PKS for nested VM Lei Wang
@ 2022-05-06  7:32 ` Wang, Lei
  8 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-06  7:32 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, joro
  Cc: chenyi.qiang, kvm, linux-kernel

Kindly ping for the comments.

On 4/24/2022 6:15 PM, Lei Wang wrote:
> This patch series is based on top of v10 PKS core support kernel patchset:
> https://lore.kernel.org/lkml/20220419170649.1022246-1-ira.weiny@intel.com/
>
> ---
>
> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
> Protection Keys architecture to support thread-specific permission
> restrictions on supervisor pages.
>
> PKS works similar to an existing feature named PKU(protecting user pages).
> They both perform an additional check after normal paging permission
> checks are done. Access or Writes can be disabled via a MSR update
> without TLB flushes when permissions changes. If violating this
> addional check, #PF occurs and PFEC.PK bit will be set.
>
> PKS introduces MSR IA32_PKRS to manage supervisor protection key
> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
> advertises on a group of pages with the same key which is set in the
> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
> supported by XSAVES architecture.
>
> This patchset aims to add the virtualization of PKS in KVM. It
> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
> exposure, nested supported etc. Currently, PKS is not yet supported for
> shadow paging.
>
> Detailed information about PKS can be found in the latest Intel 64 and
> IA-32 Architectures Software Developer's Manual.
>
> ---
>
> Changelogs:
>
> v6->v7
> - Add documentation to note that it's nice-to-have cache tracking for PKRS,
>    and we also needn't hesitate to rip it out in the future if there's a strong
>    reason to drop the caching. (Sean)
> - Blindly reading PKRU/PKRS is wrong, fixed. (Sean)
> - Add a non-inline helper kvm_mmu_pkr_bits() to read PKR bits. (Sean)
> - Delete the comment for exposing the PKS because the pattern is common and the
>    behavior is self-explanatory. (Sean)
> - Add a helper vmx_set_host_pkrs() for setting host pkrs and rewrite the
>    related code for concise. (Sean)
> - Align an indentation in arch/x86/kvm/vmx/nested.c. (Sean)
> - Read the current PKRS if from_vmentry == false under the nested condition.
>    (Sean)
> - v6: https://lore.kernel.org/lkml/20220221080840.7369-1-chenyi.qiang@intel.com/
>
> v5->v6
> - PKRS is preserved on INIT. Add the PKRS reset operation in kvm_vcpu_reset.
>    (Sean)
> - Track the pkrs as u32. Add the code WARN on bits 64:32 being set in VMCS field.
>    (Sean)
> - Adjust the MSR intercept and entry/exit control in VMCS according to
>    guest CPUID. This resolve the issue when userspace re-enable this feature.
>    (Sean)
> - Split VMX restriction on PKS support(entry/exit load controls) out of
>    common x86. And put tdp restriction together with PKU in common x86.
>    (Sean)
> - Thanks for Sean to revise the comments in mmu.c related to
>    update_pkr_bitmap, which make it more clear for pkr bitmask cache usage.
> - v5: https://lore.kernel.org/lkml/20210811101126.8973-1-chenyi.qiang@intel.com/
>
> v4->v5
> - Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
>    And set them if PKS is exposed to guest. (Suggested by Sean)
> - Add pkrs to standard register caching mechanism to help update
>    vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
>    vmx_sync_vmcs_host_state(). (Sean)
> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
>    (Sean & Paolo)
> - v4: https://lore.kernel.org/lkml/20210205083706.14146-1-chenyi.qiang@intel.com/
>
> v3->v4
> - Make the MSR intercept and load-controls setting depend on CR4.PKS value
> - shadow the guest pkrs and make it usable in PKS emultion
> - add the cr4_pke and cr4_pks check in pkr_mask update
> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
> - v3: https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/
>
> v2->v3:
> - No function changes since last submit
> - rebase on the latest PKS kernel support:
>    https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
> - RFC v2: https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/
>
> v1->v2:
> - rebase on the latest PKS kernel support:
>    https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> - add a kvm-unit-tests for PKS
> - add the check in kvm_init_msr_list for PKRS
> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
>    VMX MSR
> - RFC v1: https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/
>
> ---
>
> Chenyi Qiang (7):
>    KVM: VMX: Introduce PKS VMCS fields
>    KVM: VMX: Add proper cache tracking for PKRS
>    KVM: X86: Expose IA32_PKRS MSR
>    KVM: MMU: Rename the pkru to pkr
>    KVM: MMU: Add support for PKS emulation
>    KVM: VMX: Expose PKS to guest
>    KVM: VMX: Enable PKS for nested VM
>
> Lei Wang (1):
>    KVM: MMU: Add helper function to get pkr bits
>
>   arch/x86/include/asm/kvm_host.h |  17 +++--
>   arch/x86/include/asm/vmx.h      |   6 ++
>   arch/x86/kvm/cpuid.c            |  13 +++-
>   arch/x86/kvm/kvm_cache_regs.h   |   7 ++
>   arch/x86/kvm/mmu.h              |  29 +++----
>   arch/x86/kvm/mmu/mmu.c          | 130 +++++++++++++++++++++++---------
>   arch/x86/kvm/vmx/capabilities.h |   6 ++
>   arch/x86/kvm/vmx/nested.c       |  36 ++++++++-
>   arch/x86/kvm/vmx/vmcs.h         |   1 +
>   arch/x86/kvm/vmx/vmcs12.c       |   2 +
>   arch/x86/kvm/vmx/vmcs12.h       |   4 +
>   arch/x86/kvm/vmx/vmx.c          |  85 +++++++++++++++++++--
>   arch/x86/kvm/vmx/vmx.h          |  14 +++-
>   arch/x86/kvm/x86.c              |   9 ++-
>   arch/x86/kvm/x86.h              |   8 ++
>   arch/x86/mm/pkeys.c             |   6 ++
>   include/linux/pks.h             |   7 ++
>   17 files changed, 301 insertions(+), 79 deletions(-)
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 8/8] KVM: VMX: Enable PKS for nested VM
  2022-04-24 10:15 ` [PATCH v7 8/8] KVM: VMX: Enable PKS for nested VM Lei Wang
@ 2022-05-20  1:24   ` Sean Christopherson
  2022-05-27  9:55     ` Wang, Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-20  1:24 UTC (permalink / raw)
  To: Lei Wang
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

Nit, use "KVM: nVMX:" for the shortlog scope.

On Sun, Apr 24, 2022, Lei Wang wrote:
> @@ -2433,6 +2437,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>  		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>  		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>  			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> +
> +		if (vmx->nested.nested_run_pending &&
> +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
> +			vmcs_write64(GUEST_IA32_PKRS, vmcs12->guest_ia32_pkrs);

As mentioned in the BNDCFGS thread, this does the wrong thing for SMM.  But, after
a lot of thought, handling this in nested_vmx_enter_non_root_mode() would be little
more than a band-aid, and a messy one at that, because KVM's SMM emulation is
horrifically broken with respect to nVMX.

Entry does to SMM does not modify _any_ state that is not saved in SMRAM.  That
we're having to deal with this crap is a symptom of KVM doing the complete wrong
thing by piggybacking nested_vmx_vmexit() and nested_vmx_enter_non_root_mode().

The SDM's description of CET spells this out very, very clearly:

  On processors that support CET shadow stacks, when the processor enters SMM,
  the processor saves the SSP register to the SMRAM state save area (see Table 31-3)
  and clears CR4.CET to 0. Thus, the initial execution environment of the SMI handler
  has CET disabled and all of the CET state of the interrupted program is still in the
  machine. An SMM that uses CET is required to save the interrupted program’s CET
  state and restore the CET state prior to exiting SMM.

It mostly works because no guest SMM handler does anything with most of the MSRs,
but it's all wildy wrong.  A concrete example of a lurking bug is if vmcs12 uses
the VM-Exit MSR load list, in which case the forced nested_vmx_vmexit() will load
state that is never undone.

So, my very strong vote is to ignore SMM and let someone who actually cares about
SMM fix that mess properly by adding custom flows for exiting/re-entering L2 on
SMI/RSM.

>  	}
>  
>  	if (nested_cpu_has_xsaves(vmcs12))
> @@ -2521,6 +2529,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
>  	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>  		vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&

ERROR: trailing whitespace
#85: FILE: arch/x86/kvm/vmx/nested.c:3407:
+^Iif (kvm_cpu_cap_has(X86_FEATURE_PKS) && $

> +	    (!vmx->nested.nested_run_pending ||
> +	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> +		vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
> +
>  	vmx_set_rflags(vcpu, vmcs12->guest_rflags);
>  
>  	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> @@ -2897,6 +2910,10 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
>  					   vmcs12->host_ia32_perf_global_ctrl)))
>  		return -EINVAL;
>  
> +	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS) &&
> +	    CC(!kvm_pkrs_valid(vmcs12->host_ia32_pkrs)))
> +		return -EINVAL;
> +
>  #ifdef CONFIG_X86_64
>  	ia32e = !!(vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE);
>  #else
> @@ -3049,6 +3066,10 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
>  	if (nested_check_guest_non_reg_state(vmcs12))
>  		return -EINVAL;
>  
> +	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS) &&
> +	    CC(!kvm_pkrs_valid(vmcs12->guest_ia32_pkrs)))
> +		return -EINVAL;
> +
>  	return 0;
>  }
>  
> @@ -3384,6 +3405,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  	    (!from_vmentry ||
>  	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>  		vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS) && 
> +	    (!from_vmentry ||

This should be "!vmx->nested.nested_run_pending" instead of "!from_vmentry" to
avoid the unnecessary VMREAD when restoring L2 with a pending VM-Enter. 
	
> +	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> +		vmx->nested.vmcs01_guest_pkrs = vmcs_read64(GUEST_IA32_PKRS);
>  
>  	/*
>  	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*

...

> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 91723a226bf3..82f79ac46d7b 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -222,6 +222,8 @@ struct nested_vmx {
>  	u64 vmcs01_debugctl;
>  	u64 vmcs01_guest_bndcfgs;
>  

Please pack these together, i.e. don't have a blank line between the various
vmcs01_* fields.

> +	u64 vmcs01_guest_pkrs;
> +
>  	/* to migrate it to L1 if L2 writes to L1's CR8 directly */
>  	int l1_tpr_threshold;
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields
  2022-04-24 10:15 ` [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields Lei Wang
@ 2022-05-24 20:55   ` Sean Christopherson
  2022-05-27  1:55     ` Wang, Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-24 20:55 UTC (permalink / raw)
  To: Lei Wang
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On Sun, Apr 24, 2022, Lei Wang wrote:
> From: Chenyi Qiang <chenyi.qiang@intel.com>
> 
> PKS(Protection Keys for Supervisor Pages) is a feature that extends the
> Protection Key architecture to support thread-specific permission
> restrictions on supervisor pages.
> 
> A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
> set of permissions associated with each protection domain.
> 
> Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
> {host,guest}-state area to store the respective values of PKRS.
> 
> Every VM exit saves PKRS into guest-state area.

Uber nit, PKRS isn't saved if VMX doesn't support the entry control.

  Every VM exit saves PKRS into guest-state area if VM_ENTRY_LOAD_IA32_PKRS
  is supported by the CPU.

With that tweak,

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 2/8] KVM: VMX: Add proper cache tracking for PKRS
  2022-04-24 10:15 ` [PATCH v7 2/8] KVM: VMX: Add proper cache tracking for PKRS Lei Wang
@ 2022-05-24 21:00   ` Sean Christopherson
  2022-05-27  2:16     ` Wang, Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-24 21:00 UTC (permalink / raw)
  To: Lei Wang
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On Sun, Apr 24, 2022, Lei Wang wrote:
> From: Chenyi Qiang <chenyi.qiang@intel.com>
> 
> Add PKRS caching into the standard register caching mechanism in order
> to take advantage of the availability checks provided by regs_avail.
> 
> This is because vcpu->arch.pkrs will be rarely acceesed by KVM, only in
> the case of host userspace MSR reads and GVA->GPA translation in
> following patches. It is unnecessary to keep it up-to-date at all times.
> 
> It also should be noted that the potential benefits of this caching are
> tenuous because the MSR read is not a hot path. it's nice-to-have so that
> we don't hesitate to rip it out in the future if there's a strong reason
> to drop the caching.


The patch looks fine, but this needs to be moved to the end of the series.
Definitely after "KVM: VMX: Expose PKS to guest", and maybe even after "KVM: VMX:
Enable PKS for nested VM".

If there's a bug with the caching logic, I want to be able to bisect to that.  By
implementing caching before any of the other PKS support, a bug in either the
caching or the virtualization will bisect to the PKS virtualization

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 3/8] KVM: X86: Expose IA32_PKRS MSR
  2022-04-24 10:15 ` [PATCH v7 3/8] KVM: X86: Expose IA32_PKRS MSR Lei Wang
@ 2022-05-24 22:11   ` Sean Christopherson
  2022-05-27  9:21     ` Wang, Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-24 22:11 UTC (permalink / raw)
  To: Lei Wang
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

Nit, something like:

  KVM: X86: Virtualize and pass-through IA32_PKRS MSR when supported

because "expose" doesn't precisely cover the pass-through behavior

On Sun, Apr 24, 2022, Lei Wang wrote:
> @@ -1111,6 +1113,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>  #endif
>  	unsigned long fs_base, gs_base;
>  	u16 fs_sel, gs_sel;
> +	u32 host_pkrs;
>  	int i;
>  
>  	vmx->req_immediate_exit = false;
> @@ -1146,6 +1149,17 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>  	 */
>  	host_state->ldt_sel = kvm_read_ldt();
>  
> +	/*
> +	 * Update the host pkrs vmcs field before vcpu runs.
> +	 * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
> +	 * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
> +	 * guest_cpuid_has(vcpu, X86_FEATURE_PKS)
> +	 */

Eh, I don't think this comment adds anything.  And practically speaking, whether
or not X86_FEATURE_PKS is reported by kvm_cpu_cap_has() and/or guest_cpuid_has()
is irrelevant.  If KVM is loading PKRS on exit, then the VMCS needs to hold an
up-to-date value.  E.g. if for some reason KVM enables the control even if PKS
isn't exposed to the guest (see below), this code still needs to refresh the value.

> +	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
> +		host_pkrs = get_current_pkrs();

No need for an intermediate host_pkrs.  I.e. this can simply be:

	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS)
		vmx_set_host_pkrs(host_state, get_current_pkrs());

> +		vmx_set_host_pkrs(host_state, host_pkrs);
> +	}
> +
>  #ifdef CONFIG_X86_64
>  	savesegment(ds, host_state->ds_sel);
>  	savesegment(es, host_state->es_sel);
> @@ -1901,6 +1915,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_DEBUGCTLMSR:
>  		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
>  		break;
> +	case MSR_IA32_PKRS:
> +		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
> +		    (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))

Nit, please align the lines that are inside a pair of parantheses, i.e.

		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
		    (!msr_info->host_initiated &&
		     !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
			return 1;

> +			return 1;
> +		msr_info->data = kvm_read_pkrs(vcpu);

A comment on the caching patch, please use kvm_pkrs_read() to follow the GPR, RIP,
PDPTR, etc... terminology (CR0/3/4 got grandfathered in).

> +		break;
>  	default:
>  	find_uret_msr:
>  		msr = vmx_find_uret_msr(vmx, msr_info->index);
> @@ -2242,7 +2263,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		}
>  		ret = kvm_set_msr_common(vcpu, msr_info);
>  		break;
> -
> +	case MSR_IA32_PKRS:
> +		if (!kvm_pkrs_valid(data))
> +			return 1;

Nit, please move this to after the capability checks.  Does not affect functionality
at all, but logically it doesn't make sense to check for a valid value of a register
that doesn't exist.

> +		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
> +		    (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
> +			return 1;
> +		vcpu->arch.pkrs = data;
> +		kvm_register_mark_available(vcpu, VCPU_EXREG_PKRS);
> +		vmcs_write64(GUEST_IA32_PKRS, data);

The caching patch should add kvm_write_pkrs().  And in there, I think I'd vote for
a WARN_ON_ONCE() that the incoming value doesn't set bits 63:32.  It's redundant
with kvm_pkrs_valid()


> +		break;
>  	default:
>  	find_uret_msr:
>  		msr = vmx_find_uret_msr(vmx, msr_index);

...

> @@ -7406,6 +7445,20 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  
>  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>  	vmx_update_exception_bitmap(vcpu);
> +
> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS)) {a
> +		if (guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {
> +			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
> +
> +			vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +			vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);

Ugh, toggling the entry/exit controls here won't do the correct thing if L2 is
active.  The MSR intercept logic works because it always operates on vmcs01's bitmap.

Let's keep this as simple as possible and set the controls if they're supported.
KVM will waste a few cycles on entry/exit if PKS is supported in the host but not
exposed to the guest, but that's not the end of the world.  If we want to optimize
that, then we can do that in the future and in a more generic way.

So this becomes:

	if (kvm_cpu_cap_has(X86_FEATURE_PKS))
		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW,
					  !guest_cpuid_has(vcpu, X86_FEATURE_PKS));

And please hoist that up to be next to the handling of X86_FEATURE_XFD to try and
bunch together code that does similar things.

The last edge case to deal with is if only one of the controls is supported, e.g. if
an L0 hypervisor is being evil.  Big surprise, BNDCFGS doesn't get this right and
needs a bug fix patch, e.g. it could retain the guest's value after exit, or host's
value after entry.  It's a silly case because it basically requires broken host
"hardware", but it'd be nice to get it right in KVM.  And that'd also be a good
opportunity to handle all of the pairs, i.e. clear the bits during setup_vmcs_config()
instead of checking both the entry and exit flags in cpu_has_load_*().

So for this series, optimistically assume my idea will pan out and a
adjust_vm_entry_exit_pair() helper will exit by the time this is fully baked.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6927f6e8ec31..53e12e6006af 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2434,6 +2434,16 @@ static bool cpu_has_sgx(void)
        return cpuid_eax(0) >= 0x12 && (cpuid_eax(0x12) & BIT(0));
 }

+static __init void adjust_vm_entry_exit_pair(u32 *entry_controls, u32 entry_bit,
+                                            u32 *exit_controls, u32 exit_bit)
+{
+       if ((*entry_controls & entry_bit) && (*exit_controls & exit_bit))
+               return;
+
+       *entry_controls &= ~entry_bit;
+       *exit_controls &= ~exit_bit;
+}
+
 static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
                                      u32 msr, u32 *result)
 {
@@ -2614,6 +2624,9 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
                                &_vmentry_control) < 0)
                return -EIO;

+       adjust_vm_entry_exit_pair(&_vmentry_control, VM_ENTRY_LOAD_IA32_PKRS,
+                                 &_vmexit_control, VM_EXIT_LOAD_IA32_PKRS);
+
        /*
         * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they
         * can't be used due to an errata where VM Exit may incorrectly clear
@@ -7536,6 +7549,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
                vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_R,
                                          !guest_cpuid_has(vcpu, X86_FEATURE_XFD));

+       if (kvm_cpu_cap_has(X86_FEATURE_PKS))
+               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW,
+                                         !guest_cpuid_has(vcpu, X86_FEATURE_PKS));

        set_cr4_guest_host_mask(vmx);


> +		} else {
> +			vmx_enable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
> +
> +			vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +			vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +		}
> +	}
>  }
>  

...

> @@ -11410,6 +11414,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
>  	kvm_rip_write(vcpu, 0xfff0);
>  
> +	if (!init_event && kvm_cpu_cap_has(X86_FEATURE_PKS))
> +		__kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);

Please put this with the other !init_event stuff, e.g. look for MSR_IA32_XSS.

>  	vcpu->arch.cr3 = 0;
>  	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
>  
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 5/8] KVM: MMU: Add helper function to get pkr bits
  2022-04-24 10:15 ` [PATCH v7 5/8] KVM: MMU: Add helper function to get pkr bits Lei Wang
@ 2022-05-24 23:21   ` Sean Christopherson
  2022-05-27  9:28     ` Wang, Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-24 23:21 UTC (permalink / raw)
  To: Lei Wang
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On Sun, Apr 24, 2022, Lei Wang wrote:
> Extra the PKR stuff to a separate, non-inline helper, which is a

s/Extra/Extract

> preparation to introduce pks support.

Please provide more justification.  The change is justified, by random readers of
this patch/commit will be clueless.

  Extract getting the effective PKR bits to a helper that lives in mmu.c
  in order to keep the is_cr4_*() helpers contained to mmu.c.  Support for
  PKRS (versus just PKRU) will require querying MMU state to see if the
  relevant CR4 bit is enabled because pkr_mask will be non-zero if _either_
  bit is enabled).

  PKR{U,S} are exposed to the guest if and only if TDP is enabled, and
  while permission_fault() is performance critical for ia32 shadow paging,
  it's a rarely used path with TDP is enabled.  I.e. moving the PKR code
  out-of-line is not a performance concern.

> Signed-off-by: Lei Wang <lei4.wang@intel.com>
> ---
>  arch/x86/kvm/mmu.h     | 20 +++++---------------
>  arch/x86/kvm/mmu/mmu.c | 21 +++++++++++++++++++++
>  2 files changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index cb3f07e63778..cea03053a153 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -204,6 +204,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	return vcpu->arch.mmu->page_fault(vcpu, &fault);
>  }
>  
> +u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,

kvm_mmu_get_pkr_bits() so that there's a verb in there.

> +		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec);
> +
>  /*
>   * Check if a given access (described through the I/D, W/R and U/S bits of a
>   * page fault error code pfec) causes a permission fault with the given PTE
> @@ -240,21 +243,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  
>  	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
>  	if (unlikely(mmu->pkr_mask)) {
> -		u32 pkr_bits, offset;
> -
> -		/*
> -		* PKRU defines 32 bits, there are 16 domains and 2
> -		* attribute bits per domain in pkru.  pte_pkey is the
> -		* index of the protection domain, so pte_pkey * 2 is
> -		* is the index of the first bit for the domain.
> -		*/
> -		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> -
> -		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
> -		offset = (pfec & ~1) +
> -			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
> -
> -		pkr_bits &= mmu->pkr_mask >> offset;
> +		u32 pkr_bits =
> +			kvm_mmu_pkr_bits(vcpu, mmu, pte_access, pte_pkey, pfec);

Nit, I prefer wrapping in the params, that way the first line shows the most
important information, e.g. what variable is being set and how (by a function call).
And then there won't be overflow with the longer helper name:

		u32 pkr_bits = kvm_mmu_get_pkr_bits(vcpu, mmu, pte_access,
						    pte_pkey, pfec);

>  		errcode |= -pkr_bits & PFERR_PK_MASK;
>  		fault |= (pkr_bits != 0);
>  	}
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index de665361548d..6d3276986102 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6477,3 +6477,24 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  	if (kvm->arch.nx_lpage_recovery_thread)
>  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
>  }
> +
> +u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> +		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec)
> +{
> +	u32 pkr_bits, offset;
> +
> +	/*
> +	* PKRU defines 32 bits, there are 16 domains and 2

Comment needs to be aligned, and it can be adjust to wrap at 80 chars (its
indentation has changed).


	/*
	 * PKRU and PKRS both define 32 bits. There are 16 domains and 2
	 * attribute bits per domain in them. pte_key is the index of the
	 * protection domain, so pte_pkey * 2 is the index of the first bit for
	 * the domain. The use of PKRU versus PKRS is selected by the address
	 * type, as determined by the U/S bit in the paging-structure entries.
	 */

> +	* attribute bits per domain in pkru.  pte_pkey is the
> +	* index of the protection domain, so pte_pkey * 2 is
> +	* is the index of the first bit for the domain.
> +	*/
> +	pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> +
> +	/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
> +	offset = (pfec & ~1) + ((pte_access & PT_USER_MASK)
> +				<< (PFERR_RSVD_BIT - PT_USER_SHIFT));
> +
> +	pkr_bits &= mmu->pkr_mask >> offset;
> +	return pkr_bits;
> +}
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 6/8] KVM: MMU: Add support for PKS emulation
  2022-04-24 10:15 ` [PATCH v7 6/8] KVM: MMU: Add support for PKS emulation Lei Wang
@ 2022-05-24 23:28   ` Sean Christopherson
  2022-05-27  9:40     ` Wang, Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-24 23:28 UTC (permalink / raw)
  To: Lei Wang
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On Sun, Apr 24, 2022, Lei Wang wrote:
> @@ -454,10 +455,11 @@ struct kvm_mmu {
>  	u8 permissions[16];
>  
>  	/*
> -	* The pkru_mask indicates if protection key checks are needed.  It
> -	* consists of 16 domains indexed by page fault error code bits [4:1],
> -	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
> -	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
> +	* The pkr_mask indicates if protection key checks are needed.
> +	* It consists of 16 domains indexed by page fault error code
> +	* bits[4:1] with PFEC.RSVD replaced by ACC_USER_MASK from the
> +	* page tables. Each domain has 2 bits which are ANDed with AD
> +	* and WD from PKRU/PKRS.

Same comments, align and wrap closer to 80 please.

>  	*/
>  	u32 pkr_mask;
>  
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index cea03053a153..6963c641e6ce 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -45,7 +45,8 @@
>  #define PT32E_ROOT_LEVEL 3
>  
>  #define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PSE | X86_CR4_PAE | X86_CR4_LA57 | \
> -			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
> +			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE | \
> +			       X86_CR4_PKS)
>  
>  #define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)
>  #define KVM_MMU_EFER_ROLE_BITS (EFER_LME | EFER_NX)
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d3276986102..a6cbc22d3312 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -209,6 +209,7 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
>  BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
>  BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
>  BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
> +BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pks, X86_CR4_PKS);
>  BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
>  BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);
>  
> @@ -231,6 +232,7 @@ BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smep);
>  BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smap);
>  BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pke);
>  BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, la57);
> +BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pks);
>  BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
>  
>  static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
> @@ -4608,37 +4610,58 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
>  }
>  
>  /*

...

> + * Protection Key Rights (PKR) is an additional mechanism by which data accesses
> + * with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
> + * Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
> + * (PKRS) registers.  The Protection Key (PK) used for an access is a 4-bit
> + * value specified in bits 62:59 of the leaf PTE used to translate the address.
> + *
> + * PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
> + * access-disable (AD) and write-disable (WD) bit.  The PK from the leaf PTE is
> + * used to index the approriate PKR (see below), e.g. PK=1 would consume bits

s/approriate/appropriate

> + * 3:2 (bit 3 == write-disable, bit 2 == access-disable).
> + *
> + * The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
> + * _address_ (not access type!).  For a user-mode address, PKRU is used; for a
> + * supervisor-mode address, PKRS is used.  An address is supervisor-mode if the
> + * U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
> + * an address is user-mode if the U/S flag is 1 in _all_ entries.  Again, this
> + * is the address type, not the the access type, e.g. a supervisor-mode _access_

Double "the the" can be a single "the".

> + * will consume PKRU if the _address_ is a user-mode address.
> + *
> + * As alluded to above, PKR checks are only performed for data accesses; code
> + * fetches are not subject to PKR checks.  Terminal page faults (!PRESENT or
> + * PFEC.RSVD=1) are also not subject to PKR checks.
> + *
> + * PKR write-disable checks for superivsor-mode _accesses_ are performed if and
> + * only if CR0.WP=1 (though access-disable checks still apply).
> + *
> + * In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
> + * (c) CR0.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
> + * PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
> + * (g) the address type (retrieved from the paging-structure entries).
> + *
> + * To avoid conditional branches in permission_fault(), the PKR bitmask caches
> + * the above inputs, except for (e) the PKR{S,U} entry.  The FETCH, USER, and
> + * WRITE bits of the PFEC and the effective value of the paging-structures' U/S
> + * bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
> + * PKR bitmask (similar to the 4-bit Protection Key itself).  The two bits of
> + * the PKR bitmask "entry" are then extracted and ANDed with the two bits of
> + * the PKR{S,U} register corresponding to the address type and protection key.
> + *
> + * E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
> + * will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
> + * to suppress PKR checks on code fetches.
> + */
>  static void update_pkr_bitmask(struct kvm_mmu *mmu)
>  {
>  	unsigned bit;
>  	bool wp;
> -

Please keep this newline, i.e. after the declaration of the cr4 booleans.  That
helps isolate the clearing of mmu->pkr_mask, which makes the functional affect of
the earlier return more obvious.

Ah, and use reverse fir tree for the variable declarations, i.e.

	bool cr4_pke = is_cr4_pke(mmu);
	bool cr4_pks = is_cr4_pks(mmu);
	unsigned bit;
	bool wp;

	mmu->pkr_mask = 0;

	if (!cr4_pke && !cr4_pks)
		return;

> +	bool cr4_pke = is_cr4_pke(mmu);
> +	bool cr4_pks = is_cr4_pks(mmu);
>  	mmu->pkr_mask = 0;
>  
> -	if (!is_cr4_pke(mmu))
> +	if (!cr4_pke && !cr4_pks)
>  		return;
>  
>  	wp = is_cr0_wp(mmu);
  

  ...

> @@ -6482,14 +6509,22 @@ u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec)
>  {
>  	u32 pkr_bits, offset;
> +	u32 pkr;
>  
>  	/*
> -	* PKRU defines 32 bits, there are 16 domains and 2
> -	* attribute bits per domain in pkru.  pte_pkey is the
> -	* index of the protection domain, so pte_pkey * 2 is
> -	* is the index of the first bit for the domain.
> +	* PKRU and PKRS both define 32 bits. There are 16 domains
> +	* and 2 attribute bits per domain in them. pte_key is the
> +	* index of the protection domain, so pte_pkey * 2 is the
> +	* index of the first bit for the domain. The use of PKRU
> +	* versus PKRS is selected by the address type, as determined
> +	* by the U/S bit in the paging-structure entries.


Align and wrap closer to 80 please.

>  	*/
> -	pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> +	if (pte_access & PT_USER_MASK)
> +		pkr = is_cr4_pke(mmu) ? vcpu->arch.pkru : 0;
> +	else
> +		pkr = is_cr4_pks(mmu) ? kvm_read_pkrs(vcpu) : 0;
> +
> +	pkr_bits = (pkr >> pte_pkey * 2) & 3;
>  
>  	/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
>  	offset = (pfec & ~1) + ((pte_access & PT_USER_MASK)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 7/8] KVM: VMX: Expose PKS to guest
  2022-04-24 10:15 ` [PATCH v7 7/8] KVM: VMX: Expose PKS to guest Lei Wang
@ 2022-05-24 23:34   ` Sean Christopherson
  2022-05-27  9:42     ` Wang, Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-24 23:34 UTC (permalink / raw)
  To: Lei Wang
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On Sun, Apr 24, 2022, Lei Wang wrote:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9d0588e85410..cbcb0d7b47a4 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3250,7 +3250,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  		}
>  
>  		/*
> -		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
> +		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
>  		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs

Heh, missed one ;-)  Let's reduce future pain and reword this whole comment:

		/*
		 * SMEP/SMAP/PKU/PKS are effectively disabled if the CPU is in
		 * non-paging mode in hardware.  To emulate this behavior,
		 * clear them in the hardware CR4 when the guest switches to
		 * non-paging mode and unrestricted guest is disabled, as KVM
		 * must run the guest with hardware CR0.PG=1.
		 */


>  		 * to be manually disabled when guest switches to non-paging
>  		 * mode.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields
  2022-05-24 20:55   ` Sean Christopherson
@ 2022-05-27  1:55     ` Wang, Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-27  1:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On 5/25/2022 4:55 AM, Sean Christopherson wrote:
> Uber nit, PKRS isn't saved if VMX doesn't support the entry control.
>    Every VM exit saves PKRS into guest-state area if VM_ENTRY_LOAD_IA32_PKRS
>    is supported by the CPU.
>
> With that tweak,

Make sense, will fix it.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 2/8] KVM: VMX: Add proper cache tracking for PKRS
  2022-05-24 21:00   ` Sean Christopherson
@ 2022-05-27  2:16     ` Wang, Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-27  2:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On 5/25/2022 5:00 AM, Sean Christopherson wrote:
> The patch looks fine, but this needs to be moved to the end of the series.
> Definitely after "KVM: VMX: Expose PKS to guest", and maybe even after "KVM: VMX:
> Enable PKS for nested VM".

OK, will do that.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 3/8] KVM: X86: Expose IA32_PKRS MSR
  2022-05-24 22:11   ` Sean Christopherson
@ 2022-05-27  9:21     ` Wang, Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-27  9:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On 5/25/2022 6:11 AM, Sean Christopherson wrote:
> Nit, something like:
>
>    KVM: X86: Virtualize and pass-through IA32_PKRS MSR when supported
>
> because "expose" doesn't precisely cover the pass-through behavior

Will change it.

> Eh, I don't think this comment adds anything.  And practically speaking, whether
> or not X86_FEATURE_PKS is reported by kvm_cpu_cap_has() and/or guest_cpuid_has()
> is irrelevant.  If KVM is loading PKRS on exit, then the VMCS needs to hold an
> up-to-date value.  E.g. if for some reason KVM enables the control even if PKS
> isn't exposed to the guest (see below), this code still needs to refresh the value.
Will remove the irrelevant statement, just left the "Update the host 
pkrs vmcs field before vcpu runs."

> No need for an intermediate host_pkrs.  I.e. this can simply be:
>
> 	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS)
> 		vmx_set_host_pkrs(host_state, get_current_pkrs());
OK, will write the code in one line.

> Nit, please align the lines that are inside a pair of parantheses, i.e.
My mistake, will align the lines.

> A comment on the caching patch, please use kvm_pkrs_read() to follow the GPR, RIP,
> PDPTR, etc... terminology (CR0/3/4 got grandfathered in).
Will rename the function.

> Nit, please move this to after the capability checks.  Does not affect functionality
> at all, but logically it doesn't make sense to check for a valid value of a register
> that doesn't exist.
Make sense, will move it.

> The caching patch should add kvm_write_pkrs().  And in there, I think I'd vote for
> a WARN_ON_ONCE() that the incoming value doesn't set bits 63:32.  It's redundant
> with kvm_pkrs_valid()
Will add a new function at the caching patch.

> Ugh, toggling the entry/exit controls here won't do the correct thing if L2 is
> active.  The MSR intercept logic works because it always operates on vmcs01's bitmap.
>
> Let's keep this as simple as possible and set the controls if they're supported.
> KVM will waste a few cycles on entry/exit if PKS is supported in the host but not
> exposed to the guest, but that's not the end of the world.  If we want to optimize
> that, then we can do that in the future and in a more generic way.
>
> So this becomes:
>
> 	if (kvm_cpu_cap_has(X86_FEATURE_PKS))
> 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW,
> 					  !guest_cpuid_has(vcpu, X86_FEATURE_PKS));
>
> And please hoist that up to be next to the handling of X86_FEATURE_XFD to try and
> bunch together code that does similar things.
>
> The last edge case to deal with is if only one of the controls is supported, e.g. if
> an L0 hypervisor is being evil.  Big surprise, BNDCFGS doesn't get this right and
> needs a bug fix patch, e.g. it could retain the guest's value after exit, or host's
> value after entry.  It's a silly case because it basically requires broken host
> "hardware", but it'd be nice to get it right in KVM.  And that'd also be a good
> opportunity to handle all of the pairs, i.e. clear the bits during setup_vmcs_config()
> instead of checking both the entry and exit flags in cpu_has_load_*().
>
> So for this series, optimistically assume my idea will pan out and a
> adjust_vm_entry_exit_pair() helper will exit by the time this is fully baked.
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6927f6e8ec31..53e12e6006af 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2434,6 +2434,16 @@ static bool cpu_has_sgx(void)
>          return cpuid_eax(0) >= 0x12 && (cpuid_eax(0x12) & BIT(0));
>   }
>
> +static __init void adjust_vm_entry_exit_pair(u32 *entry_controls, u32 entry_bit,
> +                                            u32 *exit_controls, u32 exit_bit)
> +{
> +       if ((*entry_controls & entry_bit) && (*exit_controls & exit_bit))
> +               return;
> +
> +       *entry_controls &= ~entry_bit;
> +       *exit_controls &= ~exit_bit;
> +}
> +
>   static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
>                                        u32 msr, u32 *result)
>   {
> @@ -2614,6 +2624,9 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>                                  &_vmentry_control) < 0)
>                  return -EIO;
>
> +       adjust_vm_entry_exit_pair(&_vmentry_control, VM_ENTRY_LOAD_IA32_PKRS,
> +                                 &_vmexit_control, VM_EXIT_LOAD_IA32_PKRS);
> +
>          /*
>           * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they
>           * can't be used due to an errata where VM Exit may incorrectly clear
> @@ -7536,6 +7549,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>                  vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_R,
>                                            !guest_cpuid_has(vcpu, X86_FEATURE_XFD));
>
> +       if (kvm_cpu_cap_has(X86_FEATURE_PKS))
> +               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW,
> +                                         !guest_cpuid_has(vcpu, X86_FEATURE_PKS));
>
>          set_cr4_guest_host_mask(vmx);
Thanks for your suggestions, will try your helper.

> Please put this with the other !init_event stuff, e.g. look for MSR_IA32_XSS.
Will do it.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 5/8] KVM: MMU: Add helper function to get pkr bits
  2022-05-24 23:21   ` Sean Christopherson
@ 2022-05-27  9:28     ` Wang, Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-27  9:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On 5/25/2022 7:21 AM, Sean Christopherson wrote:
> On Sun, Apr 24, 2022, Lei Wang wrote:
>> Extra the PKR stuff to a separate, non-inline helper, which is a
> s/Extra/Extract

My mistake, will fix it.

>> preparation to introduce pks support.
> Please provide more justification.  The change is justified, by random readers of
> this patch/commit will be clueless.
>
>    Extract getting the effective PKR bits to a helper that lives in mmu.c
>    in order to keep the is_cr4_*() helpers contained to mmu.c.  Support for
>    PKRS (versus just PKRU) will require querying MMU state to see if the
>    relevant CR4 bit is enabled because pkr_mask will be non-zero if _either_
>    bit is enabled).
>
>    PKR{U,S} are exposed to the guest if and only if TDP is enabled, and
>    while permission_fault() is performance critical for ia32 shadow paging,
>    it's a rarely used path with TDP is enabled.  I.e. moving the PKR code
>    out-of-line is not a performance concern.

Will add more justification to make it clearer to random readers.

>> Signed-off-by: Lei Wang <lei4.wang@intel.com>
>> ---
>>   arch/x86/kvm/mmu.h     | 20 +++++---------------
>>   arch/x86/kvm/mmu/mmu.c | 21 +++++++++++++++++++++
>>   2 files changed, 26 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index cb3f07e63778..cea03053a153 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -204,6 +204,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>   	return vcpu->arch.mmu->page_fault(vcpu, &fault);
>>   }
>>   
>> +u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> kvm_mmu_get_pkr_bits() so that there's a verb in there.

Will fix it.

>> +		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec);
>> +
>>   /*
>>    * Check if a given access (described through the I/D, W/R and U/S bits of a
>>    * page fault error code pfec) causes a permission fault with the given PTE
>> @@ -240,21 +243,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>   
>>   	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
>>   	if (unlikely(mmu->pkr_mask)) {
>> -		u32 pkr_bits, offset;
>> -
>> -		/*
>> -		* PKRU defines 32 bits, there are 16 domains and 2
>> -		* attribute bits per domain in pkru.  pte_pkey is the
>> -		* index of the protection domain, so pte_pkey * 2 is
>> -		* is the index of the first bit for the domain.
>> -		*/
>> -		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
>> -
>> -		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
>> -		offset = (pfec & ~1) +
>> -			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
>> -
>> -		pkr_bits &= mmu->pkr_mask >> offset;
>> +		u32 pkr_bits =
>> +			kvm_mmu_pkr_bits(vcpu, mmu, pte_access, pte_pkey, pfec);
> Nit, I prefer wrapping in the params, that way the first line shows the most
> important information, e.g. what variable is being set and how (by a function call).
> And then there won't be overflow with the longer helper name:
>
> 		u32 pkr_bits = kvm_mmu_get_pkr_bits(vcpu, mmu, pte_access,
> 						    pte_pkey, pfec);

Make sense, will make the wrapping more reasonable.

> Comment needs to be aligned, and it can be adjust to wrap at 80 chars (its
> indentation has changed).
>
> 	/*
> 	 * PKRU and PKRS both define 32 bits. There are 16 domains and 2
> 	 * attribute bits per domain in them. pte_key is the index of the
> 	 * protection domain, so pte_pkey * 2 is the index of the first bit for
> 	 * the domain. The use of PKRU versus PKRS is selected by the address
> 	 * type, as determined by the U/S bit in the paging-structure entries.
> 	 */
Will align and adjust it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 6/8] KVM: MMU: Add support for PKS emulation
  2022-05-24 23:28   ` Sean Christopherson
@ 2022-05-27  9:40     ` Wang, Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-27  9:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On 5/25/2022 7:28 AM, Sean Christopherson wrote:
> On Sun, Apr 24, 2022, Lei Wang wrote:
>> @@ -454,10 +455,11 @@ struct kvm_mmu {
>>   	u8 permissions[16];
>>   
>>   	/*
>> -	* The pkru_mask indicates if protection key checks are needed.  It
>> -	* consists of 16 domains indexed by page fault error code bits [4:1],
>> -	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
>> -	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
>> +	* The pkr_mask indicates if protection key checks are needed.
>> +	* It consists of 16 domains indexed by page fault error code
>> +	* bits[4:1] with PFEC.RSVD replaced by ACC_USER_MASK from the
>> +	* page tables. Each domain has 2 bits which are ANDed with AD
>> +	* and WD from PKRU/PKRS.
> Same comments, align and wrap closer to 80 please.
Will do it.
>>   	*/
>>   	u32 pkr_mask;
>>   
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index cea03053a153..6963c641e6ce 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -45,7 +45,8 @@
>>   #define PT32E_ROOT_LEVEL 3
>>   
>>   #define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PSE | X86_CR4_PAE | X86_CR4_LA57 | \
>> -			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
>> +			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE | \
>> +			       X86_CR4_PKS)
>>   
>>   #define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)
>>   #define KVM_MMU_EFER_ROLE_BITS (EFER_LME | EFER_NX)
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 6d3276986102..a6cbc22d3312 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -209,6 +209,7 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
>>   BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
>>   BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
>>   BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
>> +BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pks, X86_CR4_PKS);
>>   BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
>>   BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);
>>   
>> @@ -231,6 +232,7 @@ BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smep);
>>   BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smap);
>>   BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pke);
>>   BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, la57);
>> +BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pks);
>>   BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
>>   
>>   static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
>> @@ -4608,37 +4610,58 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
>>   }
>>   
>>   /*
> ...
>
>> + * Protection Key Rights (PKR) is an additional mechanism by which data accesses
>> + * with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
>> + * Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
>> + * (PKRS) registers.  The Protection Key (PK) used for an access is a 4-bit
>> + * value specified in bits 62:59 of the leaf PTE used to translate the address.
>> + *
>> + * PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
>> + * access-disable (AD) and write-disable (WD) bit.  The PK from the leaf PTE is
>> + * used to index the approriate PKR (see below), e.g. PK=1 would consume bits
> s/approriate/appropriate
Will correct it.
>> + * 3:2 (bit 3 == write-disable, bit 2 == access-disable).
>> + *
>> + * The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
>> + * _address_ (not access type!).  For a user-mode address, PKRU is used; for a
>> + * supervisor-mode address, PKRS is used.  An address is supervisor-mode if the
>> + * U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
>> + * an address is user-mode if the U/S flag is 1 in _all_ entries.  Again, this
>> + * is the address type, not the the access type, e.g. a supervisor-mode _access_
> Double "the the" can be a single "the".
Will remove it.
>> + * will consume PKRU if the _address_ is a user-mode address.
>> + *
>> + * As alluded to above, PKR checks are only performed for data accesses; code
>> + * fetches are not subject to PKR checks.  Terminal page faults (!PRESENT or
>> + * PFEC.RSVD=1) are also not subject to PKR checks.
>> + *
>> + * PKR write-disable checks for superivsor-mode _accesses_ are performed if and
>> + * only if CR0.WP=1 (though access-disable checks still apply).
>> + *
>> + * In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
>> + * (c) CR0.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
>> + * PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
>> + * (g) the address type (retrieved from the paging-structure entries).
>> + *
>> + * To avoid conditional branches in permission_fault(), the PKR bitmask caches
>> + * the above inputs, except for (e) the PKR{S,U} entry.  The FETCH, USER, and
>> + * WRITE bits of the PFEC and the effective value of the paging-structures' U/S
>> + * bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
>> + * PKR bitmask (similar to the 4-bit Protection Key itself).  The two bits of
>> + * the PKR bitmask "entry" are then extracted and ANDed with the two bits of
>> + * the PKR{S,U} register corresponding to the address type and protection key.
>> + *
>> + * E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
>> + * will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
>> + * to suppress PKR checks on code fetches.
>> + */
>>   static void update_pkr_bitmask(struct kvm_mmu *mmu)
>>   {
>>   	unsigned bit;
>>   	bool wp;
>> -
> Please keep this newline, i.e. after the declaration of the cr4 booleans.  That
> helps isolate the clearing of mmu->pkr_mask, which makes the functional affect of
> the earlier return more obvious.
>
> Ah, and use reverse fir tree for the variable declarations, i.e.
>
> 	bool cr4_pke = is_cr4_pke(mmu);
> 	bool cr4_pks = is_cr4_pks(mmu);
> 	unsigned bit;
> 	bool wp;
>
> 	mmu->pkr_mask = 0;
>
> 	if (!cr4_pke && !cr4_pks)
> 		return;

Very nice of you, will use reverse fir tree for the declaration.

>> +	bool cr4_pke = is_cr4_pke(mmu);
>> +	bool cr4_pks = is_cr4_pks(mmu);
>>   	mmu->pkr_mask = 0;
>>   
>> -	if (!is_cr4_pke(mmu))
>> +	if (!cr4_pke && !cr4_pks)
>>   		return;
>>   
>>   	wp = is_cr0_wp(mmu);
>    
>
>    ...
>
>> @@ -6482,14 +6509,22 @@ u32 kvm_mmu_pkr_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>   		     unsigned pte_access, unsigned pte_pkey, unsigned int pfec)
>>   {
>>   	u32 pkr_bits, offset;
>> +	u32 pkr;
>>   
>>   	/*
>> -	* PKRU defines 32 bits, there are 16 domains and 2
>> -	* attribute bits per domain in pkru.  pte_pkey is the
>> -	* index of the protection domain, so pte_pkey * 2 is
>> -	* is the index of the first bit for the domain.
>> +	* PKRU and PKRS both define 32 bits. There are 16 domains
>> +	* and 2 attribute bits per domain in them. pte_key is the
>> +	* index of the protection domain, so pte_pkey * 2 is the
>> +	* index of the first bit for the domain. The use of PKRU
>> +	* versus PKRS is selected by the address type, as determined
>> +	* by the U/S bit in the paging-structure entries.
>
> Align and wrap closer to 80 please.
Will do it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 7/8] KVM: VMX: Expose PKS to guest
  2022-05-24 23:34   ` Sean Christopherson
@ 2022-05-27  9:42     ` Wang, Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-27  9:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On 5/25/2022 7:34 AM, Sean Christopherson wrote:
> On Sun, Apr 24, 2022, Lei Wang wrote:
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 9d0588e85410..cbcb0d7b47a4 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -3250,7 +3250,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>   		}
>>   
>>   		/*
>> -		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
>> +		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
>>   		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
> Heh, missed one ;-)  Let's reduce future pain and reword this whole comment:
>
> 		/*
> 		 * SMEP/SMAP/PKU/PKS are effectively disabled if the CPU is in
> 		 * non-paging mode in hardware.  To emulate this behavior,
> 		 * clear them in the hardware CR4 when the guest switches to
> 		 * non-paging mode and unrestricted guest is disabled, as KVM
> 		 * must run the guest with hardware CR0.PG=1.
> 		 */
>
Nice catch, will fix it ;-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v7 8/8] KVM: VMX: Enable PKS for nested VM
  2022-05-20  1:24   ` Sean Christopherson
@ 2022-05-27  9:55     ` Wang, Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Wang, Lei @ 2022-05-27  9:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, wanpengli, jmattson, joro, chenyi.qiang, kvm,
	linux-kernel

On 5/20/2022 9:24 AM, Sean Christopherson wrote:
> Nit, use "KVM: nVMX:" for the shortlog scope.

Will change it.

> On Sun, Apr 24, 2022, Lei Wang wrote:
>> @@ -2433,6 +2437,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>   		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>   		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>   			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>> +
>> +		if (vmx->nested.nested_run_pending &&
>> +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
>> +			vmcs_write64(GUEST_IA32_PKRS, vmcs12->guest_ia32_pkrs);
> As mentioned in the BNDCFGS thread, this does the wrong thing for SMM.  But, after
> a lot of thought, handling this in nested_vmx_enter_non_root_mode() would be little
> more than a band-aid, and a messy one at that, because KVM's SMM emulation is
> horrifically broken with respect to nVMX.
>
> Entry does to SMM does not modify _any_ state that is not saved in SMRAM.  That
> we're having to deal with this crap is a symptom of KVM doing the complete wrong
> thing by piggybacking nested_vmx_vmexit() and nested_vmx_enter_non_root_mode().
>
> The SDM's description of CET spells this out very, very clearly:
>
>    On processors that support CET shadow stacks, when the processor enters SMM,
>    the processor saves the SSP register to the SMRAM state save area (see Table 31-3)
>    and clears CR4.CET to 0. Thus, the initial execution environment of the SMI handler
>    has CET disabled and all of the CET state of the interrupted program is still in the
>    machine. An SMM that uses CET is required to save the interrupted program’s CET
>    state and restore the CET state prior to exiting SMM.
>
> It mostly works because no guest SMM handler does anything with most of the MSRs,
> but it's all wildy wrong.  A concrete example of a lurking bug is if vmcs12 uses
> the VM-Exit MSR load list, in which case the forced nested_vmx_vmexit() will load
> state that is never undone.
>
> So, my very strong vote is to ignore SMM and let someone who actually cares about
> SMM fix that mess properly by adding custom flows for exiting/re-entering L2 on
> SMI/RSM.

OK, I will leave the mess alone.

>>   	}
>>   
>>   	if (nested_cpu_has_xsaves(vmcs12))
>> @@ -2521,6 +2529,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>   	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
>>   	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>>   		vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
>> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
> ERROR: trailing whitespace
> #85: FILE: arch/x86/kvm/vmx/nested.c:3407:
> +^Iif (kvm_cpu_cap_has(X86_FEATURE_PKS) && $

Sorry for my carelessness, will remove the trailing whitespace.

>> +	    (!vmx->nested.nested_run_pending ||
>> +	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
>> +		vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
>> +
>>   	vmx_set_rflags(vcpu, vmcs12->guest_rflags);
>>   
>>   	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
>> @@ -2897,6 +2910,10 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
>>   					   vmcs12->host_ia32_perf_global_ctrl)))
>>   		return -EINVAL;
>>   
>> +	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS) &&
>> +	    CC(!kvm_pkrs_valid(vmcs12->host_ia32_pkrs)))
>> +		return -EINVAL;
>> +
>>   #ifdef CONFIG_X86_64
>>   	ia32e = !!(vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE);
>>   #else
>> @@ -3049,6 +3066,10 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
>>   	if (nested_check_guest_non_reg_state(vmcs12))
>>   		return -EINVAL;
>>   
>> +	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS) &&
>> +	    CC(!kvm_pkrs_valid(vmcs12->guest_ia32_pkrs)))
>> +		return -EINVAL;
>> +
>>   	return 0;
>>   }
>>   
>> @@ -3384,6 +3405,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>>   	    (!from_vmentry ||
>>   	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>>   		vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
>> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
>> +	    (!from_vmentry ||
> This should be "!vmx->nested.nested_run_pending" instead of "!from_vmentry" to
> avoid the unnecessary VMREAD when restoring L2 with a pending VM-Enter.

Will fix that.

>> +	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
>> +		vmx->nested.vmcs01_guest_pkrs = vmcs_read64(GUEST_IA32_PKRS);
>>   
>>   	/*
>>   	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
> ...
>
>> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>> index 91723a226bf3..82f79ac46d7b 100644
>> --- a/arch/x86/kvm/vmx/vmx.h
>> +++ b/arch/x86/kvm/vmx/vmx.h
>> @@ -222,6 +222,8 @@ struct nested_vmx {
>>   	u64 vmcs01_debugctl;
>>   	u64 vmcs01_guest_bndcfgs;
>>   
> Please pack these together, i.e. don't have a blank line between the various
> vmcs01_* fields.

OK, will check them and remove the blank lines.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-05-27  9:55 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-24 10:15 [PATCH v7 0/8] KVM: PKS Virtualization support Lei Wang
2022-04-24 10:15 ` [PATCH v7 1/8] KVM: VMX: Introduce PKS VMCS fields Lei Wang
2022-05-24 20:55   ` Sean Christopherson
2022-05-27  1:55     ` Wang, Lei
2022-04-24 10:15 ` [PATCH v7 2/8] KVM: VMX: Add proper cache tracking for PKRS Lei Wang
2022-05-24 21:00   ` Sean Christopherson
2022-05-27  2:16     ` Wang, Lei
2022-04-24 10:15 ` [PATCH v7 3/8] KVM: X86: Expose IA32_PKRS MSR Lei Wang
2022-05-24 22:11   ` Sean Christopherson
2022-05-27  9:21     ` Wang, Lei
2022-04-24 10:15 ` [PATCH v7 4/8] KVM: MMU: Rename the pkru to pkr Lei Wang
2022-04-24 10:15 ` [PATCH v7 5/8] KVM: MMU: Add helper function to get pkr bits Lei Wang
2022-05-24 23:21   ` Sean Christopherson
2022-05-27  9:28     ` Wang, Lei
2022-04-24 10:15 ` [PATCH v7 6/8] KVM: MMU: Add support for PKS emulation Lei Wang
2022-05-24 23:28   ` Sean Christopherson
2022-05-27  9:40     ` Wang, Lei
2022-04-24 10:15 ` [PATCH v7 7/8] KVM: VMX: Expose PKS to guest Lei Wang
2022-05-24 23:34   ` Sean Christopherson
2022-05-27  9:42     ` Wang, Lei
2022-04-24 10:15 ` [PATCH v7 8/8] KVM: VMX: Enable PKS for nested VM Lei Wang
2022-05-20  1:24   ` Sean Christopherson
2022-05-27  9:55     ` Wang, Lei
2022-05-06  7:32 ` [PATCH v7 0/8] KVM: PKS Virtualization support Wang, Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.