linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5] KVM: PKS Virtualization support
@ 2021-02-05  8:37 Chenyi Qiang
  2021-02-05  8:37 ` [PATCH v4 1/5] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-02-05  8:37 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Protection Keys for Supervisor Pages(PKS) is a feature that extends the
Protection Keys architecture to support thread-specific permission
restrictions on supervisor pages.

PKS works similar to an existing feature named PKU(protecting user pages).
They both perform an additional check after all legacy access
permissions checks are done. If violated, #PF occurs and PFEC.PK bit will
be set. PKS introduces MSR IA32_PKRS to manage supervisor protection key
rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
advertises on a group of pages with the same key which is set in the
leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
supported by XSAVES architecture.

This patchset aims to add the virtualization of PKS in KVM. It
implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
exposure, nested supported etc. Currently, PKS is not yet supported for
shadow paging. 

PKS bare metal support:
https://lore.kernel.org/lkml/20201106232908.364581-1-ira.weiny@intel.com/

Detailed information about PKS can be found in the latest Intel 64 and
IA-32 Architectures Software Developer's Manual.

---

Changelogs:

v3->v4
- Make the MSR intercept and load-controls setting depend on CR4.PKS value
- shadow the guest pkrs and make it usable in PKS emultion
- add the cr4_pke and cr4_pks check in pkr_mask update
- squash PATCH 2 and PATCH 5 to make the dependencies read more clear
- v3: https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/

v2->v3:
- No function changes since last submit
- rebase on the latest PKS kernel support:
  https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
- add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
- RFC v2: https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/

v1->v2:
- rebase on the latest PKS kernel support:
  https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
- add a kvm-unit-tests for PKS
- add the check in kvm_init_msr_list for PKRS
- place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
- add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
  VMX MSR
- RFC v1: https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/

---

Chenyi Qiang (5):
  KVM: VMX: Introduce PKS VMCS fields
  KVM: X86: Expose PKS to guest
  KVM: MMU: Rename the pkru to pkr
  KVM: MMU: Add support for PKS emulation
  KVM: VMX: Enable PKS for nested VM

 arch/x86/include/asm/kvm_host.h | 17 +++---
 arch/x86/include/asm/pkeys.h    |  1 +
 arch/x86/include/asm/vmx.h      |  6 ++
 arch/x86/kvm/cpuid.c            |  3 +-
 arch/x86/kvm/mmu.h              | 23 ++++----
 arch/x86/kvm/mmu/mmu.c          | 81 +++++++++++++++------------
 arch/x86/kvm/vmx/capabilities.h |  6 ++
 arch/x86/kvm/vmx/nested.c       | 38 ++++++++++++-
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmcs12.c       |  2 +
 arch/x86/kvm/vmx/vmcs12.h       |  6 +-
 arch/x86/kvm/vmx/vmx.c          | 99 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h          |  1 +
 arch/x86/kvm/x86.c              | 11 +++-
 arch/x86/kvm/x86.h              |  8 +++
 arch/x86/mm/pkeys.c             |  6 ++
 include/linux/pkeys.h           |  4 ++
 17 files changed, 249 insertions(+), 64 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/5] KVM: VMX: Introduce PKS VMCS fields
  2021-02-05  8:37 [PATCH v4 0/5] KVM: PKS Virtualization support Chenyi Qiang
@ 2021-02-05  8:37 ` Chenyi Qiang
  2021-02-05  8:37 ` [PATCH v4 2/5] KVM: X86: Expose PKS to guest Chenyi Qiang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-02-05  8:37 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

PKS(Protection Keys for Supervisor Pages) is a feature that extends the
Protection Key architecture to support thread-specific permission
restrictions on supervisor pages.

A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
set of permissions associated with each protection domian.

Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
{host,guest}-state area to store the value of PKRS.

Every VM exit saves PKRS into guest-state area.
If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
area.
If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
area.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/vmx.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f8ba5289ecb0..5472859e21b0 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -94,6 +94,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_LOAD_IA32_PKRS			0x20000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -107,6 +108,7 @@
 #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
+#define VM_ENTRY_LOAD_IA32_PKRS			0x00400000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
@@ -243,12 +245,16 @@ enum vmcs_field {
 	GUEST_BNDCFGS_HIGH              = 0x00002813,
 	GUEST_IA32_RTIT_CTL		= 0x00002814,
 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
+	GUEST_IA32_PKRS			= 0x00002818,
+	GUEST_IA32_PKRS_HIGH		= 0x00002819,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_PKRS			= 0x00002c06,
+	HOST_IA32_PKRS_HIGH		= 0x00002c07,
 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
 	EXCEPTION_BITMAP                = 0x00004004,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05  8:37 [PATCH v4 0/5] KVM: PKS Virtualization support Chenyi Qiang
  2021-02-05  8:37 ` [PATCH v4 1/5] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
@ 2021-02-05  8:37 ` Chenyi Qiang
  2021-02-05  9:23   ` Paolo Bonzini
                     ` (2 more replies)
  2021-02-05  8:37 ` [PATCH v4 3/5] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-02-05  8:37 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Protection Keys for Supervisor Pages (PKS) is a feature that extends the
Protection Keys architecture to support thread-specific permission
restrictions on supervisor pages, which extends an existing feature
named PKU (for user-mode pages).

PKS uses IA32_PKRS MSR (PKRS) at index 0x6E1 to allow software to manage
supervisor key rights. For performance consideration, PKRS intercept
will be disabled when guest CR4.PKS=1 so that the guest can access the
PKRS without VM exits.

PKS introduces dedicated control fields in VMCS to switch PKRS, which
only does the retore part. In addition, every VM exit saves PKRS into
the guest-state area in VMCS, while VM enter won't save the host value
due to the expectation that the host won't change the MSR often. Update
the host's value in VMCS manually if the MSR has been changed by the
kernel since the last time the VMCS was run.

Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of VM_{ENTRY,
EXIT}_LOAD_IA32_PKRS currently.

The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +-
 arch/x86/include/asm/pkeys.h    |  1 +
 arch/x86/kvm/cpuid.c            |  3 +-
 arch/x86/kvm/vmx/capabilities.h |  6 +++
 arch/x86/kvm/vmx/nested.c       |  1 +
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmx.c          | 89 ++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |  8 ++-
 arch/x86/kvm/x86.h              |  8 +++
 arch/x86/mm/pkeys.c             |  6 +++
 include/linux/pkeys.h           |  4 ++
 11 files changed, 122 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d44858b69353..c8b149d9775a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -100,7 +100,8 @@
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_PKS))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
@@ -539,6 +540,7 @@ struct kvm_vcpu_arch {
 	unsigned long cr8;
 	u32 host_pkru;
 	u32 pkru;
+	u32 pkrs;
 	u32 hflags;
 	u64 efer;
 	u64 apic_base;
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 990fe9c4787c..610c94b12dbf 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -142,6 +142,7 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
 __must_check int pks_key_alloc(const char *const pkey_user, int flags);
 void pks_key_free(int pkey);
+u32 get_current_pkrs(void);
 
 void pks_mk_noaccess(int pkey);
 void pks_mk_readonly(int pkey);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 06a278b3701d..4062b83091b9 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -390,7 +390,8 @@ void kvm_set_cpu_caps(void)
 		F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | F(RDPID) |
 		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
 		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
-		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/
+		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
+		0 /*PKS*/
 	);
 	/* Set LA57 based on hardware capability. */
 	if (cpuid_ecx(7) & F(LA57))
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3a1861403d73..1cadeaaf9985 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -103,6 +103,12 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
 	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL);
 }
 
+static inline bool cpu_has_load_ia32_pkrs(void)
+{
+	return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PKRS) &&
+	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PKRS);
+}
+
 static inline bool cpu_has_vmx_mpx(void)
 {
 	return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) &&
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 89af692deb7e..2266d98ace6f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -250,6 +250,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
 	dest->ds_sel = src->ds_sel;
 	dest->es_sel = src->es_sel;
 #endif
+	dest->pkrs = src->pkrs;
 }
 
 static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 1472c6c376f7..b5ba6407c5e1 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -40,6 +40,7 @@ struct vmcs_host_state {
 #ifdef CONFIG_X86_64
 	u16           ds_sel, es_sel;
 #endif
+	u32           pkrs;
 };
 
 struct vmcs_controls_shadow {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 47b8357b9751..a3d95492e2b7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -363,6 +363,8 @@ module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);
 static u32 vmx_segment_access_rights(struct kvm_segment *var);
 static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
 							  u32 msr, int type);
+static __always_inline void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu,
+							 u32 msr, int type);
 
 void vmx_vmexit(void);
 
@@ -660,6 +662,8 @@ static bool is_valid_passthrough_msr(u32 msr)
 	case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B:
 		/* PT MSRs. These are handled in pt_update_intercept_for_msr() */
 		return true;
+	case MSR_IA32_PKRS:
+		return true;
 	}
 
 	r = possible_passthrough_msr_slot(msr) != -ENOENT;
@@ -1201,6 +1205,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 #endif
 	unsigned long fs_base, gs_base;
 	u16 fs_sel, gs_sel;
+	u32 host_pkrs;
 	int i;
 
 	vmx->req_immediate_exit = false;
@@ -1233,6 +1238,21 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	 */
 	host_state->ldt_sel = kvm_read_ldt();
 
+	/*
+	 * Update the host pkrs vmcs field before vcpu runs.
+	 * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
+	 * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	 * guest_cpuid_has(vcpu, X86_FEATURE_PKS) &&
+	 * CR4.PKS=1.
+	 */
+	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
+		host_pkrs = get_current_pkrs();
+		if (unlikely(host_pkrs != host_state->pkrs)) {
+			vmcs_write64(HOST_IA32_PKRS, host_pkrs);
+			host_state->pkrs = host_pkrs;
+		}
+	}
+
 #ifdef CONFIG_X86_64
 	savesegment(ds, host_state->ds_sel);
 	savesegment(es, host_state->es_sel);
@@ -1920,6 +1940,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		else
 			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
 		break;
+	case MSR_IA32_PKRS:
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		msr_info->data = vcpu->arch.pkrs;
+		break;
 	case MSR_TSC_AUX:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
@@ -2189,6 +2216,18 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		else
 			vmx->pt_desc.guest.addr_a[index / 2] = data;
 		break;
+	case MSR_IA32_PKRS:
+		if (!kvm_pkrs_valid(data))
+			return 1;
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		if (vcpu->arch.pkrs != data) {
+			vcpu->arch.pkrs = data;
+			vmcs_write64(GUEST_IA32_PKRS, data);
+		}
+		break;
 	case MSR_TSC_AUX:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
@@ -2479,7 +2518,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_EXIT_LOAD_IA32_EFER |
 	      VM_EXIT_CLEAR_BNDCFGS |
 	      VM_EXIT_PT_CONCEAL_PIP |
-	      VM_EXIT_CLEAR_IA32_RTIT_CTL;
+	      VM_EXIT_CLEAR_IA32_RTIT_CTL |
+	      VM_EXIT_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
 				&_vmexit_control) < 0)
 		return -EIO;
@@ -2503,7 +2543,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_ENTRY_LOAD_IA32_EFER |
 	      VM_ENTRY_LOAD_BNDCFGS |
 	      VM_ENTRY_PT_CONCEAL_PIP |
-	      VM_ENTRY_LOAD_IA32_RTIT_CTL;
+	      VM_ENTRY_LOAD_IA32_RTIT_CTL |
+	      VM_ENTRY_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
 				&_vmentry_control) < 0)
 		return -EIO;
@@ -3103,8 +3144,9 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	 * is in force while we are in guest mode.  Do not let guests control
 	 * this bit, even if host CR4.MCE == 0.
 	 */
-	unsigned long hw_cr4;
+	unsigned long hw_cr4, old_cr4;
 
+	old_cr4 = kvm_read_cr4(vcpu);
 	hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
 	if (is_unrestricted_guest(vcpu))
 		hw_cr4 |= KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST;
@@ -3152,7 +3194,7 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		}
 
 		/*
-		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
 		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
 		 * to be manually disabled when guest switches to non-paging
 		 * mode.
@@ -3160,10 +3202,29 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		 * If !enable_unrestricted_guest, the CPU is always running
 		 * with CR0.PG=1 and CR4 needs to be modified.
 		 * If enable_unrestricted_guest, the CPU automatically
-		 * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+		 * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
 		 */
 		if (!is_paging(vcpu))
-			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+				    X86_CR4_PKS);
+	}
+
+	if ((hw_cr4 ^ old_cr4) & X86_CR4_PKS) {
+		/* pass through PKRS to guest when CR4.PKS=1 */
+		if (guest_cpuid_has(vcpu, X86_FEATURE_PKS) && hw_cr4 & X86_CR4_PKS) {
+			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
+			vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+			vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+			/*
+			 * Every vm exit saves guest pkrs automatically, sync vcpu->arch.pkrs
+			 * to VMCS.GUEST_PKRS to avoid the pollution by host pkrs.
+			 */
+			vmcs_write64(GUEST_IA32_PKRS, vcpu->arch.pkrs);
+		} else {
+			vmx_enable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
+			vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+			vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+		}
 	}
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -5841,6 +5902,8 @@ void dump_vmcs(void)
 		       vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
 	if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
 		pr_err("BndCfgS = 0x%016llx\n", vmcs_read64(GUEST_BNDCFGS));
+	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(GUEST_IA32_PKRS));
 	pr_err("Interruptibility = %08x  ActivityState = %08x\n",
 	       vmcs_read32(GUEST_INTERRUPTIBILITY_INFO),
 	       vmcs_read32(GUEST_ACTIVITY_STATE));
@@ -5876,6 +5939,8 @@ void dump_vmcs(void)
 	    vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
 		pr_err("PerfGlobCtl = 0x%016llx\n",
 		       vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
+	if (vmexit_ctl & VM_EXIT_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(HOST_IA32_PKRS));
 
 	pr_err("*** Control State ***\n");
 	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
@@ -6776,6 +6841,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	pt_guest_exit(vmx);
 
+	if (static_cpu_has(X86_FEATURE_PKS) &&
+	    kvm_read_cr4_bits(vcpu, X86_CR4_PKS))
+		vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS);
+
 	kvm_load_host_xsave_state(vcpu);
 
 	vmx->nested.nested_run_pending = 0;
@@ -7280,6 +7349,14 @@ static __init void vmx_set_cpu_caps(void)
 	if (vmx_pt_mode_is_host_guest())
 		kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
 
+	/*
+	 * PKS is not yet implemented for shadow paging.
+	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
+	 * don't expose the PKS as well.
+	 */
+	if (enable_ept && cpu_has_load_ia32_pkrs())
+		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
+
 	if (vmx_umip_emulated())
 		kvm_cpu_cap_set(X86_FEATURE_UMIP);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f5ede41bf9e6..684ef760481c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1213,7 +1213,7 @@ static const u32 msrs_to_save_all[] = {
 	MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
 	MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
 	MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
-	MSR_IA32_UMWAIT_CONTROL,
+	MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
 
 	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
 	MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
@@ -5718,6 +5718,10 @@ static void kvm_init_msr_list(void)
 				intel_pt_validate_hw_cap(PT_CAP_num_address_ranges) * 2)
 				continue;
 			break;
+		case MSR_IA32_PKRS:
+			if (!kvm_cpu_cap_has(X86_FEATURE_PKS))
+				continue;
+			break;
 		case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
 			if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
 			    min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
@@ -10026,6 +10030,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 
 	vcpu->arch.ia32_xss = 0;
 
+	vcpu->arch.pkrs = 0;
+
 	kvm_x86_ops.vcpu_reset(vcpu, init_event);
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 3900ab0c6004..b63e32baad00 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -366,6 +366,12 @@ static inline bool kvm_dr6_valid(u64 data)
 	return !(data >> 32);
 }
 
+static inline bool kvm_pkrs_valid(u64 data)
+{
+	/* bit[63,32] must be zero */
+	return !(data >> 32);
+}
+
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
 int kvm_spec_ctrl_test_value(u64 value);
@@ -398,6 +404,8 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
 		__reserved_bits |= X86_CR4_UMIP;        \
 	if (!__cpu_has(__c, X86_FEATURE_VMX))           \
 		__reserved_bits |= X86_CR4_VMXE;        \
+	if (!__cpu_has(__c, X86_FEATURE_PKS))           \
+		__reserved_bits |= X86_CR4_PKS;         \
 	__reserved_bits;                                \
 })
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 57718716cc70..8027f854c600 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -390,3 +390,9 @@ void pks_key_free(int pkey)
 	__clear_bit(pkey, &pks_key_allocation_map);
 }
 EXPORT_SYMBOL_GPL(pks_key_free);
+
+u32 get_current_pkrs(void)
+{
+	return this_cpu_read(pkrs_cache);
+}
+EXPORT_SYMBOL_GPL(get_current_pkrs);
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index bed0e293f13b..480429020f4c 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -72,6 +72,10 @@ static inline void pks_mk_readwrite(int pkey)
 {
 	pr_err("%s is not valid without PKS support\n", __func__);
 }
+static inline u32 get_current_pkrs(void)
+{
+	return 0;
+}
 #endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 3/5] KVM: MMU: Rename the pkru to pkr
  2021-02-05  8:37 [PATCH v4 0/5] KVM: PKS Virtualization support Chenyi Qiang
  2021-02-05  8:37 ` [PATCH v4 1/5] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
  2021-02-05  8:37 ` [PATCH v4 2/5] KVM: X86: Expose PKS to guest Chenyi Qiang
@ 2021-02-05  8:37 ` Chenyi Qiang
  2021-02-05  8:37 ` [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation Chenyi Qiang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-02-05  8:37 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

PKRU represents the PKU register utilized in the protection key rights
check for user pages. Protection Keys for Superviosr Pages (PKS) extends
the protection key architecture to cover supervisor pages.

Rename the *pkru* related variables and functions to *pkr* which stands
for both of the PKRU and PKRS. It makes sense because both registers
have the same format. PKS and PKU can also share the same bitmap to
cache the conditions where protection key checks are needed.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.h              | 12 ++++++------
 arch/x86/kvm/mmu/mmu.c          | 18 +++++++++---------
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c8b149d9775a..1909d34cbac8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -383,7 +383,7 @@ struct kvm_mmu {
 	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
 	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
 	*/
-	u32 pkru_mask;
+	u32 pkr_mask;
 
 	u64 *pae_root;
 	u64 *lm_root;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9c4a9c8e43d9..a77bd20c83f9 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -190,8 +190,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	u32 errcode = PFERR_PRESENT_MASK;
 
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-	if (unlikely(mmu->pkru_mask)) {
-		u32 pkru_bits, offset;
+	if (unlikely(mmu->pkr_mask)) {
+		u32 pkr_bits, offset;
 
 		/*
 		* PKRU defines 32 bits, there are 16 domains and 2
@@ -199,15 +199,15 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		* index of the protection domain, so pte_pkey * 2 is
 		* is the index of the first bit for the domain.
 		*/
-		pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
 
 		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
 		offset = (pfec & ~1) +
 			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
 
-		pkru_bits &= mmu->pkru_mask >> offset;
-		errcode |= -pkru_bits & PFERR_PK_MASK;
-		fault |= (pkru_bits != 0);
+		pkr_bits &= mmu->pkr_mask >> offset;
+		errcode |= -pkr_bits & PFERR_PK_MASK;
+		fault |= (pkr_bits != 0);
 	}
 
 	return -(u32)fault & errcode;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1f96adff8dc4..d22c0813e4b9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4301,20 +4301,20 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
 * away both AD and WD.  For all reads or if the last condition holds, WD
 * only will be masked away.
 */
-static void update_pkru_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 				bool ept)
 {
 	unsigned bit;
 	bool wp;
 
 	if (ept) {
-		mmu->pkru_mask = 0;
+		mmu->pkr_mask = 0;
 		return;
 	}
 
 	/* PKEY is enabled only if CR4.PKE and EFER.LMA are both set. */
 	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || !is_long_mode(vcpu)) {
-		mmu->pkru_mask = 0;
+		mmu->pkr_mask = 0;
 		return;
 	}
 
@@ -4348,7 +4348,7 @@ static void update_pkru_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		/* PKRU.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
-		mmu->pkru_mask |= (pkey_bits & 3) << pfec;
+		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
 	}
 }
 
@@ -4370,7 +4370,7 @@ static void paging64_init_context_common(struct kvm_vcpu *vcpu,
 
 	reset_rsvds_bits_mask(vcpu, context);
 	update_permission_bitmask(vcpu, context, false);
-	update_pkru_bitmask(vcpu, context, false);
+	update_pkr_bitmask(vcpu, context, false);
 	update_last_nonleaf_level(vcpu, context);
 
 	MMU_WARN_ON(!is_pae(vcpu));
@@ -4400,7 +4400,7 @@ static void paging32_init_context(struct kvm_vcpu *vcpu,
 
 	reset_rsvds_bits_mask(vcpu, context);
 	update_permission_bitmask(vcpu, context, false);
-	update_pkru_bitmask(vcpu, context, false);
+	update_pkr_bitmask(vcpu, context, false);
 	update_last_nonleaf_level(vcpu, context);
 
 	context->page_fault = paging32_page_fault;
@@ -4519,7 +4519,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 	}
 
 	update_permission_bitmask(vcpu, context, false);
-	update_pkru_bitmask(vcpu, context, false);
+	update_pkr_bitmask(vcpu, context, false);
 	update_last_nonleaf_level(vcpu, context);
 	reset_tdp_shadow_zero_bits_mask(vcpu, context);
 }
@@ -4667,7 +4667,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 	context->mmu_role.as_u64 = new_role.as_u64;
 
 	update_permission_bitmask(vcpu, context, true);
-	update_pkru_bitmask(vcpu, context, true);
+	update_pkr_bitmask(vcpu, context, true);
 	update_last_nonleaf_level(vcpu, context);
 	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
 	reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
@@ -4738,7 +4738,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 	}
 
 	update_permission_bitmask(vcpu, g_context, false);
-	update_pkru_bitmask(vcpu, g_context, false);
+	update_pkr_bitmask(vcpu, g_context, false);
 	update_last_nonleaf_level(vcpu, g_context);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation
  2021-02-05  8:37 [PATCH v4 0/5] KVM: PKS Virtualization support Chenyi Qiang
                   ` (2 preceding siblings ...)
  2021-02-05  8:37 ` [PATCH v4 3/5] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
@ 2021-02-05  8:37 ` Chenyi Qiang
  2021-02-05  9:20   ` Paolo Bonzini
  2021-07-29 17:25   ` Sean Christopherson
  2021-02-05  8:37 ` [PATCH v4 5/5] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
  2021-02-05  9:24 ` [PATCH v4 0/5] KVM: PKS Virtualization support Paolo Bonzini
  5 siblings, 2 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-02-05  8:37 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

In addition to the pkey check for user pages, advertise pkr_mask also to
cache the conditions where protection key checks for supervisor pages
are needed. Add CR4_PKS in mmu_role_bits to track the pkr_mask update on
a per-mmu basis.

In original cache conditions of pkr_mask, U/S bit in page tables is a
judgement condition and replace the PFEC.RSVD in page fault error code
to form the index of 16 domains. PKS support would extend the U/S bits
(if U/S=0, PKS check required). It adds an additional check for
cr4_pke/cr4_pks to ensure the necessity and distinguish PKU and PKS from
each other.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 11 +++---
 arch/x86/kvm/mmu.h              | 13 ++++---
 arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++--------------
 arch/x86/kvm/x86.c              |  3 +-
 4 files changed, 53 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1909d34cbac8..e515f1cecb88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -294,7 +294,7 @@ union kvm_mmu_extended_role {
 		unsigned int cr0_pg:1;
 		unsigned int cr4_pae:1;
 		unsigned int cr4_pse:1;
-		unsigned int cr4_pke:1;
+		unsigned int cr4_pkr:1;
 		unsigned int cr4_smap:1;
 		unsigned int cr4_smep:1;
 		unsigned int maxphyaddr:6;
@@ -378,10 +378,11 @@ struct kvm_mmu {
 	u8 permissions[16];
 
 	/*
-	* The pkru_mask indicates if protection key checks are needed.  It
-	* consists of 16 domains indexed by page fault error code bits [4:1],
-	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
-	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
+	* The pkr_mask indicates if protection key checks are needed.
+	* It consists of 16 domains indexed by page fault error code
+	* bits[4:1] with PFEC.RSVD replaced by ACC_USER_MASK from the
+	* page tables. Each domain has 2 bits which are ANDed with AD
+	* and WD from PKRU/PKRS.
 	*/
 	u32 pkr_mask;
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a77bd20c83f9..55b71c28e46e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -192,14 +192,17 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
 	if (unlikely(mmu->pkr_mask)) {
 		u32 pkr_bits, offset;
+		u64 pkr;
 
 		/*
-		* PKRU defines 32 bits, there are 16 domains and 2
-		* attribute bits per domain in pkru.  pte_pkey is the
-		* index of the protection domain, so pte_pkey * 2 is
-		* is the index of the first bit for the domain.
+		* PKRU and PKRS both define 32 bits. There are 16 domains
+		* and 2 attribute bits per domain in them. pte_key is the
+		* index of the protection domain, so pte_pkey * 2 is the
+		* index of the first bit for the domain. The choice of
+		* PKRU and PKRS is determined by the accessed pages.
 		*/
-		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : vcpu->arch.pkrs;
+		pkr_bits = (pkr >> pte_pkey * 2) & 3;
 
 		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
 		offset = (pfec & ~1) +
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d22c0813e4b9..92b24fa71f93 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4278,42 +4278,49 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
 }
 
 /*
-* PKU is an additional mechanism by which the paging controls access to
-* user-mode addresses based on the value in the PKRU register.  Protection
-* key violations are reported through a bit in the page fault error code.
+* Protection Keys (PKEY) is an additional mechanism by which
+* the paging controls access to user-mode/supervisor-mode address
+* based on the values in PKEY registers (PKRU/PKRS). Protection key
+* violations are reported through a bit in the page fault error code.
 * Unlike other bits of the error code, the PK bit is not known at the
 * call site of e.g. gva_to_gpa; it must be computed directly in
-* permission_fault based on two bits of PKRU, on some machine state (CR4,
-* CR0, EFER, CPL), and on other bits of the error code and the page tables.
+* permission_fault based on two bits of PKRU/PKRS, on some machine
+* state (CR4, CR0, EFER, CPL), and on other bits of the error code
+* and the page tables.
 *
 * In particular the following conditions come from the error code, the
 * page tables and the machine state:
-* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
+* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
 * - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
-* - PK is always zero if U=0 in the page tables
-* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
+* - PK is always zero if
+*       - U=0 in the page tables and CR4.PKS=0
+*       - U=1 in the page tables and CR4.PKU=0
+* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.
 *
-* The PKRU bitmask caches the result of these four conditions.  The error
-* code (minus the P bit) and the page table's U bit form an index into the
-* PKRU bitmask.  Two bits of the PKRU bitmask are then extracted and ANDed
-* with the two bits of the PKRU register corresponding to the protection key.
-* For the first three conditions above the bits will be 00, thus masking
-* away both AD and WD.  For all reads or if the last condition holds, WD
-* only will be masked away.
+* The pkr_mask caches the result of these three conditions. The error
+* code (minus the P bit) and the page table's U bit form an index into
+* the pkr_mask. Two bits of the pkr_mask are then extracted and ANDed with
+* the two bits of the PKEY register corresponding to the protection key.
+* For the first three conditions above the bits will be 00, thus masking away
+* both AD and WD. For all reads or if the last condition holds, WD only will be
+* masked away.
 */
 static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 				bool ept)
 {
 	unsigned bit;
-	bool wp;
+	bool wp, cr4_pke, cr4_pks;
 
 	if (ept) {
 		mmu->pkr_mask = 0;
 		return;
 	}
 
-	/* PKEY is enabled only if CR4.PKE and EFER.LMA are both set. */
-	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || !is_long_mode(vcpu)) {
+	cr4_pke = kvm_read_cr4_bits(vcpu, X86_CR4_PKE) != 0;
+	cr4_pks = kvm_read_cr4_bits(vcpu, X86_CR4_PKS) != 0;
+
+	/* PKEY is enabled only if CR4.PKE/CR4.PKS and EFER.LMA are both set. */
+	if ((!cr4_pke && !cr4_pks) || !is_long_mode(vcpu)) {
 		mmu->pkr_mask = 0;
 		return;
 	}
@@ -4333,19 +4340,22 @@ static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		pte_user = pfec & PFERR_RSVD_MASK;
 
 		/*
-		 * Only need to check the access which is not an
-		 * instruction fetch and is to a user page.
+		 * need to check the access which is not an
+		 * instruction fetch and
+		 * - if cr4_pke 1-setting when accessing a user page.
+		 * - if cr4_pks 1-setting when accessing a supervisor page.
 		 */
-		check_pkey = (!ff && pte_user);
+		check_pkey = !ff && (pte_user ? cr4_pke : cr4_pks);
+
 		/*
-		 * write access is controlled by PKRU if it is a
-		 * user access or CR0.WP = 1.
+		 * write access is controlled by PKRU/PKRS if
+		 * it is a user access or CR0.WP = 1.
 		 */
 		check_write = check_pkey && wf && (uf || wp);
 
-		/* PKRU.AD stops both read and write access. */
+		/* PKRU/PKRS.AD stops both read and write access. */
 		pkey_bits = !!check_pkey;
-		/* PKRU.WD stops write access. */
+		/* PKRU/PKRS.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
 		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
@@ -4427,7 +4437,8 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu)
 	ext.cr4_smep = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
 	ext.cr4_smap = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
 	ext.cr4_pse = !!is_pse(vcpu);
-	ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
+	ext.cr4_pkr = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
+		      !!kvm_read_cr4_bits(vcpu, X86_CR4_PKS);
 	ext.maxphyaddr = cpuid_maxphyaddr(vcpu);
 
 	ext.valid = 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 684ef760481c..aec889a4eb66 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -982,7 +982,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	unsigned long old_cr4 = kvm_read_cr4(vcpu);
 	unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
 				   X86_CR4_SMEP;
-	unsigned long mmu_role_bits = pdptr_bits | X86_CR4_SMAP | X86_CR4_PKE;
+	unsigned long mmu_role_bits = pdptr_bits | X86_CR4_SMAP | X86_CR4_PKE |
+				      X86_CR4_PKS;
 
 	if (kvm_valid_cr4(vcpu, cr4))
 		return 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 5/5] KVM: VMX: Enable PKS for nested VM
  2021-02-05  8:37 [PATCH v4 0/5] KVM: PKS Virtualization support Chenyi Qiang
                   ` (3 preceding siblings ...)
  2021-02-05  8:37 ` [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation Chenyi Qiang
@ 2021-02-05  8:37 ` Chenyi Qiang
  2021-02-05  9:24 ` [PATCH v4 0/5] KVM: PKS Virtualization support Paolo Bonzini
  5 siblings, 0 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-02-05  8:37 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

PKS MSR passes through guest directly. Configure the MSR to match the
L0/L1 settings so that nested VM runs PKS properly.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 37 +++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmcs12.c |  2 ++
 arch/x86/kvm/vmx/vmcs12.h |  6 +++++-
 arch/x86/kvm/vmx/vmx.c    | 10 ++++++++++
 arch/x86/kvm/vmx/vmx.h    |  1 +
 5 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2266d98ace6f..a0b0f6fc7808 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -653,6 +653,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 					MSR_IA32_PRED_CMD,
 					MSR_TYPE_W);
 
+	if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
+		nested_vmx_disable_intercept_for_msr(
+					msr_bitmap_l1, msr_bitmap_l0,
+					MSR_IA32_PKRS,
+					MSR_TYPE_R | MSR_TYPE_W);
+
 	kvm_vcpu_unmap(vcpu, &to_vmx(vcpu)->nested.msr_bitmap_map, false);
 
 	return true;
@@ -2439,6 +2445,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
 		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+		if (vmx->nested.nested_run_pending &&
+		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+			vmcs_write64(GUEST_IA32_PKRS, vmcs12->guest_ia32_pkrs);
 	}
 
 	if (nested_cpu_has_xsaves(vmcs12))
@@ -2527,6 +2537,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
+
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    (!vmx->nested.nested_run_pending ||
+	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
+		vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
 	vmx_set_rflags(vcpu, vmcs12->guest_rflags);
 
 	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
@@ -2867,6 +2882,10 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 					   vmcs12->host_ia32_perf_global_ctrl)))
 		return -EINVAL;
 
+	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->host_ia32_pkrs)))
+		return -EINVAL;
+
 #ifdef CONFIG_X86_64
 	ia32e = !!(vcpu->arch.efer & EFER_LMA);
 #else
@@ -3016,6 +3035,10 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	if (nested_check_guest_non_reg_state(vmcs12))
 		return -EINVAL;
 
+	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->guest_ia32_pkrs)))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -3326,6 +3349,9 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	if (kvm_mpx_supported() &&
 		!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 		vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+		vmx->nested.vmcs01_guest_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	/*
 	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
@@ -3929,6 +3955,7 @@ static bool is_vmcs12_ext_field(unsigned long field)
 	case GUEST_IDTR_BASE:
 	case GUEST_PENDING_DBG_EXCEPTIONS:
 	case GUEST_BNDCFGS:
+	case GUEST_IA32_PKRS:
 		return true;
 	default:
 		break;
@@ -3980,6 +4007,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 	if (kvm_mpx_supported())
 		vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (guest_cpuid_has(vcpu, X86_FEATURE_PKS))
+		vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
 }
@@ -4215,6 +4244,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 					 vmcs12->host_ia32_perf_global_ctrl));
 
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS)
+		vmcs_write64(GUEST_IA32_PKRS, vmcs12->host_ia32_pkrs);
+
 	/* Set L1 segment info according to Intel SDM
 	    27.5.2 Loading Host Segment and Descriptor-Table Registers */
 	seg = (struct kvm_segment) {
@@ -6330,7 +6362,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
+		VM_EXIT_LOAD_IA32_PKRS;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6350,7 +6383,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		VM_ENTRY_IA32E_MODE |
 #endif
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
-		VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_PKRS;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER);
 
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index c8e51c004f78..df7b2143b807 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -61,9 +61,11 @@ const unsigned short vmcs_field_to_offset_table[] = {
 	FIELD64(GUEST_PDPTR2, guest_pdptr2),
 	FIELD64(GUEST_PDPTR3, guest_pdptr3),
 	FIELD64(GUEST_BNDCFGS, guest_bndcfgs),
+	FIELD64(GUEST_IA32_PKRS, guest_ia32_pkrs),
 	FIELD64(HOST_IA32_PAT, host_ia32_pat),
 	FIELD64(HOST_IA32_EFER, host_ia32_efer),
 	FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
+	FIELD64(HOST_IA32_PKRS, host_ia32_pkrs),
 	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
 	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
 	FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 80232daf00ff..009b4c317375 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -69,7 +69,9 @@ struct __packed vmcs12 {
 	u64 vm_function_control;
 	u64 eptp_list_address;
 	u64 pml_address;
-	u64 padding64[3]; /* room for future expansion */
+	u64 guest_ia32_pkrs;
+	u64 host_ia32_pkrs;
+	u64 padding64[1]; /* room for future expansion */
 	/*
 	 * To allow migration of L1 (complete with its L2 guests) between
 	 * machines of different natural widths (32 or 64 bit), we cannot have
@@ -256,6 +258,8 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(vm_function_control, 296);
 	CHECK_OFFSET(eptp_list_address, 304);
 	CHECK_OFFSET(pml_address, 312);
+	CHECK_OFFSET(guest_ia32_pkrs, 320);
+	CHECK_OFFSET(host_ia32_pkrs, 328);
 	CHECK_OFFSET(cr0_guest_host_mask, 344);
 	CHECK_OFFSET(cr4_guest_host_mask, 352);
 	CHECK_OFFSET(cr0_read_shadow, 360);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a3d95492e2b7..f3e8e1ba0003 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7198,6 +7198,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
 	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
 	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
+	cr4_fixed1_update(X86_CR4_PKS,        ecx, feature_bit(PKS));
 
 #undef cr4_fixed1_update
 }
@@ -7217,6 +7218,15 @@ static void nested_vmx_entry_exit_ctls_update(struct kvm_vcpu *vcpu)
 			vmx->nested.msrs.exit_ctls_high &= ~VM_EXIT_CLEAR_BNDCFGS;
 		}
 	}
+
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {
+		vmx->nested.msrs.entry_ctls_high |= VM_ENTRY_LOAD_IA32_PKRS;
+		vmx->nested.msrs.exit_ctls_high |= VM_EXIT_LOAD_IA32_PKRS;
+	} else {
+		vmx->nested.msrs.entry_ctls_high &= ~VM_ENTRY_LOAD_IA32_PKRS;
+		vmx->nested.msrs.exit_ctls_high &= ~VM_EXIT_LOAD_IA32_PKRS;
+	}
 }
 
 static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f6f66e5c6510..9d3670e45636 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -155,6 +155,7 @@ struct nested_vmx {
 	/* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
 	u64 vmcs01_debugctl;
 	u64 vmcs01_guest_bndcfgs;
+	u64 vmcs01_guest_pkrs;
 
 	/* to migrate it to L1 if L2 writes to L1's CR8 directly */
 	int l1_tpr_threshold;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation
  2021-02-05  8:37 ` [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation Chenyi Qiang
@ 2021-02-05  9:20   ` Paolo Bonzini
  2021-07-29 17:25   ` Sean Christopherson
  1 sibling, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2021-02-05  9:20 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 05/02/21 09:37, Chenyi Qiang wrote:
> |In addition to the pkey check for user pages, advertise pkr_mask also 
> to cache the conditions where protection key checks for supervisor pages 
> are needed. Add CR4_PKS in mmu_role_bits to track the pkr_mask update on 
> a per-mmu basis. In original cache conditions of pkr_mask, U/S bit in 
> page tables is a judgement condition and replace the PFEC.RSVD in page 
> fault error code to form the index of 16 domains. PKS support would 
> extend the U/S bits (if U/S=0, PKS check required). It adds an 
> additional check for cr4_pke/cr4_pks to ensure the necessity and 
> distinguish PKU and PKS from each other. |

Slight changes to the commit message:

   Up until now, pkr_mask had 0 bits for supervisor pages (the U/S bit in
   page tables replaces the PFEC.RSVD in page fault error code).
   For PKS support, fill in the bits using the same algorithm used for
   user mode pages, but with CR4.PKE replaced by CR4.PKS.  Because of
   this change, CR4.PKS must also be included in the MMU role.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05  8:37 ` [PATCH v4 2/5] KVM: X86: Expose PKS to guest Chenyi Qiang
@ 2021-02-05  9:23   ` Paolo Bonzini
  2021-02-05  9:25   ` Paolo Bonzini
  2021-07-29 16:44   ` Sean Christopherson
  2 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2021-02-05  9:23 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 05/02/21 09:37, Chenyi Qiang wrote:

> +	/*
> +	 * PKS is not yet implemented for shadow paging.
> +	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
> +	 * don't expose the PKS as well.
> +	 */
> +	if (enable_ept && cpu_has_load_ia32_pkrs())
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);

This piece should be a separate patch, added after patch 4.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/5] KVM: PKS Virtualization support
  2021-02-05  8:37 [PATCH v4 0/5] KVM: PKS Virtualization support Chenyi Qiang
                   ` (4 preceding siblings ...)
  2021-02-05  8:37 ` [PATCH v4 5/5] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
@ 2021-02-05  9:24 ` Paolo Bonzini
  5 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2021-02-05  9:24 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 05/02/21 09:37, Chenyi Qiang wrote:
> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
> Protection Keys architecture to support thread-specific permission
> restrictions on supervisor pages.
> 
> PKS works similar to an existing feature named PKU(protecting user pages).
> They both perform an additional check after all legacy access
> permissions checks are done. If violated, #PF occurs and PFEC.PK bit will
> be set. PKS introduces MSR IA32_PKRS to manage supervisor protection key
> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
> advertises on a group of pages with the same key which is set in the
> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
> supported by XSAVES architecture.
> 
> This patchset aims to add the virtualization of PKS in KVM. It
> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
> exposure, nested supported etc. Currently, PKS is not yet supported for
> shadow paging.
> 
> PKS bare metal support:
> https://lore.kernel.org/lkml/20201106232908.364581-1-ira.weiny@intel.com/
> 
> Detailed information about PKS can be found in the latest Intel 64 and
> IA-32 Architectures Software Developer's Manual.
> 
> ---
> 
> Changelogs:
> 
> v3->v4
> - Make the MSR intercept and load-controls setting depend on CR4.PKS value
> - shadow the guest pkrs and make it usable in PKS emultion
> - add the cr4_pke and cr4_pks check in pkr_mask update
> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
> - v3: https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/
> 
> v2->v3:
> - No function changes since last submit
> - rebase on the latest PKS kernel support:
>    https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
> - RFC v2: https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/
> 
> v1->v2:
> - rebase on the latest PKS kernel support:
>    https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> - add a kvm-unit-tests for PKS
> - add the check in kvm_init_msr_list for PKRS
> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
>    VMX MSR
> - RFC v1: https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/
> 
> ---
> 
> Chenyi Qiang (5):
>    KVM: VMX: Introduce PKS VMCS fields
>    KVM: X86: Expose PKS to guest
>    KVM: MMU: Rename the pkru to pkr
>    KVM: MMU: Add support for PKS emulation
>    KVM: VMX: Enable PKS for nested VM
> 
>   arch/x86/include/asm/kvm_host.h | 17 +++---
>   arch/x86/include/asm/pkeys.h    |  1 +
>   arch/x86/include/asm/vmx.h      |  6 ++
>   arch/x86/kvm/cpuid.c            |  3 +-
>   arch/x86/kvm/mmu.h              | 23 ++++----
>   arch/x86/kvm/mmu/mmu.c          | 81 +++++++++++++++------------
>   arch/x86/kvm/vmx/capabilities.h |  6 ++
>   arch/x86/kvm/vmx/nested.c       | 38 ++++++++++++-
>   arch/x86/kvm/vmx/vmcs.h         |  1 +
>   arch/x86/kvm/vmx/vmcs12.c       |  2 +
>   arch/x86/kvm/vmx/vmcs12.h       |  6 +-
>   arch/x86/kvm/vmx/vmx.c          | 99 +++++++++++++++++++++++++++++++--
>   arch/x86/kvm/vmx/vmx.h          |  1 +
>   arch/x86/kvm/x86.c              | 11 +++-
>   arch/x86/kvm/x86.h              |  8 +++
>   arch/x86/mm/pkeys.c             |  6 ++
>   include/linux/pkeys.h           |  4 ++
>   17 files changed, 249 insertions(+), 64 deletions(-)
> 

Looks mostly good, but I'll only be able to include it after the bare 
metal implementation is in.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05  8:37 ` [PATCH v4 2/5] KVM: X86: Expose PKS to guest Chenyi Qiang
  2021-02-05  9:23   ` Paolo Bonzini
@ 2021-02-05  9:25   ` Paolo Bonzini
  2021-02-05  9:56     ` Borislav Petkov
  2021-07-29 16:44   ` Sean Christopherson
  2 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2021-02-05  9:25 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel, x86, Andy Lutomirski, Borislav Petkov

On 05/02/21 09:37, Chenyi Qiang wrote:
> 
> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 57718716cc70..8027f854c600 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -390,3 +390,9 @@ void pks_key_free(int pkey)
>  	__clear_bit(pkey, &pks_key_allocation_map);
>  }
>  EXPORT_SYMBOL_GPL(pks_key_free);
> +
> +u32 get_current_pkrs(void)
> +{
> +	return this_cpu_read(pkrs_cache);
> +}
> +EXPORT_SYMBOL_GPL(get_current_pkrs);
> diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> index bed0e293f13b..480429020f4c 100644
> --- a/include/linux/pkeys.h
> +++ b/include/linux/pkeys.h
> @@ -72,6 +72,10 @@ static inline void pks_mk_readwrite(int pkey)
>  {
>  	pr_err("%s is not valid without PKS support\n", __func__);
>  }
> +static inline u32 get_current_pkrs(void)
> +{
> +	return 0;
> +}
>  #endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
>  

This would need an ack from the x86 people.  Andy, Boris?

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05  9:25   ` Paolo Bonzini
@ 2021-02-05  9:56     ` Borislav Petkov
  2021-02-05 10:10       ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2021-02-05  9:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li, kvm, linux-kernel, x86,
	Andy Lutomirski

On Fri, Feb 05, 2021 at 10:25:48AM +0100, Paolo Bonzini wrote:
> On 05/02/21 09:37, Chenyi Qiang wrote:
> > 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index 57718716cc70..8027f854c600 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -390,3 +390,9 @@ void pks_key_free(int pkey)
> >  	__clear_bit(pkey, &pks_key_allocation_map);
> >  }
> >  EXPORT_SYMBOL_GPL(pks_key_free);
> > +
> > +u32 get_current_pkrs(void)
> > +{
> > +	return this_cpu_read(pkrs_cache);
> > +}
> > +EXPORT_SYMBOL_GPL(get_current_pkrs);
> > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> > index bed0e293f13b..480429020f4c 100644
> > --- a/include/linux/pkeys.h
> > +++ b/include/linux/pkeys.h
> > @@ -72,6 +72,10 @@ static inline void pks_mk_readwrite(int pkey)
> >  {
> >  	pr_err("%s is not valid without PKS support\n", __func__);
> >  }
> > +static inline u32 get_current_pkrs(void)
> > +{
> > +	return 0;
> > +}
> >  #endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
> 
> This would need an ack from the x86 people.  Andy, Boris?

This looks like the PKS baremetal pile needs to be upstream first.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05  9:56     ` Borislav Petkov
@ 2021-02-05 10:10       ` Paolo Bonzini
  2021-02-05 11:29         ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2021-02-05 10:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li, kvm, linux-kernel, x86,
	Andy Lutomirski

On 05/02/21 10:56, Borislav Petkov wrote:
> On Fri, Feb 05, 2021 at 10:25:48AM +0100, Paolo Bonzini wrote:
>> On 05/02/21 09:37, Chenyi Qiang wrote:
>>>
>>> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
>>> index 57718716cc70..8027f854c600 100644
>>> --- a/arch/x86/mm/pkeys.c
>>> +++ b/arch/x86/mm/pkeys.c
>>> @@ -390,3 +390,9 @@ void pks_key_free(int pkey)
>>>   	__clear_bit(pkey, &pks_key_allocation_map);
>>>   }
>>>   EXPORT_SYMBOL_GPL(pks_key_free);
>>> +
>>> +u32 get_current_pkrs(void)
>>> +{
>>> +	return this_cpu_read(pkrs_cache);
>>> +}
>>> +EXPORT_SYMBOL_GPL(get_current_pkrs);
>>> diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
>>> index bed0e293f13b..480429020f4c 100644
>>> --- a/include/linux/pkeys.h
>>> +++ b/include/linux/pkeys.h
>>> @@ -72,6 +72,10 @@ static inline void pks_mk_readwrite(int pkey)
>>>   {
>>>   	pr_err("%s is not valid without PKS support\n", __func__);
>>>   }
>>> +static inline u32 get_current_pkrs(void)
>>> +{
>>> +	return 0;
>>> +}
>>>   #endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
>>
>> This would need an ack from the x86 people.  Andy, Boris?
> 
> This looks like the PKS baremetal pile needs to be upstream first.

Yes, it does.  I would like to have an ack for including the above two 
hunks once PKS is upstream.

I also have CET and bus lock #DB queued and waiting for the bare metal 
functionality, however they do not touch anything outside arch/x86/kvm.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05 10:10       ` Paolo Bonzini
@ 2021-02-05 11:29         ` Thomas Gleixner
  2021-02-05 11:51           ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2021-02-05 11:29 UTC (permalink / raw)
  To: Paolo Bonzini, Borislav Petkov
  Cc: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li, kvm, linux-kernel, x86,
	Andy Lutomirski

On Fri, Feb 05 2021 at 11:10, Paolo Bonzini wrote:
> On 05/02/21 10:56, Borislav Petkov wrote:
>>> This would need an ack from the x86 people.  Andy, Boris?
>> 
>> This looks like the PKS baremetal pile needs to be upstream first.
>
> Yes, it does.  I would like to have an ack for including the above two 
> hunks once PKS is upstream.
>
> I also have CET and bus lock #DB queued and waiting for the bare metal 
> functionality, however they do not touch anything outside arch/x86/kvm.

What's the exact point of queueing random stuff which lacks bare metal
support?

Once PKS, CET or whatever is merged into tip then it's the point for
resending the KVM patches for inclusion and that's the point where it
gets acked and not $N month ahead when everything is still in flux.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05 11:29         ` Thomas Gleixner
@ 2021-02-05 11:51           ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2021-02-05 11:51 UTC (permalink / raw)
  To: Thomas Gleixner, Borislav Petkov
  Cc: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li, kvm, linux-kernel, x86,
	Andy Lutomirski

On 05/02/21 12:29, Thomas Gleixner wrote:
> On Fri, Feb 05 2021 at 11:10, Paolo Bonzini wrote:
>> On 05/02/21 10:56, Borislav Petkov wrote:
>>>> This would need an ack from the x86 people.  Andy, Boris?
>>>
>>> This looks like the PKS baremetal pile needs to be upstream first.
>>
>> Yes, it does.  I would like to have an ack for including the above two
>> hunks once PKS is upstream.
>>
>> I also have CET and bus lock #DB queued and waiting for the bare metal
>> functionality, however they do not touch anything outside arch/x86/kvm.
> 
> What's the exact point of queueing random stuff which lacks bare metal
> support?

The code is often completely independent of bare metal support even if 
it depends of it (CET and bus lock for example only share the #defines; 
for PKS this is not the case just because Intel decided not to use 
XSAVES *shrug*).

I prefer to queue early, because it keeps my backlog small and because 
every resend comes with the risk of random changes sneaking in since the 
version that I reviewed.  An early ack would also mean that I don't have 
to bug you in the middle of the merge window.  But it's not a problem, 
I'll ask for acks again once PKS is merged into tip.

Thanks,

Paolo

> Once PKS, CET or whatever is merged into tip then it's the point for
> resending the KVM patches for inclusion and that's the point where it
> gets acked and not $N month ahead when everything is still in flux.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-02-05  8:37 ` [PATCH v4 2/5] KVM: X86: Expose PKS to guest Chenyi Qiang
  2021-02-05  9:23   ` Paolo Bonzini
  2021-02-05  9:25   ` Paolo Bonzini
@ 2021-07-29 16:44   ` Sean Christopherson
  2021-08-03  8:50     ` Chenyi Qiang
  2 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2021-07-29 16:44 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Fri, Feb 05, 2021, Chenyi Qiang wrote:
> @@ -539,6 +540,7 @@ struct kvm_vcpu_arch {
>  	unsigned long cr8;
>  	u32 host_pkru;
>  	u32 pkru;
> +	u32 pkrs;
>  	u32 hflags;
>  	u64 efer;
>  	u64 apic_base;

...

> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 89af692deb7e..2266d98ace6f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -250,6 +250,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
>  	dest->ds_sel = src->ds_sel;
>  	dest->es_sel = src->es_sel;
>  #endif
> +	dest->pkrs = src->pkrs;

This is wrong.  It also arguably belongs in patch 04.

The part that's missing is actually updating vmcs.HOST_IA32_PKRS.  FS/GS are
handled by vmx_set_host_fs_gs(), while LDT/DS/ES are oddballs where the selector
is also the state that's restored.

In other words, this will cause nested VM-Exit to load the wrong PKRS.

>  }
>  
>  static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> index 1472c6c376f7..b5ba6407c5e1 100644
> --- a/arch/x86/kvm/vmx/vmcs.h
> +++ b/arch/x86/kvm/vmx/vmcs.h
> @@ -40,6 +40,7 @@ struct vmcs_host_state {
>  #ifdef CONFIG_X86_64
>  	u16           ds_sel, es_sel;
>  #endif
> +	u32           pkrs;
>  };
>  
>  struct vmcs_controls_shadow {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 47b8357b9751..a3d95492e2b7 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -363,6 +363,8 @@ module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);
>  static u32 vmx_segment_access_rights(struct kvm_segment *var);
>  static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
>  							  u32 msr, int type);
> +static __always_inline void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu,
> +							 u32 msr, int type);
>  
>  void vmx_vmexit(void);
>  
> @@ -660,6 +662,8 @@ static bool is_valid_passthrough_msr(u32 msr)
>  	case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B:
>  		/* PT MSRs. These are handled in pt_update_intercept_for_msr() */
>  		return true;
> +	case MSR_IA32_PKRS:
> +		return true;

This is wrong with respect to MSR filtering, as KVM will fail to intercept the
MSR in response to a filter change.  See vmx_msr_filter_changed()..  I also think
that special casing PKRS is a bad idea in general.  More later.

>  	}
>  
>  	r = possible_passthrough_msr_slot(msr) != -ENOENT;

...

> +	case MSR_IA32_PKRS:
> +		if (!kvm_pkrs_valid(data))
> +			return 1;
> +		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
> +		    (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
> +			return 1;
> +		if (vcpu->arch.pkrs != data) {

This will need to be modified if we go with my below proposal.

> +			vcpu->arch.pkrs = data;
> +			vmcs_write64(GUEST_IA32_PKRS, data);
> +		}
> +		break;
>  	case MSR_TSC_AUX:
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
> @@ -2479,7 +2518,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  	      VM_EXIT_LOAD_IA32_EFER |
>  	      VM_EXIT_CLEAR_BNDCFGS |
>  	      VM_EXIT_PT_CONCEAL_PIP |
> -	      VM_EXIT_CLEAR_IA32_RTIT_CTL;
> +	      VM_EXIT_CLEAR_IA32_RTIT_CTL |
> +	      VM_EXIT_LOAD_IA32_PKRS;
>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
>  				&_vmexit_control) < 0)
>  		return -EIO;
> @@ -2503,7 +2543,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  	      VM_ENTRY_LOAD_IA32_EFER |
>  	      VM_ENTRY_LOAD_BNDCFGS |
>  	      VM_ENTRY_PT_CONCEAL_PIP |
> -	      VM_ENTRY_LOAD_IA32_RTIT_CTL;
> +	      VM_ENTRY_LOAD_IA32_RTIT_CTL |
> +	      VM_ENTRY_LOAD_IA32_PKRS;
>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
>  				&_vmentry_control) < 0)
>  		return -EIO;
> @@ -3103,8 +3144,9 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  	 * is in force while we are in guest mode.  Do not let guests control
>  	 * this bit, even if host CR4.MCE == 0.
>  	 */
> -	unsigned long hw_cr4;
> +	unsigned long hw_cr4, old_cr4;
>  
> +	old_cr4 = kvm_read_cr4(vcpu);
>  	hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
>  	if (is_unrestricted_guest(vcpu))
>  		hw_cr4 |= KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST;
> @@ -3152,7 +3194,7 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  		}
>  
>  		/*
> -		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
> +		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
>  		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
>  		 * to be manually disabled when guest switches to non-paging
>  		 * mode.
> @@ -3160,10 +3202,29 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  		 * If !enable_unrestricted_guest, the CPU is always running
>  		 * with CR0.PG=1 and CR4 needs to be modified.
>  		 * If enable_unrestricted_guest, the CPU automatically
> -		 * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
> +		 * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
>  		 */
>  		if (!is_paging(vcpu))
> -			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
> +			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> +				    X86_CR4_PKS);
> +	}
> +
> +	if ((hw_cr4 ^ old_cr4) & X86_CR4_PKS) {

Comparing hw_cr4 and old_cr4 is wrong, they are representative of two different
contexts.  old_cr4 is the guest's previous CR4 irrespective of KVM maniuplations,
whereas hw_cr4 does include KVM's modification, e.g. the is_paging() logic above
may clear CR4.PKS and may lead to incorrect behavior.

E.g.:

	guest.CR4.PKS = 1
	guest.CR0.PG  = 0
	guest_hw.CR4.PKS = 0  // KVM
	vmcs.LOAD_PKRS = 0    // KVM
	guest.CR0.PG  = 1
	guest_hw.CR4.PKS = 1  // KVM
	vmcs.LOAD_PKRS not modified because guest.CR4.PKS == guest_hw.CR4.PKS == 1

This logic also fails to handle the case where L1 doesn't intercept CR4.PKS
modifications for L2.

The VM-Exit path that does:

> +     if (static_cpu_has(X86_FEATURE_PKS) &&
> +         kvm_read_cr4_bits(vcpu, X86_CR4_PKS))
> +             vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS)

is also flawed in that, per this scheme, it may unnecessarily read vmcs.GUEST_PKRS,
though I don't think it can get the wrong value, unless of course it's running L2...

In short, IMO toggling PKRS interception/load on CR4 changes is a really, really
bad idea.  UMIP emulation attempted do fancy toggling and got it wrong multiple
times, and this is more complicated than what UMIP was trying to do.

The only motiviation for toggling PKRS interception/load is to avoid the VMREAD in
the VM-Exit path.  Given that vcpu->arch.pkrs is rarely accessed by KVM, e.g. only
on host userspace MSR reads and and GVA->GPA translation via permission_fault(),
keeping vcpu->arch.pkrs up-to-date at all times is unnecessary, i.e. it can be
synchronized on-demand.  And regarding permission_fault(), that's indirectly gated
by CR4.PKS=1, thus deferring the VMREAD to permission_fault() is guaranteed to be
more performant than reading on every VM-Exit (with CR4.PKS=1).

So:

  - Disable PKRS MSR interception if it's exposed to the guest.
  - Load PKRS on entry/exit if it's exposed to the guest.
  - Add VCPU_EXREG_PKRS and clear its bits in vmx_register_cache_reset().
  - Add helpers to read/write/cache PKRS and use accordingly.

> +		/* pass through PKRS to guest when CR4.PKS=1 */
> +		if (guest_cpuid_has(vcpu, X86_FEATURE_PKS) && hw_cr4 & X86_CR4_PKS) {
> +			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
> +			vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +			vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +			/*
> +			 * Every vm exit saves guest pkrs automatically, sync vcpu->arch.pkrs
> +			 * to VMCS.GUEST_PKRS to avoid the pollution by host pkrs.
> +			 */
> +			vmcs_write64(GUEST_IA32_PKRS, vcpu->arch.pkrs);
> +		} else {
> +			vmx_enable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
> +			vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +			vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +		}
>  	}
>  
>  	vmcs_writel(CR4_READ_SHADOW, cr4);

...

> @@ -6776,6 +6841,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  
>  	pt_guest_exit(vmx);
>  
> +	if (static_cpu_has(X86_FEATURE_PKS) &&
> +	    kvm_read_cr4_bits(vcpu, X86_CR4_PKS))
> +		vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS);
> +
>  	kvm_load_host_xsave_state(vcpu);
>  
>  	vmx->nested.nested_run_pending = 0;
> @@ -7280,6 +7349,14 @@ static __init void vmx_set_cpu_caps(void)
>  	if (vmx_pt_mode_is_host_guest())
>  		kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
>  
> +	/*
> +	 * PKS is not yet implemented for shadow paging.
> +	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
> +	 * don't expose the PKS as well.
> +	 */
> +	if (enable_ept && cpu_has_load_ia32_pkrs())
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
> +
>  	if (vmx_umip_emulated())
>  		kvm_cpu_cap_set(X86_FEATURE_UMIP);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f5ede41bf9e6..684ef760481c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1213,7 +1213,7 @@ static const u32 msrs_to_save_all[] = {
>  	MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>  	MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>  	MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> -	MSR_IA32_UMWAIT_CONTROL,
> +	MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
>  
>  	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
>  	MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
> @@ -5718,6 +5718,10 @@ static void kvm_init_msr_list(void)
>  				intel_pt_validate_hw_cap(PT_CAP_num_address_ranges) * 2)
>  				continue;
>  			break;
> +		case MSR_IA32_PKRS:
> +			if (!kvm_cpu_cap_has(X86_FEATURE_PKS))
> +				continue;
> +			break;
>  		case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
>  			if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
>  			    min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
> @@ -10026,6 +10030,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  
>  	vcpu->arch.ia32_xss = 0;
>  
> +	vcpu->arch.pkrs = 0;

This is wrong and also unreviewable.  It's wrong because the write isn't propagted
to vmcs.GUEST_PKRS, e.g. if the guest enables CR0.PG and CR4.PKS without writing
PKRS, KVM will run the guest with a stale value.

It's unreviewable because the SDM doesn't specify whether PKRS is cleared or
preserved on INIT.  The SDM needs an entry for PRKS in Table 9-1. "IA-32 and Intel
64 Processor States Following Power-up, Reset, or INIT" before this can be merged.

> +
>  	kvm_x86_ops.vcpu_reset(vcpu, init_event);
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation
  2021-02-05  8:37 ` [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation Chenyi Qiang
  2021-02-05  9:20   ` Paolo Bonzini
@ 2021-07-29 17:25   ` Sean Christopherson
  2021-07-29 17:45     ` Paolo Bonzini
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2021-07-29 17:25 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Fri, Feb 05, 2021, Chenyi Qiang wrote:
> In addition to the pkey check for user pages, advertise pkr_mask also to
> cache the conditions where protection key checks for supervisor pages
> are needed. Add CR4_PKS in mmu_role_bits to track the pkr_mask update on
> a per-mmu basis.
> 
> In original cache conditions of pkr_mask, U/S bit in page tables is a
> judgement condition and replace the PFEC.RSVD in page fault error code
> to form the index of 16 domains. PKS support would extend the U/S bits
> (if U/S=0, PKS check required). It adds an additional check for
> cr4_pke/cr4_pks to ensure the necessity and distinguish PKU and PKS from
> each other.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 11 +++---
>  arch/x86/kvm/mmu.h              | 13 ++++---
>  arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++--------------
>  arch/x86/kvm/x86.c              |  3 +-
>  4 files changed, 53 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1909d34cbac8..e515f1cecb88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -294,7 +294,7 @@ union kvm_mmu_extended_role {
>  		unsigned int cr0_pg:1;
>  		unsigned int cr4_pae:1;
>  		unsigned int cr4_pse:1;
> -		unsigned int cr4_pke:1;
> +		unsigned int cr4_pkr:1;

Smushing these together will not work, as this code (from below)

> -     ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
> +     ext.cr4_pkr = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
> +                   !!kvm_read_cr4_bits(vcpu, X86_CR4_PKS);

will generate the same mmu_role for CR4.PKE=0,PKS=1 and CR4.PKE=1,PKS=1 (and
other combinations).  I.e. KVM will fail to reconfigure the MMU and thus skip
update_pkr_bitmask() if the guest toggles PKE or PKS while the other PK* bit is set.

>  		unsigned int cr4_smap:1;
>  		unsigned int cr4_smep:1;
>  		unsigned int maxphyaddr:6;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation
  2021-07-29 17:25   ` Sean Christopherson
@ 2021-07-29 17:45     ` Paolo Bonzini
  2021-08-03  8:52       ` Chenyi Qiang
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2021-07-29 17:45 UTC (permalink / raw)
  To: Sean Christopherson, Chenyi Qiang
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Xiaoyao Li, kvm, linux-kernel

On 29/07/21 19:25, Sean Christopherson wrote:
>> -		unsigned int cr4_pke:1;
>> +		unsigned int cr4_pkr:1;
> Smushing these together will not work, as this code (from below)
> 
>> -     ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
>> +     ext.cr4_pkr = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>> +                   !!kvm_read_cr4_bits(vcpu, X86_CR4_PKS);
> will generate the same mmu_role for CR4.PKE=0,PKS=1 and CR4.PKE=1,PKS=1 (and
> other combinations).  I.e. KVM will fail to reconfigure the MMU and thus skip
> update_pkr_bitmask() if the guest toggles PKE or PKS while the other PK* bit is set.
> 

I'm also not sure why there would be issues in just using cr4_pks.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/5] KVM: X86: Expose PKS to guest
  2021-07-29 16:44   ` Sean Christopherson
@ 2021-08-03  8:50     ` Chenyi Qiang
  0 siblings, 0 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-08-03  8:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

Thanks Sean for your review.

On 7/30/2021 12:44 AM, Sean Christopherson wrote:
> On Fri, Feb 05, 2021, Chenyi Qiang wrote:
>> @@ -539,6 +540,7 @@ struct kvm_vcpu_arch {
>>   	unsigned long cr8;
>>   	u32 host_pkru;
>>   	u32 pkru;
>> +	u32 pkrs;
>>   	u32 hflags;
>>   	u64 efer;
>>   	u64 apic_base;
> 
> ...
> 
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 89af692deb7e..2266d98ace6f 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -250,6 +250,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
>>   	dest->ds_sel = src->ds_sel;
>>   	dest->es_sel = src->es_sel;
>>   #endif
>> +	dest->pkrs = src->pkrs;
> 
> This is wrong.  It also arguably belongs in patch 04.
> 
> The part that's missing is actually updating vmcs.HOST_IA32_PKRS.  FS/GS are
> handled by vmx_set_host_fs_gs(), while LDT/DS/ES are oddballs where the selector
> is also the state that's restored.
> 

Will fix it. I guess it should belong in patch 05.

> In other words, this will cause nested VM-Exit to load the wrong PKRS.
> 
>>   }
>>   
>>   static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
>> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
>> index 1472c6c376f7..b5ba6407c5e1 100644
>> --- a/arch/x86/kvm/vmx/vmcs.h
>> +++ b/arch/x86/kvm/vmx/vmcs.h
>> @@ -40,6 +40,7 @@ struct vmcs_host_state {
>>   #ifdef CONFIG_X86_64
>>   	u16           ds_sel, es_sel;
>>   #endif
>> +	u32           pkrs;
>>   };
>>   
>>   struct vmcs_controls_shadow {
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 47b8357b9751..a3d95492e2b7 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -363,6 +363,8 @@ module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);
>>   static u32 vmx_segment_access_rights(struct kvm_segment *var);
>>   static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
>>   							  u32 msr, int type);
>> +static __always_inline void vmx_enable_intercept_for_msr(struct kvm_vcpu *vcpu,
>> +							 u32 msr, int type);
>>   
>>   void vmx_vmexit(void);
>>   
>> @@ -660,6 +662,8 @@ static bool is_valid_passthrough_msr(u32 msr)
>>   	case MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B:
>>   		/* PT MSRs. These are handled in pt_update_intercept_for_msr() */
>>   		return true;
>> +	case MSR_IA32_PKRS:
>> +		return true;
> 
> This is wrong with respect to MSR filtering, as KVM will fail to intercept the
> MSR in response to a filter change.  See vmx_msr_filter_changed()..  I also think
> that special casing PKRS is a bad idea in general.  More later.
> 

Yes, msr filter support for PKRS was missing. Will add the support in 
vmx_msr_filter_changed().

>>   	}
>>   
>>   	r = possible_passthrough_msr_slot(msr) != -ENOENT;
> 
> ...
> 
>> +	case MSR_IA32_PKRS:
>> +		if (!kvm_pkrs_valid(data))
>> +			return 1;
>> +		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
>> +		    (!msr_info->host_initiated &&
>> +		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
>> +			return 1;
>> +		if (vcpu->arch.pkrs != data) {
> 
> This will need to be modified if we go with my below proposal.
> 
>> +			vcpu->arch.pkrs = data;
>> +			vmcs_write64(GUEST_IA32_PKRS, data);
>> +		}
>> +		break;
>>   	case MSR_TSC_AUX:
>>   		if (!msr_info->host_initiated &&
>>   		    !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
>> @@ -2479,7 +2518,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>>   	      VM_EXIT_LOAD_IA32_EFER |
>>   	      VM_EXIT_CLEAR_BNDCFGS |
>>   	      VM_EXIT_PT_CONCEAL_PIP |
>> -	      VM_EXIT_CLEAR_IA32_RTIT_CTL;
>> +	      VM_EXIT_CLEAR_IA32_RTIT_CTL |
>> +	      VM_EXIT_LOAD_IA32_PKRS;
>>   	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
>>   				&_vmexit_control) < 0)
>>   		return -EIO;
>> @@ -2503,7 +2543,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>>   	      VM_ENTRY_LOAD_IA32_EFER |
>>   	      VM_ENTRY_LOAD_BNDCFGS |
>>   	      VM_ENTRY_PT_CONCEAL_PIP |
>> -	      VM_ENTRY_LOAD_IA32_RTIT_CTL;
>> +	      VM_ENTRY_LOAD_IA32_RTIT_CTL |
>> +	      VM_ENTRY_LOAD_IA32_PKRS;
>>   	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
>>   				&_vmentry_control) < 0)
>>   		return -EIO;
>> @@ -3103,8 +3144,9 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>   	 * is in force while we are in guest mode.  Do not let guests control
>>   	 * this bit, even if host CR4.MCE == 0.
>>   	 */
>> -	unsigned long hw_cr4;
>> +	unsigned long hw_cr4, old_cr4;
>>   
>> +	old_cr4 = kvm_read_cr4(vcpu);
>>   	hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
>>   	if (is_unrestricted_guest(vcpu))
>>   		hw_cr4 |= KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST;
>> @@ -3152,7 +3194,7 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>   		}
>>   
>>   		/*
>> -		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
>> +		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
>>   		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
>>   		 * to be manually disabled when guest switches to non-paging
>>   		 * mode.
>> @@ -3160,10 +3202,29 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>   		 * If !enable_unrestricted_guest, the CPU is always running
>>   		 * with CR0.PG=1 and CR4 needs to be modified.
>>   		 * If enable_unrestricted_guest, the CPU automatically
>> -		 * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
>> +		 * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
>>   		 */
>>   		if (!is_paging(vcpu))
>> -			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
>> +			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
>> +				    X86_CR4_PKS);
>> +	}
>> +
>> +	if ((hw_cr4 ^ old_cr4) & X86_CR4_PKS) {
> 
> Comparing hw_cr4 and old_cr4 is wrong, they are representative of two different
> contexts.  old_cr4 is the guest's previous CR4 irrespective of KVM maniuplations,
> whereas hw_cr4 does include KVM's modification, e.g. the is_paging() logic above
> may clear CR4.PKS and may lead to incorrect behavior.
> 
> E.g.:
> 
> 	guest.CR4.PKS = 1
> 	guest.CR0.PG  = 0
> 	guest_hw.CR4.PKS = 0  // KVM
> 	vmcs.LOAD_PKRS = 0    // KVM
> 	guest.CR0.PG  = 1
> 	guest_hw.CR4.PKS = 1  // KVM
> 	vmcs.LOAD_PKRS not modified because guest.CR4.PKS == guest_hw.CR4.PKS == 1
> 
> This logic also fails to handle the case where L1 doesn't intercept CR4.PKS
> modifications for L2.
> 
> The VM-Exit path that does:
> 
>> +     if (static_cpu_has(X86_FEATURE_PKS) &&
>> +         kvm_read_cr4_bits(vcpu, X86_CR4_PKS))
>> +             vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS)
> 
> is also flawed in that, per this scheme, it may unnecessarily read vmcs.GUEST_PKRS,
> though I don't think it can get the wrong value, unless of course it's running L2...
> 
> In short, IMO toggling PKRS interception/load on CR4 changes is a really, really
> bad idea.  UMIP emulation attempted do fancy toggling and got it wrong multiple
> times, and this is more complicated than what UMIP was trying to do.
> 
> The only motiviation for toggling PKRS interception/load is to avoid the VMREAD in
> the VM-Exit path.  Given that vcpu->arch.pkrs is rarely accessed by KVM, e.g. only
> on host userspace MSR reads and and GVA->GPA translation via permission_fault(),
> keeping vcpu->arch.pkrs up-to-date at all times is unnecessary, i.e. it can be
> synchronized on-demand.  And regarding permission_fault(), that's indirectly gated
> by CR4.PKS=1, thus deferring the VMREAD to permission_fault() is guaranteed to be
> more performant than reading on every VM-Exit (with CR4.PKS=1).
> 
> So:
> 
>    - Disable PKRS MSR interception if it's exposed to the guest.
>    - Load PKRS on entry/exit if it's exposed to the guest.
>    - Add VCPU_EXREG_PKRS and clear its bits in vmx_register_cache_reset().
>    - Add helpers to read/write/cache PKRS and use accordingly.
> 

Make sense. Will refactor all these mentioned in next version.

>> +		/* pass through PKRS to guest when CR4.PKS=1 */
>> +		if (guest_cpuid_has(vcpu, X86_FEATURE_PKS) && hw_cr4 & X86_CR4_PKS) {
>> +			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
>> +			vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +			vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +			/*
>> +			 * Every vm exit saves guest pkrs automatically, sync vcpu->arch.pkrs
>> +			 * to VMCS.GUEST_PKRS to avoid the pollution by host pkrs.
>> +			 */
>> +			vmcs_write64(GUEST_IA32_PKRS, vcpu->arch.pkrs);
>> +		} else {
>> +			vmx_enable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
>> +			vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +			vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +		}
>>   	}
>>   
>>   	vmcs_writel(CR4_READ_SHADOW, cr4);
> 
> ...
> 
>> @@ -6776,6 +6841,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>   
>>   	pt_guest_exit(vmx);
>>   
>> +	if (static_cpu_has(X86_FEATURE_PKS) &&
>> +	    kvm_read_cr4_bits(vcpu, X86_CR4_PKS))
>> +		vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS);
>> +
>>   	kvm_load_host_xsave_state(vcpu);
>>   
>>   	vmx->nested.nested_run_pending = 0;
>> @@ -7280,6 +7349,14 @@ static __init void vmx_set_cpu_caps(void)
>>   	if (vmx_pt_mode_is_host_guest())
>>   		kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
>>   
>> +	/*
>> +	 * PKS is not yet implemented for shadow paging.
>> +	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
>> +	 * don't expose the PKS as well.
>> +	 */
>> +	if (enable_ept && cpu_has_load_ia32_pkrs())
>> +		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
>> +
>>   	if (vmx_umip_emulated())
>>   		kvm_cpu_cap_set(X86_FEATURE_UMIP);
>>   
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index f5ede41bf9e6..684ef760481c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1213,7 +1213,7 @@ static const u32 msrs_to_save_all[] = {
>>   	MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>>   	MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>>   	MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
>> -	MSR_IA32_UMWAIT_CONTROL,
>> +	MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
>>   
>>   	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
>>   	MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
>> @@ -5718,6 +5718,10 @@ static void kvm_init_msr_list(void)
>>   				intel_pt_validate_hw_cap(PT_CAP_num_address_ranges) * 2)
>>   				continue;
>>   			break;
>> +		case MSR_IA32_PKRS:
>> +			if (!kvm_cpu_cap_has(X86_FEATURE_PKS))
>> +				continue;
>> +			break;
>>   		case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
>>   			if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
>>   			    min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
>> @@ -10026,6 +10030,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>   
>>   	vcpu->arch.ia32_xss = 0;
>>   
>> +	vcpu->arch.pkrs = 0;
> 
> This is wrong and also unreviewable.  It's wrong because the write isn't propagted
> to vmcs.GUEST_PKRS, e.g. if the guest enables CR0.PG and CR4.PKS without writing
> PKRS, KVM will run the guest with a stale value.
> 

Yes, should write to vmcs to do reset.
  > It's unreviewable because the SDM doesn't specify whether PKRS is 
cleared or
> preserved on INIT.  The SDM needs an entry for PRKS in Table 9-1. "IA-32 and Intel
> 64 Processor States Following Power-up, Reset, or INIT" before this can be merged.
> 

PKRS is missing in Table 9-1. It will be updated in next version of SDM, 
just let you know to help current review:

"PKRS is cleared on INIT. It should be 0 in all cases."

>> +
>>   	kvm_x86_ops.vcpu_reset(vcpu, init_event);
>>   }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation
  2021-07-29 17:45     ` Paolo Bonzini
@ 2021-08-03  8:52       ` Chenyi Qiang
  0 siblings, 0 replies; 19+ messages in thread
From: Chenyi Qiang @ 2021-08-03  8:52 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Xiaoyao Li, kvm, linux-kernel



On 7/30/2021 1:45 AM, Paolo Bonzini wrote:
> On 29/07/21 19:25, Sean Christopherson wrote:
>>> -        unsigned int cr4_pke:1;
>>> +        unsigned int cr4_pkr:1;
>> Smushing these together will not work, as this code (from below)
>>
>>> -     ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
>>> +     ext.cr4_pkr = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
>>> +                   !!kvm_read_cr4_bits(vcpu, X86_CR4_PKS);
>> will generate the same mmu_role for CR4.PKE=0,PKS=1 and 
>> CR4.PKE=1,PKS=1 (and
>> other combinations).  I.e. KVM will fail to reconfigure the MMU and 
>> thus skip
>> update_pkr_bitmask() if the guest toggles PKE or PKS while the other 
>> PK* bit is set.
>>
> 
> I'm also not sure why there would be issues in just using cr4_pks.
> 

Will split out the pke and pks.

Thanks
Chenyi

> Paolo
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-08-03  8:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-05  8:37 [PATCH v4 0/5] KVM: PKS Virtualization support Chenyi Qiang
2021-02-05  8:37 ` [PATCH v4 1/5] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
2021-02-05  8:37 ` [PATCH v4 2/5] KVM: X86: Expose PKS to guest Chenyi Qiang
2021-02-05  9:23   ` Paolo Bonzini
2021-02-05  9:25   ` Paolo Bonzini
2021-02-05  9:56     ` Borislav Petkov
2021-02-05 10:10       ` Paolo Bonzini
2021-02-05 11:29         ` Thomas Gleixner
2021-02-05 11:51           ` Paolo Bonzini
2021-07-29 16:44   ` Sean Christopherson
2021-08-03  8:50     ` Chenyi Qiang
2021-02-05  8:37 ` [PATCH v4 3/5] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
2021-02-05  8:37 ` [PATCH v4 4/5] KVM: MMU: Add support for PKS emulation Chenyi Qiang
2021-02-05  9:20   ` Paolo Bonzini
2021-07-29 17:25   ` Sean Christopherson
2021-07-29 17:45     ` Paolo Bonzini
2021-08-03  8:52       ` Chenyi Qiang
2021-02-05  8:37 ` [PATCH v4 5/5] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
2021-02-05  9:24 ` [PATCH v4 0/5] KVM: PKS Virtualization support Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).