All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/7] KVM: PKS Virtualization support
@ 2020-08-07  8:48 Chenyi Qiang
  2020-08-07  8:48 ` [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

This RFC series introduce the KVM support for PKS which have dependency
on PKS kernel patches for some definitions. The latest kernel patch set
can be found at
https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/

---

Protection Keys for Supervisor Pages(PKS) is a feature that extends the
Protection Keys architecture to support thread-specific permission
restrictions on supervisor pages.

PKS works similar to an existing feature named PKU(protecting user pages).
They both perform an additional check after all legacy access
permissions checks are done. If violated, #PF occurs and PFEC.PK bit will
be set. PKS introduces MSR IA32_PKRS to manage supervisor protection key
rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
advertises on a group of pages with the same key which is set in the
leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
supported by XSAVES architecture.

This patchset aims to add the virtualization of PKS in KVM. It
implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
exposure, nested supported etc. Currently, PKS is not yet supported for
shadow paging. 

Detailed information about PKS can be found in the latest Intel 64 and
IA-32 Architectures Software Developer's Manual.

Chenyi Qiang (7):
  KVM: VMX: Introduce PKS VMCS fields
  KVM: VMX: Expose IA32_PKRS MSR
  KVM: MMU: Rename the pkru to pkr
  KVM: MMU: Refactor pkr_mask to cache condition
  KVM: MMU: Add support for PKS emulation
  KVM: X86: Expose PKS to guest and userspace
  KVM: VMX: Enable PKS for nested VM

 arch/x86/include/asm/kvm_host.h | 13 ++---
 arch/x86/include/asm/pkeys.h    |  1 +
 arch/x86/include/asm/vmx.h      |  6 +++
 arch/x86/kvm/cpuid.c            |  3 +-
 arch/x86/kvm/mmu.h              | 36 +++++++------
 arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++-------------
 arch/x86/kvm/vmx/capabilities.h |  6 +++
 arch/x86/kvm/vmx/nested.c       | 33 ++++++++++++
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmcs12.c       |  2 +
 arch/x86/kvm/vmx/vmcs12.h       |  6 ++-
 arch/x86/kvm/vmx/vmx.c          | 91 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h          |  1 +
 arch/x86/kvm/x86.c              |  7 ++-
 arch/x86/kvm/x86.h              |  6 +++
 arch/x86/mm/pkeys.c             |  6 +++
 include/linux/pkeys.h           |  4 ++
 17 files changed, 234 insertions(+), 66 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields
  2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
@ 2020-08-07  8:48 ` Chenyi Qiang
  2020-08-10 23:17   ` Jim Mattson
  2020-08-07  8:48 ` [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR Chenyi Qiang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

PKS(Protection Keys for Supervisor Pages) is a feature that extends the
Protection Key architecture to support thread-specific permission
restrictions on supervisor pages.

A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
set of permissions associated with each protection domian.

Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
{host,guest}-state area to store the value of PKRS.

Every VM exit saves PKRS into guest-state area.
If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
area.
If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
area.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/vmx.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index cd7de4b401fe..425cf81dd722 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -94,6 +94,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_LOAD_IA32_PKRS			0x20000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -107,6 +108,7 @@
 #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
+#define VM_ENTRY_LOAD_IA32_PKRS			0x00400000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
@@ -243,12 +245,16 @@ enum vmcs_field {
 	GUEST_BNDCFGS_HIGH              = 0x00002813,
 	GUEST_IA32_RTIT_CTL		= 0x00002814,
 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
+	GUEST_IA32_PKRS			= 0x00002818,
+	GUEST_IA32_PKRS_HIGH		= 0x00002819,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_PKRS			= 0x00002c06,
+	HOST_IA32_PKRS_HIGH		= 0x00002c07,
 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
 	EXCEPTION_BITMAP                = 0x00004004,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
  2020-08-07  8:48 ` [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
@ 2020-08-07  8:48 ` Chenyi Qiang
  2020-08-12 21:21   ` Jim Mattson
  2021-01-26 18:01   ` Paolo Bonzini
  2020-08-07  8:48 ` [RFC 3/7] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
index 0x6E1 to allow software to manage supervisor protection key
rights. For performance consideration, PKRS intercept will be disabled
so that the guest can access the PKRS without VM exits.
PKS introduces dedicated control fields in VMCS to switch PKRS, which
only does the retore part. In addition, every VM exit saves PKRS into
the guest-state area in VMCS, while VM enter won't save the host value
due to the expectation that the host won't change the MSR often. Update
the host's value in VMCS manually if the MSR has been changed by the
kernel since the last time the VMCS was run.
The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/pkeys.h    |  1 +
 arch/x86/kvm/vmx/capabilities.h |  6 +++
 arch/x86/kvm/vmx/nested.c       |  1 +
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmx.c          | 66 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.h              |  6 +++
 arch/x86/mm/pkeys.c             |  6 +++
 include/linux/pkeys.h           |  4 ++
 8 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 097abca7784c..d7c405d6eea6 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -142,6 +142,7 @@ u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val);
 int pks_key_alloc(const char *const pkey_user);
 void pks_key_free(int pkey);
 u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val);
+u32 get_current_pkrs(void);
 
 /*
  * pks_update_protection - Update the protection of the specified key
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 4bbd8b448d22..7099e3105f48 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -103,6 +103,12 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
 	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL);
 }
 
+static inline bool cpu_has_load_ia32_pkrs(void)
+{
+	return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PKRS) &&
+	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PKRS);
+}
+
 static inline bool cpu_has_vmx_mpx(void)
 {
 	return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) &&
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 11e4df560018..df2c2e733549 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
 	dest->ds_sel = src->ds_sel;
 	dest->es_sel = src->es_sel;
 #endif
+	dest->pkrs = src->pkrs;
 }
 
 static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 7a3675fddec2..39ec3d0c844b 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -40,6 +40,7 @@ struct vmcs_host_state {
 #ifdef CONFIG_X86_64
 	u16           ds_sel, es_sel;
 #endif
+	u32           pkrs;
 };
 
 struct vmcs_controls_shadow {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 13745f2a5ecd..d91d59fb46fa 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1131,6 +1131,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 #endif
 	unsigned long fs_base, gs_base;
 	u16 fs_sel, gs_sel;
+	u32 host_pkrs;
 	int i;
 
 	vmx->req_immediate_exit = false;
@@ -1163,6 +1164,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	 */
 	host_state->ldt_sel = kvm_read_ldt();
 
+	/*
+	 * Update the host pkrs vmcs field before vcpu runs.
+	 * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
+	 * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	 * guest_cpuid_has(vcpu, X86_FEATURE_VMX).
+	 */
+	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
+		host_pkrs = get_current_pkrs();
+		if (unlikely(host_pkrs != host_state->pkrs)) {
+			vmcs_write64(HOST_IA32_PKRS, host_pkrs);
+			host_state->pkrs = host_pkrs;
+		}
+	}
+
 #ifdef CONFIG_X86_64
 	savesegment(ds, host_state->ds_sel);
 	savesegment(es, host_state->es_sel);
@@ -1951,6 +1966,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		else
 			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
 		break;
+	case MSR_IA32_PKRS:
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		msr_info->data = vmcs_read64(GUEST_IA32_PKRS);
+		break;
 	case MSR_TSC_AUX:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
@@ -2221,6 +2243,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		else
 			vmx->pt_desc.guest.addr_a[index / 2] = data;
 		break;
+	case MSR_IA32_PKRS:
+		if (!kvm_pkrs_valid(data))
+			return 1;
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		vmcs_write64(GUEST_IA32_PKRS, data);
+		break;
 	case MSR_TSC_AUX:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
@@ -2510,7 +2541,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_EXIT_LOAD_IA32_EFER |
 	      VM_EXIT_CLEAR_BNDCFGS |
 	      VM_EXIT_PT_CONCEAL_PIP |
-	      VM_EXIT_CLEAR_IA32_RTIT_CTL;
+	      VM_EXIT_CLEAR_IA32_RTIT_CTL |
+	      VM_EXIT_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
 				&_vmexit_control) < 0)
 		return -EIO;
@@ -2534,7 +2566,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_ENTRY_LOAD_IA32_EFER |
 	      VM_ENTRY_LOAD_BNDCFGS |
 	      VM_ENTRY_PT_CONCEAL_PIP |
-	      VM_ENTRY_LOAD_IA32_RTIT_CTL;
+	      VM_ENTRY_LOAD_IA32_RTIT_CTL |
+	      VM_ENTRY_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
 				&_vmentry_control) < 0)
 		return -EIO;
@@ -5868,6 +5901,8 @@ void dump_vmcs(void)
 		       vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
 	if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
 		pr_err("BndCfgS = 0x%016llx\n", vmcs_read64(GUEST_BNDCFGS));
+	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(GUEST_IA32_PKRS));
 	pr_err("Interruptibility = %08x  ActivityState = %08x\n",
 	       vmcs_read32(GUEST_INTERRUPTIBILITY_INFO),
 	       vmcs_read32(GUEST_ACTIVITY_STATE));
@@ -5903,6 +5938,8 @@ void dump_vmcs(void)
 	    vmexit_ctl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
 		pr_err("PerfGlobCtl = 0x%016llx\n",
 		       vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
+	if (vmexit_ctl & VM_EXIT_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(HOST_IA32_PKRS));
 
 	pr_err("*** Control State ***\n");
 	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
@@ -7230,6 +7267,26 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
+static void vmx_update_pkrs_cfg(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
+	bool pks_supported = guest_cpuid_has(vcpu, X86_FEATURE_PKS);
+
+	/*
+	 * set intercept for PKRS when the guest doesn't support pks
+	 */
+	vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PKRS, MSR_TYPE_RW, !pks_supported);
+
+	if (pks_supported) {
+		vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+		vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+	} else {
+		vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+		vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+	}
+}
+
 static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7251,6 +7308,11 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 			~(FEAT_CTL_VMX_ENABLED_INSIDE_SMX |
 			  FEAT_CTL_VMX_ENABLED_OUTSIDE_SMX);
 
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS))
+		vmx_update_pkrs_cfg(vcpu);
+	else
+		guest_cpuid_clear(vcpu, X86_FEATURE_PKS);
+
 	if (nested_vmx_allowed(vcpu)) {
 		nested_vmx_cr_fixed1_bits_update(vcpu);
 		nested_vmx_entry_exit_ctls_update(vcpu);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6eb62e97e59f..7fb206f98bed 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -361,6 +361,12 @@ static inline bool kvm_dr7_valid(u64 data)
 	return !(data >> 32);
 }
 
+static inline bool kvm_pkrs_valid(u64 data)
+{
+	/* bit[63,32] must be zero */
+	return !(data >> 32);
+}
+
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
 u64 kvm_spec_ctrl_valid_bits(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index cc9a28a55ba3..b237c54074ba 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -333,6 +333,12 @@ void pks_key_free(int pkey)
 }
 EXPORT_SYMBOL_GPL(pks_key_free);
 
+u32 get_current_pkrs(void)
+{
+	return this_cpu_read(pkrs_cache);
+}
+EXPORT_SYMBOL_GPL(get_current_pkrs);
+
 static int pks_keys_allocated_show(struct seq_file *m, void *p)
 {
 	int i;
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 1d84ab7c12d4..8ac90fae137f 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -66,6 +66,10 @@ static inline void pks_update_protection(int pkey, unsigned long protection)
 {
 	WARN_ON_ONCE(1);
 }
+static inline u32 get_current_pkrs(void)
+{
+	return 0;
+}
 #endif /* ! CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 3/7] KVM: MMU: Rename the pkru to pkr
  2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
  2020-08-07  8:48 ` [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
  2020-08-07  8:48 ` [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR Chenyi Qiang
@ 2020-08-07  8:48 ` Chenyi Qiang
  2021-01-26 18:16   ` Paolo Bonzini
  2020-08-07  8:48 ` [RFC 4/7] KVM: MMU: Refactor pkr_mask to cache condition Chenyi Qiang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

PKRU represents the PKU register utilized in the protection key rights
check for user pages. Protection Keys for Superviosr Pages (PKS) extends
the protection key architecture to cover supervisor pages.

Rename the *pkru* related variables and functions to *pkr* which stands
for both of the PKRU and PKRS. It makes sense because both registers
have the same format. PKS and PKU can also share the same bitmap to
cache the conditions where protection key checks are needed.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.h              | 12 ++++++------
 arch/x86/kvm/mmu/mmu.c          | 18 +++++++++---------
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be5363b21540..6b739d0d1c97 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -427,7 +427,7 @@ struct kvm_mmu {
 	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
 	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
 	*/
-	u32 pkru_mask;
+	u32 pkr_mask;
 
 	u64 *pae_root;
 	u64 *lm_root;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 444bb9c54548..0c2fdf0abf22 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -193,8 +193,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	u32 errcode = PFERR_PRESENT_MASK;
 
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-	if (unlikely(mmu->pkru_mask)) {
-		u32 pkru_bits, offset;
+	if (unlikely(mmu->pkr_mask)) {
+		u32 pkr_bits, offset;
 
 		/*
 		* PKRU defines 32 bits, there are 16 domains and 2
@@ -202,15 +202,15 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		* index of the protection domain, so pte_pkey * 2 is
 		* is the index of the first bit for the domain.
 		*/
-		pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
 
 		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
 		offset = (pfec & ~1) +
 			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
 
-		pkru_bits &= mmu->pkru_mask >> offset;
-		errcode |= -pkru_bits & PFERR_PK_MASK;
-		fault |= (pkru_bits != 0);
+		pkr_bits &= mmu->pkr_mask >> offset;
+		errcode |= -pkr_bits & PFERR_PK_MASK;
+		fault |= (pkr_bits != 0);
 	}
 
 	return -(u32)fault & errcode;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d6a0ae7800c..481442f5e27a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4716,20 +4716,20 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
 * away both AD and WD.  For all reads or if the last condition holds, WD
 * only will be masked away.
 */
-static void update_pkru_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 				bool ept)
 {
 	unsigned bit;
 	bool wp;
 
 	if (ept) {
-		mmu->pkru_mask = 0;
+		mmu->pkr_mask = 0;
 		return;
 	}
 
 	/* PKEY is enabled only if CR4.PKE and EFER.LMA are both set. */
 	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || !is_long_mode(vcpu)) {
-		mmu->pkru_mask = 0;
+		mmu->pkr_mask = 0;
 		return;
 	}
 
@@ -4763,7 +4763,7 @@ static void update_pkru_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		/* PKRU.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
-		mmu->pkru_mask |= (pkey_bits & 3) << pfec;
+		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
 	}
 }
 
@@ -4785,7 +4785,7 @@ static void paging64_init_context_common(struct kvm_vcpu *vcpu,
 
 	reset_rsvds_bits_mask(vcpu, context);
 	update_permission_bitmask(vcpu, context, false);
-	update_pkru_bitmask(vcpu, context, false);
+	update_pkr_bitmask(vcpu, context, false);
 	update_last_nonleaf_level(vcpu, context);
 
 	MMU_WARN_ON(!is_pae(vcpu));
@@ -4815,7 +4815,7 @@ static void paging32_init_context(struct kvm_vcpu *vcpu,
 
 	reset_rsvds_bits_mask(vcpu, context);
 	update_permission_bitmask(vcpu, context, false);
-	update_pkru_bitmask(vcpu, context, false);
+	update_pkr_bitmask(vcpu, context, false);
 	update_last_nonleaf_level(vcpu, context);
 
 	context->page_fault = paging32_page_fault;
@@ -4925,7 +4925,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 	}
 
 	update_permission_bitmask(vcpu, context, false);
-	update_pkru_bitmask(vcpu, context, false);
+	update_pkr_bitmask(vcpu, context, false);
 	update_last_nonleaf_level(vcpu, context);
 	reset_tdp_shadow_zero_bits_mask(vcpu, context);
 }
@@ -5032,7 +5032,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 	context->mmu_role.as_u64 = new_role.as_u64;
 
 	update_permission_bitmask(vcpu, context, true);
-	update_pkru_bitmask(vcpu, context, true);
+	update_pkr_bitmask(vcpu, context, true);
 	update_last_nonleaf_level(vcpu, context);
 	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
 	reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
@@ -5103,7 +5103,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 	}
 
 	update_permission_bitmask(vcpu, g_context, false);
-	update_pkru_bitmask(vcpu, g_context, false);
+	update_pkr_bitmask(vcpu, g_context, false);
 	update_last_nonleaf_level(vcpu, g_context);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 4/7] KVM: MMU: Refactor pkr_mask to cache condition
  2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (2 preceding siblings ...)
  2020-08-07  8:48 ` [RFC 3/7] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
@ 2020-08-07  8:48 ` Chenyi Qiang
  2021-01-26 18:16   ` Paolo Bonzini
  2020-08-07  8:48 ` [RFC 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

pkr_mask bitmap indicates if protection key checks are needed for user
pages currently. It is indexed by page fault error code bits [4:1] with
PFEC.RSVD replaced by the ACC_USER_MASK from the page tables. Refactor
it by reverting to the use of PFEC.RSVD. After that, PKS and PKU can
share the same bitmap.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/kvm/mmu.h     | 10 ++++++----
 arch/x86/kvm/mmu/mmu.c | 16 ++++++++++------
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0c2fdf0abf22..7fb4c63d5704 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -202,11 +202,13 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		* index of the protection domain, so pte_pkey * 2 is
 		* is the index of the first bit for the domain.
 		*/
-		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		if (pte_access & PT_USER_MASK)
+			pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		else
+			pkr_bits = 0;
 
-		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
-		offset = (pfec & ~1) +
-			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
+		/* clear present bit */
+		offset = (pfec & ~1);
 
 		pkr_bits &= mmu->pkr_mask >> offset;
 		errcode |= -pkr_bits & PFERR_PK_MASK;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 481442f5e27a..333b4da739f8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4737,21 +4737,25 @@ static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 
 	for (bit = 0; bit < ARRAY_SIZE(mmu->permissions); ++bit) {
 		unsigned pfec, pkey_bits;
-		bool check_pkey, check_write, ff, uf, wf, pte_user;
+		bool check_pkey, check_write, ff, uf, wf, rsvdf;
 
 		pfec = bit << 1;
 		ff = pfec & PFERR_FETCH_MASK;
 		uf = pfec & PFERR_USER_MASK;
 		wf = pfec & PFERR_WRITE_MASK;
 
-		/* PFEC.RSVD is replaced by ACC_USER_MASK. */
-		pte_user = pfec & PFERR_RSVD_MASK;
+		/*
+		 * PFERR_RSVD_MASK bit is not set if the
+		 * access is subject to PK restrictions.
+		 */
+		rsvdf = pfec & PFERR_RSVD_MASK;
 
 		/*
-		 * Only need to check the access which is not an
-		 * instruction fetch and is to a user page.
+		 * need to check the access which is not an
+		 * instruction fetch and is not a rsvd fault.
 		 */
-		check_pkey = (!ff && pte_user);
+		check_pkey = (!ff && !rsvdf);
+
 		/*
 		 * write access is controlled by PKRU if it is a
 		 * user access or CR0.WP = 1.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 5/7] KVM: MMU: Add support for PKS emulation
  2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (3 preceding siblings ...)
  2020-08-07  8:48 ` [RFC 4/7] KVM: MMU: Refactor pkr_mask to cache condition Chenyi Qiang
@ 2020-08-07  8:48 ` Chenyi Qiang
  2021-01-26 18:23   ` Paolo Bonzini
  2020-08-07  8:48 ` [RFC 6/7] KVM: X86: Expose PKS to guest and userspace Chenyi Qiang
  2020-08-07  8:48 ` [RFC 7/7] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
  6 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Advertise pkr_mask to cache the conditions where pretection key checks
for supervisor pages are needed. When the accessed pages are those with
a translation for which the U/S flag is 0 in at least one
paging-structure entry controlling the translation, they are the
supervisor pages and PKRS enforces the access rights check.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  8 +++---
 arch/x86/kvm/mmu.h              | 12 ++++++---
 arch/x86/kvm/mmu/mmu.c          | 44 +++++++++++++++++----------------
 3 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6b739d0d1c97..736e56e023d5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -422,10 +422,10 @@ struct kvm_mmu {
 	u8 permissions[16];
 
 	/*
-	* The pkru_mask indicates if protection key checks are needed.  It
-	* consists of 16 domains indexed by page fault error code bits [4:1],
-	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
-	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
+	* The pkr_mask indicates if protection key checks are needed.
+	* It consists of 16 domains indexed by page fault error code
+	* bits[4:1]. Each domain has 2 bits which are ANDed with AD
+	* and WD from PKRU/PKRS.
 	*/
 	u32 pkr_mask;
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 7fb4c63d5704..b840b2d9ee9f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -195,15 +195,19 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
 	if (unlikely(mmu->pkr_mask)) {
 		u32 pkr_bits, offset;
+		u64 pkrs;
 
 		/*
-		* PKRU defines 32 bits, there are 16 domains and 2
-		* attribute bits per domain in pkru.  pte_pkey is the
-		* index of the protection domain, so pte_pkey * 2 is
-		* is the index of the first bit for the domain.
+		* PKRU and PKRS both define 32 bits. There are 16 domains
+		* and 2 attribute bits per domain in them. pte_key is the
+		* index of the protection domain, so pte_pkey * 2 is the
+		* index of the first bit for the domain. The choice of
+		* PKRU and PKRS is determined by the accessed pages.
 		*/
 		if (pte_access & PT_USER_MASK)
 			pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		else if (!kvm_get_msr(vcpu, MSR_IA32_PKRS, &pkrs))
+			pkr_bits = (pkrs >> (pte_pkey * 2)) & 3;
 		else
 			pkr_bits = 0;
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 333b4da739f8..845aea86b138 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4693,28 +4693,29 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
 }
 
 /*
-* PKU is an additional mechanism by which the paging controls access to
-* user-mode addresses based on the value in the PKRU register.  Protection
-* key violations are reported through a bit in the page fault error code.
+* Protection Keys (PKEY) is an additional mechanism by which
+* the paging controls access to user-mode/supervisor-mode address
+* based on the values in PKEY registers (PKRU/PKRS). Protection key
+* violations are reported through a bit in the page fault error code.
 * Unlike other bits of the error code, the PK bit is not known at the
 * call site of e.g. gva_to_gpa; it must be computed directly in
-* permission_fault based on two bits of PKRU, on some machine state (CR4,
-* CR0, EFER, CPL), and on other bits of the error code and the page tables.
+* permission_fault based on two bits of PKRU/PKRS, on some machine
+* state (CR4, CR0, EFER, CPL), and on other bits of the error code
+* and the page tables.
 *
 * In particular the following conditions come from the error code, the
 * page tables and the machine state:
-* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
+* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
 * - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
-* - PK is always zero if U=0 in the page tables
-* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
+* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.
 *
-* The PKRU bitmask caches the result of these four conditions.  The error
-* code (minus the P bit) and the page table's U bit form an index into the
-* PKRU bitmask.  Two bits of the PKRU bitmask are then extracted and ANDed
-* with the two bits of the PKRU register corresponding to the protection key.
-* For the first three conditions above the bits will be 00, thus masking
-* away both AD and WD.  For all reads or if the last condition holds, WD
-* only will be masked away.
+* The pkr_mask caches the result of these three conditions. The error
+* code (minus the P bit) forms an index into the pkr_mask. Both PKU and
+* PKS shares the same bitmask. Two bits of the pkr_mask are then extracted
+* and ANDed with the two bits of the PKEY register corresponding to
+* the protection key. For the first two conditions above the bits will be 00,
+* thus masking away both AD and WD. For all reads or if the last condition
+* holds, WD only will be masked away.
 */
 static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 				bool ept)
@@ -4727,8 +4728,9 @@ static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		return;
 	}
 
-	/* PKEY is enabled only if CR4.PKE and EFER.LMA are both set. */
-	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || !is_long_mode(vcpu)) {
+	/* PKEY is enabled only if CR4.PKE/CR4.PKS and EFER.LMA are both set. */
+	if ((!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) &&
+	    !kvm_read_cr4_bits(vcpu, X86_CR4_PKS)) || !is_long_mode(vcpu)) {
 		mmu->pkr_mask = 0;
 		return;
 	}
@@ -4757,14 +4759,14 @@ static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		check_pkey = (!ff && !rsvdf);
 
 		/*
-		 * write access is controlled by PKRU if it is a
-		 * user access or CR0.WP = 1.
+		 * write access is controlled by PKRU/PKRS if
+		 * it is a user access or CR0.WP = 1.
 		 */
 		check_write = check_pkey && wf && (uf || wp);
 
-		/* PKRU.AD stops both read and write access. */
+		/* PKRU/PKRS.AD stops both read and write access. */
 		pkey_bits = !!check_pkey;
-		/* PKRU.WD stops write access. */
+		/* PKRU/PKRS.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
 		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 6/7] KVM: X86: Expose PKS to guest and userspace
  2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (4 preceding siblings ...)
  2020-08-07  8:48 ` [RFC 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
@ 2020-08-07  8:48 ` Chenyi Qiang
  2020-08-13 19:04   ` Jim Mattson
  2020-08-07  8:48 ` [RFC 7/7] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
  6 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of VM_{ENTRY,
EXIT}_LOAD_IA32_PKRS currently.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/cpuid.c            |  3 ++-
 arch/x86/kvm/vmx/vmx.c          | 15 ++++++++++++---
 arch/x86/kvm/x86.c              |  7 +++++--
 4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 736e56e023d5..36c7356693c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -99,7 +99,8 @@
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_PKS))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8a294f9747aa..897749250afd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -325,7 +325,8 @@ void kvm_set_cpu_caps(void)
 		F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | F(RDPID) |
 		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
 		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
-		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/
+		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
+		0 /*PKS*/
 	);
 	/* Set LA57 based on hardware capability. */
 	if (cpuid_ecx(7) & F(LA57))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d91d59fb46fa..5cdf4d3848fb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3216,7 +3216,7 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		}
 
 		/*
-		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
 		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
 		 * to be manually disabled when guest switches to non-paging
 		 * mode.
@@ -3224,10 +3224,11 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		 * If !enable_unrestricted_guest, the CPU is always running
 		 * with CR0.PG=1 and CR4 needs to be modified.
 		 * If enable_unrestricted_guest, the CPU automatically
-		 * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+		 * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
 		 */
 		if (!is_paging(vcpu))
-			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+				    X86_CR4_PKS);
 	}
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -7348,6 +7349,14 @@ static __init void vmx_set_cpu_caps(void)
 	if (vmx_pt_mode_is_host_guest())
 		kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
 
+	/*
+	 * PKS is not yet implemented for shadow paging.
+	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
+	 * don't expose the PKS as well.
+	 */
+	if (enable_ept && cpu_has_load_ia32_pkrs())
+		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
+
 	if (vmx_umip_emulated())
 		kvm_cpu_cap_set(X86_FEATURE_UMIP);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..8ad9622ee2b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -949,6 +949,8 @@ EXPORT_SYMBOL_GPL(kvm_set_xcr);
 		__reserved_bits |= X86_CR4_LA57;	\
 	if (!__cpu_has(__c, X86_FEATURE_UMIP))		\
 		__reserved_bits |= X86_CR4_UMIP;	\
+	if (!__cpu_has(__c, X86_FEATURE_PKS))		\
+		__reserved_bits |= X86_CR4_PKS;		\
 	__reserved_bits;				\
 })
 
@@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long old_cr4 = kvm_read_cr4(vcpu);
 	unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
-				   X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
+				   X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+				   X86_CR4_PKS;
 
 	if (kvm_valid_cr4(vcpu, cr4))
 		return 1;
@@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
 	MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
 	MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
 	MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
-	MSR_IA32_UMWAIT_CONTROL,
+	MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
 
 	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
 	MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (5 preceding siblings ...)
  2020-08-07  8:48 ` [RFC 6/7] KVM: X86: Expose PKS to guest and userspace Chenyi Qiang
@ 2020-08-07  8:48 ` Chenyi Qiang
  2020-08-11  0:05   ` Jim Mattson
  6 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

PKS MSR passes through guest directly. Configure the MSR to match the
L0/L1 settings so that nested VM runs PKS properly.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmcs12.c |  2 ++
 arch/x86/kvm/vmx/vmcs12.h |  6 +++++-
 arch/x86/kvm/vmx/vmx.c    | 10 ++++++++++
 arch/x86/kvm/vmx/vmx.h    |  1 +
 5 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index df2c2e733549..1f9823d21ecd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -647,6 +647,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 					MSR_IA32_PRED_CMD,
 					MSR_TYPE_W);
 
+	if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
+		nested_vmx_disable_intercept_for_msr(
+					msr_bitmap_l1, msr_bitmap_l0,
+					MSR_IA32_PKRS,
+					MSR_TYPE_R | MSR_TYPE_W);
+
 	kvm_vcpu_unmap(vcpu, &to_vmx(vcpu)->nested.msr_bitmap_map, false);
 
 	return true;
@@ -2427,6 +2433,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
 		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+		if (vmx->nested.nested_run_pending &&
+		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+			vmcs_write64(GUEST_IA32_PKRS, vmcs12->guest_ia32_pkrs);
 	}
 
 	if (nested_cpu_has_xsaves(vmcs12))
@@ -2509,6 +2519,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
+
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    (!vmx->nested.nested_run_pending ||
+	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
+		vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
 	vmx_set_rflags(vcpu, vmcs12->guest_rflags);
 
 	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
@@ -2849,6 +2864,10 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 					   vmcs12->host_ia32_perf_global_ctrl)))
 		return -EINVAL;
 
+	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->host_ia32_pkrs)))
+		return -EINVAL;
+
 #ifdef CONFIG_X86_64
 	ia32e = !!(vcpu->arch.efer & EFER_LMA);
 #else
@@ -2998,6 +3017,10 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	if (nested_check_guest_non_reg_state(vmcs12))
 		return -EINVAL;
 
+	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->guest_ia32_pkrs)))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -3271,6 +3294,9 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	if (kvm_mpx_supported() &&
 		!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 		vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+		vmx->nested.vmcs01_guest_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	/*
 	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
@@ -3865,6 +3891,7 @@ static bool is_vmcs12_ext_field(unsigned long field)
 	case GUEST_IDTR_BASE:
 	case GUEST_PENDING_DBG_EXCEPTIONS:
 	case GUEST_BNDCFGS:
+	case GUEST_IA32_PKRS:
 		return true;
 	default:
 		break;
@@ -3916,6 +3943,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 	if (kvm_mpx_supported())
 		vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS))
+		vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
 }
@@ -4151,6 +4180,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 					 vmcs12->host_ia32_perf_global_ctrl));
 
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS)
+		vmcs_write64(GUEST_IA32_PKRS, vmcs12->host_ia32_pkrs);
+
 	/* Set L1 segment info according to Intel SDM
 	    27.5.2 Loading Host Segment and Descriptor-Table Registers */
 	seg = (struct kvm_segment) {
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index c8e51c004f78..df7b2143b807 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -61,9 +61,11 @@ const unsigned short vmcs_field_to_offset_table[] = {
 	FIELD64(GUEST_PDPTR2, guest_pdptr2),
 	FIELD64(GUEST_PDPTR3, guest_pdptr3),
 	FIELD64(GUEST_BNDCFGS, guest_bndcfgs),
+	FIELD64(GUEST_IA32_PKRS, guest_ia32_pkrs),
 	FIELD64(HOST_IA32_PAT, host_ia32_pat),
 	FIELD64(HOST_IA32_EFER, host_ia32_efer),
 	FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
+	FIELD64(HOST_IA32_PKRS, host_ia32_pkrs),
 	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
 	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
 	FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 80232daf00ff..009b4c317375 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -69,7 +69,9 @@ struct __packed vmcs12 {
 	u64 vm_function_control;
 	u64 eptp_list_address;
 	u64 pml_address;
-	u64 padding64[3]; /* room for future expansion */
+	u64 guest_ia32_pkrs;
+	u64 host_ia32_pkrs;
+	u64 padding64[1]; /* room for future expansion */
 	/*
 	 * To allow migration of L1 (complete with its L2 guests) between
 	 * machines of different natural widths (32 or 64 bit), we cannot have
@@ -256,6 +258,8 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(vm_function_control, 296);
 	CHECK_OFFSET(eptp_list_address, 304);
 	CHECK_OFFSET(pml_address, 312);
+	CHECK_OFFSET(guest_ia32_pkrs, 320);
+	CHECK_OFFSET(host_ia32_pkrs, 328);
 	CHECK_OFFSET(cr0_guest_host_mask, 344);
 	CHECK_OFFSET(cr4_guest_host_mask, 352);
 	CHECK_OFFSET(cr0_read_shadow, 360);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5cdf4d3848fb..3c3fb554c3fc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7178,6 +7178,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
 	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
 	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
+	cr4_fixed1_update(X86_CR4_PKS,        ecx, feature_bit(PKS));
 
 #undef cr4_fixed1_update
 }
@@ -7197,6 +7198,15 @@ static void nested_vmx_entry_exit_ctls_update(struct kvm_vcpu *vcpu)
 			vmx->nested.msrs.exit_ctls_high &= ~VM_EXIT_CLEAR_BNDCFGS;
 		}
 	}
+
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {
+		vmx->nested.msrs.entry_ctls_high |= VM_ENTRY_LOAD_IA32_PKRS;
+		vmx->nested.msrs.exit_ctls_high |= VM_EXIT_LOAD_IA32_PKRS;
+	} else {
+		vmx->nested.msrs.entry_ctls_high &= ~VM_ENTRY_LOAD_IA32_PKRS;
+		vmx->nested.msrs.exit_ctls_high &= ~VM_EXIT_LOAD_IA32_PKRS;
+	}
 }
 
 static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 639798e4a6ca..2819d3e030b9 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -176,6 +176,7 @@ struct nested_vmx {
 	/* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
 	u64 vmcs01_debugctl;
 	u64 vmcs01_guest_bndcfgs;
+	u64 vmcs01_guest_pkrs;
 
 	/* to migrate it to L1 if L2 writes to L1's CR8 directly */
 	int l1_tpr_threshold;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields
  2020-08-07  8:48 ` [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
@ 2020-08-10 23:17   ` Jim Mattson
  0 siblings, 0 replies; 38+ messages in thread
From: Jim Mattson @ 2020-08-10 23:17 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
> PKS(Protection Keys for Supervisor Pages) is a feature that extends the
> Protection Key architecture to support thread-specific permission
> restrictions on supervisor pages.
>
> A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
> set of permissions associated with each protection domian.
>
> Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
> {host,guest}-state area to store the value of PKRS.
>
> Every VM exit saves PKRS into guest-state area.
> If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
> area.
> If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
> area.
>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-07  8:48 ` [RFC 7/7] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
@ 2020-08-11  0:05   ` Jim Mattson
  2020-08-12 15:00     ` Sean Christopherson
  2020-08-13  4:52     ` Chenyi Qiang
  0 siblings, 2 replies; 38+ messages in thread
From: Jim Mattson @ 2020-08-11  0:05 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
> PKS MSR passes through guest directly. Configure the MSR to match the
> L0/L1 settings so that nested VM runs PKS properly.
>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/vmcs12.c |  2 ++
>  arch/x86/kvm/vmx/vmcs12.h |  6 +++++-
>  arch/x86/kvm/vmx/vmx.c    | 10 ++++++++++
>  arch/x86/kvm/vmx/vmx.h    |  1 +
>  5 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index df2c2e733549..1f9823d21ecd 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -647,6 +647,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>                                         MSR_IA32_PRED_CMD,
>                                         MSR_TYPE_W);
>
> +       if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
> +               nested_vmx_disable_intercept_for_msr(
> +                                       msr_bitmap_l1, msr_bitmap_l0,
> +                                       MSR_IA32_PKRS,
> +                                       MSR_TYPE_R | MSR_TYPE_W);

What if L1 intercepts only *reads* of MSR_IA32_PKRS?

>         kvm_vcpu_unmap(vcpu, &to_vmx(vcpu)->nested.msr_bitmap_map, false);
>
>         return true;

> @@ -2509,6 +2519,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>         if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
>             !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>                 vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
> +
> +       if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&

Is the above check superfluous? I would assume that the L1 guest can't
set VM_ENTRY_LOAD_IA32_PKRS unless this is true.

> +           (!vmx->nested.nested_run_pending ||
> +            !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> +               vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);

This doesn't seem right to me. On the target of a live migration, with
L2 active at the time the snapshot was taken (i.e.,
vmx->nested.nested_run_pending=0), it looks like we're going to try to
overwrite the current L2 PKRS value with L1's PKRS value (except that
in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
0). Am I missing something?

>         vmx_set_rflags(vcpu, vmcs12->guest_rflags);
>
>         /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the


> @@ -3916,6 +3943,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>                 vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>         if (kvm_mpx_supported())
>                 vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
> +       if (kvm_cpu_cap_has(X86_FEATURE_PKS))

Shouldn't we be checking to see if the *virtual* CPU supports PKS
before writing anything into vmcs12->guest_ia32_pkrs?

> +               vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
>
>         vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
>  }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-11  0:05   ` Jim Mattson
@ 2020-08-12 15:00     ` Sean Christopherson
  2020-08-12 18:32       ` Jim Mattson
  2020-08-13  4:52     ` Chenyi Qiang
  1 sibling, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2020-08-12 15:00 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Chenyi Qiang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Mon, Aug 10, 2020 at 05:05:36PM -0700, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >
> > PKS MSR passes through guest directly. Configure the MSR to match the
> > L0/L1 settings so that nested VM runs PKS properly.
> >
> > Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> > ---
> >  arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/vmcs12.c |  2 ++
> >  arch/x86/kvm/vmx/vmcs12.h |  6 +++++-
> >  arch/x86/kvm/vmx/vmx.c    | 10 ++++++++++
> >  arch/x86/kvm/vmx/vmx.h    |  1 +
> >  5 files changed, 50 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index df2c2e733549..1f9823d21ecd 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -647,6 +647,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >                                         MSR_IA32_PRED_CMD,
> >                                         MSR_TYPE_W);
> >
> > +       if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
> > +               nested_vmx_disable_intercept_for_msr(
> > +                                       msr_bitmap_l1, msr_bitmap_l0,
> > +                                       MSR_IA32_PKRS,
> > +                                       MSR_TYPE_R | MSR_TYPE_W);
> 
> What if L1 intercepts only *reads* of MSR_IA32_PKRS?

nested_vmx_disable_intercept_for_msr() handles merging L1's desires, the
(MSR_TYPE_R | MSR_TYPE_W) param is effectively L0's desire for L2.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-12 15:00     ` Sean Christopherson
@ 2020-08-12 18:32       ` Jim Mattson
  0 siblings, 0 replies; 38+ messages in thread
From: Jim Mattson @ 2020-08-12 18:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chenyi Qiang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Wed, Aug 12, 2020 at 8:00 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Mon, Aug 10, 2020 at 05:05:36PM -0700, Jim Mattson wrote:
> > On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> > >
> > > PKS MSR passes through guest directly. Configure the MSR to match the
> > > L0/L1 settings so that nested VM runs PKS properly.
> > >
> > > Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> > > ---
> > >  arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
> > >  arch/x86/kvm/vmx/vmcs12.c |  2 ++
> > >  arch/x86/kvm/vmx/vmcs12.h |  6 +++++-
> > >  arch/x86/kvm/vmx/vmx.c    | 10 ++++++++++
> > >  arch/x86/kvm/vmx/vmx.h    |  1 +
> > >  5 files changed, 50 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index df2c2e733549..1f9823d21ecd 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -647,6 +647,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > >                                         MSR_IA32_PRED_CMD,
> > >                                         MSR_TYPE_W);
> > >
> > > +       if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
> > > +               nested_vmx_disable_intercept_for_msr(
> > > +                                       msr_bitmap_l1, msr_bitmap_l0,
> > > +                                       MSR_IA32_PKRS,
> > > +                                       MSR_TYPE_R | MSR_TYPE_W);
> >
> > What if L1 intercepts only *reads* of MSR_IA32_PKRS?
>
> nested_vmx_disable_intercept_for_msr() handles merging L1's desires, the
> (MSR_TYPE_R | MSR_TYPE_W) param is effectively L0's desire for L2.

I should know better than to assume that a function in kvm actually
does anything like what its name implies, but I never seem to learn.
:-(

Thanks!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-07  8:48 ` [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR Chenyi Qiang
@ 2020-08-12 21:21   ` Jim Mattson
  2020-08-13  5:42     ` Chenyi Qiang
  2021-01-26 18:01   ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Jim Mattson @ 2020-08-12 21:21 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
> index 0x6E1 to allow software to manage supervisor protection key
> rights. For performance consideration, PKRS intercept will be disabled
> so that the guest can access the PKRS without VM exits.
> PKS introduces dedicated control fields in VMCS to switch PKRS, which
> only does the retore part. In addition, every VM exit saves PKRS into
> the guest-state area in VMCS, while VM enter won't save the host value
> due to the expectation that the host won't change the MSR often. Update
> the host's value in VMCS manually if the MSR has been changed by the
> kernel since the last time the VMCS was run.
> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---

> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 11e4df560018..df2c2e733549 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
>         dest->ds_sel = src->ds_sel;
>         dest->es_sel = src->es_sel;
>  #endif
> +       dest->pkrs = src->pkrs;

Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
PKRS isn't usable outside of long mode, is it?

>  }
>
>  static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> index 7a3675fddec2..39ec3d0c844b 100644
> --- a/arch/x86/kvm/vmx/vmcs.h
> +++ b/arch/x86/kvm/vmx/vmcs.h
> @@ -40,6 +40,7 @@ struct vmcs_host_state {
>  #ifdef CONFIG_X86_64
>         u16           ds_sel, es_sel;
>  #endif
> +       u32           pkrs;

One thing that does seem odd throughout is that PKRS is a 64-bit
resource, but we store and pass around only 32-bits. Yes, the top 32
bits are currently reserved, but the same can be said of, say, cr4, a
few lines above this. Yet, we store and pass around cr4 as 64-bits.
How do we decide?

>  };
>
>  struct vmcs_controls_shadow {

> @@ -1163,6 +1164,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>          */
>         host_state->ldt_sel = kvm_read_ldt();
>
> +       /*
> +        * Update the host pkrs vmcs field before vcpu runs.
> +        * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
> +        * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
> +        * guest_cpuid_has(vcpu, X86_FEATURE_VMX).
> +        */
> +       if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
> +               host_pkrs = get_current_pkrs();
> +               if (unlikely(host_pkrs != host_state->pkrs)) {
> +                       vmcs_write64(HOST_IA32_PKRS, host_pkrs);
> +                       host_state->pkrs = host_pkrs;
> +               }
> +       }
> +
>  #ifdef CONFIG_X86_64
>         savesegment(ds, host_state->ds_sel);
>         savesegment(es, host_state->es_sel);
> @@ -1951,6 +1966,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>                 else
>                         msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
>                 break;
> +       case MSR_IA32_PKRS:
> +               if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
> +                   (!msr_info->host_initiated &&
> +                   !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))

Could this be simplified if we required
kvm_cpu_cap_has(X86_FEATURE_PKS) as a prerequisite for
guest_cpuid_has(vcpu, X86_FEATURE_PKS)? If not, what is the expected
behavior if guest_cpuid_has(vcpu, X86_FEATURE_PKS) is true and
kvm_cpu_cap_has(X86_FEATURE_PKS)  is false?

> +                       return 1;
> +               msr_info->data = vmcs_read64(GUEST_IA32_PKRS);
> +               break;
>         case MSR_TSC_AUX:
>                 if (!msr_info->host_initiated &&
>                     !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))

> @@ -7230,6 +7267,26 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>                 vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>  }
>
> +static void vmx_update_pkrs_cfg(struct kvm_vcpu *vcpu)
> +{
> +       struct vcpu_vmx *vmx = to_vmx(vcpu);
> +       unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> +       bool pks_supported = guest_cpuid_has(vcpu, X86_FEATURE_PKS);
> +
> +       /*
> +        * set intercept for PKRS when the guest doesn't support pks
> +        */
> +       vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PKRS, MSR_TYPE_RW, !pks_supported);
> +
> +       if (pks_supported) {
> +               vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +               vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +       } else {
> +               vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +               vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +       }
> +}

Not just your change, but it is unclear to me what the expected
behavior is when CPUID bits are modified while the guest is running.
For example, it seems that this code results in immediate behavioral
changes for an L1 guest; however, if an L2 guest is active, then there
are no behavioral changes until the next emulated VM-entry from L1 to
L2. Is that right?

On a more general note, do you have any kvm-unit-tests that exercise
the new code?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-11  0:05   ` Jim Mattson
  2020-08-12 15:00     ` Sean Christopherson
@ 2020-08-13  4:52     ` Chenyi Qiang
  2020-08-13 17:52       ` Jim Mattson
  1 sibling, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-13  4:52 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML



On 8/11/2020 8:05 AM, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>
>> PKS MSR passes through guest directly. Configure the MSR to match the
>> L0/L1 settings so that nested VM runs PKS properly.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
>>   arch/x86/kvm/vmx/vmcs12.c |  2 ++
>>   arch/x86/kvm/vmx/vmcs12.h |  6 +++++-
>>   arch/x86/kvm/vmx/vmx.c    | 10 ++++++++++
>>   arch/x86/kvm/vmx/vmx.h    |  1 +
>>   5 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index df2c2e733549..1f9823d21ecd 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -647,6 +647,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>>                                          MSR_IA32_PRED_CMD,
>>                                          MSR_TYPE_W);
>>
>> +       if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
>> +               nested_vmx_disable_intercept_for_msr(
>> +                                       msr_bitmap_l1, msr_bitmap_l0,
>> +                                       MSR_IA32_PKRS,
>> +                                       MSR_TYPE_R | MSR_TYPE_W);
> 
> What if L1 intercepts only *reads* of MSR_IA32_PKRS?
> 
>>          kvm_vcpu_unmap(vcpu, &to_vmx(vcpu)->nested.msr_bitmap_map, false);
>>
>>          return true;
> 
>> @@ -2509,6 +2519,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>          if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
>>              !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>>                  vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
>> +
>> +       if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
> 
> Is the above check superfluous? I would assume that the L1 guest can't
> set VM_ENTRY_LOAD_IA32_PKRS unless this is true.
> 

I enforce this check to ensure vmcs_write to the Guest_IA32_PKRS without 
error. if deleted, vmcs_write to GUEST_IA32_PKRS may executed when PKS 
is unsupported.

>> +           (!vmx->nested.nested_run_pending ||
>> +            !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
>> +               vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
> 
> This doesn't seem right to me. On the target of a live migration, with
> L2 active at the time the snapshot was taken (i.e.,
> vmx->nested.nested_run_pending=0), it looks like we're going to try to
> overwrite the current L2 PKRS value with L1's PKRS value (except that
> in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
> 0). Am I missing something?
> 

We overwrite the L2 PKRS with L1's value when L2 doesn't support PKS. 
Because the L1's VM_ENTRY_LOAD_IA32_PKRS is off, we need to migrate L1's 
PKRS to L2.

>>          vmx_set_rflags(vcpu, vmcs12->guest_rflags);
>>
>>          /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> 
> 
>> @@ -3916,6 +3943,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>                  vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>          if (kvm_mpx_supported())
>>                  vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
>> +       if (kvm_cpu_cap_has(X86_FEATURE_PKS))
> 
> Shouldn't we be checking to see if the *virtual* CPU supports PKS
> before writing anything into vmcs12->guest_ia32_pkrs?
> 

Yes, It's reasonable.

>> +               vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
>>
>>          vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
>>   }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-12 21:21   ` Jim Mattson
@ 2020-08-13  5:42     ` Chenyi Qiang
  2020-08-13 17:31       ` Jim Mattson
  0 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-13  5:42 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML



On 8/13/2020 5:21 AM, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>
>> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
>> index 0x6E1 to allow software to manage supervisor protection key
>> rights. For performance consideration, PKRS intercept will be disabled
>> so that the guest can access the PKRS without VM exits.
>> PKS introduces dedicated control fields in VMCS to switch PKRS, which
>> only does the retore part. In addition, every VM exit saves PKRS into
>> the guest-state area in VMCS, while VM enter won't save the host value
>> due to the expectation that the host won't change the MSR often. Update
>> the host's value in VMCS manually if the MSR has been changed by the
>> kernel since the last time the VMCS was run.
>> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
>> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
> 
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 11e4df560018..df2c2e733549 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
>>          dest->ds_sel = src->ds_sel;
>>          dest->es_sel = src->es_sel;
>>   #endif
>> +       dest->pkrs = src->pkrs;
> 
> Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
> PKRS isn't usable outside of long mode, is it?
> 

Yes, I'm also thinking about whether to put all pks code into 
CONFIG_X86_64. The kernel implementation also wrap its pks code inside 
CONFIG_ARCH_HAS_SUPERVISOR_PKEYS which has dependency with CONFIG_X86_64.
However, maybe this can help when host kernel disable PKS but the guest 
enable it. What do you think about this?


>>   }
>>
>>   static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
>> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
>> index 7a3675fddec2..39ec3d0c844b 100644
>> --- a/arch/x86/kvm/vmx/vmcs.h
>> +++ b/arch/x86/kvm/vmx/vmcs.h
>> @@ -40,6 +40,7 @@ struct vmcs_host_state {
>>   #ifdef CONFIG_X86_64
>>          u16           ds_sel, es_sel;
>>   #endif
>> +       u32           pkrs;
> 
> One thing that does seem odd throughout is that PKRS is a 64-bit
> resource, but we store and pass around only 32-bits. Yes, the top 32
> bits are currently reserved, but the same can be said of, say, cr4, a
> few lines above this. Yet, we store and pass around cr4 as 64-bits.
> How do we decide?
> 

IMO, If the high part of PKRS is zero-reserved, it's OK to use u32. I 
define it as u32 just to follow the definition pkrs_cache in kernel code.

>>   };
>>
>>   struct vmcs_controls_shadow {
> 
>> @@ -1163,6 +1164,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>>           */
>>          host_state->ldt_sel = kvm_read_ldt();
>>
>> +       /*
>> +        * Update the host pkrs vmcs field before vcpu runs.
>> +        * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
>> +        * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
>> +        * guest_cpuid_has(vcpu, X86_FEATURE_VMX).
>> +        */
>> +       if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
>> +               host_pkrs = get_current_pkrs();
>> +               if (unlikely(host_pkrs != host_state->pkrs)) {
>> +                       vmcs_write64(HOST_IA32_PKRS, host_pkrs);
>> +                       host_state->pkrs = host_pkrs;
>> +               }
>> +       }
>> +
>>   #ifdef CONFIG_X86_64
>>          savesegment(ds, host_state->ds_sel);
>>          savesegment(es, host_state->es_sel);
>> @@ -1951,6 +1966,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>                  else
>>                          msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
>>                  break;
>> +       case MSR_IA32_PKRS:
>> +               if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
>> +                   (!msr_info->host_initiated &&
>> +                   !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
> 
> Could this be simplified if we required
> kvm_cpu_cap_has(X86_FEATURE_PKS) as a prerequisite for
> guest_cpuid_has(vcpu, X86_FEATURE_PKS)? If not, what is the expected
> behavior if guest_cpuid_has(vcpu, X86_FEATURE_PKS) is true and
> kvm_cpu_cap_has(X86_FEATURE_PKS)  is false?
> 

Yes, kvm_cpu_cap_has() is a prerequisite for guest_cpuid_has() as we 
have done the check in vmx_cpuid_update(). Here I add the 
kvm_cpu_cap_has() check to ensure the vmcs_read(GUEST_IA32_PKRS) can 
execute correctly in case of the userspace access.

>> +                       return 1;
>> +               msr_info->data = vmcs_read64(GUEST_IA32_PKRS);
>> +               break;
>>          case MSR_TSC_AUX:
>>                  if (!msr_info->host_initiated &&
>>                      !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
> 
>> @@ -7230,6 +7267,26 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>>                  vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>>   }
>>
>> +static void vmx_update_pkrs_cfg(struct kvm_vcpu *vcpu)
>> +{
>> +       struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +       unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
>> +       bool pks_supported = guest_cpuid_has(vcpu, X86_FEATURE_PKS);
>> +
>> +       /*
>> +        * set intercept for PKRS when the guest doesn't support pks
>> +        */
>> +       vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PKRS, MSR_TYPE_RW, !pks_supported);
>> +
>> +       if (pks_supported) {
>> +               vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +               vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +       } else {
>> +               vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +               vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +       }
>> +}
> 
> Not just your change, but it is unclear to me what the expected
> behavior is when CPUID bits are modified while the guest is running.
> For example, it seems that this code results in immediate behavioral
> changes for an L1 guest; however, if an L2 guest is active, then there
> are no behavioral changes until the next emulated VM-entry from L1 to
> L2. Is that right?
> 

I don't know if there is a way to deal with the CPUID modification in 
KVM while the guest is running. Some CPUID modification like 
X86_FEATURE_OSPKE happens when the guest sets CR4_PKE. But I'm not 
familiar with your case.

Paolo

What's your opinion?


> On a more general note, do you have any kvm-unit-tests that exercise
> the new code?
> 

Yes, I'll attach the kvm-unit-tests in next version.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-13  5:42     ` Chenyi Qiang
@ 2020-08-13 17:31       ` Jim Mattson
  2020-08-18  7:27         ` Chenyi Qiang
  0 siblings, 1 reply; 38+ messages in thread
From: Jim Mattson @ 2020-08-13 17:31 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Wed, Aug 12, 2020 at 10:42 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
>
>
> On 8/13/2020 5:21 AM, Jim Mattson wrote:
> > On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >>
> >> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
> >> index 0x6E1 to allow software to manage supervisor protection key
> >> rights. For performance consideration, PKRS intercept will be disabled
> >> so that the guest can access the PKRS without VM exits.
> >> PKS introduces dedicated control fields in VMCS to switch PKRS, which
> >> only does the retore part. In addition, every VM exit saves PKRS into
> >> the guest-state area in VMCS, while VM enter won't save the host value
> >> due to the expectation that the host won't change the MSR often. Update
> >> the host's value in VMCS manually if the MSR has been changed by the
> >> kernel since the last time the VMCS was run.
> >> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
> >> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
> >>
> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> >> ---
> >
> >> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> >> index 11e4df560018..df2c2e733549 100644
> >> --- a/arch/x86/kvm/vmx/nested.c
> >> +++ b/arch/x86/kvm/vmx/nested.c
> >> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
> >>          dest->ds_sel = src->ds_sel;
> >>          dest->es_sel = src->es_sel;
> >>   #endif
> >> +       dest->pkrs = src->pkrs;
> >
> > Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
> > PKRS isn't usable outside of long mode, is it?
> >
>
> Yes, I'm also thinking about whether to put all pks code into
> CONFIG_X86_64. The kernel implementation also wrap its pks code inside
> CONFIG_ARCH_HAS_SUPERVISOR_PKEYS which has dependency with CONFIG_X86_64.
> However, maybe this can help when host kernel disable PKS but the guest
> enable it. What do you think about this?

I see no problem in exposing PKRS to the guest even if the host
doesn't have CONFIG_ARCH_HAS_SUPERVISOR_PKEYS.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-13  4:52     ` Chenyi Qiang
@ 2020-08-13 17:52       ` Jim Mattson
  2020-08-14 10:07         ` Chenyi Qiang
  0 siblings, 1 reply; 38+ messages in thread
From: Jim Mattson @ 2020-08-13 17:52 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Wed, Aug 12, 2020 at 9:54 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
>
>
> On 8/11/2020 8:05 AM, Jim Mattson wrote:
> > On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >>
> >> PKS MSR passes through guest directly. Configure the MSR to match the
> >> L0/L1 settings so that nested VM runs PKS properly.
> >>
> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> >> ---

> >> +           (!vmx->nested.nested_run_pending ||
> >> +            !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> >> +               vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
> >
> > This doesn't seem right to me. On the target of a live migration, with
> > L2 active at the time the snapshot was taken (i.e.,
> > vmx->nested.nested_run_pending=0), it looks like we're going to try to
> > overwrite the current L2 PKRS value with L1's PKRS value (except that
> > in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
> > 0). Am I missing something?
> >
>
> We overwrite the L2 PKRS with L1's value when L2 doesn't support PKS.
> Because the L1's VM_ENTRY_LOAD_IA32_PKRS is off, we need to migrate L1's
> PKRS to L2.

I'm thinking of the case where vmx->nested.nested_run_pending is
false, and we are processing a KVM_SET_NESTED_STATE ioctl, yet
VM_ENTRY_LOAD_IA32_PKRS *is* set in the vmcs12.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace
  2020-08-07  8:48 ` [RFC 6/7] KVM: X86: Expose PKS to guest and userspace Chenyi Qiang
@ 2020-08-13 19:04   ` Jim Mattson
  2020-08-14  2:33     ` Chenyi Qiang
  2020-09-30  4:36     ` Sean Christopherson
  0 siblings, 2 replies; 38+ messages in thread
From: Jim Mattson @ 2020-08-13 19:04 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
> Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
> enabled by setting CR4.PKS when long mode is active. PKS is only
> implemented when EPT is enabled and requires the support of VM_{ENTRY,
> EXIT}_LOAD_IA32_PKRS currently.
>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

> @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  {
>         unsigned long old_cr4 = kvm_read_cr4(vcpu);
>         unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
> -                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
> +                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> +                                  X86_CR4_PKS;

This list already seems overly long, but I don't think CR4.PKS belongs
here. In volume 3 of the SDM, section 4.4.1, it says:

- If PAE paging would be in use following an execution of MOV to CR0
or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
then the PDPTEs are loaded from the address in CR3.

CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
Since it has no effect on PAE paging, I would be surprised if it did
result in a PDPTE load.

>         if (kvm_valid_cr4(vcpu, cr4))
>                 return 1;
> @@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
>         MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>         MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>         MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> -       MSR_IA32_UMWAIT_CONTROL,
> +       MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,

Should MSR_IA32_PKRS be added to the switch statement in
kvm_init_msr_list()? Something like...

case MSR_IA32_PKRS:
        if (!kvm_cpu_cap_has(X86_FEATURE_PKRS))
                continue;
        break;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace
  2020-08-13 19:04   ` Jim Mattson
@ 2020-08-14  2:33     ` Chenyi Qiang
  2020-09-30  4:36     ` Sean Christopherson
  1 sibling, 0 replies; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-14  2:33 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML



On 8/14/2020 3:04 AM, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>
>> Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
>> enabled by setting CR4.PKS when long mode is active. PKS is only
>> implemented when EPT is enabled and requires the support of VM_{ENTRY,
>> EXIT}_LOAD_IA32_PKRS currently.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> 
>> @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>   {
>>          unsigned long old_cr4 = kvm_read_cr4(vcpu);
>>          unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
>> -                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
>> +                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
>> +                                  X86_CR4_PKS;
> 
> This list already seems overly long, but I don't think CR4.PKS belongs
> here. In volume 3 of the SDM, section 4.4.1, it says:
> 
> - If PAE paging would be in use following an execution of MOV to CR0
> or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
> of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
> then the PDPTEs are loaded from the address in CR3.
> 
> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> Since it has no effect on PAE paging, I would be surprised if it did
> result in a PDPTE load.
> 

Oh, My mistake.

>>          if (kvm_valid_cr4(vcpu, cr4))
>>                  return 1;
>> @@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
>>          MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
>>          MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
>>          MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
>> -       MSR_IA32_UMWAIT_CONTROL,
>> +       MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
> 
> Should MSR_IA32_PKRS be added to the switch statement in
> kvm_init_msr_list()? Something like...
> 
> case MSR_IA32_PKRS:
>          if (!kvm_cpu_cap_has(X86_FEATURE_PKRS))
>                  continue;
>          break;
> 

Yes, this should be added.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-13 17:52       ` Jim Mattson
@ 2020-08-14 10:07         ` Chenyi Qiang
  2020-08-14 17:34           ` Jim Mattson
  0 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-14 10:07 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML



On 8/14/2020 1:52 AM, Jim Mattson wrote:
> On Wed, Aug 12, 2020 at 9:54 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>
>>
>>
>> On 8/11/2020 8:05 AM, Jim Mattson wrote:
>>> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>>>
>>>> PKS MSR passes through guest directly. Configure the MSR to match the
>>>> L0/L1 settings so that nested VM runs PKS properly.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
> 
>>>> +           (!vmx->nested.nested_run_pending ||
>>>> +            !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
>>>> +               vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
>>>
>>> This doesn't seem right to me. On the target of a live migration, with
>>> L2 active at the time the snapshot was taken (i.e.,
>>> vmx->nested.nested_run_pending=0), it looks like we're going to try to
>>> overwrite the current L2 PKRS value with L1's PKRS value (except that
>>> in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
>>> 0). Am I missing something?
>>>
>>
>> We overwrite the L2 PKRS with L1's value when L2 doesn't support PKS.
>> Because the L1's VM_ENTRY_LOAD_IA32_PKRS is off, we need to migrate L1's
>> PKRS to L2.
> 
> I'm thinking of the case where vmx->nested.nested_run_pending is
> false, and we are processing a KVM_SET_NESTED_STATE ioctl, yet
> VM_ENTRY_LOAD_IA32_PKRS *is* set in the vmcs12.
> 

Oh, I miss this case. What I'm still confused here is that the 
restoration for GUEST_IA32_DEBUGCTL and GUEST_BNDCFGS have the same 
issue, right? or I miss something.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM
  2020-08-14 10:07         ` Chenyi Qiang
@ 2020-08-14 17:34           ` Jim Mattson
  0 siblings, 0 replies; 38+ messages in thread
From: Jim Mattson @ 2020-08-14 17:34 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Fri, Aug 14, 2020 at 3:09 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
>
>
> On 8/14/2020 1:52 AM, Jim Mattson wrote:
> > On Wed, Aug 12, 2020 at 9:54 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >>
> >>
> >>
> >> On 8/11/2020 8:05 AM, Jim Mattson wrote:
> >>> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >>>>
> >>>> PKS MSR passes through guest directly. Configure the MSR to match the
> >>>> L0/L1 settings so that nested VM runs PKS properly.
> >>>>
> >>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> >>>> ---
> >
> >>>> +           (!vmx->nested.nested_run_pending ||
> >>>> +            !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> >>>> +               vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
> >>>
> >>> This doesn't seem right to me. On the target of a live migration, with
> >>> L2 active at the time the snapshot was taken (i.e.,
> >>> vmx->nested.nested_run_pending=0), it looks like we're going to try to
> >>> overwrite the current L2 PKRS value with L1's PKRS value (except that
> >>> in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
> >>> 0). Am I missing something?
> >>>
> >>
> >> We overwrite the L2 PKRS with L1's value when L2 doesn't support PKS.
> >> Because the L1's VM_ENTRY_LOAD_IA32_PKRS is off, we need to migrate L1's
> >> PKRS to L2.
> >
> > I'm thinking of the case where vmx->nested.nested_run_pending is
> > false, and we are processing a KVM_SET_NESTED_STATE ioctl, yet
> > VM_ENTRY_LOAD_IA32_PKRS *is* set in the vmcs12.
> >
>
> Oh, I miss this case. What I'm still confused here is that the
> restoration for GUEST_IA32_DEBUGCTL and GUEST_BNDCFGS have the same
> issue, right? or I miss something.

I take it back. This does work, assuming that userspace calls
KVM_SET_MSRS before calling KVM_SET_NESTED_STATE. Assuming L2 is
active when the checkpoint is taken, the MSR values saved will be the
L2 values. When restoring the MSRs with KVM_SET_MSRS, the L2 MSR
values will be written into vmcs01. They don't belong there, but we're
never going to launch vmcs01 with those MSR values. Instead, when
userspace calls KVM_SET_NESTED_STATE, those values will be transferred
first to the vmcs01_<msr> fields of the vmx->nested struct, and then
to vmcs02.

This is subtle, and I don't think it's documented anywhere that
KVM_SET_NESTED_STATE must be called after KVM_SET_MSRS. In fact, there
are probably a number of dependencies among the various KVM_SET_*
functions that aren't documented anywhere.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-13 17:31       ` Jim Mattson
@ 2020-08-18  7:27         ` Chenyi Qiang
  2020-08-18 18:23           ` Jim Mattson
  0 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2020-08-18  7:27 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML



On 8/14/2020 1:31 AM, Jim Mattson wrote:
> On Wed, Aug 12, 2020 at 10:42 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>
>>
>>
>> On 8/13/2020 5:21 AM, Jim Mattson wrote:
>>> On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>>>
>>>> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
>>>> index 0x6E1 to allow software to manage supervisor protection key
>>>> rights. For performance consideration, PKRS intercept will be disabled
>>>> so that the guest can access the PKRS without VM exits.
>>>> PKS introduces dedicated control fields in VMCS to switch PKRS, which
>>>> only does the retore part. In addition, every VM exit saves PKRS into
>>>> the guest-state area in VMCS, while VM enter won't save the host value
>>>> due to the expectation that the host won't change the MSR often. Update
>>>> the host's value in VMCS manually if the MSR has been changed by the
>>>> kernel since the last time the VMCS was run.
>>>> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
>>>> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>
>>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>>> index 11e4df560018..df2c2e733549 100644
>>>> --- a/arch/x86/kvm/vmx/nested.c
>>>> +++ b/arch/x86/kvm/vmx/nested.c
>>>> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
>>>>           dest->ds_sel = src->ds_sel;
>>>>           dest->es_sel = src->es_sel;
>>>>    #endif
>>>> +       dest->pkrs = src->pkrs;
>>>
>>> Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
>>> PKRS isn't usable outside of long mode, is it?
>>>
>>
>> Yes, I'm also thinking about whether to put all pks code into
>> CONFIG_X86_64. The kernel implementation also wrap its pks code inside
>> CONFIG_ARCH_HAS_SUPERVISOR_PKEYS which has dependency with CONFIG_X86_64.
>> However, maybe this can help when host kernel disable PKS but the guest
>> enable it. What do you think about this?
> 
> I see no problem in exposing PKRS to the guest even if the host
> doesn't have CONFIG_ARCH_HAS_SUPERVISOR_PKEYS.
> 

Yes, but I would prefer to keep it outside CONFIG_X86_64. PKS code has 
several code blocks and putting them under x86_64 may end up being a 
mess. In addition, PKU KVM related code isn't under CONFIG_X86_64 as 
well. So, is it really necessary to put inside?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-18  7:27         ` Chenyi Qiang
@ 2020-08-18 18:23           ` Jim Mattson
  2020-08-22  3:28             ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Jim Mattson @ 2020-08-18 18:23 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Tue, Aug 18, 2020 at 12:28 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
>
>
> On 8/14/2020 1:31 AM, Jim Mattson wrote:
> > On Wed, Aug 12, 2020 at 10:42 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >>
> >>
> >>
> >> On 8/13/2020 5:21 AM, Jim Mattson wrote:
> >>> On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >>>>
> >>>> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
> >>>> index 0x6E1 to allow software to manage supervisor protection key
> >>>> rights. For performance consideration, PKRS intercept will be disabled
> >>>> so that the guest can access the PKRS without VM exits.
> >>>> PKS introduces dedicated control fields in VMCS to switch PKRS, which
> >>>> only does the retore part. In addition, every VM exit saves PKRS into
> >>>> the guest-state area in VMCS, while VM enter won't save the host value
> >>>> due to the expectation that the host won't change the MSR often. Update
> >>>> the host's value in VMCS manually if the MSR has been changed by the
> >>>> kernel since the last time the VMCS was run.
> >>>> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
> >>>> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
> >>>>
> >>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> >>>> ---
> >>>
> >>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> >>>> index 11e4df560018..df2c2e733549 100644
> >>>> --- a/arch/x86/kvm/vmx/nested.c
> >>>> +++ b/arch/x86/kvm/vmx/nested.c
> >>>> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
> >>>>           dest->ds_sel = src->ds_sel;
> >>>>           dest->es_sel = src->es_sel;
> >>>>    #endif
> >>>> +       dest->pkrs = src->pkrs;
> >>>
> >>> Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
> >>> PKRS isn't usable outside of long mode, is it?
> >>>
> >>
> >> Yes, I'm also thinking about whether to put all pks code into
> >> CONFIG_X86_64. The kernel implementation also wrap its pks code inside
> >> CONFIG_ARCH_HAS_SUPERVISOR_PKEYS which has dependency with CONFIG_X86_64.
> >> However, maybe this can help when host kernel disable PKS but the guest
> >> enable it. What do you think about this?
> >
> > I see no problem in exposing PKRS to the guest even if the host
> > doesn't have CONFIG_ARCH_HAS_SUPERVISOR_PKEYS.
> >
>
> Yes, but I would prefer to keep it outside CONFIG_X86_64. PKS code has
> several code blocks and putting them under x86_64 may end up being a
> mess. In addition, PKU KVM related code isn't under CONFIG_X86_64 as
> well. So, is it really necessary to put inside?

I'll let someone who actually cares about the i386 build answer that question.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-18 18:23           ` Jim Mattson
@ 2020-08-22  3:28             ` Sean Christopherson
  0 siblings, 0 replies; 38+ messages in thread
From: Sean Christopherson @ 2020-08-22  3:28 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Chenyi Qiang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Tue, Aug 18, 2020 at 11:23:47AM -0700, Jim Mattson wrote:
> On Tue, Aug 18, 2020 at 12:28 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >
> >
> >
> > On 8/14/2020 1:31 AM, Jim Mattson wrote:
> > > On Wed, Aug 12, 2020 at 10:42 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> > >>
> > >>
> > >>
> > >> On 8/13/2020 5:21 AM, Jim Mattson wrote:
> > >>> On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> > >>>>
> > >>>> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
> > >>>> index 0x6E1 to allow software to manage supervisor protection key
> > >>>> rights. For performance consideration, PKRS intercept will be disabled
> > >>>> so that the guest can access the PKRS without VM exits.
> > >>>> PKS introduces dedicated control fields in VMCS to switch PKRS, which
> > >>>> only does the retore part. In addition, every VM exit saves PKRS into
> > >>>> the guest-state area in VMCS, while VM enter won't save the host value
> > >>>> due to the expectation that the host won't change the MSR often. Update
> > >>>> the host's value in VMCS manually if the MSR has been changed by the
> > >>>> kernel since the last time the VMCS was run.
> > >>>> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
> > >>>> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
> > >>>>
> > >>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> > >>>> ---
> > >>>
> > >>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > >>>> index 11e4df560018..df2c2e733549 100644
> > >>>> --- a/arch/x86/kvm/vmx/nested.c
> > >>>> +++ b/arch/x86/kvm/vmx/nested.c
> > >>>> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
> > >>>>           dest->ds_sel = src->ds_sel;
> > >>>>           dest->es_sel = src->es_sel;
> > >>>>    #endif
> > >>>> +       dest->pkrs = src->pkrs;
> > >>>
> > >>> Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
> > >>> PKRS isn't usable outside of long mode, is it?
> > >>>
> > >>
> > >> Yes, I'm also thinking about whether to put all pks code into
> > >> CONFIG_X86_64. The kernel implementation also wrap its pks code inside
> > >> CONFIG_ARCH_HAS_SUPERVISOR_PKEYS which has dependency with CONFIG_X86_64.
> > >> However, maybe this can help when host kernel disable PKS but the guest
> > >> enable it. What do you think about this?
> > >
> > > I see no problem in exposing PKRS to the guest even if the host
> > > doesn't have CONFIG_ARCH_HAS_SUPERVISOR_PKEYS.
> > >
> >
> > Yes, but I would prefer to keep it outside CONFIG_X86_64. PKS code has
> > several code blocks and putting them under x86_64 may end up being a
> > mess. In addition, PKU KVM related code isn't under CONFIG_X86_64 as
> > well. So, is it really necessary to put inside?
> 
> I'll let someone who actually cares about the i386 build answer that question.

Ha, I care about the i386 build to the extent that it doesn't break.  I
don't care at all shaving cycles/memory for things like this.  Given how
long some KVM i386 bugs have gone unnoticed I'm not sure there's anyone that
cares about KVM i386 these days :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace
  2020-08-13 19:04   ` Jim Mattson
  2020-08-14  2:33     ` Chenyi Qiang
@ 2020-09-30  4:36     ` Sean Christopherson
  2021-01-26 18:24       ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2020-09-30  4:36 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Chenyi Qiang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Thu, Aug 13, 2020 at 12:04:54PM -0700, Jim Mattson wrote:
> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
> >
> > Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
> > enabled by setting CR4.PKS when long mode is active. PKS is only
> > implemented when EPT is enabled and requires the support of VM_{ENTRY,
> > EXIT}_LOAD_IA32_PKRS currently.
> >
> > Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> 
> > @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> >  {
> >         unsigned long old_cr4 = kvm_read_cr4(vcpu);
> >         unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
> > -                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
> > +                                  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> > +                                  X86_CR4_PKS;
> 
> This list already seems overly long, but I don't think CR4.PKS belongs
> here. In volume 3 of the SDM, section 4.4.1, it says:
> 
> - If PAE paging would be in use following an execution of MOV to CR0
> or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
> of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
> then the PDPTEs are loaded from the address in CR3.
> 
> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> Since it has no effect on PAE paging, I would be surprised if it did
> result in a PDPTE load.

It does belong in the mmu_role_bits though ;-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2020-08-07  8:48 ` [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR Chenyi Qiang
  2020-08-12 21:21   ` Jim Mattson
@ 2021-01-26 18:01   ` Paolo Bonzini
  2021-01-27  7:55     ` Chenyi Qiang
  2021-02-01  9:53     ` Chenyi Qiang
  1 sibling, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2021-01-26 18:01 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 07/08/20 10:48, Chenyi Qiang wrote:
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> +	bool pks_supported = guest_cpuid_has(vcpu, X86_FEATURE_PKS);
> +
> +	/*
> +	 * set intercept for PKRS when the guest doesn't support pks
> +	 */
> +	vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PKRS, MSR_TYPE_RW, !pks_supported);
> +
> +	if (pks_supported) {
> +		vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +		vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +	} else {
> +		vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +		vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +	}

Is the guest expected to do a lot of reads/writes to the MSR (e.g. at 
every context switch)?

Even if this is the case, the MSR intercepts and the entry/exit controls 
should only be done if CR4.PKS=1.  If the guest does not use PKS, KVM 
should behave as if these patches did not exist.

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 4/7] KVM: MMU: Refactor pkr_mask to cache condition
  2020-08-07  8:48 ` [RFC 4/7] KVM: MMU: Refactor pkr_mask to cache condition Chenyi Qiang
@ 2021-01-26 18:16   ` Paolo Bonzini
  2021-01-27  3:14     ` Chenyi Qiang
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2021-01-26 18:16 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 07/08/20 10:48, Chenyi Qiang wrote:
> 
>  		* index of the protection domain, so pte_pkey * 2 is
>  		* is the index of the first bit for the domain.
>  		*/
> -		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> +		if (pte_access & PT_USER_MASK)
> +			pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> +		else
> +			pkr_bits = 0;
>  
> -		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
> -		offset = (pfec & ~1) +
> -			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
> +		/* clear present bit */
> +		offset = (pfec & ~1);
>  
>  		pkr_bits &= mmu->pkr_mask >> offset;
>  		errcode |= -pkr_bits & PFERR_PK_MASK;

I think this is incorrect.  mmu->pkr_mask must cover both clear and set 
ACC_USER_MASK, in to cover all combinations of CR4.PKE and CR4.PKS. 
Right now, check_pkey is !ff && pte_user, but you need to make it 
something like

	check_pkey = !ff && (pte_user ? cr4_pke : cr4_pks);

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 3/7] KVM: MMU: Rename the pkru to pkr
  2020-08-07  8:48 ` [RFC 3/7] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
@ 2021-01-26 18:16   ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2021-01-26 18:16 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 07/08/20 10:48, Chenyi Qiang wrote:
> PKRU represents the PKU register utilized in the protection key rights
> check for user pages. Protection Keys for Superviosr Pages (PKS) extends
> the protection key architecture to cover supervisor pages.
> 
> Rename the *pkru* related variables and functions to *pkr* which stands
> for both of the PKRU and PKRS. It makes sense because both registers
> have the same format. PKS and PKU can also share the same bitmap to
> cache the conditions where protection key checks are needed.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  2 +-
>   arch/x86/kvm/mmu.h              | 12 ++++++------
>   arch/x86/kvm/mmu/mmu.c          | 18 +++++++++---------
>   3 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..6b739d0d1c97 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -427,7 +427,7 @@ struct kvm_mmu {
>   	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
>   	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
>   	*/
> -	u32 pkru_mask;
> +	u32 pkr_mask;
>   
>   	u64 *pae_root;
>   	u64 *lm_root;
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..0c2fdf0abf22 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -193,8 +193,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   	u32 errcode = PFERR_PRESENT_MASK;
>   
>   	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
> -	if (unlikely(mmu->pkru_mask)) {
> -		u32 pkru_bits, offset;
> +	if (unlikely(mmu->pkr_mask)) {
> +		u32 pkr_bits, offset;
>   
>   		/*
>   		* PKRU defines 32 bits, there are 16 domains and 2
> @@ -202,15 +202,15 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   		* index of the protection domain, so pte_pkey * 2 is
>   		* is the index of the first bit for the domain.
>   		*/
> -		pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> +		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
>   
>   		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
>   		offset = (pfec & ~1) +
>   			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
>   
> -		pkru_bits &= mmu->pkru_mask >> offset;
> -		errcode |= -pkru_bits & PFERR_PK_MASK;
> -		fault |= (pkru_bits != 0);
> +		pkr_bits &= mmu->pkr_mask >> offset;
> +		errcode |= -pkr_bits & PFERR_PK_MASK;
> +		fault |= (pkr_bits != 0);
>   	}
>   
>   	return -(u32)fault & errcode;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d6a0ae7800c..481442f5e27a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4716,20 +4716,20 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
>   * away both AD and WD.  For all reads or if the last condition holds, WD
>   * only will be masked away.
>   */
> -static void update_pkru_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> +static void update_pkr_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   				bool ept)
>   {
>   	unsigned bit;
>   	bool wp;
>   
>   	if (ept) {
> -		mmu->pkru_mask = 0;
> +		mmu->pkr_mask = 0;
>   		return;
>   	}
>   
>   	/* PKEY is enabled only if CR4.PKE and EFER.LMA are both set. */
>   	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) || !is_long_mode(vcpu)) {
> -		mmu->pkru_mask = 0;
> +		mmu->pkr_mask = 0;
>   		return;
>   	}
>   
> @@ -4763,7 +4763,7 @@ static void update_pkru_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   		/* PKRU.WD stops write access. */
>   		pkey_bits |= (!!check_write) << 1;
>   
> -		mmu->pkru_mask |= (pkey_bits & 3) << pfec;
> +		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
>   	}
>   }
>   
> @@ -4785,7 +4785,7 @@ static void paging64_init_context_common(struct kvm_vcpu *vcpu,
>   
>   	reset_rsvds_bits_mask(vcpu, context);
>   	update_permission_bitmask(vcpu, context, false);
> -	update_pkru_bitmask(vcpu, context, false);
> +	update_pkr_bitmask(vcpu, context, false);
>   	update_last_nonleaf_level(vcpu, context);
>   
>   	MMU_WARN_ON(!is_pae(vcpu));
> @@ -4815,7 +4815,7 @@ static void paging32_init_context(struct kvm_vcpu *vcpu,
>   
>   	reset_rsvds_bits_mask(vcpu, context);
>   	update_permission_bitmask(vcpu, context, false);
> -	update_pkru_bitmask(vcpu, context, false);
> +	update_pkr_bitmask(vcpu, context, false);
>   	update_last_nonleaf_level(vcpu, context);
>   
>   	context->page_fault = paging32_page_fault;
> @@ -4925,7 +4925,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
>   	}
>   
>   	update_permission_bitmask(vcpu, context, false);
> -	update_pkru_bitmask(vcpu, context, false);
> +	update_pkr_bitmask(vcpu, context, false);
>   	update_last_nonleaf_level(vcpu, context);
>   	reset_tdp_shadow_zero_bits_mask(vcpu, context);
>   }
> @@ -5032,7 +5032,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>   	context->mmu_role.as_u64 = new_role.as_u64;
>   
>   	update_permission_bitmask(vcpu, context, true);
> -	update_pkru_bitmask(vcpu, context, true);
> +	update_pkr_bitmask(vcpu, context, true);
>   	update_last_nonleaf_level(vcpu, context);
>   	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
>   	reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
> @@ -5103,7 +5103,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
>   	}
>   
>   	update_permission_bitmask(vcpu, g_context, false);
> -	update_pkru_bitmask(vcpu, g_context, false);
> +	update_pkr_bitmask(vcpu, g_context, false);
>   	update_last_nonleaf_level(vcpu, g_context);
>   }
>   
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 5/7] KVM: MMU: Add support for PKS emulation
  2020-08-07  8:48 ` [RFC 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
@ 2021-01-26 18:23   ` Paolo Bonzini
  2021-01-27  3:00     ` Chenyi Qiang
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2021-01-26 18:23 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 07/08/20 10:48, Chenyi Qiang wrote:
> 
>  		if (pte_access & PT_USER_MASK)
>  			pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
> +		else if (!kvm_get_msr(vcpu, MSR_IA32_PKRS, &pkrs))
> +			pkr_bits = (pkrs >> (pte_pkey * 2)) & 3;

You should be able to always use vcpu->arch.pkrs here.  So

pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : vcpu->arch.pkrs;
pkr_bits = (pkr >> pte_pkey * 2) & 3;

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace
  2020-09-30  4:36     ` Sean Christopherson
@ 2021-01-26 18:24       ` Paolo Bonzini
  2021-01-26 19:56         ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2021-01-26 18:24 UTC (permalink / raw)
  To: Jim Mattson, Sean Christopherson
  Cc: Chenyi Qiang, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	Xiaoyao Li, kvm list, LKML

On 30/09/20 06:36, Sean Christopherson wrote:
>> CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
>> Since it has no effect on PAE paging, I would be surprised if it did
>> result in a PDPTE load.
> It does belong in the mmu_role_bits though;-)
> 

Does it?  We don't support PKU/PKS for shadow paging, and it's always 
zero for EPT.  We only support enough PKU/PKS for emulation.

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace
  2021-01-26 18:24       ` Paolo Bonzini
@ 2021-01-26 19:56         ` Sean Christopherson
  2021-01-26 20:05           ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2021-01-26 19:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jim Mattson, Chenyi Qiang, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On Tue, Jan 26, 2021, Paolo Bonzini wrote:
> On 30/09/20 06:36, Sean Christopherson wrote:
> > > CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
> > > Since it has no effect on PAE paging, I would be surprised if it did
> > > result in a PDPTE load.
> > It does belong in the mmu_role_bits though;-)
> > 
> 
> Does it?  We don't support PKU/PKS for shadow paging, and it's always zero
> for EPT.  We only support enough PKU/PKS for emulation.

As proposed, yes.  The PKU/PKS mask is tracked on a per-mmu basis, e.g. computed
in update_pkr_bitmask() and consumed in permission_fault() during emulation.
Omitting CR4.PKS from the extended role could let KVM reuse an MMU with the
wrong pkr_mask.

That being said, I don't think it needs a dedicated bit.  IIUC, the logic is
PKU|PKS, with pkr_mask generation being PKU vs. PKS agnostic.  The role could do
the same and smush the bits together, e.g.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d6616f6f6ef..3bfca34f6ea2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -293,7 +293,7 @@ union kvm_mmu_extended_role {
                unsigned int cr0_pg:1;
                unsigned int cr4_pae:1;
                unsigned int cr4_pse:1;
-               unsigned int cr4_pke:1;
+               unsigned int cr4_pkr:1;
                unsigned int cr4_smap:1;
                unsigned int cr4_smep:1;
                unsigned int maxphyaddr:6;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 79166288ed03..2774b85a36d5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4448,7 +4448,8 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu)
        ext.cr4_smep = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
        ext.cr4_smap = !!kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
        ext.cr4_pse = !!is_pse(vcpu);
-       ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
+       ext.cr4_pkr = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
+                     !!kvm_read_cr4_bits(vcpu, X86_CR4_PKS);
        ext.maxphyaddr = cpuid_maxphyaddr(vcpu);

        ext.valid = 1;


Another option would be to move the tracking out of the MMU, e.g. make pkr_mask
per-vCPU and recalculate when CR4 changes.  I think that would "just work", even
when nested VMs are in play?

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace
  2021-01-26 19:56         ` Sean Christopherson
@ 2021-01-26 20:05           ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2021-01-26 20:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Chenyi Qiang, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Xiaoyao Li, kvm list, LKML

On 26/01/21 20:56, Sean Christopherson wrote:
>>> It does belong in the mmu_role_bits though;-)
>>
>> Does it?  We don't support PKU/PKS for shadow paging, and it's always zero
>> for EPT.  We only support enough PKU/PKS for emulation.
>
> As proposed, yes. The PKU/PKS mask is tracked on a per-mmu basis, e.g. 
> computed in update_pkr_bitmask() and consumed in permission_fault() 
> during emulation. Omitting CR4.PKS from the extended role could let KVM 
> reuse an MMU with the wrong pkr_mask.

Right, not for the hash table key but for reuse.

> IIUC, the logic is PKU|PKS, with pkr_mask generation being PKU vs. PKS agnostic.

Not in the patches as submitted, but it's what I suggested indeed (using 
one bit of the PFEC to pick one of CR4.PKE and CR4.PKS).

> Another option would be to move the tracking out of the MMU, e.g. make pkr_mask
> per-vCPU and recalculate when CR4 changes.  I think that would "just work", even
> when nested VMs are in play?

Yeah, pkr_mask is basically one of four constants (depending on CR4.PKE 
and CR4.PKS) so recalculating when CR4 changes would work too.  But I'm 
okay with doing that later, too.

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 5/7] KVM: MMU: Add support for PKS emulation
  2021-01-26 18:23   ` Paolo Bonzini
@ 2021-01-27  3:00     ` Chenyi Qiang
  2021-01-27  8:37       ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2021-01-27  3:00 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel



On 1/27/2021 2:23 AM, Paolo Bonzini wrote:
> On 07/08/20 10:48, Chenyi Qiang wrote:
>>
>>          if (pte_access & PT_USER_MASK)
>>              pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
>> +        else if (!kvm_get_msr(vcpu, MSR_IA32_PKRS, &pkrs))
>> +            pkr_bits = (pkrs >> (pte_pkey * 2)) & 3;
> 
> You should be able to always use vcpu->arch.pkrs here.  So
> 
> pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : vcpu->arch.pkrs;
> pkr_bits = (pkr >> pte_pkey * 2) & 3;
> 
> Paolo

Concerning vcpu->arch.pkrs would be the only use case in current 
submitted patches, is it still necessary to shadow it?

> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 4/7] KVM: MMU: Refactor pkr_mask to cache condition
  2021-01-26 18:16   ` Paolo Bonzini
@ 2021-01-27  3:14     ` Chenyi Qiang
  0 siblings, 0 replies; 38+ messages in thread
From: Chenyi Qiang @ 2021-01-27  3:14 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel



On 1/27/2021 2:16 AM, Paolo Bonzini wrote:
> On 07/08/20 10:48, Chenyi Qiang wrote:
>>
>>          * index of the protection domain, so pte_pkey * 2 is
>>          * is the index of the first bit for the domain.
>>          */
>> -        pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
>> +        if (pte_access & PT_USER_MASK)
>> +            pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
>> +        else
>> +            pkr_bits = 0;
>>
>> -        /* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
>> -        offset = (pfec & ~1) +
>> -            ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - 
>> PT_USER_SHIFT));
>> +        /* clear present bit */
>> +        offset = (pfec & ~1);
>>
>>          pkr_bits &= mmu->pkr_mask >> offset;
>>          errcode |= -pkr_bits & PFERR_PK_MASK;
> 
> I think this is incorrect.  mmu->pkr_mask must cover both clear and set 
> ACC_USER_MASK, in to cover all combinations of CR4.PKE and CR4.PKS. 
> Right now, check_pkey is !ff && pte_user, but you need to make it 
> something like
> 
>      check_pkey = !ff && (pte_user ? cr4_pke : cr4_pks);
> 
> Paolo

Oh, I didn't distinguish the cr4_pke/cr4_pks check. Will fix this issue.

> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2021-01-26 18:01   ` Paolo Bonzini
@ 2021-01-27  7:55     ` Chenyi Qiang
  2021-02-01  9:53     ` Chenyi Qiang
  1 sibling, 0 replies; 38+ messages in thread
From: Chenyi Qiang @ 2021-01-27  7:55 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel



On 1/27/2021 2:01 AM, Paolo Bonzini wrote:
> On 07/08/20 10:48, Chenyi Qiang wrote:
>> +{
>> +    struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +    unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
>> +    bool pks_supported = guest_cpuid_has(vcpu, X86_FEATURE_PKS);
>> +
>> +    /*
>> +     * set intercept for PKRS when the guest doesn't support pks
>> +     */
>> +    vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PKRS, MSR_TYPE_RW, 
>> !pks_supported);
>> +
>> +    if (pks_supported) {
>> +        vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +        vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +    } else {
>> +        vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +        vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +    }
> 
> Is the guest expected to do a lot of reads/writes to the MSR (e.g. at 
> every context switch)?
> 

In current design for PKS, the PMEM stray write protection is the only 
implemented use case, and PKRS is only temporarily changed during 
specific code paths. Thus reads/writes to MSR is not so frequent, I think.

> Even if this is the case, the MSR intercepts and the entry/exit controls 
> should only be done if CR4.PKS=1.  If the guest does not use PKS, KVM 
> should behave as if these patches did not exist.
> 


I pass through the PKRS and enable the entry/exit controls when PKS is 
supported, and just want to narrow down the window of MSR switch during 
the VMX transition. But yeah, I should also consider the enabling status 
of guest PKS according to CR4.PKS, will fix it in next version.

> Paolo
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 5/7] KVM: MMU: Add support for PKS emulation
  2021-01-27  3:00     ` Chenyi Qiang
@ 2021-01-27  8:37       ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2021-01-27  8:37 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 27/01/21 04:00, Chenyi Qiang wrote:
>>
>>>
>>>          if (pte_access & PT_USER_MASK)
>>>              pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
>>> +        else if (!kvm_get_msr(vcpu, MSR_IA32_PKRS, &pkrs))
>>> +            pkr_bits = (pkrs >> (pte_pkey * 2)) & 3;
>>
>> You should be able to always use vcpu->arch.pkrs here.  So
>>
>> pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : vcpu->arch.pkrs;
>> pkr_bits = (pkr >> pte_pkey * 2) & 3;
>>
> Concerning vcpu->arch.pkrs would be the only use case in current 
> submitted patches, is it still necessary to shadow it?

Yes, please do.  kvm_get_msr is pretty slow, and the code becomes 
simpler too.

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2021-01-26 18:01   ` Paolo Bonzini
  2021-01-27  7:55     ` Chenyi Qiang
@ 2021-02-01  9:53     ` Chenyi Qiang
  2021-02-01 10:05       ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Chenyi Qiang @ 2021-02-01  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel



On 1/27/2021 2:01 AM, Paolo Bonzini wrote:
> On 07/08/20 10:48, Chenyi Qiang wrote:
>> +{
>> +    struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +    unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
>> +    bool pks_supported = guest_cpuid_has(vcpu, X86_FEATURE_PKS);
>> +
>> +    /*
>> +     * set intercept for PKRS when the guest doesn't support pks
>> +     */
>> +    vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PKRS, MSR_TYPE_RW, 
>> !pks_supported);
>> +
>> +    if (pks_supported) {
>> +        vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +        vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +    } else {
>> +        vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
>> +        vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
>> +    }
> 
> Is the guest expected to do a lot of reads/writes to the MSR (e.g. at 
> every context switch)?
> 
> Even if this is the case, the MSR intercepts and the entry/exit controls 
> should only be done if CR4.PKS=1.  If the guest does not use PKS, KVM 
> should behave as if these patches did not exist.
> 

Hi Paolo,

Per the MSR intercepts and entry/exit controls, IA32_PKRS access is 
independent of the CR4.PKS bit, it just depends on CPUID enumeration. If 
the guest doesn't set CR4.PKS but still has the CPUID capability, 
modifying on PKRS should be supported but has no effect. IIUC, we can 
not ignore these controls if CR4.PKS=0.

Thanks
Chenyi

> Paolo
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR
  2021-02-01  9:53     ` Chenyi Qiang
@ 2021-02-01 10:05       ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2021-02-01 10:05 UTC (permalink / raw)
  To: Chenyi Qiang, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 01/02/21 10:53, Chenyi Qiang wrote:
>>>
>>
>> Is the guest expected to do a lot of reads/writes to the MSR (e.g. at 
>> every context switch)?
>>
>> Even if this is the case, the MSR intercepts and the entry/exit 
>> controls should only be done if CR4.PKS=1.  If the guest does not use 
>> PKS, KVM should behave as if these patches did not exist.
>>
> 
> Hi Paolo,
> 
> Per the MSR intercepts and entry/exit controls, IA32_PKRS access is 
> independent of the CR4.PKS bit, it just depends on CPUID enumeration. If 
> the guest doesn't set CR4.PKS but still has the CPUID capability, 
> modifying on PKRS should be supported but has no effect. IIUC, we can 
> not ignore these controls if CR4.PKS=0.

Understood, I wanted to avoid paying the price (if any) of loading PKRS 
on vmentry and vmexit not just if CPUID.PKS=0, but also if CR4.PKS=0. 
If CR4.PKS=0 it would be nicer to enable the MSR intercept and disable 
the vmentry/vmexit controls; just run the guest with the host value of 
IA32_PKRS.

Paolo


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-02-01 10:08 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07  8:48 [RFC 0/7] KVM: PKS Virtualization support Chenyi Qiang
2020-08-07  8:48 ` [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
2020-08-10 23:17   ` Jim Mattson
2020-08-07  8:48 ` [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR Chenyi Qiang
2020-08-12 21:21   ` Jim Mattson
2020-08-13  5:42     ` Chenyi Qiang
2020-08-13 17:31       ` Jim Mattson
2020-08-18  7:27         ` Chenyi Qiang
2020-08-18 18:23           ` Jim Mattson
2020-08-22  3:28             ` Sean Christopherson
2021-01-26 18:01   ` Paolo Bonzini
2021-01-27  7:55     ` Chenyi Qiang
2021-02-01  9:53     ` Chenyi Qiang
2021-02-01 10:05       ` Paolo Bonzini
2020-08-07  8:48 ` [RFC 3/7] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
2021-01-26 18:16   ` Paolo Bonzini
2020-08-07  8:48 ` [RFC 4/7] KVM: MMU: Refactor pkr_mask to cache condition Chenyi Qiang
2021-01-26 18:16   ` Paolo Bonzini
2021-01-27  3:14     ` Chenyi Qiang
2020-08-07  8:48 ` [RFC 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
2021-01-26 18:23   ` Paolo Bonzini
2021-01-27  3:00     ` Chenyi Qiang
2021-01-27  8:37       ` Paolo Bonzini
2020-08-07  8:48 ` [RFC 6/7] KVM: X86: Expose PKS to guest and userspace Chenyi Qiang
2020-08-13 19:04   ` Jim Mattson
2020-08-14  2:33     ` Chenyi Qiang
2020-09-30  4:36     ` Sean Christopherson
2021-01-26 18:24       ` Paolo Bonzini
2021-01-26 19:56         ` Sean Christopherson
2021-01-26 20:05           ` Paolo Bonzini
2020-08-07  8:48 ` [RFC 7/7] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
2020-08-11  0:05   ` Jim Mattson
2020-08-12 15:00     ` Sean Christopherson
2020-08-12 18:32       ` Jim Mattson
2020-08-13  4:52     ` Chenyi Qiang
2020-08-13 17:52       ` Jim Mattson
2020-08-14 10:07         ` Chenyi Qiang
2020-08-14 17:34           ` Jim Mattson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.