All of lore.kernel.org
 help / color / mirror / Atom feed
* [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support
@ 2015-12-07  9:16 Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 1/9] x86/hvm: pkeys, add the flag to enable Memory Protection Keys Huaitong Han
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

Changes in v3:
*Get CPUID:ospke depend on guest cpuid instead of host hardware capable, 
and Move cpuid patch to the last of patches.
*Move opt_pku to cpu/common.c.
*Use MASK_EXTR for get_pte_pkeys.
*Add quoting for pkru macro, and use static inline pkru_read functions.
*Rebase get_xsave_addr for updated codes, and add uncompressed format support
for xsave state.
*Use fpu_xsave instead of vcpu_save_fpu, and adjust the code style for
leaf_pte_pkeys_check.
*Add parentheses for PFEC_prot_key of gva2gfn funcitons.

Changes in v2:
*Rebase all patches in staging branch
*Disable X86_CR4_PKE on hypervisor, and delete pkru_read/write functions, and
use xsave state read to get pkru value.
*Delete the patch that adds pkeys support for do_page_fault.
*Add pkeys support for gva2gfn so that setting _PAGE_PK_BIT in the return
value can get propagated to the guest correctly.

The protection-key feature provides an additional mechanism by which IA-32e
paging controls access to usermode addresses.

Hardware support for protection keys for user pages is enumerated with CPUID
feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE
with the setting of CR4.PKE(bit 22).

When CR4.PKE = 1, every linear address is associated with the 4-bit protection
key located in bits 62:59 of the paging-structure entry that mapped the page
containing the linear address. The PKRU register determines, for each
protection key, whether user-mode addresses with that protection key may be
read or written.

The PKRU register (protection key rights for user pages) is a 32-bit register
with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the
access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable
bit for protection key i (WDi).

Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and
write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus
be read and written by instructions in the XSAVE feature set.

PFEC.PK (bit 5) is defined as protection key violations.

The specification of Protection Keys can be found at SDM (4.6.2, volume 3)
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.

Huaitong Han (9):
  x86/hvm: pkeys, add the flag to enable Memory Protection Keys
  x86/hvm: pkeys, add pkeys support when setting CR4
  x86/hvm: pkeys, disable pkeys for guests in non-paging mode
  x86/hvm: pkeys, add functions to get pkeys value from PTE
  x86/hvm: pkeys, add functions to support PKRU access
  x86/hvm: pkeys, add xstate support for pkeys
  x86/hvm: pkeys, add pkeys support for guest_walk_tables
  x86/hvm: pkeys, add pkeys support for gva2gfn funcitons
  x86/hvm: pkeys, add pkeys support for cpuid handling

 docs/misc/xen-command-line.markdown | 21 +++++++++++
 tools/libxc/xc_cpufeature.h         |  2 +
 tools/libxc/xc_cpuid_x86.c          |  6 ++-
 xen/arch/x86/cpu/common.c           | 10 ++++-
 xen/arch/x86/hvm/hvm.c              | 34 +++++++++++++----
 xen/arch/x86/hvm/vmx/vmx.c          | 11 +++---
 xen/arch/x86/i387.c                 |  2 +-
 xen/arch/x86/mm/guest_walk.c        | 73 +++++++++++++++++++++++++++++++++++++
 xen/arch/x86/xstate.c               |  7 +++-
 xen/include/asm-x86/cpufeature.h    |  6 ++-
 xen/include/asm-x86/guest_pt.h      |  7 ++++
 xen/include/asm-x86/hvm/hvm.h       |  2 +
 xen/include/asm-x86/i387.h          |  1 +
 xen/include/asm-x86/page.h          |  5 +++
 xen/include/asm-x86/processor.h     | 20 ++++++++++
 xen/include/asm-x86/x86_64/page.h   | 12 ++++++
 xen/include/asm-x86/xstate.h        |  4 +-
 17 files changed, 203 insertions(+), 20 deletions(-)

-- 
2.4.3


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [V3 PATCH 1/9] x86/hvm: pkeys, add the flag to enable Memory Protection Keys
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-10 15:37   ` George Dunlap
  2015-12-07  9:16 ` [V3 PATCH 2/9] x86/hvm: pkeys, add pkeys support when setting CR4 Huaitong Han
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch adds the flag to enable Memory Protection Keys.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 docs/misc/xen-command-line.markdown | 21 +++++++++++++++++++++
 xen/arch/x86/cpu/common.c           | 10 +++++++++-
 xen/include/asm-x86/cpufeature.h    |  6 +++++-
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index c103894..ef5ef6c 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1177,6 +1177,27 @@ This option can be specified more than once (up to 8 times at present).
 ### ple\_window
 > `= <integer>`
 
+### pku
+> `= <boolean>`
+
+> Default: `true`
+
+Flag to enable Memory Protection Keys.
+
+The protection-key feature provides an additional mechanism by which IA-32e
+paging controls access to usermode addresses.
+
+When CR4.PKE = 1, every linear address is associated with the 4-bit protection
+key located in bits 62:59 of the paging-structure entry that mapped the page
+containing the linear address. The PKRU register determines, for each
+protection key, whether user-mode addresses with that protection key may be
+read or written.
+
+The PKRU register (protection key rights for user pages) is a 32-bit register
+with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the
+access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable
+bit for protection key i (WDi).
+
 ### psr (Intel)
 > `= List of ( cmt:<boolean> | rmid_max:<integer> | cat:<boolean> | cos_max:<integer> | cdp:<boolean> )`
 
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 310ec85..7d03e52 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -22,6 +22,10 @@ boolean_param("xsave", use_xsave);
 bool_t opt_arat = 1;
 boolean_param("arat", opt_arat);
 
+/* pku: Flag to enable Memory Protection Keys (default on). */
+bool_t opt_pku = 1;
+boolean_param("pku", opt_pku);
+
 unsigned int opt_cpuid_mask_ecx = ~0u;
 integer_param("cpuid_mask_ecx", opt_cpuid_mask_ecx);
 unsigned int opt_cpuid_mask_edx = ~0u;
@@ -270,7 +274,8 @@ static void __cpuinit generic_identify(struct cpuinfo_x86 *c)
 	if ( c->cpuid_level >= 0x00000007 )
 		cpuid_count(0x00000007, 0, &tmp,
 			    &c->x86_capability[cpufeat_word(X86_FEATURE_FSGSBASE)],
-			    &tmp, &tmp);
+			    &c->x86_capability[cpufeat_word(X86_FEATURE_PKU)],
+			    &tmp);
 }
 
 /*
@@ -323,6 +328,9 @@ void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
 	if ( cpu_has_xsave )
 		xstate_init(c);
 
+   	if ( !opt_pku )
+		setup_clear_cpu_cap(X86_FEATURE_PKU);
+
 	/*
 	 * The vendor-specific functions might have changed features.  Now
 	 * we do "generic changes."
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index af127cf..ef96514 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -11,7 +11,7 @@
 
 #include <xen/const.h>
 
-#define NCAPINTS	8	/* N 32-bit words worth of info */
+#define NCAPINTS	9	/* N 32-bit words worth of info */
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (edx), word 0 */
 #define X86_FEATURE_FPU		(0*32+ 0) /* Onboard FPU */
@@ -163,6 +163,10 @@
 #define X86_FEATURE_ADX		(7*32+19) /* ADCX, ADOX instructions */
 #define X86_FEATURE_SMAP	(7*32+20) /* Supervisor Mode Access Prevention */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 8 */
+#define X86_FEATURE_PKU	(8*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(8*32+ 4) /* OS Protection Keys Enable */
+
 #define cpufeat_word(idx)	((idx) / 32)
 #define cpufeat_bit(idx)	((idx) % 32)
 #define cpufeat_mask(idx)	(_AC(1, U) << cpufeat_bit(idx))
-- 
2.4.3


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 2/9] x86/hvm: pkeys, add pkeys support when setting CR4
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 1/9] x86/hvm: pkeys, add the flag to enable Memory Protection Keys Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 3/9] x86/hvm: pkeys, disable pkeys for guests in non-paging mode Huaitong Han
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch adds pkeys support when setting CR4.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/hvm.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index db0aeba..59916ed 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1924,6 +1924,7 @@ static unsigned long hvm_cr4_guest_reserved_bits(const struct vcpu *v,
         leaf1_edx = boot_cpu_data.x86_capability[cpufeat_word(X86_FEATURE_VME)];
         leaf1_ecx = boot_cpu_data.x86_capability[cpufeat_word(X86_FEATURE_PCID)];
         leaf7_0_ebx = boot_cpu_data.x86_capability[cpufeat_word(X86_FEATURE_FSGSBASE)];
+        leaf7_0_ecx = boot_cpu_data.x86_capability[cpufeat_word(X86_FEATURE_PKU)];
     }
 
     return ~(unsigned long)
@@ -1959,7 +1960,9 @@ static unsigned long hvm_cr4_guest_reserved_bits(const struct vcpu *v,
              (leaf7_0_ebx & cpufeat_mask(X86_FEATURE_SMEP) ?
               X86_CR4_SMEP : 0) |
              (leaf7_0_ebx & cpufeat_mask(X86_FEATURE_SMAP) ?
-              X86_CR4_SMAP : 0));
+              X86_CR4_SMAP : 0) |
+              (leaf7_0_ecx & cpufeat_mask(X86_FEATURE_PKU) ?
+              X86_CR4_PKE : 0));
 }
 
 static int hvm_load_cpu_ctxt(struct domain *d, hvm_domain_context_t *h)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 3/9] x86/hvm: pkeys, disable pkeys for guests in non-paging mode
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 1/9] x86/hvm: pkeys, add the flag to enable Memory Protection Keys Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 2/9] x86/hvm: pkeys, add pkeys support when setting CR4 Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 4/9] x86/hvm: pkeys, add functions to get pkeys value from PTE Huaitong Han
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch disables pkeys for guest in non-paging mode, However XEN always uses
paging mode to emulate guest non-paging mode, To emulate this behavior, pkeys
needs to be manually disabled when guest switches to non-paging mode.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/vmx/vmx.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2581e97..7123912 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1380,12 +1380,13 @@ static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr)
         if ( !hvm_paging_enabled(v) )
         {
             /*
-             * SMEP/SMAP is disabled if CPU is in non-paging mode in hardware.
-             * However Xen always uses paging mode to emulate guest non-paging
-             * mode. To emulate this behavior, SMEP/SMAP needs to be manually
-             * disabled when guest VCPU is in non-paging mode.
+             * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+             * hardware. However Xen always uses paging mode to emulate guest
+             * non-paging mode. To emulate this behavior, SMEP/SMAP/PKU needs
+             * to be manually disabled when guest VCPU is in non-paging mode.
              */
-            v->arch.hvm_vcpu.hw_cr[4] &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
+            v->arch.hvm_vcpu.hw_cr[4] &=
+                ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
         }
         __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]);
         break;
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 4/9] x86/hvm: pkeys, add functions to get pkeys value from PTE
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
                   ` (2 preceding siblings ...)
  2015-12-07  9:16 ` [V3 PATCH 3/9] x86/hvm: pkeys, disable pkeys for guests in non-paging mode Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-10 15:48   ` George Dunlap
  2015-12-07  9:16 ` [V3 PATCH 5/9] x86/hvm: pkeys, add functions to support PKRU access Huaitong Han
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch adds functions to get pkeys value from PTE.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 xen/include/asm-x86/guest_pt.h    |  7 +++++++
 xen/include/asm-x86/page.h        |  5 +++++
 xen/include/asm-x86/x86_64/page.h | 12 ++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h
index 3447973..6b0af70 100644
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -154,6 +154,13 @@ static inline u32 guest_l4e_get_flags(guest_l4e_t gl4e)
 { return l4e_get_flags(gl4e); }
 #endif
 
+static inline u32 guest_l1e_get_pkeys(guest_l1e_t gl1e)
+{ return l1e_get_pkeys(gl1e); }
+static inline u32 guest_l2e_get_pkeys(guest_l2e_t gl2e)
+{ return l2e_get_pkeys(gl2e); }
+static inline u32 guest_l3e_get_pkeys(guest_l3e_t gl3e)
+{ return l3e_get_pkeys(gl3e); }
+
 static inline guest_l1e_t guest_l1e_from_gfn(gfn_t gfn, u32 flags)
 { return l1e_from_pfn(gfn_x(gfn), flags); }
 static inline guest_l2e_t guest_l2e_from_gfn(gfn_t gfn, u32 flags)
diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
index a095a93..93a0db0 100644
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -93,6 +93,11 @@
 #define l3e_get_flags(x)           (get_pte_flags((x).l3))
 #define l4e_get_flags(x)           (get_pte_flags((x).l4))
 
+/* Get pte pkeys (unsigned int). */
+#define l1e_get_pkeys(x)           (get_pte_pkeys((x).l1))
+#define l2e_get_pkeys(x)           (get_pte_pkeys((x).l2))
+#define l3e_get_pkeys(x)           (get_pte_pkeys((x).l3))
+
 /* Construct an empty pte. */
 #define l1e_empty()                ((l1_pgentry_t) { 0 })
 #define l2e_empty()                ((l2_pgentry_t) { 0 })
diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
index 19ab4d0..3ca489a 100644
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -134,6 +134,18 @@ typedef l4_pgentry_t root_pgentry_t;
 #define get_pte_flags(x) (((int)((x) >> 40) & ~0xFFF) | ((int)(x) & 0xFFF))
 #define put_pte_flags(x) (((intpte_t)((x) & ~0xFFF) << 40) | ((x) & 0xFFF))
 
+/*
+ * Protection keys define a new 4-bit protection key field
+ * (PKEY) in bits 62:59 of leaf entries of the page tables.
+ * This corresponds to bit 22:19 of a 24-bit flags.
+ *
+ * Notice: Bit 22 is used by _PAGE_GNTTAB which is visible to PV guests,
+ * so Protection keys must be disabled on PV guests.
+ */
+#define _PAGE_PKEY_BITS  (0x780000)	 /* Protection Keys, 22:19 */
+
+#define get_pte_pkeys(x) (MASK_EXTR(get_pte_flags(x), _PAGE_PKEY_BITS))
+
 /* Bit 23 of a 24-bit flag mask. This corresponds to bit 63 of a pte.*/
 #define _PAGE_NX_BIT (1U<<23)
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 5/9] x86/hvm: pkeys, add functions to support PKRU access
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
                   ` (3 preceding siblings ...)
  2015-12-07  9:16 ` [V3 PATCH 4/9] x86/hvm: pkeys, add functions to get pkeys value from PTE Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-10 18:48   ` Andrew Cooper
  2015-12-07  9:16 ` [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys Huaitong Han
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch adds functions to support PKRU access.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 xen/include/asm-x86/processor.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 3f8411f..c345787 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -342,6 +342,26 @@ static inline void write_cr4(unsigned long val)
     asm volatile ( "mov %0,%%cr4" : : "r" (val) );
 }
 
+/* Macros for PKRU domain */
+#define PKRU_READ  (0)
+#define PKRU_WRITE (1)
+#define PKRU_ATTRS (2)
+
+/*
+ * PKRU defines 32 bits, there are 16 domains and 2 attribute bits per
+ * domain in pkru, pkeys is index to a defined domain, so the value of
+ * pte_pkeys * PKRU_ATTRS + R/W is offset of a defined domain attribute.
+ */
+static inline bool_t read_pkru_ad(unsigned int pkru, unsigned int pkey)
+{
+    return (pkru >> (pkey * PKRU_ATTRS + PKRU_READ)) & 1;
+}
+
+static inline bool_t read_pkru_wd(unsigned int pkru, unsigned int pkey)
+{
+    return (pkru >> (pkey * PKRU_ATTRS + PKRU_WRITE)) & 1;
+}
+
 /* Clear and set 'TS' bit respectively */
 static inline void clts(void) 
 {
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
                   ` (4 preceding siblings ...)
  2015-12-07  9:16 ` [V3 PATCH 5/9] x86/hvm: pkeys, add functions to support PKRU access Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-10 17:39   ` George Dunlap
                     ` (2 more replies)
  2015-12-07  9:16 ` [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables Huaitong Han
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch adds xstate support for pkeys.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 xen/arch/x86/xstate.c        | 7 +++++--
 xen/include/asm-x86/xstate.h | 4 +++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index b65da38..db978c4 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -146,12 +146,15 @@ static void __init setup_xstate_comp(void)
     }
 }
 
-static void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)
+void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)
 {
     if ( !((1ul << xfeature_idx) & xfeature_mask) )
         return NULL;
 
-    return xsave + xstate_comp_offsets[xfeature_idx];
+    if ( xsave_area_compressed(xsave) )
+        return xsave + xstate_comp_offsets[xfeature_idx];
+    else
+        return xsave + xstate_offsets[xfeature_idx];
 }
 
 void expand_xsave_states(struct vcpu *v, void *dest, unsigned int size)
diff --git a/xen/include/asm-x86/xstate.h b/xen/include/asm-x86/xstate.h
index 12d939b..6536813 100644
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -34,13 +34,14 @@
 #define XSTATE_OPMASK  (1ULL << 5)
 #define XSTATE_ZMM     (1ULL << 6)
 #define XSTATE_HI_ZMM  (1ULL << 7)
+#define XSTATE_PKRU    (1ULL << 9)
 #define XSTATE_LWP     (1ULL << 62) /* AMD lightweight profiling */
 #define XSTATE_FP_SSE  (XSTATE_FP | XSTATE_SSE)
 #define XCNTXT_MASK    (XSTATE_FP | XSTATE_SSE | XSTATE_YMM | XSTATE_OPMASK | \
                         XSTATE_ZMM | XSTATE_HI_ZMM | XSTATE_NONLAZY)
 
 #define XSTATE_ALL     (~(1ULL << 63))
-#define XSTATE_NONLAZY (XSTATE_LWP | XSTATE_BNDREGS | XSTATE_BNDCSR)
+#define XSTATE_NONLAZY (XSTATE_LWP | XSTATE_BNDREGS | XSTATE_BNDCSR | XSTATE_PKRU)
 #define XSTATE_LAZY    (XSTATE_ALL & ~XSTATE_NONLAZY)
 #define XSTATE_COMPACTION_ENABLED  (1ULL << 63)
 
@@ -90,6 +91,7 @@ uint64_t get_msr_xss(void);
 void xsave(struct vcpu *v, uint64_t mask);
 void xrstor(struct vcpu *v, uint64_t mask);
 bool_t xsave_enabled(const struct vcpu *v);
+void *get_xsave_addr(void *xsave, unsigned int xfeature_idx);
 int __must_check validate_xstate(u64 xcr0, u64 xcr0_accum, u64 xstate_bv);
 int __must_check handle_xsetbv(u32 index, u64 new_bv);
 void expand_xsave_states(struct vcpu *v, void *dest, unsigned int size);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
                   ` (5 preceding siblings ...)
  2015-12-07  9:16 ` [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-10 18:19   ` George Dunlap
  2015-12-10 18:59   ` Andrew Cooper
  2015-12-07  9:16 ` [V3 PATCH 8/9] x86/hvm: pkeys, add pkeys support for gva2gfn funcitons Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 9/9] x86/hvm: pkeys, add pkeys support for cpuid handling Huaitong Han
  8 siblings, 2 replies; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch adds pkeys support for guest_walk_tables.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 xen/arch/x86/i387.c           |  2 +-
 xen/arch/x86/mm/guest_walk.c  | 73 +++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/hvm.h |  2 ++
 xen/include/asm-x86/i387.h    |  1 +
 4 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
index b661d39..83c8465 100644
--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -132,7 +132,7 @@ static inline uint64_t vcpu_xsave_mask(const struct vcpu *v)
 }
 
 /* Save x87 extended state */
-static inline void fpu_xsave(struct vcpu *v)
+void fpu_xsave(struct vcpu *v)
 {
     bool_t ok;
     uint64_t mask = vcpu_xsave_mask(v);
diff --git a/xen/arch/x86/mm/guest_walk.c b/xen/arch/x86/mm/guest_walk.c
index 18d1acf..e79f72f 100644
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -31,6 +31,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <xen/sched.h>
 #include <asm/page.h>
 #include <asm/guest_pt.h>
+#include <asm/xstate.h>
+#include <asm/i387.h>
 
 extern const uint32_t gw_page_flags[];
 #if GUEST_PAGING_LEVELS == CONFIG_PAGING_LEVELS
@@ -90,6 +92,61 @@ static uint32_t set_ad_bits(void *guest_p, void *walk_p, int set_dirty)
     return 0;
 }
 
+#if GUEST_PAGING_LEVELS >= CONFIG_PAGING_LEVELS
+bool_t leaf_pte_pkeys_check(struct vcpu *vcpu, uint32_t pfec,
+                uint32_t pte_access, uint32_t pte_pkeys)
+{
+    void *xsave_addr;
+    unsigned int pkru = 0;
+    bool_t pkru_ad, pkru_wd;
+
+    bool_t uf = !!(pfec & PFEC_user_mode);
+    bool_t wf = !!(pfec & PFEC_write_access);
+    bool_t ff = !!(pfec & PFEC_insn_fetch);
+    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
+    bool_t pkuf  = !!(pfec & PFEC_prot_key);
+
+    if ( !cpu_has_xsave || !pkuf || is_pv_vcpu(vcpu) )
+        return 0;
+
+    /* PKRU dom0 is always zero */
+    if ( likely(!pte_pkeys) )
+        return 0;
+
+    /* Update vcpu xsave area */
+    fpu_xsave(vcpu);
+    xsave_addr = get_xsave_addr(vcpu->arch.xsave_area, fls64(XSTATE_PKRU)-1);
+    if ( !!xsave_addr )
+        memcpy(&pkru, xsave_addr, sizeof(pkru));
+
+    if ( unlikely(pkru) )
+    {
+        /*
+         * PKU:  additional mechanism by which the paging controls
+         * access to user-mode addresses based on the value in the
+         * PKRU register. A fault is considered as a PKU violation if all
+         * of the following conditions are ture:
+         * 1.CR4_PKE=1.
+         * 2.EFER_LMA=1.
+         * 3.page is present with no reserved bit violations.
+         * 4.the access is not an instruction fetch.
+         * 5.the access is to a user page.
+         * 6.PKRU.AD=1
+         *       or The access is a data write and PKRU.WD=1
+         *            and either CR0.WP=1 or it is a user access.
+         */
+        pkru_ad = read_pkru_ad(pkru, pte_pkeys);
+        pkru_wd = read_pkru_wd(pkru, pte_pkeys);
+        if ( hvm_pku_enabled(vcpu) && hvm_long_mode_enabled(vcpu) &&
+            !rsvdf && !ff && (pkru_ad ||
+            (pkru_wd && wf && (hvm_wp_enabled(vcpu) || uf))))
+            return 1;
+    }
+
+    return 0;
+}
+#endif
+
 /* Walk the guest pagetables, after the manner of a hardware walker. */
 /* Because the walk is essentially random, it can cause a deadlock 
  * warning in the p2m locking code. Highly unlikely this is an actual
@@ -106,6 +163,7 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
 #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
     guest_l3e_t *l3p = NULL;
     guest_l4e_t *l4p;
+    unsigned int pkeys;
 #endif
     uint32_t gflags, mflags, iflags, rc = 0;
     bool_t smep = 0, smap = 0;
@@ -190,6 +248,7 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
         goto out;
     /* Get the l3e and check its flags*/
     gw->l3e = l3p[guest_l3_table_offset(va)];
+    pkeys = guest_l3e_get_pkeys(gw->l3e);
     gflags = guest_l3e_get_flags(gw->l3e) ^ iflags;
     if ( !(gflags & _PAGE_PRESENT) ) {
         rc |= _PAGE_PRESENT;
@@ -199,6 +258,9 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
     
     pse1G = (gflags & _PAGE_PSE) && guest_supports_1G_superpages(v); 
 
+    if ( pse1G && leaf_pte_pkeys_check(v, pfec, gflags, pkeys) )
+        rc |= _PAGE_PKEY_BITS;
+
     if ( pse1G )
     {
         /* Generate a fake l1 table entry so callers don't all 
@@ -270,6 +332,12 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
 
     pse2M = (gflags & _PAGE_PSE) && guest_supports_superpages(v); 
 
+#if GUEST_PAGING_LEVELS >= 4
+    pkeys = guest_l2e_get_pkeys(gw->l2e);
+    if ( pse2M && leaf_pte_pkeys_check(v, pfec, gflags, pkeys) )
+        rc |= _PAGE_PKEY_BITS;
+#endif
+
     if ( pse2M )
     {
         /* Special case: this guest VA is in a PSE superpage, so there's
@@ -330,6 +398,11 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
             goto out;
         }
         rc |= ((gflags & mflags) ^ mflags);
+#if GUEST_PAGING_LEVELS >= 4
+        pkeys = guest_l1e_get_pkeys(gw->l1e);
+        if ( leaf_pte_pkeys_check(v, pfec, gflags, pkeys) )
+            rc |= _PAGE_PKEY_BITS;
+#endif
     }
 
 #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index f80e143..79b3421 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -275,6 +275,8 @@ int hvm_girq_dest_2_vcpu_id(struct domain *d, uint8_t dest, uint8_t dest_mode);
     (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_SMAP))
 #define hvm_nx_enabled(v) \
     (!!((v)->arch.hvm_vcpu.guest_efer & EFER_NX))
+#define hvm_pku_enabled(v) \
+    (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_PKE))
 
 /* Can we use superpages in the HAP p2m table? */
 #define hap_has_1gb (!!(hvm_funcs.hap_capabilities & HVM_HAP_SUPERPAGE_1GB))
diff --git a/xen/include/asm-x86/i387.h b/xen/include/asm-x86/i387.h
index 7cfa215..c4aee70 100644
--- a/xen/include/asm-x86/i387.h
+++ b/xen/include/asm-x86/i387.h
@@ -30,6 +30,7 @@ struct ix87_env {
 
 void vcpu_restore_fpu_eager(struct vcpu *v);
 void vcpu_restore_fpu_lazy(struct vcpu *v);
+void fpu_xsave(struct vcpu *v);
 void vcpu_save_fpu(struct vcpu *v);
 void save_fpu_enable(void);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 8/9]  x86/hvm: pkeys, add pkeys support for gva2gfn funcitons
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
                   ` (6 preceding siblings ...)
  2015-12-07  9:16 ` [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-07  9:16 ` [V3 PATCH 9/9] x86/hvm: pkeys, add pkeys support for cpuid handling Huaitong Han
  8 siblings, 0 replies; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

 This patch adds pkeys support for gva2gfn funcitons.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 xen/arch/x86/hvm/hvm.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 59916ed..b88f381 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4319,7 +4319,8 @@ static enum hvm_copy_result __hvm_clear(paddr_t addr, int size)
     p2m_type_t p2mt;
     char *p;
     int count, todo = size;
-    uint32_t pfec = PFEC_page_present | PFEC_write_access;
+    uint32_t pfec = PFEC_page_present | PFEC_write_access |
+        (hvm_pku_enabled(curr) ? PFEC_prot_key : 0);
 
     /*
      * XXX Disable for 4.1.0: PV-on-HVM drivers will do grant-table ops
@@ -4420,7 +4421,8 @@ enum hvm_copy_result hvm_copy_to_guest_virt(
 {
     return __hvm_copy(buf, vaddr, size,
                       HVMCOPY_to_guest | HVMCOPY_fault | HVMCOPY_virt,
-                      PFEC_page_present | PFEC_write_access | pfec);
+                      PFEC_page_present | PFEC_write_access | pfec |
+                      (hvm_pku_enabled(current) ? PFEC_prot_key : 0));
 }
 
 enum hvm_copy_result hvm_copy_from_guest_virt(
@@ -4428,7 +4430,8 @@ enum hvm_copy_result hvm_copy_from_guest_virt(
 {
     return __hvm_copy(buf, vaddr, size,
                       HVMCOPY_from_guest | HVMCOPY_fault | HVMCOPY_virt,
-                      PFEC_page_present | pfec);
+                      PFEC_page_present | pfec |
+                      (hvm_pku_enabled(current) ? PFEC_prot_key : 0));
 }
 
 enum hvm_copy_result hvm_fetch_from_guest_virt(
@@ -4446,7 +4449,8 @@ enum hvm_copy_result hvm_copy_to_guest_virt_nofault(
 {
     return __hvm_copy(buf, vaddr, size,
                       HVMCOPY_to_guest | HVMCOPY_no_fault | HVMCOPY_virt,
-                      PFEC_page_present | PFEC_write_access | pfec);
+                      PFEC_page_present | PFEC_write_access | pfec |
+                      (hvm_pku_enabled(current) ? PFEC_prot_key : 0));
 }
 
 enum hvm_copy_result hvm_copy_from_guest_virt_nofault(
@@ -4454,7 +4458,8 @@ enum hvm_copy_result hvm_copy_from_guest_virt_nofault(
 {
     return __hvm_copy(buf, vaddr, size,
                       HVMCOPY_from_guest | HVMCOPY_no_fault | HVMCOPY_virt,
-                      PFEC_page_present | pfec);
+                      PFEC_page_present | pfec |
+                      (hvm_pku_enabled(current) ? PFEC_prot_key : 0));
 }
 
 enum hvm_copy_result hvm_fetch_from_guest_virt_nofault(
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [V3 PATCH 9/9] x86/hvm: pkeys, add pkeys support for cpuid handling
  2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
                   ` (7 preceding siblings ...)
  2015-12-07  9:16 ` [V3 PATCH 8/9] x86/hvm: pkeys, add pkeys support for gva2gfn funcitons Huaitong Han
@ 2015-12-07  9:16 ` Huaitong Han
  2015-12-11  9:47   ` Jan Beulich
  8 siblings, 1 reply; 44+ messages in thread
From: Huaitong Han @ 2015-12-07  9:16 UTC (permalink / raw)
  To: jbeulich, andrew.cooper3, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: Huaitong Han, xen-devel

This patch adds pkeys support for cpuid handing.

Pkeys hardware support is CPUID.7.0.ECX[3]:PKU. software support is
CPUID.7.0.ECX[4]:OSPKE and it reflects the support setting of CR4.PKE.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 tools/libxc/xc_cpufeature.h |  2 ++
 tools/libxc/xc_cpuid_x86.c  |  6 ++++--
 xen/arch/x86/hvm/hvm.c      | 14 +++++++++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/libxc/xc_cpufeature.h b/tools/libxc/xc_cpufeature.h
index c3ddc80..f6a9778 100644
--- a/tools/libxc/xc_cpufeature.h
+++ b/tools/libxc/xc_cpufeature.h
@@ -141,5 +141,7 @@
 #define X86_FEATURE_ADX         19 /* ADCX, ADOX instructions */
 #define X86_FEATURE_SMAP        20 /* Supervisor Mode Access Protection */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
+#define X86_FEATURE_PKU     3
 
 #endif /* __LIBXC_CPUFEATURE_H */
diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
index 8882c01..1ce979b 100644
--- a/tools/libxc/xc_cpuid_x86.c
+++ b/tools/libxc/xc_cpuid_x86.c
@@ -427,9 +427,11 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
                         bitmaskof(X86_FEATURE_ADX)  |
                         bitmaskof(X86_FEATURE_SMAP) |
                         bitmaskof(X86_FEATURE_FSGSBASE));
+            regs[2] &= bitmaskof(X86_FEATURE_PKU);
         } else
-            regs[1] = 0;
-        regs[0] = regs[2] = regs[3] = 0;
+            regs[1] = regs[2] = 0;
+
+        regs[0] = regs[3] = 0;
         break;
 
     case 0x0000000d:
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index b88f381..d7b3b43 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4577,7 +4577,7 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
             __clear_bit(X86_FEATURE_APIC & 31, edx);
 
         /* Fix up OSXSAVE. */
-        if ( cpu_has_xsave )
+        if ( *ecx & cpufeat_mask(X86_FEATURE_XSAVE) )
             *ecx |= (v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_OSXSAVE) ?
                      cpufeat_mask(X86_FEATURE_OSXSAVE) : 0;
 
@@ -4605,6 +4605,18 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
         /* Don't expose INVPCID to non-hap hvm. */
         if ( (count == 0) && !hap_enabled(d) )
             *ebx &= ~cpufeat_mask(X86_FEATURE_INVPCID);
+
+        /* X86_FEATURE_PKU is not yet implemented for shadow paging
+         *
+         * Hypervisor gets guest pkru value from XSAVE state, because
+         * Hypervisor CR4 without X86_CR4_PKE disables RDPKRU instruction.
+         */
+        if ( (count == 0) && (!hap_enabled(d) || !cpu_has_xsave) )
+            *ecx &= ~cpufeat_mask(X86_FEATURE_PKU);
+
+        if ( (count == 0) && (*ecx & cpufeat_mask(X86_FEATURE_PKU)) )
+            *ecx |= (v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_PKE) ?
+                     cpufeat_mask(X86_FEATURE_OSPKE) : 0;
         break;
     case 0xb:
         /* Fix the x2APIC identifier. */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 1/9] x86/hvm: pkeys, add the flag to enable Memory Protection Keys
  2015-12-07  9:16 ` [V3 PATCH 1/9] x86/hvm: pkeys, add the flag to enable Memory Protection Keys Huaitong Han
@ 2015-12-10 15:37   ` George Dunlap
  0 siblings, 0 replies; 44+ messages in thread
From: George Dunlap @ 2015-12-10 15:37 UTC (permalink / raw)
  To: Huaitong Han, jbeulich, andrew.cooper3, jun.nakajima, eddie.dong,
	kevin.tian, george.dunlap, ian.jackson, stefano.stabellini,
	ian.campbell, wei.liu2, keir
  Cc: xen-devel

On 07/12/15 09:16, Huaitong Han wrote:
> This patch adds the flag to enable Memory Protection Keys.
> 
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  docs/misc/xen-command-line.markdown | 21 +++++++++++++++++++++
>  xen/arch/x86/cpu/common.c           | 10 +++++++++-
>  xen/include/asm-x86/cpufeature.h    |  6 +++++-
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index c103894..ef5ef6c 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1177,6 +1177,27 @@ This option can be specified more than once (up to 8 times at present).
>  ### ple\_window
>  > `= <integer>`
>  
> +### pku
> +> `= <boolean>`
> +
> +> Default: `true`
> +
> +Flag to enable Memory Protection Keys.
> +
> +The protection-key feature provides an additional mechanism by which IA-32e
> +paging controls access to usermode addresses.
> +
> +When CR4.PKE = 1, every linear address is associated with the 4-bit protection
> +key located in bits 62:59 of the paging-structure entry that mapped the page
> +containing the linear address. The PKRU register determines, for each
> +protection key, whether user-mode addresses with that protection key may be
> +read or written.
> +
> +The PKRU register (protection key rights for user pages) is a 32-bit register
> +with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the
> +access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable
> +bit for protection key i (WDi).

These last two paragraphs are awfully technically detailed for a
command-line reference.  I think the first two paragraphs would be
sufficient.

 -George

> +
>  ### psr (Intel)
>  > `= List of ( cmt:<boolean> | rmid_max:<integer> | cat:<boolean> | cos_max:<integer> | cdp:<boolean> )`
>  
> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
> index 310ec85..7d03e52 100644
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -22,6 +22,10 @@ boolean_param("xsave", use_xsave);
>  bool_t opt_arat = 1;
>  boolean_param("arat", opt_arat);
>  
> +/* pku: Flag to enable Memory Protection Keys (default on). */
> +bool_t opt_pku = 1;
> +boolean_param("pku", opt_pku);
> +
>  unsigned int opt_cpuid_mask_ecx = ~0u;
>  integer_param("cpuid_mask_ecx", opt_cpuid_mask_ecx);
>  unsigned int opt_cpuid_mask_edx = ~0u;
> @@ -270,7 +274,8 @@ static void __cpuinit generic_identify(struct cpuinfo_x86 *c)
>  	if ( c->cpuid_level >= 0x00000007 )
>  		cpuid_count(0x00000007, 0, &tmp,
>  			    &c->x86_capability[cpufeat_word(X86_FEATURE_FSGSBASE)],
> -			    &tmp, &tmp);
> +			    &c->x86_capability[cpufeat_word(X86_FEATURE_PKU)],
> +			    &tmp);
>  }
>  
>  /*
> @@ -323,6 +328,9 @@ void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
>  	if ( cpu_has_xsave )
>  		xstate_init(c);
>  
> +   	if ( !opt_pku )
> +		setup_clear_cpu_cap(X86_FEATURE_PKU);
> +
>  	/*
>  	 * The vendor-specific functions might have changed features.  Now
>  	 * we do "generic changes."
> diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
> index af127cf..ef96514 100644
> --- a/xen/include/asm-x86/cpufeature.h
> +++ b/xen/include/asm-x86/cpufeature.h
> @@ -11,7 +11,7 @@
>  
>  #include <xen/const.h>
>  
> -#define NCAPINTS	8	/* N 32-bit words worth of info */
> +#define NCAPINTS	9	/* N 32-bit words worth of info */
>  
>  /* Intel-defined CPU features, CPUID level 0x00000001 (edx), word 0 */
>  #define X86_FEATURE_FPU		(0*32+ 0) /* Onboard FPU */
> @@ -163,6 +163,10 @@
>  #define X86_FEATURE_ADX		(7*32+19) /* ADCX, ADOX instructions */
>  #define X86_FEATURE_SMAP	(7*32+20) /* Supervisor Mode Access Prevention */
>  
> +/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 8 */
> +#define X86_FEATURE_PKU	(8*32+ 3) /* Protection Keys for Userspace */
> +#define X86_FEATURE_OSPKE	(8*32+ 4) /* OS Protection Keys Enable */
> +
>  #define cpufeat_word(idx)	((idx) / 32)
>  #define cpufeat_bit(idx)	((idx) % 32)
>  #define cpufeat_mask(idx)	(_AC(1, U) << cpufeat_bit(idx))
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 4/9] x86/hvm: pkeys, add functions to get pkeys value from PTE
  2015-12-07  9:16 ` [V3 PATCH 4/9] x86/hvm: pkeys, add functions to get pkeys value from PTE Huaitong Han
@ 2015-12-10 15:48   ` George Dunlap
  2015-12-10 18:47     ` Andrew Cooper
  0 siblings, 1 reply; 44+ messages in thread
From: George Dunlap @ 2015-12-10 15:48 UTC (permalink / raw)
  To: Huaitong Han, jbeulich, andrew.cooper3, jun.nakajima, eddie.dong,
	kevin.tian, george.dunlap, ian.jackson, stefano.stabellini,
	ian.campbell, wei.liu2, keir
  Cc: xen-devel

On 07/12/15 09:16, Huaitong Han wrote:
> This patch adds functions to get pkeys value from PTE.
> 
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
>  xen/include/asm-x86/guest_pt.h    |  7 +++++++
>  xen/include/asm-x86/page.h        |  5 +++++
>  xen/include/asm-x86/x86_64/page.h | 12 ++++++++++++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h
> index 3447973..6b0af70 100644
> --- a/xen/include/asm-x86/guest_pt.h
> +++ b/xen/include/asm-x86/guest_pt.h
> @@ -154,6 +154,13 @@ static inline u32 guest_l4e_get_flags(guest_l4e_t gl4e)
>  { return l4e_get_flags(gl4e); }
>  #endif
>  
> +static inline u32 guest_l1e_get_pkeys(guest_l1e_t gl1e)
> +{ return l1e_get_pkeys(gl1e); }
> +static inline u32 guest_l2e_get_pkeys(guest_l2e_t gl2e)
> +{ return l2e_get_pkeys(gl2e); }
> +static inline u32 guest_l3e_get_pkeys(guest_l3e_t gl3e)
> +{ return l3e_get_pkeys(gl3e); }
> +
>  static inline guest_l1e_t guest_l1e_from_gfn(gfn_t gfn, u32 flags)
>  { return l1e_from_pfn(gfn_x(gfn), flags); }
>  static inline guest_l2e_t guest_l2e_from_gfn(gfn_t gfn, u32 flags)
> diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
> index a095a93..93a0db0 100644
> --- a/xen/include/asm-x86/page.h
> +++ b/xen/include/asm-x86/page.h
> @@ -93,6 +93,11 @@
>  #define l3e_get_flags(x)           (get_pte_flags((x).l3))
>  #define l4e_get_flags(x)           (get_pte_flags((x).l4))
>  
> +/* Get pte pkeys (unsigned int). */
> +#define l1e_get_pkeys(x)           (get_pte_pkeys((x).l1))
> +#define l2e_get_pkeys(x)           (get_pte_pkeys((x).l2))
> +#define l3e_get_pkeys(x)           (get_pte_pkeys((x).l3))
> +
>  /* Construct an empty pte. */
>  #define l1e_empty()                ((l1_pgentry_t) { 0 })
>  #define l2e_empty()                ((l2_pgentry_t) { 0 })
> diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
> index 19ab4d0..3ca489a 100644
> --- a/xen/include/asm-x86/x86_64/page.h
> +++ b/xen/include/asm-x86/x86_64/page.h
> @@ -134,6 +134,18 @@ typedef l4_pgentry_t root_pgentry_t;
>  #define get_pte_flags(x) (((int)((x) >> 40) & ~0xFFF) | ((int)(x) & 0xFFF))
>  #define put_pte_flags(x) (((intpte_t)((x) & ~0xFFF) << 40) | ((x) & 0xFFF))
>  
> +/*
> + * Protection keys define a new 4-bit protection key field
> + * (PKEY) in bits 62:59 of leaf entries of the page tables.
> + * This corresponds to bit 22:19 of a 24-bit flags.
> + *
> + * Notice: Bit 22 is used by _PAGE_GNTTAB which is visible to PV guests,
> + * so Protection keys must be disabled on PV guests.
> + */
> +#define _PAGE_PKEY_BITS  (0x780000)	 /* Protection Keys, 22:19 */
> +
> +#define get_pte_pkeys(x) (MASK_EXTR(get_pte_flags(x), _PAGE_PKEY_BITS))

Sorry if I'm getting nit-picky here, but any given pte will only have a
single pkey, right?  Would it be better if these were "get_pte_pkey()"
(i.e., singular)?

 -George

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys
  2015-12-07  9:16 ` [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys Huaitong Han
@ 2015-12-10 17:39   ` George Dunlap
  2015-12-10 18:57   ` Andrew Cooper
  2015-12-11  9:36   ` Jan Beulich
  2 siblings, 0 replies; 44+ messages in thread
From: George Dunlap @ 2015-12-10 17:39 UTC (permalink / raw)
  To: Huaitong Han, jbeulich, andrew.cooper3, jun.nakajima, eddie.dong,
	kevin.tian, george.dunlap, ian.jackson, stefano.stabellini,
	ian.campbell, wei.liu2, keir
  Cc: xen-devel

On 07/12/15 09:16, Huaitong Han wrote:
> This patch adds xstate support for pkeys.

Hey Huaitong,

Hope you don't mind me giving you a little feedback on the way you've
broken down your patches here.  The purpose for breaking a change down
into separate patches like this is to make it easier for people
reviewing (and for people later who come and look at the commits) to
figure out what's going on.  But if you break the patch series down to
much, you go back to making things harder again.

Take this patch, for instance.  You're making get_xsave_addr() global
(non-static), and also making it handle the case where the xsave area
was compressed (which it didn't before).

But as a reviewer, I don't know at this point who is calling
get_xsave_addr() or why; I don't know if this change is necessary, or if
it is correct for all the callers (or if the change wrt compressed xsave
areas is even necessary).

So one problem with this patch is that you don't include that
information in the description.  But part of it is that the change
doesn't really make much sense by itself.

I think I'd squash patches 4-7 into a single patch.

More comments about the approach in patch 7.

 -George


> 
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
>  xen/arch/x86/xstate.c        | 7 +++++--
>  xen/include/asm-x86/xstate.h | 4 +++-
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
> index b65da38..db978c4 100644
> --- a/xen/arch/x86/xstate.c
> +++ b/xen/arch/x86/xstate.c
> @@ -146,12 +146,15 @@ static void __init setup_xstate_comp(void)
>      }
>  }
>  
> -static void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)
> +void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)
>  {
>      if ( !((1ul << xfeature_idx) & xfeature_mask) )
>          return NULL;
>  
> -    return xsave + xstate_comp_offsets[xfeature_idx];
> +    if ( xsave_area_compressed(xsave) )
> +        return xsave + xstate_comp_offsets[xfeature_idx];
> +    else
> +        return xsave + xstate_offsets[xfeature_idx];
>  }
>  
>  void expand_xsave_states(struct vcpu *v, void *dest, unsigned int size)
> diff --git a/xen/include/asm-x86/xstate.h b/xen/include/asm-x86/xstate.h
> index 12d939b..6536813 100644
> --- a/xen/include/asm-x86/xstate.h
> +++ b/xen/include/asm-x86/xstate.h
> @@ -34,13 +34,14 @@
>  #define XSTATE_OPMASK  (1ULL << 5)
>  #define XSTATE_ZMM     (1ULL << 6)
>  #define XSTATE_HI_ZMM  (1ULL << 7)
> +#define XSTATE_PKRU    (1ULL << 9)
>  #define XSTATE_LWP     (1ULL << 62) /* AMD lightweight profiling */
>  #define XSTATE_FP_SSE  (XSTATE_FP | XSTATE_SSE)
>  #define XCNTXT_MASK    (XSTATE_FP | XSTATE_SSE | XSTATE_YMM | XSTATE_OPMASK | \
>                          XSTATE_ZMM | XSTATE_HI_ZMM | XSTATE_NONLAZY)
>  
>  #define XSTATE_ALL     (~(1ULL << 63))
> -#define XSTATE_NONLAZY (XSTATE_LWP | XSTATE_BNDREGS | XSTATE_BNDCSR)
> +#define XSTATE_NONLAZY (XSTATE_LWP | XSTATE_BNDREGS | XSTATE_BNDCSR | XSTATE_PKRU)
>  #define XSTATE_LAZY    (XSTATE_ALL & ~XSTATE_NONLAZY)
>  #define XSTATE_COMPACTION_ENABLED  (1ULL << 63)
>  
> @@ -90,6 +91,7 @@ uint64_t get_msr_xss(void);
>  void xsave(struct vcpu *v, uint64_t mask);
>  void xrstor(struct vcpu *v, uint64_t mask);
>  bool_t xsave_enabled(const struct vcpu *v);
> +void *get_xsave_addr(void *xsave, unsigned int xfeature_idx);
>  int __must_check validate_xstate(u64 xcr0, u64 xcr0_accum, u64 xstate_bv);
>  int __must_check handle_xsetbv(u32 index, u64 new_bv);
>  void expand_xsave_states(struct vcpu *v, void *dest, unsigned int size);
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-07  9:16 ` [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables Huaitong Han
@ 2015-12-10 18:19   ` George Dunlap
  2015-12-11  9:16     ` Wu, Feng
                       ` (2 more replies)
  2015-12-10 18:59   ` Andrew Cooper
  1 sibling, 3 replies; 44+ messages in thread
From: George Dunlap @ 2015-12-10 18:19 UTC (permalink / raw)
  To: Huaitong Han, jbeulich, andrew.cooper3, jun.nakajima, eddie.dong,
	kevin.tian, george.dunlap, ian.jackson, stefano.stabellini,
	ian.campbell, wei.liu2, keir
  Cc: xen-devel

On 07/12/15 09:16, Huaitong Han wrote:
> This patch adds pkeys support for guest_walk_tables.
> 
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
>  xen/arch/x86/i387.c           |  2 +-
>  xen/arch/x86/mm/guest_walk.c  | 73 +++++++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/hvm/hvm.h |  2 ++
>  xen/include/asm-x86/i387.h    |  1 +
>  4 files changed, 77 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
> index b661d39..83c8465 100644
> --- a/xen/arch/x86/i387.c
> +++ b/xen/arch/x86/i387.c
> @@ -132,7 +132,7 @@ static inline uint64_t vcpu_xsave_mask(const struct vcpu *v)
>  }
>  
>  /* Save x87 extended state */
> -static inline void fpu_xsave(struct vcpu *v)
> +void fpu_xsave(struct vcpu *v)
>  {
>      bool_t ok;
>      uint64_t mask = vcpu_xsave_mask(v);
> diff --git a/xen/arch/x86/mm/guest_walk.c b/xen/arch/x86/mm/guest_walk.c
> index 18d1acf..e79f72f 100644
> --- a/xen/arch/x86/mm/guest_walk.c
> +++ b/xen/arch/x86/mm/guest_walk.c
> @@ -31,6 +31,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
>  #include <xen/sched.h>
>  #include <asm/page.h>
>  #include <asm/guest_pt.h>
> +#include <asm/xstate.h>
> +#include <asm/i387.h>
>  
>  extern const uint32_t gw_page_flags[];
>  #if GUEST_PAGING_LEVELS == CONFIG_PAGING_LEVELS
> @@ -90,6 +92,61 @@ static uint32_t set_ad_bits(void *guest_p, void *walk_p, int set_dirty)
>      return 0;
>  }
>  
> +#if GUEST_PAGING_LEVELS >= CONFIG_PAGING_LEVELS
> +bool_t leaf_pte_pkeys_check(struct vcpu *vcpu, uint32_t pfec,
> +                uint32_t pte_access, uint32_t pte_pkeys)

pte_access doesn't seem to be used at all.

> +{
> +    void *xsave_addr;
> +    unsigned int pkru = 0;
> +    bool_t pkru_ad, pkru_wd;
> +
> +    bool_t uf = !!(pfec & PFEC_user_mode);
> +    bool_t wf = !!(pfec & PFEC_write_access);
> +    bool_t ff = !!(pfec & PFEC_insn_fetch);
> +    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
> +    bool_t pkuf  = !!(pfec & PFEC_prot_key);

So I'm just wondering out loud here -- is there actually any situation
in which we would want guest_walk_tables to act differently than the
real hardware?

That is, is there actually any situation where, pku is enabled, the vcpu
is in long mode, PFEC_write_access and/or PFEC_page_present is set, and
the pkey is non-zero, that we want guest_walk_tables() to only check the
write-protect bit for the pte, and not also check the pkru?

Because if not, it seems like it would be much more robust to simply
*always* check for pkru_ad if PFEC_page_present is set, and for pkru_wd
if PFEC_write_access is set.

Then in patch 8, you wouldn't need to go around all the __hvm_copy
functions adding in PFEC_prot; instead, you'd just need to add
PFEC_insn_fetch to the "fetch" (as is already done for SMEP and NX), and
you'd be done.

> +
> +    if ( !cpu_has_xsave || !pkuf || is_pv_vcpu(vcpu) )
> +        return 0;
> +
> +    /* PKRU dom0 is always zero */

"dom0" has a very specific meaning in Xen.  I think this would be better
written "pkey 0 always has full access".

> +    if ( likely(!pte_pkeys) )
> +        return 0;
> +
> +    /* Update vcpu xsave area */
> +    fpu_xsave(vcpu);

Is there a reason you're calling fpu_xsave() directly here, rather than
just calling vcpu_save_fpu()?  That saves you actually doing the xsave
if the fpu hasn't been modified since the last time you read it.

> +    xsave_addr = get_xsave_addr(vcpu->arch.xsave_area, fls64(XSTATE_PKRU)-1);
> +    if ( !!xsave_addr )
> +        memcpy(&pkru, xsave_addr, sizeof(pkru));

There's no need for the !! here.  But in any case, isn't there a better
function for reading the xsave state than manually calculating the
address and doing a memcpy?

> +
> +    if ( unlikely(pkru) )
> +    {
> +        /*
> +         * PKU:  additional mechanism by which the paging controls
> +         * access to user-mode addresses based on the value in the
> +         * PKRU register. A fault is considered as a PKU violation if all
> +         * of the following conditions are ture:
> +         * 1.CR4_PKE=1.
> +         * 2.EFER_LMA=1.
> +         * 3.page is present with no reserved bit violations.
> +         * 4.the access is not an instruction fetch.
> +         * 5.the access is to a user page.
> +         * 6.PKRU.AD=1
> +         *       or The access is a data write and PKRU.WD=1
> +         *            and either CR0.WP=1 or it is a user access.
> +         */
> +        pkru_ad = read_pkru_ad(pkru, pte_pkeys);
> +        pkru_wd = read_pkru_wd(pkru, pte_pkeys);
> +        if ( hvm_pku_enabled(vcpu) && hvm_long_mode_enabled(vcpu) &&
> +            !rsvdf && !ff && (pkru_ad ||
> +            (pkru_wd && wf && (hvm_wp_enabled(vcpu) || uf))))
> +            return 1;

This statement here is really difficult to read.  Why don't you put the
checks which don't depend on the pkru up before you read it?  e.g.,
hvm_pku_enabled(), hvm_long_mode_enabled(), rsvdf, ff, &c?

 -George

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 4/9] x86/hvm: pkeys, add functions to get pkeys value from PTE
  2015-12-10 15:48   ` George Dunlap
@ 2015-12-10 18:47     ` Andrew Cooper
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Cooper @ 2015-12-10 18:47 UTC (permalink / raw)
  To: George Dunlap, Huaitong Han, jbeulich, jun.nakajima, eddie.dong,
	kevin.tian, george.dunlap, ian.jackson, stefano.stabellini,
	ian.campbell, wei.liu2, keir
  Cc: xen-devel

On 10/12/15 15:48, George Dunlap wrote:
> On 07/12/15 09:16, Huaitong Han wrote:
>> This patch adds functions to get pkeys value from PTE.
>>
>> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
>> ---
>>  xen/include/asm-x86/guest_pt.h    |  7 +++++++
>>  xen/include/asm-x86/page.h        |  5 +++++
>>  xen/include/asm-x86/x86_64/page.h | 12 ++++++++++++
>>  3 files changed, 24 insertions(+)
>>
>> diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h
>> index 3447973..6b0af70 100644
>> --- a/xen/include/asm-x86/guest_pt.h
>> +++ b/xen/include/asm-x86/guest_pt.h
>> @@ -154,6 +154,13 @@ static inline u32 guest_l4e_get_flags(guest_l4e_t gl4e)
>>  { return l4e_get_flags(gl4e); }
>>  #endif
>>  
>> +static inline u32 guest_l1e_get_pkeys(guest_l1e_t gl1e)
>> +{ return l1e_get_pkeys(gl1e); }
>> +static inline u32 guest_l2e_get_pkeys(guest_l2e_t gl2e)
>> +{ return l2e_get_pkeys(gl2e); }
>> +static inline u32 guest_l3e_get_pkeys(guest_l3e_t gl3e)
>> +{ return l3e_get_pkeys(gl3e); }
>> +
>>  static inline guest_l1e_t guest_l1e_from_gfn(gfn_t gfn, u32 flags)
>>  { return l1e_from_pfn(gfn_x(gfn), flags); }
>>  static inline guest_l2e_t guest_l2e_from_gfn(gfn_t gfn, u32 flags)
>> diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
>> index a095a93..93a0db0 100644
>> --- a/xen/include/asm-x86/page.h
>> +++ b/xen/include/asm-x86/page.h
>> @@ -93,6 +93,11 @@
>>  #define l3e_get_flags(x)           (get_pte_flags((x).l3))
>>  #define l4e_get_flags(x)           (get_pte_flags((x).l4))
>>  
>> +/* Get pte pkeys (unsigned int). */
>> +#define l1e_get_pkeys(x)           (get_pte_pkeys((x).l1))
>> +#define l2e_get_pkeys(x)           (get_pte_pkeys((x).l2))
>> +#define l3e_get_pkeys(x)           (get_pte_pkeys((x).l3))
>> +
>>  /* Construct an empty pte. */
>>  #define l1e_empty()                ((l1_pgentry_t) { 0 })
>>  #define l2e_empty()                ((l2_pgentry_t) { 0 })
>> diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
>> index 19ab4d0..3ca489a 100644
>> --- a/xen/include/asm-x86/x86_64/page.h
>> +++ b/xen/include/asm-x86/x86_64/page.h
>> @@ -134,6 +134,18 @@ typedef l4_pgentry_t root_pgentry_t;
>>  #define get_pte_flags(x) (((int)((x) >> 40) & ~0xFFF) | ((int)(x) & 0xFFF))
>>  #define put_pte_flags(x) (((intpte_t)((x) & ~0xFFF) << 40) | ((x) & 0xFFF))
>>  
>> +/*
>> + * Protection keys define a new 4-bit protection key field
>> + * (PKEY) in bits 62:59 of leaf entries of the page tables.
>> + * This corresponds to bit 22:19 of a 24-bit flags.
>> + *
>> + * Notice: Bit 22 is used by _PAGE_GNTTAB which is visible to PV guests,
>> + * so Protection keys must be disabled on PV guests.
>> + */
>> +#define _PAGE_PKEY_BITS  (0x780000)	 /* Protection Keys, 22:19 */
>> +
>> +#define get_pte_pkeys(x) (MASK_EXTR(get_pte_flags(x), _PAGE_PKEY_BITS))
> Sorry if I'm getting nit-picky here, but any given pte will only have a
> single pkey, right?  Would it be better if these were "get_pte_pkey()"
> (i.e., singular)?

Correct.  An individual pte contains 4 bits which is "the protection
key", an index (in the range 0-15) into the protection key register,
which ultimately determines access-denied/write-denied on the linear
address.

~Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 5/9] x86/hvm: pkeys, add functions to support PKRU access
  2015-12-07  9:16 ` [V3 PATCH 5/9] x86/hvm: pkeys, add functions to support PKRU access Huaitong Han
@ 2015-12-10 18:48   ` Andrew Cooper
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Cooper @ 2015-12-10 18:48 UTC (permalink / raw)
  To: Huaitong Han, jbeulich, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: xen-devel

On 07/12/15 09:16, Huaitong Han wrote:
> This patch adds functions to support PKRU access.
>
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>, with a suggestion...

> ---
>  xen/include/asm-x86/processor.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
> index 3f8411f..c345787 100644
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -342,6 +342,26 @@ static inline void write_cr4(unsigned long val)
>      asm volatile ( "mov %0,%%cr4" : : "r" (val) );
>  }
>  
> +/* Macros for PKRU domain */
> +#define PKRU_READ  (0)
> +#define PKRU_WRITE (1)
> +#define PKRU_ATTRS (2)
> +
> +/*
> + * PKRU defines 32 bits, there are 16 domains and 2 attribute bits per
> + * domain in pkru, pkeys is index to a defined domain, so the value of
> + * pte_pkeys * PKRU_ATTRS + R/W is offset of a defined domain attribute.
> + */
> +static inline bool_t read_pkru_ad(unsigned int pkru, unsigned int pkey)
> +{

ASSERT(pkey < 16);

> +    return (pkru >> (pkey * PKRU_ATTRS + PKRU_READ)) & 1;
> +}
> +
> +static inline bool_t read_pkru_wd(unsigned int pkru, unsigned int pkey)
> +{

ASSERT(pkey < 16);

~Andrew

> +    return (pkru >> (pkey * PKRU_ATTRS + PKRU_WRITE)) & 1;
> +}
> +
>  /* Clear and set 'TS' bit respectively */
>  static inline void clts(void) 
>  {

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys
  2015-12-07  9:16 ` [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys Huaitong Han
  2015-12-10 17:39   ` George Dunlap
@ 2015-12-10 18:57   ` Andrew Cooper
  2015-12-11  9:36   ` Jan Beulich
  2 siblings, 0 replies; 44+ messages in thread
From: Andrew Cooper @ 2015-12-10 18:57 UTC (permalink / raw)
  To: Huaitong Han, jbeulich, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: xen-devel

On 07/12/15 09:16, Huaitong Han wrote:
> This patch adds xstate support for pkeys.
>
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
>  xen/arch/x86/xstate.c        | 7 +++++--
>  xen/include/asm-x86/xstate.h | 4 +++-
>  2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
> index b65da38..db978c4 100644
> --- a/xen/arch/x86/xstate.c
> +++ b/xen/arch/x86/xstate.c
> @@ -146,12 +146,15 @@ static void __init setup_xstate_comp(void)
>      }
>  }
>  
> -static void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)
> +void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)

This really should take a struct_xsave *xsave, rather than a void pointer.

>  {
>      if ( !((1ul << xfeature_idx) & xfeature_mask) )
>          return NULL;

Now I look at it, this check is bogus.  The check needs to be against
the xsave header in the area, rather than Xen's maximum xfeature_mask. 
A guest might easily have a smaller xcr0 than the maximum Xen is willing
to allow, causing the pointer below to be bogus.

~Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-07  9:16 ` [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables Huaitong Han
  2015-12-10 18:19   ` George Dunlap
@ 2015-12-10 18:59   ` Andrew Cooper
  2015-12-11  7:18     ` Han, Huaitong
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Cooper @ 2015-12-10 18:59 UTC (permalink / raw)
  To: Huaitong Han, jbeulich, jun.nakajima, eddie.dong, kevin.tian,
	george.dunlap, ian.jackson, stefano.stabellini, ian.campbell,
	wei.liu2, keir
  Cc: xen-devel

On 07/12/15 09:16, Huaitong Han wrote:
> +
> +    /* PKRU dom0 is always zero */
> +    if ( likely(!pte_pkeys) )
> +        return 0;

This is not an architectural restriction (as far as I can tell).  Xen
must never make assumptions about how a guest chooses to use a feature.

~Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-10 18:59   ` Andrew Cooper
@ 2015-12-11  7:18     ` Han, Huaitong
  2015-12-11  8:48       ` Andrew Cooper
  0 siblings, 1 reply; 44+ messages in thread
From: Han, Huaitong @ 2015-12-11  7:18 UTC (permalink / raw)
  To: george.dunlap, ian.campbell, Dong, Eddie, andrew.cooper3,
	wei.liu2, ian.jackson, Tian, Kevin, stefano.stabellini, jbeulich,
	Nakajima, Jun, keir
  Cc: xen-devel

On Thu, 2015-12-10 at 18:59 +0000, Andrew Cooper wrote:
> On 07/12/15 09:16, Huaitong Han wrote:
> > +
> > +    /* PKRU dom0 is always zero */
> > +    if ( likely(!pte_pkeys) )
> > +        return 0;
> 
> This is not an architectural restriction (as far as I can tell).  Xen
> must never make assumptions about how a guest chooses to use a
> feature.
Yes, I will remove the code section.

On my test machine, writing non-zero value to PKRU domain 0 will lead
to a GP fault, this is also conform to the usage scenario, and I will
ask the intel architect the reason why SDM does not describe.

If it is OK, I will send the other patch or new version patch to
improve the efficiency of the code.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-11  7:18     ` Han, Huaitong
@ 2015-12-11  8:48       ` Andrew Cooper
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Cooper @ 2015-12-11  8:48 UTC (permalink / raw)
  To: Han, Huaitong, george.dunlap, ian.campbell, Dong, Eddie,
	wei.liu2, ian.jackson, Tian, Kevin, stefano.stabellini, jbeulich,
	Nakajima, Jun, keir
  Cc: xen-devel

On 11/12/2015 07:18, Han, Huaitong wrote:
> On Thu, 2015-12-10 at 18:59 +0000, Andrew Cooper wrote:
>> On 07/12/15 09:16, Huaitong Han wrote:
>>> +
>>> +    /* PKRU dom0 is always zero */
>>> +    if ( likely(!pte_pkeys) )
>>> +        return 0;
>> This is not an architectural restriction (as far as I can tell).  Xen
>> must never make assumptions about how a guest chooses to use a
>> feature.
> Yes, I will remove the code section.
>
> On my test machine, writing non-zero value to PKRU domain 0 will lead
> to a GP fault, this is also conform to the usage scenario, and I will
> ask the intel architect the reason why SDM does not describe.

If this is (or is going to be documented as) an architectural
restriction, then the code can stay, although I would adjust the comment to

/* Architecturally, pkey 0 is always zero. */

Thinking about it, the restriction does make sense.  It equates to "the
default behaviour of things before the PKRU feature came along".

~Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-10 18:19   ` George Dunlap
@ 2015-12-11  9:16     ` Wu, Feng
  2015-12-11  9:23       ` Jan Beulich
  2015-12-16 15:36       ` George Dunlap
  2015-12-11  9:23     ` Han, Huaitong
  2015-12-11  9:26     ` Jan Beulich
  2 siblings, 2 replies; 44+ messages in thread
From: Wu, Feng @ 2015-12-11  9:16 UTC (permalink / raw)
  To: George Dunlap, Han, Huaitong, jbeulich, andrew.cooper3, Nakajima,
	Jun, Dong, Eddie, Tian, Kevin, george.dunlap, ian.jackson,
	stefano.stabellini, ian.campbell, wei.liu2, keir
  Cc: Wu, Feng, xen-devel



> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of George Dunlap
> Sent: Friday, December 11, 2015 2:20 AM
> To: Han, Huaitong <huaitong.han@intel.com>; jbeulich@suse.com;
> andrew.cooper3@citrix.com; Nakajima, Jun <jun.nakajima@intel.com>; Dong,
> Eddie <eddie.dong@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> george.dunlap@eu.citrix.com; ian.jackson@eu.citrix.com;
> stefano.stabellini@eu.citrix.com; ian.campbell@citrix.com; wei.liu2@citrix.com;
> keir@xen.org
> Cc: xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for
> guest_walk_tables
> 
> On 07/12/15 09:16, Huaitong Han wrote:
> > This patch adds pkeys support for guest_walk_tables.
> >
> > Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> > ---
> >  xen/arch/x86/i387.c           |  2 +-
> >  xen/arch/x86/mm/guest_walk.c  | 73
> +++++++++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-x86/hvm/hvm.h |  2 ++
> >  xen/include/asm-x86/i387.h    |  1 +
> >  4 files changed, 77 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
> > index b661d39..83c8465 100644
> > --- a/xen/arch/x86/i387.c
> > +++ b/xen/arch/x86/i387.c
> > @@ -132,7 +132,7 @@ static inline uint64_t vcpu_xsave_mask(const struct
> vcpu *v)
> >  }
> >
> >  /* Save x87 extended state */
> > -static inline void fpu_xsave(struct vcpu *v)
> > +void fpu_xsave(struct vcpu *v)
> >  {
> >      bool_t ok;
> >      uint64_t mask = vcpu_xsave_mask(v);
> > diff --git a/xen/arch/x86/mm/guest_walk.c b/xen/arch/x86/mm/guest_walk.c
> > index 18d1acf..e79f72f 100644
> > --- a/xen/arch/x86/mm/guest_walk.c
> > +++ b/xen/arch/x86/mm/guest_walk.c
> > @@ -31,6 +31,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
> >  #include <xen/sched.h>
> >  #include <asm/page.h>
> >  #include <asm/guest_pt.h>
> > +#include <asm/xstate.h>
> > +#include <asm/i387.h>
> >
> >  extern const uint32_t gw_page_flags[];
> >  #if GUEST_PAGING_LEVELS == CONFIG_PAGING_LEVELS
> > @@ -90,6 +92,61 @@ static uint32_t set_ad_bits(void *guest_p, void *walk_p,
> int set_dirty)
> >      return 0;
> >  }
> >
> > +#if GUEST_PAGING_LEVELS >= CONFIG_PAGING_LEVELS
> > +bool_t leaf_pte_pkeys_check(struct vcpu *vcpu, uint32_t pfec,
> > +                uint32_t pte_access, uint32_t pte_pkeys)
> 
> pte_access doesn't seem to be used at all.
> 
> > +{
> > +    void *xsave_addr;
> > +    unsigned int pkru = 0;
> > +    bool_t pkru_ad, pkru_wd;
> > +
> > +    bool_t uf = !!(pfec & PFEC_user_mode);
> > +    bool_t wf = !!(pfec & PFEC_write_access);
> > +    bool_t ff = !!(pfec & PFEC_insn_fetch);
> > +    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
> > +    bool_t pkuf  = !!(pfec & PFEC_prot_key);
> 
> So I'm just wondering out loud here -- is there actually any situation
> in which we would want guest_walk_tables to act differently than the
> real hardware?
> 
> That is, is there actually any situation where, pku is enabled, the vcpu
> is in long mode, PFEC_write_access and/or PFEC_page_present is set, and
> the pkey is non-zero, that we want guest_walk_tables() to only check the
> write-protect bit for the pte, and not also check the pkru?
> 
> Because if not, it seems like it would be much more robust to simply
> *always* check for pkru_ad if PFEC_page_present is set, and for pkru_wd
> if PFEC_write_access is set.

guest_walk_tables() is also used by shadow code, though we don't
plan to support pkeys for shadow now, however, in that case, the 'pfec'
is generated by hardware, and the pkuf bit may be 0 or 1 depending
on the real page fault. If we unconditionally check pkeys in
guest_walk_tables(), it is not a good ideas for someone who may
want to implement the pkeys for shadow in future, since we only
need to check pkeys when the pkuf is set in pfec.

> 
> Then in patch 8, you wouldn't need to go around all the __hvm_copy
> functions adding in PFEC_prot; instead, you'd just need to add
> PFEC_insn_fetch to the "fetch" (as is already done for SMEP and NX), and
> you'd be done.

Pkeys has no business with instructions fetch, why do we need to add
PFEC_insn_fetch?

Thanks,
Feng

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-10 18:19   ` George Dunlap
  2015-12-11  9:16     ` Wu, Feng
@ 2015-12-11  9:23     ` Han, Huaitong
  2015-12-11  9:50       ` Jan Beulich
  2015-12-11  9:26     ` Jan Beulich
  2 siblings, 1 reply; 44+ messages in thread
From: Han, Huaitong @ 2015-12-11  9:23 UTC (permalink / raw)
  To: george.dunlap, ian.campbell, Dong, Eddie, andrew.cooper3,
	wei.liu2, ian.jackson, Tian, Kevin, george.dunlap, jbeulich,
	Nakajima, Jun, stefano.stabellini, keir
  Cc: xen-devel

On Thu, 2015-12-10 at 18:19 +0000, George Dunlap wrote:
> On 07/12/15 09:16, Huaitong Han wrote:
> > +{
> > +    void *xsave_addr;
> > +    unsigned int pkru = 0;
> > +    bool_t pkru_ad, pkru_wd;
> > +
> > +    bool_t uf = !!(pfec & PFEC_user_mode);
> > +    bool_t wf = !!(pfec & PFEC_write_access);
> > +    bool_t ff = !!(pfec & PFEC_insn_fetch);
> > +    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
> > +    bool_t pkuf  = !!(pfec & PFEC_prot_key);
> 
> So I'm just wondering out loud here -- is there actually any
> situation
> in which we would want guest_walk_tables to act differently than the
> real hardware?
> 
> That is, is there actually any situation where, pku is enabled, the
> vcpu
> is in long mode, PFEC_write_access and/or PFEC_page_present is set,
> and
> the pkey is non-zero, that we want guest_walk_tables() to only check
> the
> write-protect bit for the pte, and not also check the pkru?
> 
> Because if not, it seems like it would be much more robust to simply
> *always* check for pkru_ad if PFEC_page_present is set, and for
> pkru_wd
> if PFEC_write_access is set.
> Then in patch 8, you wouldn't need to go around all the __hvm_copy
> functions adding in PFEC_prot; instead, you'd just need to add
> PFEC_insn_fetch to the "fetch" (as is already done for SMEP and NX),
> and
> you'd be done.
See reply email from Feng discussed with me.

> > +
> > +    if ( !cpu_has_xsave || !pkuf || is_pv_vcpu(vcpu) )
> > +        return 0;
> > +
> > +    /* PKRU dom0 is always zero */
> 
> "dom0" has a very specific meaning in Xen.  I think this would be
> better
> written "pkey 0 always has full access".
> 
> > +    if ( likely(!pte_pkeys) )
> > +        return 0;
> > +
> > +    /* Update vcpu xsave area */
> > +    fpu_xsave(vcpu);
> 
> Is there a reason you're calling fpu_xsave() directly here, rather
> than
> just calling vcpu_save_fpu()?  That saves you actually doing the
> xsave
> if the fpu hasn't been modified since the last time you read it.
use fpu_xsave instead of fpu_xsave because Jan's comment:
Which is bogus by itself: That function isn't meant to be used for
purposes like the one you have, e.g. due to its side effect of
clearing ->fpu_dirtied. You really ought to be using a lower level
function here (and I don't think the corresponding struct vcpu
should get altered in any way). --Jan

And I can add 
    if ( !vcpu->fpu_dirtied )    
before fpu_xsave(vcpu);

> > +    xsave_addr = get_xsave_addr(vcpu->arch.xsave_area,
> > fls64(XSTATE_PKRU)-1);
> > +    if ( !!xsave_addr )
> > +        memcpy(&pkru, xsave_addr, sizeof(pkru));
> 
> There's no need for the !! here.  But in any case, isn't there a
> better
> function for reading the xsave state than manually calculating the
> address and doing a memcpy?
RDPKRU is disabled by hypervisor CR4 because PV mode must disable
CR4.PKE, getting PKRU value only depends on xsave.
> > +
> > +    if ( unlikely(pkru) )
> > +    {
> > +        /*
> > +         * PKU:  additional mechanism by which the paging controls
> > +         * access to user-mode addresses based on the value in the
> > +         * PKRU register. A fault is considered as a PKU violation
> > if all
> > +         * of the following conditions are ture:
> > +         * 1.CR4_PKE=1.
> > +         * 2.EFER_LMA=1.
> > +         * 3.page is present with no reserved bit violations.
> > +         * 4.the access is not an instruction fetch.
> > +         * 5.the access is to a user page.
> > +         * 6.PKRU.AD=1
> > +         *       or The access is a data write and PKRU.WD=1
> > +         *            and either CR0.WP=1 or it is a user access.
> > +         */
> > +        pkru_ad = read_pkru_ad(pkru, pte_pkeys);
> > +        pkru_wd = read_pkru_wd(pkru, pte_pkeys);
> > +        if ( hvm_pku_enabled(vcpu) && hvm_long_mode_enabled(vcpu)
> > &&
> > +            !rsvdf && !ff && (pkru_ad ||
> > +            (pkru_wd && wf && (hvm_wp_enabled(vcpu) || uf))))
> > +            return 1;
> 
> This statement here is really difficult to read.  Why don't you put
> the
> checks which don't depend on the pkru up before you read it?  e.g.,
> hvm_pku_enabled(), hvm_long_mode_enabled(), rsvdf, ff, &c?
> 
>  -George

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-11  9:16     ` Wu, Feng
@ 2015-12-11  9:23       ` Jan Beulich
  2015-12-16 15:36       ` George Dunlap
  1 sibling, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2015-12-11  9:23 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, George Dunlap,
	xen-devel, Jun Nakajima, Huaitong Han, keir

>>> On 11.12.15 at 10:16, <feng.wu@intel.com> wrote:
>> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
>> bounces@lists.xen.org] On Behalf Of George Dunlap
>> Sent: Friday, December 11, 2015 2:20 AM
>> On 07/12/15 09:16, Huaitong Han wrote:
>> > This patch adds pkeys support for guest_walk_tables.
>> >
>> > Signed-off-by: Huaitong Han <huaitong.han@intel.com>
>> > ---
>> >  xen/arch/x86/i387.c           |  2 +-
>> >  xen/arch/x86/mm/guest_walk.c  | 73
>> +++++++++++++++++++++++++++++++++++++++++++
>> >  xen/include/asm-x86/hvm/hvm.h |  2 ++
>> >  xen/include/asm-x86/i387.h    |  1 +
>> >  4 files changed, 77 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c
>> > index b661d39..83c8465 100644
>> > --- a/xen/arch/x86/i387.c
>> > +++ b/xen/arch/x86/i387.c
>> > @@ -132,7 +132,7 @@ static inline uint64_t vcpu_xsave_mask(const struct
>> vcpu *v)
>> >  }
>> >
>> >  /* Save x87 extended state */
>> > -static inline void fpu_xsave(struct vcpu *v)
>> > +void fpu_xsave(struct vcpu *v)
>> >  {
>> >      bool_t ok;
>> >      uint64_t mask = vcpu_xsave_mask(v);
>> > diff --git a/xen/arch/x86/mm/guest_walk.c b/xen/arch/x86/mm/guest_walk.c
>> > index 18d1acf..e79f72f 100644
>> > --- a/xen/arch/x86/mm/guest_walk.c
>> > +++ b/xen/arch/x86/mm/guest_walk.c
>> > @@ -31,6 +31,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
>> >  #include <xen/sched.h>
>> >  #include <asm/page.h>
>> >  #include <asm/guest_pt.h>
>> > +#include <asm/xstate.h>
>> > +#include <asm/i387.h>
>> >
>> >  extern const uint32_t gw_page_flags[];
>> >  #if GUEST_PAGING_LEVELS == CONFIG_PAGING_LEVELS
>> > @@ -90,6 +92,61 @@ static uint32_t set_ad_bits(void *guest_p, void *walk_p,
>> int set_dirty)
>> >      return 0;
>> >  }
>> >
>> > +#if GUEST_PAGING_LEVELS >= CONFIG_PAGING_LEVELS
>> > +bool_t leaf_pte_pkeys_check(struct vcpu *vcpu, uint32_t pfec,
>> > +                uint32_t pte_access, uint32_t pte_pkeys)
>> 
>> pte_access doesn't seem to be used at all.
>> 
>> > +{
>> > +    void *xsave_addr;
>> > +    unsigned int pkru = 0;
>> > +    bool_t pkru_ad, pkru_wd;
>> > +
>> > +    bool_t uf = !!(pfec & PFEC_user_mode);
>> > +    bool_t wf = !!(pfec & PFEC_write_access);
>> > +    bool_t ff = !!(pfec & PFEC_insn_fetch);
>> > +    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
>> > +    bool_t pkuf  = !!(pfec & PFEC_prot_key);
>> 
>> So I'm just wondering out loud here -- is there actually any situation
>> in which we would want guest_walk_tables to act differently than the
>> real hardware?
>> 
>> That is, is there actually any situation where, pku is enabled, the vcpu
>> is in long mode, PFEC_write_access and/or PFEC_page_present is set, and
>> the pkey is non-zero, that we want guest_walk_tables() to only check the
>> write-protect bit for the pte, and not also check the pkru?
>> 
>> Because if not, it seems like it would be much more robust to simply
>> *always* check for pkru_ad if PFEC_page_present is set, and for pkru_wd
>> if PFEC_write_access is set.
> 
> guest_walk_tables() is also used by shadow code, though we don't
> plan to support pkeys for shadow now, however, in that case, the 'pfec'
> is generated by hardware, and the pkuf bit may be 0 or 1 depending
> on the real page fault. If we unconditionally check pkeys in
> guest_walk_tables(), it is not a good ideas for someone who may
> want to implement the pkeys for shadow in future, since we only
> need to check pkeys when the pkuf is set in pfec.

If shadow code isn't to support pkeys, how would an error code ever
have the respective bit set?

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-10 18:19   ` George Dunlap
  2015-12-11  9:16     ` Wu, Feng
  2015-12-11  9:23     ` Han, Huaitong
@ 2015-12-11  9:26     ` Jan Beulich
  2015-12-15  8:14       ` Han, Huaitong
  2 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2015-12-11  9:26 UTC (permalink / raw)
  To: George Dunlap, Huaitong Han
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, xen-devel,
	jun.nakajima, keir

>>> On 10.12.15 at 19:19, <george.dunlap@citrix.com> wrote:
> On 07/12/15 09:16, Huaitong Han wrote:
>> +    if ( likely(!pte_pkeys) )
>> +        return 0;
>> +
>> +    /* Update vcpu xsave area */
>> +    fpu_xsave(vcpu);
> 
> Is there a reason you're calling fpu_xsave() directly here, rather than
> just calling vcpu_save_fpu()?  That saves you actually doing the xsave
> if the fpu hasn't been modified since the last time you read it.

I've already said on an earlier version that wholesale saving of the
entire XSAVE state is wrong here. It should just be the single piece
that we're actually interested in, and it quite likely shouldn't go into
struct vcpu (but e.g. into a local buffer).

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys
  2015-12-07  9:16 ` [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys Huaitong Han
  2015-12-10 17:39   ` George Dunlap
  2015-12-10 18:57   ` Andrew Cooper
@ 2015-12-11  9:36   ` Jan Beulich
  2 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2015-12-11  9:36 UTC (permalink / raw)
  To: Huaitong Han
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, xen-devel,
	jun.nakajima, keir

>>> On 07.12.15 at 10:16, <huaitong.han@intel.com> wrote:
> --- a/xen/arch/x86/xstate.c
> +++ b/xen/arch/x86/xstate.c
> @@ -146,12 +146,15 @@ static void __init setup_xstate_comp(void)
>      }
>  }
>  
> -static void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)
> +void *get_xsave_addr(void *xsave, unsigned int xfeature_idx)
>  {
>      if ( !((1ul << xfeature_idx) & xfeature_mask) )
>          return NULL;
>  
> -    return xsave + xstate_comp_offsets[xfeature_idx];
> +    if ( xsave_area_compressed(xsave) )
> +        return xsave + xstate_comp_offsets[xfeature_idx];
> +    else
> +        return xsave + xstate_offsets[xfeature_idx];

For constructs like this I'd really like to ask to avoid as much of the
redundancy as possible. The most compact form would probably be

        return xsave + (xsave_area_compressed(xsave)
                        ? xstate_comp_offsets
                        : xstate_offsets)[xfeature_idx];

But at the very least the "else" should go away.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 9/9] x86/hvm: pkeys, add pkeys support for cpuid handling
  2015-12-07  9:16 ` [V3 PATCH 9/9] x86/hvm: pkeys, add pkeys support for cpuid handling Huaitong Han
@ 2015-12-11  9:47   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2015-12-11  9:47 UTC (permalink / raw)
  To: Huaitong Han
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, xen-devel,
	jun.nakajima, keir

>>> On 07.12.15 at 10:16, <huaitong.han@intel.com> wrote:
> @@ -4605,6 +4605,18 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
>          /* Don't expose INVPCID to non-hap hvm. */
>          if ( (count == 0) && !hap_enabled(d) )
>              *ebx &= ~cpufeat_mask(X86_FEATURE_INVPCID);
> +
> +        /* X86_FEATURE_PKU is not yet implemented for shadow paging

Coding style.

> +         *
> +         * Hypervisor gets guest pkru value from XSAVE state, because
> +         * Hypervisor CR4 without X86_CR4_PKE disables RDPKRU instruction.
> +         */
> +        if ( (count == 0) && (!hap_enabled(d) || !cpu_has_xsave) )

I has been said before that you should check the guest property
here, not the host one. Without you doing so I can't even see the
point of you adjusting the logic to set OSXSAVE above.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-11  9:23     ` Han, Huaitong
@ 2015-12-11  9:50       ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2015-12-11  9:50 UTC (permalink / raw)
  To: Huaitong Han
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Jun Nakajima, keir

>>> On 11.12.15 at 10:23, <huaitong.han@intel.com> wrote:
> On Thu, 2015-12-10 at 18:19 +0000, George Dunlap wrote:
>> On 07/12/15 09:16, Huaitong Han wrote:
>> > +    if ( likely(!pte_pkeys) )
>> > +        return 0;
>> > +
>> > +    /* Update vcpu xsave area */
>> > +    fpu_xsave(vcpu);
>> 
>> Is there a reason you're calling fpu_xsave() directly here, rather
>> than
>> just calling vcpu_save_fpu()?  That saves you actually doing the
>> xsave
>> if the fpu hasn't been modified since the last time you read it.
> use fpu_xsave instead of fpu_xsave because Jan's comment:
> Which is bogus by itself: That function isn't meant to be used for
> purposes like the one you have, e.g. due to its side effect of
> clearing ->fpu_dirtied. You really ought to be using a lower level
> function here (and I don't think the corresponding struct vcpu
> should get altered in any way). --Jan
> 
> And I can add 
>     if ( !vcpu->fpu_dirtied )    
> before fpu_xsave(vcpu);

PKRU being non-lazy state, how would fpu_dirtied matter here?

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-11  9:26     ` Jan Beulich
@ 2015-12-15  8:14       ` Han, Huaitong
  2015-12-15  9:02         ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Han, Huaitong @ 2015-12-15  8:14 UTC (permalink / raw)
  To: JBeulich
  Cc: Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Nakajima, Jun, keir

On Fri, 2015-12-11 at 02:26 -0700, Jan Beulich wrote:
> > > > On 10.12.15 at 19:19, <george.dunlap@citrix.com> wrote:
> > On 07/12/15 09:16, Huaitong Han wrote:
> > > +    if ( likely(!pte_pkeys) )
> > > +        return 0;
> > > +
> > > +    /* Update vcpu xsave area */
> > > +    fpu_xsave(vcpu);
> > 
> > Is there a reason you're calling fpu_xsave() directly here, rather
> > than
> > just calling vcpu_save_fpu()?  That saves you actually doing the
> > xsave
> > if the fpu hasn't been modified since the last time you read it.
> 
> I've already said on an earlier version that wholesale saving of the
> entire XSAVE state is wrong here. It should just be the single piece
> that we're actually interested in, and it quite likely shouldn't go
> into
> struct vcpu (but e.g. into a local buffer).

The comments on V2 version said using vcpu_save_fpu is wrong because of
 v->fpu_dirtied, but why wholesale saving of the entire XSAVE state is
wrong here? I understand xsave maybe cost a little more. But if we just
save the single piece, many functions are not reused because
xstate_comp_offsets is pointless, and we need add a new function as
follow that looks not good:

(just a example):
unsigned int get_xsave_pkru(void)
{
    struct xsave_struct *xstate;
    void *xsave_addr;
    unsigned int pkru = 0;

    xstate = _xzalloc(xsave_cntxt_size, 64);
    if ( xstate == NULL )
        return -ENOMEM;

    asm volatile ( ".byte 0x0f,0xae,0x27" /* XSAVE */
                        : "=m" (*xstate)
                        : "a" (XSTATE_PKRU), "d" (0), "D" (xstate) );

    xsave_addr = get_xsave_addr(xstate, fls64(XSTATE_PKRU)-1);
    if ( xsave_addr )
        memcpy(&pkru, xsave_addr, sizeof(pkru));

    xfree(xstate);

    return pkru;
}

Thanks a lot.
Huaitong

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-15  8:14       ` Han, Huaitong
@ 2015-12-15  9:02         ` Jan Beulich
  2015-12-16  8:16           ` Han, Huaitong
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2015-12-15  9:02 UTC (permalink / raw)
  To: Huaitong Han
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Jun Nakajima, keir

>>> On 15.12.15 at 09:14, <huaitong.han@intel.com> wrote:
> On Fri, 2015-12-11 at 02:26 -0700, Jan Beulich wrote:
>> > > > On 10.12.15 at 19:19, <george.dunlap@citrix.com> wrote:
>> > On 07/12/15 09:16, Huaitong Han wrote:
>> > > +    if ( likely(!pte_pkeys) )
>> > > +        return 0;
>> > > +
>> > > +    /* Update vcpu xsave area */
>> > > +    fpu_xsave(vcpu);
>> > 
>> > Is there a reason you're calling fpu_xsave() directly here, rather
>> > than
>> > just calling vcpu_save_fpu()?  That saves you actually doing the
>> > xsave
>> > if the fpu hasn't been modified since the last time you read it.
>> 
>> I've already said on an earlier version that wholesale saving of the
>> entire XSAVE state is wrong here. It should just be the single piece
>> that we're actually interested in, and it quite likely shouldn't go
>> into
>> struct vcpu (but e.g. into a local buffer).
> 
> The comments on V2 version said using vcpu_save_fpu is wrong because of
>  v->fpu_dirtied, but why wholesale saving of the entire XSAVE state is
> wrong here? I understand xsave maybe cost a little more.

"A little" is quite a bit of an understatement.

> But if we just
> save the single piece, many functions are not reused because
> xstate_comp_offsets is pointless, and we need add a new function as
> follow that looks not good:

Well, I wouldn't want you to introduce a brand new function, but
instead just factor out the necessary piece from xsave() (making
the new one take a struct xsave_struct * instead of a struct vcpu *,
and calling it from what is now xsave()).

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-15  9:02         ` Jan Beulich
@ 2015-12-16  8:16           ` Han, Huaitong
  2015-12-16  8:32             ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Han, Huaitong @ 2015-12-16  8:16 UTC (permalink / raw)
  To: JBeulich
  Cc: Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Nakajima, Jun, keir

On Tue, 2015-12-15 at 02:02 -0700, Jan Beulich wrote:
> > > > On 15.12.15 at 09:14, <huaitong.han@intel.com> wrote:
> > On Fri, 2015-12-11 at 02:26 -0700, Jan Beulich wrote:
> > > > > > On 10.12.15 at 19:19, <george.dunlap@citrix.com> wrote:
> > > > On 07/12/15 09:16, Huaitong Han wrote:
> > > > > +    if ( likely(!pte_pkeys) )
> > > > > +        return 0;
> > > > > +
> > > > > +    /* Update vcpu xsave area */
> > > > > +    fpu_xsave(vcpu);
> > > > 
> > > > Is there a reason you're calling fpu_xsave() directly here,
> > > > rather
> > > > than
> > > > just calling vcpu_save_fpu()?  That saves you actually doing
> > > > the
> > > > xsave
> > > > if the fpu hasn't been modified since the last time you read
> > > > it.
> > > 
> > > I've already said on an earlier version that wholesale saving of
> > > the
> > > entire XSAVE state is wrong here. It should just be the single
> > > piece
> > > that we're actually interested in, and it quite likely shouldn't
> > > go
> > > into
> > > struct vcpu (but e.g. into a local buffer).
> > 
> > The comments on V2 version said using vcpu_save_fpu is wrong
> > because of
> >  v->fpu_dirtied, but why wholesale saving of the entire XSAVE state
> > is
> > wrong here? I understand xsave maybe cost a little more.
> 
> "A little" is quite a bit of an understatement.
> 
> > But if we just
> > save the single piece, many functions are not reused because
> > xstate_comp_offsets is pointless, and we need add a new function as
> > follow that looks not good:
> 
> Well, I wouldn't want you to introduce a brand new function, but
> instead just factor out the necessary piece from xsave() (making
> the new one take a struct xsave_struct * instead of a struct vcpu *,
> and calling it from what is now xsave()).
So the function looks like this:
unsigned int get_xsave_pkru(struct vcpu *v)
{
    void *offset;
    struct xsave_struct *xsave_area;
    uint64_t mask = XSTATE_PKRU;
    unsigned int index = fls64(mask) - 1;
    unsigned int pkru = 0;

    if ( !cpu_has_xsave )
        return 0;
    
    BUG_ON(xsave_cntxt_size < XSTATE_AREA_MIN_SIZE);
    xsave_area = _xzalloc(xsave_cntxt_size, 64);
    if ( xsave_area == NULL )
        return 0;

    xsave(xsave_area, mask);
    offset = (void *)xsave_area + (xsave_area_compressed(xsave) ? 
            XSTATE_AREA_MIN_SIZE : xstate_offsets[index] );
    memcpy(&pkru, offset, sizeof(pkru));

    xfree(xsave_area);

    return pkru;
}
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16  8:16           ` Han, Huaitong
@ 2015-12-16  8:32             ` Jan Beulich
  2015-12-16  9:03               ` Han, Huaitong
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2015-12-16  8:32 UTC (permalink / raw)
  To: Huaitong Han
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Jun Nakajima, keir

>>> On 16.12.15 at 09:16, <huaitong.han@intel.com> wrote:
> On Tue, 2015-12-15 at 02:02 -0700, Jan Beulich wrote:
>> Well, I wouldn't want you to introduce a brand new function, but
>> instead just factor out the necessary piece from xsave() (making
>> the new one take a struct xsave_struct * instead of a struct vcpu *,
>> and calling it from what is now xsave()).
> So the function looks like this:
> unsigned int get_xsave_pkru(struct vcpu *v)
> {
>     void *offset;
>     struct xsave_struct *xsave_area;
>     uint64_t mask = XSTATE_PKRU;
>     unsigned int index = fls64(mask) - 1;
>     unsigned int pkru = 0;
> 
>     if ( !cpu_has_xsave )
>         return 0;
>     
>     BUG_ON(xsave_cntxt_size < XSTATE_AREA_MIN_SIZE);
>     xsave_area = _xzalloc(xsave_cntxt_size, 64);
>     if ( xsave_area == NULL )
>         return 0;
> 
>     xsave(xsave_area, mask);
>     offset = (void *)xsave_area + (xsave_area_compressed(xsave) ? 
>             XSTATE_AREA_MIN_SIZE : xstate_offsets[index] );
>     memcpy(&pkru, offset, sizeof(pkru));
> 
>     xfree(xsave_area);
> 
>     return pkru;
> }

Depending on how frequently this might get called, the allocation
overhead may not be tolerable. I.e. you may want to set up e.g.
a per-CPU buffer up front. Or you check whether using RDPKRU
(with temporarily setting CR4.PKE) is cheaper than what you
do right now.

Also I don't think the buffer needs to be xsave_cntxt_size in size;
xstate_offsets[index] + sizeof(pkru) (and its equivalent in the
non-compressed case) should suffice.

And finally I don't think returning 0 in the allocation failure case
would be valid, as that - iiuc - means no restrictions at all, and
hence would hamper security inside the guest. But that's of
course moot if the allocation gets moved out of here.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16  8:32             ` Jan Beulich
@ 2015-12-16  9:03               ` Han, Huaitong
  2015-12-16  9:12                 ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Han, Huaitong @ 2015-12-16  9:03 UTC (permalink / raw)
  To: JBeulich, andrew.cooper3
  Cc: Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, ian.jackson, george.dunlap, xen-devel, Nakajima,
	Jun, keir

On Wed, 2015-12-16 at 01:32 -0700, Jan Beulich wrote:
> > > > On 16.12.15 at 09:16, <huaitong.han@intel.com> wrote:
> > On Tue, 2015-12-15 at 02:02 -0700, Jan Beulich wrote:
> > > Well, I wouldn't want you to introduce a brand new function, but
> > > instead just factor out the necessary piece from xsave() (making
> > > the new one take a struct xsave_struct * instead of a struct vcpu
> > > *,
> > > and calling it from what is now xsave()).
> > So the function looks like this:
> > unsigned int get_xsave_pkru(struct vcpu *v)
> > {
> >     void *offset;
> >     struct xsave_struct *xsave_area;
> >     uint64_t mask = XSTATE_PKRU;
> >     unsigned int index = fls64(mask) - 1;
> >     unsigned int pkru = 0;
> > 
> >     if ( !cpu_has_xsave )
> >         return 0;
> >     
> >     BUG_ON(xsave_cntxt_size < XSTATE_AREA_MIN_SIZE);
> >     xsave_area = _xzalloc(xsave_cntxt_size, 64);
> >     if ( xsave_area == NULL )
> >         return 0;
> > 
> >     xsave(xsave_area, mask);
> >     offset = (void *)xsave_area + (xsave_area_compressed(xsave) ? 
> >             XSTATE_AREA_MIN_SIZE : xstate_offsets[index] );
> >     memcpy(&pkru, offset, sizeof(pkru));
> > 
> >     xfree(xsave_area);
> > 
> >     return pkru;
> > }
> 
> Depending on how frequently this might get called, the allocation
> overhead may not be tolerable. I.e. you may want to set up e.g.
> a per-CPU buffer up front. Or you check whether using RDPKRU
> (with temporarily setting CR4.PKE) is cheaper than what you
> do right now.
RDPKRU does cost less than the function, and if temporarily setting
CR4.PKE is accepted, I will use RDPKRU instead of the function.

Andrew, what is your opinion?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16  9:03               ` Han, Huaitong
@ 2015-12-16  9:12                 ` Jan Beulich
  2015-12-17  9:18                   ` Han, Huaitong
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2015-12-16  9:12 UTC (permalink / raw)
  To: Huaitong Han
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Jun Nakajima, keir

>>> On 16.12.15 at 10:03, <huaitong.han@intel.com> wrote:
> On Wed, 2015-12-16 at 01:32 -0700, Jan Beulich wrote:
>> > > > On 16.12.15 at 09:16, <huaitong.han@intel.com> wrote:
>> > On Tue, 2015-12-15 at 02:02 -0700, Jan Beulich wrote:
>> > > Well, I wouldn't want you to introduce a brand new function, but
>> > > instead just factor out the necessary piece from xsave() (making
>> > > the new one take a struct xsave_struct * instead of a struct vcpu
>> > > *,
>> > > and calling it from what is now xsave()).
>> > So the function looks like this:
>> > unsigned int get_xsave_pkru(struct vcpu *v)
>> > {
>> >     void *offset;
>> >     struct xsave_struct *xsave_area;
>> >     uint64_t mask = XSTATE_PKRU;
>> >     unsigned int index = fls64(mask) - 1;
>> >     unsigned int pkru = 0;
>> > 
>> >     if ( !cpu_has_xsave )
>> >         return 0;
>> >     
>> >     BUG_ON(xsave_cntxt_size < XSTATE_AREA_MIN_SIZE);
>> >     xsave_area = _xzalloc(xsave_cntxt_size, 64);
>> >     if ( xsave_area == NULL )
>> >         return 0;
>> > 
>> >     xsave(xsave_area, mask);
>> >     offset = (void *)xsave_area + (xsave_area_compressed(xsave) ? 
>> >             XSTATE_AREA_MIN_SIZE : xstate_offsets[index] );
>> >     memcpy(&pkru, offset, sizeof(pkru));
>> > 
>> >     xfree(xsave_area);
>> > 
>> >     return pkru;
>> > }
>> 
>> Depending on how frequently this might get called, the allocation
>> overhead may not be tolerable. I.e. you may want to set up e.g.
>> a per-CPU buffer up front. Or you check whether using RDPKRU
>> (with temporarily setting CR4.PKE) is cheaper than what you
>> do right now.
> RDPKRU does cost less than the function, and if temporarily setting
> CR4.PKE is accepted, I will use RDPKRU instead of the function.

The question isn't just the RDPKRU cost, but that of the two CR4
writes.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-11  9:16     ` Wu, Feng
  2015-12-11  9:23       ` Jan Beulich
@ 2015-12-16 15:36       ` George Dunlap
  2015-12-16 16:28         ` Tim Deegan
  2015-12-18  8:21         ` Han, Huaitong
  1 sibling, 2 replies; 44+ messages in thread
From: George Dunlap @ 2015-12-16 15:36 UTC (permalink / raw)
  To: Wu, Feng, Han, Huaitong, jbeulich, andrew.cooper3, Nakajima, Jun,
	Dong, Eddie, Tian, Kevin, george.dunlap, ian.jackson,
	stefano.stabellini, ian.campbell, wei.liu2, keir
  Cc: Tim Deegan, xen-devel

[Adding Tim, the previous mm maintainer]

On 11/12/15 09:16, Wu, Feng wrote:
>>> +{
>>> +    void *xsave_addr;
>>> +    unsigned int pkru = 0;
>>> +    bool_t pkru_ad, pkru_wd;
>>> +
>>> +    bool_t uf = !!(pfec & PFEC_user_mode);
>>> +    bool_t wf = !!(pfec & PFEC_write_access);
>>> +    bool_t ff = !!(pfec & PFEC_insn_fetch);
>>> +    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
>>> +    bool_t pkuf  = !!(pfec & PFEC_prot_key);
>>
>> So I'm just wondering out loud here -- is there actually any situation
>> in which we would want guest_walk_tables to act differently than the
>> real hardware?
>>
>> That is, is there actually any situation where, pku is enabled, the vcpu
>> is in long mode, PFEC_write_access and/or PFEC_page_present is set, and
>> the pkey is non-zero, that we want guest_walk_tables() to only check the
>> write-protect bit for the pte, and not also check the pkru?
>>
>> Because if not, it seems like it would be much more robust to simply
>> *always* check for pkru_ad if PFEC_page_present is set, and for pkru_wd
>> if PFEC_write_access is set.
> 
> guest_walk_tables() is also used by shadow code, though we don't
> plan to support pkeys for shadow now, however, in that case, the 'pfec'
> is generated by hardware, and the pkuf bit may be 0 or 1 depending
> on the real page fault. If we unconditionally check pkeys in
> guest_walk_tables(), it is not a good ideas for someone who may
> want to implement the pkeys for shadow in future, since we only
> need to check pkeys when the pkuf is set in pfec.

So you'll have to forgive me for being a bit slow here -- I stepped away
from the mm code for a while, and a lot has changed since I agreed to
become mm maintainer earlier this year.

So here is what I see in the tree; please correct me if I missed
something important.

The function guest_walk_tables():
 1. Accepts a pfec argument which is meant to describe the *access type*
that is happening
 2. Returns a value with the "wrong" flags (i.e., flags which needed to
be set that were missing, or flags that needed to be clear that were set).
 3. It checks reserved bits in the pagetable regardless of whether
PFEC_reserved_bit is set in the caller, and returns _PAGE_INVALID_BITS
if so.
 4. If PFEC_insn_fetch is set, it will only check for nx and smep if
those features are enabled for the guest.

The main callers are the various gva_to_gfn(), which accept a *pointer*
to a pfec argument.  This pointer is actually bidirectional: its value
passed to guest_walk_tables() to determine what access types are
checked; what is returned is meant to be a pfec value which can be
passed to a guest as part of a fault (and is by several callers).  But
it may also return the Xen-internal flags, PFEC_page_{paged,shared}.
And, importantly, it modifies the pfec value based on the bits that are
returned from guest_walk_tables(): It will clear PFEC_present if
guest_walk_tables() returns _PAGE_PRESENT, and it will set
PFEC_reserved_bit if guest_walk_tables() returns _PAGE_INVALID_BIT.

The next logical level up are the
hvm_{copy,fetch}_{to,from}_guest_virt(), which also take a pfec
argument: but really the only purpose of the pfec argument is to allow
the caller to add the PFEC_user_mode flag; the other access-type flags
are set automatically by the functions themselves (e.g., "to" sets
PFEC_write_access, "fetch" sets PFEC_insn_fetch, &c).  Several callers
set PFEC_present as well, but callers set any other bits.

(hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx or
smep are enabled in the guest.  This seems inconsistent to me with the
treatment of PFEC_reserved_bit: it seems like
hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
particularly as guest_walk_tables() will already gate the checks based
on whether nx or smep is enabled in the guest.  Tim, you know of any
reason for this?)

So there seems to me to be no reason to pass PFEC_prot_key into
guest_walk_tables() or gva_to_gfn().  The pfec value passed *into* those
should simply indicate the type of memory access being done: present,
write, instruction fetch, user.

With your current series, guest_walk_tables() already checks for pkeys
being enabled in the guest before checking for them in the pagetables.
For shadow mode, these will be false, and so no checks will be done.  If
anyone ever implements pkeys for shadow mode, then these will be
enabled, and the checks will be done, without any intervention on the
part of the caller.

The only thing that's missing is:
1. for gva_to_gfn() to check for _PAGE_PKEY_BITS on the return from
guest_walk_tables(), and set PFEC_prot_key if so
2. to set PFEC_insn_fetch when pkeys are enabled in
hvm_fetch_from_guest_virt(), so that guest_walk_tables() knows that it's
an instruction fetch.

>> Then in patch 8, you wouldn't need to go around all the __hvm_copy
>> functions adding in PFEC_prot; instead, you'd just need to add
>> PFEC_insn_fetch to the "fetch" (as is already done for SMEP and NX), and
>> you'd be done.
> 
> Pkeys has no business with instructions fetch, why do we need to add
> PFEC_insn_fetch?

PFEC_insn_fetch passed *into* guest_walk_tables() indicates that the
access is an instruction fetch.  Your code treats instruction fetches
differently than normal reads, yes?

At the moment PFEC_insn_fetch is only set in hvm_fetch_from_guest_virt()
if hvm_nx_enabled() or hvm_smep_enabled() are true.  Which means that if
you *don't* have nx or smep enabled, then the patch series as is will
fault on instruction fetches when it shouldn't.  (I don't remember
anyone mentioning nx or smep being enabled as a prerequisite for pkeys.)

(And it seems to me that it would be better to always set
PFEC_insn_fetch in hvm_fetch_from_guest_virt(), but that's a bit beyond
the scope of what you're trying to accomplish here.)

 -George

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16 15:36       ` George Dunlap
@ 2015-12-16 16:28         ` Tim Deegan
  2015-12-16 16:34           ` Andrew Cooper
  2015-12-16 16:50           ` George Dunlap
  2015-12-18  8:21         ` Han, Huaitong
  1 sibling, 2 replies; 44+ messages in thread
From: Tim Deegan @ 2015-12-16 16:28 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, Tian, Kevin, Wu, Feng, jbeulich, stefano.stabellini,
	george.dunlap, andrew.cooper3, Dong, Eddie, xen-devel, Nakajima,
	Jun, Han, Huaitong, keir, ian.jackson, ian.campbell

Hi,

At 15:36 +0000 on 16 Dec (1450280191), George Dunlap wrote:
> (hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx or
> smep are enabled in the guest.  This seems inconsistent to me with the
> treatment of PFEC_reserved_bit: it seems like
> hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
> particularly as guest_walk_tables() will already gate the checks based
> on whether nx or smep is enabled in the guest.  Tim, you know of any
> reason for this?)

This code is following the hardware, IIRC: on real CPUs without NX
enabled, pagefault error codes will not have that bit set even when
caused by instruction fetches.  Looking at the SDM (3A.4.7) the
current rules are for that bit to be set iff ((PAE && NXE) || SMEP).

It seems unfortunate to have this check duplicated between
guest_walk_tables() and its callers, so potentially g_w_t() should
explicitly _clear_ PFEC_insn_fetch in cases where it should not be
returned, and then its callers can set it unconditionally for
all instruction fetches.

(There are similar rules for the new PK bit, which should not be shown
to the guest if CR4.PKE is clear.)

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16 16:28         ` Tim Deegan
@ 2015-12-16 16:34           ` Andrew Cooper
  2015-12-16 17:33             ` Tim Deegan
  2015-12-16 16:50           ` George Dunlap
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Cooper @ 2015-12-16 16:34 UTC (permalink / raw)
  To: Tim Deegan, George Dunlap
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, stefano.stabellini,
	george.dunlap, ian.jackson, Dong, Eddie, xen-devel, jbeulich,
	Han, Huaitong, wei.liu2, keir, ian.campbell

On 16/12/15 16:28, Tim Deegan wrote:
> Hi,
>
> At 15:36 +0000 on 16 Dec (1450280191), George Dunlap wrote:
>> (hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx or
>> smep are enabled in the guest.  This seems inconsistent to me with the
>> treatment of PFEC_reserved_bit: it seems like
>> hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
>> particularly as guest_walk_tables() will already gate the checks based
>> on whether nx or smep is enabled in the guest.  Tim, you know of any
>> reason for this?)
> This code is following the hardware, IIRC: on real CPUs without NX
> enabled, pagefault error codes will not have that bit set even when
> caused by instruction fetches.

There is a difference between Intel and AMD.  Intel has the bit reserved
(and therefore faults), while AMD declares the bit as ignored (and
doesn't fault).

These are yet more subtle differences which are not quite accounted for
correctly in Xen (which I am frankly unlikely to even get to in cpuid v2).

~Andrew

>   Looking at the SDM (3A.4.7) the
> current rules are for that bit to be set iff ((PAE && NXE) || SMEP).
>
> It seems unfortunate to have this check duplicated between
> guest_walk_tables() and its callers, so potentially g_w_t() should
> explicitly _clear_ PFEC_insn_fetch in cases where it should not be
> returned, and then its callers can set it unconditionally for
> all instruction fetches.
>
> (There are similar rules for the new PK bit, which should not be shown
> to the guest if CR4.PKE is clear.)
>
> Cheers,
>
> Tim.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16 16:28         ` Tim Deegan
  2015-12-16 16:34           ` Andrew Cooper
@ 2015-12-16 16:50           ` George Dunlap
  2015-12-16 17:21             ` Tim Deegan
  1 sibling, 1 reply; 44+ messages in thread
From: George Dunlap @ 2015-12-16 16:50 UTC (permalink / raw)
  To: Tim Deegan
  Cc: wei.liu2, Tian, Kevin, Wu, Feng, jbeulich, stefano.stabellini,
	george.dunlap, andrew.cooper3, Dong, Eddie, xen-devel, Nakajima,
	Jun, Han, Huaitong, keir, ian.jackson, ian.campbell

On 16/12/15 16:28, Tim Deegan wrote:
> Hi,
> 
> At 15:36 +0000 on 16 Dec (1450280191), George Dunlap wrote:
>> (hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx or
>> smep are enabled in the guest.  This seems inconsistent to me with the
>> treatment of PFEC_reserved_bit: it seems like
>> hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
>> particularly as guest_walk_tables() will already gate the checks based
>> on whether nx or smep is enabled in the guest.  Tim, you know of any
>> reason for this?)
> 
> This code is following the hardware, IIRC: on real CPUs without NX
> enabled, pagefault error codes will not have that bit set even when
> caused by instruction fetches.  Looking at the SDM (3A.4.7) the
> current rules are for that bit to be set iff ((PAE && NXE) || SMEP).

Right, but I think your answer points to the fundamental problem with
the way this code has been written.  The pfec value passed to
gva_to_gfn() is used for two separate functions: 1) to tell
guest_walk_tables() what kind of access is happening, so that it can
check the appropriate bits 2) to be used as a pfec value to pass back to
the guest when delivering a page fault.

These two functions should be clearly separated, but right now they're
not: hvm_fetch_from_guest_virt() is "pre-clearing" PFEC_insn_fetch, so
that guest_walk_tables() doesn't even know that an instruction fetch is
happening; as such, it's likely that the pkey code in this series will
DTWT (fault on an instruction fetch) if pkeys are enabled but nx and
smep are not.

Really those should be two separate parameters, one in and one out; and
it should be gva_to_gfn()'s job to translate the missing bits from
guest_walk_tables() into a pfec value suitable to use when injecting a
fault.  (Or potentially guest_walk_tables()'s job.)

 -George

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16 16:50           ` George Dunlap
@ 2015-12-16 17:21             ` Tim Deegan
  0 siblings, 0 replies; 44+ messages in thread
From: Tim Deegan @ 2015-12-16 17:21 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, Tian, Kevin, Wu, Feng, jbeulich, stefano.stabellini,
	george.dunlap, andrew.cooper3, Dong, Eddie, xen-devel, Nakajima,
	Jun, Han, Huaitong, keir, ian.jackson, ian.campbell

At 16:50 +0000 on 16 Dec (1450284615), George Dunlap wrote:
> On 16/12/15 16:28, Tim Deegan wrote:
> > Hi,
> > 
> > At 15:36 +0000 on 16 Dec (1450280191), George Dunlap wrote:
> >> (hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx or
> >> smep are enabled in the guest.  This seems inconsistent to me with the
> >> treatment of PFEC_reserved_bit: it seems like
> >> hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
> >> particularly as guest_walk_tables() will already gate the checks based
> >> on whether nx or smep is enabled in the guest.  Tim, you know of any
> >> reason for this?)
> > 
> > This code is following the hardware, IIRC: on real CPUs without NX
> > enabled, pagefault error codes will not have that bit set even when
> > caused by instruction fetches.  Looking at the SDM (3A.4.7) the
> > current rules are for that bit to be set iff ((PAE && NXE) || SMEP).
> 
> Right, but I think your answer points to the fundamental problem with
> the way this code has been written.  The pfec value passed to
> gva_to_gfn() is used for two separate functions: 1) to tell
> guest_walk_tables() what kind of access is happening, so that it can
> check the appropriate bits 2) to be used as a pfec value to pass back to
> the guest when delivering a page fault.

Yes.  At the time, this made sense: the caller and the walker were
collectively assembling the PFEC, with the caller providing bits that
depended on the kind of access and the walker managing the bits that
depended on the pagetables.  We could have had a separate flags
parameter to describe (read/write/fetch X user/supervisor), but that
would by construction have been identical to the inital set of PFEC bits.

Now that we have a potential case where the hardware behaves
differently for instruction fetches but _doesn't_ flag them in the
PFEC, then we clearly need to track them separately.

> Really those should be two separate parameters, one in and one out; and
> it should be gva_to_gfn()'s job to translate the missing bits from
> guest_walk_tables() into a pfec value suitable to use when injecting a
> fault.  (Or potentially guest_walk_tables()'s job.)

Agreed.

Tim.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16 16:34           ` Andrew Cooper
@ 2015-12-16 17:33             ` Tim Deegan
  0 siblings, 0 replies; 44+ messages in thread
From: Tim Deegan @ 2015-12-16 17:33 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, stefano.stabellini,
	george.dunlap, ian.jackson, Dong, Eddie, George Dunlap,
	xen-devel, jbeulich, Han, Huaitong, wei.liu2, keir, ian.campbell

At 16:34 +0000 on 16 Dec (1450283699), Andrew Cooper wrote:
> On 16/12/15 16:28, Tim Deegan wrote:
> > Hi,
> >
> > At 15:36 +0000 on 16 Dec (1450280191), George Dunlap wrote:
> >> (hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx or
> >> smep are enabled in the guest.  This seems inconsistent to me with the
> >> treatment of PFEC_reserved_bit: it seems like
> >> hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
> >> particularly as guest_walk_tables() will already gate the checks based
> >> on whether nx or smep is enabled in the guest.  Tim, you know of any
> >> reason for this?)
> > This code is following the hardware, IIRC: on real CPUs without NX
> > enabled, pagefault error codes will not have that bit set even when
> > caused by instruction fetches.
> 
> There is a difference between Intel and AMD.  Intel has the bit reserved
> (and therefore faults), while AMD declares the bit as ignored (and
> doesn't fault).

I think we are talking about different things.  Like the U/S and W/R
bits, the I/D bit of a PFEC describes what kind of access was being
made, and is not affected by the contents of the pagetables at all.
(cf. the P bit and the new PK bit, which do depend on PT contents).

When neither NX nor SMEP is enabled, we must not set the I/D bit in
the PFEC we give to the guest, since real CPUs would not do so.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16  9:12                 ` Jan Beulich
@ 2015-12-17  9:18                   ` Han, Huaitong
  2015-12-17 10:05                     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Han, Huaitong @ 2015-12-17  9:18 UTC (permalink / raw)
  To: JBeulich
  Cc: Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Nakajima, Jun, keir

On Wed, 2015-12-16 at 02:12 -0700, Jan Beulich wrote:
> > > > On 16.12.15 at 10:03, <huaitong.han@intel.com> wrote:
> > On Wed, 2015-12-16 at 01:32 -0700, Jan Beulich wrote:
> > > > > > On 16.12.15 at 09:16, <huaitong.han@intel.com> wrote:
> > > > On Tue, 2015-12-15 at 02:02 -0700, Jan Beulich wrote:
> > > > > Well, I wouldn't want you to introduce a brand new function,
> > > > > but
> > > > > instead just factor out the necessary piece from xsave()
> > > > > (making
> > > > > the new one take a struct xsave_struct * instead of a struct
> > > > > vcpu
> > > > > *,
> > > > > and calling it from what is now xsave()).
> > > > So the function looks like this:
> > > > unsigned int get_xsave_pkru(struct vcpu *v)
> > > > {
> > > >     void *offset;
> > > >     struct xsave_struct *xsave_area;
> > > >     uint64_t mask = XSTATE_PKRU;
> > > >     unsigned int index = fls64(mask) - 1;
> > > >     unsigned int pkru = 0;
> > > > 
> > > >     if ( !cpu_has_xsave )
> > > >         return 0;
> > > >     
> > > >     BUG_ON(xsave_cntxt_size < XSTATE_AREA_MIN_SIZE);
> > > >     xsave_area = _xzalloc(xsave_cntxt_size, 64);
> > > >     if ( xsave_area == NULL )
> > > >         return 0;
> > > > 
> > > >     xsave(xsave_area, mask);
> > > >     offset = (void *)xsave_area + (xsave_area_compressed(xsave)
> > > > ? 
> > > >             XSTATE_AREA_MIN_SIZE : xstate_offsets[index] );
> > > >     memcpy(&pkru, offset, sizeof(pkru));
> > > > 
> > > >     xfree(xsave_area);
> > > > 
> > > >     return pkru;
> > > > }
> > > 
> > > Depending on how frequently this might get called, the allocation
> > > overhead may not be tolerable. I.e. you may want to set up e.g.
> > > a per-CPU buffer up front. Or you check whether using RDPKRU
> > > (with temporarily setting CR4.PKE) is cheaper than what you
> > > do right now.
> > RDPKRU does cost less than the function, and if temporarily setting
> > CR4.PKE is accepted, I will use RDPKRU instead of the function.
> 
> The question isn't just the RDPKRU cost, but that of the two CR4
> writes.
Testing result with NOW() function:
Time of the function 10 times execution
(XEN)xsave time is 1376 ns
(XEN)read_pkru time is 28 ns

So, read_pkru function does cost less than get_xsave_pkru, and I will
use read_pkru.


Testing codes:
 static void reboot_machine(unsigned char key, struct cpu_user_regs
*regs)
 {
-    printk("'%c' pressed -> rebooting machine\n", key);
-    machine_restart(0);
+  //  printk("read pkru test\n", key);
+       s_time_t pre, last;
+       unsigned int pkru = 0;
+       unsigned int i = 0;
+       void * offset;
+       struct xsave_struct *save_area = _xzalloc(1024, 64);
+
+       pre = NOW();
+       for (;i< 10;i++){
+       xsave(current, XSTATE_PKRU, save_area);
+       offset = (void *)save_area + XSTATE_AREA_MIN_SIZE;
+       memcpy(&pkru, offset, sizeof(pkru));}
+       last = NOW();
+       printk("xsave time is %lu\n", (last - pre));
+
+       pre = NOW();
+       for (;i< 10;i++){
+       pkru  = read_pkru();
+       }
+       last = NOW();
+       printk("read_pkru time is %lu\n", (last - pre));
+  //  machine_restart(0);
 }

 static inline unsigned int read_pkru(void)
{
    unsigned int pkru;

    set_in_cr4(X86_CR4_PKE);
    asm volatile (".byte 0x0f,0x01,0xee"
        : "=a" (pkru) : "c" (0) : "dx");
    clear_in_cr4(X86_CR4_PKE);

    return pkru;
}

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-17  9:18                   ` Han, Huaitong
@ 2015-12-17 10:05                     ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2015-12-17 10:05 UTC (permalink / raw)
  To: Huaitong Han
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, ian.jackson, george.dunlap,
	xen-devel, Jun Nakajima, keir

>>> On 17.12.15 at 10:18, <huaitong.han@intel.com> wrote:
> On Wed, 2015-12-16 at 02:12 -0700, Jan Beulich wrote:
>> > > > On 16.12.15 at 10:03, <huaitong.han@intel.com> wrote:
>> > On Wed, 2015-12-16 at 01:32 -0700, Jan Beulich wrote:
>> > > Depending on how frequently this might get called, the allocation
>> > > overhead may not be tolerable. I.e. you may want to set up e.g.
>> > > a per-CPU buffer up front. Or you check whether using RDPKRU
>> > > (with temporarily setting CR4.PKE) is cheaper than what you
>> > > do right now.
>> > RDPKRU does cost less than the function, and if temporarily setting
>> > CR4.PKE is accepted, I will use RDPKRU instead of the function.
>> 
>> The question isn't just the RDPKRU cost, but that of the two CR4
>> writes.
> Testing result with NOW() function:
> Time of the function 10 times execution
> (XEN)xsave time is 1376 ns
> (XEN)read_pkru time is 28 ns

Wow, that's a huge difference. I'm having trouble to believe two
CR4 writes are this fast, and an XSAVE of just a single 32-bit item
is this slow. But well - you measured it.

Jan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-16 15:36       ` George Dunlap
  2015-12-16 16:28         ` Tim Deegan
@ 2015-12-18  8:21         ` Han, Huaitong
  2015-12-18 10:03           ` George Dunlap
  2015-12-18 11:46           ` Tim Deegan
  1 sibling, 2 replies; 44+ messages in thread
From: Han, Huaitong @ 2015-12-18  8:21 UTC (permalink / raw)
  To: Wu, Feng, george.dunlap
  Cc: tim, Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, Dong, Eddie, xen-devel, jbeulich,
	Nakajima, Jun, keir, ian.jackson

On Wed, 2015-12-16 at 15:36 +0000, George Dunlap wrote:
> [Adding Tim, the previous mm maintainer]
> 
> On 11/12/15 09:16, Wu, Feng wrote:
> > > > +{
> > > > +    void *xsave_addr;
> > > > +    unsigned int pkru = 0;
> > > > +    bool_t pkru_ad, pkru_wd;
> > > > +
> > > > +    bool_t uf = !!(pfec & PFEC_user_mode);
> > > > +    bool_t wf = !!(pfec & PFEC_write_access);
> > > > +    bool_t ff = !!(pfec & PFEC_insn_fetch);
> > > > +    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
> > > > +    bool_t pkuf  = !!(pfec & PFEC_prot_key);
> > > 
> > > So I'm just wondering out loud here -- is there actually any
> > > situation
> > > in which we would want guest_walk_tables to act differently than
> > > the
> > > real hardware?
> > > 
> > > That is, is there actually any situation where, pku is enabled,
> > > the vcpu
> > > is in long mode, PFEC_write_access and/or PFEC_page_present is
> > > set, and
> > > the pkey is non-zero, that we want guest_walk_tables() to only
> > > check the
> > > write-protect bit for the pte, and not also check the pkru?
> > > 
> > > Because if not, it seems like it would be much more robust to
> > > simply
> > > *always* check for pkru_ad if PFEC_page_present is set, and for
> > > pkru_wd
> > > if PFEC_write_access is set.
> > 
> > guest_walk_tables() is also used by shadow code, though we don't
> > plan to support pkeys for shadow now, however, in that case, the
> > 'pfec'
> > is generated by hardware, and the pkuf bit may be 0 or 1 depending
> > on the real page fault. If we unconditionally check pkeys in
> > guest_walk_tables(), it is not a good ideas for someone who may
> > want to implement the pkeys for shadow in future, since we only
> > need to check pkeys when the pkuf is set in pfec.
> 
> So you'll have to forgive me for being a bit slow here -- I stepped
> away
> from the mm code for a while, and a lot has changed since I agreed to
> become mm maintainer earlier this year.
> 
> So here is what I see in the tree; please correct me if I missed
> something important.
> 
> The function guest_walk_tables():
>  1. Accepts a pfec argument which is meant to describe the *access
> type*
> that is happening
>  2. Returns a value with the "wrong" flags (i.e., flags which needed
> to
> be set that were missing, or flags that needed to be clear that were
> set).
>  3. It checks reserved bits in the pagetable regardless of whether
> PFEC_reserved_bit is set in the caller, and returns
> _PAGE_INVALID_BITS
> if so.
>  4. If PFEC_insn_fetch is set, it will only check for nx and smep if
> those features are enabled for the guest.
> 
> The main callers are the various gva_to_gfn(), which accept a
> *pointer*
> to a pfec argument.  This pointer is actually bidirectional: its
> value
> passed to guest_walk_tables() to determine what access types are
> checked; what is returned is meant to be a pfec value which can be
> passed to a guest as part of a fault (and is by several callers). 
>  But
> it may also return the Xen-internal flags, PFEC_page_{paged,shared}.
> And, importantly, it modifies the pfec value based on the bits that
> are
> returned from guest_walk_tables(): It will clear PFEC_present if
> guest_walk_tables() returns _PAGE_PRESENT, and it will set
> PFEC_reserved_bit if guest_walk_tables() returns _PAGE_INVALID_BIT.
> 
> The next logical level up are the
> hvm_{copy,fetch}_{to,from}_guest_virt(), which also take a pfec
> argument: but really the only purpose of the pfec argument is to
> allow
> the caller to add the PFEC_user_mode flag; the other access-type
> flags
> are set automatically by the functions themselves (e.g., "to" sets
> PFEC_write_access, "fetch" sets PFEC_insn_fetch, &c).  Several
> callers
> set PFEC_present as well, but callers set any other bits.
> 
> (hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx
> or
> smep are enabled in the guest.  This seems inconsistent to me with
> the
> treatment of PFEC_reserved_bit: it seems like
> hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
> particularly as guest_walk_tables() will already gate the checks
> based
> on whether nx or smep is enabled in the guest.  Tim, you know of any
> reason for this?)
> 
> So there seems to me to be no reason to pass PFEC_prot_key into
> guest_walk_tables() or gva_to_gfn().  The pfec value passed *into*
> those
> should simply indicate the type of memory access being done: present,
> write, instruction fetch, user.
> 
> With your current series, guest_walk_tables() already checks for
> pkeys
> being enabled in the guest before checking for them in the
> pagetables.
> For shadow mode, these will be false, and so no checks will be done. 
>  If
> anyone ever implements pkeys for shadow mode, then these will be
> enabled, and the checks will be done, without any intervention on the
> part of the caller.

I have understood it, but, the problem with shadow mode is that pfec
may come from regs->error_code(hardware), just like:
rc = sh_walk_guest_tables(v, va, &gw, regs->error_code);
so, when regs->error_code does not have PFEC_prot_key,
guest_walk_tables may still check PKEY when codes is writen according
to what you said, and it maybe return a different result.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-18  8:21         ` Han, Huaitong
@ 2015-12-18 10:03           ` George Dunlap
  2015-12-18 11:46           ` Tim Deegan
  1 sibling, 0 replies; 44+ messages in thread
From: George Dunlap @ 2015-12-18 10:03 UTC (permalink / raw)
  To: Han, Huaitong, Wu, Feng
  Cc: tim, Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, Dong, Eddie, xen-devel, jbeulich,
	Nakajima, Jun, keir, ian.jackson

On 18/12/15 08:21, Han, Huaitong wrote:
> On Wed, 2015-12-16 at 15:36 +0000, George Dunlap wrote:
>> [Adding Tim, the previous mm maintainer]
>>
>> On 11/12/15 09:16, Wu, Feng wrote:
>>>>> +{
>>>>> +    void *xsave_addr;
>>>>> +    unsigned int pkru = 0;
>>>>> +    bool_t pkru_ad, pkru_wd;
>>>>> +
>>>>> +    bool_t uf = !!(pfec & PFEC_user_mode);
>>>>> +    bool_t wf = !!(pfec & PFEC_write_access);
>>>>> +    bool_t ff = !!(pfec & PFEC_insn_fetch);
>>>>> +    bool_t rsvdf = !!(pfec & PFEC_reserved_bit);
>>>>> +    bool_t pkuf  = !!(pfec & PFEC_prot_key);
>>>>
>>>> So I'm just wondering out loud here -- is there actually any
>>>> situation
>>>> in which we would want guest_walk_tables to act differently than
>>>> the
>>>> real hardware?
>>>>
>>>> That is, is there actually any situation where, pku is enabled,
>>>> the vcpu
>>>> is in long mode, PFEC_write_access and/or PFEC_page_present is
>>>> set, and
>>>> the pkey is non-zero, that we want guest_walk_tables() to only
>>>> check the
>>>> write-protect bit for the pte, and not also check the pkru?
>>>>
>>>> Because if not, it seems like it would be much more robust to
>>>> simply
>>>> *always* check for pkru_ad if PFEC_page_present is set, and for
>>>> pkru_wd
>>>> if PFEC_write_access is set.
>>>
>>> guest_walk_tables() is also used by shadow code, though we don't
>>> plan to support pkeys for shadow now, however, in that case, the
>>> 'pfec'
>>> is generated by hardware, and the pkuf bit may be 0 or 1 depending
>>> on the real page fault. If we unconditionally check pkeys in
>>> guest_walk_tables(), it is not a good ideas for someone who may
>>> want to implement the pkeys for shadow in future, since we only
>>> need to check pkeys when the pkuf is set in pfec.
>>
>> So you'll have to forgive me for being a bit slow here -- I stepped
>> away
>> from the mm code for a while, and a lot has changed since I agreed to
>> become mm maintainer earlier this year.
>>
>> So here is what I see in the tree; please correct me if I missed
>> something important.
>>
>> The function guest_walk_tables():
>>  1. Accepts a pfec argument which is meant to describe the *access
>> type*
>> that is happening
>>  2. Returns a value with the "wrong" flags (i.e., flags which needed
>> to
>> be set that were missing, or flags that needed to be clear that were
>> set).
>>  3. It checks reserved bits in the pagetable regardless of whether
>> PFEC_reserved_bit is set in the caller, and returns
>> _PAGE_INVALID_BITS
>> if so.
>>  4. If PFEC_insn_fetch is set, it will only check for nx and smep if
>> those features are enabled for the guest.
>>
>> The main callers are the various gva_to_gfn(), which accept a
>> *pointer*
>> to a pfec argument.  This pointer is actually bidirectional: its
>> value
>> passed to guest_walk_tables() to determine what access types are
>> checked; what is returned is meant to be a pfec value which can be
>> passed to a guest as part of a fault (and is by several callers). 
>>  But
>> it may also return the Xen-internal flags, PFEC_page_{paged,shared}.
>> And, importantly, it modifies the pfec value based on the bits that
>> are
>> returned from guest_walk_tables(): It will clear PFEC_present if
>> guest_walk_tables() returns _PAGE_PRESENT, and it will set
>> PFEC_reserved_bit if guest_walk_tables() returns _PAGE_INVALID_BIT.
>>
>> The next logical level up are the
>> hvm_{copy,fetch}_{to,from}_guest_virt(), which also take a pfec
>> argument: but really the only purpose of the pfec argument is to
>> allow
>> the caller to add the PFEC_user_mode flag; the other access-type
>> flags
>> are set automatically by the functions themselves (e.g., "to" sets
>> PFEC_write_access, "fetch" sets PFEC_insn_fetch, &c).  Several
>> callers
>> set PFEC_present as well, but callers set any other bits.
>>
>> (hvm_fetch_from_guest_virt() seems to only set PFEC_insn_fetch if nx
>> or
>> smep are enabled in the guest.  This seems inconsistent to me with
>> the
>> treatment of PFEC_reserved_bit: it seems like
>> hvm_fetch_from_guest_virt() should always pass in PFEC_insn_fetch,
>> particularly as guest_walk_tables() will already gate the checks
>> based
>> on whether nx or smep is enabled in the guest.  Tim, you know of any
>> reason for this?)
>>
>> So there seems to me to be no reason to pass PFEC_prot_key into
>> guest_walk_tables() or gva_to_gfn().  The pfec value passed *into*
>> those
>> should simply indicate the type of memory access being done: present,
>> write, instruction fetch, user.
>>
>> With your current series, guest_walk_tables() already checks for
>> pkeys
>> being enabled in the guest before checking for them in the
>> pagetables.
>> For shadow mode, these will be false, and so no checks will be done. 
>>  If
>> anyone ever implements pkeys for shadow mode, then these will be
>> enabled, and the checks will be done, without any intervention on the
>> part of the caller.
> 
> I have understood it, but, the problem with shadow mode is that pfec
> may come from regs->error_code(hardware), just like:
> rc = sh_walk_guest_tables(v, va, &gw, regs->error_code);
> so, when regs->error_code does not have PFEC_prot_key,
> guest_walk_tables may still check PKEY when codes is writen according
> to what you said, and it maybe return a different result.

Under what situation would the *hardware* not generate PFEC_prot_key,
but guest_walk_tables() would, under exactly the same conditions,
generate PFEC_prot_key?  Shouldn't guest_walk_tables() be made to behave
identically to the hardware?

The only situation where that might legitimately happen is when the
shadow pagetables and the guest pagetables diverge; for instance, when a
page had been write-protected because Xen thought it was a pagetable.
Then the guest pte might have write permission, but the pkey read-only
permission; but the shadow pte would not have write permission.  In that
case, the hardware might give PFEC_write_access but not PFEC_prot_key, I
suppose (haven't read the manual to be sure); but in that case, we
*want* guest_walk_tables() to detect and add PFEC_prot_key, don't we?

 -George




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables
  2015-12-18  8:21         ` Han, Huaitong
  2015-12-18 10:03           ` George Dunlap
@ 2015-12-18 11:46           ` Tim Deegan
  1 sibling, 0 replies; 44+ messages in thread
From: Tim Deegan @ 2015-12-18 11:46 UTC (permalink / raw)
  To: Han, Huaitong
  Cc: Tian, Kevin, Wu, Feng, ian.campbell, stefano.stabellini,
	george.dunlap, andrew.cooper3, Dong, Eddie, george.dunlap,
	xen-devel, jbeulich, Nakajima, Jun, wei.liu2, keir, ian.jackson

At 08:21 +0000 on 18 Dec (1450426907), Han, Huaitong wrote:
> On Wed, 2015-12-16 at 15:36 +0000, George Dunlap wrote:
> > With your current series, guest_walk_tables() already checks for
> > pkeys
> > being enabled in the guest before checking for them in the
> > pagetables.
> > For shadow mode, these will be false, and so no checks will be done. 
> >  If
> > anyone ever implements pkeys for shadow mode, then these will be
> > enabled, and the checks will be done, without any intervention on the
> > part of the caller.
> 
> I have understood it, but, the problem with shadow mode is that pfec
> may come from regs->error_code?hardware?, just like:
> rc = sh_walk_guest_tables(v, va, &gw, regs->error_code);
> so, when regs->error_code does not have PFEC_prot_key,
> guest_walk_tables may still check PKEY when codes is writen according
> to what you said, and it maybe return a different result.

That's OK -- in general there's no guarantee that the guest walk will
produce the same result as the original hardware fault, since the
guest may have modified its pagetables.

So flagging something as a pkey error when the original fault didn't
is OK.  OTOH, we shouldn't keep a hardware-supplied pkey bit in the
PFEC unless the guest walk finds a reason for it to be there.
IOW, we should treat it like the PFEC_present bit and have it depend
only on the walk (or go straight to George's plan of separating inputs
from outputs so that's not an issue).

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2015-12-18 11:46 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-07  9:16 [V3 PATCH 0/9] x86/hvm: pkeys, add memory protection-key support Huaitong Han
2015-12-07  9:16 ` [V3 PATCH 1/9] x86/hvm: pkeys, add the flag to enable Memory Protection Keys Huaitong Han
2015-12-10 15:37   ` George Dunlap
2015-12-07  9:16 ` [V3 PATCH 2/9] x86/hvm: pkeys, add pkeys support when setting CR4 Huaitong Han
2015-12-07  9:16 ` [V3 PATCH 3/9] x86/hvm: pkeys, disable pkeys for guests in non-paging mode Huaitong Han
2015-12-07  9:16 ` [V3 PATCH 4/9] x86/hvm: pkeys, add functions to get pkeys value from PTE Huaitong Han
2015-12-10 15:48   ` George Dunlap
2015-12-10 18:47     ` Andrew Cooper
2015-12-07  9:16 ` [V3 PATCH 5/9] x86/hvm: pkeys, add functions to support PKRU access Huaitong Han
2015-12-10 18:48   ` Andrew Cooper
2015-12-07  9:16 ` [V3 PATCH 6/9] x86/hvm: pkeys, add xstate support for pkeys Huaitong Han
2015-12-10 17:39   ` George Dunlap
2015-12-10 18:57   ` Andrew Cooper
2015-12-11  9:36   ` Jan Beulich
2015-12-07  9:16 ` [V3 PATCH 7/9] x86/hvm: pkeys, add pkeys support for guest_walk_tables Huaitong Han
2015-12-10 18:19   ` George Dunlap
2015-12-11  9:16     ` Wu, Feng
2015-12-11  9:23       ` Jan Beulich
2015-12-16 15:36       ` George Dunlap
2015-12-16 16:28         ` Tim Deegan
2015-12-16 16:34           ` Andrew Cooper
2015-12-16 17:33             ` Tim Deegan
2015-12-16 16:50           ` George Dunlap
2015-12-16 17:21             ` Tim Deegan
2015-12-18  8:21         ` Han, Huaitong
2015-12-18 10:03           ` George Dunlap
2015-12-18 11:46           ` Tim Deegan
2015-12-11  9:23     ` Han, Huaitong
2015-12-11  9:50       ` Jan Beulich
2015-12-11  9:26     ` Jan Beulich
2015-12-15  8:14       ` Han, Huaitong
2015-12-15  9:02         ` Jan Beulich
2015-12-16  8:16           ` Han, Huaitong
2015-12-16  8:32             ` Jan Beulich
2015-12-16  9:03               ` Han, Huaitong
2015-12-16  9:12                 ` Jan Beulich
2015-12-17  9:18                   ` Han, Huaitong
2015-12-17 10:05                     ` Jan Beulich
2015-12-10 18:59   ` Andrew Cooper
2015-12-11  7:18     ` Han, Huaitong
2015-12-11  8:48       ` Andrew Cooper
2015-12-07  9:16 ` [V3 PATCH 8/9] x86/hvm: pkeys, add pkeys support for gva2gfn funcitons Huaitong Han
2015-12-07  9:16 ` [V3 PATCH 9/9] x86/hvm: pkeys, add pkeys support for cpuid handling Huaitong Han
2015-12-11  9:47   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.