All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Fixes to pagetable handling
@ 2017-02-27 14:03 Andrew Cooper
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
                   ` (8 more replies)
  0 siblings, 9 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

This series has been a long time in preparation (i.e. most of the 4.8 and 4.9
dev cycles).  It started when I tried to make an XTF PoC for XSA-176, and
stumbled upon the the _PAGE_PAGED aliasing issue (see patch 7) which caused by
PoC to be descheduled waiting for (a non-existent) paging agent to respond.

This series is built upon:
  1) The switch to using _Bool. I spent rather too long trying to debug why
     CR0.WP wasn't behaving properly, and it was down to static inline bool_t
     guest_wp_enabled().
  2) The series to switch emulation to using system-segment relative memory
     accesses (directly relevant to patches 1 and 2), which in turn resulted
     in the discovery of XSA-191.
  3) The CPUID improvement work to get maxphysaddr into a sensibly audited
     state, and sensibly accessible location.

Outstanding hardware issues discovered include:
  1) There is an observable delay in AMD Fam 10h processors between loading a
     segment selector, and the results of the LDT/GDT memory access being
     visible in the pagetables (via the Access bits being set).
  2) SMAP provides insufficient information to the pagefault handler (via the
     error code) to fully determine the nature of the access causing the
     pagefault in the first place.  See patch 2 for details.

Andrew Cooper (7):
  x86/hvm: Correctly identify implicit supervisor accesses
  x86/shadow: Try to correctly identify implicit supervisor accesses
  x86/pagewalk: Helpers for reserved bit handling
  x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves
  x86/shadow: Use the pagewalk reserved bits helpers
  x86/pagewalk: Consistently use guest_walk_*() helpers for translation
  x86/pagewalk: Re-implement the pagetable walker

 xen/arch/x86/hvm/emulate.c        |  18 +-
 xen/arch/x86/mm/guest_walk.c      | 444 ++++++++++++++++++++------------------
 xen/arch/x86/mm/hap/guest_walk.c  |  23 +-
 xen/arch/x86/mm/hap/nested_ept.c  |   2 +-
 xen/arch/x86/mm/p2m.c             |  17 +-
 xen/arch/x86/mm/shadow/multi.c    | 108 +++++++---
 xen/include/asm-x86/guest_pt.h    | 180 +++++++++++++---
 xen/include/asm-x86/hvm/hvm.h     |   4 +-
 xen/include/asm-x86/p2m.h         |   2 +-
 xen/include/asm-x86/page.h        |   3 -
 xen/include/asm-x86/processor.h   |   3 +
 xen/include/asm-x86/x86_64/page.h |   6 -
 12 files changed, 489 insertions(+), 321 deletions(-)

-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
@ 2017-02-27 14:03 ` Andrew Cooper
  2017-03-01 15:05   ` Jan Beulich
                     ` (4 more replies)
  2017-02-27 14:03 ` [PATCH 2/7] x86/shadow: Try to correctly " Andrew Cooper
                   ` (7 subsequent siblings)
  8 siblings, 5 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel
  Cc: George Dunlap, Andrew Cooper, Paul Durrant, Tim Deegan, Jan Beulich

All actions which refer to the active ldt/gdt/idt or task register
(e.g. loading a new segment selector) are known as implicit supervisor
accesses, even when the access originates from user code.

The distinction is necessary in the pagewalk when SMAP is enabled.  Refer to
Intel SDM Vol 3 "Access Rights" for the exact details.

Introduce a new pagewalk input, and make use of the new system segment
references in hvmemul_{read,write}().  While modifying those areas, move the
calculation of the appropriate pagewalk input before its first use.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Tim Deegan <tim@xen.org>
---
 xen/arch/x86/hvm/emulate.c      | 18 ++++++++++--------
 xen/arch/x86/mm/guest_walk.c    |  4 ++++
 xen/include/asm-x86/processor.h |  1 +
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index f24d289..9b32bca 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -783,6 +783,11 @@ static int __hvmemul_read(
     struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
     int rc;
 
+    if ( is_x86_system_segment(seg) )
+        pfec |= PFEC_implicit;
+    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
+        pfec |= PFEC_user_mode;
+
     rc = hvmemul_virtual_to_linear(
         seg, offset, bytes, &reps, access_type, hvmemul_ctxt, &addr);
     if ( rc != X86EMUL_OKAY || !bytes )
@@ -793,10 +798,6 @@ static int __hvmemul_read(
          (vio->mmio_gla == (addr & PAGE_MASK)) )
         return hvmemul_linear_mmio_read(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
 
-    if ( (seg != x86_seg_none) &&
-         (hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3) )
-        pfec |= PFEC_user_mode;
-
     rc = ((access_type == hvm_access_insn_fetch) ?
           hvm_fetch_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo) :
           hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo));
@@ -893,6 +894,11 @@ static int hvmemul_write(
     struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
     int rc;
 
+    if ( is_x86_system_segment(seg) )
+        pfec |= PFEC_implicit;
+    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
+        pfec |= PFEC_user_mode;
+
     rc = hvmemul_virtual_to_linear(
         seg, offset, bytes, &reps, hvm_access_write, hvmemul_ctxt, &addr);
     if ( rc != X86EMUL_OKAY || !bytes )
@@ -902,10 +908,6 @@ static int hvmemul_write(
          (vio->mmio_gla == (addr & PAGE_MASK)) )
         return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
 
-    if ( (seg != x86_seg_none) &&
-         (hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3) )
-        pfec |= PFEC_user_mode;
-
     rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
 
     switch ( rc )
diff --git a/xen/arch/x86/mm/guest_walk.c b/xen/arch/x86/mm/guest_walk.c
index faaf70c..4f8d0e3 100644
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -161,6 +161,10 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
     bool_t pse1G = 0, pse2M = 0;
     p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
 
+    /* Only implicit supervisor data accesses exist. */
+    ASSERT(!(pfec & PFEC_implicit) ||
+           !(pfec & (PFEC_insn_fetch|PFEC_user_mode)));
+
     perfc_incr(guest_walk);
     memset(gw, 0, sizeof(*gw));
     gw->va = va;
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index dda8b83..d3ba8ea 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -76,6 +76,7 @@
 /* Internally used only flags. */
 #define PFEC_page_paged     (1U<<16)
 #define PFEC_page_shared    (1U<<17)
+#define PFEC_implicit       (1U<<18) /* Pagewalk input for ldt/gdt/idt/tr accesses. */
 
 /* Other exception error code values. */
 #define X86_XEC_EXT         (_AC(1,U) << 0)
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/7] x86/shadow: Try to correctly identify implicit supervisor accesses
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
@ 2017-02-27 14:03 ` Andrew Cooper
  2017-03-01 15:11   ` Jan Beulich
                     ` (2 more replies)
  2017-02-27 14:03 ` [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling Andrew Cooper
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/arch/x86/mm/shadow/multi.c | 60 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 128809d..7c6b017 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2857,7 +2857,7 @@ static int sh_page_fault(struct vcpu *v,
     const struct x86_emulate_ops *emul_ops;
     int r;
     p2m_type_t p2mt;
-    uint32_t rc;
+    uint32_t rc, error_code;
     int version;
     const struct npfec access = {
          .read_access = 1,
@@ -3011,13 +3011,69 @@ static int sh_page_fault(struct vcpu *v,
 
  rewalk:
 
+    error_code = regs->error_code;
+
+    /*
+     * When CR4.SMAP is enabled, instructions which have a side effect of
+     * accessing the system data structures (e.g. mov to %ds accessing the
+     * LDT/GDT, or int $n accessing the IDT) are known as implicit supervisor
+     * accesses.
+     *
+     * The distinction between implicit and explicit accesses form part of the
+     * determination of access rights, controlling whether the access is
+     * successful, or raises a #PF.
+     *
+     * Unfortunately, the processor throws away the implicit/explicit
+     * distinction and does not provide it to the pagefault handler
+     * (i.e. here.) in the #PF error code.  Therefore, we must try to
+     * reconstruct the lost state so it can be fed back into our pagewalk
+     * through the guest tables.
+     *
+     * User mode accesses are easy to reconstruct:
+     *
+     *   If we observe a cpl3 data fetch which was a supervisor walk, this
+     *   must have been an implicit access to a system table.
+     *
+     * Supervisor mode accesses are not easy:
+     *
+     *   In principle, we could decode the instruction under %rip and have the
+     *   instruction emulator tell us if there is an implicit access.
+     *   However, this is racy with other vcpus updating the pagetable or
+     *   rewriting the instruction stream under our feet.
+     *
+     *   Therefore, we do nothing.  (If anyone has a sensible suggestion for
+     *   how to distinguish these cases, xen-devel@ is all ears...)
+     *
+     * As a result, one specific corner case will fail.  If a guest OS with
+     * SMAP enabled ends up mapping a system table with user mappings, sets
+     * EFLAGS.AC to allow explicit accesses to user mappings, and implicitly
+     * accesses the user mapping, hardware and the shadow code will disagree
+     * on whether a #PF should be raised.
+     *
+     * Hardware raises #PF because implicit supervisor accesses to user
+     * mappings are strictly disallowed.  As we can't reconstruct the correct
+     * input, the pagewalk is performed as if it were an explicit access,
+     * which concludes that the access should have succeeded and the shadow
+     * pagetables need modifying.  The shadow pagetables are modified (to the
+     * same value), and we re-enter the guest to re-execute the instruction,
+     * which causes another #PF, and the vcpu livelocks, unable to make
+     * forward progress.
+     *
+     * In practice, this is tolerable because no OS would deliberately
+     * construct such a corner case, as doing so would mean it is trivially
+     * root-able by its own userspace.
+     */
+    if ( !(error_code & (PFEC_insn_fetch|PFEC_user_mode)) &&
+         (is_pv_vcpu(v) ? (regs->ss & 3) : hvm_get_cpl(v)) == 3 )
+        error_code |= PFEC_implicit;
+
     /* The walk is done in a lock-free style, with some sanity check
      * postponed after grabbing paging lock later. Those delayed checks
      * will make sure no inconsistent mapping being translated into
      * shadow page table. */
     version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
     rmb();
-    rc = sh_walk_guest_tables(v, va, &gw, regs->error_code);
+    rc = sh_walk_guest_tables(v, va, &gw, error_code);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
     regs->error_code &= ~PFEC_page_present;
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
  2017-02-27 14:03 ` [PATCH 2/7] x86/shadow: Try to correctly " Andrew Cooper
@ 2017-02-27 14:03 ` Andrew Cooper
  2017-03-01 15:57   ` Jan Beulich
                     ` (2 more replies)
  2017-02-27 14:03 ` [PATCH 4/7] x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves Andrew Cooper
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich

Some bits are unconditionally reserved in pagetable entries, or reserved
because of alignment restrictions.  Other bits are reserved because of control
register configuration.

Introduce helpers which take an individual vcpu and guest pagetable entry, and
calculates whether any reserved bits are set.

While here, add a couple of newlines to aid readability, drop some trailing
whitespace and bool/const correct the existing helpers to allow the new
helpers to take const vcpu pointers.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Tim Deegan <tim@xen.org>
---
 xen/include/asm-x86/guest_pt.h | 98 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 90 insertions(+), 8 deletions(-)

diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h
index 0bf6cf9..1c3d384 100644
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -44,6 +44,18 @@ gfn_to_paddr(gfn_t gfn)
 #undef get_gfn
 #define get_gfn(d, g, t) get_gfn_type((d), gfn_x(g), (t), P2M_ALLOC)
 
+/* Mask covering the reserved bits from superpage alignment. */
+#define SUPERPAGE_RSVD(bit)                                             \
+    (((1ULL << (bit)) - 1) & ~(_PAGE_PSE_PAT | (_PAGE_PSE_PAT - 1)))
+
+static inline uint32_t fold_pse36(uint64_t val)
+{
+    return (val & ~(0x1ffULL << 13)) | ((val & (0x1ffULL << 32)) >> (32 - 13));
+}
+static inline uint64_t unfold_pse36(uint32_t val)
+{
+    return (val & ~(0x1ffULL << 13)) | ((val & (0x1ffULL << 13)) << (32 - 13));
+}
 
 /* Types of the guest's page tables and access functions for them */
 
@@ -51,9 +63,13 @@ gfn_to_paddr(gfn_t gfn)
 
 #define GUEST_L1_PAGETABLE_ENTRIES     1024
 #define GUEST_L2_PAGETABLE_ENTRIES     1024
+
 #define GUEST_L1_PAGETABLE_SHIFT         12
 #define GUEST_L2_PAGETABLE_SHIFT         22
 
+#define GUEST_L1_PAGETABLE_RSVD           0
+#define GUEST_L2_PAGETABLE_RSVD           0
+
 typedef uint32_t guest_intpte_t;
 typedef struct { guest_intpte_t l1; } guest_l1e_t;
 typedef struct { guest_intpte_t l2; } guest_l2e_t;
@@ -88,21 +104,39 @@ static inline guest_l2e_t guest_l2e_from_gfn(gfn_t gfn, u32 flags)
 #else /* GUEST_PAGING_LEVELS != 2 */
 
 #if GUEST_PAGING_LEVELS == 3
+
 #define GUEST_L1_PAGETABLE_ENTRIES      512
 #define GUEST_L2_PAGETABLE_ENTRIES      512
 #define GUEST_L3_PAGETABLE_ENTRIES        4
+
 #define GUEST_L1_PAGETABLE_SHIFT         12
 #define GUEST_L2_PAGETABLE_SHIFT         21
 #define GUEST_L3_PAGETABLE_SHIFT         30
+
+#define GUEST_L1_PAGETABLE_RSVD            0x7ff0000000000000UL
+#define GUEST_L2_PAGETABLE_RSVD            0x7ff0000000000000UL
+#define GUEST_L3_PAGETABLE_RSVD                                      \
+    (0xfff0000000000000UL | _PAGE_GLOBAL | _PAGE_PSE | _PAGE_DIRTY | \
+     _PAGE_ACCESSED | _PAGE_USER | _PAGE_RW)
+
 #else /* GUEST_PAGING_LEVELS == 4 */
+
 #define GUEST_L1_PAGETABLE_ENTRIES      512
 #define GUEST_L2_PAGETABLE_ENTRIES      512
 #define GUEST_L3_PAGETABLE_ENTRIES      512
 #define GUEST_L4_PAGETABLE_ENTRIES      512
+
 #define GUEST_L1_PAGETABLE_SHIFT         12
 #define GUEST_L2_PAGETABLE_SHIFT         21
 #define GUEST_L3_PAGETABLE_SHIFT         30
 #define GUEST_L4_PAGETABLE_SHIFT         39
+
+#define GUEST_L1_PAGETABLE_RSVD            0
+#define GUEST_L2_PAGETABLE_RSVD            0
+#define GUEST_L3_PAGETABLE_RSVD            0
+/* NB L4e._PAGE_GLOBAL is reserved for AMD, but ignored for Intel. */
+#define GUEST_L4_PAGETABLE_RSVD            _PAGE_PSE
+
 #endif
 
 typedef l1_pgentry_t guest_l1e_t;
@@ -170,27 +204,30 @@ static inline guest_l4e_t guest_l4e_from_gfn(gfn_t gfn, u32 flags)
 
 /* Which pagetable features are supported on this vcpu? */
 
-static inline int
-guest_supports_superpages(struct vcpu *v)
+static inline bool guest_supports_superpages(const struct vcpu *v)
 {
     /* The _PAGE_PSE bit must be honoured in HVM guests, whenever
-     * CR4.PSE is set or the guest is in PAE or long mode. 
+     * CR4.PSE is set or the guest is in PAE or long mode.
      * It's also used in the dummy PT for vcpus with CR4.PG cleared. */
     return (is_pv_vcpu(v)
             ? opt_allow_superpage
-            : (GUEST_PAGING_LEVELS != 2 
+            : (GUEST_PAGING_LEVELS != 2
                || !hvm_paging_enabled(v)
                || (v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_PSE)));
 }
 
-static inline int
-guest_supports_1G_superpages(struct vcpu *v)
+static inline bool guest_has_pse36(const struct vcpu *v)
+{
+     /* No support for 2-level PV guests. */
+    return is_pv_vcpu(v) ? 0 : paging_mode_hap(v->domain);
+}
+
+static inline bool guest_supports_1G_superpages(const struct vcpu *v)
 {
     return (GUEST_PAGING_LEVELS >= 4 && hvm_pse1gb_supported(v->domain));
 }
 
-static inline int
-guest_supports_nx(struct vcpu *v)
+static inline bool guest_supports_nx(const struct vcpu *v)
 {
     if ( GUEST_PAGING_LEVELS == 2 || !cpu_has_nx )
         return 0;
@@ -213,6 +250,51 @@ guest_supports_nx(struct vcpu *v)
 #define _PAGE_INVALID_BITS _PAGE_INVALID_BIT
 #endif
 
+/* Helpers for identifying whether guest entries have reserved bits set. */
+
+/* Bits reserved because of maxphysaddr, and (lack of) EFER.NX */
+static inline uint64_t guest_rsvd_bits(const struct vcpu *v)
+{
+    return ((PADDR_MASK &
+             ~((1UL << v->domain->arch.cpuid->extd.maxphysaddr) - 1)) |
+            (guest_supports_nx(v) ? 0 : put_pte_flags(_PAGE_NX_BIT)));
+}
+
+static inline bool guest_l1e_rsvd_bits(const struct vcpu *v, guest_l1e_t l1e)
+{
+    return l1e.l1 & (guest_rsvd_bits(v) | GUEST_L1_PAGETABLE_RSVD);
+}
+
+static inline bool guest_l2e_rsvd_bits(const struct vcpu *v, guest_l2e_t l2e)
+{
+    uint64_t rsvd_bits = guest_rsvd_bits(v);
+
+    return ((l2e.l2 & (rsvd_bits | GUEST_L2_PAGETABLE_RSVD |
+                       (guest_supports_superpages(v) ? 0 : _PAGE_PSE))) ||
+            ((l2e.l2 & _PAGE_PSE) &&
+             (l2e.l2 & ((GUEST_PAGING_LEVELS == 2 && guest_has_pse36(v))
+                        ? (fold_pse36(rsvd_bits | (1ULL << 40)))
+                        : SUPERPAGE_RSVD(GUEST_L2_PAGETABLE_SHIFT)))));
+}
+
+#if GUEST_PAGING_LEVELS >= 3
+static inline bool guest_l3e_rsvd_bits(const struct vcpu *v, guest_l3e_t l3e)
+{
+    return ((l3e.l3 & (guest_rsvd_bits(v) | GUEST_L3_PAGETABLE_RSVD |
+                       (guest_supports_1G_superpages(v) ? 0 : _PAGE_PSE))) ||
+            ((l3e.l3 & _PAGE_PSE) &&
+             (l3e.l3 & SUPERPAGE_RSVD(GUEST_L3_PAGETABLE_SHIFT))));
+}
+
+#if GUEST_PAGING_LEVELS >= 4
+static inline bool guest_l4e_rsvd_bits(const struct vcpu *v, guest_l4e_t l4e)
+{
+    return l4e.l4 & (guest_rsvd_bits(v) | GUEST_L4_PAGETABLE_RSVD |
+                     ((v->domain->arch.cpuid->x86_vendor == X86_VENDOR_AMD)
+                      ? _PAGE_GLOBAL : 0));
+}
+#endif /* GUEST_PAGING_LEVELS >= 4 */
+#endif /* GUEST_PAGING_LEVELS >= 3 */
 
 /* Type used for recording a walk through guest pagetables.  It is
  * filled in by the pagetable walk function, and also used as a cache
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/7] x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
                   ` (2 preceding siblings ...)
  2017-02-27 14:03 ` [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling Andrew Cooper
@ 2017-02-27 14:03 ` Andrew Cooper
  2017-03-01 16:00   ` Jan Beulich
  2017-02-27 14:03 ` [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers Andrew Cooper
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

On Intel hardware, EFER is not fully switched between host and guest contexts.
In practice, this means that Xen's EFER.NX setting leaks into guest context,
and influences the behaviour of the hardware pagewalker.

When servicing a pagefault, Xen's model of guests behaviour should match
hardware's behaviour, to allow correct interpretation of the pagefault error
code, and to avoid creating observable difference in behaviour from the guests
point of view.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>

Fixing this isn't trivial.  On more modern hardware, we can use EFER loading.
On older hardware, we can use general MSR loading if available.  On
older-hardware-yet, we could reload EFER right before/after vmentry/vmexit.
However, doing so would require reloading EFER before any data accesses (as
the NX bit will cause #PF[RSVD]), and that is rather hard given the need to
preserve the GPRs.
---
 xen/include/asm-x86/hvm/hvm.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 87b203a..9907a7a 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -292,8 +292,10 @@ int hvm_girq_dest_2_vcpu_id(struct domain *d, uint8_t dest, uint8_t dest_mode);
     (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_SMEP))
 #define hvm_smap_enabled(v) \
     (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_SMAP))
+/* HVM guests on Intel hardware leak Xen's NX settings into guest context. */
 #define hvm_nx_enabled(v) \
-    (!!((v)->arch.hvm_vcpu.guest_efer & EFER_NX))
+    ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && cpu_has_nx) || \
+     !!((v)->arch.hvm_vcpu.guest_efer & EFER_NX))
 #define hvm_pku_enabled(v) \
     (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_PKE))
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
                   ` (3 preceding siblings ...)
  2017-02-27 14:03 ` [PATCH 4/7] x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves Andrew Cooper
@ 2017-02-27 14:03 ` Andrew Cooper
  2017-03-01 16:03   ` Jan Beulich
  2017-03-02 14:33   ` Tim Deegan
  2017-02-27 14:03 ` [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation Andrew Cooper
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich

The shadow logic should never create a shadow of a guest PTE which contains
reserved bits from the guests point of view.  Such a shadowed entry might not
cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
from the guests point of view.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/arch/x86/mm/shadow/multi.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 7c6b017..702835b 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2157,7 +2157,8 @@ static int validate_gl4e(struct vcpu *v, void *new_ge, mfn_t sl4mfn, void *se)
 
     perfc_incr(shadow_validate_gl4e_calls);
 
-    if ( guest_l4e_get_flags(new_gl4e) & _PAGE_PRESENT )
+    if ( (guest_l4e_get_flags(new_gl4e) & _PAGE_PRESENT) &&
+         !guest_l4e_rsvd_bits(v, new_gl4e) )
     {
         gfn_t gl3gfn = guest_l4e_get_gfn(new_gl4e);
         mfn_t gl3mfn = get_gfn_query_unlocked(d, gfn_x(gl3gfn), &p2mt);
@@ -2215,7 +2216,8 @@ static int validate_gl3e(struct vcpu *v, void *new_ge, mfn_t sl3mfn, void *se)
 
     perfc_incr(shadow_validate_gl3e_calls);
 
-    if ( guest_l3e_get_flags(new_gl3e) & _PAGE_PRESENT )
+    if ( (guest_l3e_get_flags(new_gl3e) & _PAGE_PRESENT) &&
+         !guest_l3e_rsvd_bits(v, new_gl3e) )
     {
         gfn_t gl2gfn = guest_l3e_get_gfn(new_gl3e);
         mfn_t gl2mfn = get_gfn_query_unlocked(d, gfn_x(gl2gfn), &p2mt);
@@ -2248,7 +2250,8 @@ static int validate_gl2e(struct vcpu *v, void *new_ge, mfn_t sl2mfn, void *se)
 
     perfc_incr(shadow_validate_gl2e_calls);
 
-    if ( guest_l2e_get_flags(new_gl2e) & _PAGE_PRESENT )
+    if ( (guest_l2e_get_flags(new_gl2e) & _PAGE_PRESENT) &&
+         !guest_l2e_rsvd_bits(v, new_gl2e) )
     {
         gfn_t gl1gfn = guest_l2e_get_gfn(new_gl2e);
         if ( guest_supports_superpages(v) &&
@@ -2288,8 +2291,7 @@ static int validate_gl1e(struct vcpu *v, void *new_ge, mfn_t sl1mfn, void *se)
     shadow_l1e_t new_sl1e;
     guest_l1e_t new_gl1e = *(guest_l1e_t *)new_ge;
     shadow_l1e_t *sl1p = se;
-    gfn_t gfn;
-    mfn_t gmfn;
+    mfn_t gmfn = INVALID_MFN;
     p2m_type_t p2mt;
     int result = 0;
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
@@ -2298,8 +2300,13 @@ static int validate_gl1e(struct vcpu *v, void *new_ge, mfn_t sl1mfn, void *se)
 
     perfc_incr(shadow_validate_gl1e_calls);
 
-    gfn = guest_l1e_get_gfn(new_gl1e);
-    gmfn = get_gfn_query_unlocked(d, gfn_x(gfn), &p2mt);
+    if ( (guest_l1e_get_flags(new_gl1e) & _PAGE_PRESENT) &&
+         !guest_l1e_rsvd_bits(v, new_gl1e) )
+    {
+        gfn_t gfn = guest_l1e_get_gfn(new_gl1e);
+
+        gmfn = get_gfn_query_unlocked(d, gfn_x(gfn), &p2mt);
+    }
 
     l1e_propagate_from_guest(v, new_gl1e, gmfn, &new_sl1e, ft_prefetch, p2mt);
     result |= shadow_set_l1e(d, sl1p, new_sl1e, p2mt, sl1mfn);
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
                   ` (4 preceding siblings ...)
  2017-02-27 14:03 ` [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers Andrew Cooper
@ 2017-02-27 14:03 ` Andrew Cooper
  2017-03-01 16:22   ` Jan Beulich
                     ` (2 more replies)
  2017-02-27 14:03 ` [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker Andrew Cooper
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich

hap_p2m_ga_to_gfn() and sh_page_fault() currently use guest_l1e_get_gfn() to
obtain the translation of a pagewalk.  This is conceptually wrong (the
semantics of gw.l1e is an internal detail), and will actually be wrong when
PSE36 superpage support is fixed.  Switch them to using guest_walk_to_gfn().

Take the opportunity to const-correct the walk_t parameter of the
guest_walk_to_*() helpers, and implement guest_walk_to_gpa() in terms of
guest_walk_to_gfn() to avoid duplicating the actual translation calculation.

While editing guest_walk_to_gpa(), fix a latent bug by causing it to return
INVALID_PADDR rather than 0 for a failed translation, as 0 is also a valid
successful result.  The sole caller, sh_page_fault(), has already confirmed
that the translation is valid, so this doesn't cause a behavioural change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/arch/x86/mm/hap/guest_walk.c |  2 +-
 xen/arch/x86/mm/shadow/multi.c   |  2 +-
 xen/include/asm-x86/guest_pt.h   | 19 +++++++++----------
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm/hap/guest_walk.c b/xen/arch/x86/mm/hap/guest_walk.c
index 569a495..313f82f 100644
--- a/xen/arch/x86/mm/hap/guest_walk.c
+++ b/xen/arch/x86/mm/hap/guest_walk.c
@@ -98,7 +98,7 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
     /* Interpret the answer */
     if ( missing == 0 )
     {
-        gfn_t gfn = guest_l1e_get_gfn(gw.l1e);
+        gfn_t gfn = guest_walk_to_gfn(&gw);
         struct page_info *page;
         page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), &p2mt,
                                      NULL, P2M_ALLOC | P2M_UNSHARE);
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 702835b..f73b553 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3109,7 +3109,7 @@ static int sh_page_fault(struct vcpu *v,
     }
 
     /* What mfn is the guest trying to access? */
-    gfn = guest_l1e_get_gfn(gw.l1e);
+    gfn = guest_walk_to_gfn(&gw);
     gmfn = get_gfn(d, gfn, &p2mt);
 
     if ( shadow_mode_refcounts(d) &&
diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h
index 1c3d384..1aa383f 100644
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -323,8 +323,7 @@ struct guest_pagetable_walk
 
 /* Given a walk_t, translate the gw->va into the guest's notion of the
  * corresponding frame number. */
-static inline gfn_t
-guest_walk_to_gfn(walk_t *gw)
+static inline gfn_t guest_walk_to_gfn(const walk_t *gw)
 {
     if ( !(guest_l1e_get_flags(gw->l1e) & _PAGE_PRESENT) )
         return INVALID_GFN;
@@ -333,19 +332,19 @@ guest_walk_to_gfn(walk_t *gw)
 
 /* Given a walk_t, translate the gw->va into the guest's notion of the
  * corresponding physical address. */
-static inline paddr_t
-guest_walk_to_gpa(walk_t *gw)
+static inline paddr_t guest_walk_to_gpa(const walk_t *gw)
 {
-    if ( !(guest_l1e_get_flags(gw->l1e) & _PAGE_PRESENT) )
-        return 0;
-    return ((paddr_t)gfn_x(guest_l1e_get_gfn(gw->l1e)) << PAGE_SHIFT) +
-           (gw->va & ~PAGE_MASK);
+    gfn_t gfn = guest_walk_to_gfn(gw);
+
+    if ( gfn_eq(gfn, INVALID_GFN) )
+        return INVALID_PADDR;
+
+    return (gfn_x(gfn) << PAGE_SHIFT) | (gw->va & ~PAGE_MASK);
 }
 
 /* Given a walk_t from a successful walk, return the page-order of the 
  * page or superpage that the virtual address is in. */
-static inline unsigned int 
-guest_walk_to_page_order(walk_t *gw)
+static inline unsigned int guest_walk_to_page_order(const walk_t *gw)
 {
     /* This is only valid for successful walks - otherwise the 
      * PSE bits might be invalid. */
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
                   ` (5 preceding siblings ...)
  2017-02-27 14:03 ` [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation Andrew Cooper
@ 2017-02-27 14:03 ` Andrew Cooper
  2017-03-02 11:52   ` Jan Beulich
                     ` (3 more replies)
  2017-03-01 16:24 ` [PATCH 0/7] Fixes to pagetable handling Jan Beulich
  2017-03-06 16:42 ` [RFC XTF PATCH] Pagetable Emulation testing Andrew Cooper
  8 siblings, 4 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-02-27 14:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich

The existing pagetable walker has complicated return semantics, which squeeze
multiple pieces of information into single integer.  This would be fine if the
information didn't overlap, but it does.

Specifically, _PAGE_INVALID_BITS for 3-level guests alias _PAGE_PAGED and
_PAGE_SHARED.  A guest which constructs a PTE with bits 52 or 53 set (the
start of the upper software-available range) will create a virtual address
which, when walked by Xen, tricks Xen into believing the frame is paged or
shared.  This behaviour was introduced by XSA-173 (c/s 8b17648).

It is also complicated to turn rc back into a normal pagefault error code.
Instead, change the calling semantics to return a boolean indicating success,
and have the function accumulate a real pagefault error code as it goes
(including synthetic error codes, which do not alias hardware ones).  This
requires an equivalent adjustment to map_domain_gfn().

Issues fixed:
 * 2-level PSE36 superpages now return the correct translation.
 * 2-level L2 superpages without CR0.PSE now return the correct translation.
 * SMEP now inhibits a user instruction fetch even if NX isn't active.
 * Supervisor writes without CR0.WP now set the leaf dirty bit.
 * L4e._PAGE_GLOBAL is strictly reserved on AMD.
 * 3-level l3 entries have all reserved bits checked.
 * 3-level entries can no longer alias Xen's idea of paged or shared.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/arch/x86/mm/guest_walk.c      | 448 +++++++++++++++++++-------------------
 xen/arch/x86/mm/hap/guest_walk.c  |  21 +-
 xen/arch/x86/mm/hap/nested_ept.c  |   2 +-
 xen/arch/x86/mm/p2m.c             |  17 +-
 xen/arch/x86/mm/shadow/multi.c    |  27 +--
 xen/include/asm-x86/guest_pt.h    |  63 ++++--
 xen/include/asm-x86/p2m.h         |   2 +-
 xen/include/asm-x86/page.h        |   3 -
 xen/include/asm-x86/processor.h   |   2 +
 xen/include/asm-x86/x86_64/page.h |   6 -
 10 files changed, 303 insertions(+), 288 deletions(-)

diff --git a/xen/arch/x86/mm/guest_walk.c b/xen/arch/x86/mm/guest_walk.c
index 4f8d0e3..01bcb1d 100644
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -32,44 +32,6 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <asm/page.h>
 #include <asm/guest_pt.h>
 
-extern const uint32_t gw_page_flags[];
-#if GUEST_PAGING_LEVELS == CONFIG_PAGING_LEVELS
-const uint32_t gw_page_flags[] = {
-    /* I/F -  Usr Wr */
-    /* 0   0   0   0 */ _PAGE_PRESENT,
-    /* 0   0   0   1 */ _PAGE_PRESENT|_PAGE_RW,
-    /* 0   0   1   0 */ _PAGE_PRESENT|_PAGE_USER,
-    /* 0   0   1   1 */ _PAGE_PRESENT|_PAGE_RW|_PAGE_USER,
-    /* 0   1   0   0 */ _PAGE_PRESENT,
-    /* 0   1   0   1 */ _PAGE_PRESENT|_PAGE_RW,
-    /* 0   1   1   0 */ _PAGE_PRESENT|_PAGE_USER,
-    /* 0   1   1   1 */ _PAGE_PRESENT|_PAGE_RW|_PAGE_USER,
-    /* 1   0   0   0 */ _PAGE_PRESENT|_PAGE_NX_BIT,
-    /* 1   0   0   1 */ _PAGE_PRESENT|_PAGE_RW|_PAGE_NX_BIT,
-    /* 1   0   1   0 */ _PAGE_PRESENT|_PAGE_USER|_PAGE_NX_BIT,
-    /* 1   0   1   1 */ _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_NX_BIT,
-    /* 1   1   0   0 */ _PAGE_PRESENT|_PAGE_NX_BIT,
-    /* 1   1   0   1 */ _PAGE_PRESENT|_PAGE_RW|_PAGE_NX_BIT,
-    /* 1   1   1   0 */ _PAGE_PRESENT|_PAGE_USER|_PAGE_NX_BIT,
-    /* 1   1   1   1 */ _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_NX_BIT,
-};
-#endif
-
-/* Flags that are needed in a pagetable entry, with the sense of NX inverted */
-static uint32_t mandatory_flags(struct vcpu *v, uint32_t pfec) 
-{
-    /* Don't demand not-NX if the CPU wouldn't enforce it. */
-    if ( !guest_supports_nx(v) )
-        pfec &= ~PFEC_insn_fetch;
-
-    /* Don't demand R/W if the CPU wouldn't enforce it. */
-    if ( is_hvm_vcpu(v) && unlikely(!hvm_wp_enabled(v)) 
-         && !(pfec & PFEC_user_mode) )
-        pfec &= ~PFEC_write_access;
-
-    return gw_page_flags[(pfec & 0x1f) >> 1] | _PAGE_INVALID_BITS;
-}
-
 /* Modify a guest pagetable entry to set the Accessed and Dirty bits.
  * Returns non-zero if it actually writes to guest memory. */
 static uint32_t set_ad_bits(void *guest_p, void *walk_p, int set_dirty)
@@ -90,62 +52,33 @@ static uint32_t set_ad_bits(void *guest_p, void *walk_p, int set_dirty)
     return 0;
 }
 
-#if GUEST_PAGING_LEVELS >= 4
-static bool_t pkey_fault(struct vcpu *vcpu, uint32_t pfec,
-        uint32_t pte_flags, uint32_t pte_pkey)
-{
-    uint32_t pkru;
-
-    /* When page isn't present,  PKEY isn't checked. */
-    if ( !(pfec & PFEC_page_present) || is_pv_vcpu(vcpu) )
-        return 0;
-
-    /*
-     * PKU:  additional mechanism by which the paging controls
-     * access to user-mode addresses based on the value in the
-     * PKRU register. A fault is considered as a PKU violation if all
-     * of the following conditions are true:
-     * 1.CR4_PKE=1.
-     * 2.EFER_LMA=1.
-     * 3.Page is present with no reserved bit violations.
-     * 4.The access is not an instruction fetch.
-     * 5.The access is to a user page.
-     * 6.PKRU.AD=1 or
-     *      the access is a data write and PKRU.WD=1 and
-     *          either CR0.WP=1 or it is a user access.
-     */
-    if ( !hvm_pku_enabled(vcpu) ||
-         !hvm_long_mode_enabled(vcpu) ||
-         (pfec & PFEC_reserved_bit) ||
-         (pfec & PFEC_insn_fetch) ||
-         !(pte_flags & _PAGE_USER) )
-        return 0;
-
-    pkru = read_pkru();
-    if ( unlikely(pkru) )
-    {
-        bool_t pkru_ad = read_pkru_ad(pkru, pte_pkey);
-        bool_t pkru_wd = read_pkru_wd(pkru, pte_pkey);
-
-        /* Condition 6 */
-        if ( pkru_ad ||
-             (pkru_wd && (pfec & PFEC_write_access) &&
-              (hvm_wp_enabled(vcpu) || (pfec & PFEC_user_mode))) )
-            return 1;
-    }
-
-    return 0;
-}
-#endif
-
-/* Walk the guest pagetables, after the manner of a hardware walker. */
-/* Because the walk is essentially random, it can cause a deadlock 
- * warning in the p2m locking code. Highly unlikely this is an actual
- * deadlock, because who would walk page table in the opposite order? */
-uint32_t
+/*
+ * Walk the guest pagetables, after the manner of a hardware walker.
+ *
+ * This is a condensing of the 'Paging' chapters from Intel and AMD software
+ * manuals.  Please refer closely to them.
+ *
+ * A pagetable walk consists of two parts:
+ *   1) to find whether a translation exists, and
+ *   2) if a translation does exist, to check whether the translation's access
+ *      rights permit the access.
+ *
+ * A translation is found by following the pagetable structure (starting at
+ * %cr3) to a leaf entry (an L1 PTE, or a higher level entry with PSE set)
+ * which identifies the physical destination of the access.
+ *
+ * A translation from one level to the next exists if the PTE is both present
+ * and has no reserved bits set.  If the pagewalk counters a situation where a
+ * translation does not exist, the walk stops at that point.
+ *
+ * The access rights (NX, User, RW bits) are collected as the walk progresses.
+ * If a translation exists, the accumulated access rights are compared to the
+ * requested walk, to see whether the access is permitted.
+ */
+bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
                   unsigned long va, walk_t *gw, 
-                  uint32_t pfec, mfn_t top_mfn, void *top_map)
+                  uint32_t walk, mfn_t top_mfn, void *top_map)
 {
     struct domain *d = v->domain;
     p2m_type_t p2mt;
@@ -155,64 +88,44 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
     guest_l3e_t *l3p = NULL;
     guest_l4e_t *l4p;
 #endif
-    unsigned int pkey;
-    uint32_t gflags, mflags, iflags, rc = 0;
-    bool_t smep = 0, smap = 0;
+    uint32_t gflags, rc;
     bool_t pse1G = 0, pse2M = 0;
     p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
 
-    /* Only implicit supervisor data accesses exist. */
-    ASSERT(!(pfec & PFEC_implicit) ||
-           !(pfec & (PFEC_insn_fetch|PFEC_user_mode)));
+#define AR_ACCUM_AND (_PAGE_USER | _PAGE_RW)
+#define AR_ACCUM_OR  (_PAGE_NX_BIT)
+    /* Start with all AND bits set, all OR bits clear. */
+    uint32_t ar, ar_and = ~0u, ar_or = 0;
 
-    perfc_incr(guest_walk);
-    memset(gw, 0, sizeof(*gw));
-    gw->va = va;
-
-    /* Mandatory bits that must be set in every entry.  We invert NX and
-     * the invalid bits, to calculate as if there were an "X" bit that
-     * allowed access.  We will accumulate, in rc, the set of flags that
-     * are missing/unwanted. */
-    mflags = mandatory_flags(v, pfec);
-    iflags = (_PAGE_NX_BIT | _PAGE_INVALID_BITS);
+    bool walk_ok = false;
 
-    if ( is_hvm_domain(d) && !(pfec & PFEC_user_mode) )
-    {
-        const struct cpu_user_regs *regs = guest_cpu_user_regs();
+    /*
+     * TODO - We should ASSERT() that only the following bits are set as
+     * inputs to a guest walk, but a whole load of code currently passes in
+     * other PFEC_ constants.
+     */
+    walk &= (PFEC_implicit | PFEC_insn_fetch | PFEC_user_mode | PFEC_write_access);
 
-        /* SMEP: kernel-mode instruction fetches from user-mode mappings
-         * should fault.  Unlike NX or invalid bits, we're looking for _all_
-         * entries in the walk to have _PAGE_USER set, so we need to do the
-         * whole walk as if it were a user-mode one and then invert the answer. */
-        smep =  hvm_smep_enabled(v) && (pfec & PFEC_insn_fetch);
+    /* Only implicit supervisor data accesses exist. */
+    ASSERT(!(walk & PFEC_implicit) ||
+           !(walk & (PFEC_insn_fetch|PFEC_user_mode)));
 
-        switch ( v->arch.smap_check_policy )
-        {
-        case SMAP_CHECK_HONOR_CPL_AC:
-            /*
-             * SMAP: kernel-mode data accesses from user-mode mappings
-             * should fault.
-             * A fault is considered as a SMAP violation if the following
-             * conditions come true:
-             *   - X86_CR4_SMAP is set in CR4
-             *   - A user page is accessed
-             *   - CPL = 3 or X86_EFLAGS_AC is clear
-             *   - Page fault in kernel mode
-             */
-            smap = hvm_smap_enabled(v) &&
-                   ((hvm_get_cpl(v) == 3) || !(regs->_eflags & X86_EFLAGS_AC));
-            break;
-        case SMAP_CHECK_ENABLED:
-            smap = hvm_smap_enabled(v);
-            break;
-        default:
-            ASSERT(v->arch.smap_check_policy == SMAP_CHECK_DISABLED);
-            break;
-        }
-    }
+    /*
+     * PFEC_insn_fetch is only used as an input to pagetable walking if NX or
+     * SMEP are enabled.  Otherwise, instruction fetches are indistinguishable
+     * from data reads.
+     *
+     * This property can be demonstrated on real hardware by having NX and
+     * SMEP inactive, but SMAP active, and observing that EFLAGS.AC determines
+     * whether a pagefault occures for supervisor execution on user mappings.
+     */
+    if ( !(guest_supports_nx(v) || guest_smep_enabled(v)) )
+        walk &= ~PFEC_insn_fetch;
 
-    if ( smep || smap )
-        mflags |= _PAGE_USER;
+    perfc_incr(guest_walk);
+    memset(gw, 0, sizeof(*gw));
+    gw->va = va;
+    gw->pfec = walk & (PFEC_insn_fetch | PFEC_user_mode | PFEC_write_access);
 
 #if GUEST_PAGING_LEVELS >= 3 /* PAE or 64... */
 #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
@@ -221,17 +134,20 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
     gw->l4mfn = top_mfn;
     l4p = (guest_l4e_t *) top_map;
     gw->l4e = l4p[guest_l4_table_offset(va)];
-    gflags = guest_l4e_get_flags(gw->l4e) ^ iflags;
-    if ( !(gflags & _PAGE_PRESENT) ) {
-        rc |= _PAGE_PRESENT;
+    gflags = guest_l4e_get_flags(gw->l4e);
+    if ( !(gflags & _PAGE_PRESENT) )
         goto out;
-    }
-    if ( gflags & _PAGE_PSE )
+
+    /* Check for reserved bits. */
+    if ( guest_l4e_rsvd_bits(v, gw->l4e) )
     {
-        rc |= _PAGE_PSE | _PAGE_INVALID_BIT;
+        gw->pfec |= PFEC_reserved_bit | PFEC_page_present;
         goto out;
     }
-    rc |= ((gflags & mflags) ^ mflags);
+
+    /* Accumulate l4e access rights. */
+    ar_and &= gflags;
+    ar_or  |= gflags;
 
     /* Map the l3 table */
     l3p = map_domain_gfn(p2m, 
@@ -241,17 +157,28 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
                          qt,
                          &rc); 
     if(l3p == NULL)
+    {
+        gw->pfec |= rc & PFEC_synth_mask;
         goto out;
+    }
+
     /* Get the l3e and check its flags*/
     gw->l3e = l3p[guest_l3_table_offset(va)];
-    pkey = guest_l3e_get_pkey(gw->l3e);
-    gflags = guest_l3e_get_flags(gw->l3e) ^ iflags;
-    if ( !(gflags & _PAGE_PRESENT) ) {
-        rc |= _PAGE_PRESENT;
+    gflags = guest_l3e_get_flags(gw->l3e);
+    if ( !(gflags & _PAGE_PRESENT) )
+        goto out;
+
+    /* Check for reserved bits, including possibly _PAGE_PSE. */
+    if ( guest_l3e_rsvd_bits(v, gw->l3e) )
+    {
+        gw->pfec |= PFEC_reserved_bit | PFEC_page_present;
         goto out;
     }
-    rc |= ((gflags & mflags) ^ mflags);
     
+    /* Accumulate l3e access rights. */
+    ar_and &= gflags;
+    ar_or  |= gflags;
+
     pse1G = !!(gflags & _PAGE_PSE);
 
     if ( pse1G )
@@ -272,26 +199,25 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
             /* _PAGE_PSE_PAT not set: remove _PAGE_PAT from flags. */
             flags &= ~_PAGE_PAT;
 
-        if ( !guest_supports_1G_superpages(v) )
-            rc |= _PAGE_PSE | _PAGE_INVALID_BIT;
-        if ( gfn_x(start) & GUEST_L3_GFN_MASK & ~0x1 )
-            rc |= _PAGE_INVALID_BITS;
-
         /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L3_GFN_MASK) +
                      ((va >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
         gw->l1e = guest_l1e_from_gfn(start, flags);
         gw->l2mfn = gw->l1mfn = INVALID_MFN;
-        goto set_ad;
+        goto leaf;
     }
 
 #else /* PAE only... */
 
     /* Get the l3e and check its flag */
     gw->l3e = ((guest_l3e_t *) top_map)[guest_l3_table_offset(va)];
-    if ( !(guest_l3e_get_flags(gw->l3e) & _PAGE_PRESENT) ) 
+    gflags = guest_l3e_get_flags(gw->l3e);
+    if ( !(gflags & _PAGE_PRESENT) )
+        goto out;
+
+    if ( guest_l3e_rsvd_bits(v, gw->l3e) )
     {
-        rc |= _PAGE_PRESENT;
+        gw->pfec |= PFEC_reserved_bit | PFEC_page_present;
         goto out;
     }
 
@@ -305,7 +231,11 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
                          qt,
                          &rc); 
     if(l2p == NULL)
+    {
+        gw->pfec |= rc & PFEC_synth_mask;
         goto out;
+    }
+
     /* Get the l2e */
     gw->l2e = l2p[guest_l2_table_offset(va)];
 
@@ -318,15 +248,32 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
 
 #endif /* All levels... */
 
-    pkey = guest_l2e_get_pkey(gw->l2e);
-    gflags = guest_l2e_get_flags(gw->l2e) ^ iflags;
-    if ( !(gflags & _PAGE_PRESENT) ) {
-        rc |= _PAGE_PRESENT;
+    /* Check the l2e flags. */
+    gflags = guest_l2e_get_flags(gw->l2e);
+    if ( !(gflags & _PAGE_PRESENT) )
+        goto out;
+
+    /*
+     * In 2-level paging without CR0.PSE, there are no reserved bits, and the
+     * PAT/PSE bit is ignored.
+     */
+    if ( GUEST_PAGING_LEVELS == 2 && !guest_supports_superpages(v) )
+    {
+        gw->l2e.l2 &= ~_PAGE_PSE;
+        gflags &= ~_PAGE_PSE;
+    }
+    /* else check for reserved bits, including possibly _PAGE_PSE. */
+    else if ( guest_l2e_rsvd_bits(v, gw->l2e) )
+    {
+        gw->pfec |= PFEC_reserved_bit | PFEC_page_present;
         goto out;
     }
-    rc |= ((gflags & mflags) ^ mflags);
 
-    pse2M = (gflags & _PAGE_PSE) && guest_supports_superpages(v); 
+    /* Accumulate l2e access rights. */
+    ar_and &= gflags;
+    ar_or  |= gflags;
+
+    pse2M = !!(gflags & _PAGE_PSE);
 
     if ( pse2M )
     {
@@ -334,7 +281,11 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
          * no guest l1e.  We make one up so that the propagation code
          * can generate a shadow l1 table.  Start with the gfn of the 
          * first 4k-page of the superpage. */
+#if GUEST_PAGING_LEVELS == 2
+        gfn_t start = _gfn(unfold_pse36(gw->l2e.l2) >> PAGE_SHIFT);
+#else
         gfn_t start = guest_l2e_get_gfn(gw->l2e);
+#endif
         /* Grant full access in the l1e, since all the guest entry's 
          * access controls are enforced in the shadow l2e. */
         int flags = (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW|
@@ -348,70 +299,136 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
             /* _PAGE_PSE_PAT not set: remove _PAGE_PAT from flags. */
             flags &= ~_PAGE_PAT;
 
-        if ( gfn_x(start) & GUEST_L2_GFN_MASK & ~0x1 )
-            rc |= _PAGE_INVALID_BITS;
-
-        /* Increment the pfn by the right number of 4k pages.  
-         * Mask out PAT and invalid bits. */
+        /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L2_GFN_MASK) +
                      guest_l1_table_offset(va));
+#if GUEST_PAGING_LEVELS == 2
+         /* Wider than 32 bits if PSE36 superpage. */
+        gw->el1e = (gfn_x(start) << PAGE_SHIFT) | flags;
+#else
         gw->l1e = guest_l1e_from_gfn(start, flags);
+#endif
         gw->l1mfn = INVALID_MFN;
-    } 
-    else 
+        goto leaf;
+    }
+
+    /* Map the l1 table */
+    l1p = map_domain_gfn(p2m,
+                         guest_l2e_get_gfn(gw->l2e),
+                         &gw->l1mfn,
+                         &p2mt,
+                         qt,
+                         &rc);
+    if ( l1p == NULL )
     {
-        /* Not a superpage: carry on and find the l1e. */
-        l1p = map_domain_gfn(p2m, 
-                             guest_l2e_get_gfn(gw->l2e), 
-                             &gw->l1mfn,
-                             &p2mt,
-                             qt,
-                             &rc);
-        if(l1p == NULL)
+        gw->pfec |= rc & PFEC_synth_mask;
+        goto out;
+    }
+    gw->l1e = l1p[guest_l1_table_offset(va)];
+    gflags = guest_l1e_get_flags(gw->l1e);
+    if ( !(gflags & _PAGE_PRESENT) )
+        goto out;
+
+    /* Check for reserved bits. */
+    if ( guest_l1e_rsvd_bits(v, gw->l1e) )
+    {
+        gw->pfec |= PFEC_reserved_bit | PFEC_page_present;
+        goto out;
+    }
+
+    /* Accumulate l1e access rights. */
+    ar_and &= gflags;
+    ar_or  |= gflags;
+
+ leaf:
+    gw->pfec |= PFEC_page_present;
+
+    /*
+     * The pagetable walk has returned a successful translation.  Now check
+     * access rights to see whether the access should succeed.
+     */
+    ar = (ar_and & AR_ACCUM_AND) | (ar_or & AR_ACCUM_OR);
+
+    if ( (walk & PFEC_insn_fetch) && (ar & _PAGE_NX_BIT) )
+        /* Requested an instruction fetch and found NX? Fail. */
+        goto out;
+
+    if ( walk & PFEC_user_mode ) /* Requested a user acess. */
+    {
+        if ( !(ar & _PAGE_USER) )
+            /* Got a supervisor walk?  Unconditional fail. */
             goto out;
-        gw->l1e = l1p[guest_l1_table_offset(va)];
-        pkey = guest_l1e_get_pkey(gw->l1e);
-        gflags = guest_l1e_get_flags(gw->l1e) ^ iflags;
-        if ( !(gflags & _PAGE_PRESENT) ) {
-            rc |= _PAGE_PRESENT;
+
+        if ( (walk & PFEC_write_access) && !(ar & _PAGE_RW) )
+            /* Requested a write and only got a read? Fail. */
             goto out;
+    }
+    else /* Requested a supervisor access. */
+    {
+        if ( ar & _PAGE_USER ) /* Got a user walk. */
+        {
+            if ( (walk & PFEC_insn_fetch) && guest_smep_enabled(v) )
+                /* User insn fetch and smep? Fail. */
+                goto out;
+
+            if ( !(walk & PFEC_insn_fetch) && guest_smap_enabled(v) &&
+                 ((walk & PFEC_implicit) ||
+                  !(guest_cpu_user_regs()->_eflags & X86_EFLAGS_AC)) )
+                /* User data access and smap? Fail. */
+                goto out;
         }
-        rc |= ((gflags & mflags) ^ mflags);
+
+        if ( (walk & PFEC_write_access) && !(ar & _PAGE_RW) &&
+             guest_wp_enabled(v) )
+            /* Requested a write, got a read, and CR0.WP is set? Fail. */
+            goto out;
     }
 
 #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
-set_ad:
-    if ( pkey_fault(v, pfec, gflags, pkey) )
-        rc |= _PAGE_PKEY_BITS;
+    /*
+     * If all access checks are thusfar ok, check Protection Key for 64bit
+     * user data accesses.
+     *
+     * N.B. In the case that the walk ended with a superpage, the fabricated
+     * gw->l1e contains the appropriate leaf pkey.
+     */
+    if ( (walk & PFEC_user_mode) && !(walk & PFEC_insn_fetch) && guest_pku_enabled(v) )
+    {
+        unsigned int pkey = guest_l1e_get_pkey(gw->l1e);
+        unsigned int pkru = read_pkru();
+
+        if ( read_pkru_ad(pkru, pkey) ||
+             ((ar & PFEC_write_access) && read_pkru_wd(pkru, pkey)) )
+        {
+            gw->pfec |= PFEC_prot_key;
+            goto out;
+        }
+    }
 #endif
-    /* Now re-invert the user-mode requirement for SMEP and SMAP */
-    if ( smep || smap )
-        rc ^= _PAGE_USER;
+
+    walk_ok = true;
 
     /* Go back and set accessed and dirty bits only if the walk was a
      * success.  Although the PRMs say higher-level _PAGE_ACCESSED bits
      * get set whenever a lower-level PT is used, at least some hardware
      * walkers behave this way. */
-    if ( rc == 0 ) 
-    {
 #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
-        if ( set_ad_bits(l4p + guest_l4_table_offset(va), &gw->l4e, 0) )
-            paging_mark_dirty(d, gw->l4mfn);
-        if ( set_ad_bits(l3p + guest_l3_table_offset(va), &gw->l3e,
-                         (pse1G && (pfec & PFEC_write_access))) )
-            paging_mark_dirty(d, gw->l3mfn);
+    if ( set_ad_bits(l4p + guest_l4_table_offset(va), &gw->l4e, 0) )
+        paging_mark_dirty(d, gw->l4mfn);
+    if ( set_ad_bits(l3p + guest_l3_table_offset(va), &gw->l3e,
+                     (pse1G && (walk & PFEC_write_access))) )
+        paging_mark_dirty(d, gw->l3mfn);
 #endif
-        if ( !pse1G ) 
+    if ( !pse1G )
+    {
+        if ( set_ad_bits(l2p + guest_l2_table_offset(va), &gw->l2e,
+                         (pse2M && (walk & PFEC_write_access))) )
+            paging_mark_dirty(d, gw->l2mfn);
+        if ( !pse2M )
         {
-            if ( set_ad_bits(l2p + guest_l2_table_offset(va), &gw->l2e,
-                             (pse2M && (pfec & PFEC_write_access))) )
-                paging_mark_dirty(d, gw->l2mfn);
-            if ( !pse2M ) 
-            {
-                if ( set_ad_bits(l1p + guest_l1_table_offset(va), &gw->l1e, 
-                                 (pfec & PFEC_write_access)) )
-                    paging_mark_dirty(d, gw->l1mfn);
-            }
+            if ( set_ad_bits(l1p + guest_l1_table_offset(va), &gw->l1e,
+                             (walk & PFEC_write_access)) )
+                paging_mark_dirty(d, gw->l1mfn);
         }
     }
 
@@ -436,10 +453,5 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
         put_page(mfn_to_page(mfn_x(gw->l1mfn)));
     }
 
-    /* If this guest has a restricted physical address space then the
-     * target GFN must fit within it. */
-    if ( !(rc & _PAGE_PRESENT) && !gfn_valid(d, guest_l1e_get_gfn(gw->l1e)) )
-        rc |= _PAGE_INVALID_BITS;
-
-    return rc;
+    return walk_ok;
 }
diff --git a/xen/arch/x86/mm/hap/guest_walk.c b/xen/arch/x86/mm/hap/guest_walk.c
index 313f82f..97b8ac5 100644
--- a/xen/arch/x86/mm/hap/guest_walk.c
+++ b/xen/arch/x86/mm/hap/guest_walk.c
@@ -50,7 +50,7 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
     struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
     paddr_t ga, uint32_t *pfec, unsigned int *page_order)
 {
-    uint32_t missing;
+    bool walk_ok;
     mfn_t top_mfn;
     void *top_map;
     p2m_type_t p2mt;
@@ -91,12 +91,12 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
 #if GUEST_PAGING_LEVELS == 3
     top_map += (cr3 & ~(PAGE_MASK | 31));
 #endif
-    missing = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
+    walk_ok = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
     unmap_domain_page(top_map);
     put_page(top_page);
 
     /* Interpret the answer */
-    if ( missing == 0 )
+    if ( walk_ok )
     {
         gfn_t gfn = guest_walk_to_gfn(&gw);
         struct page_info *page;
@@ -123,20 +123,7 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
         return gfn_x(gfn);
     }
 
-    if ( missing & _PAGE_PRESENT )
-        pfec[0] &= ~PFEC_page_present;
-
-    if ( missing & _PAGE_INVALID_BITS ) 
-        pfec[0] |= PFEC_reserved_bit;
-
-    if ( missing & _PAGE_PKEY_BITS )
-        pfec[0] |= PFEC_prot_key;
-
-    if ( missing & _PAGE_PAGED )
-        pfec[0] = PFEC_page_paged;
-
-    if ( missing & _PAGE_SHARED )
-        pfec[0] = PFEC_page_shared;
+    pfec[0] = gw.pfec;
 
  out_tweak_pfec:
     /*
diff --git a/xen/arch/x86/mm/hap/nested_ept.c b/xen/arch/x86/mm/hap/nested_ept.c
index 02b27b1..14b1bb0 100644
--- a/xen/arch/x86/mm/hap/nested_ept.c
+++ b/xen/arch/x86/mm/hap/nested_ept.c
@@ -208,7 +208,7 @@ nept_walk_tables(struct vcpu *v, unsigned long l2ga, ept_walk_t *gw)
     goto out;
 
 map_err:
-    if ( rc == _PAGE_PAGED )
+    if ( rc == PFEC_page_paged )
     {
         ret = EPT_TRANSLATE_RETRY;
         goto out;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 2eee9cd..b51421b 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1774,17 +1774,18 @@ unsigned long paging_gva_to_gfn(struct vcpu *v,
 }
 
 /*
- * If the map is non-NULL, we leave this function having
- * acquired an extra ref on mfn_to_page(*mfn).
+ * If the map is non-NULL, we leave this function having acquired an extra ref
+ * on mfn_to_page(*mfn).  In all cases, *pfec contains appropriate
+ * synthetic/structure PFEC_* bits.
  */
 void *map_domain_gfn(struct p2m_domain *p2m, gfn_t gfn, mfn_t *mfn,
-                     p2m_type_t *p2mt, p2m_query_t q, uint32_t *rc)
+                     p2m_type_t *p2mt, p2m_query_t q, uint32_t *pfec)
 {
     struct page_info *page;
 
     if ( !gfn_valid(p2m->domain, gfn) )
     {
-        *rc = _PAGE_INVALID_BIT;
+        *pfec = PFEC_reserved_bit | PFEC_page_present;
         return NULL;
     }
 
@@ -1796,21 +1797,23 @@ void *map_domain_gfn(struct p2m_domain *p2m, gfn_t gfn, mfn_t *mfn,
         if ( page )
             put_page(page);
         p2m_mem_paging_populate(p2m->domain, gfn_x(gfn));
-        *rc = _PAGE_PAGED;
+        *pfec = PFEC_page_paged;
         return NULL;
     }
     if ( p2m_is_shared(*p2mt) )
     {
         if ( page )
             put_page(page);
-        *rc = _PAGE_SHARED;
+        *pfec = PFEC_page_shared;
         return NULL;
     }
     if ( !page )
     {
-        *rc |= _PAGE_PRESENT;
+        *pfec = 0;
         return NULL;
     }
+
+    *pfec = PFEC_page_present;
     *mfn = page_to_mfn(page);
     ASSERT(mfn_valid(*mfn));
 
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index f73b553..62c1fb5 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -171,7 +171,7 @@ delete_shadow_status(struct domain *d, mfn_t gmfn, u32 shadow_type, mfn_t smfn)
 /**************************************************************************/
 /* Functions for walking the guest page tables */
 
-static inline uint32_t
+static inline bool
 sh_walk_guest_tables(struct vcpu *v, unsigned long va, walk_t *gw,
                      uint32_t pfec)
 {
@@ -2865,6 +2865,7 @@ static int sh_page_fault(struct vcpu *v,
     int r;
     p2m_type_t p2mt;
     uint32_t rc, error_code;
+    bool walk_ok;
     int version;
     const struct npfec access = {
          .read_access = 1,
@@ -3080,21 +3081,20 @@ static int sh_page_fault(struct vcpu *v,
      * shadow page table. */
     version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
     rmb();
-    rc = sh_walk_guest_tables(v, va, &gw, error_code);
+    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
     regs->error_code &= ~PFEC_page_present;
-    if ( !(rc & _PAGE_PRESENT) )
+    if ( gw.pfec & PFEC_page_present )
         regs->error_code |= PFEC_page_present;
 #endif
 
-    if ( rc != 0 )
+    if ( !walk_ok )
     {
         perfc_incr(shadow_fault_bail_real_fault);
         SHADOW_PRINTK("not a shadow fault\n");
         reset_early_unshadow(v);
-        if ( (rc & _PAGE_INVALID_BITS) )
-            regs->error_code |= PFEC_reserved_bit;
+        regs->error_code = gw.pfec & PFEC_arch_mask;
         goto propagate;
     }
 
@@ -3728,7 +3728,7 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
 {
     walk_t gw;
     gfn_t gfn;
-    uint32_t missing;
+    bool walk_ok;
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
     /* Check the vTLB cache first */
@@ -3737,18 +3737,9 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
         return vtlb_gfn;
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    if ( (missing = sh_walk_guest_tables(v, va, &gw, pfec[0])) != 0 )
+    if ( !(walk_ok = sh_walk_guest_tables(v, va, &gw, pfec[0])) )
     {
-        if ( (missing & _PAGE_PRESENT) )
-            pfec[0] &= ~PFEC_page_present;
-        if ( missing & _PAGE_INVALID_BITS )
-            pfec[0] |= PFEC_reserved_bit;
-        /*
-         * SDM Intel 64 Volume 3, Chapter Paging, PAGE-FAULT EXCEPTIONS:
-         * The PFEC_insn_fetch flag is set only when NX or SMEP are enabled.
-         */
-        if ( is_hvm_vcpu(v) && !hvm_nx_enabled(v) && !hvm_smep_enabled(v) )
-            pfec[0] &= ~PFEC_insn_fetch;
+        pfec[0] = gw.pfec;
         return gfn_x(INVALID_GFN);
     }
     gfn = guest_walk_to_gfn(&gw);
diff --git a/xen/include/asm-x86/guest_pt.h b/xen/include/asm-x86/guest_pt.h
index 1aa383f..44fc097 100644
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -236,19 +236,26 @@ static inline bool guest_supports_nx(const struct vcpu *v)
     return hvm_nx_enabled(v);
 }
 
+static inline bool guest_wp_enabled(const struct vcpu *v)
+{
+    /* PV guests can't control CR0.WP, and it is unconditionally set by Xen. */
+    return is_pv_vcpu(v) || hvm_wp_enabled(v);
+}
 
-/*
- * Some bits are invalid in any pagetable entry.
- * Normal flags values get represented in 24-bit values (see
- * get_pte_flags() and put_pte_flags()), so set bit 24 in
- * addition to be able to flag out of range frame numbers.
- */
-#if GUEST_PAGING_LEVELS == 3
-#define _PAGE_INVALID_BITS \
-    (_PAGE_INVALID_BIT | get_pte_flags(((1ull << 63) - 1) & ~(PAGE_SIZE - 1)))
-#else /* 2-level and 4-level */
-#define _PAGE_INVALID_BITS _PAGE_INVALID_BIT
-#endif
+static inline bool guest_smep_enabled(const struct vcpu *v)
+{
+    return is_pv_vcpu(v) ? 0 : hvm_smep_enabled(v);
+}
+
+static inline bool guest_smap_enabled(const struct vcpu *v)
+{
+    return is_pv_vcpu(v) ? 0 : hvm_smap_enabled(v);
+}
+
+static inline bool guest_pku_enabled(const struct vcpu *v)
+{
+    return is_pv_vcpu(v) ? 0 : hvm_pku_enabled(v);
+}
 
 /* Helpers for identifying whether guest entries have reserved bits set. */
 
@@ -312,13 +319,19 @@ struct guest_pagetable_walk
     guest_l3e_t l3e;            /* Guest's level 3 entry */
 #endif
     guest_l2e_t l2e;            /* Guest's level 2 entry */
-    guest_l1e_t l1e;            /* Guest's level 1 entry (or fabrication) */
+    union
+    {
+        guest_l1e_t l1e;        /* Guest's level 1 entry (or fabrication). */
+        uint64_t   el1e;        /* L2 PSE36 superpages wider than 32 bits. */
+    };
 #if GUEST_PAGING_LEVELS >= 4
     mfn_t l4mfn;                /* MFN that the level 4 entry was in */
     mfn_t l3mfn;                /* MFN that the level 3 entry was in */
 #endif
     mfn_t l2mfn;                /* MFN that the level 2 entry was in */
     mfn_t l1mfn;                /* MFN that the level 1 entry was in */
+
+    uint32_t pfec;              /* Accumulated PFEC_* error code from walk. */
 };
 
 /* Given a walk_t, translate the gw->va into the guest's notion of the
@@ -327,7 +340,9 @@ static inline gfn_t guest_walk_to_gfn(const walk_t *gw)
 {
     if ( !(guest_l1e_get_flags(gw->l1e) & _PAGE_PRESENT) )
         return INVALID_GFN;
-    return guest_l1e_get_gfn(gw->l1e);
+    return (GUEST_PAGING_LEVELS == 2
+            ? _gfn(gw->el1e >> PAGE_SHIFT)
+            : guest_l1e_get_gfn(gw->l1e));
 }
 
 /* Given a walk_t, translate the gw->va into the guest's notion of the
@@ -372,15 +387,16 @@ static inline unsigned int guest_walk_to_page_order(const walk_t *gw)
  * we go.  For the purposes of reading pagetables we treat all non-RAM
  * memory as contining zeroes.
  * 
- * Returns 0 for success, or the set of permission bits that we failed on 
- * if the walk did not complete. */
+ * Returns a boolean indicating success or failure.  walk_t.pfec contains
+ * the accumulated error code on failure.
+ */
 
 /* Macro-fu so you can call guest_walk_tables() and get the right one. */
 #define GPT_RENAME2(_n, _l) _n ## _ ## _l ## _levels
 #define GPT_RENAME(_n, _l) GPT_RENAME2(_n, _l)
 #define guest_walk_tables GPT_RENAME(guest_walk_tables, GUEST_PAGING_LEVELS)
 
-extern uint32_t 
+bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m, unsigned long va,
                   walk_t *gw, uint32_t pfec, mfn_t top_mfn, void *top_map);
 
@@ -400,8 +416,21 @@ static inline void print_gw(const walk_t *gw)
 #endif /* All levels... */
     gprintk(XENLOG_INFO, "   l2e=%" PRI_gpte " l2mfn=%" PRI_mfn "\n",
             gw->l2e.l2, mfn_x(gw->l2mfn));
+#if GUEST_PAGING_LEVELS == 2
+    gprintk(XENLOG_INFO, "  el1e=%08" PRIx64 " l1mfn=%" PRI_mfn "\n",
+            gw->el1e, mfn_x(gw->l1mfn));
+#else
     gprintk(XENLOG_INFO, "   l1e=%" PRI_gpte " l1mfn=%" PRI_mfn "\n",
             gw->l1e.l1, mfn_x(gw->l1mfn));
+#endif
+    gprintk(XENLOG_INFO, "   pfec=%02x[%c%c%c%c%c%c]\n", gw->pfec,
+            gw->pfec & PFEC_prot_key     ? 'K' : '-',
+            gw->pfec & PFEC_insn_fetch   ? 'I' : 'd',
+            gw->pfec & PFEC_reserved_bit ? 'R' : '-',
+            gw->pfec & PFEC_user_mode    ? 'U' : 's',
+            gw->pfec & PFEC_write_access ? 'W' : 'r',
+            gw->pfec & PFEC_page_present ? 'P' : '-'
+        );
 }
 
 #endif /* _XEN_ASM_GUEST_PT_H */
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 470d29d..bc189d1 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -679,7 +679,7 @@ int p2m_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 extern void p2m_pt_init(struct p2m_domain *p2m);
 
 void *map_domain_gfn(struct p2m_domain *p2m, gfn_t gfn, mfn_t *mfn,
-                     p2m_type_t *p2mt, p2m_query_t q, uint32_t *rc);
+                     p2m_type_t *p2mt, p2m_query_t q, uint32_t *pfec);
 
 /* Debugging and auditing of the P2M code? */
 #ifndef NDEBUG
diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
index 46faffc..bc5946b 100644
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -314,9 +314,6 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t);
 #define _PAGE_PSE_PAT  _AC(0x1000,U)
 #define _PAGE_AVAIL_HIGH (_AC(0x7ff, U) << 12)
 #define _PAGE_NX       (cpu_has_nx ? _PAGE_NX_BIT : 0)
-/* non-architectural flags */
-#define _PAGE_PAGED   0x2000U
-#define _PAGE_SHARED  0x4000U
 
 /*
  * Debug option: Ensure that granted mappings are not implicitly unmapped.
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index d3ba8ea..75632d9 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -73,10 +73,12 @@
 #define PFEC_reserved_bit   (_AC(1,U) << 3)
 #define PFEC_insn_fetch     (_AC(1,U) << 4)
 #define PFEC_prot_key       (_AC(1,U) << 5)
+#define PFEC_arch_mask      (_AC(0xffff,U)) /* Architectural PFEC values. */
 /* Internally used only flags. */
 #define PFEC_page_paged     (1U<<16)
 #define PFEC_page_shared    (1U<<17)
 #define PFEC_implicit       (1U<<18) /* Pagewalk input for ldt/gdt/idt/tr accesses. */
+#define PFEC_synth_mask     (~PFEC_arch_mask) /* Synthetic PFEC values. */
 
 /* Other exception error code values. */
 #define X86_XEC_EXT         (_AC(1,U) << 0)
diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
index 5a613bc..1a6cae6 100644
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -152,12 +152,6 @@ typedef l4_pgentry_t root_pgentry_t;
 #define _PAGE_GNTTAB (1U<<22)
 
 /*
- * Bit 24 of a 24-bit flag mask!  This is not any bit of a real pte,
- * and is only used for signalling in variables that contain flags.
- */
-#define _PAGE_INVALID_BIT (1U<<24)
-
-/*
  * Bit 12 of a 24-bit flag mask. This corresponds to bit 52 of a pte.
  * This is needed to distinguish between user and kernel PTEs since _PAGE_USER
  * is asserted for both.
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
@ 2017-03-01 15:05   ` Jan Beulich
  2017-03-02 16:14   ` Tim Deegan
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 15:05 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Paul Durrant, Tim Deegan, Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -76,6 +76,7 @@
>  /* Internally used only flags. */
>  #define PFEC_page_paged     (1U<<16)
>  #define PFEC_page_shared    (1U<<17)
> +#define PFEC_implicit       (1U<<18) /* Pagewalk input for ldt/gdt/idt/tr accesses. */

PFEC_implicit makes it kind of implicit what implicit here means, but
since anything more explicit would likely also be quite a bit longer,
let's go with this for now:

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/7] x86/shadow: Try to correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 2/7] x86/shadow: Try to correctly " Andrew Cooper
@ 2017-03-01 15:11   ` Jan Beulich
  2017-03-02 16:14   ` Tim Deegan
  2017-03-07 11:26   ` George Dunlap
  2 siblings, 0 replies; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 15:11 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-02-27 14:03 ` [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling Andrew Cooper
@ 2017-03-01 15:57   ` Jan Beulich
  2017-03-02 12:23     ` Andrew Cooper
  2017-03-02 14:12   ` Tim Deegan
  2017-03-02 16:15   ` Tim Deegan
  2 siblings, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 15:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> -static inline int
> -guest_supports_1G_superpages(struct vcpu *v)
> +static inline bool guest_has_pse36(const struct vcpu *v)
> +{
> +     /* No support for 2-level PV guests. */
> +    return is_pv_vcpu(v) ? 0 : paging_mode_hap(v->domain);

Considering the return type ITYM "false" instead of "0" here. But
then why use a conditional expression here at all?

    return !is_pv_vcpu(v) && paging_mode_hap(v->domain);

Furthermore there's no point in passing in a struct vcpu here -
both checked constructs are per-domain, and passing in
struct domain is also a better fit with the guest_ prefix of the
function name.

> +static inline bool guest_supports_1G_superpages(const struct vcpu *v)
>  {
>      return (GUEST_PAGING_LEVELS >= 4 && hvm_pse1gb_supported(v->domain));

Same here for vcpu vs domain.

> -static inline int
> -guest_supports_nx(struct vcpu *v)
> +static inline bool guest_supports_nx(const struct vcpu *v)
>  {
>      if ( GUEST_PAGING_LEVELS == 2 || !cpu_has_nx )
>          return 0;

The situation is different here - hvm_nx_enabled() can vary across
vCPU-s, so here it's rather the name which doesn't really fit the
purpose (should rather be guest_uses_nx() or some such).

> +static inline bool guest_l2e_rsvd_bits(const struct vcpu *v, guest_l2e_t l2e)
> +{
> +    uint64_t rsvd_bits = guest_rsvd_bits(v);
> +
> +    return ((l2e.l2 & (rsvd_bits | GUEST_L2_PAGETABLE_RSVD |
> +                       (guest_supports_superpages(v) ? 0 : _PAGE_PSE))) ||
> +            ((l2e.l2 & _PAGE_PSE) &&
> +             (l2e.l2 & ((GUEST_PAGING_LEVELS == 2 && guest_has_pse36(v))
> +                        ? (fold_pse36(rsvd_bits | (1ULL << 40)))

While one can look it up in the doc, I strongly think this 40 needs a
comment.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/7] x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves
  2017-02-27 14:03 ` [PATCH 4/7] x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves Andrew Cooper
@ 2017-03-01 16:00   ` Jan Beulich
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 16:00 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> On Intel hardware, EFER is not fully switched between host and guest 
> contexts.
> In practice, this means that Xen's EFER.NX setting leaks into guest context,
> and influences the behaviour of the hardware pagewalker.
> 
> When servicing a pagefault, Xen's model of guests behaviour should match
> hardware's behaviour, to allow correct interpretation of the pagefault error
> code, and to avoid creating observable difference in behaviour from the 
> guests
> point of view.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
with one nit (see below).

> Fixing this isn't trivial.  On more modern hardware, we can use EFER loading.
> On older hardware, we can use general MSR loading if available.  On
> older-hardware-yet, we could reload EFER right before/after vmentry/vmexit.
> However, doing so would require reloading EFER before any data accesses (as
> the NX bit will cause #PF[RSVD]), and that is rather hard given the need to
> preserve the GPRs.

I think the primary goal should be to get this right on modern hardware.

> --- a/xen/include/asm-x86/hvm/hvm.h
> +++ b/xen/include/asm-x86/hvm/hvm.h
> @@ -292,8 +292,10 @@ int hvm_girq_dest_2_vcpu_id(struct domain *d, uint8_t 
> dest, uint8_t dest_mode);
>      (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_SMEP))
>  #define hvm_smap_enabled(v) \
>      (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_SMAP))
> +/* HVM guests on Intel hardware leak Xen's NX settings into guest context. 
> */
>  #define hvm_nx_enabled(v) \
> -    (!!((v)->arch.hvm_vcpu.guest_efer & EFER_NX))
> +    ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && cpu_has_nx) || \
> +     !!((v)->arch.hvm_vcpu.guest_efer & EFER_NX))

The !! can now be dropped. When the change mentioned above is
done we'll need to remember that this then also needs tweaking.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-02-27 14:03 ` [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers Andrew Cooper
@ 2017-03-01 16:03   ` Jan Beulich
  2017-03-02 12:26     ` Andrew Cooper
  2017-03-02 14:33   ` Tim Deegan
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 16:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> The shadow logic should never create a shadow of a guest PTE which contains
> reserved bits from the guests point of view.  Such a shadowed entry might not
> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
> from the guests point of view.

But are we already or-ing in the RSVD bit accordingly in such cases,
before handing the #PF to the guest? The patch here certainly
doesn't make any change towards that, afaics.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation
  2017-02-27 14:03 ` [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation Andrew Cooper
@ 2017-03-01 16:22   ` Jan Beulich
  2017-03-01 16:33     ` Andrew Cooper
  2017-03-02 16:15   ` Tim Deegan
  2017-03-06 18:25   ` George Dunlap
  2 siblings, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 16:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> --- a/xen/arch/x86/mm/hap/guest_walk.c
> +++ b/xen/arch/x86/mm/hap/guest_walk.c
> @@ -98,7 +98,7 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
>      /* Interpret the answer */
>      if ( missing == 0 )
>      {
> -        gfn_t gfn = guest_l1e_get_gfn(gw.l1e);
> +        gfn_t gfn = guest_walk_to_gfn(&gw);
>          struct page_info *page;
>          page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), &p2mt,
>                                       NULL, P2M_ALLOC | P2M_UNSHARE);
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -3109,7 +3109,7 @@ static int sh_page_fault(struct vcpu *v,
>      }
>  
>      /* What mfn is the guest trying to access? */
> -    gfn = guest_l1e_get_gfn(gw.l1e);
> +    gfn = guest_walk_to_gfn(&gw);
>      gmfn = get_gfn(d, gfn, &p2mt);
>  
>      if ( shadow_mode_refcounts(d) &&

With these adjusted, guest_l1e_get_gfn() uses left are in shadow's
multi.c (presumably okay to stay) and at the end of
guest_walk_tables() itself. Is that latter one not also in need of
conversion, considering the PSE36 reference in the description?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/7] Fixes to pagetable handling
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
                   ` (6 preceding siblings ...)
  2017-02-27 14:03 ` [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker Andrew Cooper
@ 2017-03-01 16:24 ` Jan Beulich
  2017-03-01 16:32   ` Andrew Cooper
  2017-03-06 16:42 ` [RFC XTF PATCH] Pagetable Emulation testing Andrew Cooper
  8 siblings, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 16:24 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> Outstanding hardware issues discovered include:
>   1) There is an observable delay in AMD Fam 10h processors between loading a
>      segment selector, and the results of the LDT/GDT memory access being
>      visible in the pagetables (via the Access bits being set).

Are you saying the processor continues executing instructions
while the accessed bits are still clear? Or just that it takes very
long to complete the instruction doing the descriptor table access?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/7] Fixes to pagetable handling
  2017-03-01 16:24 ` [PATCH 0/7] Fixes to pagetable handling Jan Beulich
@ 2017-03-01 16:32   ` Andrew Cooper
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-03-01 16:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 01/03/17 16:24, Jan Beulich wrote:
>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>> Outstanding hardware issues discovered include:
>>   1) There is an observable delay in AMD Fam 10h processors between loading a
>>      segment selector, and the results of the LDT/GDT memory access being
>>      visible in the pagetables (via the Access bits being set).
> Are you saying the processor continues executing instructions
> while the accessed bits are still clear? Or just that it takes very
> long to complete the instruction doing the descriptor table access?

The processor does continue to execute instructions before the access
bit gets set.  I discovered this because my XTF test which checks for
the correct behaviour of the A/D bits tripped over it.  (On that note, I
really need to clean up that test and post it.)

Even a 1000-nop loop isn't always enough of a delay to observe the
access bit becoming set.  OTOH, a serialising instruction, or forcing a
memory access through the newly-loaded segment reliably cause the
effects of the load to become visible.

The memory access definitely occurs at the point of the implicit load,
because #GP are raised properly for bad segment descriptor settings...

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation
  2017-03-01 16:22   ` Jan Beulich
@ 2017-03-01 16:33     ` Andrew Cooper
  2017-03-01 16:41       ` Jan Beulich
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-03-01 16:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel

On 01/03/17 16:22, Jan Beulich wrote:
>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>> --- a/xen/arch/x86/mm/hap/guest_walk.c
>> +++ b/xen/arch/x86/mm/hap/guest_walk.c
>> @@ -98,7 +98,7 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
>>      /* Interpret the answer */
>>      if ( missing == 0 )
>>      {
>> -        gfn_t gfn = guest_l1e_get_gfn(gw.l1e);
>> +        gfn_t gfn = guest_walk_to_gfn(&gw);
>>          struct page_info *page;
>>          page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), &p2mt,
>>                                       NULL, P2M_ALLOC | P2M_UNSHARE);
>> --- a/xen/arch/x86/mm/shadow/multi.c
>> +++ b/xen/arch/x86/mm/shadow/multi.c
>> @@ -3109,7 +3109,7 @@ static int sh_page_fault(struct vcpu *v,
>>      }
>>  
>>      /* What mfn is the guest trying to access? */
>> -    gfn = guest_l1e_get_gfn(gw.l1e);
>> +    gfn = guest_walk_to_gfn(&gw);
>>      gmfn = get_gfn(d, gfn, &p2mt);
>>  
>>      if ( shadow_mode_refcounts(d) &&
> With these adjusted, guest_l1e_get_gfn() uses left are in shadow's
> multi.c (presumably okay to stay)

Yes - they are purposefully looking at the L1 entry.

> and at the end of guest_walk_tables() itself. Is that latter one not also in need of
> conversion, considering the PSE36 reference in the description?

Well - it gets swallowed up in "Re-implement the pagetable walker", but
inside guest_walk_tables() itself, it is reasonable to rely on the
internal details.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation
  2017-03-01 16:33     ` Andrew Cooper
@ 2017-03-01 16:41       ` Jan Beulich
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Beulich @ 2017-03-01 16:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 01.03.17 at 17:33, <andrew.cooper3@citrix.com> wrote:
> On 01/03/17 16:22, Jan Beulich wrote:
>>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>>> --- a/xen/arch/x86/mm/hap/guest_walk.c
>>> +++ b/xen/arch/x86/mm/hap/guest_walk.c
>>> @@ -98,7 +98,7 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
>>>      /* Interpret the answer */
>>>      if ( missing == 0 )
>>>      {
>>> -        gfn_t gfn = guest_l1e_get_gfn(gw.l1e);
>>> +        gfn_t gfn = guest_walk_to_gfn(&gw);
>>>          struct page_info *page;
>>>          page = get_page_from_gfn_p2m(p2m->domain, p2m, gfn_x(gfn), &p2mt,
>>>                                       NULL, P2M_ALLOC | P2M_UNSHARE);
>>> --- a/xen/arch/x86/mm/shadow/multi.c
>>> +++ b/xen/arch/x86/mm/shadow/multi.c
>>> @@ -3109,7 +3109,7 @@ static int sh_page_fault(struct vcpu *v,
>>>      }
>>>  
>>>      /* What mfn is the guest trying to access? */
>>> -    gfn = guest_l1e_get_gfn(gw.l1e);
>>> +    gfn = guest_walk_to_gfn(&gw);
>>>      gmfn = get_gfn(d, gfn, &p2mt);
>>>  
>>>      if ( shadow_mode_refcounts(d) &&
>> With these adjusted, guest_l1e_get_gfn() uses left are in shadow's
>> multi.c (presumably okay to stay)
> 
> Yes - they are purposefully looking at the L1 entry.
> 
>> and at the end of guest_walk_tables() itself. Is that latter one not also in need of
>> conversion, considering the PSE36 reference in the description?
> 
> Well - it gets swallowed up in "Re-implement the pagetable walker", but
> inside guest_walk_tables() itself, it is reasonable to rely on the
> internal details.

Okay, if it's left alone intentionally, then
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-02-27 14:03 ` [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker Andrew Cooper
@ 2017-03-02 11:52   ` Jan Beulich
  2017-03-02 12:00     ` Andrew Cooper
  2017-03-02 16:16   ` Tim Deegan
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-02 11:52 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> @@ -91,12 +91,12 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
>  #if GUEST_PAGING_LEVELS == 3
>      top_map += (cr3 & ~(PAGE_MASK | 31));
>  #endif
> -    missing = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
> +    walk_ok = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);

Since afaict pfec doesn't really point to an array, could I talk you into
adjusting the uses you touch anyway to become *pfec?

> --- a/xen/arch/x86/mm/hap/nested_ept.c
> +++ b/xen/arch/x86/mm/hap/nested_ept.c
> @@ -208,7 +208,7 @@ nept_walk_tables(struct vcpu *v, unsigned long l2ga, ept_walk_t *gw)
>      goto out;
>  
>  map_err:
> -    if ( rc == _PAGE_PAGED )
> +    if ( rc == PFEC_page_paged )

While orthogonal to the patch here, is == (instead of &) really the
right check here?

> @@ -3737,18 +3737,9 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
>          return vtlb_gfn;
>  #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
>  
> -    if ( (missing = sh_walk_guest_tables(v, va, &gw, pfec[0])) != 0 )
> +    if ( !(walk_ok = sh_walk_guest_tables(v, va, &gw, pfec[0])) )

*pfec again (also further down)?

> --- a/xen/include/asm-x86/guest_pt.h
> +++ b/xen/include/asm-x86/guest_pt.h
> @@ -236,19 +236,26 @@ static inline bool guest_supports_nx(const struct vcpu *v)
>      return hvm_nx_enabled(v);
>  }
>  
> +static inline bool guest_wp_enabled(const struct vcpu *v)
> +{
> +    /* PV guests can't control CR0.WP, and it is unconditionally set by Xen. */
> +    return is_pv_vcpu(v) || hvm_wp_enabled(v);
> +}
>  
> -/*
> - * Some bits are invalid in any pagetable entry.
> - * Normal flags values get represented in 24-bit values (see
> - * get_pte_flags() and put_pte_flags()), so set bit 24 in
> - * addition to be able to flag out of range frame numbers.
> - */
> -#if GUEST_PAGING_LEVELS == 3
> -#define _PAGE_INVALID_BITS \
> -    (_PAGE_INVALID_BIT | get_pte_flags(((1ull << 63) - 1) & ~(PAGE_SIZE - 1)))
> -#else /* 2-level and 4-level */
> -#define _PAGE_INVALID_BITS _PAGE_INVALID_BIT
> -#endif
> +static inline bool guest_smep_enabled(const struct vcpu *v)
> +{
> +    return is_pv_vcpu(v) ? 0 : hvm_smep_enabled(v);
> +}
> +
> +static inline bool guest_smap_enabled(const struct vcpu *v)
> +{
> +    return is_pv_vcpu(v) ? 0 : hvm_smap_enabled(v);
> +}
> +
> +static inline bool guest_pku_enabled(const struct vcpu *v)
> +{
> +    return is_pv_vcpu(v) ? 0 : hvm_pku_enabled(v);
> +}

All three "!is_pv_vcpu(v) && hvm_..._enabled(v)" ? Or otherwise
please use false.

It would be nice if you could adjust the one _eflags use to eflags,
avoiding the need for a follow-up patch.

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-03-02 11:52   ` Jan Beulich
@ 2017-03-02 12:00     ` Andrew Cooper
  2017-03-02 12:54       ` Jan Beulich
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-03-02 12:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel

On 02/03/17 11:52, Jan Beulich wrote:
>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>> @@ -91,12 +91,12 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
>>  #if GUEST_PAGING_LEVELS == 3
>>      top_map += (cr3 & ~(PAGE_MASK | 31));
>>  #endif
>> -    missing = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
>> +    walk_ok = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
> Since afaict pfec doesn't really point to an array, could I talk you into
> adjusting the uses you touch anyway to become *pfec?

I did find the use of an array annoying, but left it as-was.

Any objection if I do a general cleanup of that as a prerequisite patch,
to avoid introducing semi-unrelated changes into this patch?

>
>> --- a/xen/arch/x86/mm/hap/nested_ept.c
>> +++ b/xen/arch/x86/mm/hap/nested_ept.c
>> @@ -208,7 +208,7 @@ nept_walk_tables(struct vcpu *v, unsigned long l2ga, ept_walk_t *gw)
>>      goto out;
>>  
>>  map_err:
>> -    if ( rc == _PAGE_PAGED )
>> +    if ( rc == PFEC_page_paged )
> While orthogonal to the patch here, is == (instead of &) really the
> right check here?

In practice, if PFEC_page_paged is set, it is the only bit set.

But I agree - conceptually it should &, and will need to be when I fix
nested pagetable walking.

>
>> @@ -3737,18 +3737,9 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
>>          return vtlb_gfn;
>>  #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
>>  
>> -    if ( (missing = sh_walk_guest_tables(v, va, &gw, pfec[0])) != 0 )
>> +    if ( !(walk_ok = sh_walk_guest_tables(v, va, &gw, pfec[0])) )
> *pfec again (also further down)?
>
>> --- a/xen/include/asm-x86/guest_pt.h
>> +++ b/xen/include/asm-x86/guest_pt.h
>> @@ -236,19 +236,26 @@ static inline bool guest_supports_nx(const struct vcpu *v)
>>      return hvm_nx_enabled(v);
>>  }
>>  
>> +static inline bool guest_wp_enabled(const struct vcpu *v)
>> +{
>> +    /* PV guests can't control CR0.WP, and it is unconditionally set by Xen. */
>> +    return is_pv_vcpu(v) || hvm_wp_enabled(v);
>> +}
>>  
>> -/*
>> - * Some bits are invalid in any pagetable entry.
>> - * Normal flags values get represented in 24-bit values (see
>> - * get_pte_flags() and put_pte_flags()), so set bit 24 in
>> - * addition to be able to flag out of range frame numbers.
>> - */
>> -#if GUEST_PAGING_LEVELS == 3
>> -#define _PAGE_INVALID_BITS \
>> -    (_PAGE_INVALID_BIT | get_pte_flags(((1ull << 63) - 1) & ~(PAGE_SIZE - 1)))
>> -#else /* 2-level and 4-level */
>> -#define _PAGE_INVALID_BITS _PAGE_INVALID_BIT
>> -#endif
>> +static inline bool guest_smep_enabled(const struct vcpu *v)
>> +{
>> +    return is_pv_vcpu(v) ? 0 : hvm_smep_enabled(v);
>> +}
>> +
>> +static inline bool guest_smap_enabled(const struct vcpu *v)
>> +{
>> +    return is_pv_vcpu(v) ? 0 : hvm_smap_enabled(v);
>> +}
>> +
>> +static inline bool guest_pku_enabled(const struct vcpu *v)
>> +{
>> +    return is_pv_vcpu(v) ? 0 : hvm_pku_enabled(v);
>> +}
> All three "!is_pv_vcpu(v) && hvm_..._enabled(v)" ? Or otherwise
> please use false.

Ok.

>
> It would be nice if you could adjust the one _eflags use to eflags,
> avoiding the need for a follow-up patch.

Will do.

> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks,

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-03-01 15:57   ` Jan Beulich
@ 2017-03-02 12:23     ` Andrew Cooper
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-03-02 12:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel

On 01/03/17 15:57, Jan Beulich wrote:
>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>> -static inline int
>> -guest_supports_1G_superpages(struct vcpu *v)
>> +static inline bool guest_has_pse36(const struct vcpu *v)
>> +{
>> +     /* No support for 2-level PV guests. */
>> +    return is_pv_vcpu(v) ? 0 : paging_mode_hap(v->domain);
> Considering the return type ITYM "false" instead of "0" here.

Yes, sorry.  (This series was what caused me to introduce bool to the
hypervisor in the first place, and I clearly haven't rebased it forwards
properly.)

> But then why use a conditional expression here at all?
>
>     return !is_pv_vcpu(v) && paging_mode_hap(v->domain);

Can do.

> Furthermore there's no point in passing in a struct vcpu here -
> both checked constructs are per-domain, and passing in
> struct domain is also a better fit with the guest_ prefix of the
> function name.

At the moment, all the guest_* functions consistently take a vcpu.

However, they are inconsistent with other terms, and some of the
static-configuration properties probably should take a domain.

I will split out this cleanup into an earlier patch, adjusting some
naming and parameters.

>
>> +static inline bool guest_l2e_rsvd_bits(const struct vcpu *v, guest_l2e_t l2e)
>> +{
>> +    uint64_t rsvd_bits = guest_rsvd_bits(v);
>> +
>> +    return ((l2e.l2 & (rsvd_bits | GUEST_L2_PAGETABLE_RSVD |
>> +                       (guest_supports_superpages(v) ? 0 : _PAGE_PSE))) ||
>> +            ((l2e.l2 & _PAGE_PSE) &&
>> +             (l2e.l2 & ((GUEST_PAGING_LEVELS == 2 && guest_has_pse36(v))
>> +                        ? (fold_pse36(rsvd_bits | (1ULL << 40)))
> While one can look it up in the doc, I strongly think this 40 needs a
> comment.

Thinking about it, having GUEST_L2_PSE36_RSVD is probably a good move.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-03-01 16:03   ` Jan Beulich
@ 2017-03-02 12:26     ` Andrew Cooper
  2017-03-02 12:51       ` Jan Beulich
  2017-03-06  9:26       ` Tim Deegan
  0 siblings, 2 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-03-02 12:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel

On 01/03/17 16:03, Jan Beulich wrote:
>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>> The shadow logic should never create a shadow of a guest PTE which contains
>> reserved bits from the guests point of view.  Such a shadowed entry might not
>> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
>> from the guests point of view.
> But are we already or-ing in the RSVD bit accordingly in such cases,
> before handing the #PF to the guest? The patch here certainly
> doesn't make any change towards that, afaics.

The purpose of this patch is to ensure we never create a shadow which
risks causing hardware to generate #PF[RSVD] when running on the
shadows, other than the one deliberate case (MMIO fastpath).

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-03-02 12:26     ` Andrew Cooper
@ 2017-03-02 12:51       ` Jan Beulich
  2017-03-02 12:56         ` Andrew Cooper
  2017-03-06  9:26       ` Tim Deegan
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-02 12:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 02.03.17 at 13:26, <andrew.cooper3@citrix.com> wrote:
> On 01/03/17 16:03, Jan Beulich wrote:
>>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>>> The shadow logic should never create a shadow of a guest PTE which contains
>>> reserved bits from the guests point of view.  Such a shadowed entry might not
>>> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
>>> from the guests point of view.
>> But are we already or-ing in the RSVD bit accordingly in such cases,
>> before handing the #PF to the guest? The patch here certainly
>> doesn't make any change towards that, afaics.
> 
> The purpose of this patch is to ensure we never create a shadow which
> risks causing hardware to generate #PF[RSVD] when running on the
> shadows, other than the one deliberate case (MMIO fastpath).

Right, but instead of answering my question this emphasizes the
need for an answer, as what you say basically means we'd never
(except for that one special case) see the RSVD bit set when
getting #PF handed by hardware, yet for forwarding to the guest
we need to set that bit then in such cases.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-03-02 12:00     ` Andrew Cooper
@ 2017-03-02 12:54       ` Jan Beulich
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Beulich @ 2017-03-02 12:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 02.03.17 at 13:00, <andrew.cooper3@citrix.com> wrote:
> On 02/03/17 11:52, Jan Beulich wrote:
>>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>>> @@ -91,12 +91,12 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
>>>  #if GUEST_PAGING_LEVELS == 3
>>>      top_map += (cr3 & ~(PAGE_MASK | 31));
>>>  #endif
>>> -    missing = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
>>> +    walk_ok = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn, top_map);
>> Since afaict pfec doesn't really point to an array, could I talk you into
>> adjusting the uses you touch anyway to become *pfec?
> 
> I did find the use of an array annoying, but left it as-was.
> 
> Any objection if I do a general cleanup of that as a prerequisite patch,
> to avoid introducing semi-unrelated changes into this patch?

Certainly not.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-03-02 12:51       ` Jan Beulich
@ 2017-03-02 12:56         ` Andrew Cooper
  2017-03-02 13:19           ` Jan Beulich
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-03-02 12:56 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel

On 02/03/17 12:51, Jan Beulich wrote:
>>>> On 02.03.17 at 13:26, <andrew.cooper3@citrix.com> wrote:
>> On 01/03/17 16:03, Jan Beulich wrote:
>>>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>>>> The shadow logic should never create a shadow of a guest PTE which contains
>>>> reserved bits from the guests point of view.  Such a shadowed entry might not
>>>> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
>>>> from the guests point of view.
>>> But are we already or-ing in the RSVD bit accordingly in such cases,
>>> before handing the #PF to the guest? The patch here certainly
>>> doesn't make any change towards that, afaics.
>> The purpose of this patch is to ensure we never create a shadow which
>> risks causing hardware to generate #PF[RSVD] when running on the
>> shadows, other than the one deliberate case (MMIO fastpath).
> Right, but instead of answering my question this emphasizes the
> need for an answer, as what you say basically means we'd never
> (except for that one special case) see the RSVD bit set when
> getting #PF handed by hardware, yet for forwarding to the guest
> we need to set that bit then in such cases.

This is intentional.

We hand #PF[RSVD] back to the guest based on walking the guest
pagetables, rather than what we find from hardware walking the shadows
we create.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-03-02 12:56         ` Andrew Cooper
@ 2017-03-02 13:19           ` Jan Beulich
  2017-03-02 14:32             ` Andrew Cooper
  0 siblings, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-02 13:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 02.03.17 at 13:56, <andrew.cooper3@citrix.com> wrote:
> On 02/03/17 12:51, Jan Beulich wrote:
>>>>> On 02.03.17 at 13:26, <andrew.cooper3@citrix.com> wrote:
>>> On 01/03/17 16:03, Jan Beulich wrote:
>>>>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>>>>> The shadow logic should never create a shadow of a guest PTE which contains
>>>>> reserved bits from the guests point of view.  Such a shadowed entry might not
>>>>> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
>>>>> from the guests point of view.
>>>> But are we already or-ing in the RSVD bit accordingly in such cases,
>>>> before handing the #PF to the guest? The patch here certainly
>>>> doesn't make any change towards that, afaics.
>>> The purpose of this patch is to ensure we never create a shadow which
>>> risks causing hardware to generate #PF[RSVD] when running on the
>>> shadows, other than the one deliberate case (MMIO fastpath).
>> Right, but instead of answering my question this emphasizes the
>> need for an answer, as what you say basically means we'd never
>> (except for that one special case) see the RSVD bit set when
>> getting #PF handed by hardware, yet for forwarding to the guest
>> we need to set that bit then in such cases.
> 
> This is intentional.
> 
> We hand #PF[RSVD] back to the guest based on walking the guest
> pagetables, rather than what we find from hardware walking the shadows
> we create.

Well, is that (always) the case here already, or only after patch 7? My
question after all is whether this works as intended at this point.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-02-27 14:03 ` [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling Andrew Cooper
  2017-03-01 15:57   ` Jan Beulich
@ 2017-03-02 14:12   ` Tim Deegan
  2017-03-02 14:17     ` Andrew Cooper
  2017-03-02 16:15   ` Tim Deegan
  2 siblings, 1 reply; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 14:12 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 14:03 +0000 on 27 Feb (1488204194), Andrew Cooper wrote:
> +static inline bool guest_has_pse36(const struct vcpu *v)
> +{
> +     /* No support for 2-level PV guests. */
> +    return is_pv_vcpu(v) ? 0 : paging_mode_hap(v->domain);
> +}

Should this check the CPUID policy to see whether PSE36 is supported?
There's no way to disable it in a HAP guest anyway, so maybe not, for
consistency?

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-03-02 14:12   ` Tim Deegan
@ 2017-03-02 14:17     ` Andrew Cooper
  2017-03-02 15:09       ` Tim Deegan
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-03-02 14:17 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Jan Beulich, Xen-devel

On 02/03/17 14:12, Tim Deegan wrote:
> At 14:03 +0000 on 27 Feb (1488204194), Andrew Cooper wrote:
>> +static inline bool guest_has_pse36(const struct vcpu *v)
>> +{
>> +     /* No support for 2-level PV guests. */
>> +    return is_pv_vcpu(v) ? 0 : paging_mode_hap(v->domain);
>> +}
> Should this check the CPUID policy to see whether PSE36 is supported?
> There's no way to disable it in a HAP guest anyway, so maybe not, for
> consistency?

Hmm - perhaps I need to rethink this slightly.

CR4.PSE controls this part of the pagewalk, which we can control with
CPUID checks.  However, if CR4.PSE it set, we cannot hide hardware’s
preference of PSE36 from the guest, and in reality it will always be
present.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-03-02 13:19           ` Jan Beulich
@ 2017-03-02 14:32             ` Andrew Cooper
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-03-02 14:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel

On 02/03/17 13:19, Jan Beulich wrote:
>>>> On 02.03.17 at 13:56, <andrew.cooper3@citrix.com> wrote:
>> On 02/03/17 12:51, Jan Beulich wrote:
>>>>>> On 02.03.17 at 13:26, <andrew.cooper3@citrix.com> wrote:
>>>> On 01/03/17 16:03, Jan Beulich wrote:
>>>>>>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
>>>>>> The shadow logic should never create a shadow of a guest PTE which contains
>>>>>> reserved bits from the guests point of view.  Such a shadowed entry might not
>>>>>> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
>>>>>> from the guests point of view.
>>>>> But are we already or-ing in the RSVD bit accordingly in such cases,
>>>>> before handing the #PF to the guest? The patch here certainly
>>>>> doesn't make any change towards that, afaics.
>>>> The purpose of this patch is to ensure we never create a shadow which
>>>> risks causing hardware to generate #PF[RSVD] when running on the
>>>> shadows, other than the one deliberate case (MMIO fastpath).
>>> Right, but instead of answering my question this emphasizes the
>>> need for an answer, as what you say basically means we'd never
>>> (except for that one special case) see the RSVD bit set when
>>> getting #PF handed by hardware, yet for forwarding to the guest
>>> we need to set that bit then in such cases.
>> This is intentional.
>>
>> We hand #PF[RSVD] back to the guest based on walking the guest
>> pagetables, rather than what we find from hardware walking the shadows
>> we create.
> Well, is that (always) the case here already, or only after patch 7? My
> question after all is whether this works as intended at this point.

I am probably going to defer to Tim on the historical details here.

Most paths to creating a new shadow follow a walk of the guest
pagetables, and reserved guest entries fall out there without creating
shadows.

However, there are definitely paths to creating shadows which don't pass
through a guest walk (probably the adjacent prefetches), and those paths
need protection from creating bad shadows.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-02-27 14:03 ` [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers Andrew Cooper
  2017-03-01 16:03   ` Jan Beulich
@ 2017-03-02 14:33   ` Tim Deegan
  1 sibling, 0 replies; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 14:33 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 14:03 +0000 on 27 Feb (1488204196), Andrew Cooper wrote:
> The shadow logic should never create a shadow of a guest PTE which contains

Nit: a _valid/present_ shadow.

> reserved bits from the guests point of view.  Such a shadowed entry might not
> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
> from the guests point of view.

The shadow doesn't have to cause #PF(rsvd), any pagefault will do, so
long as the shadow pagefault handler turns it into rsvd.  This patch
uses shadow_l1e_empty(), which doesn't cause #PF(rsvd), and is fine.

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-03-02 14:17     ` Andrew Cooper
@ 2017-03-02 15:09       ` Tim Deegan
  2017-03-02 15:14         ` Andrew Cooper
  0 siblings, 1 reply; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 15:09 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 14:17 +0000 on 02 Mar (1488464221), Andrew Cooper wrote:
> On 02/03/17 14:12, Tim Deegan wrote:
> > At 14:03 +0000 on 27 Feb (1488204194), Andrew Cooper wrote:
> >> +static inline bool guest_has_pse36(const struct vcpu *v)
> >> +{
> >> +     /* No support for 2-level PV guests. */
> >> +    return is_pv_vcpu(v) ? 0 : paging_mode_hap(v->domain);
> >> +}
> > Should this check the CPUID policy to see whether PSE36 is supported?
> > There's no way to disable it in a HAP guest anyway, so maybe not, for
> > consistency?
> 
> Hmm - perhaps I need to rethink this slightly.
> 
> CR4.PSE controls this part of the pagewalk, which we can control with
> CPUID checks.  However, if CR4.PSE it set, we cannot hide hardware’s
> preference of PSE36 from the guest, and in reality it will always be
> present.

Yeah, I don't think there's anything you can do here.  You can hide
PSE36 from CPUID but a guest that _relies_ on PSE36 _not_ being supported
is going to fail on HAP.

I guess you could force PSE36 to be present in CPUID for HAP guetsts
and absent for PV and shadowed geuests...

In any case I thin this patch is correct as far as it goes.

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-03-02 15:09       ` Tim Deegan
@ 2017-03-02 15:14         ` Andrew Cooper
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-03-02 15:14 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Jan Beulich, Xen-devel

On 02/03/17 15:09, Tim Deegan wrote:
> At 14:17 +0000 on 02 Mar (1488464221), Andrew Cooper wrote:
>> On 02/03/17 14:12, Tim Deegan wrote:
>>> At 14:03 +0000 on 27 Feb (1488204194), Andrew Cooper wrote:
>>>> +static inline bool guest_has_pse36(const struct vcpu *v)
>>>> +{
>>>> +     /* No support for 2-level PV guests. */
>>>> +    return is_pv_vcpu(v) ? 0 : paging_mode_hap(v->domain);
>>>> +}
>>> Should this check the CPUID policy to see whether PSE36 is supported?
>>> There's no way to disable it in a HAP guest anyway, so maybe not, for
>>> consistency?
>> Hmm - perhaps I need to rethink this slightly.
>>
>> CR4.PSE controls this part of the pagewalk, which we can control with
>> CPUID checks.  However, if CR4.PSE it set, we cannot hide hardware’s
>> preference of PSE36 from the guest, and in reality it will always be
>> present.
> Yeah, I don't think there's anything you can do here.  You can hide
> PSE36 from CPUID but a guest that _relies_ on PSE36 _not_ being supported
> is going to fail on HAP.
>
> I guess you could force PSE36 to be present in CPUID for HAP guetsts
> and absent for PV and shadowed geuests...

We do this anyway, modulo the fudge factor to make HyperV not BSOD on
boot (which it turns out is now XTF's backdoor way to evaluate whether
it is running on HAP or Shadow, so it can avoid the livelock referenced
in patch 2.)

> In any case I thin this patch is correct as far as it goes.

I am going to litter some more comments throughout, because some of
these details are far too subtle to be left unexplained.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
  2017-03-01 15:05   ` Jan Beulich
@ 2017-03-02 16:14   ` Tim Deegan
  2017-03-07 10:46   ` George Dunlap
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 16:14 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Paul Durrant, Jan Beulich, Xen-devel

At 14:03 +0000 on 27 Feb (1488204192), Andrew Cooper wrote:
> All actions which refer to the active ldt/gdt/idt or task register
> (e.g. loading a new segment selector) are known as implicit supervisor
> accesses, even when the access originates from user code.
> 
> The distinction is necessary in the pagewalk when SMAP is enabled.  Refer to
> Intel SDM Vol 3 "Access Rights" for the exact details.
> 
> Introduce a new pagewalk input, and make use of the new system segment
> references in hvmemul_{read,write}().  While modifying those areas, move the
> calculation of the appropriate pagewalk input before its first use.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/7] x86/shadow: Try to correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 2/7] x86/shadow: Try to correctly " Andrew Cooper
  2017-03-01 15:11   ` Jan Beulich
@ 2017-03-02 16:14   ` Tim Deegan
  2017-03-07 11:26   ` George Dunlap
  2 siblings, 0 replies; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 16:14 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 14:03 +0000 on 27 Feb (1488204193), Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling
  2017-02-27 14:03 ` [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling Andrew Cooper
  2017-03-01 15:57   ` Jan Beulich
  2017-03-02 14:12   ` Tim Deegan
@ 2017-03-02 16:15   ` Tim Deegan
  2 siblings, 0 replies; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 16:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 14:03 +0000 on 27 Feb (1488204194), Andrew Cooper wrote:
> Some bits are unconditionally reserved in pagetable entries, or reserved
> because of alignment restrictions.  Other bits are reserved because of control
> register configuration.
> 
> Introduce helpers which take an individual vcpu and guest pagetable entry, and
> calculates whether any reserved bits are set.
> 
> While here, add a couple of newlines to aid readability, drop some trailing
> whitespace and bool/const correct the existing helpers to allow the new
> helpers to take const vcpu pointers.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation
  2017-02-27 14:03 ` [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation Andrew Cooper
  2017-03-01 16:22   ` Jan Beulich
@ 2017-03-02 16:15   ` Tim Deegan
  2017-03-06 18:25   ` George Dunlap
  2 siblings, 0 replies; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 16:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 14:03 +0000 on 27 Feb (1488204197), Andrew Cooper wrote:
> hap_p2m_ga_to_gfn() and sh_page_fault() currently use guest_l1e_get_gfn() to
> obtain the translation of a pagewalk.  This is conceptually wrong (the
> semantics of gw.l1e is an internal detail), and will actually be wrong when
> PSE36 superpage support is fixed.  Switch them to using guest_walk_to_gfn().
> 
> Take the opportunity to const-correct the walk_t parameter of the
> guest_walk_to_*() helpers, and implement guest_walk_to_gpa() in terms of
> guest_walk_to_gfn() to avoid duplicating the actual translation calculation.
> 
> While editing guest_walk_to_gpa(), fix a latent bug by causing it to return
> INVALID_PADDR rather than 0 for a failed translation, as 0 is also a valid
> successful result.  The sole caller, sh_page_fault(), has already confirmed
> that the translation is valid, so this doesn't cause a behavioural change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-02-27 14:03 ` [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker Andrew Cooper
  2017-03-02 11:52   ` Jan Beulich
@ 2017-03-02 16:16   ` Tim Deegan
  2017-03-06 18:28   ` George Dunlap
  2017-03-07 12:57   ` George Dunlap
  3 siblings, 0 replies; 52+ messages in thread
From: Tim Deegan @ 2017-03-02 16:16 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 14:03 +0000 on 27 Feb (1488204198), Andrew Cooper wrote:
> The existing pagetable walker has complicated return semantics, which squeeze
> multiple pieces of information into single integer.  This would be fine if the
> information didn't overlap, but it does.
> 
> Specifically, _PAGE_INVALID_BITS for 3-level guests alias _PAGE_PAGED and
> _PAGE_SHARED.  A guest which constructs a PTE with bits 52 or 53 set (the
> start of the upper software-available range) will create a virtual address
> which, when walked by Xen, tricks Xen into believing the frame is paged or
> shared.  This behaviour was introduced by XSA-173 (c/s 8b17648).
> 
> It is also complicated to turn rc back into a normal pagefault error code.
> Instead, change the calling semantics to return a boolean indicating success,
> and have the function accumulate a real pagefault error code as it goes
> (including synthetic error codes, which do not alias hardware ones).  This
> requires an equivalent adjustment to map_domain_gfn().
> 
> Issues fixed:
>  * 2-level PSE36 superpages now return the correct translation.
>  * 2-level L2 superpages without CR0.PSE now return the correct translation.
>  * SMEP now inhibits a user instruction fetch even if NX isn't active.
>  * Supervisor writes without CR0.WP now set the leaf dirty bit.
>  * L4e._PAGE_GLOBAL is strictly reserved on AMD.
>  * 3-level l3 entries have all reserved bits checked.
>  * 3-level entries can no longer alias Xen's idea of paged or shared.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers
  2017-03-02 12:26     ` Andrew Cooper
  2017-03-02 12:51       ` Jan Beulich
@ 2017-03-06  9:26       ` Tim Deegan
  1 sibling, 0 replies; 52+ messages in thread
From: Tim Deegan @ 2017-03-06  9:26 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Jan Beulich, Xen-devel

At 12:26 +0000 on 02 Mar (1488457613), Andrew Cooper wrote:
> On 01/03/17 16:03, Jan Beulich wrote:
> >>>> On 27.02.17 at 15:03, <andrew.cooper3@citrix.com> wrote:
> >> The shadow logic should never create a shadow of a guest PTE which contains
> >> reserved bits from the guests point of view.  Such a shadowed entry might not
> >> cause #PF[RSVD] when walked by hardware, thus won't behave architecturally
> >> from the guests point of view.
> > But are we already or-ing in the RSVD bit accordingly in such cases,
> > before handing the #PF to the guest? The patch here certainly
> > doesn't make any change towards that, afaics.
> 
> The purpose of this patch is to ensure we never create a shadow which
> risks causing hardware to generate #PF[RSVD] when running on the
> shadows, other than the one deliberate case (MMIO fastpath).

Confusion! AIUI:

 - Shadows installed on demand in the pagefault handler are already
   correct.  If the guest PTE contained invalid bits we'd have injected
   a fault instead of shadowing it.

 - There is no risk of accidentally installing a shadow with reserved
   bits in it even if the guest pte has reserved bits in it.
   _sh_propagate() sanity-checks the flags, and the address bits
   come from the MFN (IOW we'd need a buggy p2m entry).  If that were
   a risk, I don't think this patch would solve it.

 - The potential bug that this patch tries to fix is:
   1. Guest writes a PTE with reserved bits in it.
   2. That gets shadowed by a write-to-pagetable path or a prefetch.
   3. The shadow is a valid PTE, so the guest gets no #PF, instead
      of #PF(rsvd).

Now by the same logic I used above there's probably no path
where a reserved _address_ bit causes a problem, but I see no harm
in centralising the logic and using the same code for these
paths as for the pt walker.

In answering this, I've spotted that the calls to
l1e_propagate_from_guest() in sh_resync_l1() and sh_prefetch()
aren't updated in this patch and should be. 

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC XTF PATCH] Pagetable Emulation testing
  2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
                   ` (7 preceding siblings ...)
  2017-03-01 16:24 ` [PATCH 0/7] Fixes to pagetable handling Jan Beulich
@ 2017-03-06 16:42 ` Andrew Cooper
  2017-03-13 15:45   ` Jan Beulich
  8 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-03-06 16:42 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tests/pagetable-emulation/Makefile |   11 +
 tests/pagetable-emulation/main.c   | 1119 ++++++++++++++++++++++++++++++++++++
 tests/pagetable-emulation/stubs.S  |   94 +++
 tests/pagetable-emulation/stubs.h  |   25 +
 4 files changed, 1249 insertions(+)
 create mode 100644 tests/pagetable-emulation/Makefile
 create mode 100644 tests/pagetable-emulation/main.c
 create mode 100644 tests/pagetable-emulation/stubs.S
 create mode 100644 tests/pagetable-emulation/stubs.h

diff --git a/tests/pagetable-emulation/Makefile b/tests/pagetable-emulation/Makefile
new file mode 100644
index 0000000..63f232a
--- /dev/null
+++ b/tests/pagetable-emulation/Makefile
@@ -0,0 +1,11 @@
+include $(ROOT)/build/common.mk
+
+NAME      := pagetable-emulation
+CATEGORY  := functional
+TEST-ENVS := hvm32pse hvm32pae hvm64
+
+VARY-CFG  := hap shadow
+
+obj-perenv += stubs.o main.o
+
+include $(ROOT)/build/gen.mk
diff --git a/tests/pagetable-emulation/main.c b/tests/pagetable-emulation/main.c
new file mode 100644
index 0000000..98f69e0
--- /dev/null
+++ b/tests/pagetable-emulation/main.c
@@ -0,0 +1,1119 @@
+/**
+ * @file tests/pagetable-emulation/main.c
+ * @ref test-pagetable-emulation - TODO.
+ *
+ * @page test-pagetable-emulation TODO
+ *
+ * @sa tests/pagetable-emulation/main.c
+ */
+#include <xtf.h>
+
+#include <arch/decode.h>
+#include <arch/exinfo.h>
+#include <arch/idt.h>
+#include <arch/msr-index.h>
+#include <arch/pagetable.h>
+#include <arch/processor.h>
+#include <arch/symbolic-const.h>
+
+#include "stubs.h"
+
+const char test_title[] = "Test pagetable-emulation";
+
+intpte_t l1t[PAGE_SIZE / sizeof(intpte_t)] __aligned(PAGE_SIZE);
+intpte_t l2t[PAGE_SIZE / sizeof(intpte_t)] __aligned(PAGE_SIZE);
+
+#if CONFIG_PAGING_LEVELS > 3
+intpte_t l3t[PAGE_SIZE / sizeof(intpte_t)] __aligned(PAGE_SIZE);
+#else
+extern intpte_t l3t[PAGE_SIZE / sizeof(intpte_t)];
+#endif
+
+#define LDT_SEL 0x000F /* Entry 0x8, LDT, RPL3 */
+
+#define PFEC_P X86_PFEC_PRESENT
+#define PFEC_W X86_PFEC_WRITE
+#define PFEC_U X86_PFEC_USER
+#define PFEC_R X86_PFEC_RSVD
+#define PFEC_I X86_PFEC_INSN
+#define PFEC(...) TOK_OR(PFEC_, ##__VA_ARGS__)
+
+uint64_t efer;
+unsigned long cr0, cr4;
+bool host_nx_leaked;
+bool amd_fam10_erratum;
+bool shadow_paging;
+
+static struct {
+    unsigned long va;
+    bool active;
+    bool user;
+    const char *desc;
+
+    intpte_t pteval;
+    bool pte_printed;
+} under_test;
+
+static const struct stubs
+{
+    unsigned long (*read)          (unsigned long va);
+    unsigned long (*implicit)      (unsigned long va);
+    unsigned long (*write)         (unsigned long va);
+    unsigned long (*exec)          (unsigned long va);
+    unsigned long (*read_user)     (unsigned long va);
+    unsigned long (*implicit_user) (unsigned long va);
+    unsigned long (*write_user)    (unsigned long va);
+    unsigned long (*exec_user)     (unsigned long va);
+} regular_stubs = {
+    .read          = stub_read,
+    .implicit      = stub_implicit,
+    .write         = stub_write,
+    .exec          = stub_exec,
+    .read_user     = stub_read_user,
+    .implicit_user = stub_implicit_user,
+    .write_user    = stub_write_user,
+    .exec_user     = stub_exec_user,
+}, force_stubs = {
+    .read          = stub_force_read,
+    .implicit      = stub_force_implicit,
+    .write         = stub_force_write,
+    .exec          = stub_force_exec,
+    .read_user     = stub_force_read_user,
+    .implicit_user = stub_force_implicit_user,
+    .write_user    = stub_force_write_user,
+    .exec_user     = stub_force_exec_user,
+};
+
+struct mapping_info
+{
+    unsigned int level, order;
+    void *va;
+    intpte_t *pte, *fe_pte;
+    uint64_t paddr;
+
+    union
+    {
+        intpte_t *ptes[4];
+        struct
+        {
+            intpte_t *l1e, *l2e, *l3e, *l4e;
+        };
+    };
+};
+
+void flush_tlb(bool global)
+{
+    write_cr3(read_cr3());
+
+    if ( global && (cr4 & X86_CR4_PGE) )
+    {
+        write_cr4(cr4 & ~X86_CR4_PGE);
+        write_cr4(cr4);
+    }
+}
+
+bool ex_check_pf(struct cpu_regs *regs,
+                 const struct extable_entry *ex)
+{
+    if ( regs->entry_vector == X86_EXC_PF )
+    {
+        unsigned long cr2 = read_cr2();
+
+        if ( (cr2 != under_test.va) &&
+             (cr2 != under_test.va + (LDT_SEL & ~7)) )
+            xtf_failure("Bad %%cr2: expected %p, got %p\n",
+                        _p(under_test.va), _p(cr2));
+
+        regs->ax = EXINFO_SYM(PF, regs->error_code);
+
+        if ( ex->fixup )
+            regs->ip = ex->fixup;
+        else
+            regs->ip = *(unsigned long *)cpu_regs_sp(regs);
+
+        return true;
+    }
+
+    return false;
+}
+
+void __printf(1, 2) fail(const char *fmt, ...)
+{
+    va_list args;
+
+    if ( !under_test.active )
+        return;
+
+    if ( !under_test.pte_printed )
+    {
+        intpte_t pte = under_test.pteval;
+        printk("  PTE %"PRIpte":%s%s%s%s%s%s%s%s\n", pte,
+               pte & _PAGE_NX ? " Nx" : "",
+               pte & 0x7ff0000000000000ULL ? " Av" : "",
+               pte & ((1ULL << 52) - 1) & ~((1ULL << maxphysaddr) - 1) ? " Rs" : "",
+               pte & _PAGE_GLOBAL ? " G" : "",
+               pte & _PAGE_PSE ? " +" : "",
+               pte & _PAGE_USER ? " U" : "",
+               pte & _PAGE_RW ? " W" : "",
+               pte & _PAGE_PRESENT ? " P" : ""
+            );
+
+        under_test.pte_printed = true;
+    }
+
+    va_start(args, fmt);
+    vprintk(fmt, args);
+    va_end(args);
+    xtf_failure(NULL);
+}
+
+bool unhandled_exception(struct cpu_regs *regs)
+{
+    fail("ERROR: Unhandled exception during %s %s\n",
+         under_test.user ? "User" : "Supervisor",
+         under_test.desc);
+    return false;
+}
+
+static void prepare_mappings(struct mapping_info *m, unsigned int level, bool super, paddr_t paddr)
+{
+    bool pse36 = CONFIG_PAGING_LEVELS == 2 && paddr != (uint32_t)paddr;
+
+    memset(m, 0, sizeof(*m));
+
+#define PAGE_COMMON PF_SYM(AD, U, RW, P)
+    /*
+     * For 4-level paging, we use l4[1/2].
+     */
+    if ( CONFIG_PAGING_LEVELS == 4 )
+    {
+
+        pae_l4_identmap[1] = (unsigned long)l3t | PAGE_COMMON;
+        pae_l4_identmap[2] = (unsigned long)l3t | PAGE_COMMON;
+
+        l3t[0]   = (unsigned long)l2t | PAGE_COMMON;
+        l3t[511] = (unsigned long)l2t | PAGE_COMMON;
+
+        l2t[0]   = (unsigned long)l1t | PAGE_COMMON;
+        l2t[511] = (unsigned long)l1t | PAGE_COMMON;
+
+        l1t[0]   = paddr | PAGE_COMMON;
+        l1t[511] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
+
+        m->va     = _p(2ULL << PAE_L4_PT_SHIFT);
+        m->l1e    = &l1t[0];
+        m->l2e    = &l2t[0];
+        m->l3e    = &l3t[0];
+        m->l4e    = _p(&pae_l4_identmap[2]);
+        m->fe_pte = &l1t[511];
+
+        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L4_PT_SHIFT, 0, ex_check_pf));
+        under_test.va = (unsigned long)m->va;
+    }
+    else if ( CONFIG_PAGING_LEVELS == 3 )
+    {
+        pae32_l3_identmap[1] = (unsigned long)l2t | _PAGE_PRESENT;
+        pae32_l3_identmap[2] = (unsigned long)l2t | _PAGE_PRESENT;
+
+        l2t[0]   = (unsigned long)l1t | PAGE_COMMON;
+        l2t[511] = (unsigned long)l1t | PAGE_COMMON;
+
+        l1t[0]   = paddr | PAGE_COMMON;
+        l1t[511] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
+
+        m->va     = _p(2ULL << PAE_L3_PT_SHIFT);
+        m->l1e    = &l1t[0];
+        m->l2e    = &l2t[0];
+        m->l3e    = _p(&pae32_l3_identmap[2]);
+        m->l4e    = NULL;
+        m->fe_pte = &l1t[511];
+
+        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L3_PT_SHIFT, 0, ex_check_pf));
+        under_test.va = (unsigned long)m->va;
+    }
+    else if ( CONFIG_PAGING_LEVELS == 2 )
+    {
+        if ( pse36 )
+        {
+            ASSERT(super);
+            ASSERT(IS_ALIGNED(paddr, MB(4)));
+
+            pse_l2_identmap[511] = fold_pse36((paddr - MB(4)) | PAGE_COMMON | _PAGE_PSE);
+            pse_l2_identmap[512] = fold_pse36(paddr | PAGE_COMMON | _PAGE_PSE);
+        }
+        else
+        {
+            pse_l2_identmap[511] = (unsigned long)l1t | PAGE_COMMON;
+            pse_l2_identmap[512] = (unsigned long)l1t | PAGE_COMMON;
+
+            l1t[0]    = paddr | PAGE_COMMON;
+            l1t[1023] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
+        }
+
+        m->va     = _p(2ULL << PAE_L3_PT_SHIFT);
+        m->l1e    = pse36 ? NULL : &l1t[0];
+        m->l2e    = _p(&pse_l2_identmap[512]);
+        m->l3e    = NULL;
+        m->l4e    = NULL;
+        m->fe_pte = pse36 ? _p(&pse_l2_identmap[511]) : &l1t[1023];
+
+        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L3_PT_SHIFT, 0, ex_check_pf));
+        under_test.va = (unsigned long)m->va;
+    }
+    else
+        panic("%s() PAGING_LEVELS %u not implemented yet\n",
+              __func__, CONFIG_PAGING_LEVELS);
+
+#undef PAGE_COMMON
+
+    /* Flush the TLB before trying to use the new mappings. */
+    flush_tlb(false);
+
+    /* Put FEP immediately before va, and a ret instruction at va. */
+    memcpy(m->va - 5, "\x0f\x0bxen\xc3", 6);
+    barrier();
+
+    /* Read them back, to confirm that RAM is properly in place. */
+    if ( memcmp(m->va - 5, "\x0f\x0bxen\xc3", 6) )
+        panic("Bad phys or virtual setup\n");
+
+    /* Construct the LDT at va. */
+    user_desc *ldt = m->va;
+
+    ldt[LDT_SEL >> 3] = (typeof(*ldt))INIT_GDTE_SYM(0, 0xfffff, COMMON, DATA, DPL3, B, W);
+    gdt[GDTE_AVAIL0]  = (typeof(*gdt))INIT_GDTE((unsigned long)m->va, PAGE_SIZE, 0x82);
+#if __x86_64__
+    /* For 64bit, put the upper 32 bits of base into the adjacent entry. */
+    gdt[GDTE_AVAIL0 + 1] =
+        (user_desc){{{ .lo = ((unsigned long)m->va) >> 32, .hi = 0 }}};
+#endif
+    lldt(GDTE_AVAIL0 << 3);
+    write_fs(LDT_SEL);
+
+    m->level = level;
+    m->pte = m->ptes[level - 1];
+
+    if ( pse36 )
+    {
+        /* No l1e at all. */
+        m->order = PT_ORDER + PAGE_SHIFT;
+        m->paddr = *m->pte & ~0xfff;
+    }
+    else if ( super && (cr4 & (X86_CR4_PAE|X86_CR4_PSE)) )
+    {
+        /* Superpage in effect. */
+        m->order = ((level - 1) * PT_ORDER) + PAGE_SHIFT;
+        m->paddr =  *m->l1e & ~0xfff;
+    }
+    else
+    {
+        /* Small page, or superpage not in effect. */
+        m->order = 0;
+        m->paddr = *m->pte & ~0xfff;
+    }
+}
+
+void clear_ad(struct mapping_info *m)
+{
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(m->ptes); ++i )
+        if ( m->ptes[i] )
+            *m->ptes[i] &= ~_PAGE_AD;
+
+    invlpg(m->va);
+}
+
+enum modifier
+{
+    /* Calc. */
+    WP       = 1 << 0,
+    NX       = 1 << 1,
+    SMEP     = 1 << 2,
+    SMAP     = 1 << 3,
+    AC       = 1 << 4,
+    IMP      = 1 << 5,
+
+    /* Check only. */
+    WRITE    = 1 << 6,
+};
+
+void check(struct mapping_info *m, exinfo_t actual, exinfo_t expected, enum modifier mod)
+{
+    /* Check that the actual pagefault matched our expectation. */
+    if ( actual != expected )
+    {
+        const char *user_sup = under_test.user ? "User" : "Supervisor";
+        bool ac_fault = !!actual, ex_fault = !!expected;
+        char ac_ec[16], ex_ec[16];
+
+        if ( ac_fault )
+            x86_exc_decode_ec(ac_ec, ARRAY_SIZE(ac_ec),
+                              X86_EXC_PF, (uint16_t)actual);
+        if ( ex_fault )
+            x86_exc_decode_ec(ex_ec, ARRAY_SIZE(ex_ec),
+                              X86_EXC_PF, (uint16_t)expected);
+
+        if ( ac_fault && !ex_fault )
+            fail("    Fail: expected no fault, got #PF[%s] for %s %s\n",
+                 ac_ec, user_sup, under_test.desc);
+        else if ( !ac_fault && ex_fault )
+            fail("    Fail: expected #PF[%s], got no fault for %s %s\n",
+                 ex_ec, user_sup, under_test.desc);
+        else
+            fail("    Fail: expected #PF[%s], got #PF[%s] for %s %s\n",
+                 ex_ec, ac_ec,  user_sup, under_test.desc);
+    }
+
+    /* Check that A/D bits got updated as expected. */
+    unsigned int leaf_level =
+        m->order ? ((m->order - PAGE_SHIFT) / PT_ORDER) : 0;
+    unsigned int i; /* NB - Levels are 0-indexed. */
+
+    if ( amd_fam10_erratum )
+    {
+        /*
+         * AMD Fam10 appears to defer the setting of access bits for implicit
+         * loads.  As a result, the implicit tests (which load %fs) don't
+         * necessarily observe the access bits being set on the pagewalk to
+         * the LDT.
+         *
+         * Experimentally, a serialising instruction fixes things, or a loop
+         * of 1000 nops, but so does forcing the use of the loaded segment.
+         *
+         * If this is an implicit load which didn't fault, read through %fs to
+         * force it to be loaded into the segment cache.
+         */
+        if ( (mod & IMP) && !actual )
+            asm volatile ("mov %%fs:0x1000, %0" : "=r" (i));
+    }
+
+    for ( i = 0; i < ARRAY_SIZE(m->ptes); ++i )
+    {
+        int exp_a, exp_d;
+
+        if ( !m->ptes[i] )
+            continue;
+
+        if ( CONFIG_PAGING_LEVELS == 3 && i == 2 )
+        {
+            /*
+             * 32bit PAE paging is special.  The 4 PDPTE's are read from
+             * memory, cached in the processor and don't strictly count as
+             * pagetables. The A/D bits are not updated.
+             */
+            exp_a = 0;
+            exp_d = 0;
+        }
+        else if ( leaf_level > i )
+        {
+            /*
+             * Logically below a superpage.  Nothing should consider this part
+             * of the pagetable structure, and neither A or D should be set.
+             */
+            exp_a = 0;
+            exp_d = 0;
+        }
+        else if ( leaf_level == i )
+        {
+            /*
+             * At a leaf page.  If there was no fault, we expect A to be set,
+             * optionally D if a write occurred.
+             */
+            exp_a = (actual == 0);
+            exp_d = exp_a && (mod & WRITE);
+        }
+        else
+        {
+            /*
+             * Higher level translation structure.  A processor is free to
+             * cache the partial translation or not, at its discretion, but
+             * they will never be dirty.
+             */
+            exp_a = -1;
+            exp_d = 0;
+        }
+
+        bool act_a = *m->ptes[i] & _PAGE_ACCESSED;
+        bool act_d = *m->ptes[i] & _PAGE_DIRTY;
+
+        if ( (exp_a >= 0 && exp_a != act_a) || (exp_d != act_d) )
+            fail("    Fail: expected L%u AD = %c%u, got %u%u for %s %s\n",
+                 i + 1, exp_a == 1 ? '1' : exp_a == 0 ? '0' : 'x', exp_d,
+                 act_a, act_d,
+                 under_test.user ? "User" : "Supervisor", under_test.desc);
+    }
+
+    clear_ad(m);
+    write_fs(0);
+}
+
+exinfo_t calc(struct mapping_info *m, uint64_t new, unsigned int walk, enum modifier mod)
+{
+    bool nx_valid   = CONFIG_PAGING_LEVELS >= 3 && (host_nx_leaked || (mod & NX));
+    bool insn_valid = nx_valid || (mod & SMEP);
+    uint64_t rsvd = ((1ULL << 52) - 1) & ~((1ULL << maxphysaddr) - 1);
+
+    /* Accumulate additional bits which are reserved. */
+    if ( !nx_valid )
+        rsvd |= _PAGE_NX;
+
+    if ( m->level == 4 )
+        rsvd |= _PAGE_PSE | (vendor_is_amd ? _PAGE_GLOBAL : 0);
+    else if ( m->level == 3 && !cpu_has_page1gb )
+        rsvd |= _PAGE_PSE;
+
+    if ( m->order )
+    {
+        if ( CONFIG_PAGING_LEVELS > 2 || !cpu_has_pse36 )
+            rsvd |= ((1ULL << m->order) - 1) & ~(_PAGE_PSE_PAT | (_PAGE_PSE_PAT - 1));
+        else
+            rsvd |= (1ULL << 21) | fold_pse36(rsvd);
+    }
+
+    if ( CONFIG_PAGING_LEVELS == 3 )
+        rsvd |= 0x7ff0000000000000ULL;
+
+
+    if ( !insn_valid )
+        walk &= ~PFEC(I);
+
+    exinfo_t base = EXINFO_SYM(PF, walk & PFEC(I, U, W));
+
+    /* Check whether a translation exists. */
+    if ( !(new & _PAGE_PRESENT) )
+        return base;
+    base |= PFEC(P);
+
+    if ( new & rsvd )
+        return base | PFEC(R);
+
+    /* Check access rights. */
+
+    if ( (walk & PFEC(I)) && (new & _PAGE_NX) )
+        /* Insn fetch of NX page? Always fail. */
+        return base;
+
+    if ( walk & PFEC(U) )
+    {
+        /* User walk. */
+
+        if ( !(new & _PAGE_USER) )
+            /* Supervisor page? Always fail. */
+            return base;
+
+        if ( (walk & PFEC(W)) && !(new & _PAGE_RW) )
+            /* Write to a read-only page? */
+            return base;
+    }
+    else
+    {
+        /* Supervisor Walk. */
+
+        if ( new & _PAGE_USER )
+        {
+            /* User page. */
+
+            if ( (walk & PFEC(I)) && (mod & SMEP) )
+                /* Insn fetch with SMEP? */
+                return base;
+
+            if ( !(walk & PFEC(I)) && (mod & SMAP) &&
+                 ((mod & IMP) || !(mod & AC)) )
+                /* data fetch with SMAP and (Implicit or !AC)? */
+                return base;
+        }
+
+        if ( (walk & PFEC(W)) && !(new & _PAGE_RW) && (mod & WP) )
+            /* Write to a read-only page with WP active? */
+            return base;
+    }
+
+    /* Should succeed. */
+    return 0;
+}
+
+void test_pte(const struct stubs *stubs, struct mapping_info *m, uint64_t overlay)
+{
+    uint64_t new = m->paddr | overlay;
+    bool user = false;
+
+    under_test.pteval = *m->pte = new;
+    under_test.pte_printed = false;
+    clear_ad(m);
+
+    under_test.active = true;
+
+    for ( ; ; user = true )
+    {
+        unsigned int base = user ? PFEC(U) : 0;
+
+#define CALL(fn, va)                                                    \
+        (user ? exec_user_param(fn ## _user, (unsigned long)(va))       \
+              : fn((unsigned long)(va)))
+
+        under_test.user = user;
+
+        /* Map the exec FEP stub with suitable permissions. */
+        if ( stubs == &force_stubs )
+        {
+            *m->fe_pte &= ~_PAGE_USER;
+            if ( user )
+                *m->fe_pte |= _PAGE_USER;
+            invlpg(m->va - 5);
+        }
+
+        /* Basic read. */
+        under_test.desc = "Read";
+        check(m, CALL(stubs->read, m->va),
+              calc(m, new, base | PFEC(), 0),
+              0);
+
+        /* Implicit read (always supervisor).  `mov $LDT_SEL, %fs`. */
+        under_test.desc = "Read, Implicit";
+        check(m, CALL(stubs->implicit, _p(LDT_SEL)),
+              calc(m, new, PFEC(), IMP),
+              IMP);
+
+        /* Read, SMAP. */
+        if ( cpu_has_smap )
+        {
+            write_cr4(cr4 | X86_CR4_SMAP);
+
+            asm volatile ("clac");
+
+            under_test.desc = "Read, SMAP AC0";
+            check(m, CALL(stubs->read, m->va),
+                  calc(m, new, base | PFEC(), SMAP),
+                  0);
+
+            under_test.desc = "Read, Implicit, SMAP AC0";
+            check(m, CALL(stubs->implicit, _p(LDT_SEL)),
+                  calc(m, new, PFEC(), SMAP | IMP),
+                  IMP);
+
+            asm volatile ("stac");
+
+            under_test.desc = "Read, SMAP AC1";
+            check(m, CALL(stubs->read, m->va),
+                  calc(m, new, base | PFEC(), SMAP | AC),
+                  0);
+
+            if ( !user && !shadow_paging )
+            {
+                /*
+                 * This corner case loses information in the pagefault error
+                 * code, which the shadow pagetable logic in the hypervisor
+                 * can't account for.
+                 *
+                 * Executing this test in supervisor mode with shadow paging
+                 * will livelock with no further progress made.
+                 */
+                under_test.desc = "Read, Implicit, SMAP AC1";
+                check(m, CALL(stubs->implicit, _p(LDT_SEL)),
+                      calc(m, new, PFEC(), SMAP | AC | IMP),
+                      IMP);
+            }
+
+            asm volatile ("clac");
+
+            write_cr4(cr4);
+        }
+
+        /* Basic write. */
+        under_test.desc = "Write";
+        check(m, CALL(stubs->write, m->va),
+              calc(m, new, base | PFEC(W), 0),
+              WRITE);
+
+        /* Write, WP. */
+        write_cr0(cr0 | X86_CR0_WP);
+
+        under_test.desc = "Write, WP";
+        check(m, CALL(stubs->write, m->va),
+              calc(m, new, base | PFEC(W), WP),
+              WRITE);
+
+        write_cr0(cr0);
+
+        /* Write, SMAP. */
+        if ( cpu_has_smap )
+        {
+            write_cr4(cr4 | X86_CR4_SMAP);
+
+            asm volatile ("clac");
+
+            under_test.desc = "Write, SMAP AC0";
+            check(m, CALL(stubs->write, m->va),
+                  calc(m, new, base | PFEC(W), SMAP),
+                  WRITE);
+
+            asm volatile ("stac");
+
+            under_test.desc = "Write, SMAP AC1";
+            check(m, CALL(stubs->write, m->va),
+                  calc(m, new, base | PFEC(W), SMAP | AC),
+                  WRITE);
+
+            asm volatile ("clac");
+
+
+            /* Write, SMAP + WP. */
+            write_cr0(cr0 | X86_CR0_WP);
+
+            under_test.desc = "Write, SMAP AC0, WP";
+            check(m, CALL(stubs->write, m->va),
+                  calc(m, new, base | PFEC(W), SMAP | WP),
+                  WRITE);
+
+            asm volatile ("stac");
+
+            under_test.desc = "Write, SMAP AC1, WP";
+            check(m, CALL(stubs->write, m->va),
+                  calc(m, new, base | PFEC(W), SMAP | AC | WP),
+                  WRITE);
+
+            asm volatile ("clac");
+
+            write_cr0(cr0);
+            write_cr4(cr4);
+        }
+
+
+        /* Basic exec. */
+        under_test.desc = "Exec";
+        check(m, CALL(stubs->exec, m->va),
+              calc(m, new, base | PFEC(I), 0),
+              0);
+
+        /* Exec, SMEP. */
+        if ( cpu_has_smep )
+        {
+            write_cr4(cr4 | X86_CR4_SMEP);
+
+            under_test.desc = "Exec, SMEP";
+            check(m, CALL(stubs->exec, m->va),
+                  calc(m, new, base | PFEC(I), SMEP),
+                  0);
+
+            write_cr4(cr4);
+        }
+
+        /* Exec, NX. */
+        if ( cpu_has_nx )
+        {
+            wrmsr(MSR_EFER, efer | EFER_NXE);
+
+            under_test.desc = "Exec, NX";
+            check(m, CALL(stubs->exec, m->va),
+                  calc(m, new, base | PFEC(I), NX),
+                  0);
+
+            /* Exec, NX and SMEP. */
+            if ( cpu_has_smep )
+            {
+                write_cr4(cr4 | X86_CR4_SMEP);
+
+                under_test.desc = "Exec, NX, SMEP";
+                check(m, CALL(stubs->exec, m->va),
+                      calc(m, new, base | PFEC(I), NX | SMEP),
+                      0);
+
+                write_cr4(cr4);
+            }
+
+            wrmsr(MSR_EFER, efer);
+        }
+
+        if ( user )
+            break;
+    }
+
+#undef CALL
+
+    under_test.active = false;
+}
+
+void run_test(const struct stubs *stubs, unsigned int level, bool super, paddr_t paddr)
+{
+    const uint64_t base = super ? _PAGE_PSE : 0;
+    struct mapping_info m;
+    struct
+    {
+        bool cond;
+        uint64_t bit;
+    } trans_bits[] =
+        {
+            { 1, 0 },
+            { 1, _PAGE_GLOBAL },
+
+#if CONFIG_PAGING_LEVELS == 2
+
+            { super && (cr4 & X86_CR4_PSE) && !cpu_has_pse36, 1ULL << 13 },
+            { super && (cr4 & X86_CR4_PSE) && !cpu_has_pse36, 1ULL << 21 },
+
+            { super && paddr != (uint32_t)paddr, 1ULL << 21 },
+
+            { super && paddr != (uint32_t)paddr && maxphysaddr < 39,
+              fold_pse36(1ULL << 39) },
+            { super && paddr != (uint32_t)paddr && maxphysaddr < 38,
+              fold_pse36(1ULL << maxphysaddr) },
+
+#else
+
+            { super, 1ULL << 13 },
+            { super, PAGE_SIZE << (((level - 1) * PT_ORDER) - 1) },
+
+            { maxphysaddr < 50, 1ULL << maxphysaddr },
+            { maxphysaddr < 51, 1ULL << 51 },
+            { 1, 1ULL << 52 },
+            { 1, _PAGE_NX },
+
+#endif
+        };
+    uint32_t ar_bits[] =
+    {
+        0,
+        PF_SYM(P),
+        PF_SYM(RW, P),
+        PF_SYM(U, P),
+        PF_SYM(U, RW, P),
+    };
+    unsigned int trans, ar;
+
+    printk("Test%s L%ue%s%s\n",
+           (stubs == &force_stubs) ? " emulated" : "",
+           level, super ? " Superpage" : "",
+           CONFIG_PAGING_LEVELS == 2 && !(cr4 & X86_CR4_PSE) ? " (No PSE)" : ""
+        );
+
+    prepare_mappings(&m, level, super, paddr);
+
+    for ( ar = 0; ar < ARRAY_SIZE(ar_bits); ++ar )
+    {
+        for ( trans = 0; trans < ARRAY_SIZE(trans_bits); ++trans )
+        {
+            if ( trans_bits[trans].cond )
+                test_pte(stubs, &m, base | trans_bits[trans].bit | ar_bits[ar]);
+        }
+    }
+}
+
+static void shatter_console_superpage(void)
+{
+    /*
+     * Shatter the superpage mapping the PV console. We want to test with
+     * CR4.PSE disabled, at which point superpages stop working.
+     */
+    uint64_t raw_pfn;
+
+    if ( hvm_get_param(HVM_PARAM_CONSOLE_PFN, &raw_pfn) == 0 )
+    {
+        unsigned int l2o = l2_table_offset(raw_pfn << PAGE_SHIFT);
+
+        if ( (l2_identmap[l2o] & PF_SYM(PSE, P)) == PF_SYM(PSE, P) )
+        {
+            static intpte_t conl1t[L1_PT_ENTRIES] __aligned(PAGE_SIZE);
+            paddr_t base_gfn = l2_identmap[l2o] >> PAGE_SHIFT;
+            unsigned int i;
+
+            for ( i = 0; i < ARRAY_SIZE(conl1t); ++i )
+                conl1t[i] = pte_from_gfn(base_gfn + i, PF_SYM(AD, RW, P));
+
+            l2_identmap[l2o] = pte_from_virt(conl1t, PF_SYM(AD, U, RW, P));
+        }
+    }
+
+    flush_tlb(true);
+}
+
+static void populate_physmap_around(paddr_t paddr)
+{
+    unsigned long extents[] =
+        {
+            (paddr >> PAGE_SHIFT) - 1,
+            paddr >> PAGE_SHIFT,
+        };
+    struct xen_memory_reservation mr =
+        {
+            .extent_start = extents,
+            .nr_extents = ARRAY_SIZE(extents),
+            .domid = DOMID_SELF,
+        };
+    int rc = hypercall_memory_op(XENMEM_populate_physmap, &mr);
+
+    if ( rc != ARRAY_SIZE(extents) )
+        panic("Failed populate_physmap: %d\n", rc);
+}
+
+static void nx_leak_check(const struct stubs *stubs)
+{
+    struct mapping_info m;
+
+    /*
+     * Always use RAM at 12k, which is present and encodable even in 2level
+     * paging.
+     */
+    prepare_mappings(&m, 1, false, 0x3000);
+
+    *m.pte &= ~_PAGE_PRESENT;
+    invlpg(m.va);
+
+    exinfo_t res = stubs->exec((unsigned long)m.va);
+
+    if ( !res || exinfo_vec(res) != X86_EXC_PF )
+        panic("Testing for NX leak didn't generate #PF\n");
+
+    host_nx_leaked = exinfo_ec(res) & PFEC(I);
+
+    printk("Host NX %sleaked%s\n",
+           host_nx_leaked    ? ""              : "not ",
+           stubs == &force_stubs ? " in emulation" : "");
+}
+
+void run_tests(const struct stubs *stubs, paddr_t paddr)
+{
+    nx_leak_check(stubs);
+
+    if ( CONFIG_PAGING_LEVELS == 2 )
+    {
+        if ( paddr == (uint32_t)paddr )
+        {
+            /*
+             * If paddr fits within 32bits, run all the tests.
+             */
+            run_test(stubs, 1, false, paddr);
+
+            cr4 &= ~X86_CR4_PSE;
+            write_cr4(cr4);
+
+            run_test(stubs, 2, false, paddr);
+            run_test(stubs, 2, true,  paddr);
+
+            cr4 |= X86_CR4_PSE;
+            write_cr4(cr4);
+
+            run_test(stubs, 2, false, paddr);
+            run_test(stubs, 2, true,  paddr);
+        }
+        else if ( cpu_has_pse36 )
+            /*
+             * Otherwise, paddrs above 32bits can only be encoded with pse36
+             * superpages.
+             */
+            run_test(stubs, 2, true, paddr);
+        else
+            printk("No applicable tests\n");
+    }
+    else
+    {
+        run_test(stubs, 1, false, paddr);
+        run_test(stubs, 2, false, paddr);
+        run_test(stubs, 2, true,  paddr);
+
+        if ( CONFIG_PAGING_LEVELS > 3 )
+        {
+            run_test(stubs, 3, false, paddr);
+            run_test(stubs, 3, true,  paddr);
+            run_test(stubs, 4, false, paddr);
+            run_test(stubs, 4, true,  paddr);
+        }
+    }
+}
+
+static void probe_shadow_paging(void)
+{
+    /*
+     * Shadow paging vs hap should be indistinguishable to guests.
+     *
+     * Shadow paging doesn't support PSE36, so this feature is nominally
+     * hidden from guests.
+     *
+     * Luckily(?), because old versions of HyperV refuse to boot if they don't
+     * see PSE36, it is purposefully leaked once PAE is enabled to keep HyperV
+     * happy.
+     *
+     * As a result, our shadow paging heuristic is that the visibility of the
+     * PSE36 feature changes as we flip in and out of PAE paging mode.
+     */
+    unsigned long tmp;
+    unsigned int _1d;
+
+    switch ( CONFIG_PAGING_LEVELS )
+    {
+    case 2:
+        write_cr0(cr0 & ~X86_CR0_PG);
+
+        write_cr3((unsigned long)pae32_l3_identmap);
+
+        write_cr4(cr4 | X86_CR4_PAE);
+        write_cr0(cr0);
+
+        _1d = cpuid_edx(1);
+
+        write_cr0(cr0 & ~X86_CR0_PG);
+        write_cr4(cr4 & ~X86_CR4_PAE);
+
+        write_cr3((unsigned long)cr3_target);
+
+        write_cr0(cr0);
+        break;
+
+    case 3:
+        write_cr0(cr0 & ~X86_CR0_PG);
+        write_cr4(cr4 & ~X86_CR4_PAE);
+
+        write_cr3((unsigned long)pse_l2_identmap);
+
+        write_cr0(cr0);
+
+        _1d = cpuid_edx(1);
+
+        write_cr0(cr0 & ~X86_CR0_PG);
+
+        write_cr3((unsigned long)cr3_target);
+
+        write_cr4(cr4);
+        write_cr0(cr0);
+        break;
+
+    case 4:
+        asm volatile (/* Drop into a 32bit compat code segment. */
+                      "push $%c[cs32];"
+                      "push $1f;"
+                      "lretq; 1:"
+
+                      ".code32;"
+                      "start_32bit:;"
+
+                      /* Flip %CR4.PAE */
+                      "mov %k[cr0], %%cr0;"
+                      "mov %k[cr4], %%cr4;"
+                      "mov $pse_l2_identmap, %k[cr3];"
+                      "mov %k[cr3], %%cr3;"
+                      "rdmsr;"
+                      "and $~" STR(EFER_LME) ", %k[a];"
+                      "wrmsr;"
+                      "or $" STR(X86_CR0_PG) ", %k[cr0];"
+                      "mov %k[cr0], %%cr0;"
+
+                      "mov $1, %k[a];"
+                      "cpuid;"
+                      "mov %k[d], %k[b];"
+
+                      /* Flip %CR4.PAE back. */
+                      "and $~" STR(X86_CR0_PG) ", %k[cr0];"
+                      "mov %k[cr0], %%cr0;"
+                      "mov $" STR(MSR_EFER) ", %k[c];"
+                      "rdmsr;"
+                      "or $" STR(EFER_LME) ", %k[a];"
+                      "wrmsr;"
+                      "or $" STR(X86_CR4_PAE) ", %k[cr4];"
+                      "mov %k[cr4], %%cr4;"
+                      "mov $pae_l4_identmap, %k[cr3];"
+                      "mov %k[cr3], %%cr3;"
+                      "or $" STR(X86_CR0_PG) ", %k[cr0];"
+                      "mov %k[cr0], %%cr0;"
+
+                      /* Return to 64bit. */
+                      "ljmpl $%c[cs], $1f;"
+                      "end_32bit:;"
+                      ".code64; 1:"
+                      : [a] "=&a" (tmp),
+                        [b] "=&b" (_1d),
+                        [c] "=&c" (tmp),
+                        [d] "=&d" (tmp),
+                        [cr3] "=&r" (tmp)
+                      : "c" (MSR_EFER),
+                        [cr0]  "R" (cr0 & ~X86_CR0_PG),
+                        [cr4]  "R" (cr4 & ~X86_CR4_PAE),
+                        [cs32] "i" (GDTE_CS32_DPL0 * 8),
+                        [cs]   "i" (__KERN_CS));
+        break;
+    }
+
+    shadow_paging = cpu_has_pse36 ^ !!(_1d & cpufeat_mask(X86_FEATURE_PSE36));
+
+    printk("  Paging mode heuristic: %s\n", shadow_paging ? "Shadow" : "Hap");
+}
+
+void test_main(void)
+{
+    xtf_unhandled_exception_hook = unhandled_exception;
+
+    printk("  Info: Vendor %s, Family %u, Model %u, Stepping %u, paddr %u, vaddr %u\n"
+           "  Features:%s%s%s%s%s%s%s%s%s%s%s\n",
+           x86_vendor_name(x86_vendor),
+           x86_family, x86_model, x86_stepping, maxphysaddr, maxvirtaddr,
+           cpu_has_pse     ? " PSE"    : "",
+           cpu_has_pae     ? " PAE"    : "",
+           cpu_has_pge     ? " PGE"    : "",
+           cpu_has_pat     ? " PAT"    : "",
+           cpu_has_pse36   ? " PSE36"  : "",
+           cpu_has_pcid    ? " PCID"   : "",
+           cpu_has_nx      ? " NX"     : "",
+           cpu_has_page1gb ? " PAGE1G" : "",
+           cpu_has_smep    ? " SMEP"   : "",
+           cpu_has_smap    ? " SMAP"   : "",
+           cpu_has_pku     ? " PKU"    : ""
+        );
+
+    if ( CONFIG_PAGING_LEVELS == 2 )
+        shatter_console_superpage();
+
+    if ( !vendor_is_intel && !vendor_is_amd )
+        xtf_warning("Unknown CPU vendor.  Something might go wrong\n");
+    if ( !xtf_has_fep )
+        xtf_skip("FEP support not detected - some tests will be skipped\n");
+
+    if ( vendor_is_amd && x86_family == 0x10 )
+    {
+        amd_fam10_erratum = true;
+        printk("  Working around suspected AMD Fam10h erratum\n");
+    }
+
+    /* Sanitise environment. */
+    efer = rdmsr(MSR_EFER) & ~EFER_NXE;
+    wrmsr(MSR_EFER, efer);
+    cr0 = read_cr0() & ~X86_CR0_WP;
+    write_cr0(cr0);
+    cr4 = read_cr4() & ~(X86_CR4_SMEP | X86_CR4_SMAP);
+    write_cr4(cr4);
+
+    probe_shadow_paging();
+
+    unsigned int i;
+    paddr_t paddrs[] =
+    {
+        GB(1),
+        1ULL << (min(maxphysaddr,
+                     CONFIG_PAGING_LEVELS == 2
+                     ? 40U
+                     : (BITS_PER_LONG + PAGE_SHIFT)) - 1),
+    };
+
+    for ( i = 0; i < ARRAY_SIZE(paddrs); ++i )
+    {
+        paddr_t paddr = paddrs[i];
+
+        printk("Using paddr 0x%"PRIpaddr"\n", paddr);
+        populate_physmap_around(paddr);
+
+        run_tests(&regular_stubs, paddr);
+
+        if ( xtf_has_fep )
+            run_tests(&force_stubs, paddr);
+    }
+
+    xtf_success(NULL);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tests/pagetable-emulation/stubs.S b/tests/pagetable-emulation/stubs.S
new file mode 100644
index 0000000..068b712
--- /dev/null
+++ b/tests/pagetable-emulation/stubs.S
@@ -0,0 +1,94 @@
+#include <xtf/extable.h>
+#include <xtf/asm_macros.h>
+#include <xen/arch-x86/xen.h>
+
+#include <arch/idt.h>
+#include <arch/page.h>
+#include <arch/processor.h>
+#include <arch/segment.h>
+
+#ifdef __i386__
+# define REG  edx
+# define REG8 dl
+#else
+# define REG  rdi
+# define REG8 dil
+#endif
+
+.macro GEN_SINGLE name, type, force=0
+
+ENTRY(stub_\name)
+
+        xor %eax, %eax
+
+#ifdef __i386__
+        mov 4(%esp), %REG
+#endif
+
+        .ifnc \type, X
+        .if \force
+        _ASM_XEN_FEP
+        .endif
+        .endif
+
+.Lstub_\name\()_fault:
+
+        .ifc \type, R
+        /* Read (and trash) the pointer. */
+        mov (%REG), %REG8
+        .endif
+
+        .ifc \type, I
+        /* Load %fs with the selector in %REG.  Causes implicit GDT/LDT access. */
+        mov %REG, %fs
+        .endif
+
+        .ifc \type, W
+        /* Write `ret` to the pointer. */
+        movb $0xc3, (%REG)
+        .endif
+
+        .ifc \type, X
+        /* Call the pointer. */
+
+        .if \force
+        /* If FORCE, move the pointer to include the force emulation prefix. */
+        sub $5, %REG
+        .endif
+
+        call *%REG
+        .endif
+
+.Lstub_\name\()_fixup:
+        ret
+
+ENDFUNC(stub_\name)
+
+        .ifnc \type, X
+        _ASM_EXTABLE_HANDLER(.Lstub_\name\()_fault, \
+                             .Lstub_\name\()_fixup, \
+                             ex_check_pf)
+        .endif
+.endm
+
+GEN_SINGLE read     R
+GEN_SINGLE implicit I
+GEN_SINGLE write    W
+GEN_SINGLE exec     X
+GEN_SINGLE force_read     R 1
+GEN_SINGLE force_implicit I 1
+GEN_SINGLE force_write    W 1
+GEN_SINGLE force_exec     X 1
+
+.pushsection .text.user, "ax", @progbits
+
+GEN_SINGLE read_user     R
+GEN_SINGLE implicit_user I
+GEN_SINGLE write_user    W
+GEN_SINGLE exec_user     X
+GEN_SINGLE force_read_user     R 1
+GEN_SINGLE force_implicit_user I 1
+GEN_SINGLE force_write_user    W 1
+GEN_SINGLE force_exec_user     X 1
+
+.popsection
diff --git a/tests/pagetable-emulation/stubs.h b/tests/pagetable-emulation/stubs.h
new file mode 100644
index 0000000..00db033
--- /dev/null
+++ b/tests/pagetable-emulation/stubs.h
@@ -0,0 +1,25 @@
+#ifndef _LOWLEVEL_H_
+#define _LOWLEVEL_H_
+
+unsigned long stub_read(unsigned long va);
+unsigned long stub_implicit(unsigned long sel); /* unsigned long sel */
+unsigned long stub_write(unsigned long va);
+unsigned long stub_exec(unsigned long va);
+
+unsigned long stub_force_read(unsigned long va);
+unsigned long stub_force_implicit(unsigned long sel); /* unsigned long sel */
+unsigned long stub_force_write(unsigned long va);
+unsigned long stub_force_exec(unsigned long va);
+
+unsigned long stub_read_user(unsigned long va);
+unsigned long stub_implicit_user(unsigned long sel); /* unsigned long sel */
+unsigned long stub_write_user(unsigned long va);
+unsigned long stub_exec_user(unsigned long va);
+
+unsigned long stub_force_read_user(unsigned long va);
+unsigned long stub_force_implicit_user(unsigned long sel); /* unsigned long sel */
+unsigned long stub_force_write_user(unsigned long va);
+unsigned long stub_force_exec_user(unsigned long va);
+
+#endif
+
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation
  2017-02-27 14:03 ` [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation Andrew Cooper
  2017-03-01 16:22   ` Jan Beulich
  2017-03-02 16:15   ` Tim Deegan
@ 2017-03-06 18:25   ` George Dunlap
  2 siblings, 0 replies; 52+ messages in thread
From: George Dunlap @ 2017-03-06 18:25 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: George Dunlap, Tim Deegan, Jan Beulich

On 27/02/17 14:03, Andrew Cooper wrote:
> hap_p2m_ga_to_gfn() and sh_page_fault() currently use guest_l1e_get_gfn() to
> obtain the translation of a pagewalk.  This is conceptually wrong (the
> semantics of gw.l1e is an internal detail), and will actually be wrong when
> PSE36 superpage support is fixed.  Switch them to using guest_walk_to_gfn().
> 
> Take the opportunity to const-correct the walk_t parameter of the
> guest_walk_to_*() helpers, and implement guest_walk_to_gpa() in terms of
> guest_walk_to_gfn() to avoid duplicating the actual translation calculation.
> 
> While editing guest_walk_to_gpa(), fix a latent bug by causing it to return
> INVALID_PADDR rather than 0 for a failed translation, as 0 is also a valid
> successful result.  The sole caller, sh_page_fault(), has already confirmed
> that the translation is valid, so this doesn't cause a behavioural change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-02-27 14:03 ` [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker Andrew Cooper
  2017-03-02 11:52   ` Jan Beulich
  2017-03-02 16:16   ` Tim Deegan
@ 2017-03-06 18:28   ` George Dunlap
  2017-03-06 18:33     ` Andrew Cooper
  2017-03-07 12:57   ` George Dunlap
  3 siblings, 1 reply; 52+ messages in thread
From: George Dunlap @ 2017-03-06 18:28 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: George Dunlap, Tim Deegan, Jan Beulich

On 27/02/17 14:03, Andrew Cooper wrote:
> The existing pagetable walker has complicated return semantics, which squeeze
> multiple pieces of information into single integer.  This would be fine if the
> information didn't overlap, but it does.
> 
> Specifically, _PAGE_INVALID_BITS for 3-level guests alias _PAGE_PAGED and
> _PAGE_SHARED.  A guest which constructs a PTE with bits 52 or 53 set (the
> start of the upper software-available range) will create a virtual address
> which, when walked by Xen, tricks Xen into believing the frame is paged or
> shared.  This behaviour was introduced by XSA-173 (c/s 8b17648).
> 
> It is also complicated to turn rc back into a normal pagefault error code.
> Instead, change the calling semantics to return a boolean indicating success,
> and have the function accumulate a real pagefault error code as it goes
> (including synthetic error codes, which do not alias hardware ones).  This
> requires an equivalent adjustment to map_domain_gfn().
> 
> Issues fixed:
>  * 2-level PSE36 superpages now return the correct translation.
>  * 2-level L2 superpages without CR0.PSE now return the correct translation.
>  * SMEP now inhibits a user instruction fetch even if NX isn't active.
>  * Supervisor writes without CR0.WP now set the leaf dirty bit.
>  * L4e._PAGE_GLOBAL is strictly reserved on AMD.
>  * 3-level l3 entries have all reserved bits checked.
>  * 3-level entries can no longer alias Xen's idea of paged or shared.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Andy,

Have you got this in a tree somewhere?  This doesn't seem to apply to
staging, and I don't think I'm going to be able to give it a proper
review without seeing the changes in-situ.  (Or if you're close to
reposting I'll wait for the repost.)

Thanks,
 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-03-06 18:28   ` George Dunlap
@ 2017-03-06 18:33     ` Andrew Cooper
  2017-03-06 18:39       ` George Dunlap
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-03-06 18:33 UTC (permalink / raw)
  To: George Dunlap, Xen-devel; +Cc: George Dunlap, Tim Deegan, Jan Beulich

On 06/03/17 18:28, George Dunlap wrote:
> On 27/02/17 14:03, Andrew Cooper wrote:
>> The existing pagetable walker has complicated return semantics, which squeeze
>> multiple pieces of information into single integer.  This would be fine if the
>> information didn't overlap, but it does.
>>
>> Specifically, _PAGE_INVALID_BITS for 3-level guests alias _PAGE_PAGED and
>> _PAGE_SHARED.  A guest which constructs a PTE with bits 52 or 53 set (the
>> start of the upper software-available range) will create a virtual address
>> which, when walked by Xen, tricks Xen into believing the frame is paged or
>> shared.  This behaviour was introduced by XSA-173 (c/s 8b17648).
>>
>> It is also complicated to turn rc back into a normal pagefault error code.
>> Instead, change the calling semantics to return a boolean indicating success,
>> and have the function accumulate a real pagefault error code as it goes
>> (including synthetic error codes, which do not alias hardware ones).  This
>> requires an equivalent adjustment to map_domain_gfn().
>>
>> Issues fixed:
>>  * 2-level PSE36 superpages now return the correct translation.
>>  * 2-level L2 superpages without CR0.PSE now return the correct translation.
>>  * SMEP now inhibits a user instruction fetch even if NX isn't active.
>>  * Supervisor writes without CR0.WP now set the leaf dirty bit.
>>  * L4e._PAGE_GLOBAL is strictly reserved on AMD.
>>  * 3-level l3 entries have all reserved bits checked.
>>  * 3-level entries can no longer alias Xen's idea of paged or shared.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Andy,
>
> Have you got this in a tree somewhere?  This doesn't seem to apply to
> staging, and I don't think I'm going to be able to give it a proper
> review without seeing the changes in-situ.  (Or if you're close to
> reposting I'll wait for the repost.)

I do have some bits an pieces (although I am some of your acks on the
earlier trivial patches.)

This patch isn't going to change in any significant way, so here you go.

http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/pagewalk-v1

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-03-06 18:33     ` Andrew Cooper
@ 2017-03-06 18:39       ` George Dunlap
  0 siblings, 0 replies; 52+ messages in thread
From: George Dunlap @ 2017-03-06 18:39 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: George Dunlap, Tim Deegan, Jan Beulich

On 06/03/17 18:33, Andrew Cooper wrote:
> On 06/03/17 18:28, George Dunlap wrote:
>> On 27/02/17 14:03, Andrew Cooper wrote:
>>> The existing pagetable walker has complicated return semantics, which squeeze
>>> multiple pieces of information into single integer.  This would be fine if the
>>> information didn't overlap, but it does.
>>>
>>> Specifically, _PAGE_INVALID_BITS for 3-level guests alias _PAGE_PAGED and
>>> _PAGE_SHARED.  A guest which constructs a PTE with bits 52 or 53 set (the
>>> start of the upper software-available range) will create a virtual address
>>> which, when walked by Xen, tricks Xen into believing the frame is paged or
>>> shared.  This behaviour was introduced by XSA-173 (c/s 8b17648).
>>>
>>> It is also complicated to turn rc back into a normal pagefault error code.
>>> Instead, change the calling semantics to return a boolean indicating success,
>>> and have the function accumulate a real pagefault error code as it goes
>>> (including synthetic error codes, which do not alias hardware ones).  This
>>> requires an equivalent adjustment to map_domain_gfn().
>>>
>>> Issues fixed:
>>>  * 2-level PSE36 superpages now return the correct translation.
>>>  * 2-level L2 superpages without CR0.PSE now return the correct translation.
>>>  * SMEP now inhibits a user instruction fetch even if NX isn't active.
>>>  * Supervisor writes without CR0.WP now set the leaf dirty bit.
>>>  * L4e._PAGE_GLOBAL is strictly reserved on AMD.
>>>  * 3-level l3 entries have all reserved bits checked.
>>>  * 3-level entries can no longer alias Xen's idea of paged or shared.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Andy,
>>
>> Have you got this in a tree somewhere?  This doesn't seem to apply to
>> staging, and I don't think I'm going to be able to give it a proper
>> review without seeing the changes in-situ.  (Or if you're close to
>> reposting I'll wait for the repost.)
> 
> I do have some bits an pieces (although I am some of your acks on the
> earlier trivial patches.)

Yeah, patch 4 it seems has been checked in already, and because of that
I somehow got confused into thinking that more of them were checked in
when I was going through the series.  I'm about to head out tonight, but
I'll give Acks or Reviews to the earlier ones tomorrow when I come back in.

> This patch isn't going to change in any significant way, so here you go.
> 
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/pagewalk-v1

Thanks.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
  2017-03-01 15:05   ` Jan Beulich
  2017-03-02 16:14   ` Tim Deegan
@ 2017-03-07 10:46   ` George Dunlap
  2017-03-07 10:51   ` Andrew Cooper
  2017-03-07 15:00   ` Paul Durrant
  4 siblings, 0 replies; 52+ messages in thread
From: George Dunlap @ 2017-03-07 10:46 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: George Dunlap, Paul Durrant, Tim Deegan, Jan Beulich

On 27/02/17 14:03, Andrew Cooper wrote:
> All actions which refer to the active ldt/gdt/idt or task register
> (e.g. loading a new segment selector) are known as implicit supervisor
> accesses, even when the access originates from user code.
> 
> The distinction is necessary in the pagewalk when SMAP is enabled.  Refer to
> Intel SDM Vol 3 "Access Rights" for the exact details.
> 
> Introduce a new pagewalk input, and make use of the new system segment
> references in hvmemul_{read,write}().  While modifying those areas, move the
> calculation of the appropriate pagewalk input before its first use.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Paul Durrant <paul.durrant@citrix.com>
> CC: George Dunlap <george.dunlap@eu.citrix.com>
> CC: Tim Deegan <tim@xen.org>
> ---
>  xen/arch/x86/hvm/emulate.c      | 18 ++++++++++--------
>  xen/arch/x86/mm/guest_walk.c    |  4 ++++
>  xen/include/asm-x86/processor.h |  1 +
>  3 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> index f24d289..9b32bca 100644
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -783,6 +783,11 @@ static int __hvmemul_read(
>      struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
>      int rc;
>  
> +    if ( is_x86_system_segment(seg) )
> +        pfec |= PFEC_implicit;
> +    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
> +        pfec |= PFEC_user_mode;
> +
>      rc = hvmemul_virtual_to_linear(
>          seg, offset, bytes, &reps, access_type, hvmemul_ctxt, &addr);
>      if ( rc != X86EMUL_OKAY || !bytes )
> @@ -793,10 +798,6 @@ static int __hvmemul_read(
>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>          return hvmemul_linear_mmio_read(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
>  
> -    if ( (seg != x86_seg_none) &&
> -         (hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3) )
> -        pfec |= PFEC_user_mode;
> -
>      rc = ((access_type == hvm_access_insn_fetch) ?
>            hvm_fetch_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo) :
>            hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo));
> @@ -893,6 +894,11 @@ static int hvmemul_write(
>      struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
>      int rc;
>  
> +    if ( is_x86_system_segment(seg) )
> +        pfec |= PFEC_implicit;
> +    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
> +        pfec |= PFEC_user_mode;
> +
>      rc = hvmemul_virtual_to_linear(
>          seg, offset, bytes, &reps, hvm_access_write, hvmemul_ctxt, &addr);
>      if ( rc != X86EMUL_OKAY || !bytes )
> @@ -902,10 +908,6 @@ static int hvmemul_write(
>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
>  
> -    if ( (seg != x86_seg_none) &&
> -         (hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3) )
> -        pfec |= PFEC_user_mode;
> -
>      rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
>  
>      switch ( rc )
> diff --git a/xen/arch/x86/mm/guest_walk.c b/xen/arch/x86/mm/guest_walk.c
> index faaf70c..4f8d0e3 100644
> --- a/xen/arch/x86/mm/guest_walk.c
> +++ b/xen/arch/x86/mm/guest_walk.c
> @@ -161,6 +161,10 @@ guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
>      bool_t pse1G = 0, pse2M = 0;
>      p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
>  
> +    /* Only implicit supervisor data accesses exist. */
> +    ASSERT(!(pfec & PFEC_implicit) ||
> +           !(pfec & (PFEC_insn_fetch|PFEC_user_mode)));
> +
>      perfc_incr(guest_walk);
>      memset(gw, 0, sizeof(*gw));
>      gw->va = va;
> diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
> index dda8b83..d3ba8ea 100644
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -76,6 +76,7 @@
>  /* Internally used only flags. */
>  #define PFEC_page_paged     (1U<<16)
>  #define PFEC_page_shared    (1U<<17)
> +#define PFEC_implicit       (1U<<18) /* Pagewalk input for ldt/gdt/idt/tr accesses. */
>  
>  /* Other exception error code values. */
>  #define X86_XEC_EXT         (_AC(1,U) << 0)
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
                     ` (2 preceding siblings ...)
  2017-03-07 10:46   ` George Dunlap
@ 2017-03-07 10:51   ` Andrew Cooper
  2017-03-07 15:00   ` Paul Durrant
  4 siblings, 0 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-03-07 10:51 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Paul Durrant, Tim Deegan, Jan Beulich

On 27/02/17 14:03, Andrew Cooper wrote:
> All actions which refer to the active ldt/gdt/idt or task register
> (e.g. loading a new segment selector) are known as implicit supervisor
> accesses, even when the access originates from user code.

It turns out that this has a bugfix in it which I hadn't realised.

I have added:

"Right away, this fixes a bug during userspace emulation where a
pagewalk for a system table was (incorrectly) performed as a user
access, causing an access violation in the common case, as system tables
reside on supervisor mappings."

~Andrew

>
> The distinction is necessary in the pagewalk when SMAP is enabled.  Refer to
> Intel SDM Vol 3 "Access Rights" for the exact details.
>
> Introduce a new pagewalk input, and make use of the new system segment
> references in hvmemul_{read,write}().  While modifying those areas, move the
> calculation of the appropriate pagewalk input before its first use.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/7] x86/shadow: Try to correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 2/7] x86/shadow: Try to correctly " Andrew Cooper
  2017-03-01 15:11   ` Jan Beulich
  2017-03-02 16:14   ` Tim Deegan
@ 2017-03-07 11:26   ` George Dunlap
  2017-03-07 11:55     ` Andrew Cooper
  2 siblings, 1 reply; 52+ messages in thread
From: George Dunlap @ 2017-03-07 11:26 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: George Dunlap, Tim Deegan, Jan Beulich

On 27/02/17 14:03, Andrew Cooper wrote:
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Tim Deegan <tim@xen.org>
> CC: George Dunlap <george.dunlap@eu.citrix.com>
> ---
>  xen/arch/x86/mm/shadow/multi.c | 60 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
> index 128809d..7c6b017 100644
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -2857,7 +2857,7 @@ static int sh_page_fault(struct vcpu *v,
>      const struct x86_emulate_ops *emul_ops;
>      int r;
>      p2m_type_t p2mt;
> -    uint32_t rc;
> +    uint32_t rc, error_code;
>      int version;
>      const struct npfec access = {
>           .read_access = 1,
> @@ -3011,13 +3011,69 @@ static int sh_page_fault(struct vcpu *v,
>  
>   rewalk:
>  
> +    error_code = regs->error_code;
> +
> +    /*
> +     * When CR4.SMAP is enabled, instructions which have a side effect of
> +     * accessing the system data structures (e.g. mov to %ds accessing the
> +     * LDT/GDT, or int $n accessing the IDT) are known as implicit supervisor
> +     * accesses.
> +     *
> +     * The distinction between implicit and explicit accesses form part of the
> +     * determination of access rights, controlling whether the access is
> +     * successful, or raises a #PF.
> +     *
> +     * Unfortunately, the processor throws away the implicit/explicit
> +     * distinction and does not provide it to the pagefault handler
> +     * (i.e. here.) in the #PF error code.  Therefore, we must try to
> +     * reconstruct the lost state so it can be fed back into our pagewalk
> +     * through the guest tables.
> +     *
> +     * User mode accesses are easy to reconstruct:
> +     *
> +     *   If we observe a cpl3 data fetch which was a supervisor walk, this
> +     *   must have been an implicit access to a system table.
> +     *
> +     * Supervisor mode accesses are not easy:
> +     *
> +     *   In principle, we could decode the instruction under %rip and have the
> +     *   instruction emulator tell us if there is an implicit access.
> +     *   However, this is racy with other vcpus updating the pagetable or
> +     *   rewriting the instruction stream under our feet.
> +     *
> +     *   Therefore, we do nothing.  (If anyone has a sensible suggestion for
> +     *   how to distinguish these cases, xen-devel@ is all ears...)
> +     *
> +     * As a result, one specific corner case will fail.  If a guest OS with
> +     * SMAP enabled ends up mapping a system table with user mappings, sets
> +     * EFLAGS.AC to allow explicit accesses to user mappings, and implicitly
> +     * accesses the user mapping, hardware and the shadow code will disagree
> +     * on whether a #PF should be raised.
> +     *
> +     * Hardware raises #PF because implicit supervisor accesses to user
> +     * mappings are strictly disallowed.  As we can't reconstruct the correct
> +     * input, the pagewalk is performed as if it were an explicit access,
> +     * which concludes that the access should have succeeded and the shadow
> +     * pagetables need modifying.  The shadow pagetables are modified (to the
> +     * same value), and we re-enter the guest to re-execute the instruction,
> +     * which causes another #PF, and the vcpu livelocks, unable to make
> +     * forward progress.

What you're saying is that if an attacker manages to trigger this
behavior, then it's probably a mistake on her part; she was trying to do
a full privilege escalation and accidentally screwed something up and
turned it into a DoS?

> +     * In practice, this is tolerable because no OS would deliberately
> +     * construct such a corner case, as doing so would mean it is trivially
> +     * root-able by its own userspace.

Just to point out, the problem with 'deliberately' is that the whole
point of SMAP is to protect an OS from its own errors. :-)  But I agree
that at first blush, the scenario above would be pretty difficult to do
accidentally.  (And I certainly don't have any better ideas.)

Would this perhaps be better as:

"In practice, this is tolerable because it is difficult to imagine a
scenario in which an OS would accidentally construct such a corner case;
and if it did, SMAP would not actually protect it from being trivially
rooted by userspace unless the attacker made a mistake and accidentally
triggered the path described here."

But that's just a suggestion.  Either way:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/7] x86/shadow: Try to correctly identify implicit supervisor accesses
  2017-03-07 11:26   ` George Dunlap
@ 2017-03-07 11:55     ` Andrew Cooper
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Cooper @ 2017-03-07 11:55 UTC (permalink / raw)
  To: George Dunlap, Xen-devel; +Cc: George Dunlap, Tim Deegan, Jan Beulich

On 07/03/17 11:26, George Dunlap wrote:
> On 27/02/17 14:03, Andrew Cooper wrote:
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Tim Deegan <tim@xen.org>
>> CC: George Dunlap <george.dunlap@eu.citrix.com>
>> ---
>>  xen/arch/x86/mm/shadow/multi.c | 60 ++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 58 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
>> index 128809d..7c6b017 100644
>> --- a/xen/arch/x86/mm/shadow/multi.c
>> +++ b/xen/arch/x86/mm/shadow/multi.c
>> @@ -2857,7 +2857,7 @@ static int sh_page_fault(struct vcpu *v,
>>      const struct x86_emulate_ops *emul_ops;
>>      int r;
>>      p2m_type_t p2mt;
>> -    uint32_t rc;
>> +    uint32_t rc, error_code;
>>      int version;
>>      const struct npfec access = {
>>           .read_access = 1,
>> @@ -3011,13 +3011,69 @@ static int sh_page_fault(struct vcpu *v,
>>  
>>   rewalk:
>>  
>> +    error_code = regs->error_code;
>> +
>> +    /*
>> +     * When CR4.SMAP is enabled, instructions which have a side effect of
>> +     * accessing the system data structures (e.g. mov to %ds accessing the
>> +     * LDT/GDT, or int $n accessing the IDT) are known as implicit supervisor
>> +     * accesses.
>> +     *
>> +     * The distinction between implicit and explicit accesses form part of the
>> +     * determination of access rights, controlling whether the access is
>> +     * successful, or raises a #PF.
>> +     *
>> +     * Unfortunately, the processor throws away the implicit/explicit
>> +     * distinction and does not provide it to the pagefault handler
>> +     * (i.e. here.) in the #PF error code.  Therefore, we must try to
>> +     * reconstruct the lost state so it can be fed back into our pagewalk
>> +     * through the guest tables.
>> +     *
>> +     * User mode accesses are easy to reconstruct:
>> +     *
>> +     *   If we observe a cpl3 data fetch which was a supervisor walk, this
>> +     *   must have been an implicit access to a system table.
>> +     *
>> +     * Supervisor mode accesses are not easy:
>> +     *
>> +     *   In principle, we could decode the instruction under %rip and have the
>> +     *   instruction emulator tell us if there is an implicit access.
>> +     *   However, this is racy with other vcpus updating the pagetable or
>> +     *   rewriting the instruction stream under our feet.
>> +     *
>> +     *   Therefore, we do nothing.  (If anyone has a sensible suggestion for
>> +     *   how to distinguish these cases, xen-devel@ is all ears...)
>> +     *
>> +     * As a result, one specific corner case will fail.  If a guest OS with
>> +     * SMAP enabled ends up mapping a system table with user mappings, sets
>> +     * EFLAGS.AC to allow explicit accesses to user mappings, and implicitly
>> +     * accesses the user mapping, hardware and the shadow code will disagree
>> +     * on whether a #PF should be raised.
>> +     *
>> +     * Hardware raises #PF because implicit supervisor accesses to user
>> +     * mappings are strictly disallowed.  As we can't reconstruct the correct
>> +     * input, the pagewalk is performed as if it were an explicit access,
>> +     * which concludes that the access should have succeeded and the shadow
>> +     * pagetables need modifying.  The shadow pagetables are modified (to the
>> +     * same value), and we re-enter the guest to re-execute the instruction,
>> +     * which causes another #PF, and the vcpu livelocks, unable to make
>> +     * forward progress.
> What you're saying is that if an attacker manages to trigger this
> behavior, then it's probably a mistake on her part; she was trying to do
> a full privilege escalation and accidentally screwed something up and
> turned it into a DoS?

This livelock can only occur if a system table is mapped with user
mappings.  This is not a situation which an OS would ever set up (other
than XTF for testing), as it means userspace can write to the GDT and
trivially root the OS.

It is certainly possible that userspace could use a vulnerability in the
OS to make such a change.  (Chances are that, if it could do that, there
are more interesting things it could do.)

Even if such a situation was set up, the livelock can only be triggered
in supervisor mode, via an implicit access, while userspace accesses are
whitelisted, which isn't a situation which will occur accidentally.

If an attacker had engineered this situation, it would be easy to
sidestep around the livelock.

>
>> +     * In practice, this is tolerable because no OS would deliberately
>> +     * construct such a corner case, as doing so would mean it is trivially
>> +     * root-able by its own userspace.
> Just to point out, the problem with 'deliberately' is that the whole
> point of SMAP is to protect an OS from its own errors.

I don't understand your point here.  The point of 'deliberately' is to
suggest that the only way a kernel would ever get into this situation is
if it had made a mistake.

There is no plausible situation (other than testing) where such a
situation would be deliberately constructed.

>  :-)  But I agree
> that at first blush, the scenario above would be pretty difficult to do
> accidentally.  (And I certainly don't have any better ideas.)
>
> Would this perhaps be better as:
>
> "In practice, this is tolerable because it is difficult to imagine a
> scenario in which an OS would accidentally construct such a corner case;
> and if it did, SMAP would not actually protect it from being trivially
> rooted by userspace unless the attacker made a mistake and accidentally
> triggered the path described here."

The issue here is that there are plenty of ways to accidentally
construct this situation during development.  All it takes is a typo in
the pagetable code for example.

My point is that in a production OS, it isn't going to occur for a
properly configured system.  If it were to occur in a production system,
then there is definitely a security-relevant bug which has been tickled.

>
> But that's just a suggestion.  Either way:
>
> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
>

How about:

"In practice, this is tolerable.  No production OS will deliberately
construct this corner case (as doing so would mean that a system table
is directly accessable to userspace, and the OS is trivially rootable.)
If this corner case comes about accidentally, then a security-relevant
bug has been tickled."

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker
  2017-02-27 14:03 ` [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker Andrew Cooper
                     ` (2 preceding siblings ...)
  2017-03-06 18:28   ` George Dunlap
@ 2017-03-07 12:57   ` George Dunlap
  3 siblings, 0 replies; 52+ messages in thread
From: George Dunlap @ 2017-03-07 12:57 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: George Dunlap, Tim Deegan, Jan Beulich

On 27/02/17 14:03, Andrew Cooper wrote:
> The existing pagetable walker has complicated return semantics, which squeeze
> multiple pieces of information into single integer.  This would be fine if the
> information didn't overlap, but it does.
> 
> Specifically, _PAGE_INVALID_BITS for 3-level guests alias _PAGE_PAGED and
> _PAGE_SHARED.  A guest which constructs a PTE with bits 52 or 53 set (the
> start of the upper software-available range) will create a virtual address
> which, when walked by Xen, tricks Xen into believing the frame is paged or
> shared.  This behaviour was introduced by XSA-173 (c/s 8b17648).
> 
> It is also complicated to turn rc back into a normal pagefault error code.
> Instead, change the calling semantics to return a boolean indicating success,
> and have the function accumulate a real pagefault error code as it goes
> (including synthetic error codes, which do not alias hardware ones).  This
> requires an equivalent adjustment to map_domain_gfn().
> 
> Issues fixed:
>  * 2-level PSE36 superpages now return the correct translation.
>  * 2-level L2 superpages without CR0.PSE now return the correct translation.
>  * SMEP now inhibits a user instruction fetch even if NX isn't active.
>  * Supervisor writes without CR0.WP now set the leaf dirty bit.
>  * L4e._PAGE_GLOBAL is strictly reserved on AMD.
>  * 3-level l3 entries have all reserved bits checked.
>  * 3-level entries can no longer alias Xen's idea of paged or shared.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Looks good -- thanks for doing this.

One tiny comment...

> @@ -372,15 +387,16 @@ static inline unsigned int guest_walk_to_page_order(const walk_t *gw)
>   * we go.  For the purposes of reading pagetables we treat all non-RAM
>   * memory as contining zeroes.
>   * 
> - * Returns 0 for success, or the set of permission bits that we failed on 
> - * if the walk did not complete. */
> + * Returns a boolean indicating success or failure.  walk_t.pfec contains
> + * the accumulated error code on failure.
> + */

Nit: You add the "wing" to the bottom of the comment here, but not to
the top.

Other than that:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses
  2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
                     ` (3 preceding siblings ...)
  2017-03-07 10:51   ` Andrew Cooper
@ 2017-03-07 15:00   ` Paul Durrant
  4 siblings, 0 replies; 52+ messages in thread
From: Paul Durrant @ 2017-03-07 15:00 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Tim (Xen.org), George Dunlap, Jan Beulich

> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: 27 February 2017 14:03
> To: Xen-devel <xen-devel@lists.xen.org>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Jan Beulich
> <JBeulich@suse.com>; Paul Durrant <Paul.Durrant@citrix.com>; George
> Dunlap <George.Dunlap@citrix.com>; Tim (Xen.org) <tim@xen.org>
> Subject: [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses
> 
> All actions which refer to the active ldt/gdt/idt or task register
> (e.g. loading a new segment selector) are known as implicit supervisor
> accesses, even when the access originates from user code.
> 
> The distinction is necessary in the pagewalk when SMAP is enabled.  Refer to
> Intel SDM Vol 3 "Access Rights" for the exact details.
> 
> Introduce a new pagewalk input, and make use of the new system segment
> references in hvmemul_{read,write}().  While modifying those areas, move
> the
> calculation of the appropriate pagewalk input before its first use.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Paul Durrant <paul.durrant@citrix.com>
> CC: George Dunlap <george.dunlap@eu.citrix.com>
> CC: Tim Deegan <tim@xen.org>
> ---
>  xen/arch/x86/hvm/emulate.c      | 18 ++++++++++--------
>  xen/arch/x86/mm/guest_walk.c    |  4 ++++
>  xen/include/asm-x86/processor.h |  1 +
>  3 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> index f24d289..9b32bca 100644
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -783,6 +783,11 @@ static int __hvmemul_read(
>      struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
>      int rc;
> 
> +    if ( is_x86_system_segment(seg) )
> +        pfec |= PFEC_implicit;
> +    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
> +        pfec |= PFEC_user_mode;
> +
>      rc = hvmemul_virtual_to_linear(
>          seg, offset, bytes, &reps, access_type, hvmemul_ctxt, &addr);
>      if ( rc != X86EMUL_OKAY || !bytes )
> @@ -793,10 +798,6 @@ static int __hvmemul_read(
>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>          return hvmemul_linear_mmio_read(addr, bytes, p_data, pfec,
> hvmemul_ctxt, 1);
> 
> -    if ( (seg != x86_seg_none) &&
> -         (hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3) )
> -        pfec |= PFEC_user_mode;
> -
>      rc = ((access_type == hvm_access_insn_fetch) ?
>            hvm_fetch_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo) :
>            hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo));
> @@ -893,6 +894,11 @@ static int hvmemul_write(
>      struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
>      int rc;
> 
> +    if ( is_x86_system_segment(seg) )
> +        pfec |= PFEC_implicit;
> +    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
> +        pfec |= PFEC_user_mode;
> +
>      rc = hvmemul_virtual_to_linear(
>          seg, offset, bytes, &reps, hvm_access_write, hvmemul_ctxt, &addr);
>      if ( rc != X86EMUL_OKAY || !bytes )
> @@ -902,10 +908,6 @@ static int hvmemul_write(
>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec,
> hvmemul_ctxt, 1);
> 
> -    if ( (seg != x86_seg_none) &&
> -         (hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3) )
> -        pfec |= PFEC_user_mode;
> -
>      rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
> 
>      switch ( rc )
> diff --git a/xen/arch/x86/mm/guest_walk.c
> b/xen/arch/x86/mm/guest_walk.c
> index faaf70c..4f8d0e3 100644
> --- a/xen/arch/x86/mm/guest_walk.c
> +++ b/xen/arch/x86/mm/guest_walk.c
> @@ -161,6 +161,10 @@ guest_walk_tables(struct vcpu *v, struct
> p2m_domain *p2m,
>      bool_t pse1G = 0, pse2M = 0;
>      p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
> 
> +    /* Only implicit supervisor data accesses exist. */
> +    ASSERT(!(pfec & PFEC_implicit) ||
> +           !(pfec & (PFEC_insn_fetch|PFEC_user_mode)));
> +
>      perfc_incr(guest_walk);
>      memset(gw, 0, sizeof(*gw));
>      gw->va = va;
> diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-
> x86/processor.h
> index dda8b83..d3ba8ea 100644
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -76,6 +76,7 @@
>  /* Internally used only flags. */
>  #define PFEC_page_paged     (1U<<16)
>  #define PFEC_page_shared    (1U<<17)
> +#define PFEC_implicit       (1U<<18) /* Pagewalk input for ldt/gdt/idt/tr
> accesses. */
> 
>  /* Other exception error code values. */
>  #define X86_XEC_EXT         (_AC(1,U) << 0)
> --
> 2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC XTF PATCH] Pagetable Emulation testing
  2017-03-06 16:42 ` [RFC XTF PATCH] Pagetable Emulation testing Andrew Cooper
@ 2017-03-13 15:45   ` Jan Beulich
  2017-03-13 17:48     ` Andrew Cooper
  0 siblings, 1 reply; 52+ messages in thread
From: Jan Beulich @ 2017-03-13 15:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 06.03.17 at 17:42, <andrew.cooper3@citrix.com> wrote:
> +struct mapping_info
> +{
> +    unsigned int level, order;
> +    void *va;
> +    intpte_t *pte, *fe_pte;

Having got quite a bit into the patch, I think I've finally understood that
"fe" here is meant to match up with "fep" - a comment might be helpful.

> +bool ex_check_pf(struct cpu_regs *regs,
> +                 const struct extable_entry *ex)
> +{
> +    if ( regs->entry_vector == X86_EXC_PF )
> +    {
> +        unsigned long cr2 = read_cr2();
> +
> +        if ( (cr2 != under_test.va) &&
> +             (cr2 != under_test.va + (LDT_SEL & ~7)) )
> +            xtf_failure("Bad %%cr2: expected %p, got %p\n",
> +                        _p(under_test.va), _p(cr2));
> +
> +        regs->ax = EXINFO_SYM(PF, regs->error_code);
> +
> +        if ( ex->fixup )
> +            regs->ip = ex->fixup;
> +        else
> +            regs->ip = *(unsigned long *)cpu_regs_sp(regs);

What does this match up with? The single RET you place at the first
byte of the page subject to testing? Surely this is rather fragile wrt
#PF happening anywhere unexpected?

> +static void prepare_mappings(struct mapping_info *m, unsigned int level, bool super, paddr_t paddr)
> +{
> +    bool pse36 = CONFIG_PAGING_LEVELS == 2 && paddr != (uint32_t)paddr;
> +
> +    memset(m, 0, sizeof(*m));
> +
> +#define PAGE_COMMON PF_SYM(AD, U, RW, P)
> +    /*
> +     * For 4-level paging, we use l4[1/2].
> +     */
> +    if ( CONFIG_PAGING_LEVELS == 4 )
> +    {
> +
> +        pae_l4_identmap[1] = (unsigned long)l3t | PAGE_COMMON;
> +        pae_l4_identmap[2] = (unsigned long)l3t | PAGE_COMMON;
> +
> +        l3t[0]   = (unsigned long)l2t | PAGE_COMMON;
> +        l3t[511] = (unsigned long)l2t | PAGE_COMMON;
> +
> +        l2t[0]   = (unsigned long)l1t | PAGE_COMMON;
> +        l2t[511] = (unsigned long)l1t | PAGE_COMMON;
> +
> +        l1t[0]   = paddr | PAGE_COMMON;
> +        l1t[511] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
> +
> +        m->va     = _p(2ULL << PAE_L4_PT_SHIFT);
> +        m->l1e    = &l1t[0];
> +        m->l2e    = &l2t[0];
> +        m->l3e    = &l3t[0];
> +        m->l4e    = _p(&pae_l4_identmap[2]);
> +        m->fe_pte = &l1t[511];
> +
> +        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L4_PT_SHIFT, 0, ex_check_pf));
> +        under_test.va = (unsigned long)m->va;
> +    }
> +    else if ( CONFIG_PAGING_LEVELS == 3 )
> +    {
> +        pae32_l3_identmap[1] = (unsigned long)l2t | _PAGE_PRESENT;
> +        pae32_l3_identmap[2] = (unsigned long)l2t | _PAGE_PRESENT;
> +
> +        l2t[0]   = (unsigned long)l1t | PAGE_COMMON;
> +        l2t[511] = (unsigned long)l1t | PAGE_COMMON;
> +
> +        l1t[0]   = paddr | PAGE_COMMON;
> +        l1t[511] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
> +
> +        m->va     = _p(2ULL << PAE_L3_PT_SHIFT);
> +        m->l1e    = &l1t[0];
> +        m->l2e    = &l2t[0];
> +        m->l3e    = _p(&pae32_l3_identmap[2]);
> +        m->l4e    = NULL;
> +        m->fe_pte = &l1t[511];
> +
> +        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L3_PT_SHIFT, 0, ex_check_pf));
> +        under_test.va = (unsigned long)m->va;
> +    }
> +    else if ( CONFIG_PAGING_LEVELS == 2 )
> +    {
> +        if ( pse36 )
> +        {
> +            ASSERT(super);
> +            ASSERT(IS_ALIGNED(paddr, MB(4)));
> +
> +            pse_l2_identmap[511] = fold_pse36((paddr - MB(4)) | PAGE_COMMON | _PAGE_PSE);
> +            pse_l2_identmap[512] = fold_pse36(paddr | PAGE_COMMON | _PAGE_PSE);
> +        }
> +        else
> +        {
> +            pse_l2_identmap[511] = (unsigned long)l1t | PAGE_COMMON;
> +            pse_l2_identmap[512] = (unsigned long)l1t | PAGE_COMMON;
> +
> +            l1t[0]    = paddr | PAGE_COMMON;
> +            l1t[1023] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
> +        }
> +
> +        m->va     = _p(2ULL << PAE_L3_PT_SHIFT);
> +        m->l1e    = pse36 ? NULL : &l1t[0];
> +        m->l2e    = _p(&pse_l2_identmap[512]);
> +        m->l3e    = NULL;
> +        m->l4e    = NULL;
> +        m->fe_pte = pse36 ? _p(&pse_l2_identmap[511]) : &l1t[1023];

The use of PAE constants here (including the literal 511 / 512) is
mildly confusing here without any comments.

> +        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L3_PT_SHIFT, 0, ex_check_pf));
> +        under_test.va = (unsigned long)m->va;
> +    }
> +    else
> +        panic("%s() PAGING_LEVELS %u not implemented yet\n",
> +              __func__, CONFIG_PAGING_LEVELS);
> +
> +#undef PAGE_COMMON
> +
> +    /* Flush the TLB before trying to use the new mappings. */
> +    flush_tlb(false);
> +
> +    /* Put FEP immediately before va, and a ret instruction at va. */
> +    memcpy(m->va - 5, "\x0f\x0bxen\xc3", 6);
> +    barrier();
> +
> +    /* Read them back, to confirm that RAM is properly in place. */
> +    if ( memcmp(m->va - 5, "\x0f\x0bxen\xc3", 6) )
> +        panic("Bad phys or virtual setup\n");
> +
> +    /* Construct the LDT at va. */
> +    user_desc *ldt = m->va;
> +
> +    ldt[LDT_SEL >> 3] = (typeof(*ldt))INIT_GDTE_SYM(0, 0xfffff, COMMON, DATA, DPL3, B, W);

This dual use of m->va clearly needs a comment, perhaps next to
the definition of LDT_SEL (which can't be freely set to whatever one
may like - it namely can't be using LDT slot 0).

> +    gdt[GDTE_AVAIL0]  = (typeof(*gdt))INIT_GDTE((unsigned long)m->va, PAGE_SIZE, 0x82);
> +#if __x86_64__

#ifdef ?

> +void check(struct mapping_info *m, exinfo_t actual, exinfo_t expected, enum modifier mod)
> +{
> +    /* Check that the actual pagefault matched our expectation. */
> +    if ( actual != expected )
> +    {
> +        const char *user_sup = under_test.user ? "User" : "Supervisor";
> +        bool ac_fault = !!actual, ex_fault = !!expected;

Stray !! ?

> +        char ac_ec[16], ex_ec[16];
> +
> +        if ( ac_fault )
> +            x86_exc_decode_ec(ac_ec, ARRAY_SIZE(ac_ec),
> +                              X86_EXC_PF, (uint16_t)actual);
> +        if ( ex_fault )
> +            x86_exc_decode_ec(ex_ec, ARRAY_SIZE(ex_ec),
> +                              X86_EXC_PF, (uint16_t)expected);

Why the casts on the last function arguments?

> +        if ( ac_fault && !ex_fault )
> +            fail("    Fail: expected no fault, got #PF[%s] for %s %s\n",
> +                 ac_ec, user_sup, under_test.desc);
> +        else if ( !ac_fault && ex_fault )
> +            fail("    Fail: expected #PF[%s], got no fault for %s %s\n",
> +                 ex_ec, user_sup, under_test.desc);
> +        else
> +            fail("    Fail: expected #PF[%s], got #PF[%s] for %s %s\n",
> +                 ex_ec, ac_ec,  user_sup, under_test.desc);

From looking just at this function ex_ec[] and ac_ec[] may be
used uninitialized here (when !ac_fault && !ex_fault). With the
function not being static, I'd actually expect compilers to be
able to spot and warn about this.

> +    }
> +
> +    /* Check that A/D bits got updated as expected. */
> +    unsigned int leaf_level =
> +        m->order ? ((m->order - PAGE_SHIFT) / PT_ORDER) : 0;
> +    unsigned int i; /* NB - Levels are 0-indexed. */

Do you enable C99 mode somewhere? Otherwise I'd expect the
compiler to warn about mixing declarations and code.

> +    if ( amd_fam10_erratum )
> +    {
> +        /*
> +         * AMD Fam10 appears to defer the setting of access bits for implicit
> +         * loads.  As a result, the implicit tests (which load %fs) don't
> +         * necessarily observe the access bits being set on the pagewalk to
> +         * the LDT.
> +         *
> +         * Experimentally, a serialising instruction fixes things, or a loop
> +         * of 1000 nops, but so does forcing the use of the loaded segment.
> +         *
> +         * If this is an implicit load which didn't fault, read through %fs to
> +         * force it to be loaded into the segment cache.

The load into the segment cache certainly has at least started, so
maybe the better (more precise) wording would be something like
"... force the load into the segment cache to complete"?

> +exinfo_t calc(struct mapping_info *m, uint64_t new, unsigned int walk, enum modifier mod)

To other than you as the author of the code, it would help if the
name of the function indicated what it wants to calculate.

> +{
> +    bool nx_valid   = CONFIG_PAGING_LEVELS >= 3 && (host_nx_leaked || (mod & NX));
> +    bool insn_valid = nx_valid || (mod & SMEP);
> +    uint64_t rsvd = ((1ULL << 52) - 1) & ~((1ULL << maxphysaddr) - 1);
> +
> +    /* Accumulate additional bits which are reserved. */
> +    if ( !nx_valid )
> +        rsvd |= _PAGE_NX;
> +
> +    if ( m->level == 4 )
> +        rsvd |= _PAGE_PSE | (vendor_is_amd ? _PAGE_GLOBAL : 0);
> +    else if ( m->level == 3 && !cpu_has_page1gb )
> +        rsvd |= _PAGE_PSE;

Shouldn't this second conditional include CONFIG_PAGING_LEVELS > 3?
Certainly that bit is reserved there too, but there are quite a few
more.

> +void test_pte(const struct stubs *stubs, struct mapping_info *m, uint64_t  overlay)
> +{
> +    uint64_t new = m->paddr | overlay;
> +    bool user = false;
> +
> +    under_test.pteval = *m->pte = new;
> +    under_test.pte_printed = false;
> +    clear_ad(m);
> +
> +    under_test.active = true;
> +
> +    for ( ; ; user = true )
> +    {
> +        unsigned int base = user ? PFEC(U) : 0;
> +
> +#define CALL(fn, va)                                                    \
> +        (user ? exec_user_param(fn ## _user, (unsigned long)(va))       \
> +              : fn((unsigned long)(va)))
> +
> +        under_test.user = user;
> +
> +        /* Map the exec FEP stub with suitable permissions. */
> +        if ( stubs == &force_stubs )
> +        {
> +            *m->fe_pte &= ~_PAGE_USER;
> +            if ( user )
> +                *m->fe_pte |= _PAGE_USER;
> +            invlpg(m->va - 5);

Why do you need to do the for the FEP prefix, but not for the
rest of the stub?

> --- /dev/null
> +++ b/tests/pagetable-emulation/stubs.h
> @@ -0,0 +1,25 @@
> +#ifndef _LOWLEVEL_H_
> +#define _LOWLEVEL_H_
> +
> +unsigned long stub_read(unsigned long va);
> +unsigned long stub_implicit(unsigned long sel); /* unsigned long sel */

What are these comments intended to tell the reader?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC XTF PATCH] Pagetable Emulation testing
  2017-03-13 15:45   ` Jan Beulich
@ 2017-03-13 17:48     ` Andrew Cooper
  2017-03-14 11:17       ` Jan Beulich
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Cooper @ 2017-03-13 17:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel

On 13/03/17 15:45, Jan Beulich wrote:
>>>> On 06.03.17 at 17:42, <andrew.cooper3@citrix.com> wrote:
>> +struct mapping_info
>> +{
>> +    unsigned int level, order;
>> +    void *va;
>> +    intpte_t *pte, *fe_pte;
> Having got quite a bit into the patch, I think I've finally understood that
> "fe" here is meant to match up with "fep" - a comment might be helpful.

Ah yes. It is a pointer to the PTE which maps the force emulation prefix.

In particular, its _PAGE_USER setting needs to match cpl so we don't
take unrelated instruction fetch faults while setting up an emulated
instruction fetch.

>
>> +bool ex_check_pf(struct cpu_regs *regs,
>> +                 const struct extable_entry *ex)
>> +{
>> +    if ( regs->entry_vector == X86_EXC_PF )
>> +    {
>> +        unsigned long cr2 = read_cr2();
>> +
>> +        if ( (cr2 != under_test.va) &&
>> +             (cr2 != under_test.va + (LDT_SEL & ~7)) )
>> +            xtf_failure("Bad %%cr2: expected %p, got %p\n",
>> +                        _p(under_test.va), _p(cr2));
>> +
>> +        regs->ax = EXINFO_SYM(PF, regs->error_code);
>> +
>> +        if ( ex->fixup )
>> +            regs->ip = ex->fixup;
>> +        else
>> +            regs->ip = *(unsigned long *)cpu_regs_sp(regs);
> What does this match up with? The single RET you place at the first
> byte of the page subject to testing? Surely this is rather fragile wrt
> #PF happening anywhere unexpected?

stub exec and user exec take #PF at the target of the call instruction,
rather than on the call instruction itself.

The extable entry is anchored at the target of the call (set up by the
asm() statements in prepare_mappings()), but the fixup needs to follow
the return address on the stack to recover.

>
>> +static void prepare_mappings(struct mapping_info *m, unsigned int level, bool super, paddr_t paddr)
>> +{
>> +    bool pse36 = CONFIG_PAGING_LEVELS == 2 && paddr != (uint32_t)paddr;
>> +
>> +    memset(m, 0, sizeof(*m));
>> +
>> +#define PAGE_COMMON PF_SYM(AD, U, RW, P)
>> +    /*
>> +     * For 4-level paging, we use l4[1/2].
>> +     */
>> +    if ( CONFIG_PAGING_LEVELS == 4 )
>> +    {
>> +
>> +        pae_l4_identmap[1] = (unsigned long)l3t | PAGE_COMMON;
>> +        pae_l4_identmap[2] = (unsigned long)l3t | PAGE_COMMON;
>> +
>> +        l3t[0]   = (unsigned long)l2t | PAGE_COMMON;
>> +        l3t[511] = (unsigned long)l2t | PAGE_COMMON;
>> +
>> +        l2t[0]   = (unsigned long)l1t | PAGE_COMMON;
>> +        l2t[511] = (unsigned long)l1t | PAGE_COMMON;
>> +
>> +        l1t[0]   = paddr | PAGE_COMMON;
>> +        l1t[511] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
>> +
>> +        m->va     = _p(2ULL << PAE_L4_PT_SHIFT);
>> +        m->l1e    = &l1t[0];
>> +        m->l2e    = &l2t[0];
>> +        m->l3e    = &l3t[0];
>> +        m->l4e    = _p(&pae_l4_identmap[2]);
>> +        m->fe_pte = &l1t[511];
>> +
>> +        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L4_PT_SHIFT, 0, ex_check_pf));
>> +        under_test.va = (unsigned long)m->va;
>> +    }
>> +    else if ( CONFIG_PAGING_LEVELS == 3 )
>> +    {
>> +        pae32_l3_identmap[1] = (unsigned long)l2t | _PAGE_PRESENT;
>> +        pae32_l3_identmap[2] = (unsigned long)l2t | _PAGE_PRESENT;
>> +
>> +        l2t[0]   = (unsigned long)l1t | PAGE_COMMON;
>> +        l2t[511] = (unsigned long)l1t | PAGE_COMMON;
>> +
>> +        l1t[0]   = paddr | PAGE_COMMON;
>> +        l1t[511] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
>> +
>> +        m->va     = _p(2ULL << PAE_L3_PT_SHIFT);
>> +        m->l1e    = &l1t[0];
>> +        m->l2e    = &l2t[0];
>> +        m->l3e    = _p(&pae32_l3_identmap[2]);
>> +        m->l4e    = NULL;
>> +        m->fe_pte = &l1t[511];
>> +
>> +        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L3_PT_SHIFT, 0, ex_check_pf));
>> +        under_test.va = (unsigned long)m->va;
>> +    }
>> +    else if ( CONFIG_PAGING_LEVELS == 2 )
>> +    {
>> +        if ( pse36 )
>> +        {
>> +            ASSERT(super);
>> +            ASSERT(IS_ALIGNED(paddr, MB(4)));
>> +
>> +            pse_l2_identmap[511] = fold_pse36((paddr - MB(4)) | PAGE_COMMON | _PAGE_PSE);
>> +            pse_l2_identmap[512] = fold_pse36(paddr | PAGE_COMMON | _PAGE_PSE);
>> +        }
>> +        else
>> +        {
>> +            pse_l2_identmap[511] = (unsigned long)l1t | PAGE_COMMON;
>> +            pse_l2_identmap[512] = (unsigned long)l1t | PAGE_COMMON;
>> +
>> +            l1t[0]    = paddr | PAGE_COMMON;
>> +            l1t[1023] = ((paddr - 1) & ~0xfff) | PAGE_COMMON;
>> +        }
>> +
>> +        m->va     = _p(2ULL << PAE_L3_PT_SHIFT);
>> +        m->l1e    = pse36 ? NULL : &l1t[0];
>> +        m->l2e    = _p(&pse_l2_identmap[512]);
>> +        m->l3e    = NULL;
>> +        m->l4e    = NULL;
>> +        m->fe_pte = pse36 ? _p(&pse_l2_identmap[511]) : &l1t[1023];
> The use of PAE constants here (including the literal 511 / 512) is
> mildly confusing here without any comments.

That looks like a copy/paste mistake.  It should be 512 <<
PSE_L2_PT_SHIFT, which should be the same value.

I will extend the l4 comments to the l3 and l2 cases.

>
>> +        asm(_ASM_EXTABLE_HANDLER(2 << PAE_L3_PT_SHIFT, 0, ex_check_pf));
>> +        under_test.va = (unsigned long)m->va;
>> +    }
>> +    else
>> +        panic("%s() PAGING_LEVELS %u not implemented yet\n",
>> +              __func__, CONFIG_PAGING_LEVELS);
>> +
>> +#undef PAGE_COMMON
>> +
>> +    /* Flush the TLB before trying to use the new mappings. */
>> +    flush_tlb(false);
>> +
>> +    /* Put FEP immediately before va, and a ret instruction at va. */
>> +    memcpy(m->va - 5, "\x0f\x0bxen\xc3", 6);
>> +    barrier();
>> +
>> +    /* Read them back, to confirm that RAM is properly in place. */
>> +    if ( memcmp(m->va - 5, "\x0f\x0bxen\xc3", 6) )
>> +        panic("Bad phys or virtual setup\n");
>> +
>> +    /* Construct the LDT at va. */
>> +    user_desc *ldt = m->va;
>> +
>> +    ldt[LDT_SEL >> 3] = (typeof(*ldt))INIT_GDTE_SYM(0, 0xfffff, COMMON, DATA, DPL3, B, W);
> This dual use of m->va clearly needs a comment, perhaps next to
> the definition of LDT_SEL (which can't be freely set to whatever one
> may like - it namely can't be using LDT slot 0).

va is just the virtual address under test.

It gets read/write testing normally, exec testing by calling at it, and
implicit access testing by layering an LDT over the top and loading a
selector.

The choice of LDT selector to use only matters if it fits within the
mapping, is otherwise valid, and is already accessed.  In particular,
the test logic doesn't cope with hardware setting the accessed bit over
a read-only mapping, because it can't distinguish the two different
memory accesses.

>
>> +    gdt[GDTE_AVAIL0]  = (typeof(*gdt))INIT_GDTE((unsigned long)m->va, PAGE_SIZE, 0x82);
>> +#if __x86_64__
> #ifdef ?

Yes, although this variant definitely works.

>
>> +void check(struct mapping_info *m, exinfo_t actual, exinfo_t expected, enum modifier mod)
>> +{
>> +    /* Check that the actual pagefault matched our expectation. */
>> +    if ( actual != expected )
>> +    {
>> +        const char *user_sup = under_test.user ? "User" : "Supervisor";
>> +        bool ac_fault = !!actual, ex_fault = !!expected;
> Stray !! ?

Yes.  This code predates me sorting out the bool_t/bool confusion.

>
>> +        char ac_ec[16], ex_ec[16];
>> +
>> +        if ( ac_fault )
>> +            x86_exc_decode_ec(ac_ec, ARRAY_SIZE(ac_ec),
>> +                              X86_EXC_PF, (uint16_t)actual);
>> +        if ( ex_fault )
>> +            x86_exc_decode_ec(ex_ec, ARRAY_SIZE(ex_ec),
>> +                              X86_EXC_PF, (uint16_t)expected);
> Why the casts on the last function arguments?

Rebasing mistake.  Up until I re-factored the exec_user() infrastructure
(very recently) to be SMEP/SMAP safe, I had a local copy here which took
void * rather than unsigned long.

>
>> +        if ( ac_fault && !ex_fault )
>> +            fail("    Fail: expected no fault, got #PF[%s] for %s %s\n",
>> +                 ac_ec, user_sup, under_test.desc);
>> +        else if ( !ac_fault && ex_fault )
>> +            fail("    Fail: expected #PF[%s], got no fault for %s %s\n",
>> +                 ex_ec, user_sup, under_test.desc);
>> +        else
>> +            fail("    Fail: expected #PF[%s], got #PF[%s] for %s %s\n",
>> +                 ex_ec, ac_ec,  user_sup, under_test.desc);
> From looking just at this function ex_ec[] and ac_ec[] may be
> used uninitialized here (when !ac_fault && !ex_fault). With the
> function not being static, I'd actually expect compilers to be
> able to spot and warn about this.

In that case, we fail the "if ( actual != expected )" precondition, and
don't end up here.

I am think it would probably be better to add %p printk() support and
have a formatter for exinfo_t, which would simply a whole lot of code
looking like this.

>
>> +    }
>> +
>> +    /* Check that A/D bits got updated as expected. */
>> +    unsigned int leaf_level =
>> +        m->order ? ((m->order - PAGE_SHIFT) / PT_ORDER) : 0;
>> +    unsigned int i; /* NB - Levels are 0-indexed. */
> Do you enable C99 mode somewhere? Otherwise I'd expect the
> compiler to warn about mixing declarations and code.

The std in use is gnu99.

>
>> +    if ( amd_fam10_erratum )
>> +    {
>> +        /*
>> +         * AMD Fam10 appears to defer the setting of access bits for implicit
>> +         * loads.  As a result, the implicit tests (which load %fs) don't
>> +         * necessarily observe the access bits being set on the pagewalk to
>> +         * the LDT.
>> +         *
>> +         * Experimentally, a serialising instruction fixes things, or a loop
>> +         * of 1000 nops, but so does forcing the use of the loaded segment.
>> +         *
>> +         * If this is an implicit load which didn't fault, read through %fs to
>> +         * force it to be loaded into the segment cache.
> The load into the segment cache certainly has at least started, so
> maybe the better (more precise) wording would be something like
> "... force the load into the segment cache to complete"?

The fact that #GP faults for bad entries work appear to work properly
would suggest that the access has happened.

I can't fathom what architectural quirks would lead to observable
behaviour like this.

>
>> +exinfo_t calc(struct mapping_info *m, uint64_t new, unsigned int walk, enum modifier mod)
> To other than you as the author of the code, it would help if the
> name of the function indicated what it wants to calculate.

Fair enough.  I need to do a general documentation sweep across this
patch anyway.

>
>> +{
>> +    bool nx_valid   = CONFIG_PAGING_LEVELS >= 3 && (host_nx_leaked || (mod & NX));
>> +    bool insn_valid = nx_valid || (mod & SMEP);
>> +    uint64_t rsvd = ((1ULL << 52) - 1) & ~((1ULL << maxphysaddr) - 1);
>> +
>> +    /* Accumulate additional bits which are reserved. */
>> +    if ( !nx_valid )
>> +        rsvd |= _PAGE_NX;
>> +
>> +    if ( m->level == 4 )
>> +        rsvd |= _PAGE_PSE | (vendor_is_amd ? _PAGE_GLOBAL : 0);
>> +    else if ( m->level == 3 && !cpu_has_page1gb )
>> +        rsvd |= _PAGE_PSE;
> Shouldn't this second conditional include CONFIG_PAGING_LEVELS > 3?
> Certainly that bit is reserved there too, but there are quite a few
> more.

I don't test 3-level L3 entries.  They don't behave like other bits of
pagetable infrastructure, as they take effect at the point of mov to
%cr3, and get re-read under a number of TLB flush conditions.

They are also broken and cause vmentry failures, as Xen is left to
manage the PDPTRs itself, and does a rather poor job.

>
>> +void test_pte(const struct stubs *stubs, struct mapping_info *m, uint64_t  overlay)
>> +{
>> +    uint64_t new = m->paddr | overlay;
>> +    bool user = false;
>> +
>> +    under_test.pteval = *m->pte = new;
>> +    under_test.pte_printed = false;
>> +    clear_ad(m);
>> +
>> +    under_test.active = true;
>> +
>> +    for ( ; ; user = true )
>> +    {
>> +        unsigned int base = user ? PFEC(U) : 0;
>> +
>> +#define CALL(fn, va)                                                    \
>> +        (user ? exec_user_param(fn ## _user, (unsigned long)(va))       \
>> +              : fn((unsigned long)(va)))
>> +
>> +        under_test.user = user;
>> +
>> +        /* Map the exec FEP stub with suitable permissions. */
>> +        if ( stubs == &force_stubs )
>> +        {
>> +            *m->fe_pte &= ~_PAGE_USER;
>> +            if ( user )
>> +                *m->fe_pte |= _PAGE_USER;
>> +            invlpg(m->va - 5);
> Why do you need to do the for the FEP prefix, but not for the
> rest of the stub?

The FEP is always mapped by a different PTE to m->va.  This was quite
awkward to set up properly, to test instruction fetches.

FEP needs to start 5 before the linear address under test, as it doesn't
redirect control flow.  The linear address under test needs to be able
to have _PAGE_USER mismatched against cpl.  The insn fetch for FEP can't
be mismatched against _PAGE_USER, so the linear address needs to be
separately controllable.  Therefore, they need to be mapped by different
PTEs.

A side effect of needing different PTEs, but adjacent linear addresses
is that the linear addresses must be aligned on the largest superpage
expressible in the current paging mode.

>
>> --- /dev/null
>> +++ b/tests/pagetable-emulation/stubs.h
>> @@ -0,0 +1,25 @@
>> +#ifndef _LOWLEVEL_H_
>> +#define _LOWLEVEL_H_
>> +
>> +unsigned long stub_read(unsigned long va);
>> +unsigned long stub_implicit(unsigned long sel); /* unsigned long sel */
> What are these comments intended to tell the reader?

Stale.  These prototypes used to be void * va.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC XTF PATCH] Pagetable Emulation testing
  2017-03-13 17:48     ` Andrew Cooper
@ 2017-03-14 11:17       ` Jan Beulich
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Beulich @ 2017-03-14 11:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel

>>> On 13.03.17 at 18:48, <andrew.cooper3@citrix.com> wrote:
> On 13/03/17 15:45, Jan Beulich wrote:
>>>>> On 06.03.17 at 17:42, <andrew.cooper3@citrix.com> wrote:
>>> +    /* Put FEP immediately before va, and a ret instruction at va. */
>>> +    memcpy(m->va - 5, "\x0f\x0bxen\xc3", 6);
>>> +    barrier();
>>> +
>>> +    /* Read them back, to confirm that RAM is properly in place. */
>>> +    if ( memcmp(m->va - 5, "\x0f\x0bxen\xc3", 6) )
>>> +        panic("Bad phys or virtual setup\n");
>>> +
>>> +    /* Construct the LDT at va. */
>>> +    user_desc *ldt = m->va;
>>> +
>>> +    ldt[LDT_SEL >> 3] = (typeof(*ldt))INIT_GDTE_SYM(0, 0xfffff, COMMON, DATA, DPL3, B, W);
>> This dual use of m->va clearly needs a comment, perhaps next to
>> the definition of LDT_SEL (which can't be freely set to whatever one
>> may like - it namely can't be using LDT slot 0).
> 
> va is just the virtual address under test.
> 
> It gets read/write testing normally, exec testing by calling at it, and
> implicit access testing by layering an LDT over the top and loading a
> selector.
> 
> The choice of LDT selector to use only matters if it fits within the
> mapping, is otherwise valid, and is already accessed.  In particular,
> the test logic doesn't cope with hardware setting the accessed bit over
> a read-only mapping, because it can't distinguish the two different
> memory accesses.

But that's my point - one can't use any LDT selector one might like,
yet what restrictions there are can only be understood by detailed
reading of the code. A comment clarifying that the chosen LDT slot
must not be slot 0 and has to be within the first page would help.

>>> +    gdt[GDTE_AVAIL0]  = (typeof(*gdt))INIT_GDTE((unsigned long)m->va, PAGE_SIZE, 0x82);
>>> +#if __x86_64__
>> #ifdef ?
> 
> Yes, although this variant definitely works.

Some compilers warn about such constructs when the symbol is
undefined (there'll never be the case of the symbol being defined,
but evaluating to zero).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2017-03-14 11:17 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-27 14:03 [PATCH 0/7] Fixes to pagetable handling Andrew Cooper
2017-02-27 14:03 ` [PATCH 1/7] x86/hvm: Correctly identify implicit supervisor accesses Andrew Cooper
2017-03-01 15:05   ` Jan Beulich
2017-03-02 16:14   ` Tim Deegan
2017-03-07 10:46   ` George Dunlap
2017-03-07 10:51   ` Andrew Cooper
2017-03-07 15:00   ` Paul Durrant
2017-02-27 14:03 ` [PATCH 2/7] x86/shadow: Try to correctly " Andrew Cooper
2017-03-01 15:11   ` Jan Beulich
2017-03-02 16:14   ` Tim Deegan
2017-03-07 11:26   ` George Dunlap
2017-03-07 11:55     ` Andrew Cooper
2017-02-27 14:03 ` [PATCH 3/7] x86/pagewalk: Helpers for reserved bit handling Andrew Cooper
2017-03-01 15:57   ` Jan Beulich
2017-03-02 12:23     ` Andrew Cooper
2017-03-02 14:12   ` Tim Deegan
2017-03-02 14:17     ` Andrew Cooper
2017-03-02 15:09       ` Tim Deegan
2017-03-02 15:14         ` Andrew Cooper
2017-03-02 16:15   ` Tim Deegan
2017-02-27 14:03 ` [PATCH 4/7] x86/hvm: Adjust hvm_nx_enabled() to match how Xen behaves Andrew Cooper
2017-03-01 16:00   ` Jan Beulich
2017-02-27 14:03 ` [PATCH 5/7] x86/shadow: Use the pagewalk reserved bits helpers Andrew Cooper
2017-03-01 16:03   ` Jan Beulich
2017-03-02 12:26     ` Andrew Cooper
2017-03-02 12:51       ` Jan Beulich
2017-03-02 12:56         ` Andrew Cooper
2017-03-02 13:19           ` Jan Beulich
2017-03-02 14:32             ` Andrew Cooper
2017-03-06  9:26       ` Tim Deegan
2017-03-02 14:33   ` Tim Deegan
2017-02-27 14:03 ` [PATCH 6/7] x86/pagewalk: Consistently use guest_walk_*() helpers for translation Andrew Cooper
2017-03-01 16:22   ` Jan Beulich
2017-03-01 16:33     ` Andrew Cooper
2017-03-01 16:41       ` Jan Beulich
2017-03-02 16:15   ` Tim Deegan
2017-03-06 18:25   ` George Dunlap
2017-02-27 14:03 ` [PATCH 7/7] x86/pagewalk: Re-implement the pagetable walker Andrew Cooper
2017-03-02 11:52   ` Jan Beulich
2017-03-02 12:00     ` Andrew Cooper
2017-03-02 12:54       ` Jan Beulich
2017-03-02 16:16   ` Tim Deegan
2017-03-06 18:28   ` George Dunlap
2017-03-06 18:33     ` Andrew Cooper
2017-03-06 18:39       ` George Dunlap
2017-03-07 12:57   ` George Dunlap
2017-03-01 16:24 ` [PATCH 0/7] Fixes to pagetable handling Jan Beulich
2017-03-01 16:32   ` Andrew Cooper
2017-03-06 16:42 ` [RFC XTF PATCH] Pagetable Emulation testing Andrew Cooper
2017-03-13 15:45   ` Jan Beulich
2017-03-13 17:48     ` Andrew Cooper
2017-03-14 11:17       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.