All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] x86/HVM: implement memory read caching
@ 2018-07-19 10:39 Jan Beulich
  2018-07-19 10:46 ` [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 10:39 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far use of CPU
registers goes (as those can't change without any other instruction
executing in between), but is wrong for memory accesses. In particular
it has been observed that Windows might page out buffers underneath an
instruction currently under emulation (hitting between two passes). If
the first pass translated a linear address successfully, any subsequent
pass needs to do so too, yielding the exact same translation.

Introduce a cache (used just by guest page table accesses for now) to
make sure above described assumption holds. This is a very simplistic
implementation for now: Only exact matches are satisfied (no overlaps or
partial reads or anything).

There's also some seemingly unrelated cleanup here which was found
desirable on the way.

1: x86/mm: add optional cache to GLA->GFN translation
2: x86/mm: use optional cache in guest_walk_tables()
3: x86/HVM: implement memory read caching
4: VMX: correct PDPTE load checks
5: x86/HVM: prefill cache with PDPTEs when possible
6: x86/shadow: a little bit of style cleanup

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation
  2018-07-19 10:39 [PATCH 0/6] x86/HVM: implement memory read caching Jan Beulich
@ 2018-07-19 10:46 ` Jan Beulich
  2018-07-19 13:12   ` Paul Durrant
  2018-07-19 10:47 ` [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 10:46 UTC (permalink / raw)
  To: xen-devel
  Cc: tamas, Razvan Cojocaru, George Dunlap, Andrew Cooper, Tim Deegan,
	Paul Durrant

The caching isn't actually implemented here, this is just setting the
stage.

Touching these anyway also
- make their return values gfn_t
- gva -> gla in their names
- name their input arguments gla

At the use sites do the conversion to gfn_t as suitable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -51,7 +51,7 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct dom
 
     DBGP2("vaddr:%lx domid:%d\n", vaddr, dp->domain_id);
 
-    *gfn = _gfn(paging_gva_to_gfn(dp->vcpu[0], vaddr, &pfec));
+    *gfn = paging_gla_to_gfn(dp->vcpu[0], vaddr, &pfec, NULL);
     if ( gfn_eq(*gfn, INVALID_GFN) )
     {
         DBGP2("kdb:bad gfn from gva_to_gfn\n");
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -678,7 +678,8 @@ static int hvmemul_linear_to_phys(
     struct hvm_emulate_ctxt *hvmemul_ctxt)
 {
     struct vcpu *curr = current;
-    unsigned long pfn, npfn, done, todo, i, offset = addr & ~PAGE_MASK;
+    gfn_t gfn, ngfn;
+    unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
     int reverse;
 
     /*
@@ -700,15 +701,17 @@ static int hvmemul_linear_to_phys(
     if ( reverse && ((PAGE_SIZE - offset) < bytes_per_rep) )
     {
         /* Do page-straddling first iteration forwards via recursion. */
-        paddr_t _paddr;
+        paddr_t gaddr;
         unsigned long one_rep = 1;
         int rc = hvmemul_linear_to_phys(
-            addr, &_paddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
+            addr, &gaddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
+
         if ( rc != X86EMUL_OKAY )
             return rc;
-        pfn = _paddr >> PAGE_SHIFT;
+        gfn = gaddr_to_gfn(gaddr);
     }
-    else if ( (pfn = paging_gva_to_gfn(curr, addr, &pfec)) == gfn_x(INVALID_GFN) )
+    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
+                     INVALID_GFN) )
     {
         if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
             return X86EMUL_RETRY;
@@ -723,11 +726,11 @@ static int hvmemul_linear_to_phys(
     {
         /* Get the next PFN in the range. */
         addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
-        npfn = paging_gva_to_gfn(curr, addr, &pfec);
+        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
 
         /* Is it contiguous with the preceding PFNs? If not then we're done. */
-        if ( (npfn == gfn_x(INVALID_GFN)) ||
-             (npfn != (pfn + (reverse ? -i : i))) )
+        if ( gfn_eq(ngfn, INVALID_GFN) ||
+             !gfn_eq(ngfn, gfn_add(gfn, reverse ? -i : i)) )
         {
             if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
                 return X86EMUL_RETRY;
@@ -735,7 +738,7 @@ static int hvmemul_linear_to_phys(
             if ( done == 0 )
             {
                 ASSERT(!reverse);
-                if ( npfn != gfn_x(INVALID_GFN) )
+                if ( !gfn_eq(ngfn, INVALID_GFN) )
                     return X86EMUL_UNHANDLEABLE;
                 *reps = 0;
                 x86_emul_pagefault(pfec, addr & PAGE_MASK, &hvmemul_ctxt->ctxt);
@@ -748,7 +751,8 @@ static int hvmemul_linear_to_phys(
         done += PAGE_SIZE;
     }
 
-    *paddr = ((paddr_t)pfn << PAGE_SHIFT) | offset;
+    *paddr = gfn_to_gaddr(gfn) | offset;
+
     return X86EMUL_OKAY;
 }
     
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2684,7 +2684,7 @@ static void *hvm_map_entry(unsigned long
      * treat it as a kernel-mode read (i.e. no access checks).
      */
     pfec = PFEC_page_present;
-    gfn = paging_gva_to_gfn(current, va, &pfec);
+    gfn = gfn_x(paging_gla_to_gfn(current, va, &pfec, NULL));
     if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
         goto fail;
 
@@ -3115,7 +3115,7 @@ enum hvm_translation_result hvm_translat
 
     if ( linear )
     {
-        gfn = _gfn(paging_gva_to_gfn(v, addr, &pfec));
+        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
 
         if ( gfn_eq(gfn, INVALID_GFN) )
         {
--- a/xen/arch/x86/hvm/monitor.c
+++ b/xen/arch/x86/hvm/monitor.c
@@ -130,7 +130,7 @@ static inline unsigned long gfn_of_rip(u
 
     hvm_get_segment_register(curr, x86_seg_cs, &sreg);
 
-    return paging_gva_to_gfn(curr, sreg.base + rip, &pfec);
+    return gfn_x(paging_gla_to_gfn(curr, sreg.base + rip, &pfec, NULL));
 }
 
 int hvm_monitor_debug(unsigned long rip, enum hvm_monitor_debug_type type,
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -81,8 +81,9 @@ static bool set_ad_bits(guest_intpte_t *
  */
 bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
-                  unsigned long va, walk_t *gw,
-                  uint32_t walk, mfn_t top_mfn, void *top_map)
+                  unsigned long gla, walk_t *gw, uint32_t walk,
+                  gfn_t top_gfn, mfn_t top_mfn, void *top_map,
+                  struct hvmemul_cache *cache)
 {
     struct domain *d = v->domain;
     p2m_type_t p2mt;
@@ -116,7 +117,7 @@ guest_walk_tables(struct vcpu *v, struct
 
     perfc_incr(guest_walk);
     memset(gw, 0, sizeof(*gw));
-    gw->va = va;
+    gw->va = gla;
     gw->pfec = walk & (PFEC_user_mode | PFEC_write_access);
 
     /*
@@ -133,7 +134,7 @@ guest_walk_tables(struct vcpu *v, struct
     /* Get the l4e from the top level table and check its flags*/
     gw->l4mfn = top_mfn;
     l4p = (guest_l4e_t *) top_map;
-    gw->l4e = l4p[guest_l4_table_offset(va)];
+    gw->l4e = l4p[guest_l4_table_offset(gla)];
     gflags = guest_l4e_get_flags(gw->l4e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -163,7 +164,7 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l3e and check its flags*/
-    gw->l3e = l3p[guest_l3_table_offset(va)];
+    gw->l3e = l3p[guest_l3_table_offset(gla)];
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -205,7 +206,7 @@ guest_walk_tables(struct vcpu *v, struct
 
         /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L3_GFN_MASK) +
-                     ((va >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
+                     ((gla >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
         gw->l1e = guest_l1e_from_gfn(start, flags);
         gw->l2mfn = gw->l1mfn = INVALID_MFN;
         leaf_level = 3;
@@ -215,7 +216,7 @@ guest_walk_tables(struct vcpu *v, struct
 #else /* PAE only... */
 
     /* Get the l3e and check its flag */
-    gw->l3e = ((guest_l3e_t *) top_map)[guest_l3_table_offset(va)];
+    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -242,14 +243,14 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l2e */
-    gw->l2e = l2p[guest_l2_table_offset(va)];
+    gw->l2e = l2p[guest_l2_table_offset(gla)];
 
 #else /* 32-bit only... */
 
     /* Get l2e from the top level table */
     gw->l2mfn = top_mfn;
     l2p = (guest_l2e_t *) top_map;
-    gw->l2e = l2p[guest_l2_table_offset(va)];
+    gw->l2e = l2p[guest_l2_table_offset(gla)];
 
 #endif /* All levels... */
 
@@ -310,7 +311,7 @@ guest_walk_tables(struct vcpu *v, struct
 
         /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L2_GFN_MASK) +
-                     guest_l1_table_offset(va));
+                     guest_l1_table_offset(gla));
 #if GUEST_PAGING_LEVELS == 2
          /* Wider than 32 bits if PSE36 superpage. */
         gw->el1e = (gfn_x(start) << PAGE_SHIFT) | flags;
@@ -334,7 +335,7 @@ guest_walk_tables(struct vcpu *v, struct
         gw->pfec |= rc & PFEC_synth_mask;
         goto out;
     }
-    gw->l1e = l1p[guest_l1_table_offset(va)];
+    gw->l1e = l1p[guest_l1_table_offset(gla)];
     gflags = guest_l1e_get_flags(gw->l1e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -443,22 +444,22 @@ guest_walk_tables(struct vcpu *v, struct
         break;
 
     case 1:
-        if ( set_ad_bits(&l1p[guest_l1_table_offset(va)].l1, &gw->l1e.l1,
+        if ( set_ad_bits(&l1p[guest_l1_table_offset(gla)].l1, &gw->l1e.l1,
                          (walk & PFEC_write_access)) )
             paging_mark_dirty(d, gw->l1mfn);
         /* Fallthrough */
     case 2:
-        if ( set_ad_bits(&l2p[guest_l2_table_offset(va)].l2, &gw->l2e.l2,
+        if ( set_ad_bits(&l2p[guest_l2_table_offset(gla)].l2, &gw->l2e.l2,
                          (walk & PFEC_write_access) && leaf_level == 2) )
             paging_mark_dirty(d, gw->l2mfn);
         /* Fallthrough */
 #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
     case 3:
-        if ( set_ad_bits(&l3p[guest_l3_table_offset(va)].l3, &gw->l3e.l3,
+        if ( set_ad_bits(&l3p[guest_l3_table_offset(gla)].l3, &gw->l3e.l3,
                          (walk & PFEC_write_access) && leaf_level == 3) )
             paging_mark_dirty(d, gw->l3mfn);
 
-        if ( set_ad_bits(&l4p[guest_l4_table_offset(va)].l4, &gw->l4e.l4,
+        if ( set_ad_bits(&l4p[guest_l4_table_offset(gla)].l4, &gw->l4e.l4,
                          false) )
             paging_mark_dirty(d, gw->l4mfn);
 #endif
--- a/xen/arch/x86/mm/hap/guest_walk.c
+++ b/xen/arch/x86/mm/hap/guest_walk.c
@@ -26,8 +26,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <xen/sched.h>
 #include "private.h" /* for hap_gva_to_gfn_* */
 
-#define _hap_gva_to_gfn(levels) hap_gva_to_gfn_##levels##_levels
-#define hap_gva_to_gfn(levels) _hap_gva_to_gfn(levels)
+#define _hap_gla_to_gfn(levels) hap_gla_to_gfn_##levels##_levels
+#define hap_gla_to_gfn(levels) _hap_gla_to_gfn(levels)
 
 #define _hap_p2m_ga_to_gfn(levels) hap_p2m_ga_to_gfn_##levels##_levels
 #define hap_p2m_ga_to_gfn(levels) _hap_p2m_ga_to_gfn(levels)
@@ -39,16 +39,10 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <asm/guest_pt.h>
 #include <asm/p2m.h>
 
-unsigned long hap_gva_to_gfn(GUEST_PAGING_LEVELS)(
-    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t *pfec)
-{
-    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
-    return hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(v, p2m, cr3, gva, pfec, NULL);
-}
-
-unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
+static unsigned long ga_to_gfn(
     struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
-    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
+    paddr_t ga, uint32_t *pfec, unsigned int *page_order,
+    struct hvmemul_cache *cache)
 {
     bool walk_ok;
     mfn_t top_mfn;
@@ -91,7 +85,8 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
 #if GUEST_PAGING_LEVELS == 3
     top_map += (cr3 & ~(PAGE_MASK | 31));
 #endif
-    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec, top_mfn, top_map);
+    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec,
+                                top_gfn, top_mfn, top_map, cache);
     unmap_domain_page(top_map);
     put_page(top_page);
 
@@ -137,6 +132,21 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
     return gfn_x(INVALID_GFN);
 }
 
+gfn_t hap_gla_to_gfn(GUEST_PAGING_LEVELS)(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t *pfec,
+    struct hvmemul_cache *cache)
+{
+    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
+
+    return _gfn(ga_to_gfn(v, p2m, cr3, gla, pfec, NULL, cache));
+}
+
+unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
+    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
+{
+    return ga_to_gfn(v, p2m, cr3, ga, pfec, page_order, NULL);
+}
 
 /*
  * Local variables:
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -743,10 +743,11 @@ hap_write_p2m_entry(struct domain *d, un
         p2m_flush_nestedp2m(d);
 }
 
-static unsigned long hap_gva_to_gfn_real_mode(
-    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t *pfec)
+static gfn_t hap_gla_to_gfn_real_mode(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t *pfec,
+    struct hvmemul_cache *cache)
 {
-    return ((paddr_t)gva >> PAGE_SHIFT);
+    return gaddr_to_gfn(gla);
 }
 
 static unsigned long hap_p2m_ga_to_gfn_real_mode(
@@ -762,7 +763,7 @@ static unsigned long hap_p2m_ga_to_gfn_r
 static const struct paging_mode hap_paging_real_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_real_mode,
+    .gla_to_gfn             = hap_gla_to_gfn_real_mode,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_real_mode,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -773,7 +774,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_protected_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_2_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_2_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_2_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -784,7 +785,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_pae_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_3_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_3_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_3_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -795,7 +796,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_long_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_4_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_4_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_4_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
--- a/xen/arch/x86/mm/hap/private.h
+++ b/xen/arch/x86/mm/hap/private.h
@@ -24,18 +24,21 @@
 /********************************************/
 /*          GUEST TRANSLATION FUNCS         */
 /********************************************/
-unsigned long hap_gva_to_gfn_2_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
-unsigned long hap_gva_to_gfn_3_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
-unsigned long hap_gva_to_gfn_4_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
+gfn_t hap_gla_to_gfn_2_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
+gfn_t hap_gla_to_gfn_3_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
+gfn_t hap_gla_to_gfn_4_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
 
 unsigned long hap_p2m_ga_to_gfn_2_levels(struct vcpu *v,
     struct p2m_domain *p2m, unsigned long cr3,
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1968,16 +1968,16 @@ void np2m_schedule(int dir)
     }
 }
 
-unsigned long paging_gva_to_gfn(struct vcpu *v,
-                                unsigned long va,
-                                uint32_t *pfec)
+gfn_t paging_gla_to_gfn(struct vcpu *v, unsigned long gla, uint32_t *pfec,
+                        struct hvmemul_cache *cache)
 {
     struct p2m_domain *hostp2m = p2m_get_hostp2m(v->domain);
     const struct paging_mode *hostmode = paging_get_hostmode(v);
 
     if ( is_hvm_vcpu(v) && paging_mode_hap(v->domain) && nestedhvm_is_n2(v) )
     {
-        unsigned long l2_gfn, l1_gfn;
+        gfn_t l2_gfn;
+        unsigned long l1_gfn;
         struct p2m_domain *p2m;
         const struct paging_mode *mode;
         uint8_t l1_p2ma;
@@ -1987,31 +1987,31 @@ unsigned long paging_gva_to_gfn(struct v
         /* translate l2 guest va into l2 guest gfn */
         p2m = p2m_get_nestedp2m(v);
         mode = paging_get_nestedmode(v);
-        l2_gfn = mode->gva_to_gfn(v, p2m, va, pfec);
+        l2_gfn = mode->gla_to_gfn(v, p2m, gla, pfec, cache);
 
-        if ( l2_gfn == gfn_x(INVALID_GFN) )
-            return gfn_x(INVALID_GFN);
+        if ( gfn_eq(l2_gfn, INVALID_GFN) )
+            return INVALID_GFN;
 
         /* translate l2 guest gfn into l1 guest gfn */
-        rv = nestedhap_walk_L1_p2m(v, l2_gfn, &l1_gfn, &l1_page_order, &l1_p2ma,
-                                   1,
+        rv = nestedhap_walk_L1_p2m(v, gfn_x(l2_gfn), &l1_gfn, &l1_page_order,
+                                   &l1_p2ma, 1,
                                    !!(*pfec & PFEC_write_access),
                                    !!(*pfec & PFEC_insn_fetch));
 
         if ( rv != NESTEDHVM_PAGEFAULT_DONE )
-            return gfn_x(INVALID_GFN);
+            return INVALID_GFN;
 
         /*
          * Sanity check that l1_gfn can be used properly as a 4K mapping, even
          * if it mapped by a nested superpage.
          */
-        ASSERT((l2_gfn & ((1ul << l1_page_order) - 1)) ==
+        ASSERT((gfn_x(l2_gfn) & ((1ul << l1_page_order) - 1)) ==
                (l1_gfn & ((1ul << l1_page_order) - 1)));
 
-        return l1_gfn;
+        return _gfn(l1_gfn);
     }
 
-    return hostmode->gva_to_gfn(v, hostp2m, va, pfec);
+    return hostmode->gla_to_gfn(v, hostp2m, gla, pfec, cache);
 }
 
 /*
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1699,15 +1699,15 @@ static unsigned int shadow_get_allocatio
 static mfn_t emulate_gva_to_mfn(struct vcpu *v, unsigned long vaddr,
                                 struct sh_emulate_ctxt *sh_ctxt)
 {
-    unsigned long gfn;
+    gfn_t gfn;
     struct page_info *page;
     mfn_t mfn;
     p2m_type_t p2mt;
     uint32_t pfec = PFEC_page_present | PFEC_write_access;
 
     /* Translate the VA to a GFN. */
-    gfn = paging_get_hostmode(v)->gva_to_gfn(v, NULL, vaddr, &pfec);
-    if ( gfn == gfn_x(INVALID_GFN) )
+    gfn = paging_get_hostmode(v)->gla_to_gfn(v, NULL, vaddr, &pfec, NULL);
+    if ( gfn_eq(gfn, INVALID_GFN) )
     {
         x86_emul_pagefault(pfec, vaddr, &sh_ctxt->ctxt);
 
@@ -1717,7 +1717,7 @@ static mfn_t emulate_gva_to_mfn(struct v
     /* Translate the GFN to an MFN. */
     ASSERT(!paging_locked_by_me(v->domain));
 
-    page = get_page_from_gfn(v->domain, gfn, &p2mt, P2M_ALLOC);
+    page = get_page_from_gfn(v->domain, gfn_x(gfn), &p2mt, P2M_ALLOC);
 
     /* Sanity checking. */
     if ( page == NULL )
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -173,17 +173,20 @@ delete_shadow_status(struct domain *d, m
 
 static inline bool
 sh_walk_guest_tables(struct vcpu *v, unsigned long va, walk_t *gw,
-                     uint32_t pfec)
+                     uint32_t pfec, struct hvmemul_cache *cache)
 {
     return guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec,
+                             _gfn(paging_mode_external(v->domain)
+                                  ? cr3_pa(v->arch.hvm_vcpu.guest_cr[3]) >> PAGE_SHIFT
+                                  : pagetable_get_pfn(v->arch.guest_table)),
 #if GUEST_PAGING_LEVELS == 3 /* PAE */
                              INVALID_MFN,
-                             v->arch.paging.shadow.gl3e
+                             v->arch.paging.shadow.gl3e,
 #else /* 32 or 64 */
                              pagetable_get_mfn(v->arch.guest_table),
-                             v->arch.paging.shadow.guest_vtable
+                             v->arch.paging.shadow.guest_vtable,
 #endif
-                             );
+                             cache);
 }
 
 /* This validation is called with lock held, and after write permission
@@ -3035,7 +3038,7 @@ static int sh_page_fault(struct vcpu *v,
      * shadow page table. */
     version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
     smp_rmb();
-    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
+    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code, NULL);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
     regs->error_code &= ~PFEC_page_present;
@@ -3675,9 +3678,9 @@ static bool sh_invlpg(struct vcpu *v, un
 }
 
 
-static unsigned long
-sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
-    unsigned long va, uint32_t *pfec)
+static gfn_t
+sh_gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
+    unsigned long gla, uint32_t *pfec, struct hvmemul_cache *cache)
 /* Called to translate a guest virtual address to what the *guest*
  * pagetables would map it to. */
 {
@@ -3687,24 +3690,25 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
     /* Check the vTLB cache first */
-    unsigned long vtlb_gfn = vtlb_lookup(v, va, *pfec);
+    unsigned long vtlb_gfn = vtlb_lookup(v, gla, *pfec);
+
     if ( vtlb_gfn != gfn_x(INVALID_GFN) )
-        return vtlb_gfn;
+        return _gfn(vtlb_gfn);
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    if ( !(walk_ok = sh_walk_guest_tables(v, va, &gw, *pfec)) )
+    if ( !(walk_ok = sh_walk_guest_tables(v, gla, &gw, *pfec, cache)) )
     {
         *pfec = gw.pfec;
-        return gfn_x(INVALID_GFN);
+        return INVALID_GFN;
     }
     gfn = guest_walk_to_gfn(&gw);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
     /* Remember this successful VA->GFN translation for later. */
-    vtlb_insert(v, va >> PAGE_SHIFT, gfn_x(gfn), *pfec);
+    vtlb_insert(v, gla >> PAGE_SHIFT, gfn_x(gfn), *pfec);
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    return gfn_x(gfn);
+    return gfn;
 }
 
 
@@ -4949,7 +4953,7 @@ int sh_audit_l4_table(struct vcpu *v, mf
 const struct paging_mode sh_paging_mode = {
     .page_fault                    = sh_page_fault,
     .invlpg                        = sh_invlpg,
-    .gva_to_gfn                    = sh_gva_to_gfn,
+    .gla_to_gfn                    = sh_gla_to_gfn,
     .update_cr3                    = sh_update_cr3,
     .update_paging_modes           = shadow_update_paging_modes,
     .write_p2m_entry               = shadow_write_p2m_entry,
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -43,11 +43,12 @@ static bool _invlpg(struct vcpu *v, unsi
     return true;
 }
 
-static unsigned long _gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
-                                 unsigned long va, uint32_t *pfec)
+static gfn_t _gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
+                         unsigned long gla, uint32_t *pfec,
+                         struct hvmemul_cache *cache)
 {
     ASSERT_UNREACHABLE();
-    return gfn_x(INVALID_GFN);
+    return INVALID_GFN;
 }
 
 static void _update_cr3(struct vcpu *v, int do_locking, bool noflush)
@@ -70,7 +71,7 @@ static void _write_p2m_entry(struct doma
 static const struct paging_mode sh_paging_none = {
     .page_fault                    = _page_fault,
     .invlpg                        = _invlpg,
-    .gva_to_gfn                    = _gva_to_gfn,
+    .gla_to_gfn                    = _gla_to_gfn,
     .update_cr3                    = _update_cr3,
     .update_paging_modes           = _update_paging_modes,
     .write_p2m_entry               = _write_p2m_entry,
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -425,7 +425,8 @@ static inline unsigned int guest_walk_to
 
 bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m, unsigned long va,
-                  walk_t *gw, uint32_t pfec, mfn_t top_mfn, void *top_map);
+                  walk_t *gw, uint32_t pfec, gfn_t top_gfn, mfn_t top_mfn,
+                  void *top_map, struct hvmemul_cache *cache);
 
 /* Pretty-print the contents of a guest-walk */
 static inline void print_gw(const walk_t *gw)
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -53,6 +53,8 @@ struct hvm_mmio_cache {
     uint8_t buffer[32];
 };
 
+struct hvmemul_cache;
+
 struct hvm_vcpu_io {
     /* I/O request in flight to device model. */
     enum hvm_io_completion io_completion;
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -107,10 +107,11 @@ struct paging_mode {
     int           (*page_fault            )(struct vcpu *v, unsigned long va,
                                             struct cpu_user_regs *regs);
     bool          (*invlpg                )(struct vcpu *v, unsigned long va);
-    unsigned long (*gva_to_gfn            )(struct vcpu *v,
+    gfn_t         (*gla_to_gfn            )(struct vcpu *v,
                                             struct p2m_domain *p2m,
-                                            unsigned long va,
-                                            uint32_t *pfec);
+                                            unsigned long gla,
+                                            uint32_t *pfec,
+                                            struct hvmemul_cache *cache);
     unsigned long (*p2m_ga_to_gfn         )(struct vcpu *v,
                                             struct p2m_domain *p2m,
                                             unsigned long cr3,
@@ -246,9 +247,10 @@ void paging_invlpg(struct vcpu *v, unsig
  * SDM Intel 64 Volume 3, Chapter Paging, PAGE-FAULT EXCEPTIONS:
  * The PFEC_insn_fetch flag is set only when NX or SMEP are enabled.
  */
-unsigned long paging_gva_to_gfn(struct vcpu *v,
-                                unsigned long va,
-                                uint32_t *pfec);
+gfn_t paging_gla_to_gfn(struct vcpu *v,
+                        unsigned long va,
+                        uint32_t *pfec,
+                        struct hvmemul_cache *cache);
 
 /* Translate a guest address using a particular CR3 value.  This is used
  * to by nested HAP code, to walk the guest-supplied NPT tables as if




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables()
  2018-07-19 10:39 [PATCH 0/6] x86/HVM: implement memory read caching Jan Beulich
  2018-07-19 10:46 ` [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
@ 2018-07-19 10:47 ` Jan Beulich
  2018-07-19 13:22   ` Paul Durrant
  2018-07-19 10:48 ` [PATCH 3/6] x86/HVM: implement memory read caching Jan Beulich
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 10:47 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant

The caching isn't actually implemented here, this is just setting the
stage.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2572,6 +2572,18 @@ void hvm_dump_emulation_state(const char
            hvmemul_ctxt->insn_buf);
 }
 
+bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t gpa,
+                        unsigned int level, void *buffer, unsigned int size)
+{
+    return false;
+}
+
+void hvmemul_write_cache(struct hvmemul_cache *cache, paddr_t gpa,
+                         unsigned int level, const void *buffer,
+                         unsigned int size)
+{
+}
+
 /*
  * Local variables:
  * mode: C
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -94,6 +94,7 @@ guest_walk_tables(struct vcpu *v, struct
     guest_l4e_t *l4p;
 #endif
     uint32_t gflags, rc;
+    paddr_t gpa;
     unsigned int leaf_level;
     p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
 
@@ -134,7 +135,15 @@ guest_walk_tables(struct vcpu *v, struct
     /* Get the l4e from the top level table and check its flags*/
     gw->l4mfn = top_mfn;
     l4p = (guest_l4e_t *) top_map;
-    gw->l4e = l4p[guest_l4_table_offset(gla)];
+    gpa = gfn_to_gaddr(top_gfn) +
+          guest_l4_table_offset(gla) * sizeof(guest_l4e_t);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, gpa, 4, &gw->l4e, sizeof(gw->l4e)) )
+    {
+        gw->l4e = l4p[guest_l4_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, gpa, 4, &gw->l4e, sizeof(gw->l4e));
+    }
     gflags = guest_l4e_get_flags(gw->l4e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -164,7 +173,15 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l3e and check its flags*/
-    gw->l3e = l3p[guest_l3_table_offset(gla)];
+    gpa = gfn_to_gaddr(guest_l4e_get_gfn(gw->l4e)) +
+          guest_l3_table_offset(gla) * sizeof(guest_l3e_t);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
+    {
+        gw->l3e = l3p[guest_l3_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e));
+    }
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -216,7 +233,16 @@ guest_walk_tables(struct vcpu *v, struct
 #else /* PAE only... */
 
     /* Get the l3e and check its flag */
-    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
+    gpa = gfn_to_gaddr(top_gfn) + ((unsigned long)top_map & ~PAGE_MASK) +
+          guest_l3_table_offset(gla) * sizeof(guest_l3e_t);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
+    {
+        gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e));
+    }
+
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -242,18 +268,24 @@ guest_walk_tables(struct vcpu *v, struct
         goto out;
     }
 
-    /* Get the l2e */
-    gw->l2e = l2p[guest_l2_table_offset(gla)];
-
 #else /* 32-bit only... */
 
-    /* Get l2e from the top level table */
     gw->l2mfn = top_mfn;
     l2p = (guest_l2e_t *) top_map;
-    gw->l2e = l2p[guest_l2_table_offset(gla)];
 
 #endif /* All levels... */
 
+    /* Get the l2e */
+    gpa = gfn_to_gaddr(top_gfn) +
+          guest_l2_table_offset(gla) * sizeof(guest_l2e_t);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, gpa, 2, &gw->l2e, sizeof(gw->l2e)) )
+    {
+        gw->l2e = l2p[guest_l2_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, gpa, 2, &gw->l2e, sizeof(gw->l2e));
+    }
+
     /* Check the l2e flags. */
     gflags = guest_l2e_get_flags(gw->l2e);
     if ( !(gflags & _PAGE_PRESENT) )
@@ -335,7 +367,17 @@ guest_walk_tables(struct vcpu *v, struct
         gw->pfec |= rc & PFEC_synth_mask;
         goto out;
     }
-    gw->l1e = l1p[guest_l1_table_offset(gla)];
+
+    gpa = gfn_to_gaddr(top_gfn) +
+          guest_l1_table_offset(gla) * sizeof(guest_l1e_t);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, gpa, 1, &gw->l1e, sizeof(gw->l1e)) )
+    {
+        gw->l1e = l1p[guest_l1_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, gpa, 1, &gw->l1e, sizeof(gw->l1e));
+    }
+
     gflags = guest_l1e_get_flags(gw->l1e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -98,6 +98,13 @@ int hvmemul_do_pio_buffer(uint16_t port,
                           uint8_t dir,
                           void *buffer);
 
+struct hvmemul_cache;
+bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
+                        unsigned int level, void *buffer, unsigned int size);
+void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
+                         unsigned int level, const void *buffer,
+                         unsigned int size);
+
 void hvm_dump_emulation_state(const char *loglvl, const char *prefix,
                               struct hvm_emulate_ctxt *hvmemul_ctxt, int rc);
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/6] x86/HVM: implement memory read caching
  2018-07-19 10:39 [PATCH 0/6] x86/HVM: implement memory read caching Jan Beulich
  2018-07-19 10:46 ` [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
  2018-07-19 10:47 ` [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
@ 2018-07-19 10:48 ` Jan Beulich
  2018-07-19 14:20   ` Paul Durrant
  2018-07-23 15:45   ` Tim Deegan
  2018-07-19 10:49 ` [PATCH 4/6] VMX: correct PDPTE load checks Jan Beulich
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 10:48 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Suravee Suthikulpanit, George Dunlap, Andrew Cooper,
	Tim Deegan, Paul Durrant, Jun Nakajima, Boris Ostrovsky,
	Brian Woods

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far use of CPU
registers goes (as those can't change without any other instruction
executing in between), but is wrong for memory accesses. In particular
it has been observed that Windows might page out buffers underneath an
instruction currently under emulation (hitting between two passes). If
the first pass translated a linear address successfully, any subsequent
pass needs to do so too, yielding the exact same translation.

Introduce a cache (used just by guest page table accesses for now) to
make sure above described assumption holds. This is a very simplistic
implementation for now: Only exact matches are satisfied (no overlaps or
partial reads or anything).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -27,6 +27,18 @@
 #include <asm/hvm/svm/svm.h>
 #include <asm/vm_event.h>
 
+struct hvmemul_cache
+{
+    unsigned int max_ents;
+    unsigned int num_ents;
+    struct {
+        paddr_t gpa:PADDR_BITS;
+        unsigned int size:(BITS_PER_LONG - PADDR_BITS) / 2;
+        unsigned int level:(BITS_PER_LONG - PADDR_BITS) / 2;
+        unsigned long data;
+    } ents[];
+};
+
 static void hvmtrace_io_assist(const ioreq_t *p)
 {
     unsigned int size, event;
@@ -520,7 +532,7 @@ static int hvmemul_do_mmio_addr(paddr_t
  */
 static void *hvmemul_map_linear_addr(
     unsigned long linear, unsigned int bytes, uint32_t pfec,
-    struct hvm_emulate_ctxt *hvmemul_ctxt)
+    struct hvm_emulate_ctxt *hvmemul_ctxt, struct hvmemul_cache *cache)
 {
     struct vcpu *curr = current;
     void *err, *mapping;
@@ -565,7 +577,7 @@ static void *hvmemul_map_linear_addr(
         ASSERT(mfn_x(*mfn) == 0);
 
         res = hvm_translate_get_page(curr, addr, true, pfec,
-                                     &pfinfo, &page, NULL, &p2mt);
+                                     &pfinfo, &page, NULL, &p2mt, cache);
 
         switch ( res )
         {
@@ -681,6 +693,8 @@ static int hvmemul_linear_to_phys(
     gfn_t gfn, ngfn;
     unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
     int reverse;
+    struct hvmemul_cache *cache = pfec & PFEC_insn_fetch
+                                  ? NULL : curr->arch.hvm_vcpu.data_cache;
 
     /*
      * Clip repetitions to a sensible maximum. This avoids extensive looping in
@@ -710,7 +724,7 @@ static int hvmemul_linear_to_phys(
             return rc;
         gfn = gaddr_to_gfn(gaddr);
     }
-    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
+    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, cache),
                      INVALID_GFN) )
     {
         if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
@@ -726,7 +740,7 @@ static int hvmemul_linear_to_phys(
     {
         /* Get the next PFN in the range. */
         addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
-        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
+        ngfn = paging_gla_to_gfn(curr, addr, &pfec, cache);
 
         /* Is it contiguous with the preceding PFNs? If not then we're done. */
         if ( gfn_eq(ngfn, INVALID_GFN) ||
@@ -1046,6 +1060,8 @@ static int __hvmemul_read(
         pfec |= PFEC_implicit;
     else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
         pfec |= PFEC_user_mode;
+    if ( access_type == hvm_access_insn_fetch )
+        pfec |= PFEC_insn_fetch;
 
     rc = hvmemul_virtual_to_linear(
         seg, offset, bytes, &reps, access_type, hvmemul_ctxt, &addr);
@@ -1059,7 +1075,8 @@ static int __hvmemul_read(
 
     rc = ((access_type == hvm_access_insn_fetch) ?
           hvm_fetch_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo) :
-          hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo));
+          hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo,
+                                     curr->arch.hvm_vcpu.data_cache));
 
     switch ( rc )
     {
@@ -1178,7 +1195,8 @@ static int hvmemul_write(
          (vio->mmio_gla == (addr & PAGE_MASK)) )
         return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
 
-    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                      curr->arch.hvm_vcpu.data_cache);
     if ( IS_ERR(mapping) )
         return ~PTR_ERR(mapping);
 
@@ -1218,7 +1236,8 @@ static int hvmemul_rmw(
     else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
         pfec |= PFEC_user_mode;
 
-    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                      current->arch.hvm_vcpu.data_cache);
     if ( IS_ERR(mapping) )
         return ~PTR_ERR(mapping);
 
@@ -1375,7 +1394,8 @@ static int hvmemul_cmpxchg(
     else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
         pfec |= PFEC_user_mode;
 
-    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                      curr->arch.hvm_vcpu.data_cache);
     if ( IS_ERR(mapping) )
         return ~PTR_ERR(mapping);
 
@@ -2282,6 +2302,7 @@ static int _hvm_emulate_one(struct hvm_e
     {
         vio->mmio_cache_count = 0;
         vio->mmio_insn_bytes = 0;
+        curr->arch.hvm_vcpu.data_cache->num_ents = 0;
     }
     else
     {
@@ -2572,9 +2593,35 @@ void hvm_dump_emulation_state(const char
            hvmemul_ctxt->insn_buf);
 }
 
+struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
+{
+    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct hvmemul_cache,
+                                                         ents[nents]));
+
+    if ( cache )
+    {
+        cache->num_ents = 0;
+        cache->max_ents = nents;
+    }
+
+    return cache;
+}
+
 bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t gpa,
                         unsigned int level, void *buffer, unsigned int size)
 {
+    unsigned int i;
+
+    ASSERT(size <= sizeof(cache->ents->data));
+
+    for ( i = 0; i < cache->num_ents; ++i )
+        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
+             cache->ents[i].size == size )
+        {
+            memcpy(buffer, &cache->ents[i].data, size);
+            return true;
+        }
+
     return false;
 }
 
@@ -2582,6 +2629,35 @@ void hvmemul_write_cache(struct hvmemul_
                          unsigned int level, const void *buffer,
                          unsigned int size)
 {
+    unsigned int i;
+
+    if ( size > sizeof(cache->ents->data) )
+    {
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    for ( i = 0; i < cache->num_ents; ++i )
+        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
+             cache->ents[i].size == size )
+        {
+            memcpy(&cache->ents[i].data, buffer, size);
+            return;
+        }
+
+    if ( unlikely(i >= cache->max_ents) )
+    {
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    cache->ents[i].level = level;
+    cache->ents[i].gpa   = gpa;
+    cache->ents[i].size  = size;
+
+    memcpy(&cache->ents[i].data, buffer, size);
+
+    cache->num_ents = i + 1;
 }
 
 /*
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1530,6 +1530,18 @@ int hvm_vcpu_initialise(struct vcpu *v)
 
     v->arch.hvm_vcpu.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
 
+    /*
+     * Leaving aside the insn fetch, for which we don't use this cache, no
+     * insn can access more than 8 independent linear addresses (AVX2
+     * gathers being the worst). Each such linear range can span a page
+     * boundary, i.e. require two page walks.
+     */
+    v->arch.hvm_vcpu.data_cache = hvmemul_cache_init(CONFIG_PAGING_LEVELS *
+                                                     8 * 2);
+    rc = -ENOMEM;
+    if ( !v->arch.hvm_vcpu.data_cache )
+        goto fail4;
+
     rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
     if ( rc != 0 )
         goto fail4;
@@ -1559,6 +1571,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
  fail5:
     free_compat_arg_xlat(v);
  fail4:
+    hvmemul_cache_destroy(v->arch.hvm_vcpu.data_cache);
     hvm_funcs.vcpu_destroy(v);
  fail3:
     vlapic_destroy(v);
@@ -1581,6 +1594,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
 
     free_compat_arg_xlat(v);
 
+    hvmemul_cache_destroy(v->arch.hvm_vcpu.data_cache);
+
     tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet);
     hvm_funcs.vcpu_destroy(v);
 
@@ -2949,7 +2964,7 @@ void hvm_task_switch(
     }
 
     rc = hvm_copy_from_guest_linear(
-        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
+        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
     if ( rc == HVMTRANS_bad_linear_to_gfn )
         hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
     if ( rc != HVMTRANS_okay )
@@ -2996,7 +3011,7 @@ void hvm_task_switch(
         goto out;
 
     rc = hvm_copy_from_guest_linear(
-        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
+        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
     if ( rc == HVMTRANS_bad_linear_to_gfn )
         hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
     /*
@@ -3107,7 +3122,7 @@ void hvm_task_switch(
 enum hvm_translation_result hvm_translate_get_page(
     struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
     pagefault_info_t *pfinfo, struct page_info **page_p,
-    gfn_t *gfn_p, p2m_type_t *p2mt_p)
+    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache)
 {
     struct page_info *page;
     p2m_type_t p2mt;
@@ -3115,7 +3130,7 @@ enum hvm_translation_result hvm_translat
 
     if ( linear )
     {
-        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
+        gfn = paging_gla_to_gfn(v, addr, &pfec, cache);
 
         if ( gfn_eq(gfn, INVALID_GFN) )
         {
@@ -3187,7 +3202,7 @@ enum hvm_translation_result hvm_translat
 #define HVMCOPY_linear     (1u<<2)
 static enum hvm_translation_result __hvm_copy(
     void *buf, paddr_t addr, int size, struct vcpu *v, unsigned int flags,
-    uint32_t pfec, pagefault_info_t *pfinfo)
+    uint32_t pfec, pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
 {
     gfn_t gfn;
     struct page_info *page;
@@ -3220,8 +3235,8 @@ static enum hvm_translation_result __hvm
 
         count = min_t(int, PAGE_SIZE - gpa, todo);
 
-        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear,
-                                     pfec, pfinfo, &page, &gfn, &p2mt);
+        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear, pfec,
+                                     pfinfo, &page, &gfn, &p2mt, cache);
         if ( res != HVMTRANS_okay )
             return res;
 
@@ -3268,14 +3283,14 @@ enum hvm_translation_result hvm_copy_to_
     paddr_t paddr, void *buf, int size, struct vcpu *v)
 {
     return __hvm_copy(buf, paddr, size, v,
-                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, NULL);
 }
 
 enum hvm_translation_result hvm_copy_from_guest_phys(
     void *buf, paddr_t paddr, int size)
 {
     return __hvm_copy(buf, paddr, size, current,
-                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL, NULL);
 }
 
 enum hvm_translation_result hvm_copy_to_guest_linear(
@@ -3284,16 +3299,17 @@ enum hvm_translation_result hvm_copy_to_
 {
     return __hvm_copy(buf, addr, size, current,
                       HVMCOPY_to_guest | HVMCOPY_linear,
-                      PFEC_page_present | PFEC_write_access | pfec, pfinfo);
+                      PFEC_page_present | PFEC_write_access | pfec,
+                      pfinfo, NULL);
 }
 
 enum hvm_translation_result hvm_copy_from_guest_linear(
     void *buf, unsigned long addr, int size, uint32_t pfec,
-    pagefault_info_t *pfinfo)
+    pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
 {
     return __hvm_copy(buf, addr, size, current,
                       HVMCOPY_from_guest | HVMCOPY_linear,
-                      PFEC_page_present | pfec, pfinfo);
+                      PFEC_page_present | pfec, pfinfo, cache);
 }
 
 enum hvm_translation_result hvm_fetch_from_guest_linear(
@@ -3302,7 +3318,7 @@ enum hvm_translation_result hvm_fetch_fr
 {
     return __hvm_copy(buf, addr, size, current,
                       HVMCOPY_from_guest | HVMCOPY_linear,
-                      PFEC_page_present | PFEC_insn_fetch | pfec, pfinfo);
+                      PFEC_page_present | PFEC_insn_fetch | pfec, pfinfo, NULL);
 }
 
 unsigned long copy_to_user_hvm(void *to, const void *from, unsigned int len)
@@ -3343,7 +3359,8 @@ unsigned long copy_from_user_hvm(void *t
         return 0;
     }
 
-    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len, 0, NULL);
+    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len,
+                                    0, NULL, NULL);
     return rc ? len : 0; /* fake a copy_from_user() return code */
 }
 
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1358,7 +1358,7 @@ static void svm_emul_swint_injection(str
         goto raise_exception;
 
     rc = hvm_copy_from_guest_linear(&idte, idte_linear_addr, idte_size,
-                                    PFEC_implicit, &pfinfo);
+                                    PFEC_implicit, &pfinfo, NULL);
     if ( rc )
     {
         if ( rc == HVMTRANS_bad_linear_to_gfn )
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -475,7 +475,7 @@ static int decode_vmx_inst(struct cpu_us
         {
             pagefault_info_t pfinfo;
             int rc = hvm_copy_from_guest_linear(poperandS, base, size,
-                                                0, &pfinfo);
+                                                0, &pfinfo, NULL);
 
             if ( rc == HVMTRANS_bad_linear_to_gfn )
                 hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -206,7 +206,7 @@ hvm_read(enum x86_segment seg,
     if ( access_type == hvm_access_insn_fetch )
         rc = hvm_fetch_from_guest_linear(p_data, addr, bytes, 0, &pfinfo);
     else
-        rc = hvm_copy_from_guest_linear(p_data, addr, bytes, 0, &pfinfo);
+        rc = hvm_copy_from_guest_linear(p_data, addr, bytes, 0, &pfinfo, NULL);
 
     switch ( rc )
     {
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -99,6 +99,11 @@ int hvmemul_do_pio_buffer(uint16_t port,
                           void *buffer);
 
 struct hvmemul_cache;
+struct hvmemul_cache *hvmemul_cache_init(unsigned int nents);
+static inline void hvmemul_cache_destroy(struct hvmemul_cache *cache)
+{
+    xfree(cache);
+}
 bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
                         unsigned int level, void *buffer, unsigned int size);
 void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
--- a/xen/include/asm-x86/hvm/support.h
+++ b/xen/include/asm-x86/hvm/support.h
@@ -99,7 +99,7 @@ enum hvm_translation_result hvm_copy_to_
     pagefault_info_t *pfinfo);
 enum hvm_translation_result hvm_copy_from_guest_linear(
     void *buf, unsigned long addr, int size, uint32_t pfec,
-    pagefault_info_t *pfinfo);
+    pagefault_info_t *pfinfo, struct hvmemul_cache *cache);
 enum hvm_translation_result hvm_fetch_from_guest_linear(
     void *buf, unsigned long addr, int size, uint32_t pfec,
     pagefault_info_t *pfinfo);
@@ -113,7 +113,7 @@ enum hvm_translation_result hvm_fetch_fr
 enum hvm_translation_result hvm_translate_get_page(
     struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
     pagefault_info_t *pfinfo, struct page_info **page_p,
-    gfn_t *gfn_p, p2m_type_t *p2mt_p);
+    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache);
 
 #define HVM_HCALL_completed  0 /* hypercall completed - no further action */
 #define HVM_HCALL_preempted  1 /* hypercall preempted - re-execute VMCALL */
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -53,8 +53,6 @@ struct hvm_mmio_cache {
     uint8_t buffer[32];
 };
 
-struct hvmemul_cache;
-
 struct hvm_vcpu_io {
     /* I/O request in flight to device model. */
     enum hvm_io_completion io_completion;
@@ -200,6 +198,7 @@ struct hvm_vcpu {
     u8                  cache_mode;
 
     struct hvm_vcpu_io  hvm_io;
+    struct hvmemul_cache *data_cache;
 
     /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
     struct x86_event     inject_event;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/6] VMX: correct PDPTE load checks
  2018-07-19 10:39 [PATCH 0/6] x86/HVM: implement memory read caching Jan Beulich
                   ` (2 preceding siblings ...)
  2018-07-19 10:48 ` [PATCH 3/6] x86/HVM: implement memory read caching Jan Beulich
@ 2018-07-19 10:49 ` Jan Beulich
  2018-08-28 11:59   ` Ping: " Jan Beulich
  2018-08-28 13:12   ` Andrew Cooper
  2018-07-19 10:50 ` [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
  2018-07-19 10:51 ` [PATCH 6/6] x86/shadow: a little bit of style cleanup Jan Beulich
  5 siblings, 2 replies; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 10:49 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Kevin Tian, Jun Nakajima

Checking the low 5 bits of CR3 is not the job of vmx_load_pdptrs().
Instead it should #GP upon bad PDPTE values, rather than causing a VM
entry failure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1361,20 +1361,18 @@ static void vmx_set_interrupt_shadow(str
     __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
 }
 
-static void vmx_load_pdptrs(struct vcpu *v)
+static bool vmx_load_pdptrs(struct vcpu *v)
 {
     unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
-    uint64_t *guest_pdptes;
+    uint64_t *guest_pdptes, valid_mask;
     struct page_info *page;
     p2m_type_t p2mt;
     char *p;
+    unsigned int i;
 
     /* EPT needs to load PDPTRS into VMCS for PAE. */
     if ( !hvm_pae_enabled(v) || (v->arch.hvm_vcpu.guest_efer & EFER_LMA) )
-        return;
-
-    if ( (cr3 & 0x1fUL) && !hvm_pcid_enabled(v) )
-        goto crash;
+        return true;
 
     page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt, P2M_UNSHARE);
     if ( !page )
@@ -1385,34 +1383,47 @@ static void vmx_load_pdptrs(struct vcpu
         gdprintk(XENLOG_ERR,
                  "Bad cr3 on load pdptrs gfn %lx type %d\n",
                  cr3 >> PAGE_SHIFT, (int) p2mt);
-        goto crash;
+        domain_crash(v->domain);
+        return false;
     }
 
     p = __map_domain_page(page);
 
-    guest_pdptes = (uint64_t *)(p + (cr3 & ~PAGE_MASK));
+    guest_pdptes = (uint64_t *)(p + (cr3 & 0xfe0));
 
-    /*
-     * We do not check the PDPTRs for validity. The CPU will do this during
-     * vm entry, and we can handle the failure there and crash the guest.
-     * The only thing we could do better here is #GP instead.
-     */
+    valid_mask = ((1ULL << v->domain->arch.cpuid->extd.maxphysaddr) - 1) &
+                 (PAGE_MASK | _PAGE_AVAIL | _PAGE_PRESENT);
+    for ( i = 0; i < 4; ++i )
+        if ( (guest_pdptes[i] & _PAGE_PRESENT) &&
+             (guest_pdptes[i] & ~valid_mask) )
+        {
+            if ( v == current )
+                hvm_inject_hw_exception(TRAP_gp_fault, 0);
+            else
+            {
+                printk(XENLOG_G_ERR "%pv: bad PDPTE%u: %"PRIx64"\n",
+                       v, i, guest_pdptes[i]);
+                domain_crash(v->domain);
+            }
+            break;
+        }
 
-    vmx_vmcs_enter(v);
+    if ( i == 4 )
+    {
+        vmx_vmcs_enter(v);
 
-    __vmwrite(GUEST_PDPTE(0), guest_pdptes[0]);
-    __vmwrite(GUEST_PDPTE(1), guest_pdptes[1]);
-    __vmwrite(GUEST_PDPTE(2), guest_pdptes[2]);
-    __vmwrite(GUEST_PDPTE(3), guest_pdptes[3]);
+        __vmwrite(GUEST_PDPTE(0), guest_pdptes[0]);
+        __vmwrite(GUEST_PDPTE(1), guest_pdptes[1]);
+        __vmwrite(GUEST_PDPTE(2), guest_pdptes[2]);
+        __vmwrite(GUEST_PDPTE(3), guest_pdptes[3]);
 
-    vmx_vmcs_exit(v);
+        vmx_vmcs_exit(v);
+    }
 
     unmap_domain_page(p);
     put_page(page);
-    return;
 
- crash:
-    domain_crash(v->domain);
+    return i == 4;
 }
 
 static void vmx_update_host_cr3(struct vcpu *v)
@@ -1621,7 +1632,8 @@ static void vmx_update_guest_cr(struct v
             if ( !hvm_paging_enabled(v) && !vmx_unrestricted_guest(v) )
                 v->arch.hvm_vcpu.hw_cr[3] =
                     v->domain->arch.hvm_domain.params[HVM_PARAM_IDENT_PT];
-            vmx_load_pdptrs(v);
+            if ( !vmx_load_pdptrs(v) )
+                break;
         }
 
         __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.hw_cr[3]);





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 10:39 [PATCH 0/6] x86/HVM: implement memory read caching Jan Beulich
                   ` (3 preceding siblings ...)
  2018-07-19 10:49 ` [PATCH 4/6] VMX: correct PDPTE load checks Jan Beulich
@ 2018-07-19 10:50 ` Jan Beulich
  2018-07-19 11:15   ` Andrew Cooper
  2018-07-19 10:51 ` [PATCH 6/6] x86/shadow: a little bit of style cleanup Jan Beulich
  5 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 10:50 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Kevin Tian, Jun Nakajima

Since strictly speaking it is incorrect for guest_walk_tables() to read
L3 entries during PAE page walks, try to overcome this where possible by
pre-loading the values from hardware into the cache. Sadly the
information is available in the EPT case only. On the positive side for
NPT the spec spells out that L3 entries are actually read on walks, so
us reading them is consistent with hardware behavior in that case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2294,6 +2294,23 @@ static int _hvm_emulate_one(struct hvm_e
 
     vio->mmio_retry = 0;
 
+    if ( !curr->arch.hvm_vcpu.data_cache->num_ents &&
+         curr->arch.paging.mode->guest_levels == 3 )
+    {
+        unsigned int i;
+
+        for ( i = 0; i < 4; ++i )
+        {
+            uint64_t pdpte;
+
+            if ( hvm_read_pdpte(curr, i, &pdpte) )
+                hvmemul_write_cache(curr->arch.hvm_vcpu.data_cache,
+                                    (curr->arch.hvm_vcpu.guest_cr[3] &
+                                     (PADDR_MASK & ~0x1f)) + i * sizeof(pdpte),
+                                    3, &pdpte, sizeof(pdpte));
+        }
+    }
+
     rc = x86_emulate(&hvmemul_ctxt->ctxt, ops);
     if ( rc == X86EMUL_OKAY && vio->mmio_retry )
         rc = X86EMUL_RETRY;
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1361,6 +1361,25 @@ static void vmx_set_interrupt_shadow(str
     __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
 }
 
+static bool read_pdpte(struct vcpu *v, unsigned int idx, uint64_t *pdpte)
+{
+    if ( !paging_mode_hap(v->domain) || !hvm_pae_enabled(v) ||
+         (v->arch.hvm_vcpu.guest_efer & EFER_LMA) )
+        return false;
+
+    if ( idx >= 4 )
+    {
+        ASSERT_UNREACHABLE();
+        return false;
+    }
+
+    vmx_vmcs_enter(v);
+    __vmread(GUEST_PDPTE(idx), pdpte);
+    vmx_vmcs_exit(v);
+
+    return true;
+}
+
 static bool vmx_load_pdptrs(struct vcpu *v)
 {
     unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
@@ -2464,6 +2483,8 @@ const struct hvm_function_table * __init
         if ( cpu_has_vmx_ept_1gb )
             vmx_function_table.hap_capabilities |= HVM_HAP_SUPERPAGE_1GB;
 
+        vmx_function_table.read_pdpte = read_pdpte;
+
         setup_ept_dump();
     }
 
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -146,6 +146,8 @@ struct hvm_function_table {
 
     void (*fpu_leave)(struct vcpu *v);
 
+    bool (*read_pdpte)(struct vcpu *v, unsigned int index, uint64_t *pdpte);
+
     int  (*get_guest_pat)(struct vcpu *v, u64 *);
     int  (*set_guest_pat)(struct vcpu *v, u64);
 
@@ -383,6 +385,12 @@ static inline unsigned long hvm_get_shad
     return hvm_funcs.get_shadow_gs_base(v);
 }
 
+static inline bool hvm_read_pdpte(struct vcpu *v, unsigned int index, uint64_t *pdpte)
+{
+    return hvm_funcs.read_pdpte &&
+           hvm_funcs.read_pdpte(v, index, pdpte);
+}
+
 static inline bool hvm_get_guest_bndcfgs(struct vcpu *v, u64 *val)
 {
     return hvm_funcs.get_guest_bndcfgs &&





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/6] x86/shadow: a little bit of style cleanup
  2018-07-19 10:39 [PATCH 0/6] x86/HVM: implement memory read caching Jan Beulich
                   ` (4 preceding siblings ...)
  2018-07-19 10:50 ` [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
@ 2018-07-19 10:51 ` Jan Beulich
  2018-07-23 15:05   ` Tim Deegan
  5 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 10:51 UTC (permalink / raw)
  To: xen-devel, Jan Beulich; +Cc: George Dunlap, Andrew Cooper, Tim Deegan

Correct indentation of a piece of code, adjusting comment style at the
same time. Constify gl3e pointers and drop a bogus (and useless once
corrected) cast.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3989,9 +3989,8 @@ sh_update_cr3(struct vcpu *v, int do_loc
     struct domain *d = v->domain;
     mfn_t gmfn;
 #if GUEST_PAGING_LEVELS == 3
-    guest_l3e_t *gl3e;
-    u32 guest_idx=0;
-    int i;
+    const guest_l3e_t *gl3e;
+    unsigned int i, guest_idx;
 #endif
 
     /* Don't do anything on an uninitialised vcpu */
@@ -4057,23 +4056,24 @@ sh_update_cr3(struct vcpu *v, int do_loc
     else
         v->arch.paging.shadow.guest_vtable = __linear_l4_table;
 #elif GUEST_PAGING_LEVELS == 3
-     /* On PAE guests we don't use a mapping of the guest's own top-level
-      * table.  We cache the current state of that table and shadow that,
-      * until the next CR3 write makes us refresh our cache. */
-     ASSERT(v->arch.paging.shadow.guest_vtable == NULL);
-
-     ASSERT(shadow_mode_external(d));
-     /* Find where in the page the l3 table is */
-     guest_idx = guest_index((void *)v->arch.hvm_vcpu.guest_cr[3]);
-
-     // Ignore the low 2 bits of guest_idx -- they are really just
-     // cache control.
-     guest_idx &= ~3;
-
-     gl3e = ((guest_l3e_t *)map_domain_page(gmfn)) + guest_idx;
-     for ( i = 0; i < 4 ; i++ )
-         v->arch.paging.shadow.gl3e[i] = gl3e[i];
-     unmap_domain_page(gl3e);
+    /*
+     * On PAE guests we don't use a mapping of the guest's own top-level
+     * table.  We cache the current state of that table and shadow that,
+     * until the next CR3 write makes us refresh our cache.
+     */
+    ASSERT(v->arch.paging.shadow.guest_vtable == NULL);
+    ASSERT(shadow_mode_external(d));
+
+    /*
+     * Find where in the page the l3 table is, but ignore the low 2 bits of
+     * guest_idx -- they are really just cache control.
+     */
+    guest_idx = guest_index((void *)v->arch.hvm_vcpu.guest_cr[3]) & ~3;
+
+    gl3e = ((guest_l3e_t *)map_domain_page(gmfn)) + guest_idx;
+    for ( i = 0; i < 4 ; i++ )
+        v->arch.paging.shadow.gl3e[i] = gl3e[i];
+    unmap_domain_page(gl3e);
 #elif GUEST_PAGING_LEVELS == 2
     ASSERT(shadow_mode_external(d));
     if ( v->arch.paging.shadow.guest_vtable )
@@ -4106,7 +4106,8 @@ sh_update_cr3(struct vcpu *v, int do_loc
         gfn_t gl2gfn;
         mfn_t gl2mfn;
         p2m_type_t p2mt;
-        guest_l3e_t *gl3e = (guest_l3e_t*)&v->arch.paging.shadow.gl3e;
+        const guest_l3e_t *gl3e = v->arch.paging.shadow.gl3e;
+
         /* First, make all four entries read-only. */
         for ( i = 0; i < 4; i++ )
         {





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 10:50 ` [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
@ 2018-07-19 11:15   ` Andrew Cooper
  2018-07-19 11:47     ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2018-07-19 11:15 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Kevin Tian, Jun Nakajima

On 19/07/18 11:50, Jan Beulich wrote:
> Since strictly speaking it is incorrect for guest_walk_tables() to read
> L3 entries during PAE page walks, try to overcome this where possible by
> pre-loading the values from hardware into the cache. Sadly the
> information is available in the EPT case only. On the positive side for
> NPT the spec spells out that L3 entries are actually read on walks, so
> us reading them is consistent with hardware behavior in that case.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I'm afraid that this isn't architecturally correct.  It means that an
emulated memory access will read the PDPTE register values, rather than
what is actually in RAM.

As an alternative with the same effect, how about in the EPT case,
passing in a top_map pointer to a local piece of stack with the PDPTE
registers read out of hardware?  That way, we don't read memory on the
pagewalk, yet don't interfere with other emulated memory accesses?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 11:15   ` Andrew Cooper
@ 2018-07-19 11:47     ` Jan Beulich
  2018-07-19 11:55       ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 11:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Kevin Tian, Jun Nakajima

>>> On 19.07.18 at 13:15, <andrew.cooper3@citrix.com> wrote:
> On 19/07/18 11:50, Jan Beulich wrote:
>> Since strictly speaking it is incorrect for guest_walk_tables() to read
>> L3 entries during PAE page walks, try to overcome this where possible by
>> pre-loading the values from hardware into the cache. Sadly the
>> information is available in the EPT case only. On the positive side for
>> NPT the spec spells out that L3 entries are actually read on walks, so
>> us reading them is consistent with hardware behavior in that case.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I'm afraid that this isn't architecturally correct.  It means that an
> emulated memory access will read the PDPTE register values, rather than
> what is actually in RAM.

I'm afraid I don't understand: A CR3 load loads the PDPTEs into
registers, and walks use those registers, not memory. That's the very
difference between PAE and all other walks.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 11:47     ` Jan Beulich
@ 2018-07-19 11:55       ` Andrew Cooper
  2018-07-19 18:37         ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2018-07-19 11:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Kevin Tian, Jun Nakajima

On 19/07/18 12:47, Jan Beulich wrote:
>>>> On 19.07.18 at 13:15, <andrew.cooper3@citrix.com> wrote:
>> On 19/07/18 11:50, Jan Beulich wrote:
>>> Since strictly speaking it is incorrect for guest_walk_tables() to read
>>> L3 entries during PAE page walks, try to overcome this where possible by
>>> pre-loading the values from hardware into the cache. Sadly the
>>> information is available in the EPT case only. On the positive side for
>>> NPT the spec spells out that L3 entries are actually read on walks, so
>>> us reading them is consistent with hardware behavior in that case.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> I'm afraid that this isn't architecturally correct.  It means that an
>> emulated memory access will read the PDPTE register values, rather than
>> what is actually in RAM.
> I'm afraid I don't understand: A CR3 load loads the PDPTEs into
> registers, and walks use those registers, not memory. That's the very
> difference between PAE and all other walks.

Patch 3 causes memory reads to come from the cache.

This patch feeds the PDPTE registers into the cache, which breaks the
architectural correctness of patch 3, because the PDPTE registers may
legitimately be stale WRT the content in memory.

The pagewalk reading of top_map doesn't require that top_map points into
guest space.  If you read the PDPTE registers onto the stack, and pass a
pointer to the stack into the pagewalk in the 3-level case, then you fix
the issue described here without breaking patch 3.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation
  2018-07-19 10:46 ` [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
@ 2018-07-19 13:12   ` Paul Durrant
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Durrant @ 2018-07-19 13:12 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel
  Cc: Andrew Cooper, tamas, Tim (Xen.org), George Dunlap, Razvan Cojocaru

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 19 July 2018 11:46
> To: xen-devel <xen-devel@lists.xenproject.org>
> Cc: Razvan Cojocaru <rcojocaru@bitdefender.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>;
> George Dunlap <George.Dunlap@citrix.com>; tamas@tklengyel.com; Tim
> (Xen.org) <tim@xen.org>
> Subject: [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation
> 
> The caching isn't actually implemented here, this is just setting the
> stage.
> 
> Touching these anyway also
> - make their return values gfn_t
> - gva -> gla in their names
> - name their input arguments gla
> 
> At the use sites do the conversion to gfn_t as suitable.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> 
> --- a/xen/arch/x86/debug.c
> +++ b/xen/arch/x86/debug.c
> @@ -51,7 +51,7 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct dom
> 
>      DBGP2("vaddr:%lx domid:%d\n", vaddr, dp->domain_id);
> 
> -    *gfn = _gfn(paging_gva_to_gfn(dp->vcpu[0], vaddr, &pfec));
> +    *gfn = paging_gla_to_gfn(dp->vcpu[0], vaddr, &pfec, NULL);
>      if ( gfn_eq(*gfn, INVALID_GFN) )
>      {
>          DBGP2("kdb:bad gfn from gva_to_gfn\n");
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -678,7 +678,8 @@ static int hvmemul_linear_to_phys(
>      struct hvm_emulate_ctxt *hvmemul_ctxt)
>  {
>      struct vcpu *curr = current;
> -    unsigned long pfn, npfn, done, todo, i, offset = addr & ~PAGE_MASK;
> +    gfn_t gfn, ngfn;
> +    unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
>      int reverse;
> 
>      /*
> @@ -700,15 +701,17 @@ static int hvmemul_linear_to_phys(
>      if ( reverse && ((PAGE_SIZE - offset) < bytes_per_rep) )
>      {
>          /* Do page-straddling first iteration forwards via recursion. */
> -        paddr_t _paddr;
> +        paddr_t gaddr;
>          unsigned long one_rep = 1;
>          int rc = hvmemul_linear_to_phys(
> -            addr, &_paddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
> +            addr, &gaddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
> +
>          if ( rc != X86EMUL_OKAY )
>              return rc;
> -        pfn = _paddr >> PAGE_SHIFT;
> +        gfn = gaddr_to_gfn(gaddr);
>      }
> -    else if ( (pfn = paging_gva_to_gfn(curr, addr, &pfec)) ==
> gfn_x(INVALID_GFN) )
> +    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
> +                     INVALID_GFN) )
>      {
>          if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
>              return X86EMUL_RETRY;
> @@ -723,11 +726,11 @@ static int hvmemul_linear_to_phys(
>      {
>          /* Get the next PFN in the range. */
>          addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
> -        npfn = paging_gva_to_gfn(curr, addr, &pfec);
> +        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
> 
>          /* Is it contiguous with the preceding PFNs? If not then we're done. */
> -        if ( (npfn == gfn_x(INVALID_GFN)) ||
> -             (npfn != (pfn + (reverse ? -i : i))) )
> +        if ( gfn_eq(ngfn, INVALID_GFN) ||
> +             !gfn_eq(ngfn, gfn_add(gfn, reverse ? -i : i)) )
>          {
>              if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
>                  return X86EMUL_RETRY;
> @@ -735,7 +738,7 @@ static int hvmemul_linear_to_phys(
>              if ( done == 0 )
>              {
>                  ASSERT(!reverse);
> -                if ( npfn != gfn_x(INVALID_GFN) )
> +                if ( !gfn_eq(ngfn, INVALID_GFN) )
>                      return X86EMUL_UNHANDLEABLE;
>                  *reps = 0;
>                  x86_emul_pagefault(pfec, addr & PAGE_MASK, &hvmemul_ctxt-
> >ctxt);
> @@ -748,7 +751,8 @@ static int hvmemul_linear_to_phys(
>          done += PAGE_SIZE;
>      }
> 
> -    *paddr = ((paddr_t)pfn << PAGE_SHIFT) | offset;
> +    *paddr = gfn_to_gaddr(gfn) | offset;
> +
>      return X86EMUL_OKAY;
>  }
> 
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2684,7 +2684,7 @@ static void *hvm_map_entry(unsigned long
>       * treat it as a kernel-mode read (i.e. no access checks).
>       */
>      pfec = PFEC_page_present;
> -    gfn = paging_gva_to_gfn(current, va, &pfec);
> +    gfn = gfn_x(paging_gla_to_gfn(current, va, &pfec, NULL));
>      if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
>          goto fail;
> 
> @@ -3115,7 +3115,7 @@ enum hvm_translation_result hvm_translat
> 
>      if ( linear )
>      {
> -        gfn = _gfn(paging_gva_to_gfn(v, addr, &pfec));
> +        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
> 
>          if ( gfn_eq(gfn, INVALID_GFN) )
>          {
> --- a/xen/arch/x86/hvm/monitor.c
> +++ b/xen/arch/x86/hvm/monitor.c
> @@ -130,7 +130,7 @@ static inline unsigned long gfn_of_rip(u
> 
>      hvm_get_segment_register(curr, x86_seg_cs, &sreg);
> 
> -    return paging_gva_to_gfn(curr, sreg.base + rip, &pfec);
> +    return gfn_x(paging_gla_to_gfn(curr, sreg.base + rip, &pfec, NULL));
>  }
> 
>  int hvm_monitor_debug(unsigned long rip, enum
> hvm_monitor_debug_type type,
> --- a/xen/arch/x86/mm/guest_walk.c
> +++ b/xen/arch/x86/mm/guest_walk.c
> @@ -81,8 +81,9 @@ static bool set_ad_bits(guest_intpte_t *
>   */
>  bool
>  guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
> -                  unsigned long va, walk_t *gw,
> -                  uint32_t walk, mfn_t top_mfn, void *top_map)
> +                  unsigned long gla, walk_t *gw, uint32_t walk,
> +                  gfn_t top_gfn, mfn_t top_mfn, void *top_map,
> +                  struct hvmemul_cache *cache)
>  {
>      struct domain *d = v->domain;
>      p2m_type_t p2mt;
> @@ -116,7 +117,7 @@ guest_walk_tables(struct vcpu *v, struct
> 
>      perfc_incr(guest_walk);
>      memset(gw, 0, sizeof(*gw));
> -    gw->va = va;
> +    gw->va = gla;
>      gw->pfec = walk & (PFEC_user_mode | PFEC_write_access);
> 
>      /*
> @@ -133,7 +134,7 @@ guest_walk_tables(struct vcpu *v, struct
>      /* Get the l4e from the top level table and check its flags*/
>      gw->l4mfn = top_mfn;
>      l4p = (guest_l4e_t *) top_map;
> -    gw->l4e = l4p[guest_l4_table_offset(va)];
> +    gw->l4e = l4p[guest_l4_table_offset(gla)];
>      gflags = guest_l4e_get_flags(gw->l4e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -163,7 +164,7 @@ guest_walk_tables(struct vcpu *v, struct
>      }
> 
>      /* Get the l3e and check its flags*/
> -    gw->l3e = l3p[guest_l3_table_offset(va)];
> +    gw->l3e = l3p[guest_l3_table_offset(gla)];
>      gflags = guest_l3e_get_flags(gw->l3e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -205,7 +206,7 @@ guest_walk_tables(struct vcpu *v, struct
> 
>          /* Increment the pfn by the right number of 4k pages. */
>          start = _gfn((gfn_x(start) & ~GUEST_L3_GFN_MASK) +
> -                     ((va >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
> +                     ((gla >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
>          gw->l1e = guest_l1e_from_gfn(start, flags);
>          gw->l2mfn = gw->l1mfn = INVALID_MFN;
>          leaf_level = 3;
> @@ -215,7 +216,7 @@ guest_walk_tables(struct vcpu *v, struct
>  #else /* PAE only... */
> 
>      /* Get the l3e and check its flag */
> -    gw->l3e = ((guest_l3e_t *) top_map)[guest_l3_table_offset(va)];
> +    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
>      gflags = guest_l3e_get_flags(gw->l3e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -242,14 +243,14 @@ guest_walk_tables(struct vcpu *v, struct
>      }
> 
>      /* Get the l2e */
> -    gw->l2e = l2p[guest_l2_table_offset(va)];
> +    gw->l2e = l2p[guest_l2_table_offset(gla)];
> 
>  #else /* 32-bit only... */
> 
>      /* Get l2e from the top level table */
>      gw->l2mfn = top_mfn;
>      l2p = (guest_l2e_t *) top_map;
> -    gw->l2e = l2p[guest_l2_table_offset(va)];
> +    gw->l2e = l2p[guest_l2_table_offset(gla)];
> 
>  #endif /* All levels... */
> 
> @@ -310,7 +311,7 @@ guest_walk_tables(struct vcpu *v, struct
> 
>          /* Increment the pfn by the right number of 4k pages. */
>          start = _gfn((gfn_x(start) & ~GUEST_L2_GFN_MASK) +
> -                     guest_l1_table_offset(va));
> +                     guest_l1_table_offset(gla));
>  #if GUEST_PAGING_LEVELS == 2
>           /* Wider than 32 bits if PSE36 superpage. */
>          gw->el1e = (gfn_x(start) << PAGE_SHIFT) | flags;
> @@ -334,7 +335,7 @@ guest_walk_tables(struct vcpu *v, struct
>          gw->pfec |= rc & PFEC_synth_mask;
>          goto out;
>      }
> -    gw->l1e = l1p[guest_l1_table_offset(va)];
> +    gw->l1e = l1p[guest_l1_table_offset(gla)];
>      gflags = guest_l1e_get_flags(gw->l1e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -443,22 +444,22 @@ guest_walk_tables(struct vcpu *v, struct
>          break;
> 
>      case 1:
> -        if ( set_ad_bits(&l1p[guest_l1_table_offset(va)].l1, &gw->l1e.l1,
> +        if ( set_ad_bits(&l1p[guest_l1_table_offset(gla)].l1, &gw->l1e.l1,
>                           (walk & PFEC_write_access)) )
>              paging_mark_dirty(d, gw->l1mfn);
>          /* Fallthrough */
>      case 2:
> -        if ( set_ad_bits(&l2p[guest_l2_table_offset(va)].l2, &gw->l2e.l2,
> +        if ( set_ad_bits(&l2p[guest_l2_table_offset(gla)].l2, &gw->l2e.l2,
>                           (walk & PFEC_write_access) && leaf_level == 2) )
>              paging_mark_dirty(d, gw->l2mfn);
>          /* Fallthrough */
>  #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
>      case 3:
> -        if ( set_ad_bits(&l3p[guest_l3_table_offset(va)].l3, &gw->l3e.l3,
> +        if ( set_ad_bits(&l3p[guest_l3_table_offset(gla)].l3, &gw->l3e.l3,
>                           (walk & PFEC_write_access) && leaf_level == 3) )
>              paging_mark_dirty(d, gw->l3mfn);
> 
> -        if ( set_ad_bits(&l4p[guest_l4_table_offset(va)].l4, &gw->l4e.l4,
> +        if ( set_ad_bits(&l4p[guest_l4_table_offset(gla)].l4, &gw->l4e.l4,
>                           false) )
>              paging_mark_dirty(d, gw->l4mfn);
>  #endif
> --- a/xen/arch/x86/mm/hap/guest_walk.c
> +++ b/xen/arch/x86/mm/hap/guest_walk.c
> @@ -26,8 +26,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
>  #include <xen/sched.h>
>  #include "private.h" /* for hap_gva_to_gfn_* */
> 
> -#define _hap_gva_to_gfn(levels) hap_gva_to_gfn_##levels##_levels
> -#define hap_gva_to_gfn(levels) _hap_gva_to_gfn(levels)
> +#define _hap_gla_to_gfn(levels) hap_gla_to_gfn_##levels##_levels
> +#define hap_gla_to_gfn(levels) _hap_gla_to_gfn(levels)
> 
>  #define _hap_p2m_ga_to_gfn(levels)
> hap_p2m_ga_to_gfn_##levels##_levels
>  #define hap_p2m_ga_to_gfn(levels) _hap_p2m_ga_to_gfn(levels)
> @@ -39,16 +39,10 @@ asm(".file \"" __OBJECT_FILE__ "\"");
>  #include <asm/guest_pt.h>
>  #include <asm/p2m.h>
> 
> -unsigned long hap_gva_to_gfn(GUEST_PAGING_LEVELS)(
> -    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t
> *pfec)
> -{
> -    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
> -    return hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(v, p2m, cr3, gva,
> pfec, NULL);
> -}
> -
> -unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
> +static unsigned long ga_to_gfn(
>      struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
> -    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
> +    paddr_t ga, uint32_t *pfec, unsigned int *page_order,
> +    struct hvmemul_cache *cache)
>  {
>      bool walk_ok;
>      mfn_t top_mfn;
> @@ -91,7 +85,8 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
>  #if GUEST_PAGING_LEVELS == 3
>      top_map += (cr3 & ~(PAGE_MASK | 31));
>  #endif
> -    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec, top_mfn,
> top_map);
> +    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec,
> +                                top_gfn, top_mfn, top_map, cache);
>      unmap_domain_page(top_map);
>      put_page(top_page);
> 
> @@ -137,6 +132,21 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
>      return gfn_x(INVALID_GFN);
>  }
> 
> +gfn_t hap_gla_to_gfn(GUEST_PAGING_LEVELS)(
> +    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t
> *pfec,
> +    struct hvmemul_cache *cache)
> +{
> +    unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
> +
> +    return _gfn(ga_to_gfn(v, p2m, cr3, gla, pfec, NULL, cache));
> +}
> +
> +unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
> +    struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
> +    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
> +{
> +    return ga_to_gfn(v, p2m, cr3, ga, pfec, page_order, NULL);
> +}
> 
>  /*
>   * Local variables:
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -743,10 +743,11 @@ hap_write_p2m_entry(struct domain *d, un
>          p2m_flush_nestedp2m(d);
>  }
> 
> -static unsigned long hap_gva_to_gfn_real_mode(
> -    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t
> *pfec)
> +static gfn_t hap_gla_to_gfn_real_mode(
> +    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t
> *pfec,
> +    struct hvmemul_cache *cache)
>  {
> -    return ((paddr_t)gva >> PAGE_SHIFT);
> +    return gaddr_to_gfn(gla);
>  }
> 
>  static unsigned long hap_p2m_ga_to_gfn_real_mode(
> @@ -762,7 +763,7 @@ static unsigned long hap_p2m_ga_to_gfn_r
>  static const struct paging_mode hap_paging_real_mode = {
>      .page_fault             = hap_page_fault,
>      .invlpg                 = hap_invlpg,
> -    .gva_to_gfn             = hap_gva_to_gfn_real_mode,
> +    .gla_to_gfn             = hap_gla_to_gfn_real_mode,
>      .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_real_mode,
>      .update_cr3             = hap_update_cr3,
>      .update_paging_modes    = hap_update_paging_modes,
> @@ -773,7 +774,7 @@ static const struct paging_mode hap_pagi
>  static const struct paging_mode hap_paging_protected_mode = {
>      .page_fault             = hap_page_fault,
>      .invlpg                 = hap_invlpg,
> -    .gva_to_gfn             = hap_gva_to_gfn_2_levels,
> +    .gla_to_gfn             = hap_gla_to_gfn_2_levels,
>      .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_2_levels,
>      .update_cr3             = hap_update_cr3,
>      .update_paging_modes    = hap_update_paging_modes,
> @@ -784,7 +785,7 @@ static const struct paging_mode hap_pagi
>  static const struct paging_mode hap_paging_pae_mode = {
>      .page_fault             = hap_page_fault,
>      .invlpg                 = hap_invlpg,
> -    .gva_to_gfn             = hap_gva_to_gfn_3_levels,
> +    .gla_to_gfn             = hap_gla_to_gfn_3_levels,
>      .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_3_levels,
>      .update_cr3             = hap_update_cr3,
>      .update_paging_modes    = hap_update_paging_modes,
> @@ -795,7 +796,7 @@ static const struct paging_mode hap_pagi
>  static const struct paging_mode hap_paging_long_mode = {
>      .page_fault             = hap_page_fault,
>      .invlpg                 = hap_invlpg,
> -    .gva_to_gfn             = hap_gva_to_gfn_4_levels,
> +    .gla_to_gfn             = hap_gla_to_gfn_4_levels,
>      .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_4_levels,
>      .update_cr3             = hap_update_cr3,
>      .update_paging_modes    = hap_update_paging_modes,
> --- a/xen/arch/x86/mm/hap/private.h
> +++ b/xen/arch/x86/mm/hap/private.h
> @@ -24,18 +24,21 @@
>  /********************************************/
>  /*          GUEST TRANSLATION FUNCS         */
>  /********************************************/
> -unsigned long hap_gva_to_gfn_2_levels(struct vcpu *v,
> -                                     struct p2m_domain *p2m,
> -                                     unsigned long gva,
> -                                     uint32_t *pfec);
> -unsigned long hap_gva_to_gfn_3_levels(struct vcpu *v,
> -                                     struct p2m_domain *p2m,
> -                                     unsigned long gva,
> -                                     uint32_t *pfec);
> -unsigned long hap_gva_to_gfn_4_levels(struct vcpu *v,
> -                                     struct p2m_domain *p2m,
> -                                     unsigned long gva,
> -                                     uint32_t *pfec);
> +gfn_t hap_gla_to_gfn_2_levels(struct vcpu *v,
> +                              struct p2m_domain *p2m,
> +                              unsigned long gla,
> +                              uint32_t *pfec,
> +                              struct hvmemul_cache *cache);
> +gfn_t hap_gla_to_gfn_3_levels(struct vcpu *v,
> +                              struct p2m_domain *p2m,
> +                              unsigned long gla,
> +                              uint32_t *pfec,
> +                              struct hvmemul_cache *cache);
> +gfn_t hap_gla_to_gfn_4_levels(struct vcpu *v,
> +                              struct p2m_domain *p2m,
> +                              unsigned long gla,
> +                              uint32_t *pfec,
> +                              struct hvmemul_cache *cache);
> 
>  unsigned long hap_p2m_ga_to_gfn_2_levels(struct vcpu *v,
>      struct p2m_domain *p2m, unsigned long cr3,
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -1968,16 +1968,16 @@ void np2m_schedule(int dir)
>      }
>  }
> 
> -unsigned long paging_gva_to_gfn(struct vcpu *v,
> -                                unsigned long va,
> -                                uint32_t *pfec)
> +gfn_t paging_gla_to_gfn(struct vcpu *v, unsigned long gla, uint32_t *pfec,
> +                        struct hvmemul_cache *cache)
>  {
>      struct p2m_domain *hostp2m = p2m_get_hostp2m(v->domain);
>      const struct paging_mode *hostmode = paging_get_hostmode(v);
> 
>      if ( is_hvm_vcpu(v) && paging_mode_hap(v->domain) &&
> nestedhvm_is_n2(v) )
>      {
> -        unsigned long l2_gfn, l1_gfn;
> +        gfn_t l2_gfn;
> +        unsigned long l1_gfn;
>          struct p2m_domain *p2m;
>          const struct paging_mode *mode;
>          uint8_t l1_p2ma;
> @@ -1987,31 +1987,31 @@ unsigned long paging_gva_to_gfn(struct v
>          /* translate l2 guest va into l2 guest gfn */
>          p2m = p2m_get_nestedp2m(v);
>          mode = paging_get_nestedmode(v);
> -        l2_gfn = mode->gva_to_gfn(v, p2m, va, pfec);
> +        l2_gfn = mode->gla_to_gfn(v, p2m, gla, pfec, cache);
> 
> -        if ( l2_gfn == gfn_x(INVALID_GFN) )
> -            return gfn_x(INVALID_GFN);
> +        if ( gfn_eq(l2_gfn, INVALID_GFN) )
> +            return INVALID_GFN;
> 
>          /* translate l2 guest gfn into l1 guest gfn */
> -        rv = nestedhap_walk_L1_p2m(v, l2_gfn, &l1_gfn, &l1_page_order,
> &l1_p2ma,
> -                                   1,
> +        rv = nestedhap_walk_L1_p2m(v, gfn_x(l2_gfn), &l1_gfn,
> &l1_page_order,
> +                                   &l1_p2ma, 1,
>                                     !!(*pfec & PFEC_write_access),
>                                     !!(*pfec & PFEC_insn_fetch));
> 
>          if ( rv != NESTEDHVM_PAGEFAULT_DONE )
> -            return gfn_x(INVALID_GFN);
> +            return INVALID_GFN;
> 
>          /*
>           * Sanity check that l1_gfn can be used properly as a 4K mapping, even
>           * if it mapped by a nested superpage.
>           */
> -        ASSERT((l2_gfn & ((1ul << l1_page_order) - 1)) ==
> +        ASSERT((gfn_x(l2_gfn) & ((1ul << l1_page_order) - 1)) ==
>                 (l1_gfn & ((1ul << l1_page_order) - 1)));
> 
> -        return l1_gfn;
> +        return _gfn(l1_gfn);
>      }
> 
> -    return hostmode->gva_to_gfn(v, hostp2m, va, pfec);
> +    return hostmode->gla_to_gfn(v, hostp2m, gla, pfec, cache);
>  }
> 
>  /*
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1699,15 +1699,15 @@ static unsigned int shadow_get_allocatio
>  static mfn_t emulate_gva_to_mfn(struct vcpu *v, unsigned long vaddr,
>                                  struct sh_emulate_ctxt *sh_ctxt)
>  {
> -    unsigned long gfn;
> +    gfn_t gfn;
>      struct page_info *page;
>      mfn_t mfn;
>      p2m_type_t p2mt;
>      uint32_t pfec = PFEC_page_present | PFEC_write_access;
> 
>      /* Translate the VA to a GFN. */
> -    gfn = paging_get_hostmode(v)->gva_to_gfn(v, NULL, vaddr, &pfec);
> -    if ( gfn == gfn_x(INVALID_GFN) )
> +    gfn = paging_get_hostmode(v)->gla_to_gfn(v, NULL, vaddr, &pfec,
> NULL);
> +    if ( gfn_eq(gfn, INVALID_GFN) )
>      {
>          x86_emul_pagefault(pfec, vaddr, &sh_ctxt->ctxt);
> 
> @@ -1717,7 +1717,7 @@ static mfn_t emulate_gva_to_mfn(struct v
>      /* Translate the GFN to an MFN. */
>      ASSERT(!paging_locked_by_me(v->domain));
> 
> -    page = get_page_from_gfn(v->domain, gfn, &p2mt, P2M_ALLOC);
> +    page = get_page_from_gfn(v->domain, gfn_x(gfn), &p2mt,
> P2M_ALLOC);
> 
>      /* Sanity checking. */
>      if ( page == NULL )
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -173,17 +173,20 @@ delete_shadow_status(struct domain *d, m
> 
>  static inline bool
>  sh_walk_guest_tables(struct vcpu *v, unsigned long va, walk_t *gw,
> -                     uint32_t pfec)
> +                     uint32_t pfec, struct hvmemul_cache *cache)
>  {
>      return guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec,
> +                             _gfn(paging_mode_external(v->domain)
> +                                  ? cr3_pa(v->arch.hvm_vcpu.guest_cr[3]) >> PAGE_SHIFT
> +                                  : pagetable_get_pfn(v->arch.guest_table)),
>  #if GUEST_PAGING_LEVELS == 3 /* PAE */
>                               INVALID_MFN,
> -                             v->arch.paging.shadow.gl3e
> +                             v->arch.paging.shadow.gl3e,
>  #else /* 32 or 64 */
>                               pagetable_get_mfn(v->arch.guest_table),
> -                             v->arch.paging.shadow.guest_vtable
> +                             v->arch.paging.shadow.guest_vtable,
>  #endif
> -                             );
> +                             cache);
>  }
> 
>  /* This validation is called with lock held, and after write permission
> @@ -3035,7 +3038,7 @@ static int sh_page_fault(struct vcpu *v,
>       * shadow page table. */
>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
>      smp_rmb();
> -    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
> +    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code, NULL);
> 
>  #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
>      regs->error_code &= ~PFEC_page_present;
> @@ -3675,9 +3678,9 @@ static bool sh_invlpg(struct vcpu *v, un
>  }
> 
> 
> -static unsigned long
> -sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
> -    unsigned long va, uint32_t *pfec)
> +static gfn_t
> +sh_gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
> +    unsigned long gla, uint32_t *pfec, struct hvmemul_cache *cache)
>  /* Called to translate a guest virtual address to what the *guest*
>   * pagetables would map it to. */
>  {
> @@ -3687,24 +3690,25 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m
> 
>  #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
>      /* Check the vTLB cache first */
> -    unsigned long vtlb_gfn = vtlb_lookup(v, va, *pfec);
> +    unsigned long vtlb_gfn = vtlb_lookup(v, gla, *pfec);
> +
>      if ( vtlb_gfn != gfn_x(INVALID_GFN) )
> -        return vtlb_gfn;
> +        return _gfn(vtlb_gfn);
>  #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
> 
> -    if ( !(walk_ok = sh_walk_guest_tables(v, va, &gw, *pfec)) )
> +    if ( !(walk_ok = sh_walk_guest_tables(v, gla, &gw, *pfec, cache)) )
>      {
>          *pfec = gw.pfec;
> -        return gfn_x(INVALID_GFN);
> +        return INVALID_GFN;
>      }
>      gfn = guest_walk_to_gfn(&gw);
> 
>  #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
>      /* Remember this successful VA->GFN translation for later. */
> -    vtlb_insert(v, va >> PAGE_SHIFT, gfn_x(gfn), *pfec);
> +    vtlb_insert(v, gla >> PAGE_SHIFT, gfn_x(gfn), *pfec);
>  #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
> 
> -    return gfn_x(gfn);
> +    return gfn;
>  }
> 
> 
> @@ -4949,7 +4953,7 @@ int sh_audit_l4_table(struct vcpu *v, mf
>  const struct paging_mode sh_paging_mode = {
>      .page_fault                    = sh_page_fault,
>      .invlpg                        = sh_invlpg,
> -    .gva_to_gfn                    = sh_gva_to_gfn,
> +    .gla_to_gfn                    = sh_gla_to_gfn,
>      .update_cr3                    = sh_update_cr3,
>      .update_paging_modes           = shadow_update_paging_modes,
>      .write_p2m_entry               = shadow_write_p2m_entry,
> --- a/xen/arch/x86/mm/shadow/none.c
> +++ b/xen/arch/x86/mm/shadow/none.c
> @@ -43,11 +43,12 @@ static bool _invlpg(struct vcpu *v, unsi
>      return true;
>  }
> 
> -static unsigned long _gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
> -                                 unsigned long va, uint32_t *pfec)
> +static gfn_t _gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
> +                         unsigned long gla, uint32_t *pfec,
> +                         struct hvmemul_cache *cache)
>  {
>      ASSERT_UNREACHABLE();
> -    return gfn_x(INVALID_GFN);
> +    return INVALID_GFN;
>  }
> 
>  static void _update_cr3(struct vcpu *v, int do_locking, bool noflush)
> @@ -70,7 +71,7 @@ static void _write_p2m_entry(struct doma
>  static const struct paging_mode sh_paging_none = {
>      .page_fault                    = _page_fault,
>      .invlpg                        = _invlpg,
> -    .gva_to_gfn                    = _gva_to_gfn,
> +    .gla_to_gfn                    = _gla_to_gfn,
>      .update_cr3                    = _update_cr3,
>      .update_paging_modes           = _update_paging_modes,
>      .write_p2m_entry               = _write_p2m_entry,
> --- a/xen/include/asm-x86/guest_pt.h
> +++ b/xen/include/asm-x86/guest_pt.h
> @@ -425,7 +425,8 @@ static inline unsigned int guest_walk_to
> 
>  bool
>  guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m, unsigned long
> va,
> -                  walk_t *gw, uint32_t pfec, mfn_t top_mfn, void *top_map);
> +                  walk_t *gw, uint32_t pfec, gfn_t top_gfn, mfn_t top_mfn,
> +                  void *top_map, struct hvmemul_cache *cache);
> 
>  /* Pretty-print the contents of a guest-walk */
>  static inline void print_gw(const walk_t *gw)
> --- a/xen/include/asm-x86/hvm/vcpu.h
> +++ b/xen/include/asm-x86/hvm/vcpu.h
> @@ -53,6 +53,8 @@ struct hvm_mmio_cache {
>      uint8_t buffer[32];
>  };
> 
> +struct hvmemul_cache;
> +
>  struct hvm_vcpu_io {
>      /* I/O request in flight to device model. */
>      enum hvm_io_completion io_completion;
> --- a/xen/include/asm-x86/paging.h
> +++ b/xen/include/asm-x86/paging.h
> @@ -107,10 +107,11 @@ struct paging_mode {
>      int           (*page_fault            )(struct vcpu *v, unsigned long va,
>                                              struct cpu_user_regs *regs);
>      bool          (*invlpg                )(struct vcpu *v, unsigned long va);
> -    unsigned long (*gva_to_gfn            )(struct vcpu *v,
> +    gfn_t         (*gla_to_gfn            )(struct vcpu *v,
>                                              struct p2m_domain *p2m,
> -                                            unsigned long va,
> -                                            uint32_t *pfec);
> +                                            unsigned long gla,
> +                                            uint32_t *pfec,
> +                                            struct hvmemul_cache *cache);
>      unsigned long (*p2m_ga_to_gfn         )(struct vcpu *v,
>                                              struct p2m_domain *p2m,
>                                              unsigned long cr3,
> @@ -246,9 +247,10 @@ void paging_invlpg(struct vcpu *v, unsig
>   * SDM Intel 64 Volume 3, Chapter Paging, PAGE-FAULT EXCEPTIONS:
>   * The PFEC_insn_fetch flag is set only when NX or SMEP are enabled.
>   */
> -unsigned long paging_gva_to_gfn(struct vcpu *v,
> -                                unsigned long va,
> -                                uint32_t *pfec);
> +gfn_t paging_gla_to_gfn(struct vcpu *v,
> +                        unsigned long va,
> +                        uint32_t *pfec,
> +                        struct hvmemul_cache *cache);
> 
>  /* Translate a guest address using a particular CR3 value.  This is used
>   * to by nested HAP code, to walk the guest-supplied NPT tables as if
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables()
  2018-07-19 10:47 ` [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
@ 2018-07-19 13:22   ` Paul Durrant
  2018-09-03 15:12     ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Durrant @ 2018-07-19 13:22 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel; +Cc: Andrew Cooper, George Dunlap

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 19 July 2018 11:47
> To: xen-devel <xen-devel@lists.xenproject.org>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>; George Dunlap <George.Dunlap@citrix.com>
> Subject: [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables()
> 
> The caching isn't actually implemented here, this is just setting the
> stage.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -2572,6 +2572,18 @@ void hvm_dump_emulation_state(const char
>             hvmemul_ctxt->insn_buf);
>  }
> 
> +bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t
> gpa,
> +                        unsigned int level, void *buffer, unsigned int size)
> +{
> +    return false;
> +}
> +
> +void hvmemul_write_cache(struct hvmemul_cache *cache, paddr_t gpa,
> +                         unsigned int level, const void *buffer,
> +                         unsigned int size)
> +{
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> --- a/xen/arch/x86/mm/guest_walk.c
> +++ b/xen/arch/x86/mm/guest_walk.c
> @@ -94,6 +94,7 @@ guest_walk_tables(struct vcpu *v, struct
>      guest_l4e_t *l4p;
>  #endif
>      uint32_t gflags, rc;
> +    paddr_t gpa;
>      unsigned int leaf_level;
>      p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
> 
> @@ -134,7 +135,15 @@ guest_walk_tables(struct vcpu *v, struct
>      /* Get the l4e from the top level table and check its flags*/
>      gw->l4mfn = top_mfn;
>      l4p = (guest_l4e_t *) top_map;
> -    gw->l4e = l4p[guest_l4_table_offset(gla)];
> +    gpa = gfn_to_gaddr(top_gfn) +
> +          guest_l4_table_offset(gla) * sizeof(guest_l4e_t);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, gpa, 4, &gw->l4e, sizeof(gw->l4e)) )
> +    {
> +        gw->l4e = l4p[guest_l4_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, gpa, 4, &gw->l4e, sizeof(gw->l4e));
> +    }
>      gflags = guest_l4e_get_flags(gw->l4e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -164,7 +173,15 @@ guest_walk_tables(struct vcpu *v, struct
>      }
> 
>      /* Get the l3e and check its flags*/
> -    gw->l3e = l3p[guest_l3_table_offset(gla)];
> +    gpa = gfn_to_gaddr(guest_l4e_get_gfn(gw->l4e)) +
> +          guest_l3_table_offset(gla) * sizeof(guest_l3e_t);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
> +    {
> +        gw->l3e = l3p[guest_l3_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e));
> +    }
>      gflags = guest_l3e_get_flags(gw->l3e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -216,7 +233,16 @@ guest_walk_tables(struct vcpu *v, struct
>  #else /* PAE only... */
> 
>      /* Get the l3e and check its flag */
> -    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
> +    gpa = gfn_to_gaddr(top_gfn) + ((unsigned long)top_map &
> ~PAGE_MASK) +
> +          guest_l3_table_offset(gla) * sizeof(guest_l3e_t);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
> +    {
> +        gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e));
> +    }
> +
>      gflags = guest_l3e_get_flags(gw->l3e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -242,18 +268,24 @@ guest_walk_tables(struct vcpu *v, struct
>          goto out;
>      }
> 
> -    /* Get the l2e */
> -    gw->l2e = l2p[guest_l2_table_offset(gla)];
> -
>  #else /* 32-bit only... */
> 
> -    /* Get l2e from the top level table */
>      gw->l2mfn = top_mfn;
>      l2p = (guest_l2e_t *) top_map;
> -    gw->l2e = l2p[guest_l2_table_offset(gla)];
> 
>  #endif /* All levels... */
> 
> +    /* Get the l2e */
> +    gpa = gfn_to_gaddr(top_gfn) +
> +          guest_l2_table_offset(gla) * sizeof(guest_l2e_t);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, gpa, 2, &gw->l2e, sizeof(gw->l2e)) )
> +    {
> +        gw->l2e = l2p[guest_l2_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, gpa, 2, &gw->l2e, sizeof(gw->l2e));
> +    }
> +
>      /* Check the l2e flags. */
>      gflags = guest_l2e_get_flags(gw->l2e);
>      if ( !(gflags & _PAGE_PRESENT) )
> @@ -335,7 +367,17 @@ guest_walk_tables(struct vcpu *v, struct
>          gw->pfec |= rc & PFEC_synth_mask;
>          goto out;
>      }
> -    gw->l1e = l1p[guest_l1_table_offset(gla)];
> +
> +    gpa = gfn_to_gaddr(top_gfn) +
> +          guest_l1_table_offset(gla) * sizeof(guest_l1e_t);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, gpa, 1, &gw->l1e, sizeof(gw->l1e)) )
> +    {
> +        gw->l1e = l1p[guest_l1_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, gpa, 1, &gw->l1e, sizeof(gw->l1e));
> +    }
> +
>      gflags = guest_l1e_get_flags(gw->l1e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> --- a/xen/include/asm-x86/hvm/emulate.h
> +++ b/xen/include/asm-x86/hvm/emulate.h
> @@ -98,6 +98,13 @@ int hvmemul_do_pio_buffer(uint16_t port,
>                            uint8_t dir,
>                            void *buffer);
> 
> +struct hvmemul_cache;
> +bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
> +                        unsigned int level, void *buffer, unsigned int size);
> +void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
> +                         unsigned int level, const void *buffer,
> +                         unsigned int size);
> +
>  void hvm_dump_emulation_state(const char *loglvl, const char *prefix,
>                                struct hvm_emulate_ctxt *hvmemul_ctxt, int rc);
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] x86/HVM: implement memory read caching
  2018-07-19 10:48 ` [PATCH 3/6] x86/HVM: implement memory read caching Jan Beulich
@ 2018-07-19 14:20   ` Paul Durrant
  2018-07-24 10:28     ` Jan Beulich
  2018-07-23 15:45   ` Tim Deegan
  1 sibling, 1 reply; 28+ messages in thread
From: Paul Durrant @ 2018-07-19 14:20 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel
  Cc: Kevin Tian, Suravee Suthikulpanit, Andrew Cooper, Tim (Xen.org),
	George Dunlap, Jun Nakajima, Boris Ostrovsky, Brian Woods

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 19 July 2018 11:49
> To: xen-devel <xen-devel@lists.xenproject.org>
> Cc: Brian Woods <brian.woods@amd.com>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>;
> George Dunlap <George.Dunlap@citrix.com>; Jun Nakajima
> <jun.nakajima@intel.com>; Kevin Tian <kevin.tian@intel.com>; Boris
> Ostrovsky <boris.ostrovsky@oracle.com>; Tim (Xen.org) <tim@xen.org>
> Subject: [PATCH 3/6] x86/HVM: implement memory read caching
> 
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath an
> instruction currently under emulation (hitting between two passes). If
> the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
> 
> Introduce a cache (used just by guest page table accesses for now) to
> make sure above described assumption holds. This is a very simplistic
> implementation for now: Only exact matches are satisfied (no overlaps or
> partial reads or anything).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -27,6 +27,18 @@
>  #include <asm/hvm/svm/svm.h>
>  #include <asm/vm_event.h>
> 
> +struct hvmemul_cache
> +{
> +    unsigned int max_ents;
> +    unsigned int num_ents;
> +    struct {
> +        paddr_t gpa:PADDR_BITS;
> +        unsigned int size:(BITS_PER_LONG - PADDR_BITS) / 2;
> +        unsigned int level:(BITS_PER_LONG - PADDR_BITS) / 2;
> +        unsigned long data;
> +    } ents[];
> +};
> +
>  static void hvmtrace_io_assist(const ioreq_t *p)
>  {
>      unsigned int size, event;
> @@ -520,7 +532,7 @@ static int hvmemul_do_mmio_addr(paddr_t
>   */
>  static void *hvmemul_map_linear_addr(
>      unsigned long linear, unsigned int bytes, uint32_t pfec,
> -    struct hvm_emulate_ctxt *hvmemul_ctxt)
> +    struct hvm_emulate_ctxt *hvmemul_ctxt, struct hvmemul_cache
> *cache)
>  {
>      struct vcpu *curr = current;
>      void *err, *mapping;
> @@ -565,7 +577,7 @@ static void *hvmemul_map_linear_addr(
>          ASSERT(mfn_x(*mfn) == 0);
> 
>          res = hvm_translate_get_page(curr, addr, true, pfec,
> -                                     &pfinfo, &page, NULL, &p2mt);
> +                                     &pfinfo, &page, NULL, &p2mt, cache);
> 
>          switch ( res )
>          {
> @@ -681,6 +693,8 @@ static int hvmemul_linear_to_phys(
>      gfn_t gfn, ngfn;
>      unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
>      int reverse;
> +    struct hvmemul_cache *cache = pfec & PFEC_insn_fetch
> +                                  ? NULL : curr->arch.hvm_vcpu.data_cache;
> 
>      /*
>       * Clip repetitions to a sensible maximum. This avoids extensive looping in
> @@ -710,7 +724,7 @@ static int hvmemul_linear_to_phys(
>              return rc;
>          gfn = gaddr_to_gfn(gaddr);
>      }
> -    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
> +    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, cache),
>                       INVALID_GFN) )
>      {
>          if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
> @@ -726,7 +740,7 @@ static int hvmemul_linear_to_phys(
>      {
>          /* Get the next PFN in the range. */
>          addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
> -        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
> +        ngfn = paging_gla_to_gfn(curr, addr, &pfec, cache);
> 
>          /* Is it contiguous with the preceding PFNs? If not then we're done. */
>          if ( gfn_eq(ngfn, INVALID_GFN) ||
> @@ -1046,6 +1060,8 @@ static int __hvmemul_read(
>          pfec |= PFEC_implicit;
>      else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
>          pfec |= PFEC_user_mode;
> +    if ( access_type == hvm_access_insn_fetch )
> +        pfec |= PFEC_insn_fetch;

Since you OR the insn_fetch flag in here...

> 
>      rc = hvmemul_virtual_to_linear(
>          seg, offset, bytes, &reps, access_type, hvmemul_ctxt, &addr);
> @@ -1059,7 +1075,8 @@ static int __hvmemul_read(
> 
>      rc = ((access_type == hvm_access_insn_fetch) ?
>            hvm_fetch_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo) :

...could you not just use hvm_copy_from_guest_linear() here regardless of access_type (and just NULL out the cache argument if it is an insn_fetch)?

AFAICT the only reason hvm_fetch_from_guest_linear() exists is to OR the extra flag in.

  Paul

> -          hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo));
> +          hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo,
> +                                     curr->arch.hvm_vcpu.data_cache));
> 
>      switch ( rc )
>      {
> @@ -1178,7 +1195,8 @@ static int hvmemul_write(
>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec,
> hvmemul_ctxt, 1);
> 
> -    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
> +                                      curr->arch.hvm_vcpu.data_cache);
>      if ( IS_ERR(mapping) )
>          return ~PTR_ERR(mapping);
> 
> @@ -1218,7 +1236,8 @@ static int hvmemul_rmw(
>      else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
>          pfec |= PFEC_user_mode;
> 
> -    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
> +                                      current->arch.hvm_vcpu.data_cache);
>      if ( IS_ERR(mapping) )
>          return ~PTR_ERR(mapping);
> 
> @@ -1375,7 +1394,8 @@ static int hvmemul_cmpxchg(
>      else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
>          pfec |= PFEC_user_mode;
> 
> -    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
> +                                      curr->arch.hvm_vcpu.data_cache);
>      if ( IS_ERR(mapping) )
>          return ~PTR_ERR(mapping);
> 
> @@ -2282,6 +2302,7 @@ static int _hvm_emulate_one(struct hvm_e
>      {
>          vio->mmio_cache_count = 0;
>          vio->mmio_insn_bytes = 0;
> +        curr->arch.hvm_vcpu.data_cache->num_ents = 0;
>      }
>      else
>      {
> @@ -2572,9 +2593,35 @@ void hvm_dump_emulation_state(const char
>             hvmemul_ctxt->insn_buf);
>  }
> 
> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
> +{
> +    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct
> hvmemul_cache,
> +                                                         ents[nents]));
> +
> +    if ( cache )
> +    {
> +        cache->num_ents = 0;
> +        cache->max_ents = nents;
> +    }
> +
> +    return cache;
> +}
> +
>  bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t
> gpa,
>                          unsigned int level, void *buffer, unsigned int size)
>  {
> +    unsigned int i;
> +
> +    ASSERT(size <= sizeof(cache->ents->data));
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
> +             cache->ents[i].size == size )
> +        {
> +            memcpy(buffer, &cache->ents[i].data, size);
> +            return true;
> +        }
> +
>      return false;
>  }
> 
> @@ -2582,6 +2629,35 @@ void hvmemul_write_cache(struct hvmemul_
>                           unsigned int level, const void *buffer,
>                           unsigned int size)
>  {
> +    unsigned int i;
> +
> +    if ( size > sizeof(cache->ents->data) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
> +             cache->ents[i].size == size )
> +        {
> +            memcpy(&cache->ents[i].data, buffer, size);
> +            return;
> +        }
> +
> +    if ( unlikely(i >= cache->max_ents) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    cache->ents[i].level = level;
> +    cache->ents[i].gpa   = gpa;
> +    cache->ents[i].size  = size;
> +
> +    memcpy(&cache->ents[i].data, buffer, size);
> +
> +    cache->num_ents = i + 1;
>  }
> 
>  /*
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1530,6 +1530,18 @@ int hvm_vcpu_initialise(struct vcpu *v)
> 
>      v->arch.hvm_vcpu.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
> 
> +    /*
> +     * Leaving aside the insn fetch, for which we don't use this cache, no
> +     * insn can access more than 8 independent linear addresses (AVX2
> +     * gathers being the worst). Each such linear range can span a page
> +     * boundary, i.e. require two page walks.
> +     */
> +    v->arch.hvm_vcpu.data_cache =
> hvmemul_cache_init(CONFIG_PAGING_LEVELS *
> +                                                     8 * 2);
> +    rc = -ENOMEM;
> +    if ( !v->arch.hvm_vcpu.data_cache )
> +        goto fail4;
> +
>      rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
>      if ( rc != 0 )
>          goto fail4;
> @@ -1559,6 +1571,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
>   fail5:
>      free_compat_arg_xlat(v);
>   fail4:
> +    hvmemul_cache_destroy(v->arch.hvm_vcpu.data_cache);
>      hvm_funcs.vcpu_destroy(v);
>   fail3:
>      vlapic_destroy(v);
> @@ -1581,6 +1594,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
> 
>      free_compat_arg_xlat(v);
> 
> +    hvmemul_cache_destroy(v->arch.hvm_vcpu.data_cache);
> +
>      tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet);
>      hvm_funcs.vcpu_destroy(v);
> 
> @@ -2949,7 +2964,7 @@ void hvm_task_switch(
>      }
> 
>      rc = hvm_copy_from_guest_linear(
> -        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
> +        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
>      if ( rc == HVMTRANS_bad_linear_to_gfn )
>          hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
>      if ( rc != HVMTRANS_okay )
> @@ -2996,7 +3011,7 @@ void hvm_task_switch(
>          goto out;
> 
>      rc = hvm_copy_from_guest_linear(
> -        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
> +        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
>      if ( rc == HVMTRANS_bad_linear_to_gfn )
>          hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
>      /*
> @@ -3107,7 +3122,7 @@ void hvm_task_switch(
>  enum hvm_translation_result hvm_translate_get_page(
>      struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
>      pagefault_info_t *pfinfo, struct page_info **page_p,
> -    gfn_t *gfn_p, p2m_type_t *p2mt_p)
> +    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache)
>  {
>      struct page_info *page;
>      p2m_type_t p2mt;
> @@ -3115,7 +3130,7 @@ enum hvm_translation_result hvm_translat
> 
>      if ( linear )
>      {
> -        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
> +        gfn = paging_gla_to_gfn(v, addr, &pfec, cache);
> 
>          if ( gfn_eq(gfn, INVALID_GFN) )
>          {
> @@ -3187,7 +3202,7 @@ enum hvm_translation_result hvm_translat
>  #define HVMCOPY_linear     (1u<<2)
>  static enum hvm_translation_result __hvm_copy(
>      void *buf, paddr_t addr, int size, struct vcpu *v, unsigned int flags,
> -    uint32_t pfec, pagefault_info_t *pfinfo)
> +    uint32_t pfec, pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
>  {
>      gfn_t gfn;
>      struct page_info *page;
> @@ -3220,8 +3235,8 @@ static enum hvm_translation_result __hvm
> 
>          count = min_t(int, PAGE_SIZE - gpa, todo);
> 
> -        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear,
> -                                     pfec, pfinfo, &page, &gfn, &p2mt);
> +        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear, pfec,
> +                                     pfinfo, &page, &gfn, &p2mt, cache);
>          if ( res != HVMTRANS_okay )
>              return res;
> 
> @@ -3268,14 +3283,14 @@ enum hvm_translation_result hvm_copy_to_
>      paddr_t paddr, void *buf, int size, struct vcpu *v)
>  {
>      return __hvm_copy(buf, paddr, size, v,
> -                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL);
> +                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, NULL);
>  }
> 
>  enum hvm_translation_result hvm_copy_from_guest_phys(
>      void *buf, paddr_t paddr, int size)
>  {
>      return __hvm_copy(buf, paddr, size, current,
> -                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL);
> +                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL, NULL);
>  }
> 
>  enum hvm_translation_result hvm_copy_to_guest_linear(
> @@ -3284,16 +3299,17 @@ enum hvm_translation_result hvm_copy_to_
>  {
>      return __hvm_copy(buf, addr, size, current,
>                        HVMCOPY_to_guest | HVMCOPY_linear,
> -                      PFEC_page_present | PFEC_write_access | pfec, pfinfo);
> +                      PFEC_page_present | PFEC_write_access | pfec,
> +                      pfinfo, NULL);
>  }
> 
>  enum hvm_translation_result hvm_copy_from_guest_linear(
>      void *buf, unsigned long addr, int size, uint32_t pfec,
> -    pagefault_info_t *pfinfo)
> +    pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
>  {
>      return __hvm_copy(buf, addr, size, current,
>                        HVMCOPY_from_guest | HVMCOPY_linear,
> -                      PFEC_page_present | pfec, pfinfo);
> +                      PFEC_page_present | pfec, pfinfo, cache);
>  }
> 
>  enum hvm_translation_result hvm_fetch_from_guest_linear(
> @@ -3302,7 +3318,7 @@ enum hvm_translation_result hvm_fetch_fr
>  {
>      return __hvm_copy(buf, addr, size, current,
>                        HVMCOPY_from_guest | HVMCOPY_linear,
> -                      PFEC_page_present | PFEC_insn_fetch | pfec, pfinfo);
> +                      PFEC_page_present | PFEC_insn_fetch | pfec, pfinfo, NULL);
>  }
> 
>  unsigned long copy_to_user_hvm(void *to, const void *from, unsigned int
> len)
> @@ -3343,7 +3359,8 @@ unsigned long copy_from_user_hvm(void *t
>          return 0;
>      }
> 
> -    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len, 0, NULL);
> +    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len,
> +                                    0, NULL, NULL);
>      return rc ? len : 0; /* fake a copy_from_user() return code */
>  }
> 
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1358,7 +1358,7 @@ static void svm_emul_swint_injection(str
>          goto raise_exception;
> 
>      rc = hvm_copy_from_guest_linear(&idte, idte_linear_addr, idte_size,
> -                                    PFEC_implicit, &pfinfo);
> +                                    PFEC_implicit, &pfinfo, NULL);
>      if ( rc )
>      {
>          if ( rc == HVMTRANS_bad_linear_to_gfn )
> --- a/xen/arch/x86/hvm/vmx/vvmx.c
> +++ b/xen/arch/x86/hvm/vmx/vvmx.c
> @@ -475,7 +475,7 @@ static int decode_vmx_inst(struct cpu_us
>          {
>              pagefault_info_t pfinfo;
>              int rc = hvm_copy_from_guest_linear(poperandS, base, size,
> -                                                0, &pfinfo);
> +                                                0, &pfinfo, NULL);
> 
>              if ( rc == HVMTRANS_bad_linear_to_gfn )
>                  hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -206,7 +206,7 @@ hvm_read(enum x86_segment seg,
>      if ( access_type == hvm_access_insn_fetch )
>          rc = hvm_fetch_from_guest_linear(p_data, addr, bytes, 0, &pfinfo);
>      else
> -        rc = hvm_copy_from_guest_linear(p_data, addr, bytes, 0, &pfinfo);
> +        rc = hvm_copy_from_guest_linear(p_data, addr, bytes, 0, &pfinfo,
> NULL);
> 
>      switch ( rc )
>      {
> --- a/xen/include/asm-x86/hvm/emulate.h
> +++ b/xen/include/asm-x86/hvm/emulate.h
> @@ -99,6 +99,11 @@ int hvmemul_do_pio_buffer(uint16_t port,
>                            void *buffer);
> 
>  struct hvmemul_cache;
> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents);
> +static inline void hvmemul_cache_destroy(struct hvmemul_cache *cache)
> +{
> +    xfree(cache);
> +}
>  bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
>                          unsigned int level, void *buffer, unsigned int size);
>  void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
> --- a/xen/include/asm-x86/hvm/support.h
> +++ b/xen/include/asm-x86/hvm/support.h
> @@ -99,7 +99,7 @@ enum hvm_translation_result hvm_copy_to_
>      pagefault_info_t *pfinfo);
>  enum hvm_translation_result hvm_copy_from_guest_linear(
>      void *buf, unsigned long addr, int size, uint32_t pfec,
> -    pagefault_info_t *pfinfo);
> +    pagefault_info_t *pfinfo, struct hvmemul_cache *cache);
>  enum hvm_translation_result hvm_fetch_from_guest_linear(
>      void *buf, unsigned long addr, int size, uint32_t pfec,
>      pagefault_info_t *pfinfo);
> @@ -113,7 +113,7 @@ enum hvm_translation_result hvm_fetch_fr
>  enum hvm_translation_result hvm_translate_get_page(
>      struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
>      pagefault_info_t *pfinfo, struct page_info **page_p,
> -    gfn_t *gfn_p, p2m_type_t *p2mt_p);
> +    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache);
> 
>  #define HVM_HCALL_completed  0 /* hypercall completed - no further
> action */
>  #define HVM_HCALL_preempted  1 /* hypercall preempted - re-execute
> VMCALL */
> --- a/xen/include/asm-x86/hvm/vcpu.h
> +++ b/xen/include/asm-x86/hvm/vcpu.h
> @@ -53,8 +53,6 @@ struct hvm_mmio_cache {
>      uint8_t buffer[32];
>  };
> 
> -struct hvmemul_cache;
> -
>  struct hvm_vcpu_io {
>      /* I/O request in flight to device model. */
>      enum hvm_io_completion io_completion;
> @@ -200,6 +198,7 @@ struct hvm_vcpu {
>      u8                  cache_mode;
> 
>      struct hvm_vcpu_io  hvm_io;
> +    struct hvmemul_cache *data_cache;
> 
>      /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
>      struct x86_event     inject_event;
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 11:55       ` Andrew Cooper
@ 2018-07-19 18:37         ` Jan Beulich
  2018-07-19 18:47           ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 18:37 UTC (permalink / raw)
  To: andrew.cooper3; +Cc: George.Dunlap, xen-devel, kevin.tian, jun.nakajima

>>> Andrew Cooper <andrew.cooper3@citrix.com> 07/19/18 1:55 PM >>>
>On 19/07/18 12:47, Jan Beulich wrote:
>>>>> On 19.07.18 at 13:15, <andrew.cooper3@citrix.com> wrote:
>>> On 19/07/18 11:50, Jan Beulich wrote:
>>>> Since strictly speaking it is incorrect for guest_walk_tables() to read
>>>> L3 entries during PAE page walks, try to overcome this where possible by
>>>> pre-loading the values from hardware into the cache. Sadly the
>>>> information is available in the EPT case only. On the positive side for
>>>> NPT the spec spells out that L3 entries are actually read on walks, so
>>>> us reading them is consistent with hardware behavior in that case.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> I'm afraid that this isn't architecturally correct.  It means that an
>>> emulated memory access will read the PDPTE register values, rather than
>>> what is actually in RAM.
>> I'm afraid I don't understand: A CR3 load loads the PDPTEs into
>> registers, and walks use those registers, not memory. That's the very
>> difference between PAE and all other walks.
>
>Patch 3 causes memory reads to come from the cache.
>
>This patch feeds the PDPTE registers into the cache, which breaks the
>architectural correctness of patch 3, because the PDPTE registers may
>legitimately be stale WRT the content in memory.

Exactly. And I want to use the register contents in that case. Hence the filling
of the cache here from the register values.


>The pagewalk reading of top_map doesn't require that top_map points into
>guest space.  If you read the PDPTE registers onto the stack, and pass a
>pointer to the stack into the pagewalk in the 3-level case, then you fix
>the issue described here without breaking patch 3.

I'm afraid I still don't understand: Why "onto the stack"? And anyway - are
you trying to tell me this odd is how actual hardware behaves?


Jan




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 18:37         ` Jan Beulich
@ 2018-07-19 18:47           ` Andrew Cooper
  2018-07-19 19:00             ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2018-07-19 18:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, kevin.tian, jun.nakajima

On 19/07/18 19:37, Jan Beulich wrote:
>>>> Andrew Cooper <andrew.cooper3@citrix.com> 07/19/18 1:55 PM >>>
>> On 19/07/18 12:47, Jan Beulich wrote:
>>>>>> On 19.07.18 at 13:15, <andrew.cooper3@citrix.com> wrote:
>>>> On 19/07/18 11:50, Jan Beulich wrote:
>>>>> Since strictly speaking it is incorrect for guest_walk_tables() to read
>>>>> L3 entries during PAE page walks, try to overcome this where possible by
>>>>> pre-loading the values from hardware into the cache. Sadly the
>>>>> information is available in the EPT case only. On the positive side for
>>>>> NPT the spec spells out that L3 entries are actually read on walks, so
>>>>> us reading them is consistent with hardware behavior in that case.
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> I'm afraid that this isn't architecturally correct.  It means that an
>>>> emulated memory access will read the PDPTE register values, rather than
>>>> what is actually in RAM.
>>> I'm afraid I don't understand: A CR3 load loads the PDPTEs into
>>> registers, and walks use those registers, not memory. That's the very
>>> difference between PAE and all other walks.
>> Patch 3 causes memory reads to come from the cache.
>>
>> This patch feeds the PDPTE registers into the cache, which breaks the
>> architectural correctness of patch 3, because the PDPTE registers may
>> legitimately be stale WRT the content in memory.
> Exactly. And I want to use the register contents in that case. Hence the filling
> of the cache here from the register values.

But using the register values is wrong here.

>
>
>> The pagewalk reading of top_map doesn't require that top_map points into
>> guest space.  If you read the PDPTE registers onto the stack, and pass a
>> pointer to the stack into the pagewalk in the 3-level case, then you fix
>> the issue described here without breaking patch 3.
> I'm afraid I still don't understand: Why "onto the stack"? And anyway - are
> you trying to tell me this odd is how actual hardware behaves?

Consider the following scenario:

; %eax points at some valid PDPTEs

mov %eax, %cr3 ; Copies 32 bytes into the hidden registers.
mov $0xdead, 8(%eax) ; Clobbers PDPTE[1].  Hardware register still intact.
FEP mov 8(%eax), %ebx ; Reads the memory behind the clobbered PDPTE[1]

When emulating the 3rd instruction, you actually need to read memory,
not the cached PDPTEs, because the value can be different.  Therefore,
you must not put the PDPTEs into the cache in the first place, because
it will cause erroneous behaviour.

My point with the top_map pointer is that you can arrange for it to
point to the real PDPTEs (pulled out of the VMCS) rather than reading
the real guest RAM at the physical address pointed to by %cr3.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 18:47           ` Andrew Cooper
@ 2018-07-19 19:00             ` Jan Beulich
  2018-07-19 19:07               ` Andrew Cooper
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-19 19:00 UTC (permalink / raw)
  To: andrew.cooper3; +Cc: George.Dunlap, xen-devel, kevin.tian, jun.nakajima

>>> Andrew Cooper <andrew.cooper3@citrix.com> 07/19/18 8:47 PM >>>
>On 19/07/18 19:37, Jan Beulich wrote:
>>>> Andrew Cooper <andrew.cooper3@citrix.com> 07/19/18 1:55 PM >>>
>> On 19/07/18 12:47, Jan Beulich wrote:
>>> This patch feeds the PDPTE registers into the cache, which breaks the
>>> architectural correctness of patch 3, because the PDPTE registers may
>>> legitimately be stale WRT the content in memory.
>> Exactly. And I want to use the register contents in that case. Hence the filling
>> of the cache here from the register values.
>
>But using the register values is wrong here.
>
>>> The pagewalk reading of top_map doesn't require that top_map points into
>>> guest space.  If you read the PDPTE registers onto the stack, and pass a
>>> pointer to the stack into the pagewalk in the 3-level case, then you fix
>>> the issue described here without breaking patch 3.
>> I'm afraid I still don't understand: Why "onto the stack"? And anyway - are
>> you trying to tell me this odd is how actual hardware behaves?
>
>Consider the following scenario:
>
>; %eax points at some valid PDPTEs
>
>mov %eax, %cr3 ; Copies 32 bytes into the hidden registers.
>mov $0xdead, 8(%eax) ; Clobbers PDPTE[1].  Hardware register still intact.
>FEP mov 8(%eax), %ebx ; Reads the memory behind the clobbered PDPTE[1]
>
>When emulating the 3rd instruction, you actually need to read memory,
>not the cached PDPTEs, because the value can be different.  Therefore,
>you must not put the PDPTEs into the cache in the first place, because
>it will cause erroneous behaviour.

So what lead you to believe the actual data access could be satisfied from the
cache? Cache entries are deliberately tagged by page table level, and no level 0
entries will (currently) ever be produced.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 19:00             ` Jan Beulich
@ 2018-07-19 19:07               ` Andrew Cooper
  2018-07-24  7:04                 ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2018-07-19 19:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, kevin.tian, jun.nakajima

On 19/07/18 20:00, Jan Beulich wrote:
>>>> Andrew Cooper <andrew.cooper3@citrix.com> 07/19/18 8:47 PM >>>
>> On 19/07/18 19:37, Jan Beulich wrote:
>>>>> Andrew Cooper <andrew.cooper3@citrix.com> 07/19/18 1:55 PM >>>
>>> On 19/07/18 12:47, Jan Beulich wrote:
>>>> This patch feeds the PDPTE registers into the cache, which breaks the
>>>> architectural correctness of patch 3, because the PDPTE registers may
>>>> legitimately be stale WRT the content in memory.
>>> Exactly. And I want to use the register contents in that case. Hence the filling
>>> of the cache here from the register values.
>> But using the register values is wrong here.
>>
>>>> The pagewalk reading of top_map doesn't require that top_map points into
>>>> guest space.  If you read the PDPTE registers onto the stack, and pass a
>>>> pointer to the stack into the pagewalk in the 3-level case, then you fix
>>>> the issue described here without breaking patch 3.
>>> I'm afraid I still don't understand: Why "onto the stack"? And anyway - are
>>> you trying to tell me this odd is how actual hardware behaves?
>> Consider the following scenario:
>>
>> ; %eax points at some valid PDPTEs
>>
>> mov %eax, %cr3 ; Copies 32 bytes into the hidden registers.
>> mov $0xdead, 8(%eax) ; Clobbers PDPTE[1].  Hardware register still intact.
>> FEP mov 8(%eax), %ebx ; Reads the memory behind the clobbered PDPTE[1]
>>
>> When emulating the 3rd instruction, you actually need to read memory,
>> not the cached PDPTEs, because the value can be different.  Therefore,
>> you must not put the PDPTEs into the cache in the first place, because
>> it will cause erroneous behaviour.
> So what lead you to believe the actual data access could be satisfied from the
> cache? Cache entries are deliberately tagged by page table level, and no level 0
> entries will (currently) ever be produced.

Oh - I'd not looked in that much detail at your algorithm.  As a first
gut feel, tagging by level doesn't sound as if it will interact
correctly with linear pagetables.

Both the Intel and AMD ORM's maintain paging structure caches so I'd
expect that a linear pagetable entry would be served from that cache
rather than being read twice from RAM.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] x86/shadow: a little bit of style cleanup
  2018-07-19 10:51 ` [PATCH 6/6] x86/shadow: a little bit of style cleanup Jan Beulich
@ 2018-07-23 15:05   ` Tim Deegan
  0 siblings, 0 replies; 28+ messages in thread
From: Tim Deegan @ 2018-07-23 15:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Andrew Cooper

At 04:51 -0600 on 19 Jul (1531975863), Jan Beulich wrote:
> Correct indentation of a piece of code, adjusting comment style at the
> same time. Constify gl3e pointers and drop a bogus (and useless once
> corrected) cast.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] x86/HVM: implement memory read caching
  2018-07-19 10:48 ` [PATCH 3/6] x86/HVM: implement memory read caching Jan Beulich
  2018-07-19 14:20   ` Paul Durrant
@ 2018-07-23 15:45   ` Tim Deegan
  1 sibling, 0 replies; 28+ messages in thread
From: Tim Deegan @ 2018-07-23 15:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Paul Durrant, Suravee Suthikulpanit, xen-devel, Boris Ostrovsky,
	Brian Woods

Hi,

At 04:48 -0600 on 19 Jul (1531975717), Jan Beulich wrote:
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath an
> instruction currently under emulation (hitting between two passes). If
> the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
> 
> Introduce a cache (used just by guest page table accesses for now) to
> make sure above described assumption holds. This is a very simplistic
> implementation for now: Only exact matches are satisfied (no overlaps or
> partial reads or anything).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

For the change to shadow code:

Acked-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-19 19:07               ` Andrew Cooper
@ 2018-07-24  7:04                 ` Jan Beulich
  2018-07-24  7:27                   ` Juergen Gross
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2018-07-24  7:04 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Kevin Tian, Jun Nakajima

>>> On 19.07.18 at 21:07, <andrew.cooper3@citrix.com> wrote:
> Oh - I'd not looked in that much detail at your algorithm.  As a first
> gut feel, tagging by level doesn't sound as if it will interact
> correctly with linear pagetables.
> 
> Both the Intel and AMD ORM's maintain paging structure caches so I'd
> expect that a linear pagetable entry would be served from that cache
> rather than being read twice from RAM.

Is there anywhere enough detail about the actual implementation of
the paging structure caches? I could imagine them being per level. It
wouldn't be very difficult to switch to a tristate here (normal data,
page table, and PAE L3).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-24  7:04                 ` Jan Beulich
@ 2018-07-24  7:27                   ` Juergen Gross
  2018-07-24  7:44                     ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Juergen Gross @ 2018-07-24  7:27 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: George Dunlap, xen-devel, Kevin Tian, Jun Nakajima

On 24/07/18 09:04, Jan Beulich wrote:
>>>> On 19.07.18 at 21:07, <andrew.cooper3@citrix.com> wrote:
>> Oh - I'd not looked in that much detail at your algorithm.  As a first
>> gut feel, tagging by level doesn't sound as if it will interact
>> correctly with linear pagetables.
>>
>> Both the Intel and AMD ORM's maintain paging structure caches so I'd
>> expect that a linear pagetable entry would be served from that cache
>> rather than being read twice from RAM.
> 
> Is there anywhere enough detail about the actual implementation of
> the paging structure caches? I could imagine them being per level. It
> wouldn't be very difficult to switch to a tristate here (normal data,
> page table, and PAE L3).
> 

From https://www.cs.rice.edu/CS/Architecture/docs/barr-isca10.pdf :

Therefore, both AMD and Intel have implemented MMU
caches for page table entries from the higher levels of the tree [3,
9]. However, their caches have quite different structure. For exam-
ple, AMD’s Page Walk Cache stores page table entries from any
level of the tree, whereas Intel implements distinct caches for each
level  of  the  tree.   Also,  AMD’s  Page  Walk  Cache  is  indexed  by
the physical address of the cached page table entry, whereas Intel’s
Paging-Structure Caches are indexed by portions of the virtual ad-
dress being translated


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible
  2018-07-24  7:27                   ` Juergen Gross
@ 2018-07-24  7:44                     ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2018-07-24  7:44 UTC (permalink / raw)
  To: Andrew Cooper, Juergen Gross
  Cc: George Dunlap, xen-devel, Kevin Tian, Jun Nakajima

>>> On 24.07.18 at 09:27, <jgross@suse.com> wrote:
> On 24/07/18 09:04, Jan Beulich wrote:
>>>>> On 19.07.18 at 21:07, <andrew.cooper3@citrix.com> wrote:
>>> Oh - I'd not looked in that much detail at your algorithm.  As a first
>>> gut feel, tagging by level doesn't sound as if it will interact
>>> correctly with linear pagetables.
>>>
>>> Both the Intel and AMD ORM's maintain paging structure caches so I'd
>>> expect that a linear pagetable entry would be served from that cache
>>> rather than being read twice from RAM.
>> 
>> Is there anywhere enough detail about the actual implementation of
>> the paging structure caches? I could imagine them being per level. It
>> wouldn't be very difficult to switch to a tristate here (normal data,
>> page table, and PAE L3).
>> 
> 
> From https://www.cs.rice.edu/CS/Architecture/docs/barr-isca10.pdf :
> 
> Therefore, both AMD and Intel have implemented MMU
> caches for page table entries from the higher levels of the tree [3,
> 9]. However, their caches have quite different structure. For exam-
> ple, AMD’s Page Walk Cache stores page table entries from any
> level of the tree, whereas Intel implements distinct caches for each
> level  of  the  tree.   Also,  AMD’s  Page  Walk  Cache  is  indexed  by
> the physical address of the cached page table entry, whereas Intel’s
> Paging-Structure Caches are indexed by portions of the virtual ad-
> dress being translated

Oh, interesting, thanks. Basically means OSes and alike can't really
rely on any specific behavior, and implementations have some
leeway.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] x86/HVM: implement memory read caching
  2018-07-19 14:20   ` Paul Durrant
@ 2018-07-24 10:28     ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2018-07-24 10:28 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Kevin Tian, Suravee Suthikulpanit, Andrew Cooper, Tim Deegan,
	george.dunlap, Jun Nakajima, xen-devel, Boris Ostrovsky,
	Brian Woods

>>> On 19.07.18 at 16:20, <Paul.Durrant@citrix.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 19 July 2018 11:49
>> @@ -1046,6 +1060,8 @@ static int __hvmemul_read(
>>          pfec |= PFEC_implicit;
>>      else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
>>          pfec |= PFEC_user_mode;
>> +    if ( access_type == hvm_access_insn_fetch )
>> +        pfec |= PFEC_insn_fetch;
> 
> Since you OR the insn_fetch flag in here...
> 
>> 
>>      rc = hvmemul_virtual_to_linear(
>>          seg, offset, bytes, &reps, access_type, hvmemul_ctxt, &addr);
>> @@ -1059,7 +1075,8 @@ static int __hvmemul_read(
>> 
>>      rc = ((access_type == hvm_access_insn_fetch) ?
>>            hvm_fetch_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo) :
> 
> ...could you not just use hvm_copy_from_guest_linear() here regardless of 
> access_type (and just NULL out the cache argument if it is an insn_fetch)?
> 
> AFAICT the only reason hvm_fetch_from_guest_linear() exists is to OR the 
> extra flag in.

Well, technically it looks like I indeed could. I'm not sure that's good idea
though - the visual separation of "copy" vs "fetch" is helpful I think. Let's
see if I get any opinions by others in one or the other direction.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Ping: [PATCH 4/6] VMX: correct PDPTE load checks
  2018-07-19 10:49 ` [PATCH 4/6] VMX: correct PDPTE load checks Jan Beulich
@ 2018-08-28 11:59   ` Jan Beulich
  2018-08-28 13:12   ` Andrew Cooper
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2018-08-28 11:59 UTC (permalink / raw)
  To: Jun Nakajima, Kevin Tian; +Cc: George Dunlap, Andrew Cooper, xen-devel

>>> On 19.07.18 at 12:49,  wrote:
> Checking the low 5 bits of CR3 is not the job of vmx_load_pdptrs().
> Instead it should #GP upon bad PDPTE values, rather than causing a VM
> entry failure.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1361,20 +1361,18 @@ static void vmx_set_interrupt_shadow(str
>      __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
>  }
>  
> -static void vmx_load_pdptrs(struct vcpu *v)
> +static bool vmx_load_pdptrs(struct vcpu *v)
>  {
>      unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
> -    uint64_t *guest_pdptes;
> +    uint64_t *guest_pdptes, valid_mask;
>      struct page_info *page;
>      p2m_type_t p2mt;
>      char *p;
> +    unsigned int i;
>  
>      /* EPT needs to load PDPTRS into VMCS for PAE. */
>      if ( !hvm_pae_enabled(v) || (v->arch.hvm_vcpu.guest_efer & EFER_LMA) )
> -        return;
> -
> -    if ( (cr3 & 0x1fUL) && !hvm_pcid_enabled(v) )
> -        goto crash;
> +        return true;
>  
>      page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt, 
> P2M_UNSHARE);
>      if ( !page )
> @@ -1385,34 +1383,47 @@ static void vmx_load_pdptrs(struct vcpu
>          gdprintk(XENLOG_ERR,
>                   "Bad cr3 on load pdptrs gfn %lx type %d\n",
>                   cr3 >> PAGE_SHIFT, (int) p2mt);
> -        goto crash;
> +        domain_crash(v->domain);
> +        return false;
>      }
>  
>      p = __map_domain_page(page);
>  
> -    guest_pdptes = (uint64_t *)(p + (cr3 & ~PAGE_MASK));
> +    guest_pdptes = (uint64_t *)(p + (cr3 & 0xfe0));
>  
> -    /*
> -     * We do not check the PDPTRs for validity. The CPU will do this during
> -     * vm entry, and we can handle the failure there and crash the guest.
> -     * The only thing we could do better here is #GP instead.
> -     */
> +    valid_mask = ((1ULL << v->domain->arch.cpuid->extd.maxphysaddr) - 1) &
> +                 (PAGE_MASK | _PAGE_AVAIL | _PAGE_PRESENT);
> +    for ( i = 0; i < 4; ++i )
> +        if ( (guest_pdptes[i] & _PAGE_PRESENT) &&
> +             (guest_pdptes[i] & ~valid_mask) )
> +        {
> +            if ( v == current )
> +                hvm_inject_hw_exception(TRAP_gp_fault, 0);
> +            else
> +            {
> +                printk(XENLOG_G_ERR "%pv: bad PDPTE%u: %"PRIx64"\n",
> +                       v, i, guest_pdptes[i]);
> +                domain_crash(v->domain);
> +            }
> +            break;
> +        }
>  
> -    vmx_vmcs_enter(v);
> +    if ( i == 4 )
> +    {
> +        vmx_vmcs_enter(v);
>  
> -    __vmwrite(GUEST_PDPTE(0), guest_pdptes[0]);
> -    __vmwrite(GUEST_PDPTE(1), guest_pdptes[1]);
> -    __vmwrite(GUEST_PDPTE(2), guest_pdptes[2]);
> -    __vmwrite(GUEST_PDPTE(3), guest_pdptes[3]);
> +        __vmwrite(GUEST_PDPTE(0), guest_pdptes[0]);
> +        __vmwrite(GUEST_PDPTE(1), guest_pdptes[1]);
> +        __vmwrite(GUEST_PDPTE(2), guest_pdptes[2]);
> +        __vmwrite(GUEST_PDPTE(3), guest_pdptes[3]);
>  
> -    vmx_vmcs_exit(v);
> +        vmx_vmcs_exit(v);
> +    }
>  
>      unmap_domain_page(p);
>      put_page(page);
> -    return;
>  
> - crash:
> -    domain_crash(v->domain);
> +    return i == 4;
>  }
>  
>  static void vmx_update_host_cr3(struct vcpu *v)
> @@ -1621,7 +1632,8 @@ static void vmx_update_guest_cr(struct v
>              if ( !hvm_paging_enabled(v) && !vmx_unrestricted_guest(v) )
>                  v->arch.hvm_vcpu.hw_cr[3] =
>                      v->domain->arch.hvm_domain.params[HVM_PARAM_IDENT_PT];
> -            vmx_load_pdptrs(v);
> +            if ( !vmx_load_pdptrs(v) )
> +                break;
>          }
>  
>          __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.hw_cr[3]);
> 
> 
> 
> 





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] VMX: correct PDPTE load checks
  2018-07-19 10:49 ` [PATCH 4/6] VMX: correct PDPTE load checks Jan Beulich
  2018-08-28 11:59   ` Ping: " Jan Beulich
@ 2018-08-28 13:12   ` Andrew Cooper
  2018-08-30  1:24     ` Tian, Kevin
  2018-08-30 13:58     ` Jan Beulich
  1 sibling, 2 replies; 28+ messages in thread
From: Andrew Cooper @ 2018-08-28 13:12 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Kevin Tian, Jun Nakajima

On 19/07/18 11:49, Jan Beulich wrote:
> Checking the low 5 bits of CR3 is not the job of vmx_load_pdptrs().
> Instead it should #GP upon bad PDPTE values, rather than causing a VM
> entry failure.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1361,20 +1361,18 @@ static void vmx_set_interrupt_shadow(str
>      __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
>  }
>  
> -static void vmx_load_pdptrs(struct vcpu *v)
> +static bool vmx_load_pdptrs(struct vcpu *v)
>  {
>      unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
> -    uint64_t *guest_pdptes;
> +    uint64_t *guest_pdptes, valid_mask;
>      struct page_info *page;
>      p2m_type_t p2mt;
>      char *p;
> +    unsigned int i;
>  
>      /* EPT needs to load PDPTRS into VMCS for PAE. */
>      if ( !hvm_pae_enabled(v) || (v->arch.hvm_vcpu.guest_efer & EFER_LMA) )
> -        return;
> -
> -    if ( (cr3 & 0x1fUL) && !hvm_pcid_enabled(v) )
> -        goto crash;
> +        return true;
>  
>      page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt, P2M_UNSHARE);

cr3 needs truncating to 32 bits before doing this.  The upper bits of
cr3 can remain set after transitioning away from long mode, which will
cause this emulation to do the wrong thing.

>      if ( !page )
> @@ -1385,34 +1383,47 @@ static void vmx_load_pdptrs(struct vcpu
>          gdprintk(XENLOG_ERR,
>                   "Bad cr3 on load pdptrs gfn %lx type %d\n",
>                   cr3 >> PAGE_SHIFT, (int) p2mt);
> -        goto crash;
> +        domain_crash(v->domain);
> +        return false;
>      }
>  
>      p = __map_domain_page(page);
>  
> -    guest_pdptes = (uint64_t *)(p + (cr3 & ~PAGE_MASK));
> +    guest_pdptes = (uint64_t *)(p + (cr3 & 0xfe0));

You can drop p, and guest_pdptes can lose its prefix.

>  
> -    /*
> -     * We do not check the PDPTRs for validity. The CPU will do this during
> -     * vm entry, and we can handle the failure there and crash the guest.
> -     * The only thing we could do better here is #GP instead.
> -     */
> +    valid_mask = ((1ULL << v->domain->arch.cpuid->extd.maxphysaddr) - 1) &
> +                 (PAGE_MASK | _PAGE_AVAIL | _PAGE_PRESENT);

How did you come across this list?  The only valid bits are Present, PWT
and PCD, while the upper nibble of control bits is documented as ignored
rather than reserved.

> +    for ( i = 0; i < 4; ++i )
> +        if ( (guest_pdptes[i] & _PAGE_PRESENT) &&
> +             (guest_pdptes[i] & ~valid_mask) )
> +        {
> +            if ( v == current )
> +                hvm_inject_hw_exception(TRAP_gp_fault, 0);

The function is void for the same reason why this isn't correct.  We are
in the hvm_update_* path rather than the set_* path, which is beyond the
point of being able to unwind (and why I haven't yet got around to
fixing this function).

The only way I can see of fixing this is plumbing everything into a new
paging_set_cr3() callback, which can return X86EMUL_EXCEPTION and fail
the hvm_set_cr3() call.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] VMX: correct PDPTE load checks
  2018-08-28 13:12   ` Andrew Cooper
@ 2018-08-30  1:24     ` Tian, Kevin
  2018-08-30 13:58     ` Jan Beulich
  1 sibling, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2018-08-30  1:24 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, xen-devel; +Cc: George Dunlap, Nakajima, Jun

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Tuesday, August 28, 2018 9:12 PM
> > +    valid_mask = ((1ULL << v->domain->arch.cpuid->extd.maxphysaddr) -
> 1) &
> > +                 (PAGE_MASK | _PAGE_AVAIL | _PAGE_PRESENT);
> 
> How did you come across this list?  The only valid bits are Present, PWT
> and PCD, while the upper nibble of control bits is documented as ignored
> rather than reserved.

Agree that PWT/PCD should be added to mask too. _PAGE_AVAIL is right
here standing for ignored bits.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] VMX: correct PDPTE load checks
  2018-08-28 13:12   ` Andrew Cooper
  2018-08-30  1:24     ` Tian, Kevin
@ 2018-08-30 13:58     ` Jan Beulich
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2018-08-30 13:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Kevin Tian, Jun Nakajima

>>> On 28.08.18 at 15:12, <andrew.cooper3@citrix.com> wrote:
> On 19/07/18 11:49, Jan Beulich wrote:
>> Checking the low 5 bits of CR3 is not the job of vmx_load_pdptrs().
>> Instead it should #GP upon bad PDPTE values, rather than causing a VM
>> entry failure.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -1361,20 +1361,18 @@ static void vmx_set_interrupt_shadow(str
>>      __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
>>  }
>>  
>> -static void vmx_load_pdptrs(struct vcpu *v)
>> +static bool vmx_load_pdptrs(struct vcpu *v)
>>  {
>>      unsigned long cr3 = v->arch.hvm_vcpu.guest_cr[3];
>> -    uint64_t *guest_pdptes;
>> +    uint64_t *guest_pdptes, valid_mask;
>>      struct page_info *page;
>>      p2m_type_t p2mt;
>>      char *p;
>> +    unsigned int i;
>>  
>>      /* EPT needs to load PDPTRS into VMCS for PAE. */
>>      if ( !hvm_pae_enabled(v) || (v->arch.hvm_vcpu.guest_efer & EFER_LMA) )
>> -        return;
>> -
>> -    if ( (cr3 & 0x1fUL) && !hvm_pcid_enabled(v) )
>> -        goto crash;
>> +        return true;
>>  
>>      page = get_page_from_gfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt, P2M_UNSHARE);
> 
> cr3 needs truncating to 32 bits before doing this.  The upper bits of
> cr3 can remain set after transitioning away from long mode, which will
> cause this emulation to do the wrong thing.

In which case the easiest thing would be to change the variable's
type. I've done this, albeit it's sort of orthogonal to what the patch
has been meaning to do till now.

>> -    /*
>> -     * We do not check the PDPTRs for validity. The CPU will do this during
>> -     * vm entry, and we can handle the failure there and crash the guest.
>> -     * The only thing we could do better here is #GP instead.
>> -     */
>> +    valid_mask = ((1ULL << v->domain->arch.cpuid->extd.maxphysaddr) - 1) &
>> +                 (PAGE_MASK | _PAGE_AVAIL | _PAGE_PRESENT);
> 
> How did you come across this list?  The only valid bits are Present, PWT
> and PCD, while the upper nibble of control bits is documented as ignored
> rather than reserved.

As to the upper nibble (_PAGE_AVAIL I assume, which are only 3 bits) -
that's what the mask sets. As the name says, it's a collection of bits
which are valid to be set here (which necessarily includes ignored bits;
we're only interested in its inverted value for masking purposes).

As to PWT and PCD, the set above was based on the foot note to
section "Checks on Guest Page-Directory-Pointer-Table Entries"
saying

"This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
 (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are
 ignored."

But I now realize that I've wrongly implied this to be a complete
description.

>> +    for ( i = 0; i < 4; ++i )
>> +        if ( (guest_pdptes[i] & _PAGE_PRESENT) &&
>> +             (guest_pdptes[i] & ~valid_mask) )
>> +        {
>> +            if ( v == current )
>> +                hvm_inject_hw_exception(TRAP_gp_fault, 0);
> 
> The function is void for the same reason why this isn't correct.  We are
> in the hvm_update_* path rather than the set_* path, which is beyond the
> point of being able to unwind (and why I haven't yet got around to
> fixing this function).
> 
> The only way I can see of fixing this is plumbing everything into a new
> paging_set_cr3() callback, which can return X86EMUL_EXCEPTION and fail
> the hvm_set_cr3() call.

Hmm, I see. I'll see about doing this. But why would this be a paging
hook? It looks to me as if this wants to be another entry in hvm_funcs,
also taking into consideration that on AMD/NPT the PDPTRs are
documented not to work the "normal" way. If it was a paging one,
where would you call it from?

I'm also anticipating a further complication: If we split the function,
how would we guarantee that the read/check one has always been
invoked before the load one? After all we'd have to latch the values
read from memory in the former in order to use the validated values
in the latter. This is in particular because hvm_update_guest_cr3()
looks to be called from a number of paths not having come though
hvm_set_cr3() (which anyway itself invokes paging_update_cr3()).
A pretty good (or bad) example is hvmop_flush_tlb_all(), in
particular considering the comment there.

Bottom line - I think the patch here is an improvement in its own right,
and further improvements should perhaps be done separately. In
the meantime I wonder whether the failure path in
vmx_update_guest_cr() could attempt some rollback (the prior CR3
value must have been fine after all, leaving aside the fact that the
it may have been loaded with CR4.PAE still clear, which is a rather
fragile way of changing paging modes anyway).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables()
  2018-07-19 13:22   ` Paul Durrant
@ 2018-09-03 15:12     ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2018-09-03 15:12 UTC (permalink / raw)
  To: Andrew Cooper, Paul Durrant; +Cc: xen-devel, george.dunlap

>>> On 19.07.18 at 15:22, <Paul.Durrant@citrix.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 19 July 2018 11:47
>> 
>> @@ -134,7 +135,15 @@ guest_walk_tables(struct vcpu *v, struct
>>      /* Get the l4e from the top level table and check its flags*/
>>      gw->l4mfn = top_mfn;
>>      l4p = (guest_l4e_t *) top_map;
>> -    gw->l4e = l4p[guest_l4_table_offset(gla)];
>> +    gpa = gfn_to_gaddr(top_gfn) +
>> +          guest_l4_table_offset(gla) * sizeof(guest_l4e_t);
>> +    if ( !cache ||
>> +         !hvmemul_read_cache(cache, gpa, 4, &gw->l4e, sizeof(gw->l4e)) )
>> +    {
>> +        gw->l4e = l4p[guest_l4_table_offset(gla)];
>> +        if ( cache )
>> +            hvmemul_write_cache(cache, gpa, 4, &gw->l4e, sizeof(gw->l4e));
>> +    }
>>      gflags = guest_l4e_get_flags(gw->l4e);
>>      if ( !(gflags & _PAGE_PRESENT) )
>>          goto out;
>> @@ -164,7 +173,15 @@ guest_walk_tables(struct vcpu *v, struct
>>      }
>> 
>>      /* Get the l3e and check its flags*/
>> -    gw->l3e = l3p[guest_l3_table_offset(gla)];
>> +    gpa = gfn_to_gaddr(guest_l4e_get_gfn(gw->l4e)) +
>> +          guest_l3_table_offset(gla) * sizeof(guest_l3e_t);
>> +    if ( !cache ||
>> +         !hvmemul_read_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
>> +    {
>> +        gw->l3e = l3p[guest_l3_table_offset(gla)];
>> +        if ( cache )
>> +            hvmemul_write_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e));
>> +    }
>>      gflags = guest_l3e_get_flags(gw->l3e);
>>      if ( !(gflags & _PAGE_PRESENT) )
>>          goto out;
>> @@ -216,7 +233,16 @@ guest_walk_tables(struct vcpu *v, struct
>>  #else /* PAE only... */
>> 
>>      /* Get the l3e and check its flag */
>> -    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
>> +    gpa = gfn_to_gaddr(top_gfn) + ((unsigned long)top_map &
>> ~PAGE_MASK) +
>> +          guest_l3_table_offset(gla) * sizeof(guest_l3e_t);
>> +    if ( !cache ||
>> +         !hvmemul_read_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
>> +    {
>> +        gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
>> +        if ( cache )
>> +            hvmemul_write_cache(cache, gpa, 3, &gw->l3e, sizeof(gw->l3e));
>> +    }
>> +
>>      gflags = guest_l3e_get_flags(gw->l3e);
>>      if ( !(gflags & _PAGE_PRESENT) )
>>          goto out;
>> @@ -242,18 +268,24 @@ guest_walk_tables(struct vcpu *v, struct
>>          goto out;
>>      }
>> 
>> -    /* Get the l2e */
>> -    gw->l2e = l2p[guest_l2_table_offset(gla)];
>> -
>>  #else /* 32-bit only... */
>> 
>> -    /* Get l2e from the top level table */
>>      gw->l2mfn = top_mfn;
>>      l2p = (guest_l2e_t *) top_map;
>> -    gw->l2e = l2p[guest_l2_table_offset(gla)];
>> 
>>  #endif /* All levels... */
>> 
>> +    /* Get the l2e */
>> +    gpa = gfn_to_gaddr(top_gfn) +
>> +          guest_l2_table_offset(gla) * sizeof(guest_l2e_t);
>> +    if ( !cache ||
>> +         !hvmemul_read_cache(cache, gpa, 2, &gw->l2e, sizeof(gw->l2e)) )
>> +    {
>> +        gw->l2e = l2p[guest_l2_table_offset(gla)];
>> +        if ( cache )
>> +            hvmemul_write_cache(cache, gpa, 2, &gw->l2e, sizeof(gw->l2e));
>> +    }
>> +
>>      /* Check the l2e flags. */
>>      gflags = guest_l2e_get_flags(gw->l2e);
>>      if ( !(gflags & _PAGE_PRESENT) )
>> @@ -335,7 +367,17 @@ guest_walk_tables(struct vcpu *v, struct
>>          gw->pfec |= rc & PFEC_synth_mask;
>>          goto out;
>>      }
>> -    gw->l1e = l1p[guest_l1_table_offset(gla)];
>> +
>> +    gpa = gfn_to_gaddr(top_gfn) +
>> +          guest_l1_table_offset(gla) * sizeof(guest_l1e_t);
>> +    if ( !cache ||
>> +         !hvmemul_read_cache(cache, gpa, 1, &gw->l1e, sizeof(gw->l1e)) )
>> +    {
>> +        gw->l1e = l1p[guest_l1_table_offset(gla)];
>> +        if ( cache )
>> +            hvmemul_write_cache(cache, gpa, 1, &gw->l1e, sizeof(gw->l1e));
>> +    }
>> +
>>      gflags = guest_l1e_get_flags(gw->l1e);
>>      if ( !(gflags & _PAGE_PRESENT) )
>>          goto out;

I've just now realized that the cache writes should either be repeated
after having set A/D bits, or the setting of A/D bits should be suppressed
when a read has hit the cache. The latter option would be somewhat
difficult with our late setting of the bits, but I still wanted to ask: Do you
have a preference either way?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-09-03 15:12 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-19 10:39 [PATCH 0/6] x86/HVM: implement memory read caching Jan Beulich
2018-07-19 10:46 ` [PATCH 1/6] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
2018-07-19 13:12   ` Paul Durrant
2018-07-19 10:47 ` [PATCH 2/6] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
2018-07-19 13:22   ` Paul Durrant
2018-09-03 15:12     ` Jan Beulich
2018-07-19 10:48 ` [PATCH 3/6] x86/HVM: implement memory read caching Jan Beulich
2018-07-19 14:20   ` Paul Durrant
2018-07-24 10:28     ` Jan Beulich
2018-07-23 15:45   ` Tim Deegan
2018-07-19 10:49 ` [PATCH 4/6] VMX: correct PDPTE load checks Jan Beulich
2018-08-28 11:59   ` Ping: " Jan Beulich
2018-08-28 13:12   ` Andrew Cooper
2018-08-30  1:24     ` Tian, Kevin
2018-08-30 13:58     ` Jan Beulich
2018-07-19 10:50 ` [PATCH 5/6] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
2018-07-19 11:15   ` Andrew Cooper
2018-07-19 11:47     ` Jan Beulich
2018-07-19 11:55       ` Andrew Cooper
2018-07-19 18:37         ` Jan Beulich
2018-07-19 18:47           ` Andrew Cooper
2018-07-19 19:00             ` Jan Beulich
2018-07-19 19:07               ` Andrew Cooper
2018-07-24  7:04                 ` Jan Beulich
2018-07-24  7:27                   ` Juergen Gross
2018-07-24  7:44                     ` Jan Beulich
2018-07-19 10:51 ` [PATCH 6/6] x86/shadow: a little bit of style cleanup Jan Beulich
2018-07-23 15:05   ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.