All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] x86/HVM: implement memory read caching
@ 2018-09-11 13:10 Jan Beulich
  2018-09-11 13:13 ` [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
                   ` (5 more replies)
  0 siblings, 6 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-11 13:10 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far as use of CPU
registers goes (as those can't change without any other instruction
executing in between), but is wrong for memory accesses. In particular
it has been observed that Windows might page out buffers underneath
an instruction currently under emulation (hitting between two passes).
If the first pass translated a linear address successfully, any subsequent
pass needs to do so too, yielding the exact same translation.

Introduce a cache (used just by guest page table accesses for now, i.e.
a form of "paging structure cache") to make sure above described
assumption holds. This is a very simplistic implementation for now: Only
exact matches are satisfied (no overlaps or partial reads or anything).

There's also some seemingly unrelated cleanup here which was found
desirable on the way.

1: x86/mm: add optional cache to GLA->GFN translation
2: x86/mm: use optional cache in guest_walk_tables()
3: x86/HVM: implement memory read caching
4: x86/HVM: prefill cache with PDPTEs when possible

"VMX: correct PDPTE load checks" is omitted from v2, as I can't
currently find enough time to carry out the requested further
rework.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation
  2018-09-11 13:10 [PATCH v2 0/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-09-11 13:13 ` Jan Beulich
  2018-09-11 13:40   ` Razvan Cojocaru
  2018-09-19 15:09   ` Wei Liu
  2018-09-11 13:14 ` [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-11 13:13 UTC (permalink / raw)
  To: xen-devel
  Cc: Tamas K Lengyel, Wei Liu, Razvan Cojocaru, George Dunlap,
	Andrew Cooper, Tim Deegan, Paul Durrant

The caching isn't actually implemented here, this is just setting the
stage.

Touching these anyway also
- make their return values gfn_t
- gva -> gla in their names
- name their input arguments gla

At the use sites do the conversion to gfn_t as suitable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
---
v2: Re-base.

--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -51,7 +51,7 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct dom
 
     DBGP2("vaddr:%lx domid:%d\n", vaddr, dp->domain_id);
 
-    *gfn = _gfn(paging_gva_to_gfn(dp->vcpu[0], vaddr, &pfec));
+    *gfn = paging_gla_to_gfn(dp->vcpu[0], vaddr, &pfec, NULL);
     if ( gfn_eq(*gfn, INVALID_GFN) )
     {
         DBGP2("kdb:bad gfn from gva_to_gfn\n");
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -699,7 +699,8 @@ static int hvmemul_linear_to_phys(
     struct hvm_emulate_ctxt *hvmemul_ctxt)
 {
     struct vcpu *curr = current;
-    unsigned long pfn, npfn, done, todo, i, offset = addr & ~PAGE_MASK;
+    gfn_t gfn, ngfn;
+    unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
     int reverse;
 
     /*
@@ -721,15 +722,17 @@ static int hvmemul_linear_to_phys(
     if ( reverse && ((PAGE_SIZE - offset) < bytes_per_rep) )
     {
         /* Do page-straddling first iteration forwards via recursion. */
-        paddr_t _paddr;
+        paddr_t gaddr;
         unsigned long one_rep = 1;
         int rc = hvmemul_linear_to_phys(
-            addr, &_paddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
+            addr, &gaddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
+
         if ( rc != X86EMUL_OKAY )
             return rc;
-        pfn = _paddr >> PAGE_SHIFT;
+        gfn = gaddr_to_gfn(gaddr);
     }
-    else if ( (pfn = paging_gva_to_gfn(curr, addr, &pfec)) == gfn_x(INVALID_GFN) )
+    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
+                     INVALID_GFN) )
     {
         if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
             return X86EMUL_RETRY;
@@ -744,11 +747,11 @@ static int hvmemul_linear_to_phys(
     {
         /* Get the next PFN in the range. */
         addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
-        npfn = paging_gva_to_gfn(curr, addr, &pfec);
+        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
 
         /* Is it contiguous with the preceding PFNs? If not then we're done. */
-        if ( (npfn == gfn_x(INVALID_GFN)) ||
-             (npfn != (pfn + (reverse ? -i : i))) )
+        if ( gfn_eq(ngfn, INVALID_GFN) ||
+             !gfn_eq(ngfn, gfn_add(gfn, reverse ? -i : i)) )
         {
             if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
                 return X86EMUL_RETRY;
@@ -756,7 +759,7 @@ static int hvmemul_linear_to_phys(
             if ( done == 0 )
             {
                 ASSERT(!reverse);
-                if ( npfn != gfn_x(INVALID_GFN) )
+                if ( !gfn_eq(ngfn, INVALID_GFN) )
                     return X86EMUL_UNHANDLEABLE;
                 *reps = 0;
                 x86_emul_pagefault(pfec, addr & PAGE_MASK, &hvmemul_ctxt->ctxt);
@@ -769,7 +772,8 @@ static int hvmemul_linear_to_phys(
         done += PAGE_SIZE;
     }
 
-    *paddr = ((paddr_t)pfn << PAGE_SHIFT) | offset;
+    *paddr = gfn_to_gaddr(gfn) | offset;
+
     return X86EMUL_OKAY;
 }
     
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2682,7 +2682,7 @@ static void *hvm_map_entry(unsigned long
      * treat it as a kernel-mode read (i.e. no access checks).
      */
     pfec = PFEC_page_present;
-    gfn = paging_gva_to_gfn(current, va, &pfec);
+    gfn = gfn_x(paging_gla_to_gfn(current, va, &pfec, NULL));
     if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
         goto fail;
 
@@ -3112,7 +3112,7 @@ enum hvm_translation_result hvm_translat
 
     if ( linear )
     {
-        gfn = _gfn(paging_gva_to_gfn(v, addr, &pfec));
+        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
 
         if ( gfn_eq(gfn, INVALID_GFN) )
         {
--- a/xen/arch/x86/hvm/monitor.c
+++ b/xen/arch/x86/hvm/monitor.c
@@ -130,7 +130,7 @@ static inline unsigned long gfn_of_rip(u
 
     hvm_get_segment_register(curr, x86_seg_cs, &sreg);
 
-    return paging_gva_to_gfn(curr, sreg.base + rip, &pfec);
+    return gfn_x(paging_gla_to_gfn(curr, sreg.base + rip, &pfec, NULL));
 }
 
 int hvm_monitor_debug(unsigned long rip, enum hvm_monitor_debug_type type,
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -81,8 +81,9 @@ static bool set_ad_bits(guest_intpte_t *
  */
 bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
-                  unsigned long va, walk_t *gw,
-                  uint32_t walk, mfn_t top_mfn, void *top_map)
+                  unsigned long gla, walk_t *gw, uint32_t walk,
+                  gfn_t top_gfn, mfn_t top_mfn, void *top_map,
+                  struct hvmemul_cache *cache)
 {
     struct domain *d = v->domain;
     p2m_type_t p2mt;
@@ -116,7 +117,7 @@ guest_walk_tables(struct vcpu *v, struct
 
     perfc_incr(guest_walk);
     memset(gw, 0, sizeof(*gw));
-    gw->va = va;
+    gw->va = gla;
     gw->pfec = walk & (PFEC_user_mode | PFEC_write_access);
 
     /*
@@ -133,7 +134,7 @@ guest_walk_tables(struct vcpu *v, struct
     /* Get the l4e from the top level table and check its flags*/
     gw->l4mfn = top_mfn;
     l4p = (guest_l4e_t *) top_map;
-    gw->l4e = l4p[guest_l4_table_offset(va)];
+    gw->l4e = l4p[guest_l4_table_offset(gla)];
     gflags = guest_l4e_get_flags(gw->l4e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -163,7 +164,7 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l3e and check its flags*/
-    gw->l3e = l3p[guest_l3_table_offset(va)];
+    gw->l3e = l3p[guest_l3_table_offset(gla)];
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -205,7 +206,7 @@ guest_walk_tables(struct vcpu *v, struct
 
         /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L3_GFN_MASK) +
-                     ((va >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
+                     ((gla >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
         gw->l1e = guest_l1e_from_gfn(start, flags);
         gw->l2mfn = gw->l1mfn = INVALID_MFN;
         leaf_level = 3;
@@ -215,7 +216,7 @@ guest_walk_tables(struct vcpu *v, struct
 #else /* PAE only... */
 
     /* Get the l3e and check its flag */
-    gw->l3e = ((guest_l3e_t *) top_map)[guest_l3_table_offset(va)];
+    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -242,14 +243,14 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l2e */
-    gw->l2e = l2p[guest_l2_table_offset(va)];
+    gw->l2e = l2p[guest_l2_table_offset(gla)];
 
 #else /* 32-bit only... */
 
     /* Get l2e from the top level table */
     gw->l2mfn = top_mfn;
     l2p = (guest_l2e_t *) top_map;
-    gw->l2e = l2p[guest_l2_table_offset(va)];
+    gw->l2e = l2p[guest_l2_table_offset(gla)];
 
 #endif /* All levels... */
 
@@ -310,7 +311,7 @@ guest_walk_tables(struct vcpu *v, struct
 
         /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L2_GFN_MASK) +
-                     guest_l1_table_offset(va));
+                     guest_l1_table_offset(gla));
 #if GUEST_PAGING_LEVELS == 2
          /* Wider than 32 bits if PSE36 superpage. */
         gw->el1e = (gfn_x(start) << PAGE_SHIFT) | flags;
@@ -334,7 +335,7 @@ guest_walk_tables(struct vcpu *v, struct
         gw->pfec |= rc & PFEC_synth_mask;
         goto out;
     }
-    gw->l1e = l1p[guest_l1_table_offset(va)];
+    gw->l1e = l1p[guest_l1_table_offset(gla)];
     gflags = guest_l1e_get_flags(gw->l1e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -443,22 +444,22 @@ guest_walk_tables(struct vcpu *v, struct
         break;
 
     case 1:
-        if ( set_ad_bits(&l1p[guest_l1_table_offset(va)].l1, &gw->l1e.l1,
+        if ( set_ad_bits(&l1p[guest_l1_table_offset(gla)].l1, &gw->l1e.l1,
                          (walk & PFEC_write_access)) )
             paging_mark_dirty(d, gw->l1mfn);
         /* Fallthrough */
     case 2:
-        if ( set_ad_bits(&l2p[guest_l2_table_offset(va)].l2, &gw->l2e.l2,
+        if ( set_ad_bits(&l2p[guest_l2_table_offset(gla)].l2, &gw->l2e.l2,
                          (walk & PFEC_write_access) && leaf_level == 2) )
             paging_mark_dirty(d, gw->l2mfn);
         /* Fallthrough */
 #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
     case 3:
-        if ( set_ad_bits(&l3p[guest_l3_table_offset(va)].l3, &gw->l3e.l3,
+        if ( set_ad_bits(&l3p[guest_l3_table_offset(gla)].l3, &gw->l3e.l3,
                          (walk & PFEC_write_access) && leaf_level == 3) )
             paging_mark_dirty(d, gw->l3mfn);
 
-        if ( set_ad_bits(&l4p[guest_l4_table_offset(va)].l4, &gw->l4e.l4,
+        if ( set_ad_bits(&l4p[guest_l4_table_offset(gla)].l4, &gw->l4e.l4,
                          false) )
             paging_mark_dirty(d, gw->l4mfn);
 #endif
--- a/xen/arch/x86/mm/hap/guest_walk.c
+++ b/xen/arch/x86/mm/hap/guest_walk.c
@@ -26,8 +26,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <xen/sched.h>
 #include "private.h" /* for hap_gva_to_gfn_* */
 
-#define _hap_gva_to_gfn(levels) hap_gva_to_gfn_##levels##_levels
-#define hap_gva_to_gfn(levels) _hap_gva_to_gfn(levels)
+#define _hap_gla_to_gfn(levels) hap_gla_to_gfn_##levels##_levels
+#define hap_gla_to_gfn(levels) _hap_gla_to_gfn(levels)
 
 #define _hap_p2m_ga_to_gfn(levels) hap_p2m_ga_to_gfn_##levels##_levels
 #define hap_p2m_ga_to_gfn(levels) _hap_p2m_ga_to_gfn(levels)
@@ -39,16 +39,10 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <asm/guest_pt.h>
 #include <asm/p2m.h>
 
-unsigned long hap_gva_to_gfn(GUEST_PAGING_LEVELS)(
-    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t *pfec)
-{
-    unsigned long cr3 = v->arch.hvm.guest_cr[3];
-    return hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(v, p2m, cr3, gva, pfec, NULL);
-}
-
-unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
+static unsigned long ga_to_gfn(
     struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
-    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
+    paddr_t ga, uint32_t *pfec, unsigned int *page_order,
+    struct hvmemul_cache *cache)
 {
     bool walk_ok;
     mfn_t top_mfn;
@@ -91,7 +85,8 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
 #if GUEST_PAGING_LEVELS == 3
     top_map += (cr3 & ~(PAGE_MASK | 31));
 #endif
-    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec, top_mfn, top_map);
+    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec,
+                                top_gfn, top_mfn, top_map, cache);
     unmap_domain_page(top_map);
     put_page(top_page);
 
@@ -137,6 +132,21 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
     return gfn_x(INVALID_GFN);
 }
 
+gfn_t hap_gla_to_gfn(GUEST_PAGING_LEVELS)(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t *pfec,
+    struct hvmemul_cache *cache)
+{
+    unsigned long cr3 = v->arch.hvm.guest_cr[3];
+
+    return _gfn(ga_to_gfn(v, p2m, cr3, gla, pfec, NULL, cache));
+}
+
+unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
+    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
+{
+    return ga_to_gfn(v, p2m, cr3, ga, pfec, page_order, NULL);
+}
 
 /*
  * Local variables:
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -744,10 +744,11 @@ hap_write_p2m_entry(struct domain *d, un
         p2m_flush_nestedp2m(d);
 }
 
-static unsigned long hap_gva_to_gfn_real_mode(
-    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t *pfec)
+static gfn_t hap_gla_to_gfn_real_mode(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t *pfec,
+    struct hvmemul_cache *cache)
 {
-    return ((paddr_t)gva >> PAGE_SHIFT);
+    return gaddr_to_gfn(gla);
 }
 
 static unsigned long hap_p2m_ga_to_gfn_real_mode(
@@ -763,7 +764,7 @@ static unsigned long hap_p2m_ga_to_gfn_r
 static const struct paging_mode hap_paging_real_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_real_mode,
+    .gla_to_gfn             = hap_gla_to_gfn_real_mode,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_real_mode,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -774,7 +775,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_protected_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_2_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_2_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_2_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -785,7 +786,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_pae_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_3_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_3_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_3_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -796,7 +797,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_long_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_4_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_4_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_4_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
--- a/xen/arch/x86/mm/hap/private.h
+++ b/xen/arch/x86/mm/hap/private.h
@@ -24,18 +24,21 @@
 /********************************************/
 /*          GUEST TRANSLATION FUNCS         */
 /********************************************/
-unsigned long hap_gva_to_gfn_2_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
-unsigned long hap_gva_to_gfn_3_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
-unsigned long hap_gva_to_gfn_4_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
+gfn_t hap_gla_to_gfn_2_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
+gfn_t hap_gla_to_gfn_3_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
+gfn_t hap_gla_to_gfn_4_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
 
 unsigned long hap_p2m_ga_to_gfn_2_levels(struct vcpu *v,
     struct p2m_domain *p2m, unsigned long cr3,
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1970,16 +1970,16 @@ void np2m_schedule(int dir)
     }
 }
 
-unsigned long paging_gva_to_gfn(struct vcpu *v,
-                                unsigned long va,
-                                uint32_t *pfec)
+gfn_t paging_gla_to_gfn(struct vcpu *v, unsigned long gla, uint32_t *pfec,
+                        struct hvmemul_cache *cache)
 {
     struct p2m_domain *hostp2m = p2m_get_hostp2m(v->domain);
     const struct paging_mode *hostmode = paging_get_hostmode(v);
 
     if ( is_hvm_vcpu(v) && paging_mode_hap(v->domain) && nestedhvm_is_n2(v) )
     {
-        unsigned long l2_gfn, l1_gfn;
+        gfn_t l2_gfn;
+        unsigned long l1_gfn;
         struct p2m_domain *p2m;
         const struct paging_mode *mode;
         uint8_t l1_p2ma;
@@ -1989,31 +1989,31 @@ unsigned long paging_gva_to_gfn(struct v
         /* translate l2 guest va into l2 guest gfn */
         p2m = p2m_get_nestedp2m(v);
         mode = paging_get_nestedmode(v);
-        l2_gfn = mode->gva_to_gfn(v, p2m, va, pfec);
+        l2_gfn = mode->gla_to_gfn(v, p2m, gla, pfec, cache);
 
-        if ( l2_gfn == gfn_x(INVALID_GFN) )
-            return gfn_x(INVALID_GFN);
+        if ( gfn_eq(l2_gfn, INVALID_GFN) )
+            return INVALID_GFN;
 
         /* translate l2 guest gfn into l1 guest gfn */
-        rv = nestedhap_walk_L1_p2m(v, l2_gfn, &l1_gfn, &l1_page_order, &l1_p2ma,
-                                   1,
+        rv = nestedhap_walk_L1_p2m(v, gfn_x(l2_gfn), &l1_gfn, &l1_page_order,
+                                   &l1_p2ma, 1,
                                    !!(*pfec & PFEC_write_access),
                                    !!(*pfec & PFEC_insn_fetch));
 
         if ( rv != NESTEDHVM_PAGEFAULT_DONE )
-            return gfn_x(INVALID_GFN);
+            return INVALID_GFN;
 
         /*
          * Sanity check that l1_gfn can be used properly as a 4K mapping, even
          * if it mapped by a nested superpage.
          */
-        ASSERT((l2_gfn & ((1ul << l1_page_order) - 1)) ==
+        ASSERT((gfn_x(l2_gfn) & ((1ul << l1_page_order) - 1)) ==
                (l1_gfn & ((1ul << l1_page_order) - 1)));
 
-        return l1_gfn;
+        return _gfn(l1_gfn);
     }
 
-    return hostmode->gva_to_gfn(v, hostp2m, va, pfec);
+    return hostmode->gla_to_gfn(v, hostp2m, gla, pfec, cache);
 }
 
 /*
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -313,15 +313,15 @@ const struct x86_emulate_ops hvm_shadow_
 static mfn_t emulate_gva_to_mfn(struct vcpu *v, unsigned long vaddr,
                                 struct sh_emulate_ctxt *sh_ctxt)
 {
-    unsigned long gfn;
+    gfn_t gfn;
     struct page_info *page;
     mfn_t mfn;
     p2m_type_t p2mt;
     uint32_t pfec = PFEC_page_present | PFEC_write_access;
 
     /* Translate the VA to a GFN. */
-    gfn = paging_get_hostmode(v)->gva_to_gfn(v, NULL, vaddr, &pfec);
-    if ( gfn == gfn_x(INVALID_GFN) )
+    gfn = paging_get_hostmode(v)->gla_to_gfn(v, NULL, vaddr, &pfec, NULL);
+    if ( gfn_eq(gfn, INVALID_GFN) )
     {
         x86_emul_pagefault(pfec, vaddr, &sh_ctxt->ctxt);
 
@@ -331,7 +331,7 @@ static mfn_t emulate_gva_to_mfn(struct v
     /* Translate the GFN to an MFN. */
     ASSERT(!paging_locked_by_me(v->domain));
 
-    page = get_page_from_gfn(v->domain, gfn, &p2mt, P2M_ALLOC);
+    page = get_page_from_gfn(v->domain, gfn_x(gfn), &p2mt, P2M_ALLOC);
 
     /* Sanity checking. */
     if ( page == NULL )
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -173,17 +173,20 @@ delete_shadow_status(struct domain *d, m
 
 static inline bool
 sh_walk_guest_tables(struct vcpu *v, unsigned long va, walk_t *gw,
-                     uint32_t pfec)
+                     uint32_t pfec, struct hvmemul_cache *cache)
 {
     return guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec,
+                             _gfn(paging_mode_external(v->domain)
+                                  ? cr3_pa(v->arch.hvm.guest_cr[3]) >> PAGE_SHIFT
+                                  : pagetable_get_pfn(v->arch.guest_table)),
 #if GUEST_PAGING_LEVELS == 3 /* PAE */
                              INVALID_MFN,
-                             v->arch.paging.shadow.gl3e
+                             v->arch.paging.shadow.gl3e,
 #else /* 32 or 64 */
                              pagetable_get_mfn(v->arch.guest_table),
-                             v->arch.paging.shadow.guest_vtable
+                             v->arch.paging.shadow.guest_vtable,
 #endif
-                             );
+                             cache);
 }
 
 /* This validation is called with lock held, and after write permission
@@ -3032,7 +3035,7 @@ static int sh_page_fault(struct vcpu *v,
      * shadow page table. */
     version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
     smp_rmb();
-    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
+    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code, NULL);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
     regs->error_code &= ~PFEC_page_present;
@@ -3680,9 +3683,9 @@ static bool sh_invlpg(struct vcpu *v, un
 }
 
 
-static unsigned long
-sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
-    unsigned long va, uint32_t *pfec)
+static gfn_t
+sh_gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
+    unsigned long gla, uint32_t *pfec, struct hvmemul_cache *cache)
 /* Called to translate a guest virtual address to what the *guest*
  * pagetables would map it to. */
 {
@@ -3692,24 +3695,25 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
     /* Check the vTLB cache first */
-    unsigned long vtlb_gfn = vtlb_lookup(v, va, *pfec);
+    unsigned long vtlb_gfn = vtlb_lookup(v, gla, *pfec);
+
     if ( vtlb_gfn != gfn_x(INVALID_GFN) )
-        return vtlb_gfn;
+        return _gfn(vtlb_gfn);
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    if ( !(walk_ok = sh_walk_guest_tables(v, va, &gw, *pfec)) )
+    if ( !(walk_ok = sh_walk_guest_tables(v, gla, &gw, *pfec, cache)) )
     {
         *pfec = gw.pfec;
-        return gfn_x(INVALID_GFN);
+        return INVALID_GFN;
     }
     gfn = guest_walk_to_gfn(&gw);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
     /* Remember this successful VA->GFN translation for later. */
-    vtlb_insert(v, va >> PAGE_SHIFT, gfn_x(gfn), *pfec);
+    vtlb_insert(v, gla >> PAGE_SHIFT, gfn_x(gfn), *pfec);
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    return gfn_x(gfn);
+    return gfn;
 }
 
 
@@ -4954,7 +4958,7 @@ int sh_audit_l4_table(struct vcpu *v, mf
 const struct paging_mode sh_paging_mode = {
     .page_fault                    = sh_page_fault,
     .invlpg                        = sh_invlpg,
-    .gva_to_gfn                    = sh_gva_to_gfn,
+    .gla_to_gfn                    = sh_gla_to_gfn,
     .update_cr3                    = sh_update_cr3,
     .update_paging_modes           = shadow_update_paging_modes,
     .write_p2m_entry               = shadow_write_p2m_entry,
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -43,11 +43,12 @@ static bool _invlpg(struct vcpu *v, unsi
     return true;
 }
 
-static unsigned long _gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
-                                 unsigned long va, uint32_t *pfec)
+static gfn_t _gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
+                         unsigned long gla, uint32_t *pfec,
+                         struct hvmemul_cache *cache)
 {
     ASSERT_UNREACHABLE();
-    return gfn_x(INVALID_GFN);
+    return INVALID_GFN;
 }
 
 static void _update_cr3(struct vcpu *v, int do_locking, bool noflush)
@@ -70,7 +71,7 @@ static void _write_p2m_entry(struct doma
 static const struct paging_mode sh_paging_none = {
     .page_fault                    = _page_fault,
     .invlpg                        = _invlpg,
-    .gva_to_gfn                    = _gva_to_gfn,
+    .gla_to_gfn                    = _gla_to_gfn,
     .update_cr3                    = _update_cr3,
     .update_paging_modes           = _update_paging_modes,
     .write_p2m_entry               = _write_p2m_entry,
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -425,7 +425,8 @@ static inline unsigned int guest_walk_to
 
 bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m, unsigned long va,
-                  walk_t *gw, uint32_t pfec, mfn_t top_mfn, void *top_map);
+                  walk_t *gw, uint32_t pfec, gfn_t top_gfn, mfn_t top_mfn,
+                  void *top_map, struct hvmemul_cache *cache);
 
 /* Pretty-print the contents of a guest-walk */
 static inline void print_gw(const walk_t *gw)
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -53,6 +53,8 @@ struct hvm_mmio_cache {
     uint8_t buffer[32];
 };
 
+struct hvmemul_cache;
+
 struct hvm_vcpu_io {
     /* I/O request in flight to device model. */
     enum hvm_io_completion io_completion;
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -112,10 +112,11 @@ struct paging_mode {
                                             struct cpu_user_regs *regs);
     bool          (*invlpg                )(struct vcpu *v,
                                             unsigned long linear);
-    unsigned long (*gva_to_gfn            )(struct vcpu *v,
+    gfn_t         (*gla_to_gfn            )(struct vcpu *v,
                                             struct p2m_domain *p2m,
-                                            unsigned long va,
-                                            uint32_t *pfec);
+                                            unsigned long gla,
+                                            uint32_t *pfec,
+                                            struct hvmemul_cache *cache);
     unsigned long (*p2m_ga_to_gfn         )(struct vcpu *v,
                                             struct p2m_domain *p2m,
                                             unsigned long cr3,
@@ -251,9 +252,10 @@ void paging_invlpg(struct vcpu *v, unsig
  * SDM Intel 64 Volume 3, Chapter Paging, PAGE-FAULT EXCEPTIONS:
  * The PFEC_insn_fetch flag is set only when NX or SMEP are enabled.
  */
-unsigned long paging_gva_to_gfn(struct vcpu *v,
-                                unsigned long va,
-                                uint32_t *pfec);
+gfn_t paging_gla_to_gfn(struct vcpu *v,
+                        unsigned long va,
+                        uint32_t *pfec,
+                        struct hvmemul_cache *cache);
 
 /* Translate a guest address using a particular CR3 value.  This is used
  * to by nested HAP code, to walk the guest-supplied NPT tables as if




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables()
  2018-09-11 13:10 [PATCH v2 0/4] x86/HVM: implement memory read caching Jan Beulich
  2018-09-11 13:13 ` [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
@ 2018-09-11 13:14 ` Jan Beulich
  2018-09-11 16:17   ` Paul Durrant
  2018-09-19 15:50   ` Wei Liu
  2018-09-11 13:15 ` [PATCH v2 3/4] x86/HVM: implement memory read caching Jan Beulich
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-11 13:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant, Wei Liu

The caching isn't actually implemented here, this is just setting the
stage.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Don't wrongly use top_gfn for non-root gpa calculation. Re-write
    cache entries after setting A/D bits (an alternative would be to
    suppress their setting upon cache hits).

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2664,6 +2664,18 @@ void hvm_dump_emulation_state(const char
            hvmemul_ctxt->insn_buf);
 }
 
+bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t gpa,
+                        unsigned int level, void *buffer, unsigned int size)
+{
+    return false;
+}
+
+void hvmemul_write_cache(struct hvmemul_cache *cache, paddr_t gpa,
+                         unsigned int level, const void *buffer,
+                         unsigned int size)
+{
+}
+
 /*
  * Local variables:
  * mode: C
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -92,8 +92,13 @@ guest_walk_tables(struct vcpu *v, struct
 #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
     guest_l3e_t *l3p = NULL;
     guest_l4e_t *l4p;
+    paddr_t l4gpa;
+#endif
+#if GUEST_PAGING_LEVELS >= 3 /* PAE or 64... */
+    paddr_t l3gpa;
 #endif
     uint32_t gflags, rc;
+    paddr_t l1gpa = 0, l2gpa = 0;
     unsigned int leaf_level;
     p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
 
@@ -134,7 +139,15 @@ guest_walk_tables(struct vcpu *v, struct
     /* Get the l4e from the top level table and check its flags*/
     gw->l4mfn = top_mfn;
     l4p = (guest_l4e_t *) top_map;
-    gw->l4e = l4p[guest_l4_table_offset(gla)];
+    l4gpa = gfn_to_gaddr(top_gfn) +
+            guest_l4_table_offset(gla) * sizeof(gw->l4e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e)) )
+    {
+        gw->l4e = l4p[guest_l4_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e));
+    }
     gflags = guest_l4e_get_flags(gw->l4e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -164,7 +177,15 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l3e and check its flags*/
-    gw->l3e = l3p[guest_l3_table_offset(gla)];
+    l3gpa = gfn_to_gaddr(guest_l4e_get_gfn(gw->l4e)) +
+            guest_l3_table_offset(gla) * sizeof(gw->l3e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
+    {
+        gw->l3e = l3p[guest_l3_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
+    }
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -216,7 +237,16 @@ guest_walk_tables(struct vcpu *v, struct
 #else /* PAE only... */
 
     /* Get the l3e and check its flag */
-    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
+    l3gpa = gfn_to_gaddr(top_gfn) + ((unsigned long)top_map & ~PAGE_MASK) +
+            guest_l3_table_offset(gla) * sizeof(gw->l3e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
+    {
+        gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
+    }
+
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -242,18 +272,26 @@ guest_walk_tables(struct vcpu *v, struct
         goto out;
     }
 
-    /* Get the l2e */
-    gw->l2e = l2p[guest_l2_table_offset(gla)];
+    l2gpa = gfn_to_gaddr(guest_l3e_get_gfn(gw->l3e));
 
 #else /* 32-bit only... */
 
-    /* Get l2e from the top level table */
     gw->l2mfn = top_mfn;
     l2p = (guest_l2e_t *) top_map;
-    gw->l2e = l2p[guest_l2_table_offset(gla)];
+    l2gpa = gfn_to_gaddr(top_gfn);
 
 #endif /* All levels... */
 
+    /* Get the l2e */
+    l2gpa += guest_l2_table_offset(gla) * sizeof(gw->l2e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e)) )
+    {
+        gw->l2e = l2p[guest_l2_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e));
+    }
+
     /* Check the l2e flags. */
     gflags = guest_l2e_get_flags(gw->l2e);
     if ( !(gflags & _PAGE_PRESENT) )
@@ -335,7 +373,17 @@ guest_walk_tables(struct vcpu *v, struct
         gw->pfec |= rc & PFEC_synth_mask;
         goto out;
     }
-    gw->l1e = l1p[guest_l1_table_offset(gla)];
+
+    l1gpa = gfn_to_gaddr(guest_l2e_get_gfn(gw->l2e)) +
+            guest_l1_table_offset(gla) * sizeof(gw->l1e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e)) )
+    {
+        gw->l1e = l1p[guest_l1_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e));
+    }
+
     gflags = guest_l1e_get_flags(gw->l1e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -446,22 +494,38 @@ guest_walk_tables(struct vcpu *v, struct
     case 1:
         if ( set_ad_bits(&l1p[guest_l1_table_offset(gla)].l1, &gw->l1e.l1,
                          (walk & PFEC_write_access)) )
+        {
             paging_mark_dirty(d, gw->l1mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e));
+        }
         /* Fallthrough */
     case 2:
         if ( set_ad_bits(&l2p[guest_l2_table_offset(gla)].l2, &gw->l2e.l2,
                          (walk & PFEC_write_access) && leaf_level == 2) )
+        {
             paging_mark_dirty(d, gw->l2mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e));
+        }
         /* Fallthrough */
 #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
     case 3:
         if ( set_ad_bits(&l3p[guest_l3_table_offset(gla)].l3, &gw->l3e.l3,
                          (walk & PFEC_write_access) && leaf_level == 3) )
+        {
             paging_mark_dirty(d, gw->l3mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
+        }
 
         if ( set_ad_bits(&l4p[guest_l4_table_offset(gla)].l4, &gw->l4e.l4,
                          false) )
+        {
             paging_mark_dirty(d, gw->l4mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e));
+        }
 #endif
     }
 
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -98,6 +98,13 @@ int hvmemul_do_pio_buffer(uint16_t port,
                           uint8_t dir,
                           void *buffer);
 
+struct hvmemul_cache;
+bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
+                        unsigned int level, void *buffer, unsigned int size);
+void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
+                         unsigned int level, const void *buffer,
+                         unsigned int size);
+
 void hvm_dump_emulation_state(const char *loglvl, const char *prefix,
                               struct hvm_emulate_ctxt *hvmemul_ctxt, int rc);
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 3/4] x86/HVM: implement memory read caching
  2018-09-11 13:10 [PATCH v2 0/4] x86/HVM: implement memory read caching Jan Beulich
  2018-09-11 13:13 ` [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
  2018-09-11 13:14 ` [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
@ 2018-09-11 13:15 ` Jan Beulich
  2018-09-11 16:20   ` Paul Durrant
  2018-09-19 15:57   ` Wei Liu
  2018-09-11 13:16 ` [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-11 13:15 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant, Wei Liu

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far use of CPU
registers goes (as those can't change without any other instruction
executing in between), but is wrong for memory accesses. In particular
it has been observed that Windows might page out buffers underneath an
instruction currently under emulation (hitting between two passes). If
the first pass translated a linear address successfully, any subsequent
pass needs to do so too, yielding the exact same translation.

Introduce a cache (used by just guest page table accesses for now) to
make sure above described assumption holds. This is a very simplistic
implementation for now: Only exact matches are satisfied (no overlaps or
partial reads or anything).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
---
v2: Re-base.

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -27,6 +27,18 @@
 #include <asm/hvm/svm/svm.h>
 #include <asm/vm_event.h>
 
+struct hvmemul_cache
+{
+    unsigned int max_ents;
+    unsigned int num_ents;
+    struct {
+        paddr_t gpa:PADDR_BITS;
+        unsigned int size:(BITS_PER_LONG - PADDR_BITS) / 2;
+        unsigned int level:(BITS_PER_LONG - PADDR_BITS) / 2;
+        unsigned long data;
+    } ents[];
+};
+
 static void hvmtrace_io_assist(const ioreq_t *p)
 {
     unsigned int size, event;
@@ -541,7 +553,7 @@ static int hvmemul_do_mmio_addr(paddr_t
  */
 static void *hvmemul_map_linear_addr(
     unsigned long linear, unsigned int bytes, uint32_t pfec,
-    struct hvm_emulate_ctxt *hvmemul_ctxt)
+    struct hvm_emulate_ctxt *hvmemul_ctxt, struct hvmemul_cache *cache)
 {
     struct vcpu *curr = current;
     void *err, *mapping;
@@ -586,7 +598,7 @@ static void *hvmemul_map_linear_addr(
         ASSERT(mfn_x(*mfn) == 0);
 
         res = hvm_translate_get_page(curr, addr, true, pfec,
-                                     &pfinfo, &page, NULL, &p2mt);
+                                     &pfinfo, &page, NULL, &p2mt, cache);
 
         switch ( res )
         {
@@ -702,6 +714,8 @@ static int hvmemul_linear_to_phys(
     gfn_t gfn, ngfn;
     unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
     int reverse;
+    struct hvmemul_cache *cache = pfec & PFEC_insn_fetch
+                                  ? NULL : curr->arch.hvm.data_cache;
 
     /*
      * Clip repetitions to a sensible maximum. This avoids extensive looping in
@@ -731,7 +745,7 @@ static int hvmemul_linear_to_phys(
             return rc;
         gfn = gaddr_to_gfn(gaddr);
     }
-    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
+    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, cache),
                      INVALID_GFN) )
     {
         if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
@@ -747,7 +761,7 @@ static int hvmemul_linear_to_phys(
     {
         /* Get the next PFN in the range. */
         addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
-        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
+        ngfn = paging_gla_to_gfn(curr, addr, &pfec, cache);
 
         /* Is it contiguous with the preceding PFNs? If not then we're done. */
         if ( gfn_eq(ngfn, INVALID_GFN) ||
@@ -1073,7 +1087,10 @@ static int linear_read(unsigned long add
                        uint32_t pfec, struct hvm_emulate_ctxt *hvmemul_ctxt)
 {
     pagefault_info_t pfinfo;
-    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo);
+    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo,
+                                        (pfec & PFEC_insn_fetch
+                                         ? NULL
+                                         : current->arch.hvm.data_cache));
 
     switch ( rc )
     {
@@ -1270,7 +1287,8 @@ static int hvmemul_write(
 
     if ( !known_gla(addr, bytes, pfec) )
     {
-        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                          current->arch.hvm.data_cache);
         if ( IS_ERR(mapping) )
              return ~PTR_ERR(mapping);
     }
@@ -1312,7 +1330,8 @@ static int hvmemul_rmw(
 
     if ( !known_gla(addr, bytes, pfec) )
     {
-        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                          current->arch.hvm.data_cache);
         if ( IS_ERR(mapping) )
             return ~PTR_ERR(mapping);
     }
@@ -1466,7 +1485,8 @@ static int hvmemul_cmpxchg(
     else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
         pfec |= PFEC_user_mode;
 
-    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                      curr->arch.hvm.data_cache);
     if ( IS_ERR(mapping) )
         return ~PTR_ERR(mapping);
 
@@ -2373,6 +2393,7 @@ static int _hvm_emulate_one(struct hvm_e
     {
         vio->mmio_cache_count = 0;
         vio->mmio_insn_bytes = 0;
+        curr->arch.hvm.data_cache->num_ents = 0;
     }
     else
     {
@@ -2591,7 +2612,7 @@ void hvm_emulate_init_per_insn(
                                         &addr) &&
              hvm_copy_from_guest_linear(hvmemul_ctxt->insn_buf, addr,
                                         sizeof(hvmemul_ctxt->insn_buf),
-                                        pfec | PFEC_insn_fetch,
+                                        pfec | PFEC_insn_fetch, NULL,
                                         NULL) == HVMTRANS_okay) ?
             sizeof(hvmemul_ctxt->insn_buf) : 0;
     }
@@ -2664,9 +2685,35 @@ void hvm_dump_emulation_state(const char
            hvmemul_ctxt->insn_buf);
 }
 
+struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
+{
+    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct hvmemul_cache,
+                                                         ents[nents]));
+
+    if ( cache )
+    {
+        cache->num_ents = 0;
+        cache->max_ents = nents;
+    }
+
+    return cache;
+}
+
 bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t gpa,
                         unsigned int level, void *buffer, unsigned int size)
 {
+    unsigned int i;
+
+    ASSERT(size <= sizeof(cache->ents->data));
+
+    for ( i = 0; i < cache->num_ents; ++i )
+        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
+             cache->ents[i].size == size )
+        {
+            memcpy(buffer, &cache->ents[i].data, size);
+            return true;
+        }
+
     return false;
 }
 
@@ -2674,6 +2721,35 @@ void hvmemul_write_cache(struct hvmemul_
                          unsigned int level, const void *buffer,
                          unsigned int size)
 {
+    unsigned int i;
+
+    if ( size > sizeof(cache->ents->data) )
+    {
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    for ( i = 0; i < cache->num_ents; ++i )
+        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
+             cache->ents[i].size == size )
+        {
+            memcpy(&cache->ents[i].data, buffer, size);
+            return;
+        }
+
+    if ( unlikely(i >= cache->max_ents) )
+    {
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    cache->ents[i].level = level;
+    cache->ents[i].gpa   = gpa;
+    cache->ents[i].size  = size;
+
+    memcpy(&cache->ents[i].data, buffer, size);
+
+    cache->num_ents = i + 1;
 }
 
 /*
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1521,6 +1521,17 @@ int hvm_vcpu_initialise(struct vcpu *v)
 
     v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
 
+    /*
+     * Leaving aside the insn fetch, for which we don't use this cache, no
+     * insn can access more than 8 independent linear addresses (AVX2
+     * gathers being the worst). Each such linear range can span a page
+     * boundary, i.e. require two page walks.
+     */
+    v->arch.hvm.data_cache = hvmemul_cache_init(CONFIG_PAGING_LEVELS * 8 * 2);
+    rc = -ENOMEM;
+    if ( !v->arch.hvm.data_cache )
+        goto fail4;
+
     rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
     if ( rc != 0 )
         goto fail4;
@@ -1550,6 +1561,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
  fail5:
     free_compat_arg_xlat(v);
  fail4:
+    hvmemul_cache_destroy(v->arch.hvm.data_cache);
     hvm_funcs.vcpu_destroy(v);
  fail3:
     vlapic_destroy(v);
@@ -1572,6 +1584,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
 
     free_compat_arg_xlat(v);
 
+    hvmemul_cache_destroy(v->arch.hvm.data_cache);
+
     tasklet_kill(&v->arch.hvm.assert_evtchn_irq_tasklet);
     hvm_funcs.vcpu_destroy(v);
 
@@ -2946,7 +2960,7 @@ void hvm_task_switch(
     }
 
     rc = hvm_copy_from_guest_linear(
-        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
+        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
     if ( rc == HVMTRANS_bad_linear_to_gfn )
         hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
     if ( rc != HVMTRANS_okay )
@@ -2993,7 +3007,7 @@ void hvm_task_switch(
         goto out;
 
     rc = hvm_copy_from_guest_linear(
-        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
+        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
     if ( rc == HVMTRANS_bad_linear_to_gfn )
         hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
     /*
@@ -3104,7 +3118,7 @@ void hvm_task_switch(
 enum hvm_translation_result hvm_translate_get_page(
     struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
     pagefault_info_t *pfinfo, struct page_info **page_p,
-    gfn_t *gfn_p, p2m_type_t *p2mt_p)
+    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache)
 {
     struct page_info *page;
     p2m_type_t p2mt;
@@ -3112,7 +3126,7 @@ enum hvm_translation_result hvm_translat
 
     if ( linear )
     {
-        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
+        gfn = paging_gla_to_gfn(v, addr, &pfec, cache);
 
         if ( gfn_eq(gfn, INVALID_GFN) )
         {
@@ -3184,7 +3198,7 @@ enum hvm_translation_result hvm_translat
 #define HVMCOPY_linear     (1u<<2)
 static enum hvm_translation_result __hvm_copy(
     void *buf, paddr_t addr, int size, struct vcpu *v, unsigned int flags,
-    uint32_t pfec, pagefault_info_t *pfinfo)
+    uint32_t pfec, pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
 {
     gfn_t gfn;
     struct page_info *page;
@@ -3217,8 +3231,8 @@ static enum hvm_translation_result __hvm
 
         count = min_t(int, PAGE_SIZE - gpa, todo);
 
-        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear,
-                                     pfec, pfinfo, &page, &gfn, &p2mt);
+        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear, pfec,
+                                     pfinfo, &page, &gfn, &p2mt, cache);
         if ( res != HVMTRANS_okay )
             return res;
 
@@ -3265,14 +3279,14 @@ enum hvm_translation_result hvm_copy_to_
     paddr_t paddr, void *buf, int size, struct vcpu *v)
 {
     return __hvm_copy(buf, paddr, size, v,
-                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, NULL);
 }
 
 enum hvm_translation_result hvm_copy_from_guest_phys(
     void *buf, paddr_t paddr, int size)
 {
     return __hvm_copy(buf, paddr, size, current,
-                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL, NULL);
 }
 
 enum hvm_translation_result hvm_copy_to_guest_linear(
@@ -3281,16 +3295,17 @@ enum hvm_translation_result hvm_copy_to_
 {
     return __hvm_copy(buf, addr, size, current,
                       HVMCOPY_to_guest | HVMCOPY_linear,
-                      PFEC_page_present | PFEC_write_access | pfec, pfinfo);
+                      PFEC_page_present | PFEC_write_access | pfec,
+                      pfinfo, NULL);
 }
 
 enum hvm_translation_result hvm_copy_from_guest_linear(
     void *buf, unsigned long addr, int size, uint32_t pfec,
-    pagefault_info_t *pfinfo)
+    pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
 {
     return __hvm_copy(buf, addr, size, current,
                       HVMCOPY_from_guest | HVMCOPY_linear,
-                      PFEC_page_present | pfec, pfinfo);
+                      PFEC_page_present | pfec, pfinfo, cache);
 }
 
 unsigned long copy_to_user_hvm(void *to, const void *from, unsigned int len)
@@ -3331,7 +3346,8 @@ unsigned long copy_from_user_hvm(void *t
         return 0;
     }
 
-    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len, 0, NULL);
+    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len,
+                                    0, NULL, NULL);
     return rc ? len : 0; /* fake a copy_from_user() return code */
 }
 
@@ -3747,7 +3763,7 @@ void hvm_ud_intercept(struct cpu_user_re
                                         sizeof(sig), hvm_access_insn_fetch,
                                         cs, &addr) &&
              (hvm_copy_from_guest_linear(sig, addr, sizeof(sig),
-                                         walk, NULL) == HVMTRANS_okay) &&
+                                         walk, NULL, NULL) == HVMTRANS_okay) &&
              (memcmp(sig, "\xf\xbxen", sizeof(sig)) == 0) )
         {
             regs->rip += sizeof(sig);
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1358,7 +1358,7 @@ static void svm_emul_swint_injection(str
         goto raise_exception;
 
     rc = hvm_copy_from_guest_linear(&idte, idte_linear_addr, idte_size,
-                                    PFEC_implicit, &pfinfo);
+                                    PFEC_implicit, &pfinfo, NULL);
     if ( rc )
     {
         if ( rc == HVMTRANS_bad_linear_to_gfn )
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -475,7 +475,7 @@ static int decode_vmx_inst(struct cpu_us
         {
             pagefault_info_t pfinfo;
             int rc = hvm_copy_from_guest_linear(poperandS, base, size,
-                                                0, &pfinfo);
+                                                0, &pfinfo, NULL);
 
             if ( rc == HVMTRANS_bad_linear_to_gfn )
                 hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -166,7 +166,7 @@ const struct x86_emulate_ops *shadow_ini
             hvm_access_insn_fetch, sh_ctxt, &addr) &&
          !hvm_copy_from_guest_linear(
              sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
-             PFEC_insn_fetch, NULL))
+             PFEC_insn_fetch, NULL, NULL))
         ? sizeof(sh_ctxt->insn_buf) : 0;
 
     return &hvm_shadow_emulator_ops;
@@ -201,7 +201,7 @@ void shadow_continue_emulation(struct sh
                 hvm_access_insn_fetch, sh_ctxt, &addr) &&
              !hvm_copy_from_guest_linear(
                  sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
-                 PFEC_insn_fetch, NULL))
+                 PFEC_insn_fetch, NULL, NULL))
             ? sizeof(sh_ctxt->insn_buf) : 0;
         sh_ctxt->insn_buf_eip = regs->rip;
     }
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -125,7 +125,7 @@ hvm_read(enum x86_segment seg,
     rc = hvm_copy_from_guest_linear(p_data, addr, bytes,
                                     (access_type == hvm_access_insn_fetch
                                      ? PFEC_insn_fetch : 0),
-                                    &pfinfo);
+                                    &pfinfo, NULL);
 
     switch ( rc )
     {
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -99,6 +99,11 @@ int hvmemul_do_pio_buffer(uint16_t port,
                           void *buffer);
 
 struct hvmemul_cache;
+struct hvmemul_cache *hvmemul_cache_init(unsigned int nents);
+static inline void hvmemul_cache_destroy(struct hvmemul_cache *cache)
+{
+    xfree(cache);
+}
 bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
                         unsigned int level, void *buffer, unsigned int size);
 void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
--- a/xen/include/asm-x86/hvm/support.h
+++ b/xen/include/asm-x86/hvm/support.h
@@ -99,7 +99,7 @@ enum hvm_translation_result hvm_copy_to_
     pagefault_info_t *pfinfo);
 enum hvm_translation_result hvm_copy_from_guest_linear(
     void *buf, unsigned long addr, int size, uint32_t pfec,
-    pagefault_info_t *pfinfo);
+    pagefault_info_t *pfinfo, struct hvmemul_cache *cache);
 
 /*
  * Get a reference on the page under an HVM physical or linear address.  If
@@ -110,7 +110,7 @@ enum hvm_translation_result hvm_copy_fro
 enum hvm_translation_result hvm_translate_get_page(
     struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
     pagefault_info_t *pfinfo, struct page_info **page_p,
-    gfn_t *gfn_p, p2m_type_t *p2mt_p);
+    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache);
 
 #define HVM_HCALL_completed  0 /* hypercall completed - no further action */
 #define HVM_HCALL_preempted  1 /* hypercall preempted - re-execute VMCALL */
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -53,8 +53,6 @@ struct hvm_mmio_cache {
     uint8_t buffer[32];
 };
 
-struct hvmemul_cache;
-
 struct hvm_vcpu_io {
     /* I/O request in flight to device model. */
     enum hvm_io_completion io_completion;
@@ -200,6 +198,7 @@ struct hvm_vcpu {
     u8                  cache_mode;
 
     struct hvm_vcpu_io  hvm_io;
+    struct hvmemul_cache *data_cache;
 
     /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
     struct x86_event     inject_event;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible
  2018-09-11 13:10 [PATCH v2 0/4] x86/HVM: implement memory read caching Jan Beulich
                   ` (2 preceding siblings ...)
  2018-09-11 13:15 ` [PATCH v2 3/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-09-11 13:16 ` Jan Beulich
  2018-09-13  6:30   ` Tian, Kevin
  2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
  2018-10-12 13:55 ` [PATCH v2 " Andrew Cooper
  5 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-09-11 13:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, George Dunlap, Andrew Cooper, Paul Durrant,
	Jun Nakajima

Since strictly speaking it is incorrect for guest_walk_tables() to read
L3 entries during PAE page walks, try to overcome this where possible by
pre-loading the values from hardware into the cache. Sadly the
information is available in the EPT case only. On the positive side for
NPT the spec spells out that L3 entries are actually read on walks, so
us reading them is consistent with hardware behavior in that case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Re-base.

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2385,6 +2385,23 @@ static int _hvm_emulate_one(struct hvm_e
 
     vio->mmio_retry = 0;
 
+    if ( !curr->arch.hvm.data_cache->num_ents &&
+         curr->arch.paging.mode->guest_levels == 3 )
+    {
+        unsigned int i;
+
+        for ( i = 0; i < 4; ++i )
+        {
+            uint64_t pdpte;
+
+            if ( hvm_read_pdpte(curr, i, &pdpte) )
+                hvmemul_write_cache(curr->arch.hvm.data_cache,
+                                    (curr->arch.hvm.guest_cr[3] &
+                                     (PADDR_MASK & ~0x1f)) + i * sizeof(pdpte),
+                                    3, &pdpte, sizeof(pdpte));
+        }
+    }
+
     rc = x86_emulate(&hvmemul_ctxt->ctxt, ops);
     if ( rc == X86EMUL_OKAY && vio->mmio_retry )
         rc = X86EMUL_RETRY;
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1368,6 +1368,25 @@ static void vmx_set_interrupt_shadow(str
     __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
 }
 
+static bool read_pdpte(struct vcpu *v, unsigned int idx, uint64_t *pdpte)
+{
+    if ( !paging_mode_hap(v->domain) || !hvm_pae_enabled(v) ||
+         (v->arch.hvm.guest_efer & EFER_LMA) )
+        return false;
+
+    if ( idx >= 4 )
+    {
+        ASSERT_UNREACHABLE();
+        return false;
+    }
+
+    vmx_vmcs_enter(v);
+    __vmread(GUEST_PDPTE(idx), pdpte);
+    vmx_vmcs_exit(v);
+
+    return true;
+}
+
 static void vmx_load_pdptrs(struct vcpu *v)
 {
     unsigned long cr3 = v->arch.hvm.guest_cr[3];
@@ -2466,6 +2485,8 @@ const struct hvm_function_table * __init
         if ( cpu_has_vmx_ept_1gb )
             vmx_function_table.hap_capabilities |= HVM_HAP_SUPERPAGE_1GB;
 
+        vmx_function_table.read_pdpte = read_pdpte;
+
         setup_ept_dump();
     }
 
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -146,6 +146,8 @@ struct hvm_function_table {
 
     void (*fpu_leave)(struct vcpu *v);
 
+    bool (*read_pdpte)(struct vcpu *v, unsigned int index, uint64_t *pdpte);
+
     int  (*get_guest_pat)(struct vcpu *v, u64 *);
     int  (*set_guest_pat)(struct vcpu *v, u64);
 
@@ -440,6 +442,12 @@ static inline unsigned long hvm_get_shad
     return hvm_funcs.get_shadow_gs_base(v);
 }
 
+static inline bool hvm_read_pdpte(struct vcpu *v, unsigned int index, uint64_t *pdpte)
+{
+    return hvm_funcs.read_pdpte &&
+           alternative_call(hvm_funcs.read_pdpte, v, index, pdpte);
+}
+
 static inline bool hvm_get_guest_bndcfgs(struct vcpu *v, u64 *val)
 {
     return hvm_funcs.get_guest_bndcfgs &&





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation
  2018-09-11 13:13 ` [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
@ 2018-09-11 13:40   ` Razvan Cojocaru
  2018-09-19 15:09   ` Wei Liu
  1 sibling, 0 replies; 48+ messages in thread
From: Razvan Cojocaru @ 2018-09-11 13:40 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Tamas K Lengyel, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, Paul Durrant

On 9/11/18 4:13 PM, Jan Beulich wrote:
> The caching isn't actually implemented here, this is just setting the
> stage.
> 
> Touching these anyway also
> - make their return values gfn_t
> - gva -> gla in their names
> - name their input arguments gla
> 
> At the use sites do the conversion to gfn_t as suitable.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>


Thanks,
Razvan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables()
  2018-09-11 13:14 ` [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
@ 2018-09-11 16:17   ` Paul Durrant
  2018-09-12  8:30     ` Jan Beulich
  2018-09-19 15:50   ` Wei Liu
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Durrant @ 2018-09-11 16:17 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel; +Cc: Andrew Cooper, Wei Liu, George Dunlap

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 11 September 2018 14:15
> To: xen-devel <xen-devel@lists.xenproject.org>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>
> Subject: [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables()
> 
> The caching isn't actually implemented here, this is just setting the
> stage.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Don't wrongly use top_gfn for non-root gpa calculation. Re-write
>     cache entries after setting A/D bits (an alternative would be to
>     suppress their setting upon cache hits).
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -2664,6 +2664,18 @@ void hvm_dump_emulation_state(const char
>             hvmemul_ctxt->insn_buf);
>  }
> 
> +bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t
> gpa,
> +                        unsigned int level, void *buffer, unsigned int size)
> +{
> +    return false;
> +}
> +
> +void hvmemul_write_cache(struct hvmemul_cache *cache, paddr_t gpa,
> +                         unsigned int level, const void *buffer,
> +                         unsigned int size)
> +{
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> --- a/xen/arch/x86/mm/guest_walk.c
> +++ b/xen/arch/x86/mm/guest_walk.c
> @@ -92,8 +92,13 @@ guest_walk_tables(struct vcpu *v, struct
>  #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
>      guest_l3e_t *l3p = NULL;

Shouldn't the above line be...

>      guest_l4e_t *l4p;
> +    paddr_t l4gpa;
> +#endif
> +#if GUEST_PAGING_LEVELS >= 3 /* PAE or 64... */

...here?

> +    paddr_t l3gpa;
>  #endif
>      uint32_t gflags, rc;
> +    paddr_t l1gpa = 0, l2gpa = 0;
>      unsigned int leaf_level;
>      p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
> 
> @@ -134,7 +139,15 @@ guest_walk_tables(struct vcpu *v, struct
>      /* Get the l4e from the top level table and check its flags*/
>      gw->l4mfn = top_mfn;
>      l4p = (guest_l4e_t *) top_map;
> -    gw->l4e = l4p[guest_l4_table_offset(gla)];
> +    l4gpa = gfn_to_gaddr(top_gfn) +
> +            guest_l4_table_offset(gla) * sizeof(gw->l4e);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e)) )
> +    {
> +        gw->l4e = l4p[guest_l4_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e));

No need to test cache here or below since neither the read or write functions (yet) dereference it.

  Paul

> +    }
>      gflags = guest_l4e_get_flags(gw->l4e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -164,7 +177,15 @@ guest_walk_tables(struct vcpu *v, struct
>      }
> 
>      /* Get the l3e and check its flags*/
> -    gw->l3e = l3p[guest_l3_table_offset(gla)];
> +    l3gpa = gfn_to_gaddr(guest_l4e_get_gfn(gw->l4e)) +
> +            guest_l3_table_offset(gla) * sizeof(gw->l3e);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
> +    {
> +        gw->l3e = l3p[guest_l3_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
> +    }
>      gflags = guest_l3e_get_flags(gw->l3e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -216,7 +237,16 @@ guest_walk_tables(struct vcpu *v, struct
>  #else /* PAE only... */
> 
>      /* Get the l3e and check its flag */
> -    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
> +    l3gpa = gfn_to_gaddr(top_gfn) + ((unsigned long)top_map &
> ~PAGE_MASK) +
> +            guest_l3_table_offset(gla) * sizeof(gw->l3e);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
> +    {
> +        gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
> +    }
> +
>      gflags = guest_l3e_get_flags(gw->l3e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -242,18 +272,26 @@ guest_walk_tables(struct vcpu *v, struct
>          goto out;
>      }
> 
> -    /* Get the l2e */
> -    gw->l2e = l2p[guest_l2_table_offset(gla)];
> +    l2gpa = gfn_to_gaddr(guest_l3e_get_gfn(gw->l3e));
> 
>  #else /* 32-bit only... */
> 
> -    /* Get l2e from the top level table */
>      gw->l2mfn = top_mfn;
>      l2p = (guest_l2e_t *) top_map;
> -    gw->l2e = l2p[guest_l2_table_offset(gla)];
> +    l2gpa = gfn_to_gaddr(top_gfn);
> 
>  #endif /* All levels... */
> 
> +    /* Get the l2e */
> +    l2gpa += guest_l2_table_offset(gla) * sizeof(gw->l2e);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e)) )
> +    {
> +        gw->l2e = l2p[guest_l2_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e));
> +    }
> +
>      /* Check the l2e flags. */
>      gflags = guest_l2e_get_flags(gw->l2e);
>      if ( !(gflags & _PAGE_PRESENT) )
> @@ -335,7 +373,17 @@ guest_walk_tables(struct vcpu *v, struct
>          gw->pfec |= rc & PFEC_synth_mask;
>          goto out;
>      }
> -    gw->l1e = l1p[guest_l1_table_offset(gla)];
> +
> +    l1gpa = gfn_to_gaddr(guest_l2e_get_gfn(gw->l2e)) +
> +            guest_l1_table_offset(gla) * sizeof(gw->l1e);
> +    if ( !cache ||
> +         !hvmemul_read_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e)) )
> +    {
> +        gw->l1e = l1p[guest_l1_table_offset(gla)];
> +        if ( cache )
> +            hvmemul_write_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e));
> +    }
> +
>      gflags = guest_l1e_get_flags(gw->l1e);
>      if ( !(gflags & _PAGE_PRESENT) )
>          goto out;
> @@ -446,22 +494,38 @@ guest_walk_tables(struct vcpu *v, struct
>      case 1:
>          if ( set_ad_bits(&l1p[guest_l1_table_offset(gla)].l1, &gw->l1e.l1,
>                           (walk & PFEC_write_access)) )
> +        {
>              paging_mark_dirty(d, gw->l1mfn);
> +            if ( cache )
> +                hvmemul_write_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e));
> +        }
>          /* Fallthrough */
>      case 2:
>          if ( set_ad_bits(&l2p[guest_l2_table_offset(gla)].l2, &gw->l2e.l2,
>                           (walk & PFEC_write_access) && leaf_level == 2) )
> +        {
>              paging_mark_dirty(d, gw->l2mfn);
> +            if ( cache )
> +                hvmemul_write_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e));
> +        }
>          /* Fallthrough */
>  #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
>      case 3:
>          if ( set_ad_bits(&l3p[guest_l3_table_offset(gla)].l3, &gw->l3e.l3,
>                           (walk & PFEC_write_access) && leaf_level == 3) )
> +        {
>              paging_mark_dirty(d, gw->l3mfn);
> +            if ( cache )
> +                hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
> +        }
> 
>          if ( set_ad_bits(&l4p[guest_l4_table_offset(gla)].l4, &gw->l4e.l4,
>                           false) )
> +        {
>              paging_mark_dirty(d, gw->l4mfn);
> +            if ( cache )
> +                hvmemul_write_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e));
> +        }
>  #endif
>      }
> 
> --- a/xen/include/asm-x86/hvm/emulate.h
> +++ b/xen/include/asm-x86/hvm/emulate.h
> @@ -98,6 +98,13 @@ int hvmemul_do_pio_buffer(uint16_t port,
>                            uint8_t dir,
>                            void *buffer);
> 
> +struct hvmemul_cache;
> +bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
> +                        unsigned int level, void *buffer, unsigned int size);
> +void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
> +                         unsigned int level, const void *buffer,
> +                         unsigned int size);
> +
>  void hvm_dump_emulation_state(const char *loglvl, const char *prefix,
>                                struct hvm_emulate_ctxt *hvmemul_ctxt, int rc);
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/4] x86/HVM: implement memory read caching
  2018-09-11 13:15 ` [PATCH v2 3/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-09-11 16:20   ` Paul Durrant
  2018-09-12  8:38     ` Jan Beulich
  2018-09-19 15:57   ` Wei Liu
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Durrant @ 2018-09-11 16:20 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel; +Cc: Andrew Cooper, Wei Liu, George Dunlap

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 11 September 2018 14:15
> To: xen-devel <xen-devel@lists.xenproject.org>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>
> Subject: [PATCH v2 3/4] x86/HVM: implement memory read caching
> 
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath an
> instruction currently under emulation (hitting between two passes). If
> the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
> 
> Introduce a cache (used by just guest page table accesses for now) to
> make sure above described assumption holds. This is a very simplistic
> implementation for now: Only exact matches are satisfied (no overlaps or
> partial reads or anything).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Tim Deegan <tim@xen.org>
> ---
> v2: Re-base.
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -27,6 +27,18 @@
>  #include <asm/hvm/svm/svm.h>
>  #include <asm/vm_event.h>
> 
> +struct hvmemul_cache
> +{
> +    unsigned int max_ents;
> +    unsigned int num_ents;
> +    struct {
> +        paddr_t gpa:PADDR_BITS;
> +        unsigned int size:(BITS_PER_LONG - PADDR_BITS) / 2;
> +        unsigned int level:(BITS_PER_LONG - PADDR_BITS) / 2;
> +        unsigned long data;
> +    } ents[];
> +};
> +
>  static void hvmtrace_io_assist(const ioreq_t *p)
>  {
>      unsigned int size, event;
> @@ -541,7 +553,7 @@ static int hvmemul_do_mmio_addr(paddr_t
>   */
>  static void *hvmemul_map_linear_addr(
>      unsigned long linear, unsigned int bytes, uint32_t pfec,
> -    struct hvm_emulate_ctxt *hvmemul_ctxt)
> +    struct hvm_emulate_ctxt *hvmemul_ctxt, struct hvmemul_cache
> *cache)
>  {
>      struct vcpu *curr = current;
>      void *err, *mapping;
> @@ -586,7 +598,7 @@ static void *hvmemul_map_linear_addr(
>          ASSERT(mfn_x(*mfn) == 0);
> 
>          res = hvm_translate_get_page(curr, addr, true, pfec,
> -                                     &pfinfo, &page, NULL, &p2mt);
> +                                     &pfinfo, &page, NULL, &p2mt, cache);
> 
>          switch ( res )
>          {
> @@ -702,6 +714,8 @@ static int hvmemul_linear_to_phys(
>      gfn_t gfn, ngfn;
>      unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
>      int reverse;
> +    struct hvmemul_cache *cache = pfec & PFEC_insn_fetch
> +                                  ? NULL : curr->arch.hvm.data_cache;
> 
>      /*
>       * Clip repetitions to a sensible maximum. This avoids extensive looping in
> @@ -731,7 +745,7 @@ static int hvmemul_linear_to_phys(
>              return rc;
>          gfn = gaddr_to_gfn(gaddr);
>      }
> -    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
> +    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, cache),
>                       INVALID_GFN) )
>      {
>          if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
> @@ -747,7 +761,7 @@ static int hvmemul_linear_to_phys(
>      {
>          /* Get the next PFN in the range. */
>          addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
> -        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
> +        ngfn = paging_gla_to_gfn(curr, addr, &pfec, cache);
> 
>          /* Is it contiguous with the preceding PFNs? If not then we're done. */
>          if ( gfn_eq(ngfn, INVALID_GFN) ||
> @@ -1073,7 +1087,10 @@ static int linear_read(unsigned long add
>                         uint32_t pfec, struct hvm_emulate_ctxt *hvmemul_ctxt)
>  {
>      pagefault_info_t pfinfo;
> -    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo);
> +    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo,
> +                                        (pfec & PFEC_insn_fetch
> +                                         ? NULL
> +                                         : current->arch.hvm.data_cache));
> 
>      switch ( rc )
>      {
> @@ -1270,7 +1287,8 @@ static int hvmemul_write(
> 
>      if ( !known_gla(addr, bytes, pfec) )
>      {
> -        mapping = hvmemul_map_linear_addr(addr, bytes, pfec,
> hvmemul_ctxt);
> +        mapping = hvmemul_map_linear_addr(addr, bytes, pfec,
> hvmemul_ctxt,
> +                                          current->arch.hvm.data_cache);
>          if ( IS_ERR(mapping) )
>               return ~PTR_ERR(mapping);
>      }
> @@ -1312,7 +1330,8 @@ static int hvmemul_rmw(
> 
>      if ( !known_gla(addr, bytes, pfec) )
>      {
> -        mapping = hvmemul_map_linear_addr(addr, bytes, pfec,
> hvmemul_ctxt);
> +        mapping = hvmemul_map_linear_addr(addr, bytes, pfec,
> hvmemul_ctxt,
> +                                          current->arch.hvm.data_cache);
>          if ( IS_ERR(mapping) )
>              return ~PTR_ERR(mapping);
>      }
> @@ -1466,7 +1485,8 @@ static int hvmemul_cmpxchg(
>      else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
>          pfec |= PFEC_user_mode;
> 
> -    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
> +                                      curr->arch.hvm.data_cache);
>      if ( IS_ERR(mapping) )
>          return ~PTR_ERR(mapping);
> 
> @@ -2373,6 +2393,7 @@ static int _hvm_emulate_one(struct hvm_e
>      {
>          vio->mmio_cache_count = 0;
>          vio->mmio_insn_bytes = 0;
> +        curr->arch.hvm.data_cache->num_ents = 0;
>      }
>      else
>      {
> @@ -2591,7 +2612,7 @@ void hvm_emulate_init_per_insn(
>                                          &addr) &&
>               hvm_copy_from_guest_linear(hvmemul_ctxt->insn_buf, addr,
>                                          sizeof(hvmemul_ctxt->insn_buf),
> -                                        pfec | PFEC_insn_fetch,
> +                                        pfec | PFEC_insn_fetch, NULL,
>                                          NULL) == HVMTRANS_okay) ?
>              sizeof(hvmemul_ctxt->insn_buf) : 0;
>      }
> @@ -2664,9 +2685,35 @@ void hvm_dump_emulation_state(const char
>             hvmemul_ctxt->insn_buf);
>  }
> 
> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
> +{
> +    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct
> hvmemul_cache,
> +                                                         ents[nents]));
> +
> +    if ( cache )
> +    {
> +        cache->num_ents = 0;
> +        cache->max_ents = nents;
> +    }
> +
> +    return cache;
> +}
> +
>  bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t
> gpa,
>                          unsigned int level, void *buffer, unsigned int size)
>  {
> +    unsigned int i;
> +

Here you could return false if cache is NULL...

> +    ASSERT(size <= sizeof(cache->ents->data));
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
> +             cache->ents[i].size == size )
> +        {
> +            memcpy(buffer, &cache->ents[i].data, size);
> +            return true;
> +        }
> +
>      return false;
>  }
> 
> @@ -2674,6 +2721,35 @@ void hvmemul_write_cache(struct hvmemul_
>                           unsigned int level, const void *buffer,
>                           unsigned int size)
>  {
> +    unsigned int i;
> +

...and here just bail out. Thus making both functions safe to call with a NULL cache.

> +    if ( size > sizeof(cache->ents->data) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
> +             cache->ents[i].size == size )
> +        {
> +            memcpy(&cache->ents[i].data, buffer, size);
> +            return;
> +        }
> +
> +    if ( unlikely(i >= cache->max_ents) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    cache->ents[i].level = level;
> +    cache->ents[i].gpa   = gpa;
> +    cache->ents[i].size  = size;
> +
> +    memcpy(&cache->ents[i].data, buffer, size);
> +
> +    cache->num_ents = i + 1;
>  }
> 
>  /*
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1521,6 +1521,17 @@ int hvm_vcpu_initialise(struct vcpu *v)
> 
>      v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
> 
> +    /*
> +     * Leaving aside the insn fetch, for which we don't use this cache, no
> +     * insn can access more than 8 independent linear addresses (AVX2
> +     * gathers being the worst). Each such linear range can span a page
> +     * boundary, i.e. require two page walks.
> +     */
> +    v->arch.hvm.data_cache =
> hvmemul_cache_init(CONFIG_PAGING_LEVELS * 8 * 2);
> +    rc = -ENOMEM;
> +    if ( !v->arch.hvm.data_cache )
> +        goto fail4;
> +
>      rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
>      if ( rc != 0 )
>          goto fail4;
> @@ -1550,6 +1561,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
>   fail5:
>      free_compat_arg_xlat(v);
>   fail4:
> +    hvmemul_cache_destroy(v->arch.hvm.data_cache);
>      hvm_funcs.vcpu_destroy(v);
>   fail3:
>      vlapic_destroy(v);
> @@ -1572,6 +1584,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
> 
>      free_compat_arg_xlat(v);
> 
> +    hvmemul_cache_destroy(v->arch.hvm.data_cache);
> +
>      tasklet_kill(&v->arch.hvm.assert_evtchn_irq_tasklet);
>      hvm_funcs.vcpu_destroy(v);
> 
> @@ -2946,7 +2960,7 @@ void hvm_task_switch(
>      }
> 
>      rc = hvm_copy_from_guest_linear(
> -        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
> +        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
>      if ( rc == HVMTRANS_bad_linear_to_gfn )
>          hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
>      if ( rc != HVMTRANS_okay )
> @@ -2993,7 +3007,7 @@ void hvm_task_switch(
>          goto out;
> 
>      rc = hvm_copy_from_guest_linear(
> -        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
> +        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
>      if ( rc == HVMTRANS_bad_linear_to_gfn )
>          hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
>      /*
> @@ -3104,7 +3118,7 @@ void hvm_task_switch(
>  enum hvm_translation_result hvm_translate_get_page(
>      struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
>      pagefault_info_t *pfinfo, struct page_info **page_p,
> -    gfn_t *gfn_p, p2m_type_t *p2mt_p)
> +    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache)
>  {
>      struct page_info *page;
>      p2m_type_t p2mt;
> @@ -3112,7 +3126,7 @@ enum hvm_translation_result hvm_translat
> 
>      if ( linear )
>      {
> -        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
> +        gfn = paging_gla_to_gfn(v, addr, &pfec, cache);
> 
>          if ( gfn_eq(gfn, INVALID_GFN) )
>          {
> @@ -3184,7 +3198,7 @@ enum hvm_translation_result hvm_translat
>  #define HVMCOPY_linear     (1u<<2)
>  static enum hvm_translation_result __hvm_copy(
>      void *buf, paddr_t addr, int size, struct vcpu *v, unsigned int flags,
> -    uint32_t pfec, pagefault_info_t *pfinfo)
> +    uint32_t pfec, pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
>  {
>      gfn_t gfn;
>      struct page_info *page;
> @@ -3217,8 +3231,8 @@ static enum hvm_translation_result __hvm
> 
>          count = min_t(int, PAGE_SIZE - gpa, todo);
> 
> -        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear,
> -                                     pfec, pfinfo, &page, &gfn, &p2mt);
> +        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear, pfec,
> +                                     pfinfo, &page, &gfn, &p2mt, cache);
>          if ( res != HVMTRANS_okay )
>              return res;
> 
> @@ -3265,14 +3279,14 @@ enum hvm_translation_result hvm_copy_to_
>      paddr_t paddr, void *buf, int size, struct vcpu *v)
>  {
>      return __hvm_copy(buf, paddr, size, v,
> -                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL);
> +                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, NULL);
>  }
> 
>  enum hvm_translation_result hvm_copy_from_guest_phys(
>      void *buf, paddr_t paddr, int size)
>  {
>      return __hvm_copy(buf, paddr, size, current,
> -                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL);
> +                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL, NULL);
>  }
> 
>  enum hvm_translation_result hvm_copy_to_guest_linear(
> @@ -3281,16 +3295,17 @@ enum hvm_translation_result hvm_copy_to_
>  {
>      return __hvm_copy(buf, addr, size, current,
>                        HVMCOPY_to_guest | HVMCOPY_linear,
> -                      PFEC_page_present | PFEC_write_access | pfec, pfinfo);
> +                      PFEC_page_present | PFEC_write_access | pfec,
> +                      pfinfo, NULL);
>  }
> 
>  enum hvm_translation_result hvm_copy_from_guest_linear(
>      void *buf, unsigned long addr, int size, uint32_t pfec,
> -    pagefault_info_t *pfinfo)
> +    pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
>  {
>      return __hvm_copy(buf, addr, size, current,
>                        HVMCOPY_from_guest | HVMCOPY_linear,
> -                      PFEC_page_present | pfec, pfinfo);
> +                      PFEC_page_present | pfec, pfinfo, cache);
>  }
> 
>  unsigned long copy_to_user_hvm(void *to, const void *from, unsigned int
> len)
> @@ -3331,7 +3346,8 @@ unsigned long copy_from_user_hvm(void *t
>          return 0;
>      }
> 
> -    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len, 0, NULL);
> +    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len,
> +                                    0, NULL, NULL);
>      return rc ? len : 0; /* fake a copy_from_user() return code */
>  }
> 
> @@ -3747,7 +3763,7 @@ void hvm_ud_intercept(struct cpu_user_re
>                                          sizeof(sig), hvm_access_insn_fetch,
>                                          cs, &addr) &&
>               (hvm_copy_from_guest_linear(sig, addr, sizeof(sig),
> -                                         walk, NULL) == HVMTRANS_okay) &&
> +                                         walk, NULL, NULL) == HVMTRANS_okay) &&
>               (memcmp(sig, "\xf\xbxen", sizeof(sig)) == 0) )
>          {
>              regs->rip += sizeof(sig);
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1358,7 +1358,7 @@ static void svm_emul_swint_injection(str
>          goto raise_exception;
> 
>      rc = hvm_copy_from_guest_linear(&idte, idte_linear_addr, idte_size,
> -                                    PFEC_implicit, &pfinfo);
> +                                    PFEC_implicit, &pfinfo, NULL);
>      if ( rc )
>      {
>          if ( rc == HVMTRANS_bad_linear_to_gfn )
> --- a/xen/arch/x86/hvm/vmx/vvmx.c
> +++ b/xen/arch/x86/hvm/vmx/vvmx.c
> @@ -475,7 +475,7 @@ static int decode_vmx_inst(struct cpu_us
>          {
>              pagefault_info_t pfinfo;
>              int rc = hvm_copy_from_guest_linear(poperandS, base, size,
> -                                                0, &pfinfo);
> +                                                0, &pfinfo, NULL);
> 
>              if ( rc == HVMTRANS_bad_linear_to_gfn )
>                  hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -166,7 +166,7 @@ const struct x86_emulate_ops *shadow_ini
>              hvm_access_insn_fetch, sh_ctxt, &addr) &&
>           !hvm_copy_from_guest_linear(
>               sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
> -             PFEC_insn_fetch, NULL))
> +             PFEC_insn_fetch, NULL, NULL))
>          ? sizeof(sh_ctxt->insn_buf) : 0;
> 
>      return &hvm_shadow_emulator_ops;
> @@ -201,7 +201,7 @@ void shadow_continue_emulation(struct sh
>                  hvm_access_insn_fetch, sh_ctxt, &addr) &&
>               !hvm_copy_from_guest_linear(
>                   sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
> -                 PFEC_insn_fetch, NULL))
> +                 PFEC_insn_fetch, NULL, NULL))
>              ? sizeof(sh_ctxt->insn_buf) : 0;
>          sh_ctxt->insn_buf_eip = regs->rip;
>      }
> --- a/xen/arch/x86/mm/shadow/hvm.c
> +++ b/xen/arch/x86/mm/shadow/hvm.c
> @@ -125,7 +125,7 @@ hvm_read(enum x86_segment seg,
>      rc = hvm_copy_from_guest_linear(p_data, addr, bytes,
>                                      (access_type == hvm_access_insn_fetch
>                                       ? PFEC_insn_fetch : 0),
> -                                    &pfinfo);
> +                                    &pfinfo, NULL);
> 
>      switch ( rc )
>      {
> --- a/xen/include/asm-x86/hvm/emulate.h
> +++ b/xen/include/asm-x86/hvm/emulate.h
> @@ -99,6 +99,11 @@ int hvmemul_do_pio_buffer(uint16_t port,
>                            void *buffer);
> 
>  struct hvmemul_cache;
> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents);
> +static inline void hvmemul_cache_destroy(struct hvmemul_cache *cache)
> +{
> +    xfree(cache);
> +}
>  bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
>                          unsigned int level, void *buffer, unsigned int size);
>  void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
> --- a/xen/include/asm-x86/hvm/support.h
> +++ b/xen/include/asm-x86/hvm/support.h
> @@ -99,7 +99,7 @@ enum hvm_translation_result hvm_copy_to_
>      pagefault_info_t *pfinfo);
>  enum hvm_translation_result hvm_copy_from_guest_linear(
>      void *buf, unsigned long addr, int size, uint32_t pfec,
> -    pagefault_info_t *pfinfo);
> +    pagefault_info_t *pfinfo, struct hvmemul_cache *cache);
> 
>  /*
>   * Get a reference on the page under an HVM physical or linear address.  If
> @@ -110,7 +110,7 @@ enum hvm_translation_result hvm_copy_fro
>  enum hvm_translation_result hvm_translate_get_page(
>      struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
>      pagefault_info_t *pfinfo, struct page_info **page_p,
> -    gfn_t *gfn_p, p2m_type_t *p2mt_p);
> +    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache);
> 
>  #define HVM_HCALL_completed  0 /* hypercall completed - no further
> action */
>  #define HVM_HCALL_preempted  1 /* hypercall preempted - re-execute
> VMCALL */
> --- a/xen/include/asm-x86/hvm/vcpu.h
> +++ b/xen/include/asm-x86/hvm/vcpu.h
> @@ -53,8 +53,6 @@ struct hvm_mmio_cache {
>      uint8_t buffer[32];
>  };
> 
> -struct hvmemul_cache;
> -
>  struct hvm_vcpu_io {
>      /* I/O request in flight to device model. */
>      enum hvm_io_completion io_completion;
> @@ -200,6 +198,7 @@ struct hvm_vcpu {
>      u8                  cache_mode;
> 
>      struct hvm_vcpu_io  hvm_io;
> +    struct hvmemul_cache *data_cache;
> 
>      /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
>      struct x86_event     inject_event;
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables()
  2018-09-11 16:17   ` Paul Durrant
@ 2018-09-12  8:30     ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-12  8:30 UTC (permalink / raw)
  To: Paul Durrant; +Cc: Andrew Cooper, Wei Liu, george.dunlap, xen-devel

>>> On 11.09.18 at 18:17, <Paul.Durrant@citrix.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 11 September 2018 14:15
>> 
>> --- a/xen/arch/x86/mm/guest_walk.c
>> +++ b/xen/arch/x86/mm/guest_walk.c
>> @@ -92,8 +92,13 @@ guest_walk_tables(struct vcpu *v, struct
>>  #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
>>      guest_l3e_t *l3p = NULL;
> 
> Shouldn't the above line be...
> 
>>      guest_l4e_t *l4p;
>> +    paddr_t l4gpa;
>> +#endif
>> +#if GUEST_PAGING_LEVELS >= 3 /* PAE or 64... */
> 
> ...here?

No (and note that's not code I change).

>> @@ -134,7 +139,15 @@ guest_walk_tables(struct vcpu *v, struct
>>      /* Get the l4e from the top level table and check its flags*/
>>      gw->l4mfn = top_mfn;
>>      l4p = (guest_l4e_t *) top_map;
>> -    gw->l4e = l4p[guest_l4_table_offset(gla)];
>> +    l4gpa = gfn_to_gaddr(top_gfn) +
>> +            guest_l4_table_offset(gla) * sizeof(gw->l4e);
>> +    if ( !cache ||
>> +         !hvmemul_read_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e)) )
>> +    {
>> +        gw->l4e = l4p[guest_l4_table_offset(gla)];
>> +        if ( cache )
>> +            hvmemul_write_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e));
> 
> No need to test cache here or below since neither the read or write 
> functions (yet) dereference it.

I don't understand: The patch here is to prepare for a fully
implemented set of backing functions, i.e. I in particular don't
want to touch this code again once I add actual bodies to
the functions. Apart from this I don't think callers should make
assumptions like this about callee behavior.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/4] x86/HVM: implement memory read caching
  2018-09-11 16:20   ` Paul Durrant
@ 2018-09-12  8:38     ` Jan Beulich
  2018-09-12  8:49       ` Paul Durrant
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-09-12  8:38 UTC (permalink / raw)
  To: Paul Durrant; +Cc: Andrew Cooper, Wei Liu, george.dunlap, xen-devel

>>> On 11.09.18 at 18:20, <Paul.Durrant@citrix.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 11 September 2018 14:15
>> 
>> @@ -2664,9 +2685,35 @@ void hvm_dump_emulation_state(const char
>>             hvmemul_ctxt->insn_buf);
>>  }
>> 
>> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
>> +{
>> +    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct
>> hvmemul_cache,
>> +                                                         ents[nents]));
>> +
>> +    if ( cache )
>> +    {
>> +        cache->num_ents = 0;
>> +        cache->max_ents = nents;
>> +    }
>> +
>> +    return cache;
>> +}
>> +
>>  bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t
>> gpa,
>>                          unsigned int level, void *buffer, unsigned int size)
>>  {
>> +    unsigned int i;
>> +
> 
> Here you could return false if cache is NULL...

This one could perhaps be considered, but ...

>> @@ -2674,6 +2721,35 @@ void hvmemul_write_cache(struct hvmemul_
>>                           unsigned int level, const void *buffer,
>>                           unsigned int size)
>>  {
>> +    unsigned int i;
>> +
> 
> ...and here just bail out. Thus making both functions safe to call with a 
> NULL cache.

... I'm pretty much opposed to this: The term "cache" might be slightly
confusing here, but I lack a better idea for a name. Its presence is
required for correctness. After all the series is not a performance
improvement, but a plain bug fix (generalizing what we were able to
special case for the actually observed problem in commit 91afb8139f
["x86/HVM: suppress I/O completion for port output"]). And with
that I'd rather leave the read side as is as well.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/4] x86/HVM: implement memory read caching
  2018-09-12  8:38     ` Jan Beulich
@ 2018-09-12  8:49       ` Paul Durrant
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Durrant @ 2018-09-12  8:49 UTC (permalink / raw)
  To: 'Jan Beulich'; +Cc: Andrew Cooper, Wei Liu, George Dunlap, xen-devel

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 12 September 2018 09:38
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>; Wei Liu <wei.liu2@citrix.com>; xen-devel
> <xen-devel@lists.xenproject.org>
> Subject: RE: [PATCH v2 3/4] x86/HVM: implement memory read caching
> 
> >>> On 11.09.18 at 18:20, <Paul.Durrant@citrix.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: 11 September 2018 14:15
> >>
> >> @@ -2664,9 +2685,35 @@ void hvm_dump_emulation_state(const char
> >>             hvmemul_ctxt->insn_buf);
> >>  }
> >>
> >> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
> >> +{
> >> +    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct
> >> hvmemul_cache,
> >> +                                                         ents[nents]));
> >> +
> >> +    if ( cache )
> >> +    {
> >> +        cache->num_ents = 0;
> >> +        cache->max_ents = nents;
> >> +    }
> >> +
> >> +    return cache;
> >> +}
> >> +
> >>  bool hvmemul_read_cache(const struct hvmemul_cache *cache,
> paddr_t
> >> gpa,
> >>                          unsigned int level, void *buffer, unsigned int size)
> >>  {
> >> +    unsigned int i;
> >> +
> >
> > Here you could return false if cache is NULL...
> 
> This one could perhaps be considered, but ...
> 
> >> @@ -2674,6 +2721,35 @@ void hvmemul_write_cache(struct hvmemul_
> >>                           unsigned int level, const void *buffer,
> >>                           unsigned int size)
> >>  {
> >> +    unsigned int i;
> >> +
> >
> > ...and here just bail out. Thus making both functions safe to call with a
> > NULL cache.
> 
> ... I'm pretty much opposed to this: The term "cache" might be slightly
> confusing here, but I lack a better idea for a name. Its presence is
> required for correctness. After all the series is not a performance
> improvement, but a plain bug fix (generalizing what we were able to
> special case for the actually observed problem in commit 91afb8139f
> ["x86/HVM: suppress I/O completion for port output"]). And with
> that I'd rather leave the read side as is as well.
> 

Ok. I have no strong objection to the code structure as it stands so you can add my R-b to this and patch #2.

  Paul

> Jan
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible
  2018-09-11 13:16 ` [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
@ 2018-09-13  6:30   ` Tian, Kevin
  2018-09-13  8:55     ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Tian, Kevin @ 2018-09-13  6:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: George Dunlap, Andrew Cooper, Paul Durrant, Wei Liu, Nakajima, Jun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, September 11, 2018 9:16 PM
> 
> Since strictly speaking it is incorrect for guest_walk_tables() to read
> L3 entries during PAE page walks, try to overcome this where possible by

can you elaborate? why it's incorrect to read L3 entries?

> pre-loading the values from hardware into the cache. Sadly the
> information is available in the EPT case only. On the positive side for
> NPT the spec spells out that L3 entries are actually read on walks, so
> us reading them is consistent with hardware behavior in that case.

I'm a little bit confused about the description here. you change
VMX code but using NPT spec as the reference?

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Re-base.
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -2385,6 +2385,23 @@ static int _hvm_emulate_one(struct hvm_e
> 
>      vio->mmio_retry = 0;
> 
> +    if ( !curr->arch.hvm.data_cache->num_ents &&
> +         curr->arch.paging.mode->guest_levels == 3 )
> +    {
> +        unsigned int i;
> +
> +        for ( i = 0; i < 4; ++i )
> +        {
> +            uint64_t pdpte;
> +
> +            if ( hvm_read_pdpte(curr, i, &pdpte) )
> +                hvmemul_write_cache(curr->arch.hvm.data_cache,
> +                                    (curr->arch.hvm.guest_cr[3] &
> +                                     (PADDR_MASK & ~0x1f)) + i * sizeof(pdpte),
> +                                    3, &pdpte, sizeof(pdpte));
> +        }
> +    }
> +
>      rc = x86_emulate(&hvmemul_ctxt->ctxt, ops);
>      if ( rc == X86EMUL_OKAY && vio->mmio_retry )
>          rc = X86EMUL_RETRY;
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1368,6 +1368,25 @@ static void vmx_set_interrupt_shadow(str
>      __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
>  }
> 
> +static bool read_pdpte(struct vcpu *v, unsigned int idx, uint64_t *pdpte)
> +{
> +    if ( !paging_mode_hap(v->domain) || !hvm_pae_enabled(v) ||
> +         (v->arch.hvm.guest_efer & EFER_LMA) )
> +        return false;
> +
> +    if ( idx >= 4 )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return false;
> +    }
> +
> +    vmx_vmcs_enter(v);
> +    __vmread(GUEST_PDPTE(idx), pdpte);
> +    vmx_vmcs_exit(v);
> +
> +    return true;
> +}
> +
>  static void vmx_load_pdptrs(struct vcpu *v)
>  {
>      unsigned long cr3 = v->arch.hvm.guest_cr[3];
> @@ -2466,6 +2485,8 @@ const struct hvm_function_table * __init
>          if ( cpu_has_vmx_ept_1gb )
>              vmx_function_table.hap_capabilities |=
> HVM_HAP_SUPERPAGE_1GB;
> 
> +        vmx_function_table.read_pdpte = read_pdpte;
> +
>          setup_ept_dump();
>      }
> 
> --- a/xen/include/asm-x86/hvm/hvm.h
> +++ b/xen/include/asm-x86/hvm/hvm.h
> @@ -146,6 +146,8 @@ struct hvm_function_table {
> 
>      void (*fpu_leave)(struct vcpu *v);
> 
> +    bool (*read_pdpte)(struct vcpu *v, unsigned int index, uint64_t *pdpte);
> +
>      int  (*get_guest_pat)(struct vcpu *v, u64 *);
>      int  (*set_guest_pat)(struct vcpu *v, u64);
> 
> @@ -440,6 +442,12 @@ static inline unsigned long hvm_get_shad
>      return hvm_funcs.get_shadow_gs_base(v);
>  }
> 
> +static inline bool hvm_read_pdpte(struct vcpu *v, unsigned int index,
> uint64_t *pdpte)
> +{
> +    return hvm_funcs.read_pdpte &&
> +           alternative_call(hvm_funcs.read_pdpte, v, index, pdpte);
> +}
> +
>  static inline bool hvm_get_guest_bndcfgs(struct vcpu *v, u64 *val)
>  {
>      return hvm_funcs.get_guest_bndcfgs &&
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible
  2018-09-13  6:30   ` Tian, Kevin
@ 2018-09-13  8:55     ` Jan Beulich
  2018-09-14  2:18       ` Tian, Kevin
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-09-13  8:55 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Paul Durrant,
	Jun Nakajima, xen-devel

>>> On 13.09.18 at 08:30, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, September 11, 2018 9:16 PM
>> 
>> Since strictly speaking it is incorrect for guest_walk_tables() to read
>> L3 entries during PAE page walks, try to overcome this where possible by
> 
> can you elaborate? why it's incorrect to read L3 entries?

Architectural behavior: In PAE mode they get read upon CR3 loads,
not during page walks.

>> pre-loading the values from hardware into the cache. Sadly the
>> information is available in the EPT case only. On the positive side for
>> NPT the spec spells out that L3 entries are actually read on walks, so
>> us reading them is consistent with hardware behavior in that case.
> 
> I'm a little bit confused about the description here. you change
> VMX code but using NPT spec as the reference?

I'm trying to explain why there not being a way to do the same on
NPT is not only not a problem, but in line with hardware behavior.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible
  2018-09-13  8:55     ` Jan Beulich
@ 2018-09-14  2:18       ` Tian, Kevin
  2018-09-14  8:12         ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Tian, Kevin @ 2018-09-14  2:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Paul Durrant, Nakajima,
	Jun, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 13, 2018 4:55 PM
> 
> >>> On 13.09.18 at 08:30, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Tuesday, September 11, 2018 9:16 PM
> >>
> >> Since strictly speaking it is incorrect for guest_walk_tables() to read
> >> L3 entries during PAE page walks, try to overcome this where possible by
> >
> > can you elaborate? why it's incorrect to read L3 entries?
> 
> Architectural behavior: In PAE mode they get read upon CR3 loads,
> not during page walks.

ah yes. can you add "CR3 load" in description which reminds
people easily?

> 
> >> pre-loading the values from hardware into the cache. Sadly the
> >> information is available in the EPT case only. On the positive side for
> >> NPT the spec spells out that L3 entries are actually read on walks, so
> >> us reading them is consistent with hardware behavior in that case.
> >
> > I'm a little bit confused about the description here. you change
> > VMX code but using NPT spec as the reference?
> 
> I'm trying to explain why there not being a way to do the same on
> NPT is not only not a problem, but in line with hardware behavior.
> 

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible
  2018-09-14  2:18       ` Tian, Kevin
@ 2018-09-14  8:12         ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-14  8:12 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Paul Durrant,
	Jun Nakajima, xen-devel

>>> On 14.09.18 at 04:18, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, September 13, 2018 4:55 PM
>> 
>> >>> On 13.09.18 at 08:30, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Tuesday, September 11, 2018 9:16 PM
>> >>
>> >> Since strictly speaking it is incorrect for guest_walk_tables() to read
>> >> L3 entries during PAE page walks, try to overcome this where possible by
>> >
>> > can you elaborate? why it's incorrect to read L3 entries?
>> 
>> Architectural behavior: In PAE mode they get read upon CR3 loads,
>> not during page walks.
> 
> ah yes. can you add "CR3 load" in description which reminds
> people easily?

Extended text:

Since strictly speaking it is incorrect for guest_walk_tables() to read
L3 entries during PAE page walks (they get loaded from memory only upon
CR3 loads and certain TLB flushes), try to overcome this where possible
by pre-loading the values from hardware into the cache. Sadly the
information is available in the EPT case only. On the positive side for
NPT the spec spells out that L3 entries are actually read on walks, so
us reading them is consistent with hardware behavior in that case.

>> >> pre-loading the values from hardware into the cache. Sadly the
>> >> information is available in the EPT case only. On the positive side for
>> >> NPT the spec spells out that L3 entries are actually read on walks, so
>> >> us reading them is consistent with hardware behavior in that case.
>> >
>> > I'm a little bit confused about the description here. you change
>> > VMX code but using NPT spec as the reference?
>> 
>> I'm trying to explain why there not being a way to do the same on
>> NPT is not only not a problem, but in line with hardware behavior.
>> 
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation
  2018-09-11 13:13 ` [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
  2018-09-11 13:40   ` Razvan Cojocaru
@ 2018-09-19 15:09   ` Wei Liu
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Liu @ 2018-09-19 15:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tamas K Lengyel, Wei Liu, Razvan Cojocaru, George Dunlap,
	Andrew Cooper, Tim Deegan, Paul Durrant, xen-devel

On Tue, Sep 11, 2018 at 07:13:53AM -0600, Jan Beulich wrote:
> The caching isn't actually implemented here, this is just setting the
> stage.
> 
> Touching these anyway also
> - make their return values gfn_t
> - gva -> gla in their names
> - name their input arguments gla
> 
> At the use sites do the conversion to gfn_t as suitable.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

Reviewed-by: Wei Liu <wei.liu2@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables()
  2018-09-11 13:14 ` [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
  2018-09-11 16:17   ` Paul Durrant
@ 2018-09-19 15:50   ` Wei Liu
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Liu @ 2018-09-19 15:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, xen-devel, Paul Durrant, Wei Liu, Andrew Cooper

On Tue, Sep 11, 2018 at 07:14:49AM -0600, Jan Beulich wrote:
> The caching isn't actually implemented here, this is just setting the
> stage.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Wei Liu <wei.liu2@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/4] x86/HVM: implement memory read caching
  2018-09-11 13:15 ` [PATCH v2 3/4] x86/HVM: implement memory read caching Jan Beulich
  2018-09-11 16:20   ` Paul Durrant
@ 2018-09-19 15:57   ` Wei Liu
  2018-09-20  6:39     ` Jan Beulich
  1 sibling, 1 reply; 48+ messages in thread
From: Wei Liu @ 2018-09-19 15:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, xen-devel, Paul Durrant, Wei Liu, Andrew Cooper

On Tue, Sep 11, 2018 at 07:15:19AM -0600, Jan Beulich wrote:
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath an
> instruction currently under emulation (hitting between two passes). If
> the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.

Not sure I follow. If the buffers are paged out between two passes, how
would caching the translation help?  Yes you get the same translation
result but the content of the address pointed to by the translation
result could be different, right?

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/4] x86/HVM: implement memory read caching
  2018-09-19 15:57   ` Wei Liu
@ 2018-09-20  6:39     ` Jan Beulich
  2018-09-20 15:42       ` Wei Liu
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-09-20  6:39 UTC (permalink / raw)
  To: Wei Liu; +Cc: George Dunlap, Andrew Cooper, Paul Durrant, xen-devel

>>> On 19.09.18 at 17:57, <wei.liu2@citrix.com> wrote:
> On Tue, Sep 11, 2018 at 07:15:19AM -0600, Jan Beulich wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses. In particular
>> it has been observed that Windows might page out buffers underneath an
>> instruction currently under emulation (hitting between two passes). If
>> the first pass translated a linear address successfully, any subsequent
>> pass needs to do so too, yielding the exact same translation.
> 
> Not sure I follow. If the buffers are paged out between two passes, how
> would caching the translation help?  Yes you get the same translation
> result but the content of the address pointed to by the translation
> result could be different, right?

If we accessed that memory, yes. But the whole point here is to avoid
memory accesses during retry processing, when the same access has
already occurred during an earlier round. As noted on another sub-
thread, the term "cache" here may be a little misleading, as it's not
there to improve performance (this, if so, would just be a desirable
side effect), but to guarantee correctness. I've chosen this naming for
the lack of a better alternative.

So during replay/retry, inductively by all previously performed
accesses coming from this cache, the result is going to be the same
as that of the previous run. It's just that, for now, we use _this_
cache only for page table accesses. But don't forget that there is
at least one other cache in place (struct hvm_vcpu_io's
mmio_cache[]).

For the paged-out scenario this means that despite the leaf page
table entry having changed to some non-present one between the
original run through emulation code and the replay/retry after
having received qemu's reply, since that PTE won't be read again
the original translation will be (re)used.

For the actual data page in this scenario, while you're right that its
contents may have changed, there are a couple of aspects to take
into consideration:
- We must be talking about an insn accessing two locations (two
  memory ones, one of which is MMIO, or a memory and an I/O one).
- If the non I/O / MMIO side is being read, the re-read (if it occurs
  at all) is having its result discarded, by taking the shortcut through
  the first switch()'s STATE_IORESP_READY case in hvmemul_do_io().
  Note how, among all the re-issue sanity checks there, we avoid
  comparing the actual data.
- If the non I/O / MMIO side is being written, it is the OSes
  responsibility to avoid actually moving page contents to disk while
  there might still be a write access in flight - this is no different in
  behavior from bare hardware.
- Read-modify-write accesses are, as always, complicated, and
  while we deal with them better nowadays than we did in the past,
  we're still not quite there to guarantee hardware like behavior in
  all cases anyway. Nothing is getting worse by the changes made
  here, afaict.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/4] x86/HVM: implement memory read caching
  2018-09-20  6:39     ` Jan Beulich
@ 2018-09-20 15:42       ` Wei Liu
  0 siblings, 0 replies; 48+ messages in thread
From: Wei Liu @ 2018-09-20 15:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Paul Durrant, Wei Liu, xen-devel

On Thu, Sep 20, 2018 at 12:39:59AM -0600, Jan Beulich wrote:
> >>> On 19.09.18 at 17:57, <wei.liu2@citrix.com> wrote:
> > On Tue, Sep 11, 2018 at 07:15:19AM -0600, Jan Beulich wrote:
> >> Emulation requiring device model assistance uses a form of instruction
> >> re-execution, assuming that the second (and any further) pass takes
> >> exactly the same path. This is a valid assumption as far use of CPU
> >> registers goes (as those can't change without any other instruction
> >> executing in between), but is wrong for memory accesses. In particular
> >> it has been observed that Windows might page out buffers underneath an
> >> instruction currently under emulation (hitting between two passes). If
> >> the first pass translated a linear address successfully, any subsequent
> >> pass needs to do so too, yielding the exact same translation.
> > 
> > Not sure I follow. If the buffers are paged out between two passes, how
> > would caching the translation help?  Yes you get the same translation
> > result but the content of the address pointed to by the translation
> > result could be different, right?
> 
> If we accessed that memory, yes. But the whole point here is to avoid
> memory accesses during retry processing, when the same access has
> already occurred during an earlier round. As noted on another sub-
> thread, the term "cache" here may be a little misleading, as it's not
> there to improve performance (this, if so, would just be a desirable
> side effect), but to guarantee correctness. I've chosen this naming for
> the lack of a better alternative.
> 
> So during replay/retry, inductively by all previously performed
> accesses coming from this cache, the result is going to be the same
> as that of the previous run. It's just that, for now, we use _this_
> cache only for page table accesses. But don't forget that there is
> at least one other cache in place (struct hvm_vcpu_io's
> mmio_cache[]).
> 
> For the paged-out scenario this means that despite the leaf page
> table entry having changed to some non-present one between the
> original run through emulation code and the replay/retry after
> having received qemu's reply, since that PTE won't be read again
> the original translation will be (re)used.

Right. I got your idea up to this point.

I would appreciate if you can put the following paragraphs into commit
message.

> 
> For the actual data page in this scenario, while you're right that its
> contents may have changed, there are a couple of aspects to take
> into consideration:
> - We must be talking about an insn accessing two locations (two
>   memory ones, one of which is MMIO, or a memory and an I/O one).
> - If the non I/O / MMIO side is being read, the re-read (if it occurs
>   at all) is having its result discarded, by taking the shortcut through
>   the first switch()'s STATE_IORESP_READY case in hvmemul_do_io().
>   Note how, among all the re-issue sanity checks there, we avoid
>   comparing the actual data.
> - If the non I/O / MMIO side is being written, it is the OSes
>   responsibility to avoid actually moving page contents to disk while
>   there might still be a write access in flight - this is no different in
>   behavior from bare hardware.
> - Read-modify-write accesses are, as always, complicated, and
>   while we deal with them better nowadays than we did in the past,
>   we're still not quite there to guarantee hardware like behavior in
>   all cases anyway. Nothing is getting worse by the changes made
>   here, afaict.
> 
> Jan
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-09-11 13:10 [PATCH v2 0/4] x86/HVM: implement memory read caching Jan Beulich
                   ` (3 preceding siblings ...)
  2018-09-11 13:16 ` [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
@ 2018-09-25 14:14 ` Jan Beulich
  2018-09-25 14:23   ` [PATCH v3 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
                     ` (4 more replies)
  2018-10-12 13:55 ` [PATCH v2 " Andrew Cooper
  5 siblings, 5 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-25 14:14 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far as use of CPU
registers goes (as those can't change without any other instruction
executing in between), but is wrong for memory accesses. In particular
it has been observed that Windows might page out buffers underneath
an instruction currently under emulation (hitting between two passes).
If the first pass translated a linear address successfully, any subsequent
pass needs to do so too, yielding the exact same translation.

Introduce a cache (used just by guest page table accesses for now, i.e.
a form of "paging structure cache") to make sure above described
assumption holds. This is a very simplistic implementation for now: Only
exact matches are satisfied (no overlaps or partial reads or anything).

There's also some seemingly unrelated cleanup here which was found
desirable on the way.

1: x86/mm: add optional cache to GLA->GFN translation
2: x86/mm: use optional cache in guest_walk_tables()
3: x86/HVM: implement memory read caching
4: x86/HVM: prefill cache with PDPTEs when possible

As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
can't currently find enough time to carry out the requested further
rework.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/4] x86/mm: add optional cache to GLA->GFN translation
  2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-09-25 14:23   ` Jan Beulich
  2018-09-25 14:24   ` [PATCH v3 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-25 14:23 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant, Tim Deegan

The caching isn't actually implemented here, this is just setting the
stage.

Touching these anyway also
- make their return values gfn_t
- gva -> gla in their names
- name their input arguments gla

At the use sites do the conversion to gfn_t as suitable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
v3: Re-base.
v2: Re-base.

--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -51,7 +51,7 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct dom
 
     DBGP2("vaddr:%lx domid:%d\n", vaddr, dp->domain_id);
 
-    *gfn = _gfn(paging_gva_to_gfn(dp->vcpu[0], vaddr, &pfec));
+    *gfn = paging_gla_to_gfn(dp->vcpu[0], vaddr, &pfec, NULL);
     if ( gfn_eq(*gfn, INVALID_GFN) )
     {
         DBGP2("kdb:bad gfn from gva_to_gfn\n");
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -699,7 +699,8 @@ static int hvmemul_linear_to_phys(
     struct hvm_emulate_ctxt *hvmemul_ctxt)
 {
     struct vcpu *curr = current;
-    unsigned long pfn, npfn, done, todo, i, offset = addr & ~PAGE_MASK;
+    gfn_t gfn, ngfn;
+    unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
     int reverse;
 
     /*
@@ -721,15 +722,17 @@ static int hvmemul_linear_to_phys(
     if ( reverse && ((PAGE_SIZE - offset) < bytes_per_rep) )
     {
         /* Do page-straddling first iteration forwards via recursion. */
-        paddr_t _paddr;
+        paddr_t gaddr;
         unsigned long one_rep = 1;
         int rc = hvmemul_linear_to_phys(
-            addr, &_paddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
+            addr, &gaddr, bytes_per_rep, &one_rep, pfec, hvmemul_ctxt);
+
         if ( rc != X86EMUL_OKAY )
             return rc;
-        pfn = _paddr >> PAGE_SHIFT;
+        gfn = gaddr_to_gfn(gaddr);
     }
-    else if ( (pfn = paging_gva_to_gfn(curr, addr, &pfec)) == gfn_x(INVALID_GFN) )
+    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
+                     INVALID_GFN) )
     {
         if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
             return X86EMUL_RETRY;
@@ -744,11 +747,11 @@ static int hvmemul_linear_to_phys(
     {
         /* Get the next PFN in the range. */
         addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
-        npfn = paging_gva_to_gfn(curr, addr, &pfec);
+        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
 
         /* Is it contiguous with the preceding PFNs? If not then we're done. */
-        if ( (npfn == gfn_x(INVALID_GFN)) ||
-             (npfn != (pfn + (reverse ? -i : i))) )
+        if ( gfn_eq(ngfn, INVALID_GFN) ||
+             !gfn_eq(ngfn, gfn_add(gfn, reverse ? -i : i)) )
         {
             if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
                 return X86EMUL_RETRY;
@@ -756,7 +759,7 @@ static int hvmemul_linear_to_phys(
             if ( done == 0 )
             {
                 ASSERT(!reverse);
-                if ( npfn != gfn_x(INVALID_GFN) )
+                if ( !gfn_eq(ngfn, INVALID_GFN) )
                     return X86EMUL_UNHANDLEABLE;
                 *reps = 0;
                 x86_emul_pagefault(pfec, addr & PAGE_MASK, &hvmemul_ctxt->ctxt);
@@ -769,7 +772,8 @@ static int hvmemul_linear_to_phys(
         done += PAGE_SIZE;
     }
 
-    *paddr = ((paddr_t)pfn << PAGE_SHIFT) | offset;
+    *paddr = gfn_to_gaddr(gfn) | offset;
+
     return X86EMUL_OKAY;
 }
     
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2659,7 +2659,7 @@ static void *hvm_map_entry(unsigned long
      * treat it as a kernel-mode read (i.e. no access checks).
      */
     pfec = PFEC_page_present;
-    gfn = paging_gva_to_gfn(current, va, &pfec);
+    gfn = gfn_x(paging_gla_to_gfn(current, va, &pfec, NULL));
     if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
         goto fail;
 
@@ -3089,7 +3089,7 @@ enum hvm_translation_result hvm_translat
 
     if ( linear )
     {
-        gfn = _gfn(paging_gva_to_gfn(v, addr, &pfec));
+        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
 
         if ( gfn_eq(gfn, INVALID_GFN) )
         {
--- a/xen/arch/x86/hvm/monitor.c
+++ b/xen/arch/x86/hvm/monitor.c
@@ -130,7 +130,7 @@ static inline unsigned long gfn_of_rip(u
 
     hvm_get_segment_register(curr, x86_seg_cs, &sreg);
 
-    return paging_gva_to_gfn(curr, sreg.base + rip, &pfec);
+    return gfn_x(paging_gla_to_gfn(curr, sreg.base + rip, &pfec, NULL));
 }
 
 int hvm_monitor_debug(unsigned long rip, enum hvm_monitor_debug_type type,
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -81,8 +81,9 @@ static bool set_ad_bits(guest_intpte_t *
  */
 bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
-                  unsigned long va, walk_t *gw,
-                  uint32_t walk, mfn_t top_mfn, void *top_map)
+                  unsigned long gla, walk_t *gw, uint32_t walk,
+                  gfn_t top_gfn, mfn_t top_mfn, void *top_map,
+                  struct hvmemul_cache *cache)
 {
     struct domain *d = v->domain;
     p2m_type_t p2mt;
@@ -116,7 +117,7 @@ guest_walk_tables(struct vcpu *v, struct
 
     perfc_incr(guest_walk);
     memset(gw, 0, sizeof(*gw));
-    gw->va = va;
+    gw->va = gla;
     gw->pfec = walk & (PFEC_user_mode | PFEC_write_access);
 
     /*
@@ -133,7 +134,7 @@ guest_walk_tables(struct vcpu *v, struct
     /* Get the l4e from the top level table and check its flags*/
     gw->l4mfn = top_mfn;
     l4p = (guest_l4e_t *) top_map;
-    gw->l4e = l4p[guest_l4_table_offset(va)];
+    gw->l4e = l4p[guest_l4_table_offset(gla)];
     gflags = guest_l4e_get_flags(gw->l4e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -163,7 +164,7 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l3e and check its flags*/
-    gw->l3e = l3p[guest_l3_table_offset(va)];
+    gw->l3e = l3p[guest_l3_table_offset(gla)];
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -205,7 +206,7 @@ guest_walk_tables(struct vcpu *v, struct
 
         /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L3_GFN_MASK) +
-                     ((va >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
+                     ((gla >> PAGE_SHIFT) & GUEST_L3_GFN_MASK));
         gw->l1e = guest_l1e_from_gfn(start, flags);
         gw->l2mfn = gw->l1mfn = INVALID_MFN;
         leaf_level = 3;
@@ -215,7 +216,7 @@ guest_walk_tables(struct vcpu *v, struct
 #else /* PAE only... */
 
     /* Get the l3e and check its flag */
-    gw->l3e = ((guest_l3e_t *) top_map)[guest_l3_table_offset(va)];
+    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -242,14 +243,14 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l2e */
-    gw->l2e = l2p[guest_l2_table_offset(va)];
+    gw->l2e = l2p[guest_l2_table_offset(gla)];
 
 #else /* 32-bit only... */
 
     /* Get l2e from the top level table */
     gw->l2mfn = top_mfn;
     l2p = (guest_l2e_t *) top_map;
-    gw->l2e = l2p[guest_l2_table_offset(va)];
+    gw->l2e = l2p[guest_l2_table_offset(gla)];
 
 #endif /* All levels... */
 
@@ -310,7 +311,7 @@ guest_walk_tables(struct vcpu *v, struct
 
         /* Increment the pfn by the right number of 4k pages. */
         start = _gfn((gfn_x(start) & ~GUEST_L2_GFN_MASK) +
-                     guest_l1_table_offset(va));
+                     guest_l1_table_offset(gla));
 #if GUEST_PAGING_LEVELS == 2
          /* Wider than 32 bits if PSE36 superpage. */
         gw->el1e = (gfn_x(start) << PAGE_SHIFT) | flags;
@@ -334,7 +335,7 @@ guest_walk_tables(struct vcpu *v, struct
         gw->pfec |= rc & PFEC_synth_mask;
         goto out;
     }
-    gw->l1e = l1p[guest_l1_table_offset(va)];
+    gw->l1e = l1p[guest_l1_table_offset(gla)];
     gflags = guest_l1e_get_flags(gw->l1e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -443,22 +444,22 @@ guest_walk_tables(struct vcpu *v, struct
         break;
 
     case 1:
-        if ( set_ad_bits(&l1p[guest_l1_table_offset(va)].l1, &gw->l1e.l1,
+        if ( set_ad_bits(&l1p[guest_l1_table_offset(gla)].l1, &gw->l1e.l1,
                          (walk & PFEC_write_access)) )
             paging_mark_dirty(d, gw->l1mfn);
         /* Fallthrough */
     case 2:
-        if ( set_ad_bits(&l2p[guest_l2_table_offset(va)].l2, &gw->l2e.l2,
+        if ( set_ad_bits(&l2p[guest_l2_table_offset(gla)].l2, &gw->l2e.l2,
                          (walk & PFEC_write_access) && leaf_level == 2) )
             paging_mark_dirty(d, gw->l2mfn);
         /* Fallthrough */
 #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
     case 3:
-        if ( set_ad_bits(&l3p[guest_l3_table_offset(va)].l3, &gw->l3e.l3,
+        if ( set_ad_bits(&l3p[guest_l3_table_offset(gla)].l3, &gw->l3e.l3,
                          (walk & PFEC_write_access) && leaf_level == 3) )
             paging_mark_dirty(d, gw->l3mfn);
 
-        if ( set_ad_bits(&l4p[guest_l4_table_offset(va)].l4, &gw->l4e.l4,
+        if ( set_ad_bits(&l4p[guest_l4_table_offset(gla)].l4, &gw->l4e.l4,
                          false) )
             paging_mark_dirty(d, gw->l4mfn);
 #endif
--- a/xen/arch/x86/mm/hap/guest_walk.c
+++ b/xen/arch/x86/mm/hap/guest_walk.c
@@ -26,8 +26,8 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <xen/sched.h>
 #include "private.h" /* for hap_gva_to_gfn_* */
 
-#define _hap_gva_to_gfn(levels) hap_gva_to_gfn_##levels##_levels
-#define hap_gva_to_gfn(levels) _hap_gva_to_gfn(levels)
+#define _hap_gla_to_gfn(levels) hap_gla_to_gfn_##levels##_levels
+#define hap_gla_to_gfn(levels) _hap_gla_to_gfn(levels)
 
 #define _hap_p2m_ga_to_gfn(levels) hap_p2m_ga_to_gfn_##levels##_levels
 #define hap_p2m_ga_to_gfn(levels) _hap_p2m_ga_to_gfn(levels)
@@ -39,16 +39,10 @@ asm(".file \"" __OBJECT_FILE__ "\"");
 #include <asm/guest_pt.h>
 #include <asm/p2m.h>
 
-unsigned long hap_gva_to_gfn(GUEST_PAGING_LEVELS)(
-    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t *pfec)
-{
-    unsigned long cr3 = v->arch.hvm.guest_cr[3];
-    return hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(v, p2m, cr3, gva, pfec, NULL);
-}
-
-unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
+static unsigned long ga_to_gfn(
     struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
-    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
+    paddr_t ga, uint32_t *pfec, unsigned int *page_order,
+    struct hvmemul_cache *cache)
 {
     bool walk_ok;
     mfn_t top_mfn;
@@ -91,7 +85,8 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
 #if GUEST_PAGING_LEVELS == 3
     top_map += (cr3 & ~(PAGE_MASK | 31));
 #endif
-    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec, top_mfn, top_map);
+    walk_ok = guest_walk_tables(v, p2m, ga, &gw, *pfec,
+                                top_gfn, top_mfn, top_map, cache);
     unmap_domain_page(top_map);
     put_page(top_page);
 
@@ -137,6 +132,21 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
     return gfn_x(INVALID_GFN);
 }
 
+gfn_t hap_gla_to_gfn(GUEST_PAGING_LEVELS)(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t *pfec,
+    struct hvmemul_cache *cache)
+{
+    unsigned long cr3 = v->arch.hvm.guest_cr[3];
+
+    return _gfn(ga_to_gfn(v, p2m, cr3, gla, pfec, NULL, cache));
+}
+
+unsigned long hap_p2m_ga_to_gfn(GUEST_PAGING_LEVELS)(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long cr3,
+    paddr_t ga, uint32_t *pfec, unsigned int *page_order)
+{
+    return ga_to_gfn(v, p2m, cr3, ga, pfec, page_order, NULL);
+}
 
 /*
  * Local variables:
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -744,10 +744,11 @@ hap_write_p2m_entry(struct domain *d, un
         p2m_flush_nestedp2m(d);
 }
 
-static unsigned long hap_gva_to_gfn_real_mode(
-    struct vcpu *v, struct p2m_domain *p2m, unsigned long gva, uint32_t *pfec)
+static gfn_t hap_gla_to_gfn_real_mode(
+    struct vcpu *v, struct p2m_domain *p2m, unsigned long gla, uint32_t *pfec,
+    struct hvmemul_cache *cache)
 {
-    return ((paddr_t)gva >> PAGE_SHIFT);
+    return gaddr_to_gfn(gla);
 }
 
 static unsigned long hap_p2m_ga_to_gfn_real_mode(
@@ -763,7 +764,7 @@ static unsigned long hap_p2m_ga_to_gfn_r
 static const struct paging_mode hap_paging_real_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_real_mode,
+    .gla_to_gfn             = hap_gla_to_gfn_real_mode,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_real_mode,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -774,7 +775,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_protected_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_2_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_2_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_2_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -785,7 +786,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_pae_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_3_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_3_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_3_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
@@ -796,7 +797,7 @@ static const struct paging_mode hap_pagi
 static const struct paging_mode hap_paging_long_mode = {
     .page_fault             = hap_page_fault,
     .invlpg                 = hap_invlpg,
-    .gva_to_gfn             = hap_gva_to_gfn_4_levels,
+    .gla_to_gfn             = hap_gla_to_gfn_4_levels,
     .p2m_ga_to_gfn          = hap_p2m_ga_to_gfn_4_levels,
     .update_cr3             = hap_update_cr3,
     .update_paging_modes    = hap_update_paging_modes,
--- a/xen/arch/x86/mm/hap/private.h
+++ b/xen/arch/x86/mm/hap/private.h
@@ -24,18 +24,21 @@
 /********************************************/
 /*          GUEST TRANSLATION FUNCS         */
 /********************************************/
-unsigned long hap_gva_to_gfn_2_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
-unsigned long hap_gva_to_gfn_3_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
-unsigned long hap_gva_to_gfn_4_levels(struct vcpu *v,
-                                     struct p2m_domain *p2m,
-                                     unsigned long gva, 
-                                     uint32_t *pfec);
+gfn_t hap_gla_to_gfn_2_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
+gfn_t hap_gla_to_gfn_3_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
+gfn_t hap_gla_to_gfn_4_levels(struct vcpu *v,
+                              struct p2m_domain *p2m,
+                              unsigned long gla,
+                              uint32_t *pfec,
+                              struct hvmemul_cache *cache);
 
 unsigned long hap_p2m_ga_to_gfn_2_levels(struct vcpu *v,
     struct p2m_domain *p2m, unsigned long cr3,
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1980,16 +1980,16 @@ void np2m_schedule(int dir)
 }
 #endif
 
-unsigned long paging_gva_to_gfn(struct vcpu *v,
-                                unsigned long va,
-                                uint32_t *pfec)
+gfn_t paging_gla_to_gfn(struct vcpu *v, unsigned long gla, uint32_t *pfec,
+                        struct hvmemul_cache *cache)
 {
     struct p2m_domain *hostp2m = p2m_get_hostp2m(v->domain);
     const struct paging_mode *hostmode = paging_get_hostmode(v);
 
     if ( is_hvm_vcpu(v) && paging_mode_hap(v->domain) && nestedhvm_is_n2(v) )
     {
-        unsigned long l2_gfn, l1_gfn;
+        gfn_t l2_gfn;
+        unsigned long l1_gfn;
         struct p2m_domain *p2m;
         const struct paging_mode *mode;
         uint8_t l1_p2ma;
@@ -1999,31 +1999,31 @@ unsigned long paging_gva_to_gfn(struct v
         /* translate l2 guest va into l2 guest gfn */
         p2m = p2m_get_nestedp2m(v);
         mode = paging_get_nestedmode(v);
-        l2_gfn = mode->gva_to_gfn(v, p2m, va, pfec);
+        l2_gfn = mode->gla_to_gfn(v, p2m, gla, pfec, cache);
 
-        if ( l2_gfn == gfn_x(INVALID_GFN) )
-            return gfn_x(INVALID_GFN);
+        if ( gfn_eq(l2_gfn, INVALID_GFN) )
+            return INVALID_GFN;
 
         /* translate l2 guest gfn into l1 guest gfn */
-        rv = nestedhap_walk_L1_p2m(v, l2_gfn, &l1_gfn, &l1_page_order, &l1_p2ma,
-                                   1,
+        rv = nestedhap_walk_L1_p2m(v, gfn_x(l2_gfn), &l1_gfn, &l1_page_order,
+                                   &l1_p2ma, 1,
                                    !!(*pfec & PFEC_write_access),
                                    !!(*pfec & PFEC_insn_fetch));
 
         if ( rv != NESTEDHVM_PAGEFAULT_DONE )
-            return gfn_x(INVALID_GFN);
+            return INVALID_GFN;
 
         /*
          * Sanity check that l1_gfn can be used properly as a 4K mapping, even
          * if it mapped by a nested superpage.
          */
-        ASSERT((l2_gfn & ((1ul << l1_page_order) - 1)) ==
+        ASSERT((gfn_x(l2_gfn) & ((1ul << l1_page_order) - 1)) ==
                (l1_gfn & ((1ul << l1_page_order) - 1)));
 
-        return l1_gfn;
+        return _gfn(l1_gfn);
     }
 
-    return hostmode->gva_to_gfn(v, hostp2m, va, pfec);
+    return hostmode->gla_to_gfn(v, hostp2m, gla, pfec, cache);
 }
 
 /*
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -313,15 +313,15 @@ const struct x86_emulate_ops hvm_shadow_
 static mfn_t emulate_gva_to_mfn(struct vcpu *v, unsigned long vaddr,
                                 struct sh_emulate_ctxt *sh_ctxt)
 {
-    unsigned long gfn;
+    gfn_t gfn;
     struct page_info *page;
     mfn_t mfn;
     p2m_type_t p2mt;
     uint32_t pfec = PFEC_page_present | PFEC_write_access;
 
     /* Translate the VA to a GFN. */
-    gfn = paging_get_hostmode(v)->gva_to_gfn(v, NULL, vaddr, &pfec);
-    if ( gfn == gfn_x(INVALID_GFN) )
+    gfn = paging_get_hostmode(v)->gla_to_gfn(v, NULL, vaddr, &pfec, NULL);
+    if ( gfn_eq(gfn, INVALID_GFN) )
     {
         x86_emul_pagefault(pfec, vaddr, &sh_ctxt->ctxt);
 
@@ -331,7 +331,7 @@ static mfn_t emulate_gva_to_mfn(struct v
     /* Translate the GFN to an MFN. */
     ASSERT(!paging_locked_by_me(v->domain));
 
-    page = get_page_from_gfn(v->domain, gfn, &p2mt, P2M_ALLOC);
+    page = get_page_from_gfn(v->domain, gfn_x(gfn), &p2mt, P2M_ALLOC);
 
     /* Sanity checking. */
     if ( page == NULL )
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -173,17 +173,20 @@ delete_shadow_status(struct domain *d, m
 
 static inline bool
 sh_walk_guest_tables(struct vcpu *v, unsigned long va, walk_t *gw,
-                     uint32_t pfec)
+                     uint32_t pfec, struct hvmemul_cache *cache)
 {
     return guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec,
+                             _gfn(paging_mode_external(v->domain)
+                                  ? cr3_pa(v->arch.hvm.guest_cr[3]) >> PAGE_SHIFT
+                                  : pagetable_get_pfn(v->arch.guest_table)),
 #if GUEST_PAGING_LEVELS == 3 /* PAE */
                              INVALID_MFN,
-                             v->arch.paging.shadow.gl3e
+                             v->arch.paging.shadow.gl3e,
 #else /* 32 or 64 */
                              pagetable_get_mfn(v->arch.guest_table),
-                             v->arch.paging.shadow.guest_vtable
+                             v->arch.paging.shadow.guest_vtable,
 #endif
-                             );
+                             cache);
 }
 
 /* This validation is called with lock held, and after write permission
@@ -3032,7 +3035,7 @@ static int sh_page_fault(struct vcpu *v,
      * shadow page table. */
     version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
     smp_rmb();
-    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
+    walk_ok = sh_walk_guest_tables(v, va, &gw, error_code, NULL);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
     regs->error_code &= ~PFEC_page_present;
@@ -3680,9 +3683,9 @@ static bool sh_invlpg(struct vcpu *v, un
 }
 
 
-static unsigned long
-sh_gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
-    unsigned long va, uint32_t *pfec)
+static gfn_t
+sh_gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
+    unsigned long gla, uint32_t *pfec, struct hvmemul_cache *cache)
 /* Called to translate a guest virtual address to what the *guest*
  * pagetables would map it to. */
 {
@@ -3692,24 +3695,25 @@ sh_gva_to_gfn(struct vcpu *v, struct p2m
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
     /* Check the vTLB cache first */
-    unsigned long vtlb_gfn = vtlb_lookup(v, va, *pfec);
+    unsigned long vtlb_gfn = vtlb_lookup(v, gla, *pfec);
+
     if ( vtlb_gfn != gfn_x(INVALID_GFN) )
-        return vtlb_gfn;
+        return _gfn(vtlb_gfn);
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    if ( !(walk_ok = sh_walk_guest_tables(v, va, &gw, *pfec)) )
+    if ( !(walk_ok = sh_walk_guest_tables(v, gla, &gw, *pfec, cache)) )
     {
         *pfec = gw.pfec;
-        return gfn_x(INVALID_GFN);
+        return INVALID_GFN;
     }
     gfn = guest_walk_to_gfn(&gw);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB)
     /* Remember this successful VA->GFN translation for later. */
-    vtlb_insert(v, va >> PAGE_SHIFT, gfn_x(gfn), *pfec);
+    vtlb_insert(v, gla >> PAGE_SHIFT, gfn_x(gfn), *pfec);
 #endif /* (SHADOW_OPTIMIZATIONS & SHOPT_VIRTUAL_TLB) */
 
-    return gfn_x(gfn);
+    return gfn;
 }
 
 
@@ -4954,7 +4958,7 @@ int sh_audit_l4_table(struct vcpu *v, mf
 const struct paging_mode sh_paging_mode = {
     .page_fault                    = sh_page_fault,
     .invlpg                        = sh_invlpg,
-    .gva_to_gfn                    = sh_gva_to_gfn,
+    .gla_to_gfn                    = sh_gla_to_gfn,
     .update_cr3                    = sh_update_cr3,
     .update_paging_modes           = shadow_update_paging_modes,
     .write_p2m_entry               = shadow_write_p2m_entry,
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -43,11 +43,12 @@ static bool _invlpg(struct vcpu *v, unsi
     return true;
 }
 
-static unsigned long _gva_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
-                                 unsigned long va, uint32_t *pfec)
+static gfn_t _gla_to_gfn(struct vcpu *v, struct p2m_domain *p2m,
+                         unsigned long gla, uint32_t *pfec,
+                         struct hvmemul_cache *cache)
 {
     ASSERT_UNREACHABLE();
-    return gfn_x(INVALID_GFN);
+    return INVALID_GFN;
 }
 
 static void _update_cr3(struct vcpu *v, int do_locking, bool noflush)
@@ -70,7 +71,7 @@ static void _write_p2m_entry(struct doma
 static const struct paging_mode sh_paging_none = {
     .page_fault                    = _page_fault,
     .invlpg                        = _invlpg,
-    .gva_to_gfn                    = _gva_to_gfn,
+    .gla_to_gfn                    = _gla_to_gfn,
     .update_cr3                    = _update_cr3,
     .update_paging_modes           = _update_paging_modes,
     .write_p2m_entry               = _write_p2m_entry,
--- a/xen/include/asm-x86/guest_pt.h
+++ b/xen/include/asm-x86/guest_pt.h
@@ -425,7 +425,8 @@ static inline unsigned int guest_walk_to
 
 bool
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m, unsigned long va,
-                  walk_t *gw, uint32_t pfec, mfn_t top_mfn, void *top_map);
+                  walk_t *gw, uint32_t pfec, gfn_t top_gfn, mfn_t top_mfn,
+                  void *top_map, struct hvmemul_cache *cache);
 
 /* Pretty-print the contents of a guest-walk */
 static inline void print_gw(const walk_t *gw)
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -53,6 +53,8 @@ struct hvm_mmio_cache {
     uint8_t buffer[32];
 };
 
+struct hvmemul_cache;
+
 struct hvm_vcpu_io {
     /* I/O request in flight to device model. */
     enum hvm_io_completion io_completion;
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -112,10 +112,11 @@ struct paging_mode {
                                             struct cpu_user_regs *regs);
     bool          (*invlpg                )(struct vcpu *v,
                                             unsigned long linear);
-    unsigned long (*gva_to_gfn            )(struct vcpu *v,
+    gfn_t         (*gla_to_gfn            )(struct vcpu *v,
                                             struct p2m_domain *p2m,
-                                            unsigned long va,
-                                            uint32_t *pfec);
+                                            unsigned long gla,
+                                            uint32_t *pfec,
+                                            struct hvmemul_cache *cache);
     unsigned long (*p2m_ga_to_gfn         )(struct vcpu *v,
                                             struct p2m_domain *p2m,
                                             unsigned long cr3,
@@ -251,9 +252,10 @@ void paging_invlpg(struct vcpu *v, unsig
  * SDM Intel 64 Volume 3, Chapter Paging, PAGE-FAULT EXCEPTIONS:
  * The PFEC_insn_fetch flag is set only when NX or SMEP are enabled.
  */
-unsigned long paging_gva_to_gfn(struct vcpu *v,
-                                unsigned long va,
-                                uint32_t *pfec);
+gfn_t paging_gla_to_gfn(struct vcpu *v,
+                        unsigned long va,
+                        uint32_t *pfec,
+                        struct hvmemul_cache *cache);
 
 /* Translate a guest address using a particular CR3 value.  This is used
  * to by nested HAP code, to walk the guest-supplied NPT tables as if




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 2/4] x86/mm: use optional cache in guest_walk_tables()
  2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
  2018-09-25 14:23   ` [PATCH v3 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
@ 2018-09-25 14:24   ` Jan Beulich
  2018-09-25 14:25   ` [PATCH v3 3/4] x86/HVM: implement memory read caching Jan Beulich
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-25 14:24 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant

The caching isn't actually implemented here, this is just setting the
stage.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
v2: Don't wrongly use top_gfn for non-root gpa calculation. Re-write
    cache entries after setting A/D bits (an alternative would be to
    suppress their setting upon cache hits).

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2664,6 +2664,18 @@ void hvm_dump_emulation_state(const char
            hvmemul_ctxt->insn_buf);
 }
 
+bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t gpa,
+                        unsigned int level, void *buffer, unsigned int size)
+{
+    return false;
+}
+
+void hvmemul_write_cache(struct hvmemul_cache *cache, paddr_t gpa,
+                         unsigned int level, const void *buffer,
+                         unsigned int size)
+{
+}
+
 /*
  * Local variables:
  * mode: C
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -92,8 +92,13 @@ guest_walk_tables(struct vcpu *v, struct
 #if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */
     guest_l3e_t *l3p = NULL;
     guest_l4e_t *l4p;
+    paddr_t l4gpa;
+#endif
+#if GUEST_PAGING_LEVELS >= 3 /* PAE or 64... */
+    paddr_t l3gpa;
 #endif
     uint32_t gflags, rc;
+    paddr_t l1gpa = 0, l2gpa = 0;
     unsigned int leaf_level;
     p2m_query_t qt = P2M_ALLOC | P2M_UNSHARE;
 
@@ -134,7 +139,15 @@ guest_walk_tables(struct vcpu *v, struct
     /* Get the l4e from the top level table and check its flags*/
     gw->l4mfn = top_mfn;
     l4p = (guest_l4e_t *) top_map;
-    gw->l4e = l4p[guest_l4_table_offset(gla)];
+    l4gpa = gfn_to_gaddr(top_gfn) +
+            guest_l4_table_offset(gla) * sizeof(gw->l4e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e)) )
+    {
+        gw->l4e = l4p[guest_l4_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e));
+    }
     gflags = guest_l4e_get_flags(gw->l4e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -164,7 +177,15 @@ guest_walk_tables(struct vcpu *v, struct
     }
 
     /* Get the l3e and check its flags*/
-    gw->l3e = l3p[guest_l3_table_offset(gla)];
+    l3gpa = gfn_to_gaddr(guest_l4e_get_gfn(gw->l4e)) +
+            guest_l3_table_offset(gla) * sizeof(gw->l3e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
+    {
+        gw->l3e = l3p[guest_l3_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
+    }
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -216,7 +237,16 @@ guest_walk_tables(struct vcpu *v, struct
 #else /* PAE only... */
 
     /* Get the l3e and check its flag */
-    gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
+    l3gpa = gfn_to_gaddr(top_gfn) + ((unsigned long)top_map & ~PAGE_MASK) +
+            guest_l3_table_offset(gla) * sizeof(gw->l3e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e)) )
+    {
+        gw->l3e = ((guest_l3e_t *)top_map)[guest_l3_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
+    }
+
     gflags = guest_l3e_get_flags(gw->l3e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -242,18 +272,26 @@ guest_walk_tables(struct vcpu *v, struct
         goto out;
     }
 
-    /* Get the l2e */
-    gw->l2e = l2p[guest_l2_table_offset(gla)];
+    l2gpa = gfn_to_gaddr(guest_l3e_get_gfn(gw->l3e));
 
 #else /* 32-bit only... */
 
-    /* Get l2e from the top level table */
     gw->l2mfn = top_mfn;
     l2p = (guest_l2e_t *) top_map;
-    gw->l2e = l2p[guest_l2_table_offset(gla)];
+    l2gpa = gfn_to_gaddr(top_gfn);
 
 #endif /* All levels... */
 
+    /* Get the l2e */
+    l2gpa += guest_l2_table_offset(gla) * sizeof(gw->l2e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e)) )
+    {
+        gw->l2e = l2p[guest_l2_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e));
+    }
+
     /* Check the l2e flags. */
     gflags = guest_l2e_get_flags(gw->l2e);
     if ( !(gflags & _PAGE_PRESENT) )
@@ -335,7 +373,17 @@ guest_walk_tables(struct vcpu *v, struct
         gw->pfec |= rc & PFEC_synth_mask;
         goto out;
     }
-    gw->l1e = l1p[guest_l1_table_offset(gla)];
+
+    l1gpa = gfn_to_gaddr(guest_l2e_get_gfn(gw->l2e)) +
+            guest_l1_table_offset(gla) * sizeof(gw->l1e);
+    if ( !cache ||
+         !hvmemul_read_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e)) )
+    {
+        gw->l1e = l1p[guest_l1_table_offset(gla)];
+        if ( cache )
+            hvmemul_write_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e));
+    }
+
     gflags = guest_l1e_get_flags(gw->l1e);
     if ( !(gflags & _PAGE_PRESENT) )
         goto out;
@@ -446,22 +494,38 @@ guest_walk_tables(struct vcpu *v, struct
     case 1:
         if ( set_ad_bits(&l1p[guest_l1_table_offset(gla)].l1, &gw->l1e.l1,
                          (walk & PFEC_write_access)) )
+        {
             paging_mark_dirty(d, gw->l1mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l1gpa, 1, &gw->l1e, sizeof(gw->l1e));
+        }
         /* Fallthrough */
     case 2:
         if ( set_ad_bits(&l2p[guest_l2_table_offset(gla)].l2, &gw->l2e.l2,
                          (walk & PFEC_write_access) && leaf_level == 2) )
+        {
             paging_mark_dirty(d, gw->l2mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l2gpa, 2, &gw->l2e, sizeof(gw->l2e));
+        }
         /* Fallthrough */
 #if GUEST_PAGING_LEVELS == 4 /* 64-bit only... */
     case 3:
         if ( set_ad_bits(&l3p[guest_l3_table_offset(gla)].l3, &gw->l3e.l3,
                          (walk & PFEC_write_access) && leaf_level == 3) )
+        {
             paging_mark_dirty(d, gw->l3mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l3gpa, 3, &gw->l3e, sizeof(gw->l3e));
+        }
 
         if ( set_ad_bits(&l4p[guest_l4_table_offset(gla)].l4, &gw->l4e.l4,
                          false) )
+        {
             paging_mark_dirty(d, gw->l4mfn);
+            if ( cache )
+                hvmemul_write_cache(cache, l4gpa, 4, &gw->l4e, sizeof(gw->l4e));
+        }
 #endif
     }
 
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -98,6 +98,13 @@ int hvmemul_do_pio_buffer(uint16_t port,
                           uint8_t dir,
                           void *buffer);
 
+struct hvmemul_cache;
+bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
+                        unsigned int level, void *buffer, unsigned int size);
+void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
+                         unsigned int level, const void *buffer,
+                         unsigned int size);
+
 void hvm_dump_emulation_state(const char *loglvl, const char *prefix,
                               struct hvm_emulate_ctxt *hvmemul_ctxt, int rc);
 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 3/4] x86/HVM: implement memory read caching
  2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
  2018-09-25 14:23   ` [PATCH v3 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
  2018-09-25 14:24   ` [PATCH v3 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
@ 2018-09-25 14:25   ` Jan Beulich
  2018-09-26 11:05     ` Wei Liu
  2018-10-02 10:39     ` Ping: " Jan Beulich
  2018-09-25 14:26   ` [PATCH v3 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
  2018-10-02 10:36   ` Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
  4 siblings, 2 replies; 48+ messages in thread
From: Jan Beulich @ 2018-09-25 14:25 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Wei Liu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Paul Durrant, Suravee Suthikulpanit, Boris Ostrovsky,
	Brian Woods

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far use of CPU
registers goes (as those can't change without any other instruction
executing in between), but is wrong for memory accesses. In particular
it has been observed that Windows might page out buffers underneath an
instruction currently under emulation (hitting between two passes). If
the first pass translated a linear address successfully, any subsequent
pass needs to do so too, yielding the exact same translation.

Introduce a cache (used by just guest page table accesses for now) to
make sure above described assumption holds. This is a very simplistic
implementation for now: Only exact matches are satisfied (no overlaps or
partial reads or anything).

As to the actual data page in this scenario, there are a couple of
aspects to take into consideration:
- We must be talking about an insn accessing two locations (two memory
  ones, one of which is MMIO, or a memory and an I/O one).
- If the non I/O / MMIO side is being read, the re-read (if it occurs at
  all) is having its result discarded, by taking the shortcut through
  the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note
  how, among all the re-issue sanity checks there, we avoid comparing
  the actual data.
- If the non I/O / MMIO side is being written, it is the OSes
  responsibility to avoid actually moving page contents to disk while
  there might still be a write access in flight - this is no different
  in behavior from bare hardware.
- Read-modify-write accesses are, as always, complicated, and while we
  deal with them better nowadays than we did in the past, we're still
  not quite there to guarantee hardware like behavior in all cases
  anyway. Nothing is getting worse by the changes made here, afaict.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
---
v3: Add text about the actual data page to the description.
v2: Re-base.

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -27,6 +27,18 @@
 #include <asm/hvm/svm/svm.h>
 #include <asm/vm_event.h>
 
+struct hvmemul_cache
+{
+    unsigned int max_ents;
+    unsigned int num_ents;
+    struct {
+        paddr_t gpa:PADDR_BITS;
+        unsigned int size:(BITS_PER_LONG - PADDR_BITS) / 2;
+        unsigned int level:(BITS_PER_LONG - PADDR_BITS) / 2;
+        unsigned long data;
+    } ents[];
+};
+
 static void hvmtrace_io_assist(const ioreq_t *p)
 {
     unsigned int size, event;
@@ -541,7 +553,7 @@ static int hvmemul_do_mmio_addr(paddr_t
  */
 static void *hvmemul_map_linear_addr(
     unsigned long linear, unsigned int bytes, uint32_t pfec,
-    struct hvm_emulate_ctxt *hvmemul_ctxt)
+    struct hvm_emulate_ctxt *hvmemul_ctxt, struct hvmemul_cache *cache)
 {
     struct vcpu *curr = current;
     void *err, *mapping;
@@ -586,7 +598,7 @@ static void *hvmemul_map_linear_addr(
         ASSERT(mfn_x(*mfn) == 0);
 
         res = hvm_translate_get_page(curr, addr, true, pfec,
-                                     &pfinfo, &page, NULL, &p2mt);
+                                     &pfinfo, &page, NULL, &p2mt, cache);
 
         switch ( res )
         {
@@ -702,6 +714,8 @@ static int hvmemul_linear_to_phys(
     gfn_t gfn, ngfn;
     unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
     int reverse;
+    struct hvmemul_cache *cache = pfec & PFEC_insn_fetch
+                                  ? NULL : curr->arch.hvm.data_cache;
 
     /*
      * Clip repetitions to a sensible maximum. This avoids extensive looping in
@@ -731,7 +745,7 @@ static int hvmemul_linear_to_phys(
             return rc;
         gfn = gaddr_to_gfn(gaddr);
     }
-    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
+    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, cache),
                      INVALID_GFN) )
     {
         if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
@@ -747,7 +761,7 @@ static int hvmemul_linear_to_phys(
     {
         /* Get the next PFN in the range. */
         addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
-        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
+        ngfn = paging_gla_to_gfn(curr, addr, &pfec, cache);
 
         /* Is it contiguous with the preceding PFNs? If not then we're done. */
         if ( gfn_eq(ngfn, INVALID_GFN) ||
@@ -1073,7 +1087,10 @@ static int linear_read(unsigned long add
                        uint32_t pfec, struct hvm_emulate_ctxt *hvmemul_ctxt)
 {
     pagefault_info_t pfinfo;
-    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo);
+    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo,
+                                        (pfec & PFEC_insn_fetch
+                                         ? NULL
+                                         : current->arch.hvm.data_cache));
 
     switch ( rc )
     {
@@ -1270,7 +1287,8 @@ static int hvmemul_write(
 
     if ( !known_gla(addr, bytes, pfec) )
     {
-        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                          current->arch.hvm.data_cache);
         if ( IS_ERR(mapping) )
              return ~PTR_ERR(mapping);
     }
@@ -1312,7 +1330,8 @@ static int hvmemul_rmw(
 
     if ( !known_gla(addr, bytes, pfec) )
     {
-        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                          current->arch.hvm.data_cache);
         if ( IS_ERR(mapping) )
             return ~PTR_ERR(mapping);
     }
@@ -1466,7 +1485,8 @@ static int hvmemul_cmpxchg(
     else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
         pfec |= PFEC_user_mode;
 
-    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
+                                      curr->arch.hvm.data_cache);
     if ( IS_ERR(mapping) )
         return ~PTR_ERR(mapping);
 
@@ -2373,6 +2393,7 @@ static int _hvm_emulate_one(struct hvm_e
     {
         vio->mmio_cache_count = 0;
         vio->mmio_insn_bytes = 0;
+        curr->arch.hvm.data_cache->num_ents = 0;
     }
     else
     {
@@ -2591,7 +2612,7 @@ void hvm_emulate_init_per_insn(
                                         &addr) &&
              hvm_copy_from_guest_linear(hvmemul_ctxt->insn_buf, addr,
                                         sizeof(hvmemul_ctxt->insn_buf),
-                                        pfec | PFEC_insn_fetch,
+                                        pfec | PFEC_insn_fetch, NULL,
                                         NULL) == HVMTRANS_okay) ?
             sizeof(hvmemul_ctxt->insn_buf) : 0;
     }
@@ -2664,9 +2685,35 @@ void hvm_dump_emulation_state(const char
            hvmemul_ctxt->insn_buf);
 }
 
+struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
+{
+    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct hvmemul_cache,
+                                                         ents[nents]));
+
+    if ( cache )
+    {
+        cache->num_ents = 0;
+        cache->max_ents = nents;
+    }
+
+    return cache;
+}
+
 bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t gpa,
                         unsigned int level, void *buffer, unsigned int size)
 {
+    unsigned int i;
+
+    ASSERT(size <= sizeof(cache->ents->data));
+
+    for ( i = 0; i < cache->num_ents; ++i )
+        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
+             cache->ents[i].size == size )
+        {
+            memcpy(buffer, &cache->ents[i].data, size);
+            return true;
+        }
+
     return false;
 }
 
@@ -2674,6 +2721,35 @@ void hvmemul_write_cache(struct hvmemul_
                          unsigned int level, const void *buffer,
                          unsigned int size)
 {
+    unsigned int i;
+
+    if ( size > sizeof(cache->ents->data) )
+    {
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    for ( i = 0; i < cache->num_ents; ++i )
+        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
+             cache->ents[i].size == size )
+        {
+            memcpy(&cache->ents[i].data, buffer, size);
+            return;
+        }
+
+    if ( unlikely(i >= cache->max_ents) )
+    {
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    cache->ents[i].level = level;
+    cache->ents[i].gpa   = gpa;
+    cache->ents[i].size  = size;
+
+    memcpy(&cache->ents[i].data, buffer, size);
+
+    cache->num_ents = i + 1;
 }
 
 /*
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1498,6 +1498,17 @@ int hvm_vcpu_initialise(struct vcpu *v)
 
     v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
 
+    /*
+     * Leaving aside the insn fetch, for which we don't use this cache, no
+     * insn can access more than 8 independent linear addresses (AVX2
+     * gathers being the worst). Each such linear range can span a page
+     * boundary, i.e. require two page walks.
+     */
+    v->arch.hvm.data_cache = hvmemul_cache_init(CONFIG_PAGING_LEVELS * 8 * 2);
+    rc = -ENOMEM;
+    if ( !v->arch.hvm.data_cache )
+        goto fail4;
+
     rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
     if ( rc != 0 )
         goto fail4;
@@ -1527,6 +1538,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
  fail5:
     free_compat_arg_xlat(v);
  fail4:
+    hvmemul_cache_destroy(v->arch.hvm.data_cache);
     hvm_funcs.vcpu_destroy(v);
  fail3:
     vlapic_destroy(v);
@@ -1549,6 +1561,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
 
     free_compat_arg_xlat(v);
 
+    hvmemul_cache_destroy(v->arch.hvm.data_cache);
+
     tasklet_kill(&v->arch.hvm.assert_evtchn_irq_tasklet);
     hvm_funcs.vcpu_destroy(v);
 
@@ -2923,7 +2937,7 @@ void hvm_task_switch(
     }
 
     rc = hvm_copy_from_guest_linear(
-        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
+        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
     if ( rc == HVMTRANS_bad_linear_to_gfn )
         hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
     if ( rc != HVMTRANS_okay )
@@ -2970,7 +2984,7 @@ void hvm_task_switch(
         goto out;
 
     rc = hvm_copy_from_guest_linear(
-        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
+        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
     if ( rc == HVMTRANS_bad_linear_to_gfn )
         hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
     /*
@@ -3081,7 +3095,7 @@ void hvm_task_switch(
 enum hvm_translation_result hvm_translate_get_page(
     struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
     pagefault_info_t *pfinfo, struct page_info **page_p,
-    gfn_t *gfn_p, p2m_type_t *p2mt_p)
+    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache)
 {
     struct page_info *page;
     p2m_type_t p2mt;
@@ -3089,7 +3103,7 @@ enum hvm_translation_result hvm_translat
 
     if ( linear )
     {
-        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
+        gfn = paging_gla_to_gfn(v, addr, &pfec, cache);
 
         if ( gfn_eq(gfn, INVALID_GFN) )
         {
@@ -3161,7 +3175,7 @@ enum hvm_translation_result hvm_translat
 #define HVMCOPY_linear     (1u<<2)
 static enum hvm_translation_result __hvm_copy(
     void *buf, paddr_t addr, int size, struct vcpu *v, unsigned int flags,
-    uint32_t pfec, pagefault_info_t *pfinfo)
+    uint32_t pfec, pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
 {
     gfn_t gfn;
     struct page_info *page;
@@ -3194,8 +3208,8 @@ static enum hvm_translation_result __hvm
 
         count = min_t(int, PAGE_SIZE - gpa, todo);
 
-        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear,
-                                     pfec, pfinfo, &page, &gfn, &p2mt);
+        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear, pfec,
+                                     pfinfo, &page, &gfn, &p2mt, cache);
         if ( res != HVMTRANS_okay )
             return res;
 
@@ -3242,14 +3256,14 @@ enum hvm_translation_result hvm_copy_to_
     paddr_t paddr, void *buf, int size, struct vcpu *v)
 {
     return __hvm_copy(buf, paddr, size, v,
-                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, NULL);
 }
 
 enum hvm_translation_result hvm_copy_from_guest_phys(
     void *buf, paddr_t paddr, int size)
 {
     return __hvm_copy(buf, paddr, size, current,
-                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL, NULL);
 }
 
 enum hvm_translation_result hvm_copy_to_guest_linear(
@@ -3258,16 +3272,17 @@ enum hvm_translation_result hvm_copy_to_
 {
     return __hvm_copy(buf, addr, size, current,
                       HVMCOPY_to_guest | HVMCOPY_linear,
-                      PFEC_page_present | PFEC_write_access | pfec, pfinfo);
+                      PFEC_page_present | PFEC_write_access | pfec,
+                      pfinfo, NULL);
 }
 
 enum hvm_translation_result hvm_copy_from_guest_linear(
     void *buf, unsigned long addr, int size, uint32_t pfec,
-    pagefault_info_t *pfinfo)
+    pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
 {
     return __hvm_copy(buf, addr, size, current,
                       HVMCOPY_from_guest | HVMCOPY_linear,
-                      PFEC_page_present | pfec, pfinfo);
+                      PFEC_page_present | pfec, pfinfo, cache);
 }
 
 unsigned long copy_to_user_hvm(void *to, const void *from, unsigned int len)
@@ -3308,7 +3323,8 @@ unsigned long copy_from_user_hvm(void *t
         return 0;
     }
 
-    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len, 0, NULL);
+    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len,
+                                    0, NULL, NULL);
     return rc ? len : 0; /* fake a copy_from_user() return code */
 }
 
@@ -3724,7 +3740,7 @@ void hvm_ud_intercept(struct cpu_user_re
                                         sizeof(sig), hvm_access_insn_fetch,
                                         cs, &addr) &&
              (hvm_copy_from_guest_linear(sig, addr, sizeof(sig),
-                                         walk, NULL) == HVMTRANS_okay) &&
+                                         walk, NULL, NULL) == HVMTRANS_okay) &&
              (memcmp(sig, "\xf\xbxen", sizeof(sig)) == 0) )
         {
             regs->rip += sizeof(sig);
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1358,7 +1358,7 @@ static void svm_emul_swint_injection(str
         goto raise_exception;
 
     rc = hvm_copy_from_guest_linear(&idte, idte_linear_addr, idte_size,
-                                    PFEC_implicit, &pfinfo);
+                                    PFEC_implicit, &pfinfo, NULL);
     if ( rc )
     {
         if ( rc == HVMTRANS_bad_linear_to_gfn )
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -475,7 +475,7 @@ static int decode_vmx_inst(struct cpu_us
         {
             pagefault_info_t pfinfo;
             int rc = hvm_copy_from_guest_linear(poperandS, base, size,
-                                                0, &pfinfo);
+                                                0, &pfinfo, NULL);
 
             if ( rc == HVMTRANS_bad_linear_to_gfn )
                 hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -166,7 +166,7 @@ const struct x86_emulate_ops *shadow_ini
             hvm_access_insn_fetch, sh_ctxt, &addr) &&
          !hvm_copy_from_guest_linear(
              sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
-             PFEC_insn_fetch, NULL))
+             PFEC_insn_fetch, NULL, NULL))
         ? sizeof(sh_ctxt->insn_buf) : 0;
 
     return &hvm_shadow_emulator_ops;
@@ -201,7 +201,7 @@ void shadow_continue_emulation(struct sh
                 hvm_access_insn_fetch, sh_ctxt, &addr) &&
              !hvm_copy_from_guest_linear(
                  sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
-                 PFEC_insn_fetch, NULL))
+                 PFEC_insn_fetch, NULL, NULL))
             ? sizeof(sh_ctxt->insn_buf) : 0;
         sh_ctxt->insn_buf_eip = regs->rip;
     }
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -125,7 +125,7 @@ hvm_read(enum x86_segment seg,
     rc = hvm_copy_from_guest_linear(p_data, addr, bytes,
                                     (access_type == hvm_access_insn_fetch
                                      ? PFEC_insn_fetch : 0),
-                                    &pfinfo);
+                                    &pfinfo, NULL);
 
     switch ( rc )
     {
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -99,6 +99,11 @@ int hvmemul_do_pio_buffer(uint16_t port,
                           void *buffer);
 
 struct hvmemul_cache;
+struct hvmemul_cache *hvmemul_cache_init(unsigned int nents);
+static inline void hvmemul_cache_destroy(struct hvmemul_cache *cache)
+{
+    xfree(cache);
+}
 bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
                         unsigned int level, void *buffer, unsigned int size);
 void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
--- a/xen/include/asm-x86/hvm/support.h
+++ b/xen/include/asm-x86/hvm/support.h
@@ -99,7 +99,7 @@ enum hvm_translation_result hvm_copy_to_
     pagefault_info_t *pfinfo);
 enum hvm_translation_result hvm_copy_from_guest_linear(
     void *buf, unsigned long addr, int size, uint32_t pfec,
-    pagefault_info_t *pfinfo);
+    pagefault_info_t *pfinfo, struct hvmemul_cache *cache);
 
 /*
  * Get a reference on the page under an HVM physical or linear address.  If
@@ -110,7 +110,7 @@ enum hvm_translation_result hvm_copy_fro
 enum hvm_translation_result hvm_translate_get_page(
     struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
     pagefault_info_t *pfinfo, struct page_info **page_p,
-    gfn_t *gfn_p, p2m_type_t *p2mt_p);
+    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache);
 
 #define HVM_HCALL_completed  0 /* hypercall completed - no further action */
 #define HVM_HCALL_preempted  1 /* hypercall preempted - re-execute VMCALL */
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -53,8 +53,6 @@ struct hvm_mmio_cache {
     uint8_t buffer[32];
 };
 
-struct hvmemul_cache;
-
 struct hvm_vcpu_io {
     /* I/O request in flight to device model. */
     enum hvm_io_completion io_completion;
@@ -200,6 +198,7 @@ struct hvm_vcpu {
     u8                  cache_mode;
 
     struct hvm_vcpu_io  hvm_io;
+    struct hvmemul_cache *data_cache;
 
     /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
     struct x86_event     inject_event;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 4/4] x86/HVM: prefill cache with PDPTEs when possible
  2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
                     ` (2 preceding siblings ...)
  2018-09-25 14:25   ` [PATCH v3 3/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-09-25 14:26   ` Jan Beulich
  2018-09-25 14:38     ` Paul Durrant
  2018-10-02 10:36   ` Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
  4 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-09-25 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Paul Durrant

Since strictly speaking it is incorrect for guest_walk_tables() to read
L3 entries during PAE page walks (they get loaded from memory only upon
CR3 loads and certain TLB flushes), try to overcome this where possible
by pre-loading the values from hardware into the cache. Sadly the
information is available in the EPT case only. On the positive side for
NPT the spec spells out that L3 entries are actually read on walks, so
us reading them is consistent with hardware behavior in that case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v2: Re-base.

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -2385,6 +2385,23 @@ static int _hvm_emulate_one(struct hvm_e
 
     vio->mmio_retry = 0;
 
+    if ( !curr->arch.hvm.data_cache->num_ents &&
+         curr->arch.paging.mode->guest_levels == 3 )
+    {
+        unsigned int i;
+
+        for ( i = 0; i < 4; ++i )
+        {
+            uint64_t pdpte;
+
+            if ( hvm_read_pdpte(curr, i, &pdpte) )
+                hvmemul_write_cache(curr->arch.hvm.data_cache,
+                                    (curr->arch.hvm.guest_cr[3] &
+                                     (PADDR_MASK & ~0x1f)) + i * sizeof(pdpte),
+                                    3, &pdpte, sizeof(pdpte));
+        }
+    }
+
     rc = x86_emulate(&hvmemul_ctxt->ctxt, ops);
     if ( rc == X86EMUL_OKAY && vio->mmio_retry )
         rc = X86EMUL_RETRY;
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1368,6 +1368,25 @@ static void vmx_set_interrupt_shadow(str
     __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
 }
 
+static bool read_pdpte(struct vcpu *v, unsigned int idx, uint64_t *pdpte)
+{
+    if ( !paging_mode_hap(v->domain) || !hvm_pae_enabled(v) ||
+         (v->arch.hvm.guest_efer & EFER_LMA) )
+        return false;
+
+    if ( idx >= 4 )
+    {
+        ASSERT_UNREACHABLE();
+        return false;
+    }
+
+    vmx_vmcs_enter(v);
+    __vmread(GUEST_PDPTE(idx), pdpte);
+    vmx_vmcs_exit(v);
+
+    return true;
+}
+
 static void vmx_load_pdptrs(struct vcpu *v)
 {
     unsigned long cr3 = v->arch.hvm.guest_cr[3];
@@ -2466,6 +2485,8 @@ const struct hvm_function_table * __init
         if ( cpu_has_vmx_ept_1gb )
             vmx_function_table.hap_capabilities |= HVM_HAP_SUPERPAGE_1GB;
 
+        vmx_function_table.read_pdpte = read_pdpte;
+
         setup_ept_dump();
     }
 
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -146,6 +146,8 @@ struct hvm_function_table {
 
     void (*fpu_leave)(struct vcpu *v);
 
+    bool (*read_pdpte)(struct vcpu *v, unsigned int index, uint64_t *pdpte);
+
     int  (*get_guest_pat)(struct vcpu *v, u64 *);
     int  (*set_guest_pat)(struct vcpu *v, u64);
 
@@ -443,6 +445,12 @@ static inline unsigned long hvm_get_shad
     return hvm_funcs.get_shadow_gs_base(v);
 }
 
+static inline bool hvm_read_pdpte(struct vcpu *v, unsigned int index, uint64_t *pdpte)
+{
+    return hvm_funcs.read_pdpte &&
+           alternative_call(hvm_funcs.read_pdpte, v, index, pdpte);
+}
+
 static inline bool hvm_get_guest_bndcfgs(struct vcpu *v, u64 *val)
 {
     return hvm_funcs.get_guest_bndcfgs &&





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/4] x86/HVM: prefill cache with PDPTEs when possible
  2018-09-25 14:26   ` [PATCH v3 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
@ 2018-09-25 14:38     ` Paul Durrant
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Durrant @ 2018-09-25 14:38 UTC (permalink / raw)
  To: 'Jan Beulich', xen-devel; +Cc: Andrew Cooper, George Dunlap

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 25 September 2018 15:27
> To: xen-devel <xen-devel@lists.xenproject.org>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>; George Dunlap <George.Dunlap@citrix.com>
> Subject: [PATCH v3 4/4] x86/HVM: prefill cache with PDPTEs when possible
> 
> Since strictly speaking it is incorrect for guest_walk_tables() to read
> L3 entries during PAE page walks (they get loaded from memory only upon
> CR3 loads and certain TLB flushes), try to overcome this where possible
> by pre-loading the values from hardware into the cache. Sadly the
> information is available in the EPT case only. On the positive side for
> NPT the spec spells out that L3 entries are actually read on walks, so
> us reading them is consistent with hardware behavior in that case.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> v2: Re-base.
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -2385,6 +2385,23 @@ static int _hvm_emulate_one(struct hvm_e
> 
>      vio->mmio_retry = 0;
> 
> +    if ( !curr->arch.hvm.data_cache->num_ents &&
> +         curr->arch.paging.mode->guest_levels == 3 )
> +    {
> +        unsigned int i;
> +
> +        for ( i = 0; i < 4; ++i )
> +        {
> +            uint64_t pdpte;
> +
> +            if ( hvm_read_pdpte(curr, i, &pdpte) )
> +                hvmemul_write_cache(curr->arch.hvm.data_cache,
> +                                    (curr->arch.hvm.guest_cr[3] &
> +                                     (PADDR_MASK & ~0x1f)) + i *
> sizeof(pdpte),
> +                                    3, &pdpte, sizeof(pdpte));
> +        }
> +    }
> +
>      rc = x86_emulate(&hvmemul_ctxt->ctxt, ops);
>      if ( rc == X86EMUL_OKAY && vio->mmio_retry )
>          rc = X86EMUL_RETRY;
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1368,6 +1368,25 @@ static void vmx_set_interrupt_shadow(str
>      __vmwrite(GUEST_INTERRUPTIBILITY_INFO, intr_shadow);
>  }
> 
> +static bool read_pdpte(struct vcpu *v, unsigned int idx, uint64_t *pdpte)
> +{
> +    if ( !paging_mode_hap(v->domain) || !hvm_pae_enabled(v) ||
> +         (v->arch.hvm.guest_efer & EFER_LMA) )
> +        return false;
> +
> +    if ( idx >= 4 )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return false;
> +    }
> +
> +    vmx_vmcs_enter(v);
> +    __vmread(GUEST_PDPTE(idx), pdpte);
> +    vmx_vmcs_exit(v);
> +
> +    return true;
> +}
> +
>  static void vmx_load_pdptrs(struct vcpu *v)
>  {
>      unsigned long cr3 = v->arch.hvm.guest_cr[3];
> @@ -2466,6 +2485,8 @@ const struct hvm_function_table * __init
>          if ( cpu_has_vmx_ept_1gb )
>              vmx_function_table.hap_capabilities |= HVM_HAP_SUPERPAGE_1GB;
> 
> +        vmx_function_table.read_pdpte = read_pdpte;
> +
>          setup_ept_dump();
>      }
> 
> --- a/xen/include/asm-x86/hvm/hvm.h
> +++ b/xen/include/asm-x86/hvm/hvm.h
> @@ -146,6 +146,8 @@ struct hvm_function_table {
> 
>      void (*fpu_leave)(struct vcpu *v);
> 
> +    bool (*read_pdpte)(struct vcpu *v, unsigned int index, uint64_t
> *pdpte);
> +
>      int  (*get_guest_pat)(struct vcpu *v, u64 *);
>      int  (*set_guest_pat)(struct vcpu *v, u64);
> 
> @@ -443,6 +445,12 @@ static inline unsigned long hvm_get_shad
>      return hvm_funcs.get_shadow_gs_base(v);
>  }
> 
> +static inline bool hvm_read_pdpte(struct vcpu *v, unsigned int index,
> uint64_t *pdpte)
> +{
> +    return hvm_funcs.read_pdpte &&
> +           alternative_call(hvm_funcs.read_pdpte, v, index, pdpte);
> +}
> +
>  static inline bool hvm_get_guest_bndcfgs(struct vcpu *v, u64 *val)
>  {
>      return hvm_funcs.get_guest_bndcfgs &&
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/4] x86/HVM: implement memory read caching
  2018-09-25 14:25   ` [PATCH v3 3/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-09-26 11:05     ` Wei Liu
  2018-10-02 10:39     ` Ping: " Jan Beulich
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Liu @ 2018-09-26 11:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Paul Durrant, Suravee Suthikulpanit, xen-devel,
	Boris Ostrovsky, Brian Woods

On Tue, Sep 25, 2018 at 08:25:50AM -0600, Jan Beulich wrote:
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath an
> instruction currently under emulation (hitting between two passes). If
> the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
> 
> Introduce a cache (used by just guest page table accesses for now) to
> make sure above described assumption holds. This is a very simplistic
> implementation for now: Only exact matches are satisfied (no overlaps or
> partial reads or anything).
> 
> As to the actual data page in this scenario, there are a couple of
> aspects to take into consideration:
> - We must be talking about an insn accessing two locations (two memory
>   ones, one of which is MMIO, or a memory and an I/O one).
> - If the non I/O / MMIO side is being read, the re-read (if it occurs at
>   all) is having its result discarded, by taking the shortcut through
>   the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note
>   how, among all the re-issue sanity checks there, we avoid comparing
>   the actual data.
> - If the non I/O / MMIO side is being written, it is the OSes
>   responsibility to avoid actually moving page contents to disk while
>   there might still be a write access in flight - this is no different
>   in behavior from bare hardware.
> - Read-modify-write accesses are, as always, complicated, and while we
>   deal with them better nowadays than we did in the past, we're still
>   not quite there to guarantee hardware like behavior in all cases
>   anyway. Nothing is getting worse by the changes made here, afaict.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Tim Deegan <tim@xen.org>
> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

FWIW:

Reviewed-by: Wei Liu <wei.liu2@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
                     ` (3 preceding siblings ...)
  2018-09-25 14:26   ` [PATCH v3 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
@ 2018-10-02 10:36   ` Jan Beulich
  2018-10-02 10:51     ` Andrew Cooper
  4 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-10-02 10:36 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap; +Cc: xen-devel, Paul Durrant

>>> On 25.09.18 at 16:14, <JBeulich@suse.com> wrote:
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far as use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath
> an instruction currently under emulation (hitting between two passes).
> If the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
> 
> Introduce a cache (used just by guest page table accesses for now, i.e.
> a form of "paging structure cache") to make sure above described
> assumption holds. This is a very simplistic implementation for now: Only
> exact matches are satisfied (no overlaps or partial reads or anything).
> 
> There's also some seemingly unrelated cleanup here which was found
> desirable on the way.
> 
> 1: x86/mm: add optional cache to GLA->GFN translation
> 2: x86/mm: use optional cache in guest_walk_tables()
> 3: x86/HVM: implement memory read caching
> 4: x86/HVM: prefill cache with PDPTEs when possible
> 
> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
> can't currently find enough time to carry out the requested further
> rework.

Andrew, George?

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Ping: [PATCH v3 3/4] x86/HVM: implement memory read caching
  2018-09-25 14:25   ` [PATCH v3 3/4] x86/HVM: implement memory read caching Jan Beulich
  2018-09-26 11:05     ` Wei Liu
@ 2018-10-02 10:39     ` Jan Beulich
  2018-10-02 13:53       ` Boris Ostrovsky
  2018-10-09  5:19       ` Tian, Kevin
  1 sibling, 2 replies; 48+ messages in thread
From: Jan Beulich @ 2018-10-02 10:39 UTC (permalink / raw)
  To: Brian Woods, Suravee Suthikulpanit, Jun Nakajima, Kevin Tian,
	Boris Ostrovsky
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Tim Deegan, Paul Durrant,
	xen-devel

>>> On 25.09.18 at 16:25, <JBeulich@suse.com> wrote:
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath an
> instruction currently under emulation (hitting between two passes). If
> the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
> 
> Introduce a cache (used by just guest page table accesses for now) to
> make sure above described assumption holds. This is a very simplistic
> implementation for now: Only exact matches are satisfied (no overlaps or
> partial reads or anything).
> 
> As to the actual data page in this scenario, there are a couple of
> aspects to take into consideration:
> - We must be talking about an insn accessing two locations (two memory
>   ones, one of which is MMIO, or a memory and an I/O one).
> - If the non I/O / MMIO side is being read, the re-read (if it occurs at
>   all) is having its result discarded, by taking the shortcut through
>   the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note
>   how, among all the re-issue sanity checks there, we avoid comparing
>   the actual data.
> - If the non I/O / MMIO side is being written, it is the OSes
>   responsibility to avoid actually moving page contents to disk while
>   there might still be a write access in flight - this is no different
>   in behavior from bare hardware.
> - Read-modify-write accesses are, as always, complicated, and while we
>   deal with them better nowadays than we did in the past, we're still
>   not quite there to guarantee hardware like behavior in all cases
>   anyway. Nothing is getting worse by the changes made here, afaict.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Tim Deegan <tim@xen.org>
> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

SVM and VMX maintainers?

Thanks, Jan

> ---
> v3: Add text about the actual data page to the description.
> v2: Re-base.
> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -27,6 +27,18 @@
>  #include <asm/hvm/svm/svm.h>
>  #include <asm/vm_event.h>
>  
> +struct hvmemul_cache
> +{
> +    unsigned int max_ents;
> +    unsigned int num_ents;
> +    struct {
> +        paddr_t gpa:PADDR_BITS;
> +        unsigned int size:(BITS_PER_LONG - PADDR_BITS) / 2;
> +        unsigned int level:(BITS_PER_LONG - PADDR_BITS) / 2;
> +        unsigned long data;
> +    } ents[];
> +};
> +
>  static void hvmtrace_io_assist(const ioreq_t *p)
>  {
>      unsigned int size, event;
> @@ -541,7 +553,7 @@ static int hvmemul_do_mmio_addr(paddr_t
>   */
>  static void *hvmemul_map_linear_addr(
>      unsigned long linear, unsigned int bytes, uint32_t pfec,
> -    struct hvm_emulate_ctxt *hvmemul_ctxt)
> +    struct hvm_emulate_ctxt *hvmemul_ctxt, struct hvmemul_cache *cache)
>  {
>      struct vcpu *curr = current;
>      void *err, *mapping;
> @@ -586,7 +598,7 @@ static void *hvmemul_map_linear_addr(
>          ASSERT(mfn_x(*mfn) == 0);
>  
>          res = hvm_translate_get_page(curr, addr, true, pfec,
> -                                     &pfinfo, &page, NULL, &p2mt);
> +                                     &pfinfo, &page, NULL, &p2mt, cache);
>  
>          switch ( res )
>          {
> @@ -702,6 +714,8 @@ static int hvmemul_linear_to_phys(
>      gfn_t gfn, ngfn;
>      unsigned long done, todo, i, offset = addr & ~PAGE_MASK;
>      int reverse;
> +    struct hvmemul_cache *cache = pfec & PFEC_insn_fetch
> +                                  ? NULL : curr->arch.hvm.data_cache;
>  
>      /*
>       * Clip repetitions to a sensible maximum. This avoids extensive 
> looping in
> @@ -731,7 +745,7 @@ static int hvmemul_linear_to_phys(
>              return rc;
>          gfn = gaddr_to_gfn(gaddr);
>      }
> -    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, NULL),
> +    else if ( gfn_eq(gfn = paging_gla_to_gfn(curr, addr, &pfec, cache),
>                       INVALID_GFN) )
>      {
>          if ( pfec & (PFEC_page_paged | PFEC_page_shared) )
> @@ -747,7 +761,7 @@ static int hvmemul_linear_to_phys(
>      {
>          /* Get the next PFN in the range. */
>          addr += reverse ? -PAGE_SIZE : PAGE_SIZE;
> -        ngfn = paging_gla_to_gfn(curr, addr, &pfec, NULL);
> +        ngfn = paging_gla_to_gfn(curr, addr, &pfec, cache);
>  
>          /* Is it contiguous with the preceding PFNs? If not then we're 
> done. */
>          if ( gfn_eq(ngfn, INVALID_GFN) ||
> @@ -1073,7 +1087,10 @@ static int linear_read(unsigned long add
>                         uint32_t pfec, struct hvm_emulate_ctxt 
> *hvmemul_ctxt)
>  {
>      pagefault_info_t pfinfo;
> -    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo);
> +    int rc = hvm_copy_from_guest_linear(p_data, addr, bytes, pfec, &pfinfo,
> +                                        (pfec & PFEC_insn_fetch
> +                                         ? NULL
> +                                         : current->arch.hvm.data_cache));
>  
>      switch ( rc )
>      {
> @@ -1270,7 +1287,8 @@ static int hvmemul_write(
>  
>      if ( !known_gla(addr, bytes, pfec) )
>      {
> -        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
> +                                          current->arch.hvm.data_cache);
>          if ( IS_ERR(mapping) )
>               return ~PTR_ERR(mapping);
>      }
> @@ -1312,7 +1330,8 @@ static int hvmemul_rmw(
>  
>      if ( !known_gla(addr, bytes, pfec) )
>      {
> -        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +        mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
> +                                          current->arch.hvm.data_cache);
>          if ( IS_ERR(mapping) )
>              return ~PTR_ERR(mapping);
>      }
> @@ -1466,7 +1485,8 @@ static int hvmemul_cmpxchg(
>      else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].dpl == 3 )
>          pfec |= PFEC_user_mode;
>  
> -    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt,
> +                                      curr->arch.hvm.data_cache);
>      if ( IS_ERR(mapping) )
>          return ~PTR_ERR(mapping);
>  
> @@ -2373,6 +2393,7 @@ static int _hvm_emulate_one(struct hvm_e
>      {
>          vio->mmio_cache_count = 0;
>          vio->mmio_insn_bytes = 0;
> +        curr->arch.hvm.data_cache->num_ents = 0;
>      }
>      else
>      {
> @@ -2591,7 +2612,7 @@ void hvm_emulate_init_per_insn(
>                                          &addr) &&
>               hvm_copy_from_guest_linear(hvmemul_ctxt->insn_buf, addr,
>                                          sizeof(hvmemul_ctxt->insn_buf),
> -                                        pfec | PFEC_insn_fetch,
> +                                        pfec | PFEC_insn_fetch, NULL,
>                                          NULL) == HVMTRANS_okay) ?
>              sizeof(hvmemul_ctxt->insn_buf) : 0;
>      }
> @@ -2664,9 +2685,35 @@ void hvm_dump_emulation_state(const char
>             hvmemul_ctxt->insn_buf);
>  }
>  
> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents)
> +{
> +    struct hvmemul_cache *cache = xmalloc_bytes(offsetof(struct 
> hvmemul_cache,
> +                                                         ents[nents]));
> +
> +    if ( cache )
> +    {
> +        cache->num_ents = 0;
> +        cache->max_ents = nents;
> +    }
> +
> +    return cache;
> +}
> +
>  bool hvmemul_read_cache(const struct hvmemul_cache *cache, paddr_t gpa,
>                          unsigned int level, void *buffer, unsigned int 
> size)
>  {
> +    unsigned int i;
> +
> +    ASSERT(size <= sizeof(cache->ents->data));
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
> +             cache->ents[i].size == size )
> +        {
> +            memcpy(buffer, &cache->ents[i].data, size);
> +            return true;
> +        }
> +
>      return false;
>  }
>  
> @@ -2674,6 +2721,35 @@ void hvmemul_write_cache(struct hvmemul_
>                           unsigned int level, const void *buffer,
>                           unsigned int size)
>  {
> +    unsigned int i;
> +
> +    if ( size > sizeof(cache->ents->data) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    for ( i = 0; i < cache->num_ents; ++i )
> +        if ( cache->ents[i].level == level && cache->ents[i].gpa == gpa &&
> +             cache->ents[i].size == size )
> +        {
> +            memcpy(&cache->ents[i].data, buffer, size);
> +            return;
> +        }
> +
> +    if ( unlikely(i >= cache->max_ents) )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    cache->ents[i].level = level;
> +    cache->ents[i].gpa   = gpa;
> +    cache->ents[i].size  = size;
> +
> +    memcpy(&cache->ents[i].data, buffer, size);
> +
> +    cache->num_ents = i + 1;
>  }
>  
>  /*
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1498,6 +1498,17 @@ int hvm_vcpu_initialise(struct vcpu *v)
>  
>      v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
>  
> +    /*
> +     * Leaving aside the insn fetch, for which we don't use this cache, no
> +     * insn can access more than 8 independent linear addresses (AVX2
> +     * gathers being the worst). Each such linear range can span a page
> +     * boundary, i.e. require two page walks.
> +     */
> +    v->arch.hvm.data_cache = hvmemul_cache_init(CONFIG_PAGING_LEVELS * 8 * 
> 2);
> +    rc = -ENOMEM;
> +    if ( !v->arch.hvm.data_cache )
> +        goto fail4;
> +
>      rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
>      if ( rc != 0 )
>          goto fail4;
> @@ -1527,6 +1538,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
>   fail5:
>      free_compat_arg_xlat(v);
>   fail4:
> +    hvmemul_cache_destroy(v->arch.hvm.data_cache);
>      hvm_funcs.vcpu_destroy(v);
>   fail3:
>      vlapic_destroy(v);
> @@ -1549,6 +1561,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
>  
>      free_compat_arg_xlat(v);
>  
> +    hvmemul_cache_destroy(v->arch.hvm.data_cache);
> +
>      tasklet_kill(&v->arch.hvm.assert_evtchn_irq_tasklet);
>      hvm_funcs.vcpu_destroy(v);
>  
> @@ -2923,7 +2937,7 @@ void hvm_task_switch(
>      }
>  
>      rc = hvm_copy_from_guest_linear(
> -        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
> +        &tss, prev_tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
>      if ( rc == HVMTRANS_bad_linear_to_gfn )
>          hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
>      if ( rc != HVMTRANS_okay )
> @@ -2970,7 +2984,7 @@ void hvm_task_switch(
>          goto out;
>  
>      rc = hvm_copy_from_guest_linear(
> -        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo);
> +        &tss, tr.base, sizeof(tss), PFEC_page_present, &pfinfo, NULL);
>      if ( rc == HVMTRANS_bad_linear_to_gfn )
>          hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
>      /*
> @@ -3081,7 +3095,7 @@ void hvm_task_switch(
>  enum hvm_translation_result hvm_translate_get_page(
>      struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
>      pagefault_info_t *pfinfo, struct page_info **page_p,
> -    gfn_t *gfn_p, p2m_type_t *p2mt_p)
> +    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache)
>  {
>      struct page_info *page;
>      p2m_type_t p2mt;
> @@ -3089,7 +3103,7 @@ enum hvm_translation_result hvm_translat
>  
>      if ( linear )
>      {
> -        gfn = paging_gla_to_gfn(v, addr, &pfec, NULL);
> +        gfn = paging_gla_to_gfn(v, addr, &pfec, cache);
>  
>          if ( gfn_eq(gfn, INVALID_GFN) )
>          {
> @@ -3161,7 +3175,7 @@ enum hvm_translation_result hvm_translat
>  #define HVMCOPY_linear     (1u<<2)
>  static enum hvm_translation_result __hvm_copy(
>      void *buf, paddr_t addr, int size, struct vcpu *v, unsigned int flags,
> -    uint32_t pfec, pagefault_info_t *pfinfo)
> +    uint32_t pfec, pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
>  {
>      gfn_t gfn;
>      struct page_info *page;
> @@ -3194,8 +3208,8 @@ static enum hvm_translation_result __hvm
>  
>          count = min_t(int, PAGE_SIZE - gpa, todo);
>  
> -        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear,
> -                                     pfec, pfinfo, &page, &gfn, &p2mt);
> +        res = hvm_translate_get_page(v, addr, flags & HVMCOPY_linear, pfec,
> +                                     pfinfo, &page, &gfn, &p2mt, cache);
>          if ( res != HVMTRANS_okay )
>              return res;
>  
> @@ -3242,14 +3256,14 @@ enum hvm_translation_result hvm_copy_to_
>      paddr_t paddr, void *buf, int size, struct vcpu *v)
>  {
>      return __hvm_copy(buf, paddr, size, v,
> -                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL);
> +                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, NULL);
>  }
>  
>  enum hvm_translation_result hvm_copy_from_guest_phys(
>      void *buf, paddr_t paddr, int size)
>  {
>      return __hvm_copy(buf, paddr, size, current,
> -                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL);
> +                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL, NULL);
>  }
>  
>  enum hvm_translation_result hvm_copy_to_guest_linear(
> @@ -3258,16 +3272,17 @@ enum hvm_translation_result hvm_copy_to_
>  {
>      return __hvm_copy(buf, addr, size, current,
>                        HVMCOPY_to_guest | HVMCOPY_linear,
> -                      PFEC_page_present | PFEC_write_access | pfec, 
> pfinfo);
> +                      PFEC_page_present | PFEC_write_access | pfec,
> +                      pfinfo, NULL);
>  }
>  
>  enum hvm_translation_result hvm_copy_from_guest_linear(
>      void *buf, unsigned long addr, int size, uint32_t pfec,
> -    pagefault_info_t *pfinfo)
> +    pagefault_info_t *pfinfo, struct hvmemul_cache *cache)
>  {
>      return __hvm_copy(buf, addr, size, current,
>                        HVMCOPY_from_guest | HVMCOPY_linear,
> -                      PFEC_page_present | pfec, pfinfo);
> +                      PFEC_page_present | pfec, pfinfo, cache);
>  }
>  
>  unsigned long copy_to_user_hvm(void *to, const void *from, unsigned int 
> len)
> @@ -3308,7 +3323,8 @@ unsigned long copy_from_user_hvm(void *t
>          return 0;
>      }
>  
> -    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len, 0, NULL);
> +    rc = hvm_copy_from_guest_linear(to, (unsigned long)from, len,
> +                                    0, NULL, NULL);
>      return rc ? len : 0; /* fake a copy_from_user() return code */
>  }
>  
> @@ -3724,7 +3740,7 @@ void hvm_ud_intercept(struct cpu_user_re
>                                          sizeof(sig), hvm_access_insn_fetch,
>                                          cs, &addr) &&
>               (hvm_copy_from_guest_linear(sig, addr, sizeof(sig),
> -                                         walk, NULL) == HVMTRANS_okay) &&
> +                                         walk, NULL, NULL) == 
> HVMTRANS_okay) &&
>               (memcmp(sig, "\xf\xbxen", sizeof(sig)) == 0) )
>          {
>              regs->rip += sizeof(sig);
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1358,7 +1358,7 @@ static void svm_emul_swint_injection(str
>          goto raise_exception;
>  
>      rc = hvm_copy_from_guest_linear(&idte, idte_linear_addr, idte_size,
> -                                    PFEC_implicit, &pfinfo);
> +                                    PFEC_implicit, &pfinfo, NULL);
>      if ( rc )
>      {
>          if ( rc == HVMTRANS_bad_linear_to_gfn )
> --- a/xen/arch/x86/hvm/vmx/vvmx.c
> +++ b/xen/arch/x86/hvm/vmx/vvmx.c
> @@ -475,7 +475,7 @@ static int decode_vmx_inst(struct cpu_us
>          {
>              pagefault_info_t pfinfo;
>              int rc = hvm_copy_from_guest_linear(poperandS, base, size,
> -                                                0, &pfinfo);
> +                                                0, &pfinfo, NULL);
>  
>              if ( rc == HVMTRANS_bad_linear_to_gfn )
>                  hvm_inject_page_fault(pfinfo.ec, pfinfo.linear);
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -166,7 +166,7 @@ const struct x86_emulate_ops *shadow_ini
>              hvm_access_insn_fetch, sh_ctxt, &addr) &&
>           !hvm_copy_from_guest_linear(
>               sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
> -             PFEC_insn_fetch, NULL))
> +             PFEC_insn_fetch, NULL, NULL))
>          ? sizeof(sh_ctxt->insn_buf) : 0;
>  
>      return &hvm_shadow_emulator_ops;
> @@ -201,7 +201,7 @@ void shadow_continue_emulation(struct sh
>                  hvm_access_insn_fetch, sh_ctxt, &addr) &&
>               !hvm_copy_from_guest_linear(
>                   sh_ctxt->insn_buf, addr, sizeof(sh_ctxt->insn_buf),
> -                 PFEC_insn_fetch, NULL))
> +                 PFEC_insn_fetch, NULL, NULL))
>              ? sizeof(sh_ctxt->insn_buf) : 0;
>          sh_ctxt->insn_buf_eip = regs->rip;
>      }
> --- a/xen/arch/x86/mm/shadow/hvm.c
> +++ b/xen/arch/x86/mm/shadow/hvm.c
> @@ -125,7 +125,7 @@ hvm_read(enum x86_segment seg,
>      rc = hvm_copy_from_guest_linear(p_data, addr, bytes,
>                                      (access_type == hvm_access_insn_fetch
>                                       ? PFEC_insn_fetch : 0),
> -                                    &pfinfo);
> +                                    &pfinfo, NULL);
>  
>      switch ( rc )
>      {
> --- a/xen/include/asm-x86/hvm/emulate.h
> +++ b/xen/include/asm-x86/hvm/emulate.h
> @@ -99,6 +99,11 @@ int hvmemul_do_pio_buffer(uint16_t port,
>                            void *buffer);
>  
>  struct hvmemul_cache;
> +struct hvmemul_cache *hvmemul_cache_init(unsigned int nents);
> +static inline void hvmemul_cache_destroy(struct hvmemul_cache *cache)
> +{
> +    xfree(cache);
> +}
>  bool hvmemul_read_cache(const struct hvmemul_cache *, paddr_t gpa,
>                          unsigned int level, void *buffer, unsigned int 
> size);
>  void hvmemul_write_cache(struct hvmemul_cache *, paddr_t gpa,
> --- a/xen/include/asm-x86/hvm/support.h
> +++ b/xen/include/asm-x86/hvm/support.h
> @@ -99,7 +99,7 @@ enum hvm_translation_result hvm_copy_to_
>      pagefault_info_t *pfinfo);
>  enum hvm_translation_result hvm_copy_from_guest_linear(
>      void *buf, unsigned long addr, int size, uint32_t pfec,
> -    pagefault_info_t *pfinfo);
> +    pagefault_info_t *pfinfo, struct hvmemul_cache *cache);
>  
>  /*
>   * Get a reference on the page under an HVM physical or linear address.  If
> @@ -110,7 +110,7 @@ enum hvm_translation_result hvm_copy_fro
>  enum hvm_translation_result hvm_translate_get_page(
>      struct vcpu *v, unsigned long addr, bool linear, uint32_t pfec,
>      pagefault_info_t *pfinfo, struct page_info **page_p,
> -    gfn_t *gfn_p, p2m_type_t *p2mt_p);
> +    gfn_t *gfn_p, p2m_type_t *p2mt_p, struct hvmemul_cache *cache);
>  
>  #define HVM_HCALL_completed  0 /* hypercall completed - no further action 
> */
>  #define HVM_HCALL_preempted  1 /* hypercall preempted - re-execute VMCALL 
> */
> --- a/xen/include/asm-x86/hvm/vcpu.h
> +++ b/xen/include/asm-x86/hvm/vcpu.h
> @@ -53,8 +53,6 @@ struct hvm_mmio_cache {
>      uint8_t buffer[32];
>  };
>  
> -struct hvmemul_cache;
> -
>  struct hvm_vcpu_io {
>      /* I/O request in flight to device model. */
>      enum hvm_io_completion io_completion;
> @@ -200,6 +198,7 @@ struct hvm_vcpu {
>      u8                  cache_mode;
>  
>      struct hvm_vcpu_io  hvm_io;
> +    struct hvmemul_cache *data_cache;
>  
>      /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
>      struct x86_event     inject_event;
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org 
> https://lists.xenproject.org/mailman/listinfo/xen-devel 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-02 10:36   ` Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-10-02 10:51     ` Andrew Cooper
  2018-10-02 12:47       ` Jan Beulich
                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Andrew Cooper @ 2018-10-02 10:51 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap; +Cc: xen-devel, Paul Durrant

On 02/10/18 11:36, Jan Beulich wrote:
>>>> On 25.09.18 at 16:14, <JBeulich@suse.com> wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far as use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses. In particular
>> it has been observed that Windows might page out buffers underneath
>> an instruction currently under emulation (hitting between two passes).
>> If the first pass translated a linear address successfully, any subsequent
>> pass needs to do so too, yielding the exact same translation.
>>
>> Introduce a cache (used just by guest page table accesses for now, i.e.
>> a form of "paging structure cache") to make sure above described
>> assumption holds. This is a very simplistic implementation for now: Only
>> exact matches are satisfied (no overlaps or partial reads or anything).
>>
>> There's also some seemingly unrelated cleanup here which was found
>> desirable on the way.
>>
>> 1: x86/mm: add optional cache to GLA->GFN translation
>> 2: x86/mm: use optional cache in guest_walk_tables()
>> 3: x86/HVM: implement memory read caching
>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>
>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>> can't currently find enough time to carry out the requested further
>> rework.
> Andrew, George?

You've not fixed anything from my concerns with v1.

This doesn't behave like real hardware, and definitely doesn't behave as
named - "struct hvmemul_cache" is simply false.  If it were named
hvmemul_psc (or some other variation on Paging Structure Cache) then it
wouldn't be so bad, as the individual levels do make more sense in that
context (not that it would make the behaviour any closer to how hardware
actually works).

I'm also not overly happy with the conditional nature of the caching, or
that it isn't a transparent read-through cache.  This leads to a large
amount of boilerplate code for every user.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-02 10:51     ` Andrew Cooper
@ 2018-10-02 12:47       ` Jan Beulich
  2018-10-11 15:54         ` George Dunlap
  2018-10-11  6:51       ` Jan Beulich
  2018-10-11 17:36       ` George Dunlap
  2 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-10-02 12:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant

>>> On 02.10.18 at 12:51, <andrew.cooper3@citrix.com> wrote:
> On 02/10/18 11:36, Jan Beulich wrote:
>>>>> On 25.09.18 at 16:14, <JBeulich@suse.com> wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>>
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>>
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>
>>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>>> can't currently find enough time to carry out the requested further
>>> rework.
>> Andrew, George?
> 
> You've not fixed anything from my concerns with v1.

I've responded to your concerns verbally, and you went silent, as is
(I regret having to say so) the case quite often. This simply blocks
any progress. Hence, after enough time went by, I simply took the
liberty to interpret the silence as "the verbal response took care of
my concerns".

> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context

As previously pointed out (without any suggestion coming back from
you), I chose the name "cache" for the lack of a better term. However,
I certainly disagree with naming it PSC or some such, as its level zero
is intentionally there to be eventually used for non-paging-structure
data.

> (not that it would make the behaviour any closer to how hardware
> actually works).

I can certainly appreciate this concern of yours, but the whole issue
the series is aiming to address is something that we can't make
behave like hardware does: Hardware doesn't have the concept of
a device model that it needs to wait for responses from, while trying
to make use of the wait time (i.e. scheduling in another CPU).

Once again (I've said so more than once before) - the use of what
I call cache here is a correctness thing, not a performance
improvement (this, if it so happens, is just a nice side effect).
Nothing like that exists on hardware; I'm merely trying to come
close to paging structure caches. As to them - is it anywhere
spelled out that their data must not come from memory, when
doing a cache fill? We simply have no general purpose cache
that the data could come from.

> I'm also not overly happy with the conditional nature of the caching, or
> that it isn't a transparent read-through cache.  This leads to a large
> amount of boilerplate code for every user.

That's all fine, and I can understand it. Yet I hope you don't mean
to ask that for this correctness fix to become acceptable, you
demand that I implement a general purpose cache sitting
transparently underneath all read/write operations? The more that
it wouldn't even fulfill the purpose, as what the series introduces
specifically also needs to be used for page tables living in MMIO
(i.e. regardless of cachability of the memory accesses involved; I
do realize that this doesn't currently function for other reasons,
but it really should).

As to the amount of boilerplate code: Besides expressing your
dislike, do you have a concrete suggestion as to how you would
envision this to be avoided in a way covering _all_ cases, i.e. in
particular _all_ callers of guest_walk_tables() et al (i.e. all the
functions the first two patches fiddle with)? If I had seen an
obvious and natural way to achieve this, you may rest assured
that I would have tried to avoid introducing the new function
parameters, for which arguments need to be pushed through
the various layers now.

Furthermore I couldn't convince myself that doing this in a
parameter-less way would be a good idea (and in the end
provably correct): Of course we could make caches hang off
of struct vcpu, but then we would need to find a model how
to invalidate them often enough, without invalidating (parts
of) them in ways breaking the correctness that I'm trying to
achieve here.

Bottom line - I think there are three options:
1) You accept this model, despite it not being perfect; of
   course I'm then all ears as to bugs you see in the current
   version.
2) You supply a series addressing the correctness issue in a
   way to your liking, within a reasonable time frame (which to
   me would mean in time for 4.12, seeing that this series was
   put together during the 4.11 freeze and posted very soon
   after the tree was re-opened).
3) You provide feedback which is at least constructive enough
   that I can derive from it something that is (a) manageable in
   scope and (b) laying out a way towards addressing the issue
   at hand.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 3/4] x86/HVM: implement memory read caching
  2018-10-02 10:39     ` Ping: " Jan Beulich
@ 2018-10-02 13:53       ` Boris Ostrovsky
  2018-10-09  5:19       ` Tian, Kevin
  1 sibling, 0 replies; 48+ messages in thread
From: Boris Ostrovsky @ 2018-10-02 13:53 UTC (permalink / raw)
  To: Jan Beulich, Brian Woods, Suravee Suthikulpanit, Jun Nakajima,
	Kevin Tian
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Tim Deegan, Paul Durrant,
	xen-devel

On 10/2/18 6:39 AM, Jan Beulich wrote:
>>>> On 25.09.18 at 16:25, <JBeulich@suse.com> wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses. In particular
>> it has been observed that Windows might page out buffers underneath an
>> instruction currently under emulation (hitting between two passes). If
>> the first pass translated a linear address successfully, any subsequent
>> pass needs to do so too, yielding the exact same translation.
>>
>> Introduce a cache (used by just guest page table accesses for now) to
>> make sure above described assumption holds. This is a very simplistic
>> implementation for now: Only exact matches are satisfied (no overlaps or
>> partial reads or anything).
>>
>> As to the actual data page in this scenario, there are a couple of
>> aspects to take into consideration:
>> - We must be talking about an insn accessing two locations (two memory
>>   ones, one of which is MMIO, or a memory and an I/O one).
>> - If the non I/O / MMIO side is being read, the re-read (if it occurs at
>>   all) is having its result discarded, by taking the shortcut through
>>   the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note
>>   how, among all the re-issue sanity checks there, we avoid comparing
>>   the actual data.
>> - If the non I/O / MMIO side is being written, it is the OSes
>>   responsibility to avoid actually moving page contents to disk while
>>   there might still be a write access in flight - this is no different
>>   in behavior from bare hardware.
>> - Read-modify-write accesses are, as always, complicated, and while we
>>   deal with them better nowadays than we did in the past, we're still
>>   not quite there to guarantee hardware like behavior in all cases
>>   anyway. Nothing is getting worse by the changes made here, afaict.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Acked-by: Tim Deegan <tim@xen.org>
>> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
> SVM and VMX maintainers?


Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 3/4] x86/HVM: implement memory read caching
  2018-10-02 10:39     ` Ping: " Jan Beulich
  2018-10-02 13:53       ` Boris Ostrovsky
@ 2018-10-09  5:19       ` Tian, Kevin
  1 sibling, 0 replies; 48+ messages in thread
From: Tian, Kevin @ 2018-10-09  5:19 UTC (permalink / raw)
  To: Jan Beulich, Brian Woods, Suravee Suthikulpanit, Nakajima, Jun,
	Boris Ostrovsky
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Tim Deegan, Paul Durrant,
	xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, October 2, 2018 6:39 PM
> 
> >>> On 25.09.18 at 16:25, <JBeulich@suse.com> wrote:
> > Emulation requiring device model assistance uses a form of instruction
> > re-execution, assuming that the second (and any further) pass takes
> > exactly the same path. This is a valid assumption as far use of CPU
> > registers goes (as those can't change without any other instruction
> > executing in between), but is wrong for memory accesses. In particular
> > it has been observed that Windows might page out buffers underneath
> an
> > instruction currently under emulation (hitting between two passes). If
> > the first pass translated a linear address successfully, any subsequent
> > pass needs to do so too, yielding the exact same translation.
> >
> > Introduce a cache (used by just guest page table accesses for now) to
> > make sure above described assumption holds. This is a very simplistic
> > implementation for now: Only exact matches are satisfied (no overlaps or
> > partial reads or anything).
> >
> > As to the actual data page in this scenario, there are a couple of
> > aspects to take into consideration:
> > - We must be talking about an insn accessing two locations (two memory
> >   ones, one of which is MMIO, or a memory and an I/O one).
> > - If the non I/O / MMIO side is being read, the re-read (if it occurs at
> >   all) is having its result discarded, by taking the shortcut through
> >   the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note
> >   how, among all the re-issue sanity checks there, we avoid comparing
> >   the actual data.
> > - If the non I/O / MMIO side is being written, it is the OSes
> >   responsibility to avoid actually moving page contents to disk while
> >   there might still be a write access in flight - this is no different
> >   in behavior from bare hardware.
> > - Read-modify-write accesses are, as always, complicated, and while we
> >   deal with them better nowadays than we did in the past, we're still
> >   not quite there to guarantee hardware like behavior in all cases
> >   anyway. Nothing is getting worse by the changes made here, afaict.
> >
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > Acked-by: Tim Deegan <tim@xen.org>
> > Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
> 
> SVM and VMX maintainers?
> 

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-02 10:51     ` Andrew Cooper
  2018-10-02 12:47       ` Jan Beulich
@ 2018-10-11  6:51       ` Jan Beulich
  2018-10-11 17:36       ` George Dunlap
  2 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2018-10-11  6:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant

>>> On 02.10.18 at 12:51, <andrew.cooper3@citrix.com> wrote:
> On 02/10/18 11:36, Jan Beulich wrote:
>>>>> On 25.09.18 at 16:14, <JBeulich@suse.com> wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>>
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>>
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>
>>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>>> can't currently find enough time to carry out the requested further
>>> rework.
>> Andrew, George?
> 
> You've not fixed anything from my concerns with v1.
> 
> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context (not that it would make the behaviour any closer to how hardware
> actually works).
> 
> I'm also not overly happy with the conditional nature of the caching, or
> that it isn't a transparent read-through cache.  This leads to a large
> amount of boilerplate code for every user.

So after the call yesterday I've been thinking about this some more,
coming to the conclusion that it is actually what you ask for that
would end up being architecturally incorrect.

First of all, there's absolutely no way to implement a general purpose
cache that's mimic-ing hardware behavior. This is simply because we
cannot observe remove (v)CPU-s' writes to the cached areas.

What we instead need is a store where we can retain the result of
every independent memory read. Let me illustrate this at the
example of the gather family of instructions: Let's assume such an
instruction has its [xyz]mm index register all zero. This will produce
up to 16 reads from exactly the same memory location. Each of
these reads is an independent one, and hence each of them is
liable to observe different values (due to the coherent nature of
the processor caches and their protocol), if another CPU updates
the location frequently enough. As a result, to correctly replay
(parts of) such an instruction, we'd have to store up to 16
different values all for the same physical memory slot. Obviously
to distinguish them we'd need to somehow tag them.

This is why the current series does not try to use the "cache" for
other than page table elements (but leaves it possible for this
to be added later, without renaming anything). We'd have to
have the insn emulator layer (or the latest the top level
hvmemul_*() hooks) produce such tags, requiring an extension
to the emulator memory access hooks.

Now the same consideration applies to page table reads: Each
level's read is an independent one, and therefore may observe
a value different from a higher level's read even if the same
physical slot is referenced (via recursive page tables). Here,
however, we're in the "luxury" position that we have both
"natural" tagging (the page table level), and we don't need to
care about subsequent writes (other than the general purpose
cache, the paging structure caches don't "snoop" later writes,
but require active invalidation). It is my understanding that
the different page table levels have, in hardware, entirely
independent paging structure caches, i.e. a level 3 read would
not be satisfied by a level 4 PSC entry; in fact I think the spec
does not exclude either option, but suggests such an
independent behavior as the current choice of implementation.

Similarly there's the write behavior: We specifically would
require insn operand writes to _not_ update the "cache", as
during replay we still need to see the original value; the
aforementioned tagging would prevent this. The exception
here is the setting of A/D bits during page table walks: On
replay we _do_ want to observe them set if the first run
through had to set them. (The same would imo apply to
setting descriptor table accessed bits, btw.)

As to the term "cache" - would "latch" perhaps be a better name
to reflect the purpose?

Finally a word on your desire of making this a transparent thing
rather than something handed through as function arguments:
Almost everything ends up in __hvm_copy(). Therefore,
without a respective function argument, we'd need to globally
record state on the vCPU as to whether a particular memory
access ought to consult / update the "cache". Memory accesses
done in the context of hypercalls, for example, are specifically
not supposed to touch that "cache" - not only because that's
not how hypercalls are supposed to behave (they are required
to read guest memory exactly once only anyway), but also
because whatever size we'd make that "cache", a large
enough batched hypercall could exceed that size.

I'm not meaning to say this is impossible to implement correctly,
but I think going the function argument route is far easier to
prove correct, since at the relevant layer(s) of a call tree you
can see whether "caching" is intended to be in effect or not.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-02 12:47       ` Jan Beulich
@ 2018-10-11 15:54         ` George Dunlap
  2018-10-11 16:15           ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: George Dunlap @ 2018-10-11 15:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Paul Durrant, George Dunlap, xen-devel


> On Oct 2, 2018, at 1:47 PM, Jan Beulich <JBeulich@suse.com> wrote:
> 
>>>> On 02.10.18 at 12:51, <andrew.cooper3@citrix.com> wrote:
> 
>> This doesn't behave like real hardware, and definitely doesn't behave as
>> named - "struct hvmemul_cache" is simply false.  If it were named
>> hvmemul_psc (or some other variation on Paging Structure Cache) then it
>> wouldn't be so bad, as the individual levels do make more sense in that
>> context
> 
> As previously pointed out (without any suggestion coming back from
> you), I chose the name "cache" for the lack of a better term. However,
> I certainly disagree with naming it PSC or some such, as its level zero
> is intentionally there to be eventually used for non-paging-structure
> data.

I can think of lots of descriptive names which could yield unique three-letter acronyms:

Logical Read Sequence
Logical Read Series
Logical Read Record
Read Consistency Structure
Consistent Read Structure
Consistent Read Record
Emulation Read Record
[…]

 -George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-11 15:54         ` George Dunlap
@ 2018-10-11 16:15           ` Jan Beulich
  2018-10-11 16:33             ` George Dunlap
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-10-11 16:15 UTC (permalink / raw)
  To: george.dunlap; +Cc: Andrew Cooper, Paul Durrant, xen-devel

>>> On 11.10.18 at 17:54, <George.Dunlap@citrix.com> wrote:

>> On Oct 2, 2018, at 1:47 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> 
>>>>> On 02.10.18 at 12:51, <andrew.cooper3@citrix.com> wrote:
>> 
>>> This doesn't behave like real hardware, and definitely doesn't behave as
>>> named - "struct hvmemul_cache" is simply false.  If it were named
>>> hvmemul_psc (or some other variation on Paging Structure Cache) then it
>>> wouldn't be so bad, as the individual levels do make more sense in that
>>> context
>> 
>> As previously pointed out (without any suggestion coming back from
>> you), I chose the name "cache" for the lack of a better term. However,
>> I certainly disagree with naming it PSC or some such, as its level zero
>> is intentionally there to be eventually used for non-paging-structure
>> data.
> 
> I can think of lots of descriptive names which could yield unique 
> three-letter acronyms:
> 
> Logical Read Sequence
> Logical Read Series
> Logical Read Record
> Read Consistency Structure
> Consistent Read Structure
> Consistent Read Record
> Emulation Read Record
> […]

Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be
easily recognizable as what they stand for. To be honest I'd
prefer a non-acronym. Did you see my consideration towards
"latch"?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-11 16:15           ` Jan Beulich
@ 2018-10-11 16:33             ` George Dunlap
  2018-10-12  6:32               ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: George Dunlap @ 2018-10-11 16:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Paul Durrant, xen-devel



> On Oct 11, 2018, at 5:15 PM, Jan Beulich <JBeulich@suse.com> wrote:
> 
>>>> On 11.10.18 at 17:54, <George.Dunlap@citrix.com> wrote:
> 
>>> On Oct 2, 2018, at 1:47 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>> 
>>>>>> On 02.10.18 at 12:51, <andrew.cooper3@citrix.com> wrote:
>>> 
>>>> This doesn't behave like real hardware, and definitely doesn't behave as
>>>> named - "struct hvmemul_cache" is simply false.  If it were named
>>>> hvmemul_psc (or some other variation on Paging Structure Cache) then it
>>>> wouldn't be so bad, as the individual levels do make more sense in that
>>>> context
>>> 
>>> As previously pointed out (without any suggestion coming back from
>>> you), I chose the name "cache" for the lack of a better term. However,
>>> I certainly disagree with naming it PSC or some such, as its level zero
>>> is intentionally there to be eventually used for non-paging-structure
>>> data.
>> 
>> I can think of lots of descriptive names which could yield unique 
>> three-letter acronyms:
>> 
>> Logical Read Sequence
>> Logical Read Series
>> Logical Read Record
>> Read Consistency Structure
>> Consistent Read Structure
>> Consistent Read Record
>> Emulation Read Record
>> […]
> 
> Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be
> easily recognizable as what they stand for. To be honest I'd
> prefer a non-acronym. Did you see my consideration towards
> "latch”?

Of course not; that’s why you put the long form name in a comment near the declaration. :-)

I don’t think I’ve personally used “latch” with that meaning very frequently (at least not in the last 10 years), so to me it sounds a bit obscure.  I would probably go with something else myself but I don’t object to it.

 -George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-02 10:51     ` Andrew Cooper
  2018-10-02 12:47       ` Jan Beulich
  2018-10-11  6:51       ` Jan Beulich
@ 2018-10-11 17:36       ` George Dunlap
  2 siblings, 0 replies; 48+ messages in thread
From: George Dunlap @ 2018-10-11 17:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Paul Durrant, George Dunlap, Jan Beulich



> On Oct 2, 2018, at 11:51 AM, Andrew Cooper <Andrew.Cooper3@citrix.com> wrote:
> 
> On 02/10/18 11:36, Jan Beulich wrote:
>>>>> On 25.09.18 at 16:14, <JBeulich@suse.com> wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>> 
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>> 
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>> 
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>> 
>>> As for v2, I'm omitting "VMX: correct PDPTE load checks" from v3, as I
>>> can't currently find enough time to carry out the requested further
>>> rework.
>> Andrew, George?
> 
> You've not fixed anything from my concerns with v1.
> 
> This doesn't behave like real hardware, and definitely doesn't behave as
> named - "struct hvmemul_cache" is simply false.  If it were named
> hvmemul_psc (or some other variation on Paging Structure Cache) then it
> wouldn't be so bad, as the individual levels do make more sense in that
> context (not that it would make the behaviour any closer to how hardware
> actually works).
> 
> I'm also not overly happy with the conditional nature of the caching, or
> that it isn't a transparent read-through cache.  This leads to a large
> amount of boilerplate code for every user.

What I’m hearing from you are three basic objections:

1. Although it’s closer to real hardware in some ways, it’s still pretty far away

2. The name “cache” is confusing

3. Having it non-transparent adds a lot of boilerplate code to the places that do need it.

#2 is easily dealt with.  The other two are reasons to look for better options, but not reasons to reject Jan’s series if other improvements are a lot of extra work (or it’s not clear they’re better).

Since this is a bug fix, unless you can show that Jan’s series introduces worse bugs, I think Jan’s request that you either 1) fix it yourself by 4.12, or 2) acquiesce to this series (or something close to it) being accepted is reasonable.

If you want to say, “I won’t Ack it but I won’t object if someone else does”, then I’ll get to it when I get a chance (hopefully before November).

 -George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching
  2018-10-11 16:33             ` George Dunlap
@ 2018-10-12  6:32               ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2018-10-12  6:32 UTC (permalink / raw)
  To: george.dunlap; +Cc: Andrew Cooper, Paul Durrant, xen-devel

>>> On 11.10.18 at 18:33, <George.Dunlap@citrix.com> wrote:

> 
>> On Oct 11, 2018, at 5:15 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> 
>>>>> On 11.10.18 at 17:54, <George.Dunlap@citrix.com> wrote:
>> 
>>>> On Oct 2, 2018, at 1:47 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> 
>>>>>>> On 02.10.18 at 12:51, <andrew.cooper3@citrix.com> wrote:
>>>> 
>>>>> This doesn't behave like real hardware, and definitely doesn't behave as
>>>>> named - "struct hvmemul_cache" is simply false.  If it were named
>>>>> hvmemul_psc (or some other variation on Paging Structure Cache) then it
>>>>> wouldn't be so bad, as the individual levels do make more sense in that
>>>>> context
>>>> 
>>>> As previously pointed out (without any suggestion coming back from
>>>> you), I chose the name "cache" for the lack of a better term. However,
>>>> I certainly disagree with naming it PSC or some such, as its level zero
>>>> is intentionally there to be eventually used for non-paging-structure
>>>> data.
>>> 
>>> I can think of lots of descriptive names which could yield unique 
>>> three-letter acronyms:
>>> 
>>> Logical Read Sequence
>>> Logical Read Series
>>> Logical Read Record
>>> Read Consistency Structure
>>> Consistent Read Structure
>>> Consistent Read Record
>>> Emulation Read Record
>>> […]
>> 
>> Well, I'm not sure LRS, LRR, RCS, CRS, CRR, or ERR would be
>> easily recognizable as what they stand for. To be honest I'd
>> prefer a non-acronym. Did you see my consideration towards
>> "latch”?
> 
> Of course not; that’s why you put the long form name in a comment near the 
> declaration. :-)

Of course I would, but I don't think this would help. You don't
want to always go back to the declaration (in a header) when
you look at a function (in a .c file) using the type. Such names
should be at least half way self-explanatory.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] x86/HVM: implement memory read caching
  2018-09-11 13:10 [PATCH v2 0/4] x86/HVM: implement memory read caching Jan Beulich
                   ` (4 preceding siblings ...)
  2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
@ 2018-10-12 13:55 ` Andrew Cooper
  2018-10-12 14:19   ` Jan Beulich
                     ` (2 more replies)
  5 siblings, 3 replies; 48+ messages in thread
From: Andrew Cooper @ 2018-10-12 13:55 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Paul Durrant

On 11/09/18 14:10, Jan Beulich wrote:
> Emulation requiring device model assistance uses a form of instruction
> re-execution, assuming that the second (and any further) pass takes
> exactly the same path. This is a valid assumption as far as use of CPU
> registers goes (as those can't change without any other instruction
> executing in between), but is wrong for memory accesses. In particular
> it has been observed that Windows might page out buffers underneath
> an instruction currently under emulation (hitting between two passes).
> If the first pass translated a linear address successfully, any subsequent
> pass needs to do so too, yielding the exact same translation.
>
> Introduce a cache (used just by guest page table accesses for now, i.e.
> a form of "paging structure cache") to make sure above described
> assumption holds. This is a very simplistic implementation for now: Only
> exact matches are satisfied (no overlaps or partial reads or anything).
>
> There's also some seemingly unrelated cleanup here which was found
> desirable on the way.
>
> 1: x86/mm: add optional cache to GLA->GFN translation
> 2: x86/mm: use optional cache in guest_walk_tables()
> 3: x86/HVM: implement memory read caching
> 4: x86/HVM: prefill cache with PDPTEs when possible
>
> "VMX: correct PDPTE load checks" is omitted from v2, as I can't
> currently find enough time to carry out the requested further
> rework.

Following the x86 call, I've had some thoughts and suggestions about how
to make this work in a reasonable way, without resorting to the full
caching approach.

First and foremost, I'd like recommend against trying to combine the fix
for repeated PDPTR reading, and repeated PTE reading.  While they are
both repeated reading problems, one really is a knoblly corner case of
32bit PAE paging, and one is a general emulation problem.  Fixing these
problems independently makes the result rather more simple, and far
closer to how real CPUs work.

For naming, how about "access once" in place of cache?  This is the best
description of the purpose I can come up with.

Next, there should be a single hvmemul_read_once(gaddr, bytes, ...)
(name subject to improvement), which does a transparent read-through of
the "access once cache" in terms of a single flag guest physical address
space.  This allows individual callers to opt into using the access-once
semantics, and doesn't hoist them with the substantial boilerplate of
the sole correct way to use this interface.  Furthermore, this behaviour
has the same semantics as the correct longer term fix.

That alone should fix the windows issue, because there is no chance that
windows will ever page out the PDPTRs.

For the PDPTRs, this corner case is special, and should be handled by
the pagewalk code.  I'm still going to go with my previous suggestion of
having top_map point onto the caller stack.  For the VT-x case, the
values can be pulled straight out of the VMCS, while for AMD, the values
can be read through the "access once cache", which matches the behaviour
of being read from memory, but ensures they won't be repeatedly read.

Overall, I think this should be fairly architecturally clean, solve the
underlying bug, and moves things in the general direction of the longer
term goal, even if it doesn't get all the way there in one step.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] x86/HVM: implement memory read caching
  2018-10-12 13:55 ` [PATCH v2 " Andrew Cooper
@ 2018-10-12 14:19   ` Jan Beulich
  2018-10-18 15:20     ` George Dunlap
  2018-11-09 10:17   ` Jan Beulich
  2019-02-14 15:14   ` Jan Beulich
  2 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2018-10-12 14:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant

>>> On 12.10.18 at 15:55, <andrew.cooper3@citrix.com> wrote:
> On 11/09/18 14:10, Jan Beulich wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far as use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses. In particular
>> it has been observed that Windows might page out buffers underneath
>> an instruction currently under emulation (hitting between two passes).
>> If the first pass translated a linear address successfully, any subsequent
>> pass needs to do so too, yielding the exact same translation.
>>
>> Introduce a cache (used just by guest page table accesses for now, i.e.
>> a form of "paging structure cache") to make sure above described
>> assumption holds. This is a very simplistic implementation for now: Only
>> exact matches are satisfied (no overlaps or partial reads or anything).
>>
>> There's also some seemingly unrelated cleanup here which was found
>> desirable on the way.
>>
>> 1: x86/mm: add optional cache to GLA->GFN translation
>> 2: x86/mm: use optional cache in guest_walk_tables()
>> 3: x86/HVM: implement memory read caching
>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>
>> "VMX: correct PDPTE load checks" is omitted from v2, as I can't
>> currently find enough time to carry out the requested further
>> rework.
> 
> Following the x86 call, I've had some thoughts and suggestions about how
> to make this work in a reasonable way, without resorting to the full
> caching approach.

Thanks, but one question before I start thinking about this in
more detail: Before writing this, did you read my mail from the
11th? I ask because what you suggest does not look to match
the behavior I've described there as what I think it ought to be.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] x86/HVM: implement memory read caching
  2018-10-12 14:19   ` Jan Beulich
@ 2018-10-18 15:20     ` George Dunlap
  2019-05-07 16:22         ` [Xen-devel] " George Dunlap
  0 siblings, 1 reply; 48+ messages in thread
From: George Dunlap @ 2018-10-18 15:20 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant

On 10/12/2018 03:19 PM, Jan Beulich wrote:
>>>> On 12.10.18 at 15:55, <andrew.cooper3@citrix.com> wrote:
>> On 11/09/18 14:10, Jan Beulich wrote:
>>> Emulation requiring device model assistance uses a form of instruction
>>> re-execution, assuming that the second (and any further) pass takes
>>> exactly the same path. This is a valid assumption as far as use of CPU
>>> registers goes (as those can't change without any other instruction
>>> executing in between), but is wrong for memory accesses. In particular
>>> it has been observed that Windows might page out buffers underneath
>>> an instruction currently under emulation (hitting between two passes).
>>> If the first pass translated a linear address successfully, any subsequent
>>> pass needs to do so too, yielding the exact same translation.
>>>
>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>> a form of "paging structure cache") to make sure above described
>>> assumption holds. This is a very simplistic implementation for now: Only
>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>
>>> There's also some seemingly unrelated cleanup here which was found
>>> desirable on the way.
>>>
>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>> 3: x86/HVM: implement memory read caching
>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>
>>> "VMX: correct PDPTE load checks" is omitted from v2, as I can't
>>> currently find enough time to carry out the requested further
>>> rework.
>>
>> Following the x86 call, I've had some thoughts and suggestions about how
>> to make this work in a reasonable way, without resorting to the full
>> caching approach.
> 
> Thanks, but one question before I start thinking about this in
> more detail: Before writing this, did you read my mail from the
> 11th? I ask because what you suggest does not look to match
> the behavior I've described there as what I think it ought to be.

I'm taking this off my to-review queue for now then -- let me know if
you need me to review it anyway.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] x86/HVM: implement memory read caching
  2018-10-12 13:55 ` [PATCH v2 " Andrew Cooper
  2018-10-12 14:19   ` Jan Beulich
@ 2018-11-09 10:17   ` Jan Beulich
  2019-02-14 15:14   ` Jan Beulich
  2 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2018-11-09 10:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant

>>> On 12.10.18 at 15:55, <andrew.cooper3@citrix.com> wrote:

While I haven't heard back on my earlier reply, nevertheless a few
more thoughts.

> First and foremost, I'd like recommend against trying to combine the fix
> for repeated PDPTR reading, and repeated PTE reading.  While they are
> both repeated reading problems, one really is a knoblly corner case of
> 32bit PAE paging, and one is a general emulation problem.  Fixing these
> problems independently makes the result rather more simple, and far
> closer to how real CPUs work.

That's an option, but the approach currently chosen seems to fit
better with how guest_walk_tables() works.

> Next, there should be a single hvmemul_read_once(gaddr, bytes, ...)
> (name subject to improvement), which does a transparent read-through of
> the "access once cache" in terms of a single flag guest physical address
> space.  This allows individual callers to opt into using the access-once
> semantics, and doesn't hoist them with the substantial boilerplate of
> the sole correct way to use this interface.  Furthermore, this behaviour
> has the same semantics as the correct longer term fix.

Except that guest_walk_tables() doesn't invoke any hvmemul_*()
routines, nor does it get passed a struct x86_emulate_ops. And
it shouldn't, or else it couldn't be used for shadowed PV guests
anymore. If anything we'd have to replace all guest memory
reads in guest_walk_tables() by using a caller provided function
(and writes similarly of course).

A further problem with the suggested approach are the A/D bit
updates: A generic read-once model would, as explained before,
require each logically separate read to have its own entry. Since
replay, when all memory accesses produce identical results to the
initial run, will produce the exact same access pattern, simply
maintaining a counter to index into the array (reset every time a
replay round starts) would do. The A/D bit updates, however,
need to update their respective slots, and hence need to have
a way to identify which slot it is. That's not going to be
transparent, no matter what you do, as the read function
would need to return a token to be passed to the respective
write one.

The positive side effect of going this route would be that it would
get us at least closer towards allowing guest page tables to live
in MMIO space (because an explicit dependency on being able to
map page tables would go away).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] x86/HVM: implement memory read caching
  2018-10-12 13:55 ` [PATCH v2 " Andrew Cooper
  2018-10-12 14:19   ` Jan Beulich
  2018-11-09 10:17   ` Jan Beulich
@ 2019-02-14 15:14   ` Jan Beulich
  2 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2019-02-14 15:14 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap, Paul Durrant

>>> On 12.10.18 at 15:55, <andrew.cooper3@citrix.com> wrote:
> On 11/09/18 14:10, Jan Beulich wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far as use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses. In particular
>> it has been observed that Windows might page out buffers underneath
>> an instruction currently under emulation (hitting between two passes).
>> If the first pass translated a linear address successfully, any subsequent
>> pass needs to do so too, yielding the exact same translation.
>>
>> Introduce a cache (used just by guest page table accesses for now, i.e.
>> a form of "paging structure cache") to make sure above described
>> assumption holds. This is a very simplistic implementation for now: Only
>> exact matches are satisfied (no overlaps or partial reads or anything).
>>
>> There's also some seemingly unrelated cleanup here which was found
>> desirable on the way.
>>
>> 1: x86/mm: add optional cache to GLA->GFN translation
>> 2: x86/mm: use optional cache in guest_walk_tables()
>> 3: x86/HVM: implement memory read caching
>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>
>> "VMX: correct PDPTE load checks" is omitted from v2, as I can't
>> currently find enough time to carry out the requested further
>> rework.
> 
> Following the x86 call, I've had some thoughts and suggestions about how
> to make this work in a reasonable way, without resorting to the full
> caching approach.

Actually I now think I'll go that full caching (in the sense meant here)
route, see below, but of course things could easily be limited to just
the page table accesses in a first step.

> First and foremost, I'd like recommend against trying to combine the fix
> for repeated PDPTR reading, and repeated PTE reading.  While they are
> both repeated reading problems, one really is a knoblly corner case of
> 32bit PAE paging, and one is a general emulation problem.  Fixing these
> problems independently makes the result rather more simple, and far
> closer to how real CPUs work.

Well, that's a separate patch anyway. If you disagree with just
that part, then it can be easily left out (or be further refined),
the more that it's (intentionally) last in the series.

> For naming, how about "access once" in place of cache?  This is the best
> description of the purpose I can come up with.

For descriptive purposes that's fine, but I wouldn't want to introduce
a hvmemul_access_once structure type. I also didn't particularly like
the suggestions George did make. If "cached" wasn't pre-occupied to
mean something else in computers, I still think this would be the right
term to use. Right now I think I'll call the thing hvmemul_data_buf or
some such, along the lines of insn_buf[] used to hold the fetched
instruction bytes.

> Next, there should be a single hvmemul_read_once(gaddr, bytes, ...)
> (name subject to improvement), which does a transparent read-through of
> the "access once cache" in terms of a single flag guest physical address
> space.  This allows individual callers to opt into using the access-once
> semantics, and doesn't hoist them with the substantial boilerplate of
> the sole correct way to use this interface.  Furthermore, this behaviour
> has the same semantics as the correct longer term fix.

That's an option perhaps, but has the downside of needing to split
apart the combined linear->phys translation and memory access
done by linear_{read,write}(). Plus (again see below)
hvmemul_read_once() doesn't fit the page walking code very well,
due to the open coded reads there, which in turn are helpful for the
logic setting the A/D bits. I'd like to do this slightly differently (in
the page walking code in particular closer to what the implementation
here was, i.e. with buffered data maintenance separated from the
actual memory accesses). But before I go and try to re-implement
this (in as transparent a way as possible while at the same time
covering all memory accesses, not just page table reads) I'd like to
settle on a few principles, after having thought more about our
options, and after having done a few experiments.

The goal of this, as before, is not a performance improvement, but a
correctness guarantee: Upon re-execution of a previously partly
emulated insn (requiring e.g. device model assistance), all memory
accesses done so far have to return (reads) or take (writes) the
exact same data. This means that the original guest memory accesses
have to record their addresses and data. A fundamental assumption
here is that no instruction may update register state if subsequently
it may requiring re-execution (see below for where this is violated).

The transparency goal suggests that maintenance of this data should be
done at as low a layer as possible. However, not all guest memory
accesses result from instruction emulation, and any that don't may not
consume (and would better also not produce) buffered contents. That is,
maintenance has to either live above the
hvm_copy_{to,from}_guest_linear() layer (as that's what also backs
copy_{from,to}_user_hvm()), or there needs to be an indicator (function
parameter or struct vcpu flag) whether to access the buffered data.

As you didn't like the function parameter approach, I assume the
approach to take is the struct vcpu flag one, which could be set and
cleared in _hvm_emulate_one() along the lines of what the current
version of the series does for the num_ents field.

Since guest page walks also need to be taken care of (in fact these are
the primary goal, as that's where the issue to fix was noticed), and
since guest_walk_tables() doesn't use lower level routines to access
guest memory, the function either needs to be converted to use them, or
the buffer accesses need to be open coded there, as was done in this
series up to now. Using the lower level routines would in particular
complicate set_ad_bits(), so I'm not currently intending to go that
route.

Now on to the intended behavior of the buffer: If we want to mimic
actual hardware behavior as closely as possible, we have to distinguish
the ordinary data cache from TLB and paging structure caches. While the
former is coherent, the latter aren't. This in particular means e.g.
that while two distinct L<n> page table reads from the same physical
address may return the same data (because there can't be any
invalidation between the start of a single insn's execution and its
completion), the same may not be appropriate for two independent
ordinary data reads. Specifically insns with multiple independent
memory operands (CMPS{B,D,Q,W}, V{,P}GATHER*) can observe different data
for the different accesses, even if all addresses are the same, as long
as another CPU manages to modify the memory location(s) quickly enough.

If we want to retain this behavior in emulation, we'll have to tag
memory accesses such that during re-execution correct association with
their earlier matching accesses is possible, and such that distinct
accesses would not consume data buffered by earlier ones. This tagging
can, I think, still be done transparently at the layer where the
buffered data gets maintained, except that memory accesses resulting
from page walks need to be recognizable, so they won't be treated the
same way. But as per above those accesses go through independent code
paths anyway, so this wouldn't be difficult to arrange for.

But there's one possible caveat here: The way gathers currently get
handled in the insn emulator, X86EMUL_RETRY may have two different
meanings: It may either identify that a read is pending dm assistance,
and hence re-execution will occur without exiting to guest context, or
it may identify what we'd call a continuation if this was a hypercall.
In either case the code updates certain register state (specifically
the register used as operation mask) to record successfully completed
parts. While this is fine when exiting back to guest context, it would
confuse the buffered data access logic, as the access pattern would no
longer match that seen during the first execution run.

As an aside I'd like to note that a possible mis-interpretation of mine
of the description on how gathers work may mean that the continuation-
like exit-to-guest behavior is wrong altogether. I've requested
clarification from Intel. Should this need to change, we'll run into
capacity problems with struct hvm_vcpu_io's mmio_cache[]. But in the end
I hope to also be able to do away with mmio_cache[].

> For the PDPTRs, this corner case is special, and should be handled by
> the pagewalk code.  I'm still going to go with my previous suggestion of
> having top_map point onto the caller stack.  For the VT-x case, the
> values can be pulled straight out of the VMCS, while for AMD, the values
> can be read through the "access once cache", which matches the behaviour
> of being read from memory, but ensures they won't be repeatedly read.

What is different here from how the last patch already implements it?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] x86/HVM: implement memory read caching
@ 2019-05-07 16:22         ` George Dunlap
  0 siblings, 0 replies; 48+ messages in thread
From: George Dunlap @ 2019-05-07 16:22 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant

On 10/18/18 4:20 PM, George Dunlap wrote:
> On 10/12/2018 03:19 PM, Jan Beulich wrote:
>>>>> On 12.10.18 at 15:55, <andrew.cooper3@citrix.com> wrote:
>>> On 11/09/18 14:10, Jan Beulich wrote:
>>>> Emulation requiring device model assistance uses a form of instruction
>>>> re-execution, assuming that the second (and any further) pass takes
>>>> exactly the same path. This is a valid assumption as far as use of CPU
>>>> registers goes (as those can't change without any other instruction
>>>> executing in between), but is wrong for memory accesses. In particular
>>>> it has been observed that Windows might page out buffers underneath
>>>> an instruction currently under emulation (hitting between two passes).
>>>> If the first pass translated a linear address successfully, any subsequent
>>>> pass needs to do so too, yielding the exact same translation.
>>>>
>>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>>> a form of "paging structure cache") to make sure above described
>>>> assumption holds. This is a very simplistic implementation for now: Only
>>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>>
>>>> There's also some seemingly unrelated cleanup here which was found
>>>> desirable on the way.
>>>>
>>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>>> 3: x86/HVM: implement memory read caching
>>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>>
>>>> "VMX: correct PDPTE load checks" is omitted from v2, as I can't
>>>> currently find enough time to carry out the requested further
>>>> rework.
>>>
>>> Following the x86 call, I've had some thoughts and suggestions about how
>>> to make this work in a reasonable way, without resorting to the full
>>> caching approach.
>>
>> Thanks, but one question before I start thinking about this in
>> more detail: Before writing this, did you read my mail from the
>> 11th? I ask because what you suggest does not look to match
>> the behavior I've described there as what I think it ought to be.
> 
> I'm taking this off my to-review queue for now then -- let me know if
> you need me to review it anyway.

BTW I'm now deleting this from my inbox to avoid clutter.  Jan, at such
time as you want me to review it, please ping or re-send.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/4] x86/HVM: implement memory read caching
@ 2019-05-07 16:22         ` George Dunlap
  0 siblings, 0 replies; 48+ messages in thread
From: George Dunlap @ 2019-05-07 16:22 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: George Dunlap, xen-devel, Paul Durrant

On 10/18/18 4:20 PM, George Dunlap wrote:
> On 10/12/2018 03:19 PM, Jan Beulich wrote:
>>>>> On 12.10.18 at 15:55, <andrew.cooper3@citrix.com> wrote:
>>> On 11/09/18 14:10, Jan Beulich wrote:
>>>> Emulation requiring device model assistance uses a form of instruction
>>>> re-execution, assuming that the second (and any further) pass takes
>>>> exactly the same path. This is a valid assumption as far as use of CPU
>>>> registers goes (as those can't change without any other instruction
>>>> executing in between), but is wrong for memory accesses. In particular
>>>> it has been observed that Windows might page out buffers underneath
>>>> an instruction currently under emulation (hitting between two passes).
>>>> If the first pass translated a linear address successfully, any subsequent
>>>> pass needs to do so too, yielding the exact same translation.
>>>>
>>>> Introduce a cache (used just by guest page table accesses for now, i.e.
>>>> a form of "paging structure cache") to make sure above described
>>>> assumption holds. This is a very simplistic implementation for now: Only
>>>> exact matches are satisfied (no overlaps or partial reads or anything).
>>>>
>>>> There's also some seemingly unrelated cleanup here which was found
>>>> desirable on the way.
>>>>
>>>> 1: x86/mm: add optional cache to GLA->GFN translation
>>>> 2: x86/mm: use optional cache in guest_walk_tables()
>>>> 3: x86/HVM: implement memory read caching
>>>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>>>
>>>> "VMX: correct PDPTE load checks" is omitted from v2, as I can't
>>>> currently find enough time to carry out the requested further
>>>> rework.
>>>
>>> Following the x86 call, I've had some thoughts and suggestions about how
>>> to make this work in a reasonable way, without resorting to the full
>>> caching approach.
>>
>> Thanks, but one question before I start thinking about this in
>> more detail: Before writing this, did you read my mail from the
>> 11th? I ask because what you suggest does not look to match
>> the behavior I've described there as what I think it ought to be.
> 
> I'm taking this off my to-review queue for now then -- let me know if
> you need me to review it anyway.

BTW I'm now deleting this from my inbox to avoid clutter.  Jan, at such
time as you want me to review it, please ping or re-send.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/4] x86/HVM: implement memory read caching
@ 2019-05-07 16:26           ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2019-05-07 16:26 UTC (permalink / raw)
  To: george.dunlap; +Cc: George Dunlap, Andrew Cooper, Paul Durrant, xen-devel

>>> On 07.05.19 at 18:22, <george.dunlap@citrix.com> wrote:
> BTW I'm now deleting this from my inbox to avoid clutter.  Jan, at such
> time as you want me to review it, please ping or re-send.

That's fine - this is meant to be re-worked. Just didn't get to it yet.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/4] x86/HVM: implement memory read caching
@ 2019-05-07 16:26           ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2019-05-07 16:26 UTC (permalink / raw)
  To: george.dunlap; +Cc: George Dunlap, Andrew Cooper, Paul Durrant, xen-devel

>>> On 07.05.19 at 18:22, <george.dunlap@citrix.com> wrote:
> BTW I'm now deleting this from my inbox to avoid clutter.  Jan, at such
> time as you want me to review it, please ping or re-send.

That's fine - this is meant to be re-worked. Just didn't get to it yet.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2019-05-07 16:26 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-11 13:10 [PATCH v2 0/4] x86/HVM: implement memory read caching Jan Beulich
2018-09-11 13:13 ` [PATCH v2 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
2018-09-11 13:40   ` Razvan Cojocaru
2018-09-19 15:09   ` Wei Liu
2018-09-11 13:14 ` [PATCH v2 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
2018-09-11 16:17   ` Paul Durrant
2018-09-12  8:30     ` Jan Beulich
2018-09-19 15:50   ` Wei Liu
2018-09-11 13:15 ` [PATCH v2 3/4] x86/HVM: implement memory read caching Jan Beulich
2018-09-11 16:20   ` Paul Durrant
2018-09-12  8:38     ` Jan Beulich
2018-09-12  8:49       ` Paul Durrant
2018-09-19 15:57   ` Wei Liu
2018-09-20  6:39     ` Jan Beulich
2018-09-20 15:42       ` Wei Liu
2018-09-11 13:16 ` [PATCH v2 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
2018-09-13  6:30   ` Tian, Kevin
2018-09-13  8:55     ` Jan Beulich
2018-09-14  2:18       ` Tian, Kevin
2018-09-14  8:12         ` Jan Beulich
2018-09-25 14:14 ` [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
2018-09-25 14:23   ` [PATCH v3 1/4] x86/mm: add optional cache to GLA->GFN translation Jan Beulich
2018-09-25 14:24   ` [PATCH v3 2/4] x86/mm: use optional cache in guest_walk_tables() Jan Beulich
2018-09-25 14:25   ` [PATCH v3 3/4] x86/HVM: implement memory read caching Jan Beulich
2018-09-26 11:05     ` Wei Liu
2018-10-02 10:39     ` Ping: " Jan Beulich
2018-10-02 13:53       ` Boris Ostrovsky
2018-10-09  5:19       ` Tian, Kevin
2018-09-25 14:26   ` [PATCH v3 4/4] x86/HVM: prefill cache with PDPTEs when possible Jan Beulich
2018-09-25 14:38     ` Paul Durrant
2018-10-02 10:36   ` Ping: [PATCH v3 0/4] x86/HVM: implement memory read caching Jan Beulich
2018-10-02 10:51     ` Andrew Cooper
2018-10-02 12:47       ` Jan Beulich
2018-10-11 15:54         ` George Dunlap
2018-10-11 16:15           ` Jan Beulich
2018-10-11 16:33             ` George Dunlap
2018-10-12  6:32               ` Jan Beulich
2018-10-11  6:51       ` Jan Beulich
2018-10-11 17:36       ` George Dunlap
2018-10-12 13:55 ` [PATCH v2 " Andrew Cooper
2018-10-12 14:19   ` Jan Beulich
2018-10-18 15:20     ` George Dunlap
2019-05-07 16:22       ` George Dunlap
2019-05-07 16:22         ` [Xen-devel] " George Dunlap
2019-05-07 16:26         ` Jan Beulich
2019-05-07 16:26           ` [Xen-devel] " Jan Beulich
2018-11-09 10:17   ` Jan Beulich
2019-02-14 15:14   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.