[PATCHv4 0/2] x86/ept: reduce translation invalidation impact

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv4 0/2] x86/ept: reduce translation invalidation impact
@ 2015-12-14 14:39 David Vrabel
  2015-12-14 14:39 ` [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
  2015-12-14 14:39 ` [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
  0 siblings, 2 replies; 10+ messages in thread
From: David Vrabel @ 2015-12-14 14:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich

This series improves the performance of EPT by further reducing the
impact of the translation invalidations (ept_sync_domain()). By:

a) Deferring invalidations until the p2m write lock is released.

Prior to this change a 16 VCPU guest could not be successfully
migrated on an (admittedly slow) 160 PCPU box because the p2m write
lock was held for such extended periods of time.  This starved the
read lock needed (by the toolstack) to map the domain's memory,
triggering the watchdog.

After this change a 64 VCPU guest could be successfully migrated.

ept_sync_domain() is very expensive because:

a) it uses on_selected_cpus() and the IPI cost can be particularly
   high for a multi-socket machine.

b) on_selected_cpus() is serialized by its own spin lock.

On this particular box, ept_sync_domain() could take ~3-5 ms.

Simply using a fair rw lock was not sufficient to resolve this (but it
was an improvement) as the cost of the ept_sync_domain calls() was
still delaying the read locks enough for the watchdog to trigger (the
toolstack maps a batch of 1024 GFNs at a time, which means trying to
acquire the p2m read lock 1024 times).

Changes in v4:

- __ept_sync_domain() is a no-op -- invalidates are done before VMENTER.
- initialize ept->invalidate to all ones so the initial invalidate is
  always done.

Changes in v3:

- Drop already applied "x86/ept: remove unnecessary sync after
  resolving misconfigured entries".
- Replaced "mm: don't free pages until mm locks are released" with
  "x86/ept: invalidate guest physical mappings on VMENTER".

Changes in v2:

- Use a per-p2m (not per-CPU) list for page table pages to be freed.
- Hold the write lock while updating the synced_mask.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-14 14:39 [PATCHv4 0/2] x86/ept: reduce translation invalidation impact David Vrabel
@ 2015-12-14 14:39 ` David Vrabel
  2015-12-14 14:52   ` Andrew Cooper
  2015-12-15 15:59   ` George Dunlap
  2015-12-14 14:39 ` [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
  1 sibling, 2 replies; 10+ messages in thread
From: David Vrabel @ 2015-12-14 14:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich

If a guest allocates a page and the tlbflush_timestamp on the page
indicates that a TLB flush of the previous owner is required, only the
linear and combined mappings are invalidated.  The guest-physical
mappings are not invalidated.

This is currently safe because the EPT code ensures that the
guest-physical and combined mappings are invalidated /before/ the page
is freed.  However, this prevents us from deferring the EPT invalidate
until after the page is freed (e.g., to defer the invalidate until the
p2m locks are released).

The TLB flush that may be done after allocating page already causes
the original guest to VMEXIT, thus on VMENTER we can do an INVEPT if
one is pending.

This means __ept_sync_domain() need not do anything and the thus the
on_selected_cpu() call does not need to wait for as long.

ept_sync_domain() now marks all PCPUs as needing to be invalidated,
including PCPUs that the domain has not run on.  We still only IPI
those PCPUs that are active so this does not result in any more INVEPT
calls.

We do not attempt to track when PCPUs may have cached translations
because the only safe way to clear this per-CPU state is if
immediately after an invalidate the PCPU is not active (i.e., the PCPU
is not in d->domain_dirty_cpumask).  Since we only invalidate on
VMENTER or by IPIing active PCPUs this can never happen.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
v4:
- __ept_sync_domain() is a no-op -- invalidates are done before VMENTER.
- initialize ept->invalidate to all ones so the initial invalidate is
  always done.
---
 xen/arch/x86/hvm/vmx/vmx.c         | 22 ++++++++++------------
 xen/arch/x86/mm/p2m-ept.c          | 29 ++++++++++++++---------------
 xen/include/asm-x86/hvm/vmx/vmcs.h |  3 +--
 3 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index f7c5e4f..cca35f2 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -744,24 +744,12 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
 
 static void vmx_ctxt_switch_to(struct vcpu *v)
 {
-    struct domain *d = v->domain;
     unsigned long old_cr4 = read_cr4(), new_cr4 = mmu_cr4_features;
-    struct ept_data *ept_data = &p2m_get_hostp2m(d)->ept;
 
     /* HOST_CR4 in VMCS is always mmu_cr4_features. Sync CR4 now. */
     if ( old_cr4 != new_cr4 )
         write_cr4(new_cr4);
 
-    if ( paging_mode_hap(d) )
-    {
-        unsigned int cpu = smp_processor_id();
-        /* Test-and-test-and-set this CPU in the EPT-is-synced mask. */
-        if ( !cpumask_test_cpu(cpu, ept_get_synced_mask(ept_data)) &&
-             !cpumask_test_and_set_cpu(cpu,
-                                       ept_get_synced_mask(ept_data)) )
-            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept_data), 0);
-    }
-
     vmx_restore_guest_msrs(v);
     vmx_restore_dr(v);
 }
@@ -3507,6 +3495,16 @@ void vmx_vmenter_helper(const struct cpu_user_regs *regs)
     if ( unlikely(need_flush) )
         vpid_sync_all();
 
+    if ( paging_mode_hap(curr->domain) )
+    {
+        struct ept_data *ept = &p2m_get_hostp2m(curr->domain)->ept;
+        unsigned int cpu = smp_processor_id();
+
+        if ( cpumask_test_cpu(cpu, ept->invalidate)
+             && cpumask_test_and_clear_cpu(cpu, ept->invalidate) )
+            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
+    }
+
  out:
     HVMTRACE_ND(VMENTRY, 0, 1/*cycles*/, 0, 0, 0, 0, 0, 0, 0);
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index eef0372..6e0cf89 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -1089,9 +1089,10 @@ static void ept_memory_type_changed(struct p2m_domain *p2m)
 
 static void __ept_sync_domain(void *info)
 {
-    struct ept_data *ept = &((struct p2m_domain *)info)->ept;
-
-    __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
+    /*
+     * The invalidate will be done before VMENTER (see
+     * vmx_vmenter_helper()).
+     */
 }
 
 void ept_sync_domain(struct p2m_domain *p2m)
@@ -1107,16 +1108,10 @@ void ept_sync_domain(struct p2m_domain *p2m)
     if ( nestedhvm_enabled(d) && !p2m_is_nestedp2m(p2m) )
         p2m_flush_nestedp2m(d);
 
-    /*
-     * Flush active cpus synchronously. Flush others the next time this domain
-     * is scheduled onto them. We accept the race of other CPUs adding to
-     * the ept_synced mask before on_selected_cpus() reads it, resulting in
-     * unnecessary extra flushes, to avoid allocating a cpumask_t on the stack.
-     */
-    cpumask_and(ept_get_synced_mask(ept),
-                d->domain_dirty_cpumask, &cpu_online_map);
+    /* May need to invalidate on all PCPUs. */
+    cpumask_setall(ept->invalidate);
 
-    on_selected_cpus(ept_get_synced_mask(ept),
+    on_selected_cpus(d->domain_dirty_cpumask,
                      __ept_sync_domain, p2m, 1);
 }
 
@@ -1182,10 +1177,14 @@ int ept_p2m_init(struct p2m_domain *p2m)
         p2m->flush_hardware_cached_dirty = ept_flush_pml_buffers;
     }
 
-    if ( !zalloc_cpumask_var(&ept->synced_mask) )
+    if ( !zalloc_cpumask_var(&ept->invalidate) )
         return -ENOMEM;
 
-    on_each_cpu(__ept_sync_domain, p2m, 1);
+    /*
+     * Assume an initial invalidation is required, in case an EP4TA is
+     * reused.
+     */
+    cpumask_setall(ept->invalidate);
 
     return 0;
 }
@@ -1193,7 +1192,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
 void ept_p2m_uninit(struct p2m_domain *p2m)
 {
     struct ept_data *ept = &p2m->ept;
-    free_cpumask_var(ept->synced_mask);
+    free_cpumask_var(ept->invalidate);
 }
 
 static void ept_dump_p2m_table(unsigned char key)
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index a8d4d5b..e778d86 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -67,7 +67,7 @@ struct ept_data {
         };
         u64 eptp;
     };
-    cpumask_var_t synced_mask;
+    cpumask_var_t invalidate;
 };
 
 #define _VMX_DOMAIN_PML_ENABLED    0
@@ -97,7 +97,6 @@ struct pi_desc {
 #define ept_get_wl(ept)   ((ept)->ept_wl)
 #define ept_get_asr(ept)  ((ept)->asr)
 #define ept_get_eptp(ept) ((ept)->eptp)
-#define ept_get_synced_mask(ept) ((ept)->synced_mask)
 
 #define NR_PML_ENTRIES   512
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-14 14:39 [PATCHv4 0/2] x86/ept: reduce translation invalidation impact David Vrabel
  2015-12-14 14:39 ` [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
@ 2015-12-14 14:39 ` David Vrabel
  2015-12-15 16:00   ` George Dunlap
  2015-12-16 17:55   ` David Vrabel
  1 sibling, 2 replies; 10+ messages in thread
From: David Vrabel @ 2015-12-14 14:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich

Holding the p2m lock while calling ept_sync_domain() is very expensive
since it does a on_selected_cpus() call.  IPIs on many socket machines
can be very slows and on_selected_cpus() is serialized.

Defer the invalidate until the p2m lock is released.  Since the processor
may cache partial translations, we also need to make sure any page table
pages to be freed are not freed until the invalidate is complete.  Such
pages are temporarily stored in a list.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
v2:
- use per-p2m list for deferred pages.
- update synced_mask while holding write lock.
---
 xen/arch/x86/mm/mm-locks.h | 23 ++++++++++++++--------
 xen/arch/x86/mm/p2m-ept.c  | 48 +++++++++++++++++++++++++++++++++++++---------
 xen/arch/x86/mm/p2m.c      | 18 +++++++++++++++++
 xen/include/asm-x86/p2m.h  |  7 +++++++
 4 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/xen/arch/x86/mm/mm-locks.h b/xen/arch/x86/mm/mm-locks.h
index 76c7217..b5eb560 100644
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -263,14 +263,21 @@ declare_mm_lock(altp2mlist)
  */
 
 declare_mm_rwlock(altp2m);
-#define p2m_lock(p)                         \
-{                                           \
-    if ( p2m_is_altp2m(p) )                 \
-        mm_write_lock(altp2m, &(p)->lock);  \
-    else                                    \
-        mm_write_lock(p2m, &(p)->lock);     \
-}
-#define p2m_unlock(p)         mm_write_unlock(&(p)->lock);
+#define p2m_lock(p)                             \
+    do {                                        \
+        if ( p2m_is_altp2m(p) )                 \
+            mm_write_lock(altp2m, &(p)->lock);  \
+        else                                    \
+            mm_write_lock(p2m, &(p)->lock);     \
+        (p)->defer_flush++;                     \
+    } while (0)
+#define p2m_unlock(p)                                                   \
+    do {                                                                \
+        if ( --(p)->defer_flush == 0 && (p)->need_flush )               \
+            (p)->flush_and_unlock(p);                                   \
+        else                                                            \
+            mm_write_unlock(&(p)->lock);                                \
+    } while (0)
 #define gfn_lock(p,g,o)       p2m_lock(p)
 #define gfn_unlock(p,g,o)     p2m_unlock(p)
 #define p2m_read_lock(p)      mm_read_lock(p2m, &(p)->lock)
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 6e0cf89..25575a5 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -263,7 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
         unmap_domain_page(epte);
     }
     
-    p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
+    p2m_free_ptp_defer(p2m, mfn_to_page(ept_entry->mfn));
 }
 
 static bool_t ept_split_super_page(struct p2m_domain *p2m,
@@ -1095,24 +1095,53 @@ static void __ept_sync_domain(void *info)
      */
 }
 
-void ept_sync_domain(struct p2m_domain *p2m)
+static void ept_sync_domain_prepare(struct p2m_domain *p2m)
 {
     struct domain *d = p2m->domain;
     struct ept_data *ept = &p2m->ept;
-    /* Only if using EPT and this domain has some VCPUs to dirty. */
-    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
-        return;
-
-    ASSERT(local_irq_is_enabled());
 
     if ( nestedhvm_enabled(d) && !p2m_is_nestedp2m(p2m) )
         p2m_flush_nestedp2m(d);
 
     /* May need to invalidate on all PCPUs. */
     cpumask_setall(ept->invalidate);
+}
+
+static void ept_sync_domain_mask(struct p2m_domain *p2m, const cpumask_t *mask)
+{
+    on_selected_cpus(mask, __ept_sync_domain, p2m, 1);
+}
+
+void ept_sync_domain(struct p2m_domain *p2m)
+{
+    struct domain *d = p2m->domain;
+
+    /* Only if using EPT and this domain has some VCPUs to dirty. */
+    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
+        return;
+
+    ept_sync_domain_prepare(p2m);
+
+    if ( p2m->defer_flush )
+    {
+        p2m->need_flush = 1;
+        return;
+    }
+    p2m->need_flush = 0;
+
+    ept_sync_domain_mask(p2m, d->domain_dirty_cpumask);
+}
+
+static void ept_flush_and_unlock(struct p2m_domain *p2m)
+{
+    PAGE_LIST_HEAD(deferred_pages);
+
+    page_list_move(&deferred_pages, &p2m->deferred_pages);
+
+    mm_write_unlock(&p2m->lock);
 
-    on_selected_cpus(d->domain_dirty_cpumask,
-                     __ept_sync_domain, p2m, 1);
+    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
+    p2m_free_ptp_list(p2m, &deferred_pages);
 }
 
 static void ept_enable_pml(struct p2m_domain *p2m)
@@ -1163,6 +1192,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
     p2m->change_entry_type_range = ept_change_entry_type_range;
     p2m->memory_type_changed = ept_memory_type_changed;
     p2m->audit_p2m = NULL;
+    p2m->flush_and_unlock = ept_flush_and_unlock;
 
     /* Set the memory type used when accessing EPT paging structures. */
     ept->ept_mt = EPT_DEFAULT_MT;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index ed0bbd7..502bdad 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -65,6 +65,7 @@ static int p2m_initialise(struct domain *d, struct p2m_domain *p2m)
     mm_lock_init(&p2m->pod.lock);
     INIT_LIST_HEAD(&p2m->np2m_list);
     INIT_PAGE_LIST_HEAD(&p2m->pages);
+    INIT_PAGE_LIST_HEAD(&p2m->deferred_pages);
     INIT_PAGE_LIST_HEAD(&p2m->pod.super);
     INIT_PAGE_LIST_HEAD(&p2m->pod.single);
 
@@ -504,6 +505,23 @@ void p2m_free_ptp(struct p2m_domain *p2m, struct page_info *pg)
     return;
 }
 
+void p2m_free_ptp_defer(struct p2m_domain *p2m, struct page_info *pg)
+{
+    page_list_del(pg, &p2m->pages);
+    page_list_add(pg, &p2m->deferred_pages);
+}
+
+void p2m_free_ptp_list(struct p2m_domain *p2m, struct page_list_head *list)
+{
+    struct page_info *pg, *tmp;
+
+    page_list_for_each_safe(pg, tmp, list)
+    {
+        page_list_del(pg, list);
+        p2m->domain->arch.paging.free_page(p2m->domain, pg);
+    }
+}
+
 /*
  * Allocate a new p2m table for a domain.
  *
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index fa46dd9..ee03997 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -261,6 +261,11 @@ struct p2m_domain {
                                           unsigned long gfn, l1_pgentry_t *p,
                                           l1_pgentry_t new, unsigned int level);
     long               (*audit_p2m)(struct p2m_domain *p2m);
+    void               (*flush_and_unlock)(struct p2m_domain *p2m);
+
+    struct page_list_head deferred_pages;
+    unsigned int defer_flush;
+    bool_t need_flush;
 
     /* Default P2M access type for each page in the the domain: new pages,
      * swapped in pages, cleared pages, and pages that are ambiguously
@@ -688,6 +693,8 @@ static inline bool_t p2m_mem_access_sanity_check(struct domain *d)
 
 struct page_info *p2m_alloc_ptp(struct p2m_domain *p2m, unsigned long type);
 void p2m_free_ptp(struct p2m_domain *p2m, struct page_info *pg);
+void p2m_free_ptp_defer(struct p2m_domain *p2m, struct page_info *pg);
+void p2m_free_ptp_list(struct p2m_domain *p2m, struct page_list_head *list);
 
 /* Directly set a p2m entry: only for use by p2m code. Does not need
  * a call to put_gfn afterwards/ */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-14 14:39 ` [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
@ 2015-12-14 14:52   ` Andrew Cooper
  2015-12-14 15:00     ` David Vrabel
  2015-12-15 15:59   ` George Dunlap
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2015-12-14 14:52 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: George Dunlap, Kevin Tian, Tim Deegan, Jan Beulich, Jun Nakajima

On 14/12/15 14:39, David Vrabel wrote:
> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index eef0372..6e0cf89 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -1089,9 +1089,10 @@ static void ept_memory_type_changed(struct p2m_domain *p2m)
>  
>  static void __ept_sync_domain(void *info)
>  {
> -    struct ept_data *ept = &((struct p2m_domain *)info)->ept;
> -
> -    __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
> +    /*
> +     * The invalidate will be done before VMENTER (see
> +     * vmx_vmenter_helper()).
> +     */
>  }
>  
>  void ept_sync_domain(struct p2m_domain *p2m)
> @@ -1107,16 +1108,10 @@ void ept_sync_domain(struct p2m_domain *p2m)
>      if ( nestedhvm_enabled(d) && !p2m_is_nestedp2m(p2m) )
>          p2m_flush_nestedp2m(d);
>  
> -    /*
> -     * Flush active cpus synchronously. Flush others the next time this domain
> -     * is scheduled onto them. We accept the race of other CPUs adding to
> -     * the ept_synced mask before on_selected_cpus() reads it, resulting in
> -     * unnecessary extra flushes, to avoid allocating a cpumask_t on the stack.
> -     */
> -    cpumask_and(ept_get_synced_mask(ept),
> -                d->domain_dirty_cpumask, &cpu_online_map);
> +    /* May need to invalidate on all PCPUs. */
> +    cpumask_setall(ept->invalidate);
>  
> -    on_selected_cpus(ept_get_synced_mask(ept),
> +    on_selected_cpus(d->domain_dirty_cpumask,
>                       __ept_sync_domain, p2m, 1);

You can drop __ept_sync_domain() entirely by using
smp_send_event_check_mask() instead, which is a no-op IPI (and slightly
less overhead while holding the IPI lock).

>  }
>  
> @@ -1182,10 +1177,14 @@ int ept_p2m_init(struct p2m_domain *p2m)
>          p2m->flush_hardware_cached_dirty = ept_flush_pml_buffers;
>      }
>  
> -    if ( !zalloc_cpumask_var(&ept->synced_mask) )
> +    if ( !zalloc_cpumask_var(&ept->invalidate) )
>          return -ENOMEM;
>  
> -    on_each_cpu(__ept_sync_domain, p2m, 1);
> +    /*
> +     * Assume an initial invalidation is required, in case an EP4TA is
> +     * reused.
> +     */
> +    cpumask_setall(ept->invalidate);
>  
>      return 0;
>  }
> @@ -1193,7 +1192,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
>  void ept_p2m_uninit(struct p2m_domain *p2m)
>  {
>      struct ept_data *ept = &p2m->ept;
> -    free_cpumask_var(ept->synced_mask);
> +    free_cpumask_var(ept->invalidate);
>  }
>  
>  static void ept_dump_p2m_table(unsigned char key)
> diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
> index a8d4d5b..e778d86 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -67,7 +67,7 @@ struct ept_data {
>          };
>          u64 eptp;
>      };
> -    cpumask_var_t synced_mask;
> +    cpumask_var_t invalidate;

Could you include a small comment here to describe the behaviour?  Perhaps:

/* Whether an INVEPT should be issued on VMENTER? */

~Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-14 14:52   ` Andrew Cooper
@ 2015-12-14 15:00     ` David Vrabel
  2015-12-14 15:09       ` Andrew Cooper
  0 siblings, 1 reply; 10+ messages in thread
From: David Vrabel @ 2015-12-14 15:00 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: George Dunlap, Kevin Tian, Tim Deegan, Jan Beulich, Jun Nakajima

On 14/12/15 14:52, Andrew Cooper wrote:
> On 14/12/15 14:39, David Vrabel wrote:
>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>> index eef0372..6e0cf89 100644
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
[...]
>> +    on_selected_cpus(d->domain_dirty_cpumask,
>>                       __ept_sync_domain, p2m, 1);
> 
> You can drop __ept_sync_domain() entirely by using
> smp_send_event_check_mask() instead, which is a no-op IPI (and slightly
> less overhead while holding the IPI lock).

We need to wait until the IPI has been handled on the remote PCPUs since
we may immediately free a page table page.  If a VCPU was still running
it may use paging-structure-cache entries referring to that freed page.

>> -    cpumask_var_t synced_mask;
>> +    cpumask_var_t invalidate;
> 
> Could you include a small comment here to describe the behaviour?  Perhaps:
> 
> /* Whether an INVEPT should be issued on VMENTER? */

Good point.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-14 15:00     ` David Vrabel
@ 2015-12-14 15:09       ` Andrew Cooper
  2015-12-14 15:39         ` George Dunlap
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2015-12-14 15:09 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: George Dunlap, Kevin Tian, Tim Deegan, Jan Beulich, Jun Nakajima

On 14/12/15 15:00, David Vrabel wrote:
> On 14/12/15 14:52, Andrew Cooper wrote:
>> On 14/12/15 14:39, David Vrabel wrote:
>>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>>> index eef0372..6e0cf89 100644
>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
> [...]
>>> +    on_selected_cpus(d->domain_dirty_cpumask,
>>>                       __ept_sync_domain, p2m, 1);
>> You can drop __ept_sync_domain() entirely by using
>> smp_send_event_check_mask() instead, which is a no-op IPI (and slightly
>> less overhead while holding the IPI lock).
> We need to wait until the IPI has been handled on the remote PCPUs since
> we may immediately free a page table page.  If a VCPU was still running
> it may use paging-structure-cache entries referring to that freed page.

Ah yes.  Better not do that.

As some future cleanup it would be nice to be able to do this without
specifying a function, but that is a minor detail and not relevant to
this patch.

~Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-14 15:09       ` Andrew Cooper
@ 2015-12-14 15:39         ` George Dunlap
  0 siblings, 0 replies; 10+ messages in thread
From: George Dunlap @ 2015-12-14 15:39 UTC (permalink / raw)
  To: Andrew Cooper, David Vrabel, xen-devel
  Cc: George Dunlap, Kevin Tian, Tim Deegan, Jan Beulich, Jun Nakajima

On 14/12/15 15:09, Andrew Cooper wrote:
> On 14/12/15 15:00, David Vrabel wrote:
>> On 14/12/15 14:52, Andrew Cooper wrote:
>>> On 14/12/15 14:39, David Vrabel wrote:
>>>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>>>> index eef0372..6e0cf89 100644
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> [...]
>>>> +    on_selected_cpus(d->domain_dirty_cpumask,
>>>>                       __ept_sync_domain, p2m, 1);
>>> You can drop __ept_sync_domain() entirely by using
>>> smp_send_event_check_mask() instead, which is a no-op IPI (and slightly
>>> less overhead while holding the IPI lock).
>> We need to wait until the IPI has been handled on the remote PCPUs since
>> we may immediately free a page table page.  If a VCPU was still running
>> it may use paging-structure-cache entries referring to that freed page.
> 
> Ah yes.  Better not do that.
> 
> As some future cleanup it would be nice to be able to do this without
> specifying a function, but that is a minor detail and not relevant to
> this patch.

This is exactly the conclusion I came to when reviewing v3 of this
series.  Actually calling __ept_sync_domain() is unnecessary, but at
that point you might as well.

 -George

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-14 14:39 ` [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
  2015-12-14 14:52   ` Andrew Cooper
@ 2015-12-15 15:59   ` George Dunlap
  1 sibling, 0 replies; 10+ messages in thread
From: George Dunlap @ 2015-12-15 15:59 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Jan Beulich

On 14/12/15 14:39, David Vrabel wrote:
> If a guest allocates a page and the tlbflush_timestamp on the page
> indicates that a TLB flush of the previous owner is required, only the
> linear and combined mappings are invalidated.  The guest-physical
> mappings are not invalidated.
> 
> This is currently safe because the EPT code ensures that the
> guest-physical and combined mappings are invalidated /before/ the page
> is freed.  However, this prevents us from deferring the EPT invalidate
> until after the page is freed (e.g., to defer the invalidate until the
> p2m locks are released).
> 
> The TLB flush that may be done after allocating page already causes
> the original guest to VMEXIT, thus on VMENTER we can do an INVEPT if
> one is pending.
> 
> This means __ept_sync_domain() need not do anything and the thus the
> on_selected_cpu() call does not need to wait for as long.
> 
> ept_sync_domain() now marks all PCPUs as needing to be invalidated,
> including PCPUs that the domain has not run on.  We still only IPI
> those PCPUs that are active so this does not result in any more INVEPT
> calls.
> 
> We do not attempt to track when PCPUs may have cached translations
> because the only safe way to clear this per-CPU state is if
> immediately after an invalidate the PCPU is not active (i.e., the PCPU
> is not in d->domain_dirty_cpumask).  Since we only invalidate on
> VMENTER or by IPIing active PCPUs this can never happen.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
> v4:
> - __ept_sync_domain() is a no-op -- invalidates are done before VMENTER.
> - initialize ept->invalidate to all ones so the initial invalidate is
>   always done.
> ---
>  xen/arch/x86/hvm/vmx/vmx.c         | 22 ++++++++++------------
>  xen/arch/x86/mm/p2m-ept.c          | 29 ++++++++++++++---------------
>  xen/include/asm-x86/hvm/vmx/vmcs.h |  3 +--
>  3 files changed, 25 insertions(+), 29 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index f7c5e4f..cca35f2 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -744,24 +744,12 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
>  
>  static void vmx_ctxt_switch_to(struct vcpu *v)
>  {
> -    struct domain *d = v->domain;
>      unsigned long old_cr4 = read_cr4(), new_cr4 = mmu_cr4_features;
> -    struct ept_data *ept_data = &p2m_get_hostp2m(d)->ept;
>  
>      /* HOST_CR4 in VMCS is always mmu_cr4_features. Sync CR4 now. */
>      if ( old_cr4 != new_cr4 )
>          write_cr4(new_cr4);
>  
> -    if ( paging_mode_hap(d) )
> -    {
> -        unsigned int cpu = smp_processor_id();
> -        /* Test-and-test-and-set this CPU in the EPT-is-synced mask. */
> -        if ( !cpumask_test_cpu(cpu, ept_get_synced_mask(ept_data)) &&
> -             !cpumask_test_and_set_cpu(cpu,
> -                                       ept_get_synced_mask(ept_data)) )
> -            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept_data), 0);
> -    }
> -
>      vmx_restore_guest_msrs(v);
>      vmx_restore_dr(v);
>  }
> @@ -3507,6 +3495,16 @@ void vmx_vmenter_helper(const struct cpu_user_regs *regs)
>      if ( unlikely(need_flush) )
>          vpid_sync_all();
>  
> +    if ( paging_mode_hap(curr->domain) )
> +    {
> +        struct ept_data *ept = &p2m_get_hostp2m(curr->domain)->ept;
> +        unsigned int cpu = smp_processor_id();
> +
> +        if ( cpumask_test_cpu(cpu, ept->invalidate)
> +             && cpumask_test_and_clear_cpu(cpu, ept->invalidate) )
> +            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
> +    }
> +
>   out:
>      HVMTRACE_ND(VMENTRY, 0, 1/*cycles*/, 0, 0, 0, 0, 0, 0, 0);
>  
> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index eef0372..6e0cf89 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -1089,9 +1089,10 @@ static void ept_memory_type_changed(struct p2m_domain *p2m)
>  
>  static void __ept_sync_domain(void *info)
>  {
> -    struct ept_data *ept = &((struct p2m_domain *)info)->ept;
> -
> -    __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
> +    /*
> +     * The invalidate will be done before VMENTER (see
> +     * vmx_vmenter_helper()).
> +     */
>  }
>  
>  void ept_sync_domain(struct p2m_domain *p2m)
> @@ -1107,16 +1108,10 @@ void ept_sync_domain(struct p2m_domain *p2m)
>      if ( nestedhvm_enabled(d) && !p2m_is_nestedp2m(p2m) )
>          p2m_flush_nestedp2m(d);
>  
> -    /*
> -     * Flush active cpus synchronously. Flush others the next time this domain
> -     * is scheduled onto them. We accept the race of other CPUs adding to
> -     * the ept_synced mask before on_selected_cpus() reads it, resulting in
> -     * unnecessary extra flushes, to avoid allocating a cpumask_t on the stack.
> -     */
> -    cpumask_and(ept_get_synced_mask(ept),
> -                d->domain_dirty_cpumask, &cpu_online_map);
> +    /* May need to invalidate on all PCPUs. */
> +    cpumask_setall(ept->invalidate);
>  
> -    on_selected_cpus(ept_get_synced_mask(ept),
> +    on_selected_cpus(d->domain_dirty_cpumask,
>                       __ept_sync_domain, p2m, 1);
>  }
>  
> @@ -1182,10 +1177,14 @@ int ept_p2m_init(struct p2m_domain *p2m)
>          p2m->flush_hardware_cached_dirty = ept_flush_pml_buffers;
>      }
>  
> -    if ( !zalloc_cpumask_var(&ept->synced_mask) )
> +    if ( !zalloc_cpumask_var(&ept->invalidate) )
>          return -ENOMEM;
>  
> -    on_each_cpu(__ept_sync_domain, p2m, 1);
> +    /*
> +     * Assume an initial invalidation is required, in case an EP4TA is
> +     * reused.
> +     */
> +    cpumask_setall(ept->invalidate);
>  
>      return 0;
>  }
> @@ -1193,7 +1192,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
>  void ept_p2m_uninit(struct p2m_domain *p2m)
>  {
>      struct ept_data *ept = &p2m->ept;
> -    free_cpumask_var(ept->synced_mask);
> +    free_cpumask_var(ept->invalidate);
>  }
>  
>  static void ept_dump_p2m_table(unsigned char key)
> diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
> index a8d4d5b..e778d86 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -67,7 +67,7 @@ struct ept_data {
>          };
>          u64 eptp;
>      };
> -    cpumask_var_t synced_mask;
> +    cpumask_var_t invalidate;
>  };
>  
>  #define _VMX_DOMAIN_PML_ENABLED    0
> @@ -97,7 +97,6 @@ struct pi_desc {
>  #define ept_get_wl(ept)   ((ept)->ept_wl)
>  #define ept_get_asr(ept)  ((ept)->asr)
>  #define ept_get_eptp(ept) ((ept)->eptp)
> -#define ept_get_synced_mask(ept) ((ept)->synced_mask)
>  
>  #define NR_PML_ENTRIES   512
>  
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-14 14:39 ` [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
@ 2015-12-15 16:00   ` George Dunlap
  2015-12-16 17:55   ` David Vrabel
  1 sibling, 0 replies; 10+ messages in thread
From: George Dunlap @ 2015-12-15 16:00 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Jan Beulich

On 14/12/15 14:39, David Vrabel wrote:
> Holding the p2m lock while calling ept_sync_domain() is very expensive
> since it does a on_selected_cpus() call.  IPIs on many socket machines
> can be very slows and on_selected_cpus() is serialized.
> 
> Defer the invalidate until the p2m lock is released.  Since the processor
> may cache partial translations, we also need to make sure any page table
> pages to be freed are not freed until the invalidate is complete.  Such
> pages are temporarily stored in a list.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
> v2:
> - use per-p2m list for deferred pages.
> - update synced_mask while holding write lock.
> ---
>  xen/arch/x86/mm/mm-locks.h | 23 ++++++++++++++--------
>  xen/arch/x86/mm/p2m-ept.c  | 48 +++++++++++++++++++++++++++++++++++++---------
>  xen/arch/x86/mm/p2m.c      | 18 +++++++++++++++++
>  xen/include/asm-x86/p2m.h  |  7 +++++++
>  4 files changed, 79 insertions(+), 17 deletions(-)
> 
> diff --git a/xen/arch/x86/mm/mm-locks.h b/xen/arch/x86/mm/mm-locks.h
> index 76c7217..b5eb560 100644
> --- a/xen/arch/x86/mm/mm-locks.h
> +++ b/xen/arch/x86/mm/mm-locks.h
> @@ -263,14 +263,21 @@ declare_mm_lock(altp2mlist)
>   */
>  
>  declare_mm_rwlock(altp2m);
> -#define p2m_lock(p)                         \
> -{                                           \
> -    if ( p2m_is_altp2m(p) )                 \
> -        mm_write_lock(altp2m, &(p)->lock);  \
> -    else                                    \
> -        mm_write_lock(p2m, &(p)->lock);     \
> -}
> -#define p2m_unlock(p)         mm_write_unlock(&(p)->lock);
> +#define p2m_lock(p)                             \
> +    do {                                        \
> +        if ( p2m_is_altp2m(p) )                 \
> +            mm_write_lock(altp2m, &(p)->lock);  \
> +        else                                    \
> +            mm_write_lock(p2m, &(p)->lock);     \
> +        (p)->defer_flush++;                     \
> +    } while (0)
> +#define p2m_unlock(p)                                                   \
> +    do {                                                                \
> +        if ( --(p)->defer_flush == 0 && (p)->need_flush )               \
> +            (p)->flush_and_unlock(p);                                   \
> +        else                                                            \
> +            mm_write_unlock(&(p)->lock);                                \
> +    } while (0)
>  #define gfn_lock(p,g,o)       p2m_lock(p)
>  #define gfn_unlock(p,g,o)     p2m_unlock(p)
>  #define p2m_read_lock(p)      mm_read_lock(p2m, &(p)->lock)
> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index 6e0cf89..25575a5 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -263,7 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>          unmap_domain_page(epte);
>      }
>      
> -    p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
> +    p2m_free_ptp_defer(p2m, mfn_to_page(ept_entry->mfn));
>  }
>  
>  static bool_t ept_split_super_page(struct p2m_domain *p2m,
> @@ -1095,24 +1095,53 @@ static void __ept_sync_domain(void *info)
>       */
>  }
>  
> -void ept_sync_domain(struct p2m_domain *p2m)
> +static void ept_sync_domain_prepare(struct p2m_domain *p2m)
>  {
>      struct domain *d = p2m->domain;
>      struct ept_data *ept = &p2m->ept;
> -    /* Only if using EPT and this domain has some VCPUs to dirty. */
> -    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
> -        return;
> -
> -    ASSERT(local_irq_is_enabled());
>  
>      if ( nestedhvm_enabled(d) && !p2m_is_nestedp2m(p2m) )
>          p2m_flush_nestedp2m(d);
>  
>      /* May need to invalidate on all PCPUs. */
>      cpumask_setall(ept->invalidate);
> +}
> +
> +static void ept_sync_domain_mask(struct p2m_domain *p2m, const cpumask_t *mask)
> +{
> +    on_selected_cpus(mask, __ept_sync_domain, p2m, 1);
> +}
> +
> +void ept_sync_domain(struct p2m_domain *p2m)
> +{
> +    struct domain *d = p2m->domain;
> +
> +    /* Only if using EPT and this domain has some VCPUs to dirty. */
> +    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
> +        return;
> +
> +    ept_sync_domain_prepare(p2m);
> +
> +    if ( p2m->defer_flush )
> +    {
> +        p2m->need_flush = 1;
> +        return;
> +    }
> +    p2m->need_flush = 0;
> +
> +    ept_sync_domain_mask(p2m, d->domain_dirty_cpumask);
> +}
> +
> +static void ept_flush_and_unlock(struct p2m_domain *p2m)
> +{
> +    PAGE_LIST_HEAD(deferred_pages);
> +
> +    page_list_move(&deferred_pages, &p2m->deferred_pages);
> +
> +    mm_write_unlock(&p2m->lock);
>  
> -    on_selected_cpus(d->domain_dirty_cpumask,
> -                     __ept_sync_domain, p2m, 1);
> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
> +    p2m_free_ptp_list(p2m, &deferred_pages);
>  }
>  
>  static void ept_enable_pml(struct p2m_domain *p2m)
> @@ -1163,6 +1192,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
>      p2m->change_entry_type_range = ept_change_entry_type_range;
>      p2m->memory_type_changed = ept_memory_type_changed;
>      p2m->audit_p2m = NULL;
> +    p2m->flush_and_unlock = ept_flush_and_unlock;
>  
>      /* Set the memory type used when accessing EPT paging structures. */
>      ept->ept_mt = EPT_DEFAULT_MT;
> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> index ed0bbd7..502bdad 100644
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -65,6 +65,7 @@ static int p2m_initialise(struct domain *d, struct p2m_domain *p2m)
>      mm_lock_init(&p2m->pod.lock);
>      INIT_LIST_HEAD(&p2m->np2m_list);
>      INIT_PAGE_LIST_HEAD(&p2m->pages);
> +    INIT_PAGE_LIST_HEAD(&p2m->deferred_pages);
>      INIT_PAGE_LIST_HEAD(&p2m->pod.super);
>      INIT_PAGE_LIST_HEAD(&p2m->pod.single);
>  
> @@ -504,6 +505,23 @@ void p2m_free_ptp(struct p2m_domain *p2m, struct page_info *pg)
>      return;
>  }
>  
> +void p2m_free_ptp_defer(struct p2m_domain *p2m, struct page_info *pg)
> +{
> +    page_list_del(pg, &p2m->pages);
> +    page_list_add(pg, &p2m->deferred_pages);
> +}
> +
> +void p2m_free_ptp_list(struct p2m_domain *p2m, struct page_list_head *list)
> +{
> +    struct page_info *pg, *tmp;
> +
> +    page_list_for_each_safe(pg, tmp, list)
> +    {
> +        page_list_del(pg, list);
> +        p2m->domain->arch.paging.free_page(p2m->domain, pg);
> +    }
> +}
> +
>  /*
>   * Allocate a new p2m table for a domain.
>   *
> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
> index fa46dd9..ee03997 100644
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -261,6 +261,11 @@ struct p2m_domain {
>                                            unsigned long gfn, l1_pgentry_t *p,
>                                            l1_pgentry_t new, unsigned int level);
>      long               (*audit_p2m)(struct p2m_domain *p2m);
> +    void               (*flush_and_unlock)(struct p2m_domain *p2m);
> +
> +    struct page_list_head deferred_pages;
> +    unsigned int defer_flush;
> +    bool_t need_flush;
>  
>      /* Default P2M access type for each page in the the domain: new pages,
>       * swapped in pages, cleared pages, and pages that are ambiguously
> @@ -688,6 +693,8 @@ static inline bool_t p2m_mem_access_sanity_check(struct domain *d)
>  
>  struct page_info *p2m_alloc_ptp(struct p2m_domain *p2m, unsigned long type);
>  void p2m_free_ptp(struct p2m_domain *p2m, struct page_info *pg);
> +void p2m_free_ptp_defer(struct p2m_domain *p2m, struct page_info *pg);
> +void p2m_free_ptp_list(struct p2m_domain *p2m, struct page_list_head *list);
>  
>  /* Directly set a p2m entry: only for use by p2m code. Does not need
>   * a call to put_gfn afterwards/ */
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-14 14:39 ` [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
  2015-12-15 16:00   ` George Dunlap
@ 2015-12-16 17:55   ` David Vrabel
  1 sibling, 0 replies; 10+ messages in thread
From: David Vrabel @ 2015-12-16 17:55 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Jan Beulich

On 14/12/15 14:39, David Vrabel wrote:
> Holding the p2m lock while calling ept_sync_domain() is very expensive
> since it does a on_selected_cpus() call.  IPIs on many socket machines
> can be very slows and on_selected_cpus() is serialized.
> 
> Defer the invalidate until the p2m lock is released.  Since the processor
> may cache partial translations, we also need to make sure any page table
> pages to be freed are not freed until the invalidate is complete.  Such
> pages are temporarily stored in a list.

This breaks PoD because we check for zeroed pages without doing a sync
after the type change.  This allows other VCPUs to write to a page that
is now in the pod pool.  The write will both be lost and will corrupt
another zero page when the page is used to populate another GFN.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-12-16 17:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-14 14:39 [PATCHv4 0/2] x86/ept: reduce translation invalidation impact David Vrabel
2015-12-14 14:39 ` [PATCHv4 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
2015-12-14 14:52   ` Andrew Cooper
2015-12-14 15:00     ` David Vrabel
2015-12-14 15:09       ` Andrew Cooper
2015-12-14 15:39         ` George Dunlap
2015-12-15 15:59   ` George Dunlap
2015-12-14 14:39 ` [PATCHv4 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
2015-12-15 16:00   ` George Dunlap
2015-12-16 17:55   ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.