All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv6 0/2] x86/ept: reduce translation invalidation impact
@ 2015-12-18 13:50 David Vrabel
  2015-12-18 13:50 ` [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
  2015-12-18 13:50 ` [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
  0 siblings, 2 replies; 13+ messages in thread
From: David Vrabel @ 2015-12-18 13:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich

This series improves the performance of EPT by further reducing the
impact of the translation invalidations (ept_sync_domain()). By:

a) Deferring invalidations until the p2m write lock is released.

Prior to this change a 16 VCPU guest could not be successfully
migrated on an (admittedly slow) 160 PCPU box because the p2m write
lock was held for such extended periods of time.  This starved the
read lock needed (by the toolstack) to map the domain's memory,
triggering the watchdog.

After this change a 64 VCPU guest could be successfully migrated.

ept_sync_domain() is very expensive because:

a) it uses on_selected_cpus() and the IPI cost can be particularly
   high for a multi-socket machine.

b) on_selected_cpus() is serialized by its own spin lock.

On this particular box, ept_sync_domain() could take ~3-5 ms.

Simply using a fair rw lock was not sufficient to resolve this (but it
was an improvement) as the cost of the ept_sync_domain calls() was
still delaying the read locks enough for the watchdog to trigger (the
toolstack maps a batch of 1024 GFNs at a time, which means trying to
acquire the p2m read lock 1024 times).

Changes in v6:

- Fix performance bug in patch #2.
- Improve comments.

Changes in v5:

- Fix PoD by explicitly doing an invalidation before reclaiming zero
  pages.
- Use the same mechanism for dealing with freeing page table pages.
  This isn't a common path and its simpler than the deferred list.

Changes in v4:

- __ept_sync_domain() is a no-op -- invalidates are done before VMENTER.
- initialize ept->invalidate to all ones so the initial invalidate is
  always done.

Changes in v3:

- Drop already applied "x86/ept: remove unnecessary sync after
  resolving misconfigured entries".
- Replaced "mm: don't free pages until mm locks are released" with
  "x86/ept: invalidate guest physical mappings on VMENTER".

Changes in v2:

- Use a per-p2m (not per-CPU) list for page table pages to be freed.
- Hold the write lock while updating the synced_mask.

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-18 13:50 [PATCHv6 0/2] x86/ept: reduce translation invalidation impact David Vrabel
@ 2015-12-18 13:50 ` David Vrabel
  2015-12-18 14:59   ` George Dunlap
  2015-12-20  6:51   ` Tian, Kevin
  2015-12-18 13:50 ` [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
  1 sibling, 2 replies; 13+ messages in thread
From: David Vrabel @ 2015-12-18 13:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich

If a guest allocates a page and the tlbflush_timestamp on the page
indicates that a TLB flush of the previous owner is required, only the
linear and combined mappings are invalidated.  The guest-physical
mappings are not invalidated.

This is currently safe because the EPT code ensures that the
guest-physical and combined mappings are invalidated /before/ the page
is freed.  However, this prevents us from deferring the EPT invalidate
until after the page is freed (e.g., to defer the invalidate until the
p2m locks are released).

The TLB flush that may be done after allocating page already causes
the original guest to VMEXIT, thus on VMENTER we can do an INVEPT if
one is pending.

This means __ept_sync_domain() need not do anything and the thus the
on_selected_cpu() call does not need to wait for as long.

ept_sync_domain() now marks all PCPUs as needing to be invalidated,
including PCPUs that the domain has not run on.  We still only IPI
those PCPUs that are active so this does not result in any more INVEPT
calls.

We do not attempt to track when PCPUs may have cached translations
because the only safe way to clear this per-CPU state is if
immediately after an invalidate the PCPU is not active (i.e., the PCPU
is not in d->domain_dirty_cpumask).  Since we only invalidate on
VMENTER or by IPIing active PCPUs this can never happen.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
v6:
- Add some comments.
- Use test then clear instead of test_and_clear in vmx_vmenter_helper()

v4:
- __ept_sync_domain() is a no-op -- invalidates are done before VMENTER.
- initialize ept->invalidate to all ones so the initial invalidate is
  always done.
---
 xen/arch/x86/hvm/vmx/vmx.c         | 24 ++++++++++++------------
 xen/arch/x86/mm/p2m-ept.c          | 31 ++++++++++++++++++-------------
 xen/include/asm-x86/hvm/vmx/vmcs.h |  4 ++--
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2922673..23ab737 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -744,24 +744,12 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
 
 static void vmx_ctxt_switch_to(struct vcpu *v)
 {
-    struct domain *d = v->domain;
     unsigned long old_cr4 = read_cr4(), new_cr4 = mmu_cr4_features;
-    struct ept_data *ept_data = &p2m_get_hostp2m(d)->ept;
 
     /* HOST_CR4 in VMCS is always mmu_cr4_features. Sync CR4 now. */
     if ( old_cr4 != new_cr4 )
         write_cr4(new_cr4);
 
-    if ( paging_mode_hap(d) )
-    {
-        unsigned int cpu = smp_processor_id();
-        /* Test-and-test-and-set this CPU in the EPT-is-synced mask. */
-        if ( !cpumask_test_cpu(cpu, ept_get_synced_mask(ept_data)) &&
-             !cpumask_test_and_set_cpu(cpu,
-                                       ept_get_synced_mask(ept_data)) )
-            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept_data), 0);
-    }
-
     vmx_restore_guest_msrs(v);
     vmx_restore_dr(v);
 }
@@ -3586,6 +3574,18 @@ void vmx_vmenter_helper(const struct cpu_user_regs *regs)
     if ( unlikely(need_flush) )
         vpid_sync_all();
 
+    if ( paging_mode_hap(curr->domain) )
+    {
+        struct ept_data *ept = &p2m_get_hostp2m(curr->domain)->ept;
+        unsigned int cpu = smp_processor_id();
+
+        if ( cpumask_test_cpu(cpu, ept->invalidate) )
+        {
+            cpumask_clear_cpu(cpu, ept->invalidate);
+            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
+        }
+    }
+
  out:
     HVMTRACE_ND(VMENTRY, 0, 1/*cycles*/, 0, 0, 0, 0, 0, 0, 0);
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index eef0372..c094320 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -1089,9 +1089,10 @@ static void ept_memory_type_changed(struct p2m_domain *p2m)
 
 static void __ept_sync_domain(void *info)
 {
-    struct ept_data *ept = &((struct p2m_domain *)info)->ept;
-
-    __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
+    /*
+     * The invalidation will be done before VMENTER (see
+     * vmx_vmenter_helper()).
+     */
 }
 
 void ept_sync_domain(struct p2m_domain *p2m)
@@ -1108,15 +1109,15 @@ void ept_sync_domain(struct p2m_domain *p2m)
         p2m_flush_nestedp2m(d);
 
     /*
-     * Flush active cpus synchronously. Flush others the next time this domain
-     * is scheduled onto them. We accept the race of other CPUs adding to
-     * the ept_synced mask before on_selected_cpus() reads it, resulting in
-     * unnecessary extra flushes, to avoid allocating a cpumask_t on the stack.
+     * Need to invalidate on all PCPUs because either:
+     *
+     * a) A VCPU has run and some translations may be cached.
+     * b) A VCPU has not run and and the initial invalidation in case
+     *    of an EP4TA reuse is still needed.
      */
-    cpumask_and(ept_get_synced_mask(ept),
-                d->domain_dirty_cpumask, &cpu_online_map);
+    cpumask_setall(ept->invalidate);
 
-    on_selected_cpus(ept_get_synced_mask(ept),
+    on_selected_cpus(d->domain_dirty_cpumask,
                      __ept_sync_domain, p2m, 1);
 }
 
@@ -1182,10 +1183,14 @@ int ept_p2m_init(struct p2m_domain *p2m)
         p2m->flush_hardware_cached_dirty = ept_flush_pml_buffers;
     }
 
-    if ( !zalloc_cpumask_var(&ept->synced_mask) )
+    if ( !zalloc_cpumask_var(&ept->invalidate) )
         return -ENOMEM;
 
-    on_each_cpu(__ept_sync_domain, p2m, 1);
+    /*
+     * Assume an initial invalidation is required, in case an EP4TA is
+     * reused.
+     */
+    cpumask_setall(ept->invalidate);
 
     return 0;
 }
@@ -1193,7 +1198,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
 void ept_p2m_uninit(struct p2m_domain *p2m)
 {
     struct ept_data *ept = &p2m->ept;
-    free_cpumask_var(ept->synced_mask);
+    free_cpumask_var(ept->invalidate);
 }
 
 static void ept_dump_p2m_table(unsigned char key)
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index a8d4d5b..d1496b8 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -67,7 +67,8 @@ struct ept_data {
         };
         u64 eptp;
     };
-    cpumask_var_t synced_mask;
+    /* Set of PCPUs needing an INVEPT before a VMENTER. */
+    cpumask_var_t invalidate;
 };
 
 #define _VMX_DOMAIN_PML_ENABLED    0
@@ -97,7 +98,6 @@ struct pi_desc {
 #define ept_get_wl(ept)   ((ept)->ept_wl)
 #define ept_get_asr(ept)  ((ept)->asr)
 #define ept_get_eptp(ept) ((ept)->eptp)
-#define ept_get_synced_mask(ept) ((ept)->synced_mask)
 
 #define NR_PML_ENTRIES   512
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-18 13:50 [PATCHv6 0/2] x86/ept: reduce translation invalidation impact David Vrabel
  2015-12-18 13:50 ` [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
@ 2015-12-18 13:50 ` David Vrabel
  2015-12-20  6:56   ` Tian, Kevin
  2015-12-22 12:23   ` George Dunlap
  1 sibling, 2 replies; 13+ messages in thread
From: David Vrabel @ 2015-12-18 13:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich

Holding the p2m lock while calling ept_sync_domain() is very expensive
since it does a on_selected_cpus() call.  IPIs on many socket machines
can be very slows and on_selected_cpus() is serialized.

It is safe to defer the invalidate until the p2m lock is released
except for two cases:

1. When freeing a page table page (since partial translations may be
   cached).
2. When reclaiming a zero page as part of PoD.

For these cases, add p2m_tlb_flush_sync() calls which will immediately
perform the invalidate before the page is freed or reclaimed.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
v6:
- Move p2m_tlb_flush_sync() to immediately before p2m_free_ptp().  It was
  called all the time otherwise.

v5:
- add p2m_tlb_flush_sync() and call it before freeing pgae table pages
  and reclaiming zeroed pod pages.

v2:
- use per-p2m list for deferred pages.
- update synced_mask while holding write lock.
---
 xen/arch/x86/mm/mm-locks.h | 23 +++++++++++++++--------
 xen/arch/x86/mm/p2m-ept.c  | 42 ++++++++++++++++++++++++++++++++++--------
 xen/arch/x86/mm/p2m-pod.c  |  2 ++
 xen/arch/x86/mm/p2m.c      | 14 ++++++++++++++
 xen/include/asm-x86/p2m.h  | 10 ++++++++++
 5 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/mm/mm-locks.h b/xen/arch/x86/mm/mm-locks.h
index 76c7217..b1a92e2 100644
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -263,14 +263,21 @@ declare_mm_lock(altp2mlist)
  */
 
 declare_mm_rwlock(altp2m);
-#define p2m_lock(p)                         \
-{                                           \
-    if ( p2m_is_altp2m(p) )                 \
-        mm_write_lock(altp2m, &(p)->lock);  \
-    else                                    \
-        mm_write_lock(p2m, &(p)->lock);     \
-}
-#define p2m_unlock(p)         mm_write_unlock(&(p)->lock);
+#define p2m_lock(p)                             \
+    do {                                        \
+        if ( p2m_is_altp2m(p) )                 \
+            mm_write_lock(altp2m, &(p)->lock);  \
+        else                                    \
+            mm_write_lock(p2m, &(p)->lock);     \
+        (p)->defer_flush++;                     \
+    } while (0)
+#define p2m_unlock(p)                           \
+    do {                                        \
+        if ( --(p)->defer_flush == 0 )          \
+            p2m_tlb_flush_and_unlock(p);        \
+        else                                    \
+            mm_write_unlock(&(p)->lock);        \
+    } while (0)
 #define gfn_lock(p,g,o)       p2m_lock(p)
 #define gfn_unlock(p,g,o)     p2m_unlock(p)
 #define p2m_read_lock(p)      mm_read_lock(p2m, &(p)->lock)
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index c094320..43c7f1b 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
         unmap_domain_page(epte);
     }
     
+    p2m_tlb_flush_sync(p2m);
     p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
 }
 
@@ -1095,15 +1096,10 @@ static void __ept_sync_domain(void *info)
      */
 }
 
-void ept_sync_domain(struct p2m_domain *p2m)
+static void ept_sync_domain_prepare(struct p2m_domain *p2m)
 {
     struct domain *d = p2m->domain;
     struct ept_data *ept = &p2m->ept;
-    /* Only if using EPT and this domain has some VCPUs to dirty. */
-    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
-        return;
-
-    ASSERT(local_irq_is_enabled());
 
     if ( nestedhvm_enabled(d) && !p2m_is_nestedp2m(p2m) )
         p2m_flush_nestedp2m(d);
@@ -1116,9 +1112,38 @@ void ept_sync_domain(struct p2m_domain *p2m)
      *    of an EP4TA reuse is still needed.
      */
     cpumask_setall(ept->invalidate);
+}
+
+static void ept_sync_domain_mask(struct p2m_domain *p2m, const cpumask_t *mask)
+{
+    on_selected_cpus(mask, __ept_sync_domain, p2m, 1);
+}
+
+void ept_sync_domain(struct p2m_domain *p2m)
+{
+    struct domain *d = p2m->domain;
 
-    on_selected_cpus(d->domain_dirty_cpumask,
-                     __ept_sync_domain, p2m, 1);
+    /* Only if using EPT and this domain has some VCPUs to dirty. */
+    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
+        return;
+
+    ept_sync_domain_prepare(p2m);
+
+    if ( p2m->defer_flush )
+    {
+        p2m->need_flush = 1;
+        return;
+    }
+
+    ept_sync_domain_mask(p2m, d->domain_dirty_cpumask);
+}
+
+static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
+{
+    p2m->need_flush = 0;
+    if ( unlock )
+        mm_write_unlock(&p2m->lock);
+    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
 }
 
 static void ept_enable_pml(struct p2m_domain *p2m)
@@ -1169,6 +1194,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
     p2m->change_entry_type_range = ept_change_entry_type_range;
     p2m->memory_type_changed = ept_memory_type_changed;
     p2m->audit_p2m = NULL;
+    p2m->flush_and_unlock = ept_flush_and_unlock;
 
     /* Set the memory type used when accessing EPT paging structures. */
     ept->ept_mt = EPT_DEFAULT_MT;
diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c
index ea16d3e..a5d672e 100644
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -886,6 +886,8 @@ p2m_pod_zero_check(struct p2m_domain *p2m, unsigned long *gfns, int count)
         }
     }
 
+    p2m_tlb_flush_sync(p2m);
+
     /* Now check each page for real */
     for ( i=0; i < count; i++ )
     {
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index ed0bbd7..efb15cd 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -324,6 +324,20 @@ void p2m_flush_hardware_cached_dirty(struct domain *d)
     }
 }
 
+void p2m_tlb_flush_sync(struct p2m_domain *p2m)
+{
+    if ( p2m->need_flush )
+        p2m->flush_and_unlock(p2m, 0);
+}
+
+void p2m_tlb_flush_and_unlock(struct p2m_domain *p2m)
+{
+    if ( p2m->need_flush )
+        p2m->flush_and_unlock(p2m, 1);
+    else
+        mm_write_unlock(&p2m->lock);
+}
+
 mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
                     p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
                     unsigned int *page_order, bool_t locked)
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index fa46dd9..9c394c2 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -261,6 +261,10 @@ struct p2m_domain {
                                           unsigned long gfn, l1_pgentry_t *p,
                                           l1_pgentry_t new, unsigned int level);
     long               (*audit_p2m)(struct p2m_domain *p2m);
+    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
+
+    unsigned int defer_flush;
+    bool_t need_flush;
 
     /* Default P2M access type for each page in the the domain: new pages,
      * swapped in pages, cleared pages, and pages that are ambiguously
@@ -353,6 +357,12 @@ static inline bool_t p2m_is_altp2m(const struct p2m_domain *p2m)
 
 #define p2m_get_pagetable(p2m)  ((p2m)->phys_table)
 
+/*
+ * Ensure any deferred p2m TLB flush has been completed on all VCPUs.
+ */
+void p2m_tlb_flush_sync(struct p2m_domain *p2m);
+void p2m_tlb_flush_and_unlock(struct p2m_domain *p2m);
+
 /**** p2m query accessors. They lock p2m_lock, and thus serialize
  * lookups wrt modifications. They _do not_ release the lock on exit.
  * After calling any of the variants below, caller needs to use
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-18 13:50 ` [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
@ 2015-12-18 14:59   ` George Dunlap
  2015-12-20  6:51   ` Tian, Kevin
  1 sibling, 0 replies; 13+ messages in thread
From: George Dunlap @ 2015-12-18 14:59 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Jan Beulich

On 18/12/15 13:50, David Vrabel wrote:
> If a guest allocates a page and the tlbflush_timestamp on the page
> indicates that a TLB flush of the previous owner is required, only the
> linear and combined mappings are invalidated.  The guest-physical
> mappings are not invalidated.
> 
> This is currently safe because the EPT code ensures that the
> guest-physical and combined mappings are invalidated /before/ the page
> is freed.  However, this prevents us from deferring the EPT invalidate
> until after the page is freed (e.g., to defer the invalidate until the
> p2m locks are released).
> 
> The TLB flush that may be done after allocating page already causes
> the original guest to VMEXIT, thus on VMENTER we can do an INVEPT if
> one is pending.
> 
> This means __ept_sync_domain() need not do anything and the thus the
> on_selected_cpu() call does not need to wait for as long.
> 
> ept_sync_domain() now marks all PCPUs as needing to be invalidated,
> including PCPUs that the domain has not run on.  We still only IPI
> those PCPUs that are active so this does not result in any more INVEPT
> calls.
> 
> We do not attempt to track when PCPUs may have cached translations
> because the only safe way to clear this per-CPU state is if
> immediately after an invalidate the PCPU is not active (i.e., the PCPU
> is not in d->domain_dirty_cpumask).  Since we only invalidate on
> VMENTER or by IPIing active PCPUs this can never happen.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: George Dunlap <george.dunlap@citrix.com>

I'd probably have dropped this, but anyway:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
> v6:
> - Add some comments.
> - Use test then clear instead of test_and_clear in vmx_vmenter_helper()
> 
> v4:
> - __ept_sync_domain() is a no-op -- invalidates are done before VMENTER.
> - initialize ept->invalidate to all ones so the initial invalidate is
>   always done.
> ---
>  xen/arch/x86/hvm/vmx/vmx.c         | 24 ++++++++++++------------
>  xen/arch/x86/mm/p2m-ept.c          | 31 ++++++++++++++++++-------------
>  xen/include/asm-x86/hvm/vmx/vmcs.h |  4 ++--
>  3 files changed, 32 insertions(+), 27 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 2922673..23ab737 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -744,24 +744,12 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
>  
>  static void vmx_ctxt_switch_to(struct vcpu *v)
>  {
> -    struct domain *d = v->domain;
>      unsigned long old_cr4 = read_cr4(), new_cr4 = mmu_cr4_features;
> -    struct ept_data *ept_data = &p2m_get_hostp2m(d)->ept;
>  
>      /* HOST_CR4 in VMCS is always mmu_cr4_features. Sync CR4 now. */
>      if ( old_cr4 != new_cr4 )
>          write_cr4(new_cr4);
>  
> -    if ( paging_mode_hap(d) )
> -    {
> -        unsigned int cpu = smp_processor_id();
> -        /* Test-and-test-and-set this CPU in the EPT-is-synced mask. */
> -        if ( !cpumask_test_cpu(cpu, ept_get_synced_mask(ept_data)) &&
> -             !cpumask_test_and_set_cpu(cpu,
> -                                       ept_get_synced_mask(ept_data)) )
> -            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept_data), 0);
> -    }
> -
>      vmx_restore_guest_msrs(v);
>      vmx_restore_dr(v);
>  }
> @@ -3586,6 +3574,18 @@ void vmx_vmenter_helper(const struct cpu_user_regs *regs)
>      if ( unlikely(need_flush) )
>          vpid_sync_all();
>  
> +    if ( paging_mode_hap(curr->domain) )
> +    {
> +        struct ept_data *ept = &p2m_get_hostp2m(curr->domain)->ept;
> +        unsigned int cpu = smp_processor_id();
> +
> +        if ( cpumask_test_cpu(cpu, ept->invalidate) )
> +        {
> +            cpumask_clear_cpu(cpu, ept->invalidate);
> +            __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
> +        }
> +    }
> +
>   out:
>      HVMTRACE_ND(VMENTRY, 0, 1/*cycles*/, 0, 0, 0, 0, 0, 0, 0);
>  
> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index eef0372..c094320 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -1089,9 +1089,10 @@ static void ept_memory_type_changed(struct p2m_domain *p2m)
>  
>  static void __ept_sync_domain(void *info)
>  {
> -    struct ept_data *ept = &((struct p2m_domain *)info)->ept;
> -
> -    __invept(INVEPT_SINGLE_CONTEXT, ept_get_eptp(ept), 0);
> +    /*
> +     * The invalidation will be done before VMENTER (see
> +     * vmx_vmenter_helper()).
> +     */
>  }
>  
>  void ept_sync_domain(struct p2m_domain *p2m)
> @@ -1108,15 +1109,15 @@ void ept_sync_domain(struct p2m_domain *p2m)
>          p2m_flush_nestedp2m(d);
>  
>      /*
> -     * Flush active cpus synchronously. Flush others the next time this domain
> -     * is scheduled onto them. We accept the race of other CPUs adding to
> -     * the ept_synced mask before on_selected_cpus() reads it, resulting in
> -     * unnecessary extra flushes, to avoid allocating a cpumask_t on the stack.
> +     * Need to invalidate on all PCPUs because either:
> +     *
> +     * a) A VCPU has run and some translations may be cached.
> +     * b) A VCPU has not run and and the initial invalidation in case
> +     *    of an EP4TA reuse is still needed.
>       */
> -    cpumask_and(ept_get_synced_mask(ept),
> -                d->domain_dirty_cpumask, &cpu_online_map);
> +    cpumask_setall(ept->invalidate);
>  
> -    on_selected_cpus(ept_get_synced_mask(ept),
> +    on_selected_cpus(d->domain_dirty_cpumask,
>                       __ept_sync_domain, p2m, 1);
>  }
>  
> @@ -1182,10 +1183,14 @@ int ept_p2m_init(struct p2m_domain *p2m)
>          p2m->flush_hardware_cached_dirty = ept_flush_pml_buffers;
>      }
>  
> -    if ( !zalloc_cpumask_var(&ept->synced_mask) )
> +    if ( !zalloc_cpumask_var(&ept->invalidate) )
>          return -ENOMEM;
>  
> -    on_each_cpu(__ept_sync_domain, p2m, 1);
> +    /*
> +     * Assume an initial invalidation is required, in case an EP4TA is
> +     * reused.
> +     */
> +    cpumask_setall(ept->invalidate);
>  
>      return 0;
>  }
> @@ -1193,7 +1198,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
>  void ept_p2m_uninit(struct p2m_domain *p2m)
>  {
>      struct ept_data *ept = &p2m->ept;
> -    free_cpumask_var(ept->synced_mask);
> +    free_cpumask_var(ept->invalidate);
>  }
>  
>  static void ept_dump_p2m_table(unsigned char key)
> diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
> index a8d4d5b..d1496b8 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -67,7 +67,8 @@ struct ept_data {
>          };
>          u64 eptp;
>      };
> -    cpumask_var_t synced_mask;
> +    /* Set of PCPUs needing an INVEPT before a VMENTER. */
> +    cpumask_var_t invalidate;
>  };
>  
>  #define _VMX_DOMAIN_PML_ENABLED    0
> @@ -97,7 +98,6 @@ struct pi_desc {
>  #define ept_get_wl(ept)   ((ept)->ept_wl)
>  #define ept_get_asr(ept)  ((ept)->asr)
>  #define ept_get_eptp(ept) ((ept)->eptp)
> -#define ept_get_synced_mask(ept) ((ept)->synced_mask)
>  
>  #define NR_PML_ENTRIES   512
>  
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER
  2015-12-18 13:50 ` [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
  2015-12-18 14:59   ` George Dunlap
@ 2015-12-20  6:51   ` Tian, Kevin
  1 sibling, 0 replies; 13+ messages in thread
From: Tian, Kevin @ 2015-12-20  6:51 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, Nakajima, Jun

> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Sent: Friday, December 18, 2015 9:51 PM
> 
> If a guest allocates a page and the tlbflush_timestamp on the page
> indicates that a TLB flush of the previous owner is required, only the
> linear and combined mappings are invalidated.  The guest-physical
> mappings are not invalidated.
> 
> This is currently safe because the EPT code ensures that the
> guest-physical and combined mappings are invalidated /before/ the page
> is freed.  However, this prevents us from deferring the EPT invalidate
> until after the page is freed (e.g., to defer the invalidate until the
> p2m locks are released).
> 
> The TLB flush that may be done after allocating page already causes
> the original guest to VMEXIT, thus on VMENTER we can do an INVEPT if
> one is pending.
> 
> This means __ept_sync_domain() need not do anything and the thus the
> on_selected_cpu() call does not need to wait for as long.
> 
> ept_sync_domain() now marks all PCPUs as needing to be invalidated,
> including PCPUs that the domain has not run on.  We still only IPI
> those PCPUs that are active so this does not result in any more INVEPT
> calls.
> 
> We do not attempt to track when PCPUs may have cached translations
> because the only safe way to clear this per-CPU state is if
> immediately after an invalidate the PCPU is not active (i.e., the PCPU
> is not in d->domain_dirty_cpumask).  Since we only invalidate on
> VMENTER or by IPIing active PCPUs this can never happen.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: George Dunlap <george.dunlap@citrix.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-18 13:50 ` [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
@ 2015-12-20  6:56   ` Tian, Kevin
  2016-02-01 14:50     ` David Vrabel
  2015-12-22 12:23   ` George Dunlap
  1 sibling, 1 reply; 13+ messages in thread
From: Tian, Kevin @ 2015-12-20  6:56 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, Nakajima, Jun

> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Sent: Friday, December 18, 2015 9:51 PM
> 
> Holding the p2m lock while calling ept_sync_domain() is very expensive
> since it does a on_selected_cpus() call.  IPIs on many socket machines
> can be very slows and on_selected_cpus() is serialized.
> 
> It is safe to defer the invalidate until the p2m lock is released
> except for two cases:
> 
> 1. When freeing a page table page (since partial translations may be
>    cached).
> 2. When reclaiming a zero page as part of PoD.
> 
> For these cases, add p2m_tlb_flush_sync() calls which will immediately
> perform the invalidate before the page is freed or reclaimed.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

In a quick glimpse the patch is OK. Will find a time to think about it more
carefully since this part is tricky (also allow others to comment before
my ack in EPT part).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-18 13:50 ` [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
  2015-12-20  6:56   ` Tian, Kevin
@ 2015-12-22 12:23   ` George Dunlap
  2015-12-22 14:01     ` Andrew Cooper
  2016-02-01 15:57     ` David Vrabel
  1 sibling, 2 replies; 13+ messages in thread
From: George Dunlap @ 2015-12-22 12:23 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Jan Beulich

On 18/12/15 13:50, David Vrabel wrote:
> Holding the p2m lock while calling ept_sync_domain() is very expensive
> since it does a on_selected_cpus() call.  IPIs on many socket machines
> can be very slows and on_selected_cpus() is serialized.
> 
> It is safe to defer the invalidate until the p2m lock is released
> except for two cases:
> 
> 1. When freeing a page table page (since partial translations may be
>    cached).
> 2. When reclaiming a zero page as part of PoD.
> 
> For these cases, add p2m_tlb_flush_sync() calls which will immediately
> perform the invalidate before the page is freed or reclaimed.

There are at least two other places in the PoD code where the "remove ->
check -> add to cache -> unlock" pattern exist; and it looks to me like
there are other places where races might occur (e.g.,
p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
p2m_altp2m_propagate_change(), which does remove -> put -> unlock).

Part of me wonders whether, rather than making callers that need it
remember to do a flush, it might be better to explicitly pass in
P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
people think about the fact that the p2m change may not actually take
effect until later.  Any thoughts on that?

Comments on the current approach inline.

> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index c094320..43c7f1b 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>          unmap_domain_page(epte);
>      }
>      
> +    p2m_tlb_flush_sync(p2m);
>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));

It's probably worth a comment here pointing out that even if this
function is called several times (e.g., if you replace a load of 4k
entries with a 1G entry), the actual flush will only happen the first time.

> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
> +{
> +    p2m->need_flush = 0;
> +    if ( unlock )
> +        mm_write_unlock(&p2m->lock);
> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>  }

Having a function called "flush_and_unlock", with a boolean as to
whether to unlock or not, just seems a bit wonky.

Wouldn't it make more sense to have the hook just named "flush_sync()",
and move the unlocking out in the generic p2m code (where you already
have the check for need_flush)?

> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
> index fa46dd9..9c394c2 100644
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -261,6 +261,10 @@ struct p2m_domain {
>                                            unsigned long gfn, l1_pgentry_t *p,
>                                            l1_pgentry_t new, unsigned int level);
>      long               (*audit_p2m)(struct p2m_domain *p2m);
> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
> +
> +    unsigned int defer_flush;
> +    bool_t need_flush;

It's probably worth a comment that at the moment calling
flush_and_unlock() is gated on need_flush; so it's OK not to implement
flush_and_unlock() as long as you never set need_flush.

Thanks,
 -George

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-22 12:23   ` George Dunlap
@ 2015-12-22 14:01     ` Andrew Cooper
  2015-12-22 14:20       ` David Vrabel
  2016-02-01 15:57     ` David Vrabel
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2015-12-22 14:01 UTC (permalink / raw)
  To: George Dunlap, David Vrabel, xen-devel
  Cc: George Dunlap, Kevin Tian, Tim Deegan, Jan Beulich, Jun Nakajima

On 22/12/15 12:23, George Dunlap wrote:
> On 18/12/15 13:50, David Vrabel wrote:
>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>> can be very slows and on_selected_cpus() is serialized.
>>
>> It is safe to defer the invalidate until the p2m lock is released
>> except for two cases:
>>
>> 1. When freeing a page table page (since partial translations may be
>>    cached).
>> 2. When reclaiming a zero page as part of PoD.
>>
>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>> perform the invalidate before the page is freed or reclaimed.
> There are at least two other places in the PoD code where the "remove ->
> check -> add to cache -> unlock" pattern exist; and it looks to me like
> there are other places where races might occur (e.g.,
> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).
>
> Part of me wonders whether, rather than making callers that need it
> remember to do a flush, it might be better to explicitly pass in
> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
> people think about the fact that the p2m change may not actually take
> effect until later.  Any thoughts on that?
>
> Comments on the current approach inline.
>
>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>> index c094320..43c7f1b 100644
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>>          unmap_domain_page(epte);
>>      }
>>      
>> +    p2m_tlb_flush_sync(p2m);
>>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
> It's probably worth a comment here pointing out that even if this
> function is called several times (e.g., if you replace a load of 4k
> entries with a 1G entry), the actual flush will only happen the first time.
>
>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>> +{
>> +    p2m->need_flush = 0;
>> +    if ( unlock )
>> +        mm_write_unlock(&p2m->lock);
>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>  }
> Having a function called "flush_and_unlock", with a boolean as to
> whether to unlock or not, just seems a bit wonky.
>
> Wouldn't it make more sense to have the hook just named "flush_sync()",
> and move the unlocking out in the generic p2m code (where you already
> have the check for need_flush)?
>
>> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
>> index fa46dd9..9c394c2 100644
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -261,6 +261,10 @@ struct p2m_domain {
>>                                            unsigned long gfn, l1_pgentry_t *p,
>>                                            l1_pgentry_t new, unsigned int level);
>>      long               (*audit_p2m)(struct p2m_domain *p2m);
>> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
>> +
>> +    unsigned int defer_flush;
>> +    bool_t need_flush;
> It's probably worth a comment that at the moment calling
> flush_and_unlock() is gated on need_flush; so it's OK not to implement
> flush_and_unlock() as long as you never set need_flush.

This is just one small accident (in code elsewhere) away from a code
injection vulnerability.

Either we should require that all function pointers are filled in, or
BUG() if the pointer is missing when we attempt to use it.

~Andrew

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-22 14:01     ` Andrew Cooper
@ 2015-12-22 14:20       ` David Vrabel
  2015-12-22 14:56         ` George Dunlap
  0 siblings, 1 reply; 13+ messages in thread
From: David Vrabel @ 2015-12-22 14:20 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap, David Vrabel, xen-devel
  Cc: George Dunlap, Kevin Tian, Tim Deegan, Jun Nakajima, Jan Beulich

On 22/12/15 14:01, Andrew Cooper wrote:
> On 22/12/15 12:23, George Dunlap wrote:
>> On 18/12/15 13:50, David Vrabel wrote:
>>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>>> can be very slows and on_selected_cpus() is serialized.
>>>
>>> It is safe to defer the invalidate until the p2m lock is released
>>> except for two cases:
>>>
>>> 1. When freeing a page table page (since partial translations may be
>>>    cached).
>>> 2. When reclaiming a zero page as part of PoD.
>>>
>>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>>> perform the invalidate before the page is freed or reclaimed.
>> There are at least two other places in the PoD code where the "remove ->
>> check -> add to cache -> unlock" pattern exist; and it looks to me like
>> there are other places where races might occur (e.g.,
>> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
>> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).
>>
>> Part of me wonders whether, rather than making callers that need it
>> remember to do a flush, it might be better to explicitly pass in
>> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
>> people think about the fact that the p2m change may not actually take
>> effect until later.  Any thoughts on that?
>>
>> Comments on the current approach inline.
>>
>>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>>> index c094320..43c7f1b 100644
>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>>>          unmap_domain_page(epte);
>>>      }
>>>      
>>> +    p2m_tlb_flush_sync(p2m);
>>>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
>> It's probably worth a comment here pointing out that even if this
>> function is called several times (e.g., if you replace a load of 4k
>> entries with a 1G entry), the actual flush will only happen the first time.
>>
>>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>>> +{
>>> +    p2m->need_flush = 0;
>>> +    if ( unlock )
>>> +        mm_write_unlock(&p2m->lock);
>>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>>  }
>> Having a function called "flush_and_unlock", with a boolean as to
>> whether to unlock or not, just seems a bit wonky.
>>
>> Wouldn't it make more sense to have the hook just named "flush_sync()",
>> and move the unlocking out in the generic p2m code (where you already
>> have the check for need_flush)?
>>
>>> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
>>> index fa46dd9..9c394c2 100644
>>> --- a/xen/include/asm-x86/p2m.h
>>> +++ b/xen/include/asm-x86/p2m.h
>>> @@ -261,6 +261,10 @@ struct p2m_domain {
>>>                                            unsigned long gfn, l1_pgentry_t *p,
>>>                                            l1_pgentry_t new, unsigned int level);
>>>      long               (*audit_p2m)(struct p2m_domain *p2m);
>>> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
>>> +
>>> +    unsigned int defer_flush;
>>> +    bool_t need_flush;
>> It's probably worth a comment that at the moment calling
>> flush_and_unlock() is gated on need_flush; so it's OK not to implement
>> flush_and_unlock() as long as you never set need_flush.
> 
> This is just one small accident (in code elsewhere) away from a code
> injection vulnerability.
> 
> Either we should require that all function pointers are filled in, or
> BUG() if the pointer is missing when we attempt to use it.

Jan asked for the call to be conditional on need_flush and to not test
flush_and_unlock.

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-22 14:20       ` David Vrabel
@ 2015-12-22 14:56         ` George Dunlap
  0 siblings, 0 replies; 13+ messages in thread
From: George Dunlap @ 2015-12-22 14:56 UTC (permalink / raw)
  To: David Vrabel, Andrew Cooper, xen-devel
  Cc: George Dunlap, Kevin Tian, Tim Deegan, Jun Nakajima, Jan Beulich

On 22/12/15 14:20, David Vrabel wrote:
> On 22/12/15 14:01, Andrew Cooper wrote:
>> On 22/12/15 12:23, George Dunlap wrote:
>>> On 18/12/15 13:50, David Vrabel wrote:
>>>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>>>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>>>> can be very slows and on_selected_cpus() is serialized.
>>>>
>>>> It is safe to defer the invalidate until the p2m lock is released
>>>> except for two cases:
>>>>
>>>> 1. When freeing a page table page (since partial translations may be
>>>>    cached).
>>>> 2. When reclaiming a zero page as part of PoD.
>>>>
>>>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>>>> perform the invalidate before the page is freed or reclaimed.
>>> There are at least two other places in the PoD code where the "remove ->
>>> check -> add to cache -> unlock" pattern exist; and it looks to me like
>>> there are other places where races might occur (e.g.,
>>> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
>>> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).
>>>
>>> Part of me wonders whether, rather than making callers that need it
>>> remember to do a flush, it might be better to explicitly pass in
>>> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
>>> people think about the fact that the p2m change may not actually take
>>> effect until later.  Any thoughts on that?
>>>
>>> Comments on the current approach inline.
>>>
>>>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>>>> index c094320..43c7f1b 100644
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>>>>          unmap_domain_page(epte);
>>>>      }
>>>>      
>>>> +    p2m_tlb_flush_sync(p2m);
>>>>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
>>> It's probably worth a comment here pointing out that even if this
>>> function is called several times (e.g., if you replace a load of 4k
>>> entries with a 1G entry), the actual flush will only happen the first time.
>>>
>>>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>>>> +{
>>>> +    p2m->need_flush = 0;
>>>> +    if ( unlock )
>>>> +        mm_write_unlock(&p2m->lock);
>>>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>>>  }
>>> Having a function called "flush_and_unlock", with a boolean as to
>>> whether to unlock or not, just seems a bit wonky.
>>>
>>> Wouldn't it make more sense to have the hook just named "flush_sync()",
>>> and move the unlocking out in the generic p2m code (where you already
>>> have the check for need_flush)?
>>>
>>>> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
>>>> index fa46dd9..9c394c2 100644
>>>> --- a/xen/include/asm-x86/p2m.h
>>>> +++ b/xen/include/asm-x86/p2m.h
>>>> @@ -261,6 +261,10 @@ struct p2m_domain {
>>>>                                            unsigned long gfn, l1_pgentry_t *p,
>>>>                                            l1_pgentry_t new, unsigned int level);
>>>>      long               (*audit_p2m)(struct p2m_domain *p2m);
>>>> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
>>>> +
>>>> +    unsigned int defer_flush;
>>>> +    bool_t need_flush;
>>> It's probably worth a comment that at the moment calling
>>> flush_and_unlock() is gated on need_flush; so it's OK not to implement
>>> flush_and_unlock() as long as you never set need_flush.
>>
>> This is just one small accident (in code elsewhere) away from a code
>> injection vulnerability.
>>
>> Either we should require that all function pointers are filled in, or
>> BUG() if the pointer is missing when we attempt to use it.
> 
> Jan asked for the call to be conditional on need_flush and to not test
> flush_and_unlock.

Then perhaps the other paging modes should point this to a function
which calls BUG()?  Or perhaps a noop -- no point in crashing a machine
in production because you don't need to actually do a flush.

 -George

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-20  6:56   ` Tian, Kevin
@ 2016-02-01 14:50     ` David Vrabel
  2016-02-02  7:58       ` Tian, Kevin
  0 siblings, 1 reply; 13+ messages in thread
From: David Vrabel @ 2016-02-01 14:50 UTC (permalink / raw)
  To: Tian, Kevin, David Vrabel, xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Nakajima, Jun, Jan Beulich

On 20/12/15 06:56, Tian, Kevin wrote:
>> From: David Vrabel [mailto:david.vrabel@citrix.com]
>> Sent: Friday, December 18, 2015 9:51 PM
>>
>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>> can be very slows and on_selected_cpus() is serialized.
>>
>> It is safe to defer the invalidate until the p2m lock is released
>> except for two cases:
>>
>> 1. When freeing a page table page (since partial translations may be
>>    cached).
>> 2. When reclaiming a zero page as part of PoD.
>>
>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>> perform the invalidate before the page is freed or reclaimed.
>>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> 
> In a quick glimpse the patch is OK. Will find a time to think about it more
> carefully since this part is tricky (also allow others to comment before
> my ack in EPT part).

Ping?  Did you get more time to think about this patch?

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2015-12-22 12:23   ` George Dunlap
  2015-12-22 14:01     ` Andrew Cooper
@ 2016-02-01 15:57     ` David Vrabel
  1 sibling, 0 replies; 13+ messages in thread
From: David Vrabel @ 2016-02-01 15:57 UTC (permalink / raw)
  To: George Dunlap, David Vrabel, xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Tim Deegan, Jan Beulich

On 22/12/15 12:23, George Dunlap wrote:
> On 18/12/15 13:50, David Vrabel wrote:
>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>> can be very slows and on_selected_cpus() is serialized.
>>
>> It is safe to defer the invalidate until the p2m lock is released
>> except for two cases:
>>
>> 1. When freeing a page table page (since partial translations may be
>>    cached).
>> 2. When reclaiming a zero page as part of PoD.
>>
>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>> perform the invalidate before the page is freed or reclaimed.
> 
> There are at least two other places in the PoD code where the "remove ->
> check -> add to cache -> unlock" pattern exist; and it looks to me like
> there are other places where races might occur (e.g.,
> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).

I've fixed the PoD cases.

Freeing pages back to the domheap is protected by the tlb flush
timestamp mechanism.  See patch #1 in this series which makes use of
this for invalidating the guest-physical mappings (in addition to any
linear and combined mappings that were already flushes).

> Part of me wonders whether, rather than making callers that need it
> remember to do a flush, it might be better to explicitly pass in
> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
> people think about the fact that the p2m change may not actually take
> effect until later.  Any thoughts on that?

It is more efficient to batch flushes as much as possible, so explicit
flushes just before they are needed are better.  I also think it's clear
to have an explicit flush before the operation that actually needs it.

>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>> +{
>> +    p2m->need_flush = 0;
>> +    if ( unlock )
>> +        mm_write_unlock(&p2m->lock);
>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>  }
> 
> Having a function called "flush_and_unlock", with a boolean as to
> whether to unlock or not, just seems a bit wonky.
> 
> Wouldn't it make more sense to have the hook just named "flush_sync()",
> and move the unlocking out in the generic p2m code (where you already
> have the check for need_flush)?

The ept_sync_domain_mask() call must be outside the lock when unlock is
true or you don't get the performance benefits from this patch.

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released
  2016-02-01 14:50     ` David Vrabel
@ 2016-02-02  7:58       ` Tian, Kevin
  0 siblings, 0 replies; 13+ messages in thread
From: Tian, Kevin @ 2016-02-02  7:58 UTC (permalink / raw)
  To: David Vrabel, xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Nakajima, Jun, Jan Beulich

> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Sent: Monday, February 01, 2016 10:50 PM
> 
> On 20/12/15 06:56, Tian, Kevin wrote:
> >> From: David Vrabel [mailto:david.vrabel@citrix.com]
> >> Sent: Friday, December 18, 2015 9:51 PM
> >>
> >> Holding the p2m lock while calling ept_sync_domain() is very expensive
> >> since it does a on_selected_cpus() call.  IPIs on many socket machines
> >> can be very slows and on_selected_cpus() is serialized.
> >>
> >> It is safe to defer the invalidate until the p2m lock is released
> >> except for two cases:
> >>
> >> 1. When freeing a page table page (since partial translations may be
> >>    cached).
> >> 2. When reclaiming a zero page as part of PoD.
> >>
> >> For these cases, add p2m_tlb_flush_sync() calls which will immediately
> >> perform the invalidate before the page is freed or reclaimed.
> >>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >
> > In a quick glimpse the patch is OK. Will find a time to think about it more
> > carefully since this part is tricky (also allow others to comment before
> > my ack in EPT part).
> 
> Ping?  Did you get more time to think about this patch?
> 

Sorry for delay on this. I saw you've sent out a v7 on this. Queued to
be reviewed tomorrow.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-02-02  7:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-18 13:50 [PATCHv6 0/2] x86/ept: reduce translation invalidation impact David Vrabel
2015-12-18 13:50 ` [PATCHv6 1/2] x86/ept: invalidate guest physical mappings on VMENTER David Vrabel
2015-12-18 14:59   ` George Dunlap
2015-12-20  6:51   ` Tian, Kevin
2015-12-18 13:50 ` [PATCHv6 2/2] x86/ept: defer the invalidation until the p2m lock is released David Vrabel
2015-12-20  6:56   ` Tian, Kevin
2016-02-01 14:50     ` David Vrabel
2016-02-02  7:58       ` Tian, Kevin
2015-12-22 12:23   ` George Dunlap
2015-12-22 14:01     ` Andrew Cooper
2015-12-22 14:20       ` David Vrabel
2015-12-22 14:56         ` George Dunlap
2016-02-01 15:57     ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.