All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv7 0/3] gnttab: Improve scaleability
@ 2015-04-30 13:28 David Vrabel
  2015-04-30 13:28 ` [PATCHv7 1/3] gnttab: Introduce rwlock to protect updates to grant table state David Vrabel
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: David Vrabel @ 2015-04-30 13:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Tim Deegan,
	David Vrabel, Matt Wilson, Malcolm Crossley, Ian Campbell

The series makes the grant table locking more fine-grained and adds
per-VCPU maptrack free lists, which greatly improves scalability.

The series builds on the original series by Matt Wilson and Christoph
Egger from Amazon.

Performance results for aggregate intrahost network throughput
(between 20 VM pairs, with 16 dom0 VCPUs) show substantial
improvements.

                             Throughput/Gbit/s

Base                          9.7
Split locks                  25.8
Split locks                  42.9
 + per-VCPU maptrack lists

v7:
  * Re-add grant table rwlock (locking was broken without it and the
    gains are minimal anyway).
  * Remove unneeded locks from grant_table_destroy().
  * Fix get_maptrack_handle() locking.
v6:
  * Remove most uses of the grant table lock.
  * Make the grant table lock a spin lock again (there were only
    writers left after the above)
  * Add per-VCPU maptrack free lists.
v5:
  * Addressed locking issue pointed out by Jan Beulich
  * Fixed git rebase merge issue introduced in v4
    (acquiring locking twice)
  * Change for ()-loop in grant_map_exists
  * Coding style fixes
v4:
  * Coding style nits from Jan Beulich
  * Fixup read locks pointed out by Jan Beulich
  * renamed double_gt_(un)lock to double_maptrack_(un)lock
    per request from Jan Beulich
  * Addressed ASSERT()'s from Jan Beulich
  * Addressed locking issues in unmap_common pointed out
    by Jan Beulich
v3:
  * Addressed gnttab_swap_grant_ref() comment from Andrew Cooper
v2:
  * Add arm part per request from Julien Grall

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCHv7 1/3] gnttab: Introduce rwlock to protect updates to grant table state
  2015-04-30 13:28 [PATCHv7 0/3] gnttab: Improve scaleability David Vrabel
@ 2015-04-30 13:28 ` David Vrabel
  2015-05-05 10:42   ` Jan Beulich
  2015-04-30 13:28 ` [PATCHv7 2/3] gnttab: refactor locking for scalability David Vrabel
  2015-04-30 13:28 ` [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists David Vrabel
  2 siblings, 1 reply; 16+ messages in thread
From: David Vrabel @ 2015-04-30 13:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Tim Deegan,
	David Vrabel, Matt Wilson, Malcolm Crossley, Ian Campbell

From: Christoph Egger <chegger@amazon.de>

Split grant table lock into two separate locks. One to protect
maptrack state and change the other into a rwlock.

The rwlock is used to prevent readers from accessing inconsistent
grant table state such as current version, partially initialized
active table pages, etc.

Signed-off-by: Matt Wilson <msw@amazon.com>
[chegger: ported to xen-staging, split into multiple commits]
Signed-off-by: Christoph Egger <chegger@amazon.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 docs/misc/grant-tables.txt    |   28 ++++++++-
 xen/arch/arm/mm.c             |    4 +-
 xen/arch/x86/mm.c             |    4 +-
 xen/common/grant_table.c      |  140 +++++++++++++++++++++++------------------
 xen/include/xen/grant_table.h |   11 +++-
 5 files changed, 118 insertions(+), 69 deletions(-)

diff --git a/docs/misc/grant-tables.txt b/docs/misc/grant-tables.txt
index 19db4ec..c9ae8f2 100644
--- a/docs/misc/grant-tables.txt
+++ b/docs/misc/grant-tables.txt
@@ -74,7 +74,33 @@ is complete.
  matching map track entry is then removed, as if unmap had been invoked.
  These are not used by the transfer mechanism.
   map->domid         : owner of the mapped frame
-  map->ref_and_flags : grant reference, ro/rw, mapped for host or device access
+  map->ref           : grant reference
+  map->flags         : ro/rw, mapped for host or device access
+
+********************************************************************************
+ Locking
+ ~~~~~~~
+ Xen uses several locks to serialize access to the internal grant table state.
+
+  grant_table->lock          : rwlock used to prevent readers from accessing
+                               inconsistent grant table state such as current
+                               version, partially initialized active table pages,
+                               etc.
+  grant_table->maptrack_lock : spinlock used to protect the maptrack state
+
+ The primary lock for the grant table is a read/write spinlock. All
+ functions that access members of struct grant_table must acquire a
+ read lock around critical sections. Any modification to the members
+ of struct grant_table (e.g., nr_status_frames, nr_grant_frames,
+ active frames, etc.) must only be made if the write lock is
+ held. These elements are read-mostly, and read critical sections can
+ be large, which makes a rwlock a good choice.
+
+ The maptrack state is protected by its own spinlock. Any access (read
+ or write) of struct grant_table members that have a "maptrack_"
+ prefix must be made while holding the maptrack lock. The maptrack
+ state can be rapidly modified under some workloads, and the critical
+ sections are very small, thus we use a spinlock to protect them.
 
 ********************************************************************************
 
diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index 11a0be8..f39e978 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -1048,7 +1048,7 @@ int xenmem_add_to_physmap_one(
     switch ( space )
     {
     case XENMAPSPACE_grant_table:
-        spin_lock(&d->grant_table->lock);
+        write_lock(&d->grant_table->lock);
 
         if ( d->grant_table->gt_version == 0 )
             d->grant_table->gt_version = 1;
@@ -1078,7 +1078,7 @@ int xenmem_add_to_physmap_one(
 
         t = p2m_ram_rw;
 
-        spin_unlock(&d->grant_table->lock);
+        write_unlock(&d->grant_table->lock);
         break;
     case XENMAPSPACE_shared_info:
         if ( idx != 0 )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b724aa5..a46d2ff 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4569,7 +4569,7 @@ int xenmem_add_to_physmap_one(
                 mfn = virt_to_mfn(d->shared_info);
             break;
         case XENMAPSPACE_grant_table:
-            spin_lock(&d->grant_table->lock);
+            write_lock(&d->grant_table->lock);
 
             if ( d->grant_table->gt_version == 0 )
                 d->grant_table->gt_version = 1;
@@ -4591,7 +4591,7 @@ int xenmem_add_to_physmap_one(
                     mfn = virt_to_mfn(d->grant_table->shared_raw[idx]);
             }
 
-            spin_unlock(&d->grant_table->lock);
+            write_unlock(&d->grant_table->lock);
             break;
         case XENMAPSPACE_gmfn_range:
         case XENMAPSPACE_gmfn:
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index dfb45f8..4434a2d 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -229,27 +229,27 @@ static int __get_paged_frame(unsigned long gfn, unsigned long *frame, struct pag
 }
 
 static inline void
-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
+double_maptrack_lock(struct grant_table *lgt, struct grant_table *rgt)
 {
     if ( lgt < rgt )
     {
-        spin_lock(&lgt->lock);
-        spin_lock(&rgt->lock);
+        spin_lock(&lgt->maptrack_lock);
+        spin_lock(&rgt->maptrack_lock);
     }
     else
     {
         if ( lgt != rgt )
-            spin_lock(&rgt->lock);
-        spin_lock(&lgt->lock);
+            spin_lock(&rgt->maptrack_lock);
+        spin_lock(&lgt->maptrack_lock);
     }
 }
 
 static inline void
-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
+double_maptrack_unlock(struct grant_table *lgt, struct grant_table *rgt)
 {
-    spin_unlock(&lgt->lock);
+    spin_unlock(&lgt->maptrack_lock);
     if ( lgt != rgt )
-        spin_unlock(&rgt->lock);
+        spin_unlock(&rgt->maptrack_lock);
 }
 
 static inline int
@@ -267,10 +267,10 @@ static inline void
 put_maptrack_handle(
     struct grant_table *t, int handle)
 {
-    spin_lock(&t->lock);
+    spin_lock(&t->maptrack_lock);
     maptrack_entry(t, handle).ref = t->maptrack_head;
     t->maptrack_head = handle;
-    spin_unlock(&t->lock);
+    spin_unlock(&t->maptrack_lock);
 }
 
 static inline int
@@ -282,7 +282,7 @@ get_maptrack_handle(
     struct grant_mapping *new_mt;
     unsigned int          new_mt_limit, nr_frames;
 
-    spin_lock(&lgt->lock);
+    spin_lock(&lgt->maptrack_lock);
 
     while ( unlikely((handle = __get_maptrack_handle(lgt)) == -1) )
     {
@@ -311,7 +311,7 @@ get_maptrack_handle(
                  nr_frames + 1);
     }
 
-    spin_unlock(&lgt->lock);
+    spin_unlock(&lgt->maptrack_lock);
 
     return handle;
 }
@@ -508,7 +508,7 @@ static int grant_map_exists(const struct domain *ld,
     const struct active_grant_entry *act;
     unsigned int ref, max_iter;
     
-    ASSERT(spin_is_locked(&rgt->lock));
+    ASSERT(rw_is_locked(&rgt->lock));
 
     max_iter = min(*ref_count + (1 << GNTTABOP_CONTINUATION_ARG_SHIFT),
                    nr_grant_entries(rgt));
@@ -629,7 +629,7 @@ __gnttab_map_grant_ref(
     }
 
     rgt = rd->grant_table;
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
         PIN_FAIL(unlock_out, GNTST_general_error,
@@ -702,8 +702,6 @@ __gnttab_map_grant_ref(
 
     cache_flags = (shah->flags & (GTF_PAT | GTF_PWT | GTF_PCD) );
 
-    spin_unlock(&rgt->lock);
-
     /* pg may be set, with a refcount included, from __get_paged_frame */
     if ( !pg )
     {
@@ -778,7 +776,7 @@ __gnttab_map_grant_ref(
         goto undo_out;
     }
 
-    double_gt_lock(lgt, rgt);
+    double_maptrack_lock(lgt, rgt);
 
     if ( gnttab_need_iommu_mapping(ld) )
     {
@@ -801,7 +799,7 @@ __gnttab_map_grant_ref(
         }
         if ( err )
         {
-            double_gt_unlock(lgt, rgt);
+            double_maptrack_unlock(lgt, rgt);
             rc = GNTST_general_error;
             goto undo_out;
         }
@@ -814,7 +812,8 @@ __gnttab_map_grant_ref(
     mt->ref   = op->ref;
     mt->flags = op->flags;
 
-    double_gt_unlock(lgt, rgt);
+    double_maptrack_unlock(lgt, rgt);
+    read_unlock(&rgt->lock);
 
     op->dev_bus_addr = (u64)frame << PAGE_SHIFT;
     op->handle       = handle;
@@ -837,7 +836,7 @@ __gnttab_map_grant_ref(
         put_page(pg);
     }
 
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     act = &active_entry(rgt, op->ref);
 
@@ -857,7 +856,7 @@ __gnttab_map_grant_ref(
         gnttab_clear_flag(_GTF_reading, status);
 
  unlock_out:
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     op->status = rc;
     put_maptrack_handle(lgt, handle);
     rcu_unlock_domain(rd);
@@ -907,18 +906,19 @@ __gnttab_unmap_common(
     }
 
     op->map = &maptrack_entry(lgt, op->handle);
-    spin_lock(&lgt->lock);
+
+    read_lock(&lgt->lock);
 
     if ( unlikely(!op->map->flags) )
     {
-        spin_unlock(&lgt->lock);
+        read_unlock(&lgt->lock);
         gdprintk(XENLOG_INFO, "Zero flags for handle (%d).\n", op->handle);
         op->status = GNTST_bad_handle;
         return;
     }
 
     dom = op->map->domid;
-    spin_unlock(&lgt->lock);
+    read_unlock(&lgt->lock);
 
     if ( unlikely((rd = rcu_lock_domain_by_id(dom)) == NULL) )
     {
@@ -939,7 +939,9 @@ __gnttab_unmap_common(
     TRACE_1D(TRC_MEM_PAGE_GRANT_UNMAP, dom);
 
     rgt = rd->grant_table;
-    double_gt_lock(lgt, rgt);
+
+    read_lock(&rgt->lock);
+    double_maptrack_lock(lgt, rgt);
 
     op->flags = op->map->flags;
     if ( unlikely(!op->flags) || unlikely(op->map->domid != dom) )
@@ -1009,7 +1011,9 @@ __gnttab_unmap_common(
          gnttab_mark_dirty(rd, op->frame);
 
  unmap_out:
-    double_gt_unlock(lgt, rgt);
+    double_maptrack_unlock(lgt, rgt);
+    read_unlock(&rgt->lock);
+
     op->status = rc;
     rcu_unlock_domain(rd);
 }
@@ -1039,8 +1043,8 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
 
     rcu_lock_domain(rd);
     rgt = rd->grant_table;
-    spin_lock(&rgt->lock);
 
+    read_lock(&rgt->lock);
     if ( rgt->gt_version == 0 )
         goto unmap_out;
 
@@ -1104,7 +1108,7 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
         gnttab_clear_flag(_GTF_reading, status);
 
  unmap_out:
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     if ( put_handle )
     {
         op->map->flags = 0;
@@ -1290,10 +1294,14 @@ gnttab_unpopulate_status_frames(struct domain *d, struct grant_table *gt)
     gt->nr_status_frames = 0;
 }
 
+/* 
+ * Grow the grant table. The caller must hold the grant table's
+ * write lock before calling this function.
+ */
 int
 gnttab_grow_table(struct domain *d, unsigned int req_nr_frames)
 {
-    /* d's grant table lock must be held by the caller */
+    /* d's grant table write lock must be held by the caller */
 
     struct grant_table *gt = d->grant_table;
     unsigned int i;
@@ -1398,7 +1406,7 @@ gnttab_setup_table(
     }
 
     gt = d->grant_table;
-    spin_lock(&gt->lock);
+    write_lock(&gt->lock);
 
     if ( gt->gt_version == 0 )
         gt->gt_version = 1;
@@ -1426,7 +1434,7 @@ gnttab_setup_table(
     }
 
  out3:
-    spin_unlock(&gt->lock);
+    write_unlock(&gt->lock);
  out2:
     rcu_unlock_domain(d);
  out1:
@@ -1468,13 +1476,13 @@ gnttab_query_size(
         goto query_out_unlock;
     }
 
-    spin_lock(&d->grant_table->lock);
+    read_lock(&d->grant_table->lock);
 
     op.nr_frames     = nr_grant_frames(d->grant_table);
     op.max_nr_frames = max_grant_frames;
     op.status        = GNTST_okay;
 
-    spin_unlock(&d->grant_table->lock);
+    read_unlock(&d->grant_table->lock);
 
  
  query_out_unlock:
@@ -1500,7 +1508,7 @@ gnttab_prepare_for_transfer(
     union grant_combo   scombo, prev_scombo, new_scombo;
     int                 retries = 0;
 
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
     {
@@ -1551,11 +1559,11 @@ gnttab_prepare_for_transfer(
         scombo = prev_scombo;
     }
 
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     return 1;
 
  fail:
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     return 0;
 }
 
@@ -1748,7 +1756,7 @@ gnttab_transfer(
         TRACE_1D(TRC_MEM_PAGE_GRANT_TRANSFER, e->domain_id);
 
         /* Tell the guest about its new page frame. */
-        spin_lock(&e->grant_table->lock);
+        read_lock(&e->grant_table->lock);
 
         if ( e->grant_table->gt_version == 1 )
         {
@@ -1766,7 +1774,7 @@ gnttab_transfer(
         shared_entry_header(e->grant_table, gop.ref)->flags |=
             GTF_transfer_completed;
 
-        spin_unlock(&e->grant_table->lock);
+        read_unlock(&e->grant_table->lock);
 
         rcu_unlock_domain(e);
 
@@ -1804,7 +1812,7 @@ __release_grant_for_copy(
     released_read = 0;
     released_write = 0;
 
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     act = &active_entry(rgt, gref);
     sha = shared_entry_header(rgt, gref);
@@ -1845,7 +1853,7 @@ __release_grant_for_copy(
         released_read = 1;
     }
 
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
 
     if ( td != rd )
     {
@@ -1903,7 +1911,7 @@ __acquire_grant_for_copy(
 
     *page = NULL;
 
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
         PIN_FAIL(unlock_out, GNTST_general_error,
@@ -1972,17 +1980,23 @@ __acquire_grant_for_copy(
                 PIN_FAIL(unlock_out_clear, GNTST_general_error,
                          "transitive grant referenced bad domain %d\n",
                          trans_domid);
-            spin_unlock(&rgt->lock);
+
+            /* 
+             * __acquire_grant_for_copy() could take the read lock on
+             * the right table (if rd == td), so we have to drop the
+             * lock here and reacquire
+             */
+            read_unlock(&rgt->lock);
 
             rc = __acquire_grant_for_copy(td, trans_gref, rd->domain_id,
                                           readonly, &grant_frame, page,
                                           &trans_page_off, &trans_length, 0);
 
-            spin_lock(&rgt->lock);
+            read_lock(&rgt->lock);
             if ( rc != GNTST_okay ) {
                 __fixup_status_for_copy_pin(act, status);
+                read_unlock(&rgt->lock);
                 rcu_unlock_domain(td);
-                spin_unlock(&rgt->lock);
                 return rc;
             }
 
@@ -1994,7 +2008,7 @@ __acquire_grant_for_copy(
             {
                 __fixup_status_for_copy_pin(act, status);
                 rcu_unlock_domain(td);
-                spin_unlock(&rgt->lock);
+                read_unlock(&rgt->lock);
                 put_page(*page);
                 return __acquire_grant_for_copy(rd, gref, ldom, readonly,
                                                 frame, page, page_off, length,
@@ -2062,7 +2076,7 @@ __acquire_grant_for_copy(
     *length = act->length;
     *frame = act->frame;
 
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     return rc;
  
  unlock_out_clear:
@@ -2074,7 +2088,8 @@ __acquire_grant_for_copy(
         gnttab_clear_flag(_GTF_reading, status);
 
  unlock_out:
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
+
     return rc;
 }
 
@@ -2390,7 +2405,7 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     if ( gt->gt_version == op.version )
         goto out;
 
-    spin_lock(&gt->lock);
+    write_lock(&gt->lock);
     /* Make sure that the grant table isn't currently in use when we
        change the version number, except for the first 8 entries which
        are allowed to be in use (xenstore/xenconsole keeps them mapped).
@@ -2476,7 +2491,7 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     gt->gt_version = op.version;
 
 out_unlock:
-    spin_unlock(&gt->lock);
+    write_unlock(&gt->lock);
 
 out:
     op.version = gt->gt_version;
@@ -2532,7 +2547,7 @@ gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) uop,
 
     op.status = GNTST_okay;
 
-    spin_lock(&gt->lock);
+    read_lock(&gt->lock);
 
     for ( i = 0; i < op.nr_frames; i++ )
     {
@@ -2541,7 +2556,7 @@ gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) uop,
             op.status = GNTST_bad_virt_addr;
     }
 
-    spin_unlock(&gt->lock);
+    read_unlock(&gt->lock);
 out2:
     rcu_unlock_domain(d);
 out1:
@@ -2590,7 +2605,7 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t ref_b)
     struct active_grant_entry *act;
     s16 rc = GNTST_okay;
 
-    spin_lock(&gt->lock);
+    write_lock(&gt->lock);
 
     /* Bounds check on the grant refs */
     if ( unlikely(ref_a >= nr_grant_entries(d->grant_table)))
@@ -2630,7 +2645,7 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t ref_b)
     }
 
 out:
-    spin_unlock(&gt->lock);
+    write_unlock(&gt->lock);
 
     rcu_unlock_domain(d);
 
@@ -2701,12 +2716,12 @@ static int __gnttab_cache_flush(gnttab_cache_flush_t *cflush,
 
     if ( d != owner )
     {
-        spin_lock(&owner->grant_table->lock);
+        read_lock(&owner->grant_table->lock);
 
         ret = grant_map_exists(d, owner->grant_table, mfn, ref_count);
         if ( ret != 0 )
         {
-            spin_unlock(&owner->grant_table->lock);
+            read_unlock(&owner->grant_table->lock);
             rcu_unlock_domain(d);
             put_page(page);
             return ret;
@@ -2726,7 +2741,7 @@ static int __gnttab_cache_flush(gnttab_cache_flush_t *cflush,
         ret = 0;
 
     if ( d != owner )
-        spin_unlock(&owner->grant_table->lock);
+        read_unlock(&owner->grant_table->lock);
     unmap_domain_page(v);
     put_page(page);
 
@@ -2945,7 +2960,8 @@ grant_table_create(
         goto no_mem_0;
 
     /* Simple stuff. */
-    spin_lock_init(&t->lock);
+    rwlock_init(&t->lock);
+    spin_lock_init(&t->maptrack_lock);
     t->nr_grant_frames = INITIAL_NR_GRANT_FRAMES;
 
     /* Active grant table. */
@@ -3052,7 +3068,8 @@ gnttab_release_mappings(
         }
 
         rgt = rd->grant_table;
-        spin_lock(&rgt->lock);
+        read_lock(&rgt->lock);
+        double_maptrack_lock(gt, rgt);
 
         act = &active_entry(rgt, ref);
         sha = shared_entry_header(rgt, ref);
@@ -3112,7 +3129,8 @@ gnttab_release_mappings(
         if ( act->pin == 0 )
             gnttab_clear_flag(_GTF_reading, status);
 
-        spin_unlock(&rgt->lock);
+        double_maptrack_unlock(gt, rgt);
+        read_unlock(&rgt->lock);
 
         rcu_unlock_domain(rd);
 
@@ -3160,7 +3178,7 @@ static void gnttab_usage_print(struct domain *rd)
     printk("      -------- active --------       -------- shared --------\n");
     printk("[ref] localdom mfn      pin          localdom gmfn     flags\n");
 
-    spin_lock(&gt->lock);
+    read_lock(&gt->lock);
 
     if ( gt->gt_version == 0 )
         goto out;
@@ -3209,7 +3227,7 @@ static void gnttab_usage_print(struct domain *rd)
     }
 
  out:
-    spin_unlock(&gt->lock);
+    read_unlock(&gt->lock);
 
     if ( first )
         printk("grant-table for remote domain:%5d ... "
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 32f5786..98cc94a 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -64,6 +64,11 @@ struct grant_mapping {
 
 /* Per-domain grant information. */
 struct grant_table {
+    /* 
+     * Lock protecting updates to grant table state (version, active
+     * entry list, etc.)
+     */
+    rwlock_t              lock;
     /* Table size. Number of frames shared with guest */
     unsigned int          nr_grant_frames;
     /* Shared grant table (see include/public/grant_table.h). */
@@ -82,8 +87,8 @@ struct grant_table {
     struct grant_mapping **maptrack;
     unsigned int          maptrack_head;
     unsigned int          maptrack_limit;
-    /* Lock protecting updates to active and shared grant tables. */
-    spinlock_t            lock;
+    /* Lock protecting the maptrack page list, head, and limit */
+    spinlock_t            maptrack_lock;
     /* The defined versions are 1 and 2.  Set to 0 if we don't know
        what version to use yet. */
     unsigned              gt_version;
@@ -101,7 +106,7 @@ gnttab_release_mappings(
     struct domain *d);
 
 /* Increase the size of a domain's grant table.
- * Caller must hold d's grant table lock.
+ * Caller must hold d's grant table write lock.
  */
 int
 gnttab_grow_table(struct domain *d, unsigned int req_nr_frames);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv7 2/3] gnttab: refactor locking for scalability
  2015-04-30 13:28 [PATCHv7 0/3] gnttab: Improve scaleability David Vrabel
  2015-04-30 13:28 ` [PATCHv7 1/3] gnttab: Introduce rwlock to protect updates to grant table state David Vrabel
@ 2015-04-30 13:28 ` David Vrabel
  2015-04-30 14:34   ` Tim Deegan
  2015-05-05 11:58   ` Jan Beulich
  2015-04-30 13:28 ` [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists David Vrabel
  2 siblings, 2 replies; 16+ messages in thread
From: David Vrabel @ 2015-04-30 13:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Tim Deegan,
	David Vrabel, Matt Wilson, Malcolm Crossley, Ian Campbell

From: Matt Wilson <msw@amazon.com>

This patch refactors grant table locking. It splits the previous
single spinlock per grant table into multiple locks. The heavily
modified components of the grant table (the maptrack state and the
active entries) are now protected by their own spinlocks. The
remaining elements of the grant table are read-mostly, so the main
grant table lock is modified to be a rwlock to improve concurrency.

Workloads with high grant table operation rates, especially map/unmap
operations as used by blkback/blkfront when persistent grants are not
supported, show marked improvement with these changes. A domU with 24
VBDs in a streaming 2M write workload achieved 1,400 MB/sec before
this change. Performance more than doubles with this patch, reaching
3,000 MB/sec before tuning and 3,600 MB/sec after adjusting event
channel vCPU bindings.

Signed-off-by: Matt Wilson <msw@amazon.com>
[chegger: ported to xen-staging, split into multiple commits]
Signed-off-by: Christoph Egger <chegger@amazon.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 docs/misc/grant-tables.txt |   21 +++++
 xen/common/grant_table.c   |  213 +++++++++++++++++++++++++++-----------------
 2 files changed, 153 insertions(+), 81 deletions(-)

diff --git a/docs/misc/grant-tables.txt b/docs/misc/grant-tables.txt
index c9ae8f2..1ada018 100644
--- a/docs/misc/grant-tables.txt
+++ b/docs/misc/grant-tables.txt
@@ -63,6 +63,7 @@ is complete.
   act->domid : remote domain being granted rights
   act->frame : machine frame being granted
   act->pin   : used to hold reference counts
+  act->lock  : spinlock used to serialize access to active entry state
 
  Map tracking
  ~~~~~~~~~~~~
@@ -87,6 +88,8 @@ is complete.
                                version, partially initialized active table pages,
                                etc.
   grant_table->maptrack_lock : spinlock used to protect the maptrack state
+  active_grant_entry->lock   : spinlock used to serialize modifications to
+                               active entries
 
  The primary lock for the grant table is a read/write spinlock. All
  functions that access members of struct grant_table must acquire a
@@ -102,6 +105,24 @@ is complete.
  state can be rapidly modified under some workloads, and the critical
  sections are very small, thus we use a spinlock to protect them.
 
+ Active entries are obtained by calling active_entry_acquire(gt, ref).
+ This function returns a pointer to the active entry after locking its
+ spinlock. The caller must hold the rwlock for the gt in question
+ before calling active_entry_acquire(). This is because the grant
+ table can be dynamically extended via gnttab_grow_table() while a
+ domain is running and must be fully initialized. Once all access to
+ the active entry is complete, release the lock by calling
+ active_entry_release(act).
+
+ Summary of rules for locking:
+  active_entry_acquire() and active_entry_release() can only be
+  called when holding the relevant grant table's lock. I.e.:
+    read_lock(&gt->lock);
+    act = active_entry_acquire(gt, ref);
+    ...
+    active_entry_release(act);
+    read_unlock(&gt->lock);
+
 ********************************************************************************
 
  Granting a foreign domain access to frames
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 4434a2d..569b3d3 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -157,10 +157,13 @@ struct active_grant_entry {
                                in the page.                           */
     unsigned      length:16; /* For sub-page grants, the length of the
                                 grant.                                */
+    spinlock_t    lock;      /* lock to protect access of this entry.
+                                see docs/misc/grant-tables.txt for
+                                locking protocol                      */
 };
 
 #define ACGNT_PER_PAGE (PAGE_SIZE / sizeof(struct active_grant_entry))
-#define active_entry(t, e) \
+#define _active_entry(t, e) \
     ((t)->active[(e)/ACGNT_PER_PAGE][(e)%ACGNT_PER_PAGE])
 
 static inline void gnttab_flush_tlb(const struct domain *d)
@@ -188,6 +191,28 @@ nr_active_grant_frames(struct grant_table *gt)
     return num_act_frames_from_sha_frames(nr_grant_frames(gt));
 }
 
+static inline struct active_grant_entry *
+active_entry_acquire(struct grant_table *t, grant_ref_t e)
+{
+    struct active_grant_entry *act;
+
+    /* 
+     * not perfect, but better than nothing for a debug build
+     * sanity check
+     */
+    ASSERT(rw_is_locked(&t->lock));
+
+    act = &_active_entry(t, e);
+    spin_lock(&act->lock);
+
+    return act;
+}
+
+static inline void active_entry_release(struct active_grant_entry *act)
+{
+    spin_unlock(&act->lock);
+}
+    
 /* Check if the page has been paged out, or needs unsharing. 
    If rc == GNTST_okay, *page contains the page struct with a ref taken.
    Caller must do put_page(*page).
@@ -228,30 +253,6 @@ static int __get_paged_frame(unsigned long gfn, unsigned long *frame, struct pag
     return rc;
 }
 
-static inline void
-double_maptrack_lock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    if ( lgt < rgt )
-    {
-        spin_lock(&lgt->maptrack_lock);
-        spin_lock(&rgt->maptrack_lock);
-    }
-    else
-    {
-        if ( lgt != rgt )
-            spin_lock(&rgt->maptrack_lock);
-        spin_lock(&lgt->maptrack_lock);
-    }
-}
-
-static inline void
-double_maptrack_unlock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    spin_unlock(&lgt->maptrack_lock);
-    if ( lgt != rgt )
-        spin_unlock(&rgt->maptrack_lock);
-}
-
 static inline int
 __get_maptrack_handle(
     struct grant_table *t)
@@ -316,7 +317,10 @@ get_maptrack_handle(
     return handle;
 }
 
-/* Number of grant table entries. Caller must hold d's grant table lock. */
+/* 
+ * Number of grant table entries. Caller must hold d's grant table
+ * read lock.
+ */
 static unsigned int nr_grant_entries(struct grant_table *gt)
 {
     ASSERT(gt->gt_version != 0);
@@ -505,26 +509,23 @@ static int grant_map_exists(const struct domain *ld,
                             unsigned long mfn,
                             unsigned int *ref_count)
 {
-    const struct active_grant_entry *act;
+    struct active_grant_entry *act;
     unsigned int ref, max_iter;
     
     ASSERT(rw_is_locked(&rgt->lock));
 
     max_iter = min(*ref_count + (1 << GNTTABOP_CONTINUATION_ARG_SHIFT),
                    nr_grant_entries(rgt));
-    for ( ref = *ref_count; ref < max_iter; ref++ )
+    for ( ref = *ref_count; ref < max_iter; active_entry_release(act), ref++ )
     {
-        act = &active_entry(rgt, ref);
-
-        if ( !act->pin )
-            continue;
+        act = active_entry_acquire(rgt, ref);
 
-        if ( act->domid != ld->domain_id )
-            continue;
-
-        if ( act->frame != mfn )
+        if ( !act->pin ||
+             act->domid != ld->domain_id ||
+             act->frame != mfn )
             continue;
 
+        active_entry_release(act);
         return 0;
     }
 
@@ -537,6 +538,13 @@ static int grant_map_exists(const struct domain *ld,
     return -EINVAL;
 }
 
+/* 
+ * Count the number of mapped ro or rw entries tracked in the left
+ * grant table given a provided mfn provided by a foreign domain. 
+ *
+ * This function takes the maptrack lock from the left grant table and
+ * must be called with the right grant table's rwlock held.
+ */
 static void mapcount(
     struct grant_table *lgt, struct domain *rd, unsigned long mfn,
     unsigned int *wrc, unsigned int *rdc)
@@ -546,15 +554,29 @@ static void mapcount(
 
     *wrc = *rdc = 0;
 
+    /* 
+     * N.B.: while taking the left side maptrack spinlock prevents
+     * any mapping changes, the right side active entries could be
+     * changing while we are counting. The caller has to hold the
+     * grant table write lock, or some other mechanism should be
+     * used to prevent concurrent changes during this operation.
+     * This is tricky because we can't promote a read lock into
+     * a write lock.
+     */
+    ASSERT(rw_is_locked(&rd->grant_table->lock));
+    spin_lock(&lgt->maptrack_lock);
+
     for ( handle = 0; handle < lgt->maptrack_limit; handle++ )
     {
         map = &maptrack_entry(lgt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
              map->domid != rd->domain_id )
             continue;
-        if ( active_entry(rd->grant_table, map->ref).frame == mfn )
+        if ( _active_entry(rd->grant_table, map->ref).frame == mfn )
             (map->flags & GNTMAP_readonly) ? (*rdc)++ : (*wrc)++;
     }
+
+    spin_unlock(&lgt->maptrack_lock);
 }
 
 /*
@@ -576,7 +598,6 @@ __gnttab_map_grant_ref(
     struct page_info *pg = NULL;
     int            rc = GNTST_okay;
     u32            old_pin;
-    u32            act_pin;
     unsigned int   cache_flags;
     struct active_grant_entry *act = NULL;
     struct grant_mapping *mt;
@@ -639,7 +660,7 @@ __gnttab_map_grant_ref(
     if ( unlikely(op->ref >= nr_grant_entries(rgt)))
         PIN_FAIL(unlock_out, GNTST_bad_gntref, "Bad ref (%d).\n", op->ref);
 
-    act = &active_entry(rgt, op->ref);
+    act = active_entry_acquire(rgt, op->ref);
     shah = shared_entry_header(rgt, op->ref);
     if (rgt->gt_version == 1) {
         sha1 = &shared_entry_v1(rgt, op->ref);
@@ -656,7 +677,7 @@ __gnttab_map_grant_ref(
          ((act->domid != ld->domain_id) ||
           (act->pin & 0x80808080U) != 0 ||
           (act->is_sub_page)) )
-        PIN_FAIL(unlock_out, GNTST_general_error,
+        PIN_FAIL(act_release_out, GNTST_general_error,
                  "Bad domain (%d != %d), or risk of counter overflow %08x, or subpage %d\n",
                  act->domid, ld->domain_id, act->pin, act->is_sub_page);
 
@@ -667,7 +688,7 @@ __gnttab_map_grant_ref(
         if ( (rc = _set_status(rgt->gt_version, ld->domain_id,
                                op->flags & GNTMAP_readonly,
                                1, shah, act, status) ) != GNTST_okay )
-             goto unlock_out;
+            goto act_release_out;
 
         if ( !act->pin )
         {
@@ -698,7 +719,6 @@ __gnttab_map_grant_ref(
             GNTPIN_hstr_inc : GNTPIN_hstw_inc;
 
     frame = act->frame;
-    act_pin = act->pin;
 
     cache_flags = (shah->flags & (GTF_PAT | GTF_PWT | GTF_PCD) );
 
@@ -776,8 +796,6 @@ __gnttab_map_grant_ref(
         goto undo_out;
     }
 
-    double_maptrack_lock(lgt, rgt);
-
     if ( gnttab_need_iommu_mapping(ld) )
     {
         unsigned int wrc, rdc;
@@ -785,21 +803,20 @@ __gnttab_map_grant_ref(
         /* We're not translated, so we know that gmfns and mfns are
            the same things, so the IOMMU entry is always 1-to-1. */
         mapcount(lgt, rd, frame, &wrc, &rdc);
-        if ( (act_pin & (GNTPIN_hstw_mask|GNTPIN_devw_mask)) &&
+        if ( (act->pin & (GNTPIN_hstw_mask|GNTPIN_devw_mask)) &&
              !(old_pin & (GNTPIN_hstw_mask|GNTPIN_devw_mask)) )
         {
             if ( wrc == 0 )
                 err = iommu_map_page(ld, frame, frame,
                                      IOMMUF_readable|IOMMUF_writable);
         }
-        else if ( act_pin && !old_pin )
+        else if ( act->pin && !old_pin )
         {
             if ( (wrc + rdc) == 0 )
                 err = iommu_map_page(ld, frame, frame, IOMMUF_readable);
         }
         if ( err )
         {
-            double_maptrack_unlock(lgt, rgt);
             rc = GNTST_general_error;
             goto undo_out;
         }
@@ -812,7 +829,7 @@ __gnttab_map_grant_ref(
     mt->ref   = op->ref;
     mt->flags = op->flags;
 
-    double_maptrack_unlock(lgt, rgt);
+    active_entry_release(act);
     read_unlock(&rgt->lock);
 
     op->dev_bus_addr = (u64)frame << PAGE_SHIFT;
@@ -836,10 +853,6 @@ __gnttab_map_grant_ref(
         put_page(pg);
     }
 
-    read_lock(&rgt->lock);
-
-    act = &active_entry(rgt, op->ref);
-
     if ( op->flags & GNTMAP_device_map )
         act->pin -= (op->flags & GNTMAP_readonly) ?
             GNTPIN_devr_inc : GNTPIN_devw_inc;
@@ -855,6 +868,9 @@ __gnttab_map_grant_ref(
     if ( !act->pin )
         gnttab_clear_flag(_GTF_reading, status);
 
+ act_release_out:
+    active_entry_release(act);
+
  unlock_out:
     read_unlock(&rgt->lock);
     op->status = rc;
@@ -941,18 +957,17 @@ __gnttab_unmap_common(
     rgt = rd->grant_table;
 
     read_lock(&rgt->lock);
-    double_maptrack_lock(lgt, rgt);
 
     op->flags = op->map->flags;
     if ( unlikely(!op->flags) || unlikely(op->map->domid != dom) )
     {
         gdprintk(XENLOG_WARNING, "Unstable handle %u\n", op->handle);
         rc = GNTST_bad_handle;
-        goto unmap_out;
+        goto read_unlock_out;
     }
 
     op->rd = rd;
-    act = &active_entry(rgt, op->map->ref);
+    act = active_entry_acquire(rgt, op->map->ref);
 
     if ( op->frame == 0 )
     {
@@ -961,7 +976,7 @@ __gnttab_unmap_common(
     else
     {
         if ( unlikely(op->frame != act->frame) )
-            PIN_FAIL(unmap_out, GNTST_general_error,
+            PIN_FAIL(act_release_out, GNTST_general_error,
                      "Bad frame number doesn't match gntref. (%lx != %lx)\n",
                      op->frame, act->frame);
         if ( op->flags & GNTMAP_device_map )
@@ -980,7 +995,7 @@ __gnttab_unmap_common(
         if ( (rc = replace_grant_host_mapping(op->host_addr,
                                               op->frame, op->new_addr, 
                                               op->flags)) < 0 )
-            goto unmap_out;
+            goto act_release_out;
 
         ASSERT(act->pin & (GNTPIN_hstw_mask | GNTPIN_hstr_mask));
         op->map->flags &= ~GNTMAP_host_map;
@@ -1002,7 +1017,7 @@ __gnttab_unmap_common(
         if ( err )
         {
             rc = GNTST_general_error;
-            goto unmap_out;
+            goto act_release_out;
         }
     }
 
@@ -1010,8 +1025,9 @@ __gnttab_unmap_common(
     if ( !(op->flags & GNTMAP_readonly) )
          gnttab_mark_dirty(rd, op->frame);
 
- unmap_out:
-    double_maptrack_unlock(lgt, rgt);
+ act_release_out:
+    active_entry_release(act);
+ read_unlock_out:
     read_unlock(&rgt->lock);
 
     op->status = rc;
@@ -1046,9 +1062,9 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
 
     read_lock(&rgt->lock);
     if ( rgt->gt_version == 0 )
-        goto unmap_out;
+        goto unlock_out;
 
-    act = &active_entry(rgt, op->map->ref);
+    act = active_entry_acquire(rgt, op->map->ref);
     sha = shared_entry_header(rgt, op->map->ref);
 
     if ( rgt->gt_version == 1 )
@@ -1062,7 +1078,7 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
          * Suggests that __gntab_unmap_common failed early and so
          * nothing further to do
          */
-        goto unmap_out;
+        goto act_release_out;
     }
 
     pg = mfn_to_page(op->frame);
@@ -1086,7 +1102,7 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
              * Suggests that __gntab_unmap_common failed in
              * replace_grant_host_mapping() so nothing further to do
              */
-            goto unmap_out;
+            goto act_release_out;
         }
 
         if ( !is_iomem_page(op->frame) ) 
@@ -1107,8 +1123,12 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
     if ( act->pin == 0 )
         gnttab_clear_flag(_GTF_reading, status);
 
- unmap_out:
+ act_release_out:
+    active_entry_release(act);
+    
+ unlock_out:
     read_unlock(&rgt->lock);
+
     if ( put_handle )
     {
         op->map->flags = 0;
@@ -1304,7 +1324,7 @@ gnttab_grow_table(struct domain *d, unsigned int req_nr_frames)
     /* d's grant table write lock must be held by the caller */
 
     struct grant_table *gt = d->grant_table;
-    unsigned int i;
+    unsigned int i, j;
 
     ASSERT(req_nr_frames <= max_grant_frames);
 
@@ -1319,6 +1339,8 @@ gnttab_grow_table(struct domain *d, unsigned int req_nr_frames)
         if ( (gt->active[i] = alloc_xenheap_page()) == NULL )
             goto active_alloc_failed;
         clear_page(gt->active[i]);
+        for ( j = 0; j < ACGNT_PER_PAGE; j++ )
+            spin_lock_init(&gt->active[i][j].lock);
     }
 
     /* Shared */
@@ -1814,7 +1836,7 @@ __release_grant_for_copy(
 
     read_lock(&rgt->lock);
 
-    act = &active_entry(rgt, gref);
+    act = active_entry_acquire(rgt, gref);
     sha = shared_entry_header(rgt, gref);
     r_frame = act->frame;
 
@@ -1853,6 +1875,7 @@ __release_grant_for_copy(
         released_read = 1;
     }
 
+    active_entry_release(act);
     read_unlock(&rgt->lock);
 
     if ( td != rd )
@@ -1914,14 +1937,14 @@ __acquire_grant_for_copy(
     read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
-        PIN_FAIL(unlock_out, GNTST_general_error,
+        PIN_FAIL(gnt_unlock_out, GNTST_general_error,
                  "remote grant table not ready\n");
 
     if ( unlikely(gref >= nr_grant_entries(rgt)) )
-        PIN_FAIL(unlock_out, GNTST_bad_gntref,
+        PIN_FAIL(gnt_unlock_out, GNTST_bad_gntref,
                  "Bad grant reference %ld\n", gref);
 
-    act = &active_entry(rgt, gref);
+    act = active_entry_acquire(rgt, gref);
     shah = shared_entry_header(rgt, gref);
     if ( rgt->gt_version == 1 )
     {
@@ -1986,6 +2009,7 @@ __acquire_grant_for_copy(
              * the right table (if rd == td), so we have to drop the
              * lock here and reacquire
              */
+            active_entry_release(act);
             read_unlock(&rgt->lock);
 
             rc = __acquire_grant_for_copy(td, trans_gref, rd->domain_id,
@@ -1993,8 +2017,11 @@ __acquire_grant_for_copy(
                                           &trans_page_off, &trans_length, 0);
 
             read_lock(&rgt->lock);
+            act = active_entry_acquire(rgt, gref);
+
             if ( rc != GNTST_okay ) {
                 __fixup_status_for_copy_pin(act, status);
+                active_entry_release(act);
                 read_unlock(&rgt->lock);
                 rcu_unlock_domain(td);
                 return rc;
@@ -2008,6 +2035,7 @@ __acquire_grant_for_copy(
             {
                 __fixup_status_for_copy_pin(act, status);
                 rcu_unlock_domain(td);
+                active_entry_release(act);
                 read_unlock(&rgt->lock);
                 put_page(*page);
                 return __acquire_grant_for_copy(rd, gref, ldom, readonly,
@@ -2076,6 +2104,7 @@ __acquire_grant_for_copy(
     *length = act->length;
     *frame = act->frame;
 
+    active_entry_release(act);
     read_unlock(&rgt->lock);
     return rc;
  
@@ -2088,6 +2117,9 @@ __acquire_grant_for_copy(
         gnttab_clear_flag(_GTF_reading, status);
 
  unlock_out:
+    active_entry_release(act);
+
+ gnt_unlock_out:
     read_unlock(&rgt->lock);
 
     return rc;
@@ -2414,16 +2446,18 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     {
         for ( i = GNTTAB_NR_RESERVED_ENTRIES; i < nr_grant_entries(gt); i++ )
         {
-            act = &active_entry(gt, i);
+            act = active_entry_acquire(gt, i);
             if ( act->pin != 0 )
             {
                 gdprintk(XENLOG_WARNING,
                          "tried to change grant table version from %d to %d, but some grant entries still in use\n",
                          gt->gt_version,
                          op.version);
+                active_entry_release(act);
                 res = -EBUSY;
                 goto out_unlock;
             }
+            active_entry_release(act);
         }
     }
 
@@ -2602,9 +2636,17 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t ref_b)
 {
     struct domain *d = rcu_lock_current_domain();
     struct grant_table *gt = d->grant_table;
-    struct active_grant_entry *act;
+    struct active_grant_entry *act_a = NULL;
+    struct active_grant_entry *act_b = NULL;
     s16 rc = GNTST_okay;
 
+    if ( ref_a == ref_b )
+        /*
+         * noop, so avoid acquiring the same active entry
+         * twice which would case a deadlock.
+         */
+        return rc;
+
     write_lock(&gt->lock);
 
     /* Bounds check on the grant refs */
@@ -2613,12 +2655,12 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t ref_b)
     if ( unlikely(ref_b >= nr_grant_entries(d->grant_table)))
         PIN_FAIL(out, GNTST_bad_gntref, "Bad ref-b (%d).\n", ref_b);
 
-    act = &active_entry(gt, ref_a);
-    if ( act->pin )
+    act_a = active_entry_acquire(gt, ref_a);
+    if ( act_a->pin )
         PIN_FAIL(out, GNTST_eagain, "ref a %ld busy\n", (long)ref_a);
 
-    act = &active_entry(gt, ref_b);
-    if ( act->pin )
+    act_b = active_entry_acquire(gt, ref_b);
+    if ( act_b->pin )
         PIN_FAIL(out, GNTST_eagain, "ref b %ld busy\n", (long)ref_b);
 
     if ( gt->gt_version == 1 )
@@ -2645,6 +2687,10 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t ref_b)
     }
 
 out:
+    if ( act_b != NULL )
+        active_entry_release(act_b);
+    if ( act_a != NULL )
+        active_entry_release(act_a);
     write_unlock(&gt->lock);
 
     rcu_unlock_domain(d);
@@ -2954,7 +3000,7 @@ grant_table_create(
     struct domain *d)
 {
     struct grant_table *t;
-    int                 i;
+    unsigned int i, j;
 
     if ( (t = xzalloc(struct grant_table)) == NULL )
         goto no_mem_0;
@@ -2974,6 +3020,8 @@ grant_table_create(
         if ( (t->active[i] = alloc_xenheap_page()) == NULL )
             goto no_mem_2;
         clear_page(t->active[i]);
+        for ( j = 0; j < ACGNT_PER_PAGE; j++ )
+            spin_lock_init(&t->active[i][j].lock);
     }
 
     /* Tracking of mapped foreign frames table */
@@ -3069,9 +3117,8 @@ gnttab_release_mappings(
 
         rgt = rd->grant_table;
         read_lock(&rgt->lock);
-        double_maptrack_lock(gt, rgt);
 
-        act = &active_entry(rgt, ref);
+        act = active_entry_acquire(rgt, ref);
         sha = shared_entry_header(rgt, ref);
         if (rgt->gt_version == 1)
             status = &sha->flags;
@@ -3129,7 +3176,7 @@ gnttab_release_mappings(
         if ( act->pin == 0 )
             gnttab_clear_flag(_GTF_reading, status);
 
-        double_maptrack_unlock(gt, rgt);
+        active_entry_release(act);
         read_unlock(&rgt->lock);
 
         rcu_unlock_domain(rd);
@@ -3192,9 +3239,12 @@ static void gnttab_usage_print(struct domain *rd)
         uint16_t status;
         uint64_t frame;
 
-        act = &active_entry(gt, ref);
+        act = active_entry_acquire(gt, ref);
         if ( !act->pin )
+        {
+            active_entry_release(act);
             continue;
+        }
 
         sha = shared_entry_header(gt, ref);
 
@@ -3224,6 +3274,7 @@ static void gnttab_usage_print(struct domain *rd)
         printk("[%3d]    %5d 0x%06lx 0x%08x      %5d 0x%06"PRIx64" 0x%02x\n",
                ref, act->domid, act->frame, act->pin,
                sha->domid, frame, status);
+        active_entry_release(act);
     }
 
  out:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-04-30 13:28 [PATCHv7 0/3] gnttab: Improve scaleability David Vrabel
  2015-04-30 13:28 ` [PATCHv7 1/3] gnttab: Introduce rwlock to protect updates to grant table state David Vrabel
  2015-04-30 13:28 ` [PATCHv7 2/3] gnttab: refactor locking for scalability David Vrabel
@ 2015-04-30 13:28 ` David Vrabel
  2015-04-30 15:12   ` Tim Deegan
  2015-05-05 12:34   ` Jan Beulich
  2 siblings, 2 replies; 16+ messages in thread
From: David Vrabel @ 2015-04-30 13:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Tim Deegan,
	David Vrabel, Matt Wilson, Malcolm Crossley, Ian Campbell

From: Malcolm Crossley <malcolm.crossley@citrix.com>

Performance analysis of aggregate network throughput with many VMs
shows that performance is signficantly limited by contention on the
maptrack lock when obtaining/releasing maptrack handles from the free
list.

Instead of a single free list use a per-VCPU list. This avoids any
contention when obtaining a handle.  Handles must be released back to
their original list and since this may occur on a different VCPU there
is some contention on the destination VCPU's free list tail pointer
(but this is much better than a per-domain lock).

A domain's maptrack limit is multiplied by the number of VCPUs.  This
ensures that a worst case domain that only performs grant table
operations via one VCPU will not see a lower map track limit.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/grant_table.c      |  131 ++++++++++++++++++++++++-----------------
 xen/include/xen/grant_table.h |    8 ++-
 xen/include/xen/sched.h       |    5 ++
 3 files changed, 87 insertions(+), 57 deletions(-)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 569b3d3..d8e7373 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -118,9 +118,9 @@ struct gnttab_unmap_common {
     ((t)->maptrack[(e)/MAPTRACK_PER_PAGE][(e)%MAPTRACK_PER_PAGE])
 
 static inline unsigned int
-nr_maptrack_frames(struct grant_table *t)
+nr_vcpu_maptrack_frames(struct vcpu *v)
 {
-    return t->maptrack_limit / MAPTRACK_PER_PAGE;
+    return v->maptrack_limit / MAPTRACK_PER_PAGE;
 }
 
 #define MAPTRACK_TAIL (~0u)
@@ -255,66 +255,91 @@ static int __get_paged_frame(unsigned long gfn, unsigned long *frame, struct pag
 
 static inline int
 __get_maptrack_handle(
-    struct grant_table *t)
+    struct grant_table *t,
+    struct vcpu *v)
 {
-    unsigned int h;
-    if ( unlikely((h = t->maptrack_head) == MAPTRACK_TAIL) )
+    unsigned int head, next;
+
+    head = v->maptrack_head;
+    if ( unlikely(head == MAPTRACK_TAIL) )
+        return -1;
+
+    next = maptrack_entry(t, head).ref;
+    if ( unlikely(next == MAPTRACK_TAIL) )
         return -1;
-    t->maptrack_head = maptrack_entry(t, h).ref;
-    return h;
+
+    v->maptrack_head = next;
+
+    return head;
 }
 
 static inline void
 put_maptrack_handle(
     struct grant_table *t, int handle)
 {
-    spin_lock(&t->maptrack_lock);
-    maptrack_entry(t, handle).ref = t->maptrack_head;
-    t->maptrack_head = handle;
-    spin_unlock(&t->maptrack_lock);
+    struct domain *d = current->domain;
+    struct vcpu *v;
+    u32 prev_tail, cur_tail;
+
+    /* Set current hande to have TAIL reference */
+    maptrack_entry(t, handle).ref = MAPTRACK_TAIL;
+    v = d->vcpu[maptrack_entry(t,handle).vcpu];
+    cur_tail = v->maptrack_tail;
+    do {
+        prev_tail = cur_tail;
+        cur_tail = cmpxchg((u32 *)&v->maptrack_tail, prev_tail, handle);
+    } while ( cur_tail != prev_tail );
+    maptrack_entry(t, prev_tail).ref = handle;
 }
 
 static inline int
 get_maptrack_handle(
     struct grant_table *lgt)
 {
+    struct vcpu          *v = current;
     int                   i;
     grant_handle_t        handle;
     struct grant_mapping *new_mt;
-    unsigned int          new_mt_limit, nr_frames;
-
-    spin_lock(&lgt->maptrack_lock);
+    unsigned int          nr_frames;
 
-    while ( unlikely((handle = __get_maptrack_handle(lgt)) == -1) )
-    {
-        nr_frames = nr_maptrack_frames(lgt);
-        if ( nr_frames >= max_maptrack_frames )
-            break;
-
-        new_mt = alloc_xenheap_page();
-        if ( !new_mt )
-            break;
+    handle = __get_maptrack_handle(lgt, v);
+    if ( likely(handle != -1) )
+        return handle;
 
-        clear_page(new_mt);
+    nr_frames = nr_vcpu_maptrack_frames(v);
+    if ( nr_frames >= max_maptrack_frames )
+        return -1;
 
-        new_mt_limit = lgt->maptrack_limit + MAPTRACK_PER_PAGE;
+    new_mt = alloc_xenheap_page();
+    if ( !new_mt )
+        return -1;
+    clear_page(new_mt);
 
-        for ( i = 1; i < MAPTRACK_PER_PAGE; i++ )
-            new_mt[i - 1].ref = lgt->maptrack_limit + i;
-        new_mt[i - 1].ref = lgt->maptrack_head;
-        lgt->maptrack_head = lgt->maptrack_limit;
+    spin_lock(&lgt->maptrack_lock);
 
-        lgt->maptrack[nr_frames] = new_mt;
-        smp_wmb();
-        lgt->maptrack_limit      = new_mt_limit;
+    for ( i = 1; i < MAPTRACK_PER_PAGE; i++ ){
+        new_mt[i - 1].ref = (lgt->maptrack_pages * MAPTRACK_PER_PAGE) + i;
+        new_mt[i - 1].vcpu = v->vcpu_id;
+    }
+    /* Set last entry vcpu and ref */
+    new_mt[i - 1].ref = v->maptrack_head;
+    new_mt[i - 1].vcpu = v->vcpu_id;
 
-        gdprintk(XENLOG_INFO, "Increased maptrack size to %u frames\n",
-                 nr_frames + 1);
+    v->maptrack_head = lgt->maptrack_pages * MAPTRACK_PER_PAGE;
+    if (v->maptrack_tail == MAPTRACK_TAIL)
+    {
+        v->maptrack_tail = (lgt->maptrack_pages * MAPTRACK_PER_PAGE)
+            + MAPTRACK_PER_PAGE - 1;
+        new_mt[i - 1].ref = MAPTRACK_TAIL;
     }
 
+    lgt->maptrack[lgt->maptrack_pages++] = new_mt;
+
     spin_unlock(&lgt->maptrack_lock);
 
-    return handle;
+    v->maptrack_limit += MAPTRACK_PER_PAGE;
+
+    return __get_maptrack_handle(lgt, v);
 }
 
 /* 
@@ -566,7 +591,7 @@ static void mapcount(
     ASSERT(rw_is_locked(&rd->grant_table->lock));
     spin_lock(&lgt->maptrack_lock);
 
-    for ( handle = 0; handle < lgt->maptrack_limit; handle++ )
+    for ( handle = 0; handle < lgt->maptrack_pages * MAPTRACK_PER_PAGE; handle++ )
     {
         map = &maptrack_entry(lgt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
@@ -914,7 +939,7 @@ __gnttab_unmap_common(
 
     op->frame = (unsigned long)(op->dev_bus_addr >> PAGE_SHIFT);
 
-    if ( unlikely(op->handle >= lgt->maptrack_limit) )
+    if ( unlikely(op->handle >= lgt->maptrack_pages * MAPTRACK_PER_PAGE) )
     {
         gdprintk(XENLOG_INFO, "Bad handle (%d).\n", op->handle);
         op->status = GNTST_bad_handle;
@@ -1389,6 +1414,7 @@ gnttab_setup_table(
     struct gnttab_setup_table op;
     struct domain *d;
     struct grant_table *gt;
+    struct vcpu *v;
     int            i;
     xen_pfn_t  gmfn;
 
@@ -1430,6 +1456,17 @@ gnttab_setup_table(
     gt = d->grant_table;
     write_lock(&gt->lock);
 
+    /* Tracking of mapped foreign frames table */
+    if ( (gt->maptrack = xzalloc_array(struct grant_mapping *,
+                                       max_maptrack_frames * d->max_vcpus)) == NULL )
+        goto out3;
+    for_each_vcpu( d, v )
+    {
+        v->maptrack_head = MAPTRACK_TAIL;
+        v->maptrack_tail = MAPTRACK_TAIL;
+    }
+    gt->maptrack_pages = 0;
+
     if ( gt->gt_version == 0 )
         gt->gt_version = 1;
 
@@ -3024,18 +3061,6 @@ grant_table_create(
             spin_lock_init(&t->active[i][j].lock);
     }
 
-    /* Tracking of mapped foreign frames table */
-    if ( (t->maptrack = xzalloc_array(struct grant_mapping *,
-                                      max_maptrack_frames)) == NULL )
-        goto no_mem_2;
-    if ( (t->maptrack[0] = alloc_xenheap_page()) == NULL )
-        goto no_mem_3;
-    clear_page(t->maptrack[0]);
-    t->maptrack_limit = MAPTRACK_PER_PAGE;
-    for ( i = 1; i < MAPTRACK_PER_PAGE; i++ )
-        t->maptrack[0][i - 1].ref = i;
-    t->maptrack[0][i - 1].ref = MAPTRACK_TAIL;
-
     /* Shared grant table. */
     if ( (t->shared_raw = xzalloc_array(void *, max_grant_frames)) == NULL )
         goto no_mem_3;
@@ -3066,12 +3091,10 @@ grant_table_create(
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
  no_mem_3:
-    free_xenheap_page(t->maptrack[0]);
-    xfree(t->maptrack);
- no_mem_2:
     for ( i = 0;
           i < num_act_frames_from_sha_frames(INITIAL_NR_GRANT_FRAMES); i++ )
         free_xenheap_page(t->active[i]);
+ no_mem_2:
     xfree(t->active);
  no_mem_1:
     xfree(t);
@@ -3095,7 +3118,7 @@ gnttab_release_mappings(
 
     BUG_ON(!d->is_dying);
 
-    for ( handle = 0; handle < gt->maptrack_limit; handle++ )
+    for ( handle = 0; handle < (gt->maptrack_pages * MAPTRACK_PER_PAGE); handle++ )
     {
         map = &maptrack_entry(gt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) )
@@ -3200,7 +3223,7 @@ grant_table_destroy(
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
 
-    for ( i = 0; i < nr_maptrack_frames(t); i++ )
+    for ( i = 0; i < t->maptrack_pages; i++ )
         free_xenheap_page(t->maptrack[i]);
     xfree(t->maptrack);
 
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 98cc94a..c10464c 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -60,6 +60,8 @@ struct grant_mapping {
     u32      ref;           /* grant ref */
     u16      flags;         /* 0-4: GNTMAP_* ; 5-15: unused */
     domid_t  domid;         /* granting domain */
+    u32      vcpu;          /* vcpu which created the grant mapping */
+    u32      pad;           /* round size to a power of 2 */
 };
 
 /* Per-domain grant information. */
@@ -83,10 +85,10 @@ struct grant_table {
     grant_status_t       **status;
     /* Active grant table. */
     struct active_grant_entry **active;
-    /* Mapping tracking table. */
+    /* Mapping tracking table per vcpu. */
     struct grant_mapping **maptrack;
-    unsigned int          maptrack_head;
-    unsigned int          maptrack_limit;
+    /* Total pages used for mapping tracking table */
+    unsigned int          maptrack_pages;
     /* Lock protecting the maptrack page list, head, and limit */
     spinlock_t            maptrack_lock;
     /* The defined versions are 1 and 2.  Set to 0 if we don't know
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 80c6f62..7f8422e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -219,6 +219,11 @@ struct vcpu
     /* VCPU paused by system controller. */
     int              controller_pause_count;
 
+    /* Maptrack */
+    unsigned int     maptrack_head;
+    unsigned int     maptrack_tail;
+    unsigned int     maptrack_limit;
+
     /* IRQ-safe virq_lock protects against delivering VIRQ to stale evtchn. */
     evtchn_port_t    virq_to_evtchn[NR_VIRQS];
     spinlock_t       virq_lock;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 2/3] gnttab: refactor locking for scalability
  2015-04-30 13:28 ` [PATCHv7 2/3] gnttab: refactor locking for scalability David Vrabel
@ 2015-04-30 14:34   ` Tim Deegan
  2015-05-01 18:03     ` David Vrabel
  2015-05-05 11:58   ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Tim Deegan @ 2015-04-30 14:34 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Matt Wilson,
	xen-devel, Malcolm Crossley, Ian Campbell

At 14:28 +0100 on 30 Apr (1430404124), David Vrabel wrote:
> +    /* 
> +     * N.B.: while taking the left side maptrack spinlock prevents
> +     * any mapping changes, the right side active entries could be
> +     * changing while we are counting.

IIRC, the lX/rX variables are local/remote rather than left/right. 

> @@ -639,7 +660,7 @@ __gnttab_map_grant_ref(
>      if ( unlikely(op->ref >= nr_grant_entries(rgt)))
>          PIN_FAIL(unlock_out, GNTST_bad_gntref, "Bad ref (%d).\n", op->ref);
>  
> -    act = &active_entry(rgt, op->ref);
> +    act = active_entry_acquire(rgt, op->ref);
>      shah = shared_entry_header(rgt, op->ref);
>      if (rgt->gt_version == 1) {
>          sha1 = &shared_entry_v1(rgt, op->ref);
> @@ -656,7 +677,7 @@ __gnttab_map_grant_ref(
>           ((act->domid != ld->domain_id) ||
>            (act->pin & 0x80808080U) != 0 ||
>            (act->is_sub_page)) )
> -        PIN_FAIL(unlock_out, GNTST_general_error,
> +        PIN_FAIL(act_release_out, GNTST_general_error,
>                   "Bad domain (%d != %d), or risk of counter overflow %08x, or subpage %d\n",
>                   act->domid, ld->domain_id, act->pin, act->is_sub_page);
>  
> @@ -667,7 +688,7 @@ __gnttab_map_grant_ref(
>          if ( (rc = _set_status(rgt->gt_version, ld->domain_id,
>                                 op->flags & GNTMAP_readonly,
>                                 1, shah, act, status) ) != GNTST_okay )
> -             goto unlock_out;
> +            goto act_release_out;
>  
>          if ( !act->pin )
>          {
> @@ -698,7 +719,6 @@ __gnttab_map_grant_ref(
>              GNTPIN_hstr_inc : GNTPIN_hstw_inc;
>  
>      frame = act->frame;
> -    act_pin = act->pin;
>  
>      cache_flags = (shah->flags & (GTF_PAT | GTF_PWT | GTF_PCD) );
>  
> @@ -776,8 +796,6 @@ __gnttab_map_grant_ref(
>          goto undo_out;
>      }
>  
> -    double_maptrack_lock(lgt, rgt);
> -

OK, so here we hold rd->grant_table->lock and act->lock (which is in
the rd table) and are going to acquire lgt->maptrack_lock in mapcount().

That means we can't ever have a path holding domA's maptrack lock that
acquires domB's gt lock or one of domB's act->locks.  I think that's
OK because after this patch the only paths left that hold the maptrack
lock can't acquire anything except an act->lock of the same domain.
Do I understand that correctly?

Also: because mapcount() itself doesn't take any act->lock, there's no
path that holds two act->locks at the same time?

I think it would be nice to expand the locking-discipline comment to
make those restrictions clear. 

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-04-30 13:28 ` [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists David Vrabel
@ 2015-04-30 15:12   ` Tim Deegan
  2015-05-01 12:29     ` Malcolm Crossley
  2015-05-05 12:34   ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Tim Deegan @ 2015-04-30 15:12 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Matt Wilson,
	xen-devel, Malcolm Crossley, Ian Campbell

At 14:28 +0100 on 30 Apr (1430404125), David Vrabel wrote:
> From: Malcolm Crossley <malcolm.crossley@citrix.com>
> 
> Performance analysis of aggregate network throughput with many VMs
> shows that performance is signficantly limited by contention on the
> maptrack lock when obtaining/releasing maptrack handles from the free
> list.
> 
> Instead of a single free list use a per-VCPU list. This avoids any
> contention when obtaining a handle.  Handles must be released back to
> their original list and since this may occur on a different VCPU there
> is some contention on the destination VCPU's free list tail pointer
> (but this is much better than a per-domain lock).
> 
> A domain's maptrack limit is multiplied by the number of VCPUs.  This
> ensures that a worst case domain that only performs grant table
> operations via one VCPU will not see a lower map track limit.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/grant_table.c      |  131 ++++++++++++++++++++++++-----------------
>  xen/include/xen/grant_table.h |    8 ++-
>  xen/include/xen/sched.h       |    5 ++
>  3 files changed, 87 insertions(+), 57 deletions(-)
> 
> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
> index 569b3d3..d8e7373 100644
> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -118,9 +118,9 @@ struct gnttab_unmap_common {
>      ((t)->maptrack[(e)/MAPTRACK_PER_PAGE][(e)%MAPTRACK_PER_PAGE])
>  
>  static inline unsigned int
> -nr_maptrack_frames(struct grant_table *t)
> +nr_vcpu_maptrack_frames(struct vcpu *v)
>  {
> -    return t->maptrack_limit / MAPTRACK_PER_PAGE;
> +    return v->maptrack_limit / MAPTRACK_PER_PAGE;
>  }
>  
>  #define MAPTRACK_TAIL (~0u)
> @@ -255,66 +255,91 @@ static int __get_paged_frame(unsigned long gfn, unsigned long *frame, struct pag
>  
>  static inline int
>  __get_maptrack_handle(
> -    struct grant_table *t)
> +    struct grant_table *t,
> +    struct vcpu *v)
>  {
> -    unsigned int h;
> -    if ( unlikely((h = t->maptrack_head) == MAPTRACK_TAIL) )
> +    unsigned int head, next;
> +
> +    head = v->maptrack_head;
> +    if ( unlikely(head == MAPTRACK_TAIL) )
> +        return -1;
> +
> +    next = maptrack_entry(t, head).ref;
> +    if ( unlikely(next == MAPTRACK_TAIL) )
>          return -1;

So this is an extra restriction over the old code: the list shall
never be empty.  Is that to make freeing onto the tail of a remote
freelist easier?  If so can you please add a comment so that nobody
drops this later?

> -    t->maptrack_head = maptrack_entry(t, h).ref;
> -    return h;
> +
> +    v->maptrack_head = next;
> +
> +    return head;
>  }
>  
>  static inline void
>  put_maptrack_handle(
>      struct grant_table *t, int handle)
>  {
> -    spin_lock(&t->maptrack_lock);
> -    maptrack_entry(t, handle).ref = t->maptrack_head;
> -    t->maptrack_head = handle;
> -    spin_unlock(&t->maptrack_lock);
> +    struct domain *d = current->domain;
> +    struct vcpu *v;
> +    u32 prev_tail, cur_tail;
> +
> +    /* Set current hande to have TAIL reference */
> +    maptrack_entry(t, handle).ref = MAPTRACK_TAIL;
> +    v = d->vcpu[maptrack_entry(t,handle).vcpu];
> +    cur_tail = v->maptrack_tail;
> +    do {
> +        prev_tail = cur_tail;
> +        cur_tail = cmpxchg((u32 *)&v->maptrack_tail, prev_tail, handle);
> +    } while ( cur_tail != prev_tail );

Thinking out loud here: this is the matching code that needs a tail
entry to append to.  Because we advance the tail pointer befiore
threading the linst in the .ref field, there's a potentially large set
of half-freed entries here that are/were the tail pointer but haven't
had their .ref fields updated yet.  Eventually all freeing threads
will execute the line below and stitch things together.  And cmpxchg()
is a full barrier so we're safe against reordering here: any fragment
of list will end in MAPTRACK_TAIL.  Good.

But this:

> +    maptrack_entry(t, prev_tail).ref = handle;

ought to be a write_atomic(), and the reads in
__get_maptrack_handle() should be atomic too.  

>  }
>  
>  static inline int
>  get_maptrack_handle(
>      struct grant_table *lgt)
>  {
> +    struct vcpu          *v = current;
>      int                   i;
>      grant_handle_t        handle;
>      struct grant_mapping *new_mt;
> -    unsigned int          new_mt_limit, nr_frames;
> -
> -    spin_lock(&lgt->maptrack_lock);
> +    unsigned int          nr_frames;
>  
> -    while ( unlikely((handle = __get_maptrack_handle(lgt)) == -1) )
> -    {
> -        nr_frames = nr_maptrack_frames(lgt);
> -        if ( nr_frames >= max_maptrack_frames )
> -            break;
> -
> -        new_mt = alloc_xenheap_page();
> -        if ( !new_mt )
> -            break;
> +    handle = __get_maptrack_handle(lgt, v);
> +    if ( likely(handle != -1) )
> +        return handle;
>  
> -        clear_page(new_mt);
> +    nr_frames = nr_vcpu_maptrack_frames(v);
> +    if ( nr_frames >= max_maptrack_frames )
> +        return -1;
>  
> -        new_mt_limit = lgt->maptrack_limit + MAPTRACK_PER_PAGE;
> +    new_mt = alloc_xenheap_page();
> +    if ( !new_mt )
> +        return -1;
> +    clear_page(new_mt);
>  
> -        for ( i = 1; i < MAPTRACK_PER_PAGE; i++ )
> -            new_mt[i - 1].ref = lgt->maptrack_limit + i;
> -        new_mt[i - 1].ref = lgt->maptrack_head;
> -        lgt->maptrack_head = lgt->maptrack_limit;
> +    spin_lock(&lgt->maptrack_lock);
>  
> -        lgt->maptrack[nr_frames] = new_mt;
> -        smp_wmb();
> -        lgt->maptrack_limit      = new_mt_limit;
> +    for ( i = 1; i < MAPTRACK_PER_PAGE; i++ ){

Style nit: brace on its own line, please. 

> +        new_mt[i - 1].ref = (lgt->maptrack_pages * MAPTRACK_PER_PAGE) + i;
> +        new_mt[i - 1].vcpu = v->vcpu_id;
> +    }
> +    /* Set last entry vcpu and ref */
> +    new_mt[i - 1].ref = v->maptrack_head;
> +    new_mt[i - 1].vcpu = v->vcpu_id;
>  
> -        gdprintk(XENLOG_INFO, "Increased maptrack size to %u frames\n",
> -                 nr_frames + 1);
> +    v->maptrack_head = lgt->maptrack_pages * MAPTRACK_PER_PAGE;
> +    if (v->maptrack_tail == MAPTRACK_TAIL)
> +    {
> +        v->maptrack_tail = (lgt->maptrack_pages * MAPTRACK_PER_PAGE)
> +            + MAPTRACK_PER_PAGE - 1;
> +        new_mt[i - 1].ref = MAPTRACK_TAIL;

This clause was slightly confusing to me until I realised that it's
only for the case where no maptrack entries have been allocated to
this vcpu at all before (because you keep at least one on the
freelist).  And therefore, there's no need to be careful about this
update to maptrack_tail.  I think that deserves a comment.

So I think my only concerns about this whole series are some comments
and using atomic read & writes in the linked-list code.  I'm happy to
believe that you'll DTRT with all that, so for the next spin,

Acked-by: Tim Deegan <tim@xen.org> 

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-04-30 15:12   ` Tim Deegan
@ 2015-05-01 12:29     ` Malcolm Crossley
  0 siblings, 0 replies; 16+ messages in thread
From: Malcolm Crossley @ 2015-05-01 12:29 UTC (permalink / raw)
  To: Tim Deegan, David Vrabel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Matt Wilson,
	xen-devel, Ian Campbell

On 30/04/15 16:12, Tim Deegan wrote:
> At 14:28 +0100 on 30 Apr (1430404125), David Vrabel wrote:
>> From: Malcolm Crossley <malcolm.crossley@citrix.com>
>>
>> Performance analysis of aggregate network throughput with many VMs
>> shows that performance is signficantly limited by contention on the
>> maptrack lock when obtaining/releasing maptrack handles from the free
>> list.
>>
>> Instead of a single free list use a per-VCPU list. This avoids any
>> contention when obtaining a handle.  Handles must be released back to
>> their original list and since this may occur on a different VCPU there
>> is some contention on the destination VCPU's free list tail pointer
>> (but this is much better than a per-domain lock).
>>
>> A domain's maptrack limit is multiplied by the number of VCPUs.  This
>> ensures that a worst case domain that only performs grant table
>> operations via one VCPU will not see a lower map track limit.
>>
>> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> ---
>>  xen/common/grant_table.c      |  131 ++++++++++++++++++++++++-----------------
>>  xen/include/xen/grant_table.h |    8 ++-
>>  xen/include/xen/sched.h       |    5 ++
>>  3 files changed, 87 insertions(+), 57 deletions(-)
>>
>> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
>> index 569b3d3..d8e7373 100644
>> --- a/xen/common/grant_table.c
>> +++ b/xen/common/grant_table.c
>> @@ -118,9 +118,9 @@ struct gnttab_unmap_common {
>>      ((t)->maptrack[(e)/MAPTRACK_PER_PAGE][(e)%MAPTRACK_PER_PAGE])
>>  
>>  static inline unsigned int
>> -nr_maptrack_frames(struct grant_table *t)
>> +nr_vcpu_maptrack_frames(struct vcpu *v)
>>  {
>> -    return t->maptrack_limit / MAPTRACK_PER_PAGE;
>> +    return v->maptrack_limit / MAPTRACK_PER_PAGE;
>>  }
>>  
>>  #define MAPTRACK_TAIL (~0u)
>> @@ -255,66 +255,91 @@ static int __get_paged_frame(unsigned long gfn, unsigned long *frame, struct pag
>>  
>>  static inline int
>>  __get_maptrack_handle(
>> -    struct grant_table *t)
>> +    struct grant_table *t,
>> +    struct vcpu *v)
>>  {
>> -    unsigned int h;
>> -    if ( unlikely((h = t->maptrack_head) == MAPTRACK_TAIL) )
>> +    unsigned int head, next;
>> +
>> +    head = v->maptrack_head;
>> +    if ( unlikely(head == MAPTRACK_TAIL) )
>> +        return -1;
>> +
>> +    next = maptrack_entry(t, head).ref;
>> +    if ( unlikely(next == MAPTRACK_TAIL) )
>>          return -1;
> 
> So this is an extra restriction over the old code: the list shall
> never be empty.  Is that to make freeing onto the tail of a remote
> freelist easier?  If so can you please add a comment so that nobody
> drops this later?

Previously, on maptrack table init we allocated a page, with per vCPU
maptrack we would need to allocate per vcpu. As domains could have large
numbers of vCPU's, this patch change's the logic so that each vcpu
maptrack entries are only allocated when they are needed (or are full).
This behavioural change hopefully keeps the Xen memory overhead for
large vCPU domains down.

We detect a vCPU without any allocated maptrack pages by setting it's
head pointer to the MAPTRACK_TAIL define.

> 
>> -    t->maptrack_head = maptrack_entry(t, h).ref;
>> -    return h;
>> +
>> +    v->maptrack_head = next;
>> +
>> +    return head;
>>  }
>>  
>>  static inline void
>>  put_maptrack_handle(
>>      struct grant_table *t, int handle)
>>  {
>> -    spin_lock(&t->maptrack_lock);
>> -    maptrack_entry(t, handle).ref = t->maptrack_head;
>> -    t->maptrack_head = handle;
>> -    spin_unlock(&t->maptrack_lock);
>> +    struct domain *d = current->domain;
>> +    struct vcpu *v;
>> +    u32 prev_tail, cur_tail;
>> +
>> +    /* Set current hande to have TAIL reference */
>> +    maptrack_entry(t, handle).ref = MAPTRACK_TAIL;
>> +    v = d->vcpu[maptrack_entry(t,handle).vcpu];
>> +    cur_tail = v->maptrack_tail;
>> +    do {
>> +        prev_tail = cur_tail;
>> +        cur_tail = cmpxchg((u32 *)&v->maptrack_tail, prev_tail, handle);
>> +    } while ( cur_tail != prev_tail );
> 
> Thinking out loud here: this is the matching code that needs a tail
> entry to append to.  Because we advance the tail pointer befiore
> threading the linst in the .ref field, there's a potentially large set
> of half-freed entries here that are/were the tail pointer but haven't
> had their .ref fields updated yet.  Eventually all freeing threads
> will execute the line below and stitch things together.  And cmpxchg()
> is a full barrier so we're safe against reordering here: any fragment
> of list will end in MAPTRACK_TAIL.  Good.

That's the theory of how it works but you are correct that all accesses
to ref must be atomic, otherwise we won't splice the linked lists back
together again.
> 
> But this:
> 
>> +    maptrack_entry(t, prev_tail).ref = handle;
> 
> ought to be a write_atomic(), and the reads in
> __get_maptrack_handle() should be atomic too.  

You are right, we probably get away with it on x86 as it's a 32bit write
but we should use the atomic helpers.
> 
>>  }
>>  
>>  static inline int
>>  get_maptrack_handle(
>>      struct grant_table *lgt)
>>  {
>> +    struct vcpu          *v = current;
>>      int                   i;
>>      grant_handle_t        handle;
>>      struct grant_mapping *new_mt;
>> -    unsigned int          new_mt_limit, nr_frames;
>> -
>> -    spin_lock(&lgt->maptrack_lock);
>> +    unsigned int          nr_frames;
>>  
>> -    while ( unlikely((handle = __get_maptrack_handle(lgt)) == -1) )
>> -    {
>> -        nr_frames = nr_maptrack_frames(lgt);
>> -        if ( nr_frames >= max_maptrack_frames )
>> -            break;
>> -
>> -        new_mt = alloc_xenheap_page();
>> -        if ( !new_mt )
>> -            break;
>> +    handle = __get_maptrack_handle(lgt, v);
>> +    if ( likely(handle != -1) )
>> +        return handle;
>>  
>> -        clear_page(new_mt);
>> +    nr_frames = nr_vcpu_maptrack_frames(v);
>> +    if ( nr_frames >= max_maptrack_frames )
>> +        return -1;
>>  
>> -        new_mt_limit = lgt->maptrack_limit + MAPTRACK_PER_PAGE;
>> +    new_mt = alloc_xenheap_page();
>> +    if ( !new_mt )
>> +        return -1;
>> +    clear_page(new_mt);
>>  
>> -        for ( i = 1; i < MAPTRACK_PER_PAGE; i++ )
>> -            new_mt[i - 1].ref = lgt->maptrack_limit + i;
>> -        new_mt[i - 1].ref = lgt->maptrack_head;
>> -        lgt->maptrack_head = lgt->maptrack_limit;
>> +    spin_lock(&lgt->maptrack_lock);
>>  
>> -        lgt->maptrack[nr_frames] = new_mt;
>> -        smp_wmb();
>> -        lgt->maptrack_limit      = new_mt_limit;
>> +    for ( i = 1; i < MAPTRACK_PER_PAGE; i++ ){
> 
> Style nit: brace on its own line, please. 
> 
>> +        new_mt[i - 1].ref = (lgt->maptrack_pages * MAPTRACK_PER_PAGE) + i;
>> +        new_mt[i - 1].vcpu = v->vcpu_id;
>> +    }
>> +    /* Set last entry vcpu and ref */
>> +    new_mt[i - 1].ref = v->maptrack_head;
>> +    new_mt[i - 1].vcpu = v->vcpu_id;
>>  
>> -        gdprintk(XENLOG_INFO, "Increased maptrack size to %u frames\n",
>> -                 nr_frames + 1);
>> +    v->maptrack_head = lgt->maptrack_pages * MAPTRACK_PER_PAGE;
>> +    if (v->maptrack_tail == MAPTRACK_TAIL)
>> +    {
>> +        v->maptrack_tail = (lgt->maptrack_pages * MAPTRACK_PER_PAGE)
>> +            + MAPTRACK_PER_PAGE - 1;
>> +        new_mt[i - 1].ref = MAPTRACK_TAIL;
> 
> This clause was slightly confusing to me until I realised that it's
> only for the case where no maptrack entries have been allocated to
> this vcpu at all before (because you keep at least one on the
> freelist).  And therefore, there's no need to be careful about this
> update to maptrack_tail.  I think that deserves a comment.

As I detailed above, we have changed behaviour slightly so there is a
case where there is nothing allocated on the freelist. We'll add a
comment explaining this.

> 
> So I think my only concerns about this whole series are some comments
> and using atomic read & writes in the linked-list code.  I'm happy to
> believe that you'll DTRT with all that, so for the next spin,
> 
> Acked-by: Tim Deegan <tim@xen.org> 
> 

Thanks for your review.

Malcolm

> Cheers,
> 
> Tim.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 2/3] gnttab: refactor locking for scalability
  2015-04-30 14:34   ` Tim Deegan
@ 2015-05-01 18:03     ` David Vrabel
  2015-05-01 20:02       ` Tim Deegan
  0 siblings, 1 reply; 16+ messages in thread
From: David Vrabel @ 2015-05-01 18:03 UTC (permalink / raw)
  To: Tim Deegan, David Vrabel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Matt Wilson,
	xen-devel, Malcolm Crossley, Ian Campbell

On 30/04/15 15:34, Tim Deegan wrote:
> 
> OK, so here we hold rd->grant_table->lock and act->lock (which is in
> the rd table) and are going to acquire lgt->maptrack_lock in mapcount().
> 
> That means we can't ever have a path holding domA's maptrack lock that
> acquires domB's gt lock or one of domB's act->locks.  I think that's
> OK because after this patch the only paths left that hold the maptrack
> lock can't acquire anything except an act->lock of the same domain.
> Do I understand that correctly?
> 
> Also: because mapcount() itself doesn't take any act->lock, there's no
> path that holds two act->locks at the same time?

Um. I think it's easier if we just say you cannot take another lock
while holding a maptrack_lock since that's currently the case.

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 2/3] gnttab: refactor locking for scalability
  2015-05-01 18:03     ` David Vrabel
@ 2015-05-01 20:02       ` Tim Deegan
  0 siblings, 0 replies; 16+ messages in thread
From: Tim Deegan @ 2015-05-01 20:02 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Jan Beulich, Christoph Egger, Matt Wilson,
	xen-devel, Malcolm Crossley, Ian Campbell

At 19:03 +0100 on 01 May (1430506982), David Vrabel wrote:
> On 30/04/15 15:34, Tim Deegan wrote:
> > 
> > OK, so here we hold rd->grant_table->lock and act->lock (which is in
> > the rd table) and are going to acquire lgt->maptrack_lock in mapcount().
> > 
> > That means we can't ever have a path holding domA's maptrack lock that
> > acquires domB's gt lock or one of domB's act->locks.  I think that's
> > OK because after this patch the only paths left that hold the maptrack
> > lock can't acquire anything except an act->lock of the same domain.
> > Do I understand that correctly?
> > 
> > Also: because mapcount() itself doesn't take any act->lock, there's no
> > path that holds two act->locks at the same time?
> 
> Um. I think it's easier if we just say you cannot take another lock
> while holding a maptrack_lock since that's currently the case.

Er, yes.  I had misread the locking order, sorry.  So we just need to
say (a) no locks inside the maptrack lock and (b) never take two
act->locks at once.

Tim.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 1/3] gnttab: Introduce rwlock to protect updates to grant table state
  2015-04-30 13:28 ` [PATCHv7 1/3] gnttab: Introduce rwlock to protect updates to grant table state David Vrabel
@ 2015-05-05 10:42   ` Jan Beulich
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2015-05-05 10:42 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Ian Campbell, Christoph Egger, Tim Deegan,
	Matt Wilson, xen-devel, Malcolm Crossley

>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
> --- a/docs/misc/grant-tables.txt
> +++ b/docs/misc/grant-tables.txt
> @@ -74,7 +74,33 @@ is complete.
>   matching map track entry is then removed, as if unmap had been invoked.
>   These are not used by the transfer mechanism.
>    map->domid         : owner of the mapped frame
> -  map->ref_and_flags : grant reference, ro/rw, mapped for host or device access
> +  map->ref           : grant reference
> +  map->flags         : ro/rw, mapped for host or device access
> +
> +********************************************************************************
> + Locking
> + ~~~~~~~
> + Xen uses several locks to serialize access to the internal grant table state.
> +
> +  grant_table->lock          : rwlock used to prevent readers from accessing
> +                               inconsistent grant table state such as current
> +                               version, partially initialized active table pages,
> +                               etc.
> +  grant_table->maptrack_lock : spinlock used to protect the maptrack state
> +
> + The primary lock for the grant table is a read/write spinlock. All
> + functions that access members of struct grant_table must acquire a
> + read lock around critical sections. Any modification to the members
> + of struct grant_table (e.g., nr_status_frames, nr_grant_frames,
> + active frames, etc.) must only be made if the write lock is
> + held. These elements are read-mostly, and read critical sections can
> + be large, which makes a rwlock a good choice.
> +
> + The maptrack state is protected by its own spinlock. Any access (read
> + or write) of struct grant_table members that have a "maptrack_"
> + prefix must be made while holding the maptrack lock. The maptrack
> + state can be rapidly modified under some workloads, and the critical
> + sections are very small, thus we use a spinlock to protect them.

Still no word about nesting between the two?

> @@ -629,7 +629,7 @@ __gnttab_map_grant_ref(
>      }
>  
>      rgt = rd->grant_table;
> -    spin_lock(&rgt->lock);
> +    read_lock(&rgt->lock);
>  
>      if ( rgt->gt_version == 0 )
>          PIN_FAIL(unlock_out, GNTST_general_error,
> @@ -702,8 +702,6 @@ __gnttab_map_grant_ref(
>  
>      cache_flags = (shah->flags & (GTF_PAT | GTF_PWT | GTF_PCD) );
>  
> -    spin_unlock(&rgt->lock);
> -
>      /* pg may be set, with a refcount included, from __get_paged_frame */
>      if ( !pg )
>      {
> @@ -778,7 +776,7 @@ __gnttab_map_grant_ref(
>          goto undo_out;
>      }
>  
> -    double_gt_lock(lgt, rgt);
> +    double_maptrack_lock(lgt, rgt);

So looking just at the patch context here together with the two earlier
hunks: You now keep holding the rwlock, but the code following the
undo_out label read-locks the lock another time.

> @@ -1290,10 +1294,14 @@ gnttab_unpopulate_status_frames(struct domain *d, 
> struct grant_table *gt)
>      gt->nr_status_frames = 0;
>  }
>  
> +/* 
> + * Grow the grant table. The caller must hold the grant table's
> + * write lock before calling this function.
> + */
>  int
>  gnttab_grow_table(struct domain *d, unsigned int req_nr_frames)
>  {
> -    /* d's grant table lock must be held by the caller */
> +    /* d's grant table write lock must be held by the caller */

This comment is now redundant with the one you add ahead of
the function.

> @@ -3052,7 +3068,8 @@ gnttab_release_mappings(
>          }
>  
>          rgt = rd->grant_table;
> -        spin_lock(&rgt->lock);
> +        read_lock(&rgt->lock);
> +        double_maptrack_lock(gt, rgt);

What is it that requires gt's maptrack lock to also be acquired here?

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 2/3] gnttab: refactor locking for scalability
  2015-04-30 13:28 ` [PATCHv7 2/3] gnttab: refactor locking for scalability David Vrabel
  2015-04-30 14:34   ` Tim Deegan
@ 2015-05-05 11:58   ` Jan Beulich
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2015-05-05 11:58 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Ian Campbell, Christoph Egger, Tim Deegan,
	Matt Wilson, xen-devel, Malcolm Crossley

>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
> This patch refactors grant table locking. It splits the previous
> single spinlock per grant table into multiple locks. The heavily
> modified components of the grant table (the maptrack state and the
> active entries) are now protected by their own spinlocks. The
> remaining elements of the grant table are read-mostly, so the main
> grant table lock is modified to be a rwlock to improve concurrency.

This looks to be (partly) stale.

> @@ -812,7 +829,7 @@ __gnttab_map_grant_ref(
>      mt->ref   = op->ref;
>      mt->flags = op->flags;
>  
> -    double_maptrack_unlock(lgt, rgt);
> +    active_entry_release(act);
>      read_unlock(&rgt->lock);
>  
>      op->dev_bus_addr = (u64)frame << PAGE_SHIFT;
> @@ -836,10 +853,6 @@ __gnttab_map_grant_ref(
>          put_page(pg);
>      }
>  
> -    read_lock(&rgt->lock);
> -
> -    act = &active_entry(rgt, op->ref);
> -
>      if ( op->flags & GNTMAP_device_map )
>          act->pin -= (op->flags & GNTMAP_readonly) ?
>              GNTPIN_devr_inc : GNTPIN_devw_inc;
> @@ -855,6 +868,9 @@ __gnttab_map_grant_ref(
>      if ( !act->pin )
>          gnttab_clear_flag(_GTF_reading, status);
>  
> + act_release_out:
> +    active_entry_release(act);
> +
>   unlock_out:
>      read_unlock(&rgt->lock);

This again looks like broken locking: You drop the lock twice, but the
intermediate re-acquire gets deleted. Additionally it looks like you let
go of "act" before being done with modifying it (act->pin adjustment
visible in context above) and reading it (act->pin check further
down), and then you release it another time here. Am I overlooking
something?

> @@ -2414,16 +2446,18 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
>      {
>          for ( i = GNTTAB_NR_RESERVED_ENTRIES; i < nr_grant_entries(gt); i++ )
>          {
> -            act = &active_entry(gt, i);
> +            act = active_entry_acquire(gt, i);
>              if ( act->pin != 0 )
>              {
>                  gdprintk(XENLOG_WARNING,
>                           "tried to change grant table version from %d to %d, but some grant entries still in use\n",
>                           gt->gt_version,
>                           op.version);
> +                active_entry_release(act);
>                  res = -EBUSY;
>                  goto out_unlock;
>              }
> +            active_entry_release(act);
>          }

Is it really useful to acquire the entry here? I.e. wouldn't
read_atomic(act->pin) suffice?

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-04-30 13:28 ` [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists David Vrabel
  2015-04-30 15:12   ` Tim Deegan
@ 2015-05-05 12:34   ` Jan Beulich
  2015-05-12 11:01     ` David Vrabel
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2015-05-05 12:34 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Ian Campbell, Christoph Egger, Tim Deegan,
	Matt Wilson, xen-devel, Malcolm Crossley

>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
> From: Malcolm Crossley <malcolm.crossley@citrix.com>
> 
> Performance analysis of aggregate network throughput with many VMs
> shows that performance is signficantly limited by contention on the
> maptrack lock when obtaining/releasing maptrack handles from the free
> list.
> 
> Instead of a single free list use a per-VCPU list. This avoids any
> contention when obtaining a handle.  Handles must be released back to
> their original list and since this may occur on a different VCPU there
> is some contention on the destination VCPU's free list tail pointer
> (but this is much better than a per-domain lock).
> 
> A domain's maptrack limit is multiplied by the number of VCPUs.  This
> ensures that a worst case domain that only performs grant table
> operations via one VCPU will not see a lower map track limit.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/grant_table.c      |  131 ++++++++++++++++++++++++-----------------
>  xen/include/xen/grant_table.h |    8 ++-
>  xen/include/xen/sched.h       |    5 ++
>  3 files changed, 87 insertions(+), 57 deletions(-)
> 
> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
> index 569b3d3..d8e7373 100644
> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -118,9 +118,9 @@ struct gnttab_unmap_common {
>      ((t)->maptrack[(e)/MAPTRACK_PER_PAGE][(e)%MAPTRACK_PER_PAGE])
>  
>  static inline unsigned int
> -nr_maptrack_frames(struct grant_table *t)
> +nr_vcpu_maptrack_frames(struct vcpu *v)
>  {
> -    return t->maptrack_limit / MAPTRACK_PER_PAGE;
> +    return v->maptrack_limit / MAPTRACK_PER_PAGE;
>  }
>  
>  #define MAPTRACK_TAIL (~0u)
> @@ -255,66 +255,91 @@ static int __get_paged_frame(unsigned long gfn, 
> unsigned long *frame, struct pag
>  
>  static inline int
>  __get_maptrack_handle(
> -    struct grant_table *t)
> +    struct grant_table *t,
> +    struct vcpu *v)
>  {
> -    unsigned int h;
> -    if ( unlikely((h = t->maptrack_head) == MAPTRACK_TAIL) )
> +    unsigned int head, next;
> +
> +    head = v->maptrack_head;
> +    if ( unlikely(head == MAPTRACK_TAIL) )
> +        return -1;
> +
> +    next = maptrack_entry(t, head).ref;
> +    if ( unlikely(next == MAPTRACK_TAIL) )
>          return -1;
> -    t->maptrack_head = maptrack_entry(t, h).ref;
> -    return h;
> +
> +    v->maptrack_head = next;
> +
> +    return head;
>  }
>  
>  static inline void
>  put_maptrack_handle(
>      struct grant_table *t, int handle)
>  {
> -    spin_lock(&t->maptrack_lock);
> -    maptrack_entry(t, handle).ref = t->maptrack_head;
> -    t->maptrack_head = handle;
> -    spin_unlock(&t->maptrack_lock);
> +    struct domain *d = current->domain;
> +    struct vcpu *v;
> +    u32 prev_tail, cur_tail;
> +
> +    /* Set current hande to have TAIL reference */
> +    maptrack_entry(t, handle).ref = MAPTRACK_TAIL;
> +    v = d->vcpu[maptrack_entry(t,handle).vcpu];

There's a missing blank here; the comment above has a typo and
is missing a stop. (There are further similar issues below.)

> +    cur_tail = v->maptrack_tail;

read_atomic()?

> +    do {
> +        prev_tail = cur_tail;
> +        cur_tail = cmpxchg((u32 *)&v->maptrack_tail, prev_tail, handle);

Please adjust the type of maptrack_tail (or the types of {prev,cur}_tail)
such that no cast is needed here.

> +    } while ( cur_tail != prev_tail );
> +    maptrack_entry(t, prev_tail).ref = handle;
>  }
>  
>  static inline int
>  get_maptrack_handle(
>      struct grant_table *lgt)
>  {
> +    struct vcpu          *v = current;
>      int                   i;
>      grant_handle_t        handle;
>      struct grant_mapping *new_mt;
> -    unsigned int          new_mt_limit, nr_frames;
> -
> -    spin_lock(&lgt->maptrack_lock);
> +    unsigned int          nr_frames;
>  
> -    while ( unlikely((handle = __get_maptrack_handle(lgt)) == -1) )
> -    {
> -        nr_frames = nr_maptrack_frames(lgt);
> -        if ( nr_frames >= max_maptrack_frames )
> -            break;
> -
> -        new_mt = alloc_xenheap_page();
> -        if ( !new_mt )
> -            break;
> +    handle = __get_maptrack_handle(lgt, v);
> +    if ( likely(handle != -1) )
> +        return handle;
>  
> -        clear_page(new_mt);
> +    nr_frames = nr_vcpu_maptrack_frames(v);
> +    if ( nr_frames >= max_maptrack_frames )
> +        return -1;
>  
> -        new_mt_limit = lgt->maptrack_limit + MAPTRACK_PER_PAGE;
> +    new_mt = alloc_xenheap_page();
> +    if ( !new_mt )
> +        return -1;
> +    clear_page(new_mt);
>  
> -        for ( i = 1; i < MAPTRACK_PER_PAGE; i++ )
> -            new_mt[i - 1].ref = lgt->maptrack_limit + i;
> -        new_mt[i - 1].ref = lgt->maptrack_head;
> -        lgt->maptrack_head = lgt->maptrack_limit;
> +    spin_lock(&lgt->maptrack_lock);
>  
> -        lgt->maptrack[nr_frames] = new_mt;
> -        smp_wmb();
> -        lgt->maptrack_limit      = new_mt_limit;
> +    for ( i = 1; i < MAPTRACK_PER_PAGE; i++ ){
> +        new_mt[i - 1].ref = (lgt->maptrack_pages * MAPTRACK_PER_PAGE) + i;
> +        new_mt[i - 1].vcpu = v->vcpu_id;
> +    }
> +    /* Set last entry vcpu and ref */
> +    new_mt[i - 1].ref = v->maptrack_head;
> +    new_mt[i - 1].vcpu = v->vcpu_id;
>  
> -        gdprintk(XENLOG_INFO, "Increased maptrack size to %u frames\n",
> -                 nr_frames + 1);
> +    v->maptrack_head = lgt->maptrack_pages * MAPTRACK_PER_PAGE;
> +    if (v->maptrack_tail == MAPTRACK_TAIL)

Doesn't this imply v->maptrack_head == MAPTRACK_TAIL (before
the assignment right above) and hence ...

> +    {
> +        v->maptrack_tail = (lgt->maptrack_pages * MAPTRACK_PER_PAGE)
> +            + MAPTRACK_PER_PAGE - 1;
> +        new_mt[i - 1].ref = MAPTRACK_TAIL;

... this being redundant with the setting of the last entry above? In
which case this could better be an ASSERT() than an assignment.

> @@ -566,7 +591,7 @@ static void mapcount(
>      ASSERT(rw_is_locked(&rd->grant_table->lock));
>      spin_lock(&lgt->maptrack_lock);
>  
> -    for ( handle = 0; handle < lgt->maptrack_limit; handle++ )
> +    for ( handle = 0; handle < lgt->maptrack_pages * MAPTRACK_PER_PAGE; handle++ )

Long line.

> @@ -1430,6 +1456,17 @@ gnttab_setup_table(
>      gt = d->grant_table;
>      write_lock(&gt->lock);
>  
> +    /* Tracking of mapped foreign frames table */
> +    if ( (gt->maptrack = xzalloc_array(struct grant_mapping *,
> +                                       max_maptrack_frames * d->max_vcpus)) == NULL )
> +        goto out3;

This surely can easily become an allocation of far more than a page,
and hence needs to be broken up (perhaps using vmap() to map
individually allocated pages).

> +    for_each_vcpu( d, v )
> +    {
> +        v->maptrack_head = MAPTRACK_TAIL;
> +        v->maptrack_tail = MAPTRACK_TAIL;
> +    }
> +    gt->maptrack_pages = 0;

I don't think this is needed. Or if it is (i.e. the value can be reset here)
then doing the allocation above unconditionally would look wrong too.
(I suppose it gets moved here because of the dependency on
d->max_vcpus.) Indeed I don't see anything preventing a guest from
invoking GNTTABOP_setup_table more than once.

> @@ -3066,12 +3091,10 @@ grant_table_create(
>          free_xenheap_page(t->shared_raw[i]);
>      xfree(t->shared_raw);
>   no_mem_3:
> -    free_xenheap_page(t->maptrack[0]);
> -    xfree(t->maptrack);
> - no_mem_2:
>      for ( i = 0;
>            i < num_act_frames_from_sha_frames(INITIAL_NR_GRANT_FRAMES); i++ )
>          free_xenheap_page(t->active[i]);
> + no_mem_2:
>      xfree(t->active);

This movement of the no_mem_2 label looks wrong.

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-05-05 12:34   ` Jan Beulich
@ 2015-05-12 11:01     ` David Vrabel
  2015-05-12 11:37       ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: David Vrabel @ 2015-05-12 11:01 UTC (permalink / raw)
  To: Jan Beulich, David Vrabel
  Cc: Keir Fraser, Ian Campbell, Christoph Egger, Tim Deegan,
	Matt Wilson, xen-devel, Malcolm Crossley

On 05/05/15 13:34, Jan Beulich wrote:
>>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
>> From: Malcolm Crossley <malcolm.crossley@citrix.com>
>>
>> Performance analysis of aggregate network throughput with many VMs
>> shows that performance is signficantly limited by contention on the
>> maptrack lock when obtaining/releasing maptrack handles from the free
>> list.
>>
>> Instead of a single free list use a per-VCPU list. This avoids any
>> contention when obtaining a handle.  Handles must be released back to
>> their original list and since this may occur on a different VCPU there
>> is some contention on the destination VCPU's free list tail pointer
>> (but this is much better than a per-domain lock).
>>
>> A domain's maptrack limit is multiplied by the number of VCPUs.  This
>> ensures that a worst case domain that only performs grant table
>> operations via one VCPU will not see a lower map track limit.
[...]
>> +    cur_tail = v->maptrack_tail;
> 
> read_atomic()?

It's not required since if this load gets inconsistent state, the
cmpxchg loop will just go around once more.  I've added the
read_atomic() anyway, though.

>> @@ -1430,6 +1456,17 @@ gnttab_setup_table(
>>      gt = d->grant_table;
>>      write_lock(&gt->lock);
>>  
>> +    /* Tracking of mapped foreign frames table */
>> +    if ( (gt->maptrack = xzalloc_array(struct grant_mapping *,
>> +                                       max_maptrack_frames * d->max_vcpus)) == NULL )
>> +        goto out3;
> 
> This surely can easily become an allocation of far more than a page,
> and hence needs to be broken up (perhaps using vmap() to map
> individually allocated pages).

I think there should be a common vzalloc_array() function.  Do you
agree?  This will use xzalloc_array() if the alloc is < PAGE_SIZE to
avoid needlessly using vmap space.

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-05-12 11:01     ` David Vrabel
@ 2015-05-12 11:37       ` Jan Beulich
  2015-05-12 12:37         ` David Vrabel
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2015-05-12 11:37 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, Ian Campbell, Christoph Egger, Tim Deegan,
	Matt Wilson, xen-devel, Malcolm Crossley

>>> On 12.05.15 at 13:01, <david.vrabel@citrix.com> wrote:
> On 05/05/15 13:34, Jan Beulich wrote:
>>>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
>>> @@ -1430,6 +1456,17 @@ gnttab_setup_table(
>>>      gt = d->grant_table;
>>>      write_lock(&gt->lock);
>>>  
>>> +    /* Tracking of mapped foreign frames table */
>>> +    if ( (gt->maptrack = xzalloc_array(struct grant_mapping *,
>>> +                                       max_maptrack_frames * d->max_vcpus)) == NULL )
>>> +        goto out3;
>> 
>> This surely can easily become an allocation of far more than a page,
>> and hence needs to be broken up (perhaps using vmap() to map
>> individually allocated pages).
> 
> I think there should be a common vzalloc_array() function.  Do you
> agree?  This will use xzalloc_array() if the alloc is < PAGE_SIZE to
> avoid needlessly using vmap space.

For the _array flavor I'm not sure, but vmalloc()/vzalloc() are
already in the process of being added.

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-05-12 11:37       ` Jan Beulich
@ 2015-05-12 12:37         ` David Vrabel
  2015-05-12 12:56           ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: David Vrabel @ 2015-05-12 12:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Christoph Egger, Tim Deegan,
	Matt Wilson, xen-devel, Malcolm Crossley

On 12/05/15 12:37, Jan Beulich wrote:
>>>> On 12.05.15 at 13:01, <david.vrabel@citrix.com> wrote:
>> On 05/05/15 13:34, Jan Beulich wrote:
>>>>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
>>>> @@ -1430,6 +1456,17 @@ gnttab_setup_table(
>>>>      gt = d->grant_table;
>>>>      write_lock(&gt->lock);
>>>>  
>>>> +    /* Tracking of mapped foreign frames table */
>>>> +    if ( (gt->maptrack = xzalloc_array(struct grant_mapping *,
>>>> +                                       max_maptrack_frames * d->max_vcpus)) == NULL )
>>>> +        goto out3;
>>>
>>> This surely can easily become an allocation of far more than a page,
>>> and hence needs to be broken up (perhaps using vmap() to map
>>> individually allocated pages).
>>
>> I think there should be a common vzalloc_array() function.  Do you
>> agree?  This will use xzalloc_array() if the alloc is < PAGE_SIZE to
>> avoid needlessly using vmap space.
> 
> For the _array flavor I'm not sure, but vmalloc()/vzalloc() are
> already in the process of being added.

Can I leave this as-is (with the xzalloc_array()) and fix it up once
vzalloc() exists?

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists
  2015-05-12 12:37         ` David Vrabel
@ 2015-05-12 12:56           ` Jan Beulich
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2015-05-12 12:56 UTC (permalink / raw)
  To: David Vrabel
  Cc: Keir Fraser, IanCampbell, Christoph Egger, Tim Deegan,
	Matt Wilson, xen-devel, Malcolm Crossley

>>> On 12.05.15 at 14:37, <david.vrabel@citrix.com> wrote:
> On 12/05/15 12:37, Jan Beulich wrote:
>>>>> On 12.05.15 at 13:01, <david.vrabel@citrix.com> wrote:
>>> On 05/05/15 13:34, Jan Beulich wrote:
>>>>>>> On 30.04.15 at 15:28, <david.vrabel@citrix.com> wrote:
>>>>> @@ -1430,6 +1456,17 @@ gnttab_setup_table(
>>>>>      gt = d->grant_table;
>>>>>      write_lock(&gt->lock);
>>>>>  
>>>>> +    /* Tracking of mapped foreign frames table */
>>>>> +    if ( (gt->maptrack = xzalloc_array(struct grant_mapping *,
>>>>> +                                       max_maptrack_frames * d->max_vcpus)) == NULL )
>>>>> +        goto out3;
>>>>
>>>> This surely can easily become an allocation of far more than a page,
>>>> and hence needs to be broken up (perhaps using vmap() to map
>>>> individually allocated pages).
>>>
>>> I think there should be a common vzalloc_array() function.  Do you
>>> agree?  This will use xzalloc_array() if the alloc is < PAGE_SIZE to
>>> avoid needlessly using vmap space.
>> 
>> For the _array flavor I'm not sure, but vmalloc()/vzalloc() are
>> already in the process of being added.
> 
> Can I leave this as-is (with the xzalloc_array()) and fix it up once
> vzalloc() exists?

Sure.

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-05-12 12:56 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-30 13:28 [PATCHv7 0/3] gnttab: Improve scaleability David Vrabel
2015-04-30 13:28 ` [PATCHv7 1/3] gnttab: Introduce rwlock to protect updates to grant table state David Vrabel
2015-05-05 10:42   ` Jan Beulich
2015-04-30 13:28 ` [PATCHv7 2/3] gnttab: refactor locking for scalability David Vrabel
2015-04-30 14:34   ` Tim Deegan
2015-05-01 18:03     ` David Vrabel
2015-05-01 20:02       ` Tim Deegan
2015-05-05 11:58   ` Jan Beulich
2015-04-30 13:28 ` [PATCHv7 3/3] gnttab: use per-VCPU maptrack free lists David Vrabel
2015-04-30 15:12   ` Tim Deegan
2015-05-01 12:29     ` Malcolm Crossley
2015-05-05 12:34   ` Jan Beulich
2015-05-12 11:01     ` David Vrabel
2015-05-12 11:37       ` Jan Beulich
2015-05-12 12:37         ` David Vrabel
2015-05-12 12:56           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.