All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 0/3] Implement per-cpu reader-writer locks
@ 2015-11-20 16:03 Malcolm Crossley
  2015-11-20 16:03 ` [PATCHv2 1/3] rwlock: Add " Malcolm Crossley
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-20 16:03 UTC (permalink / raw)
  To: malcolm.crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

This patch series adds per-cpu reader-writer locks as a generic lock
implementation and then converts the grant table and p2m rwlocks to
use the percpu rwlocks, in order to improve multi-socket host performance.

CPU profiling has revealed the rwlocks themselves suffer from severe cache
line bouncing due to the cmpxchg operation used even when taking a read lock.
Multiqueue paravirtualised I/O results in heavy contention of the grant table
and p2m read locks of a specific domain and so I/O throughput is bottlenecked
by the overhead of the cache line bouncing itself.

Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
area to record a CPU has taken the read lock. Correctness is enforced for the 
write lock by using a per lock barrier which forces the per-cpu read lock 
to revert to using a standard read lock. The write lock then polls all 
the percpu data area until active readers for the lock have exited.

Removing the cache line bouncing on a multi-socket Haswell-EP system 
dramatically improves performance, with 16 vCPU network IO performance going 
from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
IO improvement.

Note: Benchmarking of the these performance improvements should be done with 
the non debug version of the hypervisor otherwise the map_domain_page spinlock
is the main bottleneck.

Changes in V2:
- Add Cover letter
- Convert p2m rwlock to percpu rwlock
- Improve percpu rwlock to safely handle simultaneously holding 2 or more 
  locks 
- Move percpu rwlock barrier from global to per lock
- Move write lock cpumask variable to a percpu variable
- Add macros to help initialise and use percpu rwlocks
- Updated IO benchmark results to cover revised locking implementation


 xen/arch/arm/mm.c             |   4 +-
 xen/arch/x86/mm.c             |   4 +-
 xen/arch/x86/mm/mm-locks.h    |  12 +++--
 xen/arch/x86/mm/p2m.c         |   1 +
 xen/common/grant_table.c      | 112 +++++++++++++++++++++---------------------
 xen/common/spinlock.c         |  38 ++++++++++++++
 xen/include/asm-arm/percpu.h  |   5 ++
 xen/include/asm-x86/mm.h      |   2 +-
 xen/include/asm-x86/percpu.h  |   6 +++
 xen/include/xen/grant_table.h |   4 +-
 xen/include/xen/percpu.h      |   4 ++
 xen/include/xen/spinlock.h    |  87 ++++++++++++++++++++++++++++++++
 12 files changed, 211 insertions(+), 68 deletions(-)

-- 
1.7.12.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv2 1/3] rwlock: Add per-cpu reader-writer locks
  2015-11-20 16:03 [PATCHv2 0/3] Implement per-cpu reader-writer locks Malcolm Crossley
@ 2015-11-20 16:03 ` Malcolm Crossley
  2015-11-25 11:12   ` George Dunlap
  2015-11-26 12:17   ` Jan Beulich
  2015-11-20 16:03 ` [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock Malcolm Crossley
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-20 16:03 UTC (permalink / raw)
  To: malcolm.crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

Per-cpu read-write locks allow for the fast path read case to have
low overhead by only setting/clearing a per-cpu variable for using
the read lock. The per-cpu read fast path also avoids locked
compare swap operations which can be particularly slow on coherent
multi-socket systems, particularly if there is heavy usage of the
read lock itself.

The per-cpu reader-writer lock uses a local variable to control
the read lock fast path. This allows a writer to disable the fast
path and ensures the readers switch to using the underlying
read-write lock implementation instead of the per-cpu variable.

Once the writer has taken the write lock and disabled the fast path,
it must poll the per-cpu variable for all CPU's which have entered
the critical section for the specific read-write lock the writer is
attempting to take. This design allows for a single per-cpu variable
to be used for read/write locks belonging to seperate data structures.
If a two or more different per-cpu read lock(s) are taken
simultaneously then the per-cpu data structure is not used and the
implementation takes the read lock of the underlying read-write lock,
this behaviour is equivalent to the slow path in terms of performance.
The per-cpu rwlock is not recursion safe for taking the per-cpu
read lock because there is no recursion count variable, this is
functionally equivalent to standard spin locks.

Slow path readers which are unblocked, set the per-cpu variable and
drop the read lock. This simplifies the implementation and allows
for fairness in the underlying read-write lock to be taken
advantage of.

There is more overhead on the per-cpu write lock path due to checking
each CPUs fast path per-cpu variable but this overhead is likely be
hidden by the required delay of waiting for readers to exit the
critical section. The loop is optimised to only iterate over
the per-cpu data of active readers of the rwlock. The cpumask_t for
tracking the active readers is stored in a single per-cpu data
location and thus the write lock is not pre-emption safe. Therefore
the per-cpu write lock can only be used with interrupts disabled.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
--
Changes since v1:
- Removed restriction on taking two or more different percpu rw
locks simultaneously
- Moved fast-path/slow-path barrier to be per lock instead of global
- Created seperate percpu_rwlock_t type and added macros to
initialise new type
- Added helper macros for using the percpu rwlock itself
- Moved active_readers cpumask off the stack and into a percpu
variable
---
 xen/common/spinlock.c        | 38 +++++++++++++++++++
 xen/include/asm-arm/percpu.h |  5 +++
 xen/include/asm-x86/percpu.h |  6 +++
 xen/include/xen/percpu.h     |  4 ++
 xen/include/xen/spinlock.h   | 87 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 140 insertions(+)

diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 7f89694..a0f9df5 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -10,6 +10,8 @@
 #include <asm/processor.h>
 #include <asm/atomic.h>
 
+DEFINE_PER_CPU(cpumask_t, percpu_rwlock_readers);
+
 #ifndef NDEBUG
 
 static atomic_t spin_debug __read_mostly = ATOMIC_INIT(0);
@@ -492,6 +494,42 @@ int _rw_is_write_locked(rwlock_t *lock)
     return (lock->lock == RW_WRITE_FLAG); /* writer in critical section? */
 }
 
+void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
+	        percpu_rwlock_t *percpu_rwlock)
+{
+    unsigned int cpu;
+
+    /* First take the write lock to protect against other writers. */
+    write_lock(&percpu_rwlock->rwlock);
+
+    /* Now set the global variable so that readers start using read_lock. */
+    percpu_rwlock->writer_activating = 1;
+    smp_mb();
+
+    /* Using a per cpu cpumask is only safe if there is no nesting. */
+    ASSERT(!in_irq());
+    this_cpu(percpu_rwlock_readers) = cpu_online_map;
+
+    /* Check if there are any percpu readers in progress on this rwlock. */
+    for ( ; ; )
+    {
+        for_each_cpu(cpu, &this_cpu(percpu_rwlock_readers))
+        {
+            /* 
+	     * Remove any percpu readers not contending on this rwlock
+             * from our check mask.
+	     */
+            if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock )
+                cpumask_clear_cpu(cpu, &this_cpu(percpu_rwlock_readers));
+        }
+        /* Check if we've cleared all percpu readers from check mask. */
+        if ( cpumask_empty(&this_cpu(percpu_rwlock_readers)) )
+            break;
+        /* Give the coherency fabric a break. */
+        cpu_relax();
+    };
+}
+
 #ifdef LOCK_PROFILE
 
 struct lock_profile_anc {
diff --git a/xen/include/asm-arm/percpu.h b/xen/include/asm-arm/percpu.h
index 71e7649..c308a56 100644
--- a/xen/include/asm-arm/percpu.h
+++ b/xen/include/asm-arm/percpu.h
@@ -27,6 +27,11 @@ void percpu_init_areas(void);
 #define __get_cpu_var(var) \
     (*RELOC_HIDE(&per_cpu__##var, READ_SYSREG(TPIDR_EL2)))
 
+#define per_cpu_ptr(var, cpu)  \
+    (*RELOC_HIDE(&var, __per_cpu_offset[cpu]))
+#define __get_cpu_ptr(var) \
+    (*RELOC_HIDE(&var, READ_SYSREG(TPIDR_EL2)))
+
 #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
 
 DECLARE_PER_CPU(unsigned int, cpu_id);
diff --git a/xen/include/asm-x86/percpu.h b/xen/include/asm-x86/percpu.h
index 604ff0d..51562b9 100644
--- a/xen/include/asm-x86/percpu.h
+++ b/xen/include/asm-x86/percpu.h
@@ -20,4 +20,10 @@ void percpu_init_areas(void);
 
 #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
 
+#define __get_cpu_ptr(var) \
+    (*RELOC_HIDE(var, get_cpu_info()->per_cpu_offset))
+
+#define per_cpu_ptr(var, cpu)  \
+    (*RELOC_HIDE(var, __per_cpu_offset[cpu]))
+
 #endif /* __X86_PERCPU_H__ */
diff --git a/xen/include/xen/percpu.h b/xen/include/xen/percpu.h
index abe0b11..c896863 100644
--- a/xen/include/xen/percpu.h
+++ b/xen/include/xen/percpu.h
@@ -16,6 +16,10 @@
 /* Preferred on Xen. Also see arch-defined per_cpu(). */
 #define this_cpu(var)    __get_cpu_var(var)
 
+#define this_cpu_ptr(ptr)    __get_cpu_ptr(ptr)
+
+#define get_per_cpu_var(var)  (per_cpu__##var)
+
 /* Linux compatibility. */
 #define get_cpu_var(var) this_cpu(var)
 #define put_cpu_var(var)
diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
index fb0438e..a5a25b6 100644
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -3,6 +3,7 @@
 
 #include <asm/system.h>
 #include <asm/spinlock.h>
+#include <asm/types.h>
 
 #ifndef NDEBUG
 struct lock_debug {
@@ -261,4 +262,90 @@ int _rw_is_write_locked(rwlock_t *lock);
 #define rw_is_locked(l)               _rw_is_locked(l)
 #define rw_is_write_locked(l)         _rw_is_write_locked(l)
 
+struct percpu_rwlock {
+	rwlock_t rwlock;
+	bool_t   writer_activating;
+};
+
+typedef struct percpu_rwlock percpu_rwlock_t;
+
+#define PERCPU_RW_LOCK_UNLOCKED { RW_LOCK_UNLOCKED, 0 }
+#define DEFINE_PERCPU_RWLOCK_RESOURCE(l) percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED
+#define percpu_rwlock_resource_init(l) (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED)
+
+static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
+                                         percpu_rwlock_t *percpu_rwlock)
+{
+    /* We cannot support recursion on the same lock. */
+    ASSERT(this_cpu_ptr(per_cpudata) != percpu_rwlock);
+    /* 
+     * Detect using a second percpu_rwlock_t simulatenously and fallback
+     * to standard read_lock.
+     */
+    if ( unlikely(this_cpu_ptr(per_cpudata) != NULL ) )
+    {
+        read_lock(&percpu_rwlock->rwlock);
+        return;
+    }
+
+    /* Indicate this cpu is reading. */
+    this_cpu_ptr(per_cpudata) = percpu_rwlock;
+    smp_mb();
+    /* Check if a writer is waiting. */
+    if ( unlikely(percpu_rwlock->writer_activating) )
+    {
+        /* Let the waiting writer know we aren't holding the lock. */
+        this_cpu_ptr(per_cpudata) = NULL;
+        /* Wait using the read lock to keep the lock fair. */
+        read_lock(&percpu_rwlock->rwlock);
+        /* Set the per CPU data again and continue. */
+        this_cpu_ptr(per_cpudata) = percpu_rwlock;
+        /* Drop the read lock because we don't need it anymore. */
+        read_unlock(&percpu_rwlock->rwlock);
+    }
+}
+
+static inline void _percpu_read_unlock(percpu_rwlock_t **per_cpudata,
+                percpu_rwlock_t *percpu_rwlock)
+{
+    ASSERT(this_cpu_ptr(per_cpudata) != NULL);
+    /* 
+     * Detect using a second percpu_rwlock_t simulatenously and fallback
+     * to standard read_unlock.
+     */
+    if ( unlikely(this_cpu_ptr(per_cpudata) != percpu_rwlock ) )
+    {
+        read_unlock(&percpu_rwlock->rwlock);
+        return;
+    }
+    this_cpu_ptr(per_cpudata) = NULL;
+    smp_wmb();
+}
+
+/* Don't inline percpu write lock as it's a complex function. */
+void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
+	        percpu_rwlock_t *percpu_rwlock);
+
+static inline void _percpu_write_unlock(percpu_rwlock_t *percpu_rwlock)
+{
+    ASSERT(percpu_rwlock->writer_activating);
+    percpu_rwlock->writer_activating = 0;
+    write_unlock(&percpu_rwlock->rwlock);
+}
+
+#define percpu_rw_is_write_locked(l)         _rw_is_write_locked(&((l)->rwlock))
+
+#define percpu_read_lock(percpu, lock) ( _percpu_read_lock( \
+                                        &get_per_cpu_var(percpu), lock ) )
+#define percpu_read_unlock(percpu, lock) ( _percpu_read_unlock( \
+                                        &get_per_cpu_var(percpu), lock ) )
+#define percpu_write_lock(percpu, lock) ( _percpu_write_lock( \
+                                        &get_per_cpu_var(percpu), lock ) )
+#define percpu_write_unlock(percpu, lock) ( _percpu_write_unlock( lock ) )
+
+#define DEFINE_PERCPU_RWLOCK_GLOBAL(name) DEFINE_PER_CPU(percpu_rwlock_t *, \
+                                                         name)
+#define DECLARE_PERCPU_RWLOCK_GLOBAL(name) DECLARE_PER_CPU(percpu_rwlock_t *, \
+                                                         name)
+
 #endif /* __SPINLOCK_H__ */
-- 
1.7.12.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock
  2015-11-20 16:03 [PATCHv2 0/3] Implement per-cpu reader-writer locks Malcolm Crossley
  2015-11-20 16:03 ` [PATCHv2 1/3] rwlock: Add " Malcolm Crossley
@ 2015-11-20 16:03 ` Malcolm Crossley
  2015-11-25 12:35   ` Jan Beulich
  2015-11-20 16:03 ` [PATCHv2 3/3] p2m: " Malcolm Crossley
  2015-11-24 18:16 ` [PATCHv2 0/3] Implement per-cpu reader-writer locks George Dunlap
  3 siblings, 1 reply; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-20 16:03 UTC (permalink / raw)
  To: malcolm.crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

The per domain grant table read lock suffers from significant contention when
performance multi-queue block or network IO due to the parallel
grant map/unmaps/copies occurring on the DomU's grant table.

On multi-socket systems, the contention results in the locked compare swap
operation failing frequently which results in a tight loop of retries of the
compare swap operation. As the coherency fabric can only support a specific
rate of compare swap operations for a particular data location then taking
the read lock itself becomes a bottleneck for grant operations.

Standard rwlock performance of a single VIF VM-VM transfer with 16 queues
configured was limited to approximately 15 gbit/s on a 2 socket Haswell-EP
host.

Percpu rwlock performance with the same configuration is approximately
48 gbit/s.

Oprofile was used to determine the initial overhead of the read-write locks
and to confirm the overhead was dramatically reduced by the percpu rwlocks.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
--
Changes since v1:
 - Used new macros provided in updated percpu rwlock v2 patch
 - Converted grant table rwlock_t to percpu_rwlock_t
 - Patched a missed grant table rwlock_t usage site
---
 xen/arch/arm/mm.c             |   4 +-
 xen/arch/x86/mm.c             |   4 +-
 xen/common/grant_table.c      | 112 +++++++++++++++++++++---------------------
 xen/include/xen/grant_table.h |   4 +-
 4 files changed, 62 insertions(+), 62 deletions(-)

diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index 8b6d915..238bb5f 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -1055,7 +1055,7 @@ int xenmem_add_to_physmap_one(
     switch ( space )
     {
     case XENMAPSPACE_grant_table:
-        write_lock(&d->grant_table->lock);
+        percpu_write_lock(grant_rwlock, &d->grant_table->lock);
 
         if ( d->grant_table->gt_version == 0 )
             d->grant_table->gt_version = 1;
@@ -1085,7 +1085,7 @@ int xenmem_add_to_physmap_one(
 
         t = p2m_ram_rw;
 
-        write_unlock(&d->grant_table->lock);
+        percpu_write_unlock(grant_rwlock, &d->grant_table->lock);
         break;
     case XENMAPSPACE_shared_info:
         if ( idx != 0 )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 92df36f..0ac8e66 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4654,7 +4654,7 @@ int xenmem_add_to_physmap_one(
                 mfn = virt_to_mfn(d->shared_info);
             break;
         case XENMAPSPACE_grant_table:
-            write_lock(&d->grant_table->lock);
+            percpu_write_lock(grant_rwlock, &d->grant_table->lock);
 
             if ( d->grant_table->gt_version == 0 )
                 d->grant_table->gt_version = 1;
@@ -4676,7 +4676,7 @@ int xenmem_add_to_physmap_one(
                     mfn = virt_to_mfn(d->grant_table->shared_raw[idx]);
             }
 
-            write_unlock(&d->grant_table->lock);
+            percpu_write_unlock(grant_rwlock, &d->grant_table->lock);
             break;
         case XENMAPSPACE_gmfn_range:
         case XENMAPSPACE_gmfn:
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 5d52d1e..e6f9c25 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -178,6 +178,8 @@ struct active_grant_entry {
 #define _active_entry(t, e) \
     ((t)->active[(e)/ACGNT_PER_PAGE][(e)%ACGNT_PER_PAGE])
 
+DEFINE_PERCPU_RWLOCK_GLOBAL(grant_rwlock);
+
 static inline void gnttab_flush_tlb(const struct domain *d)
 {
     if ( !paging_mode_external(d) )
@@ -208,8 +210,6 @@ active_entry_acquire(struct grant_table *t, grant_ref_t e)
 {
     struct active_grant_entry *act;
 
-    ASSERT(rw_is_locked(&t->lock));
-
     act = &_active_entry(t, e);
     spin_lock(&act->lock);
 
@@ -270,23 +270,23 @@ double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
      */
     if ( lgt < rgt )
     {
-        write_lock(&lgt->lock);
-        write_lock(&rgt->lock);
+        percpu_write_lock(grant_rwlock, &lgt->lock);
+        percpu_write_lock(grant_rwlock, &rgt->lock);
     }
     else
     {
         if ( lgt != rgt )
-            write_lock(&rgt->lock);
-        write_lock(&lgt->lock);
+            percpu_write_lock(grant_rwlock, &rgt->lock);
+        percpu_write_lock(grant_rwlock, &lgt->lock);
     }
 }
 
 static inline void
 double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
 {
-    write_unlock(&lgt->lock);
+    percpu_write_unlock(grant_rwlock, &lgt->lock);
     if ( lgt != rgt )
-        write_unlock(&rgt->lock);
+        percpu_write_unlock(grant_rwlock, &rgt->lock);
 }
 
 static inline int
@@ -659,8 +659,6 @@ static int grant_map_exists(const struct domain *ld,
                             unsigned int *ref_count)
 {
     unsigned int ref, max_iter;
-    
-    ASSERT(rw_is_locked(&rgt->lock));
 
     max_iter = min(*ref_count + (1 << GNTTABOP_CONTINUATION_ARG_SHIFT),
                    nr_grant_entries(rgt));
@@ -703,12 +701,12 @@ static unsigned int mapkind(
      * Must have the local domain's grant table write lock when
      * iterating over its maptrack entries.
      */
-    ASSERT(rw_is_write_locked(&lgt->lock));
+    ASSERT(percpu_rw_is_write_locked(&lgt->lock));
     /*
      * Must have the remote domain's grant table write lock while
      * counting its active entries.
      */
-    ASSERT(rw_is_write_locked(&rd->grant_table->lock));
+    ASSERT(percpu_rw_is_write_locked(&rd->grant_table->lock));
 
     for ( handle = 0; !(kind & MAPKIND_WRITE) &&
                       handle < lgt->maptrack_limit; handle++ )
@@ -796,7 +794,7 @@ __gnttab_map_grant_ref(
     }
 
     rgt = rd->grant_table;
-    read_lock(&rgt->lock);
+    percpu_read_lock(grant_rwlock, &rgt->lock);
 
     /* Bounds check on the grant ref */
     if ( unlikely(op->ref >= nr_grant_entries(rgt)))
@@ -859,7 +857,7 @@ __gnttab_map_grant_ref(
     cache_flags = (shah->flags & (GTF_PAT | GTF_PWT | GTF_PCD) );
 
     active_entry_release(act);
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
 
     /* pg may be set, with a refcount included, from __get_paged_frame */
     if ( !pg )
@@ -1006,7 +1004,7 @@ __gnttab_map_grant_ref(
         put_page(pg);
     }
 
-    read_lock(&rgt->lock);
+    percpu_read_lock(grant_rwlock, &rgt->lock);
 
     act = active_entry_acquire(rgt, op->ref);
 
@@ -1029,7 +1027,7 @@ __gnttab_map_grant_ref(
     active_entry_release(act);
 
  unlock_out:
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
     op->status = rc;
     put_maptrack_handle(lgt, handle);
     rcu_unlock_domain(rd);
@@ -1080,18 +1078,18 @@ __gnttab_unmap_common(
 
     op->map = &maptrack_entry(lgt, op->handle);
 
-    read_lock(&lgt->lock);
+    percpu_read_lock(grant_rwlock, &lgt->lock);
 
     if ( unlikely(!read_atomic(&op->map->flags)) )
     {
-        read_unlock(&lgt->lock);
+        percpu_read_unlock(grant_rwlock, &lgt->lock);
         gdprintk(XENLOG_INFO, "Zero flags for handle (%d).\n", op->handle);
         op->status = GNTST_bad_handle;
         return;
     }
 
     dom = op->map->domid;
-    read_unlock(&lgt->lock);
+    percpu_read_unlock(grant_rwlock, &lgt->lock);
 
     if ( unlikely((rd = rcu_lock_domain_by_id(dom)) == NULL) )
     {
@@ -1113,7 +1111,7 @@ __gnttab_unmap_common(
 
     rgt = rd->grant_table;
 
-    read_lock(&rgt->lock);
+    percpu_read_lock(grant_rwlock, &rgt->lock);
 
     op->flags = read_atomic(&op->map->flags);
     if ( unlikely(!op->flags) || unlikely(op->map->domid != dom) )
@@ -1165,7 +1163,7 @@ __gnttab_unmap_common(
  act_release_out:
     active_entry_release(act);
  unmap_out:
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
 
     if ( rc == GNTST_okay && gnttab_need_iommu_mapping(ld) )
     {
@@ -1220,7 +1218,7 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
     rcu_lock_domain(rd);
     rgt = rd->grant_table;
 
-    read_lock(&rgt->lock);
+    percpu_read_lock(grant_rwlock, &rgt->lock);
     if ( rgt->gt_version == 0 )
         goto unlock_out;
 
@@ -1286,7 +1284,7 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common *op)
  act_release_out:
     active_entry_release(act);
  unlock_out:
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
 
     if ( put_handle )
     {
@@ -1585,7 +1583,7 @@ gnttab_setup_table(
     }
 
     gt = d->grant_table;
-    write_lock(&gt->lock);
+    percpu_write_lock(grant_rwlock, &gt->lock);
 
     if ( gt->gt_version == 0 )
         gt->gt_version = 1;
@@ -1613,7 +1611,7 @@ gnttab_setup_table(
     }
 
  out3:
-    write_unlock(&gt->lock);
+    percpu_write_unlock(grant_rwlock, &gt->lock);
  out2:
     rcu_unlock_domain(d);
  out1:
@@ -1655,13 +1653,13 @@ gnttab_query_size(
         goto query_out_unlock;
     }
 
-    read_lock(&d->grant_table->lock);
+    percpu_read_lock(grant_rwlock, &d->grant_table->lock);
 
     op.nr_frames     = nr_grant_frames(d->grant_table);
     op.max_nr_frames = max_grant_frames;
     op.status        = GNTST_okay;
 
-    read_unlock(&d->grant_table->lock);
+    percpu_read_unlock(grant_rwlock, &d->grant_table->lock);
 
  
  query_out_unlock:
@@ -1687,7 +1685,7 @@ gnttab_prepare_for_transfer(
     union grant_combo   scombo, prev_scombo, new_scombo;
     int                 retries = 0;
 
-    read_lock(&rgt->lock);
+    percpu_read_lock(grant_rwlock, &rgt->lock);
 
     if ( unlikely(ref >= nr_grant_entries(rgt)) )
     {
@@ -1730,11 +1728,11 @@ gnttab_prepare_for_transfer(
         scombo = prev_scombo;
     }
 
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
     return 1;
 
  fail:
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
     return 0;
 }
 
@@ -1925,7 +1923,7 @@ gnttab_transfer(
         TRACE_1D(TRC_MEM_PAGE_GRANT_TRANSFER, e->domain_id);
 
         /* Tell the guest about its new page frame. */
-        read_lock(&e->grant_table->lock);
+        percpu_read_lock(grant_rwlock, &e->grant_table->lock);
         act = active_entry_acquire(e->grant_table, gop.ref);
 
         if ( e->grant_table->gt_version == 1 )
@@ -1949,7 +1947,7 @@ gnttab_transfer(
             GTF_transfer_completed;
 
         active_entry_release(act);
-        read_unlock(&e->grant_table->lock);
+        percpu_read_unlock(grant_rwlock, &e->grant_table->lock);
 
         rcu_unlock_domain(e);
 
@@ -1987,7 +1985,7 @@ __release_grant_for_copy(
     released_read = 0;
     released_write = 0;
 
-    read_lock(&rgt->lock);
+    percpu_read_lock(grant_rwlock, &rgt->lock);
 
     act = active_entry_acquire(rgt, gref);
     sha = shared_entry_header(rgt, gref);
@@ -2029,7 +2027,7 @@ __release_grant_for_copy(
     }
 
     active_entry_release(act);
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
 
     if ( td != rd )
     {
@@ -2086,7 +2084,7 @@ __acquire_grant_for_copy(
 
     *page = NULL;
 
-    read_lock(&rgt->lock);
+    percpu_read_lock(grant_rwlock, &rgt->lock);
 
     if ( unlikely(gref >= nr_grant_entries(rgt)) )
         PIN_FAIL(gt_unlock_out, GNTST_bad_gntref,
@@ -2168,20 +2166,20 @@ __acquire_grant_for_copy(
              * here and reacquire
              */
             active_entry_release(act);
-            read_unlock(&rgt->lock);
+            percpu_read_unlock(grant_rwlock, &rgt->lock);
 
             rc = __acquire_grant_for_copy(td, trans_gref, rd->domain_id,
                                           readonly, &grant_frame, page,
                                           &trans_page_off, &trans_length, 0);
 
-            read_lock(&rgt->lock);
+            percpu_read_lock(grant_rwlock, &rgt->lock);
             act = active_entry_acquire(rgt, gref);
 
             if ( rc != GNTST_okay ) {
                 __fixup_status_for_copy_pin(act, status);
                 rcu_unlock_domain(td);
                 active_entry_release(act);
-                read_unlock(&rgt->lock);
+                percpu_read_unlock(grant_rwlock, &rgt->lock);
                 return rc;
             }
 
@@ -2194,7 +2192,7 @@ __acquire_grant_for_copy(
                 __fixup_status_for_copy_pin(act, status);
                 rcu_unlock_domain(td);
                 active_entry_release(act);
-                read_unlock(&rgt->lock);
+                percpu_read_unlock(grant_rwlock, &rgt->lock);
                 put_page(*page);
                 return __acquire_grant_for_copy(rd, gref, ldom, readonly,
                                                 frame, page, page_off, length,
@@ -2258,7 +2256,7 @@ __acquire_grant_for_copy(
     *frame = act->frame;
 
     active_entry_release(act);
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
     return rc;
  
  unlock_out_clear:
@@ -2273,7 +2271,7 @@ __acquire_grant_for_copy(
     active_entry_release(act);
 
  gt_unlock_out:
-    read_unlock(&rgt->lock);
+    percpu_read_unlock(grant_rwlock, &rgt->lock);
 
     return rc;
 }
@@ -2589,7 +2587,7 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     if ( gt->gt_version == op.version )
         goto out;
 
-    write_lock(&gt->lock);
+    percpu_write_lock(grant_rwlock, &gt->lock);
     /*
      * Make sure that the grant table isn't currently in use when we
      * change the version number, except for the first 8 entries which
@@ -2702,7 +2700,7 @@ gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     gt->gt_version = op.version;
 
  out_unlock:
-    write_unlock(&gt->lock);
+    percpu_write_unlock(grant_rwlock, &gt->lock);
 
  out:
     op.version = gt->gt_version;
@@ -2758,7 +2756,7 @@ gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) uop,
 
     op.status = GNTST_okay;
 
-    read_lock(&gt->lock);
+    percpu_read_lock(grant_rwlock, &gt->lock);
 
     for ( i = 0; i < op.nr_frames; i++ )
     {
@@ -2767,7 +2765,7 @@ gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) uop,
             op.status = GNTST_bad_virt_addr;
     }
 
-    read_unlock(&gt->lock);
+    percpu_read_unlock(grant_rwlock, &gt->lock);
 out2:
     rcu_unlock_domain(d);
 out1:
@@ -2817,7 +2815,7 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t ref_b)
     struct active_grant_entry *act_b = NULL;
     s16 rc = GNTST_okay;
 
-    write_lock(&gt->lock);
+    percpu_write_lock(grant_rwlock, &gt->lock);
 
     /* Bounds check on the grant refs */
     if ( unlikely(ref_a >= nr_grant_entries(d->grant_table)))
@@ -2865,7 +2863,7 @@ out:
         active_entry_release(act_b);
     if ( act_a != NULL )
         active_entry_release(act_a);
-    write_unlock(&gt->lock);
+    percpu_write_unlock(grant_rwlock, &gt->lock);
 
     rcu_unlock_domain(d);
 
@@ -2936,12 +2934,12 @@ static int __gnttab_cache_flush(gnttab_cache_flush_t *cflush,
 
     if ( d != owner )
     {
-        read_lock(&owner->grant_table->lock);
+        percpu_read_lock(grant_rwlock, &owner->grant_table->lock);
 
         ret = grant_map_exists(d, owner->grant_table, mfn, ref_count);
         if ( ret != 0 )
         {
-            read_unlock(&owner->grant_table->lock);
+            percpu_read_unlock(grant_rwlock, &owner->grant_table->lock);
             rcu_unlock_domain(d);
             put_page(page);
             return ret;
@@ -2961,7 +2959,7 @@ static int __gnttab_cache_flush(gnttab_cache_flush_t *cflush,
         ret = 0;
 
     if ( d != owner )
-        read_unlock(&owner->grant_table->lock);
+        percpu_read_unlock(grant_rwlock, &owner->grant_table->lock);
     unmap_domain_page(v);
     put_page(page);
 
@@ -3180,7 +3178,7 @@ grant_table_create(
         goto no_mem_0;
 
     /* Simple stuff. */
-    rwlock_init(&t->lock);
+    percpu_rwlock_resource_init(&t->lock);
     spin_lock_init(&t->maptrack_lock);
     t->nr_grant_frames = INITIAL_NR_GRANT_FRAMES;
 
@@ -3282,7 +3280,7 @@ gnttab_release_mappings(
         }
 
         rgt = rd->grant_table;
-        read_lock(&rgt->lock);
+        percpu_read_lock(grant_rwlock, &rgt->lock);
 
         act = active_entry_acquire(rgt, ref);
         sha = shared_entry_header(rgt, ref);
@@ -3343,7 +3341,7 @@ gnttab_release_mappings(
             gnttab_clear_flag(_GTF_reading, status);
 
         active_entry_release(act);
-        read_unlock(&rgt->lock);
+        percpu_read_unlock(grant_rwlock, &rgt->lock);
 
         rcu_unlock_domain(rd);
 
@@ -3360,7 +3358,7 @@ void grant_table_warn_active_grants(struct domain *d)
 
 #define WARN_GRANT_MAX 10
 
-    read_lock(&gt->lock);
+    percpu_read_lock(grant_rwlock, &gt->lock);
 
     for ( ref = 0; ref != nr_grant_entries(gt); ref++ )
     {
@@ -3382,7 +3380,7 @@ void grant_table_warn_active_grants(struct domain *d)
         printk(XENLOG_G_DEBUG "Dom%d has too many (%d) active grants to report\n",
                d->domain_id, nr_active);
 
-    read_unlock(&gt->lock);
+    percpu_read_unlock(grant_rwlock, &gt->lock);
 
 #undef WARN_GRANT_MAX
 }
@@ -3432,7 +3430,7 @@ static void gnttab_usage_print(struct domain *rd)
     printk("      -------- active --------       -------- shared --------\n");
     printk("[ref] localdom mfn      pin          localdom gmfn     flags\n");
 
-    read_lock(&gt->lock);
+    percpu_read_lock(grant_rwlock, &gt->lock);
 
     for ( ref = 0; ref != nr_grant_entries(gt); ref++ )
     {
@@ -3475,7 +3473,7 @@ static void gnttab_usage_print(struct domain *rd)
         active_entry_release(act);
     }
 
-    read_unlock(&gt->lock);
+    percpu_read_unlock(grant_rwlock, &gt->lock);
 
     if ( first )
         printk("grant-table for remote domain:%5d ... "
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 1c29cee..59d2819 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -51,13 +51,15 @@
 /* The maximum size of a grant table. */
 extern unsigned int max_grant_frames;
 
+DECLARE_PERCPU_RWLOCK_GLOBAL(grant_rwlock);
+
 /* Per-domain grant information. */
 struct grant_table {
     /*
      * Lock protecting updates to grant table state (version, active
      * entry list, etc.)
      */
-    rwlock_t              lock;
+    percpu_rwlock_t       lock;
     /* Table size. Number of frames shared with guest */
     unsigned int          nr_grant_frames;
     /* Shared grant table (see include/public/grant_table.h). */
-- 
1.7.12.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCHv2 3/3] p2m: convert grant table rwlock to percpu rwlock
  2015-11-20 16:03 [PATCHv2 0/3] Implement per-cpu reader-writer locks Malcolm Crossley
  2015-11-20 16:03 ` [PATCHv2 1/3] rwlock: Add " Malcolm Crossley
  2015-11-20 16:03 ` [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock Malcolm Crossley
@ 2015-11-20 16:03 ` Malcolm Crossley
  2015-11-25 12:00   ` George Dunlap
  2015-11-25 12:38   ` Jan Beulich
  2015-11-24 18:16 ` [PATCHv2 0/3] Implement per-cpu reader-writer locks George Dunlap
  3 siblings, 2 replies; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-20 16:03 UTC (permalink / raw)
  To: malcolm.crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

The per domain p2m read lock suffers from significant contention when
performance multi-queue block or network IO due to the parallel
grant map/unmaps/copies occuring on the DomU's p2m.

On multi-socket systems, the contention results in the locked compare swap
operation failing frequently which results in a tight loop of retries of the
compare swap operation. As the coherency fabric can only support a specific
rate of compare swap operations for a particular data location then taking
the read lock itself becomes a bottleneck for p2m operations.

Percpu rwlock p2m performance with the same configuration is approximately
64 gbit/s vs the 48 gbit/s with grant table percpu rwlocks only.

Oprofile was used to determine the initial overhead of the read-write locks
and to confirm the overhead was dramatically reduced by the percpu rwlocks.

Note: altp2m users will not achieve a gain if they take an altp2m read lock
simultaneously with the main p2m lock.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
---
 xen/arch/x86/mm/mm-locks.h | 12 +++++++-----
 xen/arch/x86/mm/p2m.c      |  1 +
 xen/include/asm-x86/mm.h   |  2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/mm/mm-locks.h b/xen/arch/x86/mm/mm-locks.h
index 76c7217..6eb5021 100644
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -31,6 +31,8 @@
 DECLARE_PER_CPU(int, mm_lock_level);
 #define __get_lock_level()  (this_cpu(mm_lock_level))
 
+DECLARE_PERCPU_RWLOCK_GLOBAL(p2m_percpu_rwlock);
+
 static inline void mm_lock_init(mm_lock_t *l)
 {
     spin_lock_init(&l->lock);
@@ -99,7 +101,7 @@ static inline void _mm_enforce_order_lock_post(int level, int *unlock_level,
 
 static inline void mm_rwlock_init(mm_rwlock_t *l)
 {
-    rwlock_init(&l->lock);
+    percpu_rwlock_resource_init(&l->lock);
     l->locker = -1;
     l->locker_function = "nobody";
     l->unlock_level = 0;
@@ -115,7 +117,7 @@ static inline void _mm_write_lock(mm_rwlock_t *l, const char *func, int level)
     if ( !mm_write_locked_by_me(l) )
     {
         __check_lock_level(level);
-        write_lock(&l->lock);
+	percpu_write_lock(p2m_percpu_rwlock, &l->lock);
         l->locker = get_processor_id();
         l->locker_function = func;
         l->unlock_level = __get_lock_level();
@@ -131,20 +133,20 @@ static inline void mm_write_unlock(mm_rwlock_t *l)
     l->locker = -1;
     l->locker_function = "nobody";
     __set_lock_level(l->unlock_level);
-    write_unlock(&l->lock);
+    percpu_write_unlock(p2m_percpu_rwlock, &l->lock);
 }
 
 static inline void _mm_read_lock(mm_rwlock_t *l, int level)
 {
     __check_lock_level(level);
-    read_lock(&l->lock);
+    percpu_read_lock(p2m_percpu_rwlock, &l->lock);
     /* There's nowhere to store the per-CPU unlock level so we can't
      * set the lock level. */
 }
 
 static inline void mm_read_unlock(mm_rwlock_t *l)
 {
-    read_unlock(&l->lock);
+    percpu_read_unlock(p2m_percpu_rwlock, &l->lock);
 }
 
 /* This wrapper uses the line number to express the locking order below */
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index ed0bbd7..a45ee35 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -54,6 +54,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
 #undef page_to_mfn
 #define page_to_mfn(_pg) _mfn(__page_to_mfn(_pg))
 
+DEFINE_PERCPU_RWLOCK_GLOBAL(p2m_percpu_rwlock);
 
 /* Init the datastructures for later use by the p2m code */
 static int p2m_initialise(struct domain *d, struct p2m_domain *p2m)
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 67b34c6..78511dd 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -566,7 +566,7 @@ typedef struct mm_lock {
 } mm_lock_t;
 
 typedef struct mm_rwlock {
-    rwlock_t           lock;
+    percpu_rwlock_t    lock;
     int                unlock_level;
     int                recurse_count;
     int                locker; /* CPU that holds the write lock */
-- 
1.7.12.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
  2015-11-20 16:03 [PATCHv2 0/3] Implement per-cpu reader-writer locks Malcolm Crossley
                   ` (2 preceding siblings ...)
  2015-11-20 16:03 ` [PATCHv2 3/3] p2m: " Malcolm Crossley
@ 2015-11-24 18:16 ` George Dunlap
  2015-11-24 18:30   ` George Dunlap
  2015-11-24 18:32   ` Andrew Cooper
  3 siblings, 2 replies; 19+ messages in thread
From: George Dunlap @ 2015-11-24 18:16 UTC (permalink / raw)
  To: Malcolm Crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

On 20/11/15 16:03, Malcolm Crossley wrote:
> This patch series adds per-cpu reader-writer locks as a generic lock
> implementation and then converts the grant table and p2m rwlocks to
> use the percpu rwlocks, in order to improve multi-socket host performance.
> 
> CPU profiling has revealed the rwlocks themselves suffer from severe cache
> line bouncing due to the cmpxchg operation used even when taking a read lock.
> Multiqueue paravirtualised I/O results in heavy contention of the grant table
> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
> by the overhead of the cache line bouncing itself.
> 
> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
> area to record a CPU has taken the read lock. Correctness is enforced for the 
> write lock by using a per lock barrier which forces the per-cpu read lock 
> to revert to using a standard read lock. The write lock then polls all 
> the percpu data area until active readers for the lock have exited.
> 
> Removing the cache line bouncing on a multi-socket Haswell-EP system 
> dramatically improves performance, with 16 vCPU network IO performance going 
> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
> IO improvement.

Impressive -- thanks for doing this work.

One question: Your description here sounds like you've tested with a
single large domain, but what happens with multiple domains?

It looks like the "per-cpu-rwlock" is shared by *all* locks of a
particular type (e.g., all domains share the per-cpu p2m rwlock).
(Correct me if I'm wrong here.)

Which means two things:

 1) *Any* writer will have to wait for the rwlock of that type to be
released on *all* domains before being able to write.  Is there any risk
that on a busy system, this will be an unusually long wait?

2) *All* domains will have to take the slow path for reading when a
*any* domain has or is trying to acquire the write lock.  What is the
probability that on a busy system that turns out to be "most of the time"?

#2 is of course no worse than it is now, but #1 could be a bit of a bear.

 -George

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
  2015-11-24 18:16 ` [PATCHv2 0/3] Implement per-cpu reader-writer locks George Dunlap
@ 2015-11-24 18:30   ` George Dunlap
  2015-11-25  8:58     ` Malcolm Crossley
  2015-11-24 18:32   ` Andrew Cooper
  1 sibling, 1 reply; 19+ messages in thread
From: George Dunlap @ 2015-11-24 18:30 UTC (permalink / raw)
  To: Malcolm Crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

On 24/11/15 18:16, George Dunlap wrote:
> On 20/11/15 16:03, Malcolm Crossley wrote:
>> This patch series adds per-cpu reader-writer locks as a generic lock
>> implementation and then converts the grant table and p2m rwlocks to
>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>
>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>> by the overhead of the cache line bouncing itself.
>>
>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>> area to record a CPU has taken the read lock. Correctness is enforced for the 
>> write lock by using a per lock barrier which forces the per-cpu read lock 
>> to revert to using a standard read lock. The write lock then polls all 
>> the percpu data area until active readers for the lock have exited.
>>
>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>> dramatically improves performance, with 16 vCPU network IO performance going 
>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>> IO improvement.
> 
> Impressive -- thanks for doing this work.
> 
> One question: Your description here sounds like you've tested with a
> single large domain, but what happens with multiple domains?
> 
> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
> particular type (e.g., all domains share the per-cpu p2m rwlock).
> (Correct me if I'm wrong here.)

Sorry, looking in more detail at the code, it seems I am wrong.  The
fast-path stores which "slow" lock has been grabbed in the per-cpu
variable; so the writer only needs to wait for readers that have grabbed
the particular lock it's interested in.  So the scenarios I outline
below shouldn't really be issues.

The description of the algorithm  in the changelog could do with a bit
more detail. :-)

 -George

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
  2015-11-24 18:16 ` [PATCHv2 0/3] Implement per-cpu reader-writer locks George Dunlap
  2015-11-24 18:30   ` George Dunlap
@ 2015-11-24 18:32   ` Andrew Cooper
  1 sibling, 0 replies; 19+ messages in thread
From: Andrew Cooper @ 2015-11-24 18:32 UTC (permalink / raw)
  To: George Dunlap, Malcolm Crossley, JBeulich, ian.campbell,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

On 24/11/15 18:16, George Dunlap wrote:
> On 20/11/15 16:03, Malcolm Crossley wrote:
>> This patch series adds per-cpu reader-writer locks as a generic lock
>> implementation and then converts the grant table and p2m rwlocks to
>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>
>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>> by the overhead of the cache line bouncing itself.
>>
>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>> area to record a CPU has taken the read lock. Correctness is enforced for the 
>> write lock by using a per lock barrier which forces the per-cpu read lock 
>> to revert to using a standard read lock. The write lock then polls all 
>> the percpu data area until active readers for the lock have exited.
>>
>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>> dramatically improves performance, with 16 vCPU network IO performance going 
>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>> IO improvement.
> Impressive -- thanks for doing this work.
>
> One question: Your description here sounds like you've tested with a
> single large domain, but what happens with multiple domains?

The test was with two domU's, doing network traffic between them.

>
> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
> particular type (e.g., all domains share the per-cpu p2m rwlock).
> (Correct me if I'm wrong here.)

The per-pcpu pointer is shared for all locks of a particular type, but
allows the individual lock to be distinguished by following the pointer
back.

Therefore, the locks are not actually shared.

>
> Which means two things:
>
>  1) *Any* writer will have to wait for the rwlock of that type to be
> released on *all* domains before being able to write.  Is there any risk
> that on a busy system, this will be an unusually long wait?

No.  The write-locker looks through all pcpus and ignores any whose
per-cpu pointer is not pointing at the intended percpu_rwlock.

>From _percpu_write_lock():

/*
 * Remove any percpu readers not contending on this rwlock
 * from our check mask.
 */
if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock )
    cpumask_clear_cpu(cpu, &this_cpu(percpu_rwlock_readers));

As a result, two calls to _percpu_write_lock() against different
percpu_rwlock_t's will not interact.

>
> 2) *All* domains will have to take the slow path for reading when a
> *any* domain has or is trying to acquire the write lock.  What is the
> probability that on a busy system that turns out to be "most of the time"?

_percpu_read_lock() will only take the slow path if:

1) The intended lock is (or is in the process of being) taken for writing
2) The callpath has nested _percpu_read_lock() of the same type of lock

By observation, 2) does not occur currently in the code.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
  2015-11-24 18:30   ` George Dunlap
@ 2015-11-25  8:58     ` Malcolm Crossley
  2015-11-25  9:49       ` George Dunlap
  2015-11-26  9:48       ` Dario Faggioli
  0 siblings, 2 replies; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-25  8:58 UTC (permalink / raw)
  To: George Dunlap, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

On 24/11/15 18:30, George Dunlap wrote:
> On 24/11/15 18:16, George Dunlap wrote:
>> On 20/11/15 16:03, Malcolm Crossley wrote:
>>> This patch series adds per-cpu reader-writer locks as a generic lock
>>> implementation and then converts the grant table and p2m rwlocks to
>>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>>
>>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>>> by the overhead of the cache line bouncing itself.
>>>
>>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>>> area to record a CPU has taken the read lock. Correctness is enforced for the 
>>> write lock by using a per lock barrier which forces the per-cpu read lock 
>>> to revert to using a standard read lock. The write lock then polls all 
>>> the percpu data area until active readers for the lock have exited.
>>>
>>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>>> dramatically improves performance, with 16 vCPU network IO performance going 
>>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>>> IO improvement.
>>
>> Impressive -- thanks for doing this work.

Thanks, I think the key to isolating the problem was using profiling tools. The scale
of the overhead would not have been clear without them.

>>
>> One question: Your description here sounds like you've tested with a
>> single large domain, but what happens with multiple domains?
>>
>> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
>> particular type (e.g., all domains share the per-cpu p2m rwlock).
>> (Correct me if I'm wrong here.)
> 
> Sorry, looking in more detail at the code, it seems I am wrong.  The
> fast-path stores which "slow" lock has been grabbed in the per-cpu
> variable; so the writer only needs to wait for readers that have grabbed
> the particular lock it's interested in.  So the scenarios I outline
> below shouldn't really be issues.
> 
> The description of the algorithm  in the changelog could do with a bit
> more detail. :-)

I'll enhance the description to say "per lock local variable" to make it clearer
that not all readers will be affected.

BTW, I added to the "To" list because I need your ACK for the patch to the p2m code.

Do you have any review comments for that patch?

Thanks

Malcolm

> 
>  -George
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
  2015-11-25  8:58     ` Malcolm Crossley
@ 2015-11-25  9:49       ` George Dunlap
  2015-11-26  9:48       ` Dario Faggioli
  1 sibling, 0 replies; 19+ messages in thread
From: George Dunlap @ 2015-11-25  9:49 UTC (permalink / raw)
  To: Malcolm Crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

On 25/11/15 08:58, Malcolm Crossley wrote:
> On 24/11/15 18:30, George Dunlap wrote:
>> On 24/11/15 18:16, George Dunlap wrote:
>>> On 20/11/15 16:03, Malcolm Crossley wrote:
>>>> This patch series adds per-cpu reader-writer locks as a generic lock
>>>> implementation and then converts the grant table and p2m rwlocks to
>>>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>>>
>>>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>>>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>>>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>>>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>>>> by the overhead of the cache line bouncing itself.
>>>>
>>>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>>>> area to record a CPU has taken the read lock. Correctness is enforced for the 
>>>> write lock by using a per lock barrier which forces the per-cpu read lock 
>>>> to revert to using a standard read lock. The write lock then polls all 
>>>> the percpu data area until active readers for the lock have exited.
>>>>
>>>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>>>> dramatically improves performance, with 16 vCPU network IO performance going 
>>>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>>>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>>>> IO improvement.
>>>
>>> Impressive -- thanks for doing this work.
> 
> Thanks, I think the key to isolating the problem was using profiling tools. The scale
> of the overhead would not have been clear without them.
> 
>>>
>>> One question: Your description here sounds like you've tested with a
>>> single large domain, but what happens with multiple domains?
>>>
>>> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
>>> particular type (e.g., all domains share the per-cpu p2m rwlock).
>>> (Correct me if I'm wrong here.)
>>
>> Sorry, looking in more detail at the code, it seems I am wrong.  The
>> fast-path stores which "slow" lock has been grabbed in the per-cpu
>> variable; so the writer only needs to wait for readers that have grabbed
>> the particular lock it's interested in.  So the scenarios I outline
>> below shouldn't really be issues.
>>
>> The description of the algorithm  in the changelog could do with a bit
>> more detail. :-)
> 
> I'll enhance the description to say "per lock local variable" to make it clearer
> that not all readers will be affected.
> 
> BTW, I added to the "To" list because I need your ACK for the patch to the p2m code.
> 
> Do you have any review comments for that patch?

Yes, I realize that, and I'll get to it. :-)

 -George

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 1/3] rwlock: Add per-cpu reader-writer locks
  2015-11-20 16:03 ` [PATCHv2 1/3] rwlock: Add " Malcolm Crossley
@ 2015-11-25 11:12   ` George Dunlap
  2015-11-26 12:17   ` Jan Beulich
  1 sibling, 0 replies; 19+ messages in thread
From: George Dunlap @ 2015-11-25 11:12 UTC (permalink / raw)
  To: Malcolm Crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

On 20/11/15 16:03, Malcolm Crossley wrote:
> Per-cpu read-write locks allow for the fast path read case to have
> low overhead by only setting/clearing a per-cpu variable for using
> the read lock. The per-cpu read fast path also avoids locked
> compare swap operations which can be particularly slow on coherent
> multi-socket systems, particularly if there is heavy usage of the
> read lock itself.
> 
> The per-cpu reader-writer lock uses a local variable to control
> the read lock fast path. This allows a writer to disable the fast
> path and ensures the readers switch to using the underlying
> read-write lock implementation instead of the per-cpu variable.
> 
> Once the writer has taken the write lock and disabled the fast path,
> it must poll the per-cpu variable for all CPU's which have entered
> the critical section for the specific read-write lock the writer is
> attempting to take. This design allows for a single per-cpu variable
> to be used for read/write locks belonging to seperate data structures.
> If a two or more different per-cpu read lock(s) are taken
> simultaneously then the per-cpu data structure is not used and the
> implementation takes the read lock of the underlying read-write lock,
> this behaviour is equivalent to the slow path in terms of performance.
> The per-cpu rwlock is not recursion safe for taking the per-cpu
> read lock because there is no recursion count variable, this is
> functionally equivalent to standard spin locks.
> 
> Slow path readers which are unblocked, set the per-cpu variable and
> drop the read lock. This simplifies the implementation and allows
> for fairness in the underlying read-write lock to be taken
> advantage of.
> 
> There is more overhead on the per-cpu write lock path due to checking
> each CPUs fast path per-cpu variable but this overhead is likely be
> hidden by the required delay of waiting for readers to exit the
> critical section. The loop is optimised to only iterate over
> the per-cpu data of active readers of the rwlock. The cpumask_t for
> tracking the active readers is stored in a single per-cpu data
> location and thus the write lock is not pre-emption safe. Therefore
> the per-cpu write lock can only be used with interrupts disabled.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>

Code looks good to me.  Two comments.

First, it was difficult to review this because there wasn't a clear
description of the algorithm, nor even a comment describing how one
would actually use it.  It wasn't until I looked at the subsequent
patches that I got the basic idea.

A brief comment in the header file describing how one might actually go
about using this functionality would be useful.

Secondly, the way this is designed here opens up the possibility for a
new kind of bug.  In order to work correctly, all *callers* have to
match up the lock they're trying to grab with the correct per-cpu lock.
 You could imagine some point in the future, for example, breaking the
grant lock into two types; at which point there would be a risk that one
of the readers would accidentally use the wrong per-cpu lock, allowing a
writer to grab the lock while a reader is still in the critical section.

Would it make sense to have the per-cpu rwlock pointers be an array, and
then store the index of the array in the lock itself (set on
initialization)?  That would make the calls cleaner (only one argument
instead of two), and eliminate the class of error described above.

 -George

> --
> Changes since v1:
> - Removed restriction on taking two or more different percpu rw
> locks simultaneously
> - Moved fast-path/slow-path barrier to be per lock instead of global
> - Created seperate percpu_rwlock_t type and added macros to
> initialise new type
> - Added helper macros for using the percpu rwlock itself
> - Moved active_readers cpumask off the stack and into a percpu
> variable
> ---
>  xen/common/spinlock.c        | 38 +++++++++++++++++++
>  xen/include/asm-arm/percpu.h |  5 +++
>  xen/include/asm-x86/percpu.h |  6 +++
>  xen/include/xen/percpu.h     |  4 ++
>  xen/include/xen/spinlock.h   | 87 ++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 140 insertions(+)
> 
> diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
> index 7f89694..a0f9df5 100644
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -10,6 +10,8 @@
>  #include <asm/processor.h>
>  #include <asm/atomic.h>
>  
> +DEFINE_PER_CPU(cpumask_t, percpu_rwlock_readers);
> +
>  #ifndef NDEBUG
>  
>  static atomic_t spin_debug __read_mostly = ATOMIC_INIT(0);
> @@ -492,6 +494,42 @@ int _rw_is_write_locked(rwlock_t *lock)
>      return (lock->lock == RW_WRITE_FLAG); /* writer in critical section? */
>  }
>  
> +void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
> +	        percpu_rwlock_t *percpu_rwlock)
> +{
> +    unsigned int cpu;
> +
> +    /* First take the write lock to protect against other writers. */
> +    write_lock(&percpu_rwlock->rwlock);
> +
> +    /* Now set the global variable so that readers start using read_lock. */
> +    percpu_rwlock->writer_activating = 1;
> +    smp_mb();
> +
> +    /* Using a per cpu cpumask is only safe if there is no nesting. */
> +    ASSERT(!in_irq());
> +    this_cpu(percpu_rwlock_readers) = cpu_online_map;
> +
> +    /* Check if there are any percpu readers in progress on this rwlock. */
> +    for ( ; ; )
> +    {
> +        for_each_cpu(cpu, &this_cpu(percpu_rwlock_readers))
> +        {
> +            /* 
> +	     * Remove any percpu readers not contending on this rwlock
> +             * from our check mask.
> +	     */
> +            if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock )
> +                cpumask_clear_cpu(cpu, &this_cpu(percpu_rwlock_readers));
> +        }
> +        /* Check if we've cleared all percpu readers from check mask. */
> +        if ( cpumask_empty(&this_cpu(percpu_rwlock_readers)) )
> +            break;
> +        /* Give the coherency fabric a break. */
> +        cpu_relax();
> +    };
> +}
> +
>  #ifdef LOCK_PROFILE
>  
>  struct lock_profile_anc {
> diff --git a/xen/include/asm-arm/percpu.h b/xen/include/asm-arm/percpu.h
> index 71e7649..c308a56 100644
> --- a/xen/include/asm-arm/percpu.h
> +++ b/xen/include/asm-arm/percpu.h
> @@ -27,6 +27,11 @@ void percpu_init_areas(void);
>  #define __get_cpu_var(var) \
>      (*RELOC_HIDE(&per_cpu__##var, READ_SYSREG(TPIDR_EL2)))
>  
> +#define per_cpu_ptr(var, cpu)  \
> +    (*RELOC_HIDE(&var, __per_cpu_offset[cpu]))
> +#define __get_cpu_ptr(var) \
> +    (*RELOC_HIDE(&var, READ_SYSREG(TPIDR_EL2)))
> +
>  #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
>  
>  DECLARE_PER_CPU(unsigned int, cpu_id);
> diff --git a/xen/include/asm-x86/percpu.h b/xen/include/asm-x86/percpu.h
> index 604ff0d..51562b9 100644
> --- a/xen/include/asm-x86/percpu.h
> +++ b/xen/include/asm-x86/percpu.h
> @@ -20,4 +20,10 @@ void percpu_init_areas(void);
>  
>  #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
>  
> +#define __get_cpu_ptr(var) \
> +    (*RELOC_HIDE(var, get_cpu_info()->per_cpu_offset))
> +
> +#define per_cpu_ptr(var, cpu)  \
> +    (*RELOC_HIDE(var, __per_cpu_offset[cpu]))
> +
>  #endif /* __X86_PERCPU_H__ */
> diff --git a/xen/include/xen/percpu.h b/xen/include/xen/percpu.h
> index abe0b11..c896863 100644
> --- a/xen/include/xen/percpu.h
> +++ b/xen/include/xen/percpu.h
> @@ -16,6 +16,10 @@
>  /* Preferred on Xen. Also see arch-defined per_cpu(). */
>  #define this_cpu(var)    __get_cpu_var(var)
>  
> +#define this_cpu_ptr(ptr)    __get_cpu_ptr(ptr)
> +
> +#define get_per_cpu_var(var)  (per_cpu__##var)
> +
>  /* Linux compatibility. */
>  #define get_cpu_var(var) this_cpu(var)
>  #define put_cpu_var(var)
> diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
> index fb0438e..a5a25b6 100644
> --- a/xen/include/xen/spinlock.h
> +++ b/xen/include/xen/spinlock.h
> @@ -3,6 +3,7 @@
>  
>  #include <asm/system.h>
>  #include <asm/spinlock.h>
> +#include <asm/types.h>
>  
>  #ifndef NDEBUG
>  struct lock_debug {
> @@ -261,4 +262,90 @@ int _rw_is_write_locked(rwlock_t *lock);
>  #define rw_is_locked(l)               _rw_is_locked(l)
>  #define rw_is_write_locked(l)         _rw_is_write_locked(l)
>  
> +struct percpu_rwlock {
> +	rwlock_t rwlock;
> +	bool_t   writer_activating;
> +};
> +
> +typedef struct percpu_rwlock percpu_rwlock_t;
> +
> +#define PERCPU_RW_LOCK_UNLOCKED { RW_LOCK_UNLOCKED, 0 }
> +#define DEFINE_PERCPU_RWLOCK_RESOURCE(l) percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED
> +#define percpu_rwlock_resource_init(l) (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED)
> +
> +static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
> +                                         percpu_rwlock_t *percpu_rwlock)
> +{
> +    /* We cannot support recursion on the same lock. */
> +    ASSERT(this_cpu_ptr(per_cpudata) != percpu_rwlock);
> +    /* 
> +     * Detect using a second percpu_rwlock_t simulatenously and fallback
> +     * to standard read_lock.
> +     */
> +    if ( unlikely(this_cpu_ptr(per_cpudata) != NULL ) )
> +    {
> +        read_lock(&percpu_rwlock->rwlock);
> +        return;
> +    }
> +
> +    /* Indicate this cpu is reading. */
> +    this_cpu_ptr(per_cpudata) = percpu_rwlock;
> +    smp_mb();
> +    /* Check if a writer is waiting. */
> +    if ( unlikely(percpu_rwlock->writer_activating) )
> +    {
> +        /* Let the waiting writer know we aren't holding the lock. */
> +        this_cpu_ptr(per_cpudata) = NULL;
> +        /* Wait using the read lock to keep the lock fair. */
> +        read_lock(&percpu_rwlock->rwlock);
> +        /* Set the per CPU data again and continue. */
> +        this_cpu_ptr(per_cpudata) = percpu_rwlock;
> +        /* Drop the read lock because we don't need it anymore. */
> +        read_unlock(&percpu_rwlock->rwlock);
> +    }
> +}
> +
> +static inline void _percpu_read_unlock(percpu_rwlock_t **per_cpudata,
> +                percpu_rwlock_t *percpu_rwlock)
> +{
> +    ASSERT(this_cpu_ptr(per_cpudata) != NULL);
> +    /* 
> +     * Detect using a second percpu_rwlock_t simulatenously and fallback
> +     * to standard read_unlock.
> +     */
> +    if ( unlikely(this_cpu_ptr(per_cpudata) != percpu_rwlock ) )
> +    {
> +        read_unlock(&percpu_rwlock->rwlock);
> +        return;
> +    }
> +    this_cpu_ptr(per_cpudata) = NULL;
> +    smp_wmb();
> +}
> +
> +/* Don't inline percpu write lock as it's a complex function. */
> +void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
> +	        percpu_rwlock_t *percpu_rwlock);
> +
> +static inline void _percpu_write_unlock(percpu_rwlock_t *percpu_rwlock)
> +{
> +    ASSERT(percpu_rwlock->writer_activating);
> +    percpu_rwlock->writer_activating = 0;
> +    write_unlock(&percpu_rwlock->rwlock);
> +}
> +
> +#define percpu_rw_is_write_locked(l)         _rw_is_write_locked(&((l)->rwlock))
> +
> +#define percpu_read_lock(percpu, lock) ( _percpu_read_lock( \
> +                                        &get_per_cpu_var(percpu), lock ) )
> +#define percpu_read_unlock(percpu, lock) ( _percpu_read_unlock( \
> +                                        &get_per_cpu_var(percpu), lock ) )
> +#define percpu_write_lock(percpu, lock) ( _percpu_write_lock( \
> +                                        &get_per_cpu_var(percpu), lock ) )
> +#define percpu_write_unlock(percpu, lock) ( _percpu_write_unlock( lock ) )
> +
> +#define DEFINE_PERCPU_RWLOCK_GLOBAL(name) DEFINE_PER_CPU(percpu_rwlock_t *, \
> +                                                         name)
> +#define DECLARE_PERCPU_RWLOCK_GLOBAL(name) DECLARE_PER_CPU(percpu_rwlock_t *, \
> +                                                         name)
> +
>  #endif /* __SPINLOCK_H__ */
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 3/3] p2m: convert grant table rwlock to percpu rwlock
  2015-11-20 16:03 ` [PATCHv2 3/3] p2m: " Malcolm Crossley
@ 2015-11-25 12:00   ` George Dunlap
  2015-11-25 12:38   ` Jan Beulich
  1 sibling, 0 replies; 19+ messages in thread
From: George Dunlap @ 2015-11-25 12:00 UTC (permalink / raw)
  To: Malcolm Crossley, JBeulich, ian.campbell, andrew.cooper3,
	Marcos.Matsunaga, keir, konrad.wilk, george.dunlap
  Cc: xen-devel, stefano.stabellini

On 20/11/15 16:03, Malcolm Crossley wrote:
> The per domain p2m read lock suffers from significant contention when
> performance multi-queue block or network IO due to the parallel
> grant map/unmaps/copies occuring on the DomU's p2m.
> 
> On multi-socket systems, the contention results in the locked compare swap
> operation failing frequently which results in a tight loop of retries of the
> compare swap operation. As the coherency fabric can only support a specific
> rate of compare swap operations for a particular data location then taking
> the read lock itself becomes a bottleneck for p2m operations.
> 
> Percpu rwlock p2m performance with the same configuration is approximately
> 64 gbit/s vs the 48 gbit/s with grant table percpu rwlocks only.
> 
> Oprofile was used to determine the initial overhead of the read-write locks
> and to confirm the overhead was dramatically reduced by the percpu rwlocks.
> 
> Note: altp2m users will not achieve a gain if they take an altp2m read lock
> simultaneously with the main p2m lock.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>

Looks good to me.  I'll save the reviewed-by until the question about
the calling convention has been discussed.

 -George

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock
  2015-11-20 16:03 ` [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock Malcolm Crossley
@ 2015-11-25 12:35   ` Jan Beulich
  2015-11-25 13:43     ` Malcolm Crossley
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2015-11-25 12:35 UTC (permalink / raw)
  To: malcolm.crossley
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3,
	Marcos.Matsunaga, stefano.stabellini, xen-devel

>>> On 20.11.15 at 17:03, <malcolm.crossley@citrix.com> wrote:
> @@ -208,8 +210,6 @@ active_entry_acquire(struct grant_table *t, grant_ref_t e)
>  {
>      struct active_grant_entry *act;
>  
> -    ASSERT(rw_is_locked(&t->lock));

Even if not covering all cases, I don't think this should be dropped,
just like you don't drop rw_is_write_locked() asserts. Would properly
replacing this be rather expensive?

> @@ -3180,7 +3178,7 @@ grant_table_create(
>          goto no_mem_0;
>  
>      /* Simple stuff. */
> -    rwlock_init(&t->lock);
> +    percpu_rwlock_resource_init(&t->lock);

Considering George's general comment (on patch 1), would it perhaps
make sense to store (in debug builds) the per-CPU variable's offset
in the lock structure, having percpu_{read,write}_lock() verify it? Or
at the very least have gt_{read,write}_lock() wrappers, thus
avoiding the need to explicitly pass grant_rwlock at each use site? I
certainly agree with George that we should make it as hard as
possible for someone to get using these constructs wrong.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 3/3] p2m: convert grant table rwlock to percpu rwlock
  2015-11-20 16:03 ` [PATCHv2 3/3] p2m: " Malcolm Crossley
  2015-11-25 12:00   ` George Dunlap
@ 2015-11-25 12:38   ` Jan Beulich
  2015-11-25 12:54     ` Malcolm Crossley
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2015-11-25 12:38 UTC (permalink / raw)
  To: malcolm.crossley
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3,
	Marcos.Matsunaga, stefano.stabellini, xen-devel

>>> On 20.11.15 at 17:03, <malcolm.crossley@citrix.com> wrote:
> @@ -115,7 +117,7 @@ static inline void _mm_write_lock(mm_rwlock_t *l, const char *func, int level)
>      if ( !mm_write_locked_by_me(l) )
>      {
>          __check_lock_level(level);
> -        write_lock(&l->lock);
> +	percpu_write_lock(p2m_percpu_rwlock, &l->lock);

A hard tab slipped in here.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 3/3] p2m: convert grant table rwlock to percpu rwlock
  2015-11-25 12:38   ` Jan Beulich
@ 2015-11-25 12:54     ` Malcolm Crossley
  0 siblings, 0 replies; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-25 12:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3,
	Marcos.Matsunaga, stefano.stabellini, xen-devel

On 25/11/15 12:38, Jan Beulich wrote:
>>>> On 20.11.15 at 17:03, <malcolm.crossley@citrix.com> wrote:
>> @@ -115,7 +117,7 @@ static inline void _mm_write_lock(mm_rwlock_t *l, const char *func, int level)
>>      if ( !mm_write_locked_by_me(l) )
>>      {
>>          __check_lock_level(level);
>> -        write_lock(&l->lock);
>> +	percpu_write_lock(p2m_percpu_rwlock, &l->lock);
> 
> A hard tab slipped in here.

I'll fix that up, I've also modified my editor setup so this (hopefully) doesn't happen again.

Thanks

Malcolm

> Jan
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock
  2015-11-25 12:35   ` Jan Beulich
@ 2015-11-25 13:43     ` Malcolm Crossley
  2015-11-25 13:53       ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-25 13:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3,
	Marcos.Matsunaga, stefano.stabellini, xen-devel

On 25/11/15 12:35, Jan Beulich wrote:
>>>> On 20.11.15 at 17:03, <malcolm.crossley@citrix.com> wrote:
>> @@ -208,8 +210,6 @@ active_entry_acquire(struct grant_table *t, grant_ref_t e)
>>  {
>>      struct active_grant_entry *act;
>>  
>> -    ASSERT(rw_is_locked(&t->lock));
> 
> Even if not covering all cases, I don't think this should be dropped,
> just like you don't drop rw_is_write_locked() asserts. Would properly
> replacing this be rather expensive?

I could add a percpu_rw_is_locked function but it will be racy because I can't
set the local lock barrier variable ( set/clearing that would race with real
users of the write lock ) Therefore I would be left polling the percpu readers
which is not reliable without the local barrier variable. Should I still add
the function if it won't reliably detect the locked condition?

> 
>> @@ -3180,7 +3178,7 @@ grant_table_create(
>>          goto no_mem_0;
>>  
>>      /* Simple stuff. */
>> -    rwlock_init(&t->lock);
>> +    percpu_rwlock_resource_init(&t->lock);
> 
> Considering George's general comment (on patch 1), would it perhaps
> make sense to store (in debug builds) the per-CPU variable's offset
> in the lock structure, having percpu_{read,write}_lock() verify it? Or
> at the very least have gt_{read,write}_lock() wrappers, thus
> avoiding the need to explicitly pass grant_rwlock at each use site? I
> certainly agree with George that we should make it as hard as
> possible for someone to get using these constructs wrong.

I can add storing the "per-CPU variable" owner in the local lock structure for
debug builds and have the percpu_{read,write} verify it at runtime.

The downsides is that the size of structures will vary between debug and non-debug
builds and that perhaps users may be confused what the per-CPU variables are used for.

I'll add the wrappers to for the grant table but will this make it harder to determine
how the locking works?

Thanks for the comments

Malcolm
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock
  2015-11-25 13:43     ` Malcolm Crossley
@ 2015-11-25 13:53       ` Jan Beulich
  2015-11-25 14:11         ` Malcolm Crossley
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2015-11-25 13:53 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3,
	Marcos.Matsunaga, stefano.stabellini, xen-devel

>>> On 25.11.15 at 14:43, <malcolm.crossley@citrix.com> wrote:
> On 25/11/15 12:35, Jan Beulich wrote:
>>>>> On 20.11.15 at 17:03, <malcolm.crossley@citrix.com> wrote:
>>> @@ -208,8 +210,6 @@ active_entry_acquire(struct grant_table *t, grant_ref_t 
> e)
>>>  {
>>>      struct active_grant_entry *act;
>>>  
>>> -    ASSERT(rw_is_locked(&t->lock));
>> 
>> Even if not covering all cases, I don't think this should be dropped,
>> just like you don't drop rw_is_write_locked() asserts. Would properly
>> replacing this be rather expensive?
> 
> I could add a percpu_rw_is_locked function but it will be racy because I 
> can't
> set the local lock barrier variable ( set/clearing that would race with real
> users of the write lock ) Therefore I would be left polling the percpu 
> readers
> which is not reliable without the local barrier variable. Should I still add
> the function if it won't reliably detect the locked condition?

Iiuc that would then risk the ASSERT() producing false positives,
which clearly is a no-go. In which case I'd still like you to leave in
some sort of comment (perhaps just the current code converted
to a comment) to document the requirement.

>>> @@ -3180,7 +3178,7 @@ grant_table_create(
>>>          goto no_mem_0;
>>>  
>>>      /* Simple stuff. */
>>> -    rwlock_init(&t->lock);
>>> +    percpu_rwlock_resource_init(&t->lock);
>> 
>> Considering George's general comment (on patch 1), would it perhaps
>> make sense to store (in debug builds) the per-CPU variable's offset
>> in the lock structure, having percpu_{read,write}_lock() verify it? Or
>> at the very least have gt_{read,write}_lock() wrappers, thus
>> avoiding the need to explicitly pass grant_rwlock at each use site? I
>> certainly agree with George that we should make it as hard as
>> possible for someone to get using these constructs wrong.
> 
> I can add storing the "per-CPU variable" owner in the local lock structure for
> debug builds and have the percpu_{read,write} verify it at runtime.
> 
> The downsides is that the size of structures will vary between debug and non-debug
> builds and that perhaps users may be confused what the per-CPU variables are 
> used for.

I'm not sure what confusion might arise here.

> I'll add the wrappers to for the grant table but will this make it harder to
> determine how the locking works?

Why would it? At the call sites we don't really care about the
mechanism. But perhaps I'm not seeing what you're thinking of...

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock
  2015-11-25 13:53       ` Jan Beulich
@ 2015-11-25 14:11         ` Malcolm Crossley
  0 siblings, 0 replies; 19+ messages in thread
From: Malcolm Crossley @ 2015-11-25 14:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3,
	Marcos.Matsunaga, stefano.stabellini, xen-devel

On 25/11/15 13:53, Jan Beulich wrote:
>>>> On 25.11.15 at 14:43, <malcolm.crossley@citrix.com> wrote:
>> On 25/11/15 12:35, Jan Beulich wrote:
>>>>>> On 20.11.15 at 17:03, <malcolm.crossley@citrix.com> wrote:
>>>> @@ -208,8 +210,6 @@ active_entry_acquire(struct grant_table *t, grant_ref_t 
>> e)
>>>>  {
>>>>      struct active_grant_entry *act;
>>>>  
>>>> -    ASSERT(rw_is_locked(&t->lock));
>>>
>>> Even if not covering all cases, I don't think this should be dropped,
>>> just like you don't drop rw_is_write_locked() asserts. Would properly
>>> replacing this be rather expensive?
>>
>> I could add a percpu_rw_is_locked function but it will be racy because I 
>> can't
>> set the local lock barrier variable ( set/clearing that would race with real
>> users of the write lock ) Therefore I would be left polling the percpu 
>> readers
>> which is not reliable without the local barrier variable. Should I still add
>> the function if it won't reliably detect the locked condition?
> 
> Iiuc that would then risk the ASSERT() producing false positives,
> which clearly is a no-go. In which case I'd still like you to leave in
> some sort of comment (perhaps just the current code converted
> to a comment) to document the requirement.

After checking the grant table code itself, it looks like the lock is always
taken by the local CPU so we should not get ASSERT false positives.

A generic percpu_rw_is_locked function would still be racy, maybe I should

> 
>>>> @@ -3180,7 +3178,7 @@ grant_table_create(
>>>>          goto no_mem_0;
>>>>  
>>>>      /* Simple stuff. */
>>>> -    rwlock_init(&t->lock);
>>>> +    percpu_rwlock_resource_init(&t->lock);
>>>
>>> Considering George's general comment (on patch 1), would it perhaps
>>> make sense to store (in debug builds) the per-CPU variable's offset
>>> in the lock structure, having percpu_{read,write}_lock() verify it? Or
>>> at the very least have gt_{read,write}_lock() wrappers, thus
>>> avoiding the need to explicitly pass grant_rwlock at each use site? I
>>> certainly agree with George that we should make it as hard as
>>> possible for someone to get using these constructs wrong.
>>
>> I can add storing the "per-CPU variable" owner in the local lock structure for
>> debug builds and have the percpu_{read,write} verify it at runtime.
>>
>> The downsides is that the size of structures will vary between debug and non-debug
>> builds and that perhaps users may be confused what the per-CPU variables are 
>> used for.
> 
> I'm not sure what confusion might arise here.

I was thinking there's a chance users might believe the per-CPU data are all somehow
stored in the percpu_rwlock_t itself. It's not a problem and I will implement it as
described above.
> 
>> I'll add the wrappers to for the grant table but will this make it harder to
>> determine how the locking works?
> 
> Why would it? At the call sites we don't really care about the
> mechanism. But perhaps I'm not seeing what you're thinking of...

I was thinking along the lines that it might appear the lock is for all usages in the
grant table code. I.E. We won't want people using the wrappers for locking of the
maptrack table itself. Ideally we want separate percpu variables for each "type" of
resource so that we don't get nested percpu locking of the different "types" which will
harm performance. Again, It's a major problem and I'm happy to add the wrappers.

Malcolm

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 0/3] Implement per-cpu reader-writer locks
  2015-11-25  8:58     ` Malcolm Crossley
  2015-11-25  9:49       ` George Dunlap
@ 2015-11-26  9:48       ` Dario Faggioli
  1 sibling, 0 replies; 19+ messages in thread
From: Dario Faggioli @ 2015-11-26  9:48 UTC (permalink / raw)
  To: Malcolm Crossley
  Cc: xen-devel, Andrew Cooper, Ian Jackson, George Dunlap, LarsKurth


[-- Attachment #1.1: Type: text/plain, Size: 1403 bytes --]

[Cc-list edited]

On Wed, 2015-11-25 at 08:58 +0000, Malcolm Crossley wrote:
> On 24/11/15 18:30, George Dunlap wrote:
> > On 24/11/15 18:16, George Dunlap wrote:
> > > On 20/11/15 16:03, Malcolm Crossley wrote:
> > > > 
> > > > Removing the cache line bouncing on a multi-socket Haswell-EP
> > > > system 
> > > > dramatically improves performance, with 16 vCPU network IO
> > > > performance going 
> > > > from 15 gb/s to 64 gb/s! The host under test was fully
> > > > utilising all 40 
> > > > logical CPU's at 64 gb/s, so a bigger logical CPU host may see
> > > > an even better
> > > > IO improvement.
> > > 
> > > Impressive -- thanks for doing this work.
> 
> Thanks, I think the key to isolating the problem was using profiling
> tools. The scale
> of the overhead would not have been clear without them.
> 
As an aside, if it's not too much work, a few hints and instruction on
how such profiling has been done would be helpful for others and for
the future.

For example, a post on The Xen Project's blog about that (that then can
be turned into a wiki page) would be awesome. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv2 1/3] rwlock: Add per-cpu reader-writer locks
  2015-11-20 16:03 ` [PATCHv2 1/3] rwlock: Add " Malcolm Crossley
  2015-11-25 11:12   ` George Dunlap
@ 2015-11-26 12:17   ` Jan Beulich
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2015-11-26 12:17 UTC (permalink / raw)
  To: malcolm.crossley
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3,
	Marcos.Matsunaga, stefano.stabellini, xen-devel

>>> On 20.11.15 at 17:03, <malcolm.crossley@citrix.com> wrote:
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -10,6 +10,8 @@
>  #include <asm/processor.h>
>  #include <asm/atomic.h>
>  
> +DEFINE_PER_CPU(cpumask_t, percpu_rwlock_readers);

static (and perhaps even local to _percpu_write_lock()).

> @@ -492,6 +494,42 @@ int _rw_is_write_locked(rwlock_t *lock)
>      return (lock->lock == RW_WRITE_FLAG); /* writer in critical section? */
>  }
>  
> +void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
> +	        percpu_rwlock_t *percpu_rwlock)
> +{
> +    unsigned int cpu;
> +
> +    /* First take the write lock to protect against other writers. */

... or slow path readers.

> +    write_lock(&percpu_rwlock->rwlock);
> +
> +    /* Now set the global variable so that readers start using read_lock. */
> +    percpu_rwlock->writer_activating = 1;
> +    smp_mb();
> +
> +    /* Using a per cpu cpumask is only safe if there is no nesting. */
> +    ASSERT(!in_irq());
> +    this_cpu(percpu_rwlock_readers) = cpu_online_map;

cpumask_copy() please, to avoid copying perhaps half a kb on a
pretty small system.

> +    /* Check if there are any percpu readers in progress on this rwlock. */
> +    for ( ; ; )
> +    {
> +        for_each_cpu(cpu, &this_cpu(percpu_rwlock_readers))

No more than one this_cpu{,_ptr}() please within one function.

> +        {
> +            /* 
> +	     * Remove any percpu readers not contending on this rwlock
> +             * from our check mask.
> +	     */

Hard tabs.

> +            if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock )
> +                cpumask_clear_cpu(cpu, &this_cpu(percpu_rwlock_readers));

__cpumask_clear_cpu()

> +static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,

const?

> +#define percpu_read_lock(percpu, lock) ( _percpu_read_lock( \
> +                                        &get_per_cpu_var(percpu), lock ) )
> +#define percpu_read_unlock(percpu, lock) ( _percpu_read_unlock( \
> +                                        &get_per_cpu_var(percpu), lock ) )
> +#define percpu_write_lock(percpu, lock) ( _percpu_write_lock( \
> +                                        &get_per_cpu_var(percpu), lock ) )
> +#define percpu_write_unlock(percpu, lock) ( _percpu_write_unlock( lock ) )

Perhaps easier to read as

#define percpu_read_lock(percpu, lock) \
    _percpu_read_lock(&get_per_cpu_var(percpu), lock)
#define percpu_read_unlock(percpu, lock) \
    _percpu_read_unlock(&get_per_cpu_var(percpu), lock)
#define percpu_write_lock(percpu, lock) \
    _percpu_write_lock(&get_per_cpu_var(percpu), lock)
#define percpu_write_unlock(percpu, lock) _percpu_write_unlock(lock)

But at the very least the stray blanks and parentheses should be
dropped.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-11-26 12:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20 16:03 [PATCHv2 0/3] Implement per-cpu reader-writer locks Malcolm Crossley
2015-11-20 16:03 ` [PATCHv2 1/3] rwlock: Add " Malcolm Crossley
2015-11-25 11:12   ` George Dunlap
2015-11-26 12:17   ` Jan Beulich
2015-11-20 16:03 ` [PATCHv2 2/3] grant_table: convert grant table rwlock to percpu rwlock Malcolm Crossley
2015-11-25 12:35   ` Jan Beulich
2015-11-25 13:43     ` Malcolm Crossley
2015-11-25 13:53       ` Jan Beulich
2015-11-25 14:11         ` Malcolm Crossley
2015-11-20 16:03 ` [PATCHv2 3/3] p2m: " Malcolm Crossley
2015-11-25 12:00   ` George Dunlap
2015-11-25 12:38   ` Jan Beulich
2015-11-25 12:54     ` Malcolm Crossley
2015-11-24 18:16 ` [PATCHv2 0/3] Implement per-cpu reader-writer locks George Dunlap
2015-11-24 18:30   ` George Dunlap
2015-11-25  8:58     ` Malcolm Crossley
2015-11-25  9:49       ` George Dunlap
2015-11-26  9:48       ` Dario Faggioli
2015-11-24 18:32   ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.