xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND] Performance regression due to XSA-336
@ 2021-03-24 21:05 Boris Ostrovsky
  2021-03-24 21:05 ` [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock Boris Ostrovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Boris Ostrovsky @ 2021-03-24 21:05 UTC (permalink / raw)
  To: xen-devel
  Cc: jbeulich, andrew.cooper3, roger.pau, wl, boris.ostrovsky,
	stephen.s.brennan


(Re-sending with Stephen added)


While running performance tests with recent XSAs backports to our product we've
discovered significant regression in TPCC performance. With a particular guest
kernel the numbers dropped by as much as 40%.

We've narrowed that down to XSA-336 patch, specifically to the pt_migrate rwlock,
and even more specifically to this lock being taken in pt_update_irq().

We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the theory is
that lock's cnts atomic is starting to cause lots of coherence traffic. As a
quick test of this replacing pt_vcpu_lock() in pt_update_irq() with just
spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance back.

Stephen Brennan came up with new locking algorithm, I just coded it up.

A couple of notes:

* We have only observed the problem and tested this patch for performance on
  a fairly old Xen version. However, vpt code is almost identical and I expect
  upstream to suffer from the same issue.

* Stephen provided the following (slightly edited by me) writeup explaining the
  locking algorithm. I would like to include it somewhere but not sure what the
  right place would be. Commit message perhaps?


Currently, every periodic_time is protected by locking the vcpu it is on. You
can think of the per-vCPU lock (tm_lock) as protecting the fields of every
periodic_time which is attached to that vCPU, as well as the list itself, and so
it must be held when read or written, or when an object is added or removed
to/from the list.

It seems that there are three types of access to the peridic_time objects:

1. Functions which read (maybe write) all periodic_time instances attached to a
   particular vCPU. These are functions which use pt_vcpu_lock() after the
   commit, such as pt_restore_timer(), pt_save_timer(), etc.
2. Functions which want to modify a particular periodic_time object. These guys
   lock whichever vCPU the periodic_time is attached to, but since the vCPU
   could be modified without holding any lock, they are vulnerable to the bug.
   Functions in this group use pt_lock(), such as pt_timer_fn() or
   destroy_periodic_time().
3. Functions which not only want to modify the periodic_time, but also would
   like to modify the =vcpu= fields. These are create_periodic_time() or
   pt_adjust_vcpu(). They create the locking imbalance bug for group 2, but we
   can't simply hold 2 vcpu locks due to the deadlock risk.

My proposed option is to add a per-periodic_time spinlock, which protects only
the periodic_time.vcpu field. Whenever reading the vcpu field of a periodic_time
struct, you must first take that lock. The critical sections of group 1 (your
"fast path" functions) would look like this:

1. lock vcpu
2. do whatever you want with pts currently on the vcpu. It is safe to read or write
   fields of pt, because the vcpu lock protects those fields. You simply cannot
   write pt->vcpu, because somebody holding the pt lock may already be spinning
   waiting for your vcpu lock.
3. unlock vcpu


Note that there is no additional locking in this fast path. For group 2
functions (which are attempting to lock an individual periodic_time), the
critical section would look like this:

1. lock pt lock (stabilizing the vcpu field)
2. lock vcpu
3. feel free to modify any field of the periodic_time
4. unlock vcpu (due to the mutual exclusion of the pt lock, we know that we are
   unlocking the correct vcpu -- we have not been migrated)
5. unlock pt

For functions in group 3, the critical section would be:

1. lock pt (stabilizing the vcpu field)
2. lock current vcpu
3. remove from vcpu list
4. unlock vcpu. At this point, you're guaranteed that the vcpu functions
   (callers of pt_vcpu_lock()) are not accessing your pt.
5. assign pt->vcpu  (we still have mutual exclusion against group 2 functions)
6. lock destination vcpu
7. add to vcpu list
8. unlock destination vcpu
9. unlock pt

If functions from group 2 and 3 are less frequent, then you won't see too much
added lock overhead in this situation! Plus, even if group 2 and 3 are somewhat
common, the performance overhead of an uncontented fine-grained lock is muuch
smaller than the overhead of a heavily contended coarse-grained lock, like the
per-domain rw lock.


Boris Ostrovsky (1):
  x86/vpt: Replace per-guest pt_migrate lock with per pt lock

 xen/arch/x86/emul-i8254.c     |   2 +
 xen/arch/x86/hvm/hpet.c       |   1 +
 xen/arch/x86/hvm/hvm.c        |   2 -
 xen/arch/x86/hvm/rtc.c        |   1 +
 xen/arch/x86/hvm/vlapic.c     |   1 +
 xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++-------------------
 xen/include/asm-x86/hvm/vpt.h |   9 +---
 7 files changed, 74 insertions(+), 64 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock
  2021-03-24 21:05 [PATCH RESEND] Performance regression due to XSA-336 Boris Ostrovsky
@ 2021-03-24 21:05 ` Boris Ostrovsky
  2021-03-25 14:48   ` Roger Pau Monné
  2021-03-25  7:51 ` [PATCH][4.15] Performance regression due to XSA-336 Jan Beulich
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Boris Ostrovsky @ 2021-03-24 21:05 UTC (permalink / raw)
  To: xen-devel
  Cc: jbeulich, andrew.cooper3, roger.pau, wl, boris.ostrovsky,
	stephen.s.brennan

Commit 8e76aef72820 ("x86/vpt: fix race when migrating timers between
vCPUs") addressed XSA-336 by introducing a per-domain rwlock that was
intended to protect periodic timer during VCPU migration. Since such
migration is an infrequent event no performance impact was expected.

Unfortunately this turned out not to be the case: on a fairly large
guest (92 VCPUs) we've observed as much as 40% TPCC performance regression
with some guest kernels. Further investigation pointed to pt_migrate
read lock taken in pt_update_irq() as the largest contributor to this
regression. With large number of VCPUs and large number of VMEXITs
(from where pt_update_irq() is always called) the update of an atomic in
read_lock() is thought to be the main cause.

Stephen Brennan examined the locking pattern and suggested using a
per-timer lock instead. This lock will need to be held whenever there is
a chance that pt->vcpu field may change (thus avoiding XSA-336
condition).

Suggested-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
 xen/arch/x86/emul-i8254.c     |   2 +
 xen/arch/x86/hvm/hpet.c       |   1 +
 xen/arch/x86/hvm/hvm.c        |   2 -
 xen/arch/x86/hvm/rtc.c        |   1 +
 xen/arch/x86/hvm/vlapic.c     |   1 +
 xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++-------------------
 xen/include/asm-x86/hvm/vpt.h |   9 +---
 7 files changed, 74 insertions(+), 64 deletions(-)

diff --git a/xen/arch/x86/emul-i8254.c b/xen/arch/x86/emul-i8254.c
index 73be4188ad41..d83e727ff35e 100644
--- a/xen/arch/x86/emul-i8254.c
+++ b/xen/arch/x86/emul-i8254.c
@@ -478,6 +478,8 @@ void pit_init(struct domain *d, unsigned long cpu_khz)
     if ( !has_vpit(d) )
         return;
 
+    spin_lock_init(&pit->pt0.lock);
+
     spin_lock_init(&pit->lock);
 
     if ( is_hvm_domain(d) )
diff --git a/xen/arch/x86/hvm/hpet.c b/xen/arch/x86/hvm/hpet.c
index ca94e8b4538c..c7f45412164e 100644
--- a/xen/arch/x86/hvm/hpet.c
+++ b/xen/arch/x86/hvm/hpet.c
@@ -734,6 +734,7 @@ static void hpet_set(HPETState *h)
         h->hpet.timers[i].cmp = ~0ULL;
         h->hpet.comparator64[i] = ~0ULL;
         h->pt[i].source = PTSRC_isa;
+        spin_lock_init(&h->pt[i].lock);
     }
 }
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index e7bcffebc490..b60549a12a33 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -658,8 +658,6 @@ int hvm_domain_initialise(struct domain *d)
     /* need link to containing domain */
     d->arch.hvm.pl_time->domain = d;
 
-    rwlock_init(&d->arch.hvm.pl_time->pt_migrate);
-
     /* Set the default IO Bitmap. */
     if ( is_hardware_domain(d) )
     {
diff --git a/xen/arch/x86/hvm/rtc.c b/xen/arch/x86/hvm/rtc.c
index 3150f5f1479b..6289d972bb67 100644
--- a/xen/arch/x86/hvm/rtc.c
+++ b/xen/arch/x86/hvm/rtc.c
@@ -842,6 +842,7 @@ void rtc_init(struct domain *d)
     }
 
     spin_lock_init(&s->lock);
+    spin_lock_init(&s->pt.lock);
 
     init_timer(&s->update_timer, rtc_update_timer, s, smp_processor_id());
     init_timer(&s->update_timer2, rtc_update_timer2, s, smp_processor_id());
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 5e21fb4937d9..8413e41a7a80 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1626,6 +1626,7 @@ int vlapic_init(struct vcpu *v)
     vlapic_reset(vlapic);
 
     spin_lock_init(&vlapic->esr_lock);
+    spin_lock_init(&vlapic->pt.lock);
 
     tasklet_init(&vlapic->init_sipi.tasklet, vlapic_init_sipi_action, v);
 
diff --git a/xen/arch/x86/hvm/vpt.c b/xen/arch/x86/hvm/vpt.c
index 4c2afe2e9154..36d4699a5de6 100644
--- a/xen/arch/x86/hvm/vpt.c
+++ b/xen/arch/x86/hvm/vpt.c
@@ -153,32 +153,16 @@ static int pt_irq_masked(struct periodic_time *pt)
     return 1;
 }
 
-static void pt_vcpu_lock(struct vcpu *v)
-{
-    read_lock(&v->domain->arch.hvm.pl_time->pt_migrate);
-    spin_lock(&v->arch.hvm.tm_lock);
-}
-
-static void pt_vcpu_unlock(struct vcpu *v)
-{
-    spin_unlock(&v->arch.hvm.tm_lock);
-    read_unlock(&v->domain->arch.hvm.pl_time->pt_migrate);
-}
-
 static void pt_lock(struct periodic_time *pt)
 {
-    /*
-     * We cannot use pt_vcpu_lock here, because we need to acquire the
-     * per-domain lock first and then (re-)fetch the value of pt->vcpu, or
-     * else we might be using a stale value of pt->vcpu.
-     */
-    read_lock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate);
+    spin_lock(&pt->lock);
     spin_lock(&pt->vcpu->arch.hvm.tm_lock);
 }
 
 static void pt_unlock(struct periodic_time *pt)
 {
-    pt_vcpu_unlock(pt->vcpu);
+    spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
+    spin_unlock(&pt->lock);
 }
 
 static void pt_process_missed_ticks(struct periodic_time *pt)
@@ -228,7 +212,7 @@ void pt_save_timer(struct vcpu *v)
     if ( v->pause_flags & VPF_blocked )
         return;
 
-    pt_vcpu_lock(v);
+    spin_lock(&v->arch.hvm.tm_lock);
 
     list_for_each_entry ( pt, head, list )
         if ( !pt->do_not_freeze )
@@ -236,7 +220,7 @@ void pt_save_timer(struct vcpu *v)
 
     pt_freeze_time(v);
 
-    pt_vcpu_unlock(v);
+    spin_unlock(&v->arch.hvm.tm_lock);
 }
 
 void pt_restore_timer(struct vcpu *v)
@@ -244,7 +228,7 @@ void pt_restore_timer(struct vcpu *v)
     struct list_head *head = &v->arch.hvm.tm_list;
     struct periodic_time *pt;
 
-    pt_vcpu_lock(v);
+    spin_lock(&v->arch.hvm.tm_lock);
 
     list_for_each_entry ( pt, head, list )
     {
@@ -257,7 +241,7 @@ void pt_restore_timer(struct vcpu *v)
 
     pt_thaw_time(v);
 
-    pt_vcpu_unlock(v);
+    spin_unlock(&v->arch.hvm.tm_lock);
 }
 
 static void pt_timer_fn(void *data)
@@ -318,7 +302,7 @@ int pt_update_irq(struct vcpu *v)
     int irq, pt_vector = -1;
     bool level;
 
-    pt_vcpu_lock(v);
+    spin_lock(&v->arch.hvm.tm_lock);
 
     earliest_pt = NULL;
     max_lag = -1ULL;
@@ -348,7 +332,7 @@ int pt_update_irq(struct vcpu *v)
 
     if ( earliest_pt == NULL )
     {
-        pt_vcpu_unlock(v);
+        spin_unlock(&v->arch.hvm.tm_lock);
         return -1;
     }
 
@@ -356,7 +340,7 @@ int pt_update_irq(struct vcpu *v)
     irq = earliest_pt->irq;
     level = earliest_pt->level;
 
-    pt_vcpu_unlock(v);
+    spin_unlock(&v->arch.hvm.tm_lock);
 
     switch ( earliest_pt->source )
     {
@@ -403,7 +387,7 @@ int pt_update_irq(struct vcpu *v)
                 time_cb *cb = NULL;
                 void *cb_priv = NULL;
 
-                pt_vcpu_lock(v);
+                spin_lock(&v->arch.hvm.tm_lock);
                 /* Make sure the timer is still on the list. */
                 list_for_each_entry ( pt, &v->arch.hvm.tm_list, list )
                     if ( pt == earliest_pt )
@@ -413,7 +397,7 @@ int pt_update_irq(struct vcpu *v)
                         cb_priv = pt->priv;
                         break;
                     }
-                pt_vcpu_unlock(v);
+                spin_unlock(&v->arch.hvm.tm_lock);
 
                 if ( cb != NULL )
                     cb(v, cb_priv);
@@ -450,12 +434,12 @@ void pt_intr_post(struct vcpu *v, struct hvm_intack intack)
     if ( intack.source == hvm_intsrc_vector )
         return;
 
-    pt_vcpu_lock(v);
+    spin_lock(&v->arch.hvm.tm_lock);
 
     pt = is_pt_irq(v, intack);
     if ( pt == NULL )
     {
-        pt_vcpu_unlock(v);
+        spin_unlock(&v->arch.hvm.tm_lock);
         return;
     }
 
@@ -464,7 +448,7 @@ void pt_intr_post(struct vcpu *v, struct hvm_intack intack)
     cb = pt->cb;
     cb_priv = pt->priv;
 
-    pt_vcpu_unlock(v);
+    spin_unlock(&v->arch.hvm.tm_lock);
 
     if ( cb != NULL )
         cb(v, cb_priv);
@@ -475,12 +459,34 @@ void pt_migrate(struct vcpu *v)
     struct list_head *head = &v->arch.hvm.tm_list;
     struct periodic_time *pt;
 
-    pt_vcpu_lock(v);
+    spin_lock(&v->arch.hvm.tm_lock);
 
     list_for_each_entry ( pt, head, list )
         migrate_timer(&pt->timer, v->processor);
 
-    pt_vcpu_unlock(v);
+    spin_unlock(&v->arch.hvm.tm_lock);
+}
+
+static void __destroy_periodic_time(struct periodic_time *pt, bool locked)
+{
+    /* Was this structure previously initialised by create_periodic_time()? */
+    if ( pt->vcpu == NULL )
+        return;
+
+    if (!locked)
+        pt_lock(pt);
+    if ( pt->on_list )
+        list_del(&pt->list);
+    pt->on_list = 0;
+    pt->pending_intr_nr = 0;
+    if (!locked)
+        pt_unlock(pt);
+
+    /*
+     * pt_timer_fn() can run until this kill_timer() returns. We must do this
+     * outside pt_lock() otherwise we can deadlock with pt_timer_fn().
+     */
+    kill_timer(&pt->timer);
 }
 
 void create_periodic_time(
@@ -497,9 +503,16 @@ void create_periodic_time(
         return;
     }
 
-    destroy_periodic_time(pt);
+    spin_lock(&pt->lock);
 
-    write_lock(&v->domain->arch.hvm.pl_time->pt_migrate);
+    if ( pt->vcpu )
+    {
+        spin_lock(&pt->vcpu->arch.hvm.tm_lock);
+
+        __destroy_periodic_time(pt, true);
+
+        spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
+    }
 
     pt->pending_intr_nr = 0;
     pt->do_not_freeze = 0;
@@ -543,33 +556,22 @@ void create_periodic_time(
     pt->cb = cb;
     pt->priv = data;
 
+    spin_lock(&pt->vcpu->arch.hvm.tm_lock);
+
     pt->on_list = 1;
     list_add(&pt->list, &v->arch.hvm.tm_list);
 
+    spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
+
     init_timer(&pt->timer, pt_timer_fn, pt, v->processor);
     set_timer(&pt->timer, pt->scheduled);
 
-    write_unlock(&v->domain->arch.hvm.pl_time->pt_migrate);
+    spin_unlock(&pt->lock);
 }
 
 void destroy_periodic_time(struct periodic_time *pt)
 {
-    /* Was this structure previously initialised by create_periodic_time()? */
-    if ( pt->vcpu == NULL )
-        return;
-
-    pt_lock(pt);
-    if ( pt->on_list )
-        list_del(&pt->list);
-    pt->on_list = 0;
-    pt->pending_intr_nr = 0;
-    pt_unlock(pt);
-
-    /*
-     * pt_timer_fn() can run until this kill_timer() returns. We must do this
-     * outside pt_lock() otherwise we can deadlock with pt_timer_fn().
-     */
-    kill_timer(&pt->timer);
+    __destroy_periodic_time(pt, false);
 }
 
 static void pt_adjust_vcpu(struct periodic_time *pt, struct vcpu *v)
@@ -579,15 +581,25 @@ static void pt_adjust_vcpu(struct periodic_time *pt, struct vcpu *v)
     if ( pt->vcpu == NULL )
         return;
 
-    write_lock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate);
+    spin_lock(&pt->lock);
+    spin_lock(&pt->vcpu->arch.hvm.tm_lock);
+
+    if ( pt->on_list )
+        list_del(&pt->list);
+
+    spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
+
     pt->vcpu = v;
+
+    spin_lock(&pt->vcpu->arch.hvm.tm_lock);
     if ( pt->on_list )
     {
-        list_del(&pt->list);
         list_add(&pt->list, &v->arch.hvm.tm_list);
         migrate_timer(&pt->timer, v->processor);
     }
-    write_unlock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate);
+
+    spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
+    spin_unlock(&pt->lock);
 }
 
 void pt_adjust_global_vcpu_target(struct vcpu *v)
diff --git a/xen/include/asm-x86/hvm/vpt.h b/xen/include/asm-x86/hvm/vpt.h
index 39d26cbda496..b0f4af25828b 100644
--- a/xen/include/asm-x86/hvm/vpt.h
+++ b/xen/include/asm-x86/hvm/vpt.h
@@ -49,6 +49,8 @@ struct periodic_time {
     u64 last_plt_gtime;         /* platform time when last IRQ is injected */
     struct timer timer;         /* ac_timer */
     time_cb *cb;
+    spinlock_t lock;            /* protects vcpu field during PT migration. */
+                                /* Needs to be taken before VCPU's tm_lock. */
     void *priv;                 /* point back to platform time source */
 };
 
@@ -128,13 +130,6 @@ struct pl_time {    /* platform time */
     struct RTCState  vrtc;
     struct HPETState vhpet;
     struct PMTState  vpmt;
-    /*
-     * rwlock to prevent periodic_time vCPU migration. Take the lock in read
-     * mode in order to prevent the vcpu field of periodic_time from changing.
-     * Lock must be taken in write mode when changes to the vcpu field are
-     * performed, as it allows exclusive access to all the timers of a domain.
-     */
-    rwlock_t pt_migrate;
     /* guest_time = Xen sys time + stime_offset */
     int64_t stime_offset;
     /* Ensures monotonicity in appropriate timer modes. */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH][4.15] Performance regression due to XSA-336
  2021-03-24 21:05 [PATCH RESEND] Performance regression due to XSA-336 Boris Ostrovsky
  2021-03-24 21:05 ` [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock Boris Ostrovsky
@ 2021-03-25  7:51 ` Jan Beulich
  2021-03-26 11:53   ` Ian Jackson
  2021-03-25 11:51 ` [PATCH RESEND] " Roger Pau Monné
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2021-03-25  7:51 UTC (permalink / raw)
  To: Boris Ostrovsky, Ian Jackson
  Cc: andrew.cooper3, roger.pau, wl, stephen.s.brennan, xen-devel

On 24.03.2021 22:05, Boris Ostrovsky wrote:
> 
> (Re-sending with Stephen added)
> 
> 
> While running performance tests with recent XSAs backports to our product we've
> discovered significant regression in TPCC performance. With a particular guest
> kernel the numbers dropped by as much as 40%.

While the change is more intrusive than one would like at this point, an
up-to-40% regression imo makes this at least a change to be considered
for 4.15. I will admit though that before next week I won't get around
to look at this in any more detail than just having read through this
cover letter. But perhaps someone else might find time earlier.

> We've narrowed that down to XSA-336 patch, specifically to the pt_migrate rwlock,
> and even more specifically to this lock being taken in pt_update_irq().
> 
> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the theory is
> that lock's cnts atomic is starting to cause lots of coherence traffic. As a
> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with just
> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance back.
> 
> Stephen Brennan came up with new locking algorithm, I just coded it up.
> 
> A couple of notes:
> 
> * We have only observed the problem and tested this patch for performance on
>   a fairly old Xen version. However, vpt code is almost identical and I expect
>   upstream to suffer from the same issue.
> 
> * Stephen provided the following (slightly edited by me) writeup explaining the
>   locking algorithm. I would like to include it somewhere but not sure what the
>   right place would be. Commit message perhaps?

If nowhere else, then definitely in the commit message. But perhaps it
could (also) sit in some form right ahead of pt_lock() / pt_unlock()?

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] Performance regression due to XSA-336
  2021-03-24 21:05 [PATCH RESEND] Performance regression due to XSA-336 Boris Ostrovsky
  2021-03-24 21:05 ` [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock Boris Ostrovsky
  2021-03-25  7:51 ` [PATCH][4.15] Performance regression due to XSA-336 Jan Beulich
@ 2021-03-25 11:51 ` Roger Pau Monné
  2021-03-31  9:06 ` Hongyan Xia
  2021-03-31  9:53 ` Hongyan Xia
  4 siblings, 0 replies; 12+ messages in thread
From: Roger Pau Monné @ 2021-03-25 11:51 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: xen-devel, jbeulich, andrew.cooper3, wl, stephen.s.brennan

On Wed, Mar 24, 2021 at 05:05:04PM -0400, Boris Ostrovsky wrote:
> 
> (Re-sending with Stephen added)
> 
> 
> While running performance tests with recent XSAs backports to our product we've
> discovered significant regression in TPCC performance. With a particular guest
> kernel the numbers dropped by as much as 40%.
> 
> We've narrowed that down to XSA-336 patch, specifically to the pt_migrate rwlock,
> and even more specifically to this lock being taken in pt_update_irq().
> 
> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the theory is
> that lock's cnts atomic is starting to cause lots of coherence traffic. As a
> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with just
> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance back.

Right, we where kind of worried about this when doing the original
fix, but no one reported such performance regressions, I guess Oracle
is likely the only one running such big VMs.

I have a pending patch series to remove all this vpt logic from the vm
entry path, as it's not working properly except for very simple
scenarios:

https://lore.kernel.org/xen-devel/20200930104108.35969-1-roger.pau@citrix.com/

I hope I will find time to post a new version of the series soon-ish.

> Stephen Brennan came up with new locking algorithm, I just coded it up.
> 
> A couple of notes:
> 
> * We have only observed the problem and tested this patch for performance on
>   a fairly old Xen version. However, vpt code is almost identical and I expect
>   upstream to suffer from the same issue.
> 
> * Stephen provided the following (slightly edited by me) writeup explaining the
>   locking algorithm. I would like to include it somewhere but not sure what the
>   right place would be. Commit message perhaps?
> 
> 
> Currently, every periodic_time is protected by locking the vcpu it is on. You
> can think of the per-vCPU lock (tm_lock) as protecting the fields of every
> periodic_time which is attached to that vCPU, as well as the list itself, and so
> it must be held when read or written, or when an object is added or removed
> to/from the list.
> 
> It seems that there are three types of access to the peridic_time objects:
> 
> 1. Functions which read (maybe write) all periodic_time instances attached to a
>    particular vCPU. These are functions which use pt_vcpu_lock() after the
>    commit, such as pt_restore_timer(), pt_save_timer(), etc.
> 2. Functions which want to modify a particular periodic_time object. These guys
>    lock whichever vCPU the periodic_time is attached to, but since the vCPU
>    could be modified without holding any lock, they are vulnerable to the bug.
>    Functions in this group use pt_lock(), such as pt_timer_fn() or
>    destroy_periodic_time().
> 3. Functions which not only want to modify the periodic_time, but also would
>    like to modify the =vcpu= fields. These are create_periodic_time() or
>    pt_adjust_vcpu(). They create the locking imbalance bug for group 2, but we
>    can't simply hold 2 vcpu locks due to the deadlock risk.
> 
> My proposed option is to add a per-periodic_time spinlock, which protects only
> the periodic_time.vcpu field.

I wonder whether we really need a per-periodic_time spinlock, it seems
like functions using access type 1 are the only ones that suffer from
the contention caused by the global rwlock, so maybe we could adapt
this proposal to still use a per-domain lock, seeing that type 1
access are likely safe by just holding the vcpu lock.

Not that using a per-periodic_time spinlock is wrong, but it's likely
too fine grained (and adds more memory usage) as type 2 and 3 accesses
shouldn't be common anyway.

Let me make some comments on the patch itself.

> Whenever reading the vcpu field of a periodic_time
> struct, you must first take that lock. The critical sections of group 1 (your
> "fast path" functions) would look like this:
> 
> 1. lock vcpu
> 2. do whatever you want with pts currently on the vcpu. It is safe to read or write
>    fields of pt, because the vcpu lock protects those fields. You simply cannot
>    write pt->vcpu, because somebody holding the pt lock may already be spinning
>    waiting for your vcpu lock.
> 3. unlock vcpu
> 
> 
> Note that there is no additional locking in this fast path. For group 2
> functions (which are attempting to lock an individual periodic_time), the
> critical section would look like this:
> 
> 1. lock pt lock (stabilizing the vcpu field)
> 2. lock vcpu
> 3. feel free to modify any field of the periodic_time
> 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know that we are
>    unlocking the correct vcpu -- we have not been migrated)
> 5. unlock pt
> 
> For functions in group 3, the critical section would be:
> 
> 1. lock pt (stabilizing the vcpu field)
> 2. lock current vcpu
> 3. remove from vcpu list
> 4. unlock vcpu. At this point, you're guaranteed that the vcpu functions
>    (callers of pt_vcpu_lock()) are not accessing your pt.
> 5. assign pt->vcpu  (we still have mutual exclusion against group 2 functions)
> 6. lock destination vcpu
> 7. add to vcpu list
> 8. unlock destination vcpu
> 9. unlock pt
> 
> If functions from group 2 and 3 are less frequent, then you won't see too much
> added lock overhead in this situation! Plus, even if group 2 and 3 are somewhat
> common, the performance overhead of an uncontented fine-grained lock is muuch
> smaller than the overhead of a heavily contended coarse-grained lock, like the
> per-domain rw lock.

Thanks, that's a very good description of the different locking
accesses by vpt. The original fix already aimed to make this
difference by introducing the pt_vcpu_{un}lock and pt_{un}nlock
helpers.

This all stems from a very bad design decision of making vpts tied to
a vCPU, which is a broken assumption.

Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock
  2021-03-24 21:05 ` [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock Boris Ostrovsky
@ 2021-03-25 14:48   ` Roger Pau Monné
  2021-03-25 23:17     ` Boris Ostrovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Roger Pau Monné @ 2021-03-25 14:48 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: xen-devel, jbeulich, andrew.cooper3, wl, stephen.s.brennan

On Wed, Mar 24, 2021 at 05:05:05PM -0400, Boris Ostrovsky wrote:
> Commit 8e76aef72820 ("x86/vpt: fix race when migrating timers between
> vCPUs") addressed XSA-336 by introducing a per-domain rwlock that was
> intended to protect periodic timer during VCPU migration. Since such
> migration is an infrequent event no performance impact was expected.
> 
> Unfortunately this turned out not to be the case: on a fairly large
> guest (92 VCPUs) we've observed as much as 40% TPCC performance regression
> with some guest kernels. Further investigation pointed to pt_migrate
> read lock taken in pt_update_irq() as the largest contributor to this
> regression. With large number of VCPUs and large number of VMEXITs
> (from where pt_update_irq() is always called) the update of an atomic in
> read_lock() is thought to be the main cause.

Right, seems like a very plausible analysis.

Since I suspect most (if not all?) of the performance regression is
from the read_lock in pt_update_irq I think we can remove that without
doing such a big change to the current locking logic, and instead
merging part of the logic that you detail in the cover letter without
moving to a per-timer lock.

> Stephen Brennan examined the locking pattern and suggested using a
> per-timer lock instead. This lock will need to be held whenever there is
> a chance that pt->vcpu field may change (thus avoiding XSA-336
> condition).
> 
> Suggested-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

So while I think this is indeed a working solution, I'm not convinced
we require a per-timer lock, I think we can find some middle ground
using both a per-domain rwlock and the more fine grained per-vcpu
lock.

Basically for type 1 accesses (pt_vcpu_{un}lock) I think we can safely
drop the read_{un}lock call, and remove the performance bottleneck
while slightly adjusting the functions that modify the per-vcpu timer
lists to take the per-vcpu locks when doing so.

I've tried to convey that in the comments below, while also pointing
out some suitable places where comments can be added based on the text
from your cover letter.

Overall this should result in a smaller patch that will be both easier
to review and argue in terms of inclusion into 4.15.

> ---
>  xen/arch/x86/emul-i8254.c     |   2 +
>  xen/arch/x86/hvm/hpet.c       |   1 +
>  xen/arch/x86/hvm/hvm.c        |   2 -
>  xen/arch/x86/hvm/rtc.c        |   1 +
>  xen/arch/x86/hvm/vlapic.c     |   1 +
>  xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++-------------------
>  xen/include/asm-x86/hvm/vpt.h |   9 +---
>  7 files changed, 74 insertions(+), 64 deletions(-)
> 
> diff --git a/xen/arch/x86/emul-i8254.c b/xen/arch/x86/emul-i8254.c
> index 73be4188ad41..d83e727ff35e 100644
> --- a/xen/arch/x86/emul-i8254.c
> +++ b/xen/arch/x86/emul-i8254.c
> @@ -478,6 +478,8 @@ void pit_init(struct domain *d, unsigned long cpu_khz)
>      if ( !has_vpit(d) )
>          return;
>  
> +    spin_lock_init(&pit->pt0.lock);
> +
>      spin_lock_init(&pit->lock);
>  
>      if ( is_hvm_domain(d) )
> diff --git a/xen/arch/x86/hvm/hpet.c b/xen/arch/x86/hvm/hpet.c
> index ca94e8b4538c..c7f45412164e 100644
> --- a/xen/arch/x86/hvm/hpet.c
> +++ b/xen/arch/x86/hvm/hpet.c
> @@ -734,6 +734,7 @@ static void hpet_set(HPETState *h)
>          h->hpet.timers[i].cmp = ~0ULL;
>          h->hpet.comparator64[i] = ~0ULL;
>          h->pt[i].source = PTSRC_isa;
> +        spin_lock_init(&h->pt[i].lock);
>      }
>  }
>  
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index e7bcffebc490..b60549a12a33 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -658,8 +658,6 @@ int hvm_domain_initialise(struct domain *d)
>      /* need link to containing domain */
>      d->arch.hvm.pl_time->domain = d;
>  
> -    rwlock_init(&d->arch.hvm.pl_time->pt_migrate);
> -
>      /* Set the default IO Bitmap. */
>      if ( is_hardware_domain(d) )
>      {
> diff --git a/xen/arch/x86/hvm/rtc.c b/xen/arch/x86/hvm/rtc.c
> index 3150f5f1479b..6289d972bb67 100644
> --- a/xen/arch/x86/hvm/rtc.c
> +++ b/xen/arch/x86/hvm/rtc.c
> @@ -842,6 +842,7 @@ void rtc_init(struct domain *d)
>      }
>  
>      spin_lock_init(&s->lock);
> +    spin_lock_init(&s->pt.lock);
>  
>      init_timer(&s->update_timer, rtc_update_timer, s, smp_processor_id());
>      init_timer(&s->update_timer2, rtc_update_timer2, s, smp_processor_id());
> diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
> index 5e21fb4937d9..8413e41a7a80 100644
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -1626,6 +1626,7 @@ int vlapic_init(struct vcpu *v)
>      vlapic_reset(vlapic);
>  
>      spin_lock_init(&vlapic->esr_lock);
> +    spin_lock_init(&vlapic->pt.lock);
>  
>      tasklet_init(&vlapic->init_sipi.tasklet, vlapic_init_sipi_action, v);
>  
> diff --git a/xen/arch/x86/hvm/vpt.c b/xen/arch/x86/hvm/vpt.c
> index 4c2afe2e9154..36d4699a5de6 100644
> --- a/xen/arch/x86/hvm/vpt.c
> +++ b/xen/arch/x86/hvm/vpt.c
> @@ -153,32 +153,16 @@ static int pt_irq_masked(struct periodic_time *pt)
>      return 1;
>  }
>  
> -static void pt_vcpu_lock(struct vcpu *v)
> -{
> -    read_lock(&v->domain->arch.hvm.pl_time->pt_migrate);
> -    spin_lock(&v->arch.hvm.tm_lock);
> -}
> -
> -static void pt_vcpu_unlock(struct vcpu *v)
> -{
> -    spin_unlock(&v->arch.hvm.tm_lock);
> -    read_unlock(&v->domain->arch.hvm.pl_time->pt_migrate);
> -}

I would keep those functions, and just remove the read_{un}lock call
from them. The places where pt_vcpu_{un}lock is used matches the type
1 accesses listed on your cover letter, so I would add as a prefix to
the functions:

/*
 * Functions which read (maybe write) all periodic_time instances
 * attached to a particular vCPU use this locking helper.
 *
 * Such users are explicitly forbidden from changing the value of the
 * pt->vcpu field, because another thread holding the pt_migrate lock
 * may already be spinning waiting for your vcpu lock.
 */

> -
>  static void pt_lock(struct periodic_time *pt)
>  {
> -    /*
> -     * We cannot use pt_vcpu_lock here, because we need to acquire the
> -     * per-domain lock first and then (re-)fetch the value of pt->vcpu, or
> -     * else we might be using a stale value of pt->vcpu.
> -     */
> -    read_lock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate);
> +    spin_lock(&pt->lock);
>      spin_lock(&pt->vcpu->arch.hvm.tm_lock);
>  }
>  
>  static void pt_unlock(struct periodic_time *pt)
>  {
> -    pt_vcpu_unlock(pt->vcpu);
> +    spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
> +    spin_unlock(&pt->lock);
>  }

I would leave the rwlock on this pair of functions, and add the
following comment as a prefix:

/*
 * Functions which want to modify a particular periodic_time object.
 * These users lock whichever vCPU the periodic_time is attached to,
 * but since the vCPU could be modified without holding any lock, they
 * need to take an additional lock that protects against pt->vcpu
 * changing.
 */

>  
>  static void pt_process_missed_ticks(struct periodic_time *pt)
> @@ -228,7 +212,7 @@ void pt_save_timer(struct vcpu *v)
>      if ( v->pause_flags & VPF_blocked )
>          return;
>  
> -    pt_vcpu_lock(v);
> +    spin_lock(&v->arch.hvm.tm_lock);
>  
>      list_for_each_entry ( pt, head, list )
>          if ( !pt->do_not_freeze )
> @@ -236,7 +220,7 @@ void pt_save_timer(struct vcpu *v)
>  
>      pt_freeze_time(v);
>  
> -    pt_vcpu_unlock(v);
> +    spin_unlock(&v->arch.hvm.tm_lock);
>  }
>  
>  void pt_restore_timer(struct vcpu *v)
> @@ -244,7 +228,7 @@ void pt_restore_timer(struct vcpu *v)
>      struct list_head *head = &v->arch.hvm.tm_list;
>      struct periodic_time *pt;
>  
> -    pt_vcpu_lock(v);
> +    spin_lock(&v->arch.hvm.tm_lock);
>  
>      list_for_each_entry ( pt, head, list )
>      {
> @@ -257,7 +241,7 @@ void pt_restore_timer(struct vcpu *v)
>  
>      pt_thaw_time(v);
>  
> -    pt_vcpu_unlock(v);
> +    spin_unlock(&v->arch.hvm.tm_lock);
>  }
>  
>  static void pt_timer_fn(void *data)
> @@ -318,7 +302,7 @@ int pt_update_irq(struct vcpu *v)
>      int irq, pt_vector = -1;
>      bool level;
>  
> -    pt_vcpu_lock(v);
> +    spin_lock(&v->arch.hvm.tm_lock);
>  
>      earliest_pt = NULL;
>      max_lag = -1ULL;
> @@ -348,7 +332,7 @@ int pt_update_irq(struct vcpu *v)
>  
>      if ( earliest_pt == NULL )
>      {
> -        pt_vcpu_unlock(v);
> +        spin_unlock(&v->arch.hvm.tm_lock);
>          return -1;
>      }
>  
> @@ -356,7 +340,7 @@ int pt_update_irq(struct vcpu *v)
>      irq = earliest_pt->irq;
>      level = earliest_pt->level;
>  
> -    pt_vcpu_unlock(v);
> +    spin_unlock(&v->arch.hvm.tm_lock);
>  
>      switch ( earliest_pt->source )
>      {
> @@ -403,7 +387,7 @@ int pt_update_irq(struct vcpu *v)
>                  time_cb *cb = NULL;
>                  void *cb_priv = NULL;
>  
> -                pt_vcpu_lock(v);
> +                spin_lock(&v->arch.hvm.tm_lock);
>                  /* Make sure the timer is still on the list. */
>                  list_for_each_entry ( pt, &v->arch.hvm.tm_list, list )
>                      if ( pt == earliest_pt )
> @@ -413,7 +397,7 @@ int pt_update_irq(struct vcpu *v)
>                          cb_priv = pt->priv;
>                          break;
>                      }
> -                pt_vcpu_unlock(v);
> +                spin_unlock(&v->arch.hvm.tm_lock);
>  
>                  if ( cb != NULL )
>                      cb(v, cb_priv);
> @@ -450,12 +434,12 @@ void pt_intr_post(struct vcpu *v, struct hvm_intack intack)
>      if ( intack.source == hvm_intsrc_vector )
>          return;
>  
> -    pt_vcpu_lock(v);
> +    spin_lock(&v->arch.hvm.tm_lock);
>  
>      pt = is_pt_irq(v, intack);
>      if ( pt == NULL )
>      {
> -        pt_vcpu_unlock(v);
> +        spin_unlock(&v->arch.hvm.tm_lock);
>          return;
>      }
>  
> @@ -464,7 +448,7 @@ void pt_intr_post(struct vcpu *v, struct hvm_intack intack)
>      cb = pt->cb;
>      cb_priv = pt->priv;
>  
> -    pt_vcpu_unlock(v);
> +    spin_unlock(&v->arch.hvm.tm_lock);
>  
>      if ( cb != NULL )
>          cb(v, cb_priv);
> @@ -475,12 +459,34 @@ void pt_migrate(struct vcpu *v)
>      struct list_head *head = &v->arch.hvm.tm_list;
>      struct periodic_time *pt;
>  
> -    pt_vcpu_lock(v);
> +    spin_lock(&v->arch.hvm.tm_lock);
>  
>      list_for_each_entry ( pt, head, list )
>          migrate_timer(&pt->timer, v->processor);
>  
> -    pt_vcpu_unlock(v);
> +    spin_unlock(&v->arch.hvm.tm_lock);
> +}
> +
> +static void __destroy_periodic_time(struct periodic_time *pt, bool locked)
> +{
> +    /* Was this structure previously initialised by create_periodic_time()? */
> +    if ( pt->vcpu == NULL )
> +        return;
> +
> +    if (!locked)

Coding style: missing spaces.

> +        pt_lock(pt);
> +    if ( pt->on_list )
> +        list_del(&pt->list);
> +    pt->on_list = 0;

Please use false here, since this is a boolean field now.

> +    pt->pending_intr_nr = 0;
> +    if (!locked)
> +        pt_unlock(pt);
> +
> +    /*
> +     * pt_timer_fn() can run until this kill_timer() returns. We must do this
> +     * outside pt_lock() otherwise we can deadlock with pt_timer_fn().
> +     */
> +    kill_timer(&pt->timer);
>  }
>  
>  void create_periodic_time(
> @@ -497,9 +503,16 @@ void create_periodic_time(
>          return;
>      }
>  
> -    destroy_periodic_time(pt);

Why not keep the call to destroy_periodic_time here?

> +    spin_lock(&pt->lock);
>  
> -    write_lock(&v->domain->arch.hvm.pl_time->pt_migrate);

And the write lock here together with the pair of locks that you add
around the addition to the vCPU list. Those would better use
pt_vcpu_{un}lock if you agree to keep those helpers.

> +    if ( pt->vcpu )
> +    {
> +        spin_lock(&pt->vcpu->arch.hvm.tm_lock);
> +
> +        __destroy_periodic_time(pt, true);
> +
> +        spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
> +    }
>  
>      pt->pending_intr_nr = 0;
>      pt->do_not_freeze = 0;
> @@ -543,33 +556,22 @@ void create_periodic_time(
>      pt->cb = cb;
>      pt->priv = data;
>  
> +    spin_lock(&pt->vcpu->arch.hvm.tm_lock);
> +
>      pt->on_list = 1;
>      list_add(&pt->list, &v->arch.hvm.tm_list);
>  
> +    spin_unlock(&pt->vcpu->arch.hvm.tm_lock);
> +
>      init_timer(&pt->timer, pt_timer_fn, pt, v->processor);
>      set_timer(&pt->timer, pt->scheduled);
>  
> -    write_unlock(&v->domain->arch.hvm.pl_time->pt_migrate);
> +    spin_unlock(&pt->lock);
>  }
>  
>  void destroy_periodic_time(struct periodic_time *pt)
>  {
> -    /* Was this structure previously initialised by create_periodic_time()? */
> -    if ( pt->vcpu == NULL )
> -        return;
> -
> -    pt_lock(pt);
> -    if ( pt->on_list )
> -        list_del(&pt->list);
> -    pt->on_list = 0;
> -    pt->pending_intr_nr = 0;
> -    pt_unlock(pt);
> -
> -    /*
> -     * pt_timer_fn() can run until this kill_timer() returns. We must do this
> -     * outside pt_lock() otherwise we can deadlock with pt_timer_fn().
> -     */
> -    kill_timer(&pt->timer);
> +    __destroy_periodic_time(pt, false);

I'm not convinced you need the __destroy_periodic_time helper, I'm not
sure I see which benefit it has over the previous approach. You end up
reading pt->vcpu without holding any lock in the destroy case, just
like before.

>  }
>  
>  static void pt_adjust_vcpu(struct periodic_time *pt, struct vcpu *v)
> @@ -579,15 +581,25 @@ static void pt_adjust_vcpu(struct periodic_time *pt, struct vcpu *v)
>      if ( pt->vcpu == NULL )
>          return;
>  
> -    write_lock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate);

As suggested above, I think you could leave the usage of the
pt_migrate lock together with your addition of the per-vcpu vpt
locking.

As this is the most obvious function that changes the vcpu field I
would add:

/*
 * Functions which want to modify the vcpu field of the vpt need to
 * hold the global lock (pt_migrate) in write mode together with the
 * per-vcpu locks of the lists being modified. Note that two vcpu
 * locks cannot be hold at the same time to avoid a deadlock.
 */

Thanks, Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock
  2021-03-25 14:48   ` Roger Pau Monné
@ 2021-03-25 23:17     ` Boris Ostrovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Boris Ostrovsky @ 2021-03-25 23:17 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, jbeulich, andrew.cooper3, wl, stephen.s.brennan


On 3/25/21 10:48 AM, Roger Pau Monné wrote:
> On Wed, Mar 24, 2021 at 05:05:05PM -0400, Boris Ostrovsky wrote:
>> Commit 8e76aef72820 ("x86/vpt: fix race when migrating timers between
>> vCPUs") addressed XSA-336 by introducing a per-domain rwlock that was
>> intended to protect periodic timer during VCPU migration. Since such
>> migration is an infrequent event no performance impact was expected.
>>
>> Unfortunately this turned out not to be the case: on a fairly large
>> guest (92 VCPUs) we've observed as much as 40% TPCC performance regression
>> with some guest kernels. Further investigation pointed to pt_migrate
>> read lock taken in pt_update_irq() as the largest contributor to this
>> regression. With large number of VCPUs and large number of VMEXITs
>> (from where pt_update_irq() is always called) the update of an atomic in
>> read_lock() is thought to be the main cause.
> Right, seems like a very plausible analysis.
>
> Since I suspect most (if not all?) 


pt_restore_timer() (called from the scheduler) also contributes a couple of percent. But yes.


> of the performance regression is
> from the read_lock in pt_update_irq I think we can remove that without
> doing such a big change to the current locking logic, and instead
> merging part of the logic that you detail in the cover letter without
> moving to a per-timer lock.
>
>> Stephen Brennan examined the locking pattern and suggested using a
>> per-timer lock instead. This lock will need to be held whenever there is
>> a chance that pt->vcpu field may change (thus avoiding XSA-336
>> condition).
>>
>> Suggested-by: Stephen Brennan <stephen.s.brennan@oracle.com>
>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> So while I think this is indeed a working solution, I'm not convinced
> we require a per-timer lock, I think we can find some middle ground
> using both a per-domain rwlock and the more fine grained per-vcpu
> lock.
>
> Basically for type 1 accesses (pt_vcpu_{un}lock) I think we can safely
> drop the read_{un}lock call,


Yes, if that's the case then current rwlock should be fine.


>  and remove the performance bottleneck
> while slightly adjusting the functions that modify the per-vcpu timer
> lists to take the per-vcpu locks when doing so.
>
> I've tried to convey that in the comments below, while also pointing
> out some suitable places where comments can be added based on the text
> from your cover letter.
>
> Overall this should result in a smaller patch that will be both easier
> to review and argue in terms of inclusion into 4.15.


Sure. Thanks for the review/suggestions.


-boris



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][4.15] Performance regression due to XSA-336
  2021-03-25  7:51 ` [PATCH][4.15] Performance regression due to XSA-336 Jan Beulich
@ 2021-03-26 11:53   ` Ian Jackson
  2021-03-26 13:22     ` Jan Beulich
  0 siblings, 1 reply; 12+ messages in thread
From: Ian Jackson @ 2021-03-26 11:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Boris Ostrovsky, andrew.cooper3, roger.pau, wl,
	stephen.s.brennan, xen-devel

Jan Beulich writes ("Re: [PATCH][4.15] Performance regression due to XSA-336"):
> While the change is more intrusive than one would like at this point, an
> up-to-40% regression imo makes this at least a change to be considered
> for 4.15. I will admit though that before next week I won't get around
> to look at this in any more detail than just having read through this
> cover letter. But perhaps someone else might find time earlier.

As I understand it:

This amounts to a request to consider a release ack for "x86/vpt:
Replace per-guest pt_migrate lock with per pt lock".

That patch fixes a regression due to XSA-336.  XSA-336 affected many
versions of Xen.  Therefore this is not a regression between 4.14 and
4.15; rather it's a regresion from pre-XSA-336 to post-XSA-336.

I looked at the patch and I am really not comfortable with that kind
of change at this stage of the release.

So I'm afraid the answer is no, unless I have misunderstood something.

Regards,
Ian.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][4.15] Performance regression due to XSA-336
  2021-03-26 11:53   ` Ian Jackson
@ 2021-03-26 13:22     ` Jan Beulich
  2021-03-26 13:25       ` Ian Jackson
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2021-03-26 13:22 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Boris Ostrovsky, andrew.cooper3, roger.pau, wl,
	stephen.s.brennan, xen-devel

On 26.03.2021 12:53, Ian Jackson wrote:
> Jan Beulich writes ("Re: [PATCH][4.15] Performance regression due to XSA-336"):
>> While the change is more intrusive than one would like at this point, an
>> up-to-40% regression imo makes this at least a change to be considered
>> for 4.15. I will admit though that before next week I won't get around
>> to look at this in any more detail than just having read through this
>> cover letter. But perhaps someone else might find time earlier.
> 
> As I understand it:
> 
> This amounts to a request to consider a release ack for "x86/vpt:
> Replace per-guest pt_migrate lock with per pt lock".

Or, as already suggested by Roger, perhaps something simpler. First of
all I wanted you to be aware.

> That patch fixes a regression due to XSA-336.  XSA-336 affected many
> versions of Xen.  Therefore this is not a regression between 4.14 and
> 4.15; rather it's a regresion from pre-XSA-336 to post-XSA-336.
> 
> I looked at the patch and I am really not comfortable with that kind
> of change at this stage of the release.
> 
> So I'm afraid the answer is no, unless I have misunderstood something.

Understood.

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][4.15] Performance regression due to XSA-336
  2021-03-26 13:22     ` Jan Beulich
@ 2021-03-26 13:25       ` Ian Jackson
  0 siblings, 0 replies; 12+ messages in thread
From: Ian Jackson @ 2021-03-26 13:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Boris Ostrovsky, andrew.cooper3, roger.pau, wl,
	stephen.s.brennan, xen-devel

Jan Beulich writes ("Re: [PATCH][4.15] Performance regression due to XSA-336"):
> On 26.03.2021 12:53, Ian Jackson wrote:
> > This amounts to a request to consider a release ack for "x86/vpt:
> > Replace per-guest pt_migrate lock with per pt lock".
> 
> Or, as already suggested by Roger, perhaps something simpler. First of
> all I wanted you to be aware.

Thank you, yes.  That is very helpful.

> > That patch fixes a regression due to XSA-336.  XSA-336 affected many
> > versions of Xen.  Therefore this is not a regression between 4.14 and
> > 4.15; rather it's a regresion from pre-XSA-336 to post-XSA-336.
> > 
> > I looked at the patch and I am really not comfortable with that kind
> > of change at this stage of the release.
> > 
> > So I'm afraid the answer is no, unless I have misunderstood something.
> 
> Understood.

I might consider "something simpler" but the bar is very high by now
and it seems unlikely to make the cut.

Thanks,
Ian.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] Performance regression due to XSA-336
  2021-03-24 21:05 [PATCH RESEND] Performance regression due to XSA-336 Boris Ostrovsky
                   ` (2 preceding siblings ...)
  2021-03-25 11:51 ` [PATCH RESEND] " Roger Pau Monné
@ 2021-03-31  9:06 ` Hongyan Xia
  2021-03-31  9:53 ` Hongyan Xia
  4 siblings, 0 replies; 12+ messages in thread
From: Hongyan Xia @ 2021-03-31  9:06 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel
  Cc: jbeulich, andrew.cooper3, roger.pau, wl, stephen.s.brennan

Thank you for the patch. We observed similar issues.

However, would you mind sharing more details on the test setup? We
tested the hypervisor after injecting XSA-336. There were regressions
in some benchmarks but far from being enough to raise a serious alarm.
In particular, we benchmarked CPU, disk I/O, network and timer
performance on VMs with 64 and even 128 vCPUs and the noticeable
regression was a 60% drop in the ACPI PM timer when all vCPUs are
actively accessing the timer. We did not see concerns in real-world
benchmarks so I am curious how the -40% was seen.

Is it possible that the guest itself was using ACPI timer as the
clocksource? That could explain a lot. It might also be useful to know
what OS that was and other specs around the -40%.

Thank you,
Hongyan

On Wed, 2021-03-24 at 17:05 -0400, Boris Ostrovsky wrote:
> (Re-sending with Stephen added)
> 
> 
> While running performance tests with recent XSAs backports to our
> product we've
> discovered significant regression in TPCC performance. With a
> particular guest
> kernel the numbers dropped by as much as 40%.
> 
> We've narrowed that down to XSA-336 patch, specifically to the
> pt_migrate rwlock,
> and even more specifically to this lock being taken in
> pt_update_irq().
> 
> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the
> theory is
> that lock's cnts atomic is starting to cause lots of coherence
> traffic. As a
> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with
> just
> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance
> back.
> 
> Stephen Brennan came up with new locking algorithm, I just coded it
> up.
> 
> A couple of notes:
> 
> * We have only observed the problem and tested this patch for
> performance on
>   a fairly old Xen version. However, vpt code is almost identical and
> I expect
>   upstream to suffer from the same issue.
> 
> * Stephen provided the following (slightly edited by me) writeup
> explaining the
>   locking algorithm. I would like to include it somewhere but not
> sure what the
>   right place would be. Commit message perhaps?
> 
> 
> Currently, every periodic_time is protected by locking the vcpu it is
> on. You
> can think of the per-vCPU lock (tm_lock) as protecting the fields of
> every
> periodic_time which is attached to that vCPU, as well as the list
> itself, and so
> it must be held when read or written, or when an object is added or
> removed
> to/from the list.
> 
> It seems that there are three types of access to the peridic_time
> objects:
> 
> 1. Functions which read (maybe write) all periodic_time instances
> attached to a
>    particular vCPU. These are functions which use pt_vcpu_lock()
> after the
>    commit, such as pt_restore_timer(), pt_save_timer(), etc.
> 2. Functions which want to modify a particular periodic_time object.
> These guys
>    lock whichever vCPU the periodic_time is attached to, but since
> the vCPU
>    could be modified without holding any lock, they are vulnerable to
> the bug.
>    Functions in this group use pt_lock(), such as pt_timer_fn() or
>    destroy_periodic_time().
> 3. Functions which not only want to modify the periodic_time, but
> also would
>    like to modify the =vcpu= fields. These are create_periodic_time()
> or
>    pt_adjust_vcpu(). They create the locking imbalance bug for group
> 2, but we
>    can't simply hold 2 vcpu locks due to the deadlock risk.
> 
> My proposed option is to add a per-periodic_time spinlock, which
> protects only
> the periodic_time.vcpu field. Whenever reading the vcpu field of a
> periodic_time
> struct, you must first take that lock. The critical sections of group
> 1 (your
> "fast path" functions) would look like this:
> 
> 1. lock vcpu
> 2. do whatever you want with pts currently on the vcpu. It is safe to
> read or write
>    fields of pt, because the vcpu lock protects those fields. You
> simply cannot
>    write pt->vcpu, because somebody holding the pt lock may already
> be spinning
>    waiting for your vcpu lock.
> 3. unlock vcpu
> 
> 
> Note that there is no additional locking in this fast path. For group
> 2
> functions (which are attempting to lock an individual periodic_time),
> the
> critical section would look like this:
> 
> 1. lock pt lock (stabilizing the vcpu field)
> 2. lock vcpu
> 3. feel free to modify any field of the periodic_time
> 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know
> that we are
>    unlocking the correct vcpu -- we have not been migrated)
> 5. unlock pt
> 
> For functions in group 3, the critical section would be:
> 
> 1. lock pt (stabilizing the vcpu field)
> 2. lock current vcpu
> 3. remove from vcpu list
> 4. unlock vcpu. At this point, you're guaranteed that the vcpu
> functions
>    (callers of pt_vcpu_lock()) are not accessing your pt.
> 5. assign pt->vcpu  (we still have mutual exclusion against group 2
> functions)
> 6. lock destination vcpu
> 7. add to vcpu list
> 8. unlock destination vcpu
> 9. unlock pt
> 
> If functions from group 2 and 3 are less frequent, then you won't see
> too much
> added lock overhead in this situation! Plus, even if group 2 and 3
> are somewhat
> common, the performance overhead of an uncontented fine-grained lock
> is muuch
> smaller than the overhead of a heavily contended coarse-grained lock,
> like the
> per-domain rw lock.
> 
> 
> Boris Ostrovsky (1):
>   x86/vpt: Replace per-guest pt_migrate lock with per pt lock
> 
>  xen/arch/x86/emul-i8254.c     |   2 +
>  xen/arch/x86/hvm/hpet.c       |   1 +
>  xen/arch/x86/hvm/hvm.c        |   2 -
>  xen/arch/x86/hvm/rtc.c        |   1 +
>  xen/arch/x86/hvm/vlapic.c     |   1 +
>  xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++---------
> ----------
>  xen/include/asm-x86/hvm/vpt.h |   9 +---
>  7 files changed, 74 insertions(+), 64 deletions(-)
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] Performance regression due to XSA-336
  2021-03-24 21:05 [PATCH RESEND] Performance regression due to XSA-336 Boris Ostrovsky
                   ` (3 preceding siblings ...)
  2021-03-31  9:06 ` Hongyan Xia
@ 2021-03-31  9:53 ` Hongyan Xia
  2021-03-31 14:01   ` Boris Ostrovsky
  4 siblings, 1 reply; 12+ messages in thread
From: Hongyan Xia @ 2021-03-31  9:53 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel
  Cc: jbeulich, andrew.cooper3, roger.pau, wl, stephen.s.brennan

Thank you for the patch. We observed similar issues.

However, would you mind sharing more details on the test setup? We
tested the hypervisor after injecting XSA-336. There were regressions
in some benchmarks but far from being enough to raise a serious alarm.
In particular, we benchmarked CPU, disk I/O, network and timer
performance on VMs with 64 and even 128 vCPUs and the noticeable
regression was a 60% drop in the ACPI PM timer when all vCPUs are
actively accessing the timer. We did not see concerns in real-world
benchmarks so I am curious how the -40% was seen.

Is it possible that the guest itself was using ACPI timer as the
clocksource? That could explain a lot. It might also be useful to know
what OS that was and other specs around the -40%.

Thank you,
Hongyan

On Wed, 2021-03-24 at 17:05 -0400, Boris Ostrovsky wrote:
> (Re-sending with Stephen added)
> 
> 
> While running performance tests with recent XSAs backports to our
> product we've
> discovered significant regression in TPCC performance. With a
> particular guest
> kernel the numbers dropped by as much as 40%.
> 
> We've narrowed that down to XSA-336 patch, specifically to the
> pt_migrate rwlock,
> and even more specifically to this lock being taken in
> pt_update_irq().
> 
> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the
> theory is
> that lock's cnts atomic is starting to cause lots of coherence
> traffic. As a
> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with
> just
> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance
> back.
> 
> Stephen Brennan came up with new locking algorithm, I just coded it
> up.
> 
> A couple of notes:
> 
> * We have only observed the problem and tested this patch for
> performance on
>   a fairly old Xen version. However, vpt code is almost identical and
> I expect
>   upstream to suffer from the same issue.
> 
> * Stephen provided the following (slightly edited by me) writeup
> explaining the
>   locking algorithm. I would like to include it somewhere but not
> sure what the
>   right place would be. Commit message perhaps?
> 
> 
> Currently, every periodic_time is protected by locking the vcpu it is
> on. You
> can think of the per-vCPU lock (tm_lock) as protecting the fields of
> every
> periodic_time which is attached to that vCPU, as well as the list
> itself, and so
> it must be held when read or written, or when an object is added or
> removed
> to/from the list.
> 
> It seems that there are three types of access to the peridic_time
> objects:
> 
> 1. Functions which read (maybe write) all periodic_time instances
> attached to a
>    particular vCPU. These are functions which use pt_vcpu_lock()
> after the
>    commit, such as pt_restore_timer(), pt_save_timer(), etc.
> 2. Functions which want to modify a particular periodic_time object.
> These guys
>    lock whichever vCPU the periodic_time is attached to, but since
> the vCPU
>    could be modified without holding any lock, they are vulnerable to
> the bug.
>    Functions in this group use pt_lock(), such as pt_timer_fn() or
>    destroy_periodic_time().
> 3. Functions which not only want to modify the periodic_time, but
> also would
>    like to modify the =vcpu= fields. These are create_periodic_time()
> or
>    pt_adjust_vcpu(). They create the locking imbalance bug for group
> 2, but we
>    can't simply hold 2 vcpu locks due to the deadlock risk.
> 
> My proposed option is to add a per-periodic_time spinlock, which
> protects only
> the periodic_time.vcpu field. Whenever reading the vcpu field of a
> periodic_time
> struct, you must first take that lock. The critical sections of group
> 1 (your
> "fast path" functions) would look like this:
> 
> 1. lock vcpu
> 2. do whatever you want with pts currently on the vcpu. It is safe to
> read or write
>    fields of pt, because the vcpu lock protects those fields. You
> simply cannot
>    write pt->vcpu, because somebody holding the pt lock may already
> be spinning
>    waiting for your vcpu lock.
> 3. unlock vcpu
> 
> 
> Note that there is no additional locking in this fast path. For group
> 2
> functions (which are attempting to lock an individual periodic_time),
> the
> critical section would look like this:
> 
> 1. lock pt lock (stabilizing the vcpu field)
> 2. lock vcpu
> 3. feel free to modify any field of the periodic_time
> 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know
> that we are
>    unlocking the correct vcpu -- we have not been migrated)
> 5. unlock pt
> 
> For functions in group 3, the critical section would be:
> 
> 1. lock pt (stabilizing the vcpu field)
> 2. lock current vcpu
> 3. remove from vcpu list
> 4. unlock vcpu. At this point, you're guaranteed that the vcpu
> functions
>    (callers of pt_vcpu_lock()) are not accessing your pt.
> 5. assign pt->vcpu  (we still have mutual exclusion against group 2
> functions)
> 6. lock destination vcpu
> 7. add to vcpu list
> 8. unlock destination vcpu
> 9. unlock pt
> 
> If functions from group 2 and 3 are less frequent, then you won't see
> too much
> added lock overhead in this situation! Plus, even if group 2 and 3
> are somewhat
> common, the performance overhead of an uncontented fine-grained lock
> is muuch
> smaller than the overhead of a heavily contended coarse-grained lock,
> like the
> per-domain rw lock.
> 
> 
> Boris Ostrovsky (1):
>   x86/vpt: Replace per-guest pt_migrate lock with per pt lock
> 
>  xen/arch/x86/emul-i8254.c     |   2 +
>  xen/arch/x86/hvm/hpet.c       |   1 +
>  xen/arch/x86/hvm/hvm.c        |   2 -
>  xen/arch/x86/hvm/rtc.c        |   1 +
>  xen/arch/x86/hvm/vlapic.c     |   1 +
>  xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++---------
> ----------
>  xen/include/asm-x86/hvm/vpt.h |   9 +---
>  7 files changed, 74 insertions(+), 64 deletions(-)
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] Performance regression due to XSA-336
  2021-03-31  9:53 ` Hongyan Xia
@ 2021-03-31 14:01   ` Boris Ostrovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Boris Ostrovsky @ 2021-03-31 14:01 UTC (permalink / raw)
  To: Hongyan Xia, xen-devel
  Cc: jbeulich, andrew.cooper3, roger.pau, wl, stephen.s.brennan


On 3/31/21 5:53 AM, Hongyan Xia wrote:
> Thank you for the patch. We observed similar issues.
>
> However, would you mind sharing more details on the test setup? We
> tested the hypervisor after injecting XSA-336. There were regressions
> in some benchmarks but far from being enough to raise a serious alarm.
> In particular, we benchmarked CPU, disk I/O, network and timer
> performance on VMs with 64 and even 128 vCPUs and the noticeable
> regression was a 60% drop in the ACPI PM timer when all vCPUs are
> actively accessing the timer. We did not see concerns in real-world
> benchmarks so I am curious how the -40% was seen.


We tested two guest kernel versions and observed this regression only on one (older) kernel. The newer one showed almost no change so this appears to be quite sensitive to guest kernel.


Unfortunately it is not easy to get access to the test rig so I wasn't able to do a more thorough investigation of kernels' behavior. This was mostly done by staring at the code, making some changes and getting in line to have the test run.


>
> Is it possible that the guest itself was using ACPI timer as the
> clocksource? 


No, we never use it.


> That could explain a lot. It might also be useful to know
> what OS that was and other specs around the -40%.


Guest OS is Oracle Linux 6 (I think we also tried OL7 with same results), the kernel is a modified 4.1. Skylake processor, few hundred GB of memory. Benchmark is TPCC.


-boris



>
> Thank you,
> Hongyan
>
> On Wed, 2021-03-24 at 17:05 -0400, Boris Ostrovsky wrote:
>> (Re-sending with Stephen added)
>>
>>
>> While running performance tests with recent XSAs backports to our
>> product we've
>> discovered significant regression in TPCC performance. With a
>> particular guest
>> kernel the numbers dropped by as much as 40%.
>>
>> We've narrowed that down to XSA-336 patch, specifically to the
>> pt_migrate rwlock,
>> and even more specifically to this lock being taken in
>> pt_update_irq().
>>
>> We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the
>> theory is
>> that lock's cnts atomic is starting to cause lots of coherence
>> traffic. As a
>> quick test of this replacing pt_vcpu_lock() in pt_update_irq() with
>> just
>> spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance
>> back.
>>
>> Stephen Brennan came up with new locking algorithm, I just coded it
>> up.
>>
>> A couple of notes:
>>
>> * We have only observed the problem and tested this patch for
>> performance on
>>   a fairly old Xen version. However, vpt code is almost identical and
>> I expect
>>   upstream to suffer from the same issue.
>>
>> * Stephen provided the following (slightly edited by me) writeup
>> explaining the
>>   locking algorithm. I would like to include it somewhere but not
>> sure what the
>>   right place would be. Commit message perhaps?
>>
>>
>> Currently, every periodic_time is protected by locking the vcpu it is
>> on. You
>> can think of the per-vCPU lock (tm_lock) as protecting the fields of
>> every
>> periodic_time which is attached to that vCPU, as well as the list
>> itself, and so
>> it must be held when read or written, or when an object is added or
>> removed
>> to/from the list.
>>
>> It seems that there are three types of access to the peridic_time
>> objects:
>>
>> 1. Functions which read (maybe write) all periodic_time instances
>> attached to a
>>    particular vCPU. These are functions which use pt_vcpu_lock()
>> after the
>>    commit, such as pt_restore_timer(), pt_save_timer(), etc.
>> 2. Functions which want to modify a particular periodic_time object.
>> These guys
>>    lock whichever vCPU the periodic_time is attached to, but since
>> the vCPU
>>    could be modified without holding any lock, they are vulnerable to
>> the bug.
>>    Functions in this group use pt_lock(), such as pt_timer_fn() or
>>    destroy_periodic_time().
>> 3. Functions which not only want to modify the periodic_time, but
>> also would
>>    like to modify the =vcpu= fields. These are create_periodic_time()
>> or
>>    pt_adjust_vcpu(). They create the locking imbalance bug for group
>> 2, but we
>>    can't simply hold 2 vcpu locks due to the deadlock risk.
>>
>> My proposed option is to add a per-periodic_time spinlock, which
>> protects only
>> the periodic_time.vcpu field. Whenever reading the vcpu field of a
>> periodic_time
>> struct, you must first take that lock. The critical sections of group
>> 1 (your
>> "fast path" functions) would look like this:
>>
>> 1. lock vcpu
>> 2. do whatever you want with pts currently on the vcpu. It is safe to
>> read or write
>>    fields of pt, because the vcpu lock protects those fields. You
>> simply cannot
>>    write pt->vcpu, because somebody holding the pt lock may already
>> be spinning
>>    waiting for your vcpu lock.
>> 3. unlock vcpu
>>
>>
>> Note that there is no additional locking in this fast path. For group
>> 2
>> functions (which are attempting to lock an individual periodic_time),
>> the
>> critical section would look like this:
>>
>> 1. lock pt lock (stabilizing the vcpu field)
>> 2. lock vcpu
>> 3. feel free to modify any field of the periodic_time
>> 4. unlock vcpu (due to the mutual exclusion of the pt lock, we know
>> that we are
>>    unlocking the correct vcpu -- we have not been migrated)
>> 5. unlock pt
>>
>> For functions in group 3, the critical section would be:
>>
>> 1. lock pt (stabilizing the vcpu field)
>> 2. lock current vcpu
>> 3. remove from vcpu list
>> 4. unlock vcpu. At this point, you're guaranteed that the vcpu
>> functions
>>    (callers of pt_vcpu_lock()) are not accessing your pt.
>> 5. assign pt->vcpu  (we still have mutual exclusion against group 2
>> functions)
>> 6. lock destination vcpu
>> 7. add to vcpu list
>> 8. unlock destination vcpu
>> 9. unlock pt
>>
>> If functions from group 2 and 3 are less frequent, then you won't see
>> too much
>> added lock overhead in this situation! Plus, even if group 2 and 3
>> are somewhat
>> common, the performance overhead of an uncontented fine-grained lock
>> is muuch
>> smaller than the overhead of a heavily contended coarse-grained lock,
>> like the
>> per-domain rw lock.
>>
>>
>> Boris Ostrovsky (1):
>>   x86/vpt: Replace per-guest pt_migrate lock with per pt lock
>>
>>  xen/arch/x86/emul-i8254.c     |   2 +
>>  xen/arch/x86/hvm/hpet.c       |   1 +
>>  xen/arch/x86/hvm/hvm.c        |   2 -
>>  xen/arch/x86/hvm/rtc.c        |   1 +
>>  xen/arch/x86/hvm/vlapic.c     |   1 +
>>  xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++---------
>> ----------
>>  xen/include/asm-x86/hvm/vpt.h |   9 +---
>>  7 files changed, 74 insertions(+), 64 deletions(-)
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-31 14:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 21:05 [PATCH RESEND] Performance regression due to XSA-336 Boris Ostrovsky
2021-03-24 21:05 ` [PATCH RESEND] x86/vpt: Replace per-guest pt_migrate lock with per pt lock Boris Ostrovsky
2021-03-25 14:48   ` Roger Pau Monné
2021-03-25 23:17     ` Boris Ostrovsky
2021-03-25  7:51 ` [PATCH][4.15] Performance regression due to XSA-336 Jan Beulich
2021-03-26 11:53   ` Ian Jackson
2021-03-26 13:22     ` Jan Beulich
2021-03-26 13:25       ` Ian Jackson
2021-03-25 11:51 ` [PATCH RESEND] " Roger Pau Monné
2021-03-31  9:06 ` Hongyan Xia
2021-03-31  9:53 ` Hongyan Xia
2021-03-31 14:01   ` Boris Ostrovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).