[PATCH] Performance regression due to XSA-336

* [PATCH] Performance regression due to XSA-336
@ 2021-03-24 20:56 Boris Ostrovsky
  2021-03-24 20:56 ` [PATCH] x86/vpt: Replace per-guest pt_migrate lock with per pt lock Boris Ostrovsky
  0 siblings, 1 reply; 2+ messages in thread
From: Boris Ostrovsky @ 2021-03-24 20:56 UTC (permalink / raw)
  To: xen-devel; +Cc: jbeulich, andrew.cooper3, roger.pau, wl, boris.ostrovsky

While running performance tests with recent XSAs backports to our product we've
discovered significant regression in TPCC performance. With a particular guest
kernel the numbers dropped by as much as 40%.

We've narrowed that down to XSA-336 patch, specifically to the pt_migrate rwlock,
and even more specifically to this lock being taken in pt_update_irq().

We have quite a large guest (92 VCPUs) doing lots of VMEXITs and the theory is
that lock's cnts atomic is starting to cause lots of coherence traffic. As a
quick test of this replacing pt_vcpu_lock() in pt_update_irq() with just
spin_lock(&v->arch.hvm_vcpu.tm_lock) gets us almost all performance back.

Stephen Brennan came up with new locking algorithm, I just coded it up.

A couple of notes:

* We have only observed the problem and tested this patch for performance on
  a fairly old Xen version. However, vpt code is almost identical and I expect
  upstream to suffer from the same issue.

* Stephen provided the following (slightly edited by me) writeup explaining the
  locking algorithm. I would like to include it somewhere but not sure what the
  right place would be. Commit message perhaps?

Currently, every periodic_time is protected by locking the vcpu it is on. You
can think of the per-vCPU lock (tm_lock) as protecting the fields of every
periodic_time which is attached to that vCPU, as well as the list itself, and so
it must be held when read or written, or when an object is added or removed
to/from the list.

It seems that there are three types of access to the peridic_time objects:

1. Functions which read (maybe write) all periodic_time instances attached to a
   particular vCPU. These are functions which use pt_vcpu_lock() after the
   commit, such as pt_restore_timer(), pt_save_timer(), etc.
2. Functions which want to modify a particular periodic_time object. These guys
   lock whichever vCPU the periodic_time is attached to, but since the vCPU
   could be modified without holding any lock, they are vulnerable to the bug.
   Functions in this group use pt_lock(), such as pt_timer_fn() or
   destroy_periodic_time().
3. Functions which not only want to modify the periodic_time, but also would
   like to modify the =vcpu= fields. These are create_periodic_time() or
   pt_adjust_vcpu(). They create the locking imbalance bug for group 2, but we
   can't simply hold 2 vcpu locks due to the deadlock risk.

My proposed option is to add a per-periodic_time spinlock, which protects only
the periodic_time.vcpu field. Whenever reading the vcpu field of a periodic_time
struct, you must first take that lock. The critical sections of group 1 (your
"fast path" functions) would look like this:

1. lock vcpu
2. do whatever you want with pts currently on the vcpu. It is safe to read or write
   fields of pt, because the vcpu lock protects those fields. You simply cannot
   write pt->vcpu, because somebody holding the pt lock may already be spinning
   waiting for your vcpu lock.
3. unlock vcpu

Note that there is no additional locking in this fast path. For group 2
functions (which are attempting to lock an individual periodic_time), the
critical section would look like this:

1. lock pt lock (stabilizing the vcpu field)
2. lock vcpu
3. feel free to modify any field of the periodic_time
4. unlock vcpu (due to the mutual exclusion of the pt lock, we know that we are
   unlocking the correct vcpu -- we have not been migrated)
5. unlock pt

For functions in group 3, the critical section would be:

1. lock pt (stabilizing the vcpu field)
2. lock current vcpu
3. remove from vcpu list
4. unlock vcpu. At this point, you're guaranteed that the vcpu functions
   (callers of pt_vcpu_lock()) are not accessing your pt.
5. assign pt->vcpu  (we still have mutual exclusion against group 2 functions)
6. lock destination vcpu
7. add to vcpu list
8. unlock destination vcpu
9. unlock pt

If functions from group 2 and 3 are less frequent, then you won't see too much
added lock overhead in this situation! Plus, even if group 2 and 3 are somewhat
common, the performance overhead of an uncontented fine-grained lock is muuch
smaller than the overhead of a heavily contended coarse-grained lock, like the
per-domain rw lock.

Boris Ostrovsky (1):
  x86/vpt: Replace per-guest pt_migrate lock with per pt lock

 xen/arch/x86/emul-i8254.c     |   2 +
 xen/arch/x86/hvm/hpet.c       |   1 +
 xen/arch/x86/hvm/hvm.c        |   2 -
 xen/arch/x86/hvm/rtc.c        |   1 +
 xen/arch/x86/hvm/vlapic.c     |   1 +
 xen/arch/x86/hvm/vpt.c        | 122 +++++++++++++++++++++++-------------------
 xen/include/asm-x86/hvm/vpt.h |   9 +---
 7 files changed, 74 insertions(+), 64 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 2+ messages in thread