[PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock

* [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock
@ 2015-10-30 23:26 Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 1/6] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

v8->v9:
 - Added a new patch 2 which tried to prefetch the cacheline of the
   next MCS node in order to reduce the MCS unlock latency when it
   was time to do the unlock.
 - Changed the slowpath statistics counters implementation in patch
   4 from atomic_t to per-cpu variables to reduce performance overhead
   and used sysfs instead of debugfs to return the consolidated counts
   and data.

v7->v8:
 - Annotated the use of each _acquire/_release variants in qspinlock.c.
 - Used the available pending bit in the lock stealing patch to disable
   lock stealing when the queue head vCPU is actively spinning on the
   lock to avoid lock starvation.
 - Restructured the lock stealing patch to reduce code duplication.
 - Verified that the waitcnt processing will be compiled away if
   QUEUED_LOCK_STAT isn't enabled.

v6->v7:
 - Removed arch/x86/include/asm/qspinlock.h from patch 1.
 - Removed the unconditional PV kick patch as it has been merged
   into tip.
 - Changed the pvstat_inc() API to add a new condition parameter.
 - Added comments and rearrange code in patch 4 to clarify where
   lock stealing happened.
 - In patch 5, removed the check for pv_wait count when deciding when
   to wait early.
 - Updated copyrights and email address.

v5->v6:
 - Added a new patch 1 to relax the cmpxchg and xchg operations in
   the native code path to reduce performance overhead on non-x86
   architectures.
 - Updated the unconditional PV kick patch as suggested by PeterZ.
 - Added a new patch to allow one lock stealing attempt at slowpath
   entry point to reduce performance penalty due to lock waiter
   preemption.
 - Removed the pending bit and kick-ahead patches as they didn't show
   any noticeable performance improvement on top of the lock stealing
   patch.
 - Simplified the adaptive spinning patch as the lock stealing patch
   allows more aggressive pv_wait() without much performance penalty
   in non-overcommitted VMs.

v4->v5:
 - Rebased the patch to the latest tip tree.
 - Corrected the comments and commit log for patch 1.
 - Removed the v4 patch 5 as PV kick deferment is no longer needed with
   the new tip tree.
 - Simplified the adaptive spinning patch (patch 6) & improve its
   performance a bit further.
 - Re-ran the benchmark test with the new patch.

v3->v4:
 - Patch 1: add comment about possible racing condition in PV unlock.
 - Patch 2: simplified the pv_pending_lock() function as suggested by
   Davidlohr.
 - Move PV unlock optimization patch forward to patch 4 & rerun
   performance test.

v2->v3:
 - Moved deferred kicking enablement patch forward & move back
   the kick-ahead patch to make the effect of kick-ahead more visible.
 - Reworked patch 6 to make it more readable.
 - Reverted back to use state as a tri-state variable instead of
   adding an additional bistate variable.
 - Added performance data for different values of PV_KICK_AHEAD_MAX.
 - Add a new patch to optimize PV unlock code path performance.

v1->v2:
 - Take out the queued unfair lock patches
 - Add a patch to simplify the PV unlock code
 - Move pending bit and statistics collection patches to the front
 - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
   when appropriate.
 - Change the wait-early patch to use adaptive spinning to better
   balance the difference effect on normal and over-committed guests.
 - Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both regular and
over-commmitted VM guests. The adaptive spinning patch was inspired
by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap.

Patch 1 relaxes the memory order restriction of atomic operations by
using less restrictive _acquire and _release variants of cmpxchg()
and xchg(). This will reduce performance overhead when ported to other
non-x86 architectures.

Patch 2 attempts to prefetch the cacheline of the next MCS node to
reduce latency in the MCS unlock operation.

Patch 3 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 4 allows the collection of various slowpath statistics counter
data that are useful to see what is happening in the system. Per-cpu
counters are used to minimize performance overhead.

Patch 5 allows one lock stealing attempt at slowpath entry. This causes
a pretty big performance improvement for over-committed VM guests.

Patch 6 enables adaptive spinning in the queue nodes. This patch
leads to further performance improvement in over-committed guest,
though it is not as big as the previous patch.

Waiman Long (6):
  locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  locking/qspinlock: prefetch next node cacheline
  locking/pvqspinlock, x86: Optimize PV unlock code path
  locking/pvqspinlock: Collect slowpath lock statistics
  locking/pvqspinlock: Allow 1 lock stealing attempt
  locking/pvqspinlock: Queue node adaptive spinning

 arch/x86/Kconfig                          |    8 +
 arch/x86/include/asm/qspinlock_paravirt.h |   59 ++++++
 include/asm-generic/qspinlock.h           |    9 +-
 kernel/locking/qspinlock.c                |   99 +++++++---
 kernel/locking/qspinlock_paravirt.h       |  221 +++++++++++++++++----
 kernel/locking/qspinlock_stat.h           |  310 +++++++++++++++++++++++++++++
 6 files changed, 638 insertions(+), 68 deletions(-)
 create mode 100644 kernel/locking/qspinlock_stat.h

^ permalink raw reply	[flat|nested] 29+ messages in thread