linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock
@ 2015-10-30 23:26 Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 1/6] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

v8->v9:
 - Added a new patch 2 which tried to prefetch the cacheline of the
   next MCS node in order to reduce the MCS unlock latency when it
   was time to do the unlock.
 - Changed the slowpath statistics counters implementation in patch
   4 from atomic_t to per-cpu variables to reduce performance overhead
   and used sysfs instead of debugfs to return the consolidated counts
   and data.

v7->v8:
 - Annotated the use of each _acquire/_release variants in qspinlock.c.
 - Used the available pending bit in the lock stealing patch to disable
   lock stealing when the queue head vCPU is actively spinning on the
   lock to avoid lock starvation.
 - Restructured the lock stealing patch to reduce code duplication.
 - Verified that the waitcnt processing will be compiled away if
   QUEUED_LOCK_STAT isn't enabled.

v6->v7:
 - Removed arch/x86/include/asm/qspinlock.h from patch 1.
 - Removed the unconditional PV kick patch as it has been merged
   into tip.
 - Changed the pvstat_inc() API to add a new condition parameter.
 - Added comments and rearrange code in patch 4 to clarify where
   lock stealing happened.
 - In patch 5, removed the check for pv_wait count when deciding when
   to wait early.
 - Updated copyrights and email address.

v5->v6:
 - Added a new patch 1 to relax the cmpxchg and xchg operations in
   the native code path to reduce performance overhead on non-x86
   architectures.
 - Updated the unconditional PV kick patch as suggested by PeterZ.
 - Added a new patch to allow one lock stealing attempt at slowpath
   entry point to reduce performance penalty due to lock waiter
   preemption.
 - Removed the pending bit and kick-ahead patches as they didn't show
   any noticeable performance improvement on top of the lock stealing
   patch.
 - Simplified the adaptive spinning patch as the lock stealing patch
   allows more aggressive pv_wait() without much performance penalty
   in non-overcommitted VMs.

v4->v5:
 - Rebased the patch to the latest tip tree.
 - Corrected the comments and commit log for patch 1.
 - Removed the v4 patch 5 as PV kick deferment is no longer needed with
   the new tip tree.
 - Simplified the adaptive spinning patch (patch 6) & improve its
   performance a bit further.
 - Re-ran the benchmark test with the new patch.

v3->v4:
 - Patch 1: add comment about possible racing condition in PV unlock.
 - Patch 2: simplified the pv_pending_lock() function as suggested by
   Davidlohr.
 - Move PV unlock optimization patch forward to patch 4 & rerun
   performance test.

v2->v3:
 - Moved deferred kicking enablement patch forward & move back
   the kick-ahead patch to make the effect of kick-ahead more visible.
 - Reworked patch 6 to make it more readable.
 - Reverted back to use state as a tri-state variable instead of
   adding an additional bistate variable.
 - Added performance data for different values of PV_KICK_AHEAD_MAX.
 - Add a new patch to optimize PV unlock code path performance.

v1->v2:
 - Take out the queued unfair lock patches
 - Add a patch to simplify the PV unlock code
 - Move pending bit and statistics collection patches to the front
 - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
   when appropriate.
 - Change the wait-early patch to use adaptive spinning to better
   balance the difference effect on normal and over-committed guests.
 - Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both regular and
over-commmitted VM guests. The adaptive spinning patch was inspired
by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap.

Patch 1 relaxes the memory order restriction of atomic operations by
using less restrictive _acquire and _release variants of cmpxchg()
and xchg(). This will reduce performance overhead when ported to other
non-x86 architectures.

Patch 2 attempts to prefetch the cacheline of the next MCS node to
reduce latency in the MCS unlock operation.

Patch 3 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 4 allows the collection of various slowpath statistics counter
data that are useful to see what is happening in the system. Per-cpu
counters are used to minimize performance overhead.

Patch 5 allows one lock stealing attempt at slowpath entry. This causes
a pretty big performance improvement for over-committed VM guests.

Patch 6 enables adaptive spinning in the queue nodes. This patch
leads to further performance improvement in over-committed guest,
though it is not as big as the previous patch.

Waiman Long (6):
  locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  locking/qspinlock: prefetch next node cacheline
  locking/pvqspinlock, x86: Optimize PV unlock code path
  locking/pvqspinlock: Collect slowpath lock statistics
  locking/pvqspinlock: Allow 1 lock stealing attempt
  locking/pvqspinlock: Queue node adaptive spinning

 arch/x86/Kconfig                          |    8 +
 arch/x86/include/asm/qspinlock_paravirt.h |   59 ++++++
 include/asm-generic/qspinlock.h           |    9 +-
 kernel/locking/qspinlock.c                |   99 +++++++---
 kernel/locking/qspinlock_paravirt.h       |  221 +++++++++++++++++----
 kernel/locking/qspinlock_stat.h           |  310 +++++++++++++++++++++++++++++
 6 files changed, 638 insertions(+), 68 deletions(-)
 create mode 100644 kernel/locking/qspinlock_stat.h


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH tip/locking/core v9 1/6] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  2015-10-30 23:26 [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock Waiman Long
@ 2015-10-30 23:26 ` Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with the more relaxed _acquire or _release versions of
those calls to enable other architectures to adopt queued spinlocks
with less memory barrier performance overhead.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/asm-generic/qspinlock.h |    9 +++++----
 kernel/locking/qspinlock.c      |   29 ++++++++++++++++++++++++-----
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index e2aadbc..39e1cb2 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -12,8 +12,9 @@
  * GNU General Public License for more details.
  *
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hp.com>
+ * Authors: Waiman Long <waiman.long@hpe.com>
  */
 #ifndef __ASM_GENERIC_QSPINLOCK_H
 #define __ASM_GENERIC_QSPINLOCK_H
@@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
 	if (!atomic_read(&lock->val) &&
-	   (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+	   (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
 		return 1;
 	return 0;
 }
@@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
 	u32 val;
 
-	val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+	val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
 	if (likely(val == 0))
 		return;
 	queued_spin_lock_slowpath(lock, val);
@@ -93,7 +94,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 	/*
 	 * smp_mb__before_atomic() in order to guarantee release semantics
 	 */
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_sub(_Q_LOCKED_VAL, &lock->val);
 }
 #endif
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 87e9ce6..7868418 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -14,8 +14,9 @@
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
  * (C) Copyright 2013-2014 Red Hat, Inc.
  * (C) Copyright 2015 Intel Corp.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hp.com>
+ * Authors: Waiman Long <waiman.long@hpe.com>
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
@@ -176,7 +177,12 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	struct __qspinlock *l = (void *)lock;
 
-	return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+	/*
+	 * Use release semantics to make sure that the MCS node is properly
+	 * initialized before changing the tail code.
+	 */
+	return (u32)xchg_release(&l->tail,
+				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
 #else /* _Q_PENDING_BITS == 8 */
@@ -208,7 +214,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
-		old = atomic_cmpxchg(&lock->val, val, new);
+		/*
+		 * Use release semantics to make sure that the MCS node is
+		 * properly initialized before changing the tail code.
+		 */
+		old = atomic_cmpxchg_release(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -319,7 +329,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		if (val == new)
 			new |= _Q_PENDING_VAL;
 
-		old = atomic_cmpxchg(&lock->val, val, new);
+		/*
+		 * Acquire semantic is required here as the function may
+		 * return immediately if the lock was free.
+		 */
+		old = atomic_cmpxchg_acquire(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -426,7 +440,12 @@ queue:
 			set_locked(lock);
 			break;
 		}
-		old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL);
+		/*
+		 * The smp_load_acquire() call above has provided the necessary
+		 * acquire semantics required for locking. At most two
+		 * iterations of this loop may be ran.
+		 */
+		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
 			goto release;	/* No contention */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-10-30 23:26 [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 1/6] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
@ 2015-10-30 23:26 ` Waiman Long
  2015-11-02 16:36   ` Peter Zijlstra
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.

This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.

Using a locking microbenchmark on a Haswell-EX system, this patch
can improve throughput by about 5%.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7868418..c1c8a1a 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -396,6 +396,7 @@ queue:
 	 * p,*,* -> n,*,*
 	 */
 	old = xchg_tail(lock, tail);
+	next = NULL;
 
 	/*
 	 * if there was a previous node; link it and wait until reaching the
@@ -407,6 +408,16 @@ queue:
 
 		pv_wait_node(node);
 		arch_mcs_spin_lock_contended(&node->locked);
+
+		/*
+		 * While waiting for the MCS lock, the next pointer may have
+		 * been set by another lock waiter. We optimistically load
+		 * the next pointer & prefetch the cacheline for writing
+		 * to reduce latency in the upcoming MCS unlock operation.
+		 */
+		next = READ_ONCE(node->next);
+		if (next)
+			prefetchw(next);
 	}
 
 	/*
@@ -426,6 +437,15 @@ queue:
 		cpu_relax();
 
 	/*
+	 * If the next pointer is defined, we are not tail anymore.
+	 * In this case, claim the spinlock & release the MCS lock.
+	 */
+	if (next) {
+		set_locked(lock);
+		goto mcs_unlock;
+	}
+
+	/*
 	 * claim the lock:
 	 *
 	 * n,0,0 -> 0,0,1 : lock, uncontended
@@ -458,6 +478,7 @@ queue:
 	while (!(next = READ_ONCE(node->next)))
 		cpu_relax();
 
+mcs_unlock:
 	arch_mcs_spin_unlock_contended(&next->locked);
 	pv_kick_node(lock, next);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH tip/locking/core v9 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path
  2015-10-30 23:26 [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 1/6] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline Waiman Long
@ 2015-10-30 23:26 ` Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.

For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().

This patch optimizes the PV unlock code path by:
 1) Moving the unlock slowpath code from the fastpath into a separate
    __pv_queued_spin_unlock_slowpath() function to make the fastpath
    as simple as possible..
 2) For x86-64, hand-coded an assembly function to combine the register
    saving thunk code with the fastpath code. Only registers that
    are used in the fastpath will be saved and restored. If the
    fastpath fails, the slowpath function will be called via another
    PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
    __pv_queued_spin_unlock() code as the thunk saves and restores
    only one 32-bit register.

With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:

  Threads	Before patch	After patch	% Change
  -------	------------	-----------	--------
     1		   134.1 ms	  119.3 ms	  -11%
     2		   1286  ms	   953  ms	  -26%
     3		   3715  ms	  3480  ms	  -6.3%
     4		   4092  ms	  3764  ms	  -8.0%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/include/asm/qspinlock_paravirt.h |   59 +++++++++++++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h       |   43 +++++++++++++--------
 2 files changed, 86 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..9f92c18 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,65 @@
 #ifndef __ASM_QSPINLOCK_PARAVIRT_H
 #define __ASM_QSPINLOCK_PARAVIRT_H
 
+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock	__pv_queued_spin_unlock
+#define PV_UNLOCK		"__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH	"__raw_callee_save___pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ *	struct __qspinlock *l = (void *)lock;
+ *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ *	if (likely(lockval == _Q_LOCKED_VAL))
+ *		return;
+ *	pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ *   rdi = lock              (first argument)
+ *   rsi = lockval           (second argument)
+ *   rdx = internal variable (set to 0)
+ */
+asm    (".pushsection .text;"
+	".globl " PV_UNLOCK ";"
+	".align 4,0x90;"
+	PV_UNLOCK ": "
+	"push  %rdx;"
+	"mov   $0x1,%eax;"
+	"xor   %edx,%edx;"
+	"lock cmpxchg %dl,(%rdi);"
+	"cmp   $0x1,%al;"
+	"jne   .slowpath;"
+	"pop   %rdx;"
+	"ret;"
+	".slowpath: "
+	"push   %rsi;"
+	"movzbl %al,%esi;"
+	"call " PV_UNLOCK_SLOWPATH ";"
+	"pop    %rsi;"
+	"pop    %rdx;"
+	"ret;"
+	".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+	".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
 PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
 
+#endif /* CONFIG_64BIT */
 #endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index f0450ff..4bd323d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -308,23 +308,14 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
  */
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
 	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
-	u8 locked;
-
-	/*
-	 * We must not unlock if SLOW, because in that case we must first
-	 * unhash. Otherwise it would be possible to have multiple @lock
-	 * entries, which would be BAD.
-	 */
-	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
-	if (likely(locked == _Q_LOCKED_VAL))
-		return;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
 		WARN(!debug_locks_silent,
@@ -363,12 +354,32 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 */
 	pv_kick(node->cpu);
 }
+
 /*
  * Include the architecture specific callee-save thunk of the
  * __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
  */
 #include <asm/qspinlock_paravirt.h>
 
+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+	u8 locked;
+
+	/*
+	 * We must not unlock if SLOW, because in that case we must first
+	 * unhash. Otherwise it would be possible to have multiple @lock
+	 * entries, which would be BAD.
+	 */
+	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+	if (likely(locked == _Q_LOCKED_VAL))
+		return;
+
+	__pv_queued_spin_unlock_slowpath(lock, locked);
+}
+#endif /* __pv_queued_spin_unlock */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-10-30 23:26 [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock Waiman Long
                   ` (2 preceding siblings ...)
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
@ 2015-10-30 23:26 ` Waiman Long
  2015-11-02 16:40   ` Peter Zijlstra
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
  5 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of data which
enable us to calculate the kicking and wakeup latencies which have
a heavy dependency on the CPUs being used.

The statistical counters are per-cpu variables to minimize the
performance overhead in their updates. These counters are exported
via the sysfs filesystem under the /sys/kernel/qlockstat directory.
When the corresponding sysfs files are read, summation and computing
of the required data are then performed.

The measured latencies for different CPUs are:

	CPU		Wakeup		Kicking
	---		------		-------
	Haswell-EX	63.6us		 7.4us
	Westmere-EX	67.6us		 9.3us

The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.

A sample of statistics counts after system bootup (with vCPU
overcommit) was:

pv_hash_hops=1.00
pv_kick_unlock=1148
pv_kick_wake=1146
pv_latency_kick=11040
pv_latency_wake=194840
pv_spurious_wakeup=7
pv_wait_again=4
pv_wait_head=23
pv_wait_node=1129

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/Kconfig                    |    8 +
 kernel/locking/qspinlock_paravirt.h |   32 ++++-
 kernel/locking/qspinlock_stat.h     |  291 +++++++++++++++++++++++++++++++++++
 3 files changed, 326 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_stat.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4a9b9a9..403bfea 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -688,6 +688,14 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
+config QUEUED_LOCK_STAT
+	bool "Paravirt queued spinlock statistics"
+	depends on PARAVIRT_SPINLOCKS && SYSFS && QUEUED_SPINLOCKS
+	---help---
+	  Enable the collection of statistical data on the slowpath
+	  behavior of paravirtualized queued spinlocks and report
+	  them on sysfs.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 4bd323d..aaeeefb 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,11 @@ struct pv_node {
 };
 
 /*
+ * Include queued spinlock statistics code
+ */
+#include "qspinlock_stat.h"
+
+/*
  * Lock and MCS node addresses hash table for fast lookup
  *
  * Hashing is done on a per-cacheline basis to minimize the need to access
@@ -100,10 +105,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
 {
 	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
 	struct pv_hash_entry *he;
+	int hopcnt = 0;
 
 	for_each_hash_entry(he, offset, hash) {
+		hopcnt++;
 		if (!cmpxchg(&he->lock, NULL, lock)) {
 			WRITE_ONCE(he->node, node);
+			qstat_hop(hopcnt);
 			return &he->lock;
 		}
 	}
@@ -164,9 +172,11 @@ static void pv_init_node(struct mcs_spinlock *node)
 static void pv_wait_node(struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
+	int waitcnt = 0;
 	int loop;
 
-	for (;;) {
+	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
+	for (;; waitcnt++) {
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
@@ -184,12 +194,16 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		 */
 		smp_store_mb(pn->state, vcpu_halted);
 
-		if (!READ_ONCE(node->locked))
+		if (!READ_ONCE(node->locked)) {
+			qstat_inc(qstat_pv_wait_node, true);
+			qstat_inc(qstat_pv_wait_again, waitcnt);
 			pv_wait(&pn->state, vcpu_halted);
+		}
 
 		/*
-		 * If pv_kick_node() changed us to vcpu_hashed, retain that value
-		 * so that pv_wait_head() knows to not also try to hash this lock.
+		 * If pv_kick_node() changed us to vcpu_hashed, retain that
+		 * value so that pv_wait_head() knows to not also try to hash
+		 * this lock.
 		 */
 		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
 
@@ -200,6 +214,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		 * So it is better to spin for a while in the hope that the
 		 * MCS lock will be released soon.
 		 */
+		qstat_inc(qstat_pv_spurious_wakeup, !READ_ONCE(node->locked));
 	}
 
 	/*
@@ -250,6 +265,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
+	int waitcnt = 0;
 	int loop;
 
 	/*
@@ -259,7 +275,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 	if (READ_ONCE(pn->state) == vcpu_hashed)
 		lp = (struct qspinlock **)1;
 
-	for (;;) {
+	for (;; waitcnt++) {
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (!READ_ONCE(l->locked))
 				return;
@@ -290,14 +306,19 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 				return;
 			}
 		}
+		qstat_inc(qstat_pv_wait_head, true);
+		qstat_inc(qstat_pv_wait_again, waitcnt);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
+		if (!READ_ONCE(l->locked))
+			return;
 		/*
 		 * The unlocker should have freed the lock before kicking the
 		 * CPU. So if the lock is still not free, it is a spurious
 		 * wakeup and so the vCPU should wait again after spinning for
 		 * a while.
 		 */
+		qstat_inc(qstat_pv_spurious_wakeup, true);
 	}
 
 	/*
@@ -352,6 +373,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * vCPU is harmless other than the additional latency in completing
 	 * the unlock.
 	 */
+	qstat_inc(qstat_pv_kick_unlock, true);
 	pv_kick(node->cpu);
 }
 
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
new file mode 100644
index 0000000..16b84b2
--- /dev/null
+++ b/kernel/locking/qspinlock_stat.h
@@ -0,0 +1,291 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+
+/*
+ * When queued spinlock statistics is enabled, the following sysfs files
+ * will be created to hold the statistics counters:
+ *
+ * /sys/kernel/qlockstat/
+ *   pv_hash_hops	- average # of hops per hashing operation
+ *   pv_kick_unlock	- # of vCPU kicks issued at unlock time
+ *   pv_kick_wake	- # of vCPU kicks used for computing pv_latency_wake
+ *   pv_latency_kick	- average latency (ns) of vCPU kick operation
+ *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
+ *   pv_spurious_wakeup	- # of spurious wakeups
+ *   pv_wait_again	- # of vCPU wait's that happened after a vCPU kick
+ *   pv_wait_head	- # of vCPU wait's at the queue head
+ *   pv_wait_node	- # of vCPU wait's at a non-head queue node
+ *
+ * Writing to the "reset_counters" file will reset all the above counter
+ * values.
+ *
+ * These statistics counters are implemented as per-cpu variables which are
+ * summed and computed whenever the corresponding sysfs files are read. This
+ * minimizes added overhead making the counters usable even in a production
+ * environment.
+ *
+ * There may be slight difference between pv_kick_wake and pv_kick_unlock.
+ */
+enum qlock_stats {
+	qstat_pv_hash_hops,
+	qstat_pv_kick_unlock,
+	qstat_pv_kick_wake,
+	qstat_pv_latency_kick,
+	qstat_pv_latency_wake,
+	qstat_pv_spurious_wakeup,
+	qstat_pv_wait_again,
+	qstat_pv_wait_head,
+	qstat_pv_wait_node,
+	qstat_num,	/* Total number of statistics counters */
+	qstat_reset_cnts = qstat_num,
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statistics
+ */
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/sched.h>
+
+static const char * const qstat_names[qstat_num + 1] = {
+	[qstat_pv_hash_hops]	   = "pv_hash_hops",
+	[qstat_pv_kick_unlock]     = "pv_kick_unlock",
+	[qstat_pv_kick_wake]       = "pv_kick_wake",
+	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
+	[qstat_pv_latency_kick]	   = "pv_latency_kick",
+	[qstat_pv_latency_wake]    = "pv_latency_wake",
+	[qstat_pv_wait_again]      = "pv_wait_again",
+	[qstat_pv_wait_head]       = "pv_wait_head",
+	[qstat_pv_wait_node]       = "pv_wait_node",
+	[qstat_reset_cnts]         = "reset_counters",
+};
+
+/*
+ * Per-cpu counters
+ */
+static DEFINE_PER_CPU(unsigned long, qstats[qstat_num]);
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Sysfs data structures
+ */
+static struct kobj_attribute qstat_kobj_attrs[qstat_num + 1];
+static struct attribute *attrs[qstat_num + 2];
+static struct kobject *qstat_kobj;
+static struct attribute_group attr_group = {
+	.attrs = attrs,
+};
+
+/*
+ * Function to show the qlock statistics count
+ */
+static ssize_t
+qstat_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	int cpu, idx;
+	u64 stat = 0;
+
+	/*
+	 * Compute the index of the kobj_attribute in the array and used
+	 * it as the same index as the per-cpu variable
+	 */
+	idx = attr - qstat_kobj_attrs;
+
+	for_each_online_cpu(cpu)
+		stat += per_cpu(qstats[idx], cpu);
+	return sprintf(buf, "%llu\n", stat);
+}
+
+/*
+ * Return the average kick latency (ns) = pv_latency_kick/pv_kick_unlock
+ */
+static ssize_t
+kick_latency_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	int cpu;
+	u64 latencies = 0, kicks = 0;
+
+	for_each_online_cpu(cpu) {
+		kicks     += per_cpu(qstats[qstat_pv_kick_unlock],  cpu);
+		latencies += per_cpu(qstats[qstat_pv_latency_kick], cpu);
+	}
+
+	/* Rounded to the nearest ns */
+	return sprintf(buf, "%llu\n", kicks ? (latencies + kicks/2)/kicks : 0);
+}
+
+/*
+ * Return the average wake latency (ns) = pv_latency_wake/pv_kick_wake
+ */
+static ssize_t
+wake_latency_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	int cpu;
+	u64 latencies = 0, kicks = 0;
+
+	for_each_online_cpu(cpu) {
+		kicks     += per_cpu(qstats[qstat_pv_kick_wake],    cpu);
+		latencies += per_cpu(qstats[qstat_pv_latency_wake], cpu);
+	}
+
+	/* Rounded to the nearest ns */
+	return sprintf(buf, "%llu\n", kicks ? (latencies + kicks/2)/kicks : 0);
+}
+
+/*
+ * Return the average hops/hash = pv_hash_hops/pv_kick_unlock
+ */
+static ssize_t
+hash_hop_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	int cpu;
+	u64 hops = 0, kicks = 0;
+
+	for_each_online_cpu(cpu) {
+		kicks += per_cpu(qstats[qstat_pv_kick_unlock], cpu);
+		hops  += per_cpu(qstats[qstat_pv_hash_hops],   cpu);
+	}
+
+	if (!kicks)
+		return sprintf(buf, "0\n");
+
+	/*
+	 * Return a X.XX decimal number
+	 */
+	return sprintf(buf, "%llu.%02llu\n", hops/kicks,
+		      ((hops%kicks)*100 + kicks/2)/kicks);
+}
+
+/*
+ * Reset all the counters value
+ *
+ * Since the counter updates aren't atomic, the resetting is done twice
+ * to make sure that the counters are very likely to be all cleared.
+ */
+static ssize_t
+reset_counters_store(struct kobject *kobj, struct kobj_attribute *attr,
+		     const char *buf, size_t count)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		int i;
+		unsigned long *ptr = per_cpu_ptr(qstats, cpu);
+
+		for (i = 0 ; i < qstat_num; i++)
+			WRITE_ONCE(ptr[i], 0);
+		for (i = 0 ; i < qstat_num; i++)
+			WRITE_ONCE(ptr[i], 0);
+	}
+	return count;
+}
+
+/*
+ * Initialize sysfs for the qspinlock statistics
+ */
+static int __init init_qspinlock_stat(void)
+{
+	int i, retval;
+
+	qstat_kobj = kobject_create_and_add("qlockstat", kernel_kobj);
+	if (qstat_kobj == NULL)
+		return -ENOMEM;
+
+	/*
+	 * Initialize the attribute table
+	 *
+	 * As reading from and writing to the stat files can be slow, only
+	 * root is allowed to do the read/write to limit impact to system
+	 * performance.
+	 */
+	for (i = 0; i <= qstat_num; i++) {
+		qstat_kobj_attrs[i].attr.name = qstat_names[i];
+		qstat_kobj_attrs[i].attr.mode = 0400;
+		qstat_kobj_attrs[i].show      = qstat_show;
+		attrs[i]		      = &qstat_kobj_attrs[i].attr;
+	}
+	qstat_kobj_attrs[qstat_pv_hash_hops].show    = hash_hop_show;
+	qstat_kobj_attrs[qstat_pv_latency_kick].show = kick_latency_show;
+	qstat_kobj_attrs[qstat_pv_latency_wake].show = wake_latency_show;
+
+	/*
+	 * Set attributes for reset_counters
+	 */
+	qstat_kobj_attrs[qstat_reset_cnts].attr.mode = 0200;
+	qstat_kobj_attrs[qstat_reset_cnts].show      = NULL;
+	qstat_kobj_attrs[qstat_reset_cnts].store     = reset_counters_store;
+
+	retval = sysfs_create_group(qstat_kobj, &attr_group);
+	if (retval)
+		kobject_put(qstat_kobj);
+
+	return retval;
+}
+fs_initcall(init_qspinlock_stat);
+
+/*
+ * Increment the PV qspinlock statistics counters
+ */
+static inline void qstat_inc(enum qlock_stats stat, bool cond)
+{
+	if (cond)
+		this_cpu_inc(qstats[stat]);
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void qstat_hop(int hopcnt)
+{
+	this_cpu_add(qstats[qstat_pv_hash_hops], hopcnt);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+	u64 start = sched_clock();
+
+	per_cpu(pv_kick_time, cpu) = start;
+	pv_kick(cpu);
+	this_cpu_add(qstats[qstat_pv_latency_kick], sched_clock() - start);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+	*pkick_time = 0;
+	pv_wait(ptr, val);
+	if (*pkick_time) {
+		this_cpu_add(qstats[qstat_pv_latency_wake],
+			     sched_clock() - *pkick_time);
+		qstat_inc(qstat_pv_kick_wake, true);
+	}
+}
+
+#define pv_kick(c)	__pv_kick(c)
+#define pv_wait(p, v)	__pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void qstat_inc(enum qlock_stats stat, bool cond)	{ }
+static inline void qstat_hop(int hopcnt)			{ }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-10-30 23:26 [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock Waiman Long
                   ` (3 preceding siblings ...)
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
@ 2015-10-30 23:26 ` Waiman Long
  2015-11-06 14:50   ` Peter Zijlstra
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
  5 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. To prevent lock starvation, the pending
bit will be set by the queue head vCPU when it is in the active lock
spinning loop to disable any lock stealing attempt.  This helps to
reduce the performance penalty caused by lock waiter preemption while
not having much of the downsides of a real unfair lock.

The pv_wait_head() function was renamed as pv_wait_head_lock() as it
was modified to acquire the lock before returning. This is necessary
because of possible lock stealing attempts from other tasks.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

                    Westmere                    Haswell
  Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
  -----         --------    --------    --------    --------
  Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
  After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s

For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock.c          |   52 ++++++++++-------
 kernel/locking/qspinlock_paravirt.h |  111 ++++++++++++++++++++++++++++-------
 kernel/locking/qspinlock_stat.h     |   16 +++++
 3 files changed, 136 insertions(+), 43 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c1c8a1a..39c6a43 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -251,15 +251,16 @@ static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_kick_node(struct qspinlock *lock,
 					   struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_head(struct qspinlock *lock,
-					   struct mcs_spinlock *node) { }
+static __always_inline u32  __pv_wait_head_lock(struct qspinlock *lock,
+						struct mcs_spinlock *node)
+						{ return 0; }
 
 #define pv_enabled()		false
 
 #define pv_init_node		__pv_init_node
 #define pv_wait_node		__pv_wait_node
 #define pv_kick_node		__pv_kick_node
-#define pv_wait_head		__pv_wait_head
+#define pv_wait_head_lock	__pv_wait_head_lock
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
@@ -431,35 +432,44 @@ queue:
 	 * sequentiality; this is because the set_locked() function below
 	 * does not imply a full barrier.
 	 *
+	 * The PV pv_wait_head_lock function, if active, will acquire the lock
+	 * and return a non-zero value. So we have to skip the
+	 * smp_load_acquire() call. As the next PV queue head hasn't been
+	 * designated yet, there is no way for the locked value to become
+	 * _Q_SLOW_VAL. So both the redundant set_locked() and the
+	 * atomic_cmpxchg_relaxed() calls will be safe. The cost of the
+	 * redundant set_locked() call below should be negligible, too.
+	 *
+	 * If PV isn't active, 0 will be returned instead.
 	 */
-	pv_wait_head(lock, node);
-	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
-		cpu_relax();
+	val = pv_wait_head_lock(lock, node);
+	if (!val) {
+		while ((val = smp_load_acquire(&lock->val.counter))
+				& _Q_LOCKED_PENDING_MASK)
+			cpu_relax();
+		/*
+		 * Claim the lock now:
+		 *
+		 * 0,0 -> 0,1
+		 */
+		set_locked(lock);
+		val |= _Q_LOCKED_VAL;
+	}
 
 	/*
 	 * If the next pointer is defined, we are not tail anymore.
-	 * In this case, claim the spinlock & release the MCS lock.
 	 */
-	if (next) {
-		set_locked(lock);
+	if (next)
 		goto mcs_unlock;
-	}
 
 	/*
-	 * claim the lock:
-	 *
-	 * n,0,0 -> 0,0,1 : lock, uncontended
-	 * *,0,0 -> *,0,1 : lock, contended
-	 *
 	 * If the queue head is the only one in the queue (lock value == tail),
-	 * clear the tail code and grab the lock. Otherwise, we only need
-	 * to grab the lock.
+	 * we have to clear the tail code.
 	 */
 	for (;;) {
-		if (val != tail) {
-			set_locked(lock);
+		if ((val & _Q_TAIL_MASK) != tail)
 			break;
-		}
+
 		/*
 		 * The smp_load_acquire() call above has provided the necessary
 		 * acquire semantics required for locking. At most two
@@ -502,7 +512,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_init_node
 #undef pv_wait_node
 #undef pv_kick_node
-#undef pv_wait_head
+#undef pv_wait_head_lock
 
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index aaeeefb..24ccc9f 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,56 @@ struct pv_node {
 };
 
 /*
+ * Allow one unfair trylock when entering the PV slowpath when the pending
+ * bit isn't set to reduce the performance impact of lock waiter preemption
+ *
+ * By replacing the regular queued_spin_trylock() with the function below,
+ * it will be called once when a lock waiter enter the slowpath before being
+ * queued.
+ *
+ * A little bit of unfairness here can improve performance without many
+ * of the downsides of a real unfair lock.
+ */
+#define queued_spin_trylock(l)	pv_queued_spin_trylock_unfair(l)
+static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	return !(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
+		(cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0);
+}
+
+/*
+ * The pending bit is used by the queue head vCPU to indicate that it
+ * is actively spinning on the lock and no lock stealing is allowed.
+ */
+#if _Q_PENDING_BITS == 8
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	WRITE_ONCE(l->pending, 0);
+}
+
+static __always_inline void set_pending(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	WRITE_ONCE(l->pending, 1);
+}
+#else /* _Q_PENDING_BITS == 8 */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	atomic_clear_mask(&lock->val, _Q_PENDING_MASK);
+}
+
+static __always_inline void set_pending(struct qspinlock *lock)
+{
+	atomic_set_mask(&lock->val, _Q_PENDING_MASK);
+}
+#endif /* _Q_PENDING_BITS == 8 */
+
+/*
  * Include queued spinlock statistics code
  */
 #include "qspinlock_stat.h"
@@ -202,8 +252,8 @@ static void pv_wait_node(struct mcs_spinlock *node)
 
 		/*
 		 * If pv_kick_node() changed us to vcpu_hashed, retain that
-		 * value so that pv_wait_head() knows to not also try to hash
-		 * this lock.
+		 * value so that pv_wait_head_lock() knows to not also try
+		 * to hash this lock.
 		 */
 		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
 
@@ -227,8 +277,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
 /*
  * Called after setting next->locked = 1 when we're the lock owner.
  *
- * Instead of waking the waiters stuck in pv_wait_node() advance their state such
- * that they're waiting in pv_wait_head(), this avoids a wake/sleep cycle.
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state
+ * such that they're waiting in pv_wait_head_lock(), this avoids a
+ * wake/sleep cycle.
  */
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
@@ -257,10 +308,13 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
- * Wait for l->locked to become clear; halt the vcpu after a short spin.
+ * Wait for l->locked to become clear and acquire the lock;
+ * halt the vcpu after a short spin.
  * __pv_queued_spin_unlock() will wake us.
+ *
+ * The current value of the lock will be returned for additional processing.
  */
-static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
+static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
@@ -276,11 +330,24 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 		lp = (struct qspinlock **)1;
 
 	for (;; waitcnt++) {
+		/*
+		 * Set the pending bit in the active lock spinning loop to
+		 * disable lock stealing. However, the pending bit check in
+		 * pv_queued_spin_trylock_unfair() and the setting/clearing
+		 * of pending bit here aren't memory barriers. So a cmpxchg()
+		 * is used to acquire the lock to be sure.
+		 */
+		set_pending(lock);
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
-			if (!READ_ONCE(l->locked))
-				return;
+			if (!READ_ONCE(l->locked) &&
+			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+				clear_pending(lock);
+				goto gotlock;
+			}
 			cpu_relax();
 		}
+		clear_pending(lock);
+
 
 		if (!lp) { /* ONCE */
 			lp = pv_hash(lock, pn);
@@ -296,36 +363,36 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (!cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
+			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
 				/*
-				 * The lock is free and _Q_SLOW_VAL has never
-				 * been set. Therefore we need to unhash before
-				 * getting the lock.
+				 * The lock was free and now we own the lock.
+				 * Change the lock value back to _Q_LOCKED_VAL
+				 * and unhash the table.
 				 */
+				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
-				return;
+				goto gotlock;
 			}
 		}
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
-		if (!READ_ONCE(l->locked))
-			return;
 		/*
 		 * The unlocker should have freed the lock before kicking the
 		 * CPU. So if the lock is still not free, it is a spurious
-		 * wakeup and so the vCPU should wait again after spinning for
-		 * a while.
+		 * wakeup or another vCPU has stolen the lock. The current
+		 * vCPU should spin again.
 		 */
-		qstat_inc(qstat_pv_spurious_wakeup, true);
+		qstat_inc(qstat_pv_spurious_wakeup, READ_ONCE(l->locked));
 	}
 
 	/*
-	 * Lock is unlocked now; the caller will acquire it without waiting.
-	 * As with pv_wait_node() we rely on the caller to do a load-acquire
-	 * for us.
+	 * The cmpxchg() or xchg() call before coming here provides the
+	 * acquire semantics for locking.
 	 */
+gotlock:
+	return (u32)atomic_read(&lock->val);
 }
 
 /*
@@ -350,7 +417,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * so we need a barrier to order the read of the node data in
 	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
 	 *
-	 * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
+	 * Matches the cmpxchg() in pv_wait_head_lock() setting _Q_SLOW_VAL.
 	 */
 	smp_rmb();
 
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 16b84b2..1d87065 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -22,6 +22,7 @@
  *   pv_kick_wake	- # of vCPU kicks used for computing pv_latency_wake
  *   pv_latency_kick	- average latency (ns) of vCPU kick operation
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
+ *   pv_lock_stealing	- # of lock stealing operations
  *   pv_spurious_wakeup	- # of spurious wakeups
  *   pv_wait_again	- # of vCPU wait's that happened after a vCPU kick
  *   pv_wait_head	- # of vCPU wait's at the queue head
@@ -43,6 +44,7 @@ enum qlock_stats {
 	qstat_pv_kick_wake,
 	qstat_pv_latency_kick,
 	qstat_pv_latency_wake,
+	qstat_pv_lock_stealing,
 	qstat_pv_spurious_wakeup,
 	qstat_pv_wait_again,
 	qstat_pv_wait_head,
@@ -66,6 +68,7 @@ static const char * const qstat_names[qstat_num + 1] = {
 	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
 	[qstat_pv_latency_kick]	   = "pv_latency_kick",
 	[qstat_pv_latency_wake]    = "pv_latency_wake",
+	[qstat_pv_lock_stealing]   = "pv_lock_stealing",
 	[qstat_pv_wait_again]      = "pv_wait_again",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
@@ -283,6 +286,19 @@ static inline void __pv_wait(u8 *ptr, u8 val)
 #define pv_kick(c)	__pv_kick(c)
 #define pv_wait(p, v)	__pv_wait(p, v)
 
+/*
+ * PV unfair trylock count tracking function
+ */
+static inline int qstat_trylock_unfair(struct qspinlock *lock)
+{
+	int ret = pv_queued_spin_trylock_unfair(lock);
+
+	qstat_inc(qstat_pv_lock_stealing, ret);
+	return ret;
+}
+#undef  queued_spin_trylock
+#define queued_spin_trylock(l)	qstat_trylock_unfair(l)
+
 #else /* CONFIG_QUEUED_LOCK_STAT */
 
 static inline void qstat_inc(enum qlock_stats stat, bool cond)	{ }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-10-30 23:26 [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock Waiman Long
                   ` (4 preceding siblings ...)
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
@ 2015-10-30 23:26 ` Waiman Long
  2015-11-06 15:01   ` Peter Zijlstra
  5 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-10-30 23:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch,
	Davidlohr Bueso, Waiman Long

In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.

This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

		    Westmere			Haswell
  Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
  -----		--------    --------    --------    --------
  Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
  After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s

For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock.c          |    5 ++-
 kernel/locking/qspinlock_paravirt.h |   45 +++++++++++++++++++++++++++++++++-
 kernel/locking/qspinlock_stat.h     |    3 ++
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 39c6a43..685c14e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -248,7 +248,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
  */
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+					   struct mcs_spinlock *prev) { }
 static __always_inline void __pv_kick_node(struct qspinlock *lock,
 					   struct mcs_spinlock *node) { }
 static __always_inline u32  __pv_wait_head_lock(struct qspinlock *lock,
@@ -407,7 +408,7 @@ queue:
 		prev = decode_tail(old);
 		WRITE_ONCE(prev->next, node);
 
-		pv_wait_node(node);
+		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
 
 		/*
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 24ccc9f..df2dfd5 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -23,6 +23,19 @@
 #define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
 
 /*
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will stop spinning if the vCPU in the previous node is
+ * not running. The one lock stealing attempt allowed at slowpath entry
+ * mitigates the slight slowdown for non-overcommitted guest with this
+ * aggressive wait-early mechanism.
+ *
+ * The status of the previous node will be checked at fixed interval
+ * controlled by PV_PREV_CHECK_MASK.
+ */
+#define PV_PREV_CHECK_MASK	0xff
+
+/*
  * Queue node uses: vcpu_running & vcpu_halted.
  * Queue head uses: vcpu_running & vcpu_hashed.
  */
@@ -202,6 +215,20 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 }
 
 /*
+ * Return true if when it is time to check the previous node which is not
+ * in a running state.
+ */
+static inline bool
+pv_wait_early(struct pv_node *prev, int loop)
+{
+
+	if ((loop & PV_PREV_CHECK_MASK) != 0)
+		return false;
+
+	return READ_ONCE(prev->state) != vcpu_running;
+}
+
+/*
  * Initialize the PV part of the mcs_spinlock node.
  */
 static void pv_init_node(struct mcs_spinlock *node)
@@ -219,17 +246,23 @@ static void pv_init_node(struct mcs_spinlock *node)
  * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
  * behalf.
  */
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 {
 	struct pv_node *pn = (struct pv_node *)node;
+	struct pv_node *pp = (struct pv_node *)prev;
 	int waitcnt = 0;
 	int loop;
+	bool wait_early;
 
 	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
 	for (;; waitcnt++) {
-		for (loop = SPIN_THRESHOLD; loop; loop--) {
+		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
+			if (pv_wait_early(pp, loop)) {
+				wait_early = true;
+				break;
+			}
 			cpu_relax();
 		}
 
@@ -247,6 +280,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		if (!READ_ONCE(node->locked)) {
 			qstat_inc(qstat_pv_wait_node, true);
 			qstat_inc(qstat_pv_wait_again, waitcnt);
+			qstat_inc(qstat_pv_wait_early, wait_early);
 			pv_wait(&pn->state, vcpu_halted);
 		}
 
@@ -331,6 +365,12 @@ static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 
 	for (;; waitcnt++) {
 		/*
+		 * Set correct vCPU state to be used by queue node wait-early
+		 * mechanism.
+		 */
+		WRITE_ONCE(pn->state, vcpu_running);
+
+		/*
 		 * Set the pending bit in the active lock spinning loop to
 		 * disable lock stealing. However, the pending bit check in
 		 * pv_queued_spin_trylock_unfair() and the setting/clearing
@@ -374,6 +414,7 @@ static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 				goto gotlock;
 			}
 		}
+		WRITE_ONCE(pn->state, vcpu_halted);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 1d87065..937aef3 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -25,6 +25,7 @@
  *   pv_lock_stealing	- # of lock stealing operations
  *   pv_spurious_wakeup	- # of spurious wakeups
  *   pv_wait_again	- # of vCPU wait's that happened after a vCPU kick
+ *   pv_wait_early	- # of early vCPU wait's
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
  *
@@ -47,6 +48,7 @@ enum qlock_stats {
 	qstat_pv_lock_stealing,
 	qstat_pv_spurious_wakeup,
 	qstat_pv_wait_again,
+	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
 	qstat_num,	/* Total number of statistics counters */
@@ -70,6 +72,7 @@ static const char * const qstat_names[qstat_num + 1] = {
 	[qstat_pv_latency_wake]    = "pv_latency_wake",
 	[qstat_pv_lock_stealing]   = "pv_lock_stealing",
 	[qstat_pv_wait_again]      = "pv_wait_again",
+	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
 	[qstat_reset_cnts]         = "reset_counters",
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline Waiman Long
@ 2015-11-02 16:36   ` Peter Zijlstra
  2015-11-02 22:54     ` Peter Zijlstra
  2015-11-05 16:06     ` Waiman Long
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-02 16:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Oct 30, 2015 at 07:26:33PM -0400, Waiman Long wrote:
> A queue head CPU, after acquiring the lock, will have to notify
> the next CPU in the wait queue that it has became the new queue
> head. This involves loading a new cacheline from the MCS node of the
> next CPU. That operation can be expensive and add to the latency of
> locking operation.
> 
> This patch addes code to optmistically prefetch the next MCS node
> cacheline if the next pointer is defined and it has been spinning
> for the MCS lock for a while. This reduces the locking latency and
> improves the system throughput.
> 
> Using a locking microbenchmark on a Haswell-EX system, this patch
> can improve throughput by about 5%.

How does it affect IVB-EX (which you were testing earlier IIRC)?

> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  kernel/locking/qspinlock.c |   21 +++++++++++++++++++++
>  1 files changed, 21 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 7868418..c1c8a1a 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -396,6 +396,7 @@ queue:
>  	 * p,*,* -> n,*,*
>  	 */
>  	old = xchg_tail(lock, tail);
> +	next = NULL;
>  
>  	/*
>  	 * if there was a previous node; link it and wait until reaching the
> @@ -407,6 +408,16 @@ queue:
>  
>  		pv_wait_node(node);
>  		arch_mcs_spin_lock_contended(&node->locked);
> +
> +		/*
> +		 * While waiting for the MCS lock, the next pointer may have
> +		 * been set by another lock waiter. We optimistically load
> +		 * the next pointer & prefetch the cacheline for writing
> +		 * to reduce latency in the upcoming MCS unlock operation.
> +		 */
> +		next = READ_ONCE(node->next);
> +		if (next)
> +			prefetchw(next);
>  	}

OK so far I suppose. Since we already read node->locked, which is in the
same cacheline, also reading node->next isn't extra pressure. And we can
then prefetch that cacheline.

>  	/*
> @@ -426,6 +437,15 @@ queue:
>  		cpu_relax();
>  
>  	/*
> +	 * If the next pointer is defined, we are not tail anymore.
> +	 * In this case, claim the spinlock & release the MCS lock.
> +	 */
> +	if (next) {
> +		set_locked(lock);
> +		goto mcs_unlock;
> +	}
> +
> +	/*
>  	 * claim the lock:
>  	 *
>  	 * n,0,0 -> 0,0,1 : lock, uncontended
> @@ -458,6 +478,7 @@ queue:
>  	while (!(next = READ_ONCE(node->next)))
>  		cpu_relax();
>  
> +mcs_unlock:
>  	arch_mcs_spin_unlock_contended(&next->locked);
>  	pv_kick_node(lock, next);
>  

This however appears an independent optimization. Is it worth it? Would
we not already have observed a val != tail in this case? At which point
we're just adding extra code for no gain.

That is, if we observe @next, must we then not also observe val != tail?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
@ 2015-11-02 16:40   ` Peter Zijlstra
  2015-11-05 16:29     ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-02 16:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Oct 30, 2015 at 07:26:35PM -0400, Waiman Long wrote:
> This patch enables the accumulation of kicking and waiting related
> PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
> option is selected. It also enables the collection of data which
> enable us to calculate the kicking and wakeup latencies which have
> a heavy dependency on the CPUs being used.
> 
> The statistical counters are per-cpu variables to minimize the
> performance overhead in their updates. These counters are exported
> via the sysfs filesystem under the /sys/kernel/qlockstat directory.
> When the corresponding sysfs files are read, summation and computing
> of the required data are then performed.

Why did you switch to sysfs? You can create custom debugfs files too.

> @@ -259,7 +275,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>  	if (READ_ONCE(pn->state) == vcpu_hashed)
>  		lp = (struct qspinlock **)1;
>  
> -	for (;;) {
> +	for (;; waitcnt++) {
>  		for (loop = SPIN_THRESHOLD; loop; loop--) {
>  			if (!READ_ONCE(l->locked))
>  				return;

Did you check that goes away when !STAT ?

> +/*
> + * Return the average kick latency (ns) = pv_latency_kick/pv_kick_unlock
> + */
> +static ssize_t
> +kick_latency_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +	int cpu;
> +	u64 latencies = 0, kicks = 0;
> +
> +	for_each_online_cpu(cpu) {

I think you need for_each_possible_cpu(), otherwise the results will
change with hotplug operations.

> +		kicks     += per_cpu(qstats[qstat_pv_kick_unlock],  cpu);
> +		latencies += per_cpu(qstats[qstat_pv_latency_kick], cpu);
> +	}
> +
> +	/* Rounded to the nearest ns */
> +	return sprintf(buf, "%llu\n", kicks ? (latencies + kicks/2)/kicks : 0);
> +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-11-02 16:36   ` Peter Zijlstra
@ 2015-11-02 22:54     ` Peter Zijlstra
  2015-11-05 16:42       ` Waiman Long
  2015-11-05 16:06     ` Waiman Long
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-02 22:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Mon, Nov 02, 2015 at 05:36:26PM +0100, Peter Zijlstra wrote:
> On Fri, Oct 30, 2015 at 07:26:33PM -0400, Waiman Long wrote:
> > @@ -426,6 +437,15 @@ queue:
> >  		cpu_relax();
> >  
> >  	/*
> > +	 * If the next pointer is defined, we are not tail anymore.
> > +	 * In this case, claim the spinlock & release the MCS lock.
> > +	 */
> > +	if (next) {
> > +		set_locked(lock);
> > +		goto mcs_unlock;
> > +	}
> > +
> > +	/*
> >  	 * claim the lock:
> >  	 *
> >  	 * n,0,0 -> 0,0,1 : lock, uncontended
> > @@ -458,6 +478,7 @@ queue:
> >  	while (!(next = READ_ONCE(node->next)))
> >  		cpu_relax();
> >  
> > +mcs_unlock:
> >  	arch_mcs_spin_unlock_contended(&next->locked);
> >  	pv_kick_node(lock, next);
> >  
> 
> This however appears an independent optimization. Is it worth it? Would
> we not already have observed a val != tail in this case? At which point
> we're just adding extra code for no gain.
> 
> That is, if we observe @next, must we then not also observe val != tail?

Not quite; the ordering is the other way around. If we observe next we
must also observe val != tail. But its a narrow thing. Is it really
worth it?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-11-02 16:36   ` Peter Zijlstra
  2015-11-02 22:54     ` Peter Zijlstra
@ 2015-11-05 16:06     ` Waiman Long
  2015-11-05 16:39       ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-11-05 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/02/2015 11:36 AM, Peter Zijlstra wrote:
> On Fri, Oct 30, 2015 at 07:26:33PM -0400, Waiman Long wrote:
>> A queue head CPU, after acquiring the lock, will have to notify
>> the next CPU in the wait queue that it has became the new queue
>> head. This involves loading a new cacheline from the MCS node of the
>> next CPU. That operation can be expensive and add to the latency of
>> locking operation.
>>
>> This patch addes code to optmistically prefetch the next MCS node
>> cacheline if the next pointer is defined and it has been spinning
>> for the MCS lock for a while. This reduces the locking latency and
>> improves the system throughput.
>>
>> Using a locking microbenchmark on a Haswell-EX system, this patch
>> can improve throughput by about 5%.
> How does it affect IVB-EX (which you were testing earlier IIRC)?

My testing on IVB-EX indicated that if the critical section is really 
short, the change may actually slow thing a bit in some cases. However, 
when the critical section is long enough that the prefetch overhead can 
be hidden within the lock acquisition loop, there will be a performance 
boost.

>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/locking/qspinlock.c |   21 +++++++++++++++++++++
>>   1 files changed, 21 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 7868418..c1c8a1a 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -396,6 +396,7 @@ queue:
>>   	 * p,*,* ->  n,*,*
>>   	 */
>>   	old = xchg_tail(lock, tail);
>> +	next = NULL;
>>
>>   	/*
>>   	 * if there was a previous node; link it and wait until reaching the
>> @@ -407,6 +408,16 @@ queue:
>>
>>   		pv_wait_node(node);
>>   		arch_mcs_spin_lock_contended(&node->locked);
>> +
>> +		/*
>> +		 * While waiting for the MCS lock, the next pointer may have
>> +		 * been set by another lock waiter. We optimistically load
>> +		 * the next pointer&  prefetch the cacheline for writing
>> +		 * to reduce latency in the upcoming MCS unlock operation.
>> +		 */
>> +		next = READ_ONCE(node->next);
>> +		if (next)
>> +			prefetchw(next);
>>   	}
> OK so far I suppose. Since we already read node->locked, which is in the
> same cacheline, also reading node->next isn't extra pressure. And we can
> then prefetch that cacheline.
>
>>   	/*
>> @@ -426,6 +437,15 @@ queue:
>>   		cpu_relax();
>>
>>   	/*
>> +	 * If the next pointer is defined, we are not tail anymore.
>> +	 * In this case, claim the spinlock&  release the MCS lock.
>> +	 */
>> +	if (next) {
>> +		set_locked(lock);
>> +		goto mcs_unlock;
>> +	}
>> +
>> +	/*
>>   	 * claim the lock:
>>   	 *
>>   	 * n,0,0 ->  0,0,1 : lock, uncontended
>> @@ -458,6 +478,7 @@ queue:
>>   	while (!(next = READ_ONCE(node->next)))
>>   		cpu_relax();
>>
>> +mcs_unlock:
>>   	arch_mcs_spin_unlock_contended(&next->locked);
>>   	pv_kick_node(lock, next);
>>
> This however appears an independent optimization. Is it worth it? Would
> we not already have observed a val != tail in this case? At which point
> we're just adding extra code for no gain.
>
> That is, if we observe @next, must we then not also observe val != tail?

Observing next implies val != tail, but the reverse may not be true. The 
branch is done before we observe val != tail. Yes, it is an optimization 
to avoid reading node->next again if we have already observed next. I 
have observed a very minor performance boost with that change without 
the prefetch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-11-02 16:40   ` Peter Zijlstra
@ 2015-11-05 16:29     ` Waiman Long
  2015-11-05 16:43       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-11-05 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/02/2015 11:40 AM, Peter Zijlstra wrote:
> On Fri, Oct 30, 2015 at 07:26:35PM -0400, Waiman Long wrote:
>> This patch enables the accumulation of kicking and waiting related
>> PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
>> option is selected. It also enables the collection of data which
>> enable us to calculate the kicking and wakeup latencies which have
>> a heavy dependency on the CPUs being used.
>>
>> The statistical counters are per-cpu variables to minimize the
>> performance overhead in their updates. These counters are exported
>> via the sysfs filesystem under the /sys/kernel/qlockstat directory.
>> When the corresponding sysfs files are read, summation and computing
>> of the required data are then performed.
> Why did you switch to sysfs? You can create custom debugfs files too.

I was not aware of that capability. So you mean using 
debugfs_create_file() using custom file_operations. Right? That doesn't 
seem to be easier than using sysfs. However, I can use that if you think 
it is better to use debugfs.

>
>> @@ -259,7 +275,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>>   	if (READ_ONCE(pn->state) == vcpu_hashed)
>>   		lp = (struct qspinlock **)1;
>>
>> -	for (;;) {
>> +	for (;; waitcnt++) {
>>   		for (loop = SPIN_THRESHOLD; loop; loop--) {
>>   			if (!READ_ONCE(l->locked))
>>   				return;
> Did you check that goes away when !STAT ?

Yes, the increment code goes away when !STAT. I had added a comment to 
talk about that.

>
>> +/*
>> + * Return the average kick latency (ns) = pv_latency_kick/pv_kick_unlock
>> + */
>> +static ssize_t
>> +kick_latency_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
>> +{
>> +	int cpu;
>> +	u64 latencies = 0, kicks = 0;
>> +
>> +	for_each_online_cpu(cpu) {
> I think you need for_each_possible_cpu(), otherwise the results will
> change with hotplug operations.

Right, I will make the change.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-11-05 16:06     ` Waiman Long
@ 2015-11-05 16:39       ` Peter Zijlstra
  2015-11-05 16:52         ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-05 16:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Thu, Nov 05, 2015 at 11:06:48AM -0500, Waiman Long wrote:

> >How does it affect IVB-EX (which you were testing earlier IIRC)?
> 
> My testing on IVB-EX indicated that if the critical section is really short,
> the change may actually slow thing a bit in some cases. However, when the
> critical section is long enough that the prefetch overhead can be hidden
> within the lock acquisition loop, there will be a performance boost.

> >>@@ -426,6 +437,15 @@ queue:
> >>  		cpu_relax();
> >>
> >>  	/*
> >>+	 * If the next pointer is defined, we are not tail anymore.
> >>+	 * In this case, claim the spinlock&  release the MCS lock.
> >>+	 */
> >>+	if (next) {
> >>+		set_locked(lock);
> >>+		goto mcs_unlock;
> >>+	}
> >>+
> >>+	/*
> >>  	 * claim the lock:
> >>  	 *
> >>  	 * n,0,0 ->  0,0,1 : lock, uncontended
> >>@@ -458,6 +478,7 @@ queue:
> >>  	while (!(next = READ_ONCE(node->next)))
> >>  		cpu_relax();
> >>
> >>+mcs_unlock:
> >>  	arch_mcs_spin_unlock_contended(&next->locked);
> >>  	pv_kick_node(lock, next);
> >>
> >This however appears an independent optimization. Is it worth it? Would
> >we not already have observed a val != tail in this case? At which point
> >we're just adding extra code for no gain.
> >
> >That is, if we observe @next, must we then not also observe val != tail?
> 
> Observing next implies val != tail, but the reverse may not be true. The
> branch is done before we observe val != tail. Yes, it is an optimization to
> avoid reading node->next again if we have already observed next. I have
> observed a very minor performance boost with that change without the
> prefetch.

This is all good information to have in the Changelog. And since these
are two independent changes, two patches would have been the right
format.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-11-02 22:54     ` Peter Zijlstra
@ 2015-11-05 16:42       ` Waiman Long
  2015-11-05 16:49         ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-11-05 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/02/2015 05:54 PM, Peter Zijlstra wrote:
> On Mon, Nov 02, 2015 at 05:36:26PM +0100, Peter Zijlstra wrote:
>> On Fri, Oct 30, 2015 at 07:26:33PM -0400, Waiman Long wrote:
>>> @@ -426,6 +437,15 @@ queue:
>>>   		cpu_relax();
>>>
>>>   	/*
>>> +	 * If the next pointer is defined, we are not tail anymore.
>>> +	 * In this case, claim the spinlock&  release the MCS lock.
>>> +	 */
>>> +	if (next) {
>>> +		set_locked(lock);
>>> +		goto mcs_unlock;
>>> +	}
>>> +
>>> +	/*
>>>   	 * claim the lock:
>>>   	 *
>>>   	 * n,0,0 ->  0,0,1 : lock, uncontended
>>> @@ -458,6 +478,7 @@ queue:
>>>   	while (!(next = READ_ONCE(node->next)))
>>>   		cpu_relax();
>>>
>>> +mcs_unlock:
>>>   	arch_mcs_spin_unlock_contended(&next->locked);
>>>   	pv_kick_node(lock, next);
>>>
>> This however appears an independent optimization. Is it worth it? Would
>> we not already have observed a val != tail in this case? At which point
>> we're just adding extra code for no gain.
>>
>> That is, if we observe @next, must we then not also observe val != tail?
> Not quite; the ordering is the other way around. If we observe next we
> must also observe val != tail. But its a narrow thing. Is it really
> worth it?

If we observe next, we will observe val != tail sooner or later. It is 
not possible for it to clear the tail code in the lock. The tail xchg 
will guarantee that.

Another alternative is to do something like

+    if (!next)
          while (!(next = READ_ONCE(node->next)))
             cpu_relax();

Please let me know if that is more acceptable to you.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-11-05 16:29     ` Waiman Long
@ 2015-11-05 16:43       ` Peter Zijlstra
  2015-11-05 16:59         ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-05 16:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Thu, Nov 05, 2015 at 11:29:29AM -0500, Waiman Long wrote:
> On 11/02/2015 11:40 AM, Peter Zijlstra wrote:
> >On Fri, Oct 30, 2015 at 07:26:35PM -0400, Waiman Long wrote:
> >>This patch enables the accumulation of kicking and waiting related
> >>PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
> >>option is selected. It also enables the collection of data which
> >>enable us to calculate the kicking and wakeup latencies which have
> >>a heavy dependency on the CPUs being used.
> >>
> >>The statistical counters are per-cpu variables to minimize the
> >>performance overhead in their updates. These counters are exported
> >>via the sysfs filesystem under the /sys/kernel/qlockstat directory.
> >>When the corresponding sysfs files are read, summation and computing
> >>of the required data are then performed.
> >Why did you switch to sysfs? You can create custom debugfs files too.
> 
> I was not aware of that capability. So you mean using debugfs_create_file()
> using custom file_operations. Right? 

Yep.

> That doesn't seem to be easier than
> using sysfs. However, I can use that if you think it is better to use
> debugfs.

Mostly I just wanted to point out that it was possible; you need not
change to sysfs because debugfs lacks the capability.

But now that you ask, I think debugfs might be the better place, such
statistics (and the proposed CONFIG symbol) are purely for debug
purposes, right?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-11-05 16:42       ` Waiman Long
@ 2015-11-05 16:49         ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-05 16:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Thu, Nov 05, 2015 at 11:42:27AM -0500, Waiman Long wrote:
> If we observe next, we will observe val != tail sooner or later. It is not
> possible for it to clear the tail code in the lock. The tail xchg will
> guarantee that.
> 
> Another alternative is to do something like
> 
> +    if (!next)
>          while (!(next = READ_ONCE(node->next)))
>             cpu_relax();
> 

Yes maybe, although the main reason I fell over this was because it was
a separate change (and not mentioned in the Changelog).

Although the above would need braces (per CodingStyle), so:

	if (!next) {
		while (!(next = READ_ONCE(node->next)))
			cpu_relax();
	}




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline
  2015-11-05 16:39       ` Peter Zijlstra
@ 2015-11-05 16:52         ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2015-11-05 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/05/2015 11:39 AM, Peter Zijlstra wrote:
> On Thu, Nov 05, 2015 at 11:06:48AM -0500, Waiman Long wrote:
>
>>> How does it affect IVB-EX (which you were testing earlier IIRC)?
>> My testing on IVB-EX indicated that if the critical section is really short,
>> the change may actually slow thing a bit in some cases. However, when the
>> critical section is long enough that the prefetch overhead can be hidden
>> within the lock acquisition loop, there will be a performance boost.
>>>> @@ -426,6 +437,15 @@ queue:
>>>>   		cpu_relax();
>>>>
>>>>   	/*
>>>> +	 * If the next pointer is defined, we are not tail anymore.
>>>> +	 * In this case, claim the spinlock&   release the MCS lock.
>>>> +	 */
>>>> +	if (next) {
>>>> +		set_locked(lock);
>>>> +		goto mcs_unlock;
>>>> +	}
>>>> +
>>>> +	/*
>>>>   	 * claim the lock:
>>>>   	 *
>>>>   	 * n,0,0 ->   0,0,1 : lock, uncontended
>>>> @@ -458,6 +478,7 @@ queue:
>>>>   	while (!(next = READ_ONCE(node->next)))
>>>>   		cpu_relax();
>>>>
>>>> +mcs_unlock:
>>>>   	arch_mcs_spin_unlock_contended(&next->locked);
>>>>   	pv_kick_node(lock, next);
>>>>
>>> This however appears an independent optimization. Is it worth it? Would
>>> we not already have observed a val != tail in this case? At which point
>>> we're just adding extra code for no gain.
>>>
>>> That is, if we observe @next, must we then not also observe val != tail?
>> Observing next implies val != tail, but the reverse may not be true. The
>> branch is done before we observe val != tail. Yes, it is an optimization to
>> avoid reading node->next again if we have already observed next. I have
>> observed a very minor performance boost with that change without the
>> prefetch.
> This is all good information to have in the Changelog. And since these
> are two independent changes, two patches would have been the right
> format.

Yep, I will separate it into 2 patches and include additional 
information in the changelog.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-11-05 16:43       ` Peter Zijlstra
@ 2015-11-05 16:59         ` Waiman Long
  2015-11-05 17:09           ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-11-05 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/05/2015 11:43 AM, Peter Zijlstra wrote:
> On Thu, Nov 05, 2015 at 11:29:29AM -0500, Waiman Long wrote:
>> On 11/02/2015 11:40 AM, Peter Zijlstra wrote:
>>> On Fri, Oct 30, 2015 at 07:26:35PM -0400, Waiman Long wrote:
>>>> This patch enables the accumulation of kicking and waiting related
>>>> PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
>>>> option is selected. It also enables the collection of data which
>>>> enable us to calculate the kicking and wakeup latencies which have
>>>> a heavy dependency on the CPUs being used.
>>>>
>>>> The statistical counters are per-cpu variables to minimize the
>>>> performance overhead in their updates. These counters are exported
>>>> via the sysfs filesystem under the /sys/kernel/qlockstat directory.
>>>> When the corresponding sysfs files are read, summation and computing
>>>> of the required data are then performed.
>>> Why did you switch to sysfs? You can create custom debugfs files too.
>> I was not aware of that capability. So you mean using debugfs_create_file()
>> using custom file_operations. Right?
> Yep.
>
>> That doesn't seem to be easier than
>> using sysfs. However, I can use that if you think it is better to use
>> debugfs.
> Mostly I just wanted to point out that it was possible; you need not
> change to sysfs because debugfs lacks the capability.
>
> But now that you ask, I think debugfs might be the better place, such
> statistics (and the proposed CONFIG symbol) are purely for debug
> purposes, right?

Davidlohr had asked me to use per-cpu counters to reduce performance 
overhead so that they can be usable in production system. That is 
another reason why I move to sysfs.

BTW, do you have comments on the other patches in the series? I would 
like to collect all the comments before I renew the series.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-11-05 16:59         ` Waiman Long
@ 2015-11-05 17:09           ` Peter Zijlstra
  2015-11-05 17:34             ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-05 17:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Thu, Nov 05, 2015 at 11:59:21AM -0500, Waiman Long wrote:
> >Mostly I just wanted to point out that it was possible; you need not
> >change to sysfs because debugfs lacks the capability.
> >
> >But now that you ask, I think debugfs might be the better place, such
> >statistics (and the proposed CONFIG symbol) are purely for debug
> >purposes, right?
> 
> Davidlohr had asked me to use per-cpu counters to reduce performance
> overhead so that they can be usable in production system. That is another
> reason why I move to sysfs.

Yes, the per-cpu thing certainly makes sense. But as said, that does not
require you to move to sysfs.

> BTW, do you have comments on the other patches in the series? I would like
> to collect all the comments before I renew the series.

I still have to look at the last two patches, I've sadly not had time
for that yet.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics
  2015-11-05 17:09           ` Peter Zijlstra
@ 2015-11-05 17:34             ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2015-11-05 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/05/2015 12:09 PM, Peter Zijlstra wrote:
> On Thu, Nov 05, 2015 at 11:59:21AM -0500, Waiman Long wrote:
>>> Mostly I just wanted to point out that it was possible; you need not
>>> change to sysfs because debugfs lacks the capability.
>>>
>>> But now that you ask, I think debugfs might be the better place, such
>>> statistics (and the proposed CONFIG symbol) are purely for debug
>>> purposes, right?
>> Davidlohr had asked me to use per-cpu counters to reduce performance
>> overhead so that they can be usable in production system. That is another
>> reason why I move to sysfs.
> Yes, the per-cpu thing certainly makes sense. But as said, that does not
> require you to move to sysfs.
>
>> BTW, do you have comments on the other patches in the series? I would like
>> to collect all the comments before I renew the series.
> I still have to look at the last two patches, I've sadly not had time
> for that yet.

That is what I thought. Just let me know when you are done with the 
review, and I will update the patches and send out a new series.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
@ 2015-11-06 14:50   ` Peter Zijlstra
  2015-11-06 17:47     ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-06 14:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Oct 30, 2015 at 07:26:36PM -0400, Waiman Long wrote:

> @@ -431,35 +432,44 @@ queue:
>  	 * sequentiality; this is because the set_locked() function below
>  	 * does not imply a full barrier.
>  	 *
> +	 * The PV pv_wait_head_lock function, if active, will acquire the lock
> +	 * and return a non-zero value. So we have to skip the
> +	 * smp_load_acquire() call. As the next PV queue head hasn't been
> +	 * designated yet, there is no way for the locked value to become
> +	 * _Q_SLOW_VAL. So both the redundant set_locked() and the
> +	 * atomic_cmpxchg_relaxed() calls will be safe. The cost of the
> +	 * redundant set_locked() call below should be negligible, too.
> +	 *
> +	 * If PV isn't active, 0 will be returned instead.
>  	 */
> -	pv_wait_head(lock, node);
> -	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
> -		cpu_relax();
> +	val = pv_wait_head_lock(lock, node);
> +	if (!val) {
> +		while ((val = smp_load_acquire(&lock->val.counter))
> +				& _Q_LOCKED_PENDING_MASK)
> +			cpu_relax();
> +		/*
> +		 * Claim the lock now:
> +		 *
> +		 * 0,0 -> 0,1
> +		 */
> +		set_locked(lock);
> +		val |= _Q_LOCKED_VAL;
> +	}
>  
>  	/*
>  	 * If the next pointer is defined, we are not tail anymore.
> -	 * In this case, claim the spinlock & release the MCS lock.
>  	 */
> -	if (next) {
> -		set_locked(lock);
> +	if (next)
>  		goto mcs_unlock;
> -	}
>  
>  	/*
> -	 * claim the lock:
> -	 *
> -	 * n,0,0 -> 0,0,1 : lock, uncontended
> -	 * *,0,0 -> *,0,1 : lock, contended
> -	 *
>  	 * If the queue head is the only one in the queue (lock value == tail),
> -	 * clear the tail code and grab the lock. Otherwise, we only need
> -	 * to grab the lock.
> +	 * we have to clear the tail code.
>  	 */
>  	for (;;) {
> -		if (val != tail) {
> -			set_locked(lock);
> +		if ((val & _Q_TAIL_MASK) != tail)
>  			break;
> -		}
> +
>  		/*
>  		 * The smp_load_acquire() call above has provided the necessary
>  		 * acquire semantics required for locking. At most two

*urgh*, last time we had:

+	if (pv_wait_head_or_steal())
+		goto stolen;
	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
		cpu_relax();

	...

+stolen:
	while (!(next = READ_ONCE(node->next)))
		cpu_relax();

	...

Now you completely overhaul the native code.. what happened?

> -static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> +static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  {
>  	struct pv_node *pn = (struct pv_node *)node;
>  	struct __qspinlock *l = (void *)lock;
> @@ -276,11 +330,24 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>  		lp = (struct qspinlock **)1;
>  
>  	for (;; waitcnt++) {
> +		/*
> +		 * Set the pending bit in the active lock spinning loop to
> +		 * disable lock stealing. However, the pending bit check in
> +		 * pv_queued_spin_trylock_unfair() and the setting/clearing
> +		 * of pending bit here aren't memory barriers. So a cmpxchg()
> +		 * is used to acquire the lock to be sure.
> +		 */
> +		set_pending(lock);

OK, so we mark ourselves 'pending' such that a new lock() will not steal
and is forced to queue behind us.

>  		for (loop = SPIN_THRESHOLD; loop; loop--) {
> -			if (!READ_ONCE(l->locked))
> -				return;
> +			if (!READ_ONCE(l->locked) &&
> +			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> +				clear_pending(lock);
> +				goto gotlock;

Would not: cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL),
make sense to avoid the clear_pending() call?

> +			}
>  			cpu_relax();
>  		}
> +		clear_pending(lock);
> +



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-10-30 23:26 ` [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
@ 2015-11-06 15:01   ` Peter Zijlstra
  2015-11-06 17:54     ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-06 15:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Oct 30, 2015 at 07:26:37PM -0400, Waiman Long wrote:
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -23,6 +23,19 @@
>  #define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
>  
>  /*
> + * Queue Node Adaptive Spinning
> + *
> + * A queue node vCPU will stop spinning if the vCPU in the previous node is
> + * not running. The one lock stealing attempt allowed at slowpath entry
> + * mitigates the slight slowdown for non-overcommitted guest with this
> + * aggressive wait-early mechanism.
> + *
> + * The status of the previous node will be checked at fixed interval
> + * controlled by PV_PREV_CHECK_MASK.
> + */
> +#define PV_PREV_CHECK_MASK	0xff
> +
> +/*
>   * Queue node uses: vcpu_running & vcpu_halted.
>   * Queue head uses: vcpu_running & vcpu_hashed.
>   */
> @@ -202,6 +215,20 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
>  }
>  
>  /*
> + * Return true if when it is time to check the previous node which is not
> + * in a running state.
> + */
> +static inline bool
> +pv_wait_early(struct pv_node *prev, int loop)
> +{
> +
> +	if ((loop & PV_PREV_CHECK_MASK) != 0)
> +		return false;
> +
> +	return READ_ONCE(prev->state) != vcpu_running;
> +}

So it appears to me the sole purpose of PV_PREV_CHECK_MASK it to avoid
touching the prev->state cacheline too hard. Yet that is not mentioned
anywhere above.


> +static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>  {
>  	struct pv_node *pn = (struct pv_node *)node;
> +	struct pv_node *pp = (struct pv_node *)prev;
>  	int waitcnt = 0;
>  	int loop;
> +	bool wait_early;
>  
>  	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
>  	for (;; waitcnt++) {
> -		for (loop = SPIN_THRESHOLD; loop; loop--) {
> +		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
>  			if (READ_ONCE(node->locked))
>  				return;
> +			if (pv_wait_early(pp, loop)) {
> +				wait_early = true;
> +				break;
> +			}
>  			cpu_relax();
>  		}
>  

So if prev points to another node, it will never see vcpu_running. Was
that fully intended?

FYI, I think I've now seen all patches ;-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-11-06 14:50   ` Peter Zijlstra
@ 2015-11-06 17:47     ` Waiman Long
  2015-11-09 17:29       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-11-06 17:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/06/2015 09:50 AM, Peter Zijlstra wrote:
> On Fri, Oct 30, 2015 at 07:26:36PM -0400, Waiman Long wrote:
>
>> @@ -431,35 +432,44 @@ queue:
>>   	 * sequentiality; this is because the set_locked() function below
>>   	 * does not imply a full barrier.
>>   	 *
>> +	 * The PV pv_wait_head_lock function, if active, will acquire the lock
>> +	 * and return a non-zero value. So we have to skip the
>> +	 * smp_load_acquire() call. As the next PV queue head hasn't been
>> +	 * designated yet, there is no way for the locked value to become
>> +	 * _Q_SLOW_VAL. So both the redundant set_locked() and the
>> +	 * atomic_cmpxchg_relaxed() calls will be safe. The cost of the
>> +	 * redundant set_locked() call below should be negligible, too.
>> +	 *
>> +	 * If PV isn't active, 0 will be returned instead.
>>   	 */
>> -	pv_wait_head(lock, node);
>> -	while ((val = smp_load_acquire(&lock->val.counter))&  _Q_LOCKED_PENDING_MASK)
>> -		cpu_relax();
>> +	val = pv_wait_head_lock(lock, node);
>> +	if (!val) {
>> +		while ((val = smp_load_acquire(&lock->val.counter))
>> +				&  _Q_LOCKED_PENDING_MASK)
>> +			cpu_relax();
>> +		/*
>> +		 * Claim the lock now:
>> +		 *
>> +		 * 0,0 ->  0,1
>> +		 */
>> +		set_locked(lock);
>> +		val |= _Q_LOCKED_VAL;
>> +	}
>>
>>   	/*
>>   	 * If the next pointer is defined, we are not tail anymore.
>> -	 * In this case, claim the spinlock&  release the MCS lock.
>>   	 */
>> -	if (next) {
>> -		set_locked(lock);
>> +	if (next)
>>   		goto mcs_unlock;
>> -	}
>>
>>   	/*
>> -	 * claim the lock:
>> -	 *
>> -	 * n,0,0 ->  0,0,1 : lock, uncontended
>> -	 * *,0,0 ->  *,0,1 : lock, contended
>> -	 *
>>   	 * If the queue head is the only one in the queue (lock value == tail),
>> -	 * clear the tail code and grab the lock. Otherwise, we only need
>> -	 * to grab the lock.
>> +	 * we have to clear the tail code.
>>   	 */
>>   	for (;;) {
>> -		if (val != tail) {
>> -			set_locked(lock);
>> +		if ((val&  _Q_TAIL_MASK) != tail)
>>   			break;
>> -		}
>> +
>>   		/*
>>   		 * The smp_load_acquire() call above has provided the necessary
>>   		 * acquire semantics required for locking. At most two
> *urgh*, last time we had:
>
> +	if (pv_wait_head_or_steal())
> +		goto stolen;
> 	while ((val = smp_load_acquire(&lock->val.counter))&  _Q_LOCKED_PENDING_MASK)
> 		cpu_relax();
>
> 	...
>
> +stolen:
> 	while (!(next = READ_ONCE(node->next)))
> 		cpu_relax();
>
> 	...
>
> Now you completely overhaul the native code.. what happened?

I want to reuse as much of the existing native code as possible instead 
of duplicating that in the PV function. The only difference now is that 
the PV function will acquire that lock. Semantically, I don't want to 
call the lock acquisition as lock stealing as the queue head is entitled 
to get the lock next. I can rename pv_queued_spin_trylock_unfair() to 
pv_queued_spin_steal_lock() to emphasize the fact that this is the 
routine where lock stealing happens.

>> -static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> +static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>   {
>>   	struct pv_node *pn = (struct pv_node *)node;
>>   	struct __qspinlock *l = (void *)lock;
>> @@ -276,11 +330,24 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>>   		lp = (struct qspinlock **)1;
>>
>>   	for (;; waitcnt++) {
>> +		/*
>> +		 * Set the pending bit in the active lock spinning loop to
>> +		 * disable lock stealing. However, the pending bit check in
>> +		 * pv_queued_spin_trylock_unfair() and the setting/clearing
>> +		 * of pending bit here aren't memory barriers. So a cmpxchg()
>> +		 * is used to acquire the lock to be sure.
>> +		 */
>> +		set_pending(lock);
> OK, so we mark ourselves 'pending' such that a new lock() will not steal
> and is forced to queue behind us.

Yes, this ensures that lock starvation will not happens.

>
>>   		for (loop = SPIN_THRESHOLD; loop; loop--) {
>> -			if (!READ_ONCE(l->locked))
>> -				return;
>> +			if (!READ_ONCE(l->locked)&&
>> +			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
>> +				clear_pending(lock);
>> +				goto gotlock;
> Would not: cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL),
> make sense to avoid the clear_pending() call?

I can combine cmpxchg() and clear_pending() into a new helper function 
as its implementation will differ depends on NR_CPUS.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-11-06 15:01   ` Peter Zijlstra
@ 2015-11-06 17:54     ` Waiman Long
  2015-11-06 20:37       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-11-06 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/06/2015 10:01 AM, Peter Zijlstra wrote:
> On Fri, Oct 30, 2015 at 07:26:37PM -0400, Waiman Long wrote:
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -23,6 +23,19 @@
>>   #define _Q_SLOW_VAL	(3U<<  _Q_LOCKED_OFFSET)
>>
>>   /*
>> + * Queue Node Adaptive Spinning
>> + *
>> + * A queue node vCPU will stop spinning if the vCPU in the previous node is
>> + * not running. The one lock stealing attempt allowed at slowpath entry
>> + * mitigates the slight slowdown for non-overcommitted guest with this
>> + * aggressive wait-early mechanism.
>> + *
>> + * The status of the previous node will be checked at fixed interval
>> + * controlled by PV_PREV_CHECK_MASK.
>> + */
>> +#define PV_PREV_CHECK_MASK	0xff
>> +
>> +/*
>>    * Queue node uses: vcpu_running&  vcpu_halted.
>>    * Queue head uses: vcpu_running&  vcpu_hashed.
>>    */
>> @@ -202,6 +215,20 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
>>   }
>>
>>   /*
>> + * Return true if when it is time to check the previous node which is not
>> + * in a running state.
>> + */
>> +static inline bool
>> +pv_wait_early(struct pv_node *prev, int loop)
>> +{
>> +
>> +	if ((loop&  PV_PREV_CHECK_MASK) != 0)
>> +		return false;
>> +
>> +	return READ_ONCE(prev->state) != vcpu_running;
>> +}
> So it appears to me the sole purpose of PV_PREV_CHECK_MASK it to avoid
> touching the prev->state cacheline too hard. Yet that is not mentioned
> anywhere above.

Yes, that is true. I will add a comment to that effect.

>
>> +static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>>   {
>>   	struct pv_node *pn = (struct pv_node *)node;
>> +	struct pv_node *pp = (struct pv_node *)prev;
>>   	int waitcnt = 0;
>>   	int loop;
>> +	bool wait_early;
>>
>>   	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
>>   	for (;; waitcnt++) {
>> -		for (loop = SPIN_THRESHOLD; loop; loop--) {
>> +		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
>>   			if (READ_ONCE(node->locked))
>>   				return;
>> +			if (pv_wait_early(pp, loop)) {
>> +				wait_early = true;
>> +				break;
>> +			}
>>   			cpu_relax();
>>   		}
>>
> So if prev points to another node, it will never see vcpu_running. Was
> that fully intended?

I had added code in pv_wait_head_or_lock to set the state appropriately 
for the queue head vCPU.

         for (;; waitcnt++) {
                 /*
+                * Set correct vCPU state to be used by queue node 
wait-early
+                * mechanism.
+                */
+               WRITE_ONCE(pn->state, vcpu_running);
+
+               /*
                  * Set the pending bit in the active lock spinning loop to
                  * disable lock stealing. However, the pending bit check in
                  * pv_queued_spin_trylock_unfair() and the setting/clearing
@@ -374,6 +414,7 @@ static u32 pv_wait_head_lock(struct qspinlock *lock, 
struct mcs_spinlock *node)
                                 goto gotlock;
                         }
                 }
+               WRITE_ONCE(pn->state, vcpu_halted);

> FYI, I think I've now seen all patches ;-)

Thanks for the review. I will work on fixing the issues you identified 
and issue a new patch series next week.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-11-06 17:54     ` Waiman Long
@ 2015-11-06 20:37       ` Peter Zijlstra
  2015-11-09 16:51         ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-06 20:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Nov 06, 2015 at 12:54:06PM -0500, Waiman Long wrote:
> >>+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
> >>  {
> >>  	struct pv_node *pn = (struct pv_node *)node;
> >>+	struct pv_node *pp = (struct pv_node *)prev;
> >>  	int waitcnt = 0;
> >>  	int loop;
> >>+	bool wait_early;
> >>
> >>  	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
> >>  	for (;; waitcnt++) {
> >>-		for (loop = SPIN_THRESHOLD; loop; loop--) {
> >>+		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
> >>  			if (READ_ONCE(node->locked))
> >>  				return;
> >>+			if (pv_wait_early(pp, loop)) {
> >>+				wait_early = true;
> >>+				break;
> >>+			}
> >>  			cpu_relax();
> >>  		}
> >>
> >So if prev points to another node, it will never see vcpu_running. Was
> >that fully intended?
> 
> I had added code in pv_wait_head_or_lock to set the state appropriately for
> the queue head vCPU.

Yes, but that's the head, for nodes we'll always have halted or hashed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-11-06 20:37       ` Peter Zijlstra
@ 2015-11-09 16:51         ` Waiman Long
  2015-11-09 17:33           ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2015-11-09 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/06/2015 03:37 PM, Peter Zijlstra wrote:
> On Fri, Nov 06, 2015 at 12:54:06PM -0500, Waiman Long wrote:
>>>> +static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>>>>   {
>>>>   	struct pv_node *pn = (struct pv_node *)node;
>>>> +	struct pv_node *pp = (struct pv_node *)prev;
>>>>   	int waitcnt = 0;
>>>>   	int loop;
>>>> +	bool wait_early;
>>>>
>>>>   	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
>>>>   	for (;; waitcnt++) {
>>>> -		for (loop = SPIN_THRESHOLD; loop; loop--) {
>>>> +		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
>>>>   			if (READ_ONCE(node->locked))
>>>>   				return;
>>>> +			if (pv_wait_early(pp, loop)) {
>>>> +				wait_early = true;
>>>> +				break;
>>>> +			}
>>>>   			cpu_relax();
>>>>   		}
>>>>
>>> So if prev points to another node, it will never see vcpu_running. Was
>>> that fully intended?
>> I had added code in pv_wait_head_or_lock to set the state appropriately for
>> the queue head vCPU.
> Yes, but that's the head, for nodes we'll always have halted or hashed.

The node state was initialized to be vcpu_running. In pv_wait_node(), it 
will be changed to vcpu_halted before sleeping and back to vcpu_running 
after that. So it is not true that it is either halted or hashed.

In case, it was changed to vcpu_hashed, it will be changed back to 
vcpu_running in pv_wait_head_lock before entering the active spinning 
loop. There are definitely a small amount of time where the node state 
does not reflect the actual vCPU state, but that is the best we can do 
so far.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-11-06 17:47     ` Waiman Long
@ 2015-11-09 17:29       ` Peter Zijlstra
  2015-11-09 19:53         ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-09 17:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Fri, Nov 06, 2015 at 12:47:49PM -0500, Waiman Long wrote:
> On 11/06/2015 09:50 AM, Peter Zijlstra wrote:
> >*urgh*, last time we had:
> >
> >+	if (pv_wait_head_or_steal())
> >+		goto stolen;
> >	while ((val = smp_load_acquire(&lock->val.counter))&  _Q_LOCKED_PENDING_MASK)
> >		cpu_relax();
> >
> >	...
> >
> >+stolen:
> >	while (!(next = READ_ONCE(node->next)))
> >		cpu_relax();
> >
> >	...
> >
> >Now you completely overhaul the native code.. what happened?
> 
> I want to reuse as much of the existing native code as possible instead of
> duplicating that in the PV function. The only difference now is that the PV
> function will acquire that lock.

Right; and while I doubt it hurts the native case (you did benchmark it
I hope), I'm not too keen on the end result code wise.

Maybe just keep the above.

> Semantically, I don't want to call the lock
> acquisition as lock stealing as the queue head is entitled to get the lock
> next. 

Fair enough I suppose, pv_wait_head_or_lock() then?

> I can rename pv_queued_spin_trylock_unfair() to
> pv_queued_spin_steal_lock() to emphasize the fact that this is the routine
> where lock stealing happens.

OK.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning
  2015-11-09 16:51         ` Waiman Long
@ 2015-11-09 17:33           ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2015-11-09 17:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On Mon, Nov 09, 2015 at 11:51:20AM -0500, Waiman Long wrote:
> On 11/06/2015 03:37 PM, Peter Zijlstra wrote:
> >On Fri, Nov 06, 2015 at 12:54:06PM -0500, Waiman Long wrote:
> >>>>+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
> >>>>  {
> >>>>  	struct pv_node *pn = (struct pv_node *)node;
> >>>>+	struct pv_node *pp = (struct pv_node *)prev;
> >>>>  	int waitcnt = 0;
> >>>>  	int loop;
> >>>>+	bool wait_early;
> >>>>
> >>>>  	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
> >>>>  	for (;; waitcnt++) {
> >>>>-		for (loop = SPIN_THRESHOLD; loop; loop--) {
> >>>>+		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
> >>>>  			if (READ_ONCE(node->locked))
> >>>>  				return;
> >>>>+			if (pv_wait_early(pp, loop)) {
> >>>>+				wait_early = true;
> >>>>+				break;
> >>>>+			}
> >>>>  			cpu_relax();
> >>>>  		}
> >>>>
> >>>So if prev points to another node, it will never see vcpu_running. Was
> >>>that fully intended?
> >>I had added code in pv_wait_head_or_lock to set the state appropriately for
> >>the queue head vCPU.
> >Yes, but that's the head, for nodes we'll always have halted or hashed.
> 
> The node state was initialized to be vcpu_running. In pv_wait_node(), it
> will be changed to vcpu_halted before sleeping and back to vcpu_running
> after that. So it is not true that it is either halted or hashed.

Durh,.. I mixed up pv_wait_node() and pv_wait_head() I think. Sorry for
the noise.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-11-09 17:29       ` Peter Zijlstra
@ 2015-11-09 19:53         ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2015-11-09 19:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
	Scott J Norton, Douglas Hatch, Davidlohr Bueso

On 11/09/2015 12:29 PM, Peter Zijlstra wrote:
> On Fri, Nov 06, 2015 at 12:47:49PM -0500, Waiman Long wrote:
>> On 11/06/2015 09:50 AM, Peter Zijlstra wrote:
>>> *urgh*, last time we had:
>>>
>>> +	if (pv_wait_head_or_steal())
>>> +		goto stolen;
>>> 	while ((val = smp_load_acquire(&lock->val.counter))&   _Q_LOCKED_PENDING_MASK)
>>> 		cpu_relax();
>>>
>>> 	...
>>>
>>> +stolen:
>>> 	while (!(next = READ_ONCE(node->next)))
>>> 		cpu_relax();
>>>
>>> 	...
>>>
>>> Now you completely overhaul the native code.. what happened?
>> I want to reuse as much of the existing native code as possible instead of
>> duplicating that in the PV function. The only difference now is that the PV
>> function will acquire that lock.
> Right; and while I doubt it hurts the native case (you did benchmark it
> I hope), I'm not too keen on the end result code wise.
>
> Maybe just keep the above.

I can jump over the smp_load_acquire() for PV instead of adding an 
additional if block. For the native code, the only thing that was added 
was an additional masking of val with _Q_TAIL_MASK which I don't think 
will make too much of a difference.
>
>> Semantically, I don't want to call the lock
>> acquisition as lock stealing as the queue head is entitled to get the lock
>> next.
> Fair enough I suppose, pv_wait_head_or_lock() then?
>

I am fine with that name.

>> I can rename pv_queued_spin_trylock_unfair() to
>> pv_queued_spin_steal_lock() to emphasize the fact that this is the routine
>> where lock stealing happens.
> OK.
>

Cheers,
Longman

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2015-11-09 19:53 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-30 23:26 [PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock Waiman Long
2015-10-30 23:26 ` [PATCH tip/locking/core v9 1/6] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
2015-10-30 23:26 ` [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline Waiman Long
2015-11-02 16:36   ` Peter Zijlstra
2015-11-02 22:54     ` Peter Zijlstra
2015-11-05 16:42       ` Waiman Long
2015-11-05 16:49         ` Peter Zijlstra
2015-11-05 16:06     ` Waiman Long
2015-11-05 16:39       ` Peter Zijlstra
2015-11-05 16:52         ` Waiman Long
2015-10-30 23:26 ` [PATCH tip/locking/core v9 3/6] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
2015-10-30 23:26 ` [PATCH tip/locking/core v9 4/6] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
2015-11-02 16:40   ` Peter Zijlstra
2015-11-05 16:29     ` Waiman Long
2015-11-05 16:43       ` Peter Zijlstra
2015-11-05 16:59         ` Waiman Long
2015-11-05 17:09           ` Peter Zijlstra
2015-11-05 17:34             ` Waiman Long
2015-10-30 23:26 ` [PATCH tip/locking/core v9 5/6] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
2015-11-06 14:50   ` Peter Zijlstra
2015-11-06 17:47     ` Waiman Long
2015-11-09 17:29       ` Peter Zijlstra
2015-11-09 19:53         ` Waiman Long
2015-10-30 23:26 ` [PATCH tip/locking/core v9 6/6] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
2015-11-06 15:01   ` Peter Zijlstra
2015-11-06 17:54     ` Waiman Long
2015-11-06 20:37       ` Peter Zijlstra
2015-11-09 16:51         ` Waiman Long
2015-11-09 17:33           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).