[PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC
@ 2016-05-31 16:53 Waiman Long
  2016-05-31 16:53 ` [PATCH v2 1/5] locking/pvstat: Separate wait_again and spurious wakeup stats Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Waiman Long @ 2016-05-31 16:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

v1->v2:
 - Adds a patch to make pv_unhash() atomic to prevent racing.
 - Adds 2 more patches to update the pvstat code.

Patch 1 separates the wait_again and spurious wakeup stat counters
so that the former is for the queue head only and the latter is for
the non-head queue nodes.

Patch 2 fixes the missed PV wakeup problem as reported by Pan Xinhui.

Patch 3 makes pv_unhash() atomic to fix another potential racing problem
reported by Boqun Feng.

Patch 4 adds a new pvstat counter to track the rare _Q_SLOW_VAL racing.

Patch 5 adds a new argument to pv_wait() to provide better support
for PPC.

Waiman Long (5):
  locking/pvstat: Separate wait_again and spurious wakeup stats
  locking/pvqspinlock: Fix missed PV wakeup problem
  locking/pvqspinlock: Make pv_unhash() atomic
  locking/pvstat: Add stat counter to track _Q_SLOW_VAL race
  locking/pvqspinlock: Add lock holder CPU argument to pv_wait()

 arch/x86/include/asm/paravirt.h       |    4 +-
 arch/x86/include/asm/paravirt_types.h |    2 +-
 arch/x86/kernel/kvm.c                 |    2 +-
 arch/x86/xen/spinlock.c               |    2 +-
 kernel/locking/qspinlock.c            |    5 +-
 kernel/locking/qspinlock_paravirt.h   |  189 ++++++++++++++++++++++++---------
 kernel/locking/qspinlock_stat.h       |   17 ++--
 7 files changed, 159 insertions(+), 62 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/5] locking/pvstat: Separate wait_again and spurious wakeup stats
  2016-05-31 16:53 [PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
@ 2016-05-31 16:53 ` Waiman Long
  2016-08-10 18:07   ` [tip:locking/core] " tip-bot for Waiman Long
  2016-05-31 16:53 ` [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2016-05-31 16:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Currently there are overlap in the pvqspinlock wait_again and
spurious_wakeup stat counters. Because of lock stealing, it is
no longer possible to accurately determine if spurious wakeup has
happened in the queue head.  As they track both the queue node and
queue head status, it is also hard to tell how many of those comes
from the queue head and how many from the queue node.

This patch changes the accounting rules so that spurious wakeup is
only tracked in the queue node. The wait_again count, however, is
only tracked in the queue head when the vCPU failed to acquire the
lock after a vCPU kick. This should give a much better indication of
the wait-kick dynamics in the queue node and the queue head.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock_paravirt.h |   12 +++---------
 kernel/locking/qspinlock_stat.h     |    4 ++--
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 21ede57..75cc207 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -286,12 +286,10 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct pv_node *pp = (struct pv_node *)prev;
-	int waitcnt = 0;
 	int loop;
 	bool wait_early;
 
-	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
-	for (;; waitcnt++) {
+	for (;;) {
 		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
@@ -315,7 +313,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 
 		if (!READ_ONCE(node->locked)) {
 			qstat_inc(qstat_pv_wait_node, true);
-			qstat_inc(qstat_pv_wait_again, waitcnt);
 			qstat_inc(qstat_pv_wait_early, wait_early);
 			pv_wait(&pn->state, vcpu_halted);
 		}
@@ -456,12 +453,9 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
 		/*
-		 * The unlocker should have freed the lock before kicking the
-		 * CPU. So if the lock is still not free, it is a spurious
-		 * wakeup or another vCPU has stolen the lock. The current
-		 * vCPU should spin again.
+		 * Because of lock stealing, the queue head vCPU may not be
+		 * able to acquire the lock before it has to wait again.
 		 */
-		qstat_inc(qstat_pv_spurious_wakeup, READ_ONCE(l->locked));
 	}
 
 	/*
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 22e0253..3cd07a9 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -24,8 +24,8 @@
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
  *   pv_lock_slowpath	- # of locking operations via the slowpath
  *   pv_lock_stealing	- # of lock stealing operations
- *   pv_spurious_wakeup	- # of spurious wakeups
- *   pv_wait_again	- # of vCPU wait's that happened after a vCPU kick
+ *   pv_spurious_wakeup	- # of spurious wakeups in non-head vCPUs
+ *   pv_wait_again	- # of wait's after a queue head vCPU kick
  *   pv_wait_early	- # of early vCPU wait's
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-05-31 16:53 [PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
  2016-05-31 16:53 ` [PATCH v2 1/5] locking/pvstat: Separate wait_again and spurious wakeup stats Waiman Long
@ 2016-05-31 16:53 ` Waiman Long
  2016-07-15  8:47   ` Peter Zijlstra
  2016-05-31 16:53 ` [PATCH v2 3/5] locking/pvqspinlock: Make pv_unhash() atomic Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2016-05-31 16:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
done once for any pv_node. It is either in pv_kick_node() or in
pv_wait_head_or_lock(). Because of lock stealing, a pv_kick'ed node is
not guaranteed to get the lock before the spinning threshold expires
and has to call pv_wait() again. As a result, the new lock holder
won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.

This patch fixes this missed PV wakeup problem by allowing multiple
_Q_SLOW_VAL settings within pv_wait_head_or_lock() and matching each
successful setting of _Q_SLOW_VAL to a pv_hash() call.

Reported-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock_paravirt.h |   48 ++++++++++++++++++++++------------
 1 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 75cc207..3df975d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -366,12 +366,16 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	/*
 	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
 	 *
-	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
-	 * the hash table later on at unlock time, no atomic instruction is
-	 * needed.
+	 * It is very unlikely that this will race with the _Q_SLOW_VAL setting
+	 * in pv_wait_head_or_lock(). However, we use cmpxchg() here to be
+	 * sure that we won't do a double pv_hash().
+	 *
+	 * As it is the lock holder, it won't race with
+	 * __pv_queued_spin_unlock().
 	 */
-	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
-	(void)pv_hash(lock, pn);
+	if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)
+			== _Q_LOCKED_VAL))
+		pv_hash(lock, pn);
 }
 
 /*
@@ -386,18 +390,10 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
-	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
 
 	/*
-	 * If pv_kick_node() already advanced our state, we don't need to
-	 * insert ourselves into the hash table anymore.
-	 */
-	if (READ_ONCE(pn->state) == vcpu_hashed)
-		lp = (struct qspinlock **)1;
-
-	/*
 	 * Tracking # of slowpath locking operations
 	 */
 	qstat_inc(qstat_pv_lock_slowpath, true);
@@ -419,11 +415,19 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 				goto gotlock;
 			cpu_relax();
 		}
-		clear_pending(lock);
 
+		/*
+		 * Make sure the lock value check below is executed after
+		 * all the previous loads.
+		 */
+		smp_rmb();
 
-		if (!lp) { /* ONCE */
-			lp = pv_hash(lock, pn);
+		/*
+		 * Set _Q_SLOW_VAL and hash the PV node, if necessary.
+		 */
+		if (READ_ONCE(l->locked) != _Q_SLOW_VAL) {
+			struct qspinlock **lp = pv_hash(lock, pn);
+			u8 locked;
 
 			/*
 			 * We must hash before setting _Q_SLOW_VAL, such that
@@ -436,7 +440,8 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
+			locked = xchg(&l->locked, _Q_SLOW_VAL);
+			if (locked == 0) {
 				/*
 				 * The lock was free and now we own the lock.
 				 * Change the lock value back to _Q_LOCKED_VAL
@@ -444,9 +449,18 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 				 */
 				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
+				clear_pending(lock);
 				goto gotlock;
+			} else if (unlikely(locked == _Q_SLOW_VAL)) {
+				/*
+				 * Racing with pv_kick_node(), need to undo
+				 * the pv_hash().
+				 */
+				WRITE_ONCE(*lp, NULL);
 			}
 		}
+		clear_pending(lock);	/* Enable lock stealing */
+
 		WRITE_ONCE(pn->state, vcpu_halted);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/5] locking/pvqspinlock: Make pv_unhash() atomic
  2016-05-31 16:53 [PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
  2016-05-31 16:53 ` [PATCH v2 1/5] locking/pvstat: Separate wait_again and spurious wakeup stats Waiman Long
  2016-05-31 16:53 ` [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
@ 2016-05-31 16:53 ` Waiman Long
  2016-05-31 16:53 ` [PATCH v2 4/5] locking/pvstat: Add stat counter to track _Q_SLOW_VAL race Waiman Long
  2016-05-31 16:53 ` [PATCH v2 5/5] locking/pvqspinlock: Add lock holder CPU argument to pv_wait() Waiman Long
  4 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2016-05-31 16:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Boqun Feng had come up with a scenario where the hashing/unhashing
of PV node entry may produce unexpected results when the lock holder
vCPU is racing with that of the queue head:

  CPU 0 (lock holder)   CPU 1 (queue head)
  ===================   ==================
  spin_lock():          spin_lock():
    pv_kick_node():       pv_wait_head_or_lock():
                          if (READ_ONCE(l->locked) != _Q_SLOW_VAL) {
                            pv_hash();

    cmpxchg(&l->locked,
    _Q_LOCKED_VAL, _Q_SLOW_VAL);
    pv_hash();
                            locked = xchg(&l->locked, _Q_SLOW_VAL);
    do_something();         if(...) {
                            }
  spin_unlock():
    pv_unhash();
                            else if (unlikely(locked == _Q_SLOW_VAL)) {
                                WRITE_ONCE(*lp, NULL);

In this case, both the pv_unhash() and WRITE_ONCE() will erase the
same entry leaving the extra entry around. This may cause an incorrect
lookup the next time the same lock is hashed leading to missed wakeup
of the queue head.

This patch fixes that particular problem by making pv_unhash() atomic
and replacing the WRITE_ONCE() above with the atomic pv_unhash(). So
as long as the number of pv_hash() match that of pv_unhash(), no
unwant entry will be left.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock_paravirt.h |   71 ++++++++++++++++++++--------------
 1 files changed, 42 insertions(+), 29 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 3df975d..54e03b9 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -233,10 +233,16 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 	struct pv_hash_entry *he;
 	struct pv_node *node;
 
+	/*
+	 * As pv_unhash() can be called from both pv_wait_head_or_lock() and
+	 * __pv_queued_spin_unlock_slowpath(), it is made atomic to avoid
+	 * racing.
+	 */
 	for_each_hash_entry(he, offset, hash) {
 		if (READ_ONCE(he->lock) == lock) {
 			node = READ_ONCE(he->node);
-			WRITE_ONCE(he->lock, NULL);
+			if (cmpxchg(&he->lock, lock, NULL) != lock)
+				continue;	/* Entry has been changed */
 			return node;
 		}
 	}
@@ -424,39 +430,46 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 
 		/*
 		 * Set _Q_SLOW_VAL and hash the PV node, if necessary.
+		 *
+		 * We must hash before setting _Q_SLOW_VAL, such that
+		 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
+		 * we'll be sure to be able to observe our hash entry.
+		 *
+		 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
+		 *       MB                           RMB
+		 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
+		 *
+		 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 		 */
 		if (READ_ONCE(l->locked) != _Q_SLOW_VAL) {
-			struct qspinlock **lp = pv_hash(lock, pn);
-			u8 locked;
+			u8 old;
+
+			pv_hash(lock, pn);
+			old = xchg(&l->locked, _Q_SLOW_VAL);
 
 			/*
-			 * We must hash before setting _Q_SLOW_VAL, such that
-			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
-			 * we'll be sure to be able to observe our hash entry.
-			 *
-			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
-			 *       MB                           RMB
-			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
-			 *
-			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
+			 * If the old lock value isn't =_Q_LOCKED_VAL, it
+			 * will be either 0 or _Q_SLOW_VAL. In both cases,
+			 * this vCPU may be racing with the lock holder vCPU
+			 * in pv_kick_node() and/or __pv_queued_spin_unlock().
+			 * So we should use the atomic pv_unhash() to remove
+			 * the unwanted node entry.
 			 */
-			locked = xchg(&l->locked, _Q_SLOW_VAL);
-			if (locked == 0) {
-				/*
-				 * The lock was free and now we own the lock.
-				 * Change the lock value back to _Q_LOCKED_VAL
-				 * and unhash the table.
-				 */
-				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
-				WRITE_ONCE(*lp, NULL);
-				clear_pending(lock);
-				goto gotlock;
-			} else if (unlikely(locked == _Q_SLOW_VAL)) {
-				/*
-				 * Racing with pv_kick_node(), need to undo
-				 * the pv_hash().
-				 */
-				WRITE_ONCE(*lp, NULL);
+			if (old !=_Q_LOCKED_VAL) {
+				pv_unhash(lock);
+
+				if (likely(old == 0)) {
+					/*
+					 * The lock was free and now we own the
+					 * lock. Change the lock value back to
+					 * _Q_LOCKED_VAL as its node has been
+					 * unhashed.
+					 */
+					WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+					clear_pending(lock);
+					goto gotlock;
+				}
+				/* old == _Q_SLOW_VAL. */
 			}
 		}
 		clear_pending(lock);	/* Enable lock stealing */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/5] locking/pvstat: Add stat counter to track _Q_SLOW_VAL race
  2016-05-31 16:53 [PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
                   ` (2 preceding siblings ...)
  2016-05-31 16:53 ` [PATCH v2 3/5] locking/pvqspinlock: Make pv_unhash() atomic Waiman Long
@ 2016-05-31 16:53 ` Waiman Long
  2016-05-31 16:53 ` [PATCH v2 5/5] locking/pvqspinlock: Add lock holder CPU argument to pv_wait() Waiman Long
  4 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2016-05-31 16:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

This patch adds a new stat counter to track if the _Q_SLOW_VAL race
between lock owner running pv_kick_node() and queue head running
pv_wait_head_or_lock() is happening.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock_paravirt.h |    4 +++-
 kernel/locking/qspinlock_stat.h     |    5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 54e03b9..3d57296 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -468,8 +468,10 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 					WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
 					clear_pending(lock);
 					goto gotlock;
+				} else {
+					/* old == _Q_SLOW_VAL. */
+					qstat_inc(qstat_pv_slow_race, true);
 				}
-				/* old == _Q_SLOW_VAL. */
 			}
 		}
 		clear_pending(lock);	/* Enable lock stealing */
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 3cd07a9..a40262e 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -24,6 +24,7 @@
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
  *   pv_lock_slowpath	- # of locking operations via the slowpath
  *   pv_lock_stealing	- # of lock stealing operations
+ *   pv_slow_race	- # of _Q_SLOW_VAL races between lock owner & qhead
  *   pv_spurious_wakeup	- # of spurious wakeups in non-head vCPUs
  *   pv_wait_again	- # of wait's after a queue head vCPU kick
  *   pv_wait_early	- # of early vCPU wait's
@@ -48,6 +49,7 @@ enum qlock_stats {
 	qstat_pv_latency_wake,
 	qstat_pv_lock_slowpath,
 	qstat_pv_lock_stealing,
+	qstat_pv_slow_race,
 	qstat_pv_spurious_wakeup,
 	qstat_pv_wait_again,
 	qstat_pv_wait_early,
@@ -69,11 +71,12 @@ static const char * const qstat_names[qstat_num + 1] = {
 	[qstat_pv_hash_hops]	   = "pv_hash_hops",
 	[qstat_pv_kick_unlock]     = "pv_kick_unlock",
 	[qstat_pv_kick_wake]       = "pv_kick_wake",
-	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
 	[qstat_pv_latency_kick]	   = "pv_latency_kick",
 	[qstat_pv_latency_wake]    = "pv_latency_wake",
 	[qstat_pv_lock_slowpath]   = "pv_lock_slowpath",
 	[qstat_pv_lock_stealing]   = "pv_lock_stealing",
+	[qstat_pv_slow_race]	   = "pv_slow_race",
+	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
 	[qstat_pv_wait_again]      = "pv_wait_again",
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 5/5] locking/pvqspinlock: Add lock holder CPU argument to pv_wait()
  2016-05-31 16:53 [PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
                   ` (3 preceding siblings ...)
  2016-05-31 16:53 ` [PATCH v2 4/5] locking/pvstat: Add stat counter to track _Q_SLOW_VAL race Waiman Long
@ 2016-05-31 16:53 ` Waiman Long
  4 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2016-05-31 16:53 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Boqun Feng, Scott J Norton,
	Douglas Hatch, Waiman Long

Pan Xinhui was asking for a lock holder cpu argument in pv_wait()
to help the porting of pvqspinlock to PPC. The new argument will can
help hypervisor expediate the execution of the critical section by
the lock holder, if its vCPU isn't running, so that it can release
the lock sooner.

The pv_wait() function is now extended to have a third argument
that holds the vCPU number of the lock holder, if this is
known. Architecture specific hypervisor code can then make use of
that information to make better decision of which vCPU should be
running next.

This patch also adds a new field in the pv_node structure to hold
the vCPU number of the previous queue entry. For the queue head,
the prev_cpu entry is likely to be the that of the lock holder,
though lock stealing may render this information inaccurate.

In pv_wait_head_or_lock(), pv_wait() will now be called with that
vCPU number as it is likely to be the lock holder.

In pv_wait_node(), the newly added pv_lookup_hash() function will
be called to look up the queue head and pass in the lock holder vCPU
number stored there, if found.

Suggested-by:  Pan Xinhui <xinhui@linux.vnet.ibm.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/include/asm/paravirt.h       |    4 +-
 arch/x86/include/asm/paravirt_types.h |    2 +-
 arch/x86/kernel/kvm.c                 |    2 +-
 arch/x86/xen/spinlock.c               |    2 +-
 kernel/locking/qspinlock.c            |    5 +-
 kernel/locking/qspinlock_paravirt.h   |   78 +++++++++++++++++++++++++++++++--
 kernel/locking/qspinlock_stat.h       |    8 ++--
 7 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 2970d22..98f8764 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -674,9 +674,9 @@ static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock)
 	PVOP_VCALLEE1(pv_lock_ops.queued_spin_unlock, lock);
 }
 
-static __always_inline void pv_wait(u8 *ptr, u8 val)
+static __always_inline void pv_wait(u8 *ptr, u8 val, int lockcpu)
 {
-	PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+	PVOP_VCALL3(pv_lock_ops.wait, ptr, val, lockcpu);
 }
 
 static __always_inline void pv_kick(int cpu)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7fa9e77..300f1d4 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -312,7 +312,7 @@ struct pv_lock_ops {
 	void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
 	struct paravirt_callee_save queued_spin_unlock;
 
-	void (*wait)(u8 *ptr, u8 val);
+	void (*wait)(u8 *ptr, u8 val, int lockcpu);
 	void (*kick)(int cpu);
 #else /* !CONFIG_QUEUED_SPINLOCKS */
 	struct paravirt_callee_save lock_spinning;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index eea2a6f..543271e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -582,7 +582,7 @@ static void kvm_kick_cpu(int cpu)
 
 #include <asm/qspinlock.h>
 
-static void kvm_wait(u8 *ptr, u8 val)
+static void kvm_wait(u8 *ptr, u8 val, int lockcpu)
 {
 	unsigned long flags;
 
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index f42e78d..9150e62 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -39,7 +39,7 @@ static void xen_qlock_kick(int cpu)
 /*
  * Halt the current CPU & release it back to the host
  */
-static void xen_qlock_wait(u8 *byte, u8 val)
+static void xen_qlock_wait(u8 *byte, u8 val, int lockcpu)
 {
 	int irq = __this_cpu_read(lock_kicker_irq);
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce2f75e..99f31e4 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -248,7 +248,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
  */
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+static __always_inline void __pv_wait_node(struct qspinlock *lock,
+					   struct mcs_spinlock *node,
 					   struct mcs_spinlock *prev) { }
 static __always_inline void __pv_kick_node(struct qspinlock *lock,
 					   struct mcs_spinlock *node) { }
@@ -407,7 +408,7 @@ queue:
 		prev = decode_tail(old);
 		WRITE_ONCE(prev->next, node);
 
-		pv_wait_node(node, prev);
+		pv_wait_node(lock, node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
 
 		/*
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 3d57296..1e56722 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -51,6 +51,7 @@ struct pv_node {
 	struct mcs_spinlock	__res[3];
 
 	int			cpu;
+	int			prev_cpu; /* vCPU # of previous queue entry */
 	u8			state;
 };
 
@@ -158,7 +159,6 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
  * Since we should not be holding locks from NMI context (very rare indeed) the
  * max load factor is 0.75, which is around the point where open addressing
  * breaks down.
- *
  */
 struct pv_hash_entry {
 	struct qspinlock *lock;
@@ -257,6 +257,51 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 }
 
 /*
+ * Look up the given lock in the hash table
+ * Return the pv_node if found, NULL otherwise
+ *
+ * This function is used to search an entry that contains the given lock
+ * without doing any unhashing. It is entirely possible that the required
+ * entry isn't there in the hashtable at all. So it will stop looking if
+ * it doesn't find the desired one when a null entry is found after searching
+ * a cacheline worth of entries or more than 4 cachelines are searched.
+ *
+ * __ARCH_NEED_PV_HASH_LOOKUP should be set in the asm/qspinlock.h file for
+ * architectures that want to do the hash table lookup in pv_wait_node().
+ */
+#ifdef __ARCH_NEED_PV_HASH_LOOKUP
+static struct pv_node *pv_lookup_hash(struct qspinlock *lock)
+{
+	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
+	struct pv_hash_entry *he;
+	int idx = 0;
+
+	for_each_hash_entry(he, offset, hash) {
+		struct qspinlock *l = READ_ONCE(he->lock);
+
+		if (l == lock)
+			return READ_ONCE(he->node);
+		else if (++idx < PV_HE_PER_LINE)
+			/* Search at least a cacheline of hash entries */
+			continue;
+		/*
+		 * Stop searching when there is an empty slot or more than
+		 * 4 cachelines of entries are searched. This limits the cost
+		 * of doing the lookup.
+		 */
+		if (!l || (idx >= 4 * PV_HE_PER_LINE))
+			return NULL;
+	}
+	return NULL;
+}
+#else
+static inline struct pv_node *pv_lookup_hash(struct qspinlock *lock)
+{
+	return NULL;
+}
+#endif
+
+/*
  * Return true if when it is time to check the previous node which is not
  * in a running state.
  */
@@ -280,6 +325,7 @@ static void pv_init_node(struct mcs_spinlock *node)
 	BUILD_BUG_ON(sizeof(struct pv_node) > 5*sizeof(struct mcs_spinlock));
 
 	pn->cpu = smp_processor_id();
+	pn->prev_cpu = -1;
 	pn->state = vcpu_running;
 }
 
@@ -288,13 +334,16 @@ static void pv_init_node(struct mcs_spinlock *node)
  * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
  * behalf.
  */
-static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
+static void pv_wait_node(struct qspinlock *lock, struct mcs_spinlock *node,
+			 struct mcs_spinlock *prev)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct pv_node *pp = (struct pv_node *)prev;
 	int loop;
 	bool wait_early;
 
+	pn->prev_cpu = pp->cpu;	/* Save vCPU # of previous queue entry */
+
 	for (;;) {
 		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
@@ -318,9 +367,22 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 		smp_store_mb(pn->state, vcpu_halted);
 
 		if (!READ_ONCE(node->locked)) {
+			struct pv_node *ph;
+
 			qstat_inc(qstat_pv_wait_node, true);
 			qstat_inc(qstat_pv_wait_early, wait_early);
-			pv_wait(&pn->state, vcpu_halted);
+
+			/*
+			 * If the current queue head is in the hash table,
+			 * the prev_cpu field of its pv_node may contain the
+			 * vCPU # of the lock holder. However, lock stealing
+			 * may make that information inaccurate. Anyway, we
+			 * look up the hash table to try to get the lock
+			 * holder vCPU number.
+			 */
+			ph = pv_lookup_hash(lock);
+			pv_wait(&pn->state, vcpu_halted,
+				ph ? ph->prev_cpu : -1);
 		}
 
 		/*
@@ -479,7 +541,15 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		WRITE_ONCE(pn->state, vcpu_halted);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-		pv_wait(&l->locked, _Q_SLOW_VAL);
+
+		/*
+		 * Pass in the previous node vCPU nmber which is likely to be
+		 * the lock holder vCPU. This additional information may help
+		 * the hypervisor to give more resource to that vCPU so that
+		 * it can release the lock faster. With lock stealing,
+		 * however, that vCPU may not be the actual lock holder.
+		 */
+		pv_wait(&l->locked, _Q_SLOW_VAL, pn->prev_cpu);
 
 		/*
 		 * Because of lock stealing, the queue head vCPU may not be
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index a40262e..49243ba 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -275,12 +275,12 @@ static inline void __pv_kick(int cpu)
 /*
  * Replacement function for pv_wait()
  */
-static inline void __pv_wait(u8 *ptr, u8 val)
+static inline void __pv_wait(u8 *ptr, u8 val, int lockcpu)
 {
 	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
 
 	*pkick_time = 0;
-	pv_wait(ptr, val);
+	pv_wait(ptr, val, lockcpu);
 	if (*pkick_time) {
 		this_cpu_add(qstats[qstat_pv_latency_wake],
 			     sched_clock() - *pkick_time);
@@ -288,8 +288,8 @@ static inline void __pv_wait(u8 *ptr, u8 val)
 	}
 }
 
-#define pv_kick(c)	__pv_kick(c)
-#define pv_wait(p, v)	__pv_wait(p, v)
+#define pv_kick(c)		__pv_kick(c)
+#define pv_wait(p, v, c)	__pv_wait(p, v, c)
 
 #else /* CONFIG_QUEUED_LOCK_STAT */
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-05-31 16:53 ` [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
@ 2016-07-15  8:47   ` Peter Zijlstra
  2016-07-15  9:39     ` Pan Xinhui
  2016-07-15 19:47     ` Waiman Long
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2016-07-15  8:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Pan Xinhui, Boqun Feng,
	Scott J Norton, Douglas Hatch


So the reason I never get around to this is because the patch stinks.

It simply doesn't make sense... Remember, the harder you make a reviewer
work the less likely the review will be done.

Present things in clear concise language and draw a picture.

On Tue, May 31, 2016 at 12:53:48PM -0400, Waiman Long wrote:
> Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
> done once for any pv_node. It is either in pv_kick_node() or in
> pv_wait_head_or_lock().

So far so good....

> Because of lock stealing, a pv_kick'ed node is
> not guaranteed to get the lock before the spinning threshold expires
> and has to call pv_wait() again. As a result, the new lock holder
> won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.

*brain melts* what!? pv_kick'ed node reads like pv_kick_node() and that
doesn't make any kind of sense.

I'm thinking you're trying to say this:


CPU0			CPU1			CPU2

__pv_queued_spin_unlock_slowpath()
  ...
  smp_store_release(&l->locked, 0);
			__pv_queued_spin_lock_slowpath()
			  ...
			  pv_queued_spin_steal_lock()
			    cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0


						pv_wait_head_or_lock()

  pv_kick(node->cpu);  ---------------------->	  pv_wait(&l->locked, _Q_SLOW_VAL);

			__pv_queued_spin_unlock()
			  cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL

						  for () {
						    trylock_clear_pending();
						    cpu_relax();
						  }

						  pv_wait(&l->locked, _Q_SLOW_VAL);


Which is indeed 'bad', but not fatal, note that the later pv_wait() will
not in fact go wait, since l->locked will _not_ be _Q_SLOW_VAL.

Is this indeed the 3 CPU scenario you tried to describe in a scant 4
lines of text, or is there more to it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15  8:47   ` Peter Zijlstra
@ 2016-07-15  9:39     ` Pan Xinhui
  2016-07-15 10:07       ` Peter Zijlstra
  2016-07-15 19:47     ` Waiman Long
  1 sibling, 1 reply; 19+ messages in thread
From: Pan Xinhui @ 2016-07-15  9:39 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Ingo Molnar, linux-kernel, Boqun Feng, Scott J Norton, Douglas Hatch



在 16/7/15 16:47, Peter Zijlstra 写道:
>
> So the reason I never get around to this is because the patch stinks.
>
> It simply doesn't make sense... Remember, the harder you make a reviewer
> work the less likely the review will be done.
>
> Present things in clear concise language and draw a picture.
>
> On Tue, May 31, 2016 at 12:53:48PM -0400, Waiman Long wrote:
>> Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
>> done once for any pv_node. It is either in pv_kick_node() or in
>> pv_wait_head_or_lock().
>
> So far so good....
>
>> Because of lock stealing, a pv_kick'ed node is
>> not guaranteed to get the lock before the spinning threshold expires
>> and has to call pv_wait() again. As a result, the new lock holder
>> won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.
>
waiman, it might be "as a result, the head node will not really enter wait state because ->locked is not_Q_SLOW_VAL, the pv_wait will return directly."

> *brain melts* what!? pv_kick'ed node reads like pv_kick_node() and that
> doesn't make any kind of sense.
>
> I'm thinking you're trying to say this:
>
>
> CPU0			CPU1			CPU2
>
> __pv_queued_spin_unlock_slowpath()
>   ...
>   smp_store_release(&l->locked, 0);
> 			__pv_queued_spin_lock_slowpath()
> 			  ...
> 			  pv_queued_spin_steal_lock()
> 			    cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0
>
>
> 						pv_wait_head_or_lock()
>
>   pv_kick(node->cpu);  ---------------------->	  pv_wait(&l->locked, _Q_SLOW_VAL);
>
> 			__pv_queued_spin_unlock()
> 			  cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL
>
> 						  for () {
> 						    trylock_clear_pending();
> 						    cpu_relax();
> 						  }
>
> 						  pv_wait(&l->locked, _Q_SLOW_VAL);
>
>
> Which is indeed 'bad', but not fatal, note that the later pv_wait() will
> not in fact go wait, since l->locked will _not_ be _Q_SLOW_VAL.
>
hi, Peter

the problem is that "this later pv_wait will do nothing as l->locked is not _Q_SLOW_VAL",
So it is not paravirt friendly then. we will go into the trylock loop again and again until the lock is unlocked.

So if we are kicked by the unlock_slowpath, and the lock is stealed by someone else,  we need hash its node again and set l->locked to _Q_SLOW_VAL, then enter pv_wait.

but I am worried about lock stealing. could the node in the queue starve for a long time? I notice the latency of pv_wait on an over-commited guest can be bigger than 300us. I have not seen such starving case, but I think it is possible to happen.

thanks
xinhui

> Is this indeed the 3 CPU scenario you tried to describe in a scant 4
> lines of text, or is there more to it?
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15  9:39     ` Pan Xinhui
@ 2016-07-15 10:07       ` Peter Zijlstra
  2016-07-15 16:35         ` Peter Zijlstra
  2016-07-15 20:06         ` Waiman Long
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2016-07-15 10:07 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: Waiman Long, Ingo Molnar, linux-kernel, Boqun Feng,
	Scott J Norton, Douglas Hatch

On Fri, Jul 15, 2016 at 05:39:46PM +0800, Pan Xinhui wrote:
> >I'm thinking you're trying to say this:
> >
> >
> >CPU0			CPU1			CPU2
> >
> >__pv_queued_spin_unlock_slowpath()
> >  ...
> >  smp_store_release(&l->locked, 0);
> >			__pv_queued_spin_lock_slowpath()
> >			  ...
> >			  pv_queued_spin_steal_lock()
> >			    cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0
> >
> >
> >						pv_wait_head_or_lock()
> >
> >  pv_kick(node->cpu);  ---------------------->	  pv_wait(&l->locked, _Q_SLOW_VAL);
> >
> >			__pv_queued_spin_unlock()
> >			  cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL
> >
> >						  for () {
> >						    trylock_clear_pending();
> >						    cpu_relax();
> >						  }
> >
> >						  pv_wait(&l->locked, _Q_SLOW_VAL);
> >
> >
> >Which is indeed 'bad', but not fatal, note that the later pv_wait() will
> >not in fact go wait, since l->locked will _not_ be _Q_SLOW_VAL.

> 
> the problem is that "this later pv_wait will do nothing as l->locked
> is not _Q_SLOW_VAL", So it is not paravirt friendly then. we will go
> into the trylock loop again and again until the lock is unlocked.

Agreed, which is 'bad'. But the patch spoke about a missing wakeup,
which is worse, as that would completely inhibit progress.

> So if we are kicked by the unlock_slowpath, and the lock is stealed by
> someone else,  we need hash its node again and set l->locked to
> _Q_SLOW_VAL, then enter pv_wait.

Right, let me go think about this a bit.

> but I am worried about lock stealing. could the node in the queue
> starve for a long time? I notice the latency of pv_wait on an
> over-commited guest can be bigger than 300us. I have not seen such
> starving case, but I think it is possible to happen.

I share that worry, which is why we limit the steal attempt to one.
But yes, theoretically its possible to starve things AFAICT.

We've not come up with sensible way to completely avoid starvation.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15 10:07       ` Peter Zijlstra
@ 2016-07-15 16:35         ` Peter Zijlstra
  2016-07-16  1:16           ` Boqun Feng
                             ` (2 more replies)
  2016-07-15 20:06         ` Waiman Long
  1 sibling, 3 replies; 19+ messages in thread
From: Peter Zijlstra @ 2016-07-15 16:35 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: Waiman Long, Ingo Molnar, linux-kernel, Boqun Feng,
	Scott J Norton, Douglas Hatch

On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote:
> > So if we are kicked by the unlock_slowpath, and the lock is stealed by
> > someone else,  we need hash its node again and set l->locked to
> > _Q_SLOW_VAL, then enter pv_wait.
> 
> Right, let me go think about this a bit.

Urgh, brain hurt.

So I _think_ the below does for it but I could easily have missed yet
another case.

Waiman's patch has the problem that it can have two pv_hash() calls for
the same lock in progress and I'm thinking that means we can hit the
BUG() in pv_hash() due to that.

If we can't, it still has a problem because its not telling us either.



--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -20,7 +20,8 @@
  * native_queued_spin_unlock().
  */
 
-#define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
+#define _Q_HASH_VAL	(3U << _Q_LOCKED_OFFSET)
+#define _Q_SLOW_VAL	(7U << _Q_LOCKED_OFFSET)
 
 /*
  * Queue Node Adaptive Spinning
@@ -36,14 +37,11 @@
  */
 #define PV_PREV_CHECK_MASK	0xff
 
-/*
- * Queue node uses: vcpu_running & vcpu_halted.
- * Queue head uses: vcpu_running & vcpu_hashed.
- */
 enum vcpu_state {
-	vcpu_running = 0,
-	vcpu_halted,		/* Used only in pv_wait_node */
-	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
+	vcpu_node_running = 0,
+	vcpu_node_halted,
+	vcpu_head_running,
+	vcpu_head_halted,
 };
 
 struct pv_node {
@@ -263,7 +261,7 @@ pv_wait_early(struct pv_node *prev, int
 	if ((loop & PV_PREV_CHECK_MASK) != 0)
 		return false;
 
-	return READ_ONCE(prev->state) != vcpu_running;
+	return READ_ONCE(prev->state) & 1;
 }
 
 /*
@@ -311,20 +309,19 @@ static void pv_wait_node(struct mcs_spin
 		 *
 		 * Matches the cmpxchg() from pv_kick_node().
 		 */
-		smp_store_mb(pn->state, vcpu_halted);
+		smp_store_mb(pn->state, vcpu_node_halted);
 
-		if (!READ_ONCE(node->locked)) {
-			qstat_inc(qstat_pv_wait_node, true);
-			qstat_inc(qstat_pv_wait_early, wait_early);
-			pv_wait(&pn->state, vcpu_halted);
-		}
+		if (READ_ONCE(node->locked))
+			return;
+
+		qstat_inc(qstat_pv_wait_node, true);
+		qstat_inc(qstat_pv_wait_early, wait_early);
+		pv_wait(&pn->state, vcpu_node_halted);
 
 		/*
-		 * If pv_kick_node() changed us to vcpu_hashed, retain that
-		 * value so that pv_wait_head_or_lock() knows to not also try
-		 * to hash this lock.
+		 * If pv_kick_node() advanced us, retain that state.
 		 */
-		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
+		cmpxchg(&pn->state, vcpu_node_halted, vcpu_node_running);
 
 		/*
 		 * If the locked flag is still not set after wakeup, it is a
@@ -362,18 +359,17 @@ static void pv_kick_node(struct qspinloc
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 */
-	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
+	if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_running) != vcpu_node_halted)
 		return;
 
 	/*
-	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
-	 *
-	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
-	 * the hash table later on at unlock time, no atomic instruction is
-	 * needed.
+	 * See pv_wait_head_or_lock(). We have to hash and force the unlock
+	 * into the slow path to deliver the actual kick for waking.
 	 */
-	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
-	(void)pv_hash(lock, pn);
+	if (cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL) == _Q_LOCKED_VAL) {
+		(void)pv_hash(lock, pn);
+		smp_store_release(&l->locked, _Q_SLOW_VAL);
+	}
 }
 
 /*
@@ -388,28 +384,22 @@ pv_wait_head_or_lock(struct qspinlock *l
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
-	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
 
 	/*
-	 * If pv_kick_node() already advanced our state, we don't need to
-	 * insert ourselves into the hash table anymore.
-	 */
-	if (READ_ONCE(pn->state) == vcpu_hashed)
-		lp = (struct qspinlock **)1;
-
-	/*
 	 * Tracking # of slowpath locking operations
 	 */
 	qstat_inc(qstat_pv_lock_slowpath, true);
 
 	for (;; waitcnt++) {
+		u8 locked;
+
 		/*
 		 * Set correct vCPU state to be used by queue node wait-early
 		 * mechanism.
 		 */
-		WRITE_ONCE(pn->state, vcpu_running);
+		WRITE_ONCE(pn->state, vcpu_head_running);
 
 		/*
 		 * Set the pending bit in the active lock spinning loop to
@@ -423,33 +413,38 @@ pv_wait_head_or_lock(struct qspinlock *l
 		}
 		clear_pending(lock);
 
+		/*
+		 * We want to go sleep; ensure we're hashed so that
+		 * __pv_queued_spin_unlock_slow() can find us for a wakeup.
+		 */
+		locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL);
+		switch (locked) {
+		/*
+		 * We're not hashed yet, either we're fresh from pv_wait_node()
+		 * or __pv_queued_spin_unlock_slow() unhashed us but we lost
+		 * the trylock to a steal and have to re-hash.
+		 */
+		case _Q_LOCKED_VAL:
+			(void)pv_hash(lock, pn);
+			smp_store_release(&l->locked, _Q_SLOW_VAL);
+			break;
 
-		if (!lp) { /* ONCE */
-			lp = pv_hash(lock, pn);
+		/*
+		 * pv_kick_node() is hashing us, wait for it.
+		 */
+		case _Q_HASH_VAL:
+			while (READ_ONCE(l->locked) == _Q_HASH_VAL)
+				cpu_relax();
+			break;
 
-			/*
-			 * We must hash before setting _Q_SLOW_VAL, such that
-			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
-			 * we'll be sure to be able to observe our hash entry.
-			 *
-			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
-			 *       MB                           RMB
-			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
-			 *
-			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
-			 */
-			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
-				/*
-				 * The lock was free and now we own the lock.
-				 * Change the lock value back to _Q_LOCKED_VAL
-				 * and unhash the table.
-				 */
-				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
-				WRITE_ONCE(*lp, NULL);
-				goto gotlock;
-			}
+		/*
+		 * Ooh, unlocked, try and grab it.
+		 */
+		case 0:
+			continue;
 		}
-		WRITE_ONCE(pn->state, vcpu_hashed);
+
+		WRITE_ONCE(pn->state, vcpu_head_halted);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
@@ -480,7 +475,7 @@ __pv_queued_spin_unlock_slowpath(struct
 	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
 
-	if (unlikely(locked != _Q_SLOW_VAL)) {
+	if (unlikely(locked != _Q_SLOW_VAL && locked != _Q_HASH_VAL)) {
 		WARN(!debug_locks_silent,
 		     "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n",
 		     (unsigned long)lock, atomic_read(&lock->val));
@@ -488,18 +483,17 @@ __pv_queued_spin_unlock_slowpath(struct
 	}
 
 	/*
-	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
-	 * so we need a barrier to order the read of the node data in
-	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
-	 *
-	 * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL.
+	 * Wait until the hash-bucket is complete.
 	 */
-	smp_rmb();
+	while (READ_ONCE(l->locked) == _Q_HASH_VAL)
+		cpu_relax();
 
 	/*
-	 * Since the above failed to release, this must be the SLOW path.
-	 * Therefore start by looking up the blocked node and unhashing it.
+	 * Must first observe _Q_SLOW_VAL in order to observe
+	 * consistent hash bucket.
 	 */
+	smp_rmb();
+
 	node = pv_unhash(lock);
 
 	/*

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15  8:47   ` Peter Zijlstra
  2016-07-15  9:39     ` Pan Xinhui
@ 2016-07-15 19:47     ` Waiman Long
  1 sibling, 0 replies; 19+ messages in thread
From: Waiman Long @ 2016-07-15 19:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Pan Xinhui, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/15/2016 04:47 AM, Peter Zijlstra wrote:
> So the reason I never get around to this is because the patch stinks.
>
> It simply doesn't make sense... Remember, the harder you make a reviewer
> work the less likely the review will be done.
>
> Present things in clear concise language and draw a picture.
>
> On Tue, May 31, 2016 at 12:53:48PM -0400, Waiman Long wrote:
>> Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
>> done once for any pv_node. It is either in pv_kick_node() or in
>> pv_wait_head_or_lock().
> So far so good....
>
>> Because of lock stealing, a pv_kick'ed node is
>> not guaranteed to get the lock before the spinning threshold expires
>> and has to call pv_wait() again. As a result, the new lock holder
>> won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.
> *brain melts* what!? pv_kick'ed node reads like pv_kick_node() and that
> doesn't make any kind of sense.

Sorry for the confusing. I will clean up the submit log to discuss what 
I actually mean.

> I'm thinking you're trying to say this:
>
>
> CPU0			CPU1			CPU2
>
> __pv_queued_spin_unlock_slowpath()
>    ...
>    smp_store_release(&l->locked, 0);
> 			__pv_queued_spin_lock_slowpath()
> 			  ...
> 			  pv_queued_spin_steal_lock()
> 			    cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0
>
>
> 						pv_wait_head_or_lock()
>
>    pv_kick(node->cpu);  ---------------------->	pv_wait(&l->locked, _Q_SLOW_VAL);
>
> 			__pv_queued_spin_unlock()
> 			  cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL
>
> 						  for () {
> 						    trylock_clear_pending();
> 						    cpu_relax();
> 						  }
>
> 						  pv_wait(&l->locked, _Q_SLOW_VAL);
>

Yes, that is the scenario that I have in mind.

> Which is indeed 'bad', but not fatal, note that the later pv_wait() will
> not in fact go wait, since l->locked will _not_ be _Q_SLOW_VAL.
>
> Is this indeed the 3 CPU scenario you tried to describe in a scant 4
> lines of text, or is there more to it?

You are right. The vCPU won't actually going to wait. It will get out 
and spin again. I will correct the patch title. However, it is still not 
good as it is not doing what it is suppose to do.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15 10:07       ` Peter Zijlstra
  2016-07-15 16:35         ` Peter Zijlstra
@ 2016-07-15 20:06         ` Waiman Long
  1 sibling, 0 replies; 19+ messages in thread
From: Waiman Long @ 2016-07-15 20:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, Ingo Molnar, linux-kernel, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/15/2016 06:07 AM, Peter Zijlstra wrote:
> On Fri, Jul 15, 2016 at 05:39:46PM +0800, Pan Xinhui wrote:
>>> I'm thinking you're trying to say this:
>>>
>>>
>>> CPU0			CPU1			CPU2
>>>
>>> __pv_queued_spin_unlock_slowpath()
>>>   ...
>>>   smp_store_release(&l->locked, 0);
>>> 			__pv_queued_spin_lock_slowpath()
>>> 			  ...
>>> 			  pv_queued_spin_steal_lock()
>>> 			    cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0
>>>
>>>
>>> 						pv_wait_head_or_lock()
>>>
>>>   pv_kick(node->cpu);  ---------------------->	pv_wait(&l->locked, _Q_SLOW_VAL);
>>>
>>> 			__pv_queued_spin_unlock()
>>> 			  cmpxchg(&l->locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL
>>>
>>> 						  for () {
>>> 						    trylock_clear_pending();
>>> 						    cpu_relax();
>>> 						  }
>>>
>>> 						  pv_wait(&l->locked, _Q_SLOW_VAL);
>>>
>>>
>>> Which is indeed 'bad', but not fatal, note that the later pv_wait() will
>>> not in fact go wait, since l->locked will _not_ be _Q_SLOW_VAL.
>> the problem is that "this later pv_wait will do nothing as l->locked
>> is not _Q_SLOW_VAL", So it is not paravirt friendly then. we will go
>> into the trylock loop again and again until the lock is unlocked.
> Agreed, which is 'bad'. But the patch spoke about a missing wakeup,
> which is worse, as that would completely inhibit progress.

Sorry, it is my mistake. There is no missing pv_wait().

>> So if we are kicked by the unlock_slowpath, and the lock is stealed by
>> someone else,  we need hash its node again and set l->locked to
>> _Q_SLOW_VAL, then enter pv_wait.
> Right, let me go think about this a bit.

Yes, the purpose of this patch is to do exactly that. Let's the queue 
head vCPU sleeps until the lock holder release the lock and wake the 
queue head vCPU up.

>
>> but I am worried about lock stealing. could the node in the queue
>> starve for a long time? I notice the latency of pv_wait on an
>> over-commited guest can be bigger than 300us. I have not seen such
>> starving case, but I think it is possible to happen.
> I share that worry, which is why we limit the steal attempt to one.
> But yes, theoretically its possible to starve things AFAICT.
>
> We've not come up with sensible way to completely avoid starvation.

If you guys are worrying about lock constantly getting stolen between 
pv_kick() of queue head vCPU and it is ready to take the lock, we can 
keep the pending bit set across pv_wait() if it is the 2nd or later time 
that pv_wait() is called. That will ensure that no lock stealing can 
happen and cap the maximum wait time to about 2x (spin + pv_wait). I 
will add that patch to my patch series.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15 16:35         ` Peter Zijlstra
@ 2016-07-16  1:16           ` Boqun Feng
  2016-07-17 23:07             ` Waiman Long
  2016-07-17 22:52           ` Waiman Long
  2016-07-21  6:40           ` xinhui
  2 siblings, 1 reply; 19+ messages in thread
From: Boqun Feng @ 2016-07-16  1:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, Waiman Long, Ingo Molnar, linux-kernel,
	Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 9828 bytes --]

On Fri, Jul 15, 2016 at 06:35:56PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote:
> > > So if we are kicked by the unlock_slowpath, and the lock is stealed by
> > > someone else,  we need hash its node again and set l->locked to
> > > _Q_SLOW_VAL, then enter pv_wait.
> > 
> > Right, let me go think about this a bit.
> 
> Urgh, brain hurt.
> 
> So I _think_ the below does for it but I could easily have missed yet
> another case.
> 
> Waiman's patch has the problem that it can have two pv_hash() calls for
> the same lock in progress and I'm thinking that means we can hit the
> BUG() in pv_hash() due to that.
> 

I think Waiman's patch does have the problem of two pv_hash() calls for
the same lock in progress. As I mentioned in the first version:

http://lkml.kernel.org/g/20160527074331.GB8096@insomnia

And he tried to address this in the patch #3 of this series. However,
I think what is proper here is either to reorder patch 2 and 3 or to
merge patch 2 and 3, otherwise, we are introducing a bug in the middle
of this series.

Thoughts, Waiman?

That said, I found Peter's way is much simpler and easier to understand
;-)

> If we can't, it still has a problem because its not telling us either.
> 
> 
> 
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -20,7 +20,8 @@
>   * native_queued_spin_unlock().
>   */
>  
> -#define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
> +#define _Q_HASH_VAL	(3U << _Q_LOCKED_OFFSET)
> +#define _Q_SLOW_VAL	(7U << _Q_LOCKED_OFFSET)
>  
>  /*
>   * Queue Node Adaptive Spinning
> @@ -36,14 +37,11 @@
>   */
>  #define PV_PREV_CHECK_MASK	0xff
>  
> -/*
> - * Queue node uses: vcpu_running & vcpu_halted.
> - * Queue head uses: vcpu_running & vcpu_hashed.
> - */
>  enum vcpu_state {
> -	vcpu_running = 0,
> -	vcpu_halted,		/* Used only in pv_wait_node */
> -	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
> +	vcpu_node_running = 0,
> +	vcpu_node_halted,
> +	vcpu_head_running,

We actually don't need this extra running state, right? Because nobody
cares about the difference between two running states right now.

> +	vcpu_head_halted,
>  };
>  
>  struct pv_node {
> @@ -263,7 +261,7 @@ pv_wait_early(struct pv_node *prev, int
>  	if ((loop & PV_PREV_CHECK_MASK) != 0)
>  		return false;
>  
> -	return READ_ONCE(prev->state) != vcpu_running;
> +	return READ_ONCE(prev->state) & 1;
>  }
>  
>  /*
> @@ -311,20 +309,19 @@ static void pv_wait_node(struct mcs_spin
>  		 *
>  		 * Matches the cmpxchg() from pv_kick_node().
>  		 */
> -		smp_store_mb(pn->state, vcpu_halted);
> +		smp_store_mb(pn->state, vcpu_node_halted);
>  
> -		if (!READ_ONCE(node->locked)) {
> -			qstat_inc(qstat_pv_wait_node, true);
> -			qstat_inc(qstat_pv_wait_early, wait_early);
> -			pv_wait(&pn->state, vcpu_halted);
> -		}
> +		if (READ_ONCE(node->locked))
> +			return;
> +
> +		qstat_inc(qstat_pv_wait_node, true);
> +		qstat_inc(qstat_pv_wait_early, wait_early);
> +		pv_wait(&pn->state, vcpu_node_halted);
>  
>  		/*
> -		 * If pv_kick_node() changed us to vcpu_hashed, retain that
> -		 * value so that pv_wait_head_or_lock() knows to not also try
> -		 * to hash this lock.
> +		 * If pv_kick_node() advanced us, retain that state.
>  		 */
> -		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
> +		cmpxchg(&pn->state, vcpu_node_halted, vcpu_node_running);
>  
>  		/*
>  		 * If the locked flag is still not set after wakeup, it is a
> @@ -362,18 +359,17 @@ static void pv_kick_node(struct qspinloc
>  	 *
>  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>  	 */
> -	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> +	if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_running) != vcpu_node_halted)
>  		return;
>  
>  	/*
> -	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
> -	 *
> -	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
> -	 * the hash table later on at unlock time, no atomic instruction is
> -	 * needed.
> +	 * See pv_wait_head_or_lock(). We have to hash and force the unlock
> +	 * into the slow path to deliver the actual kick for waking.
>  	 */
> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> -	(void)pv_hash(lock, pn);
> +	if (cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL) == _Q_LOCKED_VAL) {
> +		(void)pv_hash(lock, pn);
> +		smp_store_release(&l->locked, _Q_SLOW_VAL);
> +	}
>  }
>  
>  /*
> @@ -388,28 +384,22 @@ pv_wait_head_or_lock(struct qspinlock *l
>  {
>  	struct pv_node *pn = (struct pv_node *)node;
>  	struct __qspinlock *l = (void *)lock;
> -	struct qspinlock **lp = NULL;
>  	int waitcnt = 0;
>  	int loop;
>  
>  	/*
> -	 * If pv_kick_node() already advanced our state, we don't need to
> -	 * insert ourselves into the hash table anymore.
> -	 */
> -	if (READ_ONCE(pn->state) == vcpu_hashed)
> -		lp = (struct qspinlock **)1;
> -
> -	/*
>  	 * Tracking # of slowpath locking operations
>  	 */
>  	qstat_inc(qstat_pv_lock_slowpath, true);
>  
>  	for (;; waitcnt++) {
> +		u8 locked;
> +
>  		/*
>  		 * Set correct vCPU state to be used by queue node wait-early
>  		 * mechanism.
>  		 */
> -		WRITE_ONCE(pn->state, vcpu_running);
> +		WRITE_ONCE(pn->state, vcpu_head_running);
>  
>  		/*
>  		 * Set the pending bit in the active lock spinning loop to
> @@ -423,33 +413,38 @@ pv_wait_head_or_lock(struct qspinlock *l
>  		}
>  		clear_pending(lock);
>  
> +		/*
> +		 * We want to go sleep; ensure we're hashed so that
> +		 * __pv_queued_spin_unlock_slow() can find us for a wakeup.
> +		 */
> +		locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL);
> +		switch (locked) {
> +		/*
> +		 * We're not hashed yet, either we're fresh from pv_wait_node()
> +		 * or __pv_queued_spin_unlock_slow() unhashed us but we lost
> +		 * the trylock to a steal and have to re-hash.
> +		 */
> +		case _Q_LOCKED_VAL:
> +			(void)pv_hash(lock, pn);
> +			smp_store_release(&l->locked, _Q_SLOW_VAL);
> +			break;
>  
> -		if (!lp) { /* ONCE */
> -			lp = pv_hash(lock, pn);
> +		/*
> +		 * pv_kick_node() is hashing us, wait for it.
> +		 */
> +		case _Q_HASH_VAL:
> +			while (READ_ONCE(l->locked) == _Q_HASH_VAL)
> +				cpu_relax();
> +			break;
>  
> -			/*
> -			 * We must hash before setting _Q_SLOW_VAL, such that
> -			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
> -			 * we'll be sure to be able to observe our hash entry.
> -			 *
> -			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
> -			 *       MB                           RMB
> -			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
> -			 *
> -			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
> -			 */
> -			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
> -				/*
> -				 * The lock was free and now we own the lock.
> -				 * Change the lock value back to _Q_LOCKED_VAL
> -				 * and unhash the table.
> -				 */
> -				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
> -				WRITE_ONCE(*lp, NULL);
> -				goto gotlock;
> -			}
> +		/*
> +		 * Ooh, unlocked, try and grab it.
> +		 */
> +		case 0:
> +			continue;
>  		}
> -		WRITE_ONCE(pn->state, vcpu_hashed);
> +
> +		WRITE_ONCE(pn->state, vcpu_head_halted);
>  		qstat_inc(qstat_pv_wait_head, true);
>  		qstat_inc(qstat_pv_wait_again, waitcnt);
>  		pv_wait(&l->locked, _Q_SLOW_VAL);
> @@ -480,7 +475,7 @@ __pv_queued_spin_unlock_slowpath(struct
>  	struct __qspinlock *l = (void *)lock;
>  	struct pv_node *node;
>  
> -	if (unlikely(locked != _Q_SLOW_VAL)) {
> +	if (unlikely(locked != _Q_SLOW_VAL && locked != _Q_HASH_VAL)) {
>  		WARN(!debug_locks_silent,
>  		     "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n",
>  		     (unsigned long)lock, atomic_read(&lock->val));
> @@ -488,18 +483,17 @@ __pv_queued_spin_unlock_slowpath(struct
>  	}
>  
>  	/*
> -	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
> -	 * so we need a barrier to order the read of the node data in
> -	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
> -	 *
> -	 * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL.
> +	 * Wait until the hash-bucket is complete.
>  	 */
> -	smp_rmb();
> +	while (READ_ONCE(l->locked) == _Q_HASH_VAL)
> +		cpu_relax();
>  

This does give a chance to let the lock waiter block the lock holder,
right? Considering a lock queue head's vcpu is preempted before it could
set the lock to _Q_SLOW_VAL.

Could we do something like this here:

	if (unlikely(cmpxchg(l->locked, _Q_HASH_VAL, 0) == _Q_HASH_VAL))
		return;
		
And in pv_wait_head_or_lock()

	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL);
	switch(locked) {
		case _Q_LOCKED_VAL:
			(void)pv_hash(lock, pn);
			locked = cmpxchg_release(&l->locked, _Q_HASH_VAL, _Q_SLOW_VAL);

			/*
			 * Only the holder will change the ->locked from
			 * _Q_HASH_VAL to another value, if this
			 * happens, the holder has already released the
			 * lock without trying to wake the head, in this
			 * case, we need to unhash ourselves and there
			 * is a great chance we can get the locke.
			 */
			if (unlikely(locked != _Q_HASH_VAL)) {
				pv_unhash(lock, pn);
				if (!cmpxchg_relaxed(&l->locked, 0, _Q_LOCKED_VAL)
					goto gotlock;
			}
			break;



Wrote those in my mailbox, may miss something.

Thoughts?

Regards,
Boqun

>  	/*
> -	 * Since the above failed to release, this must be the SLOW path.
> -	 * Therefore start by looking up the blocked node and unhashing it.
> +	 * Must first observe _Q_SLOW_VAL in order to observe
> +	 * consistent hash bucket.
>  	 */
> +	smp_rmb();
> +
>  	node = pv_unhash(lock);
>  
>  	/*

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15 16:35         ` Peter Zijlstra
  2016-07-16  1:16           ` Boqun Feng
@ 2016-07-17 22:52           ` Waiman Long
  2016-07-21  6:40           ` xinhui
  2 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2016-07-17 22:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, Ingo Molnar, linux-kernel, Boqun Feng,
	Scott J Norton, Douglas Hatch

On 07/15/2016 12:35 PM, Peter Zijlstra wrote:
> On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote:
>>> So if we are kicked by the unlock_slowpath, and the lock is stealed by
>>> someone else,  we need hash its node again and set l->locked to
>>> _Q_SLOW_VAL, then enter pv_wait.
>> Right, let me go think about this a bit.
> Urgh, brain hurt.
>
> So I _think_ the below does for it but I could easily have missed yet
> another case.
>
> Waiman's patch has the problem that it can have two pv_hash() calls for
> the same lock in progress and I'm thinking that means we can hit the
> BUG() in pv_hash() due to that.
>
> If we can't, it still has a problem because its not telling us either.
>
>
>
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -20,7 +20,8 @@
>    * native_queued_spin_unlock().
>    */
>
> -#define _Q_SLOW_VAL	(3U<<  _Q_LOCKED_OFFSET)
> +#define _Q_HASH_VAL	(3U<<  _Q_LOCKED_OFFSET)
> +#define _Q_SLOW_VAL	(7U<<  _Q_LOCKED_OFFSET)

The hash state is a transient state. It is not obvious until I read 
through the code. Maybe some comments on allowable state transition will 
help.

>
>   /*
>    * Queue Node Adaptive Spinning
> @@ -36,14 +37,11 @@
>    */
>   #define PV_PREV_CHECK_MASK	0xff
>
> -/*
> - * Queue node uses: vcpu_running&  vcpu_halted.
> - * Queue head uses: vcpu_running&  vcpu_hashed.
> - */
>   enum vcpu_state {
> -	vcpu_running = 0,
> -	vcpu_halted,		/* Used only in pv_wait_node */
> -	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
> +	vcpu_node_running = 0,
> +	vcpu_node_halted,
> +	vcpu_head_running,
> +	vcpu_head_halted,
>   };
>
>   struct pv_node {
> @@ -263,7 +261,7 @@ pv_wait_early(struct pv_node *prev, int
>   	if ((loop&  PV_PREV_CHECK_MASK) != 0)
>   		return false;
>
> -	return READ_ONCE(prev->state) != vcpu_running;
> +	return READ_ONCE(prev->state)&  1;
>   }
>   

This relies on the implicit ordering of the enum vcpu_state variable. I 
think we need some warning above that all the halt states must have bit 
0 set and running states have that bit cleared. We would like to make 
sure that any future changes in vcpu_state won't affect that rule.

>   /*
> @@ -311,20 +309,19 @@ static void pv_wait_node(struct mcs_spin
>   		 *
>   		 * Matches the cmpxchg() from pv_kick_node().
>   		 */
> -		smp_store_mb(pn->state, vcpu_halted);
> +		smp_store_mb(pn->state, vcpu_node_halted);
>
> -		if (!READ_ONCE(node->locked)) {
> -			qstat_inc(qstat_pv_wait_node, true);
> -			qstat_inc(qstat_pv_wait_early, wait_early);
> -			pv_wait(&pn->state, vcpu_halted);
> -		}
> +		if (READ_ONCE(node->locked))
> +			return;
> +
> +		qstat_inc(qstat_pv_wait_node, true);
> +		qstat_inc(qstat_pv_wait_early, wait_early);
> +		pv_wait(&pn->state, vcpu_node_halted);
>
>   		/*
> -		 * If pv_kick_node() changed us to vcpu_hashed, retain that
> -		 * value so that pv_wait_head_or_lock() knows to not also try
> -		 * to hash this lock.
> +		 * If pv_kick_node() advanced us, retain that state.
>   		 */
> -		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
> +		cmpxchg(&pn->state, vcpu_node_halted, vcpu_node_running);
>
>   		/*
>   		 * If the locked flag is still not set after wakeup, it is a
> @@ -362,18 +359,17 @@ static void pv_kick_node(struct qspinloc
>   	 *
>   	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>   	 */
> -	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> +	if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_running) != vcpu_node_halted)
>   		return;
>
>   	/*
> -	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
> -	 *
> -	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
> -	 * the hash table later on at unlock time, no atomic instruction is
> -	 * needed.
> +	 * See pv_wait_head_or_lock(). We have to hash and force the unlock
> +	 * into the slow path to deliver the actual kick for waking.
>   	 */
> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> -	(void)pv_hash(lock, pn);
> +	if (cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL) == _Q_LOCKED_VAL) {
> +		(void)pv_hash(lock, pn);
> +		smp_store_release(&l->locked, _Q_SLOW_VAL);
> +	}
>   }
>
>   /*
> @@ -388,28 +384,22 @@ pv_wait_head_or_lock(struct qspinlock *l
>   {
>   	struct pv_node *pn = (struct pv_node *)node;
>   	struct __qspinlock *l = (void *)lock;
> -	struct qspinlock **lp = NULL;
>   	int waitcnt = 0;
>   	int loop;
>
>   	/*
> -	 * If pv_kick_node() already advanced our state, we don't need to
> -	 * insert ourselves into the hash table anymore.
> -	 */
> -	if (READ_ONCE(pn->state) == vcpu_hashed)
> -		lp = (struct qspinlock **)1;
> -
> -	/*
>   	 * Tracking # of slowpath locking operations
>   	 */
>   	qstat_inc(qstat_pv_lock_slowpath, true);
>
>   	for (;; waitcnt++) {
> +		u8 locked;
> +
>   		/*
>   		 * Set correct vCPU state to be used by queue node wait-early
>   		 * mechanism.
>   		 */
> -		WRITE_ONCE(pn->state, vcpu_running);
> +		WRITE_ONCE(pn->state, vcpu_head_running);
>
>   		/*
>   		 * Set the pending bit in the active lock spinning loop to
> @@ -423,33 +413,38 @@ pv_wait_head_or_lock(struct qspinlock *l
>   		}
>   		clear_pending(lock);
>
> +		/*
> +		 * We want to go sleep; ensure we're hashed so that
> +		 * __pv_queued_spin_unlock_slow() can find us for a wakeup.
> +		 */
> +		locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL);
> +		switch (locked) {
> +		/*
> +		 * We're not hashed yet, either we're fresh from pv_wait_node()
> +		 * or __pv_queued_spin_unlock_slow() unhashed us but we lost
> +		 * the trylock to a steal and have to re-hash.
> +		 */
> +		case _Q_LOCKED_VAL:
> +			(void)pv_hash(lock, pn);
> +			smp_store_release(&l->locked, _Q_SLOW_VAL);
> +			break;
>
> -		if (!lp) { /* ONCE */
> -			lp = pv_hash(lock, pn);
> +		/*
> +		 * pv_kick_node() is hashing us, wait for it.
> +		 */
> +		case _Q_HASH_VAL:
> +			while (READ_ONCE(l->locked) == _Q_HASH_VAL)
> +				cpu_relax();
> +			break;
>
> -			/*
> -			 * We must hash before setting _Q_SLOW_VAL, such that
> -			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
> -			 * we'll be sure to be able to observe our hash entry.
> -			 *
> -			 *   [S]<hash>                  [Rmw] l->locked == _Q_SLOW_VAL
> -			 *       MB                           RMB
> -			 * [RmW] l->locked = _Q_SLOW_VAL  [L]<unhash>
> -			 *
> -			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
> -			 */
> -			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
> -				/*
> -				 * The lock was free and now we own the lock.
> -				 * Change the lock value back to _Q_LOCKED_VAL
> -				 * and unhash the table.
> -				 */
> -				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
> -				WRITE_ONCE(*lp, NULL);
> -				goto gotlock;
> -			}
> +		/*
> +		 * Ooh, unlocked, try and grab it.
> +		 */
> +		case 0:
> +			continue;
>   		}
> -		WRITE_ONCE(pn->state, vcpu_hashed);
> +
> +		WRITE_ONCE(pn->state, vcpu_head_halted);
>   		qstat_inc(qstat_pv_wait_head, true);
>   		qstat_inc(qstat_pv_wait_again, waitcnt);
>   		pv_wait(&l->locked, _Q_SLOW_VAL);
> @@ -480,7 +475,7 @@ __pv_queued_spin_unlock_slowpath(struct
>   	struct __qspinlock *l = (void *)lock;
>   	struct pv_node *node;
>
> -	if (unlikely(locked != _Q_SLOW_VAL)) {
> +	if (unlikely(locked != _Q_SLOW_VAL&&  locked != _Q_HASH_VAL)) {
>   		WARN(!debug_locks_silent,
>   		     "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n",
>   		     (unsigned long)lock, atomic_read(&lock->val));
> @@ -488,18 +483,17 @@ __pv_queued_spin_unlock_slowpath(struct
>   	}
>
>   	/*
> -	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
> -	 * so we need a barrier to order the read of the node data in
> -	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
> -	 *
> -	 * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL.
> +	 * Wait until the hash-bucket is complete.
>   	 */
> -	smp_rmb();
> +	while (READ_ONCE(l->locked) == _Q_HASH_VAL)
> +		cpu_relax();
>
>   	/*
> -	 * Since the above failed to release, this must be the SLOW path.
> -	 * Therefore start by looking up the blocked node and unhashing it.
> +	 * Must first observe _Q_SLOW_VAL in order to observe
> +	 * consistent hash bucket.
>   	 */
> +	smp_rmb();
> +
>   	node = pv_unhash(lock);
>
>   	/*

I like your patch. Other than a bit more comment to clarify thing, I 
think it is a good one.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-16  1:16           ` Boqun Feng
@ 2016-07-17 23:07             ` Waiman Long
  2016-07-17 23:10               ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2016-07-17 23:07 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Pan Xinhui, Ingo Molnar, linux-kernel,
	Scott J Norton, Douglas Hatch

On 07/15/2016 09:16 PM, Boqun Feng wrote:
> On Fri, Jul 15, 2016 at 06:35:56PM +0200, Peter Zijlstra wrote:
>> On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote:
>>>> So if we are kicked by the unlock_slowpath, and the lock is stealed by
>>>> someone else,  we need hash its node again and set l->locked to
>>>> _Q_SLOW_VAL, then enter pv_wait.
>>> Right, let me go think about this a bit.
>> Urgh, brain hurt.
>>
>> So I _think_ the below does for it but I could easily have missed yet
>> another case.
>>
>> Waiman's patch has the problem that it can have two pv_hash() calls for
>> the same lock in progress and I'm thinking that means we can hit the
>> BUG() in pv_hash() due to that.
>>
> I think Waiman's patch does have the problem of two pv_hash() calls for
> the same lock in progress. As I mentioned in the first version:
>
> http://lkml.kernel.org/g/20160527074331.GB8096@insomnia
>
> And he tried to address this in the patch #3 of this series. However,
> I think what is proper here is either to reorder patch 2 and 3 or to
> merge patch 2 and 3, otherwise, we are introducing a bug in the middle
> of this series.
>
> Thoughts, Waiman?

Patches 2 and 3 can be reversed.

>
> That said, I found Peter's way is much simpler and easier to understand
> ;-)

I agree. Peter's patch is better than mine.

>> If we can't, it still has a problem because its not telling us either.
>>
>>
>>
>> --- a/kernel/locking/qspinlock_paravirt.h
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -20,7 +20,8 @@
>>    * native_queued_spin_unlock().
>>    */
>>
>> -#define _Q_SLOW_VAL	(3U<<  _Q_LOCKED_OFFSET)
>> +#define _Q_HASH_VAL	(3U<<  _Q_LOCKED_OFFSET)
>> +#define _Q_SLOW_VAL	(7U<<  _Q_LOCKED_OFFSET)
>>
>>   /*
>>    * Queue Node Adaptive Spinning
>> @@ -36,14 +37,11 @@
>>    */
>>   #define PV_PREV_CHECK_MASK	0xff
>>
>> -/*
>> - * Queue node uses: vcpu_running&  vcpu_halted.
>> - * Queue head uses: vcpu_running&  vcpu_hashed.
>> - */
>>   enum vcpu_state {
>> -	vcpu_running = 0,
>> -	vcpu_halted,		/* Used only in pv_wait_node */
>> -	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
>> +	vcpu_node_running = 0,
>> +	vcpu_node_halted,
>> +	vcpu_head_running,
> We actually don't need this extra running state, right? Because nobody
> cares about the difference between two running states right now.

That addresses the problem in Xinhui patch that changed the state from 
halted to hashed. With that separation, that change is no longer necessary.

>> +	vcpu_head_halted,
>>   };
>>
>>   struct pv_node {
>> @@ -263,7 +261,7 @@ pv_wait_early(struct pv_node *prev, int
>>   	if ((loop&  PV_PREV_CHECK_MASK) != 0)
>>   		return false;
>>
>> -	return READ_ONCE(prev->state) != vcpu_running;
>> +	return READ_ONCE(prev->state)&  1;
>>   }
>>
>>   /*
>> @@ -311,20 +309,19 @@ static void pv_wait_node(struct mcs_spin
>>   		 *
>>   		 * Matches the cmpxchg() from pv_kick_node().
>>   		 */
>> -		smp_store_mb(pn->state, vcpu_halted);
>> +		smp_store_mb(pn->state, vcpu_node_halted);
>>
>> -		if (!READ_ONCE(node->locked)) {
>> -			qstat_inc(qstat_pv_wait_node, true);
>> -			qstat_inc(qstat_pv_wait_early, wait_early);
>> -			pv_wait(&pn->state, vcpu_halted);
>> -		}
>> +		if (READ_ONCE(node->locked))
>> +			return;
>> +
>> +		qstat_inc(qstat_pv_wait_node, true);
>> +		qstat_inc(qstat_pv_wait_early, wait_early);
>> +		pv_wait(&pn->state, vcpu_node_halted);
>>
>>   		/*
>> -		 * If pv_kick_node() changed us to vcpu_hashed, retain that
>> -		 * value so that pv_wait_head_or_lock() knows to not also try
>> -		 * to hash this lock.
>> +		 * If pv_kick_node() advanced us, retain that state.
>>   		 */
>> -		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
>> +		cmpxchg(&pn->state, vcpu_node_halted, vcpu_node_running);
>>
>>   		/*
>>   		 * If the locked flag is still not set after wakeup, it is a
>> @@ -362,18 +359,17 @@ static void pv_kick_node(struct qspinloc
>>   	 *
>>   	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>   	 */
>> -	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>> +	if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_running) != vcpu_node_halted)
>>   		return;
>>
>>   	/*
>> -	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
>> -	 *
>> -	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
>> -	 * the hash table later on at unlock time, no atomic instruction is
>> -	 * needed.
>> +	 * See pv_wait_head_or_lock(). We have to hash and force the unlock
>> +	 * into the slow path to deliver the actual kick for waking.
>>   	 */
>> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
>> -	(void)pv_hash(lock, pn);
>> +	if (cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL) == _Q_LOCKED_VAL) {
>> +		(void)pv_hash(lock, pn);
>> +		smp_store_release(&l->locked, _Q_SLOW_VAL);
>> +	}
>>   }
>>
>>   /*
>> @@ -388,28 +384,22 @@ pv_wait_head_or_lock(struct qspinlock *l
>>   {
>>   	struct pv_node *pn = (struct pv_node *)node;
>>   	struct __qspinlock *l = (void *)lock;
>> -	struct qspinlock **lp = NULL;
>>   	int waitcnt = 0;
>>   	int loop;
>>
>>   	/*
>> -	 * If pv_kick_node() already advanced our state, we don't need to
>> -	 * insert ourselves into the hash table anymore.
>> -	 */
>> -	if (READ_ONCE(pn->state) == vcpu_hashed)
>> -		lp = (struct qspinlock **)1;
>> -
>> -	/*
>>   	 * Tracking # of slowpath locking operations
>>   	 */
>>   	qstat_inc(qstat_pv_lock_slowpath, true);
>>
>>   	for (;; waitcnt++) {
>> +		u8 locked;
>> +
>>   		/*
>>   		 * Set correct vCPU state to be used by queue node wait-early
>>   		 * mechanism.
>>   		 */
>> -		WRITE_ONCE(pn->state, vcpu_running);
>> +		WRITE_ONCE(pn->state, vcpu_head_running);
>>
>>   		/*
>>   		 * Set the pending bit in the active lock spinning loop to
>> @@ -423,33 +413,38 @@ pv_wait_head_or_lock(struct qspinlock *l
>>   		}
>>   		clear_pending(lock);
>>
>> +		/*
>> +		 * We want to go sleep; ensure we're hashed so that
>> +		 * __pv_queued_spin_unlock_slow() can find us for a wakeup.
>> +		 */
>> +		locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL);
>> +		switch (locked) {
>> +		/*
>> +		 * We're not hashed yet, either we're fresh from pv_wait_node()
>> +		 * or __pv_queued_spin_unlock_slow() unhashed us but we lost
>> +		 * the trylock to a steal and have to re-hash.
>> +		 */
>> +		case _Q_LOCKED_VAL:
>> +			(void)pv_hash(lock, pn);
>> +			smp_store_release(&l->locked, _Q_SLOW_VAL);
>> +			break;
>>
>> -		if (!lp) { /* ONCE */
>> -			lp = pv_hash(lock, pn);
>> +		/*
>> +		 * pv_kick_node() is hashing us, wait for it.
>> +		 */
>> +		case _Q_HASH_VAL:
>> +			while (READ_ONCE(l->locked) == _Q_HASH_VAL)
>> +				cpu_relax();
>> +			break;
>>
>> -			/*
>> -			 * We must hash before setting _Q_SLOW_VAL, such that
>> -			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
>> -			 * we'll be sure to be able to observe our hash entry.
>> -			 *
>> -			 *   [S]<hash>                  [Rmw] l->locked == _Q_SLOW_VAL
>> -			 *       MB                           RMB
>> -			 * [RmW] l->locked = _Q_SLOW_VAL  [L]<unhash>
>> -			 *
>> -			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
>> -			 */
>> -			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
>> -				/*
>> -				 * The lock was free and now we own the lock.
>> -				 * Change the lock value back to _Q_LOCKED_VAL
>> -				 * and unhash the table.
>> -				 */
>> -				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
>> -				WRITE_ONCE(*lp, NULL);
>> -				goto gotlock;
>> -			}
>> +		/*
>> +		 * Ooh, unlocked, try and grab it.
>> +		 */
>> +		case 0:
>> +			continue;
>>   		}
>> -		WRITE_ONCE(pn->state, vcpu_hashed);
>> +
>> +		WRITE_ONCE(pn->state, vcpu_head_halted);
>>   		qstat_inc(qstat_pv_wait_head, true);
>>   		qstat_inc(qstat_pv_wait_again, waitcnt);
>>   		pv_wait(&l->locked, _Q_SLOW_VAL);
>> @@ -480,7 +475,7 @@ __pv_queued_spin_unlock_slowpath(struct
>>   	struct __qspinlock *l = (void *)lock;
>>   	struct pv_node *node;
>>
>> -	if (unlikely(locked != _Q_SLOW_VAL)) {
>> +	if (unlikely(locked != _Q_SLOW_VAL&&  locked != _Q_HASH_VAL)) {
>>   		WARN(!debug_locks_silent,
>>   		     "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n",
>>   		     (unsigned long)lock, atomic_read(&lock->val));
>> @@ -488,18 +483,17 @@ __pv_queued_spin_unlock_slowpath(struct
>>   	}
>>
>>   	/*
>> -	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
>> -	 * so we need a barrier to order the read of the node data in
>> -	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
>> -	 *
>> -	 * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL.
>> +	 * Wait until the hash-bucket is complete.
>>   	 */
>> -	smp_rmb();
>> +	while (READ_ONCE(l->locked) == _Q_HASH_VAL)
>> +		cpu_relax();
>>
> This does give a chance to let the lock waiter block the lock holder,
> right? Considering a lock queue head's vcpu is preempted before it could
> set the lock to _Q_SLOW_VAL.
>
> Could we do something like this here:
>
> 	if (unlikely(cmpxchg(l->locked, _Q_HASH_VAL, 0) == _Q_HASH_VAL))
> 		return;

I think it should be:

	if (unlikely((locked == _Q_HASH_VAL)&&  (cmpxchg(l->locked, _Q_HASH_VAL, 0) == _Q_HASH_VAL)))
		return;

You don't want to unconditionally do a cmpxchg.

> 		
> And in pv_wait_head_or_lock()
>
> 	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL);
> 	switch(locked) {
> 		case _Q_LOCKED_VAL:
> 			(void)pv_hash(lock, pn);
> 			locked = cmpxchg_release(&l->locked, _Q_HASH_VAL, _Q_SLOW_VAL);
>
> 			/*
> 			 * Only the holder will change the ->locked from
> 			 * _Q_HASH_VAL to another value, if this
> 			 * happens, the holder has already released the
> 			 * lock without trying to wake the head, in this
> 			 * case, we need to unhash ourselves and there
> 			 * is a great chance we can get the locke.
> 			 */
> 			if (unlikely(locked != _Q_HASH_VAL)) {
> 				pv_unhash(lock, pn);
> 				if (!cmpxchg_relaxed(&l->locked, 0, _Q_LOCKED_VAL)
> 					goto gotlock;
> 			}
> 			break;
>
>
>
> Wrote those in my mailbox, may miss something.
>
> Thoughts?
>
>

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-17 23:07             ` Waiman Long
@ 2016-07-17 23:10               ` Waiman Long
  2016-07-17 23:22                 ` Wanpeng Li
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2016-07-17 23:10 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Pan Xinhui, Ingo Molnar, linux-kernel,
	Scott J Norton, Douglas Hatch

On 07/17/2016 07:07 PM, Waiman Long wrote:
> On 07/15/2016 09:16 PM, Boqun Feng wrote:
>> On Fri, Jul 15, 2016 at 06:35:56PM +0200, Peter Zijlstra wrote:
>>> On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote:
>>>>> So if we are kicked by the unlock_slowpath, and the lock is 
>>>>> stealed by
>>>>> someone else,  we need hash its node again and set l->locked to
>>>>> _Q_SLOW_VAL, then enter pv_wait.
>>>> Right, let me go think about this a bit.
>>> Urgh, brain hurt.
>>>
>>> So I _think_ the below does for it but I could easily have missed yet
>>> another case.
>>>
>>> Waiman's patch has the problem that it can have two pv_hash() calls for
>>> the same lock in progress and I'm thinking that means we can hit the
>>> BUG() in pv_hash() due to that.
>>>
>> I think Waiman's patch does have the problem of two pv_hash() calls for
>> the same lock in progress. As I mentioned in the first version:
>>
>> http://lkml.kernel.org/g/20160527074331.GB8096@insomnia
>>
>> And he tried to address this in the patch #3 of this series. However,
>> I think what is proper here is either to reorder patch 2 and 3 or to
>> merge patch 2 and 3, otherwise, we are introducing a bug in the middle
>> of this series.
>>
>> Thoughts, Waiman?
>
> Patches 2 and 3 can be reversed.
>
>>
>> That said, I found Peter's way is much simpler and easier to understand
>> ;-)
>
> I agree. Peter's patch is better than mine.
>
>>> If we can't, it still has a problem because its not telling us either.
>>>
>>>
>>>
>>> --- a/kernel/locking/qspinlock_paravirt.h
>>> +++ b/kernel/locking/qspinlock_paravirt.h
>>> @@ -20,7 +20,8 @@
>>>    * native_queued_spin_unlock().
>>>    */
>>>
>>> -#define _Q_SLOW_VAL    (3U<<  _Q_LOCKED_OFFSET)
>>> +#define _Q_HASH_VAL    (3U<<  _Q_LOCKED_OFFSET)
>>> +#define _Q_SLOW_VAL    (7U<<  _Q_LOCKED_OFFSET)
>>>
>>>   /*
>>>    * Queue Node Adaptive Spinning
>>> @@ -36,14 +37,11 @@
>>>    */
>>>   #define PV_PREV_CHECK_MASK    0xff
>>>
>>> -/*
>>> - * Queue node uses: vcpu_running&  vcpu_halted.
>>> - * Queue head uses: vcpu_running&  vcpu_hashed.
>>> - */
>>>   enum vcpu_state {
>>> -    vcpu_running = 0,
>>> -    vcpu_halted,        /* Used only in pv_wait_node */
>>> -    vcpu_hashed,        /* = pv_hash'ed + vcpu_halted */
>>> +    vcpu_node_running = 0,
>>> +    vcpu_node_halted,
>>> +    vcpu_head_running,
>> We actually don't need this extra running state, right? Because nobody
>> cares about the difference between two running states right now.
>
> That addresses the problem in Xinhui patch that changed the state from 
> halted to hashed. With that separation, that change is no longer 
> necessary.

Oh, I meant Wanpeng's double hash race patch, not Xinhui's patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-17 23:10               ` Waiman Long
@ 2016-07-17 23:22                 ` Wanpeng Li
  0 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2016-07-17 23:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Boqun Feng, Peter Zijlstra, Pan Xinhui, Ingo Molnar,
	linux-kernel, Scott J Norton, Douglas Hatch

2016-07-18 7:10 GMT+08:00 Waiman Long <waiman.long@hpe.com>:
> On 07/17/2016 07:07 PM, Waiman Long wrote:
>>
>> On 07/15/2016 09:16 PM, Boqun Feng wrote:
>>>
>>> On Fri, Jul 15, 2016 at 06:35:56PM +0200, Peter Zijlstra wrote:
>>>>
>>>> On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote:
>>>>>>
>>>>>> So if we are kicked by the unlock_slowpath, and the lock is stealed by
>>>>>> someone else,  we need hash its node again and set l->locked to
>>>>>> _Q_SLOW_VAL, then enter pv_wait.
>>>>>
>>>>> Right, let me go think about this a bit.
>>>>
>>>> Urgh, brain hurt.
>>>>
>>>> So I _think_ the below does for it but I could easily have missed yet
>>>> another case.
>>>>
>>>> Waiman's patch has the problem that it can have two pv_hash() calls for
>>>> the same lock in progress and I'm thinking that means we can hit the
>>>> BUG() in pv_hash() due to that.
>>>>
>>> I think Waiman's patch does have the problem of two pv_hash() calls for
>>> the same lock in progress. As I mentioned in the first version:
>>>
>>> http://lkml.kernel.org/g/20160527074331.GB8096@insomnia
>>>
>>> And he tried to address this in the patch #3 of this series. However,
>>> I think what is proper here is either to reorder patch 2 and 3 or to
>>> merge patch 2 and 3, otherwise, we are introducing a bug in the middle
>>> of this series.
>>>
>>> Thoughts, Waiman?
>>
>>
>> Patches 2 and 3 can be reversed.
>>
>>>
>>> That said, I found Peter's way is much simpler and easier to understand
>>> ;-)
>>
>>
>> I agree. Peter's patch is better than mine.
>>
>>>> If we can't, it still has a problem because its not telling us either.
>>>>
>>>>
>>>>
>>>> --- a/kernel/locking/qspinlock_paravirt.h
>>>> +++ b/kernel/locking/qspinlock_paravirt.h
>>>> @@ -20,7 +20,8 @@
>>>>    * native_queued_spin_unlock().
>>>>    */
>>>>
>>>> -#define _Q_SLOW_VAL    (3U<<  _Q_LOCKED_OFFSET)
>>>> +#define _Q_HASH_VAL    (3U<<  _Q_LOCKED_OFFSET)
>>>> +#define _Q_SLOW_VAL    (7U<<  _Q_LOCKED_OFFSET)
>>>>
>>>>   /*
>>>>    * Queue Node Adaptive Spinning
>>>> @@ -36,14 +37,11 @@
>>>>    */
>>>>   #define PV_PREV_CHECK_MASK    0xff
>>>>
>>>> -/*
>>>> - * Queue node uses: vcpu_running&  vcpu_halted.
>>>> - * Queue head uses: vcpu_running&  vcpu_hashed.
>>>> - */
>>>>   enum vcpu_state {
>>>> -    vcpu_running = 0,
>>>> -    vcpu_halted,        /* Used only in pv_wait_node */
>>>> -    vcpu_hashed,        /* = pv_hash'ed + vcpu_halted */
>>>> +    vcpu_node_running = 0,
>>>> +    vcpu_node_halted,
>>>> +    vcpu_head_running,
>>>
>>> We actually don't need this extra running state, right? Because nobody
>>> cares about the difference between two running states right now.
>>
>>
>> That addresses the problem in Xinhui patch that changed the state from
>> halted to hashed. With that separation, that change is no longer necessary.
>
>
> Oh, I meant Wanpeng's double hash race patch, not Xinhui's patch.

This patch can base on the top of mine, I think it has already done this way. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-07-15 16:35         ` Peter Zijlstra
  2016-07-16  1:16           ` Boqun Feng
  2016-07-17 22:52           ` Waiman Long
@ 2016-07-21  6:40           ` xinhui
  2 siblings, 0 replies; 19+ messages in thread
From: xinhui @ 2016-07-21  6:40 UTC (permalink / raw)
  To: Peter Zijlstra, Pan Xinhui
  Cc: Waiman Long, Ingo Molnar, linux-kernel, Boqun Feng,
	Scott J Norton, Douglas Hatch



On 2016年07月16日 00:35, Peter Zijlstra wrote:
> On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote:
>>> So if we are kicked by the unlock_slowpath, and the lock is stealed by
>>> someone else,  we need hash its node again and set l->locked to
>>> _Q_SLOW_VAL, then enter pv_wait.
>>
>> Right, let me go think about this a bit.
>
> Urgh, brain hurt.
>
> So I _think_ the below does for it but I could easily have missed yet
> another case.
>
> Waiman's patch has the problem that it can have two pv_hash() calls for
> the same lock in progress and I'm thinking that means we can hit the
> BUG() in pv_hash() due to that.
>
> If we can't, it still has a problem because its not telling us either.
>
>
>
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -20,7 +20,8 @@
>    * native_queued_spin_unlock().
>    */
>
> -#define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
> +#define _Q_HASH_VAL	(3U << _Q_LOCKED_OFFSET)
> +#define _Q_SLOW_VAL	(7U << _Q_LOCKED_OFFSET)
>
>   /*
>    * Queue Node Adaptive Spinning
> @@ -36,14 +37,11 @@
>    */
>   #define PV_PREV_CHECK_MASK	0xff
>
> -/*
> - * Queue node uses: vcpu_running & vcpu_halted.
> - * Queue head uses: vcpu_running & vcpu_hashed.
> - */
>   enum vcpu_state {
> -	vcpu_running = 0,
> -	vcpu_halted,		/* Used only in pv_wait_node */
> -	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
> +	vcpu_node_running = 0,
> +	vcpu_node_halted,
> +	vcpu_head_running,
> +	vcpu_head_halted,
>   };
>
>   struct pv_node {
> @@ -263,7 +261,7 @@ pv_wait_early(struct pv_node *prev, int
>   	if ((loop & PV_PREV_CHECK_MASK) != 0)
>   		return false;
>
> -	return READ_ONCE(prev->state) != vcpu_running;
> +	return READ_ONCE(prev->state) & 1;
>   }
>
>   /*
> @@ -311,20 +309,19 @@ static void pv_wait_node(struct mcs_spin
>   		 *
>   		 * Matches the cmpxchg() from pv_kick_node().
>   		 */
> -		smp_store_mb(pn->state, vcpu_halted);
> +		smp_store_mb(pn->state, vcpu_node_halted);
>
> -		if (!READ_ONCE(node->locked)) {
> -			qstat_inc(qstat_pv_wait_node, true);
> -			qstat_inc(qstat_pv_wait_early, wait_early);
> -			pv_wait(&pn->state, vcpu_halted);
> -		}
> +		if (READ_ONCE(node->locked))
> +			return;
> +
> +		qstat_inc(qstat_pv_wait_node, true);
> +		qstat_inc(qstat_pv_wait_early, wait_early);
> +		pv_wait(&pn->state, vcpu_node_halted);
>
>   		/*
> -		 * If pv_kick_node() changed us to vcpu_hashed, retain that
> -		 * value so that pv_wait_head_or_lock() knows to not also try
> -		 * to hash this lock.
> +		 * If pv_kick_node() advanced us, retain that state.
>   		 */
> -		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
> +		cmpxchg(&pn->state, vcpu_node_halted, vcpu_node_running);
>
>   		/*
>   		 * If the locked flag is still not set after wakeup, it is a
> @@ -362,18 +359,17 @@ static void pv_kick_node(struct qspinloc
>   	 *
>   	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>   	 */
> -	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> +	if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_running) != vcpu_node_halted)
well, I think it should be
if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_halted) != vcpu_node_halted)

yes, the node has been the *head*, but the running state does not change.

others looks ok to me.

>   		return;
>
>   	/*
> -	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
> -	 *
> -	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
> -	 * the hash table later on at unlock time, no atomic instruction is
> -	 * needed.
> +	 * See pv_wait_head_or_lock(). We have to hash and force the unlock
> +	 * into the slow path to deliver the actual kick for waking.
>   	 */
> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> -	(void)pv_hash(lock, pn);
> +	if (cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL) == _Q_LOCKED_VAL) {
> +		(void)pv_hash(lock, pn);
> +		smp_store_release(&l->locked, _Q_SLOW_VAL);
> +	}
>   }
>
>   /*
> @@ -388,28 +384,22 @@ pv_wait_head_or_lock(struct qspinlock *l
>   {
>   	struct pv_node *pn = (struct pv_node *)node;
>   	struct __qspinlock *l = (void *)lock;
> -	struct qspinlock **lp = NULL;
>   	int waitcnt = 0;
>   	int loop;
>
>   	/*
> -	 * If pv_kick_node() already advanced our state, we don't need to
> -	 * insert ourselves into the hash table anymore.
> -	 */
> -	if (READ_ONCE(pn->state) == vcpu_hashed)
> -		lp = (struct qspinlock **)1;
> -
> -	/*
>   	 * Tracking # of slowpath locking operations
>   	 */
>   	qstat_inc(qstat_pv_lock_slowpath, true);
>
>   	for (;; waitcnt++) {
> +		u8 locked;
> +
>   		/*
>   		 * Set correct vCPU state to be used by queue node wait-early
>   		 * mechanism.
>   		 */
> -		WRITE_ONCE(pn->state, vcpu_running);
> +		WRITE_ONCE(pn->state, vcpu_head_running);
>
>   		/*
>   		 * Set the pending bit in the active lock spinning loop to
> @@ -423,33 +413,38 @@ pv_wait_head_or_lock(struct qspinlock *l
>   		}
>   		clear_pending(lock);
>
> +		/*
> +		 * We want to go sleep; ensure we're hashed so that
> +		 * __pv_queued_spin_unlock_slow() can find us for a wakeup.
> +		 */
> +		locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL);
> +		switch (locked) {
> +		/*
> +		 * We're not hashed yet, either we're fresh from pv_wait_node()
> +		 * or __pv_queued_spin_unlock_slow() unhashed us but we lost
> +		 * the trylock to a steal and have to re-hash.
> +		 */
> +		case _Q_LOCKED_VAL:
> +			(void)pv_hash(lock, pn);
> +			smp_store_release(&l->locked, _Q_SLOW_VAL);
> +			break;
>
> -		if (!lp) { /* ONCE */
> -			lp = pv_hash(lock, pn);
> +		/*
> +		 * pv_kick_node() is hashing us, wait for it.
> +		 */
> +		case _Q_HASH_VAL:
> +			while (READ_ONCE(l->locked) == _Q_HASH_VAL)
> +				cpu_relax();
> +			break;
>
> -			/*
> -			 * We must hash before setting _Q_SLOW_VAL, such that
> -			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
> -			 * we'll be sure to be able to observe our hash entry.
> -			 *
> -			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
> -			 *       MB                           RMB
> -			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
> -			 *
> -			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
> -			 */
> -			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
> -				/*
> -				 * The lock was free and now we own the lock.
> -				 * Change the lock value back to _Q_LOCKED_VAL
> -				 * and unhash the table.
> -				 */
> -				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
> -				WRITE_ONCE(*lp, NULL);
> -				goto gotlock;
> -			}
> +		/*
> +		 * Ooh, unlocked, try and grab it.
> +		 */
> +		case 0:
> +			continue;
>   		}
> -		WRITE_ONCE(pn->state, vcpu_hashed);
> +
> +		WRITE_ONCE(pn->state, vcpu_head_halted);
>   		qstat_inc(qstat_pv_wait_head, true);
>   		qstat_inc(qstat_pv_wait_again, waitcnt);
>   		pv_wait(&l->locked, _Q_SLOW_VAL);
> @@ -480,7 +475,7 @@ __pv_queued_spin_unlock_slowpath(struct
>   	struct __qspinlock *l = (void *)lock;
>   	struct pv_node *node;
>
> -	if (unlikely(locked != _Q_SLOW_VAL)) {
> +	if (unlikely(locked != _Q_SLOW_VAL && locked != _Q_HASH_VAL)) {
>   		WARN(!debug_locks_silent,
>   		     "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n",
>   		     (unsigned long)lock, atomic_read(&lock->val));
> @@ -488,18 +483,17 @@ __pv_queued_spin_unlock_slowpath(struct
>   	}
>
>   	/*
> -	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
> -	 * so we need a barrier to order the read of the node data in
> -	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
> -	 *
> -	 * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL.
> +	 * Wait until the hash-bucket is complete.
>   	 */
> -	smp_rmb();
> +	while (READ_ONCE(l->locked) == _Q_HASH_VAL)
> +		cpu_relax();
>
>   	/*
> -	 * Since the above failed to release, this must be the SLOW path.
> -	 * Therefore start by looking up the blocked node and unhashing it.
> +	 * Must first observe _Q_SLOW_VAL in order to observe
> +	 * consistent hash bucket.
>   	 */
> +	smp_rmb();
> +
>   	node = pv_unhash(lock);
>
>   	/*
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:locking/core] locking/pvstat: Separate wait_again and spurious wakeup stats
  2016-05-31 16:53 ` [PATCH v2 1/5] locking/pvstat: Separate wait_again and spurious wakeup stats Waiman Long
@ 2016-08-10 18:07   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Waiman Long @ 2016-08-10 18:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: scott.norton, linux-kernel, boqun.feng, torvalds, tglx, hpa,
	paulmck, xinhui, peterz, doug.hatch, Waiman.Long, mingo, akpm

Commit-ID:  08be8f63c40c030b5cf95b4368e314e563a86301
Gitweb:     http://git.kernel.org/tip/08be8f63c40c030b5cf95b4368e314e563a86301
Author:     Waiman Long <Waiman.Long@hpe.com>
AuthorDate: Tue, 31 May 2016 12:53:47 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 10 Aug 2016 14:16:02 +0200

locking/pvstat: Separate wait_again and spurious wakeup stats

Currently there are overlap in the pvqspinlock wait_again and
spurious_wakeup stat counters. Because of lock stealing, it is
no longer possible to accurately determine if spurious wakeup has
happened in the queue head.  As they track both the queue node and
queue head status, it is also hard to tell how many of those comes
from the queue head and how many from the queue node.

This patch changes the accounting rules so that spurious wakeup is
only tracked in the queue node. The wait_again count, however, is
only tracked in the queue head when the vCPU failed to acquire the
lock after a vCPU kick. This should give a much better indication of
the wait-kick dynamics in the queue node and the queue head.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1464713631-1066-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock_paravirt.h | 12 +++---------
 kernel/locking/qspinlock_stat.h     |  4 ++--
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 429c3dc..3acf16d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -288,12 +288,10 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct pv_node *pp = (struct pv_node *)prev;
-	int waitcnt = 0;
 	int loop;
 	bool wait_early;
 
-	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
-	for (;; waitcnt++) {
+	for (;;) {
 		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
@@ -317,7 +315,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 
 		if (!READ_ONCE(node->locked)) {
 			qstat_inc(qstat_pv_wait_node, true);
-			qstat_inc(qstat_pv_wait_again, waitcnt);
 			qstat_inc(qstat_pv_wait_early, wait_early);
 			pv_wait(&pn->state, vcpu_halted);
 		}
@@ -458,12 +455,9 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
 		/*
-		 * The unlocker should have freed the lock before kicking the
-		 * CPU. So if the lock is still not free, it is a spurious
-		 * wakeup or another vCPU has stolen the lock. The current
-		 * vCPU should spin again.
+		 * Because of lock stealing, the queue head vCPU may not be
+		 * able to acquire the lock before it has to wait again.
 		 */
-		qstat_inc(qstat_pv_spurious_wakeup, READ_ONCE(l->locked));
 	}
 
 	/*
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index b9d0315..eb0a599 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -24,8 +24,8 @@
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
  *   pv_lock_slowpath	- # of locking operations via the slowpath
  *   pv_lock_stealing	- # of lock stealing operations
- *   pv_spurious_wakeup	- # of spurious wakeups
- *   pv_wait_again	- # of vCPU wait's that happened after a vCPU kick
+ *   pv_spurious_wakeup	- # of spurious wakeups in non-head vCPUs
+ *   pv_wait_again	- # of wait's after a queue head vCPU kick
  *   pv_wait_early	- # of early vCPU wait's
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-08-10 18:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-31 16:53 [PATCH v2 0/5] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
2016-05-31 16:53 ` [PATCH v2 1/5] locking/pvstat: Separate wait_again and spurious wakeup stats Waiman Long
2016-08-10 18:07   ` [tip:locking/core] " tip-bot for Waiman Long
2016-05-31 16:53 ` [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
2016-07-15  8:47   ` Peter Zijlstra
2016-07-15  9:39     ` Pan Xinhui
2016-07-15 10:07       ` Peter Zijlstra
2016-07-15 16:35         ` Peter Zijlstra
2016-07-16  1:16           ` Boqun Feng
2016-07-17 23:07             ` Waiman Long
2016-07-17 23:10               ` Waiman Long
2016-07-17 23:22                 ` Wanpeng Li
2016-07-17 22:52           ` Waiman Long
2016-07-21  6:40           ` xinhui
2016-07-15 20:06         ` Waiman Long
2016-07-15 19:47     ` Waiman Long
2016-05-31 16:53 ` [PATCH v2 3/5] locking/pvqspinlock: Make pv_unhash() atomic Waiman Long
2016-05-31 16:53 ` [PATCH v2 4/5] locking/pvstat: Add stat counter to track _Q_SLOW_VAL race Waiman Long
2016-05-31 16:53 ` [PATCH v2 5/5] locking/pvqspinlock: Add lock holder CPU argument to pv_wait() Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).