linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] locking/pvqspinlock: Fix missed PV wakeup & support PPC
@ 2016-05-26 18:21 Waiman Long
  2016-05-26 18:21 ` [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
  2016-05-26 18:21 ` [PATCH 2/2] locking/pvqspinlock: Add lock holder CPU argument to pv_wait() Waiman Long
  0 siblings, 2 replies; 5+ messages in thread
From: Waiman Long @ 2016-05-26 18:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Scott J Norton, Douglas Hatch, Waiman Long

Patch 1 is to fix the missed PV wakeup problem as reported by Pan Xinhui.

Patch 2 is to add a new argument to pv_wait() to provide better support
for PPC. This patch has been sent out previously.

Waiman Long (2):
  locking/pvqspinlock: Fix missed PV wakeup problem
  locking/pvqspinlock: Add lock holder CPU argument to pv_wait()

 arch/x86/include/asm/paravirt.h       |    4 +-
 arch/x86/include/asm/paravirt_types.h |    2 +-
 arch/x86/kernel/kvm.c                 |    2 +-
 arch/x86/xen/spinlock.c               |    2 +-
 kernel/locking/qspinlock.c            |    5 +-
 kernel/locking/qspinlock_paravirt.h   |  126 +++++++++++++++++++++++++++------
 kernel/locking/qspinlock_stat.h       |    8 +-
 7 files changed, 117 insertions(+), 32 deletions(-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-05-26 18:21 [PATCH 0/2] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
@ 2016-05-26 18:21 ` Waiman Long
  2016-05-27  7:43   ` Boqun Feng
  2016-05-26 18:21 ` [PATCH 2/2] locking/pvqspinlock: Add lock holder CPU argument to pv_wait() Waiman Long
  1 sibling, 1 reply; 5+ messages in thread
From: Waiman Long @ 2016-05-26 18:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Scott J Norton, Douglas Hatch, Waiman Long

Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
done once for any pv_node. It is either in pv_kick_node() or in
pv_wait_head_or_lock(). Because of lock stealing, a pv_kick'ed node is
not guaranteed to get the lock before the spinning threshold expires
and has to call pv_wait() again. As a result, the new lock holder
won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.

This patch fixes this missed PV wakeup problem by allowing multiple
_Q_SLOW_VAL settings within pv_wait_head_or_lock() and matching each
successful setting of _Q_SLOW_VAL to a pv_hash() call.

Reported-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock_paravirt.h |   48 ++++++++++++++++++++++------------
 1 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 21ede57..452d06d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -369,12 +369,16 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	/*
 	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
 	 *
-	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
-	 * the hash table later on at unlock time, no atomic instruction is
-	 * needed.
+	 * It is very unlikely that this will race with the _Q_SLOW_VAL setting
+	 * in pv_wait_head_or_lock(). However, we use cmpxchg() here to be
+	 * sure that we won't do a double pv_hash().
+	 *
+	 * As it is the lock holder, it won't race with
+	 * __pv_queued_spin_unlock().
 	 */
-	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
-	(void)pv_hash(lock, pn);
+	if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)
+			== _Q_LOCKED_VAL))
+		pv_hash(lock, pn);
 }
 
 /*
@@ -389,18 +393,10 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
-	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
 
 	/*
-	 * If pv_kick_node() already advanced our state, we don't need to
-	 * insert ourselves into the hash table anymore.
-	 */
-	if (READ_ONCE(pn->state) == vcpu_hashed)
-		lp = (struct qspinlock **)1;
-
-	/*
 	 * Tracking # of slowpath locking operations
 	 */
 	qstat_inc(qstat_pv_lock_slowpath, true);
@@ -422,11 +418,19 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 				goto gotlock;
 			cpu_relax();
 		}
-		clear_pending(lock);
 
+		/*
+		 * Make sure the lock value check below is executed after
+		 * all the previous loads.
+		 */
+		smp_rmb();
 
-		if (!lp) { /* ONCE */
-			lp = pv_hash(lock, pn);
+		/*
+		 * Set _Q_SLOW_VAL and hash the PV node, if necessary.
+		 */
+		if (READ_ONCE(l->locked) != _Q_SLOW_VAL) {
+			struct qspinlock **lp = pv_hash(lock, pn);
+			u8 locked;
 
 			/*
 			 * We must hash before setting _Q_SLOW_VAL, such that
@@ -439,7 +443,8 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
+			locked = xchg(&l->locked, _Q_SLOW_VAL);
+			if (locked == 0) {
 				/*
 				 * The lock was free and now we own the lock.
 				 * Change the lock value back to _Q_LOCKED_VAL
@@ -447,9 +452,18 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 				 */
 				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
+				clear_pending(lock);
 				goto gotlock;
+			} else if (unlikely(locked == _Q_SLOW_VAL)) {
+				/*
+				 * Racing with pv_kick_node(), need to undo
+				 * the pv_hash().
+				 */
+				WRITE_ONCE(*lp, NULL);
 			}
 		}
+		clear_pending(lock);	/* Enable lock stealing */
+
 		WRITE_ONCE(pn->state, vcpu_halted);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] locking/pvqspinlock: Add lock holder CPU argument to pv_wait()
  2016-05-26 18:21 [PATCH 0/2] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
  2016-05-26 18:21 ` [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
@ 2016-05-26 18:21 ` Waiman Long
  1 sibling, 0 replies; 5+ messages in thread
From: Waiman Long @ 2016-05-26 18:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Scott J Norton, Douglas Hatch, Waiman Long

Pan Xinhui was asking for a lock holder cpu argument in pv_wait()
to help the porting of pvqspinlock to PPC. The new argument will can
help hypervisor expediate the execution of the critical section by
the lock holder, if its vCPU isn't running, so that it can release
the lock sooner.

The pv_wait() function is now extended to have a third argument
that holds the vCPU number of the lock holder, if this is
known. Architecture specific hypervisor code can then make use of
that information to make better decision of which vCPU should be
running next.

This patch also adds a new field in the pv_node structure to hold
the vCPU number of the previous queue entry. For the queue head,
the prev_cpu entry is likely to be the that of the lock holder,
though lock stealing may render this information inaccurate.

In pv_wait_head_or_lock(), pv_wait() will now be called with that
vCPU number as it is likely to be the lock holder.

In pv_wait_node(), the newly added pv_lookup_hash() function will
be called to look up the queue head and pass in the lock holder vCPU
number stored there, if found.

Suggested-by:  Pan Xinhui <xinhui@linux.vnet.ibm.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/include/asm/paravirt.h       |    4 +-
 arch/x86/include/asm/paravirt_types.h |    2 +-
 arch/x86/kernel/kvm.c                 |    2 +-
 arch/x86/xen/spinlock.c               |    2 +-
 kernel/locking/qspinlock.c            |    5 +-
 kernel/locking/qspinlock_paravirt.h   |   78 +++++++++++++++++++++++++++++++--
 kernel/locking/qspinlock_stat.h       |    8 ++--
 7 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 2970d22..98f8764 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -674,9 +674,9 @@ static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock)
 	PVOP_VCALLEE1(pv_lock_ops.queued_spin_unlock, lock);
 }
 
-static __always_inline void pv_wait(u8 *ptr, u8 val)
+static __always_inline void pv_wait(u8 *ptr, u8 val, int lockcpu)
 {
-	PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+	PVOP_VCALL3(pv_lock_ops.wait, ptr, val, lockcpu);
 }
 
 static __always_inline void pv_kick(int cpu)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7fa9e77..300f1d4 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -312,7 +312,7 @@ struct pv_lock_ops {
 	void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
 	struct paravirt_callee_save queued_spin_unlock;
 
-	void (*wait)(u8 *ptr, u8 val);
+	void (*wait)(u8 *ptr, u8 val, int lockcpu);
 	void (*kick)(int cpu);
 #else /* !CONFIG_QUEUED_SPINLOCKS */
 	struct paravirt_callee_save lock_spinning;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index eea2a6f..543271e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -582,7 +582,7 @@ static void kvm_kick_cpu(int cpu)
 
 #include <asm/qspinlock.h>
 
-static void kvm_wait(u8 *ptr, u8 val)
+static void kvm_wait(u8 *ptr, u8 val, int lockcpu)
 {
 	unsigned long flags;
 
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index f42e78d..9150e62 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -39,7 +39,7 @@ static void xen_qlock_kick(int cpu)
 /*
  * Halt the current CPU & release it back to the host
  */
-static void xen_qlock_wait(u8 *byte, u8 val)
+static void xen_qlock_wait(u8 *byte, u8 val, int lockcpu)
 {
 	int irq = __this_cpu_read(lock_kicker_irq);
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce2f75e..99f31e4 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -248,7 +248,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
  */
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+static __always_inline void __pv_wait_node(struct qspinlock *lock,
+					   struct mcs_spinlock *node,
 					   struct mcs_spinlock *prev) { }
 static __always_inline void __pv_kick_node(struct qspinlock *lock,
 					   struct mcs_spinlock *node) { }
@@ -407,7 +408,7 @@ queue:
 		prev = decode_tail(old);
 		WRITE_ONCE(prev->next, node);
 
-		pv_wait_node(node, prev);
+		pv_wait_node(lock, node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
 
 		/*
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 452d06d..b04d977 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -51,6 +51,7 @@ struct pv_node {
 	struct mcs_spinlock	__res[3];
 
 	int			cpu;
+	int			prev_cpu; /* vCPU # of previous queue entry */
 	u8			state;
 };
 
@@ -158,7 +159,6 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
  * Since we should not be holding locks from NMI context (very rare indeed) the
  * max load factor is 0.75, which is around the point where open addressing
  * breaks down.
- *
  */
 struct pv_hash_entry {
 	struct qspinlock *lock;
@@ -251,6 +251,51 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 }
 
 /*
+ * Look up the given lock in the hash table
+ * Return the pv_node if found, NULL otherwise
+ *
+ * This function is used to search an entry that contains the given lock
+ * without doing any unhashing. It is entirely possible that the required
+ * entry isn't there in the hashtable at all. So it will stop looking if
+ * it doesn't find the desired one when a null entry is found after searching
+ * a cacheline worth of entries or more than 4 cachelines are searched.
+ *
+ * __ARCH_NEED_PV_HASH_LOOKUP should be set in the asm/qspinlock.h file for
+ * architectures that want to do the hash table lookup in pv_wait_node().
+ */
+#ifdef __ARCH_NEED_PV_HASH_LOOKUP
+static struct pv_node *pv_lookup_hash(struct qspinlock *lock)
+{
+	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
+	struct pv_hash_entry *he;
+	int idx = 0;
+
+	for_each_hash_entry(he, offset, hash) {
+		struct qspinlock *l = READ_ONCE(he->lock);
+
+		if (l == lock)
+			return READ_ONCE(he->node);
+		else if (++idx < PV_HE_PER_LINE)
+			/* Search at least a cacheline of hash entries */
+			continue;
+		/*
+		 * Stop searching when there is an empty slot or more than
+		 * 4 cachelines of entries are searched. This limits the cost
+		 * of doing the lookup.
+		 */
+		if (!l || (idx >= 4 * PV_HE_PER_LINE))
+			return NULL;
+	}
+	return NULL;
+}
+#else
+static inline struct pv_node *pv_lookup_hash(struct qspinlock *lock)
+{
+	return NULL;
+}
+#endif
+
+/*
  * Return true if when it is time to check the previous node which is not
  * in a running state.
  */
@@ -274,6 +319,7 @@ static void pv_init_node(struct mcs_spinlock *node)
 	BUILD_BUG_ON(sizeof(struct pv_node) > 5*sizeof(struct mcs_spinlock));
 
 	pn->cpu = smp_processor_id();
+	pn->prev_cpu = -1;
 	pn->state = vcpu_running;
 }
 
@@ -282,7 +328,8 @@ static void pv_init_node(struct mcs_spinlock *node)
  * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
  * behalf.
  */
-static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
+static void pv_wait_node(struct qspinlock *lock, struct mcs_spinlock *node,
+			 struct mcs_spinlock *prev)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct pv_node *pp = (struct pv_node *)prev;
@@ -290,6 +337,8 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 	int loop;
 	bool wait_early;
 
+	pn->prev_cpu = pp->cpu;	/* Save vCPU # of previous queue entry */
+
 	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
 	for (;; waitcnt++) {
 		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
@@ -314,10 +363,23 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 		smp_store_mb(pn->state, vcpu_halted);
 
 		if (!READ_ONCE(node->locked)) {
+			struct pv_node *ph;
+
 			qstat_inc(qstat_pv_wait_node, true);
 			qstat_inc(qstat_pv_wait_again, waitcnt);
 			qstat_inc(qstat_pv_wait_early, wait_early);
-			pv_wait(&pn->state, vcpu_halted);
+
+			/*
+			 * If the current queue head is in the hash table,
+			 * the prev_cpu field of its pv_node may contain the
+			 * vCPU # of the lock holder. However, lock stealing
+			 * may make that information inaccurate. Anyway, we
+			 * look up the hash table to try to get the lock
+			 * holder vCPU number.
+			 */
+			ph = pv_lookup_hash(lock);
+			pv_wait(&pn->state, vcpu_halted,
+				ph ? ph->prev_cpu : -1);
 		}
 
 		/*
@@ -467,7 +529,15 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		WRITE_ONCE(pn->state, vcpu_halted);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-		pv_wait(&l->locked, _Q_SLOW_VAL);
+
+		/*
+		 * Pass in the previous node vCPU nmber which is likely to be
+		 * the lock holder vCPU. This additional information may help
+		 * the hypervisor to give more resource to that vCPU so that
+		 * it can release the lock faster. With lock stealing,
+		 * however, that vCPU may not be the actual lock holder.
+		 */
+		pv_wait(&l->locked, _Q_SLOW_VAL, pn->prev_cpu);
 
 		/*
 		 * The unlocker should have freed the lock before kicking the
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 22e0253..6676efe 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -272,12 +272,12 @@ static inline void __pv_kick(int cpu)
 /*
  * Replacement function for pv_wait()
  */
-static inline void __pv_wait(u8 *ptr, u8 val)
+static inline void __pv_wait(u8 *ptr, u8 val, int lockcpu)
 {
 	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
 
 	*pkick_time = 0;
-	pv_wait(ptr, val);
+	pv_wait(ptr, val, lockcpu);
 	if (*pkick_time) {
 		this_cpu_add(qstats[qstat_pv_latency_wake],
 			     sched_clock() - *pkick_time);
@@ -285,8 +285,8 @@ static inline void __pv_wait(u8 *ptr, u8 val)
 	}
 }
 
-#define pv_kick(c)	__pv_kick(c)
-#define pv_wait(p, v)	__pv_wait(p, v)
+#define pv_kick(c)		__pv_kick(c)
+#define pv_wait(p, v, c)	__pv_wait(p, v, c)
 
 #else /* CONFIG_QUEUED_LOCK_STAT */
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-05-26 18:21 ` [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
@ 2016-05-27  7:43   ` Boqun Feng
  2016-05-27 19:28     ` Waiman Long
  0 siblings, 1 reply; 5+ messages in thread
From: Boqun Feng @ 2016-05-27  7:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pan Xinhui,
	Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 5400 bytes --]

Hi Waiman,

On Thu, May 26, 2016 at 02:21:57PM -0400, Waiman Long wrote:
> Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
> done once for any pv_node. It is either in pv_kick_node() or in
> pv_wait_head_or_lock(). Because of lock stealing, a pv_kick'ed node is
> not guaranteed to get the lock before the spinning threshold expires
> and has to call pv_wait() again. As a result, the new lock holder
> won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.
> 
> This patch fixes this missed PV wakeup problem by allowing multiple
> _Q_SLOW_VAL settings within pv_wait_head_or_lock() and matching each
> successful setting of _Q_SLOW_VAL to a pv_hash() call.
> 
> Reported-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  kernel/locking/qspinlock_paravirt.h |   48 ++++++++++++++++++++++------------
>  1 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index 21ede57..452d06d 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -369,12 +369,16 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>  	/*
>  	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
>  	 *
> -	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
> -	 * the hash table later on at unlock time, no atomic instruction is
> -	 * needed.
> +	 * It is very unlikely that this will race with the _Q_SLOW_VAL setting
> +	 * in pv_wait_head_or_lock(). However, we use cmpxchg() here to be
> +	 * sure that we won't do a double pv_hash().
> +	 *
> +	 * As it is the lock holder, it won't race with
> +	 * __pv_queued_spin_unlock().
>  	 */
> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> -	(void)pv_hash(lock, pn);
> +	if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)
> +			== _Q_LOCKED_VAL))
> +		pv_hash(lock, pn);
>  }
>  
>  /*
> @@ -389,18 +393,10 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  {
>  	struct pv_node *pn = (struct pv_node *)node;
>  	struct __qspinlock *l = (void *)lock;
> -	struct qspinlock **lp = NULL;
>  	int waitcnt = 0;
>  	int loop;
>  
>  	/*
> -	 * If pv_kick_node() already advanced our state, we don't need to
> -	 * insert ourselves into the hash table anymore.
> -	 */
> -	if (READ_ONCE(pn->state) == vcpu_hashed)
> -		lp = (struct qspinlock **)1;
> -
> -	/*
>  	 * Tracking # of slowpath locking operations
>  	 */
>  	qstat_inc(qstat_pv_lock_slowpath, true);
> @@ -422,11 +418,19 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  				goto gotlock;
>  			cpu_relax();
>  		}
> -		clear_pending(lock);
>  
> +		/*
> +		 * Make sure the lock value check below is executed after
> +		 * all the previous loads.
> +		 */
> +		smp_rmb();
>  
> -		if (!lp) { /* ONCE */
> -			lp = pv_hash(lock, pn);
> +		/*
> +		 * Set _Q_SLOW_VAL and hash the PV node, if necessary.
> +		 */
> +		if (READ_ONCE(l->locked) != _Q_SLOW_VAL) {
> +			struct qspinlock **lp = pv_hash(lock, pn);
> +			u8 locked;
>  

Just out of curiosity, what if the following sequence happens:

	CPU 0			CPU 1
	=================	====================
	spin_lock():		spin_lock():
	  pv_kick_node(): 	  pv_wait_head_or_lock():
	  			  if (READ_ONCE(l->locked) != _Q_SLOW_VAL) { // True
				    pv_hash();

	    cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
	    pv_hash();
				    locked = xchg(&l->locked, _Q_SLOW_VAL);
	do_something();		    if(...) {
				    }
	spin_unlock():
	  pv_unhash();
				    else if (unlikely(locked == _Q_SLOW_VAL)) {
				    	WRITE_ONCE(*lp, NULL);

because pv_hash() on CPU 1 called before the one on CPU 0, therefore
the hash entry from CPU 1 is preceding the hash entry from CPU 0 in the
hash table, so that when pv_unhash() called, hash entry from CPU 1 will
be unhashed, however, the WRITE_ONCE(*lp, NULL) on CPU 1 will also
unhash the same entry, leaving that hash entry from CPU 0 not unhashed.

This could result in several interesting problems, right? ;-)

Am I missing something here?

Regards,
Boqun

>  			/*
>  			 * We must hash before setting _Q_SLOW_VAL, such that
> @@ -439,7 +443,8 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  			 *
>  			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
>  			 */
> -			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
> +			locked = xchg(&l->locked, _Q_SLOW_VAL);
> +			if (locked == 0) {
>  				/*
>  				 * The lock was free and now we own the lock.
>  				 * Change the lock value back to _Q_LOCKED_VAL
> @@ -447,9 +452,18 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  				 */
>  				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
>  				WRITE_ONCE(*lp, NULL);
> +				clear_pending(lock);
>  				goto gotlock;
> +			} else if (unlikely(locked == _Q_SLOW_VAL)) {
> +				/*
> +				 * Racing with pv_kick_node(), need to undo
> +				 * the pv_hash().
> +				 */
> +				WRITE_ONCE(*lp, NULL);
>  			}
>  		}
> +		clear_pending(lock);	/* Enable lock stealing */
> +
>  		WRITE_ONCE(pn->state, vcpu_halted);
>  		qstat_inc(qstat_pv_wait_head, true);
>  		qstat_inc(qstat_pv_wait_again, waitcnt);
> -- 
> 1.7.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem
  2016-05-27  7:43   ` Boqun Feng
@ 2016-05-27 19:28     ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2016-05-27 19:28 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Pan Xinhui,
	Scott J Norton, Douglas Hatch

On 05/27/2016 03:43 AM, Boqun Feng wrote:
> Hi Waiman,
>
> On Thu, May 26, 2016 at 02:21:57PM -0400, Waiman Long wrote:
>> Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
>> done once for any pv_node. It is either in pv_kick_node() or in
>> pv_wait_head_or_lock(). Because of lock stealing, a pv_kick'ed node is
>> not guaranteed to get the lock before the spinning threshold expires
>> and has to call pv_wait() again. As a result, the new lock holder
>> won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.
>>
>> This patch fixes this missed PV wakeup problem by allowing multiple
>> _Q_SLOW_VAL settings within pv_wait_head_or_lock() and matching each
>> successful setting of _Q_SLOW_VAL to a pv_hash() call.
>>
>> Reported-by: Pan Xinhui<xinhui.pan@linux.vnet.ibm.com>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/locking/qspinlock_paravirt.h |   48 ++++++++++++++++++++++------------
>>   1 files changed, 31 insertions(+), 17 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
>> index 21ede57..452d06d 100644
>> --- a/kernel/locking/qspinlock_paravirt.h
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -369,12 +369,16 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>>   	/*
>>   	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
>>   	 *
>> -	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
>> -	 * the hash table later on at unlock time, no atomic instruction is
>> -	 * needed.
>> +	 * It is very unlikely that this will race with the _Q_SLOW_VAL setting
>> +	 * in pv_wait_head_or_lock(). However, we use cmpxchg() here to be
>> +	 * sure that we won't do a double pv_hash().
>> +	 *
>> +	 * As it is the lock holder, it won't race with
>> +	 * __pv_queued_spin_unlock().
>>   	 */
>> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
>> -	(void)pv_hash(lock, pn);
>> +	if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)
>> +			== _Q_LOCKED_VAL))
>> +		pv_hash(lock, pn);
>>   }
>>
>>   /*
>> @@ -389,18 +393,10 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>   {
>>   	struct pv_node *pn = (struct pv_node *)node;
>>   	struct __qspinlock *l = (void *)lock;
>> -	struct qspinlock **lp = NULL;
>>   	int waitcnt = 0;
>>   	int loop;
>>
>>   	/*
>> -	 * If pv_kick_node() already advanced our state, we don't need to
>> -	 * insert ourselves into the hash table anymore.
>> -	 */
>> -	if (READ_ONCE(pn->state) == vcpu_hashed)
>> -		lp = (struct qspinlock **)1;
>> -
>> -	/*
>>   	 * Tracking # of slowpath locking operations
>>   	 */
>>   	qstat_inc(qstat_pv_lock_slowpath, true);
>> @@ -422,11 +418,19 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>   				goto gotlock;
>>   			cpu_relax();
>>   		}
>> -		clear_pending(lock);
>>
>> +		/*
>> +		 * Make sure the lock value check below is executed after
>> +		 * all the previous loads.
>> +		 */
>> +		smp_rmb();
>>
>> -		if (!lp) { /* ONCE */
>> -			lp = pv_hash(lock, pn);
>> +		/*
>> +		 * Set _Q_SLOW_VAL and hash the PV node, if necessary.
>> +		 */
>> +		if (READ_ONCE(l->locked) != _Q_SLOW_VAL) {
>> +			struct qspinlock **lp = pv_hash(lock, pn);
>> +			u8 locked;
>>
> Just out of curiosity, what if the following sequence happens:
>
> 	CPU 0			CPU 1
> 	=================	====================
> 	spin_lock():		spin_lock():
> 	  pv_kick_node(): 	  pv_wait_head_or_lock():
> 	  			  if (READ_ONCE(l->locked) != _Q_SLOW_VAL) { // True
> 				    pv_hash();
>
> 	    cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
> 	    pv_hash();
> 				    locked = xchg(&l->locked, _Q_SLOW_VAL);
> 	do_something();		    if(...) {
> 				    }
> 	spin_unlock():
> 	  pv_unhash();
> 				    else if (unlikely(locked == _Q_SLOW_VAL)) {
> 				    	WRITE_ONCE(*lp, NULL);
>
> because pv_hash() on CPU 1 called before the one on CPU 0, therefore
> the hash entry from CPU 1 is preceding the hash entry from CPU 0 in the
> hash table, so that when pv_unhash() called, hash entry from CPU 1 will
> be unhashed, however, the WRITE_ONCE(*lp, NULL) on CPU 1 will also
> unhash the same entry, leaving that hash entry from CPU 0 not unhashed.
>
> This could result in several interesting problems, right? ;-)

This is a very unlikely scenario, but I agree that it can happen. I 
think the only way to close this loophole is to make pv_unhash() atomic. 
By using pv_unhash() instead of WRITE_ONCE(), it should fix this issue. 
I will send out an updated patch to fix that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-27 19:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 18:21 [PATCH 0/2] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
2016-05-26 18:21 ` [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
2016-05-27  7:43   ` Boqun Feng
2016-05-27 19:28     ` Waiman Long
2016-05-26 18:21 ` [PATCH 2/2] locking/pvqspinlock: Add lock holder CPU argument to pv_wait() Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).