linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] locking/qspinlock: Safely handle > 4 nesting levels
@ 2019-01-21  2:49 Waiman Long
  2019-01-21  2:49 ` [PATCH 1/5] " Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Waiman Long @ 2019-01-21  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

My first thought of making qspinlocks to handle more than 4 slowpath
nesting levels to to use lock stealing when no more MCS nodes are
available. That is easy for PV qspinlocks as lock stealing is supported.
For native qspinlocks, we have to make setting the locked bit an atomic
operation which will add to slowpath lock acquisition latency. Using
my locking microbenchmark, I saw up to 10% reduction in the locking
throughput in some cases.

So we need to use a different technique in order to allow more than 4
slowpath nesting levels without introducing any noticeable performance
degradation for native qspinlocks. I settled on adding a new waiting
bit to the lock word to allow a CPU running out of percpu MCS nodes
to insert itself into the waiting queue using the new waiting bit for
synchronization. See patch 1 for details of how all this works.

Patches 2-4 enhances the locking statistics code to track the new code
as well as enabling it on other architectures such as ARM64.

Patch 5 is optional and it adds some debug code for testing purposes.

By setting MAX_NODES to 1, we can have some usage of the new code path
during the booting process as demonstrated by the stat counter values
shown below on an 1-socket 22-core 44-thread x86-64 system after booting
up the new kernel.

  lock_no_node=34
  lock_pending=30027
  lock_slowpath=173174
  lock_waiting=8

The new kernel was booted up a dozen times without seeing any problem.

Similar bootup test was done on a 2-socket 56-core 224-thread ARM64 system
with the following stat counter values.

  lock_no_node=21
  lock_pending=70245
  lock_slowpath=132703
  lock_waiting=3

No problem was seen in the ARM64 system with the new kernel. The number
of instances where 2-level spinlock slowpath nesting happens is less
frequent in the ARM64 system than in the x86-64 system.

Waiman Long (5):
  locking/qspinlock: Safely handle > 4 nesting levels
  locking/qspinlock_stat: Track the no MCS node available case
  locking/qspinlock_stat: Separate out the PV specific stat counts
  locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs
  locking/qspinlock: Add some locking debug code

 arch/Kconfig                          |   7 ++
 arch/x86/Kconfig                      |   8 --
 include/asm-generic/qspinlock_types.h |  41 +++++--
 kernel/locking/qspinlock.c            | 212 +++++++++++++++++++++++++++++++---
 kernel/locking/qspinlock_paravirt.h   |  30 ++++-
 kernel/locking/qspinlock_stat.h       | 153 +++++++++++++++---------
 6 files changed, 362 insertions(+), 89 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/5] locking/qspinlock: Safely handle > 4 nesting levels
  2019-01-21  2:49 [PATCH 0/5] locking/qspinlock: Safely handle > 4 nesting levels Waiman Long
@ 2019-01-21  2:49 ` Waiman Long
  2019-01-21  9:12   ` Peter Zijlstra
  2019-01-21  2:49 ` [PATCH 2/5] locking/qspinlock_stat: Track the no MCS node available case Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2019-01-21  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

Four queue nodes per cpu are allocated to enable up to 4 nesting levels
using the per-cpu nodes. Nested NMIs are possible in some architectures.
Still it is very unlikely that we will ever hit more than 4 nested
levels with contention in the slowpath.

When that rare condition happens, however, it is likely that the system
will hang or crash shortly after that. It is not good and we need to
handle this exception case.

On bare metal system, a new acquire_lock_no_node() function is now
added to deal with this case. The pending bit together with a new
waiting bit are overloaded to serve a special function as noted
below. The special tail value _Q_TAIL_WAITING is used as a special
flag by acquire_lock_no_node() to signal the next cpu to spin on the
waiting bit which is used as separator for the now disjointed queue.
Please see the comments in the code for the details.

On virtual machine, a new and simpler pv_acquire_lock_no_node() function
is used to directly steal the lock as lock stealing is allowed in the
PV locking path.

By doing so, we are able to support arbitrary nesting levels with no
noticeable performance degradation for the common case of <= 4 nesting
levels.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/asm-generic/qspinlock_types.h |  41 ++++++--
 kernel/locking/qspinlock.c            | 181 +++++++++++++++++++++++++++++++---
 kernel/locking/qspinlock_paravirt.h   |  30 +++++-
 3 files changed, 226 insertions(+), 26 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index d10f1e7..50602dd 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -68,18 +68,20 @@
 /*
  * Bitfields in the atomic value:
  *
- * When NR_CPUS < 16K
+ * When NR_CPUS < 16K-1
  *  0- 7: locked byte
  *     8: pending
- *  9-15: not used
+ *  9-14: not used
+ *    15: waiting
  * 16-17: tail index
  * 18-31: tail cpu (+1)
  *
- * When NR_CPUS >= 16K
+ * When NR_CPUS >= 16K-1
  *  0- 7: locked byte
  *     8: pending
- *  9-10: tail index
- * 11-31: tail cpu (+1)
+ *     9: waiting
+ * 10-11: tail index
+ * 12-31: tail cpu (+1)
  */
 #define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
 				      << _Q_ ## type ## _OFFSET)
@@ -88,14 +90,20 @@
 #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
-#if CONFIG_NR_CPUS < (1U << 14)
-#define _Q_PENDING_BITS		8
+#if CONFIG_NR_CPUS < (1U << 14) - 1
+#define _Q_PENDING_BITS		7
 #else
 #define _Q_PENDING_BITS		1
 #endif
 #define _Q_PENDING_MASK		_Q_SET_MASK(PENDING)
 
-#define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
+#define _Q_WAITING_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
+#define _Q_WAITING_BITS		1
+#define _Q_WAITING_MASK		_Q_SET_MASK(WAITING)
+
+#define _Q_WAIT_PEND_MASK	(_Q_PENDING_MASK | _Q_WAITING_MASK)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_WAITING_OFFSET + _Q_WAITING_BITS)
 #define _Q_TAIL_IDX_BITS	2
 #define _Q_TAIL_IDX_MASK	_Q_SET_MASK(TAIL_IDX)
 
@@ -109,4 +117,21 @@
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
+/*
+ * The special _Q_WAITING_VAL bit is set to indicate to the next CPU
+ * that it should spin on the waiting bit. It also tells the CPUs
+ * before to ignore the _Q_PENDING_VAL.
+ */
+#define _Q_WAITING_VAL		(1U << _Q_WAITING_OFFSET)
+#define _Q_WAIT_PEND_VAL	(_Q_WAITING_VAL | _Q_PENDING_VAL)
+
+/*
+ * The special _Q_TAIL_WAITING value in tail is used to indicate that
+ * the lock waiter who sees this should spin on the waiting bit instead
+ * and will become queue head once that bit is cleared. This also means
+ * one less cpu is supported in the max NR_CPUS for the more efficient
+ * qspinlock code.
+ */
+#define _Q_TAIL_WAITING	_Q_TAIL_MASK
+
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 8a8c3c2..5bb06df 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -56,13 +56,16 @@
  *
  * Since a spinlock disables recursion of its own context and there is a limit
  * to the contexts that can nest; namely: task, softirq, hardirq, nmi. As there
- * are at most 4 nesting levels, it can be encoded by a 2-bit number. Now
+ * should be at most 4 nesting levels, it can be encoded by a 2-bit number. Now
  * we can encode the tail by combining the 2-bit nesting level with the cpu
  * number. With one byte for the lock value and 3 bytes for the tail, only a
  * 32-bit word is now needed. Even though we only need 1 bit for the lock,
  * we extend it to a full byte to achieve better performance for architectures
  * that support atomic byte write.
  *
+ * If nmi nesting can happen, there are special code to handle more than
+ * 4 nesting levels.
+ *
  * We also change the first spinner to spin on the lock bit instead of its
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
@@ -106,12 +109,15 @@ struct qnode {
 #endif
 
 /*
- * Per-CPU queue node structures; we can never have more than 4 nested
+ * Per-CPU queue node structures; we should not have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ *
+ * In the rare case that more than 4 nesting levels are needed, special
+ * code is used to handle that without using any percpu queue node.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -147,9 +153,16 @@ struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
 	return &((struct qnode *)base + idx)->mcs;
 }
 
-#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+#define _Q_LOCKED_WAIT_PEND_MASK (_Q_LOCKED_MASK | _Q_WAIT_PEND_MASK)
+
+#if _Q_PENDING_BITS > 1
+/*
+ * Note: Both clear_pending() and clear_pending_set_locked() here have
+ * the side effect of clearing the waiting bit. However, both functions
+ * are used by the pending locker only which will not interact with a
+ * waiting waiter and so they shouldn't cause any problem.
+ */
 
-#if _Q_PENDING_BITS == 8
 /**
  * clear_pending - clear the pending bit.
  * @lock: Pointer to queued spinlock structure
@@ -194,7 +207,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
-#else /* _Q_PENDING_BITS == 8 */
+#else /* _Q_PENDING_BITS > 1 */
 
 /**
  * clear_pending - clear the pending bit.
@@ -208,7 +221,7 @@ static __always_inline void clear_pending(struct qspinlock *lock)
 }
 
 /**
- * clear_pending_set_locked - take ownership and clear the pending bit.
+ * clear_pending_set_locked - take ownership and clear waiting/pending bit.
  * @lock: Pointer to queued spinlock structure
  *
  * *,1,0 -> *,0,1
@@ -233,7 +246,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	u32 old, new, val = atomic_read(&lock->val);
 
 	for (;;) {
-		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
+		new = (val & _Q_LOCKED_WAIT_PEND_MASK) | tail;
 		/*
 		 * We can use relaxed semantics since the caller ensures that
 		 * the MCS node is properly initialized before updating the
@@ -247,7 +260,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	}
 	return old;
 }
-#endif /* _Q_PENDING_BITS == 8 */
+#endif /* _Q_PENDING_BITS > 1 */
 
 /**
  * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
@@ -274,6 +287,119 @@ static __always_inline void set_locked(struct qspinlock *lock)
 	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 }
 
+/**
+ *  acquire_lock_no_node - acquire lock without MCS node
+ *  @lock: Pointer to queued spinlock structure
+ *
+ *  It is extremely unlikely that this function will ever be called.
+ *  Marking it as noinline to not mess with the slowpath code. This
+ *  function is for native qspinlock only. The PV qspinlock code has
+ *  its own simpler version.
+ *
+ *  -----          -----  |   ----      -----          -----
+ * |Tail2| <- ... |Head2| |  |Node| <- |Tail1| <- ... |Head1|
+ *  -----          -----  |   ----      -----          -----
+ *                   |                                   |
+ *                   V                                   V
+ *             Spin on waiting                     Spin on locked
+ *
+ * The waiting and the pending bits will be acquired first which are now
+ * used as a separator for the disjointed queue shown above.
+ *
+ * The current CPU will then be inserted into queue by placing a special
+ * _Q_TAIL_WAITING value into the tail and makes the current tail
+ * point to its own local node. The next incoming CPU will see the special
+ * tail, but it has no way to find the node. Instead, it will spin on the
+ * waiting bit. When that bit is cleared, it means that all the the
+ * previous CPUs in the queue are gone and current CPU is the new lock
+ * holder. That will signal the next CPU (head2) to become the new queue
+ * head and spin on the locked flag.
+ *
+ * This will cause more lock cacheline contention, but performance is
+ * not a concern in this rarely used function.
+ *
+ * The handshake between waiting waiter CPU and the one that sets
+ * waiting bit is as follows:
+ *
+ *  setting CPU                          waiting CPU
+ *  -----------                          -----------
+ *  Set both pending & waiting
+ *  Push _Q_TAIL_WAITING to tail
+ *                                       Get _Q_TAIL_WAITING
+ *                                       Spin on waiting
+ *  If (lock free)
+ *    Set locked=1 && clear waiting
+ *                                       Observe !waiting
+ *                                       Clear pending
+ *                                       Become queue head
+ *
+ * Another no-node CPU can come now in to repeat the cycle again.
+ */
+static noinline void acquire_lock_no_node(struct qspinlock *lock)
+{
+	u32 old, new;
+
+	/*
+	 * Acquire both pending and waiting bits first to synchronize
+	 * with both a pending locker and a waiting waiter.
+	 */
+	for (;;) {
+		old = atomic_cond_read_relaxed(&lock->val,
+					      !(VAL & _Q_PENDING_VAL));
+		new = old | _Q_WAIT_PEND_VAL;
+		if (atomic_cmpxchg_acquire(&lock->val, old, new) == old)
+			break;
+	}
+
+	/*
+	 * Put _Q_TAIL_WAITING into tail. The next lock waiter that
+	 * sees it will observe the waiting bit and will wait until the
+	 * waiting bit is cleared before proceeding as the new queue head.
+	 */
+	old = xchg_tail(lock, _Q_TAIL_WAITING);
+	if (old & _Q_TAIL_MASK) {
+		struct mcs_spinlock node, *prev;
+
+		WARN_ON_ONCE((old & _Q_TAIL_MASK) == _Q_TAIL_WAITING);
+		node.locked = 0;
+		prev = decode_tail(old);
+		/*
+		 * Node data needed to be initialized before making
+		 * previous node point to it.
+		 */
+		smp_store_release(&prev->next, &node);
+		arch_mcs_spin_lock_contended(&node.locked);
+	}
+
+
+	/*
+	 * Acquire the lock, clear the tail (if applicable) and
+	 * the pending bits.
+	 */
+	old = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
+	WARN_ON_ONCE(!(old & _Q_PENDING_MASK));
+
+	if (((old & _Q_TAIL_MASK) == _Q_TAIL_WAITING) &&
+	     atomic_try_cmpxchg_relaxed(&lock->val, &old, _Q_LOCKED_VAL))
+		return;	/* Clear both pending & waiting if no one sees it */
+	smp_mb__after_atomic();
+
+	/*
+	 * Set locked and clear waiting bit only
+	 */
+	atomic_add(_Q_LOCKED_VAL - _Q_WAITING_VAL, &lock->val);
+}
+
+/*
+ * Spin on the waiting bit until it is cleared.
+ */
+static noinline void spin_on_waiting(struct qspinlock *lock)
+{
+	atomic_cond_read_relaxed(&lock->val, !(VAL & _Q_WAITING_VAL));
+
+	/* Clear the pending bit now */
+	atomic_andnot(_Q_PENDING_VAL, &lock->val);
+}
 
 /*
  * Generate the native code for queued_spin_unlock_slowpath(); provide NOPs for
@@ -412,6 +538,19 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
 
+	/*
+	 * 4 nodes are allocated based on the assumption that there will
+	 * not be nested NMIs taking spinlocks. That may not be true in
+	 * some architectures even though the chance of needing more than
+	 * 4 nodes will still be extremely unlikely. When that happens,
+	 * call the special acquire_lock_no_node() function to acquire
+	 * the lock without using any MCS node.
+	 */
+	if (unlikely(idx >= MAX_NODES)) {
+		acquire_lock_no_node(lock);
+		goto release;
+	}
+
 	node = grab_mcs_node(node, idx);
 
 	/*
@@ -460,6 +599,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * head of the waitqueue.
 	 */
 	if (old & _Q_TAIL_MASK) {
+		if (unlikely((old & _Q_TAIL_MASK) == _Q_TAIL_WAITING)) {
+			spin_on_waiting(lock);
+			goto wait_head;
+		}
+
 		prev = decode_tail(old);
 
 		/* Link @node into the waitqueue. */
@@ -500,10 +644,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * If PV isn't active, 0 will be returned instead.
 	 *
 	 */
+wait_head:
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+	/*
+	 * If both _Q_PENDING_VAL and _Q_WAITING_VAL are set, we can
+	 * ignore _Q_PENDING_VAL.
+	 */
+	val = atomic_cond_read_acquire(&lock->val,
+			!(VAL & _Q_LOCKED_MASK) &&
+			((VAL & _Q_WAIT_PEND_MASK) != _Q_PENDING_VAL));
 
 locked:
 	/*
@@ -521,14 +672,16 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * In the PV case we might already have _Q_LOCKED_VAL set, because
 	 * of lock stealing; therefore we must also allow:
 	 *
-	 * n,0,1 -> 0,0,1
+	 * n,*,1 -> 0,*,1
 	 *
-	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
-	 *       above wait condition, therefore any concurrent setting of
-	 *       PENDING will make the uncontended transition fail.
+	 * If the tail hasn't been changed, either the pending and waiting
+	 * bits are both set or both cleared. We need to perserve their
+	 * bit setting.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		u32 new = _Q_LOCKED_VAL | (val & _Q_WAIT_PEND_MASK);
+
+		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
 			goto release; /* No contention */
 	}
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 8f36c27..5600690 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -87,7 +87,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 	for (;;) {
 		int val = atomic_read(&lock->val);
 
-		if (!(val & _Q_LOCKED_PENDING_MASK) &&
+		if (!(val & _Q_LOCKED_WAIT_PEND_MASK) &&
 		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
 			qstat_inc(qstat_pv_lock_stealing, true);
 			return true;
@@ -105,7 +105,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
  * The pending bit is used by the queue head vCPU to indicate that it
  * is actively spinning on the lock and no lock stealing is allowed.
  */
-#if _Q_PENDING_BITS == 8
+#if _Q_PENDING_BITS > 1
 static __always_inline void set_pending(struct qspinlock *lock)
 {
 	WRITE_ONCE(lock->pending, 1);
@@ -122,7 +122,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
 				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
-#else /* _Q_PENDING_BITS == 8 */
+#else /* _Q_PENDING_BITS > 1 */
 static __always_inline void set_pending(struct qspinlock *lock)
 {
 	atomic_or(_Q_PENDING_VAL, &lock->val);
@@ -150,7 +150,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 	}
 	return 0;
 }
-#endif /* _Q_PENDING_BITS == 8 */
+#endif /* _Q_PENDING_BITS > 1 */
 
 /*
  * Lock and MCS node addresses hash table for fast lookup
@@ -485,6 +485,28 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
+ * Call to set_locked() isn't needed for the PV locking path.
+ */
+#define set_locked(lock)	pv_set_locked(lock)
+static __always_inline void pv_set_locked(struct qspinlock *lock) { }
+
+/*
+ * We don't need to deal with the pending and pending-ignore bit for the
+ * PV locking path as lock stealing is supported. We can simply steal
+ * the lock here without even considering the pending bit and move forward.
+ */
+#define acquire_lock_no_node(lock)	pv_acquire_lock_no_node(lock)
+static noinline void pv_acquire_lock_no_node(struct qspinlock *lock)
+{
+	u8 val;
+
+	do {
+		atomic_cond_read_relaxed(&lock->val, !(VAL & _Q_LOCKED_MASK));
+		val = cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL);
+	} while (!val);
+}
+
+/*
  * PV versions of the unlock fastpath and slowpath functions to be used
  * instead of queued_spin_unlock().
  */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/5] locking/qspinlock_stat: Track the no MCS node available case
  2019-01-21  2:49 [PATCH 0/5] locking/qspinlock: Safely handle > 4 nesting levels Waiman Long
  2019-01-21  2:49 ` [PATCH 1/5] " Waiman Long
@ 2019-01-21  2:49 ` Waiman Long
  2019-01-21  2:49 ` [PATCH 3/5] locking/qspinlock_stat: Separate out the PV specific stat counts Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2019-01-21  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

Track the number of slowpath locking operations that are being done
without any MCS node available as well renaming lock_index[123] to make
them more descriptive.

Using these stat counters is one way to find out if a code path is
being exercised.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c      |  4 +++-
 kernel/locking/qspinlock_stat.h | 24 ++++++++++++++++++------
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5bb06df..8163633 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -395,6 +395,7 @@ static noinline void acquire_lock_no_node(struct qspinlock *lock)
  */
 static noinline void spin_on_waiting(struct qspinlock *lock)
 {
+	qstat_inc(qstat_lock_waiting, true);
 	atomic_cond_read_relaxed(&lock->val, !(VAL & _Q_WAITING_VAL));
 
 	/* Clear the pending bit now */
@@ -548,6 +549,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	if (unlikely(idx >= MAX_NODES)) {
 		acquire_lock_no_node(lock);
+		qstat_inc(qstat_lock_no_node, true);
 		goto release;
 	}
 
@@ -556,7 +558,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * Keep counts of non-zero index values:
 	 */
-	qstat_inc(qstat_lock_idx1 + idx - 1, idx);
+	qstat_inc(qstat_lock_use_node2 + idx - 1, idx);
 
 	/*
 	 * Ensure that we increment the head node->count before initialising
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 42d3d8d..4f8ca8c 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -30,6 +30,14 @@
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
  *   lock_pending	- # of locking operations via pending code
  *   lock_slowpath	- # of locking operations via MCS lock queue
+ *   lock_use_node2	- # of locking operations that use 2nd percpu node
+ *   lock_use_node3	- # of locking operations that use 3rd percpu node
+ *   lock_use_node4	- # of locking operations that use 4th percpu node
+ *   lock_no_node	- # of locking operations without using percpu node
+ *   lock_waiting	- # of locking operations with waiting bit set
+ *
+ * Subtraccting lock_use_node[234] from lock_slowpath will give you
+ * lock_use_node1.
  *
  * Writing to the "reset_counters" file will reset all the above counter
  * values.
@@ -55,9 +63,11 @@ enum qlock_stats {
 	qstat_pv_wait_node,
 	qstat_lock_pending,
 	qstat_lock_slowpath,
-	qstat_lock_idx1,
-	qstat_lock_idx2,
-	qstat_lock_idx3,
+	qstat_lock_use_node2,
+	qstat_lock_use_node3,
+	qstat_lock_use_node4,
+	qstat_lock_no_node,
+	qstat_lock_waiting,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -85,9 +95,11 @@ enum qlock_stats {
 	[qstat_pv_wait_node]       = "pv_wait_node",
 	[qstat_lock_pending]       = "lock_pending",
 	[qstat_lock_slowpath]      = "lock_slowpath",
-	[qstat_lock_idx1]	   = "lock_index1",
-	[qstat_lock_idx2]	   = "lock_index2",
-	[qstat_lock_idx3]	   = "lock_index3",
+	[qstat_lock_use_node2]	   = "lock_use_node2",
+	[qstat_lock_use_node3]	   = "lock_use_node3",
+	[qstat_lock_use_node4]	   = "lock_use_node4",
+	[qstat_lock_no_node]	   = "lock_no_node",
+	[qstat_lock_waiting]	   = "lock_waiting",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/5] locking/qspinlock_stat: Separate out the PV specific stat counts
  2019-01-21  2:49 [PATCH 0/5] locking/qspinlock: Safely handle > 4 nesting levels Waiman Long
  2019-01-21  2:49 ` [PATCH 1/5] " Waiman Long
  2019-01-21  2:49 ` [PATCH 2/5] locking/qspinlock_stat: Track the no MCS node available case Waiman Long
@ 2019-01-21  2:49 ` Waiman Long
  2019-01-21  2:49 ` [PATCH 4/5] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs Waiman Long
  2019-01-21  2:49 ` [PATCH 5/5] locking/qspinlock: Add some locking debug code Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2019-01-21  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

Some of the statistics counts are for PV qspinlocks only and are not
applicable if PARAVIRT_SPINLOCKS aren't configured. So make those counts
dependent on the PARAVIRT_SPINLOCKS config option now.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_stat.h | 129 +++++++++++++++++++++++++---------------
 1 file changed, 81 insertions(+), 48 deletions(-)

diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 4f8ca8c..ad2e9f4 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -50,6 +50,7 @@
  * There may be slight difference between pv_kick_wake and pv_kick_unlock.
  */
 enum qlock_stats {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 	qstat_pv_hash_hops,
 	qstat_pv_kick_unlock,
 	qstat_pv_kick_wake,
@@ -61,6 +62,7 @@ enum qlock_stats {
 	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
+#endif
 	qstat_lock_pending,
 	qstat_lock_slowpath,
 	qstat_lock_use_node2,
@@ -82,6 +84,7 @@ enum qlock_stats {
 #include <linux/fs.h>
 
 static const char * const qstat_names[qstat_num + 1] = {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 	[qstat_pv_hash_hops]	   = "pv_hash_hops",
 	[qstat_pv_kick_unlock]     = "pv_kick_unlock",
 	[qstat_pv_kick_wake]       = "pv_kick_wake",
@@ -93,6 +96,7 @@ enum qlock_stats {
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
+#endif
 	[qstat_lock_pending]       = "lock_pending",
 	[qstat_lock_slowpath]      = "lock_slowpath",
 	[qstat_lock_use_node2]	   = "lock_use_node2",
@@ -107,6 +111,20 @@ enum qlock_stats {
  * Per-cpu counters
  */
 static DEFINE_PER_CPU(unsigned long, qstats[qstat_num]);
+
+/*
+ * Increment the PV qspinlock statistical counters
+ */
+static inline void qstat_inc(enum qlock_stats stat, bool cond)
+{
+	if (cond)
+		this_cpu_inc(qstats[stat]);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
+ * PV specific per-cpu counters
+ */
 static DEFINE_PER_CPU(u64, pv_kick_time);
 
 /*
@@ -181,6 +199,69 @@ static ssize_t qstat_read(struct file *file, char __user *user_buf,
 }
 
 /*
+ * PV hash hop count
+ */
+static inline void qstat_hop(int hopcnt)
+{
+	this_cpu_add(qstats[qstat_pv_hash_hops], hopcnt);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+	u64 start = sched_clock();
+
+	per_cpu(pv_kick_time, cpu) = start;
+	pv_kick(cpu);
+	this_cpu_add(qstats[qstat_pv_latency_kick], sched_clock() - start);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+	*pkick_time = 0;
+	pv_wait(ptr, val);
+	if (*pkick_time) {
+		this_cpu_add(qstats[qstat_pv_latency_wake],
+			     sched_clock() - *pkick_time);
+		qstat_inc(qstat_pv_kick_wake, true);
+	}
+}
+
+#define pv_kick(c)	__pv_kick(c)
+#define pv_wait(p, v)	__pv_wait(p, v)
+
+#else /* CONFIG_PARAVIRT_SPINLOCKS */
+static ssize_t qstat_read(struct file *file, char __user *user_buf,
+			  size_t count, loff_t *ppos)
+{
+	char buf[64];
+	int cpu, counter, len;
+	u64 stat = 0;
+
+	/*
+	 * Get the counter ID stored in file->f_inode->i_private
+	 */
+	counter = (long)file_inode(file)->i_private;
+
+	if (counter >= qstat_num)
+		return -EBADF;
+
+	for_each_possible_cpu(cpu)
+		stat += per_cpu(qstats[counter], cpu);
+	len = snprintf(buf, sizeof(buf) - 1, "%llu\n", stat);
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+/*
  * Function to handle write request
  *
  * When counter = reset_cnts, reset all the counter values.
@@ -253,54 +334,6 @@ static int __init init_qspinlock_stat(void)
 }
 fs_initcall(init_qspinlock_stat);
 
-/*
- * Increment the PV qspinlock statistical counters
- */
-static inline void qstat_inc(enum qlock_stats stat, bool cond)
-{
-	if (cond)
-		this_cpu_inc(qstats[stat]);
-}
-
-/*
- * PV hash hop count
- */
-static inline void qstat_hop(int hopcnt)
-{
-	this_cpu_add(qstats[qstat_pv_hash_hops], hopcnt);
-}
-
-/*
- * Replacement function for pv_kick()
- */
-static inline void __pv_kick(int cpu)
-{
-	u64 start = sched_clock();
-
-	per_cpu(pv_kick_time, cpu) = start;
-	pv_kick(cpu);
-	this_cpu_add(qstats[qstat_pv_latency_kick], sched_clock() - start);
-}
-
-/*
- * Replacement function for pv_wait()
- */
-static inline void __pv_wait(u8 *ptr, u8 val)
-{
-	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
-
-	*pkick_time = 0;
-	pv_wait(ptr, val);
-	if (*pkick_time) {
-		this_cpu_add(qstats[qstat_pv_latency_wake],
-			     sched_clock() - *pkick_time);
-		qstat_inc(qstat_pv_kick_wake, true);
-	}
-}
-
-#define pv_kick(c)	__pv_kick(c)
-#define pv_wait(p, v)	__pv_wait(p, v)
-
 #else /* CONFIG_QUEUED_LOCK_STAT */
 
 static inline void qstat_inc(enum qlock_stats stat, bool cond)	{ }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/5] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs
  2019-01-21  2:49 [PATCH 0/5] locking/qspinlock: Safely handle > 4 nesting levels Waiman Long
                   ` (2 preceding siblings ...)
  2019-01-21  2:49 ` [PATCH 3/5] locking/qspinlock_stat: Separate out the PV specific stat counts Waiman Long
@ 2019-01-21  2:49 ` Waiman Long
  2019-01-21  2:49 ` [PATCH 5/5] locking/qspinlock: Add some locking debug code Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2019-01-21  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

The QUEUED_LOCK_STAT option to report queued spinlocks statistics was
previously allowed only on x86 architecture. Now queued spinlocks are
used in multiple architectures, we now allow QUEUED_LOCK_STAT to be
enabled for all those architectures that use queued spinlocks. This
option is listed as part of the general architecture-dependent options.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/Kconfig     | 7 +++++++
 arch/x86/Kconfig | 8 --------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de..c82e32f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -885,6 +885,13 @@ config HAVE_ARCH_PREL32_RELOCATIONS
 	  architectures, and don't require runtime relocation on relocatable
 	  kernels.
 
+config QUEUED_LOCK_STAT
+	bool "Queued spinlock statistics"
+	depends on QUEUED_SPINLOCKS && DEBUG_FS
+	---help---
+	  Enable the collection of statistical data on the slowpath
+	  behavior of queued spinlocks and report them on debugfs.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4b4a7f3..872e681 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -784,14 +784,6 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
-config QUEUED_LOCK_STAT
-	bool "Paravirt queued spinlock statistics"
-	depends on PARAVIRT_SPINLOCKS && DEBUG_FS
-	---help---
-	  Enable the collection of statistical data on the slowpath
-	  behavior of paravirtualized queued spinlocks and report
-	  them on debugfs.
-
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/5] locking/qspinlock: Add some locking debug code
  2019-01-21  2:49 [PATCH 0/5] locking/qspinlock: Safely handle > 4 nesting levels Waiman Long
                   ` (3 preceding siblings ...)
  2019-01-21  2:49 ` [PATCH 4/5] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs Waiman Long
@ 2019-01-21  2:49 ` Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2019-01-21  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

Add some optionally enabled debug code to check if more than one CPU
that enter the lock critical section simultaneously.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 8163633..7671dfc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -97,6 +97,18 @@ struct qnode {
 };
 
 /*
+ * Define _Q_DEBUG_LOCK to verify if no more than one cpu can enter
+ * the lock critical section at the same time.
+ */
+// #define _Q_DEBUG_LOCK
+
+#ifdef _Q_DEBUG_LOCK
+#define _Q_DEBUG_WARN_ON(c)	WARN_ON_ONCE(c)
+#else
+#define _Q_DEBUG_WARN_ON(c)
+#endif
+
+/*
  * The pending bit spinning loop count.
  * This heuristic is used to limit the number of lockword accesses
  * made by atomic_cond_read_relaxed when waiting for the lock to
@@ -184,7 +196,13 @@ static __always_inline void clear_pending(struct qspinlock *lock)
  */
 static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 {
+#ifdef _Q_DEBUG_LOCK
+	u16 old = xchg_relaxed(&lock->locked_pending, _Q_LOCKED_VAL);
+
+	WARN_ON_ONCE((old & _Q_LOCKED_VAL) || !(old & _Q_PENDING_VAL));
+#else
 	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+#endif
 }
 
 /*
@@ -284,7 +302,13 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
  */
 static __always_inline void set_locked(struct qspinlock *lock)
 {
+#ifdef _O_DEBUG_LOCK
+	u8 old = xchg_relaxed(&lock->locked, _Q_LOCKED_VAL);
+
+	WARN_ON_ONCE(old);
+#else
 	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
+#endif
 }
 
 /**
@@ -683,6 +707,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if ((val & _Q_TAIL_MASK) == tail) {
 		u32 new = _Q_LOCKED_VAL | (val & _Q_WAIT_PEND_MASK);
 
+		_Q_DEBUG_WARN_ON((val & _Q_WAIT_PEND_MASK) &&
+				 (val & _Q_WAIT_PEND_MASK) != _Q_WAIT_PEND_VAL);
+
 		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
 			goto release; /* No contention */
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/5] locking/qspinlock: Safely handle > 4 nesting levels
  2019-01-21  2:49 ` [PATCH 1/5] " Waiman Long
@ 2019-01-21  9:12   ` Peter Zijlstra
  2019-01-21 13:13     ` Waiman Long
  2019-01-22  5:44     ` Will Deacon
  0 siblings, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2019-01-21  9:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On Sun, Jan 20, 2019 at 09:49:50PM -0500, Waiman Long wrote:
> +/**
> + *  acquire_lock_no_node - acquire lock without MCS node
> + *  @lock: Pointer to queued spinlock structure
> + *
> + *  It is extremely unlikely that this function will ever be called.
> + *  Marking it as noinline to not mess with the slowpath code. This
> + *  function is for native qspinlock only. The PV qspinlock code has
> + *  its own simpler version.
> + *
> + *  -----          -----  |   ----      -----          -----
> + * |Tail2| <- ... |Head2| |  |Node| <- |Tail1| <- ... |Head1|
> + *  -----          -----  |   ----      -----          -----
> + *                   |                                   |
> + *                   V                                   V
> + *             Spin on waiting                     Spin on locked
> + *
> + * The waiting and the pending bits will be acquired first which are now
> + * used as a separator for the disjointed queue shown above.
> + *
> + * The current CPU will then be inserted into queue by placing a special
> + * _Q_TAIL_WAITING value into the tail and makes the current tail
> + * point to its own local node. The next incoming CPU will see the special
> + * tail, but it has no way to find the node. Instead, it will spin on the
> + * waiting bit. When that bit is cleared, it means that all the the
> + * previous CPUs in the queue are gone and current CPU is the new lock
> + * holder. 

I know it's monday morning and I've not had wake-up juice yet, but I
don't think that's true.

Consider there being two CPUs that ran out of nodes and thus we have two
tail fragments waiting on the one waiting bit.

There is no sane wait to recover from this.. and stay fair, why are we
trying?

That is; what's the problem with the below?

Yes it sucks, but it is simple and doesn't introduce 100+ lines of code
that 'never' gets used.

---
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 8a8c3c208c5e..983b49a75826 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -412,6 +412,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
 
+	if (idx >= MAX_NODES) {
+		while (!queued_spin_trylock(lock))
+			cpu_relax();
+		goto release;
+	}
+
 	node = grab_mcs_node(node, idx);
 
 	/*

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/5] locking/qspinlock: Safely handle > 4 nesting levels
  2019-01-21  9:12   ` Peter Zijlstra
@ 2019-01-21 13:13     ` Waiman Long
  2019-01-22  5:44     ` Will Deacon
  1 sibling, 0 replies; 9+ messages in thread
From: Waiman Long @ 2019-01-21 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On 01/21/2019 04:12 AM, Peter Zijlstra wrote:
> On Sun, Jan 20, 2019 at 09:49:50PM -0500, Waiman Long wrote:
>> +/**
>> + *  acquire_lock_no_node - acquire lock without MCS node
>> + *  @lock: Pointer to queued spinlock structure
>> + *
>> + *  It is extremely unlikely that this function will ever be called.
>> + *  Marking it as noinline to not mess with the slowpath code. This
>> + *  function is for native qspinlock only. The PV qspinlock code has
>> + *  its own simpler version.
>> + *
>> + *  -----          -----  |   ----      -----          -----
>> + * |Tail2| <- ... |Head2| |  |Node| <- |Tail1| <- ... |Head1|
>> + *  -----          -----  |   ----      -----          -----
>> + *                   |                                   |
>> + *                   V                                   V
>> + *             Spin on waiting                     Spin on locked
>> + *
>> + * The waiting and the pending bits will be acquired first which are now
>> + * used as a separator for the disjointed queue shown above.
>> + *
>> + * The current CPU will then be inserted into queue by placing a special
>> + * _Q_TAIL_WAITING value into the tail and makes the current tail
>> + * point to its own local node. The next incoming CPU will see the special
>> + * tail, but it has no way to find the node. Instead, it will spin on the
>> + * waiting bit. When that bit is cleared, it means that all the the
>> + * previous CPUs in the queue are gone and current CPU is the new lock
>> + * holder. 
> I know it's monday morning and I've not had wake-up juice yet, but I
> don't think that's true.
>
> Consider there being two CPUs that ran out of nodes and thus we have two
> tail fragments waiting on the one waiting bit.

The waiting bit acts like a bit lock as no more than one can have it at
any time. The loser just keep spinning on it.

> There is no sane wait to recover from this.. and stay fair, why are we
> trying?
>
> That is; what's the problem with the below?
>
> Yes it sucks, but it is simple and doesn't introduce 100+ lines of code
> that 'never' gets used.
>
> ---
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 8a8c3c208c5e..983b49a75826 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -412,6 +412,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	idx = node->count++;
>  	tail = encode_tail(smp_processor_id(), idx);
>  
> +	if (idx >= MAX_NODES) {
> +		while (!queued_spin_trylock(lock))
> +			cpu_relax();
> +		goto release;
> +	}
> +
>  	node = grab_mcs_node(node, idx);
>  
>  	/*

Yes, that can work too. Although there is a possibility of live lock, it
should seldom happen when we are talking about NMIs.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/5] locking/qspinlock: Safely handle > 4 nesting levels
  2019-01-21  9:12   ` Peter Zijlstra
  2019-01-21 13:13     ` Waiman Long
@ 2019-01-22  5:44     ` Will Deacon
  1 sibling, 0 replies; 9+ messages in thread
From: Will Deacon @ 2019-01-22  5:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On Mon, Jan 21, 2019 at 10:12:34AM +0100, Peter Zijlstra wrote:
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 8a8c3c208c5e..983b49a75826 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -412,6 +412,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	idx = node->count++;
>  	tail = encode_tail(smp_processor_id(), idx);
>  
> +	if (idx >= MAX_NODES) {
> +		while (!queued_spin_trylock(lock))
> +			cpu_relax();
> +		goto release;
> +	}
> +
>  	node = grab_mcs_node(node, idx);

With an unlikely() and a comment, I /much/ prefer this approach!

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-22  5:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-21  2:49 [PATCH 0/5] locking/qspinlock: Safely handle > 4 nesting levels Waiman Long
2019-01-21  2:49 ` [PATCH 1/5] " Waiman Long
2019-01-21  9:12   ` Peter Zijlstra
2019-01-21 13:13     ` Waiman Long
2019-01-22  5:44     ` Will Deacon
2019-01-21  2:49 ` [PATCH 2/5] locking/qspinlock_stat: Track the no MCS node available case Waiman Long
2019-01-21  2:49 ` [PATCH 3/5] locking/qspinlock_stat: Separate out the PV specific stat counts Waiman Long
2019-01-21  2:49 ` [PATCH 4/5] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs Waiman Long
2019-01-21  2:49 ` [PATCH 5/5] locking/qspinlock: Add some locking debug code Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).