linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/17] powerpc: alternate queued spinlock implementation
@ 2022-11-26  9:59 Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 01/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

This replaces the generic queued spinlock code (like s390 does) with
our own implementation. There is an extra shim patch 1a to get the
series to apply.

Generic PV qspinlock code is causing latency / starvation regressions on
large systems that are resulting in hard lockups reported (mostly in
pathoogical cases).  The generic qspinlock code has a number of issues
important for powerpc hardware and hypervisors that aren't easily solved
without changing code that would impact other architectures. Follow
s390's lead and implement our own for now.

Issues for powerpc using generic qspinlocks:
- The previous lock value should not be loaded with simple loads, and
  need not be passed around from previous loads or cmpxchg results,
  because powerpc uses ll/sc-style atomics which can perform more
  complex operations that do not require this. powerpc implementations
  tend to prefer loads use larx for improved coherency performance.
- The queueing process should absolutely minimise the number of stores
  to the lock word to reduce exclusive coherency probes, important for
  large system scalability. The pending logic is counter productive
  here.
- Non-atomic unlock for paravirt locks is important (atomic instructions
  tend to still be more expensive than x86 CPUs).
- Yielding to the lock owner is important in the oversubscribed paravirt
  case, which requires storing the owner CPU in the lock word.
- More control of lock stealing for the paravirt case is important to
  keep latency down on large systems.
- The lock acquisition operation should always be made with a special
  variant of atomic instructions with the lock hint bit set, including
  (especially) in the queueing paths. This is more a matter of adding
  more arch lock helpers so not an insurmountable problem for generic
  code.

Thanks,
Nick

Since v2:
- Rebase the series on upstream and remove the 1a shim patch.
- Squash in the RFC patches that avoid a few more cmpxchg patterns in
  favour of more optimal larx/stcx implementations and allows the
  non-stealing queueing case to be removed, significantly reducing
  the queuing code.
- Reword some changelogs.

Since v1:
- Change most 'if (cond) return 1 ; return 0;'
- Bug fix: was testing count == MAX, but reentrant NMIs could bring that
  > MAX and crash.
- Fix missing memory barrier lost in asm conversion patch.
- Seperate the release barrier in publish_tail from the acquire barrier
  in get_tail_qnode.
- Moving a few minor things into their logically correct change.
- Make encode_tail_cpu take a cpu argument to match get_tail_cpu.
- Rename get_tail_cpu to decode_tail_cpu to match encode_tail_cpu.
- Rename lock_set_locked to set_locked.
- IS_ENABLED(x) ? 1 : 0 -> IS_ENABLED(x)
- Fix some comments inside inline asm.
- Change tunable names to lowercase.
- Consolidate asm for trylock_clear_tail_cpu and trylock_with_tail_cpu
- Restructure steal/wait loops to be more readable
- Count a failed cmpxchg as an iteration in steal/wait loops to avoid
  theoretical livelock/latency concern.

Nicholas Piggin (17):
  powerpc/qspinlock: add mcs queueing for contended waiters
  powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  powerpc/qspinlock: convert atomic operations to assembly
  powerpc/qspinlock: allow new waiters to steal the lock before queueing
  powerpc/qspinlock: theft prevention to control latency
  powerpc/qspinlock: store owner CPU in lock word
  powerpc/qspinlock: paravirt yield to lock owner
  powerpc/qspinlock: implement option to yield to previous node
  powerpc/qspinlock: allow stealing when head of queue yields
  powerpc/qspinlock: allow propagation of yield CPU down the queue
  powerpc/qspinlock: add ability to prod new queue head CPU
  powerpc/qspinlock: allow lock stealing in trylock and lock fastpath
  powerpc/qspinlock: use spin_begin/end API
  powerpc/qspinlock: reduce remote node steal spins
  powerpc/qspinlock: allow indefinite spinning on a preempted owner
  powerpc/qspinlock: provide accounting and options for sleepy locks
  powerpc/qspinlock: add compile-time tuning adjustments

 arch/powerpc/include/asm/qspinlock.h       | 130 ++-
 arch/powerpc/include/asm/qspinlock_types.h |  63 +-
 arch/powerpc/lib/qspinlock.c               | 985 ++++++++++++++++++++-
 3 files changed, 1167 insertions(+), 11 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 01/17] powerpc/qspinlock: add mcs queueing for contended waiters
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 02/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

This forms the basis of the qspinlock slow path.

Like generic qspinlocks and unlike the vanilla MCS algorithm, the lock
owner does not participate in the queue, only waiters. The first waiter
spins on the lock word, then when the lock is released it takes
ownership and unqueues the next waiter. This is how qspinlocks can be
implemented with the spinlock API -- lock owners don't need a node, only
waiters do.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h       |  10 +-
 arch/powerpc/include/asm/qspinlock_types.h |  23 +++
 arch/powerpc/lib/qspinlock.c               | 187 ++++++++++++++++++++-
 3 files changed, 214 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index b1443aab2145..300c7d2ebe2e 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -18,12 +18,12 @@ static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 
 static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	return 0;
+	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
 }
 
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	return atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0;
+	return atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0;
 }
 
 void queued_spin_lock_slowpath(struct qspinlock *lock);
@@ -36,7 +36,11 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	atomic_set_release(&lock->val, 0);
+	for (;;) {
+		int val = atomic_read(&lock->val);
+		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
+			return;
+	}
 }
 
 #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 59606bc0c774..20a36dfb14e2 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -10,4 +10,27 @@ typedef struct qspinlock {
 
 #define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
 
+/*
+ * Bitfields in the lock word:
+ *
+ *     0: locked bit
+ *  1-16: unused bits
+ * 17-31: tail cpu (+1)
+ */
+#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
+				      << _Q_ ## type ## _OFFSET)
+/* 0x00000001 */
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		1
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+
+/* 0xfffe0000 */
+#define _Q_TAIL_CPU_OFFSET	17
+#define _Q_TAIL_CPU_BITS	15
+#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
+
+#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
+#error "qspinlock does not support such large CONFIG_NR_CPUS"
+#endif
+
 #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 1c669b5b4607..86504628501e 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -1,12 +1,193 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/compiler.h>
 #include <linux/export.h>
-#include <linux/processor.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
 #include <asm/qspinlock.h>
 
-void queued_spin_lock_slowpath(struct qspinlock *lock)
+#define MAX_NODES	4
+
+struct qnode {
+	struct qnode	*next;
+	struct qspinlock *lock;
+	u8		locked; /* 1 if lock acquired */
+};
+
+struct qnodes {
+	int		count;
+	struct qnode nodes[MAX_NODES];
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
+
+static inline int encode_tail_cpu(int cpu)
+{
+	return (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+}
+
+static inline int decode_tail_cpu(int val)
+{
+	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
+}
+
+/*
+ * Try to acquire the lock if it was not already locked. If the tail matches
+ * mytail then clear it, otherwise leave it unchnaged. Return previous value.
+ *
+ * This is used by the head of the queue to acquire the lock and clean up
+ * its tail if it was the last one queued.
+ */
+static __always_inline int set_locked_clean_tail(struct qspinlock *lock, int tail)
+{
+	int val = atomic_read(&lock->val);
+
+	BUG_ON(val & _Q_LOCKED_VAL);
+
+	/* If we're the last queued, must clean up the tail. */
+	if ((val & _Q_TAIL_CPU_MASK) == tail) {
+		if (atomic_cmpxchg_acquire(&lock->val, val, _Q_LOCKED_VAL) == val)
+			return val;
+		/* Another waiter must have enqueued */
+		val = atomic_read(&lock->val);
+		BUG_ON(val & _Q_LOCKED_VAL);
+	}
+
+	/* We must be the owner, just set the lock bit and acquire */
+	atomic_or(_Q_LOCKED_VAL, &lock->val);
+	__atomic_acquire_fence();
+
+	return val;
+}
+
+/*
+ * Publish our tail, replacing previous tail. Return previous value.
+ *
+ * This provides a release barrier for publishing node, this pairs with the
+ * acquire barrier in get_tail_qnode() when the next CPU finds this tail
+ * value.
+ */
+static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
+{
+	for (;;) {
+		int val = atomic_read(&lock->val);
+		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
+		int old;
+
+		old = atomic_cmpxchg_release(&lock->val, val, newval);
+		if (old == val)
+			return old;
+	}
+}
+
+static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
+{
+	int cpu = decode_tail_cpu(val);
+	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
+	int idx;
+
+	/*
+	 * After publishing the new tail and finding a previous tail in the
+	 * previous val (which is the control dependency), this barrier
+	 * orders the release barrier in publish_tail_cpu performed by the
+	 * last CPU, with subsequently looking at its qnode structures
+	 * after the barrier.
+	 */
+	smp_acquire__after_ctrl_dep();
+
+	for (idx = 0; idx < MAX_NODES; idx++) {
+		struct qnode *qnode = &qnodesp->nodes[idx];
+		if (qnode->lock == lock)
+			return qnode;
+	}
+
+	BUG();
+}
+
+static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
-	while (!queued_spin_trylock(lock))
+	struct qnodes *qnodesp;
+	struct qnode *next, *node;
+	int val, old, tail;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+	qnodesp = this_cpu_ptr(&qnodes);
+	if (unlikely(qnodesp->count >= MAX_NODES)) {
+		while (!queued_spin_trylock(lock))
+			cpu_relax();
+		return;
+	}
+
+	idx = qnodesp->count++;
+	/*
+	 * Ensure that we increment the head node->count before initialising
+	 * the actual node. If the compiler is kind enough to reorder these
+	 * stores, then an IRQ could overwrite our assignments.
+	 */
+	barrier();
+	node = &qnodesp->nodes[idx];
+	node->next = NULL;
+	node->lock = lock;
+	node->locked = 0;
+
+	tail = encode_tail_cpu(smp_processor_id());
+
+	old = publish_tail_cpu(lock, tail);
+
+	/*
+	 * If there was a previous node; link it and wait until reaching the
+	 * head of the waitqueue.
+	 */
+	if (old & _Q_TAIL_CPU_MASK) {
+		struct qnode *prev = get_tail_qnode(lock, old);
+
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
+
+		/* Wait for mcs node lock to be released */
+		while (!node->locked)
+			cpu_relax();
+
+		smp_rmb(); /* acquire barrier for the mcs lock */
+	}
+
+	/* We're at the head of the waitqueue, wait for the lock. */
+	for (;;) {
+		val = atomic_read(&lock->val);
+		if (!(val & _Q_LOCKED_VAL))
+			break;
+
+		cpu_relax();
+	}
+
+	/* If we're the last queued, must clean up the tail. */
+	old = set_locked_clean_tail(lock, tail);
+	if ((old & _Q_TAIL_CPU_MASK) == tail)
+		goto release; /* Another waiter must have enqueued */
+
+	/* There is a next, must wait for node->next != NULL (MCS protocol) */
+	while (!(next = READ_ONCE(node->next)))
 		cpu_relax();
+
+	/*
+	 * Unlock the next mcs waiter node. Release barrier is not required
+	 * here because the acquirer is only accessing the lock word, and
+	 * the acquire barrier we took the lock with orders that update vs
+	 * this store to locked. The corresponding barrier is the smp_rmb()
+	 * acquire barrier for mcs lock, above.
+	 */
+	WRITE_ONCE(next->locked, 1);
+
+release:
+	qnodesp->count--; /* release the node */
+}
+
+void queued_spin_lock_slowpath(struct qspinlock *lock)
+{
+	queued_spin_lock_mcs_queue(lock);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 02/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 01/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 03/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

The first 16 bits of the lock are only modified by the owner, and other
modifications always use atomic operations on the entire 32 bits, so
unlocks can use plain stores on the 16 bits. This is the same kind of
optimisation done by core qspinlock code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h       |  6 +-----
 arch/powerpc/include/asm/qspinlock_types.h | 19 +++++++++++++++++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 300c7d2ebe2e..7bc254c55705 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -36,11 +36,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	for (;;) {
-		int val = atomic_read(&lock->val);
-		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
-			return;
-	}
+	smp_store_release(&lock->locked, 0);
 }
 
 #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 20a36dfb14e2..fe87181c59e5 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -3,12 +3,27 @@
 #define _ASM_POWERPC_QSPINLOCK_TYPES_H
 
 #include <linux/types.h>
+#include <asm/byteorder.h>
 
 typedef struct qspinlock {
-	atomic_t val;
+	union {
+		atomic_t val;
+
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u16	locked;
+			u8	reserved[2];
+		};
+#else
+		struct {
+			u8	reserved[2];
+			u16	locked;
+		};
+#endif
+	};
 } arch_spinlock_t;
 
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
 
 /*
  * Bitfields in the lock word:
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 03/17] powerpc/qspinlock: convert atomic operations to assembly
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 01/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 02/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 04/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

This uses more optimal ll/sc style access patterns (rather than
cmpxchg), and also sets the EH=1 lock hint on those operations
which acquire ownership of the lock.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h       | 24 +++++--
 arch/powerpc/include/asm/qspinlock_types.h |  4 +-
 arch/powerpc/lib/qspinlock.c               | 82 +++++++++++++---------
 3 files changed, 68 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 7bc254c55705..7d300e6883a8 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -2,28 +2,42 @@
 #ifndef _ASM_POWERPC_QSPINLOCK_H
 #define _ASM_POWERPC_QSPINLOCK_H
 
-#include <linux/atomic.h>
 #include <linux/compiler.h>
 #include <asm/qspinlock_types.h>
 
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-	return atomic_read(&lock->val);
+	return READ_ONCE(lock->val);
 }
 
 static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 {
-	return !atomic_read(&lock.val);
+	return !lock.val;
 }
 
 static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
+	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
 }
 
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	return atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
+"	cmpwi	0,%0,0							\n"
+"	bne-	2f							\n"
+"	stwcx.	%2,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (_Q_LOCKED_VAL),
+	  "i" (IS_ENABLED(CONFIG_PPC64))
+	: "cr0", "memory");
+
+	return likely(prev == 0);
 }
 
 void queued_spin_lock_slowpath(struct qspinlock *lock);
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index fe87181c59e5..b9a5a52fa670 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -7,7 +7,7 @@
 
 typedef struct qspinlock {
 	union {
-		atomic_t val;
+		u32 val;
 
 #ifdef __LITTLE_ENDIAN
 		struct {
@@ -23,7 +23,7 @@ typedef struct qspinlock {
 	};
 } arch_spinlock_t;
 
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = 0 } }
 
 /*
  * Bitfields in the lock word:
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 86504628501e..645d9affacfd 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
-#include <linux/atomic.h>
 #include <linux/bug.h>
 #include <linux/compiler.h>
 #include <linux/export.h>
@@ -22,12 +21,12 @@ struct qnodes {
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
-static inline int encode_tail_cpu(int cpu)
+static inline u32 encode_tail_cpu(int cpu)
 {
 	return (cpu + 1) << _Q_TAIL_CPU_OFFSET;
 }
 
-static inline int decode_tail_cpu(int val)
+static inline int decode_tail_cpu(u32 val)
 {
 	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
 }
@@ -39,26 +38,34 @@ static inline int decode_tail_cpu(int val)
  * This is used by the head of the queue to acquire the lock and clean up
  * its tail if it was the last one queued.
  */
-static __always_inline int set_locked_clean_tail(struct qspinlock *lock, int tail)
+static __always_inline u32 set_locked_clean_tail(struct qspinlock *lock, u32 tail)
 {
-	int val = atomic_read(&lock->val);
-
-	BUG_ON(val & _Q_LOCKED_VAL);
-
-	/* If we're the last queued, must clean up the tail. */
-	if ((val & _Q_TAIL_CPU_MASK) == tail) {
-		if (atomic_cmpxchg_acquire(&lock->val, val, _Q_LOCKED_VAL) == val)
-			return val;
-		/* Another waiter must have enqueued */
-		val = atomic_read(&lock->val);
-		BUG_ON(val & _Q_LOCKED_VAL);
-	}
-
-	/* We must be the owner, just set the lock bit and acquire */
-	atomic_or(_Q_LOCKED_VAL, &lock->val);
-	__atomic_acquire_fence();
-
-	return val;
+	u32 newval = _Q_LOCKED_VAL;
+	u32 prev, tmp;
+
+	asm volatile(
+"1:	lwarx	%0,0,%2,%6	# set_locked_clean_tail			\n"
+	/* Test whether the lock tail == tail */
+"	and	%1,%0,%5						\n"
+"	cmpw	0,%1,%3							\n"
+	/* Merge the new locked value */
+"	or	%1,%1,%4						\n"
+"	bne	2f							\n"
+	/* If the lock tail matched, then clear it, otherwise leave it. */
+"	andc	%1,%1,%5						\n"
+"2:	stwcx.	%1,0,%2							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"3:									\n"
+	: "=&r" (prev), "=&r" (tmp)
+	: "r" (&lock->val), "r"(tail), "r" (newval),
+	  "r" (_Q_TAIL_CPU_MASK),
+	  "i" (IS_ENABLED(CONFIG_PPC64))
+	: "cr0", "memory");
+
+	BUG_ON(prev & _Q_LOCKED_VAL);
+
+	return prev;
 }
 
 /*
@@ -68,20 +75,25 @@ static __always_inline int set_locked_clean_tail(struct qspinlock *lock, int tai
  * acquire barrier in get_tail_qnode() when the next CPU finds this tail
  * value.
  */
-static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
+static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
 {
-	for (;;) {
-		int val = atomic_read(&lock->val);
-		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
-		int old;
-
-		old = atomic_cmpxchg_release(&lock->val, val, newval);
-		if (old == val)
-			return old;
-	}
+	u32 prev, tmp;
+
+	asm volatile(
+"\t"	PPC_RELEASE_BARRIER "						\n"
+"1:	lwarx	%0,0,%2		# publish_tail_cpu			\n"
+"	andc	%1,%0,%4						\n"
+"	or	%1,%1,%3						\n"
+"	stwcx.	%1,0,%2							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev), "=&r"(tmp)
+	: "r" (&lock->val), "r" (tail), "r"(_Q_TAIL_CPU_MASK)
+	: "cr0", "memory");
+
+	return prev;
 }
 
-static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
+static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = decode_tail_cpu(val);
 	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
@@ -109,7 +121,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
-	int val, old, tail;
+	u32 val, old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -156,7 +168,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 	/* We're at the head of the waitqueue, wait for the lock. */
 	for (;;) {
-		val = atomic_read(&lock->val);
+		val = READ_ONCE(lock->val);
 		if (!(val & _Q_LOCKED_VAL))
 			break;
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 04/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (2 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 03/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 05/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Allow new waiters to "steal" the lock before queueing. That is, to
acquire it while other CPUs have queued.

This particularly helps paravirt performance when physical CPUs are
oversubscribed, by keeping the lock from becoming a strict FIFO and
vCPU preemption causing queue train wrecks.

The new __queued_spin_trylock_steal() function is put in qspinlock.h
to save having to move it, because it will be used there by a later
change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h |  23 ++++++
 arch/powerpc/lib/qspinlock.c         | 110 ++++++++++++++++++++++++---
 2 files changed, 124 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 7d300e6883a8..2a6f12a2c385 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -40,6 +40,29 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 	return likely(prev == 0);
 }
 
+static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
+{
+	u32 prev, tmp;
+
+	/* Trylock may get ahead of queued nodes if it finds unlocked */
+	asm volatile(
+"1:	lwarx	%0,0,%2,%5	# __queued_spin_trylock_steal		\n"
+"	andc.	%1,%0,%4						\n"
+"	bne-	2f							\n"
+"	and	%1,%0,%4						\n"
+"	or	%1,%1,%3						\n"
+"	stwcx.	%1,0,%2							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev), "=&r" (tmp)
+	: "r" (&lock->val), "r" (_Q_LOCKED_VAL), "r" (_Q_TAIL_CPU_MASK),
+	  "i" (IS_ENABLED(CONFIG_PPC64))
+	: "cr0", "memory");
+
+	return likely(!(prev & ~_Q_TAIL_CPU_MASK));
+}
+
 void queued_spin_lock_slowpath(struct qspinlock *lock);
 
 static __always_inline void queued_spin_lock(struct qspinlock *lock)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 645d9affacfd..6d67bc38b122 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -19,8 +19,17 @@ struct qnodes {
 	struct qnode nodes[MAX_NODES];
 };
 
+/* Tuning parameters */
+static int steal_spins __read_mostly = (1<<5);
+static bool maybe_stealers __read_mostly = true;
+
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
+static __always_inline int get_steal_spins(void)
+{
+	return steal_spins;
+}
+
 static inline u32 encode_tail_cpu(int cpu)
 {
 	return (cpu + 1) << _Q_TAIL_CPU_OFFSET;
@@ -38,33 +47,35 @@ static inline int decode_tail_cpu(u32 val)
  * This is used by the head of the queue to acquire the lock and clean up
  * its tail if it was the last one queued.
  */
-static __always_inline u32 set_locked_clean_tail(struct qspinlock *lock, u32 tail)
+static __always_inline u32 trylock_clean_tail(struct qspinlock *lock, u32 tail)
 {
 	u32 newval = _Q_LOCKED_VAL;
 	u32 prev, tmp;
 
 	asm volatile(
-"1:	lwarx	%0,0,%2,%6	# set_locked_clean_tail			\n"
-	/* Test whether the lock tail == tail */
-"	and	%1,%0,%5						\n"
+"1:	lwarx	%0,0,%2,%7	# trylock_clean_tail			\n"
+	/* This test is necessary if there could be stealers */
+"	andi.	%1,%0,%5						\n"
+"	bne	3f							\n"
+	/* Test whether the lock tail == mytail */
+"	and	%1,%0,%6						\n"
 "	cmpw	0,%1,%3							\n"
 	/* Merge the new locked value */
 "	or	%1,%1,%4						\n"
 "	bne	2f							\n"
 	/* If the lock tail matched, then clear it, otherwise leave it. */
-"	andc	%1,%1,%5						\n"
+"	andc	%1,%1,%6						\n"
 "2:	stwcx.	%1,0,%2							\n"
 "	bne-	1b							\n"
 "\t"	PPC_ACQUIRE_BARRIER "						\n"
 "3:									\n"
 	: "=&r" (prev), "=&r" (tmp)
 	: "r" (&lock->val), "r"(tail), "r" (newval),
+	  "i" (_Q_LOCKED_VAL),
 	  "r" (_Q_TAIL_CPU_MASK),
 	  "i" (IS_ENABLED(CONFIG_PPC64))
 	: "cr0", "memory");
 
-	BUG_ON(prev & _Q_LOCKED_VAL);
-
 	return prev;
 }
 
@@ -117,6 +128,30 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
+static inline bool try_to_steal_lock(struct qspinlock *lock)
+{
+	int iters = 0;
+
+	if (!steal_spins)
+		return false;
+
+	/* Attempt to steal the lock */
+	do {
+		u32 val = READ_ONCE(lock->val);
+
+		if (unlikely(!(val & _Q_LOCKED_VAL))) {
+			if (__queued_spin_trylock_steal(lock))
+				return true;
+		} else {
+			cpu_relax();
+		}
+
+		iters++;
+	} while (iters < get_steal_spins());
+
+	return false;
+}
+
 static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
 	struct qnodes *qnodesp;
@@ -166,6 +201,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
 
+again:
 	/* We're at the head of the waitqueue, wait for the lock. */
 	for (;;) {
 		val = READ_ONCE(lock->val);
@@ -176,9 +212,14 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	}
 
 	/* If we're the last queued, must clean up the tail. */
-	old = set_locked_clean_tail(lock, tail);
+	old = trylock_clean_tail(lock, tail);
+	if (unlikely(old & _Q_LOCKED_VAL)) {
+		BUG_ON(!maybe_stealers);
+		goto again; /* Can only be true if maybe_stealers. */
+	}
+
 	if ((old & _Q_TAIL_CPU_MASK) == tail)
-		goto release; /* Another waiter must have enqueued */
+		goto release; /* We were the tail, no next. */
 
 	/* There is a next, must wait for node->next != NULL (MCS protocol) */
 	while (!(next = READ_ONCE(node->next)))
@@ -199,6 +240,9 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 void queued_spin_lock_slowpath(struct qspinlock *lock)
 {
+	if (try_to_steal_lock(lock))
+		return;
+
 	queued_spin_lock_mcs_queue(lock);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
@@ -208,3 +252,51 @@ void pv_spinlocks_init(void)
 {
 }
 #endif
+
+#include <linux/debugfs.h>
+static int steal_spins_set(void *data, u64 val)
+{
+	static DEFINE_MUTEX(lock);
+
+	/*
+	 * The lock slow path has a !maybe_stealers case that can assume
+	 * the head of queue will not see concurrent waiters. That waiter
+	 * is unsafe in the presence of stealers, so must keep them away
+	 * from one another.
+	 */
+
+	mutex_lock(&lock);
+	if (val && !steal_spins) {
+		maybe_stealers = true;
+		/* wait for queue head waiter to go away */
+		synchronize_rcu();
+		steal_spins = val;
+	} else if (!val && steal_spins) {
+		steal_spins = val;
+		/* wait for all possible stealers to go away */
+		synchronize_rcu();
+		maybe_stealers = false;
+	} else {
+		steal_spins = val;
+	}
+	mutex_unlock(&lock);
+
+	return 0;
+}
+
+static int steal_spins_get(void *data, u64 *val)
+{
+	*val = steal_spins;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
+
+static __init int spinlock_debugfs_init(void)
+{
+	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+
+	return 0;
+}
+device_initcall(spinlock_debugfs_init);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 05/17] powerpc/qspinlock: theft prevention to control latency
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (3 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 04/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 06/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Give the queue head the ability to stop stealers. After a number of
spins without sucessfully acquiring the lock, the queue head sets
this, which halts stealing and will assure it is the next owner.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock_types.h |  8 +++-
 arch/powerpc/lib/qspinlock.c               | 53 ++++++++++++++++++++++
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index b9a5a52fa670..1911a8a16237 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -29,7 +29,8 @@ typedef struct qspinlock {
  * Bitfields in the lock word:
  *
  *     0: locked bit
- *  1-16: unused bits
+ *  1-15: unused bits
+ *    16: must queue bit
  * 17-31: tail cpu (+1)
  */
 #define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
@@ -39,6 +40,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS		1
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 
+/* 0x00010000 */
+#define _Q_MUST_Q_OFFSET	16
+#define _Q_MUST_Q_BITS		1
+#define _Q_MUST_Q_VAL		(1U << _Q_MUST_Q_OFFSET)
+
 /* 0xfffe0000 */
 #define _Q_TAIL_CPU_OFFSET	17
 #define _Q_TAIL_CPU_BITS	15
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 6d67bc38b122..979b17ac7bd1 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -22,6 +22,7 @@ struct qnodes {
 /* Tuning parameters */
 static int steal_spins __read_mostly = (1<<5);
 static bool maybe_stealers __read_mostly = true;
+static int head_spins __read_mostly = (1<<8);
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
 	return steal_spins;
 }
 
+static __always_inline int get_head_spins(void)
+{
+	return head_spins;
+}
+
 static inline u32 encode_tail_cpu(int cpu)
 {
 	return (cpu + 1) << _Q_TAIL_CPU_OFFSET;
@@ -104,6 +110,22 @@ static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
 	return prev;
 }
 
+static __always_inline u32 set_mustq(struct qspinlock *lock)
+{
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# set_mustq				\n"
+"	or	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (_Q_MUST_Q_VAL)
+	: "cr0", "memory");
+
+	return prev;
+}
+
 static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = decode_tail_cpu(val);
@@ -139,6 +161,9 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
 	do {
 		u32 val = READ_ONCE(lock->val);
 
+		if (val & _Q_MUST_Q_VAL)
+			break;
+
 		if (unlikely(!(val & _Q_LOCKED_VAL))) {
 			if (__queued_spin_trylock_steal(lock))
 				return true;
@@ -157,7 +182,9 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
 	u32 val, old, tail;
+	bool mustq = false;
 	int idx;
+	int iters = 0;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
@@ -209,6 +236,15 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 			break;
 
 		cpu_relax();
+		if (!maybe_stealers)
+			continue;
+		iters++;
+
+		if (!mustq && iters >= get_head_spins()) {
+			mustq = true;
+			set_mustq(lock);
+			val |= _Q_MUST_Q_VAL;
+		}
 	}
 
 	/* If we're the last queued, must clean up the tail. */
@@ -293,9 +329,26 @@ static int steal_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
 
+static int head_spins_set(void *data, u64 val)
+{
+	head_spins = val;
+
+	return 0;
+}
+
+static int head_spins_get(void *data, u64 *val)
+{
+	*val = head_spins;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 
 	return 0;
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 06/17] powerpc/qspinlock: store owner CPU in lock word
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (4 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 05/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 07/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Store the owner CPU number in the lock word so it may be yielded to,
as powerpc's paravirtualised simple spinlocks do.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h       | 12 ++++++++++--
 arch/powerpc/include/asm/qspinlock_types.h | 12 +++++++++++-
 arch/powerpc/lib/qspinlock.c               |  2 +-
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 2a6f12a2c385..be53702e56fc 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -20,8 +20,15 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
 }
 
+static __always_inline u32 queued_spin_encode_locked_val(void)
+{
+	/* XXX: make this use lock value in paca like simple spinlocks? */
+	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
+}
+
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
+	u32 new = queued_spin_encode_locked_val();
 	u32 prev;
 
 	asm volatile(
@@ -33,7 +40,7 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 "\t"	PPC_ACQUIRE_BARRIER "						\n"
 "2:									\n"
 	: "=&r" (prev)
-	: "r" (&lock->val), "r" (_Q_LOCKED_VAL),
+	: "r" (&lock->val), "r" (new),
 	  "i" (IS_ENABLED(CONFIG_PPC64))
 	: "cr0", "memory");
 
@@ -42,6 +49,7 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 
 static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
 {
+	u32 new = queued_spin_encode_locked_val();
 	u32 prev, tmp;
 
 	/* Trylock may get ahead of queued nodes if it finds unlocked */
@@ -56,7 +64,7 @@ static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
 "\t"	PPC_ACQUIRE_BARRIER "						\n"
 "2:									\n"
 	: "=&r" (prev), "=&r" (tmp)
-	: "r" (&lock->val), "r" (_Q_LOCKED_VAL), "r" (_Q_TAIL_CPU_MASK),
+	: "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
 	  "i" (IS_ENABLED(CONFIG_PPC64))
 	: "cr0", "memory");
 
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 1911a8a16237..adfeed4aa495 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -29,7 +29,8 @@ typedef struct qspinlock {
  * Bitfields in the lock word:
  *
  *     0: locked bit
- *  1-15: unused bits
+ *  1-14: lock holder cpu
+ *    15: unused bit
  *    16: must queue bit
  * 17-31: tail cpu (+1)
  */
@@ -40,6 +41,15 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS		1
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 
+/* 0x00007ffe */
+#define _Q_OWNER_CPU_OFFSET	1
+#define _Q_OWNER_CPU_BITS	14
+#define _Q_OWNER_CPU_MASK	_Q_SET_MASK(OWNER_CPU)
+
+#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
+#error "qspinlock does not support such large CONFIG_NR_CPUS"
+#endif
+
 /* 0x00010000 */
 #define _Q_MUST_Q_OFFSET	16
 #define _Q_MUST_Q_BITS		1
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 979b17ac7bd1..a5b2c0377cf9 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -55,7 +55,7 @@ static inline int decode_tail_cpu(u32 val)
  */
 static __always_inline u32 trylock_clean_tail(struct qspinlock *lock, u32 tail)
 {
-	u32 newval = _Q_LOCKED_VAL;
+	u32 newval = queued_spin_encode_locked_val();
 	u32 prev, tmp;
 
 	asm volatile(
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 07/17] powerpc/qspinlock: paravirt yield to lock owner
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (5 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 06/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 08/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Waiters spinning on the lock word should yield to the lock owner if the
vCPU is preempted. This improves performance when the hypervisor has
oversubscribed physical CPUs.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 99 +++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index a5b2c0377cf9..bada773292af 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/smp.h>
 #include <asm/qspinlock.h>
+#include <asm/paravirt.h>
 
 #define MAX_NODES	4
 
@@ -24,14 +25,16 @@ static int steal_spins __read_mostly = (1<<5);
 static bool maybe_stealers __read_mostly = true;
 static int head_spins __read_mostly = (1<<8);
 
+static bool pv_yield_owner __read_mostly = true;
+
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
-static __always_inline int get_steal_spins(void)
+static __always_inline int get_steal_spins(bool paravirt)
 {
 	return steal_spins;
 }
 
-static __always_inline int get_head_spins(void)
+static __always_inline int get_head_spins(bool paravirt)
 {
 	return head_spins;
 }
@@ -46,6 +49,11 @@ static inline int decode_tail_cpu(u32 val)
 	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
 }
 
+static inline int get_owner_cpu(u32 val)
+{
+	return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
+}
+
 /*
  * Try to acquire the lock if it was not already locked. If the tail matches
  * mytail then clear it, otherwise leave it unchnaged. Return previous value.
@@ -150,7 +158,45 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static inline bool try_to_steal_lock(struct qspinlock *lock)
+static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+{
+	int owner;
+	u32 yield_count;
+
+	BUG_ON(!(val & _Q_LOCKED_VAL));
+
+	if (!paravirt)
+		goto relax;
+
+	if (!pv_yield_owner)
+		goto relax;
+
+	owner = get_owner_cpu(val);
+	yield_count = yield_count_of(owner);
+
+	if ((yield_count & 1) == 0)
+		goto relax; /* owner vcpu is running */
+
+	/*
+	 * Read the lock word after sampling the yield count. On the other side
+	 * there may a wmb because the yield count update is done by the
+	 * hypervisor preemption and the value update by the OS, however this
+	 * ordering might reduce the chance of out of order accesses and
+	 * improve the heuristic.
+	 */
+	smp_rmb();
+
+	if (READ_ONCE(lock->val) == val) {
+		yield_to_preempted(owner, yield_count);
+		/* Don't relax if we yielded. Maybe we should? */
+		return;
+	}
+relax:
+	cpu_relax();
+}
+
+
+static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
 	int iters = 0;
 
@@ -168,16 +214,16 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
 			if (__queued_spin_trylock_steal(lock))
 				return true;
 		} else {
-			cpu_relax();
+			yield_to_locked_owner(lock, val, paravirt);
 		}
 
 		iters++;
-	} while (iters < get_steal_spins());
+	} while (iters < get_steal_spins(paravirt));
 
 	return false;
 }
 
-static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
+static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
 {
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
@@ -235,12 +281,12 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		if (!(val & _Q_LOCKED_VAL))
 			break;
 
-		cpu_relax();
+		yield_to_locked_owner(lock, val, paravirt);
 		if (!maybe_stealers)
 			continue;
 		iters++;
 
-		if (!mustq && iters >= get_head_spins()) {
+		if (!mustq && iters >= get_head_spins(paravirt)) {
 			mustq = true;
 			set_mustq(lock);
 			val |= _Q_MUST_Q_VAL;
@@ -276,10 +322,20 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 void queued_spin_lock_slowpath(struct qspinlock *lock)
 {
-	if (try_to_steal_lock(lock))
-		return;
-
-	queued_spin_lock_mcs_queue(lock);
+	/*
+	 * This looks funny, but it induces the compiler to inline both
+	 * sides of the branch rather than share code as when the condition
+	 * is passed as the paravirt argument to the functions.
+	 */
+	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()) {
+		if (try_to_steal_lock(lock, true))
+			return;
+		queued_spin_lock_mcs_queue(lock, true);
+	} else {
+		if (try_to_steal_lock(lock, false))
+			return;
+		queued_spin_lock_mcs_queue(lock, false);
+	}
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
@@ -345,10 +401,29 @@ static int head_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
 
+static int pv_yield_owner_set(void *data, u64 val)
+{
+	pv_yield_owner = !!val;
+
+	return 0;
+}
+
+static int pv_yield_owner_get(void *data, u64 *val)
+{
+	*val = pv_yield_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
+	if (is_shared_processor()) {
+		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+	}
 
 	return 0;
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 08/17] powerpc/qspinlock: implement option to yield to previous node
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (6 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 07/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 09/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Queued waiters which are not at the head of the queue don't spin on
the lock word but their qnode lock word, waiting for the previous queued
CPU to release them. Add an option which allows these waiters to yield
to the previous CPU if its vCPU is preempted.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 46 +++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index bada773292af..838be0bc8705 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -26,6 +26,7 @@ static bool maybe_stealers __read_mostly = true;
 static int head_spins __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
+static bool pv_yield_prev __read_mostly = true;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -195,6 +196,32 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
 	cpu_relax();
 }
 
+static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, u32 val, bool paravirt)
+{
+	int prev_cpu = decode_tail_cpu(val);
+	u32 yield_count;
+
+	if (!paravirt)
+		goto relax;
+
+	if (!pv_yield_prev)
+		goto relax;
+
+	yield_count = yield_count_of(prev_cpu);
+	if ((yield_count & 1) == 0)
+		goto relax; /* owner vcpu is running */
+
+	smp_rmb(); /* See yield_to_locked_owner comment */
+
+	if (!node->locked) {
+		yield_to_preempted(prev_cpu, yield_count);
+		return;
+	}
+
+relax:
+	cpu_relax();
+}
+
 
 static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
@@ -269,7 +296,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 		/* Wait for mcs node lock to be released */
 		while (!node->locked)
-			cpu_relax();
+			yield_to_prev(lock, node, old, paravirt);
 
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
@@ -417,12 +444,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
 
+static int pv_yield_prev_set(void *data, u64 val)
+{
+	pv_yield_prev = !!val;
+
+	return 0;
+}
+
+static int pv_yield_prev_get(void *data, u64 *val)
+{
+	*val = pv_yield_prev;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 	}
 
 	return 0;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 09/17] powerpc/qspinlock: allow stealing when head of queue yields
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (7 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 08/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 10/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

If the head of queue is preventing stealing but it finds the owner vCPU
is preempted, it will yield its cycles to the owner which could cause it
to become preempted. Add an option to re-allow stealers before yielding,
and disallow them again after returning from the yield.

Disable this option by default for now, i.e., no logical change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 59 ++++++++++++++++++++++++++++++++++--
 1 file changed, 56 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 838be0bc8705..1aafafc701da 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -26,6 +26,7 @@ static bool maybe_stealers __read_mostly = true;
 static int head_spins __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
+static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
@@ -135,6 +136,22 @@ static __always_inline u32 set_mustq(struct qspinlock *lock)
 	return prev;
 }
 
+static __always_inline u32 clear_mustq(struct qspinlock *lock)
+{
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# clear_mustq				\n"
+"	andc	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (_Q_MUST_Q_VAL)
+	: "cr0", "memory");
+
+	return prev;
+}
+
 static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = decode_tail_cpu(val);
@@ -159,7 +176,7 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool mustq)
 {
 	int owner;
 	u32 yield_count;
@@ -188,7 +205,11 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
 	smp_rmb();
 
 	if (READ_ONCE(lock->val) == val) {
+		if (mustq)
+			clear_mustq(lock);
 		yield_to_preempted(owner, yield_count);
+		if (mustq)
+			set_mustq(lock);
 		/* Don't relax if we yielded. Maybe we should? */
 		return;
 	}
@@ -196,6 +217,21 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
 	cpu_relax();
 }
 
+static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+{
+	__yield_to_locked_owner(lock, val, paravirt, false);
+}
+
+static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+{
+	bool mustq = false;
+
+	if ((val & _Q_MUST_Q_VAL) && pv_yield_allow_steal)
+		mustq = true;
+
+	__yield_to_locked_owner(lock, val, paravirt, mustq);
+}
+
 static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, u32 val, bool paravirt)
 {
 	int prev_cpu = decode_tail_cpu(val);
@@ -211,7 +247,7 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 	if ((yield_count & 1) == 0)
 		goto relax; /* owner vcpu is running */
 
-	smp_rmb(); /* See yield_to_locked_owner comment */
+	smp_rmb(); /* See __yield_to_locked_owner comment */
 
 	if (!node->locked) {
 		yield_to_preempted(prev_cpu, yield_count);
@@ -308,7 +344,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		if (!(val & _Q_LOCKED_VAL))
 			break;
 
-		yield_to_locked_owner(lock, val, paravirt);
+		yield_head_to_locked_owner(lock, val, paravirt);
 		if (!maybe_stealers)
 			continue;
 		iters++;
@@ -444,6 +480,22 @@ static int pv_yield_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
 
+static int pv_yield_allow_steal_set(void *data, u64 val)
+{
+	pv_yield_allow_steal = !!val;
+
+	return 0;
+}
+
+static int pv_yield_allow_steal_get(void *data, u64 *val)
+{
+	*val = pv_yield_allow_steal;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
+
 static int pv_yield_prev_set(void *data, u64 val)
 {
 	pv_yield_prev = !!val;
@@ -466,6 +518,7 @@ static __init int spinlock_debugfs_init(void)
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 	}
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 10/17] powerpc/qspinlock: allow propagation of yield CPU down the queue
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (8 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 09/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 11/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Having all CPUs poll the lock word for the owner CPU that should be
yielded to defeats most of the purpose of using MCS queueing for
scalability. Yet it may be desirable for queued waiters to to yield
to a preempted owner.

With this change, queue waiters never sample the owner CPU directly
from the lock word. The queue head (which is spinning on the lock)
propagates the owner CPU back to the next waiter if it finds the owner
has been preempted. That waiter then propagates the owner CPU back to
the next waiter, and so on.

s390 addreses this problem differenty, by having queued waiters sample
the lock word to find the owner at a low frequency. That has the
advantage of being simpler, the advantage of propagation is that the
lock word never has to be accesed by queued waiters, and the transfer
of cache lines to transmit the owner data is only required when lock
holder vCPU preemption occurs.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 79 ++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 1aafafc701da..b044760a05e9 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -12,6 +12,7 @@
 struct qnode {
 	struct qnode	*next;
 	struct qspinlock *lock;
+	int		yield_cpu;
 	u8		locked; /* 1 if lock acquired */
 };
 
@@ -28,6 +29,7 @@ static int head_spins __read_mostly = (1<<8);
 static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
+static bool pv_yield_propagate_owner __read_mostly = true;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -232,14 +234,67 @@ static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u
 	__yield_to_locked_owner(lock, val, paravirt, mustq);
 }
 
+static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
+{
+	struct qnode *next;
+	int owner;
+
+	if (!paravirt)
+		return;
+	if (!pv_yield_propagate_owner)
+		return;
+
+	owner = get_owner_cpu(val);
+	if (*set_yield_cpu == owner)
+		return;
+
+	next = READ_ONCE(node->next);
+	if (!next)
+		return;
+
+	if (vcpu_is_preempted(owner)) {
+		next->yield_cpu = owner;
+		*set_yield_cpu = owner;
+	} else if (*set_yield_cpu != -1) {
+		next->yield_cpu = owner;
+		*set_yield_cpu = owner;
+	}
+}
+
 static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, u32 val, bool paravirt)
 {
 	int prev_cpu = decode_tail_cpu(val);
 	u32 yield_count;
+	int yield_cpu;
 
 	if (!paravirt)
 		goto relax;
 
+	if (!pv_yield_propagate_owner)
+		goto yield_prev;
+
+	yield_cpu = READ_ONCE(node->yield_cpu);
+	if (yield_cpu == -1) {
+		/* Propagate back the -1 CPU */
+		if (node->next && node->next->yield_cpu != -1)
+			node->next->yield_cpu = yield_cpu;
+		goto yield_prev;
+	}
+
+	yield_count = yield_count_of(yield_cpu);
+	if ((yield_count & 1) == 0)
+		goto yield_prev; /* owner vcpu is running */
+
+	smp_rmb();
+
+	if (yield_cpu == node->yield_cpu) {
+		if (node->next && node->next->yield_cpu != yield_cpu)
+			node->next->yield_cpu = yield_cpu;
+		yield_to_preempted(yield_cpu, yield_count);
+		return;
+	}
+
+yield_prev:
 	if (!pv_yield_prev)
 		goto relax;
 
@@ -293,6 +348,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	u32 val, old, tail;
 	bool mustq = false;
 	int idx;
+	int set_yield_cpu = -1;
 	int iters = 0;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -314,6 +370,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	node = &qnodesp->nodes[idx];
 	node->next = NULL;
 	node->lock = lock;
+	node->yield_cpu = -1;
 	node->locked = 0;
 
 	tail = encode_tail_cpu(smp_processor_id());
@@ -334,6 +391,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		while (!node->locked)
 			yield_to_prev(lock, node, old, paravirt);
 
+		/* Clear out stale propagated yield_cpu */
+		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
+			node->yield_cpu = -1;
+
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
 
@@ -344,6 +405,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		if (!(val & _Q_LOCKED_VAL))
 			break;
 
+		propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 		yield_head_to_locked_owner(lock, val, paravirt);
 		if (!maybe_stealers)
 			continue;
@@ -512,6 +574,22 @@ static int pv_yield_prev_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
 
+static int pv_yield_propagate_owner_set(void *data, u64 val)
+{
+	pv_yield_propagate_owner = !!val;
+
+	return 0;
+}
+
+static int pv_yield_propagate_owner_get(void *data, u64 *val)
+{
+	*val = pv_yield_propagate_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
@@ -520,6 +598,7 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
+		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
 	}
 
 	return 0;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 11/17] powerpc/qspinlock: add ability to prod new queue head CPU
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (9 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 10/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 12/17] powerpc/qspinlock: allow lock stealing in trylock and lock fastpath Nicholas Piggin
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

After the head of the queue acquires the lock, it releases the
next waiter in the queue to become the new head. Add an option
to prod the new head if its vCPU was preempted. This may only
have an effect if queue waiters are yielding.

Disable this option by default for now, i.e., no logical change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index b044760a05e9..03329f4ed238 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -12,6 +12,7 @@
 struct qnode {
 	struct qnode	*next;
 	struct qspinlock *lock;
+	int		cpu;
 	int		yield_cpu;
 	u8		locked; /* 1 if lock acquired */
 };
@@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
 static bool pv_yield_propagate_owner __read_mostly = true;
+static bool pv_prod_head __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -370,10 +372,11 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	node = &qnodesp->nodes[idx];
 	node->next = NULL;
 	node->lock = lock;
+	node->cpu = smp_processor_id();
 	node->yield_cpu = -1;
 	node->locked = 0;
 
-	tail = encode_tail_cpu(smp_processor_id());
+	tail = encode_tail_cpu(node->cpu);
 
 	old = publish_tail_cpu(lock, tail);
 
@@ -439,7 +442,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	 * this store to locked. The corresponding barrier is the smp_rmb()
 	 * acquire barrier for mcs lock, above.
 	 */
-	WRITE_ONCE(next->locked, 1);
+	if (paravirt && pv_prod_head) {
+		int next_cpu = next->cpu;
+		WRITE_ONCE(next->locked, 1);
+		if (vcpu_is_preempted(next_cpu))
+			prod_cpu(next_cpu);
+	} else {
+		WRITE_ONCE(next->locked, 1);
+	}
 
 release:
 	qnodesp->count--; /* release the node */
@@ -590,6 +600,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
 
+static int pv_prod_head_set(void *data, u64 val)
+{
+	pv_prod_head = !!val;
+
+	return 0;
+}
+
+static int pv_prod_head_get(void *data, u64 *val)
+{
+	*val = pv_prod_head;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
@@ -599,6 +625,7 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
+		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
 	}
 
 	return 0;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 12/17] powerpc/qspinlock: allow lock stealing in trylock and lock fastpath
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (10 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 11/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 13/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

This change allows trylock to steal the lock. It also allows the initial
lock attempt to steal the lock rather than bailing out and going to the
slow path.

This gives trylock more strength: without this a continually-contended
lock will never permit a trylock to succeed. With this change, the
trylock has a small but non-zero chance.

It also also gives the lock fastpath most of the benefit of passing the
reservation back through to the steal loop in the slow path without the
complexity.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h | 22 ++++++++++++++++++++--
 arch/powerpc/lib/qspinlock.c         |  9 +++++++++
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index be53702e56fc..c9fa83bba1d5 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -5,6 +5,15 @@
 #include <linux/compiler.h>
 #include <asm/qspinlock_types.h>
 
+/*
+ * The trylock itself may steal. This makes trylocks slightly stronger, and
+ * might make spin locks slightly more efficient when stealing.
+ *
+ * This is compile-time, so if true then there may always be stealers, so the
+ * nosteal paths become unused.
+ */
+#define _Q_SPIN_TRY_LOCK_STEAL 1
+
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
 	return READ_ONCE(lock->val);
@@ -26,13 +35,14 @@ static __always_inline u32 queued_spin_encode_locked_val(void)
 	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
 }
 
-static __always_inline int queued_spin_trylock(struct qspinlock *lock)
+static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
 {
 	u32 new = queued_spin_encode_locked_val();
 	u32 prev;
 
+	/* Trylock succeeds only when unlocked and no queued nodes */
 	asm volatile(
-"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
+"1:	lwarx	%0,0,%1,%3	# __queued_spin_trylock_nosteal		\n"
 "	cmpwi	0,%0,0							\n"
 "	bne-	2f							\n"
 "	stwcx.	%2,0,%1							\n"
@@ -71,6 +81,14 @@ static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
 	return likely(!(prev & ~_Q_TAIL_CPU_MASK));
 }
 
+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
+{
+	if (!_Q_SPIN_TRY_LOCK_STEAL)
+		return __queued_spin_trylock_nosteal(lock);
+	else
+		return __queued_spin_trylock_steal(lock);
+}
+
 void queued_spin_lock_slowpath(struct qspinlock *lock);
 
 static __always_inline void queued_spin_lock(struct qspinlock *lock)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 03329f4ed238..1cb47a6478a0 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -24,7 +24,11 @@ struct qnodes {
 
 /* Tuning parameters */
 static int steal_spins __read_mostly = (1<<5);
+#if _Q_SPIN_TRY_LOCK_STEAL == 1
+static const bool maybe_stealers = true;
+#else
 static bool maybe_stealers __read_mostly = true;
+#endif
 static int head_spins __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
@@ -483,6 +487,10 @@ void pv_spinlocks_init(void)
 #include <linux/debugfs.h>
 static int steal_spins_set(void *data, u64 val)
 {
+#if _Q_SPIN_TRY_LOCK_STEAL == 1
+	/* MAYBE_STEAL remains true */
+	steal_spins = val;
+#else
 	static DEFINE_MUTEX(lock);
 
 	/*
@@ -507,6 +515,7 @@ static int steal_spins_set(void *data, u64 val)
 		steal_spins = val;
 	}
 	mutex_unlock(&lock);
+#endif
 
 	return 0;
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 13/17] powerpc/qspinlock: use spin_begin/end API
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (11 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 12/17] powerpc/qspinlock: allow lock stealing in trylock and lock fastpath Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 14/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
to prevent threads issuing a lot of expensive priority nops which may not
have much effect due to immediately executing low then medium priority.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 39 ++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 1cb47a6478a0..70f924296b36 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -184,6 +184,7 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
+/* Called inside spin_begin() */
 static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool mustq)
 {
 	int owner;
@@ -203,6 +204,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 	if ((yield_count & 1) == 0)
 		goto relax; /* owner vcpu is running */
 
+	spin_end();
+
 	/*
 	 * Read the lock word after sampling the yield count. On the other side
 	 * there may a wmb because the yield count update is done by the
@@ -218,18 +221,22 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 		yield_to_preempted(owner, yield_count);
 		if (mustq)
 			set_mustq(lock);
+		spin_begin();
 		/* Don't relax if we yielded. Maybe we should? */
 		return;
 	}
+	spin_begin();
 relax:
-	cpu_relax();
+	spin_cpu_relax();
 }
 
+/* Called inside spin_begin() */
 static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
 {
 	__yield_to_locked_owner(lock, val, paravirt, false);
 }
 
+/* Called inside spin_begin() */
 static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
 {
 	bool mustq = false;
@@ -267,6 +274,7 @@ static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int
 	}
 }
 
+/* Called inside spin_begin() */
 static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, u32 val, bool paravirt)
 {
 	int prev_cpu = decode_tail_cpu(val);
@@ -291,14 +299,18 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 	if ((yield_count & 1) == 0)
 		goto yield_prev; /* owner vcpu is running */
 
+	spin_end();
+
 	smp_rmb();
 
 	if (yield_cpu == node->yield_cpu) {
 		if (node->next && node->next->yield_cpu != yield_cpu)
 			node->next->yield_cpu = yield_cpu;
 		yield_to_preempted(yield_cpu, yield_count);
+		spin_begin();
 		return;
 	}
+	spin_begin();
 
 yield_prev:
 	if (!pv_yield_prev)
@@ -308,15 +320,19 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 	if ((yield_count & 1) == 0)
 		goto relax; /* owner vcpu is running */
 
+	spin_end();
+
 	smp_rmb(); /* See __yield_to_locked_owner comment */
 
 	if (!node->locked) {
 		yield_to_preempted(prev_cpu, yield_count);
+		spin_begin();
 		return;
 	}
+	spin_begin();
 
 relax:
-	cpu_relax();
+	spin_cpu_relax();
 }
 
 
@@ -328,6 +344,8 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 		return false;
 
 	/* Attempt to steal the lock */
+	spin_begin();
+
 	do {
 		u32 val = READ_ONCE(lock->val);
 
@@ -335,8 +353,10 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			break;
 
 		if (unlikely(!(val & _Q_LOCKED_VAL))) {
+			spin_end();
 			if (__queued_spin_trylock_steal(lock))
 				return true;
+			spin_begin();
 		} else {
 			yield_to_locked_owner(lock, val, paravirt);
 		}
@@ -344,6 +364,8 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 		iters++;
 	} while (iters < get_steal_spins(paravirt));
 
+	spin_end();
+
 	return false;
 }
 
@@ -395,8 +417,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		WRITE_ONCE(prev->next, node);
 
 		/* Wait for mcs node lock to be released */
+		spin_begin();
 		while (!node->locked)
 			yield_to_prev(lock, node, old, paravirt);
+		spin_end();
 
 		/* Clear out stale propagated yield_cpu */
 		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
@@ -407,6 +431,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 again:
 	/* We're at the head of the waitqueue, wait for the lock. */
+	spin_begin();
 	for (;;) {
 		val = READ_ONCE(lock->val);
 		if (!(val & _Q_LOCKED_VAL))
@@ -424,6 +449,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 			val |= _Q_MUST_Q_VAL;
 		}
 	}
+	spin_end();
 
 	/* If we're the last queued, must clean up the tail. */
 	old = trylock_clean_tail(lock, tail);
@@ -436,8 +462,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		goto release; /* We were the tail, no next. */
 
 	/* There is a next, must wait for node->next != NULL (MCS protocol) */
-	while (!(next = READ_ONCE(node->next)))
-		cpu_relax();
+	next = READ_ONCE(node->next);
+	if (!next) {
+		spin_begin();
+		while (!(next = READ_ONCE(node->next)))
+			cpu_relax();
+		spin_end();
+	}
 
 	/*
 	 * Unlock the next mcs waiter node. Release barrier is not required
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 14/17] powerpc/qspinlock: reduce remote node steal spins
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (12 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 13/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 15/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Allow for a reduction in the number of times a CPU from a different
node than the owner can attempt to steal the lock before queueing.
This could bias the transfer behaviour of the lock across the
machine and reduce NUMA crossings.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 43 +++++++++++++++++++++++++++++++++---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 70f924296b36..7f6b41627351 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -4,6 +4,7 @@
 #include <linux/export.h>
 #include <linux/percpu.h>
 #include <linux/smp.h>
+#include <linux/topology.h>
 #include <asm/qspinlock.h>
 #include <asm/paravirt.h>
 
@@ -24,6 +25,7 @@ struct qnodes {
 
 /* Tuning parameters */
 static int steal_spins __read_mostly = (1<<5);
+static int remote_steal_spins __read_mostly = (1<<2);
 #if _Q_SPIN_TRY_LOCK_STEAL == 1
 static const bool maybe_stealers = true;
 #else
@@ -44,6 +46,11 @@ static __always_inline int get_steal_spins(bool paravirt)
 	return steal_spins;
 }
 
+static __always_inline int get_remote_steal_spins(bool paravirt)
+{
+	return remote_steal_spins;
+}
+
 static __always_inline int get_head_spins(bool paravirt)
 {
 	return head_spins;
@@ -335,10 +342,24 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 	spin_cpu_relax();
 }
 
+static __always_inline bool steal_break(u32 val, int iters, bool paravirt)
+{
+	if (iters >= get_steal_spins(paravirt))
+		return true;
+
+	if (IS_ENABLED(CONFIG_NUMA) &&
+			(iters >= get_remote_steal_spins(paravirt))) {
+		int cpu = get_owner_cpu(val);
+		if (numa_node_id() != cpu_to_node(cpu))
+			return true;
+	}
+	return false;
+}
 
 static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
 	int iters = 0;
+	u32 val;
 
 	if (!steal_spins)
 		return false;
@@ -347,8 +368,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 	spin_begin();
 
 	do {
-		u32 val = READ_ONCE(lock->val);
-
+		val = READ_ONCE(lock->val);
 		if (val & _Q_MUST_Q_VAL)
 			break;
 
@@ -362,7 +382,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 		}
 
 		iters++;
-	} while (iters < get_steal_spins(paravirt));
+	} while (!steal_break(val, iters, paravirt));
 
 	spin_end();
 
@@ -560,6 +580,22 @@ static int steal_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
 
+static int remote_steal_spins_set(void *data, u64 val)
+{
+	remote_steal_spins = val;
+
+	return 0;
+}
+
+static int remote_steal_spins_get(void *data, u64 *val)
+{
+	*val = remote_steal_spins;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_remote_steal_spins, remote_steal_spins_get, remote_steal_spins_set, "%llu\n");
+
 static int head_spins_set(void *data, u64 val)
 {
 	head_spins = val;
@@ -659,6 +695,7 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+	debugfs_create_file("qspl_remote_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_remote_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 15/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (13 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 14/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 16/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Provide an option that holds off queueing indefinitely while the lock
owner is preempted. This could reduce queueing latencies for very
overcommitted vcpu situations.

This is disabled by default.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/lib/qspinlock.c | 77 +++++++++++++++++++++++++++++-------
 1 file changed, 62 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 7f6b41627351..1c9079489b50 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -35,6 +35,7 @@ static int head_spins __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
+static bool pv_spin_on_preempted_owner __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
 static bool pv_yield_propagate_owner __read_mostly = true;
 static bool pv_prod_head __read_mostly = false;
@@ -191,11 +192,12 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-/* Called inside spin_begin() */
-static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool mustq)
+/* Called inside spin_begin(). Returns whether or not the vCPU was preempted. */
+static __always_inline bool __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool mustq)
 {
 	int owner;
 	u32 yield_count;
+	bool preempted = false;
 
 	BUG_ON(!(val & _Q_LOCKED_VAL));
 
@@ -213,6 +215,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 
 	spin_end();
 
+	preempted = true;
+
 	/*
 	 * Read the lock word after sampling the yield count. On the other side
 	 * there may a wmb because the yield count update is done by the
@@ -229,29 +233,32 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 		if (mustq)
 			set_mustq(lock);
 		spin_begin();
+
 		/* Don't relax if we yielded. Maybe we should? */
-		return;
+		return preempted;
 	}
 	spin_begin();
 relax:
 	spin_cpu_relax();
+
+	return preempted;
 }
 
-/* Called inside spin_begin() */
-static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+/* Called inside spin_begin(). Returns whether or not the vCPU was preempted. */
+static __always_inline bool yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
 {
-	__yield_to_locked_owner(lock, val, paravirt, false);
+	return __yield_to_locked_owner(lock, val, paravirt, false);
 }
 
-/* Called inside spin_begin() */
-static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+/* Called inside spin_begin(). Returns whether or not the vCPU was preempted. */
+static __always_inline bool yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
 {
 	bool mustq = false;
 
 	if ((val & _Q_MUST_Q_VAL) && pv_yield_allow_steal)
 		mustq = true;
 
-	__yield_to_locked_owner(lock, val, paravirt, mustq);
+	return __yield_to_locked_owner(lock, val, paravirt, mustq);
 }
 
 static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
@@ -361,13 +368,16 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 	int iters = 0;
 	u32 val;
 
-	if (!steal_spins)
+	if (!steal_spins) {
+		/* XXX: should spin_on_preempted_owner do anything here? */
 		return false;
+	}
 
 	/* Attempt to steal the lock */
 	spin_begin();
-
 	do {
+		bool preempted = false;
+
 		val = READ_ONCE(lock->val);
 		if (val & _Q_MUST_Q_VAL)
 			break;
@@ -378,10 +388,23 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 				return true;
 			spin_begin();
 		} else {
-			yield_to_locked_owner(lock, val, paravirt);
+			preempted = yield_to_locked_owner(lock, val, paravirt);
 		}
 
-		iters++;
+		if (preempted) {
+			if (!pv_spin_on_preempted_owner)
+				iters++;
+			/*
+			 * pv_spin_on_preempted_owner don't increase iters
+			 * while the owner is preempted -- we won't interfere
+			 * with it by definition. This could introduce some
+			 * latency issue if we continually observe preempted
+			 * owners, but hopefully that's a rare corner case of
+			 * a badly oversubscribed system.
+			 */
+		} else {
+			iters++;
+		}
 	} while (!steal_break(val, iters, paravirt));
 
 	spin_end();
@@ -453,15 +476,22 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	/* We're at the head of the waitqueue, wait for the lock. */
 	spin_begin();
 	for (;;) {
+		bool preempted;
+
 		val = READ_ONCE(lock->val);
 		if (!(val & _Q_LOCKED_VAL))
 			break;
 
 		propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
-		yield_head_to_locked_owner(lock, val, paravirt);
+		preempted = yield_head_to_locked_owner(lock, val, paravirt);
 		if (!maybe_stealers)
 			continue;
-		iters++;
+		if (preempted) {
+			if (!pv_spin_on_preempted_owner)
+				iters++;
+		} else {
+			iters++;
+		}
 
 		if (!mustq && iters >= get_head_spins(paravirt)) {
 			mustq = true;
@@ -644,6 +674,22 @@ static int pv_yield_allow_steal_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
 
+static int pv_spin_on_preempted_owner_set(void *data, u64 val)
+{
+	pv_spin_on_preempted_owner = !!val;
+
+	return 0;
+}
+
+static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
+{
+	*val = pv_spin_on_preempted_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
+
 static int pv_yield_prev_set(void *data, u64 val)
 {
 	pv_yield_prev = !!val;
@@ -700,6 +746,7 @@ static __init int spinlock_debugfs_init(void)
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
+		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
 		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 16/17] powerpc/qspinlock: provide accounting and options for sleepy locks
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (14 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 15/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-26  9:59 ` [PATCH v3 17/17] powerpc/qspinlock: add compile-time tuning adjustments Nicholas Piggin
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

Finding the owner or a queued waiter on a lock with a preempted vcpu
is indicative of an oversubscribed guest causing the lock to get into
trouble. Provide some options to detect this situation and have new
CPUs avoid queueing for a longer time (more steal iterations) to
minimise the problems caused by vcpu preemption on the queue.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock_types.h |   7 +-
 arch/powerpc/lib/qspinlock.c               | 242 +++++++++++++++++++--
 2 files changed, 230 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index adfeed4aa495..4766a7aa03cb 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -30,7 +30,7 @@ typedef struct qspinlock {
  *
  *     0: locked bit
  *  1-14: lock holder cpu
- *    15: unused bit
+ *    15: lock owner or queuer vcpus observed to be preempted bit
  *    16: must queue bit
  * 17-31: tail cpu (+1)
  */
@@ -50,6 +50,11 @@ typedef struct qspinlock {
 #error "qspinlock does not support such large CONFIG_NR_CPUS"
 #endif
 
+/* 0x00008000 */
+#define _Q_SLEEPY_OFFSET	15
+#define _Q_SLEEPY_BITS		1
+#define _Q_SLEEPY_VAL		(1U << _Q_SLEEPY_OFFSET)
+
 /* 0x00010000 */
 #define _Q_MUST_Q_OFFSET	16
 #define _Q_MUST_Q_BITS		1
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 1c9079489b50..9a31b6147a23 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/smp.h>
 #include <linux/topology.h>
+#include <linux/sched/clock.h>
 #include <asm/qspinlock.h>
 #include <asm/paravirt.h>
 
@@ -36,25 +37,56 @@ static int head_spins __read_mostly = (1<<8);
 static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_spin_on_preempted_owner __read_mostly = false;
+static bool pv_sleepy_lock __read_mostly = true;
+static bool pv_sleepy_lock_sticky __read_mostly = false;
+static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
+static int pv_sleepy_lock_factor __read_mostly = 256;
 static bool pv_yield_prev __read_mostly = true;
 static bool pv_yield_propagate_owner __read_mostly = true;
 static bool pv_prod_head __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
+static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
 
-static __always_inline int get_steal_spins(bool paravirt)
+static __always_inline bool recently_sleepy(void)
 {
-	return steal_spins;
+	/* pv_sleepy_lock is true when this is called */
+	if (pv_sleepy_lock_interval_ns) {
+		u64 seen = this_cpu_read(sleepy_lock_seen_clock);
+
+		if (seen) {
+			u64 delta = sched_clock() - seen;
+			if (delta < pv_sleepy_lock_interval_ns)
+				return true;
+			this_cpu_write(sleepy_lock_seen_clock, 0);
+		}
+	}
+
+	return false;
 }
 
-static __always_inline int get_remote_steal_spins(bool paravirt)
+static __always_inline int get_steal_spins(bool paravirt, bool sleepy)
 {
-	return remote_steal_spins;
+	if (paravirt && sleepy)
+		return steal_spins * pv_sleepy_lock_factor;
+	else
+		return steal_spins;
 }
 
-static __always_inline int get_head_spins(bool paravirt)
+static __always_inline int get_remote_steal_spins(bool paravirt, bool sleepy)
 {
-	return head_spins;
+	if (paravirt && sleepy)
+		return remote_steal_spins * pv_sleepy_lock_factor;
+	else
+		return remote_steal_spins;
+}
+
+static __always_inline int get_head_spins(bool paravirt, bool sleepy)
+{
+	if (paravirt && sleepy)
+		return head_spins * pv_sleepy_lock_factor;
+	else
+		return head_spins;
 }
 
 static inline u32 encode_tail_cpu(int cpu)
@@ -168,6 +200,56 @@ static __always_inline u32 clear_mustq(struct qspinlock *lock)
 	return prev;
 }
 
+static __always_inline bool try_set_sleepy(struct qspinlock *lock, u32 old)
+{
+	u32 prev;
+	u32 new = old | _Q_SLEEPY_VAL;
+
+	BUG_ON(!(old & _Q_LOCKED_VAL));
+	BUG_ON(old & _Q_SLEEPY_VAL);
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# try_set_sleepy			\n"
+"	cmpw	0,%0,%2							\n"
+"	bne-	2f							\n"
+"	stwcx.	%3,0,%1							\n"
+"	bne-	1b							\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r"(old), "r" (new)
+	: "cr0", "memory");
+
+	return likely(prev == old);
+}
+
+static __always_inline void seen_sleepy_owner(struct qspinlock *lock, u32 val)
+{
+	if (pv_sleepy_lock) {
+		if (pv_sleepy_lock_interval_ns)
+			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
+		if (!(val & _Q_SLEEPY_VAL))
+			try_set_sleepy(lock, val);
+	}
+}
+
+static __always_inline void seen_sleepy_lock(void)
+{
+	if (pv_sleepy_lock && pv_sleepy_lock_interval_ns)
+		this_cpu_write(sleepy_lock_seen_clock, sched_clock());
+}
+
+static __always_inline void seen_sleepy_node(struct qspinlock *lock, u32 val)
+{
+	if (pv_sleepy_lock) {
+		if (pv_sleepy_lock_interval_ns)
+			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
+		if (val & _Q_LOCKED_VAL) {
+			if (!(val & _Q_SLEEPY_VAL))
+				try_set_sleepy(lock, val);
+		}
+	}
+}
+
 static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = decode_tail_cpu(val);
@@ -215,6 +297,7 @@ static __always_inline bool __yield_to_locked_owner(struct qspinlock *lock, u32
 
 	spin_end();
 
+	seen_sleepy_owner(lock, val);
 	preempted = true;
 
 	/*
@@ -289,11 +372,12 @@ static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int
 }
 
 /* Called inside spin_begin() */
-static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, u32 val, bool paravirt)
+static __always_inline bool yield_to_prev(struct qspinlock *lock, struct qnode *node, u32 val, bool paravirt)
 {
 	int prev_cpu = decode_tail_cpu(val);
 	u32 yield_count;
 	int yield_cpu;
+	bool preempted = false;
 
 	if (!paravirt)
 		goto relax;
@@ -315,6 +399,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 
 	spin_end();
 
+	preempted = true;
+	seen_sleepy_node(lock, val);
+
 	smp_rmb();
 
 	if (yield_cpu == node->yield_cpu) {
@@ -322,7 +409,7 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 			node->next->yield_cpu = yield_cpu;
 		yield_to_preempted(yield_cpu, yield_count);
 		spin_begin();
-		return;
+		return preempted;
 	}
 	spin_begin();
 
@@ -336,26 +423,31 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 
 	spin_end();
 
+	preempted = true;
+	seen_sleepy_node(lock, val);
+
 	smp_rmb(); /* See __yield_to_locked_owner comment */
 
 	if (!node->locked) {
 		yield_to_preempted(prev_cpu, yield_count);
 		spin_begin();
-		return;
+		return preempted;
 	}
 	spin_begin();
 
 relax:
 	spin_cpu_relax();
+
+	return preempted;
 }
 
-static __always_inline bool steal_break(u32 val, int iters, bool paravirt)
+static __always_inline bool steal_break(u32 val, int iters, bool paravirt, bool sleepy)
 {
-	if (iters >= get_steal_spins(paravirt))
+	if (iters >= get_steal_spins(paravirt, sleepy))
 		return true;
 
 	if (IS_ENABLED(CONFIG_NUMA) &&
-			(iters >= get_remote_steal_spins(paravirt))) {
+			(iters >= get_remote_steal_spins(paravirt, sleepy))) {
 		int cpu = get_owner_cpu(val);
 		if (numa_node_id() != cpu_to_node(cpu))
 			return true;
@@ -365,6 +457,8 @@ static __always_inline bool steal_break(u32 val, int iters, bool paravirt)
 
 static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
+	bool seen_preempted = false;
+	bool sleepy = false;
 	int iters = 0;
 	u32 val;
 
@@ -391,7 +485,25 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			preempted = yield_to_locked_owner(lock, val, paravirt);
 		}
 
+		if (paravirt && pv_sleepy_lock) {
+			if (!sleepy) {
+				if (val & _Q_SLEEPY_VAL) {
+					seen_sleepy_lock();
+					sleepy = true;
+				} else if (recently_sleepy()) {
+					sleepy = true;
+				}
+			}
+			if (pv_sleepy_lock_sticky && seen_preempted &&
+					!(val & _Q_SLEEPY_VAL)) {
+				if (try_set_sleepy(lock, val))
+					val |= _Q_SLEEPY_VAL;
+			}
+		}
+
 		if (preempted) {
+			seen_preempted = true;
+			sleepy = true;
 			if (!pv_spin_on_preempted_owner)
 				iters++;
 			/*
@@ -405,7 +517,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 		} else {
 			iters++;
 		}
-	} while (!steal_break(val, iters, paravirt));
+	} while (!steal_break(val, iters, paravirt, sleepy));
 
 	spin_end();
 
@@ -417,6 +529,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
 	u32 val, old, tail;
+	bool seen_preempted = false;
+	bool sleepy = false;
 	bool mustq = false;
 	int idx;
 	int set_yield_cpu = -1;
@@ -461,8 +575,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 		/* Wait for mcs node lock to be released */
 		spin_begin();
-		while (!node->locked)
-			yield_to_prev(lock, node, old, paravirt);
+		while (!node->locked) {
+			if (yield_to_prev(lock, node, old, paravirt))
+				seen_preempted = true;
+		}
 		spin_end();
 
 		/* Clear out stale propagated yield_cpu */
@@ -472,8 +588,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
 
-again:
 	/* We're at the head of the waitqueue, wait for the lock. */
+again:
 	spin_begin();
 	for (;;) {
 		bool preempted;
@@ -482,18 +598,40 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		if (!(val & _Q_LOCKED_VAL))
 			break;
 
+		if (paravirt && pv_sleepy_lock && maybe_stealers) {
+			if (!sleepy) {
+				if (val & _Q_SLEEPY_VAL) {
+					seen_sleepy_lock();
+					sleepy = true;
+				} else if (recently_sleepy()) {
+					sleepy = true;
+				}
+			}
+			if (pv_sleepy_lock_sticky && seen_preempted &&
+					!(val & _Q_SLEEPY_VAL)) {
+				if (try_set_sleepy(lock, val))
+					val |= _Q_SLEEPY_VAL;
+			}
+		}
+
 		propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 		preempted = yield_head_to_locked_owner(lock, val, paravirt);
 		if (!maybe_stealers)
 			continue;
-		if (preempted) {
+
+		if (preempted)
+			seen_preempted = true;
+
+		if (paravirt && preempted) {
+			sleepy = true;
+
 			if (!pv_spin_on_preempted_owner)
 				iters++;
 		} else {
 			iters++;
 		}
 
-		if (!mustq && iters >= get_head_spins(paravirt)) {
+		if (!mustq && iters >= get_head_spins(paravirt, sleepy)) {
 			mustq = true;
 			set_mustq(lock);
 			val |= _Q_MUST_Q_VAL;
@@ -690,6 +828,70 @@ static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
 
+static int pv_sleepy_lock_set(void *data, u64 val)
+{
+	pv_sleepy_lock = !!val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock, pv_sleepy_lock_get, pv_sleepy_lock_set, "%llu\n");
+
+static int pv_sleepy_lock_sticky_set(void *data, u64 val)
+{
+	pv_sleepy_lock_sticky = !!val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_sticky_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock_sticky;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_sticky, pv_sleepy_lock_sticky_get, pv_sleepy_lock_sticky_set, "%llu\n");
+
+static int pv_sleepy_lock_interval_ns_set(void *data, u64 val)
+{
+	pv_sleepy_lock_interval_ns = val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_interval_ns_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock_interval_ns;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_interval_ns, pv_sleepy_lock_interval_ns_get, pv_sleepy_lock_interval_ns_set, "%llu\n");
+
+static int pv_sleepy_lock_factor_set(void *data, u64 val)
+{
+	pv_sleepy_lock_factor = val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_factor_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock_factor;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_factor, pv_sleepy_lock_factor_get, pv_sleepy_lock_factor_set, "%llu\n");
+
 static int pv_yield_prev_set(void *data, u64 val)
 {
 	pv_yield_prev = !!val;
@@ -747,6 +949,10 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
+		debugfs_create_file("qspl_pv_sleepy_lock", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock);
+		debugfs_create_file("qspl_pv_sleepy_lock_sticky", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_sticky);
+		debugfs_create_file("qspl_pv_sleepy_lock_interval_ns", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_interval_ns);
+		debugfs_create_file("qspl_pv_sleepy_lock_factor", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_factor);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
 		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 17/17] powerpc/qspinlock: add compile-time tuning adjustments
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (15 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 16/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
@ 2022-11-26  9:59 ` Nicholas Piggin
  2022-11-28  3:11 ` [PATCH v3 real 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-26  9:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour, Nicholas Piggin

This adds compile-time options that allow the EH lock hint bit to be
enabled or disabled, and adds some new options that may or may not
help matters. To help with experimentation and tuning.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h | 61 ++++++++++++++++++++++++++--
 arch/powerpc/lib/qspinlock.c         | 39 ++++++++++++++++--
 2 files changed, 94 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index c9fa83bba1d5..9e71f8de7b12 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -5,15 +5,68 @@
 #include <linux/compiler.h>
 #include <asm/qspinlock_types.h>
 
+#ifdef CONFIG_PPC64
+/*
+ * Use the EH=1 hint for accesses that result in the lock being acquired.
+ * The hardware is supposed to optimise this pattern by holding the lock
+ * cacheline longer, and releasing when a store to the same memory (the
+ * unlock) is performed.
+ */
+#define _Q_SPIN_EH_HINT 1
+#else
+#define _Q_SPIN_EH_HINT 0
+#endif
+
 /*
  * The trylock itself may steal. This makes trylocks slightly stronger, and
- * might make spin locks slightly more efficient when stealing.
+ * makes locks slightly more efficient when stealing.
  *
  * This is compile-time, so if true then there may always be stealers, so the
  * nosteal paths become unused.
  */
 #define _Q_SPIN_TRY_LOCK_STEAL 1
 
+/*
+ * Put a speculation barrier after testing the lock/node and finding it
+ * busy. Try to prevent pointless speculation in slow paths.
+ *
+ * Slows down the lockstorm microbenchmark with no stealing, where locking
+ * is purely FIFO through the queue. May have more benefit in real workload
+ * where speculating into the wrong place could have a greater cost.
+ */
+#define _Q_SPIN_SPEC_BARRIER 0
+
+#ifdef CONFIG_PPC64
+/*
+ * Execute a miso instruction after passing the MCS lock ownership to the
+ * queue head. Miso is intended to make stores visible to other CPUs sooner.
+ *
+ * This seems to make the lockstorm microbenchmark nospin test go slightly
+ * faster on POWER10, but disable for now.
+ */
+#define _Q_SPIN_MISO 0
+#else
+#define _Q_SPIN_MISO 0
+#endif
+
+#ifdef CONFIG_PPC64
+/*
+ * This executes miso after an unlock of the lock word, having ownership
+ * pass to the next CPU sooner. This will slow the uncontended path to some
+ * degree. Not evidence it helps yet.
+ */
+#define _Q_SPIN_MISO_UNLOCK 0
+#else
+#define _Q_SPIN_MISO_UNLOCK 0
+#endif
+
+/*
+ * Seems to slow down lockstorm microbenchmark, suspect queue node just
+ * has to become shared again right afterwards when its waiter spins on
+ * the lock field.
+ */
+#define _Q_SPIN_PREFETCH_NEXT 0
+
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
 	return READ_ONCE(lock->val);
@@ -51,7 +104,7 @@ static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
 "2:									\n"
 	: "=&r" (prev)
 	: "r" (&lock->val), "r" (new),
-	  "i" (IS_ENABLED(CONFIG_PPC64))
+	  "i" (_Q_SPIN_EH_HINT)
 	: "cr0", "memory");
 
 	return likely(prev == 0);
@@ -75,7 +128,7 @@ static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
 "2:									\n"
 	: "=&r" (prev), "=&r" (tmp)
 	: "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
-	  "i" (IS_ENABLED(CONFIG_PPC64))
+	  "i" (_Q_SPIN_EH_HINT)
 	: "cr0", "memory");
 
 	return likely(!(prev & ~_Q_TAIL_CPU_MASK));
@@ -100,6 +153,8 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
 	smp_store_release(&lock->locked, 0);
+	if (_Q_SPIN_MISO_UNLOCK)
+		asm volatile("miso" ::: "memory");
 }
 
 #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 9a31b6147a23..2eab84774911 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -48,6 +48,12 @@ static bool pv_prod_head __read_mostly = false;
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
 
+#if _Q_SPIN_SPEC_BARRIER == 1
+#define spec_barrier() do { asm volatile("ori 31,31,0" ::: "memory"); } while (0)
+#else
+#define spec_barrier() do { } while (0)
+#endif
+
 static __always_inline bool recently_sleepy(void)
 {
 	/* pv_sleepy_lock is true when this is called */
@@ -137,7 +143,7 @@ static __always_inline u32 trylock_clean_tail(struct qspinlock *lock, u32 tail)
 	: "r" (&lock->val), "r"(tail), "r" (newval),
 	  "i" (_Q_LOCKED_VAL),
 	  "r" (_Q_TAIL_CPU_MASK),
-	  "i" (IS_ENABLED(CONFIG_PPC64))
+	  "i" (_Q_SPIN_EH_HINT)
 	: "cr0", "memory");
 
 	return prev;
@@ -475,6 +481,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 		val = READ_ONCE(lock->val);
 		if (val & _Q_MUST_Q_VAL)
 			break;
+		spec_barrier();
 
 		if (unlikely(!(val & _Q_LOCKED_VAL))) {
 			spin_end();
@@ -540,6 +547,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 	qnodesp = this_cpu_ptr(&qnodes);
 	if (unlikely(qnodesp->count >= MAX_NODES)) {
+		spec_barrier();
 		while (!queued_spin_trylock(lock))
 			cpu_relax();
 		return;
@@ -576,9 +584,12 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		/* Wait for mcs node lock to be released */
 		spin_begin();
 		while (!node->locked) {
+			spec_barrier();
+
 			if (yield_to_prev(lock, node, old, paravirt))
 				seen_preempted = true;
 		}
+		spec_barrier();
 		spin_end();
 
 		/* Clear out stale propagated yield_cpu */
@@ -586,6 +597,17 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 			node->yield_cpu = -1;
 
 		smp_rmb(); /* acquire barrier for the mcs lock */
+
+		/*
+		 * Generic qspinlocks have this prefetch here, but it seems
+		 * like it could cause additional line transitions because
+		 * the waiter will keep loading from it.
+		 */
+		if (_Q_SPIN_PREFETCH_NEXT) {
+			next = READ_ONCE(node->next);
+			if (next)
+				prefetchw(next);
+		}
 	}
 
 	/* We're at the head of the waitqueue, wait for the lock. */
@@ -597,6 +619,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		val = READ_ONCE(lock->val);
 		if (!(val & _Q_LOCKED_VAL))
 			break;
+		spec_barrier();
 
 		if (paravirt && pv_sleepy_lock && maybe_stealers) {
 			if (!sleepy) {
@@ -637,6 +660,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 			val |= _Q_MUST_Q_VAL;
 		}
 	}
+	spec_barrier();
 	spin_end();
 
 	/* If we're the last queued, must clean up the tail. */
@@ -657,6 +681,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 			cpu_relax();
 		spin_end();
 	}
+	spec_barrier();
 
 	/*
 	 * Unlock the next mcs waiter node. Release barrier is not required
@@ -668,10 +693,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	if (paravirt && pv_prod_head) {
 		int next_cpu = next->cpu;
 		WRITE_ONCE(next->locked, 1);
+		if (_Q_SPIN_MISO)
+			asm volatile("miso" ::: "memory");
 		if (vcpu_is_preempted(next_cpu))
 			prod_cpu(next_cpu);
 	} else {
 		WRITE_ONCE(next->locked, 1);
+		if (_Q_SPIN_MISO)
+			asm volatile("miso" ::: "memory");
 	}
 
 release:
@@ -686,12 +715,16 @@ void queued_spin_lock_slowpath(struct qspinlock *lock)
 	 * is passed as the paravirt argument to the functions.
 	 */
 	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()) {
-		if (try_to_steal_lock(lock, true))
+		if (try_to_steal_lock(lock, true)) {
+			spec_barrier();
 			return;
+		}
 		queued_spin_lock_mcs_queue(lock, true);
 	} else {
-		if (try_to_steal_lock(lock, false))
+		if (try_to_steal_lock(lock, false)) {
+			spec_barrier();
 			return;
+		}
 		queued_spin_lock_mcs_queue(lock, false);
 	}
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 real 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (16 preceding siblings ...)
  2022-11-26  9:59 ` [PATCH v3 17/17] powerpc/qspinlock: add compile-time tuning adjustments Nicholas Piggin
@ 2022-11-28  3:11 ` Nicholas Piggin
  2022-12-08 12:39 ` (subset) [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Michael Ellerman
  2023-04-13 10:58 ` Shrikanth Hegde
  19 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2022-11-28  3:11 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour

Add a powerpc specific implementation of queued spinlocks. This is the
build framework with a very simple (non-queued) spinlock implementation
to begin with. Later changes add queueing, and other features and
optimisations one-at-a-time. It is done this way to more easily see how
the queued spinlocks are built, and to make performance and correctness
bisects more useful.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
Missed the first patch sending the series :( Here is the real patch 1.

Thanks,
NIck

 arch/powerpc/Kconfig                          |  1 -
 arch/powerpc/include/asm/paravirt.h           |  3 +-
 arch/powerpc/include/asm/processor.h          |  1 +
 arch/powerpc/include/asm/qspinlock.h          | 87 +++++++------------
 arch/powerpc/include/asm/qspinlock_paravirt.h |  7 --
 arch/powerpc/include/asm/qspinlock_types.h    | 13 +++
 arch/powerpc/include/asm/spinlock.h           |  2 +-
 arch/powerpc/include/asm/spinlock_types.h     |  2 +-
 arch/powerpc/lib/Makefile                     |  4 +-
 arch/powerpc/lib/qspinlock.c                  | 17 ++++
 arch/powerpc/platforms/pseries/vas.c          |  1 +
 11 files changed, 67 insertions(+), 71 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
 create mode 100644 arch/powerpc/include/asm/qspinlock_types.h
 create mode 100644 arch/powerpc/lib/qspinlock.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2ca5418457ed..1d5b4f280feb 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -155,7 +155,6 @@ config PPC
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_MEMTEST
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
-	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
index f5ba1a3c41f8..119b44b8e81b 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -3,14 +3,13 @@
 #define _ASM_POWERPC_PARAVIRT_H
 
 #include <linux/jump_label.h>
-#include <asm/smp.h>
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
 #include <asm/hvcall.h>
 #endif
 
 #ifdef CONFIG_PPC_SPLPAR
-#include <linux/smp.h>
+#include <asm/smp.h>
 #include <asm/kvm_guest.h>
 #include <asm/cputhreads.h>
 
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 631802999d59..640d9a35661c 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -39,6 +39,7 @@
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
 #include <linux/thread_info.h>
+#include <asm/paravirt.h>
 #include <asm/ptrace.h>
 #include <asm/hw_breakpoint.h>
 
diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index b676c4fb90fd..b1443aab2145 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -2,83 +2,54 @@
 #ifndef _ASM_POWERPC_QSPINLOCK_H
 #define _ASM_POWERPC_QSPINLOCK_H
 
-#include <asm-generic/qspinlock_types.h>
-#include <asm/paravirt.h>
+#include <linux/atomic.h>
+#include <linux/compiler.h>
+#include <asm/qspinlock_types.h>
 
-#define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
-
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_queued_spin_unlock(struct qspinlock *lock);
-
-static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-	if (!is_shared_processor())
-		native_queued_spin_lock_slowpath(lock, val);
-	else
-		__pv_queued_spin_lock_slowpath(lock, val);
+	return atomic_read(&lock->val);
 }
 
-#define queued_spin_unlock queued_spin_unlock
-static inline void queued_spin_unlock(struct qspinlock *lock)
+static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 {
-	if (!is_shared_processor())
-		smp_store_release(&lock->locked, 0);
-	else
-		__pv_queued_spin_unlock(lock);
+	return !atomic_read(&lock.val);
 }
 
-#else
-extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-#endif
-
-static __always_inline void queued_spin_lock(struct qspinlock *lock)
+static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	u32 val = 0;
-
-	if (likely(arch_atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
-		return;
-
-	queued_spin_lock_slowpath(lock, val);
+	return 0;
 }
-#define queued_spin_lock queued_spin_lock
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define SPIN_THRESHOLD (1<<15) /* not tuned */
-
-static __always_inline void pv_wait(u8 *ptr, u8 val)
+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	if (*ptr != val)
-		return;
-	yield_to_any();
-	/*
-	 * We could pass in a CPU here if waiting in the queue and yield to
-	 * the previous CPU in the queue.
-	 */
+	return atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0;
 }
 
-static __always_inline void pv_kick(int cpu)
+void queued_spin_lock_slowpath(struct qspinlock *lock);
+
+static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-	prod_cpu(cpu);
+	if (!queued_spin_trylock(lock))
+		queued_spin_lock_slowpath(lock);
 }
 
-extern void __pv_init_lock_hash(void);
-
-static inline void pv_spinlocks_init(void)
+static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	__pv_init_lock_hash();
+	atomic_set_release(&lock->val, 0);
 }
 
-#endif
-
-/*
- * Queued spinlocks rely heavily on smp_cond_load_relaxed() to busy-wait,
- * which was found to have performance problems if implemented with
- * the preferred spin_begin()/spin_end() SMT priority pattern. Use the
- * generic version instead.
- */
+#define arch_spin_is_locked(l)		queued_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queued_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queued_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queued_spin_lock(l)
+#define arch_spin_trylock(l)		queued_spin_trylock(l)
+#define arch_spin_unlock(l)		queued_spin_unlock(l)
 
-#include <asm-generic/qspinlock.h>
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void pv_spinlocks_init(void);
+#else
+static inline void pv_spinlocks_init(void) { }
+#endif
 
 #endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h
deleted file mode 100644
index 6b60e7736a47..000000000000
--- a/arch/powerpc/include/asm/qspinlock_paravirt.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_QSPINLOCK_PARAVIRT_H
-#define _ASM_POWERPC_QSPINLOCK_PARAVIRT_H
-
-EXPORT_SYMBOL(__pv_queued_spin_unlock);
-
-#endif /* _ASM_POWERPC_QSPINLOCK_PARAVIRT_H */
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
new file mode 100644
index 000000000000..59606bc0c774
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_QSPINLOCK_TYPES_H
+#define _ASM_POWERPC_QSPINLOCK_TYPES_H
+
+#include <linux/types.h>
+
+typedef struct qspinlock {
+	atomic_t val;
+} arch_spinlock_t;
+
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
+
+#endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index bd75872a6334..7dafca8e3f02 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -13,7 +13,7 @@
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
 
-#ifndef CONFIG_PARAVIRT_SPINLOCKS
+#ifndef CONFIG_PPC_QUEUED_SPINLOCKS
 static inline void pv_spinlocks_init(void) { }
 #endif
 
diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index d5f8a74ed2e8..40b01446cf75 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -7,7 +7,7 @@
 #endif
 
 #ifdef CONFIG_PPC_QUEUED_SPINLOCKS
-#include <asm-generic/qspinlock_types.h>
+#include <asm/qspinlock_types.h>
 #include <asm-generic/qrwlock_types.h>
 #else
 #include <asm/simple_spinlock_types.h>
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 8560c912186d..b895cbf6a709 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -52,7 +52,9 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \
 obj64-y	+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
 	   memcpy_64.o copy_mc_64.o
 
-ifndef CONFIG_PPC_QUEUED_SPINLOCKS
+ifdef CONFIG_PPC_QUEUED_SPINLOCKS
+obj64-$(CONFIG_SMP)	+= qspinlock.o
+else
 obj64-$(CONFIG_SMP)	+= locks.o
 endif
 
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
new file mode 100644
index 000000000000..1c669b5b4607
--- /dev/null
+++ b/arch/powerpc/lib/qspinlock.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/export.h>
+#include <linux/processor.h>
+#include <asm/qspinlock.h>
+
+void queued_spin_lock_slowpath(struct qspinlock *lock)
+{
+	while (!queued_spin_trylock(lock))
+		cpu_relax();
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void pv_spinlocks_init(void)
+{
+}
+#endif
diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
index 4ad6e510d405..3ca573f5a0f7 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -15,6 +15,7 @@
 #include <linux/irqdomain.h>
 #include <asm/machdep.h>
 #include <asm/hvcall.h>
+#include <asm/paravirt.h>
 #include <asm/plpar_wrappers.h>
 #include <asm/firmware.h>
 #include <asm/vas.h>
-- 
2.37.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: (subset) [PATCH v3 00/17] powerpc: alternate queued spinlock implementation
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (17 preceding siblings ...)
  2022-11-28  3:11 ` [PATCH v3 real 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
@ 2022-12-08 12:39 ` Michael Ellerman
  2023-04-13 10:58 ` Shrikanth Hegde
  19 siblings, 0 replies; 21+ messages in thread
From: Michael Ellerman @ 2022-12-08 12:39 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Jordan Niethe, Laurent Dufour

On Sat, 26 Nov 2022 19:59:15 +1000, Nicholas Piggin wrote:
> This replaces the generic queued spinlock code (like s390 does) with
> our own implementation. There is an extra shim patch 1a to get the
> series to apply.
> 
> Generic PV qspinlock code is causing latency / starvation regressions on
> large systems that are resulting in hard lockups reported (mostly in
> pathoogical cases).  The generic qspinlock code has a number of issues
> important for powerpc hardware and hypervisors that aren't easily solved
> without changing code that would impact other architectures. Follow
> s390's lead and implement our own for now.
> 
> [...]

Applied to powerpc/next.

[00/17] powerpc/qspinlock: powerpc qspinlock implementation
        https://git.kernel.org/powerpc/c/9f61521c7a284e799050cd2adacc9a611bd2b491
[01/17] powerpc/qspinlock: add mcs queueing for contended waiters
        https://git.kernel.org/powerpc/c/84990b169557428c318df87b7836cd15f65b62dc
[02/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
        https://git.kernel.org/powerpc/c/4c93c2e4b9e8988511c06b9c042f23d4b8f593ad
[03/17] powerpc/qspinlock: convert atomic operations to assembly
        https://git.kernel.org/powerpc/c/b3a73b7db2b6cb3b2e5bfda5518a0e92230ef673
[04/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing
        https://git.kernel.org/powerpc/c/6aa42f883c438ea132a28801bef3f86f3883d14c
[05/17] powerpc/qspinlock: theft prevention to control latency
        https://git.kernel.org/powerpc/c/0944534ef4d5cf39c8133575524be0be3337dd62
[06/17] powerpc/qspinlock: store owner CPU in lock word
        https://git.kernel.org/powerpc/c/e1a31e7fd7130628cfd229253da2b4630e7a809c
[07/17] powerpc/qspinlock: paravirt yield to lock owner
        https://git.kernel.org/powerpc/c/085f03311bcede99550e08a1f7cad41bf758b460
[08/17] powerpc/qspinlock: implement option to yield to previous node
        https://git.kernel.org/powerpc/c/bd48287b2cf4cd6e95576db3a94fd2a7cdf9832d
[09/17] powerpc/qspinlock: allow stealing when head of queue yields
        https://git.kernel.org/powerpc/c/b4c3cdc1a698a2f6168768d0bed4bf062723722e
[10/17] powerpc/qspinlock: allow propagation of yield CPU down the queue
        https://git.kernel.org/powerpc/c/28db61e207ea3890d286cff3141c1ce67346074d
[11/17] powerpc/qspinlock: add ability to prod new queue head CPU
        https://git.kernel.org/powerpc/c/be742c573fdafcfa1752642ca1c7aaf08c258128
[12/17] powerpc/qspinlock: allow lock stealing in trylock and lock fastpath
        https://git.kernel.org/powerpc/c/f61ab43cc1a6146d6eef7e0713a452c3677ad13e
[13/17] powerpc/qspinlock: use spin_begin/end API
        https://git.kernel.org/powerpc/c/71c235027ce7940434acd3f553602ad8b5d36469
[14/17] powerpc/qspinlock: reduce remote node steal spins
        https://git.kernel.org/powerpc/c/cc79701114154efe79663ba47d9e51aad2ed3c78
[15/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
        https://git.kernel.org/powerpc/c/39dfc73596b48bb50cf7e4f3f54e38427dda5b4e
[16/17] powerpc/qspinlock: provide accounting and options for sleepy locks
        https://git.kernel.org/powerpc/c/12b459a5ebf3308e718bc1dd48acb7c4cf7f1a75
[17/17] powerpc/qspinlock: add compile-time tuning adjustments
        https://git.kernel.org/powerpc/c/0b2199841a7952d01a717b465df028b40b2cf3e9

cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/17] powerpc: alternate queued spinlock implementation
  2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (18 preceding siblings ...)
  2022-12-08 12:39 ` (subset) [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Michael Ellerman
@ 2023-04-13 10:58 ` Shrikanth Hegde
  19 siblings, 0 replies; 21+ messages in thread
From: Shrikanth Hegde @ 2023-04-13 10:58 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Jordan Niethe, Laurent Dufour, linuxppc-dev



On 11/26/22 3:29 PM, Nicholas Piggin wrote:
> This replaces the generic queued spinlock code (like s390 does) with
> our own implementation. There is an extra shim patch 1a to get the
> series to apply.
> 
> Generic PV qspinlock code is causing latency / starvation regressions on
> large systems that are resulting in hard lockups reported (mostly in
> pathoogical cases).  The generic qspinlock code has a number of issues
> important for powerpc hardware and hypervisors that aren't easily solved
> without changing code that would impact other architectures. Follow
> s390's lead and implement our own for now.
> 
> Issues for powerpc using generic qspinlocks:
> - The previous lock value should not be loaded with simple loads, and
>   need not be passed around from previous loads or cmpxchg results,
>   because powerpc uses ll/sc-style atomics which can perform more
>   complex operations that do not require this. powerpc implementations
>   tend to prefer loads use larx for improved coherency performance.
> - The queueing process should absolutely minimise the number of stores
>   to the lock word to reduce exclusive coherency probes, important for
>   large system scalability. The pending logic is counter productive
>   here.
> - Non-atomic unlock for paravirt locks is important (atomic instructions
>   tend to still be more expensive than x86 CPUs).
> - Yielding to the lock owner is important in the oversubscribed paravirt
>   case, which requires storing the owner CPU in the lock word.
> - More control of lock stealing for the paravirt case is important to
>   keep latency down on large systems.
> - The lock acquisition operation should always be made with a special
>   variant of atomic instructions with the lock hint bit set, including
>   (especially) in the queueing paths. This is more a matter of adding
>   more arch lock helpers so not an insurmountable problem for generic
>   code.
> 
> Thanks,
> Nick
> 
> Since v2:
> - Rebase the series on upstream and remove the 1a shim patch.
> - Squash in the RFC patches that avoid a few more cmpxchg patterns in
>   favour of more optimal larx/stcx implementations and allows the
>   non-stealing queueing case to be removed, significantly reducing
>   the queuing code.
> - Reword some changelogs.
> 
> Since v1:
> - Change most 'if (cond) return 1 ; return 0;'
> - Bug fix: was testing count == MAX, but reentrant NMIs could bring that
>   > MAX and crash.
> - Fix missing memory barrier lost in asm conversion patch.
> - Seperate the release barrier in publish_tail from the acquire barrier
>   in get_tail_qnode.
> - Moving a few minor things into their logically correct change.
> - Make encode_tail_cpu take a cpu argument to match get_tail_cpu.
> - Rename get_tail_cpu to decode_tail_cpu to match encode_tail_cpu.
> - Rename lock_set_locked to set_locked.
> - IS_ENABLED(x) ? 1 : 0 -> IS_ENABLED(x)
> - Fix some comments inside inline asm.
> - Change tunable names to lowercase.
> - Consolidate asm for trylock_clear_tail_cpu and trylock_with_tail_cpu
> - Restructure steal/wait loops to be more readable
> - Count a failed cmpxchg as an iteration in steal/wait loops to avoid
>   theoretical livelock/latency concern.
> 
> Nicholas Piggin (17):
>   powerpc/qspinlock: add mcs queueing for contended waiters
>   powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
>   powerpc/qspinlock: convert atomic operations to assembly
>   powerpc/qspinlock: allow new waiters to steal the lock before queueing
>   powerpc/qspinlock: theft prevention to control latency
>   powerpc/qspinlock: store owner CPU in lock word
>   powerpc/qspinlock: paravirt yield to lock owner
>   powerpc/qspinlock: implement option to yield to previous node
>   powerpc/qspinlock: allow stealing when head of queue yields
>   powerpc/qspinlock: allow propagation of yield CPU down the queue
>   powerpc/qspinlock: add ability to prod new queue head CPU
>   powerpc/qspinlock: allow lock stealing in trylock and lock fastpath
>   powerpc/qspinlock: use spin_begin/end API
>   powerpc/qspinlock: reduce remote node steal spins
>   powerpc/qspinlock: allow indefinite spinning on a preempted owner
>   powerpc/qspinlock: provide accounting and options for sleepy locks
>   powerpc/qspinlock: add compile-time tuning adjustments
> 
>  arch/powerpc/include/asm/qspinlock.h       | 130 ++-
>  arch/powerpc/include/asm/qspinlock_types.h |  63 +-
>  arch/powerpc/lib/qspinlock.c               | 985 ++++++++++++++++++++-
>  3 files changed, 1167 insertions(+), 11 deletions(-)
> 

Hi. 

we are observing hardlockup problems with queued spinlock on powerVM Shared Processor LPAR's(SPLPAR)
LPAR is configured with 320vcpu/160cpu. It is over-committed. 


Scenario: 
Bring up the system. create a CPU cgroup with 50% or 5% CPU bandwidth. Run strss-ng with 160 CPU stressor.
Issue is hit consistently on 5% CPU bandwidth allocation. Issue happens sometime with 50%. Issue was 
not observed when stress-ng is run without any cgroup during testing. Lockups are more easily hit, if
other SPLPAR are doing some work.
The same testing holds good in Dedicated LPAR. So this maybe specific to SPLPAR only.  


Lockup backtrace:
[ 2612.771144] watchdog: CPU 216 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
[ 2612.771158] watchdog: CPU 216 TB:119522648258051, last heartbeat TB:119516305262839 (12388ms ago) 
[ 2612.771160] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables bonding nfnetlink mlx5_ib pseries_rng ib_uverbs ib_core drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw nvme tls vmx_crypto nvme_core psample t10_pi dm_mirror dm_region_hash dm_log dm_mod fuse
[ 2612.771186] CPU: 216 PID: 0 Comm: swapper/216 Not tainted 6.2.0ssh+ #6           
[ 2612.771189] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.10 (MH1030_044) hv:phyp pSeries
[ 2612.771191] NIP:  c0000000000f2710 LR: c0000000000ae38c CTR: 0000000000000000
[ 2612.771193] REGS: c00000fffb4c7d60 TRAP: 0900   Not tainted  (6.2.0ssh+)         
[ 2612.771195] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24022222  XER: 00000000
[ 2612.771200] CFAR: 0000000000000000 IRQMASK: 1                                
[ 2612.771200] GPR00: 0000000024022222 c00000fffba47b30 c0000000017f0e00 0000000000000000
[ 2612.771200] GPR04: 0000000000000040 000000000000ebd9 0000000000000081 fffffffffffe0000
[ 2612.771200] GPR08: 0000000000000081 c00000fffbf49c80 0000000000000004 0000000044022222
[ 2612.771200] GPR12: 0000000000000000 c00000fffbee9c80 0000000000000001 c00000000125ef08
[ 2612.771200] GPR16: c00000739f6bd600 c000000002ac78d0 c000000002ac7058 000000000000ebd9
[ 2612.771200] GPR20: 0000000000000200 0000000000000000 fffffffffffe0000 0000000000000001
[ 2612.771200] GPR24: 0000000000000001 0000000000000001 c0000000021f1880 0000000000000001
[ 2612.771200] GPR28: 0000000000000032 c000000002ac7190 0000000000008081 c00000fff5a31f80
[ 2612.771217] NIP [c0000000000f2710] plpar_hcall_norets_notrace+0x18/0x2c      
[ 2612.771221] LR [c0000000000ae38c] queued_spin_lock_slowpath+0x52c/0x1190         
[ 2612.771225] Call Trace:                                                      
[ 2612.771226] [c00000fffba47c40] [c000000000f635c0] _raw_spin_lock+0x70/0x90   
[ 2612.771232] [c00000fffba47c60] [c000000000198f6c] raw_spin_rq_lock_nested+0x2c/0x100
[ 2612.771236] [c00000fffba47ca0] [c0000000001ba900] load_balance+0x5e0/0x9b0   
[ 2612.771239] [c00000fffba47e10] [c0000000001bbbc8] rebalance_domains+0x2e8/0x490
[ 2612.771242] [c00000fffba47ee0] [c000000000f63c8c] __do_softirq+0x15c/0x3dc   
[ 2612.771245] [c00000fffba47fe0] [c000000000017950] do_softirq_own_stack+0x40/0x60
[ 2612.771248] [c000007387c9f9b0] [c0000000001539e8] __irq_exit_rcu+0x158/0x190 
[ 2612.771251] [c000007387c9f9e0] [c000000000154520] irq_exit+0x20/0x40         
[ 2612.771253] [c000007387c9fa00] [c000000000028af4] timer_interrupt+0x174/0x320
[ 2612.771255] [c000007387c9fa60] [c000000000009f8c] decrementer_common_virt+0x28c/0x290


We did a git bisect between 6.1 and 6.2
Git bisect logs:
# git bisect log  
git bisect start
# good: [274d842fa1efd9449e62222c8896e0be11621f1f] powerpc/tlb: Add local flush for page given mm_struct and psize
git bisect good 274d842fa1efd9449e62222c8896e0be11621f1f
# bad: [0b2199841a7952d01a717b465df028b40b2cf3e9] powerpc/qspinlock: add compile-time tuning adjustments
git bisect bad 0b2199841a7952d01a717b465df028b40b2cf3e9
# good: [247f34f7b80357943234f93f247a1ae6b6c3a740] Linux 6.1-rc2
git bisect good 247f34f7b80357943234f93f247a1ae6b6c3a740
# good: [bd48287b2cf4cd6e95576db3a94fd2a7cdf9832d] powerpc/qspinlock: implement option to yield to previous node
git bisect good bd48287b2cf4cd6e95576db3a94fd2a7cdf9832d
# bad: [f61ab43cc1a6146d6eef7e0713a452c3677ad13e] powerpc/qspinlock: allow lock stealing in trylock and lock fastpath
git bisect bad f61ab43cc1a6146d6eef7e0713a452c3677ad13e
# bad: [28db61e207ea3890d286cff3141c1ce67346074d] powerpc/qspinlock: allow propagation of yield CPU down the queue
git bisect bad 28db61e207ea3890d286cff3141c1ce67346074d
# good: [b4c3cdc1a698a2f6168768d0bed4bf062723722e] powerpc/qspinlock: allow stealing when head of queue yields
git bisect good b4c3cdc1a698a2f6168768d0bed4bf062723722e
# first bad commit: [28db61e207ea3890d286cff3141c1ce67346074d] powerpc/qspinlock: allow propagation of yield CPU down the queue


Tried fiddling with these tunable available for qspinlock with their default values on the test system:

qspl_steal_spins = 256
qspl_head_spins = 256
qspl_pv_yield_owner = 1
qspl_pv_yield_prev = 1
qspl_pv_yield_allow_steal = 0
qspl_pv_yield_propagate_owner = 1

No lockup with qspl_pv_yield_propagate_owner = 0. But that would defeat the purpose of the queued spinlock. 
So tried with next one, i.e qspl_pv_yield_allow_steal = 1.  No lockup was seen. Likely we have to tune it for
powerVM or default values need to change based on the system? 

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-04-13 10:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-26  9:59 [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 01/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 02/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 03/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 04/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 05/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 06/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 07/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 08/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 09/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 10/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 11/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 12/17] powerpc/qspinlock: allow lock stealing in trylock and lock fastpath Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 13/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 14/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 15/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 16/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
2022-11-26  9:59 ` [PATCH v3 17/17] powerpc/qspinlock: add compile-time tuning adjustments Nicholas Piggin
2022-11-28  3:11 ` [PATCH v3 real 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
2022-12-08 12:39 ` (subset) [PATCH v3 00/17] powerpc: alternate queued spinlock implementation Michael Ellerman
2023-04-13 10:58 ` Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).