All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] powerpc: alternate queued spinlock implementation
@ 2022-07-28  6:31 Nicholas Piggin
  2022-07-28  6:31 ` [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
                   ` (17 more replies)
  0 siblings, 18 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This replaces the generic queued spinlock code (like s390 does) with
our own implementation. There is an extra shim patch 1a here to get the
series to apply.

So far the microbenchmarks look okay, haven't really had time to write
up a good set of results. I hope to get some significant bigger
workloads some testing time in the next week or so so if those turn
out positive I may resubmit with any tweaks and some more details of
numbers.

Thanks,
Nick

Nicholas Piggin (17):
  powerpc/qspinlock: powerpc qspinlock implementation
  powerpc/qspinlock: add mcs queueing for contended waiters
  powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  powerpc/qspinlock: convert atomic operations to assembly
  powerpc/qspinlock: allow new waiters to steal the lock before queueing
  powerpc/qspinlock: theft prevention to control latency
  powerpc/qspinlock: store owner CPU in lock word
  powerpc/qspinlock: paravirt yield to lock owner
  powerpc/qspinlock: implement option to yield to previous node
  powerpc/qspinlock: allow stealing when head of queue yields
  powerpc/qspinlock: allow propagation of yield CPU down the queue
  powerpc/qspinlock: add ability to prod new queue head CPU
  powerpc/qspinlock: trylock and initial lock attempt may steal
  powerpc/qspinlock: use spin_begin/end API
  powerpc/qspinlock: reduce remote node steal spins
  powerpc/qspinlock: allow indefinite spinning on a preempted owner
  powerpc/qspinlock: provide accounting and options for sleepy locks

 arch/powerpc/Kconfig                       |    1 -
 arch/powerpc/include/asm/qspinlock.h       |  130 ++-
 arch/powerpc/include/asm/qspinlock_types.h |   70 ++
 arch/powerpc/include/asm/spinlock_types.h  |    2 +-
 arch/powerpc/lib/Makefile                  |    4 +-
 arch/powerpc/lib/qspinlock.c               | 1009 ++++++++++++++++++++
 6 files changed, 1172 insertions(+), 44 deletions(-)
 create mode 100644 arch/powerpc/include/asm/qspinlock_types.h
 create mode 100644 arch/powerpc/lib/qspinlock.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-10  1:52   ` Jordan NIethe
  2022-11-10  0:35   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 1a/17] powerpc/qspinlock: Prepare qspinlock code Nicholas Piggin
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Add a powerpc specific implementation of queued spinlocks. This is the
build framework with a very simple (non-queued) spinlock implementation
to begin with. Later changes add queueing, and other features and
optimisations one-at-a-time. It is done this way to more easily see how
the queued spinlocks are built, and to make performance and correctness
bisects more useful.

Generic PV qspinlock code is causing latency / starvation regressions on
large systems that are resulting in hard lockups reported (mostly in
pathoogical cases).  The generic qspinlock code has a number of issues
important for powerpc hardware and hypervisors that aren't easily solved
without changing code that would impact other architectures. Follow
s390's lead and implement our own for now.

Issues for powerpc using generic qspinlocks:
- The previous lock value should not be loaded with simple loads, and
  need not be passed around from previous loads or cmpxchg results,
  because powerpc uses ll/sc-style atomics which can perform more
  complex operations that do not require this. powerpc implementations
  tend to prefer loads use larx for improved coherency performance.
- The queueing process should absolutely minimise the number of stores
  to the lock word to reduce exclusive coherency probes, important for
  large system scalability. The pending logic is counter productive
  here.
- Non-atomic unlock for paravirt locks is important (atomic instructions
  tend to still be more expensive than x86 CPUs).
- Yielding to the lock owner is important in the oversubscribed paravirt
  case, which requires storing the owner CPU in the lock word.
- More control of lock stealing for the paravirt case is important to
  keep latency down on large systems.
- The lock acquisition operation should always be made with a special
  variant of atomic instructions with the lock hint bit set, including
  (especially) in the queueing paths. This is more a matter of adding
  more arch lock helpers so not an insurmountable problem for generic
  code.

Since the RFC series, I tested this on a 16-socket 1920 thread POWER10
system with some microbenchmarks, and that showed up significant
problems with the previous series. High amount of spinning on the lock
up-front (lock stealing) for SPLPAR mode (paravirt) really hurts
scalability when the guest is not overcommitted. However on smaller
KVM systems with significant overcommit (e.g., 5-10%), this spinning
is very important to avoid performance tanking due to the queueing
problem. So rather than set STEAL_SPINS and HEAD_SPINS based on
SPLPAR at boot-time, I lowered them and do more to dynamically deal
with vCPU preemption. So behaviour of dedicated and shared LPAR mode
is now the same until there is vCPU preemption detected. This seems
to be leading to better results overall, but some worst-case latencies
are significantly up with the lockstorm test (latency is still better
than generic queued spinlocks, but not as good as it previously was or
as good as simple). Statistical fairness is still significantly better.

Thanks,
Nick

---
 arch/powerpc/Kconfig                       |  1 -
 arch/powerpc/include/asm/qspinlock.h       | 78 ++++++++++------------
 arch/powerpc/include/asm/qspinlock_types.h | 13 ++++
 arch/powerpc/include/asm/spinlock_types.h  |  2 +-
 arch/powerpc/lib/Makefile                  |  4 +-
 arch/powerpc/lib/qspinlock.c               | 18 +++++
 6 files changed, 69 insertions(+), 47 deletions(-)
 create mode 100644 arch/powerpc/include/asm/qspinlock_types.h
 create mode 100644 arch/powerpc/lib/qspinlock.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 7aa12e88c580..4838e6c96b20 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -154,7 +154,6 @@ config PPC
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_MEMTEST
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
-	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 39c1c7f80579..cb2b4f91e976 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -2,66 +2,56 @@
 #ifndef _ASM_POWERPC_QSPINLOCK_H
 #define _ASM_POWERPC_QSPINLOCK_H
 
-#include <asm-generic/qspinlock_types.h>
-#include <asm/paravirt.h>
+#include <linux/atomic.h>
+#include <linux/compiler.h>
+#include <asm/qspinlock_types.h>
 
-#define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
-
-void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-void __pv_queued_spin_unlock(struct qspinlock *lock);
-
-static __always_inline void queued_spin_lock(struct qspinlock *lock)
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-	u32 val = 0;
+	return atomic_read(&lock->val);
+}
 
-	if (likely(arch_atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
-		return;
+static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
+{
+	return !atomic_read(&lock.val);
+}
 
-	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
-		queued_spin_lock_slowpath(lock, val);
-	else
-		__pv_queued_spin_lock_slowpath(lock, val);
+static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
+{
+	return 0;
 }
-#define queued_spin_lock queued_spin_lock
 
-static inline void queued_spin_unlock(struct qspinlock *lock)
+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
-		smp_store_release(&lock->locked, 0);
-	else
-		__pv_queued_spin_unlock(lock);
+	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
+		return 1;
+	return 0;
 }
-#define queued_spin_unlock queued_spin_unlock
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define SPIN_THRESHOLD (1<<15) /* not tuned */
+void queued_spin_lock_slowpath(struct qspinlock *lock);
 
-static __always_inline void pv_wait(u8 *ptr, u8 val)
+static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-	if (*ptr != val)
-		return;
-	yield_to_any();
-	/*
-	 * We could pass in a CPU here if waiting in the queue and yield to
-	 * the previous CPU in the queue.
-	 */
+	if (!queued_spin_trylock(lock))
+		queued_spin_lock_slowpath(lock);
 }
 
-static __always_inline void pv_kick(int cpu)
+static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	prod_cpu(cpu);
+	atomic_set_release(&lock->val, 0);
 }
 
-#endif
+#define arch_spin_is_locked(l)		queued_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queued_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queued_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queued_spin_lock(l)
+#define arch_spin_trylock(l)		queued_spin_trylock(l)
+#define arch_spin_unlock(l)		queued_spin_unlock(l)
 
-/*
- * Queued spinlocks rely heavily on smp_cond_load_relaxed() to busy-wait,
- * which was found to have performance problems if implemented with
- * the preferred spin_begin()/spin_end() SMT priority pattern. Use the
- * generic version instead.
- */
-
-#include <asm-generic/qspinlock.h>
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void pv_spinlocks_init(void);
+#else
+static inline void pv_spinlocks_init(void) { }
+#endif
 
 #endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
new file mode 100644
index 000000000000..59606bc0c774
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_QSPINLOCK_TYPES_H
+#define _ASM_POWERPC_QSPINLOCK_TYPES_H
+
+#include <linux/types.h>
+
+typedef struct qspinlock {
+	atomic_t val;
+} arch_spinlock_t;
+
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
+
+#endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index d5f8a74ed2e8..40b01446cf75 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -7,7 +7,7 @@
 #endif
 
 #ifdef CONFIG_PPC_QUEUED_SPINLOCKS
-#include <asm-generic/qspinlock_types.h>
+#include <asm/qspinlock_types.h>
 #include <asm-generic/qrwlock_types.h>
 #else
 #include <asm/simple_spinlock_types.h>
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 8560c912186d..b895cbf6a709 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -52,7 +52,9 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \
 obj64-y	+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
 	   memcpy_64.o copy_mc_64.o
 
-ifndef CONFIG_PPC_QUEUED_SPINLOCKS
+ifdef CONFIG_PPC_QUEUED_SPINLOCKS
+obj64-$(CONFIG_SMP)	+= qspinlock.o
+else
 obj64-$(CONFIG_SMP)	+= locks.o
 endif
 
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
new file mode 100644
index 000000000000..8dbce99a373c
--- /dev/null
+++ b/arch/powerpc/lib/qspinlock.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/export.h>
+#include <linux/processor.h>
+#include <asm/qspinlock.h>
+
+void queued_spin_lock_slowpath(struct qspinlock *lock)
+{
+	while (!queued_spin_trylock(lock))
+		cpu_relax();
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void pv_spinlocks_init(void)
+{
+}
+#endif
+
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 1a/17] powerpc/qspinlock: Prepare qspinlock code
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
  2022-07-28  6:31 ` [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-07-28  6:31 ` [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

I have a bunch of parallel patches that clean up the generic queued
spinlock code, but the powerpc implementation does not use or depend
on any of that except patch collision. I've based the powerpc series
on top of that work, but it's annoying to post or carry around all
those patches as well. This shim patch takes the powerpc changes and
should be applied first. If powerpc series is to go ahead of the
generic series, then this patch would just be merged into patch 1
of this series. This patch won't compile or do anything useful alone.
---
 arch/powerpc/include/asm/qspinlock.h          | 45 ++++++-------------
 arch/powerpc/include/asm/qspinlock_paravirt.h |  7 ---
 arch/powerpc/include/asm/spinlock.h           |  2 +-
 3 files changed, 15 insertions(+), 39 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index b676c4fb90fd..39c1c7f80579 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -7,42 +7,32 @@
 
 #define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_queued_spin_unlock(struct qspinlock *lock);
+void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+void __pv_queued_spin_unlock(struct qspinlock *lock);
 
-static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-	if (!is_shared_processor())
-		native_queued_spin_lock_slowpath(lock, val);
+	u32 val = 0;
+
+	if (likely(arch_atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
+		return;
+
+	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
+		queued_spin_lock_slowpath(lock, val);
 	else
 		__pv_queued_spin_lock_slowpath(lock, val);
 }
+#define queued_spin_lock queued_spin_lock
 
-#define queued_spin_unlock queued_spin_unlock
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	if (!is_shared_processor())
+	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
 		smp_store_release(&lock->locked, 0);
 	else
 		__pv_queued_spin_unlock(lock);
 }
-
-#else
-extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-#endif
-
-static __always_inline void queued_spin_lock(struct qspinlock *lock)
-{
-	u32 val = 0;
-
-	if (likely(arch_atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
-		return;
-
-	queued_spin_lock_slowpath(lock, val);
-}
-#define queued_spin_lock queued_spin_lock
+#define queued_spin_unlock queued_spin_unlock
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define SPIN_THRESHOLD (1<<15) /* not tuned */
@@ -63,13 +53,6 @@ static __always_inline void pv_kick(int cpu)
 	prod_cpu(cpu);
 }
 
-extern void __pv_init_lock_hash(void);
-
-static inline void pv_spinlocks_init(void)
-{
-	__pv_init_lock_hash();
-}
-
 #endif
 
 /*
diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h
deleted file mode 100644
index 6b60e7736a47..000000000000
--- a/arch/powerpc/include/asm/qspinlock_paravirt.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_QSPINLOCK_PARAVIRT_H
-#define _ASM_POWERPC_QSPINLOCK_PARAVIRT_H
-
-EXPORT_SYMBOL(__pv_queued_spin_unlock);
-
-#endif /* _ASM_POWERPC_QSPINLOCK_PARAVIRT_H */
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index bd75872a6334..7dafca8e3f02 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -13,7 +13,7 @@
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
 
-#ifndef CONFIG_PARAVIRT_SPINLOCKS
+#ifndef CONFIG_PPC_QUEUED_SPINLOCKS
 static inline void pv_spinlocks_init(void) { }
 #endif
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
  2022-07-28  6:31 ` [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
  2022-07-28  6:31 ` [PATCH 1a/17] powerpc/qspinlock: Prepare qspinlock code Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-10  2:28   ` Jordan NIethe
  2022-11-10  0:36   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This forms the basis of the qspinlock slow path.

Like generic qspinlocks and unlike the vanilla MCS algorithm, the lock
owner does not participate in the queue, only waiters. The first waiter
spins on the lock word, then when the lock is released it takes
ownership and unqueues the next waiter. This is how qspinlocks can be
implemented with the spinlock API -- lock owners don't need a node, only
waiters do.
---
 arch/powerpc/include/asm/qspinlock.h       |  10 +-
 arch/powerpc/include/asm/qspinlock_types.h |  21 +++
 arch/powerpc/lib/qspinlock.c               | 166 ++++++++++++++++++++-
 3 files changed, 191 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index cb2b4f91e976..f06117aa60e1 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -18,12 +18,12 @@ static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 
 static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	return 0;
+	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
 }
 
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
+	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
 		return 1;
 	return 0;
 }
@@ -38,7 +38,11 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	atomic_set_release(&lock->val, 0);
+	for (;;) {
+		int val = atomic_read(&lock->val);
+		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
+			return;
+	}
 }
 
 #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 59606bc0c774..9630e714c70d 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -10,4 +10,25 @@ typedef struct qspinlock {
 
 #define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
 
+/*
+ * Bitfields in the atomic value:
+ *
+ *     0: locked bit
+ * 16-31: tail cpu (+1)
+ */
+#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
+				      << _Q_ ## type ## _OFFSET)
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		1
+#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+
+#define _Q_TAIL_CPU_OFFSET	16
+#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
+
+#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
+#error "qspinlock does not support such large CONFIG_NR_CPUS"
+#endif
+
 #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 8dbce99a373c..5ebb88d95636 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -1,12 +1,172 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/compiler.h>
 #include <linux/export.h>
-#include <linux/processor.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
 #include <asm/qspinlock.h>
 
-void queued_spin_lock_slowpath(struct qspinlock *lock)
+#define MAX_NODES	4
+
+struct qnode {
+	struct qnode	*next;
+	struct qspinlock *lock;
+	u8		locked; /* 1 if lock acquired */
+};
+
+struct qnodes {
+	int		count;
+	struct qnode nodes[MAX_NODES];
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
+
+static inline int encode_tail_cpu(void)
+{
+	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
+}
+
+static inline int get_tail_cpu(int val)
+{
+	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
+}
+
+/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
+static __always_inline void lock_set_locked(struct qspinlock *lock)
+{
+	atomic_or(_Q_LOCKED_VAL, &lock->val);
+	__atomic_acquire_fence();
+}
+
+/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
+static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
+{
+	int newval = _Q_LOCKED_VAL;
+
+	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
+		return 1;
+	else
+		return 0;
+}
+
+/*
+ * Publish our tail, replacing previous tail. Return previous value.
+ *
+ * This provides a release barrier for publishing node, and an acquire barrier
+ * for getting the old node.
+ */
+static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
 {
-	while (!queued_spin_trylock(lock))
+	for (;;) {
+		int val = atomic_read(&lock->val);
+		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
+		int old;
+
+		old = atomic_cmpxchg(&lock->val, val, newval);
+		if (old == val)
+			return old;
+	}
+}
+
+static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
+{
+	int cpu = get_tail_cpu(val);
+	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
+	int idx;
+
+	for (idx = 0; idx < MAX_NODES; idx++) {
+		struct qnode *qnode = &qnodesp->nodes[idx];
+		if (qnode->lock == lock)
+			return qnode;
+	}
+
+	BUG();
+}
+
+static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
+{
+	struct qnodes *qnodesp;
+	struct qnode *next, *node;
+	int val, old, tail;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+	qnodesp = this_cpu_ptr(&qnodes);
+	if (unlikely(qnodesp->count == MAX_NODES)) {
+		while (!queued_spin_trylock(lock))
+			cpu_relax();
+		return;
+	}
+
+	idx = qnodesp->count++;
+	/*
+	 * Ensure that we increment the head node->count before initialising
+	 * the actual node. If the compiler is kind enough to reorder these
+	 * stores, then an IRQ could overwrite our assignments.
+	 */
+	barrier();
+	node = &qnodesp->nodes[idx];
+	node->next = NULL;
+	node->lock = lock;
+	node->locked = 0;
+
+	tail = encode_tail_cpu();
+
+	old = publish_tail_cpu(lock, tail);
+
+	/*
+	 * If there was a previous node; link it and wait until reaching the
+	 * head of the waitqueue.
+	 */
+	if (old & _Q_TAIL_CPU_MASK) {
+		struct qnode *prev = get_tail_qnode(lock, old);
+
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
+
+		/* Wait for mcs node lock to be released */
+		while (!node->locked)
+			cpu_relax();
+
+		smp_rmb(); /* acquire barrier for the mcs lock */
+	}
+
+	/* We're at the head of the waitqueue, wait for the lock. */
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
+		cpu_relax();
+
+	/* If we're the last queued, must clean up the tail. */
+	if ((val & _Q_TAIL_CPU_MASK) == tail) {
+		if (trylock_clear_tail_cpu(lock, val))
+			goto release;
+		/* Another waiter must have enqueued */
+	}
+
+	/* We must be the owner, just set the lock bit and acquire */
+	lock_set_locked(lock);
+
+	/* contended path; must wait for next != NULL (MCS protocol) */
+	while (!(next = READ_ONCE(node->next)))
 		cpu_relax();
+
+	/*
+	 * Unlock the next mcs waiter node. Release barrier is not required
+	 * here because the acquirer is only accessing the lock word, and
+	 * the acquire barrier we took the lock with orders that update vs
+	 * this store to locked. The corresponding barrier is the smp_rmb()
+	 * acquire barrier for mcs lock, above.
+	 */
+	WRITE_ONCE(next->locked, 1);
+
+release:
+	qnodesp->count--; /* release the node */
+}
+
+void queued_spin_lock_slowpath(struct qspinlock *lock)
+{
+	queued_spin_lock_mcs_queue(lock);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (2 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-10  3:28   ` Jordan Niethe
  2022-11-10  0:39   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The first 16 bits of the lock are only modified by the owner, and other
modifications always use atomic operations on the entire 32 bits, so
unlocks can use plain stores on the 16 bits. This is the same kind of
optimisation done by core qspinlock code.
---
 arch/powerpc/include/asm/qspinlock.h       |  6 +-----
 arch/powerpc/include/asm/qspinlock_types.h | 19 +++++++++++++++++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index f06117aa60e1..79a1936fb68d 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	for (;;) {
-		int val = atomic_read(&lock->val);
-		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
-			return;
-	}
+	smp_store_release(&lock->locked, 0);
 }
 
 #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 9630e714c70d..3425dab42576 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -3,12 +3,27 @@
 #define _ASM_POWERPC_QSPINLOCK_TYPES_H
 
 #include <linux/types.h>
+#include <asm/byteorder.h>
 
 typedef struct qspinlock {
-	atomic_t val;
+	union {
+		atomic_t val;
+
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u16	locked;
+			u8	reserved[2];
+		};
+#else
+		struct {
+			u8	reserved[2];
+			u16	locked;
+		};
+#endif
+	};
 } arch_spinlock_t;
 
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
 
 /*
  * Bitfields in the atomic value:
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (3 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-10  3:54   ` Jordan Niethe
  2022-11-10  0:39   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This uses more optimal ll/sc style access patterns (rather than
cmpxchg), and also sets the EH=1 lock hint on those operations
which acquire ownership of the lock.
---
 arch/powerpc/include/asm/qspinlock.h       | 25 +++++--
 arch/powerpc/include/asm/qspinlock_types.h |  6 +-
 arch/powerpc/lib/qspinlock.c               | 81 +++++++++++++++-------
 3 files changed, 79 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 79a1936fb68d..3ab354159e5e 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -2,28 +2,43 @@
 #ifndef _ASM_POWERPC_QSPINLOCK_H
 #define _ASM_POWERPC_QSPINLOCK_H
 
-#include <linux/atomic.h>
 #include <linux/compiler.h>
 #include <asm/qspinlock_types.h>
 
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-	return atomic_read(&lock->val);
+	return READ_ONCE(lock->val);
 }
 
 static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 {
-	return !atomic_read(&lock.val);
+	return !lock.val;
 }
 
 static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
+	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
 }
 
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
+	u32 new = _Q_LOCKED_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
+"	cmpwi	0,%0,0							\n"
+"	bne-	2f							\n"
+"	stwcx.	%2,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	if (likely(prev == 0))
 		return 1;
 	return 0;
 }
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 3425dab42576..210adf05b235 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -7,7 +7,7 @@
 
 typedef struct qspinlock {
 	union {
-		atomic_t val;
+		u32 val;
 
 #ifdef __LITTLE_ENDIAN
 		struct {
@@ -23,10 +23,10 @@ typedef struct qspinlock {
 	};
 } arch_spinlock_t;
 
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = 0 } }
 
 /*
- * Bitfields in the atomic value:
+ * Bitfields in the lock word:
  *
  *     0: locked bit
  * 16-31: tail cpu (+1)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 5ebb88d95636..7c71e5e287df 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
-#include <linux/atomic.h>
 #include <linux/bug.h>
 #include <linux/compiler.h>
 #include <linux/export.h>
@@ -22,32 +21,59 @@ struct qnodes {
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
-static inline int encode_tail_cpu(void)
+static inline u32 encode_tail_cpu(void)
 {
 	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
 }
 
-static inline int get_tail_cpu(int val)
+static inline int get_tail_cpu(u32 val)
 {
 	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
 }
 
 /* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
+/* Take the lock by setting the lock bit, no other CPUs will touch it. */
 static __always_inline void lock_set_locked(struct qspinlock *lock)
 {
-	atomic_or(_Q_LOCKED_VAL, &lock->val);
-	__atomic_acquire_fence();
+	u32 new = _Q_LOCKED_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%3	# lock_set_locked			\n"
+"	or	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
 }
 
-/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
-static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
+/* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
+static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
 {
-	int newval = _Q_LOCKED_VAL;
-
-	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
+	u32 new = _Q_LOCKED_VAL;
+	u32 prev;
+
+	BUG_ON(old & _Q_LOCKED_VAL);
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%4	# trylock_clear_tail_cpu		\n"
+"	cmpw	0,%0,%2							\n"
+"	bne-	2f							\n"
+"	stwcx.	%3,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r"(old), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	if (likely(prev == old))
 		return 1;
-	else
-		return 0;
+	return 0;
 }
 
 /*
@@ -56,20 +82,25 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int va
  * This provides a release barrier for publishing node, and an acquire barrier
  * for getting the old node.
  */
-static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
+static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
 {
-	for (;;) {
-		int val = atomic_read(&lock->val);
-		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
-		int old;
-
-		old = atomic_cmpxchg(&lock->val, val, newval);
-		if (old == val)
-			return old;
-	}
+	u32 prev, tmp;
+
+	asm volatile(
+"\t"	PPC_RELEASE_BARRIER "						\n"
+"1:	lwarx	%0,0,%2		# publish_tail_cpu			\n"
+"	andc	%1,%0,%4						\n"
+"	or	%1,%1,%3						\n"
+"	stwcx.	%1,0,%2							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev), "=&r"(tmp)
+	: "r" (&lock->val), "r" (tail), "r"(_Q_TAIL_CPU_MASK)
+	: "cr0", "memory");
+
+	return prev;
 }
 
-static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
+static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = get_tail_cpu(val);
 	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
@@ -88,7 +119,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
-	int val, old, tail;
+	u32 val, old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -134,7 +165,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	}
 
 	/* We're at the head of the waitqueue, wait for the lock. */
-	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
+	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
 		cpu_relax();
 
 	/* If we're the last queued, must clean up the tail. */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (4 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-10  4:31   ` Jordan Niethe
  2022-11-10  0:40   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Allow new waiters a number of spins on the lock word before queueing,
which particularly helps paravirt performance when physical CPUs are
oversubscribed.
---
 arch/powerpc/lib/qspinlock.c | 152 ++++++++++++++++++++++++++++++++---
 1 file changed, 141 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 7c71e5e287df..1625cce714b2 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -19,8 +19,17 @@ struct qnodes {
 	struct qnode nodes[MAX_NODES];
 };
 
+/* Tuning parameters */
+static int STEAL_SPINS __read_mostly = (1<<5);
+static bool MAYBE_STEALERS __read_mostly = true;
+
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
+static __always_inline int get_steal_spins(void)
+{
+	return STEAL_SPINS;
+}
+
 static inline u32 encode_tail_cpu(void)
 {
 	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
@@ -76,6 +85,39 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 ol
 	return 0;
 }
 
+static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u32 new)
+{
+	u32 prev;
+
+	BUG_ON(old & _Q_LOCKED_VAL);
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%4	# queued_spin_trylock_cmpxchg		\n"
+"	cmpw	0,%0,%2							\n"
+"	bne-	2f							\n"
+"	stwcx.	%3,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r"(old), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	return prev;
+}
+
+/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
+static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
+{
+	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
+
+	if (__trylock_cmpxchg(lock, val, newval) == val)
+		return 1;
+	else
+		return 0;
+}
+
 /*
  * Publish our tail, replacing previous tail. Return previous value.
  *
@@ -115,6 +157,31 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
+static inline bool try_to_steal_lock(struct qspinlock *lock)
+{
+	int iters;
+
+	/* Attempt to steal the lock */
+	for (;;) {
+		u32 val = READ_ONCE(lock->val);
+
+		if (unlikely(!(val & _Q_LOCKED_VAL))) {
+			if (trylock_with_tail_cpu(lock, val))
+				return true;
+			continue;
+		}
+
+		cpu_relax();
+
+		iters++;
+
+		if (iters >= get_steal_spins())
+			break;
+	}
+
+	return false;
+}
+
 static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
 	struct qnodes *qnodesp;
@@ -164,20 +231,39 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
 
-	/* We're at the head of the waitqueue, wait for the lock. */
-	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
-		cpu_relax();
+	if (!MAYBE_STEALERS) {
+		/* We're at the head of the waitqueue, wait for the lock. */
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
+			cpu_relax();
 
-	/* If we're the last queued, must clean up the tail. */
-	if ((val & _Q_TAIL_CPU_MASK) == tail) {
-		if (trylock_clear_tail_cpu(lock, val))
-			goto release;
-		/* Another waiter must have enqueued */
-	}
+		/* If we're the last queued, must clean up the tail. */
+		if ((val & _Q_TAIL_CPU_MASK) == tail) {
+			if (trylock_clear_tail_cpu(lock, val))
+				goto release;
+			/* Another waiter must have enqueued. */
+		}
+
+		/* We must be the owner, just set the lock bit and acquire */
+		lock_set_locked(lock);
+	} else {
+again:
+		/* We're at the head of the waitqueue, wait for the lock. */
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
+			cpu_relax();
 
-	/* We must be the owner, just set the lock bit and acquire */
-	lock_set_locked(lock);
+		/* If we're the last queued, must clean up the tail. */
+		if ((val & _Q_TAIL_CPU_MASK) == tail) {
+			if (trylock_clear_tail_cpu(lock, val))
+				goto release;
+			/* Another waiter must have enqueued, or lock stolen. */
+		} else {
+			if (trylock_with_tail_cpu(lock, val))
+				goto unlock_next;
+		}
+		goto again;
+	}
 
+unlock_next:
 	/* contended path; must wait for next != NULL (MCS protocol) */
 	while (!(next = READ_ONCE(node->next)))
 		cpu_relax();
@@ -197,6 +283,9 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 void queued_spin_lock_slowpath(struct qspinlock *lock)
 {
+	if (try_to_steal_lock(lock))
+		return;
+
 	queued_spin_lock_mcs_queue(lock);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
@@ -207,3 +296,44 @@ void pv_spinlocks_init(void)
 }
 #endif
 
+#include <linux/debugfs.h>
+static int steal_spins_set(void *data, u64 val)
+{
+	static DEFINE_MUTEX(lock);
+
+	mutex_lock(&lock);
+	if (val && !STEAL_SPINS) {
+		MAYBE_STEALERS = true;
+		/* wait for waiter to go away */
+		synchronize_rcu();
+		STEAL_SPINS = val;
+	} else if (!val && STEAL_SPINS) {
+		STEAL_SPINS = val;
+		/* wait for all possible stealers to go away */
+		synchronize_rcu();
+		MAYBE_STEALERS = false;
+	} else {
+		STEAL_SPINS = val;
+	}
+	mutex_unlock(&lock);
+
+	return 0;
+}
+
+static int steal_spins_get(void *data, u64 *val)
+{
+	*val = STEAL_SPINS;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
+
+static __init int spinlock_debugfs_init(void)
+{
+	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+
+	return 0;
+}
+device_initcall(spinlock_debugfs_init);
+
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (5 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-10  5:51   ` Jordan Niethe
  2022-11-10  0:40   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Give the queue head the ability to stop stealers. After a number of
spins without sucessfully acquiring the lock, the queue head employs
this, which will assure it is the next owner.
---
 arch/powerpc/include/asm/qspinlock_types.h | 10 +++-
 arch/powerpc/lib/qspinlock.c               | 56 +++++++++++++++++++++-
 2 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 210adf05b235..8b20f5e22bba 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -29,7 +29,8 @@ typedef struct qspinlock {
  * Bitfields in the lock word:
  *
  *     0: locked bit
- * 16-31: tail cpu (+1)
+ *    16: must queue bit
+ * 17-31: tail cpu (+1)
  */
 #define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
 				      << _Q_ ## type ## _OFFSET)
@@ -38,7 +39,12 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 
-#define _Q_TAIL_CPU_OFFSET	16
+#define _Q_MUST_Q_OFFSET	16
+#define _Q_MUST_Q_BITS		1
+#define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
+#define _Q_MUST_Q_VAL		(1U << _Q_MUST_Q_OFFSET)
+
+#define _Q_TAIL_CPU_OFFSET	17
 #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
 
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 1625cce714b2..a906cc8f15fa 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -22,6 +22,7 @@ struct qnodes {
 /* Tuning parameters */
 static int STEAL_SPINS __read_mostly = (1<<5);
 static bool MAYBE_STEALERS __read_mostly = true;
+static int HEAD_SPINS __read_mostly = (1<<8);
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
 	return STEAL_SPINS;
 }
 
+static __always_inline int get_head_spins(void)
+{
+	return HEAD_SPINS;
+}
+
 static inline u32 encode_tail_cpu(void)
 {
 	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
@@ -142,6 +148,23 @@ static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
 	return prev;
 }
 
+static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
+{
+	u32 new = _Q_MUST_Q_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# lock_set_mustq			\n"
+"	or	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new)
+	: "cr0", "memory");
+
+	return prev;
+}
+
 static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = get_tail_cpu(val);
@@ -165,6 +188,9 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
 	for (;;) {
 		u32 val = READ_ONCE(lock->val);
 
+		if (val & _Q_MUST_Q_VAL)
+			break;
+
 		if (unlikely(!(val & _Q_LOCKED_VAL))) {
 			if (trylock_with_tail_cpu(lock, val))
 				return true;
@@ -246,11 +272,22 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		/* We must be the owner, just set the lock bit and acquire */
 		lock_set_locked(lock);
 	} else {
+		int iters = 0;
+		bool set_mustq = false;
+
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
-		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
 			cpu_relax();
 
+			iters++;
+			if (!set_mustq && iters >= get_head_spins()) {
+				set_mustq = true;
+				lock_set_mustq(lock);
+				val |= _Q_MUST_Q_VAL;
+			}
+		}
+
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
 			if (trylock_clear_tail_cpu(lock, val))
@@ -329,9 +366,26 @@ static int steal_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
 
+static int head_spins_set(void *data, u64 val)
+{
+	HEAD_SPINS = val;
+
+	return 0;
+}
+
+static int head_spins_get(void *data, u64 *val)
+{
+	*val = HEAD_SPINS;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (6 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  0:50   ` Jordan Niethe
  2022-11-10  0:40   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Store the owner CPU number in the lock word so it may be yielded to,
as powerpc's paravirtualised simple spinlocks do.
---
 arch/powerpc/include/asm/qspinlock.h       |  8 +++++++-
 arch/powerpc/include/asm/qspinlock_types.h | 10 ++++++++++
 arch/powerpc/lib/qspinlock.c               |  6 +++---
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 3ab354159e5e..44601b261e08 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
 }
 
+static __always_inline u32 queued_spin_get_locked_val(void)
+{
+	/* XXX: make this use lock value in paca like simple spinlocks? */
+	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
+}
+
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	u32 new = _Q_LOCKED_VAL;
+	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
 	asm volatile(
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 8b20f5e22bba..35f9525381e6 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -29,6 +29,8 @@ typedef struct qspinlock {
  * Bitfields in the lock word:
  *
  *     0: locked bit
+ *  1-14: lock holder cpu
+ *    15: unused bit
  *    16: must queue bit
  * 17-31: tail cpu (+1)
  */
@@ -39,6 +41,14 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 
+#define _Q_OWNER_CPU_OFFSET	1
+#define _Q_OWNER_CPU_BITS	14
+#define _Q_OWNER_CPU_MASK	_Q_SET_MASK(OWNER_CPU)
+
+#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
+#error "qspinlock does not support such large CONFIG_NR_CPUS"
+#endif
+
 #define _Q_MUST_Q_OFFSET	16
 #define _Q_MUST_Q_BITS		1
 #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index a906cc8f15fa..aa26cfe21f18 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -50,7 +50,7 @@ static inline int get_tail_cpu(u32 val)
 /* Take the lock by setting the lock bit, no other CPUs will touch it. */
 static __always_inline void lock_set_locked(struct qspinlock *lock)
 {
-	u32 new = _Q_LOCKED_VAL;
+	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
 	asm volatile(
@@ -68,7 +68,7 @@ static __always_inline void lock_set_locked(struct qspinlock *lock)
 /* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
 static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
 {
-	u32 new = _Q_LOCKED_VAL;
+	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
 	BUG_ON(old & _Q_LOCKED_VAL);
@@ -116,7 +116,7 @@ static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u3
 /* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
 static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
 {
-	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
+	u32 newval = queued_spin_get_locked_val() | (val & _Q_TAIL_CPU_MASK);
 
 	if (__trylock_cmpxchg(lock, val, newval) == val)
 		return 1;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (7 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  2:01   ` Jordan Niethe
  2022-11-10  0:41   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
                   ` (8 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Waiters spinning on the lock word should yield to the lock owner if the
vCPU is preempted. This improves performance when the hypervisor has
oversubscribed physical CPUs.
---
 arch/powerpc/lib/qspinlock.c | 97 ++++++++++++++++++++++++++++++------
 1 file changed, 83 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index aa26cfe21f18..55286ac91da5 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/smp.h>
 #include <asm/qspinlock.h>
+#include <asm/paravirt.h>
 
 #define MAX_NODES	4
 
@@ -24,14 +25,16 @@ static int STEAL_SPINS __read_mostly = (1<<5);
 static bool MAYBE_STEALERS __read_mostly = true;
 static int HEAD_SPINS __read_mostly = (1<<8);
 
+static bool pv_yield_owner __read_mostly = true;
+
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
-static __always_inline int get_steal_spins(void)
+static __always_inline int get_steal_spins(bool paravirt)
 {
 	return STEAL_SPINS;
 }
 
-static __always_inline int get_head_spins(void)
+static __always_inline int get_head_spins(bool paravirt)
 {
 	return HEAD_SPINS;
 }
@@ -46,7 +49,11 @@ static inline int get_tail_cpu(u32 val)
 	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
 }
 
-/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
+static inline int get_owner_cpu(u32 val)
+{
+	return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
+}
+
 /* Take the lock by setting the lock bit, no other CPUs will touch it. */
 static __always_inline void lock_set_locked(struct qspinlock *lock)
 {
@@ -180,7 +187,45 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static inline bool try_to_steal_lock(struct qspinlock *lock)
+static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+{
+	int owner;
+	u32 yield_count;
+
+	BUG_ON(!(val & _Q_LOCKED_VAL));
+
+	if (!paravirt)
+		goto relax;
+
+	if (!pv_yield_owner)
+		goto relax;
+
+	owner = get_owner_cpu(val);
+	yield_count = yield_count_of(owner);
+
+	if ((yield_count & 1) == 0)
+		goto relax; /* owner vcpu is running */
+
+	/*
+	 * Read the lock word after sampling the yield count. On the other side
+	 * there may a wmb because the yield count update is done by the
+	 * hypervisor preemption and the value update by the OS, however this
+	 * ordering might reduce the chance of out of order accesses and
+	 * improve the heuristic.
+	 */
+	smp_rmb();
+
+	if (READ_ONCE(lock->val) == val) {
+		yield_to_preempted(owner, yield_count);
+		/* Don't relax if we yielded. Maybe we should? */
+		return;
+	}
+relax:
+	cpu_relax();
+}
+
+
+static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
 	int iters;
 
@@ -197,18 +242,18 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
 			continue;
 		}
 
-		cpu_relax();
+		yield_to_locked_owner(lock, val, paravirt);
 
 		iters++;
 
-		if (iters >= get_steal_spins())
+		if (iters >= get_steal_spins(paravirt))
 			break;
 	}
 
 	return false;
 }
 
-static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
+static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
 {
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
@@ -260,7 +305,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	if (!MAYBE_STEALERS) {
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
-			cpu_relax();
+			yield_to_locked_owner(lock, val, paravirt);
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -278,10 +323,10 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
-			cpu_relax();
+			yield_to_locked_owner(lock, val, paravirt);
 
 			iters++;
-			if (!set_mustq && iters >= get_head_spins()) {
+			if (!set_mustq && iters >= get_head_spins(paravirt)) {
 				set_mustq = true;
 				lock_set_mustq(lock);
 				val |= _Q_MUST_Q_VAL;
@@ -320,10 +365,15 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 void queued_spin_lock_slowpath(struct qspinlock *lock)
 {
-	if (try_to_steal_lock(lock))
-		return;
-
-	queued_spin_lock_mcs_queue(lock);
+	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()) {
+		if (try_to_steal_lock(lock, true))
+			return;
+		queued_spin_lock_mcs_queue(lock, true);
+	} else {
+		if (try_to_steal_lock(lock, false))
+			return;
+		queued_spin_lock_mcs_queue(lock, false);
+	}
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
@@ -382,10 +432,29 @@ static int head_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
 
+static int pv_yield_owner_set(void *data, u64 val)
+{
+	pv_yield_owner = !!val;
+
+	return 0;
+}
+
+static int pv_yield_owner_get(void *data, u64 *val)
+{
+	*val = pv_yield_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
+	if (is_shared_processor()) {
+		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+	}
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (8 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  2:07   ` Jordan Niethe
  2022-11-10  0:41   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Queued waiters which are not at the head of the queue don't spin on
the lock word but their qnode lock word, waiting for the previous queued
CPU to release them. Add an option which allows these waiters to yield
to the previous CPU if its vCPU is preempted.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 46 +++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 55286ac91da5..b39f8c5b329c 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
 static int HEAD_SPINS __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
+static bool pv_yield_prev __read_mostly = true;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -224,6 +225,31 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
 	cpu_relax();
 }
 
+static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
+{
+	u32 yield_count;
+
+	if (!paravirt)
+		goto relax;
+
+	if (!pv_yield_prev)
+		goto relax;
+
+	yield_count = yield_count_of(prev_cpu);
+	if ((yield_count & 1) == 0)
+		goto relax; /* owner vcpu is running */
+
+	smp_rmb(); /* See yield_to_locked_owner comment */
+
+	if (!node->locked) {
+		yield_to_preempted(prev_cpu, yield_count);
+		return;
+	}
+
+relax:
+	cpu_relax();
+}
+
 
 static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
@@ -291,13 +317,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	 */
 	if (old & _Q_TAIL_CPU_MASK) {
 		struct qnode *prev = get_tail_qnode(lock, old);
+		int prev_cpu = get_tail_cpu(old);
 
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
 
 		/* Wait for mcs node lock to be released */
 		while (!node->locked)
-			cpu_relax();
+			yield_to_prev(lock, node, prev_cpu, paravirt);
 
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
@@ -448,12 +475,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
 
+static int pv_yield_prev_set(void *data, u64 val)
+{
+	pv_yield_prev = !!val;
+
+	return 0;
+}
+
+static int pv_yield_prev_get(void *data, u64 *val)
+{
+	*val = pv_yield_prev;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 	}
 
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (9 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  4:06   ` Jordan Niethe
  2022-11-10  0:42   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

If the head of queue is preventing stealing but it finds the owner vCPU
is preempted, it will yield its cycles to the owner which could cause it
to become preempted. Add an option to re-allow stealers before yielding,
and disallow them again after returning from the yield.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 56 ++++++++++++++++++++++++++++++++++--
 1 file changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index b39f8c5b329c..94f007f66942 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
 static int HEAD_SPINS __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
+static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
@@ -173,6 +174,23 @@ static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
 	return prev;
 }
 
+static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
+{
+	u32 new = _Q_MUST_Q_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# lock_clear_mustq			\n"
+"	andc	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new)
+	: "cr0", "memory");
+
+	return prev;
+}
+
 static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = get_tail_cpu(val);
@@ -188,7 +206,7 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
 {
 	int owner;
 	u32 yield_count;
@@ -217,7 +235,11 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
 	smp_rmb();
 
 	if (READ_ONCE(lock->val) == val) {
+		if (clear_mustq)
+			lock_clear_mustq(lock);
 		yield_to_preempted(owner, yield_count);
+		if (clear_mustq)
+			lock_set_mustq(lock);
 		/* Don't relax if we yielded. Maybe we should? */
 		return;
 	}
@@ -225,6 +247,16 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
 	cpu_relax();
 }
 
+static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+{
+	__yield_to_locked_owner(lock, val, paravirt, false);
+}
+
+static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
+{
+	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
+}
+
 static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
 {
 	u32 yield_count;
@@ -332,7 +364,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	if (!MAYBE_STEALERS) {
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
-			yield_to_locked_owner(lock, val, paravirt);
+			yield_head_to_locked_owner(lock, val, paravirt, false);
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -350,7 +382,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
-			yield_to_locked_owner(lock, val, paravirt);
+			yield_head_to_locked_owner(lock, val, paravirt,
+					pv_yield_allow_steal && set_mustq);
 
 			iters++;
 			if (!set_mustq && iters >= get_head_spins(paravirt)) {
@@ -475,6 +508,22 @@ static int pv_yield_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
 
+static int pv_yield_allow_steal_set(void *data, u64 val)
+{
+	pv_yield_allow_steal = !!val;
+
+	return 0;
+}
+
+static int pv_yield_allow_steal_get(void *data, u64 *val)
+{
+	*val = pv_yield_allow_steal;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
+
 static int pv_yield_prev_set(void *data, u64 val)
 {
 	pv_yield_prev = !!val;
@@ -497,6 +546,7 @@ static __init int spinlock_debugfs_init(void)
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 	}
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (10 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  4:17   ` Jordan Niethe
                     ` (2 more replies)
  2022-07-28  6:31 ` [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
                   ` (5 subsequent siblings)
  17 siblings, 3 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Having all CPUs poll the lock word for the owner CPU that should be
yielded to defeats most of the purpose of using MCS queueing for
scalability. Yet it may be desirable for queued waiters to to yield
to a preempted owner.

s390 addreses this problem by having queued waiters sample the lock
word to find the owner much less frequently. In this approach, the
waiters never sample it directly, but the queue head propagates the
owner CPU back to the next waiter if it ever finds the owner has
been preempted. Queued waiters then subsequently propagate the owner
CPU back to the next waiter, and so on.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 85 +++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 94f007f66942..28c85a2d5635 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -12,6 +12,7 @@
 struct qnode {
 	struct qnode	*next;
 	struct qspinlock *lock;
+	int		yield_cpu;
 	u8		locked; /* 1 if lock acquired */
 };
 
@@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
 static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
+static bool pv_yield_propagate_owner __read_mostly = true;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -257,13 +259,66 @@ static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u
 	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
 }
 
+static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
+{
+	struct qnode *next;
+	int owner;
+
+	if (!paravirt)
+		return;
+	if (!pv_yield_propagate_owner)
+		return;
+
+	owner = get_owner_cpu(val);
+	if (*set_yield_cpu == owner)
+		return;
+
+	next = READ_ONCE(node->next);
+	if (!next)
+		return;
+
+	if (vcpu_is_preempted(owner)) {
+		next->yield_cpu = owner;
+		*set_yield_cpu = owner;
+	} else if (*set_yield_cpu != -1) {
+		next->yield_cpu = owner;
+		*set_yield_cpu = owner;
+	}
+}
+
 static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
 {
 	u32 yield_count;
+	int yield_cpu;
 
 	if (!paravirt)
 		goto relax;
 
+	if (!pv_yield_propagate_owner)
+		goto yield_prev;
+
+	yield_cpu = READ_ONCE(node->yield_cpu);
+	if (yield_cpu == -1) {
+		/* Propagate back the -1 CPU */
+		if (node->next && node->next->yield_cpu != -1)
+			node->next->yield_cpu = yield_cpu;
+		goto yield_prev;
+	}
+
+	yield_count = yield_count_of(yield_cpu);
+	if ((yield_count & 1) == 0)
+		goto yield_prev; /* owner vcpu is running */
+
+	smp_rmb();
+
+	if (yield_cpu == node->yield_cpu) {
+		if (node->next && node->next->yield_cpu != yield_cpu)
+			node->next->yield_cpu = yield_cpu;
+		yield_to_preempted(yield_cpu, yield_count);
+		return;
+	}
+
+yield_prev:
 	if (!pv_yield_prev)
 		goto relax;
 
@@ -337,6 +392,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	node = &qnodesp->nodes[idx];
 	node->next = NULL;
 	node->lock = lock;
+	node->yield_cpu = -1;
 	node->locked = 0;
 
 	tail = encode_tail_cpu();
@@ -358,13 +414,21 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		while (!node->locked)
 			yield_to_prev(lock, node, prev_cpu, paravirt);
 
+		/* Clear out stale propagated yield_cpu */
+		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
+			node->yield_cpu = -1;
+
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
 
 	if (!MAYBE_STEALERS) {
+		int set_yield_cpu = -1;
+
 		/* We're at the head of the waitqueue, wait for the lock. */
-		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
+			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt, false);
+		}
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -376,12 +440,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		/* We must be the owner, just set the lock bit and acquire */
 		lock_set_locked(lock);
 	} else {
+		int set_yield_cpu = -1;
 		int iters = 0;
 		bool set_mustq = false;
 
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
+			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt,
 					pv_yield_allow_steal && set_mustq);
 
@@ -540,6 +606,22 @@ static int pv_yield_prev_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
 
+static int pv_yield_propagate_owner_set(void *data, u64 val)
+{
+	pv_yield_propagate_owner = !!val;
+
+	return 0;
+}
+
+static int pv_yield_propagate_owner_get(void *data, u64 *val)
+{
+	*val = pv_yield_propagate_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
@@ -548,6 +630,7 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
+		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
 	}
 
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (11 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  4:22   ` Jordan Niethe
  2022-11-10  0:42   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

After the head of the queue acquires the lock, it releases the
next waiter in the queue to become the new head. Add an option
to prod the new head if its vCPU was preempted. This may only
have an effect if queue waiters are yielding.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 28c85a2d5635..3b10e31bcf0a 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -12,6 +12,7 @@
 struct qnode {
 	struct qnode	*next;
 	struct qspinlock *lock;
+	int		cpu;
 	int		yield_cpu;
 	u8		locked; /* 1 if lock acquired */
 };
@@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
 static bool pv_yield_propagate_owner __read_mostly = true;
+static bool pv_prod_head __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -392,6 +394,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	node = &qnodesp->nodes[idx];
 	node->next = NULL;
 	node->lock = lock;
+	node->cpu = smp_processor_id();
 	node->yield_cpu = -1;
 	node->locked = 0;
 
@@ -483,7 +486,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	 * this store to locked. The corresponding barrier is the smp_rmb()
 	 * acquire barrier for mcs lock, above.
 	 */
-	WRITE_ONCE(next->locked, 1);
+	if (paravirt && pv_prod_head) {
+		int next_cpu = next->cpu;
+		WRITE_ONCE(next->locked, 1);
+		if (vcpu_is_preempted(next_cpu))
+			prod_cpu(next_cpu);
+	} else {
+		WRITE_ONCE(next->locked, 1);
+	}
 
 release:
 	qnodesp->count--; /* release the node */
@@ -622,6 +632,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
 
+static int pv_prod_head_set(void *data, u64 val)
+{
+	pv_prod_head = !!val;
+
+	return 0;
+}
+
+static int pv_prod_head_get(void *data, u64 *val)
+{
+	*val = pv_prod_head;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
@@ -631,6 +657,7 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
+		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
 	}
 
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (12 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  4:32   ` Jordan Niethe
  2022-11-10  0:43   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
                   ` (3 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This gives trylock slightly more strength, and it also gives most
of the benefit of passing 'val' back through the slowpath without
the complexity.
---
 arch/powerpc/include/asm/qspinlock.h | 39 +++++++++++++++++++++++++++-
 arch/powerpc/lib/qspinlock.c         |  9 +++++++
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 44601b261e08..d3d2039237b2 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -5,6 +5,8 @@
 #include <linux/compiler.h>
 #include <asm/qspinlock_types.h>
 
+#define _Q_SPIN_TRY_LOCK_STEAL 1
+
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
 	return READ_ONCE(lock->val);
@@ -26,11 +28,12 @@ static __always_inline u32 queued_spin_get_locked_val(void)
 	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
 }
 
-static __always_inline int queued_spin_trylock(struct qspinlock *lock)
+static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
 {
 	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
+	/* Trylock succeeds only when unlocked and no queued nodes */
 	asm volatile(
 "1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
 "	cmpwi	0,%0,0							\n"
@@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 	return 0;
 }
 
+static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
+{
+	u32 new = queued_spin_get_locked_val();
+	u32 prev, tmp;
+
+	/* Trylock may get ahead of queued nodes if it finds unlocked */
+	asm volatile(
+"1:	lwarx	%0,0,%2,%5	# queued_spin_trylock			\n"
+"	andc.	%1,%0,%4						\n"
+"	bne-	2f							\n"
+"	and	%1,%0,%4						\n"
+"	or	%1,%1,%3						\n"
+"	stwcx.	%1,0,%2							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev), "=&r" (tmp)
+	: "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	if (likely(!(prev & ~_Q_TAIL_CPU_MASK)))
+		return 1;
+	return 0;
+}
+
+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
+{
+	if (!_Q_SPIN_TRY_LOCK_STEAL)
+		return __queued_spin_trylock_nosteal(lock);
+	else
+		return __queued_spin_trylock_steal(lock);
+}
+
 void queued_spin_lock_slowpath(struct qspinlock *lock);
 
 static __always_inline void queued_spin_lock(struct qspinlock *lock)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 3b10e31bcf0a..277aef1fab0a 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -24,7 +24,11 @@ struct qnodes {
 
 /* Tuning parameters */
 static int STEAL_SPINS __read_mostly = (1<<5);
+#if _Q_SPIN_TRY_LOCK_STEAL == 1
+static const bool MAYBE_STEALERS = true;
+#else
 static bool MAYBE_STEALERS __read_mostly = true;
+#endif
 static int HEAD_SPINS __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
@@ -522,6 +526,10 @@ void pv_spinlocks_init(void)
 #include <linux/debugfs.h>
 static int steal_spins_set(void *data, u64 val)
 {
+#if _Q_SPIN_TRY_LOCK_STEAL == 1
+	/* MAYBE_STEAL remains true */
+	STEAL_SPINS = val;
+#else
 	static DEFINE_MUTEX(lock);
 
 	mutex_lock(&lock);
@@ -539,6 +547,7 @@ static int steal_spins_set(void *data, u64 val)
 		STEAL_SPINS = val;
 	}
 	mutex_unlock(&lock);
+#endif
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (13 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  4:36   ` Jordan Niethe
  2022-11-10  0:43   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
to prevent threads issuing a lot of expensive priority nops which may not
have much effect due to immediately executing low then medium priority.
---
 arch/powerpc/lib/qspinlock.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 277aef1fab0a..d4594c701f7d 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -233,6 +233,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 	if ((yield_count & 1) == 0)
 		goto relax; /* owner vcpu is running */
 
+	spin_end();
+
 	/*
 	 * Read the lock word after sampling the yield count. On the other side
 	 * there may a wmb because the yield count update is done by the
@@ -248,11 +250,13 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 		yield_to_preempted(owner, yield_count);
 		if (clear_mustq)
 			lock_set_mustq(lock);
+		spin_begin();
 		/* Don't relax if we yielded. Maybe we should? */
 		return;
 	}
+	spin_begin();
 relax:
-	cpu_relax();
+	spin_cpu_relax();
 }
 
 static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
@@ -315,14 +319,18 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 	if ((yield_count & 1) == 0)
 		goto yield_prev; /* owner vcpu is running */
 
+	spin_end();
+
 	smp_rmb();
 
 	if (yield_cpu == node->yield_cpu) {
 		if (node->next && node->next->yield_cpu != yield_cpu)
 			node->next->yield_cpu = yield_cpu;
 		yield_to_preempted(yield_cpu, yield_count);
+		spin_begin();
 		return;
 	}
+	spin_begin();
 
 yield_prev:
 	if (!pv_yield_prev)
@@ -332,15 +340,19 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 	if ((yield_count & 1) == 0)
 		goto relax; /* owner vcpu is running */
 
+	spin_end();
+
 	smp_rmb(); /* See yield_to_locked_owner comment */
 
 	if (!node->locked) {
 		yield_to_preempted(prev_cpu, yield_count);
+		spin_begin();
 		return;
 	}
+	spin_begin();
 
 relax:
-	cpu_relax();
+	spin_cpu_relax();
 }
 
 
@@ -349,6 +361,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 	int iters;
 
 	/* Attempt to steal the lock */
+	spin_begin();
 	for (;;) {
 		u32 val = READ_ONCE(lock->val);
 
@@ -356,8 +369,10 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			break;
 
 		if (unlikely(!(val & _Q_LOCKED_VAL))) {
+			spin_end();
 			if (trylock_with_tail_cpu(lock, val))
 				return true;
+			spin_begin();
 			continue;
 		}
 
@@ -368,6 +383,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 		if (iters >= get_steal_spins(paravirt))
 			break;
 	}
+	spin_end();
 
 	return false;
 }
@@ -418,8 +434,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		WRITE_ONCE(prev->next, node);
 
 		/* Wait for mcs node lock to be released */
+		spin_begin();
 		while (!node->locked)
 			yield_to_prev(lock, node, prev_cpu, paravirt);
+		spin_end();
 
 		/* Clear out stale propagated yield_cpu */
 		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
@@ -432,10 +450,12 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		int set_yield_cpu = -1;
 
 		/* We're at the head of the waitqueue, wait for the lock. */
+		spin_begin();
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
 			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt, false);
 		}
+		spin_end();
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -453,6 +473,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
+		spin_begin();
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
 			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt,
@@ -465,6 +486,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 				val |= _Q_MUST_Q_VAL;
 			}
 		}
+		spin_end();
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -480,8 +502,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 unlock_next:
 	/* contended path; must wait for next != NULL (MCS protocol) */
-	while (!(next = READ_ONCE(node->next)))
-		cpu_relax();
+	next = READ_ONCE(node->next);
+	if (!next) {
+		spin_begin();
+		while (!(next = READ_ONCE(node->next)))
+			cpu_relax();
+		spin_end();
+	}
 
 	/*
 	 * Unlock the next mcs waiter node. Release barrier is not required
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (14 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  4:43   ` Jordan Niethe
  2022-11-10  0:43   ` Jordan Niethe
  2022-07-28  6:31 ` [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
  2022-07-28  6:31 ` [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Allow for a reduction in the number of times a CPU from a different
node than the owner can attempt to steal the lock before queueing.
This could bias the transfer behaviour of the lock across the
machine and reduce NUMA crossings.
---
 arch/powerpc/lib/qspinlock.c | 34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index d4594c701f7d..24f68bd71e2b 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -4,6 +4,7 @@
 #include <linux/export.h>
 #include <linux/percpu.h>
 #include <linux/smp.h>
+#include <linux/topology.h>
 #include <asm/qspinlock.h>
 #include <asm/paravirt.h>
 
@@ -24,6 +25,7 @@ struct qnodes {
 
 /* Tuning parameters */
 static int STEAL_SPINS __read_mostly = (1<<5);
+static int REMOTE_STEAL_SPINS __read_mostly = (1<<2);
 #if _Q_SPIN_TRY_LOCK_STEAL == 1
 static const bool MAYBE_STEALERS = true;
 #else
@@ -39,9 +41,13 @@ static bool pv_prod_head __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
-static __always_inline int get_steal_spins(bool paravirt)
+static __always_inline int get_steal_spins(bool paravirt, bool remote)
 {
-	return STEAL_SPINS;
+	if (remote) {
+		return REMOTE_STEAL_SPINS;
+	} else {
+		return STEAL_SPINS;
+	}
 }
 
 static __always_inline int get_head_spins(bool paravirt)
@@ -380,8 +386,13 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 
 		iters++;
 
-		if (iters >= get_steal_spins(paravirt))
+		if (iters >= get_steal_spins(paravirt, false))
 			break;
+		if (iters >= get_steal_spins(paravirt, true)) {
+			int cpu = get_owner_cpu(val);
+			if (numa_node_id() != cpu_to_node(cpu))
+				break;
+		}
 	}
 	spin_end();
 
@@ -588,6 +599,22 @@ static int steal_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
 
+static int remote_steal_spins_set(void *data, u64 val)
+{
+	REMOTE_STEAL_SPINS = val;
+
+	return 0;
+}
+
+static int remote_steal_spins_get(void *data, u64 *val)
+{
+	*val = REMOTE_STEAL_SPINS;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_remote_steal_spins, remote_steal_spins_get, remote_steal_spins_set, "%llu\n");
+
 static int head_spins_set(void *data, u64 val)
 {
 	HEAD_SPINS = val;
@@ -687,6 +714,7 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+	debugfs_create_file("qspl_remote_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_remote_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (15 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-12  4:49   ` Jordan Niethe
                     ` (2 more replies)
  2022-07-28  6:31 ` [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
  17 siblings, 3 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Provide an option that holds off queueing indefinitely while the lock
owner is preempted. This could reduce queueing latencies for very
overcommitted vcpu situations.

This is disabled by default.
---
 arch/powerpc/lib/qspinlock.c | 91 +++++++++++++++++++++++++++++++-----
 1 file changed, 79 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 24f68bd71e2b..5cfd69931e31 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
 
 static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
+static bool pv_spin_on_preempted_owner __read_mostly = false;
 static bool pv_yield_prev __read_mostly = true;
 static bool pv_yield_propagate_owner __read_mostly = true;
 static bool pv_prod_head __read_mostly = false;
@@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
+static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
 {
 	int owner;
 	u32 yield_count;
 
 	BUG_ON(!(val & _Q_LOCKED_VAL));
 
+	*preempted = false;
+
 	if (!paravirt)
 		goto relax;
 
@@ -241,6 +244,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 
 	spin_end();
 
+	*preempted = true;
+
 	/*
 	 * Read the lock word after sampling the yield count. On the other side
 	 * there may a wmb because the yield count update is done by the
@@ -265,14 +270,14 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 	spin_cpu_relax();
 }
 
-static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool *preempted)
 {
-	__yield_to_locked_owner(lock, val, paravirt, false);
+	__yield_to_locked_owner(lock, val, paravirt, false, preempted);
 }
 
-static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
+static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
 {
-	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
+	__yield_to_locked_owner(lock, val, paravirt, clear_mustq, preempted);
 }
 
 static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
@@ -364,12 +369,33 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 
 static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
-	int iters;
+	int iters = 0;
+
+	if (!STEAL_SPINS) {
+		if (paravirt && pv_spin_on_preempted_owner) {
+			spin_begin();
+			for (;;) {
+				u32 val = READ_ONCE(lock->val);
+				bool preempted;
+
+				if (val & _Q_MUST_Q_VAL)
+					break;
+				if (!(val & _Q_LOCKED_VAL))
+					break;
+				if (!vcpu_is_preempted(get_owner_cpu(val)))
+					break;
+				yield_to_locked_owner(lock, val, paravirt, &preempted);
+			}
+			spin_end();
+		}
+		return false;
+	}
 
 	/* Attempt to steal the lock */
 	spin_begin();
 	for (;;) {
 		u32 val = READ_ONCE(lock->val);
+		bool preempted;
 
 		if (val & _Q_MUST_Q_VAL)
 			break;
@@ -382,9 +408,22 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			continue;
 		}
 
-		yield_to_locked_owner(lock, val, paravirt);
-
-		iters++;
+		yield_to_locked_owner(lock, val, paravirt, &preempted);
+
+		if (paravirt && preempted) {
+			if (!pv_spin_on_preempted_owner)
+				iters++;
+			/*
+			 * pv_spin_on_preempted_owner don't increase iters
+			 * while the owner is preempted -- we won't interfere
+			 * with it by definition. This could introduce some
+			 * latency issue if we continually observe preempted
+			 * owners, but hopefully that's a rare corner case of
+			 * a badly oversubscribed system.
+			 */
+		} else {
+			iters++;
+		}
 
 		if (iters >= get_steal_spins(paravirt, false))
 			break;
@@ -463,8 +502,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		/* We're at the head of the waitqueue, wait for the lock. */
 		spin_begin();
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
+			bool preempted;
+
 			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
-			yield_head_to_locked_owner(lock, val, paravirt, false);
+			yield_head_to_locked_owner(lock, val, paravirt, false, &preempted);
 		}
 		spin_end();
 
@@ -486,11 +527,20 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		/* We're at the head of the waitqueue, wait for the lock. */
 		spin_begin();
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
+			bool preempted;
+
 			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt,
-					pv_yield_allow_steal && set_mustq);
+					pv_yield_allow_steal && set_mustq,
+					&preempted);
+
+			if (paravirt && preempted) {
+				if (!pv_spin_on_preempted_owner)
+					iters++;
+			} else {
+				iters++;
+			}
 
-			iters++;
 			if (!set_mustq && iters >= get_head_spins(paravirt)) {
 				set_mustq = true;
 				lock_set_mustq(lock);
@@ -663,6 +713,22 @@ static int pv_yield_allow_steal_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
 
+static int pv_spin_on_preempted_owner_set(void *data, u64 val)
+{
+	pv_spin_on_preempted_owner = !!val;
+
+	return 0;
+}
+
+static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
+{
+	*val = pv_spin_on_preempted_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
+
 static int pv_yield_prev_set(void *data, u64 val)
 {
 	pv_yield_prev = !!val;
@@ -719,6 +785,7 @@ static __init int spinlock_debugfs_init(void)
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
+		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
 		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks
  2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
                   ` (16 preceding siblings ...)
  2022-07-28  6:31 ` [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
@ 2022-07-28  6:31 ` Nicholas Piggin
  2022-08-15  1:11   ` Jordan Niethe
  2022-11-10  0:44   ` Jordan Niethe
  17 siblings, 2 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-07-28  6:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Finding the owner or a queued waiter on a lock with a preempted vcpu
is indicative of an oversubscribed guest causing the lock to get into
trouble. Provide some options to detect this situation and have new
CPUs avoid queueing for a longer time (more steal iterations) to
minimise the problems caused by vcpu preemption on the queue.
---
 arch/powerpc/include/asm/qspinlock_types.h |   7 +-
 arch/powerpc/lib/qspinlock.c               | 240 +++++++++++++++++++--
 2 files changed, 232 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 35f9525381e6..4fbcc8a4230b 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -30,7 +30,7 @@ typedef struct qspinlock {
  *
  *     0: locked bit
  *  1-14: lock holder cpu
- *    15: unused bit
+ *    15: lock owner or queuer vcpus observed to be preempted bit
  *    16: must queue bit
  * 17-31: tail cpu (+1)
  */
@@ -49,6 +49,11 @@ typedef struct qspinlock {
 #error "qspinlock does not support such large CONFIG_NR_CPUS"
 #endif
 
+#define _Q_SLEEPY_OFFSET	15
+#define _Q_SLEEPY_BITS		1
+#define _Q_SLEEPY_MASK		_Q_SET_MASK(SLEEPY_OWNER)
+#define _Q_SLEEPY_VAL		(1U << _Q_SLEEPY_OFFSET)
+
 #define _Q_MUST_Q_OFFSET	16
 #define _Q_MUST_Q_BITS		1
 #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 5cfd69931e31..c18133c01450 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/smp.h>
 #include <linux/topology.h>
+#include <linux/sched/clock.h>
 #include <asm/qspinlock.h>
 #include <asm/paravirt.h>
 
@@ -36,24 +37,54 @@ static int HEAD_SPINS __read_mostly = (1<<8);
 static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_spin_on_preempted_owner __read_mostly = false;
+static bool pv_sleepy_lock __read_mostly = true;
+static bool pv_sleepy_lock_sticky __read_mostly = false;
+static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
+static int pv_sleepy_lock_factor __read_mostly = 256;
 static bool pv_yield_prev __read_mostly = true;
 static bool pv_yield_propagate_owner __read_mostly = true;
 static bool pv_prod_head __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
+static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
 
-static __always_inline int get_steal_spins(bool paravirt, bool remote)
+static __always_inline bool recently_sleepy(void)
+{
+	if (pv_sleepy_lock_interval_ns) {
+		u64 seen = this_cpu_read(sleepy_lock_seen_clock);
+
+		if (seen) {
+			u64 delta = sched_clock() - seen;
+			if (delta < pv_sleepy_lock_interval_ns)
+				return true;
+			this_cpu_write(sleepy_lock_seen_clock, 0);
+		}
+	}
+
+	return false;
+}
+
+static __always_inline int get_steal_spins(bool paravirt, bool remote, bool sleepy)
 {
 	if (remote) {
-		return REMOTE_STEAL_SPINS;
+		if (paravirt && sleepy)
+			return REMOTE_STEAL_SPINS * pv_sleepy_lock_factor;
+		else
+			return REMOTE_STEAL_SPINS;
 	} else {
-		return STEAL_SPINS;
+		if (paravirt && sleepy)
+			return STEAL_SPINS * pv_sleepy_lock_factor;
+		else
+			return STEAL_SPINS;
 	}
 }
 
-static __always_inline int get_head_spins(bool paravirt)
+static __always_inline int get_head_spins(bool paravirt, bool sleepy)
 {
-	return HEAD_SPINS;
+	if (paravirt && sleepy)
+		return HEAD_SPINS * pv_sleepy_lock_factor;
+	else
+		return HEAD_SPINS;
 }
 
 static inline u32 encode_tail_cpu(void)
@@ -206,6 +237,60 @@ static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
 	return prev;
 }
 
+static __always_inline bool lock_try_set_sleepy(struct qspinlock *lock, u32 old)
+{
+	u32 prev;
+	u32 new = old | _Q_SLEEPY_VAL;
+
+	BUG_ON(!(old & _Q_LOCKED_VAL));
+	BUG_ON(old & _Q_SLEEPY_VAL);
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# lock_try_set_sleepy			\n"
+"	cmpw	0,%0,%2							\n"
+"	bne-	2f							\n"
+"	stwcx.	%3,0,%1							\n"
+"	bne-	1b							\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r"(old), "r" (new)
+	: "cr0", "memory");
+
+	if (prev == old)
+		return true;
+	return false;
+}
+
+static __always_inline void seen_sleepy_owner(struct qspinlock *lock, u32 val)
+{
+	if (pv_sleepy_lock) {
+		if (pv_sleepy_lock_interval_ns)
+			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
+		if (!(val & _Q_SLEEPY_VAL))
+			lock_try_set_sleepy(lock, val);
+	}
+}
+
+static __always_inline void seen_sleepy_lock(void)
+{
+	if (pv_sleepy_lock && pv_sleepy_lock_interval_ns)
+		this_cpu_write(sleepy_lock_seen_clock, sched_clock());
+}
+
+static __always_inline void seen_sleepy_node(struct qspinlock *lock)
+{
+	if (pv_sleepy_lock) {
+		u32 val = READ_ONCE(lock->val);
+
+		if (pv_sleepy_lock_interval_ns)
+			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
+		if (val & _Q_LOCKED_VAL) {
+			if (!(val & _Q_SLEEPY_VAL))
+				lock_try_set_sleepy(lock, val);
+		}
+	}
+}
+
 static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = get_tail_cpu(val);
@@ -244,6 +329,7 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
 
 	spin_end();
 
+	seen_sleepy_owner(lock, val);
 	*preempted = true;
 
 	/*
@@ -307,11 +393,13 @@ static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int
 	}
 }
 
-static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
+static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt, bool *preempted)
 {
 	u32 yield_count;
 	int yield_cpu;
 
+	*preempted = false;
+
 	if (!paravirt)
 		goto relax;
 
@@ -332,6 +420,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 
 	spin_end();
 
+	*preempted = true;
+	seen_sleepy_node(lock);
+
 	smp_rmb();
 
 	if (yield_cpu == node->yield_cpu) {
@@ -353,6 +444,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 
 	spin_end();
 
+	*preempted = true;
+	seen_sleepy_node(lock);
+
 	smp_rmb(); /* See yield_to_locked_owner comment */
 
 	if (!node->locked) {
@@ -369,6 +463,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
 
 static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
+	bool preempted;
+	bool seen_preempted = false;
+	bool sleepy = false;
 	int iters = 0;
 
 	if (!STEAL_SPINS) {
@@ -376,7 +473,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			spin_begin();
 			for (;;) {
 				u32 val = READ_ONCE(lock->val);
-				bool preempted;
 
 				if (val & _Q_MUST_Q_VAL)
 					break;
@@ -395,7 +491,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 	spin_begin();
 	for (;;) {
 		u32 val = READ_ONCE(lock->val);
-		bool preempted;
 
 		if (val & _Q_MUST_Q_VAL)
 			break;
@@ -408,9 +503,29 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			continue;
 		}
 
+		if (paravirt && pv_sleepy_lock && !sleepy) {
+			if (!sleepy) {
+				if (val & _Q_SLEEPY_VAL) {
+					seen_sleepy_lock();
+					sleepy = true;
+				} else if (recently_sleepy()) {
+					sleepy = true;
+				}
+			}
+			if (pv_sleepy_lock_sticky && seen_preempted &&
+					!(val & _Q_SLEEPY_VAL)) {
+				if (lock_try_set_sleepy(lock, val))
+					val |= _Q_SLEEPY_VAL;
+			}
+		}
+
 		yield_to_locked_owner(lock, val, paravirt, &preempted);
+		if (preempted)
+			seen_preempted = true;
 
 		if (paravirt && preempted) {
+			sleepy = true;
+
 			if (!pv_spin_on_preempted_owner)
 				iters++;
 			/*
@@ -425,14 +540,15 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			iters++;
 		}
 
-		if (iters >= get_steal_spins(paravirt, false))
+		if (iters >= get_steal_spins(paravirt, false, sleepy))
 			break;
-		if (iters >= get_steal_spins(paravirt, true)) {
+		if (iters >= get_steal_spins(paravirt, true, sleepy)) {
 			int cpu = get_owner_cpu(val);
 			if (numa_node_id() != cpu_to_node(cpu))
 				break;
 		}
 	}
+
 	spin_end();
 
 	return false;
@@ -443,6 +559,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
 	u32 val, old, tail;
+	bool seen_preempted = false;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -485,8 +602,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 		/* Wait for mcs node lock to be released */
 		spin_begin();
-		while (!node->locked)
-			yield_to_prev(lock, node, prev_cpu, paravirt);
+		while (!node->locked) {
+			bool preempted;
+
+			yield_to_prev(lock, node, prev_cpu, paravirt, &preempted);
+			if (preempted)
+				seen_preempted = true;
+		}
 		spin_end();
 
 		/* Clear out stale propagated yield_cpu */
@@ -506,6 +628,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt, false, &preempted);
+			if (preempted)
+				seen_preempted = true;
 		}
 		spin_end();
 
@@ -521,27 +645,47 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	} else {
 		int set_yield_cpu = -1;
 		int iters = 0;
+		bool sleepy = false;
 		bool set_mustq = false;
+		bool preempted;
 
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
 		spin_begin();
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
-			bool preempted;
+			if (paravirt && pv_sleepy_lock) {
+				if (!sleepy) {
+					if (val & _Q_SLEEPY_VAL) {
+						seen_sleepy_lock();
+						sleepy = true;
+					} else if (recently_sleepy()) {
+						sleepy = true;
+					}
+				}
+				if (pv_sleepy_lock_sticky && seen_preempted &&
+						!(val & _Q_SLEEPY_VAL)) {
+					if (lock_try_set_sleepy(lock, val))
+						val |= _Q_SLEEPY_VAL;
+				}
+			}
 
 			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt,
 					pv_yield_allow_steal && set_mustq,
 					&preempted);
+			if (preempted)
+				seen_preempted = true;
 
 			if (paravirt && preempted) {
+				sleepy = true;
+
 				if (!pv_spin_on_preempted_owner)
 					iters++;
 			} else {
 				iters++;
 			}
 
-			if (!set_mustq && iters >= get_head_spins(paravirt)) {
+			if (!set_mustq && iters >= get_head_spins(paravirt, sleepy)) {
 				set_mustq = true;
 				lock_set_mustq(lock);
 				val |= _Q_MUST_Q_VAL;
@@ -729,6 +873,70 @@ static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
 
+static int pv_sleepy_lock_set(void *data, u64 val)
+{
+	pv_sleepy_lock = !!val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock, pv_sleepy_lock_get, pv_sleepy_lock_set, "%llu\n");
+
+static int pv_sleepy_lock_sticky_set(void *data, u64 val)
+{
+	pv_sleepy_lock_sticky = !!val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_sticky_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock_sticky;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_sticky, pv_sleepy_lock_sticky_get, pv_sleepy_lock_sticky_set, "%llu\n");
+
+static int pv_sleepy_lock_interval_ns_set(void *data, u64 val)
+{
+	pv_sleepy_lock_interval_ns = val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_interval_ns_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock_interval_ns;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_interval_ns, pv_sleepy_lock_interval_ns_get, pv_sleepy_lock_interval_ns_set, "%llu\n");
+
+static int pv_sleepy_lock_factor_set(void *data, u64 val)
+{
+	pv_sleepy_lock_factor = val;
+
+	return 0;
+}
+
+static int pv_sleepy_lock_factor_get(void *data, u64 *val)
+{
+	*val = pv_sleepy_lock_factor;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_factor, pv_sleepy_lock_factor_get, pv_sleepy_lock_factor_set, "%llu\n");
+
 static int pv_yield_prev_set(void *data, u64 val)
 {
 	pv_yield_prev = !!val;
@@ -786,6 +994,10 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
+		debugfs_create_file("qspl_pv_sleepy_lock", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock);
+		debugfs_create_file("qspl_pv_sleepy_lock_sticky", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_sticky);
+		debugfs_create_file("qspl_pv_sleepy_lock_interval_ns", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_interval_ns);
+		debugfs_create_file("qspl_pv_sleepy_lock_factor", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_factor);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
 		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-07-28  6:31 ` [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
@ 2022-08-10  1:52   ` Jordan NIethe
  2022-08-10  6:48     ` Christophe Leroy
  2022-11-10  0:35   ` Jordan Niethe
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan NIethe @ 2022-08-10  1:52 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
<snip>
> -#define queued_spin_lock queued_spin_lock
>  
> -static inline void queued_spin_unlock(struct qspinlock *lock)
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> -	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> -		smp_store_release(&lock->locked, 0);
> -	else
> -		__pv_queued_spin_unlock(lock);
> +	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
> +		return 1;
> +	return 0;

optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters
  2022-07-28  6:31 ` [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
@ 2022-08-10  2:28   ` Jordan NIethe
  2022-11-10  0:36   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan NIethe @ 2022-08-10  2:28 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
<snip>

>  
> +/*
> + * Bitfields in the atomic value:
> + *
> + *     0: locked bit
> + * 16-31: tail cpu (+1)
> + */
> +#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
> +				      << _Q_ ## type ## _OFFSET)
> +#define _Q_LOCKED_OFFSET	0
> +#define _Q_LOCKED_BITS		1
> +#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
> +#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET	16
> +#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
> +#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
> +

Just to state the obvious this is:

#define _Q_LOCKED_OFFSET	0
#define _Q_LOCKED_BITS		1
#define _Q_LOCKED_MASK		0x00000001
#define _Q_LOCKED_VAL		1

#define _Q_TAIL_CPU_OFFSET	16
#define _Q_TAIL_CPU_BITS	16
#define _Q_TAIL_CPU_MASK	0xffff0000

> +#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 8dbce99a373c..5ebb88d95636 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,12 +1,172 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> +#include <linux/atomic.h>
> +#include <linux/bug.h>
> +#include <linux/compiler.h>
>  #include <linux/export.h>
> -#include <linux/processor.h>
> +#include <linux/percpu.h>
> +#include <linux/smp.h>
>  #include <asm/qspinlock.h>
>  
> -void queued_spin_lock_slowpath(struct qspinlock *lock)
> +#define MAX_NODES	4
> +
> +struct qnode {
> +	struct qnode	*next;
> +	struct qspinlock *lock;
> +	u8		locked; /* 1 if lock acquired */
> +};
> +
> +struct qnodes {
> +	int		count;
> +	struct qnode nodes[MAX_NODES];
> +};

I think it could be worth commenting why qnodes::count instead _Q_TAIL_IDX_OFFSET.

> +
> +static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +
> +static inline int encode_tail_cpu(void)

I think the generic version that takes smp_processor_id() as a parameter is clearer - at least with this function name.

> +{
> +	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> +}
> +
> +static inline int get_tail_cpu(int val)

It seems like there should be a "decode" function to pair up with the "encode" function.

> +{
> +	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> +}
> +
> +/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */

Does that comment mean it is not necessary to use an atomic_or here?

> +static __always_inline void lock_set_locked(struct qspinlock *lock)

nit: could just be called set_locked()

> +{
> +	atomic_or(_Q_LOCKED_VAL, &lock->val);
> +	__atomic_acquire_fence();
> +}
> +
> +/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
> +{
> +	int newval = _Q_LOCKED_VAL;
> +
> +	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> +		return 1;
> +	else
> +		return 0;

same optional style nit: return (atomic_cmpxchg_acquire(&lock->val, val, newval) == val);

> +}
> +
> +/*
> + * Publish our tail, replacing previous tail. Return previous value.
> + *
> + * This provides a release barrier for publishing node, and an acquire barrier
> + * for getting the old node.
> + */
> +static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)

Did you change from the xchg_tail() name in the generic version because of the release and acquire barriers this provides?
Does "publish" generally imply the old value will be returned?

>  {
> -	while (!queued_spin_trylock(lock))
> +	for (;;) {
> +		int val = atomic_read(&lock->val);
> +		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> +		int old;
> +
> +		old = atomic_cmpxchg(&lock->val, val, newval);
> +		if (old == val)
> +			return old;
> +	}
> +}
> +
> +static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> +{
> +	int cpu = get_tail_cpu(val);
> +	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
> +	int idx;
> +
> +	for (idx = 0; idx < MAX_NODES; idx++) {
> +		struct qnode *qnode = &qnodesp->nodes[idx];
> +		if (qnode->lock == lock)
> +			return qnode;
> +	}

In case anyone else is confused by this, Nick explained each cpu can only queue on a unique spinlock once regardless of "idx" level.

> +
> +	BUG();
> +}
> +
> +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +{
> +	struct qnodes *qnodesp;
> +	struct qnode *next, *node;
> +	int val, old, tail;
> +	int idx;
> +
> +	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> +
> +	qnodesp = this_cpu_ptr(&qnodes);
> +	if (unlikely(qnodesp->count == MAX_NODES)) {

The comparison is >= in the generic, I guess we've no nested NMI so this is safe?

> +		while (!queued_spin_trylock(lock))
> +			cpu_relax();
> +		return;
> +	}
> +
> +	idx = qnodesp->count++;
> +	/*
> +	 * Ensure that we increment the head node->count before initialising
> +	 * the actual node. If the compiler is kind enough to reorder these
> +	 * stores, then an IRQ could overwrite our assignments.
> +	 */
> +	barrier();
> +	node = &qnodesp->nodes[idx];
> +	node->next = NULL;
> +	node->lock = lock;
> +	node->locked = 0;
> +
> +	tail = encode_tail_cpu();
> +
> +	old = publish_tail_cpu(lock, tail);
> +
> +	/*
> +	 * If there was a previous node; link it and wait until reaching the
> +	 * head of the waitqueue.
> +	 */
> +	if (old & _Q_TAIL_CPU_MASK) {
> +		struct qnode *prev = get_tail_qnode(lock, old);
> +
> +		/* Link @node into the waitqueue. */
> +		WRITE_ONCE(prev->next, node);
> +
> +		/* Wait for mcs node lock to be released */
> +		while (!node->locked)
> +			cpu_relax();
> +
> +		smp_rmb(); /* acquire barrier for the mcs lock */
> +	}
> +
> +	/* We're at the head of the waitqueue, wait for the lock. */
> +	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
> +		cpu_relax();
> +
> +	/* If we're the last queued, must clean up the tail. */
> +	if ((val & _Q_TAIL_CPU_MASK) == tail) {
> +		if (trylock_clear_tail_cpu(lock, val))
> +			goto release;
> +		/* Another waiter must have enqueued */
> +	}
> +
> +	/* We must be the owner, just set the lock bit and acquire */
> +	lock_set_locked(lock);
> +
> +	/* contended path; must wait for next != NULL (MCS protocol) */
> +	while (!(next = READ_ONCE(node->next)))
>  		cpu_relax();
> +
> +	/*
> +	 * Unlock the next mcs waiter node. Release barrier is not required
> +	 * here because the acquirer is only accessing the lock word, and
> +	 * the acquire barrier we took the lock with orders that update vs
> +	 * this store to locked. The corresponding barrier is the smp_rmb()
> +	 * acquire barrier for mcs lock, above.
> +	 */
> +	WRITE_ONCE(next->locked, 1);
> +
> +release:
> +	qnodesp->count--; /* release the node */
> +}
> +
> +void queued_spin_lock_slowpath(struct qspinlock *lock)
> +{
> +	queued_spin_lock_mcs_queue(lock);
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  2022-07-28  6:31 ` [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
@ 2022-08-10  3:28   ` Jordan Niethe
  2022-11-10  0:39   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-10  3:28 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> The first 16 bits of the lock are only modified by the owner, and other
> modifications always use atomic operations on the entire 32 bits, so
> unlocks can use plain stores on the 16 bits. This is the same kind of
> optimisation done by core qspinlock code.
> ---
>  arch/powerpc/include/asm/qspinlock.h       |  6 +-----
>  arch/powerpc/include/asm/qspinlock_types.h | 19 +++++++++++++++++--
>  2 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index f06117aa60e1..79a1936fb68d 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
>  
>  static inline void queued_spin_unlock(struct qspinlock *lock)
>  {
> -	for (;;) {
> -		int val = atomic_read(&lock->val);
> -		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
> -			return;
> -	}
> +	smp_store_release(&lock->locked, 0);

Is it also possible for lock_set_locked() to use a non-atomic acquire
operation?

>  }
>  
>  #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 9630e714c70d..3425dab42576 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -3,12 +3,27 @@
>  #define _ASM_POWERPC_QSPINLOCK_TYPES_H
>  
>  #include <linux/types.h>
> +#include <asm/byteorder.h>
>  
>  typedef struct qspinlock {
> -	atomic_t val;
> +	union {
> +		atomic_t val;
> +
> +#ifdef __LITTLE_ENDIAN
> +		struct {
> +			u16	locked;
> +			u8	reserved[2];
> +		};
> +#else
> +		struct {
> +			u8	reserved[2];
> +			u16	locked;
> +		};
> +#endif
> +	};
>  } arch_spinlock_t;

Just to double check we have:

#define _Q_LOCKED_OFFSET	0
#define _Q_LOCKED_BITS		1
#define _Q_LOCKED_MASK		0x00000001
#define _Q_LOCKED_VAL		1

#define _Q_TAIL_CPU_OFFSET	16
#define _Q_TAIL_CPU_BITS	16
#define _Q_TAIL_CPU_MASK	0xffff0000


so the ordering here looks correct.

>  
> -#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
> +#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
>  
>  /*
>   * Bitfields in the atomic value:


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly
  2022-07-28  6:31 ` [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
@ 2022-08-10  3:54   ` Jordan Niethe
  2022-11-10  0:39   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-10  3:54 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> This uses more optimal ll/sc style access patterns (rather than
> cmpxchg), and also sets the EH=1 lock hint on those operations
> which acquire ownership of the lock.
> ---
>  arch/powerpc/include/asm/qspinlock.h       | 25 +++++--
>  arch/powerpc/include/asm/qspinlock_types.h |  6 +-
>  arch/powerpc/lib/qspinlock.c               | 81 +++++++++++++++-------
>  3 files changed, 79 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index 79a1936fb68d..3ab354159e5e 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -2,28 +2,43 @@
>  #ifndef _ASM_POWERPC_QSPINLOCK_H
>  #define _ASM_POWERPC_QSPINLOCK_H
>  
> -#include <linux/atomic.h>
>  #include <linux/compiler.h>
>  #include <asm/qspinlock_types.h>
>  
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
> -	return atomic_read(&lock->val);
> +	return READ_ONCE(lock->val);
>  }
>  
>  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>  {
> -	return !atomic_read(&lock.val);
> +	return !lock.val;
>  }
>  
>  static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  {
> -	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
> +	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> -	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
> +	u32 new = _Q_LOCKED_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
> +"	cmpwi	0,%0,0							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%2,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)

btw IS_ENABLED() already returns 1 or 0

> +	: "cr0", "memory");

This is the ISA's "test and set" atomic primitive. Do you think it would be worth seperating it as a helper?

> +
> +	if (likely(prev == 0))
>  		return 1;
>  	return 0;

same optional style nit: return likely(prev == 0);

>  }
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 3425dab42576..210adf05b235 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -7,7 +7,7 @@
>  
>  typedef struct qspinlock {
>  	union {
> -		atomic_t val;
> +		u32 val;
>  
>  #ifdef __LITTLE_ENDIAN
>  		struct {
> @@ -23,10 +23,10 @@ typedef struct qspinlock {
>  	};
>  } arch_spinlock_t;
>  
> -#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
> +#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = 0 } }
>  
>  /*
> - * Bitfields in the atomic value:
> + * Bitfields in the lock word:
>   *
>   *     0: locked bit
>   * 16-31: tail cpu (+1)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5ebb88d95636..7c71e5e287df 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> -#include <linux/atomic.h>
>  #include <linux/bug.h>
>  #include <linux/compiler.h>
>  #include <linux/export.h>
> @@ -22,32 +21,59 @@ struct qnodes {
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static inline int encode_tail_cpu(void)
> +static inline u32 encode_tail_cpu(void)
>  {
>  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
>  }
>  
> -static inline int get_tail_cpu(int val)
> +static inline int get_tail_cpu(u32 val)
>  {
>  	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
>  /* Take the lock by setting the bit, no other CPUs may concurrently lock it. */

I think you missed deleting the above line.

> +/* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> -	atomic_or(_Q_LOCKED_VAL, &lock->val);
> -	__atomic_acquire_fence();
> +	u32 new = _Q_LOCKED_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%3	# lock_set_locked			\n"
> +"	or	%0,%0,%2						\n"
> +"	stwcx.	%0,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");
>  }


This is pretty similar with the DEFINE_TESTOP() pattern from
arch/powerpc/include/asm/bitops.h (such as test_and_set_bits_lock()) except for
word instead of double word. Do you think it's possible / beneficial to make
use of those macros?


>  
> -/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> -static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
> +/* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
>  {
> -	int newval = _Q_LOCKED_VAL;
> -
> -	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> +	u32 new = _Q_LOCKED_VAL;
> +	u32 prev;
> +
> +	BUG_ON(old & _Q_LOCKED_VAL);

The BUG_ON() could have been introduced in an earlier patch I think.

> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%4	# trylock_clear_tail_cpu		\n"
> +"	cmpw	0,%0,%2							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%3,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r"(old), "r" (new),

Could this be like  "r"(_Q_TAIL_CPU_MASK) below?
i.e. "r" (_Q_LOCKED_VAL)? Makes it clear new doesn't change.

> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");
> +
> +	if (likely(prev == old))
>  		return 1;
> -	else
> -		return 0;
> +	return 0;
>  }
>  
>  /*
> @@ -56,20 +82,25 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int va
>   * This provides a release barrier for publishing node, and an acquire barrier


Does the comment mean there needs to be an acquire barrier in this assembly?


>   * for getting the old node.
>   */
> -static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
> +static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
>  {
> -	for (;;) {
> -		int val = atomic_read(&lock->val);
> -		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> -		int old;
> -
> -		old = atomic_cmpxchg(&lock->val, val, newval);
> -		if (old == val)
> -			return old;
> -	}
> +	u32 prev, tmp;
> +
> +	asm volatile(
> +"\t"	PPC_RELEASE_BARRIER "						\n"
> +"1:	lwarx	%0,0,%2		# publish_tail_cpu			\n"
> +"	andc	%1,%0,%4						\n"
> +"	or	%1,%1,%3						\n"
> +"	stwcx.	%1,0,%2							\n"
> +"	bne-	1b							\n"
> +	: "=&r" (prev), "=&r"(tmp)
> +	: "r" (&lock->val), "r" (tail), "r"(_Q_TAIL_CPU_MASK)
> +	: "cr0", "memory");
> +
> +	return prev;
>  }
>  
> -static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> +static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
>  	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
> @@ -88,7 +119,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  {
>  	struct qnodes *qnodesp;
>  	struct qnode *next, *node;
> -	int val, old, tail;
> +	u32 val, old, tail;
>  	int idx;
>  
>  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> @@ -134,7 +165,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  	}
>  
>  	/* We're at the head of the waitqueue, wait for the lock. */
> -	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
> +	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
>  		cpu_relax();
>  
>  	/* If we're the last queued, must clean up the tail. */


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing
  2022-07-28  6:31 ` [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
@ 2022-08-10  4:31   ` Jordan Niethe
  2022-11-10  0:40   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-10  4:31 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Allow new waiters a number of spins on the lock word before queueing,
> which particularly helps paravirt performance when physical CPUs are
> oversubscribed.
> ---
>  arch/powerpc/lib/qspinlock.c | 152 ++++++++++++++++++++++++++++++++---
>  1 file changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 7c71e5e287df..1625cce714b2 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -19,8 +19,17 @@ struct qnodes {
>  	struct qnode nodes[MAX_NODES];
>  };
>  
> +/* Tuning parameters */
> +static int STEAL_SPINS __read_mostly = (1<<5);
> +static bool MAYBE_STEALERS __read_mostly = true;

I can understand why, but macro case variables can be a bit confusing.

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> +static __always_inline int get_steal_spins(void)
> +{
> +	return STEAL_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -76,6 +85,39 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 ol
>  	return 0;
>  }
>  
> +static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u32 new)
> +{
> +	u32 prev;
> +
> +	BUG_ON(old & _Q_LOCKED_VAL);
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%4	# queued_spin_trylock_cmpxchg		\n"

s/queued_spin_trylock_cmpxchg/__trylock_cmpxchg/

btw what is the format you using for the '\n's in the inline asm?

> +"	cmpw	0,%0,%2							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%3,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r"(old), "r" (new),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");

This is very similar to trylock_clear_tail_cpu(). So maybe it is worth having
some form of "test and set" primitive helper.

> +
> +	return prev;
> +}
> +
> +/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
> +{
> +	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> +
> +	if (__trylock_cmpxchg(lock, val, newval) == val)
> +		return 1;
> +	else
> +		return 0;

same optional style nit: return __trylock_cmpxchg(lock, val, newval) == val

> +}
> +
>  /*
>   * Publish our tail, replacing previous tail. Return previous value.
>   *
> @@ -115,6 +157,31 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> +static inline bool try_to_steal_lock(struct qspinlock *lock)
> +{
> +	int iters;
> +
> +	/* Attempt to steal the lock */
> +	for (;;) {
> +		u32 val = READ_ONCE(lock->val);
> +
> +		if (unlikely(!(val & _Q_LOCKED_VAL))) {
> +			if (trylock_with_tail_cpu(lock, val))
> +				return true;
> +			continue;
> +		}

The continue would bypass iters++/cpu_relax but the next time around
  if (unlikely(!(val & _Q_LOCKED_VAL))) {
should fail so everything should be fine?

> +
> +		cpu_relax();
> +
> +		iters++;
> +
> +		if (iters >= get_steal_spins())
> +			break;
> +	}
> +
> +	return false;
> +}
> +
>  static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  {
>  	struct qnodes *qnodesp;
> @@ -164,20 +231,39 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  		smp_rmb(); /* acquire barrier for the mcs lock */
>  	}
>  
> -	/* We're at the head of the waitqueue, wait for the lock. */
> -	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> -		cpu_relax();
> +	if (!MAYBE_STEALERS) {
> +		/* We're at the head of the waitqueue, wait for the lock. */
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +			cpu_relax();
>  
> -	/* If we're the last queued, must clean up the tail. */
> -	if ((val & _Q_TAIL_CPU_MASK) == tail) {
> -		if (trylock_clear_tail_cpu(lock, val))
> -			goto release;
> -		/* Another waiter must have enqueued */
> -	}
> +		/* If we're the last queued, must clean up the tail. */
> +		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> +			if (trylock_clear_tail_cpu(lock, val))
> +				goto release;
> +			/* Another waiter must have enqueued. */
> +		}
> +
> +		/* We must be the owner, just set the lock bit and acquire */
> +		lock_set_locked(lock);
> +	} else {
> +again:
> +		/* We're at the head of the waitqueue, wait for the lock. */
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +			cpu_relax();
>  
> -	/* We must be the owner, just set the lock bit and acquire */
> -	lock_set_locked(lock);
> +		/* If we're the last queued, must clean up the tail. */
> +		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> +			if (trylock_clear_tail_cpu(lock, val))
> +				goto release;
> +			/* Another waiter must have enqueued, or lock stolen. */
> +		} else {
> +			if (trylock_with_tail_cpu(lock, val))
> +				goto unlock_next;
> +		}
> +		goto again;
> +	}
>  
> +unlock_next:
>  	/* contended path; must wait for next != NULL (MCS protocol) */
>  	while (!(next = READ_ONCE(node->next)))
>  		cpu_relax();
> @@ -197,6 +283,9 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  
>  void queued_spin_lock_slowpath(struct qspinlock *lock)
>  {
> +	if (try_to_steal_lock(lock))
> +		return;
> +
>  	queued_spin_lock_mcs_queue(lock);
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
> @@ -207,3 +296,44 @@ void pv_spinlocks_init(void)
>  }
>  #endif
>  
> +#include <linux/debugfs.h>
> +static int steal_spins_set(void *data, u64 val)
> +{
> +	static DEFINE_MUTEX(lock);


I just want to check if it would be possible to get rid of the MAYBE_STEALERS
variable completely and do something like:

  bool maybe_stealers() { return STEAL_SPINS > 0; }

I guess based on the below code it wouldn't work, but I'm still not quite sure
why that is.

> +
> +	mutex_lock(&lock);
> +	if (val && !STEAL_SPINS) {
> +		MAYBE_STEALERS = true;
> +		/* wait for waiter to go away */
> +		synchronize_rcu();
> +		STEAL_SPINS = val;
> +	} else if (!val && STEAL_SPINS) {
> +		STEAL_SPINS = val;
> +		/* wait for all possible stealers to go away */
> +		synchronize_rcu();
> +		MAYBE_STEALERS = false;
> +	} else {
> +		STEAL_SPINS = val;
> +	}
> +	mutex_unlock(&lock);

STEAL_SPINS is an int not a u64.

> +
> +	return 0;
> +}
> +
> +static int steal_spins_get(void *data, u64 *val)
> +{
> +	*val = STEAL_SPINS;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
> +
> +static __init int spinlock_debugfs_init(void)
> +{
> +	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> +
> +	return 0;
> +}
> +device_initcall(spinlock_debugfs_init);
> +


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency
  2022-07-28  6:31 ` [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
@ 2022-08-10  5:51   ` Jordan Niethe
  2022-11-10  0:40   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-10  5:51 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Give the queue head the ability to stop stealers. After a number of
> spins without sucessfully acquiring the lock, the queue head employs
> this, which will assure it is the next owner.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h | 10 +++-
>  arch/powerpc/lib/qspinlock.c               | 56 +++++++++++++++++++++-
>  2 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 210adf05b235..8b20f5e22bba 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,7 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   *     0: locked bit
> - * 16-31: tail cpu (+1)
> + *    16: must queue bit
> + * 17-31: tail cpu (+1)
>   */
>  #define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
>  				      << _Q_ ## type ## _OFFSET)
> @@ -38,7 +39,12 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
>  
> -#define _Q_TAIL_CPU_OFFSET	16
> +#define _Q_MUST_Q_OFFSET	16
> +#define _Q_MUST_Q_BITS		1
> +#define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> +#define _Q_MUST_Q_VAL		(1U << _Q_MUST_Q_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET	17
>  #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
>  #define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)

Not a big deal but some of these values could be calculated like in the
generic version. e.g.

	#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET +_Q_LOCKED_BITS)

>  
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 1625cce714b2..a906cc8f15fa 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -22,6 +22,7 @@ struct qnodes {
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
> +static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
>  	return STEAL_SPINS;
>  }
>  
> +static __always_inline int get_head_spins(void)
> +{
> +	return HEAD_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -142,6 +148,23 @@ static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
>  	return prev;
>  }
>  
> +static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
> +{
> +	u32 new = _Q_MUST_Q_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1		# lock_set_mustq			\n"

Is the EH bit not set because we don't hold the lock here?

> +"	or	%0,%0,%2						\n"
> +"	stwcx.	%0,0,%1							\n"
> +"	bne-	1b							\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new)
> +	: "cr0", "memory");

This is another usage close to the DEFINE_TESTOP() pattern.

> +
> +	return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
> @@ -165,6 +188,9 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
>  
> +		if (val & _Q_MUST_Q_VAL)
> +			break;
> +
>  		if (unlikely(!(val & _Q_LOCKED_VAL))) {
>  			if (trylock_with_tail_cpu(lock, val))
>  				return true;
> @@ -246,11 +272,22 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  		/* We must be the owner, just set the lock bit and acquire */
>  		lock_set_locked(lock);
>  	} else {
> +		int iters = 0;
> +		bool set_mustq = false;
> +
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
> -		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>  			cpu_relax();
>  
> +			iters++;

It seems instead of using set_mustq, (val & _Q_MUST_Q_VAL) could be checked?

> +			if (!set_mustq && iters >= get_head_spins()) {
> +				set_mustq = true;
> +				lock_set_mustq(lock);
> +				val |= _Q_MUST_Q_VAL;
> +			}
> +		}
> +
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
>  			if (trylock_clear_tail_cpu(lock, val))
> @@ -329,9 +366,26 @@ static int steal_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
>  
> +static int head_spins_set(void *data, u64 val)
> +{
> +	HEAD_SPINS = val;
> +
> +	return 0;
> +}
> +
> +static int head_spins_get(void *data, u64 *val)
> +{
> +	*val = HEAD_SPINS;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> +	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-08-10  1:52   ` Jordan NIethe
@ 2022-08-10  6:48     ` Christophe Leroy
  0 siblings, 0 replies; 78+ messages in thread
From: Christophe Leroy @ 2022-08-10  6:48 UTC (permalink / raw)
  To: Jordan NIethe, Nicholas Piggin, linuxppc-dev



Le 10/08/2022 à 03:52, Jordan NIethe a écrit :
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> <snip>
>> -#define queued_spin_lock queued_spin_lock
>>   
>> -static inline void queued_spin_unlock(struct qspinlock *lock)
>> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>>   {
>> -	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
>> -		smp_store_release(&lock->locked, 0);
>> -	else
>> -		__pv_queued_spin_unlock(lock);
>> +	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
>> +		return 1;
>> +	return 0;
> 
> optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);
> 

The parenthesis are pointless, and ! is usually prefered to == 0, 
something like that:

	return !atomic_cmpxchg_acquire(&lock->val, 0, 1);

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word
  2022-07-28  6:31 ` [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
@ 2022-08-12  0:50   ` Jordan Niethe
  2022-11-10  0:40   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  0:50 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Store the owner CPU number in the lock word so it may be yielded to,
> as powerpc's paravirtualised simple spinlocks do.
> ---
>  arch/powerpc/include/asm/qspinlock.h       |  8 +++++++-
>  arch/powerpc/include/asm/qspinlock_types.h | 10 ++++++++++
>  arch/powerpc/lib/qspinlock.c               |  6 +++---
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index 3ab354159e5e..44601b261e08 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
> +static __always_inline u32 queued_spin_get_locked_val(void)

Maybe this function should have "encode" in the name to match with
encode_tail_cpu().


> +{
> +	/* XXX: make this use lock value in paca like simple spinlocks? */

Is that the paca's lock_token which is 0x8000?


> +	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
> +}
> +
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> -	u32 new = _Q_LOCKED_VAL;
> +	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
>  	asm volatile(
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 8b20f5e22bba..35f9525381e6 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,6 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   *     0: locked bit
> + *  1-14: lock holder cpu
> + *    15: unused bit
>   *    16: must queue bit
>   * 17-31: tail cpu (+1)

So there is one more bit to store the tail cpu vs the lock holder cpu?

>   */
> @@ -39,6 +41,14 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
>  
> +#define _Q_OWNER_CPU_OFFSET	1
> +#define _Q_OWNER_CPU_BITS	14
> +#define _Q_OWNER_CPU_MASK	_Q_SET_MASK(OWNER_CPU)
> +
> +#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #define _Q_MUST_Q_OFFSET	16
>  #define _Q_MUST_Q_BITS		1
>  #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index a906cc8f15fa..aa26cfe21f18 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -50,7 +50,7 @@ static inline int get_tail_cpu(u32 val)
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> -	u32 new = _Q_LOCKED_VAL;
> +	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
>  	asm volatile(
> @@ -68,7 +68,7 @@ static __always_inline void lock_set_locked(struct qspinlock *lock)
>  /* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
>  static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
>  {
> -	u32 new = _Q_LOCKED_VAL;
> +	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
>  	BUG_ON(old & _Q_LOCKED_VAL);
> @@ -116,7 +116,7 @@ static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u3
>  /* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
>  static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
>  {
> -	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> +	u32 newval = queued_spin_get_locked_val() | (val & _Q_TAIL_CPU_MASK);
>  
>  	if (__trylock_cmpxchg(lock, val, newval) == val)
>  		return 1;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner
  2022-07-28  6:31 ` [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
@ 2022-08-12  2:01   ` Jordan Niethe
  2022-11-10  0:41   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  2:01 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

 On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Waiters spinning on the lock word should yield to the lock owner if the
> vCPU is preempted. This improves performance when the hypervisor has
> oversubscribed physical CPUs.
> ---
>  arch/powerpc/lib/qspinlock.c | 97 ++++++++++++++++++++++++++++++------
>  1 file changed, 83 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index aa26cfe21f18..55286ac91da5 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include <linux/percpu.h>
>  #include <linux/smp.h>
>  #include <asm/qspinlock.h>
> +#include <asm/paravirt.h>
>  
>  #define MAX_NODES	4
>  
> @@ -24,14 +25,16 @@ static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
> +static bool pv_yield_owner __read_mostly = true;

Not macro case for these globals? To me name does not make it super clear this
is a boolean. What about pv_yield_owner_enabled?

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(void)
> +static __always_inline int get_steal_spins(bool paravirt)
>  {
>  	return STEAL_SPINS;
>  }
>  
> -static __always_inline int get_head_spins(void)
> +static __always_inline int get_head_spins(bool paravirt)
>  {
>  	return HEAD_SPINS;
>  }
> @@ -46,7 +49,11 @@ static inline int get_tail_cpu(u32 val)
>  	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
> -/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
> +static inline int get_owner_cpu(u32 val)
> +{
> +	return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
> +}
> +
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> @@ -180,7 +187,45 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> -static inline bool try_to_steal_lock(struct qspinlock *lock)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)

This name doesn't seem correct for the non paravirt case.

> +{
> +	int owner;
> +	u32 yield_count;
> +
> +	BUG_ON(!(val & _Q_LOCKED_VAL));
> +
> +	if (!paravirt)
> +		goto relax;
> +
> +	if (!pv_yield_owner)
> +		goto relax;
> +
> +	owner = get_owner_cpu(val);
> +	yield_count = yield_count_of(owner);
> +
> +	if ((yield_count & 1) == 0)
> +		goto relax; /* owner vcpu is running */

I wonder why not use vcpu_is_preempted()?

> +
> +	/*
> +	 * Read the lock word after sampling the yield count. On the other side
> +	 * there may a wmb because the yield count update is done by the
> +	 * hypervisor preemption and the value update by the OS, however this
> +	 * ordering might reduce the chance of out of order accesses and
> +	 * improve the heuristic.
> +	 */
> +	smp_rmb();
> +
> +	if (READ_ONCE(lock->val) == val) {
> +		yield_to_preempted(owner, yield_count);
> +		/* Don't relax if we yielded. Maybe we should? */
> +		return;
> +	}
> +relax:
> +	cpu_relax();
> +}
> +
> +
> +static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
>  	int iters;
>  
> @@ -197,18 +242,18 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
>  			continue;
>  		}
>  
> -		cpu_relax();
> +		yield_to_locked_owner(lock, val, paravirt);
>  
>  		iters++;
>  
> -		if (iters >= get_steal_spins())
> +		if (iters >= get_steal_spins(paravirt))
>  			break;
>  	}
>  
>  	return false;
>  }
>  
> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
>  {
>  	struct qnodes *qnodesp;
>  	struct qnode *next, *node;
> @@ -260,7 +305,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  	if (!MAYBE_STEALERS) {
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> -			cpu_relax();
> +			yield_to_locked_owner(lock, val, paravirt);
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -278,10 +323,10 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> -			cpu_relax();
> +			yield_to_locked_owner(lock, val, paravirt);
>  
>  			iters++;
> -			if (!set_mustq && iters >= get_head_spins()) {
> +			if (!set_mustq && iters >= get_head_spins(paravirt)) {
>  				set_mustq = true;
>  				lock_set_mustq(lock);
>  				val |= _Q_MUST_Q_VAL;
> @@ -320,10 +365,15 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  
>  void queued_spin_lock_slowpath(struct qspinlock *lock)
>  {
> -	if (try_to_steal_lock(lock))
> -		return;
> -
> -	queued_spin_lock_mcs_queue(lock);
> +	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()) {
> +		if (try_to_steal_lock(lock, true))
> +			return;
> +		queued_spin_lock_mcs_queue(lock, true);
> +	} else {
> +		if (try_to_steal_lock(lock, false))
> +			return;
> +		queued_spin_lock_mcs_queue(lock, false);
> +	}
>  }

There is not really a need for a conditional: 

bool paravirt = IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
is_shared_processor();

if (try_to_steal_lock(lock, paravirt))
	return;

queued_spin_lock_mcs_queue(lock, paravirt);


The paravirt parameter used by the various functions seems always to be
equivalent to (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()).
I wonder if it would be simpler testing (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor())
(using a helper function) in those functions instead passing it as a parameter?


>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  
> @@ -382,10 +432,29 @@ static int head_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
>  
> +static int pv_yield_owner_set(void *data, u64 val)
> +{
> +	pv_yield_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
> +	if (is_shared_processor()) {
> +		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
> +	}
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node
  2022-07-28  6:31 ` [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
@ 2022-08-12  2:07   ` Jordan Niethe
  2022-11-10  0:41   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  2:07 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Queued waiters which are not at the head of the queue don't spin on
> the lock word but their qnode lock word, waiting for the previous queued
> CPU to release them. Add an option which allows these waiters to yield
> to the previous CPU if its vCPU is preempted.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 46 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 55286ac91da5..b39f8c5b329c 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_prev __read_mostly = true;

Similiar suggestion, maybe pv_yield_prev_enabled would read better.

Isn't this enabled by default contrary to the commit message?


>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -224,6 +225,31 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
>  	cpu_relax();
>  }
>  
> +static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)

yield_to_locked_owner() takes a raw val and works out the cpu to yield to.
I think for consistency have yield_to_prev() take the raw val and work it out too.

> +{
> +	u32 yield_count;
> +
> +	if (!paravirt)
> +		goto relax;
> +
> +	if (!pv_yield_prev)
> +		goto relax;
> +
> +	yield_count = yield_count_of(prev_cpu);
> +	if ((yield_count & 1) == 0)
> +		goto relax; /* owner vcpu is running */
> +
> +	smp_rmb(); /* See yield_to_locked_owner comment */
> +
> +	if (!node->locked) {
> +		yield_to_preempted(prev_cpu, yield_count);
> +		return;
> +	}
> +
> +relax:
> +	cpu_relax();
> +}
> +
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
> @@ -291,13 +317,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	 */
>  	if (old & _Q_TAIL_CPU_MASK) {
>  		struct qnode *prev = get_tail_qnode(lock, old);
> +		int prev_cpu = get_tail_cpu(old);

This could then be removed.

>  
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
>  
>  		/* Wait for mcs node lock to be released */
>  		while (!node->locked)
> -			cpu_relax();
> +			yield_to_prev(lock, node, prev_cpu, paravirt);

And would have this as:
			yield_to_prev(lock, node, old, paravirt);


>  
>  		smp_rmb(); /* acquire barrier for the mcs lock */
>  	}
> @@ -448,12 +475,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
>  
> +static int pv_yield_prev_set(void *data, u64 val)
> +{
> +	pv_yield_prev = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_prev_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_prev;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
> +		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields
  2022-07-28  6:31 ` [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
@ 2022-08-12  4:06   ` Jordan Niethe
  2022-11-10  0:42   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  4:06 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> If the head of queue is preventing stealing but it finds the owner vCPU
> is preempted, it will yield its cycles to the owner which could cause it
> to become preempted. Add an option to re-allow stealers before yielding,
> and disallow them again after returning from the yield.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 56 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index b39f8c5b329c..94f007f66942 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_allow_steal __read_mostly = false;

To me this one does read as a boolean, but if you go with those other changes
I'd make it pv_yield_steal_enable to be consistent.

>  static bool pv_yield_prev __read_mostly = true;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> @@ -173,6 +174,23 @@ static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
>  	return prev;
>  }
>  
> +static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
> +{
> +	u32 new = _Q_MUST_Q_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1		# lock_clear_mustq			\n"
> +"	andc	%0,%0,%2						\n"
> +"	stwcx.	%0,0,%1							\n"
> +"	bne-	1b							\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new)
> +	: "cr0", "memory");
> +

This is pretty similar to the DEFINE_TESTOP() pattern again with the same llong caveat.


> +	return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
> @@ -188,7 +206,7 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)

 /* See yield_to_locked_owner comment */ comment needs to be updated now.


>  {
>  	int owner;
>  	u32 yield_count;
> @@ -217,7 +235,11 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
>  	smp_rmb();
>  
>  	if (READ_ONCE(lock->val) == val) {
> +		if (clear_mustq)
> +			lock_clear_mustq(lock);
>  		yield_to_preempted(owner, yield_count);
> +		if (clear_mustq)
> +			lock_set_mustq(lock);
>  		/* Don't relax if we yielded. Maybe we should? */
>  		return;
>  	}
> @@ -225,6 +247,16 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
>  	cpu_relax();
>  }
>  
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> +{
> +	__yield_to_locked_owner(lock, val, paravirt, false);
> +}
> +
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +{

The check for pv_yield_allow_steal seems like it could go here instead of
being done by the caller.
__yield_to_locked_owner() checks for pv_yield_owner so it seems more
  consistent.



> +	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +}
> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
>  {
>  	u32 yield_count;
> @@ -332,7 +364,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	if (!MAYBE_STEALERS) {
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> -			yield_to_locked_owner(lock, val, paravirt);
> +			yield_head_to_locked_owner(lock, val, paravirt, false);
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -350,7 +382,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> -			yield_to_locked_owner(lock, val, paravirt);
> +			yield_head_to_locked_owner(lock, val, paravirt,
> +					pv_yield_allow_steal && set_mustq);
>  
>  			iters++;
>  			if (!set_mustq && iters >= get_head_spins(paravirt)) {
> @@ -475,6 +508,22 @@ static int pv_yield_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
>  
> +static int pv_yield_allow_steal_set(void *data, u64 val)
> +{
> +	pv_yield_allow_steal = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_allow_steal_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_allow_steal;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
> +
>  static int pv_yield_prev_set(void *data, u64 val)
>  {
>  	pv_yield_prev = !!val;
> @@ -497,6 +546,7 @@ static __init int spinlock_debugfs_init(void)
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
> +		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  	}
>  


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue
  2022-07-28  6:31 ` [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
@ 2022-08-12  4:17   ` Jordan Niethe
  2022-10-06 17:27   ` Laurent Dufour
  2022-11-10  0:42   ` Jordan Niethe
  2 siblings, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  4:17 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Having all CPUs poll the lock word for the owner CPU that should be
> yielded to defeats most of the purpose of using MCS queueing for
> scalability. Yet it may be desirable for queued waiters to to yield
> to a preempted owner.
> 
> s390 addreses this problem by having queued waiters sample the lock
> word to find the owner much less frequently. In this approach, the
> waiters never sample it directly, but the queue head propagates the
> owner CPU back to the next waiter if it ever finds the owner has
> been preempted. Queued waiters then subsequently propagate the owner
> CPU back to the next waiter, and so on.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 85 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 94f007f66942..28c85a2d5635 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>  	struct qnode	*next;
>  	struct qspinlock *lock;
> +	int		yield_cpu;
>  	u8		locked; /* 1 if lock acquired */
>  };
>  
> @@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
> +static bool pv_yield_propagate_owner __read_mostly = true;

This also seems to be enabled by default.

>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -257,13 +259,66 @@ static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u
>  	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
>  }
>  
> +static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
> +{
> +	struct qnode *next;
> +	int owner;
> +
> +	if (!paravirt)
> +		return;
> +	if (!pv_yield_propagate_owner)
> +		return;
> +
> +	owner = get_owner_cpu(val);
> +	if (*set_yield_cpu == owner)
> +		return;
> +
> +	next = READ_ONCE(node->next);
> +	if (!next)
> +		return;
> +
> +	if (vcpu_is_preempted(owner)) {

Is there a difference about using vcpu_is_preempted() here
vs checking bit 0 in other places?


> +		next->yield_cpu = owner;
> +		*set_yield_cpu = owner;
> +	} else if (*set_yield_cpu != -1) {

It might be worth giving the -1 CPU a #define.

> +		next->yield_cpu = owner;
> +		*set_yield_cpu = owner;
> +	}
> +}

Does this need to pass set_yield_cpu by reference? Couldn't it's new value be
returned? To me it makes it more clear the function is used to change
set_yield_cpu. I think this would work:

int set_yield_cpu = -1;

static __always_inline int propagate_yield_cpu(struct qnode *node, u32 val, int set_yield_cpu, bool paravirt)
{
	struct qnode *next;
	int owner;

	if (!paravirt)
		goto out;
	if (!pv_yield_propagate_owner)
		goto out;

	owner = get_owner_cpu(val);
	if (set_yield_cpu == owner)
		goto out;

	next = READ_ONCE(node->next);
	if (!next)
		goto out;

	if (vcpu_is_preempted(owner)) {
		next->yield_cpu = owner;
		return owner;
	} else if (set_yield_cpu != -1) {
		next->yield_cpu = owner;
		return owner;
	}

out:
	return set_yield_cpu;
}

set_yield_cpu = propagate_yield_cpu(...  set_yield_cpu ...);



> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
>  {
>  	u32 yield_count;
> +	int yield_cpu;
>  
>  	if (!paravirt)
>  		goto relax;
>  
> +	if (!pv_yield_propagate_owner)
> +		goto yield_prev;
> +
> +	yield_cpu = READ_ONCE(node->yield_cpu);
> +	if (yield_cpu == -1) {
> +		/* Propagate back the -1 CPU */
> +		if (node->next && node->next->yield_cpu != -1)
> +			node->next->yield_cpu = yield_cpu;
> +		goto yield_prev;
> +	}
> +
> +	yield_count = yield_count_of(yield_cpu);
> +	if ((yield_count & 1) == 0)
> +		goto yield_prev; /* owner vcpu is running */
> +
> +	smp_rmb();
> +
> +	if (yield_cpu == node->yield_cpu) {
> +		if (node->next && node->next->yield_cpu != yield_cpu)
> +			node->next->yield_cpu = yield_cpu;
> +		yield_to_preempted(yield_cpu, yield_count);
> +		return;
> +	}
> +
> +yield_prev:
>  	if (!pv_yield_prev)
>  		goto relax;
>  
> @@ -337,6 +392,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	node = &qnodesp->nodes[idx];
>  	node->next = NULL;
>  	node->lock = lock;
> +	node->yield_cpu = -1;
>  	node->locked = 0;
>  
>  	tail = encode_tail_cpu();
> @@ -358,13 +414,21 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		while (!node->locked)
>  			yield_to_prev(lock, node, prev_cpu, paravirt);
>  
> +		/* Clear out stale propagated yield_cpu */
> +		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
> +			node->yield_cpu = -1;
> +
>  		smp_rmb(); /* acquire barrier for the mcs lock */
>  	}
>  
>  	if (!MAYBE_STEALERS) {
> +		int set_yield_cpu = -1;
> +
>  		/* We're at the head of the waitqueue, wait for the lock. */
> -		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt, false);
> +		}
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -376,12 +440,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We must be the owner, just set the lock bit and acquire */
>  		lock_set_locked(lock);
>  	} else {
> +		int set_yield_cpu = -1;
>  		int iters = 0;
>  		bool set_mustq = false;
>  
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
>  					pv_yield_allow_steal && set_mustq);
>  
> @@ -540,6 +606,22 @@ static int pv_yield_prev_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
>  
> +static int pv_yield_propagate_owner_set(void *data, u64 val)
> +{
> +	pv_yield_propagate_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_propagate_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_propagate_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> @@ -548,6 +630,7 @@ static __init int spinlock_debugfs_init(void)
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
> +		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU
  2022-07-28  6:31 ` [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
@ 2022-08-12  4:22   ` Jordan Niethe
  2022-11-10  0:42   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  4:22 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> After the head of the queue acquires the lock, it releases the
> next waiter in the queue to become the new head. Add an option
> to prod the new head if its vCPU was preempted. This may only
> have an effect if queue waiters are yielding.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 28c85a2d5635..3b10e31bcf0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>  	struct qnode	*next;
>  	struct qspinlock *lock;
> +	int		cpu;
>  	int		yield_cpu;
>  	u8		locked; /* 1 if lock acquired */
>  };
> @@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
> +static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -392,6 +394,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	node = &qnodesp->nodes[idx];
>  	node->next = NULL;
>  	node->lock = lock;
> +	node->cpu = smp_processor_id();

I suppose this could be used in some other places too.

For example change:
	yield_to_prev(lock, node, prev, paravirt);

In yield_to_prev() it could then access the prev->cpu.

>  	node->yield_cpu = -1;
>  	node->locked = 0;
>  
> @@ -483,7 +486,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	 * this store to locked. The corresponding barrier is the smp_rmb()
>  	 * acquire barrier for mcs lock, above.
>  	 */
> -	WRITE_ONCE(next->locked, 1);
> +	if (paravirt && pv_prod_head) {
> +		int next_cpu = next->cpu;
> +		WRITE_ONCE(next->locked, 1);
> +		if (vcpu_is_preempted(next_cpu))
> +			prod_cpu(next_cpu);
> +	} else {
> +		WRITE_ONCE(next->locked, 1);
> +	}
>  
>  release:
>  	qnodesp->count--; /* release the node */
> @@ -622,6 +632,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
>  
> +static int pv_prod_head_set(void *data, u64 val)
> +{
> +	pv_prod_head = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_prod_head_get(void *data, u64 *val)
> +{
> +	*val = pv_prod_head;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> @@ -631,6 +657,7 @@ static __init int spinlock_debugfs_init(void)
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
> +		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal
  2022-07-28  6:31 ` [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
@ 2022-08-12  4:32   ` Jordan Niethe
  2022-11-10  0:43   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  4:32 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> This gives trylock slightly more strength, and it also gives most
> of the benefit of passing 'val' back through the slowpath without
> the complexity.
> ---
>  arch/powerpc/include/asm/qspinlock.h | 39 +++++++++++++++++++++++++++-
>  arch/powerpc/lib/qspinlock.c         |  9 +++++++
>  2 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index 44601b261e08..d3d2039237b2 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -5,6 +5,8 @@
>  #include <linux/compiler.h>
>  #include <asm/qspinlock_types.h>
>  
> +#define _Q_SPIN_TRY_LOCK_STEAL 1

Would this be a config option?

> +
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
>  	return READ_ONCE(lock->val);
> @@ -26,11 +28,12 @@ static __always_inline u32 queued_spin_get_locked_val(void)
>  	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
>  }
>  
> -static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
>  {
>  	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
> +	/* Trylock succeeds only when unlocked and no queued nodes */
>  	asm volatile(
>  "1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"

s/queued_spin_trylock/__queued_spin_trylock_nosteal

>  "	cmpwi	0,%0,0							\n"
> @@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  	return 0;
>  }
>  
> +static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
> +{
> +	u32 new = queued_spin_get_locked_val();
> +	u32 prev, tmp;
> +
> +	/* Trylock may get ahead of queued nodes if it finds unlocked */
> +	asm volatile(
> +"1:	lwarx	%0,0,%2,%5	# queued_spin_trylock			\n"

s/queued_spin_trylock/__queued_spin_trylock_steal

> +"	andc.	%1,%0,%4						\n"
> +"	bne-	2f							\n"
> +"	and	%1,%0,%4						\n"
> +"	or	%1,%1,%3						\n"
> +"	stwcx.	%1,0,%2							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"

Just because there's a little bit more going on here...

Q_TAIL_CPU_MASK = 0xFFFE0000
~Q_TAIL_CPU_MASK = 0x1FFFF


1:	lwarx	prev, 0, &lock->val, IS_ENABLED_PPC64
	andc.	tmp, prev, _Q_TAIL_CPU_MASK 	(tmp = prev & ~_Q_TAIL_CPU_MASK)
	bne-	2f 				(exit if locked)
	and	tmp, prev, _Q_TAIL_CPU_MASK 	(tmp = prev & _Q_TAIL_CPU_MASK)
	or	tmp, tmp, new			(tmp |= new)					
	stwcx.	tmp, 0, &lock->val					
		
	bne-	1b							
	PPC_ACQUIRE_BARRIER		
2:

... which seems correct.


> +	: "=&r" (prev), "=&r" (tmp)
> +	: "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");
> +
> +	if (likely(!(prev & ~_Q_TAIL_CPU_MASK)))
> +		return 1;
> +	return 0;
> +}
> +
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +{
> +	if (!_Q_SPIN_TRY_LOCK_STEAL)
> +		return __queued_spin_trylock_nosteal(lock);
> +	else
> +		return __queued_spin_trylock_steal(lock);
> +}
> +
>  void queued_spin_lock_slowpath(struct qspinlock *lock);
>  
>  static __always_inline void queued_spin_lock(struct qspinlock *lock)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 3b10e31bcf0a..277aef1fab0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -24,7 +24,11 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> +static const bool MAYBE_STEALERS = true;
> +#else
>  static bool MAYBE_STEALERS __read_mostly = true;
> +#endif
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> @@ -522,6 +526,10 @@ void pv_spinlocks_init(void)
>  #include <linux/debugfs.h>
>  static int steal_spins_set(void *data, u64 val)
>  {
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> +	/* MAYBE_STEAL remains true */
> +	STEAL_SPINS = val;
> +#else
>  	static DEFINE_MUTEX(lock);
>  
>  	mutex_lock(&lock);
> @@ -539,6 +547,7 @@ static int steal_spins_set(void *data, u64 val)
>  		STEAL_SPINS = val;
>  	}
>  	mutex_unlock(&lock);
> +#endif
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API
  2022-07-28  6:31 ` [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
@ 2022-08-12  4:36   ` Jordan Niethe
  2022-11-10  0:43   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  4:36 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
> to prevent threads issuing a lot of expensive priority nops which may not
> have much effect due to immediately executing low then medium priority.

Just a general comment regarding the spin_{begin,end} API, more complicated
than something like

	spin_begin()
	for(;;)
		spin_cpu_relax()
	spin_end()

it becomes difficult to keep track of. Unfortunately, I don't have any good
suggestions how to improve it. Hopefully with P10s wait instruction we can
maybe try and move away from this.

It might be useful to comment the functions pre and post conditions regarding
expectations about spin_begin() and spin_end().

> ---
>  arch/powerpc/lib/qspinlock.c | 35 +++++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 277aef1fab0a..d4594c701f7d 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -233,6 +233,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  	if ((yield_count & 1) == 0)
>  		goto relax; /* owner vcpu is running */
>  
> +	spin_end();
> +
>  	/*
>  	 * Read the lock word after sampling the yield count. On the other side
>  	 * there may a wmb because the yield count update is done by the
> @@ -248,11 +250,13 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  		yield_to_preempted(owner, yield_count);
>  		if (clear_mustq)
>  			lock_set_mustq(lock);
> +		spin_begin();
>  		/* Don't relax if we yielded. Maybe we should? */
>  		return;
>  	}
> +	spin_begin();
>  relax:
> -	cpu_relax();
> +	spin_cpu_relax();
>  }
>  
>  static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> @@ -315,14 +319,18 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  	if ((yield_count & 1) == 0)
>  		goto yield_prev; /* owner vcpu is running */
>  
> +	spin_end();
> +
>  	smp_rmb();
>  
>  	if (yield_cpu == node->yield_cpu) {
>  		if (node->next && node->next->yield_cpu != yield_cpu)
>  			node->next->yield_cpu = yield_cpu;
>  		yield_to_preempted(yield_cpu, yield_count);
> +		spin_begin();
>  		return;
>  	}
> +	spin_begin();
>  
>  yield_prev:
>  	if (!pv_yield_prev)
> @@ -332,15 +340,19 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  	if ((yield_count & 1) == 0)
>  		goto relax; /* owner vcpu is running */
>  
> +	spin_end();
> +
>  	smp_rmb(); /* See yield_to_locked_owner comment */
>  
>  	if (!node->locked) {
>  		yield_to_preempted(prev_cpu, yield_count);
> +		spin_begin();
>  		return;
>  	}
> +	spin_begin();
>  
>  relax:
> -	cpu_relax();
> +	spin_cpu_relax();
>  }
>  
>  
> @@ -349,6 +361,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  	int iters;
>  
>  	/* Attempt to steal the lock */
> +	spin_begin();
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
>  
> @@ -356,8 +369,10 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			break;
>  
>  		if (unlikely(!(val & _Q_LOCKED_VAL))) {
> +			spin_end();
>  			if (trylock_with_tail_cpu(lock, val))
>  				return true;
> +			spin_begin();
>  			continue;
>  		}
>  
> @@ -368,6 +383,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  		if (iters >= get_steal_spins(paravirt))
>  			break;
>  	}
> +	spin_end();
>  
>  	return false;
>  }
> @@ -418,8 +434,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		WRITE_ONCE(prev->next, node);
>  
>  		/* Wait for mcs node lock to be released */
> +		spin_begin();
>  		while (!node->locked)
>  			yield_to_prev(lock, node, prev_cpu, paravirt);
> +		spin_end();
>  
>  		/* Clear out stale propagated yield_cpu */
>  		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
> @@ -432,10 +450,12 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		int set_yield_cpu = -1;
>  
>  		/* We're at the head of the waitqueue, wait for the lock. */
> +		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt, false);
>  		}
> +		spin_end();
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -453,6 +473,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
> +		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
> @@ -465,6 +486,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  				val |= _Q_MUST_Q_VAL;
>  			}
>  		}
> +		spin_end();
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -480,8 +502,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  unlock_next:
>  	/* contended path; must wait for next != NULL (MCS protocol) */
> -	while (!(next = READ_ONCE(node->next)))
> -		cpu_relax();
> +	next = READ_ONCE(node->next);
> +	if (!next) {
> +		spin_begin();
> +		while (!(next = READ_ONCE(node->next)))
> +			cpu_relax();
> +		spin_end();
> +	}
>  
>  	/*
>  	 * Unlock the next mcs waiter node. Release barrier is not required


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins
  2022-07-28  6:31 ` [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
@ 2022-08-12  4:43   ` Jordan Niethe
  2022-11-10  0:43   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  4:43 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Allow for a reduction in the number of times a CPU from a different
> node than the owner can attempt to steal the lock before queueing.
> This could bias the transfer behaviour of the lock across the
> machine and reduce NUMA crossings.
> ---
>  arch/powerpc/lib/qspinlock.c | 34 +++++++++++++++++++++++++++++++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index d4594c701f7d..24f68bd71e2b 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -4,6 +4,7 @@
>  #include <linux/export.h>
>  #include <linux/percpu.h>
>  #include <linux/smp.h>
> +#include <linux/topology.h>
>  #include <asm/qspinlock.h>
>  #include <asm/paravirt.h>
>  
> @@ -24,6 +25,7 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +static int REMOTE_STEAL_SPINS __read_mostly = (1<<2);
>  #if _Q_SPIN_TRY_LOCK_STEAL == 1
>  static const bool MAYBE_STEALERS = true;
>  #else
> @@ -39,9 +41,13 @@ static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(bool paravirt)
> +static __always_inline int get_steal_spins(bool paravirt, bool remote)
>  {
> -	return STEAL_SPINS;
> +	if (remote) {
> +		return REMOTE_STEAL_SPINS;
> +	} else {
> +		return STEAL_SPINS;
> +	}
>  }
>  
>  static __always_inline int get_head_spins(bool paravirt)
> @@ -380,8 +386,13 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  
>  		iters++;
>  
> -		if (iters >= get_steal_spins(paravirt))
> +		if (iters >= get_steal_spins(paravirt, false))
>  			break;
> +		if (iters >= get_steal_spins(paravirt, true)) {

There's no indication of what true and false mean here which is hard to read.
To me it feels like two separate functions would be more clear.


> +			int cpu = get_owner_cpu(val);
> +			if (numa_node_id() != cpu_to_node(cpu))

What about using node_distance() instead?


> +				break;
> +		}
>  	}
>  	spin_end();
>  
> @@ -588,6 +599,22 @@ static int steal_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
>  
> +static int remote_steal_spins_set(void *data, u64 val)
> +{
> +	REMOTE_STEAL_SPINS = val;

REMOTE_STEAL_SPINS is int not u64.

> +
> +	return 0;
> +}
> +
> +static int remote_steal_spins_get(void *data, u64 *val)
> +{
> +	*val = REMOTE_STEAL_SPINS;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_remote_steal_spins, remote_steal_spins_get, remote_steal_spins_set, "%llu\n");
> +
>  static int head_spins_set(void *data, u64 val)
>  {
>  	HEAD_SPINS = val;
> @@ -687,6 +714,7 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> +	debugfs_create_file("qspl_remote_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_remote_steal_spins);
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
  2022-07-28  6:31 ` [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
@ 2022-08-12  4:49   ` Jordan Niethe
  2022-09-22 15:02   ` Laurent Dufour
  2022-11-10  0:44   ` Jordan Niethe
  2 siblings, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-12  4:49 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Provide an option that holds off queueing indefinitely while the lock
> owner is preempted. This could reduce queueing latencies for very
> overcommitted vcpu situations.
> 
> This is disabled by default.
> ---
>  arch/powerpc/lib/qspinlock.c | 91 +++++++++++++++++++++++++++++++-----
>  1 file changed, 79 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 24f68bd71e2b..5cfd69931e31 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
> +static bool pv_spin_on_preempted_owner __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
> @@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> -static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
>  	int owner;
>  	u32 yield_count;
>  
>  	BUG_ON(!(val & _Q_LOCKED_VAL));
>  
> +	*preempted = false;
> +
>  	if (!paravirt)
>  		goto relax;
>  
> @@ -241,6 +244,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>  	spin_end();
>  
> +	*preempted = true;
> +
>  	/*
>  	 * Read the lock word after sampling the yield count. On the other side
>  	 * there may a wmb because the yield count update is done by the
> @@ -265,14 +270,14 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  	spin_cpu_relax();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool *preempted)

It seems like preempted parameter could be the return value of
yield_to_locked_owner(). Then callers that don't use the value returned in
preempted don't need to create an unnecessary variable to pass in.

>  {
> -	__yield_to_locked_owner(lock, val, paravirt, false);
> +	__yield_to_locked_owner(lock, val, paravirt, false, preempted);
>  }
>  
> -static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
> -	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +	__yield_to_locked_owner(lock, val, paravirt, clear_mustq, preempted);
>  }
>  
>  static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
> @@ -364,12 +369,33 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
> -	int iters;
> +	int iters = 0;
> +
> +	if (!STEAL_SPINS) {
> +		if (paravirt && pv_spin_on_preempted_owner) {
> +			spin_begin();
> +			for (;;) {
> +				u32 val = READ_ONCE(lock->val);
> +				bool preempted;
> +
> +				if (val & _Q_MUST_Q_VAL)
> +					break;
> +				if (!(val & _Q_LOCKED_VAL))
> +					break;
> +				if (!vcpu_is_preempted(get_owner_cpu(val)))
> +					break;
> +				yield_to_locked_owner(lock, val, paravirt, &preempted);
> +			}
> +			spin_end();
> +		}
> +		return false;
> +	}
>  
>  	/* Attempt to steal the lock */
>  	spin_begin();
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
> +		bool preempted;
>  
>  		if (val & _Q_MUST_Q_VAL)
>  			break;
> @@ -382,9 +408,22 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			continue;
>  		}
>  
> -		yield_to_locked_owner(lock, val, paravirt);
> -
> -		iters++;
> +		yield_to_locked_owner(lock, val, paravirt, &preempted);
> +
> +		if (paravirt && preempted) {
> +			if (!pv_spin_on_preempted_owner)
> +				iters++;
> +			/*
> +			 * pv_spin_on_preempted_owner don't increase iters
> +			 * while the owner is preempted -- we won't interfere
> +			 * with it by definition. This could introduce some
> +			 * latency issue if we continually observe preempted
> +			 * owners, but hopefully that's a rare corner case of
> +			 * a badly oversubscribed system.
> +			 */
> +		} else {
> +			iters++;
> +		}
>  
>  		if (iters >= get_steal_spins(paravirt, false))
>  			break;
> @@ -463,8 +502,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			bool preempted;
> +
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
> -			yield_head_to_locked_owner(lock, val, paravirt, false);
> +			yield_head_to_locked_owner(lock, val, paravirt, false, &preempted);
>  		}
>  		spin_end();
>  
> @@ -486,11 +527,20 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			bool preempted;
> +
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
> -					pv_yield_allow_steal && set_mustq);
> +					pv_yield_allow_steal && set_mustq,
> +					&preempted);
> +
> +			if (paravirt && preempted) {
> +				if (!pv_spin_on_preempted_owner)
> +					iters++;
> +			} else {
> +				iters++;
> +			}
>  
> -			iters++;
>  			if (!set_mustq && iters >= get_head_spins(paravirt)) {
>  				set_mustq = true;
>  				lock_set_mustq(lock);
> @@ -663,6 +713,22 @@ static int pv_yield_allow_steal_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
>  
> +static int pv_spin_on_preempted_owner_set(void *data, u64 val)
> +{
> +	pv_spin_on_preempted_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_spin_on_preempted_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
> +
>  static int pv_yield_prev_set(void *data, u64 val)
>  {
>  	pv_yield_prev = !!val;
> @@ -719,6 +785,7 @@ static __init int spinlock_debugfs_init(void)
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
> +		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks
  2022-07-28  6:31 ` [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
@ 2022-08-15  1:11   ` Jordan Niethe
  2022-11-10  0:44   ` Jordan Niethe
  1 sibling, 0 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-08-15  1:11 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Finding the owner or a queued waiter on a lock with a preempted vcpu
> is indicative of an oversubscribed guest causing the lock to get into
> trouble. Provide some options to detect this situation and have new
> CPUs avoid queueing for a longer time (more steal iterations) to
> minimise the problems caused by vcpu preemption on the queue.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h |   7 +-
>  arch/powerpc/lib/qspinlock.c               | 240 +++++++++++++++++++--
>  2 files changed, 232 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 35f9525381e6..4fbcc8a4230b 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -30,7 +30,7 @@ typedef struct qspinlock {
>   *
>   *     0: locked bit
>   *  1-14: lock holder cpu
> - *    15: unused bit
> + *    15: lock owner or queuer vcpus observed to be preempted bit
>   *    16: must queue bit
>   * 17-31: tail cpu (+1)
>   */
> @@ -49,6 +49,11 @@ typedef struct qspinlock {
>  #error "qspinlock does not support such large CONFIG_NR_CPUS"
>  #endif
>  
> +#define _Q_SLEEPY_OFFSET	15
> +#define _Q_SLEEPY_BITS		1
> +#define _Q_SLEEPY_MASK		_Q_SET_MASK(SLEEPY_OWNER)
> +#define _Q_SLEEPY_VAL		(1U << _Q_SLEEPY_OFFSET)
> +
>  #define _Q_MUST_Q_OFFSET	16
>  #define _Q_MUST_Q_BITS		1
>  #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5cfd69931e31..c18133c01450 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include <linux/percpu.h>
>  #include <linux/smp.h>
>  #include <linux/topology.h>
> +#include <linux/sched/clock.h>
>  #include <asm/qspinlock.h>
>  #include <asm/paravirt.h>
>  
> @@ -36,24 +37,54 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_spin_on_preempted_owner __read_mostly = false;
> +static bool pv_sleepy_lock __read_mostly = true;
> +static bool pv_sleepy_lock_sticky __read_mostly = false;

The sticky part could potentially be its own patch.

> +static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
> +static int pv_sleepy_lock_factor __read_mostly = 256;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
>  
> -static __always_inline int get_steal_spins(bool paravirt, bool remote)
> +static __always_inline bool recently_sleepy(void)
> +{

Other users of pv_sleepy_lock_interval_ns first check pv_sleepy_lock.

> +	if (pv_sleepy_lock_interval_ns) {
> +		u64 seen = this_cpu_read(sleepy_lock_seen_clock);
> +
> +		if (seen) {
> +			u64 delta = sched_clock() - seen;
> +			if (delta < pv_sleepy_lock_interval_ns)
> +				return true;
> +			this_cpu_write(sleepy_lock_seen_clock, 0);
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +static __always_inline int get_steal_spins(bool paravirt, bool remote, bool sleepy)

It seems like paravirt is implied by sleepy.

>  {
>  	if (remote) {
> -		return REMOTE_STEAL_SPINS;
> +		if (paravirt && sleepy)
> +			return REMOTE_STEAL_SPINS * pv_sleepy_lock_factor;
> +		else
> +			return REMOTE_STEAL_SPINS;
>  	} else {
> -		return STEAL_SPINS;
> +		if (paravirt && sleepy)
> +			return STEAL_SPINS * pv_sleepy_lock_factor;
> +		else
> +			return STEAL_SPINS;
>  	}
>  }

I think that separate functions would still be nicer but this could get rid of
the nesting conditionals like


	int spins;
	if (remote)
		spins = REMOTE_STEAL_SPINS;
	else
		spins = STEAL_SPINS;

	if (sleepy)
		return spins * pv_sleepy_lock_factor;
	return spins;

>  
> -static __always_inline int get_head_spins(bool paravirt)
> +static __always_inline int get_head_spins(bool paravirt, bool sleepy)
>  {
> -	return HEAD_SPINS;
> +	if (paravirt && sleepy)
> +		return HEAD_SPINS * pv_sleepy_lock_factor;
> +	else
> +		return HEAD_SPINS;
>  }
>  
>  static inline u32 encode_tail_cpu(void)
> @@ -206,6 +237,60 @@ static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
>  	return prev;
>  }
>  
> +static __always_inline bool lock_try_set_sleepy(struct qspinlock *lock, u32 old)
> +{
> +	u32 prev;
> +	u32 new = old | _Q_SLEEPY_VAL;
> +
> +	BUG_ON(!(old & _Q_LOCKED_VAL));
> +	BUG_ON(old & _Q_SLEEPY_VAL);
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1		# lock_try_set_sleepy			\n"
> +"	cmpw	0,%0,%2							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%3,0,%1							\n"
> +"	bne-	1b							\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r"(old), "r" (new)
> +	: "cr0", "memory");
> +
> +	if (prev == old)
> +		return true;
> +	return false;
> +}
> +
> +static __always_inline void seen_sleepy_owner(struct qspinlock *lock, u32 val)
> +{
> +	if (pv_sleepy_lock) {
> +		if (pv_sleepy_lock_interval_ns)
> +			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> +		if (!(val & _Q_SLEEPY_VAL))
> +			lock_try_set_sleepy(lock, val);
> +	}
> +}
> +
> +static __always_inline void seen_sleepy_lock(void)
> +{
> +	if (pv_sleepy_lock && pv_sleepy_lock_interval_ns)
> +		this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> +}
> +
> +static __always_inline void seen_sleepy_node(struct qspinlock *lock)
> +{

If yield_to_prev() was made to take a raw val, that val could be passed to
seen_sleepy_node() and it would not need to get it by itself.

> +	if (pv_sleepy_lock) {
> +		u32 val = READ_ONCE(lock->val);
> +
> +		if (pv_sleepy_lock_interval_ns)
> +			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> +		if (val & _Q_LOCKED_VAL) {
> +			if (!(val & _Q_SLEEPY_VAL))
> +				lock_try_set_sleepy(lock, val);
> +		}
> +	}
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
> @@ -244,6 +329,7 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>  	spin_end();
>  
> +	seen_sleepy_owner(lock, val);
>  	*preempted = true;
>  
>  	/*
> @@ -307,11 +393,13 @@ static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int
>  	}
>  }
>  
> -static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
> +static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt, bool *preempted)
>  {
>  	u32 yield_count;
>  	int yield_cpu;
>  
> +	*preempted = false;
> +
>  	if (!paravirt)
>  		goto relax;
>  
> @@ -332,6 +420,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  	spin_end();
>  
> +	*preempted = true;
> +	seen_sleepy_node(lock);
> +
>  	smp_rmb();
>  
>  	if (yield_cpu == node->yield_cpu) {
> @@ -353,6 +444,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  	spin_end();
>  
> +	*preempted = true;
> +	seen_sleepy_node(lock);
> +
>  	smp_rmb(); /* See yield_to_locked_owner comment */
>  
>  	if (!node->locked) {
> @@ -369,6 +463,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
> +	bool preempted;
> +	bool seen_preempted = false;
> +	bool sleepy = false;
>  	int iters = 0;
>  
>  	if (!STEAL_SPINS) {
> @@ -376,7 +473,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			spin_begin();
>  			for (;;) {
>  				u32 val = READ_ONCE(lock->val);
> -				bool preempted;
>  
>  				if (val & _Q_MUST_Q_VAL)
>  					break;
> @@ -395,7 +491,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  	spin_begin();
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
> -		bool preempted;
>  
>  		if (val & _Q_MUST_Q_VAL)
>  			break;
> @@ -408,9 +503,29 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			continue;
>  		}
>  
> +		if (paravirt && pv_sleepy_lock && !sleepy) {
> +			if (!sleepy) {


The enclosing conditional means this would always be true. I think the out conditional should be
if (paravirt && pv_sleepy_lock)
otherwise the pv_sleepy_lock_sticky part wouldn't work properly.


> +				if (val & _Q_SLEEPY_VAL) {
> +					seen_sleepy_lock();
> +					sleepy = true;
> +				} else if (recently_sleepy()) {
> +					sleepy = true;
> +				}
> +
> +			if (pv_sleepy_lock_sticky && seen_preempted &&
> +					!(val & _Q_SLEEPY_VAL)) {
> +				if (lock_try_set_sleepy(lock, val))
> +					val |= _Q_SLEEPY_VAL;
> +			}
> +
> +
>  		yield_to_locked_owner(lock, val, paravirt, &preempted);
> +		if (preempted)
> +			seen_preempted = true;

This could belong to the next if statement, there can not be !paravirt && preempted ?

>  
>  		if (paravirt && preempted) {
> +			sleepy = true;
> +
>  			if (!pv_spin_on_preempted_owner)
>  				iters++;
>  			/*
> @@ -425,14 +540,15 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			iters++;
>  		}
>  
> -		if (iters >= get_steal_spins(paravirt, false))
> +		if (iters >= get_steal_spins(paravirt, false, sleepy))
>  			break;
> -		if (iters >= get_steal_spins(paravirt, true)) {
> +		if (iters >= get_steal_spins(paravirt, true, sleepy)) {
>  			int cpu = get_owner_cpu(val);
>  			if (numa_node_id() != cpu_to_node(cpu))
>  				break;
>  		}
>  	}
> +
>  	spin_end();
>  
>  	return false;
> @@ -443,6 +559,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	struct qnodes *qnodesp;
>  	struct qnode *next, *node;
>  	u32 val, old, tail;
> +	bool seen_preempted = false;
>  	int idx;
>  
>  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> @@ -485,8 +602,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  		/* Wait for mcs node lock to be released */
>  		spin_begin();
> -		while (!node->locked)
> -			yield_to_prev(lock, node, prev_cpu, paravirt);
> +		while (!node->locked) {
> +			bool preempted;
> +
> +			yield_to_prev(lock, node, prev_cpu, paravirt, &preempted);
> +			if (preempted)
> +				seen_preempted = true;
> +		}
>  		spin_end();
>  
>  		/* Clear out stale propagated yield_cpu */
> @@ -506,6 +628,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt, false, &preempted);
> +			if (preempted)
> +				seen_preempted = true;
>  		}
>  		spin_end();
>  
> @@ -521,27 +645,47 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	} else {
>  		int set_yield_cpu = -1;
>  		int iters = 0;
> +		bool sleepy = false;
>  		bool set_mustq = false;
> +		bool preempted;
>  
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> -			bool preempted;
> +			if (paravirt && pv_sleepy_lock) {
> +				if (!sleepy) {
> +					if (val & _Q_SLEEPY_VAL) {
> +						seen_sleepy_lock();
> +						sleepy = true;
> +					} else if (recently_sleepy()) {
> +						sleepy = true;
> +					}
> +				}
> +				if (pv_sleepy_lock_sticky && seen_preempted &&
> +						!(val & _Q_SLEEPY_VAL)) {
> +					if (lock_try_set_sleepy(lock, val))
> +						val |= _Q_SLEEPY_VAL;
> +				}
> +			}
>  
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
>  					pv_yield_allow_steal && set_mustq,
>  					&preempted);
> +			if (preempted)
> +				seen_preempted = true;
>  
>  			if (paravirt && preempted) {
> +				sleepy = true;
> +
>  				if (!pv_spin_on_preempted_owner)
>  					iters++;
>  			} else {
>  				iters++;
>  			}
>  
> -			if (!set_mustq && iters >= get_head_spins(paravirt)) {
> +			if (!set_mustq && iters >= get_head_spins(paravirt, sleepy)) {
>  				set_mustq = true;
>  				lock_set_mustq(lock);
>  				val |= _Q_MUST_Q_VAL;
> @@ -729,6 +873,70 @@ static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
>  
> +static int pv_sleepy_lock_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock, pv_sleepy_lock_get, pv_sleepy_lock_set, "%llu\n");
> +
> +static int pv_sleepy_lock_sticky_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock_sticky = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_sticky_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock_sticky;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_sticky, pv_sleepy_lock_sticky_get, pv_sleepy_lock_sticky_set, "%llu\n");
> +
> +static int pv_sleepy_lock_interval_ns_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock_interval_ns = val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_interval_ns_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock_interval_ns;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_interval_ns, pv_sleepy_lock_interval_ns_get, pv_sleepy_lock_interval_ns_set, "%llu\n");
> +
> +static int pv_sleepy_lock_factor_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock_factor = val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_factor_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock_factor;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_factor, pv_sleepy_lock_factor_get, pv_sleepy_lock_factor_set, "%llu\n");
> +
>  static int pv_yield_prev_set(void *data, u64 val)
>  {
>  	pv_yield_prev = !!val;
> @@ -786,6 +994,10 @@ static __init int spinlock_debugfs_init(void)
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
> +		debugfs_create_file("qspl_pv_sleepy_lock", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock);
> +		debugfs_create_file("qspl_pv_sleepy_lock_sticky", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_sticky);
> +		debugfs_create_file("qspl_pv_sleepy_lock_interval_ns", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_interval_ns);
> +		debugfs_create_file("qspl_pv_sleepy_lock_factor", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_factor);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
  2022-07-28  6:31 ` [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
  2022-08-12  4:49   ` Jordan Niethe
@ 2022-09-22 15:02   ` Laurent Dufour
  2022-09-23  8:16     ` Nicholas Piggin
  2022-11-10  0:44   ` Jordan Niethe
  2 siblings, 1 reply; 78+ messages in thread
From: Laurent Dufour @ 2022-09-22 15:02 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On 28/07/2022 08:31:19, Nicholas Piggin wrote:
> Provide an option that holds off queueing indefinitely while the lock
> owner is preempted. This could reduce queueing latencies for very
> overcommitted vcpu situations.
> 
> This is disabled by default.

Hi Nick,

I should have missed something here.

If this option is turned on, CPU trying to lock when there is a preempted
owner will spin checking the lock->val and yielding the lock owner CPU.
Am I right?

If yes, why not being queued and spin checking its own value, yielding
against the lock owner CPU? This will generate less cache bouncing, which
is what the queued spinlock is trying to address, isn't it?

Thanks,
Laurent.

> ---
>  arch/powerpc/lib/qspinlock.c | 91 +++++++++++++++++++++++++++++++-----
>  1 file changed, 79 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 24f68bd71e2b..5cfd69931e31 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
> +static bool pv_spin_on_preempted_owner __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
> @@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> -static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
>  	int owner;
>  	u32 yield_count;
>  
>  	BUG_ON(!(val & _Q_LOCKED_VAL));
>  
> +	*preempted = false;
> +
>  	if (!paravirt)
>  		goto relax;
>  
> @@ -241,6 +244,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>  	spin_end();
>  
> +	*preempted = true;
> +
>  	/*
>  	 * Read the lock word after sampling the yield count. On the other side
>  	 * there may a wmb because the yield count update is done by the
> @@ -265,14 +270,14 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  	spin_cpu_relax();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool *preempted)
>  {
> -	__yield_to_locked_owner(lock, val, paravirt, false);
> +	__yield_to_locked_owner(lock, val, paravirt, false, preempted);
>  }
>  
> -static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
> -	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +	__yield_to_locked_owner(lock, val, paravirt, clear_mustq, preempted);
>  }
>  
>  static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
> @@ -364,12 +369,33 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
> -	int iters;
> +	int iters = 0;
> +
> +	if (!STEAL_SPINS) {
> +		if (paravirt && pv_spin_on_preempted_owner) {
> +			spin_begin();
> +			for (;;) {
> +				u32 val = READ_ONCE(lock->val);
> +				bool preempted;
> +
> +				if (val & _Q_MUST_Q_VAL)
> +					break;
> +				if (!(val & _Q_LOCKED_VAL))
> +					break;
> +				if (!vcpu_is_preempted(get_owner_cpu(val)))
> +					break;
> +				yield_to_locked_owner(lock, val, paravirt, &preempted);
> +			}
> +			spin_end();
> +		}
> +		return false;
> +	}
>  
>  	/* Attempt to steal the lock */
>  	spin_begin();
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
> +		bool preempted;
>  
>  		if (val & _Q_MUST_Q_VAL)
>  			break;
> @@ -382,9 +408,22 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			continue;
>  		}
>  
> -		yield_to_locked_owner(lock, val, paravirt);
> -
> -		iters++;
> +		yield_to_locked_owner(lock, val, paravirt, &preempted);
> +
> +		if (paravirt && preempted) {
> +			if (!pv_spin_on_preempted_owner)
> +				iters++;
> +			/*
> +			 * pv_spin_on_preempted_owner don't increase iters
> +			 * while the owner is preempted -- we won't interfere
> +			 * with it by definition. This could introduce some
> +			 * latency issue if we continually observe preempted
> +			 * owners, but hopefully that's a rare corner case of
> +			 * a badly oversubscribed system.
> +			 */
> +		} else {
> +			iters++;
> +		}
>  
>  		if (iters >= get_steal_spins(paravirt, false))
>  			break;
> @@ -463,8 +502,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			bool preempted;
> +
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
> -			yield_head_to_locked_owner(lock, val, paravirt, false);
> +			yield_head_to_locked_owner(lock, val, paravirt, false, &preempted);
>  		}
>  		spin_end();
>  
> @@ -486,11 +527,20 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			bool preempted;
> +
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
> -					pv_yield_allow_steal && set_mustq);
> +					pv_yield_allow_steal && set_mustq,
> +					&preempted);
> +
> +			if (paravirt && preempted) {
> +				if (!pv_spin_on_preempted_owner)
> +					iters++;
> +			} else {
> +				iters++;
> +			}
>  
> -			iters++;
>  			if (!set_mustq && iters >= get_head_spins(paravirt)) {
>  				set_mustq = true;
>  				lock_set_mustq(lock);
> @@ -663,6 +713,22 @@ static int pv_yield_allow_steal_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
>  
> +static int pv_spin_on_preempted_owner_set(void *data, u64 val)
> +{
> +	pv_spin_on_preempted_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_spin_on_preempted_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
> +
>  static int pv_yield_prev_set(void *data, u64 val)
>  {
>  	pv_yield_prev = !!val;
> @@ -719,6 +785,7 @@ static __init int spinlock_debugfs_init(void)
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
> +		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
  2022-09-22 15:02   ` Laurent Dufour
@ 2022-09-23  8:16     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-09-23  8:16 UTC (permalink / raw)
  To: Laurent Dufour, linuxppc-dev

On Fri Sep 23, 2022 at 1:02 AM AEST, Laurent Dufour wrote:
> On 28/07/2022 08:31:19, Nicholas Piggin wrote:
> > Provide an option that holds off queueing indefinitely while the lock
> > owner is preempted. This could reduce queueing latencies for very
> > overcommitted vcpu situations.
> > 
> > This is disabled by default.
>
> Hi Nick,
>
> I should have missed something here.
>
> If this option is turned on, CPU trying to lock when there is a preempted
> owner will spin checking the lock->val and yielding the lock owner CPU.
> Am I right?

Yes.

> If yes, why not being queued and spin checking its own value, yielding
> against the lock owner CPU?

I guess the idea is that when we start getting vCPU preemption, queueing
behaviour causes this "train wreck" behaviour where lock waiters being
preempted can halt lock transfers to other waiters (whereas with simple
spinlocks only owner vCPU preemption matters). So the heuristics for
paravirt qspinlock basically come down to avoiding queueing and making
waiters behave more like a simple spinlock when it matters. That's the
case for upstream and this rewrite.

> This will generate less cache bouncing, which
> is what the queued spinlock is trying to address, isn't it?

It could. When the owner is preempted it's not going to be modifying
the lock word and probably not surrounding data in the same cache
line, and there won't be a lot of other try-lock operations come in
(because they'll mostly queue up here as well). So cacheline bouncing
shouldn't be the worst problem we face here. But it possibly is a
concern.

I didn't yet meausre any real improvement from this option, and it
possibly has some starvation potential, so it's disabled by default for
now.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue
  2022-07-28  6:31 ` [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
  2022-08-12  4:17   ` Jordan Niethe
@ 2022-10-06 17:27   ` Laurent Dufour
  2022-11-10  0:42   ` Jordan Niethe
  2 siblings, 0 replies; 78+ messages in thread
From: Laurent Dufour @ 2022-10-06 17:27 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On 28/07/2022 08:31:14, Nicholas Piggin wrote:
> Having all CPUs poll the lock word for the owner CPU that should be
> yielded to defeats most of the purpose of using MCS queueing for
> scalability. Yet it may be desirable for queued waiters to to yield
> to a preempted owner.
> 
> s390 addreses this problem by having queued waiters sample the lock
> word to find the owner much less frequently. In this approach, the
> waiters never sample it directly, but the queue head propagates the
> owner CPU back to the next waiter if it ever finds the owner has
> been preempted. Queued waiters then subsequently propagate the owner
> CPU back to the next waiter, and so on.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 85 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 94f007f66942..28c85a2d5635 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>  	struct qnode	*next;
>  	struct qspinlock *lock;
> +	int		yield_cpu;
>  	u8		locked; /* 1 if lock acquired */
>  };
>  
> @@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
> +static bool pv_yield_propagate_owner __read_mostly = true;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -257,13 +259,66 @@ static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u
>  	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
>  }
>  
> +static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
> +{
> +	struct qnode *next;
> +	int owner;
> +
> +	if (!paravirt)
> +		return;
> +	if (!pv_yield_propagate_owner)
> +		return;
> +
> +	owner = get_owner_cpu(val);
> +	if (*set_yield_cpu == owner)
> +		return;
> +
> +	next = READ_ONCE(node->next);
> +	if (!next)
> +		return;
> +
> +	if (vcpu_is_preempted(owner)) {
> +		next->yield_cpu = owner;
> +		*set_yield_cpu = owner;
> +	} else if (*set_yield_cpu != -1) {
> +		next->yield_cpu = owner;
> +		*set_yield_cpu = owner;
> +	}

This is bit confusing, the else branch is the same as the true one.
This might be written like this:

	if (vcpu_is_preempted(owner) || *set_yield_cpu != -1) {
		next->yield_cpu = owner;
		*set_yield_cpu = owner;
	}

> +}
> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
>  {
>  	u32 yield_count;
> +	int yield_cpu;
>  
>  	if (!paravirt)
>  		goto relax;
>  
> +	if (!pv_yield_propagate_owner)
> +		goto yield_prev;
> +
> +	yield_cpu = READ_ONCE(node->yield_cpu);
> +	if (yield_cpu == -1) {
> +		/* Propagate back the -1 CPU */
> +		if (node->next && node->next->yield_cpu != -1)
> +			node->next->yield_cpu = yield_cpu;
> +		goto yield_prev;
> +	}
> +
> +	yield_count = yield_count_of(yield_cpu);
> +	if ((yield_count & 1) == 0)
> +		goto yield_prev; /* owner vcpu is running */
> +
> +	smp_rmb();
> +
> +	if (yield_cpu == node->yield_cpu) {
> +		if (node->next && node->next->yield_cpu != yield_cpu)
> +			node->next->yield_cpu = yield_cpu;
> +		yield_to_preempted(yield_cpu, yield_count);
> +		return;
> +	}
> +

In the case that test is false, this means that the lock owner has probably
changed, why are we yeilding to the previous node instead of reading again
the node->yield_cpu, checking against -1 value etc..?
Yielding to the previous node is valid, but it might be better to yield to
the owner, isn't it?

> +yield_prev:
>  	if (!pv_yield_prev)
>  		goto relax;
>  
> @@ -337,6 +392,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	node = &qnodesp->nodes[idx];
>  	node->next = NULL;
>  	node->lock = lock;
> +	node->yield_cpu = -1;
>  	node->locked = 0;
>  
>  	tail = encode_tail_cpu();
> @@ -358,13 +414,21 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		while (!node->locked)
>  			yield_to_prev(lock, node, prev_cpu, paravirt);
>  
> +		/* Clear out stale propagated yield_cpu */
> +		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
> +			node->yield_cpu = -1;

Why doing tests and not directly setting node->yield_cpu to -1?
Is the write operation more costly than the 3 tests?

> +
>  		smp_rmb(); /* acquire barrier for the mcs lock */
>  	}
>  
>  	if (!MAYBE_STEALERS) {
> +		int set_yield_cpu = -1;
> +
>  		/* We're at the head of the waitqueue, wait for the lock. */
> -		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt, false);
> +		}
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -376,12 +440,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We must be the owner, just set the lock bit and acquire */
>  		lock_set_locked(lock);
>  	} else {
> +		int set_yield_cpu = -1;
>  		int iters = 0;
>  		bool set_mustq = false;
>  
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
>  					pv_yield_allow_steal && set_mustq);
>  
> @@ -540,6 +606,22 @@ static int pv_yield_prev_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
>  
> +static int pv_yield_propagate_owner_set(void *data, u64 val)
> +{
> +	pv_yield_propagate_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_propagate_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_propagate_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> @@ -548,6 +630,7 @@ static __init int spinlock_debugfs_init(void)
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
> +		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-07-28  6:31 ` [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
  2022-08-10  1:52   ` Jordan NIethe
@ 2022-11-10  0:35   ` Jordan Niethe
  2022-11-10  6:37     ` Christophe Leroy
  2022-11-10  9:09     ` Nicholas Piggin
  1 sibling, 2 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:35 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
<snip>
> -#define queued_spin_lock queued_spin_lock
>  
> -static inline void queued_spin_unlock(struct qspinlock *lock)
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> -	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> -		smp_store_release(&lock->locked, 0);
> -	else
> -		__pv_queued_spin_unlock(lock);
> +	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
> +		return 1;
> +	return 0;

optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);

[resend as utf-8, not utf-7]


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters
  2022-07-28  6:31 ` [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
  2022-08-10  2:28   ` Jordan NIethe
@ 2022-11-10  0:36   ` Jordan Niethe
  2022-11-10  9:21     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:36 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
<snip>
[resend as utf-8, not utf-7]
>  
> +/*
> + * Bitfields in the atomic value:
> + *
> + *     0: locked bit
> + * 16-31: tail cpu (+1)
> + */
> +#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
> +				      << _Q_ ## type ## _OFFSET)
> +#define _Q_LOCKED_OFFSET	0
> +#define _Q_LOCKED_BITS		1
> +#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
> +#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET	16
> +#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
> +#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
> +

Just to state the obvious this is:

#define _Q_LOCKED_OFFSET	0
#define _Q_LOCKED_BITS		1
#define _Q_LOCKED_MASK		0x00000001
#define _Q_LOCKED_VAL		1

#define _Q_TAIL_CPU_OFFSET	16
#define _Q_TAIL_CPU_BITS	16
#define _Q_TAIL_CPU_MASK	0xffff0000

> +#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 8dbce99a373c..5ebb88d95636 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,12 +1,172 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> +#include <linux/atomic.h>
> +#include <linux/bug.h>
> +#include <linux/compiler.h>
>  #include <linux/export.h>
> -#include <linux/processor.h>
> +#include <linux/percpu.h>
> +#include <linux/smp.h>
>  #include <asm/qspinlock.h>
>  
> -void queued_spin_lock_slowpath(struct qspinlock *lock)
> +#define MAX_NODES	4
> +
> +struct qnode {
> +	struct qnode	*next;
> +	struct qspinlock *lock;
> +	u8		locked; /* 1 if lock acquired */
> +};
> +
> +struct qnodes {
> +	int		count;
> +	struct qnode nodes[MAX_NODES];
> +};

I think it could be worth commenting why qnodes::count instead _Q_TAIL_IDX_OFFSET.

> +
> +static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +
> +static inline int encode_tail_cpu(void)

I think the generic version that takes smp_processor_id() as a parameter is clearer - at least with this function name.

> +{
> +	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> +}
> +
> +static inline int get_tail_cpu(int val)

It seems like there should be a "decode" function to pair up with the "encode" function.

> +{
> +	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> +}
> +
> +/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */

Does that comment mean it is not necessary to use an atomic_or here?

> +static __always_inline void lock_set_locked(struct qspinlock *lock)

nit: could just be called set_locked()

> +{
> +	atomic_or(_Q_LOCKED_VAL, &lock->val);
> +	__atomic_acquire_fence();
> +}
> +
> +/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
> +{
> +	int newval = _Q_LOCKED_VAL;
> +
> +	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> +		return 1;
> +	else
> +		return 0;

same optional style nit: return (atomic_cmpxchg_acquire(&lock->val, val, newval) == val);

> +}
> +
> +/*
> + * Publish our tail, replacing previous tail. Return previous value.
> + *
> + * This provides a release barrier for publishing node, and an acquire barrier
> + * for getting the old node.
> + */
> +static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)

Did you change from the xchg_tail() name in the generic version because of the release and acquire barriers this provides?
Does "publish" generally imply the old value will be returned?

>  {
> -	while (!queued_spin_trylock(lock))
> +	for (;;) {
> +		int val = atomic_read(&lock->val);
> +		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> +		int old;
> +
> +		old = atomic_cmpxchg(&lock->val, val, newval);
> +		if (old == val)
> +			return old;
> +	}
> +}
> +
> +static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> +{
> +	int cpu = get_tail_cpu(val);
> +	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
> +	int idx;
> +
> +	for (idx = 0; idx < MAX_NODES; idx++) {
> +		struct qnode *qnode = &qnodesp->nodes[idx];
> +		if (qnode->lock == lock)
> +			return qnode;
> +	}

In case anyone else is confused by this, Nick explained each cpu can only queue on a unique spinlock once regardless of "idx" level.

> +
> +	BUG();
> +}
> +
> +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +{
> +	struct qnodes *qnodesp;
> +	struct qnode *next, *node;
> +	int val, old, tail;
> +	int idx;
> +
> +	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> +
> +	qnodesp = this_cpu_ptr(&qnodes);
> +	if (unlikely(qnodesp->count == MAX_NODES)) {

The comparison is >= in the generic, I guess we've no nested NMI so this is safe?

> +		while (!queued_spin_trylock(lock))
> +			cpu_relax();
> +		return;
> +	}
> +
> +	idx = qnodesp->count++;
> +	/*
> +	 * Ensure that we increment the head node->count before initialising
> +	 * the actual node. If the compiler is kind enough to reorder these
> +	 * stores, then an IRQ could overwrite our assignments.
> +	 */
> +	barrier();
> +	node = &qnodesp->nodes[idx];
> +	node->next = NULL;
> +	node->lock = lock;
> +	node->locked = 0;
> +
> +	tail = encode_tail_cpu();
> +
> +	old = publish_tail_cpu(lock, tail);
> +
> +	/*
> +	 * If there was a previous node; link it and wait until reaching the
> +	 * head of the waitqueue.
> +	 */
> +	if (old & _Q_TAIL_CPU_MASK) {
> +		struct qnode *prev = get_tail_qnode(lock, old);
> +
> +		/* Link @node into the waitqueue. */
> +		WRITE_ONCE(prev->next, node);
> +
> +		/* Wait for mcs node lock to be released */
> +		while (!node->locked)
> +			cpu_relax();
> +
> +		smp_rmb(); /* acquire barrier for the mcs lock */
> +	}
> +
> +	/* We're at the head of the waitqueue, wait for the lock. */
> +	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
> +		cpu_relax();
> +
> +	/* If we're the last queued, must clean up the tail. */
> +	if ((val & _Q_TAIL_CPU_MASK) == tail) {
> +		if (trylock_clear_tail_cpu(lock, val))
> +			goto release;
> +		/* Another waiter must have enqueued */
> +	}
> +
> +	/* We must be the owner, just set the lock bit and acquire */
> +	lock_set_locked(lock);
> +
> +	/* contended path; must wait for next != NULL (MCS protocol) */
> +	while (!(next = READ_ONCE(node->next)))
>  		cpu_relax();
> +
> +	/*
> +	 * Unlock the next mcs waiter node. Release barrier is not required
> +	 * here because the acquirer is only accessing the lock word, and
> +	 * the acquire barrier we took the lock with orders that update vs
> +	 * this store to locked. The corresponding barrier is the smp_rmb()
> +	 * acquire barrier for mcs lock, above.
> +	 */
> +	WRITE_ONCE(next->locked, 1);
> +
> +release:
> +	qnodesp->count--; /* release the node */
> +}
> +
> +void queued_spin_lock_slowpath(struct qspinlock *lock)
> +{
> +	queued_spin_lock_mcs_queue(lock);
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  2022-07-28  6:31 ` [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
  2022-08-10  3:28   ` Jordan Niethe
@ 2022-11-10  0:39   ` Jordan Niethe
  2022-11-10  9:25     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:39 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> The first 16 bits of the lock are only modified by the owner, and other
> modifications always use atomic operations on the entire 32 bits, so
> unlocks can use plain stores on the 16 bits. This is the same kind of
> optimisation done by core qspinlock code.
> ---
>  arch/powerpc/include/asm/qspinlock.h       |  6 +-----
>  arch/powerpc/include/asm/qspinlock_types.h | 19 +++++++++++++++++--
>  2 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index f06117aa60e1..79a1936fb68d 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
>  
>  static inline void queued_spin_unlock(struct qspinlock *lock)
>  {
> -	for (;;) {
> -		int val = atomic_read(&lock->val);
> -		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
> -			return;
> -	}
> +	smp_store_release(&lock->locked, 0);

Is it also possible for lock_set_locked() to use a non-atomic acquire
operation?

>  }
>  
>  #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 9630e714c70d..3425dab42576 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -3,12 +3,27 @@
>  #define _ASM_POWERPC_QSPINLOCK_TYPES_H
>  
>  #include <linux/types.h>
> +#include <asm/byteorder.h>
>  
>  typedef struct qspinlock {
> -	atomic_t val;
> +	union {
> +		atomic_t val;
> +
> +#ifdef __LITTLE_ENDIAN
> +		struct {
> +			u16	locked;
> +			u8	reserved[2];
> +		};
> +#else
> +		struct {
> +			u8	reserved[2];
> +			u16	locked;
> +		};
> +#endif
> +	};
>  } arch_spinlock_t;

Just to double check we have:

#define _Q_LOCKED_OFFSET	0
#define _Q_LOCKED_BITS		1
#define _Q_LOCKED_MASK		0x00000001
#define _Q_LOCKED_VAL		1

#define _Q_TAIL_CPU_OFFSET	16
#define _Q_TAIL_CPU_BITS	16
#define _Q_TAIL_CPU_MASK	0xffff0000


so the ordering here looks correct.

>  
> -#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
> +#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
>  
>  /*
>   * Bitfields in the atomic value:


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly
  2022-07-28  6:31 ` [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
  2022-08-10  3:54   ` Jordan Niethe
@ 2022-11-10  0:39   ` Jordan Niethe
  2022-11-10  8:36     ` Christophe Leroy
  2022-11-10  9:40     ` Nicholas Piggin
  1 sibling, 2 replies; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:39 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> This uses more optimal ll/sc style access patterns (rather than
> cmpxchg), and also sets the EH=1 lock hint on those operations
> which acquire ownership of the lock.
> ---
>  arch/powerpc/include/asm/qspinlock.h       | 25 +++++--
>  arch/powerpc/include/asm/qspinlock_types.h |  6 +-
>  arch/powerpc/lib/qspinlock.c               | 81 +++++++++++++++-------
>  3 files changed, 79 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index 79a1936fb68d..3ab354159e5e 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -2,28 +2,43 @@
>  #ifndef _ASM_POWERPC_QSPINLOCK_H
>  #define _ASM_POWERPC_QSPINLOCK_H
>  
> -#include <linux/atomic.h>
>  #include <linux/compiler.h>
>  #include <asm/qspinlock_types.h>
>  
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
> -	return atomic_read(&lock->val);
> +	return READ_ONCE(lock->val);
>  }
>  
>  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>  {
> -	return !atomic_read(&lock.val);
> +	return !lock.val;
>  }
>  
>  static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  {
> -	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
> +	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> -	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
> +	u32 new = _Q_LOCKED_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
> +"	cmpwi	0,%0,0							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%2,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)

btw IS_ENABLED() already returns 1 or 0

> +	: "cr0", "memory");

This is the ISA's "test and set" atomic primitive. Do you think it would be worth seperating it as a helper?

> +
> +	if (likely(prev == 0))
>  		return 1;
>  	return 0;

same optional style nit: return likely(prev == 0);

>  }
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 3425dab42576..210adf05b235 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -7,7 +7,7 @@
>  
>  typedef struct qspinlock {
>  	union {
> -		atomic_t val;
> +		u32 val;
>  
>  #ifdef __LITTLE_ENDIAN
>  		struct {
> @@ -23,10 +23,10 @@ typedef struct qspinlock {
>  	};
>  } arch_spinlock_t;
>  
> -#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
> +#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = 0 } }
>  
>  /*
> - * Bitfields in the atomic value:
> + * Bitfields in the lock word:
>   *
>   *     0: locked bit
>   * 16-31: tail cpu (+1)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5ebb88d95636..7c71e5e287df 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> -#include <linux/atomic.h>
>  #include <linux/bug.h>
>  #include <linux/compiler.h>
>  #include <linux/export.h>
> @@ -22,32 +21,59 @@ struct qnodes {
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static inline int encode_tail_cpu(void)
> +static inline u32 encode_tail_cpu(void)
>  {
>  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
>  }
>  
> -static inline int get_tail_cpu(int val)
> +static inline int get_tail_cpu(u32 val)
>  {
>  	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
>  /* Take the lock by setting the bit, no other CPUs may concurrently lock it. */

I think you missed deleting the above line.

> +/* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> -	atomic_or(_Q_LOCKED_VAL, &lock->val);
> -	__atomic_acquire_fence();
> +	u32 new = _Q_LOCKED_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%3	# lock_set_locked			\n"
> +"	or	%0,%0,%2						\n"
> +"	stwcx.	%0,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");
>  }

This is pretty similar with the DEFINE_TESTOP() pattern from
arch/powerpc/include/asm/bitops.h (such as test_and_set_bits_lock()) except for
word instead of double word. Do you think it's possible / beneficial to make
use of those macros?


>  
> -/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> -static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
> +/* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
>  {
> -	int newval = _Q_LOCKED_VAL;
> -
> -	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> +	u32 new = _Q_LOCKED_VAL;
> +	u32 prev;
> +
> +	BUG_ON(old & _Q_LOCKED_VAL);

The BUG_ON() could have been introduced in an earlier patch I think.

> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%4	# trylock_clear_tail_cpu		\n"
> +"	cmpw	0,%0,%2							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%3,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r"(old), "r" (new),

Could this be like  "r"(_Q_TAIL_CPU_MASK) below?
i.e. "r" (_Q_LOCKED_VAL)? Makes it clear new doesn't change.

> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");
> +
> +	if (likely(prev == old))
>  		return 1;
> -	else
> -		return 0;
> +	return 0;
>  }
>  
>  /*
> @@ -56,20 +82,25 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int va
>   * This provides a release barrier for publishing node, and an acquire barrier

Does the comment mean there needs to be an acquire barrier in this assembly?


>   * for getting the old node.
>   */
> -static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
> +static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
>  {
> -	for (;;) {
> -		int val = atomic_read(&lock->val);
> -		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> -		int old;
> -
> -		old = atomic_cmpxchg(&lock->val, val, newval);
> -		if (old == val)
> -			return old;
> -	}
> +	u32 prev, tmp;
> +
> +	asm volatile(
> +"\t"	PPC_RELEASE_BARRIER "						\n"
> +"1:	lwarx	%0,0,%2		# publish_tail_cpu			\n"
> +"	andc	%1,%0,%4						\n"
> +"	or	%1,%1,%3						\n"
> +"	stwcx.	%1,0,%2							\n"
> +"	bne-	1b							\n"
> +	: "=&r" (prev), "=&r"(tmp)
> +	: "r" (&lock->val), "r" (tail), "r"(_Q_TAIL_CPU_MASK)
> +	: "cr0", "memory");
> +
> +	return prev;
>  }
>  
> -static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> +static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
>  	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
> @@ -88,7 +119,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  {
>  	struct qnodes *qnodesp;
>  	struct qnode *next, *node;
> -	int val, old, tail;
> +	u32 val, old, tail;
>  	int idx;
>  
>  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> @@ -134,7 +165,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  	}
>  
>  	/* We're at the head of the waitqueue, wait for the lock. */
> -	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
> +	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
>  		cpu_relax();
>  
>  	/* If we're the last queued, must clean up the tail. */


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing
  2022-07-28  6:31 ` [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
  2022-08-10  4:31   ` Jordan Niethe
@ 2022-11-10  0:40   ` Jordan Niethe
  2022-11-10 10:54     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:40 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Allow new waiters a number of spins on the lock word before queueing,
> which particularly helps paravirt performance when physical CPUs are
> oversubscribed.
> ---
>  arch/powerpc/lib/qspinlock.c | 152 ++++++++++++++++++++++++++++++++---
>  1 file changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 7c71e5e287df..1625cce714b2 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -19,8 +19,17 @@ struct qnodes {
>  	struct qnode nodes[MAX_NODES];
>  };
>  
> +/* Tuning parameters */
> +static int STEAL_SPINS __read_mostly = (1<<5);
> +static bool MAYBE_STEALERS __read_mostly = true;

I can understand why, but macro case variables can be a bit confusing.

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> +static __always_inline int get_steal_spins(void)
> +{
> +	return STEAL_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -76,6 +85,39 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 ol
>  	return 0;
>  }
>  
> +static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u32 new)
> +{
> +	u32 prev;
> +
> +	BUG_ON(old & _Q_LOCKED_VAL);
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1,%4	# queued_spin_trylock_cmpxchg		\n"

s/queued_spin_trylock_cmpxchg/__trylock_cmpxchg/

btw what is the format you using for the '\n's in the inline asm?

> +"	cmpw	0,%0,%2							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%3,0,%1							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r"(old), "r" (new),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");

This is very similar to trylock_clear_tail_cpu(). So maybe it is worth having
some form of "test and set" primitive helper.

> +
> +	return prev;
> +}
> +
> +/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
> +{
> +	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> +
> +	if (__trylock_cmpxchg(lock, val, newval) == val)
> +		return 1;
> +	else
> +		return 0;

same optional style nit: return __trylock_cmpxchg(lock, val, newval) == val

> +}
> +
>  /*
>   * Publish our tail, replacing previous tail. Return previous value.
>   *
> @@ -115,6 +157,31 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> +static inline bool try_to_steal_lock(struct qspinlock *lock)
> +{
> +	int iters;
> +
> +	/* Attempt to steal the lock */
> +	for (;;) {
> +		u32 val = READ_ONCE(lock->val);
> +
> +		if (unlikely(!(val & _Q_LOCKED_VAL))) {
> +			if (trylock_with_tail_cpu(lock, val))
> +				return true;
> +			continue;
> +		}

The continue would bypass iters++/cpu_relax but the next time around
  if (unlikely(!(val & _Q_LOCKED_VAL))) {
should fail so everything should be fine?

> +
> +		cpu_relax();
> +
> +		iters++;
> +
> +		if (iters >= get_steal_spins())
> +			break;
> +	}
> +
> +	return false;
> +}
> +
>  static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  {
>  	struct qnodes *qnodesp;
> @@ -164,20 +231,39 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  		smp_rmb(); /* acquire barrier for the mcs lock */
>  	}
>  
> -	/* We're at the head of the waitqueue, wait for the lock. */
> -	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> -		cpu_relax();
> +	if (!MAYBE_STEALERS) {
> +		/* We're at the head of the waitqueue, wait for the lock. */
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +			cpu_relax();
>  
> -	/* If we're the last queued, must clean up the tail. */
> -	if ((val & _Q_TAIL_CPU_MASK) == tail) {
> -		if (trylock_clear_tail_cpu(lock, val))
> -			goto release;
> -		/* Another waiter must have enqueued */
> -	}
> +		/* If we're the last queued, must clean up the tail. */
> +		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> +			if (trylock_clear_tail_cpu(lock, val))
> +				goto release;
> +			/* Another waiter must have enqueued. */
> +		}
> +
> +		/* We must be the owner, just set the lock bit and acquire */
> +		lock_set_locked(lock);
> +	} else {
> +again:
> +		/* We're at the head of the waitqueue, wait for the lock. */
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +			cpu_relax();
>  
> -	/* We must be the owner, just set the lock bit and acquire */
> -	lock_set_locked(lock);
> +		/* If we're the last queued, must clean up the tail. */
> +		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> +			if (trylock_clear_tail_cpu(lock, val))
> +				goto release;
> +			/* Another waiter must have enqueued, or lock stolen. */
> +		} else {
> +			if (trylock_with_tail_cpu(lock, val))
> +				goto unlock_next;
> +		}
> +		goto again;
> +	}
>  
> +unlock_next:
>  	/* contended path; must wait for next != NULL (MCS protocol) */
>  	while (!(next = READ_ONCE(node->next)))
>  		cpu_relax();
> @@ -197,6 +283,9 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  
>  void queued_spin_lock_slowpath(struct qspinlock *lock)
>  {
> +	if (try_to_steal_lock(lock))
> +		return;
> +
>  	queued_spin_lock_mcs_queue(lock);
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
> @@ -207,3 +296,44 @@ void pv_spinlocks_init(void)
>  }
>  #endif
>  
> +#include <linux/debugfs.h>
> +static int steal_spins_set(void *data, u64 val)
> +{
> +	static DEFINE_MUTEX(lock);

I just want to check if it would be possible to get rid of the MAYBE_STEALERS
variable completely and do something like:

  bool maybe_stealers() { return STEAL_SPINS > 0; }

I guess based on the below code it wouldn't work, but I'm still not quite sure
why that is.

> +
> +	mutex_lock(&lock);
> +	if (val && !STEAL_SPINS) {
> +		MAYBE_STEALERS = true;
> +		/* wait for waiter to go away */
> +		synchronize_rcu();
> +		STEAL_SPINS = val;
> +	} else if (!val && STEAL_SPINS) {
> +		STEAL_SPINS = val;
> +		/* wait for all possible stealers to go away */
> +		synchronize_rcu();
> +		MAYBE_STEALERS = false;
> +	} else {
> +		STEAL_SPINS = val;
> +	}
> +	mutex_unlock(&lock);

STEAL_SPINS is an int not a u64.

> +
> +	return 0;
> +}
> +
> +static int steal_spins_get(void *data, u64 *val)
> +{
> +	*val = STEAL_SPINS;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
> +
> +static __init int spinlock_debugfs_init(void)
> +{
> +	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> +
> +	return 0;
> +}
> +device_initcall(spinlock_debugfs_init);
> +


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency
  2022-07-28  6:31 ` [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
  2022-08-10  5:51   ` Jordan Niethe
@ 2022-11-10  0:40   ` Jordan Niethe
  2022-11-10 10:57     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:40 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Give the queue head the ability to stop stealers. After a number of
> spins without sucessfully acquiring the lock, the queue head employs
> this, which will assure it is the next owner.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h | 10 +++-
>  arch/powerpc/lib/qspinlock.c               | 56 +++++++++++++++++++++-
>  2 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 210adf05b235..8b20f5e22bba 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,7 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   *     0: locked bit
> - * 16-31: tail cpu (+1)
> + *    16: must queue bit
> + * 17-31: tail cpu (+1)
>   */
>  #define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
>  				      << _Q_ ## type ## _OFFSET)
> @@ -38,7 +39,12 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
>  
> -#define _Q_TAIL_CPU_OFFSET	16
> +#define _Q_MUST_Q_OFFSET	16
> +#define _Q_MUST_Q_BITS		1
> +#define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> +#define _Q_MUST_Q_VAL		(1U << _Q_MUST_Q_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET	17
>  #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
>  #define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)

Not a big deal but some of these values could be calculated like in the
generic version. e.g.

	#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET +_Q_LOCKED_BITS)

>  
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 1625cce714b2..a906cc8f15fa 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -22,6 +22,7 @@ struct qnodes {
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
> +static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
>  	return STEAL_SPINS;
>  }
>  
> +static __always_inline int get_head_spins(void)
> +{
> +	return HEAD_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -142,6 +148,23 @@ static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
>  	return prev;
>  }
>  
> +static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
> +{
> +	u32 new = _Q_MUST_Q_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1		# lock_set_mustq			\n"

Is the EH bit not set because we don't hold the lock here?

> +"	or	%0,%0,%2						\n"
> +"	stwcx.	%0,0,%1							\n"
> +"	bne-	1b							\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new)
> +	: "cr0", "memory");

This is another usage close to the DEFINE_TESTOP() pattern.

> +
> +	return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
> @@ -165,6 +188,9 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
>  
> +		if (val & _Q_MUST_Q_VAL)
> +			break;
> +
>  		if (unlikely(!(val & _Q_LOCKED_VAL))) {
>  			if (trylock_with_tail_cpu(lock, val))
>  				return true;
> @@ -246,11 +272,22 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  		/* We must be the owner, just set the lock bit and acquire */
>  		lock_set_locked(lock);
>  	} else {
> +		int iters = 0;
> +		bool set_mustq = false;
> +
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
> -		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>  			cpu_relax();
>  
> +			iters++;

It seems instead of using set_mustq, (val & _Q_MUST_Q_VAL) could be checked?

> +			if (!set_mustq && iters >= get_head_spins()) {
> +				set_mustq = true;
> +				lock_set_mustq(lock);
> +				val |= _Q_MUST_Q_VAL;
> +			}
> +		}
> +
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
>  			if (trylock_clear_tail_cpu(lock, val))
> @@ -329,9 +366,26 @@ static int steal_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
>  
> +static int head_spins_set(void *data, u64 val)
> +{
> +	HEAD_SPINS = val;
> +
> +	return 0;
> +}
> +
> +static int head_spins_get(void *data, u64 *val)
> +{
> +	*val = HEAD_SPINS;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> +	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word
  2022-07-28  6:31 ` [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
  2022-08-12  0:50   ` Jordan Niethe
@ 2022-11-10  0:40   ` Jordan Niethe
  2022-11-10 10:59     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:40 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Store the owner CPU number in the lock word so it may be yielded to,
> as powerpc's paravirtualised simple spinlocks do.
> ---
>  arch/powerpc/include/asm/qspinlock.h       |  8 +++++++-
>  arch/powerpc/include/asm/qspinlock_types.h | 10 ++++++++++
>  arch/powerpc/lib/qspinlock.c               |  6 +++---
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index 3ab354159e5e..44601b261e08 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
> +static __always_inline u32 queued_spin_get_locked_val(void)

Maybe this function should have "encode" in the name to match with
encode_tail_cpu().


> +{
> +	/* XXX: make this use lock value in paca like simple spinlocks? */

Is that the paca's lock_token which is 0x8000?


> +	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
> +}
> +
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> -	u32 new = _Q_LOCKED_VAL;
> +	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
>  	asm volatile(
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 8b20f5e22bba..35f9525381e6 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,6 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   *     0: locked bit
> + *  1-14: lock holder cpu
> + *    15: unused bit
>   *    16: must queue bit
>   * 17-31: tail cpu (+1)

So there is one more bit to store the tail cpu vs the lock holder cpu?

>   */
> @@ -39,6 +41,14 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
>  
> +#define _Q_OWNER_CPU_OFFSET	1
> +#define _Q_OWNER_CPU_BITS	14
> +#define _Q_OWNER_CPU_MASK	_Q_SET_MASK(OWNER_CPU)
> +
> +#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #define _Q_MUST_Q_OFFSET	16
>  #define _Q_MUST_Q_BITS		1
>  #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index a906cc8f15fa..aa26cfe21f18 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -50,7 +50,7 @@ static inline int get_tail_cpu(u32 val)
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> -	u32 new = _Q_LOCKED_VAL;
> +	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
>  	asm volatile(
> @@ -68,7 +68,7 @@ static __always_inline void lock_set_locked(struct qspinlock *lock)
>  /* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
>  static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
>  {
> -	u32 new = _Q_LOCKED_VAL;
> +	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
>  	BUG_ON(old & _Q_LOCKED_VAL);
> @@ -116,7 +116,7 @@ static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u3
>  /* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
>  static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
>  {
> -	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> +	u32 newval = queued_spin_get_locked_val() | (val & _Q_TAIL_CPU_MASK);
>  
>  	if (__trylock_cmpxchg(lock, val, newval) == val)
>  		return 1;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner
  2022-07-28  6:31 ` [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
  2022-08-12  2:01   ` Jordan Niethe
@ 2022-11-10  0:41   ` Jordan Niethe
  2022-11-10 11:13     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:41 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

 On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
 [resend as utf-8, not utf-7]
> Waiters spinning on the lock word should yield to the lock owner if the
> vCPU is preempted. This improves performance when the hypervisor has
> oversubscribed physical CPUs.
> ---
>  arch/powerpc/lib/qspinlock.c | 97 ++++++++++++++++++++++++++++++------
>  1 file changed, 83 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index aa26cfe21f18..55286ac91da5 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include <linux/percpu.h>
>  #include <linux/smp.h>
>  #include <asm/qspinlock.h>
> +#include <asm/paravirt.h>
>  
>  #define MAX_NODES	4
>  
> @@ -24,14 +25,16 @@ static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
> +static bool pv_yield_owner __read_mostly = true;

Not macro case for these globals? To me name does not make it super clear this
is a boolean. What about pv_yield_owner_enabled?

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(void)
> +static __always_inline int get_steal_spins(bool paravirt)
>  {
>  	return STEAL_SPINS;
>  }
>  
> -static __always_inline int get_head_spins(void)
> +static __always_inline int get_head_spins(bool paravirt)
>  {
>  	return HEAD_SPINS;
>  }
> @@ -46,7 +49,11 @@ static inline int get_tail_cpu(u32 val)
>  	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
> -/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
> +static inline int get_owner_cpu(u32 val)
> +{
> +	return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
> +}
> +
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> @@ -180,7 +187,45 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> -static inline bool try_to_steal_lock(struct qspinlock *lock)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)

This name doesn't seem correct for the non paravirt case.

> +{
> +	int owner;
> +	u32 yield_count;
> +
> +	BUG_ON(!(val & _Q_LOCKED_VAL));
> +
> +	if (!paravirt)
> +		goto relax;
> +
> +	if (!pv_yield_owner)
> +		goto relax;
> +
> +	owner = get_owner_cpu(val);
> +	yield_count = yield_count_of(owner);
> +
> +	if ((yield_count & 1) == 0)
> +		goto relax; /* owner vcpu is running */

I wonder why not use vcpu_is_preempted()?

> +
> +	/*
> +	 * Read the lock word after sampling the yield count. On the other side
> +	 * there may a wmb because the yield count update is done by the
> +	 * hypervisor preemption and the value update by the OS, however this
> +	 * ordering might reduce the chance of out of order accesses and
> +	 * improve the heuristic.
> +	 */
> +	smp_rmb();
> +
> +	if (READ_ONCE(lock->val) == val) {
> +		yield_to_preempted(owner, yield_count);
> +		/* Don't relax if we yielded. Maybe we should? */
> +		return;
> +	}
> +relax:
> +	cpu_relax();
> +}
> +
> +
> +static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
>  	int iters;
>  
> @@ -197,18 +242,18 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
>  			continue;
>  		}
>  
> -		cpu_relax();
> +		yield_to_locked_owner(lock, val, paravirt);
>  
>  		iters++;
>  
> -		if (iters >= get_steal_spins())
> +		if (iters >= get_steal_spins(paravirt))
>  			break;
>  	}
>  
>  	return false;
>  }
>  
> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
>  {
>  	struct qnodes *qnodesp;
>  	struct qnode *next, *node;
> @@ -260,7 +305,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  	if (!MAYBE_STEALERS) {
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> -			cpu_relax();
> +			yield_to_locked_owner(lock, val, paravirt);
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -278,10 +323,10 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> -			cpu_relax();
> +			yield_to_locked_owner(lock, val, paravirt);
>  
>  			iters++;
> -			if (!set_mustq && iters >= get_head_spins()) {
> +			if (!set_mustq && iters >= get_head_spins(paravirt)) {
>  				set_mustq = true;
>  				lock_set_mustq(lock);
>  				val |= _Q_MUST_Q_VAL;
> @@ -320,10 +365,15 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  
>  void queued_spin_lock_slowpath(struct qspinlock *lock)
>  {
> -	if (try_to_steal_lock(lock))
> -		return;
> -
> -	queued_spin_lock_mcs_queue(lock);
> +	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()) {
> +		if (try_to_steal_lock(lock, true))
> +			return;
> +		queued_spin_lock_mcs_queue(lock, true);
> +	} else {
> +		if (try_to_steal_lock(lock, false))
> +			return;
> +		queued_spin_lock_mcs_queue(lock, false);
> +	}
>  }

There is not really a need for a conditional: 

bool paravirt = IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
is_shared_processor();

if (try_to_steal_lock(lock, paravirt))
	return;

queued_spin_lock_mcs_queue(lock, paravirt);


The paravirt parameter used by the various functions seems always to be
equivalent to (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()).
I wonder if it would be simpler testing (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor())
(using a helper function) in those functions instead passing it as a parameter?


>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  
> @@ -382,10 +432,29 @@ static int head_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
>  
> +static int pv_yield_owner_set(void *data, u64 val)
> +{
> +	pv_yield_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
> +	if (is_shared_processor()) {
> +		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
> +	}
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node
  2022-07-28  6:31 ` [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
  2022-08-12  2:07   ` Jordan Niethe
@ 2022-11-10  0:41   ` Jordan Niethe
  2022-11-10 11:14     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:41 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Queued waiters which are not at the head of the queue don't spin on
> the lock word but their qnode lock word, waiting for the previous queued
> CPU to release them. Add an option which allows these waiters to yield
> to the previous CPU if its vCPU is preempted.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 46 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 55286ac91da5..b39f8c5b329c 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_prev __read_mostly = true;

Similiar suggestion, maybe pv_yield_prev_enabled would read better.

Isn't this enabled by default contrary to the commit message?


>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -224,6 +225,31 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
>  	cpu_relax();
>  }
>  
> +static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)

yield_to_locked_owner() takes a raw val and works out the cpu to yield to.
I think for consistency have yield_to_prev() take the raw val and work it out too.

> +{
> +	u32 yield_count;
> +
> +	if (!paravirt)
> +		goto relax;
> +
> +	if (!pv_yield_prev)
> +		goto relax;
> +
> +	yield_count = yield_count_of(prev_cpu);
> +	if ((yield_count & 1) == 0)
> +		goto relax; /* owner vcpu is running */
> +
> +	smp_rmb(); /* See yield_to_locked_owner comment */
> +
> +	if (!node->locked) {
> +		yield_to_preempted(prev_cpu, yield_count);
> +		return;
> +	}
> +
> +relax:
> +	cpu_relax();
> +}
> +
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
> @@ -291,13 +317,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	 */
>  	if (old & _Q_TAIL_CPU_MASK) {
>  		struct qnode *prev = get_tail_qnode(lock, old);
> +		int prev_cpu = get_tail_cpu(old);

This could then be removed.

>  
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
>  
>  		/* Wait for mcs node lock to be released */
>  		while (!node->locked)
> -			cpu_relax();
> +			yield_to_prev(lock, node, prev_cpu, paravirt);

And would have this as:
			yield_to_prev(lock, node, old, paravirt);


>  
>  		smp_rmb(); /* acquire barrier for the mcs lock */
>  	}
> @@ -448,12 +475,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
>  
> +static int pv_yield_prev_set(void *data, u64 val)
> +{
> +	pv_yield_prev = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_prev_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_prev;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
> +		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields
  2022-07-28  6:31 ` [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
  2022-08-12  4:06   ` Jordan Niethe
@ 2022-11-10  0:42   ` Jordan Niethe
  2022-11-10 11:22     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:42 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> If the head of queue is preventing stealing but it finds the owner vCPU
> is preempted, it will yield its cycles to the owner which could cause it
> to become preempted. Add an option to re-allow stealers before yielding,
> and disallow them again after returning from the yield.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 56 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index b39f8c5b329c..94f007f66942 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_allow_steal __read_mostly = false;

To me this one does read as a boolean, but if you go with those other changes
I'd make it pv_yield_steal_enable to be consistent.

>  static bool pv_yield_prev __read_mostly = true;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> @@ -173,6 +174,23 @@ static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
>  	return prev;
>  }
>  
> +static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
> +{
> +	u32 new = _Q_MUST_Q_VAL;
> +	u32 prev;
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1		# lock_clear_mustq			\n"
> +"	andc	%0,%0,%2						\n"
> +"	stwcx.	%0,0,%1							\n"
> +"	bne-	1b							\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r" (new)
> +	: "cr0", "memory");
> +

This is pretty similar to the DEFINE_TESTOP() pattern again with the same llong caveat.


> +	return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
> @@ -188,7 +206,7 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)

 /* See yield_to_locked_owner comment */ comment needs to be updated now.


>  {
>  	int owner;
>  	u32 yield_count;
> @@ -217,7 +235,11 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
>  	smp_rmb();
>  
>  	if (READ_ONCE(lock->val) == val) {
> +		if (clear_mustq)
> +			lock_clear_mustq(lock);
>  		yield_to_preempted(owner, yield_count);
> +		if (clear_mustq)
> +			lock_set_mustq(lock);
>  		/* Don't relax if we yielded. Maybe we should? */
>  		return;
>  	}
> @@ -225,6 +247,16 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
>  	cpu_relax();
>  }
>  
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> +{
> +	__yield_to_locked_owner(lock, val, paravirt, false);
> +}
> +
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +{

The check for pv_yield_allow_steal seems like it could go here instead of
being done by the caller.
__yield_to_locked_owner() checks for pv_yield_owner so it seems more
  consistent.



> +	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +}
> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
>  {
>  	u32 yield_count;
> @@ -332,7 +364,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	if (!MAYBE_STEALERS) {
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> -			yield_to_locked_owner(lock, val, paravirt);
> +			yield_head_to_locked_owner(lock, val, paravirt, false);
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -350,7 +382,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> -			yield_to_locked_owner(lock, val, paravirt);
> +			yield_head_to_locked_owner(lock, val, paravirt,
> +					pv_yield_allow_steal && set_mustq);
>  
>  			iters++;
>  			if (!set_mustq && iters >= get_head_spins(paravirt)) {
> @@ -475,6 +508,22 @@ static int pv_yield_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
>  
> +static int pv_yield_allow_steal_set(void *data, u64 val)
> +{
> +	pv_yield_allow_steal = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_allow_steal_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_allow_steal;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
> +
>  static int pv_yield_prev_set(void *data, u64 val)
>  {
>  	pv_yield_prev = !!val;
> @@ -497,6 +546,7 @@ static __init int spinlock_debugfs_init(void)
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
> +		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  	}
>  


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue
  2022-07-28  6:31 ` [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
  2022-08-12  4:17   ` Jordan Niethe
  2022-10-06 17:27   ` Laurent Dufour
@ 2022-11-10  0:42   ` Jordan Niethe
  2022-11-10 11:25     ` Nicholas Piggin
  2 siblings, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:42 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Having all CPUs poll the lock word for the owner CPU that should be
> yielded to defeats most of the purpose of using MCS queueing for
> scalability. Yet it may be desirable for queued waiters to to yield
> to a preempted owner.
> 
> s390 addreses this problem by having queued waiters sample the lock
> word to find the owner much less frequently. In this approach, the
> waiters never sample it directly, but the queue head propagates the
> owner CPU back to the next waiter if it ever finds the owner has
> been preempted. Queued waiters then subsequently propagate the owner
> CPU back to the next waiter, and so on.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 85 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 94f007f66942..28c85a2d5635 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>  	struct qnode	*next;
>  	struct qspinlock *lock;
> +	int		yield_cpu;
>  	u8		locked; /* 1 if lock acquired */
>  };
>  
> @@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
> +static bool pv_yield_propagate_owner __read_mostly = true;

This also seems to be enabled by default.

>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -257,13 +259,66 @@ static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u
>  	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
>  }
>  
> +static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
> +{
> +	struct qnode *next;
> +	int owner;
> +
> +	if (!paravirt)
> +		return;
> +	if (!pv_yield_propagate_owner)
> +		return;
> +
> +	owner = get_owner_cpu(val);
> +	if (*set_yield_cpu == owner)
> +		return;
> +
> +	next = READ_ONCE(node->next);
> +	if (!next)
> +		return;
> +
> +	if (vcpu_is_preempted(owner)) {

Is there a difference about using vcpu_is_preempted() here
vs checking bit 0 in other places?


> +		next->yield_cpu = owner;
> +		*set_yield_cpu = owner;
> +	} else if (*set_yield_cpu != -1) {

It might be worth giving the -1 CPU a #define.

> +		next->yield_cpu = owner;
> +		*set_yield_cpu = owner;
> +	}
> +}

Does this need to pass set_yield_cpu by reference? Couldn't it's new value be
returned? To me it makes it more clear the function is used to change
set_yield_cpu. I think this would work:

int set_yield_cpu = -1;

static __always_inline int propagate_yield_cpu(struct qnode *node, u32 val, int set_yield_cpu, bool paravirt)
{
	struct qnode *next;
	int owner;

	if (!paravirt)
		goto out;
	if (!pv_yield_propagate_owner)
		goto out;

	owner = get_owner_cpu(val);
	if (set_yield_cpu == owner)
		goto out;

	next = READ_ONCE(node->next);
	if (!next)
		goto out;

	if (vcpu_is_preempted(owner)) {
		next->yield_cpu = owner;
		return owner;
	} else if (set_yield_cpu != -1) {
		next->yield_cpu = owner;
		return owner;
	}

out:
	return set_yield_cpu;
}

set_yield_cpu = propagate_yield_cpu(...  set_yield_cpu ...);



> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
>  {
>  	u32 yield_count;
> +	int yield_cpu;
>  
>  	if (!paravirt)
>  		goto relax;
>  
> +	if (!pv_yield_propagate_owner)
> +		goto yield_prev;
> +
> +	yield_cpu = READ_ONCE(node->yield_cpu);
> +	if (yield_cpu == -1) {
> +		/* Propagate back the -1 CPU */
> +		if (node->next && node->next->yield_cpu != -1)
> +			node->next->yield_cpu = yield_cpu;
> +		goto yield_prev;
> +	}
> +
> +	yield_count = yield_count_of(yield_cpu);
> +	if ((yield_count & 1) == 0)
> +		goto yield_prev; /* owner vcpu is running */
> +
> +	smp_rmb();
> +
> +	if (yield_cpu == node->yield_cpu) {
> +		if (node->next && node->next->yield_cpu != yield_cpu)
> +			node->next->yield_cpu = yield_cpu;
> +		yield_to_preempted(yield_cpu, yield_count);
> +		return;
> +	}
> +
> +yield_prev:
>  	if (!pv_yield_prev)
>  		goto relax;
>  
> @@ -337,6 +392,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	node = &qnodesp->nodes[idx];
>  	node->next = NULL;
>  	node->lock = lock;
> +	node->yield_cpu = -1;
>  	node->locked = 0;
>  
>  	tail = encode_tail_cpu();
> @@ -358,13 +414,21 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		while (!node->locked)
>  			yield_to_prev(lock, node, prev_cpu, paravirt);
>  
> +		/* Clear out stale propagated yield_cpu */
> +		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
> +			node->yield_cpu = -1;
> +
>  		smp_rmb(); /* acquire barrier for the mcs lock */
>  	}
>  
>  	if (!MAYBE_STEALERS) {
> +		int set_yield_cpu = -1;
> +
>  		/* We're at the head of the waitqueue, wait for the lock. */
> -		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt, false);
> +		}
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -376,12 +440,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We must be the owner, just set the lock bit and acquire */
>  		lock_set_locked(lock);
>  	} else {
> +		int set_yield_cpu = -1;
>  		int iters = 0;
>  		bool set_mustq = false;
>  
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
>  					pv_yield_allow_steal && set_mustq);
>  
> @@ -540,6 +606,22 @@ static int pv_yield_prev_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
>  
> +static int pv_yield_propagate_owner_set(void *data, u64 val)
> +{
> +	pv_yield_propagate_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_yield_propagate_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_yield_propagate_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> @@ -548,6 +630,7 @@ static __init int spinlock_debugfs_init(void)
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
> +		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU
  2022-07-28  6:31 ` [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
  2022-08-12  4:22   ` Jordan Niethe
@ 2022-11-10  0:42   ` Jordan Niethe
  2022-11-10 11:32     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:42 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> After the head of the queue acquires the lock, it releases the
> next waiter in the queue to become the new head. Add an option
> to prod the new head if its vCPU was preempted. This may only
> have an effect if queue waiters are yielding.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 28c85a2d5635..3b10e31bcf0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>  	struct qnode	*next;
>  	struct qspinlock *lock;
> +	int		cpu;
>  	int		yield_cpu;
>  	u8		locked; /* 1 if lock acquired */
>  };
> @@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
> +static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -392,6 +394,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	node = &qnodesp->nodes[idx];
>  	node->next = NULL;
>  	node->lock = lock;
> +	node->cpu = smp_processor_id();

I suppose this could be used in some other places too.

For example change:
	yield_to_prev(lock, node, prev, paravirt);

In yield_to_prev() it could then access the prev->cpu.

>  	node->yield_cpu = -1;
>  	node->locked = 0;
>  
> @@ -483,7 +486,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	 * this store to locked. The corresponding barrier is the smp_rmb()
>  	 * acquire barrier for mcs lock, above.
>  	 */
> -	WRITE_ONCE(next->locked, 1);
> +	if (paravirt && pv_prod_head) {
> +		int next_cpu = next->cpu;
> +		WRITE_ONCE(next->locked, 1);
> +		if (vcpu_is_preempted(next_cpu))
> +			prod_cpu(next_cpu);
> +	} else {
> +		WRITE_ONCE(next->locked, 1);
> +	}
>  
>  release:
>  	qnodesp->count--; /* release the node */
> @@ -622,6 +632,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
>  
> +static int pv_prod_head_set(void *data, u64 val)
> +{
> +	pv_prod_head = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_prod_head_get(void *data, u64 *val)
> +{
> +	*val = pv_prod_head;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> @@ -631,6 +657,7 @@ static __init int spinlock_debugfs_init(void)
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
> +		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
>  	}
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal
  2022-07-28  6:31 ` [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
  2022-08-12  4:32   ` Jordan Niethe
@ 2022-11-10  0:43   ` Jordan Niethe
  2022-11-10 11:35     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:43 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> This gives trylock slightly more strength, and it also gives most
> of the benefit of passing 'val' back through the slowpath without
> the complexity.
> ---
>  arch/powerpc/include/asm/qspinlock.h | 39 +++++++++++++++++++++++++++-
>  arch/powerpc/lib/qspinlock.c         |  9 +++++++
>  2 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index 44601b261e08..d3d2039237b2 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -5,6 +5,8 @@
>  #include <linux/compiler.h>
>  #include <asm/qspinlock_types.h>
>  
> +#define _Q_SPIN_TRY_LOCK_STEAL 1

Would this be a config option?

> +
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
>  	return READ_ONCE(lock->val);
> @@ -26,11 +28,12 @@ static __always_inline u32 queued_spin_get_locked_val(void)
>  	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
>  }
>  
> -static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
>  {
>  	u32 new = queued_spin_get_locked_val();
>  	u32 prev;
>  
> +	/* Trylock succeeds only when unlocked and no queued nodes */
>  	asm volatile(
>  "1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"

s/queued_spin_trylock/__queued_spin_trylock_nosteal

>  "	cmpwi	0,%0,0							\n"
> @@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  	return 0;
>  }
>  
> +static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
> +{
> +	u32 new = queued_spin_get_locked_val();
> +	u32 prev, tmp;
> +
> +	/* Trylock may get ahead of queued nodes if it finds unlocked */
> +	asm volatile(
> +"1:	lwarx	%0,0,%2,%5	# queued_spin_trylock			\n"

s/queued_spin_trylock/__queued_spin_trylock_steal

> +"	andc.	%1,%0,%4						\n"
> +"	bne-	2f							\n"
> +"	and	%1,%0,%4						\n"
> +"	or	%1,%1,%3						\n"
> +"	stwcx.	%1,0,%2							\n"
> +"	bne-	1b							\n"
> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +"2:									\n"

Just because there's a little bit more going on here...

Q_TAIL_CPU_MASK = 0xFFFE0000
~Q_TAIL_CPU_MASK = 0x1FFFF


1:	lwarx	prev, 0, &lock->val, IS_ENABLED_PPC64
	andc.	tmp, prev, _Q_TAIL_CPU_MASK 	(tmp = prev & ~_Q_TAIL_CPU_MASK)
	bne-	2f 				(exit if locked)
	and	tmp, prev, _Q_TAIL_CPU_MASK 	(tmp = prev & _Q_TAIL_CPU_MASK)
	or	tmp, tmp, new			(tmp |= new)					
	stwcx.	tmp, 0, &lock->val					
		
	bne-	1b							
	PPC_ACQUIRE_BARRIER		
2:

... which seems correct.


> +	: "=&r" (prev), "=&r" (tmp)
> +	: "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> +	: "cr0", "memory");
> +
> +	if (likely(!(prev & ~_Q_TAIL_CPU_MASK)))
> +		return 1;
> +	return 0;
> +}
> +
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +{
> +	if (!_Q_SPIN_TRY_LOCK_STEAL)
> +		return __queued_spin_trylock_nosteal(lock);
> +	else
> +		return __queued_spin_trylock_steal(lock);
> +}
> +
>  void queued_spin_lock_slowpath(struct qspinlock *lock);
>  
>  static __always_inline void queued_spin_lock(struct qspinlock *lock)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 3b10e31bcf0a..277aef1fab0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -24,7 +24,11 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> +static const bool MAYBE_STEALERS = true;
> +#else
>  static bool MAYBE_STEALERS __read_mostly = true;
> +#endif
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> @@ -522,6 +526,10 @@ void pv_spinlocks_init(void)
>  #include <linux/debugfs.h>
>  static int steal_spins_set(void *data, u64 val)
>  {
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> +	/* MAYBE_STEAL remains true */
> +	STEAL_SPINS = val;
> +#else
>  	static DEFINE_MUTEX(lock);
>  
>  	mutex_lock(&lock);
> @@ -539,6 +547,7 @@ static int steal_spins_set(void *data, u64 val)
>  		STEAL_SPINS = val;
>  	}
>  	mutex_unlock(&lock);
> +#endif
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API
  2022-07-28  6:31 ` [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
  2022-08-12  4:36   ` Jordan Niethe
@ 2022-11-10  0:43   ` Jordan Niethe
  2022-11-10 11:36     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:43 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
> to prevent threads issuing a lot of expensive priority nops which may not
> have much effect due to immediately executing low then medium priority.

Just a general comment regarding the spin_{begin,end} API, more complicated
than something like

	spin_begin()
	for(;;)
		spin_cpu_relax()
	spin_end()

it becomes difficult to keep track of. Unfortunately, I don't have any good
suggestions how to improve it. Hopefully with P10s wait instruction we can
maybe try and move away from this.

It might be useful to comment the functions pre and post conditions regarding
expectations about spin_begin() and spin_end().

> ---
>  arch/powerpc/lib/qspinlock.c | 35 +++++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 277aef1fab0a..d4594c701f7d 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -233,6 +233,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  	if ((yield_count & 1) == 0)
>  		goto relax; /* owner vcpu is running */
>  
> +	spin_end();
> +
>  	/*
>  	 * Read the lock word after sampling the yield count. On the other side
>  	 * there may a wmb because the yield count update is done by the
> @@ -248,11 +250,13 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  		yield_to_preempted(owner, yield_count);
>  		if (clear_mustq)
>  			lock_set_mustq(lock);
> +		spin_begin();
>  		/* Don't relax if we yielded. Maybe we should? */
>  		return;
>  	}
> +	spin_begin();
>  relax:
> -	cpu_relax();
> +	spin_cpu_relax();
>  }
>  
>  static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> @@ -315,14 +319,18 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  	if ((yield_count & 1) == 0)
>  		goto yield_prev; /* owner vcpu is running */
>  
> +	spin_end();
> +
>  	smp_rmb();
>  
>  	if (yield_cpu == node->yield_cpu) {
>  		if (node->next && node->next->yield_cpu != yield_cpu)
>  			node->next->yield_cpu = yield_cpu;
>  		yield_to_preempted(yield_cpu, yield_count);
> +		spin_begin();
>  		return;
>  	}
> +	spin_begin();
>  
>  yield_prev:
>  	if (!pv_yield_prev)
> @@ -332,15 +340,19 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  	if ((yield_count & 1) == 0)
>  		goto relax; /* owner vcpu is running */
>  
> +	spin_end();
> +
>  	smp_rmb(); /* See yield_to_locked_owner comment */
>  
>  	if (!node->locked) {
>  		yield_to_preempted(prev_cpu, yield_count);
> +		spin_begin();
>  		return;
>  	}
> +	spin_begin();
>  
>  relax:
> -	cpu_relax();
> +	spin_cpu_relax();
>  }
>  
>  
> @@ -349,6 +361,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  	int iters;
>  
>  	/* Attempt to steal the lock */
> +	spin_begin();
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
>  
> @@ -356,8 +369,10 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			break;
>  
>  		if (unlikely(!(val & _Q_LOCKED_VAL))) {
> +			spin_end();
>  			if (trylock_with_tail_cpu(lock, val))
>  				return true;
> +			spin_begin();
>  			continue;
>  		}
>  
> @@ -368,6 +383,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  		if (iters >= get_steal_spins(paravirt))
>  			break;
>  	}
> +	spin_end();
>  
>  	return false;
>  }
> @@ -418,8 +434,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		WRITE_ONCE(prev->next, node);
>  
>  		/* Wait for mcs node lock to be released */
> +		spin_begin();
>  		while (!node->locked)
>  			yield_to_prev(lock, node, prev_cpu, paravirt);
> +		spin_end();
>  
>  		/* Clear out stale propagated yield_cpu */
>  		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
> @@ -432,10 +450,12 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		int set_yield_cpu = -1;
>  
>  		/* We're at the head of the waitqueue, wait for the lock. */
> +		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt, false);
>  		}
> +		spin_end();
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -453,6 +473,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
> +		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
> @@ -465,6 +486,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  				val |= _Q_MUST_Q_VAL;
>  			}
>  		}
> +		spin_end();
>  
>  		/* If we're the last queued, must clean up the tail. */
>  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -480,8 +502,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  unlock_next:
>  	/* contended path; must wait for next != NULL (MCS protocol) */
> -	while (!(next = READ_ONCE(node->next)))
> -		cpu_relax();
> +	next = READ_ONCE(node->next);
> +	if (!next) {
> +		spin_begin();
> +		while (!(next = READ_ONCE(node->next)))
> +			cpu_relax();
> +		spin_end();
> +	}
>  
>  	/*
>  	 * Unlock the next mcs waiter node. Release barrier is not required


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins
  2022-07-28  6:31 ` [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
  2022-08-12  4:43   ` Jordan Niethe
@ 2022-11-10  0:43   ` Jordan Niethe
  2022-11-10 11:37     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:43 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Allow for a reduction in the number of times a CPU from a different
> node than the owner can attempt to steal the lock before queueing.
> This could bias the transfer behaviour of the lock across the
> machine and reduce NUMA crossings.
> ---
>  arch/powerpc/lib/qspinlock.c | 34 +++++++++++++++++++++++++++++++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index d4594c701f7d..24f68bd71e2b 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -4,6 +4,7 @@
>  #include <linux/export.h>
>  #include <linux/percpu.h>
>  #include <linux/smp.h>
> +#include <linux/topology.h>
>  #include <asm/qspinlock.h>
>  #include <asm/paravirt.h>
>  
> @@ -24,6 +25,7 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +static int REMOTE_STEAL_SPINS __read_mostly = (1<<2);
>  #if _Q_SPIN_TRY_LOCK_STEAL == 1
>  static const bool MAYBE_STEALERS = true;
>  #else
> @@ -39,9 +41,13 @@ static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(bool paravirt)
> +static __always_inline int get_steal_spins(bool paravirt, bool remote)
>  {
> -	return STEAL_SPINS;
> +	if (remote) {
> +		return REMOTE_STEAL_SPINS;
> +	} else {
> +		return STEAL_SPINS;
> +	}
>  }
>  
>  static __always_inline int get_head_spins(bool paravirt)
> @@ -380,8 +386,13 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  
>  		iters++;
>  
> -		if (iters >= get_steal_spins(paravirt))
> +		if (iters >= get_steal_spins(paravirt, false))
>  			break;
> +		if (iters >= get_steal_spins(paravirt, true)) {

There's no indication of what true and false mean here which is hard to read.
To me it feels like two separate functions would be more clear.


> +			int cpu = get_owner_cpu(val);
> +			if (numa_node_id() != cpu_to_node(cpu))

What about using node_distance() instead?


> +				break;
> +		}
>  	}
>  	spin_end();
>  
> @@ -588,6 +599,22 @@ static int steal_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
>  
> +static int remote_steal_spins_set(void *data, u64 val)
> +{
> +	REMOTE_STEAL_SPINS = val;

REMOTE_STEAL_SPINS is int not u64.

> +
> +	return 0;
> +}
> +
> +static int remote_steal_spins_get(void *data, u64 *val)
> +{
> +	*val = REMOTE_STEAL_SPINS;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_remote_steal_spins, remote_steal_spins_get, remote_steal_spins_set, "%llu\n");
> +
>  static int head_spins_set(void *data, u64 val)
>  {
>  	HEAD_SPINS = val;
> @@ -687,6 +714,7 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "
>  static __init int spinlock_debugfs_init(void)
>  {
>  	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
> +	debugfs_create_file("qspl_remote_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_remote_steal_spins);
>  	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
  2022-07-28  6:31 ` [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
  2022-08-12  4:49   ` Jordan Niethe
  2022-09-22 15:02   ` Laurent Dufour
@ 2022-11-10  0:44   ` Jordan Niethe
  2022-11-10 11:38     ` Nicholas Piggin
  2 siblings, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:44 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Provide an option that holds off queueing indefinitely while the lock
> owner is preempted. This could reduce queueing latencies for very
> overcommitted vcpu situations.
> 
> This is disabled by default.
> ---
>  arch/powerpc/lib/qspinlock.c | 91 +++++++++++++++++++++++++++++++-----
>  1 file changed, 79 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 24f68bd71e2b..5cfd69931e31 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
> +static bool pv_spin_on_preempted_owner __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
> @@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  	BUG();
>  }
>  
> -static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
>  	int owner;
>  	u32 yield_count;
>  
>  	BUG_ON(!(val & _Q_LOCKED_VAL));
>  
> +	*preempted = false;
> +
>  	if (!paravirt)
>  		goto relax;
>  
> @@ -241,6 +244,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>  	spin_end();
>  
> +	*preempted = true;
> +
>  	/*
>  	 * Read the lock word after sampling the yield count. On the other side
>  	 * there may a wmb because the yield count update is done by the
> @@ -265,14 +270,14 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  	spin_cpu_relax();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool *preempted)

It seems like preempted parameter could be the return value of
yield_to_locked_owner(). Then callers that don't use the value returned in
preempted don't need to create an unnecessary variable to pass in.

>  {
> -	__yield_to_locked_owner(lock, val, paravirt, false);
> +	__yield_to_locked_owner(lock, val, paravirt, false, preempted);
>  }
>  
> -static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
> -	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +	__yield_to_locked_owner(lock, val, paravirt, clear_mustq, preempted);
>  }
>  
>  static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
> @@ -364,12 +369,33 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
> -	int iters;
> +	int iters = 0;
> +
> +	if (!STEAL_SPINS) {
> +		if (paravirt && pv_spin_on_preempted_owner) {
> +			spin_begin();
> +			for (;;) {
> +				u32 val = READ_ONCE(lock->val);
> +				bool preempted;
> +
> +				if (val & _Q_MUST_Q_VAL)
> +					break;
> +				if (!(val & _Q_LOCKED_VAL))
> +					break;
> +				if (!vcpu_is_preempted(get_owner_cpu(val)))
> +					break;
> +				yield_to_locked_owner(lock, val, paravirt, &preempted);
> +			}
> +			spin_end();
> +		}
> +		return false;
> +	}
>  
>  	/* Attempt to steal the lock */
>  	spin_begin();
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
> +		bool preempted;
>  
>  		if (val & _Q_MUST_Q_VAL)
>  			break;
> @@ -382,9 +408,22 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			continue;
>  		}
>  
> -		yield_to_locked_owner(lock, val, paravirt);
> -
> -		iters++;
> +		yield_to_locked_owner(lock, val, paravirt, &preempted);
> +
> +		if (paravirt && preempted) {
> +			if (!pv_spin_on_preempted_owner)
> +				iters++;
> +			/*
> +			 * pv_spin_on_preempted_owner don't increase iters
> +			 * while the owner is preempted -- we won't interfere
> +			 * with it by definition. This could introduce some
> +			 * latency issue if we continually observe preempted
> +			 * owners, but hopefully that's a rare corner case of
> +			 * a badly oversubscribed system.
> +			 */
> +		} else {
> +			iters++;
> +		}
>  
>  		if (iters >= get_steal_spins(paravirt, false))
>  			break;
> @@ -463,8 +502,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			bool preempted;
> +
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
> -			yield_head_to_locked_owner(lock, val, paravirt, false);
> +			yield_head_to_locked_owner(lock, val, paravirt, false, &preempted);
>  		}
>  		spin_end();
>  
> @@ -486,11 +527,20 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> +			bool preempted;
> +
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
> -					pv_yield_allow_steal && set_mustq);
> +					pv_yield_allow_steal && set_mustq,
> +					&preempted);
> +
> +			if (paravirt && preempted) {
> +				if (!pv_spin_on_preempted_owner)
> +					iters++;
> +			} else {
> +				iters++;
> +			}
>  
> -			iters++;
>  			if (!set_mustq && iters >= get_head_spins(paravirt)) {
>  				set_mustq = true;
>  				lock_set_mustq(lock);
> @@ -663,6 +713,22 @@ static int pv_yield_allow_steal_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
>  
> +static int pv_spin_on_preempted_owner_set(void *data, u64 val)
> +{
> +	pv_spin_on_preempted_owner = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
> +{
> +	*val = pv_spin_on_preempted_owner;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
> +
>  static int pv_yield_prev_set(void *data, u64 val)
>  {
>  	pv_yield_prev = !!val;
> @@ -719,6 +785,7 @@ static __init int spinlock_debugfs_init(void)
>  	if (is_shared_processor()) {
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
> +		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks
  2022-07-28  6:31 ` [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
  2022-08-15  1:11   ` Jordan Niethe
@ 2022-11-10  0:44   ` Jordan Niethe
  2022-11-10 11:41     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Jordan Niethe @ 2022-11-10  0:44 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Finding the owner or a queued waiter on a lock with a preempted vcpu
> is indicative of an oversubscribed guest causing the lock to get into
> trouble. Provide some options to detect this situation and have new
> CPUs avoid queueing for a longer time (more steal iterations) to
> minimise the problems caused by vcpu preemption on the queue.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h |   7 +-
>  arch/powerpc/lib/qspinlock.c               | 240 +++++++++++++++++++--
>  2 files changed, 232 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> index 35f9525381e6..4fbcc8a4230b 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -30,7 +30,7 @@ typedef struct qspinlock {
>   *
>   *     0: locked bit
>   *  1-14: lock holder cpu
> - *    15: unused bit
> + *    15: lock owner or queuer vcpus observed to be preempted bit
>   *    16: must queue bit
>   * 17-31: tail cpu (+1)
>   */
> @@ -49,6 +49,11 @@ typedef struct qspinlock {
>  #error "qspinlock does not support such large CONFIG_NR_CPUS"
>  #endif
>  
> +#define _Q_SLEEPY_OFFSET	15
> +#define _Q_SLEEPY_BITS		1
> +#define _Q_SLEEPY_MASK		_Q_SET_MASK(SLEEPY_OWNER)
> +#define _Q_SLEEPY_VAL		(1U << _Q_SLEEPY_OFFSET)
> +
>  #define _Q_MUST_Q_OFFSET	16
>  #define _Q_MUST_Q_BITS		1
>  #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5cfd69931e31..c18133c01450 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include <linux/percpu.h>
>  #include <linux/smp.h>
>  #include <linux/topology.h>
> +#include <linux/sched/clock.h>
>  #include <asm/qspinlock.h>
>  #include <asm/paravirt.h>
>  
> @@ -36,24 +37,54 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_spin_on_preempted_owner __read_mostly = false;
> +static bool pv_sleepy_lock __read_mostly = true;
> +static bool pv_sleepy_lock_sticky __read_mostly = false;

The sticky part could potentially be its own patch.

> +static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
> +static int pv_sleepy_lock_factor __read_mostly = 256;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
>  
> -static __always_inline int get_steal_spins(bool paravirt, bool remote)
> +static __always_inline bool recently_sleepy(void)
> +{

Other users of pv_sleepy_lock_interval_ns first check pv_sleepy_lock.

> +	if (pv_sleepy_lock_interval_ns) {
> +		u64 seen = this_cpu_read(sleepy_lock_seen_clock);
> +
> +		if (seen) {
> +			u64 delta = sched_clock() - seen;
> +			if (delta < pv_sleepy_lock_interval_ns)
> +				return true;
> +			this_cpu_write(sleepy_lock_seen_clock, 0);
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +static __always_inline int get_steal_spins(bool paravirt, bool remote, bool sleepy)

It seems like paravirt is implied by sleepy.

>  {
>  	if (remote) {
> -		return REMOTE_STEAL_SPINS;
> +		if (paravirt && sleepy)
> +			return REMOTE_STEAL_SPINS * pv_sleepy_lock_factor;
> +		else
> +			return REMOTE_STEAL_SPINS;
>  	} else {
> -		return STEAL_SPINS;
> +		if (paravirt && sleepy)
> +			return STEAL_SPINS * pv_sleepy_lock_factor;
> +		else
> +			return STEAL_SPINS;
>  	}
>  }

I think that separate functions would still be nicer but this could get rid of
the nesting conditionals like


	int spins;
	if (remote)
		spins = REMOTE_STEAL_SPINS;
	else
		spins = STEAL_SPINS;

	if (sleepy)
		return spins * pv_sleepy_lock_factor;
	return spins;

>  
> -static __always_inline int get_head_spins(bool paravirt)
> +static __always_inline int get_head_spins(bool paravirt, bool sleepy)
>  {
> -	return HEAD_SPINS;
> +	if (paravirt && sleepy)
> +		return HEAD_SPINS * pv_sleepy_lock_factor;
> +	else
> +		return HEAD_SPINS;
>  }
>  
>  static inline u32 encode_tail_cpu(void)
> @@ -206,6 +237,60 @@ static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
>  	return prev;
>  }
>  
> +static __always_inline bool lock_try_set_sleepy(struct qspinlock *lock, u32 old)
> +{
> +	u32 prev;
> +	u32 new = old | _Q_SLEEPY_VAL;
> +
> +	BUG_ON(!(old & _Q_LOCKED_VAL));
> +	BUG_ON(old & _Q_SLEEPY_VAL);
> +
> +	asm volatile(
> +"1:	lwarx	%0,0,%1		# lock_try_set_sleepy			\n"
> +"	cmpw	0,%0,%2							\n"
> +"	bne-	2f							\n"
> +"	stwcx.	%3,0,%1							\n"
> +"	bne-	1b							\n"
> +"2:									\n"
> +	: "=&r" (prev)
> +	: "r" (&lock->val), "r"(old), "r" (new)
> +	: "cr0", "memory");
> +
> +	if (prev == old)
> +		return true;
> +	return false;
> +}
> +
> +static __always_inline void seen_sleepy_owner(struct qspinlock *lock, u32 val)
> +{
> +	if (pv_sleepy_lock) {
> +		if (pv_sleepy_lock_interval_ns)
> +			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> +		if (!(val & _Q_SLEEPY_VAL))
> +			lock_try_set_sleepy(lock, val);
> +	}
> +}
> +
> +static __always_inline void seen_sleepy_lock(void)
> +{
> +	if (pv_sleepy_lock && pv_sleepy_lock_interval_ns)
> +		this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> +}
> +
> +static __always_inline void seen_sleepy_node(struct qspinlock *lock)
> +{

If yield_to_prev() was made to take a raw val, that val could be passed to
seen_sleepy_node() and it would not need to get it by itself.

> +	if (pv_sleepy_lock) {
> +		u32 val = READ_ONCE(lock->val);
> +
> +		if (pv_sleepy_lock_interval_ns)
> +			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> +		if (val & _Q_LOCKED_VAL) {
> +			if (!(val & _Q_SLEEPY_VAL))
> +				lock_try_set_sleepy(lock, val);
> +		}
> +	}
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>  	int cpu = get_tail_cpu(val);
> @@ -244,6 +329,7 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>  	spin_end();
>  
> +	seen_sleepy_owner(lock, val);
>  	*preempted = true;
>  
>  	/*
> @@ -307,11 +393,13 @@ static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int
>  	}
>  }
>  
> -static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
> +static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt, bool *preempted)
>  {
>  	u32 yield_count;
>  	int yield_cpu;
>  
> +	*preempted = false;
> +
>  	if (!paravirt)
>  		goto relax;
>  
> @@ -332,6 +420,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  	spin_end();
>  
> +	*preempted = true;
> +	seen_sleepy_node(lock);
> +
>  	smp_rmb();
>  
>  	if (yield_cpu == node->yield_cpu) {
> @@ -353,6 +444,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  	spin_end();
>  
> +	*preempted = true;
> +	seen_sleepy_node(lock);
> +
>  	smp_rmb(); /* See yield_to_locked_owner comment */
>  
>  	if (!node->locked) {
> @@ -369,6 +463,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
>  {
> +	bool preempted;
> +	bool seen_preempted = false;
> +	bool sleepy = false;
>  	int iters = 0;
>  
>  	if (!STEAL_SPINS) {
> @@ -376,7 +473,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			spin_begin();
>  			for (;;) {
>  				u32 val = READ_ONCE(lock->val);
> -				bool preempted;
>  
>  				if (val & _Q_MUST_Q_VAL)
>  					break;
> @@ -395,7 +491,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  	spin_begin();
>  	for (;;) {
>  		u32 val = READ_ONCE(lock->val);
> -		bool preempted;
>  
>  		if (val & _Q_MUST_Q_VAL)
>  			break;
> @@ -408,9 +503,29 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			continue;
>  		}
>  
> +		if (paravirt && pv_sleepy_lock && !sleepy) {
> +			if (!sleepy) {

The enclosing conditional means this would always be true. I think the out conditional should be
if (paravirt && pv_sleepy_lock)
otherwise the pv_sleepy_lock_sticky part wouldn't work properly.


> +				if (val & _Q_SLEEPY_VAL) {
> +					seen_sleepy_lock();
> +					sleepy = true;
> +				} else if (recently_sleepy()) {
> +					sleepy = true;
> +				}
> +
> +			if (pv_sleepy_lock_sticky && seen_preempted &&
> +					!(val & _Q_SLEEPY_VAL)) {
> +				if (lock_try_set_sleepy(lock, val))
> +					val |= _Q_SLEEPY_VAL;
> +			}
> +
> +
>  		yield_to_locked_owner(lock, val, paravirt, &preempted);
> +		if (preempted)
> +			seen_preempted = true;

This could belong to the next if statement, there can not be !paravirt && preempted ?

>  
>  		if (paravirt && preempted) {
> +			sleepy = true;
> +
>  			if (!pv_spin_on_preempted_owner)
>  				iters++;
>  			/*
> @@ -425,14 +540,15 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
>  			iters++;
>  		}
>  
> -		if (iters >= get_steal_spins(paravirt, false))
> +		if (iters >= get_steal_spins(paravirt, false, sleepy))
>  			break;
> -		if (iters >= get_steal_spins(paravirt, true)) {
> +		if (iters >= get_steal_spins(paravirt, true, sleepy)) {
>  			int cpu = get_owner_cpu(val);
>  			if (numa_node_id() != cpu_to_node(cpu))
>  				break;
>  		}
>  	}
> +
>  	spin_end();
>  
>  	return false;
> @@ -443,6 +559,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	struct qnodes *qnodesp;
>  	struct qnode *next, *node;
>  	u32 val, old, tail;
> +	bool seen_preempted = false;
>  	int idx;
>  
>  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> @@ -485,8 +602,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  		/* Wait for mcs node lock to be released */
>  		spin_begin();
> -		while (!node->locked)
> -			yield_to_prev(lock, node, prev_cpu, paravirt);
> +		while (!node->locked) {
> +			bool preempted;
> +
> +			yield_to_prev(lock, node, prev_cpu, paravirt, &preempted);
> +			if (preempted)
> +				seen_preempted = true;
> +		}
>  		spin_end();
>  
>  		/* Clear out stale propagated yield_cpu */
> @@ -506,6 +628,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt, false, &preempted);
> +			if (preempted)
> +				seen_preempted = true;
>  		}
>  		spin_end();
>  
> @@ -521,27 +645,47 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  	} else {
>  		int set_yield_cpu = -1;
>  		int iters = 0;
> +		bool sleepy = false;
>  		bool set_mustq = false;
> +		bool preempted;
>  
>  again:
>  		/* We're at the head of the waitqueue, wait for the lock. */
>  		spin_begin();
>  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> -			bool preempted;
> +			if (paravirt && pv_sleepy_lock) {
> +				if (!sleepy) {
> +					if (val & _Q_SLEEPY_VAL) {
> +						seen_sleepy_lock();
> +						sleepy = true;
> +					} else if (recently_sleepy()) {
> +						sleepy = true;
> +					}
> +				}
> +				if (pv_sleepy_lock_sticky && seen_preempted &&
> +						!(val & _Q_SLEEPY_VAL)) {
> +					if (lock_try_set_sleepy(lock, val))
> +						val |= _Q_SLEEPY_VAL;
> +				}
> +			}
>  
>  			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
>  			yield_head_to_locked_owner(lock, val, paravirt,
>  					pv_yield_allow_steal && set_mustq,
>  					&preempted);
> +			if (preempted)
> +				seen_preempted = true;
>  
>  			if (paravirt && preempted) {
> +				sleepy = true;
> +
>  				if (!pv_spin_on_preempted_owner)
>  					iters++;
>  			} else {
>  				iters++;
>  			}
>  
> -			if (!set_mustq && iters >= get_head_spins(paravirt)) {
> +			if (!set_mustq && iters >= get_head_spins(paravirt, sleepy)) {
>  				set_mustq = true;
>  				lock_set_mustq(lock);
>  				val |= _Q_MUST_Q_VAL;
> @@ -729,6 +873,70 @@ static int pv_spin_on_preempted_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_spin_on_preempted_owner, pv_spin_on_preempted_owner_get, pv_spin_on_preempted_owner_set, "%llu\n");
>  
> +static int pv_sleepy_lock_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock, pv_sleepy_lock_get, pv_sleepy_lock_set, "%llu\n");
> +
> +static int pv_sleepy_lock_sticky_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock_sticky = !!val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_sticky_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock_sticky;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_sticky, pv_sleepy_lock_sticky_get, pv_sleepy_lock_sticky_set, "%llu\n");
> +
> +static int pv_sleepy_lock_interval_ns_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock_interval_ns = val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_interval_ns_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock_interval_ns;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_interval_ns, pv_sleepy_lock_interval_ns_get, pv_sleepy_lock_interval_ns_set, "%llu\n");
> +
> +static int pv_sleepy_lock_factor_set(void *data, u64 val)
> +{
> +	pv_sleepy_lock_factor = val;
> +
> +	return 0;
> +}
> +
> +static int pv_sleepy_lock_factor_get(void *data, u64 *val)
> +{
> +	*val = pv_sleepy_lock_factor;
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_sleepy_lock_factor, pv_sleepy_lock_factor_get, pv_sleepy_lock_factor_set, "%llu\n");
> +
>  static int pv_yield_prev_set(void *data, u64 val)
>  {
>  	pv_yield_prev = !!val;
> @@ -786,6 +994,10 @@ static __init int spinlock_debugfs_init(void)
>  		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
>  		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>  		debugfs_create_file("qspl_pv_spin_on_preempted_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_spin_on_preempted_owner);
> +		debugfs_create_file("qspl_pv_sleepy_lock", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock);
> +		debugfs_create_file("qspl_pv_sleepy_lock_sticky", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_sticky);
> +		debugfs_create_file("qspl_pv_sleepy_lock_interval_ns", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_interval_ns);
> +		debugfs_create_file("qspl_pv_sleepy_lock_factor", 0600, arch_debugfs_dir, NULL, &fops_pv_sleepy_lock_factor);
>  		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>  		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
>  		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-11-10  0:35   ` Jordan Niethe
@ 2022-11-10  6:37     ` Christophe Leroy
  2022-11-10 11:44       ` Nicholas Piggin
  2022-11-10  9:09     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Christophe Leroy @ 2022-11-10  6:37 UTC (permalink / raw)
  To: Jordan Niethe, Nicholas Piggin, linuxppc-dev



Le 10/11/2022 à 01:35, Jordan Niethe a écrit :
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> <snip>
>> -#define queued_spin_lock queued_spin_lock
>>   
>> -static inline void queued_spin_unlock(struct qspinlock *lock)
>> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>>   {
>> -	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
>> -		smp_store_release(&lock->locked, 0);
>> -	else
>> -		__pv_queued_spin_unlock(lock);
>> +	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
>> +		return 1;
>> +	return 0;
> 
> optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);

No parenthesis.
No == 0

Should be :

	return !atomic_cmpxchg_acquire(&lock->val, 0, 1);

> 
> [resend as utf-8, not utf-7]
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly
  2022-11-10  0:39   ` Jordan Niethe
@ 2022-11-10  8:36     ` Christophe Leroy
  2022-11-10 11:48       ` Nicholas Piggin
  2022-11-10  9:40     ` Nicholas Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Christophe Leroy @ 2022-11-10  8:36 UTC (permalink / raw)
  To: Jordan Niethe, Nicholas Piggin, linuxppc-dev



Le 10/11/2022 à 01:39, Jordan Niethe a écrit :
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
>> This uses more optimal ll/sc style access patterns (rather than
>> cmpxchg), and also sets the EH=1 lock hint on those operations
>> which acquire ownership of the lock.
>> ---
>>   arch/powerpc/include/asm/qspinlock.h       | 25 +++++--
>>   arch/powerpc/include/asm/qspinlock_types.h |  6 +-
>>   arch/powerpc/lib/qspinlock.c               | 81 +++++++++++++++-------
>>   3 files changed, 79 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
>> index 79a1936fb68d..3ab354159e5e 100644
>> --- a/arch/powerpc/include/asm/qspinlock.h
>> +++ b/arch/powerpc/include/asm/qspinlock.h
>> @@ -2,28 +2,43 @@
>>   #ifndef _ASM_POWERPC_QSPINLOCK_H
>>   #define _ASM_POWERPC_QSPINLOCK_H
>>   
>> -#include <linux/atomic.h>
>>   #include <linux/compiler.h>
>>   #include <asm/qspinlock_types.h>
>>   
>>   static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>>   {
>> -	return atomic_read(&lock->val);
>> +	return READ_ONCE(lock->val);
>>   }
>>   
>>   static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>>   {
>> -	return !atomic_read(&lock.val);
>> +	return !lock.val;
>>   }
>>   
>>   static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>>   {
>> -	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
>> +	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>>   }
>>   
>>   static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>>   {
>> -	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
>> +	u32 new = _Q_LOCKED_VAL;
>> +	u32 prev;
>> +
>> +	asm volatile(
>> +"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
>> +"	cmpwi	0,%0,0							\n"
>> +"	bne-	2f							\n"
>> +"	stwcx.	%2,0,%1							\n"
>> +"	bne-	1b							\n"
>> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
>> +"2:									\n"
>> +	: "=&r" (prev)
>> +	: "r" (&lock->val), "r" (new),
>> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> 
> btw IS_ENABLED() already returns 1 or 0
> 
>> +	: "cr0", "memory");
> 
> This is the ISA's "test and set" atomic primitive. Do you think it would be worth seperating it as a helper?
> 
>> +
>> +	if (likely(prev == 0))
>>   		return 1;
>>   	return 0;
> 
> same optional style nit: return likely(prev == 0);

	return likely(!prev);

> 
>>   }
>> diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
>> index 3425dab42576..210adf05b235 100644
>> --- a/arch/powerpc/include/asm/qspinlock_types.h
>> +++ b/arch/powerpc/include/asm/qspinlock_types.h
>> @@ -7,7 +7,7 @@
>>   
>>   typedef struct qspinlock {
>>   	union {
>> -		atomic_t val;
>> +		u32 val;
>>   
>>   #ifdef __LITTLE_ENDIAN
>>   		struct {
>> @@ -23,10 +23,10 @@ typedef struct qspinlock {
>>   	};
>>   } arch_spinlock_t;
>>   
>> -#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
>> +#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = 0 } }
>>   
>>   /*
>> - * Bitfields in the atomic value:
>> + * Bitfields in the lock word:
>>    *
>>    *     0: locked bit
>>    * 16-31: tail cpu (+1)
>> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
>> index 5ebb88d95636..7c71e5e287df 100644
>> --- a/arch/powerpc/lib/qspinlock.c
>> +++ b/arch/powerpc/lib/qspinlock.c
>> @@ -1,5 +1,4 @@
>>   // SPDX-License-Identifier: GPL-2.0-or-later
>> -#include <linux/atomic.h>
>>   #include <linux/bug.h>
>>   #include <linux/compiler.h>
>>   #include <linux/export.h>
>> @@ -22,32 +21,59 @@ struct qnodes {
>>   
>>   static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>>   
>> -static inline int encode_tail_cpu(void)
>> +static inline u32 encode_tail_cpu(void)
>>   {
>>   	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
>>   }
>>   
>> -static inline int get_tail_cpu(int val)
>> +static inline int get_tail_cpu(u32 val)
>>   {
>>   	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>>   }
>>   
>>   /* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
> 
> I think you missed deleting the above line.
> 
>> +/* Take the lock by setting the lock bit, no other CPUs will touch it. */
>>   static __always_inline void lock_set_locked(struct qspinlock *lock)
>>   {
>> -	atomic_or(_Q_LOCKED_VAL, &lock->val);
>> -	__atomic_acquire_fence();
>> +	u32 new = _Q_LOCKED_VAL;
>> +	u32 prev;
>> +
>> +	asm volatile(
>> +"1:	lwarx	%0,0,%1,%3	# lock_set_locked			\n"
>> +"	or	%0,%0,%2						\n"
>> +"	stwcx.	%0,0,%1							\n"
>> +"	bne-	1b							\n"
>> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
>> +	: "=&r" (prev)
>> +	: "r" (&lock->val), "r" (new),
>> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
>> +	: "cr0", "memory");
>>   }
> 
> This is pretty similar with the DEFINE_TESTOP() pattern from
> arch/powerpc/include/asm/bitops.h (such as test_and_set_bits_lock()) except for
> word instead of double word. Do you think it's possible / beneficial to make
> use of those macros?
> 
> 
>>   
>> -/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
>> -static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
>> +/* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
>> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
>>   {
>> -	int newval = _Q_LOCKED_VAL;
>> -
>> -	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
>> +	u32 new = _Q_LOCKED_VAL;
>> +	u32 prev;
>> +
>> +	BUG_ON(old & _Q_LOCKED_VAL);
> 
> The BUG_ON() could have been introduced in an earlier patch I think.

Can we avoid the BUG_ON() at all and replace by a WARN_ON ?

> 
>> +
>> +	asm volatile(
>> +"1:	lwarx	%0,0,%1,%4	# trylock_clear_tail_cpu		\n"
>> +"	cmpw	0,%0,%2							\n"
>> +"	bne-	2f							\n"
>> +"	stwcx.	%3,0,%1							\n"
>> +"	bne-	1b							\n"
>> +"\t"	PPC_ACQUIRE_BARRIER "						\n"
>> +"2:									\n"
>> +	: "=&r" (prev)
>> +	: "r" (&lock->val), "r"(old), "r" (new),
> 
> Could this be like  "r"(_Q_TAIL_CPU_MASK) below?
> i.e. "r" (_Q_LOCKED_VAL)? Makes it clear new doesn't change.
> 
>> +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
>> +	: "cr0", "memory");
>> +
>> +	if (likely(prev == old))
>>   		return 1;
>> -	else
>> -		return 0;
>> +	return 0;
>>   }
>>   
>>   /*
>> @@ -56,20 +82,25 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int va
>>    * This provides a release barrier for publishing node, and an acquire barrier
> 
> Does the comment mean there needs to be an acquire barrier in this assembly?
> 
> 
>>    * for getting the old node.
>>    */
>> -static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
>> +static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
>>   {
>> -	for (;;) {
>> -		int val = atomic_read(&lock->val);
>> -		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
>> -		int old;
>> -
>> -		old = atomic_cmpxchg(&lock->val, val, newval);
>> -		if (old == val)
>> -			return old;
>> -	}
>> +	u32 prev, tmp;
>> +
>> +	asm volatile(
>> +"\t"	PPC_RELEASE_BARRIER "						\n"
>> +"1:	lwarx	%0,0,%2		# publish_tail_cpu			\n"
>> +"	andc	%1,%0,%4						\n"
>> +"	or	%1,%1,%3						\n"
>> +"	stwcx.	%1,0,%2							\n"
>> +"	bne-	1b							\n"
>> +	: "=&r" (prev), "=&r"(tmp)
>> +	: "r" (&lock->val), "r" (tail), "r"(_Q_TAIL_CPU_MASK)
>> +	: "cr0", "memory");
>> +
>> +	return prev;
>>   }
>>   
>> -static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
>> +static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>>   {
>>   	int cpu = get_tail_cpu(val);
>>   	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
>> @@ -88,7 +119,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>>   {
>>   	struct qnodes *qnodesp;
>>   	struct qnode *next, *node;
>> -	int val, old, tail;
>> +	u32 val, old, tail;
>>   	int idx;
>>   
>>   	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
>> @@ -134,7 +165,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>>   	}
>>   
>>   	/* We're at the head of the waitqueue, wait for the lock. */
>> -	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
>> +	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
>>   		cpu_relax();
>>   
>>   	/* If we're the last queued, must clean up the tail. */
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-11-10  0:35   ` Jordan Niethe
  2022-11-10  6:37     ` Christophe Leroy
@ 2022-11-10  9:09     ` Nicholas Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10  9:09 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:35 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> <snip>
> > -#define queued_spin_lock queued_spin_lock
> >  
> > -static inline void queued_spin_unlock(struct qspinlock *lock)
> > +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >  {
> > -	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> > -		smp_store_release(&lock->locked, 0);
> > -	else
> > -		__pv_queued_spin_unlock(lock);
> > +	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
> > +		return 1;
> > +	return 0;
>
> optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);
>
> [resend as utf-8, not utf-7]

Thanks for the thorough review, apologies again it took me so long to
get back to.

I'm not completely sold on this. I guess it's already side-effects in a
control flow statement though... Maybe I will change it, not sure.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters
  2022-11-10  0:36   ` Jordan Niethe
@ 2022-11-10  9:21     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10  9:21 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:36 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> <snip>
> [resend as utf-8, not utf-7]
> >  
> > +/*
> > + * Bitfields in the atomic value:
> > + *
> > + *     0: locked bit
> > + * 16-31: tail cpu (+1)
> > + */
> > +#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
> > +				      << _Q_ ## type ## _OFFSET)
> > +#define _Q_LOCKED_OFFSET	0
> > +#define _Q_LOCKED_BITS		1
> > +#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
> > +#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
> > +
> > +#define _Q_TAIL_CPU_OFFSET	16
> > +#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
> > +#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
> > +
>
> Just to state the obvious this is:
>
> #define _Q_LOCKED_OFFSET	0
> #define _Q_LOCKED_BITS		1
> #define _Q_LOCKED_MASK		0x00000001
> #define _Q_LOCKED_VAL		1
>
> #define _Q_TAIL_CPU_OFFSET	16
> #define _Q_TAIL_CPU_BITS	16
> #define _Q_TAIL_CPU_MASK	0xffff0000

Yeah. I'm wondering if that's a better style in the first place.
Generic qspinlock this can chance so there's slightly more reason to do
it that way.

> > +#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
> > +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> > +#endif
> > +
> >  #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 8dbce99a373c..5ebb88d95636 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -1,12 +1,172 @@
> >  // SPDX-License-Identifier: GPL-2.0-or-later
> > +#include <linux/atomic.h>
> > +#include <linux/bug.h>
> > +#include <linux/compiler.h>
> >  #include <linux/export.h>
> > -#include <linux/processor.h>
> > +#include <linux/percpu.h>
> > +#include <linux/smp.h>
> >  #include <asm/qspinlock.h>
> >  
> > -void queued_spin_lock_slowpath(struct qspinlock *lock)
> > +#define MAX_NODES	4
> > +
> > +struct qnode {
> > +	struct qnode	*next;
> > +	struct qspinlock *lock;
> > +	u8		locked; /* 1 if lock acquired */
> > +};
> > +
> > +struct qnodes {
> > +	int		count;
> > +	struct qnode nodes[MAX_NODES];
> > +};
>
> I think it could be worth commenting why qnodes::count instead _Q_TAIL_IDX_OFFSET.

I wasn't sure what you meant by this.

> > +
> > +static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> > +
> > +static inline int encode_tail_cpu(void)
>
> I think the generic version that takes smp_processor_id() as a parameter is clearer - at least with this function name.

Agree.

> > +{
> > +	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> > +}
> > +
> > +static inline int get_tail_cpu(int val)
>
> It seems like there should be a "decode" function to pair up with the "encode" function.

Agree.

> > +{
> > +	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> > +}
> > +
> > +/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
>
> Does that comment mean it is not necessary to use an atomic_or here?

No, only that it can't be locked. It can still be modified by another
queuer.

> > +static __always_inline void lock_set_locked(struct qspinlock *lock)
>
> nit: could just be called set_locked()

Yep.

> > +{
> > +	atomic_or(_Q_LOCKED_VAL, &lock->val);
> > +	__atomic_acquire_fence();
> > +}
> > +
> > +/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> > +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
> > +{
> > +	int newval = _Q_LOCKED_VAL;
> > +
> > +	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> > +		return 1;
> > +	else
> > +		return 0;
>
> same optional style nit: return (atomic_cmpxchg_acquire(&lock->val, val, newval) == val);

Am thinking about it :)

> > +}
> > +
> > +/*
> > + * Publish our tail, replacing previous tail. Return previous value.
> > + *
> > + * This provides a release barrier for publishing node, and an acquire barrier
> > + * for getting the old node.
> > + */
> > +static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
>
> Did you change from the xchg_tail() name in the generic version because of the release and acquire barriers this provides?
> Does "publish" generally imply the old value will be returned?

Yes publish I thought is a bit more obvious that's where it becomes
visible to other CPUs. It doesn't imply return, but I thought those
semantis are the self-documenting part.

>
> >  {
> > -	while (!queued_spin_trylock(lock))
> > +	for (;;) {
> > +		int val = atomic_read(&lock->val);
> > +		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> > +		int old;
> > +
> > +		old = atomic_cmpxchg(&lock->val, val, newval);
> > +		if (old == val)
> > +			return old;
> > +	}
> > +}
> > +
> > +static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> > +{
> > +	int cpu = get_tail_cpu(val);
> > +	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
> > +	int idx;
> > +
> > +	for (idx = 0; idx < MAX_NODES; idx++) {
> > +		struct qnode *qnode = &qnodesp->nodes[idx];
> > +		if (qnode->lock == lock)
> > +			return qnode;
> > +	}
>
> In case anyone else is confused by this, Nick explained each cpu can only queue on a unique spinlock once regardless of "idx" level.
>
> > +
> > +	BUG();
> > +}
> > +
> > +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> > +{
> > +	struct qnodes *qnodesp;
> > +	struct qnode *next, *node;
> > +	int val, old, tail;
> > +	int idx;
> > +
> > +	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> > +
> > +	qnodesp = this_cpu_ptr(&qnodes);
> > +	if (unlikely(qnodesp->count == MAX_NODES)) {
>
> The comparison is >= in the generic, I guess we've no nested NMI so this is safe?

No... we could have nested NMI so this is wrong, good catch.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  2022-11-10  0:39   ` Jordan Niethe
@ 2022-11-10  9:25     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10  9:25 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:39 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > The first 16 bits of the lock are only modified by the owner, and other
> > modifications always use atomic operations on the entire 32 bits, so
> > unlocks can use plain stores on the 16 bits. This is the same kind of
> > optimisation done by core qspinlock code.
> > ---
> >  arch/powerpc/include/asm/qspinlock.h       |  6 +-----
> >  arch/powerpc/include/asm/qspinlock_types.h | 19 +++++++++++++++++--
> >  2 files changed, 18 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> > index f06117aa60e1..79a1936fb68d 100644
> > --- a/arch/powerpc/include/asm/qspinlock.h
> > +++ b/arch/powerpc/include/asm/qspinlock.h
> > @@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
> >  
> >  static inline void queued_spin_unlock(struct qspinlock *lock)
> >  {
> > -	for (;;) {
> > -		int val = atomic_read(&lock->val);
> > -		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
> > -			return;
> > -	}
> > +	smp_store_release(&lock->locked, 0);
>
> Is it also possible for lock_set_locked() to use a non-atomic acquire
> operation?

It has to be atomic as mentioned earlier.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly
  2022-11-10  0:39   ` Jordan Niethe
  2022-11-10  8:36     ` Christophe Leroy
@ 2022-11-10  9:40     ` Nicholas Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10  9:40 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:39 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > This uses more optimal ll/sc style access patterns (rather than
> > cmpxchg), and also sets the EH=1 lock hint on those operations
> > which acquire ownership of the lock.
> > ---
> >  arch/powerpc/include/asm/qspinlock.h       | 25 +++++--
> >  arch/powerpc/include/asm/qspinlock_types.h |  6 +-
> >  arch/powerpc/lib/qspinlock.c               | 81 +++++++++++++++-------
> >  3 files changed, 79 insertions(+), 33 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> > index 79a1936fb68d..3ab354159e5e 100644
> > --- a/arch/powerpc/include/asm/qspinlock.h
> > +++ b/arch/powerpc/include/asm/qspinlock.h
> > @@ -2,28 +2,43 @@
> >  #ifndef _ASM_POWERPC_QSPINLOCK_H
> >  #define _ASM_POWERPC_QSPINLOCK_H
> >  
> > -#include <linux/atomic.h>
> >  #include <linux/compiler.h>
> >  #include <asm/qspinlock_types.h>
> >  
> >  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
> >  {
> > -	return atomic_read(&lock->val);
> > +	return READ_ONCE(lock->val);
> >  }
> >  
> >  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
> >  {
> > -	return !atomic_read(&lock.val);
> > +	return !lock.val;
> >  }
> >  
> >  static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
> >  {
> > -	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
> > +	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
> >  }
> >  
> >  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >  {
> > -	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
> > +	u32 new = _Q_LOCKED_VAL;
> > +	u32 prev;
> > +
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
> > +"	cmpwi	0,%0,0							\n"
> > +"	bne-	2f							\n"
> > +"	stwcx.	%2,0,%1							\n"
> > +"	bne-	1b							\n"
> > +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> > +"2:									\n"
> > +	: "=&r" (prev)
> > +	: "r" (&lock->val), "r" (new),
> > +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
>
> btw IS_ENABLED() already returns 1 or 0

I guess we already do that in atomic.h too. Okay.

> > +	: "cr0", "memory");
>
> This is the ISA's "test and set" atomic primitive. Do you think it would be worth seperating it as a helper?

It ends up getting more complex as we go. I might leave some of these
primitives open coded for now, we could possibly look at providing them
or reusing more generic primitives after the series though.

> > +
> > +	if (likely(prev == 0))
> >  		return 1;
> >  	return 0;
>
> same optional style nit: return likely(prev == 0);

Will do.

>
> >  }
> > diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> > index 3425dab42576..210adf05b235 100644
> > --- a/arch/powerpc/include/asm/qspinlock_types.h
> > +++ b/arch/powerpc/include/asm/qspinlock_types.h
> > @@ -7,7 +7,7 @@
> >  
> >  typedef struct qspinlock {
> >  	union {
> > -		atomic_t val;
> > +		u32 val;
> >  
> >  #ifdef __LITTLE_ENDIAN
> >  		struct {
> > @@ -23,10 +23,10 @@ typedef struct qspinlock {
> >  	};
> >  } arch_spinlock_t;
> >  
> > -#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
> > +#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = 0 } }
> >  
> >  /*
> > - * Bitfields in the atomic value:
> > + * Bitfields in the lock word:
> >   *
> >   *     0: locked bit
> >   * 16-31: tail cpu (+1)
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 5ebb88d95636..7c71e5e287df 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -1,5 +1,4 @@
> >  // SPDX-License-Identifier: GPL-2.0-or-later
> > -#include <linux/atomic.h>
> >  #include <linux/bug.h>
> >  #include <linux/compiler.h>
> >  #include <linux/export.h>
> > @@ -22,32 +21,59 @@ struct qnodes {
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > -static inline int encode_tail_cpu(void)
> > +static inline u32 encode_tail_cpu(void)
> >  {
> >  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> >  }
> >  
> > -static inline int get_tail_cpu(int val)
> > +static inline int get_tail_cpu(u32 val)
> >  {
> >  	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> >  }
> >  
> >  /* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
>
> I think you missed deleting the above line.
>
> > +/* Take the lock by setting the lock bit, no other CPUs will touch it. */
> >  static __always_inline void lock_set_locked(struct qspinlock *lock)
> >  {
> > -	atomic_or(_Q_LOCKED_VAL, &lock->val);
> > -	__atomic_acquire_fence();
> > +	u32 new = _Q_LOCKED_VAL;
> > +	u32 prev;
> > +
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%1,%3	# lock_set_locked			\n"
> > +"	or	%0,%0,%2						\n"
> > +"	stwcx.	%0,0,%1							\n"
> > +"	bne-	1b							\n"
> > +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> > +	: "=&r" (prev)
> > +	: "r" (&lock->val), "r" (new),
> > +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> > +	: "cr0", "memory");
> >  }
>
> This is pretty similar with the DEFINE_TESTOP() pattern from
> arch/powerpc/include/asm/bitops.h (such as test_and_set_bits_lock()) except for
> word instead of double word. Do you think it's possible / beneficial to make
> use of those macros?

If we could pull almost all our atomic primitives into one place and
make them usable by atomics, bitops, locks, etc. might be a good idea.

That function specifically works on a dword so we can't use it here,
and I don't want to modify any files except for the new ones in this
series if possible, but consolidating our primitives a bit more would
be nice.

> > -/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> > -static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
> > +/* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
> > +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
> >  {
> > -	int newval = _Q_LOCKED_VAL;
> > -
> > -	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> > +	u32 new = _Q_LOCKED_VAL;
> > +	u32 prev;
> > +
> > +	BUG_ON(old & _Q_LOCKED_VAL);
>
> The BUG_ON() could have been introduced in an earlier patch I think.

Yes.

> > +
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%1,%4	# trylock_clear_tail_cpu		\n"
> > +"	cmpw	0,%0,%2							\n"
> > +"	bne-	2f							\n"
> > +"	stwcx.	%3,0,%1							\n"
> > +"	bne-	1b							\n"
> > +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> > +"2:									\n"
> > +	: "=&r" (prev)
> > +	: "r" (&lock->val), "r"(old), "r" (new),
>
> Could this be like  "r"(_Q_TAIL_CPU_MASK) below?
> i.e. "r" (_Q_LOCKED_VAL)? Makes it clear new doesn't change.

Sure.

>
> > +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> > +	: "cr0", "memory");
> > +
> > +	if (likely(prev == old))
> >  		return 1;
> > -	else
> > -		return 0;
> > +	return 0;
> >  }
> >  
> >  /*
> > @@ -56,20 +82,25 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int va
> >   * This provides a release barrier for publishing node, and an acquire barrier
>
> Does the comment mean there needs to be an acquire barrier in this assembly?

Yes, another good catch. What I'm going to do instead is add the acquire
to get_tail_qnode() because that path is only hit when you have multiple
waiters, and I think pairing it that way makes the barriers more
obvious.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing
  2022-11-10  0:40   ` Jordan Niethe
@ 2022-11-10 10:54     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 10:54 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:40 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Allow new waiters a number of spins on the lock word before queueing,
> > which particularly helps paravirt performance when physical CPUs are
> > oversubscribed.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 152 ++++++++++++++++++++++++++++++++---
> >  1 file changed, 141 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 7c71e5e287df..1625cce714b2 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -19,8 +19,17 @@ struct qnodes {
> >  	struct qnode nodes[MAX_NODES];
> >  };
> >  
> > +/* Tuning parameters */
> > +static int STEAL_SPINS __read_mostly = (1<<5);
> > +static bool MAYBE_STEALERS __read_mostly = true;
>
> I can understand why, but macro case variables can be a bit confusing.

Yeah they started out as #defines. I'll change them.

> > +
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > +static __always_inline int get_steal_spins(void)
> > +{
> > +	return STEAL_SPINS;
> > +}
> > +
> >  static inline u32 encode_tail_cpu(void)
> >  {
> >  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> > @@ -76,6 +85,39 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 ol
> >  	return 0;
> >  }
> >  
> > +static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u32 new)
> > +{
> > +	u32 prev;
> > +
> > +	BUG_ON(old & _Q_LOCKED_VAL);
> > +
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%1,%4	# queued_spin_trylock_cmpxchg		\n"
>
> s/queued_spin_trylock_cmpxchg/__trylock_cmpxchg/

Yes.

> btw what is the format you using for the '\n's in the inline asm?

Ah, not really sure :P

> > +"	cmpw	0,%0,%2							\n"
> > +"	bne-	2f							\n"
> > +"	stwcx.	%3,0,%1							\n"
> > +"	bne-	1b							\n"
> > +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> > +"2:									\n"
> > +	: "=&r" (prev)
> > +	: "r" (&lock->val), "r"(old), "r" (new),
> > +	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> > +	: "cr0", "memory");
>
> This is very similar to trylock_clear_tail_cpu(). So maybe it is worth having
> some form of "test and set" primitive helper.

Yes I was able to consolidate these two, good point.

> > +
> > +	return prev;
> > +}
> > +
> > +/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
> > +static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
> > +{
> > +	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> > +
> > +	if (__trylock_cmpxchg(lock, val, newval) == val)
> > +		return 1;
> > +	else
> > +		return 0;
>
> same optional style nit: return __trylock_cmpxchg(lock, val, newval) == val
>
> > +}
> > +
> >  /*
> >   * Publish our tail, replacing previous tail. Return previous value.
> >   *
> > @@ -115,6 +157,31 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
> >  	BUG();
> >  }
> >  
> > +static inline bool try_to_steal_lock(struct qspinlock *lock)
> > +{
> > +	int iters;
> > +
> > +	/* Attempt to steal the lock */
> > +	for (;;) {
> > +		u32 val = READ_ONCE(lock->val);
> > +
> > +		if (unlikely(!(val & _Q_LOCKED_VAL))) {
> > +			if (trylock_with_tail_cpu(lock, val))
> > +				return true;
> > +			continue;
> > +		}
>
> The continue would bypass iters++/cpu_relax but the next time around
>   if (unlikely(!(val & _Q_LOCKED_VAL))) {
> should fail so everything should be fine?

Yes it should. I suppose it could starve in theory though. Maybe
I'll change it to count as an iteration.

> > +#include <linux/debugfs.h>
> > +static int steal_spins_set(void *data, u64 val)
> > +{
> > +	static DEFINE_MUTEX(lock);
>
> I just want to check if it would be possible to get rid of the MAYBE_STEALERS
> variable completely and do something like:
>
>   bool maybe_stealers() { return STEAL_SPINS > 0; }
>
> I guess based on the below code it wouldn't work, but I'm still not quite sure
> why that is.

Because the slowpath has a !maybe_stealers path which assumes the
lock won't be stolen so it doesn't need to cmpxchg the lock bit on,
among other things.

I'll add a bit more comment.

> > +
> > +	mutex_lock(&lock);
> > +	if (val && !STEAL_SPINS) {
> > +		MAYBE_STEALERS = true;
> > +		/* wait for waiter to go away */
> > +		synchronize_rcu();
> > +		STEAL_SPINS = val;
> > +	} else if (!val && STEAL_SPINS) {
> > +		STEAL_SPINS = val;
> > +		/* wait for all possible stealers to go away */
> > +		synchronize_rcu();
> > +		MAYBE_STEALERS = false;
> > +	} else {
> > +		STEAL_SPINS = val;
> > +	}
> > +	mutex_unlock(&lock);
>
> STEAL_SPINS is an int not a u64.

Yeah but that's how the DEFINE_SIMPLE_ATTRIBUTE things seem to work,
unfortunately.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency
  2022-11-10  0:40   ` Jordan Niethe
@ 2022-11-10 10:57     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 10:57 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:40 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Give the queue head the ability to stop stealers. After a number of
> > spins without sucessfully acquiring the lock, the queue head employs
> > this, which will assure it is the next owner.
> > ---
> >  arch/powerpc/include/asm/qspinlock_types.h | 10 +++-
> >  arch/powerpc/lib/qspinlock.c               | 56 +++++++++++++++++++++-
> >  2 files changed, 63 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> > index 210adf05b235..8b20f5e22bba 100644
> > --- a/arch/powerpc/include/asm/qspinlock_types.h
> > +++ b/arch/powerpc/include/asm/qspinlock_types.h
> > @@ -29,7 +29,8 @@ typedef struct qspinlock {
> >   * Bitfields in the lock word:
> >   *
> >   *     0: locked bit
> > - * 16-31: tail cpu (+1)
> > + *    16: must queue bit
> > + * 17-31: tail cpu (+1)
> >   */
> >  #define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
> >  				      << _Q_ ## type ## _OFFSET)
> > @@ -38,7 +39,12 @@ typedef struct qspinlock {
> >  #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
> >  #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
> >  
> > -#define _Q_TAIL_CPU_OFFSET	16
> > +#define _Q_MUST_Q_OFFSET	16
> > +#define _Q_MUST_Q_BITS		1
> > +#define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> > +#define _Q_MUST_Q_VAL		(1U << _Q_MUST_Q_OFFSET)
> > +
> > +#define _Q_TAIL_CPU_OFFSET	17
> >  #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
> >  #define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
>
> Not a big deal but some of these values could be calculated like in the
> generic version. e.g.
>
> 	#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET +_Q_LOCKED_BITS)

Yeah, we don't *really* have more than one locked bit though. Haven't
made up my mind about these defines yet.

> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 1625cce714b2..a906cc8f15fa 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -22,6 +22,7 @@ struct qnodes {
> >  /* Tuning parameters */
> >  static int STEAL_SPINS __read_mostly = (1<<5);
> >  static bool MAYBE_STEALERS __read_mostly = true;
> > +static int HEAD_SPINS __read_mostly = (1<<8);
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > @@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
> >  	return STEAL_SPINS;
> >  }
> >  
> > +static __always_inline int get_head_spins(void)
> > +{
> > +	return HEAD_SPINS;
> > +}
> > +
> >  static inline u32 encode_tail_cpu(void)
> >  {
> >  	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> > @@ -142,6 +148,23 @@ static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
> >  	return prev;
> >  }
> >  
> > +static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
> > +{
> > +	u32 new = _Q_MUST_Q_VAL;
> > +	u32 prev;
> > +
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%1		# lock_set_mustq			\n"
>
> Is the EH bit not set because we don't hold the lock here?

Right, we're still waiting for it.

> > +"	or	%0,%0,%2						\n"
> > +"	stwcx.	%0,0,%1							\n"
> > +"	bne-	1b							\n"
> > +	: "=&r" (prev)
> > +	: "r" (&lock->val), "r" (new)
> > +	: "cr0", "memory");
>
> This is another usage close to the DEFINE_TESTOP() pattern.
>
> > +
> > +	return prev;
> > +}
> > +
> >  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
> >  {
> >  	int cpu = get_tail_cpu(val);
> > @@ -165,6 +188,9 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
> >  	for (;;) {
> >  		u32 val = READ_ONCE(lock->val);
> >  
> > +		if (val & _Q_MUST_Q_VAL)
> > +			break;
> > +
> >  		if (unlikely(!(val & _Q_LOCKED_VAL))) {
> >  			if (trylock_with_tail_cpu(lock, val))
> >  				return true;
> > @@ -246,11 +272,22 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> >  		/* We must be the owner, just set the lock bit and acquire */
> >  		lock_set_locked(lock);
> >  	} else {
> > +		int iters = 0;
> > +		bool set_mustq = false;
> > +
> >  again:
> >  		/* We're at the head of the waitqueue, wait for the lock. */
> > -		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> > +		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> >  			cpu_relax();
> >  
> > +			iters++;
>
> It seems instead of using set_mustq, (val & _Q_MUST_Q_VAL) could be checked?

I wanted to give the reader (and compiler for what that's worth) the
idea that it won't change concurrently after we set it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word
  2022-11-10  0:40   ` Jordan Niethe
@ 2022-11-10 10:59     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 10:59 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:40 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Store the owner CPU number in the lock word so it may be yielded to,
> > as powerpc's paravirtualised simple spinlocks do.
> > ---
> >  arch/powerpc/include/asm/qspinlock.h       |  8 +++++++-
> >  arch/powerpc/include/asm/qspinlock_types.h | 10 ++++++++++
> >  arch/powerpc/lib/qspinlock.c               |  6 +++---
> >  3 files changed, 20 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> > index 3ab354159e5e..44601b261e08 100644
> > --- a/arch/powerpc/include/asm/qspinlock.h
> > +++ b/arch/powerpc/include/asm/qspinlock.h
> > @@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
> >  	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
> >  }
> >  
> > +static __always_inline u32 queued_spin_get_locked_val(void)
>
> Maybe this function should have "encode" in the name to match with
> encode_tail_cpu().

Yep.

> > +{
> > +	/* XXX: make this use lock value in paca like simple spinlocks? */
>
> Is that the paca's lock_token which is 0x8000?

Yes, which AFAIKS is actually unused now with queued spinlocks.

> > +	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
> > +}
> > +
> >  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >  {
> > -	u32 new = _Q_LOCKED_VAL;
> > +	u32 new = queued_spin_get_locked_val();
> >  	u32 prev;
> >  
> >  	asm volatile(
> > diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> > index 8b20f5e22bba..35f9525381e6 100644
> > --- a/arch/powerpc/include/asm/qspinlock_types.h
> > +++ b/arch/powerpc/include/asm/qspinlock_types.h
> > @@ -29,6 +29,8 @@ typedef struct qspinlock {
> >   * Bitfields in the lock word:
> >   *
> >   *     0: locked bit
> > + *  1-14: lock holder cpu
> > + *    15: unused bit
> >   *    16: must queue bit
> >   * 17-31: tail cpu (+1)
>
> So there is one more bit to store the tail cpu vs the lock holder cpu?

Yeah but the tail has to encode it as CPU+1.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner
  2022-11-10  0:41   ` Jordan Niethe
@ 2022-11-10 11:13     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:13 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:41 AM AEST, Jordan Niethe wrote:
>  On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
>  [resend as utf-8, not utf-7]
> > Waiters spinning on the lock word should yield to the lock owner if the
> > vCPU is preempted. This improves performance when the hypervisor has
> > oversubscribed physical CPUs.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 97 ++++++++++++++++++++++++++++++------
> >  1 file changed, 83 insertions(+), 14 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index aa26cfe21f18..55286ac91da5 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -5,6 +5,7 @@
> >  #include <linux/percpu.h>
> >  #include <linux/smp.h>
> >  #include <asm/qspinlock.h>
> > +#include <asm/paravirt.h>
> >  
> >  #define MAX_NODES	4
> >  
> > @@ -24,14 +25,16 @@ static int STEAL_SPINS __read_mostly = (1<<5);
> >  static bool MAYBE_STEALERS __read_mostly = true;
> >  static int HEAD_SPINS __read_mostly = (1<<8);
> >  
> > +static bool pv_yield_owner __read_mostly = true;
>
> Not macro case for these globals? To me name does not make it super clear this
> is a boolean. What about pv_yield_owner_enabled?

Hmm. Might think about doing a better prefix namespace for these
tunables, which might help.

> > +
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > -static __always_inline int get_steal_spins(void)
> > +static __always_inline int get_steal_spins(bool paravirt)
> >  {
> >  	return STEAL_SPINS;
> >  }
> >  
> > -static __always_inline int get_head_spins(void)
> > +static __always_inline int get_head_spins(bool paravirt)
> >  {
> >  	return HEAD_SPINS;
> >  }
> > @@ -46,7 +49,11 @@ static inline int get_tail_cpu(u32 val)
> >  	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> >  }
> >  
> > -/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
> > +static inline int get_owner_cpu(u32 val)
> > +{
> > +	return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
> > +}
> > +
> >  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
> >  static __always_inline void lock_set_locked(struct qspinlock *lock)
> >  {
> > @@ -180,7 +187,45 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
> >  	BUG();
> >  }
> >  
> > -static inline bool try_to_steal_lock(struct qspinlock *lock)
> > +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
>
> This name doesn't seem correct for the non paravirt case.

Well... a yield to a running CPU is just a relax in any case. I think
it's okay.

> > +{
> > +	int owner;
> > +	u32 yield_count;
> > +
> > +	BUG_ON(!(val & _Q_LOCKED_VAL));
> > +
> > +	if (!paravirt)
> > +		goto relax;
> > +
> > +	if (!pv_yield_owner)
> > +		goto relax;
> > +
> > +	owner = get_owner_cpu(val);
> > +	yield_count = yield_count_of(owner);
> > +
> > +	if ((yield_count & 1) == 0)
> > +		goto relax; /* owner vcpu is running */
>
> I wonder why not use vcpu_is_preempted()?

Because we use a particular yield_count for the yield hcall (it
tries to avoid the situation where the owner wakes up and may release
the lock and then we yield to it).

>
> > +
> > +	/*
> > +	 * Read the lock word after sampling the yield count. On the other side
> > +	 * there may a wmb because the yield count update is done by the
> > +	 * hypervisor preemption and the value update by the OS, however this
> > +	 * ordering might reduce the chance of out of order accesses and
> > +	 * improve the heuristic.
> > +	 */
> > +	smp_rmb();
> > +
> > +	if (READ_ONCE(lock->val) == val) {
> > +		yield_to_preempted(owner, yield_count);
> > +		/* Don't relax if we yielded. Maybe we should? */
> > +		return;
> > +	}
> > +relax:
> > +	cpu_relax();
> > +}
> > +
> > +
> > +static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
> >  {
> >  	int iters;
> >  
> > @@ -197,18 +242,18 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
> >  			continue;
> >  		}
> >  
> > -		cpu_relax();
> > +		yield_to_locked_owner(lock, val, paravirt);
> >  
> >  		iters++;
> >  
> > -		if (iters >= get_steal_spins())
> > +		if (iters >= get_steal_spins(paravirt))
> >  			break;
> >  	}
> >  
> >  	return false;
> >  }
> >  
> > -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> > +static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
> >  {
> >  	struct qnodes *qnodesp;
> >  	struct qnode *next, *node;
> > @@ -260,7 +305,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> >  	if (!MAYBE_STEALERS) {
> >  		/* We're at the head of the waitqueue, wait for the lock. */
> >  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> > -			cpu_relax();
> > +			yield_to_locked_owner(lock, val, paravirt);
> >  
> >  		/* If we're the last queued, must clean up the tail. */
> >  		if ((val & _Q_TAIL_CPU_MASK) == tail) {
> > @@ -278,10 +323,10 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> >  again:
> >  		/* We're at the head of the waitqueue, wait for the lock. */
> >  		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> > -			cpu_relax();
> > +			yield_to_locked_owner(lock, val, paravirt);
> >  
> >  			iters++;
> > -			if (!set_mustq && iters >= get_head_spins()) {
> > +			if (!set_mustq && iters >= get_head_spins(paravirt)) {
> >  				set_mustq = true;
> >  				lock_set_mustq(lock);
> >  				val |= _Q_MUST_Q_VAL;
> > @@ -320,10 +365,15 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> >  
> >  void queued_spin_lock_slowpath(struct qspinlock *lock)
> >  {
> > -	if (try_to_steal_lock(lock))
> > -		return;
> > -
> > -	queued_spin_lock_mcs_queue(lock);
> > +	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()) {
> > +		if (try_to_steal_lock(lock, true))
> > +			return;
> > +		queued_spin_lock_mcs_queue(lock, true);
> > +	} else {
> > +		if (try_to_steal_lock(lock, false))
> > +			return;
> > +		queued_spin_lock_mcs_queue(lock, false);
> > +	}
> >  }
>
> There is not really a need for a conditional: 
>
> bool paravirt = IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
> is_shared_processor();
>
> if (try_to_steal_lock(lock, paravirt))
> 	return;
>
> queued_spin_lock_mcs_queue(lock, paravirt);
>
>
> The paravirt parameter used by the various functions seems always to be
> equivalent to (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()).
> I wonder if it would be simpler testing (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor())
> (using a helper function) in those functions instead passing it as a parameter?

You'd think so and yes semantically that's identical, but with my
version gcc-12 seems to inline each side and with yours they are
more shared. We actually want the separate versions because
is_shared_processor() is set at boot so we always only run one side
or the other so we want best efficiency possible and don't have the
icache pollution concern because the other side never runs.

At least that's the idea, that's what generic qspinlocks do too.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node
  2022-11-10  0:41   ` Jordan Niethe
@ 2022-11-10 11:14     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:14 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:41 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Queued waiters which are not at the head of the queue don't spin on
> > the lock word but their qnode lock word, waiting for the previous queued
> > CPU to release them. Add an option which allows these waiters to yield
> > to the previous CPU if its vCPU is preempted.
> > 
> > Disable this option by default for now, i.e., no logical change.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 46 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 45 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 55286ac91da5..b39f8c5b329c 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
> >  static int HEAD_SPINS __read_mostly = (1<<8);
> >  
> >  static bool pv_yield_owner __read_mostly = true;
> > +static bool pv_yield_prev __read_mostly = true;
>
> Similiar suggestion, maybe pv_yield_prev_enabled would read better.
>
> Isn't this enabled by default contrary to the commit message?

Yeah a few of those changelogs got out of synch.

>
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > @@ -224,6 +225,31 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
> >  	cpu_relax();
> >  }
> >  
> > +static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
>
> yield_to_locked_owner() takes a raw val and works out the cpu to yield to.
> I think for consistency have yield_to_prev() take the raw val and work it out too.

Good thinking.

> > +{
> > +	u32 yield_count;
> > +
> > +	if (!paravirt)
> > +		goto relax;
> > +
> > +	if (!pv_yield_prev)
> > +		goto relax;
> > +
> > +	yield_count = yield_count_of(prev_cpu);
> > +	if ((yield_count & 1) == 0)
> > +		goto relax; /* owner vcpu is running */
> > +
> > +	smp_rmb(); /* See yield_to_locked_owner comment */
> > +
> > +	if (!node->locked) {
> > +		yield_to_preempted(prev_cpu, yield_count);
> > +		return;
> > +	}
> > +
> > +relax:
> > +	cpu_relax();
> > +}
> > +
> >  
> >  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
> >  {
> > @@ -291,13 +317,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
> >  	 */
> >  	if (old & _Q_TAIL_CPU_MASK) {
> >  		struct qnode *prev = get_tail_qnode(lock, old);
> > +		int prev_cpu = get_tail_cpu(old);
>
> This could then be removed.
>
> >  
> >  		/* Link @node into the waitqueue. */
> >  		WRITE_ONCE(prev->next, node);
> >  
> >  		/* Wait for mcs node lock to be released */
> >  		while (!node->locked)
> > -			cpu_relax();
> > +			yield_to_prev(lock, node, prev_cpu, paravirt);
>
> And would have this as:
> 			yield_to_prev(lock, node, old, paravirt);

Yep.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields
  2022-11-10  0:42   ` Jordan Niethe
@ 2022-11-10 11:22     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:22 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:42 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > If the head of queue is preventing stealing but it finds the owner vCPU
> > is preempted, it will yield its cycles to the owner which could cause it
> > to become preempted. Add an option to re-allow stealers before yielding,
> > and disallow them again after returning from the yield.
> > 
> > Disable this option by default for now, i.e., no logical change.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 56 ++++++++++++++++++++++++++++++++++--
> >  1 file changed, 53 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index b39f8c5b329c..94f007f66942 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
> >  static int HEAD_SPINS __read_mostly = (1<<8);
> >  
> >  static bool pv_yield_owner __read_mostly = true;
> > +static bool pv_yield_allow_steal __read_mostly = false;
>
> To me this one does read as a boolean, but if you go with those other changes
> I'd make it pv_yield_steal_enable to be consistent.
>
> >  static bool pv_yield_prev __read_mostly = true;
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> > @@ -173,6 +174,23 @@ static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
> >  	return prev;
> >  }
> >  
> > +static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
> > +{
> > +	u32 new = _Q_MUST_Q_VAL;
> > +	u32 prev;
> > +
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%1		# lock_clear_mustq			\n"
> > +"	andc	%0,%0,%2						\n"
> > +"	stwcx.	%0,0,%1							\n"
> > +"	bne-	1b							\n"
> > +	: "=&r" (prev)
> > +	: "r" (&lock->val), "r" (new)
> > +	: "cr0", "memory");
> > +
>
> This is pretty similar to the DEFINE_TESTOP() pattern again with the same llong caveat.
>
>
> > +	return prev;
> > +}
> > +
> >  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
> >  {
> >  	int cpu = get_tail_cpu(val);
> > @@ -188,7 +206,7 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
> >  	BUG();
> >  }
> >  
> > -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> > +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
>
>  /* See yield_to_locked_owner comment */ comment needs to be updated now.

Yep.

> >  {
> >  	int owner;
> >  	u32 yield_count;
> > @@ -217,7 +235,11 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
> >  	smp_rmb();
> >  
> >  	if (READ_ONCE(lock->val) == val) {
> > +		if (clear_mustq)
> > +			lock_clear_mustq(lock);
> >  		yield_to_preempted(owner, yield_count);
> > +		if (clear_mustq)
> > +			lock_set_mustq(lock);
> >  		/* Don't relax if we yielded. Maybe we should? */
> >  		return;
> >  	}
> > @@ -225,6 +247,16 @@ static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 va
> >  	cpu_relax();
> >  }
> >  
> > +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> > +{
> > +	__yield_to_locked_owner(lock, val, paravirt, false);
> > +}
> > +
> > +static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> > +{
>
> The check for pv_yield_allow_steal seems like it could go here instead of
> being done by the caller.
> __yield_to_locked_owner() checks for pv_yield_owner so it seems more
>   consistent.

Yeah that worked and is probably an improvement.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue
  2022-11-10  0:42   ` Jordan Niethe
@ 2022-11-10 11:25     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:25 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:42 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Having all CPUs poll the lock word for the owner CPU that should be
> > yielded to defeats most of the purpose of using MCS queueing for
> > scalability. Yet it may be desirable for queued waiters to to yield
> > to a preempted owner.
> > 
> > s390 addreses this problem by having queued waiters sample the lock
> > word to find the owner much less frequently. In this approach, the
> > waiters never sample it directly, but the queue head propagates the
> > owner CPU back to the next waiter if it ever finds the owner has
> > been preempted. Queued waiters then subsequently propagate the owner
> > CPU back to the next waiter, and so on.
> > 
> > Disable this option by default for now, i.e., no logical change.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 85 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 84 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 94f007f66942..28c85a2d5635 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -12,6 +12,7 @@
> >  struct qnode {
> >  	struct qnode	*next;
> >  	struct qspinlock *lock;
> > +	int		yield_cpu;
> >  	u8		locked; /* 1 if lock acquired */
> >  };
> >  
> > @@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
> >  static bool pv_yield_owner __read_mostly = true;
> >  static bool pv_yield_allow_steal __read_mostly = false;
> >  static bool pv_yield_prev __read_mostly = true;
> > +static bool pv_yield_propagate_owner __read_mostly = true;
>
> This also seems to be enabled by default.
>
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > @@ -257,13 +259,66 @@ static __always_inline void yield_head_to_locked_owner(struct qspinlock *lock, u
> >  	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> >  }
> >  
> > +static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
> > +{
> > +	struct qnode *next;
> > +	int owner;
> > +
> > +	if (!paravirt)
> > +		return;
> > +	if (!pv_yield_propagate_owner)
> > +		return;
> > +
> > +	owner = get_owner_cpu(val);
> > +	if (*set_yield_cpu == owner)
> > +		return;
> > +
> > +	next = READ_ONCE(node->next);
> > +	if (!next)
> > +		return;
> > +
> > +	if (vcpu_is_preempted(owner)) {
>
> Is there a difference about using vcpu_is_preempted() here
> vs checking bit 0 in other places?

Yeah we weren't yielding here so didn't have to load the yield count
explicitly.

> > +		next->yield_cpu = owner;
> > +		*set_yield_cpu = owner;
> > +	} else if (*set_yield_cpu != -1) {
>
> It might be worth giving the -1 CPU a #define.

It's fairly standard to use -1 as a null value for cpu.

>
> > +		next->yield_cpu = owner;
> > +		*set_yield_cpu = owner;
> > +	}
> > +}
>
> Does this need to pass set_yield_cpu by reference? Couldn't it's new value be
> returned? To me it makes it more clear the function is used to change
> set_yield_cpu. I think this would work:
>
> int set_yield_cpu = -1;
>
> static __always_inline int propagate_yield_cpu(struct qnode *node, u32 val, int set_yield_cpu, bool paravirt)
> {
> 	struct qnode *next;
> 	int owner;
>
> 	if (!paravirt)
> 		goto out;
> 	if (!pv_yield_propagate_owner)
> 		goto out;
>
> 	owner = get_owner_cpu(val);
> 	if (set_yield_cpu == owner)
> 		goto out;
>
> 	next = READ_ONCE(node->next);
> 	if (!next)
> 		goto out;
>
> 	if (vcpu_is_preempted(owner)) {
> 		next->yield_cpu = owner;
> 		return owner;
> 	} else if (set_yield_cpu != -1) {
> 		next->yield_cpu = owner;
> 		return owner;
> 	}
>
> out:
> 	return set_yield_cpu;
> }
>
> set_yield_cpu = propagate_yield_cpu(...  set_yield_cpu ...);

I think I prefer as is, because the caller doesn't use it anywhere. It
looks more like some temporary storage to be used by the function over
multiple calls this way.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU
  2022-11-10  0:42   ` Jordan Niethe
@ 2022-11-10 11:32     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:32 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:42 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > After the head of the queue acquires the lock, it releases the
> > next waiter in the queue to become the new head. Add an option
> > to prod the new head if its vCPU was preempted. This may only
> > have an effect if queue waiters are yielding.
> > 
> > Disable this option by default for now, i.e., no logical change.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 29 ++++++++++++++++++++++++++++-
> >  1 file changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 28c85a2d5635..3b10e31bcf0a 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -12,6 +12,7 @@
> >  struct qnode {
> >  	struct qnode	*next;
> >  	struct qspinlock *lock;
> > +	int		cpu;
> >  	int		yield_cpu;
> >  	u8		locked; /* 1 if lock acquired */
> >  };
> > @@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
> >  static bool pv_yield_allow_steal __read_mostly = false;
> >  static bool pv_yield_prev __read_mostly = true;
> >  static bool pv_yield_propagate_owner __read_mostly = true;
> > +static bool pv_prod_head __read_mostly = false;
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > @@ -392,6 +394,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
> >  	node = &qnodesp->nodes[idx];
> >  	node->next = NULL;
> >  	node->lock = lock;
> > +	node->cpu = smp_processor_id();
>
> I suppose this could be used in some other places too.
>
> For example change:
> 	yield_to_prev(lock, node, prev, paravirt);
>
> In yield_to_prev() it could then access the prev->cpu.

That case is a bit iffy. As soon as we WRITE_ONCE to prev, the prev lock
holder can go away. It's a statically allocated array and per-CPU so it
should actually give us the right value even if that CPU queued on some
other lock again, but I think it's more straightforward just to not
touch it after that point. This is also a remote and hot cache line, so
avoiding any loads on it is nice (we have the store, but you don't have
to wait for those), and we already have val.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal
  2022-11-10  0:43   ` Jordan Niethe
@ 2022-11-10 11:35     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:35 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:43 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > This gives trylock slightly more strength, and it also gives most
> > of the benefit of passing 'val' back through the slowpath without
> > the complexity.
> > ---
> >  arch/powerpc/include/asm/qspinlock.h | 39 +++++++++++++++++++++++++++-
> >  arch/powerpc/lib/qspinlock.c         |  9 +++++++
> >  2 files changed, 47 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> > index 44601b261e08..d3d2039237b2 100644
> > --- a/arch/powerpc/include/asm/qspinlock.h
> > +++ b/arch/powerpc/include/asm/qspinlock.h
> > @@ -5,6 +5,8 @@
> >  #include <linux/compiler.h>
> >  #include <asm/qspinlock_types.h>
> >  
> > +#define _Q_SPIN_TRY_LOCK_STEAL 1
>
> Would this be a config option?

I think probably not, it's more to keep the other code variant there if
we want to try tune experiment with it. We might end up cutting out a
bunch of these options if we narrow down on a good configuration.

>
> > +
> >  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
> >  {
> >  	return READ_ONCE(lock->val);
> > @@ -26,11 +28,12 @@ static __always_inline u32 queued_spin_get_locked_val(void)
> >  	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
> >  }
> >  
> > -static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> > +static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
> >  {
> >  	u32 new = queued_spin_get_locked_val();
> >  	u32 prev;
> >  
> > +	/* Trylock succeeds only when unlocked and no queued nodes */
> >  	asm volatile(
> >  "1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
>
> s/queued_spin_trylock/__queued_spin_trylock_nosteal

I wanted to keep those because they (can be) inlined into the wider
kernel, so you'd rather see queued_spin_trylock than this internal name.

> >  "	cmpwi	0,%0,0							\n"
> > @@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >  	return 0;
> >  }
> >  
> > +static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
> > +{
> > +	u32 new = queued_spin_get_locked_val();
> > +	u32 prev, tmp;
> > +
> > +	/* Trylock may get ahead of queued nodes if it finds unlocked */
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%2,%5	# queued_spin_trylock			\n"
>
> s/queued_spin_trylock/__queued_spin_trylock_steal
>
> > +"	andc.	%1,%0,%4						\n"
> > +"	bne-	2f							\n"
> > +"	and	%1,%0,%4						\n"
> > +"	or	%1,%1,%3						\n"
> > +"	stwcx.	%1,0,%2							\n"
> > +"	bne-	1b							\n"
> > +"\t"	PPC_ACQUIRE_BARRIER "						\n"
> > +"2:									\n"
>
> Just because there's a little bit more going on here...
>
> Q_TAIL_CPU_MASK = 0xFFFE0000
> ~Q_TAIL_CPU_MASK = 0x1FFFF
>
>
> 1:	lwarx	prev, 0, &lock->val, IS_ENABLED_PPC64
> 	andc.	tmp, prev, _Q_TAIL_CPU_MASK 	(tmp = prev & ~_Q_TAIL_CPU_MASK)
> 	bne-	2f 				(exit if locked)
> 	and	tmp, prev, _Q_TAIL_CPU_MASK 	(tmp = prev & _Q_TAIL_CPU_MASK)
> 	or	tmp, tmp, new			(tmp |= new)					
> 	stwcx.	tmp, 0, &lock->val					
> 		
> 	bne-	1b							
> 	PPC_ACQUIRE_BARRIER		
> 2:
>
> ... which seems correct.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API
  2022-11-10  0:43   ` Jordan Niethe
@ 2022-11-10 11:36     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:36 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:43 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
> > to prevent threads issuing a lot of expensive priority nops which may not
> > have much effect due to immediately executing low then medium priority.
>
> Just a general comment regarding the spin_{begin,end} API, more complicated
> than something like
>
> 	spin_begin()
> 	for(;;)
> 		spin_cpu_relax()
> 	spin_end()
>
> it becomes difficult to keep track of. Unfortunately, I don't have any good
> suggestions how to improve it. Hopefully with P10s wait instruction we can
> maybe try and move away from this.
>
> It might be useful to comment the functions pre and post conditions regarding
> expectations about spin_begin() and spin_end().

Yep, added some small comments.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins
  2022-11-10  0:43   ` Jordan Niethe
@ 2022-11-10 11:37     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:37 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:43 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Allow for a reduction in the number of times a CPU from a different
> > node than the owner can attempt to steal the lock before queueing.
> > This could bias the transfer behaviour of the lock across the
> > machine and reduce NUMA crossings.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 34 +++++++++++++++++++++++++++++++---
> >  1 file changed, 31 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index d4594c701f7d..24f68bd71e2b 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -4,6 +4,7 @@
> >  #include <linux/export.h>
> >  #include <linux/percpu.h>
> >  #include <linux/smp.h>
> > +#include <linux/topology.h>
> >  #include <asm/qspinlock.h>
> >  #include <asm/paravirt.h>
> >  
> > @@ -24,6 +25,7 @@ struct qnodes {
> >  
> >  /* Tuning parameters */
> >  static int STEAL_SPINS __read_mostly = (1<<5);
> > +static int REMOTE_STEAL_SPINS __read_mostly = (1<<2);
> >  #if _Q_SPIN_TRY_LOCK_STEAL == 1
> >  static const bool MAYBE_STEALERS = true;
> >  #else
> > @@ -39,9 +41,13 @@ static bool pv_prod_head __read_mostly = false;
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> >  
> > -static __always_inline int get_steal_spins(bool paravirt)
> > +static __always_inline int get_steal_spins(bool paravirt, bool remote)
> >  {
> > -	return STEAL_SPINS;
> > +	if (remote) {
> > +		return REMOTE_STEAL_SPINS;
> > +	} else {
> > +		return STEAL_SPINS;
> > +	}
> >  }
> >  
> >  static __always_inline int get_head_spins(bool paravirt)
> > @@ -380,8 +386,13 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
> >  
> >  		iters++;
> >  
> > -		if (iters >= get_steal_spins(paravirt))
> > +		if (iters >= get_steal_spins(paravirt, false))
> >  			break;
> > +		if (iters >= get_steal_spins(paravirt, true)) {
>
> There's no indication of what true and false mean here which is hard to read.
> To me it feels like two separate functions would be more clear.

Good point. I reworked this a bit.

>
>
> > +			int cpu = get_owner_cpu(val);
> > +			if (numa_node_id() != cpu_to_node(cpu))
>
> What about using node_distance() instead?

We don't really need the distance, just whether or not it's the same or
not. I think distance not only has to look up the nodes, but then has to
look up a matrix to get a number.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner
  2022-11-10  0:44   ` Jordan Niethe
@ 2022-11-10 11:38     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:38 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:44 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Provide an option that holds off queueing indefinitely while the lock
> > owner is preempted. This could reduce queueing latencies for very
> > overcommitted vcpu situations.
> > 
> > This is disabled by default.
> > ---
> >  arch/powerpc/lib/qspinlock.c | 91 +++++++++++++++++++++++++++++++-----
> >  1 file changed, 79 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 24f68bd71e2b..5cfd69931e31 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
> >  
> >  static bool pv_yield_owner __read_mostly = true;
> >  static bool pv_yield_allow_steal __read_mostly = false;
> > +static bool pv_spin_on_preempted_owner __read_mostly = false;
> >  static bool pv_yield_prev __read_mostly = true;
> >  static bool pv_yield_propagate_owner __read_mostly = true;
> >  static bool pv_prod_head __read_mostly = false;
> > @@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
> >  	BUG();
> >  }
> >  
> > -static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
> > +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
> >  {
> >  	int owner;
> >  	u32 yield_count;
> >  
> >  	BUG_ON(!(val & _Q_LOCKED_VAL));
> >  
> > +	*preempted = false;
> > +
> >  	if (!paravirt)
> >  		goto relax;
> >  
> > @@ -241,6 +244,8 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
> >  
> >  	spin_end();
> >  
> > +	*preempted = true;
> > +
> >  	/*
> >  	 * Read the lock word after sampling the yield count. On the other side
> >  	 * there may a wmb because the yield count update is done by the
> > @@ -265,14 +270,14 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
> >  	spin_cpu_relax();
> >  }
> >  
> > -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
> > +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool *preempted)
>
> It seems like preempted parameter could be the return value of
> yield_to_locked_owner(). Then callers that don't use the value returned in
> preempted don't need to create an unnecessary variable to pass in.

That works.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks
  2022-11-10  0:44   ` Jordan Niethe
@ 2022-11-10 11:41     ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:41 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 10:44 AM AEST, Jordan Niethe wrote:
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> [resend as utf-8, not utf-7]
> > Finding the owner or a queued waiter on a lock with a preempted vcpu
> > is indicative of an oversubscribed guest causing the lock to get into
> > trouble. Provide some options to detect this situation and have new
> > CPUs avoid queueing for a longer time (more steal iterations) to
> > minimise the problems caused by vcpu preemption on the queue.
> > ---
> >  arch/powerpc/include/asm/qspinlock_types.h |   7 +-
> >  arch/powerpc/lib/qspinlock.c               | 240 +++++++++++++++++++--
> >  2 files changed, 232 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
> > index 35f9525381e6..4fbcc8a4230b 100644
> > --- a/arch/powerpc/include/asm/qspinlock_types.h
> > +++ b/arch/powerpc/include/asm/qspinlock_types.h
> > @@ -30,7 +30,7 @@ typedef struct qspinlock {
> >   *
> >   *     0: locked bit
> >   *  1-14: lock holder cpu
> > - *    15: unused bit
> > + *    15: lock owner or queuer vcpus observed to be preempted bit
> >   *    16: must queue bit
> >   * 17-31: tail cpu (+1)
> >   */
> > @@ -49,6 +49,11 @@ typedef struct qspinlock {
> >  #error "qspinlock does not support such large CONFIG_NR_CPUS"
> >  #endif
> >  
> > +#define _Q_SLEEPY_OFFSET	15
> > +#define _Q_SLEEPY_BITS		1
> > +#define _Q_SLEEPY_MASK		_Q_SET_MASK(SLEEPY_OWNER)
> > +#define _Q_SLEEPY_VAL		(1U << _Q_SLEEPY_OFFSET)
> > +
> >  #define _Q_MUST_Q_OFFSET	16
> >  #define _Q_MUST_Q_BITS		1
> >  #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
> > diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> > index 5cfd69931e31..c18133c01450 100644
> > --- a/arch/powerpc/lib/qspinlock.c
> > +++ b/arch/powerpc/lib/qspinlock.c
> > @@ -5,6 +5,7 @@
> >  #include <linux/percpu.h>
> >  #include <linux/smp.h>
> >  #include <linux/topology.h>
> > +#include <linux/sched/clock.h>
> >  #include <asm/qspinlock.h>
> >  #include <asm/paravirt.h>
> >  
> > @@ -36,24 +37,54 @@ static int HEAD_SPINS __read_mostly = (1<<8);
> >  static bool pv_yield_owner __read_mostly = true;
> >  static bool pv_yield_allow_steal __read_mostly = false;
> >  static bool pv_spin_on_preempted_owner __read_mostly = false;
> > +static bool pv_sleepy_lock __read_mostly = true;
> > +static bool pv_sleepy_lock_sticky __read_mostly = false;
>
> The sticky part could potentially be its own patch.

I'll see how that looks.

> > +static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
> > +static int pv_sleepy_lock_factor __read_mostly = 256;
> >  static bool pv_yield_prev __read_mostly = true;
> >  static bool pv_yield_propagate_owner __read_mostly = true;
> >  static bool pv_prod_head __read_mostly = false;
> >  
> >  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> > +static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
> >  
> > -static __always_inline int get_steal_spins(bool paravirt, bool remote)
> > +static __always_inline bool recently_sleepy(void)
> > +{
>
> Other users of pv_sleepy_lock_interval_ns first check pv_sleepy_lock.

In this case it should be implied, I've added a comment.

>
> > +	if (pv_sleepy_lock_interval_ns) {
> > +		u64 seen = this_cpu_read(sleepy_lock_seen_clock);
> > +
> > +		if (seen) {
> > +			u64 delta = sched_clock() - seen;
> > +			if (delta < pv_sleepy_lock_interval_ns)
> > +				return true;
> > +			this_cpu_write(sleepy_lock_seen_clock, 0);
> > +		}
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +static __always_inline int get_steal_spins(bool paravirt, bool remote, bool sleepy)
>
> It seems like paravirt is implied by sleepy.
>
> >  {
> >  	if (remote) {
> > -		return REMOTE_STEAL_SPINS;
> > +		if (paravirt && sleepy)
> > +			return REMOTE_STEAL_SPINS * pv_sleepy_lock_factor;
> > +		else
> > +			return REMOTE_STEAL_SPINS;
> >  	} else {
> > -		return STEAL_SPINS;
> > +		if (paravirt && sleepy)
> > +			return STEAL_SPINS * pv_sleepy_lock_factor;
> > +		else
> > +			return STEAL_SPINS;
> >  	}
> >  }
>
> I think that separate functions would still be nicer but this could get rid of
> the nesting conditionals like
>
>
> 	int spins;
> 	if (remote)
> 		spins = REMOTE_STEAL_SPINS;
> 	else
> 		spins = STEAL_SPINS;
>
> 	if (sleepy)
> 		return spins * pv_sleepy_lock_factor;
> 	return spins;

Yeah it was getting a bit out of hand.

>
> >  
> > -static __always_inline int get_head_spins(bool paravirt)
> > +static __always_inline int get_head_spins(bool paravirt, bool sleepy)
> >  {
> > -	return HEAD_SPINS;
> > +	if (paravirt && sleepy)
> > +		return HEAD_SPINS * pv_sleepy_lock_factor;
> > +	else
> > +		return HEAD_SPINS;
> >  }
> >  
> >  static inline u32 encode_tail_cpu(void)
> > @@ -206,6 +237,60 @@ static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
> >  	return prev;
> >  }
> >  
> > +static __always_inline bool lock_try_set_sleepy(struct qspinlock *lock, u32 old)
> > +{
> > +	u32 prev;
> > +	u32 new = old | _Q_SLEEPY_VAL;
> > +
> > +	BUG_ON(!(old & _Q_LOCKED_VAL));
> > +	BUG_ON(old & _Q_SLEEPY_VAL);
> > +
> > +	asm volatile(
> > +"1:	lwarx	%0,0,%1		# lock_try_set_sleepy			\n"
> > +"	cmpw	0,%0,%2							\n"
> > +"	bne-	2f							\n"
> > +"	stwcx.	%3,0,%1							\n"
> > +"	bne-	1b							\n"
> > +"2:									\n"
> > +	: "=&r" (prev)
> > +	: "r" (&lock->val), "r"(old), "r" (new)
> > +	: "cr0", "memory");
> > +
> > +	if (prev == old)
> > +		return true;
> > +	return false;
> > +}
> > +
> > +static __always_inline void seen_sleepy_owner(struct qspinlock *lock, u32 val)
> > +{
> > +	if (pv_sleepy_lock) {
> > +		if (pv_sleepy_lock_interval_ns)
> > +			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> > +		if (!(val & _Q_SLEEPY_VAL))
> > +			lock_try_set_sleepy(lock, val);
> > +	}
> > +}
> > +
> > +static __always_inline void seen_sleepy_lock(void)
> > +{
> > +	if (pv_sleepy_lock && pv_sleepy_lock_interval_ns)
> > +		this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> > +}
> > +
> > +static __always_inline void seen_sleepy_node(struct qspinlock *lock)
> > +{
>
> If yield_to_prev() was made to take a raw val, that val could be passed to
> seen_sleepy_node() and it would not need to get it by itself.

Yep.

>
> > +	if (pv_sleepy_lock) {
> > +		u32 val = READ_ONCE(lock->val);
> > +
> > +		if (pv_sleepy_lock_interval_ns)
> > +			this_cpu_write(sleepy_lock_seen_clock, sched_clock());
> > +		if (val & _Q_LOCKED_VAL) {
> > +			if (!(val & _Q_SLEEPY_VAL))
> > +				lock_try_set_sleepy(lock, val);
> > +		}
> > +	}
> > +}
> > +
> >  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
> >  {
> >  	int cpu = get_tail_cpu(val);
> > @@ -244,6 +329,7 @@ static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, u32
> >  
> >  	spin_end();
> >  
> > +	seen_sleepy_owner(lock, val);
> >  	*preempted = true;
> >  
> >  	/*
> > @@ -307,11 +393,13 @@ static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, int
> >  	}
> >  }
> >  
> > -static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
> > +static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt, bool *preempted)
> >  {
> >  	u32 yield_count;
> >  	int yield_cpu;
> >  
> > +	*preempted = false;
> > +
> >  	if (!paravirt)
> >  		goto relax;
> >  
> > @@ -332,6 +420,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
> >  
> >  	spin_end();
> >  
> > +	*preempted = true;
> > +	seen_sleepy_node(lock);
> > +
> >  	smp_rmb();
> >  
> >  	if (yield_cpu == node->yield_cpu) {
> > @@ -353,6 +444,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
> >  
> >  	spin_end();
> >  
> > +	*preempted = true;
> > +	seen_sleepy_node(lock);
> > +
> >  	smp_rmb(); /* See yield_to_locked_owner comment */
> >  
> >  	if (!node->locked) {
> > @@ -369,6 +463,9 @@ static __always_inline void yield_to_prev(struct qspinlock *lock, struct qnode *
> >  
> >  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
> >  {
> > +	bool preempted;
> > +	bool seen_preempted = false;
> > +	bool sleepy = false;
> >  	int iters = 0;
> >  
> >  	if (!STEAL_SPINS) {
> > @@ -376,7 +473,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
> >  			spin_begin();
> >  			for (;;) {
> >  				u32 val = READ_ONCE(lock->val);
> > -				bool preempted;
> >  
> >  				if (val & _Q_MUST_Q_VAL)
> >  					break;
> > @@ -395,7 +491,6 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
> >  	spin_begin();
> >  	for (;;) {
> >  		u32 val = READ_ONCE(lock->val);
> > -		bool preempted;
> >  
> >  		if (val & _Q_MUST_Q_VAL)
> >  			break;
> > @@ -408,9 +503,29 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
> >  			continue;
> >  		}
> >  
> > +		if (paravirt && pv_sleepy_lock && !sleepy) {
> > +			if (!sleepy) {
>
> The enclosing conditional means this would always be true. I think the out conditional should be
> if (paravirt && pv_sleepy_lock)
> otherwise the pv_sleepy_lock_sticky part wouldn't work properly.

Good catch, I think you're right.
>
>
> > +				if (val & _Q_SLEEPY_VAL) {
> > +					seen_sleepy_lock();
> > +					sleepy = true;
> > +				} else if (recently_sleepy()) {
> > +					sleepy = true;
> > +				}
> > +
> > +			if (pv_sleepy_lock_sticky && seen_preempted &&
> > +					!(val & _Q_SLEEPY_VAL)) {
> > +				if (lock_try_set_sleepy(lock, val))
> > +					val |= _Q_SLEEPY_VAL;
> > +			}
> > +
> > +
> >  		yield_to_locked_owner(lock, val, paravirt, &preempted);
> > +		if (preempted)
> > +			seen_preempted = true;
>
> This could belong to the next if statement, there can not be !paravirt && preempted ?

Yep.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
  2022-11-10  6:37     ` Christophe Leroy
@ 2022-11-10 11:44       ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:44 UTC (permalink / raw)
  To: Christophe Leroy, Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 4:37 PM AEST, Christophe Leroy wrote:
>
>
> Le 10/11/2022 à 01:35, Jordan Niethe a écrit :
> > On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> > <snip>
> >> -#define queued_spin_lock queued_spin_lock
> >>   
> >> -static inline void queued_spin_unlock(struct qspinlock *lock)
> >> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >>   {
> >> -	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> >> -		smp_store_release(&lock->locked, 0);
> >> -	else
> >> -		__pv_queued_spin_unlock(lock);
> >> +	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
> >> +		return 1;
> >> +	return 0;
> > 
> > optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);
>
> No parenthesis.
> No == 0
>
> Should be :
>
> 	return !atomic_cmpxchg_acquire(&lock->val, 0, 1);

In this case I prefer the == 0 because we're testing against the 0 old
parameter being passed in. This is the recognisable cmpxchg pattern.

The other style of cmpxchg returns true if it succeeded, so it's less
clear we're not using that version if using !.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly
  2022-11-10  8:36     ` Christophe Leroy
@ 2022-11-10 11:48       ` Nicholas Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nicholas Piggin @ 2022-11-10 11:48 UTC (permalink / raw)
  To: Christophe Leroy, Jordan Niethe, linuxppc-dev

On Thu Nov 10, 2022 at 6:36 PM AEST, Christophe Leroy wrote:
>
>
> Le 10/11/2022 à 01:39, Jordan Niethe a écrit :
> >> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
> >>   {
> >> -	int newval = _Q_LOCKED_VAL;
> >> -
> >> -	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> >> +	u32 new = _Q_LOCKED_VAL;
> >> +	u32 prev;
> >> +
> >> +	BUG_ON(old & _Q_LOCKED_VAL);
> > 
> > The BUG_ON() could have been introduced in an earlier patch I think.
>
> Can we avoid the BUG_ON() at all and replace by a WARN_ON ?

Lock has gone wrong here. Critical sections not working means data
corruption and little prospect of continuing to run.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2022-11-10 11:50 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-28  6:31 [PATCH 00/17] powerpc: alternate queued spinlock implementation Nicholas Piggin
2022-07-28  6:31 ` [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation Nicholas Piggin
2022-08-10  1:52   ` Jordan NIethe
2022-08-10  6:48     ` Christophe Leroy
2022-11-10  0:35   ` Jordan Niethe
2022-11-10  6:37     ` Christophe Leroy
2022-11-10 11:44       ` Nicholas Piggin
2022-11-10  9:09     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 1a/17] powerpc/qspinlock: Prepare qspinlock code Nicholas Piggin
2022-07-28  6:31 ` [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
2022-08-10  2:28   ` Jordan NIethe
2022-11-10  0:36   ` Jordan Niethe
2022-11-10  9:21     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
2022-08-10  3:28   ` Jordan Niethe
2022-11-10  0:39   ` Jordan Niethe
2022-11-10  9:25     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
2022-08-10  3:54   ` Jordan Niethe
2022-11-10  0:39   ` Jordan Niethe
2022-11-10  8:36     ` Christophe Leroy
2022-11-10 11:48       ` Nicholas Piggin
2022-11-10  9:40     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
2022-08-10  4:31   ` Jordan Niethe
2022-11-10  0:40   ` Jordan Niethe
2022-11-10 10:54     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
2022-08-10  5:51   ` Jordan Niethe
2022-11-10  0:40   ` Jordan Niethe
2022-11-10 10:57     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
2022-08-12  0:50   ` Jordan Niethe
2022-11-10  0:40   ` Jordan Niethe
2022-11-10 10:59     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
2022-08-12  2:01   ` Jordan Niethe
2022-11-10  0:41   ` Jordan Niethe
2022-11-10 11:13     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
2022-08-12  2:07   ` Jordan Niethe
2022-11-10  0:41   ` Jordan Niethe
2022-11-10 11:14     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
2022-08-12  4:06   ` Jordan Niethe
2022-11-10  0:42   ` Jordan Niethe
2022-11-10 11:22     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
2022-08-12  4:17   ` Jordan Niethe
2022-10-06 17:27   ` Laurent Dufour
2022-11-10  0:42   ` Jordan Niethe
2022-11-10 11:25     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
2022-08-12  4:22   ` Jordan Niethe
2022-11-10  0:42   ` Jordan Niethe
2022-11-10 11:32     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
2022-08-12  4:32   ` Jordan Niethe
2022-11-10  0:43   ` Jordan Niethe
2022-11-10 11:35     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
2022-08-12  4:36   ` Jordan Niethe
2022-11-10  0:43   ` Jordan Niethe
2022-11-10 11:36     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins Nicholas Piggin
2022-08-12  4:43   ` Jordan Niethe
2022-11-10  0:43   ` Jordan Niethe
2022-11-10 11:37     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner Nicholas Piggin
2022-08-12  4:49   ` Jordan Niethe
2022-09-22 15:02   ` Laurent Dufour
2022-09-23  8:16     ` Nicholas Piggin
2022-11-10  0:44   ` Jordan Niethe
2022-11-10 11:38     ` Nicholas Piggin
2022-07-28  6:31 ` [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks Nicholas Piggin
2022-08-15  1:11   ` Jordan Niethe
2022-11-10  0:44   ` Jordan Niethe
2022-11-10 11:41     ` Nicholas Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.