All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] add our own qspinlock implementation
@ 2022-07-11  3:04 Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 01/14] powerpc/qspinlock: powerpc " Nicholas Piggin
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The qspinlock conversion resulted in some latency regressions
particularly in the paravirt (SPLPAR) case. I haven't been able to
improve them much so for now I rewrite with a different paravirt
algorithm (as s390 does). This isn't the same as s390 but they have
some of the same concerns by the looks, so possibly if we can't
unify with generic code we could unify with them at some point.

Thanks,
Nick

Nicholas Piggin (14):
  powerpc/qspinlock: powerpc qspinlock implementation
  powerpc/qspinlock: add mcs queueing for contended waiters
  powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  powerpc/qspinlock: convert atomic operations to assembly
  powerpc/qspinlock: allow new waiters to steal the lock before queueing
  powerpc/qspinlock: theft prevention to control latency
  powerpc/qspinlock: store owner CPU in lock word
  powerpc/qspinlock: paravirt yield to lock owner
  powerpc/qspinlock: implement option to yield to previous node
  powerpc/qspinlock: allow stealing when head of queue yields
  powerpc/qspinlock: allow propagation of yield CPU down the queue
  powerpc/qspinlock: add ability to prod new queue head CPU
  powerpc/qspinlock: trylock and initial lock attempt may steal
  powerpc/qspinlock: use spin_begin/end API

 arch/powerpc/Kconfig                       |   1 -
 arch/powerpc/include/asm/qspinlock.h       | 130 ++--
 arch/powerpc/include/asm/qspinlock_types.h |  65 ++
 arch/powerpc/include/asm/spinlock_types.h  |   2 +-
 arch/powerpc/lib/Makefile                  |   4 +-
 arch/powerpc/lib/qspinlock.c               | 671 +++++++++++++++++++++
 6 files changed, 829 insertions(+), 44 deletions(-)
 create mode 100644 arch/powerpc/include/asm/qspinlock_types.h
 create mode 100644 arch/powerpc/lib/qspinlock.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 01/14] powerpc/qspinlock: powerpc qspinlock implementation
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 02/14] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Add a powerpc specific implementation of queued spinlocks. This is the
build framework with a very simple (non-queued) spinlock implementation
to begin with. Later changes add queueing, and other features and
optimisations one-at-a-time. It is done this way to more easily see how
the queued spinlocks are built, and to make performance and correctness
bisects more useful.

Generic PV qspinlock code is causing latency / starvation regressions on
large systems that are resulting in hard lockups reported (mostly in
pathoogical cases).  The generic qspinlock code has a number of issues
important for powerpc hardware and hypervisors that aren't easily solved
without changing code that would impact other architectures. Follow
s390's lead and implement our own for now.

Issues for powerpc using generic qspinlocks:
- The previous lock value should not be loaded with simple loads, and
  need not be passed around from previous loads or cmpxchg results,
  because powerpc uses ll/sc-style atomics which can perform more
  complex operations that do not require this. powerpc implementations
  tend to prefer loads use larx for improved coherency performance.
- The queueing process should absolutely minimise the number of stores
  to the lock word to reduce exclusive coherency probes, important for
  large system scalability. The pending logic is counter productive
  here.
- Non-atomic unlock for paravirt locks is important (atomic instructions
  tend to still be more expensive than x86 CPUs).
- Yielding to the lock owner is important in the oversubscribed paravirt
  case, which requires storing the owner CPU in the lock word.
- More control of lock stealing for the paravirt case is important to
  keep latency down on large systems.
- The lock acquisition operation should always be made with a special
  variant of atomic instructions with the lock hint bit set, including
  (especially) in the queueing paths. This is more a matter of adding
  more arch lock helpers so not an insurmountable problem for generic
  code.

Testing a paravirt (KVM guest) system with 512 vCPUs and 488 hardware
threads on a 4-socket POWER10. The purpose of these tests is not to show
throughput advantage because they are not representative of a real
workload and quite spinlock bound, but rather to show some latency and
fairness examples in pathological cases for undersubscribed and
oversubscribed guest situations:

will-it-scale/open1_threads -t64
          min      max    total
simple   4897     6066   344628
vanilla  3821     4100   251680
patched  5974     6995   415986

will-it-scale/open1_threads -t512
          min      max    total
simple    109      477   205022
vanilla     2    14675   571786
patched   227      711   267419

lockstorm is a simple in-kernel test that just has threads hammering on
a lock and collecting statistics (wait: arrival->acquire time, starve:
number of times the lock was acquired while we waited, sequential:
number of times we got the lock in a row):

lockstorm cpus=0-63
            min     max    total  max wait  max starve  max sequential
simple   124831  155101  8993621    1866us       16714             500
vanilla   81768   90426  5524856    1167us        8134            1087
patched  134411  161234  9622482    1996us       19023             437

lockstorm lock stress test cpus=0-511
            min     max    total  max wait  max starve  max sequential
simple     1442    3127  1434772   60922us       65320             590
vanilla       6  202616  7090954  580014us     3829074          282107
patched    1801    6296  2292002   53165us      103015             430

dbench 64
        tput (MB/s)  max-lat (ms)
simple      6224.33         24.76
vanilla     6951.91          4.73
patched     8416.77         18.59

dbench 512
        tput (MB/s)  max-lat (ms)
simple      4015.59        503.19
vanilla     4339.17        461.06
patched     5396.53        677.11

- qspin-try.patch
- powerpc-qspl.patch
- qspl-owner-cpu.patch
- owner-cpu-cleanup.patch
- qspl-owner-spin-queueud.patch
- yield-prev-mcs.patch
- qspl-steal.patch
- qspl-clear-mustq.patch
- fixes.patch
- qspl-stealer-no-mustq.patch
- fixes2
- fixes3
- rewrite
- tuning.patch
- qspl-tuning.patch
- yield-clear-mustq.patch
- mcs-no-yield-owner.patch
- propagate-yield-cpu.patch
- pv-tune-spins.patch
- set-must-q-cleanup.patch
- cleanup-publish.patch
- fix
- cleanup.patch
- more-tunables.patch
- tuning-and-propagate.patch
- tail-mask-remove.patch
- debug-qspl.patch
- asm-fixes.patch
- fix-asm2.patch
---
 arch/powerpc/Kconfig                       |  1 -
 arch/powerpc/include/asm/qspinlock.h       | 78 ++++++++++------------
 arch/powerpc/include/asm/qspinlock_types.h | 13 ++++
 arch/powerpc/include/asm/spinlock_types.h  |  2 +-
 arch/powerpc/lib/Makefile                  |  4 +-
 arch/powerpc/lib/qspinlock.c               | 18 +++++
 6 files changed, 69 insertions(+), 47 deletions(-)
 create mode 100644 arch/powerpc/include/asm/qspinlock_types.h
 create mode 100644 arch/powerpc/lib/qspinlock.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 7aa12e88c580..4838e6c96b20 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -154,7 +154,6 @@ config PPC
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_MEMTEST
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
-	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 39c1c7f80579..cb2b4f91e976 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -2,66 +2,56 @@
 #ifndef _ASM_POWERPC_QSPINLOCK_H
 #define _ASM_POWERPC_QSPINLOCK_H
 
-#include <asm-generic/qspinlock_types.h>
-#include <asm/paravirt.h>
+#include <linux/atomic.h>
+#include <linux/compiler.h>
+#include <asm/qspinlock_types.h>
 
-#define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
-
-void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-void __pv_queued_spin_unlock(struct qspinlock *lock);
-
-static __always_inline void queued_spin_lock(struct qspinlock *lock)
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-	u32 val = 0;
+	return atomic_read(&lock->val);
+}
 
-	if (likely(arch_atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
-		return;
+static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
+{
+	return !atomic_read(&lock.val);
+}
 
-	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
-		queued_spin_lock_slowpath(lock, val);
-	else
-		__pv_queued_spin_lock_slowpath(lock, val);
+static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
+{
+	return 0;
 }
-#define queued_spin_lock queued_spin_lock
 
-static inline void queued_spin_unlock(struct qspinlock *lock)
+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
-		smp_store_release(&lock->locked, 0);
-	else
-		__pv_queued_spin_unlock(lock);
+	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
+		return 1;
+	return 0;
 }
-#define queued_spin_unlock queued_spin_unlock
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define SPIN_THRESHOLD (1<<15) /* not tuned */
+void queued_spin_lock_slowpath(struct qspinlock *lock);
 
-static __always_inline void pv_wait(u8 *ptr, u8 val)
+static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-	if (*ptr != val)
-		return;
-	yield_to_any();
-	/*
-	 * We could pass in a CPU here if waiting in the queue and yield to
-	 * the previous CPU in the queue.
-	 */
+	if (!queued_spin_trylock(lock))
+		queued_spin_lock_slowpath(lock);
 }
 
-static __always_inline void pv_kick(int cpu)
+static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	prod_cpu(cpu);
+	atomic_set_release(&lock->val, 0);
 }
 
-#endif
+#define arch_spin_is_locked(l)		queued_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queued_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queued_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queued_spin_lock(l)
+#define arch_spin_trylock(l)		queued_spin_trylock(l)
+#define arch_spin_unlock(l)		queued_spin_unlock(l)
 
-/*
- * Queued spinlocks rely heavily on smp_cond_load_relaxed() to busy-wait,
- * which was found to have performance problems if implemented with
- * the preferred spin_begin()/spin_end() SMT priority pattern. Use the
- * generic version instead.
- */
-
-#include <asm-generic/qspinlock.h>
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void pv_spinlocks_init(void);
+#else
+static inline void pv_spinlocks_init(void) { }
+#endif
 
 #endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
new file mode 100644
index 000000000000..59606bc0c774
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_QSPINLOCK_TYPES_H
+#define _ASM_POWERPC_QSPINLOCK_TYPES_H
+
+#include <linux/types.h>
+
+typedef struct qspinlock {
+	atomic_t val;
+} arch_spinlock_t;
+
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
+
+#endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index d5f8a74ed2e8..40b01446cf75 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -7,7 +7,7 @@
 #endif
 
 #ifdef CONFIG_PPC_QUEUED_SPINLOCKS
-#include <asm-generic/qspinlock_types.h>
+#include <asm/qspinlock_types.h>
 #include <asm-generic/qrwlock_types.h>
 #else
 #include <asm/simple_spinlock_types.h>
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 8560c912186d..b895cbf6a709 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -52,7 +52,9 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \
 obj64-y	+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
 	   memcpy_64.o copy_mc_64.o
 
-ifndef CONFIG_PPC_QUEUED_SPINLOCKS
+ifdef CONFIG_PPC_QUEUED_SPINLOCKS
+obj64-$(CONFIG_SMP)	+= qspinlock.o
+else
 obj64-$(CONFIG_SMP)	+= locks.o
 endif
 
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
new file mode 100644
index 000000000000..8dbce99a373c
--- /dev/null
+++ b/arch/powerpc/lib/qspinlock.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/export.h>
+#include <linux/processor.h>
+#include <asm/qspinlock.h>
+
+void queued_spin_lock_slowpath(struct qspinlock *lock)
+{
+	while (!queued_spin_trylock(lock))
+		cpu_relax();
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void pv_spinlocks_init(void)
+{
+}
+#endif
+
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 02/14] powerpc/qspinlock: add mcs queueing for contended waiters
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 01/14] powerpc/qspinlock: powerpc " Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 03/14] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This forms the basis of the qspinlock slow path.

Like generic qspinlocks and unlike the vanilla MCS algorithm, the lock
owner does not participate in the queue, only waiters. The first waiter
spins on the lock word, then when the lock is released it takes
ownership and unqueues the next waiter. This is how qspinlocks can be
implemented with the spinlock API -- lock owners don't need a node, only
waiters do.
---
 arch/powerpc/include/asm/qspinlock.h       |  10 +-
 arch/powerpc/include/asm/qspinlock_types.h |  21 +++
 arch/powerpc/lib/qspinlock.c               | 166 ++++++++++++++++++++-
 3 files changed, 191 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index cb2b4f91e976..f06117aa60e1 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -18,12 +18,12 @@ static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 
 static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	return 0;
+	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
 }
 
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
+	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
 		return 1;
 	return 0;
 }
@@ -38,7 +38,11 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	atomic_set_release(&lock->val, 0);
+	for (;;) {
+		int val = atomic_read(&lock->val);
+		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
+			return;
+	}
 }
 
 #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 59606bc0c774..9630e714c70d 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -10,4 +10,25 @@ typedef struct qspinlock {
 
 #define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
 
+/*
+ * Bitfields in the atomic value:
+ *
+ *     0: locked bit
+ * 16-31: tail cpu (+1)
+ */
+#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
+				      << _Q_ ## type ## _OFFSET)
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		1
+#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+
+#define _Q_TAIL_CPU_OFFSET	16
+#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
+
+#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
+#error "qspinlock does not support such large CONFIG_NR_CPUS"
+#endif
+
 #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 8dbce99a373c..c65f7719a188 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -1,12 +1,172 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/compiler.h>
 #include <linux/export.h>
-#include <linux/processor.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
 #include <asm/qspinlock.h>
 
-void queued_spin_lock_slowpath(struct qspinlock *lock)
+#define MAX_NODES	4
+
+struct qnode {
+	struct qnode	*next;
+	struct qspinlock *lock;
+	u8		locked; /* 1 if lock acquired */
+};
+
+struct qnodes {
+	int		count;
+	struct qnode nodes[MAX_NODES];
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
+
+static inline int encode_tail_cpu(void)
+{
+	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
+}
+
+static inline int get_tail_cpu(int val)
+{
+	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
+}
+
+/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
+static __always_inline void lock_set_locked(struct qspinlock *lock)
+{
+	atomic_or(_Q_LOCKED_VAL, &lock->val);
+	__atomic_acquire_fence();
+}
+
+/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
+static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
+{
+	int newval = _Q_LOCKED_VAL;
+
+	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
+		return 1;
+	else
+		return 0;
+}
+
+/*
+ * Publish our tail, replacing previous tail. Return previous value.
+ *
+ * This provides a release barrier for publishing node, and an acquire barrier
+ * for getting the old node.
+ */
+static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
 {
-	while (!queued_spin_trylock(lock))
+	for (;;) {
+		int val = atomic_read(&lock->val);
+		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
+		int old;
+
+		old = atomic_cmpxchg(&lock->val, val, newval);
+		if (old == val)
+			return old;
+	}
+}
+
+static inline struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
+{
+	int cpu = get_tail_cpu(val);
+	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
+	int idx;
+
+	for (idx = 0; idx < MAX_NODES; idx++) {
+		struct qnode *qnode = &qnodesp->nodes[idx];
+		if (qnode->lock == lock)
+			return qnode;
+	}
+
+	BUG();
+}
+
+static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
+{
+	struct qnodes *qnodesp;
+	struct qnode *next, *node;
+	int val, old, tail;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+	qnodesp = this_cpu_ptr(&qnodes);
+	if (unlikely(qnodesp->count == MAX_NODES)) {
+		while (!queued_spin_trylock(lock))
+			cpu_relax();
+		return;
+	}
+
+	idx = qnodesp->count++;
+	/*
+	 * Ensure that we increment the head node->count before initialising
+	 * the actual node. If the compiler is kind enough to reorder these
+	 * stores, then an IRQ could overwrite our assignments.
+	 */
+	barrier();
+	node = &qnodesp->nodes[idx];
+	node->next = NULL;
+	node->lock = lock;
+	node->locked = 0;
+
+	tail = encode_tail_cpu();
+
+	old = publish_tail_cpu(lock, tail);
+
+	/*
+	 * If there was a previous node; link it and wait until reaching the
+	 * head of the waitqueue.
+	 */
+	if (old & _Q_TAIL_CPU_MASK) {
+		struct qnode *prev = get_tail_qnode(lock, old);
+
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
+
+		/* Wait for mcs node lock to be released */
+		while (!node->locked)
+			cpu_relax();
+
+		smp_rmb(); /* acquire barrier for the mcs lock */
+	}
+
+	/* We're at the head of the waitqueue, wait for the lock. */
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
+		cpu_relax();
+
+	/* If we're the last queued, must clean up the tail. */
+	if ((val & _Q_TAIL_CPU_MASK) == tail) {
+		if (trylock_clear_tail_cpu(lock, val))
+			goto release;
+		/* Another waiter must have enqueued */
+	}
+
+	/* We must be the owner, just set the lock bit and acquire */
+	lock_set_locked(lock);
+
+	/* contended path; must wait for next != NULL (MCS protocol) */
+	while (!(next = READ_ONCE(node->next)))
 		cpu_relax();
+
+	/*
+	 * Unlock the next mcs waiter node. Release barrier is not required
+	 * here because the acquirer is only accessing the lock word, and
+	 * the acquire barrier we took the lock with orders that update vs
+	 * this store to locked. The corresponding barrier is the smp_rmb()
+	 * acquire barrier for mcs lock, above.
+	 */
+	WRITE_ONCE(next->locked, 1);
+
+release:
+	qnodesp->count--; /* release the node */
+}
+
+void queued_spin_lock_slowpath(struct qspinlock *lock)
+{
+	queued_spin_lock_mcs_queue(lock);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 03/14] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 01/14] powerpc/qspinlock: powerpc " Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 02/14] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 04/14] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The first 16 bits of the lock are only modified by the owner, and other
modifications always use atomic operations on the entire 32 bits, so
unlocks can use plain stores on the 16 bits. This is the same kind of
optimisation done by core qspinlock code.
---
 arch/powerpc/include/asm/qspinlock.h       |  6 +-----
 arch/powerpc/include/asm/qspinlock_types.h | 19 +++++++++++++++++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index f06117aa60e1..79a1936fb68d 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	for (;;) {
-		int val = atomic_read(&lock->val);
-		if (atomic_cmpxchg_release(&lock->val, val, val & ~_Q_LOCKED_VAL) == val)
-			return;
-	}
+	smp_store_release(&lock->locked, 0);
 }
 
 #define arch_spin_is_locked(l)		queued_spin_is_locked(l)
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 9630e714c70d..3425dab42576 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -3,12 +3,27 @@
 #define _ASM_POWERPC_QSPINLOCK_TYPES_H
 
 #include <linux/types.h>
+#include <asm/byteorder.h>
 
 typedef struct qspinlock {
-	atomic_t val;
+	union {
+		atomic_t val;
+
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u16	locked;
+			u8	reserved[2];
+		};
+#else
+		struct {
+			u8	reserved[2];
+			u16	locked;
+		};
+#endif
+	};
 } arch_spinlock_t;
 
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
 
 /*
  * Bitfields in the atomic value:
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 04/14] powerpc/qspinlock: convert atomic operations to assembly
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (2 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 03/14] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 05/14] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This uses more optimal ll/sc style access patterns (rather than
cmpxchg), and also sets the EH=1 lock hint on those operations
which acquire ownership of the lock.
---
 arch/powerpc/include/asm/qspinlock.h       | 25 +++++--
 arch/powerpc/include/asm/qspinlock_types.h |  6 +-
 arch/powerpc/lib/qspinlock.c               | 81 +++++++++++++++-------
 3 files changed, 79 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 79a1936fb68d..3ab354159e5e 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -2,28 +2,43 @@
 #ifndef _ASM_POWERPC_QSPINLOCK_H
 #define _ASM_POWERPC_QSPINLOCK_H
 
-#include <linux/atomic.h>
 #include <linux/compiler.h>
 #include <asm/qspinlock_types.h>
 
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-	return atomic_read(&lock->val);
+	return READ_ONCE(lock->val);
 }
 
 static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 {
-	return !atomic_read(&lock.val);
+	return !lock.val;
 }
 
 static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
-	return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
+	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
 }
 
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
+	u32 new = _Q_LOCKED_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
+"	cmpwi	0,%0,0							\n"
+"	bne-	2f							\n"
+"	stwcx.	%2,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	if (likely(prev == 0))
 		return 1;
 	return 0;
 }
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 3425dab42576..210adf05b235 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -7,7 +7,7 @@
 
 typedef struct qspinlock {
 	union {
-		atomic_t val;
+		u32 val;
 
 #ifdef __LITTLE_ENDIAN
 		struct {
@@ -23,10 +23,10 @@ typedef struct qspinlock {
 	};
 } arch_spinlock_t;
 
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = ATOMIC_INIT(0) } }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ { .val = 0 } }
 
 /*
- * Bitfields in the atomic value:
+ * Bitfields in the lock word:
  *
  *     0: locked bit
  * 16-31: tail cpu (+1)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index c65f7719a188..76dca922ba71 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
-#include <linux/atomic.h>
 #include <linux/bug.h>
 #include <linux/compiler.h>
 #include <linux/export.h>
@@ -22,32 +21,59 @@ struct qnodes {
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
-static inline int encode_tail_cpu(void)
+static inline u32 encode_tail_cpu(void)
 {
 	return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
 }
 
-static inline int get_tail_cpu(int val)
+static inline int get_tail_cpu(u32 val)
 {
 	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
 }
 
 /* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
+/* Take the lock by setting the lock bit, no other CPUs will touch it. */
 static __always_inline void lock_set_locked(struct qspinlock *lock)
 {
-	atomic_or(_Q_LOCKED_VAL, &lock->val);
-	__atomic_acquire_fence();
+	u32 new = _Q_LOCKED_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%3	# lock_set_locked			\n"
+"	or	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
 }
 
-/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
-static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int val)
+/* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
+static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
 {
-	int newval = _Q_LOCKED_VAL;
-
-	if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
+	u32 new = _Q_LOCKED_VAL;
+	u32 prev;
+
+	BUG_ON(old & _Q_LOCKED_VAL);
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%4	# trylock_clear_tail_cpu		\n"
+"	cmpw	0,%0,%2							\n"
+"	bne-	2f							\n"
+"	stwcx.	%3,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r"(old), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	if (likely(prev == old))
 		return 1;
-	else
-		return 0;
+	return 0;
 }
 
 /*
@@ -56,20 +82,25 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, int va
  * This provides a release barrier for publishing node, and an acquire barrier
  * for getting the old node.
  */
-static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)
+static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
 {
-	for (;;) {
-		int val = atomic_read(&lock->val);
-		int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
-		int old;
-
-		old = atomic_cmpxchg(&lock->val, val, newval);
-		if (old == val)
-			return old;
-	}
+	u32 prev, tmp;
+
+	asm volatile(
+"\t"	PPC_RELEASE_BARRIER "						\n"
+"1:	lwarx	%0,0,%2		# publish_tail_cpu			\n"
+"	andc	%1,%0,%4						\n"
+"	or	%1,%1,%3						\n"
+"	stwcx.	%1,0,%2							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev), "=&r"(tmp)
+	: "r" (&lock->val), "r" (tail), "r"(_Q_TAIL_CPU_MASK)
+	: "cr0", "memory");
+
+	return prev;
 }
 
-static inline struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
+static inline struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = get_tail_cpu(val);
 	struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
@@ -88,7 +119,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
-	int val, old, tail;
+	u32 val, old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -134,7 +165,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	}
 
 	/* We're at the head of the waitqueue, wait for the lock. */
-	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_VAL)
+	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
 		cpu_relax();
 
 	/* If we're the last queued, must clean up the tail. */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 05/14] powerpc/qspinlock: allow new waiters to steal the lock before queueing
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (3 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 04/14] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 06/14] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Allow new waiters a number of spins on the lock word before queueing,
which particularly helps paravirt performance when physical CPUs are
oversubscribed.
---
 arch/powerpc/lib/qspinlock.c | 143 ++++++++++++++++++++++++++++++++---
 1 file changed, 132 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 76dca922ba71..cb87991602ff 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -19,6 +19,10 @@ struct qnodes {
 	struct qnode nodes[MAX_NODES];
 };
 
+/* Tuning parameters */
+static int STEAL_SPINS __read_mostly = (1<<5);
+static bool MAYBE_STEALERS __read_mostly = true;
+
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
 static inline u32 encode_tail_cpu(void)
@@ -76,6 +80,39 @@ static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 ol
 	return 0;
 }
 
+static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u32 new)
+{
+	u32 prev;
+
+	BUG_ON(old & _Q_LOCKED_VAL);
+
+	asm volatile(
+"1:	lwarx	%0,0,%1,%4	# queued_spin_trylock_cmpxchg		\n"
+"	cmpw	0,%0,%2							\n"
+"	bne-	2f							\n"
+"	stwcx.	%3,0,%1							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r"(old), "r" (new),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	return prev;
+}
+
+/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
+static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
+{
+	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
+
+	if (__trylock_cmpxchg(lock, val, newval) == val)
+		return 1;
+	else
+		return 0;
+}
+
 /*
  * Publish our tail, replacing previous tail. Return previous value.
  *
@@ -115,6 +152,26 @@ static inline struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
+static inline bool try_to_steal_lock(struct qspinlock *lock)
+{
+	int iters;
+
+	/* Attempt to steal the lock */
+	for (iters = 0; iters < STEAL_SPINS; iters++) {
+		u32 val = READ_ONCE(lock->val);
+
+		if (val & _Q_LOCKED_VAL) {
+			cpu_relax();
+			continue;
+		}
+
+		if (trylock_with_tail_cpu(lock, val))
+			return true;
+	}
+
+	return false;
+}
+
 static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
 	struct qnodes *qnodesp;
@@ -164,20 +221,39 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
 
-	/* We're at the head of the waitqueue, wait for the lock. */
-	while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
-		cpu_relax();
+	if (!MAYBE_STEALERS) {
+		/* We're at the head of the waitqueue, wait for the lock. */
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
+			cpu_relax();
 
-	/* If we're the last queued, must clean up the tail. */
-	if ((val & _Q_TAIL_CPU_MASK) == tail) {
-		if (trylock_clear_tail_cpu(lock, val))
-			goto release;
-		/* Another waiter must have enqueued */
-	}
+		/* If we're the last queued, must clean up the tail. */
+		if ((val & _Q_TAIL_CPU_MASK) == tail) {
+			if (trylock_clear_tail_cpu(lock, val))
+				goto release;
+			/* Another waiter must have enqueued. */
+		}
+
+		/* We must be the owner, just set the lock bit and acquire */
+		lock_set_locked(lock);
+	} else {
+again:
+		/* We're at the head of the waitqueue, wait for the lock. */
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
+			cpu_relax();
 
-	/* We must be the owner, just set the lock bit and acquire */
-	lock_set_locked(lock);
+		/* If we're the last queued, must clean up the tail. */
+		if ((val & _Q_TAIL_CPU_MASK) == tail) {
+			if (trylock_clear_tail_cpu(lock, val))
+				goto release;
+			/* Another waiter must have enqueued, or lock stolen. */
+		} else {
+			if (trylock_with_tail_cpu(lock, val))
+				goto unlock_next;
+		}
+		goto again;
+	}
 
+unlock_next:
 	/* contended path; must wait for next != NULL (MCS protocol) */
 	while (!(next = READ_ONCE(node->next)))
 		cpu_relax();
@@ -197,6 +273,9 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 void queued_spin_lock_slowpath(struct qspinlock *lock)
 {
+	if (try_to_steal_lock(lock))
+		return;
+
 	queued_spin_lock_mcs_queue(lock);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
@@ -204,6 +283,48 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void pv_spinlocks_init(void)
 {
+	STEAL_SPINS = (1<<15);
 }
 #endif
 
+#include <linux/debugfs.h>
+static int steal_spins_set(void *data, u64 val)
+{
+	static DEFINE_MUTEX(lock);
+
+	mutex_lock(&lock);
+	if (val && !STEAL_SPINS) {
+		MAYBE_STEALERS = true;
+		/* wait for waiter to go away */
+		synchronize_rcu();
+		STEAL_SPINS = val;
+	} else if (!val && STEAL_SPINS) {
+		STEAL_SPINS = val;
+		/* wait for all possible stealers to go away */
+		synchronize_rcu();
+		MAYBE_STEALERS = false;
+	} else {
+		STEAL_SPINS = val;
+	}
+	mutex_unlock(&lock);
+
+	return 0;
+}
+
+static int steal_spins_get(void *data, u64 *val)
+{
+	*val = STEAL_SPINS;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
+
+static __init int spinlock_debugfs_init(void)
+{
+	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+
+	return 0;
+}
+device_initcall(spinlock_debugfs_init);
+
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 06/14] powerpc/qspinlock: theft prevention to control latency
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (4 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 05/14] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 07/14] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Give the queue head the ability to stop stealers. After a number of
spins without sucessfully acquiring the lock, the queue head employs
this, which will assure it is the next owner.
---
 arch/powerpc/include/asm/qspinlock_types.h | 10 ++++-
 arch/powerpc/lib/qspinlock.c               | 45 +++++++++++++++++++++-
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 210adf05b235..8b20f5e22bba 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -29,7 +29,8 @@ typedef struct qspinlock {
  * Bitfields in the lock word:
  *
  *     0: locked bit
- * 16-31: tail cpu (+1)
+ *    16: must queue bit
+ * 17-31: tail cpu (+1)
  */
 #define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
 				      << _Q_ ## type ## _OFFSET)
@@ -38,7 +39,12 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 
-#define _Q_TAIL_CPU_OFFSET	16
+#define _Q_MUST_Q_OFFSET	16
+#define _Q_MUST_Q_BITS		1
+#define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
+#define _Q_MUST_Q_VAL		(1U << _Q_MUST_Q_OFFSET)
+
+#define _Q_TAIL_CPU_OFFSET	17
 #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)
 
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index cb87991602ff..662a744fa1ee 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -22,6 +22,7 @@ struct qnodes {
 /* Tuning parameters */
 static int STEAL_SPINS __read_mostly = (1<<5);
 static bool MAYBE_STEALERS __read_mostly = true;
+static int HEAD_SPINS __read_mostly = (1<<13);
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -137,6 +138,23 @@ static __always_inline u32 publish_tail_cpu(struct qspinlock *lock, u32 tail)
 	return prev;
 }
 
+static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
+{
+	u32 new = _Q_MUST_Q_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# queued_spin_set_mustq			\n"
+"	or	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new)
+	: "cr0", "memory");
+
+	return prev;
+}
+
 static inline struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = get_tail_cpu(val);
@@ -160,6 +178,9 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
 	for (iters = 0; iters < STEAL_SPINS; iters++) {
 		u32 val = READ_ONCE(lock->val);
 
+		if (val & _Q_MUST_Q_VAL)
+			break;
+
 		if (val & _Q_LOCKED_VAL) {
 			cpu_relax();
 			continue;
@@ -236,10 +257,14 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		/* We must be the owner, just set the lock bit and acquire */
 		lock_set_locked(lock);
 	} else {
+		int iters = 0;
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
-		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
+			if (iters++ == HEAD_SPINS)
+				lock_set_mustq(lock);
 			cpu_relax();
+		}
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -284,6 +309,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 void pv_spinlocks_init(void)
 {
 	STEAL_SPINS = (1<<15);
+	HEAD_SPINS = (1<<13);
 }
 #endif
 
@@ -320,9 +346,26 @@ static int steal_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, "%llu\n");
 
+static int head_spins_set(void *data, u64 val)
+{
+	HEAD_SPINS = val;
+
+	return 0;
+}
+
+static int head_spins_get(void *data, u64 *val)
+{
+	*val = HEAD_SPINS;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
+	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 07/14] powerpc/qspinlock: store owner CPU in lock word
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (5 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 06/14] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 08/14] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Store the owner CPU number in the lock word so it may be yielded to,
as powerpc's paravirtualised simple spinlocks do.
---
 arch/powerpc/include/asm/qspinlock.h       |  8 +++++++-
 arch/powerpc/include/asm/qspinlock_types.h | 10 ++++++++++
 arch/powerpc/lib/qspinlock.c               |  6 +++---
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 3ab354159e5e..44601b261e08 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 	return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
 }
 
+static __always_inline u32 queued_spin_get_locked_val(void)
+{
+	/* XXX: make this use lock value in paca like simple spinlocks? */
+	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
+}
+
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
-	u32 new = _Q_LOCKED_VAL;
+	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
 	asm volatile(
diff --git a/arch/powerpc/include/asm/qspinlock_types.h b/arch/powerpc/include/asm/qspinlock_types.h
index 8b20f5e22bba..35f9525381e6 100644
--- a/arch/powerpc/include/asm/qspinlock_types.h
+++ b/arch/powerpc/include/asm/qspinlock_types.h
@@ -29,6 +29,8 @@ typedef struct qspinlock {
  * Bitfields in the lock word:
  *
  *     0: locked bit
+ *  1-14: lock holder cpu
+ *    15: unused bit
  *    16: must queue bit
  * 17-31: tail cpu (+1)
  */
@@ -39,6 +41,14 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 
+#define _Q_OWNER_CPU_OFFSET	1
+#define _Q_OWNER_CPU_BITS	14
+#define _Q_OWNER_CPU_MASK	_Q_SET_MASK(OWNER_CPU)
+
+#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
+#error "qspinlock does not support such large CONFIG_NR_CPUS"
+#endif
+
 #define _Q_MUST_Q_OFFSET	16
 #define _Q_MUST_Q_BITS		1
 #define _Q_MUST_Q_MASK		_Q_SET_MASK(MUST_Q)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 662a744fa1ee..3c6554a02de7 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -40,7 +40,7 @@ static inline int get_tail_cpu(u32 val)
 /* Take the lock by setting the lock bit, no other CPUs will touch it. */
 static __always_inline void lock_set_locked(struct qspinlock *lock)
 {
-	u32 new = _Q_LOCKED_VAL;
+	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
 	asm volatile(
@@ -58,7 +58,7 @@ static __always_inline void lock_set_locked(struct qspinlock *lock)
 /* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
 static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, u32 old)
 {
-	u32 new = _Q_LOCKED_VAL;
+	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
 	BUG_ON(old & _Q_LOCKED_VAL);
@@ -106,7 +106,7 @@ static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 old, u3
 /* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
 static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 val)
 {
-	u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
+	u32 newval = queued_spin_get_locked_val() | (val & _Q_TAIL_CPU_MASK);
 
 	if (__trylock_cmpxchg(lock, val, newval) == val)
 		return 1;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 08/14] powerpc/qspinlock: paravirt yield to lock owner
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (6 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 07/14] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 09/14] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Waiters spinning on the lock word should yield to the lock owner if the
vCPU is preempted. This improves performance when the hypervisor has
oversubscribed physical CPUs.
---
 arch/powerpc/lib/qspinlock.c | 93 +++++++++++++++++++++++++++++++-----
 1 file changed, 82 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 3c6554a02de7..ee85198d5e28 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -5,6 +5,7 @@
 #include <linux/percpu.h>
 #include <linux/smp.h>
 #include <asm/qspinlock.h>
+#include <asm/paravirt.h>
 
 #define MAX_NODES	4
 
@@ -24,6 +25,8 @@ static int STEAL_SPINS __read_mostly = (1<<5);
 static bool MAYBE_STEALERS __read_mostly = true;
 static int HEAD_SPINS __read_mostly = (1<<13);
 
+static bool pv_yield_owner __read_mostly = true;
+
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
 static inline u32 encode_tail_cpu(void)
@@ -36,7 +39,11 @@ static inline int get_tail_cpu(u32 val)
 	return (val >> _Q_TAIL_CPU_OFFSET) - 1;
 }
 
-/* Take the lock by setting the bit, no other CPUs may concurrently lock it. */
+static inline int get_owner_cpu(u32 val)
+{
+	return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
+}
+
 /* Take the lock by setting the lock bit, no other CPUs will touch it. */
 static __always_inline void lock_set_locked(struct qspinlock *lock)
 {
@@ -170,7 +177,45 @@ static inline struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static inline bool try_to_steal_lock(struct qspinlock *lock)
+static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+{
+	int owner;
+	u32 yield_count;
+
+	BUG_ON(!(val & _Q_LOCKED_VAL));
+
+	if (!paravirt)
+		goto relax;
+
+	if (!pv_yield_owner)
+		goto relax;
+
+	owner = get_owner_cpu(val);
+	yield_count = yield_count_of(owner);
+
+	if ((yield_count & 1) == 0)
+		goto relax; /* owner vcpu is running */
+
+	/*
+	 * Read the lock word after sampling the yield count. On the other side
+	 * there may a wmb because the yield count update is done by the
+	 * hypervisor preemption and the value update by the OS, however this
+	 * ordering might reduce the chance of out of order accesses and
+	 * improve the heuristic.
+	 */
+	smp_rmb();
+
+	if (READ_ONCE(lock->val) == val) {
+		yield_to_preempted(owner, yield_count);
+		/* Don't relax if we yielded. Maybe we should? */
+		return;
+	}
+relax:
+	cpu_relax();
+}
+
+
+static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
 	int iters;
 
@@ -182,7 +227,7 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
 			break;
 
 		if (val & _Q_LOCKED_VAL) {
-			cpu_relax();
+			yield_to_locked_owner(lock, val, paravirt);
 			continue;
 		}
 
@@ -193,7 +238,7 @@ static inline bool try_to_steal_lock(struct qspinlock *lock)
 	return false;
 }
 
-static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
+static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
 {
 	struct qnodes *qnodesp;
 	struct qnode *next, *node;
@@ -245,7 +290,7 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	if (!MAYBE_STEALERS) {
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
-			cpu_relax();
+			yield_to_locked_owner(lock, val, paravirt);
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -261,9 +306,11 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
-			if (iters++ == HEAD_SPINS)
+			if (iters++ == HEAD_SPINS) {
 				lock_set_mustq(lock);
-			cpu_relax();
+				val |= _Q_MUST_Q_VAL;
+			}
+			yield_to_locked_owner(lock, val, paravirt);
 		}
 
 		/* If we're the last queued, must clean up the tail. */
@@ -298,10 +345,15 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 void queued_spin_lock_slowpath(struct qspinlock *lock)
 {
-	if (try_to_steal_lock(lock))
-		return;
-
-	queued_spin_lock_mcs_queue(lock);
+	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && is_shared_processor()) {
+		if (try_to_steal_lock(lock, true))
+			return;
+		queued_spin_lock_mcs_queue(lock, true);
+	} else {
+		if (try_to_steal_lock(lock, false))
+			return;
+		queued_spin_lock_mcs_queue(lock, false);
+	}
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
@@ -362,10 +414,29 @@ static int head_spins_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_head_spins, head_spins_get, head_spins_set, "%llu\n");
 
+static int pv_yield_owner_set(void *data, u64 val)
+{
+	pv_yield_owner = !!val;
+
+	return 0;
+}
+
+static int pv_yield_owner_get(void *data, u64 *val)
+{
+	*val = pv_yield_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
+	if (is_shared_processor()) {
+		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+	}
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 09/14] powerpc/qspinlock: implement option to yield to previous node
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (7 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 08/14] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 10/14] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Queued waiters which are not at the head of the queue don't spin on
the lock word but their qnode lock word, waiting for the previous queued
CPU to release them. Add an option which allows these waiters to yield
to the previous CPU if its vCPU is preempted.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 46 +++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index ee85198d5e28..1a58ed51c060 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
 static int HEAD_SPINS __read_mostly = (1<<13);
 
 static bool pv_yield_owner __read_mostly = true;
+static bool pv_yield_prev __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -214,6 +215,31 @@ static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt
 	cpu_relax();
 }
 
+static void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
+{
+	u32 yield_count;
+
+	if (!paravirt)
+		goto relax;
+
+	if (!pv_yield_prev)
+		goto relax;
+
+	yield_count = yield_count_of(prev_cpu);
+	if ((yield_count & 1) == 0)
+		goto relax; /* owner vcpu is running */
+
+	smp_rmb(); /* See yield_to_locked_owner comment */
+
+	if (!node->locked) {
+		yield_to_preempted(prev_cpu, yield_count);
+		return;
+	}
+
+relax:
+	cpu_relax();
+}
+
 
 static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool paravirt)
 {
@@ -276,13 +302,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	 */
 	if (old & _Q_TAIL_CPU_MASK) {
 		struct qnode *prev = get_tail_qnode(lock, old);
+		int prev_cpu = get_tail_cpu(old);
 
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
 
 		/* Wait for mcs node lock to be released */
 		while (!node->locked)
-			cpu_relax();
+			yield_to_prev(lock, node, prev_cpu, paravirt);
 
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
@@ -430,12 +457,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
 
+static int pv_yield_prev_set(void *data, u64 val)
+{
+	pv_yield_prev = !!val;
+
+	return 0;
+}
+
+static int pv_yield_prev_get(void *data, u64 *val)
+{
+	*val = pv_yield_prev;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 	}
 
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 10/14] powerpc/qspinlock: allow stealing when head of queue yields
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (8 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 09/14] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 11/14] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

If the head of queue is preventing stealing but it finds the owner vCPU
is preempted, it will yield its cycles to the owner which could cause it
to become preempted. Add an option to re-allow stealers before yielding,
and disallow them again after returning from the yield.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 49 +++++++++++++++++++++++++++++++++---
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 1a58ed51c060..4f1dc3322485 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
 static int HEAD_SPINS __read_mostly = (1<<13);
 
 static bool pv_yield_owner __read_mostly = true;
+static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
@@ -163,6 +164,24 @@ static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
 	return prev;
 }
 
+static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
+{
+	u32 new = _Q_MUST_Q_VAL;
+	u32 prev;
+
+	asm volatile(
+"1:	lwarx	%0,0,%1		# queued_spin_set_mustq			\n"
+"	andc	%0,%0,%2						\n"
+"	stwcx.	%0,0,%1							\n"
+"	bne-	1b							\n"
+	: "=&r" (prev)
+	: "r" (&lock->val), "r" (new)
+	: "cr0", "memory");
+
+	return prev;
+}
+
+
 static inline struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 {
 	int cpu = get_tail_cpu(val);
@@ -178,7 +197,7 @@ static inline struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
 {
 	int owner;
 	u32 yield_count;
@@ -207,7 +226,11 @@ static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt
 	smp_rmb();
 
 	if (READ_ONCE(lock->val) == val) {
+		if (clear_mustq)
+			lock_clear_mustq(lock);
 		yield_to_preempted(owner, yield_count);
+		if (clear_mustq)
+			lock_set_mustq(lock);
 		/* Don't relax if we yielded. Maybe we should? */
 		return;
 	}
@@ -253,7 +276,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			break;
 
 		if (val & _Q_LOCKED_VAL) {
-			yield_to_locked_owner(lock, val, paravirt);
+			yield_to_locked_owner(lock, val, paravirt, false);
 			continue;
 		}
 
@@ -317,7 +340,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	if (!MAYBE_STEALERS) {
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
-			yield_to_locked_owner(lock, val, paravirt);
+			yield_to_locked_owner(lock, val, paravirt, false);
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -337,7 +360,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 				lock_set_mustq(lock);
 				val |= _Q_MUST_Q_VAL;
 			}
-			yield_to_locked_owner(lock, val, paravirt);
+			yield_to_locked_owner(lock, val, paravirt,
+				pv_yield_allow_steal && (iters > HEAD_SPINS));
 		}
 
 		/* If we're the last queued, must clean up the tail. */
@@ -457,6 +481,22 @@ static int pv_yield_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, pv_yield_owner_set, "%llu\n");
 
+static int pv_yield_allow_steal_set(void *data, u64 val)
+{
+	pv_yield_allow_steal = !!val;
+
+	return 0;
+}
+
+static int pv_yield_allow_steal_get(void *data, u64 *val)
+{
+	*val = pv_yield_allow_steal;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_allow_steal, pv_yield_allow_steal_get, pv_yield_allow_steal_set, "%llu\n");
+
 static int pv_yield_prev_set(void *data, u64 val)
 {
 	pv_yield_prev = !!val;
@@ -479,6 +519,7 @@ static __init int spinlock_debugfs_init(void)
 	debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, &fops_head_spins);
 	if (is_shared_processor()) {
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
+		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 	}
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 11/14] powerpc/qspinlock: allow propagation of yield CPU down the queue
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (9 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 10/14] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 12/14] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Having all CPUs poll the lock word for the owner CPU that should be
yielded to defeats most of the purpose of using MCS queueing for
scalability. Yet it may be desirable for queued waiters to to yield
to a preempted owner.

s390 addreses this problem by having queued waiters sample the lock
word to find the owner much less frequently. In this approach, the
waiters never sample it directly, but the queue head propagates the
owner CPU back to the next waiter if it ever finds the owner has
been preempted. Queued waiters then subsequently propagate the owner
CPU back to the next waiter, and so on.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 103 +++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index 4f1dc3322485..c39af19f006e 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -12,6 +12,7 @@
 struct qnode {
 	struct qnode	*next;
 	struct qspinlock *lock;
+	int		yield_cpu;
 	u8		locked; /* 1 if lock acquired */
 };
 
@@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<13);
 static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = false;
+static bool pv_yield_propagate_owner __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -197,7 +199,7 @@ static inline struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
 	BUG();
 }
 
-static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
+static void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
 {
 	int owner;
 	u32 yield_count;
@@ -238,13 +240,76 @@ static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt
 	cpu_relax();
 }
 
+static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
+{
+	__yield_to_locked_owner(lock, val, paravirt, false);
+}
+
+static void yield_head_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt, bool clear_mustq)
+{
+	__yield_to_locked_owner(lock, val, paravirt, clear_mustq);
+}
+
+static void propagate_yield_cpu(struct qnode *node, u32 val, int *set_yield_cpu, bool paravirt)
+{
+	struct qnode *next;
+	int owner;
+
+	if (!paravirt)
+		return;
+	if (!pv_yield_propagate_owner)
+		return;
+
+	owner = get_owner_cpu(val);
+	if (*set_yield_cpu == owner)
+		return;
+
+	next = READ_ONCE(node->next);
+	if (!next)
+		return;
+
+	if (vcpu_is_preempted(owner)) {
+		next->yield_cpu = owner;
+		*set_yield_cpu = owner;
+	} else if (*set_yield_cpu != -1) {
+		next->yield_cpu = owner;
+		*set_yield_cpu = owner;
+	}
+}
+
 static void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_cpu, bool paravirt)
 {
 	u32 yield_count;
+	int yield_cpu;
 
 	if (!paravirt)
 		goto relax;
 
+	if (!pv_yield_propagate_owner)
+		goto yield_prev;
+
+	yield_cpu = READ_ONCE(node->yield_cpu);
+	if (yield_cpu == -1) {
+		/* Propagate back the -1 CPU */
+		if (node->next && node->next->yield_cpu != -1)
+			node->next->yield_cpu = yield_cpu;
+		goto yield_prev;
+	}
+
+	yield_count = yield_count_of(yield_cpu);
+	if ((yield_count & 1) == 0)
+		goto yield_prev; /* owner vcpu is running */
+
+	smp_rmb();
+
+	if (yield_cpu == node->yield_cpu) {
+		if (node->next && node->next->yield_cpu != yield_cpu)
+			node->next->yield_cpu = yield_cpu;
+		yield_to_preempted(yield_cpu, yield_count);
+		return;
+	}
+
+yield_prev:
 	if (!pv_yield_prev)
 		goto relax;
 
@@ -276,7 +341,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 			break;
 
 		if (val & _Q_LOCKED_VAL) {
-			yield_to_locked_owner(lock, val, paravirt, false);
+			yield_to_locked_owner(lock, val, paravirt);
 			continue;
 		}
 
@@ -313,6 +378,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	node = &qnodesp->nodes[idx];
 	node->next = NULL;
 	node->lock = lock;
+	node->yield_cpu = -1;
 	node->locked = 0;
 
 	tail = encode_tail_cpu();
@@ -334,13 +400,21 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		while (!node->locked)
 			yield_to_prev(lock, node, prev_cpu, paravirt);
 
+		/* Clear out stale propagated yield_cpu */
+		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
+			node->yield_cpu = -1;
+
 		smp_rmb(); /* acquire barrier for the mcs lock */
 	}
 
 	if (!MAYBE_STEALERS) {
+		int set_yield_cpu = -1;
+
 		/* We're at the head of the waitqueue, wait for the lock. */
-		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
-			yield_to_locked_owner(lock, val, paravirt, false);
+		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
+			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
+			yield_head_to_locked_owner(lock, val, paravirt, false);
+		}
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -352,6 +426,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		/* We must be the owner, just set the lock bit and acquire */
 		lock_set_locked(lock);
 	} else {
+		int set_yield_cpu = -1;
 		int iters = 0;
 again:
 		/* We're at the head of the waitqueue, wait for the lock. */
@@ -360,7 +435,8 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 				lock_set_mustq(lock);
 				val |= _Q_MUST_Q_VAL;
 			}
-			yield_to_locked_owner(lock, val, paravirt,
+			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
+			yield_head_to_locked_owner(lock, val, paravirt,
 				pv_yield_allow_steal && (iters > HEAD_SPINS));
 		}
 
@@ -513,6 +589,22 @@ static int pv_yield_prev_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, pv_yield_prev_set, "%llu\n");
 
+static int pv_yield_propagate_owner_set(void *data, u64 val)
+{
+	pv_yield_propagate_owner = !!val;
+
+	return 0;
+}
+
+static int pv_yield_propagate_owner_get(void *data, u64 *val)
+{
+	*val = pv_yield_propagate_owner;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
@@ -521,6 +613,7 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_owner);
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
+		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
 	}
 
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 12/14] powerpc/qspinlock: add ability to prod new queue head CPU
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (10 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 11/14] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 13/14] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 14/14] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

After the head of the queue acquires the lock, it releases the
next waiter in the queue to become the new head. Add an option
to prod the new head if its vCPU was preempted. This may only
have an effect if queue waiters are yielding.

Disable this option by default for now, i.e., no logical change.
---
 arch/powerpc/lib/qspinlock.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index c39af19f006e..ce0563c56915 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -12,6 +12,7 @@
 struct qnode {
 	struct qnode	*next;
 	struct qspinlock *lock;
+	int		cpu;
 	int		yield_cpu;
 	u8		locked; /* 1 if lock acquired */
 };
@@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
 static bool pv_yield_allow_steal __read_mostly = false;
 static bool pv_yield_prev __read_mostly = false;
 static bool pv_yield_propagate_owner __read_mostly = false;
+static bool pv_prod_head __read_mostly = false;
 
 static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
 
@@ -378,6 +380,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	node = &qnodesp->nodes[idx];
 	node->next = NULL;
 	node->lock = lock;
+	node->cpu = smp_processor_id();
 	node->yield_cpu = -1;
 	node->locked = 0;
 
@@ -464,7 +467,14 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 	 * this store to locked. The corresponding barrier is the smp_rmb()
 	 * acquire barrier for mcs lock, above.
 	 */
-	WRITE_ONCE(next->locked, 1);
+	if (paravirt && pv_prod_head) {
+		int next_cpu = next->cpu;
+		WRITE_ONCE(next->locked, 1);
+		if (vcpu_is_preempted(next_cpu))
+			prod_cpu(next_cpu);
+	} else {
+		WRITE_ONCE(next->locked, 1);
+	}
 
 release:
 	qnodesp->count--; /* release the node */
@@ -605,6 +615,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 *val)
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
 
+static int pv_prod_head_set(void *data, u64 val)
+{
+	pv_prod_head = !!val;
+
+	return 0;
+}
+
+static int pv_prod_head_get(void *data, u64 *val)
+{
+	*val = pv_prod_head;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, pv_prod_head_set, "%llu\n");
+
 static __init int spinlock_debugfs_init(void)
 {
 	debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, &fops_steal_spins);
@@ -614,6 +640,7 @@ static __init int spinlock_debugfs_init(void)
 		debugfs_create_file("qspl_pv_yield_allow_steal", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
 		debugfs_create_file("qspl_pv_yield_prev", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_prev);
 		debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
+		debugfs_create_file("qspl_pv_prod_head", 0600, arch_debugfs_dir, NULL, &fops_pv_prod_head);
 	}
 
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 13/14] powerpc/qspinlock: trylock and initial lock attempt may steal
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (11 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 12/14] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  2022-07-11  3:04 ` [RFC PATCH 14/14] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This gives trylock slightly more strength, and it also gives most
of the benefit of passing 'val' back through the slowpath without
the complexity.
---
 arch/powerpc/include/asm/qspinlock.h | 39 +++++++++++++++++++++++++++-
 arch/powerpc/lib/qspinlock.c         |  9 +++++++
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 44601b261e08..d3d2039237b2 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -5,6 +5,8 @@
 #include <linux/compiler.h>
 #include <asm/qspinlock_types.h>
 
+#define _Q_SPIN_TRY_LOCK_STEAL 1
+
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
 	return READ_ONCE(lock->val);
@@ -26,11 +28,12 @@ static __always_inline u32 queued_spin_get_locked_val(void)
 	return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
 }
 
-static __always_inline int queued_spin_trylock(struct qspinlock *lock)
+static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock *lock)
 {
 	u32 new = queued_spin_get_locked_val();
 	u32 prev;
 
+	/* Trylock succeeds only when unlocked and no queued nodes */
 	asm volatile(
 "1:	lwarx	%0,0,%1,%3	# queued_spin_trylock			\n"
 "	cmpwi	0,%0,0							\n"
@@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 	return 0;
 }
 
+static __always_inline int __queued_spin_trylock_steal(struct qspinlock *lock)
+{
+	u32 new = queued_spin_get_locked_val();
+	u32 prev, tmp;
+
+	/* Trylock may get ahead of queued nodes if it finds unlocked */
+	asm volatile(
+"1:	lwarx	%0,0,%2,%5	# queued_spin_trylock			\n"
+"	andc.	%1,%0,%4						\n"
+"	bne-	2f							\n"
+"	and	%1,%0,%4						\n"
+"	or	%1,%1,%3						\n"
+"	stwcx.	%1,0,%2							\n"
+"	bne-	1b							\n"
+"\t"	PPC_ACQUIRE_BARRIER "						\n"
+"2:									\n"
+	: "=&r" (prev), "=&r" (tmp)
+	: "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
+	  "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
+	: "cr0", "memory");
+
+	if (likely(!(prev & ~_Q_TAIL_CPU_MASK)))
+		return 1;
+	return 0;
+}
+
+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
+{
+	if (!_Q_SPIN_TRY_LOCK_STEAL)
+		return __queued_spin_trylock_nosteal(lock);
+	else
+		return __queued_spin_trylock_steal(lock);
+}
+
 void queued_spin_lock_slowpath(struct qspinlock *lock);
 
 static __always_inline void queued_spin_lock(struct qspinlock *lock)
diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index ce0563c56915..d67b923e4f98 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -24,7 +24,11 @@ struct qnodes {
 
 /* Tuning parameters */
 static int STEAL_SPINS __read_mostly = (1<<5);
+#if _Q_SPIN_TRY_LOCK_STEAL == 1
+static const bool MAYBE_STEALERS = true;
+#else
 static bool MAYBE_STEALERS __read_mostly = true;
+#endif
 static int HEAD_SPINS __read_mostly = (1<<13);
 
 static bool pv_yield_owner __read_mostly = true;
@@ -505,6 +509,10 @@ void pv_spinlocks_init(void)
 #include <linux/debugfs.h>
 static int steal_spins_set(void *data, u64 val)
 {
+#if _Q_SPIN_TRY_LOCK_STEAL == 1
+	/* MAYBE_STEAL remains true */
+	STEAL_SPINS = val;
+#else
 	static DEFINE_MUTEX(lock);
 
 	mutex_lock(&lock);
@@ -522,6 +530,7 @@ static int steal_spins_set(void *data, u64 val)
 		STEAL_SPINS = val;
 	}
 	mutex_unlock(&lock);
+#endif
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 14/14] powerpc/qspinlock: use spin_begin/end API
  2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
                   ` (12 preceding siblings ...)
  2022-07-11  3:04 ` [RFC PATCH 13/14] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
@ 2022-07-11  3:04 ` Nicholas Piggin
  13 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2022-07-11  3:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
to prevent threads issuing a lot of expensive priority nops which may not
have much effect due to immediately executing low then medium priority.
---
 arch/powerpc/lib/qspinlock.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
index d67b923e4f98..486423d566d3 100644
--- a/arch/powerpc/lib/qspinlock.c
+++ b/arch/powerpc/lib/qspinlock.c
@@ -243,7 +243,7 @@ static void __yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravi
 		return;
 	}
 relax:
-	cpu_relax();
+	spin_cpu_relax();
 }
 
 static void yield_to_locked_owner(struct qspinlock *lock, u32 val, bool paravirt)
@@ -331,7 +331,7 @@ static void yield_to_prev(struct qspinlock *lock, struct qnode *node, int prev_c
 	}
 
 relax:
-	cpu_relax();
+	spin_cpu_relax();
 }
 
 
@@ -340,6 +340,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 	int iters;
 
 	/* Attempt to steal the lock */
+	spin_begin();
 	for (iters = 0; iters < STEAL_SPINS; iters++) {
 		u32 val = READ_ONCE(lock->val);
 
@@ -354,6 +355,7 @@ static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool parav
 		if (trylock_with_tail_cpu(lock, val))
 			return true;
 	}
+	spin_end();
 
 	return false;
 }
@@ -404,8 +406,10 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		WRITE_ONCE(prev->next, node);
 
 		/* Wait for mcs node lock to be released */
+		spin_begin();
 		while (!node->locked)
 			yield_to_prev(lock, node, prev_cpu, paravirt);
+		spin_end();
 
 		/* Clear out stale propagated yield_cpu */
 		if (paravirt && pv_yield_propagate_owner && node->yield_cpu != -1)
@@ -418,10 +422,12 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		int set_yield_cpu = -1;
 
 		/* We're at the head of the waitqueue, wait for the lock. */
+		spin_begin();
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
 			propagate_yield_cpu(node, val, &set_yield_cpu, paravirt);
 			yield_head_to_locked_owner(lock, val, paravirt, false);
 		}
+		spin_end();
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -436,6 +442,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 		int set_yield_cpu = -1;
 		int iters = 0;
 again:
+		spin_begin();
 		/* We're at the head of the waitqueue, wait for the lock. */
 		while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
 			if (iters++ == HEAD_SPINS) {
@@ -446,6 +453,7 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 			yield_head_to_locked_owner(lock, val, paravirt,
 				pv_yield_allow_steal && (iters > HEAD_SPINS));
 		}
+		spin_end();
 
 		/* If we're the last queued, must clean up the tail. */
 		if ((val & _Q_TAIL_CPU_MASK) == tail) {
@@ -461,8 +469,13 @@ static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, b
 
 unlock_next:
 	/* contended path; must wait for next != NULL (MCS protocol) */
-	while (!(next = READ_ONCE(node->next)))
-		cpu_relax();
+	next = READ_ONCE(node->next);
+	if (!next) {
+		spin_begin();
+		while (!(next = READ_ONCE(node->next)))
+			cpu_relax();
+		spin_end();
+	}
 
 	/*
 	 * Unlock the next mcs waiter node. Release barrier is not required
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-07-11  3:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-11  3:04 [RFC PATCH 00/14] add our own qspinlock implementation Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 01/14] powerpc/qspinlock: powerpc " Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 02/14] powerpc/qspinlock: add mcs queueing for contended waiters Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 03/14] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 04/14] powerpc/qspinlock: convert atomic operations to assembly Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 05/14] powerpc/qspinlock: allow new waiters to steal the lock before queueing Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 06/14] powerpc/qspinlock: theft prevention to control latency Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 07/14] powerpc/qspinlock: store owner CPU in lock word Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 08/14] powerpc/qspinlock: paravirt yield to lock owner Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 09/14] powerpc/qspinlock: implement option to yield to previous node Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 10/14] powerpc/qspinlock: allow stealing when head of queue yields Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 11/14] powerpc/qspinlock: allow propagation of yield CPU down the queue Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 12/14] powerpc/qspinlock: add ability to prod new queue head CPU Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 13/14] powerpc/qspinlock: trylock and initial lock attempt may steal Nicholas Piggin
2022-07-11  3:04 ` [RFC PATCH 14/14] powerpc/qspinlock: use spin_begin/end API Nicholas Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.