linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v15 0/6] Add NUMA-awareness to qspinlock
@ 2021-05-14 20:07 Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Alex Kogan @ 2021-05-14 20:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Changes from v14:
----------------

- Change the way the main queue is scanned and reordered in
cna_wait_head_or_lock(), based on Peter's suggestion.

In detail: instead of inspecting only one queue node, we now scan
(and move nodes into the secondary queue) as long as the lock
remains busy. This simplified the code quite a bit, as we don't need
to call cna_order_queue() again from cna_lock_handoff(). 

- Use local_clock() instead of relying on jiffies to decide when to
flush the secondary queue, per Andy's suggestion.

- Use module_param() for numa_spinlock_threshold_ns, so it can be tweaked
at runtime, per Andy's suggestion.

- Reduce the default value for numa_spinlock_threshold_ns to 1ms based on
the comments from Andy and Peter. The performance numbers below include
results with the new default as well as with the value of 10ms, which was 
the default threshold in previous revisions of the series.

Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a primary queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the MCS lock holder checks whether the next
waiter in the primary queue (if exists) is running on the same NUMA node.
If it is not, that waiter is detached from the main queue and moved into
the tail of the secondary queue. This way, we gradually filter the primary
queue, leaving only waiters running on the same preferred NUMA node. Note
that certain priortized waiters (e.g., in irq and nmi contexts) are
excluded from being moved to the secondary queue. We change the NUMA node
preference after a waiter at the head of the secondary queue spins for a
certain amount of time. We do that by flushing the secondary queue into
the head of the primary queue, effectively changing the preference to the
NUMA node of the waiter at the head of the secondary queue at the time of
the flush.

More details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The 
standard deviation is also reported in (), and in general is about 3%
from the average. The 'stock' kernel is v5.12.0,
commit 3cf5c8ea3a66, compiled in the default configuration. 
'CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set and
the new default threshold of 1ms for flushing the secondary queue
(numa_spinlock_threshold_ns); 'CNA-10ms' is the same as CNA, 
but uses the threshold of 10ms. The speedup is calculated by dividing 
the result of 'CNA' and 'CNA-10ms', respectively, by the result
achieved with 'stock'.

#thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
  1  2.695 (0.108) 2.704 (0.099) / 1.003  2.712 (0.077) / 1.006
  2  2.753 (0.187) 2.785 (0.171) / 1.012  2.822 (0.174) / 1.025
  4  4.355 (0.139) 4.417 (0.179) / 1.014  4.361 (0.181) / 1.001
  8  5.163 (0.119) 7.017 (0.195) / 1.359  7.369 (0.186) / 1.427
 16  5.944 (0.134) 9.110 (0.242) / 1.532  9.187 (0.233) / 1.546
 32  6.310 (0.082) 9.710 (0.156) / 1.539  9.827 (0.161) / 1.557
 36  6.374 (0.112) 9.777 (0.141) / 1.534  9.830 (0.124) / 1.542
 72  6.170 (0.139) 9.922 (0.190) / 1.608  9.945 (0.136) / 1.612
108  6.002 (0.089) 9.651 (0.176) / 1.608  9.847 (0.125) / 1.641
142  5.784 (0.079) 9.477 (0.089) / 1.638  9.641 (0.113) / 1.667

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads: 

#thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
  1  0.503 (0.004) 0.501 (0.001) / 0.996  0.503 (0.002) / 1.000
  2  0.783 (0.014) 0.773 (0.011) / 0.988  0.774 (0.016) / 0.989
  4  1.422 (0.025) 1.398 (0.030) / 0.983  1.403 (0.025) / 0.987
  8  1.753 (0.104) 1.641 (0.132) / 0.936  1.675 (0.134) / 0.956
 16  1.851 (0.097) 1.760 (0.103) / 0.951  1.774 (0.119) / 0.959
 32  0.905 (0.081) 1.708 (0.081) / 1.888  1.738 (0.069) / 1.922
 36  0.895 (0.058) 1.726 (0.065) / 1.928  1.735 (0.081) / 1.938
 72  0.823 (0.033) 1.610 (0.067) / 1.957  1.647 (0.067) / 2.002
108  0.845 (0.035) 1.588 (0.054) / 1.878  1.740 (0.067) / 2.058
142  0.840 (0.030) 1.546 (0.042) / 1.839  1.740 (0.048) / 2.070

and will-it-scale/lock2_threads:

#thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
  1  1.551 (0.003) 1.558 (0.006) / 1.005  1.558 (0.003) / 1.005
  2  2.722 (0.064) 2.704 (0.063) / 0.993  2.727 (0.058) / 1.002
  4  5.286 (0.178) 5.360 (0.151) / 1.014  5.360 (0.135) / 1.014
  8  4.115 (0.297) 3.906 (0.383) / 0.949  4.062 (0.366) / 0.987
 16  4.119 (0.121) 3.950 (0.131) / 0.959  4.009 (0.132) / 0.973
 32  2.508 (0.097) 3.805 (0.106) / 1.517  3.960 (0.091) / 1.579
 36  2.457 (0.101) 3.810 (0.072) / 1.551  3.931 (0.106) / 1.600
 72  1.913 (0.103) 3.530 (0.070) / 1.845  3.860 (0.078) / 2.018
108  1.891 (0.109) 3.410 (0.079) / 1.803  3.881 (0.097) / 2.052
142  1.752 (0.096) 3.236 (0.080) / 1.847  3.774 (0.078) / 2.155

Our evaluation shows that CNA also improves performance of user 
applications that have hot pthread mutexes. Those mutexes are 
blocking, and waiting threads park and unpark via the futex 
mechanism in the kernel. Given that kernel futex chains, which
are hashed by the mutex address, are each protected by a 
chain-specific spin lock, the contention on a user-mode mutex 
translates into contention on a kernel level spinlock. 

Here are the throughput results (ops/us) for the leveldb ‘readrandom’
benchmark:

#thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
  1  0.533 (0.011) 0.539 (0.014) / 1.012  0.536 (0.013) / 1.006
  2  0.854 (0.022) 0.856 (0.017) / 1.003  0.857 (0.020) / 1.004
  4  1.236 (0.028) 1.238 (0.054) / 1.002  1.217 (0.054) / 0.985
  8  1.207 (0.117) 1.198 (0.122) / 0.993  1.155 (0.138) / 0.957
 16  0.758 (0.055) 1.128 (0.118) / 1.489  1.068 (0.131) / 1.409
 32  0.743 (0.027) 1.153 (0.028) / 1.551  1.147 (0.021) / 1.543
 36  0.708 (0.027) 1.150 (0.024) / 1.623  1.137 (0.026) / 1.605
 72  0.629 (0.016) 1.112 (0.019) / 1.767  1.134 (0.019) / 1.802
108  0.610 (0.012) 1.053 (0.018) / 1.725  1.130 (0.017) / 1.853
142  0.606 (0.013) 1.008 (0.020) / 1.664  1.110 (0.023) / 1.833

Further comments are welcome and appreciated.

Alex Kogan (6):
  locking/qspinlock: Rename mcs lock/unlock macros and make them more
    generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Avoid moving certain threads between waiting queues
    in CNA
  locking/qspinlock: Introduce the shuffle reduction optimization into
    CNA

 .../admin-guide/kernel-parameters.txt         |  18 +
 arch/arm/include/asm/mcs_spinlock.h           |   6 +-
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |   4 +
 include/asm-generic/mcs_spinlock.h            |   4 +-
 kernel/locking/mcs_spinlock.h                 |  20 +-
 kernel/locking/qspinlock.c                    |  82 +++-
 kernel/locking/qspinlock_cna.h                | 425 ++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h           |   2 +-
 10 files changed, 562 insertions(+), 23 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v15 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
@ 2021-05-14 20:07 ` Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2021-05-14 20:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

The mcs unlock macro (arch_mcs_lock_handoff) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h       | 18 +++++++++---------
 kernel/locking/qspinlock.c          |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..1eb4d733459c 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include <asm/spinlock.h>
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
+#define arch_mcs_spin_wait(lock)					\
 do {									\
 	/* Ensure prior stores are observed before we enter wfe. */	\
 	smp_mb();							\
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_lock_handoff(lock, val)				\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release((lock), (val));				\
 	dsb_sev();							\
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..f933d99c63e0 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_wait(l)
+ *   arch_mcs_lock_handoff(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 85251d8771d9..e794babc519a 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
 	int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_wait
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
+#define arch_mcs_spin_wait(l)					\
+do {								\
+	smp_cond_load_acquire(l, VAL);				\
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_lock_handoff
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_lock_handoff(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	WRITE_ONCE(prev->next, node);
 
 	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
+	arch_mcs_spin_wait(&node->locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba53d56..435d696f9250 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -471,7 +471,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		arch_mcs_spin_wait(&node->locked);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -550,7 +550,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..619d80fd5ea8 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * The write to next->locked in arch_mcs_lock_handoff()
 	 * must be ordered before the read of pn->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v15 2/6] locking/qspinlock: Refactor the qspinlock slow path
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
@ 2021-05-14 20:07 ` Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2021-05-14 20:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 435d696f9250..e3518709ffdc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -289,6 +289,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+					     u32 val,
+					     struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_lock_handoff - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
+					       struct mcs_spinlock *next)
+{
+	arch_mcs_lock_handoff(&next->locked, 1);
+}
+
+#define try_clear_tail		__try_clear_tail
+#define mcs_lock_handoff	__mcs_lock_handoff
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -533,7 +561,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -550,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_lock_handoff(&next->locked, 1);
+	mcs_lock_handoff(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -575,6 +603,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			__mcs_lock_handoff
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
@ 2021-05-14 20:07 ` Alex Kogan
  2021-09-22 19:25   ` Davidlohr Bueso
                     ` (2 more replies)
  2021-05-14 20:07 ` [PATCH v15 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 28+ messages in thread
From: Alex Kogan @ 2021-05-14 20:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the MCS lock
holder checks whether the next waiter in the primary queue (if exists) is
running on the same NUMA node. If it is not, that waiter is detached from
the main queue and moved into the tail of the secondary queue. This way,
we gradually filter the primary queue, leaving only waiters running on
the same preferred NUMA node. For more details, see
https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock between waiters in the main queue. This issue will be
addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 ++
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |   4 +
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  42 ++-
 kernel/locking/qspinlock_cna.h                | 325 ++++++++++++++++++
 7 files changed, 402 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a816935d23d4..94d35507560c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3515,6 +3515,16 @@
 			NUMA balancing.
 			Allowed values are enable and disable
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0045e1b44190..819c3dad8afc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1564,6 +1564,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index d86ab942219c..21d09e8db979 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 6974b5174495..6c73c2dc17fb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -608,6 +608,10 @@ void __init alternative_instructions(void)
 	 */
 	paravirt_set_cap();
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+	cna_configure_spin_lock_slowpath();
+#endif
+
 	/*
 	 * First patch paravirt functions, such that we overwrite the indirect
 	 * call with the direct call.
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index e794babc519a..3926aad129ed 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e3518709ffdc..8c1a21b53913 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -71,7 +71,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -80,7 +81,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -104,6 +105,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -317,7 +320,7 @@ static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
 #define try_clear_tail		__try_clear_tail
 #define mcs_lock_handoff	__mcs_lock_handoff
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -589,6 +592,37 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_init_node
+#define pv_init_node			cna_init_node
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			cna_lock_handoff
+
+#undef  queued_spin_lock_slowpath
+/*
+ * defer defining queued_spin_lock_slowpath until after the include to
+ * avoid a name clash with the identically named field in pv_ops.lock
+ * (see cna_configure_spin_lock_slowpath())
+ */
+#include "qspinlock_cna.h"
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..ca564e64e5de
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,325 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the MCS lock
+ * holder checks whether the next waiter in the primary queue (if exists) is
+ * running on the same NUMA node. If it is not, that waiter is detached from the
+ * main queue and moved into the tail of the secondary queue. This way, we
+ * gradually filter the primary queue, leaving only waiters running on the same
+ * preferred NUMA node.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	u16			numa_node;
+	u16			real_numa_node;
+	u32			encoded_tail;	/* self */
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->real_numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * make sure @encoded_tail is not confused with other valid
+		 * values for @locked (0 or 1)
+		 */
+		WARN_ON(cn->encoded_tail <= 1);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	cn->numa_node = cn->real_numa_node;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ *
+ * Returns the new primary head node or NULL on failure.
+ */
+static struct mcs_spinlock *
+cna_splice_head(struct qspinlock *lock, u32 val,
+		struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (next) {
+		/*
+		 * If the primary queue is not empty, the primary tail doesn't
+		 * need to change and we can simply link the secondary tail to
+		 * the old primary head.
+		 */
+		tail_2nd->next = next;
+	} else {
+		/*
+		 * When the primary queue is empty, the secondary tail becomes
+		 * the primary tail.
+		 */
+
+		/*
+		 * Speculatively break the secondary queue's circular link such
+		 * that when the secondary tail becomes the primary tail it all
+		 * works out.
+		 */
+		tail_2nd->next = NULL;
+
+		/*
+		 * tail_2nd->next = NULL;	old = xchg_tail(lock, tail);
+		 *				prev = decode_tail(old);
+		 * try_cmpxchg_release(...);	WRITE_ONCE(prev->next, node);
+		 *
+		 * If the following cmpxchg() succeeds, our stores will not
+		 * collide.
+		 */
+		new = ((struct cna_node *)tail_2nd)->encoded_tail |
+			_Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
+			/* Restore the secondary queue's circular link. */
+			tail_2nd->next = head_2nd;
+			return NULL;
+		}
+	}
+
+	/* The primary queue head now is what was the secondary queue head. */
+	return head_2nd;
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		struct mcs_spinlock *next;
+
+		/*
+		 * When there are waiters on the secondary queue, try to move
+		 * them back onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_lock_handoff(&next->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues are empty. Do what MCS does. */
+	return __try_clear_tail(lock, val, node);
+}
+
+/*
+ * cna_splice_next -- splice the next node from the primary queue onto
+ * the secondary queue.
+ */
+static void cna_splice_next(struct mcs_spinlock *node,
+			    struct mcs_spinlock *next,
+			    struct mcs_spinlock *nnext)
+{
+	/* remove 'next' from the main queue */
+	node->next = nnext;
+
+	/* stick `next` on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		next->next = next;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = next;
+		next->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)next)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - check whether the next waiter in the main queue is on
+ * the same NUMA node as the lock holder; if not, and it has a waiter behind
+ * it in the main queue, move the former onto the secondary queue.
+ * Returns 1 if the next waiter runs on the same NUMA node; 0 otherwise.
+ */
+static int cna_order_queue(struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = READ_ONCE(node->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int numa_node, next_numa_node;
+
+	if (!next)
+		return 0;
+
+	numa_node = cn->numa_node;
+	next_numa_node = ((struct cna_node *)next)->numa_node;
+
+	if (next_numa_node != numa_node) {
+		struct mcs_spinlock *nnext = READ_ONCE(next->next);
+
+		if (nnext)
+			cna_splice_next(node, next, nnext);
+
+		return 0;
+	}
+	return 1;
+}
+
+#define LOCK_IS_BUSY(lock) (atomic_read(&(lock)->val) & _Q_LOCKED_PENDING_MASK)
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	while (LOCK_IS_BUSY(lock) && !cna_order_queue(node))
+		cpu_relax();
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline void cna_lock_handoff(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	u32 val = 1;
+
+	if (node->locked > 1) {
+		struct cna_node *cn = (struct cna_node *)node;
+
+		val = node->locked;	/* preseve secondary queue */
+
+		/*
+		 * We have a local waiter, either real or fake one;
+		 * reload @next in case it was changed by cna_order_queue().
+		 */
+		next = node->next;
+
+		/*
+		 * Pass over NUMA node id of primary queue, to maintain the
+		 * preference even if the next waiter is on a different node.
+		 */
+		((struct cna_node *)next)->numa_node = cn->numa_node;
+	}
+
+	arch_mcs_lock_handoff(&next->locked, val);
+}
+
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/*
+ * Switch to the NUMA-friendly slow path for spinlocks when we have
+ * multiple NUMA nodes in native environment, unless the user has
+ * overridden this default behavior by setting the numa_spinlock flag.
+ */
+void __init cna_configure_spin_lock_slowpath(void)
+{
+
+	if (numa_spinlock_flag < 0)
+		return;
+
+	if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
+		    pv_ops.lock.queued_spin_lock_slowpath !=
+			native_queued_spin_lock_slowpath))
+		return;
+
+	cna_init_nodes();
+
+	pv_ops.lock.queued_spin_lock_slowpath = __cna_queued_spin_lock_slowpath;
+
+	pr_info("Enabling CNA spinlock\n");
+}
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v15 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (2 preceding siblings ...)
  2021-05-14 20:07 ` [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2021-05-14 20:07 ` Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2021-05-14 20:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Keep track of the time the thread at the head of the secondary queue
has been waiting, and force inter-node handoff once this time passes
a preset threshold. The default value for the threshold (1ms) can be
overridden with the new kernel boot command-line option
"qspinlock.numa_spinlock_threshold_ns".

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 ++
 kernel/locking/qspinlock_cna.h                | 81 +++++++++++++++----
 2 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 94d35507560c..e2bcf6f4e048 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4183,6 +4183,14 @@
 			[KNL] Number of legacy pty's. Overwrites compiled-in
 			default number.
 
+	qspinlock.numa_spinlock_threshold_ns=	[NUMA, PV_OPS]
+			Set the time threshold in nanoseconds for the
+			number of intra-node lock hand-offs before the
+			NUMA-aware spinlock is forced to be passed to
+			a thread on another NUMA node. Smaller values
+			result in a more fair, but less performant spinlock,
+			and vice versa. The default value is 1000000 (=1ms).
+
 	quiet		[KNL] Disable most log messages
 
 	r128=		[HW,DRM]
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index ca564e64e5de..0b991c340fb1 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,8 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/sched/clock.h>
+#include <linux/moduleparam.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -37,19 +39,39 @@
  * gradually filter the primary queue, leaving only waiters running on the same
  * preferred NUMA node.
  *
+ * We change the NUMA node preference after a waiter at the head of the
+ * secondary queue spins for a certain amount of time (1ms, by default).
+ * We do that by flushing the secondary queue into the head of the primary queue,
+ * effectively changing the preference to the NUMA node of the waiter at the head
+ * of the secondary queue at the time of the flush.
+ *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
  * Authors: Alex Kogan <alex.kogan@oracle.com>
  *          Dave Dice <dave.dice@oracle.com>
  */
 
+#define FLUSH_SECONDARY_QUEUE	1
+
 struct cna_node {
 	struct mcs_spinlock	mcs;
 	u16			numa_node;
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
+	u64			start_time;
 };
 
+static ulong numa_spinlock_threshold_ns = 1000000;   /* 1ms, by default */
+module_param(numa_spinlock_threshold_ns, ulong, 0644);
+
+static inline bool intra_node_threshold_reached(struct cna_node *cn)
+{
+	u64 current_time = local_clock();
+	u64 threshold = cn->start_time + numa_spinlock_threshold_ns;
+
+	return current_time > threshold;
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -92,6 +114,7 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node)
 	struct cna_node *cn = (struct cna_node *)node;
 
 	cn->numa_node = cn->real_numa_node;
+	cn->start_time = 0;
 }
 
 /*
@@ -191,8 +214,14 @@ static void cna_splice_next(struct mcs_spinlock *node,
 
 	/* stick `next` on the secondary queue tail */
 	if (node->locked <= 1) { /* if secondary queue is empty */
+		struct cna_node *cn = (struct cna_node *)node;
+
 		/* create secondary queue */
 		next->next = next;
+
+		cn->start_time = local_clock();
+		/* secondary queue is not empty iff start_time != 0 */
+		WARN_ON(!cn->start_time);
 	} else {
 		/* add to the tail of the secondary queue */
 		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
@@ -240,12 +269,18 @@ static int cna_order_queue(struct mcs_spinlock *node)
 static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 						 struct mcs_spinlock *node)
 {
-	/*
-	 * Try and put the time otherwise spent spin waiting on
-	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-	 */
-	while (LOCK_IS_BUSY(lock) && !cna_order_queue(node))
-		cpu_relax();
+	struct cna_node *cn = (struct cna_node *)node;
+
+	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
+		/*
+		 * Try and put the time otherwise spent spin waiting on
+		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+		 */
+		while (LOCK_IS_BUSY(lock) && !cna_order_queue(node))
+			cpu_relax();
+	} else {
+		cn->start_time = FLUSH_SECONDARY_QUEUE;
+	}
 
 	return 0; /* we lied; we didn't wait, go do so now */
 }
@@ -253,24 +288,38 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 static inline void cna_lock_handoff(struct mcs_spinlock *node,
 				 struct mcs_spinlock *next)
 {
+	struct cna_node *cn = (struct cna_node *)node;
 	u32 val = 1;
 
-	if (node->locked > 1) {
-		struct cna_node *cn = (struct cna_node *)node;
+	if (cn->start_time != FLUSH_SECONDARY_QUEUE) {
+		if (node->locked > 1) {
+			val = node->locked;	/* preseve secondary queue */
+
+			/*
+			 * We have a local waiter, either real or fake one;
+			 * reload @next in case it was changed by cna_order_queue().
+			 */
+			next = node->next;
 
-		val = node->locked;	/* preseve secondary queue */
+			/*
+			 * Pass over NUMA node id of primary queue, to maintain the
+			 * preference even if the next waiter is on a different node.
+			 */
+			((struct cna_node *)next)->numa_node = cn->numa_node;
 
+			((struct cna_node *)next)->start_time = cn->start_time;
+		}
+	} else {
 		/*
-		 * We have a local waiter, either real or fake one;
-		 * reload @next in case it was changed by cna_order_queue().
+		 * We decided to flush the secondary queue;
+		 * this can only happen if that queue is not empty.
 		 */
-		next = node->next;
-
+		WARN_ON(node->locked <= 1);
 		/*
-		 * Pass over NUMA node id of primary queue, to maintain the
-		 * preference even if the next waiter is on a different node.
+		 * Splice the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
 		 */
-		((struct cna_node *)next)->numa_node = cn->numa_node;
+		next = cna_splice_head(NULL, 0, node, next);
 	}
 
 	arch_mcs_lock_handoff(&next->locked, val);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v15 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (3 preceding siblings ...)
  2021-05-14 20:07 ` [PATCH v15 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
@ 2021-05-14 20:07 ` Alex Kogan
  2021-05-14 20:07 ` [PATCH v15 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2021-05-14 20:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Prohibit moving certain threads (e.g., in irq and nmi contexts)
to the secondary queue. Those prioritized threads will always stay
in the primary queue, and so will have a shorter wait time for the lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 0b991c340fb1..ffc5c3301f0f 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -6,6 +6,7 @@
 #include <linux/topology.h>
 #include <linux/sched/clock.h>
 #include <linux/moduleparam.h>
+#include <linux/sched/rt.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -37,7 +38,8 @@
  * running on the same NUMA node. If it is not, that waiter is detached from the
  * main queue and moved into the tail of the secondary queue. This way, we
  * gradually filter the primary queue, leaving only waiters running on the same
- * preferred NUMA node.
+ * preferred NUMA node. Note that certain priortized waiters (e.g., in
+ * irq and nmi contexts) are excluded from being moved to the secondary queue.
  *
  * We change the NUMA node preference after a waiter at the head of the
  * secondary queue spins for a certain amount of time (1ms, by default).
@@ -53,6 +55,8 @@
 
 #define FLUSH_SECONDARY_QUEUE	1
 
+#define CNA_PRIORITY_NODE      0xffff
+
 struct cna_node {
 	struct mcs_spinlock	mcs;
 	u16			numa_node;
@@ -111,9 +115,10 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
+	bool priority = !in_task() || irqs_disabled() || rt_task(current);
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->numa_node = cn->real_numa_node;
+	cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
 	cn->start_time = 0;
 }
 
@@ -252,7 +257,7 @@ static int cna_order_queue(struct mcs_spinlock *node)
 	numa_node = cn->numa_node;
 	next_numa_node = ((struct cna_node *)next)->numa_node;
 
-	if (next_numa_node != numa_node) {
+	if (next_numa_node != numa_node && next_numa_node != CNA_PRIORITY_NODE) {
 		struct mcs_spinlock *nnext = READ_ONCE(next->next);
 
 		if (nnext)
@@ -272,6 +277,13 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	struct cna_node *cn = (struct cna_node *)node;
 
 	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
+		/*
+		 * We are at the head of the wait queue, no need to use
+		 * the fake NUMA node ID.
+		 */
+		if (cn->numa_node == CNA_PRIORITY_NODE)
+			cn->numa_node = cn->real_numa_node;
+
 		/*
 		 * Try and put the time otherwise spent spin waiting on
 		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v15 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (4 preceding siblings ...)
  2021-05-14 20:07 ` [PATCH v15 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
@ 2021-05-14 20:07 ` Alex Kogan
  2021-09-30  9:44 ` [PATCH v15 0/6] Add NUMA-awareness to qspinlock Barry Song
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2021-05-14 20:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

This performance optimization chooses probabilistically to avoid moving
threads from the main queue into the secondary one when the secondary queue
is empty.

It is helpful when the lock is only lightly contended. In particular, it
makes CNA less eager to create a secondary queue, but does not introduce
any extra delays for threads waiting in that queue once it is created.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index ffc5c3301f0f..17d56c739e57 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -7,6 +7,7 @@
 #include <linux/sched/clock.h>
 #include <linux/moduleparam.h>
 #include <linux/sched/rt.h>
+#include <linux/random.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -76,6 +77,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
 	return current_time > threshold;
 }
 
+/*
+ * Controls the probability for enabling the ordering of the main queue
+ * when the secondary queue is empty. The chosen value reduces the amount
+ * of unnecessary shuffling of threads between the two waiting queues
+ * when the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG  (7)
+
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Return false with probability 1 / 2^@num_bits.
+ * Intuitively, the larger @num_bits the less likely false is to be returned.
+ * @num_bits must be a number between 0 and 31.
+ */
+static bool probably(unsigned int num_bits)
+{
+	u32 s;
+
+	s = this_cpu_read(seed);
+	s = next_pseudo_random32(s);
+	this_cpu_write(seed, s);
+
+	return s & ((1 << num_bits) - 1);
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -276,6 +305,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
+	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
+		/*
+		 * When the secondary queue is empty, skip the calls to
+		 * cna_order_queue() below with high probability. This optimization
+		 * reduces the overhead of unnecessary shuffling of threads
+		 * between waiting queues when the lock is only lightly contended.
+		 */
+		return 0;
+	}
+
 	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
 		/*
 		 * We are at the head of the wait queue, no need to use
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-05-14 20:07 ` [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2021-09-22 19:25   ` Davidlohr Bueso
  2021-09-22 19:52     ` Waiman Long
  2023-08-04  1:49     ` Guo Ren
  2021-09-30 10:05   ` Barry Song
  2023-08-02 23:14   ` Guo Ren
  2 siblings, 2 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2021-09-22 19:25 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Fri, 14 May 2021, Alex Kogan wrote:

>diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>index a816935d23d4..94d35507560c 100644
>--- a/Documentation/admin-guide/kernel-parameters.txt
>+++ b/Documentation/admin-guide/kernel-parameters.txt
>@@ -3515,6 +3515,16 @@
> 			NUMA balancing.
> 			Allowed values are enable and disable
>
>+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
>+			of spinlock. The options are:
>+			auto - Enable this variant if running on a multi-node
>+			machine in native environment.
>+			on  - Unconditionally enable this variant.

Is there any reason why the user would explicitly pass the on option
when the auto thing already does the multi-node check? Perhaps strange
numa topologies? Otherwise I would say it's not needed and the fewer
options we give the user for low level locking the better.

>+			off - Unconditionally disable this variant.
>+
>+			Not specifying this option is equivalent to
>+			numa_spinlock=auto.
>+
> 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> 			'node', 'default' can be specified
> 			This can be set from sysctl after boot.
>diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>index 0045e1b44190..819c3dad8afc 100644
>--- a/arch/x86/Kconfig
>+++ b/arch/x86/Kconfig
>@@ -1564,6 +1564,26 @@ config NUMA
>
> 	  Otherwise, you should say N.
>
>+config NUMA_AWARE_SPINLOCKS
>+	bool "Numa-aware spinlocks"
>+	depends on NUMA
>+	depends on QUEUED_SPINLOCKS
>+	depends on 64BIT
>+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
>+	# This is awkward, but hopefully would be resolved once static_call()
>+	# is available.
>+	depends on PARAVIRT_SPINLOCKS

We now have static_call() - see 9183c3f9ed7.


>+	default y
>+	help
>+	  Introduce NUMA (Non Uniform Memory Access) awareness into
>+	  the slow path of spinlocks.
>+
>+	  In this variant of qspinlock, the kernel will try to keep the lock
>+	  on the same node, thus reducing the number of remote cache misses,
>+	  while trading some of the short term fairness for better performance.
>+
>+	  Say N if you want absolute first come first serve fairness.

This would also need a depends on !PREEMPT_RT, no? Raw spinlocks really want
the determinism.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-09-22 19:25   ` Davidlohr Bueso
@ 2021-09-22 19:52     ` Waiman Long
  2023-08-04  1:49     ` Guo Ren
  1 sibling, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-09-22 19:52 UTC (permalink / raw)
  To: Davidlohr Bueso, Alex Kogan
  Cc: linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On 9/22/21 3:25 PM, Davidlohr Bueso wrote:
> On Fri, 14 May 2021, Alex Kogan wrote:
>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
>> b/Documentation/admin-guide/kernel-parameters.txt
>> index a816935d23d4..94d35507560c 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -3515,6 +3515,16 @@
>>             NUMA balancing.
>>             Allowed values are enable and disable
>>
>> +    numa_spinlock=    [NUMA, PV_OPS] Select the NUMA-aware variant
>> +            of spinlock. The options are:
>> +            auto - Enable this variant if running on a multi-node
>> +            machine in native environment.
>> +            on  - Unconditionally enable this variant.
>
> Is there any reason why the user would explicitly pass the on option
> when the auto thing already does the multi-node check? Perhaps strange
> numa topologies? Otherwise I would say it's not needed and the fewer
> options we give the user for low level locking the better.

I asked Alex to put in a command line option because we may want to 
disable it on a multi-socket server if we want to.


>
>> +            off - Unconditionally disable this variant.
>> +
>> +            Not specifying this option is equivalent to
>> +            numa_spinlock=auto.
>> +
>>     numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
>>             'node', 'default' can be specified
>>             This can be set from sysctl after boot.
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 0045e1b44190..819c3dad8afc 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1564,6 +1564,26 @@ config NUMA
>>
>>       Otherwise, you should say N.
>>
>> +config NUMA_AWARE_SPINLOCKS
>> +    bool "Numa-aware spinlocks"
>> +    depends on NUMA
>> +    depends on QUEUED_SPINLOCKS
>> +    depends on 64BIT
>> +    # For now, we depend on PARAVIRT_SPINLOCKS to make the patching 
>> work.
>> +    # This is awkward, but hopefully would be resolved once 
>> static_call()
>> +    # is available.
>> +    depends on PARAVIRT_SPINLOCKS
>
> We now have static_call() - see 9183c3f9ed7.
I agree that it is now time to look at using the static call for 
slowpath switching.
>
>
>> +    default y
>> +    help
>> +      Introduce NUMA (Non Uniform Memory Access) awareness into
>> +      the slow path of spinlocks.
>> +
>> +      In this variant of qspinlock, the kernel will try to keep the 
>> lock
>> +      on the same node, thus reducing the number of remote cache 
>> misses,
>> +      while trading some of the short term fairness for better 
>> performance.
>> +
>> +      Say N if you want absolute first come first serve fairness.
>
> This would also need a depends on !PREEMPT_RT, no? Raw spinlocks 
> really want
> the determinism. 

Agreed

Cheers,
Longman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (5 preceding siblings ...)
  2021-05-14 20:07 ` [PATCH v15 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
@ 2021-09-30  9:44 ` Barry Song
  2021-09-30 16:58   ` Waiman Long
  2021-09-30 23:51   ` Alex Kogan
  2021-12-13 20:37 ` Alex Kogan
  2022-04-11 17:09 ` Alex Kogan
  8 siblings, 2 replies; 28+ messages in thread
From: Barry Song @ 2021-09-30  9:44 UTC (permalink / raw)
  To: alex.kogan
  Cc: arnd, bp, daniel.m.jordan, dave.dice, guohanjun, hpa, jglauber,
	linux-arch, linux-arm-kernel, linux-kernel, linux, longman,
	mingo, peterz, steven.sistare, tglx, will.deacon, x86

> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The 
> standard deviation is also reported in (), and in general is about 3%
> from the average. The 'stock' kernel is v5.12.0,

I assume x5-4 server has the crossbar topology and its numa diameter is
1hop, and all tests were done on this kind of symmetrical topology. Am
I right? 

    ┌─┐                 ┌─┐
    │ ├─────────────────┤ │
    └─┤1               1└┬┘
      │  1           1   │
      │    1       1     │
      │      1   1       │
      │        1         │
      │      1   1       │
      │     1      1     │
      │   1         1    │
     ┌┼┐1             1  ├─┐
     │┼┼─────────────────┤ │
     └─┘                 └─┘


what if the hardware is using the ring topology and other topologies with
2-hops or even 3-hops such as:

     ┌─┐                 ┌─┐
     │ ├─────────────────┤ │
     └─┤                 └┬┘
       │                  │
       │                  │
       │                  │
       │                  │
       │                  │
       │                  │
       │                  │
      ┌┤                  ├─┐
      │┼┬─────────────────┤ │
      └─┘                 └─┘


or:


    ┌───┐       ┌───┐      ┌────┐      ┌─────┐
    │   │       │   │      │    │      │     │
    │   │       │   │      │    │      │     │
    ├───┼───────┼───┼──────┼────┼──────┼─────┤
    │   │       │   │      │    │      │     │
    └───┘       └───┘      └────┘      └─────┘

do we need to consider the distances of numa nodes in the secondary
queue? does it still make sense to treat everyone else equal in
secondary queue?

Thanks
barry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-05-14 20:07 ` [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
  2021-09-22 19:25   ` Davidlohr Bueso
@ 2021-09-30 10:05   ` Barry Song
  2023-08-02 23:14   ` Guo Ren
  2 siblings, 0 replies; 28+ messages in thread
From: Barry Song @ 2021-09-30 10:05 UTC (permalink / raw)
  To: alex.kogan
  Cc: arnd, bp, daniel.m.jordan, dave.dice, guohanjun, hpa, jglauber,
	linux-arch, linux-arm-kernel, linux-kernel, linux, longman,
	mingo, peterz, steven.sistare, tglx, will.deacon, x86

> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a primary queue for
> + * threads running on the same NUMA node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. Schematically, it
> + * looks like this:
> + *
> + *    cna_node
> + *   +----------+     +--------+         +--------+
> + *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
> + *   |mcs:locked| -.  +--------+         +--------+
> + *   +----------+  |
> + *                 `----------------------.
> + *                                        v
> + *                 +--------+         +--------+
> + *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
> + *                 +--------+         +--------+
> + *                     ^                    |
> + *                     `--------------------'
> + *

probably not only related with NUMA, it might be also related with cache topology.
For example, one NUMA might has a couple of sub domains, each domain shares some
last level cache. ZEN, Power and some ARM servers all have this kind of topology.

lock synchronization within this smaller range should be much faster. anyway, it
looks like a good start to be aware of numa only for this moment.

Thanks
barry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
  2021-09-30  9:44 ` [PATCH v15 0/6] Add NUMA-awareness to qspinlock Barry Song
@ 2021-09-30 16:58   ` Waiman Long
  2021-09-30 22:57     ` Barry Song
  2021-09-30 23:51   ` Alex Kogan
  1 sibling, 1 reply; 28+ messages in thread
From: Waiman Long @ 2021-09-30 16:58 UTC (permalink / raw)
  To: Barry Song, alex.kogan
  Cc: arnd, bp, daniel.m.jordan, dave.dice, guohanjun, hpa, jglauber,
	linux-arch, linux-arm-kernel, linux-kernel, linux, mingo, peterz,
	steven.sistare, tglx, will.deacon, x86

On 9/30/21 5:44 AM, Barry Song wrote:
>> We have done some performance evaluation with the locktorture module
>> as well as with several benchmarks from the will-it-scale repo.
>> The following locktorture results are from an Oracle X5-4 server
>> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
>> cores each). Each number represents an average (over 25 runs) of the
>> total number of ops (x10^7) reported at the end of each run. The
>> standard deviation is also reported in (), and in general is about 3%
>> from the average. The 'stock' kernel is v5.12.0,
> I assume x5-4 server has the crossbar topology and its numa diameter is
> 1hop, and all tests were done on this kind of symmetrical topology. Am
> I right?
>
>      ┌─┐                 ┌─┐
>      │ ├─────────────────┤ │
>      └─┤1               1└┬┘
>        │  1           1   │
>        │    1       1     │
>        │      1   1       │
>        │        1         │
>        │      1   1       │
>        │     1      1     │
>        │   1         1    │
>       ┌┼┐1             1  ├─┐
>       │┼┼─────────────────┤ │
>       └─┘                 └─┘
>
>
> what if the hardware is using the ring topology and other topologies with
> 2-hops or even 3-hops such as:
>
>       ┌─┐                 ┌─┐
>       │ ├─────────────────┤ │
>       └─┤                 └┬┘
>         │                  │
>         │                  │
>         │                  │
>         │                  │
>         │                  │
>         │                  │
>         │                  │
>        ┌┤                  ├─┐
>        │┼┬─────────────────┤ │
>        └─┘                 └─┘
>
>
> or:
>
>
>      ┌───┐       ┌───┐      ┌────┐      ┌─────┐
>      │   │       │   │      │    │      │     │
>      │   │       │   │      │    │      │     │
>      ├───┼───────┼───┼──────┼────┼──────┼─────┤
>      │   │       │   │      │    │      │     │
>      └───┘       └───┘      └────┘      └─────┘
>
> do we need to consider the distances of numa nodes in the secondary
> queue? does it still make sense to treat everyone else equal in
> secondary queue?

The purpose of this patch series is to minimize cacheline transfer from 
one numa node to another. Taking the fine grained detail of the numa 
topology into account will complicate the code without much performance 
benefit from my point of view. Let's keep it simple first. We can always 
improve it later on if one can show real benefit of doing so.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
  2021-09-30 16:58   ` Waiman Long
@ 2021-09-30 22:57     ` Barry Song
  0 siblings, 0 replies; 28+ messages in thread
From: Barry Song @ 2021-09-30 22:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: alex.kogan, Arnd Bergmann, Borislav Petkov, daniel.m.jordan,
	dave.dice, guohanjun, H. Peter Anvin, jglauber, linux-arch,
	linux-arm-kernel, LKML, linux, Ingo Molnar, Peter Zijlstra,
	steven.sistare, Thomas Gleixner, Will Deacon, x86

On Fri, Oct 1, 2021 at 5:58 AM Waiman Long <llong@redhat.com> wrote:
>
> On 9/30/21 5:44 AM, Barry Song wrote:
> >> We have done some performance evaluation with the locktorture module
> >> as well as with several benchmarks from the will-it-scale repo.
> >> The following locktorture results are from an Oracle X5-4 server
> >> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> >> cores each). Each number represents an average (over 25 runs) of the
> >> total number of ops (x10^7) reported at the end of each run. The
> >> standard deviation is also reported in (), and in general is about 3%
> >> from the average. The 'stock' kernel is v5.12.0,
> > I assume x5-4 server has the crossbar topology and its numa diameter is
> > 1hop, and all tests were done on this kind of symmetrical topology. Am
> > I right?
> >
> >      ┌─┐                 ┌─┐
> >      │ ├─────────────────┤ │
> >      └─┤1               1└┬┘
> >        │  1           1   │
> >        │    1       1     │
> >        │      1   1       │
> >        │        1         │
> >        │      1   1       │
> >        │     1      1     │
> >        │   1         1    │
> >       ┌┼┐1             1  ├─┐
> >       │┼┼─────────────────┤ │
> >       └─┘                 └─┘
> >
> >
> > what if the hardware is using the ring topology and other topologies with
> > 2-hops or even 3-hops such as:
> >
> >       ┌─┐                 ┌─┐
> >       │ ├─────────────────┤ │
> >       └─┤                 └┬┘
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >        ┌┤                  ├─┐
> >        │┼┬─────────────────┤ │
> >        └─┘                 └─┘
> >
> >
> > or:
> >
> >
> >      ┌───┐       ┌───┐      ┌────┐      ┌─────┐
> >      │   │       │   │      │    │      │     │
> >      │   │       │   │      │    │      │     │
> >      ├───┼───────┼───┼──────┼────┼──────┼─────┤
> >      │   │       │   │      │    │      │     │
> >      └───┘       └───┘      └────┘      └─────┘
> >
> > do we need to consider the distances of numa nodes in the secondary
> > queue? does it still make sense to treat everyone else equal in
> > secondary queue?
>
> The purpose of this patch series is to minimize cacheline transfer from
> one numa node to another. Taking the fine grained detail of the numa
> topology into account will complicate the code without much performance
> benefit from my point of view. Let's keep it simple first. We can always
> improve it later on if one can show real benefit of doing so.

for sure i am not expecting the complex  NUMA topology taken into account for
this moment. I am just curious how things will be different if topology isn't a
crossbar with 1-hop only.

when the master queue is empty, the distance of the numa node spinlock will
jump to will affect the performance. but I am not quite sure how much it will
be. just like a disk, bumping back and forth between far cylinders and sectors
might waste a lot of time.

On the other hand, some numa nodes might be very close while some others
might be very far. for example, if one socket has several DIEs, and the machine
has several sockets, cacheline coherence overhead for NUMA nodes of DIEs within
one socket might be much less than that of NUMA nodes which are in different
sockets. I assume maintaining the master/secondary queues need some
overhead especially while the system has many cores and multiple NUMA nodes,
in this case, making neighbor NUMA nodes share one master queue might win.

Anyway, we need a lot of benchmarking on this before we can really do anything
on it.  For this moment, ignoring the complicated topology should be a
better way
to start.

>
> Cheers,
> Longman

Thanks
barry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
  2021-09-30  9:44 ` [PATCH v15 0/6] Add NUMA-awareness to qspinlock Barry Song
  2021-09-30 16:58   ` Waiman Long
@ 2021-09-30 23:51   ` Alex Kogan
  1 sibling, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2021-09-30 23:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Arnd Bergmann, Borislav Petkov, Daniel Jordan, Dave Dice,
	Hanjun Guo, hpa, jglauber, linux-arch, linux-arm-kernel,
	linux-kernel, linux, longman, mingo, peterz, Steven Sistare,
	tglx, will.deacon, x86



> On Sep 30, 2021, at 5:44 AM, Barry Song <21cnbao@gmail.com> wrote:
> 
>> We have done some performance evaluation with the locktorture module
>> as well as with several benchmarks from the will-it-scale repo.
>> The following locktorture results are from an Oracle X5-4 server
>> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
>> cores each). Each number represents an average (over 25 runs) of the
>> total number of ops (x10^7) reported at the end of each run. The 
>> standard deviation is also reported in (), and in general is about 3%
>> from the average. The 'stock' kernel is v5.12.0,
> 
> I assume x5-4 server has the crossbar topology and its numa diameter is
> 1hop, and all tests were done on this kind of symmetrical topology. Am
> I right? 
> 
>    ┌─┐                 ┌─┐
>    │ ├─────────────────┤ │
>    └─┤1               1└┬┘
>      │  1           1   │
>      │    1       1     │
>      │      1   1       │
>      │        1         │
>      │      1   1       │
>      │     1      1     │
>      │   1         1    │
>     ┌┼┐1             1  ├─┐
>     │┼┼─────────────────┤ │
>     └─┘                 └─┘
> 
> 
> what if the hardware is using the ring topology and other topologies
For better or worse, CNA is pretty much agnostic to the physical topology.
So it is true that we might do better by transferring NUMA node preference
to a “closer” node, but it would probably require some sophisticated logic 
to identify such a node. As discussed on another thread, we are opting
for a simple solution that might be refined later if needed.

Best regards,
— Alex



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (6 preceding siblings ...)
  2021-09-30  9:44 ` [PATCH v15 0/6] Add NUMA-awareness to qspinlock Barry Song
@ 2021-12-13 20:37 ` Alex Kogan
  2021-12-15 15:13   ` Alex Kogan
  2022-04-11 17:09 ` Alex Kogan
  8 siblings, 1 reply; 28+ messages in thread
From: Alex Kogan @ 2021-12-13 20:37 UTC (permalink / raw)
  To: LKML
  Cc: linux, Peter Zijlstra, Ingo Molnar, Will Deacon, Arnd Bergmann,
	Waiman Long, linux-arch, linux-arm-kernel, Thomas Gleixner,
	Borislav Petkov, hpa, x86, Hanjun Guo, Jan Glauber,
	Steven Sistare, Daniel Jordan, Dave Dice, Antonio Paolillo,
	Rafael Chehab, Diogo Behrens, Fuming (DRC OS Kernel Lab),
	Davidlohr Bueso

Hi, all.

A couple of quick updates regarding the patch series.

First, I am working to collect more performance data for the patch.
In particular, I measured up to 14% throughput improvement for RocksDB[1]
and 26% improvement for a few sybench[2] benchmarks. The details appear
below. sybench also has a number of OLTP-like benchmarks; those do not show
sensitivity to CNA -- no improvements, no regressions.

Second, we became aware of the work by researchers at Huawei Dresden
Research Center who verified the correctness of CNA on weak memory
models: https://arxiv.org/pdf/2111.15240.pdf. The authors found the
CNA patch series (with some minor simplifications) to be correct in the
Linux kernel memory model. For further details, please refer to the paper.

=============

The following results are from an Oracle X5-8 server (eight Intel Xeon
E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded cores each). Note
that this system is different from the one I was normally using for testing,
as the latter is going through repairs. Each number represents average
throughput over 25 runs. The standard deviation is also reported in ().

The throughput results (ops/us) for RocksDB (v6.26.0) ‘readrandom’ benchmark:
#thr  	 stock      CNA          / speedup
  1  0.131 (0.004) 0.130 (0.005) / 0.989
  2  0.232 (0.005) 0.230 (0.004) / 0.991
  4  0.437 (0.008) 0.430 (0.009) / 0.985
  8  0.560 (0.015) 0.563 (0.014) / 1.004
 16  0.699 (0.017) 0.690 (0.032) / 0.986
 32  0.738 (0.029) 0.763 (0.028) / 1.034
 64  0.597 (0.014) 0.631 (0.012) / 1.057
 72  0.591 (0.010) 0.623 (0.012) / 1.054
144  0.569 (0.010) 0.636 (0.012) / 1.117
216  0.540 (0.008) 0.615 (0.009) / 1.139
286  0.527 (0.005) 0.579 (0.009) / 1.099

The throughput results (events/ms) for sysbench ’threads’ benchmark:
#thr  	 stock      CNA          / speedup
  1  2168.245 (11.715) 2169.734 (8.042) / 1.001
  2  1016.283 (48.849) 1097.674 (19.516) / 1.080
  4  1220.588 (17.383) 1229.191 (11.928) / 1.007
  8  1048.581 (46.102) 1036.614 (13.365) / 0.989
 16  804.573 (30.343) 832.324 (26.457) / 1.034
 32  799.307 (39.315) 869.813 (9.352) / 1.088
 64  688.387 (7.312) 812.554 (19.414) / 1.180
 72  681.979 (16.198) 825.458 (8.655) / 1.210
144  606.460 (5.277) 759.814 (7.027) / 1.253
216  580.517 (6.718) 719.856 (8.381) / 1.240
286  527.053 (12.382) 659.322 (12.527) / 1.251

The throughput results (events/ms) for sysbench ’mutex’ benchmark:
  1  15.218 (0.064) 14.986 (0.080) / 0.985
  2  6.043 (0.953) 6.058 (0.856) / 1.003
  4  5.852 (0.132) 5.761 (0.295) / 0.984
  8  3.291 (0.054) 3.241 (0.143) / 0.985
 16  2.931 (0.036) 3.437 (0.150) / 1.173
 32  2.863 (0.058) 3.267 (0.124) / 1.141
 64  2.457 (0.099) 3.101 (0.024) / 1.262
 72  2.438 (0.060) 3.054 (0.079) / 1.253
144  2.399 (0.042) 3.007 (0.056) / 1.253
216  2.413 (0.033) 2.697 (0.044) / 1.118
286  2.029 (0.033) 2.288 (0.015) / 1.128

Best regards,
— Alex

[1] https://github.com/facebook/rocksdb
[2] https://github.com/akopytov/sysbench

> On May 14, 2021, at 4:07 PM, Alex Kogan <alex.kogan@oracle.com> wrote:
> 
> Changes from v14:
> ----------------
> 
> - Change the way the main queue is scanned and reordered in
> cna_wait_head_or_lock(), based on Peter's suggestion.
> 
> In detail: instead of inspecting only one queue node, we now scan
> (and move nodes into the secondary queue) as long as the lock
> remains busy. This simplified the code quite a bit, as we don't need
> to call cna_order_queue() again from cna_lock_handoff(). 
> 
> - Use local_clock() instead of relying on jiffies to decide when to
> flush the secondary queue, per Andy's suggestion.
> 
> - Use module_param() for numa_spinlock_threshold_ns, so it can be tweaked
> at runtime, per Andy's suggestion.
> 
> - Reduce the default value for numa_spinlock_threshold_ns to 1ms based on
> the comments from Andy and Peter. The performance numbers below include
> results with the new default as well as with the value of 10ms, which was 
> the default threshold in previous revisions of the series.
> 
> Summary
> -------
> 
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> 
> CNA is a NUMA-aware version of the MCS lock. Spinning threads are
> organized in two queues, a primary queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. After acquiring the MCS lock and
> before acquiring the spinlock, the MCS lock holder checks whether the next
> waiter in the primary queue (if exists) is running on the same NUMA node.
> If it is not, that waiter is detached from the main queue and moved into
> the tail of the secondary queue. This way, we gradually filter the primary
> queue, leaving only waiters running on the same preferred NUMA node. Note
> that certain priortized waiters (e.g., in irq and nmi contexts) are
> excluded from being moved to the secondary queue. We change the NUMA node
> preference after a waiter at the head of the secondary queue spins for a
> certain amount of time. We do that by flushing the secondary queue into
> the head of the primary queue, effectively changing the preference to the
> NUMA node of the waiter at the head of the secondary queue at the time of
> the flush.
> 
> More details are available at https://arxiv.org/abs/1810.05600.
> 
> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The 
> standard deviation is also reported in (), and in general is about 3%
> from the average. The 'stock' kernel is v5.12.0,
> commit 3cf5c8ea3a66, compiled in the default configuration. 
> 'CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set and
> the new default threshold of 1ms for flushing the secondary queue
> (numa_spinlock_threshold_ns); 'CNA-10ms' is the same as CNA, 
> but uses the threshold of 10ms. The speedup is calculated by dividing 
> the result of 'CNA' and 'CNA-10ms', respectively, by the result
> achieved with 'stock'.
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  2.695 (0.108) 2.704 (0.099) / 1.003  2.712 (0.077) / 1.006
>  2  2.753 (0.187) 2.785 (0.171) / 1.012  2.822 (0.174) / 1.025
>  4  4.355 (0.139) 4.417 (0.179) / 1.014  4.361 (0.181) / 1.001
>  8  5.163 (0.119) 7.017 (0.195) / 1.359  7.369 (0.186) / 1.427
> 16  5.944 (0.134) 9.110 (0.242) / 1.532  9.187 (0.233) / 1.546
> 32  6.310 (0.082) 9.710 (0.156) / 1.539  9.827 (0.161) / 1.557
> 36  6.374 (0.112) 9.777 (0.141) / 1.534  9.830 (0.124) / 1.542
> 72  6.170 (0.139) 9.922 (0.190) / 1.608  9.945 (0.136) / 1.612
> 108  6.002 (0.089) 9.651 (0.176) / 1.608  9.847 (0.125) / 1.641
> 142  5.784 (0.079) 9.477 (0.089) / 1.638  9.641 (0.113) / 1.667
> 
> The following tables contain throughput results (ops/us) from the same
> setup for will-it-scale/open1_threads: 
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  0.503 (0.004) 0.501 (0.001) / 0.996  0.503 (0.002) / 1.000
>  2  0.783 (0.014) 0.773 (0.011) / 0.988  0.774 (0.016) / 0.989
>  4  1.422 (0.025) 1.398 (0.030) / 0.983  1.403 (0.025) / 0.987
>  8  1.753 (0.104) 1.641 (0.132) / 0.936  1.675 (0.134) / 0.956
> 16  1.851 (0.097) 1.760 (0.103) / 0.951  1.774 (0.119) / 0.959
> 32  0.905 (0.081) 1.708 (0.081) / 1.888  1.738 (0.069) / 1.922
> 36  0.895 (0.058) 1.726 (0.065) / 1.928  1.735 (0.081) / 1.938
> 72  0.823 (0.033) 1.610 (0.067) / 1.957  1.647 (0.067) / 2.002
> 108  0.845 (0.035) 1.588 (0.054) / 1.878  1.740 (0.067) / 2.058
> 142  0.840 (0.030) 1.546 (0.042) / 1.839  1.740 (0.048) / 2.070
> 
> and will-it-scale/lock2_threads:
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  1.551 (0.003) 1.558 (0.006) / 1.005  1.558 (0.003) / 1.005
>  2  2.722 (0.064) 2.704 (0.063) / 0.993  2.727 (0.058) / 1.002
>  4  5.286 (0.178) 5.360 (0.151) / 1.014  5.360 (0.135) / 1.014
>  8  4.115 (0.297) 3.906 (0.383) / 0.949  4.062 (0.366) / 0.987
> 16  4.119 (0.121) 3.950 (0.131) / 0.959  4.009 (0.132) / 0.973
> 32  2.508 (0.097) 3.805 (0.106) / 1.517  3.960 (0.091) / 1.579
> 36  2.457 (0.101) 3.810 (0.072) / 1.551  3.931 (0.106) / 1.600
> 72  1.913 (0.103) 3.530 (0.070) / 1.845  3.860 (0.078) / 2.018
> 108  1.891 (0.109) 3.410 (0.079) / 1.803  3.881 (0.097) / 2.052
> 142  1.752 (0.096) 3.236 (0.080) / 1.847  3.774 (0.078) / 2.155
> 
> Our evaluation shows that CNA also improves performance of user 
> applications that have hot pthread mutexes. Those mutexes are 
> blocking, and waiting threads park and unpark via the futex 
> mechanism in the kernel. Given that kernel futex chains, which
> are hashed by the mutex address, are each protected by a 
> chain-specific spin lock, the contention on a user-mode mutex 
> translates into contention on a kernel level spinlock. 
> 
> Here are the throughput results (ops/us) for the leveldb ‘readrandom’
> benchmark:
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  0.533 (0.011) 0.539 (0.014) / 1.012  0.536 (0.013) / 1.006
>  2  0.854 (0.022) 0.856 (0.017) / 1.003  0.857 (0.020) / 1.004
>  4  1.236 (0.028) 1.238 (0.054) / 1.002  1.217 (0.054) / 0.985
>  8  1.207 (0.117) 1.198 (0.122) / 0.993  1.155 (0.138) / 0.957
> 16  0.758 (0.055) 1.128 (0.118) / 1.489  1.068 (0.131) / 1.409
> 32  0.743 (0.027) 1.153 (0.028) / 1.551  1.147 (0.021) / 1.543
> 36  0.708 (0.027) 1.150 (0.024) / 1.623  1.137 (0.026) / 1.605
> 72  0.629 (0.016) 1.112 (0.019) / 1.767  1.134 (0.019) / 1.802
> 108  0.610 (0.012) 1.053 (0.018) / 1.725  1.130 (0.017) / 1.853
> 142  0.606 (0.013) 1.008 (0.020) / 1.664  1.110 (0.023) / 1.833
> 
> Further comments are welcome and appreciated.
> 
> Alex Kogan (6):
>  locking/qspinlock: Rename mcs lock/unlock macros and make them more
>    generic
>  locking/qspinlock: Refactor the qspinlock slow path
>  locking/qspinlock: Introduce CNA into the slow path of qspinlock
>  locking/qspinlock: Introduce starvation avoidance into CNA
>  locking/qspinlock: Avoid moving certain threads between waiting queues
>    in CNA
>  locking/qspinlock: Introduce the shuffle reduction optimization into
>    CNA
> 
> .../admin-guide/kernel-parameters.txt         |  18 +
> arch/arm/include/asm/mcs_spinlock.h           |   6 +-
> arch/x86/Kconfig                              |  20 +
> arch/x86/include/asm/qspinlock.h              |   4 +
> arch/x86/kernel/alternative.c                 |   4 +
> include/asm-generic/mcs_spinlock.h            |   4 +-
> kernel/locking/mcs_spinlock.h                 |  20 +-
> kernel/locking/qspinlock.c                    |  82 +++-
> kernel/locking/qspinlock_cna.h                | 425 ++++++++++++++++++
> kernel/locking/qspinlock_paravirt.h           |   2 +-
> 10 files changed, 562 insertions(+), 23 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
> 
> -- 
> 2.24.3 (Apple Git-128)
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
  2021-12-13 20:37 ` Alex Kogan
@ 2021-12-15 15:13   ` Alex Kogan
  0 siblings, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2021-12-15 15:13 UTC (permalink / raw)
  To: LKML
  Cc: linux, Peter Zijlstra, Ingo Molnar, Will Deacon, Arnd Bergmann,
	Waiman Long, linux-arch, linux-arm-kernel, Thomas Gleixner,
	Borislav Petkov, hpa, x86, Hanjun Guo, Jan Glauber,
	Steven Sistare, Daniel Jordan, Dave Dice, Antonio Paolillo,
	Rafael Chehab, Diogo Behrens, Fuming (DRC OS Kernel Lab),
	Davidlohr Bueso


> The throughput results (events/ms) for sysbench ’threads’ benchmark:
The correct units are events/s (aka eps) — sorry for the typo.

> The throughput results (events/ms) for sysbench ’mutex’ benchmark:
Same here.

— Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
  2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (7 preceding siblings ...)
  2021-12-13 20:37 ` Alex Kogan
@ 2022-04-11 17:09 ` Alex Kogan
  8 siblings, 0 replies; 28+ messages in thread
From: Alex Kogan @ 2022-04-11 17:09 UTC (permalink / raw)
  To: LKML
  Cc: linux, Peter Zijlstra, Ingo Molnar, Will Deacon, Arnd Bergmann,
	Waiman Long, linux-arch, linux-arm-kernel, Thomas Gleixner,
	Borislav Petkov, hpa, x86, Hanjun Guo, Jan Glauber,
	Steven Sistare, Daniel Jordan, Dave Dice

Hi, all.

I’ve got around to collect more performance data with macro-benchmarks
from the LKP suite[1]. Specifically, I measured more than 2.5x improvement
with fsmark and up to nearly 2x improvement with AIM7. Note that similar 
improvements have been previously reported by the kernel test robot [2,3],
but it is nice to see that they are reproducible on our system as well. 
The details appear below.

Our performance team also evaluated the performance of TPC-C with 
Oracle DB on the patched kernel with CNA. The evaluation, carried on a
system equipped with two AMD EPYC 7551 processors, showed
no sensitivity to CNA -- no improvements, no regressions.

=============

The following results are from an Oracle X5-8 server (eight Intel Xeon
E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded cores each). Each
number represents average throughput over 25 runs. The standard deviation
is also reported in (), and it was pretty large in some experiments. 

The following results are from the fsmark benchmark (based on the 
`fsmark-generic-1brd-1BRD_48G-4M-btrfs-1x-*t-NoSync-24G` config).
The reported numbers represent `files-per-sec`.

num-threads  stock      CNA          / speedup
  1  400.216 (7.399) 403.212 (7.169) / 1.007
  2  784.484 (13.809) 778.196 (22.699) / 0.992
  4  1153.792 (54.699) 1130.140 (53.823) / 0.980
  8  1240.424 (160.453) 1287.532 (169.659) / 1.038
 16  1440.720 (182.555) 1489.288 (148.604) / 1.034
 32  1600.432 (59.944) 1860.652 (212.971) / 1.163
 64  1169.740 (40.815) 3027.680 (142.349) / 2.588
 72  1174.240 (39.261) 2682.560 (543.690) / 2.285
144  1164.296 (42.441) 2842.684 (400.664) / 2.442
216  1174.312 (38.392) 2445.168 (575.668) / 2.082
286  1174.732 (52.430) 2598.444 (468.615) / 2.212

We also experimented with another config 
(`fsmark-generic-1brd-1BRD_48G-4M-btrfs-1x-*t-fsyncBeforeClose-24G`)
and measured more modest, yet robust improvement:

num-threads  stock      CNA          / speedup
  1  177.304 (3.659) 178.596 (3.299) / 1.007
  2  345.260 (8.398) 349.384 (5.706) / 1.012
  4  647.292 (14.982) 648.164 (11.286) / 1.001
  8  1008.304 (46.342) 1007.108 (37.765) / 0.999
 16  1116.652 (57.816) 1150.752 (49.005) / 1.031
 32  1229.760 (84.932) 1346.132 (80.935) / 1.095
 64  981.564 (59.533) 1313.992 (74.971) / 1.339
 72  995.180 (59.228) 1266.660 (63.287) / 1.273
144  978.448 (69.581) 1290.336 (43.065) / 1.319
216  1011.880 (60.154) 1310.524 (61.631) / 1.295
286  985.164 (67.016) 1305.244 (63.735) / 1.325

The following results are from the AIM7 benchmark (based on the 
`aim7-fs-raid-4BRD_12G-btrfs-*-RAID0-disk_rr` config). Here we vary 
the load, and measure performance in `jobs-per-minute` units. Due to time
limitations, each experiment was repeated only 7 times.

load  stock      CNA          / speedup
  2  965.631 (1.696) 963.969 (0.965) / 0.998
 20  7901.647 (116.671) 8256.576 (93.362) / 1.045
100  16409.144 (280.828) 23380.819 (279.365) / 1.425
500  20355.541 (353.862) 34281.836 (873.357) / 1.684
1000  20733.947 (507.388) 36546.829 (695.273) / 1.763
1500  21570.639 (499.205) 37078.249 (806.669) / 1.719
3000  22068.761 (415.813) 36649.144 (1219.638) / 1.661
9000  19551.681 (337.232) 37998.464 (687.059) / 1.943

Best regards,
— Alex

[1] https://github.com/intel/lkp-tests.git
[2] https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/HGVOCYDEE5KTLYPTAFBD2RXDQOCDPFUJ/
[3] https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/DNMEQPXJRQY2IKHZ3ERGRY6TUPWDTFUN/

> On May 14, 2021, at 4:07 PM, Alex Kogan <alex.kogan@oracle.com> wrote:
> 
> Changes from v14:
> ----------------
> 
> - Change the way the main queue is scanned and reordered in
> cna_wait_head_or_lock(), based on Peter's suggestion.
> 
> In detail: instead of inspecting only one queue node, we now scan
> (and move nodes into the secondary queue) as long as the lock
> remains busy. This simplified the code quite a bit, as we don't need
> to call cna_order_queue() again from cna_lock_handoff(). 
> 
> - Use local_clock() instead of relying on jiffies to decide when to
> flush the secondary queue, per Andy's suggestion.
> 
> - Use module_param() for numa_spinlock_threshold_ns, so it can be tweaked
> at runtime, per Andy's suggestion.
> 
> - Reduce the default value for numa_spinlock_threshold_ns to 1ms based on
> the comments from Andy and Peter. The performance numbers below include
> results with the new default as well as with the value of 10ms, which was 
> the default threshold in previous revisions of the series.
> 
> Summary
> -------
> 
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> 
> CNA is a NUMA-aware version of the MCS lock. Spinning threads are
> organized in two queues, a primary queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. After acquiring the MCS lock and
> before acquiring the spinlock, the MCS lock holder checks whether the next
> waiter in the primary queue (if exists) is running on the same NUMA node.
> If it is not, that waiter is detached from the main queue and moved into
> the tail of the secondary queue. This way, we gradually filter the primary
> queue, leaving only waiters running on the same preferred NUMA node. Note
> that certain priortized waiters (e.g., in irq and nmi contexts) are
> excluded from being moved to the secondary queue. We change the NUMA node
> preference after a waiter at the head of the secondary queue spins for a
> certain amount of time. We do that by flushing the secondary queue into
> the head of the primary queue, effectively changing the preference to the
> NUMA node of the waiter at the head of the secondary queue at the time of
> the flush.
> 
> More details are available at https://arxiv.org/abs/1810.05600.
> 
> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The 
> standard deviation is also reported in (), and in general is about 3%
> from the average. The 'stock' kernel is v5.12.0,
> commit 3cf5c8ea3a66, compiled in the default configuration. 
> 'CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set and
> the new default threshold of 1ms for flushing the secondary queue
> (numa_spinlock_threshold_ns); 'CNA-10ms' is the same as CNA, 
> but uses the threshold of 10ms. The speedup is calculated by dividing 
> the result of 'CNA' and 'CNA-10ms', respectively, by the result
> achieved with 'stock'.
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  2.695 (0.108) 2.704 (0.099) / 1.003  2.712 (0.077) / 1.006
>  2  2.753 (0.187) 2.785 (0.171) / 1.012  2.822 (0.174) / 1.025
>  4  4.355 (0.139) 4.417 (0.179) / 1.014  4.361 (0.181) / 1.001
>  8  5.163 (0.119) 7.017 (0.195) / 1.359  7.369 (0.186) / 1.427
> 16  5.944 (0.134) 9.110 (0.242) / 1.532  9.187 (0.233) / 1.546
> 32  6.310 (0.082) 9.710 (0.156) / 1.539  9.827 (0.161) / 1.557
> 36  6.374 (0.112) 9.777 (0.141) / 1.534  9.830 (0.124) / 1.542
> 72  6.170 (0.139) 9.922 (0.190) / 1.608  9.945 (0.136) / 1.612
> 108  6.002 (0.089) 9.651 (0.176) / 1.608  9.847 (0.125) / 1.641
> 142  5.784 (0.079) 9.477 (0.089) / 1.638  9.641 (0.113) / 1.667
> 
> The following tables contain throughput results (ops/us) from the same
> setup for will-it-scale/open1_threads: 
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  0.503 (0.004) 0.501 (0.001) / 0.996  0.503 (0.002) / 1.000
>  2  0.783 (0.014) 0.773 (0.011) / 0.988  0.774 (0.016) / 0.989
>  4  1.422 (0.025) 1.398 (0.030) / 0.983  1.403 (0.025) / 0.987
>  8  1.753 (0.104) 1.641 (0.132) / 0.936  1.675 (0.134) / 0.956
> 16  1.851 (0.097) 1.760 (0.103) / 0.951  1.774 (0.119) / 0.959
> 32  0.905 (0.081) 1.708 (0.081) / 1.888  1.738 (0.069) / 1.922
> 36  0.895 (0.058) 1.726 (0.065) / 1.928  1.735 (0.081) / 1.938
> 72  0.823 (0.033) 1.610 (0.067) / 1.957  1.647 (0.067) / 2.002
> 108  0.845 (0.035) 1.588 (0.054) / 1.878  1.740 (0.067) / 2.058
> 142  0.840 (0.030) 1.546 (0.042) / 1.839  1.740 (0.048) / 2.070
> 
> and will-it-scale/lock2_threads:
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  1.551 (0.003) 1.558 (0.006) / 1.005  1.558 (0.003) / 1.005
>  2  2.722 (0.064) 2.704 (0.063) / 0.993  2.727 (0.058) / 1.002
>  4  5.286 (0.178) 5.360 (0.151) / 1.014  5.360 (0.135) / 1.014
>  8  4.115 (0.297) 3.906 (0.383) / 0.949  4.062 (0.366) / 0.987
> 16  4.119 (0.121) 3.950 (0.131) / 0.959  4.009 (0.132) / 0.973
> 32  2.508 (0.097) 3.805 (0.106) / 1.517  3.960 (0.091) / 1.579
> 36  2.457 (0.101) 3.810 (0.072) / 1.551  3.931 (0.106) / 1.600
> 72  1.913 (0.103) 3.530 (0.070) / 1.845  3.860 (0.078) / 2.018
> 108  1.891 (0.109) 3.410 (0.079) / 1.803  3.881 (0.097) / 2.052
> 142  1.752 (0.096) 3.236 (0.080) / 1.847  3.774 (0.078) / 2.155
> 
> Our evaluation shows that CNA also improves performance of user 
> applications that have hot pthread mutexes. Those mutexes are 
> blocking, and waiting threads park and unpark via the futex 
> mechanism in the kernel. Given that kernel futex chains, which
> are hashed by the mutex address, are each protected by a 
> chain-specific spin lock, the contention on a user-mode mutex 
> translates into contention on a kernel level spinlock. 
> 
> Here are the throughput results (ops/us) for the leveldb ‘readrandom’
> benchmark:
> 
> #thr  	 stock      CNA          / speedup  CNA-10ms    / speedup
>  1  0.533 (0.011) 0.539 (0.014) / 1.012  0.536 (0.013) / 1.006
>  2  0.854 (0.022) 0.856 (0.017) / 1.003  0.857 (0.020) / 1.004
>  4  1.236 (0.028) 1.238 (0.054) / 1.002  1.217 (0.054) / 0.985
>  8  1.207 (0.117) 1.198 (0.122) / 0.993  1.155 (0.138) / 0.957
> 16  0.758 (0.055) 1.128 (0.118) / 1.489  1.068 (0.131) / 1.409
> 32  0.743 (0.027) 1.153 (0.028) / 1.551  1.147 (0.021) / 1.543
> 36  0.708 (0.027) 1.150 (0.024) / 1.623  1.137 (0.026) / 1.605
> 72  0.629 (0.016) 1.112 (0.019) / 1.767  1.134 (0.019) / 1.802
> 108  0.610 (0.012) 1.053 (0.018) / 1.725  1.130 (0.017) / 1.853
> 142  0.606 (0.013) 1.008 (0.020) / 1.664  1.110 (0.023) / 1.833
> 
> Further comments are welcome and appreciated.
> 
> Alex Kogan (6):
>  locking/qspinlock: Rename mcs lock/unlock macros and make them more
>    generic
>  locking/qspinlock: Refactor the qspinlock slow path
>  locking/qspinlock: Introduce CNA into the slow path of qspinlock
>  locking/qspinlock: Introduce starvation avoidance into CNA
>  locking/qspinlock: Avoid moving certain threads between waiting queues
>    in CNA
>  locking/qspinlock: Introduce the shuffle reduction optimization into
>    CNA
> 
> .../admin-guide/kernel-parameters.txt         |  18 +
> arch/arm/include/asm/mcs_spinlock.h           |   6 +-
> arch/x86/Kconfig                              |  20 +
> arch/x86/include/asm/qspinlock.h              |   4 +
> arch/x86/kernel/alternative.c                 |   4 +
> include/asm-generic/mcs_spinlock.h            |   4 +-
> kernel/locking/mcs_spinlock.h                 |  20 +-
> kernel/locking/qspinlock.c                    |  82 +++-
> kernel/locking/qspinlock_cna.h                | 425 ++++++++++++++++++
> kernel/locking/qspinlock_paravirt.h           |   2 +-
> 10 files changed, 562 insertions(+), 23 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
> 
> -- 
> 2.24.3 (Apple Git-128)
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-05-14 20:07 ` [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
  2021-09-22 19:25   ` Davidlohr Bueso
  2021-09-30 10:05   ` Barry Song
@ 2023-08-02 23:14   ` Guo Ren
  2023-08-03  8:50     ` Peter Zijlstra
  2 siblings, 1 reply; 28+ messages in thread
From: Guo Ren @ 2023-08-02 23:14 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Fri, May 14, 2021 at 04:07:40PM -0400, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a primary queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. After acquiring the
> MCS lock and before acquiring the spinlock, the MCS lock
> holder checks whether the next waiter in the primary queue (if exists) is
> running on the same NUMA node. If it is not, that waiter is detached from
> the main queue and moved into the tail of the secondary queue. This way,
> we gradually filter the primary queue, leaving only waiters running on
> the same preferred NUMA node. For more details, see
> https://arxiv.org/abs/1810.05600.
> 
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock between waiters in the main queue. This issue will be
> addressed later in the series.
> 
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
> boot time only if we run on a multi-node machine in native environment and
> the new config is enabled. (For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
> resolved once static_call() is available.) This default behavior can be
> overridden with the new kernel boot command-line option
> "numa_spinlock=on/off" (default is "auto").
> 
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  arch/x86/Kconfig                              |  20 ++
>  arch/x86/include/asm/qspinlock.h              |   4 +
>  arch/x86/kernel/alternative.c                 |   4 +
>  kernel/locking/mcs_spinlock.h                 |   2 +-
>  kernel/locking/qspinlock.c                    |  42 ++-
>  kernel/locking/qspinlock_cna.h                | 325 ++++++++++++++++++
>  7 files changed, 402 insertions(+), 5 deletions(-)
>  create mode 100644 kernel/locking/qspinlock_cna.h
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a816935d23d4..94d35507560c 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3515,6 +3515,16 @@
>  			NUMA balancing.
>  			Allowed values are enable and disable
>  
> +	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
> +			of spinlock. The options are:
> +			auto - Enable this variant if running on a multi-node
> +			machine in native environment.
> +			on  - Unconditionally enable this variant.
> +			off - Unconditionally disable this variant.
> +
> +			Not specifying this option is equivalent to
> +			numa_spinlock=auto.
> +
>  	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
>  			'node', 'default' can be specified
>  			This can be set from sysctl after boot.
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0045e1b44190..819c3dad8afc 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1564,6 +1564,26 @@ config NUMA
>  
>  	  Otherwise, you should say N.
>  
> +config NUMA_AWARE_SPINLOCKS
> +	bool "Numa-aware spinlocks"
> +	depends on NUMA
> +	depends on QUEUED_SPINLOCKS
> +	depends on 64BIT
> +	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
> +	# This is awkward, but hopefully would be resolved once static_call()
> +	# is available.
> +	depends on PARAVIRT_SPINLOCKS
> +	default y
> +	help
> +	  Introduce NUMA (Non Uniform Memory Access) awareness into
> +	  the slow path of spinlocks.
> +
> +	  In this variant of qspinlock, the kernel will try to keep the lock
> +	  on the same node, thus reducing the number of remote cache misses,
> +	  while trading some of the short term fairness for better performance.
> +
> +	  Say N if you want absolute first come first serve fairness.
> +
>  config AMD_NUMA
>  	def_bool y
>  	prompt "Old style AMD Opteron NUMA detection"
> diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
> index d86ab942219c..21d09e8db979 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
>  	return val;
>  }
>  
> +#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
> +extern void cna_configure_spin_lock_slowpath(void);
> +#endif
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>  extern void __pv_init_lock_hash(void);
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 6974b5174495..6c73c2dc17fb 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -608,6 +608,10 @@ void __init alternative_instructions(void)
>  	 */
>  	paravirt_set_cap();
>  
> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +	cna_configure_spin_lock_slowpath();
> +#endif
> +
>  	/*
>  	 * First patch paravirt functions, such that we overwrite the indirect
>  	 * call with the direct call.
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index e794babc519a..3926aad129ed 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -17,7 +17,7 @@
>  
>  struct mcs_spinlock {
>  	struct mcs_spinlock *next;
> -	int locked; /* 1 if lock acquired */
> +	unsigned int locked; /* 1 if lock acquired */
>  	int count;  /* nesting count, see qspinlock.c */
>  };
>  
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index e3518709ffdc..8c1a21b53913 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -11,7 +11,7 @@
>   *          Peter Zijlstra <peterz@infradead.org>
>   */
>  
> -#ifndef _GEN_PV_LOCK_SLOWPATH
> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>  
>  #include <linux/smp.h>
>  #include <linux/bug.h>
> @@ -71,7 +71,8 @@
>  /*
>   * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>   * size and four of them will fit nicely in one 64-byte cacheline. For
> - * pvqspinlock, however, we need more space for extra data. To accommodate
> + * pvqspinlock, however, we need more space for extra data. The same also
> + * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
>   * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>   * two of them can fit in a cacheline in this case. That is OK as it is rare
>   * to have more than 2 levels of slowpath nesting in actual use. We don't
> @@ -80,7 +81,7 @@
>   */
>  struct qnode {
>  	struct mcs_spinlock mcs;
> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>  	long reserved[2];
>  #endif
>  };
> @@ -104,6 +105,8 @@ struct qnode {
>   * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>   *
>   * PV doubles the storage and uses the second cacheline for PV state.
> + * CNA also doubles the storage and uses the second cacheline for
> + * CNA-specific state.
>   */
>  static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>  
> @@ -317,7 +320,7 @@ static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
>  #define try_clear_tail		__try_clear_tail
>  #define mcs_lock_handoff	__mcs_lock_handoff
>  
> -#endif /* _GEN_PV_LOCK_SLOWPATH */
> +#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
>  
>  /**
>   * queued_spin_lock_slowpath - acquire the queued spinlock
> @@ -589,6 +592,37 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  
> +/*
> + * Generate the code for NUMA-aware spinlocks
> + */
> +#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +#define _GEN_CNA_LOCK_SLOWPATH
> +
> +#undef pv_init_node
> +#define pv_init_node			cna_init_node
> +
> +#undef pv_wait_head_or_lock
> +#define pv_wait_head_or_lock		cna_wait_head_or_lock
> +
> +#undef try_clear_tail
> +#define try_clear_tail			cna_try_clear_tail
> +
> +#undef mcs_lock_handoff
> +#define mcs_lock_handoff			cna_lock_handoff
> +
> +#undef  queued_spin_lock_slowpath
> +/*
> + * defer defining queued_spin_lock_slowpath until after the include to
> + * avoid a name clash with the identically named field in pv_ops.lock
> + * (see cna_configure_spin_lock_slowpath())
> + */
> +#include "qspinlock_cna.h"
> +#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
> +
> +#include "qspinlock.c"
> +
> +#endif
> +
>  /*
>   * Generate the paravirt code for queued_spin_unlock_slowpath().
>   */
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> new file mode 100644
> index 000000000000..ca564e64e5de
> --- /dev/null
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -0,0 +1,325 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _GEN_CNA_LOCK_SLOWPATH
> +#error "do not include this file"
> +#endif
> +
> +#include <linux/topology.h>
> +
> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a primary queue for
> + * threads running on the same NUMA node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. Schematically, it
> + * looks like this:
> + *
> + *    cna_node
> + *   +----------+     +--------+         +--------+
> + *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
> + *   |mcs:locked| -.  +--------+         +--------+
> + *   +----------+  |
> + *                 `----------------------.
> + *                                        v
> + *                 +--------+         +--------+
> + *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
> + *                 +--------+         +--------+
> + *                     ^                    |
> + *                     `--------------------'
> + *
> + * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
> + * encoded pointer to the tail of the secondary queue, which is organized as a
> + * circular list.
> + *
> + * After acquiring the MCS lock and before acquiring the spinlock, the MCS lock
> + * holder checks whether the next waiter in the primary queue (if exists) is
> + * running on the same NUMA node. If it is not, that waiter is detached from the
> + * main queue and moved into the tail of the secondary queue. This way, we
> + * gradually filter the primary queue, leaving only waiters running on the same
> + * preferred NUMA node.
> + *
> + * For more details, see https://arxiv.org/abs/1810.05600.
> + *
> + * Authors: Alex Kogan <alex.kogan@oracle.com>
> + *          Dave Dice <dave.dice@oracle.com>
> + */
> +
> +struct cna_node {
> +	struct mcs_spinlock	mcs;
> +	u16			numa_node;
> +	u16			real_numa_node;
> +	u32			encoded_tail;	/* self */
> +};
> +
> +static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> +{
> +	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> +	int numa_node = cpu_to_node(cpu);
> +	int i;
> +
> +	for (i = 0; i < MAX_NODES; i++) {
> +		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
> +
> +		cn->real_numa_node = numa_node;
> +		cn->encoded_tail = encode_tail(cpu, i);
> +		/*
> +		 * make sure @encoded_tail is not confused with other valid
> +		 * values for @locked (0 or 1)
> +		 */
> +		WARN_ON(cn->encoded_tail <= 1);
> +	}
> +}
> +
> +static int __init cna_init_nodes(void)
> +{
> +	unsigned int cpu;
> +
> +	/*
> +	 * this will break on 32bit architectures, so we restrict
> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
> +	 */
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store an ecoded tail word in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
> +
> +	for_each_possible_cpu(cpu)
> +		cna_init_nodes_per_cpu(cpu);
> +
> +	return 0;
> +}
> +
> +static __always_inline void cna_init_node(struct mcs_spinlock *node)
> +{
> +	struct cna_node *cn = (struct cna_node *)node;
> +
> +	cn->numa_node = cn->real_numa_node;
> +}
> +
> +/*
> + * cna_splice_head -- splice the entire secondary queue onto the head of the
> + * primary queue.
> + *
> + * Returns the new primary head node or NULL on failure.
> + */
> +static struct mcs_spinlock *
> +cna_splice_head(struct qspinlock *lock, u32 val,
> +		struct mcs_spinlock *node, struct mcs_spinlock *next)
> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +	u32 new;
> +
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
> +
> +	if (next) {
> +		/*
> +		 * If the primary queue is not empty, the primary tail doesn't
> +		 * need to change and we can simply link the secondary tail to
> +		 * the old primary head.
> +		 */
> +		tail_2nd->next = next;
> +	} else {
> +		/*
> +		 * When the primary queue is empty, the secondary tail becomes
> +		 * the primary tail.
> +		 */
> +
> +		/*
> +		 * Speculatively break the secondary queue's circular link such
> +		 * that when the secondary tail becomes the primary tail it all
> +		 * works out.
> +		 */
> +		tail_2nd->next = NULL;
> +
> +		/*
> +		 * tail_2nd->next = NULL;	old = xchg_tail(lock, tail);
> +		 *				prev = decode_tail(old);
> +		 * try_cmpxchg_release(...);	WRITE_ONCE(prev->next, node);
> +		 *
> +		 * If the following cmpxchg() succeeds, our stores will not
> +		 * collide.
> +		 */
> +		new = ((struct cna_node *)tail_2nd)->encoded_tail |
> +			_Q_LOCKED_VAL;
> +		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
> +			/* Restore the secondary queue's circular link. */
> +			tail_2nd->next = head_2nd;
> +			return NULL;
> +		}
> +	}
> +
> +	/* The primary queue head now is what was the secondary queue head. */
> +	return head_2nd;
> +}
> +
> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> +				      struct mcs_spinlock *node)
> +{
> +	/*
> +	 * We're here because the primary queue is empty; check the secondary
> +	 * queue for remote waiters.
> +	 */
> +	if (node->locked > 1) {
> +		struct mcs_spinlock *next;
> +
> +		/*
> +		 * When there are waiters on the secondary queue, try to move
> +		 * them back onto the primary queue and let them rip.
> +		 */
> +		next = cna_splice_head(lock, val, node, NULL);
> +		if (next) {
> +			arch_mcs_lock_handoff(&next->locked, 1);
> +			return true;
> +		}
> +
> +		return false;
> +	}
> +
> +	/* Both queues are empty. Do what MCS does. */
> +	return __try_clear_tail(lock, val, node);
> +}
> +
> +/*
> + * cna_splice_next -- splice the next node from the primary queue onto
> + * the secondary queue.
> + */
> +static void cna_splice_next(struct mcs_spinlock *node,
> +			    struct mcs_spinlock *next,
> +			    struct mcs_spinlock *nnext)
> +{
> +	/* remove 'next' from the main queue */
> +	node->next = nnext;
> +
> +	/* stick `next` on the secondary queue tail */
> +	if (node->locked <= 1) { /* if secondary queue is empty */
> +		/* create secondary queue */
> +		next->next = next;
> +	} else {
> +		/* add to the tail of the secondary queue */
> +		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
> +		struct mcs_spinlock *head_2nd = tail_2nd->next;
> +
> +		tail_2nd->next = next;
> +		next->next = head_2nd;
> +	}
> +
> +	node->locked = ((struct cna_node *)next)->encoded_tail;
> +}
> +
> +/*
> + * cna_order_queue - check whether the next waiter in the main queue is on
> + * the same NUMA node as the lock holder; if not, and it has a waiter behind
> + * it in the main queue, move the former onto the secondary queue.
> + * Returns 1 if the next waiter runs on the same NUMA node; 0 otherwise.
> + */
> +static int cna_order_queue(struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *next = READ_ONCE(node->next);
> +	struct cna_node *cn = (struct cna_node *)node;
> +	int numa_node, next_numa_node;
> +
> +	if (!next)
> +		return 0;
> +
> +	numa_node = cn->numa_node;
> +	next_numa_node = ((struct cna_node *)next)->numa_node;
> +
> +	if (next_numa_node != numa_node) {
> +		struct mcs_spinlock *nnext = READ_ONCE(next->next);
> +
> +		if (nnext)
> +			cna_splice_next(node, next, nnext);
> +
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +#define LOCK_IS_BUSY(lock) (atomic_read(&(lock)->val) & _Q_LOCKED_PENDING_MASK)
> +
> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
> +						 struct mcs_spinlock *node)
> +{
> +	/*
> +	 * Try and put the time otherwise spent spin waiting on
> +	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> +	 */
> +	while (LOCK_IS_BUSY(lock) && !cna_order_queue(node))
> +		cpu_relax();
> +
> +	return 0; /* we lied; we didn't wait, go do so now */
> +}
> +
> +static inline void cna_lock_handoff(struct mcs_spinlock *node,
> +				 struct mcs_spinlock *next)
> +{
> +	u32 val = 1;
> +
> +	if (node->locked > 1) {
> +		struct cna_node *cn = (struct cna_node *)node;
> +
> +		val = node->locked;	/* preseve secondary queue */
> +
> +		/*
> +		 * We have a local waiter, either real or fake one;
> +		 * reload @next in case it was changed by cna_order_queue().
> +		 */
> +		next = node->next;
> +
> +		/*
> +		 * Pass over NUMA node id of primary queue, to maintain the
> +		 * preference even if the next waiter is on a different node.
> +		 */
> +		((struct cna_node *)next)->numa_node = cn->numa_node;
> +	}
> +
> +	arch_mcs_lock_handoff(&next->locked, val);
> +}
> +
> +/*
> + * Constant (boot-param configurable) flag selecting the NUMA-aware variant
> + * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
> + */
> +static int numa_spinlock_flag;
> +
> +static int __init numa_spinlock_setup(char *str)
> +{
> +	if (!strcmp(str, "auto")) {
> +		numa_spinlock_flag = 0;
> +		return 1;
> +	} else if (!strcmp(str, "on")) {
> +		numa_spinlock_flag = 1;
> +		return 1;
> +	} else if (!strcmp(str, "off")) {
> +		numa_spinlock_flag = -1;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +__setup("numa_spinlock=", numa_spinlock_setup);
> +
> +void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
> +
> +/*
> + * Switch to the NUMA-friendly slow path for spinlocks when we have
> + * multiple NUMA nodes in native environment, unless the user has
> + * overridden this default behavior by setting the numa_spinlock flag.
> + */
> +void __init cna_configure_spin_lock_slowpath(void)
> +{
> +
> +	if (numa_spinlock_flag < 0)
> +		return;
> +
> +	if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
> +		    pv_ops.lock.queued_spin_lock_slowpath !=
> +			native_queued_spin_lock_slowpath))
> +		return;
> +
> +	cna_init_nodes();
> +
> +	pv_ops.lock.queued_spin_lock_slowpath = __cna_queued_spin_lock_slowpath;
The pv_ops is belongs to x86 custom frame work, and it prevent other
architectures connect to the CNA spinlock.

I'm working on riscv qspinlock on sg2042 64 cores 2/4 NUMA nodes
platforms. Here are the patches about riscv CNA qspinlock:
https://lore.kernel.org/linux-riscv/20230802164701.192791-19-guoren@kernel.org/

What's the next plan for this patch series? I think the two-queue design
has satisfied most platforms with two NUMA nodes.


> +
> +	pr_info("Enabling CNA spinlock\n");
> +}
> -- 
> 2.24.3 (Apple Git-128)
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-02 23:14   ` Guo Ren
@ 2023-08-03  8:50     ` Peter Zijlstra
  2023-08-03 10:28       ` Guo Ren
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2023-08-03  8:50 UTC (permalink / raw)
  To: Guo Ren
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Wed, Aug 02, 2023 at 07:14:05PM -0400, Guo Ren wrote:

> The pv_ops is belongs to x86 custom frame work, and it prevent other
> architectures connect to the CNA spinlock.

static_call() exists as a arch neutral variant of this.

> I'm working on riscv qspinlock on sg2042 64 cores 2/4 NUMA nodes
> platforms. Here are the patches about riscv CNA qspinlock:
> https://lore.kernel.org/linux-riscv/20230802164701.192791-19-guoren@kernel.org/
> 
> What's the next plan for this patch series? I think the two-queue design
> has satisfied most platforms with two NUMA nodes.

What has been your reason for working on CNA? What lock has been so
contended you need this?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-03  8:50     ` Peter Zijlstra
@ 2023-08-03 10:28       ` Guo Ren
  2023-08-03 11:56         ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Guo Ren @ 2023-08-03 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Thu, Aug 3, 2023 at 4:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Aug 02, 2023 at 07:14:05PM -0400, Guo Ren wrote:
>
> > The pv_ops is belongs to x86 custom frame work, and it prevent other
> > architectures connect to the CNA spinlock.
>
> static_call() exists as a arch neutral variant of this.
Emm... we have used static_call() in the riscv queued_spin_lock_:
https://lore.kernel.org/all/20230802164701.192791-20-guoren@kernel.org/

But we met a compile problem:

  GEN     .vmlinux.objs
  MODPOST Module.symvers
ERROR: modpost: "__SCK__pv_queued_spin_unlock" [arch/riscv/kvm/kvm.ko]
undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock"
[kernel/locking/locktorture.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock" [mm/z3fold.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock"
[fs/nfs_common/grace.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v1.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v2.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock"
[fs/quota/quota_tree.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fuse/virtiofs.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/dlm/dlm.ko] undefined!
ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fscache/fscache.ko]
undefined!
WARNING: modpost: suppressed 839 unresolved symbol warnings because
there were too many)
/home/guoren/source/kernel/linux/scripts/Makefile.modpost:144: recipe
for target 'Module.symvers' failed

Our solution is:
EXPORT_SYMBOL(__SCK__pv_queued_spin_unlock);

What do you think about it?

>
> > I'm working on riscv qspinlock on sg2042 64 cores 2/4 NUMA nodes
> > platforms. Here are the patches about riscv CNA qspinlock:
> > https://lore.kernel.org/linux-riscv/20230802164701.192791-19-guoren@kernel.org/
> >
> > What's the next plan for this patch series? I think the two-queue design
> > has satisfied most platforms with two NUMA nodes.
>
> What has been your reason for working on CNA? What lock has been so
> contended you need this?
I wrote the reason here:
https://lore.kernel.org/all/20230802164701.192791-1-guoren@kernel.org/

The target platform is: https://www.sophon.ai/

The two NUMA nodes platform has come out, so we want to measure the
benefit of CNA qspinlock.

Any feedbacks are welcome :)

-- 
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-03 10:28       ` Guo Ren
@ 2023-08-03 11:56         ` Peter Zijlstra
  2023-08-04  1:33           ` Guo Ren
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2023-08-03 11:56 UTC (permalink / raw)
  To: Guo Ren
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Thu, Aug 03, 2023 at 06:28:51PM +0800, Guo Ren wrote:
> On Thu, Aug 3, 2023 at 4:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Aug 02, 2023 at 07:14:05PM -0400, Guo Ren wrote:
> >
> > > The pv_ops is belongs to x86 custom frame work, and it prevent other
> > > architectures connect to the CNA spinlock.
> >
> > static_call() exists as a arch neutral variant of this.
> Emm... we have used static_call() in the riscv queued_spin_lock_:
> https://lore.kernel.org/all/20230802164701.192791-20-guoren@kernel.org/

Yeah, I think I saw that land in the INBOX, just haven't had time to
look at it.

> But we met a compile problem:
> 
>   GEN     .vmlinux.objs
>   MODPOST Module.symvers
> ERROR: modpost: "__SCK__pv_queued_spin_unlock" [arch/riscv/kvm/kvm.ko]
> undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> [kernel/locking/locktorture.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock" [mm/z3fold.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> [fs/nfs_common/grace.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v1.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v2.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> [fs/quota/quota_tree.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fuse/virtiofs.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/dlm/dlm.ko] undefined!
> ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fscache/fscache.ko]
> undefined!
> WARNING: modpost: suppressed 839 unresolved symbol warnings because
> there were too many)
> /home/guoren/source/kernel/linux/scripts/Makefile.modpost:144: recipe
> for target 'Module.symvers' failed
> 
> Our solution is:
> EXPORT_SYMBOL(__SCK__pv_queued_spin_unlock);
> 
> What do you think about it?

Could be you're not using static_call_mod() to go with
EXPORT_STATIC_CALL_TRAMP()

> > > I'm working on riscv qspinlock on sg2042 64 cores 2/4 NUMA nodes
> > > platforms. Here are the patches about riscv CNA qspinlock:
> > > https://lore.kernel.org/linux-riscv/20230802164701.192791-19-guoren@kernel.org/
> > >
> > > What's the next plan for this patch series? I think the two-queue design
> > > has satisfied most platforms with two NUMA nodes.
> >
> > What has been your reason for working on CNA? What lock has been so
> > contended you need this?
> I wrote the reason here:
> https://lore.kernel.org/all/20230802164701.192791-1-guoren@kernel.org/
> 
> The target platform is: https://www.sophon.ai/
> 
> The two NUMA nodes platform has come out, so we want to measure the
> benefit of CNA qspinlock.

CNA should only show a benefit when there is strong inter-node
contention, and in that case it is typically best to fix the kernel side
locking.

Hence the question as to what lock prompted you to look at this.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-03 11:56         ` Peter Zijlstra
@ 2023-08-04  1:33           ` Guo Ren
  2023-08-04  1:38             ` Guo Ren
  2023-08-04  8:25             ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Guo Ren @ 2023-08-04  1:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Thu, Aug 3, 2023 at 7:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Aug 03, 2023 at 06:28:51PM +0800, Guo Ren wrote:
> > On Thu, Aug 3, 2023 at 4:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Aug 02, 2023 at 07:14:05PM -0400, Guo Ren wrote:
> > >
> > > > The pv_ops is belongs to x86 custom frame work, and it prevent other
> > > > architectures connect to the CNA spinlock.
> > >
> > > static_call() exists as a arch neutral variant of this.
> > Emm... we have used static_call() in the riscv queued_spin_lock_:
> > https://lore.kernel.org/all/20230802164701.192791-20-guoren@kernel.org/
>
> Yeah, I think I saw that land in the INBOX, just haven't had time to
> look at it.
>
> > But we met a compile problem:
> >
> >   GEN     .vmlinux.objs
> >   MODPOST Module.symvers
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [arch/riscv/kvm/kvm.ko]
> > undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> > [kernel/locking/locktorture.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [mm/z3fold.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> > [fs/nfs_common/grace.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v1.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v2.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> > [fs/quota/quota_tree.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fuse/virtiofs.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/dlm/dlm.ko] undefined!
> > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fscache/fscache.ko]
> > undefined!
> > WARNING: modpost: suppressed 839 unresolved symbol warnings because
> > there were too many)
> > /home/guoren/source/kernel/linux/scripts/Makefile.modpost:144: recipe
> > for target 'Module.symvers' failed
> >
> > Our solution is:
> > EXPORT_SYMBOL(__SCK__pv_queued_spin_unlock);
> >
> > What do you think about it?
>
> Could be you're not using static_call_mod() to go with
> EXPORT_STATIC_CALL_TRAMP()
Thx, that's what I want.

>
> > > > I'm working on riscv qspinlock on sg2042 64 cores 2/4 NUMA nodes
> > > > platforms. Here are the patches about riscv CNA qspinlock:
> > > > https://lore.kernel.org/linux-riscv/20230802164701.192791-19-guoren@kernel.org/
> > > >
> > > > What's the next plan for this patch series? I think the two-queue design
> > > > has satisfied most platforms with two NUMA nodes.
> > >
> > > What has been your reason for working on CNA? What lock has been so
> > > contended you need this?
> > I wrote the reason here:
> > https://lore.kernel.org/all/20230802164701.192791-1-guoren@kernel.org/
> >
> > The target platform is: https://www.sophon.ai/
> >
> > The two NUMA nodes platform has come out, so we want to measure the
> > benefit of CNA qspinlock.
>
> CNA should only show a benefit when there is strong inter-node
> contention, and in that case it is typically best to fix the kernel side
> locking.
>
> Hence the question as to what lock prompted you to look at this.
I met the long lock queue situation when the hardware gave an overly
aggressive store queue merge buffer delay mechanism. See:
https://lore.kernel.org/linux-riscv/20230802164701.192791-8-guoren@kernel.org/

This also let me consider improving the efficiency of the long lock
queue release. For example, if the queue is like this:

(Node0 cpu0) -> (Node1 cpu64) -> (Node0 cpu1) -> (Node1 cpu65) ->
(Node0 cpu2) -> (Node1 cpu66) -> ...

Then every mcs_unlock would cause a cross-NUMA transaction. But if we
could make the queue like this:

(Node0 cpu0) -> (Node0 cpu1) -> (Node0 cpu2) -> (Node1 cpu65) ->
(Node1 cpu66) -> (Node1 cpu64) -> ...

Only one cross-NUMA transaction is needed. Although it would cause
starvation problems, qspinlock.numa_spinlock_threshold_ns could give a
basic guarantee.

-- 
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-04  1:33           ` Guo Ren
@ 2023-08-04  1:38             ` Guo Ren
  2023-08-04  8:25             ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Guo Ren @ 2023-08-04  1:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Fri, Aug 4, 2023 at 9:33 AM Guo Ren <guoren@kernel.org> wrote:
>
> On Thu, Aug 3, 2023 at 7:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Aug 03, 2023 at 06:28:51PM +0800, Guo Ren wrote:
> > > On Thu, Aug 3, 2023 at 4:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Wed, Aug 02, 2023 at 07:14:05PM -0400, Guo Ren wrote:
> > > >
> > > > > The pv_ops is belongs to x86 custom frame work, and it prevent other
> > > > > architectures connect to the CNA spinlock.
> > > >
> > > > static_call() exists as a arch neutral variant of this.
> > > Emm... we have used static_call() in the riscv queued_spin_lock_:
> > > https://lore.kernel.org/all/20230802164701.192791-20-guoren@kernel.org/
> >
> > Yeah, I think I saw that land in the INBOX, just haven't had time to
> > look at it.
> >
> > > But we met a compile problem:
> > >
> > >   GEN     .vmlinux.objs
> > >   MODPOST Module.symvers
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [arch/riscv/kvm/kvm.ko]
> > > undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> > > [kernel/locking/locktorture.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [mm/z3fold.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> > > [fs/nfs_common/grace.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v1.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/quota/quota_v2.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock"
> > > [fs/quota/quota_tree.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fuse/virtiofs.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/dlm/dlm.ko] undefined!
> > > ERROR: modpost: "__SCK__pv_queued_spin_unlock" [fs/fscache/fscache.ko]
> > > undefined!
> > > WARNING: modpost: suppressed 839 unresolved symbol warnings because
> > > there were too many)
> > > /home/guoren/source/kernel/linux/scripts/Makefile.modpost:144: recipe
> > > for target 'Module.symvers' failed
> > >
> > > Our solution is:
> > > EXPORT_SYMBOL(__SCK__pv_queued_spin_unlock);
> > >
> > > What do you think about it?
> >
> > Could be you're not using static_call_mod() to go with
> > EXPORT_STATIC_CALL_TRAMP()
> Thx, that's what I want.
>
> >
> > > > > I'm working on riscv qspinlock on sg2042 64 cores 2/4 NUMA nodes
> > > > > platforms. Here are the patches about riscv CNA qspinlock:
> > > > > https://lore.kernel.org/linux-riscv/20230802164701.192791-19-guoren@kernel.org/
> > > > >
> > > > > What's the next plan for this patch series? I think the two-queue design
> > > > > has satisfied most platforms with two NUMA nodes.
> > > >
> > > > What has been your reason for working on CNA? What lock has been so
> > > > contended you need this?
> > > I wrote the reason here:
> > > https://lore.kernel.org/all/20230802164701.192791-1-guoren@kernel.org/
> > >
> > > The target platform is: https://www.sophon.ai/
> > >
> > > The two NUMA nodes platform has come out, so we want to measure the
> > > benefit of CNA qspinlock.
> >
> > CNA should only show a benefit when there is strong inter-node
> > contention, and in that case it is typically best to fix the kernel side
> > locking.
> >
> > Hence the question as to what lock prompted you to look at this.
> I met the long lock queue situation when the hardware gave an overly
> aggressive store queue merge buffer delay mechanism. See:
> https://lore.kernel.org/linux-riscv/20230802164701.192791-8-guoren@kernel.org/
>
> This also let me consider improving the efficiency of the long lock
> queue release. For example, if the queue is like this:
>
> (Node0 cpu0) -> (Node1 cpu64) -> (Node0 cpu1) -> (Node1 cpu65) ->
> (Node0 cpu2) -> (Node1 cpu66) -> ...
>
> Then every mcs_unlock would cause a cross-NUMA transaction. But if we
> could make the queue like this:
>
> (Node0 cpu0) -> (Node0 cpu1) -> (Node0 cpu2) -> (Node1 cpu65) ->
> (Node1 cpu66) -> (Node1 cpu64) -> ...
>
> Only one cross-NUMA transaction is needed. Although it would cause
> starvation problems, qspinlock.numa_spinlock_threshold_ns could give a
> basic guarantee.
I thought it was a tradeoff for the balance between fairness and efficiency.

>
> --
> Best Regards
>  Guo Ren



-- 
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-09-22 19:25   ` Davidlohr Bueso
  2021-09-22 19:52     ` Waiman Long
@ 2023-08-04  1:49     ` Guo Ren
  1 sibling, 0 replies; 28+ messages in thread
From: Guo Ren @ 2023-08-04  1:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, longman,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Thu, Sep 23, 2021 at 3:26 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Fri, 14 May 2021, Alex Kogan wrote:
>
> >diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >index a816935d23d4..94d35507560c 100644
> >--- a/Documentation/admin-guide/kernel-parameters.txt
> >+++ b/Documentation/admin-guide/kernel-parameters.txt
> >@@ -3515,6 +3515,16 @@
> >                       NUMA balancing.
> >                       Allowed values are enable and disable
> >
> >+      numa_spinlock=  [NUMA, PV_OPS] Select the NUMA-aware variant
> >+                      of spinlock. The options are:
> >+                      auto - Enable this variant if running on a multi-node
> >+                      machine in native environment.
> >+                      on  - Unconditionally enable this variant.
>
> Is there any reason why the user would explicitly pass the on option
> when the auto thing already does the multi-node check? Perhaps strange
> numa topologies? Otherwise I would say it's not needed and the fewer
> options we give the user for low level locking the better.
>
> >+                      off - Unconditionally disable this variant.
> >+
> >+                      Not specifying this option is equivalent to
> >+                      numa_spinlock=auto.
> >+
> >       numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> >                       'node', 'default' can be specified
> >                       This can be set from sysctl after boot.
> >diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >index 0045e1b44190..819c3dad8afc 100644
> >--- a/arch/x86/Kconfig
> >+++ b/arch/x86/Kconfig
> >@@ -1564,6 +1564,26 @@ config NUMA
> >
> >         Otherwise, you should say N.
> >
> >+config NUMA_AWARE_SPINLOCKS
> >+      bool "Numa-aware spinlocks"
> >+      depends on NUMA
> >+      depends on QUEUED_SPINLOCKS
> >+      depends on 64BIT
> >+      # For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
> >+      # This is awkward, but hopefully would be resolved once static_call()
> >+      # is available.
> >+      depends on PARAVIRT_SPINLOCKS
>
> We now have static_call() - see 9183c3f9ed7.
>
>
> >+      default y
> >+      help
> >+        Introduce NUMA (Non Uniform Memory Access) awareness into
> >+        the slow path of spinlocks.
> >+
> >+        In this variant of qspinlock, the kernel will try to keep the lock
> >+        on the same node, thus reducing the number of remote cache misses,
> >+        while trading some of the short term fairness for better performance.
> >+
> >+        Say N if you want absolute first come first serve fairness.
>
> This would also need a depends on !PREEMPT_RT, no? Raw spinlocks really want
> the determinism.
I hope we shouldn't force disable it in the Kconfig. Could we put this
idea in numa_spinlock=auto?

>
> Thanks,
> Davidlohr



-- 
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-04  1:33           ` Guo Ren
  2023-08-04  1:38             ` Guo Ren
@ 2023-08-04  8:25             ` Peter Zijlstra
  2023-08-04 14:17               ` Guo Ren
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2023-08-04  8:25 UTC (permalink / raw)
  To: Guo Ren
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Fri, Aug 04, 2023 at 09:33:48AM +0800, Guo Ren wrote:
> On Thu, Aug 3, 2023 at 7:57 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > CNA should only show a benefit when there is strong inter-node
> > contention, and in that case it is typically best to fix the kernel side
> > locking.
> >
> > Hence the question as to what lock prompted you to look at this.
> I met the long lock queue situation when the hardware gave an overly
> aggressive store queue merge buffer delay mechanism. See:
> https://lore.kernel.org/linux-riscv/20230802164701.192791-8-guoren@kernel.org/

*groan*, so you're using it to work around 'broken' hardware :-(

Wouldn't that hardware have horrifically bad lock throughput anyway?
Everybody would end up waiting on that store buffer delay.

> This also let me consider improving the efficiency of the long lock
> queue release. For example, if the queue is like this:
> 
> (Node0 cpu0) -> (Node1 cpu64) -> (Node0 cpu1) -> (Node1 cpu65) ->
> (Node0 cpu2) -> (Node1 cpu66) -> ...
> 
> Then every mcs_unlock would cause a cross-NUMA transaction. But if we
> could make the queue like this:

See, this is where the ARM64 WFE would come in handy; I don't suppose
RISC-V has anything like that?

Also, by the time you have 6 waiters, I'd say the lock is terribly
contended and you should look at improving the lockinh scheme.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-04  8:25             ` Peter Zijlstra
@ 2023-08-04 14:17               ` Guo Ren
  2023-08-04 18:23                 ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Guo Ren @ 2023-08-04 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Fri, Aug 4, 2023 at 4:26 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Aug 04, 2023 at 09:33:48AM +0800, Guo Ren wrote:
> > On Thu, Aug 3, 2023 at 7:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > CNA should only show a benefit when there is strong inter-node
> > > contention, and in that case it is typically best to fix the kernel side
> > > locking.
> > >
> > > Hence the question as to what lock prompted you to look at this.
> > I met the long lock queue situation when the hardware gave an overly
> > aggressive store queue merge buffer delay mechanism. See:
> > https://lore.kernel.org/linux-riscv/20230802164701.192791-8-guoren@kernel.org/
>
> *groan*, so you're using it to work around 'broken' hardware :-(
Yes, the hardware needs to be improved, and it couldn't depend on
WRITE_ONCE() hack. But from another view, if we tell the hardware this
is a WRITE_ONCE(), this store instruction should be observed by
sibling cores immediately, then the hardware could optimize the
behavior in the store queue. (All modern processors have a store queue
beyond the cache, and there is latency between the store queue and the
cache.) So:

How about adding an instruction of "st.aqrl" for WRITE_ONCE()? Which
makes the WRITE_ONCE() become the RCsc synchronization point.

>
> Wouldn't that hardware have horrifically bad lock throughput anyway?
> Everybody would end up waiting on that store buffer delay.
This problem is only found in the lock torture case, and only one
entry left in the store buffer would cause the problem. We are now
widely stress testing userspace parallel applications to find a second
case. Yes, we must be careful to treat this.

>
> > This also let me consider improving the efficiency of the long lock
> > queue release. For example, if the queue is like this:
> >
> > c -> c -> (Node0 cpu1) -> (Node1 cpu65) ->
> > (Node0 cpu2) -> (Node1 cpu66) -> ...
> >
> > Then every mcs_unlock would cause a cross-NUMA c. But if we
> > could make the queue like this:
>
> See, this is where the ARM64 WFE would come in handy; I don't suppose
> RISC-V has anything like that?
Em... arm64 smp_cond_load only could save power consumption or release
the pipeline resources of an SMT processor. When (Node1 cpu64) is in
the WFE state, it still needs (Node0 cpu1) to write the value to give
a cross-NUMA signal. So I didn't see what WFE related to reducing
cross-Numa transactions, or I missed something. Sorry

>
> Also, by the time you have 6 waiters, I'd say the lock is terribly
> contended and you should look at improving the lockinh scheme.
--
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-04 14:17               ` Guo Ren
@ 2023-08-04 18:23                 ` Peter Zijlstra
  2023-08-05  0:19                   ` Guo Ren
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2023-08-04 18:23 UTC (permalink / raw)
  To: Guo Ren
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Fri, Aug 04, 2023 at 10:17:35AM -0400, Guo Ren wrote:

> > See, this is where the ARM64 WFE would come in handy; I don't suppose
> > RISC-V has anything like that?
> Em... arm64 smp_cond_load only could save power consumption or release
> the pipeline resources of an SMT processor. When (Node1 cpu64) is in
> the WFE state, it still needs (Node0 cpu1) to write the value to give
> a cross-NUMA signal. So I didn't see what WFE related to reducing
> cross-Numa transactions, or I missed something. Sorry

The benefit is that WFE significantly reduces the memory traffic. Since
it 'suspends' the core and waits for a write-notification instead of
busy polling the memory location you get a ton less loads.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2023-08-04 18:23                 ` Peter Zijlstra
@ 2023-08-05  0:19                   ` Guo Ren
  0 siblings, 0 replies; 28+ messages in thread
From: Guo Ren @ 2023-08-05  0:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Kogan, linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Fri, Aug 4, 2023 at 2:24 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Aug 04, 2023 at 10:17:35AM -0400, Guo Ren wrote:
>
> > > See, this is where the ARM64 WFE would come in handy; I don't suppose
> > > RISC-V has anything like that?
> > Em... arm64 smp_cond_load only could save power consumption or release
> > the pipeline resources of an SMT processor. When (Node1 cpu64) is in
> > the WFE state, it still needs (Node0 cpu1) to write the value to give
> > a cross-NUMA signal. So I didn't see what WFE related to reducing
> > cross-Numa transactions, or I missed something. Sorry
>
> The benefit is that WFE significantly reduces the memory traffic. Since
> it 'suspends' the core and waits for a write-notification instead of
> busy polling the memory location you get a ton less loads.
Em... I had a different observation: When a long lock queue appeared
by a store buffer delay problem in the lock torture test, we observed
all interconnects get into a quiet state, and there was no more memory
traffic. All the cores are loop-loading "different" cacheline from
their L1 cache, caused by queued_spinlock. So I don't see any memory
traffics on the bus.

For the LL + WFE, AFAIK, LL is a load instruction that would grab the
cacheline from the bus into the L1-cache and set the reservation set
(arm may call it exclusive-monitor). If any cacheline invalidation
requests (readunique/cleanunique/...) come in, WFE would retire, and
the reservation set would be cleared. So from a cacheline perspective,
there is no difference between "LL+WFE" and "looping loads."

Let's see two scenarios of LL+WFE, multi-cores, and muti-threadings of one core:
 - In the multi-cores case, WFE didn't give any more benefits than the
loop loading from my perspective. Because the only thing WFE could do
is to "suspend core" (I borrowed your word here), but it can't be deep
sleep because the response from WFE is the most prior thing. As you
said, we should prevent "terribly contended" situations, so WFE must
keep fast reactions in the pipeline, not deep sleep. That's WFI stuff.
And loop loading also could reduce power consumption through the
proper micro-arch design: When the pipeline gets into a loop loading
state, the loop buffer mechanism start, no instructions fetch happens,
the frontend component can suspend for a while, and the only working
components are "loop buffer" and "LSU load path." Other components
could suspend for a while. So loop loading is not as terrible as you
thought.

 - In the multi-threading of one core case, introducing an ISA
instruction (WFE) to solve the loop loading problem is worthwhile
because the thread could release the resource of the processor's pipe
line.


>


--
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-08-05  0:19 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
2021-05-14 20:07 ` [PATCH v15 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
2021-05-14 20:07 ` [PATCH v15 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2021-05-14 20:07 ` [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2021-09-22 19:25   ` Davidlohr Bueso
2021-09-22 19:52     ` Waiman Long
2023-08-04  1:49     ` Guo Ren
2021-09-30 10:05   ` Barry Song
2023-08-02 23:14   ` Guo Ren
2023-08-03  8:50     ` Peter Zijlstra
2023-08-03 10:28       ` Guo Ren
2023-08-03 11:56         ` Peter Zijlstra
2023-08-04  1:33           ` Guo Ren
2023-08-04  1:38             ` Guo Ren
2023-08-04  8:25             ` Peter Zijlstra
2023-08-04 14:17               ` Guo Ren
2023-08-04 18:23                 ` Peter Zijlstra
2023-08-05  0:19                   ` Guo Ren
2021-05-14 20:07 ` [PATCH v15 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2021-05-14 20:07 ` [PATCH v15 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
2021-05-14 20:07 ` [PATCH v15 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
2021-09-30  9:44 ` [PATCH v15 0/6] Add NUMA-awareness to qspinlock Barry Song
2021-09-30 16:58   ` Waiman Long
2021-09-30 22:57     ` Barry Song
2021-09-30 23:51   ` Alex Kogan
2021-12-13 20:37 ` Alex Kogan
2021-12-15 15:13   ` Alex Kogan
2022-04-11 17:09 ` Alex Kogan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).