linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v14 0/6] Add NUMA-awareness to qspinlock
@ 2021-04-01 15:31 Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-01 15:31 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Change from v13:
----------------

Fix regression in stress-ng.
Reported-by: kernel test robot <oliver.sang@intel.com>

The fix is to move common-case stores into the 'partial_order' field
of the queue node from cna_wait_head_or_lock() (where those stores
can cause a cache miss during lock handover) to cna_init_node().

Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a primary queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the MCS lock holder checks whether the next
waiter in the primary queue (if exists) is running on the same NUMA node.
If it is not, that waiter is detached from the main queue and moved into
the tail of the secondary queue. This way, we gradually filter the primary
queue, leaving only waiters running on the same preferred NUMA node. Note
that certain priortized waiters (e.g., in irq and nmi contexts) are
excluded from being moved to the secondary queue. We change the NUMA node
preference after a waiter at the head of the secondary queue spins for a
certain amount of time. We do that by flushing the secondary queue into
the head of the primary queue, effectively changing the preference to the
NUMA node of the waiter at the head of the secondary queue at the time of
the flush.

More details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The 
standard deviation is also reported in (), and in general is about 3%
from the average. The 'stock' kernel is v5.12.0-rc5,
commit 99d4e8b4e609b, compiled in the default configuration. 
'CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set;
The speedup is calculated by dividing the result of 'CNA' by the result
 achieved with 'stock'.

#thr  	 stock     CNA / speedup     
  1  2.675 (0.088) 2.707 (0.100) / 1.012
  2  2.781 (0.172) 2.790 (0.148) / 1.003
  4  4.314 (0.115) 4.403 (0.205) / 1.021
  8  5.085 (0.140) 7.292 (0.210) / 1.434
 16  5.846 (0.114) 9.340 (0.186) / 1.598
 32  6.282 (0.127) 10.074 (0.197) / 1.603
 36  6.382 (0.144) 10.243 (0.198) / 1.605
 72  6.175 (0.086) 10.482 (0.179) / 1.698
108  6.037 (0.073) 10.354 (0.164) / 1.715
142  5.796 (0.072) 10.275 (0.193) / 1.773

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads: 

#thr  	 stock     CNA / speedup
  1  0.515 (0.002) 0.514 (0.001) / 0.999
  2  0.774 (0.018) 0.780 (0.019) / 1.009
  4  1.408 (0.030) 1.421 (0.034) / 1.009
  8  1.785 (0.082) 1.724 (0.115) / 0.965
 16  1.878 (0.118) 1.775 (0.114) / 0.945
 32  0.914 (0.072) 1.755 (0.105) / 1.920
 36  0.877 (0.064) 1.738 (0.100) / 1.981
 72  0.810 (0.047) 1.658 (0.086) / 2.047
108  0.835 (0.036) 1.756 (0.091) / 2.103
142  0.809 (0.038) 1.766 (0.069) / 2.184

and will-it-scale/lock2_threads:

#thr  	 stock     CNA / speedup
  1  1.605 (0.003) 1.606 (0.005) / 1.000
  2  2.814 (0.077) 2.793 (0.068) / 0.992
  4  5.318 (0.352) 5.463 (0.357) / 1.027
  8  4.241 (0.265) 4.025 (0.366) / 0.949
 16  4.125 (0.124) 3.914 (0.159) / 0.949
 32  2.458 (0.113) 3.950 (0.131) / 1.607
 36  2.457 (0.075) 3.924 (0.080) / 1.597
 72  1.836 (0.102) 3.909 (0.104) / 2.129
108  1.868 (0.098) 3.871 (0.106) / 2.072
142  1.780 (0.124) 3.894 (0.089) / 2.188

Our evaluation shows that CNA also improves performance of user 
applications that have hot pthread mutexes. Those mutexes are 
blocking, and waiting threads park and unpark via the futex 
mechanism in the kernel. Given that kernel futex chains, which
are hashed by the mutex address, are each protected by a 
chain-specific spin lock, the contention on a user-mode mutex 
translates into contention on a kernel level spinlock. 

Here are the throughput results (ops/us) for the leveldb ‘readrandom’
benchmark:

#thr  	 stock         CNA / speedup
  1  0.493 (0.034) 0.520 (0.026) / 1.053
  2  0.796 (0.046) 0.838 (0.033) / 1.053
  4  1.172 (0.076) 1.211 (0.043) / 1.033
  8  1.215 (0.096) 1.207 (0.116) / 0.993
 16  0.986 (0.136) 1.070 (0.129) / 1.085
 32  0.756 (0.036) 1.148 (0.021) / 1.519
 36  0.701 (0.032) 1.147 (0.017) / 1.636
 72  0.625 (0.016) 1.149 (0.024) / 1.839
108  0.612 (0.015) 1.146 (0.014) / 1.872
142  0.606 (0.013) 1.131 (0.025) / 1.867

Further comments are welcome and appreciated.

Alex Kogan (6):
  locking/qspinlock: Rename mcs lock/unlock macros and make them more
    generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Avoid moving certain threads between waiting queues
    in CNA
  locking/qspinlock: Introduce the shuffle reduction optimization into
    CNA

 .../admin-guide/kernel-parameters.txt         |  19 +
 arch/arm/include/asm/mcs_spinlock.h           |   6 +-
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |   4 +
 include/asm-generic/mcs_spinlock.h            |   4 +-
 kernel/locking/mcs_spinlock.h                 |  20 +-
 kernel/locking/qspinlock.c                    |  82 +++-
 kernel/locking/qspinlock_cna.h                | 455 ++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h           |   2 +-
 10 files changed, 593 insertions(+), 23 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v14 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
  2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
@ 2021-04-01 15:31 ` Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-01 15:31 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

The mcs unlock macro (arch_mcs_lock_handoff) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h       | 18 +++++++++---------
 kernel/locking/qspinlock.c          |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..1eb4d733459c 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include <asm/spinlock.h>
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
+#define arch_mcs_spin_wait(lock)					\
 do {									\
 	/* Ensure prior stores are observed before we enter wfe. */	\
 	smp_mb();							\
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_lock_handoff(lock, val)				\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release((lock), (val));				\
 	dsb_sev();							\
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..f933d99c63e0 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_wait(l)
+ *   arch_mcs_lock_handoff(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..904ba5d0f3f4 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
 	int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_wait
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
+#define arch_mcs_spin_wait(l)					\
+do {								\
+	smp_cond_load_acquire(l, VAL);				\
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_lock_handoff
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_lock_handoff(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	WRITE_ONCE(prev->next, node);
 
 	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
+	arch_mcs_spin_wait(&node->locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba53d56..435d696f9250 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -471,7 +471,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		arch_mcs_spin_wait(&node->locked);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -550,7 +550,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..619d80fd5ea8 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * The write to next->locked in arch_mcs_lock_handoff()
 	 * must be ordered before the read of pn->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v14 2/6] locking/qspinlock: Refactor the qspinlock slow path
  2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
@ 2021-04-01 15:31 ` Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-01 15:31 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 435d696f9250..e3518709ffdc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -289,6 +289,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+					     u32 val,
+					     struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_lock_handoff - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
+					       struct mcs_spinlock *next)
+{
+	arch_mcs_lock_handoff(&next->locked, 1);
+}
+
+#define try_clear_tail		__try_clear_tail
+#define mcs_lock_handoff	__mcs_lock_handoff
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -533,7 +561,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -550,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_lock_handoff(&next->locked, 1);
+	mcs_lock_handoff(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -575,6 +603,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			__mcs_lock_handoff
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
@ 2021-04-01 15:31 ` Alex Kogan
  2021-04-13 11:30   ` Peter Zijlstra
  2021-04-01 15:31 ` [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Alex Kogan @ 2021-04-01 15:31 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the MCS lock
holder checks whether the next waiter in the primary queue (if exists) is
running on the same NUMA node. If it is not, that waiter is detached from
the main queue and moved into the tail of the secondary queue. This way,
we gradually filter the primary queue, leaving only waiters running on
the same preferred NUMA node. For more details, see
https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock between waiters in the main queue. This issue will be
addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 ++
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |   4 +
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  42 ++-
 kernel/locking/qspinlock_cna.h                | 330 ++++++++++++++++++
 7 files changed, 407 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..ace55afd4441 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3475,6 +3475,16 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..c3b47077d7e4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1558,6 +1558,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index d86ab942219c..21d09e8db979 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e46725d..0178ff898d8a 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -741,6 +741,10 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+	cna_configure_spin_lock_slowpath();
+#endif
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 904ba5d0f3f4..5e47ffb3f08b 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e3518709ffdc..8c1a21b53913 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -71,7 +71,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -80,7 +81,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -104,6 +105,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -317,7 +320,7 @@ static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
 #define try_clear_tail		__try_clear_tail
 #define mcs_lock_handoff	__mcs_lock_handoff
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -589,6 +592,37 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_init_node
+#define pv_init_node			cna_init_node
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			cna_lock_handoff
+
+#undef  queued_spin_lock_slowpath
+/*
+ * defer defining queued_spin_lock_slowpath until after the include to
+ * avoid a name clash with the identically named field in pv_ops.lock
+ * (see cna_configure_spin_lock_slowpath())
+ */
+#include "qspinlock_cna.h"
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..d689861a7b3d
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,330 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the MCS lock
+ * holder checks whether the next waiter in the primary queue (if exists) is
+ * running on the same NUMA node. If it is not, that waiter is detached from the
+ * main queue and moved into the tail of the secondary queue. This way, we
+ * gradually filter the primary queue, leaving only waiters running on the same
+ * preferred NUMA node.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	u16			numa_node;
+	u16			real_numa_node;
+	u32			encoded_tail;	/* self */
+	u32			partial_order;	/* enum val */
+};
+
+enum {
+	LOCAL_WAITER_FOUND,
+	LOCAL_WAITER_NOT_FOUND,
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->real_numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * make sure @encoded_tail is not confused with other valid
+		 * values for @locked (0 or 1)
+		 */
+		WARN_ON(cn->encoded_tail <= 1);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	cn->numa_node = cn->real_numa_node;
+	cn->partial_order = LOCAL_WAITER_FOUND;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ *
+ * Returns the new primary head node or NULL on failure.
+ */
+static struct mcs_spinlock *
+cna_splice_head(struct qspinlock *lock, u32 val,
+		struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (next) {
+		/*
+		 * If the primary queue is not empty, the primary tail doesn't
+		 * need to change and we can simply link the secondary tail to
+		 * the old primary head.
+		 */
+		tail_2nd->next = next;
+	} else {
+		/*
+		 * When the primary queue is empty, the secondary tail becomes
+		 * the primary tail.
+		 */
+
+		/*
+		 * Speculatively break the secondary queue's circular link such
+		 * that when the secondary tail becomes the primary tail it all
+		 * works out.
+		 */
+		tail_2nd->next = NULL;
+
+		/*
+		 * tail_2nd->next = NULL;	old = xchg_tail(lock, tail);
+		 *				prev = decode_tail(old);
+		 * try_cmpxchg_release(...);	WRITE_ONCE(prev->next, node);
+		 *
+		 * If the following cmpxchg() succeeds, our stores will not
+		 * collide.
+		 */
+		new = ((struct cna_node *)tail_2nd)->encoded_tail |
+			_Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
+			/* Restore the secondary queue's circular link. */
+			tail_2nd->next = head_2nd;
+			return NULL;
+		}
+	}
+
+	/* The primary queue head now is what was the secondary queue head. */
+	return head_2nd;
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		struct mcs_spinlock *next;
+
+		/*
+		 * When there are waiters on the secondary queue, try to move
+		 * them back onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_lock_handoff(&next->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues are empty. Do what MCS does. */
+	return __try_clear_tail(lock, val, node);
+}
+
+/*
+ * cna_splice_tail -- splice the next node from the primary queue onto
+ * the secondary queue.
+ */
+static void cna_splice_next(struct mcs_spinlock *node,
+			    struct mcs_spinlock *next,
+			    struct mcs_spinlock *nnext)
+{
+	/* remove 'next' from the main queue */
+	node->next = nnext;
+
+	/* stick `next` on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		next->next = next;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = next;
+		next->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)next)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - check whether the next waiter in the main queue is on
+ * the same NUMA node as the lock holder; if not, and it has a waiter behind
+ * it in the main queue, move the former onto the secondary queue.
+ */
+static void cna_order_queue(struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next = READ_ONCE(node->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int numa_node, next_numa_node;
+
+	if (!next) {
+		cn->partial_order = LOCAL_WAITER_NOT_FOUND;
+		return;
+	}
+
+	numa_node = cn->numa_node;
+	next_numa_node = ((struct cna_node *)next)->numa_node;
+
+	if (next_numa_node != numa_node) {
+		struct mcs_spinlock *nnext = READ_ONCE(next->next);
+
+		if (nnext) {
+			cna_splice_next(node, next, nnext);
+			next = nnext;
+		}
+		/*
+		 * Inherit NUMA node id of primary queue, to maintain the
+		 * preference even if the next waiter is on a different node.
+		 */
+		((struct cna_node *)next)->numa_node = numa_node;
+	}
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cna_order_queue(node);
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline void cna_lock_handoff(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	u32 val = 1;
+
+	u32 partial_order = cn->partial_order;
+
+	if (partial_order == LOCAL_WAITER_NOT_FOUND)
+		cna_order_queue(node);
+
+	/*
+	 * We have a local waiter, either real or fake one;
+	 * reload @next in case it was changed by cna_order_queue().
+	 */
+	next = node->next;
+	if (node->locked > 1)
+		val = node->locked;	/* preseve secondary queue */
+
+	arch_mcs_lock_handoff(&next->locked, val);
+}
+
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/*
+ * Switch to the NUMA-friendly slow path for spinlocks when we have
+ * multiple NUMA nodes in native environment, unless the user has
+ * overridden this default behavior by setting the numa_spinlock flag.
+ */
+void __init cna_configure_spin_lock_slowpath(void)
+{
+
+	if (numa_spinlock_flag < 0)
+		return;
+
+	if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
+		    pv_ops.lock.queued_spin_lock_slowpath !=
+			native_queued_spin_lock_slowpath))
+		return;
+
+	cna_init_nodes();
+
+	pv_ops.lock.queued_spin_lock_slowpath = __cna_queued_spin_lock_slowpath;
+
+	pr_info("Enabling CNA spinlock\n");
+}
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (2 preceding siblings ...)
  2021-04-01 15:31 ` [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2021-04-01 15:31 ` Alex Kogan
  2021-04-13  6:03   ` Andi Kleen
  2021-04-13 12:03   ` Peter Zijlstra
  2021-04-01 15:31 ` [PATCH v14 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
  5 siblings, 2 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-01 15:31 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Keep track of the time the thread at the head of the secondary queue
has been waiting, and force inter-node handoff once this time passes
a preset threshold. The default value for the threshold (10ms) can be
overridden with the new kernel boot command-line option
"numa_spinlock_threshold". The ms value is translated internally to the
nearest rounded-up jiffies.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  9 ++
 kernel/locking/qspinlock_cna.h                | 96 ++++++++++++++++---
 2 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ace55afd4441..5c959631a8c8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3485,6 +3485,15 @@
 			Not specifying this option is equivalent to
 			numa_spinlock=auto.
 
+	numa_spinlock_threshold=	[NUMA, PV_OPS]
+			Set the time threshold in milliseconds for the
+			number of intra-node lock hand-offs before the
+			NUMA-aware spinlock is forced to be passed to
+			a thread on another NUMA node.	Valid values
+			are in the [1..100] range. Smaller values result
+			in a more fair, but less performant spinlock,
+			and vice versa. The default value is 10.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index d689861a7b3d..0513360c11fe 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -37,6 +37,12 @@
  * gradually filter the primary queue, leaving only waiters running on the same
  * preferred NUMA node.
  *
+ * We change the NUMA node preference after a waiter at the head of the
+ * secondary queue spins for a certain amount of time (10ms, by default).
+ * We do that by flushing the secondary queue into the head of the primary queue,
+ * effectively changing the preference to the NUMA node of the waiter at the head
+ * of the secondary queue at the time of the flush.
+ *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
  * Authors: Alex Kogan <alex.kogan@oracle.com>
@@ -49,13 +55,33 @@ struct cna_node {
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* enum val */
+	s32			start_time;
 };
 
 enum {
 	LOCAL_WAITER_FOUND,
 	LOCAL_WAITER_NOT_FOUND,
+	FLUSH_SECONDARY_QUEUE
 };
 
+/*
+ * Controls the threshold time in ms (default = 10) for intra-node lock
+ * hand-offs before the NUMA-aware variant of spinlock is forced to be
+ * passed to a thread on another NUMA node. The default setting can be
+ * changed with the "numa_spinlock_threshold" boot option.
+ */
+#define MSECS_TO_JIFFIES(m)	\
+	(((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
+static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
+
+static inline bool intra_node_threshold_reached(struct cna_node *cn)
+{
+	s32 current_time = (s32)jiffies;
+	s32 threshold = cn->start_time + intra_node_handoff_threshold;
+
+	return current_time - threshold > 0;
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -99,6 +125,7 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node)
 
 	cn->numa_node = cn->real_numa_node;
 	cn->partial_order = LOCAL_WAITER_FOUND;
+	cn->start_time = 0;
 }
 
 /*
@@ -198,8 +225,15 @@ static void cna_splice_next(struct mcs_spinlock *node,
 
 	/* stick `next` on the secondary queue tail */
 	if (node->locked <= 1) { /* if secondary queue is empty */
+		struct cna_node *cn = (struct cna_node *)node;
+
 		/* create secondary queue */
 		next->next = next;
+
+		cn->start_time = (s32)jiffies;
+		/* make sure start_time != 0 iff secondary queue is not empty */
+		if (!cn->start_time)
+			cn->start_time = 1;
 	} else {
 		/* add to the tail of the secondary queue */
 		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
@@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node)
 static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 						 struct mcs_spinlock *node)
 {
-	/*
-	 * Try and put the time otherwise spent spin waiting on
-	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-	 */
-	cna_order_queue(node);
+	struct cna_node *cn = (struct cna_node *)node;
+
+	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
+		/*
+		 * Try and put the time otherwise spent spin waiting on
+		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+		 */
+		cna_order_queue(node);
+	} else {
+		cn->partial_order = FLUSH_SECONDARY_QUEUE;
+	}
 
 	return 0; /* we lied; we didn't wait, go do so now */
 }
@@ -270,13 +310,28 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 	if (partial_order == LOCAL_WAITER_NOT_FOUND)
 		cna_order_queue(node);
 
-	/*
-	 * We have a local waiter, either real or fake one;
-	 * reload @next in case it was changed by cna_order_queue().
-	 */
-	next = node->next;
-	if (node->locked > 1)
-		val = node->locked;	/* preseve secondary queue */
+	if (partial_order != FLUSH_SECONDARY_QUEUE) {
+		/*
+		 * We have a local waiter, either real or fake one;
+		 * reload @next in case it was changed by cna_order_queue().
+		 */
+		next = node->next;
+		if (node->locked > 1) {
+			val = node->locked;     /* preseve secondary queue */
+			((struct cna_node *)next)->start_time = cn->start_time;
+		}
+	} else {
+		/*
+		 * We decided to flush the secondary queue;
+		 * this can only happen if that queue is not empty.
+		 */
+		WARN_ON(node->locked <= 1);
+		/*
+		 * Splice the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
+		 */
+		next = cna_splice_head(NULL, 0, node, next);
+	}
 
 	arch_mcs_lock_handoff(&next->locked, val);
 }
@@ -328,3 +383,20 @@ void __init cna_configure_spin_lock_slowpath(void)
 
 	pr_info("Enabling CNA spinlock\n");
 }
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+	int param;
+
+	if (get_option(&str, &param)) {
+		/* valid value is between 1 and 100 */
+		if (param <= 0 || param > 100)
+			return 0;
+
+		intra_node_handoff_threshold = msecs_to_jiffies(param);
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v14 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA
  2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (3 preceding siblings ...)
  2021-04-01 15:31 ` [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
@ 2021-04-01 15:31 ` Alex Kogan
  2021-04-01 15:31 ` [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
  5 siblings, 0 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-01 15:31 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Prohibit moving certain threads (e.g., in irq and nmi contexts)
to the secondary queue. Those prioritized threads will always stay
in the primary queue, and so will have a shorter wait time for the lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 0513360c11fe..29c3abbd3d94 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/sched/rt.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -35,7 +36,8 @@
  * running on the same NUMA node. If it is not, that waiter is detached from the
  * main queue and moved into the tail of the secondary queue. This way, we
  * gradually filter the primary queue, leaving only waiters running on the same
- * preferred NUMA node.
+ * preferred NUMA node. Note that certain priortized waiters (e.g., in
+ * irq and nmi contexts) are excluded from being moved to the secondary queue.
  *
  * We change the NUMA node preference after a waiter at the head of the
  * secondary queue spins for a certain amount of time (10ms, by default).
@@ -49,6 +51,8 @@
  *          Dave Dice <dave.dice@oracle.com>
  */
 
+#define CNA_PRIORITY_NODE      0xffff
+
 struct cna_node {
 	struct mcs_spinlock	mcs;
 	u16			numa_node;
@@ -121,9 +125,10 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
+	bool priority = !in_task() || irqs_disabled() || rt_task(current);
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->numa_node = cn->real_numa_node;
+	cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
 	cn->partial_order = LOCAL_WAITER_FOUND;
 	cn->start_time = 0;
 }
@@ -266,11 +271,13 @@ static void cna_order_queue(struct mcs_spinlock *node)
 	next_numa_node = ((struct cna_node *)next)->numa_node;
 
 	if (next_numa_node != numa_node) {
-		struct mcs_spinlock *nnext = READ_ONCE(next->next);
+		if (next_numa_node != CNA_PRIORITY_NODE) {
+			struct mcs_spinlock *nnext = READ_ONCE(next->next);
 
-		if (nnext) {
-			cna_splice_next(node, next, nnext);
-			next = nnext;
+			if (nnext) {
+				cna_splice_next(node, next, nnext);
+				next = nnext;
+			}
 		}
 		/*
 		 * Inherit NUMA node id of primary queue, to maintain the
@@ -287,6 +294,13 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	struct cna_node *cn = (struct cna_node *)node;
 
 	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
+		/*
+		 * We are at the head of the wait queue, no need to use
+		 * the fake NUMA node ID.
+		 */
+		if (cn->numa_node == CNA_PRIORITY_NODE)
+			cn->numa_node = cn->real_numa_node;
+
 		/*
 		 * Try and put the time otherwise spent spin waiting on
 		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (4 preceding siblings ...)
  2021-04-01 15:31 ` [PATCH v14 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
@ 2021-04-01 15:31 ` Alex Kogan
  2021-04-14  7:47   ` Andreas Herrmann
  5 siblings, 1 reply; 21+ messages in thread
From: Alex Kogan @ 2021-04-01 15:31 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

This performance optimization chooses probabilistically to avoid moving
threads from the main queue into the secondary one when the secondary queue
is empty.

It is helpful when the lock is only lightly contended. In particular, it
makes CNA less eager to create a secondary queue, but does not introduce
any extra delays for threads waiting in that queue once it is created.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 29c3abbd3d94..983c6a47a221 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -5,6 +5,7 @@
 
 #include <linux/topology.h>
 #include <linux/sched/rt.h>
+#include <linux/random.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
 	return current_time - threshold > 0;
 }
 
+/*
+ * Controls the probability for enabling the ordering of the main queue
+ * when the secondary queue is empty. The chosen value reduces the amount
+ * of unnecessary shuffling of threads between the two waiting queues
+ * when the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG  (7)
+
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Return false with probability 1 / 2^@num_bits.
+ * Intuitively, the larger @num_bits the less likely false is to be returned.
+ * @num_bits must be a number between 0 and 31.
+ */
+static bool probably(unsigned int num_bits)
+{
+	u32 s;
+
+	s = this_cpu_read(seed);
+	s = next_pseudo_random32(s);
+	this_cpu_write(seed, s);
+
+	return s & ((1 << num_bits) - 1);
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
+	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
+		/*
+		 * When the secondary queue is empty, skip the call to
+		 * cna_order_queue() below with high probability. This optimization
+		 * reduces the overhead of unnecessary shuffling of threads
+		 * between waiting queues when the lock is only lightly contended.
+		 */
+		return 0;
+	}
+
 	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
 		/*
 		 * We are at the head of the wait queue, no need to use
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-01 15:31 ` [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
@ 2021-04-13  6:03   ` Andi Kleen
  2021-04-13  6:12     ` Andi Kleen
  2021-04-13 21:01     ` [External] : " Alex Kogan
  2021-04-13 12:03   ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2021-04-13  6:03 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

Alex Kogan <alex.kogan@oracle.com> writes:
>  
> +	numa_spinlock_threshold=	[NUMA, PV_OPS]
> +			Set the time threshold in milliseconds for the
> +			number of intra-node lock hand-offs before the
> +			NUMA-aware spinlock is forced to be passed to
> +			a thread on another NUMA node.	Valid values
> +			are in the [1..100] range. Smaller values result
> +			in a more fair, but less performant spinlock,
> +			and vice versa. The default value is 10.

ms granularity seems very coarse grained for this. Surely
at some point of spinning you can afford a ktime_get? But ok.

Could you turn that into a moduleparm which can be changed at runtime?
Would be strange to have to reboot just to play with this parameter

This would also make the code a lot shorter I guess.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-13  6:03   ` Andi Kleen
@ 2021-04-13  6:12     ` Andi Kleen
  2021-04-13 21:01     ` [External] : " Alex Kogan
  1 sibling, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2021-04-13  6:12 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

Andi Kleen <ak@linux.intel.com> writes:

> Alex Kogan <alex.kogan@oracle.com> writes:
>>  
>> +	numa_spinlock_threshold=	[NUMA, PV_OPS]
>> +			Set the time threshold in milliseconds for the
>> +			number of intra-node lock hand-offs before the
>> +			NUMA-aware spinlock is forced to be passed to
>> +			a thread on another NUMA node.	Valid values
>> +			are in the [1..100] range. Smaller values result
>> +			in a more fair, but less performant spinlock,
>> +			and vice versa. The default value is 10.
>
> ms granularity seems very coarse grained for this. Surely
> at some point of spinning you can afford a ktime_get? But ok.

Actually thinking about it more using jiffies is likely broken
anyways because if the interrupts are disabled and the CPU
is running the main timer interrupts they won't increase.

cpu_clock (better than ktime_get) or sched_clock would work.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-04-01 15:31 ` [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2021-04-13 11:30   ` Peter Zijlstra
  2021-04-14  2:29     ` [External] : " Alex Kogan
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-04-13 11:30 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Thu, Apr 01, 2021 at 11:31:53AM -0400, Alex Kogan wrote:

> +/*
> + * cna_splice_tail -- splice the next node from the primary queue onto
> + * the secondary queue.
> + */
> +static void cna_splice_next(struct mcs_spinlock *node,
> +			    struct mcs_spinlock *next,
> +			    struct mcs_spinlock *nnext)

You forgot to update the comment when you changed the name on this
thing.

> +/*
> + * cna_order_queue - check whether the next waiter in the main queue is on
> + * the same NUMA node as the lock holder; if not, and it has a waiter behind
> + * it in the main queue, move the former onto the secondary queue.
> + */
> +static void cna_order_queue(struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *next = READ_ONCE(node->next);
> +	struct cna_node *cn = (struct cna_node *)node;
> +	int numa_node, next_numa_node;
> +
> +	if (!next) {
> +		cn->partial_order = LOCAL_WAITER_NOT_FOUND;
> +		return;
> +	}
> +
> +	numa_node = cn->numa_node;
> +	next_numa_node = ((struct cna_node *)next)->numa_node;
> +
> +	if (next_numa_node != numa_node) {
> +		struct mcs_spinlock *nnext = READ_ONCE(next->next);
> +
> +		if (nnext) {
> +			cna_splice_next(node, next, nnext);
> +			next = nnext;
> +		}
> +		/*
> +		 * Inherit NUMA node id of primary queue, to maintain the
> +		 * preference even if the next waiter is on a different node.
> +		 */
> +		((struct cna_node *)next)->numa_node = numa_node;
> +	}
> +}

So the obvious change since last time I looked a this is that it now
only looks 1 entry ahead. Which makes sense I suppose.

I'm not really a fan of the 'partial_order' name combined with that
silly enum { LOCAL_WAITER_FOUND, LOCAL_WAITER_NOT_FOUND }. That's just
really bad naming all around. The enum is about having a waiter while
the variable is about partial order, that doesn't match at all.

If you rename the variable to 'has_waiter' and simply use 0,1 values,
things would be ever so more readable. But I don't think that makes
sense, see below.

I'm also not sure about that whole numa_node thing, why would you
over-write the numa node, why at this point ?

> +
> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
> +						 struct mcs_spinlock *node)
> +{
> +	/*
> +	 * Try and put the time otherwise spent spin waiting on
> +	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> +	 */
> +	cna_order_queue(node);
> +
> +	return 0; /* we lied; we didn't wait, go do so now */

So here we inspect one entry ahead and then quit. I can't rmember, but
did we try something like:

	/*
	 * Try and put the time otherwise spent spin waiting on
	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
	 * Move one entry at a go until either the list is fully
	 * sorted or we ran out of spin condition.
	 */
	while (READ_ONCE(lock->val) & _Q_LOCKED_PENDING_MASK &&
	       node->partial_order)
		cna_order_queue(node);

	return 0;

This will keep moving @next to the remote list until such a time that
we're forced to continue or @next is local.

> +}
> +
> +static inline void cna_lock_handoff(struct mcs_spinlock *node,
> +				 struct mcs_spinlock *next)
> +{
> +	struct cna_node *cn = (struct cna_node *)node;
> +	u32 val = 1;
> +
> +	u32 partial_order = cn->partial_order;
> +
> +	if (partial_order == LOCAL_WAITER_NOT_FOUND)
> +		cna_order_queue(node);
> +

AFAICT this is where playing silly games with ->numa_node belong; but
right along with that goes a comment that describes why any of that
makes sense.

I mean, if you leave your node, for any reason, why bother coming back
to it, why not accept it is a sign of the gods you're overdue for a
node-change?

Was the efficacy of this scheme tested?

> +	/*
> +	 * We have a local waiter, either real or fake one;
> +	 * reload @next in case it was changed by cna_order_queue().
> +	 */
> +	next = node->next;
> +	if (node->locked > 1)
> +		val = node->locked;	/* preseve secondary queue */

IIRC we used to do:

	val |= node->locked;

Which is simpler for not having branches. Why change a good thing?

> +
> +	arch_mcs_lock_handoff(&next->locked, val);
> +}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-01 15:31 ` [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
  2021-04-13  6:03   ` Andi Kleen
@ 2021-04-13 12:03   ` Peter Zijlstra
  2021-04-16  2:52     ` [External] : " Alex Kogan
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-04-13 12:03 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Thu, Apr 01, 2021 at 11:31:54AM -0400, Alex Kogan wrote:

> @@ -49,13 +55,33 @@ struct cna_node {
>  	u16			real_numa_node;
>  	u32			encoded_tail;	/* self */
>  	u32			partial_order;	/* enum val */
> +	s32			start_time;
>  };

> +/*
> + * Controls the threshold time in ms (default = 10) for intra-node lock
> + * hand-offs before the NUMA-aware variant of spinlock is forced to be
> + * passed to a thread on another NUMA node. The default setting can be
> + * changed with the "numa_spinlock_threshold" boot option.
> + */
> +#define MSECS_TO_JIFFIES(m)	\
> +	(((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
> +static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
> +
> +static inline bool intra_node_threshold_reached(struct cna_node *cn)
> +{
> +	s32 current_time = (s32)jiffies;
> +	s32 threshold = cn->start_time + intra_node_handoff_threshold;
> +
> +	return current_time - threshold > 0;
> +}

None of this makes any sense:

 - why do you track time elapsed as a signed entity?
 - why are you using jiffies; that's terrible granularity.

As Andi already said, 10ms is silly large. You've just inflated the
lock-acquire time for every contended lock to stupid land just because
NUMA.

And this also brings me to the whole premise of this series; *why* are
we optimizing this? What locks are so contended that this actually helps
and shouldn't you be spending your time breaking those locks? That would
improve throughput more than this ever can.

>  static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>  {
>  	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);

> @@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node)
>  static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>  						 struct mcs_spinlock *node)
>  {
> -	/*
> -	 * Try and put the time otherwise spent spin waiting on
> -	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> -	 */
> -	cna_order_queue(node);
> +	struct cna_node *cn = (struct cna_node *)node;
> +
> +	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
> +		/*
> +		 * Try and put the time otherwise spent spin waiting on
> +		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> +		 */
> +		cna_order_queue(node);
> +	} else {
> +		cn->partial_order = FLUSH_SECONDARY_QUEUE;

This is even worse naming..

> +	}
>  
>  	return 0; /* we lied; we didn't wait, go do so now */
>  }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-13  6:03   ` Andi Kleen
  2021-04-13  6:12     ` Andi Kleen
@ 2021-04-13 21:01     ` Alex Kogan
  2021-04-13 21:22       ` Andi Kleen
  2021-04-14 16:41       ` Waiman Long
  1 sibling, 2 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-13 21:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux, Peter Zijlstra, Ingo Molnar, will.deacon, arnd, longman,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice

Hi, Andi.

Thanks for your comments!

> On Apr 13, 2021, at 2:03 AM, Andi Kleen <ak@linux.intel.com> wrote:
> 
> Alex Kogan <alex.kogan@oracle.com> writes:
>> 
>> +	numa_spinlock_threshold=	[NUMA, PV_OPS]
>> +			Set the time threshold in milliseconds for the
>> +			number of intra-node lock hand-offs before the
>> +			NUMA-aware spinlock is forced to be passed to
>> +			a thread on another NUMA node.	Valid values
>> +			are in the [1..100] range. Smaller values result
>> +			in a more fair, but less performant spinlock,
>> +			and vice versa. The default value is 10.
> 
> ms granularity seems very coarse grained for this. Surely
> at some point of spinning you can afford a ktime_get? But ok.
We are reading time when we are at the head of the (main) queue, but
don’t have the lock yet. Not sure about the latency of ktime_get(), but
anything reasonably fast but not necessarily precise should work.

> Could you turn that into a moduleparm which can be changed at runtime?
> Would be strange to have to reboot just to play with this parameter
Yes, good suggestion, thanks.

> This would also make the code a lot shorter I guess.
So you don’t think we need the command-line parameter, just the module_param?

Regards,
— Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-13 21:01     ` [External] : " Alex Kogan
@ 2021-04-13 21:22       ` Andi Kleen
  2021-04-14  2:30         ` Alex Kogan
  2021-04-14 16:41       ` Waiman Long
  1 sibling, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2021-04-13 21:22 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, Peter Zijlstra, Ingo Molnar, will.deacon, arnd, longman,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice

> > ms granularity seems very coarse grained for this. Surely
> > at some point of spinning you can afford a ktime_get? But ok.
> We are reading time when we are at the head of the (main) queue, but
> don’t have the lock yet. Not sure about the latency of ktime_get(), but
> anything reasonably fast but not necessarily precise should work.

Actually cpu_clock / sched_clock (see my other email). These should
be fast without corner cases and also monotonic.

> 
> > Could you turn that into a moduleparm which can be changed at runtime?
> > Would be strange to have to reboot just to play with this parameter
> Yes, good suggestion, thanks.
> 
> > This would also make the code a lot shorter I guess.
> So you don’t think we need the command-line parameter, just the module_param?

module_params can be changed at the command line too, so yes.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2021-04-13 11:30   ` Peter Zijlstra
@ 2021-04-14  2:29     ` Alex Kogan
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-14  2:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux, Ingo Molnar, Will Deacon, Arnd Bergmann, longman,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice

Peter, thanks for all the comments and suggestions!

> On Apr 13, 2021, at 7:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Apr 01, 2021 at 11:31:53AM -0400, Alex Kogan wrote:
> 
>> +/*
>> + * cna_splice_tail -- splice the next node from the primary queue onto
>> + * the secondary queue.
>> + */
>> +static void cna_splice_next(struct mcs_spinlock *node,
>> +			    struct mcs_spinlock *next,
>> +			    struct mcs_spinlock *nnext)
> 
> You forgot to update the comment when you changed the name on this
> thing.
Good catch, thanks.

> 
>> +/*
>> + * cna_order_queue - check whether the next waiter in the main queue is on
>> + * the same NUMA node as the lock holder; if not, and it has a waiter behind
>> + * it in the main queue, move the former onto the secondary queue.
>> + */
>> +static void cna_order_queue(struct mcs_spinlock *node)
>> +{
>> +	struct mcs_spinlock *next = READ_ONCE(node->next);
>> +	struct cna_node *cn = (struct cna_node *)node;
>> +	int numa_node, next_numa_node;
>> +
>> +	if (!next) {
>> +		cn->partial_order = LOCAL_WAITER_NOT_FOUND;
>> +		return;
>> +	}
>> +
>> +	numa_node = cn->numa_node;
>> +	next_numa_node = ((struct cna_node *)next)->numa_node;
>> +
>> +	if (next_numa_node != numa_node) {
>> +		struct mcs_spinlock *nnext = READ_ONCE(next->next);
>> +
>> +		if (nnext) {
>> +			cna_splice_next(node, next, nnext);
>> +			next = nnext;
>> +		}
>> +		/*
>> +		 * Inherit NUMA node id of primary queue, to maintain the
>> +		 * preference even if the next waiter is on a different node.
>> +		 */
>> +		((struct cna_node *)next)->numa_node = numa_node;
>> +	}
>> +}
> 
> So the obvious change since last time I looked a this is that it now
> only looks 1 entry ahead. Which makes sense I suppose.
This is in response to the critique that the worst-case time complexity of
cna_order_queue() was O(n). With this change, the complexity is constant.

> 
> I'm not really a fan of the 'partial_order' name combined with that
> silly enum { LOCAL_WAITER_FOUND, LOCAL_WAITER_NOT_FOUND }. That's just
> really bad naming all around. The enum is about having a waiter while
> the variable is about partial order, that doesn't match at all.
Fair enough.

> If you rename the variable to 'has_waiter' and simply use 0,1 values,
> things would be ever so more readable. But I don't think that makes
> sense, see below.
> 
> I'm also not sure about that whole numa_node thing, why would you
> over-write the numa node, why at this point ?
With this move-one-by-one approach, I want to keep the NUMA-node 
preference of the lock holder even if the next-next waiter is on a different
NUMA-node. Otherwise, we will end up switching preference often and
the entire scheme would not perform well. In particular, we might easily
end up with threads from the preferred node waiting in the secondary queue.

> 
>> +
>> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
>> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>> +						 struct mcs_spinlock *node)
>> +{
>> +	/*
>> +	 * Try and put the time otherwise spent spin waiting on
>> +	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
>> +	 */
>> +	cna_order_queue(node);
>> +
>> +	return 0; /* we lied; we didn't wait, go do so now */
> 
> So here we inspect one entry ahead and then quit. I can't rmember, but
> did we try something like:
> 
> 	/*
> 	 * Try and put the time otherwise spent spin waiting on
> 	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> 	 * Move one entry at a go until either the list is fully
> 	 * sorted or we ran out of spin condition.
> 	 */
> 	while (READ_ONCE(lock->val) & _Q_LOCKED_PENDING_MASK &&
> 	       node->partial_order)
> 		cna_order_queue(node);
> 
> 	return 0;
> 
> This will keep moving @next to the remote list until such a time that
> we're forced to continue or @next is local.
We have not tried that. This is actually an interesting idea, with its pros and cons.
That is, we are likely to filter out “non-preferred” waiters into the secondary queue
faster, but also we are more likely to run into a situation where the lock becomes
available at the time we are running the cna_order_queue() logic, thus prolonging
the handover time.

With this change, however, there is no need to call cna_order_queue() again 
in cna_lock_handoff(), which is another advantage I guess.

All in all, I am fine switching to this alternative if you like it more.

BTW, we can return node->partial_order instead of 0, to skip the call to
atomic_cond_read_acquire() in the general code.

> 
>> +}
>> +
>> +static inline void cna_lock_handoff(struct mcs_spinlock *node,
>> +				 struct mcs_spinlock *next)
>> +{
>> +	struct cna_node *cn = (struct cna_node *)node;
>> +	u32 val = 1;
>> +
>> +	u32 partial_order = cn->partial_order;
>> +
>> +	if (partial_order == LOCAL_WAITER_NOT_FOUND)
>> +		cna_order_queue(node);
>> +
> 
> AFAICT this is where playing silly games with ->numa_node belong; but
> right along with that goes a comment that describes why any of that
> makes sense.
> 
> I mean, if you leave your node, for any reason, why bother coming back
> to it, why not accept it is a sign of the gods you're overdue for a
> node-change?
I think there is some misunderstanding here — let me try to clarify.
In the current logic, we first call cna_order_queue() from cna_wait_head_or_lock().
If we find an appropriate “local” waiter (either real or fake), we set it as
next and make sure the numa_node of that node is the same as ours.
If not (i.e., when we do not have any next waiter), we set partial_order to
LOCAL_WAITER_NOT_FOUND (pardon the names) and go spin on the lock (calling
atomic_cond_read_acquire() in the generic code).
During this spin time, new waiters can join the queue.
Hence, we recheck the state of the queue by calling cna_order_queue() again
from cna_lock_handoff().

As just mentioned, if we change the logic as you suggest, there is
really no reason to call cna_order_queue() again.

> 
> Was the efficacy of this scheme tested?
> 
>> +	/*
>> +	 * We have a local waiter, either real or fake one;
>> +	 * reload @next in case it was changed by cna_order_queue().
>> +	 */
>> +	next = node->next;
>> +	if (node->locked > 1)
>> +		val = node->locked;	/* preseve secondary queue */
> 
> IIRC we used to do:
> 
> 	val |= node->locked;
> 
> Which is simpler for not having branches. Why change a good thing?
With |= we might set val to encoded tail+1 (rather than encoded tail).
This still would work, as decode_tail(val) does not care about LSB, but seems fragile to me.
We talked about that before: 
https://lore.kernel.org/linux-arm-kernel/E32F90E2-F00B-423B-A851-336004EF6593@oracle.com

If you think this is a non-issue, I will be happy to adopt your suggestion.

Regards,
— Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-13 21:22       ` Andi Kleen
@ 2021-04-14  2:30         ` Alex Kogan
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-14  2:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux, Peter Zijlstra, Ingo Molnar, Will Deacon, arnd, longman,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice



> On Apr 13, 2021, at 5:22 PM, Andi Kleen <ak@linux.intel.com> wrote:
> 
>>> ms granularity seems very coarse grained for this. Surely
>>> at some point of spinning you can afford a ktime_get? But ok.
>> We are reading time when we are at the head of the (main) queue, but
>> don’t have the lock yet. Not sure about the latency of ktime_get(), but
>> anything reasonably fast but not necessarily precise should work.
> 
> Actually cpu_clock / sched_clock (see my other email). These should
> be fast without corner cases and also monotonic.
I see, thanks.

> 
>> 
>>> Could you turn that into a moduleparm which can be changed at runtime?
>>> Would be strange to have to reboot just to play with this parameter
>> Yes, good suggestion, thanks.
>> 
>>> This would also make the code a lot shorter I guess.
>> So you don’t think we need the command-line parameter, just the module_param?
> 
> module_params can be changed at the command line too, so yes.
Got it, thanks again.

— Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2021-04-01 15:31 ` [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
@ 2021-04-14  7:47   ` Andreas Herrmann
  2021-04-14 18:18     ` [External] : " Alex Kogan
  0 siblings, 1 reply; 21+ messages in thread
From: Andreas Herrmann @ 2021-04-14  7:47 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote:
> This performance optimization chooses probabilistically to avoid moving
> threads from the main queue into the secondary one when the secondary queue
> is empty.
> 
> It is helpful when the lock is only lightly contended. In particular, it
> makes CNA less eager to create a secondary queue, but does not introduce
> any extra delays for threads waiting in that queue once it is created.
> 
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index 29c3abbd3d94..983c6a47a221 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -5,6 +5,7 @@
>  
>  #include <linux/topology.h>
>  #include <linux/sched/rt.h>
> +#include <linux/random.h>
>  
>  /*
>   * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
>  	return current_time - threshold > 0;
>  }
>  
> +/*
> + * Controls the probability for enabling the ordering of the main queue
> + * when the secondary queue is empty. The chosen value reduces the amount
> + * of unnecessary shuffling of threads between the two waiting queues
> + * when the contention is low, while responding fast enough and enabling
> + * the shuffling when the contention is high.
> + */
> +#define SHUFFLE_REDUCTION_PROB_ARG  (7)

Out of curiosity:

Have you used other values and done measurements what's an efficient
value for SHUFFLE_REDUCTION_PROB_ARG?
Maybe I miscalculated it, but if I understand it correctly this value
implies that the propability is 0.9921875 that below function returns
true.

My question is probably answered by following statement from
referenced paper:

"In our experiments with the shuffle reduction optimization enabled,
we set THRESHOLD2 to 0xff." (page with figure 5)

> +
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Return false with probability 1 / 2^@num_bits.
> + * Intuitively, the larger @num_bits the less likely false is to be returned.
> + * @num_bits must be a number between 0 and 31.
> + */
> +static bool probably(unsigned int num_bits)
> +{
> +	u32 s;
> +
> +	s = this_cpu_read(seed);
> +	s = next_pseudo_random32(s);
> +	this_cpu_write(seed, s);
> +
> +	return s & ((1 << num_bits) - 1);
> +}
> +
>  static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>  {
>  	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>  {
>  	struct cna_node *cn = (struct cna_node *)node;
>  
> +	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {

Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7
it's roughly 1 out of 100 cases where probably() returns false.

Why/when is this beneficial?

I assume it has to do with following statement in the referenced
paper:

"The superior performance over MCS at 4 threads is the result of the
shuffling that does take place once in a while, organizing threads’
arrivals to the lock in a way that reduces the inter-socket lock
migration without the need to continuously modify the main queue. ..."
(page with figure 9; the paper has no page numbers)

But OTHO why this pseudo randomness?

How about deterministically treating every 100th execution differently
(it also matches "once in a while") and thus entirely removing the
pseudo randomness?

Have you tried this? If so why was it worse than pseudo randomness?

(Or maybe I missed something and pseudo randomness is required for
other reasons there.)

> +		/*
> +		 * When the secondary queue is empty, skip the call to
> +		 * cna_order_queue() below with high probability. This optimization
> +		 * reduces the overhead of unnecessary shuffling of threads
> +		 * between waiting queues when the lock is only lightly contended.
> +		 */
> +		return 0;
> +	}
> +
>  	if (!cn->start_time || !intra_node_threshold_reached(cn)) {
>  		/*
>  		 * We are at the head of the wait queue, no need to use
> -- 
> 2.24.3 (Apple Git-128)
> 

-- 
Regards,
Andreas


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-13 21:01     ` [External] : " Alex Kogan
  2021-04-13 21:22       ` Andi Kleen
@ 2021-04-14 16:41       ` Waiman Long
  2021-04-14 17:26         ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: Waiman Long @ 2021-04-14 16:41 UTC (permalink / raw)
  To: Alex Kogan, Andi Kleen
  Cc: linux, Peter Zijlstra, Ingo Molnar, will.deacon, arnd,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice

On 4/13/21 5:01 PM, Alex Kogan wrote:
> Hi, Andi.
>
> Thanks for your comments!
>
>> On Apr 13, 2021, at 2:03 AM, Andi Kleen <ak@linux.intel.com> wrote:
>>
>> Alex Kogan <alex.kogan@oracle.com> writes:
>>> +	numa_spinlock_threshold=	[NUMA, PV_OPS]
>>> +			Set the time threshold in milliseconds for the
>>> +			number of intra-node lock hand-offs before the
>>> +			NUMA-aware spinlock is forced to be passed to
>>> +			a thread on another NUMA node.	Valid values
>>> +			are in the [1..100] range. Smaller values result
>>> +			in a more fair, but less performant spinlock,
>>> +			and vice versa. The default value is 10.
>> ms granularity seems very coarse grained for this. Surely
>> at some point of spinning you can afford a ktime_get? But ok.
> We are reading time when we are at the head of the (main) queue, but
> don’t have the lock yet. Not sure about the latency of ktime_get(), but
> anything reasonably fast but not necessarily precise should work.
>
>> Could you turn that into a moduleparm which can be changed at runtime?
>> Would be strange to have to reboot just to play with this parameter
> Yes, good suggestion, thanks.
>
>> This would also make the code a lot shorter I guess.
> So you don’t think we need the command-line parameter, just the module_param?

The CNA code, if enabled, will be in vmlinux, not in a kernel module. As 
a result, I think a module parameter will be no different from a kernel 
command line parameter in this regard.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-14 16:41       ` Waiman Long
@ 2021-04-14 17:26         ` Andi Kleen
  2021-04-14 17:31           ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2021-04-14 17:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alex Kogan, linux, Peter Zijlstra, Ingo Molnar, will.deacon,
	arnd, linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa,
	x86, guohanjun, jglauber, steven.sistare, daniel.m.jordan,
	dave.dice

> The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a
> result, I think a module parameter will be no different from a kernel
> command line parameter in this regard.

You can still change it in /sys at runtime, even if it's in the vmlinux.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-14 17:26         ` Andi Kleen
@ 2021-04-14 17:31           ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2021-04-14 17:31 UTC (permalink / raw)
  To: Andi Kleen, Waiman Long
  Cc: Alex Kogan, linux, Peter Zijlstra, Ingo Molnar, will.deacon,
	arnd, linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa,
	x86, guohanjun, jglauber, steven.sistare, daniel.m.jordan,
	dave.dice

On 4/14/21 1:26 PM, Andi Kleen wrote:
>> The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a
>> result, I think a module parameter will be no different from a kernel
>> command line parameter in this regard.
> You can still change it in /sys at runtime, even if it's in the vmlinux.

I see, thank for the clarification.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2021-04-14  7:47   ` Andreas Herrmann
@ 2021-04-14 18:18     ` Alex Kogan
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-14 18:18 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: linux, Peter Zijlstra, Ingo Molnar, Will Deacon, arnd, longman,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice

Hi, Andreas.

Thanks for the great questions.

> On Apr 14, 2021, at 3:47 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> 
> On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote:
>> This performance optimization chooses probabilistically to avoid moving
>> threads from the main queue into the secondary one when the secondary queue
>> is empty.
>> 
>> It is helpful when the lock is only lightly contended. In particular, it
>> makes CNA less eager to create a secondary queue, but does not introduce
>> any extra delays for threads waiting in that queue once it is created.
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
>> 1 file changed, 39 insertions(+)
>> 
>> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
>> index 29c3abbd3d94..983c6a47a221 100644
>> --- a/kernel/locking/qspinlock_cna.h
>> +++ b/kernel/locking/qspinlock_cna.h
>> @@ -5,6 +5,7 @@
>> 
>> #include <linux/topology.h>
>> #include <linux/sched/rt.h>
>> +#include <linux/random.h>
>> 
>> /*
>>  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>> @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
>> 	return current_time - threshold > 0;
>> }
>> 
>> +/*
>> + * Controls the probability for enabling the ordering of the main queue
>> + * when the secondary queue is empty. The chosen value reduces the amount
>> + * of unnecessary shuffling of threads between the two waiting queues
>> + * when the contention is low, while responding fast enough and enabling
>> + * the shuffling when the contention is high.
>> + */
>> +#define SHUFFLE_REDUCTION_PROB_ARG  (7)
> 
> Out of curiosity:
> 
> Have you used other values and done measurements what's an efficient
> value for SHUFFLE_REDUCTION_PROB_ARG?
Yes, we did try other values. Small variations do not change the results much,
but if you bump the value significantly (e.g., 20), you end up with a lock that
hardly does any shuffling and thus performs on-par with the (MCS-based)
baseline.

> Maybe I miscalculated it, but if I understand it correctly this value
> implies that the propability is 0.9921875 that below function returns
> true.
Your analysis is correct. Intuitively, we tried to keep the probability around 1-2%,
so if we do decide to shuffle when we don’t really want to (i.e., when the
contention is low), the overall overhead of such wrong decisions would be small.

> 
> My question is probably answered by following statement from
> referenced paper:
> 
> "In our experiments with the shuffle reduction optimization enabled,
> we set THRESHOLD2 to 0xff." (page with figure 5)
Yeah, just to avoid any confusion — we used a different mechanism to draw
pseudo-random numbers in the paper, so that 0xff number is not directly 
comparable to the range of possible values for SHUFFLE_REDUCTION_PROB_ARG,
but the idea was exactly the same.

> 
>> +
>> +/* Per-CPU pseudo-random number seed */
>> +static DEFINE_PER_CPU(u32, seed);
>> +
>> +/*
>> + * Return false with probability 1 / 2^@num_bits.
>> + * Intuitively, the larger @num_bits the less likely false is to be returned.
>> + * @num_bits must be a number between 0 and 31.
>> + */
>> +static bool probably(unsigned int num_bits)
>> +{
>> +	u32 s;
>> +
>> +	s = this_cpu_read(seed);
>> +	s = next_pseudo_random32(s);
>> +	this_cpu_write(seed, s);
>> +
>> +	return s & ((1 << num_bits) - 1);
>> +}
>> +
>> static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>> {
>> 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
>> @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>> {
>> 	struct cna_node *cn = (struct cna_node *)node;
>> 
>> +	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
> 
> Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7
> it's roughly 1 out of 100 cases where probably() returns false.
> 
> Why/when is this beneficial?
> 
> I assume it has to do with following statement in the referenced
> paper:
> 
> "The superior performance over MCS at 4 threads is the result of the
> shuffling that does take place once in a while, organizing threads’
> arrivals to the lock in a way that reduces the inter-socket lock
> migration without the need to continuously modify the main queue. ..."
> (page with figure 9; the paper has no page numbers)
This is correct. This performance optimization deals with a small regression
that might occur at the low-medium range of contention. And just to reiterate
what the patch comment says, this optimization has nothing to do with correctness
and does not introduce any extra delays for threads already waiting in the
secondary queue.

> 
> But OTHO why this pseudo randomness?
> 
> How about deterministically treating every 100th execution differently
> (it also matches "once in a while") and thus entirely removing the
> pseudo randomness?
Pseudo-randomness helps to avoid pathological cases where we would do the
same decision, despite having this optimization in place, for every lock acquisition.

Consider an application in which each thread cycles through X locks, e.g., X=10.
If we count deterministically, then for the first 9 locks we will _always avoid doing
any shuffling, missing on any opportunity for the benefit derived by the CNA 
approach. At the same time, for lock #10 we will always attempt to shuffle, and so
the optimization would not have any effect. 

> 
> Have you tried this? If so why was it worse than pseudo randomness?
> 
> (Or maybe I missed something and pseudo randomness is required for
> other reasons there.)

Hope this helps — let me know if you have any further questions!

Best,
— Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
  2021-04-13 12:03   ` Peter Zijlstra
@ 2021-04-16  2:52     ` Alex Kogan
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Kogan @ 2021-04-16  2:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux, Ingo Molnar, Will Deacon, Arnd Bergmann, longman,
	linux-arch, linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86,
	guohanjun, jglauber, steven.sistare, daniel.m.jordan, dave.dice



> On Apr 13, 2021, at 8:03 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Apr 01, 2021 at 11:31:54AM -0400, Alex Kogan wrote:
> 
>> @@ -49,13 +55,33 @@ struct cna_node {
>> 	u16			real_numa_node;
>> 	u32			encoded_tail;	/* self */
>> 	u32			partial_order;	/* enum val */
>> +	s32			start_time;
>> };
> 
>> +/*
>> + * Controls the threshold time in ms (default = 10) for intra-node lock
>> + * hand-offs before the NUMA-aware variant of spinlock is forced to be
>> + * passed to a thread on another NUMA node. The default setting can be
>> + * changed with the "numa_spinlock_threshold" boot option.
>> + */
>> +#define MSECS_TO_JIFFIES(m)	\
>> +	(((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
>> +static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
>> +
>> +static inline bool intra_node_threshold_reached(struct cna_node *cn)
>> +{
>> +	s32 current_time = (s32)jiffies;
>> +	s32 threshold = cn->start_time + intra_node_handoff_threshold;
>> +
>> +	return current_time - threshold > 0;
>> +}
> 
> None of this makes any sense:
> 
> - why do you track time elapsed as a signed entity?
> - why are you using jiffies; that's terrible granularity.
Good points. I will address that (see below). I will just mention that 
those suggestions came from senior folks on this mailing list,
and it seemed prudent to take their counsel. 

> 
> As Andi already said, 10ms is silly large. You've just inflated the
> lock-acquire time for every contended lock to stupid land just because
> NUMA.
I just ran a few quick tests — local_clock() (a wrapper around sched_clock()) 
works well, so I will switch to using that.

I also took a few numbers with different thresholds. Looks like we can drop 
the threshold to 1ms with a minor penalty to performance. However, 
pushing the threshold to 100us has a more significant cost. Here are
the numbers for reference:

will-it-scale/lock2_threads:
threshold:                     10ms     1ms      100us
speedup at 142 threads:       2.184    1.974     1.1418 

will-it-scale/open1_threads:
threshold:                     10ms     1ms      100us
speedup at 142 threads:       2.146    1.974     1.291

Would you be more comfortable with setting the default at 1ms?

> And this also brings me to the whole premise of this series; *why* are
> we optimizing this? What locks are so contended that this actually helps
> and shouldn't you be spending your time breaking those locks? That would
> improve throughput more than this ever can.

I think for the same reason the kernel switched from ticket locks to queue locks
several years back. There always will be applications with contended locks. 
Sometimes the workarounds are easy, but many times they are not, like with 
legacy applications or when the workload is skewed (e.g., every client tries to
update the metadata of the same file protected by the same lock). The results
show that for those cases we leave > 2x performance on the table. Those are not
only our numbers — LKP reports show similar or even better results, 
on a wide range of benchmarks, e.g.:
https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/HGVOCYDEE5KTLYPTAFBD2RXDQOCDPFUJ/
https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/OUPS7MZ3GJA2XYWM52GMU7H7EI25IT37/
https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/DNMEQPXJRQY2IKHZ3ERGRY6TUPWDTFUN/

Regards,
— Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-04-16  2:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 15:31 [PATCH v14 0/6] Add NUMA-awareness to qspinlock Alex Kogan
2021-04-01 15:31 ` [PATCH v14 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
2021-04-01 15:31 ` [PATCH v14 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2021-04-01 15:31 ` [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2021-04-13 11:30   ` Peter Zijlstra
2021-04-14  2:29     ` [External] : " Alex Kogan
2021-04-01 15:31 ` [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2021-04-13  6:03   ` Andi Kleen
2021-04-13  6:12     ` Andi Kleen
2021-04-13 21:01     ` [External] : " Alex Kogan
2021-04-13 21:22       ` Andi Kleen
2021-04-14  2:30         ` Alex Kogan
2021-04-14 16:41       ` Waiman Long
2021-04-14 17:26         ` Andi Kleen
2021-04-14 17:31           ` Waiman Long
2021-04-13 12:03   ` Peter Zijlstra
2021-04-16  2:52     ` [External] : " Alex Kogan
2021-04-01 15:31 ` [PATCH v14 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
2021-04-01 15:31 ` [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
2021-04-14  7:47   ` Andreas Herrmann
2021-04-14 18:18     ` [External] : " Alex Kogan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).