linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Add NUMA-awareness to qspinlock
@ 2019-07-15 19:25 Alex Kogan
  2019-07-15 19:25 ` [PATCH v3 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic Alex Kogan
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: Alex Kogan @ 2019-07-15 19:25 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Changes from v2:
----------------
- Patch the NUMA-aware qspinlock at the boot time on machines with
multiple NUMA nodes and a kernel compiled with NUMA_AWARE_SPINLOCKS,
as suggested by Peter and Longman.

- CNA queue nodes encapsulate MCS queue nodes, similarly to paravirt nodes,
as suggested by Peter. MCS queue node size has been increased by 4 bytes.

- Use the existing next_pseudo_random32() instead of a custom xorshift
pseudo-random number generator, as suggested by Peter.

- Use cpu_to_node() to lookup the NUMA node of a thread, as suggested
by Hanjun.

— Rewrote cna_pass_mcs_lock(), as suggested by Peter.

- We evaluated the patch on a single-node machine as well as in a paravirt 
environment (with virtme/qemu), as suggested by Peter and Longman.
Details are below.

— Our evaluation shows that CNA also improves performance of user 
applications that have hot pthread mutexes, as the latter create contention
on spin locks protecting futex chains in the kernel when waiting threads
park and unpark. Details are below.


Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. At the unlock time, the lock
holder scans the main queue looking for a thread running on the same
node. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue. To avoid starvation of threads in the secondary queue,
those threads are moved back to the head of the main queue
after a certain expected number of intra-node lock hand-offs.

More details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The 
standard deviation is also reported in (), and in general is about 3%
from the average. The 'stock' kernel is v5.2.0-rc2,
commit f782099a96a0 ("Merge branch 'perf/core'"),
compiled in the default configuration. 'patch' is the modified
kernel compiled with NUMA_AWARE_SPINLOCKS not set; it is included to show
that any performance changes to the existing qspinlock implementation are
essentially noise. 'patch-CNA' is the modified kernel with 
NUMA_AWARE_SPINLOCKS set; the speedup is calculated dividing 
'patch-CNA' by 'stock'.

#thr  	 stock  	patch	     patch-CNA 	 speedup (patch-CNA/stock)
  1  2.687 (0.104)  2.655 (0.099)   2.706 (0.119)  1.007
  2  3.085 (0.104)  3.140 (0.128)   3.111 (0.147)  1.009
  4  4.230 (0.125)  4.217 (0.129)   4.482 (0.121)  1.060
  8  5.480 (0.159)  5.411 (0.183)   7.064 (0.218)  1.289
 16  6.733 (0.196)  6.764 (0.155)   8.666 (0.161)  1.287
 32  7.557 (0.148)  7.488 (0.133)   9.519 (0.253)  1.260
 36  7.667 (0.222)  7.654 (0.211)   9.530 (0.218)  1.243
 72  6.931 (0.172)  6.931 (0.187)  10.030 (0.217)  1.447
108  6.478 (0.098)  6.423 (0.107)  10.157 (0.250)  1.568
142  6.041 (0.102)  6.058 (0.111)  10.102 (0.260)  1.672

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads: 

#thr  	 stock  	patch	     patch-CNA 	 speedup (patch-CNA/stock)
  1  0.536 (0.001)  0.540 (0.003)  0.538 (0.001)  1.002
  2  0.833 (0.020)  0.842 (0.028)  0.827 (0.025)  0.993
  4  1.464 (0.031)  1.473 (0.025)  1.465 (0.033)  1.001
  8  1.685 (0.087)  1.707 (0.078)  1.708 (0.104)  1.013
 16  1.715 (0.091)  1.777 (0.100)  1.766 (0.070)  1.029
 32  0.937 (0.065)  0.930 (0.078)  1.752 (0.072)  1.869
 36  0.930 (0.079)  0.927 (0.092)  1.731 (0.068)  1.862
 72  0.871 (0.037)  0.855 (0.038)  1.758 (0.071)  2.019
108  0.856 (0.044)  0.865 (0.042)  1.747 (0.063)  2.040
142  0.810 (0.051)  0.815 (0.041)  1.776 (0.064)  2.193

and will-it-scale/lock2_threads:

#thr  	 stock  	patch	     patch-CNA 	 speedup (patch-CNA/stock)
  1  1.631 (0.002)  1.638 (0.002)  1.637 (0.002)  1.004
  2  2.756 (0.076)  2.761 (0.063)  2.778 (0.081)  1.008
  4  5.119 (0.411)  5.256 (0.331)  5.138 (0.388)  1.004
  8  4.147 (0.215)  4.299 (0.264)  4.126 (0.322)  0.995
 16  4.214 (0.111)  4.234 (0.133)  4.133 (0.128)  0.981
 32  2.485 (0.095)  2.473 (0.117)  4.015 (0.115)  1.616
 36  2.423 (0.099)  2.451 (0.117)  3.963 (0.129)  1.636
 72  2.026 (0.102)  1.983 (0.108)  4.000 (0.122)  1.975
108  2.102 (0.088)  2.145 (0.080)  3.927 (0.108)  1.868
142  1.923 (0.128)  1.894 (0.100)  3.879 (0.081)  2.018

We also evaluated the patch on a single-node machine (Intel i7-4770 with 
4 hyperthreaded cores) with will-it-scale, and observed no meaningful
performance impact, as expected. For instance, below are results for 
will-it-scale/open1_threads:

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  0.861 (0.006)  0.867 (0.005)  1.007
  2  1.481 (0.015)  1.511 (0.017)  1.020
  4  2.671 (0.041)  2.697 (0.049)  1.010
  6  2.889 (0.064)  2.910 (0.060)  1.007

Furthermore, we evaluated the patch in the paravirt setup, booting the 
kernel with virtme (qemu) and $(nproc) cores on the same Oracle X5-4 server
as above. We run will-it-scale benchmarks, and once again observed 
no meaningful performance impact. For instance, below are results for
will-it-scale/open1_threads:

#thr  	 stock 	      patch-CNA   speedup (patch-CNA/stock)
  1  0.761 (0.009)  0.763 (0.009)  1.003
  2  0.652 (0.043)  0.666 (0.033)  1.022
  4  0.591 (0.036)  0.596 (0.033)  1.008
  8  0.582 (0.019)  0.575 (0.020)  0.989
 16  0.680 (0.021)  0.685 (0.018)  1.007
 32  0.566 (0.031)  0.548 (0.049)  0.968
 36  0.549 (0.053)  0.531 (0.053)  0.966
 72  0.363 (0.012)  0.364 (0.008)  1.002
108  0.359 (0.010)  0.361 (0.009)  1.004
142  0.355 (0.011)  0.362 (0.011)  1.020

Our evaluation shows that CNA also improves performance of user 
applications that have hot pthread mutexes. Those mutexes are 
blocking, and waiting threads park and unpark via the futex 
mechanism in the kernel. Given that kernel futex chains, which
are hashed by the mutex address, are each protected by a 
chain-specific spin lock, the contention on a user-mode mutex 
translates into contention on a kernel level spinlock. 

Here are the results for the leveldb ‘readrandom’ benchmark:

#thr  	 stock        patch-CNA    speedup (patch-CNA/stock)
  1  0.479 (0.036)  0.533 (0.010)   1.113
  2  0.653 (0.022)  0.680 (0.027)   1.042
  4  0.705 (0.016)  0.701 (0.019)   0.995
  8  0.686 (0.021)  0.690 (0.024)   1.006
 16  0.708 (0.025)  0.719 (0.020)   1.016
 32  0.728 (0.023)  1.011 (0.117)   1.389
 36  0.720 (0.038)  1.073 (0.127)   1.491
 72  0.652 (0.018)  1.195 (0.017)   1.833
108  0.624 (0.016)  1.178 (0.028)   1.888
142  0.604 (0.015)  1.163 (0.024)   1.925

Further comments are welcome and appreciated.

Alex Kogan (5):
  locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Introduce the shuffle reduction optimization into
    CNA

 arch/arm/include/asm/mcs_spinlock.h |   4 +-
 arch/x86/Kconfig                    |  18 +++
 arch/x86/include/asm/qspinlock.h    |   4 +
 arch/x86/kernel/alternative.c       |  12 ++
 kernel/locking/mcs_spinlock.h       |   8 +-
 kernel/locking/qspinlock.c          |  81 +++++++++++---
 kernel/locking/qspinlock_cna.h      | 218 ++++++++++++++++++++++++++++++++++++
 7 files changed, 326 insertions(+), 19 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
  2019-07-15 19:25 [PATCH v3 0/5] Add NUMA-awareness to qspinlock Alex Kogan
@ 2019-07-15 19:25 ` Alex Kogan
  2019-07-16 10:23   ` Peter Zijlstra
  2019-07-15 19:25 ` [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Alex Kogan @ 2019-07-15 19:25 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

The arch_mcs_spin_unlock_contended macro should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored is different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 arch/arm/include/asm/mcs_spinlock.h | 4 ++--
 kernel/locking/mcs_spinlock.h       | 6 +++---
 kernel/locking/qspinlock.c          | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..ae6d763477f4 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_spin_unlock_contended(lock, val)			\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release(lock, (val));					\
 	dsb_sev();							\
 } while (0)
 
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..bc6d3244e1af 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -41,8 +41,8 @@ do {									\
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_spin_unlock_contended(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_spin_unlock_contended(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e14b32c69639..961781624638 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -558,7 +558,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_spin_unlock_contended(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path
  2019-07-15 19:25 [PATCH v3 0/5] Add NUMA-awareness to qspinlock Alex Kogan
  2019-07-15 19:25 ` [PATCH v3 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic Alex Kogan
@ 2019-07-15 19:25 ` Alex Kogan
  2019-07-16 10:20   ` Peter Zijlstra
  2019-07-15 19:25 ` [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Alex Kogan @ 2019-07-15 19:25 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 961781624638..5668466b3006 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -297,6 +297,36 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * set_locked_empty_mcs - Try to set the spinlock value to _Q_LOCKED_VAL,
+ * and by doing that unlock the MCS lock when its waiting queue is empty
+ * @lock: Pointer to queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ *
+ * *,*,* -> 0,0,1
+ */
+static __always_inline bool __set_locked_empty_mcs(struct qspinlock *lock,
+						   u32 val,
+						   struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * pass_mcs_lock - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __pass_mcs_lock(struct mcs_spinlock *node,
+					    struct mcs_spinlock *next)
+{
+	arch_mcs_spin_unlock_contended(&next->locked, 1);
+}
+
+#define set_locked_empty_mcs	__set_locked_empty_mcs
+#define pass_mcs_lock		__pass_mcs_lock
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -541,7 +571,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (set_locked_empty_mcs(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -558,7 +588,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked, 1);
+	pass_mcs_lock(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -583,6 +613,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef set_locked_empty_mcs
+#define set_locked_empty_mcs		__set_locked_empty_mcs
+
+#undef pass_mcs_lock
+#define pass_mcs_lock			__pass_mcs_lock
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-15 19:25 [PATCH v3 0/5] Add NUMA-awareness to qspinlock Alex Kogan
  2019-07-15 19:25 ` [PATCH v3 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic Alex Kogan
  2019-07-15 19:25 ` [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
@ 2019-07-15 19:25 ` Alex Kogan
  2019-07-15 21:30   ` Waiman Long
                     ` (3 more replies)
  2019-07-15 19:25 ` [PATCH v3 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 33+ messages in thread
From: Alex Kogan @ 2019-07-15 19:25 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. At the unlock time,
the lock holder scans the main queue looking for a thread running on
the same node. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue. For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). The CNA variant is patched in
at the boot time only if we run a multi-node machine, and the new
config is enabled. For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well.
However, this should be resolved once static_call() is available.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 arch/x86/Kconfig                 |  18 +++++
 arch/x86/include/asm/qspinlock.h |   4 +
 arch/x86/kernel/alternative.c    |  12 +++
 kernel/locking/mcs_spinlock.h    |   2 +-
 kernel/locking/qspinlock.c       |  41 +++++++---
 kernel/locking/qspinlock_cna.h   | 164 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 229 insertions(+), 12 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..1d8f80c47687 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1548,6 +1548,24 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  The kernel will try to keep the lock on the same node,
+	  thus reducing the number of remote cache misses, while
+	  trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index bd5ac6cc37db..d9b6c34d5eb4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0d57015114e7..1c25f0505ec0 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -649,6 +649,18 @@ void __init alternative_instructions(void)
 				(unsigned long)__smp_locks_end);
 #endif
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+	/*
+	 * If we have multiple NUMA nodes, switch from native
+	 * to the NUMA-friendly slow path for spin locks.
+	 */
+	if (nr_node_ids > 1 && pv_ops.lock.queued_spin_lock_slowpath ==
+			native_queued_spin_lock_slowpath) {
+		pv_ops.lock.queued_spin_lock_slowpath =
+			__cna_queued_spin_lock_slowpath;
+	}
+#endif
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index bc6d3244e1af..36b802babc88 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	u64 locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5668466b3006..1ba38f85d0ae 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -20,7 +20,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -77,18 +77,14 @@
 #define MAX_NODES	4
 
 /*
- * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
- * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
- * that, we insert two more long words to pad it up to 32 bytes. IOW, only
- * two of them can fit in a cacheline in this case. That is OK as it is rare
- * to have more than 2 levels of slowpath nesting in actual use. We don't
- * want to penalize pvqspinlocks to optimize for a rare case in native
- * qspinlocks.
+ * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
+ * size. For pvqspinlock or the NUMA-aware variant, however, we need more
+ * space for extra data. To accommodate that, we insert two more long words
+ * to pad it up to 36 bytes.
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -327,7 +323,7 @@ static __always_inline void __pass_mcs_lock(struct mcs_spinlock *node,
 #define set_locked_empty_mcs	__set_locked_empty_mcs
 #define pass_mcs_lock		__pass_mcs_lock
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -600,6 +596,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 /*
+ * Generate the code for NUMA-aware spin locks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_init_node
+#define pv_init_node cna_init_node
+
+#undef set_locked_empty_mcs
+#define set_locked_empty_mcs		cna_set_locked_empty_mcs
+
+#undef pass_mcs_lock
+#define pass_mcs_lock			cna_pass_mcs_lock
+
+#undef  queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock_cna.h"
+#include "qspinlock.c"
+
+#endif
+
+/*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
 #if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..efb9b12b2f9b
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,164 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a main queue for
+ * threads running on the same node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. At the unlock time,
+ * the lock holder scans the main queue looking for a thread running on
+ * the same node. If found (call it thread T), all threads in the main queue
+ * between the current lock holder and T are moved to the end of the
+ * secondary queue, and the lock is passed to T. If such T is not found, the
+ * lock is passed to the first node in the secondary queue. Finally, if the
+ * secondary queue is empty, the lock is passed to the next thread in the
+ * main queue. To avoid starvation of threads in the secondary queue,
+ * those threads are moved back to the head of the main queue
+ * after a certain expected number of intra-node lock hand-offs.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct	mcs_spinlock mcs;
+	u32	numa_node;
+	u32	encoded_tail;
+	struct	cna_node *tail;    /* points to the secondary queue tail */
+};
+
+#define CNA_NODE(ptr) ((struct cna_node *)(ptr))
+
+static void cna_init_node(struct mcs_spinlock *node)
+{
+	struct cna_node *cn = CNA_NODE(node);
+	struct mcs_spinlock *base_node;
+	int cpuid;
+
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store a pointer in the node's @locked field */
+	BUILD_BUG_ON(sizeof(uintptr_t) > sizeof_field(struct mcs_spinlock, locked));
+
+	cpuid = smp_processor_id();
+	cn->numa_node = cpu_to_node(cpuid);
+
+	base_node = this_cpu_ptr(&qnodes[0].mcs);
+	cn->encoded_tail = encode_tail(cpuid, base_node->count - 1);
+}
+
+/**
+ * find_successor - Scan the main waiting queue looking for the first
+ * thread running on the same node as the lock holder. If found (call it
+ * thread T), move all threads in the main queue between the lock holder
+ * and T to the end of the secondary queue and return T; otherwise, return NULL.
+ */
+static struct cna_node *find_successor(struct mcs_spinlock *me)
+{
+	struct cna_node *me_cna = CNA_NODE(me);
+	struct cna_node *head_other, *tail_other, *cur;
+	struct cna_node *next = CNA_NODE(READ_ONCE(me->next));
+	int my_node;
+
+	/* @next should be set, else we would not be calling this function. */
+	WARN_ON_ONCE(next == NULL);
+
+	my_node = me_cna->numa_node;
+
+	/*
+	 * Fast path - check whether the immediate successor runs on
+	 * the same node.
+	 */
+	if (next->numa_node == my_node)
+		return next;
+
+	head_other = next;
+	tail_other = next;
+
+	/*
+	 * Traverse the main waiting queue starting from the successor of my
+	 * successor, and look for a thread running on the same node.
+	 */
+	cur = CNA_NODE(READ_ONCE(next->mcs.next));
+	while (cur) {
+		if (cur->numa_node == my_node) {
+			/*
+			 * Found a thread on the same node. Move threads
+			 * between me and that node into the secondary queue.
+			 */
+			if (me->locked > 1)
+				CNA_NODE(me->locked)->tail->mcs.next =
+					(struct mcs_spinlock *)head_other;
+			else
+				me->locked = (uintptr_t)head_other;
+			tail_other->mcs.next = NULL;
+			CNA_NODE(me->locked)->tail = tail_other;
+			return cur;
+		}
+		tail_other = cur;
+		cur = CNA_NODE(READ_ONCE(cur->mcs.next));
+	}
+	return NULL;
+}
+
+static inline bool cna_set_locked_empty_mcs(struct qspinlock *lock, u32 val,
+					struct mcs_spinlock *node)
+{
+	/* Check whether the secondary queue is empty. */
+	if (node->locked <= 1) {
+		if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
+				_Q_LOCKED_VAL))
+			return true; /* No contention */
+	} else {
+		/*
+		 * Pass the lock to the first thread in the secondary
+		 * queue, but first try to update the queue's tail to
+		 * point to the last node in the secondary queue.
+		 */
+		struct cna_node *succ = CNA_NODE(node->locked);
+		u32 new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
+
+		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
+			arch_mcs_spin_unlock_contended(&succ->mcs.locked, 1);
+			return true;
+		}
+	}
+
+	return false;
+}
+
+static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
+				     struct mcs_spinlock *next)
+{
+	struct cna_node *succ = NULL;
+	u64 *var = &next->locked;
+	u64 val = 1;
+
+	succ = find_successor(node);
+
+	if (succ) {
+		var = &succ->mcs.locked;
+		/*
+		 * We unlock a successor by passing a non-zero value,
+		 * so set @val to 1 iff @locked is 0, which will happen
+		 * if we acquired the MCS lock when its queue was empty
+		 */
+		val = node->locked + (node->locked == 0);
+	} else if (node->locked > 1) { /* if the secondary queue is not empty */
+		/* pass the lock to the first node in that queue */
+		succ = CNA_NODE(node->locked);
+		succ->tail->mcs.next = next;
+		var = &succ->mcs.locked;
+	}	/*
+		 * Otherwise, pass the lock to the immediate successor
+		 * in the main queue.
+		 */
+
+	arch_mcs_spin_unlock_contended(var, val);
+}
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-07-15 19:25 [PATCH v3 0/5] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (2 preceding siblings ...)
  2019-07-15 19:25 ` [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2019-07-15 19:25 ` Alex Kogan
  2019-07-16 15:59   ` Peter Zijlstra
  2019-07-15 19:25 ` [PATCH v3 5/5] locking/qspinlock: Introduce the shuffle reduction optimization " Alex Kogan
  2019-07-16 11:47 ` [PATCH v3 0/5] Add NUMA-awareness to qspinlock Nicholas Piggin
  5 siblings, 1 reply; 33+ messages in thread
From: Alex Kogan @ 2019-07-15 19:25 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Choose the next lock holder among spinning threads running on the same
node with high probability rather than always. With small probability,
hand the lock to the first thread in the secondary queue or, if that
queue is empty, to the immediate successor of the current lock holder
in the main queue.  Thus, assuming no failures while threads hold the
lock, every thread would be able to acquire the lock after a bounded
number of lock transitions, with high probability.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock_cna.h | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index efb9b12b2f9b..3de5be813a46 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/random.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -36,6 +37,33 @@ struct cna_node {
 
 #define CNA_NODE(ptr) ((struct cna_node *)(ptr))
 
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Controls the probability for intra-node lock hand-off. It can be
+ * tuned and depend, e.g., on the number of CPUs per node. For now,
+ * choose a value that provides reasonable long-term fairness without
+ * sacrificing performance compared to a version that does not have any
+ * fairness guarantees.
+ */
+#define INTRA_NODE_HANDOFF_PROB_ARG 0x10000
+
+/*
+ * Return false with probability 1 / @range.
+ * @range must be a power of 2.
+ */
+static bool probably(unsigned int range)
+{
+	u32 s;
+
+	s = this_cpu_read(seed);
+	s = next_pseudo_random32(s);
+	this_cpu_write(seed, s);
+
+	return s & (range - 1);
+}
+
 static void cna_init_node(struct mcs_spinlock *node)
 {
 	struct cna_node *cn = CNA_NODE(node);
@@ -140,7 +168,13 @@ static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
 	u64 *var = &next->locked;
 	u64 val = 1;
 
-	succ = find_successor(node);
+	/*
+	 * Try to pass the lock to a thread running on the same node.
+	 * For long-term fairness, search for such a thread with high
+	 * probability rather than always.
+	 */
+	if (probably(INTRA_NODE_HANDOFF_PROB_ARG))
+		succ = find_successor(node);
 
 	if (succ) {
 		var = &succ->mcs.locked;
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 5/5] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2019-07-15 19:25 [PATCH v3 0/5] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (3 preceding siblings ...)
  2019-07-15 19:25 ` [PATCH v3 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
@ 2019-07-15 19:25 ` Alex Kogan
  2019-07-16 11:47 ` [PATCH v3 0/5] Add NUMA-awareness to qspinlock Nicholas Piggin
  5 siblings, 0 replies; 33+ messages in thread
From: Alex Kogan @ 2019-07-15 19:25 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

This optimization reduces the probability threads will be shuffled between
the main and secondary queues when the secondary queue is empty.
It is helpful when the lock is only lightly contended.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock_cna.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 3de5be813a46..9e6bd9e6d82b 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -50,6 +50,15 @@ static DEFINE_PER_CPU(u32, seed);
 #define INTRA_NODE_HANDOFF_PROB_ARG 0x10000
 
 /*
+ * Controls the probability for enabling the scan of the main queue when
+ * the secondary queue is empty. The chosen value reduces the amount of
+ * unnecessary shuffling of threads between the two waiting queues when
+ * the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG  0x80
+
+/*
  * Return false with probability 1 / @range.
  * @range must be a power of 2.
  */
@@ -169,6 +178,16 @@ static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
 	u64 val = 1;
 
 	/*
+	 * Limit thread shuffling when the secondary queue is empty.
+	 * This copes with the overhead the shuffling creates when the
+	 * lock is only lightly contended, and threads do not stay
+	 * in the secondary queue long enough to reap the benefit of moving
+	 * them there.
+	 */
+	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG))
+		goto pass_lock;
+
+	/*
 	 * Try to pass the lock to a thread running on the same node.
 	 * For long-term fairness, search for such a thread with high
 	 * probability rather than always.
@@ -194,5 +213,6 @@ static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
 		 * in the main queue.
 		 */
 
+pass_lock:
 	arch_mcs_spin_unlock_contended(var, val);
 }
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-15 19:25 ` [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2019-07-15 21:30   ` Waiman Long
  2019-07-16 11:04     ` Peter Zijlstra
                       ` (2 more replies)
  2019-07-16 11:05   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 33+ messages in thread
From: Waiman Long @ 2019-07-15 21:30 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: rahul.x.yadav, dave.dice, steven.sistare, daniel.m.jordan

On 7/15/19 3:25 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a main queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. At the unlock time,
> the lock holder scans the main queue looking for a thread running on
> the same node. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue. For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). The CNA variant is patched in
> at the boot time only if we run a multi-node machine, and the new
> config is enabled. For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well.
> However, this should be resolved once static_call() is available.
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  arch/x86/Kconfig                 |  18 +++++
>  arch/x86/include/asm/qspinlock.h |   4 +
>  arch/x86/kernel/alternative.c    |  12 +++
>  kernel/locking/mcs_spinlock.h    |   2 +-
>  kernel/locking/qspinlock.c       |  41 +++++++---
>  kernel/locking/qspinlock_cna.h   | 164 +++++++++++++++++++++++++++++++++++++++
>  6 files changed, 229 insertions(+), 12 deletions(-)
>  create mode 100644 kernel/locking/qspinlock_cna.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2bbbd4d1ba31..1d8f80c47687 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1548,6 +1548,24 @@ config NUMA
>  
>  	  Otherwise, you should say N.
>  
> +config NUMA_AWARE_SPINLOCKS
> +	bool "Numa-aware spinlocks"
> +	depends on NUMA
> +	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
> +	# This is awkward, but hopefully would be resolved once static_call()
> +	# is available.
> +	depends on PARAVIRT_SPINLOCKS
> +	default y
> +	help
> +	  Introduce NUMA (Non Uniform Memory Access) awareness into
> +	  the slow path of spinlocks.
> +
> +	  The kernel will try to keep the lock on the same node,
> +	  thus reducing the number of remote cache misses, while
> +	  trading some of the short term fairness for better performance.
> +
> +	  Say N if you want absolute first come first serve fairness.

You should also add a dependency on QUEUED_SPINLOCKS to highlight the
fact that it is a variant of qspinlock. You should also mention that in
the help text.


> +
>  config AMD_NUMA
>  	def_bool y
>  	prompt "Old style AMD Opteron NUMA detection"
> diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
> index bd5ac6cc37db..d9b6c34d5eb4 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
>  	return val;
>  }
>  
> +#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
> +extern void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
> +#endif
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>  extern void __pv_init_lock_hash(void);
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 0d57015114e7..1c25f0505ec0 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -649,6 +649,18 @@ void __init alternative_instructions(void)
>  				(unsigned long)__smp_locks_end);
>  #endif
>  
> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +	/*
> +	 * If we have multiple NUMA nodes, switch from native
> +	 * to the NUMA-friendly slow path for spin locks.
> +	 */
> +	if (nr_node_ids > 1 && pv_ops.lock.queued_spin_lock_slowpath ==
> +			native_queued_spin_lock_slowpath) {
> +		pv_ops.lock.queued_spin_lock_slowpath =
> +			__cna_queued_spin_lock_slowpath;
> +	}
> +#endif
> +
>  	apply_paravirt(__parainstructions, __parainstructions_end);
>  
>  	restart_nmi();
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index bc6d3244e1af..36b802babc88 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -17,7 +17,7 @@
>  
>  struct mcs_spinlock {
>  	struct mcs_spinlock *next;
> -	int locked; /* 1 if lock acquired */
> +	u64 locked; /* 1 if lock acquired */
>  	int count;  /* nesting count, see qspinlock.c */
>  };
>  
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 5668466b3006..1ba38f85d0ae 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -20,7 +20,7 @@
>   *          Peter Zijlstra <peterz@infradead.org>
>   */
>  
> -#ifndef _GEN_PV_LOCK_SLOWPATH
> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>  
>  #include <linux/smp.h>
>  #include <linux/bug.h>
> @@ -77,18 +77,14 @@
>  #define MAX_NODES	4
>  
>  /*
> - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
> - * size and four of them will fit nicely in one 64-byte cacheline. For
> - * pvqspinlock, however, we need more space for extra data. To accommodate
> - * that, we insert two more long words to pad it up to 32 bytes. IOW, only
> - * two of them can fit in a cacheline in this case. That is OK as it is rare
> - * to have more than 2 levels of slowpath nesting in actual use. We don't
> - * want to penalize pvqspinlocks to optimize for a rare case in native
> - * qspinlocks.
> + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
> + * size. For pvqspinlock or the NUMA-aware variant, however, we need more
> + * space for extra data. To accommodate that, we insert two more long words
> + * to pad it up to 36 bytes.
>   */
The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
mcs_spinlock structure is 8-byte aligned. For better cacheline
alignment, I will like to keep mcs_spinlock to 16 bytes as before.
Instead, you can use encode_tail() to store the CNA node pointer in
"locked". For instance, use (encode_tail() << 1) in locked to
distinguish it from the regular locked=1 value.
>  struct qnode {
>  	struct mcs_spinlock mcs;
> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>  	long reserved[2];
>  #endif
>  };
> @@ -327,7 +323,7 @@ static __always_inline void __pass_mcs_lock(struct mcs_spinlock *node,
>  #define set_locked_empty_mcs	__set_locked_empty_mcs
>  #define pass_mcs_lock		__pass_mcs_lock
>  
> -#endif /* _GEN_PV_LOCK_SLOWPATH */
> +#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
>  
>  /**
>   * queued_spin_lock_slowpath - acquire the queued spinlock
> @@ -600,6 +596,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  
>  /*
> + * Generate the code for NUMA-aware spin locks
> + */
> +#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +#define _GEN_CNA_LOCK_SLOWPATH
> +
> +#undef pv_init_node
> +#define pv_init_node cna_init_node
> +
> +#undef set_locked_empty_mcs
> +#define set_locked_empty_mcs		cna_set_locked_empty_mcs
> +
> +#undef pass_mcs_lock
> +#define pass_mcs_lock			cna_pass_mcs_lock
> +
> +#undef  queued_spin_lock_slowpath
> +#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
> +
> +#include "qspinlock_cna.h"
> +#include "qspinlock.c"
> +
> +#endif
> +
> +/*
>   * Generate the paravirt code for queued_spin_unlock_slowpath().
>   */
>  #if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> new file mode 100644
> index 000000000000..efb9b12b2f9b
> --- /dev/null
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -0,0 +1,164 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _GEN_CNA_LOCK_SLOWPATH
> +#error "do not include this file"
> +#endif
> +
> +#include <linux/topology.h>
> +
> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a main queue for
> + * threads running on the same node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. At the unlock time,
> + * the lock holder scans the main queue looking for a thread running on
> + * the same node. If found (call it thread T), all threads in the main queue
> + * between the current lock holder and T are moved to the end of the
> + * secondary queue, and the lock is passed to T. If such T is not found, the
> + * lock is passed to the first node in the secondary queue. Finally, if the
> + * secondary queue is empty, the lock is passed to the next thread in the
> + * main queue. To avoid starvation of threads in the secondary queue,
> + * those threads are moved back to the head of the main queue
> + * after a certain expected number of intra-node lock hand-offs.
> + *
> + * For more details, see https://arxiv.org/abs/1810.05600.
> + *
> + * Authors: Alex Kogan <alex.kogan@oracle.com>
> + *          Dave Dice <dave.dice@oracle.com>
> + */
> +
> +struct cna_node {
> +	struct	mcs_spinlock mcs;
> +	u32	numa_node;
> +	u32	encoded_tail;
> +	struct	cna_node *tail;    /* points to the secondary queue tail */
> +};
> +
> +#define CNA_NODE(ptr) ((struct cna_node *)(ptr))
> +
> +static void cna_init_node(struct mcs_spinlock *node)
> +{
> +	struct cna_node *cn = CNA_NODE(node);
> +	struct mcs_spinlock *base_node;
> +	int cpuid;
> +
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store a pointer in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(uintptr_t) > sizeof_field(struct mcs_spinlock, locked));
> +
> +	cpuid = smp_processor_id();
> +	cn->numa_node = cpu_to_node(cpuid);
> +
> +	base_node = this_cpu_ptr(&qnodes[0].mcs);
> +	cn->encoded_tail = encode_tail(cpuid, base_node->count - 1);
> +}


I think you can use an early_init call to initialize the numa_node and
encoded_tail values for all the per-cpu CNA nodes instead of doing it
every time a node is used. If it turns out that pv_qspinlock is used,
the pv_node_init() will properly re-initialize it. The only thing left
to do here is perhaps setting tail to NULL.

-Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path
  2019-07-15 19:25 ` [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
@ 2019-07-16 10:20   ` Peter Zijlstra
  2019-07-16 14:53     ` Alex Kogan
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 10:20 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Mon, Jul 15, 2019 at 03:25:33PM -0400, Alex Kogan wrote:

> +/*
> + * set_locked_empty_mcs - Try to set the spinlock value to _Q_LOCKED_VAL,
> + * and by doing that unlock the MCS lock when its waiting queue is empty
> + * @lock: Pointer to queued spinlock structure
> + * @val: Current value of the lock
> + * @node: Pointer to the MCS node of the lock holder
> + *
> + * *,*,* -> 0,0,1
> + */
> +static __always_inline bool __set_locked_empty_mcs(struct qspinlock *lock,
> +						   u32 val,
> +						   struct mcs_spinlock *node)
> +{
> +	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
> +}

That name is nonsense. It should be something like:

static __always_inline bool __try_clear_tail(...)


> +/*
> + * pass_mcs_lock - pass the MCS lock to the next waiter
> + * @node: Pointer to the MCS node of the lock holder
> + * @next: Pointer to the MCS node of the first waiter in the MCS queue
> + */
> +static __always_inline void __pass_mcs_lock(struct mcs_spinlock *node,
> +					    struct mcs_spinlock *next)
> +{
> +	arch_mcs_spin_unlock_contended(&next->locked, 1);
> +}

I'm not entirely happy with that name either; but it's not horrible like
the other one. Why not mcs_spin_unlock_contended() ?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
  2019-07-15 19:25 ` [PATCH v3 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic Alex Kogan
@ 2019-07-16 10:23   ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 10:23 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Mon, Jul 15, 2019 at 03:25:32PM -0400, Alex Kogan wrote:

> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index e14b32c69639..961781624638 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -558,7 +558,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	if (!next)
>  		next = smp_cond_load_relaxed(&node->next, (VAL));
>  
> -	arch_mcs_spin_unlock_contended(&next->locked);
> +	arch_mcs_spin_unlock_contended(&next->locked, 1);
>  	pv_kick_node(lock, next);

My problem with this patch is that the above reads really daft. Should
we rename the whole function? arch_mcs_pass_lock() perhaps?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-15 21:30   ` Waiman Long
@ 2019-07-16 11:04     ` Peter Zijlstra
  2019-07-16 14:26     ` Alex Kogan
       [not found]     ` <aa73b86d-902a-bb6f-d372-8645c8299a6d@redhat.com>
  2 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 11:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, rahul.x.yadav, mingo, bp, hpa,
	Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On Mon, Jul 15, 2019 at 05:30:01PM -0400, Waiman Long wrote:
> On 7/15/19 3:25 PM, Alex Kogan wrote:
> >  /*
> > - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
> > - * size and four of them will fit nicely in one 64-byte cacheline. For
> > - * pvqspinlock, however, we need more space for extra data. To accommodate
> > - * that, we insert two more long words to pad it up to 32 bytes. IOW, only
> > - * two of them can fit in a cacheline in this case. That is OK as it is rare
> > - * to have more than 2 levels of slowpath nesting in actual use. We don't
> > - * want to penalize pvqspinlocks to optimize for a rare case in native
> > - * qspinlocks.
> > + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
> > + * size. For pvqspinlock or the NUMA-aware variant, however, we need more
> > + * space for extra data. To accommodate that, we insert two more long words
> > + * to pad it up to 36 bytes.
> >   */

> The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
> mcs_spinlock structure is 8-byte aligned. For better cacheline
> alignment, I will like to keep mcs_spinlock to 16 bytes as before.
> Instead, you can use encode_tail() to store the CNA node pointer in
> "locked". For instance, use (encode_tail() << 1) in locked to
> distinguish it from the regular locked=1 value.

Yes, please don't bloat this. I already don't like what Waiman did for
the paravirt case, but this is horrible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-15 19:25 ` [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
  2019-07-15 21:30   ` Waiman Long
@ 2019-07-16 11:05   ` Peter Zijlstra
  2019-07-16 14:30     ` Alex Kogan
  2019-07-16 15:50   ` Peter Zijlstra
  2019-07-17  2:16   ` Waiman Long
  3 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 11:05 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Mon, Jul 15, 2019 at 03:25:34PM -0400, Alex Kogan wrote:
> +/**
> + * find_successor - Scan the main waiting queue looking for the first
> + * thread running on the same node as the lock holder. If found (call it
> + * thread T), move all threads in the main queue between the lock holder
> + * and T to the end of the secondary queue and return T; otherwise, return NULL.
> + */
> +static struct cna_node *find_successor(struct mcs_spinlock *me)

Either don't use a kernel doc comment, but if you must, you have to
stick to their format, otherwise we'll get endless stupid patches fixing
up the stupid comment.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/5] Add NUMA-awareness to qspinlock
  2019-07-15 19:25 [PATCH v3 0/5] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (4 preceding siblings ...)
  2019-07-15 19:25 ` [PATCH v3 5/5] locking/qspinlock: Introduce the shuffle reduction optimization " Alex Kogan
@ 2019-07-16 11:47 ` Nicholas Piggin
       [not found]   ` <7D29555E-8F72-4EDD-8A87-B1A59C3945A6@oracle.com>
  5 siblings, 1 reply; 33+ messages in thread
From: Nicholas Piggin @ 2019-07-16 11:47 UTC (permalink / raw)
  To: Alex Kogan, arnd, bp, guohanjun, hpa, jglauber, linux-arch,
	linux-arm-kernel, linux, linux-kernel, longman, mingo, peterz,
	tglx, will.deacon, x86
  Cc: rahul.x.yadav, dave.dice, steven.sistare, daniel.m.jordan

Alex Kogan's on July 16, 2019 5:25 am:
> Our evaluation shows that CNA also improves performance of user 
> applications that have hot pthread mutexes. Those mutexes are 
> blocking, and waiting threads park and unpark via the futex 
> mechanism in the kernel. Given that kernel futex chains, which
> are hashed by the mutex address, are each protected by a 
> chain-specific spin lock, the contention on a user-mode mutex 
> translates into contention on a kernel level spinlock. 

What applications are those, what performance numbers? Arguably that's
much more interesting than microbenchmarks (which are mainly useful to
help ensure the fast paths are not impacted IMO).

Thanks,
Nick

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-15 21:30   ` Waiman Long
  2019-07-16 11:04     ` Peter Zijlstra
@ 2019-07-16 14:26     ` Alex Kogan
  2019-07-16 14:44       ` Waiman Long
       [not found]     ` <aa73b86d-902a-bb6f-d372-8645c8299a6d@redhat.com>
  2 siblings, 1 reply; 33+ messages in thread
From: Alex Kogan @ 2019-07-16 14:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, rahul.x.yadav, mingo, bp, hpa,
	steven.sistare, tglx, daniel.m.jordan, linux-arm-kernel


> On Jul 15, 2019, at 5:30 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 7/15/19 3:25 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a main queue for
>> threads running on the same node as the current lock holder, and a
>> secondary queue for threads running on other nodes. At the unlock time,
>> the lock holder scans the main queue looking for a thread running on
>> the same node. If found (call it thread T), all threads in the main queue
>> between the current lock holder and T are moved to the end of the
>> secondary queue, and the lock is passed to T. If such T is not found, the
>> lock is passed to the first node in the secondary queue. Finally, if the
>> secondary queue is empty, the lock is passed to the next thread in the
>> main queue. For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=NH4Xld7c5GQcD5N1oCMpapcK4gtC1Lg6WNc__6B-qlo&s=DBj9a52iqQ8TW5raX7fVjytlskLrPc9gseyBQCM0GS0&e= .
>> 
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same node. This issue
>> will be addressed later in the series.
>> 
>> Enabling CNA is controlled via a new configuration option
>> (NUMA_AWARE_SPINLOCKS). The CNA variant is patched in
>> at the boot time only if we run a multi-node machine, and the new
>> config is enabled. For the time being, the patching requires
>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well.
>> However, this should be resolved once static_call() is available.
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> arch/x86/Kconfig                 |  18 +++++
>> arch/x86/include/asm/qspinlock.h |   4 +
>> arch/x86/kernel/alternative.c    |  12 +++
>> kernel/locking/mcs_spinlock.h    |   2 +-
>> kernel/locking/qspinlock.c       |  41 +++++++---
>> kernel/locking/qspinlock_cna.h   | 164 +++++++++++++++++++++++++++++++++++++++
>> 6 files changed, 229 insertions(+), 12 deletions(-)
>> create mode 100644 kernel/locking/qspinlock_cna.h
>> 
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 2bbbd4d1ba31..1d8f80c47687 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1548,6 +1548,24 @@ config NUMA
>> 
>> 	  Otherwise, you should say N.
>> 
>> +config NUMA_AWARE_SPINLOCKS
>> +	bool "Numa-aware spinlocks"
>> +	depends on NUMA
>> +	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
>> +	# This is awkward, but hopefully would be resolved once static_call()
>> +	# is available.
>> +	depends on PARAVIRT_SPINLOCKS
>> +	default y
>> +	help
>> +	  Introduce NUMA (Non Uniform Memory Access) awareness into
>> +	  the slow path of spinlocks.
>> +
>> +	  The kernel will try to keep the lock on the same node,
>> +	  thus reducing the number of remote cache misses, while
>> +	  trading some of the short term fairness for better performance.
>> +
>> +	  Say N if you want absolute first come first serve fairness.
> 
> You should also add a dependency on QUEUED_SPINLOCKS to highlight the
> fact that it is a variant of qspinlock. You should also mention that in
> the help text.
Will do.

> 
> 
>> +
>> config AMD_NUMA
>> 	def_bool y
>> 	prompt "Old style AMD Opteron NUMA detection"
>> diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
>> index bd5ac6cc37db..d9b6c34d5eb4 100644
>> --- a/arch/x86/include/asm/qspinlock.h
>> +++ b/arch/x86/include/asm/qspinlock.h
>> @@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
>> 	return val;
>> }
>> 
>> +#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
>> +extern void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>> +#endif
>> +
>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>> extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>> extern void __pv_init_lock_hash(void);
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index 0d57015114e7..1c25f0505ec0 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -649,6 +649,18 @@ void __init alternative_instructions(void)
>> 				(unsigned long)__smp_locks_end);
>> #endif
>> 
>> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +	/*
>> +	 * If we have multiple NUMA nodes, switch from native
>> +	 * to the NUMA-friendly slow path for spin locks.
>> +	 */
>> +	if (nr_node_ids > 1 && pv_ops.lock.queued_spin_lock_slowpath ==
>> +			native_queued_spin_lock_slowpath) {
>> +		pv_ops.lock.queued_spin_lock_slowpath =
>> +			__cna_queued_spin_lock_slowpath;
>> +	}
>> +#endif
>> +
>> 	apply_paravirt(__parainstructions, __parainstructions_end);
>> 
>> 	restart_nmi();
>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index bc6d3244e1af..36b802babc88 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -17,7 +17,7 @@
>> 
>> struct mcs_spinlock {
>> 	struct mcs_spinlock *next;
>> -	int locked; /* 1 if lock acquired */
>> +	u64 locked; /* 1 if lock acquired */
>> 	int count;  /* nesting count, see qspinlock.c */
>> };
>> 
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 5668466b3006..1ba38f85d0ae 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -20,7 +20,7 @@
>>  *          Peter Zijlstra <peterz@infradead.org>
>>  */
>> 
>> -#ifndef _GEN_PV_LOCK_SLOWPATH
>> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>> 
>> #include <linux/smp.h>
>> #include <linux/bug.h>
>> @@ -77,18 +77,14 @@
>> #define MAX_NODES	4
>> 
>> /*
>> - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>> - * size and four of them will fit nicely in one 64-byte cacheline. For
>> - * pvqspinlock, however, we need more space for extra data. To accommodate
>> - * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>> - * two of them can fit in a cacheline in this case. That is OK as it is rare
>> - * to have more than 2 levels of slowpath nesting in actual use. We don't
>> - * want to penalize pvqspinlocks to optimize for a rare case in native
>> - * qspinlocks.
>> + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
>> + * size. For pvqspinlock or the NUMA-aware variant, however, we need more
>> + * space for extra data. To accommodate that, we insert two more long words
>> + * to pad it up to 36 bytes.
>>  */
> The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
> mcs_spinlock structure is 8-byte aligned. For better cacheline
> alignment, I will like to keep mcs_spinlock to 16 bytes as before.
> Instead, you can use encode_tail() to store the CNA node pointer in
> "locked". For instance, use (encode_tail() << 1) in locked to
> distinguish it from the regular locked=1 value.
I think this can work.
decode_tail() will get the actual node pointer from the encoded value.
And that would keep the size of mcs_spinlock intact.
Good idea, thanks!

BTW, maybe better change those function names to encode_node() / decode_node() then?

>> struct qnode {
>> 	struct mcs_spinlock mcs;
>> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
>> +#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> 	long reserved[2];
>> #endif
>> };
>> @@ -327,7 +323,7 @@ static __always_inline void __pass_mcs_lock(struct mcs_spinlock *node,
>> #define set_locked_empty_mcs	__set_locked_empty_mcs
>> #define pass_mcs_lock		__pass_mcs_lock
>> 
>> -#endif /* _GEN_PV_LOCK_SLOWPATH */
>> +#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
>> 
>> /**
>>  * queued_spin_lock_slowpath - acquire the queued spinlock
>> @@ -600,6 +596,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>> EXPORT_SYMBOL(queued_spin_lock_slowpath);
>> 
>> /*
>> + * Generate the code for NUMA-aware spin locks
>> + */
>> +#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +#define _GEN_CNA_LOCK_SLOWPATH
>> +
>> +#undef pv_init_node
>> +#define pv_init_node cna_init_node
>> +
>> +#undef set_locked_empty_mcs
>> +#define set_locked_empty_mcs		cna_set_locked_empty_mcs
>> +
>> +#undef pass_mcs_lock
>> +#define pass_mcs_lock			cna_pass_mcs_lock
>> +
>> +#undef  queued_spin_lock_slowpath
>> +#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
>> +
>> +#include "qspinlock_cna.h"
>> +#include "qspinlock.c"
>> +
>> +#endif
>> +
>> +/*
>>  * Generate the paravirt code for queued_spin_unlock_slowpath().
>>  */
>> #if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
>> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
>> new file mode 100644
>> index 000000000000..efb9b12b2f9b
>> --- /dev/null
>> +++ b/kernel/locking/qspinlock_cna.h
>> @@ -0,0 +1,164 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _GEN_CNA_LOCK_SLOWPATH
>> +#error "do not include this file"
>> +#endif
>> +
>> +#include <linux/topology.h>
>> +
>> +/*
>> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>> + *
>> + * In CNA, spinning threads are organized in two queues, a main queue for
>> + * threads running on the same node as the current lock holder, and a
>> + * secondary queue for threads running on other nodes. At the unlock time,
>> + * the lock holder scans the main queue looking for a thread running on
>> + * the same node. If found (call it thread T), all threads in the main queue
>> + * between the current lock holder and T are moved to the end of the
>> + * secondary queue, and the lock is passed to T. If such T is not found, the
>> + * lock is passed to the first node in the secondary queue. Finally, if the
>> + * secondary queue is empty, the lock is passed to the next thread in the
>> + * main queue. To avoid starvation of threads in the secondary queue,
>> + * those threads are moved back to the head of the main queue
>> + * after a certain expected number of intra-node lock hand-offs.
>> + *
>> + * For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=NH4Xld7c5GQcD5N1oCMpapcK4gtC1Lg6WNc__6B-qlo&s=DBj9a52iqQ8TW5raX7fVjytlskLrPc9gseyBQCM0GS0&e= .
>> + *
>> + * Authors: Alex Kogan <alex.kogan@oracle.com>
>> + *          Dave Dice <dave.dice@oracle.com>
>> + */
>> +
>> +struct cna_node {
>> +	struct	mcs_spinlock mcs;
>> +	u32	numa_node;
>> +	u32	encoded_tail;
>> +	struct	cna_node *tail;    /* points to the secondary queue tail */
>> +};
>> +
>> +#define CNA_NODE(ptr) ((struct cna_node *)(ptr))
>> +
>> +static void cna_init_node(struct mcs_spinlock *node)
>> +{
>> +	struct cna_node *cn = CNA_NODE(node);
>> +	struct mcs_spinlock *base_node;
>> +	int cpuid;
>> +
>> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
>> +	/* we store a pointer in the node's @locked field */
>> +	BUILD_BUG_ON(sizeof(uintptr_t) > sizeof_field(struct mcs_spinlock, locked));
>> +
>> +	cpuid = smp_processor_id();
>> +	cn->numa_node = cpu_to_node(cpuid);
>> +
>> +	base_node = this_cpu_ptr(&qnodes[0].mcs);
>> +	cn->encoded_tail = encode_tail(cpuid, base_node->count - 1);
>> +}
> 
> 
> I think you can use an early_init call to initialize the numa_node and
> encoded_tail values for all the per-cpu CNA nodes instead of doing it
> every time a node is used. If it turns out that pv_qspinlock is used,
> the pv_node_init() will properly re-initialize it.
Yes, this should work. Thanks.

BTW, should not we initialize `cpu` in pv_init_node() that same way?

> The only thing left
> to do here is perhaps setting tail to NULL.
There is no need to initialize cna_node.tail — we never access it unless
the node is at the head of the secondary queue, and in that case we 
initialize it before placing the node at the head of that queue 
(see find_successor()).

Best regards,
— Alex

> 
> -Longman
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-16 11:05   ` Peter Zijlstra
@ 2019-07-16 14:30     ` Alex Kogan
  0 siblings, 0 replies; 33+ messages in thread
From: Alex Kogan @ 2019-07-16 14:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel


> On Jul 16, 2019, at 7:05 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jul 15, 2019 at 03:25:34PM -0400, Alex Kogan wrote:
>> +/**
>> + * find_successor - Scan the main waiting queue looking for the first
>> + * thread running on the same node as the lock holder. If found (call it
>> + * thread T), move all threads in the main queue between the lock holder
>> + * and T to the end of the secondary queue and return T; otherwise, return NULL.
>> + */
>> +static struct cna_node *find_successor(struct mcs_spinlock *me)
> 
> Either don't use a kernel doc comment, but if you must, you have to
> stick to their format, otherwise we'll get endless stupid patches fixing
> up the stupid comment.
Will fix that.

Thanks,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-16 14:26     ` Alex Kogan
@ 2019-07-16 14:44       ` Waiman Long
  0 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2019-07-16 14:44 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, rahul.x.yadav, mingo, bp, hpa,
	steven.sistare, tglx, daniel.m.jordan, linux-arm-kernel

On 7/16/19 10:26 AM, Alex Kogan wrote:
>> On Jul 15, 2019, at 5:30 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> On 7/15/19 3:25 PM, Alex Kogan wrote:
>>
>>> /*
>>> - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>>> - * size and four of them will fit nicely in one 64-byte cacheline. For
>>> - * pvqspinlock, however, we need more space for extra data. To accommodate
>>> - * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>>> - * two of them can fit in a cacheline in this case. That is OK as it is rare
>>> - * to have more than 2 levels of slowpath nesting in actual use. We don't
>>> - * want to penalize pvqspinlocks to optimize for a rare case in native
>>> - * qspinlocks.
>>> + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
>>> + * size. For pvqspinlock or the NUMA-aware variant, however, we need more
>>> + * space for extra data. To accommodate that, we insert two more long words
>>> + * to pad it up to 36 bytes.
>>>  */
>> The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
>> mcs_spinlock structure is 8-byte aligned. For better cacheline
>> alignment, I will like to keep mcs_spinlock to 16 bytes as before.
>> Instead, you can use encode_tail() to store the CNA node pointer in
>> "locked". For instance, use (encode_tail() << 1) in locked to
>> distinguish it from the regular locked=1 value.
> I think this can work.
> decode_tail() will get the actual node pointer from the encoded value.
> And that would keep the size of mcs_spinlock intact.
> Good idea, thanks!
>
> BTW, maybe better change those function names to encode_node() / decode_node() then?

The names look good to me.


>
>>> s
>>> +
>>> +static void cna_init_node(struct mcs_spinlock *node)
>>> +{
>>> +	struct cna_node *cn = CNA_NODE(node);
>>> +	struct mcs_spinlock *base_node;
>>> +	int cpuid;
>>> +
>>> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
>>> +	/* we store a pointer in the node's @locked field */
>>> +	BUILD_BUG_ON(sizeof(uintptr_t) > sizeof_field(struct mcs_spinlock, locked));
>>> +
>>> +	cpuid = smp_processor_id();
>>> +	cn->numa_node = cpu_to_node(cpuid);
>>> +
>>> +	base_node = this_cpu_ptr(&qnodes[0].mcs);
>>> +	cn->encoded_tail = encode_tail(cpuid, base_node->count - 1);
>>> +}
>>
>> I think you can use an early_init call to initialize the numa_node and
>> encoded_tail values for all the per-cpu CNA nodes instead of doing it
>> every time a node is used. If it turns out that pv_qspinlock is used,
>> the pv_node_init() will properly re-initialize it.
> Yes, this should work. Thanks.
>
> BTW, should not we initialize `cpu` in pv_init_node() that same way?

We would also initialize cpu this way in pv_init_node. The
smp_processor_id() call is relatively cheap, but the initialization done
here is more expensive.


>> The only thing left
>> to do here is perhaps setting tail to NULL.
> There is no need to initialize cna_node.tail — we never access it unless
> the node is at the head of the secondary queue, and in that case we 
> initialize it before placing the node at the head of that queue 
> (see find_successor()).

OK.

-Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
       [not found]       ` <C1C55A40-FDB1-43B5-B551-F9B8BE776DF8@oracle.com>
@ 2019-07-16 14:50         ` Waiman Long
  2019-07-17 17:44           ` Alex Kogan
  0 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2019-07-16 14:50 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, rahul.x.yadav, mingo, bp, hpa,
	steven.sistare, tglx, daniel.m.jordan, linux-arm-kernel

On 7/16/19 10:29 AM, Alex Kogan wrote:
>
>> On Jul 15, 2019, at 7:22 PM, Waiman Long <longman@redhat.com
>> <mailto:longman@redhat.com>> wrote:
>>
>> On 7/15/19 5:30 PM, Waiman Long wrote:
>>>> -#ifndef _GEN_PV_LOCK_SLOWPATH
>>>> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>>>>  
>>>>  #include <linux/smp.h>
>>>>  #include <linux/bug.h>
>>>> @@ -77,18 +77,14 @@
>>>>  #define MAX_NODES	4
>>>>  
>>>>  /*
>>>> - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>>>> - * size and four of them will fit nicely in one 64-byte cacheline. For
>>>> - * pvqspinlock, however, we need more space for extra data. To accommodate
>>>> - * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>>>> - * two of them can fit in a cacheline in this case. That is OK as it is rare
>>>> - * to have more than 2 levels of slowpath nesting in actual use. We don't
>>>> - * want to penalize pvqspinlocks to optimize for a rare case in native
>>>> - * qspinlocks.
>>>> + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
>>>> + * size. For pvqspinlock or the NUMA-aware variant, however, we need more
>>>> + * space for extra data. To accommodate that, we insert two more long words
>>>> + * to pad it up to 36 bytes.
>>>>   */
>>> The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
>>> mcs_spinlock structure is 8-byte aligned. For better cacheline
>>> alignment, I will like to keep mcs_spinlock to 16 bytes as before.
>>> Instead, you can use encode_tail() to store the CNA node pointer in
>>> "locked". For instance, use (encode_tail() << 1) in locked to
>>> distinguish it from the regular locked=1 value.
>>
>> Actually, the encoded tail value is already shift left either 16 bits
>> or 9 bits. So there is no need to shift it. You can assigned it directly:
>>
>> mcs->locked = cna->encoded_tail;
>>
>> You do need to change the type of locked to "unsigned int", though,
>> for proper comparison with "1".
>>
> Got it, thanks.
>
I forgot to mention that I would like to see a boot command line option
to force off and maybe on as well the numa qspinlock code. This can help
in testing as you don't need to build 2 separate kernels, one with
NUMA_AWARE_SPINLOCKS on and one with it off. For small 2-socket systems,
numa qspinlock may not help much. So an option to turn it off can be
useful. Xen also have an option to turns off PV qspinlock.

-Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path
  2019-07-16 10:20   ` Peter Zijlstra
@ 2019-07-16 14:53     ` Alex Kogan
  2019-07-16 15:58       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Kogan @ 2019-07-16 14:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Jul 16, 2019, at 6:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jul 15, 2019 at 03:25:33PM -0400, Alex Kogan wrote:
> 
>> +/*
>> + * set_locked_empty_mcs - Try to set the spinlock value to _Q_LOCKED_VAL,
>> + * and by doing that unlock the MCS lock when its waiting queue is empty
>> + * @lock: Pointer to queued spinlock structure
>> + * @val: Current value of the lock
>> + * @node: Pointer to the MCS node of the lock holder
>> + *
>> + * *,*,* -> 0,0,1
>> + */
>> +static __always_inline bool __set_locked_empty_mcs(struct qspinlock *lock,
>> +						   u32 val,
>> +						   struct mcs_spinlock *node)
>> +{
>> +	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
>> +}
> 
> That name is nonsense. It should be something like:
> 
> static __always_inline bool __try_clear_tail(…)

We already have set_locked(), so I was trying to convey the fact that we are
doing the same here, but only when the MCS chain is empty.

I can use __try_clear_tail() instead.

> 
> 
>> +/*
>> + * pass_mcs_lock - pass the MCS lock to the next waiter
>> + * @node: Pointer to the MCS node of the lock holder
>> + * @next: Pointer to the MCS node of the first waiter in the MCS queue
>> + */
>> +static __always_inline void __pass_mcs_lock(struct mcs_spinlock *node,
>> +					    struct mcs_spinlock *next)
>> +{
>> +	arch_mcs_spin_unlock_contended(&next->locked, 1);
>> +}
> 
> I'm not entirely happy with that name either; but it's not horrible like
> the other one. Why not mcs_spin_unlock_contended() ?

Sure, I can use mcs_spin_unlock_contended() instead.

Thanks,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-15 19:25 ` [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
  2019-07-15 21:30   ` Waiman Long
  2019-07-16 11:05   ` Peter Zijlstra
@ 2019-07-16 15:50   ` Peter Zijlstra
  2019-07-16 17:19     ` Alex Kogan
  2019-07-17  2:16   ` Waiman Long
  3 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 15:50 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Mon, Jul 15, 2019 at 03:25:34PM -0400, Alex Kogan wrote:
> +static struct cna_node *find_successor(struct mcs_spinlock *me)
> +{
> +	struct cna_node *me_cna = CNA_NODE(me);
> +	struct cna_node *head_other, *tail_other, *cur;
> +	struct cna_node *next = CNA_NODE(READ_ONCE(me->next));
> +	int my_node;
> +
> +	/* @next should be set, else we would not be calling this function. */
> +	WARN_ON_ONCE(next == NULL);
> +
> +	my_node = me_cna->numa_node;
> +
> +	/*
> +	 * Fast path - check whether the immediate successor runs on
> +	 * the same node.
> +	 */
> +	if (next->numa_node == my_node)
> +		return next;
> +
> +	head_other = next;
> +	tail_other = next;
> +
> +	/*
> +	 * Traverse the main waiting queue starting from the successor of my
> +	 * successor, and look for a thread running on the same node.
> +	 */
> +	cur = CNA_NODE(READ_ONCE(next->mcs.next));
> +	while (cur) {
> +		if (cur->numa_node == my_node) {
> +			/*
> +			 * Found a thread on the same node. Move threads
> +			 * between me and that node into the secondary queue.
> +			 */
> +			if (me->locked > 1)
> +				CNA_NODE(me->locked)->tail->mcs.next =
> +					(struct mcs_spinlock *)head_other;
> +			else
> +				me->locked = (uintptr_t)head_other;
> +			tail_other->mcs.next = NULL;
> +			CNA_NODE(me->locked)->tail = tail_other;
> +			return cur;
> +		}
> +		tail_other = cur;
> +		cur = CNA_NODE(READ_ONCE(cur->mcs.next));
> +	}
> +	return NULL;
> +}

static void cna_move(struct cna_node *cn, struct cna_node *cni)
{
	struct cna_node *head, *tail;

	/* remove @cni */
	WRITE_ONCE(cn->mcs.next, cni->mcs.next);

	/* stick @cni on the 'other' list tail */
	cni->mcs.next = NULL;

	if (cn->mcs.locked <= 1) {
		/* head = tail = cni */
		head = cni;
		head->tail = cni;
		cn->mcs.locked = head->encoded_tail;
	} else {
		/* add to tail */
		head = (struct cna_node *)decode_tail(cn->mcs.locked);
		tail = tail->tail;
		tail->next = cni;
	}
}

static struct cna_node *cna_find_next(struct mcs_spinlock *node)
{
	struct cna_node *cni, *cn = (struct cna_node *)node;

	while ((cni = (struct cna_node *)READ_ONCE(cn->mcs.next))) {
		if (likely(cni->node == cn->node))
			break;

		cna_move(cn, cni);
	}

	return cni;
}

> +static inline bool cna_set_locked_empty_mcs(struct qspinlock *lock, u32 val,
> +					struct mcs_spinlock *node)
> +{
> +	/* Check whether the secondary queue is empty. */
> +	if (node->locked <= 1) {
> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
> +				_Q_LOCKED_VAL))
> +			return true; /* No contention */
> +	} else {
> +		/*
> +		 * Pass the lock to the first thread in the secondary
> +		 * queue, but first try to update the queue's tail to
> +		 * point to the last node in the secondary queue.


That comment doesn't make sense; there's at least one conditional
missing.

> +		 */
> +		struct cna_node *succ = CNA_NODE(node->locked);
> +		u32 new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
> +
> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> +			arch_mcs_spin_unlock_contended(&succ->mcs.locked, 1);
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}

static cna_try_clear_tail(struct qspinlock *lock, u32 val, struct mcs_spinlock *node)
{
	if (node->locked <= 1)
		return __try_clear_tail(lock, val, node);

	/* the other case */
}

> +static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
> +				     struct mcs_spinlock *next)
> +{
> +	struct cna_node *succ = NULL;
> +	u64 *var = &next->locked;
> +	u64 val = 1;
> +
> +	succ = find_successor(node);

This makes unlock O(n), which is 'funneh' and undocumented.

> +
> +	if (succ) {
> +		var = &succ->mcs.locked;
> +		/*
> +		 * We unlock a successor by passing a non-zero value,
> +		 * so set @val to 1 iff @locked is 0, which will happen
> +		 * if we acquired the MCS lock when its queue was empty
> +		 */
> +		val = node->locked + (node->locked == 0);
> +	} else if (node->locked > 1) { /* if the secondary queue is not empty */
> +		/* pass the lock to the first node in that queue */
> +		succ = CNA_NODE(node->locked);
> +		succ->tail->mcs.next = next;
> +		var = &succ->mcs.locked;

> +	}	/*
> +		 * Otherwise, pass the lock to the immediate successor
> +		 * in the main queue.
> +		 */

I don't think this mis-indented comment can happen. The call-site
guarantees @next is non-null.

Therefore, cna_find_next() will either return it, or place it on the
secondary list. If it (cna_find_next) returns NULL, we must have a
non-empty secondary list.

In no case do I see this tertiary condition being possible.

> +
> +	arch_mcs_spin_unlock_contended(var, val);
> +}

This also renders this @next argument superfluous.

static cna_mcs_pass_lock(struct mcs_spinlock *node, struct mcs_spinlock *next)
{
	next = cna_find_next(node);
	if (!next) {
		BUG_ON(node->locked <= 1);
		next = (struct cna_node *)decode_tail(node->locked);
		node->locked = 1;
	}

	arch_mcs_pass_lock(&next->mcs.locked, node->locked);
}


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path
  2019-07-16 14:53     ` Alex Kogan
@ 2019-07-16 15:58       ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 15:58 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jul 16, 2019 at 10:53:02AM -0400, Alex Kogan wrote:
> On Jul 16, 2019, at 6:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Mon, Jul 15, 2019 at 03:25:33PM -0400, Alex Kogan wrote:
> > 
> >> +/*
> >> + * set_locked_empty_mcs - Try to set the spinlock value to _Q_LOCKED_VAL,
> >> + * and by doing that unlock the MCS lock when its waiting queue is empty
> >> + * @lock: Pointer to queued spinlock structure
> >> + * @val: Current value of the lock
> >> + * @node: Pointer to the MCS node of the lock holder
> >> + *
> >> + * *,*,* -> 0,0,1
> >> + */
> >> +static __always_inline bool __set_locked_empty_mcs(struct qspinlock *lock,
> >> +						   u32 val,
> >> +						   struct mcs_spinlock *node)
> >> +{
> >> +	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
> >> +}
> > 
> > That name is nonsense. It should be something like:
> > 
> > static __always_inline bool __try_clear_tail(…)
> 
> We already have set_locked(), so I was trying to convey the fact that we are
> doing the same here, but only when the MCS chain is empty.
> 
> I can use __try_clear_tail() instead.

Thing is, we go into this function with: *,0,1 and are trying to obtain
0,0,1. IOW, we're trying to clear the tail, while preserving pending and
locked.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-07-15 19:25 ` [PATCH v3 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
@ 2019-07-16 15:59   ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 15:59 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Mon, Jul 15, 2019 at 03:25:35PM -0400, Alex Kogan wrote:

> @@ -36,6 +37,33 @@ struct cna_node {
>  
>  #define CNA_NODE(ptr) ((struct cna_node *)(ptr))
>  
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Controls the probability for intra-node lock hand-off. It can be
> + * tuned and depend, e.g., on the number of CPUs per node. For now,
> + * choose a value that provides reasonable long-term fairness without
> + * sacrificing performance compared to a version that does not have any
> + * fairness guarantees.
> + */
> +#define INTRA_NODE_HANDOFF_PROB_ARG 0x10000
> +
> +/*
> + * Return false with probability 1 / @range.
> + * @range must be a power of 2.
> + */
> +static bool probably(unsigned int range)
> +{
> +	u32 s;
> +
> +	s = this_cpu_read(seed);
> +	s = next_pseudo_random32(s);
> +	this_cpu_write(seed, s);
> +
> +	return s & (range - 1);

This is fragile, better to take a number of bits as argument.

> +}
> +
>  static void cna_init_node(struct mcs_spinlock *node)
>  {
>  	struct cna_node *cn = CNA_NODE(node);
> @@ -140,7 +168,13 @@ static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
>  	u64 *var = &next->locked;
>  	u64 val = 1;
>  
> -	succ = find_successor(node);
> +	/*
> +	 * Try to pass the lock to a thread running on the same node.
> +	 * For long-term fairness, search for such a thread with high
> +	 * probability rather than always.
> +	 */
> +	if (probably(INTRA_NODE_HANDOFF_PROB_ARG))
> +		succ = find_successor(node);
>  
>  	if (succ) {
>  		var = &succ->mcs.locked;

And this is where that tertiary condition comes from.. I think.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-16 15:50   ` Peter Zijlstra
@ 2019-07-16 17:19     ` Alex Kogan
  2019-07-16 18:47       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Kogan @ 2019-07-16 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

Hi, Peter.

Thanks for the review and all the suggestions!

A couple of comments are inlined below.

> On Jul 16, 2019, at 11:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jul 15, 2019 at 03:25:34PM -0400, Alex Kogan wrote:
>> +static struct cna_node *find_successor(struct mcs_spinlock *me)
>> +{
>> +	struct cna_node *me_cna = CNA_NODE(me);
>> +	struct cna_node *head_other, *tail_other, *cur;
>> +	struct cna_node *next = CNA_NODE(READ_ONCE(me->next));
>> +	int my_node;
>> +
>> +	/* @next should be set, else we would not be calling this function. */
>> +	WARN_ON_ONCE(next == NULL);
>> +
>> +	my_node = me_cna->numa_node;
>> +
>> +	/*
>> +	 * Fast path - check whether the immediate successor runs on
>> +	 * the same node.
>> +	 */
>> +	if (next->numa_node == my_node)
>> +		return next;
>> +
>> +	head_other = next;
>> +	tail_other = next;
>> +
>> +	/*
>> +	 * Traverse the main waiting queue starting from the successor of my
>> +	 * successor, and look for a thread running on the same node.
>> +	 */
>> +	cur = CNA_NODE(READ_ONCE(next->mcs.next));
>> +	while (cur) {
>> +		if (cur->numa_node == my_node) {
>> +			/*
>> +			 * Found a thread on the same node. Move threads
>> +			 * between me and that node into the secondary queue.
>> +			 */
>> +			if (me->locked > 1)
>> +				CNA_NODE(me->locked)->tail->mcs.next =
>> +					(struct mcs_spinlock *)head_other;
>> +			else
>> +				me->locked = (uintptr_t)head_other;
>> +			tail_other->mcs.next = NULL;
>> +			CNA_NODE(me->locked)->tail = tail_other;
>> +			return cur;
>> +		}
>> +		tail_other = cur;
>> +		cur = CNA_NODE(READ_ONCE(cur->mcs.next));
>> +	}
>> +	return NULL;
>> +}
> 
> static void cna_move(struct cna_node *cn, struct cna_node *cni)
> {
> 	struct cna_node *head, *tail;
> 
> 	/* remove @cni */
> 	WRITE_ONCE(cn->mcs.next, cni->mcs.next);
> 
> 	/* stick @cni on the 'other' list tail */
> 	cni->mcs.next = NULL;
> 
> 	if (cn->mcs.locked <= 1) {
> 		/* head = tail = cni */
> 		head = cni;
> 		head->tail = cni;
> 		cn->mcs.locked = head->encoded_tail;
> 	} else {
> 		/* add to tail */
> 		head = (struct cna_node *)decode_tail(cn->mcs.locked);
> 		tail = tail->tail;
> 		tail->next = cni;
> 	}
> }
> 
> static struct cna_node *cna_find_next(struct mcs_spinlock *node)
> {
> 	struct cna_node *cni, *cn = (struct cna_node *)node;
> 
> 	while ((cni = (struct cna_node *)READ_ONCE(cn->mcs.next))) {
> 		if (likely(cni->node == cn->node))
> 			break;
> 
> 		cna_move(cn, cni);
> 	}
> 
> 	return cni;
> }
But then you move nodes from the main list to the ‘other’ list one-by-one.
I’m afraid this would be unnecessary expensive.
Plus, all this extra work is wasted if you do not find a thread on the same 
NUMA node (you move everyone to the ‘other’ list only to move them back in 
cna_mcs_pass_lock()).

> 
>> +static inline bool cna_set_locked_empty_mcs(struct qspinlock *lock, u32 val,
>> +					struct mcs_spinlock *node)
>> +{
>> +	/* Check whether the secondary queue is empty. */
>> +	if (node->locked <= 1) {
>> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
>> +				_Q_LOCKED_VAL))
>> +			return true; /* No contention */
>> +	} else {
>> +		/*
>> +		 * Pass the lock to the first thread in the secondary
>> +		 * queue, but first try to update the queue's tail to
>> +		 * point to the last node in the secondary queue.
> 
> 
> That comment doesn't make sense; there's at least one conditional
> missing.
In CNA, we cannot just clear the tail when the MCS chain is empty, as 
there might be nodes in the ‘other’ chain. In that case (this is the “else” part),
we want to pass the lock to the first node in the ‘other’ chain, but 
first we need to put the last node from that chain into the tail. Perhaps the
comment should read “…  but first try to update the *primary* queue's tail …”, 
if that makes more sense.

> 
>> +		 */
>> +		struct cna_node *succ = CNA_NODE(node->locked);
>> +		u32 new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
>> +
>> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
>> +			arch_mcs_spin_unlock_contended(&succ->mcs.locked, 1);
>> +			return true;
>> +		}
>> +	}
>> +
>> +	return false;
>> +}
> 
> static cna_try_clear_tail(struct qspinlock *lock, u32 val, struct mcs_spinlock *node)
> {
> 	if (node->locked <= 1)
> 		return __try_clear_tail(lock, val, node);
> 
> 	/* the other case */
> }
Good point, thanks.

> 
>> +static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
>> +				     struct mcs_spinlock *next)
>> +{
>> +	struct cna_node *succ = NULL;
>> +	u64 *var = &next->locked;
>> +	u64 val = 1;
>> +
>> +	succ = find_successor(node);
> 
> This makes unlock O(n), which is 'funneh' and undocumented.
I will add a comment above the call to find_successor() / cna_find_next().

> 
>> +
>> +	if (succ) {
>> +		var = &succ->mcs.locked;
>> +		/*
>> +		 * We unlock a successor by passing a non-zero value,
>> +		 * so set @val to 1 iff @locked is 0, which will happen
>> +		 * if we acquired the MCS lock when its queue was empty
>> +		 */
>> +		val = node->locked + (node->locked == 0);
>> +	} else if (node->locked > 1) { /* if the secondary queue is not empty */
>> +		/* pass the lock to the first node in that queue */
>> +		succ = CNA_NODE(node->locked);
>> +		succ->tail->mcs.next = next;
>> +		var = &succ->mcs.locked;
> 
>> +	}	/*
>> +		 * Otherwise, pass the lock to the immediate successor
>> +		 * in the main queue.
>> +		 */
> 
> I don't think this mis-indented comment can happen. The call-site
> guarantees @next is non-null.
> 
> Therefore, cna_find_next() will either return it, or place it on the
> secondary list. If it (cna_find_next) returns NULL, we must have a
> non-empty secondary list.
> 
> In no case do I see this tertiary condition being possible.
find_successor() will return NULL if it does not find a thread running on the 
same NUMA node. And the secondary queue might be empty at that time.

> 
>> +
>> +	arch_mcs_spin_unlock_contended(var, val);
>> +}
> 
> This also renders this @next argument superfluous.
> 
> static cna_mcs_pass_lock(struct mcs_spinlock *node, struct mcs_spinlock *next)
> {
> 	next = cna_find_next(node);
> 	if (!next) {
> 		BUG_ON(node->locked <= 1);
> 		next = (struct cna_node *)decode_tail(node->locked);
> 		node->locked = 1;
> 	}
> 
> 	arch_mcs_pass_lock(&next->mcs.locked, node->locked);
> }

@next is passed to save the load from @node.
This is probably most important for the native code (__pass_mcs_lock()).
That function should be inlined, however, and that load should not matter.
Bottom line, I agree that we can remove the @next argument.

Best regards,
— Alex



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-16 17:19     ` Alex Kogan
@ 2019-07-16 18:47       ` Peter Zijlstra
  2019-07-17  8:39         ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-16 18:47 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jul 16, 2019 at 01:19:16PM -0400, Alex Kogan wrote:
> > On Jul 16, 2019, at 11:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > static void cna_move(struct cna_node *cn, struct cna_node *cni)
> > {
> > 	struct cna_node *head, *tail;
> > 
> > 	/* remove @cni */
> > 	WRITE_ONCE(cn->mcs.next, cni->mcs.next);
> > 
> > 	/* stick @cni on the 'other' list tail */
> > 	cni->mcs.next = NULL;
> > 
> > 	if (cn->mcs.locked <= 1) {
> > 		/* head = tail = cni */
> > 		head = cni;
> > 		head->tail = cni;
> > 		cn->mcs.locked = head->encoded_tail;
> > 	} else {
> > 		/* add to tail */
> > 		head = (struct cna_node *)decode_tail(cn->mcs.locked);
> > 		tail = tail->tail;
> > 		tail->next = cni;
> > 	}
> > }
> > 
> > static struct cna_node *cna_find_next(struct mcs_spinlock *node)
> > {
> > 	struct cna_node *cni, *cn = (struct cna_node *)node;
> > 
> > 	while ((cni = (struct cna_node *)READ_ONCE(cn->mcs.next))) {
> > 		if (likely(cni->node == cn->node))
> > 			break;
> > 
> > 		cna_move(cn, cni);
> > 	}
> > 
> > 	return cni;
> > }
> But then you move nodes from the main list to the ‘other’ list one-by-one.
> I’m afraid this would be unnecessary expensive.
> Plus, all this extra work is wasted if you do not find a thread on the same 
> NUMA node (you move everyone to the ‘other’ list only to move them back in 
> cna_mcs_pass_lock()).

My primary concern was readability; I find the above suggestion much
more readable. Maybe it can be written differently; you'll have to play
around a bit.

> >> +static inline bool cna_set_locked_empty_mcs(struct qspinlock *lock, u32 val,
> >> +					struct mcs_spinlock *node)
> >> +{
> >> +	/* Check whether the secondary queue is empty. */
> >> +	if (node->locked <= 1) {
> >> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
> >> +				_Q_LOCKED_VAL))
> >> +			return true; /* No contention */
> >> +	} else {
> >> +		/*
> >> +		 * Pass the lock to the first thread in the secondary
> >> +		 * queue, but first try to update the queue's tail to
> >> +		 * point to the last node in the secondary queue.
> > 
> > 
> > That comment doesn't make sense; there's at least one conditional
> > missing.
> In CNA, we cannot just clear the tail when the MCS chain is empty, as 
> there might be nodes in the ‘other’ chain. In that case (this is the “else” part),
> we want to pass the lock to the first node in the ‘other’ chain, but 
> first we need to put the last node from that chain into the tail. Perhaps the
> comment should read “…  but first try to update the *primary* queue's tail …”, 
> if that makes more sense.

It is 'try and pass the lock' at best. It is not a
definite/unconditional thing we're doing.

> >> +		 */
> >> +		struct cna_node *succ = CNA_NODE(node->locked);
> >> +		u32 new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
> >> +
> >> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> >> +			arch_mcs_spin_unlock_contended(&succ->mcs.locked, 1);
> >> +			return true;
> >> +		}
> >> +	}
> >> +
> >> +	return false;
> >> +}

> >> +static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
> >> +				     struct mcs_spinlock *next)
> >> +{
> >> +	struct cna_node *succ = NULL;
> >> +	u64 *var = &next->locked;
> >> +	u64 val = 1;
> >> +
> >> +	succ = find_successor(node);
> >> +
> >> +	if (succ) {
> >> +		var = &succ->mcs.locked;
> >> +		/*
> >> +		 * We unlock a successor by passing a non-zero value,
> >> +		 * so set @val to 1 iff @locked is 0, which will happen
> >> +		 * if we acquired the MCS lock when its queue was empty
> >> +		 */
> >> +		val = node->locked + (node->locked == 0);
> >> +	} else if (node->locked > 1) { /* if the secondary queue is not empty */
> >> +		/* pass the lock to the first node in that queue */
> >> +		succ = CNA_NODE(node->locked);
> >> +		succ->tail->mcs.next = next;
> >> +		var = &succ->mcs.locked;
> > 
> >> +	}	/*
> >> +		 * Otherwise, pass the lock to the immediate successor
> >> +		 * in the main queue.
> >> +		 */
> > 
> > I don't think this mis-indented comment can happen. The call-site
> > guarantees @next is non-null.
> > 
> > Therefore, cna_find_next() will either return it, or place it on the
> > secondary list. If it (cna_find_next) returns NULL, we must have a
> > non-empty secondary list.
> > 
> > In no case do I see this tertiary condition being possible.
> find_successor() will return NULL if it does not find a thread running on the 
> same NUMA node. And the secondary queue might be empty at that time.

See; I couldn't untangle that case from the code. Means readablilty
needs improving.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/5] Add NUMA-awareness to qspinlock
       [not found]   ` <7D29555E-8F72-4EDD-8A87-B1A59C3945A6@oracle.com>
@ 2019-07-16 23:07     ` Nicholas Piggin
  0 siblings, 0 replies; 33+ messages in thread
From: Nicholas Piggin @ 2019-07-16 23:07 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, x86, Arnd Bergmann, peterz, linux-kernel, jglauber,
	guohanjun, dave.dice, linux, steven.sistare, daniel.m.jordan,
	rahul.x.yadav, mingo, bp, hpa, will.deacon, longman, tglx,
	linux-arm-kernel

Alex Kogan's on July 17, 2019 12:45 am:
> 
>> On Jul 16, 2019, at 7:47 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
>> Alex Kogan's on July 16, 2019 5:25 am:
>>> Our evaluation shows that CNA also improves performance of user 
>>> applications that have hot pthread mutexes. Those mutexes are 
>>> blocking, and waiting threads park and unpark via the futex 
>>> mechanism in the kernel. Given that kernel futex chains, which
>>> are hashed by the mutex address, are each protected by a 
>>> chain-specific spin lock, the contention on a user-mode mutex 
>>> translates into contention on a kernel level spinlock. 
>> 
>> What applications are those, what performance numbers? Arguably that's
>> much more interesting than microbenchmarks (which are mainly useful to
>> help ensure the fast paths are not impacted IMO).
> 
> Those are applications that use locks in which waiting threads can park (block),
> e.g., pthread mutexes. Under (user-level) contention, the park-unpark mechanism
> in the kernel creates contention on (kernel) spin locks protecting futex chains.
> As an example, we experimented with LevelDB (key-value store), and included
> performance numbers in the patch. Or you are looking for something else?

Oh, no that's good. I confused myself thinking that was another will it
scale benchmark. The speedup becomes significant on readrandom, I wonder
if of it might be that you're gating which threads get to complete the 
futex operation and so the effect is amplified beyond just the critical
section of the spin lock?

Am I reading the table correctly, this test gets about 2.1x speedup when
scaling from 1 to 142 threads in the patch-CNA case?

Thanks,
Nick

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-15 19:25 ` [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
                     ` (2 preceding siblings ...)
  2019-07-16 15:50   ` Peter Zijlstra
@ 2019-07-17  2:16   ` Waiman Long
  2019-07-17  7:44     ` Peter Zijlstra
  3 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2019-07-17  2:16 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: rahul.x.yadav, dave.dice, steven.sistare, daniel.m.jordan

On 7/15/19 3:25 PM, Alex Kogan wrote:
> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a main queue for
> + * threads running on the same node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. At the unlock time,
> + * the lock holder scans the main queue looking for a thread running on
> + * the same node. If found (call it thread T), all threads in the main queue
> + * between the current lock holder and T are moved to the end of the
> + * secondary queue, and the lock is passed to T. If such T is not found, the
> + * lock is passed to the first node in the secondary queue. Finally, if the
> + * secondary queue is empty, the lock is passed to the next thread in the
> + * main queue. To avoid starvation of threads in the secondary queue,
> + * those threads are moved back to the head of the main queue
> + * after a certain expected number of intra-node lock hand-offs.
> + *
> + * For more details, see https://arxiv.org/abs/1810.05600.
> + *
> + * Authors: Alex Kogan <alex.kogan@oracle.com>
> + *          Dave Dice <dave.dice@oracle.com>
> + */
> +
> +struct cna_node {
> +	struct	mcs_spinlock mcs;
> +	u32	numa_node;
> +	u32	encoded_tail;
> +	struct	cna_node *tail;    /* points to the secondary queue tail */
> +};
> +
> +#define CNA_NODE(ptr) ((struct cna_node *)(ptr))
> +
> +static void cna_init_node(struct mcs_spinlock *node)
> +{
> +	struct cna_node *cn = CNA_NODE(node);
> +	struct mcs_spinlock *base_node;
> +	int cpuid;
> +
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store a pointer in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(uintptr_t) > sizeof_field(struct mcs_spinlock, locked));
> +
> +	cpuid = smp_processor_id();
> +	cn->numa_node = cpu_to_node(cpuid);
> +
> +	base_node = this_cpu_ptr(&qnodes[0].mcs);
> +	cn->encoded_tail = encode_tail(cpuid, base_node->count - 1);
> +}
> +
> +/**
> + * find_successor - Scan the main waiting queue looking for the first
> + * thread running on the same node as the lock holder. If found (call it
> + * thread T), move all threads in the main queue between the lock holder
> + * and T to the end of the secondary queue and return T; otherwise, return NULL.
> + */

Here you talk about main and secondary queues. However, there is no
mention of what are those queues. As I am familiar with qspinlock queue,
I can figure out that the main queue is the mcs_node->next chain that
starts from the MCS lock holder. The secondary queue is a separate MCS
node chain with its head stored in the mcs_node->locked value of the MCS
lock holder and the tail stored in the cna->tail. More detail comments
on what and where they are will help to improve the readability of the
code. A simple graphic to illustrate those queues will help too, for example

/*
 * MCS lock holder
 * ===============
 *    mcs_node
 *   +--------+      +----+         +----+
 *   | next   | ---> |next| -> ...  |next| -> NULL  [Main queue]
 *   | locked | -+   +----+         +----+
 *   +--------+  |
 *               |   +----+         +----+
 *               +-> |next| -> ...  |next| -> X     [Secondary queue]
 *    cna_node       +----+         +----+
 *   +--------*                       ^
 *   | tail   | ----------------------+
 *   +--------*   
 *
 * N.B. locked = 1 if secondary queue is absent.
 */

> +static struct cna_node *find_successor(struct mcs_spinlock *me)
> +{
> +	struct cna_node *me_cna = CNA_NODE(me);
> +	struct cna_node *head_other, *tail_other, *cur;
As you have talked about secondary queue, how about head_2nd, tail_2nd
instead of *_other?

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-17  2:16   ` Waiman Long
@ 2019-07-17  7:44     ` Peter Zijlstra
  2019-07-17 13:35       ` Waiman Long
  2019-07-17 14:42       ` Alex Kogan
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-17  7:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, rahul.x.yadav, mingo, bp, hpa,
	Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On Tue, Jul 16, 2019 at 10:16:29PM -0400, Waiman Long wrote:
>  A simple graphic to illustrate those queues will help too, for example

Very much yes!

> /*
>  * MCS lock holder
>  * ===============
>  *    mcs_node
>  *   +--------+      +----+         +----+
>  *   | next   | ---> |next| -> ...  |next| -> NULL  [Main queue]
>  *   | locked | -+   +----+         +----+
>  *   +--------+  |
>  *               |   +----+         +----+
>  *               +-> |next| -> ...  |next| -> X     [Secondary queue]
>  *    cna_node       +----+         +----+
>  *   +--------*                       ^
>  *   | tail   | ----------------------+
>  *   +--------*   

Almost; IIUC that cna_node is the same as the one from locked, so you
end up with something like:

>  *    mcs_node
>  *   +--------+      +----+         +----+
>  *   | next   | ---> |next| -> ...  |next| -> NULL  [Main queue]
>  *   | locked | -+   +----+         +----+
>  *   +--------+  |
>  *               |   +---------+         +----+
>  *               +-> |mcs::next| -> ...  |next| -> NULL     [Secondary queue]
>  *                   |cna::tail| -+      +----+
>  *                   +---------+  |        ^
>  *                                +--------+
>  *
>  * N.B. locked = 1 if secondary queue is absent.
>  */

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-16 18:47       ` Peter Zijlstra
@ 2019-07-17  8:39         ` Peter Zijlstra
  2019-07-17  8:59           ` Peter Zijlstra
       [not found]           ` <FFC2D45A-24B3-40E1-ABBB-1D696E830B23@oracle.com>
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-17  8:39 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jul 16, 2019 at 08:47:24PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 16, 2019 at 01:19:16PM -0400, Alex Kogan wrote:
> > > On Jul 16, 2019, at 11:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > static void cna_move(struct cna_node *cn, struct cna_node *cni)
> > > {
> > > 	struct cna_node *head, *tail;
> > > 
> > > 	/* remove @cni */
> > > 	WRITE_ONCE(cn->mcs.next, cni->mcs.next);
> > > 
> > > 	/* stick @cni on the 'other' list tail */
> > > 	cni->mcs.next = NULL;
> > > 
> > > 	if (cn->mcs.locked <= 1) {
> > > 		/* head = tail = cni */
> > > 		head = cni;
> > > 		head->tail = cni;
> > > 		cn->mcs.locked = head->encoded_tail;
> > > 	} else {
> > > 		/* add to tail */
> > > 		head = (struct cna_node *)decode_tail(cn->mcs.locked);
> > > 		tail = tail->tail;
> > > 		tail->next = cni;
> > > 	}
> > > }
> > > 
> > > static struct cna_node *cna_find_next(struct mcs_spinlock *node)
> > > {
> > > 	struct cna_node *cni, *cn = (struct cna_node *)node;
> > > 
> > > 	while ((cni = (struct cna_node *)READ_ONCE(cn->mcs.next))) {
> > > 		if (likely(cni->node == cn->node))
> > > 			break;
> > > 
> > > 		cna_move(cn, cni);
> > > 	}
> > > 
> > > 	return cni;
> > > }
> > But then you move nodes from the main list to the ‘other’ list one-by-one.
> > I’m afraid this would be unnecessary expensive.
> > Plus, all this extra work is wasted if you do not find a thread on the same 
> > NUMA node (you move everyone to the ‘other’ list only to move them back in 
> > cna_mcs_pass_lock()).
> 
> My primary concern was readability; I find the above suggestion much
> more readable. Maybe it can be written differently; you'll have to play
> around a bit.

static void cna_splice_tail(struct cna_node *cn, struct cna_node *head, struct cna_node *tail)
{
	struct cna_node *list;

	/* remove [head,tail] */
	WRITE_ONCE(cn->mcs.next, tail->mcs.next);
	tail->mcs.next = NULL;

	/* stick [head,tail] on the secondary list tail */
	if (cn->mcs.locked <= 1) {
		/* create secondary list */
		head->tail = tail;
		cn->mcs.locked = head->encoded_tail;
	} else {
		/* add to tail */
		list = (struct cna_node *)decode_tail(cn->mcs.locked);
		list->tail->next = head;
		list->tail = tail;
	}
}

static struct cna_node *cna_find_next(struct mcs_spinlock *node)
{
	struct cna_node *cni, *cn = (struct cna_node *)node;
	struct cna_node *head, *tail = NULL;

	/* find any next lock from 'our' node */
	for (head = cni = (struct cna_node *)READ_ONCE(cn->mcs.next);
	     cni && cni->node != cn->node;
	     tail = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
		;

	/* when found, splice any skipped locks onto the secondary list */
	if (cni && tail)
		cna_splice_tail(cn, head, tail);

	return cni;
}

How's that?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-17  8:39         ` Peter Zijlstra
@ 2019-07-17  8:59           ` Peter Zijlstra
  2019-07-17 14:52             ` Alex Kogan
       [not found]           ` <FFC2D45A-24B3-40E1-ABBB-1D696E830B23@oracle.com>
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-17  8:59 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Wed, Jul 17, 2019 at 10:39:44AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 16, 2019 at 08:47:24PM +0200, Peter Zijlstra wrote:

> > My primary concern was readability; I find the above suggestion much
> > more readable. Maybe it can be written differently; you'll have to play
> > around a bit.
> 
> static void cna_splice_tail(struct cna_node *cn, struct cna_node *head, struct cna_node *tail)
> {
> 	struct cna_node *list;
> 
> 	/* remove [head,tail] */
> 	WRITE_ONCE(cn->mcs.next, tail->mcs.next);
> 	tail->mcs.next = NULL;
> 
> 	/* stick [head,tail] on the secondary list tail */
> 	if (cn->mcs.locked <= 1) {
> 		/* create secondary list */
> 		head->tail = tail;
> 		cn->mcs.locked = head->encoded_tail;
> 	} else {
> 		/* add to tail */
> 		list = (struct cna_node *)decode_tail(cn->mcs.locked);
> 		list->tail->next = head;
> 		list->tail = tail;
> 	}
> }
> 
> static struct cna_node *cna_find_next(struct mcs_spinlock *node)
> {
> 	struct cna_node *cni, *cn = (struct cna_node *)node;
> 	struct cna_node *head, *tail = NULL;
> 
> 	/* find any next lock from 'our' node */
> 	for (head = cni = (struct cna_node *)READ_ONCE(cn->mcs.next);
> 	     cni && cni->node != cn->node;
> 	     tail = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
> 		;

I think we can do away with those READ_ONCE()s, at this point those
pointers should be stable. But please double check.

> 	/* when found, splice any skipped locks onto the secondary list */
> 	if (cni && tail)
> 		cna_splice_tail(cn, head, tail);
> 
> 	return cni;
> }
> 
> How's that?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-17  7:44     ` Peter Zijlstra
@ 2019-07-17 13:35       ` Waiman Long
  2019-07-17 14:42       ` Alex Kogan
  1 sibling, 0 replies; 33+ messages in thread
From: Waiman Long @ 2019-07-17 13:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, rahul.x.yadav, mingo, bp, hpa,
	Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On 7/17/19 3:44 AM, Peter Zijlstra wrote:
> On Tue, Jul 16, 2019 at 10:16:29PM -0400, Waiman Long wrote:
>>  A simple graphic to illustrate those queues will help too, for example
> Very much yes!
>
>> /*
>>  * MCS lock holder
>>  * ===============
>>  *    mcs_node
>>  *   +--------+      +----+         +----+
>>  *   | next   | ---> |next| -> ...  |next| -> NULL  [Main queue]
>>  *   | locked | -+   +----+         +----+
>>  *   +--------+  |
>>  *               |   +----+         +----+
>>  *               +-> |next| -> ...  |next| -> X     [Secondary queue]
>>  *    cna_node       +----+         +----+
>>  *   +--------*                       ^
>>  *   | tail   | ----------------------+
>>  *   +--------*   
> Almost; IIUC that cna_node is the same as the one from locked, so you
> end up with something like:
>
>>  *    mcs_node
>>  *   +--------+      +----+         +----+
>>  *   | next   | ---> |next| -> ...  |next| -> NULL  [Main queue]
>>  *   | locked | -+   +----+         +----+
>>  *   +--------+  |
>>  *               |   +---------+         +----+
>>  *               +-> |mcs::next| -> ...  |next| -> NULL     [Secondary queue]
>>  *                   |cna::tail| -+      +----+
>>  *                   +---------+  |        ^
>>  *                                +--------+
>>  *
>>  * N.B. locked = 1 if secondary queue is absent.
>>  */

Yes, you are right. Thanks for the correction.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-17  7:44     ` Peter Zijlstra
  2019-07-17 13:35       ` Waiman Long
@ 2019-07-17 14:42       ` Alex Kogan
  1 sibling, 0 replies; 33+ messages in thread
From: Alex Kogan @ 2019-07-17 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, Waiman Long, tglx, daniel.m.jordan,
	linux-arm-kernel


>>  *    mcs_node
>>  *   +--------+      +----+         +----+
>>  *   | next   | ---> |next| -> ...  |next| -> NULL  [Main queue]
>>  *   | locked | -+   +----+         +----+
>>  *   +--------+  |
>>  *               |   +---------+         +----+
>>  *               +-> |mcs::next| -> ...  |next| -> NULL     [Secondary queue]
>> *                   |cna::tail| -+      +----+
>>  *                   +---------+  |        ^
>> *                                +--------+
>>  *
>>  * N.B. locked = 1 if secondary queue is absent.
>>  */


I would change mcs_node to cna_node, next to mcs::next and locked to mcs::locked
in the very first node. Other than that, this is great. Thanks, Peter and Longman!

I should probably stick this graphic in the comment at the top of the file, 
instead of a comment to find_successor() (or whatever this function ends up 
being called).

— Alex



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-17  8:59           ` Peter Zijlstra
@ 2019-07-17 14:52             ` Alex Kogan
  0 siblings, 0 replies; 33+ messages in thread
From: Alex Kogan @ 2019-07-17 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel


> On Jul 17, 2019, at 4:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jul 17, 2019 at 10:39:44AM +0200, Peter Zijlstra wrote:
>> On Tue, Jul 16, 2019 at 08:47:24PM +0200, Peter Zijlstra wrote:
> 
>>> My primary concern was readability; I find the above suggestion much
>>> more readable. Maybe it can be written differently; you'll have to play
>>> around a bit.
>> 
>> static void cna_splice_tail(struct cna_node *cn, struct cna_node *head, struct cna_node *tail)
>> {
>> 	struct cna_node *list;
>> 
>> 	/* remove [head,tail] */
>> 	WRITE_ONCE(cn->mcs.next, tail->mcs.next);
>> 	tail->mcs.next = NULL;
>> 
>> 	/* stick [head,tail] on the secondary list tail */
>> 	if (cn->mcs.locked <= 1) {
>> 		/* create secondary list */
>> 		head->tail = tail;
>> 		cn->mcs.locked = head->encoded_tail;
>> 	} else {
>> 		/* add to tail */
>> 		list = (struct cna_node *)decode_tail(cn->mcs.locked);
>> 		list->tail->next = head;
>> 		list->tail = tail;
>> 	}
>> }
>> 
>> static struct cna_node *cna_find_next(struct mcs_spinlock *node)
>> {
>> 	struct cna_node *cni, *cn = (struct cna_node *)node;
>> 	struct cna_node *head, *tail = NULL;
>> 
>> 	/* find any next lock from 'our' node */
>> 	for (head = cni = (struct cna_node *)READ_ONCE(cn->mcs.next);
>> 	     cni && cni->node != cn->node;
>> 	     tail = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
>> 		;
> 
> I think we can do away with those READ_ONCE()s, at this point those
> pointers should be stable. But please double check.

I think we can get rid of WRITE_ONCE above and the first READ_ONCE, as the 
“first” next pointer (in the node of the current lock holder) is stable at this
point, and is not read / written concurrently. We do need the second READ_ONCE
as we traverse the list and can come across a next pointer being changed.

— Alex

> 
>> 	/* when found, splice any skipped locks onto the secondary list */
>> 	if (cni && tail)
>> 		cna_splice_tail(cn, head, tail);
>> 
>> 	return cni;
>> }
>> 
>> How's that?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
       [not found]           ` <FFC2D45A-24B3-40E1-ABBB-1D696E830B23@oracle.com>
@ 2019-07-17 15:09             ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-17 15:09 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Wed, Jul 17, 2019 at 10:27:30AM -0400, Alex Kogan wrote:
> > On Jul 17, 2019, at 4:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > static void cna_splice_tail(struct cna_node *cn, struct cna_node *head, struct cna_node *tail)
> > {
> > 	struct cna_node *list;
> > 
> > 	/* remove [head,tail] */
> > 	WRITE_ONCE(cn->mcs.next, tail->mcs.next);
> > 	tail->mcs.next = NULL;
> > 
> > 	/* stick [head,tail] on the secondary list tail */
> > 	if (cn->mcs.locked <= 1) {
> > 		/* create secondary list */
> > 		head->tail = tail;
> > 		cn->mcs.locked = head->encoded_tail;
> > 	} else {
> > 		/* add to tail */
> > 		list = (struct cna_node *)decode_tail(cn->mcs.locked);
> > 		list->tail->next = head;
> > 		list->tail = tail;
> > 	}
> > }
> > 
> > static struct cna_node *cna_find_next(struct mcs_spinlock *node)
> > {
> > 	struct cna_node *cni, *cn = (struct cna_node *)node;
> > 	struct cna_node *head, *tail = NULL;
> > 
> > 	/* find any next lock from 'our' node */
> > 	for (head = cni = (struct cna_node *)READ_ONCE(cn->mcs.next);
> > 	     cni && cni->node != cn->node;
> > 	     tail = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
> > 		;
> > 
> > 	/* when found, splice any skipped locks onto the secondary list */
> > 	if (cni && tail)
> > 		cna_splice_tail(cn, head, tail);
> > 
> > 	return cni;
> > }
> > 
> > How's that?
> 
> This is almost perfect!! :)
> 
> The above should work, but I think we should have a specialized fast-path for 
> checking the immediate next node in the main queue. This would be the common
> case, once we splice ‘other’ nodes onto the secondary queue. In the above we
> would go through four branches before returning from cna_find_next(). In the 
> following we would have just one:

Right, but can you measure a difference? ;-) Anyway, no real objection,
just playing devils advocate here.

> > static struct cna_node *cna_find_next(struct mcs_spinlock *node)
> > {
> > 	struct cna_node *cn = (struct cna_node *)node;
> 	   struct cna_node *cni = (struct cna_node *)READ_ONCE(cn->mcs.next);
> > 
> > 	struct cna_node *head, *tail = NULL;
> > 
> 	   /* fast path */
> 	   if (cni->node == cn->node) 
> 		return cni;
> 
> > 	/* find any next lock from 'our' node */
> > 	for (head = cn->mcs.next;
	     head = cni,

you just did that load :-)

> > 	     cni && cni->node != cn->node;
> > 	     tail = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
> > 		;
> > 
> > 	/* when found, splice any skipped locks onto the secondary list */
> > 	if (cni && tail)
> > 		cna_splice_tail(cn, head, tail);
> > 
> > 	return cni;
> > }
> 
> 
> Also, any reason you say ‘lock’ instead of ’node’ in the comments?
> I.e., I think "when found, splice any skipped locks onto the secondary list” should be
> "when found, splice any skipped nodes onto the secondary list”.

Due to the confusion between lock waiter node and numa node :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-16 14:50         ` Waiman Long
@ 2019-07-17 17:44           ` Alex Kogan
  2019-07-17 17:58             ` Waiman Long
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Kogan @ 2019-07-17 17:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, Will Deacon, linux, linux-kernel, rahul.x.yadav,
	Ingo Molnar, bp, hpa, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel


> On Jul 16, 2019, at 10:50 AM, Waiman Long <longman@redhat.com> wrote:
> 
> On 7/16/19 10:29 AM, Alex Kogan wrote:
>> 
>>> On Jul 15, 2019, at 7:22 PM, Waiman Long <longman@redhat.com
>>> <mailto:longman@redhat.com>> wrote:
>>> 
>>> On 7/15/19 5:30 PM, Waiman Long wrote:
>>>>> -#ifndef _GEN_PV_LOCK_SLOWPATH
>>>>> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>>>>> 
>>>>> #include <linux/smp.h>
>>>>> #include <linux/bug.h>
>>>>> @@ -77,18 +77,14 @@
>>>>> #define MAX_NODES	4
>>>>> 
>>>>> /*
>>>>> - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>>>>> - * size and four of them will fit nicely in one 64-byte cacheline. For
>>>>> - * pvqspinlock, however, we need more space for extra data. To accommodate
>>>>> - * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>>>>> - * two of them can fit in a cacheline in this case. That is OK as it is rare
>>>>> - * to have more than 2 levels of slowpath nesting in actual use. We don't
>>>>> - * want to penalize pvqspinlocks to optimize for a rare case in native
>>>>> - * qspinlocks.
>>>>> + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
>>>>> + * size. For pvqspinlock or the NUMA-aware variant, however, we need more
>>>>> + * space for extra data. To accommodate that, we insert two more long words
>>>>> + * to pad it up to 36 bytes.
>>>>> */
>>>> The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
>>>> mcs_spinlock structure is 8-byte aligned. For better cacheline
>>>> alignment, I will like to keep mcs_spinlock to 16 bytes as before.
>>>> Instead, you can use encode_tail() to store the CNA node pointer in
>>>> "locked". For instance, use (encode_tail() << 1) in locked to
>>>> distinguish it from the regular locked=1 value.
>>> 
>>> Actually, the encoded tail value is already shift left either 16 bits
>>> or 9 bits. So there is no need to shift it. You can assigned it directly:
>>> 
>>> mcs->locked = cna->encoded_tail;
>>> 
>>> You do need to change the type of locked to "unsigned int", though,
>>> for proper comparison with "1".
>>> 
>> Got it, thanks.
>> 
> I forgot to mention that I would like to see a boot command line option
> to force off and maybe on as well the numa qspinlock code. This can help
> in testing as you don't need to build 2 separate kernels, one with
> NUMA_AWARE_SPINLOCKS on and one with it off.
IIUC it should be easy to add a boot option to force off the NUMA-aware spinlock 
even if it is enabled though config, but the other way around would require 
compiling in the NUMA-aware spinlock stuff even if the config option is disabled.
Is that ok?

Also, what should the option name be?
"numa_spinlock=on/off” if we want both ways, or “no_numa_spinlock" if we want just the “force off” option?

> For small 2-socket systems,
> numa qspinlock may not help much.
It actually helps quite a bit (e.g., speedup of up to 42-57% for will-it-scale on a dual-socket x86 system).
We have numbers and plots in our paper on arxiv.

Regards,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-07-17 17:44           ` Alex Kogan
@ 2019-07-17 17:58             ` Waiman Long
  0 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2019-07-17 17:58 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, Will Deacon, linux, linux-kernel, rahul.x.yadav,
	Ingo Molnar, bp, hpa, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On 7/17/19 1:44 PM, Alex Kogan wrote:
>> On Jul 16, 2019, at 10:50 AM, Waiman Long <longman@redhat.com> wrote:
>>
>> On 7/16/19 10:29 AM, Alex Kogan wrote:
>>>> On Jul 15, 2019, at 7:22 PM, Waiman Long <longman@redhat.com
>>>> <mailto:longman@redhat.com>> wrote:
>>>>
>>>> On 7/15/19 5:30 PM, Waiman Long wrote:
>>>>>> -#ifndef _GEN_PV_LOCK_SLOWPATH
>>>>>> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>>>>>>
>>>>>> #include <linux/smp.h>
>>>>>> #include <linux/bug.h>
>>>>>> @@ -77,18 +77,14 @@
>>>>>> #define MAX_NODES	4
>>>>>>
>>>>>> /*
>>>>>> - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>>>>>> - * size and four of them will fit nicely in one 64-byte cacheline. For
>>>>>> - * pvqspinlock, however, we need more space for extra data. To accommodate
>>>>>> - * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>>>>>> - * two of them can fit in a cacheline in this case. That is OK as it is rare
>>>>>> - * to have more than 2 levels of slowpath nesting in actual use. We don't
>>>>>> - * want to penalize pvqspinlocks to optimize for a rare case in native
>>>>>> - * qspinlocks.
>>>>>> + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
>>>>>> + * size. For pvqspinlock or the NUMA-aware variant, however, we need more
>>>>>> + * space for extra data. To accommodate that, we insert two more long words
>>>>>> + * to pad it up to 36 bytes.
>>>>>> */
>>>>> The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
>>>>> mcs_spinlock structure is 8-byte aligned. For better cacheline
>>>>> alignment, I will like to keep mcs_spinlock to 16 bytes as before.
>>>>> Instead, you can use encode_tail() to store the CNA node pointer in
>>>>> "locked". For instance, use (encode_tail() << 1) in locked to
>>>>> distinguish it from the regular locked=1 value.
>>>> Actually, the encoded tail value is already shift left either 16 bits
>>>> or 9 bits. So there is no need to shift it. You can assigned it directly:
>>>>
>>>> mcs->locked = cna->encoded_tail;
>>>>
>>>> You do need to change the type of locked to "unsigned int", though,
>>>> for proper comparison with "1".
>>>>
>>> Got it, thanks.
>>>
>> I forgot to mention that I would like to see a boot command line option
>> to force off and maybe on as well the numa qspinlock code. This can help
>> in testing as you don't need to build 2 separate kernels, one with
>> NUMA_AWARE_SPINLOCKS on and one with it off.
> IIUC it should be easy to add a boot option to force off the NUMA-aware spinlock 
> even if it is enabled though config, but the other way around would require 
> compiling in the NUMA-aware spinlock stuff even if the config option is disabled.
> Is that ok?

That is not what I am looking for. If the config option is disabled, the
boot command line option is disabled also. For the on case, one possible
usage scenario is with a VM guest where all the vcpus are pinned to a
physical CPUs with no over-commit and correct numa information. In that
case, one may want to use numa-qspinlock instead of pv-qspinlock.

> Also, what should the option name be?
> "numa_spinlock=on/off” if we want both ways, or “no_numa_spinlock" if we want just the “force off” option?

I think "numa_spinlock=on/off" will be good. The default is "auto" where
it will be turned on when there is more than one numa nodes in the system.

>> For small 2-socket systems,
>> numa qspinlock may not help much.
> It actually helps quite a bit (e.g., speedup of up to 42-57% for will-it-scale on a dual-socket x86 system).
> We have numbers and plots in our paper on arxiv.

I am talking about older 2-socket systems where each socket may have
just a few cpus. Also some Intel CPU can be configured to have 2 numa
nodes per socket. For AMD EPYC, it can be configured to have 4 numa
nodes per socket.

Cheers,
Longman



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-07-17 17:58 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-15 19:25 [PATCH v3 0/5] Add NUMA-awareness to qspinlock Alex Kogan
2019-07-15 19:25 ` [PATCH v3 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic Alex Kogan
2019-07-16 10:23   ` Peter Zijlstra
2019-07-15 19:25 ` [PATCH v3 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2019-07-16 10:20   ` Peter Zijlstra
2019-07-16 14:53     ` Alex Kogan
2019-07-16 15:58       ` Peter Zijlstra
2019-07-15 19:25 ` [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2019-07-15 21:30   ` Waiman Long
2019-07-16 11:04     ` Peter Zijlstra
2019-07-16 14:26     ` Alex Kogan
2019-07-16 14:44       ` Waiman Long
     [not found]     ` <aa73b86d-902a-bb6f-d372-8645c8299a6d@redhat.com>
     [not found]       ` <C1C55A40-FDB1-43B5-B551-F9B8BE776DF8@oracle.com>
2019-07-16 14:50         ` Waiman Long
2019-07-17 17:44           ` Alex Kogan
2019-07-17 17:58             ` Waiman Long
2019-07-16 11:05   ` Peter Zijlstra
2019-07-16 14:30     ` Alex Kogan
2019-07-16 15:50   ` Peter Zijlstra
2019-07-16 17:19     ` Alex Kogan
2019-07-16 18:47       ` Peter Zijlstra
2019-07-17  8:39         ` Peter Zijlstra
2019-07-17  8:59           ` Peter Zijlstra
2019-07-17 14:52             ` Alex Kogan
     [not found]           ` <FFC2D45A-24B3-40E1-ABBB-1D696E830B23@oracle.com>
2019-07-17 15:09             ` Peter Zijlstra
2019-07-17  2:16   ` Waiman Long
2019-07-17  7:44     ` Peter Zijlstra
2019-07-17 13:35       ` Waiman Long
2019-07-17 14:42       ` Alex Kogan
2019-07-15 19:25 ` [PATCH v3 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2019-07-16 15:59   ` Peter Zijlstra
2019-07-15 19:25 ` [PATCH v3 5/5] locking/qspinlock: Introduce the shuffle reduction optimization " Alex Kogan
2019-07-16 11:47 ` [PATCH v3 0/5] Add NUMA-awareness to qspinlock Nicholas Piggin
     [not found]   ` <7D29555E-8F72-4EDD-8A87-B1A59C3945A6@oracle.com>
2019-07-16 23:07     ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).