All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 0/5] Add NUMA-awareness to qspinlock
@ 2020-04-03 20:59 ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Changes from v9:
----------------

- Revise the series based on Peter's version, adopting names, style, etc.

- Add a new patch that allows to prioritize certain threads (e.g., in 
irq and nmi contexts) and avoids moving them between waiting queues,
based on the suggestion by Longman.

- Drop the shuffle reduction optimization from the series (new performance
data did not justify it).

- Do not call cna_init_nodes() as an early_initcall (call it from 
cna_configure_spin_lock_slowpath() instead), based on the comment from 
Longman.


Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a primary queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the lock holder scans the primary queue
looking for a thread running on the same node (pre-scan). If found (call
it thread T), all threads in the primary queue between the current lock
holder and T are moved to the end of the secondary queue.  If such T
is not found, we make another scan of the primary queue after acquiring 
the spinlock when unlocking the MCS lock (post-scan), starting at the
node where pre-scan stopped. If both scans fail to find such T, the
MCS lock is passed to the first thread in the secondary queue. If the
secondary queue is empty, the MCS lock is passed to the next thread in the
primary queue. To avoid starvation of threads in the secondary queue, those
threads are moved back to the head of the primary queue after a certain
number of intra-node lock hand-offs. Lastly, certain threads (e.g., in
in irq and nmi contexts) are given a preferential treatment -- the scan
stops when such a thread is found, effectively never moving those threads 
into the secondary queue.

More details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The 
standard deviation is also reported in (), and in general is about 3%
from the average. The 'stock' kernel is v5.6.0-rc6,
commit 5ad0ec0b8652, compiled in the default configuration. 
'patch-CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set; 
the speedup is calculated dividing 'patch-CNA' by 'stock'.

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  2.702 (0.100)  2.712 (0.122)  1.003
  2  3.691 (0.162)  3.672 (0.138)  0.995
  4  4.285 (0.108)  4.256 (0.124)  0.993
  8  5.117 (0.133)  5.972 (0.258)  1.167
 16  6.273 (0.196)  7.628 (0.274)  1.216
 32  6.757 (0.122)  8.544 (0.225)  1.264
 36  6.761 (0.091)  8.691 (0.170)  1.285
 72  6.569 (0.132)  9.280 (0.225)  1.413
108  6.167 (0.112)  9.410 (0.171)  1.526
142  5.901 (0.117)  9.415 (0.211)  1.595

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads: 

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  0.511 (0.002)  0.525 (0.003)  1.027
  2  0.774 (0.018)  0.769 (0.017)  0.993
  4  1.352 (0.023)  1.372 (0.032)  1.014
  8  1.675 (0.090)  1.660 (0.136)  0.991
 16  1.665 (0.114)  1.583 (0.092)  0.951
 32  0.966 (0.038)  1.637 (0.087)  1.694
 36  0.973 (0.066)  1.570 (0.081)  1.613
 72  0.844 (0.040)  1.620 (0.091)  1.919
108  0.836 (0.040)  1.670 (0.084)  1.999
142  0.799 (0.043)  1.699 (0.087)  2.127

and will-it-scale/lock2_threads:

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  1.581 (0.004)  1.576 (0.007)  0.997
  2  2.699 (0.059)  2.687 (0.067)  0.996
  4  5.240 (0.234)  5.155 (0.252)  0.984
  8  4.370 (0.241)  4.111 (0.342)  0.941
 16  4.152 (0.112)  4.113 (0.164)  0.991
 32  2.579 (0.099)  4.099 (0.127)  1.589
 36  2.604 (0.066)  4.005 (0.104)  1.538
 72  2.028 (0.091)  4.024 (0.112)  1.984
108  2.079 (0.106)  3.997 (0.093)  1.923
142  1.858 (0.103)  3.955 (0.109)  2.129

Our evaluation shows that CNA also improves performance of user 
applications that have hot pthread mutexes. Those mutexes are 
blocking, and waiting threads park and unpark via the futex 
mechanism in the kernel. Given that kernel futex chains, which
are hashed by the mutex address, are each protected by a 
chain-specific spin lock, the contention on a user-mode mutex 
translates into contention on a kernel level spinlock. 

Here are the results for the leveldb ‘readrandom’ benchmark:

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  0.530 (0.013)  0.533 (0.011)  1.006
  2  0.839 (0.043)  0.847 (0.031)  1.010
  4  0.758 (0.021)  0.764 (0.018)  1.008
  8  0.677 (0.022)  0.682 (0.016)  1.008
 16  0.714 (0.023)  0.814 (0.027)  1.140
 32  0.765 (0.040)  1.168 (0.032)  1.527
 36  0.706 (0.023)  1.139 (0.066)  1.614
 72  0.624 (0.017)  1.184 (0.026)  1.898
108  0.605 (0.013)  1.147 (0.023)  1.894
142  0.593 (0.012)  1.131 (0.019)  1.908

Further comments are welcome and appreciated.

Alex Kogan (5):
  locking/qspinlock: Rename mcs lock/unlock macros and make them more
    generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Avoid moving certain threads between waiting queues
    in CNA

 .../admin-guide/kernel-parameters.txt         |  18 +
 arch/arm/include/asm/mcs_spinlock.h           |   6 +-
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   6 +
 arch/x86/kernel/alternative.c                 |   2 +
 include/asm-generic/mcs_spinlock.h            |   4 +-
 kernel/locking/mcs_spinlock.h                 |  20 +-
 kernel/locking/qspinlock.c                    |  82 +++-
 kernel/locking/qspinlock_cna.h                | 407 ++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h           |   2 +-
 10 files changed, 544 insertions(+), 23 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v10 0/5] Add NUMA-awareness to qspinlock
@ 2020-04-03 20:59 ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Changes from v9:
----------------

- Revise the series based on Peter's version, adopting names, style, etc.

- Add a new patch that allows to prioritize certain threads (e.g., in 
irq and nmi contexts) and avoids moving them between waiting queues,
based on the suggestion by Longman.

- Drop the shuffle reduction optimization from the series (new performance
data did not justify it).

- Do not call cna_init_nodes() as an early_initcall (call it from 
cna_configure_spin_lock_slowpath() instead), based on the comment from 
Longman.


Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a primary queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the lock holder scans the primary queue
looking for a thread running on the same node (pre-scan). If found (call
it thread T), all threads in the primary queue between the current lock
holder and T are moved to the end of the secondary queue.  If such T
is not found, we make another scan of the primary queue after acquiring 
the spinlock when unlocking the MCS lock (post-scan), starting at the
node where pre-scan stopped. If both scans fail to find such T, the
MCS lock is passed to the first thread in the secondary queue. If the
secondary queue is empty, the MCS lock is passed to the next thread in the
primary queue. To avoid starvation of threads in the secondary queue, those
threads are moved back to the head of the primary queue after a certain
number of intra-node lock hand-offs. Lastly, certain threads (e.g., in
in irq and nmi contexts) are given a preferential treatment -- the scan
stops when such a thread is found, effectively never moving those threads 
into the secondary queue.

More details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The 
standard deviation is also reported in (), and in general is about 3%
from the average. The 'stock' kernel is v5.6.0-rc6,
commit 5ad0ec0b8652, compiled in the default configuration. 
'patch-CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set; 
the speedup is calculated dividing 'patch-CNA' by 'stock'.

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  2.702 (0.100)  2.712 (0.122)  1.003
  2  3.691 (0.162)  3.672 (0.138)  0.995
  4  4.285 (0.108)  4.256 (0.124)  0.993
  8  5.117 (0.133)  5.972 (0.258)  1.167
 16  6.273 (0.196)  7.628 (0.274)  1.216
 32  6.757 (0.122)  8.544 (0.225)  1.264
 36  6.761 (0.091)  8.691 (0.170)  1.285
 72  6.569 (0.132)  9.280 (0.225)  1.413
108  6.167 (0.112)  9.410 (0.171)  1.526
142  5.901 (0.117)  9.415 (0.211)  1.595

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads: 

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  0.511 (0.002)  0.525 (0.003)  1.027
  2  0.774 (0.018)  0.769 (0.017)  0.993
  4  1.352 (0.023)  1.372 (0.032)  1.014
  8  1.675 (0.090)  1.660 (0.136)  0.991
 16  1.665 (0.114)  1.583 (0.092)  0.951
 32  0.966 (0.038)  1.637 (0.087)  1.694
 36  0.973 (0.066)  1.570 (0.081)  1.613
 72  0.844 (0.040)  1.620 (0.091)  1.919
108  0.836 (0.040)  1.670 (0.084)  1.999
142  0.799 (0.043)  1.699 (0.087)  2.127

and will-it-scale/lock2_threads:

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  1.581 (0.004)  1.576 (0.007)  0.997
  2  2.699 (0.059)  2.687 (0.067)  0.996
  4  5.240 (0.234)  5.155 (0.252)  0.984
  8  4.370 (0.241)  4.111 (0.342)  0.941
 16  4.152 (0.112)  4.113 (0.164)  0.991
 32  2.579 (0.099)  4.099 (0.127)  1.589
 36  2.604 (0.066)  4.005 (0.104)  1.538
 72  2.028 (0.091)  4.024 (0.112)  1.984
108  2.079 (0.106)  3.997 (0.093)  1.923
142  1.858 (0.103)  3.955 (0.109)  2.129

Our evaluation shows that CNA also improves performance of user 
applications that have hot pthread mutexes. Those mutexes are 
blocking, and waiting threads park and unpark via the futex 
mechanism in the kernel. Given that kernel futex chains, which
are hashed by the mutex address, are each protected by a 
chain-specific spin lock, the contention on a user-mode mutex 
translates into contention on a kernel level spinlock. 

Here are the results for the leveldb ‘readrandom’ benchmark:

#thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
  1  0.530 (0.013)  0.533 (0.011)  1.006
  2  0.839 (0.043)  0.847 (0.031)  1.010
  4  0.758 (0.021)  0.764 (0.018)  1.008
  8  0.677 (0.022)  0.682 (0.016)  1.008
 16  0.714 (0.023)  0.814 (0.027)  1.140
 32  0.765 (0.040)  1.168 (0.032)  1.527
 36  0.706 (0.023)  1.139 (0.066)  1.614
 72  0.624 (0.017)  1.184 (0.026)  1.898
108  0.605 (0.013)  1.147 (0.023)  1.894
142  0.593 (0.012)  1.131 (0.019)  1.908

Further comments are welcome and appreciated.

Alex Kogan (5):
  locking/qspinlock: Rename mcs lock/unlock macros and make them more
    generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Avoid moving certain threads between waiting queues
    in CNA

 .../admin-guide/kernel-parameters.txt         |  18 +
 arch/arm/include/asm/mcs_spinlock.h           |   6 +-
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   6 +
 arch/x86/kernel/alternative.c                 |   2 +
 include/asm-generic/mcs_spinlock.h            |   4 +-
 kernel/locking/mcs_spinlock.h                 |  20 +-
 kernel/locking/qspinlock.c                    |  82 +++-
 kernel/locking/qspinlock_cna.h                | 407 ++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h           |   2 +-
 10 files changed, 544 insertions(+), 23 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.21.1 (Apple Git-122.3)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v10 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
  2020-04-03 20:59 ` Alex Kogan
@ 2020-04-03 20:59   ` Alex Kogan
  -1 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

The mcs unlock macro (arch_mcs_lock_handoff) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h       | 18 +++++++++---------
 kernel/locking/qspinlock.c          |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..1eb4d733459c 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include <asm/spinlock.h>
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
+#define arch_mcs_spin_wait(lock)					\
 do {									\
 	/* Ensure prior stores are observed before we enter wfe. */	\
 	smp_mb();							\
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_lock_handoff(lock, val)				\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release((lock), (val));				\
 	dsb_sev();							\
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..f933d99c63e0 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_wait(l)
+ *   arch_mcs_lock_handoff(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..904ba5d0f3f4 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
 	int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_wait
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
+#define arch_mcs_spin_wait(l)					\
+do {								\
+	smp_cond_load_acquire(l, VAL);				\
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_lock_handoff
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_lock_handoff(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	WRITE_ONCE(prev->next, node);
 
 	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
+	arch_mcs_spin_wait(&node->locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b9515fcc9b29..ac1dedbe0237 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -471,7 +471,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		arch_mcs_spin_wait(&node->locked);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -550,7 +550,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..619d80fd5ea8 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * The write to next->locked in arch_mcs_lock_handoff()
 	 * must be ordered before the read of pn->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
@ 2020-04-03 20:59   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

The mcs unlock macro (arch_mcs_lock_handoff) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h       | 18 +++++++++---------
 kernel/locking/qspinlock.c          |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..1eb4d733459c 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include <asm/spinlock.h>
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
+#define arch_mcs_spin_wait(lock)					\
 do {									\
 	/* Ensure prior stores are observed before we enter wfe. */	\
 	smp_mb();							\
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_lock_handoff(lock, val)				\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release((lock), (val));				\
 	dsb_sev();							\
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..f933d99c63e0 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_wait(l)
+ *   arch_mcs_lock_handoff(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..904ba5d0f3f4 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
 	int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_wait
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
+#define arch_mcs_spin_wait(l)					\
+do {								\
+	smp_cond_load_acquire(l, VAL);				\
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_lock_handoff
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_lock_handoff(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	WRITE_ONCE(prev->next, node);
 
 	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
+	arch_mcs_spin_wait(&node->locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b9515fcc9b29..ac1dedbe0237 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -471,7 +471,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		arch_mcs_spin_wait(&node->locked);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -550,7 +550,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_lock_handoff(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..619d80fd5ea8 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * The write to next->locked in arch_mcs_lock_handoff()
 	 * must be ordered before the read of pn->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
-- 
2.21.1 (Apple Git-122.3)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 2/5] locking/qspinlock: Refactor the qspinlock slow path
  2020-04-03 20:59 ` Alex Kogan
  (?)
@ 2020-04-03 20:59   ` Alex Kogan
  -1 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ac1dedbe0237..6e63c72e3fbd 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -289,6 +289,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+						   u32 val,
+						   struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_lock_handoff - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
+					    struct mcs_spinlock *next)
+{
+	arch_mcs_lock_handoff(&next->locked, 1);
+}
+
+#define try_clear_tail		__try_clear_tail
+#define mcs_lock_handoff	__mcs_lock_handoff
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -533,7 +561,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -550,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_lock_handoff(&next->locked, 1);
+	mcs_lock_handoff(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -575,6 +603,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			__mcs_lock_handoff
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 2/5] locking/qspinlock: Refactor the qspinlock slow path
@ 2020-04-03 20:59   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ac1dedbe0237..6e63c72e3fbd 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -289,6 +289,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+						   u32 val,
+						   struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_lock_handoff - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
+					    struct mcs_spinlock *next)
+{
+	arch_mcs_lock_handoff(&next->locked, 1);
+}
+
+#define try_clear_tail		__try_clear_tail
+#define mcs_lock_handoff	__mcs_lock_handoff
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -533,7 +561,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -550,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_lock_handoff(&next->locked, 1);
+	mcs_lock_handoff(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -575,6 +603,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			__mcs_lock_handoff
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 2/5] locking/qspinlock: Refactor the qspinlock slow path
@ 2020-04-03 20:59   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ac1dedbe0237..6e63c72e3fbd 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -289,6 +289,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+						   u32 val,
+						   struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_lock_handoff - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
+					    struct mcs_spinlock *next)
+{
+	arch_mcs_lock_handoff(&next->locked, 1);
+}
+
+#define try_clear_tail		__try_clear_tail
+#define mcs_lock_handoff	__mcs_lock_handoff
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -533,7 +561,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -550,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_lock_handoff(&next->locked, 1);
+	mcs_lock_handoff(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -575,6 +603,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			__mcs_lock_handoff
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.21.1 (Apple Git-122.3)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-04-03 20:59 ` Alex Kogan
  (?)
@ 2020-04-03 20:59   ` Alex Kogan
  -1 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the lock holder scans the
primary queue looking for a thread running on the same node (pre-scan). If
found (call it thread T), all threads in the primary queue between the
current lock holder and T are moved to the end of the secondary queue.
If such T is not found, we make another scan of the primary queue when
unlocking the MCS lock (post-scan), starting at the position where
pre-scan stopped. If both scans fail to find such T, the MCS lock is
passed to the first thread in the secondary queue. If the secondary queue
is empty, the lock is passed to the next thread in the primary queue.
For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   6 +
 arch/x86/kernel/alternative.c                 |   2 +
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  39 +-
 kernel/locking/qspinlock_cna.h                | 344 ++++++++++++++++++
 7 files changed, 418 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..cf3ede858e01 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3239,6 +3239,16 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..196fde22159b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1566,6 +1566,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 444d6fd9a6d8..5a3027d8ea27 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,12 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#else
+static inline void cna_configure_spin_lock_slowpath(void) { }
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 15ac0d5f4b40..50f79a8aa2a9 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -739,6 +739,8 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+	cna_configure_spin_lock_slowpath();
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 904ba5d0f3f4..5e47ffb3f08b 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6e63c72e3fbd..5b01ab0cc944 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -71,7 +71,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -80,7 +81,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -104,6 +105,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -317,7 +320,7 @@ static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
 #define try_clear_tail		__try_clear_tail
 #define mcs_lock_handoff	__mcs_lock_handoff
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -589,6 +592,34 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			cna_lock_handoff
+
+#undef  queued_spin_lock_slowpath
+/*
+ * defer defining queued_spin_lock_slowpath until after the include to
+ * avoid a name clash with the identically named field in pv_ops.lock
+ * (see cna_configure_spin_lock_slowpath())
+ */
+#include "qspinlock_cna.h"
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..619883f3dfd3
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,344 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the primary queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;	/* self */
+	u32			partial_order;	/* encoded tail or enum val */
+};
+
+enum {
+	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
+	MIN_ENCODED_TAIL
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * make sure @encoded_tail is not confused with other valid
+		 * values for @locked (0 or 1) or with designated values for
+		 * @pre_scan_result
+		 */
+		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ *
+ * Returns the new primary head node or NULL on failure.
+ */
+static struct mcs_spinlock *
+cna_splice_head(struct qspinlock *lock, u32 val,
+		struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (next) {
+		/*
+		 * If the primary queue is not empty, the primary tail doesn't
+		 * need to change and we can simply link the secondary tail to
+		 * the old primary head.
+		 */
+		tail_2nd->next = next;
+	} else {
+		/*
+		 * When the primary queue is empty, the secondary tail becomes
+		 * the primary tail.
+		 */
+
+		/*
+		 * Speculatively break the secondary queue's circular link such
+		 * that when the secondary tail becomes the primary tail it all
+		 * works out.
+		 */
+		tail_2nd->next = NULL;
+
+		/*
+		 * tail_2nd->next = NULL;	old = xchg_tail(lock, tail);
+		 *				prev = decode_tail(old);
+		 * try_cmpxchg_release(...);	WRITE_ONCE(prev->next, node);
+		 *
+		 * If the following cmpxchg() succeeds, our stores will not
+		 * collide.
+		 */
+		new = ((struct cna_node *)tail_2nd)->encoded_tail |
+			_Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
+			/* Restore the secondary queue's circular link. */
+			tail_2nd->next = head_2nd;
+			return NULL;
+		}
+	}
+
+	/* The primary queue head now is what was the secondary queue head. */
+	return head_2nd;
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		struct mcs_spinlock *next;
+
+		/*
+		 * When there are waiters on the secondary queue, try to move
+		 * them back onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_lock_handoff(&next->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues are empty. Do what MCS does. */
+	return __try_clear_tail(lock, val, node);
+}
+
+/*
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
+ *
+ * Returns LOCAL_WAITER_FOUND if a matching node is found; otherwise return
+ * the encoded pointer to the last element inspected (such that a subsequent
+ * scan can continue there).
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ */
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
+{
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int nid = cn->numa_node;
+	struct cna_node *last;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != nid;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	if (!cni)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+
+	return LOCAL_WAITER_FOUND;
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline void cna_lock_handoff(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	u32 val = _Q_LOCKED_VAL;
+	u32 partial_order = cn->partial_order;
+
+	/*
+	 * check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan starting from the
+	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 */
+	if (partial_order >= MIN_ENCODED_TAIL)
+		partial_order =
+			cna_order_queue(node, decode_tail(partial_order));
+
+	if (partial_order == LOCAL_WAITER_FOUND) {
+		/*
+		 * We found a local waiter; reload @next in case we called
+		 * cna_order_queue() above.
+		 */
+		next = node->next;
+		if (node->locked > 1)
+			val = node->locked;	/* preseve secondary queue */
+	} else if (node->locked > 1) {
+		/*
+		 * When there are no local waiters on the primary queue, splice
+		 * the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
+		 */
+		next = cna_splice_head(NULL, 0, node, next);
+	}
+
+	arch_mcs_lock_handoff(&next->locked, val);
+}
+
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/*
+ * Switch to the NUMA-friendly slow path for spinlocks when we have
+ * multiple NUMA nodes in native environment, unless the user has
+ * overridden this default behavior by setting the numa_spinlock flag.
+ */
+void __init cna_configure_spin_lock_slowpath(void)
+{
+
+	if (numa_spinlock_flag < 0)
+		return;
+
+	if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
+		    pv_ops.lock.queued_spin_lock_slowpath !=
+			native_queued_spin_lock_slowpath))
+		return;
+
+	cna_init_nodes();
+
+	pv_ops.lock.queued_spin_lock_slowpath = __cna_queued_spin_lock_slowpath;
+
+	pr_info("Enabling CNA spinlock\n");
+}
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-04-03 20:59   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the lock holder scans the
primary queue looking for a thread running on the same node (pre-scan). If
found (call it thread T), all threads in the primary queue between the
current lock holder and T are moved to the end of the secondary queue.
If such T is not found, we make another scan of the primary queue when
unlocking the MCS lock (post-scan), starting at the position where
pre-scan stopped. If both scans fail to find such T, the MCS lock is
passed to the first thread in the secondary queue. If the secondary queue
is empty, the lock is passed to the next thread in the primary queue.
For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   6 +
 arch/x86/kernel/alternative.c                 |   2 +
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  39 +-
 kernel/locking/qspinlock_cna.h                | 344 ++++++++++++++++++
 7 files changed, 418 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..cf3ede858e01 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3239,6 +3239,16 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..196fde22159b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1566,6 +1566,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 444d6fd9a6d8..5a3027d8ea27 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,12 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#else
+static inline void cna_configure_spin_lock_slowpath(void) { }
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 15ac0d5f4b40..50f79a8aa2a9 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -739,6 +739,8 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+	cna_configure_spin_lock_slowpath();
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 904ba5d0f3f4..5e47ffb3f08b 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6e63c72e3fbd..5b01ab0cc944 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -71,7 +71,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -80,7 +81,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -104,6 +105,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -317,7 +320,7 @@ static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
 #define try_clear_tail		__try_clear_tail
 #define mcs_lock_handoff	__mcs_lock_handoff
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -589,6 +592,34 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			cna_lock_handoff
+
+#undef  queued_spin_lock_slowpath
+/*
+ * defer defining queued_spin_lock_slowpath until after the include to
+ * avoid a name clash with the identically named field in pv_ops.lock
+ * (see cna_configure_spin_lock_slowpath())
+ */
+#include "qspinlock_cna.h"
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..619883f3dfd3
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,344 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the primary queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;	/* self */
+	u32			partial_order;	/* encoded tail or enum val */
+};
+
+enum {
+	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
+	MIN_ENCODED_TAIL
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * make sure @encoded_tail is not confused with other valid
+		 * values for @locked (0 or 1) or with designated values for
+		 * @pre_scan_result
+		 */
+		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ *
+ * Returns the new primary head node or NULL on failure.
+ */
+static struct mcs_spinlock *
+cna_splice_head(struct qspinlock *lock, u32 val,
+		struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (next) {
+		/*
+		 * If the primary queue is not empty, the primary tail doesn't
+		 * need to change and we can simply link the secondary tail to
+		 * the old primary head.
+		 */
+		tail_2nd->next = next;
+	} else {
+		/*
+		 * When the primary queue is empty, the secondary tail becomes
+		 * the primary tail.
+		 */
+
+		/*
+		 * Speculatively break the secondary queue's circular link such
+		 * that when the secondary tail becomes the primary tail it all
+		 * works out.
+		 */
+		tail_2nd->next = NULL;
+
+		/*
+		 * tail_2nd->next = NULL;	old = xchg_tail(lock, tail);
+		 *				prev = decode_tail(old);
+		 * try_cmpxchg_release(...);	WRITE_ONCE(prev->next, node);
+		 *
+		 * If the following cmpxchg() succeeds, our stores will not
+		 * collide.
+		 */
+		new = ((struct cna_node *)tail_2nd)->encoded_tail |
+			_Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
+			/* Restore the secondary queue's circular link. */
+			tail_2nd->next = head_2nd;
+			return NULL;
+		}
+	}
+
+	/* The primary queue head now is what was the secondary queue head. */
+	return head_2nd;
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		struct mcs_spinlock *next;
+
+		/*
+		 * When there are waiters on the secondary queue, try to move
+		 * them back onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_lock_handoff(&next->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues are empty. Do what MCS does. */
+	return __try_clear_tail(lock, val, node);
+}
+
+/*
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
+ *
+ * Returns LOCAL_WAITER_FOUND if a matching node is found; otherwise return
+ * the encoded pointer to the last element inspected (such that a subsequent
+ * scan can continue there).
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ */
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
+{
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int nid = cn->numa_node;
+	struct cna_node *last;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != nid;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	if (!cni)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+
+	return LOCAL_WAITER_FOUND;
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline void cna_lock_handoff(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	u32 val = _Q_LOCKED_VAL;
+	u32 partial_order = cn->partial_order;
+
+	/*
+	 * check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan starting from the
+	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 */
+	if (partial_order >= MIN_ENCODED_TAIL)
+		partial_order =
+			cna_order_queue(node, decode_tail(partial_order));
+
+	if (partial_order == LOCAL_WAITER_FOUND) {
+		/*
+		 * We found a local waiter; reload @next in case we called
+		 * cna_order_queue() above.
+		 */
+		next = node->next;
+		if (node->locked > 1)
+			val = node->locked;	/* preseve secondary queue */
+	} else if (node->locked > 1) {
+		/*
+		 * When there are no local waiters on the primary queue, splice
+		 * the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
+		 */
+		next = cna_splice_head(NULL, 0, node, next);
+	}
+
+	arch_mcs_lock_handoff(&next->locked, val);
+}
+
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/*
+ * Switch to the NUMA-friendly slow path for spinlocks when we have
+ * multiple NUMA nodes in native environment, unless the user has
+ * overridden this default behavior by setting the numa_spinlock flag.
+ */
+void __init cna_configure_spin_lock_slowpath(void)
+{
+
+	if (numa_spinlock_flag < 0)
+		return;
+
+	if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
+		    pv_ops.lock.queued_spin_lock_slowpath !=
+			native_queued_spin_lock_slowpath))
+		return;
+
+	cna_init_nodes();
+
+	pv_ops.lock.queued_spin_lock_slowpath = __cna_queued_spin_lock_slowpath;
+
+	pr_info("Enabling CNA spinlock\n");
+}
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-04-03 20:59   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the lock holder scans the
primary queue looking for a thread running on the same node (pre-scan). If
found (call it thread T), all threads in the primary queue between the
current lock holder and T are moved to the end of the secondary queue.
If such T is not found, we make another scan of the primary queue when
unlocking the MCS lock (post-scan), starting at the position where
pre-scan stopped. If both scans fail to find such T, the MCS lock is
passed to the first thread in the secondary queue. If the secondary queue
is empty, the lock is passed to the next thread in the primary queue.
For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   6 +
 arch/x86/kernel/alternative.c                 |   2 +
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  39 +-
 kernel/locking/qspinlock_cna.h                | 344 ++++++++++++++++++
 7 files changed, 418 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..cf3ede858e01 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3239,6 +3239,16 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..196fde22159b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1566,6 +1566,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 444d6fd9a6d8..5a3027d8ea27 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,12 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#else
+static inline void cna_configure_spin_lock_slowpath(void) { }
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 15ac0d5f4b40..50f79a8aa2a9 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -739,6 +739,8 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+	cna_configure_spin_lock_slowpath();
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 904ba5d0f3f4..5e47ffb3f08b 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6e63c72e3fbd..5b01ab0cc944 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -71,7 +71,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -80,7 +81,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -104,6 +105,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -317,7 +320,7 @@ static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
 #define try_clear_tail		__try_clear_tail
 #define mcs_lock_handoff	__mcs_lock_handoff
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -589,6 +592,34 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff			cna_lock_handoff
+
+#undef  queued_spin_lock_slowpath
+/*
+ * defer defining queued_spin_lock_slowpath until after the include to
+ * avoid a name clash with the identically named field in pv_ops.lock
+ * (see cna_configure_spin_lock_slowpath())
+ */
+#include "qspinlock_cna.h"
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..619883f3dfd3
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,344 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the primary queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;	/* self */
+	u32			partial_order;	/* encoded tail or enum val */
+};
+
+enum {
+	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
+	MIN_ENCODED_TAIL
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * make sure @encoded_tail is not confused with other valid
+		 * values for @locked (0 or 1) or with designated values for
+		 * @pre_scan_result
+		 */
+		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ *
+ * Returns the new primary head node or NULL on failure.
+ */
+static struct mcs_spinlock *
+cna_splice_head(struct qspinlock *lock, u32 val,
+		struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (next) {
+		/*
+		 * If the primary queue is not empty, the primary tail doesn't
+		 * need to change and we can simply link the secondary tail to
+		 * the old primary head.
+		 */
+		tail_2nd->next = next;
+	} else {
+		/*
+		 * When the primary queue is empty, the secondary tail becomes
+		 * the primary tail.
+		 */
+
+		/*
+		 * Speculatively break the secondary queue's circular link such
+		 * that when the secondary tail becomes the primary tail it all
+		 * works out.
+		 */
+		tail_2nd->next = NULL;
+
+		/*
+		 * tail_2nd->next = NULL;	old = xchg_tail(lock, tail);
+		 *				prev = decode_tail(old);
+		 * try_cmpxchg_release(...);	WRITE_ONCE(prev->next, node);
+		 *
+		 * If the following cmpxchg() succeeds, our stores will not
+		 * collide.
+		 */
+		new = ((struct cna_node *)tail_2nd)->encoded_tail |
+			_Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
+			/* Restore the secondary queue's circular link. */
+			tail_2nd->next = head_2nd;
+			return NULL;
+		}
+	}
+
+	/* The primary queue head now is what was the secondary queue head. */
+	return head_2nd;
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		struct mcs_spinlock *next;
+
+		/*
+		 * When there are waiters on the secondary queue, try to move
+		 * them back onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_lock_handoff(&next->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues are empty. Do what MCS does. */
+	return __try_clear_tail(lock, val, node);
+}
+
+/*
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
+ *
+ * Returns LOCAL_WAITER_FOUND if a matching node is found; otherwise return
+ * the encoded pointer to the last element inspected (such that a subsequent
+ * scan can continue there).
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ */
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
+{
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int nid = cn->numa_node;
+	struct cna_node *last;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != nid;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	if (!cni)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+
+	return LOCAL_WAITER_FOUND;
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline void cna_lock_handoff(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	u32 val = _Q_LOCKED_VAL;
+	u32 partial_order = cn->partial_order;
+
+	/*
+	 * check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan starting from the
+	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 */
+	if (partial_order >= MIN_ENCODED_TAIL)
+		partial_order =
+			cna_order_queue(node, decode_tail(partial_order));
+
+	if (partial_order == LOCAL_WAITER_FOUND) {
+		/*
+		 * We found a local waiter; reload @next in case we called
+		 * cna_order_queue() above.
+		 */
+		next = node->next;
+		if (node->locked > 1)
+			val = node->locked;	/* preseve secondary queue */
+	} else if (node->locked > 1) {
+		/*
+		 * When there are no local waiters on the primary queue, splice
+		 * the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
+		 */
+		next = cna_splice_head(NULL, 0, node, next);
+	}
+
+	arch_mcs_lock_handoff(&next->locked, val);
+}
+
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/*
+ * Switch to the NUMA-friendly slow path for spinlocks when we have
+ * multiple NUMA nodes in native environment, unless the user has
+ * overridden this default behavior by setting the numa_spinlock flag.
+ */
+void __init cna_configure_spin_lock_slowpath(void)
+{
+
+	if (numa_spinlock_flag < 0)
+		return;
+
+	if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
+		    pv_ops.lock.queued_spin_lock_slowpath !=
+			native_queued_spin_lock_slowpath))
+		return;
+
+	cna_init_nodes();
+
+	pv_ops.lock.queued_spin_lock_slowpath = __cna_queued_spin_lock_slowpath;
+
+	pr_info("Enabling CNA spinlock\n");
+}
-- 
2.21.1 (Apple Git-122.3)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2020-04-03 20:59 ` Alex Kogan
@ 2020-04-03 20:59   ` Alex Kogan
  -1 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Keep track of the number of intra-node lock handoffs, and force
inter-node handoff once this number reaches a preset threshold.
The default value for the threshold can be overridden with
the new kernel boot command-line option "numa_spinlock_threshold".

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 +++
 kernel/locking/qspinlock.c                    |  3 +
 kernel/locking/qspinlock_cna.h                | 55 ++++++++++++++++---
 3 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index cf3ede858e01..c23bbf49024b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3249,6 +3249,14 @@
 			Not specifying this option is equivalent to
 			numa_spinlock=auto.
 
+	numa_spinlock_threshold=	[NUMA, PV_OPS]
+			Set the threshold for the number of intra-node
+			lock hand-offs before the NUMA-aware spinlock
+			is forced to be passed to a thread on another NUMA node.
+			Valid values are in the [0..31] range. Smaller values
+			result in a more fair, but less performant spinlock, and
+			vice versa. The default value is 16.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5b01ab0cc944..29e480235cce 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -598,6 +598,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 #define _GEN_CNA_LOCK_SLOWPATH
 
+#undef pv_init_node
+#define pv_init_node			cna_init_node
+
 #undef pv_wait_head_or_lock
 #define pv_wait_head_or_lock		cna_wait_head_or_lock
 
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 619883f3dfd3..e3180f6f5cdc 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -38,7 +38,9 @@
  * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
  * stopped. If both scans fail to find such T, the MCS lock is passed to the
  * first thread in the secondary queue. If the secondary queue is empty, the
- * lock is passed to the next thread in the primary queue.
+ * lock is passed to the next thread in the primary queue. To avoid starvation
+ * of threads in the secondary queue, those threads are moved back to the head
+ * of the primary queue after a certain number of intra-node lock hand-offs.
  *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
@@ -51,13 +53,23 @@ struct cna_node {
 	int			numa_node;
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* encoded tail or enum val */
+	u32			intra_count;
 };
 
 enum {
 	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
+	FLUSH_SECONDARY_QUEUE = 3,
 	MIN_ENCODED_TAIL
 };
 
+/*
+ * Controls the threshold for the number of intra-node lock hand-offs before
+ * the NUMA-aware variant of spinlock is forced to be passed to a thread on
+ * another NUMA node. The default setting can be changed with the
+ * "numa_spinlock_threshold" boot option.
+ */
+unsigned int intra_node_handoff_threshold __ro_after_init = 1 << 16;
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -96,6 +108,11 @@ static int __init cna_init_nodes(void)
 	return 0;
 }
 
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+	((struct cna_node *)node)->intra_count = 0;
+}
+
 /*
  * cna_splice_head -- splice the entire secondary queue onto the head of the
  * primary queue.
@@ -250,11 +267,15 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	/*
-	 * Try and put the time otherwise spent spin waiting on
-	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-	 */
-	cn->partial_order = cna_order_queue(node, node);
+	if (cn->intra_count < intra_node_handoff_threshold) {
+		/*
+		 * Try and put the time otherwise spent spin waiting on
+		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+		 */
+		cn->partial_order = cna_order_queue(node, node);
+	} else {
+		cn->partial_order = FLUSH_SECONDARY_QUEUE;
+	}
 
 	return 0; /* we lied; we didn't wait, go do so now */
 }
@@ -281,8 +302,11 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 		 * cna_order_queue() above.
 		 */
 		next = node->next;
-		if (node->locked > 1)
+		if (node->locked > 1) {
 			val = node->locked;	/* preseve secondary queue */
+			((struct cna_node *)next)->intra_count =
+				cn->intra_count + 1;
+		}
 	} else if (node->locked > 1) {
 		/*
 		 * When there are no local waiters on the primary queue, splice
@@ -342,3 +366,20 @@ void __init cna_configure_spin_lock_slowpath(void)
 
 	pr_info("Enabling CNA spinlock\n");
 }
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+	int new_threshold_param;
+
+	if (get_option(&str, &new_threshold_param)) {
+		/* valid value is between 0 and 31 */
+		if (new_threshold_param < 0 || new_threshold_param > 31)
+			return 0;
+
+		intra_node_handoff_threshold = 1 << new_threshold_param;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2020-04-03 20:59   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Keep track of the number of intra-node lock handoffs, and force
inter-node handoff once this number reaches a preset threshold.
The default value for the threshold can be overridden with
the new kernel boot command-line option "numa_spinlock_threshold".

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 +++
 kernel/locking/qspinlock.c                    |  3 +
 kernel/locking/qspinlock_cna.h                | 55 ++++++++++++++++---
 3 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index cf3ede858e01..c23bbf49024b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3249,6 +3249,14 @@
 			Not specifying this option is equivalent to
 			numa_spinlock=auto.
 
+	numa_spinlock_threshold=	[NUMA, PV_OPS]
+			Set the threshold for the number of intra-node
+			lock hand-offs before the NUMA-aware spinlock
+			is forced to be passed to a thread on another NUMA node.
+			Valid values are in the [0..31] range. Smaller values
+			result in a more fair, but less performant spinlock, and
+			vice versa. The default value is 16.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5b01ab0cc944..29e480235cce 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -598,6 +598,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 #define _GEN_CNA_LOCK_SLOWPATH
 
+#undef pv_init_node
+#define pv_init_node			cna_init_node
+
 #undef pv_wait_head_or_lock
 #define pv_wait_head_or_lock		cna_wait_head_or_lock
 
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 619883f3dfd3..e3180f6f5cdc 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -38,7 +38,9 @@
  * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
  * stopped. If both scans fail to find such T, the MCS lock is passed to the
  * first thread in the secondary queue. If the secondary queue is empty, the
- * lock is passed to the next thread in the primary queue.
+ * lock is passed to the next thread in the primary queue. To avoid starvation
+ * of threads in the secondary queue, those threads are moved back to the head
+ * of the primary queue after a certain number of intra-node lock hand-offs.
  *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
@@ -51,13 +53,23 @@ struct cna_node {
 	int			numa_node;
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* encoded tail or enum val */
+	u32			intra_count;
 };
 
 enum {
 	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
+	FLUSH_SECONDARY_QUEUE = 3,
 	MIN_ENCODED_TAIL
 };
 
+/*
+ * Controls the threshold for the number of intra-node lock hand-offs before
+ * the NUMA-aware variant of spinlock is forced to be passed to a thread on
+ * another NUMA node. The default setting can be changed with the
+ * "numa_spinlock_threshold" boot option.
+ */
+unsigned int intra_node_handoff_threshold __ro_after_init = 1 << 16;
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -96,6 +108,11 @@ static int __init cna_init_nodes(void)
 	return 0;
 }
 
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+	((struct cna_node *)node)->intra_count = 0;
+}
+
 /*
  * cna_splice_head -- splice the entire secondary queue onto the head of the
  * primary queue.
@@ -250,11 +267,15 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	/*
-	 * Try and put the time otherwise spent spin waiting on
-	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-	 */
-	cn->partial_order = cna_order_queue(node, node);
+	if (cn->intra_count < intra_node_handoff_threshold) {
+		/*
+		 * Try and put the time otherwise spent spin waiting on
+		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+		 */
+		cn->partial_order = cna_order_queue(node, node);
+	} else {
+		cn->partial_order = FLUSH_SECONDARY_QUEUE;
+	}
 
 	return 0; /* we lied; we didn't wait, go do so now */
 }
@@ -281,8 +302,11 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 		 * cna_order_queue() above.
 		 */
 		next = node->next;
-		if (node->locked > 1)
+		if (node->locked > 1) {
 			val = node->locked;	/* preseve secondary queue */
+			((struct cna_node *)next)->intra_count =
+				cn->intra_count + 1;
+		}
 	} else if (node->locked > 1) {
 		/*
 		 * When there are no local waiters on the primary queue, splice
@@ -342,3 +366,20 @@ void __init cna_configure_spin_lock_slowpath(void)
 
 	pr_info("Enabling CNA spinlock\n");
 }
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+	int new_threshold_param;
+
+	if (get_option(&str, &new_threshold_param)) {
+		/* valid value is between 0 and 31 */
+		if (new_threshold_param < 0 || new_threshold_param > 31)
+			return 0;
+
+		intra_node_handoff_threshold = 1 << new_threshold_param;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
-- 
2.21.1 (Apple Git-122.3)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 5/5] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA
  2020-04-03 20:59 ` Alex Kogan
@ 2020-04-03 20:59   ` Alex Kogan
  -1 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice

Prohibit moving certain threads (e.g., in irq and nmi contexts)
to the secondary queue. Those prioritized threads will always stay
in the primary queue, and so will have a shorter wait time for the lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index e3180f6f5cdc..b004ce6882b6 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/sched/rt.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -41,6 +42,9 @@
  * lock is passed to the next thread in the primary queue. To avoid starvation
  * of threads in the secondary queue, those threads are moved back to the head
  * of the primary queue after a certain number of intra-node lock hand-offs.
+ * Lastly, certain threads (e.g., in irq and nmi contexts) are given
+ * preferential treatment -- the scan stops when such a thread is found,
+ * effectively never moving those threads into the secondary queue.
  *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
@@ -50,7 +54,7 @@
 
 struct cna_node {
 	struct mcs_spinlock	mcs;
-	int			numa_node;
+	int			numa_node;	/* use LSB for priority */
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* encoded tail or enum val */
 	u32			intra_count;
@@ -79,7 +83,7 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 	for (i = 0; i < MAX_NODES; i++) {
 		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
 
-		cn->numa_node = numa_node;
+		cn->numa_node = numa_node << 1;
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * make sure @encoded_tail is not confused with other valid
@@ -110,6 +114,14 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
+	/*
+	 * Set the priority bit in @numa_node for threads that should not
+	 * be moved to the secondary queue.
+	 */
+	bool priority = !in_task() || irqs_disabled() || rt_task(current);
+	((struct cna_node *)node)->numa_node =
+		(((struct cna_node *)node)->numa_node & ~1) | priority;
+
 	((struct cna_node *)node)->intra_count = 0;
 }
 
@@ -243,12 +255,16 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 {
 	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
+	int nid = cn->numa_node >> 1;
 	struct cna_node *last;
 
 	/* find any next waiter on 'our' NUMA node */
 	for (last = cn;
-	     cni && cni->numa_node != nid;
+		 /*
+		  * iterate as long as the current node is not priorizied and
+		  * does not run on 'our' NUMA node
+		  */
+	     cni && !(cni->numa_node & 0x1) && (cni->numa_node >> 1) != nid;
 	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
 		;
 
@@ -258,6 +274,12 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 	if (last != cn)	/* did we skip any waiters? */
 		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
 
+	/*
+	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
+	 * of a prioritized waiter. That waiter will get the lock next even if
+	 * it runs on a different NUMA node, but this is what we wanted when we
+	 * prioritized it.
+	 */
 	return LOCAL_WAITER_FOUND;
 }
 
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v10 5/5] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA
@ 2020-04-03 20:59   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-03 20:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Prohibit moving certain threads (e.g., in irq and nmi contexts)
to the secondary queue. Those prioritized threads will always stay
in the primary queue, and so will have a shorter wait time for the lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index e3180f6f5cdc..b004ce6882b6 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/sched/rt.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -41,6 +42,9 @@
  * lock is passed to the next thread in the primary queue. To avoid starvation
  * of threads in the secondary queue, those threads are moved back to the head
  * of the primary queue after a certain number of intra-node lock hand-offs.
+ * Lastly, certain threads (e.g., in irq and nmi contexts) are given
+ * preferential treatment -- the scan stops when such a thread is found,
+ * effectively never moving those threads into the secondary queue.
  *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
@@ -50,7 +54,7 @@
 
 struct cna_node {
 	struct mcs_spinlock	mcs;
-	int			numa_node;
+	int			numa_node;	/* use LSB for priority */
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* encoded tail or enum val */
 	u32			intra_count;
@@ -79,7 +83,7 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 	for (i = 0; i < MAX_NODES; i++) {
 		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
 
-		cn->numa_node = numa_node;
+		cn->numa_node = numa_node << 1;
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * make sure @encoded_tail is not confused with other valid
@@ -110,6 +114,14 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
+	/*
+	 * Set the priority bit in @numa_node for threads that should not
+	 * be moved to the secondary queue.
+	 */
+	bool priority = !in_task() || irqs_disabled() || rt_task(current);
+	((struct cna_node *)node)->numa_node =
+		(((struct cna_node *)node)->numa_node & ~1) | priority;
+
 	((struct cna_node *)node)->intra_count = 0;
 }
 
@@ -243,12 +255,16 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 {
 	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
+	int nid = cn->numa_node >> 1;
 	struct cna_node *last;
 
 	/* find any next waiter on 'our' NUMA node */
 	for (last = cn;
-	     cni && cni->numa_node != nid;
+		 /*
+		  * iterate as long as the current node is not priorizied and
+		  * does not run on 'our' NUMA node
+		  */
+	     cni && !(cni->numa_node & 0x1) && (cni->numa_node >> 1) != nid;
 	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
 		;
 
@@ -258,6 +274,12 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 	if (last != cn)	/* did we skip any waiters? */
 		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
 
+	/*
+	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
+	 * of a prioritized waiter. That waiter will get the lock next even if
+	 * it runs on a different NUMA node, but this is what we wanted when we
+	 * prioritized it.
+	 */
 	return LOCAL_WAITER_FOUND;
 }
 
-- 
2.21.1 (Apple Git-122.3)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-04-03 20:59   ` Alex Kogan
  (?)
  (?)
@ 2020-04-04 23:25   ` kbuild test robot
  2020-04-07 21:57     ` Alex Kogan
  -1 siblings, 1 reply; 27+ messages in thread
From: kbuild test robot @ 2020-04-04 23:25 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1782 bytes --]

Hi Alex,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/locking/core]
[also build test ERROR on arm/for-next linus/master tip/x86/core v5.6 next-20200404]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Alex-Kogan/Add-NUMA-awareness-to-qspinlock/20200405-045543
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git f1e67e355c2aafeddf1eac31335709236996d2fe
config: i386-alldefconfig (attached as .config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/x86/kernel/alternative.c: In function 'alternative_instructions':
>> arch/x86/kernel/alternative.c:742:2: error: implicit declaration of function 'cna_configure_spin_lock_slowpath' [-Werror=implicit-function-declaration]
     cna_configure_spin_lock_slowpath();
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/cna_configure_spin_lock_slowpath +742 arch/x86/kernel/alternative.c

   741	
 > 742		cna_configure_spin_lock_slowpath();
   743	
   744		apply_paravirt(__parainstructions, __parainstructions_end);
   745	
   746		restart_nmi();
   747		alternatives_patched = 1;
   748	}
   749	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 10589 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-04-04 23:25   ` kbuild test robot
@ 2020-04-07 21:57     ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-04-07 21:57 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2700 bytes --]

Thank you for the report.

The cna_configure_spin_lock_slowpath() function is declared in <asm/qspinlock.h>, but I think that header file is not included for ARCH=i386.
The attached patch takes care of that. I will fix this issue in the next revision of the series.

Best regards,
— Alex




> On Apr 4, 2020, at 7:25 PM, kbuild test robot <lkp@intel.com> wrote:
> 
> Hi Alex,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on tip/locking/core]
> [also build test ERROR on arm/for-next linus/master tip/x86/core v5.6 next-20200404]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see https://urldefense.com/v3/__https://stackoverflow.com/a/37406982__;!!GqivPVa7Brio!Lqc3AwNlFOKA7cagALf2nrTIW1vs51gNE0IpUwdbGXHSNkBrdGd3foFN0-mVET_7$ ]
> 
> url:    https://urldefense.com/v3/__https://github.com/0day-ci/linux/commits/Alex-Kogan/Add-NUMA-awareness-to-qspinlock/20200405-045543__;!!GqivPVa7Brio!Lqc3AwNlFOKA7cagALf2nrTIW1vs51gNE0IpUwdbGXHSNkBrdGd3foFN0-8wHcpi$ 
> base:   https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git__;!!GqivPVa7Brio!Lqc3AwNlFOKA7cagALf2nrTIW1vs51gNE0IpUwdbGXHSNkBrdGd3foFN0zsCtG8K$  f1e67e355c2aafeddf1eac31335709236996d2fe
> config: i386-alldefconfig (attached as .config)
> compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
> reproduce:
>        # save the attached .config to linux build tree
>        make ARCH=i386 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kbuild test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>   arch/x86/kernel/alternative.c: In function 'alternative_instructions':
>>> arch/x86/kernel/alternative.c:742:2: error: implicit declaration of function 'cna_configure_spin_lock_slowpath' [-Werror=implicit-function-declaration]
>     cna_configure_spin_lock_slowpath();
>     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   cc1: some warnings being treated as errors
> 
> vim +/cna_configure_spin_lock_slowpath +742 arch/x86/kernel/alternative.c
> 
>   741	
>> 742		cna_configure_spin_lock_slowpath();
>   743	
>   744		apply_paravirt(__parainstructions, __parainstructions_end);
>   745	
>   746		restart_nmi();
>   747		alternatives_patched = 1;
>   748	}
>   749	
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://urldefense.com/v3/__https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org__;!!GqivPVa7Brio!Lqc3AwNlFOKA7cagALf2nrTIW1vs51gNE0IpUwdbGXHSNkBrdGd3foFN01EKECWp$ 
> <.config.gz>


[-- Attachment #2: 0001-locking-qspinlock-Fix-compilation-error-on-ARCH-i386.patch --]
[-- Type: application/octet-stream, Size: 1406 bytes --]

From 6dd7b50183ffbded00d67440abbe57c01f1f620b Mon Sep 17 00:00:00 2001
From: Alex Kogan <alex.kogan@oracle.com>
Date: Tue, 7 Apr 2020 17:21:00 -0400
Subject: [PATCH] locking/qspinlock: Fix compilation error on ARCH=i386

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reported-by: kbuild test robot <lkp@intel.com>
---
 arch/x86/include/asm/qspinlock.h | 2 --
 arch/x86/kernel/alternative.c    | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 5a3027d8ea27..fe4884b6f1b4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -29,8 +29,6 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 
 #ifdef CONFIG_NUMA_AWARE_SPINLOCKS
 extern void cna_configure_spin_lock_slowpath(void);
-#else
-static inline void cna_configure_spin_lock_slowpath(void) { }
 #endif
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 50f79a8aa2a9..92e075f09201 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -739,7 +739,9 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	cna_configure_spin_lock_slowpath();
+#endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 0/5] Add NUMA-awareness to qspinlock
  2020-04-03 20:59 ` Alex Kogan
@ 2020-05-04 14:17   ` Alex Kogan
  -1 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-05-04 14:17 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: linux, Ingo Molnar, Will Deacon, Arnd Bergmann, linux-arch,
	linux-arm-kernel, linux-kernel, Thomas Gleixner, Borislav Petkov,
	hpa, x86, Hanjun Guo, Jan Glauber, Steven Sistare, Daniel Jordan,
	dave.dice, Alex Kogan

Hi, Peter, Longman (and everyone on this list),

Hope you are doing well.

I was wondering whether you have had a chance to review this series,
and have any further comments.

Thanks,
— Alex

> On Apr 3, 2020, at 4:59 PM, Alex Kogan <alex.kogan@oracle.com> wrote:
> 
> Changes from v9:
> ----------------
> 
> - Revise the series based on Peter's version, adopting names, style, etc.
> 
> - Add a new patch that allows to prioritize certain threads (e.g., in 
> irq and nmi contexts) and avoids moving them between waiting queues,
> based on the suggestion by Longman.
> 
> - Drop the shuffle reduction optimization from the series (new performance
> data did not justify it).
> 
> - Do not call cna_init_nodes() as an early_initcall (call it from 
> cna_configure_spin_lock_slowpath() instead), based on the comment from 
> Longman.
> 
> 
> Summary
> -------
> 
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> 
> CNA is a NUMA-aware version of the MCS lock. Spinning threads are
> organized in two queues, a primary queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. After acquiring the MCS lock and
> before acquiring the spinlock, the lock holder scans the primary queue
> looking for a thread running on the same node (pre-scan). If found (call
> it thread T), all threads in the primary queue between the current lock
> holder and T are moved to the end of the secondary queue.  If such T
> is not found, we make another scan of the primary queue after acquiring 
> the spinlock when unlocking the MCS lock (post-scan), starting at the
> node where pre-scan stopped. If both scans fail to find such T, the
> MCS lock is passed to the first thread in the secondary queue. If the
> secondary queue is empty, the MCS lock is passed to the next thread in the
> primary queue. To avoid starvation of threads in the secondary queue, those
> threads are moved back to the head of the primary queue after a certain
> number of intra-node lock hand-offs. Lastly, certain threads (e.g., in
> in irq and nmi contexts) are given a preferential treatment -- the scan
> stops when such a thread is found, effectively never moving those threads 
> into the secondary queue.
> 
> More details are available at https://arxiv.org/abs/1810.05600.
> 
> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The 
> standard deviation is also reported in (), and in general is about 3%
> from the average. The 'stock' kernel is v5.6.0-rc6,
> commit 5ad0ec0b8652, compiled in the default configuration. 
> 'patch-CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set; 
> the speedup is calculated dividing 'patch-CNA' by 'stock'.
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  2.702 (0.100)  2.712 (0.122)  1.003
>  2  3.691 (0.162)  3.672 (0.138)  0.995
>  4  4.285 (0.108)  4.256 (0.124)  0.993
>  8  5.117 (0.133)  5.972 (0.258)  1.167
> 16  6.273 (0.196)  7.628 (0.274)  1.216
> 32  6.757 (0.122)  8.544 (0.225)  1.264
> 36  6.761 (0.091)  8.691 (0.170)  1.285
> 72  6.569 (0.132)  9.280 (0.225)  1.413
> 108  6.167 (0.112)  9.410 (0.171)  1.526
> 142  5.901 (0.117)  9.415 (0.211)  1.595
> 
> The following tables contain throughput results (ops/us) from the same
> setup for will-it-scale/open1_threads: 
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  0.511 (0.002)  0.525 (0.003)  1.027
>  2  0.774 (0.018)  0.769 (0.017)  0.993
>  4  1.352 (0.023)  1.372 (0.032)  1.014
>  8  1.675 (0.090)  1.660 (0.136)  0.991
> 16  1.665 (0.114)  1.583 (0.092)  0.951
> 32  0.966 (0.038)  1.637 (0.087)  1.694
> 36  0.973 (0.066)  1.570 (0.081)  1.613
> 72  0.844 (0.040)  1.620 (0.091)  1.919
> 108  0.836 (0.040)  1.670 (0.084)  1.999
> 142  0.799 (0.043)  1.699 (0.087)  2.127
> 
> and will-it-scale/lock2_threads:
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  1.581 (0.004)  1.576 (0.007)  0.997
>  2  2.699 (0.059)  2.687 (0.067)  0.996
>  4  5.240 (0.234)  5.155 (0.252)  0.984
>  8  4.370 (0.241)  4.111 (0.342)  0.941
> 16  4.152 (0.112)  4.113 (0.164)  0.991
> 32  2.579 (0.099)  4.099 (0.127)  1.589
> 36  2.604 (0.066)  4.005 (0.104)  1.538
> 72  2.028 (0.091)  4.024 (0.112)  1.984
> 108  2.079 (0.106)  3.997 (0.093)  1.923
> 142  1.858 (0.103)  3.955 (0.109)  2.129
> 
> Our evaluation shows that CNA also improves performance of user 
> applications that have hot pthread mutexes. Those mutexes are 
> blocking, and waiting threads park and unpark via the futex 
> mechanism in the kernel. Given that kernel futex chains, which
> are hashed by the mutex address, are each protected by a 
> chain-specific spin lock, the contention on a user-mode mutex 
> translates into contention on a kernel level spinlock. 
> 
> Here are the results for the leveldb ‘readrandom’ benchmark:
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  0.530 (0.013)  0.533 (0.011)  1.006
>  2  0.839 (0.043)  0.847 (0.031)  1.010
>  4  0.758 (0.021)  0.764 (0.018)  1.008
>  8  0.677 (0.022)  0.682 (0.016)  1.008
> 16  0.714 (0.023)  0.814 (0.027)  1.140
> 32  0.765 (0.040)  1.168 (0.032)  1.527
> 36  0.706 (0.023)  1.139 (0.066)  1.614
> 72  0.624 (0.017)  1.184 (0.026)  1.898
> 108  0.605 (0.013)  1.147 (0.023)  1.894
> 142  0.593 (0.012)  1.131 (0.019)  1.908
> 
> Further comments are welcome and appreciated.
> 
> Alex Kogan (5):
>  locking/qspinlock: Rename mcs lock/unlock macros and make them more
>    generic
>  locking/qspinlock: Refactor the qspinlock slow path
>  locking/qspinlock: Introduce CNA into the slow path of qspinlock
>  locking/qspinlock: Introduce starvation avoidance into CNA
>  locking/qspinlock: Avoid moving certain threads between waiting queues
>    in CNA
> 
> .../admin-guide/kernel-parameters.txt         |  18 +
> arch/arm/include/asm/mcs_spinlock.h           |   6 +-
> arch/x86/Kconfig                              |  20 +
> arch/x86/include/asm/qspinlock.h              |   6 +
> arch/x86/kernel/alternative.c                 |   2 +
> include/asm-generic/mcs_spinlock.h            |   4 +-
> kernel/locking/mcs_spinlock.h                 |  20 +-
> kernel/locking/qspinlock.c                    |  82 +++-
> kernel/locking/qspinlock_cna.h                | 407 ++++++++++++++++++
> kernel/locking/qspinlock_paravirt.h           |   2 +-
> 10 files changed, 544 insertions(+), 23 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
> 
> -- 
> 2.21.1 (Apple Git-122.3)
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 0/5] Add NUMA-awareness to qspinlock
@ 2020-05-04 14:17   ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-05-04 14:17 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: linux-arch, Hanjun Guo, Arnd Bergmann, dave.dice, Jan Glauber,
	x86, Will Deacon, linux, linux-kernel, Ingo Molnar,
	Borislav Petkov, hpa, Alex Kogan, Steven Sistare,
	Thomas Gleixner, Daniel Jordan, linux-arm-kernel

Hi, Peter, Longman (and everyone on this list),

Hope you are doing well.

I was wondering whether you have had a chance to review this series,
and have any further comments.

Thanks,
— Alex

> On Apr 3, 2020, at 4:59 PM, Alex Kogan <alex.kogan@oracle.com> wrote:
> 
> Changes from v9:
> ----------------
> 
> - Revise the series based on Peter's version, adopting names, style, etc.
> 
> - Add a new patch that allows to prioritize certain threads (e.g., in 
> irq and nmi contexts) and avoids moving them between waiting queues,
> based on the suggestion by Longman.
> 
> - Drop the shuffle reduction optimization from the series (new performance
> data did not justify it).
> 
> - Do not call cna_init_nodes() as an early_initcall (call it from 
> cna_configure_spin_lock_slowpath() instead), based on the comment from 
> Longman.
> 
> 
> Summary
> -------
> 
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> 
> CNA is a NUMA-aware version of the MCS lock. Spinning threads are
> organized in two queues, a primary queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. After acquiring the MCS lock and
> before acquiring the spinlock, the lock holder scans the primary queue
> looking for a thread running on the same node (pre-scan). If found (call
> it thread T), all threads in the primary queue between the current lock
> holder and T are moved to the end of the secondary queue.  If such T
> is not found, we make another scan of the primary queue after acquiring 
> the spinlock when unlocking the MCS lock (post-scan), starting at the
> node where pre-scan stopped. If both scans fail to find such T, the
> MCS lock is passed to the first thread in the secondary queue. If the
> secondary queue is empty, the MCS lock is passed to the next thread in the
> primary queue. To avoid starvation of threads in the secondary queue, those
> threads are moved back to the head of the primary queue after a certain
> number of intra-node lock hand-offs. Lastly, certain threads (e.g., in
> in irq and nmi contexts) are given a preferential treatment -- the scan
> stops when such a thread is found, effectively never moving those threads 
> into the secondary queue.
> 
> More details are available at https://arxiv.org/abs/1810.05600.
> 
> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The 
> standard deviation is also reported in (), and in general is about 3%
> from the average. The 'stock' kernel is v5.6.0-rc6,
> commit 5ad0ec0b8652, compiled in the default configuration. 
> 'patch-CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set; 
> the speedup is calculated dividing 'patch-CNA' by 'stock'.
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  2.702 (0.100)  2.712 (0.122)  1.003
>  2  3.691 (0.162)  3.672 (0.138)  0.995
>  4  4.285 (0.108)  4.256 (0.124)  0.993
>  8  5.117 (0.133)  5.972 (0.258)  1.167
> 16  6.273 (0.196)  7.628 (0.274)  1.216
> 32  6.757 (0.122)  8.544 (0.225)  1.264
> 36  6.761 (0.091)  8.691 (0.170)  1.285
> 72  6.569 (0.132)  9.280 (0.225)  1.413
> 108  6.167 (0.112)  9.410 (0.171)  1.526
> 142  5.901 (0.117)  9.415 (0.211)  1.595
> 
> The following tables contain throughput results (ops/us) from the same
> setup for will-it-scale/open1_threads: 
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  0.511 (0.002)  0.525 (0.003)  1.027
>  2  0.774 (0.018)  0.769 (0.017)  0.993
>  4  1.352 (0.023)  1.372 (0.032)  1.014
>  8  1.675 (0.090)  1.660 (0.136)  0.991
> 16  1.665 (0.114)  1.583 (0.092)  0.951
> 32  0.966 (0.038)  1.637 (0.087)  1.694
> 36  0.973 (0.066)  1.570 (0.081)  1.613
> 72  0.844 (0.040)  1.620 (0.091)  1.919
> 108  0.836 (0.040)  1.670 (0.084)  1.999
> 142  0.799 (0.043)  1.699 (0.087)  2.127
> 
> and will-it-scale/lock2_threads:
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  1.581 (0.004)  1.576 (0.007)  0.997
>  2  2.699 (0.059)  2.687 (0.067)  0.996
>  4  5.240 (0.234)  5.155 (0.252)  0.984
>  8  4.370 (0.241)  4.111 (0.342)  0.941
> 16  4.152 (0.112)  4.113 (0.164)  0.991
> 32  2.579 (0.099)  4.099 (0.127)  1.589
> 36  2.604 (0.066)  4.005 (0.104)  1.538
> 72  2.028 (0.091)  4.024 (0.112)  1.984
> 108  2.079 (0.106)  3.997 (0.093)  1.923
> 142  1.858 (0.103)  3.955 (0.109)  2.129
> 
> Our evaluation shows that CNA also improves performance of user 
> applications that have hot pthread mutexes. Those mutexes are 
> blocking, and waiting threads park and unpark via the futex 
> mechanism in the kernel. Given that kernel futex chains, which
> are hashed by the mutex address, are each protected by a 
> chain-specific spin lock, the contention on a user-mode mutex 
> translates into contention on a kernel level spinlock. 
> 
> Here are the results for the leveldb ‘readrandom’ benchmark:
> 
> #thr  	 stock        patch-CNA   speedup (patch-CNA/stock)
>  1  0.530 (0.013)  0.533 (0.011)  1.006
>  2  0.839 (0.043)  0.847 (0.031)  1.010
>  4  0.758 (0.021)  0.764 (0.018)  1.008
>  8  0.677 (0.022)  0.682 (0.016)  1.008
> 16  0.714 (0.023)  0.814 (0.027)  1.140
> 32  0.765 (0.040)  1.168 (0.032)  1.527
> 36  0.706 (0.023)  1.139 (0.066)  1.614
> 72  0.624 (0.017)  1.184 (0.026)  1.898
> 108  0.605 (0.013)  1.147 (0.023)  1.894
> 142  0.593 (0.012)  1.131 (0.019)  1.908
> 
> Further comments are welcome and appreciated.
> 
> Alex Kogan (5):
>  locking/qspinlock: Rename mcs lock/unlock macros and make them more
>    generic
>  locking/qspinlock: Refactor the qspinlock slow path
>  locking/qspinlock: Introduce CNA into the slow path of qspinlock
>  locking/qspinlock: Introduce starvation avoidance into CNA
>  locking/qspinlock: Avoid moving certain threads between waiting queues
>    in CNA
> 
> .../admin-guide/kernel-parameters.txt         |  18 +
> arch/arm/include/asm/mcs_spinlock.h           |   6 +-
> arch/x86/Kconfig                              |  20 +
> arch/x86/include/asm/qspinlock.h              |   6 +
> arch/x86/kernel/alternative.c                 |   2 +
> include/asm-generic/mcs_spinlock.h            |   4 +-
> kernel/locking/mcs_spinlock.h                 |  20 +-
> kernel/locking/qspinlock.c                    |  82 +++-
> kernel/locking/qspinlock_cna.h                | 407 ++++++++++++++++++
> kernel/locking/qspinlock_paravirt.h           |   2 +-
> 10 files changed, 544 insertions(+), 23 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
> 
> -- 
> 2.21.1 (Apple Git-122.3)
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 5/5] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA
  2020-04-03 20:59   ` Alex Kogan
@ 2020-07-28 19:34     ` Waiman Long
  -1 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-07-28 19:34 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, dave.dice

[-- Attachment #1: Type: text/plain, Size: 4271 bytes --]

On 4/3/20 4:59 PM, Alex Kogan wrote:
> Prohibit moving certain threads (e.g., in irq and nmi contexts)
> to the secondary queue. Those prioritized threads will always stay
> in the primary queue, and so will have a shorter wait time for the lock.
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/locking/qspinlock_cna.h | 30 ++++++++++++++++++++++++++----
>   1 file changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index e3180f6f5cdc..b004ce6882b6 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -4,6 +4,7 @@
>   #endif
>   
>   #include <linux/topology.h>
> +#include <linux/sched/rt.h>
>   
>   /*
>    * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> @@ -41,6 +42,9 @@
>    * lock is passed to the next thread in the primary queue. To avoid starvation
>    * of threads in the secondary queue, those threads are moved back to the head
>    * of the primary queue after a certain number of intra-node lock hand-offs.
> + * Lastly, certain threads (e.g., in irq and nmi contexts) are given
> + * preferential treatment -- the scan stops when such a thread is found,
> + * effectively never moving those threads into the secondary queue.
>    *
>    * For more details, see https://arxiv.org/abs/1810.05600.
>    *
> @@ -50,7 +54,7 @@
>   
>   struct cna_node {
>   	struct mcs_spinlock	mcs;
> -	int			numa_node;
> +	int			numa_node;	/* use LSB for priority */
>   	u32			encoded_tail;	/* self */
>   	u32			partial_order;	/* encoded tail or enum val */
>   	u32			intra_count;
> @@ -79,7 +83,7 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>   	for (i = 0; i < MAX_NODES; i++) {
>   		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
>   
> -		cn->numa_node = numa_node;
> +		cn->numa_node = numa_node << 1;
>   		cn->encoded_tail = encode_tail(cpu, i);
>   		/*
>   		 * make sure @encoded_tail is not confused with other valid
> @@ -110,6 +114,14 @@ static int __init cna_init_nodes(void)
>   
>   static __always_inline void cna_init_node(struct mcs_spinlock *node)
>   {
> +	/*
> +	 * Set the priority bit in @numa_node for threads that should not
> +	 * be moved to the secondary queue.
> +	 */
> +	bool priority = !in_task() || irqs_disabled() || rt_task(current);
> +	((struct cna_node *)node)->numa_node =
> +		(((struct cna_node *)node)->numa_node & ~1) | priority;
> +
>   	((struct cna_node *)node)->intra_count = 0;
>   }
>   
> @@ -243,12 +255,16 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
>   {
>   	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
>   	struct cna_node *cn = (struct cna_node *)node;
> -	int nid = cn->numa_node;
> +	int nid = cn->numa_node >> 1;
>   	struct cna_node *last;
>   
>   	/* find any next waiter on 'our' NUMA node */
>   	for (last = cn;
> -	     cni && cni->numa_node != nid;
> +		 /*
> +		  * iterate as long as the current node is not priorizied and
> +		  * does not run on 'our' NUMA node
> +		  */
> +	     cni && !(cni->numa_node & 0x1) && (cni->numa_node >> 1) != nid;
>   	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
>   		;
>   
> @@ -258,6 +274,12 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
>   	if (last != cn)	/* did we skip any waiters? */
>   		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
>   
> +	/*
> +	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
> +	 * of a prioritized waiter. That waiter will get the lock next even if
> +	 * it runs on a different NUMA node, but this is what we wanted when we
> +	 * prioritized it.
> +	 */
>   	return LOCAL_WAITER_FOUND;
>   }
>   

Sorry for the late review as I was swamped with other tasks earlier.

It is good to have a patch to handle lock waiters that shouldn't be 
delayed, the current way of using bit 0 of numa_node to indicate that is 
kind of hackery. Also it may not be a good idea to change the current 
node id like that. I have a patch (attached) that can handle these 
issues. What do you think about it?

Cheers,
Longman



[-- Attachment #2: 0006-locking-qspinlock-Make-CNA-priority-node-inherit-pri.patch --]
[-- Type: text/x-patch, Size: 4211 bytes --]

From d21eae6d7b125253dd57c034b0a648bb11acc472 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 28 Jul 2020 12:07:54 -0400
Subject: [PATCH 6/9] locking/qspinlock: Make CNA priority node inherit primary
 queue node ID

When a lock waiter is not in a task context, has irq disabled or is a
RT task. It is considered at a higher priority and so will not be thrown
into the secondary queue to suffer additional lock acquisition latency.
However, if its node id is different from the primary queue node id,
we will have the situation that the primary queue node id is changed
mid-stream and the secondary queue may have nodes with the new node id
causing them to suffer undue delay.

One way to avoid this situation is to make the priority node inherit
the node id of the primary queue. That will require two numa node
fields in the CNA node - one for real numa node id and the other for
current numa node id. Since 16 bits are more than enough for node id,
we can split the current 32-bit field into two 16-bit ones without
affecting the structure size.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index b004ce6882b6..911c96279494 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -52,9 +52,12 @@
  *          Dave Dice <dave.dice@oracle.com>
  */
 
+#define CNA_PRIORITY_NODE	0xffff
+
 struct cna_node {
 	struct mcs_spinlock	mcs;
-	int			numa_node;	/* use LSB for priority */
+	u16			numa_node;
+	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* encoded tail or enum val */
 	u32			intra_count;
@@ -83,7 +86,7 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 	for (i = 0; i < MAX_NODES; i++) {
 		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
 
-		cn->numa_node = numa_node << 1;
+		cn->real_numa_node = numa_node;
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * make sure @encoded_tail is not confused with other valid
@@ -119,10 +122,10 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node)
 	 * be moved to the secondary queue.
 	 */
 	bool priority = !in_task() || irqs_disabled() || rt_task(current);
-	((struct cna_node *)node)->numa_node =
-		(((struct cna_node *)node)->numa_node & ~1) | priority;
+	struct cna_node *cn = (struct cna_node *)node;
 
-	((struct cna_node *)node)->intra_count = 0;
+	cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
+	cn->intra_count = 0;
 }
 
 /*
@@ -255,7 +258,7 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 {
 	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node >> 1;
+	int nid = cn->numa_node;
 	struct cna_node *last;
 
 	/* find any next waiter on 'our' NUMA node */
@@ -264,13 +267,16 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 		  * iterate as long as the current node is not priorizied and
 		  * does not run on 'our' NUMA node
 		  */
-	     cni && !(cni->numa_node & 0x1) && (cni->numa_node >> 1) != nid;
+	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
 	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
 		;
 
 	if (!cni)
 		return last->encoded_tail; /* continue from here */
 
+	if (cni->numa_node == CNA_PRIORITY_NODE)
+		cni->numa_node = nid;	/* Inherit node id of primary queue */
+
 	if (last != cn)	/* did we skip any waiters? */
 		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
 
@@ -290,6 +296,13 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	struct cna_node *cn = (struct cna_node *)node;
 
 	if (cn->intra_count < intra_node_handoff_threshold) {
+		/*
+		 * We are at the head of the wait queue, no need to use
+		 * the fake NUMA node ID.
+		 */
+		if (cn->numa_node == CNA_PRIORITY_NODE)
+			cn->numa_node = cn->real_numa_node;
+
 		/*
 		 * Try and put the time otherwise spent spin waiting on
 		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 5/5] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA
@ 2020-07-28 19:34     ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-07-28 19:34 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: dave.dice, steven.sistare, daniel.m.jordan

[-- Attachment #1: Type: text/plain, Size: 4271 bytes --]

On 4/3/20 4:59 PM, Alex Kogan wrote:
> Prohibit moving certain threads (e.g., in irq and nmi contexts)
> to the secondary queue. Those prioritized threads will always stay
> in the primary queue, and so will have a shorter wait time for the lock.
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/locking/qspinlock_cna.h | 30 ++++++++++++++++++++++++++----
>   1 file changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index e3180f6f5cdc..b004ce6882b6 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -4,6 +4,7 @@
>   #endif
>   
>   #include <linux/topology.h>
> +#include <linux/sched/rt.h>
>   
>   /*
>    * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> @@ -41,6 +42,9 @@
>    * lock is passed to the next thread in the primary queue. To avoid starvation
>    * of threads in the secondary queue, those threads are moved back to the head
>    * of the primary queue after a certain number of intra-node lock hand-offs.
> + * Lastly, certain threads (e.g., in irq and nmi contexts) are given
> + * preferential treatment -- the scan stops when such a thread is found,
> + * effectively never moving those threads into the secondary queue.
>    *
>    * For more details, see https://arxiv.org/abs/1810.05600.
>    *
> @@ -50,7 +54,7 @@
>   
>   struct cna_node {
>   	struct mcs_spinlock	mcs;
> -	int			numa_node;
> +	int			numa_node;	/* use LSB for priority */
>   	u32			encoded_tail;	/* self */
>   	u32			partial_order;	/* encoded tail or enum val */
>   	u32			intra_count;
> @@ -79,7 +83,7 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>   	for (i = 0; i < MAX_NODES; i++) {
>   		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
>   
> -		cn->numa_node = numa_node;
> +		cn->numa_node = numa_node << 1;
>   		cn->encoded_tail = encode_tail(cpu, i);
>   		/*
>   		 * make sure @encoded_tail is not confused with other valid
> @@ -110,6 +114,14 @@ static int __init cna_init_nodes(void)
>   
>   static __always_inline void cna_init_node(struct mcs_spinlock *node)
>   {
> +	/*
> +	 * Set the priority bit in @numa_node for threads that should not
> +	 * be moved to the secondary queue.
> +	 */
> +	bool priority = !in_task() || irqs_disabled() || rt_task(current);
> +	((struct cna_node *)node)->numa_node =
> +		(((struct cna_node *)node)->numa_node & ~1) | priority;
> +
>   	((struct cna_node *)node)->intra_count = 0;
>   }
>   
> @@ -243,12 +255,16 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
>   {
>   	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
>   	struct cna_node *cn = (struct cna_node *)node;
> -	int nid = cn->numa_node;
> +	int nid = cn->numa_node >> 1;
>   	struct cna_node *last;
>   
>   	/* find any next waiter on 'our' NUMA node */
>   	for (last = cn;
> -	     cni && cni->numa_node != nid;
> +		 /*
> +		  * iterate as long as the current node is not priorizied and
> +		  * does not run on 'our' NUMA node
> +		  */
> +	     cni && !(cni->numa_node & 0x1) && (cni->numa_node >> 1) != nid;
>   	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
>   		;
>   
> @@ -258,6 +274,12 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
>   	if (last != cn)	/* did we skip any waiters? */
>   		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
>   
> +	/*
> +	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
> +	 * of a prioritized waiter. That waiter will get the lock next even if
> +	 * it runs on a different NUMA node, but this is what we wanted when we
> +	 * prioritized it.
> +	 */
>   	return LOCAL_WAITER_FOUND;
>   }
>   

Sorry for the late review as I was swamped with other tasks earlier.

It is good to have a patch to handle lock waiters that shouldn't be 
delayed, the current way of using bit 0 of numa_node to indicate that is 
kind of hackery. Also it may not be a good idea to change the current 
node id like that. I have a patch (attached) that can handle these 
issues. What do you think about it?

Cheers,
Longman



[-- Attachment #2: 0006-locking-qspinlock-Make-CNA-priority-node-inherit-pri.patch --]
[-- Type: text/x-patch, Size: 4211 bytes --]

From d21eae6d7b125253dd57c034b0a648bb11acc472 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 28 Jul 2020 12:07:54 -0400
Subject: [PATCH 6/9] locking/qspinlock: Make CNA priority node inherit primary
 queue node ID

When a lock waiter is not in a task context, has irq disabled or is a
RT task. It is considered at a higher priority and so will not be thrown
into the secondary queue to suffer additional lock acquisition latency.
However, if its node id is different from the primary queue node id,
we will have the situation that the primary queue node id is changed
mid-stream and the secondary queue may have nodes with the new node id
causing them to suffer undue delay.

One way to avoid this situation is to make the priority node inherit
the node id of the primary queue. That will require two numa node
fields in the CNA node - one for real numa node id and the other for
current numa node id. Since 16 bits are more than enough for node id,
we can split the current 32-bit field into two 16-bit ones without
affecting the structure size.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index b004ce6882b6..911c96279494 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -52,9 +52,12 @@
  *          Dave Dice <dave.dice@oracle.com>
  */
 
+#define CNA_PRIORITY_NODE	0xffff
+
 struct cna_node {
 	struct mcs_spinlock	mcs;
-	int			numa_node;	/* use LSB for priority */
+	u16			numa_node;
+	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* encoded tail or enum val */
 	u32			intra_count;
@@ -83,7 +86,7 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 	for (i = 0; i < MAX_NODES; i++) {
 		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
 
-		cn->numa_node = numa_node << 1;
+		cn->real_numa_node = numa_node;
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * make sure @encoded_tail is not confused with other valid
@@ -119,10 +122,10 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node)
 	 * be moved to the secondary queue.
 	 */
 	bool priority = !in_task() || irqs_disabled() || rt_task(current);
-	((struct cna_node *)node)->numa_node =
-		(((struct cna_node *)node)->numa_node & ~1) | priority;
+	struct cna_node *cn = (struct cna_node *)node;
 
-	((struct cna_node *)node)->intra_count = 0;
+	cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
+	cn->intra_count = 0;
 }
 
 /*
@@ -255,7 +258,7 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 {
 	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node >> 1;
+	int nid = cn->numa_node;
 	struct cna_node *last;
 
 	/* find any next waiter on 'our' NUMA node */
@@ -264,13 +267,16 @@ static u32 cna_order_queue(struct mcs_spinlock *node,
 		  * iterate as long as the current node is not priorizied and
 		  * does not run on 'our' NUMA node
 		  */
-	     cni && !(cni->numa_node & 0x1) && (cni->numa_node >> 1) != nid;
+	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
 	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
 		;
 
 	if (!cni)
 		return last->encoded_tail; /* continue from here */
 
+	if (cni->numa_node == CNA_PRIORITY_NODE)
+		cni->numa_node = nid;	/* Inherit node id of primary queue */
+
 	if (last != cn)	/* did we skip any waiters? */
 		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
 
@@ -290,6 +296,13 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	struct cna_node *cn = (struct cna_node *)node;
 
 	if (cn->intra_count < intra_node_handoff_threshold) {
+		/*
+		 * We are at the head of the wait queue, no need to use
+		 * the fake NUMA node ID.
+		 */
+		if (cn->numa_node == CNA_PRIORITY_NODE)
+			cn->numa_node = cn->real_numa_node;
+
 		/*
 		 * Try and put the time otherwise spent spin waiting on
 		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-- 
2.18.1


[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2020-04-03 20:59   ` Alex Kogan
  (?)
@ 2020-07-28 19:39   ` Waiman Long
  -1 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-07-28 19:39 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, dave.dice

[-- Attachment #1: Type: text/plain, Size: 704 bytes --]

On 4/3/20 4:59 PM, Alex Kogan wrote:
> Keep track of the number of intra-node lock handoffs, and force
> inter-node handoff once this number reaches a preset threshold.
> The default value for the threshold can be overridden with
> the new kernel boot command-line option "numa_spinlock_threshold".
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---

A major issue with setting a limit on the maximum intra-node lock 
transfer is that the worst case latency where a lock can be transferred 
to another node is indeterminant. How about changing it to a time-based 
limit?

Cheers,
Longman



[-- Attachment #2: 0007-locking-qspinlock-Convert-to-time-based-spinlock-thr.patch --]
[-- Type: text/x-patch, Size: 5800 bytes --]

From 56b7cfcd417e13b4bfeb77ca290185726c05e69a Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 28 Jul 2020 12:07:56 -0400
Subject: [PATCH 7/9] locking/qspinlock: Convert to time based spinlock
 threshold

The "numa_spinlock_threshold" kernel command line parameter specifies
the number of maximum intra-node lock transfers allowed in a continuously
contended lock. Depending on how long the lock critical section lasts,
the actual maximum latency can vary quite a lot.

Controlling that maximum latency is actually more meaningful than
just the number of intra-node lock transfers. By running a locking
microbenchmark with null critcal section on a 2-socket x86-64 system,
the locking rate was about 4,875 kop/s.  The default threshold of
64k translates to about 13ms in intra-node lock transfers. Of course,
with a slow critical section, the actual maximum latency will increase
substantially.

Change the kernel parameter to be time based (in ms) so that the users
have a better handle on what maximum latency to expect for inter-node
lock transfers. The default is currently set to 10ms. A range of
1-100ms is allowed. That ms value is translated internally to the
nearest rounded-up jiffies.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         | 13 +++---
 kernel/locking/qspinlock_cna.h                | 44 +++++++++++++------
 2 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 96eb7aae4f63..945e346c678f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3339,12 +3339,13 @@
 			numa_spinlock=auto.
 
 	numa_spinlock_threshold=	[NUMA, PV_OPS]
-			Set the threshold for the number of intra-node
-			lock hand-offs before the NUMA-aware spinlock
-			is forced to be passed to a thread on another NUMA node.
-			Valid values are in the [0..31] range. Smaller values
-			result in a more fair, but less performant spinlock, and
-			vice versa. The default value is 16.
+			Set the time threshold in milliseconds for the
+			number of intra-node lock hand-offs before the
+			NUMA-aware spinlock is forced to be passed to
+			a thread on another NUMA node.	Valid values
+			are in the [1..100] range. Smaller values result
+			in a more fair, but less performant spinlock,
+			and vice versa. The default value is 10.
 
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 911c96279494..03e8ba71a537 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -60,7 +60,7 @@ struct cna_node {
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
 	u32			partial_order;	/* encoded tail or enum val */
-	u32			intra_count;
+	s32			start_time;
 };
 
 enum {
@@ -70,12 +70,22 @@ enum {
 };
 
 /*
- * Controls the threshold for the number of intra-node lock hand-offs before
- * the NUMA-aware variant of spinlock is forced to be passed to a thread on
- * another NUMA node. The default setting can be changed with the
- * "numa_spinlock_threshold" boot option.
+ * Controls the threshold time in ms (default = 10) for intra-node lock
+ * hand-offs before the NUMA-aware variant of spinlock is forced to be
+ * passed to a thread on another NUMA node. The default setting can be
+ * changed with the "numa_spinlock_threshold" boot option.
  */
-unsigned int intra_node_handoff_threshold __ro_after_init = 1 << 16;
+#define MSECS_TO_JIFFIES(m)	\
+	((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)
+static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
+
+static inline bool intra_node_threshold_reached(struct cna_node *cn)
+{
+	s32 current_time = (s32)jiffies;
+	s32 threshold = cn->start_time + intra_node_handoff_threshold;
+
+	return current_time - threshold > 0;
+}
 
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
@@ -125,7 +135,7 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node)
 	struct cna_node *cn = (struct cna_node *)node;
 
 	cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
-	cn->intra_count = 0;
+	cn->start_time = 0;
 }
 
 /*
@@ -295,7 +305,14 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	if (cn->intra_count < intra_node_handoff_threshold) {
+	/*
+	 * In case the previous node passed over the special start_time
+	 * of 0, a refetch of jiffies should be either 0 or 1.
+	 */
+	if (!cn->start_time)
+		cn->start_time = (s32)jiffies;
+
+	if (!intra_node_threshold_reached(cn)) {
 		/*
 		 * We are at the head of the wait queue, no need to use
 		 * the fake NUMA node ID.
@@ -339,8 +356,7 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 		next = node->next;
 		if (node->locked > 1) {
 			val = node->locked;	/* preseve secondary queue */
-			((struct cna_node *)next)->intra_count =
-				cn->intra_count + 1;
+			((struct cna_node *)next)->start_time = cn->start_time;
 		}
 	} else if (node->locked > 1) {
 		/*
@@ -404,14 +420,14 @@ void __init cna_configure_spin_lock_slowpath(void)
 
 static int __init numa_spinlock_threshold_setup(char *str)
 {
-	int new_threshold_param;
+	int param;
 
-	if (get_option(&str, &new_threshold_param)) {
+	if (get_option(&str, &param)) {
 		/* valid value is between 0 and 31 */
-		if (new_threshold_param < 0 || new_threshold_param > 31)
+		if (param <= 0 || param > 100)
 			return 0;
 
-		intra_node_handoff_threshold = 1 << new_threshold_param;
+		intra_node_handoff_threshold = msecs_to_jiffies(param);
 		return 1;
 	}
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-04-03 20:59   ` Alex Kogan
@ 2020-07-28 20:00     ` Waiman Long
  -1 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-07-28 20:00 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, dave.dice

[-- Attachment #1: Type: text/plain, Size: 2137 bytes --]

On 4/3/20 4:59 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a primary queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. After acquiring the
> MCS lock and before acquiring the spinlock, the lock holder scans the
> primary queue looking for a thread running on the same node (pre-scan). If
> found (call it thread T), all threads in the primary queue between the
> current lock holder and T are moved to the end of the secondary queue.
> If such T is not found, we make another scan of the primary queue when
> unlocking the MCS lock (post-scan), starting at the position where
> pre-scan stopped. If both scans fail to find such T, the MCS lock is
> passed to the first thread in the secondary queue. If the secondary queue
> is empty, the lock is passed to the next thread in the primary queue.
> For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
> boot time only if we run on a multi-node machine in native environment and
> the new config is enabled. (For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
> resolved once static_call() is available.) This default behavior can be
> overridden with the new kernel boot command-line option
> "numa_spinlock=on/off" (default is "auto").
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---

There is also a concern that the worst case latency for a lock transfer 
can be close to O(n) which can be quite large for large SMP systems. I 
have a patch on top that modifies the current behavior to limit the 
number of node scans after the lock is freed.

Cheers,
Longman



[-- Attachment #2: 0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.patch --]
[-- Type: text/x-patch, Size: 8053 bytes --]

From 1e0d1a7c1d04383367b3d7a086ef3becd2fd81d8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 28 Jul 2020 15:13:47 -0400
Subject: [PATCH 8/9] locking/qspinlock: Limit CNA node scan after the lock is
 freed

When there is a long list of lock waiters, the worst case scenario
for CNA is that none of them are from the same NUMA node. All the lock
waiters will have to be scanned to figure that out. If the lock becomes
free early in the scanning process, it means that there will be a long
delay between lock being freed and re-acquired after scanning the full
list. The worst case delay is O(n) where n is the number of lock waiters
which is limited by the number of cpus in the system.

One way to limit the delay is to limit the number of nodes to be scanned
after the lock is freed. Assuming random distribution of NUMA nodes in
the lock waiters, the chance of not finding a lock waiter in the same
NUMA node is ((n-1)/n)^m where n is the number of NUMA nodes and m is
the number of lock waiters scanned. If we limit the number of scans to
4n, for example, the chance of not finding same NUMA node lock waiter
will be 1.4% and 1.6% for 8 and 16-socket servers respectively.
Note that the limit applies only after the lock is freed, so the chance
of not finding the right lock waiter is even lower.

To make this possible, the lock waiter scanning and lock spinning have
to be done at the same time. Since the lock waiter scanning won't stop
until the lock is freed with additional look ahead, if necessary. The
extra scanning done in cna_lock_handoff() isn't needed anymore.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 130 +++++++++++++++++++++------------
 1 file changed, 83 insertions(+), 47 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 03e8ba71a537..89a3905870ea 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -59,7 +59,7 @@ struct cna_node {
 	u16			numa_node;
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
-	u32			partial_order;	/* encoded tail or enum val */
+	u32			order;		/* encoded tail or enum val */
 	s32			start_time;
 };
 
@@ -79,6 +79,12 @@ enum {
 	((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)
 static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
 
+/*
+ * Controls how many lookaheads for matching numa node after the lock
+ * is freed.
+ */
+static int matching_node_lookahead __ro_after_init;
+
 static inline bool intra_node_threshold_reached(struct cna_node *cn)
 {
 	s32 current_time = (s32)jiffies;
@@ -105,6 +111,11 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		 */
 		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
 	}
+	/*
+	 * With 4x numa-node lookahead and assuming random node distribution,
+	 * the probability of not finding a matching node is less than 2%.
+	 */
+	matching_node_lookahead = 4*nr_node_ids;
 }
 
 static int __init cna_init_nodes(void)
@@ -262,33 +273,67 @@ static void cna_splice_tail(struct mcs_spinlock *node,
  * of current waiters. However, the amortized complexity is close to O(1),
  * as the immediate successor is likely to be running on the same node once
  * threads from other nodes are moved to the secondary queue.
+ *
+ * Given the fact that the lock state is being monitored at the same time
+ * and the number of additional scans after the lock is freed is limited.
+ * This will limit the additional lock acquisition latency.
  */
-static u32 cna_order_queue(struct mcs_spinlock *node,
-			   struct mcs_spinlock *iter)
+static u32 cna_order_queue(struct mcs_spinlock *node, struct qspinlock *lock)
 {
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
-	struct cna_node *last;
+	struct cna_node *cni = cn;
+	struct cna_node *found = NULL;
+	int lookahead = matching_node_lookahead;
+	bool locked = true;
 
-	/* find any next waiter on 'our' NUMA node */
-	for (last = cn;
-		 /*
-		  * iterate as long as the current node is not priorizied and
-		  * does not run on 'our' NUMA node
-		  */
-	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
-	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
-		;
+	/*
+	 * Find any next waiter on 'our' NUMA node while spinning on the lock
+	 * simultaneously. Iterate as long as the current node is not
+	 * priorizied and does not run on 'our' NUMA node, or up to
+	 * "lockahead" nodes have been checked after the lock was freed.
+	 */
+	while (!found && lookahead) {
+		if (locked) {
+			u32 val = atomic_read(&lock->val);
+			if (!(val & _Q_LOCKED_PENDING_MASK)) {
+				locked = false;
+				continue;
+			}
+			/*
+			 * If we are hitting the end of the queue, use
+			 * atomic_cond_read_relaxed() to wait until the
+			 * lock value changes which means either the lock
+			 * is freed or a new waiter comes.
+			 */
+			if (cni->encoded_tail == (val & _Q_TAIL_MASK))
+				atomic_cond_read_relaxed(&lock->val, VAL != val);
+		}
+		if (!found) {
+			found = (struct cna_node *)READ_ONCE(cni->mcs.next);
+			if (!found && !locked)
+				break; /* Lock is free with no more nodes to parse */
+
+			if (found && found->numa_node != CNA_PRIORITY_NODE
+				  && found->numa_node != cn->numa_node) {
+				cni = found;
+				found = NULL;
+			}
+			if (!locked)
+				lookahead--;
+		}
+	}
+
+	if (!found)
+		return cni->encoded_tail; /* continue from here */
 
-	if (!cni)
-		return last->encoded_tail; /* continue from here */
+	if (found->numa_node == CNA_PRIORITY_NODE) {
+		int nid = cn->numa_node;
 
-	if (cni->numa_node == CNA_PRIORITY_NODE)
-		cni->numa_node = nid;	/* Inherit node id of primary queue */
+		found->numa_node = nid;	/* Inherit node id of primary queue */
+	}
 
-	if (last != cn)	/* did we skip any waiters? */
-		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+	if (cni != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)cni);
 
 	/*
 	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
@@ -312,23 +357,23 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	if (!cn->start_time)
 		cn->start_time = (s32)jiffies;
 
-	if (!intra_node_threshold_reached(cn)) {
-		/*
-		 * We are at the head of the wait queue, no need to use
-		 * the fake NUMA node ID.
-		 */
-		if (cn->numa_node == CNA_PRIORITY_NODE)
-			cn->numa_node = cn->real_numa_node;
-
-		/*
-		 * Try and put the time otherwise spent spin waiting on
-		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-		 */
-		cn->partial_order = cna_order_queue(node, node);
-	} else {
-		cn->partial_order = FLUSH_SECONDARY_QUEUE;
+	if (intra_node_threshold_reached(cn)) {
+		cn->order = FLUSH_SECONDARY_QUEUE;
+		goto out;
 	}
+	/*
+	 * We are at the head of the wait queue, no need to use
+	 * the fake NUMA node ID.
+	 */
+	if (cn->numa_node == CNA_PRIORITY_NODE)
+		cn->numa_node = cn->real_numa_node;
 
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->order = cna_order_queue(node, lock);
+out:
 	return 0; /* we lied; we didn't wait, go do so now */
 }
 
@@ -337,18 +382,9 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 	u32 val = _Q_LOCKED_VAL;
-	u32 partial_order = cn->partial_order;
-
-	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
-	 */
-	if (partial_order >= MIN_ENCODED_TAIL)
-		partial_order =
-			cna_order_queue(node, decode_tail(partial_order));
+	u32 order = cn->order;
 
-	if (partial_order == LOCAL_WAITER_FOUND) {
+	if (order == LOCAL_WAITER_FOUND) {
 		/*
 		 * We found a local waiter; reload @next in case we called
 		 * cna_order_queue() above.
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-07-28 20:00     ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-07-28 20:00 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: dave.dice, steven.sistare, daniel.m.jordan

[-- Attachment #1: Type: text/plain, Size: 2137 bytes --]

On 4/3/20 4:59 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a primary queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. After acquiring the
> MCS lock and before acquiring the spinlock, the lock holder scans the
> primary queue looking for a thread running on the same node (pre-scan). If
> found (call it thread T), all threads in the primary queue between the
> current lock holder and T are moved to the end of the secondary queue.
> If such T is not found, we make another scan of the primary queue when
> unlocking the MCS lock (post-scan), starting at the position where
> pre-scan stopped. If both scans fail to find such T, the MCS lock is
> passed to the first thread in the secondary queue. If the secondary queue
> is empty, the lock is passed to the next thread in the primary queue.
> For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
> boot time only if we run on a multi-node machine in native environment and
> the new config is enabled. (For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
> resolved once static_call() is available.) This default behavior can be
> overridden with the new kernel boot command-line option
> "numa_spinlock=on/off" (default is "auto").
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---

There is also a concern that the worst case latency for a lock transfer 
can be close to O(n) which can be quite large for large SMP systems. I 
have a patch on top that modifies the current behavior to limit the 
number of node scans after the lock is freed.

Cheers,
Longman



[-- Attachment #2: 0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.patch --]
[-- Type: text/x-patch, Size: 8053 bytes --]

From 1e0d1a7c1d04383367b3d7a086ef3becd2fd81d8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 28 Jul 2020 15:13:47 -0400
Subject: [PATCH 8/9] locking/qspinlock: Limit CNA node scan after the lock is
 freed

When there is a long list of lock waiters, the worst case scenario
for CNA is that none of them are from the same NUMA node. All the lock
waiters will have to be scanned to figure that out. If the lock becomes
free early in the scanning process, it means that there will be a long
delay between lock being freed and re-acquired after scanning the full
list. The worst case delay is O(n) where n is the number of lock waiters
which is limited by the number of cpus in the system.

One way to limit the delay is to limit the number of nodes to be scanned
after the lock is freed. Assuming random distribution of NUMA nodes in
the lock waiters, the chance of not finding a lock waiter in the same
NUMA node is ((n-1)/n)^m where n is the number of NUMA nodes and m is
the number of lock waiters scanned. If we limit the number of scans to
4n, for example, the chance of not finding same NUMA node lock waiter
will be 1.4% and 1.6% for 8 and 16-socket servers respectively.
Note that the limit applies only after the lock is freed, so the chance
of not finding the right lock waiter is even lower.

To make this possible, the lock waiter scanning and lock spinning have
to be done at the same time. Since the lock waiter scanning won't stop
until the lock is freed with additional look ahead, if necessary. The
extra scanning done in cna_lock_handoff() isn't needed anymore.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 130 +++++++++++++++++++++------------
 1 file changed, 83 insertions(+), 47 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 03e8ba71a537..89a3905870ea 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -59,7 +59,7 @@ struct cna_node {
 	u16			numa_node;
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
-	u32			partial_order;	/* encoded tail or enum val */
+	u32			order;		/* encoded tail or enum val */
 	s32			start_time;
 };
 
@@ -79,6 +79,12 @@ enum {
 	((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)
 static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
 
+/*
+ * Controls how many lookaheads for matching numa node after the lock
+ * is freed.
+ */
+static int matching_node_lookahead __ro_after_init;
+
 static inline bool intra_node_threshold_reached(struct cna_node *cn)
 {
 	s32 current_time = (s32)jiffies;
@@ -105,6 +111,11 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		 */
 		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
 	}
+	/*
+	 * With 4x numa-node lookahead and assuming random node distribution,
+	 * the probability of not finding a matching node is less than 2%.
+	 */
+	matching_node_lookahead = 4*nr_node_ids;
 }
 
 static int __init cna_init_nodes(void)
@@ -262,33 +273,67 @@ static void cna_splice_tail(struct mcs_spinlock *node,
  * of current waiters. However, the amortized complexity is close to O(1),
  * as the immediate successor is likely to be running on the same node once
  * threads from other nodes are moved to the secondary queue.
+ *
+ * Given the fact that the lock state is being monitored at the same time
+ * and the number of additional scans after the lock is freed is limited.
+ * This will limit the additional lock acquisition latency.
  */
-static u32 cna_order_queue(struct mcs_spinlock *node,
-			   struct mcs_spinlock *iter)
+static u32 cna_order_queue(struct mcs_spinlock *node, struct qspinlock *lock)
 {
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
-	struct cna_node *last;
+	struct cna_node *cni = cn;
+	struct cna_node *found = NULL;
+	int lookahead = matching_node_lookahead;
+	bool locked = true;
 
-	/* find any next waiter on 'our' NUMA node */
-	for (last = cn;
-		 /*
-		  * iterate as long as the current node is not priorizied and
-		  * does not run on 'our' NUMA node
-		  */
-	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
-	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
-		;
+	/*
+	 * Find any next waiter on 'our' NUMA node while spinning on the lock
+	 * simultaneously. Iterate as long as the current node is not
+	 * priorizied and does not run on 'our' NUMA node, or up to
+	 * "lockahead" nodes have been checked after the lock was freed.
+	 */
+	while (!found && lookahead) {
+		if (locked) {
+			u32 val = atomic_read(&lock->val);
+			if (!(val & _Q_LOCKED_PENDING_MASK)) {
+				locked = false;
+				continue;
+			}
+			/*
+			 * If we are hitting the end of the queue, use
+			 * atomic_cond_read_relaxed() to wait until the
+			 * lock value changes which means either the lock
+			 * is freed or a new waiter comes.
+			 */
+			if (cni->encoded_tail == (val & _Q_TAIL_MASK))
+				atomic_cond_read_relaxed(&lock->val, VAL != val);
+		}
+		if (!found) {
+			found = (struct cna_node *)READ_ONCE(cni->mcs.next);
+			if (!found && !locked)
+				break; /* Lock is free with no more nodes to parse */
+
+			if (found && found->numa_node != CNA_PRIORITY_NODE
+				  && found->numa_node != cn->numa_node) {
+				cni = found;
+				found = NULL;
+			}
+			if (!locked)
+				lookahead--;
+		}
+	}
+
+	if (!found)
+		return cni->encoded_tail; /* continue from here */
 
-	if (!cni)
-		return last->encoded_tail; /* continue from here */
+	if (found->numa_node == CNA_PRIORITY_NODE) {
+		int nid = cn->numa_node;
 
-	if (cni->numa_node == CNA_PRIORITY_NODE)
-		cni->numa_node = nid;	/* Inherit node id of primary queue */
+		found->numa_node = nid;	/* Inherit node id of primary queue */
+	}
 
-	if (last != cn)	/* did we skip any waiters? */
-		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+	if (cni != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)cni);
 
 	/*
 	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
@@ -312,23 +357,23 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	if (!cn->start_time)
 		cn->start_time = (s32)jiffies;
 
-	if (!intra_node_threshold_reached(cn)) {
-		/*
-		 * We are at the head of the wait queue, no need to use
-		 * the fake NUMA node ID.
-		 */
-		if (cn->numa_node == CNA_PRIORITY_NODE)
-			cn->numa_node = cn->real_numa_node;
-
-		/*
-		 * Try and put the time otherwise spent spin waiting on
-		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-		 */
-		cn->partial_order = cna_order_queue(node, node);
-	} else {
-		cn->partial_order = FLUSH_SECONDARY_QUEUE;
+	if (intra_node_threshold_reached(cn)) {
+		cn->order = FLUSH_SECONDARY_QUEUE;
+		goto out;
 	}
+	/*
+	 * We are at the head of the wait queue, no need to use
+	 * the fake NUMA node ID.
+	 */
+	if (cn->numa_node == CNA_PRIORITY_NODE)
+		cn->numa_node = cn->real_numa_node;
 
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->order = cna_order_queue(node, lock);
+out:
 	return 0; /* we lied; we didn't wait, go do so now */
 }
 
@@ -337,18 +382,9 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 	u32 val = _Q_LOCKED_VAL;
-	u32 partial_order = cn->partial_order;
-
-	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
-	 */
-	if (partial_order >= MIN_ENCODED_TAIL)
-		partial_order =
-			cna_order_queue(node, decode_tail(partial_order));
+	u32 order = cn->order;
 
-	if (partial_order == LOCAL_WAITER_FOUND) {
+	if (order == LOCAL_WAITER_FOUND) {
 		/*
 		 * We found a local waiter; reload @next in case we called
 		 * cna_order_queue() above.
-- 
2.18.1


[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-07-28 20:00     ` Waiman Long
@ 2020-08-31 21:39       ` Alex Kogan
  -1 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-08-31 21:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

[-- Attachment #1: Type: text/plain, Size: 3757 bytes --]


> On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 4/3/20 4:59 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a primary queue for
>> threads running on the same node as the current lock holder, and a
>> secondary queue for threads running on other nodes. After acquiring the
>> MCS lock and before acquiring the spinlock, the lock holder scans the
>> primary queue looking for a thread running on the same node (pre-scan). If
>> found (call it thread T), all threads in the primary queue between the
>> current lock holder and T are moved to the end of the secondary queue.
>> If such T is not found, we make another scan of the primary queue when
>> unlocking the MCS lock (post-scan), starting at the position where
>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>> passed to the first thread in the secondary queue. If the secondary queue
>> is empty, the lock is passed to the next thread in the primary queue.
>> For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .
>> 
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same node. This issue
>> will be addressed later in the series.
>> 
>> Enabling CNA is controlled via a new configuration option
>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>> boot time only if we run on a multi-node machine in native environment and
>> the new config is enabled. (For the time being, the patching requires
>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>> resolved once static_call() is available.) This default behavior can be
>> overridden with the new kernel boot command-line option
>> "numa_spinlock=on/off" (default is "auto").
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Waiman Long <longman@redhat.com>
>> ---
> 
> There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.

I understand the concern. While your patch addresses it, I am afraid it makes
the code somewhat more complex, and duplicates some of the slow path
functionality (e.g., the spin loop until the lock value changes to a certain
value).

Let me suggest a different idea for gradually restructuring the main queue
that has some similarity to the way you suggested to handle prioritized waiters.
Basically, instead of scanning the entire chain of main queue waiters, 
we can check only the next waiter and, if present and it runs on a different
node, move it to the secondary queue. In addition, to maintain the preference
for a certain numa node ID, we set the numa node of the next-next waiter, 
if present, to that of the current lock holder. This is the part similar to the
way you suggested to handle prioritized waiters.

This way, the worst case complexity of cna_order_queue() decreases from O(n)
down to O(1), as we always “scan" only one waiter. And as before, we change
the preference (and flush the secondary queue) after X handovers (or after
Y ms, as in your in other patch).

I attach the patch that applies on top of your patch for prioritized nodes
(0006), but does not include your patch 0007 (time based threshold), 
which I will integrate into the series in the next revision.

Please, let me know what you think.

Thanks,
— Alex


[-- Attachment #2: 0007-locking-qspinlock-Implement-incremental-culling-in-C.patch --]
[-- Type: application/octet-stream, Size: 8858 bytes --]

From 28af255824b92a79fc055e15f5dd2251bc055c10 Mon Sep 17 00:00:00 2001
From: Alex Kogan <alex.kogan@oracle.com>
Date: Thu, 27 Aug 2020 19:06:48 -0400
Subject: [PATCH v10 7/7] locking/qspinlock: Implement incremental culling in
 CNA

Replace the scan of the main queue in cna_order_queue()
with incremental culling. This decreases the worst case
complexity of cna_order_queue() from O(n) down to O(1).
The idea is to move waiters running on other nodes
gradually, one-by-one, into the secondary queue. This is
achieved by probing the numa node of the next waiter only,
hence O(1) complexity. To keep the lock priority on the
same node even if there is a chain of waiters from different
nodes, we overwrite numa nodes of those waiters. Those
'fake' numa nodes will be reset in subsequent lock acquisitions.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
---
 kernel/locking/qspinlock_cna.h | 135 +++++++++++++++------------------
 1 file changed, 63 insertions(+), 72 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 911c96279494..e5faf16ebe29 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -59,14 +59,14 @@ struct cna_node {
 	u16			numa_node;
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
-	u32			partial_order;	/* encoded tail or enum val */
+	u32			partial_order;	/* enum val */
 	u32			intra_count;
 };
 
 enum {
-	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
-	FLUSH_SECONDARY_QUEUE = 3,
-	MIN_ENCODED_TAIL
+	LOCAL_WAITER_FOUND,
+	LOCAL_WAITER_NOT_FOUND,
+	FLUSH_SECONDARY_QUEUE,
 };
 
 /*
@@ -90,10 +90,9 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * make sure @encoded_tail is not confused with other valid
-		 * values for @locked (0 or 1) or with designated values for
-		 * @pre_scan_result
+		 * values for @locked (0 or 1)
 		 */
-		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
+		WARN_ON(cn->encoded_tail <= 1);
 	}
 }
 
@@ -117,10 +116,6 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
-	/*
-	 * Set the priority bit in @numa_node for threads that should not
-	 * be moved to the secondary queue.
-	 */
 	bool priority = !in_task() || irqs_disabled() || rt_task(current);
 	struct cna_node *cn = (struct cna_node *)node;
 
@@ -213,79 +208,64 @@ static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
 }
 
 /*
- * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * cna_splice_tail -- splice the next node from the primary queue
  * onto the secondary queue.
  */
-static void cna_splice_tail(struct mcs_spinlock *node,
-			    struct mcs_spinlock *first,
-			    struct mcs_spinlock *last)
+static void cna_splice_next(struct mcs_spinlock *node,
+			    struct mcs_spinlock *next,
+			    struct mcs_spinlock *nnext)
 {
-	/* remove [first,last] */
-	node->next = last->next;
+	/* remove 'next' from the main queue */
+	node->next = nnext;
 
-	/* stick [first,last] on the secondary queue tail */
+	/* stick `next` on the secondary queue tail */
 	if (node->locked <= 1) { /* if secondary queue is empty */
 		/* create secondary queue */
-		last->next = first;
+		next->next = next;
 	} else {
 		/* add to the tail of the secondary queue */
 		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
 		struct mcs_spinlock *head_2nd = tail_2nd->next;
 
-		tail_2nd->next = first;
-		last->next = head_2nd;
+		tail_2nd->next = next;
+		next->next = head_2nd;
 	}
 
-	node->locked = ((struct cna_node *)last)->encoded_tail;
+	node->locked = ((struct cna_node *)next)->encoded_tail;
 }
 
 /*
- * cna_order_queue - scan the primary queue looking for the first lock node on
- * the same NUMA node as the lock holder and move any skipped nodes onto the
- * secondary queue.
- *
- * Returns LOCAL_WAITER_FOUND if a matching node is found; otherwise return
- * the encoded pointer to the last element inspected (such that a subsequent
- * scan can continue there).
- *
- * The worst case complexity of the scan is O(n), where n is the number
- * of current waiters. However, the amortized complexity is close to O(1),
- * as the immediate successor is likely to be running on the same node once
- * threads from other nodes are moved to the secondary queue.
+ * cna_order_queue - check whether the next lock waiter
+ * is on the same NUMA node as the lock holder; if not,
+ * and it is not a prioritized waiter, move it onto
+ * the secondary queue.
  */
-static u32 cna_order_queue(struct mcs_spinlock *node,
-			   struct mcs_spinlock *iter)
+static u32 cna_order_queue(struct mcs_spinlock *node)
 {
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
-	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
-	struct cna_node *last;
-
-	/* find any next waiter on 'our' NUMA node */
-	for (last = cn;
-		 /*
-		  * iterate as long as the current node is not priorizied and
-		  * does not run on 'our' NUMA node
-		  */
-	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
-	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
-		;
+	struct mcs_spinlock *next = READ_ONCE(node->next);
+	int numa_node, next_numa_node;
 
-	if (!cni)
-		return last->encoded_tail; /* continue from here */
+	if (!next)
+		return LOCAL_WAITER_NOT_FOUND;
 
-	if (cni->numa_node == CNA_PRIORITY_NODE)
-		cni->numa_node = nid;	/* Inherit node id of primary queue */
+	numa_node = ((struct cna_node *)node)->numa_node;
+	next_numa_node = ((struct cna_node *)next)->numa_node;
 
-	if (last != cn)	/* did we skip any waiters? */
-		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+	if (next_numa_node != numa_node) {
+		if (next_numa_node != CNA_PRIORITY_NODE) {
+			struct mcs_spinlock *nnext = READ_ONCE(next->next);
 
-	/*
-	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
-	 * of a prioritized waiter. That waiter will get the lock next even if
-	 * it runs on a different NUMA node, but this is what we wanted when we
-	 * prioritized it.
-	 */
+			if (nnext) {
+				cna_splice_next(node, next, nnext);
+				next = nnext;
+			}
+		}
+		/*
+		 * Inherit numa node id of primary queue, to maintain the
+		 * preference even if the next waiter is on a different node.
+		 */
+		((struct cna_node *)next)->numa_node = numa_node;
+	}
 	return LOCAL_WAITER_FOUND;
 }
 
@@ -307,7 +287,7 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 		 * Try and put the time otherwise spent spin waiting on
 		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
 		 */
-		cn->partial_order = cna_order_queue(node, node);
+		cn->partial_order = cna_order_queue(node);
 	} else {
 		cn->partial_order = FLUSH_SECONDARY_QUEUE;
 	}
@@ -323,18 +303,23 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 	u32 partial_order = cn->partial_order;
 
 	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 * Check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan.
 	 */
-	if (partial_order >= MIN_ENCODED_TAIL)
-		partial_order =
-			cna_order_queue(node, decode_tail(partial_order));
+	if (partial_order == LOCAL_WAITER_NOT_FOUND)
+		partial_order = cna_order_queue(node);
+
+	/*
+	 * We have a successor in the main queue ('next'), so
+	 * if we call cna_order_queue() above, we will find
+	 * a local waiter, either real or faked one.
+	 */
+	WARN_ON(partial_order == LOCAL_WAITER_NOT_FOUND);
 
 	if (partial_order == LOCAL_WAITER_FOUND) {
 		/*
-		 * We found a local waiter; reload @next in case we called
-		 * cna_order_queue() above.
+		 * We found a local waiter; reload @next in case it
+		 * was changed by cna_order_queue().
 		 */
 		next = node->next;
 		if (node->locked > 1) {
@@ -342,7 +327,13 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 			((struct cna_node *)next)->intra_count =
 				cn->intra_count + 1;
 		}
-	} else if (node->locked > 1) {
+	} else {
+		WARN_ON(partial_order != FLUSH_SECONDARY_QUEUE);
+		/*
+		 * We decided to flush the secondary queue;
+		 * this can only happen if that queue is not empty.
+		 */
+		WARN_ON(node->locked <= 1);
 		/*
 		 * When there are no local waiters on the primary queue, splice
 		 * the secondary queue onto the primary queue and pass the lock
-- 
2.21.1 (Apple Git-122.3)


[-- Attachment #3: Type: text/plain, Size: 99 bytes --]


> 
> Cheers,
> Longman
> 
> 
> <0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.patch>


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-08-31 21:39       ` Alex Kogan
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Kogan @ 2020-08-31 21:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, mingo, bp, hpa, steven.sistare,
	tglx, daniel.m.jordan, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 3757 bytes --]


> On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 4/3/20 4:59 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a primary queue for
>> threads running on the same node as the current lock holder, and a
>> secondary queue for threads running on other nodes. After acquiring the
>> MCS lock and before acquiring the spinlock, the lock holder scans the
>> primary queue looking for a thread running on the same node (pre-scan). If
>> found (call it thread T), all threads in the primary queue between the
>> current lock holder and T are moved to the end of the secondary queue.
>> If such T is not found, we make another scan of the primary queue when
>> unlocking the MCS lock (post-scan), starting at the position where
>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>> passed to the first thread in the secondary queue. If the secondary queue
>> is empty, the lock is passed to the next thread in the primary queue.
>> For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .
>> 
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same node. This issue
>> will be addressed later in the series.
>> 
>> Enabling CNA is controlled via a new configuration option
>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>> boot time only if we run on a multi-node machine in native environment and
>> the new config is enabled. (For the time being, the patching requires
>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>> resolved once static_call() is available.) This default behavior can be
>> overridden with the new kernel boot command-line option
>> "numa_spinlock=on/off" (default is "auto").
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Waiman Long <longman@redhat.com>
>> ---
> 
> There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.

I understand the concern. While your patch addresses it, I am afraid it makes
the code somewhat more complex, and duplicates some of the slow path
functionality (e.g., the spin loop until the lock value changes to a certain
value).

Let me suggest a different idea for gradually restructuring the main queue
that has some similarity to the way you suggested to handle prioritized waiters.
Basically, instead of scanning the entire chain of main queue waiters, 
we can check only the next waiter and, if present and it runs on a different
node, move it to the secondary queue. In addition, to maintain the preference
for a certain numa node ID, we set the numa node of the next-next waiter, 
if present, to that of the current lock holder. This is the part similar to the
way you suggested to handle prioritized waiters.

This way, the worst case complexity of cna_order_queue() decreases from O(n)
down to O(1), as we always “scan" only one waiter. And as before, we change
the preference (and flush the secondary queue) after X handovers (or after
Y ms, as in your in other patch).

I attach the patch that applies on top of your patch for prioritized nodes
(0006), but does not include your patch 0007 (time based threshold), 
which I will integrate into the series in the next revision.

Please, let me know what you think.

Thanks,
— Alex


[-- Attachment #2: 0007-locking-qspinlock-Implement-incremental-culling-in-C.patch --]
[-- Type: application/octet-stream, Size: 8858 bytes --]

From 28af255824b92a79fc055e15f5dd2251bc055c10 Mon Sep 17 00:00:00 2001
From: Alex Kogan <alex.kogan@oracle.com>
Date: Thu, 27 Aug 2020 19:06:48 -0400
Subject: [PATCH v10 7/7] locking/qspinlock: Implement incremental culling in
 CNA

Replace the scan of the main queue in cna_order_queue()
with incremental culling. This decreases the worst case
complexity of cna_order_queue() from O(n) down to O(1).
The idea is to move waiters running on other nodes
gradually, one-by-one, into the secondary queue. This is
achieved by probing the numa node of the next waiter only,
hence O(1) complexity. To keep the lock priority on the
same node even if there is a chain of waiters from different
nodes, we overwrite numa nodes of those waiters. Those
'fake' numa nodes will be reset in subsequent lock acquisitions.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
---
 kernel/locking/qspinlock_cna.h | 135 +++++++++++++++------------------
 1 file changed, 63 insertions(+), 72 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 911c96279494..e5faf16ebe29 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -59,14 +59,14 @@ struct cna_node {
 	u16			numa_node;
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
-	u32			partial_order;	/* encoded tail or enum val */
+	u32			partial_order;	/* enum val */
 	u32			intra_count;
 };
 
 enum {
-	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
-	FLUSH_SECONDARY_QUEUE = 3,
-	MIN_ENCODED_TAIL
+	LOCAL_WAITER_FOUND,
+	LOCAL_WAITER_NOT_FOUND,
+	FLUSH_SECONDARY_QUEUE,
 };
 
 /*
@@ -90,10 +90,9 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * make sure @encoded_tail is not confused with other valid
-		 * values for @locked (0 or 1) or with designated values for
-		 * @pre_scan_result
+		 * values for @locked (0 or 1)
 		 */
-		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
+		WARN_ON(cn->encoded_tail <= 1);
 	}
 }
 
@@ -117,10 +116,6 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
-	/*
-	 * Set the priority bit in @numa_node for threads that should not
-	 * be moved to the secondary queue.
-	 */
 	bool priority = !in_task() || irqs_disabled() || rt_task(current);
 	struct cna_node *cn = (struct cna_node *)node;
 
@@ -213,79 +208,64 @@ static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
 }
 
 /*
- * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * cna_splice_tail -- splice the next node from the primary queue
  * onto the secondary queue.
  */
-static void cna_splice_tail(struct mcs_spinlock *node,
-			    struct mcs_spinlock *first,
-			    struct mcs_spinlock *last)
+static void cna_splice_next(struct mcs_spinlock *node,
+			    struct mcs_spinlock *next,
+			    struct mcs_spinlock *nnext)
 {
-	/* remove [first,last] */
-	node->next = last->next;
+	/* remove 'next' from the main queue */
+	node->next = nnext;
 
-	/* stick [first,last] on the secondary queue tail */
+	/* stick `next` on the secondary queue tail */
 	if (node->locked <= 1) { /* if secondary queue is empty */
 		/* create secondary queue */
-		last->next = first;
+		next->next = next;
 	} else {
 		/* add to the tail of the secondary queue */
 		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
 		struct mcs_spinlock *head_2nd = tail_2nd->next;
 
-		tail_2nd->next = first;
-		last->next = head_2nd;
+		tail_2nd->next = next;
+		next->next = head_2nd;
 	}
 
-	node->locked = ((struct cna_node *)last)->encoded_tail;
+	node->locked = ((struct cna_node *)next)->encoded_tail;
 }
 
 /*
- * cna_order_queue - scan the primary queue looking for the first lock node on
- * the same NUMA node as the lock holder and move any skipped nodes onto the
- * secondary queue.
- *
- * Returns LOCAL_WAITER_FOUND if a matching node is found; otherwise return
- * the encoded pointer to the last element inspected (such that a subsequent
- * scan can continue there).
- *
- * The worst case complexity of the scan is O(n), where n is the number
- * of current waiters. However, the amortized complexity is close to O(1),
- * as the immediate successor is likely to be running on the same node once
- * threads from other nodes are moved to the secondary queue.
+ * cna_order_queue - check whether the next lock waiter
+ * is on the same NUMA node as the lock holder; if not,
+ * and it is not a prioritized waiter, move it onto
+ * the secondary queue.
  */
-static u32 cna_order_queue(struct mcs_spinlock *node,
-			   struct mcs_spinlock *iter)
+static u32 cna_order_queue(struct mcs_spinlock *node)
 {
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
-	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
-	struct cna_node *last;
-
-	/* find any next waiter on 'our' NUMA node */
-	for (last = cn;
-		 /*
-		  * iterate as long as the current node is not priorizied and
-		  * does not run on 'our' NUMA node
-		  */
-	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
-	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
-		;
+	struct mcs_spinlock *next = READ_ONCE(node->next);
+	int numa_node, next_numa_node;
 
-	if (!cni)
-		return last->encoded_tail; /* continue from here */
+	if (!next)
+		return LOCAL_WAITER_NOT_FOUND;
 
-	if (cni->numa_node == CNA_PRIORITY_NODE)
-		cni->numa_node = nid;	/* Inherit node id of primary queue */
+	numa_node = ((struct cna_node *)node)->numa_node;
+	next_numa_node = ((struct cna_node *)next)->numa_node;
 
-	if (last != cn)	/* did we skip any waiters? */
-		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+	if (next_numa_node != numa_node) {
+		if (next_numa_node != CNA_PRIORITY_NODE) {
+			struct mcs_spinlock *nnext = READ_ONCE(next->next);
 
-	/*
-	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
-	 * of a prioritized waiter. That waiter will get the lock next even if
-	 * it runs on a different NUMA node, but this is what we wanted when we
-	 * prioritized it.
-	 */
+			if (nnext) {
+				cna_splice_next(node, next, nnext);
+				next = nnext;
+			}
+		}
+		/*
+		 * Inherit numa node id of primary queue, to maintain the
+		 * preference even if the next waiter is on a different node.
+		 */
+		((struct cna_node *)next)->numa_node = numa_node;
+	}
 	return LOCAL_WAITER_FOUND;
 }
 
@@ -307,7 +287,7 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 		 * Try and put the time otherwise spent spin waiting on
 		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
 		 */
-		cn->partial_order = cna_order_queue(node, node);
+		cn->partial_order = cna_order_queue(node);
 	} else {
 		cn->partial_order = FLUSH_SECONDARY_QUEUE;
 	}
@@ -323,18 +303,23 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 	u32 partial_order = cn->partial_order;
 
 	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 * Check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan.
 	 */
-	if (partial_order >= MIN_ENCODED_TAIL)
-		partial_order =
-			cna_order_queue(node, decode_tail(partial_order));
+	if (partial_order == LOCAL_WAITER_NOT_FOUND)
+		partial_order = cna_order_queue(node);
+
+	/*
+	 * We have a successor in the main queue ('next'), so
+	 * if we call cna_order_queue() above, we will find
+	 * a local waiter, either real or faked one.
+	 */
+	WARN_ON(partial_order == LOCAL_WAITER_NOT_FOUND);
 
 	if (partial_order == LOCAL_WAITER_FOUND) {
 		/*
-		 * We found a local waiter; reload @next in case we called
-		 * cna_order_queue() above.
+		 * We found a local waiter; reload @next in case it
+		 * was changed by cna_order_queue().
 		 */
 		next = node->next;
 		if (node->locked > 1) {
@@ -342,7 +327,13 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 			((struct cna_node *)next)->intra_count =
 				cn->intra_count + 1;
 		}
-	} else if (node->locked > 1) {
+	} else {
+		WARN_ON(partial_order != FLUSH_SECONDARY_QUEUE);
+		/*
+		 * We decided to flush the secondary queue;
+		 * this can only happen if that queue is not empty.
+		 */
+		WARN_ON(node->locked <= 1);
 		/*
 		 * When there are no local waiters on the primary queue, splice
 		 * the secondary queue onto the primary queue and pass the lock
-- 
2.21.1 (Apple Git-122.3)


[-- Attachment #3: Type: text/plain, Size: 99 bytes --]


> 
> Cheers,
> Longman
> 
> 
> <0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.patch>


[-- Attachment #4: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-08-31 21:39       ` Alex Kogan
@ 2020-09-01 17:38         ` Waiman Long
  -1 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-09-01 17:38 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice

On 8/31/20 5:39 PM, Alex Kogan wrote:
>> On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> On 4/3/20 4:59 PM, Alex Kogan wrote:
>>> In CNA, spinning threads are organized in two queues, a primary queue for
>>> threads running on the same node as the current lock holder, and a
>>> secondary queue for threads running on other nodes. After acquiring the
>>> MCS lock and before acquiring the spinlock, the lock holder scans the
>>> primary queue looking for a thread running on the same node (pre-scan). If
>>> found (call it thread T), all threads in the primary queue between the
>>> current lock holder and T are moved to the end of the secondary queue.
>>> If such T is not found, we make another scan of the primary queue when
>>> unlocking the MCS lock (post-scan), starting at the position where
>>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>>> passed to the first thread in the secondary queue. If the secondary queue
>>> is empty, the lock is passed to the next thread in the primary queue.
>>> For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .
>>>
>>> Note that this variant of CNA may introduce starvation by continuously
>>> passing the lock to threads running on the same node. This issue
>>> will be addressed later in the series.
>>>
>>> Enabling CNA is controlled via a new configuration option
>>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>>> boot time only if we run on a multi-node machine in native environment and
>>> the new config is enabled. (For the time being, the patching requires
>>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>>> resolved once static_call() is available.) This default behavior can be
>>> overridden with the new kernel boot command-line option
>>> "numa_spinlock=on/off" (default is "auto").
>>>
>>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>>> Reviewed-by: Waiman Long <longman@redhat.com>
>>> ---
>> There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.
> I understand the concern. While your patch addresses it, I am afraid it makes
> the code somewhat more complex, and duplicates some of the slow path
> functionality (e.g., the spin loop until the lock value changes to a certain
> value).
>
> Let me suggest a different idea for gradually restructuring the main queue
> that has some similarity to the way you suggested to handle prioritized waiters.
> Basically, instead of scanning the entire chain of main queue waiters,
> we can check only the next waiter and, if present and it runs on a different
> node, move it to the secondary queue. In addition, to maintain the preference
> for a certain numa node ID, we set the numa node of the next-next waiter,
> if present, to that of the current lock holder. This is the part similar to the
> way you suggested to handle prioritized waiters.
>
> This way, the worst case complexity of cna_order_queue() decreases from O(n)
> down to O(1), as we always “scan" only one waiter. And as before, we change
> the preference (and flush the secondary queue) after X handovers (or after
> Y ms, as in your in other patch).
>
> I attach the patch that applies on top of your patch for prioritized nodes
> (0006), but does not include your patch 0007 (time based threshold),
> which I will integrate into the series in the next revision.
>
> Please, let me know what you think.
>
That is an interesting idea. I don't have any fundamental objection to 
that. I just wonder how it will impact the kind of performance test that 
you ran before. It would be nice to see the performance impact with that 
change.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-09-01 17:38         ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2020-09-01 17:38 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, mingo, bp, hpa, steven.sistare,
	tglx, daniel.m.jordan, linux-arm-kernel

On 8/31/20 5:39 PM, Alex Kogan wrote:
>> On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> On 4/3/20 4:59 PM, Alex Kogan wrote:
>>> In CNA, spinning threads are organized in two queues, a primary queue for
>>> threads running on the same node as the current lock holder, and a
>>> secondary queue for threads running on other nodes. After acquiring the
>>> MCS lock and before acquiring the spinlock, the lock holder scans the
>>> primary queue looking for a thread running on the same node (pre-scan). If
>>> found (call it thread T), all threads in the primary queue between the
>>> current lock holder and T are moved to the end of the secondary queue.
>>> If such T is not found, we make another scan of the primary queue when
>>> unlocking the MCS lock (post-scan), starting at the position where
>>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>>> passed to the first thread in the secondary queue. If the secondary queue
>>> is empty, the lock is passed to the next thread in the primary queue.
>>> For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .
>>>
>>> Note that this variant of CNA may introduce starvation by continuously
>>> passing the lock to threads running on the same node. This issue
>>> will be addressed later in the series.
>>>
>>> Enabling CNA is controlled via a new configuration option
>>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>>> boot time only if we run on a multi-node machine in native environment and
>>> the new config is enabled. (For the time being, the patching requires
>>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>>> resolved once static_call() is available.) This default behavior can be
>>> overridden with the new kernel boot command-line option
>>> "numa_spinlock=on/off" (default is "auto").
>>>
>>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>>> Reviewed-by: Waiman Long <longman@redhat.com>
>>> ---
>> There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.
> I understand the concern. While your patch addresses it, I am afraid it makes
> the code somewhat more complex, and duplicates some of the slow path
> functionality (e.g., the spin loop until the lock value changes to a certain
> value).
>
> Let me suggest a different idea for gradually restructuring the main queue
> that has some similarity to the way you suggested to handle prioritized waiters.
> Basically, instead of scanning the entire chain of main queue waiters,
> we can check only the next waiter and, if present and it runs on a different
> node, move it to the secondary queue. In addition, to maintain the preference
> for a certain numa node ID, we set the numa node of the next-next waiter,
> if present, to that of the current lock holder. This is the part similar to the
> way you suggested to handle prioritized waiters.
>
> This way, the worst case complexity of cna_order_queue() decreases from O(n)
> down to O(1), as we always “scan" only one waiter. And as before, we change
> the preference (and flush the secondary queue) after X handovers (or after
> Y ms, as in your in other patch).
>
> I attach the patch that applies on top of your patch for prioritized nodes
> (0006), but does not include your patch 0007 (time based threshold),
> which I will integrate into the series in the next revision.
>
> Please, let me know what you think.
>
That is an interesting idea. I don't have any fundamental objection to 
that. I just wonder how it will impact the kind of performance test that 
you ran before. It would be nice to see the performance impact with that 
change.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-09-01 17:40 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 20:59 [PATCH v10 0/5] Add NUMA-awareness to qspinlock Alex Kogan
2020-04-03 20:59 ` Alex Kogan
2020-04-03 20:59 ` [PATCH v10 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
2020-04-03 20:59   ` Alex Kogan
2020-04-03 20:59 ` [PATCH v10 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2020-04-03 20:59   ` Alex Kogan
2020-04-03 20:59   ` Alex Kogan
2020-04-03 20:59 ` [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2020-04-03 20:59   ` Alex Kogan
2020-04-03 20:59   ` Alex Kogan
2020-04-04 23:25   ` kbuild test robot
2020-04-07 21:57     ` Alex Kogan
2020-07-28 20:00   ` Waiman Long
2020-07-28 20:00     ` Waiman Long
2020-08-31 21:39     ` Alex Kogan
2020-08-31 21:39       ` Alex Kogan
2020-09-01 17:38       ` Waiman Long
2020-09-01 17:38         ` Waiman Long
2020-04-03 20:59 ` [PATCH v10 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2020-04-03 20:59   ` Alex Kogan
2020-07-28 19:39   ` Waiman Long
2020-04-03 20:59 ` [PATCH v10 5/5] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
2020-04-03 20:59   ` Alex Kogan
2020-07-28 19:34   ` Waiman Long
2020-07-28 19:34     ` Waiman Long
2020-05-04 14:17 ` [PATCH v10 0/5] Add NUMA-awareness to qspinlock Alex Kogan
2020-05-04 14:17   ` Alex Kogan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.