All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Add NUMA-awareness to qspinlock
@ 2019-01-31  3:01 ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA socket as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA sockets. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock.

CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
socket as the current lock holder, and a secondary queue for threads
running on other sockets. Threads record the ID of the socket on which
they are running in their queue nodes. At the unlock time, the lock
holder scans the main queue looking for a thread running on the same
socket. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue.

Full details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 5 runs) of the
total number of ops (x10^7) reported at the end of each run. The stock
kernel is v4.20.0-rc4+ compiled in the default configuration.

#thr  stock  patched speedup (patched/stock)
  1   2.710   2.715  1.002
  2   3.108   3.001  0.966
  4   4.194   3.919  0.934
  8   5.309   6.894  1.299
 16   6.722   9.094  1.353
 32   7.314   9.885  1.352
 36   7.562   9.855  1.303
 72   6.696  10.358  1.547
108   6.364  10.181  1.600
142   6.179  10.178  1.647

When the kernel is compiled with lockstat enabled, CNA 
achieves even larger speedups:

#thr  stock  patched speedup
  1   2.368   2.399  1.013
  2   2.567   2.492  0.970
  4   2.310   2.534  1.097
  8   2.893   4.468  1.544
 16   3.786   5.611  1.482
 32   4.097   5.578  1.362
 36   4.165   5.661  1.359
 72   3.484   5.841  1.677
108   2.890   5.498  1.903
142   2.695   5.356  1.987

This is because lockstat generates writes into shared variables inside the 
critical section to update various stats (e.g., the last CPU on which a
lock was acquired). By keeping the lock local on a socket, CNA reduces the
number of remote cache misses on the access to the lock itself as well as
to the critical section data.

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads (with the kernel compiled in the
default configuration):

#thr  stock patched speedup
  1   0.553   0.579  1.046
  2   0.860   0.907  1.054
  4   1.503   1.533  1.020
  8   1.735   1.704  0.983
 16   1.757   1.744  0.992
 32   0.888   1.705  1.921
 36   0.878   1.746  1.988
 72   0.844   1.766  2.094
108   0.812   1.747  2.150
142   0.804   1.767  2.198

and will-it-scale/lock2_threads:

#thr  stock patched speedup
  1   1.714   1.704  0.994
  2   2.919   2.914  0.998
  4   5.024   5.157  1.027
  8   4.101   3.946  0.962
 16   4.113   3.947  0.959
 32   2.618   4.145  1.583
 36   2.561   3.981  1.554
 72   2.062   4.015  1.947
108   2.157   3.977  1.844
142   1.992   3.916  1.966

As a part of correctness testing, we performed kernel builds on the patched
kernel with X*NCPU parallelism, for X=1,3,5.

Code reviews and performance testing are welcome and appreciated.


Alex Kogan (3):
  locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA

 arch/arm/include/asm/mcs_spinlock.h   |   4 +-
 include/asm-generic/qspinlock_types.h |  10 ++
 kernel/locking/mcs_spinlock.h         |  21 +++-
 kernel/locking/qspinlock.c            | 211 ++++++++++++++++++++++++++++++----
 4 files changed, 218 insertions(+), 28 deletions(-)

-- 
2.11.0 (Apple Git-81)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/3] Add NUMA-awareness to qspinlock
@ 2019-01-31  3:01 ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA socket as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA sockets. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock.

CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
socket as the current lock holder, and a secondary queue for threads
running on other sockets. Threads record the ID of the socket on which
they are running in their queue nodes. At the unlock time, the lock
holder scans the main queue looking for a thread running on the same
socket. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue.

Full details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 5 runs) of the
total number of ops (x10^7) reported at the end of each run. The stock
kernel is v4.20.0-rc4+ compiled in the default configuration.

#thr  stock  patched speedup (patched/stock)
  1   2.710   2.715  1.002
  2   3.108   3.001  0.966
  4   4.194   3.919  0.934
  8   5.309   6.894  1.299
 16   6.722   9.094  1.353
 32   7.314   9.885  1.352
 36   7.562   9.855  1.303
 72   6.696  10.358  1.547
108   6.364  10.181  1.600
142   6.179  10.178  1.647

When the kernel is compiled with lockstat enabled, CNA 
achieves even larger speedups:

#thr  stock  patched speedup
  1   2.368   2.399  1.013
  2   2.567   2.492  0.970
  4   2.310   2.534  1.097
  8   2.893   4.468  1.544
 16   3.786   5.611  1.482
 32   4.097   5.578  1.362
 36   4.165   5.661  1.359
 72   3.484   5.841  1.677
108   2.890   5.498  1.903
142   2.695   5.356  1.987

This is because lockstat generates writes into shared variables inside the 
critical section to update various stats (e.g., the last CPU on which a
lock was acquired). By keeping the lock local on a socket, CNA reduces the
number of remote cache misses on the access to the lock itself as well as
to the critical section data.

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads (with the kernel compiled in the
default configuration):

#thr  stock patched speedup
  1   0.553   0.579  1.046
  2   0.860   0.907  1.054
  4   1.503   1.533  1.020
  8   1.735   1.704  0.983
 16   1.757   1.744  0.992
 32   0.888   1.705  1.921
 36   0.878   1.746  1.988
 72   0.844   1.766  2.094
108   0.812   1.747  2.150
142   0.804   1.767  2.198

and will-it-scale/lock2_threads:

#thr  stock patched speedup
  1   1.714   1.704  0.994
  2   2.919   2.914  0.998
  4   5.024   5.157  1.027
  8   4.101   3.946  0.962
 16   4.113   3.947  0.959
 32   2.618   4.145  1.583
 36   2.561   3.981  1.554
 72   2.062   4.015  1.947
108   2.157   3.977  1.844
142   1.992   3.916  1.966

As a part of correctness testing, we performed kernel builds on the patched
kernel with X*NCPU parallelism, for X=1,3,5.

Code reviews and performance testing are welcome and appreciated.


Alex Kogan (3):
  locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA

 arch/arm/include/asm/mcs_spinlock.h   |   4 +-
 include/asm-generic/qspinlock_types.h |  10 ++
 kernel/locking/mcs_spinlock.h         |  21 +++-
 kernel/locking/qspinlock.c            | 211 ++++++++++++++++++++++++++++++----
 4 files changed, 218 insertions(+), 28 deletions(-)

-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/3] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
  2019-01-31  3:01 ` Alex Kogan
@ 2019-01-31  3:01   ` Alex Kogan
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

The arch_mcs_spin_unlock_contended macro should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored is different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 arch/arm/include/asm/mcs_spinlock.h | 4 ++--
 kernel/locking/mcs_spinlock.h       | 6 +++---
 kernel/locking/qspinlock.c          | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..ae6d763477f4 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_spin_unlock_contended(lock, val)			\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release(lock, (val));					\
 	dsb_sev();							\
 } while (0)
 
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..bc6d3244e1af 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -41,8 +41,8 @@ do {									\
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_spin_unlock_contended(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_spin_unlock_contended(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 8a8c3c208c5e..fc88e9685bdf 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -545,7 +545,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_spin_unlock_contended(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
-- 
2.11.0 (Apple Git-81)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/3] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
@ 2019-01-31  3:01   ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

The arch_mcs_spin_unlock_contended macro should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored is different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 arch/arm/include/asm/mcs_spinlock.h | 4 ++--
 kernel/locking/mcs_spinlock.h       | 6 +++---
 kernel/locking/qspinlock.c          | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..ae6d763477f4 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_spin_unlock_contended(lock, val)			\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release(lock, (val));					\
 	dsb_sev();							\
 } while (0)
 
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..bc6d3244e1af 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -41,8 +41,8 @@ do {									\
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_spin_unlock_contended(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_spin_unlock_contended(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 8a8c3c208c5e..fc88e9685bdf 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -545,7 +545,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_spin_unlock_contended(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/3] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-01-31  3:01 ` Alex Kogan
@ 2019-01-31  3:01   ` Alex Kogan
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same socket as the current lock holder, and a
secondary queue for threads running on other sockets. For details,
see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same socket. This issue
will be addressed later in the series.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/asm-generic/qspinlock_types.h |  10 +++
 kernel/locking/mcs_spinlock.h         |  15 +++-
 kernel/locking/qspinlock.c            | 162 +++++++++++++++++++++++++++++-----
 3 files changed, 164 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index d10f1e7d6ba8..1807389ab0f9 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -109,4 +109,14 @@ typedef struct qspinlock {
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
+/*
+ * Bitfields in the non-atomic socket_and_count value:
+ * 0- 1: queue node index (always < 4)
+ * 2-31: socket id
+ */
+#define _Q_IDX_OFFSET		0
+#define _Q_IDX_BITS		2
+#define _Q_IDX_MASK		_Q_SET_MASK(IDX)
+#define _Q_SOCKET_OFFSET	(_Q_IDX_OFFSET + _Q_IDX_BITS)
+
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index bc6d3244e1af..78a9920cf613 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -9,6 +9,12 @@
  * to acquire the lock spinning on a local variable.
  * It avoids expensive cache bouncings that common test-and-set spin-lock
  * implementations incur.
+ *
+ * This implementation of the MCS spin-lock is NUMA-aware. Spinning
+ * threads are organized in two queues, a main queue for threads running
+ * on the same socket as the current lock holder, and a secondary queue
+ * for threads running on other sockets.
+ * For details, see https://arxiv.org/abs/1810.05600.
  */
 #ifndef __LINUX_MCS_SPINLOCK_H
 #define __LINUX_MCS_SPINLOCK_H
@@ -17,8 +23,13 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
-	int count;  /* nesting count, see qspinlock.c */
+	uintptr_t locked; /* 1 if lock acquired, 0 if not, other values */
+			  /* represent a pointer to the secondary queue head */
+	u32 socket_and_count;	/* socket id on which this thread is running */
+				/* with two lower bits reserved for nesting */
+				/* count, see qspinlock.c */
+	struct mcs_spinlock *tail;    /* points to the secondary queue tail */
+	u32 encoded_tail; /* encoding of this node as the main queue tail */
 };
 
 #ifndef arch_mcs_spin_lock_contended
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc88e9685bdf..6addc24f219d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -77,14 +77,11 @@
 #define MAX_NODES	4
 
 /*
- * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
- * size and four of them will fit nicely in one 64-byte cacheline. For
+ * On 64-bit architectures, the mcs_spinlock structure will be 32 bytes in
+ * size and two of them will fit nicely in one 64-byte cacheline. For
  * pvqspinlock, however, we need more space for extra data. To accommodate
- * that, we insert two more long words to pad it up to 32 bytes. IOW, only
- * two of them can fit in a cacheline in this case. That is OK as it is rare
- * to have more than 2 levels of slowpath nesting in actual use. We don't
- * want to penalize pvqspinlocks to optimize for a rare case in native
- * qspinlocks.
+ * that, we insert two more long words to pad it up to 40 bytes. IOW, only
+ * one of them can fit in a cacheline in this case.
  */
 struct qnode {
 	struct mcs_spinlock mcs;
@@ -109,9 +106,9 @@ struct qnode {
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
- * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ * Exactly fits two 64-byte cachelines on a 64-bit architecture.
  *
- * PV doubles the storage and uses the second cacheline for PV state.
+ * PV adds more storage for PV state, and thus needs three cachelines.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -124,9 +121,6 @@ static inline __pure u32 encode_tail(int cpu, int idx)
 {
 	u32 tail;
 
-#ifdef CONFIG_DEBUG_SPINLOCK
-	BUG_ON(idx > 3);
-#endif
 	tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
 	tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
 
@@ -300,6 +294,81 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+#define MCS_NODE(ptr) ((struct mcs_spinlock *)(ptr))
+
+static inline __pure int decode_socket(u32 socket_and_count)
+{
+	int socket = (socket_and_count >> _Q_SOCKET_OFFSET) - 1;
+
+	return socket;
+}
+
+static inline __pure int decode_count(u32 socket_and_count)
+{
+	int count = socket_and_count & _Q_IDX_MASK;
+
+	return count;
+}
+
+static inline void set_socket(struct mcs_spinlock *node, int socket)
+{
+	u32 val;
+
+	val  = (socket + 1) << _Q_SOCKET_OFFSET;
+	val |= decode_count(node->socket_and_count);
+
+	node->socket_and_count = val;
+}
+
+static struct mcs_spinlock *find_successor(struct mcs_spinlock *me,
+					   int my_cpuid)
+{
+	int my_socket;
+	struct mcs_spinlock *head_other, *tail_other, *cur;
+
+	struct mcs_spinlock *next = me->next;
+	/* @next should be set, else we would not be calling this function. */
+	WARN_ON_ONCE(next == NULL);
+
+	/* Get socket, which would not be set if we entered an empty queue. */
+	my_socket = decode_socket(me->socket_and_count);
+	if (my_socket == -1)
+		my_socket = numa_cpu_node(my_cpuid);
+	/*
+	 * Fast path - check whether the immediate successor runs on
+	 * the same socket.
+	 */
+	if (decode_socket(next->socket_and_count) == my_socket)
+		return next;
+
+	head_other = next;
+	tail_other = next;
+
+	/*
+	 * Traverse the main waiting queue starting from the successor of my
+	 * successor, and look for a thread running on the same socket.
+	 */
+	cur = READ_ONCE(next->next);
+	while (cur) {
+		if (decode_socket(cur->socket_and_count) == my_socket) {
+			/*
+			 * Found a thread on the same socket. Move threads
+			 * between me and that node into the secondary queue.
+			 */
+			if (me->locked > 1)
+				MCS_NODE(me->locked)->tail->next = head_other;
+			else
+				me->locked = (uintptr_t)head_other;
+			tail_other->next = NULL;
+			MCS_NODE(me->locked)->tail = tail_other;
+			return cur;
+		}
+		tail_other = cur;
+		cur = READ_ONCE(cur->next);
+	}
+	return NULL;
+}
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -325,9 +394,9 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
  */
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-	struct mcs_spinlock *prev, *next, *node;
-	u32 old, tail;
-	int idx;
+	struct mcs_spinlock *prev, *next, *node, *succ;
+	u32 old, tail, new;
+	int idx, cpuid;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
@@ -409,8 +478,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	qstat_inc(qstat_lock_slowpath, true);
 pv_queue:
 	node = this_cpu_ptr(&qnodes[0].mcs);
-	idx = node->count++;
-	tail = encode_tail(smp_processor_id(), idx);
+#ifdef CONFIG_DEBUG_SPINLOCK
+	BUG_ON(decode_count(node->socket_and_count) >= 3);
+#endif
+	idx = decode_count(node->socket_and_count++);
+	cpuid = smp_processor_id();
+	tail = encode_tail(cpuid, idx);
 
 	node = grab_mcs_node(node, idx);
 
@@ -428,6 +501,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 
 	node->locked = 0;
 	node->next = NULL;
+	set_socket(node, -1);
+	node->encoded_tail = tail;
 	pv_init_node(node);
 
 	/*
@@ -462,6 +537,14 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
 
+		/*
+		 * An explicit barrier after the store to @socket
+		 * is not required as making the socket value visible is
+		 * required only for performance, not correctness, and
+		 * we rather avoid the cost of the barrier.
+		 */
+		set_socket(node, numa_cpu_node(cpuid));
+
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
 
@@ -477,6 +560,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		next = READ_ONCE(node->next);
 		if (next)
 			prefetchw(next);
+	} else {
+		/* Must pass a non-zero value to successor when we unlock. */
+		node->locked = 1;
 	}
 
 	/*
@@ -528,8 +614,24 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
-			goto release; /* No contention */
+		/* Check whether the secondary queue is empty. */
+		if (node->locked == 1) {
+			if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
+					_Q_LOCKED_VAL))
+				goto release; /* No contention */
+		} else {
+			/*
+			 * Pass the lock to the first thread in the secondary
+			 * queue, but first try to update the queue's tail to
+			 * point to the last node in the secondary queue.
+			 */
+			succ = MCS_NODE(node->locked);
+			new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
+			if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
+				arch_mcs_spin_unlock_contended(&succ->locked, 1);
+				goto release;
+			}
+		}
 	}
 
 	/*
@@ -545,14 +647,32 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked, 1);
+	/* Try to pass the lock to a thread running on the same socket. */
+	succ = find_successor(node, cpuid);
+	if (succ) {
+		arch_mcs_spin_unlock_contended(&succ->locked, node->locked);
+	} else if (node->locked > 1) {
+		/*
+		 * If the secondary queue is not empty, pass the lock
+		 * to the first node in that queue.
+		 */
+		succ = MCS_NODE(node->locked);
+		succ->tail->next = next;
+		arch_mcs_spin_unlock_contended(&succ->locked, 1);
+	} else {
+		/*
+		 * Otherwise, pass the lock to the immediate successor
+		 * in the main queue.
+		 */
+		arch_mcs_spin_unlock_contended(&next->locked, 1);
+	}
 	pv_kick_node(lock, next);
 
 release:
 	/*
 	 * release the node
 	 */
-	__this_cpu_dec(qnodes[0].mcs.count);
+	__this_cpu_dec(qnodes[0].mcs.socket_and_count);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
-- 
2.11.0 (Apple Git-81)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/3] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2019-01-31  3:01   ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same socket as the current lock holder, and a
secondary queue for threads running on other sockets. For details,
see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same socket. This issue
will be addressed later in the series.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/asm-generic/qspinlock_types.h |  10 +++
 kernel/locking/mcs_spinlock.h         |  15 +++-
 kernel/locking/qspinlock.c            | 162 +++++++++++++++++++++++++++++-----
 3 files changed, 164 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index d10f1e7d6ba8..1807389ab0f9 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -109,4 +109,14 @@ typedef struct qspinlock {
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
+/*
+ * Bitfields in the non-atomic socket_and_count value:
+ * 0- 1: queue node index (always < 4)
+ * 2-31: socket id
+ */
+#define _Q_IDX_OFFSET		0
+#define _Q_IDX_BITS		2
+#define _Q_IDX_MASK		_Q_SET_MASK(IDX)
+#define _Q_SOCKET_OFFSET	(_Q_IDX_OFFSET + _Q_IDX_BITS)
+
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index bc6d3244e1af..78a9920cf613 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -9,6 +9,12 @@
  * to acquire the lock spinning on a local variable.
  * It avoids expensive cache bouncings that common test-and-set spin-lock
  * implementations incur.
+ *
+ * This implementation of the MCS spin-lock is NUMA-aware. Spinning
+ * threads are organized in two queues, a main queue for threads running
+ * on the same socket as the current lock holder, and a secondary queue
+ * for threads running on other sockets.
+ * For details, see https://arxiv.org/abs/1810.05600.
  */
 #ifndef __LINUX_MCS_SPINLOCK_H
 #define __LINUX_MCS_SPINLOCK_H
@@ -17,8 +23,13 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
-	int count;  /* nesting count, see qspinlock.c */
+	uintptr_t locked; /* 1 if lock acquired, 0 if not, other values */
+			  /* represent a pointer to the secondary queue head */
+	u32 socket_and_count;	/* socket id on which this thread is running */
+				/* with two lower bits reserved for nesting */
+				/* count, see qspinlock.c */
+	struct mcs_spinlock *tail;    /* points to the secondary queue tail */
+	u32 encoded_tail; /* encoding of this node as the main queue tail */
 };
 
 #ifndef arch_mcs_spin_lock_contended
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc88e9685bdf..6addc24f219d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -77,14 +77,11 @@
 #define MAX_NODES	4
 
 /*
- * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
- * size and four of them will fit nicely in one 64-byte cacheline. For
+ * On 64-bit architectures, the mcs_spinlock structure will be 32 bytes in
+ * size and two of them will fit nicely in one 64-byte cacheline. For
  * pvqspinlock, however, we need more space for extra data. To accommodate
- * that, we insert two more long words to pad it up to 32 bytes. IOW, only
- * two of them can fit in a cacheline in this case. That is OK as it is rare
- * to have more than 2 levels of slowpath nesting in actual use. We don't
- * want to penalize pvqspinlocks to optimize for a rare case in native
- * qspinlocks.
+ * that, we insert two more long words to pad it up to 40 bytes. IOW, only
+ * one of them can fit in a cacheline in this case.
  */
 struct qnode {
 	struct mcs_spinlock mcs;
@@ -109,9 +106,9 @@ struct qnode {
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
- * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ * Exactly fits two 64-byte cachelines on a 64-bit architecture.
  *
- * PV doubles the storage and uses the second cacheline for PV state.
+ * PV adds more storage for PV state, and thus needs three cachelines.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -124,9 +121,6 @@ static inline __pure u32 encode_tail(int cpu, int idx)
 {
 	u32 tail;
 
-#ifdef CONFIG_DEBUG_SPINLOCK
-	BUG_ON(idx > 3);
-#endif
 	tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
 	tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
 
@@ -300,6 +294,81 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+#define MCS_NODE(ptr) ((struct mcs_spinlock *)(ptr))
+
+static inline __pure int decode_socket(u32 socket_and_count)
+{
+	int socket = (socket_and_count >> _Q_SOCKET_OFFSET) - 1;
+
+	return socket;
+}
+
+static inline __pure int decode_count(u32 socket_and_count)
+{
+	int count = socket_and_count & _Q_IDX_MASK;
+
+	return count;
+}
+
+static inline void set_socket(struct mcs_spinlock *node, int socket)
+{
+	u32 val;
+
+	val  = (socket + 1) << _Q_SOCKET_OFFSET;
+	val |= decode_count(node->socket_and_count);
+
+	node->socket_and_count = val;
+}
+
+static struct mcs_spinlock *find_successor(struct mcs_spinlock *me,
+					   int my_cpuid)
+{
+	int my_socket;
+	struct mcs_spinlock *head_other, *tail_other, *cur;
+
+	struct mcs_spinlock *next = me->next;
+	/* @next should be set, else we would not be calling this function. */
+	WARN_ON_ONCE(next == NULL);
+
+	/* Get socket, which would not be set if we entered an empty queue. */
+	my_socket = decode_socket(me->socket_and_count);
+	if (my_socket == -1)
+		my_socket = numa_cpu_node(my_cpuid);
+	/*
+	 * Fast path - check whether the immediate successor runs on
+	 * the same socket.
+	 */
+	if (decode_socket(next->socket_and_count) == my_socket)
+		return next;
+
+	head_other = next;
+	tail_other = next;
+
+	/*
+	 * Traverse the main waiting queue starting from the successor of my
+	 * successor, and look for a thread running on the same socket.
+	 */
+	cur = READ_ONCE(next->next);
+	while (cur) {
+		if (decode_socket(cur->socket_and_count) == my_socket) {
+			/*
+			 * Found a thread on the same socket. Move threads
+			 * between me and that node into the secondary queue.
+			 */
+			if (me->locked > 1)
+				MCS_NODE(me->locked)->tail->next = head_other;
+			else
+				me->locked = (uintptr_t)head_other;
+			tail_other->next = NULL;
+			MCS_NODE(me->locked)->tail = tail_other;
+			return cur;
+		}
+		tail_other = cur;
+		cur = READ_ONCE(cur->next);
+	}
+	return NULL;
+}
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -325,9 +394,9 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
  */
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-	struct mcs_spinlock *prev, *next, *node;
-	u32 old, tail;
-	int idx;
+	struct mcs_spinlock *prev, *next, *node, *succ;
+	u32 old, tail, new;
+	int idx, cpuid;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
@@ -409,8 +478,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	qstat_inc(qstat_lock_slowpath, true);
 pv_queue:
 	node = this_cpu_ptr(&qnodes[0].mcs);
-	idx = node->count++;
-	tail = encode_tail(smp_processor_id(), idx);
+#ifdef CONFIG_DEBUG_SPINLOCK
+	BUG_ON(decode_count(node->socket_and_count) >= 3);
+#endif
+	idx = decode_count(node->socket_and_count++);
+	cpuid = smp_processor_id();
+	tail = encode_tail(cpuid, idx);
 
 	node = grab_mcs_node(node, idx);
 
@@ -428,6 +501,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 
 	node->locked = 0;
 	node->next = NULL;
+	set_socket(node, -1);
+	node->encoded_tail = tail;
 	pv_init_node(node);
 
 	/*
@@ -462,6 +537,14 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
 
+		/*
+		 * An explicit barrier after the store to @socket
+		 * is not required as making the socket value visible is
+		 * required only for performance, not correctness, and
+		 * we rather avoid the cost of the barrier.
+		 */
+		set_socket(node, numa_cpu_node(cpuid));
+
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
 
@@ -477,6 +560,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		next = READ_ONCE(node->next);
 		if (next)
 			prefetchw(next);
+	} else {
+		/* Must pass a non-zero value to successor when we unlock. */
+		node->locked = 1;
 	}
 
 	/*
@@ -528,8 +614,24 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
-			goto release; /* No contention */
+		/* Check whether the secondary queue is empty. */
+		if (node->locked == 1) {
+			if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
+					_Q_LOCKED_VAL))
+				goto release; /* No contention */
+		} else {
+			/*
+			 * Pass the lock to the first thread in the secondary
+			 * queue, but first try to update the queue's tail to
+			 * point to the last node in the secondary queue.
+			 */
+			succ = MCS_NODE(node->locked);
+			new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
+			if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
+				arch_mcs_spin_unlock_contended(&succ->locked, 1);
+				goto release;
+			}
+		}
 	}
 
 	/*
@@ -545,14 +647,32 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked, 1);
+	/* Try to pass the lock to a thread running on the same socket. */
+	succ = find_successor(node, cpuid);
+	if (succ) {
+		arch_mcs_spin_unlock_contended(&succ->locked, node->locked);
+	} else if (node->locked > 1) {
+		/*
+		 * If the secondary queue is not empty, pass the lock
+		 * to the first node in that queue.
+		 */
+		succ = MCS_NODE(node->locked);
+		succ->tail->next = next;
+		arch_mcs_spin_unlock_contended(&succ->locked, 1);
+	} else {
+		/*
+		 * Otherwise, pass the lock to the immediate successor
+		 * in the main queue.
+		 */
+		arch_mcs_spin_unlock_contended(&next->locked, 1);
+	}
 	pv_kick_node(lock, next);
 
 release:
 	/*
 	 * release the node
 	 */
-	__this_cpu_dec(qnodes[0].mcs.count);
+	__this_cpu_dec(qnodes[0].mcs.socket_and_count);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-01-31  3:01 ` Alex Kogan
@ 2019-01-31  3:01   ` Alex Kogan
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

Choose the next lock holder among spinning threads running on the same
socket with high probability rather than always. With small probability,
hand the lock to the first thread in the secondary queue or, if that
queue is empty, to the immediate successor of the current lock holder
in the main queue.  Thus, assuming no failures while threads hold the
lock, every thread would be able to acquire the lock after a bounded
number of lock transitions, with high probability.

Note that we could make the inter-socket transition deterministic,
by sticking a counter of intra-socket transitions in the head node
of the secondary queue. At the handoff time, we could increment
the counter and check if it is below a threshold. This adds another
field to queue nodes and nearly-certain local cache miss to read and
update this counter during the handoff. While still beating stock,
this variant adds certain overhead over the probabilistic variant.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6addc24f219d..d3caef4f84e2 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -31,6 +31,7 @@
 #include <linux/prefetch.h>
 #include <asm/byteorder.h>
 #include <asm/qspinlock.h>
+#include <linux/random.h>
 
 /*
  * Include queued spinlock statistics code
@@ -112,6 +113,18 @@ struct qnode {
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Controls the probability for intra-socket lock hand-off. It can be
+ * tuned and depend, e.g., on the number of CPUs per socket. For now,
+ * choose a value that provides reasonable long-term fairness without
+ * sacrificing performance compared to a version that does not have any
+ * fairness guarantees.
+ */
+#define INTRA_SOCKET_HANDOFF_PROB_ARG	0x10000
+
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
  * therefore increment the cpu number by one.
@@ -369,6 +382,35 @@ static struct mcs_spinlock *find_successor(struct mcs_spinlock *me,
 	return NULL;
 }
 
+/*
+ * xorshift function for generating pseudo-random numbers:
+ * https://en.wikipedia.org/wiki/Xorshift
+ */
+static inline u32 xor_random(void)
+{
+	u32 v;
+
+	v = this_cpu_read(seed);
+	if (v == 0)
+		get_random_bytes(&v, sizeof(u32));
+
+	v ^= v << 6;
+	v ^= v >> 21;
+	v ^= v << 7;
+	this_cpu_write(seed, v);
+
+	return v;
+}
+
+/*
+ * Return false with probability 1 / @range.
+ * @range must be a power of 2.
+ */
+static bool probably(unsigned int range)
+{
+	return xor_random() & (range - 1);
+}
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -647,8 +689,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	/* Try to pass the lock to a thread running on the same socket. */
-	succ = find_successor(node, cpuid);
+	/*
+	 * Try to pass the lock to a thread running on the same socket.
+	 * For long-term fairness, search for such a thread with high
+	 * probability rather than always.
+	 */
+	succ = NULL;
+	if (probably(INTRA_SOCKET_HANDOFF_PROB_ARG))
+		succ = find_successor(node, cpuid);
+
 	if (succ) {
 		arch_mcs_spin_unlock_contended(&succ->locked, node->locked);
 	} else if (node->locked > 1) {
-- 
2.11.0 (Apple Git-81)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-01-31  3:01   ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-01-31  3:01 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Choose the next lock holder among spinning threads running on the same
socket with high probability rather than always. With small probability,
hand the lock to the first thread in the secondary queue or, if that
queue is empty, to the immediate successor of the current lock holder
in the main queue.  Thus, assuming no failures while threads hold the
lock, every thread would be able to acquire the lock after a bounded
number of lock transitions, with high probability.

Note that we could make the inter-socket transition deterministic,
by sticking a counter of intra-socket transitions in the head node
of the secondary queue. At the handoff time, we could increment
the counter and check if it is below a threshold. This adds another
field to queue nodes and nearly-certain local cache miss to read and
update this counter during the handoff. While still beating stock,
this variant adds certain overhead over the probabilistic variant.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6addc24f219d..d3caef4f84e2 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -31,6 +31,7 @@
 #include <linux/prefetch.h>
 #include <asm/byteorder.h>
 #include <asm/qspinlock.h>
+#include <linux/random.h>
 
 /*
  * Include queued spinlock statistics code
@@ -112,6 +113,18 @@ struct qnode {
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Controls the probability for intra-socket lock hand-off. It can be
+ * tuned and depend, e.g., on the number of CPUs per socket. For now,
+ * choose a value that provides reasonable long-term fairness without
+ * sacrificing performance compared to a version that does not have any
+ * fairness guarantees.
+ */
+#define INTRA_SOCKET_HANDOFF_PROB_ARG	0x10000
+
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
  * therefore increment the cpu number by one.
@@ -369,6 +382,35 @@ static struct mcs_spinlock *find_successor(struct mcs_spinlock *me,
 	return NULL;
 }
 
+/*
+ * xorshift function for generating pseudo-random numbers:
+ * https://en.wikipedia.org/wiki/Xorshift
+ */
+static inline u32 xor_random(void)
+{
+	u32 v;
+
+	v = this_cpu_read(seed);
+	if (v == 0)
+		get_random_bytes(&v, sizeof(u32));
+
+	v ^= v << 6;
+	v ^= v >> 21;
+	v ^= v << 7;
+	this_cpu_write(seed, v);
+
+	return v;
+}
+
+/*
+ * Return false with probability 1 / @range.
+ * @range must be a power of 2.
+ */
+static bool probably(unsigned int range)
+{
+	return xor_random() & (range - 1);
+}
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -647,8 +689,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	/* Try to pass the lock to a thread running on the same socket. */
-	succ = find_successor(node, cpuid);
+	/*
+	 * Try to pass the lock to a thread running on the same socket.
+	 * For long-term fairness, search for such a thread with high
+	 * probability rather than always.
+	 */
+	succ = NULL;
+	if (probably(INTRA_SOCKET_HANDOFF_PROB_ARG))
+		succ = find_successor(node, cpuid);
+
 	if (succ) {
 		arch_mcs_spin_unlock_contended(&succ->locked, node->locked);
 	} else if (node->locked > 1) {
-- 
2.11.0 (Apple Git-81)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] Add NUMA-awareness to qspinlock
  2019-01-31  3:01 ` Alex Kogan
  (?)
@ 2019-01-31  9:56   ` Peter Zijlstra
  -1 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-01-31  9:56 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, steven.sistare, daniel.m.jordan,
	dave.dice, rahul.x.yadav

On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA socket as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA sockets. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock.

Since you use NUMA, use the term node, not socket. The two are not
strictly related.

> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
> organized in two queues, a main queue for threads running on the same
> socket as the current lock holder, and a secondary queue for threads
> running on other sockets. Threads record the ID of the socket on which
> they are running in their queue nodes. At the unlock time, the lock
> holder scans the main queue looking for a thread running on the same
> socket. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue.
> 
> Full details are available at https://arxiv.org/abs/1810.05600.

Full details really should also be in the Changelog. You can skip much
of the academic bla-bla, but the Changelog should be self contained.

> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 5 runs) of the
> total number of ops (x10^7) reported at the end of each run. The stock
> kernel is v4.20.0-rc4+ compiled in the default configuration.
> 
> #thr  stock  patched speedup (patched/stock)
>   1   2.710   2.715  1.002
>   2   3.108   3.001  0.966
>   4   4.194   3.919  0.934

So low contention is actually worse. Funnily low contention is the
majority of our locks and is _really_ important.

>   8   5.309   6.894  1.299
>  16   6.722   9.094  1.353
>  32   7.314   9.885  1.352
>  36   7.562   9.855  1.303
>  72   6.696  10.358  1.547
> 108   6.364  10.181  1.600
> 142   6.179  10.178  1.647
> 
> When the kernel is compiled with lockstat enabled, CNA 

I'll ignore that, lockstat/lockdep enabled runs are not what one would
call performance relevant.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] Add NUMA-awareness to qspinlock
@ 2019-01-31  9:56   ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-01-31  9:56 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, steven.sistare, longman, daniel.m.jordan,
	linux-arm-kernel

On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA socket as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA sockets. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock.

Since you use NUMA, use the term node, not socket. The two are not
strictly related.

> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
> organized in two queues, a main queue for threads running on the same
> socket as the current lock holder, and a secondary queue for threads
> running on other sockets. Threads record the ID of the socket on which
> they are running in their queue nodes. At the unlock time, the lock
> holder scans the main queue looking for a thread running on the same
> socket. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue.
> 
> Full details are available at https://arxiv.org/abs/1810.05600.

Full details really should also be in the Changelog. You can skip much
of the academic bla-bla, but the Changelog should be self contained.

> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 5 runs) of the
> total number of ops (x10^7) reported at the end of each run. The stock
> kernel is v4.20.0-rc4+ compiled in the default configuration.
> 
> #thr  stock  patched speedup (patched/stock)
>   1   2.710   2.715  1.002
>   2   3.108   3.001  0.966
>   4   4.194   3.919  0.934

So low contention is actually worse. Funnily low contention is the
majority of our locks and is _really_ important.

>   8   5.309   6.894  1.299
>  16   6.722   9.094  1.353
>  32   7.314   9.885  1.352
>  36   7.562   9.855  1.303
>  72   6.696  10.358  1.547
> 108   6.364  10.181  1.600
> 142   6.179  10.178  1.647
> 
> When the kernel is compiled with lockstat enabled, CNA 

I'll ignore that, lockstat/lockdep enabled runs are not what one would
call performance relevant.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] Add NUMA-awareness to qspinlock
@ 2019-01-31  9:56   ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-01-31  9:56 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, steven.sistare, longman, daniel.m.jordan,
	linux-arm-kernel

On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA socket as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA sockets. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock.

Since you use NUMA, use the term node, not socket. The two are not
strictly related.

> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
> organized in two queues, a main queue for threads running on the same
> socket as the current lock holder, and a secondary queue for threads
> running on other sockets. Threads record the ID of the socket on which
> they are running in their queue nodes. At the unlock time, the lock
> holder scans the main queue looking for a thread running on the same
> socket. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue.
> 
> Full details are available at https://arxiv.org/abs/1810.05600.

Full details really should also be in the Changelog. You can skip much
of the academic bla-bla, but the Changelog should be self contained.

> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 5 runs) of the
> total number of ops (x10^7) reported at the end of each run. The stock
> kernel is v4.20.0-rc4+ compiled in the default configuration.
> 
> #thr  stock  patched speedup (patched/stock)
>   1   2.710   2.715  1.002
>   2   3.108   3.001  0.966
>   4   4.194   3.919  0.934

So low contention is actually worse. Funnily low contention is the
majority of our locks and is _really_ important.

>   8   5.309   6.894  1.299
>  16   6.722   9.094  1.353
>  32   7.314   9.885  1.352
>  36   7.562   9.855  1.303
>  72   6.696  10.358  1.547
> 108   6.364  10.181  1.600
> 142   6.179  10.178  1.647
> 
> When the kernel is compiled with lockstat enabled, CNA 

I'll ignore that, lockstat/lockdep enabled runs are not what one would
call performance relevant.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-01-31  3:01   ` Alex Kogan
@ 2019-01-31 10:00     ` Peter Zijlstra
  -1 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-01-31 10:00 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, steven.sistare, daniel.m.jordan,
	dave.dice, rahul.x.yadav, Thomas Gleixner

On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
> Choose the next lock holder among spinning threads running on the same
> socket with high probability rather than always. With small probability,
> hand the lock to the first thread in the secondary queue or, if that
> queue is empty, to the immediate successor of the current lock holder
> in the main queue.  Thus, assuming no failures while threads hold the
> lock, every thread would be able to acquire the lock after a bounded
> number of lock transitions, with high probability.
> 
> Note that we could make the inter-socket transition deterministic,
> by sticking a counter of intra-socket transitions in the head node
> of the secondary queue. At the handoff time, we could increment
> the counter and check if it is below a threshold. This adds another
> field to queue nodes and nearly-certain local cache miss to read and
> update this counter during the handoff. While still beating stock,
> this variant adds certain overhead over the probabilistic variant.

(also heavily suffers from the socket == node confusion)

How would you suggest RT 'tunes' this?

RT relies on FIFO fairness of the basic spinlock primitives; you just
completely wrecked that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-01-31 10:00     ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-01-31 10:00 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, steven.sistare, longman, Thomas Gleixner,
	daniel.m.jordan, linux-arm-kernel

On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
> Choose the next lock holder among spinning threads running on the same
> socket with high probability rather than always. With small probability,
> hand the lock to the first thread in the secondary queue or, if that
> queue is empty, to the immediate successor of the current lock holder
> in the main queue.  Thus, assuming no failures while threads hold the
> lock, every thread would be able to acquire the lock after a bounded
> number of lock transitions, with high probability.
> 
> Note that we could make the inter-socket transition deterministic,
> by sticking a counter of intra-socket transitions in the head node
> of the secondary queue. At the handoff time, we could increment
> the counter and check if it is below a threshold. This adds another
> field to queue nodes and nearly-certain local cache miss to read and
> update this counter during the handoff. While still beating stock,
> this variant adds certain overhead over the probabilistic variant.

(also heavily suffers from the socket == node confusion)

How would you suggest RT 'tunes' this?

RT relies on FIFO fairness of the basic spinlock primitives; you just
completely wrecked that.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-01-31  3:01   ` Alex Kogan
@ 2019-01-31 17:38     ` Waiman Long
  -1 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2019-01-31 17:38 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: steven.sistare, daniel.m.jordan, dave.dice, rahul.x.yadav

On 01/30/2019 10:01 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a main queue for
> threads running on the same socket as the current lock holder, and a
> secondary queue for threads running on other sockets. For details,
> see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same socket. This issue
> will be addressed later in the series.
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>

Just wondering if you have tried include PARVIRT_SPINLOCKS option to see
if that patch may screw up the PV qspinlock code.

Anyway, I do believe your claim that NUMA-aware qspinlock is good for
large systems with many nodes. However, all these extra code are
overhead for small systems that have a single node/socket, for instance.

I will support doing something similar to what had been done to support
PV qspinlock. IOW, a separate slowpath function that can be patched to
become the default depending on the system being run on or a kernel boot
option setting.

I would like to keep the core slowpath function simple and easy to
understand. So most of the CNA code should be encapsulated into some
helper functions and put into a separated file.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2019-01-31 17:38     ` Waiman Long
  0 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2019-01-31 17:38 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel
  Cc: rahul.x.yadav, dave.dice, steven.sistare, daniel.m.jordan

On 01/30/2019 10:01 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a main queue for
> threads running on the same socket as the current lock holder, and a
> secondary queue for threads running on other sockets. For details,
> see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same socket. This issue
> will be addressed later in the series.
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>

Just wondering if you have tried include PARVIRT_SPINLOCKS option to see
if that patch may screw up the PV qspinlock code.

Anyway, I do believe your claim that NUMA-aware qspinlock is good for
large systems with many nodes. However, all these extra code are
overhead for small systems that have a single node/socket, for instance.

I will support doing something similar to what had been done to support
PV qspinlock. IOW, a separate slowpath function that can be patched to
become the default depending on the system being run on or a kernel boot
option setting.

I would like to keep the core slowpath function simple and easy to
understand. So most of the CNA code should be encapsulated into some
helper functions and put into a separated file.

Thanks,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] Add NUMA-awareness to qspinlock
  2019-01-31  9:56   ` Peter Zijlstra
@ 2019-02-01 21:20     ` Alex Kogan
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-01 21:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, Steven Sistare, Daniel Jordan,
	dave.dice, rahul.x.yadav


> On Jan 31, 2019, at 4:56 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:
>> Lock throughput can be increased by handing a lock to a waiter on the
>> same NUMA socket as the lock holder, provided care is taken to avoid
>> starvation of waiters on other NUMA sockets. This patch introduces CNA
>> (compact NUMA-aware lock) as the slow path for qspinlock.
> 
> Since you use NUMA, use the term node, not socket. The two are not
> strictly related.
Got it, thanks.

> 
>> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
>> organized in two queues, a main queue for threads running on the same
>> socket as the current lock holder, and a secondary queue for threads
>> running on other sockets. Threads record the ID of the socket on which
>> they are running in their queue nodes. At the unlock time, the lock
>> holder scans the main queue looking for a thread running on the same
>> socket. If found (call it thread T), all threads in the main queue
>> between the current lock holder and T are moved to the end of the
>> secondary queue, and the lock is passed to T. If such T is not found, the
>> lock is passed to the first node in the secondary queue. Finally, if the
>> secondary queue is empty, the lock is passed to the next thread in the
>> main queue.
>> 
>> Full details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=7sFZrsdpLJxLRHIFWN_sE6zgKy20Ti8lOoepiEyipAo&s=5VRAQVjw0B1SCjvBLzzwxkHQ6TZ3FIl_tGDfvn3FXvo&e=.
> 
> Full details really should also be in the Changelog. You can skip much
> of the academic bla-bla, but the Changelog should be self contained.
> 
>> We have done some performance evaluation with the locktorture module
>> as well as with several benchmarks from the will-it-scale repo.
>> The following locktorture results are from an Oracle X5-4 server
>> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
>> cores each). Each number represents an average (over 5 runs) of the
>> total number of ops (x10^7) reported at the end of each run. The stock
>> kernel is v4.20.0-rc4+ compiled in the default configuration.
>> 
>> #thr  stock  patched speedup (patched/stock)
>>  1   2.710   2.715  1.002
>>  2   3.108   3.001  0.966
>>  4   4.194   3.919  0.934
> 
> So low contention is actually worse. Funnily low contention is the
> majority of our locks and is _really_ important.
This can be most certainly engineered out, e.g., by caching the node ID on which a task is running.
We will look into that.

> 
>>  8   5.309   6.894  1.299
>> 16   6.722   9.094  1.353
>> 32   7.314   9.885  1.352
>> 36   7.562   9.855  1.303
>> 72   6.696  10.358  1.547
>> 108   6.364  10.181  1.600
>> 142   6.179  10.178  1.647
>> 
>> When the kernel is compiled with lockstat enabled, CNA 
> 
> I'll ignore that, lockstat/lockdep enabled runs are not what one would
> call performance relevant.
Please, note that only one set of results has lockstat enabled.
The rest of the results (will-it-scale included) do not have it.

Regards,
— Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] Add NUMA-awareness to qspinlock
@ 2019-02-01 21:20     ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-01 21:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, Steven Sistare, longman, Daniel Jordan,
	linux-arm-kernel


> On Jan 31, 2019, at 4:56 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:
>> Lock throughput can be increased by handing a lock to a waiter on the
>> same NUMA socket as the lock holder, provided care is taken to avoid
>> starvation of waiters on other NUMA sockets. This patch introduces CNA
>> (compact NUMA-aware lock) as the slow path for qspinlock.
> 
> Since you use NUMA, use the term node, not socket. The two are not
> strictly related.
Got it, thanks.

> 
>> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
>> organized in two queues, a main queue for threads running on the same
>> socket as the current lock holder, and a secondary queue for threads
>> running on other sockets. Threads record the ID of the socket on which
>> they are running in their queue nodes. At the unlock time, the lock
>> holder scans the main queue looking for a thread running on the same
>> socket. If found (call it thread T), all threads in the main queue
>> between the current lock holder and T are moved to the end of the
>> secondary queue, and the lock is passed to T. If such T is not found, the
>> lock is passed to the first node in the secondary queue. Finally, if the
>> secondary queue is empty, the lock is passed to the next thread in the
>> main queue.
>> 
>> Full details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=7sFZrsdpLJxLRHIFWN_sE6zgKy20Ti8lOoepiEyipAo&s=5VRAQVjw0B1SCjvBLzzwxkHQ6TZ3FIl_tGDfvn3FXvo&e=.
> 
> Full details really should also be in the Changelog. You can skip much
> of the academic bla-bla, but the Changelog should be self contained.
> 
>> We have done some performance evaluation with the locktorture module
>> as well as with several benchmarks from the will-it-scale repo.
>> The following locktorture results are from an Oracle X5-4 server
>> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
>> cores each). Each number represents an average (over 5 runs) of the
>> total number of ops (x10^7) reported at the end of each run. The stock
>> kernel is v4.20.0-rc4+ compiled in the default configuration.
>> 
>> #thr  stock  patched speedup (patched/stock)
>>  1   2.710   2.715  1.002
>>  2   3.108   3.001  0.966
>>  4   4.194   3.919  0.934
> 
> So low contention is actually worse. Funnily low contention is the
> majority of our locks and is _really_ important.
This can be most certainly engineered out, e.g., by caching the node ID on which a task is running.
We will look into that.

> 
>>  8   5.309   6.894  1.299
>> 16   6.722   9.094  1.353
>> 32   7.314   9.885  1.352
>> 36   7.562   9.855  1.303
>> 72   6.696  10.358  1.547
>> 108   6.364  10.181  1.600
>> 142   6.179  10.178  1.647
>> 
>> When the kernel is compiled with lockstat enabled, CNA 
> 
> I'll ignore that, lockstat/lockdep enabled runs are not what one would
> call performance relevant.
Please, note that only one set of results has lockstat enabled.
The rest of the results (will-it-scale included) do not have it.

Regards,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-01-31 17:38     ` Waiman Long
@ 2019-02-01 21:26       ` Alex Kogan
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-01 21:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, Steven Sistare, Daniel Jordan,
	dave.dice, Rahul Yadav


> On Jan 31, 2019, at 12:38 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 01/30/2019 10:01 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a main queue for
>> threads running on the same socket as the current lock holder, and a
>> secondary queue for threads running on other sockets. For details,
>> see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=RdsRU4oU2j-_hNG5kDC5ZEE_XHikl3QLNttCaBde3QU&s=NQ840S1lz53Cq7AnOlPWdnjZI7_Ic3rfYsf-w2aYus4&e=.
>> 
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same socket. This issue
>> will be addressed later in the series.
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Just wondering if you have tried include PARVIRT_SPINLOCKS option to see
> if that patch may screw up the PV qspinlock code.
No, I haven’t yet.
The idea was to make it work for non-PV systems first, and then extend to PV.

> 
> Anyway, I do believe your claim that NUMA-aware qspinlock is good for
> large systems with many nodes. However, all these extra code are
> overhead for small systems that have a single node/socket, for instance.
> 
> I will support doing something similar to what had been done to support
> PV qspinlock. IOW, a separate slowpath function that can be patched to
> become the default depending on the system being run on or a kernel boot
> option setting.
> 
> I would like to keep the core slowpath function simple and easy to
> understand. So most of the CNA code should be encapsulated into some
> helper functions and put into a separated file.
Sounds good. 
I think it should be pretty straightforward to encapsulate the CNA code and do what you suggest.
We will look into that.

Thanks,
— Alex

> 
> Thanks,
> Longman
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/3] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2019-02-01 21:26       ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-01 21:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, arnd, peterz, dave.dice, will.deacon, linux,
	linux-kernel, Rahul Yadav, mingo, Steven Sistare, Daniel Jordan,
	linux-arm-kernel


> On Jan 31, 2019, at 12:38 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 01/30/2019 10:01 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a main queue for
>> threads running on the same socket as the current lock holder, and a
>> secondary queue for threads running on other sockets. For details,
>> see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=RdsRU4oU2j-_hNG5kDC5ZEE_XHikl3QLNttCaBde3QU&s=NQ840S1lz53Cq7AnOlPWdnjZI7_Ic3rfYsf-w2aYus4&e=.
>> 
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same socket. This issue
>> will be addressed later in the series.
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Just wondering if you have tried include PARVIRT_SPINLOCKS option to see
> if that patch may screw up the PV qspinlock code.
No, I haven’t yet.
The idea was to make it work for non-PV systems first, and then extend to PV.

> 
> Anyway, I do believe your claim that NUMA-aware qspinlock is good for
> large systems with many nodes. However, all these extra code are
> overhead for small systems that have a single node/socket, for instance.
> 
> I will support doing something similar to what had been done to support
> PV qspinlock. IOW, a separate slowpath function that can be patched to
> become the default depending on the system being run on or a kernel boot
> option setting.
> 
> I would like to keep the core slowpath function simple and easy to
> understand. So most of the CNA code should be encapsulated into some
> helper functions and put into a separated file.
Sounds good. 
I think it should be pretty straightforward to encapsulate the CNA code and do what you suggest.
We will look into that.

Thanks,
— Alex

> 
> Thanks,
> Longman
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-01-31 10:00     ` Peter Zijlstra
@ 2019-02-05  3:35       ` Alex Kogan
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-05  3:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, steven.sistare, daniel.m.jordan,
	dave.dice, rahul.x.yadav, Thomas Gleixner


> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
>> Choose the next lock holder among spinning threads running on the same
>> socket with high probability rather than always. With small probability,
>> hand the lock to the first thread in the secondary queue or, if that
>> queue is empty, to the immediate successor of the current lock holder
>> in the main queue.  Thus, assuming no failures while threads hold the
>> lock, every thread would be able to acquire the lock after a bounded
>> number of lock transitions, with high probability.
>> 
>> Note that we could make the inter-socket transition deterministic,
>> by sticking a counter of intra-socket transitions in the head node
>> of the secondary queue. At the handoff time, we could increment
>> the counter and check if it is below a threshold. This adds another
>> field to queue nodes and nearly-certain local cache miss to read and
>> update this counter during the handoff. While still beating stock,
>> this variant adds certain overhead over the probabilistic variant.
> 
> (also heavily suffers from the socket == node confusion)
> 
> How would you suggest RT 'tunes' this?
> 
> RT relies on FIFO fairness of the basic spinlock primitives; you just
> completely wrecked that.

This is true that CNA trades some fairness for shorter lock handover latency, much like any other NUMA-aware lock.

Can you explain, however, what exactly breaks here?
It seems that even today, qspinlock does not support RT_PREEMPT, given that it uses per-CPU queue nodes.

Thank you,
— Alex




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-02-05  3:35       ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-05  3:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, steven.sistare, longman, Thomas Gleixner,
	daniel.m.jordan, linux-arm-kernel


> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
>> Choose the next lock holder among spinning threads running on the same
>> socket with high probability rather than always. With small probability,
>> hand the lock to the first thread in the secondary queue or, if that
>> queue is empty, to the immediate successor of the current lock holder
>> in the main queue.  Thus, assuming no failures while threads hold the
>> lock, every thread would be able to acquire the lock after a bounded
>> number of lock transitions, with high probability.
>> 
>> Note that we could make the inter-socket transition deterministic,
>> by sticking a counter of intra-socket transitions in the head node
>> of the secondary queue. At the handoff time, we could increment
>> the counter and check if it is below a threshold. This adds another
>> field to queue nodes and nearly-certain local cache miss to read and
>> update this counter during the handoff. While still beating stock,
>> this variant adds certain overhead over the probabilistic variant.
> 
> (also heavily suffers from the socket == node confusion)
> 
> How would you suggest RT 'tunes' this?
> 
> RT relies on FIFO fairness of the basic spinlock primitives; you just
> completely wrecked that.

This is true that CNA trades some fairness for shorter lock handover latency, much like any other NUMA-aware lock.

Can you explain, however, what exactly breaks here?
It seems that even today, qspinlock does not support RT_PREEMPT, given that it uses per-CPU queue nodes.

Thank you,
— Alex




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-02-05  3:35       ` Alex Kogan
  (?)
@ 2019-02-05  9:22         ` Peter Zijlstra
  -1 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-02-05  9:22 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, steven.sistare, daniel.m.jordan,
	dave.dice, rahul.x.yadav, Thomas Gleixner

On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote:
> 
> > On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
> >> Choose the next lock holder among spinning threads running on the same
> >> socket with high probability rather than always. With small probability,
> >> hand the lock to the first thread in the secondary queue or, if that
> >> queue is empty, to the immediate successor of the current lock holder
> >> in the main queue.  Thus, assuming no failures while threads hold the
> >> lock, every thread would be able to acquire the lock after a bounded
> >> number of lock transitions, with high probability.
> >> 
> >> Note that we could make the inter-socket transition deterministic,
> >> by sticking a counter of intra-socket transitions in the head node
> >> of the secondary queue. At the handoff time, we could increment
> >> the counter and check if it is below a threshold. This adds another
> >> field to queue nodes and nearly-certain local cache miss to read and
> >> update this counter during the handoff. While still beating stock,
> >> this variant adds certain overhead over the probabilistic variant.
> > 
> > (also heavily suffers from the socket == node confusion)
> > 
> > How would you suggest RT 'tunes' this?
> > 
> > RT relies on FIFO fairness of the basic spinlock primitives; you just
> > completely wrecked that.
> 
> This is true that CNA trades some fairness for shorter lock handover
> latency, much like any other NUMA-aware lock.
> 
> Can you explain, however, what exactly breaks here?

Timeliness guarantees. FIFO-fair has well defined time behaviour; you
know exactly how long you get to wait before you acquire the lock,
namely however many waiters are in front of you multiplied by the worst
case wait time.

Doing time analysis on a randomized algorithm isn't my idea of fun.

> It seems that even today, qspinlock does not support RT_PREEMPT, given
> that it uses per-CPU queue nodes.

It does work with RT, commit:

  7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")

it a direct result of RT observing funnies with it. I've no idea why you
think it would not work.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-02-05  9:22         ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-02-05  9:22 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, steven.sistare, longman, Thomas Gleixner,
	daniel.m.jordan, linux-arm-kernel

On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote:
> 
> > On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
> >> Choose the next lock holder among spinning threads running on the same
> >> socket with high probability rather than always. With small probability,
> >> hand the lock to the first thread in the secondary queue or, if that
> >> queue is empty, to the immediate successor of the current lock holder
> >> in the main queue.  Thus, assuming no failures while threads hold the
> >> lock, every thread would be able to acquire the lock after a bounded
> >> number of lock transitions, with high probability.
> >> 
> >> Note that we could make the inter-socket transition deterministic,
> >> by sticking a counter of intra-socket transitions in the head node
> >> of the secondary queue. At the handoff time, we could increment
> >> the counter and check if it is below a threshold. This adds another
> >> field to queue nodes and nearly-certain local cache miss to read and
> >> update this counter during the handoff. While still beating stock,
> >> this variant adds certain overhead over the probabilistic variant.
> > 
> > (also heavily suffers from the socket == node confusion)
> > 
> > How would you suggest RT 'tunes' this?
> > 
> > RT relies on FIFO fairness of the basic spinlock primitives; you just
> > completely wrecked that.
> 
> This is true that CNA trades some fairness for shorter lock handover
> latency, much like any other NUMA-aware lock.
> 
> Can you explain, however, what exactly breaks here?

Timeliness guarantees. FIFO-fair has well defined time behaviour; you
know exactly how long you get to wait before you acquire the lock,
namely however many waiters are in front of you multiplied by the worst
case wait time.

Doing time analysis on a randomized algorithm isn't my idea of fun.

> It seems that even today, qspinlock does not support RT_PREEMPT, given
> that it uses per-CPU queue nodes.

It does work with RT, commit:

  7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")

it a direct result of RT observing funnies with it. I've no idea why you
think it would not work.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-02-05  9:22         ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-02-05  9:22 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, steven.sistare, longman, Thomas Gleixner,
	daniel.m.jordan, linux-arm-kernel

On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote:
> 
> > On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
> >> Choose the next lock holder among spinning threads running on the same
> >> socket with high probability rather than always. With small probability,
> >> hand the lock to the first thread in the secondary queue or, if that
> >> queue is empty, to the immediate successor of the current lock holder
> >> in the main queue.  Thus, assuming no failures while threads hold the
> >> lock, every thread would be able to acquire the lock after a bounded
> >> number of lock transitions, with high probability.
> >> 
> >> Note that we could make the inter-socket transition deterministic,
> >> by sticking a counter of intra-socket transitions in the head node
> >> of the secondary queue. At the handoff time, we could increment
> >> the counter and check if it is below a threshold. This adds another
> >> field to queue nodes and nearly-certain local cache miss to read and
> >> update this counter during the handoff. While still beating stock,
> >> this variant adds certain overhead over the probabilistic variant.
> > 
> > (also heavily suffers from the socket == node confusion)
> > 
> > How would you suggest RT 'tunes' this?
> > 
> > RT relies on FIFO fairness of the basic spinlock primitives; you just
> > completely wrecked that.
> 
> This is true that CNA trades some fairness for shorter lock handover
> latency, much like any other NUMA-aware lock.
> 
> Can you explain, however, what exactly breaks here?

Timeliness guarantees. FIFO-fair has well defined time behaviour; you
know exactly how long you get to wait before you acquire the lock,
namely however many waiters are in front of you multiplied by the worst
case wait time.

Doing time analysis on a randomized algorithm isn't my idea of fun.

> It seems that even today, qspinlock does not support RT_PREEMPT, given
> that it uses per-CPU queue nodes.

It does work with RT, commit:

  7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")

it a direct result of RT observing funnies with it. I've no idea why you
think it would not work.



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-02-05  9:22         ` Peter Zijlstra
@ 2019-02-05 13:48           ` Waiman Long
  -1 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2019-02-05 13:48 UTC (permalink / raw)
  To: Peter Zijlstra, Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, linux-arch, linux-arm-kernel,
	linux-kernel, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav, Thomas Gleixner

On 02/05/2019 04:22 AM, Peter Zijlstra wrote:
> On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote:
>>> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
>>>> Choose the next lock holder among spinning threads running on the same
>>>> socket with high probability rather than always. With small probability,
>>>> hand the lock to the first thread in the secondary queue or, if that
>>>> queue is empty, to the immediate successor of the current lock holder
>>>> in the main queue.  Thus, assuming no failures while threads hold the
>>>> lock, every thread would be able to acquire the lock after a bounded
>>>> number of lock transitions, with high probability.
>>>>
>>>> Note that we could make the inter-socket transition deterministic,
>>>> by sticking a counter of intra-socket transitions in the head node
>>>> of the secondary queue. At the handoff time, we could increment
>>>> the counter and check if it is below a threshold. This adds another
>>>> field to queue nodes and nearly-certain local cache miss to read and
>>>> update this counter during the handoff. While still beating stock,
>>>> this variant adds certain overhead over the probabilistic variant.
>>> (also heavily suffers from the socket == node confusion)
>>>
>>> How would you suggest RT 'tunes' this?
>>>
>>> RT relies on FIFO fairness of the basic spinlock primitives; you just
>>> completely wrecked that.
>> This is true that CNA trades some fairness for shorter lock handover
>> latency, much like any other NUMA-aware lock.
>>
>> Can you explain, however, what exactly breaks here?
> Timeliness guarantees. FIFO-fair has well defined time behaviour; you
> know exactly how long you get to wait before you acquire the lock,
> namely however many waiters are in front of you multiplied by the worst
> case wait time.
>
> Doing time analysis on a randomized algorithm isn't my idea of fun.

RT doesn't work well with NUMA qspinlock is another reason why I want it
to be a separate slow path. We will disable it  on a RT kernel where
guaranteed low latency is a must and throughput isn't as important.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-02-05 13:48           ` Waiman Long
  0 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2019-02-05 13:48 UTC (permalink / raw)
  To: Peter Zijlstra, Alex Kogan
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	rahul.x.yadav, mingo, steven.sistare, Thomas Gleixner,
	daniel.m.jordan, linux-arm-kernel

On 02/05/2019 04:22 AM, Peter Zijlstra wrote:
> On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote:
>>> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
>>>> Choose the next lock holder among spinning threads running on the same
>>>> socket with high probability rather than always. With small probability,
>>>> hand the lock to the first thread in the secondary queue or, if that
>>>> queue is empty, to the immediate successor of the current lock holder
>>>> in the main queue.  Thus, assuming no failures while threads hold the
>>>> lock, every thread would be able to acquire the lock after a bounded
>>>> number of lock transitions, with high probability.
>>>>
>>>> Note that we could make the inter-socket transition deterministic,
>>>> by sticking a counter of intra-socket transitions in the head node
>>>> of the secondary queue. At the handoff time, we could increment
>>>> the counter and check if it is below a threshold. This adds another
>>>> field to queue nodes and nearly-certain local cache miss to read and
>>>> update this counter during the handoff. While still beating stock,
>>>> this variant adds certain overhead over the probabilistic variant.
>>> (also heavily suffers from the socket == node confusion)
>>>
>>> How would you suggest RT 'tunes' this?
>>>
>>> RT relies on FIFO fairness of the basic spinlock primitives; you just
>>> completely wrecked that.
>> This is true that CNA trades some fairness for shorter lock handover
>> latency, much like any other NUMA-aware lock.
>>
>> Can you explain, however, what exactly breaks here?
> Timeliness guarantees. FIFO-fair has well defined time behaviour; you
> know exactly how long you get to wait before you acquire the lock,
> namely however many waiters are in front of you multiplied by the worst
> case wait time.
>
> Doing time analysis on a randomized algorithm isn't my idea of fun.

RT doesn't work well with NUMA qspinlock is another reason why I want it
to be a separate slow path. We will disable it  on a RT kernel where
guaranteed low latency is a must and throughput isn't as important.

Cheers,
Longman

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-02-05  9:22         ` Peter Zijlstra
@ 2019-02-05 21:07           ` Alex Kogan
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-05 21:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, Steven Sistare, Daniel Jordan,
	dave.dice, Rahul Yadav, Thomas Gleixner

[ Resending after correcting an issue with the included URL and correcting a typo 
in Waiman’s name — sorry about that! ]

> On Feb 5, 2019, at 4:22 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote:
>> 
>>> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
>>>> Choose the next lock holder among spinning threads running on the same
>>>> socket with high probability rather than always. With small probability,
>>>> hand the lock to the first thread in the secondary queue or, if that
>>>> queue is empty, to the immediate successor of the current lock holder
>>>> in the main queue.  Thus, assuming no failures while threads hold the
>>>> lock, every thread would be able to acquire the lock after a bounded
>>>> number of lock transitions, with high probability.
>>>> 
>>>> Note that we could make the inter-socket transition deterministic,
>>>> by sticking a counter of intra-socket transitions in the head node
>>>> of the secondary queue. At the handoff time, we could increment
>>>> the counter and check if it is below a threshold. This adds another
>>>> field to queue nodes and nearly-certain local cache miss to read and
>>>> update this counter during the handoff. While still beating stock,
>>>> this variant adds certain overhead over the probabilistic variant.
>>> 
>>> (also heavily suffers from the socket == node confusion)
>>> 
>>> How would you suggest RT 'tunes' this?
>>> 
>>> RT relies on FIFO fairness of the basic spinlock primitives; you just
>>> completely wrecked that.
>> 
>> This is true that CNA trades some fairness for shorter lock handover
>> latency, much like any other NUMA-aware lock.
>> 
>> Can you explain, however, what exactly breaks here?
> 
> Timeliness guarantees. FIFO-fair has well defined time behaviour; you
> know exactly how long you get to wait before you acquire the lock,
> namely however many waiters are in front of you multiplied by the worst
> case wait time.
Got it — thanks for the clarification!

> 
> Doing time analysis on a randomized algorithm isn't my idea of fun.
> 
>> It seems that even today, qspinlock does not support RT_PREEMPT, given
>> that it uses per-CPU queue nodes.
> 
> It does work with RT, commit:
> 
>  7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")
> 
> it a direct result of RT observing funnies with it. I've no idea why you
> think it would not work.
Just trying to get to the bottom of it — as of today, qspinlock explicitly assumes
no preemption while waiting for the lock.

Here is what Waiman had to say about that in https://lwn.net/Articles/561775:

"The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock.”

This was back in 2013, but the code still uses per-CPU queue nodes,
and AFAICT, preemption will break things up.

So what you are saying is that RT would be fine assuming no preemption in
the spinlock as long as it provides FIFO? Or there is some future code patch 
that will take care of the “no preemption” assumption (but still assume FIFO)?

Thanks,
— Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-02-05 21:07           ` Alex Kogan
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Kogan @ 2019-02-05 21:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	Rahul Yadav, mingo, Steven Sistare, longman, Thomas Gleixner,
	Daniel Jordan, linux-arm-kernel

[ Resending after correcting an issue with the included URL and correcting a typo 
in Waiman’s name — sorry about that! ]

> On Feb 5, 2019, at 4:22 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote:
>> 
>>> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote:
>>>> Choose the next lock holder among spinning threads running on the same
>>>> socket with high probability rather than always. With small probability,
>>>> hand the lock to the first thread in the secondary queue or, if that
>>>> queue is empty, to the immediate successor of the current lock holder
>>>> in the main queue.  Thus, assuming no failures while threads hold the
>>>> lock, every thread would be able to acquire the lock after a bounded
>>>> number of lock transitions, with high probability.
>>>> 
>>>> Note that we could make the inter-socket transition deterministic,
>>>> by sticking a counter of intra-socket transitions in the head node
>>>> of the secondary queue. At the handoff time, we could increment
>>>> the counter and check if it is below a threshold. This adds another
>>>> field to queue nodes and nearly-certain local cache miss to read and
>>>> update this counter during the handoff. While still beating stock,
>>>> this variant adds certain overhead over the probabilistic variant.
>>> 
>>> (also heavily suffers from the socket == node confusion)
>>> 
>>> How would you suggest RT 'tunes' this?
>>> 
>>> RT relies on FIFO fairness of the basic spinlock primitives; you just
>>> completely wrecked that.
>> 
>> This is true that CNA trades some fairness for shorter lock handover
>> latency, much like any other NUMA-aware lock.
>> 
>> Can you explain, however, what exactly breaks here?
> 
> Timeliness guarantees. FIFO-fair has well defined time behaviour; you
> know exactly how long you get to wait before you acquire the lock,
> namely however many waiters are in front of you multiplied by the worst
> case wait time.
Got it — thanks for the clarification!

> 
> Doing time analysis on a randomized algorithm isn't my idea of fun.
> 
>> It seems that even today, qspinlock does not support RT_PREEMPT, given
>> that it uses per-CPU queue nodes.
> 
> It does work with RT, commit:
> 
>  7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")
> 
> it a direct result of RT observing funnies with it. I've no idea why you
> think it would not work.
Just trying to get to the bottom of it — as of today, qspinlock explicitly assumes
no preemption while waiting for the lock.

Here is what Waiman had to say about that in https://lwn.net/Articles/561775:

"The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock.”

This was back in 2013, but the code still uses per-CPU queue nodes,
and AFAICT, preemption will break things up.

So what you are saying is that RT would be fine assuming no preemption in
the spinlock as long as it provides FIFO? Or there is some future code patch 
that will take care of the “no preemption” assumption (but still assume FIFO)?

Thanks,
— Alex
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-02-05 21:07           ` Alex Kogan
@ 2019-02-05 21:12             ` Waiman Long
  -1 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2019-02-05 21:12 UTC (permalink / raw)
  To: Alex Kogan, Peter Zijlstra
  Cc: linux, mingo, will.deacon, arnd, linux-arch, linux-arm-kernel,
	linux-kernel, Steven Sistare, Daniel Jordan, dave.dice,
	Rahul Yadav, Thomas Gleixner

On 02/05/2019 04:07 PM, Alex Kogan wrote:
>> Doing time analysis on a randomized algorithm isn't my idea of fun.
>>
>>> It seems that even today, qspinlock does not support RT_PREEMPT, given
>>> that it uses per-CPU queue nodes.
>> It does work with RT, commit:
>>
>>  7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")
>>
>> it a direct result of RT observing funnies with it. I've no idea why you
>> think it would not work.
> Just trying to get to the bottom of it — as of today, qspinlock explicitly assumes
> no preemption while waiting for the lock.
>
> Here is what Waiman had to say about that in https://lwn.net/Articles/561775:
>
> "The idea behind this spinlock implementation is the fact that spinlocks
> are acquired with preemption disabled. In other words, the process
> will not be migrated to another CPU while it is trying to get a
> spinlock.”
>
> This was back in 2013, but the code still uses per-CPU queue nodes,
> and AFAICT, preemption will break things up.
>
> So what you are saying is that RT would be fine assuming no preemption in
> the spinlock as long as it provides FIFO? Or there is some future code patch 
> that will take care of the “no preemption” assumption (but still assume FIFO)?
>
> Thanks,
> — Alex

Some of the critical sections protected by spinlocks may have execution
times that are much longer than desired. That is why they are converted
to rt-mutex in the RT kernel. There is another class of spinlocks called
raw spinlocks. They are the same as regular spinlocks in non RT-kernel,
but remain spinlocks with no preemption allowed in RT-kernel as sleeping
locks can't be used in atomic context. This is where the replacement of
the current qspinlock code by your NUMA-aware qspinlock may screw up the
timing guarantee that can be provided by the RT-kernel.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-02-05 21:12             ` Waiman Long
  0 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2019-02-05 21:12 UTC (permalink / raw)
  To: Alex Kogan, Peter Zijlstra
  Cc: linux-arch, arnd, dave.dice, will.deacon, linux, linux-kernel,
	Rahul Yadav, mingo, Steven Sistare, Thomas Gleixner,
	Daniel Jordan, linux-arm-kernel

On 02/05/2019 04:07 PM, Alex Kogan wrote:
>> Doing time analysis on a randomized algorithm isn't my idea of fun.
>>
>>> It seems that even today, qspinlock does not support RT_PREEMPT, given
>>> that it uses per-CPU queue nodes.
>> It does work with RT, commit:
>>
>>  7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee")
>>
>> it a direct result of RT observing funnies with it. I've no idea why you
>> think it would not work.
> Just trying to get to the bottom of it — as of today, qspinlock explicitly assumes
> no preemption while waiting for the lock.
>
> Here is what Waiman had to say about that in https://lwn.net/Articles/561775:
>
> "The idea behind this spinlock implementation is the fact that spinlocks
> are acquired with preemption disabled. In other words, the process
> will not be migrated to another CPU while it is trying to get a
> spinlock.”
>
> This was back in 2013, but the code still uses per-CPU queue nodes,
> and AFAICT, preemption will break things up.
>
> So what you are saying is that RT would be fine assuming no preemption in
> the spinlock as long as it provides FIFO? Or there is some future code patch 
> that will take care of the “no preemption” assumption (but still assume FIFO)?
>
> Thanks,
> — Alex

Some of the critical sections protected by spinlocks may have execution
times that are much longer than desired. That is why they are converted
to rt-mutex in the RT kernel. There is another class of spinlocks called
raw spinlocks. They are the same as regular spinlocks in non RT-kernel,
but remain spinlocks with no preemption allowed in RT-kernel as sleeping
locks can't be used in atomic context. This is where the replacement of
the current qspinlock code by your NUMA-aware qspinlock may screw up the
timing guarantee that can be provided by the RT-kernel.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-02-05 21:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-31  3:01 [PATCH 0/3] Add NUMA-awareness to qspinlock Alex Kogan
2019-01-31  3:01 ` Alex Kogan
2019-01-31  3:01 ` [PATCH 1/3] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic Alex Kogan
2019-01-31  3:01   ` Alex Kogan
2019-01-31  3:01 ` [PATCH 2/3] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2019-01-31  3:01   ` Alex Kogan
2019-01-31 17:38   ` Waiman Long
2019-01-31 17:38     ` Waiman Long
2019-02-01 21:26     ` Alex Kogan
2019-02-01 21:26       ` Alex Kogan
2019-01-31  3:01 ` [PATCH 3/3] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2019-01-31  3:01   ` Alex Kogan
2019-01-31 10:00   ` Peter Zijlstra
2019-01-31 10:00     ` Peter Zijlstra
2019-02-05  3:35     ` Alex Kogan
2019-02-05  3:35       ` Alex Kogan
2019-02-05  9:22       ` Peter Zijlstra
2019-02-05  9:22         ` Peter Zijlstra
2019-02-05  9:22         ` Peter Zijlstra
2019-02-05 13:48         ` Waiman Long
2019-02-05 13:48           ` Waiman Long
2019-02-05 21:07         ` Alex Kogan
2019-02-05 21:07           ` Alex Kogan
2019-02-05 21:12           ` Waiman Long
2019-02-05 21:12             ` Waiman Long
2019-01-31  9:56 ` [PATCH 0/3] Add NUMA-awareness to qspinlock Peter Zijlstra
2019-01-31  9:56   ` Peter Zijlstra
2019-01-31  9:56   ` Peter Zijlstra
2019-02-01 21:20   ` Alex Kogan
2019-02-01 21:20     ` Alex Kogan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.