Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v9 0/5] Add NUMA-awareness to qspinlock
@ 2020-01-15  3:59 Alex Kogan
  2020-01-15  3:59 ` [PATCH v9 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
                   ` (6 more replies)
  0 siblings, 7 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-15  3:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Minor changes from v8 based on feedback from Longman:
-----------------------------------------------------

- Add __init to cna_configure_spin_lock_slowpath().

- Fix the comment for cna_scan_main_queue().

- Change the type of intra_node_handoff_threshold to unsigned int.


Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the lock holder scans the main queue
looking for a thread running on the same node (pre-scan). If found (call
it thread T), all threads in the main queue between the current lock
holder and T are moved to the end of the secondary queue.  If such T
is not found, we make another scan of the main queue after acquiring 
the spinlock when unlocking the MCS lock (post-scan), starting at the
node where pre-scan stopped. If both scans fail to find such T, the
MCS lock is passed to the first thread in the secondary queue. If the
secondary queue is empty, the MCS lock is passed to the next thread in the
main queue. To avoid starvation of threads in the secondary queue, those
threads are moved back to the head of the main queue after a certain
number of intra-node lock hand-offs.

More details are available at https://arxiv.org/abs/1810.05600.

The series applies on top of v5.5.0-rc6, commit b3a987b026.
Performance numbers are available in previous revisions
of the series.

Further comments are welcome and appreciated.

Alex Kogan (5):
  locking/qspinlock: Rename mcs lock/unlock macros and make them more
    generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Introduce the shuffle reduction optimization into
    CNA

 .../admin-guide/kernel-parameters.txt         |  18 +
 arch/arm/include/asm/mcs_spinlock.h           |   6 +-
 arch/x86/Kconfig                              |  20 +
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |   4 +
 include/asm-generic/mcs_spinlock.h            |   4 +-
 kernel/locking/mcs_spinlock.h                 |  20 +-
 kernel/locking/qspinlock.c                    |  82 +++-
 kernel/locking/qspinlock_cna.h                | 399 ++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h           |   2 +-
 10 files changed, 536 insertions(+), 23 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v9 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
  2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
@ 2020-01-15  3:59 ` Alex Kogan
  2020-01-15  3:59 ` [PATCH v9 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-15  3:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

The mcs unlock macro (arch_mcs_pass_lock) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h       | 18 +++++++++---------
 kernel/locking/qspinlock.c          |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..693fe6ce3c43 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include <asm/spinlock.h>
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
+#define arch_mcs_spin_lock(lock)				\
 do {									\
 	/* Ensure prior stores are observed before we enter wfe. */	\
 	smp_mb();							\
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_pass_lock(lock, val)					\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release((lock), (val));				\
 	dsb_sev();							\
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..868da43dba7c 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_lock(l)
+ *   arch_mcs_pass_lock(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..52d06ec6f525 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
 	int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_lock
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
+#define arch_mcs_spin_lock(l)					\
+do {								\
+	smp_cond_load_acquire(l, VAL);				\
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_spin_unlock
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_pass_lock(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	WRITE_ONCE(prev->next, node);
 
 	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
+	arch_mcs_spin_lock(&node->locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_pass_lock(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 2473f10c6956..804c0fbd6328 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -470,7 +470,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		arch_mcs_spin_lock(&node->locked);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -549,7 +549,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_pass_lock(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..e98079414671 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * The write to next->locked in arch_mcs_pass_lock()
 	 * must be ordered before the read of pn->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v9 2/5] locking/qspinlock: Refactor the qspinlock slow path
  2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
  2020-01-15  3:59 ` [PATCH v9 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
@ 2020-01-15  3:59 ` Alex Kogan
  2020-01-15  3:59 ` [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-15  3:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 804c0fbd6328..c06d1e8075d9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -288,6 +288,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+						   u32 val,
+						   struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_pass_lock - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
+					    struct mcs_spinlock *next)
+{
+	arch_mcs_pass_lock(&next->locked, 1);
+}
+
+#define try_clear_tail	__try_clear_tail
+#define mcs_pass_lock		__mcs_pass_lock
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -532,7 +560,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -549,7 +577,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_pass_lock(&next->locked, 1);
+	mcs_pass_lock(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -574,6 +602,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_pass_lock
+#define mcs_pass_lock			__mcs_pass_lock
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
  2020-01-15  3:59 ` [PATCH v9 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
  2020-01-15  3:59 ` [PATCH v9 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
@ 2020-01-15  3:59 ` Alex Kogan
  2020-01-23  9:26   ` Peter Zijlstra
  2020-01-23 14:15   ` Waiman Long
  2020-01-15  3:59 ` [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-15  3:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the lock holder scans the
main queue looking for a thread running on the same node (pre-scan). If
found (call it thread T), all threads in the main queue between the
current lock holder and T are moved to the end of the secondary queue.
If such T is not found, we make another scan of the main queue when
unlocking the MCS lock (post-scan), starting at the position where
pre-scan stopped. If both scans fail to find such T, the MCS lock is
passed to the first thread in the secondary queue. If the secondary queue
is empty, the lock is passed to the next thread in the main queue.
For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 ++
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |   4 +
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  39 ++-
 kernel/locking/qspinlock_cna.h                | 318 ++++++++++++++++++
 7 files changed, 392 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ade4e6ec23e0..b68cb80e477f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,16 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5e8949953660..26dd29b2d515 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1562,6 +1562,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 444d6fd9a6d8..fe4884b6f1b4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9ec463fe96f2..5a59d06a9d21 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -738,6 +738,10 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+	cna_configure_spin_lock_slowpath();
+#endif
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 52d06ec6f525..e40b9538b79f 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c06d1e8075d9..609980a53841 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -70,7 +70,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -79,7 +80,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -103,6 +104,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -316,7 +319,7 @@ static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
 #define try_clear_tail	__try_clear_tail
 #define mcs_pass_lock		__mcs_pass_lock
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -588,6 +591,34 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_pre_scan
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_change_tail
+
+#undef mcs_pass_lock
+#define mcs_pass_lock			cna_pass_lock
+
+#undef  queued_spin_lock_slowpath
+/*
+ * defer defining queued_spin_lock_slowpath until after the include to
+ * avoid a name clash with the identically named field in pv_ops.lock
+ * (see cna_configure_spin_lock_slowpath())
+ */
+#include "qspinlock_cna.h"
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..8000231f3d51
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,318 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a main queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+    +--------+        +--------+
+ *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
+ *   |mcs:locked| -+ +--------+        +--------+
+ *   +----------+  |
+ *                 +----------------------+
+ *                                        \/
+ *                 +--------+         +--------+
+ *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     +--------------------+
+ *
+ * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the main queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the main queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the main queue when
+ * unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the main queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;
+	u32			pre_scan_result; /* encoded tail or enum val */
+};
+
+enum {
+	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
+	MIN_ENCODED_TAIL
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * make sure @encoded_tail is not confused with other valid
+		 * values for @locked (0 or 1) or with designated values for
+		 * @pre_scan_result
+		 */
+		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+early_initcall(cna_init_nodes);
+
+/* this function is called only when the primary queue is empty */
+static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
+				       struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	/* If the secondary queue is empty, do what MCS does. */
+	if (node->locked <= 1)
+		return __try_clear_tail(lock, val, node);
+
+	/*
+	 * Try to update the tail value to the last node in the secondary queue.
+	 * If successful, pass the lock to the first thread in the secondary
+	 * queue. Doing those two actions effectively moves all nodes from the
+	 * secondary queue into the main one.
+	 */
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
+
+	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
+		/*
+		 * Try to reset @next in tail_2nd to NULL, but no need to check
+		 * the result - if failed, a new successor has updated it.
+		 */
+		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
+		arch_mcs_pass_lock(&head_2nd->locked, 1);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * cna_splice_tail -- splice nodes in the main queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_scan_main_queue - scan the main waiting queue looking for the first
+ * thread running on the same NUMA node as the lock holder. If found (call it
+ * thread T), move all threads in the main queue between the lock holder and
+ * T to the end of the secondary queue and return LOCAL_WAITER_FOUND;
+ * otherwise, return the encoded pointer of the last scanned node in the
+ * primary queue (so a subsequent scan can be resumed from that node).
+ *
+ * Schematically, this may look like the following (nn stands for numa_node and
+ * et stands for encoded_tail).
+ *
+ *   when cna_scan_main_queue() is called (the secondary queue is empty):
+ *
+ *  A+------------+   B+--------+   C+--------+   T+--------+
+ *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
+ *   |mcs:locked=1|    |cna:nn=0|    |cna:nn=2|    |cna:nn=1|
+ *   |cna:nn=1    |    +--------+    +--------+    +--------+
+ *   +----------- +
+ *
+ *   when cna_scan_main_queue() returns (the secondary queue contains B and C):
+ *
+ *  A+----------------+    T+--------+
+ *   |mcs:next        | ->  |mcs:next| -> NULL
+ *   |mcs:locked=C.et | -+  |cna:nn=1|
+ *   |cna:nn=1        |  |  +--------+
+ *   +--------------- +  +-----+
+ *                             \/
+ *          B+--------+   C+--------+
+ *           |mcs:next| -> |mcs:next| -+
+ *           |cna:nn=0|    |cna:nn=2|  |
+ *           +--------+    +--------+  |
+ *               ^                     |
+ *               +---------------------+
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ *
+ * @node      : Pointer to the MCS node of the lock holder
+ * @pred_start: Pointer to the MCS node of the waiter whose successor should be
+ *              the first node in the scan
+ * Return     : LOCAL_WAITER_FOUND or encoded tail of the last scanned waiter
+ */
+static u32 cna_scan_main_queue(struct mcs_spinlock *node,
+			       struct mcs_spinlock *pred_start)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(pred_start->next);
+	struct cna_node *last;
+	int my_numa_node = cn->numa_node;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != my_numa_node;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	/* if found, splice any skipped waiters onto the secondary queue */
+	if (cni) {
+		if (last != cn)	/* did we skip any waiters? */
+			cna_splice_tail(node, node->next,
+					(struct mcs_spinlock *)last);
+		return LOCAL_WAITER_FOUND;
+	}
+
+	return last->encoded_tail;
+}
+
+__always_inline u32 cna_pre_scan(struct qspinlock *lock,
+				  struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	cn->pre_scan_result = cna_scan_main_queue(node, node);
+
+	return 0;
+}
+
+static inline void cna_pass_lock(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	struct mcs_spinlock *next_holder = next, *tail_2nd;
+	u32 val = 1;
+
+	u32 scan = cn->pre_scan_result;
+
+	/*
+	 * check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan starting from the
+	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 */
+	if (scan >= MIN_ENCODED_TAIL)
+		scan = cna_scan_main_queue(node, decode_tail(scan));
+
+	if (scan == LOCAL_WAITER_FOUND) {
+		next_holder = node->next;
+		/*
+		 * we unlock successor by passing a non-zero value,
+		 * so set @val to 1 iff @locked is 0, which will happen
+		 * if we acquired the MCS lock when its queue was empty
+		 */
+		val = node->locked ? node->locked : 1;
+	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
+		/* next holder will be the first node in the secondary queue */
+		tail_2nd = decode_tail(node->locked);
+		/* @tail_2nd->next points to the head of the secondary queue */
+		next_holder = tail_2nd->next;
+		/* splice the secondary queue onto the head of the main queue */
+		tail_2nd->next = next;
+	}
+
+	arch_mcs_pass_lock(&next_holder->locked, val);
+}
+
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/*
+ * Switch to the NUMA-friendly slow path for spinlocks when we have
+ * multiple NUMA nodes in native environment, unless the user has
+ * overridden this default behavior by setting the numa_spinlock flag.
+ */
+void __init cna_configure_spin_lock_slowpath(void)
+{
+	if ((numa_spinlock_flag == 1) ||
+	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
+		    pv_ops.lock.queued_spin_lock_slowpath ==
+			native_queued_spin_lock_slowpath)) {
+		pv_ops.lock.queued_spin_lock_slowpath =
+		    __cna_queued_spin_lock_slowpath;
+
+		pr_info("Enabling CNA spinlock\n");
+	}
+}
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (2 preceding siblings ...)
  2020-01-15  3:59 ` [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2020-01-15  3:59 ` Alex Kogan
  2020-01-23 19:55   ` Waiman Long
  2020-01-15  3:59 ` [PATCH v9 5/5] locking/qspinlock: Introduce the shuffle reduction optimization " Alex Kogan
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Alex Kogan @ 2020-01-15  3:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

Keep track of the number of intra-node lock handoffs, and force
inter-node handoff once this number reaches a preset threshold.
The default value for the threshold can be overridden with
the new kernel boot command-line option "numa_spinlock_threshold".

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++
 kernel/locking/qspinlock.c                    |  3 ++
 kernel/locking/qspinlock_cna.h                | 41 ++++++++++++++++++-
 3 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b68cb80e477f..30d79819a3b0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3200,6 +3200,14 @@
 			Not specifying this option is equivalent to
 			numa_spinlock=auto.
 
+	numa_spinlock_threshold=	[NUMA, PV_OPS]
+			Set the threshold for the number of intra-node
+			lock hand-offs before the NUMA-aware spinlock
+			is forced to be passed to a thread on another NUMA node.
+			Valid values are in the [0..31] range. Smaller values
+			result in a more fair, but less performant spinlock, and
+			vice versa. The default value is 16.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 609980a53841..e382d8946ccc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -597,6 +597,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 #define _GEN_CNA_LOCK_SLOWPATH
 
+#undef pv_init_node
+#define pv_init_node			cna_init_node
+
 #undef pv_wait_head_or_lock
 #define pv_wait_head_or_lock		cna_pre_scan
 
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 8000231f3d51..a2b65f87e6f8 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -51,13 +51,25 @@ struct cna_node {
 	int			numa_node;
 	u32			encoded_tail;
 	u32			pre_scan_result; /* encoded tail or enum val */
+	u32			intra_count;
 };
 
 enum {
 	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
+	FLUSH_SECONDARY_QUEUE = 3,
 	MIN_ENCODED_TAIL
 };
 
+/*
+ * Controls the threshold for the number of intra-node lock hand-offs before
+ * the NUMA-aware variant of spinlock is forced to be passed to a thread on
+ * another NUMA node. By default, the chosen value provides reasonable
+ * long-term fairness without sacrificing performance compared to a lock
+ * that does not have any fairness guarantees. The default setting can
+ * be changed with the "numa_spinlock_threshold" boot option.
+ */
+unsigned int intra_node_handoff_threshold __ro_after_init = 1 << 16;
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -97,6 +109,11 @@ static int __init cna_init_nodes(void)
 }
 early_initcall(cna_init_nodes);
 
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+	((struct cna_node *)node)->intra_count = 0;
+}
+
 /* this function is called only when the primary queue is empty */
 static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
 				       struct mcs_spinlock *node)
@@ -232,7 +249,9 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->pre_scan_result = cna_scan_main_queue(node, node);
+	cn->pre_scan_result =
+		cn->intra_count == intra_node_handoff_threshold ?
+			FLUSH_SECONDARY_QUEUE : cna_scan_main_queue(node, node);
 
 	return 0;
 }
@@ -262,6 +281,9 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 		 * if we acquired the MCS lock when its queue was empty
 		 */
 		val = node->locked ? node->locked : 1;
+		/* inc @intra_count if the secondary queue is not empty */
+		((struct cna_node *)next_holder)->intra_count =
+			cn->intra_count + (node->locked > 1);
 	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
 		/* next holder will be the first node in the secondary queue */
 		tail_2nd = decode_tail(node->locked);
@@ -316,3 +338,20 @@ void __init cna_configure_spin_lock_slowpath(void)
 		pr_info("Enabling CNA spinlock\n");
 	}
 }
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+	int new_threshold_param;
+
+	if (get_option(&str, &new_threshold_param)) {
+		/* valid value is between 0 and 31 */
+		if (new_threshold_param < 0 || new_threshold_param > 31)
+			return 0;
+
+		intra_node_handoff_threshold = 1 << new_threshold_param;
+		return 1;
+	}
+
+	return 0;
+}
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v9 5/5] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (3 preceding siblings ...)
  2020-01-15  3:59 ` [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
@ 2020-01-15  3:59 ` " Alex Kogan
  2020-01-22 11:45 ` [PATCH v9 0/5] Add NUMA-awareness to qspinlock Lihao Liang
  2020-01-24 22:24 ` Paul E. McKenney
  6 siblings, 0 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-15  3:59 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, steven.sistare, daniel.m.jordan

This performance optimization reduces the probability threads will be
shuffled between the main and secondary queues when the secondary queue
is empty. It is helpful when the lock is only lightly contended.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 46 ++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index a2b65f87e6f8..f0b0c15dcf9d 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/random.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -57,6 +58,7 @@ struct cna_node {
 enum {
 	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
 	FLUSH_SECONDARY_QUEUE = 3,
+	PASS_LOCK_IMMEDIATELY = 4,
 	MIN_ENCODED_TAIL
 };
 
@@ -70,6 +72,34 @@ enum {
  */
 unsigned int intra_node_handoff_threshold __ro_after_init = 1 << 16;
 
+/*
+ * Controls the probability for enabling the scan of the main queue when
+ * the secondary queue is empty. The chosen value reduces the amount of
+ * unnecessary shuffling of threads between the two waiting queues when
+ * the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG  (7)
+
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Return false with probability 1 / 2^@num_bits.
+ * Intuitively, the larger @num_bits the less likely false is to be returned.
+ * @num_bits must be a number between 0 and 31.
+ */
+static bool probably(unsigned int num_bits)
+{
+	u32 s;
+
+	s = this_cpu_read(seed);
+	s = next_pseudo_random32(s);
+	this_cpu_write(seed, s);
+
+	return s & ((1 << num_bits) - 1);
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -250,8 +280,11 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
 	struct cna_node *cn = (struct cna_node *)node;
 
 	cn->pre_scan_result =
-		cn->intra_count == intra_node_handoff_threshold ?
-			FLUSH_SECONDARY_QUEUE : cna_scan_main_queue(node, node);
+		(node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) ?
+			PASS_LOCK_IMMEDIATELY :
+			cn->intra_count == intra_node_handoff_threshold ?
+				FLUSH_SECONDARY_QUEUE :
+				cna_scan_main_queue(node, node);
 
 	return 0;
 }
@@ -265,6 +298,14 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 
 	u32 scan = cn->pre_scan_result;
 
+	/*
+	 * perf. optimization - check if we can skip the logic of triaging
+	 * through other possible values in @scan (helps under light lock
+	 * contention)
+	 */
+	if (scan == PASS_LOCK_IMMEDIATELY)
+		goto pass_lock;
+
 	/*
 	 * check if a successor from the same numa node has not been found in
 	 * pre-scan, and if so, try to find it in post-scan starting from the
@@ -293,6 +334,7 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 		tail_2nd->next = next;
 	}
 
+pass_lock:
 	arch_mcs_pass_lock(&next_holder->locked, val);
 }
 
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (4 preceding siblings ...)
  2020-01-15  3:59 ` [PATCH v9 5/5] locking/qspinlock: Introduce the shuffle reduction optimization " Alex Kogan
@ 2020-01-22 11:45 ` Lihao Liang
  2020-01-22 17:24   ` Waiman Long
  2020-01-22 19:29   ` Alex Kogan
  2020-01-24 22:24 ` Paul E. McKenney
  6 siblings, 2 replies; 39+ messages in thread
From: Lihao Liang @ 2020-01-22 11:45 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, mingo, bp, hpa,
	longman, tglx, daniel.m.jordan, Will Deacon, linux-arm-kernel

Hi Alex,

On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
>
> Summary
> -------
>
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>

Thanks for your patches. The experimental results look promising!

I understand that the new CNA qspinlock uses randomization to achieve
long-term fairness, and provides the numa_spinlock_threshold parameter
for users to tune. As Linux runs extremely diverse workloads, it is not
clear how randomization affects its fairness, and how users with
different requirements are supposed to tune this parameter.

To this end, Will and I consider it beneficial to be able to answer the
following question:

With different values of numa_spinlock_threshold and
SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
sockets have to wait to acquire the lock? This is particularly relevant
in high contention situations when new threads keep arriving on the same
socket as the lock holder.

In this email, I try to provide some formal analysis to address this
question. Let's assume the probability for the lock to stay on the
same socket is *at least* p, which corresponds to the probability for
the function probably(unsigned int num_bits) in the patch to return *false*,
where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the
function.

I noticed that the default value of p in the patch is 1/2^7 = 0.01, which is
somewhat counter-intuitive to me. If we switch sockets 99 times out of 100,
then fairness should be obvious. What I expected is a (much) higher value of
p, which would likely result in better performance, while having some degree
of fairness guarantee. Have you run some experiments by setting a lower
SHUFFLE_REDUCTION_PROB_ARG instead of the default value 7? It would be very
helpful to know the performance numbers.

Now let's do some analysis:

1. What is the probability P for the first thread on a different socket to
acquire the lock after *at most* N consecutive local lock handovers?

Note: N corresponds to the variable intra_node_handoff_threshold in the
patch, which is set to value 1 << numa_spinlock_threshold. Default value
is 1 << 16 = 64K.

Assuming mutual independence [1], we have P is equal to 1 - p^N, where p^N is
the probability of N consecutive threads running on the socket where the lock
was most recently acquired.

If p is 0.99, the probabilities of switching to a different socket after
N local lock handovers are as follows:

63.4% (N = 100)
86.6% (N = 200)
99.3% (N = 500)
99.996% (N = 1000)
99.99999999999933% (N =  64K)

2. We can ask the same question as above for the k-th thread on a different
socket from the lock holder. That is, what is the probability P for the k-th
thread on a different socket to acquire the lock after *at most* N
consecutive local lock handovers, assuming all these k threads in the queue
are running on different sockets (the worst case scenario). The analysis is
as follows (the case when k = 1 reduces to Question 1 above):

The total probability P is the sum of Pi for i = 0, 1, ..., N, where Pi is
the probability of having i *total* local lock handovers before the k-th
thread on a different socket can acquire the lock.

Pi can be calculated using formula Pi =  B_i_k * (p^i) * (1 - p)^k, where

-- B_i_k is the number of ways to put i balls into k buckets, representing
all possible ways the i local handovers occurred in k different sockets.
B_i_k is a multiset number and equal to (i + k - 1)! / (i! * (k-1)!) [2]

-- p^i is the probability of i local lock handovers

-- (1 - p)^k is the probability of k socket switchings

I've written a simple Python script to calculate the value of P.
Let's look at some concrete examples and numbers.

When p = 0.99, k = 3 (e.g. a 4-socket system), P is equal to:

8.5%   (N = 100)
33.2% (N = 200)
87.9% (N = 500)
99.7% (N = 1000)
99.99999999999937% (N = 64K)

When p = 0.99, k = 7 (e.g. an 8-socket system), the values of P are:

0.01% (N = 100)
0.52% (N = 200)
24.7% (N = 500)
87.5% (N = 1000)
99.3% (N = 1500)
99.99999999999871% (N = 64K)

I think this mathematical analysis would help users better understand the
fairness property of the CNA qspinlock. One can use it to plot a graph with
different values of p and N to tune the qspinlock for different platforms
and workloads.

Based on the analysis above, it may be useful to have
SHUFFLE_REDUCTION_PROB_ARG as a tunable parameter as well. Setting
SHUFFLE_REDUCTION_PROB_ARG to a lower value results in a higher value of p,
which would likely increase the performance. Then we can set
intra_node_handoff_threshold to have a bounded degree of fairness.
For instance, a user may want P to be around 90% for N = 100 on a 8-core
system. So they can set p = 0.9 and intra_node_handoff_threshold = ~150,
based on our analysis that P =  91.9% for N = 100, and 99.99% for N = 200,
when k = 7.

I hope this helps and please let me know if you have any comments or
if you spot any mistakes in our analysis.

Best,
Lihao.

References:
[1] https://en.wikipedia.org/wiki/Independence_(probability_theory)#More_than_two_events
[2] Theorem 2, https://en.wikipedia.org/wiki/Stars_and_bars_(combinatorics)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-22 11:45 ` [PATCH v9 0/5] Add NUMA-awareness to qspinlock Lihao Liang
@ 2020-01-22 17:24   ` Waiman Long
  2020-01-23 11:35     ` Will Deacon
  2020-01-22 19:29   ` Alex Kogan
  1 sibling, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-22 17:24 UTC (permalink / raw)
  To: Lihao Liang, Alex Kogan
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, mingo, bp, hpa, steven.sistare,
	tglx, daniel.m.jordan, Will Deacon, linux-arm-kernel

On 1/22/20 6:45 AM, Lihao Liang wrote:
> Hi Alex,
>
> On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
>> Summary
>> -------
>>
>> Lock throughput can be increased by handing a lock to a waiter on the
>> same NUMA node as the lock holder, provided care is taken to avoid
>> starvation of waiters on other NUMA nodes. This patch introduces CNA
>> (compact NUMA-aware lock) as the slow path for qspinlock. It is
>> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>>
> Thanks for your patches. The experimental results look promising!
>
> I understand that the new CNA qspinlock uses randomization to achieve
> long-term fairness, and provides the numa_spinlock_threshold parameter
> for users to tune. As Linux runs extremely diverse workloads, it is not
> clear how randomization affects its fairness, and how users with
> different requirements are supposed to tune this parameter.
>
> To this end, Will and I consider it beneficial to be able to answer the
> following question:
>
> With different values of numa_spinlock_threshold and
> SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
> sockets have to wait to acquire the lock? This is particularly relevant
> in high contention situations when new threads keep arriving on the same
> socket as the lock holder.
>
> In this email, I try to provide some formal analysis to address this
> question. Let's assume the probability for the lock to stay on the
> same socket is *at least* p, which corresponds to the probability for
> the function probably(unsigned int num_bits) in the patch to return *false*,
> where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the
> function.

That is not strictly true from my understanding of the code. The
probably() function does not come into play if a secondary queue is
present. Also calling cna_scan_main_queue() doesn't guarantee that a
waiter in the same node can be found. So the simple mathematical
analysis isn't that applicable in this case. One will have to do an
actual simulation to find out what the actual behavior will be.

The comment in the code states that:

/*
 * Controls the probability for enabling the scan of the main queue when
 * the secondary queue is empty. The chosen value reduces the amount of
 * unnecessary shuffling of threads between the two waiting queues when
 * the contention is low, while responding fast enough and enabling
 * the shuffling when the contention is high.
 */
#define SHUFFLE_REDUCTION_PROB_ARG  (7)

Cheers,
Longman




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-22 11:45 ` [PATCH v9 0/5] Add NUMA-awareness to qspinlock Lihao Liang
  2020-01-22 17:24   ` Waiman Long
@ 2020-01-22 19:29   ` Alex Kogan
  2020-01-26  0:32     ` Lihao Liang
  1 sibling, 1 reply; 39+ messages in thread
From: Alex Kogan @ 2020-01-22 19:29 UTC (permalink / raw)
  To: Lihao Liang
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, will.deacon, linux, steven.sistare, linux-kernel, mingo, bp,
	hpa, longman, tglx, daniel.m.jordan, Will Deacon,
	linux-arm-kernel

Hi, Lihao.

> On Jan 22, 2020, at 6:45 AM, Lihao Liang <lihaoliang@google.com> wrote:
> 
> Hi Alex,
> 
> On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
>> 
>> Summary
>> -------
>> 
>> Lock throughput can be increased by handing a lock to a waiter on the
>> same NUMA node as the lock holder, provided care is taken to avoid
>> starvation of waiters on other NUMA nodes. This patch introduces CNA
>> (compact NUMA-aware lock) as the slow path for qspinlock. It is
>> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>> 
> 
> Thanks for your patches. The experimental results look promising!
> 
> I understand that the new CNA qspinlock uses randomization to achieve
> long-term fairness, and provides the numa_spinlock_threshold parameter
> for users to tune.
This has been the case in the first versions of the series, but is not true anymore.
That is, the long-term fairness is achieved deterministically (and you are correct 
that it is done through the numa_spinlock_threshold parameter).

> As Linux runs extremely diverse workloads, it is not
> clear how randomization affects its fairness, and how users with
> different requirements are supposed to tune this parameter.
> 
> To this end, Will and I consider it beneficial to be able to answer the
> following question:
> 
> With different values of numa_spinlock_threshold and
> SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
> sockets have to wait to acquire the lock?
The SHUFFLE_REDUCTION_PROB_ARG parameter is intended for performance
optimization only, and *does not* affect the long-term fairness (or, at the 
very least, does not make it any worse). As Longman correctly pointed out in 
his response to this email, the shuffle reduction optimization is relevant only
when the secondary queue is empty. In that case, CNA hands-off the lock
exactly as MCS does, i.e., in the FIFO order. Note that when the secondary
queue is not empty, we do not call probably().

> This is particularly relevant
> in high contention situations when new threads keep arriving on the same
> socket as the lock holder.
In this case, the lock will stay on the same NUMA node/socket for 
2^numa_spinlock_threshold times, which is the worst case scenario if we 
consider the long-term fairness. And if we have multiple nodes, it will take 
up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node
lock transitions until any given thread will acquire the lock
(assuming 2^numa_spinlock_threshold > nr_cpus_per_node).

Hopefully, it addresses your concern. Let me know if you have any further 
questions.

Best regards,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-15  3:59 ` [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
@ 2020-01-23  9:26   ` Peter Zijlstra
  2020-01-23 10:06     ` Peter Zijlstra
  2020-01-23 14:15   ` Waiman Long
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2020-01-23  9:26 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, mingo, bp, hpa,
	longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jan 14, 2020 at 10:59:18PM -0500, Alex Kogan wrote:
> +/* this function is called only when the primary queue is empty */
> +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> +				       struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +	u32 new;
> +
> +	/* If the secondary queue is empty, do what MCS does. */
> +	if (node->locked <= 1)
> +		return __try_clear_tail(lock, val, node);
> +
> +	/*
> +	 * Try to update the tail value to the last node in the secondary queue.
> +	 * If successful, pass the lock to the first thread in the secondary
> +	 * queue. Doing those two actions effectively moves all nodes from the
> +	 * secondary queue into the main one.
> +	 */
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
> +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
> +
> +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> +		/*
> +		 * Try to reset @next in tail_2nd to NULL, but no need to check
> +		 * the result - if failed, a new successor has updated it.
> +		 */

I think you actually have an ordering bug here; the load of head_2nd
*must* happen before the atomic_try_cmpxchg(), otherwise it might
observe the new next and clear a valid next pointer.

What would be the best fix for that; I'm thinking:

	head_2nd = smp_load_acquire(&tail_2nd->next);

Will?

> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
> +		arch_mcs_pass_lock(&head_2nd->locked, 1);
> +		return true;
> +	}
> +
> +	return false;
> +}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-23  9:26   ` Peter Zijlstra
@ 2020-01-23 10:06     ` Peter Zijlstra
  2020-01-23 10:16       ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2020-01-23 10:06 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, mingo, bp, hpa,
	longman, tglx, daniel.m.jordan, linux-arm-kernel

On Thu, Jan 23, 2020 at 10:26:58AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 14, 2020 at 10:59:18PM -0500, Alex Kogan wrote:
> > +/* this function is called only when the primary queue is empty */
> > +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> > +				       struct mcs_spinlock *node)
> > +{
> > +	struct mcs_spinlock *head_2nd, *tail_2nd;
> > +	u32 new;
> > +
> > +	/* If the secondary queue is empty, do what MCS does. */
> > +	if (node->locked <= 1)
> > +		return __try_clear_tail(lock, val, node);
> > +
> > +	/*
> > +	 * Try to update the tail value to the last node in the secondary queue.
> > +	 * If successful, pass the lock to the first thread in the secondary
> > +	 * queue. Doing those two actions effectively moves all nodes from the
> > +	 * secondary queue into the main one.
> > +	 */
> > +	tail_2nd = decode_tail(node->locked);
> > +	head_2nd = tail_2nd->next;
> > +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
> > +
> > +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> > +		/*
> > +		 * Try to reset @next in tail_2nd to NULL, but no need to check
> > +		 * the result - if failed, a new successor has updated it.
> > +		 */
> 
> I think you actually have an ordering bug here; the load of head_2nd
> *must* happen before the atomic_try_cmpxchg(), otherwise it might
> observe the new next and clear a valid next pointer.
> 
> What would be the best fix for that; I'm thinking:
> 
> 	head_2nd = smp_load_acquire(&tail_2nd->next);
> 
> Will?

Hmm, given we've not passed the lock around yet; why wouldn't something
like this work:

	smp_store_release(&tail_2nd->next, NULL);

	if (!atomic_try_cmpxchg_relaxed(&lock, &val, new)) {
		tail_2nd->next = head_2nd;
		return false;
	}

The whole second queue is only ever modified by the lock owner, and that
is us, so we can pre-terminate the secondary queue (break the circular
link), try the cmpxchg and fix it back up when it fails.


> > +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
> > +		arch_mcs_pass_lock(&head_2nd->locked, 1);
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-23 10:06     ` Peter Zijlstra
@ 2020-01-23 10:16       ` Peter Zijlstra
  2020-01-23 11:22         ` Will Deacon
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2020-01-23 10:16 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, mingo, bp, hpa,
	longman, tglx, daniel.m.jordan, linux-arm-kernel

On Thu, Jan 23, 2020 at 11:06:35AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2020 at 10:26:58AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 14, 2020 at 10:59:18PM -0500, Alex Kogan wrote:
> > > +/* this function is called only when the primary queue is empty */
> > > +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> > > +				       struct mcs_spinlock *node)
> > > +{
> > > +	struct mcs_spinlock *head_2nd, *tail_2nd;
> > > +	u32 new;
> > > +
> > > +	/* If the secondary queue is empty, do what MCS does. */
> > > +	if (node->locked <= 1)
> > > +		return __try_clear_tail(lock, val, node);
> > > +
> > > +	/*
> > > +	 * Try to update the tail value to the last node in the secondary queue.
> > > +	 * If successful, pass the lock to the first thread in the secondary
> > > +	 * queue. Doing those two actions effectively moves all nodes from the
> > > +	 * secondary queue into the main one.
> > > +	 */
> > > +	tail_2nd = decode_tail(node->locked);
> > > +	head_2nd = tail_2nd->next;
> > > +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
> > > +
> > > +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> > > +		/*
> > > +		 * Try to reset @next in tail_2nd to NULL, but no need to check
> > > +		 * the result - if failed, a new successor has updated it.
> > > +		 */
> > 
> > I think you actually have an ordering bug here; the load of head_2nd
> > *must* happen before the atomic_try_cmpxchg(), otherwise it might
> > observe the new next and clear a valid next pointer.
> > 
> > What would be the best fix for that; I'm thinking:
> > 
> > 	head_2nd = smp_load_acquire(&tail_2nd->next);
> > 
> > Will?
> 
> Hmm, given we've not passed the lock around yet; why wouldn't something
> like this work:
> 
> 	smp_store_release(&tail_2nd->next, NULL);

Argh, make that:

	tail_2nd->next = NULL;

	smp_wmb();

> 	if (!atomic_try_cmpxchg_relaxed(&lock, &val, new)) {
> 		tail_2nd->next = head_2nd;
> 		return false;
> 	}
> 
> The whole second queue is only ever modified by the lock owner, and that
> is us, so we can pre-terminate the secondary queue (break the circular
> link), try the cmpxchg and fix it back up when it fails.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-23 10:16       ` Peter Zijlstra
@ 2020-01-23 11:22         ` Will Deacon
  2020-01-23 13:17           ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Will Deacon @ 2020-01-23 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, hpa, arnd, will.deacon, jglauber, x86, dave.dice,
	linux, linux-kernel, mingo, steven.sistare, longman, guohanjun,
	Alex Kogan, bp, tglx, daniel.m.jordan, linux-arm-kernel

On Thu, Jan 23, 2020 at 11:16:49AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2020 at 11:06:35AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2020 at 10:26:58AM +0100, Peter Zijlstra wrote:
> > > On Tue, Jan 14, 2020 at 10:59:18PM -0500, Alex Kogan wrote:
> > > > +/* this function is called only when the primary queue is empty */
> > > > +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> > > > +				       struct mcs_spinlock *node)
> > > > +{
> > > > +	struct mcs_spinlock *head_2nd, *tail_2nd;
> > > > +	u32 new;
> > > > +
> > > > +	/* If the secondary queue is empty, do what MCS does. */
> > > > +	if (node->locked <= 1)
> > > > +		return __try_clear_tail(lock, val, node);
> > > > +
> > > > +	/*
> > > > +	 * Try to update the tail value to the last node in the secondary queue.
> > > > +	 * If successful, pass the lock to the first thread in the secondary
> > > > +	 * queue. Doing those two actions effectively moves all nodes from the
> > > > +	 * secondary queue into the main one.
> > > > +	 */
> > > > +	tail_2nd = decode_tail(node->locked);
> > > > +	head_2nd = tail_2nd->next;
> > > > +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
> > > > +
> > > > +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> > > > +		/*
> > > > +		 * Try to reset @next in tail_2nd to NULL, but no need to check
> > > > +		 * the result - if failed, a new successor has updated it.
> > > > +		 */
> > > 
> > > I think you actually have an ordering bug here; the load of head_2nd
> > > *must* happen before the atomic_try_cmpxchg(), otherwise it might
> > > observe the new next and clear a valid next pointer.
> > > 
> > > What would be the best fix for that; I'm thinking:
> > > 
> > > 	head_2nd = smp_load_acquire(&tail_2nd->next);
> > > 
> > > Will?
> > 
> > Hmm, given we've not passed the lock around yet; why wouldn't something
> > like this work:
> > 
> > 	smp_store_release(&tail_2nd->next, NULL);
> 
> Argh, make that:
> 
> 	tail_2nd->next = NULL;
> 
> 	smp_wmb();
> 
> > 	if (!atomic_try_cmpxchg_relaxed(&lock, &val, new)) {

... or could you drop the smp_wmb() and make this
atomic_try_cmpxchg_release()?

To be honest, I've failed to understand the code prior to your changes
in this area: it appears to reply on a control-dependency from the two
cmpxchg_relaxed() calls (which isn't sufficient to order the store parts
afaict) and I also don't get how we deal with a transiently circular primary
queue.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-22 17:24   ` Waiman Long
@ 2020-01-23 11:35     ` Will Deacon
  2020-01-23 15:25       ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Will Deacon @ 2020-01-23 11:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, x86, arnd, peterz, dave.dice, jglauber,
	hpa, will.deacon, linux, linux-kernel, mingo, bp, Lihao Liang,
	Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

Hi folks,

(I think Lihao is travelling at the moment, so he may be delayed in his
replies)

On Wed, Jan 22, 2020 at 12:24:58PM -0500, Waiman Long wrote:
> On 1/22/20 6:45 AM, Lihao Liang wrote:
> > On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
> >> Summary
> >> -------
> >>
> >> Lock throughput can be increased by handing a lock to a waiter on the
> >> same NUMA node as the lock holder, provided care is taken to avoid
> >> starvation of waiters on other NUMA nodes. This patch introduces CNA
> >> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> >> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> >>
> > Thanks for your patches. The experimental results look promising!
> >
> > I understand that the new CNA qspinlock uses randomization to achieve
> > long-term fairness, and provides the numa_spinlock_threshold parameter
> > for users to tune. As Linux runs extremely diverse workloads, it is not
> > clear how randomization affects its fairness, and how users with
> > different requirements are supposed to tune this parameter.
> >
> > To this end, Will and I consider it beneficial to be able to answer the
> > following question:
> >
> > With different values of numa_spinlock_threshold and
> > SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
> > sockets have to wait to acquire the lock? This is particularly relevant
> > in high contention situations when new threads keep arriving on the same
> > socket as the lock holder.
> >
> > In this email, I try to provide some formal analysis to address this
> > question. Let's assume the probability for the lock to stay on the
> > same socket is *at least* p, which corresponds to the probability for
> > the function probably(unsigned int num_bits) in the patch to return *false*,
> > where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the
> > function.
> 
> That is not strictly true from my understanding of the code. The
> probably() function does not come into play if a secondary queue is
> present. Also calling cna_scan_main_queue() doesn't guarantee that a
> waiter in the same node can be found. So the simple mathematical
> analysis isn't that applicable in this case. One will have to do an
> actual simulation to find out what the actual behavior will be.

It's certainly true that the analysis is based on the worst-case scenario,
but I think it's still worth considering. For example, the secondary queue
does not exist initially so it seems a bit odd that we only instantiate it
with < 1% probability.

That said, my real concern with any of this is that it makes formal
modelling and analysis of the qspinlock considerably more challenging. I
would /really/ like to see an update to the TLA+ model we have of the
current implementation [1] and preferably also the userspace version I
hacked together [2] so that we can continue to test and validate changes
to the code outside of the usual kernel stress-testing.

Will

[1] https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/
[2] https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/spinbench/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-23 11:22         ` Will Deacon
@ 2020-01-23 13:17           ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2020-01-23 13:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, hpa, arnd, will.deacon, jglauber, x86, dave.dice,
	linux, linux-kernel, mingo, steven.sistare, longman, guohanjun,
	Alex Kogan, bp, tglx, daniel.m.jordan, linux-arm-kernel

On Thu, Jan 23, 2020 at 11:22:51AM +0000, Will Deacon wrote:

> > Argh, make that:
> > 
> > 	tail_2nd->next = NULL;
> > 
> > 	smp_wmb();
> > 
> > > 	if (!atomic_try_cmpxchg_relaxed(&lock, &val, new)) {
> 
> ... or could you drop the smp_wmb() and make this
> atomic_try_cmpxchg_release()?

My current code has the smp_wmb(), because most _releases end up being
an smp_mb() (except for powerpc where it is of equal cost to wmb and
arm64, where I have no idea of the costs).

> To be honest, I've failed to understand the code prior to your changes
> in this area: it appears to reply on a control-dependency from the two
> cmpxchg_relaxed() calls (which isn't sufficient to order the store parts
> afaict) and I also don't get how we deal with a transiently circular primary
> queue.

Ha!, yes, so this little piece took me a while too. Let me attempt an
explanation.

+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'

So @node is the current lock holder, node->next == NULL (primary queue
is empty) and we're going to try and splice the secondary queue to the
head of the primary.

+       tail_2nd = decode_tail(node->locked);
+       head_2nd = tail_2nd->next;

this gets the secondary head and tail, so far so simple

+       new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;

this encodes the new primary tail (as kept in lock->val), still simple

+       if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {

if this here succeeds, we've got the primary tail pointing at the
secondary tail. This is safe because only the lock holder (us) ever
modifies the secondary queue.

+               /*
+                * Try to reset @next in tail_2nd to NULL, but no need to check
+                * the result - if failed, a new successor has updated it.
+                */
+               cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);

This is (broken, as per the prior argument) breaking the circular link
the secondary queue has. The trick here is that since we're the lock
holder, nothing will actually iterate the primary ->next chain, so a
bogus value in there is of no concern.

_However_ a new waiter might at this point do:

	old = xchg_tail(lock, node);
	if (old) {
		prev = decode_tail(old);
		WRITE_ONCE(prev->next, node);
		...
	}

which then results in conflicting stores to the one ->next variable.

The cmpxchg() is attempting to terminate the list, while the new waiter
is extending the list, it is therefore paramount the new waiter always
wins this. To that end they're employing the cmpxchg, but it very much
relies on the @head_2nd load to have happened before we exposed the
secondary tail as primary tail, otherwise it can have loaded the new
->next pointer and overwriten it.

+               arch_mcs_pass_lock(&head_2nd->locked, 1);
+               return true;
+       }
+
+       return false;


Did that help, or just make it worse?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-15  3:59 ` [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
  2020-01-23  9:26   ` Peter Zijlstra
@ 2020-01-23 14:15   ` Waiman Long
  2020-01-23 15:29     ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-23 14:15 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: dave.dice, steven.sistare, daniel.m.jordan

On 1/14/20 10:59 PM, Alex Kogan wrote:
> +static int __init cna_init_nodes(void)
> +{
> +	unsigned int cpu;
> +
> +	/*
> +	 * this will break on 32bit architectures, so we restrict
> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
> +	 */
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store an ecoded tail word in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
> +
> +	for_each_possible_cpu(cpu)
> +		cna_init_nodes_per_cpu(cpu);
> +
> +	return 0;
> +}
> +early_initcall(cna_init_nodes);
> +

I just realized that you shouldn't call cna_init_nodes as an
early_initcall. Instead,

> +/*
> + * Switch to the NUMA-friendly slow path for spinlocks when we have
> + * multiple NUMA nodes in native environment, unless the user has
> + * overridden this default behavior by setting the numa_spinlock flag.
> + */
> +void __init cna_configure_spin_lock_slowpath(void)
> +{
> +	if ((numa_spinlock_flag == 1) ||
> +	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
> +		    pv_ops.lock.queued_spin_lock_slowpath ==
> +			native_queued_spin_lock_slowpath)) {
> +		pv_ops.lock.queued_spin_lock_slowpath =
> +		    __cna_queued_spin_lock_slowpath;
> +
> +		pr_info("Enabling CNA spinlock\n");
> +	}
> +}

call it when it is sure that CNA spinlock is going to be used. At this
point, the system is still in UP mode and the slowpath will not be called.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-23 11:35     ` Will Deacon
@ 2020-01-23 15:25       ` Waiman Long
  2020-01-23 19:08         ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-23 15:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, guohanjun, x86, arnd, peterz, dave.dice, jglauber,
	hpa, will.deacon, linux, linux-kernel, mingo, bp, Lihao Liang,
	Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On 1/23/20 6:35 AM, Will Deacon wrote:
> Hi folks,
>
> (I think Lihao is travelling at the moment, so he may be delayed in his
> replies)
>
> On Wed, Jan 22, 2020 at 12:24:58PM -0500, Waiman Long wrote:
>> On 1/22/20 6:45 AM, Lihao Liang wrote:
>>> On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
>>>> Summary
>>>> -------
>>>>
>>>> Lock throughput can be increased by handing a lock to a waiter on the
>>>> same NUMA node as the lock holder, provided care is taken to avoid
>>>> starvation of waiters on other NUMA nodes. This patch introduces CNA
>>>> (compact NUMA-aware lock) as the slow path for qspinlock. It is
>>>> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>>>>
>>> Thanks for your patches. The experimental results look promising!
>>>
>>> I understand that the new CNA qspinlock uses randomization to achieve
>>> long-term fairness, and provides the numa_spinlock_threshold parameter
>>> for users to tune. As Linux runs extremely diverse workloads, it is not
>>> clear how randomization affects its fairness, and how users with
>>> different requirements are supposed to tune this parameter.
>>>
>>> To this end, Will and I consider it beneficial to be able to answer the
>>> following question:
>>>
>>> With different values of numa_spinlock_threshold and
>>> SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
>>> sockets have to wait to acquire the lock? This is particularly relevant
>>> in high contention situations when new threads keep arriving on the same
>>> socket as the lock holder.
>>>
>>> In this email, I try to provide some formal analysis to address this
>>> question. Let's assume the probability for the lock to stay on the
>>> same socket is *at least* p, which corresponds to the probability for
>>> the function probably(unsigned int num_bits) in the patch to return *false*,
>>> where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the
>>> function.
>> That is not strictly true from my understanding of the code. The
>> probably() function does not come into play if a secondary queue is
>> present. Also calling cna_scan_main_queue() doesn't guarantee that a
>> waiter in the same node can be found. So the simple mathematical
>> analysis isn't that applicable in this case. One will have to do an
>> actual simulation to find out what the actual behavior will be.
> It's certainly true that the analysis is based on the worst-case scenario,
> but I think it's still worth considering. For example, the secondary queue
> does not exist initially so it seems a bit odd that we only instantiate it
> with < 1% probability.
>
> That said, my real concern with any of this is that it makes formal
> modelling and analysis of the qspinlock considerably more challenging. I
> would /really/ like to see an update to the TLA+ model we have of the
> current implementation [1] and preferably also the userspace version I
> hacked together [2] so that we can continue to test and validate changes
> to the code outside of the usual kernel stress-testing.

I do agree that the current CNA code is hard to model. The CNA lock
behaves like a regular qspinlock in many cases. If the lock becomes
fairly contended with waiters from different nodes, it will
opportunistically switch to CNA mode where preference is given to
waiters in the same node.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-23 14:15   ` Waiman Long
@ 2020-01-23 15:29     ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2020-01-23 15:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, mingo, bp, hpa, Alex Kogan,
	steven.sistare, tglx, daniel.m.jordan, linux-arm-kernel

On Thu, Jan 23, 2020 at 09:15:55AM -0500, Waiman Long wrote:
> On 1/14/20 10:59 PM, Alex Kogan wrote:
> > +static int __init cna_init_nodes(void)
> > +{
> > +	unsigned int cpu;
> > +
> > +	/*
> > +	 * this will break on 32bit architectures, so we restrict
> > +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
> > +	 */
> > +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> > +	/* we store an ecoded tail word in the node's @locked field */
> > +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
> > +
> > +	for_each_possible_cpu(cpu)
> > +		cna_init_nodes_per_cpu(cpu);
> > +
> > +	return 0;
> > +}
> > +early_initcall(cna_init_nodes);
> > +
> 
> I just realized that you shouldn't call cna_init_nodes as an
> early_initcall. Instead,
> 
> > +/*
> > + * Switch to the NUMA-friendly slow path for spinlocks when we have
> > + * multiple NUMA nodes in native environment, unless the user has
> > + * overridden this default behavior by setting the numa_spinlock flag.
> > + */
> > +void __init cna_configure_spin_lock_slowpath(void)
> > +{
> > +	if ((numa_spinlock_flag == 1) ||
> > +	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
> > +		    pv_ops.lock.queued_spin_lock_slowpath ==
> > +			native_queued_spin_lock_slowpath)) {
> > +		pv_ops.lock.queued_spin_lock_slowpath =
> > +		    __cna_queued_spin_lock_slowpath;
> > +
> > +		pr_info("Enabling CNA spinlock\n");
> > +	}
> > +}
> 
> call it when it is sure that CNA spinlock is going to be used. At this
> point, the system is still in UP mode and the slowpath will not be called.

Indeed, let me go fix that!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-23 15:25       ` Waiman Long
@ 2020-01-23 19:08         ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2020-01-23 19:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arch, guohanjun, x86, arnd, peterz, dave.dice, jglauber,
	hpa, will.deacon, linux, linux-kernel, mingo, bp, Lihao Liang,
	Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 4235 bytes --]

On 1/23/20 10:25 AM, Waiman Long wrote:
> On 1/23/20 6:35 AM, Will Deacon wrote:
>> Hi folks,
>>
>> (I think Lihao is travelling at the moment, so he may be delayed in his
>> replies)
>>
>> On Wed, Jan 22, 2020 at 12:24:58PM -0500, Waiman Long wrote:
>>> On 1/22/20 6:45 AM, Lihao Liang wrote:
>>>> On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
>>>>> Summary
>>>>> -------
>>>>>
>>>>> Lock throughput can be increased by handing a lock to a waiter on the
>>>>> same NUMA node as the lock holder, provided care is taken to avoid
>>>>> starvation of waiters on other NUMA nodes. This patch introduces CNA
>>>>> (compact NUMA-aware lock) as the slow path for qspinlock. It is
>>>>> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>>>>>
>>>> Thanks for your patches. The experimental results look promising!
>>>>
>>>> I understand that the new CNA qspinlock uses randomization to achieve
>>>> long-term fairness, and provides the numa_spinlock_threshold parameter
>>>> for users to tune. As Linux runs extremely diverse workloads, it is not
>>>> clear how randomization affects its fairness, and how users with
>>>> different requirements are supposed to tune this parameter.
>>>>
>>>> To this end, Will and I consider it beneficial to be able to answer the
>>>> following question:
>>>>
>>>> With different values of numa_spinlock_threshold and
>>>> SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
>>>> sockets have to wait to acquire the lock? This is particularly relevant
>>>> in high contention situations when new threads keep arriving on the same
>>>> socket as the lock holder.
>>>>
>>>> In this email, I try to provide some formal analysis to address this
>>>> question. Let's assume the probability for the lock to stay on the
>>>> same socket is *at least* p, which corresponds to the probability for
>>>> the function probably(unsigned int num_bits) in the patch to return *false*,
>>>> where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the
>>>> function.
>>> That is not strictly true from my understanding of the code. The
>>> probably() function does not come into play if a secondary queue is
>>> present. Also calling cna_scan_main_queue() doesn't guarantee that a
>>> waiter in the same node can be found. So the simple mathematical
>>> analysis isn't that applicable in this case. One will have to do an
>>> actual simulation to find out what the actual behavior will be.
>> It's certainly true that the analysis is based on the worst-case scenario,
>> but I think it's still worth considering. For example, the secondary queue
>> does not exist initially so it seems a bit odd that we only instantiate it
>> with < 1% probability.
>>
>> That said, my real concern with any of this is that it makes formal
>> modelling and analysis of the qspinlock considerably more challenging. I
>> would /really/ like to see an update to the TLA+ model we have of the
>> current implementation [1] and preferably also the userspace version I
>> hacked together [2] so that we can continue to test and validate changes
>> to the code outside of the usual kernel stress-testing.
> I do agree that the current CNA code is hard to model. The CNA lock
> behaves like a regular qspinlock in many cases. If the lock becomes
> fairly contended with waiters from different nodes, it will
> opportunistically switch to CNA mode where preference is given to
> waiters in the same node.

BTW, I added the attached draft lock_event patch on top of the v9 CNA
patch series to observe the behavior of the CNA lock. Using a 2-socket
96-thread x86-64 server, the lock event output after boot up was:

cna_intra_max=1942
cna_mainscan_hit=134
cna_merge_queue=73
cna_prescan_hit=16662
cna_prescan_miss=268
cna_splice_new=352
cna_splice_old=2415
lock_pending=130090
lock_slowpath=191868
lock_use_node2=135

After resetting the counts and running a 96-thread lock stress test for
10s, I got

cna_intra_max=65536
cna_mainscan_hit=46
cna_merge_queue=661
cna_prescan_hit=42486841
cna_prescan_miss=68
cna_splice_new=676
cna_splice_old=402
lock_pending=11012
lock_slowpath=44332335
lock_use_node2=57203

So the cna_intra_max does go to the maximum of 64k.

Cheers,
Longman


[-- Attachment #2: 0006-locking-qspinlock-Enable-lock-events-tracking-for-CN.patch --]
[-- Type: text/x-patch, Size: 7752 bytes --]

From aa090c6f0a07d48dc4dcb22087cf4c17a25686d6 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Thu, 23 Jan 2020 13:53:12 -0500
Subject: [PATCH 6/6] locking/qspinlock: Enable lock events tracking for CNA
 qspinlock code

Add some lock events for tracking the behavior of the CNA qspinlock
code. A new lockevent_max() function is added to find out the maximum
value that CNA intra_count can reach.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events.c      | 23 +++++++++++++++++++----
 kernel/locking/lock_events.h      | 11 +++++++++++
 kernel/locking/lock_events_list.h | 13 +++++++++++++
 kernel/locking/qspinlock_cna.h    | 21 ++++++++++++++++-----
 kernel/locking/qspinlock_stat.h   | 23 ++++++++++++++++++++++-
 5 files changed, 81 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/lock_events.c b/kernel/locking/lock_events.c
index fa2c2f951c6b..0237cbbc94a2 100644
--- a/kernel/locking/lock_events.c
+++ b/kernel/locking/lock_events.c
@@ -120,14 +120,29 @@ static const struct file_operations fops_lockevent = {
 
 static bool __init skip_lockevent(const char *name)
 {
-	static int pv_on __initdata = -1;
+	static enum {
+		LOCK_UNKNOWN,
+		LOCK_NATIVE,
+		LOCK_PV,
+		LOCK_CNA,
+	} state __initdata = LOCK_UNKNOWN;
+
+	if (state == LOCK_UNKNOWN) {
+		if (pv_ops.lock.queued_spin_lock_slowpath ==
+		    native_queued_spin_lock_slowpath)
+			state = LOCK_NATIVE;
+		else if (pv_ops.lock.queued_spin_lock_slowpath ==
+			 pv_queued_spin_lock_slowpath)
+			state = LOCK_PV;
+		else
+			state = LOCK_CNA;
+	}
 
-	if (pv_on < 0)
-		pv_on = !pv_is_native_spin_unlock();
 	/*
 	 * Skip PV qspinlock events on bare metal.
 	 */
-	if (!pv_on && !memcmp(name, "pv_", 3))
+	if (((state != LOCK_PV)  && !memcmp(name, "pv_", 3)) ||
+	    ((state != LOCK_CNA) && !memcmp(name, "cna_", 4)))
 		return true;
 	return false;
 }
diff --git a/kernel/locking/lock_events.h b/kernel/locking/lock_events.h
index 8c7e7d25f09c..d8528725324c 100644
--- a/kernel/locking/lock_events.h
+++ b/kernel/locking/lock_events.h
@@ -50,11 +50,22 @@ static inline void __lockevent_add(enum lock_events event, int inc)
 
 #define lockevent_add(ev, c)	__lockevent_add(LOCKEVENT_ ##ev, c)
 
+static inline void __lockevent_max(enum lock_events event, unsigned long val)
+{
+	unsigned long max = raw_cpu_read(lockevents[event]);
+
+	if (val > max)
+		raw_cpu_write(lockevents[event], val);
+}
+
+#define lockevent_max(ev, v)	__lockevent_max(LOCKEVENT_ ##ev, v)
+
 #else  /* CONFIG_LOCK_EVENT_COUNTS */
 
 #define lockevent_inc(ev)
 #define lockevent_add(ev, c)
 #define lockevent_cond_inc(ev, c)
+#define lockevent_max(ev, v)
 
 #endif /* CONFIG_LOCK_EVENT_COUNTS */
 #endif /* __LOCKING_LOCK_EVENTS_H */
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 239039d0ce21..df1042bb19e9 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -35,6 +35,19 @@ LOCK_EVENT(pv_wait_head)	/* # of vCPU wait's at the queue head	   */
 LOCK_EVENT(pv_wait_node)	/* # of vCPU wait's at non-head queue node */
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+/*
+ * Locking events for CNA qspinlock
+ */
+LOCK_EVENT(cna_prescan_hit)
+LOCK_EVENT(cna_prescan_miss)
+LOCK_EVENT(cna_mainscan_hit)
+LOCK_EVENT(cna_merge_queue)	/* # of queue merges (secondary -> primary) */
+LOCK_EVENT(cna_splice_new)	/* # of splices to new secondary queue	    */
+LOCK_EVENT(cna_splice_old)	/* # of splices to existing secondary queue */
+LOCK_EVENT(cna_intra_max)	/* Maximum intra_count value		    */
+#endif
+
 /*
  * Locking events for qspinlock
  *
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index f0b0c15dcf9d..2c410d67e094 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -193,6 +193,7 @@ static void cna_splice_tail(struct mcs_spinlock *node,
 	if (node->locked <= 1) { /* if secondary queue is empty */
 		/* create secondary queue */
 		last->next = first;
+		lockevent_inc(cna_splice_new);
 	} else {
 		/* add to the tail of the secondary queue */
 		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
@@ -200,6 +201,7 @@ static void cna_splice_tail(struct mcs_spinlock *node,
 
 		tail_2nd->next = first;
 		last->next = head_2nd;
+		lockevent_inc(cna_splice_old);
 	}
 
 	node->locked = ((struct cna_node *)last)->encoded_tail;
@@ -285,14 +287,15 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
 			cn->intra_count == intra_node_handoff_threshold ?
 				FLUSH_SECONDARY_QUEUE :
 				cna_scan_main_queue(node, node);
-
+	lockevent_cond_inc(cna_prescan_hit,
+			   cn->pre_scan_result == LOCAL_WAITER_FOUND);
 	return 0;
 }
 
 static inline void cna_pass_lock(struct mcs_spinlock *node,
 				 struct mcs_spinlock *next)
 {
-	struct cna_node *cn = (struct cna_node *)node;
+	struct cna_node *cn = (struct cna_node *)node, *next_cn;
 	struct mcs_spinlock *next_holder = next, *tail_2nd;
 	u32 val = 1;
 
@@ -311,20 +314,27 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 	 * pre-scan, and if so, try to find it in post-scan starting from the
 	 * node where pre-scan stopped (stored in @pre_scan_result)
 	 */
-	if (scan >= MIN_ENCODED_TAIL)
+	if (scan >= MIN_ENCODED_TAIL) {
 		scan = cna_scan_main_queue(node, decode_tail(scan));
+		lockevent_inc(cna_prescan_miss);
+		lockevent_cond_inc(cna_mainscan_hit,
+				   scan == LOCAL_WAITER_FOUND);
+	}
 
 	if (scan == LOCAL_WAITER_FOUND) {
 		next_holder = node->next;
+		next_cn = (struct cna_node *)next_holder;
+
 		/*
 		 * we unlock successor by passing a non-zero value,
 		 * so set @val to 1 iff @locked is 0, which will happen
 		 * if we acquired the MCS lock when its queue was empty
 		 */
 		val = node->locked ? node->locked : 1;
+
 		/* inc @intra_count if the secondary queue is not empty */
-		((struct cna_node *)next_holder)->intra_count =
-			cn->intra_count + (node->locked > 1);
+		next_cn->intra_count = cn->intra_count + (node->locked > 1);
+		lockevent_max(cna_intra_max, next_cn->intra_count);
 	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
 		/* next holder will be the first node in the secondary queue */
 		tail_2nd = decode_tail(node->locked);
@@ -332,6 +342,7 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 		next_holder = tail_2nd->next;
 		/* splice the secondary queue onto the head of the main queue */
 		tail_2nd->next = next;
+		lockevent_inc(cna_merge_queue);
 	}
 
 pass_lock:
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index e625bb410aa2..530f86477e0f 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -22,6 +22,18 @@
  */
 static DEFINE_PER_CPU(u64, pv_kick_time);
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+static inline bool lock_event_return_max(int id)
+{
+	return id == LOCKEVENT_cna_intra_max;
+}
+#else
+static inline bool lock_event_return_max(int id)
+{
+	return false;
+}
+#endif
+
 /*
  * Function to read and return the PV qspinlock counts.
  *
@@ -38,7 +50,7 @@ ssize_t lockevent_read(struct file *file, char __user *user_buf,
 {
 	char buf[64];
 	int cpu, id, len;
-	u64 sum = 0, kicks = 0;
+	u64 sum = 0, kicks = 0, val;
 
 	/*
 	 * Get the counter ID stored in file->f_inode->i_private
@@ -49,6 +61,15 @@ ssize_t lockevent_read(struct file *file, char __user *user_buf,
 		return -EBADF;
 
 	for_each_possible_cpu(cpu) {
+		val = per_cpu(lockevents[id], cpu);
+		if (lock_event_return_max(id)) {
+			/*
+			 * Find the maximum of all per-cpu values
+			 */
+			if (val > sum)
+				sum = val;
+			continue;
+		}
 		sum += per_cpu(lockevents[id], cpu);
 		/*
 		 * Need to sum additional counters for some of them
-- 
2.18.1


[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2020-01-15  3:59 ` [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
@ 2020-01-23 19:55   ` Waiman Long
  2020-01-23 20:39     ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-23 19:55 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: dave.dice, steven.sistare, daniel.m.jordan

On 1/14/20 10:59 PM, Alex Kogan wrote:
> Keep track of the number of intra-node lock handoffs, and force
> inter-node handoff once this number reaches a preset threshold.
> The default value for the threshold can be overridden with
> the new kernel boot command-line option "numa_spinlock_threshold".
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  8 ++++
>  kernel/locking/qspinlock.c                    |  3 ++
>  kernel/locking/qspinlock_cna.h                | 41 ++++++++++++++++++-
>  3 files changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index b68cb80e477f..30d79819a3b0 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3200,6 +3200,14 @@
>  			Not specifying this option is equivalent to
>  			numa_spinlock=auto.
>  
> +	numa_spinlock_threshold=	[NUMA, PV_OPS]
> +			Set the threshold for the number of intra-node
> +			lock hand-offs before the NUMA-aware spinlock
> +			is forced to be passed to a thread on another NUMA node.
> +			Valid values are in the [0..31] range. Smaller values
> +			result in a more fair, but less performant spinlock, and
> +			vice versa. The default value is 16.
> +
>  	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
>  			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
>  			Some features depend on CPU0. Known dependencies are:
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 609980a53841..e382d8946ccc 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -597,6 +597,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  #if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>  #define _GEN_CNA_LOCK_SLOWPATH
>  
> +#undef pv_init_node
> +#define pv_init_node			cna_init_node
> +
>  #undef pv_wait_head_or_lock
>  #define pv_wait_head_or_lock		cna_pre_scan
>  
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index 8000231f3d51..a2b65f87e6f8 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -51,13 +51,25 @@ struct cna_node {
>  	int			numa_node;
>  	u32			encoded_tail;
>  	u32			pre_scan_result; /* encoded tail or enum val */
> +	u32			intra_count;
>  };
>  
>  enum {
>  	LOCAL_WAITER_FOUND = 2,	/* 0 and 1 are reserved for @locked */
> +	FLUSH_SECONDARY_QUEUE = 3,
>  	MIN_ENCODED_TAIL
>  };
>  
> +/*
> + * Controls the threshold for the number of intra-node lock hand-offs before
> + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
> + * another NUMA node. By default, the chosen value provides reasonable
> + * long-term fairness without sacrificing performance compared to a lock
> + * that does not have any fairness guarantees. The default setting can
> + * be changed with the "numa_spinlock_threshold" boot option.
> + */
> +unsigned int intra_node_handoff_threshold __ro_after_init = 1 << 16;
> +
>  static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>  {
>  	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> @@ -97,6 +109,11 @@ static int __init cna_init_nodes(void)
>  }
>  early_initcall(cna_init_nodes);
>  
> +static __always_inline void cna_init_node(struct mcs_spinlock *node)
> +{
> +	((struct cna_node *)node)->intra_count = 0;
> +}
> +
>  /* this function is called only when the primary queue is empty */
>  static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
>  				       struct mcs_spinlock *node)
> @@ -232,7 +249,9 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
>  {
>  	struct cna_node *cn = (struct cna_node *)node;
>  
> -	cn->pre_scan_result = cna_scan_main_queue(node, node);
> +	cn->pre_scan_result =
> +		cn->intra_count == intra_node_handoff_threshold ?
> +			FLUSH_SECONDARY_QUEUE : cna_scan_main_queue(node, node);
>  
>  	return 0;
>  }
> @@ -262,6 +281,9 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
>  		 * if we acquired the MCS lock when its queue was empty
>  		 */
>  		val = node->locked ? node->locked : 1;
> +		/* inc @intra_count if the secondary queue is not empty */
> +		((struct cna_node *)next_holder)->intra_count =
> +			cn->intra_count + (node->locked > 1);

Playing with lock event counts, I would like you to change the meaning
intra_count parameter that you are tracking. Instead of tracking the
number of times a lock is passed to a waiter of the same node
consecutively, I would like you to track the number of times the head
waiter in the secondary queue has given up its chance to acquire the
lock because a later waiter has jumped the queue and acquire the lock
before it. This value determines the worst case latency that a secondary
queue waiter can experience. So

@@ -332,8 +334,12 @@ static inline void cna_pass_lock(struct
mcs_spinlock *node,
                 */
                val = node->locked ? node->locked : 1;
 
-               /* inc @intra_count if the secondary queue is not empty */
-               next_cn->intra_count = cn->intra_count + (node->locked > 1);
+               /*
+                * inc @intra_count and pass it down if the secondary queue
+                * is not empty
+                */
+               if (node->locked > 1)
+                       next_cn->intra_count = cn->intra_count + 1;
        } else if (node->locked > 1) {    /* if secondary queue is not
empty */
                /* next holder will be the first node in the secondary
queue */

Maybe rename it to jump_count or some other more meaningful name. With
that change, we could probably reduce the default threshold from 64k to
maybe 256 or 512.

I changed the threshold to 256 and run a 96-thread locking stress test
for 10s, the lock event counts:

cna_flush_queue=15687
cna_intra_max=256
cna_mainscan_hit=13
cna_merge_queue=15691
cna_prescan_hit=4344037
cna_prescan_miss=21
cna_splice_new=15701
cna_splice_old=1289
lock_pending=4384
lock_slowpath=47998292
lock_use_node2=16778

Of the prescan hits, only about 0.4% of that resulted in a queue flush
which I thought is reasonable. I didn't see any noticeable degradation
in the performance of the stress test by reducing the threshold from 64k
to 256.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2020-01-23 19:55   ` Waiman Long
@ 2020-01-23 20:39     ` Waiman Long
  2020-01-23 23:39       ` Alex Kogan
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-23 20:39 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: dave.dice, steven.sistare, daniel.m.jordan

On 1/23/20 2:55 PM, Waiman Long wrote:
> Playing with lock event counts, I would like you to change the meaning
> intra_count parameter that you are tracking. Instead of tracking the
> number of times a lock is passed to a waiter of the same node
> consecutively, I would like you to track the number of times the head
> waiter in the secondary queue has given up its chance to acquire the
> lock because a later waiter has jumped the queue and acquire the lock
> before it. This value determines the worst case latency that a secondary
> queue waiter can experience. So

Well, that is not strictly true as a a waiter in the middle of the
secondary queue can go back and fro between the queues for a number of
times. Of course, if we can ensure that when a FLUSH_SECONDARY_QUEUE is
issued, those waiters that were in the secondary queue won't be put back
into the secondary queue again. The parameter will then really determine
the worst case latency.

One way to do it is to store the tail of the secondary queue into the
CNA node and passed it down the queue until it matches the current
encoded tail. That will require changing both numa_node and intra_count
into u16 to squeeze out space for another u32.

That will also make the code a bit easier to analyze.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2020-01-23 20:39     ` Waiman Long
@ 2020-01-23 23:39       ` Alex Kogan
  0 siblings, 0 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-23 23:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, linux-kernel, mingo, bp, hpa, steven.sistare,
	tglx, daniel.m.jordan, linux-arm-kernel

> On Jan 23, 2020, at 3:39 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 1/23/20 2:55 PM, Waiman Long wrote:
>> Playing with lock event counts, I would like you to change the meaning
>> intra_count parameter that you are tracking. Instead of tracking the
>> number of times a lock is passed to a waiter of the same node
>> consecutively, I would like you to track the number of times the head
>> waiter in the secondary queue has given up its chance to acquire the
>> lock because a later waiter has jumped the queue and acquire the lock
>> before it.
Isn’t that the same thing? Note that we keep track of the number of 
intra-node lock transfers only when the secondary queue is not empty.

>> This value determines the worst case latency that a secondary
>> queue waiter can experience. So
> 
> Well, that is not strictly true as a a waiter in the middle of the
> secondary queue can go back and fro between the queues for a number of
> times. Of course, if we can ensure that when a FLUSH_SECONDARY_QUEUE is
> issued, those waiters that were in the secondary queue won't be put back
> into the secondary queue again.
This will not work as intended when we have more than 2 nodes. That is, if we
have threads from node A & B in the secondary queue, and then the queue
is flushed and its head (say, from node A) gets the lock, we want to push 
threads from node B back into the secondary queue, to keep the lock on node A.

And if we have only 2 nodes, a waiter in the middle of the secondary queue will 
never go back into the secondary queue, even if the threshold is small. 
This is because we flush the secondary queue by putting all its waiters in
the front of the main queue, and the secondary queue will remain empty at least
until we reach a thread from another node.

Regards,
— Alex
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
                   ` (5 preceding siblings ...)
  2020-01-22 11:45 ` [PATCH v9 0/5] Add NUMA-awareness to qspinlock Lihao Liang
@ 2020-01-24 22:24 ` Paul E. McKenney
       [not found]   ` <6AAE7FC6-F5DE-4067-8BC4-77F27948CD09@oracle.com>
  6 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2020-01-24 22:24 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, peterz, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, mingo, bp, hpa,
	longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jan 14, 2020 at 10:59:15PM -0500, Alex Kogan wrote:
> Minor changes from v8 based on feedback from Longman:
> -----------------------------------------------------
> 
> - Add __init to cna_configure_spin_lock_slowpath().
> 
> - Fix the comment for cna_scan_main_queue().
> 
> - Change the type of intra_node_handoff_threshold to unsigned int.
> 
> 
> Summary
> -------
> 
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> 
> CNA is a NUMA-aware version of the MCS lock. Spinning threads are
> organized in two queues, a main queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. After acquiring the MCS lock and
> before acquiring the spinlock, the lock holder scans the main queue
> looking for a thread running on the same node (pre-scan). If found (call
> it thread T), all threads in the main queue between the current lock
> holder and T are moved to the end of the secondary queue.  If such T
> is not found, we make another scan of the main queue after acquiring 
> the spinlock when unlocking the MCS lock (post-scan), starting at the
> node where pre-scan stopped. If both scans fail to find such T, the
> MCS lock is passed to the first thread in the secondary queue. If the
> secondary queue is empty, the MCS lock is passed to the next thread in the
> main queue. To avoid starvation of threads in the secondary queue, those
> threads are moved back to the head of the main queue after a certain
> number of intra-node lock hand-offs.
> 
> More details are available at https://arxiv.org/abs/1810.05600.
> 
> The series applies on top of v5.5.0-rc6, commit b3a987b026.
> Performance numbers are available in previous revisions
> of the series.
> 
> Further comments are welcome and appreciated.

I ran this on a large system with a version of locktorture that was
modified to print out the maximum and minimum per-CPU lock-acquisition
counts, and with CPU hotplug disabled.  I also modified the LOCK01 and
LOCK04 scenarios to use 220 hardware threads.

Here is what the test ended up with at the end of a one-hour run:

LOCK01 (exclusive):
Writes:  Total: 1241107333  Max/Min: 9206962/60902 ???  Fail: 0

LOCK04 (rwlock):
Writes:  Total: 232991963  Max/Min: 2631574/74582 ???  Fail: 0
Reads :  Total: 216935386  Max/Min: 2735939/28665 ???  Fail: 0

The "???" strings are printed because the ratio of maximum to minimum exceeds
a factor of two.

I also ran 30-minute runs on my laptop, which has 12 hardware threads:

LOCK01 (exclusive):
Writes:  Total: 3992072782  Max/Min: 259368782/97231961 ???  Fail: 0

LOCK04 (rwlock):
Writes:  Total: 131063892  Max/Min: 13136206/5876157 ???  Fail: 0
Reads :  Total: 144876801  Max/Min: 19999535/4873442 ???  Fail: 0

These also exceed the factor-of-two cutoff, but not as dramatically.
The readers for the reader-writer lock fared worst, with a 4-to-1 ratio.

These tests did run within guest OSes.  Is that configuration out of
scope for this locking algorithm?  In addition (as might well also have
been the case for the locktorture runs in your paper), these tests run
a pair of stress-test tasks for each hardware thread.

Is this expected behavior?

							Thanx, Paul

> Alex Kogan (5):
>   locking/qspinlock: Rename mcs lock/unlock macros and make them more
>     generic
>   locking/qspinlock: Refactor the qspinlock slow path
>   locking/qspinlock: Introduce CNA into the slow path of qspinlock
>   locking/qspinlock: Introduce starvation avoidance into CNA
>   locking/qspinlock: Introduce the shuffle reduction optimization into
>     CNA
> 
>  .../admin-guide/kernel-parameters.txt         |  18 +
>  arch/arm/include/asm/mcs_spinlock.h           |   6 +-
>  arch/x86/Kconfig                              |  20 +
>  arch/x86/include/asm/qspinlock.h              |   4 +
>  arch/x86/kernel/alternative.c                 |   4 +
>  include/asm-generic/mcs_spinlock.h            |   4 +-
>  kernel/locking/mcs_spinlock.h                 |  20 +-
>  kernel/locking/qspinlock.c                    |  82 +++-
>  kernel/locking/qspinlock_cna.h                | 399 ++++++++++++++++++
>  kernel/locking/qspinlock_paravirt.h           |   2 +-
>  10 files changed, 536 insertions(+), 23 deletions(-)
>  create mode 100644 kernel/locking/qspinlock_cna.h
> 
> -- 
> 2.21.0 (Apple Git-122.2)
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
       [not found]   ` <6AAE7FC6-F5DE-4067-8BC4-77F27948CD09@oracle.com>
@ 2020-01-25  0:57     ` Paul E. McKenney
  2020-01-25  1:59       ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2020-01-25  0:57 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, steven.sistare, linux-kernel,
	Ingo Molnar, bp, hpa, Waiman Long, tglx, daniel.m.jordan,
	linux-arm-kernel

On Fri, Jan 24, 2020 at 06:39:02PM -0500, Alex Kogan wrote:
> Hi, Paul.
> 
> Thanks for running those experiments!
> 
> > On Jan 24, 2020, at 5:24 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Tue, Jan 14, 2020 at 10:59:15PM -0500, Alex Kogan wrote:
> >> Minor changes from v8 based on feedback from Longman:
> >> -----------------------------------------------------
> >> 
> >> - Add __init to cna_configure_spin_lock_slowpath().
> >> 
> >> - Fix the comment for cna_scan_main_queue().
> >> 
> >> - Change the type of intra_node_handoff_threshold to unsigned int.
> >> 
> >> 
> >> Summary
> >> -------
> >> 
> >> Lock throughput can be increased by handing a lock to a waiter on the
> >> same NUMA node as the lock holder, provided care is taken to avoid
> >> starvation of waiters on other NUMA nodes. This patch introduces CNA
> >> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> >> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> >> 
> >> CNA is a NUMA-aware version of the MCS lock. Spinning threads are
> >> organized in two queues, a main queue for threads running on the same
> >> node as the current lock holder, and a secondary queue for threads
> >> running on other nodes. Threads store the ID of the node on which
> >> they are running in their queue nodes. After acquiring the MCS lock and
> >> before acquiring the spinlock, the lock holder scans the main queue
> >> looking for a thread running on the same node (pre-scan). If found (call
> >> it thread T), all threads in the main queue between the current lock
> >> holder and T are moved to the end of the secondary queue.  If such T
> >> is not found, we make another scan of the main queue after acquiring 
> >> the spinlock when unlocking the MCS lock (post-scan), starting at the
> >> node where pre-scan stopped. If both scans fail to find such T, the
> >> MCS lock is passed to the first thread in the secondary queue. If the
> >> secondary queue is empty, the MCS lock is passed to the next thread in the
> >> main queue. To avoid starvation of threads in the secondary queue, those
> >> threads are moved back to the head of the main queue after a certain
> >> number of intra-node lock hand-offs.
> >> 
> >> More details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=1KUGGZYTHnQ25fgRFppdNvpJfI0rOO_Usdu18RDu_14&s=F12nhHutwnPNt_TQ2ELER0DhtsHlEI9EiW1nDPhm5-Y&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=1KUGGZYTHnQ25fgRFppdNvpJfI0rOO_Usdu18RDu_14&s=F12nhHutwnPNt_TQ2ELER0DhtsHlEI9EiW1nDPhm5-Y&e=> .
> >> 
> >> The series applies on top of v5.5.0-rc6, commit b3a987b026.
> >> Performance numbers are available in previous revisions
> >> of the series.
> >> 
> >> Further comments are welcome and appreciated.
> > 
> > I ran this on a large system with a version of locktorture that was
> > modified to print out the maximum and minimum per-CPU lock-acquisition
> > counts, and with CPU hotplug disabled.  I also modified the LOCK01 and
> > LOCK04 scenarios to use 220 hardware threads.
> > 
> > Here is what the test ended up with at the end of a one-hour run:
> > 
> > LOCK01 (exclusive):
> > Writes:  Total: 1241107333  Max/Min: 9206962/60902 ???  Fail: 0
> > 
> > LOCK04 (rwlock):
> > Writes:  Total: 232991963  Max/Min: 2631574/74582 ???  Fail: 0
> > Reads :  Total: 216935386  Max/Min: 2735939/28665 ???  Fail: 0
> > 
> > The "???" strings are printed because the ratio of maximum to minimum exceeds
> > a factor of two.
> Is this what you expect / have seen with the existing qspinlock?
> 
> > 
> > I also ran 30-minute runs on my laptop, which has 12 hardware threads:
> > 
> > LOCK01 (exclusive):
> > Writes:  Total: 3992072782  Max/Min: 259368782/97231961 ???  Fail: 0
> > 
> > LOCK04 (rwlock):
> > Writes:  Total: 131063892  Max/Min: 13136206/5876157 ???  Fail: 0
> > Reads :  Total: 144876801  Max/Min: 19999535/4873442 ???  Fail: 0
> I assume the system above is multi-socket, but your laptop is probably not?
> 
> If that’s the case, CNA should not be enabled on your laptop (grep
> kernel logs for "Enabling CNA spinlock” to be sure).
> 
> > 
> > These also exceed the factor-of-two cutoff, but not as dramatically.
> > The readers for the reader-writer lock fared worst, with a 4-to-1 ratio.
> > 
> > These tests did run within guest OSes.
> So I really wonder if CNA was enabled here, or whether this is what you get
> with paravirt qspinlock.
> 
> >  Is that configuration out of
> > scope for this locking algorithm?  In addition (as might well also have
> > been the case for the locktorture runs in your paper), these tests run
> > a pair of stress-test tasks for each hardware thread.
> > 
> > Is this expected behavior?
> The results do appear skewed a bit too much, but it would be helpful to know
> what qspinlock we are looking at, and how they compare to the existing qspinlock,
> in case it is indeed CNA.

You called it!  I will play with QEMU's -numa argument to see if I can get
CNA to run for me.  Please accept my apologies for the false alarm.

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-25  0:57     ` Paul E. McKenney
@ 2020-01-25  1:59       ` Waiman Long
       [not found]         ` <adb4fb09-f374-4d64-096b-ba9ad8b35fd5@redhat.com>
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-25  1:59 UTC (permalink / raw)
  To: paulmck, Alex Kogan
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, steven.sistare, tglx, daniel.m.jordan, linux-arm-kernel

On 1/24/20 7:57 PM, Paul E. McKenney wrote:
> On Fri, Jan 24, 2020 at 06:39:02PM -0500, Alex Kogan wrote:
>> Hi, Paul.
>>
>> Thanks for running those experiments!
>>
>>> On Jan 24, 2020, at 5:24 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>>
>>> On Tue, Jan 14, 2020 at 10:59:15PM -0500, Alex Kogan wrote:
>>>> Minor changes from v8 based on feedback from Longman:
>>>> -----------------------------------------------------
>>>>
>>>> - Add __init to cna_configure_spin_lock_slowpath().
>>>>
>>>> - Fix the comment for cna_scan_main_queue().
>>>>
>>>> - Change the type of intra_node_handoff_threshold to unsigned int.
>>>>
>>>>
>>>> Summary
>>>> -------
>>>>
>>>> Lock throughput can be increased by handing a lock to a waiter on the
>>>> same NUMA node as the lock holder, provided care is taken to avoid
>>>> starvation of waiters on other NUMA nodes. This patch introduces CNA
>>>> (compact NUMA-aware lock) as the slow path for qspinlock. It is
>>>> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>>>>
>>>> CNA is a NUMA-aware version of the MCS lock. Spinning threads are
>>>> organized in two queues, a main queue for threads running on the same
>>>> node as the current lock holder, and a secondary queue for threads
>>>> running on other nodes. Threads store the ID of the node on which
>>>> they are running in their queue nodes. After acquiring the MCS lock and
>>>> before acquiring the spinlock, the lock holder scans the main queue
>>>> looking for a thread running on the same node (pre-scan). If found (call
>>>> it thread T), all threads in the main queue between the current lock
>>>> holder and T are moved to the end of the secondary queue.  If such T
>>>> is not found, we make another scan of the main queue after acquiring 
>>>> the spinlock when unlocking the MCS lock (post-scan), starting at the
>>>> node where pre-scan stopped. If both scans fail to find such T, the
>>>> MCS lock is passed to the first thread in the secondary queue. If the
>>>> secondary queue is empty, the MCS lock is passed to the next thread in the
>>>> main queue. To avoid starvation of threads in the secondary queue, those
>>>> threads are moved back to the head of the main queue after a certain
>>>> number of intra-node lock hand-offs.
>>>>
>>>> More details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=1KUGGZYTHnQ25fgRFppdNvpJfI0rOO_Usdu18RDu_14&s=F12nhHutwnPNt_TQ2ELER0DhtsHlEI9EiW1nDPhm5-Y&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=1KUGGZYTHnQ25fgRFppdNvpJfI0rOO_Usdu18RDu_14&s=F12nhHutwnPNt_TQ2ELER0DhtsHlEI9EiW1nDPhm5-Y&e=> .
>>>>
>>>> The series applies on top of v5.5.0-rc6, commit b3a987b026.
>>>> Performance numbers are available in previous revisions
>>>> of the series.
>>>>
>>>> Further comments are welcome and appreciated.
>>> I ran this on a large system with a version of locktorture that was
>>> modified to print out the maximum and minimum per-CPU lock-acquisition
>>> counts, and with CPU hotplug disabled.  I also modified the LOCK01 and
>>> LOCK04 scenarios to use 220 hardware threads.
>>>
>>> Here is what the test ended up with at the end of a one-hour run:
>>>
>>> LOCK01 (exclusive):
>>> Writes:  Total: 1241107333  Max/Min: 9206962/60902 ???  Fail: 0
>>>
>>> LOCK04 (rwlock):
>>> Writes:  Total: 232991963  Max/Min: 2631574/74582 ???  Fail: 0
>>> Reads :  Total: 216935386  Max/Min: 2735939/28665 ???  Fail: 0
>>>
>>> The "???" strings are printed because the ratio of maximum to minimum exceeds
>>> a factor of two.
>> Is this what you expect / have seen with the existing qspinlock?
>>
>>> I also ran 30-minute runs on my laptop, which has 12 hardware threads:
>>>
>>> LOCK01 (exclusive):
>>> Writes:  Total: 3992072782  Max/Min: 259368782/97231961 ???  Fail: 0
>>>
>>> LOCK04 (rwlock):
>>> Writes:  Total: 131063892  Max/Min: 13136206/5876157 ???  Fail: 0
>>> Reads :  Total: 144876801  Max/Min: 19999535/4873442 ???  Fail: 0
>> I assume the system above is multi-socket, but your laptop is probably not?
>>
>> If that’s the case, CNA should not be enabled on your laptop (grep
>> kernel logs for "Enabling CNA spinlock” to be sure).
>>
>>> These also exceed the factor-of-two cutoff, but not as dramatically.
>>> The readers for the reader-writer lock fared worst, with a 4-to-1 ratio.
>>>
>>> These tests did run within guest OSes.
>> So I really wonder if CNA was enabled here, or whether this is what you get
>> with paravirt qspinlock.
>>
>>>  Is that configuration out of
>>> scope for this locking algorithm?  In addition (as might well also have
>>> been the case for the locktorture runs in your paper), these tests run
>>> a pair of stress-test tasks for each hardware thread.
>>>
>>> Is this expected behavior?
>> The results do appear skewed a bit too much, but it would be helpful to know
>> what qspinlock we are looking at, and how they compare to the existing qspinlock,
>> in case it is indeed CNA.
> You called it!  I will play with QEMU's -numa argument to see if I can get
> CNA to run for me.  Please accept my apologies for the false alarm.
>
> 							Thanx, Paul
>
CNA is not currently supported in a VM guest simply because the numa
information is not reliable. You will have to run it on baremetal to
test it. Sorry for that.

Regards,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
       [not found]         ` <adb4fb09-f374-4d64-096b-ba9ad8b35fd5@redhat.com>
@ 2020-01-25  4:58           ` Paul E. McKenney
  2020-01-25 19:41             ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2020-01-25  4:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
> On 1/24/20 8:59 PM, Waiman Long wrote:
> >> You called it!  I will play with QEMU's -numa argument to see if I can get
> >> CNA to run for me.  Please accept my apologies for the false alarm.
> >>
> >> 							Thanx, Paul
> >>
> > CNA is not currently supported in a VM guest simply because the numa
> > information is not reliable. You will have to run it on baremetal to
> > test it. Sorry for that.
> 
> Correction. There is a command line option to force CNA lock to be used
> in a VM. Use the "numa_spinlock=on" boot command line parameter.

As I understand it, I need to use a series of -numa arguments to qemu
combined with the numa_spinlock=on (or =1) on the kernel command line.
If the kernel thinks that there is only one NUMA node, it appears to
avoid doing CNA.

Correct?

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-25  4:58           ` Paul E. McKenney
@ 2020-01-25 19:41             ` Waiman Long
  2020-01-26 15:35               ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-25 19:41 UTC (permalink / raw)
  To: paulmck
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On 1/24/20 11:58 PM, Paul E. McKenney wrote:
> On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
>> On 1/24/20 8:59 PM, Waiman Long wrote:
>>>> You called it!  I will play with QEMU's -numa argument to see if I can get
>>>> CNA to run for me.  Please accept my apologies for the false alarm.
>>>>
>>>> 							Thanx, Paul
>>>>
>>> CNA is not currently supported in a VM guest simply because the numa
>>> information is not reliable. You will have to run it on baremetal to
>>> test it. Sorry for that.
>> Correction. There is a command line option to force CNA lock to be used
>> in a VM. Use the "numa_spinlock=on" boot command line parameter.
> As I understand it, I need to use a series of -numa arguments to qemu
> combined with the numa_spinlock=on (or =1) on the kernel command line.
> If the kernel thinks that there is only one NUMA node, it appears to
> avoid doing CNA.
>
> Correct?
>
> 							Thanx, Paul
>
In auto-detection mode (the default), CNA will only be turned on when
paravirt qspinlock is not enabled first and there are at least 2 numa
nodes. The "numa_spinlock=on" option will force it on even when both of
the above conditions are false.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-22 19:29   ` Alex Kogan
@ 2020-01-26  0:32     ` Lihao Liang
  2020-01-26  1:58       ` Lihao Liang
  2020-01-27  6:16       ` Alex Kogan
  0 siblings, 2 replies; 39+ messages in thread
From: Lihao Liang @ 2020-01-26  0:32 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, will.deacon, linux, steven.sistare, linux-kernel, mingo, bp,
	hpa, longman, tglx, daniel.m.jordan, Will Deacon,
	linux-arm-kernel

Hi Alex and Waiman,

Thanks a lot for your swift response and clarification.

On Wed, Jan 22, 2020 at 7:30 PM Alex Kogan <alex.kogan@oracle.com> wrote:
>
> Hi, Lihao.
>
> > On Jan 22, 2020, at 6:45 AM, Lihao Liang <lihaoliang@google.com> wrote:
> >
> > Hi Alex,
> >
> > On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
> >>
> >> Summary
> >> -------
> >>
> >> Lock throughput can be increased by handing a lock to a waiter on the
> >> same NUMA node as the lock holder, provided care is taken to avoid
> >> starvation of waiters on other NUMA nodes. This patch introduces CNA
> >> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> >> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> >>
> >
> > Thanks for your patches. The experimental results look promising!
> >
> > I understand that the new CNA qspinlock uses randomization to achieve
> > long-term fairness, and provides the numa_spinlock_threshold parameter
> > for users to tune.
> This has been the case in the first versions of the series, but is not true anymore.
> That is, the long-term fairness is achieved deterministically (and you are correct
> that it is done through the numa_spinlock_threshold parameter).
>
> > As Linux runs extremely diverse workloads, it is not
> > clear how randomization affects its fairness, and how users with
> > different requirements are supposed to tune this parameter.
> >
> > To this end, Will and I consider it beneficial to be able to answer the
> > following question:
> >
> > With different values of numa_spinlock_threshold and
> > SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
> > sockets have to wait to acquire the lock?
> The SHUFFLE_REDUCTION_PROB_ARG parameter is intended for performance
> optimization only, and *does not* affect the long-term fairness (or, at the
> very least, does not make it any worse). As Longman correctly pointed out in
> his response to this email, the shuffle reduction optimization is relevant only
> when the secondary queue is empty. In that case, CNA hands-off the lock
> exactly as MCS does, i.e., in the FIFO order. Note that when the secondary
> queue is not empty, we do not call probably().
>
> > This is particularly relevant
> > in high contention situations when new threads keep arriving on the same
> > socket as the lock holder.
> In this case, the lock will stay on the same NUMA node/socket for
> 2^numa_spinlock_threshold times, which is the worst case scenario if we
> consider the long-term fairness. And if we have multiple nodes, it will take
> up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node
> lock transitions until any given thread will acquire the lock
> (assuming 2^numa_spinlock_threshold > nr_cpus_per_node).
>

You're right that the latest version of the patch handles long-term fairness
deterministically.

As I understand it, the n-th thread in the main queue is guaranteed to
acquire the lock after N lock handovers, where N is bounded by

n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)

I'm not sure what role the variable nr_cpus_per_node plays in your analysis.

Do I miss anything?

Many thanks,
Lihao.

> Hopefully, it addresses your concern. Let me know if you have any further
> questions.
>
> Best regards,
> — Alex
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-26  0:32     ` Lihao Liang
@ 2020-01-26  1:58       ` Lihao Liang
  2020-01-27 16:01         ` Alex Kogan
  2020-01-27  6:16       ` Alex Kogan
  1 sibling, 1 reply; 39+ messages in thread
From: Lihao Liang @ 2020-01-26  1:58 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, will.deacon, linux, steven.sistare, linux-kernel, mingo, bp,
	hpa, longman, tglx, daniel.m.jordan, Will Deacon,
	linux-arm-kernel

On Sun, Jan 26, 2020 at 12:32 AM Lihao Liang <lihaoliang@google.com> wrote:
>
> Hi Alex and Waiman,
>
> Thanks a lot for your swift response and clarification.
>
> On Wed, Jan 22, 2020 at 7:30 PM Alex Kogan <alex.kogan@oracle.com> wrote:
> >
> > Hi, Lihao.
> >
> > > On Jan 22, 2020, at 6:45 AM, Lihao Liang <lihaoliang@google.com> wrote:
> > >
> > > Hi Alex,
> > >
> > > On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@oracle.com> wrote:
> > >>
> > >> Summary
> > >> -------
> > >>
> > >> Lock throughput can be increased by handing a lock to a waiter on the
> > >> same NUMA node as the lock holder, provided care is taken to avoid
> > >> starvation of waiters on other NUMA nodes. This patch introduces CNA
> > >> (compact NUMA-aware lock) as the slow path for qspinlock. It is
> > >> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
> > >>
> > >
> > > Thanks for your patches. The experimental results look promising!
> > >
> > > I understand that the new CNA qspinlock uses randomization to achieve
> > > long-term fairness, and provides the numa_spinlock_threshold parameter
> > > for users to tune.
> > This has been the case in the first versions of the series, but is not true anymore.
> > That is, the long-term fairness is achieved deterministically (and you are correct
> > that it is done through the numa_spinlock_threshold parameter).
> >
> > > As Linux runs extremely diverse workloads, it is not
> > > clear how randomization affects its fairness, and how users with
> > > different requirements are supposed to tune this parameter.
> > >
> > > To this end, Will and I consider it beneficial to be able to answer the
> > > following question:
> > >
> > > With different values of numa_spinlock_threshold and
> > > SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
> > > sockets have to wait to acquire the lock?
> > The SHUFFLE_REDUCTION_PROB_ARG parameter is intended for performance
> > optimization only, and *does not* affect the long-term fairness (or, at the
> > very least, does not make it any worse). As Longman correctly pointed out in
> > his response to this email, the shuffle reduction optimization is relevant only
> > when the secondary queue is empty. In that case, CNA hands-off the lock
> > exactly as MCS does, i.e., in the FIFO order. Note that when the secondary
> > queue is not empty, we do not call probably().
> >
> > > This is particularly relevant
> > > in high contention situations when new threads keep arriving on the same
> > > socket as the lock holder.
> > In this case, the lock will stay on the same NUMA node/socket for
> > 2^numa_spinlock_threshold times, which is the worst case scenario if we
> > consider the long-term fairness. And if we have multiple nodes, it will take
> > up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node
> > lock transitions until any given thread will acquire the lock
> > (assuming 2^numa_spinlock_threshold > nr_cpus_per_node).
> >
>
> You're right that the latest version of the patch handles long-term fairness
> deterministically.
>
> As I understand it, the n-th thread in the main queue is guaranteed to
> acquire the lock after N lock handovers, where N is bounded by
>
> n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)
>
> I'm not sure what role the variable nr_cpus_per_node plays in your analysis.
>
> Do I miss anything?
>

If I understand correctly, there are two phases in the algorithm:

MCS phase: when the secondary queue is empty, as explained in your emails,
the algorithm hands the lock to threads in the main queue in an FIFO order.
When probably(SHUFFLE_REDUCTION_PROB_ARG) returns false (with default
probability 1%), if the algorithm finds the first thread running on the same
socket as the lock holder in cna_scan_main_queue(), it enters the following
CNA phase.

CNA phase: when the secondary queue is not empty, the algorithm keeps
handing the lock to threads in the main queue that run on the same socket as
the lock holder. When 2^numa_spinlock_threshold is reached, it splices
the secondary queue to the front of the main queue. And we are back to the
MCS phase above.

For the n-th thread T in the main queue, the MCS phase handles threads that
arrived in the main queue before T. In high contention situations, the CNA
phase handles two kinds of threads:

1. Threads ahead of T that run on the same socket as the lock holder when
a transition from the MCS to CNA phase was made. Assume there are m such
threads.

2. Threads that keep arriving on the same socket as the lock holder. There
are at most 2^numa_spinlock_threshold of them.

Then the number of lock handovers in the CNA phase is max(m,
2^numa_spinlock_threshold). So the total number of lock handovers before T
acquires the lock is at most

n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)

Please let me know if I misunderstand anything.

Many thanks,
Lihao.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-25 19:41             ` Waiman Long
@ 2020-01-26 15:35               ` Paul E. McKenney
  2020-01-26 22:42                 ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2020-01-26 15:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On Sat, Jan 25, 2020 at 02:41:39PM -0500, Waiman Long wrote:
> On 1/24/20 11:58 PM, Paul E. McKenney wrote:
> > On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
> >> On 1/24/20 8:59 PM, Waiman Long wrote:
> >>>> You called it!  I will play with QEMU's -numa argument to see if I can get
> >>>> CNA to run for me.  Please accept my apologies for the false alarm.
> >>>>
> >>>> 							Thanx, Paul
> >>>>
> >>> CNA is not currently supported in a VM guest simply because the numa
> >>> information is not reliable. You will have to run it on baremetal to
> >>> test it. Sorry for that.
> >> Correction. There is a command line option to force CNA lock to be used
> >> in a VM. Use the "numa_spinlock=on" boot command line parameter.
> > As I understand it, I need to use a series of -numa arguments to qemu
> > combined with the numa_spinlock=on (or =1) on the kernel command line.
> > If the kernel thinks that there is only one NUMA node, it appears to
> > avoid doing CNA.
> >
> > Correct?
> >
> > 							Thanx, Paul
> >
> In auto-detection mode (the default), CNA will only be turned on when
> paravirt qspinlock is not enabled first and there are at least 2 numa
> nodes. The "numa_spinlock=on" option will force it on even when both of
> the above conditions are false.

Hmmm...

Here is my kernel command line taken from the console log:

console=ttyS0 locktorture.onoff_interval=0 numa_spinlock=on locktorture.stat_interval=15 locktorture.shutdown_secs=1800 locktorture.verbose=1

Yet the string "Enabling CNA spinlock" does not appear.

Ah, idiot here needs to enable CONFIG_NUMA_AWARE_SPINLOCKS in his build.
Trying again with "--kconfig "CONFIG_NUMA_AWARE_SPINLOCKS=y"...

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-26 15:35               ` Paul E. McKenney
@ 2020-01-26 22:42                 ` Paul E. McKenney
  2020-01-26 23:32                   ` Paul E. McKenney
                                     ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Paul E. McKenney @ 2020-01-26 22:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On Sun, Jan 26, 2020 at 07:35:35AM -0800, Paul E. McKenney wrote:
> On Sat, Jan 25, 2020 at 02:41:39PM -0500, Waiman Long wrote:
> > On 1/24/20 11:58 PM, Paul E. McKenney wrote:
> > > On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
> > >> On 1/24/20 8:59 PM, Waiman Long wrote:
> > >>>> You called it!  I will play with QEMU's -numa argument to see if I can get
> > >>>> CNA to run for me.  Please accept my apologies for the false alarm.
> > >>>>
> > >>>> 							Thanx, Paul
> > >>>>
> > >>> CNA is not currently supported in a VM guest simply because the numa
> > >>> information is not reliable. You will have to run it on baremetal to
> > >>> test it. Sorry for that.
> > >> Correction. There is a command line option to force CNA lock to be used
> > >> in a VM. Use the "numa_spinlock=on" boot command line parameter.
> > > As I understand it, I need to use a series of -numa arguments to qemu
> > > combined with the numa_spinlock=on (or =1) on the kernel command line.
> > > If the kernel thinks that there is only one NUMA node, it appears to
> > > avoid doing CNA.
> > >
> > > Correct?
> > >
> > > 							Thanx, Paul
> > >
> > In auto-detection mode (the default), CNA will only be turned on when
> > paravirt qspinlock is not enabled first and there are at least 2 numa
> > nodes. The "numa_spinlock=on" option will force it on even when both of
> > the above conditions are false.
> 
> Hmmm...
> 
> Here is my kernel command line taken from the console log:
> 
> console=ttyS0 locktorture.onoff_interval=0 numa_spinlock=on locktorture.stat_interval=15 locktorture.shutdown_secs=1800 locktorture.verbose=1
> 
> Yet the string "Enabling CNA spinlock" does not appear.
> 
> Ah, idiot here needs to enable CONFIG_NUMA_AWARE_SPINLOCKS in his build.
> Trying again with "--kconfig "CONFIG_NUMA_AWARE_SPINLOCKS=y"...

And after fixing that, plus adding the other three Kconfig options required
to enable this, I really do see "Enabling CNA spinlock" in the console log.
Yay!

At the end of the 30-minute locktorture exclusive-lock run, I see this:

Writes:  Total: 572176565  Max/Min: 54167704/10878216 ???  Fail: 0

This is about a five-to-one ratio.  Is this expected behavior, given a
single NUMA node on a single-socket system with 12 hardware threads?

I will try reader-writer lock next.

Again, should I be using qemu's -numa command-line option to create nodes?
If so, what would be a sane configuration given 12 CPUs and 512MB of
memory for the VM?  If not, what is a good way to exercise CNA's NUMA
capabilities within a guest OS?

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-26 22:42                 ` Paul E. McKenney
@ 2020-01-26 23:32                   ` Paul E. McKenney
  2020-01-27  6:04                   ` Alex Kogan
  2020-01-27 14:11                   ` Waiman Long
  2 siblings, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2020-01-26 23:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On Sun, Jan 26, 2020 at 02:42:45PM -0800, Paul E. McKenney wrote:
> On Sun, Jan 26, 2020 at 07:35:35AM -0800, Paul E. McKenney wrote:
> > On Sat, Jan 25, 2020 at 02:41:39PM -0500, Waiman Long wrote:
> > > On 1/24/20 11:58 PM, Paul E. McKenney wrote:
> > > > On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
> > > >> On 1/24/20 8:59 PM, Waiman Long wrote:
> > > >>>> You called it!  I will play with QEMU's -numa argument to see if I can get
> > > >>>> CNA to run for me.  Please accept my apologies for the false alarm.
> > > >>>>
> > > >>>> 							Thanx, Paul
> > > >>>>
> > > >>> CNA is not currently supported in a VM guest simply because the numa
> > > >>> information is not reliable. You will have to run it on baremetal to
> > > >>> test it. Sorry for that.
> > > >> Correction. There is a command line option to force CNA lock to be used
> > > >> in a VM. Use the "numa_spinlock=on" boot command line parameter.
> > > > As I understand it, I need to use a series of -numa arguments to qemu
> > > > combined with the numa_spinlock=on (or =1) on the kernel command line.
> > > > If the kernel thinks that there is only one NUMA node, it appears to
> > > > avoid doing CNA.
> > > >
> > > > Correct?
> > > >
> > > > 							Thanx, Paul
> > > >
> > > In auto-detection mode (the default), CNA will only be turned on when
> > > paravirt qspinlock is not enabled first and there are at least 2 numa
> > > nodes. The "numa_spinlock=on" option will force it on even when both of
> > > the above conditions are false.
> > 
> > Hmmm...
> > 
> > Here is my kernel command line taken from the console log:
> > 
> > console=ttyS0 locktorture.onoff_interval=0 numa_spinlock=on locktorture.stat_interval=15 locktorture.shutdown_secs=1800 locktorture.verbose=1
> > 
> > Yet the string "Enabling CNA spinlock" does not appear.
> > 
> > Ah, idiot here needs to enable CONFIG_NUMA_AWARE_SPINLOCKS in his build.
> > Trying again with "--kconfig "CONFIG_NUMA_AWARE_SPINLOCKS=y"...
> 
> And after fixing that, plus adding the other three Kconfig options required
> to enable this, I really do see "Enabling CNA spinlock" in the console log.
> Yay!
> 
> At the end of the 30-minute locktorture exclusive-lock run, I see this:
> 
> Writes:  Total: 572176565  Max/Min: 54167704/10878216 ???  Fail: 0
> 
> This is about a five-to-one ratio.  Is this expected behavior, given a
> single NUMA node on a single-socket system with 12 hardware threads?
> 
> I will try reader-writer lock next.
> 
> Again, should I be using qemu's -numa command-line option to create nodes?
> If so, what would be a sane configuration given 12 CPUs and 512MB of
> memory for the VM?  If not, what is a good way to exercise CNA's NUMA
> capabilities within a guest OS?

And the reader-writer lock run also shows "Enabling CNA spinlock".  Same
hardware and same 30-minute duration.

Writes:  Total: 25556912  Max/Min: 3225161/1664621   Fail: 0
Reads :  Total: 11530762  Max/Min: 1155679/794179   Fail: 0

These are both within a factor of two (1.9 and 1.5), hence no "???".

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-26 22:42                 ` Paul E. McKenney
  2020-01-26 23:32                   ` Paul E. McKenney
@ 2020-01-27  6:04                   ` Alex Kogan
  2020-01-27 14:11                   ` Waiman Long
  2 siblings, 0 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-27  6:04 UTC (permalink / raw)
  To: paulmck
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, steven.sistare, linux-kernel,
	Ingo Molnar, bp, hpa, Waiman Long, tglx, daniel.m.jordan,
	linux-arm-kernel



> On Jan 26, 2020, at 5:42 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Sun, Jan 26, 2020 at 07:35:35AM -0800, Paul E. McKenney wrote:
>> On Sat, Jan 25, 2020 at 02:41:39PM -0500, Waiman Long wrote:
>>> On 1/24/20 11:58 PM, Paul E. McKenney wrote:
>>>> On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
>>>>> On 1/24/20 8:59 PM, Waiman Long wrote:
>>>>>>> You called it!  I will play with QEMU's -numa argument to see if I can get
>>>>>>> CNA to run for me.  Please accept my apologies for the false alarm.
>>>>>>> 
>>>>>>> 							Thanx, Paul
>>>>>>> 
>>>>>> CNA is not currently supported in a VM guest simply because the numa
>>>>>> information is not reliable. You will have to run it on baremetal to
>>>>>> test it. Sorry for that.
>>>>> Correction. There is a command line option to force CNA lock to be used
>>>>> in a VM. Use the "numa_spinlock=on" boot command line parameter.
>>>> As I understand it, I need to use a series of -numa arguments to qemu
>>>> combined with the numa_spinlock=on (or =1) on the kernel command line.
>>>> If the kernel thinks that there is only one NUMA node, it appears to
>>>> avoid doing CNA.
>>>> 
>>>> Correct?
>>>> 
>>>> 							Thanx, Paul
>>>> 
>>> In auto-detection mode (the default), CNA will only be turned on when
>>> paravirt qspinlock is not enabled first and there are at least 2 numa
>>> nodes. The "numa_spinlock=on" option will force it on even when both of
>>> the above conditions are false.
>> 
>> Hmmm...
>> 
>> Here is my kernel command line taken from the console log:
>> 
>> console=ttyS0 locktorture.onoff_interval=0 numa_spinlock=on locktorture.stat_interval=15 locktorture.shutdown_secs=1800 locktorture.verbose=1
>> 
>> Yet the string "Enabling CNA spinlock" does not appear.
>> 
>> Ah, idiot here needs to enable CONFIG_NUMA_AWARE_SPINLOCKS in his build.
>> Trying again with "--kconfig "CONFIG_NUMA_AWARE_SPINLOCKS=y"...
> 
> And after fixing that, plus adding the other three Kconfig options required
> to enable this, I really do see "Enabling CNA spinlock" in the console log.
> Yay!
Great! Your persistence paid off :)

Yet, CNA does not do much interesting here, as it sees only one numa node.

> 
> At the end of the 30-minute locktorture exclusive-lock run, I see this:
> 
> Writes:  Total: 572176565  Max/Min: 54167704/10878216 ???  Fail: 0
> 
> This is about a five-to-one ratio.  Is this expected behavior, given a
> single NUMA node on a single-socket system with 12 hardware threads?
I’m not sure what is expected here.
I’m guessing that if you boot your guest with the default 
(non-CNA/non-paravirt) qspinlock, you will get a similar result.

> 
> I will try reader-writer lock next.
> 
> Again, should I be using qemu's -numa command-line option to create nodes?
> If so, what would be a sane configuration given 12 CPUs and 512MB of
> memory for the VM?  If not, what is a good way to exercise CNA's NUMA
> capabilities within a guest OS?
That’s a good question. Perhaps Longman knows the answer?

Regards,
— Alex
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-26  0:32     ` Lihao Liang
  2020-01-26  1:58       ` Lihao Liang
@ 2020-01-27  6:16       ` Alex Kogan
  1 sibling, 0 replies; 39+ messages in thread
From: Alex Kogan @ 2020-01-27  6:16 UTC (permalink / raw)
  To: Lihao Liang
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, will.deacon, linux, steven.sistare, linux-kernel, mingo, bp,
	hpa, longman, tglx, daniel.m.jordan, Will Deacon,
	linux-arm-kernel


>>> This is particularly relevant
>>> in high contention situations when new threads keep arriving on the same
>>> socket as the lock holder.
>> In this case, the lock will stay on the same NUMA node/socket for
>> 2^numa_spinlock_threshold times, which is the worst case scenario if we
>> consider the long-term fairness. And if we have multiple nodes, it will take
>> up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node
>> lock transitions until any given thread will acquire the lock
>> (assuming 2^numa_spinlock_threshold > nr_cpus_per_node).
>> 
> 
> You're right that the latest version of the patch handles long-term fairness
> deterministically.
> 
> As I understand it, the n-th thread in the main queue is guaranteed to
> acquire the lock after N lock handovers, where N is bounded by
> 
> n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)
> 
> I'm not sure what role the variable nr_cpus_per_node plays in your analysis.

Yeah, that’s a minor point, but let me try to clarify.

The "n-th thread in the main queue” is (at most) the nr_cpus_per_node-th thread 
for some node k. So when the node k gets the preference, that thread will
get the lock after at most nr_cpus_per_node-1 lock transitions. As we consider
the upper bound, your analysis is also correct; mine is just a bit tighter.

Makes sense?

Regards,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-26 22:42                 ` Paul E. McKenney
  2020-01-26 23:32                   ` Paul E. McKenney
  2020-01-27  6:04                   ` Alex Kogan
@ 2020-01-27 14:11                   ` Waiman Long
  2020-01-27 15:09                     ` Paul E. McKenney
  2 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-27 14:11 UTC (permalink / raw)
  To: paulmck
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On 1/26/20 5:42 PM, Paul E. McKenney wrote:
> On Sun, Jan 26, 2020 at 07:35:35AM -0800, Paul E. McKenney wrote:
>> On Sat, Jan 25, 2020 at 02:41:39PM -0500, Waiman Long wrote:
>>> On 1/24/20 11:58 PM, Paul E. McKenney wrote:
>>>> On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
>>>>> On 1/24/20 8:59 PM, Waiman Long wrote:
>>>>>>> You called it!  I will play with QEMU's -numa argument to see if I can get
>>>>>>> CNA to run for me.  Please accept my apologies for the false alarm.
>>>>>>>
>>>>>>> 							Thanx, Paul
>>>>>>>
>>>>>> CNA is not currently supported in a VM guest simply because the numa
>>>>>> information is not reliable. You will have to run it on baremetal to
>>>>>> test it. Sorry for that.
>>>>> Correction. There is a command line option to force CNA lock to be used
>>>>> in a VM. Use the "numa_spinlock=on" boot command line parameter.
>>>> As I understand it, I need to use a series of -numa arguments to qemu
>>>> combined with the numa_spinlock=on (or =1) on the kernel command line.
>>>> If the kernel thinks that there is only one NUMA node, it appears to
>>>> avoid doing CNA.
>>>>
>>>> Correct?
>>>>
>>>> 							Thanx, Paul
>>>>
>>> In auto-detection mode (the default), CNA will only be turned on when
>>> paravirt qspinlock is not enabled first and there are at least 2 numa
>>> nodes. The "numa_spinlock=on" option will force it on even when both of
>>> the above conditions are false.
>> Hmmm...
>>
>> Here is my kernel command line taken from the console log:
>>
>> console=ttyS0 locktorture.onoff_interval=0 numa_spinlock=on locktorture.stat_interval=15 locktorture.shutdown_secs=1800 locktorture.verbose=1
>>
>> Yet the string "Enabling CNA spinlock" does not appear.
>>
>> Ah, idiot here needs to enable CONFIG_NUMA_AWARE_SPINLOCKS in his build.
>> Trying again with "--kconfig "CONFIG_NUMA_AWARE_SPINLOCKS=y"...
> And after fixing that, plus adding the other three Kconfig options required
> to enable this, I really do see "Enabling CNA spinlock" in the console log.
> Yay!
>
> At the end of the 30-minute locktorture exclusive-lock run, I see this:
>
> Writes:  Total: 572176565  Max/Min: 54167704/10878216 ???  Fail: 0
>
> This is about a five-to-one ratio.  Is this expected behavior, given a
> single NUMA node on a single-socket system with 12 hardware threads?
Do you mean within the VM, lscpu showed that the system has one node and
12 threads per node? If that is the case, it should behave like regular
qspinlock and be fair.
>
> I will try reader-writer lock next.
>
> Again, should I be using qemu's -numa command-line option to create nodes?
> If so, what would be a sane configuration given 12 CPUs and 512MB of
> memory for the VM?  If not, what is a good way to exercise CNA's NUMA
> capabilities within a guest OS?

You can certainly play around with CNA in a VM. However, it is generally
not recommended to use CNA in a VM unless the VM cpu topology matches
the host with 1-to-1 vcpu pinning and there is no vcpu overcommit. In
this case, one may see some performance improvement using CNA by using
the "numa_spinlock=on" option to explicitly turn it on.

Because of the shuffling of queue entries, CNA is inherently less fair
than the regular qspinlock. However, a ratio of 5 seems excessive to me.
vcpu preemption may be a factor in contributing to this large variation.
My testing on bare metal only showed a throughput variation within
10-20% at most.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-27 14:11                   ` Waiman Long
@ 2020-01-27 15:09                     ` Paul E. McKenney
       [not found]                       ` <9b3a3f16-5405-b6d1-d023-b85f4aab46dd@redhat.com>
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2020-01-27 15:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On Mon, Jan 27, 2020 at 09:11:43AM -0500, Waiman Long wrote:
> On 1/26/20 5:42 PM, Paul E. McKenney wrote:
> > On Sun, Jan 26, 2020 at 07:35:35AM -0800, Paul E. McKenney wrote:
> >> On Sat, Jan 25, 2020 at 02:41:39PM -0500, Waiman Long wrote:
> >>> On 1/24/20 11:58 PM, Paul E. McKenney wrote:
> >>>> On Fri, Jan 24, 2020 at 09:17:05PM -0500, Waiman Long wrote:
> >>>>> On 1/24/20 8:59 PM, Waiman Long wrote:
> >>>>>>> You called it!  I will play with QEMU's -numa argument to see if I can get
> >>>>>>> CNA to run for me.  Please accept my apologies for the false alarm.
> >>>>>>>
> >>>>>>> 							Thanx, Paul
> >>>>>>>
> >>>>>> CNA is not currently supported in a VM guest simply because the numa
> >>>>>> information is not reliable. You will have to run it on baremetal to
> >>>>>> test it. Sorry for that.
> >>>>> Correction. There is a command line option to force CNA lock to be used
> >>>>> in a VM. Use the "numa_spinlock=on" boot command line parameter.
> >>>> As I understand it, I need to use a series of -numa arguments to qemu
> >>>> combined with the numa_spinlock=on (or =1) on the kernel command line.
> >>>> If the kernel thinks that there is only one NUMA node, it appears to
> >>>> avoid doing CNA.
> >>>>
> >>>> Correct?
> >>>>
> >>>> 							Thanx, Paul
> >>>>
> >>> In auto-detection mode (the default), CNA will only be turned on when
> >>> paravirt qspinlock is not enabled first and there are at least 2 numa
> >>> nodes. The "numa_spinlock=on" option will force it on even when both of
> >>> the above conditions are false.
> >> Hmmm...
> >>
> >> Here is my kernel command line taken from the console log:
> >>
> >> console=ttyS0 locktorture.onoff_interval=0 numa_spinlock=on locktorture.stat_interval=15 locktorture.shutdown_secs=1800 locktorture.verbose=1
> >>
> >> Yet the string "Enabling CNA spinlock" does not appear.
> >>
> >> Ah, idiot here needs to enable CONFIG_NUMA_AWARE_SPINLOCKS in his build.
> >> Trying again with "--kconfig "CONFIG_NUMA_AWARE_SPINLOCKS=y"...
> > And after fixing that, plus adding the other three Kconfig options required
> > to enable this, I really do see "Enabling CNA spinlock" in the console log.
> > Yay!
> >
> > At the end of the 30-minute locktorture exclusive-lock run, I see this:
> >
> > Writes:  Total: 572176565  Max/Min: 54167704/10878216 ???  Fail: 0
> >
> > This is about a five-to-one ratio.  Is this expected behavior, given a
> > single NUMA node on a single-socket system with 12 hardware threads?
> Do you mean within the VM, lscpu showed that the system has one node and
> 12 threads per node? If that is the case, it should behave like regular
> qspinlock and be fair.

I mean that I saw this in dmesg, which I believe to be telling me the
same thing as lscpu saying that there is one node, but you tell me!

[    0.007106] No NUMA configuration found
[    0.007107] Faking a node at [mem 0x0000000000000000-0x000000001ffdefff]
[    0.007111] NODE_DATA(0) allocated [mem 0x1ffdb000-0x1ffdefff]
[    0.007126] Zone ranges:
[    0.007127]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.007128]   DMA32    [mem 0x0000000001000000-0x000000001ffdefff]
[    0.007128]   Normal   empty
[    0.007129] Movable zone start for each node
[    0.007129] Early memory node ranges
[    0.007130]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.007132]   node   0: [mem 0x0000000000100000-0x000000001ffdefff]
[    0.007227] Zeroed struct page in unavailable ranges: 98 pages
[    0.007227] Initmem setup node 0 [mem 0x0000000000001000-0x000000001ffdefff]
[    0.007228] On node 0 totalpages: 130941
[    0.007231]   DMA zone: 64 pages used for memmap
[    0.007231]   DMA zone: 21 pages reserved
[    0.007232]   DMA zone: 3998 pages, LIFO batch:0
[    0.007266]   DMA32 zone: 1984 pages used for memmap
[    0.007267]   DMA32 zone: 126943 pages, LIFO batch:31

> > I will try reader-writer lock next.
> >
> > Again, should I be using qemu's -numa command-line option to create nodes?
> > If so, what would be a sane configuration given 12 CPUs and 512MB of
> > memory for the VM?  If not, what is a good way to exercise CNA's NUMA
> > capabilities within a guest OS?
> 
> You can certainly play around with CNA in a VM. However, it is generally
> not recommended to use CNA in a VM unless the VM cpu topology matches
> the host with 1-to-1 vcpu pinning and there is no vcpu overcommit. In
> this case, one may see some performance improvement using CNA by using
> the "numa_spinlock=on" option to explicitly turn it on.

Sorry, but I will not be booting this on bare metal on the systems that
I currently have access to.  No more than I run rcutorture on bare metal
on them, especially not with newly modified variants of RCU.  ;-)

> Because of the shuffling of queue entries, CNA is inherently less fair
> than the regular qspinlock. However, a ratio of 5 seems excessive to me.
> vcpu preemption may be a factor in contributing to this large variation.
> My testing on bare metal only showed a throughput variation within
> 10-20% at most.

OK.  Any guidance on qemu's -numa, or should I just experiment with it?
The latter will take me some time, as I must focus on other things
this week.

Alternatively, would it make sense for you to give it a spin in a VM?
After all, it is entirely possible that I still have some configuration
or another messed up.

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-26  1:58       ` Lihao Liang
@ 2020-01-27 16:01         ` Alex Kogan
  2020-01-29  1:39           ` Lihao Liang
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Kogan @ 2020-01-27 16:01 UTC (permalink / raw)
  To: Lihao Liang
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, will.deacon, linux, steven.sistare, linux-kernel, mingo, bp,
	hpa, longman, tglx, daniel.m.jordan, Will Deacon,
	linux-arm-kernel

Hi, Lihao.

>>> 
>>>> This is particularly relevant
>>>> in high contention situations when new threads keep arriving on the same
>>>> socket as the lock holder.
>>> In this case, the lock will stay on the same NUMA node/socket for
>>> 2^numa_spinlock_threshold times, which is the worst case scenario if we
>>> consider the long-term fairness. And if we have multiple nodes, it will take
>>> up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node
>>> lock transitions until any given thread will acquire the lock
>>> (assuming 2^numa_spinlock_threshold > nr_cpus_per_node).
>>> 
>> 
>> You're right that the latest version of the patch handles long-term fairness
>> deterministically.
>> 
>> As I understand it, the n-th thread in the main queue is guaranteed to
>> acquire the lock after N lock handovers, where N is bounded by
>> 
>> n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)
>> 
>> I'm not sure what role the variable nr_cpus_per_node plays in your analysis.
>> 
>> Do I miss anything?
>> 
> 
> If I understand correctly, there are two phases in the algorithm:
> 
> MCS phase: when the secondary queue is empty, as explained in your emails,
> the algorithm hands the lock to threads in the main queue in an FIFO order.
> When probably(SHUFFLE_REDUCTION_PROB_ARG) returns false (with default
> probability 1%), if the algorithm finds the first thread running on the same
> socket as the lock holder in cna_scan_main_queue(), it enters the following
> CNA phase
Yep. When probably() returns false, we scan the main queue. If as the result of
this scan the secondary queue becomes not empty, we enter what you call
the CNA phase.

> .
> 
> CNA phase: when the secondary queue is not empty, the algorithm keeps
> handing the lock to threads in the main queue that run on the same socket as
> the lock holder. When 2^numa_spinlock_threshold is reached, it splices
> the secondary queue to the front of the main queue. And we are back to the
> MCS phase above.
Correct.

> For the n-th thread T in the main queue, the MCS phase handles threads that
> arrived in the main queue before T. In high contention situations, the CNA
> phase handles two kinds of threads:
> 
> 1. Threads ahead of T that run on the same socket as the lock holder when
> a transition from the MCS to CNA phase was made. Assume there are m such
> threads.
> 
> 2. Threads that keep arriving on the same socket as the lock holder. There
> are at most 2^numa_spinlock_threshold of them.
> 
> Then the number of lock handovers in the CNA phase is max(m,
> 2^numa_spinlock_threshold). So the total number of lock handovers before T
> acquires the lock is at most
> 
> n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)
> 
> Please let me know if I misunderstand anything.
I think you got it right (modulo nr_cpus_per_node instead of n, as mentioned in 
my other response).

Regards,
— Alex
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
       [not found]                       ` <9b3a3f16-5405-b6d1-d023-b85f4aab46dd@redhat.com>
@ 2020-01-27 17:17                         ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2020-01-27 17:17 UTC (permalink / raw)
  To: paulmck
  Cc: linux-arch, guohanjun, Arnd Bergmann, Peter Zijlstra, dave.dice,
	jglauber, x86, Will Deacon, linux, linux-kernel, Ingo Molnar, bp,
	hpa, Alex Kogan, steven.sistare, tglx, daniel.m.jordan,
	linux-arm-kernel

On 1/27/20 12:12 PM, Waiman Long wrote:
> On 1/27/20 10:09 AM, Paul E. McKenney wrote:
>>> Do you mean within the VM, lscpu showed that the system has one node and
>>> 12 threads per node? If that is the case, it should behave like regular
>>> qspinlock and be fair.
>> I mean that I saw this in dmesg, which I believe to be telling me the
>> same thing as lscpu saying that there is one node, but you tell me!
>>
>> [    0.007106] No NUMA configuration found
>> [    0.007107] Faking a node at [mem 0x0000000000000000-0x000000001ffdefff]
>> [    0.007111] NODE_DATA(0) allocated [mem 0x1ffdb000-0x1ffdefff]
>> [    0.007126] Zone ranges:
>> [    0.007127]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
>> [    0.007128]   DMA32    [mem 0x0000000001000000-0x000000001ffdefff]
>> [    0.007128]   Normal   empty
>> [    0.007129] Movable zone start for each node
>> [    0.007129] Early memory node ranges
>> [    0.007130]   node   0: [mem 0x0000000000001000-0x000000000009efff]
>> [    0.007132]   node   0: [mem 0x0000000000100000-0x000000001ffdefff]
>> [    0.007227] Zeroed struct page in unavailable ranges: 98 pages
>> [    0.007227] Initmem setup node 0 [mem 0x0000000000001000-0x000000001ffdefff]
>> [    0.007228] On node 0 totalpages: 130941
>> [    0.007231]   DMA zone: 64 pages used for memmap
>> [    0.007231]   DMA zone: 21 pages reserved
>> [    0.007232]   DMA zone: 3998 pages, LIFO batch:0
>> [    0.007266]   DMA32 zone: 1984 pages used for memmap
>> [    0.007267]   DMA32 zone: 126943 pages, LIFO batch:31
>>
> That does look like just one node.
>>>> I will try reader-writer lock next.
>>>>
>>>> Again, should I be using qemu's -numa command-line option to create nodes?
>>>> If so, what would be a sane configuration given 12 CPUs and 512MB of
>>>> memory for the VM?  If not, what is a good way to exercise CNA's NUMA
>>>> capabilities within a guest OS?
>>> You can certainly play around with CNA in a VM. However, it is generally
>>> not recommended to use CNA in a VM unless the VM cpu topology matches
>>> the host with 1-to-1 vcpu pinning and there is no vcpu overcommit. In
>>> this case, one may see some performance improvement using CNA by using
>>> the "numa_spinlock=on" option to explicitly turn it on.
>> Sorry, but I will not be booting this on bare metal on the systems that
>> I currently have access to.  No more than I run rcutorture on bare metal
>> on them, especially not with newly modified variants of RCU.  ;-)
>>
>>> Because of the shuffling of queue entries, CNA is inherently less fair
>>> than the regular qspinlock. However, a ratio of 5 seems excessive to me.
>>> vcpu preemption may be a factor in contributing to this large variation.
>>> My testing on bare metal only showed a throughput variation within
>>> 10-20% at most.
>> OK.  Any guidance on qemu's -numa, or should I just experiment with it?
>> The latter will take me some time, as I must focus on other things
>> this week.
> To really test it, you should have multiple numa nodes, 2 or 4.
>> Alternatively, would it make sense for you to give it a spin in a VM?
>> After all, it is entirely possible that I still have some configuration
>> or another messed up.
>
> It all depends on what you want to test. If you want just to make sure
> that it won't fail any locking test. Yes, you can certainly do that in
> a VM. If you want to test for performance or fairness of CNA versus
> regular qspinlock, using a VM is not a proper platform. Even with a
> single node, you see 5x difference in locking count. I suspect that
> maybe one or a few vcpus got preempted pretty frequently to perform
> host activities that it screw up the data.
>
BTW, in a VM, the queued_spin_unlock function pointer will still be
pointing to the paravirt version even if CNA is used for the slowpath.
That shouldn't impact correctness, but is not optimal for performance.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock
  2020-01-27 16:01         ` Alex Kogan
@ 2020-01-29  1:39           ` Lihao Liang
  0 siblings, 0 replies; 39+ messages in thread
From: Lihao Liang @ 2020-01-29  1:39 UTC (permalink / raw)
  To: Alex Kogan, longman
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, will.deacon, linux, linux-kernel, Lihao Liang, mingo, bp,
	hpa, steven.sistare, tglx, daniel.m.jordan, Will Deacon,
	linux-arm-kernel

Hi Alex and Waiman,

On Mon, Jan 27, 2020 at 4:02 PM Alex Kogan <alex.kogan@oracle.com> wrote:
>
> Hi, Lihao.
>
> >>>
> >>>> This is particularly relevant
> >>>> in high contention situations when new threads keep arriving on the same
> >>>> socket as the lock holder.
> >>> In this case, the lock will stay on the same NUMA node/socket for
> >>> 2^numa_spinlock_threshold times, which is the worst case scenario if we
> >>> consider the long-term fairness. And if we have multiple nodes, it will take
> >>> up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node
> >>> lock transitions until any given thread will acquire the lock
> >>> (assuming 2^numa_spinlock_threshold > nr_cpus_per_node).
> >>>
> >>
> >> You're right that the latest version of the patch handles long-term fairness
> >> deterministically.
> >>
> >> As I understand it, the n-th thread in the main queue is guaranteed to
> >> acquire the lock after N lock handovers, where N is bounded by
> >>
> >> n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)
> >>
> >> I'm not sure what role the variable nr_cpus_per_node plays in your analysis.
> >>
> >> Do I miss anything?
> >>
> >
> > If I understand correctly, there are two phases in the algorithm:
> >
> > MCS phase: when the secondary queue is empty, as explained in your emails,
> > the algorithm hands the lock to threads in the main queue in an FIFO order.
> > When probably(SHUFFLE_REDUCTION_PROB_ARG) returns false (with default
> > probability 1%), if the algorithm finds the first thread running on the same
> > socket as the lock holder in cna_scan_main_queue(), it enters the following
> > CNA phase
> Yep. When probably() returns false, we scan the main queue. If as the result of
> this scan the secondary queue becomes not empty, we enter what you call
> the CNA phase.
>

As I understand it, the probability of making a transition from the
MCS to CNA phase
in less than N lock handovers is 1 - p^N, where p is the probability
that probably()
returns true (default 99%). So in high contention situations where N can become
quite large in a relatively short period of time, the probability of
getting into the CNA
phase is high, e.g. 95% when N = 300.

I was wondering whether it would be possible to detect contention and make a
phase transition deterministically, maybe by reusing the intra_count
variable to keep
track of the processing rate in the MCS phase?

As Will pointed out earlier, this would make formal analysis and
verification of the
CNA qspinlock much more feasible.

> > .
> >
> > CNA phase: when the secondary queue is not empty, the algorithm keeps
> > handing the lock to threads in the main queue that run on the same socket as
> > the lock holder. When 2^numa_spinlock_threshold is reached, it splices
> > the secondary queue to the front of the main queue. And we are back to the
> > MCS phase above.
> Correct.
>
> > For the n-th thread T in the main queue, the MCS phase handles threads that
> > arrived in the main queue before T. In high contention situations, the CNA
> > phase handles two kinds of threads:
> >
> > 1. Threads ahead of T that run on the same socket as the lock holder when
> > a transition from the MCS to CNA phase was made. Assume there are m such
> > threads.
> >
> > 2. Threads that keep arriving on the same socket as the lock holder. There
> > are at most 2^numa_spinlock_threshold of them.
> >
> > Then the number of lock handovers in the CNA phase is max(m,
> > 2^numa_spinlock_threshold). So the total number of lock handovers before T
> > acquires the lock is at most
> >
> > n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1)
> >
> > Please let me know if I misunderstand anything.
> I think you got it right (modulo nr_cpus_per_node instead of n, as mentioned in
> my other response).
>

Make sense. Thanks a lot for the clarification :)

Best,
Lihao.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, back to index

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-15  3:59 [PATCH v9 0/5] Add NUMA-awareness to qspinlock Alex Kogan
2020-01-15  3:59 ` [PATCH v9 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
2020-01-15  3:59 ` [PATCH v9 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2020-01-15  3:59 ` [PATCH v9 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2020-01-23  9:26   ` Peter Zijlstra
2020-01-23 10:06     ` Peter Zijlstra
2020-01-23 10:16       ` Peter Zijlstra
2020-01-23 11:22         ` Will Deacon
2020-01-23 13:17           ` Peter Zijlstra
2020-01-23 14:15   ` Waiman Long
2020-01-23 15:29     ` Peter Zijlstra
2020-01-15  3:59 ` [PATCH v9 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2020-01-23 19:55   ` Waiman Long
2020-01-23 20:39     ` Waiman Long
2020-01-23 23:39       ` Alex Kogan
2020-01-15  3:59 ` [PATCH v9 5/5] locking/qspinlock: Introduce the shuffle reduction optimization " Alex Kogan
2020-01-22 11:45 ` [PATCH v9 0/5] Add NUMA-awareness to qspinlock Lihao Liang
2020-01-22 17:24   ` Waiman Long
2020-01-23 11:35     ` Will Deacon
2020-01-23 15:25       ` Waiman Long
2020-01-23 19:08         ` Waiman Long
2020-01-22 19:29   ` Alex Kogan
2020-01-26  0:32     ` Lihao Liang
2020-01-26  1:58       ` Lihao Liang
2020-01-27 16:01         ` Alex Kogan
2020-01-29  1:39           ` Lihao Liang
2020-01-27  6:16       ` Alex Kogan
2020-01-24 22:24 ` Paul E. McKenney
     [not found]   ` <6AAE7FC6-F5DE-4067-8BC4-77F27948CD09@oracle.com>
2020-01-25  0:57     ` Paul E. McKenney
2020-01-25  1:59       ` Waiman Long
     [not found]         ` <adb4fb09-f374-4d64-096b-ba9ad8b35fd5@redhat.com>
2020-01-25  4:58           ` Paul E. McKenney
2020-01-25 19:41             ` Waiman Long
2020-01-26 15:35               ` Paul E. McKenney
2020-01-26 22:42                 ` Paul E. McKenney
2020-01-26 23:32                   ` Paul E. McKenney
2020-01-27  6:04                   ` Alex Kogan
2020-01-27 14:11                   ` Waiman Long
2020-01-27 15:09                     ` Paul E. McKenney
     [not found]                       ` <9b3a3f16-5405-b6d1-d023-b85f4aab46dd@redhat.com>
2020-01-27 17:17                         ` Waiman Long

Linux-ARM-Kernel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/0 linux-arm-kernel/git/0.git
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/1 linux-arm-kernel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-arm-kernel linux-arm-kernel/ https://lore.kernel.org/linux-arm-kernel \
		linux-arm-kernel@lists.infradead.org
	public-inbox-index linux-arm-kernel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-arm-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git