All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/5] Add NUMA-awareness to qspinlock
@ 2019-11-25 21:07 ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

Minor change from v6:
---------------------

- fixed a 32-bit build failure by adding dependency on 64BIT in Kconfig.
Reported-by: kbuild test robot <lkp@intel.com>


Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the lock holder scans the main queue
looking for a thread running on the same node (pre-scan). If found (call
it thread T), all threads in the main queue between the current lock
holder and T are moved to the end of the secondary queue.  If such T
is not found, we make another scan of the main queue after acquiring 
the spinlock when unlocking the MCS lock (post-scan), starting at the
node where pre-scan stopped. If both scans fail to find such T, the
MCS lock is passed to the first thread in the secondary queue. If the
secondary queue is empty, the MCS lock is passed to the next thread in the
main queue. To avoid starvation of threads in the secondary queue, those
threads are moved back to the head of the main queue after a certain
number of intra-node lock hand-offs.

More details are available at https://arxiv.org/abs/1810.05600.

The series applies on top of v5.4.0, commit eae56099de85.
Performance numbers are available in previous revisions
of the series.

Further comments are welcome and appreciated.

Alex Kogan (5):
  locking/qspinlock: Rename mcs lock/unlock macros and make them more
    generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Introduce the shuffle reduction optimization into
    CNA

 .../admin-guide/kernel-parameters.txt         |  18 +
 arch/arm/include/asm/mcs_spinlock.h           |   6 +-
 arch/x86/Kconfig                              |  20 ++
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |  70 ++++
 include/asm-generic/mcs_spinlock.h            |   4 +-
 kernel/locking/mcs_spinlock.h                 |  20 +-
 kernel/locking/qspinlock.c                    |  77 ++++-
 kernel/locking/qspinlock_cna.h                | 319 ++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h           |   2 +-
 10 files changed, 517 insertions(+), 23 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v7 0/5] Add NUMA-awareness to qspinlock
@ 2019-11-25 21:07 ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Minor change from v6:
---------------------

- fixed a 32-bit build failure by adding dependency on 64BIT in Kconfig.
Reported-by: kbuild test robot <lkp@intel.com>


Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the lock holder scans the main queue
looking for a thread running on the same node (pre-scan). If found (call
it thread T), all threads in the main queue between the current lock
holder and T are moved to the end of the secondary queue.  If such T
is not found, we make another scan of the main queue after acquiring 
the spinlock when unlocking the MCS lock (post-scan), starting at the
node where pre-scan stopped. If both scans fail to find such T, the
MCS lock is passed to the first thread in the secondary queue. If the
secondary queue is empty, the MCS lock is passed to the next thread in the
main queue. To avoid starvation of threads in the secondary queue, those
threads are moved back to the head of the main queue after a certain
number of intra-node lock hand-offs.

More details are available at https://arxiv.org/abs/1810.05600.

The series applies on top of v5.4.0, commit eae56099de85.
Performance numbers are available in previous revisions
of the series.

Further comments are welcome and appreciated.

Alex Kogan (5):
  locking/qspinlock: Rename mcs lock/unlock macros and make them more
    generic
  locking/qspinlock: Refactor the qspinlock slow path
  locking/qspinlock: Introduce CNA into the slow path of qspinlock
  locking/qspinlock: Introduce starvation avoidance into CNA
  locking/qspinlock: Introduce the shuffle reduction optimization into
    CNA

 .../admin-guide/kernel-parameters.txt         |  18 +
 arch/arm/include/asm/mcs_spinlock.h           |   6 +-
 arch/x86/Kconfig                              |  20 ++
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |  70 ++++
 include/asm-generic/mcs_spinlock.h            |   4 +-
 kernel/locking/mcs_spinlock.h                 |  20 +-
 kernel/locking/qspinlock.c                    |  77 ++++-
 kernel/locking/qspinlock_cna.h                | 319 ++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h           |   2 +-
 10 files changed, 517 insertions(+), 23 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
  2019-11-25 21:07 ` Alex Kogan
@ 2019-11-25 21:07   ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

The mcs unlock macro (arch_mcs_pass_lock) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h       | 18 +++++++++---------
 kernel/locking/qspinlock.c          |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..693fe6ce3c43 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include <asm/spinlock.h>
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
+#define arch_mcs_spin_lock(lock)				\
 do {									\
 	/* Ensure prior stores are observed before we enter wfe. */	\
 	smp_mb();							\
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_pass_lock(lock, val)					\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release((lock), (val));				\
 	dsb_sev();							\
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..868da43dba7c 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_lock(l)
+ *   arch_mcs_pass_lock(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..52d06ec6f525 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
 	int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_lock
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
+#define arch_mcs_spin_lock(l)					\
+do {								\
+	smp_cond_load_acquire(l, VAL);				\
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_spin_unlock
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_pass_lock(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	WRITE_ONCE(prev->next, node);
 
 	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
+	arch_mcs_spin_lock(&node->locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_pass_lock(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 2473f10c6956..804c0fbd6328 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -470,7 +470,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		arch_mcs_spin_lock(&node->locked);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -549,7 +549,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_pass_lock(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..e98079414671 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * The write to next->locked in arch_mcs_pass_lock()
 	 * must be ordered before the read of pn->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
@ 2019-11-25 21:07   ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

The mcs unlock macro (arch_mcs_pass_lock) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 arch/arm/include/asm/mcs_spinlock.h |  6 +++---
 include/asm-generic/mcs_spinlock.h  |  4 ++--
 kernel/locking/mcs_spinlock.h       | 18 +++++++++---------
 kernel/locking/qspinlock.c          |  4 ++--
 kernel/locking/qspinlock_paravirt.h |  2 +-
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..693fe6ce3c43 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
 #include <asm/spinlock.h>
 
 /* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
+#define arch_mcs_spin_lock(lock)				\
 do {									\
 	/* Ensure prior stores are observed before we enter wfe. */	\
 	smp_mb();							\
@@ -14,9 +14,9 @@ do {									\
 		wfe();							\
 } while (0)								\
 
-#define arch_mcs_spin_unlock_contended(lock)				\
+#define arch_mcs_pass_lock(lock, val)					\
 do {									\
-	smp_store_release(lock, 1);					\
+	smp_store_release((lock), (val));				\
 	dsb_sev();							\
 } while (0)
 
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..868da43dba7c 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
 /*
  * Architectures can define their own:
  *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
+ *   arch_mcs_spin_lock(l)
+ *   arch_mcs_pass_lock(l, val)
  *
  * See kernel/locking/mcs_spinlock.c.
  */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..52d06ec6f525 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
 	int count;  /* nesting count, see qspinlock.c */
 };
 
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_lock
 /*
  * Using smp_cond_load_acquire() provides the acquire semantics
  * required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
  * ARM64 would like to do spin-waiting instead of purely
  * spinning, and smp_cond_load_acquire() provides that behavior.
  */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
+#define arch_mcs_spin_lock(l)					\
+do {								\
+	smp_cond_load_acquire(l, VAL);				\
 } while (0)
 #endif
 
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_spin_unlock
 /*
  * smp_store_release() provides a memory barrier to ensure all
  * operations in the critical section has been completed before
  * unlocking.
  */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
+#define arch_mcs_pass_lock(l, val)				\
+	smp_store_release((l), (val))
 #endif
 
 /*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	WRITE_ONCE(prev->next, node);
 
 	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
+	arch_mcs_spin_lock(&node->locked);
 }
 
 /*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 	}
 
 	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_pass_lock(&next->locked, 1);
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 2473f10c6956..804c0fbd6328 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -470,7 +470,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		arch_mcs_spin_lock(&node->locked);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -549,7 +549,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	arch_mcs_pass_lock(&next->locked, 1);
 	pv_kick_node(lock, next);
 
 release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..e98079414671 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * The write to next->locked in arch_mcs_pass_lock()
 	 * must be ordered before the read of pn->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 2/5] locking/qspinlock: Refactor the qspinlock slow path
  2019-11-25 21:07 ` Alex Kogan
@ 2019-11-25 21:07   ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 804c0fbd6328..c06d1e8075d9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -288,6 +288,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+						   u32 val,
+						   struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_pass_lock - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
+					    struct mcs_spinlock *next)
+{
+	arch_mcs_pass_lock(&next->locked, 1);
+}
+
+#define try_clear_tail	__try_clear_tail
+#define mcs_pass_lock		__mcs_pass_lock
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -532,7 +560,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -549,7 +577,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_pass_lock(&next->locked, 1);
+	mcs_pass_lock(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -574,6 +602,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_pass_lock
+#define mcs_pass_lock			__mcs_pass_lock
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 2/5] locking/qspinlock: Refactor the qspinlock slow path
@ 2019-11-25 21:07   ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 804c0fbd6328..c06d1e8075d9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -288,6 +288,34 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
 #endif
 
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+						   u32 val,
+						   struct mcs_spinlock *node)
+{
+	return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_pass_lock - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
+					    struct mcs_spinlock *next)
+{
+	arch_mcs_pass_lock(&next->locked, 1);
+}
+
+#define try_clear_tail	__try_clear_tail
+#define mcs_pass_lock		__mcs_pass_lock
+
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
 /**
@@ -532,7 +560,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *       PENDING will make the uncontended transition fail.
 	 */
 	if ((val & _Q_TAIL_MASK) == tail) {
-		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		if (try_clear_tail(lock, val, node))
 			goto release; /* No contention */
 	}
 
@@ -549,7 +577,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_pass_lock(&next->locked, 1);
+	mcs_pass_lock(node, next);
 	pv_kick_node(lock, next);
 
 release:
@@ -574,6 +602,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#undef try_clear_tail
+#define try_clear_tail		__try_clear_tail
+
+#undef mcs_pass_lock
+#define mcs_pass_lock			__mcs_pass_lock
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-11-25 21:07 ` Alex Kogan
@ 2019-11-25 21:07   ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the lock holder scans the
main queue looking for a thread running on the same node (pre-scan). If
found (call it thread T), all threads in the main queue between the
current lock holder and T are moved to the end of the secondary queue.
If such T is not found, we make another scan of the main queue when
unlocking the MCS lock (post-scan), starting at the position where
pre-scan stopped. If both scans fail to find such T, the MCS lock is
passed to the first thread in the secondary queue. If the secondary queue
is empty, the lock is passed to the next thread in the main queue.
For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 ++
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |  43 +++
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  34 ++-
 kernel/locking/qspinlock_cna.h                | 264 ++++++++++++++++++
 7 files changed, 372 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8dee8f68fe15..904cb32f592d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3175,6 +3175,16 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8ef85139553f..0b53e3e008ca 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1573,6 +1573,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 444d6fd9a6d8..6fa8fcc5c7af 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9d3a971ea364..6a4ccbf4e09c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -698,6 +698,33 @@ static void __init int3_selftest(void)
 	unregister_die_notifier(&int3_exception_nb);
 }
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+#endif
+
 void __init alternative_instructions(void)
 {
 	int3_selftest();
@@ -738,6 +765,22 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+	/*
+	 * By default, switch to the NUMA-friendly slow path for
+	 * spinlocks when we have multiple NUMA nodes in native environment.
+	 */
+	if ((numa_spinlock_flag == 1) ||
+	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
+		    pv_ops.lock.queued_spin_lock_slowpath ==
+			native_queued_spin_lock_slowpath)) {
+		pv_ops.lock.queued_spin_lock_slowpath =
+		    __cna_queued_spin_lock_slowpath;
+
+		pr_info("Enabling CNA spinlock\n");
+	}
+#endif
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 52d06ec6f525..e40b9538b79f 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c06d1e8075d9..6d8c4a52e44e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -70,7 +70,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -79,7 +80,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -103,6 +104,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -316,7 +319,7 @@ static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
 #define try_clear_tail	__try_clear_tail
 #define mcs_pass_lock		__mcs_pass_lock
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -588,6 +591,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_pre_scan
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_change_tail
+
+#undef mcs_pass_lock
+#define mcs_pass_lock			cna_pass_lock
+
+#undef  queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock_cna.h"
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..a638336f9560
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,264 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a main queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+    +--------+        +--------+
+ *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
+ *   |mcs:locked| -+ +--------+        +--------+
+ *   +----------+  |
+ *                 +----------------------+
+ *                                        \/
+ *                 +--------+         +--------+
+ *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     +--------------------+
+ *
+ * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the main queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the main queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the main queue when
+ * unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the main queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;
+	u32			pre_scan_result; /* 0 or encoded tail */
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * @encoded_tail has to be larger than 1, so we do not confuse
+		 * it with other valid values for @locked or @pre_scan_result
+		 * (0 or 1)
+		 */
+		WARN_ON(cn->encoded_tail <= 1);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+early_initcall(cna_init_nodes);
+
+static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
+				       struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	/* If the secondary queue is empty, do what MCS does. */
+	if (node->locked <= 1)
+		return __try_clear_tail(lock, val, node);
+
+	/*
+	 * Try to update the tail value to the last node in the secondary queue.
+	 * If successful, pass the lock to the first thread in the secondary
+	 * queue. Doing those two actions effectively moves all nodes from the
+	 * secondary queue into the main one.
+	 */
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
+
+	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
+		/*
+		 * Try to reset @next in tail_2nd to NULL, but no need to check
+		 * the result - if failed, a new successor has updated it.
+		 */
+		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
+		arch_mcs_pass_lock(&head_2nd->locked, 1);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * cna_splice_tail -- splice nodes in the main queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_scan_main_queue - scan the main waiting queue looking for the first
+ * thread running on the same NUMA node as the lock holder. If found (call it
+ * thread T), move all threads in the main queue between the lock holder and
+ * T to the end of the secondary queue and return 0; otherwise, return the
+ * encoded pointer of the last scanned node in the primary queue (so a
+ * subsequent scan can be resumed from that node)
+ *
+ * Schematically, this may look like the following (nn stands for numa_node and
+ * et stands for encoded_tail).
+ *
+ *   when cna_scan_main_queue() is called (the secondary queue is empty):
+ *
+ *  A+------------+   B+--------+   C+--------+   T+--------+
+ *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
+ *   |mcs:locked=1|    |cna:nn=0|    |cna:nn=2|    |cna:nn=1|
+ *   |cna:nn=1    |    +--------+    +--------+    +--------+
+ *   +----------- +
+ *
+ *   when cna_scan_main_queue() returns (the secondary queue contains B and C):
+ *
+ *  A+----------------+    T+--------+
+ *   |mcs:next        | ->  |mcs:next| -> NULL
+ *   |mcs:locked=C.et | -+  |cna:nn=1|
+ *   |cna:nn=1        |  |  +--------+
+ *   +--------------- +  +-----+
+ *                             \/
+ *          B+--------+   C+--------+
+ *           |mcs:next| -> |mcs:next| -+
+ *           |cna:nn=0|    |cna:nn=2|  |
+ *           +--------+    +--------+  |
+ *               ^                     |
+ *               +---------------------+
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ */
+static u32 cna_scan_main_queue(struct mcs_spinlock *node,
+			       struct mcs_spinlock *pred_start)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(pred_start->next);
+	struct cna_node *last;
+	int my_numa_node = cn->numa_node;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != my_numa_node;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	/* if found, splice any skipped waiters onto the secondary queue */
+	if (cni) {
+		if (last != cn)	/* did we skip any waiters? */
+			cna_splice_tail(node, node->next,
+					(struct mcs_spinlock *)last);
+		return 0;
+	}
+
+	return last->encoded_tail;
+}
+
+__always_inline u32 cna_pre_scan(struct qspinlock *lock,
+				  struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	cn->pre_scan_result = cna_scan_main_queue(node, node);
+
+	return 0;
+}
+
+static inline void cna_pass_lock(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	struct mcs_spinlock *next_holder = next, *tail_2nd;
+	u32 val = 1;
+
+	u32 scan = cn->pre_scan_result;
+
+	/*
+	 * check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan starting from the
+	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 */
+	if (scan > 0)
+		scan = cna_scan_main_queue(node, decode_tail(scan));
+
+	if (!scan) { /* if found a successor from the same numa node */
+		next_holder = node->next;
+		/*
+		 * we unlock successor by passing a non-zero value,
+		 * so set @val to 1 iff @locked is 0, which will happen
+		 * if we acquired the MCS lock when its queue was empty
+		 */
+		val = node->locked ? node->locked : 1;
+	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
+		/* next holder will be the first node in the secondary queue */
+		tail_2nd = decode_tail(node->locked);
+		/* @tail_2nd->next points to the head of the secondary queue */
+		next_holder = tail_2nd->next;
+		/* splice the secondary queue onto the head of the main queue */
+		tail_2nd->next = next;
+	}
+
+	arch_mcs_pass_lock(&next_holder->locked, val);
+}
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2019-11-25 21:07   ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the lock holder scans the
main queue looking for a thread running on the same node (pre-scan). If
found (call it thread T), all threads in the main queue between the
current lock holder and T are moved to the end of the secondary queue.
If such T is not found, we make another scan of the main queue when
unlocking the MCS lock (post-scan), starting at the position where
pre-scan stopped. If both scans fail to find such T, the MCS lock is
passed to the first thread in the secondary queue. If the secondary queue
is empty, the lock is passed to the next thread in the main queue.
For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/x86/Kconfig                              |  20 ++
 arch/x86/include/asm/qspinlock.h              |   4 +
 arch/x86/kernel/alternative.c                 |  43 +++
 kernel/locking/mcs_spinlock.h                 |   2 +-
 kernel/locking/qspinlock.c                    |  34 ++-
 kernel/locking/qspinlock_cna.h                | 264 ++++++++++++++++++
 7 files changed, 372 insertions(+), 5 deletions(-)
 create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8dee8f68fe15..904cb32f592d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3175,6 +3175,16 @@
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	numa_spinlock=	[NUMA, PV_OPS] Select the NUMA-aware variant
+			of spinlock. The options are:
+			auto - Enable this variant if running on a multi-node
+			machine in native environment.
+			on  - Unconditionally enable this variant.
+			off - Unconditionally disable this variant.
+
+			Not specifying this option is equivalent to
+			numa_spinlock=auto.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8ef85139553f..0b53e3e008ca 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1573,6 +1573,26 @@ config NUMA
 
 	  Otherwise, you should say N.
 
+config NUMA_AWARE_SPINLOCKS
+	bool "Numa-aware spinlocks"
+	depends on NUMA
+	depends on QUEUED_SPINLOCKS
+	depends on 64BIT
+	# For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+	# This is awkward, but hopefully would be resolved once static_call()
+	# is available.
+	depends on PARAVIRT_SPINLOCKS
+	default y
+	help
+	  Introduce NUMA (Non Uniform Memory Access) awareness into
+	  the slow path of spinlocks.
+
+	  In this variant of qspinlock, the kernel will try to keep the lock
+	  on the same node, thus reducing the number of remote cache misses,
+	  while trading some of the short term fairness for better performance.
+
+	  Say N if you want absolute first come first serve fairness.
+
 config AMD_NUMA
 	def_bool y
 	prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 444d6fd9a6d8..6fa8fcc5c7af 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 	return val;
 }
 
+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+#endif
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9d3a971ea364..6a4ccbf4e09c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -698,6 +698,33 @@ static void __init int3_selftest(void)
 	unregister_die_notifier(&int3_exception_nb);
 }
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+	if (!strcmp(str, "auto")) {
+		numa_spinlock_flag = 0;
+		return 1;
+	} else if (!strcmp(str, "on")) {
+		numa_spinlock_flag = 1;
+		return 1;
+	} else if (!strcmp(str, "off")) {
+		numa_spinlock_flag = -1;
+		return 1;
+	}
+
+	return 0;
+}
+
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+#endif
+
 void __init alternative_instructions(void)
 {
 	int3_selftest();
@@ -738,6 +765,22 @@ void __init alternative_instructions(void)
 	}
 #endif
 
+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+	/*
+	 * By default, switch to the NUMA-friendly slow path for
+	 * spinlocks when we have multiple NUMA nodes in native environment.
+	 */
+	if ((numa_spinlock_flag == 1) ||
+	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
+		    pv_ops.lock.queued_spin_lock_slowpath ==
+			native_queued_spin_lock_slowpath)) {
+		pv_ops.lock.queued_spin_lock_slowpath =
+		    __cna_queued_spin_lock_slowpath;
+
+		pr_info("Enabling CNA spinlock\n");
+	}
+#endif
+
 	apply_paravirt(__parainstructions, __parainstructions_end);
 
 	restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 52d06ec6f525..e40b9538b79f 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
+	unsigned int locked; /* 1 if lock acquired */
 	int count;  /* nesting count, see qspinlock.c */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c06d1e8075d9..6d8c4a52e44e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
 
 #include <linux/smp.h>
 #include <linux/bug.h>
@@ -70,7 +70,8 @@
 /*
  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
  * two of them can fit in a cacheline in this case. That is OK as it is rare
  * to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -79,7 +80,7 @@
  */
 struct qnode {
 	struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 	long reserved[2];
 #endif
 };
@@ -103,6 +104,8 @@ struct qnode {
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
@@ -316,7 +319,7 @@ static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
 #define try_clear_tail	__try_clear_tail
 #define mcs_pass_lock		__mcs_pass_lock
 
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
 
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
@@ -588,6 +591,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock		cna_pre_scan
+
+#undef try_clear_tail
+#define try_clear_tail			cna_try_change_tail
+
+#undef mcs_pass_lock
+#define mcs_pass_lock			cna_pass_lock
+
+#undef  queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
+
+#include "qspinlock_cna.h"
+#include "qspinlock.c"
+
+#endif
+
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..a638336f9560
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,264 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a main queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ *    cna_node
+ *   +----------+    +--------+        +--------+
+ *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
+ *   |mcs:locked| -+ +--------+        +--------+
+ *   +----------+  |
+ *                 +----------------------+
+ *                                        \/
+ *                 +--------+         +--------+
+ *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     +--------------------+
+ *
+ * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the main queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the main queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the main queue when
+ * unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the main queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;
+	u32			pre_scan_result; /* 0 or encoded tail */
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * @encoded_tail has to be larger than 1, so we do not confuse
+		 * it with other valid values for @locked or @pre_scan_result
+		 * (0 or 1)
+		 */
+		WARN_ON(cn->encoded_tail <= 1);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+early_initcall(cna_init_nodes);
+
+static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
+				       struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+	u32 new;
+
+	/* If the secondary queue is empty, do what MCS does. */
+	if (node->locked <= 1)
+		return __try_clear_tail(lock, val, node);
+
+	/*
+	 * Try to update the tail value to the last node in the secondary queue.
+	 * If successful, pass the lock to the first thread in the secondary
+	 * queue. Doing those two actions effectively moves all nodes from the
+	 * secondary queue into the main one.
+	 */
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
+
+	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
+		/*
+		 * Try to reset @next in tail_2nd to NULL, but no need to check
+		 * the result - if failed, a new successor has updated it.
+		 */
+		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
+		arch_mcs_pass_lock(&head_2nd->locked, 1);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * cna_splice_tail -- splice nodes in the main queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_scan_main_queue - scan the main waiting queue looking for the first
+ * thread running on the same NUMA node as the lock holder. If found (call it
+ * thread T), move all threads in the main queue between the lock holder and
+ * T to the end of the secondary queue and return 0; otherwise, return the
+ * encoded pointer of the last scanned node in the primary queue (so a
+ * subsequent scan can be resumed from that node)
+ *
+ * Schematically, this may look like the following (nn stands for numa_node and
+ * et stands for encoded_tail).
+ *
+ *   when cna_scan_main_queue() is called (the secondary queue is empty):
+ *
+ *  A+------------+   B+--------+   C+--------+   T+--------+
+ *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
+ *   |mcs:locked=1|    |cna:nn=0|    |cna:nn=2|    |cna:nn=1|
+ *   |cna:nn=1    |    +--------+    +--------+    +--------+
+ *   +----------- +
+ *
+ *   when cna_scan_main_queue() returns (the secondary queue contains B and C):
+ *
+ *  A+----------------+    T+--------+
+ *   |mcs:next        | ->  |mcs:next| -> NULL
+ *   |mcs:locked=C.et | -+  |cna:nn=1|
+ *   |cna:nn=1        |  |  +--------+
+ *   +--------------- +  +-----+
+ *                             \/
+ *          B+--------+   C+--------+
+ *           |mcs:next| -> |mcs:next| -+
+ *           |cna:nn=0|    |cna:nn=2|  |
+ *           +--------+    +--------+  |
+ *               ^                     |
+ *               +---------------------+
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ */
+static u32 cna_scan_main_queue(struct mcs_spinlock *node,
+			       struct mcs_spinlock *pred_start)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(pred_start->next);
+	struct cna_node *last;
+	int my_numa_node = cn->numa_node;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != my_numa_node;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	/* if found, splice any skipped waiters onto the secondary queue */
+	if (cni) {
+		if (last != cn)	/* did we skip any waiters? */
+			cna_splice_tail(node, node->next,
+					(struct mcs_spinlock *)last);
+		return 0;
+	}
+
+	return last->encoded_tail;
+}
+
+__always_inline u32 cna_pre_scan(struct qspinlock *lock,
+				  struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	cn->pre_scan_result = cna_scan_main_queue(node, node);
+
+	return 0;
+}
+
+static inline void cna_pass_lock(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	struct mcs_spinlock *next_holder = next, *tail_2nd;
+	u32 val = 1;
+
+	u32 scan = cn->pre_scan_result;
+
+	/*
+	 * check if a successor from the same numa node has not been found in
+	 * pre-scan, and if so, try to find it in post-scan starting from the
+	 * node where pre-scan stopped (stored in @pre_scan_result)
+	 */
+	if (scan > 0)
+		scan = cna_scan_main_queue(node, decode_tail(scan));
+
+	if (!scan) { /* if found a successor from the same numa node */
+		next_holder = node->next;
+		/*
+		 * we unlock successor by passing a non-zero value,
+		 * so set @val to 1 iff @locked is 0, which will happen
+		 * if we acquired the MCS lock when its queue was empty
+		 */
+		val = node->locked ? node->locked : 1;
+	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
+		/* next holder will be the first node in the secondary queue */
+		tail_2nd = decode_tail(node->locked);
+		/* @tail_2nd->next points to the head of the secondary queue */
+		next_holder = tail_2nd->next;
+		/* splice the secondary queue onto the head of the main queue */
+		tail_2nd->next = next;
+	}
+
+	arch_mcs_pass_lock(&next_holder->locked, val);
+}
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-11-25 21:07 ` Alex Kogan
@ 2019-11-25 21:07   ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

Keep track of the number of intra-node lock handoffs, and force
inter-node handoff once this number reaches a preset threshold.
The default value for the threshold can be overridden with
the new kernel boot command-line option "numa_spinlock_threshold".

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++++
 arch/x86/kernel/alternative.c                 | 27 +++++++++++++++++++
 kernel/locking/qspinlock.c                    |  3 +++
 kernel/locking/qspinlock_cna.h                | 27 ++++++++++++++++---
 4 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 904cb32f592d..887fbfce701d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3185,6 +3185,14 @@
 			Not specifying this option is equivalent to
 			numa_spinlock=auto.
 
+	numa_spinlock_threshold=	[NUMA, PV_OPS]
+			Set the threshold for the number of intra-node
+			lock hand-offs before the NUMA-aware spinlock
+			is forced to be passed to a thread on another NUMA node.
+			Valid values are in the [0..31] range. Smaller values
+			result in a more fair, but less performant spinlock, and
+			vice versa. The default value is 16.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 6a4ccbf4e09c..28552e0491b5 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -723,6 +723,33 @@ static int __init numa_spinlock_setup(char *str)
 
 __setup("numa_spinlock=", numa_spinlock_setup);
 
+/*
+ * Controls the threshold for the number of intra-node lock hand-offs before
+ * the NUMA-aware variant of spinlock is forced to be passed to a thread on
+ * another NUMA node. By default, the chosen value provides reasonable
+ * long-term fairness without sacrificing performance compared to a lock
+ * that does not have any fairness guarantees.
+ */
+int intra_node_handoff_threshold = 1 << 16;
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+	int new_threshold_param;
+
+	if (get_option(&str, &new_threshold_param)) {
+		/* valid value is between 0 and 31 */
+		if (new_threshold_param < 0 || new_threshold_param > 31)
+			return 0;
+
+		intra_node_handoff_threshold = 1 << new_threshold_param;
+		return 1;
+	}
+
+	return 0;
+}
+
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
+
 #endif
 
 void __init alternative_instructions(void)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6d8c4a52e44e..1d0d884308ef 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -597,6 +597,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 #define _GEN_CNA_LOCK_SLOWPATH
 
+#undef pv_init_node
+#define pv_init_node			cna_init_node
+
 #undef pv_wait_head_or_lock
 #define pv_wait_head_or_lock		cna_pre_scan
 
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index a638336f9560..dcb2bcfd2d94 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -50,9 +50,16 @@ struct cna_node {
 	struct mcs_spinlock	mcs;
 	int			numa_node;
 	u32			encoded_tail;
-	u32			pre_scan_result; /* 0 or encoded tail */
+	u32			pre_scan_result; /* 0, 1 or encoded tail */
+	u32			intra_count;
 };
 
+/*
+ * Controls the threshold for the number of intra-node lock hand-offs.
+ * See arch/x86/kernel/alternative.c for details.
+ */
+extern int intra_node_handoff_threshold;
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -92,6 +99,11 @@ static int __init cna_init_nodes(void)
 }
 early_initcall(cna_init_nodes);
 
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+	((struct cna_node *)node)->intra_count = 0;
+}
+
 static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
 				       struct mcs_spinlock *node)
 {
@@ -221,7 +233,13 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->pre_scan_result = cna_scan_main_queue(node, node);
+	/*
+	 * setting @pre_scan_result to 1 indicates that no post-scan
+	 * should be made in cna_pass_lock()
+	 */
+	cn->pre_scan_result =
+		cn->intra_count == intra_node_handoff_threshold ?
+			1 : cna_scan_main_queue(node, node);
 
 	return 0;
 }
@@ -240,7 +258,7 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 	 * pre-scan, and if so, try to find it in post-scan starting from the
 	 * node where pre-scan stopped (stored in @pre_scan_result)
 	 */
-	if (scan > 0)
+	if (scan > 1)
 		scan = cna_scan_main_queue(node, decode_tail(scan));
 
 	if (!scan) { /* if found a successor from the same numa node */
@@ -251,6 +269,9 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 		 * if we acquired the MCS lock when its queue was empty
 		 */
 		val = node->locked ? node->locked : 1;
+		/* inc @intra_count if the secondary queue is not empty */
+		((struct cna_node *)next_holder)->intra_count =
+			cn->intra_count + (node->locked > 1);
 	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
 		/* next holder will be the first node in the secondary queue */
 		tail_2nd = decode_tail(node->locked);
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-11-25 21:07   ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

Keep track of the number of intra-node lock handoffs, and force
inter-node handoff once this number reaches a preset threshold.
The default value for the threshold can be overridden with
the new kernel boot command-line option "numa_spinlock_threshold".

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++++
 arch/x86/kernel/alternative.c                 | 27 +++++++++++++++++++
 kernel/locking/qspinlock.c                    |  3 +++
 kernel/locking/qspinlock_cna.h                | 27 ++++++++++++++++---
 4 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 904cb32f592d..887fbfce701d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3185,6 +3185,14 @@
 			Not specifying this option is equivalent to
 			numa_spinlock=auto.
 
+	numa_spinlock_threshold=	[NUMA, PV_OPS]
+			Set the threshold for the number of intra-node
+			lock hand-offs before the NUMA-aware spinlock
+			is forced to be passed to a thread on another NUMA node.
+			Valid values are in the [0..31] range. Smaller values
+			result in a more fair, but less performant spinlock, and
+			vice versa. The default value is 16.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 6a4ccbf4e09c..28552e0491b5 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -723,6 +723,33 @@ static int __init numa_spinlock_setup(char *str)
 
 __setup("numa_spinlock=", numa_spinlock_setup);
 
+/*
+ * Controls the threshold for the number of intra-node lock hand-offs before
+ * the NUMA-aware variant of spinlock is forced to be passed to a thread on
+ * another NUMA node. By default, the chosen value provides reasonable
+ * long-term fairness without sacrificing performance compared to a lock
+ * that does not have any fairness guarantees.
+ */
+int intra_node_handoff_threshold = 1 << 16;
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+	int new_threshold_param;
+
+	if (get_option(&str, &new_threshold_param)) {
+		/* valid value is between 0 and 31 */
+		if (new_threshold_param < 0 || new_threshold_param > 31)
+			return 0;
+
+		intra_node_handoff_threshold = 1 << new_threshold_param;
+		return 1;
+	}
+
+	return 0;
+}
+
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
+
 #endif
 
 void __init alternative_instructions(void)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 6d8c4a52e44e..1d0d884308ef 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -597,6 +597,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
 #define _GEN_CNA_LOCK_SLOWPATH
 
+#undef pv_init_node
+#define pv_init_node			cna_init_node
+
 #undef pv_wait_head_or_lock
 #define pv_wait_head_or_lock		cna_pre_scan
 
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index a638336f9560..dcb2bcfd2d94 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -50,9 +50,16 @@ struct cna_node {
 	struct mcs_spinlock	mcs;
 	int			numa_node;
 	u32			encoded_tail;
-	u32			pre_scan_result; /* 0 or encoded tail */
+	u32			pre_scan_result; /* 0, 1 or encoded tail */
+	u32			intra_count;
 };
 
+/*
+ * Controls the threshold for the number of intra-node lock hand-offs.
+ * See arch/x86/kernel/alternative.c for details.
+ */
+extern int intra_node_handoff_threshold;
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -92,6 +99,11 @@ static int __init cna_init_nodes(void)
 }
 early_initcall(cna_init_nodes);
 
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+	((struct cna_node *)node)->intra_count = 0;
+}
+
 static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
 				       struct mcs_spinlock *node)
 {
@@ -221,7 +233,13 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->pre_scan_result = cna_scan_main_queue(node, node);
+	/*
+	 * setting @pre_scan_result to 1 indicates that no post-scan
+	 * should be made in cna_pass_lock()
+	 */
+	cn->pre_scan_result =
+		cn->intra_count == intra_node_handoff_threshold ?
+			1 : cna_scan_main_queue(node, node);
 
 	return 0;
 }
@@ -240,7 +258,7 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 	 * pre-scan, and if so, try to find it in post-scan starting from the
 	 * node where pre-scan stopped (stored in @pre_scan_result)
 	 */
-	if (scan > 0)
+	if (scan > 1)
 		scan = cna_scan_main_queue(node, decode_tail(scan));
 
 	if (!scan) { /* if found a successor from the same numa node */
@@ -251,6 +269,9 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 		 * if we acquired the MCS lock when its queue was empty
 		 */
 		val = node->locked ? node->locked : 1;
+		/* inc @intra_count if the secondary queue is not empty */
+		((struct cna_node *)next_holder)->intra_count =
+			cn->intra_count + (node->locked > 1);
 	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
 		/* next holder will be the first node in the secondary queue */
 		tail_2nd = decode_tail(node->locked);
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 5/5] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2019-11-25 21:07 ` Alex Kogan
@ 2019-11-25 21:07   ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, alex.kogan, dave.dice, rahul.x.yadav

This optimization reduces the probability threads will be shuffled between
the main and secondary queues when the secondary queue is empty.
It is helpful when the lock is only lightly contended.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock_cna.h | 50 ++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index dcb2bcfd2d94..f1eef6bece7b 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/random.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -50,7 +51,7 @@ struct cna_node {
 	struct mcs_spinlock	mcs;
 	int			numa_node;
 	u32			encoded_tail;
-	u32			pre_scan_result; /* 0, 1 or encoded tail */
+	u32			pre_scan_result; /* 0, 1, 2 or encoded tail */
 	u32			intra_count;
 };
 
@@ -60,6 +61,34 @@ struct cna_node {
  */
 extern int intra_node_handoff_threshold;
 
+/*
+ * Controls the probability for enabling the scan of the main queue when
+ * the secondary queue is empty. The chosen value reduces the amount of
+ * unnecessary shuffling of threads between the two waiting queues when
+ * the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG  (7)
+
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Return false with probability 1 / 2^@num_bits.
+ * Intuitively, the larger @num_bits the less likely false is to be returned.
+ * @num_bits must be a number between 0 and 31.
+ */
+static bool probably(unsigned int num_bits)
+{
+	u32 s;
+
+	s = this_cpu_read(seed);
+	s = next_pseudo_random32(s);
+	this_cpu_write(seed, s);
+
+	return s & ((1 << num_bits) - 1);
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -72,11 +101,11 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		cn->numa_node = numa_node;
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
-		 * @encoded_tail has to be larger than 1, so we do not confuse
+		 * @encoded_tail has to be larger than 2, so we do not confuse
 		 * it with other valid values for @locked or @pre_scan_result
-		 * (0 or 1)
+		 * (0, 1 or 2)
 		 */
-		WARN_ON(cn->encoded_tail <= 1);
+		WARN_ON(cn->encoded_tail <= 2);
 	}
 }
 
@@ -234,12 +263,13 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
 	struct cna_node *cn = (struct cna_node *)node;
 
 	/*
-	 * setting @pre_scan_result to 1 indicates that no post-scan
+	 * setting @pre_scan_result to 1 or 2 indicates that no post-scan
 	 * should be made in cna_pass_lock()
 	 */
 	cn->pre_scan_result =
-		cn->intra_count == intra_node_handoff_threshold ?
-			1 : cna_scan_main_queue(node, node);
+		(node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) ?
+			1 : cn->intra_count == intra_node_handoff_threshold ?
+			2 : cna_scan_main_queue(node, node);
 
 	return 0;
 }
@@ -253,12 +283,15 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 
 	u32 scan = cn->pre_scan_result;
 
+	if (scan == 1)
+		goto pass_lock;
+
 	/*
 	 * check if a successor from the same numa node has not been found in
 	 * pre-scan, and if so, try to find it in post-scan starting from the
 	 * node where pre-scan stopped (stored in @pre_scan_result)
 	 */
-	if (scan > 1)
+	if (scan > 2)
 		scan = cna_scan_main_queue(node, decode_tail(scan));
 
 	if (!scan) { /* if found a successor from the same numa node */
@@ -281,5 +314,6 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 		tail_2nd->next = next;
 	}
 
+pass_lock:
 	arch_mcs_pass_lock(&next_holder->locked, val);
 }
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 5/5] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
@ 2019-11-25 21:07   ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-11-25 21:07 UTC (permalink / raw)
  To: linux, peterz, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: alex.kogan, dave.dice, rahul.x.yadav, steven.sistare, daniel.m.jordan

This optimization reduces the probability threads will be shuffled between
the main and secondary queues when the secondary queue is empty.
It is helpful when the lock is only lightly contended.

Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 kernel/locking/qspinlock_cna.h | 50 ++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index dcb2bcfd2d94..f1eef6bece7b 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include <linux/topology.h>
+#include <linux/random.h>
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -50,7 +51,7 @@ struct cna_node {
 	struct mcs_spinlock	mcs;
 	int			numa_node;
 	u32			encoded_tail;
-	u32			pre_scan_result; /* 0, 1 or encoded tail */
+	u32			pre_scan_result; /* 0, 1, 2 or encoded tail */
 	u32			intra_count;
 };
 
@@ -60,6 +61,34 @@ struct cna_node {
  */
 extern int intra_node_handoff_threshold;
 
+/*
+ * Controls the probability for enabling the scan of the main queue when
+ * the secondary queue is empty. The chosen value reduces the amount of
+ * unnecessary shuffling of threads between the two waiting queues when
+ * the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG  (7)
+
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Return false with probability 1 / 2^@num_bits.
+ * Intuitively, the larger @num_bits the less likely false is to be returned.
+ * @num_bits must be a number between 0 and 31.
+ */
+static bool probably(unsigned int num_bits)
+{
+	u32 s;
+
+	s = this_cpu_read(seed);
+	s = next_pseudo_random32(s);
+	this_cpu_write(seed, s);
+
+	return s & ((1 << num_bits) - 1);
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -72,11 +101,11 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		cn->numa_node = numa_node;
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
-		 * @encoded_tail has to be larger than 1, so we do not confuse
+		 * @encoded_tail has to be larger than 2, so we do not confuse
 		 * it with other valid values for @locked or @pre_scan_result
-		 * (0 or 1)
+		 * (0, 1 or 2)
 		 */
-		WARN_ON(cn->encoded_tail <= 1);
+		WARN_ON(cn->encoded_tail <= 2);
 	}
 }
 
@@ -234,12 +263,13 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
 	struct cna_node *cn = (struct cna_node *)node;
 
 	/*
-	 * setting @pre_scan_result to 1 indicates that no post-scan
+	 * setting @pre_scan_result to 1 or 2 indicates that no post-scan
 	 * should be made in cna_pass_lock()
 	 */
 	cn->pre_scan_result =
-		cn->intra_count == intra_node_handoff_threshold ?
-			1 : cna_scan_main_queue(node, node);
+		(node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) ?
+			1 : cn->intra_count == intra_node_handoff_threshold ?
+			2 : cna_scan_main_queue(node, node);
 
 	return 0;
 }
@@ -253,12 +283,15 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 
 	u32 scan = cn->pre_scan_result;
 
+	if (scan == 1)
+		goto pass_lock;
+
 	/*
 	 * check if a successor from the same numa node has not been found in
 	 * pre-scan, and if so, try to find it in post-scan starting from the
 	 * node where pre-scan stopped (stored in @pre_scan_result)
 	 */
-	if (scan > 1)
+	if (scan > 2)
 		scan = cna_scan_main_queue(node, decode_tail(scan));
 
 	if (!scan) { /* if found a successor from the same numa node */
@@ -281,5 +314,6 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
 		tail_2nd->next = next;
 	}
 
+pass_lock:
 	arch_mcs_pass_lock(&next_holder->locked, val);
 }
-- 
2.21.0 (Apple Git-122.2)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-11-25 21:07   ` Alex Kogan
@ 2019-12-06 17:21     ` Waiman Long
  -1 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2019-12-06 17:21 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, dave.dice, rahul.x.yadav

On 11/25/19 4:07 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a main queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. After acquiring the
> MCS lock and before acquiring the spinlock, the lock holder scans the
> main queue looking for a thread running on the same node (pre-scan). If
> found (call it thread T), all threads in the main queue between the
> current lock holder and T are moved to the end of the secondary queue.
> If such T is not found, we make another scan of the main queue when
> unlocking the MCS lock (post-scan), starting at the position where
> pre-scan stopped. If both scans fail to find such T, the MCS lock is
> passed to the first thread in the secondary queue. If the secondary queue
> is empty, the lock is passed to the next thread in the main queue.
> For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
> boot time only if we run on a multi-node machine in native environment and
> the new config is enabled. (For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
> resolved once static_call() is available.) This default behavior can be
> overridden with the new kernel boot command-line option
> "numa_spinlock=on/off" (default is "auto").
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  arch/x86/Kconfig                              |  20 ++
>  arch/x86/include/asm/qspinlock.h              |   4 +
>  arch/x86/kernel/alternative.c                 |  43 +++
>  kernel/locking/mcs_spinlock.h                 |   2 +-
>  kernel/locking/qspinlock.c                    |  34 ++-
>  kernel/locking/qspinlock_cna.h                | 264 ++++++++++++++++++
>  7 files changed, 372 insertions(+), 5 deletions(-)
>  create mode 100644 kernel/locking/qspinlock_cna.h
>
  :
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 9d3a971ea364..6a4ccbf4e09c 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -698,6 +698,33 @@ static void __init int3_selftest(void)
>  	unregister_die_notifier(&int3_exception_nb);
>  }
>  
> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +/*
> + * Constant (boot-param configurable) flag selecting the NUMA-aware variant
> + * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
> + */
> +static int numa_spinlock_flag;
> +
> +static int __init numa_spinlock_setup(char *str)
> +{
> +	if (!strcmp(str, "auto")) {
> +		numa_spinlock_flag = 0;
> +		return 1;
> +	} else if (!strcmp(str, "on")) {
> +		numa_spinlock_flag = 1;
> +		return 1;
> +	} else if (!strcmp(str, "off")) {
> +		numa_spinlock_flag = -1;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +__setup("numa_spinlock=", numa_spinlock_setup);
> +
> +#endif
> +

This __init function should be in qspinlock_cna.h. We generally like to
put as much related code into as few places as possible instead of
spreading them around in different places.

>  void __init alternative_instructions(void)
>  {
>  	int3_selftest();
> @@ -738,6 +765,22 @@ void __init alternative_instructions(void)
>  	}
>  #endif
>  
> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +	/*
> +	 * By default, switch to the NUMA-friendly slow path for
> +	 * spinlocks when we have multiple NUMA nodes in native environment.
> +	 */
> +	if ((numa_spinlock_flag == 1) ||
> +	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
> +		    pv_ops.lock.queued_spin_lock_slowpath ==
> +			native_queued_spin_lock_slowpath)) {
> +		pv_ops.lock.queued_spin_lock_slowpath =
> +		    __cna_queued_spin_lock_slowpath;
> +
> +		pr_info("Enabling CNA spinlock\n");
> +	}
> +#endif
> +
>  	apply_paravirt(__parainstructions, __parainstructions_end);

Encapsulate the logic into another __init function in qspinlock_cna.h
and just make a function call here. You can declare the function in
arch/x86/include/asm/qspinlock.h.


>  
>  	restart_nmi();
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 52d06ec6f525..e40b9538b79f 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -17,7 +17,7 @@
>  
>  struct mcs_spinlock {
>  	struct mcs_spinlock *next;
> -	int locked; /* 1 if lock acquired */
> +	unsigned int locked; /* 1 if lock acquired */
>  	int count;  /* nesting count, see qspinlock.c */
>  };
>  
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index c06d1e8075d9..6d8c4a52e44e 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -11,7 +11,7 @@
>   *          Peter Zijlstra <peterz@infradead.org>
>   */
>  
> -#ifndef _GEN_PV_LOCK_SLOWPATH
> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>  
>  #include <linux/smp.h>
>  #include <linux/bug.h>
> @@ -70,7 +70,8 @@
>  /*
>   * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>   * size and four of them will fit nicely in one 64-byte cacheline. For
> - * pvqspinlock, however, we need more space for extra data. To accommodate
> + * pvqspinlock, however, we need more space for extra data. The same also
> + * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
>   * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>   * two of them can fit in a cacheline in this case. That is OK as it is rare
>   * to have more than 2 levels of slowpath nesting in actual use. We don't
> @@ -79,7 +80,7 @@
>   */
>  struct qnode {
>  	struct mcs_spinlock mcs;
> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>  	long reserved[2];
>  #endif
>  };
> @@ -103,6 +104,8 @@ struct qnode {
>   * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>   *
>   * PV doubles the storage and uses the second cacheline for PV state.
> + * CNA also doubles the storage and uses the second cacheline for
> + * CNA-specific state.
>   */
>  static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>  
> @@ -316,7 +319,7 @@ static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
>  #define try_clear_tail	__try_clear_tail
>  #define mcs_pass_lock		__mcs_pass_lock
>  
> -#endif /* _GEN_PV_LOCK_SLOWPATH */
> +#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
>  
>  /**
>   * queued_spin_lock_slowpath - acquire the queued spinlock
> @@ -588,6 +591,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  
> +/*
> + * Generate the code for NUMA-aware spinlocks
> + */
> +#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +#define _GEN_CNA_LOCK_SLOWPATH
> +
> +#undef pv_wait_head_or_lock
> +#define pv_wait_head_or_lock		cna_pre_scan
> +
> +#undef try_clear_tail
> +#define try_clear_tail			cna_try_change_tail
> +
> +#undef mcs_pass_lock
> +#define mcs_pass_lock			cna_pass_lock
> +
> +#undef  queued_spin_lock_slowpath
> +#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
> +
> +#include "qspinlock_cna.h"
> +#include "qspinlock.c"
> +
> +#endif
> +
>  /*
>   * Generate the paravirt code for queued_spin_unlock_slowpath().
>   */
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> new file mode 100644
> index 000000000000..a638336f9560
> --- /dev/null
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -0,0 +1,264 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _GEN_CNA_LOCK_SLOWPATH
> +#error "do not include this file"
> +#endif
> +
> +#include <linux/topology.h>
> +
> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a main queue for
> + * threads running on the same NUMA node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. Schematically, it
> + * looks like this:
> + *
> + *    cna_node
> + *   +----------+    +--------+        +--------+
> + *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
> + *   |mcs:locked| -+ +--------+        +--------+
> + *   +----------+  |
> + *                 +----------------------+
> + *                                        \/
> + *                 +--------+         +--------+
> + *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
> + *                 +--------+         +--------+
> + *                     ^                    |
> + *                     +--------------------+
> + *
> + * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
> + * encoded pointer to the tail of the secondary queue, which is organized as a
> + * circular list.
> + *
> + * After acquiring the MCS lock and before acquiring the spinlock, the lock
> + * holder scans the main queue looking for a thread running on the same node
> + * (pre-scan). If found (call it thread T), all threads in the main queue
> + * between the current lock holder and T are moved to the end of the secondary
> + * queue.  If such T is not found, we make another scan of the main queue when
> + * unlocking the MCS lock (post-scan), starting at the node where pre-scan
> + * stopped. If both scans fail to find such T, the MCS lock is passed to the
> + * first thread in the secondary queue. If the secondary queue is empty, the
> + * lock is passed to the next thread in the main queue.
> + *
> + * For more details, see https://arxiv.org/abs/1810.05600.
> + *
> + * Authors: Alex Kogan <alex.kogan@oracle.com>
> + *          Dave Dice <dave.dice@oracle.com>
> + */
> +
> +struct cna_node {
> +	struct mcs_spinlock	mcs;
> +	int			numa_node;
> +	u32			encoded_tail;
> +	u32			pre_scan_result; /* 0 or encoded tail */
> +};
> +
> +static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> +{
> +	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> +	int numa_node = cpu_to_node(cpu);
> +	int i;
> +
> +	for (i = 0; i < MAX_NODES; i++) {
> +		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
> +
> +		cn->numa_node = numa_node;
> +		cn->encoded_tail = encode_tail(cpu, i);
> +		/*
> +		 * @encoded_tail has to be larger than 1, so we do not confuse
> +		 * it with other valid values for @locked or @pre_scan_result
> +		 * (0 or 1)
> +		 */
> +		WARN_ON(cn->encoded_tail <= 1);
> +	}
> +}
> +
> +static int __init cna_init_nodes(void)
> +{
> +	unsigned int cpu;
> +
> +	/*
> +	 * this will break on 32bit architectures, so we restrict
> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
> +	 */
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store an ecoded tail word in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
> +
> +	for_each_possible_cpu(cpu)
> +		cna_init_nodes_per_cpu(cpu);
> +
> +	return 0;
> +}
> +early_initcall(cna_init_nodes);
> +

Include a comment here saying that the cna_try_change_tail() function is
only called when the primary queue is empty. That will make the code
easier to read.


> +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> +				       struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +	u32 new;
> +
> +	/* If the secondary queue is empty, do what MCS does. */
> +	if (node->locked <= 1)
> +		return __try_clear_tail(lock, val, node);
> +
> +	/*
> +	 * Try to update the tail value to the last node in the secondary queue.
> +	 * If successful, pass the lock to the first thread in the secondary
> +	 * queue. Doing those two actions effectively moves all nodes from the
> +	 * secondary queue into the main one.
> +	 */
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
> +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
> +
> +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> +		/*
> +		 * Try to reset @next in tail_2nd to NULL, but no need to check
> +		 * the result - if failed, a new successor has updated it.
> +		 */
> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
> +		arch_mcs_pass_lock(&head_2nd->locked, 1);
> +		return true;
> +	}
> +
> +	return false;
> +}
Cheers,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2019-12-06 17:21     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2019-12-06 17:21 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: rahul.x.yadav, dave.dice, steven.sistare, daniel.m.jordan

On 11/25/19 4:07 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a main queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. After acquiring the
> MCS lock and before acquiring the spinlock, the lock holder scans the
> main queue looking for a thread running on the same node (pre-scan). If
> found (call it thread T), all threads in the main queue between the
> current lock holder and T are moved to the end of the secondary queue.
> If such T is not found, we make another scan of the main queue when
> unlocking the MCS lock (post-scan), starting at the position where
> pre-scan stopped. If both scans fail to find such T, the MCS lock is
> passed to the first thread in the secondary queue. If the secondary queue
> is empty, the lock is passed to the next thread in the main queue.
> For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
> boot time only if we run on a multi-node machine in native environment and
> the new config is enabled. (For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
> resolved once static_call() is available.) This default behavior can be
> overridden with the new kernel boot command-line option
> "numa_spinlock=on/off" (default is "auto").
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  arch/x86/Kconfig                              |  20 ++
>  arch/x86/include/asm/qspinlock.h              |   4 +
>  arch/x86/kernel/alternative.c                 |  43 +++
>  kernel/locking/mcs_spinlock.h                 |   2 +-
>  kernel/locking/qspinlock.c                    |  34 ++-
>  kernel/locking/qspinlock_cna.h                | 264 ++++++++++++++++++
>  7 files changed, 372 insertions(+), 5 deletions(-)
>  create mode 100644 kernel/locking/qspinlock_cna.h
>
  :
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 9d3a971ea364..6a4ccbf4e09c 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -698,6 +698,33 @@ static void __init int3_selftest(void)
>  	unregister_die_notifier(&int3_exception_nb);
>  }
>  
> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +/*
> + * Constant (boot-param configurable) flag selecting the NUMA-aware variant
> + * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
> + */
> +static int numa_spinlock_flag;
> +
> +static int __init numa_spinlock_setup(char *str)
> +{
> +	if (!strcmp(str, "auto")) {
> +		numa_spinlock_flag = 0;
> +		return 1;
> +	} else if (!strcmp(str, "on")) {
> +		numa_spinlock_flag = 1;
> +		return 1;
> +	} else if (!strcmp(str, "off")) {
> +		numa_spinlock_flag = -1;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +__setup("numa_spinlock=", numa_spinlock_setup);
> +
> +#endif
> +

This __init function should be in qspinlock_cna.h. We generally like to
put as much related code into as few places as possible instead of
spreading them around in different places.

>  void __init alternative_instructions(void)
>  {
>  	int3_selftest();
> @@ -738,6 +765,22 @@ void __init alternative_instructions(void)
>  	}
>  #endif
>  
> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +	/*
> +	 * By default, switch to the NUMA-friendly slow path for
> +	 * spinlocks when we have multiple NUMA nodes in native environment.
> +	 */
> +	if ((numa_spinlock_flag == 1) ||
> +	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
> +		    pv_ops.lock.queued_spin_lock_slowpath ==
> +			native_queued_spin_lock_slowpath)) {
> +		pv_ops.lock.queued_spin_lock_slowpath =
> +		    __cna_queued_spin_lock_slowpath;
> +
> +		pr_info("Enabling CNA spinlock\n");
> +	}
> +#endif
> +
>  	apply_paravirt(__parainstructions, __parainstructions_end);

Encapsulate the logic into another __init function in qspinlock_cna.h
and just make a function call here. You can declare the function in
arch/x86/include/asm/qspinlock.h.


>  
>  	restart_nmi();
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 52d06ec6f525..e40b9538b79f 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -17,7 +17,7 @@
>  
>  struct mcs_spinlock {
>  	struct mcs_spinlock *next;
> -	int locked; /* 1 if lock acquired */
> +	unsigned int locked; /* 1 if lock acquired */
>  	int count;  /* nesting count, see qspinlock.c */
>  };
>  
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index c06d1e8075d9..6d8c4a52e44e 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -11,7 +11,7 @@
>   *          Peter Zijlstra <peterz@infradead.org>
>   */
>  
> -#ifndef _GEN_PV_LOCK_SLOWPATH
> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>  
>  #include <linux/smp.h>
>  #include <linux/bug.h>
> @@ -70,7 +70,8 @@
>  /*
>   * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>   * size and four of them will fit nicely in one 64-byte cacheline. For
> - * pvqspinlock, however, we need more space for extra data. To accommodate
> + * pvqspinlock, however, we need more space for extra data. The same also
> + * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
>   * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>   * two of them can fit in a cacheline in this case. That is OK as it is rare
>   * to have more than 2 levels of slowpath nesting in actual use. We don't
> @@ -79,7 +80,7 @@
>   */
>  struct qnode {
>  	struct mcs_spinlock mcs;
> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>  	long reserved[2];
>  #endif
>  };
> @@ -103,6 +104,8 @@ struct qnode {
>   * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>   *
>   * PV doubles the storage and uses the second cacheline for PV state.
> + * CNA also doubles the storage and uses the second cacheline for
> + * CNA-specific state.
>   */
>  static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>  
> @@ -316,7 +319,7 @@ static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
>  #define try_clear_tail	__try_clear_tail
>  #define mcs_pass_lock		__mcs_pass_lock
>  
> -#endif /* _GEN_PV_LOCK_SLOWPATH */
> +#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
>  
>  /**
>   * queued_spin_lock_slowpath - acquire the queued spinlock
> @@ -588,6 +591,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  
> +/*
> + * Generate the code for NUMA-aware spinlocks
> + */
> +#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> +#define _GEN_CNA_LOCK_SLOWPATH
> +
> +#undef pv_wait_head_or_lock
> +#define pv_wait_head_or_lock		cna_pre_scan
> +
> +#undef try_clear_tail
> +#define try_clear_tail			cna_try_change_tail
> +
> +#undef mcs_pass_lock
> +#define mcs_pass_lock			cna_pass_lock
> +
> +#undef  queued_spin_lock_slowpath
> +#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
> +
> +#include "qspinlock_cna.h"
> +#include "qspinlock.c"
> +
> +#endif
> +
>  /*
>   * Generate the paravirt code for queued_spin_unlock_slowpath().
>   */
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> new file mode 100644
> index 000000000000..a638336f9560
> --- /dev/null
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -0,0 +1,264 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _GEN_CNA_LOCK_SLOWPATH
> +#error "do not include this file"
> +#endif
> +
> +#include <linux/topology.h>
> +
> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a main queue for
> + * threads running on the same NUMA node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. Schematically, it
> + * looks like this:
> + *
> + *    cna_node
> + *   +----------+    +--------+        +--------+
> + *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
> + *   |mcs:locked| -+ +--------+        +--------+
> + *   +----------+  |
> + *                 +----------------------+
> + *                                        \/
> + *                 +--------+         +--------+
> + *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
> + *                 +--------+         +--------+
> + *                     ^                    |
> + *                     +--------------------+
> + *
> + * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
> + * encoded pointer to the tail of the secondary queue, which is organized as a
> + * circular list.
> + *
> + * After acquiring the MCS lock and before acquiring the spinlock, the lock
> + * holder scans the main queue looking for a thread running on the same node
> + * (pre-scan). If found (call it thread T), all threads in the main queue
> + * between the current lock holder and T are moved to the end of the secondary
> + * queue.  If such T is not found, we make another scan of the main queue when
> + * unlocking the MCS lock (post-scan), starting at the node where pre-scan
> + * stopped. If both scans fail to find such T, the MCS lock is passed to the
> + * first thread in the secondary queue. If the secondary queue is empty, the
> + * lock is passed to the next thread in the main queue.
> + *
> + * For more details, see https://arxiv.org/abs/1810.05600.
> + *
> + * Authors: Alex Kogan <alex.kogan@oracle.com>
> + *          Dave Dice <dave.dice@oracle.com>
> + */
> +
> +struct cna_node {
> +	struct mcs_spinlock	mcs;
> +	int			numa_node;
> +	u32			encoded_tail;
> +	u32			pre_scan_result; /* 0 or encoded tail */
> +};
> +
> +static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> +{
> +	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> +	int numa_node = cpu_to_node(cpu);
> +	int i;
> +
> +	for (i = 0; i < MAX_NODES; i++) {
> +		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
> +
> +		cn->numa_node = numa_node;
> +		cn->encoded_tail = encode_tail(cpu, i);
> +		/*
> +		 * @encoded_tail has to be larger than 1, so we do not confuse
> +		 * it with other valid values for @locked or @pre_scan_result
> +		 * (0 or 1)
> +		 */
> +		WARN_ON(cn->encoded_tail <= 1);
> +	}
> +}
> +
> +static int __init cna_init_nodes(void)
> +{
> +	unsigned int cpu;
> +
> +	/*
> +	 * this will break on 32bit architectures, so we restrict
> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
> +	 */
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store an ecoded tail word in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
> +
> +	for_each_possible_cpu(cpu)
> +		cna_init_nodes_per_cpu(cpu);
> +
> +	return 0;
> +}
> +early_initcall(cna_init_nodes);
> +

Include a comment here saying that the cna_try_change_tail() function is
only called when the primary queue is empty. That will make the code
easier to read.


> +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> +				       struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +	u32 new;
> +
> +	/* If the secondary queue is empty, do what MCS does. */
> +	if (node->locked <= 1)
> +		return __try_clear_tail(lock, val, node);
> +
> +	/*
> +	 * Try to update the tail value to the last node in the secondary queue.
> +	 * If successful, pass the lock to the first thread in the secondary
> +	 * queue. Doing those two actions effectively moves all nodes from the
> +	 * secondary queue into the main one.
> +	 */
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
> +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
> +
> +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
> +		/*
> +		 * Try to reset @next in tail_2nd to NULL, but no need to check
> +		 * the result - if failed, a new successor has updated it.
> +		 */
> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
> +		arch_mcs_pass_lock(&head_2nd->locked, 1);
> +		return true;
> +	}
> +
> +	return false;
> +}
Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
  2019-11-25 21:07   ` Alex Kogan
@ 2019-12-06 18:09     ` Waiman Long
  -1 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2019-12-06 18:09 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, dave.dice, rahul.x.yadav

On 11/25/19 4:07 PM, Alex Kogan wrote:
> Keep track of the number of intra-node lock handoffs, and force
> inter-node handoff once this number reaches a preset threshold.
> The default value for the threshold can be overridden with
> the new kernel boot command-line option "numa_spinlock_threshold".
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  8 ++++++
>  arch/x86/kernel/alternative.c                 | 27 +++++++++++++++++++
>  kernel/locking/qspinlock.c                    |  3 +++
>  kernel/locking/qspinlock_cna.h                | 27 ++++++++++++++++---
>  4 files changed, 62 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 904cb32f592d..887fbfce701d 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3185,6 +3185,14 @@
>  			Not specifying this option is equivalent to
>  			numa_spinlock=auto.
>  
> +	numa_spinlock_threshold=	[NUMA, PV_OPS]
> +			Set the threshold for the number of intra-node
> +			lock hand-offs before the NUMA-aware spinlock
> +			is forced to be passed to a thread on another NUMA node.
> +			Valid values are in the [0..31] range. Smaller values
> +			result in a more fair, but less performant spinlock, and
> +			vice versa. The default value is 16.
> +
>  	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
>  			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
>  			Some features depend on CPU0. Known dependencies are:
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 6a4ccbf4e09c..28552e0491b5 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -723,6 +723,33 @@ static int __init numa_spinlock_setup(char *str)
>  
>  __setup("numa_spinlock=", numa_spinlock_setup);
>  
> +/*
> + * Controls the threshold for the number of intra-node lock hand-offs before
> + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
> + * another NUMA node. By default, the chosen value provides reasonable
> + * long-term fairness without sacrificing performance compared to a lock
> + * that does not have any fairness guarantees.
> + */
> +int intra_node_handoff_threshold = 1 << 16;

      ^
      __ro_after_init

> +
> +static int __init numa_spinlock_threshold_setup(char *str)
> +{
> +	int new_threshold_param;
> +
> +	if (get_option(&str, &new_threshold_param)) {
> +		/* valid value is between 0 and 31 */
> +		if (new_threshold_param < 0 || new_threshold_param > 31)
> +			return 0;
> +
> +		intra_node_handoff_threshold = 1 << new_threshold_param;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
> +
>  #endif
>  

Against, this should be in qspinlock_can.h not in alternative.c.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 4/5] locking/qspinlock: Introduce starvation avoidance into CNA
@ 2019-12-06 18:09     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2019-12-06 18:09 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: rahul.x.yadav, dave.dice, steven.sistare, daniel.m.jordan

On 11/25/19 4:07 PM, Alex Kogan wrote:
> Keep track of the number of intra-node lock handoffs, and force
> inter-node handoff once this number reaches a preset threshold.
> The default value for the threshold can be overridden with
> the new kernel boot command-line option "numa_spinlock_threshold".
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  8 ++++++
>  arch/x86/kernel/alternative.c                 | 27 +++++++++++++++++++
>  kernel/locking/qspinlock.c                    |  3 +++
>  kernel/locking/qspinlock_cna.h                | 27 ++++++++++++++++---
>  4 files changed, 62 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 904cb32f592d..887fbfce701d 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3185,6 +3185,14 @@
>  			Not specifying this option is equivalent to
>  			numa_spinlock=auto.
>  
> +	numa_spinlock_threshold=	[NUMA, PV_OPS]
> +			Set the threshold for the number of intra-node
> +			lock hand-offs before the NUMA-aware spinlock
> +			is forced to be passed to a thread on another NUMA node.
> +			Valid values are in the [0..31] range. Smaller values
> +			result in a more fair, but less performant spinlock, and
> +			vice versa. The default value is 16.
> +
>  	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
>  			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
>  			Some features depend on CPU0. Known dependencies are:
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 6a4ccbf4e09c..28552e0491b5 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -723,6 +723,33 @@ static int __init numa_spinlock_setup(char *str)
>  
>  __setup("numa_spinlock=", numa_spinlock_setup);
>  
> +/*
> + * Controls the threshold for the number of intra-node lock hand-offs before
> + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
> + * another NUMA node. By default, the chosen value provides reasonable
> + * long-term fairness without sacrificing performance compared to a lock
> + * that does not have any fairness guarantees.
> + */
> +int intra_node_handoff_threshold = 1 << 16;

      ^
      __ro_after_init

> +
> +static int __init numa_spinlock_threshold_setup(char *str)
> +{
> +	int new_threshold_param;
> +
> +	if (get_option(&str, &new_threshold_param)) {
> +		/* valid value is between 0 and 31 */
> +		if (new_threshold_param < 0 || new_threshold_param > 31)
> +			return 0;
> +
> +		intra_node_handoff_threshold = 1 << new_threshold_param;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
> +
>  #endif
>  

Against, this should be in qspinlock_can.h not in alternative.c.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-12-06 17:21     ` Waiman Long
@ 2019-12-06 19:50       ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-12-06 19:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux, Peter Zijlstra, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav

Thanks for the feedback. I will take care of that and resubmit.

Regards,
— Alex

> On Dec 6, 2019, at 12:21 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 11/25/19 4:07 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a main queue for
>> threads running on the same node as the current lock holder, and a
>> secondary queue for threads running on other nodes. After acquiring the
>> MCS lock and before acquiring the spinlock, the lock holder scans the
>> main queue looking for a thread running on the same node (pre-scan). If
>> found (call it thread T), all threads in the main queue between the
>> current lock holder and T are moved to the end of the secondary queue.
>> If such T is not found, we make another scan of the main queue when
>> unlocking the MCS lock (post-scan), starting at the position where
>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>> passed to the first thread in the secondary queue. If the secondary queue
>> is empty, the lock is passed to the next thread in the main queue.
>> For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=FcJA-hpAHs8MxL3rksMfq_GYsA7RIciHEF3AE4ZATOI&s=SR-p0LtEzUCrpwWY2ShRIF9lXod5Wc_NkN9zIOVvfxM&e= .
>> 
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same node. This issue
>> will be addressed later in the series.
>> 
>> Enabling CNA is controlled via a new configuration option
>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>> boot time only if we run on a multi-node machine in native environment and
>> the new config is enabled. (For the time being, the patching requires
>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>> resolved once static_call() is available.) This default behavior can be
>> overridden with the new kernel boot command-line option
>> "numa_spinlock=on/off" (default is "auto").
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> .../admin-guide/kernel-parameters.txt         |  10 +
>> arch/x86/Kconfig                              |  20 ++
>> arch/x86/include/asm/qspinlock.h              |   4 +
>> arch/x86/kernel/alternative.c                 |  43 +++
>> kernel/locking/mcs_spinlock.h                 |   2 +-
>> kernel/locking/qspinlock.c                    |  34 ++-
>> kernel/locking/qspinlock_cna.h                | 264 ++++++++++++++++++
>> 7 files changed, 372 insertions(+), 5 deletions(-)
>> create mode 100644 kernel/locking/qspinlock_cna.h
>> 
>   :
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index 9d3a971ea364..6a4ccbf4e09c 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -698,6 +698,33 @@ static void __init int3_selftest(void)
>> 	unregister_die_notifier(&int3_exception_nb);
>> }
>> 
>> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +/*
>> + * Constant (boot-param configurable) flag selecting the NUMA-aware variant
>> + * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
>> + */
>> +static int numa_spinlock_flag;
>> +
>> +static int __init numa_spinlock_setup(char *str)
>> +{
>> +	if (!strcmp(str, "auto")) {
>> +		numa_spinlock_flag = 0;
>> +		return 1;
>> +	} else if (!strcmp(str, "on")) {
>> +		numa_spinlock_flag = 1;
>> +		return 1;
>> +	} else if (!strcmp(str, "off")) {
>> +		numa_spinlock_flag = -1;
>> +		return 1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +__setup("numa_spinlock=", numa_spinlock_setup);
>> +
>> +#endif
>> +
> 
> This __init function should be in qspinlock_cna.h. We generally like to
> put as much related code into as few places as possible instead of
> spreading them around in different places.
> 
>> void __init alternative_instructions(void)
>> {
>> 	int3_selftest();
>> @@ -738,6 +765,22 @@ void __init alternative_instructions(void)
>> 	}
>> #endif
>> 
>> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +	/*
>> +	 * By default, switch to the NUMA-friendly slow path for
>> +	 * spinlocks when we have multiple NUMA nodes in native environment.
>> +	 */
>> +	if ((numa_spinlock_flag == 1) ||
>> +	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
>> +		    pv_ops.lock.queued_spin_lock_slowpath ==
>> +			native_queued_spin_lock_slowpath)) {
>> +		pv_ops.lock.queued_spin_lock_slowpath =
>> +		    __cna_queued_spin_lock_slowpath;
>> +
>> +		pr_info("Enabling CNA spinlock\n");
>> +	}
>> +#endif
>> +
>> 	apply_paravirt(__parainstructions, __parainstructions_end);
> 
> Encapsulate the logic into another __init function in qspinlock_cna.h
> and just make a function call here. You can declare the function in
> arch/x86/include/asm/qspinlock.h.
> 
> 
>> 
>> 	restart_nmi();
>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 52d06ec6f525..e40b9538b79f 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -17,7 +17,7 @@
>> 
>> struct mcs_spinlock {
>> 	struct mcs_spinlock *next;
>> -	int locked; /* 1 if lock acquired */
>> +	unsigned int locked; /* 1 if lock acquired */
>> 	int count;  /* nesting count, see qspinlock.c */
>> };
>> 
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index c06d1e8075d9..6d8c4a52e44e 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -11,7 +11,7 @@
>>  *          Peter Zijlstra <peterz@infradead.org>
>>  */
>> 
>> -#ifndef _GEN_PV_LOCK_SLOWPATH
>> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>> 
>> #include <linux/smp.h>
>> #include <linux/bug.h>
>> @@ -70,7 +70,8 @@
>> /*
>>  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>>  * size and four of them will fit nicely in one 64-byte cacheline. For
>> - * pvqspinlock, however, we need more space for extra data. To accommodate
>> + * pvqspinlock, however, we need more space for extra data. The same also
>> + * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
>>  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>>  * two of them can fit in a cacheline in this case. That is OK as it is rare
>>  * to have more than 2 levels of slowpath nesting in actual use. We don't
>> @@ -79,7 +80,7 @@
>>  */
>> struct qnode {
>> 	struct mcs_spinlock mcs;
>> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
>> +#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> 	long reserved[2];
>> #endif
>> };
>> @@ -103,6 +104,8 @@ struct qnode {
>>  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>>  *
>>  * PV doubles the storage and uses the second cacheline for PV state.
>> + * CNA also doubles the storage and uses the second cacheline for
>> + * CNA-specific state.
>>  */
>> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>> 
>> @@ -316,7 +319,7 @@ static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
>> #define try_clear_tail	__try_clear_tail
>> #define mcs_pass_lock		__mcs_pass_lock
>> 
>> -#endif /* _GEN_PV_LOCK_SLOWPATH */
>> +#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
>> 
>> /**
>>  * queued_spin_lock_slowpath - acquire the queued spinlock
>> @@ -588,6 +591,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>> }
>> EXPORT_SYMBOL(queued_spin_lock_slowpath);
>> 
>> +/*
>> + * Generate the code for NUMA-aware spinlocks
>> + */
>> +#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +#define _GEN_CNA_LOCK_SLOWPATH
>> +
>> +#undef pv_wait_head_or_lock
>> +#define pv_wait_head_or_lock		cna_pre_scan
>> +
>> +#undef try_clear_tail
>> +#define try_clear_tail			cna_try_change_tail
>> +
>> +#undef mcs_pass_lock
>> +#define mcs_pass_lock			cna_pass_lock
>> +
>> +#undef  queued_spin_lock_slowpath
>> +#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
>> +
>> +#include "qspinlock_cna.h"
>> +#include "qspinlock.c"
>> +
>> +#endif
>> +
>> /*
>>  * Generate the paravirt code for queued_spin_unlock_slowpath().
>>  */
>> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
>> new file mode 100644
>> index 000000000000..a638336f9560
>> --- /dev/null
>> +++ b/kernel/locking/qspinlock_cna.h
>> @@ -0,0 +1,264 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _GEN_CNA_LOCK_SLOWPATH
>> +#error "do not include this file"
>> +#endif
>> +
>> +#include <linux/topology.h>
>> +
>> +/*
>> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>> + *
>> + * In CNA, spinning threads are organized in two queues, a main queue for
>> + * threads running on the same NUMA node as the current lock holder, and a
>> + * secondary queue for threads running on other nodes. Schematically, it
>> + * looks like this:
>> + *
>> + *    cna_node
>> + *   +----------+    +--------+        +--------+
>> + *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
>> + *   |mcs:locked| -+ +--------+        +--------+
>> + *   +----------+  |
>> + *                 +----------------------+
>> + *                                        \/
>> + *                 +--------+         +--------+
>> + *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
>> + *                 +--------+         +--------+
>> + *                     ^                    |
>> + *                     +--------------------+
>> + *
>> + * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
>> + * encoded pointer to the tail of the secondary queue, which is organized as a
>> + * circular list.
>> + *
>> + * After acquiring the MCS lock and before acquiring the spinlock, the lock
>> + * holder scans the main queue looking for a thread running on the same node
>> + * (pre-scan). If found (call it thread T), all threads in the main queue
>> + * between the current lock holder and T are moved to the end of the secondary
>> + * queue.  If such T is not found, we make another scan of the main queue when
>> + * unlocking the MCS lock (post-scan), starting at the node where pre-scan
>> + * stopped. If both scans fail to find such T, the MCS lock is passed to the
>> + * first thread in the secondary queue. If the secondary queue is empty, the
>> + * lock is passed to the next thread in the main queue.
>> + *
>> + * For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=FcJA-hpAHs8MxL3rksMfq_GYsA7RIciHEF3AE4ZATOI&s=SR-p0LtEzUCrpwWY2ShRIF9lXod5Wc_NkN9zIOVvfxM&e= .
>> + *
>> + * Authors: Alex Kogan <alex.kogan@oracle.com>
>> + *          Dave Dice <dave.dice@oracle.com>
>> + */
>> +
>> +struct cna_node {
>> +	struct mcs_spinlock	mcs;
>> +	int			numa_node;
>> +	u32			encoded_tail;
>> +	u32			pre_scan_result; /* 0 or encoded tail */
>> +};
>> +
>> +static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>> +{
>> +	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
>> +	int numa_node = cpu_to_node(cpu);
>> +	int i;
>> +
>> +	for (i = 0; i < MAX_NODES; i++) {
>> +		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
>> +
>> +		cn->numa_node = numa_node;
>> +		cn->encoded_tail = encode_tail(cpu, i);
>> +		/*
>> +		 * @encoded_tail has to be larger than 1, so we do not confuse
>> +		 * it with other valid values for @locked or @pre_scan_result
>> +		 * (0 or 1)
>> +		 */
>> +		WARN_ON(cn->encoded_tail <= 1);
>> +	}
>> +}
>> +
>> +static int __init cna_init_nodes(void)
>> +{
>> +	unsigned int cpu;
>> +
>> +	/*
>> +	 * this will break on 32bit architectures, so we restrict
>> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
>> +	 */
>> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
>> +	/* we store an ecoded tail word in the node's @locked field */
>> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
>> +
>> +	for_each_possible_cpu(cpu)
>> +		cna_init_nodes_per_cpu(cpu);
>> +
>> +	return 0;
>> +}
>> +early_initcall(cna_init_nodes);
>> +
> 
> Include a comment here saying that the cna_try_change_tail() function is
> only called when the primary queue is empty. That will make the code
> easier to read.
> 
> 
>> +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
>> +				       struct mcs_spinlock *node)
>> +{
>> +	struct mcs_spinlock *head_2nd, *tail_2nd;
>> +	u32 new;
>> +
>> +	/* If the secondary queue is empty, do what MCS does. */
>> +	if (node->locked <= 1)
>> +		return __try_clear_tail(lock, val, node);
>> +
>> +	/*
>> +	 * Try to update the tail value to the last node in the secondary queue.
>> +	 * If successful, pass the lock to the first thread in the secondary
>> +	 * queue. Doing those two actions effectively moves all nodes from the
>> +	 * secondary queue into the main one.
>> +	 */
>> +	tail_2nd = decode_tail(node->locked);
>> +	head_2nd = tail_2nd->next;
>> +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
>> +
>> +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
>> +		/*
>> +		 * Try to reset @next in tail_2nd to NULL, but no need to check
>> +		 * the result - if failed, a new successor has updated it.
>> +		 */
>> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
>> +		arch_mcs_pass_lock(&head_2nd->locked, 1);
>> +		return true;
>> +	}
>> +
>> +	return false;
>> +}
> Cheers,
> Longman
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2019-12-06 19:50       ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2019-12-06 19:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, guohanjun, arnd, Peter Zijlstra, dave.dice, jglauber,
	x86, will.deacon, linux, linux-kernel, rahul.x.yadav, mingo, bp,
	hpa, steven.sistare, tglx, daniel.m.jordan, linux-arm-kernel

Thanks for the feedback. I will take care of that and resubmit.

Regards,
— Alex

> On Dec 6, 2019, at 12:21 PM, Waiman Long <longman@redhat.com> wrote:
> 
> On 11/25/19 4:07 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a main queue for
>> threads running on the same node as the current lock holder, and a
>> secondary queue for threads running on other nodes. After acquiring the
>> MCS lock and before acquiring the spinlock, the lock holder scans the
>> main queue looking for a thread running on the same node (pre-scan). If
>> found (call it thread T), all threads in the main queue between the
>> current lock holder and T are moved to the end of the secondary queue.
>> If such T is not found, we make another scan of the main queue when
>> unlocking the MCS lock (post-scan), starting at the position where
>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>> passed to the first thread in the secondary queue. If the secondary queue
>> is empty, the lock is passed to the next thread in the main queue.
>> For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=FcJA-hpAHs8MxL3rksMfq_GYsA7RIciHEF3AE4ZATOI&s=SR-p0LtEzUCrpwWY2ShRIF9lXod5Wc_NkN9zIOVvfxM&e= .
>> 
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same node. This issue
>> will be addressed later in the series.
>> 
>> Enabling CNA is controlled via a new configuration option
>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>> boot time only if we run on a multi-node machine in native environment and
>> the new config is enabled. (For the time being, the patching requires
>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>> resolved once static_call() is available.) This default behavior can be
>> overridden with the new kernel boot command-line option
>> "numa_spinlock=on/off" (default is "auto").
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> .../admin-guide/kernel-parameters.txt         |  10 +
>> arch/x86/Kconfig                              |  20 ++
>> arch/x86/include/asm/qspinlock.h              |   4 +
>> arch/x86/kernel/alternative.c                 |  43 +++
>> kernel/locking/mcs_spinlock.h                 |   2 +-
>> kernel/locking/qspinlock.c                    |  34 ++-
>> kernel/locking/qspinlock_cna.h                | 264 ++++++++++++++++++
>> 7 files changed, 372 insertions(+), 5 deletions(-)
>> create mode 100644 kernel/locking/qspinlock_cna.h
>> 
>   :
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index 9d3a971ea364..6a4ccbf4e09c 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -698,6 +698,33 @@ static void __init int3_selftest(void)
>> 	unregister_die_notifier(&int3_exception_nb);
>> }
>> 
>> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +/*
>> + * Constant (boot-param configurable) flag selecting the NUMA-aware variant
>> + * of spinlock.  Possible values: -1 (off) / 0 (auto, default) / 1 (on).
>> + */
>> +static int numa_spinlock_flag;
>> +
>> +static int __init numa_spinlock_setup(char *str)
>> +{
>> +	if (!strcmp(str, "auto")) {
>> +		numa_spinlock_flag = 0;
>> +		return 1;
>> +	} else if (!strcmp(str, "on")) {
>> +		numa_spinlock_flag = 1;
>> +		return 1;
>> +	} else if (!strcmp(str, "off")) {
>> +		numa_spinlock_flag = -1;
>> +		return 1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +__setup("numa_spinlock=", numa_spinlock_setup);
>> +
>> +#endif
>> +
> 
> This __init function should be in qspinlock_cna.h. We generally like to
> put as much related code into as few places as possible instead of
> spreading them around in different places.
> 
>> void __init alternative_instructions(void)
>> {
>> 	int3_selftest();
>> @@ -738,6 +765,22 @@ void __init alternative_instructions(void)
>> 	}
>> #endif
>> 
>> +#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +	/*
>> +	 * By default, switch to the NUMA-friendly slow path for
>> +	 * spinlocks when we have multiple NUMA nodes in native environment.
>> +	 */
>> +	if ((numa_spinlock_flag == 1) ||
>> +	    (numa_spinlock_flag == 0 && nr_node_ids > 1 &&
>> +		    pv_ops.lock.queued_spin_lock_slowpath ==
>> +			native_queued_spin_lock_slowpath)) {
>> +		pv_ops.lock.queued_spin_lock_slowpath =
>> +		    __cna_queued_spin_lock_slowpath;
>> +
>> +		pr_info("Enabling CNA spinlock\n");
>> +	}
>> +#endif
>> +
>> 	apply_paravirt(__parainstructions, __parainstructions_end);
> 
> Encapsulate the logic into another __init function in qspinlock_cna.h
> and just make a function call here. You can declare the function in
> arch/x86/include/asm/qspinlock.h.
> 
> 
>> 
>> 	restart_nmi();
>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 52d06ec6f525..e40b9538b79f 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -17,7 +17,7 @@
>> 
>> struct mcs_spinlock {
>> 	struct mcs_spinlock *next;
>> -	int locked; /* 1 if lock acquired */
>> +	unsigned int locked; /* 1 if lock acquired */
>> 	int count;  /* nesting count, see qspinlock.c */
>> };
>> 
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index c06d1e8075d9..6d8c4a52e44e 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -11,7 +11,7 @@
>>  *          Peter Zijlstra <peterz@infradead.org>
>>  */
>> 
>> -#ifndef _GEN_PV_LOCK_SLOWPATH
>> +#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)
>> 
>> #include <linux/smp.h>
>> #include <linux/bug.h>
>> @@ -70,7 +70,8 @@
>> /*
>>  * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
>>  * size and four of them will fit nicely in one 64-byte cacheline. For
>> - * pvqspinlock, however, we need more space for extra data. To accommodate
>> + * pvqspinlock, however, we need more space for extra data. The same also
>> + * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
>>  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
>>  * two of them can fit in a cacheline in this case. That is OK as it is rare
>>  * to have more than 2 levels of slowpath nesting in actual use. We don't
>> @@ -79,7 +80,7 @@
>>  */
>> struct qnode {
>> 	struct mcs_spinlock mcs;
>> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
>> +#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> 	long reserved[2];
>> #endif
>> };
>> @@ -103,6 +104,8 @@ struct qnode {
>>  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>>  *
>>  * PV doubles the storage and uses the second cacheline for PV state.
>> + * CNA also doubles the storage and uses the second cacheline for
>> + * CNA-specific state.
>>  */
>> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>> 
>> @@ -316,7 +319,7 @@ static __always_inline void __mcs_pass_lock(struct mcs_spinlock *node,
>> #define try_clear_tail	__try_clear_tail
>> #define mcs_pass_lock		__mcs_pass_lock
>> 
>> -#endif /* _GEN_PV_LOCK_SLOWPATH */
>> +#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */
>> 
>> /**
>>  * queued_spin_lock_slowpath - acquire the queued spinlock
>> @@ -588,6 +591,29 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>> }
>> EXPORT_SYMBOL(queued_spin_lock_slowpath);
>> 
>> +/*
>> + * Generate the code for NUMA-aware spinlocks
>> + */
>> +#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
>> +#define _GEN_CNA_LOCK_SLOWPATH
>> +
>> +#undef pv_wait_head_or_lock
>> +#define pv_wait_head_or_lock		cna_pre_scan
>> +
>> +#undef try_clear_tail
>> +#define try_clear_tail			cna_try_change_tail
>> +
>> +#undef mcs_pass_lock
>> +#define mcs_pass_lock			cna_pass_lock
>> +
>> +#undef  queued_spin_lock_slowpath
>> +#define queued_spin_lock_slowpath	__cna_queued_spin_lock_slowpath
>> +
>> +#include "qspinlock_cna.h"
>> +#include "qspinlock.c"
>> +
>> +#endif
>> +
>> /*
>>  * Generate the paravirt code for queued_spin_unlock_slowpath().
>>  */
>> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
>> new file mode 100644
>> index 000000000000..a638336f9560
>> --- /dev/null
>> +++ b/kernel/locking/qspinlock_cna.h
>> @@ -0,0 +1,264 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _GEN_CNA_LOCK_SLOWPATH
>> +#error "do not include this file"
>> +#endif
>> +
>> +#include <linux/topology.h>
>> +
>> +/*
>> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>> + *
>> + * In CNA, spinning threads are organized in two queues, a main queue for
>> + * threads running on the same NUMA node as the current lock holder, and a
>> + * secondary queue for threads running on other nodes. Schematically, it
>> + * looks like this:
>> + *
>> + *    cna_node
>> + *   +----------+    +--------+        +--------+
>> + *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
>> + *   |mcs:locked| -+ +--------+        +--------+
>> + *   +----------+  |
>> + *                 +----------------------+
>> + *                                        \/
>> + *                 +--------+         +--------+
>> + *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
>> + *                 +--------+         +--------+
>> + *                     ^                    |
>> + *                     +--------------------+
>> + *
>> + * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
>> + * encoded pointer to the tail of the secondary queue, which is organized as a
>> + * circular list.
>> + *
>> + * After acquiring the MCS lock and before acquiring the spinlock, the lock
>> + * holder scans the main queue looking for a thread running on the same node
>> + * (pre-scan). If found (call it thread T), all threads in the main queue
>> + * between the current lock holder and T are moved to the end of the secondary
>> + * queue.  If such T is not found, we make another scan of the main queue when
>> + * unlocking the MCS lock (post-scan), starting at the node where pre-scan
>> + * stopped. If both scans fail to find such T, the MCS lock is passed to the
>> + * first thread in the secondary queue. If the secondary queue is empty, the
>> + * lock is passed to the next thread in the main queue.
>> + *
>> + * For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=FcJA-hpAHs8MxL3rksMfq_GYsA7RIciHEF3AE4ZATOI&s=SR-p0LtEzUCrpwWY2ShRIF9lXod5Wc_NkN9zIOVvfxM&e= .
>> + *
>> + * Authors: Alex Kogan <alex.kogan@oracle.com>
>> + *          Dave Dice <dave.dice@oracle.com>
>> + */
>> +
>> +struct cna_node {
>> +	struct mcs_spinlock	mcs;
>> +	int			numa_node;
>> +	u32			encoded_tail;
>> +	u32			pre_scan_result; /* 0 or encoded tail */
>> +};
>> +
>> +static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>> +{
>> +	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
>> +	int numa_node = cpu_to_node(cpu);
>> +	int i;
>> +
>> +	for (i = 0; i < MAX_NODES; i++) {
>> +		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
>> +
>> +		cn->numa_node = numa_node;
>> +		cn->encoded_tail = encode_tail(cpu, i);
>> +		/*
>> +		 * @encoded_tail has to be larger than 1, so we do not confuse
>> +		 * it with other valid values for @locked or @pre_scan_result
>> +		 * (0 or 1)
>> +		 */
>> +		WARN_ON(cn->encoded_tail <= 1);
>> +	}
>> +}
>> +
>> +static int __init cna_init_nodes(void)
>> +{
>> +	unsigned int cpu;
>> +
>> +	/*
>> +	 * this will break on 32bit architectures, so we restrict
>> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
>> +	 */
>> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
>> +	/* we store an ecoded tail word in the node's @locked field */
>> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
>> +
>> +	for_each_possible_cpu(cpu)
>> +		cna_init_nodes_per_cpu(cpu);
>> +
>> +	return 0;
>> +}
>> +early_initcall(cna_init_nodes);
>> +
> 
> Include a comment here saying that the cna_try_change_tail() function is
> only called when the primary queue is empty. That will make the code
> easier to read.
> 
> 
>> +static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
>> +				       struct mcs_spinlock *node)
>> +{
>> +	struct mcs_spinlock *head_2nd, *tail_2nd;
>> +	u32 new;
>> +
>> +	/* If the secondary queue is empty, do what MCS does. */
>> +	if (node->locked <= 1)
>> +		return __try_clear_tail(lock, val, node);
>> +
>> +	/*
>> +	 * Try to update the tail value to the last node in the secondary queue.
>> +	 * If successful, pass the lock to the first thread in the secondary
>> +	 * queue. Doing those two actions effectively moves all nodes from the
>> +	 * secondary queue into the main one.
>> +	 */
>> +	tail_2nd = decode_tail(node->locked);
>> +	head_2nd = tail_2nd->next;
>> +	new = ((struct cna_node *)tail_2nd)->encoded_tail + _Q_LOCKED_VAL;
>> +
>> +	if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
>> +		/*
>> +		 * Try to reset @next in tail_2nd to NULL, but no need to check
>> +		 * the result - if failed, a new successor has updated it.
>> +		 */
>> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
>> +		arch_mcs_pass_lock(&head_2nd->locked, 1);
>> +		return true;
>> +	}
>> +
>> +	return false;
>> +}
> Cheers,
> Longman
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 5/5] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
  2019-11-25 21:07   ` Alex Kogan
@ 2019-12-06 22:00     ` Waiman Long
  -1 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2019-12-06 22:00 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: steven.sistare, daniel.m.jordan, dave.dice, rahul.x.yadav

On 11/25/19 4:07 PM, Alex Kogan wrote:
> @@ -234,12 +263,13 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
>  	struct cna_node *cn = (struct cna_node *)node;
>  
>  	/*
> -	 * setting @pre_scan_result to 1 indicates that no post-scan
> +	 * setting @pre_scan_result to 1 or 2 indicates that no post-scan
>  	 * should be made in cna_pass_lock()
>  	 */
>  	cn->pre_scan_result =
> -		cn->intra_count == intra_node_handoff_threshold ?
> -			1 : cna_scan_main_queue(node, node);
> +		(node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) ?
> +			1 : cn->intra_count == intra_node_handoff_threshold ?
> +			2 : cna_scan_main_queue(node, node);
>  
>  	return 0;
>  }
> @@ -253,12 +283,15 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
>  
>  	u32 scan = cn->pre_scan_result;
>  
> +	if (scan == 1)
> +		goto pass_lock;
> +
>  	/*
>  	 * check if a successor from the same numa node has not been found in
>  	 * pre-scan, and if so, try to find it in post-scan starting from the
>  	 * node where pre-scan stopped (stored in @pre_scan_result)
>  	 */
> -	if (scan > 1)
> +	if (scan > 2)
>  		scan = cna_scan_main_queue(node, decode_tail(scan));
>  
>  	if (!scan) { /* if found a successor from the same numa node */
> @@ -281,5 +314,6 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
>  		tail_2nd->next = next;
>  	}
>  
> +pass_lock:
>  	arch_mcs_pass_lock(&next_holder->locked, val);
>  }

I think you might have mishandled the proper accounting of intra_count.
How about something like:

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index f1eef6bece7b..03f8fdec2b80 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -268,7 +268,7 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
         */
        cn->pre_scan_result =
                (node->locked <= 1 &&
probably(SHUFFLE_REDUCTION_PROB_ARG)) ?
-                       1 : cn->intra_count ==
intra_node_handoff_threshold ?
+                       1 : cn->intra_count >=
intra_node_handoff_threshold ?
                        2 : cna_scan_main_queue(node, node);
 
        return 0;
@@ -283,9 +283,6 @@ static inline void cna_pass_lock(struct mcs_spinlock
*node,
 
        u32 scan = cn->pre_scan_result;
 
-       if (scan == 1)
-               goto pass_lock;
-
        /*
         * check if a successor from the same numa node has not been
found in
         * pre-scan, and if so, try to find it in post-scan starting
from the
@@ -294,7 +291,13 @@ static inline void cna_pass_lock(struct
mcs_spinlock *node,
        if (scan > 2)
                scan = cna_scan_main_queue(node, decode_tail(scan));
 
-       if (!scan) { /* if found a successor from the same numa node */
+       if (scan <= 1) { /* if found a successor from the same numa node */
+               /* inc @intra_count if the secondary queue is not empty */
+               ((struct cna_node *)next_holder)->intra_count =
+                       cn->intra_count + (node->locked > 1);
+               if ((scan == 1)
+                       goto pass_lock;
+
                next_holder = node->next;
                /*
                 * we unlock successor by passing a non-zero value,
@@ -302,9 +305,6 @@ static inline void cna_pass_lock(struct mcs_spinlock
*node,
                 * if we acquired the MCS lock when its queue was empty
                 */
                val = node->locked ? node->locked : 1;
-               /* inc @intra_count if the secondary queue is not empty */
-               ((struct cna_node *)next_holder)->intra_count =
-                       cn->intra_count + (node->locked > 1);
        } else if (node->locked > 1) {    /* if secondary queue is not
empty */
                /* next holder will be the first node in the secondary
queue */
                tail_2nd = decode_tail(node->locked);

The meaning of scan value:

0 - pass to next cna node, which is in the same numa node. Additional
cna node may or may not be added to the secondary queue

1 - pass to next cna node, which may not be in the same numa node. No
change to secondary queue

2 - exceed intra node handoff threshold, unconditionally merge back the
secondary queue cna nodes, if available

>2 no cna node of the same numa node found, unconditionally merge back
the secondary queue cna nodes, if available

The code will be easier to read if symbolic names instead of just numbers.

Cheers,
Longman



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 5/5] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
@ 2019-12-06 22:00     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2019-12-06 22:00 UTC (permalink / raw)
  To: Alex Kogan, linux, peterz, mingo, will.deacon, arnd, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber
  Cc: rahul.x.yadav, dave.dice, steven.sistare, daniel.m.jordan

On 11/25/19 4:07 PM, Alex Kogan wrote:
> @@ -234,12 +263,13 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
>  	struct cna_node *cn = (struct cna_node *)node;
>  
>  	/*
> -	 * setting @pre_scan_result to 1 indicates that no post-scan
> +	 * setting @pre_scan_result to 1 or 2 indicates that no post-scan
>  	 * should be made in cna_pass_lock()
>  	 */
>  	cn->pre_scan_result =
> -		cn->intra_count == intra_node_handoff_threshold ?
> -			1 : cna_scan_main_queue(node, node);
> +		(node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) ?
> +			1 : cn->intra_count == intra_node_handoff_threshold ?
> +			2 : cna_scan_main_queue(node, node);
>  
>  	return 0;
>  }
> @@ -253,12 +283,15 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
>  
>  	u32 scan = cn->pre_scan_result;
>  
> +	if (scan == 1)
> +		goto pass_lock;
> +
>  	/*
>  	 * check if a successor from the same numa node has not been found in
>  	 * pre-scan, and if so, try to find it in post-scan starting from the
>  	 * node where pre-scan stopped (stored in @pre_scan_result)
>  	 */
> -	if (scan > 1)
> +	if (scan > 2)
>  		scan = cna_scan_main_queue(node, decode_tail(scan));
>  
>  	if (!scan) { /* if found a successor from the same numa node */
> @@ -281,5 +314,6 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
>  		tail_2nd->next = next;
>  	}
>  
> +pass_lock:
>  	arch_mcs_pass_lock(&next_holder->locked, val);
>  }

I think you might have mishandled the proper accounting of intra_count.
How about something like:

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index f1eef6bece7b..03f8fdec2b80 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -268,7 +268,7 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
         */
        cn->pre_scan_result =
                (node->locked <= 1 &&
probably(SHUFFLE_REDUCTION_PROB_ARG)) ?
-                       1 : cn->intra_count ==
intra_node_handoff_threshold ?
+                       1 : cn->intra_count >=
intra_node_handoff_threshold ?
                        2 : cna_scan_main_queue(node, node);
 
        return 0;
@@ -283,9 +283,6 @@ static inline void cna_pass_lock(struct mcs_spinlock
*node,
 
        u32 scan = cn->pre_scan_result;
 
-       if (scan == 1)
-               goto pass_lock;
-
        /*
         * check if a successor from the same numa node has not been
found in
         * pre-scan, and if so, try to find it in post-scan starting
from the
@@ -294,7 +291,13 @@ static inline void cna_pass_lock(struct
mcs_spinlock *node,
        if (scan > 2)
                scan = cna_scan_main_queue(node, decode_tail(scan));
 
-       if (!scan) { /* if found a successor from the same numa node */
+       if (scan <= 1) { /* if found a successor from the same numa node */
+               /* inc @intra_count if the secondary queue is not empty */
+               ((struct cna_node *)next_holder)->intra_count =
+                       cn->intra_count + (node->locked > 1);
+               if ((scan == 1)
+                       goto pass_lock;
+
                next_holder = node->next;
                /*
                 * we unlock successor by passing a non-zero value,
@@ -302,9 +305,6 @@ static inline void cna_pass_lock(struct mcs_spinlock
*node,
                 * if we acquired the MCS lock when its queue was empty
                 */
                val = node->locked ? node->locked : 1;
-               /* inc @intra_count if the secondary queue is not empty */
-               ((struct cna_node *)next_holder)->intra_count =
-                       cn->intra_count + (node->locked > 1);
        } else if (node->locked > 1) {    /* if secondary queue is not
empty */
                /* next holder will be the first node in the secondary
queue */
                tail_2nd = decode_tail(node->locked);

The meaning of scan value:

0 - pass to next cna node, which is in the same numa node. Additional
cna node may or may not be added to the secondary queue

1 - pass to next cna node, which may not be in the same numa node. No
change to secondary queue

2 - exceed intra node handoff threshold, unconditionally merge back the
secondary queue cna nodes, if available

>2 no cna node of the same numa node found, unconditionally merge back
the secondary queue cna nodes, if available

The code will be easier to read if symbolic names instead of just numbers.

Cheers,
Longman



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2019-11-25 21:07   ` Alex Kogan
  (?)
@ 2020-01-21 20:29     ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-21 20:29 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav


various notes and changes in the below.

---

Index: linux-2.6/kernel/locking/qspinlock.c
===================================================================
--- linux-2.6.orig/kernel/locking/qspinlock.c
+++ linux-2.6/kernel/locking/qspinlock.c
@@ -598,10 +598,10 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath)
 #define _GEN_CNA_LOCK_SLOWPATH
 
 #undef pv_wait_head_or_lock
-#define pv_wait_head_or_lock		cna_pre_scan
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
 
 #undef try_clear_tail
-#define try_clear_tail			cna_try_change_tail
+#define try_clear_tail			cna_try_clear_tail
 
 #undef mcs_pass_lock
 #define mcs_pass_lock			cna_pass_lock
Index: linux-2.6/kernel/locking/qspinlock_cna.h
===================================================================
--- linux-2.6.orig/kernel/locking/qspinlock_cna.h
+++ linux-2.6/kernel/locking/qspinlock_cna.h
@@ -8,37 +8,37 @@
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
  *
- * In CNA, spinning threads are organized in two queues, a main queue for
+ * In CNA, spinning threads are organized in two queues, a primary queue for
  * threads running on the same NUMA node as the current lock holder, and a
- * secondary queue for threads running on other nodes. Schematically, it
- * looks like this:
+ * secondary queue for threads running on other nodes. Schematically, it looks
+ * like this:
  *
  *    cna_node
- *   +----------+    +--------+        +--------+
- *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
- *   |mcs:locked| -+ +--------+        +--------+
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
  *   +----------+  |
- *                 +----------------------+
- *                                        \/
+ *                 `----------------------.
+ *                                        v
  *                 +--------+         +--------+
- *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
  *                 +--------+         +--------+
  *                     ^                    |
- *                     +--------------------+
+ *                     `--------------------'
  *
- * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
+ * N.B. locked := 1 if secondary queue is absent. Othewrise, it contains the
  * encoded pointer to the tail of the secondary queue, which is organized as a
  * circular list.
  *
  * After acquiring the MCS lock and before acquiring the spinlock, the lock
- * holder scans the main queue looking for a thread running on the same node
- * (pre-scan). If found (call it thread T), all threads in the main queue
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
  * between the current lock holder and T are moved to the end of the secondary
- * queue.  If such T is not found, we make another scan of the main queue when
- * unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
  * stopped. If both scans fail to find such T, the MCS lock is passed to the
  * first thread in the secondary queue. If the secondary queue is empty, the
- * lock is passed to the next thread in the main queue.
+ * lock is passed to the next thread in the primary queue.
  *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
@@ -49,8 +49,8 @@
 struct cna_node {
 	struct mcs_spinlock	mcs;
 	int			numa_node;
-	u32			encoded_tail;
-	u32			pre_scan_result; /* 0 or encoded tail */
+	u32			encoded_tail;    /* self */
+	u32			partial_order;
 };
 
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
@@ -66,7 +66,7 @@ static void __init cna_init_nodes_per_cp
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * @encoded_tail has to be larger than 1, so we do not confuse
-		 * it with other valid values for @locked or @pre_scan_result
+		 * it with other valid values for @locked or @partial_order
 		 * (0 or 1)
 		 */
 		WARN_ON(cn->encoded_tail <= 1);
@@ -92,8 +92,8 @@ static int __init cna_init_nodes(void)
 }
 early_initcall(cna_init_nodes);
 
-static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
-				       struct mcs_spinlock *node)
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *head_2nd, *tail_2nd;
 	u32 new;
@@ -106,7 +106,9 @@ static inline bool cna_try_change_tail(s
 	 * Try to update the tail value to the last node in the secondary queue.
 	 * If successful, pass the lock to the first thread in the secondary
 	 * queue. Doing those two actions effectively moves all nodes from the
-	 * secondary queue into the main one.
+	 * secondary queue into the primary one.
+	 *
+	 * XXX: reformulate, that's just confusing.
 	 */
 	tail_2nd = decode_tail(node->locked);
 	head_2nd = tail_2nd->next;
@@ -116,6 +118,9 @@ static inline bool cna_try_change_tail(s
 		/*
 		 * Try to reset @next in tail_2nd to NULL, but no need to check
 		 * the result - if failed, a new successor has updated it.
+		 *
+		 * XXX: figure out what how where..
+		 * XXX: cns_pass_lock() does something relared
 		 */
 		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
 		arch_mcs_pass_lock(&head_2nd->locked, 1);
@@ -126,7 +131,7 @@ static inline bool cna_try_change_tail(s
 }
 
 /*
- * cna_splice_tail -- splice nodes in the main queue between [first, last]
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
  * onto the secondary queue.
  */
 static void cna_splice_tail(struct mcs_spinlock *node,
@@ -153,77 +158,56 @@ static void cna_splice_tail(struct mcs_s
 }
 
 /*
- * cna_scan_main_queue - scan the main waiting queue looking for the first
- * thread running on the same NUMA node as the lock holder. If found (call it
- * thread T), move all threads in the main queue between the lock holder and
- * T to the end of the secondary queue and return 0; otherwise, return the
- * encoded pointer of the last scanned node in the primary queue (so a
- * subsequent scan can be resumed from that node)
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
  *
- * Schematically, this may look like the following (nn stands for numa_node and
- * et stands for encoded_tail).
- *
- *   when cna_scan_main_queue() is called (the secondary queue is empty):
- *
- *  A+------------+   B+--------+   C+--------+   T+--------+
- *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
- *   |mcs:locked=1|    |cna:nn=0|    |cna:nn=2|    |cna:nn=1|
- *   |cna:nn=1    |    +--------+    +--------+    +--------+
- *   +----------- +
- *
- *   when cna_scan_main_queue() returns (the secondary queue contains B and C):
- *
- *  A+----------------+    T+--------+
- *   |mcs:next        | ->  |mcs:next| -> NULL
- *   |mcs:locked=C.et | -+  |cna:nn=1|
- *   |cna:nn=1        |  |  +--------+
- *   +--------------- +  +-----+
- *                             \/
- *          B+--------+   C+--------+
- *           |mcs:next| -> |mcs:next| -+
- *           |cna:nn=0|    |cna:nn=2|  |
- *           +--------+    +--------+  |
- *               ^                     |
- *               +---------------------+
+ * Returns 0 if a matching node is found; otherwise return the encoded pointer
+ * to the last element inspected (such that a subsequent scan can continue there).
  *
  * The worst case complexity of the scan is O(n), where n is the number
  * of current waiters. However, the amortized complexity is close to O(1),
  * as the immediate successor is likely to be running on the same node once
  * threads from other nodes are moved to the secondary queue.
+ *
+ * XXX does not compute; given equal contention it should average to O(nr_nodes).
  */
-static u32 cna_scan_main_queue(struct mcs_spinlock *node,
-			       struct mcs_spinlock *pred_start)
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
 {
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(pred_start->next);
+	int nid = cn->numa_node;
 	struct cna_node *last;
-	int my_numa_node = cn->numa_node;
 
 	/* find any next waiter on 'our' NUMA node */
 	for (last = cn;
-	     cni && cni->numa_node != my_numa_node;
+	     cni && cni->numa_node != nid;
 	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
 		;
 
-	/* if found, splice any skipped waiters onto the secondary queue */
-	if (cni) {
-		if (last != cn)	/* did we skip any waiters? */
-			cna_splice_tail(node, node->next,
-					(struct mcs_spinlock *)last);
-		return 0;
-	}
+	if (!cna)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
 
-	return last->encoded_tail;
+	return 0;
 }
 
-__always_inline u32 cna_pre_scan(struct qspinlock *lock,
-				  struct mcs_spinlock *node)
+/* abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->pre_scan_result = cna_scan_main_queue(node, node);
+	/*
+	 * Try and put the time otherwise spend spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
 
-	return 0;
+	return 0; /* we lied; we didn't wait, go do so now */
 }
 
 static inline void cna_pass_lock(struct mcs_spinlock *node,
@@ -231,33 +215,27 @@ static inline void cna_pass_lock(struct
 {
 	struct cna_node *cn = (struct cna_node *)node;
 	struct mcs_spinlock *next_holder = next, *tail_2nd;
-	u32 val = 1;
+	u32 partial_order = cn->partial_order;
+	u32 val = _Q_LOCKED_VAL;
 
-	u32 scan = cn->pre_scan_result;
+	/* cna_wait_head_or_lock() left work for us. */
+	if (partial_order) {
+		partial_order = cna_order_queue(node, decode_tail(partial_order));
 
-	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
-	 */
-	if (scan > 0)
-		scan = cna_scan_main_queue(node, decode_tail(scan));
-
-	if (!scan) { /* if found a successor from the same numa node */
+	if (!partial_order) {		/* ordered; use primary queue */
 		next_holder = node->next;
-		/*
-		 * we unlock successor by passing a non-zero value,
-		 * so set @val to 1 iff @locked is 0, which will happen
-		 * if we acquired the MCS lock when its queue was empty
-		 */
-		val = node->locked ? node->locked : 1;
-	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
+		val |= node->locked;	/* preseve secondary queue */
+
+	} else if (node->locked > 1) {	/* try secondary queue */
+
 		/* next holder will be the first node in the secondary queue */
 		tail_2nd = decode_tail(node->locked);
 		/* @tail_2nd->next points to the head of the secondary queue */
 		next_holder = tail_2nd->next;
-		/* splice the secondary queue onto the head of the main queue */
+		/* splice the secondary queue onto the head of the primary queue */
 		tail_2nd->next = next;
+
+		// XXX cna_try_clear_tail also does something like this
 	}
 
 	arch_mcs_pass_lock(&next_holder->locked, val);


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-21 20:29     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-21 20:29 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel


various notes and changes in the below.

---

Index: linux-2.6/kernel/locking/qspinlock.c
===================================================================
--- linux-2.6.orig/kernel/locking/qspinlock.c
+++ linux-2.6/kernel/locking/qspinlock.c
@@ -598,10 +598,10 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath)
 #define _GEN_CNA_LOCK_SLOWPATH
 
 #undef pv_wait_head_or_lock
-#define pv_wait_head_or_lock		cna_pre_scan
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
 
 #undef try_clear_tail
-#define try_clear_tail			cna_try_change_tail
+#define try_clear_tail			cna_try_clear_tail
 
 #undef mcs_pass_lock
 #define mcs_pass_lock			cna_pass_lock
Index: linux-2.6/kernel/locking/qspinlock_cna.h
===================================================================
--- linux-2.6.orig/kernel/locking/qspinlock_cna.h
+++ linux-2.6/kernel/locking/qspinlock_cna.h
@@ -8,37 +8,37 @@
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
  *
- * In CNA, spinning threads are organized in two queues, a main queue for
+ * In CNA, spinning threads are organized in two queues, a primary queue for
  * threads running on the same NUMA node as the current lock holder, and a
- * secondary queue for threads running on other nodes. Schematically, it
- * looks like this:
+ * secondary queue for threads running on other nodes. Schematically, it looks
+ * like this:
  *
  *    cna_node
- *   +----------+    +--------+        +--------+
- *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
- *   |mcs:locked| -+ +--------+        +--------+
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
  *   +----------+  |
- *                 +----------------------+
- *                                        \/
+ *                 `----------------------.
+ *                                        v
  *                 +--------+         +--------+
- *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
  *                 +--------+         +--------+
  *                     ^                    |
- *                     +--------------------+
+ *                     `--------------------'
  *
- * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
+ * N.B. locked := 1 if secondary queue is absent. Othewrise, it contains the
  * encoded pointer to the tail of the secondary queue, which is organized as a
  * circular list.
  *
  * After acquiring the MCS lock and before acquiring the spinlock, the lock
- * holder scans the main queue looking for a thread running on the same node
- * (pre-scan). If found (call it thread T), all threads in the main queue
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
  * between the current lock holder and T are moved to the end of the secondary
- * queue.  If such T is not found, we make another scan of the main queue when
- * unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
  * stopped. If both scans fail to find such T, the MCS lock is passed to the
  * first thread in the secondary queue. If the secondary queue is empty, the
- * lock is passed to the next thread in the main queue.
+ * lock is passed to the next thread in the primary queue.
  *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
@@ -49,8 +49,8 @@
 struct cna_node {
 	struct mcs_spinlock	mcs;
 	int			numa_node;
-	u32			encoded_tail;
-	u32			pre_scan_result; /* 0 or encoded tail */
+	u32			encoded_tail;    /* self */
+	u32			partial_order;
 };
 
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
@@ -66,7 +66,7 @@ static void __init cna_init_nodes_per_cp
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * @encoded_tail has to be larger than 1, so we do not confuse
-		 * it with other valid values for @locked or @pre_scan_result
+		 * it with other valid values for @locked or @partial_order
 		 * (0 or 1)
 		 */
 		WARN_ON(cn->encoded_tail <= 1);
@@ -92,8 +92,8 @@ static int __init cna_init_nodes(void)
 }
 early_initcall(cna_init_nodes);
 
-static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
-				       struct mcs_spinlock *node)
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *head_2nd, *tail_2nd;
 	u32 new;
@@ -106,7 +106,9 @@ static inline bool cna_try_change_tail(s
 	 * Try to update the tail value to the last node in the secondary queue.
 	 * If successful, pass the lock to the first thread in the secondary
 	 * queue. Doing those two actions effectively moves all nodes from the
-	 * secondary queue into the main one.
+	 * secondary queue into the primary one.
+	 *
+	 * XXX: reformulate, that's just confusing.
 	 */
 	tail_2nd = decode_tail(node->locked);
 	head_2nd = tail_2nd->next;
@@ -116,6 +118,9 @@ static inline bool cna_try_change_tail(s
 		/*
 		 * Try to reset @next in tail_2nd to NULL, but no need to check
 		 * the result - if failed, a new successor has updated it.
+		 *
+		 * XXX: figure out what how where..
+		 * XXX: cns_pass_lock() does something relared
 		 */
 		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
 		arch_mcs_pass_lock(&head_2nd->locked, 1);
@@ -126,7 +131,7 @@ static inline bool cna_try_change_tail(s
 }
 
 /*
- * cna_splice_tail -- splice nodes in the main queue between [first, last]
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
  * onto the secondary queue.
  */
 static void cna_splice_tail(struct mcs_spinlock *node,
@@ -153,77 +158,56 @@ static void cna_splice_tail(struct mcs_s
 }
 
 /*
- * cna_scan_main_queue - scan the main waiting queue looking for the first
- * thread running on the same NUMA node as the lock holder. If found (call it
- * thread T), move all threads in the main queue between the lock holder and
- * T to the end of the secondary queue and return 0; otherwise, return the
- * encoded pointer of the last scanned node in the primary queue (so a
- * subsequent scan can be resumed from that node)
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
  *
- * Schematically, this may look like the following (nn stands for numa_node and
- * et stands for encoded_tail).
- *
- *   when cna_scan_main_queue() is called (the secondary queue is empty):
- *
- *  A+------------+   B+--------+   C+--------+   T+--------+
- *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
- *   |mcs:locked=1|    |cna:nn=0|    |cna:nn=2|    |cna:nn=1|
- *   |cna:nn=1    |    +--------+    +--------+    +--------+
- *   +----------- +
- *
- *   when cna_scan_main_queue() returns (the secondary queue contains B and C):
- *
- *  A+----------------+    T+--------+
- *   |mcs:next        | ->  |mcs:next| -> NULL
- *   |mcs:locked=C.et | -+  |cna:nn=1|
- *   |cna:nn=1        |  |  +--------+
- *   +--------------- +  +-----+
- *                             \/
- *          B+--------+   C+--------+
- *           |mcs:next| -> |mcs:next| -+
- *           |cna:nn=0|    |cna:nn=2|  |
- *           +--------+    +--------+  |
- *               ^                     |
- *               +---------------------+
+ * Returns 0 if a matching node is found; otherwise return the encoded pointer
+ * to the last element inspected (such that a subsequent scan can continue there).
  *
  * The worst case complexity of the scan is O(n), where n is the number
  * of current waiters. However, the amortized complexity is close to O(1),
  * as the immediate successor is likely to be running on the same node once
  * threads from other nodes are moved to the secondary queue.
+ *
+ * XXX does not compute; given equal contention it should average to O(nr_nodes).
  */
-static u32 cna_scan_main_queue(struct mcs_spinlock *node,
-			       struct mcs_spinlock *pred_start)
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
 {
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(pred_start->next);
+	int nid = cn->numa_node;
 	struct cna_node *last;
-	int my_numa_node = cn->numa_node;
 
 	/* find any next waiter on 'our' NUMA node */
 	for (last = cn;
-	     cni && cni->numa_node != my_numa_node;
+	     cni && cni->numa_node != nid;
 	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
 		;
 
-	/* if found, splice any skipped waiters onto the secondary queue */
-	if (cni) {
-		if (last != cn)	/* did we skip any waiters? */
-			cna_splice_tail(node, node->next,
-					(struct mcs_spinlock *)last);
-		return 0;
-	}
+	if (!cna)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
 
-	return last->encoded_tail;
+	return 0;
 }
 
-__always_inline u32 cna_pre_scan(struct qspinlock *lock,
-				  struct mcs_spinlock *node)
+/* abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->pre_scan_result = cna_scan_main_queue(node, node);
+	/*
+	 * Try and put the time otherwise spend spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
 
-	return 0;
+	return 0; /* we lied; we didn't wait, go do so now */
 }
 
 static inline void cna_pass_lock(struct mcs_spinlock *node,
@@ -231,33 +215,27 @@ static inline void cna_pass_lock(struct
 {
 	struct cna_node *cn = (struct cna_node *)node;
 	struct mcs_spinlock *next_holder = next, *tail_2nd;
-	u32 val = 1;
+	u32 partial_order = cn->partial_order;
+	u32 val = _Q_LOCKED_VAL;
 
-	u32 scan = cn->pre_scan_result;
+	/* cna_wait_head_or_lock() left work for us. */
+	if (partial_order) {
+		partial_order = cna_order_queue(node, decode_tail(partial_order));
 
-	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
-	 */
-	if (scan > 0)
-		scan = cna_scan_main_queue(node, decode_tail(scan));
-
-	if (!scan) { /* if found a successor from the same numa node */
+	if (!partial_order) {		/* ordered; use primary queue */
 		next_holder = node->next;
-		/*
-		 * we unlock successor by passing a non-zero value,
-		 * so set @val to 1 iff @locked is 0, which will happen
-		 * if we acquired the MCS lock when its queue was empty
-		 */
-		val = node->locked ? node->locked : 1;
-	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
+		val |= node->locked;	/* preseve secondary queue */
+
+	} else if (node->locked > 1) {	/* try secondary queue */
+
 		/* next holder will be the first node in the secondary queue */
 		tail_2nd = decode_tail(node->locked);
 		/* @tail_2nd->next points to the head of the secondary queue */
 		next_holder = tail_2nd->next;
-		/* splice the secondary queue onto the head of the main queue */
+		/* splice the secondary queue onto the head of the primary queue */
 		tail_2nd->next = next;
+
+		// XXX cna_try_clear_tail also does something like this
 	}
 
 	arch_mcs_pass_lock(&next_holder->locked, val);

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-21 20:29     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-21 20:29 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel


various notes and changes in the below.

---

Index: linux-2.6/kernel/locking/qspinlock.c
===================================================================
--- linux-2.6.orig/kernel/locking/qspinlock.c
+++ linux-2.6/kernel/locking/qspinlock.c
@@ -598,10 +598,10 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath)
 #define _GEN_CNA_LOCK_SLOWPATH
 
 #undef pv_wait_head_or_lock
-#define pv_wait_head_or_lock		cna_pre_scan
+#define pv_wait_head_or_lock		cna_wait_head_or_lock
 
 #undef try_clear_tail
-#define try_clear_tail			cna_try_change_tail
+#define try_clear_tail			cna_try_clear_tail
 
 #undef mcs_pass_lock
 #define mcs_pass_lock			cna_pass_lock
Index: linux-2.6/kernel/locking/qspinlock_cna.h
===================================================================
--- linux-2.6.orig/kernel/locking/qspinlock_cna.h
+++ linux-2.6/kernel/locking/qspinlock_cna.h
@@ -8,37 +8,37 @@
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
  *
- * In CNA, spinning threads are organized in two queues, a main queue for
+ * In CNA, spinning threads are organized in two queues, a primary queue for
  * threads running on the same NUMA node as the current lock holder, and a
- * secondary queue for threads running on other nodes. Schematically, it
- * looks like this:
+ * secondary queue for threads running on other nodes. Schematically, it looks
+ * like this:
  *
  *    cna_node
- *   +----------+    +--------+        +--------+
- *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
- *   |mcs:locked| -+ +--------+        +--------+
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
  *   +----------+  |
- *                 +----------------------+
- *                                        \/
+ *                 `----------------------.
+ *                                        v
  *                 +--------+         +--------+
- *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
  *                 +--------+         +--------+
  *                     ^                    |
- *                     +--------------------+
+ *                     `--------------------'
  *
- * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
+ * N.B. locked := 1 if secondary queue is absent. Othewrise, it contains the
  * encoded pointer to the tail of the secondary queue, which is organized as a
  * circular list.
  *
  * After acquiring the MCS lock and before acquiring the spinlock, the lock
- * holder scans the main queue looking for a thread running on the same node
- * (pre-scan). If found (call it thread T), all threads in the main queue
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
  * between the current lock holder and T are moved to the end of the secondary
- * queue.  If such T is not found, we make another scan of the main queue when
- * unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
  * stopped. If both scans fail to find such T, the MCS lock is passed to the
  * first thread in the secondary queue. If the secondary queue is empty, the
- * lock is passed to the next thread in the main queue.
+ * lock is passed to the next thread in the primary queue.
  *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
@@ -49,8 +49,8 @@
 struct cna_node {
 	struct mcs_spinlock	mcs;
 	int			numa_node;
-	u32			encoded_tail;
-	u32			pre_scan_result; /* 0 or encoded tail */
+	u32			encoded_tail;    /* self */
+	u32			partial_order;
 };
 
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
@@ -66,7 +66,7 @@ static void __init cna_init_nodes_per_cp
 		cn->encoded_tail = encode_tail(cpu, i);
 		/*
 		 * @encoded_tail has to be larger than 1, so we do not confuse
-		 * it with other valid values for @locked or @pre_scan_result
+		 * it with other valid values for @locked or @partial_order
 		 * (0 or 1)
 		 */
 		WARN_ON(cn->encoded_tail <= 1);
@@ -92,8 +92,8 @@ static int __init cna_init_nodes(void)
 }
 early_initcall(cna_init_nodes);
 
-static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
-				       struct mcs_spinlock *node)
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
 {
 	struct mcs_spinlock *head_2nd, *tail_2nd;
 	u32 new;
@@ -106,7 +106,9 @@ static inline bool cna_try_change_tail(s
 	 * Try to update the tail value to the last node in the secondary queue.
 	 * If successful, pass the lock to the first thread in the secondary
 	 * queue. Doing those two actions effectively moves all nodes from the
-	 * secondary queue into the main one.
+	 * secondary queue into the primary one.
+	 *
+	 * XXX: reformulate, that's just confusing.
 	 */
 	tail_2nd = decode_tail(node->locked);
 	head_2nd = tail_2nd->next;
@@ -116,6 +118,9 @@ static inline bool cna_try_change_tail(s
 		/*
 		 * Try to reset @next in tail_2nd to NULL, but no need to check
 		 * the result - if failed, a new successor has updated it.
+		 *
+		 * XXX: figure out what how where..
+		 * XXX: cns_pass_lock() does something relared
 		 */
 		cmpxchg_relaxed(&tail_2nd->next, head_2nd, NULL);
 		arch_mcs_pass_lock(&head_2nd->locked, 1);
@@ -126,7 +131,7 @@ static inline bool cna_try_change_tail(s
 }
 
 /*
- * cna_splice_tail -- splice nodes in the main queue between [first, last]
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
  * onto the secondary queue.
  */
 static void cna_splice_tail(struct mcs_spinlock *node,
@@ -153,77 +158,56 @@ static void cna_splice_tail(struct mcs_s
 }
 
 /*
- * cna_scan_main_queue - scan the main waiting queue looking for the first
- * thread running on the same NUMA node as the lock holder. If found (call it
- * thread T), move all threads in the main queue between the lock holder and
- * T to the end of the secondary queue and return 0; otherwise, return the
- * encoded pointer of the last scanned node in the primary queue (so a
- * subsequent scan can be resumed from that node)
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
  *
- * Schematically, this may look like the following (nn stands for numa_node and
- * et stands for encoded_tail).
- *
- *   when cna_scan_main_queue() is called (the secondary queue is empty):
- *
- *  A+------------+   B+--------+   C+--------+   T+--------+
- *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
- *   |mcs:locked=1|    |cna:nn=0|    |cna:nn=2|    |cna:nn=1|
- *   |cna:nn=1    |    +--------+    +--------+    +--------+
- *   +----------- +
- *
- *   when cna_scan_main_queue() returns (the secondary queue contains B and C):
- *
- *  A+----------------+    T+--------+
- *   |mcs:next        | ->  |mcs:next| -> NULL
- *   |mcs:locked=C.et | -+  |cna:nn=1|
- *   |cna:nn=1        |  |  +--------+
- *   +--------------- +  +-----+
- *                             \/
- *          B+--------+   C+--------+
- *           |mcs:next| -> |mcs:next| -+
- *           |cna:nn=0|    |cna:nn=2|  |
- *           +--------+    +--------+  |
- *               ^                     |
- *               +---------------------+
+ * Returns 0 if a matching node is found; otherwise return the encoded pointer
+ * to the last element inspected (such that a subsequent scan can continue there).
  *
  * The worst case complexity of the scan is O(n), where n is the number
  * of current waiters. However, the amortized complexity is close to O(1),
  * as the immediate successor is likely to be running on the same node once
  * threads from other nodes are moved to the secondary queue.
+ *
+ * XXX does not compute; given equal contention it should average to O(nr_nodes).
  */
-static u32 cna_scan_main_queue(struct mcs_spinlock *node,
-			       struct mcs_spinlock *pred_start)
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
 {
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(pred_start->next);
+	int nid = cn->numa_node;
 	struct cna_node *last;
-	int my_numa_node = cn->numa_node;
 
 	/* find any next waiter on 'our' NUMA node */
 	for (last = cn;
-	     cni && cni->numa_node != my_numa_node;
+	     cni && cni->numa_node != nid;
 	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
 		;
 
-	/* if found, splice any skipped waiters onto the secondary queue */
-	if (cni) {
-		if (last != cn)	/* did we skip any waiters? */
-			cna_splice_tail(node, node->next,
-					(struct mcs_spinlock *)last);
-		return 0;
-	}
+	if (!cna)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
 
-	return last->encoded_tail;
+	return 0;
 }
 
-__always_inline u32 cna_pre_scan(struct qspinlock *lock,
-				  struct mcs_spinlock *node)
+/* abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
 {
 	struct cna_node *cn = (struct cna_node *)node;
 
-	cn->pre_scan_result = cna_scan_main_queue(node, node);
+	/*
+	 * Try and put the time otherwise spend spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
 
-	return 0;
+	return 0; /* we lied; we didn't wait, go do so now */
 }
 
 static inline void cna_pass_lock(struct mcs_spinlock *node,
@@ -231,33 +215,27 @@ static inline void cna_pass_lock(struct
 {
 	struct cna_node *cn = (struct cna_node *)node;
 	struct mcs_spinlock *next_holder = next, *tail_2nd;
-	u32 val = 1;
+	u32 partial_order = cn->partial_order;
+	u32 val = _Q_LOCKED_VAL;
 
-	u32 scan = cn->pre_scan_result;
+	/* cna_wait_head_or_lock() left work for us. */
+	if (partial_order) {
+		partial_order = cna_order_queue(node, decode_tail(partial_order));
 
-	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
-	 */
-	if (scan > 0)
-		scan = cna_scan_main_queue(node, decode_tail(scan));
-
-	if (!scan) { /* if found a successor from the same numa node */
+	if (!partial_order) {		/* ordered; use primary queue */
 		next_holder = node->next;
-		/*
-		 * we unlock successor by passing a non-zero value,
-		 * so set @val to 1 iff @locked is 0, which will happen
-		 * if we acquired the MCS lock when its queue was empty
-		 */
-		val = node->locked ? node->locked : 1;
-	} else if (node->locked > 1) {	  /* if secondary queue is not empty */
+		val |= node->locked;	/* preseve secondary queue */
+
+	} else if (node->locked > 1) {	/* try secondary queue */
+
 		/* next holder will be the first node in the secondary queue */
 		tail_2nd = decode_tail(node->locked);
 		/* @tail_2nd->next points to the head of the secondary queue */
 		next_holder = tail_2nd->next;
-		/* splice the secondary queue onto the head of the main queue */
+		/* splice the secondary queue onto the head of the primary queue */
 		tail_2nd->next = next;
+
+		// XXX cna_try_clear_tail also does something like this
 	}
 
 	arch_mcs_pass_lock(&next_holder->locked, val);


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-21 20:29     ` Peter Zijlstra
@ 2020-01-22  8:58       ` Will Deacon
  -1 siblings, 0 replies; 48+ messages in thread
From: Will Deacon @ 2020-01-22  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Kogan, linux-arch, guohanjun, arnd, dave.dice, jglauber,
	x86, will.deacon, linux, steven.sistare, linux-kernel,
	rahul.x.yadav, mingo, bp, hpa, longman, tglx, daniel.m.jordan,
	linux-arm-kernel

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> 
> various notes and changes in the below.
> 
> ---
> 
> Index: linux-2.6/kernel/locking/qspinlock.c
> ===================================================================
> --- linux-2.6.orig/kernel/locking/qspinlock.c
> +++ linux-2.6/kernel/locking/qspinlock.c
> @@ -598,10 +598,10 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath)
>  #define _GEN_CNA_LOCK_SLOWPATH
>  
>  #undef pv_wait_head_or_lock
> -#define pv_wait_head_or_lock		cna_pre_scan
> +#define pv_wait_head_or_lock		cna_wait_head_or_lock
>  
>  #undef try_clear_tail
> -#define try_clear_tail			cna_try_change_tail
> +#define try_clear_tail			cna_try_clear_tail
>  
>  #undef mcs_pass_lock
>  #define mcs_pass_lock			cna_pass_lock
> Index: linux-2.6/kernel/locking/qspinlock_cna.h
> ===================================================================
> --- linux-2.6.orig/kernel/locking/qspinlock_cna.h
> +++ linux-2.6/kernel/locking/qspinlock_cna.h
> @@ -8,37 +8,37 @@
>  /*
>   * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>   *
> - * In CNA, spinning threads are organized in two queues, a main queue for
> + * In CNA, spinning threads are organized in two queues, a primary queue for
>   * threads running on the same NUMA node as the current lock holder, and a
> - * secondary queue for threads running on other nodes. Schematically, it
> - * looks like this:
> + * secondary queue for threads running on other nodes. Schematically, it looks
> + * like this:
>   *
>   *    cna_node
> - *   +----------+    +--------+        +--------+
> - *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
> - *   |mcs:locked| -+ +--------+        +--------+
> + *   +----------+     +--------+         +--------+
> + *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
> + *   |mcs:locked| -.  +--------+         +--------+
>   *   +----------+  |
> - *                 +----------------------+
> - *                                        \/
> + *                 `----------------------.
> + *                                        v
>   *                 +--------+         +--------+
> - *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
> + *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
>   *                 +--------+         +--------+
>   *                     ^                    |
> - *                     +--------------------+
> + *                     `--------------------'
>   *
> - * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
> + * N.B. locked := 1 if secondary queue is absent. Othewrise, it contains the

If we're redoing the comment, please can you s/Othewrise/Otherwise/ at the
same time? It catches me every time I read it!

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-22  8:58       ` Will Deacon
  0 siblings, 0 replies; 48+ messages in thread
From: Will Deacon @ 2020-01-22  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, hpa, arnd, will.deacon, jglauber, x86, dave.dice,
	linux, linux-kernel, rahul.x.yadav, mingo, steven.sistare,
	longman, guohanjun, Alex Kogan, bp, tglx, daniel.m.jordan,
	linux-arm-kernel

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> 
> various notes and changes in the below.
> 
> ---
> 
> Index: linux-2.6/kernel/locking/qspinlock.c
> ===================================================================
> --- linux-2.6.orig/kernel/locking/qspinlock.c
> +++ linux-2.6/kernel/locking/qspinlock.c
> @@ -598,10 +598,10 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath)
>  #define _GEN_CNA_LOCK_SLOWPATH
>  
>  #undef pv_wait_head_or_lock
> -#define pv_wait_head_or_lock		cna_pre_scan
> +#define pv_wait_head_or_lock		cna_wait_head_or_lock
>  
>  #undef try_clear_tail
> -#define try_clear_tail			cna_try_change_tail
> +#define try_clear_tail			cna_try_clear_tail
>  
>  #undef mcs_pass_lock
>  #define mcs_pass_lock			cna_pass_lock
> Index: linux-2.6/kernel/locking/qspinlock_cna.h
> ===================================================================
> --- linux-2.6.orig/kernel/locking/qspinlock_cna.h
> +++ linux-2.6/kernel/locking/qspinlock_cna.h
> @@ -8,37 +8,37 @@
>  /*
>   * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>   *
> - * In CNA, spinning threads are organized in two queues, a main queue for
> + * In CNA, spinning threads are organized in two queues, a primary queue for
>   * threads running on the same NUMA node as the current lock holder, and a
> - * secondary queue for threads running on other nodes. Schematically, it
> - * looks like this:
> + * secondary queue for threads running on other nodes. Schematically, it looks
> + * like this:
>   *
>   *    cna_node
> - *   +----------+    +--------+        +--------+
> - *   |mcs:next  | -> |mcs:next| -> ... |mcs:next| -> NULL      [Main queue]
> - *   |mcs:locked| -+ +--------+        +--------+
> + *   +----------+     +--------+         +--------+
> + *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
> + *   |mcs:locked| -.  +--------+         +--------+
>   *   +----------+  |
> - *                 +----------------------+
> - *                                        \/
> + *                 `----------------------.
> + *                                        v
>   *                 +--------+         +--------+
> - *                 |mcs:next| -> ...  |mcs:next|          [Secondary queue]
> + *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
>   *                 +--------+         +--------+
>   *                     ^                    |
> - *                     +--------------------+
> + *                     `--------------------'
>   *
> - * N.B. locked = 1 if secondary queue is absent. Othewrise, it contains the
> + * N.B. locked := 1 if secondary queue is absent. Othewrise, it contains the

If we're redoing the comment, please can you s/Othewrise/Otherwise/ at the
same time? It catches me every time I read it!

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
  2019-11-25 21:07   ` Alex Kogan
  (?)
@ 2020-01-22  9:15     ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:15 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav

On Mon, Nov 25, 2019 at 04:07:05PM -0500, Alex Kogan wrote:

> --- a/arch/arm/include/asm/mcs_spinlock.h
> +++ b/arch/arm/include/asm/mcs_spinlock.h
> @@ -6,7 +6,7 @@
>  #include <asm/spinlock.h>
>  
>  /* MCS spin-locking. */
> -#define arch_mcs_spin_lock_contended(lock)				\
> +#define arch_mcs_spin_lock(lock)				\
>  do {									\
>  	/* Ensure prior stores are observed before we enter wfe. */	\
>  	smp_mb();							\
> @@ -14,9 +14,9 @@ do {									\
>  		wfe();							\
>  } while (0)								\
>  
> -#define arch_mcs_spin_unlock_contended(lock)				\
> +#define arch_mcs_pass_lock(lock, val)					\
>  do {									\
> -	smp_store_release(lock, 1);					\
> +	smp_store_release((lock), (val));				\
>  	dsb_sev();							\
>  } while (0)

So I hate those names; it used to be clear this was the contended path,
not so anymore. arch_mcs_spin_lock() in particular is grossly misnamed
now.

's/arch_mcs_spin_lock/arch_mcs_spin_wait/g' could perhaps work, if you
really want to get rid of the _contended suffix.

Also, pass_lock seems unfortunately named...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
@ 2020-01-22  9:15     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:15 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Mon, Nov 25, 2019 at 04:07:05PM -0500, Alex Kogan wrote:

> --- a/arch/arm/include/asm/mcs_spinlock.h
> +++ b/arch/arm/include/asm/mcs_spinlock.h
> @@ -6,7 +6,7 @@
>  #include <asm/spinlock.h>
>  
>  /* MCS spin-locking. */
> -#define arch_mcs_spin_lock_contended(lock)				\
> +#define arch_mcs_spin_lock(lock)				\
>  do {									\
>  	/* Ensure prior stores are observed before we enter wfe. */	\
>  	smp_mb();							\
> @@ -14,9 +14,9 @@ do {									\
>  		wfe();							\
>  } while (0)								\
>  
> -#define arch_mcs_spin_unlock_contended(lock)				\
> +#define arch_mcs_pass_lock(lock, val)					\
>  do {									\
> -	smp_store_release(lock, 1);					\
> +	smp_store_release((lock), (val));				\
>  	dsb_sev();							\
>  } while (0)

So I hate those names; it used to be clear this was the contended path,
not so anymore. arch_mcs_spin_lock() in particular is grossly misnamed
now.

's/arch_mcs_spin_lock/arch_mcs_spin_wait/g' could perhaps work, if you
really want to get rid of the _contended suffix.

Also, pass_lock seems unfortunately named...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
@ 2020-01-22  9:15     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:15 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Mon, Nov 25, 2019 at 04:07:05PM -0500, Alex Kogan wrote:

> --- a/arch/arm/include/asm/mcs_spinlock.h
> +++ b/arch/arm/include/asm/mcs_spinlock.h
> @@ -6,7 +6,7 @@
>  #include <asm/spinlock.h>
>  
>  /* MCS spin-locking. */
> -#define arch_mcs_spin_lock_contended(lock)				\
> +#define arch_mcs_spin_lock(lock)				\
>  do {									\
>  	/* Ensure prior stores are observed before we enter wfe. */	\
>  	smp_mb();							\
> @@ -14,9 +14,9 @@ do {									\
>  		wfe();							\
>  } while (0)								\
>  
> -#define arch_mcs_spin_unlock_contended(lock)				\
> +#define arch_mcs_pass_lock(lock, val)					\
>  do {									\
> -	smp_store_release(lock, 1);					\
> +	smp_store_release((lock), (val));				\
>  	dsb_sev();							\
>  } while (0)

So I hate those names; it used to be clear this was the contended path,
not so anymore. arch_mcs_spin_lock() in particular is grossly misnamed
now.

's/arch_mcs_spin_lock/arch_mcs_spin_wait/g' could perhaps work, if you
really want to get rid of the _contended suffix.

Also, pass_lock seems unfortunately named...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-21 20:29     ` Peter Zijlstra
  (?)
@ 2020-01-22  9:22       ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:22 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> @@ -92,8 +92,8 @@ static int __init cna_init_nodes(void)
>  }
>  early_initcall(cna_init_nodes);
>  
> -static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> -				       struct mcs_spinlock *node)
> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> +				      struct mcs_spinlock *node)
>  {
>  	struct mcs_spinlock *head_2nd, *tail_2nd;
>  	u32 new;

Also, that whole function is placed wrong; it should be between
cna_wait_head_or_lock() and cna_pass_lock(), then it's in the order they
appear in the slow path, ie. the order they actually run.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-22  9:22       ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:22 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> @@ -92,8 +92,8 @@ static int __init cna_init_nodes(void)
>  }
>  early_initcall(cna_init_nodes);
>  
> -static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> -				       struct mcs_spinlock *node)
> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> +				      struct mcs_spinlock *node)
>  {
>  	struct mcs_spinlock *head_2nd, *tail_2nd;
>  	u32 new;

Also, that whole function is placed wrong; it should be between
cna_wait_head_or_lock() and cna_pass_lock(), then it's in the order they
appear in the slow path, ie. the order they actually run.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-22  9:22       ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:22 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> @@ -92,8 +92,8 @@ static int __init cna_init_nodes(void)
>  }
>  early_initcall(cna_init_nodes);
>  
> -static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> -				       struct mcs_spinlock *node)
> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> +				      struct mcs_spinlock *node)
>  {
>  	struct mcs_spinlock *head_2nd, *tail_2nd;
>  	u32 new;

Also, that whole function is placed wrong; it should be between
cna_wait_head_or_lock() and cna_pass_lock(), then it's in the order they
appear in the slow path, ie. the order they actually run.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-21 20:29     ` Peter Zijlstra
  (?)
@ 2020-01-22  9:51       ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:51 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> 
> various notes and changes in the below.

Also, sorry for replying to v7 and v8, I forgot to refresh email on the
laptop and had spotty cell service last night and only found v7 in that
mailbox.

Afaict none of the things I commented on were fundamentally changed
though.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-22  9:51       ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:51 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> 
> various notes and changes in the below.

Also, sorry for replying to v7 and v8, I forgot to refresh email on the
laptop and had spotty cell service last night and only found v7 in that
mailbox.

Afaict none of the things I commented on were fundamentally changed
though.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-22  9:51       ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22  9:51 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> 
> various notes and changes in the below.

Also, sorry for replying to v7 and v8, I forgot to refresh email on the
laptop and had spotty cell service last night and only found v7 in that
mailbox.

Afaict none of the things I commented on were fundamentally changed
though.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-22  9:51       ` Peter Zijlstra
  (?)
@ 2020-01-22 17:04         ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22 17:04 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav

On Wed, Jan 22, 2020 at 10:51:27AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> > 
> > various notes and changes in the below.
> 
> Also, sorry for replying to v7 and v8, I forgot to refresh email on the
> laptop and had spotty cell service last night and only found v7 in that
> mailbox.
> 
> Afaict none of the things I commented on were fundamentally changed
> though.

But since I was editing, here is the latest version..

---

Index: linux-2.6/kernel/locking/qspinlock_cna.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,261 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it looks
+ * like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the primary queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;    /* self */
+	u32			partial_order;
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * @encoded_tail has to be larger than 1, so we do not confuse
+		 * it with other valid values for @locked or @partial_order
+		 * (0 or 1)
+		 */
+		WARN_ON(cn->encoded_tail <= 1);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+early_initcall(cna_init_nodes);
+
+/*
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
+ *
+ * Returns 0 if a matching node is found; otherwise return the encoded pointer
+ * to the last element inspected (such that a subsequent scan can continue there).
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ *
+ * XXX does not compute; given equal contention it should average to O(nr_nodes).
+ */
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
+{
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int nid = cn->numa_node;
+	struct cna_node *last;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != nid;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	if (!cna)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+
+	return 0;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ */
+static cna_splice_head(struct qspinlock *lock, u32 val,
+		       struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (lock) {
+		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
+			return NULL;
+
+		/*
+		 * The moment we've linked the primary tail we can race with
+		 * the WRITE_ONCE(prev->next, node) store from new waiters.
+		 * That store must win.
+		 */
+		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
+	} else {
+		tail_2nd->next = next;
+	}
+
+	return head_2nd;
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	/*
+	 * Try and put the time otherwise spend spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next;
+
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		/*
+		 * When there are waiters on the secondary queue move them back
+		 * onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_pass_lock(&head_2nd->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues empty. */
+	return __try_clear_tail(lock, val, node);
+}
+
+static inline void cna_pass_lock(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	u32 partial_order = cn->partial_order;
+	u32 val = _Q_LOCKED_VAL;
+
+	/* cna_wait_head_or_lock() left work for us. */
+	if (partial_order) {
+		partial_order = cna_order_queue(node, decode_tail(partial_order));
+
+	if (!partial_order) {
+		/*
+		 * We found a local waiter; reload @next in case we called
+		 * cna_order_queue() above.
+		 */
+		next = node->next;
+		val |= node->locked;	/* preseve secondary queue */
+
+	} else if (node->locked > 1) {
+		/*
+		 * When there are no local waiters on the primary queue, splice
+		 * the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
+		 */
+		next = cna_splice_head(NULL, 0, node, next);
+	}
+
+	arch_mcs_pass_lock(&next->locked, val);
+}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-22 17:04         ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22 17:04 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Wed, Jan 22, 2020 at 10:51:27AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> > 
> > various notes and changes in the below.
> 
> Also, sorry for replying to v7 and v8, I forgot to refresh email on the
> laptop and had spotty cell service last night and only found v7 in that
> mailbox.
> 
> Afaict none of the things I commented on were fundamentally changed
> though.

But since I was editing, here is the latest version..

---

Index: linux-2.6/kernel/locking/qspinlock_cna.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,261 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it looks
+ * like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the primary queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;    /* self */
+	u32			partial_order;
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * @encoded_tail has to be larger than 1, so we do not confuse
+		 * it with other valid values for @locked or @partial_order
+		 * (0 or 1)
+		 */
+		WARN_ON(cn->encoded_tail <= 1);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+early_initcall(cna_init_nodes);
+
+/*
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
+ *
+ * Returns 0 if a matching node is found; otherwise return the encoded pointer
+ * to the last element inspected (such that a subsequent scan can continue there).
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ *
+ * XXX does not compute; given equal contention it should average to O(nr_nodes).
+ */
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
+{
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int nid = cn->numa_node;
+	struct cna_node *last;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != nid;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	if (!cna)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+
+	return 0;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ */
+static cna_splice_head(struct qspinlock *lock, u32 val,
+		       struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (lock) {
+		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
+			return NULL;
+
+		/*
+		 * The moment we've linked the primary tail we can race with
+		 * the WRITE_ONCE(prev->next, node) store from new waiters.
+		 * That store must win.
+		 */
+		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
+	} else {
+		tail_2nd->next = next;
+	}
+
+	return head_2nd;
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	/*
+	 * Try and put the time otherwise spend spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next;
+
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		/*
+		 * When there are waiters on the secondary queue move them back
+		 * onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_pass_lock(&head_2nd->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues empty. */
+	return __try_clear_tail(lock, val, node);
+}
+
+static inline void cna_pass_lock(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	u32 partial_order = cn->partial_order;
+	u32 val = _Q_LOCKED_VAL;
+
+	/* cna_wait_head_or_lock() left work for us. */
+	if (partial_order) {
+		partial_order = cna_order_queue(node, decode_tail(partial_order));
+
+	if (!partial_order) {
+		/*
+		 * We found a local waiter; reload @next in case we called
+		 * cna_order_queue() above.
+		 */
+		next = node->next;
+		val |= node->locked;	/* preseve secondary queue */
+
+	} else if (node->locked > 1) {
+		/*
+		 * When there are no local waiters on the primary queue, splice
+		 * the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
+		 */
+		next = cna_splice_head(NULL, 0, node, next);
+	}
+
+	arch_mcs_pass_lock(&next->locked, val);
+}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-22 17:04         ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-22 17:04 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Wed, Jan 22, 2020 at 10:51:27AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
> > 
> > various notes and changes in the below.
> 
> Also, sorry for replying to v7 and v8, I forgot to refresh email on the
> laptop and had spotty cell service last night and only found v7 in that
> mailbox.
> 
> Afaict none of the things I commented on were fundamentally changed
> though.

But since I was editing, here is the latest version..

---

Index: linux-2.6/kernel/locking/qspinlock_cna.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,261 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it looks
+ * like this:
+ *
+ *    cna_node
+ *   +----------+     +--------+         +--------+
+ *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
+ *   |mcs:locked| -.  +--------+         +--------+
+ *   +----------+  |
+ *                 `----------------------.
+ *                                        v
+ *                 +--------+         +--------+
+ *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
+ *                 +--------+         +--------+
+ *                     ^                    |
+ *                     `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the lock
+ * holder scans the primary queue looking for a thread running on the same node
+ * (pre-scan). If found (call it thread T), all threads in the primary queue
+ * between the current lock holder and T are moved to the end of the secondary
+ * queue.  If such T is not found, we make another scan of the primary queue
+ * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
+ * stopped. If both scans fail to find such T, the MCS lock is passed to the
+ * first thread in the secondary queue. If the secondary queue is empty, the
+ * lock is passed to the next thread in the primary queue.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <alex.kogan@oracle.com>
+ *          Dave Dice <dave.dice@oracle.com>
+ */
+
+struct cna_node {
+	struct mcs_spinlock	mcs;
+	int			numa_node;
+	u32			encoded_tail;    /* self */
+	u32			partial_order;
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+	int numa_node = cpu_to_node(cpu);
+	int i;
+
+	for (i = 0; i < MAX_NODES; i++) {
+		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+		cn->numa_node = numa_node;
+		cn->encoded_tail = encode_tail(cpu, i);
+		/*
+		 * @encoded_tail has to be larger than 1, so we do not confuse
+		 * it with other valid values for @locked or @partial_order
+		 * (0 or 1)
+		 */
+		WARN_ON(cn->encoded_tail <= 1);
+	}
+}
+
+static int __init cna_init_nodes(void)
+{
+	unsigned int cpu;
+
+	/*
+	 * this will break on 32bit architectures, so we restrict
+	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
+	 */
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store an ecoded tail word in the node's @locked field */
+	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+	for_each_possible_cpu(cpu)
+		cna_init_nodes_per_cpu(cpu);
+
+	return 0;
+}
+early_initcall(cna_init_nodes);
+
+/*
+ * cna_splice_tail -- splice nodes in the primary queue between [first, last]
+ * onto the secondary queue.
+ */
+static void cna_splice_tail(struct mcs_spinlock *node,
+			    struct mcs_spinlock *first,
+			    struct mcs_spinlock *last)
+{
+	/* remove [first,last] */
+	node->next = last->next;
+
+	/* stick [first,last] on the secondary queue tail */
+	if (node->locked <= 1) { /* if secondary queue is empty */
+		/* create secondary queue */
+		last->next = first;
+	} else {
+		/* add to the tail of the secondary queue */
+		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+		struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+		tail_2nd->next = first;
+		last->next = head_2nd;
+	}
+
+	node->locked = ((struct cna_node *)last)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - scan the primary queue looking for the first lock node on
+ * the same NUMA node as the lock holder and move any skipped nodes onto the
+ * secondary queue.
+ *
+ * Returns 0 if a matching node is found; otherwise return the encoded pointer
+ * to the last element inspected (such that a subsequent scan can continue there).
+ *
+ * The worst case complexity of the scan is O(n), where n is the number
+ * of current waiters. However, the amortized complexity is close to O(1),
+ * as the immediate successor is likely to be running on the same node once
+ * threads from other nodes are moved to the secondary queue.
+ *
+ * XXX does not compute; given equal contention it should average to O(nr_nodes).
+ */
+static u32 cna_order_queue(struct mcs_spinlock *node,
+			   struct mcs_spinlock *iter)
+{
+	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
+	struct cna_node *cn = (struct cna_node *)node;
+	int nid = cn->numa_node;
+	struct cna_node *last;
+
+	/* find any next waiter on 'our' NUMA node */
+	for (last = cn;
+	     cni && cni->numa_node != nid;
+	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
+		;
+
+	if (!cna)
+		return last->encoded_tail; /* continue from here */
+
+	if (last != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+
+	return 0;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ */
+static cna_splice_head(struct qspinlock *lock, u32 val,
+		       struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+	struct mcs_spinlock *head_2nd, *tail_2nd;
+
+	tail_2nd = decode_tail(node->locked);
+	head_2nd = tail_2nd->next;
+
+	if (lock) {
+		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
+		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
+			return NULL;
+
+		/*
+		 * The moment we've linked the primary tail we can race with
+		 * the WRITE_ONCE(prev->next, node) store from new waiters.
+		 * That store must win.
+		 */
+		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
+	} else {
+		tail_2nd->next = next;
+	}
+
+	return head_2nd;
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+						 struct mcs_spinlock *node)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+
+	/*
+	 * Try and put the time otherwise spend spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->partial_order = cna_order_queue(node, node);
+
+	return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+				      struct mcs_spinlock *node)
+{
+	struct mcs_spinlock *next;
+
+	/*
+	 * We're here because the primary queue is empty; check the secondary
+	 * queue for remote waiters.
+	 */
+	if (node->locked > 1) {
+		/*
+		 * When there are waiters on the secondary queue move them back
+		 * onto the primary queue and let them rip.
+		 */
+		next = cna_splice_head(lock, val, node, NULL);
+		if (next) {
+			arch_mcs_pass_lock(&head_2nd->locked, 1);
+			return true;
+		}
+
+		return false;
+	}
+
+	/* Both queues empty. */
+	return __try_clear_tail(lock, val, node);
+}
+
+static inline void cna_pass_lock(struct mcs_spinlock *node,
+				 struct mcs_spinlock *next)
+{
+	struct cna_node *cn = (struct cna_node *)node;
+	u32 partial_order = cn->partial_order;
+	u32 val = _Q_LOCKED_VAL;
+
+	/* cna_wait_head_or_lock() left work for us. */
+	if (partial_order) {
+		partial_order = cna_order_queue(node, decode_tail(partial_order));
+
+	if (!partial_order) {
+		/*
+		 * We found a local waiter; reload @next in case we called
+		 * cna_order_queue() above.
+		 */
+		next = node->next;
+		val |= node->locked;	/* preseve secondary queue */
+
+	} else if (node->locked > 1) {
+		/*
+		 * When there are no local waiters on the primary queue, splice
+		 * the secondary queue onto the primary queue and pass the lock
+		 * to the longest waiting remote waiter.
+		 */
+		next = cna_splice_head(NULL, 0, node, next);
+	}
+
+	arch_mcs_pass_lock(&next->locked, val);
+}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-22 17:04         ` Peter Zijlstra
  (?)
@ 2020-01-23  9:00           ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-23  9:00 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, will.deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav

On Wed, Jan 22, 2020 at 06:04:56PM +0100, Peter Zijlstra wrote:
> +/*
> + * cna_splice_head -- splice the entire secondary queue onto the head of the
> + * primary queue.
> + */
> +static cna_splice_head(struct qspinlock *lock, u32 val,
> +		       struct mcs_spinlock *node, struct mcs_spinlock *next)
> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
> +
> +	if (lock) {

That should be: if (!next) {

> +		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> +		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
> +			return NULL;
> +
> +		/*
> +		 * The moment we've linked the primary tail we can race with
> +		 * the WRITE_ONCE(prev->next, node) store from new waiters.
> +		 * That store must win.
> +		 */

And that still is a shit comment; I'll go try again.

> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
> +	} else {
> +		tail_2nd->next = next;
> +	}
> +
> +	return head_2nd;
> +}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-23  9:00           ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-23  9:00 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Wed, Jan 22, 2020 at 06:04:56PM +0100, Peter Zijlstra wrote:
> +/*
> + * cna_splice_head -- splice the entire secondary queue onto the head of the
> + * primary queue.
> + */
> +static cna_splice_head(struct qspinlock *lock, u32 val,
> +		       struct mcs_spinlock *node, struct mcs_spinlock *next)
> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
> +
> +	if (lock) {

That should be: if (!next) {

> +		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> +		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
> +			return NULL;
> +
> +		/*
> +		 * The moment we've linked the primary tail we can race with
> +		 * the WRITE_ONCE(prev->next, node) store from new waiters.
> +		 * That store must win.
> +		 */

And that still is a shit comment; I'll go try again.

> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
> +	} else {
> +		tail_2nd->next = next;
> +	}
> +
> +	return head_2nd;
> +}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-23  9:00           ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-23  9:00 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	will.deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Wed, Jan 22, 2020 at 06:04:56PM +0100, Peter Zijlstra wrote:
> +/*
> + * cna_splice_head -- splice the entire secondary queue onto the head of the
> + * primary queue.
> + */
> +static cna_splice_head(struct qspinlock *lock, u32 val,
> +		       struct mcs_spinlock *node, struct mcs_spinlock *next)
> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
> +
> +	if (lock) {

That should be: if (!next) {

> +		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> +		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
> +			return NULL;
> +
> +		/*
> +		 * The moment we've linked the primary tail we can race with
> +		 * the WRITE_ONCE(prev->next, node) store from new waiters.
> +		 * That store must win.
> +		 */

And that still is a shit comment; I'll go try again.

> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
> +	} else {
> +		tail_2nd->next = next;
> +	}
> +
> +	return head_2nd;
> +}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
       [not found]     ` <C608A39E-CAFC-4C79-9EB6-3DFD9621E3F6@oracle.com>
@ 2020-01-25 11:23         ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-25 11:23 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, mingo, Will Deacon, arnd, longman, linux-arch,
	linux-arm-kernel, linux-kernel, tglx, bp, hpa, x86, guohanjun,
	jglauber, steven.sistare, daniel.m.jordan, dave.dice,
	rahul.x.yadav

On Thu, Jan 23, 2020 at 06:51:11PM -0500, Alex Kogan wrote:
> > On Jan 22, 2020, at 4:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Also, pass_lock seems unfortunately named…

> Well, I know the guy who should be blamed for that one :)
> https://lkml.org/lkml/2019/7/16/212 <https://lkml.org/lkml/2019/7/16/212>

Ha, yes, that guy is an idiot sometimes ;-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic
@ 2020-01-25 11:23         ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-25 11:23 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, guohanjun, arnd, dave.dice, jglauber, x86,
	Will Deacon, linux, steven.sistare, linux-kernel, rahul.x.yadav,
	mingo, bp, hpa, longman, tglx, daniel.m.jordan, linux-arm-kernel

On Thu, Jan 23, 2020 at 06:51:11PM -0500, Alex Kogan wrote:
> > On Jan 22, 2020, at 4:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Also, pass_lock seems unfortunately named…

> Well, I know the guy who should be blamed for that one :)
> https://lkml.org/lkml/2019/7/16/212 <https://lkml.org/lkml/2019/7/16/212>

Ha, yes, that guy is an idiot sometimes ;-)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-22 17:04         ` Peter Zijlstra
@ 2020-01-30 22:01           ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2020-01-30 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux, Ingo Molnar, Will Deacon, Arnd Bergmann, Waiman Long,
	linux-arch, linux-arm-kernel, linux-kernel, Thomas Gleixner,
	Borislav Petkov, hpa, x86, Hanjun Guo, Jan Glauber,
	Steven Sistare, Daniel Jordan, dave.dice

Hi, Peter.

It looks good — thanks for your review!

I have a couple of comments and suggestions.
Please, see below.

> On Jan 22, 2020, at 12:04 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jan 22, 2020 at 10:51:27AM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
>>> 
>>> various notes and changes in the below.
>> 
>> Also, sorry for replying to v7 and v8, I forgot to refresh email on the
>> laptop and had spotty cell service last night and only found v7 in that
>> mailbox.
>> 
>> Afaict none of the things I commented on were fundamentally changed
>> though.
Nothing fundamental, but some things you may find objectionable, like 
the names of new enum elements :)

> 
> But since I was editing, here is the latest version..
> 
> ---
> 
> Index: linux-2.6/kernel/locking/qspinlock_cna.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/kernel/locking/qspinlock_cna.h
> @@ -0,0 +1,261 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _GEN_CNA_LOCK_SLOWPATH
> +#error "do not include this file"
> +#endif
> +
> +#include <linux/topology.h>
> +
> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a primary queue for
> + * threads running on the same NUMA node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. Schematically, it looks
> + * like this:
> + *
> + *    cna_node
> + *   +----------+     +--------+         +--------+
> + *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
> + *   |mcs:locked| -.  +--------+         +--------+
> + *   +----------+  |
> + *                 `----------------------.
> + *                                        v
> + *                 +--------+         +--------+
> + *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
> + *                 +--------+         +--------+
> + *                     ^                    |
> + *                     `--------------------'
> + *
> + * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
> + * encoded pointer to the tail of the secondary queue, which is organized as a
> + * circular list.
> + *
> + * After acquiring the MCS lock and before acquiring the spinlock, the lock
> + * holder scans the primary queue looking for a thread running on the same node
> + * (pre-scan). If found (call it thread T), all threads in the primary queue
> + * between the current lock holder and T are moved to the end of the secondary
> + * queue.  If such T is not found, we make another scan of the primary queue
> + * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
> + * stopped. If both scans fail to find such T, the MCS lock is passed to the
> + * first thread in the secondary queue. If the secondary queue is empty, the
> + * lock is passed to the next thread in the primary queue.
> + *
> + * For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=49NdSlRXQxlPifNqUk_E7p7Q-nJ0_HP9fkF_F_GsC_Y&s=0c25gE6r6cKpOC3J0KaPnnkUd4wFXwPNilBflDnNOSQ&e= .
> + *
> + * Authors: Alex Kogan <alex.kogan@oracle.com>
> + *          Dave Dice <dave.dice@oracle.com>
> + */
> +
> +struct cna_node {
> +	struct mcs_spinlock	mcs;
> +	int			numa_node;
> +	u32			encoded_tail;    /* self */
> +	u32			partial_order;
I will not argue about names, just point out that I think pre_scan_result
is more self-explanatory.

> +};
> +
> +static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> +{
> +	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> +	int numa_node = cpu_to_node(cpu);
> +	int i;
> +
> +	for (i = 0; i < MAX_NODES; i++) {
> +		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
> +
> +		cn->numa_node = numa_node;
> +		cn->encoded_tail = encode_tail(cpu, i);
> +		/*
> +		 * @encoded_tail has to be larger than 1, so we do not confuse
> +		 * it with other valid values for @locked or @partial_order
> +		 * (0 or 1)
> +		 */
> +		WARN_ON(cn->encoded_tail <= 1);
> +	}
> +}
> +
> +static int __init cna_init_nodes(void)
> +{
> +	unsigned int cpu;
> +
> +	/*
> +	 * this will break on 32bit architectures, so we restrict
> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
> +	 */
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store an ecoded tail word in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
> +
> +	for_each_possible_cpu(cpu)
> +		cna_init_nodes_per_cpu(cpu);
> +
> +	return 0;
> +}
> +early_initcall(cna_init_nodes);
> +
> +/*
> + * cna_splice_tail -- splice nodes in the primary queue between [first, last]
> + * onto the secondary queue.
> + */
> +static void cna_splice_tail(struct mcs_spinlock *node,
> +			    struct mcs_spinlock *first,
> +			    struct mcs_spinlock *last)
> +{
> +	/* remove [first,last] */
> +	node->next = last->next;
> +
> +	/* stick [first,last] on the secondary queue tail */
> +	if (node->locked <= 1) { /* if secondary queue is empty */
> +		/* create secondary queue */
> +		last->next = first;
> +	} else {
> +		/* add to the tail of the secondary queue */
> +		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
> +		struct mcs_spinlock *head_2nd = tail_2nd->next;
> +
> +		tail_2nd->next = first;
> +		last->next = head_2nd;
> +	}
> +
> +	node->locked = ((struct cna_node *)last)->encoded_tail;
> +}
> +
> +/*
> + * cna_order_queue - scan the primary queue looking for the first lock node on
> + * the same NUMA node as the lock holder and move any skipped nodes onto the
> + * secondary queue.
Oh man, you took out the ascii figure I was working so hard to put in. Oh well.

> + *
> + * Returns 0 if a matching node is found; otherwise return the encoded pointer
> + * to the last element inspected (such that a subsequent scan can continue there).
> + *
> + * The worst case complexity of the scan is O(n), where n is the number
> + * of current waiters. However, the amortized complexity is close to O(1),
> + * as the immediate successor is likely to be running on the same node once
> + * threads from other nodes are moved to the secondary queue.
> + *
> + * XXX does not compute; given equal contention it should average to O(nr_nodes).
Let me try to convince you. Under contention, the immediate waiter would be
a local one. So the scan would typically take O(1) steps. We need to consider
the extra work/steps we take to move nodes to the secondary queue. The
number of such nodes is O(n) (to be more precise, it is O(n-m), where m
is nr_cpus_per_node), and we spend a constant amount of work, per node, 
to scan those nodes and move them to the secondary queue. So in 2^thresholds
lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming 
2^thresholds > n, the amortized complexity of this function then is O(1).

> + */
> +static u32 cna_order_queue(struct mcs_spinlock *node,
> +			   struct mcs_spinlock *iter)
> +{
> +	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
> +	struct cna_node *cn = (struct cna_node *)node;
> +	int nid = cn->numa_node;
> +	struct cna_node *last;
> +
> +	/* find any next waiter on 'our' NUMA node */
> +	for (last = cn;
> +	     cni && cni->numa_node != nid;
> +	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
> +		;
> +
> +	if (!cna)
Should be ‘cni’

> +		return last->encoded_tail; /* continue from here */
> +
> +	if (last != cn)	/* did we skip any waiters? */
> +		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
> +
> +	return 0;
> +}
> +
> +/*
> + * cna_splice_head -- splice the entire secondary queue onto the head of the
> + * primary queue.
> + */
> +static cna_splice_head(struct qspinlock *lock, u32 val,
> +		       struct mcs_spinlock *node, struct mcs_spinlock *next)
Missing return value type (struct mcs_spinlock *).

> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
I understand that you are trying to avoid repeating those two lines
in both places this function is called from, but you do it at the cost
of adding the following unnecessary if-statement, and in general this function
looks clunky.

Maybe move those two lines into a separate function, e.g.,

static struct mcs_spinlock *cna_extract_2dary_head_tail(unsigned int locked,
								struct mcs_spinlock **tail_2nd)

and then call this function from cna_pass_lock(), while here you can do:

	  struct mcs_spinlock *head_2nd, *tail_2nd;

	  head_2nd = cna_extract_2dary_head_tail(lock, &tail_2nd);

	  u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
	  …


> +
> +	if (lock) {
> +		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> +		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
> +			return NULL;
> +
> +		/*
> +		 * The moment we've linked the primary tail we can race with
> +		 * the WRITE_ONCE(prev->next, node) store from new waiters.
> +		 * That store must win.
> +		 */
> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
> +	} else {
> +		tail_2nd->next = next;
> +	}
> +
> +	return head_2nd;
> +}
> +
> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
> +						 struct mcs_spinlock *node)
> +{
> +	struct cna_node *cn = (struct cna_node *)node;
> +
> +	/*
> +	 * Try and put the time otherwise spend spin waiting on
> +	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> +	 */
> +	cn->partial_order = cna_order_queue(node, node);
> +
> +	return 0; /* we lied; we didn't wait, go do so now */
> +}
> +
> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> +				      struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *next;
> +
> +	/*
> +	 * We're here because the primary queue is empty; check the secondary
> +	 * queue for remote waiters.
> +	 */
> +	if (node->locked > 1) {
> +		/*
> +		 * When there are waiters on the secondary queue move them back
> +		 * onto the primary queue and let them rip.
> +		 */
> +		next = cna_splice_head(lock, val, node, NULL);
> +		if (next) {
And, again, this if-statement is here only because you moved the rest of the code
into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() you can
bring that code back?

> +			arch_mcs_pass_lock(&head_2nd->locked, 1);
Should be next->locked. Better yet, ‘next' should be called ‘head_2nd’.
> +			return true;
> +		}
> +
> +		return false;
> +	}
> +
> +	/* Both queues empty. */
> +	return __try_clear_tail(lock, val, node);
> +}
> +
> +static inline void cna_pass_lock(struct mcs_spinlock *node,
> +				 struct mcs_spinlock *next)
> +{
> +	struct cna_node *cn = (struct cna_node *)node;
> +	u32 partial_order = cn->partial_order;
> +	u32 val = _Q_LOCKED_VAL;
> +
> +	/* cna_wait_head_or_lock() left work for us. */
> +	if (partial_order) {
> +		partial_order = cna_order_queue(node, decode_tail(partial_order));
> +
> +	if (!partial_order) {
> +		/*
> +		 * We found a local waiter; reload @next in case we called
> +		 * cna_order_queue() above.
> +		 */
> +		next = node->next;
> +		val |= node->locked;	/* preseve secondary queue */
This is wrong. node->locked can be 0, 1 or an encoded tail at this point, and
the latter case will be handled incorrectly.

Should be 
		  if (node->locked) val = node->locked;
instead.

> +
> +	} else if (node->locked > 1) {
> +		/*
> +		 * When there are no local waiters on the primary queue, splice
> +		 * the secondary queue onto the primary queue and pass the lock
> +		 * to the longest waiting remote waiter.
> +		 */
> +		next = cna_splice_head(NULL, 0, node, next);
> +	}
> +
> +	arch_mcs_pass_lock(&next->locked, val);
> +}

Regards,
— Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-30 22:01           ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2020-01-30 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Hanjun Guo, Arnd Bergmann, dave.dice, Jan Glauber,
	x86, Will Deacon, linux, Steven Sistare, linux-kernel,
	Ingo Molnar, Borislav Petkov, hpa, Waiman Long, Thomas Gleixner,
	Daniel Jordan, linux-arm-kernel

Hi, Peter.

It looks good — thanks for your review!

I have a couple of comments and suggestions.
Please, see below.

> On Jan 22, 2020, at 12:04 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jan 22, 2020 at 10:51:27AM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 21, 2020 at 09:29:19PM +0100, Peter Zijlstra wrote:
>>> 
>>> various notes and changes in the below.
>> 
>> Also, sorry for replying to v7 and v8, I forgot to refresh email on the
>> laptop and had spotty cell service last night and only found v7 in that
>> mailbox.
>> 
>> Afaict none of the things I commented on were fundamentally changed
>> though.
Nothing fundamental, but some things you may find objectionable, like 
the names of new enum elements :)

> 
> But since I was editing, here is the latest version..
> 
> ---
> 
> Index: linux-2.6/kernel/locking/qspinlock_cna.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/kernel/locking/qspinlock_cna.h
> @@ -0,0 +1,261 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _GEN_CNA_LOCK_SLOWPATH
> +#error "do not include this file"
> +#endif
> +
> +#include <linux/topology.h>
> +
> +/*
> + * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> + *
> + * In CNA, spinning threads are organized in two queues, a primary queue for
> + * threads running on the same NUMA node as the current lock holder, and a
> + * secondary queue for threads running on other nodes. Schematically, it looks
> + * like this:
> + *
> + *    cna_node
> + *   +----------+     +--------+         +--------+
> + *   |mcs:next  | --> |mcs:next| --> ... |mcs:next| --> NULL  [Primary queue]
> + *   |mcs:locked| -.  +--------+         +--------+
> + *   +----------+  |
> + *                 `----------------------.
> + *                                        v
> + *                 +--------+         +--------+
> + *                 |mcs:next| --> ... |mcs:next|            [Secondary queue]
> + *                 +--------+         +--------+
> + *                     ^                    |
> + *                     `--------------------'
> + *
> + * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
> + * encoded pointer to the tail of the secondary queue, which is organized as a
> + * circular list.
> + *
> + * After acquiring the MCS lock and before acquiring the spinlock, the lock
> + * holder scans the primary queue looking for a thread running on the same node
> + * (pre-scan). If found (call it thread T), all threads in the primary queue
> + * between the current lock holder and T are moved to the end of the secondary
> + * queue.  If such T is not found, we make another scan of the primary queue
> + * when unlocking the MCS lock (post-scan), starting at the node where pre-scan
> + * stopped. If both scans fail to find such T, the MCS lock is passed to the
> + * first thread in the secondary queue. If the secondary queue is empty, the
> + * lock is passed to the next thread in the primary queue.
> + *
> + * For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=49NdSlRXQxlPifNqUk_E7p7Q-nJ0_HP9fkF_F_GsC_Y&s=0c25gE6r6cKpOC3J0KaPnnkUd4wFXwPNilBflDnNOSQ&e= .
> + *
> + * Authors: Alex Kogan <alex.kogan@oracle.com>
> + *          Dave Dice <dave.dice@oracle.com>
> + */
> +
> +struct cna_node {
> +	struct mcs_spinlock	mcs;
> +	int			numa_node;
> +	u32			encoded_tail;    /* self */
> +	u32			partial_order;
I will not argue about names, just point out that I think pre_scan_result
is more self-explanatory.

> +};
> +
> +static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> +{
> +	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> +	int numa_node = cpu_to_node(cpu);
> +	int i;
> +
> +	for (i = 0; i < MAX_NODES; i++) {
> +		struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
> +
> +		cn->numa_node = numa_node;
> +		cn->encoded_tail = encode_tail(cpu, i);
> +		/*
> +		 * @encoded_tail has to be larger than 1, so we do not confuse
> +		 * it with other valid values for @locked or @partial_order
> +		 * (0 or 1)
> +		 */
> +		WARN_ON(cn->encoded_tail <= 1);
> +	}
> +}
> +
> +static int __init cna_init_nodes(void)
> +{
> +	unsigned int cpu;
> +
> +	/*
> +	 * this will break on 32bit architectures, so we restrict
> +	 * the use of CNA to 64bit only (see arch/x86/Kconfig)
> +	 */
> +	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
> +	/* we store an ecoded tail word in the node's @locked field */
> +	BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
> +
> +	for_each_possible_cpu(cpu)
> +		cna_init_nodes_per_cpu(cpu);
> +
> +	return 0;
> +}
> +early_initcall(cna_init_nodes);
> +
> +/*
> + * cna_splice_tail -- splice nodes in the primary queue between [first, last]
> + * onto the secondary queue.
> + */
> +static void cna_splice_tail(struct mcs_spinlock *node,
> +			    struct mcs_spinlock *first,
> +			    struct mcs_spinlock *last)
> +{
> +	/* remove [first,last] */
> +	node->next = last->next;
> +
> +	/* stick [first,last] on the secondary queue tail */
> +	if (node->locked <= 1) { /* if secondary queue is empty */
> +		/* create secondary queue */
> +		last->next = first;
> +	} else {
> +		/* add to the tail of the secondary queue */
> +		struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
> +		struct mcs_spinlock *head_2nd = tail_2nd->next;
> +
> +		tail_2nd->next = first;
> +		last->next = head_2nd;
> +	}
> +
> +	node->locked = ((struct cna_node *)last)->encoded_tail;
> +}
> +
> +/*
> + * cna_order_queue - scan the primary queue looking for the first lock node on
> + * the same NUMA node as the lock holder and move any skipped nodes onto the
> + * secondary queue.
Oh man, you took out the ascii figure I was working so hard to put in. Oh well.

> + *
> + * Returns 0 if a matching node is found; otherwise return the encoded pointer
> + * to the last element inspected (such that a subsequent scan can continue there).
> + *
> + * The worst case complexity of the scan is O(n), where n is the number
> + * of current waiters. However, the amortized complexity is close to O(1),
> + * as the immediate successor is likely to be running on the same node once
> + * threads from other nodes are moved to the secondary queue.
> + *
> + * XXX does not compute; given equal contention it should average to O(nr_nodes).
Let me try to convince you. Under contention, the immediate waiter would be
a local one. So the scan would typically take O(1) steps. We need to consider
the extra work/steps we take to move nodes to the secondary queue. The
number of such nodes is O(n) (to be more precise, it is O(n-m), where m
is nr_cpus_per_node), and we spend a constant amount of work, per node, 
to scan those nodes and move them to the secondary queue. So in 2^thresholds
lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming 
2^thresholds > n, the amortized complexity of this function then is O(1).

> + */
> +static u32 cna_order_queue(struct mcs_spinlock *node,
> +			   struct mcs_spinlock *iter)
> +{
> +	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
> +	struct cna_node *cn = (struct cna_node *)node;
> +	int nid = cn->numa_node;
> +	struct cna_node *last;
> +
> +	/* find any next waiter on 'our' NUMA node */
> +	for (last = cn;
> +	     cni && cni->numa_node != nid;
> +	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
> +		;
> +
> +	if (!cna)
Should be ‘cni’

> +		return last->encoded_tail; /* continue from here */
> +
> +	if (last != cn)	/* did we skip any waiters? */
> +		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
> +
> +	return 0;
> +}
> +
> +/*
> + * cna_splice_head -- splice the entire secondary queue onto the head of the
> + * primary queue.
> + */
> +static cna_splice_head(struct qspinlock *lock, u32 val,
> +		       struct mcs_spinlock *node, struct mcs_spinlock *next)
Missing return value type (struct mcs_spinlock *).

> +{
> +	struct mcs_spinlock *head_2nd, *tail_2nd;
> +
> +	tail_2nd = decode_tail(node->locked);
> +	head_2nd = tail_2nd->next;
I understand that you are trying to avoid repeating those two lines
in both places this function is called from, but you do it at the cost
of adding the following unnecessary if-statement, and in general this function
looks clunky.

Maybe move those two lines into a separate function, e.g.,

static struct mcs_spinlock *cna_extract_2dary_head_tail(unsigned int locked,
								struct mcs_spinlock **tail_2nd)

and then call this function from cna_pass_lock(), while here you can do:

	  struct mcs_spinlock *head_2nd, *tail_2nd;

	  head_2nd = cna_extract_2dary_head_tail(lock, &tail_2nd);

	  u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
	  …


> +
> +	if (lock) {
> +		u32 new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> +		if (!atomic_try_cmpxchg_relaxed(&lock->val, &val, new))
> +			return NULL;
> +
> +		/*
> +		 * The moment we've linked the primary tail we can race with
> +		 * the WRITE_ONCE(prev->next, node) store from new waiters.
> +		 * That store must win.
> +		 */
> +		cmpxchg_relaxed(&tail_2nd->next, head_2nd, next);
> +	} else {
> +		tail_2nd->next = next;
> +	}
> +
> +	return head_2nd;
> +}
> +
> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
> +						 struct mcs_spinlock *node)
> +{
> +	struct cna_node *cn = (struct cna_node *)node;
> +
> +	/*
> +	 * Try and put the time otherwise spend spin waiting on
> +	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> +	 */
> +	cn->partial_order = cna_order_queue(node, node);
> +
> +	return 0; /* we lied; we didn't wait, go do so now */
> +}
> +
> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> +				      struct mcs_spinlock *node)
> +{
> +	struct mcs_spinlock *next;
> +
> +	/*
> +	 * We're here because the primary queue is empty; check the secondary
> +	 * queue for remote waiters.
> +	 */
> +	if (node->locked > 1) {
> +		/*
> +		 * When there are waiters on the secondary queue move them back
> +		 * onto the primary queue and let them rip.
> +		 */
> +		next = cna_splice_head(lock, val, node, NULL);
> +		if (next) {
And, again, this if-statement is here only because you moved the rest of the code
into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() you can
bring that code back?

> +			arch_mcs_pass_lock(&head_2nd->locked, 1);
Should be next->locked. Better yet, ‘next' should be called ‘head_2nd’.
> +			return true;
> +		}
> +
> +		return false;
> +	}
> +
> +	/* Both queues empty. */
> +	return __try_clear_tail(lock, val, node);
> +}
> +
> +static inline void cna_pass_lock(struct mcs_spinlock *node,
> +				 struct mcs_spinlock *next)
> +{
> +	struct cna_node *cn = (struct cna_node *)node;
> +	u32 partial_order = cn->partial_order;
> +	u32 val = _Q_LOCKED_VAL;
> +
> +	/* cna_wait_head_or_lock() left work for us. */
> +	if (partial_order) {
> +		partial_order = cna_order_queue(node, decode_tail(partial_order));
> +
> +	if (!partial_order) {
> +		/*
> +		 * We found a local waiter; reload @next in case we called
> +		 * cna_order_queue() above.
> +		 */
> +		next = node->next;
> +		val |= node->locked;	/* preseve secondary queue */
This is wrong. node->locked can be 0, 1 or an encoded tail at this point, and
the latter case will be handled incorrectly.

Should be 
		  if (node->locked) val = node->locked;
instead.

> +
> +	} else if (node->locked > 1) {
> +		/*
> +		 * When there are no local waiters on the primary queue, splice
> +		 * the secondary queue onto the primary queue and pass the lock
> +		 * to the longest waiting remote waiter.
> +		 */
> +		next = cna_splice_head(NULL, 0, node, next);
> +	}
> +
> +	arch_mcs_pass_lock(&next->locked, val);
> +}

Regards,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-30 22:01           ` Alex Kogan
@ 2020-01-31 13:35             ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-31 13:35 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux, Ingo Molnar, Will Deacon, Arnd Bergmann, Waiman Long,
	linux-arch, linux-arm-kernel, linux-kernel, Thomas Gleixner,
	Borislav Petkov, hpa, x86, Hanjun Guo, Jan Glauber,
	Steven Sistare, Daniel Jordan, dave.dice

On Thu, Jan 30, 2020 at 05:01:15PM -0500, Alex Kogan wrote:

> > +/*
> > + * cna_order_queue - scan the primary queue looking for the first lock node on
> > + * the same NUMA node as the lock holder and move any skipped nodes onto the
> > + * secondary queue.
> Oh man, you took out the ascii figure I was working so hard to put in. Oh well.

Right; sorry about that. The reason it's gone is that it was mostly
identical to the one higher up in the file and didn't consider all
cases this code deals with.

Instead I gifted you cna_splice_head() :-)

> > + *
> > + * Returns 0 if a matching node is found; otherwise return the encoded pointer
> > + * to the last element inspected (such that a subsequent scan can continue there).
> > + *
> > + * The worst case complexity of the scan is O(n), where n is the number
> > + * of current waiters. However, the amortized complexity is close to O(1),
> > + * as the immediate successor is likely to be running on the same node once
> > + * threads from other nodes are moved to the secondary queue.
> > + *
> > + * XXX does not compute; given equal contention it should average to O(nr_nodes).
> Let me try to convince you. Under contention, the immediate waiter would be
> a local one. So the scan would typically take O(1) steps. We need to consider
> the extra work/steps we take to move nodes to the secondary queue. The
> number of such nodes is O(n) (to be more precise, it is O(n-m), where m
> is nr_cpus_per_node), and we spend a constant amount of work, per node, 
> to scan those nodes and move them to the secondary queue. So in 2^thresholds
> lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming 
> 2^thresholds > n, the amortized complexity of this function then is O(1).

There is no threshold in this patch. That's the next patch, and
I've been procrastinating on that one, mostly also because of the
'reasonable' claim without data.

But Ah!, I think I see, after nr_cpus tries, all remote waiters are on
the secondary queue and only local waiters will remain. That will indeed
depress the average a lot.

> > + */
> > +static u32 cna_order_queue(struct mcs_spinlock *node,
> > +			   struct mcs_spinlock *iter)
> > +{
> > +	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
> > +	struct cna_node *cn = (struct cna_node *)node;
> > +	int nid = cn->numa_node;
> > +	struct cna_node *last;
> > +
> > +	/* find any next waiter on 'our' NUMA node */
> > +	for (last = cn;
> > +	     cni && cni->numa_node != nid;
> > +	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
> > +		;
> > +
> > +	if (!cna)
> Should be ‘cni’

Yeah, I think the build robots told me the same; in any case, that's
already fixed in the version you can find in my queue.git thing.

> > +		return last->encoded_tail; /* continue from here */
> > +
> > +	if (last != cn)	/* did we skip any waiters? */
> > +		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * cna_splice_head -- splice the entire secondary queue onto the head of the
> > + * primary queue.
> > + */
> > +static cna_splice_head(struct qspinlock *lock, u32 val,
> > +		       struct mcs_spinlock *node, struct mcs_spinlock *next)
> Missing return value type (struct mcs_spinlock *).

Yeah, robots beat you again.

> > +{
> > +	struct mcs_spinlock *head_2nd, *tail_2nd;
> > +
> > +	tail_2nd = decode_tail(node->locked);
> > +	head_2nd = tail_2nd->next;
> I understand that you are trying to avoid repeating those two lines
> in both places this function is called from, but you do it at the cost
> of adding the following unnecessary if-statement, and in general this function
> looks clunky.

Let me show you the current form:

/*
 * cna_splice_head -- splice the entire secondary queue onto the head of the
 * primary queue.
 *
 * Returns the new primary head node or NULL on failure.
 */
static struct mcs_spinlock *
cna_splice_head(struct qspinlock *lock, u32 val,
		struct mcs_spinlock *node, struct mcs_spinlock *next)
{
	struct mcs_spinlock *head_2nd, *tail_2nd;
	u32 new;

	tail_2nd = decode_tail(node->locked);
	head_2nd = tail_2nd->next;

	if (!next) {
		/*
		 * When the primary queue is empty; our tail becomes the primary tail.
		 */

		/*
		 * Speculatively break the secondary queue's circular link; such that when
		 * the secondary tail becomes the primary tail it all works out.
		 */
		tail_2nd->next = NULL;

		/*
		 * tail_2nd->next = NULL		xchg_tail(lock, tail)
		 *
		 * cmpxchg_release(&lock, val, new)	WRITE_ONCE(prev->next, node);
		 *
		 * Such that if the cmpxchg() succeeds, our stores won't collide.
		 */
		new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
			/*
			 * Note; when this cmpxchg fails due to concurrent
			 * queueing of a new waiter, then we'll try again in
			 * cna_pass_lock() if required.
			 *
			 * Restore the secondary queue's circular link.
			 */
			tail_2nd->next = head_2nd;
			return NULL;
		}

	} else {
		/*
		 * If the primary queue is not empty; the primary tail doesn't need
		 * to change and we can simply link our tail to the old head.
		 */
		tail_2nd->next = next;
	}

	/* That which was the secondary queue head, is now the primary queue head */
	return head_2nd;
}

The function as a whole is self-contained and consistent, it deals with
the splice 2nd to 1st queue, for all possible cases. You only have to
think about the list splice in one place, here, instead of two places.

I don't think it will actually result in more branches emitted; the
compiler should be able to use value propagation to eliminate stuff.

> > +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> > +				      struct mcs_spinlock *node)
> > +{
> > +	struct mcs_spinlock *next;
> > +
> > +	/*
> > +	 * We're here because the primary queue is empty; check the secondary
> > +	 * queue for remote waiters.
> > +	 */
> > +	if (node->locked > 1) {
> > +		/*
> > +		 * When there are waiters on the secondary queue move them back
> > +		 * onto the primary queue and let them rip.
> > +		 */
> > +		next = cna_splice_head(lock, val, node, NULL);
> > +		if (next) {
> And, again, this if-statement is here only because you moved the rest of the code
> into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() you can
> bring that code back?

I don't see the objection, you already had a branch there, from the
cmpxchg(), this is the same branch, the compiler should fold the lot. We
can add an __always_inline if you're worried.

> > +			arch_mcs_pass_lock(&head_2nd->locked, 1);
> Should be next->locked. Better yet, ‘next' should be called ‘head_2nd’.

Yeah, fixed already. Those damn build bots figured it didn't compile.

> > +			return true;
> > +		}
> > +
> > +		return false;
> > +	}
> > +
> > +	/* Both queues empty. */
> > +	return __try_clear_tail(lock, val, node);
> > +}
> > +
> > +static inline void cna_pass_lock(struct mcs_spinlock *node,
> > +				 struct mcs_spinlock *next)
> > +{
> > +	struct cna_node *cn = (struct cna_node *)node;
> > +	u32 partial_order = cn->partial_order;
> > +	u32 val = _Q_LOCKED_VAL;
> > +
> > +	/* cna_wait_head_or_lock() left work for us. */
> > +	if (partial_order) {
> > +		partial_order = cna_order_queue(node, decode_tail(partial_order));
> > +
> > +	if (!partial_order) {
> > +		/*
> > +		 * We found a local waiter; reload @next in case we called
> > +		 * cna_order_queue() above.
> > +		 */
> > +		next = node->next;
> > +		val |= node->locked;	/* preseve secondary queue */
> This is wrong. node->locked can be 0, 1 or an encoded tail at this point, and
> the latter case will be handled incorrectly.
> 
> Should be 
> 		  if (node->locked) val = node->locked;
> instead.

I'm confused, who cares about the locked bit when it has an encoded tail on?

The generic code only cares about it being !0, and the cna code always
checks if it has a tail (>1 , <=1) first.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-31 13:35             ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2020-01-31 13:35 UTC (permalink / raw)
  To: Alex Kogan
  Cc: linux-arch, Hanjun Guo, Arnd Bergmann, dave.dice, Jan Glauber,
	x86, Will Deacon, linux, Steven Sistare, linux-kernel,
	Ingo Molnar, Borislav Petkov, hpa, Waiman Long, Thomas Gleixner,
	Daniel Jordan, linux-arm-kernel

On Thu, Jan 30, 2020 at 05:01:15PM -0500, Alex Kogan wrote:

> > +/*
> > + * cna_order_queue - scan the primary queue looking for the first lock node on
> > + * the same NUMA node as the lock holder and move any skipped nodes onto the
> > + * secondary queue.
> Oh man, you took out the ascii figure I was working so hard to put in. Oh well.

Right; sorry about that. The reason it's gone is that it was mostly
identical to the one higher up in the file and didn't consider all
cases this code deals with.

Instead I gifted you cna_splice_head() :-)

> > + *
> > + * Returns 0 if a matching node is found; otherwise return the encoded pointer
> > + * to the last element inspected (such that a subsequent scan can continue there).
> > + *
> > + * The worst case complexity of the scan is O(n), where n is the number
> > + * of current waiters. However, the amortized complexity is close to O(1),
> > + * as the immediate successor is likely to be running on the same node once
> > + * threads from other nodes are moved to the secondary queue.
> > + *
> > + * XXX does not compute; given equal contention it should average to O(nr_nodes).
> Let me try to convince you. Under contention, the immediate waiter would be
> a local one. So the scan would typically take O(1) steps. We need to consider
> the extra work/steps we take to move nodes to the secondary queue. The
> number of such nodes is O(n) (to be more precise, it is O(n-m), where m
> is nr_cpus_per_node), and we spend a constant amount of work, per node, 
> to scan those nodes and move them to the secondary queue. So in 2^thresholds
> lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming 
> 2^thresholds > n, the amortized complexity of this function then is O(1).

There is no threshold in this patch. That's the next patch, and
I've been procrastinating on that one, mostly also because of the
'reasonable' claim without data.

But Ah!, I think I see, after nr_cpus tries, all remote waiters are on
the secondary queue and only local waiters will remain. That will indeed
depress the average a lot.

> > + */
> > +static u32 cna_order_queue(struct mcs_spinlock *node,
> > +			   struct mcs_spinlock *iter)
> > +{
> > +	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
> > +	struct cna_node *cn = (struct cna_node *)node;
> > +	int nid = cn->numa_node;
> > +	struct cna_node *last;
> > +
> > +	/* find any next waiter on 'our' NUMA node */
> > +	for (last = cn;
> > +	     cni && cni->numa_node != nid;
> > +	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
> > +		;
> > +
> > +	if (!cna)
> Should be ‘cni’

Yeah, I think the build robots told me the same; in any case, that's
already fixed in the version you can find in my queue.git thing.

> > +		return last->encoded_tail; /* continue from here */
> > +
> > +	if (last != cn)	/* did we skip any waiters? */
> > +		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * cna_splice_head -- splice the entire secondary queue onto the head of the
> > + * primary queue.
> > + */
> > +static cna_splice_head(struct qspinlock *lock, u32 val,
> > +		       struct mcs_spinlock *node, struct mcs_spinlock *next)
> Missing return value type (struct mcs_spinlock *).

Yeah, robots beat you again.

> > +{
> > +	struct mcs_spinlock *head_2nd, *tail_2nd;
> > +
> > +	tail_2nd = decode_tail(node->locked);
> > +	head_2nd = tail_2nd->next;
> I understand that you are trying to avoid repeating those two lines
> in both places this function is called from, but you do it at the cost
> of adding the following unnecessary if-statement, and in general this function
> looks clunky.

Let me show you the current form:

/*
 * cna_splice_head -- splice the entire secondary queue onto the head of the
 * primary queue.
 *
 * Returns the new primary head node or NULL on failure.
 */
static struct mcs_spinlock *
cna_splice_head(struct qspinlock *lock, u32 val,
		struct mcs_spinlock *node, struct mcs_spinlock *next)
{
	struct mcs_spinlock *head_2nd, *tail_2nd;
	u32 new;

	tail_2nd = decode_tail(node->locked);
	head_2nd = tail_2nd->next;

	if (!next) {
		/*
		 * When the primary queue is empty; our tail becomes the primary tail.
		 */

		/*
		 * Speculatively break the secondary queue's circular link; such that when
		 * the secondary tail becomes the primary tail it all works out.
		 */
		tail_2nd->next = NULL;

		/*
		 * tail_2nd->next = NULL		xchg_tail(lock, tail)
		 *
		 * cmpxchg_release(&lock, val, new)	WRITE_ONCE(prev->next, node);
		 *
		 * Such that if the cmpxchg() succeeds, our stores won't collide.
		 */
		new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
			/*
			 * Note; when this cmpxchg fails due to concurrent
			 * queueing of a new waiter, then we'll try again in
			 * cna_pass_lock() if required.
			 *
			 * Restore the secondary queue's circular link.
			 */
			tail_2nd->next = head_2nd;
			return NULL;
		}

	} else {
		/*
		 * If the primary queue is not empty; the primary tail doesn't need
		 * to change and we can simply link our tail to the old head.
		 */
		tail_2nd->next = next;
	}

	/* That which was the secondary queue head, is now the primary queue head */
	return head_2nd;
}

The function as a whole is self-contained and consistent, it deals with
the splice 2nd to 1st queue, for all possible cases. You only have to
think about the list splice in one place, here, instead of two places.

I don't think it will actually result in more branches emitted; the
compiler should be able to use value propagation to eliminate stuff.

> > +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
> > +				      struct mcs_spinlock *node)
> > +{
> > +	struct mcs_spinlock *next;
> > +
> > +	/*
> > +	 * We're here because the primary queue is empty; check the secondary
> > +	 * queue for remote waiters.
> > +	 */
> > +	if (node->locked > 1) {
> > +		/*
> > +		 * When there are waiters on the secondary queue move them back
> > +		 * onto the primary queue and let them rip.
> > +		 */
> > +		next = cna_splice_head(lock, val, node, NULL);
> > +		if (next) {
> And, again, this if-statement is here only because you moved the rest of the code
> into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() you can
> bring that code back?

I don't see the objection, you already had a branch there, from the
cmpxchg(), this is the same branch, the compiler should fold the lot. We
can add an __always_inline if you're worried.

> > +			arch_mcs_pass_lock(&head_2nd->locked, 1);
> Should be next->locked. Better yet, ‘next' should be called ‘head_2nd’.

Yeah, fixed already. Those damn build bots figured it didn't compile.

> > +			return true;
> > +		}
> > +
> > +		return false;
> > +	}
> > +
> > +	/* Both queues empty. */
> > +	return __try_clear_tail(lock, val, node);
> > +}
> > +
> > +static inline void cna_pass_lock(struct mcs_spinlock *node,
> > +				 struct mcs_spinlock *next)
> > +{
> > +	struct cna_node *cn = (struct cna_node *)node;
> > +	u32 partial_order = cn->partial_order;
> > +	u32 val = _Q_LOCKED_VAL;
> > +
> > +	/* cna_wait_head_or_lock() left work for us. */
> > +	if (partial_order) {
> > +		partial_order = cna_order_queue(node, decode_tail(partial_order));
> > +
> > +	if (!partial_order) {
> > +		/*
> > +		 * We found a local waiter; reload @next in case we called
> > +		 * cna_order_queue() above.
> > +		 */
> > +		next = node->next;
> > +		val |= node->locked;	/* preseve secondary queue */
> This is wrong. node->locked can be 0, 1 or an encoded tail at this point, and
> the latter case will be handled incorrectly.
> 
> Should be 
> 		  if (node->locked) val = node->locked;
> instead.

I'm confused, who cares about the locked bit when it has an encoded tail on?

The generic code only cares about it being !0, and the cna code always
checks if it has a tail (>1 , <=1) first.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
  2020-01-31 13:35             ` Peter Zijlstra
@ 2020-01-31 18:33               ` Alex Kogan
  -1 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2020-01-31 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux, Ingo Molnar, Will Deacon, Arnd Bergmann, Waiman Long,
	linux-arch, linux-arm-kernel, linux-kernel, Thomas Gleixner,
	Borislav Petkov, hpa, x86, Hanjun Guo, Jan Glauber,
	Steven Sistare, Daniel Jordan, dave.dice


>>> + *
>>> + * Returns 0 if a matching node is found; otherwise return the encoded pointer
>>> + * to the last element inspected (such that a subsequent scan can continue there).
>>> + *
>>> + * The worst case complexity of the scan is O(n), where n is the number
>>> + * of current waiters. However, the amortized complexity is close to O(1),
>>> + * as the immediate successor is likely to be running on the same node once
>>> + * threads from other nodes are moved to the secondary queue.
>>> + *
>>> + * XXX does not compute; given equal contention it should average to O(nr_nodes).
>> Let me try to convince you. Under contention, the immediate waiter would be
>> a local one. So the scan would typically take O(1) steps. We need to consider
>> the extra work/steps we take to move nodes to the secondary queue. The
>> number of such nodes is O(n) (to be more precise, it is O(n-m), where m
>> is nr_cpus_per_node), and we spend a constant amount of work, per node, 
>> to scan those nodes and move them to the secondary queue. So in 2^thresholds
>> lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming 
>> 2^thresholds > n, the amortized complexity of this function then is O(1).
> 
> There is no threshold in this patch.
This does not change the analysis, though.

> That's the next patch, and
> I've been procrastinating on that one, mostly also because of the
> 'reasonable' claim without data.
> 
> But Ah!, I think I see, after nr_cpus tries, all remote waiters are on
> the secondary queue and only local waiters will remain. That will indeed
> depress the average a lot.
Ok, cool.

< snip >

> 
>>> +{
>>> +	struct mcs_spinlock *head_2nd, *tail_2nd;
>>> +
>>> +	tail_2nd = decode_tail(node->locked);
>>> +	head_2nd = tail_2nd->next;
>> I understand that you are trying to avoid repeating those two lines
>> in both places this function is called from, but you do it at the cost
>> of adding the following unnecessary if-statement, and in general this function
>> looks clunky.
> 
> Let me show you the current form:
> 
> /*
> * cna_splice_head -- splice the entire secondary queue onto the head of the
> * primary queue.
> *
> * Returns the new primary head node or NULL on failure.
Maybe explain here why failure can happen? Eg., “The latter can happen
due to a race with a waiter joining an empty primary queue."

> */
> static struct mcs_spinlock *
> cna_splice_head(struct qspinlock *lock, u32 val,
> 		struct mcs_spinlock *node, struct mcs_spinlock *next)
> {
> 	struct mcs_spinlock *head_2nd, *tail_2nd;
> 	u32 new;
> 
> 	tail_2nd = decode_tail(node->locked);
> 	head_2nd = tail_2nd->next;
> 
> 	if (!next) {
> 		/*
> 		 * When the primary queue is empty; our tail becomes the primary tail.
> 		 */
> 
> 		/*
> 		 * Speculatively break the secondary queue's circular link; such that when
> 		 * the secondary tail becomes the primary tail it all works out.
> 		 */
> 		tail_2nd->next = NULL;
> 
> 		/*
> 		 * tail_2nd->next = NULL		xchg_tail(lock, tail)
> 		 *
> 		 * cmpxchg_release(&lock, val, new)	WRITE_ONCE(prev->next, node);
> 		 *
> 		 * Such that if the cmpxchg() succeeds, our stores won't collide.
> 		 */
> 		new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> 		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
> 			/*
> 			 * Note; when this cmpxchg fails due to concurrent
> 			 * queueing of a new waiter, then we'll try again in
> 			 * cna_pass_lock() if required.
> 			 *
> 			 * Restore the secondary queue's circular link.
> 			 */
> 			tail_2nd->next = head_2nd;
> 			return NULL;
> 		}
> 
> 	} else {
> 		/*
> 		 * If the primary queue is not empty; the primary tail doesn't need
> 		 * to change and we can simply link our tail to the old head.
> 		 */
> 		tail_2nd->next = next;
> 	}
> 
> 	/* That which was the secondary queue head, is now the primary queue head */
Rephrase the comment?

> 	return head_2nd;
> }
> 
> The function as a whole is self-contained and consistent, it deals with
> the splice 2nd to 1st queue, for all possible cases. You only have to
> think about the list splice in one place, here, instead of two places.
> 
> I don't think it will actually result in more branches emitted; the
> compiler should be able to use value propagation to eliminate stuff.
Ok, I can see your point.

> 
>>> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
>>> +				      struct mcs_spinlock *node)
>>> +{
>>> +	struct mcs_spinlock *next;
>>> +
>>> +	/*
>>> +	 * We're here because the primary queue is empty; check the secondary
>>> +	 * queue for remote waiters.
>>> +	 */
>>> +	if (node->locked > 1) {
>>> +		/*
>>> +		 * When there are waiters on the secondary queue move them back
>>> +		 * onto the primary queue and let them rip.
>>> +		 */
>>> +		next = cna_splice_head(lock, val, node, NULL);
>>> +		if (next) {
>> And, again, this if-statement is here only because you moved the rest of the code
>> into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() you can
>> bring that code back?
> 
> I don't see the objection, you already had a branch there, from the
> cmpxchg(), this is the same branch, the compiler should fold the lot.
Now you have the branch from cmpxchg(), and another one from "if (next)”.
But you are right that the compiler is likely to optimize out the latter.

> We can add an __always_inline if you're worried.
Let’s do that.

< snip >

>>> +static inline void cna_pass_lock(struct mcs_spinlock *node,
>>> +				 struct mcs_spinlock *next)
>>> +{
>>> +	struct cna_node *cn = (struct cna_node *)node;
>>> +	u32 partial_order = cn->partial_order;
>>> +	u32 val = _Q_LOCKED_VAL;
>>> +
>>> +	/* cna_wait_head_or_lock() left work for us. */
>>> +	if (partial_order) {
>>> +		partial_order = cna_order_queue(node, decode_tail(partial_order));
>>> +
>>> +	if (!partial_order) {
>>> +		/*
>>> +		 * We found a local waiter; reload @next in case we called
>>> +		 * cna_order_queue() above.
>>> +		 */
>>> +		next = node->next;
>>> +		val |= node->locked;	/* preseve secondary queue */
>> This is wrong. node->locked can be 0, 1 or an encoded tail at this point, and
>> the latter case will be handled incorrectly.
>> 
>> Should be 
>> 		  if (node->locked) val = node->locked;
>> instead.
> 
> I'm confused, who cares about the locked bit when it has an encoded tail on?
> 
> The generic code only cares about it being !0, and the cna code always
> checks if it has a tail (>1 , <=1) first.
Ah, that may actually work, but not sure if this was your intention.

The code above sets val to 1 or encoded tail + 1 (rather than encoded tail),
decode_tail(tail) does not care about LSB, and will do its calculation correctly.
IOW, decode_tail( tail ) is the same as decode_tail( tail + 1 ).

I think this is a bit fragile and depends on the implementation of 
decode_tail(), but if you are fine with that, no objections here.

Regards,
— Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
@ 2020-01-31 18:33               ` Alex Kogan
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Kogan @ 2020-01-31 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Hanjun Guo, Arnd Bergmann, dave.dice, Jan Glauber,
	x86, Will Deacon, linux, Steven Sistare, linux-kernel,
	Ingo Molnar, Borislav Petkov, hpa, Waiman Long, Thomas Gleixner,
	Daniel Jordan, linux-arm-kernel


>>> + *
>>> + * Returns 0 if a matching node is found; otherwise return the encoded pointer
>>> + * to the last element inspected (such that a subsequent scan can continue there).
>>> + *
>>> + * The worst case complexity of the scan is O(n), where n is the number
>>> + * of current waiters. However, the amortized complexity is close to O(1),
>>> + * as the immediate successor is likely to be running on the same node once
>>> + * threads from other nodes are moved to the secondary queue.
>>> + *
>>> + * XXX does not compute; given equal contention it should average to O(nr_nodes).
>> Let me try to convince you. Under contention, the immediate waiter would be
>> a local one. So the scan would typically take O(1) steps. We need to consider
>> the extra work/steps we take to move nodes to the secondary queue. The
>> number of such nodes is O(n) (to be more precise, it is O(n-m), where m
>> is nr_cpus_per_node), and we spend a constant amount of work, per node, 
>> to scan those nodes and move them to the secondary queue. So in 2^thresholds
>> lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming 
>> 2^thresholds > n, the amortized complexity of this function then is O(1).
> 
> There is no threshold in this patch.
This does not change the analysis, though.

> That's the next patch, and
> I've been procrastinating on that one, mostly also because of the
> 'reasonable' claim without data.
> 
> But Ah!, I think I see, after nr_cpus tries, all remote waiters are on
> the secondary queue and only local waiters will remain. That will indeed
> depress the average a lot.
Ok, cool.

< snip >

> 
>>> +{
>>> +	struct mcs_spinlock *head_2nd, *tail_2nd;
>>> +
>>> +	tail_2nd = decode_tail(node->locked);
>>> +	head_2nd = tail_2nd->next;
>> I understand that you are trying to avoid repeating those two lines
>> in both places this function is called from, but you do it at the cost
>> of adding the following unnecessary if-statement, and in general this function
>> looks clunky.
> 
> Let me show you the current form:
> 
> /*
> * cna_splice_head -- splice the entire secondary queue onto the head of the
> * primary queue.
> *
> * Returns the new primary head node or NULL on failure.
Maybe explain here why failure can happen? Eg., “The latter can happen
due to a race with a waiter joining an empty primary queue."

> */
> static struct mcs_spinlock *
> cna_splice_head(struct qspinlock *lock, u32 val,
> 		struct mcs_spinlock *node, struct mcs_spinlock *next)
> {
> 	struct mcs_spinlock *head_2nd, *tail_2nd;
> 	u32 new;
> 
> 	tail_2nd = decode_tail(node->locked);
> 	head_2nd = tail_2nd->next;
> 
> 	if (!next) {
> 		/*
> 		 * When the primary queue is empty; our tail becomes the primary tail.
> 		 */
> 
> 		/*
> 		 * Speculatively break the secondary queue's circular link; such that when
> 		 * the secondary tail becomes the primary tail it all works out.
> 		 */
> 		tail_2nd->next = NULL;
> 
> 		/*
> 		 * tail_2nd->next = NULL		xchg_tail(lock, tail)
> 		 *
> 		 * cmpxchg_release(&lock, val, new)	WRITE_ONCE(prev->next, node);
> 		 *
> 		 * Such that if the cmpxchg() succeeds, our stores won't collide.
> 		 */
> 		new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> 		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
> 			/*
> 			 * Note; when this cmpxchg fails due to concurrent
> 			 * queueing of a new waiter, then we'll try again in
> 			 * cna_pass_lock() if required.
> 			 *
> 			 * Restore the secondary queue's circular link.
> 			 */
> 			tail_2nd->next = head_2nd;
> 			return NULL;
> 		}
> 
> 	} else {
> 		/*
> 		 * If the primary queue is not empty; the primary tail doesn't need
> 		 * to change and we can simply link our tail to the old head.
> 		 */
> 		tail_2nd->next = next;
> 	}
> 
> 	/* That which was the secondary queue head, is now the primary queue head */
Rephrase the comment?

> 	return head_2nd;
> }
> 
> The function as a whole is self-contained and consistent, it deals with
> the splice 2nd to 1st queue, for all possible cases. You only have to
> think about the list splice in one place, here, instead of two places.
> 
> I don't think it will actually result in more branches emitted; the
> compiler should be able to use value propagation to eliminate stuff.
Ok, I can see your point.

> 
>>> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
>>> +				      struct mcs_spinlock *node)
>>> +{
>>> +	struct mcs_spinlock *next;
>>> +
>>> +	/*
>>> +	 * We're here because the primary queue is empty; check the secondary
>>> +	 * queue for remote waiters.
>>> +	 */
>>> +	if (node->locked > 1) {
>>> +		/*
>>> +		 * When there are waiters on the secondary queue move them back
>>> +		 * onto the primary queue and let them rip.
>>> +		 */
>>> +		next = cna_splice_head(lock, val, node, NULL);
>>> +		if (next) {
>> And, again, this if-statement is here only because you moved the rest of the code
>> into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() you can
>> bring that code back?
> 
> I don't see the objection, you already had a branch there, from the
> cmpxchg(), this is the same branch, the compiler should fold the lot.
Now you have the branch from cmpxchg(), and another one from "if (next)”.
But you are right that the compiler is likely to optimize out the latter.

> We can add an __always_inline if you're worried.
Let’s do that.

< snip >

>>> +static inline void cna_pass_lock(struct mcs_spinlock *node,
>>> +				 struct mcs_spinlock *next)
>>> +{
>>> +	struct cna_node *cn = (struct cna_node *)node;
>>> +	u32 partial_order = cn->partial_order;
>>> +	u32 val = _Q_LOCKED_VAL;
>>> +
>>> +	/* cna_wait_head_or_lock() left work for us. */
>>> +	if (partial_order) {
>>> +		partial_order = cna_order_queue(node, decode_tail(partial_order));
>>> +
>>> +	if (!partial_order) {
>>> +		/*
>>> +		 * We found a local waiter; reload @next in case we called
>>> +		 * cna_order_queue() above.
>>> +		 */
>>> +		next = node->next;
>>> +		val |= node->locked;	/* preseve secondary queue */
>> This is wrong. node->locked can be 0, 1 or an encoded tail at this point, and
>> the latter case will be handled incorrectly.
>> 
>> Should be 
>> 		  if (node->locked) val = node->locked;
>> instead.
> 
> I'm confused, who cares about the locked bit when it has an encoded tail on?
> 
> The generic code only cares about it being !0, and the cna code always
> checks if it has a tail (>1 , <=1) first.
Ah, that may actually work, but not sure if this was your intention.

The code above sets val to 1 or encoded tail + 1 (rather than encoded tail),
decode_tail(tail) does not care about LSB, and will do its calculation correctly.
IOW, decode_tail( tail ) is the same as decode_tail( tail + 1 ).

I think this is a bit fragile and depends on the implementation of 
decode_tail(), but if you are fine with that, no objections here.

Regards,
— Alex


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2020-01-31 18:34 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-25 21:07 [PATCH v7 0/5] Add NUMA-awareness to qspinlock Alex Kogan
2019-11-25 21:07 ` Alex Kogan
2019-11-25 21:07 ` [PATCH v7 1/5] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
2019-11-25 21:07   ` Alex Kogan
2020-01-22  9:15   ` Peter Zijlstra
2020-01-22  9:15     ` Peter Zijlstra
2020-01-22  9:15     ` Peter Zijlstra
     [not found]     ` <C608A39E-CAFC-4C79-9EB6-3DFD9621E3F6@oracle.com>
2020-01-25 11:23       ` Peter Zijlstra
2020-01-25 11:23         ` Peter Zijlstra
2019-11-25 21:07 ` [PATCH v7 2/5] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2019-11-25 21:07   ` Alex Kogan
2019-11-25 21:07 ` [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2019-11-25 21:07   ` Alex Kogan
2019-12-06 17:21   ` Waiman Long
2019-12-06 17:21     ` Waiman Long
2019-12-06 19:50     ` Alex Kogan
2019-12-06 19:50       ` Alex Kogan
2020-01-21 20:29   ` Peter Zijlstra
2020-01-21 20:29     ` Peter Zijlstra
2020-01-21 20:29     ` Peter Zijlstra
2020-01-22  8:58     ` Will Deacon
2020-01-22  8:58       ` Will Deacon
2020-01-22  9:22     ` Peter Zijlstra
2020-01-22  9:22       ` Peter Zijlstra
2020-01-22  9:22       ` Peter Zijlstra
2020-01-22  9:51     ` Peter Zijlstra
2020-01-22  9:51       ` Peter Zijlstra
2020-01-22  9:51       ` Peter Zijlstra
2020-01-22 17:04       ` Peter Zijlstra
2020-01-22 17:04         ` Peter Zijlstra
2020-01-22 17:04         ` Peter Zijlstra
2020-01-23  9:00         ` Peter Zijlstra
2020-01-23  9:00           ` Peter Zijlstra
2020-01-23  9:00           ` Peter Zijlstra
2020-01-30 22:01         ` Alex Kogan
2020-01-30 22:01           ` Alex Kogan
2020-01-31 13:35           ` Peter Zijlstra
2020-01-31 13:35             ` Peter Zijlstra
2020-01-31 18:33             ` Alex Kogan
2020-01-31 18:33               ` Alex Kogan
2019-11-25 21:07 ` [PATCH v7 4/5] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2019-11-25 21:07   ` Alex Kogan
2019-12-06 18:09   ` Waiman Long
2019-12-06 18:09     ` Waiman Long
2019-11-25 21:07 ` [PATCH v7 5/5] locking/qspinlock: Introduce the shuffle reduction optimization " Alex Kogan
2019-11-25 21:07   ` Alex Kogan
2019-12-06 22:00   ` Waiman Long
2019-12-06 22:00     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.