* [PATCH v3] locking/osq: Use optimized spinning loop for arm64
@ 2020-01-13 15:07 Waiman Long
2020-01-13 16:48 ` Will Deacon
2020-01-17 10:09 ` [tip: locking/core] " tip-bot2 for Waiman Long
0 siblings, 2 replies; 3+ messages in thread
From: Waiman Long @ 2020-01-13 15:07 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Catalin Marinas
Cc: linux-arm-kernel, linux-kernel, maz, yezengruan, Waiman Long
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
using wfe for spinlock that can boost performance of sibling threads
by putting the current cpu to a wait state that is broken only when
the monitored variable changes or an external event happens.
OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.
The vcpu_is_preempted() check, however, is a problem as changes to the
preempt state of of previous node will not affect the wait state. For
ARM64, vcpu_is_preempted is not currently defined and so is a no-op.
Will has indicated that he is planning to para-virtualize wfe instead
of defining vcpu_is_preempted for PV support. So just add a comment in
arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted()
should not be defined as suggested.
On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
After patch, the numbers were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
So there was about 20% performance improvement.
Signed-off-by: Waiman Long <longman@redhat.com>
---
arch/arm64/include/asm/spinlock.h | 9 +++++++++
kernel/locking/osq_lock.c | 17 ++++-------------
2 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index b093b287babf..102404dc1e13 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -11,4 +11,13 @@
/* See include/linux/spinlock.h */
#define smp_mb__after_spinlock() smp_mb()
+/*
+ * Changing this will break osq_lock() thanks to the call inside
+ * smp_cond_load_relaxed().
+ *
+ * See:
+ * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net
+ */
+#define vcpu_is_preempted(cpu) false
+
#endif /* __ASM_SPINLOCK_H */
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f47..e42893f2fcbc 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -134,20 +134,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* cmpxchg in an attempt to undo our queueing.
*/
- while (!READ_ONCE(node->locked)) {
- /*
- * If we need to reschedule bail... so we can block.
- * Use vcpu_is_preempted() to avoid waiting for a preempted
- * lock holder:
- */
- if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
- goto unqueue;
-
- cpu_relax();
- }
- return true;
+ if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
+ vcpu_is_preempted(node_cpu(node->prev))))
+ return true;
-unqueue:
+ /* unqueue */
/*
* Step - A -- stabilize @prev
*
--
2.18.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v3] locking/osq: Use optimized spinning loop for arm64
2020-01-13 15:07 [PATCH v3] locking/osq: Use optimized spinning loop for arm64 Waiman Long
@ 2020-01-13 16:48 ` Will Deacon
2020-01-17 10:09 ` [tip: locking/core] " tip-bot2 for Waiman Long
1 sibling, 0 replies; 3+ messages in thread
From: Will Deacon @ 2020-01-13 16:48 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-kernel, maz, yezengruan
On Mon, Jan 13, 2020 at 10:07:35AM -0500, Waiman Long wrote:
> Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
> using wfe for spinlock that can boost performance of sibling threads
> by putting the current cpu to a wait state that is broken only when
> the monitored variable changes or an external event happens.
>
> OSQ has a more complicated spinning loop. Besides the lock value, it
> also checks for need_resched() and vcpu_is_preempted(). The check for
> need_resched() is not a problem as it is only set by the tick interrupt
> handler. That will be detected by the spinning cpu right after iret.
>
> The vcpu_is_preempted() check, however, is a problem as changes to the
> preempt state of of previous node will not affect the wait state. For
> ARM64, vcpu_is_preempted is not currently defined and so is a no-op.
> Will has indicated that he is planning to para-virtualize wfe instead
> of defining vcpu_is_preempted for PV support. So just add a comment in
> arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted()
> should not be defined as suggested.
>
> On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
> microbenchmark was run for 10s with and without the patch. The
> performance numbers before patch were:
>
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
> Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
>
> After patch, the numbers were:
>
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
> Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
>
> So there was about 20% performance improvement.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> arch/arm64/include/asm/spinlock.h | 9 +++++++++
> kernel/locking/osq_lock.c | 17 ++++-------------
> 2 files changed, 13 insertions(+), 13 deletions(-)
Acked-by: Will Deacon <will@kernel.org>
Thanks,
Will
^ permalink raw reply [flat|nested] 3+ messages in thread
* [tip: locking/core] locking/osq: Use optimized spinning loop for arm64
2020-01-13 15:07 [PATCH v3] locking/osq: Use optimized spinning loop for arm64 Waiman Long
2020-01-13 16:48 ` Will Deacon
@ 2020-01-17 10:09 ` tip-bot2 for Waiman Long
1 sibling, 0 replies; 3+ messages in thread
From: tip-bot2 for Waiman Long @ 2020-01-17 10:09 UTC (permalink / raw)
To: linux-tip-commits
Cc: Waiman Long, Peter Zijlstra (Intel), Will Deacon, x86, LKML
The following commit has been merged into the locking/core branch of tip:
Commit-ID: f5bfdc8e3947a7ae489cf8ae9cfd6b3fb357b952
Gitweb: https://git.kernel.org/tip/f5bfdc8e3947a7ae489cf8ae9cfd6b3fb357b952
Author: Waiman Long <longman@redhat.com>
AuthorDate: Mon, 13 Jan 2020 10:07:35 -05:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 17 Jan 2020 10:19:30 +01:00
locking/osq: Use optimized spinning loop for arm64
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
using wfe for spinlock that can boost performance of sibling threads
by putting the current cpu to a wait state that is broken only when
the monitored variable changes or an external event happens.
OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.
The vcpu_is_preempted() check, however, is a problem as changes to the
preempt state of of previous node will not affect the wait state. For
ARM64, vcpu_is_preempted is not currently defined and so is a no-op.
Will has indicated that he is planning to para-virtualize wfe instead
of defining vcpu_is_preempted for PV support. So just add a comment in
arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted()
should not be defined as suggested.
On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
After patch, the numbers were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
So there was about 20% performance improvement.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200113150735.21956-1-longman@redhat.com
---
arch/arm64/include/asm/spinlock.h | 9 +++++++++
kernel/locking/osq_lock.c | 23 ++++++++++-------------
2 files changed, 19 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index b093b28..102404d 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -11,4 +11,13 @@
/* See include/linux/spinlock.h */
#define smp_mb__after_spinlock() smp_mb()
+/*
+ * Changing this will break osq_lock() thanks to the call inside
+ * smp_cond_load_relaxed().
+ *
+ * See:
+ * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net
+ */
+#define vcpu_is_preempted(cpu) false
+
#endif /* __ASM_SPINLOCK_H */
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600a..1f77349 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -134,20 +134,17 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* cmpxchg in an attempt to undo our queueing.
*/
- while (!READ_ONCE(node->locked)) {
- /*
- * If we need to reschedule bail... so we can block.
- * Use vcpu_is_preempted() to avoid waiting for a preempted
- * lock holder:
- */
- if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
- goto unqueue;
-
- cpu_relax();
- }
- return true;
+ /*
+ * Wait to acquire the lock or cancelation. Note that need_resched()
+ * will come with an IPI, which will wake smp_cond_load_relaxed() if it
+ * is implemented with a monitor-wait. vcpu_is_preempted() relies on
+ * polling, be careful.
+ */
+ if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
+ vcpu_is_preempted(node_cpu(node->prev))))
+ return true;
-unqueue:
+ /* unqueue */
/*
* Step - A -- stabilize @prev
*
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-01-17 10:09 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13 15:07 [PATCH v3] locking/osq: Use optimized spinning loop for arm64 Waiman Long
2020-01-13 16:48 ` Will Deacon
2020-01-17 10:09 ` [tip: locking/core] " tip-bot2 for Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).