* [PATCH v2] locking/osq: Use optimized spinning loop for arm64
@ 2020-01-12 23:58 Waiman Long
2020-01-13 8:32 ` yezengruan
2020-01-13 11:57 ` Will Deacon
0 siblings, 2 replies; 4+ messages in thread
From: Waiman Long @ 2020-01-12 23:58 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Catalin Marinas
Cc: linux-arm-kernel, linux-kernel, Waiman Long
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
for spinlock that can boost performance of sibling threads by putting
the current cpu to a shallow sleep state that is woken up only when
the monitored variable changes or an external event happens.
OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.
The vcpu_is_preempted() check, however, is a problem as changes to the
preempt state of of previous node will not affect the sleep state. For
ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard
against future addition of vcpu_is_preempted() to arm64, code is added
to cause build error when vcpu_is_preempted becomes defined in arm64
without the corresponding changes in the OSQ spinning code.
On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
After patch, the numbers were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
So there was about 20% performance improvement.
Signed-off-by: Waiman Long <longman@redhat.com>
---
arch/arm64/include/asm/barrier.h | 10 ++++++++++
kernel/locking/osq_lock.c | 25 ++++++++++++-------------
2 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 7d9cc5ec4971..8eb5f1239885 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -152,6 +152,16 @@ do { \
VAL; \
})
+/*
+ * In osq_lock(), smp_cond_load_relaxed() is called with a condition
+ * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not
+ * currently defined. So it is a no-op. If vcpu_is_preempted is defined in
+ * the future, smp_cond_load_relaxed() will not response to changes in the
+ * preempt state in a timely manner. So code changes will have to be made
+ * to address this deficiency.
+ */
+#define vcpu_is_preempted_not_used
+
#define smp_cond_load_acquire(ptr, cond_expr) \
({ \
typeof(ptr) __PTR = (ptr); \
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f47..69ec5161c3cc 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,6 +13,14 @@
*/
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
+/*
+ * The optimized smp_cond_load_relaxed() spin loop should not be used with
+ * vcpu_is_preempted defined.
+ */
+#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used)
+#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!"
+#endif
+
/*
* We use the value 0 to represent "no CPU", thus the encoded value
* will be the CPU number incremented by 1.
@@ -134,20 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* cmpxchg in an attempt to undo our queueing.
*/
- while (!READ_ONCE(node->locked)) {
- /*
- * If we need to reschedule bail... so we can block.
- * Use vcpu_is_preempted() to avoid waiting for a preempted
- * lock holder:
- */
- if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
- goto unqueue;
-
- cpu_relax();
- }
- return true;
+ if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
+ vcpu_is_preempted(node_cpu(node->prev))))
+ return true;
-unqueue:
+ /* unqueue */
/*
* Step - A -- stabilize @prev
*
--
2.18.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v2] locking/osq: Use optimized spinning loop for arm64
2020-01-12 23:58 [PATCH v2] locking/osq: Use optimized spinning loop for arm64 Waiman Long
@ 2020-01-13 8:32 ` yezengruan
2020-01-13 11:57 ` Will Deacon
1 sibling, 0 replies; 4+ messages in thread
From: yezengruan @ 2020-01-13 8:32 UTC (permalink / raw)
To: Waiman Long, Peter Zijlstra, Ingo Molnar, Will Deacon, Catalin Marinas
Cc: linux-kernel, linux-arm-kernel
Hi Waiman,
On 2020/1/13 7:58, Waiman Long wrote:
> Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
> for spinlock that can boost performance of sibling threads by putting
> the current cpu to a shallow sleep state that is woken up only when
> the monitored variable changes or an external event happens.
>
> OSQ has a more complicated spinning loop. Besides the lock value, it
> also checks for need_resched() and vcpu_is_preempted(). The check for
> need_resched() is not a problem as it is only set by the tick interrupt
> handler. That will be detected by the spinning cpu right after iret.
>
> The vcpu_is_preempted() check, however, is a problem as changes to the
> preempt state of of previous node will not affect the sleep state. For
> ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard
> against future addition of vcpu_is_preempted() to arm64, code is added
> to cause build error when vcpu_is_preempted becomes defined in arm64
> without the corresponding changes in the OSQ spinning code.
Recently, I am supporting vcpu_is_preempted() for arm64. There is a patch set which do this[1].
[1] https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/
>
> On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
> microbenchmark was run for 10s with and without the patch. The
> performance numbers before patch were:
>
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
> Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
>
> After patch, the numbers were:
>
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
> Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
>
> So there was about 20% performance improvement.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> arch/arm64/include/asm/barrier.h | 10 ++++++++++
> kernel/locking/osq_lock.c | 25 ++++++++++++-------------
> 2 files changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 7d9cc5ec4971..8eb5f1239885 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -152,6 +152,16 @@ do { \
> VAL; \
> })
>
> +/*
> + * In osq_lock(), smp_cond_load_relaxed() is called with a condition
> + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not
> + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in
> + * the future, smp_cond_load_relaxed() will not response to changes in the
> + * preempt state in a timely manner. So code changes will have to be made
> + * to address this deficiency.
> + */
> +#define vcpu_is_preempted_not_used
> +
> #define smp_cond_load_acquire(ptr, cond_expr) \
> ({ \
> typeof(ptr) __PTR = (ptr); \
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 6ef600aa0f47..69ec5161c3cc 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,6 +13,14 @@
> */
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>
> +/*
> + * The optimized smp_cond_load_relaxed() spin loop should not be used with
> + * vcpu_is_preempted defined.
> + */
> +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used)
> +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!"
> +#endif
> +
> /*
> * We use the value 0 to represent "no CPU", thus the encoded value
> * will be the CPU number incremented by 1.
> @@ -134,20 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * cmpxchg in an attempt to undo our queueing.
> */
>
> - while (!READ_ONCE(node->locked)) {
> - /*
> - * If we need to reschedule bail... so we can block.
> - * Use vcpu_is_preempted() to avoid waiting for a preempted
> - * lock holder:
> - */
> - if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
> - goto unqueue;
> -
> - cpu_relax();
> - }
> - return true;
> + if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> + vcpu_is_preempted(node_cpu(node->prev))))
> + return true;
>
> -unqueue:
> + /* unqueue */
> /*
> * Step - A -- stabilize @prev
> *
>
Thanks,
Zengruan
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2] locking/osq: Use optimized spinning loop for arm64
2020-01-12 23:58 [PATCH v2] locking/osq: Use optimized spinning loop for arm64 Waiman Long
2020-01-13 8:32 ` yezengruan
@ 2020-01-13 11:57 ` Will Deacon
2020-01-13 13:43 ` Waiman Long
1 sibling, 1 reply; 4+ messages in thread
From: Will Deacon @ 2020-01-13 11:57 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-kernel, maz
[+Marc]
On Sun, Jan 12, 2020 at 06:58:54PM -0500, Waiman Long wrote:
> Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
> for spinlock that can boost performance of sibling threads by putting
> the current cpu to a shallow sleep state that is woken up only when
> the monitored variable changes or an external event happens.
>
> OSQ has a more complicated spinning loop. Besides the lock value, it
> also checks for need_resched() and vcpu_is_preempted(). The check for
> need_resched() is not a problem as it is only set by the tick interrupt
> handler. That will be detected by the spinning cpu right after iret.
>
> The vcpu_is_preempted() check, however, is a problem as changes to the
> preempt state of of previous node will not affect the sleep state. For
> ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard
> against future addition of vcpu_is_preempted() to arm64, code is added
> to cause build error when vcpu_is_preempted becomes defined in arm64
> without the corresponding changes in the OSQ spinning code.
>
> On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
> microbenchmark was run for 10s with and without the patch. The
> performance numbers before patch were:
>
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
> Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
>
> After patch, the numbers were:
>
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
> Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
>
> So there was about 20% performance improvement.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> arch/arm64/include/asm/barrier.h | 10 ++++++++++
> kernel/locking/osq_lock.c | 25 ++++++++++++-------------
> 2 files changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 7d9cc5ec4971..8eb5f1239885 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -152,6 +152,16 @@ do { \
> VAL; \
> })
>
> +/*
> + * In osq_lock(), smp_cond_load_relaxed() is called with a condition
> + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not
> + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in
> + * the future, smp_cond_load_relaxed() will not response to changes in the
> + * preempt state in a timely manner. So code changes will have to be made
> + * to address this deficiency.
> + */
> +#define vcpu_is_preempted_not_used
> +
> #define smp_cond_load_acquire(ptr, cond_expr) \
> ({ \
> typeof(ptr) __PTR = (ptr); \
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 6ef600aa0f47..69ec5161c3cc 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,6 +13,14 @@
> */
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>
> +/*
> + * The optimized smp_cond_load_relaxed() spin loop should not be used with
> + * vcpu_is_preempted defined.
> + */
> +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used)
> +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!"
> +#endif
Although I appreciate you going the extra mile for arm64 (thanks!), I think
this is probably a bit overkill given that I don't plan to merge the series
from Zengruan any time soon. Instead, how about just defining
vcpu_is_preempted in arch/arm64/include/asm/spinlock.h with a comment:
/*
* Changing this will break osq_lock() thanks to the call inside
* smp_cond_load_relaxed().
*
* See:
* https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net
*/
#define vcpu_is_preempted(cpu) false
So we'll notice that when somebody tries to change it.
Will
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2] locking/osq: Use optimized spinning loop for arm64
2020-01-13 11:57 ` Will Deacon
@ 2020-01-13 13:43 ` Waiman Long
0 siblings, 0 replies; 4+ messages in thread
From: Waiman Long @ 2020-01-13 13:43 UTC (permalink / raw)
To: Will Deacon
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-kernel, maz
On 1/13/20 6:57 AM, Will Deacon wrote:
> [+Marc]
>
> On Sun, Jan 12, 2020 at 06:58:54PM -0500, Waiman Long wrote:
>> Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
>> for spinlock that can boost performance of sibling threads by putting
>> the current cpu to a shallow sleep state that is woken up only when
>> the monitored variable changes or an external event happens.
>>
>> OSQ has a more complicated spinning loop. Besides the lock value, it
>> also checks for need_resched() and vcpu_is_preempted(). The check for
>> need_resched() is not a problem as it is only set by the tick interrupt
>> handler. That will be detected by the spinning cpu right after iret.
>>
>> The vcpu_is_preempted() check, however, is a problem as changes to the
>> preempt state of of previous node will not affect the sleep state. For
>> ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard
>> against future addition of vcpu_is_preempted() to arm64, code is added
>> to cause build error when vcpu_is_preempted becomes defined in arm64
>> without the corresponding changes in the OSQ spinning code.
>>
>> On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
>> microbenchmark was run for 10s with and without the patch. The
>> performance numbers before patch were:
>>
>> Running locktest with mutex [runtime = 10s, load = 1]
>> Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
>> Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
>>
>> After patch, the numbers were:
>>
>> Running locktest with mutex [runtime = 10s, load = 1]
>> Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
>> Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
>>
>> So there was about 20% performance improvement.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> arch/arm64/include/asm/barrier.h | 10 ++++++++++
>> kernel/locking/osq_lock.c | 25 ++++++++++++-------------
>> 2 files changed, 22 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index 7d9cc5ec4971..8eb5f1239885 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -152,6 +152,16 @@ do { \
>> VAL; \
>> })
>>
>> +/*
>> + * In osq_lock(), smp_cond_load_relaxed() is called with a condition
>> + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not
>> + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in
>> + * the future, smp_cond_load_relaxed() will not response to changes in the
>> + * preempt state in a timely manner. So code changes will have to be made
>> + * to address this deficiency.
>> + */
>> +#define vcpu_is_preempted_not_used
>> +
>> #define smp_cond_load_acquire(ptr, cond_expr) \
>> ({ \
>> typeof(ptr) __PTR = (ptr); \
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 6ef600aa0f47..69ec5161c3cc 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -13,6 +13,14 @@
>> */
>> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>>
>> +/*
>> + * The optimized smp_cond_load_relaxed() spin loop should not be used with
>> + * vcpu_is_preempted defined.
>> + */
>> +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used)
>> +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!"
>> +#endif
> Although I appreciate you going the extra mile for arm64 (thanks!), I think
> this is probably a bit overkill given that I don't plan to merge the series
> from Zengruan any time soon. Instead, how about just defining
> vcpu_is_preempted in arch/arm64/include/asm/spinlock.h with a comment:
>
>
> /*
> * Changing this will break osq_lock() thanks to the call inside
> * smp_cond_load_relaxed().
> *
> * See:
> * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net
> */
> #define vcpu_is_preempted(cpu) false
>
>
> So we'll notice that when somebody tries to change it.
Yes, that works for me. I just want to make sure that if any changes to
add vcpu_is_preempted to arm64 in the future will get caught.
Cheers,
Longman
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-01-13 13:44 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-12 23:58 [PATCH v2] locking/osq: Use optimized spinning loop for arm64 Waiman Long
2020-01-13 8:32 ` yezengruan
2020-01-13 11:57 ` Will Deacon
2020-01-13 13:43 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).