linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] locking/osq: Use more optimized spinning for arm64
@ 2020-01-09 15:38 Waiman Long
  2020-01-10 10:06 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Waiman Long @ 2020-01-09 15:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon; +Cc: linux-kernel, Waiman Long

Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
for spinlock that can boost performance of sibling threads by putting
the current cpu to a shallow sleep state that is woken up when the
monitored variable changes or an external event happens.

OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.

The vcpu_is_preempted() check, however, is a problem as changes to
the state of of previous node will not affect the sleep state. For
ARM64, vcpu_is_preempted is not defined and so we can just skip the
vcpu_is_preempted() check and use smp_cond_load_relaxed() instead.

On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s

After patch, the numbers were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s

So there was about 20% performance improvement.

Longer term, we may have to define and use a static_key to indicate
that vcpu_is_preempted is defined and it may return a value of true.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/osq_lock.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f47..129e8f56ae71 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -134,6 +134,27 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	 * cmpxchg in an attempt to undo our queueing.
 	 */
 
+	/*
+	 * If vcpu_is_preempted is not defined, we can skip the check
+	 * and use smp_cond_load_relaxed() instead. For arm64, this
+	 * could lead to the use of the more optimized wfe instruction.
+	 * As need_sched() is set by interrupt handler, it will break
+	 * out and do the unqueue in a timely manner.
+	 *
+	 * TODO: We may need to add a static_key like vcpu_is_preemptible
+	 *	 as vcpu_is_preempted() will always return false with
+	 *	 bare metal even if it is defined.
+	 */
+#ifndef vcpu_is_preempted
+	{
+		int locked = smp_cond_load_relaxed(&node->locked,
+						   VAL || need_resched());
+		if (!locked)
+			goto unqueue;
+		return true;
+	}
+#endif
+
 	while (!READ_ONCE(node->locked)) {
 		/*
 		 * If we need to reschedule bail... so we can block.
-- 
2.18.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] locking/osq: Use more optimized spinning for arm64
  2020-01-09 15:38 [PATCH] locking/osq: Use more optimized spinning for arm64 Waiman Long
@ 2020-01-10 10:06 ` Peter Zijlstra
  2020-01-10 14:13   ` Waiman Long
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2020-01-10 10:06 UTC (permalink / raw)
  To: Waiman Long; +Cc: Ingo Molnar, Will Deacon, linux-kernel

On Thu, Jan 09, 2020 at 10:38:31AM -0500, Waiman Long wrote:

> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -134,6 +134,27 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	 * cmpxchg in an attempt to undo our queueing.
>  	 */
>  
> +	/*
> +	 * If vcpu_is_preempted is not defined, we can skip the check
> +	 * and use smp_cond_load_relaxed() instead. For arm64, this
> +	 * could lead to the use of the more optimized wfe instruction.
> +	 * As need_sched() is set by interrupt handler, it will break
> +	 * out and do the unqueue in a timely manner.
> +	 *
> +	 * TODO: We may need to add a static_key like vcpu_is_preemptible
> +	 *	 as vcpu_is_preempted() will always return false with
> +	 *	 bare metal even if it is defined.
> +	 */
> +#ifndef vcpu_is_preempted
> +	{
> +		int locked = smp_cond_load_relaxed(&node->locked,
> +						   VAL || need_resched());
> +		if (!locked)
> +			goto unqueue;
> +		return true;
> +	}
> +#endif

Much yuck :-/

With ARM64 being the only arch that currently makes use of this; another
approach is doing something like:

That is also rather yuck, and definitely needs a few comments sprinked
on it, but it should just work for everyone.

It basically relies on an arch having a spinning *cond_load*()
implementation if it has vcpu_is_preempted(), which is true today.

---
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f47..6e00d7c077ba 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -133,18 +133,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	 * guaranteed their existence -- this allows us to apply
 	 * cmpxchg in an attempt to undo our queueing.
 	 */
+	if (!smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
+						  vcpu_is_preempetd(node_cpu(node->prev))))
+		goto unqueue;
 
-	while (!READ_ONCE(node->locked)) {
-		/*
-		 * If we need to reschedule bail... so we can block.
-		 * Use vcpu_is_preempted() to avoid waiting for a preempted
-		 * lock holder:
-		 */
-		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
-			goto unqueue;
-
-		cpu_relax();
-	}
 	return true;
 
 unqueue:

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] locking/osq: Use more optimized spinning for arm64
  2020-01-10 10:06 ` Peter Zijlstra
@ 2020-01-10 14:13   ` Waiman Long
  0 siblings, 0 replies; 3+ messages in thread
From: Waiman Long @ 2020-01-10 14:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Will Deacon, linux-kernel

On 1/10/20 5:06 AM, Peter Zijlstra wrote:
> On Thu, Jan 09, 2020 at 10:38:31AM -0500, Waiman Long wrote:
>
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -134,6 +134,27 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>  	 * cmpxchg in an attempt to undo our queueing.
>>  	 */
>>  
>> +	/*
>> +	 * If vcpu_is_preempted is not defined, we can skip the check
>> +	 * and use smp_cond_load_relaxed() instead. For arm64, this
>> +	 * could lead to the use of the more optimized wfe instruction.
>> +	 * As need_sched() is set by interrupt handler, it will break
>> +	 * out and do the unqueue in a timely manner.
>> +	 *
>> +	 * TODO: We may need to add a static_key like vcpu_is_preemptible
>> +	 *	 as vcpu_is_preempted() will always return false with
>> +	 *	 bare metal even if it is defined.
>> +	 */
>> +#ifndef vcpu_is_preempted
>> +	{
>> +		int locked = smp_cond_load_relaxed(&node->locked,
>> +						   VAL || need_resched());
>> +		if (!locked)
>> +			goto unqueue;
>> +		return true;
>> +	}
>> +#endif
> Much yuck :-/
>
> With ARM64 being the only arch that currently makes use of this; another
> approach is doing something like:
>
> That is also rather yuck, and definitely needs a few comments sprinked
> on it, but it should just work for everyone.
>
> It basically relies on an arch having a spinning *cond_load*()
> implementation if it has vcpu_is_preempted(), which is true today.
>
> ---
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 6ef600aa0f47..6e00d7c077ba 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -133,18 +133,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	 * guaranteed their existence -- this allows us to apply
>  	 * cmpxchg in an attempt to undo our queueing.
>  	 */
> +	if (!smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> +						  vcpu_is_preempetd(node_cpu(node->prev))))
> +		goto unqueue;
>  
> -	while (!READ_ONCE(node->locked)) {
> -		/*
> -		 * If we need to reschedule bail... so we can block.
> -		 * Use vcpu_is_preempted() to avoid waiting for a preempted
> -		 * lock holder:
> -		 */
> -		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
> -			goto unqueue;
> -
> -		cpu_relax();
> -	}
>  	return true;
>  
>  unqueue:
>
Yes, that will work for now. We do need to document that in where
smp_cond_load_relaxed() is defined.

In the future, if vcpu_is_preempted() is defined for ARM64, it will
break. How about defining a variant like smp_cond_load_vcpu_relaxed(p,
cond, vcpu)? With that, we can make sure that the code will be properly
updated when vcpu_is_preempted() is defined for ARM64. I know it is
still kind of ugly, but it is a safer approach.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-01-10 14:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-09 15:38 [PATCH] locking/osq: Use more optimized spinning for arm64 Waiman Long
2020-01-10 10:06 ` Peter Zijlstra
2020-01-10 14:13   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).