linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer
@ 2021-01-27 20:01 Alexander A Sverdlin
  2021-01-27 20:01 ` [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended() Alexander A Sverdlin
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Alexander A Sverdlin @ 2021-01-27 20:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, linux-kernel
  Cc: Alexander Sverdlin, Russell King, linux-arm-kernel

From: Alexander Sverdlin <alexander.sverdlin@nokia.com>

Ensure writes are pushed out of core write buffer to prevent waiting code
on another cores from spinning longer than necessary.

6 threads running tight spinlock loop competing for the same lock
on 6 cores on MIPS/Octeon do 1000000 iterations...

before the patch in:	4.3 sec
after the patch in:	1.2 sec

Same 6-core Octeon machine:
sysbench --test=mutex --num-threads=64 --memory-scope=local run

w/o patch:	1.53s
with patch:	1.28s

This will also allow to remove the smp_wmb() in
arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
issue?).

Finally our internal quite diverse test suite of different IPC/network
aspects didn't detect any regressions on ARM/ARM64/x86_64.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
---
 kernel/locking/mcs_spinlock.h | 5 +++++
 kernel/locking/qspinlock.c    | 6 ++++++
 2 files changed, 11 insertions(+)

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153..10e497a 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
 		return;
 	}
 	WRITE_ONCE(prev->next, node);
+	/*
+	 * This is necessary to make sure that the corresponding "while" in the
+	 * mcs_spin_unlock() doesn't loop forever
+	 */
+	smp_wmb();
 
 	/* Wait until the lock holder passes the lock down. */
 	arch_mcs_spin_lock_contended(&node->locked);
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba..577fe01 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
+		/*
+		 * This is necessary to make sure that the corresponding
+		 * smp_cond_load_relaxed() below (running on another core)
+		 * doesn't spin forever.
+		 */
+		smp_wmb();
 
 		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended()
  2021-01-27 20:01 [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Alexander A Sverdlin
@ 2021-01-27 20:01 ` Alexander A Sverdlin
  2021-01-27 22:22   ` Will Deacon
  2021-01-27 22:21 ` [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Will Deacon
  2021-01-27 22:43 ` Peter Zijlstra
  2 siblings, 1 reply; 8+ messages in thread
From: Alexander A Sverdlin @ 2021-01-27 20:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, linux-kernel
  Cc: Alexander Sverdlin, Russell King, linux-arm-kernel

From: Alexander Sverdlin <alexander.sverdlin@nokia.com>

Drop smp_wmb in arch_mcs_spin_lock_contended() after adding in into
ARCH-independent code.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
---
 arch/arm/include/asm/mcs_spinlock.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf..d8fa921 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -8,8 +8,6 @@
 /* MCS spin-locking. */
 #define arch_mcs_spin_lock_contended(lock)				\
 do {									\
-	/* Ensure prior stores are observed before we enter wfe. */	\
-	smp_mb();							\
 	while (!(smp_load_acquire(lock)))				\
 		wfe();							\
 } while (0)								\
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer
  2021-01-27 20:01 [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Alexander A Sverdlin
  2021-01-27 20:01 ` [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended() Alexander A Sverdlin
@ 2021-01-27 22:21 ` Will Deacon
  2021-01-28  7:36   ` Alexander Sverdlin
  2021-01-27 22:43 ` Peter Zijlstra
  2 siblings, 1 reply; 8+ messages in thread
From: Will Deacon @ 2021-01-27 22:21 UTC (permalink / raw)
  To: Alexander A Sverdlin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Russell King,
	linux-arm-kernel

On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Ensure writes are pushed out of core write buffer to prevent waiting code
> on another cores from spinning longer than necessary.
> 
> 6 threads running tight spinlock loop competing for the same lock
> on 6 cores on MIPS/Octeon do 1000000 iterations...
> 
> before the patch in:	4.3 sec
> after the patch in:	1.2 sec

If you only have 6 cores, I'm not sure qspinlock makes any sense...

> Same 6-core Octeon machine:
> sysbench --test=mutex --num-threads=64 --memory-scope=local run
> 
> w/o patch:	1.53s
> with patch:	1.28s
> 
> This will also allow to remove the smp_wmb() in
> arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
> issue?).
> 
> Finally our internal quite diverse test suite of different IPC/network
> aspects didn't detect any regressions on ARM/ARM64/x86_64.
> 
> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> ---
>  kernel/locking/mcs_spinlock.h | 5 +++++
>  kernel/locking/qspinlock.c    | 6 ++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 5e10153..10e497a 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		return;
>  	}
>  	WRITE_ONCE(prev->next, node);
> +	/*
> +	 * This is necessary to make sure that the corresponding "while" in the
> +	 * mcs_spin_unlock() doesn't loop forever
> +	 */
> +	smp_wmb();

If it loops forever, that's broken hardware design; store buffers need to
drain. I don't think we should add unconditional barriers to bodge this.

>  	/* Wait until the lock holder passes the lock down. */
>  	arch_mcs_spin_lock_contended(&node->locked);
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba..577fe01 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
> +		/*
> +		 * This is necessary to make sure that the corresponding
> +		 * smp_cond_load_relaxed() below (running on another core)
> +		 * doesn't spin forever.
> +		 */
> +		smp_wmb();

Likewise.

Will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended()
  2021-01-27 20:01 ` [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended() Alexander A Sverdlin
@ 2021-01-27 22:22   ` Will Deacon
  0 siblings, 0 replies; 8+ messages in thread
From: Will Deacon @ 2021-01-27 22:22 UTC (permalink / raw)
  To: Alexander A Sverdlin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Russell King,
	linux-arm-kernel

On Wed, Jan 27, 2021 at 09:01:09PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Drop smp_wmb in arch_mcs_spin_lock_contended() after adding in into
> ARCH-independent code.
> 
> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> ---
>  arch/arm/include/asm/mcs_spinlock.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
> index 529d2cf..d8fa921 100644
> --- a/arch/arm/include/asm/mcs_spinlock.h
> +++ b/arch/arm/include/asm/mcs_spinlock.h
> @@ -8,8 +8,6 @@
>  /* MCS spin-locking. */
>  #define arch_mcs_spin_lock_contended(lock)				\
>  do {									\
> -	/* Ensure prior stores are observed before we enter wfe. */	\
> -	smp_mb();							\

I think this is the right place for the barrier, not in the core code.

Will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer
  2021-01-27 20:01 [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Alexander A Sverdlin
  2021-01-27 20:01 ` [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended() Alexander A Sverdlin
  2021-01-27 22:21 ` [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Will Deacon
@ 2021-01-27 22:43 ` Peter Zijlstra
  2021-01-28  7:42   ` Alexander Sverdlin
  2 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-27 22:43 UTC (permalink / raw)
  To: Alexander A Sverdlin
  Cc: Ingo Molnar, Will Deacon, linux-kernel, Russell King, linux-arm-kernel

On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Ensure writes are pushed out of core write buffer to prevent waiting code
> on another cores from spinning longer than necessary.

Our smp_wmb() as defined does not have that property. You're relying on
some arch specific details which do not belong in common code.


> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 5e10153..10e497a 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		return;
>  	}
>  	WRITE_ONCE(prev->next, node);
> +	/*
> +	 * This is necessary to make sure that the corresponding "while" in the
> +	 * mcs_spin_unlock() doesn't loop forever
> +	 */

This comment is utterly inadequate, since it does not describe an
explicit ordering between two (or more) stores.

> +	smp_wmb();
>  
>  	/* Wait until the lock holder passes the lock down. */
>  	arch_mcs_spin_lock_contended(&node->locked);
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba..577fe01 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
> +		/*
> +		 * This is necessary to make sure that the corresponding
> +		 * smp_cond_load_relaxed() below (running on another core)
> +		 * doesn't spin forever.
> +		 */
> +		smp_wmb();

That's insane, cache coherency should not allow that to happen in the
first place. Our smp_wmb() cannot help with that.

>  		pv_wait_node(node, prev);
>  		arch_mcs_spin_lock_contended(&node->locked);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer
  2021-01-27 22:21 ` [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Will Deacon
@ 2021-01-28  7:36   ` Alexander Sverdlin
  2021-01-28 11:24     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Sverdlin @ 2021-01-28  7:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Russell King,
	linux-arm-kernel

Hi!

On 27/01/2021 23:21, Will Deacon wrote:
> On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
>> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>>
>> Ensure writes are pushed out of core write buffer to prevent waiting code
>> on another cores from spinning longer than necessary.
>>
>> 6 threads running tight spinlock loop competing for the same lock
>> on 6 cores on MIPS/Octeon do 1000000 iterations...
>>
>> before the patch in:	4.3 sec
>> after the patch in:	1.2 sec
> If you only have 6 cores, I'm not sure qspinlock makes any sense...

That's my impression as well and I even proposed to revert qspinlocks for MIPS:
https://lore.kernel.org/linux-mips/20170610002644.8434-1-paul.burton@imgtec.com/T/#ma1506c80472129b2ac41554cc2d863c9a24518c0

>> Same 6-core Octeon machine:
>> sysbench --test=mutex --num-threads=64 --memory-scope=local run
>>
>> w/o patch:	1.53s
>> with patch:	1.28s
>>
>> This will also allow to remove the smp_wmb() in
>> arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
>> issue?).
>>
>> Finally our internal quite diverse test suite of different IPC/network
>> aspects didn't detect any regressions on ARM/ARM64/x86_64.
>>
>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>> ---
>>  kernel/locking/mcs_spinlock.h | 5 +++++
>>  kernel/locking/qspinlock.c    | 6 ++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 5e10153..10e497a 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>>  		return;
>>  	}
>>  	WRITE_ONCE(prev->next, node);
>> +	/*
>> +	 * This is necessary to make sure that the corresponding "while" in the
>> +	 * mcs_spin_unlock() doesn't loop forever
>> +	 */
>> +	smp_wmb();
> If it loops forever, that's broken hardware design; store buffers need to
> drain. I don't think we should add unconditional barriers to bodge this.

The comment is a bit exaggerating the situation, but it's undeterministic and you see the
measurements above. Something is wrong in the qspinlocks code, please consider this patch
"RFC", but something has to be done here.

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer
  2021-01-27 22:43 ` Peter Zijlstra
@ 2021-01-28  7:42   ` Alexander Sverdlin
  0 siblings, 0 replies; 8+ messages in thread
From: Alexander Sverdlin @ 2021-01-28  7:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, linux-kernel, Russell King, linux-arm-kernel

Hi!

On 27/01/2021 23:43, Peter Zijlstra wrote:
> On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
>> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>>
>> Ensure writes are pushed out of core write buffer to prevent waiting code
>> on another cores from spinning longer than necessary.
> Our smp_wmb() as defined does not have that property. You're relying on
> some arch specific details which do not belong in common code.

Yes, my intention was SYNCW on Octeon, which by accident is smp_wmb().
Do you think that the core write buffer is only Octeon feature and there
will be no others?

Should I re-implement arch_mcs_spin_lock_contended() for Octeon only,
as it has been done for ARM?

>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 5e10153..10e497a 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>>  		return;
>>  	}
>>  	WRITE_ONCE(prev->next, node);
>> +	/*
>> +	 * This is necessary to make sure that the corresponding "while" in the
>> +	 * mcs_spin_unlock() doesn't loop forever
>> +	 */
> This comment is utterly inadequate, since it does not describe an
> explicit ordering between two (or more) stores.
> 
>> +	smp_wmb();
>>  
>>  	/* Wait until the lock holder passes the lock down. */
>>  	arch_mcs_spin_lock_contended(&node->locked);
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index cbff6ba..577fe01 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  
>>  		/* Link @node into the waitqueue. */
>>  		WRITE_ONCE(prev->next, node);
>> +		/*
>> +		 * This is necessary to make sure that the corresponding
>> +		 * smp_cond_load_relaxed() below (running on another core)
>> +		 * doesn't spin forever.
>> +		 */
>> +		smp_wmb();
> That's insane, cache coherency should not allow that to happen in the
> first place. Our smp_wmb() cannot help with that.
> 

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer
  2021-01-28  7:36   ` Alexander Sverdlin
@ 2021-01-28 11:24     ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-28 11:24 UTC (permalink / raw)
  To: Alexander Sverdlin
  Cc: Will Deacon, Ingo Molnar, linux-kernel, Russell King, linux-arm-kernel

On Thu, Jan 28, 2021 at 08:36:24AM +0100, Alexander Sverdlin wrote:

> >> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> >> index 5e10153..10e497a 100644
> >> --- a/kernel/locking/mcs_spinlock.h
> >> +++ b/kernel/locking/mcs_spinlock.h
> >> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> >>  		return;
> >>  	}
> >>  	WRITE_ONCE(prev->next, node);
> >> +	/*
> >> +	 * This is necessary to make sure that the corresponding "while" in the
> >> +	 * mcs_spin_unlock() doesn't loop forever
> >> +	 */
> >> +	smp_wmb();
> > If it loops forever, that's broken hardware design; store buffers need to
> > drain. I don't think we should add unconditional barriers to bodge this.
> 
> The comment is a bit exaggerating the situation, but it's undeterministic and you see the
> measurements above. Something is wrong in the qspinlocks code, please consider this patch
> "RFC", but something has to be done here.

The qspinlock code has been TLA+ modelled and has had extensive memory
ordering analysis. It has had lots of runtime on extremely large x86,
arm64, and Power machines. I'm fairly confident there is nothing wrong.

What I do think is more likely is that your platform is broken, it
wouldn't be the first MIPS that's got store-buffers completely wrong,
see commit:

  a30718868915 ("MIPS: Change definition of cpu_relax() for Loongson-3")

Working around micro arch store-buffer issues is not something the
generic code is for.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-28 11:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 20:01 [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Alexander A Sverdlin
2021-01-27 20:01 ` [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended() Alexander A Sverdlin
2021-01-27 22:22   ` Will Deacon
2021-01-27 22:21 ` [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer Will Deacon
2021-01-28  7:36   ` Alexander Sverdlin
2021-01-28 11:24     ` Peter Zijlstra
2021-01-27 22:43 ` Peter Zijlstra
2021-01-28  7:42   ` Alexander Sverdlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).