linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] powerpc/qspinlock: Some tuning updates
@ 2021-03-09  1:59 Davidlohr Bueso
  2021-03-09  1:59 ` [PATCH 1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once Davidlohr Bueso
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Davidlohr Bueso @ 2021-03-09  1:59 UTC (permalink / raw)
  To: npiggin
  Cc: peterz, mingo, will, longman, mpe, benh, paulus, linux-kernel,
	linuxppc-dev, dave

Hi,

A few updates while going through the powerpc port of the qspinlock.

Patches 1 and 2 are straightforward, while patch 3 can be considered
more of an rfc as I've only tested on a single machine, and there
could be an alternative way if it doesn't end up being nearly a
universal performance win.

Thanks!

Davidlohr Bueso (3):
  powerpc/spinlock: Define smp_mb__after_spinlock only once
  powerpc/spinlock: Unserialize spin_is_locked
  powerpc/qspinlock: Use generic smp_cond_load_relaxed

 arch/powerpc/include/asm/barrier.h         | 16 ----------------
 arch/powerpc/include/asm/qspinlock.h       | 14 --------------
 arch/powerpc/include/asm/simple_spinlock.h |  6 +-----
 arch/powerpc/include/asm/spinlock.h        |  3 +++
 4 files changed, 4 insertions(+), 35 deletions(-)

--
2.26.2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once
  2021-03-09  1:59 [PATCH 0/3] powerpc/qspinlock: Some tuning updates Davidlohr Bueso
@ 2021-03-09  1:59 ` Davidlohr Bueso
  2021-03-09  1:59 ` [PATCH 2/3] powerpc/spinlock: Unserialize spin_is_locked Davidlohr Bueso
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Davidlohr Bueso @ 2021-03-09  1:59 UTC (permalink / raw)
  To: npiggin
  Cc: peterz, mingo, will, longman, mpe, benh, paulus, linux-kernel,
	linuxppc-dev, dave, Davidlohr Bueso

Instead of both queued and simple spinlocks doing it. Move
it into the arch's spinlock.h.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 arch/powerpc/include/asm/qspinlock.h       | 2 --
 arch/powerpc/include/asm/simple_spinlock.h | 3 ---
 arch/powerpc/include/asm/spinlock.h        | 3 +++
 3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index b752d34517b3..3ce1a0bee4fe 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -44,8 +44,6 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 }
 #define queued_spin_lock queued_spin_lock
 
-#define smp_mb__after_spinlock()   smp_mb()
-
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
 	/*
diff --git a/arch/powerpc/include/asm/simple_spinlock.h b/arch/powerpc/include/asm/simple_spinlock.h
index 9c3c30534333..3e87258f73b1 100644
--- a/arch/powerpc/include/asm/simple_spinlock.h
+++ b/arch/powerpc/include/asm/simple_spinlock.h
@@ -282,7 +282,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)	rw_yield(lock)
 #define arch_write_relax(lock)	rw_yield(lock)
 
-/* See include/linux/spinlock.h */
-#define smp_mb__after_spinlock()   smp_mb()
-
 #endif /* _ASM_POWERPC_SIMPLE_SPINLOCK_H */
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 6ec72282888d..bd75872a6334 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -10,6 +10,9 @@
 #include <asm/simple_spinlock.h>
 #endif
 
+/* See include/linux/spinlock.h */
+#define smp_mb__after_spinlock()	smp_mb()
+
 #ifndef CONFIG_PARAVIRT_SPINLOCKS
 static inline void pv_spinlocks_init(void) { }
 #endif
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] powerpc/spinlock: Unserialize spin_is_locked
  2021-03-09  1:59 [PATCH 0/3] powerpc/qspinlock: Some tuning updates Davidlohr Bueso
  2021-03-09  1:59 ` [PATCH 1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once Davidlohr Bueso
@ 2021-03-09  1:59 ` Davidlohr Bueso
  2021-03-09  1:59 ` [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed Davidlohr Bueso
  2021-03-31  1:09 ` [PATCH 0/3] powerpc/qspinlock: Some tuning updates Michael Ellerman
  3 siblings, 0 replies; 11+ messages in thread
From: Davidlohr Bueso @ 2021-03-09  1:59 UTC (permalink / raw)
  To: npiggin
  Cc: peterz, mingo, will, longman, mpe, benh, paulus, linux-kernel,
	linuxppc-dev, dave, parri.andrea, pabeni, Davidlohr Bueso

c6f5d02b6a0f (locking/spinlocks/arm64: Remove smp_mb() from
arch_spin_is_locked()) made it pretty official that the call
semantics do not imply any sort of barriers, and any user that
gets creative must explicitly do any serialization.

This creativity, however, is nowadays pretty limited:

1. spin_unlock_wait() has been removed from the kernel in favor
of a lock/unlock combo. Furthermore, queued spinlocks have now
for a number of years no longer relied on _Q_LOCKED_VAL for the
call, but any non-zero value to indicate a locked state. There
were cases where the delayed locked store could lead to breaking
mutual exclusion with crossed locking; such as with sysv ipc and
netfilter being the most extreme.

2. The auditing Andrea did in verified that remaining spin_is_locked()
no longer rely on such semantics. Most callers just use it to assert
a lock is taken, in a debug nature. The only user that gets cute is
NOLOCK qdisc, as of:

   96009c7d500e (sched: replace __QDISC_STATE_RUNNING bit with a spin lock)

... which ironically went in the next day after c6f5d02b6a0f. This
change replaces test_bit() with spin_is_locked() to know whether
to take the busylock heuristic to reduce contention on the main
qdisc lock. So any races against spin_is_locked() for archs that
use LL/SC for spin_lock() will be benign and not break any mutual
exclusion; furthermore, both the seqlock and busylock have the same
scope.

Cc: parri.andrea@gmail.com
Cc: pabeni@redhat.com
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 arch/powerpc/include/asm/qspinlock.h       | 12 ------------
 arch/powerpc/include/asm/simple_spinlock.h |  3 +--
 2 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index 3ce1a0bee4fe..b052b0624816 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -44,18 +44,6 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 }
 #define queued_spin_lock queued_spin_lock
 
-static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
-{
-	/*
-	 * This barrier was added to simple spinlocks by commit 51d7d5205d338,
-	 * but it should now be possible to remove it, asm arm64 has done with
-	 * commit c6f5d02b6a0f.
-	 */
-	smp_mb();
-	return atomic_read(&lock->val);
-}
-#define queued_spin_is_locked queued_spin_is_locked
-
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define SPIN_THRESHOLD (1<<15) /* not tuned */
 
diff --git a/arch/powerpc/include/asm/simple_spinlock.h b/arch/powerpc/include/asm/simple_spinlock.h
index 3e87258f73b1..1b935396522a 100644
--- a/arch/powerpc/include/asm/simple_spinlock.h
+++ b/arch/powerpc/include/asm/simple_spinlock.h
@@ -38,8 +38,7 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
-	smp_mb();
-	return !arch_spin_value_unlocked(*lock);
+	return !arch_spin_value_unlocked(READ_ONCE(*lock));
 }
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
  2021-03-09  1:59 [PATCH 0/3] powerpc/qspinlock: Some tuning updates Davidlohr Bueso
  2021-03-09  1:59 ` [PATCH 1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once Davidlohr Bueso
  2021-03-09  1:59 ` [PATCH 2/3] powerpc/spinlock: Unserialize spin_is_locked Davidlohr Bueso
@ 2021-03-09  1:59 ` Davidlohr Bueso
  2021-03-09  9:39   ` Michal Suchánek
  2021-03-16  4:59   ` Nicholas Piggin
  2021-03-31  1:09 ` [PATCH 0/3] powerpc/qspinlock: Some tuning updates Michael Ellerman
  3 siblings, 2 replies; 11+ messages in thread
From: Davidlohr Bueso @ 2021-03-09  1:59 UTC (permalink / raw)
  To: npiggin
  Cc: peterz, mingo, will, longman, mpe, benh, paulus, linux-kernel,
	linuxppc-dev, dave, Davidlohr Bueso

49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.

1. locktorture.

This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.

+=========+==========+==========+=======+
| # tasks | vanilla  | dirty    | %diff |
+=========+==========+==========+=======+
| 2       | 46718565 | 48751350 | 4.35  |
+---------+----------+----------+-------+
| 4       | 51740198 | 50369082 | -2.65 |
+---------+----------+----------+-------+
| 8       | 63756510 | 62568821 | -1.86 |
+---------+----------+----------+-------+
| 16      | 67824531 | 70966546 | 4.63  |
+---------+----------+----------+-------+
| 32      | 53843519 | 61155508 | 13.58 |
+---------+----------+----------+-------+
| 64      | 53005778 | 53104412 | 0.18  |
+---------+----------+----------+-------+
| 128     | 53331980 | 54606910 | 2.39  |
+=========+==========+==========+=======+

2. sockperf (tcp throughput)

Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:

		     simple-spinlock           vanilla			dirty
Hmean     14        73.50 (   0.00%)       54.44 * -25.93%*       73.45 * -0.07%*
Hmean     100      654.47 (   0.00%)      385.61 * -41.08%*      771.43 * 17.87%*
Hmean     300     2719.39 (   0.00%)     2181.67 * -19.77%*     2666.50 * -1.94%*
Hmean     500     4400.59 (   0.00%)     3390.77 * -22.95%*     4322.14 * -1.78%*
Hmean     850     6726.21 (   0.00%)     5264.03 * -21.74%*     6863.12 * 2.04%*

3. dbench (tmpfs)

Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
				     vanilla                dirty
Amean     latency-1          1.67 (   0.00%)        1.67 *   0.09%*
Amean     latency-2          2.15 (   0.00%)        2.08 *   3.36%*
Amean     latency-4          2.50 (   0.00%)        2.56 *  -2.27%*
Amean     latency-8          2.49 (   0.00%)        2.48 *   0.31%*
Amean     latency-16         2.69 (   0.00%)        2.72 *  -1.37%*
Amean     latency-32         2.96 (   0.00%)        3.04 *  -2.60%*
Amean     latency-64         7.78 (   0.00%)        8.17 *  -5.07%*
Amean     latency-512      186.91 (   0.00%)      186.41 *   0.27%*

For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:

			     vanilla                dirty
Hmean     1        849.13 (   0.00%)      851.51 *   0.28%*
Hmean     2       1664.03 (   0.00%)     1663.94 *  -0.01%*
Hmean     4       3073.70 (   0.00%)     3104.29 *   1.00%*
Hmean     8       5624.02 (   0.00%)     5694.16 *   1.25%*
Hmean     16      9169.49 (   0.00%)     9324.43 *   1.69%*
Hmean     32     11969.37 (   0.00%)    12127.09 *   1.32%*
Hmean     64     15021.12 (   0.00%)    15243.14 *   1.48%*
Hmean     512    14891.27 (   0.00%)    15162.11 *   1.82%*

Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 arch/powerpc/include/asm/barrier.h | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index aecfde829d5d..7ae29cfb06c0 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -80,22 +80,6 @@ do {									\
 	___p1;								\
 })
 
-#ifdef CONFIG_PPC64
-#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
-	typeof(ptr) __PTR = (ptr);				\
-	__unqual_scalar_typeof(*ptr) VAL;			\
-	VAL = READ_ONCE(*__PTR);				\
-	if (unlikely(!(cond_expr))) {				\
-		spin_begin();					\
-		do {						\
-			VAL = READ_ONCE(*__PTR);		\
-		} while (!(cond_expr));				\
-		spin_end();					\
-	}							\
-	(typeof(*ptr))VAL;					\
-})
-#endif
-
 #ifdef CONFIG_PPC_BOOK3S_64
 #define NOSPEC_BARRIER_SLOT   nop
 #elif defined(CONFIG_PPC_FSL_BOOK3E)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
  2021-03-09  1:59 ` [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed Davidlohr Bueso
@ 2021-03-09  9:39   ` Michal Suchánek
  2021-03-09 15:46     ` Davidlohr Bueso
  2021-03-16  4:59   ` Nicholas Piggin
  1 sibling, 1 reply; 11+ messages in thread
From: Michal Suchánek @ 2021-03-09  9:39 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: npiggin, Davidlohr Bueso, peterz, linuxppc-dev, linux-kernel,
	mingo, paulus, longman, will

On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
> 
> However, data shows that while this pattern works well with simple
                                              ^^^^^^^^^^^^^^^^^^^^^^
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.
...
> 
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index aecfde829d5d..7ae29cfb06c0 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -80,22 +80,6 @@ do {									\
>  	___p1;								\
>  })
>  
> -#ifdef CONFIG_PPC64
Maybe it should be kept for the simple spinlock case then?

Thanks

Michal
> -#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
> -	typeof(ptr) __PTR = (ptr);				\
> -	__unqual_scalar_typeof(*ptr) VAL;			\
> -	VAL = READ_ONCE(*__PTR);				\
> -	if (unlikely(!(cond_expr))) {				\
> -		spin_begin();					\
> -		do {						\
> -			VAL = READ_ONCE(*__PTR);		\
> -		} while (!(cond_expr));				\
> -		spin_end();					\
> -	}							\
> -	(typeof(*ptr))VAL;					\
> -})
> -#endif
> -
>  #ifdef CONFIG_PPC_BOOK3S_64
>  #define NOSPEC_BARRIER_SLOT   nop
>  #elif defined(CONFIG_PPC_FSL_BOOK3E)
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
  2021-03-09  9:39   ` Michal Suchánek
@ 2021-03-09 15:46     ` Davidlohr Bueso
  2021-03-09 17:30       ` Michal Suchánek
  0 siblings, 1 reply; 11+ messages in thread
From: Davidlohr Bueso @ 2021-03-09 15:46 UTC (permalink / raw)
  To: Michal Such�nek
  Cc: npiggin, Davidlohr Bueso, peterz, linuxppc-dev, linux-kernel,
	mingo, paulus, longman, will

On Tue, 09 Mar 2021, Michal Such�nek wrote:

>On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:
>> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
>> busy-waiting pausing with a preferred SMT priority pattern, lowering
>> the priority (reducing decode cycles) during the whole loop slowpath.
>>
>> However, data shows that while this pattern works well with simple
>                                              ^^^^^^^^^^^^^^^^^^^^^^
>> spinlocks, queued spinlocks benefit more being kept in medium priority,
>> with a cpu_relax() instead, being a low+medium combo on powerpc.
>...
>>
>> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
>> index aecfde829d5d..7ae29cfb06c0 100644
>> --- a/arch/powerpc/include/asm/barrier.h
>> +++ b/arch/powerpc/include/asm/barrier.h
>> @@ -80,22 +80,6 @@ do {									\
>>	___p1;								\
>>  })
>>
>> -#ifdef CONFIG_PPC64
>Maybe it should be kept for the simple spinlock case then?

It is kept, note that simple spinlocks don't use smp_cond_load_relaxed,
but instead deal with the priorities in arch_spin_lock(), so it will
spin in low priority until it sees a chance to take the lock, where
it switches back to medium.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
  2021-03-09 15:46     ` Davidlohr Bueso
@ 2021-03-09 17:30       ` Michal Suchánek
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Suchánek @ 2021-03-09 17:30 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: npiggin, Davidlohr Bueso, peterz, linuxppc-dev, linux-kernel,
	mingo, paulus, longman, will

On Tue, Mar 09, 2021 at 07:46:11AM -0800, Davidlohr Bueso wrote:
> On Tue, 09 Mar 2021, Michal Such�nek wrote:
> 
> > On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:
> > > 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> > > busy-waiting pausing with a preferred SMT priority pattern, lowering
> > > the priority (reducing decode cycles) during the whole loop slowpath.
> > > 
> > > However, data shows that while this pattern works well with simple
> >                                              ^^^^^^^^^^^^^^^^^^^^^^
> > > spinlocks, queued spinlocks benefit more being kept in medium priority,
> > > with a cpu_relax() instead, being a low+medium combo on powerpc.
> > ...
> > > 
> > > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > > index aecfde829d5d..7ae29cfb06c0 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -80,22 +80,6 @@ do {									\
> > > 	___p1;								\
> > >  })
> > > 
> > > -#ifdef CONFIG_PPC64
> > Maybe it should be kept for the simple spinlock case then?
> 
> It is kept, note that simple spinlocks don't use smp_cond_load_relaxed,
> but instead deal with the priorities in arch_spin_lock(), so it will
> spin in low priority until it sees a chance to take the lock, where
> it switches back to medium.

Indeed, thanks for the clarification.

Michal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
  2021-03-09  1:59 ` [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed Davidlohr Bueso
  2021-03-09  9:39   ` Michal Suchánek
@ 2021-03-16  4:59   ` Nicholas Piggin
  2021-03-18 20:02     ` Davidlohr Bueso
  2021-03-18 20:47     ` [PATCH v2] " Davidlohr Bueso
  1 sibling, 2 replies; 11+ messages in thread
From: Nicholas Piggin @ 2021-03-16  4:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: benh, Davidlohr Bueso, linux-kernel, linuxppc-dev, longman,
	mingo, mpe, paulus, peterz, will

Excerpts from Davidlohr Bueso's message of March 9, 2021 11:59 am:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
> 
> However, data shows that while this pattern works well with simple
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.

Thanks for tracking this down and the comprehensive results, great
work.

It's only a relatively recent patch, so I think the revert is a
good idea (i.e., don't keep it around for possibly other code to
hit problems with).

One request, could you add a comment in place that references
smp_cond_load_relaxed() so this commit can be found again if
someone looks at it? Something like this

/*
 * smp_cond_load_relaxed was found to have performance problems if
 * implemented with spin_begin()/spin_end().
 */

I wonder if it should have a Fixes: tag to the original commit as
well.

Otherwise,

Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> 
> Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
> 2 sockets and 8 threads per core.
> 
> 1. locktorture.
> 
> This is data for the lowest and most artificial/pathological level,
> with increasing thread counts pounding on the lock. Metrics are total
> ops/minute. Despite some small hits in the 4-8 range, scenarios are
> either neutral or favorable to this patch.
> 
> +=========+==========+==========+=======+
> | # tasks | vanilla  | dirty    | %diff |
> +=========+==========+==========+=======+
> | 2       | 46718565 | 48751350 | 4.35  |
> +---------+----------+----------+-------+
> | 4       | 51740198 | 50369082 | -2.65 |
> +---------+----------+----------+-------+
> | 8       | 63756510 | 62568821 | -1.86 |
> +---------+----------+----------+-------+
> | 16      | 67824531 | 70966546 | 4.63  |
> +---------+----------+----------+-------+
> | 32      | 53843519 | 61155508 | 13.58 |
> +---------+----------+----------+-------+
> | 64      | 53005778 | 53104412 | 0.18  |
> +---------+----------+----------+-------+
> | 128     | 53331980 | 54606910 | 2.39  |
> +=========+==========+==========+=======+
> 
> 2. sockperf (tcp throughput)
> 
> Here a client will do one-way throughput tests to a localhost server, with
> increasing message sizes, dealing with the sk_lock. This patch shows to put
> the performance of the qspinlock back to par with that of the simple lock:
> 
> 		     simple-spinlock           vanilla			dirty
> Hmean     14        73.50 (   0.00%)       54.44 * -25.93%*       73.45 * -0.07%*
> Hmean     100      654.47 (   0.00%)      385.61 * -41.08%*      771.43 * 17.87%*
> Hmean     300     2719.39 (   0.00%)     2181.67 * -19.77%*     2666.50 * -1.94%*
> Hmean     500     4400.59 (   0.00%)     3390.77 * -22.95%*     4322.14 * -1.78%*
> Hmean     850     6726.21 (   0.00%)     5264.03 * -21.74%*     6863.12 * 2.04%*
> 
> 3. dbench (tmpfs)
> 
> Configured to run with up to ncpusx8 clients, it shows both latency and
> throughput metrics. For the latency, with the exception of the 64 case,
> there is really nothing to go by:
> 				     vanilla                dirty
> Amean     latency-1          1.67 (   0.00%)        1.67 *   0.09%*
> Amean     latency-2          2.15 (   0.00%)        2.08 *   3.36%*
> Amean     latency-4          2.50 (   0.00%)        2.56 *  -2.27%*
> Amean     latency-8          2.49 (   0.00%)        2.48 *   0.31%*
> Amean     latency-16         2.69 (   0.00%)        2.72 *  -1.37%*
> Amean     latency-32         2.96 (   0.00%)        3.04 *  -2.60%*
> Amean     latency-64         7.78 (   0.00%)        8.17 *  -5.07%*
> Amean     latency-512      186.91 (   0.00%)      186.41 *   0.27%*
> 
> For the dbench4 Throughput (misleading but traditional) there's a small
> but rather constant improvement:
> 
> 			     vanilla                dirty
> Hmean     1        849.13 (   0.00%)      851.51 *   0.28%*
> Hmean     2       1664.03 (   0.00%)     1663.94 *  -0.01%*
> Hmean     4       3073.70 (   0.00%)     3104.29 *   1.00%*
> Hmean     8       5624.02 (   0.00%)     5694.16 *   1.25%*
> Hmean     16      9169.49 (   0.00%)     9324.43 *   1.69%*
> Hmean     32     11969.37 (   0.00%)    12127.09 *   1.32%*
> Hmean     64     15021.12 (   0.00%)    15243.14 *   1.48%*
> Hmean     512    14891.27 (   0.00%)    15162.11 *   1.82%*
> 
> Measuring the dbench4 Per-VFS Operation latency, shows some very minor
> differences within the noise level, around the 0-1% ranges.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  arch/powerpc/include/asm/barrier.h | 16 ----------------
>  1 file changed, 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index aecfde829d5d..7ae29cfb06c0 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -80,22 +80,6 @@ do {									\
>  	___p1;								\
>  })
>  
> -#ifdef CONFIG_PPC64
> -#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
> -	typeof(ptr) __PTR = (ptr);				\
> -	__unqual_scalar_typeof(*ptr) VAL;			\
> -	VAL = READ_ONCE(*__PTR);				\
> -	if (unlikely(!(cond_expr))) {				\
> -		spin_begin();					\
> -		do {						\
> -			VAL = READ_ONCE(*__PTR);		\
> -		} while (!(cond_expr));				\
> -		spin_end();					\
> -	}							\
> -	(typeof(*ptr))VAL;					\
> -})
> -#endif
> -
>  #ifdef CONFIG_PPC_BOOK3S_64
>  #define NOSPEC_BARRIER_SLOT   nop
>  #elif defined(CONFIG_PPC_FSL_BOOK3E)
> -- 
> 2.26.2
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
  2021-03-16  4:59   ` Nicholas Piggin
@ 2021-03-18 20:02     ` Davidlohr Bueso
  2021-03-18 20:47     ` [PATCH v2] " Davidlohr Bueso
  1 sibling, 0 replies; 11+ messages in thread
From: Davidlohr Bueso @ 2021-03-18 20:02 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: benh, Davidlohr Bueso, linux-kernel, linuxppc-dev, longman,
	mingo, mpe, paulus, peterz, will

On Tue, 16 Mar 2021, Nicholas Piggin wrote:

>One request, could you add a comment in place that references
>smp_cond_load_relaxed() so this commit can be found again if
>someone looks at it? Something like this
>
>/*
> * smp_cond_load_relaxed was found to have performance problems if
> * implemented with spin_begin()/spin_end().
> */

Sure, let me see where I can fit that in and send out a v2.

Similarly, but unrelated to this patch, is there any chance we could
remove the whole spin_until_cond() machinery and make it specific to
powerpc? This was introduced in 2017 and doesn't really have any users
outside of powerpc, except for these:

drivers/firmware/arm_scmi/driver.c:             spin_until_cond(scmi_xfer_done_no_timeout(cinfo, xfer, stop));
drivers/firmware/arm_scmi/shmem.c:      spin_until_cond(ioread32(&shmem->channel_status) &
drivers/net/ethernet/xilinx/ll_temac_main.c:    spin_until_cond(hard_acs_rdy_or_timeout(lp, timeout));

... which afaict only the xilinx one can actually build on powerpc.
Regardless, these could be converted to smp_cond_load_relaxed(), being
the more standard way to do optimized busy-waiting, caring more about
the family of barriers than ad-hoc SMT priorities. Of course, I have
no way of testing any of these changes.

>I wonder if it should have a Fixes: tag to the original commit as
>well.

I'm not sure either. I've actually been informed recently of other
workloads that benefit from the revert on large Power9 boxes. So I'll
go ahead and add it.

>
>Otherwise,
>
>Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] powerpc/qspinlock: Use generic smp_cond_load_relaxed
  2021-03-16  4:59   ` Nicholas Piggin
  2021-03-18 20:02     ` Davidlohr Bueso
@ 2021-03-18 20:47     ` Davidlohr Bueso
  1 sibling, 0 replies; 11+ messages in thread
From: Davidlohr Bueso @ 2021-03-18 20:47 UTC (permalink / raw)
  To: npiggin
  Cc: benh, linux-kernel, linuxppc-dev, longman, mingo, mpe, paulus,
	peterz, will, dbueso, Davidlohr Bueso

49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.

1. locktorture.

This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.

+=========+==========+==========+=======+
| # tasks | vanilla  | dirty    | %diff |
+=========+==========+==========+=======+
| 2       | 46718565 | 48751350 | 4.35  |
+---------+----------+----------+-------+
| 4       | 51740198 | 50369082 | -2.65 |
+---------+----------+----------+-------+
| 8       | 63756510 | 62568821 | -1.86 |
+---------+----------+----------+-------+
| 16      | 67824531 | 70966546 | 4.63  |
+---------+----------+----------+-------+
| 32      | 53843519 | 61155508 | 13.58 |
+---------+----------+----------+-------+
| 64      | 53005778 | 53104412 | 0.18  |
+---------+----------+----------+-------+
| 128     | 53331980 | 54606910 | 2.39  |
+=========+==========+==========+=======+

2. sockperf (tcp throughput)

Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:

		     simple-spinlock           vanilla			dirty
Hmean     14        73.50 (   0.00%)       54.44 * -25.93%*       73.45 * -0.07%*
Hmean     100      654.47 (   0.00%)      385.61 * -41.08%*      771.43 * 17.87%*
Hmean     300     2719.39 (   0.00%)     2181.67 * -19.77%*     2666.50 * -1.94%*
Hmean     500     4400.59 (   0.00%)     3390.77 * -22.95%*     4322.14 * -1.78%*
Hmean     850     6726.21 (   0.00%)     5264.03 * -21.74%*     6863.12 * 2.04%*

3. dbench (tmpfs)

Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
				     vanilla                dirty
Amean     latency-1          1.67 (   0.00%)        1.67 *   0.09%*
Amean     latency-2          2.15 (   0.00%)        2.08 *   3.36%*
Amean     latency-4          2.50 (   0.00%)        2.56 *  -2.27%*
Amean     latency-8          2.49 (   0.00%)        2.48 *   0.31%*
Amean     latency-16         2.69 (   0.00%)        2.72 *  -1.37%*
Amean     latency-32         2.96 (   0.00%)        3.04 *  -2.60%*
Amean     latency-64         7.78 (   0.00%)        8.17 *  -5.07%*
Amean     latency-512      186.91 (   0.00%)      186.41 *   0.27%*

For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:

			     vanilla                dirty
Hmean     1        849.13 (   0.00%)      851.51 *   0.28%*
Hmean     2       1664.03 (   0.00%)     1663.94 *  -0.01%*
Hmean     4       3073.70 (   0.00%)     3104.29 *   1.00%*
Hmean     8       5624.02 (   0.00%)     5694.16 *   1.25%*
Hmean     16      9169.49 (   0.00%)     9324.43 *   1.69%*
Hmean     32     11969.37 (   0.00%)    12127.09 *   1.32%*
Hmean     64     15021.12 (   0.00%)    15243.14 *   1.48%*
Hmean     512    14891.27 (   0.00%)    15162.11 *   1.82%*

Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.

Fixes: 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed())
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
Changes from v1:
Added small description and labeling smp_cond_load_relaxed requested by Nick.
Added Nick's ack.

 arch/powerpc/include/asm/barrier.h   | 16 ----------------
 arch/powerpc/include/asm/qspinlock.h |  7 +++++++
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index aecfde829d5d..7ae29cfb06c0 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -80,22 +80,6 @@ do {									\
 	___p1;								\
 })
 
-#ifdef CONFIG_PPC64
-#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
-	typeof(ptr) __PTR = (ptr);				\
-	__unqual_scalar_typeof(*ptr) VAL;			\
-	VAL = READ_ONCE(*__PTR);				\
-	if (unlikely(!(cond_expr))) {				\
-		spin_begin();					\
-		do {						\
-			VAL = READ_ONCE(*__PTR);		\
-		} while (!(cond_expr));				\
-		spin_end();					\
-	}							\
-	(typeof(*ptr))VAL;					\
-})
-#endif
-
 #ifdef CONFIG_PPC_BOOK3S_64
 #define NOSPEC_BARRIER_SLOT   nop
 #elif defined(CONFIG_PPC_FSL_BOOK3E)
diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index b052b0624816..9da649e1a488 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -72,6 +72,13 @@ static inline void pv_spinlocks_init(void)
 
 #endif
 
+/*
+ * Queued spinlocks rely heavily on smp_cond_load_relaxed to busy-wait,
+ * which was found that have performance problems if implemented with
+ * the preferred spin_begin()/spin_end() SMT priority pattern. Use the
+ * generic version instead.
+ */
+
 #include <asm-generic/qspinlock.h>
 
 #endif /* _ASM_POWERPC_QSPINLOCK_H */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] powerpc/qspinlock: Some tuning updates
  2021-03-09  1:59 [PATCH 0/3] powerpc/qspinlock: Some tuning updates Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2021-03-09  1:59 ` [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed Davidlohr Bueso
@ 2021-03-31  1:09 ` Michael Ellerman
  3 siblings, 0 replies; 11+ messages in thread
From: Michael Ellerman @ 2021-03-31  1:09 UTC (permalink / raw)
  To: npiggin, Davidlohr Bueso
  Cc: will, linuxppc-dev, longman, linux-kernel, paulus, mingo, peterz

On Mon, 8 Mar 2021 17:59:47 -0800, Davidlohr Bueso wrote:
> A few updates while going through the powerpc port of the qspinlock.
> 
> Patches 1 and 2 are straightforward, while patch 3 can be considered
> more of an rfc as I've only tested on a single machine, and there
> could be an alternative way if it doesn't end up being nearly a
> universal performance win.
> 
> [...]

Applied to powerpc/next.

[1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once
      https://git.kernel.org/powerpc/c/2bf3604c415c9d75311141b8eb6ac8780ef74420
[2/3] powerpc/spinlock: Unserialize spin_is_locked
      https://git.kernel.org/powerpc/c/66f60522138c2e0d8a3518edd4979df11a2d7525
[3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
      https://git.kernel.org/powerpc/c/deb9b13eb2571fbde164ae012c77985fd14f2f02

cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-03-31  1:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09  1:59 [PATCH 0/3] powerpc/qspinlock: Some tuning updates Davidlohr Bueso
2021-03-09  1:59 ` [PATCH 1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once Davidlohr Bueso
2021-03-09  1:59 ` [PATCH 2/3] powerpc/spinlock: Unserialize spin_is_locked Davidlohr Bueso
2021-03-09  1:59 ` [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed Davidlohr Bueso
2021-03-09  9:39   ` Michal Suchánek
2021-03-09 15:46     ` Davidlohr Bueso
2021-03-09 17:30       ` Michal Suchánek
2021-03-16  4:59   ` Nicholas Piggin
2021-03-18 20:02     ` Davidlohr Bueso
2021-03-18 20:47     ` [PATCH v2] " Davidlohr Bueso
2021-03-31  1:09 ` [PATCH 0/3] powerpc/qspinlock: Some tuning updates Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).