linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
@ 2019-10-07 21:44 Sebastian Andrzej Siewior
  2019-10-07 21:44 ` [PATCH 1/3] ARM: Use qrwlock implementation Sebastian Andrzej Siewior
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-07 21:44 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Arnd Bergmann, Peter Zijlstra, Russell King, Ingo Molnar,
	Waiman Long, Will Deacon


I added support for queued-RW and -spinlocks for ARM. I wanted to
remove the current implementation but this does not work. The CPU_V6
kernel config does not have support for xchg() on 2 byte memory address.
This is required by q-lock' slowpath. It is possible to create a
multi-kernel (with v6+v7+SMP) which then lack the function.

I tested the q-lock implementation with
	hackbench -g40 -s 500 -l 500

The numbers in the table below represent the average runtime of 10
invocations. I tested with HZ_100,HZ_250 and the different preemption
levels on a IMX6q-board (quad Cortex-A9) and an AM572x board (dual
Cortex-A15).
"Ticket" means the current implementation on v5.4-rc1, Q-Locks is the
switch to queued RW and spinlocks and in Q-locksI the locking
instruction is additionally inlined.

IMX6q
~~~~~
HZ_100  | PREEMPT_NONE  | PREEMPT_VOLUNTARY 	| PREEMPT
Ticket  | 52.103        | 52.284		| 60.5681
Q-locks | 54.1804	| 53.267		| 56.1914
Q-locksI| 52.2985       | 49.398		| 56.7441

HZ_250  | PREEMPT_NONE  | PREEMPT_VOLUNTARY 	| PREEMPT
Ticket  | 54.3888       | 52.7896      		| 58.4837
Q-locks | 52.1027	| 52.2302               | 57.26
Q-locksI| 51.6185       | 51.5856		| 55.327

AM572x
~~~~~~
HZ_100  | PREEMPT_NONE  | PREEMPT_VOLUNTARY 	| PREEMPT
Ticket  | 42.3819       | 42.4821      		| 43.2671
Q-locks | 40.9141	| 40.0269	        | 42.65  
Q-locksI| 40.0763       | 39.9101      		| 40.7811

HZ_250  | PREEMPT_NONE  | PREEMPT_VOLUNTARY 	| PREEMPT
Ticket  | 41.6399       | 42.9386      		| 44.5865
Q-locks | 41.4476	| 43.0836               | 43.1937
Q-locksI| 39.6897       | 41.1746		| 43.1962

Based on these numbers, the Q-lock based implementation performs a
little better that the current ticket spinlock implementation. On IMX6q
it requires additionally to inline the locks while it makes hardly a
difference on AM572x.

Here are `size' numbers for the different vmlinux binary:

   text	   data	    bss	    dec	 dec KiB  variant
8096124	2604932	 198648	10899704 10644.24 5.4-rc1 CONFIG_HZ_100 CONFIG_PREEMPT_NONE
8031639	2605060	 198656	10835355 10581.40 qlocks  CONFIG_HZ_100 CONFIG_PREEMPT_NONE
8319233	2605072	 198656	11122961 10862.27 qlocksI CONFIG_HZ_100 CONFIG_PREEMPT_NONE

8098548	2604932	 198648	10902128 10646.61 5.4-rc1 CONFIG_HZ_100 CONFIG_PREEMPT_VOLUNTARY
8034103	2605060	 198656	10837819 10583.81 qlocks  CONFIG_HZ_100 CONFIG_PREEMPT_VOLUNTARY
8321769	2605072	 198656	11125497 10864.74 qlocksI CONFIG_HZ_100 CONFIG_PREEMPT_VOLUNTARY

8082969	2605468	 198712	10887149 10631.98 5.4-rc1 CONFIG_HZ_100 CONFIG_PREEMPT
8083732	2609692	 198720	10892144 10636.86 qlocks  CONFIG_HZ_100 CONFIG_PREEMPT
8725070	2609704	 198720	11533494 11263.18 qlocksI CONFIG_HZ_100 CONFIG_PREEMPT

8096784	2605188	 198648	10900620 10645.14 5.4-rc1 CONFIG_HZ_250 CONFIG_PREEMPT_NONE
8032307	2605316	 198656	10836279 10582.30 qlocks  CONFIG_HZ_250 CONFIG_PREEMPT_NONE
8319901	2605328	 198656	11123885 10863.17 qlocksI CONFIG_HZ_250 CONFIG_PREEMPT_NONE

8099208	2605188	 198648	10903044 10647.50 5.4-rc1 CONFIG_HZ_250 CONFIG_PREEMPT_VOLUNTARY
8034739	2605316	 198656	10838711 10584.68 qlocks  CONFIG_HZ_250 CONFIG_PREEMPT_VOLUNTARY
8322405	2605328	 198656	11126389 10865.61 qlocksI CONFIG_HZ_250 CONFIG_PREEMPT_VOLUNTARY

8083645	2605724	 198712	10888081 10632.89 5.4-rc1 CONFIG_HZ_250 CONFIG_PREEMPT
8084376	2609948	 198720	10893044 10637.74 qlocks  CONFIG_HZ_250 CONFIG_PREEMPT
8725762	2609960	 198720	11534442 11264.10 qlocksI CONFIG_HZ_250 CONFIG_PREEMPT

On average the q-locksI variant is approx. 200KiB larger compared to the
current implementation. On the preempt configuration the size increases
by approx. 600KiB which is probably not worth it.

Sebastian


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] ARM: Use qrwlock implementation
  2019-10-07 21:44 [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Sebastian Andrzej Siewior
@ 2019-10-07 21:44 ` Sebastian Andrzej Siewior
  2019-10-07 21:44 ` [PATCH 2/3] ARM: Use qspinlock implementation Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-07 21:44 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Arnd Bergmann, Peter Zijlstra, Russell King,
	Sebastian Andrzej Siewior, Ingo Molnar, Waiman Long, Will Deacon

Use the generic qrwlock implementation for rwlock. The WFE mechanism is
used as part of the spinlock implementation.

Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
---
 arch/arm/Kconfig                      |  1 +
 arch/arm/include/asm/Kbuild           |  1 +
 arch/arm/include/asm/spinlock.h       |  8 ++++++++
 arch/arm/include/asm/spinlock_types.h | 10 +++++++++-
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 8a50efb559f35..2f704531d883a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -34,6 +34,7 @@ config ARM
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUED_RWLOCKS if !CPU_V6
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select BINFMT_FLAT_ARGVP_ENVP_ON_STACK
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 68ca86f85eb73..5327be7572cd2 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -18,6 +18,7 @@ generic-y += preempt.h
 generic-y += seccomp.h
 generic-y += serial.h
 generic-y += trace_clock.h
+generic-y += qrwlock.h
 
 generated-y += mach-types.h
 generated-y += unistd-nr.h
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 8f009e788ad40..e23f71b2cd43d 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -128,6 +128,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
+#ifdef CONFIG_CPU_V6
 /*
  * RWLOCKS
  *
@@ -270,4 +271,11 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
 	}
 }
 
+#else
+
+#include <asm/qrwlock.h>
+#define smp_mb__after_spinlock()       smp_mb()
+
+#endif
+
 #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/arm/include/asm/spinlock_types.h b/arch/arm/include/asm/spinlock_types.h
index 5976958647fe1..c942a75250bb2 100644
--- a/arch/arm/include/asm/spinlock_types.h
+++ b/arch/arm/include/asm/spinlock_types.h
@@ -29,6 +29,14 @@ typedef struct {
 	u32 lock;
 } arch_rwlock_t;
 
-#define __ARCH_RW_LOCK_UNLOCKED		{ 0 }
+#ifdef CONFIG_CPU_V6
+
+#define __ARCH_RW_LOCK_UNLOCKED                { 0 }
+
+#else
+
+#include <asm-generic/qrwlock_types.h>
+
+#endif
 
 #endif
-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] ARM: Use qspinlock implementation
  2019-10-07 21:44 [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Sebastian Andrzej Siewior
  2019-10-07 21:44 ` [PATCH 1/3] ARM: Use qrwlock implementation Sebastian Andrzej Siewior
@ 2019-10-07 21:44 ` Sebastian Andrzej Siewior
  2019-10-07 21:44 ` [PATCH 3/3] ARM: Inline locking functions for !PREEMPTION Sebastian Andrzej Siewior
  2019-10-08 11:42 ` [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Arnd Bergmann
  3 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-07 21:44 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Arnd Bergmann, Peter Zijlstra, Russell King,
	Sebastian Andrzej Siewior, Ingo Molnar, Waiman Long, Will Deacon

Use the generic queued spinlock implementation for spinlock. The WFE
mechanism is used as part of arch_mcs_spin_lock_contended().

Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
---
 arch/arm/Kconfig                      | 1 +
 arch/arm/include/asm/Kbuild           | 1 +
 arch/arm/include/asm/spinlock.h       | 5 +++--
 arch/arm/include/asm/spinlock_types.h | 4 ++--
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 2f704531d883a..7eba89bb45755 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -35,6 +35,7 @@ config ARM
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_USE_QUEUED_RWLOCKS if !CPU_V6
+	select ARCH_USE_QUEUED_SPINLOCKS if !CPU_V6
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select BINFMT_FLAT_ARGVP_ENVP_ON_STACK
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 5327be7572cd2..f98bcfd9612b6 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -18,6 +18,7 @@ generic-y += preempt.h
 generic-y += seccomp.h
 generic-y += serial.h
 generic-y += trace_clock.h
+generic-y += qspinlock.h
 generic-y += qrwlock.h
 
 generated-y += mach-types.h
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index e23f71b2cd43d..7f3477679eb58 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -45,6 +45,7 @@ static inline void dsb_sev(void)
 	__asm__(SEV);
 }
 
+#ifdef CONFIG_CPU_V6
 /*
  * ARMv6 ticket-based spin-locking.
  *
@@ -128,7 +129,6 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
-#ifdef CONFIG_CPU_V6
 /*
  * RWLOCKS
  *
@@ -274,7 +274,8 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
 #else
 
 #include <asm/qrwlock.h>
-#define smp_mb__after_spinlock()       smp_mb()
+#include <asm/qspinlock.h>
+#define smp_mb__after_spinlock()	smp_mb()
 
 #endif
 
diff --git a/arch/arm/include/asm/spinlock_types.h b/arch/arm/include/asm/spinlock_types.h
index c942a75250bb2..7d5200da6a5f8 100644
--- a/arch/arm/include/asm/spinlock_types.h
+++ b/arch/arm/include/asm/spinlock_types.h
@@ -6,6 +6,7 @@
 # error "please don't include this file directly"
 #endif
 
+#ifdef CONFIG_CPU_V6
 #define TICKET_SHIFT	16
 
 typedef struct {
@@ -29,12 +30,11 @@ typedef struct {
 	u32 lock;
 } arch_rwlock_t;
 
-#ifdef CONFIG_CPU_V6
-
 #define __ARCH_RW_LOCK_UNLOCKED                { 0 }
 
 #else
 
+#include <asm-generic/qspinlock_types.h>
 #include <asm-generic/qrwlock_types.h>
 
 #endif
-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] ARM: Inline locking functions for !PREEMPTION
  2019-10-07 21:44 [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Sebastian Andrzej Siewior
  2019-10-07 21:44 ` [PATCH 1/3] ARM: Use qrwlock implementation Sebastian Andrzej Siewior
  2019-10-07 21:44 ` [PATCH 2/3] ARM: Use qspinlock implementation Sebastian Andrzej Siewior
@ 2019-10-07 21:44 ` Sebastian Andrzej Siewior
  2019-10-08 11:42 ` [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Arnd Bergmann
  3 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-07 21:44 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Arnd Bergmann, Peter Zijlstra, Russell King,
	Sebastian Andrzej Siewior, Ingo Molnar, Waiman Long, Will Deacon

On non-preemptive kernels, the locking instruction is less than 64 bytes
and it makes sense to inline it. With PREEMPTION the kernel becomes very
big if the locks are inlined.

Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
---
 arch/arm/Kconfig | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7eba89bb45755..c1ee04c209e6e 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -26,6 +26,32 @@ config ARM
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAVE_CUSTOM_GPIO_H
 	select ARCH_HAS_GCOV_PROFILE_ALL
+	select ARCH_INLINE_READ_LOCK if !PREEMPTION
+	select ARCH_INLINE_READ_LOCK_BH if !PREEMPTION
+	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPTION
+	select ARCH_INLINE_READ_LOCK_IRQSAVE if !PREEMPTION
+	select ARCH_INLINE_READ_UNLOCK if !PREEMPTION
+	select ARCH_INLINE_READ_UNLOCK_BH if !PREEMPTION
+	select ARCH_INLINE_READ_UNLOCK_IRQ if !PREEMPTION
+	select ARCH_INLINE_READ_UNLOCK_IRQRESTORE if !PREEMPTION
+	select ARCH_INLINE_WRITE_LOCK if !PREEMPTION
+	select ARCH_INLINE_WRITE_LOCK_BH if !PREEMPTION
+	select ARCH_INLINE_WRITE_LOCK_IRQ if !PREEMPTION
+	select ARCH_INLINE_WRITE_LOCK_IRQSAVE if !PREEMPTION
+	select ARCH_INLINE_WRITE_UNLOCK if !PREEMPTION
+	select ARCH_INLINE_WRITE_UNLOCK_BH if !PREEMPTION
+	select ARCH_INLINE_WRITE_UNLOCK_IRQ if !PREEMPTION
+	select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE if !PREEMPTION
+	select ARCH_INLINE_SPIN_TRYLOCK if !PREEMPTION
+	select ARCH_INLINE_SPIN_TRYLOCK_BH if !PREEMPTION
+	select ARCH_INLINE_SPIN_LOCK if !PREEMPTION
+	select ARCH_INLINE_SPIN_LOCK_BH if !PREEMPTION
+	select ARCH_INLINE_SPIN_LOCK_IRQ if !PREEMPTION
+	select ARCH_INLINE_SPIN_LOCK_IRQSAVE if !PREEMPTION
+	select ARCH_INLINE_SPIN_UNLOCK if !PREEMPTION
+	select ARCH_INLINE_SPIN_UNLOCK_BH if !PREEMPTION
+	select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
+	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
 	select ARCH_KEEP_MEMBLOCK if HAVE_ARCH_PFN_VALID || KEXEC
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_NO_SG_CHAIN if !ARM_HAS_SG_CHAIN
-- 
2.23.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-07 21:44 [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2019-10-07 21:44 ` [PATCH 3/3] ARM: Inline locking functions for !PREEMPTION Sebastian Andrzej Siewior
@ 2019-10-08 11:42 ` Arnd Bergmann
  2019-10-08 13:36   ` Waiman Long
  2019-10-08 19:32   ` Sebastian Andrzej Siewior
  3 siblings, 2 replies; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-08 11:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Russell King, Ingo Molnar, Waiman Long,
	Will Deacon, Linux ARM

On Mon, Oct 7, 2019 at 11:45 PM Sebastian Andrzej Siewior
<sebastian@breakpoint.cc> wrote:
>
> I added support for queued-RW and -spinlocks for ARM. I wanted to
> remove the current implementation but this does not work. The CPU_V6
> kernel config does not have support for xchg() on 2 byte memory address.
> This is required by q-lock' slowpath. It is possible to create a
> multi-kernel (with v6+v7+SMP) which then lack the function.
>
> I tested the q-lock implementation with
>         hackbench -g40 -s 500 -l 500
>
> The numbers in the table below represent the average runtime of 10
> invocations. I tested with HZ_100,HZ_250 and the different preemption
> levels on a IMX6q-board (quad Cortex-A9) and an AM572x board (dual
> Cortex-A15).
> "Ticket" means the current implementation on v5.4-rc1, Q-Locks is the
> switch to queued RW and spinlocks and in Q-locksI the locking
> instruction is additionally inlined.

This looks nice, and I don't see anything wrong with the implementation,
but I am slightly worried about switching everything over to a generic
spinlock while keeping the custom ARM version for an exceptionally
rare corner case:

The ARM spinlock is now only used when you build an SMP-enabled
kernel for an ARM1136r0 that is used in OMAP2, i.MX3 and some
of the least common Integrator/Realview variants. I'm not aware
of any binary distros with ARMv6+ kernels, so these would run custom
kernels that are almost always non-SMP as well as no longer getting
kernel upgrades (almost all have been discontinued years ago, the i.MX35
chip itself was the last to get EOLd in 2018).
Raspbian builds an ARMv6K SMP kernel that is not affected by this.

I wonder if we can do something better here and make the
asm-generic/qspinlock.h implementation always degrade into an
equivalent of include/linux/spinlock_up.h when running on uniprocessor
systems, avoiding both the atomic cmpxchg and the slowpath.

That way, an ARMv6+SMP kernel on UP could share the qspinlock
implementation but never actually get into the invalid 16-bit xchg() or
sev()/wfe(). It already shouldn't ever get into the slowpath on a
non-SMP system if I understand it correctly, but avoiding the cmpxchg()
entirely would be an added benefit.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 11:42 ` [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Arnd Bergmann
@ 2019-10-08 13:36   ` Waiman Long
  2019-10-08 14:32     ` Arnd Bergmann
  2019-10-08 19:32   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-10-08 13:36 UTC (permalink / raw)
  To: Arnd Bergmann, Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Russell King, Linux ARM

On 10/8/19 7:42 AM, Arnd Bergmann wrote:
> On Mon, Oct 7, 2019 at 11:45 PM Sebastian Andrzej Siewior
> <sebastian@breakpoint.cc> wrote:
>> I added support for queued-RW and -spinlocks for ARM. I wanted to
>> remove the current implementation but this does not work. The CPU_V6
>> kernel config does not have support for xchg() on 2 byte memory address.
>> This is required by q-lock' slowpath. It is possible to create a
>> multi-kernel (with v6+v7+SMP) which then lack the function.
>>
>> I tested the q-lock implementation with
>>         hackbench -g40 -s 500 -l 500
>>
>> The numbers in the table below represent the average runtime of 10
>> invocations. I tested with HZ_100,HZ_250 and the different preemption
>> levels on a IMX6q-board (quad Cortex-A9) and an AM572x board (dual
>> Cortex-A15).
>> "Ticket" means the current implementation on v5.4-rc1, Q-Locks is the
>> switch to queued RW and spinlocks and in Q-locksI the locking
>> instruction is additionally inlined.
> This looks nice, and I don't see anything wrong with the implementation,
> but I am slightly worried about switching everything over to a generic
> spinlock while keeping the custom ARM version for an exceptionally
> rare corner case:
>
> The ARM spinlock is now only used when you build an SMP-enabled
> kernel for an ARM1136r0 that is used in OMAP2, i.MX3 and some
> of the least common Integrator/Realview variants. I'm not aware
> of any binary distros with ARMv6+ kernels, so these would run custom
> kernels that are almost always non-SMP as well as no longer getting
> kernel upgrades (almost all have been discontinued years ago, the i.MX35
> chip itself was the last to get EOLd in 2018).
> Raspbian builds an ARMv6K SMP kernel that is not affected by this.
>
> I wonder if we can do something better here and make the
> asm-generic/qspinlock.h implementation always degrade into an
> equivalent of include/linux/spinlock_up.h when running on uniprocessor
> systems, avoiding both the atomic cmpxchg and the slowpath.

In x86, the lock instruction prefix is patched out when running on UP
system. This downgrades the atomic cmpxchg to non-atomic one. We may do
something similar in other architectures.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 13:36   ` Waiman Long
@ 2019-10-08 14:32     ` Arnd Bergmann
  2019-10-08 19:47       ` Sebastian Andrzej Siewior
  2019-10-09  8:46       ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-08 14:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Russell King, Sebastian Andrzej Siewior,
	Ingo Molnar, Will Deacon, Linux ARM

On Tue, Oct 8, 2019 at 3:36 PM Waiman Long <longman@redhat.com> wrote:
> On 10/8/19 7:42 AM, Arnd Bergmann wrote:
> > On Mon, Oct 7, 2019 at 11:45 PM Sebastian Andrzej Siewior <sebastian@breakpoint.cc> wrote:
> > I wonder if we can do something better here and make the
> > asm-generic/qspinlock.h implementation always degrade into an
> > equivalent of include/linux/spinlock_up.h when running on uniprocessor
> > systems, avoiding both the atomic cmpxchg and the slowpath.

I looked at the qspinlock implementation some more, and I think we can
get away with simply defining a special-case smp_xchg16_relaxed()
that is only used for the odd  configuration:

diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h
index 8b701f8e175c..6bf4964c105c 100644
--- a/arch/arm/include/asm/cmpxchg.h
+++ b/arch/arm/include/asm/cmpxchg.h
@@ -114,6 +114,24 @@ static inline unsigned long __xchg(unsigned long
x, volatile void *ptr, int size
        return ret;
 }

+#ifdef CONFIG_CPU_V6
+static inline unsigned short smp_xchg16_relaxed(volatile unsigned
short *ptr, unsigned short x)
+{
+       unsigned short ret, tmp;
+       asm volatile("@ smp_xchg16_relaxed\n"
+       ".arch armv6k\n"
+       "1:     ldrexh  %0, [%3]\n"
+       "       strexh  %1, %2, [%3]\n"
+       "       teq     %1, #0\n"
+       "       bne     1b"
+               : "=&r" (ret), "=&r" (tmp)
+               : "r" (x), "r" (ptr)
+               : "memory", "cc");
+       return ret;
+}
+#define smp_xchg16_relaxed smp_xchg16_relaxed
+#endif
+
 #define xchg_relaxed(ptr, x) ({
         \
        (__typeof__(*(ptr)))__xchg((unsigned long)(x), (ptr),           \
                                   sizeof(*(ptr)));                     \
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 4c0d009a46f0..4612e27e3330 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -71,6 +71,10 @@
        __ret;                                                          \
 })

+#ifndef smp_xchg16_relaxed
+#define smp_xchg16_relaxed(p, x) xchg_relaxed(p, x)
+#endif
+
 #include <linux/atomic-fallback.h>

 #include <asm-generic/atomic-long.h>
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 2473f10c6956..af1473528edf 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -178,8 +178,8 @@ static __always_inline u32 xchg_tail(struct
qspinlock *lock, u32 tail)
         * We can use relaxed semantics since the caller ensures that the
         * MCS node is properly initialized before updating the tail.
         */
-       return (u32)xchg_relaxed(&lock->tail,
-                                tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+       return (u32)smp_xchg16_relaxed(&lock->tail,
+                                      tail >> _Q_TAIL_OFFSET) <<
_Q_TAIL_OFFSET;
 }

 #else /* _Q_PENDING_BITS == 8 */

> In x86, the lock instruction prefix is patched out when running on UP
> system. This downgrades the atomic cmpxchg to non-atomic one. We may do
> something similar in other architectures.

Unfortunately, the atomic macros cannot trivially be made cheaper
on non-SMP systems based on load-locked/store-conditional
based architectures, as there may be an interrupt in-between,
and disabling interrupts would likely be more expensive.

However, there might be a way to take a shortcut out of
queued_spin_lock() using asm-goto combined with the ARM
__ALT_SMP_ASM() macro.

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 11:42 ` [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Arnd Bergmann
  2019-10-08 13:36   ` Waiman Long
@ 2019-10-08 19:32   ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-08 19:32 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Russell King, Ingo Molnar, Waiman Long,
	Will Deacon, Linux ARM

On 2019-10-08 13:42:43 [+0200], Arnd Bergmann wrote:
> This looks nice, and I don't see anything wrong with the implementation,
> but I am slightly worried about switching everything over to a generic
> spinlock while keeping the custom ARM version for an exceptionally
> rare corner case:

I do not want ARM to have two spinlock implementations.

> The ARM spinlock is now only used when you build an SMP-enabled
> kernel for an ARM1136r0 that is used in OMAP2, i.MX3 and some
> of the least common Integrator/Realview variants. I'm not aware
> of any binary distros with ARMv6+ kernels, so these would run custom
> kernels that are almost always non-SMP as well as no longer getting
> kernel upgrades (almost all have been discontinued years ago, the i.MX35
> chip itself was the last to get EOLd in 2018).
> Raspbian builds an ARMv6K SMP kernel that is not affected by this.

I just looked at the Debian configs. The armel look UP only and armhf is
V7. I am not sure if there is a SMP capable CPU that is V6. I've been
looking at a wiki [0] and the first SMP CPU seems to be ARMv6K.

> I wonder if we can do something better here and make the
> asm-generic/qspinlock.h implementation always degrade into an
> equivalent of include/linux/spinlock_up.h when running on uniprocessor
> systems, avoiding both the atomic cmpxchg and the slowpath.
> 
> That way, an ARMv6+SMP kernel on UP could share the qspinlock
> implementation but never actually get into the invalid 16-bit xchg() or
> sev()/wfe(). It already shouldn't ever get into the slowpath on a
> non-SMP system if I understand it correctly, but avoiding the cmpxchg()
> entirely would be an added benefit.

The lock should never be contenteded so yes.

[0] https://en.wikipedia.org/wiki/List_of_ARM_microarchitectures

>        Arnd

Sebastian

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 14:32     ` Arnd Bergmann
@ 2019-10-08 19:47       ` Sebastian Andrzej Siewior
  2019-10-08 21:47         ` Arnd Bergmann
  2019-10-09  8:46       ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-08 19:47 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Russell King, Ingo Molnar, Waiman Long,
	Will Deacon, Linux ARM

On 2019-10-08 16:32:27 [+0200], Arnd Bergmann wrote:
> On Tue, Oct 8, 2019 at 3:36 PM Waiman Long <longman@redhat.com> wrote:
> > In x86, the lock instruction prefix is patched out when running on UP
> > system. This downgrades the atomic cmpxchg to non-atomic one. We may do
> > something similar in other architectures.
> 
> Unfortunately, the atomic macros cannot trivially be made cheaper
> on non-SMP systems based on load-locked/store-conditional
> based architectures, as there may be an interrupt in-between,
> and disabling interrupts would likely be more expensive.
> 
> However, there might be a way to take a shortcut out of
> queued_spin_lock() using asm-goto combined with the ARM
> __ALT_SMP_ASM() macro.

The smp_xchg16_relaxed() snippet above looked good. I would buy it :)
Where are you heading with __ALT_SMP_ASM()?

>       Arnd

Sebastian

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 19:47       ` Sebastian Andrzej Siewior
@ 2019-10-08 21:47         ` Arnd Bergmann
  2019-10-08 22:02           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-08 21:47 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Russell King, Ingo Molnar, Waiman Long,
	Will Deacon, Linux ARM

On Tue, Oct 8, 2019 at 9:47 PM Sebastian Andrzej Siewior
<sebastian@breakpoint.cc> wrote:
>
> On 2019-10-08 16:32:27 [+0200], Arnd Bergmann wrote:
> > On Tue, Oct 8, 2019 at 3:36 PM Waiman Long <longman@redhat.com> wrote:
> > > In x86, the lock instruction prefix is patched out when running on UP
> > > system. This downgrades the atomic cmpxchg to non-atomic one. We may do
> > > something similar in other architectures.
> >
> > Unfortunately, the atomic macros cannot trivially be made cheaper
> > on non-SMP systems based on load-locked/store-conditional
> > based architectures, as there may be an interrupt in-between,
> > and disabling interrupts would likely be more expensive.
> >
> > However, there might be a way to take a shortcut out of
> > queued_spin_lock() using asm-goto combined with the ARM
> > __ALT_SMP_ASM() macro.
>
> The smp_xchg16_relaxed() snippet above looked good. I would buy it :)
> Where are you heading with __ALT_SMP_ASM()?

I was thinking of something along the lines of this:

diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 2c595446cd73..1aa321b45f63 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -45,6 +45,19 @@ static inline void dsb_sev(void)
        __asm__(SEV);
 }

+static __always_inline bool smp_enabled(void)
+{
+       if (IS_ENABLED(CONFIG_SMP))
+               asm_volatile_goto(__ALT_SMP_ASM(WASM(b) "
%l[smp_on_up]", WASM(nop))
+                                 :::: smp_on_up);
+
+       return false;
+
+smp_on_up:
+       return true;
+}
+#define smp_enabled smp_enabled
+
 #include <asm/qrwlock.h>
 #include <asm/qspinlock.h>
 #define smp_mb__after_spinlock()       smp_mb()
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index fde943d180e0..3c456ad1661b 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -12,6 +12,10 @@

 #include <asm-generic/qspinlock_types.h>

+#ifndef smp_enabled
+#define smp_enabled (true)
+#endif
+
 /**
  * queued_spin_is_locked - is the spinlock locked?
  * @lock: Pointer to queued spinlock structure
@@ -75,6 +79,11 @@ static __always_inline void queued_spin_lock(struct
qspinlock *lock)
 {
        u32 val = 0;

+       if (!smp_enabled()) {
+               atomic_set(&lock->val, _Q_LOCKED_VAL);
+               return;
+       }
+
        if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
                return;

The above is likely incorrect, non-idiomatic or inefficient, but this
is a way to
avoid both a runtime check and the cmpxchg() in each spinlock.

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 21:47         ` Arnd Bergmann
@ 2019-10-08 22:02           ` Sebastian Andrzej Siewior
  2019-10-09  8:15             ` Arnd Bergmann
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-08 22:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Russell King, Ingo Molnar, Waiman Long,
	Will Deacon, Linux ARM

On 2019-10-08 23:47:31 [+0200], Arnd Bergmann wrote:
…
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index fde943d180e0..3c456ad1661b 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -75,6 +79,11 @@ static __always_inline void queued_spin_lock(struct
> qspinlock *lock)
>  {
>         u32 val = 0;
> 
> +       if (!smp_enabled()) {
> +               atomic_set(&lock->val, _Q_LOCKED_VAL);
> +               return;
> +       }
> +
>         if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
>                 return;
> 
> The above is likely incorrect, non-idiomatic or inefficient, but this
> is a way to
> avoid both a runtime check and the cmpxchg() in each spinlock.

You would have to put this in arch_spin_trylock() but I get the idea.
The current implementation does cmpxchg() in the try-lock case so by
switching to q-locks are not getting worse in the UP case. Therefore I
this is more of an optimisation for those that run SMP kernels on UP
machines.

>       Arnd

Sebastian

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 22:02           ` Sebastian Andrzej Siewior
@ 2019-10-09  8:15             ` Arnd Bergmann
  0 siblings, 0 replies; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-09  8:15 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Russell King, Ingo Molnar, Waiman Long,
	Will Deacon, Linux ARM

On Wed, Oct 9, 2019 at 12:02 AM Sebastian Andrzej Siewior
<sebastian@breakpoint.cc> wrote:
>
> On 2019-10-08 23:47:31 [+0200], Arnd Bergmann wrote:
> …
> > diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> > index fde943d180e0..3c456ad1661b 100644
> > --- a/include/asm-generic/qspinlock.h
> > +++ b/include/asm-generic/qspinlock.h
> …
> > @@ -75,6 +79,11 @@ static __always_inline void queued_spin_lock(struct
> > qspinlock *lock)
> >  {
> >         u32 val = 0;
> >
> > +       if (!smp_enabled()) {
> > +               atomic_set(&lock->val, _Q_LOCKED_VAL);
> > +               return;
> > +       }
> > +
> >         if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
> >                 return;
> >
> > The above is likely incorrect, non-idiomatic or inefficient, but this
> > is a way to
> > avoid both a runtime check and the cmpxchg() in each spinlock.
>
> You would have to put this in arch_spin_trylock() but I get the idea.
> The current implementation does cmpxchg() in the try-lock case so by
> switching to q-locks are not getting worse in the UP case.

Ah right, I dismissed the trylock case because there are few calls
to spin_trylock but I failed to noticed the __raw_spin_lock() for the
CONFIG_LOCK_STAT case.

> Therefore I this is more of an optimisation for those that run SMP
> kernels on UP machines.

Yes, that was the point. Originally I wasn't sure if there was a case
in which that configuration could still end up in the spinlock slowpath,
but as that won't happen it's just a way to avoid the atomic operation
on UP. It might not actually make much of a difference depending
on the actual overhead of ldrex/strex while running on a single core.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-08 14:32     ` Arnd Bergmann
  2019-10-08 19:47       ` Sebastian Andrzej Siewior
@ 2019-10-09  8:46       ` Peter Zijlstra
  2019-10-09  8:57         ` Arnd Bergmann
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-10-09  8:46 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Tue, Oct 08, 2019 at 04:32:27PM +0200, Arnd Bergmann wrote:
> diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h
> index 8b701f8e175c..6bf4964c105c 100644
> --- a/arch/arm/include/asm/cmpxchg.h
> +++ b/arch/arm/include/asm/cmpxchg.h
> @@ -114,6 +114,24 @@ static inline unsigned long __xchg(unsigned long
> x, volatile void *ptr, int size
>         return ret;
>  }
> 
> +#ifdef CONFIG_CPU_V6
> +static inline unsigned short smp_xchg16_relaxed(volatile unsigned
> short *ptr, unsigned short x)
> +{
> +       unsigned short ret, tmp;
> +       asm volatile("@ smp_xchg16_relaxed\n"
> +       ".arch armv6k\n"
> +       "1:     ldrexh  %0, [%3]\n"
> +       "       strexh  %1, %2, [%3]\n"
> +       "       teq     %1, #0\n"
> +       "       bne     1b"
> +               : "=&r" (ret), "=&r" (tmp)
> +               : "r" (x), "r" (ptr)
> +               : "memory", "cc");
> +       return ret;
> +}
> +#define smp_xchg16_relaxed smp_xchg16_relaxed
> +#endif

Why is this not in __xchg() as a variant for case 2 ?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09  8:46       ` Peter Zijlstra
@ 2019-10-09  8:57         ` Arnd Bergmann
  2019-10-09  9:31           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-09  8:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Wed, Oct 9, 2019 at 10:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Oct 08, 2019 at 04:32:27PM +0200, Arnd Bergmann wrote:
> > diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h
> > index 8b701f8e175c..6bf4964c105c 100644
> > --- a/arch/arm/include/asm/cmpxchg.h
> > +++ b/arch/arm/include/asm/cmpxchg.h
> > @@ -114,6 +114,24 @@ static inline unsigned long __xchg(unsigned long
> > x, volatile void *ptr, int size
> >         return ret;
> >  }
> >
> > +#ifdef CONFIG_CPU_V6
> > +static inline unsigned short smp_xchg16_relaxed(volatile unsigned
> > short *ptr, unsigned short x)
> > +{
> > +       unsigned short ret, tmp;
> > +       asm volatile("@ smp_xchg16_relaxed\n"
> > +       ".arch armv6k\n"
> > +       "1:     ldrexh  %0, [%3]\n"
> > +       "       strexh  %1, %2, [%3]\n"
> > +       "       teq     %1, #0\n"
> > +       "       bne     1b"
> > +               : "=&r" (ret), "=&r" (tmp)
> > +               : "r" (x), "r" (ptr)
> > +               : "memory", "cc");
> > +       return ret;
> > +}
> > +#define smp_xchg16_relaxed smp_xchg16_relaxed
> > +#endif
>
> Why is this not in __xchg() as a variant for case 2 ?

ldrexh/strexh are instructions that are only available on SMP-capable
architecture revisions (ARMv6K or higher). When building a kernel
that runs both on pre-K ARMv6 uniprocessor systems and on later
SMP systems, __xchg() can only do 32-bit  ldrex/strex.

The trick of smp_xchg16_relaxed() is to allow the 16-bit atomics
in code that can is only ever executed on SMP systems but not
the uniprocessor OMAP2 and i.MX3 chips that would trap.

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09  8:57         ` Arnd Bergmann
@ 2019-10-09  9:31           ` Peter Zijlstra
  2019-10-09 10:31             ` Arnd Bergmann
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-10-09  9:31 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Wed, Oct 09, 2019 at 10:57:25AM +0200, Arnd Bergmann wrote:
> On Wed, Oct 9, 2019 at 10:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Oct 08, 2019 at 04:32:27PM +0200, Arnd Bergmann wrote:
> > > diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h
> > > index 8b701f8e175c..6bf4964c105c 100644
> > > --- a/arch/arm/include/asm/cmpxchg.h
> > > +++ b/arch/arm/include/asm/cmpxchg.h
> > > @@ -114,6 +114,24 @@ static inline unsigned long __xchg(unsigned long
> > > x, volatile void *ptr, int size
> > >         return ret;
> > >  }
> > >
> > > +#ifdef CONFIG_CPU_V6
> > > +static inline unsigned short smp_xchg16_relaxed(volatile unsigned
> > > short *ptr, unsigned short x)
> > > +{
> > > +       unsigned short ret, tmp;
> > > +       asm volatile("@ smp_xchg16_relaxed\n"
> > > +       ".arch armv6k\n"
> > > +       "1:     ldrexh  %0, [%3]\n"
> > > +       "       strexh  %1, %2, [%3]\n"
> > > +       "       teq     %1, #0\n"
> > > +       "       bne     1b"
> > > +               : "=&r" (ret), "=&r" (tmp)
> > > +               : "r" (x), "r" (ptr)
> > > +               : "memory", "cc");
> > > +       return ret;
> > > +}
> > > +#define smp_xchg16_relaxed smp_xchg16_relaxed
> > > +#endif
> >
> > Why is this not in __xchg() as a variant for case 2 ?
> 
> ldrexh/strexh are instructions that are only available on SMP-capable
> architecture revisions (ARMv6K or higher). When building a kernel
> that runs both on pre-K ARMv6 uniprocessor systems and on later
> SMP systems, __xchg() can only do 32-bit  ldrex/strex.

You can do u16 xchg using a u32 ll/sc, see openrisc's xchg_small().

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09  9:31           ` Peter Zijlstra
@ 2019-10-09 10:31             ` Arnd Bergmann
  2019-10-09 10:56               ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-09 10:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Wed, Oct 9, 2019 at 11:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 09, 2019 at 10:57:25AM +0200, Arnd Bergmann wrote:
> > On Wed, Oct 9, 2019 at 10:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > Why is this not in __xchg() as a variant for case 2 ?
> >
> > ldrexh/strexh are instructions that are only available on SMP-capable
> > architecture revisions (ARMv6K or higher). When building a kernel
> > that runs both on pre-K ARMv6 uniprocessor systems and on later
> > SMP systems, __xchg() can only do 32-bit  ldrex/strex.
>
> You can do u16 xchg using a u32 ll/sc, see openrisc's xchg_small().

Ah, right. That would be much nicer than my smp_xchg16_relaxed()
hack to get the corner case working, as it avoids the ugly special
case in qspinlock.h.

Would this still have comparable performance characteristics?
I assume the 16-bit xchg_relaxed() in qspinlock.c was meant as
an optimization for x86 and other cmpxchg based architectures but
doesn't actually help on ll/sc based architectures that get the
reservation on the whole cache line anyway?

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09 10:31             ` Arnd Bergmann
@ 2019-10-09 10:56               ` Peter Zijlstra
  2019-10-09 12:00                 ` Arnd Bergmann
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-10-09 10:56 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Wed, Oct 09, 2019 at 12:31:24PM +0200, Arnd Bergmann wrote:
> On Wed, Oct 9, 2019 at 11:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Oct 09, 2019 at 10:57:25AM +0200, Arnd Bergmann wrote:
> > > On Wed, Oct 9, 2019 at 10:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > Why is this not in __xchg() as a variant for case 2 ?
> > >
> > > ldrexh/strexh are instructions that are only available on SMP-capable
> > > architecture revisions (ARMv6K or higher). When building a kernel
> > > that runs both on pre-K ARMv6 uniprocessor systems and on later
> > > SMP systems, __xchg() can only do 32-bit  ldrex/strex.
> >
> > You can do u16 xchg using a u32 ll/sc, see openrisc's xchg_small().
> 
> Ah, right. That would be much nicer than my smp_xchg16_relaxed()
> hack to get the corner case working, as it avoids the ugly special
> case in qspinlock.h.
> 
> Would this still have comparable performance characteristics?

I suppose so..

> I assume the 16-bit xchg_relaxed() in qspinlock.c was meant as
> an optimization for x86 and other cmpxchg based architectures but
> doesn't actually help on ll/sc based architectures that get the
> reservation on the whole cache line anyway?

It does actually help here too, because it allows other operations to be
regular load/stores.

Look at the #if _Q_PENDING_BITS==8 in qspinlock.c, as opposed to the
#else where they're all atomic_*().


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09 10:56               ` Peter Zijlstra
@ 2019-10-09 12:00                 ` Arnd Bergmann
  2019-10-09 12:06                   ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-09 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Wed, Oct 9, 2019 at 12:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 09, 2019 at 12:31:24PM +0200, Arnd Bergmann wrote:
> > On Wed, Oct 9, 2019 at 11:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > I assume the 16-bit xchg_relaxed() in qspinlock.c was meant as
> > an optimization for x86 and other cmpxchg based architectures but
> > doesn't actually help on ll/sc based architectures that get the
> > reservation on the whole cache line anyway?
>
> It does actually help here too, because it allows other operations to be
> regular load/stores.
>
> Look at the #if _Q_PENDING_BITS==8 in qspinlock.c, as opposed to the
> #else where they're all atomic_*().

Oh, is that safe with an xchg() implementation that operates on the whole
32 bit when a concurrent thread can do a simple store to one half of it?

The ARM architecture reference says "It is UNPREDICTABLE whether the
transition from Exclusive Access to Open Access state occurs when the
Store or StoreExcl is from another observer.", which sounds to me
me like the xchg_small() trick would not work with the qspinlock
implementation on ARM. [I see that mips, openrisc and xtensa do this,
but did not try to find out whether they have ll/sc semantics that make
it safe when another thread does a plain store to the reservation]

OTOH, I suppose we could just set _Q_PENDING_BITS to 1
regardless of NR_CPUS on any architecture without a 16-bit
xchg() primitive.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09 12:00                 ` Arnd Bergmann
@ 2019-10-09 12:06                   ` Peter Zijlstra
  2019-10-09 12:52                     ` Will Deacon
  2019-10-09 13:50                     ` Arnd Bergmann
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-10-09 12:06 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Wed, Oct 09, 2019 at 02:00:05PM +0200, Arnd Bergmann wrote:
> On Wed, Oct 9, 2019 at 12:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Oct 09, 2019 at 12:31:24PM +0200, Arnd Bergmann wrote:
> > > On Wed, Oct 9, 2019 at 11:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > I assume the 16-bit xchg_relaxed() in qspinlock.c was meant as
> > > an optimization for x86 and other cmpxchg based architectures but
> > > doesn't actually help on ll/sc based architectures that get the
> > > reservation on the whole cache line anyway?
> >
> > It does actually help here too, because it allows other operations to be
> > regular load/stores.
> >
> > Look at the #if _Q_PENDING_BITS==8 in qspinlock.c, as opposed to the
> > #else where they're all atomic_*().
> 
> Oh, is that safe with an xchg() implementation that operates on the whole
> 32 bit when a concurrent thread can do a simple store to one half of it?

It had better be, otherwise LL/SC'd be broken. SC _must_ fail when there
is a contending store.

> The ARM architecture reference says "It is UNPREDICTABLE whether the
> transition from Exclusive Access to Open Access state occurs when the
> Store or StoreExcl is from another observer.", which sounds to me
> me like the xchg_small() trick would not work with the qspinlock
> implementation on ARM. [I see that mips, openrisc and xtensa do this,
> but did not try to find out whether they have ll/sc semantics that make
> it safe when another thread does a plain store to the reservation]

Will?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09 12:06                   ` Peter Zijlstra
@ 2019-10-09 12:52                     ` Will Deacon
  2019-10-09 13:50                     ` Arnd Bergmann
  1 sibling, 0 replies; 22+ messages in thread
From: Will Deacon @ 2019-10-09 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnd Bergmann, Russell King, Sebastian Andrzej Siewior,
	Ingo Molnar, Waiman Long, Linux ARM

On Wed, Oct 09, 2019 at 02:06:39PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 09, 2019 at 02:00:05PM +0200, Arnd Bergmann wrote:
> > On Wed, Oct 9, 2019 at 12:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Oct 09, 2019 at 12:31:24PM +0200, Arnd Bergmann wrote:
> > > > On Wed, Oct 9, 2019 at 11:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > I assume the 16-bit xchg_relaxed() in qspinlock.c was meant as
> > > > an optimization for x86 and other cmpxchg based architectures but
> > > > doesn't actually help on ll/sc based architectures that get the
> > > > reservation on the whole cache line anyway?
> > >
> > > It does actually help here too, because it allows other operations to be
> > > regular load/stores.
> > >
> > > Look at the #if _Q_PENDING_BITS==8 in qspinlock.c, as opposed to the
> > > #else where they're all atomic_*().
> > 
> > Oh, is that safe with an xchg() implementation that operates on the whole
> > 32 bit when a concurrent thread can do a simple store to one half of it?
> 
> It had better be, otherwise LL/SC'd be broken. SC _must_ fail when there
> is a contending store.
> 
> > The ARM architecture reference says "It is UNPREDICTABLE whether the
> > transition from Exclusive Access to Open Access state occurs when the
> > Store or StoreExcl is from another observer.", which sounds to me
> > me like the xchg_small() trick would not work with the qspinlock
> > implementation on ARM. [I see that mips, openrisc and xtensa do this,
> > but did not try to find out whether they have ll/sc semantics that make
> > it safe when another thread does a plain store to the reservation]
> 
> Will?

I think this is the documentation being unhelpful -- there is a section
about the "local monitor", which this text applies to, but then there's
also a section about the "global monitor", which has a state machine
diagram in the section called "Clear global monitor event". This diagram
shows how the global monitor transitions from "Exclusive access" to "Open
access" in response to a store from a different observer to the "marked
address".

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09 12:06                   ` Peter Zijlstra
  2019-10-09 12:52                     ` Will Deacon
@ 2019-10-09 13:50                     ` Arnd Bergmann
  2019-10-09 21:42                       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2019-10-09 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King, Sebastian Andrzej Siewior, Ingo Molnar,
	Waiman Long, Will Deacon, Linux ARM

On Wed, Oct 9, 2019 at 2:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 09, 2019 at 02:00:05PM +0200, Arnd Bergmann wrote:
> > On Wed, Oct 9, 2019 at 12:57 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Oct 09, 2019 at 12:31:24PM +0200, Arnd Bergmann wrote:
> > > > On Wed, Oct 9, 2019 at 11:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > I assume the 16-bit xchg_relaxed() in qspinlock.c was meant as
> > > > an optimization for x86 and other cmpxchg based architectures but
> > > > doesn't actually help on ll/sc based architectures that get the
> > > > reservation on the whole cache line anyway?
> > >
> > > It does actually help here too, because it allows other operations to be
> > > regular load/stores.
> > >
> > > Look at the #if _Q_PENDING_BITS==8 in qspinlock.c, as opposed to the
> > > #else where they're all atomic_*().
> >
> > Oh, is that safe with an xchg() implementation that operates on the whole
> > 32 bit when a concurrent thread can do a simple store to one half of it?
>
> It had better be, otherwise LL/SC'd be broken. SC _must_ fail when there
> is a contending store.

Ok. I looked a bit at the other implementations that do xchg16() through
cmpxchg32(), and it seems it would be easiest to reuse the superh version,
which is fully portable by moving arch/sh/include/asm/cmpxchg-xchg.h
into include/asm-generic/, the same thing would allow us to change a
number of other architectures to use the generic qspinlock instead of
their own locks.

Sebastian, do you want to try doing it that way?

     Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM
  2019-10-09 13:50                     ` Arnd Bergmann
@ 2019-10-09 21:42                       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-09 21:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Russell King, Ingo Molnar, Waiman Long,
	Will Deacon, Linux ARM

On 2019-10-09 15:50:24 [+0200], Arnd Bergmann wrote:
> Ok. I looked a bit at the other implementations that do xchg16() through
> cmpxchg32(), and it seems it would be easiest to reuse the superh version,
> which is fully portable by moving arch/sh/include/asm/cmpxchg-xchg.h
> into include/asm-generic/, the same thing would allow us to change a
> number of other architectures to use the generic qspinlock instead of
> their own locks.
> 
> Sebastian, do you want to try doing it that way?

sounds good, I'm on it.

>      Arnd

Sebastian

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-10-09 21:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-07 21:44 [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Sebastian Andrzej Siewior
2019-10-07 21:44 ` [PATCH 1/3] ARM: Use qrwlock implementation Sebastian Andrzej Siewior
2019-10-07 21:44 ` [PATCH 2/3] ARM: Use qspinlock implementation Sebastian Andrzej Siewior
2019-10-07 21:44 ` [PATCH 3/3] ARM: Inline locking functions for !PREEMPTION Sebastian Andrzej Siewior
2019-10-08 11:42 ` [RFC PATCH 0/3] Queued spinlocks/RW-locks for ARM Arnd Bergmann
2019-10-08 13:36   ` Waiman Long
2019-10-08 14:32     ` Arnd Bergmann
2019-10-08 19:47       ` Sebastian Andrzej Siewior
2019-10-08 21:47         ` Arnd Bergmann
2019-10-08 22:02           ` Sebastian Andrzej Siewior
2019-10-09  8:15             ` Arnd Bergmann
2019-10-09  8:46       ` Peter Zijlstra
2019-10-09  8:57         ` Arnd Bergmann
2019-10-09  9:31           ` Peter Zijlstra
2019-10-09 10:31             ` Arnd Bergmann
2019-10-09 10:56               ` Peter Zijlstra
2019-10-09 12:00                 ` Arnd Bergmann
2019-10-09 12:06                   ` Peter Zijlstra
2019-10-09 12:52                     ` Will Deacon
2019-10-09 13:50                     ` Arnd Bergmann
2019-10-09 21:42                       ` Sebastian Andrzej Siewior
2019-10-08 19:32   ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).