All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] ARM: findbit assembly updates
@ 2022-10-28 16:47 ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Linus Torvalds, linux-arm-kernel,
	Linux Kernel Mailing List, Mark Rutland, Will Deacon

Hi,

This series updates the arm32 assembly versions of the findbit
operations:

- Document ARMv5 code that calculates the bit offset
- Provide an updated ARMv7 implementation using the rbit instruction
- Switch to use macros instead of duplicating mostly identical code
- Switch to using word loads rather than byte loads
- Add unwinder information for backtracing

I've had it sitting around in-use for some time, and no issues have
arisen. Tested also outside the kernel tree in userspace and results
are the same with the previous implementation.

Testing with the find_bit benchmark module shows that these operations
coded in assembly are faster than the generic versions (previously
posted), so I believe they're worth keeping.

 arch/arm/include/asm/assembler.h |   6 +
 arch/arm/lib/findbit.S           | 230 +++++++++++++++------------------------
 2 files changed, 94 insertions(+), 142 deletions(-)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 0/5] ARM: findbit assembly updates
@ 2022-10-28 16:47 ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Linus Torvalds, linux-arm-kernel,
	Linux Kernel Mailing List, Mark Rutland, Will Deacon

Hi,

This series updates the arm32 assembly versions of the findbit
operations:

- Document ARMv5 code that calculates the bit offset
- Provide an updated ARMv7 implementation using the rbit instruction
- Switch to use macros instead of duplicating mostly identical code
- Switch to using word loads rather than byte loads
- Add unwinder information for backtracing

I've had it sitting around in-use for some time, and no issues have
arisen. Tested also outside the kernel tree in userspace and results
are the same with the previous implementation.

Testing with the find_bit benchmark module shows that these operations
coded in assembly are faster than the generic versions (previously
posted), so I believe they're worth keeping.

 arch/arm/include/asm/assembler.h |   6 +
 arch/arm/lib/findbit.S           | 230 +++++++++++++++------------------------
 2 files changed, 94 insertions(+), 142 deletions(-)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 16:47 ` Russell King (Oracle)
@ 2022-10-28 16:47   ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Document the ARMv5 bit offset calculation code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 7fd3600db8ef..4c584bc4704b 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -172,10 +172,10 @@ ENDPROC(_find_next_bit_be)
 .L_found:
 #if __LINUX_ARM_ARCH__ >= 5
 		rsb	r0, r3, #0
-		and	r3, r3, r0
-		clz	r3, r3
-		rsb	r3, r3, #31
-		add	r0, r2, r3
+		and	r3, r3, r0		@ mask out lowest bit set
+		clz	r3, r3			@ count high zero bits
+		rsb	r3, r3, #31		@ offset of first set bit
+		add	r0, r2, r3		@ add offset of first set bit
 #else
 		tst	r3, #0x0f
 		addeq	r2, r2, #4
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 16:47   ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Document the ARMv5 bit offset calculation code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 7fd3600db8ef..4c584bc4704b 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -172,10 +172,10 @@ ENDPROC(_find_next_bit_be)
 .L_found:
 #if __LINUX_ARM_ARCH__ >= 5
 		rsb	r0, r3, #0
-		and	r3, r3, r0
-		clz	r3, r3
-		rsb	r3, r3, #31
-		add	r0, r2, r3
+		and	r3, r3, r0		@ mask out lowest bit set
+		clz	r3, r3			@ count high zero bits
+		rsb	r3, r3, #31		@ offset of first set bit
+		add	r0, r2, r3		@ add offset of first set bit
 #else
 		tst	r3, #0x0f
 		addeq	r2, r2, #4
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] ARM: findbit: provide more efficient ARMv7 implementation
  2022-10-28 16:47 ` Russell King (Oracle)
@ 2022-10-28 16:47   ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Provide a more efficient ARMv7 implementation to determine the first
set bit in the supplied value.

Signed-off-by: Russell King (Oracle) <rmk@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 4c584bc4704b..256e095d490b 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -170,7 +170,11 @@ ENDPROC(_find_next_bit_be)
  * One or more bits in the LSB of r3 are assumed to be set.
  */
 .L_found:
-#if __LINUX_ARM_ARCH__ >= 5
+#if __LINUX_ARM_ARCH__ >= 7
+		rbit	r3, r3			@ reverse bits
+		clz	r3, r3			@ count high zero bits
+		add	r0, r2, r3		@ add offset of first set bit
+#elif __LINUX_ARM_ARCH__ >= 5
 		rsb	r0, r3, #0
 		and	r3, r3, r0		@ mask out lowest bit set
 		clz	r3, r3			@ count high zero bits
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] ARM: findbit: provide more efficient ARMv7 implementation
@ 2022-10-28 16:47   ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:47 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Provide a more efficient ARMv7 implementation to determine the first
set bit in the supplied value.

Signed-off-by: Russell King (Oracle) <rmk@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 4c584bc4704b..256e095d490b 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -170,7 +170,11 @@ ENDPROC(_find_next_bit_be)
  * One or more bits in the LSB of r3 are assumed to be set.
  */
 .L_found:
-#if __LINUX_ARM_ARCH__ >= 5
+#if __LINUX_ARM_ARCH__ >= 7
+		rbit	r3, r3			@ reverse bits
+		clz	r3, r3			@ count high zero bits
+		add	r0, r2, r3		@ add offset of first set bit
+#elif __LINUX_ARM_ARCH__ >= 5
 		rsb	r0, r3, #0
 		and	r3, r3, r0		@ mask out lowest bit set
 		clz	r3, r3			@ count high zero bits
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] ARM: findbit: convert to macros
  2022-10-28 16:47 ` Russell King (Oracle)
@ 2022-10-28 16:48   ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Since the pairs of _find_first and _find_next functions are pretty
similar, use macros to generate this code. This commit does not
change the generated code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 158 +++++++++++------------------------------
 1 file changed, 42 insertions(+), 116 deletions(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 256e095d490b..8280f66d38a5 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -14,155 +14,81 @@
 #include <asm/assembler.h>
                 .text
 
-/*
- * Purpose  : Find a 'zero' bit
- * Prototype: int find_first_zero_bit(void *addr, unsigned int maxbit);
- */
-ENTRY(_find_first_zero_bit_le)
-		teq	r1, #0	
+		.macro	find_first, endian, set, name
+ENTRY(_find_first_\name\()bit_\endian)
+		teq	r1, #0
 		beq	3f
 		mov	r2, #0
 1:
+		.ifc	\endian, be
+		eor	r3, r2, #0x18
+ ARM(		ldrb	r3, [r0, r3, lsr #3]	)
+ THUMB(		lsr	r3, #3			)
+ THUMB(		ldrb	r3, [r0, r3]		)
+		.else
  ARM(		ldrb	r3, [r0, r2, lsr #3]	)
  THUMB(		lsr	r3, r2, #3		)
  THUMB(		ldrb	r3, [r0, r3]		)
+		.endif
+		.ifeq	\set
 		eors	r3, r3, #0xff		@ invert bits
+		.else
+		movs	r3, r3
+		.endif
 		bne	.L_found		@ any now set - found zero bit
 		add	r2, r2, #8		@ next bit pointer
 2:		cmp	r2, r1			@ any more?
 		blo	1b
 3:		mov	r0, r1			@ no free bits
 		ret	lr
-ENDPROC(_find_first_zero_bit_le)
+ENDPROC(_find_first_\name\()bit_\endian)
+		.endm
 
-/*
- * Purpose  : Find next 'zero' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
- */
-ENTRY(_find_next_zero_bit_le)
+		.macro	find_next, endian, set, name
+ENTRY(_find_next_\name\()bit_\endian)
 		cmp	r2, r1
 		bhs	3b
 		ands	ip, r2, #7
 		beq	1b			@ If new byte, goto old routine
+		.ifc	\endian, be
+		eor	r3, r2, #0x18
+ ARM(		ldrb	r3, [r0, r3, lsr #3]	)
+ THUMB(		lsr	r3, #3			)
+ THUMB(		ldrb	r3, [r0, r3]		)
+		.else
  ARM(		ldrb	r3, [r0, r2, lsr #3]	)
  THUMB(		lsr	r3, r2, #3		)
  THUMB(		ldrb	r3, [r0, r3]		)
+		.endif
+		.ifeq	\set
 		eor	r3, r3, #0xff		@ now looking for a 1 bit
+		.endif
 		movs	r3, r3, lsr ip		@ shift off unused bits
 		bne	.L_found
 		orr	r2, r2, #7		@ if zero, then no bits here
 		add	r2, r2, #1		@ align bit pointer
 		b	2b			@ loop for next bit
-ENDPROC(_find_next_zero_bit_le)
+ENDPROC(_find_next_\name\()bit_\endian)
+		.endm
 
-/*
- * Purpose  : Find a 'one' bit
- * Prototype: int find_first_bit(const unsigned long *addr, unsigned int maxbit);
- */
-ENTRY(_find_first_bit_le)
-		teq	r1, #0	
-		beq	3f
-		mov	r2, #0
-1:
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
-2:		cmp	r2, r1			@ any more?
-		blo	1b
-3:		mov	r0, r1			@ no free bits
-		ret	lr
-ENDPROC(_find_first_bit_le)
+		.macro	find_bit, endian, set, name
+		find_first \endian, \set, \name
+		find_next  \endian, \set, \name
+		.endm
 
-/*
- * Purpose  : Find next 'one' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
- */
-ENTRY(_find_next_bit_le)
-		cmp	r2, r1
-		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3, lsr ip		@ shift off unused bits
-		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
-		add	r2, r2, #1		@ align bit pointer
-		b	2b			@ loop for next bit
-ENDPROC(_find_next_bit_le)
+/* _find_first_zero_bit_le and _find_next_zero_bit_le */
+		find_bit le, 0, zero_
 
-#ifdef __ARMEB__
+/* _find_first_bit_le and _find_next_bit_le */
+		find_bit le, 1
 
-ENTRY(_find_first_zero_bit_be)
-		teq	r1, #0
-		beq	3f
-		mov	r2, #0
-1:		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		eors	r3, r3, #0xff		@ invert bits
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
-2:		cmp	r2, r1			@ any more?
-		blo	1b
-3:		mov	r0, r1			@ no free bits
-		ret	lr
-ENDPROC(_find_first_zero_bit_be)
+#ifdef __ARMEB__
 
-ENTRY(_find_next_zero_bit_be)
-		cmp	r2, r1
-		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
-		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		eor	r3, r3, #0xff		@ now looking for a 1 bit
-		movs	r3, r3, lsr ip		@ shift off unused bits
-		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
-		add	r2, r2, #1		@ align bit pointer
-		b	2b			@ loop for next bit
-ENDPROC(_find_next_zero_bit_be)
+/* _find_first_zero_bit_be and _find_next_zero_bit_be */
+		find_bit be, 0, zero_
 
-ENTRY(_find_first_bit_be)
-		teq	r1, #0
-		beq	3f
-		mov	r2, #0
-1:		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
-2:		cmp	r2, r1			@ any more?
-		blo	1b
-3:		mov	r0, r1			@ no free bits
-		ret	lr
-ENDPROC(_find_first_bit_be)
-
-ENTRY(_find_next_bit_be)
-		cmp	r2, r1
-		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
-		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3, lsr ip		@ shift off unused bits
-		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
-		add	r2, r2, #1		@ align bit pointer
-		b	2b			@ loop for next bit
-ENDPROC(_find_next_bit_be)
+/* _find_first_bit_be and _find_next_bit_be */
+		find_bit be, 1
 
 #endif
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] ARM: findbit: convert to macros
@ 2022-10-28 16:48   ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Since the pairs of _find_first and _find_next functions are pretty
similar, use macros to generate this code. This commit does not
change the generated code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 158 +++++++++++------------------------------
 1 file changed, 42 insertions(+), 116 deletions(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 256e095d490b..8280f66d38a5 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -14,155 +14,81 @@
 #include <asm/assembler.h>
                 .text
 
-/*
- * Purpose  : Find a 'zero' bit
- * Prototype: int find_first_zero_bit(void *addr, unsigned int maxbit);
- */
-ENTRY(_find_first_zero_bit_le)
-		teq	r1, #0	
+		.macro	find_first, endian, set, name
+ENTRY(_find_first_\name\()bit_\endian)
+		teq	r1, #0
 		beq	3f
 		mov	r2, #0
 1:
+		.ifc	\endian, be
+		eor	r3, r2, #0x18
+ ARM(		ldrb	r3, [r0, r3, lsr #3]	)
+ THUMB(		lsr	r3, #3			)
+ THUMB(		ldrb	r3, [r0, r3]		)
+		.else
  ARM(		ldrb	r3, [r0, r2, lsr #3]	)
  THUMB(		lsr	r3, r2, #3		)
  THUMB(		ldrb	r3, [r0, r3]		)
+		.endif
+		.ifeq	\set
 		eors	r3, r3, #0xff		@ invert bits
+		.else
+		movs	r3, r3
+		.endif
 		bne	.L_found		@ any now set - found zero bit
 		add	r2, r2, #8		@ next bit pointer
 2:		cmp	r2, r1			@ any more?
 		blo	1b
 3:		mov	r0, r1			@ no free bits
 		ret	lr
-ENDPROC(_find_first_zero_bit_le)
+ENDPROC(_find_first_\name\()bit_\endian)
+		.endm
 
-/*
- * Purpose  : Find next 'zero' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
- */
-ENTRY(_find_next_zero_bit_le)
+		.macro	find_next, endian, set, name
+ENTRY(_find_next_\name\()bit_\endian)
 		cmp	r2, r1
 		bhs	3b
 		ands	ip, r2, #7
 		beq	1b			@ If new byte, goto old routine
+		.ifc	\endian, be
+		eor	r3, r2, #0x18
+ ARM(		ldrb	r3, [r0, r3, lsr #3]	)
+ THUMB(		lsr	r3, #3			)
+ THUMB(		ldrb	r3, [r0, r3]		)
+		.else
  ARM(		ldrb	r3, [r0, r2, lsr #3]	)
  THUMB(		lsr	r3, r2, #3		)
  THUMB(		ldrb	r3, [r0, r3]		)
+		.endif
+		.ifeq	\set
 		eor	r3, r3, #0xff		@ now looking for a 1 bit
+		.endif
 		movs	r3, r3, lsr ip		@ shift off unused bits
 		bne	.L_found
 		orr	r2, r2, #7		@ if zero, then no bits here
 		add	r2, r2, #1		@ align bit pointer
 		b	2b			@ loop for next bit
-ENDPROC(_find_next_zero_bit_le)
+ENDPROC(_find_next_\name\()bit_\endian)
+		.endm
 
-/*
- * Purpose  : Find a 'one' bit
- * Prototype: int find_first_bit(const unsigned long *addr, unsigned int maxbit);
- */
-ENTRY(_find_first_bit_le)
-		teq	r1, #0	
-		beq	3f
-		mov	r2, #0
-1:
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
-2:		cmp	r2, r1			@ any more?
-		blo	1b
-3:		mov	r0, r1			@ no free bits
-		ret	lr
-ENDPROC(_find_first_bit_le)
+		.macro	find_bit, endian, set, name
+		find_first \endian, \set, \name
+		find_next  \endian, \set, \name
+		.endm
 
-/*
- * Purpose  : Find next 'one' bit
- * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset)
- */
-ENTRY(_find_next_bit_le)
-		cmp	r2, r1
-		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3, lsr ip		@ shift off unused bits
-		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
-		add	r2, r2, #1		@ align bit pointer
-		b	2b			@ loop for next bit
-ENDPROC(_find_next_bit_le)
+/* _find_first_zero_bit_le and _find_next_zero_bit_le */
+		find_bit le, 0, zero_
 
-#ifdef __ARMEB__
+/* _find_first_bit_le and _find_next_bit_le */
+		find_bit le, 1
 
-ENTRY(_find_first_zero_bit_be)
-		teq	r1, #0
-		beq	3f
-		mov	r2, #0
-1:		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		eors	r3, r3, #0xff		@ invert bits
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
-2:		cmp	r2, r1			@ any more?
-		blo	1b
-3:		mov	r0, r1			@ no free bits
-		ret	lr
-ENDPROC(_find_first_zero_bit_be)
+#ifdef __ARMEB__
 
-ENTRY(_find_next_zero_bit_be)
-		cmp	r2, r1
-		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
-		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		eor	r3, r3, #0xff		@ now looking for a 1 bit
-		movs	r3, r3, lsr ip		@ shift off unused bits
-		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
-		add	r2, r2, #1		@ align bit pointer
-		b	2b			@ loop for next bit
-ENDPROC(_find_next_zero_bit_be)
+/* _find_first_zero_bit_be and _find_next_zero_bit_be */
+		find_bit be, 0, zero_
 
-ENTRY(_find_first_bit_be)
-		teq	r1, #0
-		beq	3f
-		mov	r2, #0
-1:		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
-2:		cmp	r2, r1			@ any more?
-		blo	1b
-3:		mov	r0, r1			@ no free bits
-		ret	lr
-ENDPROC(_find_first_bit_be)
-
-ENTRY(_find_next_bit_be)
-		cmp	r2, r1
-		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
-		eor	r3, r2, #0x18		@ big endian byte ordering
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		movs	r3, r3, lsr ip		@ shift off unused bits
-		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
-		add	r2, r2, #1		@ align bit pointer
-		b	2b			@ loop for next bit
-ENDPROC(_find_next_bit_be)
+/* _find_first_bit_be and _find_next_bit_be */
+		find_bit be, 1
 
 #endif
 
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] ARM: findbit: operate by words
  2022-10-28 16:47 ` Russell King (Oracle)
@ 2022-10-28 16:48   ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Convert the implementations to operate on words rather than bytes
which makes bitmap searching faster.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/include/asm/assembler.h |  6 +++
 arch/arm/lib/findbit.S           | 78 ++++++++++++++++++--------------
 2 files changed, 50 insertions(+), 34 deletions(-)

diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
index 90fbe4a3f9c8..28e18f79c300 100644
--- a/arch/arm/include/asm/assembler.h
+++ b/arch/arm/include/asm/assembler.h
@@ -761,6 +761,12 @@ THUMB(	orr	\reg , \reg , #PSR_T_BIT	)
 	.endif
 	.endm
 
+	.if		__LINUX_ARM_ARCH__ < 6
+	.set		.Lrev_l_uses_tmp, 1
+	.else
+	.set		.Lrev_l_uses_tmp, 0
+	.endif
+
 	/*
 	 * bl_r - branch and link to register
 	 *
diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 8280f66d38a5..6ec584d16d46 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -14,32 +14,32 @@
 #include <asm/assembler.h>
                 .text
 
+#ifdef __ARMEB__
+#define SWAB_ENDIAN le
+#else
+#define SWAB_ENDIAN be
+#endif
+
 		.macro	find_first, endian, set, name
 ENTRY(_find_first_\name\()bit_\endian)
 		teq	r1, #0
 		beq	3f
 		mov	r2, #0
-1:
-		.ifc	\endian, be
-		eor	r3, r2, #0x18
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
+1:		ldr	r3, [r0], #4
+		.ifeq \set
+		mvns	r3, r3			@ invert/test bits
 		.else
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
+		movs	r3, r3			@ test bits
 		.endif
-		.ifeq	\set
-		eors	r3, r3, #0xff		@ invert bits
+		.ifc \endian, SWAB_ENDIAN
+		bne	.L_found_swab
 		.else
-		movs	r3, r3
+		bne	.L_found		@ found the bit?
 		.endif
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
+		add	r2, r2, #32		@ next index
 2:		cmp	r2, r1			@ any more?
 		blo	1b
-3:		mov	r0, r1			@ no free bits
+3:		mov	r0, r1			@ no more bits
 		ret	lr
 ENDPROC(_find_first_\name\()bit_\endian)
 		.endm
@@ -48,24 +48,25 @@ ENDPROC(_find_first_\name\()bit_\endian)
 ENTRY(_find_next_\name\()bit_\endian)
 		cmp	r2, r1
 		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
-		.ifc	\endian, be
-		eor	r3, r2, #0x18
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		.else
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
+		mov	ip, r2, lsr #5		@ word index
+		add	r0, r0, ip, lsl #2
+		ands	ip, r2, #31		@ bit position
+		beq	1b
+		ldr	r3, [r0], #4
+		.ifeq \set
+		mvn	r3, r3			@ invert bits
+		.endif
+		.ifc \endian, SWAB_ENDIAN
+		rev_l	r3, ip
+		.if	.Lrev_l_uses_tmp
+		@ we need to recompute ip because rev_l will have overwritten
+		@ it.
+		and	ip, r2, #31		@ bit position
 		.endif
-		.ifeq	\set
-		eor	r3, r3, #0xff		@ now looking for a 1 bit
 		.endif
 		movs	r3, r3, lsr ip		@ shift off unused bits
 		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
+		orr	r2, r2, #31		@ no zero bits
 		add	r2, r2, #1		@ align bit pointer
 		b	2b			@ loop for next bit
 ENDPROC(_find_next_\name\()bit_\endian)
@@ -95,6 +96,8 @@ ENDPROC(_find_next_\name\()bit_\endian)
 /*
  * One or more bits in the LSB of r3 are assumed to be set.
  */
+.L_found_swab:
+		rev_l	r3, ip
 .L_found:
 #if __LINUX_ARM_ARCH__ >= 7
 		rbit	r3, r3			@ reverse bits
@@ -107,13 +110,20 @@ ENDPROC(_find_next_\name\()bit_\endian)
 		rsb	r3, r3, #31		@ offset of first set bit
 		add	r0, r2, r3		@ add offset of first set bit
 #else
-		tst	r3, #0x0f
+		mov	ip, #~0
+		tst	r3, ip, lsr #16		@ test bits 0-15
+		addeq	r2, r2, #16
+		moveq	r3, r3, lsr #16
+		tst	r3, #0x00ff
+		addeq	r2, r2, #8
+		moveq	r3, r3, lsr #8
+		tst	r3, #0x000f
 		addeq	r2, r2, #4
-		movne	r3, r3, lsl #4
-		tst	r3, #0x30
+		moveq	r3, r3, lsr #4
+		tst	r3, #0x0003
 		addeq	r2, r2, #2
-		movne	r3, r3, lsl #2
-		tst	r3, #0x40
+		moveq	r3, r3, lsr #2
+		tst	r3, #0x0001
 		addeq	r2, r2, #1
 		mov	r0, r2
 #endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] ARM: findbit: operate by words
@ 2022-10-28 16:48   ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Convert the implementations to operate on words rather than bytes
which makes bitmap searching faster.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/include/asm/assembler.h |  6 +++
 arch/arm/lib/findbit.S           | 78 ++++++++++++++++++--------------
 2 files changed, 50 insertions(+), 34 deletions(-)

diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
index 90fbe4a3f9c8..28e18f79c300 100644
--- a/arch/arm/include/asm/assembler.h
+++ b/arch/arm/include/asm/assembler.h
@@ -761,6 +761,12 @@ THUMB(	orr	\reg , \reg , #PSR_T_BIT	)
 	.endif
 	.endm
 
+	.if		__LINUX_ARM_ARCH__ < 6
+	.set		.Lrev_l_uses_tmp, 1
+	.else
+	.set		.Lrev_l_uses_tmp, 0
+	.endif
+
 	/*
 	 * bl_r - branch and link to register
 	 *
diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 8280f66d38a5..6ec584d16d46 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -14,32 +14,32 @@
 #include <asm/assembler.h>
                 .text
 
+#ifdef __ARMEB__
+#define SWAB_ENDIAN le
+#else
+#define SWAB_ENDIAN be
+#endif
+
 		.macro	find_first, endian, set, name
 ENTRY(_find_first_\name\()bit_\endian)
 		teq	r1, #0
 		beq	3f
 		mov	r2, #0
-1:
-		.ifc	\endian, be
-		eor	r3, r2, #0x18
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
+1:		ldr	r3, [r0], #4
+		.ifeq \set
+		mvns	r3, r3			@ invert/test bits
 		.else
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
+		movs	r3, r3			@ test bits
 		.endif
-		.ifeq	\set
-		eors	r3, r3, #0xff		@ invert bits
+		.ifc \endian, SWAB_ENDIAN
+		bne	.L_found_swab
 		.else
-		movs	r3, r3
+		bne	.L_found		@ found the bit?
 		.endif
-		bne	.L_found		@ any now set - found zero bit
-		add	r2, r2, #8		@ next bit pointer
+		add	r2, r2, #32		@ next index
 2:		cmp	r2, r1			@ any more?
 		blo	1b
-3:		mov	r0, r1			@ no free bits
+3:		mov	r0, r1			@ no more bits
 		ret	lr
 ENDPROC(_find_first_\name\()bit_\endian)
 		.endm
@@ -48,24 +48,25 @@ ENDPROC(_find_first_\name\()bit_\endian)
 ENTRY(_find_next_\name\()bit_\endian)
 		cmp	r2, r1
 		bhs	3b
-		ands	ip, r2, #7
-		beq	1b			@ If new byte, goto old routine
-		.ifc	\endian, be
-		eor	r3, r2, #0x18
- ARM(		ldrb	r3, [r0, r3, lsr #3]	)
- THUMB(		lsr	r3, #3			)
- THUMB(		ldrb	r3, [r0, r3]		)
-		.else
- ARM(		ldrb	r3, [r0, r2, lsr #3]	)
- THUMB(		lsr	r3, r2, #3		)
- THUMB(		ldrb	r3, [r0, r3]		)
+		mov	ip, r2, lsr #5		@ word index
+		add	r0, r0, ip, lsl #2
+		ands	ip, r2, #31		@ bit position
+		beq	1b
+		ldr	r3, [r0], #4
+		.ifeq \set
+		mvn	r3, r3			@ invert bits
+		.endif
+		.ifc \endian, SWAB_ENDIAN
+		rev_l	r3, ip
+		.if	.Lrev_l_uses_tmp
+		@ we need to recompute ip because rev_l will have overwritten
+		@ it.
+		and	ip, r2, #31		@ bit position
 		.endif
-		.ifeq	\set
-		eor	r3, r3, #0xff		@ now looking for a 1 bit
 		.endif
 		movs	r3, r3, lsr ip		@ shift off unused bits
 		bne	.L_found
-		orr	r2, r2, #7		@ if zero, then no bits here
+		orr	r2, r2, #31		@ no zero bits
 		add	r2, r2, #1		@ align bit pointer
 		b	2b			@ loop for next bit
 ENDPROC(_find_next_\name\()bit_\endian)
@@ -95,6 +96,8 @@ ENDPROC(_find_next_\name\()bit_\endian)
 /*
  * One or more bits in the LSB of r3 are assumed to be set.
  */
+.L_found_swab:
+		rev_l	r3, ip
 .L_found:
 #if __LINUX_ARM_ARCH__ >= 7
 		rbit	r3, r3			@ reverse bits
@@ -107,13 +110,20 @@ ENDPROC(_find_next_\name\()bit_\endian)
 		rsb	r3, r3, #31		@ offset of first set bit
 		add	r0, r2, r3		@ add offset of first set bit
 #else
-		tst	r3, #0x0f
+		mov	ip, #~0
+		tst	r3, ip, lsr #16		@ test bits 0-15
+		addeq	r2, r2, #16
+		moveq	r3, r3, lsr #16
+		tst	r3, #0x00ff
+		addeq	r2, r2, #8
+		moveq	r3, r3, lsr #8
+		tst	r3, #0x000f
 		addeq	r2, r2, #4
-		movne	r3, r3, lsl #4
-		tst	r3, #0x30
+		moveq	r3, r3, lsr #4
+		tst	r3, #0x0003
 		addeq	r2, r2, #2
-		movne	r3, r3, lsl #2
-		tst	r3, #0x40
+		moveq	r3, r3, lsr #2
+		tst	r3, #0x0001
 		addeq	r2, r2, #1
 		mov	r0, r2
 #endif
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] ARM: findbit: add unwinder information
  2022-10-28 16:47 ` Russell King (Oracle)
@ 2022-10-28 16:48   ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Add unwinder information so oops in the findbit functions can create a
proper backtrace.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 6ec584d16d46..b7ac2d3c0748 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -12,6 +12,7 @@
  */
 #include <linux/linkage.h>
 #include <asm/assembler.h>
+#include <asm/unwind.h>
                 .text
 
 #ifdef __ARMEB__
@@ -22,6 +23,7 @@
 
 		.macro	find_first, endian, set, name
 ENTRY(_find_first_\name\()bit_\endian)
+	UNWIND(	.fnstart)
 		teq	r1, #0
 		beq	3f
 		mov	r2, #0
@@ -41,11 +43,13 @@ ENTRY(_find_first_\name\()bit_\endian)
 		blo	1b
 3:		mov	r0, r1			@ no more bits
 		ret	lr
+	UNWIND(	.fnend)
 ENDPROC(_find_first_\name\()bit_\endian)
 		.endm
 
 		.macro	find_next, endian, set, name
 ENTRY(_find_next_\name\()bit_\endian)
+	UNWIND(	.fnstart)
 		cmp	r2, r1
 		bhs	3b
 		mov	ip, r2, lsr #5		@ word index
@@ -69,6 +73,7 @@ ENTRY(_find_next_\name\()bit_\endian)
 		orr	r2, r2, #31		@ no zero bits
 		add	r2, r2, #1		@ align bit pointer
 		b	2b			@ loop for next bit
+	UNWIND(	.fnend)
 ENDPROC(_find_next_\name\()bit_\endian)
 		.endm
 
@@ -97,6 +102,7 @@ ENDPROC(_find_next_\name\()bit_\endian)
  * One or more bits in the LSB of r3 are assumed to be set.
  */
 .L_found_swab:
+	UNWIND(	.fnstart)
 		rev_l	r3, ip
 .L_found:
 #if __LINUX_ARM_ARCH__ >= 7
@@ -130,4 +136,4 @@ ENDPROC(_find_next_\name\()bit_\endian)
 		cmp	r1, r0			@ Clamp to maxbit
 		movlo	r0, r1
 		ret	lr
-
+	UNWIND(	.fnend)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] ARM: findbit: add unwinder information
@ 2022-10-28 16:48   ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 16:48 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, Linus Torvalds, linux-arm-kernel

Add unwinder information so oops in the findbit functions can create a
proper backtrace.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/lib/findbit.S | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S
index 6ec584d16d46..b7ac2d3c0748 100644
--- a/arch/arm/lib/findbit.S
+++ b/arch/arm/lib/findbit.S
@@ -12,6 +12,7 @@
  */
 #include <linux/linkage.h>
 #include <asm/assembler.h>
+#include <asm/unwind.h>
                 .text
 
 #ifdef __ARMEB__
@@ -22,6 +23,7 @@
 
 		.macro	find_first, endian, set, name
 ENTRY(_find_first_\name\()bit_\endian)
+	UNWIND(	.fnstart)
 		teq	r1, #0
 		beq	3f
 		mov	r2, #0
@@ -41,11 +43,13 @@ ENTRY(_find_first_\name\()bit_\endian)
 		blo	1b
 3:		mov	r0, r1			@ no more bits
 		ret	lr
+	UNWIND(	.fnend)
 ENDPROC(_find_first_\name\()bit_\endian)
 		.endm
 
 		.macro	find_next, endian, set, name
 ENTRY(_find_next_\name\()bit_\endian)
+	UNWIND(	.fnstart)
 		cmp	r2, r1
 		bhs	3b
 		mov	ip, r2, lsr #5		@ word index
@@ -69,6 +73,7 @@ ENTRY(_find_next_\name\()bit_\endian)
 		orr	r2, r2, #31		@ no zero bits
 		add	r2, r2, #1		@ align bit pointer
 		b	2b			@ loop for next bit
+	UNWIND(	.fnend)
 ENDPROC(_find_next_\name\()bit_\endian)
 		.endm
 
@@ -97,6 +102,7 @@ ENDPROC(_find_next_\name\()bit_\endian)
  * One or more bits in the LSB of r3 are assumed to be set.
  */
 .L_found_swab:
+	UNWIND(	.fnstart)
 		rev_l	r3, ip
 .L_found:
 #if __LINUX_ARM_ARCH__ >= 7
@@ -130,4 +136,4 @@ ENDPROC(_find_next_\name\()bit_\endian)
 		cmp	r1, r0			@ Clamp to maxbit
 		movlo	r0, r1
 		ret	lr
-
+	UNWIND(	.fnend)
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 16:47   ` Russell King (Oracle)
@ 2022-10-28 17:05     ` Linus Torvalds
  -1 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 17:05 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
<rmk+kernel@armlinux.org.uk> wrote:
>
> Document the ARMv5 bit offset calculation code.

Hmm. Don't the generic bitop functions end up using this? We do have a
comment in the code that says

 * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
 * and produce optimal inlined code in all cases. On ARMv7 it is even
 * better by also using the rbit instruction.

but that 'may' makes me wonder...

IOW, what is it in the hand-written code that doesn't get done by the
generic code these days?

                     Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 17:05     ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 17:05 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
<rmk+kernel@armlinux.org.uk> wrote:
>
> Document the ARMv5 bit offset calculation code.

Hmm. Don't the generic bitop functions end up using this? We do have a
comment in the code that says

 * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
 * and produce optimal inlined code in all cases. On ARMv7 it is even
 * better by also using the rbit instruction.

but that 'may' makes me wonder...

IOW, what is it in the hand-written code that doesn't get done by the
generic code these days?

                     Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 17:05     ` Linus Torvalds
@ 2022-10-28 17:45       ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> <rmk+kernel@armlinux.org.uk> wrote:
> >
> > Document the ARMv5 bit offset calculation code.
> 
> Hmm. Don't the generic bitop functions end up using this? We do have a
> comment in the code that says
> 
>  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
>  * and produce optimal inlined code in all cases. On ARMv7 it is even
>  * better by also using the rbit instruction.

It's true that the generic code also makes use of the rbit and clz
instructions - but in terms of the speed of the functions these only
get used once we've found a word that is interesting to locate the
bit we want in.

> but that 'may' makes me wonder...
> 
> IOW, what is it in the hand-written code that doesn't get done by the
> generic code these days?

For the _find_first_bit, there isn't much difference in the number
of instructions or really what is going on, only the organisation
and flow of the code is more inline - but that shouldn't make much
of a difference. Yet, there is a definite repeatable measurable
difference between the two:

random-filled:
arm    : find_first_bit:               17778911 ns,  16448 iterations
generic: find_first_bit:               18596622 ns,  16401 iterations

sparse:
arm    : find_first_bit:                7301363 ns,    656 iterations
generic: find_first_bit:                7589120 ns,    655 iterations

The bigger difference is in the find_next_bit operations, and this
likely comes from the arm32 code not having the hassles of the "_and"
and other conditionals that the generic code has:

random-filled:
arm    : find_next_bit:                 2242618 ns, 163949 iterations
generic: find_next_bit:                 2632859 ns, 163743 iterations

sparse:
arm    : find_next_bit:                   40078 ns,    656 iterations
generic: find_next_bit:                   69943 ns,    655 iterations

find_next_zero_bit show a greater difference:

random-filled:
arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
generic: find_next_zero_bit:            2844221 ns, 163938 iterations

sparse:
arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
generic: find_next_zero_bit:            5529553 ns, 327026 iterations

Here's the disassemblies for comparison. Note that the arm versions
share code paths between the functions which makes the code even more
compact - so the loop in the find_first gets re-used for find_next
after we check the current word.

generic:

00000000 <_find_first_bit>:
   0:   e1a02000        mov     r2, r0
   4:   e2510000        subs    r0, r1, #0
   8:   012fff1e        bxeq    lr
   c:   e2422004        sub     r2, r2, #4
  10:   e3a03000        mov     r3, #0
  14:   ea000002        b       24 <_find_first_bit+0x24>
  18:   e2833020        add     r3, r3, #32
  1c:   e1500003        cmp     r0, r3
  20:   912fff1e        bxls    lr
  24:   e5b2c004        ldr     ip, [r2, #4]!
  28:   e35c0000        cmp     ip, #0
  2c:   0afffff9        beq     18 <_find_first_bit+0x18>
  30:   e6ffcf3c        rbit    ip, ip
  34:   e16fcf1c        clz     ip, ip
  38:   e08c3003        add     r3, ip, r3
  3c:   e1500003        cmp     r0, r3
  40:   21a00003        movcs   r0, r3
  44:   e12fff1e        bx      lr

000000e8 <_find_next_bit>:
  e8:   e92d4070        push    {r4, r5, r6, lr}
  ec:   e1530002        cmp     r3, r2
  f0:   e59d4010        ldr     r4, [sp, #16]
  f4:   e59d5014        ldr     r5, [sp, #20]
  f8:   2a000024        bcs     190 <_find_next_bit+0xa8>
  fc:   e1a0e2a3        lsr     lr, r3, #5
 100:   e3510000        cmp     r1, #0
 104:   e203601f        and     r6, r3, #31
 108:   e3c3301f        bic     r3, r3, #31
 10c:   e790c10e        ldr     ip, [r0, lr, lsl #2]
 110:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
 114:   100cc00e        andne   ip, ip, lr
 118:   e3e0e000        mvn     lr, #0
 11c:   e02cc004        eor     ip, ip, r4
 120:   e3550000        cmp     r5, #0
 124:   e1a0e61e        lsl     lr, lr, r6
 128:   1a00001a        bne     198 <_find_next_bit+0xb0>
 12c:   e01cc00e        ands    ip, ip, lr
 130:   1a000011        bne     17c <_find_next_bit+0x94>
 134:   e2833020        add     r3, r3, #32
 138:   e1530002        cmp     r3, r2
 13c:   3a000003        bcc     150 <_find_next_bit+0x68>
 140:   ea000012        b       190 <_find_next_bit+0xa8>
 144:   e2833020        add     r3, r3, #32
 148:   e1520003        cmp     r2, r3
 14c:   9a00000f        bls     190 <_find_next_bit+0xa8>
 150:   e1a0e2a3        lsr     lr, r3, #5
 154:   e3510000        cmp     r1, #0
 158:   e790c10e        ldr     ip, [r0, lr, lsl #2]
 15c:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
 160:   100cc00e        andne   ip, ip, lr
 164:   e15c0004        cmp     ip, r4
 168:   0afffff5        beq     144 <_find_next_bit+0x5c>
 16c:   e02cc004        eor     ip, ip, r4
 170:   e3550000        cmp     r5, #0
 174:   0a000000        beq     17c <_find_next_bit+0x94>
 178:   e6bfcf3c        rev     ip, ip
 17c:   e6ffcf3c        rbit    ip, ip
 180:   e16fcf1c        clz     ip, ip
 184:   e08c3003        add     r3, ip, r3
 188:   e1520003        cmp     r2, r3
 18c:   21a02003        movcs   r2, r3
 190:   e1a00002        mov     r0, r2
 194:   e8bd8070        pop     {r4, r5, r6, pc}
 198:   e6bfef3e        rev     lr, lr
 19c:   e01cc00e        ands    ip, ip, lr
 1a0:   0affffe3        beq     134 <_find_next_bit+0x4c>
 1a4:   eafffff3        b       178 <_find_next_bit+0x90>

==========
arm optimised:

00000060 <_find_first_bit_le>:
  60:   e3310000        teq     r1, #0
  64:   0a000006        beq     84 <_find_first_bit_le+0x24>
  68:   e3a02000        mov     r2, #0
  6c:   e4903004        ldr     r3, [r0], #4
  70:   e1b03003        movs    r3, r3
  74:   1a000011        bne     c0 <_find_next_bit_le+0x34>
  78:   e2822020        add     r2, r2, #32
  7c:   e1520001        cmp     r2, r1
  80:   3afffff9        bcc     6c <_find_first_bit_le+0xc>
  84:   e1a00001        mov     r0, r1
  88:   e12fff1e        bx      lr

0000008c <_find_next_bit_le>:
  8c:   e1520001        cmp     r2, r1
  90:   2afffffb        bcs     84 <_find_first_bit_le+0x24>
  94:   e1a0c2a2        lsr     ip, r2, #5
  98:   e080010c        add     r0, r0, ip, lsl #2
  9c:   e212c01f        ands    ip, r2, #31
  a0:   0afffff1        beq     6c <_find_first_bit_le+0xc>
  a4:   e4903004        ldr     r3, [r0], #4
  a8:   e1b03c33        lsrs    r3, r3, ip
  ac:   1a000003        bne     c0 <_find_next_bit_le+0x34>
  b0:   e382201f        orr     r2, r2, #31
  b4:   e2822001        add     r2, r2, #1
  b8:   eaffffef        b       7c <_find_first_bit_le+0x1c>
  bc:   e6bf3f33        rev     r3, r3
  c0:   e6ff3f33        rbit    r3, r3
  c4:   e16f3f13        clz     r3, r3
  c8:   e0820003        add     r0, r2, r3
  cc:   e1510000        cmp     r1, r0
  d0:   31a00001        movcc   r0, r1
  d4:   e12fff1e        bx      lr

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 17:45       ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> <rmk+kernel@armlinux.org.uk> wrote:
> >
> > Document the ARMv5 bit offset calculation code.
> 
> Hmm. Don't the generic bitop functions end up using this? We do have a
> comment in the code that says
> 
>  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
>  * and produce optimal inlined code in all cases. On ARMv7 it is even
>  * better by also using the rbit instruction.

It's true that the generic code also makes use of the rbit and clz
instructions - but in terms of the speed of the functions these only
get used once we've found a word that is interesting to locate the
bit we want in.

> but that 'may' makes me wonder...
> 
> IOW, what is it in the hand-written code that doesn't get done by the
> generic code these days?

For the _find_first_bit, there isn't much difference in the number
of instructions or really what is going on, only the organisation
and flow of the code is more inline - but that shouldn't make much
of a difference. Yet, there is a definite repeatable measurable
difference between the two:

random-filled:
arm    : find_first_bit:               17778911 ns,  16448 iterations
generic: find_first_bit:               18596622 ns,  16401 iterations

sparse:
arm    : find_first_bit:                7301363 ns,    656 iterations
generic: find_first_bit:                7589120 ns,    655 iterations

The bigger difference is in the find_next_bit operations, and this
likely comes from the arm32 code not having the hassles of the "_and"
and other conditionals that the generic code has:

random-filled:
arm    : find_next_bit:                 2242618 ns, 163949 iterations
generic: find_next_bit:                 2632859 ns, 163743 iterations

sparse:
arm    : find_next_bit:                   40078 ns,    656 iterations
generic: find_next_bit:                   69943 ns,    655 iterations

find_next_zero_bit show a greater difference:

random-filled:
arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
generic: find_next_zero_bit:            2844221 ns, 163938 iterations

sparse:
arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
generic: find_next_zero_bit:            5529553 ns, 327026 iterations

Here's the disassemblies for comparison. Note that the arm versions
share code paths between the functions which makes the code even more
compact - so the loop in the find_first gets re-used for find_next
after we check the current word.

generic:

00000000 <_find_first_bit>:
   0:   e1a02000        mov     r2, r0
   4:   e2510000        subs    r0, r1, #0
   8:   012fff1e        bxeq    lr
   c:   e2422004        sub     r2, r2, #4
  10:   e3a03000        mov     r3, #0
  14:   ea000002        b       24 <_find_first_bit+0x24>
  18:   e2833020        add     r3, r3, #32
  1c:   e1500003        cmp     r0, r3
  20:   912fff1e        bxls    lr
  24:   e5b2c004        ldr     ip, [r2, #4]!
  28:   e35c0000        cmp     ip, #0
  2c:   0afffff9        beq     18 <_find_first_bit+0x18>
  30:   e6ffcf3c        rbit    ip, ip
  34:   e16fcf1c        clz     ip, ip
  38:   e08c3003        add     r3, ip, r3
  3c:   e1500003        cmp     r0, r3
  40:   21a00003        movcs   r0, r3
  44:   e12fff1e        bx      lr

000000e8 <_find_next_bit>:
  e8:   e92d4070        push    {r4, r5, r6, lr}
  ec:   e1530002        cmp     r3, r2
  f0:   e59d4010        ldr     r4, [sp, #16]
  f4:   e59d5014        ldr     r5, [sp, #20]
  f8:   2a000024        bcs     190 <_find_next_bit+0xa8>
  fc:   e1a0e2a3        lsr     lr, r3, #5
 100:   e3510000        cmp     r1, #0
 104:   e203601f        and     r6, r3, #31
 108:   e3c3301f        bic     r3, r3, #31
 10c:   e790c10e        ldr     ip, [r0, lr, lsl #2]
 110:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
 114:   100cc00e        andne   ip, ip, lr
 118:   e3e0e000        mvn     lr, #0
 11c:   e02cc004        eor     ip, ip, r4
 120:   e3550000        cmp     r5, #0
 124:   e1a0e61e        lsl     lr, lr, r6
 128:   1a00001a        bne     198 <_find_next_bit+0xb0>
 12c:   e01cc00e        ands    ip, ip, lr
 130:   1a000011        bne     17c <_find_next_bit+0x94>
 134:   e2833020        add     r3, r3, #32
 138:   e1530002        cmp     r3, r2
 13c:   3a000003        bcc     150 <_find_next_bit+0x68>
 140:   ea000012        b       190 <_find_next_bit+0xa8>
 144:   e2833020        add     r3, r3, #32
 148:   e1520003        cmp     r2, r3
 14c:   9a00000f        bls     190 <_find_next_bit+0xa8>
 150:   e1a0e2a3        lsr     lr, r3, #5
 154:   e3510000        cmp     r1, #0
 158:   e790c10e        ldr     ip, [r0, lr, lsl #2]
 15c:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
 160:   100cc00e        andne   ip, ip, lr
 164:   e15c0004        cmp     ip, r4
 168:   0afffff5        beq     144 <_find_next_bit+0x5c>
 16c:   e02cc004        eor     ip, ip, r4
 170:   e3550000        cmp     r5, #0
 174:   0a000000        beq     17c <_find_next_bit+0x94>
 178:   e6bfcf3c        rev     ip, ip
 17c:   e6ffcf3c        rbit    ip, ip
 180:   e16fcf1c        clz     ip, ip
 184:   e08c3003        add     r3, ip, r3
 188:   e1520003        cmp     r2, r3
 18c:   21a02003        movcs   r2, r3
 190:   e1a00002        mov     r0, r2
 194:   e8bd8070        pop     {r4, r5, r6, pc}
 198:   e6bfef3e        rev     lr, lr
 19c:   e01cc00e        ands    ip, ip, lr
 1a0:   0affffe3        beq     134 <_find_next_bit+0x4c>
 1a4:   eafffff3        b       178 <_find_next_bit+0x90>

==========
arm optimised:

00000060 <_find_first_bit_le>:
  60:   e3310000        teq     r1, #0
  64:   0a000006        beq     84 <_find_first_bit_le+0x24>
  68:   e3a02000        mov     r2, #0
  6c:   e4903004        ldr     r3, [r0], #4
  70:   e1b03003        movs    r3, r3
  74:   1a000011        bne     c0 <_find_next_bit_le+0x34>
  78:   e2822020        add     r2, r2, #32
  7c:   e1520001        cmp     r2, r1
  80:   3afffff9        bcc     6c <_find_first_bit_le+0xc>
  84:   e1a00001        mov     r0, r1
  88:   e12fff1e        bx      lr

0000008c <_find_next_bit_le>:
  8c:   e1520001        cmp     r2, r1
  90:   2afffffb        bcs     84 <_find_first_bit_le+0x24>
  94:   e1a0c2a2        lsr     ip, r2, #5
  98:   e080010c        add     r0, r0, ip, lsl #2
  9c:   e212c01f        ands    ip, r2, #31
  a0:   0afffff1        beq     6c <_find_first_bit_le+0xc>
  a4:   e4903004        ldr     r3, [r0], #4
  a8:   e1b03c33        lsrs    r3, r3, ip
  ac:   1a000003        bne     c0 <_find_next_bit_le+0x34>
  b0:   e382201f        orr     r2, r2, #31
  b4:   e2822001        add     r2, r2, #1
  b8:   eaffffef        b       7c <_find_first_bit_le+0x1c>
  bc:   e6bf3f33        rev     r3, r3
  c0:   e6ff3f33        rbit    r3, r3
  c4:   e16f3f13        clz     r3, r3
  c8:   e0820003        add     r0, r2, r3
  cc:   e1510000        cmp     r1, r0
  d0:   31a00001        movcc   r0, r1
  d4:   e12fff1e        bx      lr

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 17:45       ` Russell King (Oracle)
@ 2022-10-28 18:37         ` Yury Norov
  -1 siblings, 0 replies; 28+ messages in thread
From: Yury Norov @ 2022-10-28 18:37 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel, Alexey Klimov

+ Alexey Klimov

On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote:
> On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> > <rmk+kernel@armlinux.org.uk> wrote:
> > >
> > > Document the ARMv5 bit offset calculation code.
> > 
> > Hmm. Don't the generic bitop functions end up using this? We do have a
> > comment in the code that says
> > 
> >  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> >  * and produce optimal inlined code in all cases. On ARMv7 it is even
> >  * better by also using the rbit instruction.
> 
> It's true that the generic code also makes use of the rbit and clz
> instructions - but in terms of the speed of the functions these only
> get used once we've found a word that is interesting to locate the
> bit we want in.
> 
> > but that 'may' makes me wonder...
> > 
> > IOW, what is it in the hand-written code that doesn't get done by the
> > generic code these days?
> 
> For the _find_first_bit, there isn't much difference in the number
> of instructions or really what is going on, only the organisation
> and flow of the code is more inline - but that shouldn't make much
> of a difference. Yet, there is a definite repeatable measurable
> difference between the two:
> 
> random-filled:
> arm    : find_first_bit:               17778911 ns,  16448 iterations
> generic: find_first_bit:               18596622 ns,  16401 iterations
> 
> sparse:
> arm    : find_first_bit:                7301363 ns,    656 iterations
> generic: find_first_bit:                7589120 ns,    655 iterations
> 
> The bigger difference is in the find_next_bit operations, and this
> likely comes from the arm32 code not having the hassles of the "_and"
> and other conditionals that the generic code has:
> 
> random-filled:
> arm    : find_next_bit:                 2242618 ns, 163949 iterations
> generic: find_next_bit:                 2632859 ns, 163743 iterations
> 
> sparse:
> arm    : find_next_bit:                   40078 ns,    656 iterations
> generic: find_next_bit:                   69943 ns,    655 iterations
> 
> find_next_zero_bit show a greater difference:
> 
> random-filled:
> arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
> generic: find_next_zero_bit:            2844221 ns, 163938 iterations
> 
> sparse:
> arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
> generic: find_next_zero_bit:            5529553 ns, 327026 iterations

Those numbers disagree with what Alexey has measured on Odroid board
for A15 but somewhat in line with what he had for A7:

https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

Can you please share details about your methodology: what CPU did you use;
did you lock cpu frequencies; in addition to mean, can you also show
standard deviation, raw logs?

To Alexey: if you have time, can you please repeat those measurements
on top of v6 + this series? This would help a lot in understanding how
new code performs, both generic and arch.

If generic vs arch code comparison looks different for different CPU
versions, what should we prefer? 
 
> Here's the disassemblies for comparison. Note that the arm versions
> share code paths between the functions which makes the code even more
> compact - so the loop in the find_first gets re-used for find_next
> after we check the current word.
> 
> generic:

[...]
 
> 000000e8 <_find_next_bit>:
>   e8:   e92d4070        push    {r4, r5, r6, lr}
>   ec:   e1530002        cmp     r3, r2
>   f0:   e59d4010        ldr     r4, [sp, #16]
>   f4:   e59d5014        ldr     r5, [sp, #20]
>   f8:   2a000024        bcs     190 <_find_next_bit+0xa8>
>   fc:   e1a0e2a3        lsr     lr, r3, #5
>  100:   e3510000        cmp     r1, #0
>  104:   e203601f        and     r6, r3, #31
>  108:   e3c3301f        bic     r3, r3, #31
>  10c:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  110:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  114:   100cc00e        andne   ip, ip, lr
>  118:   e3e0e000        mvn     lr, #0
>  11c:   e02cc004        eor     ip, ip, r4
>  120:   e3550000        cmp     r5, #0
>  124:   e1a0e61e        lsl     lr, lr, r6
>  128:   1a00001a        bne     198 <_find_next_bit+0xb0>
>  12c:   e01cc00e        ands    ip, ip, lr
>  130:   1a000011        bne     17c <_find_next_bit+0x94>
>  134:   e2833020        add     r3, r3, #32
>  138:   e1530002        cmp     r3, r2
>  13c:   3a000003        bcc     150 <_find_next_bit+0x68>
>  140:   ea000012        b       190 <_find_next_bit+0xa8>
>  144:   e2833020        add     r3, r3, #32
>  148:   e1520003        cmp     r2, r3
>  14c:   9a00000f        bls     190 <_find_next_bit+0xa8>
>  150:   e1a0e2a3        lsr     lr, r3, #5
>  154:   e3510000        cmp     r1, #0
>  158:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  15c:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  160:   100cc00e        andne   ip, ip, lr
>  164:   e15c0004        cmp     ip, r4
>  168:   0afffff5        beq     144 <_find_next_bit+0x5c>
>  16c:   e02cc004        eor     ip, ip, r4
>  170:   e3550000        cmp     r5, #0
>  174:   0a000000        beq     17c <_find_next_bit+0x94>
>  178:   e6bfcf3c        rev     ip, ip
>  17c:   e6ffcf3c        rbit    ip, ip
>  180:   e16fcf1c        clz     ip, ip
>  184:   e08c3003        add     r3, ip, r3
>  188:   e1520003        cmp     r2, r3
>  18c:   21a02003        movcs   r2, r3
>  190:   e1a00002        mov     r0, r2
>  194:   e8bd8070        pop     {r4, r5, r6, pc}
>  198:   e6bfef3e        rev     lr, lr
>  19c:   e01cc00e        ands    ip, ip, lr
>  1a0:   0affffe3        beq     134 <_find_next_bit+0x4c>
>  1a4:   eafffff3        b       178 <_find_next_bit+0x90>

On top of master, my generic _find_next_bit looks different:

000000e4 <_find_next_bit>:
  e4:   e1510002        cmp     r1, r2
  e8:   e1a0c000        mov     ip, r0
  ec:   e1a00001        mov     r0, r1
  f0:   912fff1e        bxls    lr
  f4:   e1a012a2        lsr     r1, r2, #5
  f8:   e92d4010        push    {r4, lr}
  fc:   e202201f        and     r2, r2, #31
 100:   e3e03000        mvn     r3, #0
 104:   e79ce101        ldr     lr, [ip, r1, lsl #2]
 108:   e01ee213        ands    lr, lr, r3, lsl r2
 10c:   1a000012        bne     15c <_find_next_bit+0x78>
 110:   e2813001        add     r3, r1, #1
 114:   e1a04283        lsl     r4, r3, #5
 118:   e1540000        cmp     r4, r0
 11c:   28bd8010        popcs   {r4, pc}
 120:   e08c2101        add     r2, ip, r1, lsl #2
 124:   ea000002        b       134 <_find_next_bit+0x50>
 128:   e1a04283        lsl     r4, r3, #5
 12c:   e1500004        cmp     r0, r4
 130:   98bd8010        popls   {r4, pc}
 134:   e5b2e004        ldr     lr, [r2, #4]!
 138:   e2833001        add     r3, r3, #1
 13c:   e35e0000        cmp     lr, #0
 140:   0afffff8        beq     128 <_find_next_bit+0x44>
 144:   e6ffef3e        rbit    lr, lr
 148:   e16fef1e        clz     lr, lr
 14c:   e084400e        add     r4, r4, lr
 150:   e1500004        cmp     r0, r4
 154:   21a00004        movcs   r0, r4
 158:   e8bd8010        pop     {r4, pc}
 15c:   e1a04281        lsl     r4, r1, #5
 160:   eafffff7        b       144 <_find_next_bit+0x60>

Are you sure you're running latest kernel?

Thanks,
Yury

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 18:37         ` Yury Norov
  0 siblings, 0 replies; 28+ messages in thread
From: Yury Norov @ 2022-10-28 18:37 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Torvalds, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel, Alexey Klimov

+ Alexey Klimov

On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote:
> On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> > <rmk+kernel@armlinux.org.uk> wrote:
> > >
> > > Document the ARMv5 bit offset calculation code.
> > 
> > Hmm. Don't the generic bitop functions end up using this? We do have a
> > comment in the code that says
> > 
> >  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> >  * and produce optimal inlined code in all cases. On ARMv7 it is even
> >  * better by also using the rbit instruction.
> 
> It's true that the generic code also makes use of the rbit and clz
> instructions - but in terms of the speed of the functions these only
> get used once we've found a word that is interesting to locate the
> bit we want in.
> 
> > but that 'may' makes me wonder...
> > 
> > IOW, what is it in the hand-written code that doesn't get done by the
> > generic code these days?
> 
> For the _find_first_bit, there isn't much difference in the number
> of instructions or really what is going on, only the organisation
> and flow of the code is more inline - but that shouldn't make much
> of a difference. Yet, there is a definite repeatable measurable
> difference between the two:
> 
> random-filled:
> arm    : find_first_bit:               17778911 ns,  16448 iterations
> generic: find_first_bit:               18596622 ns,  16401 iterations
> 
> sparse:
> arm    : find_first_bit:                7301363 ns,    656 iterations
> generic: find_first_bit:                7589120 ns,    655 iterations
> 
> The bigger difference is in the find_next_bit operations, and this
> likely comes from the arm32 code not having the hassles of the "_and"
> and other conditionals that the generic code has:
> 
> random-filled:
> arm    : find_next_bit:                 2242618 ns, 163949 iterations
> generic: find_next_bit:                 2632859 ns, 163743 iterations
> 
> sparse:
> arm    : find_next_bit:                   40078 ns,    656 iterations
> generic: find_next_bit:                   69943 ns,    655 iterations
> 
> find_next_zero_bit show a greater difference:
> 
> random-filled:
> arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
> generic: find_next_zero_bit:            2844221 ns, 163938 iterations
> 
> sparse:
> arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
> generic: find_next_zero_bit:            5529553 ns, 327026 iterations

Those numbers disagree with what Alexey has measured on Odroid board
for A15 but somewhat in line with what he had for A7:

https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

Can you please share details about your methodology: what CPU did you use;
did you lock cpu frequencies; in addition to mean, can you also show
standard deviation, raw logs?

To Alexey: if you have time, can you please repeat those measurements
on top of v6 + this series? This would help a lot in understanding how
new code performs, both generic and arch.

If generic vs arch code comparison looks different for different CPU
versions, what should we prefer? 
 
> Here's the disassemblies for comparison. Note that the arm versions
> share code paths between the functions which makes the code even more
> compact - so the loop in the find_first gets re-used for find_next
> after we check the current word.
> 
> generic:

[...]
 
> 000000e8 <_find_next_bit>:
>   e8:   e92d4070        push    {r4, r5, r6, lr}
>   ec:   e1530002        cmp     r3, r2
>   f0:   e59d4010        ldr     r4, [sp, #16]
>   f4:   e59d5014        ldr     r5, [sp, #20]
>   f8:   2a000024        bcs     190 <_find_next_bit+0xa8>
>   fc:   e1a0e2a3        lsr     lr, r3, #5
>  100:   e3510000        cmp     r1, #0
>  104:   e203601f        and     r6, r3, #31
>  108:   e3c3301f        bic     r3, r3, #31
>  10c:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  110:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  114:   100cc00e        andne   ip, ip, lr
>  118:   e3e0e000        mvn     lr, #0
>  11c:   e02cc004        eor     ip, ip, r4
>  120:   e3550000        cmp     r5, #0
>  124:   e1a0e61e        lsl     lr, lr, r6
>  128:   1a00001a        bne     198 <_find_next_bit+0xb0>
>  12c:   e01cc00e        ands    ip, ip, lr
>  130:   1a000011        bne     17c <_find_next_bit+0x94>
>  134:   e2833020        add     r3, r3, #32
>  138:   e1530002        cmp     r3, r2
>  13c:   3a000003        bcc     150 <_find_next_bit+0x68>
>  140:   ea000012        b       190 <_find_next_bit+0xa8>
>  144:   e2833020        add     r3, r3, #32
>  148:   e1520003        cmp     r2, r3
>  14c:   9a00000f        bls     190 <_find_next_bit+0xa8>
>  150:   e1a0e2a3        lsr     lr, r3, #5
>  154:   e3510000        cmp     r1, #0
>  158:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  15c:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  160:   100cc00e        andne   ip, ip, lr
>  164:   e15c0004        cmp     ip, r4
>  168:   0afffff5        beq     144 <_find_next_bit+0x5c>
>  16c:   e02cc004        eor     ip, ip, r4
>  170:   e3550000        cmp     r5, #0
>  174:   0a000000        beq     17c <_find_next_bit+0x94>
>  178:   e6bfcf3c        rev     ip, ip
>  17c:   e6ffcf3c        rbit    ip, ip
>  180:   e16fcf1c        clz     ip, ip
>  184:   e08c3003        add     r3, ip, r3
>  188:   e1520003        cmp     r2, r3
>  18c:   21a02003        movcs   r2, r3
>  190:   e1a00002        mov     r0, r2
>  194:   e8bd8070        pop     {r4, r5, r6, pc}
>  198:   e6bfef3e        rev     lr, lr
>  19c:   e01cc00e        ands    ip, ip, lr
>  1a0:   0affffe3        beq     134 <_find_next_bit+0x4c>
>  1a4:   eafffff3        b       178 <_find_next_bit+0x90>

On top of master, my generic _find_next_bit looks different:

000000e4 <_find_next_bit>:
  e4:   e1510002        cmp     r1, r2
  e8:   e1a0c000        mov     ip, r0
  ec:   e1a00001        mov     r0, r1
  f0:   912fff1e        bxls    lr
  f4:   e1a012a2        lsr     r1, r2, #5
  f8:   e92d4010        push    {r4, lr}
  fc:   e202201f        and     r2, r2, #31
 100:   e3e03000        mvn     r3, #0
 104:   e79ce101        ldr     lr, [ip, r1, lsl #2]
 108:   e01ee213        ands    lr, lr, r3, lsl r2
 10c:   1a000012        bne     15c <_find_next_bit+0x78>
 110:   e2813001        add     r3, r1, #1
 114:   e1a04283        lsl     r4, r3, #5
 118:   e1540000        cmp     r4, r0
 11c:   28bd8010        popcs   {r4, pc}
 120:   e08c2101        add     r2, ip, r1, lsl #2
 124:   ea000002        b       134 <_find_next_bit+0x50>
 128:   e1a04283        lsl     r4, r3, #5
 12c:   e1500004        cmp     r0, r4
 130:   98bd8010        popls   {r4, pc}
 134:   e5b2e004        ldr     lr, [r2, #4]!
 138:   e2833001        add     r3, r3, #1
 13c:   e35e0000        cmp     lr, #0
 140:   0afffff8        beq     128 <_find_next_bit+0x44>
 144:   e6ffef3e        rbit    lr, lr
 148:   e16fef1e        clz     lr, lr
 14c:   e084400e        add     r4, r4, lr
 150:   e1500004        cmp     r0, r4
 154:   21a00004        movcs   r0, r4
 158:   e8bd8010        pop     {r4, pc}
 15c:   e1a04281        lsl     r4, r1, #5
 160:   eafffff7        b       144 <_find_next_bit+0x60>

Are you sure you're running latest kernel?

Thanks,
Yury

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 17:45       ` Russell King (Oracle)
@ 2022-10-28 19:01         ` Linus Torvalds
  -1 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 19:01 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 10:45 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> For the _find_first_bit, there isn't much difference in the number
> of instructions or really what is going on, only the organisation
> and flow of the code is more inline - but that shouldn't make much
> of a difference. Yet, there is a definite repeatable measurable
> difference between the two:

Hmm. Interestingly, your _find_first_zero_bit_le() (which
find_next_bit ends up using except for the first byte) ends up doing
an optimization that is technically not valid.

In particular, the *generic* code does

                        sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz);

for the final result.

In contrast, the arm code doesn't do the "min()" at all, and if there
are bits after the bitmap (in a partial byte), it will just return
those bits.

So the arm code ends up avoiding some operations. Which works most of
the time, because

 (a) usually bitmaps are byte-aligned anyway

 (b) most users do "find_first_bit(.., size) >= size" as the "found no
bits" test

but it actually looks to me like your handcoded arm code is simply
wrong. At least going by our docbook comments for find_first_bit:

 * Returns the bit number of the first set bit.
 * If no bits are set, returns @size.

And look here: bitmap_empty() does

        return find_first_bit(src, nbits) == nbits;

and now imagine that 'nbits' is not a small constant value (which we
handle separately) and is also not byte aligned.

Maybe I'm mis-reading your assembly (I "know" arm assembly, but I
can't read it fluently like x86). But I don't think so.

So I think your code is actually buggy, but probably the bug is quite
hard to trigger in practice due to (a)/(b) above.

We do have bitmaps that aren't byte-aligned. The cpumask ones are the
most common ones. But in the cpumask_first() implementation (which is
just a wrapper for find_first_bit()), our documentation actually says

 * Returns >= nr_cpu_ids if no cpus set.

and I think that may have been what we historically did elsewhere too,
and may be the source of the arm situation.

Anyway, this can be fixed by either

 (a) fixing the arm code

 (b) changing the docs and making that ">= size" be the right thing to do

but I do think this is a very strong example of why having
architecture-specific code like this is very very dangerous. Because
as it stands now, that arm code really looks like it's actively buggy
to me.

         Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 19:01         ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 19:01 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 10:45 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> For the _find_first_bit, there isn't much difference in the number
> of instructions or really what is going on, only the organisation
> and flow of the code is more inline - but that shouldn't make much
> of a difference. Yet, there is a definite repeatable measurable
> difference between the two:

Hmm. Interestingly, your _find_first_zero_bit_le() (which
find_next_bit ends up using except for the first byte) ends up doing
an optimization that is technically not valid.

In particular, the *generic* code does

                        sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz);

for the final result.

In contrast, the arm code doesn't do the "min()" at all, and if there
are bits after the bitmap (in a partial byte), it will just return
those bits.

So the arm code ends up avoiding some operations. Which works most of
the time, because

 (a) usually bitmaps are byte-aligned anyway

 (b) most users do "find_first_bit(.., size) >= size" as the "found no
bits" test

but it actually looks to me like your handcoded arm code is simply
wrong. At least going by our docbook comments for find_first_bit:

 * Returns the bit number of the first set bit.
 * If no bits are set, returns @size.

And look here: bitmap_empty() does

        return find_first_bit(src, nbits) == nbits;

and now imagine that 'nbits' is not a small constant value (which we
handle separately) and is also not byte aligned.

Maybe I'm mis-reading your assembly (I "know" arm assembly, but I
can't read it fluently like x86). But I don't think so.

So I think your code is actually buggy, but probably the bug is quite
hard to trigger in practice due to (a)/(b) above.

We do have bitmaps that aren't byte-aligned. The cpumask ones are the
most common ones. But in the cpumask_first() implementation (which is
just a wrapper for find_first_bit()), our documentation actually says

 * Returns >= nr_cpu_ids if no cpus set.

and I think that may have been what we historically did elsewhere too,
and may be the source of the arm situation.

Anyway, this can be fixed by either

 (a) fixing the arm code

 (b) changing the docs and making that ">= size" be the right thing to do

but I do think this is a very strong example of why having
architecture-specific code like this is very very dangerous. Because
as it stands now, that arm code really looks like it's actively buggy
to me.

         Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 19:01         ` Linus Torvalds
@ 2022-10-28 19:10           ` Linus Torvalds
  -1 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 19:10 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 12:01 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In contrast, the arm code doesn't do the "min()" at all, and if there
> are bits after the bitmap (in a partial byte), it will just return
> those bits.

No, I did misread the code. It returns 'size' for the no bits case,
and the 'movlo r0, r1'; does the right thing for the "bits past the
end" case too.

I guess I need to get better at reading arm asm.

                Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 19:10           ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 19:10 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 12:01 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In contrast, the arm code doesn't do the "min()" at all, and if there
> are bits after the bitmap (in a partial byte), it will just return
> those bits.

No, I did misread the code. It returns 'size' for the no bits case,
and the 'movlo r0, r1'; does the right thing for the "bits past the
end" case too.

I guess I need to get better at reading arm asm.

                Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 18:37         ` Yury Norov
@ 2022-10-28 19:42           ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 19:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel, Alexey Klimov

On Fri, Oct 28, 2022 at 11:37:44AM -0700, Yury Norov wrote:
> + Alexey Klimov
> 
> On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote:
> > On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> > > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> > > <rmk+kernel@armlinux.org.uk> wrote:
> > > >
> > > > Document the ARMv5 bit offset calculation code.
> > > 
> > > Hmm. Don't the generic bitop functions end up using this? We do have a
> > > comment in the code that says
> > > 
> > >  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> > >  * and produce optimal inlined code in all cases. On ARMv7 it is even
> > >  * better by also using the rbit instruction.
> > 
> > It's true that the generic code also makes use of the rbit and clz
> > instructions - but in terms of the speed of the functions these only
> > get used once we've found a word that is interesting to locate the
> > bit we want in.
> > 
> > > but that 'may' makes me wonder...
> > > 
> > > IOW, what is it in the hand-written code that doesn't get done by the
> > > generic code these days?
> > 
> > For the _find_first_bit, there isn't much difference in the number
> > of instructions or really what is going on, only the organisation
> > and flow of the code is more inline - but that shouldn't make much
> > of a difference. Yet, there is a definite repeatable measurable
> > difference between the two:
> > 
> > random-filled:
> > arm    : find_first_bit:               17778911 ns,  16448 iterations
> > generic: find_first_bit:               18596622 ns,  16401 iterations
> > 
> > sparse:
> > arm    : find_first_bit:                7301363 ns,    656 iterations
> > generic: find_first_bit:                7589120 ns,    655 iterations
> > 
> > The bigger difference is in the find_next_bit operations, and this
> > likely comes from the arm32 code not having the hassles of the "_and"
> > and other conditionals that the generic code has:
> > 
> > random-filled:
> > arm    : find_next_bit:                 2242618 ns, 163949 iterations
> > generic: find_next_bit:                 2632859 ns, 163743 iterations
> > 
> > sparse:
> > arm    : find_next_bit:                   40078 ns,    656 iterations
> > generic: find_next_bit:                   69943 ns,    655 iterations
> > 
> > find_next_zero_bit show a greater difference:
> > 
> > random-filled:
> > arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
> > generic: find_next_zero_bit:            2844221 ns, 163938 iterations
> > 
> > sparse:
> > arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
> > generic: find_next_zero_bit:            5529553 ns, 327026 iterations
> 
> Those numbers disagree with what Alexey has measured on Odroid board
> for A15 but somewhat in line with what he had for A7:

Considering no one has seen these patches until I've just posted
them, frankly I don't think there's any point me looking at anyone
elses results.

These changes make substantial improvements to the arm32 assembly
code versions.

If you want a like-for-like comparison, then please get Alexey to
test with these patches applied. I am confident that he will confirm
my results.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 19:42           ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 19:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel, Alexey Klimov

On Fri, Oct 28, 2022 at 11:37:44AM -0700, Yury Norov wrote:
> + Alexey Klimov
> 
> On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote:
> > On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> > > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> > > <rmk+kernel@armlinux.org.uk> wrote:
> > > >
> > > > Document the ARMv5 bit offset calculation code.
> > > 
> > > Hmm. Don't the generic bitop functions end up using this? We do have a
> > > comment in the code that says
> > > 
> > >  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> > >  * and produce optimal inlined code in all cases. On ARMv7 it is even
> > >  * better by also using the rbit instruction.
> > 
> > It's true that the generic code also makes use of the rbit and clz
> > instructions - but in terms of the speed of the functions these only
> > get used once we've found a word that is interesting to locate the
> > bit we want in.
> > 
> > > but that 'may' makes me wonder...
> > > 
> > > IOW, what is it in the hand-written code that doesn't get done by the
> > > generic code these days?
> > 
> > For the _find_first_bit, there isn't much difference in the number
> > of instructions or really what is going on, only the organisation
> > and flow of the code is more inline - but that shouldn't make much
> > of a difference. Yet, there is a definite repeatable measurable
> > difference between the two:
> > 
> > random-filled:
> > arm    : find_first_bit:               17778911 ns,  16448 iterations
> > generic: find_first_bit:               18596622 ns,  16401 iterations
> > 
> > sparse:
> > arm    : find_first_bit:                7301363 ns,    656 iterations
> > generic: find_first_bit:                7589120 ns,    655 iterations
> > 
> > The bigger difference is in the find_next_bit operations, and this
> > likely comes from the arm32 code not having the hassles of the "_and"
> > and other conditionals that the generic code has:
> > 
> > random-filled:
> > arm    : find_next_bit:                 2242618 ns, 163949 iterations
> > generic: find_next_bit:                 2632859 ns, 163743 iterations
> > 
> > sparse:
> > arm    : find_next_bit:                   40078 ns,    656 iterations
> > generic: find_next_bit:                   69943 ns,    655 iterations
> > 
> > find_next_zero_bit show a greater difference:
> > 
> > random-filled:
> > arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
> > generic: find_next_zero_bit:            2844221 ns, 163938 iterations
> > 
> > sparse:
> > arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
> > generic: find_next_zero_bit:            5529553 ns, 327026 iterations
> 
> Those numbers disagree with what Alexey has measured on Odroid board
> for A15 but somewhat in line with what he had for A7:

Considering no one has seen these patches until I've just posted
them, frankly I don't think there's any point me looking at anyone
elses results.

These changes make substantial improvements to the arm32 assembly
code versions.

If you want a like-for-like comparison, then please get Alexey to
test with these patches applied. I am confident that he will confirm
my results.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 19:01         ` Linus Torvalds
@ 2022-10-28 19:46           ` Russell King (Oracle)
  -1 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 12:01:00PM -0700, Linus Torvalds wrote:
> Hmm. Interestingly, your _find_first_zero_bit_le() (which
> find_next_bit ends up using except for the first byte) ends up doing
> an optimization that is technically not valid.
> 
> In particular, the *generic* code does
> 
>                         sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz);
> 
> for the final result.
> 
> In contrast, the arm code doesn't do the "min()" at all, and if there
> are bits after the bitmap (in a partial byte), it will just return
> those bits.

You've missed how the min() is coded. Specifically, that's handled by:

  cc:   e1510000        cmp     r1, r0
  d0:   31a00001        movcc   r0, r1

which clamps the returned index to the size of the array (held in r1).
So everything is in fact fine - and I think your analysis is incorrect.

Please could you take another look and evaluate whether you think the
arm assembly is incorrect.

I kind'a stopped reading here on the assumption that the remainder of
your email was based on this misinterpretation of the code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 19:46           ` Russell King (Oracle)
  0 siblings, 0 replies; 28+ messages in thread
From: Russell King (Oracle) @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 12:01:00PM -0700, Linus Torvalds wrote:
> Hmm. Interestingly, your _find_first_zero_bit_le() (which
> find_next_bit ends up using except for the first byte) ends up doing
> an optimization that is technically not valid.
> 
> In particular, the *generic* code does
> 
>                         sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz);
> 
> for the final result.
> 
> In contrast, the arm code doesn't do the "min()" at all, and if there
> are bits after the bitmap (in a partial byte), it will just return
> those bits.

You've missed how the min() is coded. Specifically, that's handled by:

  cc:   e1510000        cmp     r1, r0
  d0:   31a00001        movcc   r0, r1

which clamps the returned index to the size of the array (held in r1).
So everything is in fact fine - and I think your analysis is incorrect.

Please could you take another look and evaluate whether you think the
arm assembly is incorrect.

I kind'a stopped reading here on the assumption that the remainder of
your email was based on this misinterpretation of the code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
  2022-10-28 19:46           ` Russell King (Oracle)
@ 2022-10-28 20:26             ` Linus Torvalds
  -1 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 20:26 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 12:46 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> You've missed how the min() is coded.

Yes, I realised that and sent another email about my mea culpa.

           Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
@ 2022-10-28 20:26             ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2022-10-28 20:26 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Yury Norov, Catalin Marinas, Mark Rutland, Will Deacon,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Oct 28, 2022 at 12:46 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> You've missed how the min() is coded.

Yes, I realised that and sent another email about my mea culpa.

           Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-10-28 20:27 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-28 16:47 [PATCH 0/5] ARM: findbit assembly updates Russell King (Oracle)
2022-10-28 16:47 ` Russell King (Oracle)
2022-10-28 16:47 ` [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation Russell King (Oracle)
2022-10-28 16:47   ` Russell King (Oracle)
2022-10-28 17:05   ` Linus Torvalds
2022-10-28 17:05     ` Linus Torvalds
2022-10-28 17:45     ` Russell King (Oracle)
2022-10-28 17:45       ` Russell King (Oracle)
2022-10-28 18:37       ` Yury Norov
2022-10-28 18:37         ` Yury Norov
2022-10-28 19:42         ` Russell King (Oracle)
2022-10-28 19:42           ` Russell King (Oracle)
2022-10-28 19:01       ` Linus Torvalds
2022-10-28 19:01         ` Linus Torvalds
2022-10-28 19:10         ` Linus Torvalds
2022-10-28 19:10           ` Linus Torvalds
2022-10-28 19:46         ` Russell King (Oracle)
2022-10-28 19:46           ` Russell King (Oracle)
2022-10-28 20:26           ` Linus Torvalds
2022-10-28 20:26             ` Linus Torvalds
2022-10-28 16:47 ` [PATCH 2/5] ARM: findbit: provide more efficient ARMv7 implementation Russell King (Oracle)
2022-10-28 16:47   ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 3/5] ARM: findbit: convert to macros Russell King (Oracle)
2022-10-28 16:48   ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 4/5] ARM: findbit: operate by words Russell King (Oracle)
2022-10-28 16:48   ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 5/5] ARM: findbit: add unwinder information Russell King (Oracle)
2022-10-28 16:48   ` Russell King (Oracle)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.