linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] lib: optimize find_bit() functions
@ 2022-08-24  1:26 Yury Norov
  2022-08-24  1:26 ` [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24  1:26 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Guenter Roeck, Dennis Zhou, Russell King,
	Catalin Marinas, Andy Shevchenko, Rasmus Villemoes,
	Alexey Klimov, Kees Cook, Andy Whitcroft

In the recent discussion, it was noticed that find_next_bit() functions may
be improved by adding wrappers around common __find_next_bit() in .c file.

As suggested by Linus, I tried the meta-programming trick with the
EXPRESSION macro, which is passed from wrapper into find_bit()
helpers:

  #define BIT_FIND_BODY(addr, size, start, EXPRESSION)		\
  	BIT_FIND_SETUP(addr, size, start)			\
  	BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION)	\
  	BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION)		\
  	return size;						\
  found:	BIT_WORD_SWAB(val);				\
  	return min((idx)*BITS_PER_LONG + __ffs(val), size)
  
  unsigned long _find_next_and_bit(const unsigned long *addr1,
  				 const unsigned long *addr2,
  				 unsigned long size,
  				 unsigned long start)
  { BIT_FIND_BODY(addr, size, start, addr1[idx] & addr2[idx]); }

I appreciated the potential of how the EXPRESSION works, but I don't like
that the resulting macro is constructed from pieces because it makes it
harder to understand what happens behind the ifdefery. Checkpatch isn't
happy as well because the final macro contains 'return' statement; and I
would agree that it's better to avoid it.

I spined the idea one more time, trying to make FIND helper a more or
less standard looking macros.

This new approach saves 10-11K of Image size, and is 15% faster in the
performance benchmark. See the last patch for some statistics.

v1: https://lore.kernel.org/all/20220728161208.865420-2-yury.norov@gmail.com/T/

Yury Norov (3):
  lib/find_bit: introduce FIND_FIRST_BIT() macro
  lib/find_bit: create find_first_zero_bit_le()
  lib/find_bit: optimize find_next_bit() functions

 MAINTAINERS          |   2 +
 include/linux/find.h |  46 +++++++++++++------
 lib/Makefile         |   1 +
 lib/find_bit.c       | 104 ++++++++++++-------------------------------
 lib/find_bit.h       |  45 +++++++++++++++++++
 lib/find_bit_be.c    |  42 +++++++++++++++++
 6 files changed, 152 insertions(+), 88 deletions(-)
 create mode 100644 lib/find_bit.h
 create mode 100644 lib/find_bit_be.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro
  2022-08-24  1:26 [PATCH v2 0/4] lib: optimize find_bit() functions Yury Norov
@ 2022-08-24  1:26 ` Yury Norov
  2022-08-24  9:10   ` Andy Shevchenko
  2022-08-24 17:59   ` Linus Torvalds
  2022-08-24  1:26 ` [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le() Yury Norov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24  1:26 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Guenter Roeck, Dennis Zhou, Russell King,
	Catalin Marinas, Andy Shevchenko, Rasmus Villemoes,
	Alexey Klimov, Kees Cook, Andy Whitcroft

Now that we have many flavors of find_first_bit(), and expect even more,
it's better to have one macro that generates optimal code for all and makes
maintaining of slightly different functions simpler.

The logic common to all versions is moved to the new macro, and all the
flavors are generated by providing an EXPRESSION macro-parameter, like
in this example:

  #define FIND_FIRST_BIT(EXPRESSION, size) ...
  
  find_first_ornot_and_bit(addr1, addr2, addr3, size)
  {
  	return FIND_NEXT_BIT(addr1[idx] | ~addr2[idx] & addr3[idx], size);
  }

The EXPRESSION may be of any complexity, as soon as it only refers
the bitmap(s) and an iterator idx.

The 'word_op' is here to allow the macro to generate code for _le
versions on big-endian machines; used in the following patches.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 MAINTAINERS    |  1 +
 lib/find_bit.c | 30 +++++-------------------------
 lib/find_bit.h | 24 ++++++++++++++++++++++++
 3 files changed, 30 insertions(+), 25 deletions(-)
 create mode 100644 lib/find_bit.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 96f47a7865d6..02e11f2dbafe 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3612,6 +3612,7 @@ F:	include/linux/find.h
 F:	include/linux/nodemask.h
 F:	lib/bitmap.c
 F:	lib/cpumask.c
+F:	lib/find_bit.h
 F:	lib/find_bit.c
 F:	lib/find_bit_benchmark.c
 F:	lib/test_bitmap.c
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..ccc4fb1dfc71 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -19,6 +19,8 @@
 #include <linux/minmax.h>
 #include <linux/swab.h>
 
+#include "find_bit.h"
+
 #if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
 	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
 	!defined(find_next_and_bit)
@@ -77,14 +79,7 @@ EXPORT_SYMBOL(_find_next_bit);
  */
 unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
 {
-	unsigned long idx;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		if (addr[idx])
-			return min(idx * BITS_PER_LONG + __ffs(addr[idx]), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(addr[idx], size);
 }
 EXPORT_SYMBOL(_find_first_bit);
 #endif
@@ -97,15 +92,7 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
 				  const unsigned long *addr2,
 				  unsigned long size)
 {
-	unsigned long idx, val;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		val = addr1[idx] & addr2[idx];
-		if (val)
-			return min(idx * BITS_PER_LONG + __ffs(val), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(addr1[idx] & addr2[idx], size);
 }
 EXPORT_SYMBOL(_find_first_and_bit);
 #endif
@@ -116,14 +103,7 @@ EXPORT_SYMBOL(_find_first_and_bit);
  */
 unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size)
 {
-	unsigned long idx;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		if (addr[idx] != ~0UL)
-			return min(idx * BITS_PER_LONG + ffz(addr[idx]), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(~addr[idx], size);
 }
 EXPORT_SYMBOL(_find_first_zero_bit);
 #endif
diff --git a/lib/find_bit.h b/lib/find_bit.h
new file mode 100644
index 000000000000..b4b6245ddbf6
--- /dev/null
+++ b/lib/find_bit.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LIB_FIND_BIT_H
+#define _LIB_FIND_BIT_H
+
+#ifndef word_op
+#define word_op
+#endif
+
+#define FIND_FIRST_BIT(EXPRESSION, size)					\
+({										\
+	unsigned long idx, val, sz = (size);					\
+										\
+	for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {			\
+		val = (EXPRESSION);						\
+		if (val) {							\
+			sz = min(idx * BITS_PER_LONG + __ffs(word_op(val)), sz);\
+			break;							\
+		}								\
+	}									\
+										\
+	sz;									\
+})
+
+#endif /* _LIB_FIND_BIT_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24  1:26 [PATCH v2 0/4] lib: optimize find_bit() functions Yury Norov
  2022-08-24  1:26 ` [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
@ 2022-08-24  1:26 ` Yury Norov
  2022-08-24  9:22   ` Andy Shevchenko
  2022-08-24 18:11   ` Linus Torvalds
  2022-08-24  1:26 ` [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions Yury Norov
  2022-08-24  9:00 ` [PATCH v2 0/4] lib: optimize find_bit() functions Andy Shevchenko
  3 siblings, 2 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24  1:26 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Guenter Roeck, Dennis Zhou, Russell King,
	Catalin Marinas, Andy Shevchenko, Rasmus Villemoes,
	Alexey Klimov, Kees Cook, Andy Whitcroft

find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
despite that 'next' is known to be slower than the 'first' version.

Now that we have a common FIND_FIRST_BIT() macro helper, it's trivial
to implement find_first_zero_bit_le() as a real function.

Moving find_*_le() to a separate file helps to fit the FIND_FIRST_BIT()
to the _le needs by wiring word_op to swab.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
Like other find_*_le() functions, the new one takes void *addr, instead
of unsigned long *. This should be fixed for all in a separate series.

 MAINTAINERS          |  1 +
 include/linux/find.h | 23 ++++++++++++++++++-----
 lib/Makefile         |  1 +
 lib/find_bit_be.c    | 23 +++++++++++++++++++++++
 4 files changed, 43 insertions(+), 5 deletions(-)
 create mode 100644 lib/find_bit_be.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 02e11f2dbafe..fd1d1625b053 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3614,6 +3614,7 @@ F:	lib/bitmap.c
 F:	lib/cpumask.c
 F:	lib/find_bit.h
 F:	lib/find_bit.c
+F:	lib/find_bit_le.c
 F:	lib/find_bit_benchmark.c
 F:	lib/test_bitmap.c
 F:	tools/include/linux/bitmap.h
diff --git a/include/linux/find.h b/include/linux/find.h
index 424ef67d4a42..2464bff5de04 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -17,6 +17,10 @@ extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);
 
+#ifdef __BIG_ENDIAN
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
+#endif
+
 #ifndef find_next_bit
 /**
  * find_next_bit - find the next set bit in a memory region
@@ -251,6 +255,20 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
 }
 #endif
 
+#ifndef find_first_zero_bit_le
+static inline
+unsigned long find_first_zero_bit_le(const void *addr, unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0);
+
+		return val == ~0UL ? size : ffz(val);
+	}
+
+	return _find_first_zero_bit_le(addr, size);
+}
+#endif
+
 #ifndef find_next_bit_le
 static inline
 unsigned long find_next_bit_le(const void *addr, unsigned
@@ -270,11 +288,6 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 }
 #endif
 
-#ifndef find_first_zero_bit_le
-#define find_first_zero_bit_le(addr, size) \
-	find_next_zero_bit_le((addr), (size), 0)
-#endif
-
 #else
 #error "Please fix <asm/byteorder.h>"
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 5927d7fa0806..0f41b76a277e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -49,6 +49,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 percpu-refcount.o rhashtable.o base64.o \
 	 once.o refcount.o usercopy.o errseq.o bucket_locks.o \
 	 generic-radix-tree.o
+obj-$(CONFIG_CPU_BIG_ENDIAN) += find_bit_be.o
 obj-$(CONFIG_STRING_SELFTEST) += test_string.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
diff --git a/lib/find_bit_be.c b/lib/find_bit_be.c
new file mode 100644
index 000000000000..36173cb7e012
--- /dev/null
+++ b/lib/find_bit_be.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Big-endian routines for bit search */
+
+#include <linux/bitops.h>
+#include <linux/bitmap.h>
+#include <linux/export.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+#include <linux/swab.h>
+
+#define word_op swab
+#include "find_bit.h"
+
+#ifndef find_first_zero_bit_le
+/*
+ * Find the first cleared bit in an LE memory region.
+ */
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
+{
+	return FIND_FIRST_BIT(~addr[idx], size);
+}
+EXPORT_SYMBOL(_find_first_zero_bit_le);
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions
  2022-08-24  1:26 [PATCH v2 0/4] lib: optimize find_bit() functions Yury Norov
  2022-08-24  1:26 ` [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
  2022-08-24  1:26 ` [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le() Yury Norov
@ 2022-08-24  1:26 ` Yury Norov
  2022-08-24  9:19   ` Andy Shevchenko
  2022-08-24  9:00 ` [PATCH v2 0/4] lib: optimize find_bit() functions Andy Shevchenko
  3 siblings, 1 reply; 23+ messages in thread
From: Yury Norov @ 2022-08-24  1:26 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Guenter Roeck, Dennis Zhou, Russell King,
	Catalin Marinas, Andy Shevchenko, Rasmus Villemoes,
	Alexey Klimov, Kees Cook, Andy Whitcroft

Over the past couple years, the function _find_next_bit() was extended
with parameters that modify its behavior to implement and- zero- and le-
flavors. The parameters are passed at compile time, but current design
prevents a compiler from optimizing out the conditionals.

As find_next_bit() API grows, I expect that more parameterss will be added.
Current designs would require more conditional code in _find_next_bit(),
which would bloat the helper even more and make it barely readable.

This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
a set of wrappers, so that the compile-time optimization becomes possible.

The common logic is moved to the new macro, and all flavors may be
generated by providing an EXPRESSION macro parameter, like in this example:

  #define FIND_NEXT_BIT(EXPRESSION, size, start) ...
  
  find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
  {
  	return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx], size, start);
  }

The EXPRESSION may be of any complexity, as soon as it only refers
the bitmap(s) and an iterator idx.

I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
of this series. The results for kvm/x86_64 are:

                      v6.0-rc2  Optimized       Difference  Z-score
Random dense bitmap         ns         ns        ns      %
find_next_bit:          787735     670546    117189   14.9     3.97
find_next_zero_bit:     777492     664208    113284   14.6    10.51
find_last_bit:          830925     687573    143352   17.3     2.35
find_first_bit:        3874366    3306635    567731   14.7     1.84
find_first_and_bit:   40677125   37739887   2937238    7.2     1.36
find_next_and_bit:      347865     304456     43409   12.5     1.35

Random sparse bitmap
find_next_bit:           19816      14021      5795   29.2     6.10
find_next_zero_bit:    1318901    1223794     95107    7.2     1.41
find_last_bit:           14573      13514      1059    7.3     6.92
find_first_bit:        1313321    1249024     64297    4.9     1.53
find_first_and_bit:       8921       8098       823    9.2     4.56
find_next_and_bit:        9796       7176      2620   26.7     5.39

Where the statistics is significant (z-score > 3), the improvement
is ~15%.

According to bloat-o-meter, the Image size is 10-11K less:

x86_64/defconfig:
add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)

arm64/defconfig:
add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
Checkpatch warns that the FIND_NEXT_BIT() is a "Macros with flow
control statements", but in this case I think it's false positive
because the label is defined in the same code block as the 'goto'
statement, and control can't flow out of the block.

 include/linux/find.h | 23 ++++++++-----
 lib/find_bit.c       | 78 +++++++++++++++-----------------------------
 lib/find_bit.h       | 21 ++++++++++++
 lib/find_bit_be.c    | 19 +++++++++++
 4 files changed, 81 insertions(+), 60 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 2464bff5de04..dead6f53a97b 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -8,9 +8,12 @@
 
 #include <linux/bitops.h>
 
-extern unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
+unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
+				unsigned long start);
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start);
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+					 unsigned long start);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
@@ -19,6 +22,10 @@ extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long siz
 
 #ifdef __BIG_ENDIAN
 unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
+unsigned long _find_next_zero_bit_le(const  unsigned long *addr, unsigned
+					long size, unsigned long offset);
+unsigned long _find_next_bit_le(const unsigned long *addr, unsigned
+				long size, unsigned long offset);
 #endif
 
 #ifndef find_next_bit
@@ -45,7 +52,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+	return _find_next_bit(addr, size, offset);
 }
 #endif
 
@@ -75,7 +82,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+	return _find_next_and_bit(addr1, addr2, size, offset);
 }
 #endif
 
@@ -103,7 +110,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+	return _find_next_zero_bit(addr, size, offset);
 }
 #endif
 
@@ -251,7 +258,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+	return _find_next_zero_bit_le(addr, size, offset);
 }
 #endif
 
@@ -284,7 +291,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+	return _find_next_bit_le(addr, size, offset);
 }
 #endif
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index ccc4fb1dfc71..357750d25ff9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -21,58 +21,6 @@
 
 #include "find_bit.h"
 
-#if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
-	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
-	!defined(find_next_and_bit)
-/*
- * This is a common helper function for find_next_bit, find_next_zero_bit, and
- * find_next_and_bit. The differences are:
- *  - The "invert" argument, which is XORed with each fetched word before
- *    searching it for one bits.
- *  - The optional "addr2", which is anded with "addr1" if present.
- */
-unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le)
-{
-	unsigned long tmp, mask;
-
-	if (unlikely(start >= nbits))
-		return nbits;
-
-	tmp = addr1[start / BITS_PER_LONG];
-	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
-	tmp ^= invert;
-
-	/* Handle 1st word. */
-	mask = BITMAP_FIRST_WORD_MASK(start);
-	if (le)
-		mask = swab(mask);
-
-	tmp &= mask;
-
-	start = round_down(start, BITS_PER_LONG);
-
-	while (!tmp) {
-		start += BITS_PER_LONG;
-		if (start >= nbits)
-			return nbits;
-
-		tmp = addr1[start / BITS_PER_LONG];
-		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
-		tmp ^= invert;
-	}
-
-	if (le)
-		tmp = swab(tmp);
-
-	return min(start + __ffs(tmp), nbits);
-}
-EXPORT_SYMBOL(_find_next_bit);
-#endif
-
 #ifndef find_first_bit
 /*
  * Find the first set bit in a memory region.
@@ -108,6 +56,32 @@ unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size
 EXPORT_SYMBOL(_find_first_zero_bit);
 #endif
 
+#ifndef find_next_bit
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr[idx], nbits, start);
+}
+EXPORT_SYMBOL(_find_next_bit);
+#endif
+
+#ifndef find_next_and_bit
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr1[idx] & addr2[idx], nbits, start);
+}
+EXPORT_SYMBOL(_find_next_and_bit);
+#endif
+
+#ifndef find_next_zero_bit
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+					 unsigned long start)
+{
+	return FIND_NEXT_BIT(~addr[idx], nbits, start);
+}
+EXPORT_SYMBOL(_find_next_zero_bit);
+#endif
+
 #ifndef find_last_bit
 unsigned long _find_last_bit(const unsigned long *addr, unsigned long size)
 {
diff --git a/lib/find_bit.h b/lib/find_bit.h
index b4b6245ddbf6..6b6312f93301 100644
--- a/lib/find_bit.h
+++ b/lib/find_bit.h
@@ -21,4 +21,25 @@
 	sz;									\
 })
 
+#define FIND_NEXT_BIT(EXPRESSION, size, start)					\
+({										\
+	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
+										\
+	if (unlikely(__start >= sz))						\
+		goto out;							\
+										\
+	mask = word_op(BITMAP_FIRST_WORD_MASK(__start));			\
+	idx = __start / BITS_PER_LONG;						\
+										\
+	for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) {		\
+		if (idx > sz / BITS_PER_LONG)					\
+			goto out;						\
+		idx++;								\
+	}									\
+										\
+	sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);		\
+out:										\
+	sz;									\
+})
+
 #endif /* _LIB_FIND_BIT_H */
diff --git a/lib/find_bit_be.c b/lib/find_bit_be.c
index 36173cb7e012..cbf669aaf3cb 100644
--- a/lib/find_bit_be.c
+++ b/lib/find_bit_be.c
@@ -21,3 +21,22 @@ unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long s
 }
 EXPORT_SYMBOL(_find_first_zero_bit_le);
 #endif
+
+#ifndef find_next_zero_bit_le
+unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
+		long size, unsigned long offset)
+{
+	return FIND_NEXT_BIT(~addr[idx], size, offset);
+}
+EXPORT_SYMBOL(_find_next_zero_bit_le);
+#endif
+
+#ifndef find_next_bit_le
+unsigned long _find_next_bit_le(const unsigned long  *addr, unsigned
+		long size, unsigned long offset)
+{
+	return FIND_NEXT_BIT(addr[idx], size, offset);
+}
+EXPORT_SYMBOL(_find_next_bit_le);
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/4] lib: optimize find_bit() functions
  2022-08-24  1:26 [PATCH v2 0/4] lib: optimize find_bit() functions Yury Norov
                   ` (2 preceding siblings ...)
  2022-08-24  1:26 ` [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions Yury Norov
@ 2022-08-24  9:00 ` Andy Shevchenko
  3 siblings, 0 replies; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24  9:00 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 4:39 AM Yury Norov <yury.norov@gmail.com> wrote:

...

> v1: https://lore.kernel.org/all/20220728161208.865420-2-yury.norov@gmail.com/T/
>
> Yury Norov (3):
>   lib/find_bit: introduce FIND_FIRST_BIT() macro
>   lib/find_bit: create find_first_zero_bit_le()
>   lib/find_bit: optimize find_next_bit() functions
>
>  MAINTAINERS          |   2 +
>  include/linux/find.h |  46 +++++++++++++------
>  lib/Makefile         |   1 +
>  lib/find_bit.c       | 104 ++++++++++++-------------------------------
>  lib/find_bit.h       |  45 +++++++++++++++++++
>  lib/find_bit_be.c    |  42 +++++++++++++++++
>  6 files changed, 152 insertions(+), 88 deletions(-)
>  create mode 100644 lib/find_bit.h
>  create mode 100644 lib/find_bit_be.c
>
> --
> 2.34.1

Seems like the cover letter is generated by Git, but it has 0/4 in the
Subject while the rest of the series is n/3. What happened on your
side with the numbering of the patches?


--
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro
  2022-08-24  1:26 ` [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
@ 2022-08-24  9:10   ` Andy Shevchenko
  2022-08-24 13:19     ` Yury Norov
  2022-08-24 17:59   ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24  9:10 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 4:51 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> Now that we have many flavors of find_first_bit(), and expect even more,
> it's better to have one macro that generates optimal code for all and makes
> maintaining of slightly different functions simpler.
>
> The logic common to all versions is moved to the new macro, and all the
> flavors are generated by providing an EXPRESSION macro-parameter, like
> in this example:
>
>   #define FIND_FIRST_BIT(EXPRESSION, size) ...
>
>   find_first_ornot_and_bit(addr1, addr2, addr3, size)
>   {
>         return FIND_NEXT_BIT(addr1[idx] | ~addr2[idx] & addr3[idx], size);
>   }
>
> The EXPRESSION may be of any complexity, as soon as it only refers
> the bitmap(s) and an iterator idx.
>
> The 'word_op' is here to allow the macro to generate code for _le
> versions on big-endian machines; used in the following patches.

...

> +#ifndef word_op
> +#define word_op
> +#endif

Not sure about the naming without namespace. Perhaps __ffs_word_op?

> +#define FIND_FIRST_BIT(EXPRESSION, size)                                       \
> +({                                                                             \
> +       unsigned long idx, val, sz = (size);                                    \
> +                                                                               \
> +       for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {                        \

I think we can do slightly better:

for (unsigned long idx = 0; idx < sz; idx += BITS_PER_LONG) {
  unsigned long val;

> +               val = (EXPRESSION);                                             \
> +               if (val) {                                                      \
> +                       sz = min(idx * BITS_PER_LONG + __ffs(word_op(val)), sz);\

sz = min(idx + __ffs(...));

> +                       break;                                                  \
> +               }                                                               \
> +       }                                                                       \
> +                                                                               \
> +       sz;                                                                     \
> +})


-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions
  2022-08-24  1:26 ` [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions Yury Norov
@ 2022-08-24  9:19   ` Andy Shevchenko
  2022-08-24 13:53     ` Yury Norov
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24  9:19 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> Over the past couple years, the function _find_next_bit() was extended
> with parameters that modify its behavior to implement and- zero- and le-
> flavors. The parameters are passed at compile time, but current design
> prevents a compiler from optimizing out the conditionals.
>
> As find_next_bit() API grows, I expect that more parameterss will be added.

parameters

> Current designs would require more conditional code in _find_next_bit(),
> which would bloat the helper even more and make it barely readable.
>
> This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
> a set of wrappers, so that the compile-time optimization becomes possible.
>
> The common logic is moved to the new macro, and all flavors may be
> generated by providing an EXPRESSION macro parameter, like in this example:
>
>   #define FIND_NEXT_BIT(EXPRESSION, size, start) ...
>
>   find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
>   {
>         return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx], size, start);
>   }
>
> The EXPRESSION may be of any complexity, as soon as it only refers
> the bitmap(s) and an iterator idx.

...

> +#define FIND_NEXT_BIT(EXPRESSION, size, start)                                 \
> +({                                                                             \
> +       unsigned long mask, idx, tmp, sz = (size), __start = (start);           \
> +                                                                               \
> +       if (unlikely(__start >= sz))                                            \
> +               goto out;                                                       \
> +                                                                               \
> +       mask = word_op(BITMAP_FIRST_WORD_MASK(__start));                        \
> +       idx = __start / BITS_PER_LONG;                                          \
> +                                                                               \
> +       for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) {             \

for (unsigned long tmp ...;
But hey, why not loop over idx (which probably should be named as
offset) as I proposed in the first patch? You will drop a lot of
divisions / multiplications, no?

> +               if (idx > sz / BITS_PER_LONG)                                   \
> +                       goto out;                                               \
> +               idx++;                                                          \
> +       }                                                                       \
> +                                                                               \
> +       sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);                \
> +out:                                                                           \
> +       sz;                                                                     \
> +})

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24  1:26 ` [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le() Yury Norov
@ 2022-08-24  9:22   ` Andy Shevchenko
  2022-08-24  9:24     ` Andy Shevchenko
  2022-08-24 13:37     ` Yury Norov
  2022-08-24 18:11   ` Linus Torvalds
  1 sibling, 2 replies; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24  9:22 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
> despite that 'next' is known to be slower than the 'first' version.
>
> Now that we have a common FIND_FIRST_BIT() macro helper, it's trivial
> to implement find_first_zero_bit_le() as a real function.
>
> Moving find_*_le() to a separate file helps to fit the FIND_FIRST_BIT()
> to the _le needs by wiring word_op to swab.
>
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
> Like other find_*_le() functions, the new one takes void *addr, instead
> of unsigned long *. This should be fixed for all in a separate series.

From this comment it is unclear to me why we can't fix them first and
then apply this with the correct type?

...

> +#define word_op swab
> +#include "find_bit.h"

Looking at this, I would rather always require to define __ffs_word_op
(or whatever name) in the user and replace #ifndef in the find_bit.h
with
#error "The __ffs_word_op must be defined before including find_bit.h!"

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24  9:22   ` Andy Shevchenko
@ 2022-08-24  9:24     ` Andy Shevchenko
  2022-08-24 13:37     ` Yury Norov
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24  9:24 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 12:22 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <yury.norov@gmail.com> wrote:

...

> > +#define word_op swab
> > +#include "find_bit.h"
>
> Looking at this, I would rather always require to define __ffs_word_op
> (or whatever name) in the user and replace #ifndef in the find_bit.h
> with
> #error "The __ffs_word_op must be defined before including find_bit.h!"

The rationale is that the missed definition may give wrong results
while being compiled with no errors. With the above, the developer
must think about what they are doing.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro
  2022-08-24  9:10   ` Andy Shevchenko
@ 2022-08-24 13:19     ` Yury Norov
  2022-08-24 14:18       ` David Laight
  2022-08-24 17:45       ` Andy Shevchenko
  0 siblings, 2 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24 13:19 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 12:10:02PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 4:51 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > Now that we have many flavors of find_first_bit(), and expect even more,
> > it's better to have one macro that generates optimal code for all and makes
> > maintaining of slightly different functions simpler.
> >
> > The logic common to all versions is moved to the new macro, and all the
> > flavors are generated by providing an EXPRESSION macro-parameter, like
> > in this example:
> >
> >   #define FIND_FIRST_BIT(EXPRESSION, size) ...
> >
> >   find_first_ornot_and_bit(addr1, addr2, addr3, size)
> >   {
> >         return FIND_NEXT_BIT(addr1[idx] | ~addr2[idx] & addr3[idx], size);
> >   }
> >
> > The EXPRESSION may be of any complexity, as soon as it only refers
> > the bitmap(s) and an iterator idx.
> >
> > The 'word_op' is here to allow the macro to generate code for _le
> > versions on big-endian machines; used in the following patches.
> 
> ...
> 
> > +#ifndef word_op
> > +#define word_op
> > +#endif
> 
> Not sure about the naming without namespace. Perhaps __ffs_word_op?
> 
> > +#define FIND_FIRST_BIT(EXPRESSION, size)                                       \
> > +({                                                                             \
> > +       unsigned long idx, val, sz = (size);                                    \
> > +                                                                               \
> > +       for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {                        \
> 
> I think we can do slightly better:
> 
> for (unsigned long idx = 0; idx < sz; idx += BITS_PER_LONG) {
>   unsigned long val;

This will blow up the EXPRESSION. We can mitigate it on user side:
  find_first_bit(addr, size)
  {
        return FIND_FIRST_BIT(addr[idx/BITS_PER_LONG], size);
  }

But to me it's a wtf++.

And generated code looks almost the same, except that
on x86_64 your version is bigger. Compare before:
0000000000000000 <_find_first_bit>:
   0:   mov    %rsi,%rax
   3:   test   %rsi,%rsi
   6:   je     35 <_find_first_bit+0x35>
   8:   xor    %edx,%edx
   a:   jmp    19 <_find_first_bit+0x19>
   c:   add    $0x40,%rdx               // Track bits and
  10:   add    $0x8,%rdi                // index separately
  14:   cmp    %rax,%rdx
  17:   jae    35 <_find_first_bit+0x35>
  19:   mov    (%rdi),%rcx
  1c:   test   %rcx,%rcx
  1f:   je     c <_find_first_bit+0xc>
  21:   tzcnt  %rcx,%rcx
  26:   add    %rdx,%rcx
  29:   cmp    %rcx,%rax
  2c:   cmova  %rcx,%rax
  30:   jmp    35 <_find_first_bit+0x35>
  35:   jmp    3a <_find_first_bit+0x3a>
  3a:   nopw   0x0(%rax,%rax,1)

And after:
0000000000000000 <_find_first_bit>:
   0:   mov    %rsi,%rax
   3:   test   %rsi,%rsi
   6:   je     39 <_find_first_bit+0x39>
   8:   xor    %edx,%edx
   a:   jmp    15 <_find_first_bit+0x15>
   c:   add    $0x40,%rdx               // Track bits only
  10:   cmp    %rdx,%rax 
  13:   jbe    39 <_find_first_bit+0x39>
  15:   mov    %rdx,%rcx
  18:   shr    $0x6,%rcx                // But divide here
  1c:   mov    (%rdi,%rcx,8),%rcx
  20:   test   %rcx,%rcx
  23:   je     c <_find_first_bit+0xc>
  25:   tzcnt  %rcx,%rcx
  2a:   add    %rcx,%rdx
  2d:   cmp    %rdx,%rax
  30:   cmova  %rdx,%rax
  34:   jmp    39 <_find_first_bit+0x39>
  39:   jmp    3e <_find_first_bit+0x3e>
  3e:   xchg   %ax,%ax                  // Which adds 4 bytes to .text 

Thanks,
Yury

> > +               val = (EXPRESSION);                                             \
> > +               if (val) {                                                      \
> > +                       sz = min(idx * BITS_PER_LONG + __ffs(word_op(val)), sz);\
> 
> sz = min(idx + __ffs(...));
> 
> > +                       break;                                                  \
> > +               }                                                               \
> > +       }                                                                       \
> > +                                                                               \
> > +       sz;                                                                     \
> > +})
> 
> 
> -- 
> With Best Regards,
> Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24  9:22   ` Andy Shevchenko
  2022-08-24  9:24     ` Andy Shevchenko
@ 2022-08-24 13:37     ` Yury Norov
  2022-08-24 17:50       ` Andy Shevchenko
  2022-08-24 17:58       ` Russell King (Oracle)
  1 sibling, 2 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24 13:37 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 12:22:33PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
> > despite that 'next' is known to be slower than the 'first' version.
> >
> > Now that we have a common FIND_FIRST_BIT() macro helper, it's trivial
> > to implement find_first_zero_bit_le() as a real function.
> >
> > Moving find_*_le() to a separate file helps to fit the FIND_FIRST_BIT()
> > to the _le needs by wiring word_op to swab.
> >
> > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > ---
> > Like other find_*_le() functions, the new one takes void *addr, instead
> > of unsigned long *. This should be fixed for all in a separate series.
> 
> From this comment it is unclear to me why we can't fix them first and
> then apply this with the correct type?

Because there is a codebase that relies on existing types, mostly in
filesystem code. And those fs fixes would require 5 or 6 patches.

This would triple the length of this series, and is completely
unrelated. That's why I think that:
        > > This should be fixed for all in a separate series.

> ...
> 
> > +#define word_op swab
> > +#include "find_bit.h"
> 
> Looking at this, I would rather always require to define __ffs_word_op
> (or whatever name) in the user and replace #ifndef in the find_bit.h
> with
> #error "The __ffs_word_op must be defined before including find_bit.h!"

This is a local header which is not intended to be included anywhere
except lib/find_bit{,_be}.c. I don't expect someone else would want to
include it, even in lib. So what you suggest is a bit overthinking to
me. But if you insist, I can do that.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions
  2022-08-24  9:19   ` Andy Shevchenko
@ 2022-08-24 13:53     ` Yury Norov
  2022-08-24 17:54       ` Andy Shevchenko
  0 siblings, 1 reply; 23+ messages in thread
From: Yury Norov @ 2022-08-24 13:53 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > Over the past couple years, the function _find_next_bit() was extended
> > with parameters that modify its behavior to implement and- zero- and le-
> > flavors. The parameters are passed at compile time, but current design
> > prevents a compiler from optimizing out the conditionals.
> >
> > As find_next_bit() API grows, I expect that more parameterss will be added.
> 
> parameters
> 
> > Current designs would require more conditional code in _find_next_bit(),
> > which would bloat the helper even more and make it barely readable.
> >
> > This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
> > a set of wrappers, so that the compile-time optimization becomes possible.
> >
> > The common logic is moved to the new macro, and all flavors may be
> > generated by providing an EXPRESSION macro parameter, like in this example:
> >
> >   #define FIND_NEXT_BIT(EXPRESSION, size, start) ...
> >
> >   find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
> >   {
> >         return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx], size, start);
> >   }
> >
> > The EXPRESSION may be of any complexity, as soon as it only refers
> > the bitmap(s) and an iterator idx.
> 
> ...
> 
> > +#define FIND_NEXT_BIT(EXPRESSION, size, start)                                 \
> > +({                                                                             \
> > +       unsigned long mask, idx, tmp, sz = (size), __start = (start);           \
> > +                                                                               \
> > +       if (unlikely(__start >= sz))                                            \
> > +               goto out;                                                       \
> > +                                                                               \
> > +       mask = word_op(BITMAP_FIRST_WORD_MASK(__start));                        \
> > +       idx = __start / BITS_PER_LONG;                                          \
> > +                                                                               \
> > +       for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) {             \
> 
> for (unsigned long tmp ...;
> But hey, why not loop over idx (which probably should be named as
> offset)

Offset in structure, index in array, isn't?

> as I proposed in the first patch? You will drop a lot of
> divisions / multiplications, no?

Those divisions and multiplications are optimized away, and
what you suggested blows up the EXPRESSION.

I tried like this:
   mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
   idx = __start / BITS_PER_LONG;
   tmp = (EXPRESSION);

   while (1) {
        if (tmp) {
               sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
               break;
        }

        if (++idx > sz)
                break;

        tmp = (EXPRESSION);
   } 

And it generated the same code, but looks less expressive to me.
If you have some elegant approach in mind - can you please share
it, and how the generated code looks?

Thanks,
Yury

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro
  2022-08-24 13:19     ` Yury Norov
@ 2022-08-24 14:18       ` David Laight
  2022-08-24 17:45       ` Andy Shevchenko
  1 sibling, 0 replies; 23+ messages in thread
From: David Laight @ 2022-08-24 14:18 UTC (permalink / raw)
  To: 'Yury Norov', Andy Shevchenko
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

...
> And generated code looks almost the same, except that
> on x86_64 your version is bigger. Compare before:
> 0000000000000000 <_find_first_bit>:
>    0:   mov    %rsi,%rax
>    3:   test   %rsi,%rsi
>    6:   je     35 <_find_first_bit+0x35>
>    8:   xor    %edx,%edx
>    a:   jmp    19 <_find_first_bit+0x19>
>    c:   add    $0x40,%rdx               // Track bits and
>   10:   add    $0x8,%rdi                // index separately

That add is free - happens in parallel with other instrutcions

>   14:   cmp    %rax,%rdx
>   17:   jae    35 <_find_first_bit+0x35>

The instructions below will (probably/hopefully) be
speculatively executed in parallel with the cmp/jae above

>   19:   mov    (%rdi),%rcx
>   1c:   test   %rcx,%rcx
>   1f:   je     c <_find_first_bit+0xc>
>   21:   tzcnt  %rcx,%rcx
>   26:   add    %rdx,%rcx
>   29:   cmp    %rcx,%rax
>   2c:   cmova  %rcx,%rax
>   30:   jmp    35 <_find_first_bit+0x35>
>   35:   jmp    3a <_find_first_bit+0x3a>
>   3a:   nopw   0x0(%rax,%rax,1)
> 
> And after:
> 0000000000000000 <_find_first_bit>:
>    0:   mov    %rsi,%rax
>    3:   test   %rsi,%rsi
>    6:   je     39 <_find_first_bit+0x39>
>    8:   xor    %edx,%edx
>    a:   jmp    15 <_find_first_bit+0x15>
>    c:   add    $0x40,%rdx               // Track bits only
>   10:   cmp    %rdx,%rax
>   13:   jbe    39 <_find_first_bit+0x39>
>   15:   mov    %rdx,%rcx
>   18:   shr    $0x6,%rcx                // But divide here
>   1c:   mov    (%rdi,%rcx,8),%rcx
>   20:   test   %rcx,%rcx

That is a long register dependency chain involving %cx.
It will limit the execution speed to (at least 6) clocks/iteration.
The older version might be 3 clocks/iteration.
So this could easily run at half the speed.

	David

>   23:   je     c <_find_first_bit+0xc>
>   25:   tzcnt  %rcx,%rcx
>   2a:   add    %rcx,%rdx
>   2d:   cmp    %rdx,%rax
>   30:   cmova  %rdx,%rax
>   34:   jmp    39 <_find_first_bit+0x39>
>   39:   jmp    3e <_find_first_bit+0x3e>
>   3e:   xchg   %ax,%ax                  // Which adds 4 bytes to .text
> 
> Thanks,
> Yury
> 
> > > +               val = (EXPRESSION);                                             \
> > > +               if (val) {                                                      \
> > > +                       sz = min(idx * BITS_PER_LONG + __ffs(word_op(val)), sz);\
> >
> > sz = min(idx + __ffs(...));
> >
> > > +                       break;                                                  \
> > > +               }                                                               \
> > > +       }                                                                       \
> > > +                                                                               \
> > > +       sz;                                                                     \
> > > +})
> >
> >
> > --
> > With Best Regards,
> > Andy Shevchenko

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro
  2022-08-24 13:19     ` Yury Norov
  2022-08-24 14:18       ` David Laight
@ 2022-08-24 17:45       ` Andy Shevchenko
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24 17:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 4:19 PM Yury Norov <yury.norov@gmail.com> wrote:
> On Wed, Aug 24, 2022 at 12:10:02PM +0300, Andy Shevchenko wrote:
> > On Wed, Aug 24, 2022 at 4:51 AM Yury Norov <yury.norov@gmail.com> wrote:

...

> > > +#define FIND_FIRST_BIT(EXPRESSION, size)                                       \
> > > +({                                                                             \
> > > +       unsigned long idx, val, sz = (size);                                    \
> > > +                                                                               \
> > > +       for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {                        \
> >
> > I think we can do slightly better:
> >
> > for (unsigned long idx = 0; idx < sz; idx += BITS_PER_LONG) {
> >   unsigned long val;
>
> This will blow up the EXPRESSION. We can mitigate it on user side:

I'm not sure I understand how EXPRESSION is involved in all this. What
I proposed is to replace the for-loop one-by-one to
one-by-BITS_PER_LONG. But okay, I have re-read the above patch and now
I see what you are doing, basically you use internal variables of the
macro in the EXPRESSION. Hmm...

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24 13:37     ` Yury Norov
@ 2022-08-24 17:50       ` Andy Shevchenko
  2022-08-24 17:58       ` Russell King (Oracle)
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24 17:50 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 4:37 PM Yury Norov <yury.norov@gmail.com> wrote:
> On Wed, Aug 24, 2022 at 12:22:33PM +0300, Andy Shevchenko wrote:
> > On Wed, Aug 24, 2022 at 5:17 AM Yury Norov <yury.norov@gmail.com> wrote:

...

> > > Like other find_*_le() functions, the new one takes void *addr, instead
> > > of unsigned long *. This should be fixed for all in a separate series.
> >
> > From this comment it is unclear to me why we can't fix them first and
> > then apply this with the correct type?
>
> Because there is a codebase that relies on existing types, mostly in
> filesystem code. And those fs fixes would require 5 or 6 patches.
>
> This would triple the length of this series, and is completely
> unrelated. That's why I think that:
>         > > This should be fixed for all in a separate series.

So comment update then, if a new version is required?

...

> > > +#define word_op swab
> > > +#include "find_bit.h"
> >
> > Looking at this, I would rather always require to define __ffs_word_op
> > (or whatever name) in the user and replace #ifndef in the find_bit.h
> > with
> > #error "The __ffs_word_op must be defined before including find_bit.h!"
>
> This is a local header which is not intended to be included anywhere
> except lib/find_bit{,_be}.c. I don't expect someone else would want to
> include it, even in lib. So what you suggest is a bit overthinking to
> me. But if you insist, I can do that.

Basically by the above you assured me that #error is the right
approach, please do it that way and we will definitely catch the
incorrect users (even by `git grep -lw __ffs_word_op` if they slip
into the kernel somehow).

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions
  2022-08-24 13:53     ` Yury Norov
@ 2022-08-24 17:54       ` Andy Shevchenko
  2022-08-24 17:56         ` Andy Shevchenko
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24 17:54 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 4:53 PM Yury Norov <yury.norov@gmail.com> wrote:
> On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> > On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <yury.norov@gmail.com> wrote:

...

> > > +#define FIND_NEXT_BIT(EXPRESSION, size, start)                                 \
> > > +({                                                                             \
> > > +       unsigned long mask, idx, tmp, sz = (size), __start = (start);           \
> > > +                                                                               \
> > > +       if (unlikely(__start >= sz))                                            \
> > > +               goto out;                                                       \
> > > +                                                                               \
> > > +       mask = word_op(BITMAP_FIRST_WORD_MASK(__start));                        \
> > > +       idx = __start / BITS_PER_LONG;                                          \
> > > +                                                                               \
> > > +       for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) {             \
> >
> > for (unsigned long tmp ...;
> > But hey, why not loop over idx (which probably should be named as
> > offset)
>
> Offset in structure, index in array, isn't?
>
> > as I proposed in the first patch? You will drop a lot of
> > divisions / multiplications, no?
>
> Those divisions and multiplications are optimized away, and
> what you suggested blows up the EXPRESSION.
>
> I tried like this:
>    mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
>    idx = __start / BITS_PER_LONG;
>    tmp = (EXPRESSION);
>
>    while (1) {
>         if (tmp) {
>                sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
>                break;
>         }
>
>         if (++idx > sz)
>                 break;
>
>         tmp = (EXPRESSION);
>    }
>
> And it generated the same code, but looks less expressive to me.
> If you have some elegant approach in mind - can you please share
> it, and how the generated code looks?

for (unsigned long idx = 0; idx < sz; idx++) {
  unsigned long tmp;

  tmp = (EXPRESSION);
  if (tmp) {
    ...
  }
}

No?

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions
  2022-08-24 17:54       ` Andy Shevchenko
@ 2022-08-24 17:56         ` Andy Shevchenko
  2022-08-24 21:27           ` Yury Norov
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Shevchenko @ 2022-08-24 17:56 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 8:54 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Wed, Aug 24, 2022 at 4:53 PM Yury Norov <yury.norov@gmail.com> wrote:
> > On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> > > On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <yury.norov@gmail.com> wrote:

...

> > > > +#define FIND_NEXT_BIT(EXPRESSION, size, start)                                 \
> > > > +({                                                                             \
> > > > +       unsigned long mask, idx, tmp, sz = (size), __start = (start);           \
> > > > +                                                                               \
> > > > +       if (unlikely(__start >= sz))                                            \
> > > > +               goto out;                                                       \
> > > > +                                                                               \
> > > > +       mask = word_op(BITMAP_FIRST_WORD_MASK(__start));                        \
> > > > +       idx = __start / BITS_PER_LONG;                                          \
> > > > +                                                                               \
> > > > +       for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) {             \
> > >
> > > for (unsigned long tmp ...;
> > > But hey, why not loop over idx (which probably should be named as
> > > offset)
> >
> > Offset in structure, index in array, isn't?
> >
> > > as I proposed in the first patch? You will drop a lot of
> > > divisions / multiplications, no?
> >
> > Those divisions and multiplications are optimized away, and
> > what you suggested blows up the EXPRESSION.
> >
> > I tried like this:
> >    mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
> >    idx = __start / BITS_PER_LONG;
> >    tmp = (EXPRESSION);
> >
> >    while (1) {
> >         if (tmp) {
> >                sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
> >                break;
> >         }
> >
> >         if (++idx > sz)
> >                 break;
> >
> >         tmp = (EXPRESSION);
> >    }
> >
> > And it generated the same code, but looks less expressive to me.
> > If you have some elegant approach in mind - can you please share
> > it, and how the generated code looks?
>
> for (unsigned long idx = 0; idx < sz; idx++) {

Of source 0 should be changed to whatever start you have there.

>   unsigned long tmp;
>
>   tmp = (EXPRESSION);
>   if (tmp) {
>     ...
>   }
> }
>
> No?

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24 13:37     ` Yury Norov
  2022-08-24 17:50       ` Andy Shevchenko
@ 2022-08-24 17:58       ` Russell King (Oracle)
  2022-08-24 20:03         ` Yury Norov
  1 sibling, 1 reply; 23+ messages in thread
From: Russell King (Oracle) @ 2022-08-24 17:58 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andy Shevchenko, Linus Torvalds, Linux Kernel Mailing List,
	Guenter Roeck, Dennis Zhou, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 06:37:45AM -0700, Yury Norov wrote:
> Because there is a codebase that relies on existing types, mostly in
> filesystem code. And those fs fixes would require 5 or 6 patches.

Does that mean that are there filesystems that are passing pointers to
the find_bit functions which are _not_ aligned to an "unsigned long" ?

If there are, we should _not_ convert 32-bit ARM to use word loads or
use the generic code; unaligned loads are expensive on older ARM CPUs,
at least not the code for older ARM CPUs.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro
  2022-08-24  1:26 ` [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
  2022-08-24  9:10   ` Andy Shevchenko
@ 2022-08-24 17:59   ` Linus Torvalds
  1 sibling, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2022-08-24 17:59 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, Guenter Roeck, Dennis Zhou, Russell King,
	Catalin Marinas, Andy Shevchenko, Rasmus Villemoes,
	Alexey Klimov, Kees Cook, Andy Whitcroft

On Tue, Aug 23, 2022 at 6:26 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> +
> +#ifndef word_op
> +#define word_op
> +#endif
> +
> +#define FIND_FIRST_BIT(EXPRESSION, size)                                       \

Please just make 'word_op' be an macro argument, the same way 'EXPRESSION' is.

That way the LE/BE cases can be handled without any odd tricks.

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24  1:26 ` [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le() Yury Norov
  2022-08-24  9:22   ` Andy Shevchenko
@ 2022-08-24 18:11   ` Linus Torvalds
  2022-08-24 22:09     ` Yury Norov
  1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2022-08-24 18:11 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, Guenter Roeck, Dennis Zhou, Russell King,
	Catalin Marinas, Andy Shevchenko, Rasmus Villemoes,
	Alexey Klimov, Kees Cook, Andy Whitcroft

Ugh, and here is the "odd trick" immediately afterwards:

On Tue, Aug 23, 2022 at 6:26 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> +++ b/lib/find_bit_be.c
> @@ -0,0 +1,23 @@
> +
> +#define word_op swab
> +#include "find_bit.h"
> +
> +#ifndef find_first_zero_bit_le
> +/*
> + * Find the first cleared bit in an LE memory region.
> + */
> +unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
> +{
> +       return FIND_FIRST_BIT(~addr[idx], size);
> +}
> +EXPORT_SYMBOL(_find_first_zero_bit_le);
> +#endif

I looked at that "_find_first_zero_bit_le()" code a long time and says
"that can't be right".

Until I noticed the

> +#define word_op swab

that was several lines above it, and not even close to the use of it.

So no.

Just make the macro take that "last word op" as another argument, and
then you don't need these tricks, and you can do the _le() version in
lib/find.c file *exactly* like the regular version, using just

  #ifdef __BIG_ENDIAN
  unsigned long find_first_zero_bit_le(const unsigned long *addr,
unsigned long size)
  { return FIND_FIRST_BIT(~addr[idx], swab(val), size); }
  #else
  unsigned long find_first_zero_bit_le(const unsigned long *addr,
unsigned long size) __alias(find_first_zero_bit);
  #endif

or something like that.

And yes, it means that the "regular" versions need to pass in just
"val" as a last-word expression, but who cares? It's still cleaner,
and easier to explain: "The first expression finds the word that
matches our requirement, the second expression munges the word we
found into the bit order we use".

Or something.

         Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24 17:58       ` Russell King (Oracle)
@ 2022-08-24 20:03         ` Yury Norov
  0 siblings, 0 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24 20:03 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andy Shevchenko, Linus Torvalds, Linux Kernel Mailing List,
	Guenter Roeck, Dennis Zhou, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 06:58:10PM +0100, Russell King (Oracle) wrote:
> On Wed, Aug 24, 2022 at 06:37:45AM -0700, Yury Norov wrote:
> > Because there is a codebase that relies on existing types, mostly in
> > filesystem code. And those fs fixes would require 5 or 6 patches.
> 
> Does that mean that are there filesystems that are passing pointers to
> the find_bit functions which are _not_ aligned to an "unsigned long" ?

Not necessarily. For example, I looked at ext2/4 code, and it looks like
they need void *bitmap, because they pass opaque pointer ->b_data from
struct buffer_head:

  struct buffer_head {
        ...
        char *b_data;                   /* pointer to data within the page */
        ...
  }

So, quite probably, the pointer is aligned when it points to a bitmap.
But I have nothing to prove it, except standard anthropic principle
"otherwise people would complain".

In LE case, the find_bit_le() functions are aliased to find_bit(),
which is aligned, and somehow it works.

> If there are, we should _not_ convert 32-bit ARM to use word loads or
> use the generic code; unaligned loads are expensive on older ARM CPUs,
> at least not the code for older ARM CPUs.

I wonder, if there's an arch that throws an exception in case of unaligned
access. IIRC, ARM can do that, but this option is disabled in kernel.
Right?

I have a series that adds a runtime parameters checker for bitmaps.
I'll add a test for alignment and see how it works (not very soon).

Thanks,
Yury

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions
  2022-08-24 17:56         ` Andy Shevchenko
@ 2022-08-24 21:27           ` Yury Norov
  0 siblings, 0 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24 21:27 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linus Torvalds, Linux Kernel Mailing List, Guenter Roeck,
	Dennis Zhou, Russell King, Catalin Marinas, Andy Shevchenko,
	Rasmus Villemoes, Alexey Klimov, Kees Cook, Andy Whitcroft

On Wed, Aug 24, 2022 at 08:56:02PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 8:54 PM Andy Shevchenko
> <andy.shevchenko@gmail.com> wrote:
> > On Wed, Aug 24, 2022 at 4:53 PM Yury Norov <yury.norov@gmail.com> wrote:
> > > On Wed, Aug 24, 2022 at 12:19:05PM +0300, Andy Shevchenko wrote:
> > > > On Wed, Aug 24, 2022 at 4:56 AM Yury Norov <yury.norov@gmail.com> wrote:
> 
> ...
> 
> > > > > +#define FIND_NEXT_BIT(EXPRESSION, size, start)                                 \
> > > > > +({                                                                             \
> > > > > +       unsigned long mask, idx, tmp, sz = (size), __start = (start);           \
> > > > > +                                                                               \
> > > > > +       if (unlikely(__start >= sz))                                            \
> > > > > +               goto out;                                                       \
> > > > > +                                                                               \
> > > > > +       mask = word_op(BITMAP_FIRST_WORD_MASK(__start));                        \
> > > > > +       idx = __start / BITS_PER_LONG;                                          \
> > > > > +                                                                               \
> > > > > +       for (tmp = (EXPRESSION) & mask; !tmp; tmp = (EXPRESSION)) {             \
> > > >
> > > > for (unsigned long tmp ...;
> > > > But hey, why not loop over idx (which probably should be named as
> > > > offset)
> > >
> > > Offset in structure, index in array, isn't?
> > >
> > > > as I proposed in the first patch? You will drop a lot of
> > > > divisions / multiplications, no?
> > >
> > > Those divisions and multiplications are optimized away, and
> > > what you suggested blows up the EXPRESSION.
> > >
> > > I tried like this:
> > >    mask = word_op(BITMAP_FIRST_WORD_MASK(__start));
> > >    idx = __start / BITS_PER_LONG;
> > >    tmp = (EXPRESSION);
> > >
> > >    while (1) {
> > >         if (tmp) {
> > >                sz = min(idx * BITS_PER_LONG + __ffs(word_op(tmp)), sz);
> > >                break;
> > >         }
> > >
> > >         if (++idx > sz)
> > >                 break;
> > >
> > >         tmp = (EXPRESSION);
> > >    }
> > >
> > > And it generated the same code, but looks less expressive to me.
> > > If you have some elegant approach in mind - can you please share
> > > it, and how the generated code looks?
> >
> > for (unsigned long idx = 0; idx < sz; idx++) {
> 
> Of source 0 should be changed to whatever start you have there.
> 
> >   unsigned long tmp;
> >
> >   tmp = (EXPRESSION);
> >   if (tmp) {
> >     ...
> >   }
> > }
> >
> > No?

No. For the first iteration, the tmp can't be calculated inside the loop
(my example above is wrong) because we need to clear first bits:

   mask = BITMAP_FIRST_WORD_MASK(__start);
   idx = __start / BITS_PER_LONG;
   tmp = (EXPRESSION) & mask;   // First fetch is here

   while (1) {
        if (tmp) {              // Evaluate here
               sz = min(idx * BITS_PER_LONG + __ffs(tmp), sz);
               break;
        }

        if (++idx > sz)         // Increment here
                break;

        tmp = (EXPRESSION);     // Other fetches here
   }

Trying to move iterator increment inside the for-loop, like you suggested
would break the sequence - common-case word fetch will happen before the
idx++.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le()
  2022-08-24 18:11   ` Linus Torvalds
@ 2022-08-24 22:09     ` Yury Norov
  0 siblings, 0 replies; 23+ messages in thread
From: Yury Norov @ 2022-08-24 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Guenter Roeck, Dennis Zhou, Russell King,
	Catalin Marinas, Andy Shevchenko, Rasmus Villemoes,
	Alexey Klimov, Kees Cook, Andy Whitcroft

> Just make the macro take that "last word op" as another argument, and
> then you don't need these tricks, and you can do the _le() version in
> lib/find.c file *exactly* like the regular version, using just
> 
>   #ifdef __BIG_ENDIAN
>   unsigned long find_first_zero_bit_le(const unsigned long *addr,
> unsigned long size)
>   { return FIND_FIRST_BIT(~addr[idx], swab(val), size); }
>   #else
>   unsigned long find_first_zero_bit_le(const unsigned long *addr,
> unsigned long size) __alias(find_first_zero_bit);
>   #endif
> 
> or something like that.
> 
> And yes, it means that the "regular" versions need to pass in just
> "val" as a last-word expression, but who cares? It's still cleaner,
> and easier to explain: "The first expression finds the word that
> matches our requirement, the second expression munges the word we
> found into the bit order we use".

OK. I'll do like this:

   #define FIND_FIRST_BIT(FETCH, MUNG, size) ...

   unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
   {
         return FIND_FIRST_BIT(~addr[idx], /* NOP */, size);
   }

   unsigned long find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
   {
         return FIND_FIRST_BIT(~addr[idx], swab, size);
   }

It doesn't actually look that worse.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-08-24 22:11 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-24  1:26 [PATCH v2 0/4] lib: optimize find_bit() functions Yury Norov
2022-08-24  1:26 ` [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
2022-08-24  9:10   ` Andy Shevchenko
2022-08-24 13:19     ` Yury Norov
2022-08-24 14:18       ` David Laight
2022-08-24 17:45       ` Andy Shevchenko
2022-08-24 17:59   ` Linus Torvalds
2022-08-24  1:26 ` [PATCH v2 2/3] lib/find_bit: create find_first_zero_bit_le() Yury Norov
2022-08-24  9:22   ` Andy Shevchenko
2022-08-24  9:24     ` Andy Shevchenko
2022-08-24 13:37     ` Yury Norov
2022-08-24 17:50       ` Andy Shevchenko
2022-08-24 17:58       ` Russell King (Oracle)
2022-08-24 20:03         ` Yury Norov
2022-08-24 18:11   ` Linus Torvalds
2022-08-24 22:09     ` Yury Norov
2022-08-24  1:26 ` [PATCH v2 3/3] lib/find_bit: optimize find_next_bit() functions Yury Norov
2022-08-24  9:19   ` Andy Shevchenko
2022-08-24 13:53     ` Yury Norov
2022-08-24 17:54       ` Andy Shevchenko
2022-08-24 17:56         ` Andy Shevchenko
2022-08-24 21:27           ` Yury Norov
2022-08-24  9:00 ` [PATCH v2 0/4] lib: optimize find_bit() functions Andy Shevchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).