linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] lib: optimize find_bit() functions
@ 2022-09-15  2:07 Yury Norov
  2022-09-15  2:07 ` [PATCH v4 1/4] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Yury Norov @ 2022-09-15  2:07 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Alexey Klimov, Andy Shevchenko, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

In the recent discussion, it was noticed that find_next_bit() functions may
be improved by adding wrappers around common __find_next_bit() in .c file.

As suggested by Linus, I tried the meta-programming trick with the
EXPRESSION macro, which is passed from wrapper into find_bit()
helpers:

  #define BIT_FIND_BODY(addr, size, start, EXPRESSION)          \
        BIT_FIND_SETUP(addr, size, start)                       \
        BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION)      \
        BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION)         \
        return size;                                            \
  found:        BIT_WORD_SWAB(val);                             \
        return min((idx)*BITS_PER_LONG + __ffs(val), size)

  unsigned long _find_next_and_bit(const unsigned long *addr1,
                                 const unsigned long *addr2,
                                 unsigned long size,
                                 unsigned long start)
  { BIT_FIND_BODY(addr, size, start, addr1[idx] & addr2[idx]); }

I appreciated the potential of how the EXPRESSION works, but I don't like
that the resulting macro is constructed from pieces because it makes it
harder to understand what happens behind the ifdefery. Checkpatch isn't
happy as well because the final macro contains 'return' statement; and I
would agree that it's better to avoid it.

I spun the idea one more time, trying to make FIND helper a more or
less standard looking macro.

This new approach saves 10-11K of Image size, and is 15% faster in the
performance benchmark. See the 3rd patch for some statistics.

v1: https://lore.kernel.org/all/20220728161208.865420-2-yury.norov@gmail.com/T/
v2: https://lore.kernel.org/lkml/YwaXvphVpy5A7fSs@yury-laptop/t/
v3: https://lore.kernel.org/all/xhsmhedwnb15r.mognet@vschneid.remote.csb/T/
v4:
 - fix for-loop break condition in FIND_NEXT_BIT;
 - add review tags from Valentin Schneider.

Yury Norov (4):
  lib/find_bit: introduce FIND_FIRST_BIT() macro
  lib/find_bit: create find_first_zero_bit_le()
  lib/find_bit: optimize find_next_bit() functions
  tools: sync find_bit() implementation

 include/linux/find.h       |  46 +++++++---
 lib/find_bit.c             | 178 ++++++++++++++++++++++---------------
 tools/include/linux/find.h |  61 +++----------
 tools/lib/find_bit.c       | 149 ++++++++++++++-----------------
 4 files changed, 220 insertions(+), 214 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/4] lib/find_bit: introduce FIND_FIRST_BIT() macro
  2022-09-15  2:07 [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
@ 2022-09-15  2:07 ` Yury Norov
  2022-09-15  2:07 ` [PATCH v4 2/4] lib/find_bit: create find_first_zero_bit_le() Yury Norov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Yury Norov @ 2022-09-15  2:07 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Alexey Klimov, Andy Shevchenko, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

Now that we have many flavors of find_first_bit(), and expect even more,
it's better to have one macro that generates optimal code for all and makes
maintaining of slightly different functions simpler.

The logic common to all versions is moved to the new macro, and all the
flavors are generated by providing an FETCH macro-parameter, like
in this example:

  #define FIND_FIRST_BIT(FETCH, MUNGE, size) ...

  find_first_ornot_and_bit(addr1, addr2, addr3, size)
  {
        return FIND_FIRST_BIT(addr1[idx] | ~addr2[idx] & addr3[idx], /* nop */, size);
  }

The FETCH may be of any complexity, as soon as it only refers
the bitmap(s) and an iterator idx.

MUNGE is here to support _le code generation for BE builds. May be
empty.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/find_bit.c | 49 ++++++++++++++++++++++++-------------------------
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..894b656f6836 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -19,6 +19,27 @@
 #include <linux/minmax.h>
 #include <linux/swab.h>
 
+/*
+ * Common helper for find_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
+ */
+#define FIND_FIRST_BIT(FETCH, MUNGE, size)					\
+({										\
+	unsigned long idx, val, sz = (size);					\
+										\
+	for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {			\
+		val = (FETCH);							\
+		if (val) {							\
+			sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz);	\
+			break;							\
+		}								\
+	}									\
+										\
+	sz;									\
+})
+
 #if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
 	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
 	!defined(find_next_and_bit)
@@ -77,14 +98,7 @@ EXPORT_SYMBOL(_find_next_bit);
  */
 unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
 {
-	unsigned long idx;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		if (addr[idx])
-			return min(idx * BITS_PER_LONG + __ffs(addr[idx]), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(addr[idx], /* nop */, size);
 }
 EXPORT_SYMBOL(_find_first_bit);
 #endif
@@ -97,15 +111,7 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
 				  const unsigned long *addr2,
 				  unsigned long size)
 {
-	unsigned long idx, val;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		val = addr1[idx] & addr2[idx];
-		if (val)
-			return min(idx * BITS_PER_LONG + __ffs(val), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size);
 }
 EXPORT_SYMBOL(_find_first_and_bit);
 #endif
@@ -116,14 +122,7 @@ EXPORT_SYMBOL(_find_first_and_bit);
  */
 unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size)
 {
-	unsigned long idx;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		if (addr[idx] != ~0UL)
-			return min(idx * BITS_PER_LONG + ffz(addr[idx]), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(~addr[idx], /* nop */, size);
 }
 EXPORT_SYMBOL(_find_first_zero_bit);
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/4] lib/find_bit: create find_first_zero_bit_le()
  2022-09-15  2:07 [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
  2022-09-15  2:07 ` [PATCH v4 1/4] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
@ 2022-09-15  2:07 ` Yury Norov
  2022-09-19 13:45   ` Andy Shevchenko
  2022-09-15  2:07 ` [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions Yury Norov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Yury Norov @ 2022-09-15  2:07 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Alexey Klimov, Andy Shevchenko, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
despite that 'next' is known to be slower than 'first' version.

Now that we have common FIND_FIRST_BIT() macro helper, it's trivial
to implement find_first_zero_bit_le() as a real function.

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/find.h | 23 ++++++++++++++++++-----
 lib/find_bit.c       | 14 ++++++++++++++
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 424ef67d4a42..2464bff5de04 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -17,6 +17,10 @@ extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);
 
+#ifdef __BIG_ENDIAN
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
+#endif
+
 #ifndef find_next_bit
 /**
  * find_next_bit - find the next set bit in a memory region
@@ -251,6 +255,20 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
 }
 #endif
 
+#ifndef find_first_zero_bit_le
+static inline
+unsigned long find_first_zero_bit_le(const void *addr, unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0);
+
+		return val == ~0UL ? size : ffz(val);
+	}
+
+	return _find_first_zero_bit_le(addr, size);
+}
+#endif
+
 #ifndef find_next_bit_le
 static inline
 unsigned long find_next_bit_le(const void *addr, unsigned
@@ -270,11 +288,6 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 }
 #endif
 
-#ifndef find_first_zero_bit_le
-#define find_first_zero_bit_le(addr, size) \
-	find_next_zero_bit_le((addr), (size), 0)
-#endif
-
 #else
 #error "Please fix <asm/byteorder.h>"
 #endif
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 894b656f6836..f4d9b9684811 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -160,3 +160,17 @@ unsigned long find_next_clump8(unsigned long *clump, const unsigned long *addr,
 	return offset;
 }
 EXPORT_SYMBOL(find_next_clump8);
+
+#ifdef __BIG_ENDIAN
+#ifndef find_first_zero_bit_le
+/*
+ * Find the first cleared bit in an LE memory region.
+ */
+unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
+{
+	return FIND_FIRST_BIT(~addr[idx], swab, size);
+}
+EXPORT_SYMBOL(_find_first_zero_bit_le);
+#endif
+
+#endif /* __BIG_ENDIAN */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions
  2022-09-15  2:07 [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
  2022-09-15  2:07 ` [PATCH v4 1/4] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
  2022-09-15  2:07 ` [PATCH v4 2/4] lib/find_bit: create find_first_zero_bit_le() Yury Norov
@ 2022-09-15  2:07 ` Yury Norov
  2022-09-15 16:25   ` Valentin Schneider
  2022-09-19 13:45   ` Andy Shevchenko
  2022-09-15  2:07 ` [PATCH v4 4/4] tools: sync find_bit() implementation Yury Norov
  2022-09-21 15:40 ` [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
  4 siblings, 2 replies; 12+ messages in thread
From: Yury Norov @ 2022-09-15  2:07 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Alexey Klimov, Andy Shevchenko, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

Over the past couple years, the function _find_next_bit() was extended
with parameters that modify its behavior to implement and- zero- and le-
flavors. The parameters are passed at compile time, but current design
prevents a compiler from optimizing out the conditionals.

As find_next_bit() API grows, I expect that more parameters will be added.
Current design would require more conditional code in _find_next_bit(),
which would bloat the helper even more and make it barely readable.

This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
a set of wrappers, so that the compile-time optimizations become possible.

The common logic is moved to the new macro, and all flavors may be
generated by providing a FETCH macro parameter, like in this example:

  #define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...

  find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
  {
	return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
				/* nop */, size, start);
  }

The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
and an iterator idx.

MUNGE is here to support _le code generation for BE builds. May be
empty.

I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
of 6.0-rc2 + this series. The results for kvm/x86_64 are:

                      v6.0-rc2  Optimized       Difference  Z-score
Random dense bitmap         ns         ns        ns      %
find_next_bit:          787735     670546    117189   14.9     3.97
find_next_zero_bit:     777492     664208    113284   14.6    10.51
find_last_bit:          830925     687573    143352   17.3     2.35
find_first_bit:        3874366    3306635    567731   14.7     1.84
find_first_and_bit:   40677125   37739887   2937238    7.2     1.36
find_next_and_bit:      347865     304456     43409   12.5     1.35

Random sparse bitmap
find_next_bit:           19816      14021      5795   29.2     6.10
find_next_zero_bit:    1318901    1223794     95107    7.2     1.41
find_last_bit:           14573      13514      1059    7.3     6.92
find_first_bit:        1313321    1249024     64297    4.9     1.53
find_first_and_bit:       8921       8098       823    9.2     4.56
find_next_and_bit:        9796       7176      2620   26.7     5.39

Where the statistics is significant (z-score > 3), the improvement
is ~15%.

According to the bloat-o-meter, the Image size is 10-11K less:

x86_64/defconfig:
add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)

arm64/defconfig:
add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/find.h |  23 ++++++---
 lib/find_bit.c       | 119 +++++++++++++++++++++++++------------------
 2 files changed, 85 insertions(+), 57 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 2464bff5de04..dead6f53a97b 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -8,9 +8,12 @@
 
 #include <linux/bitops.h>
 
-extern unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
+unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
+				unsigned long start);
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start);
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+					 unsigned long start);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
@@ -19,6 +22,10 @@ extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long siz
 
 #ifdef __BIG_ENDIAN
 unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
+unsigned long _find_next_zero_bit_le(const  unsigned long *addr, unsigned
+					long size, unsigned long offset);
+unsigned long _find_next_bit_le(const unsigned long *addr, unsigned
+				long size, unsigned long offset);
 #endif
 
 #ifndef find_next_bit
@@ -45,7 +52,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+	return _find_next_bit(addr, size, offset);
 }
 #endif
 
@@ -75,7 +82,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+	return _find_next_and_bit(addr1, addr2, size, offset);
 }
 #endif
 
@@ -103,7 +110,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+	return _find_next_zero_bit(addr, size, offset);
 }
 #endif
 
@@ -251,7 +258,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+	return _find_next_zero_bit_le(addr, size, offset);
 }
 #endif
 
@@ -284,7 +291,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+	return _find_next_bit_le(addr, size, offset);
 }
 #endif
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index f4d9b9684811..8707b4ef3e5e 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -40,57 +40,33 @@
 	sz;									\
 })
 
-#if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
-	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
-	!defined(find_next_and_bit)
 /*
- * This is a common helper function for find_next_bit, find_next_zero_bit, and
- * find_next_and_bit. The differences are:
- *  - The "invert" argument, which is XORed with each fetched word before
- *    searching it for one bits.
- *  - The optional "addr2", which is anded with "addr1" if present.
+ * Common helper for find_next_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
+ * @start: The bitnumber to start searching at
  */
-unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le)
-{
-	unsigned long tmp, mask;
-
-	if (unlikely(start >= nbits))
-		return nbits;
-
-	tmp = addr1[start / BITS_PER_LONG];
-	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
-	tmp ^= invert;
-
-	/* Handle 1st word. */
-	mask = BITMAP_FIRST_WORD_MASK(start);
-	if (le)
-		mask = swab(mask);
-
-	tmp &= mask;
-
-	start = round_down(start, BITS_PER_LONG);
-
-	while (!tmp) {
-		start += BITS_PER_LONG;
-		if (start >= nbits)
-			return nbits;
-
-		tmp = addr1[start / BITS_PER_LONG];
-		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
-		tmp ^= invert;
-	}
-
-	if (le)
-		tmp = swab(tmp);
-
-	return min(start + __ffs(tmp), nbits);
-}
-EXPORT_SYMBOL(_find_next_bit);
-#endif
+#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)				\
+({										\
+	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
+										\
+	if (unlikely(__start >= sz))						\
+		goto out;							\
+										\
+	mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));				\
+	idx = __start / BITS_PER_LONG;						\
+										\
+	for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {			\
+		if ((idx + 1) * BITS_PER_LONG >= sz)				\
+			goto out;						\
+		idx++;								\
+	}									\
+										\
+	sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);			\
+out:										\
+	sz;									\
+})
 
 #ifndef find_first_bit
 /*
@@ -127,6 +103,32 @@ unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size
 EXPORT_SYMBOL(_find_first_zero_bit);
 #endif
 
+#ifndef find_next_bit
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_bit);
+#endif
+
+#ifndef find_next_and_bit
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_and_bit);
+#endif
+
+#ifndef find_next_zero_bit
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+					 unsigned long start)
+{
+	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_zero_bit);
+#endif
+
 #ifndef find_last_bit
 unsigned long _find_last_bit(const unsigned long *addr, unsigned long size)
 {
@@ -173,4 +175,23 @@ unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long s
 EXPORT_SYMBOL(_find_first_zero_bit_le);
 #endif
 
+#ifndef find_next_zero_bit_le
+unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
+		long size, unsigned long offset)
+{
+	return FIND_NEXT_BIT(~addr[idx], swab, size, offset);
+}
+EXPORT_SYMBOL(_find_next_zero_bit_le);
+#endif
+
+#ifndef find_next_bit_le
+unsigned long _find_next_bit_le(const unsigned long  *addr, unsigned
+		long size, unsigned long offset)
+{
+	return FIND_NEXT_BIT(addr[idx], swab, size, offset);
+}
+EXPORT_SYMBOL(_find_next_bit_le);
+
+#endif
+
 #endif /* __BIG_ENDIAN */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 4/4] tools: sync find_bit() implementation
  2022-09-15  2:07 [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
                   ` (2 preceding siblings ...)
  2022-09-15  2:07 ` [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions Yury Norov
@ 2022-09-15  2:07 ` Yury Norov
  2022-09-21 15:40 ` [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
  4 siblings, 0 replies; 12+ messages in thread
From: Yury Norov @ 2022-09-15  2:07 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Yury Norov, Alexey Klimov, Andy Shevchenko, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

Sync find_first_bit() and find_next_bit() implementation with the
mother kernel.

Also, drop unused find_last_bit() and find_next_clump8().

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 tools/include/linux/find.h |  61 +++------------
 tools/lib/find_bit.c       | 149 +++++++++++++++++--------------------
 2 files changed, 81 insertions(+), 129 deletions(-)

diff --git a/tools/include/linux/find.h b/tools/include/linux/find.h
index 47e2bd6c5174..38c0a542b0e2 100644
--- a/tools/include/linux/find.h
+++ b/tools/include/linux/find.h
@@ -8,21 +8,23 @@
 
 #include <linux/bitops.h>
 
-extern unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
+unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
+				unsigned long start);
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start);
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+					 unsigned long start);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
-extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);
 
 #ifndef find_next_bit
 /**
  * find_next_bit - find the next set bit in a memory region
  * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
  * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
  *
  * Returns the bit number for the next set bit
  * If no bits are set, returns @size.
@@ -41,7 +43,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+	return _find_next_bit(addr, size, offset);
 }
 #endif
 
@@ -50,8 +52,8 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
  * find_next_and_bit - find the next set bit in both memory regions
  * @addr1: The first address to base the search on
  * @addr2: The second address to base the search on
- * @offset: The bitnumber to start searching at
  * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
  *
  * Returns the bit number for the next set bit
  * If no bits are set, returns @size.
@@ -71,7 +73,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+	return _find_next_and_bit(addr1, addr2, size, offset);
 }
 #endif
 
@@ -79,8 +81,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
  * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
  * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
  *
  * Returns the bit number of the next zero bit
  * If no bits are zero, returns @size.
@@ -99,7 +101,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+	return _find_next_zero_bit(addr, size, offset);
 }
 #endif
 
@@ -172,43 +174,4 @@ unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
 }
 #endif
 
-#ifndef find_last_bit
-/**
- * find_last_bit - find the last set bit in a memory region
- * @addr: The address to start the search at
- * @size: The number of bits to search
- *
- * Returns the bit number of the last set bit, or size.
- */
-static inline
-unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val = *addr & GENMASK(size - 1, 0);
-
-		return val ? __fls(val) : size;
-	}
-
-	return _find_last_bit(addr, size);
-}
-#endif
-
-/**
- * find_next_clump8 - find next 8-bit clump with set bits in a memory region
- * @clump: location to store copy of found clump
- * @addr: address to base the search on
- * @size: bitmap size in number of bits
- * @offset: bit offset at which to start searching
- *
- * Returns the bit offset for the next set clump; the found clump value is
- * copied to the location pointed by @clump. If no bits are set, returns @size.
- */
-extern unsigned long find_next_clump8(unsigned long *clump,
-				      const unsigned long *addr,
-				      unsigned long size, unsigned long offset);
-
-#define find_first_clump8(clump, bits, size) \
-	find_next_clump8((clump), (bits), (size), 0)
-
-
 #endif /*__LINUX_FIND_H_ */
diff --git a/tools/lib/find_bit.c b/tools/lib/find_bit.c
index ba4b8d94e004..6a3dc167d30e 100644
--- a/tools/lib/find_bit.c
+++ b/tools/lib/find_bit.c
@@ -18,66 +18,54 @@
 #include <linux/bitmap.h>
 #include <linux/kernel.h>
 
-#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
-		!defined(find_next_and_bit)
-
 /*
- * This is a common helper function for find_next_bit, find_next_zero_bit, and
- * find_next_and_bit. The differences are:
- *  - The "invert" argument, which is XORed with each fetched word before
- *    searching it for one bits.
- *  - The optional "addr2", which is anded with "addr1" if present.
+ * Common helper for find_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
  */
-unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le)
-{
-	unsigned long tmp, mask;
-	(void) le;
-
-	if (unlikely(start >= nbits))
-		return nbits;
-
-	tmp = addr1[start / BITS_PER_LONG];
-	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
-	tmp ^= invert;
-
-	/* Handle 1st word. */
-	mask = BITMAP_FIRST_WORD_MASK(start);
-
-	/*
-	 * Due to the lack of swab() in tools, and the fact that it doesn't
-	 * need little-endian support, just comment it out
-	 */
-#if (0)
-	if (le)
-		mask = swab(mask);
-#endif
-
-	tmp &= mask;
+#define FIND_FIRST_BIT(FETCH, MUNGE, size)					\
+({										\
+	unsigned long idx, val, sz = (size);					\
+										\
+	for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {			\
+		val = (FETCH);							\
+		if (val) {							\
+			sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz);	\
+			break;							\
+		}								\
+	}									\
+										\
+	sz;									\
+})
 
-	start = round_down(start, BITS_PER_LONG);
-
-	while (!tmp) {
-		start += BITS_PER_LONG;
-		if (start >= nbits)
-			return nbits;
-
-		tmp = addr1[start / BITS_PER_LONG];
-		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
-		tmp ^= invert;
-	}
-
-#if (0)
-	if (le)
-		tmp = swab(tmp);
-#endif
-
-	return min(start + __ffs(tmp), nbits);
-}
-#endif
+/*
+ * Common helper for find_next_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
+ * @start: The bitnumber to start searching at
+ */
+#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)				\
+({										\
+	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
+										\
+	if (unlikely(__start >= sz))						\
+		goto out;							\
+										\
+	mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));				\
+	idx = __start / BITS_PER_LONG;						\
+										\
+	for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {			\
+		if ((idx + 1) * BITS_PER_LONG >= sz)				\
+			goto out;						\
+		idx++;								\
+	}									\
+										\
+	sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);			\
+out:										\
+	sz;									\
+})
 
 #ifndef find_first_bit
 /*
@@ -85,14 +73,7 @@ unsigned long _find_next_bit(const unsigned long *addr1,
  */
 unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
 {
-	unsigned long idx;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		if (addr[idx])
-			return min(idx * BITS_PER_LONG + __ffs(addr[idx]), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(addr[idx], /* nop */, size);
 }
 #endif
 
@@ -104,15 +85,7 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
 				  const unsigned long *addr2,
 				  unsigned long size)
 {
-	unsigned long idx, val;
-
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		val = addr1[idx] & addr2[idx];
-		if (val)
-			return min(idx * BITS_PER_LONG + __ffs(val), size);
-	}
-
-	return size;
+	return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size);
 }
 #endif
 
@@ -122,13 +95,29 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
  */
 unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size)
 {
-	unsigned long idx;
+	return FIND_FIRST_BIT(~addr[idx], /* nop */, size);
+}
+#endif
 
-	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
-		if (addr[idx] != ~0UL)
-			return min(idx * BITS_PER_LONG + ffz(addr[idx]), size);
-	}
+#ifndef find_next_bit
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
+}
+#endif
 
-	return size;
+#ifndef find_next_and_bit
+unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start);
+}
+#endif
+
+#ifndef find_next_zero_bit
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+					 unsigned long start)
+{
+	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
 }
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions
  2022-09-15  2:07 ` [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions Yury Norov
@ 2022-09-15 16:25   ` Valentin Schneider
  2022-09-19 13:45   ` Andy Shevchenko
  1 sibling, 0 replies; 12+ messages in thread
From: Valentin Schneider @ 2022-09-15 16:25 UTC (permalink / raw)
  To: Yury Norov, Linus Torvalds, linux-kernel
  Cc: Yury Norov, Alexey Klimov, Andy Shevchenko, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Sven Schnelle, Russell King

On 14/09/22 19:07, Yury Norov wrote:
> Over the past couple years, the function _find_next_bit() was extended
> with parameters that modify its behavior to implement and- zero- and le-
> flavors. The parameters are passed at compile time, but current design
> prevents a compiler from optimizing out the conditionals.
>
> As find_next_bit() API grows, I expect that more parameters will be added.
> Current design would require more conditional code in _find_next_bit(),
> which would bloat the helper even more and make it barely readable.
>
> This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
> a set of wrappers, so that the compile-time optimizations become possible.
>
> The common logic is moved to the new macro, and all flavors may be
> generated by providing a FETCH macro parameter, like in this example:
>
>   #define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...
>
>   find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
>   {
>       return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
>                               /* nop */, size, start);
>   }
>
> The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
> and an iterator idx.
>
> MUNGE is here to support _le code generation for BE builds. May be
> empty.
>
> I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
> of 6.0-rc2 + this series. The results for kvm/x86_64 are:
>
>                       v6.0-rc2  Optimized       Difference  Z-score
> Random dense bitmap         ns         ns        ns      %
> find_next_bit:          787735     670546    117189   14.9     3.97
> find_next_zero_bit:     777492     664208    113284   14.6    10.51
> find_last_bit:          830925     687573    143352   17.3     2.35
> find_first_bit:        3874366    3306635    567731   14.7     1.84
> find_first_and_bit:   40677125   37739887   2937238    7.2     1.36
> find_next_and_bit:      347865     304456     43409   12.5     1.35
>
> Random sparse bitmap
> find_next_bit:           19816      14021      5795   29.2     6.10
> find_next_zero_bit:    1318901    1223794     95107    7.2     1.41
> find_last_bit:           14573      13514      1059    7.3     6.92
> find_first_bit:        1313321    1249024     64297    4.9     1.53
> find_first_and_bit:       8921       8098       823    9.2     4.56
> find_next_and_bit:        9796       7176      2620   26.7     5.39
>
> Where the statistics is significant (z-score > 3), the improvement
> is ~15%.
>
> According to the bloat-o-meter, the Image size is 10-11K less:
>
> x86_64/defconfig:
> add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)
>
> arm64/defconfig:
> add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Yury Norov <yury.norov@gmail.com>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/4] lib/find_bit: create find_first_zero_bit_le()
  2022-09-15  2:07 ` [PATCH v4 2/4] lib/find_bit: create find_first_zero_bit_le() Yury Norov
@ 2022-09-19 13:45   ` Andy Shevchenko
  0 siblings, 0 replies; 12+ messages in thread
From: Andy Shevchenko @ 2022-09-19 13:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, linux-kernel, Alexey Klimov, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

On Wed, Sep 14, 2022 at 07:07:28PM -0700, Yury Norov wrote:
> find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
> despite that 'next' is known to be slower than 'first' version.
> 
> Now that we have common FIND_FIRST_BIT() macro helper, it's trivial
> to implement find_first_zero_bit_le() as a real function.

...

> +#ifdef __BIG_ENDIAN

Probably you want to add a blank line here.

> +#ifndef find_first_zero_bit_le
> +/*
> + * Find the first cleared bit in an LE memory region.
> + */
> +unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size)
> +{
> +	return FIND_FIRST_BIT(~addr[idx], swab, size);
> +}
> +EXPORT_SYMBOL(_find_first_zero_bit_le);
> +#endif
> +
> +#endif /* __BIG_ENDIAN */

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions
  2022-09-15  2:07 ` [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions Yury Norov
  2022-09-15 16:25   ` Valentin Schneider
@ 2022-09-19 13:45   ` Andy Shevchenko
  2022-09-19 15:23     ` Linus Torvalds
  2022-09-20  1:41     ` Yury Norov
  1 sibling, 2 replies; 12+ messages in thread
From: Andy Shevchenko @ 2022-09-19 13:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, linux-kernel, Alexey Klimov, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

On Wed, Sep 14, 2022 at 07:07:29PM -0700, Yury Norov wrote:
> Over the past couple years, the function _find_next_bit() was extended
> with parameters that modify its behavior to implement and- zero- and le-
> flavors. The parameters are passed at compile time, but current design
> prevents a compiler from optimizing out the conditionals.
> 
> As find_next_bit() API grows, I expect that more parameters will be added.
> Current design would require more conditional code in _find_next_bit(),
> which would bloat the helper even more and make it barely readable.
> 
> This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
> a set of wrappers, so that the compile-time optimizations become possible.
> 
> The common logic is moved to the new macro, and all flavors may be
> generated by providing a FETCH macro parameter, like in this example:
> 
>   #define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...
> 
>   find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
>   {
> 	return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
> 				/* nop */, size, start);
>   }
> 
> The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
> and an iterator idx.
> 
> MUNGE is here to support _le code generation for BE builds. May be
> empty.
> 
> I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
> of 6.0-rc2 + this series. The results for kvm/x86_64 are:
> 
>                       v6.0-rc2  Optimized       Difference  Z-score
> Random dense bitmap         ns         ns        ns      %
> find_next_bit:          787735     670546    117189   14.9     3.97
> find_next_zero_bit:     777492     664208    113284   14.6    10.51
> find_last_bit:          830925     687573    143352   17.3     2.35
> find_first_bit:        3874366    3306635    567731   14.7     1.84
> find_first_and_bit:   40677125   37739887   2937238    7.2     1.36
> find_next_and_bit:      347865     304456     43409   12.5     1.35
> 
> Random sparse bitmap
> find_next_bit:           19816      14021      5795   29.2     6.10
> find_next_zero_bit:    1318901    1223794     95107    7.2     1.41
> find_last_bit:           14573      13514      1059    7.3     6.92
> find_first_bit:        1313321    1249024     64297    4.9     1.53
> find_first_and_bit:       8921       8098       823    9.2     4.56
> find_next_and_bit:        9796       7176      2620   26.7     5.39
> 
> Where the statistics is significant (z-score > 3), the improvement
> is ~15%.
> 
> According to the bloat-o-meter, the Image size is 10-11K less:
> 
> x86_64/defconfig:
> add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)
> 
> arm64/defconfig:
> add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)

...

>  /*

Seems like you wanted this to be a kernel doc, but it isn't right now.

> - * This is a common helper function for find_next_bit, find_next_zero_bit, and
> - * find_next_and_bit. The differences are:
> - *  - The "invert" argument, which is XORed with each fetched word before
> - *    searching it for one bits.
> - *  - The optional "addr2", which is anded with "addr1" if present.
> + * Common helper for find_next_bit() function family

In such case this should start with a name of the macro

 * FIND_NEXT_BIT - ...

> + * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
> + * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
> + * @size: The bitmap size in bits
> + * @start: The bitnumber to start searching at
>   */

...

> +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)				\
> +({										\
> +	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
> +										\
> +	if (unlikely(__start >= sz))						\
> +		goto out;							\
> +										\
> +	mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));				\
> +	idx = __start / BITS_PER_LONG;						\
> +										\
> +	for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {			\
> +		if ((idx + 1) * BITS_PER_LONG >= sz)				\
> +			goto out;						\
> +		idx++;								\
> +	}									\
> +										\
> +	sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);			\
> +out:										\

I dunno if GCC expression limits the scope of goto labels, but on the safe side
you can add a prefix to it, so it becomes:

FIND_NEXT_BIT_out:

(or alike).

> +	sz;									\
> +})

...

> +unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
> +		long size, unsigned long offset)

Usually we don't split parameters between lines.

...

> +unsigned long _find_next_bit_le(const unsigned long  *addr, unsigned
> +		long size, unsigned long offset)

Ditto.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions
  2022-09-19 13:45   ` Andy Shevchenko
@ 2022-09-19 15:23     ` Linus Torvalds
  2022-09-20 11:59       ` Andy Shevchenko
  2022-09-20  1:41     ` Yury Norov
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2022-09-19 15:23 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Yury Norov, linux-kernel, Alexey Klimov, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

On Mon, Sep 19, 2022 at 6:46 AM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> > +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)                             \
> > +({                                                                           \
[..]
> > +out:                                                                         \
>
> I dunno if GCC expression limits the scope of goto labels

No. Labels are function-global by default. If you need block-scope for
them, you need to declare them explicitly in tha block before use with
"__label__".

> but on the safe side you can add a prefix to it, so it becomes:
>
> FIND_NEXT_BIT_out:

That doesn't really help, since if you were to use the macro twice,
you'd still get a name clash.

That said, I'm not convinced any of this matters, since these macros
aren't supposed to be used anywhere else, and in fact, they aren't
even in any header file that would allow anybody else to use them.

So I think all the normal "make macros safe" rules are simply
irrelevant for these cases - despite the readable name, these macros
are local special cases for code generation and avoiding duplication,
not generic "use this macro to find a bit".

So it's one thing if a macro is in a header file to be used by random
code. It's a different thing entirely if it's a specialized local
macro for a local issue, that nobody else is ever going to even see.

So I don't think it would be wrong to use __label__ to block-scope it,
or to use a longer name, but I also don't think it's really required.

It's not exactly super-common, but we have various cases of macros
that generate full (or partial) function bodies in various places,
where the macro does various things that should *never* be done in a
"regular" macro that gets used by normal code.

You can see one class of this with something like

   git grep '^static.*##.*(.*\\$' -- '*.c'

but to *really* go blind, see the "SYSCALL_DEFINE*()" macros in
<linux/syscalls.h>.

Those will mess with your mind, and go "whoever wrote those macros
needs to be institutionalized". They do some impressive things,
including creating local functions _and_ starting a new function
definition where the actual code then isn't part of the macro, but
actually just continues where the macro was used.

Which is all very natural and actually looks quite nice and readable
in the places that use it, with users looking like

  SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
  {
        int fd;
        struct pid *p;
   ...

which is all pretty legible. But there's no question that that macro
violates every single "normal" macro rule.

The FIND_NEXT_BIT() macro in comparison is pretty tame.

                  Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions
  2022-09-19 13:45   ` Andy Shevchenko
  2022-09-19 15:23     ` Linus Torvalds
@ 2022-09-20  1:41     ` Yury Norov
  1 sibling, 0 replies; 12+ messages in thread
From: Yury Norov @ 2022-09-20  1:41 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linus Torvalds, linux-kernel, Alexey Klimov, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

On Mon, Sep 19, 2022 at 04:45:54PM +0300, Andy Shevchenko wrote:
> On Wed, Sep 14, 2022 at 07:07:29PM -0700, Yury Norov wrote:
> > Over the past couple years, the function _find_next_bit() was extended
> > with parameters that modify its behavior to implement and- zero- and le-
> > flavors. The parameters are passed at compile time, but current design
> > prevents a compiler from optimizing out the conditionals.
> > 
> > As find_next_bit() API grows, I expect that more parameters will be added.
> > Current design would require more conditional code in _find_next_bit(),
> > which would bloat the helper even more and make it barely readable.
> > 
> > This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
> > a set of wrappers, so that the compile-time optimizations become possible.
> > 
> > The common logic is moved to the new macro, and all flavors may be
> > generated by providing a FETCH macro parameter, like in this example:
> > 
> >   #define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...
> > 
> >   find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
> >   {
> > 	return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
> > 				/* nop */, size, start);
> >   }
> > 
> > The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
> > and an iterator idx.
> > 
> > MUNGE is here to support _le code generation for BE builds. May be
> > empty.
> > 
> > I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
> > of 6.0-rc2 + this series. The results for kvm/x86_64 are:
> > 
> >                       v6.0-rc2  Optimized       Difference  Z-score
> > Random dense bitmap         ns         ns        ns      %
> > find_next_bit:          787735     670546    117189   14.9     3.97
> > find_next_zero_bit:     777492     664208    113284   14.6    10.51
> > find_last_bit:          830925     687573    143352   17.3     2.35
> > find_first_bit:        3874366    3306635    567731   14.7     1.84
> > find_first_and_bit:   40677125   37739887   2937238    7.2     1.36
> > find_next_and_bit:      347865     304456     43409   12.5     1.35
> > 
> > Random sparse bitmap
> > find_next_bit:           19816      14021      5795   29.2     6.10
> > find_next_zero_bit:    1318901    1223794     95107    7.2     1.41
> > find_last_bit:           14573      13514      1059    7.3     6.92
> > find_first_bit:        1313321    1249024     64297    4.9     1.53
> > find_first_and_bit:       8921       8098       823    9.2     4.56
> > find_next_and_bit:        9796       7176      2620   26.7     5.39
> > 
> > Where the statistics is significant (z-score > 3), the improvement
> > is ~15%.
> > 
> > According to the bloat-o-meter, the Image size is 10-11K less:
> > 
> > x86_64/defconfig:
> > add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)
> > 
> > arm64/defconfig:
> > add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)
> 
> ...
> 
> >  /*
> 
> Seems like you wanted this to be a kernel doc, but it isn't right now.

No, I didn't. I can remove '@' below, if that concerns you.
 
> > - * This is a common helper function for find_next_bit, find_next_zero_bit, and
> > - * find_next_and_bit. The differences are:
> > - *  - The "invert" argument, which is XORed with each fetched word before
> > - *    searching it for one bits.
> > - *  - The optional "addr2", which is anded with "addr1" if present.
> > + * Common helper for find_next_bit() function family
> 
> In such case this should start with a name of the macro
> 
>  * FIND_NEXT_BIT - ...
> 
> > + * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
> > + * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
> > + * @size: The bitmap size in bits
> > + * @start: The bitnumber to start searching at
> >   */
> 
> ...
> 
> > +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)				\
> > +({										\
> > +	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
> > +										\
> > +	if (unlikely(__start >= sz))						\
> > +		goto out;							\
> > +										\
> > +	mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));				\
> > +	idx = __start / BITS_PER_LONG;						\
> > +										\
> > +	for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {			\
> > +		if ((idx + 1) * BITS_PER_LONG >= sz)				\
> > +			goto out;						\
> > +		idx++;								\
> > +	}									\
> > +										\
> > +	sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);			\
> > +out:										\
> 
> I dunno if GCC expression limits the scope of goto labels, but on the safe side
> you can add a prefix to it, so it becomes:
> 
> FIND_NEXT_BIT_out:
> 
> (or alike).

As Linus already said, the 'out' is function-scope. We can make it a
block-scope with __label__, but this would make an impression that we
are OK with stacking many FIND macros in a single function.

I spend some time trying to figure out a legitimate usecase for it, but
nothing came in mind. There are many real cases when we need 2 or more
find functions at once but all that cases would work with regular
wrappers around FIND_BIT(). Check this, for example:

https://lore.kernel.org/lkml/20220919210559.1509179-6-yury.norov@gmail.com/

I don't know how FIND_BIT() machinery will evolve with time. For now
it's a clean and neat local helper with a very straightforward usage.
Lets keep it simple now? If someone will decide to call FIND_BIT()
twice and fail, it would be a good hint that he's doing something
wrong.

> > +	sz;									\
> > +})
> 
> ...
> 
> > +unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned
> > +		long size, unsigned long offset)
> 
> Usually we don't split parameters between lines.

Ok

Thanks,
Yury

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions
  2022-09-19 15:23     ` Linus Torvalds
@ 2022-09-20 11:59       ` Andy Shevchenko
  0 siblings, 0 replies; 12+ messages in thread
From: Andy Shevchenko @ 2022-09-20 11:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yury Norov, linux-kernel, Alexey Klimov, Andy Whitcroft,
	Catalin Marinas, David Laight, Dennis Zhou, Guenter Roeck,
	Kees Cook, Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

On Mon, Sep 19, 2022 at 08:23:00AM -0700, Linus Torvalds wrote:
> On Mon, Sep 19, 2022 at 6:46 AM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> >
> > > +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)                             \
> > > +({                                                                           \
> [..]
> > > +out:                                                                         \
> >
> > I dunno if GCC expression limits the scope of goto labels
> 
> No. Labels are function-global by default. If you need block-scope for
> them, you need to declare them explicitly in tha block before use with
> "__label__".
> 
> > but on the safe side you can add a prefix to it, so it becomes:
> >
> > FIND_NEXT_BIT_out:
> 
> That doesn't really help, since if you were to use the macro twice,
> you'd still get a name clash.
> 
> That said, I'm not convinced any of this matters, since these macros
> aren't supposed to be used anywhere else, and in fact, they aren't
> even in any header file that would allow anybody else to use them.
> 
> So I think all the normal "make macros safe" rules are simply
> irrelevant for these cases - despite the readable name, these macros
> are local special cases for code generation and avoiding duplication,
> not generic "use this macro to find a bit".
> 
> So it's one thing if a macro is in a header file to be used by random
> code. It's a different thing entirely if it's a specialized local
> macro for a local issue, that nobody else is ever going to even see.
> 
> So I don't think it would be wrong to use __label__ to block-scope it,
> or to use a longer name, but I also don't think it's really required.
> 
> It's not exactly super-common, but we have various cases of macros
> that generate full (or partial) function bodies in various places,
> where the macro does various things that should *never* be done in a
> "regular" macro that gets used by normal code.
> 
> You can see one class of this with something like
> 
>    git grep '^static.*##.*(.*\\$' -- '*.c'
> 
> but to *really* go blind, see the "SYSCALL_DEFINE*()" macros in
> <linux/syscalls.h>.
> 
> Those will mess with your mind, and go "whoever wrote those macros
> needs to be institutionalized". They do some impressive things,
> including creating local functions _and_ starting a new function
> definition where the actual code then isn't part of the macro, but
> actually just continues where the macro was used.
> 
> Which is all very natural and actually looks quite nice and readable
> in the places that use it, with users looking like
> 
>   SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
>   {
>         int fd;
>         struct pid *p;
>    ...
> 
> which is all pretty legible. But there's no question that that macro
> violates every single "normal" macro rule.
> 
> The FIND_NEXT_BIT() macro in comparison is pretty tame.

Thanks for elaboration. It makes a lot of sense and something TIL material
to me.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 0/4] lib: optimize find_bit() functions
  2022-09-15  2:07 [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
                   ` (3 preceding siblings ...)
  2022-09-15  2:07 ` [PATCH v4 4/4] tools: sync find_bit() implementation Yury Norov
@ 2022-09-21 15:40 ` Yury Norov
  4 siblings, 0 replies; 12+ messages in thread
From: Yury Norov @ 2022-09-21 15:40 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Alexey Klimov, Andy Shevchenko, Andy Whitcroft, Catalin Marinas,
	David Laight, Dennis Zhou, Guenter Roeck, Kees Cook,
	Rasmus Villemoes, Valentin Schneider, Sven Schnelle,
	Russell King

If no other comments, I'll address Andy's comments on formatting and
move it in bitmap-for-next.

Thanks,
Yury

On Wed, Sep 14, 2022 at 07:07:26PM -0700, Yury Norov wrote:
> In the recent discussion, it was noticed that find_next_bit() functions may
> be improved by adding wrappers around common __find_next_bit() in .c file.
> 
> As suggested by Linus, I tried the meta-programming trick with the
> EXPRESSION macro, which is passed from wrapper into find_bit()
> helpers:
> 
>   #define BIT_FIND_BODY(addr, size, start, EXPRESSION)          \
>         BIT_FIND_SETUP(addr, size, start)                       \
>         BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION)      \
>         BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION)         \
>         return size;                                            \
>   found:        BIT_WORD_SWAB(val);                             \
>         return min((idx)*BITS_PER_LONG + __ffs(val), size)
> 
>   unsigned long _find_next_and_bit(const unsigned long *addr1,
>                                  const unsigned long *addr2,
>                                  unsigned long size,
>                                  unsigned long start)
>   { BIT_FIND_BODY(addr, size, start, addr1[idx] & addr2[idx]); }
> 
> I appreciated the potential of how the EXPRESSION works, but I don't like
> that the resulting macro is constructed from pieces because it makes it
> harder to understand what happens behind the ifdefery. Checkpatch isn't
> happy as well because the final macro contains 'return' statement; and I
> would agree that it's better to avoid it.
> 
> I spun the idea one more time, trying to make FIND helper a more or
> less standard looking macro.
> 
> This new approach saves 10-11K of Image size, and is 15% faster in the
> performance benchmark. See the 3rd patch for some statistics.
> 
> v1: https://lore.kernel.org/all/20220728161208.865420-2-yury.norov@gmail.com/T/
> v2: https://lore.kernel.org/lkml/YwaXvphVpy5A7fSs@yury-laptop/t/
> v3: https://lore.kernel.org/all/xhsmhedwnb15r.mognet@vschneid.remote.csb/T/
> v4:
>  - fix for-loop break condition in FIND_NEXT_BIT;
>  - add review tags from Valentin Schneider.
> 
> Yury Norov (4):
>   lib/find_bit: introduce FIND_FIRST_BIT() macro
>   lib/find_bit: create find_first_zero_bit_le()
>   lib/find_bit: optimize find_next_bit() functions
>   tools: sync find_bit() implementation
> 
>  include/linux/find.h       |  46 +++++++---
>  lib/find_bit.c             | 178 ++++++++++++++++++++++---------------
>  tools/include/linux/find.h |  61 +++----------
>  tools/lib/find_bit.c       | 149 ++++++++++++++-----------------
>  4 files changed, 220 insertions(+), 214 deletions(-)
> 
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-09-21 15:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-15  2:07 [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov
2022-09-15  2:07 ` [PATCH v4 1/4] lib/find_bit: introduce FIND_FIRST_BIT() macro Yury Norov
2022-09-15  2:07 ` [PATCH v4 2/4] lib/find_bit: create find_first_zero_bit_le() Yury Norov
2022-09-19 13:45   ` Andy Shevchenko
2022-09-15  2:07 ` [PATCH v4 3/4] lib/find_bit: optimize find_next_bit() functions Yury Norov
2022-09-15 16:25   ` Valentin Schneider
2022-09-19 13:45   ` Andy Shevchenko
2022-09-19 15:23     ` Linus Torvalds
2022-09-20 11:59       ` Andy Shevchenko
2022-09-20  1:41     ` Yury Norov
2022-09-15  2:07 ` [PATCH v4 4/4] tools: sync find_bit() implementation Yury Norov
2022-09-21 15:40 ` [PATCH v4 0/4] lib: optimize find_bit() functions Yury Norov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).