All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5] lib: optimize cpumask_next_and()
@ 2017-11-28 13:13 Clement Courbet
  2017-11-29  7:16 ` Yury Norov
  0 siblings, 1 reply; 5+ messages in thread
From: Clement Courbet @ 2017-11-28 13:13 UTC (permalink / raw)
  To: Andrew Morton, Yury Norov; +Cc: Clement Courbet, linux-arch, linux-kernel

We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join).

For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory
usage.

Also added generic bitmap benchmarks in the new `test_find_bit` module
for the reference and new implementations (results: see
`find_next_and_bit` and `find_next_and_bit_ref` in [2] and [3] below).
Note that on Arm (), the new c implementation still outperforms the
old one that uses c+ the asm implementation of `find_next_bit` [3].

[1] Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
    for (int n = -1; n <= nr_cpu_ids; ++n) {
      asm volatile("" : "+rm"(src1p)); // prevent any optimization
      asm volatile("" : "+rm"(src2p));
      unsigned long result = cpumask_next_and(n, src1p, src2p);
      asm volatile("" : "+rm"(result));
    }
  }
```

Results:
pattern1    pattern2     time_before/time_after
0x0000ffff  0x0000ffff   1.65
0x0000ffff  0x00005555   2.24
0x0000ffff  0x00001111   2.94
0x0000ffff  0x00000000   14.0
0x00005555  0x0000ffff   1.67
0x00005555  0x00005555   1.71
0x00005555  0x00001111   1.90
0x00005555  0x00000000   6.58
0x00001111  0x0000ffff   1.46
0x00001111  0x00005555   1.49
0x00001111  0x00001111   1.45
0x00001111  0x00000000   3.10
0x00000000  0x0000ffff   1.18
0x00000000  0x00005555   1.18
0x00000000  0x00001111   1.17
0x00000000  0x00000000   1.25
-----------------------------
               geo.mean  2.06

[2] test_find_next_bit, X86 (skylake)

 [ 3913.477422] Start testing find_bit() with random-filled bitmap
 [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
 [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
 [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
 [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
 [ 3913.480216] Start testing find_next_and_bit() with random-filled
 bitmap
 [ 3913.481027] find_next_and_bit_ref: 319444 cycles, 8216 iterations
 [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
 [ 3913.481075] Start testing find_bit() with sparse bitmap
 [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
 [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
 [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
 [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
 [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
 [ 3913.481270] find_next_and_bit_ref: 2486 cycles, 1 iterations
 [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations

[3] test_find_next_bit, arm (v7 odroid XU3).

[  267.206928] Start testing find_bit() with random-filled bitmap
[  267.214752] find_next_bit: 4474 cycles, 16419 iterations
[  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
[  267.229294] find_last_bit: 4209 cycles, 16419 iterations
[  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
[  267.286265] Start testing find_next_and_bit() with random-filled
bitmap
[  267.294895] find_next_and_bit_ref: 7572 cycles, 8140 iterations
[  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
[  267.309422] Start testing find_bit() with sparse bitmap
[  267.316054] find_next_bit: 191 cycles, 66 iterations
[  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
[  267.329803] find_last_bit: 84 cycles, 66 iterations
[  267.336169] find_first_bit: 4118 cycles, 66 iterations
[  267.342627] Start testing find_next_and_bit() with sparse bitmap
[  267.349992] find_next_and_bit_ref: 193 cycles, 1 iterations
[  267.356919] find_next_and_bit: 91 cycles, 1 iterations

Signed-off-by: Clement Courbet <courbet@google.com>
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

Changes in v4:
 - Mark _find_next_bit() inline.

Changes in v5:
 - Added benchmarks to test_find_bit.cc
 - Fixed arm compilation: added missing header to arm bitops.h

 arch/arm/include/asm/bitops.h           |  1 +
 arch/unicore32/include/asm/bitops.h     |  2 ++
 include/asm-generic/bitops/find.h       | 20 +++++++++++
 include/linux/bitmap.h                  |  6 +++-
 lib/cpumask.c                           |  9 ++---
 lib/find_bit.c                          | 59 ++++++++++++++++++++++++---------
 lib/test_find_bit.c                     | 47 +++++++++++++++++++++++++-
 tools/include/asm-generic/bitops/find.h | 16 +++++++++
 tools/lib/find_bit.c                    | 40 ++++++++++++++++------
 9 files changed, 168 insertions(+), 32 deletions(-)

diff --git a/arch/arm/include/asm/bitops.h b/arch/arm/include/asm/bitops.h
index ce5ee762ed66..4cab9bb823fb 100644
--- a/arch/arm/include/asm/bitops.h
+++ b/arch/arm/include/asm/bitops.h
@@ -338,6 +338,7 @@ static inline int find_next_bit_le(const void *p, int size, int offset)
 
 #endif
 
+#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/le.h>
 
 /*
diff --git a/arch/unicore32/include/asm/bitops.h b/arch/unicore32/include/asm/bitops.h
index 401f597bc38c..c0cbdbe17168 100644
--- a/arch/unicore32/include/asm/bitops.h
+++ b/arch/unicore32/include/asm/bitops.h
@@ -44,4 +44,6 @@ static inline int fls(int x)
 #define find_first_bit		find_first_bit
 #define find_first_zero_bit	find_first_zero_bit
 
+#include <asm-generic/bitops/find.h>
+
 #endif /* __UNICORE_BITOPS_H__ */
diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h
index 1ba611e16fa0..8a1ee10014de 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -16,6 +16,22 @@ extern unsigned long find_next_bit(const unsigned long *addr, unsigned long
 		size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
@@ -55,8 +71,12 @@ extern unsigned long find_first_zero_bit(const unsigned long *addr,
 					 unsigned long size);
 #else /* CONFIG_GENERIC_FIND_FIRST_BIT */
 
+#ifndef find_first_bit
 #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
+#endif
+#ifndef find_first_zero_bit
 #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
+#endif
 
 #endif /* CONFIG_GENERIC_FIND_FIRST_BIT */
 
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 3489253e38fc..45a9e169d0fd 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -83,8 +83,12 @@
  *  test_and_change_bit(bit, addr)      Change bit and return old value
  *  find_first_zero_bit(addr, nbits)    Position first zero bit in *addr
  *  find_first_bit(addr, nbits)         Position first set bit in *addr
- *  find_next_zero_bit(addr, nbits, bit)  Position next zero bit in *addr >= bit
+ *  find_next_zero_bit(addr, nbits, bit)
+ *                                      Position next zero bit in *addr >= bit
  *  find_next_bit(addr, nbits, bit)     Position next set bit in *addr >= bit
+ *  find_next_and_bit(addr1, addr2, nbits, bit)
+ *                                      Same as find_first_bit, but in
+ *                                      (*addr1 & *addr2)
  *
  */
 
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 35fe142ebb5e..beca6244671a 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -33,10 +33,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 		     const struct cpumask *src2p)
 {
-	while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-		if (cpumask_test_cpu(n, src2p))
-			break;
-	return n;
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+		nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..ee3df93ba69a 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -21,22 +21,29 @@
 #include <linux/export.h>
 #include <linux/kernel.h>
 
-#if !defined(find_next_bit) || !defined(find_next_zero_bit)
+#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
+		!defined(find_next_and_bit)
 
 /*
- * This is a common helper function for find_next_bit and
- * find_next_zero_bit.  The difference is the "invert" argument, which
- * is XORed with each fetched word before searching it for one bits.
+ * This is a common helper function for find_next_bit, find_next_zero_bit, and
+ * find_next_and_bit. The differences are:
+ *  - The "invert" argument, which is XORed with each fetched word before
+ *    searching it for one bits.
+ *  - The optional "addr2", which is anded with "addr1" if present.
  */
-static unsigned long _find_next_bit(const unsigned long *addr,
-		unsigned long nbits, unsigned long start, unsigned long invert)
+static inline unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert)
 {
 	unsigned long tmp;
 
 	if (unlikely(start >= nbits))
 		return nbits;
 
-	tmp = addr[start / BITS_PER_LONG] ^ invert;
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
 
 	/* Handle 1st word. */
 	tmp &= BITMAP_FIRST_WORD_MASK(start);
@@ -47,7 +54,10 @@ static unsigned long _find_next_bit(const unsigned long *addr,
 		if (start >= nbits)
 			return nbits;
 
-		tmp = addr[start / BITS_PER_LONG] ^ invert;
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
 	}
 
 	return min(start + __ffs(tmp), nbits);
@@ -61,7 +71,7 @@ static unsigned long _find_next_bit(const unsigned long *addr,
 unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 			    unsigned long offset)
 {
-	return _find_next_bit(addr, size, offset, 0UL);
+	return _find_next_bit(addr, NULL, size, offset, 0UL);
 }
 EXPORT_SYMBOL(find_next_bit);
 #endif
@@ -70,11 +80,21 @@ EXPORT_SYMBOL(find_next_bit);
 unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 				 unsigned long offset)
 {
-	return _find_next_bit(addr, size, offset, ~0UL);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL);
 }
 EXPORT_SYMBOL(find_next_zero_bit);
 #endif
 
+#if !defined(find_next_and_bit)
+unsigned long find_next_and_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	return _find_next_bit(addr1, addr2, size, offset, 0UL);
+}
+EXPORT_SYMBOL(find_next_and_bit);
+#endif
+
 #ifndef find_first_bit
 /*
  * Find the first set bit in a memory region.
@@ -146,15 +166,19 @@ static inline unsigned long ext2_swab(const unsigned long y)
 }
 
 #if !defined(find_next_bit_le) || !defined(find_next_zero_bit_le)
-static unsigned long _find_next_bit_le(const unsigned long *addr,
-		unsigned long nbits, unsigned long start, unsigned long invert)
+static inline unsigned long _find_next_bit_le(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert)
 {
 	unsigned long tmp;
 
 	if (unlikely(start >= nbits))
 		return nbits;
 
-	tmp = addr[start / BITS_PER_LONG] ^ invert;
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
 
 	/* Handle 1st word. */
 	tmp &= ext2_swab(BITMAP_FIRST_WORD_MASK(start));
@@ -165,7 +189,10 @@ static unsigned long _find_next_bit_le(const unsigned long *addr,
 		if (start >= nbits)
 			return nbits;
 
-		tmp = addr[start / BITS_PER_LONG] ^ invert;
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
 	}
 
 	return min(start + __ffs(ext2_swab(tmp)), nbits);
@@ -176,7 +203,7 @@ static unsigned long _find_next_bit_le(const unsigned long *addr,
 unsigned long find_next_zero_bit_le(const void *addr, unsigned
 		long size, unsigned long offset)
 {
-	return _find_next_bit_le(addr, size, offset, ~0UL);
+	return _find_next_bit_le(addr, NULL, size, offset, ~0UL);
 }
 EXPORT_SYMBOL(find_next_zero_bit_le);
 #endif
@@ -185,7 +212,7 @@ EXPORT_SYMBOL(find_next_zero_bit_le);
 unsigned long find_next_bit_le(const void *addr, unsigned
 		long size, unsigned long offset)
 {
-	return _find_next_bit_le(addr, size, offset, 0UL);
+	return _find_next_bit_le(addr, NULL, size, offset, 0UL);
 }
 EXPORT_SYMBOL(find_next_bit_le);
 #endif
diff --git a/lib/test_find_bit.c b/lib/test_find_bit.c
index f4394a36f9aa..90773efa4694 100644
--- a/lib/test_find_bit.c
+++ b/lib/test_find_bit.c
@@ -35,6 +35,7 @@
 #define SPARSE		500
 
 static DECLARE_BITMAP(bitmap, BITMAP_LEN) __initdata;
+static DECLARE_BITMAP(bitmap2, BITMAP_LEN) __initdata;
 
 /*
  * This is Schlemiel the Painter's algorithm. It should be called after
@@ -107,6 +108,42 @@ static int __init test_find_last_bit(const void *bitmap, unsigned long len)
 	return 0;
 }
 
+static int __init test_find_next_and_bit(const void *bitmap,
+		const void *bitmap2, unsigned long len)
+{
+	unsigned long i, cnt;
+	cycles_t cycles;
+
+	cycles = get_cycles();
+	for (cnt = i = 0; i < BITMAP_LEN; cnt++)
+		i = find_next_and_bit(bitmap, bitmap2, BITMAP_LEN, i+1);
+	cycles = get_cycles() - cycles;
+	pr_err("find_next_and_bit: %ld cycles, %ld iterations\n", (long)cycles,
+		cnt);
+
+	return 0;
+}
+
+static int __init test_find_next_and_bit_ref(const void *bitmap,
+		const void *bitmap2, unsigned long len)
+{
+	unsigned long i, cnt;
+	cycles_t cycles;
+
+	cycles = get_cycles();
+	for (cnt = i = 0; i < BITMAP_LEN; cnt++)
+		while ((i = find_next_bit(bitmap, BITMAP_LEN, i + 1))
+			< BITMAP_LEN)
+			if (test_bit(i, bitmap2))
+				break;
+
+	cycles = get_cycles() - cycles;
+	pr_err("find_next_and_bit_ref: %ld cycles, %ld iterations\n",
+		(long)cycles, cnt);
+
+	return 0;
+}
+
 static int __init find_bit_test(void)
 {
 	unsigned long nbits = BITMAP_LEN / SPARSE;
@@ -114,23 +151,31 @@ static int __init find_bit_test(void)
 	pr_err("\nStart testing find_bit() with random-filled bitmap\n");
 
 	get_random_bytes(bitmap, sizeof(bitmap));
+	get_random_bytes(bitmap2, sizeof(bitmap2));
 
 	test_find_next_bit(bitmap, BITMAP_LEN);
 	test_find_next_zero_bit(bitmap, BITMAP_LEN);
 	test_find_last_bit(bitmap, BITMAP_LEN);
 	test_find_first_bit(bitmap, BITMAP_LEN);
+	test_find_next_and_bit_ref(bitmap, bitmap2, BITMAP_LEN);
+	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
 
 	pr_err("\nStart testing find_bit() with sparse bitmap\n");
 
 	bitmap_zero(bitmap, BITMAP_LEN);
+	bitmap_zero(bitmap2, BITMAP_LEN);
 
-	while (nbits--)
+	while (nbits--) {
 		__set_bit(prandom_u32() % BITMAP_LEN, bitmap);
+		__set_bit(prandom_u32() % BITMAP_LEN, bitmap2);
+	}
 
 	test_find_next_bit(bitmap, BITMAP_LEN);
 	test_find_next_zero_bit(bitmap, BITMAP_LEN);
 	test_find_last_bit(bitmap, BITMAP_LEN);
 	test_find_first_bit(bitmap, BITMAP_LEN);
+	test_find_next_and_bit_ref(bitmap, bitmap2, BITMAP_LEN);
+	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
 
 	return 0;
 }
diff --git a/tools/include/asm-generic/bitops/find.h b/tools/include/asm-generic/bitops/find.h
index 9311fadaaab2..16ed1982cb34 100644
--- a/tools/include/asm-generic/bitops/find.h
+++ b/tools/include/asm-generic/bitops/find.h
@@ -16,6 +16,22 @@ extern unsigned long find_next_bit(const unsigned long *addr, unsigned long
 		size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 
 /**
diff --git a/tools/lib/find_bit.c b/tools/lib/find_bit.c
index 42c15f906aac..9474f5856d4e 100644
--- a/tools/lib/find_bit.c
+++ b/tools/lib/find_bit.c
@@ -22,22 +22,29 @@
 #include <linux/bitmap.h>
 #include <linux/kernel.h>
 
-#if !defined(find_next_bit)
+#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
+		!defined(find_next_and_bit)
 
 /*
- * This is a common helper function for find_next_bit and
- * find_next_zero_bit.  The difference is the "invert" argument, which
- * is XORed with each fetched word before searching it for one bits.
+ * This is a common helper function for find_next_bit, find_next_zero_bit, and
+ * find_next_and_bit. The differences are:
+ *  - The "invert" argument, which is XORed with each fetched word before
+ *    searching it for one bits.
+ *  - The optional "addr2", which is anded with "addr1" if present.
  */
-static unsigned long _find_next_bit(const unsigned long *addr,
-		unsigned long nbits, unsigned long start, unsigned long invert)
+static inline unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert)
 {
 	unsigned long tmp;
 
 	if (unlikely(start >= nbits))
 		return nbits;
 
-	tmp = addr[start / BITS_PER_LONG] ^ invert;
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
 
 	/* Handle 1st word. */
 	tmp &= BITMAP_FIRST_WORD_MASK(start);
@@ -48,7 +55,10 @@ static unsigned long _find_next_bit(const unsigned long *addr,
 		if (start >= nbits)
 			return nbits;
 
-		tmp = addr[start / BITS_PER_LONG] ^ invert;
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
 	}
 
 	return min(start + __ffs(tmp), nbits);
@@ -62,7 +72,7 @@ static unsigned long _find_next_bit(const unsigned long *addr,
 unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 			    unsigned long offset)
 {
-	return _find_next_bit(addr, size, offset, 0UL);
+	return _find_next_bit(addr, NULL, size, offset, 0UL);
 }
 #endif
 
@@ -104,6 +114,16 @@ unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
 unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 				 unsigned long offset)
 {
-	return _find_next_bit(addr, size, offset, ~0UL);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL);
 }
 #endif
+
+#ifndef find_next_and_bit
+unsigned long find_next_and_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	return _find_next_bit(addr1, addr2, size, offset, 0UL);
+}
+#endif
+
-- 
2.15.0.417.g466bffb3ac-goog

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v5] lib: optimize cpumask_next_and()
  2017-11-28 13:13 [PATCH v5] lib: optimize cpumask_next_and() Clement Courbet
@ 2017-11-29  7:16 ` Yury Norov
  2017-11-29  9:35   ` Clement Courbet
  0 siblings, 1 reply; 5+ messages in thread
From: Yury Norov @ 2017-11-29  7:16 UTC (permalink / raw)
  To: Clement Courbet; +Cc: Andrew Morton, linux-arch, linux-kernel

NACK. 

I'm sorry, but it seems you have to send v6. See comments inline.

On Tue, Nov 28, 2017 at 02:13:34PM +0100, Clement Courbet wrote:
> We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
> It's essentially a joined iteration in search for a non-zero bit, which
> is currently implemented as a lookup join (find a nonzero bit on the
> lhs, lookup the rhs to see if it's set there).
> 
> Implement a direct join (find a nonzero bit on the incrementally built
> join).
> 
> For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
> faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory
> usage.
> 
> Also added generic bitmap benchmarks in the new `test_find_bit` module
> for the reference and new implementations (results: see
> `find_next_and_bit` and `find_next_and_bit_ref` in [2] and [3] below).
> Note that on Arm (), the new c implementation still outperforms the
> old one that uses c+ the asm implementation of `find_next_bit` [3].

What is 'c+'? Is it typo?

If you find generic find_bit() on arm faster that asm one, we'd definitely
drop that piece of asm. I have this check it in my long list.

> [1] Approximate benchmark code:
> 
> ```
>   unsigned long src1p[nr_cpumask_longs] = {pattern1};
>   unsigned long src2p[nr_cpumask_longs] = {pattern2};
>   for (/*a bunch of repetitions*/) {
>     for (int n = -1; n <= nr_cpu_ids; ++n) {
>       asm volatile("" : "+rm"(src1p)); // prevent any optimization
>       asm volatile("" : "+rm"(src2p));
>       unsigned long result = cpumask_next_and(n, src1p, src2p);
>       asm volatile("" : "+rm"(result));
>     }
>   }
> ```
> 
> Results:
> pattern1    pattern2     time_before/time_after
> 0x0000ffff  0x0000ffff   1.65
> 0x0000ffff  0x00005555   2.24
> 0x0000ffff  0x00001111   2.94
> 0x0000ffff  0x00000000   14.0
> 0x00005555  0x0000ffff   1.67
> 0x00005555  0x00005555   1.71
> 0x00005555  0x00001111   1.90
> 0x00005555  0x00000000   6.58
> 0x00001111  0x0000ffff   1.46
> 0x00001111  0x00005555   1.49
> 0x00001111  0x00001111   1.45
> 0x00001111  0x00000000   3.10
> 0x00000000  0x0000ffff   1.18
> 0x00000000  0x00005555   1.18
> 0x00000000  0x00001111   1.17
> 0x00000000  0x00000000   1.25
> -----------------------------
>                geo.mean  2.06
> 
> [2] test_find_next_bit, X86 (skylake)
> 
>  [ 3913.477422] Start testing find_bit() with random-filled bitmap
>  [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
>  [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
>  [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
>  [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
>  [ 3913.480216] Start testing find_next_and_bit() with random-filled
>  bitmap
>  [ 3913.481027] find_next_and_bit_ref: 319444 cycles, 8216 iterations
>  [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
>  [ 3913.481075] Start testing find_bit() with sparse bitmap
>  [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
>  [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
>  [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
>  [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
>  [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
>  [ 3913.481270] find_next_and_bit_ref: 2486 cycles, 1 iterations
>  [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations
> 
> [3] test_find_next_bit, arm (v7 odroid XU3).
> 
> [  267.206928] Start testing find_bit() with random-filled bitmap
> [  267.214752] find_next_bit: 4474 cycles, 16419 iterations
> [  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
> [  267.229294] find_last_bit: 4209 cycles, 16419 iterations
> [  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
> [  267.286265] Start testing find_next_and_bit() with random-filled
> bitmap
> [  267.294895] find_next_and_bit_ref: 7572 cycles, 8140 iterations
> [  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
> [  267.309422] Start testing find_bit() with sparse bitmap
> [  267.316054] find_next_bit: 191 cycles, 66 iterations
> [  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
> [  267.329803] find_last_bit: 84 cycles, 66 iterations
> [  267.336169] find_first_bit: 4118 cycles, 66 iterations
> [  267.342627] Start testing find_next_and_bit() with sparse bitmap
> [  267.349992] find_next_and_bit_ref: 193 cycles, 1 iterations
> [  267.356919] find_next_and_bit: 91 cycles, 1 iterations

This is old version of test based on get_cycles. New one is based on
ktime_get and has other minor changes. I think you'd rerun tests to
not confuse readers. New version is already in linux-next.

> Signed-off-by: Clement Courbet <courbet@google.com>
> ---
> Changes in v2:
>  - Refactored _find_next_common_bit into _find_next_bit., as suggested
>    by Yury Norov. This has no adverse effects on the performance side,
>    as the compiler successfully inlines the code.
> 
> Changes in v3:
>  - Fixes find_next_and_bit() declaration.
>  - Synchronize _find_next_bit_le() with _find_next_bit()
>  - Synchronize the code in tools/lib/find_bit.c
>  - Add find_next_and_bit to guard code
>  - Fix invert value (bad sync with our internal tree on which I'm doing
>    the testing).
> 
> Changes in v4:
>  - Mark _find_next_bit() inline.
> 
> Changes in v5:
>  - Added benchmarks to test_find_bit.cc
>  - Fixed arm compilation: added missing header to arm bitops.h
> 
>  arch/arm/include/asm/bitops.h           |  1 +
>  arch/unicore32/include/asm/bitops.h     |  2 ++
>  include/asm-generic/bitops/find.h       | 20 +++++++++++
>  include/linux/bitmap.h                  |  6 +++-
>  lib/cpumask.c                           |  9 ++---
>  lib/find_bit.c                          | 59 ++++++++++++++++++++++++---------
>  lib/test_find_bit.c                     | 47 +++++++++++++++++++++++++-
>  tools/include/asm-generic/bitops/find.h | 16 +++++++++
>  tools/lib/find_bit.c                    | 40 ++++++++++++++++------
>  9 files changed, 168 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm/include/asm/bitops.h b/arch/arm/include/asm/bitops.h
> index ce5ee762ed66..4cab9bb823fb 100644
> --- a/arch/arm/include/asm/bitops.h
> +++ b/arch/arm/include/asm/bitops.h
> @@ -338,6 +338,7 @@ static inline int find_next_bit_le(const void *p, int size, int offset)
>  
>  #endif
>  
> +#include <asm-generic/bitops/find.h>
>  #include <asm-generic/bitops/le.h>
>  
>  /*
> diff --git a/arch/unicore32/include/asm/bitops.h b/arch/unicore32/include/asm/bitops.h
> index 401f597bc38c..c0cbdbe17168 100644
> --- a/arch/unicore32/include/asm/bitops.h
> +++ b/arch/unicore32/include/asm/bitops.h
> @@ -44,4 +44,6 @@ static inline int fls(int x)
>  #define find_first_bit		find_first_bit
>  #define find_first_zero_bit	find_first_zero_bit
>  
> +#include <asm-generic/bitops/find.h>
> +
>  #endif /* __UNICORE_BITOPS_H__ */
> diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h
> index 1ba611e16fa0..8a1ee10014de 100644
> --- a/include/asm-generic/bitops/find.h
> +++ b/include/asm-generic/bitops/find.h
> @@ -16,6 +16,22 @@ extern unsigned long find_next_bit(const unsigned long *addr, unsigned long
>  		size, unsigned long offset);
>  #endif
>  
> +#ifndef find_next_and_bit
> +/**
> + * find_next_and_bit - find the next set bit in both memory regions
> + * @addr1: The first address to base the search on
> + * @addr2: The second address to base the search on
> + * @offset: The bitnumber to start searching at
> + * @size: The bitmap size in bits
> + *
> + * Returns the bit number for the next set bit
> + * If no bits are set, returns @size.
> + */
> +extern unsigned long find_next_and_bit(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long size,
> +		unsigned long offset);
> +#endif
> +
>  #ifndef find_next_zero_bit
>  /**
>   * find_next_zero_bit - find the next cleared bit in a memory region
> @@ -55,8 +71,12 @@ extern unsigned long find_first_zero_bit(const unsigned long *addr,
>  					 unsigned long size);
>  #else /* CONFIG_GENERIC_FIND_FIRST_BIT */
>  
> +#ifndef find_first_bit
>  #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> +#endif
> +#ifndef find_first_zero_bit
>  #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
> +#endif

How this change related to the find_next_and_bit?

Does unguarded version breaks something? Please elaborate.

>  #endif /* CONFIG_GENERIC_FIND_FIRST_BIT */
>  
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 3489253e38fc..45a9e169d0fd 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -83,8 +83,12 @@
>   *  test_and_change_bit(bit, addr)      Change bit and return old value
>   *  find_first_zero_bit(addr, nbits)    Position first zero bit in *addr
>   *  find_first_bit(addr, nbits)         Position first set bit in *addr
> - *  find_next_zero_bit(addr, nbits, bit)  Position next zero bit in *addr >= bit
> + *  find_next_zero_bit(addr, nbits, bit)
> + *                                      Position next zero bit in *addr >= bit
>   *  find_next_bit(addr, nbits, bit)     Position next set bit in *addr >= bit
> + *  find_next_and_bit(addr1, addr2, nbits, bit)
> + *                                      Same as find_first_bit, but in
> + *                                      (*addr1 & *addr2)

Nit. Addr1, addr2 - doesn't sound self-explainative. What about
find_next_and_bit(addr1, and_map, nbits, bit), or something like this?

>   *
>   */
>  
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index 35fe142ebb5e..beca6244671a 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -33,10 +33,11 @@ EXPORT_SYMBOL(cpumask_next);
>  int cpumask_next_and(int n, const struct cpumask *src1p,
>  		     const struct cpumask *src2p)
>  {
> -	while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
> -		if (cpumask_test_cpu(n, src2p))
> -			break;
> -	return n;
> +	/* -1 is a legal arg here. */
> +	if (n != -1)
> +		cpumask_check(n);
> +	return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
> +		nr_cpumask_bits, n + 1);
>  }
>  EXPORT_SYMBOL(cpumask_next_and);
>  
> diff --git a/lib/find_bit.c b/lib/find_bit.c
> index 6ed74f78380c..ee3df93ba69a 100644
> --- a/lib/find_bit.c
> +++ b/lib/find_bit.c
> @@ -21,22 +21,29 @@
>  #include <linux/export.h>
>  #include <linux/kernel.h>
>  
> -#if !defined(find_next_bit) || !defined(find_next_zero_bit)
> +#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
> +		!defined(find_next_and_bit)
>  
>  /*
> - * This is a common helper function for find_next_bit and
> - * find_next_zero_bit.  The difference is the "invert" argument, which
> - * is XORed with each fetched word before searching it for one bits.
> + * This is a common helper function for find_next_bit, find_next_zero_bit, and
> + * find_next_and_bit. The differences are:
> + *  - The "invert" argument, which is XORed with each fetched word before
> + *    searching it for one bits.
> + *  - The optional "addr2", which is anded with "addr1" if present.

The optional "and", which is anded with "addr" if present. - sounds
better, isn't?

>   */
> -static unsigned long _find_next_bit(const unsigned long *addr,
> -		unsigned long nbits, unsigned long start, unsigned long invert)
> +static inline unsigned long _find_next_bit(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long nbits,
> +		unsigned long start, unsigned long invert)
>  {
>  	unsigned long tmp;
>  
>  	if (unlikely(start >= nbits))
>  		return nbits;
>  
> -	tmp = addr[start / BITS_PER_LONG] ^ invert;
> +	tmp = addr1[start / BITS_PER_LONG];
> +	if (addr2)
> +		tmp &= addr2[start / BITS_PER_LONG];
> +	tmp ^= invert;
>  
>  	/* Handle 1st word. */
>  	tmp &= BITMAP_FIRST_WORD_MASK(start);
> @@ -47,7 +54,10 @@ static unsigned long _find_next_bit(const unsigned long *addr,
>  		if (start >= nbits)
>  			return nbits;
>  
> -		tmp = addr[start / BITS_PER_LONG] ^ invert;
> +		tmp = addr1[start / BITS_PER_LONG];
> +		if (addr2)
> +			tmp &= addr2[start / BITS_PER_LONG];
> +		tmp ^= invert;
>  	}
>  
>  	return min(start + __ffs(tmp), nbits);
> @@ -61,7 +71,7 @@ static unsigned long _find_next_bit(const unsigned long *addr,
>  unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>  			    unsigned long offset)
>  {
> -	return _find_next_bit(addr, size, offset, 0UL);
> +	return _find_next_bit(addr, NULL, size, offset, 0UL);
>  }
>  EXPORT_SYMBOL(find_next_bit);
>  #endif
> @@ -70,11 +80,21 @@ EXPORT_SYMBOL(find_next_bit);
>  unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
>  				 unsigned long offset)
>  {
> -	return _find_next_bit(addr, size, offset, ~0UL);
> +	return _find_next_bit(addr, NULL, size, offset, ~0UL);
>  }
>  EXPORT_SYMBOL(find_next_zero_bit);
>  #endif
>  
> +#if !defined(find_next_and_bit)
> +unsigned long find_next_and_bit(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long size,
> +		unsigned long offset)
> +{
> +	return _find_next_bit(addr1, addr2, size, offset, 0UL);
> +}
> +EXPORT_SYMBOL(find_next_and_bit);
> +#endif
> +
>  #ifndef find_first_bit
>  /*
>   * Find the first set bit in a memory region.
> @@ -146,15 +166,19 @@ static inline unsigned long ext2_swab(const unsigned long y)
>  }
>  
>  #if !defined(find_next_bit_le) || !defined(find_next_zero_bit_le)
> -static unsigned long _find_next_bit_le(const unsigned long *addr,
> -		unsigned long nbits, unsigned long start, unsigned long invert)
> +static inline unsigned long _find_next_bit_le(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long nbits,
> +		unsigned long start, unsigned long invert)
>  {
>  	unsigned long tmp;
>  
>  	if (unlikely(start >= nbits))
>  		return nbits;
>  
> -	tmp = addr[start / BITS_PER_LONG] ^ invert;
> +	tmp = addr1[start / BITS_PER_LONG];
> +	if (addr2)
> +		tmp &= addr2[start / BITS_PER_LONG];
> +	tmp ^= invert;
>  
>  	/* Handle 1st word. */
>  	tmp &= ext2_swab(BITMAP_FIRST_WORD_MASK(start));
> @@ -165,7 +189,10 @@ static unsigned long _find_next_bit_le(const unsigned long *addr,
>  		if (start >= nbits)
>  			return nbits;
>  
> -		tmp = addr[start / BITS_PER_LONG] ^ invert;
> +		tmp = addr1[start / BITS_PER_LONG];
> +		if (addr2)
> +			tmp &= addr2[start / BITS_PER_LONG];
> +		tmp ^= invert;
>  	}
>  
>  	return min(start + __ffs(ext2_swab(tmp)), nbits);
> @@ -176,7 +203,7 @@ static unsigned long _find_next_bit_le(const unsigned long *addr,
>  unsigned long find_next_zero_bit_le(const void *addr, unsigned
>  		long size, unsigned long offset)
>  {
> -	return _find_next_bit_le(addr, size, offset, ~0UL);
> +	return _find_next_bit_le(addr, NULL, size, offset, ~0UL);
>  }
>  EXPORT_SYMBOL(find_next_zero_bit_le);
>  #endif
> @@ -185,7 +212,7 @@ EXPORT_SYMBOL(find_next_zero_bit_le);
>  unsigned long find_next_bit_le(const void *addr, unsigned
>  		long size, unsigned long offset)
>  {
> -	return _find_next_bit_le(addr, size, offset, 0UL);
> +	return _find_next_bit_le(addr, NULL, size, offset, 0UL);
>  }
>  EXPORT_SYMBOL(find_next_bit_le);
>  #endif
> diff --git a/lib/test_find_bit.c b/lib/test_find_bit.c
> index f4394a36f9aa..90773efa4694 100644
> --- a/lib/test_find_bit.c
> +++ b/lib/test_find_bit.c
> @@ -35,6 +35,7 @@
>  #define SPARSE		500
>  
>  static DECLARE_BITMAP(bitmap, BITMAP_LEN) __initdata;
> +static DECLARE_BITMAP(bitmap2, BITMAP_LEN) __initdata;
>  
>  /*
>   * This is Schlemiel the Painter's algorithm. It should be called after
> @@ -107,6 +108,42 @@ static int __init test_find_last_bit(const void *bitmap, unsigned long len)
>  	return 0;
>  }
>  
> +static int __init test_find_next_and_bit(const void *bitmap,
> +		const void *bitmap2, unsigned long len)
> +{
> +	unsigned long i, cnt;
> +	cycles_t cycles;
> +
> +	cycles = get_cycles();

Please switch to ktime_get, as other tests do.

> +	for (cnt = i = 0; i < BITMAP_LEN; cnt++)
> +		i = find_next_and_bit(bitmap, bitmap2, BITMAP_LEN, i+1);
> +	cycles = get_cycles() - cycles;
> +	pr_err("find_next_and_bit: %ld cycles, %ld iterations\n", (long)cycles,
> +		cnt);

Use pretty formatting here to align with others tests output.

> +
> +	return 0;
> +}
> +
> +static int __init test_find_next_and_bit_ref(const void *bitmap,
> +		const void *bitmap2, unsigned long len)

> +{
> +	unsigned long i, cnt;
> +	cycles_t cycles;
> +
> +	cycles = get_cycles();
> +	for (cnt = i = 0; i < BITMAP_LEN; cnt++)
> +		while ((i = find_next_bit(bitmap, BITMAP_LEN, i + 1))
> +			< BITMAP_LEN)

Nit:		while (i < BITMAP_LEN) {
                        i = find_next_bit(bitmap, BITMAP_LEN, i + 1);

> +			if (test_bit(i, bitmap2))
> +				break;

                }

> +
> +	cycles = get_cycles() - cycles;
> +	pr_err("find_next_and_bit_ref: %ld cycles, %ld iterations\n",
> +		(long)cycles, cnt);
> +
> +	return 0;
> +}

I don't understand the purpose of this. It's obviously clear that
test_find_next_and_bit cannot be slower than test_find_next_and_bit_ref

We don't have 'reference' implementations for other functions here.
If you still think that you need it, please introduce
find_next_and_bit_ref() to make the purpose of test clear. (I spent
more that a minute to understand what happens here, and my first
suggestion was wrong.)

>  static int __init find_bit_test(void)
>  {
>  	unsigned long nbits = BITMAP_LEN / SPARSE;
> @@ -114,23 +151,31 @@ static int __init find_bit_test(void)
>  	pr_err("\nStart testing find_bit() with random-filled bitmap\n");
>  
>  	get_random_bytes(bitmap, sizeof(bitmap));
> +	get_random_bytes(bitmap2, sizeof(bitmap2));
>  
>  	test_find_next_bit(bitmap, BITMAP_LEN);
>  	test_find_next_zero_bit(bitmap, BITMAP_LEN);
>  	test_find_last_bit(bitmap, BITMAP_LEN);
>  	test_find_first_bit(bitmap, BITMAP_LEN);
> +	test_find_next_and_bit_ref(bitmap, bitmap2, BITMAP_LEN);
> +	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
>  
>  	pr_err("\nStart testing find_bit() with sparse bitmap\n");
>  
>  	bitmap_zero(bitmap, BITMAP_LEN);
> +	bitmap_zero(bitmap2, BITMAP_LEN);
>  
> -	while (nbits--)
> +	while (nbits--) {
>  		__set_bit(prandom_u32() % BITMAP_LEN, bitmap);
> +		__set_bit(prandom_u32() % BITMAP_LEN, bitmap2);
> +	}
>  
>  	test_find_next_bit(bitmap, BITMAP_LEN);
>  	test_find_next_zero_bit(bitmap, BITMAP_LEN);
>  	test_find_last_bit(bitmap, BITMAP_LEN);
>  	test_find_first_bit(bitmap, BITMAP_LEN);
> +	test_find_next_and_bit_ref(bitmap, bitmap2, BITMAP_LEN);
> +	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);

For sparse bitmaps it will be like traversing zero-bitmaps. I doubt
this numbers will be representative. Do we need this test at all?

>  	return 0;
>  }

Yury

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v5] lib: optimize cpumask_next_and()
  2017-11-29  7:16 ` Yury Norov
@ 2017-11-29  9:35   ` Clement Courbet
  2017-11-30  6:21     ` Yury Norov
  0 siblings, 1 reply; 5+ messages in thread
From: Clement Courbet @ 2017-11-29  9:35 UTC (permalink / raw)
  To: Yury Norov; +Cc: linux-kernel, Andrew Morton

> > Note that on Arm (), the new c implementation still outperforms the
> > old one that uses c+ the asm implementation of `find_next_bit` [3].
> What is 'c+'? Is it typo?

I meant "a mix of C and asm" ~(C + asm). Rephrased.

> If you find generic find_bit() on arm faster that asm one, we'd
> definitely drop that piece of asm. I have this check it in my
> long list.

What's faster for sure is the mix (the improvement in this commit minus the
possible hit from not using the ASM implementation). I can't tell whether the
latter is negligible or not (I only have one ARM board to try it out), but
that's definitly something to try.

> This is old version of test based on get_cycles. New one is based on
> ktime_get and has other minor changes. I think you'd rerun tests to
> not confuse readers. New version is already in linux-next.

So I'm not sure whether I should be submitting this against 'linux' or
'linux-next' ? This patch is against 'linux', so I think it should
be consistent with the code around.

> > #ifndef find_first_bit
> > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > #endif
> > #ifndef find_first_zero_bit
> > #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
> > #endif
> How this change related to the find_next_and_bit?

The arm header defines these symbols. Now that we're including
the generic implementation in the arm headers, we need to guard this to
avoid the duplicate definition.

> > test_find_next_and_bit_ref
> I don't understand the purpose of this. It's obviously clear that
> test_find_next_and_bit cannot be slower than test_find_next_and_bit_ref

Fair enough :) That was to back my claim that this commit is worth it.
I've removed the "_ref" version.

> For sparse bitmaps it will be like traversing zero-bitmaps. I doubt
> this numbers will be representative. Do we need this test at all?

It's just two lines, and gives an interesting data point. Why not
keep it ?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v5] lib: optimize cpumask_next_and()
  2017-11-29  9:35   ` Clement Courbet
@ 2017-11-30  6:21     ` Yury Norov
  2017-11-30 14:00       ` Clement Courbet
  0 siblings, 1 reply; 5+ messages in thread
From: Yury Norov @ 2017-11-30  6:21 UTC (permalink / raw)
  To: Clement Courbet; +Cc: linux-kernel, Andrew Morton

On Wed, Nov 29, 2017 at 10:35:55AM +0100, Clement Courbet wrote:
> > > Note that on Arm (), the new c implementation still outperforms the
> > > old one that uses c+ the asm implementation of `find_next_bit` [3].
> > What is 'c+'? Is it typo?
> 
> I meant "a mix of C and asm" ~(C + asm). Rephrased.
> 
> > If you find generic find_bit() on arm faster that asm one, we'd
> > definitely drop that piece of asm. I have this check it in my
> > long list.
> 
> What's faster for sure is the mix (the improvement in this commit minus the
> possible hit from not using the ASM implementation). I can't tell whether the
> latter is negligible or not (I only have one ARM board to try it out), but
> that's definitly something to try.
> 
> > This is old version of test based on get_cycles. New one is based on
> > ktime_get and has other minor changes. I think you'd rerun tests to
> > not confuse readers. New version is already in linux-next.
> 
> So I'm not sure whether I should be submitting this against 'linux' or
> 'linux-next' ? This patch is against 'linux', so I think it should
> be consistent with the code around.

Linux-next is your choice. 
https://lwn.net/Articles/289013/


> > > #ifndef find_first_bit
> > > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > > #endif
> > > #ifndef find_first_zero_bit
> > > #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
> > > #endif
> > How this change related to the find_next_and_bit?
> 
> The arm header defines these symbols. Now that we're including
> the generic implementation in the arm headers, we need to guard this to
> avoid the duplicate definition.

So I think it really worth to be separated patch. Really, it's
completely nontrivial why adding new function in lib/find_bit.c
requires including asm-generic/bitops/find.h in arm and uncore32
asm/bitops.h headers (bug?). And why doing that makes you guard
find_first_bit and find_first_zero_bit (another bug?).

Headers are always important. Please elaborate it and also CC arm
and uncore32 communities, linux-arch and Arnd Bergman.

> > > test_find_next_and_bit_ref
> > I don't understand the purpose of this. It's obviously clear that
> > test_find_next_and_bit cannot be slower than test_find_next_and_bit_ref
> 
> Fair enough :) That was to back my claim that this commit is worth it.
> I've removed the "_ref" version.
> 
> > For sparse bitmaps it will be like traversing zero-bitmaps. I doubt
> > this numbers will be representative. Do we need this test at all?
> 
> It's just two lines, and gives an interesting data point. Why not
> keep it ?

---

Again. test_find_next_and_bit is trimmed, but it is still based on
get_cycles and uses tabs in printf(). Please fix it.

Yury

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v5] lib: optimize cpumask_next_and()
  2017-11-30  6:21     ` Yury Norov
@ 2017-11-30 14:00       ` Clement Courbet
  0 siblings, 0 replies; 5+ messages in thread
From: Clement Courbet @ 2017-11-30 14:00 UTC (permalink / raw)
  To: Yury Norov; +Cc: linux-kernel, Andrew Morton

> So I think it really worth to be separated patch. Really, it's
> completely nontrivial why adding new function in lib/find_bit.c
> requires including asm-generic/bitops/find.h in arm and uncore32
> asm/bitops.h headers (bug?). And why doing that makes you guard
> find_first_bit and find_first_zero_bit (another bug?).

OK, I'll send a separate patch for this.

> Linux-next is your choice.
>  [...]
> Again. test_find_next_and_bit is trimmed, but it is still based on
> get_cycles and uses tabs in printf(). Please fix it.

OK, I'll send a version of the patch rebased against linux-next.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-11-30 14:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-28 13:13 [PATCH v5] lib: optimize cpumask_next_and() Clement Courbet
2017-11-29  7:16 ` Yury Norov
2017-11-29  9:35   ` Clement Courbet
2017-11-30  6:21     ` Yury Norov
2017-11-30 14:00       ` Clement Courbet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.