* [PATCH] x86: Change x86 to use generic find_next_bit @ 2008-03-09 20:01 Alexander van Heukelum 2008-03-09 20:10 ` Ingo Molnar ` (4 more replies) 0 siblings, 5 replies; 19+ messages in thread From: Alexander van Heukelum @ 2008-03-09 20:01 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, LKML; +Cc: heukelum x86: Change x86 to use the generic find_next_bit implementation The versions with inline assembly are in fact slower on the machines I tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid crashing the benchmark. Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size 1...512, for each possible bitmap with one bit set, for each possible offset: find the position of the first bit starting at offset. If you follow ;). Times include setup of the bitmap and checking of the results. Athlon Xeon Opteron 32/64bit x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s If the bitmap size is not a multiple of BITS_PER_LONG, and no set (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a value outside of the range [0,size]. The generic version always returns exactly size. The generic version also uses unsigned long everywhere, while the x86 versions use a mishmash of int, unsigned (int), long and unsigned long. Using the generic version does give a slightly bigger kernel, though. defconfig: text data bss dec hex filename x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit) generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit) x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit) generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit) arch/x86/Kconfig | 3 ++ arch/x86/lib/Makefile | 2 +- arch/x86/lib/bitops_32.c | 70 ------------------------------------------- arch/x86/lib/bitops_64.c | 68 ----------------------------------------- include/asm-x86/bitops.h | 6 ++++ include/asm-x86/bitops_32.h | 16 ---------- include/asm-x86/bitops_64.h | 2 - lib/find_next_bit.c | 2 + 8 files changed, 12 insertions(+), 157 deletions(-) Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> --- Hi, I think it would be a good idea to use the generic versions of find_next_bit and find_next_zero_bit. The i386 versions have a bug, and the generic ones are faster (in my probably not-so-representative micro-benchmark, that is). Patch is against -x86#testing. It compiles. Greetings, Alexander diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ab85a04..29a5a38 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -83,6 +83,9 @@ config GENERIC_BUG def_bool y depends on BUG +config GENERIC_FIND_NEXT_BIT + def_bool y + config GENERIC_HWEIGHT def_bool y diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index 25df1c1..4360932 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -11,7 +11,7 @@ lib-y += memcpy_$(BITS).o ifeq ($(CONFIG_X86_32),y) lib-y += checksum_32.o lib-y += strstr_32.o - lib-y += bitops_32.o semaphore_32.o string_32.o + lib-y += semaphore_32.o string_32.o lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o else diff --git a/arch/x86/lib/bitops_32.c b/arch/x86/lib/bitops_32.c deleted file mode 100644 index b654404..0000000 --- a/arch/x86/lib/bitops_32.c +++ /dev/null @@ -1,70 +0,0 @@ -#include <linux/bitops.h> -#include <linux/module.h> - -/** - * find_next_bit - find the next set bit in a memory region - * @addr: The address to base the search on - * @offset: The bitnumber to start searching at - * @size: The maximum size to search - */ -int find_next_bit(const unsigned long *addr, int size, int offset) -{ - const unsigned long *p = addr + (offset >> 5); - int set = 0, bit = offset & 31, res; - - if (bit) { - /* - * Look for nonzero in the first 32 bits: - */ - __asm__("bsfl %1,%0\n\t" - "jne 1f\n\t" - "movl $32, %0\n" - "1:" - : "=r" (set) - : "r" (*p >> bit)); - if (set < (32 - bit)) - return set + offset; - set = 32 - bit; - p++; - } - /* - * No set bit yet, search remaining full words for a bit - */ - res = find_first_bit (p, size - 32 * (p - addr)); - return (offset + set + res); -} -EXPORT_SYMBOL(find_next_bit); - -/** - * find_next_zero_bit - find the first zero bit in a memory region - * @addr: The address to base the search on - * @offset: The bitnumber to start searching at - * @size: The maximum size to search - */ -int find_next_zero_bit(const unsigned long *addr, int size, int offset) -{ - const unsigned long *p = addr + (offset >> 5); - int set = 0, bit = offset & 31, res; - - if (bit) { - /* - * Look for zero in the first 32 bits. - */ - __asm__("bsfl %1,%0\n\t" - "jne 1f\n\t" - "movl $32, %0\n" - "1:" - : "=r" (set) - : "r" (~(*p >> bit))); - if (set < (32 - bit)) - return set + offset; - set = 32 - bit; - p++; - } - /* - * No zero yet, search remaining full bytes for a zero - */ - res = find_first_zero_bit(p, size - 32 * (p - addr)); - return (offset + set + res); -} -EXPORT_SYMBOL(find_next_zero_bit); diff --git a/arch/x86/lib/bitops_64.c b/arch/x86/lib/bitops_64.c index 0e8f491..0eeb704 100644 --- a/arch/x86/lib/bitops_64.c +++ b/arch/x86/lib/bitops_64.c @@ -1,9 +1,7 @@ #include <linux/bitops.h> #undef find_first_zero_bit -#undef find_next_zero_bit #undef find_first_bit -#undef find_next_bit static inline long __find_first_zero_bit(const unsigned long * addr, unsigned long size) @@ -57,39 +55,6 @@ long find_first_zero_bit(const unsigned long * addr, unsigned long size) return __find_first_zero_bit (addr, size); } -/** - * find_next_zero_bit - find the next zero bit in a memory region - * @addr: The address to base the search on - * @offset: The bitnumber to start searching at - * @size: The maximum size to search - */ -long find_next_zero_bit (const unsigned long * addr, long size, long offset) -{ - const unsigned long * p = addr + (offset >> 6); - unsigned long set = 0; - unsigned long res, bit = offset&63; - - if (bit) { - /* - * Look for zero in first word - */ - asm("bsfq %1,%0\n\t" - "cmoveq %2,%0" - : "=r" (set) - : "r" (~(*p >> bit)), "r"(64L)); - if (set < (64 - bit)) - return set + offset; - set = 64 - bit; - p++; - } - /* - * No zero yet, search remaining full words for a zero - */ - res = __find_first_zero_bit (p, size - 64 * (p - addr)); - - return (offset + set + res); -} - static inline long __find_first_bit(const unsigned long * addr, unsigned long size) { @@ -136,40 +101,7 @@ long find_first_bit(const unsigned long * addr, unsigned long size) return __find_first_bit(addr,size); } -/** - * find_next_bit - find the first set bit in a memory region - * @addr: The address to base the search on - * @offset: The bitnumber to start searching at - * @size: The maximum size to search - */ -long find_next_bit(const unsigned long * addr, long size, long offset) -{ - const unsigned long * p = addr + (offset >> 6); - unsigned long set = 0, bit = offset & 63, res; - - if (bit) { - /* - * Look for nonzero in the first 64 bits: - */ - asm("bsfq %1,%0\n\t" - "cmoveq %2,%0\n\t" - : "=r" (set) - : "r" (*p >> bit), "r" (64L)); - if (set < (64 - bit)) - return set + offset; - set = 64 - bit; - p++; - } - /* - * No set bit yet, search remaining full words for a bit - */ - res = __find_first_bit (p, size - 64 * (p - addr)); - return (offset + set + res); -} - #include <linux/module.h> -EXPORT_SYMBOL(find_next_bit); EXPORT_SYMBOL(find_first_bit); EXPORT_SYMBOL(find_first_zero_bit); -EXPORT_SYMBOL(find_next_zero_bit); diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h index 1a23ce1..7fbf93a 100644 --- a/include/asm-x86/bitops.h +++ b/include/asm-x86/bitops.h @@ -312,6 +312,12 @@ static int test_bit(int nr, const volatile unsigned long *addr); #undef ADDR +unsigned long find_next_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); +unsigned long find_next_zero_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); + + #ifdef CONFIG_X86_32 # include "bitops_32.h" #else diff --git a/include/asm-x86/bitops_32.h b/include/asm-x86/bitops_32.h index e4d75fc..570f0fa 100644 --- a/include/asm-x86/bitops_32.h +++ b/include/asm-x86/bitops_32.h @@ -38,14 +38,6 @@ static inline int find_first_zero_bit(const unsigned long *addr, unsigned size) } /** - * find_next_zero_bit - find the first zero bit in a memory region - * @addr: The address to base the search on - * @offset: The bit number to start searching at - * @size: The maximum size to search - */ -int find_next_zero_bit(const unsigned long *addr, int size, int offset); - -/** * __ffs - find first bit in word. * @word: The word to search * @@ -81,14 +73,6 @@ static inline unsigned find_first_bit(const unsigned long *addr, unsigned size) } /** - * find_next_bit - find the first set bit in a memory region - * @addr: The address to base the search on - * @offset: The bit number to start searching at - * @size: The maximum size to search - */ -int find_next_bit(const unsigned long *addr, int size, int offset); - -/** * ffz - find first zero in word. * @word: The word to search * diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h index aaf1519..40bf0f8 100644 --- a/include/asm-x86/bitops_64.h +++ b/include/asm-x86/bitops_64.h @@ -6,9 +6,7 @@ */ extern long find_first_zero_bit(const unsigned long *addr, unsigned long size); -extern long find_next_zero_bit(const unsigned long *addr, long size, long offset); extern long find_first_bit(const unsigned long *addr, unsigned long size); -extern long find_next_bit(const unsigned long *addr, long size, long offset); /* return index of first bet set in val or max when no bit is set */ static inline long __scanbit(unsigned long val, unsigned long max) diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c index 78ccd73..5820e07 100644 --- a/lib/find_next_bit.c +++ b/lib/find_next_bit.c @@ -15,6 +15,8 @@ #include <asm/byteorder.h> #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) +#undef find_next_bit +#undef find_next_zero_bit /** * find_next_bit - find the next set bit in a memory region ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:01 [PATCH] x86: Change x86 to use generic find_next_bit Alexander van Heukelum @ 2008-03-09 20:10 ` Ingo Molnar 2008-03-09 21:03 ` Andi Kleen 2008-03-09 21:13 ` Alexander van Heukelum 2008-03-09 20:11 ` Ingo Molnar ` (3 subsequent siblings) 4 siblings, 2 replies; 19+ messages in thread From: Ingo Molnar @ 2008-03-09 20:10 UTC (permalink / raw) To: Alexander van Heukelum; +Cc: Thomas Gleixner, H. Peter Anvin, LKML, heukelum * Alexander van Heukelum <heukelum@mailshack.com> wrote: > x86: Change x86 to use the generic find_next_bit implementation > > The versions with inline assembly are in fact slower on the machines I > tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, > AMD Opteron 270). The i386-version needed a fix similar to 06024f21 to > avoid crashing the benchmark. > > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size > 1...512, for each possible bitmap with one bit set, for each possible > offset: find the position of the first bit starting at offset. If you > follow ;). Times include setup of the bitmap and checking of the > results. > > Athlon Xeon Opteron 32/64bit > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s ok, that's rather convincing. the generic version in lib/find_next_bit.c is open-coded C which gcc can optimize pretty nicely. the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use the special x86 'bit search forward' (BSF) instruction - which i know from the days when the scheduler relied on it has some non-trivial setup costs. So especially when there's _small_ bitmasks involved, it's more expensive. > If the bitmap size is not a multiple of BITS_PER_LONG, and no set > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a > value outside of the range [0,size]. The generic version always > returns exactly size. The generic version also uses unsigned long > everywhere, while the x86 versions use a mishmash of int, unsigned > (int), long and unsigned long. i'm not surprised that the hand-coded assembly versions had a bug ... [ this means we have to test it quite carefully though, as lots of code only ever gets tested on x86 so code could have built dependency on the buggy behavior. ] > Using the generic version does give a slightly bigger kernel, though. > > defconfig: text data bss dec hex filename > x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit) > generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit) > x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit) > generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit) i'd not worry about that too much. Have you tried to build with: CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_OPTIMIZE_INLINING=y (the latter only available in x86.git) > Patch is against -x86#testing. It compiles. i've picked it up into x86.git, lets see how it goes in practice. Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:10 ` Ingo Molnar @ 2008-03-09 21:03 ` Andi Kleen 2008-03-09 21:32 ` Andi Kleen 2008-03-09 21:13 ` Alexander van Heukelum 1 sibling, 1 reply; 19+ messages in thread From: Andi Kleen @ 2008-03-09 21:03 UTC (permalink / raw) To: Ingo Molnar Cc: Alexander van Heukelum, Thomas Gleixner, H. Peter Anvin, LKML, heukelum Ingo Molnar <mingo@elte.hu> writes: > > the generic version in lib/find_next_bit.c is open-coded C which gcc can > optimize pretty nicely. > > the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use > the special x86 'bit search forward' (BSF) instruction - which i know > from the days when the scheduler relied on it has some non-trivial setup ~14 cycles on K8 for memory, but if you stay in a register it is 8 cycles > costs. So especially when there's _small_ bitmasks involved, it's more > expensive. I had a patchkit some time ago to special case the max_bit <= 63 case and always use directly inlined stream lined single instruction assembler for that. There was still some issue and I dropped it then, but doing something like that makes still sense. Even if the BSF is slightly slower than the open coded version just getting rid of the CALL will make it a win and it could be also kept in a register so you get the 8 cycle variant (for which I doubt you can do it faster open coded) The result would be that a standard for_each_cpu () in a NR_CPUS <= 64 kernel wouldn't have any unnecessary calls. In general the problem of walking cpu masks is quite different from seaching ext2 bitmaps, so they likely should be special cased and optimized for each. -Andi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 21:03 ` Andi Kleen @ 2008-03-09 21:32 ` Andi Kleen 0 siblings, 0 replies; 19+ messages in thread From: Andi Kleen @ 2008-03-09 21:32 UTC (permalink / raw) To: Ingo Molnar Cc: Alexander van Heukelum, Thomas Gleixner, H. Peter Anvin, LKML, heukelum Andi Kleen <andi@firstfloor.org> writes: > > I had a patchkit some time ago to special case the max_bit <= 63 case ... never mind ... the patchkit is actually included. I had only dropped it at some point, but fixed later. -Andi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:10 ` Ingo Molnar 2008-03-09 21:03 ` Andi Kleen @ 2008-03-09 21:13 ` Alexander van Heukelum 2008-03-10 6:29 ` Ingo Molnar 1 sibling, 1 reply; 19+ messages in thread From: Alexander van Heukelum @ 2008-03-09 21:13 UTC (permalink / raw) To: Ingo Molnar, Alexander van Heukelum; +Cc: Thomas Gleixner, H. Peter Anvin, LKML On Sun, 9 Mar 2008 21:10:16 +0100, "Ingo Molnar" <mingo@elte.hu> said: > > Athlon Xeon Opteron 32/64bit > > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s > > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s > > ok, that's rather convincing. > > the generic version in lib/find_next_bit.c is open-coded C which gcc can > optimize pretty nicely. > > the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly use > the special x86 'bit search forward' (BSF) instruction - which i know > from the days when the scheduler relied on it has some non-trivial setup > costs. So especially when there's _small_ bitmasks involved, it's more > expensive. Hi, BSF is fine, it doesn't need any special setup. The problem is probably that the old versions use find_first_bit and find_first_zero_bit, which are also hand optimized versions... and they use "repe scasl/q". That's another little project ;). > > If the bitmap size is not a multiple of BITS_PER_LONG, and no set > > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a > > value outside of the range [0,size]. The generic version always > > returns exactly size. The generic version also uses unsigned long > > everywhere, while the x86 versions use a mishmash of int, unsigned > > (int), long and unsigned long. > > i'm not surprised that the hand-coded assembly versions had a bug ... Not surprised about the bug, but it was in fact noticed, and fixed in x86_64! > [ this means we have to test it quite carefully though, as lots of code > only ever gets tested on x86 so code could have built dependency on > the buggy behavior. ] Agreed. > > Using the generic version does give a slightly bigger kernel, though. > > > > defconfig: text data bss dec hex filename > > x86-specific: 4738555 481232 626688 5846475 5935cb vmlinux (32 bit) > > generic: 4738621 481232 626688 5846541 59360d vmlinux (32 bit) > > x86-specific: 5392395 846568 724424 6963387 6a40bb vmlinux (64 bit) > > generic: 5392458 846568 724424 6963450 6a40fa vmlinux (64 bit) > > i'd not worry about that too much. Have you tried to build with: I don't but I needed to compile something to test the build anyhow ;) > CONFIG_CC_OPTIMIZE_FOR_SIZE=y > CONFIG_OPTIMIZE_INLINING=y This was defconfig in -x86#testing, they were both already enabled. Here is what you get with those options turned off ;). text data bss dec hex filename x86-specific: 5543996 481232 626688 6651916 65800c vmlinux (32 bit) generic: 5543880 481232 626688 6651800 657f98 vmlinux (32 bit) x86-specific: 6111834 846568 724424 7682826 753b0a vmlinux (64 bit) generic: 6111882 846568 724424 7682874 753b3a vmlinux (64 bit) (and I double-checked the i386 results) > (the latter only available in x86.git) > > > Patch is against -x86#testing. It compiles. > > i've picked it up into x86.git, lets see how it goes in practice. Thanks, Alexander > Ingo -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - And now for something completely different ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 21:13 ` Alexander van Heukelum @ 2008-03-10 6:29 ` Ingo Molnar 0 siblings, 0 replies; 19+ messages in thread From: Ingo Molnar @ 2008-03-10 6:29 UTC (permalink / raw) To: Alexander van Heukelum Cc: Alexander van Heukelum, Thomas Gleixner, H. Peter Anvin, LKML * Alexander van Heukelum <heukelum@fastmail.fm> wrote: > > ok, that's rather convincing. > > > > the generic version in lib/find_next_bit.c is open-coded C which gcc > > can optimize pretty nicely. > > > > the hand-coded assembly versions in arch/x86/lib/bitops_32.c mostly > > use the special x86 'bit search forward' (BSF) instruction - which i > > know from the days when the scheduler relied on it has some > > non-trivial setup costs. So especially when there's _small_ bitmasks > > involved, it's more expensive. > > Hi, > > BSF is fine, it doesn't need any special setup. [...] under "setup costs" i mean cycles spent by the CPU itself - the instruction itself is simple (of course) and needs no setup. If you look at BSF performance you'll see that it has nontrivial overhead. Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:01 [PATCH] x86: Change x86 to use generic find_next_bit Alexander van Heukelum 2008-03-09 20:10 ` Ingo Molnar @ 2008-03-09 20:11 ` Ingo Molnar 2008-03-09 20:31 ` Alexander van Heukelum 2008-03-09 20:28 ` [PATCH] x86: Change x86 to use generic find_next_bit Andi Kleen ` (2 subsequent siblings) 4 siblings, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2008-03-09 20:11 UTC (permalink / raw) To: Alexander van Heukelum; +Cc: Thomas Gleixner, H. Peter Anvin, LKML, heukelum * Alexander van Heukelum <heukelum@mailshack.com> wrote: > --- a/lib/find_next_bit.c > +++ b/lib/find_next_bit.c > @@ -15,6 +15,8 @@ > #include <asm/byteorder.h> > > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) > +#undef find_next_bit > +#undef find_next_zero_bit this bit looks weird - did you need it for testing? Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:11 ` Ingo Molnar @ 2008-03-09 20:31 ` Alexander van Heukelum 2008-03-09 20:51 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Alexander van Heukelum @ 2008-03-09 20:31 UTC (permalink / raw) To: Ingo Molnar, Alexander van Heukelum; +Cc: Thomas Gleixner, H. Peter Anvin, LKML On Sun, 9 Mar 2008 21:11:52 +0100, "Ingo Molnar" <mingo@elte.hu> said: > * Alexander van Heukelum <heukelum@mailshack.com> wrote: > > --- a/lib/find_next_bit.c > > +++ b/lib/find_next_bit.c > > @@ -15,6 +15,8 @@ > > #include <asm/byteorder.h> > > > > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) > > +#undef find_next_bit > > +#undef find_next_zero_bit > > this bit looks weird - did you need it for testing? Worse, it's needed to get x86_64 to compile. They are defined in include/asm-x86/bitops_64.h (which gets included). They are used to optimize the case where the bitmap size is known at compile time and not larger than BITS_PER_LONG. Undeffing them here is the easiest way to get things to compile, here. Greetings, Alexander -- http://www.fastmail.fm - The way an email service should be ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:31 ` Alexander van Heukelum @ 2008-03-09 20:51 ` Ingo Molnar 2008-03-09 21:29 ` Andi Kleen 2008-03-10 23:17 ` [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps Alexander van Heukelum 0 siblings, 2 replies; 19+ messages in thread From: Ingo Molnar @ 2008-03-09 20:51 UTC (permalink / raw) To: Alexander van Heukelum Cc: Alexander van Heukelum, Thomas Gleixner, H. Peter Anvin, LKML * Alexander van Heukelum <heukelum@fastmail.fm> wrote: > > > #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) > > > +#undef find_next_bit > > > +#undef find_next_zero_bit > > > > this bit looks weird - did you need it for testing? > > Worse, it's needed to get x86_64 to compile. > > They are defined in include/asm-x86/bitops_64.h (which gets included). > They are used to optimize the case where the bitmap size is known at > compile time and not larger than BITS_PER_LONG. Undeffing them here is > the easiest way to get things to compile, here. ok - but this needs to be solved in a cleaner way. That build-time optimization needs to be pushed into generic code so that 32-bit x86 and other architectures can make use of it as well. The lib/find_next_bit.c functions should be named __find_next_bit() or so. Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:51 ` Ingo Molnar @ 2008-03-09 21:29 ` Andi Kleen 2008-03-10 23:17 ` [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps Alexander van Heukelum 1 sibling, 0 replies; 19+ messages in thread From: Andi Kleen @ 2008-03-09 21:29 UTC (permalink / raw) To: Ingo Molnar Cc: Alexander van Heukelum, Alexander van Heukelum, Thomas Gleixner, H. Peter Anvin, LKML Ingo Molnar <mingo@elte.hu> writes: > > ok - but this needs to be solved in a cleaner way. That build-time > optimization needs to be pushed into generic code so that 32-bit x86 and It probably doesn't make sense in generic code because it will be too large open coded. -Andi ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps 2008-03-09 20:51 ` Ingo Molnar 2008-03-09 21:29 ` Andi Kleen @ 2008-03-10 23:17 ` Alexander van Heukelum 2008-03-11 9:56 ` Ingo Molnar 1 sibling, 1 reply; 19+ messages in thread From: Alexander van Heukelum @ 2008-03-10 23:17 UTC (permalink / raw) To: Ingo Molnar Cc: Alexander van Heukelum, Andi Kleen, Thomas Gleixner, H. Peter Anvin, LKML x86: Optimize find_next_(zero_)bit for small constant-size bitmaps NOTE: This is not well tested. I'm also quite unsure if this makes the pre-processor situation any better. On an i386 defconfig (the x86#testing one), the size of vmlinux hardly changes with this applied. I have observed only four places where this optimization avoids a call into find_next_bit: In the functions return_unused_surplus_pages, alloc_fresh_huge_page, and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap. In __next_cpu a call is avoided for a 32-bit bitmap. That's it. On x86_64 I observed the following (I did not look closely yet, though). Current #testing defconfig: 146 x bsf, 27 x find_next_*bit text data bss dec hex filename 5392637 846592 724424 6963653 6a41c5 vmlinux After removing the x86_64 specific optimization for find_next_*bit: 94 x bsf, 79 x find_next_*bit text data bss dec hex filename 5392358 846592 724424 6963374 6a40ae vmlinux After this patch (making the optimization generic): 146 x bsf, 27 x find_next_*bit text data bss dec hex filename 5392396 846592 724424 6963412 6a40d4 vmlinux Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> --- > ok - but this needs to be solved in a cleaner way. That build-time > optimization needs to be pushed into generic code so that 32-bit x86 and > other architectures can make use of it as well. The lib/find_next_bit.c > functions should be named __find_next_bit() or so. > > Ingo I think this is what you had in mind? It seems to work. Alternative implementation ideas? Comments? In particular: does this break non-x86? Greetings, Alexander include/asm-x86/bitops.h | 4 +- include/asm-x86/bitops_64.h | 10 ------- include/linux/bitops.h | 60 +++++++++++++++++++++++++++++++++++++++++++ lib/find_next_bit.c | 10 +++---- 4 files changed, 66 insertions(+), 18 deletions(-) diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h index 7fbf93a..cbbe329 100644 --- a/include/asm-x86/bitops.h +++ b/include/asm-x86/bitops.h @@ -312,9 +312,9 @@ static int test_bit(int nr, const volatile unsigned long *addr); #undef ADDR -unsigned long find_next_bit(const unsigned long *addr, +unsigned long __find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset); -unsigned long find_next_zero_bit(const unsigned long *addr, +unsigned long __find_next_zero_bit(const unsigned long *addr, unsigned long size, unsigned long offset); diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h index 40bf0f8..87e1a17 100644 --- a/include/asm-x86/bitops_64.h +++ b/include/asm-x86/bitops_64.h @@ -20,21 +20,11 @@ static inline long __scanbit(unsigned long val, unsigned long max) (__scanbit(*(unsigned long *)addr,(size))) : \ find_first_bit(addr,size))) -#define find_next_bit(addr,size,off) \ -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \ - ((off) + (__scanbit((*(unsigned long *)addr) >> (off),(size)-(off)))) : \ - find_next_bit(addr,size,off))) - #define find_first_zero_bit(addr,size) \ ((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \ (__scanbit(~*(unsigned long *)addr,(size))) : \ find_first_zero_bit(addr,size))) -#define find_next_zero_bit(addr,size,off) \ -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \ - ((off)+(__scanbit(~(((*(unsigned long *)addr)) >> (off)),(size)-(off)))) : \ - find_next_zero_bit(addr,size,off))) - static inline void set_bit_string(unsigned long *bitmap, unsigned long i, int len) { diff --git a/include/linux/bitops.h b/include/linux/bitops.h index 69c1edb..ba78ca1 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -72,4 +72,64 @@ static inline unsigned fls_long(unsigned long l) return fls64(l); } +#ifdef __KERNEL__ +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT +unsigned long __find_next_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); + +static __always_inline unsigned long +find_next_bit(const unsigned long *addr, unsigned long size, + unsigned long offset) +{ + unsigned long value; + + /* Here we would like to use a version of ffs that */ + /* returns BITS_PER_LONG if its argument is zero. */ + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) { + value = (*addr) & ((~0ul) << offset); + return (value == 0) ? BITS_PER_LONG : __ffs(value); + } + + /* Less than BITS_PER_LONG: put in sentinel, then __ffs */ + /* Here we would like to have a __set_bit/__ffs combo */ + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) { + value = (*addr) & ((~0ul) << offset); + value |= (1ul << size); + return __ffs(value); + } + + /* size not constant or too big */ + return __find_next_bit(addr, size, offset); +} + +unsigned long __find_next_zero_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); + +static __always_inline unsigned long +find_next_zero_bit(const unsigned long *addr, unsigned long size, + unsigned long offset) +{ + unsigned long value; + + /* Here we would like to use a version of ffs that */ + /* returns BITS_PER_LONG if its argument is zero. */ + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) { + value = (~(*addr)) & ((~0ul) << offset); + return (value == 0) ? BITS_PER_LONG : __ffs(value); + } + + /* Less than BITS_PER_LONG: put in sentinel, then __ffs */ + /* Here we would like to have a __set_bit/__ffs combo. */ + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) { + value = (~(*addr)) & ((~0ul) << offset); + value |= (1ul << size); + return __ffs(value); + } + + /* size not constant or too big */ + return __find_next_zero_bit(addr, size, offset); +} + +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */ +#endif /* __KERNEL__ */ #endif diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c index 5820e07..5249f4a 100644 --- a/lib/find_next_bit.c +++ b/lib/find_next_bit.c @@ -15,8 +15,6 @@ #include <asm/byteorder.h> #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) -#undef find_next_bit -#undef find_next_zero_bit /** * find_next_bit - find the next set bit in a memory region @@ -24,8 +22,8 @@ * @offset: The bitnumber to start searching at * @size: The maximum size to search */ -unsigned long find_next_bit(const unsigned long *addr, unsigned long size, - unsigned long offset) +unsigned long __find_next_bit(const unsigned long *addr, + unsigned long size, unsigned long offset) { const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); @@ -69,8 +67,8 @@ EXPORT_SYMBOL(find_next_bit); * This implementation of find_{first,next}_zero_bit was stolen from * Linus' asm-alpha/bitops.h. */ -unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, - unsigned long offset) +unsigned long __find_next_zero_bit(const unsigned long *addr, + unsigned long size, unsigned long offset) { const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps 2008-03-10 23:17 ` [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps Alexander van Heukelum @ 2008-03-11 9:56 ` Ingo Molnar 2008-03-11 15:17 ` [PATCH] " Alexander van Heukelum 0 siblings, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2008-03-11 9:56 UTC (permalink / raw) To: Alexander van Heukelum Cc: Alexander van Heukelum, Andi Kleen, Thomas Gleixner, H. Peter Anvin, LKML * Alexander van Heukelum <heukelum@mailshack.com> wrote: > x86: Optimize find_next_(zero_)bit for small constant-size bitmaps thanks, looks nice, except for a small detail: > +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT > +unsigned long __find_next_bit(const unsigned long *addr, > + unsigned long size, unsigned long offset); > + > +static __always_inline unsigned long > +find_next_bit(const unsigned long *addr, unsigned long size, > + unsigned long offset) > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */ there should be an #else here i think, which maps find_next_bit to __find_next_bit / etc. small syntactic nit: > +unsigned long __find_next_bit(const unsigned long *addr, > + unsigned long size, unsigned long offset); should be explicitly "extern", so that we see that it's a prototype declaration. Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps 2008-03-11 9:56 ` Ingo Molnar @ 2008-03-11 15:17 ` Alexander van Heukelum 2008-03-11 15:22 ` [RFC] non-x86: " Alexander van Heukelum 2008-03-11 15:23 ` [PATCH] x86: " Ingo Molnar 0 siblings, 2 replies; 19+ messages in thread From: Alexander van Heukelum @ 2008-03-11 15:17 UTC (permalink / raw) To: Ingo Molnar Cc: Alexander van Heukelum, Andi Kleen, Thomas Gleixner, H. Peter Anvin, LKML x86: Optimize find_next_(zero_)bit for small constant-size bitmaps. This moves an optimization for searching constant-sized small bitmaps form x86_64-specific to generic code. On an i386 defconfig (the x86#testing one), the size of vmlinux hardly changes with this applied. I have observed only four places where this optimization avoids a call into find_next_bit: In the functions return_unused_surplus_pages, alloc_fresh_huge_page, and adjust_pool_surplus, this patch avoids a call for a 1-bit bitmap. In __next_cpu a call is avoided for a 32-bit bitmap. That's it. On x86_64, 52 locations are optimized with a minimal increase in code size: Current #testing defconfig: 146 x bsf, 27 x find_next_*bit text data bss dec hex filename 5392637 846592 724424 6963653 6a41c5 vmlinux After removing the x86_64 specific optimization for find_next_*bit: 94 x bsf, 79 x find_next_*bit text data bss dec hex filename 5392358 846592 724424 6963374 6a40ae vmlinux After this patch (making the optimization generic): 146 x bsf, 27 x find_next_*bit text data bss dec hex filename 5392396 846592 724424 6963412 6a40d4 vmlinux --- Hi Ingo, I have now tested (in user-space) the one-long-versions of the find_next_(zero_)bit in the patch. They give the expected results. > > +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT > > +unsigned long __find_next_bit(const unsigned long *addr, > > + unsigned long size, unsigned long offset); > > + > > +static __always_inline unsigned long > > +find_next_bit(const unsigned long *addr, unsigned long size, > > + unsigned long offset) > > > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */ > > there should be an #else here i think, which maps find_next_bit to > __find_next_bit / etc. No, that is intentional: architectures are expected to either set CONFIG_GENERIC_FIND_NEXT_BIT, or provide their own implementation of find_next_bit. An alternative would be to have all architectures providing __find_next_bit/__find_next_zero_bit and do the optimization unconditionally. > > +unsigned long __find_next_bit(const unsigned long *addr, > > + unsigned long size, unsigned long offset); > > should be explicitly "extern", so that we see that it's a prototype > declaration. ok. This version runs fine on qemu. Greetings, Alexander diff --git a/include/asm-x86/bitops.h b/include/asm-x86/bitops.h index 7fbf93a..1a23ce1 100644 --- a/include/asm-x86/bitops.h +++ b/include/asm-x86/bitops.h @@ -312,12 +312,6 @@ static int test_bit(int nr, const volatile unsigned long *addr); #undef ADDR -unsigned long find_next_bit(const unsigned long *addr, - unsigned long size, unsigned long offset); -unsigned long find_next_zero_bit(const unsigned long *addr, - unsigned long size, unsigned long offset); - - #ifdef CONFIG_X86_32 # include "bitops_32.h" #else diff --git a/include/asm-x86/bitops_64.h b/include/asm-x86/bitops_64.h index 40bf0f8..87e1a17 100644 --- a/include/asm-x86/bitops_64.h +++ b/include/asm-x86/bitops_64.h @@ -20,21 +20,11 @@ static inline long __scanbit(unsigned long val, unsigned long max) (__scanbit(*(unsigned long *)addr,(size))) : \ find_first_bit(addr,size))) -#define find_next_bit(addr,size,off) \ -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \ - ((off) + (__scanbit((*(unsigned long *)addr) >> (off),(size)-(off)))) : \ - find_next_bit(addr,size,off))) - #define find_first_zero_bit(addr,size) \ ((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \ (__scanbit(~*(unsigned long *)addr,(size))) : \ find_first_zero_bit(addr,size))) -#define find_next_zero_bit(addr,size,off) \ -((__builtin_constant_p(size) && (size) <= BITS_PER_LONG ? \ - ((off)+(__scanbit(~(((*(unsigned long *)addr)) >> (off)),(size)-(off)))) : \ - find_next_zero_bit(addr,size,off))) - static inline void set_bit_string(unsigned long *bitmap, unsigned long i, int len) { diff --git a/include/linux/bitops.h b/include/linux/bitops.h index 69c1edb..15360f9 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -72,4 +72,81 @@ static inline unsigned fls_long(unsigned long l) return fls64(l); } +#ifdef __KERNEL__ +#ifdef CONFIG_GENERIC_FIND_NEXT_BIT +extern unsigned long __find_next_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); + +/** + * find_next_bit - find the next set bit in a memory region + * @addr: The address to base the search on + * @offset: The bitnumber to start searching at + * @size: The bitmap size in bits + */ +static __always_inline unsigned long +find_next_bit(const unsigned long *addr, unsigned long size, + unsigned long offset) +{ + unsigned long value; + + /* Avoid a function call if the bitmap size is a constant */ + /* and not bigger than BITS_PER_LONG. */ + + /* insert a sentinel so that __ffs returns size if there */ + /* are no set bits in the bitmap */ + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) { + value = (*addr) & ((~0ul) << offset); + value |= (1ul << size); + return __ffs(value); + } + + /* the result of __ffs(0) is undefined, so it needs to be */ + /* handled separately */ + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) { + value = (*addr) & ((~0ul) << offset); + return (value == 0) ? BITS_PER_LONG : __ffs(value); + } + + /* size is not constant or too big */ + return __find_next_bit(addr, size, offset); +} + +extern unsigned long __find_next_zero_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); + +/** + * find_next_zero_bit - find the next cleared bit in a memory region + * @addr: The address to base the search on + * @offset: The bitnumber to start searching at + * @size: The bitmap size in bits + */ +static __always_inline unsigned long +find_next_zero_bit(const unsigned long *addr, unsigned long size, + unsigned long offset) +{ + unsigned long value; + + /* Avoid a function call if the bitmap size is a constant */ + /* and not bigger than BITS_PER_LONG. */ + + /* insert a sentinel so that __ffs returns size if there */ + /* are no set bits in the bitmap */ + if (__builtin_constant_p(size) && (size < BITS_PER_LONG)) { + value = (~(*addr)) & ((~0ul) << offset); + value |= (1ul << size); + return __ffs(value); + } + + /* the result of __ffs(0) is undefined, so it needs to be */ + /* handled separately */ + if (__builtin_constant_p(size) && (size == BITS_PER_LONG)) { + value = (~(*addr)) & ((~0ul) << offset); + return (value == 0) ? BITS_PER_LONG : __ffs(value); + } + + /* size is not constant or too big */ + return __find_next_zero_bit(addr, size, offset); +} +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */ +#endif /* __KERNEL__ */ #endif diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c index 5820e07..ce94c4c 100644 --- a/lib/find_next_bit.c +++ b/lib/find_next_bit.c @@ -15,17 +15,12 @@ #include <asm/byteorder.h> #define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) -#undef find_next_bit -#undef find_next_zero_bit - -/** - * find_next_bit - find the next set bit in a memory region - * @addr: The address to base the search on - * @offset: The bitnumber to start searching at - * @size: The maximum size to search + +/* + * Find the next set bit in a memory region. */ -unsigned long find_next_bit(const unsigned long *addr, unsigned long size, - unsigned long offset) +unsigned long __find_next_bit(const unsigned long *addr, + unsigned long size, unsigned long offset) { const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); @@ -62,15 +57,14 @@ found_first: found_middle: return result + __ffs(tmp); } - -EXPORT_SYMBOL(find_next_bit); +EXPORT_SYMBOL(__find_next_bit); /* * This implementation of find_{first,next}_zero_bit was stolen from * Linus' asm-alpha/bitops.h. */ -unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, - unsigned long offset) +unsigned long __find_next_zero_bit(const unsigned long *addr, + unsigned long size, unsigned long offset) { const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); @@ -107,8 +101,7 @@ found_first: found_middle: return result + ffz(tmp); } - -EXPORT_SYMBOL(find_next_zero_bit); +EXPORT_SYMBOL(__find_next_zero_bit); #ifdef __BIG_ENDIAN ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC] non-x86: Optimize find_next_(zero_)bit for small constant-size bitmaps 2008-03-11 15:17 ` [PATCH] " Alexander van Heukelum @ 2008-03-11 15:22 ` Alexander van Heukelum 2008-03-11 15:23 ` [PATCH] x86: " Ingo Molnar 1 sibling, 0 replies; 19+ messages in thread From: Alexander van Heukelum @ 2008-03-11 15:22 UTC (permalink / raw) To: linux-arch Cc: Alexander van Heukelum, Ingo Molnar, Andi Kleen, Thomas Gleixner, H. Peter Anvin, LKML non-x86: Optimize small bitmap searches for the other archs. Concerns archs that have their own implementation of find_next_bit and find_next_zero_bit. For archs that define CONFIG_GENERIC_FIND_NEXT_BIT, a patch is submitted to be included in Ingo's x86-testing tree. It moves an optimization for searching small, constant-sized bitmaps from x86_64-specific to generic code. It works by creating a new __always_inline-marked function in linux/bitops.h, and renaming find_next_(zero_)bit to __find_next_(zero_)bit in the generic code. The new code is currently guarded by CONFIG_GENERIC_FIND_NEXT_BIT, so other archs are expected to work as before. To enable the optimization globally, all other archs need to implement __find_next_bit and __find_next_zero_bit either by exporting these symbols, or by implementing them as inline functions in their asm/bitops.h. It would also be nice if every implementation would have the same declaration: unsigned long __find_next_(zero_)bit(unsigned long *, unsigned long, unsigned long); I added a totally untested patch for reference. Did I mis an arch? Does the assembly still match the (changed) prototype? Can we get concensus on doing the optimization in generic code? Greetings, Alexander arch/arm/lib/findbit.S | 6 ++++-- arch/avr32/kernel/avr32_ksyms.c | 4 ++-- arch/avr32/lib/findbit.S | 4 ++-- include/asm-arm/bitops.h | 20 ++++++++++++-------- include/asm-m68k/bitops.h | 8 ++++---- include/asm-powerpc/bitops.h | 4 ---- include/asm-s390/bitops.h | 12 ++++++------ include/linux/bitops.h | 2 -- 8 files changed, 30 insertions(+), 30 deletions(-) diff --git a/arch/arm/lib/findbit.S b/arch/arm/lib/findbit.S index a5ca024..5c92967 100644 --- a/arch/arm/lib/findbit.S +++ b/arch/arm/lib/findbit.S @@ -36,7 +36,8 @@ ENTRY(_find_first_zero_bit_le) /* * Purpose : Find next 'zero' bit - * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset) + * Prototype: unsigned long find_next_zero_bit(unsigned long *addr, + * unsigned long maxbit, unsigned long offset); */ ENTRY(_find_next_zero_bit_le) teq r1, #0 @@ -70,7 +71,8 @@ ENTRY(_find_first_bit_le) /* * Purpose : Find next 'one' bit - * Prototype: int find_next_zero_bit(void *addr, unsigned int maxbit, int offset) + * Prototype: unsigned long find_next_zero_bit(unsigned long *addr, + * unsigned long maxbit, unsigned long offset); */ ENTRY(_find_next_bit_le) teq r1, #0 diff --git a/arch/avr32/kernel/avr32_ksyms.c b/arch/avr32/kernel/avr32_ksyms.c index 80f55f8..c228ed1 100644 --- a/arch/avr32/kernel/avr32_ksyms.c +++ b/arch/avr32/kernel/avr32_ksyms.c @@ -51,9 +51,9 @@ EXPORT_SYMBOL(__const_udelay); /* Bit operations (lib/findbit.S) */ EXPORT_SYMBOL(find_first_zero_bit); -EXPORT_SYMBOL(find_next_zero_bit); +EXPORT_SYMBOL(__find_next_zero_bit); EXPORT_SYMBOL(find_first_bit); -EXPORT_SYMBOL(find_next_bit); +EXPORT_SYMBOL(__find_next_bit); EXPORT_SYMBOL(generic_find_next_zero_le_bit); /* I/O primitives (lib/io-*.S) */ diff --git a/arch/avr32/lib/findbit.S b/arch/avr32/lib/findbit.S index c6b91de..729deab 100644 --- a/arch/avr32/lib/findbit.S +++ b/arch/avr32/lib/findbit.S @@ -29,7 +29,7 @@ ENTRY(find_first_zero_bit) * unsigned long size, * unsigned long offset) */ -ENTRY(find_next_zero_bit) +ENTRY(__find_next_zero_bit) lsr r8, r10, 5 sub r9, r11, r10 retle r11 @@ -93,7 +93,7 @@ ENTRY(find_first_bit) * unsigned long size, * unsigned long offset) */ -ENTRY(find_next_bit) +ENTRY(__find_next_bit) lsr r8, r10, 5 sub r9, r11, r10 retle r11 diff --git a/include/asm-arm/bitops.h b/include/asm-arm/bitops.h index 5c60bfc..08b3163 100644 --- a/include/asm-arm/bitops.h +++ b/include/asm-arm/bitops.h @@ -158,9 +158,11 @@ extern int _test_and_set_bit_le(int nr, volatile unsigned long * p); extern int _test_and_clear_bit_le(int nr, volatile unsigned long * p); extern int _test_and_change_bit_le(int nr, volatile unsigned long * p); extern int _find_first_zero_bit_le(const void * p, unsigned size); -extern int _find_next_zero_bit_le(const void * p, int size, int offset); +extern unsigned long _find_next_zero_bit_le(const unsigned long * p, + unsigned long size, unsigned long offset); extern int _find_first_bit_le(const unsigned long *p, unsigned size); -extern int _find_next_bit_le(const unsigned long *p, int size, int offset); +extern unsigned long _find_next_bit_le(const unsigned long *p, + unsigned long size, unsigned long offset); /* * Big endian assembly bitops. nr = 0 -> byte 3 bit 0. @@ -172,9 +174,11 @@ extern int _test_and_set_bit_be(int nr, volatile unsigned long * p); extern int _test_and_clear_bit_be(int nr, volatile unsigned long * p); extern int _test_and_change_bit_be(int nr, volatile unsigned long * p); extern int _find_first_zero_bit_be(const void * p, unsigned size); -extern int _find_next_zero_bit_be(const void * p, int size, int offset); +extern unsigned long _find_next_zero_bit_be(const unsigned long * p, + unsigned long size, unsigned long offset); extern int _find_first_bit_be(const unsigned long *p, unsigned size); -extern int _find_next_bit_be(const unsigned long *p, int size, int offset); +extern unsigned long _find_next_bit_be(const unsigned long *p, + unsigned long size, unsigned long offset); #ifndef CONFIG_SMP /* @@ -208,9 +212,9 @@ extern int _find_next_bit_be(const unsigned long *p, int size, int offset); #define test_and_clear_bit(nr,p) ATOMIC_BITOP_LE(test_and_clear_bit,nr,p) #define test_and_change_bit(nr,p) ATOMIC_BITOP_LE(test_and_change_bit,nr,p) #define find_first_zero_bit(p,sz) _find_first_zero_bit_le(p,sz) -#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off) +#define __find_next_zero_bit(p,sz,off) _find_next_zero_bit_le(p,sz,off) #define find_first_bit(p,sz) _find_first_bit_le(p,sz) -#define find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off) +#define __find_next_bit(p,sz,off) _find_next_bit_le(p,sz,off) #define WORD_BITOFF_TO_LE(x) ((x)) @@ -226,9 +230,9 @@ extern int _find_next_bit_be(const unsigned long *p, int size, int offset); #define test_and_clear_bit(nr,p) ATOMIC_BITOP_BE(test_and_clear_bit,nr,p) #define test_and_change_bit(nr,p) ATOMIC_BITOP_BE(test_and_change_bit,nr,p) #define find_first_zero_bit(p,sz) _find_first_zero_bit_be(p,sz) -#define find_next_zero_bit(p,sz,off) _find_next_zero_bit_be(p,sz,off) +#define __find_next_zero_bit(p,sz,off) _find_next_zero_bit_be(p,sz,off) #define find_first_bit(p,sz) _find_first_bit_be(p,sz) -#define find_next_bit(p,sz,off) _find_next_bit_be(p,sz,off) +#define __find_next_bit(p,sz,off) _find_next_bit_be(p,sz,off) #define WORD_BITOFF_TO_LE(x) ((x) ^ 0x18) diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h index 83d1f28..c320c7a 100644 --- a/include/asm-m68k/bitops.h +++ b/include/asm-m68k/bitops.h @@ -199,8 +199,8 @@ out: return ((long)p - (long)vaddr - 4) * 8 + res; } -static inline int find_next_zero_bit(const unsigned long *vaddr, int size, - int offset) +static inline unsigned long __find_next_zero_bit(const unsigned long *vaddr, + unsigned long size, unsigned long offset) { const unsigned long *p = vaddr + (offset >> 5); int bit = offset & 31UL, res; @@ -246,8 +246,8 @@ out: return ((long)p - (long)vaddr - 4) * 8 + res; } -static inline int find_next_bit(const unsigned long *vaddr, int size, - int offset) +static inline unsigned long __find_next_bit(const unsigned long *vaddr, + unsigned long size, unsigned long offset) { const unsigned long *p = vaddr + (offset >> 5); int bit = offset & 31UL, res; diff --git a/include/asm-powerpc/bitops.h b/include/asm-powerpc/bitops.h index 220d9a7..8ae6667 100644 --- a/include/asm-powerpc/bitops.h +++ b/include/asm-powerpc/bitops.h @@ -317,8 +317,6 @@ static __inline__ int fls(unsigned int x) #include <asm-generic/bitops/hweight.h> #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0) -unsigned long find_next_zero_bit(const unsigned long *addr, - unsigned long size, unsigned long offset); /** * find_first_bit - find the first set bit in a memory region * @addr: The address to start the search at @@ -328,8 +326,6 @@ unsigned long find_next_zero_bit(const unsigned long *addr, * containing a bit. */ #define find_first_bit(addr, size) find_next_bit((addr), (size), 0) -unsigned long find_next_bit(const unsigned long *addr, - unsigned long size, unsigned long offset); /* Little-endian versions */ diff --git a/include/asm-s390/bitops.h b/include/asm-s390/bitops.h index 965394e..13a5c33 100644 --- a/include/asm-s390/bitops.h +++ b/include/asm-s390/bitops.h @@ -691,9 +691,9 @@ static inline unsigned long find_first_bit(const unsigned long * addr, * @offset: The bitnumber to start searching at * @size: The maximum size to search */ -static inline int find_next_zero_bit (const unsigned long * addr, - unsigned long size, - unsigned long offset) +static inline unsigned long __find_next_zero_bit(const unsigned long * addr, + unsigned long size, + unsigned long offset) { const unsigned long *p; unsigned long bit, set; @@ -727,9 +727,9 @@ static inline int find_next_zero_bit (const unsigned long * addr, * @offset: The bitnumber to start searching at * @size: The maximum size to search */ -static inline int find_next_bit (const unsigned long * addr, - unsigned long size, - unsigned long offset) +static inline unsigned long __find_next_bit(const unsigned long * addr, + unsigned long size, + unsigned long offset) { const unsigned long *p; unsigned long bit, set; diff --git a/include/linux/bitops.h b/include/linux/bitops.h index 15360f9..68be268 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -73,7 +73,6 @@ static inline unsigned fls_long(unsigned long l) } #ifdef __KERNEL__ -#ifdef CONFIG_GENERIC_FIND_NEXT_BIT extern unsigned long __find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset); @@ -147,6 +146,5 @@ find_next_zero_bit(const unsigned long *addr, unsigned long size, /* size is not constant or too big */ return __find_next_zero_bit(addr, size, offset); } -#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */ #endif /* __KERNEL__ */ #endif ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps 2008-03-11 15:17 ` [PATCH] " Alexander van Heukelum 2008-03-11 15:22 ` [RFC] non-x86: " Alexander van Heukelum @ 2008-03-11 15:23 ` Ingo Molnar 1 sibling, 0 replies; 19+ messages in thread From: Ingo Molnar @ 2008-03-11 15:23 UTC (permalink / raw) To: Alexander van Heukelum Cc: Alexander van Heukelum, Andi Kleen, Thomas Gleixner, H. Peter Anvin, LKML * Alexander van Heukelum <heukelum@mailshack.com> wrote: > > > +#endif /* CONFIG_GENERIC_FIND_NEXT_BIT */ > > > > there should be an #else here i think, which maps find_next_bit to > > __find_next_bit / etc. > > No, that is intentional: architectures are expected to either set > CONFIG_GENERIC_FIND_NEXT_BIT, or provide their own implementation of > find_next_bit. indeed, you are right. > This version runs fine on qemu. great - i've queued it up into x86.git. Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:01 [PATCH] x86: Change x86 to use generic find_next_bit Alexander van Heukelum 2008-03-09 20:10 ` Ingo Molnar 2008-03-09 20:11 ` Ingo Molnar @ 2008-03-09 20:28 ` Andi Kleen 2008-03-09 21:31 ` Andi Kleen 2008-03-13 12:44 ` Aneesh Kumar K.V 4 siblings, 0 replies; 19+ messages in thread From: Andi Kleen @ 2008-03-09 20:28 UTC (permalink / raw) To: Alexander van Heukelum Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, LKML, heukelum Alexander van Heukelum <heukelum@mailshack.com> writes: > > If the bitmap size is not a multiple of BITS_PER_LONG, and no set > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a > value outside of the range [0,size]. The generic version always returns > exactly size. With that change it is likely possible to remove the min(NR_CPUS in lib/cpumask.c __first/__next_cpu. iirc it was just a workaround for the x86 quirk. I suspect with such a change it would be possible to inline those again, possibly speeding up some loops who do cpumask walking (iirc the scheduler used to do that frequently) -Andi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:01 [PATCH] x86: Change x86 to use generic find_next_bit Alexander van Heukelum ` (2 preceding siblings ...) 2008-03-09 20:28 ` [PATCH] x86: Change x86 to use generic find_next_bit Andi Kleen @ 2008-03-09 21:31 ` Andi Kleen 2008-03-13 12:44 ` Aneesh Kumar K.V 4 siblings, 0 replies; 19+ messages in thread From: Andi Kleen @ 2008-03-09 21:31 UTC (permalink / raw) To: Alexander van Heukelum Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, LKML, heukelum Alexander van Heukelum <heukelum@mailshack.com> writes: > > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size > 1...512, for each possible bitmap with one bit set, for each possible > offset: find the position of the first bit starting at offset. If you > follow ;). Times include setup of the bitmap and checking of the > results. BTW another comment: it would be far more sense if you did some profiling on what bitmap sizes a real kernel uses (should be easy enough with some systemtap) and benchmarked only that. I doubt 1 bit bitmap searches are common for example ... -Andi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-09 20:01 [PATCH] x86: Change x86 to use generic find_next_bit Alexander van Heukelum ` (3 preceding siblings ...) 2008-03-09 21:31 ` Andi Kleen @ 2008-03-13 12:44 ` Aneesh Kumar K.V 2008-03-13 14:27 ` Alexander van Heukelum 4 siblings, 1 reply; 19+ messages in thread From: Aneesh Kumar K.V @ 2008-03-13 12:44 UTC (permalink / raw) To: Alexander van Heukelum Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, LKML, heukelum On Sun, Mar 09, 2008 at 09:01:04PM +0100, Alexander van Heukelum wrote: > x86: Change x86 to use the generic find_next_bit implementation > > The versions with inline assembly are in fact slower on the machines I > tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD > Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid > crashing the benchmark. > > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size > 1...512, for each possible bitmap with one bit set, for each possible > offset: find the position of the first bit starting at offset. If you > follow ;). Times include setup of the bitmap and checking of the > results. > > Athlon Xeon Opteron 32/64bit > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s > > If the bitmap size is not a multiple of BITS_PER_LONG, and no set > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a > value outside of the range [0,size]. The generic version always returns > exactly size. The generic version also uses unsigned long everywhere, > while the x86 versions use a mishmash of int, unsigned (int), long and > unsigned long. > This problem is observed on x86_64 and powerpc also. We need a long aligned address for test_bit, set_bit find_bit etc. In ext4 we have to make sure we align the address passed to ext4_test_bit ext4_set_bit ext4_find_next_zero_bit ext4_find_next_bit fs/ext4/mballoc.c have some examples. -aneesh ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86: Change x86 to use generic find_next_bit 2008-03-13 12:44 ` Aneesh Kumar K.V @ 2008-03-13 14:27 ` Alexander van Heukelum 0 siblings, 0 replies; 19+ messages in thread From: Alexander van Heukelum @ 2008-03-13 14:27 UTC (permalink / raw) To: Aneesh Kumar K.V, Alexander van Heukelum Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, LKML On Thu, 13 Mar 2008 18:14:24 +0530, "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> said: > On Sun, Mar 09, 2008 at 09:01:04PM +0100, Alexander van Heukelum wrote: > > x86: Change x86 to use the generic find_next_bit implementation > > > > The versions with inline assembly are in fact slower on the machines I > > tested them on (in userspace) (Athlon XP 2800+, p4-like Xeon 2.8GHz, AMD > > Opteron 270). The i386-version needed a fix similar to 06024f21 to avoid > > crashing the benchmark. > > > > Benchmark using: gcc -fomit-frame-pointer -Os. For each bitmap size > > 1...512, for each possible bitmap with one bit set, for each possible > > offset: find the position of the first bit starting at offset. If you > > follow ;). Times include setup of the bitmap and checking of the > > results. > > > > Athlon Xeon Opteron 32/64bit > > x86-specific: 0m3.692s 0m2.820s 0m3.196s / 0m2.480s > > generic: 0m2.622s 0m1.662s 0m2.100s / 0m1.572s > > > > If the bitmap size is not a multiple of BITS_PER_LONG, and no set > > (cleared) bit is found, find_next_bit (find_next_zero_bit) returns a > > value outside of the range [0,size]. The generic version always returns > > exactly size. The generic version also uses unsigned long everywhere, > > while the x86 versions use a mishmash of int, unsigned (int), long and > > unsigned long. > > This problem is observed on x86_64 and powerpc also. I'm not entirely sure if it is a problem. In some cases it certainly is an inconvenience, though ;). I mentioned the difference between the old and generic versions, because of the possibility of dependence of this behaviour. Indeed I see for example (in fs/ext4/mballoc.c). bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit); if (bit >= end) break; next = mb_find_next_bit(bitmap_bh->b_data, end, bit); if (next > end) next = end; free += next - bit; So here it needed to adjust the value. > We need a long > aligned address for test_bit, set_bit find_bit etc. In ext4 we have > to make sure we align the address passed to > > ext4_test_bit > ext4_set_bit > ext4_find_next_zero_bit > ext4_find_next_bit > > fs/ext4/mballoc.c have some examples. This is a different 'problem'. find_next_bit works on arrays of long, while the bitmaps in ext4_find_next_bit are of type void * and seem not to have any alignment restrictions. ext4 implements wrappers around find_next_bit to solve that 'problem'. The question that arises is: do we want find_first_bit, find_next_bit. etc. to always return a value in the range [0, size], or do we want to allow implementations that return [0, size-1] if there is a bit found and something else (roundup(size,bitsperlong) or ulongmax, for example if, none were found? The current x86_64 versions of find_first_bit and find_next_bit return roundup(size,bitsperlong) if no bits were found, but on the other hand I guess most bitmaps are a multiple of bitsperlong bits in size, which hides the difference. Greetings, Alexander > -aneesh -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - Or how I learned to stop worrying and love email again ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2008-03-13 14:49 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-03-09 20:01 [PATCH] x86: Change x86 to use generic find_next_bit Alexander van Heukelum 2008-03-09 20:10 ` Ingo Molnar 2008-03-09 21:03 ` Andi Kleen 2008-03-09 21:32 ` Andi Kleen 2008-03-09 21:13 ` Alexander van Heukelum 2008-03-10 6:29 ` Ingo Molnar 2008-03-09 20:11 ` Ingo Molnar 2008-03-09 20:31 ` Alexander van Heukelum 2008-03-09 20:51 ` Ingo Molnar 2008-03-09 21:29 ` Andi Kleen 2008-03-10 23:17 ` [RFC/PATCH] x86: Optimize find_next_(zero_)bit for small constant-size bitmaps Alexander van Heukelum 2008-03-11 9:56 ` Ingo Molnar 2008-03-11 15:17 ` [PATCH] " Alexander van Heukelum 2008-03-11 15:22 ` [RFC] non-x86: " Alexander van Heukelum 2008-03-11 15:23 ` [PATCH] x86: " Ingo Molnar 2008-03-09 20:28 ` [PATCH] x86: Change x86 to use generic find_next_bit Andi Kleen 2008-03-09 21:31 ` Andi Kleen 2008-03-13 12:44 ` Aneesh Kumar K.V 2008-03-13 14:27 ` Alexander van Heukelum
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).