linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-02 10:04 Chuck Ebbert
  2003-05-02 22:55 ` David S. Miller
  0 siblings, 1 reply; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-02 10:04 UTC (permalink / raw)
  To: Peter Osterlund; +Cc: linux-kernel

Peter Osterlund wrote:

>>   mov <mem1>,eax
>>   mov eax,<mem2>
>>   mov <mem1>,eax        ; eax already contains mem1 you stupid compiler
>>   ret
>
> Not necessarily if mem2 == mem1 + 2. Consider this code:

  I realized after sending that last that I should have added that there
were no volatiles and no aliasing possible.  This was the kernel code:

  return conf->last_used = new_disk;

(new_disk is a local variable, conf is a function arg.)

------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-02 10:04 [RFC][PATCH] Faster generic_fls Chuck Ebbert
@ 2003-05-02 22:55 ` David S. Miller
  0 siblings, 0 replies; 65+ messages in thread
From: David S. Miller @ 2003-05-02 22:55 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Peter Osterlund, linux-kernel

On Fri, 2003-05-02 at 03:04, Chuck Ebbert wrote:
>   I realized after sending that last that I should have added that there
> were no volatiles and no aliasing possible.  This was the kernel code:
> 
>   return conf->last_used = new_disk;
> 
> (new_disk is a local variable, conf is a function arg.)

Since new_disk is on the stack, is there something about 'conf'
that guarenetes it is not on the stack too?  F.e. what if
&conf->last_used were one byte into 'new_disk' or something
like that.

Probably this is all illegal...

-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-03  2:21 Chuck Ebbert
  0 siblings, 0 replies; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-03  2:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

>>   return conf->last_used = new_disk;
>> 
>> (new_disk is a local variable, conf is a function arg.)
>
> Since new_disk is on the stack, is there something about 'conf'
> that guarenetes it is not on the stack too?  F.e. what if
> &conf->last_used were one byte into 'new_disk' or something
> like that.

  It could -- I had thought the -fno-strict-aliasing option
meant exactly the opposite of what it really means.  Since the
compiler has been told to be pessimistic about possibilities
like this maybe that's what it has to do.
------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-02  1:47                       ` Thomas Schlichter
@ 2003-05-02 13:24                         ` Willy Tarreau
  0 siblings, 0 replies; 65+ messages in thread
From: Willy Tarreau @ 2003-05-02 13:24 UTC (permalink / raw)
  To: Thomas Schlichter; +Cc: dphillips, hugang, linux-kernel

On Fri, May 02, 2003 at 03:47:37AM +0200, Thomas Schlichter wrote:
 
> That is what I posted in my first message in this thread... The shift 
> algorithm only works fine for uniform distributed input values... But here is 
> a version that behaves better for small values, too. I don't think it will 
> reach the tree version but it should be much better that the old version!

I also tried to change your version into this, and at least it's not slower,
so it's good anyway, considering the small code size.

> If this is the case the tree version will surely be the best!
> 
> But I think this topic is not worth any further work as this is not used very 
> often... So this version will be my last one!

I agree. Just for the record, I'll post two other original implementations
which are not intersting for their real-life performance, but may be
interesting to reuse in other projects or in cheap hardware implementations,
or as a base for other algos. One of the downsides is that they need many
registers. They both are about twice as slow as the tree on athlon, alpha and
sparc.

The first one uses no jump at all if the CPU can do CMOV. It's about twice as
slow as tree, but may be a win on machines with a big jump cost. And since
there's no jump, its execution time is constant :

unsigned fls32_1(unsigned n)
{
    /* this code is totally jumpless on architectures that support CMOV,
       and can execute up to 4 instructions per cycle. However, it uses
       lots of instructions and registers and is not as fast as it should be.
       It's about 20 cycles on a dual-pipeline CPU.
    */
    register unsigned x = n, bits = 0, bits2, a, b;
        
    a = x & 0xffff0000;
    bits2 = bits + 16;

    b = x & 0xff00ff00;
    if (x & a) { bits = bits2;}
    if (x & a) { x &= a;}

    bits2 = bits + 8;
    a = x & 0xf0f0f0f0;

    if (x & b) { bits = bits2;}
    if (x & b) { x &= b;}

    bits2 = bits + 4;
    b = x & 0xcccccccc;

    if (x & a) { bits = bits2;}
    if (x & a) { x &= a;}

    bits2 = bits + 2;
    a = x & 0xaaaaaaaa;

    if (x & b) { bits = bits2;}

    if (x & b) { x &= b;}
    bits2 = bits + 1;

    if (x & a) { bits = bits2;}
    if (x & a) { x &= a;}

    if (x) { bits += 1; }

    return bits;
}

The second one has a radically different approach. It converges int 5 shifts.
However, each iteration has a non neglileable cost, and its time is nearly
constant too. The code is rather small (about 70 bytes), but it needs a CPU
which can shift and jump fast to be efficient. It consumes about the same
time as above.

unsigned fls32_2(unsigned n)
{
    register unsigned t = 16, r = 0;
    register unsigned m = 0xffff0000;

    // it only needs 5 iterations to complete, and some
    // instructions can be executed in parallel. It's
    // more efficient than the pure shift on small values.
    // But it needs many registers :-(
    if (n) {
        while (t) {
            if (n & m) {
                n >>= t;
                r += t;
                t >>= 1;
                m >>= t;
            } else {
                n <<= t;
                t >>= 1;
                m <<= t ? t : 1;
            }
        }
        return  r + 1;
    }
    else
        return n;
}


These were my last versions, and not particularly performant ones ;-)

Regards,
Willy


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-02  0:54                       ` Andrew Morton
@ 2003-05-02 12:21                         ` Daniel Phillips
  0 siblings, 0 replies; 65+ messages in thread
From: Daniel Phillips @ 2003-05-02 12:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: willy, schlicht, hugang, linux-kernel

On Friday 02 May 2003 02:54, Andrew Morton wrote:
> Would it be churlish to point out that the only significant user of fls()
> is sctp_v6_addr_match_len()?

Maybe.  It's also useful for breaking up an arbitrary IO region optimally into 
binary-sized blocks, which is part of the current 2.4 device-mapper patch 
set, not yet submitted.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-02  8:50 Chuck Ebbert
@ 2003-05-02  9:37 ` Peter Osterlund
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Osterlund @ 2003-05-02  9:37 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel

Chuck Ebbert <76306.1226@compuserve.com> writes:

>   GCC is the strangest combination of utterly brilliant and brain-dead
> stupid that I've ever seen... I've seen it do tail merges that took
> my breath away, followed by this:
> 
>   mov <mem1>,eax
>   mov eax,<mem2>
>   mov <mem1>,eax        ; eax already contains mem1 you stupid compiler
>   ret

Not necessarily if mem2 == mem1 + 2. Consider this code:

        #include <string.h>
        int f(char* a, char* b)
        {
            int t;
            memcpy(&t, a, sizeof(int));
            memcpy(b, &t, sizeof(int));
            memcpy(&t, a, sizeof(int));
            return t;
        }

"gcc -O2 -Wall -S test.c -fomit-frame-pointer" correctly generates:

f:
        movl    4(%esp), %ecx
        movl    (%ecx), %eax
        movl    8(%esp), %edx
        movl    %eax, (%edx)
        movl    (%ecx), %eax
        ret

-- 
Peter Osterlund - petero2@telia.com
http://w1.894.telia.com/~u89404340

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-02  8:50 Chuck Ebbert
  2003-05-02  9:37 ` Peter Osterlund
  0 siblings, 1 reply; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-02  8:50 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

David S. Miller wrote:

> I know you think GCC is a pile of dogshit, in many ways I do too, but I
> think it's just as important to point out the good parts where they
> exist.

  GCC is the strangest combination of utterly brilliant and brain-dead
stupid that I've ever seen... I've seen it do tail merges that took
my breath away, followed by this:

  mov <mem1>,eax
  mov eax,<mem2>
  mov <mem1>,eax        ; eax already contains mem1 you stupid compiler
  ret

------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 12:00       ` David S. Miller
@ 2003-05-02  5:14         ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2003-05-02  5:14 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel


On 1 May 2003, David S. Miller wrote:
> 
> This actually is false.  GCC does not know what resources the
> instruction uses, therefore it cannot perform accurate instruction
> scheduling.

Yeah, and sadly the fact that gcc-3.2.x does better instruction scheduling 
has shown itself to not actually _matter_ that much. I'm quite impressed 
by the new scheduler, but gcc-2.x seems to still generate faster code on 
too many examples.

CPU's are getting smarter, not dumber. And that implies, for example, that 
instruction scheduling matters less and less. What matters on the P4, for 
example, seems to be the _choice_ of instructions, not the scheduling of 
them.

			Linus


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-02  0:43                     ` Daniel Phillips
  2003-05-02  0:54                       ` Andrew Morton
@ 2003-05-02  1:47                       ` Thomas Schlichter
  2003-05-02 13:24                         ` Willy Tarreau
  1 sibling, 1 reply; 65+ messages in thread
From: Thomas Schlichter @ 2003-05-02  1:47 UTC (permalink / raw)
  To: dphillips, Willy Tarreau, hugang; +Cc: linux-kernel

[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 1900 bytes --]

On May 2, Daniel Phillips wrote:
> On Friday 02 May 2003 02:10, Willy Tarreau wrote:
> > At first, I thought you had coded an FFS instead of an FLS. But I
> > realized it's valid, and is fast because one half of the numbers tested
> > will not even take one iteration.
>
> Aha, and duh.  At 1 million iterations, my binary search clobbers the shift
> version by a factor of four.  At 2**31 iterations, my version also wins, by
> 20%.
>
> I should note that the iterations parameter to my benchmark program is
> deeply flawed, since it changes the nature of the input set, not just the
> number of iterations.  Fortunately, all tests I've done have been with
> 2**32 iterations, giving a perfectly uniform distribution of input values.

That is what I posted in my first message in this thread... The shift 
algorithm only works fine for uniform distributed input values... But here is 
a version that behaves better for small values, too. I don't think it will 
reach the tree version but it should be much better that the old version!

int fls_shift(int x)
{
	int bit;

	if(x & 0xFFFF0000) {
		bit = 32;
 
		while(x > 0) {
			--bit;
			x <<= 1;
		}
	} else {
		bit = 0;
 
		while(x) {
			++bit;
			x >>= 1;
		}
	}

	return bit;
}

For me this version even speeds up the uniform distributed benchmark...

> For uniformly distributed inputs the simple shift loop does well, but for
> calculating floor(log2(x)) it's much worse than the fancier alternatives. 
> I suspect typical usage leans more to the latter than the former.

If this is the case the tree version will surely be the best!

But I think this topic is not worth any further work as this is not used very 
often... So this version will be my last one!

But this was a good example how suited algorithms can speed up benchmarks ;-)

> Regards,
>
> Daniel

Best regards
  Thomas

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-02  0:43                     ` Daniel Phillips
@ 2003-05-02  0:54                       ` Andrew Morton
  2003-05-02 12:21                         ` Daniel Phillips
  2003-05-02  1:47                       ` Thomas Schlichter
  1 sibling, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2003-05-02  0:54 UTC (permalink / raw)
  To: dphillips; +Cc: willy, schlicht, hugang, linux-kernel

Daniel Phillips <dphillips@sistina.com> wrote:
>
> On Friday 02 May 2003 02:10, Willy Tarreau wrote:
> > At first, I thought you had coded an FFS instead of an FLS. But I realized
> > it's valid, and is fast because one half of the numbers tested will not even
> > take one iteration.
> 
> Aha, and duh.  At 1 million iterations, my binary search clobbers the shift 
> version by a factor of four.  At 2**31 iterations, my version also wins, by 
> 20%.
> 

Would it be churlish to point out that the only significant user of fls()
is sctp_v6_addr_match_len()?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-02  0:10                   ` Willy Tarreau
@ 2003-05-02  0:43                     ` Daniel Phillips
  2003-05-02  0:54                       ` Andrew Morton
  2003-05-02  1:47                       ` Thomas Schlichter
  0 siblings, 2 replies; 65+ messages in thread
From: Daniel Phillips @ 2003-05-02  0:43 UTC (permalink / raw)
  To: Willy Tarreau, Thomas Schlichter
  Cc: Willy TARREAU, hugang, linux-kernel, akpm

On Friday 02 May 2003 02:10, Willy Tarreau wrote:
> At first, I thought you had coded an FFS instead of an FLS. But I realized
> it's valid, and is fast because one half of the numbers tested will not even
> take one iteration.

Aha, and duh.  At 1 million iterations, my binary search clobbers the shift 
version by a factor of four.  At 2**31 iterations, my version also wins, by 
20%.

I should note that the iterations parameter to my benchmark program is deeply 
flawed, since it changes the nature of the input set, not just the number of 
iterations.  Fortunately, all tests I've done have been with 2**32 
iterations, giving a perfectly uniform distribution of input values.

For uniformly distributed inputs the simple shift loop does well, but for 
calculating floor(log2(x)) it's much worse than the fancier alternatives.  I 
suspect typical usage leans more to the latter than the former.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 15:54               ` Thomas Schlichter
@ 2003-05-02  0:33                 ` hugang
  0 siblings, 0 replies; 65+ messages in thread
From: hugang @ 2003-05-02  0:33 UTC (permalink / raw)
  To: Thomas Schlichter; +Cc: willy, linux-kernel, akpm

On Thu, 1 May 2003 17:54:28 +0200
Thomas Schlichter <schlicht@uni-mannheim.de> wrote:

> Hi,
> 
> On March 1, hugang wrote:
> > On Thu, 1 May 2003 15:52:04 +0200
> >
> > Willy TARREAU <willy@w.ods.org> wrote:
> > >  However, I have good news for you, the table code is 15% faster than my
> > > tree on an Alpha EV6 ;-)
> > >  It may be because gcc can use different tricks on this arch.
> > >
> > >  Cheers
> > >  Willy
> >
> > However, The most code changed has increase a little peformance, Do we
> > really need it?
> >
> > Here is test in my machine, The faster is still table.
> 
> I think the table version is so fast as a normal distribution of numbers is 
> tested. With this distribution half of the occuring values have the MSB set, 
> one quarter the next significant bit and so on...
> 
> The tree version is approximately equally fast for every number, but the table 
> version is fast if a very significant bit is set and slow if  a less 
> significant bit is set...
> 
> If small values occour more often than big ones the tree version should be 
> preferred, else following version may be very fast, too.
> 
> static inline int fls_shift(int x)
> {
> 	int bit = 32;
>  
> 	while(x > 0) {
> 		--bit;
> 		x <<= 1;
> 	}
> 
> 	return x ? bit : 0;
> }
> 

Yes, The shift is very faster than other. Here is a test in P4 1.6G.

model name      : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 ====
test_fls.c: In function `fls_tree_fls':
test_fls.c:85: warning: type defaults to `int' in declaration of `t'
4294967295 iterations of nil... checksum = 4294967295

real    0m21.698s
user    0m21.650s
sys     0m0.000s
4294967295 iterations of table... checksum = 4294967265

real    0m41.581s
user    0m41.500s
sys     0m0.000s
4294967295 iterations of tree... checksum = 4294967265

real    1m1.173s
user    1m1.040s
sys     0m0.000s
4294967295 iterations of shift... checksum = 4294967265

real    0m22.050s
user    0m22.000s
sys     0m0.010s



-- 
Hu Gang / Steve
Email        : huagng@soulinfo.com, steve@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7  D499 A6C2 C418 86C8 610E
ICQ#         : 205800361
Registered Linux User : 204016

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 23:27                 ` Thomas Schlichter
  2003-05-02  0:10                   ` Willy Tarreau
@ 2003-05-02  0:13                   ` Daniel Phillips
  1 sibling, 0 replies; 65+ messages in thread
From: Daniel Phillips @ 2003-05-02  0:13 UTC (permalink / raw)
  To: Thomas Schlichter, Willy TARREAU, hugang; +Cc: linux-kernel, akpm

On Friday 02 May 2003 01:27, Thomas Schlichter wrote:
> ...So for me the table version seems to be the slowest one. The BSRL
> instruction on the K6-III seems to be very slow, too. The tree and my shift
> version are faster than the original version here...
>
> That someone else can test my fls_shift version I'll provide it here again:
> static inline int fls_shift(int x)
> {
> 	int bit = 32;
>
> 	while(x > 0) {
> 		--bit;
> 		x <<= 1;
> 	}
>
> 	return x ? bit : 0;
> }

Your shift version is the fastest on the PIII as well, finishing in 45.3 
seconds vs 53.4 for my original, and using only 12 bytes of text.  This was a 
big surprise.  The time was the same, whether I inlined it or not.  I take 
this to mean that -O3 inlines such a short function in either case.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 13:52           ` Willy TARREAU
  2003-05-01 14:14             ` Falk Hueffner
  2003-05-01 14:53             ` hugang
@ 2003-05-02  0:13             ` Lincoln Dale
  2 siblings, 0 replies; 65+ messages in thread
From: Lincoln Dale @ 2003-05-02  0:13 UTC (permalink / raw)
  To: Willy TARREAU; +Cc: hugang, akpm, linux-kernel

At 03:52 PM 1/05/2003 +0200, Willy TARREAU wrote:
>Ok, I recoded the tree myself with if/else, and it's now faster than all 
>others,
[..]
>4294967295 iterations of nil... checksum = 4294967295
>4294967295 iterations of old... checksum = 4294967265
>4294967295 iterations of new... checksum = 4294967265
>4294967295 iterations of fls_table... checksum = 4294967264
>4294967295 iterations of fls_tree... checksum = 4294967265

yep - faster - but with different results.
shouldn't the checksums be the same????


cheers,

lincoln.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 23:27                 ` Thomas Schlichter
@ 2003-05-02  0:10                   ` Willy Tarreau
  2003-05-02  0:43                     ` Daniel Phillips
  2003-05-02  0:13                   ` Daniel Phillips
  1 sibling, 1 reply; 65+ messages in thread
From: Willy Tarreau @ 2003-05-02  0:10 UTC (permalink / raw)
  To: Thomas Schlichter; +Cc: Willy TARREAU, hugang, linux-kernel, akpm, dphillips

On Fri, May 02, 2003 at 01:27:16AM +0200, Thomas Schlichter wrote:
 
> So for me the table version seems to be the slowest one. The BSRL instruction 
> on the K6-III seems to be very slow, too. The tree and my shift version are 
> faster than the original version here...
> 
> That someone else can test my fls_shift version I'll provide it here again:

That's interesting Thomasr. I get 18.4 s on the Athlon here vs 32.3 for Daniel's
(I have broken my function at the moment so I cannot re-bench it right now, but
it should be near Daniel's). At first, I thought you had coded an FFS instead of
an FLS. But I realized it's valid, and is fast because one half of the numbers
tested will not even take one iteration.

Regards,
Willy


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 17:16               ` Willy TARREAU
@ 2003-05-01 23:27                 ` Thomas Schlichter
  2003-05-02  0:10                   ` Willy Tarreau
  2003-05-02  0:13                   ` Daniel Phillips
  0 siblings, 2 replies; 65+ messages in thread
From: Thomas Schlichter @ 2003-05-01 23:27 UTC (permalink / raw)
  To: Willy TARREAU, hugang; +Cc: linux-kernel, akpm, dphillips

[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 2752 bytes --]

On May 1, Willy TARREAU wrote:
> On Thu, May 01, 2003 at 10:53:21PM +0800, hugang wrote:
> > Here is test in my machine, The faster is still table.
>
> because, as Falk said, the tests are incremental and the branch prediction
> works very well. I proposed a simple scrambling function based on bswap.
> Please consider this function :
>
>       f(i) = i ^ htonl(i) ^ htonl(i<<7)
>
> It returns such values :
>
> 0x00000001 => 0x81000001
> 0x00000002 => 0x02010002
> 0x00000003 => 0x83010003
> 0x00000004 => 0x04020004
> 0x00000005 => 0x85020005
> 0x00000006 => 0x06030006
> 0x00000007 => 0x87030007
> 0x00000008 => 0x08040008
> 0x00000009 => 0x89040009
> 0x0000000a => 0x0a05000a
> 0x0000000b => 0x8b05000b
>
> etc...
>
> As you see, high bits move fast enough to shot a predictor.
>
> The tree function as well as Daniel's "new" resist better to non linear
> suites. BTW, the latest changes I did show that the convergence should be
> between Daniel's function and mine, because there are common concepts. I
> noticed that the expression used in Daniel's function is too complex for
> gcc to optimize it well enough. In my case, gcc 3.2.3 coded too many jumps
> instead of conditional moves. I saw that playing with -mbranch-cost changes
> the code. A mix between the two is used here (and listed after). It's still
> not optimial, reading the code, because there's always one useless jump and
> move. But in the worst case, it gives exactly the same result as Daniel's.
> But in other cases, it is even slightly faster. Honnestly, now it's just a
> matter of taste. Daniel's easier to read, mine produces smaller code. This
> time, it's faster than others Athlon, Alpha and Sparc. I Don't know about
> the PentiumIII nor the P4.
>
> Here are the results on Athlon, gcc-3.2.3, then Alpha and Sparc.

~~ snip~~

Here are some results with the scrambling function on AMD K6-III 450MHz, 
gcc-2.95.3:

fls_old (generic_fls from linux 2.5.68):
real    3m52.960s
user    3m42.080s
sys     0m0.348s

fls_bsrl:
real    4m39.040s
user    4m25.532s
sys     0m0.401s

fls_table:
real    4m59.622s
user    4m45.511s
sys     0m0.469s

fls_tree:
real    3m3.986s
user    2m55.236s
sys     0m0.272s

fls_shift:
real    2m58.765s
user    2m50.092s
sys     0m0.278s

So for me the table version seems to be the slowest one. The BSRL instruction 
on the K6-III seems to be very slow, too. The tree and my shift version are 
faster than the original version here...

That someone else can test my fls_shift version I'll provide it here again:
static inline int fls_shift(int x)
{
	int bit = 32;
 
	while(x > 0) {
		--bit;
		x <<= 1;
	}

	return x ? bit : 0;
}

> Willy

Thomas

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 14:53             ` hugang
  2003-05-01 15:54               ` Thomas Schlichter
@ 2003-05-01 17:16               ` Willy TARREAU
  2003-05-01 23:27                 ` Thomas Schlichter
  1 sibling, 1 reply; 65+ messages in thread
From: Willy TARREAU @ 2003-05-01 17:16 UTC (permalink / raw)
  To: hugang; +Cc: linux-kernel, akpm, dphillips

On Thu, May 01, 2003 at 10:53:21PM +0800, hugang wrote:
 
> Here is test in my machine, The faster is still table.

because, as Falk said, the tests are incremental and the branch prediction
works very well. I proposed a simple scrambling function based on bswap. Please
consider this function :

      f(i) = i ^ htonl(i) ^ htonl(i<<7)

It returns such values :

0x00000001 => 0x81000001
0x00000002 => 0x02010002
0x00000003 => 0x83010003
0x00000004 => 0x04020004
0x00000005 => 0x85020005
0x00000006 => 0x06030006
0x00000007 => 0x87030007
0x00000008 => 0x08040008
0x00000009 => 0x89040009
0x0000000a => 0x0a05000a
0x0000000b => 0x8b05000b

etc...

As you see, high bits move fast enough to shot a predictor.

The tree function as well as Daniel's "new" resist better to non linear suites.
BTW, the latest changes I did show that the convergence should be between
Daniel's function and mine, because there are common concepts. I noticed that
the expression used in Daniel's function is too complex for gcc to optimize it
well enough. In my case, gcc 3.2.3 coded too many jumps instead of conditional
moves. I saw that playing with -mbranch-cost changes the code. A mix between
the two is used here (and listed after). It's still not optimial, reading the
code, because there's always one useless jump and move. But in the worst case,
it gives exactly the same result as Daniel's. But in other cases, it is even
slightly faster. Honnestly, now it's just a matter of taste. Daniel's easier to
read, mine produces smaller code. This time, it's faster than others Athlon,
Alpha and Sparc. I Don't know about the PentiumIII nor the P4.

Here are the results on Athlon, gcc-3.2.3, then Alpha and Sparc.

Willy

====

4294967295 iterations of bswap/nil... checksum = 4294967295

real    0m53.133s
user    0m53.114s
sys     0m0.005s

4294967295 iterations of bswap/fls_table... checksum = 4294967262

real    1m44.686s
user    1m44.163s
sys     0m0.487s

4294967295 iterations of bswap/new... checksum = 4294967262

real    1m16.463s
user    1m16.144s
sys     0m0.314s

4294967295 iterations of bswap/fls_tree... checksum = 4294967262

real    1m16.511s
user    1m16.169s
sys     0m0.296s

== alpha 466 MHz , gcc-3.2.3
   gcc -O3 -o testfls -fomit-frame-pointer -mcpu=ev67 ===

4294967295 iterations of bswap/fls_tree... checksum = 4294967262

real    4m14.432s
user    4m13.540s
sys     0m0.038s

4294967295 iterations of bswap/fls_table... checksum = 4294967262

real    4m36.204s
user    4m35.368s
sys     0m0.030s

== ultra-sparc 333 MHz, gcc 3.1.1 (linear only)
   gcc -O3 -mcpu=v9 -fomit-frame-pointer ===

4294967295 iterations of fls_tree... checksum = 4294967265

real    1m52.680s
user    1m52.640s
sys     0m0.010s

4294967295 iterations of fls_table... checksum = 4294967265

real    3m24.514s
user    3m24.430s
sys     0m0.010s

======= here is the function :

unsigned fls_tree_fls(unsigned n) {
    register t = 0;

    if (n >= (1<<16)) {
	n >>= 16;
	t = 16;
    }

    if (n < (1<<8))
	if (n < (1<<4))
	    if (n < 4)
		return t + n - ((n + 1) >> 2);
	    else
		return t + 3 + (n>>3);
	else
	    if (n < (1<<6))
		return t + 5 + (n>>5);
	    else
		return t + 7 + (n>>7);
    else
	if (n < (1<<12))
	    if (n < (1<<10))
		return t + 9 + (n>>9);
	    else
		return t + 11 + (n>>11);
	else
	    if (n < (1<<14))
		return t + 13 + (n>>13);
	    else
		return t + 15 + (n>>15);
}


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-01 17:15 Chuck Ebbert
  0 siblings, 0 replies; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-01 17:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:

>>  BTW, has someone benchmarked BSF/BSR on x86 ? It should be
>> faster, it it's also possible that a poor microcode implements it with a one
>> bit/cycle algo, which will result in one instruction not being as fast as your
>> code.
>
> I think the original i386 did it with a one-bit-per-cycle algorithm,
> anything since should be fine. In particular, on a P4 where I just tested,
> the bsf seems to be 4 cycles over the whole input set (actually, my whole
> loop was 4 cycles per iteration, so 4 cycles is worst-case. I'm assuming
> the rest could have been done in parallell).

 
 Just for comparison, the Pentium (Classic) manual says 6-43 clocks for
bsfl and 7-72 (!) for bsrl.

------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  9:19           ` Willy Tarreau
@ 2003-05-01 16:02             ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2003-05-01 16:02 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Daniel Phillips, Falk Hueffner, linux-kernel


On Thu, 1 May 2003, Willy Tarreau wrote:
>
>  BTW, has someone benchmarked BSF/BSR on x86 ? It should be
> faster, it it's also possible that a poor microcode implements it with a one
> bit/cycle algo, which will result in one instruction not being as fast as your
> code.

I think the original i386 did it with a one-bit-per-cycle algorithm,
anything since should be fine. In particular, on a P4 where I just tested,
the bsf seems to be 4 cycles over the whole input set (actually, my whole
loop was 4 cycles per iteration, so 4 cycles is worst-case. I'm assuming
the rest could have been done in parallell).

		Linus


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 14:53             ` hugang
@ 2003-05-01 15:54               ` Thomas Schlichter
  2003-05-02  0:33                 ` hugang
  2003-05-01 17:16               ` Willy TARREAU
  1 sibling, 1 reply; 65+ messages in thread
From: Thomas Schlichter @ 2003-05-01 15:54 UTC (permalink / raw)
  To: hugang; +Cc: Willy TARREAU, linux-kernel, akpm

[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 1790 bytes --]

Hi,

On March 1, hugang wrote:
> On Thu, 1 May 2003 15:52:04 +0200
>
> Willy TARREAU <willy@w.ods.org> wrote:
> >  However, I have good news for you, the table code is 15% faster than my
> > tree on an Alpha EV6 ;-)
> >  It may be because gcc can use different tricks on this arch.
> >
> >  Cheers
> >  Willy
>
> However, The most code changed has increase a little peformance, Do we
> really need it?
>
> Here is test in my machine, The faster is still table.

I think the table version is so fast as a normal distribution of numbers is 
tested. With this distribution half of the occuring values have the MSB set, 
one quarter the next significant bit and so on...

The tree version is approximately equally fast for every number, but the table 
version is fast if a very significant bit is set and slow if  a less 
significant bit is set...

If small values occour more often than big ones the tree version should be 
preferred, else following version may be very fast, too.

static inline int fls_shift(int x)
{
	int bit = 32;
 
	while(x > 0) {
		--bit;
		x <<= 1;
	}

	return x ? bit : 0;
}

I get following times on my K6-III (compiled with -O3 -march=k6) for 
4294967296 iterations:

fls_bsrl: (uses the 'bsrl' command via inline assembler)
real    3m39.059s
user    3m27.416s
sys     0m0.345s

fls_shift:
real    1m47.337s
user    1m41.801s
sys     0m0.185s

fls_tree:
real    1m56.746s
user    1m50.681s
sys     0m0.165s

The 32Bit tree is a bit slower here, too:
real    1m59.447s
user    1m53.375s
sys     0m0.200s

I did not test the table version as I think it cannot be faster than the shift 
version...

Best regards
  Thomas Schlichter

P.S.: I'd be happy to see how this performs on an Athlon or P-IV...

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 13:52           ` Willy TARREAU
  2003-05-01 14:14             ` Falk Hueffner
@ 2003-05-01 14:53             ` hugang
  2003-05-01 15:54               ` Thomas Schlichter
  2003-05-01 17:16               ` Willy TARREAU
  2003-05-02  0:13             ` Lincoln Dale
  2 siblings, 2 replies; 65+ messages in thread
From: hugang @ 2003-05-01 14:53 UTC (permalink / raw)
  To: Willy TARREAU, linux-kernel, akpm

On Thu, 1 May 2003 15:52:04 +0200
Willy TARREAU <willy@w.ods.org> wrote:

>  However, I have good news for you, the table code is 15% faster than my tree
>  on an Alpha EV6 ;-)
>  It may be because gcc can use different tricks on this arch.
> 
>  Cheers
>  Willy

However, The most code changed has increase a little peformance, Do we really need it?

Here is test in my machine, The faster is still table.

==== -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 ====
test_fls.c: In function `fls_tree_fls':
test_fls.c:78: warning: type defaults to `int' in declaration of `t'
4294967295 iterations of nil... checksum = 4294967295

real    0m21.852s
user    0m21.500s
sys     0m0.300s
4294967295 iterations of old... checksum = 4294967265

real    0m45.206s
user    0m44.800s
sys     0m0.310s
4294967295 iterations of new... checksum = 4294967265

real    0m45.235s
user    0m44.720s
sys     0m0.420s
4294967295 iterations of new2... checksum = 4294967265

real    0m59.615s
user    0m58.960s
sys     0m0.540s
4294967295 iterations of table... checksum = 4294967265

real    0m41.784s
user    0m41.420s
sys     0m0.280s
4294967295 iterations of tree... checksum = 4294967265

real    0m44.874s
user    0m44.410s
sys     0m0.380s
------------------


-- 
Hu Gang / Steve
Email        : huagng@soulinfo.com, steve@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7  D499 A6C2 C418 86C8 610E
ICQ#         : 205800361
Registered Linux User : 204016

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 14:14             ` Falk Hueffner
@ 2003-05-01 14:26               ` Willy TARREAU
  0 siblings, 0 replies; 65+ messages in thread
From: Willy TARREAU @ 2003-05-01 14:26 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: Willy TARREAU, hugang, akpm, linux-kernel

On Thu, May 01, 2003 at 04:14:17PM +0200, Falk Hueffner wrote:
> Willy TARREAU <willy@w.ods.org> writes:
> 
> > On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> > Ok, I recoded the tree myself with if/else, and it's now faster than
> > all others, whatever the compiler.
> 
> Have you tried with not simply increasing, but random numbers? I guess
> this could make quite a difference here because of branch prediction.

I thought about this, and indeed, that's what I used in the program I used
to bench the first function I sent yesterday. The problem of the random, is
that it's so slow that you must build a giant table and apply your tests to
this table. So the problem mainly displaces to data cache misses which cost
more than certain operations. If you try it, you'll note that it's difficult
to get comparable results twice.

Other solutions include non-linear suites such as mixing some sequential
values with BSWAP. Eg: x ^ bswap(x) ^ bswap(x << 4).

Willy


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 13:52           ` Willy TARREAU
@ 2003-05-01 14:14             ` Falk Hueffner
  2003-05-01 14:26               ` Willy TARREAU
  2003-05-01 14:53             ` hugang
  2003-05-02  0:13             ` Lincoln Dale
  2 siblings, 1 reply; 65+ messages in thread
From: Falk Hueffner @ 2003-05-01 14:14 UTC (permalink / raw)
  To: Willy TARREAU; +Cc: hugang, akpm, linux-kernel

Willy TARREAU <willy@w.ods.org> writes:

> On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> Ok, I recoded the tree myself with if/else, and it's now faster than
> all others, whatever the compiler.

Have you tried with not simply increasing, but random numbers? I guess
this could make quite a difference here because of branch prediction.

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  7:05         ` hugang
@ 2003-05-01 13:52           ` Willy TARREAU
  2003-05-01 14:14             ` Falk Hueffner
                               ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Willy TARREAU @ 2003-05-01 13:52 UTC (permalink / raw)
  To: hugang; +Cc: akpm, linux-kernel

On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> Hello
> 
> Here is the table version of fls. Before version of it is wrong.

Ok, I recoded the tree myself with if/else, and it's now faster than all others,
whatever the compiler. I have cut the tree to only 16 bits evaluations because
it was still faster than a full 32 bits tree on x86, probably because of the
code length. However, on Alpha EV6 (gcc-2.95.3), the 32 bits tree is faster.
The 32 bits code is also the same speed as the 16 bits on gcc-3.2.3 if I
specify -mbranch-cost=2 (because gcc assumes that today's CPUs need one cycle
per jump).

Here is the function, followed by the logs executed on an Athlon.

Disclaimer: Daniel's PentiumIII seems to give very different results than my
Athlon, so every test should be considered for one arch only !!!

To quickly resume the results, on gcc-3.2.3 it's 45% faster than the table algo,
I don't understand why, and just a little bit faster than Daniel's function.

On gcc-2.95.3, it's only 20% faster than the table, but 46% than Daniel's.

However, I have good news for you, the table code is 15% faster than my tree
on an Alpha EV6 ;-)
It may be because gcc can use different tricks on this arch.

Cheers
Willy

static inline int fls_tree_fls(unsigned n) {
#ifndef REALLY_WANT_A_32BITS_TREE
    /* it seems as if working on half a tree is faster than the
       complete tree.
    */
    register t = 0;
    if (n >= (1<<16)) {
	n >>= 16;
	t = 16;
    }
    if (n < (1<<8))
	if (n < (1<<4))
	    if (n < 4)
		return t + n - ((n + 1) >> 2);
	    else
		return t + 3 + (n >= 8);
	else
	    if (n < (1<<6))
		return t + 5 + (n >= (1<<5));
	    else
		return t + 7 + (n >= (1<<7));
    else
	if (n < (1<<12))
	    if (n < (1<<10))
		return t + 9 + (n >= (1<<9));
	    else
		return t + 11 + (n >= (1<<11));
	else
	    if (n < (1<<14))
		return t + 13 + (n >= (1<<13));
	    else
		return t + 15 + (n >= (1<<15));
#else
    /* perhaps this is faster on systems with BIG caches, but on
       athlon, at least, it's slower than the previous one
    */
    if (n < (1<<16))
	if (n < (1<<8))
	    if (n < (1<<4))
		if (n < 4)
		    return n - ((n + 1) >> 2);
		else
		    return 3 + (n >= 8);
	    else
		if (n < (1<<6))
		    return 5 + (n >= (1<<5));
		else
		    return 7 + (n >= (1<<7));
	else
	    if (n < (1<<12))
		if (n < (1<<10))
		    return 9 + (n >= (1<<9));
		else
		    return 11 + (n >= (1<<11));
	    else
		if (n < (1<<14))
		    return 13 + (n >= (1<<13));
		else
		    return 15 + (n >= (1<<15));
    else
	if (n < (1<<24))
	    if (n < (1<<20))
		if (n < (1<<18))
		    return 17 + (n >= (1<<17));
		else
		    return 19 + (n >= (1<<19));
	    else
		if (n < (1<<22))
		    return 21 + (n >= (1<<21));
		else
		    return 23 + (n >= (1<<23));
	else
	    if (n < (1<<28))
		if (n < (1<<26))
		    return 25 + (n >= (1<<25));
		else
		    return 27 + (n >= (1<<27));
	    else
		if (n < (1<<30))
		    return 29 + (n >= (1<<29));
		else
		    return 31 + (n >= (1<<31));
#endif
}

== results on Athlon 1800+ (1.53 GHz), gcc-2.95.3 -O3 -march=i686 :

4294967295 iterations of nil... checksum = 4294967295

real    0m5.600s
user    0m5.590s
sys     0m0.007s
4294967295 iterations of old... checksum = 4294967265

real    0m39.640s
user    0m39.499s
sys     0m0.141s
4294967295 iterations of new... checksum = 4294967265

real    0m45.088s
user    0m44.936s
sys     0m0.158s
4294967295 iterations of fls_table... checksum = 4294967264

real    0m35.988s
user    0m35.833s
sys     0m0.159s
4294967295 iterations of fls_tree... checksum = 4294967265

real    0m28.699s
user    0m28.605s
sys     0m0.096s

== results on Athlon 1800+ (1.53 GHz), gcc-3.2.3 -O3 -march=athlon :

4294967295 iterations of nil... checksum = 4294967295

real    0m16.624s
user    0m16.624s
sys     0m0.001s
4294967295 iterations of old... checksum = 4294967265

real    0m57.685s
user    0m57.568s
sys     0m0.125s
4294967295 iterations of new... checksum = 4294967265

real    0m37.513s
user    0m37.415s
sys     0m0.103s
4294967295 iterations of fls_table... checksum = 4294967264

real    0m56.068s
user    0m55.991s
sys     0m0.085s
4294967295 iterations of fls_tree... 
checksum = 4294967265

real    0m36.636s
user    0m36.516s
sys     0m0.125s

=== alpha EV6 / 466 Mhz, gcc-2.95.3 -O3 ====
4294967295 iterations of new... checksum = 4294967265

real    2m19.951s
user    2m19.543s
sys     0m0.018s
4294967295 iterations of fls_table... checksum = 4294967265

real    1m12.825s
user    1m12.667s
sys     0m0.005s
4294967295 iterations of fls_tree... checksum = 4294967265

real    1m33.469s
user    1m33.242s
sys     0m0.013s


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-01 13:31 linux
  0 siblings, 0 replies; 65+ messages in thread
From: linux @ 2003-05-01 13:31 UTC (permalink / raw)
  To: linux-kernel

The Fount of Holy Pengun Pee expressed himself thus:
> Yeah, except if you want best code generation you should probably use
>
>	static inline int fls(int x)
>	{
>		int bit;
>		/* Only well-defined for non-zero */
>		asm("bsrl %1,%0":"=r" (bit):"rm" (x));
>		return x ? bit : 32;
>	}
>
> which should generate (assuming you give "-march=xxxx" to tell gcc to use 
> cmov):
>
>		movl    $32, %eax
>		bsrl %edx,%ecx
>		testl   %edx, %edx
>		cmovne  %ecx, %eax
>
> which looks pretty optimal.

Except that the testl is unnecessary.  The full semantics of
bsrl are:

if (source != 0) {
	destination = <index of most significant set bit in source>;
	ZF = 0;
} else {
	destination = UNDEFINED;
	ZF = 1;
}

Thus, there are two reasonable code sequences:

asm("	bsrl	%2, %1"
"\n	cmovne  %1, %0" : "=r" (bit), "=r" (temp) : "rm" (x), "0" (32));

and (if I've got the earlyclobber syntax right):

asm("	bsrl	%1, %0"
"\n	cmoveq  %2, %0" : "=&r,&r" (bit) : "0,rm" (x), "rm,rm" (32));

Note that in this latter case, I have to list %1 == %0 as an explicit
alternative, because otherwise the & on operand 0 would tell GCC to forbid
that combination and it's only %0 == %2 that's forbidden.

(The first alternative is listed first because it uses fewer registers
and so is preferred, all other things being equal.)

Unfortunately, I'm not sure how to let GCC choose between these two
alternatives...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  4:40     ` Linus Torvalds
@ 2003-05-01 12:00       ` David S. Miller
  2003-05-02  5:14         ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: David S. Miller @ 2003-05-01 12:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Wed, 2003-04-30 at 21:40, Linus Torvalds wrote:
> And inline asms schedule as well as any built-in, unless they are marked
> volatile (either explicitly or implicitly by not having any outputs).

This actually is false.  GCC does not know what resources the
instruction uses, therefore it cannot perform accurate instruction
scheduling.

Richard and I discussed at some time providing a way for inline
asms to give the instruction attributes.

An easier way is to provide a per-platform way to get at the
"weird" instructions a cpu has.  This is precisely what GCC currently
provides in the form of __builtin_${CPU}_${WEIRD_INSN}(args...) type
interfaces.  These give you what you want (complete control) yet
also let GCC schedule the thing accurately.

I know you think GCC is a pile of dogshit, in many ways I do too, but I
think it's just as important to point out the good parts where they
exist.

-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 11:17       ` hugang
@ 2003-05-01 11:45         ` Willy TARREAU
  0 siblings, 0 replies; 65+ messages in thread
From: Willy TARREAU @ 2003-05-01 11:45 UTC (permalink / raw)
  To: hugang; +Cc: Willy TARREAU, linux-kernel

On Thu, May 01, 2003 at 07:17:30PM +0800, hugang wrote:
 
> Do see the mail by Andrew Morton? -----

no sorry, I didn't see it, but I've read it now. I agree with him, that's
what I noticed here too. I also tried a binary search through the table in a
tree-like fashion, but the CPU doesn't like going back and forth through the
table at all ! I'm retrying with an "elastic mask".

Cheers
Willy


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01 10:22     ` Willy TARREAU
@ 2003-05-01 11:17       ` hugang
  2003-05-01 11:45         ` Willy TARREAU
  0 siblings, 1 reply; 65+ messages in thread
From: hugang @ 2003-05-01 11:17 UTC (permalink / raw)
  To: Willy TARREAU; +Cc: linux-kernel

On Thu, 1 May 2003 12:22:30 +0200
Willy TARREAU <willy@w.ods.org> wrote:

> On Thu, May 01, 2003 at 01:16:05PM +0800, hugang wrote:
> 
> > Here is table version of the fls. Yes it fast than other.
> 
> Sorry, but this code returns wrong results. Test it with less
> iterations, it doesn't return the same result.
> 
> >         unsigned i = 0;
> > 
> >         do {
> >                 i++;
> >         } while (n <= fls_table[i].max && n > fls_table[i].min);
> 
> You never even compare with table[0] !
> 
> > --test log is here----
> 
> I recoded a table based on your idea, and it's clearly faster than others, and
> even faster than yours :
> 
> ============
> static unsigned tbl[33] = { 2147483648, 1073741824, 536870912, 268435456,
> 			    134217728, 67108864, 33554432, 16777216,
> 			    8388608, 4194304, 2097152, 1048576,
> 			    524288, 262144, 131072, 65536,
> 			    32768, 16384, 8192, 4096,
> 			    2048, 1024, 512, 256,
> 			    128, 64, 32, 16,
> 			    8, 4, 2, 1, 0 };
> 
> 
> static inline int fls_tbl_fls(unsigned n)
> {
>         unsigned i = 0;
> 
> 	while (n < tbl[i])
> 	    i++;
> 
>         return 32 - i;
> }
> ===========

Do see the mail by Andrew Morton? -----

From: Andrew Morton <akpm@digeo.com>
To: hugang <hugang@soulinfo.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [RFC][PATCH] Faster generic_fls
Date: Wed, 30 Apr 2003 22:11:29 -0700
Sender: linux-kernel-owner@vger.kernel.org
X-Mailer: Sylpheed version 0.8.11 (GTK+ 1.2.10; i586-pc-linux-gnu)

 hugang <hugang@soulinfo.com> wrote:
 >
 > Here is table version of the fls. Yes it fast than other.

 nooo..  That has a big cache footprint.  At the very least you should use
 a binary search.  gcc will do it for you:

 	switch (n) {
 	case 0 ... 1:
 		return 1;
 	case 2 ... 3:
 		return 2;
 	case 4 ... 7:
 		return 3;
 	case 8 ... 15:
 		return 4;

 etc.
----------
I agree him, The Lastest code my posted has works fine, Please check it.

-- 
Hu Gang / Steve
Email        : huagng@soulinfo.com, steve@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7  D499 A6C2 C418 86C8 610E
ICQ#         : 205800361
Registered Linux User : 204016

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  5:16   ` hugang
@ 2003-05-01 10:22     ` Willy TARREAU
  2003-05-01 11:17       ` hugang
  0 siblings, 1 reply; 65+ messages in thread
From: Willy TARREAU @ 2003-05-01 10:22 UTC (permalink / raw)
  To: hugang; +Cc: linux-kernel

On Thu, May 01, 2003 at 01:16:05PM +0800, hugang wrote:

> Here is table version of the fls. Yes it fast than other.

Sorry, but this code returns wrong results. Test it with less
iterations, it doesn't return the same result.

>         unsigned i = 0;
> 
>         do {
>                 i++;
>         } while (n <= fls_table[i].max && n > fls_table[i].min);

You never even compare with table[0] !

> --test log is here----

I recoded a table based on your idea, and it's clearly faster than others, and
even faster than yours :

============
static unsigned tbl[33] = { 2147483648, 1073741824, 536870912, 268435456,
			    134217728, 67108864, 33554432, 16777216,
			    8388608, 4194304, 2097152, 1048576,
			    524288, 262144, 131072, 65536,
			    32768, 16384, 8192, 4096,
			    2048, 1024, 512, 256,
			    128, 64, 32, 16,
			    8, 4, 2, 1, 0 };


static inline int fls_tbl_fls(unsigned n)
{
        unsigned i = 0;

	while (n < tbl[i])
	    i++;

        return 32 - i;
}
===========

it returns the right result. Compiled with gcc 2.95.3 -march=i686 -O3, athlon
1.533 GHz (1800+) :

4294967295 iterations of new... checksum = 4294967265

real    1m6.182s
user    1m6.060s
sys     0m0.133s
4294967295 iterations of new2... checksum = 4294967265

real    0m36.504s
user    0m36.294s
sys     0m0.202s
4294967295 iterations of fls_table... checksum = 4294967295

real    0m21.962s
user    0m21.833s
sys     0m0.124s
4294967295 iterations of fls_tbl... checksum = 4294967265

real    0m19.268s
user    0m19.102s
sys     0m0.168s

The same compiled with gcc-3.2.3 :

4294967295 iterations of fls_table... checksum = 4294967295

real    0m14.211s
user    0m14.210s
sys     0m0.000s

Cheers,
Willy


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  1:02         ` Daniel Phillips
@ 2003-05-01  9:19           ` Willy Tarreau
  2003-05-01 16:02             ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Willy Tarreau @ 2003-05-01  9:19 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Willy Tarreau, Linus Torvalds, Falk Hueffner, linux-kernel

On Thu, May 01, 2003 at 03:02:14AM +0200, Daniel Phillips wrote:
> On Wednesday 30 April 2003 22:59, Willy Tarreau wrote:
> > I must acknowledge that your simple code was not easy to beat ! You can try
> > this one on your PIII, I could only test it on an athlon mobile and a P4.
> > With gcc 2.95.3, it gives me a boost of about 25%, because it seems as gcc
> > cannot optimize shifts efficiently. On 3.2.3, however, it's between 0 and
> > 5% depending on optimization/CPU.
> 
> Was something ifdef'd incorrectly?  Otherwise, there is something the PIII 
> hates about that code.  I got 107 seconds on the PIII, vs 53 seconds for my 
> posted code at O3, and virtually no difference at O2.  (gcc 3.2.3)

No, it was correct. It simply means that some CPUs prefer bit shifting while
others prefer comparisons and sometimes jumps, that mostly depends on the
ALU and the branch prediction unit, it seems. That shows us that if we include
such code in the kernel, it has to be arch-specific, so we may end up coding
it in assembler, keeping your code as the default one when no asm has been coded
for a given arch. BTW, has someone benchmarked BSF/BSR on x86 ? It should be
faster, it it's also possible that a poor microcode implements it with a one
bit/cycle algo, which will result in one instruction not being as fast as your
code.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  5:33       ` hugang
@ 2003-05-01  7:05         ` hugang
  2003-05-01 13:52           ` Willy TARREAU
  0 siblings, 1 reply; 65+ messages in thread
From: hugang @ 2003-05-01  7:05 UTC (permalink / raw)
  To: hugang; +Cc: akpm, linux-kernel

Hello

Here is the table version of fls. Before version of it is wrong.
-------
static inline int fls_table_fls(unsigned n)
{
	switch (n) {
	case 0: return 0;
	case 1 ... 1: return 1;
	case 2 ... 3: return 2;
	case 4 ... 7: return 3;
	case 8 ... 15: return 4;
	case 16 ... 31: return 5;
	case 32 ... 63: return 6;
	case 64 ... 127: return 7;
	case 128 ... 255: return 8;
	case 256 ... 511: return 9;
	case 512 ... 1023: return 10;
	case 1024 ... 2047: return 11;
	case 2048 ... 4095: return 12;
	case 4096 ... 8191: return 13;
	case 8192 ... 16383: return 14;
	case 16384 ... 32767: return 15;
	case 32768 ... 65535: return 16;
	case 65536 ... 131071: return 17;
	case 131072 ... 262143: return 18;
	case 262144 ... 524287: return 19;
	case 524288 ... 1048575: return 20;
	case 1048576 ... 2097151: return 21;
	case 2097152 ... 4194303: return 22;
	case 4194304 ... 8388607: return 23;
	case 8388608 ... 16777215: return 24;
	case 16777216 ... 33554431: return 25;
	case 33554432 ... 67108863: return 26;
	case 67108864 ... 134217727: return 27;
	case 134217728 ... 268435455: return 28;
	case 268435456 ... 536870911: return 29;
	case 536870912 ... 1073741823: return 30;
	case 1073741824 ... 2147483647: return 31;
	default:return 32;
	}
}
--here is the test log-----
model name	: Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -march=i686 -O3 ====
4294967295 iterations of nil... checksum = 4294967295

real	0m15.548s
user	0m15.500s
sys	0m0.010s
4294967295 iterations of old... checksum = 4294967265

real	0m34.778s
user	0m34.690s
sys	0m0.020s
4294967295 iterations of new... checksum = 4294967265

real	0m56.330s
user	0m56.190s
sys	0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real	0m40.755s
user	0m40.680s
sys	0m0.000s
4294967295 iterations of table... checksum = 4294967265

real	0m49.928s
user	0m49.770s
sys	0m0.050s
------------------
==== -march=i686 -O2 ====
4294967295 iterations of nil... checksum = 4294967295

real	0m28.350s
user	0m28.280s
sys	0m0.020s
4294967295 iterations of old... checksum = 4294967265

real	0m48.199s
user	0m48.100s
sys	0m0.000s
4294967295 iterations of new... checksum = 4294967265

real	0m50.986s
user	0m50.890s
sys	0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real	1m5.356s
user	1m5.210s
sys	0m0.010s
4294967295 iterations of table... checksum = 4294967265

real	0m45.238s
user	0m45.150s
sys	0m0.000s
------------------
==== -O2 ====
4294967295 iterations of nil... checksum = 4294967295

real	0m33.503s
user	0m33.410s
sys	0m0.020s
4294967295 iterations of old... checksum = 4294967265

real	0m50.571s
user	0m50.450s
sys	0m0.020s
4294967295 iterations of new... checksum = 4294967265

real	0m53.324s
user	0m53.200s
sys	0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real	0m56.360s
user	0m56.250s
sys	0m0.000s
4294967295 iterations of table... checksum = 4294967265

real	0m46.524s
user	0m46.440s
sys	0m0.000s
------------------
==== -O3 ====
4294967295 iterations of nil... checksum = 4294967295

real	0m5.421s
user	0m5.410s
sys	0m0.000s
4294967295 iterations of old... checksum = 4294967265

real	0m29.744s
user	0m29.670s
sys	0m0.010s
4294967295 iterations of new... checksum = 4294967265

real	0m53.160s
user	0m53.050s
sys	0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real	0m31.464s
user	0m31.400s
sys	0m0.010s
4294967295 iterations of table... checksum = 4294967265

real	0m46.546s
user	0m46.440s
sys	0m0.000s
------------------

-- 
Hu Gang / Steve
Email        : huagng@soulinfo.com, steve@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7  D499 A6C2 C418 86C8 610E
ICQ#         : 205800361
Registered Linux User : 204016

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  5:11     ` Andrew Morton
@ 2003-05-01  5:33       ` hugang
  2003-05-01  7:05         ` hugang
  0 siblings, 1 reply; 65+ messages in thread
From: hugang @ 2003-05-01  5:33 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel

On Wed, 30 Apr 2003 22:11:29 -0700
Andrew Morton <akpm@digeo.com> wrote:

>  nooo..  That has a big cache footprint.  At the very least you should use
>  a binary search.  gcc will do it for you:
> 
>  	switch (n) {
>  	case 0 ... 1:
>  		return 1;
>  	case 2 ... 3:
>  		return 2;
>  	case 4 ... 7:
>  		return 3;
>  	case 8 ... 15:
>  		return 4;
> 
>  etc.
It is here.
--------------------
static inline int fls_table_fls(unsigned n)
{
    switch (n) {
    case 0  ...      0: return 1;
    case 1  ...      1: return 2;
    case 2  ...      3: return 3;
    case 4  ...      7: return 4;
    case 8  ...      15: return 5;
    case 16         ...      31: return 6;
    case 32         ...      63: return 7;
    case 64         ...      127: return 8;
    case 128        ...      255: return 9;
    case 256        ...      511: return 10;
    case 512        ...      1023: return 11;
    case 1024       ...      2047: return 12;
    case 2048       ...      4095: return 13;
    case 4096       ...      8191: return 14;
    case 8192       ...      16383: return 15;
    case 16384      ...      32767: return 16;
    case 32768      ...      65535: return 17;
    case 65536      ...      131071: return 18;
    case 131072     ...      262143: return 19;
    case 262144     ...      524287: return 20;
    case 524288     ...      1048575: return 21;
    case 1048576    ...      2097151: return 22;
    case 2097152    ...      4194303: return 23;
    case 4194304    ...      8388607: return 24;
    case 8388608    ...      16777215: return 25;
    case 16777216   ...      33554431: return 26;
    case 33554432   ...      67108863: return 27;
    case 67108864   ...      134217727: return 28;
    case 134217728  ...      268435455: return 29;
    case 268435456  ...      536870911: return 30;
    case 536870912  ...      1073741823: return 31;
    default:    
    }   
    return 32;
} 

Now it in test, I will put log and file into 
http://soulinfo.com/~hugang/kernel/

-- 
Hu Gang / Steve
Email        : huagng@soulinfo.com, steve@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7  D499 A6C2 C418 86C8 610E
ICQ#         : 205800361
Registered Linux User : 204016

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 20:55 ` Andrew Morton
  2003-05-01  5:03   ` hugang
@ 2003-05-01  5:16   ` hugang
  2003-05-01 10:22     ` Willy TARREAU
  1 sibling, 1 reply; 65+ messages in thread
From: hugang @ 2003-05-01  5:16 UTC (permalink / raw)
  To: linux-kernel

On Wed, 30 Apr 2003 13:55:12 -0700
Andrew Morton <akpm@digeo.com> wrote:

> Daniel Phillips <dphillips@sistina.com> wrote:
> >
> > +static inline unsigned fls8(unsigned n)
> > +{
> > +	return n & 0xf0?
> > +	    n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
> > +	    n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
> > +}
> 
> 	return fls_table[n];
> 
> 
> That'll be faster in benchmarks, possibly slower in real world.  As usual.
> 
Here is table version of the fls. Yes it fast than other.
typedef struct fls_table_t {
        unsigned min;
        unsigned max;
        unsigned value;
} fls_table_t;

static fls_table_t fls_table[] = {
        {0, 1, 1},
        {1, 2, 2},
        {2, 4, 3},
        {4, 8, 4},
        {8, 16, 5},
        {16, 32, 6},
        {32, 64, 7},
        {64, 128, 8},
        {128, 256, 9},
        {256, 512, 10},
        {512, 1024, 11},
        {1024, 2048, 12},
        {2048, 4096, 13},
        {4096, 8192, 14},
        {8192, 16384, 15},
        {16384, 32768, 16},
        {32768, 65536, 17},
        {65536, 131072, 18},
        {131072, 262144, 19},
        {262144, 524288, 20},
        {524288, 1048576, 21},
        {1048576, 2097152, 22},
        {2097152, 4194304, 23},
        {4194304, 8388608, 24},
        {8388608, 16777216, 25},
        {16777216, 33554432, 26},
        {33554432, 67108864, 27},
        {67108864, 134217728, 28},
        {134217728, 268435456, 29},
        {268435456, 536870912, 30},
        {536870912, 1073741824, 31},
        {1073741824, 2147483648, 32},

        {0, -1, 32},
};

static inline int fls_table_fls(unsigned n)
{
        unsigned i = 0;

        do {
                i++;
        } while (n <= fls_table[i].max && n > fls_table[i].min);

        return fls_table[i].value;
}

--test log is here----
model name	: Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -march=i686 -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m22.311s
user	0m21.850s
sys	0m0.010s
4294967295 iterations of old... checksum = 4294967265

real	0m34.673s
user	0m33.960s
sys	0m0.000s
4294967295 iterations of new... checksum = 4294967265

real	0m57.702s
user	0m56.530s
sys	0m0.040s
4294967295 iterations of new2... checksum = 4294967265

real	0m41.709s
user	0m40.810s
sys	0m0.000s
4294967295 iterations of table... checksum = 4294967295

real	0m22.274s
user	0m21.640s
sys	0m0.010s
------------------
==== -march=i686 -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m29.501s
user	0m28.880s
sys	0m0.040s
4294967295 iterations of old... checksum = 4294967265

real	0m49.054s
user	0m47.840s
sys	0m0.040s
4294967295 iterations of new... checksum = 4294967265

real	0m52.331s
user	0m51.270s
sys	0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real	1m7.043s
user	1m5.660s
sys	0m0.010s
4294967295 iterations of table... checksum = 4294967295

real	0m45.783s
user	0m44.860s
sys	0m0.010s
------------------
==== -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m34.395s
user	0m33.670s
sys	0m0.000s
4294967295 iterations of old... checksum = 4294967265

real	0m52.277s
user	0m51.230s
sys	0m0.000s
4294967295 iterations of new... checksum = 4294967265

real	0m54.750s
user	0m53.640s
sys	0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real	0m57.728s
user	0m56.590s
sys	0m0.000s
4294967295 iterations of table... checksum = 4294967295

real	0m44.540s
user	0m43.600s
sys	0m0.020s
------------------
==== -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m16.196s
user	0m15.880s
sys	0m0.000s
4294967295 iterations of old... checksum = 4294967265

real	0m34.342s
user	0m33.630s
sys	0m0.010s
4294967295 iterations of new... checksum = 4294967265

real	0m58.554s
user	0m57.380s
sys	0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real	0m37.140s
user	0m36.320s
sys	0m0.040s
4294967295 iterations of table... checksum = 4294967295

real	0m21.723s
user	0m21.270s
sys	0m0.000s
------------------

all of the files can download from:
http://soulinfo.com/~hugang/kernel/
-- 
Hu Gang / Steve
Email        : huagng@soulinfo.com, steve@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7  D499 A6C2 C418 86C8 610E
ICQ#         : 205800361
Registered Linux User : 204016

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  5:03   ` hugang
@ 2003-05-01  5:11     ` Andrew Morton
  2003-05-01  5:33       ` hugang
  0 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2003-05-01  5:11 UTC (permalink / raw)
  To: hugang; +Cc: linux-kernel

hugang <hugang@soulinfo.com> wrote:
>
> Here is table version of the fls. Yes it fast than other.

nooo..  That has a big cache footprint.  At the very least you should use
a binary search.  gcc will do it for you:

	switch (n) {
	case 0 ... 1:
		return 1;
	case 2 ... 3:
		return 2;
	case 4 ... 7:
		return 3;
	case 8 ... 15:
		return 4;

etc.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 20:55 ` Andrew Morton
@ 2003-05-01  5:03   ` hugang
  2003-05-01  5:11     ` Andrew Morton
  2003-05-01  5:16   ` hugang
  1 sibling, 1 reply; 65+ messages in thread
From: hugang @ 2003-05-01  5:03 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel

On Wed, 30 Apr 2003 13:55:12 -0700
Andrew Morton <akpm@digeo.com> wrote:

> Daniel Phillips <dphillips@sistina.com> wrote:
> >
> > +static inline unsigned fls8(unsigned n)
> > +{
> > +	return n & 0xf0?
> > +	    n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
> > +	    n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
> > +}
> 
> 	return fls_table[n];
> 
> 
> That'll be faster in benchmarks, possibly slower in real world.  As usual.
> 
Here is table version of the fls. Yes it fast than other.
typedef struct fls_table_t {
        unsigned min;
        unsigned max;
        unsigned value;
} fls_table_t;

static fls_table_t fls_table[] = {
        {0, 1, 1},
        {1, 2, 2},
        {2, 4, 3},
        {4, 8, 4},
        {8, 16, 5},
        {16, 32, 6},
        {32, 64, 7},
        {64, 128, 8},
        {128, 256, 9},
        {256, 512, 10},
        {512, 1024, 11},
        {1024, 2048, 12},
        {2048, 4096, 13},
        {4096, 8192, 14},
        {8192, 16384, 15},
        {16384, 32768, 16},
        {32768, 65536, 17},
        {65536, 131072, 18},
        {131072, 262144, 19},
        {262144, 524288, 20},
        {524288, 1048576, 21},
        {1048576, 2097152, 22},
        {2097152, 4194304, 23},
        {4194304, 8388608, 24},
        {8388608, 16777216, 25},
        {16777216, 33554432, 26},
        {33554432, 67108864, 27},
        {67108864, 134217728, 28},
        {134217728, 268435456, 29},
        {268435456, 536870912, 30},
        {536870912, 1073741824, 31},
        {1073741824, 2147483648, 32},

        {0, -1, 32},
};

static inline int fls_table_fls(unsigned n)
{
        unsigned i = 0;

        do {
                i++;
        } while (n <= fls_table[i].max && n > fls_table[i].min);

        return fls_table[i].value;
}

--test log is here----
model name	: Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -march=i686 -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m22.311s
user	0m21.850s
sys	0m0.010s
4294967295 iterations of old... checksum = 4294967265

real	0m34.673s
user	0m33.960s
sys	0m0.000s
4294967295 iterations of new... checksum = 4294967265

real	0m57.702s
user	0m56.530s
sys	0m0.040s
4294967295 iterations of new2... checksum = 4294967265

real	0m41.709s
user	0m40.810s
sys	0m0.000s
4294967295 iterations of table... checksum = 4294967295

real	0m22.274s
user	0m21.640s
sys	0m0.010s
------------------
==== -march=i686 -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m29.501s
user	0m28.880s
sys	0m0.040s
4294967295 iterations of old... checksum = 4294967265

real	0m49.054s
user	0m47.840s
sys	0m0.040s
4294967295 iterations of new... checksum = 4294967265

real	0m52.331s
user	0m51.270s
sys	0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real	1m7.043s
user	1m5.660s
sys	0m0.010s
4294967295 iterations of table... checksum = 4294967295

real	0m45.783s
user	0m44.860s
sys	0m0.010s
------------------
==== -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m34.395s
user	0m33.670s
sys	0m0.000s
4294967295 iterations of old... checksum = 4294967265

real	0m52.277s
user	0m51.230s
sys	0m0.000s
4294967295 iterations of new... checksum = 4294967265

real	0m54.750s
user	0m53.640s
sys	0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real	0m57.728s
user	0m56.590s
sys	0m0.000s
4294967295 iterations of table... checksum = 4294967295

real	0m44.540s
user	0m43.600s
sys	0m0.020s
------------------
==== -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295

real	0m16.196s
user	0m15.880s
sys	0m0.000s
4294967295 iterations of old... checksum = 4294967265

real	0m34.342s
user	0m33.630s
sys	0m0.010s
4294967295 iterations of new... checksum = 4294967265

real	0m58.554s
user	0m57.380s
sys	0m0.000s
4294967295 iterations of new2... checksum = 4294967265

real	0m37.140s
user	0m36.320s
sys	0m0.040s
4294967295 iterations of table... checksum = 4294967295

real	0m21.723s
user	0m21.270s
sys	0m0.000s
------------------

-- 
Hu Gang / Steve
Email        : huagng@soulinfo.com, steve@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7  D499 A6C2 C418 86C8 610E
ICQ#         : 205800361
Registered Linux User : 204016

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  1:46   ` Andi Kleen
@ 2003-05-01  4:40     ` Linus Torvalds
  2003-05-01 12:00       ` David S. Miller
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2003-05-01  4:40 UTC (permalink / raw)
  To: linux-kernel

In article <p73ade730d1.fsf@oldwotan.suse.de>, Andi Kleen  <ak@suse.de> wrote:
>Linus Torvalds <torvalds@transmeta.com> writes:
>
>> Yeah, except if you want best code generation you should probably use
>> 
>> 	static inline int fls(int x)
>> 	{
>> 		int bit;
>> 		/* Only well-defined for non-zero */
>> 		asm("bsrl %1,%0":"=r" (bit):"rm" (x));
>
>"g" should work for the second operand too and it'll give gcc
>slightly more choices with possibly better code.

"g" allows immediates, which is _not_ correct.

>But the __builtin is probably preferable if gcc supports it because
>a builtin can be scheduled, inline assembly can't.

You're over-estimating gcc builtins, and underestimating inline
assembly. 

gcc builtins usually generate _worse_ code than well-written inline
assembly, since builtins try to be generic (at least the ones that are
cross-architecture).

And inline asms schedule as well as any built-in, unless they are marked
volatile (either explicitly or implicitly by not having any outputs).

And the proof is in the pudding: I'll bet you a dollar my code generates
better code. AND my code works on the intel compiler _and_ with older
gcc's.

The fact is, gcc builtins are almost never worth it. 

		Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
       [not found] ` <Pine.LNX.4.44.0304301801210.20283-100000@home.transmeta.com.suse.lists.linux.kernel>
@ 2003-05-01  1:46   ` Andi Kleen
  2003-05-01  4:40     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2003-05-01  1:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> Yeah, except if you want best code generation you should probably use
> 
> 	static inline int fls(int x)
> 	{
> 		int bit;
> 		/* Only well-defined for non-zero */
> 		asm("bsrl %1,%0":"=r" (bit):"rm" (x));

"g" should work for the second operand too and it'll give gcc
slightly more choices with possibly better code.

But the __builtin is probably preferable if gcc supports it because
a builtin can be scheduled, inline assembly can't.

-Andi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-05-01  0:12                 ` Falk Hueffner
@ 2003-05-01  1:07                   ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2003-05-01  1:07 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: Alan Cox, dphillips, Linux Kernel Mailing List


On 1 May 2003, Falk Hueffner wrote:
> 
> There appears to be hardware support for fls, too. This is what gcc
> generates for

Yeah, except if you want best code generation you should probably use

	static inline int fls(int x)
	{
		int bit;
		/* Only well-defined for non-zero */
		asm("bsrl %1,%0":"=r" (bit):"rm" (x));
		return x ? bit : 32;
	}

which should generate (assuming you give "-march=xxxx" to tell gcc to use 
cmov):

		movl    $32, %eax
		bsrl %edx,%ecx
		testl   %edx, %edx
		cmovne  %ecx, %eax

which looks pretty optimal.

		Linus


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 20:59       ` Willy Tarreau
@ 2003-05-01  1:02         ` Daniel Phillips
  2003-05-01  9:19           ` Willy Tarreau
  0 siblings, 1 reply; 65+ messages in thread
From: Daniel Phillips @ 2003-05-01  1:02 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Linus Torvalds, Falk Hueffner, linux-kernel

On Wednesday 30 April 2003 22:59, Willy Tarreau wrote:
> On Wed, Apr 30, 2003 at 09:15:33PM +0200, Daniel Phillips wrote:
> > In the dawn of time, before God gave us Cache, my version would have been
> > the fastest, because it executes the fewest instructions.  In the misty
> > future, as cache continues to scale and processors sprout more parallel
> > execution units, it will be clearly better once again.
>
> Daniel,
>
> I must acknowledge that your simple code was not easy to beat ! You can try
> this one on your PIII, I could only test it on an athlon mobile and a P4.
> With gcc 2.95.3, it gives me a boost of about 25%, because it seems as gcc
> cannot optimize shifts efficiently. On 3.2.3, however, it's between 0 and
> 5% depending on optimization/CPU.

Was something ifdef'd incorrectly?  Otherwise, there is something the PIII 
hates about that code.  I got 107 seconds on the PIII, vs 53 seconds for my 
posted code at O3, and virtually no difference at O2.  (gcc 3.2.3)

Regards,

Daniel


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 23:41               ` Linus Torvalds
  2003-04-30 22:47                 ` Alan Cox
@ 2003-05-01  0:12                 ` Falk Hueffner
  2003-05-01  1:07                   ` Linus Torvalds
  1 sibling, 1 reply; 65+ messages in thread
From: Falk Hueffner @ 2003-05-01  0:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, dphillips, Linux Kernel Mailing List

Linus Torvalds <torvalds@transmeta.com> writes:

> On 30 Apr 2003, Alan Cox wrote:
> > 
> > It ought to be basically the same as ffs because if I remember rightly 
> > 
> > ffs(x^(x-1)) == fls(x)
> 
> So did anybody time this? ffs() we have hardware support for on x86,
> and it's even reasonably efficient on some CPU's ..

There appears to be hardware support for fls, too. This is what gcc
generates for

int fls(int x) {
    return x ? 32 - __builtin_clz(x) : 0;
}

fls:
        pushl   %ebp
        xorl    %edx, %edx
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        testl   %eax, %eax
        je      .L3
        bsrl    %eax, %ecx
        movl    $32, %edx
        xorl    $31, %ecx
        subl    %ecx, %edx
.L3:
        popl    %ebp
        movl    %edx, %eax
        ret

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 20:25             ` Alan Cox
  2003-04-30 21:59               ` Falk Hueffner
@ 2003-04-30 23:41               ` Linus Torvalds
  2003-04-30 22:47                 ` Alan Cox
  2003-05-01  0:12                 ` Falk Hueffner
  1 sibling, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2003-04-30 23:41 UTC (permalink / raw)
  To: Alan Cox; +Cc: Falk Hueffner, dphillips, Linux Kernel Mailing List


On 30 Apr 2003, Alan Cox wrote:
> 
> It ought to be basically the same as ffs because if I remember rightly 
> 
> ffs(x^(x-1)) == fls(x)

So did anybody time this? ffs() we have hardware support for on x86, and 
it's even reasonably efficient on some CPU's .. So this _should_ beat all 
new-comers, and obviously some people already have benchmark programs 
ready and willing..

		Linus


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 23:41               ` Linus Torvalds
@ 2003-04-30 22:47                 ` Alan Cox
  2003-05-01  0:12                 ` Falk Hueffner
  1 sibling, 0 replies; 65+ messages in thread
From: Alan Cox @ 2003-04-30 22:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Falk Hueffner, dphillips, Linux Kernel Mailing List

On Iau, 2003-05-01 at 00:41, Linus Torvalds wrote:
> On 30 Apr 2003, Alan Cox wrote:
> > 
> > It ought to be basically the same as ffs because if I remember rightly 
> > 
> > ffs(x^(x-1)) == fls(x)
> 
> So did anybody time this? ffs() we have hardware support for on x86, and 
> it's even reasonably efficient on some CPU's .. So this _should_ beat all 
> new-comers, and obviously some people already have benchmark programs 
> ready and willing..

Unfortunately on digging up my notes it seems I got ffs/fls backwards so it doesnt
help us a great deal.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 21:59               ` Falk Hueffner
@ 2003-04-30 22:22                 ` Alan Cox
  0 siblings, 0 replies; 65+ messages in thread
From: Alan Cox @ 2003-04-30 22:22 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: Linus Torvalds, dphillips, Linux Kernel Mailing List

On Mer, 2003-04-30 at 22:59, Falk Hueffner wrote:
> > ffs(x^(x-1)) == fls(x)
> 
> I don't think so. Maybe you are thinking of ffs(x) == fls(x & -x). So
> you can calculate ffs with fls, but not easily the other way round.

Well I asked my CPU to double check. If I got the code right it thinks that
ffs(i^(i-1))==fls(i) is true for 1->2^32-1

If you think about it it seems to make sense

	x-1 turns the lowest 100..0 sequence into 01......1

	The xor removes unchanged higher bits

	so 
		10100
         -1     10011
	XOR	00111



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 20:25             ` Alan Cox
@ 2003-04-30 21:59               ` Falk Hueffner
  2003-04-30 22:22                 ` Alan Cox
  2003-04-30 23:41               ` Linus Torvalds
  1 sibling, 1 reply; 65+ messages in thread
From: Falk Hueffner @ 2003-04-30 21:59 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, dphillips, Linux Kernel Mailing List

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> On Mer, 2003-04-30 at 17:16, Linus Torvalds wrote:
> > Clearly you're not going to make _one_ load to get fls, since having a 
> > 4GB lookup array for a 32-bit fls would be "somewhat" wasteful.
> 
> It ought to be basically the same as ffs because if I remember rightly 
> 
> ffs(x^(x-1)) == fls(x)

I don't think so. Maybe you are thinking of ffs(x) == fls(x & -x). So
you can calculate ffs with fls, but not easily the other way round.

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 19:15     ` Daniel Phillips
@ 2003-04-30 20:59       ` Willy Tarreau
  2003-05-01  1:02         ` Daniel Phillips
  0 siblings, 1 reply; 65+ messages in thread
From: Willy Tarreau @ 2003-04-30 20:59 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, Falk Hueffner, linux-kernel

On Wed, Apr 30, 2003 at 09:15:33PM +0200, Daniel Phillips wrote:
 
> In the dawn of time, before God gave us Cache, my version would have been the 
> fastest, because it executes the fewest instructions.  In the misty future, 
> as cache continues to scale and processors sprout more parallel execution 
> units, it will be clearly better once again.

Daniel,

I must acknowledge that your simple code was not easy to beat ! You can try this
one on your PIII, I could only test it on an athlon mobile and a P4. With gcc 2.95.3,
it gives me a boost of about 25%, because it seems as gcc cannot optimize shifts
efficiently. On 3.2.3, however, it's between 0 and 5% depending on optimization/CPU.

But the code is a lot smaller (about 105 bytes total for the function). I could also
code a version with parallel execution of several branches, but it was slower,
because every branch not taken was computed for nothing, and the glue around it had
a cost. I tried to implement it with as many booleans as possible so that the compiler
can try to emit conditionnal instructions and make use of carries. I did operations
on an unsigned char because it has some advantages, such as 0x80 being negative ;-)

I also let the other tests in, so that you can try them if you want. Looking through
the generated code, it's clear that an assembler version would be faster. Not much,
but a bit.

Cheers,
Willy

static inline
unsigned new2_fls32(unsigned n32)
{
        unsigned t = 0;
        unsigned char n;

        if (n32 >> 16) {
                n32 >>= 16;
                t = 16;
        }

        n = n32;
        if (n32 >> 8) {
                n = (unsigned char)(n32 >> 8);
                t += 8;
        }
#if 0
        else
                if (!n) return t;
        // this one uses comparisons and jumps, but is short and fast
        return (n < 0x10) ?
                t + ((n < 0x04) ?  2 - (n < 0x02) : 4 - (n < 0x08)) :
                t + ((n < 0x40) ?  6 - (n < 0x20) : 8 - (n < 0x80)) ;
#endif
#if 0
        // this one compiles into several parallel branches, but is slower
        if (n < 0x10) {
                return t += (n < 0x04) ?  n - ((n + 1) >> 2) : 4 - (n < 8);
        } else {
                return t += (n < 0x40) ?  6 - (n < 0x20) : 8 - (n < 0x80) ;
        }
#endif
#if 0
        // the same as above
        return t += (n < 0x10) ? (n < 0x04) ?  n - ((n + 1) >> 2) : 4 - (n < 8) :
                        (n < 0x40) ?  6 - (n < 0x20) : 8 - (n < 0x80) ;
#endif
#if 1
        // this one uses comparisons and jumps, but is short and the fastest
        return (n < 0x10) ?
                (n < 0x04) ?  n + t - ((n + 1) >> 2) : 4 + t - (n < 8) :
                (n < 0x40) ?  6 + t - (n < 0x20) : 8 + t - (n < 0x80) ;
#endif
}


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  2:46 Daniel Phillips
  2003-04-30  7:19 ` Willy Tarreau
  2003-04-30 11:14 ` Falk Hueffner
@ 2003-04-30 20:55 ` Andrew Morton
  2003-05-01  5:03   ` hugang
  2003-05-01  5:16   ` hugang
  2 siblings, 2 replies; 65+ messages in thread
From: Andrew Morton @ 2003-04-30 20:55 UTC (permalink / raw)
  To: dphillips; +Cc: linux-kernel

Daniel Phillips <dphillips@sistina.com> wrote:
>
> +static inline unsigned fls8(unsigned n)
> +{
> +	return n & 0xf0?
> +	    n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
> +	    n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
> +}

	return fls_table[n];


That'll be faster in benchmarks, possibly slower in real world.  As usual.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 16:16           ` Linus Torvalds
  2003-04-30 16:43             ` Falk Hueffner
@ 2003-04-30 20:25             ` Alan Cox
  2003-04-30 21:59               ` Falk Hueffner
  2003-04-30 23:41               ` Linus Torvalds
  1 sibling, 2 replies; 65+ messages in thread
From: Alan Cox @ 2003-04-30 20:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Falk Hueffner, dphillips, Linux Kernel Mailing List

On Mer, 2003-04-30 at 17:16, Linus Torvalds wrote:
> Clearly you're not going to make _one_ load to get fls, since having a 
> 4GB lookup array for a 32-bit fls would be "somewhat" wasteful.

It ought to be basically the same as ffs because if I remember rightly 

ffs(x^(x-1)) == fls(x)

Been a long time so I may have it wrong


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 14:11   ` Linus Torvalds
  2003-04-30 14:53     ` Falk Hueffner
@ 2003-04-30 19:15     ` Daniel Phillips
  2003-04-30 20:59       ` Willy Tarreau
  1 sibling, 1 reply; 65+ messages in thread
From: Daniel Phillips @ 2003-04-30 19:15 UTC (permalink / raw)
  To: Linus Torvalds, Falk Hueffner; +Cc: linux-kernel

On Wednesday 30 April 2003 16:11, Linus Torvalds wrote:
> On 30 Apr 2003, Falk Hueffner wrote:
> > gcc 3.4 will have a __builtin_ctz function which can be used for this.
> > It will emit special instructions on CPUs that support it (i386, Alpha
> > EV67), and use a lookup table on others, which is very boring, but
> > also faster.
>
> Classic mistake. Lookup tables are only faster in benchmarks, they are
> almost always slower in real life. You only need to miss in the cache
> _once_ on the lookup to lose all the time you won on the previous one
> hundred calls.
>
> "Small and simple" is almost always better than the alternatives. I
> suspect that's one reason why older versions of gcc often generate code
> that actually runs faster than newer versions: the newer versions _look_
> like they do a better job, but..

I agree that this one lies in a gray area, being linearly faster (PIV 
notwithwstanding) but bigger too.

In the dawn of time, before God gave us Cache, my version would have been the 
fastest, because it executes the fewest instructions.  In the misty future, 
as cache continues to scale and processors sprout more parallel execution 
units, it will be clearly better once again.  For now, it's marginal, and 
after all, what's a factor of two between friends?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 16:16           ` Linus Torvalds
@ 2003-04-30 16:43             ` Falk Hueffner
  2003-04-30 20:25             ` Alan Cox
  1 sibling, 0 replies; 65+ messages in thread
From: Falk Hueffner @ 2003-04-30 16:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dphillips, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> Clearly you're not going to make _one_ load to get fls, since having
> a 4GB lookup array for a 32-bit fls would be "somewhat" wasteful.
> 
> So the lookup table would probably look up just the last 8 bits.
> 
> So the lookup table version is several instructions in itself, doing
> about half of what the calculating version needs to do _anyway_.

Right.

> Including those data-dependent branches.

Well, at least on Alpha, gcc can optimize them all away except one.

Note I'm not really arguing Linux should use a lookup table for fls;
I'm just arguing putting it as default into libgcc for architectures
that don't provide a special version (which aren't particularly many)
isn't stupid. But probably I'm just biased, because I put it there :)

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 16:03         ` Falk Hueffner
@ 2003-04-30 16:16           ` Linus Torvalds
  2003-04-30 16:43             ` Falk Hueffner
  2003-04-30 20:25             ` Alan Cox
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2003-04-30 16:16 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: dphillips, linux-kernel


On 30 Apr 2003, Falk Hueffner wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
> 
> > There is _never_ any excuse to use a lookup table for something that
> > can be calculated with a few simple instructions. That's just
> > stupid.
> 
> Well, the "few simple instructions" are 28 instructions on Alpha for
> example, including 6 data-dependent branches. So I don't think it's
> *that* stupid.

You're comparing apples to oranges.

Clearly you're not going to make _one_ load to get fls, since having a 
4GB lookup array for a 32-bit fls would be "somewhat" wasteful.

So the lookup table would probably look up just the last 8 bits.

So the lookup table version is several instructions in itself, doing about
half of what the calculating version needs to do _anyway_. Including those
data-dependent branches.

		Linus


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 15:28       ` Linus Torvalds
@ 2003-04-30 16:03         ` Falk Hueffner
  2003-04-30 16:16           ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Falk Hueffner @ 2003-04-30 16:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dphillips, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> There is _never_ any excuse to use a lookup table for something that
> can be calculated with a few simple instructions. That's just
> stupid.

Well, the "few simple instructions" are 28 instructions on Alpha for
example, including 6 data-dependent branches. So I don't think it's
*that* stupid.

But I agree that benchmarking this stand-alone isn't particularly
useful.

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 14:53     ` Falk Hueffner
@ 2003-04-30 15:28       ` Linus Torvalds
  2003-04-30 16:03         ` Falk Hueffner
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2003-04-30 15:28 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: dphillips, linux-kernel


On 30 Apr 2003, Falk Hueffner wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
> 
> > Classic mistake. Lookup tables are only faster in benchmarks, they
> > are almost always slower in real life. You only need to miss in the
> > cache _once_ on the lookup to lose all the time you won on the
> > previous one hundred calls.
> 
> It seems to me if you call the function so seldom the table drops out
> of the cache, it is irrelevant how long it takes anyway.

Yeah, but then this whole discussion is irrelevant too.

I'm just saying that micro-benchmarks of operations that do not make any
sense on their own are _bad_ measures of real performance. That lookup is 
going to lose to the "natural" code even if it hits the L2, and misses 
only the L1.

Any piece of code that only works well when it sits in (and owns) the L1
is a piece of crap.

> Well, if a lookup table isn't "small and simple", I don't know what
> is.

Something that calculates it directly? Like, perchance, the current code? 

There is _never_ any excuse to use a lookup table for something that can 
be calculated with a few simple instructions. That's just stupid.

		Linus


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 14:11   ` Linus Torvalds
@ 2003-04-30 14:53     ` Falk Hueffner
  2003-04-30 15:28       ` Linus Torvalds
  2003-04-30 19:15     ` Daniel Phillips
  1 sibling, 1 reply; 65+ messages in thread
From: Falk Hueffner @ 2003-04-30 14:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dphillips, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> Classic mistake. Lookup tables are only faster in benchmarks, they
> are almost always slower in real life. You only need to miss in the
> cache _once_ on the lookup to lose all the time you won on the
> previous one hundred calls.

It seems to me if you call the function so seldom the table drops out
of the cache, it is irrelevant how long it takes anyway.

> "Small and simple" is almost always better than the alternatives. 

Well, if a lookup table isn't "small and simple", I don't know what
is.

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 11:14 ` Falk Hueffner
  2003-04-30 13:03   ` Daniel Phillips
@ 2003-04-30 14:11   ` Linus Torvalds
  2003-04-30 14:53     ` Falk Hueffner
  2003-04-30 19:15     ` Daniel Phillips
  1 sibling, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2003-04-30 14:11 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: dphillips, linux-kernel


On 30 Apr 2003, Falk Hueffner wrote:
> 
> gcc 3.4 will have a __builtin_ctz function which can be used for this.
> It will emit special instructions on CPUs that support it (i386, Alpha
> EV67), and use a lookup table on others, which is very boring, but
> also faster.

Classic mistake. Lookup tables are only faster in benchmarks, they are
almost always slower in real life. You only need to miss in the cache
_once_ on the lookup to lose all the time you won on the previous one
hundred calls.

"Small and simple" is almost always better than the alternatives. I 
suspect that's one reason why older versions of gcc often generate code 
that actually runs faster than newer versions: the newer versions _look_ 
like they do a better job, but..

			Linus


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 13:17     ` Falk Hueffner
@ 2003-04-30 14:07       ` Daniel Phillips
  0 siblings, 0 replies; 65+ messages in thread
From: Daniel Phillips @ 2003-04-30 14:07 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: linux-kernel

On Wed 30 Apr 03 15:17, Falk Hueffner wrote:
> Daniel Phillips <dphillips@sistina.com> writes:
> > It's somewhat annoying that __builtin_clz leaves the all-ones case
> > dangling instead of returning -1.
>
> I assume you mean all-zero...

:-)

> Well, the semantics of the corresponding
> instructions simply differ too much. clz(0) is 31 on some, 32 on some
> others, and undefined on some more architectures. So nailing it down
> would incur some performance penalty.

It would only penalize stupidly broken architectures; it would help the ones 
that got it right.  It's hard to imagine how a sane person could think the 
result of clz((u32)0) should be other than 32.

Sigh.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  7:19 ` Willy Tarreau
  2003-04-30  8:36   ` Aaron Lehmann
@ 2003-04-30 13:51   ` Daniel Phillips
  1 sibling, 0 replies; 65+ messages in thread
From: Daniel Phillips @ 2003-04-30 13:51 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Linus Torvalds, linux-kernel

On Wednesday 30 April 2003 09:19, Willy Tarreau wrote:
> Oh, and the new2 function is another variant I have been using by the past,
> but it is slower here.

It's really cute.

> It performed well on older machines with gcc 2.7.2.
> Its main problem may be the code size, due to the 32 bits masks. Your code
> has the advantage of being short and working on small data sets.

My code just looks small.  At -O2 it expands to 4 times the size of the 
existing fls and 5 times at -O3.  The saving grace is that, being faster, it 
doesn't have to be inline, saving code size overall.

> [code]
>
> Here are the results...

The overall trend seems to be that my version is faster with newer compilers 
and higher optimization levels, and overall, the newer compilers are 
producing faster code.  The big anomally is where the original version turns 
in the best time of the lot, running on PIV, compiled at -O3 with 2.95.3, as 
you pointed out.

Well, it raises interesting questions, but I'm not entirely sure what they are 
;-)

Regards,

Daniel


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 13:03   ` Daniel Phillips
@ 2003-04-30 13:17     ` Falk Hueffner
  2003-04-30 14:07       ` Daniel Phillips
  0 siblings, 1 reply; 65+ messages in thread
From: Falk Hueffner @ 2003-04-30 13:17 UTC (permalink / raw)
  To: dphillips; +Cc: Linus Torvalds, linux-kernel

Daniel Phillips <dphillips@sistina.com> writes:

> It's somewhat annoying that __builtin_clz leaves the all-ones case
> dangling instead of returning -1.

I assume you mean all-zero... Well, the semantics of the corresponding
instructions simply differ too much. clz(0) is 31 on some, 32 on some
others, and undefined on some more architectures. So nailing it down
would incur some performance penalty.

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30 11:14 ` Falk Hueffner
@ 2003-04-30 13:03   ` Daniel Phillips
  2003-04-30 13:17     ` Falk Hueffner
  2003-04-30 14:11   ` Linus Torvalds
  1 sibling, 1 reply; 65+ messages in thread
From: Daniel Phillips @ 2003-04-30 13:03 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: Linus Torvalds, linux-kernel

On Wednesday 30 April 2003 13:14, Falk Hueffner wrote:
> gcc 3.4 will have a __builtin_ctz function which can be used for this.
> It will emit special instructions on CPUs that support it (i386, Alpha
> EV67), and use a lookup table on others, which is very boring, but
> also faster.

Actually, __builtin_clz:

  http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Not having a gcc 2.4 handy, I couldn't test it, but I did notice that the 
built-in ffs is very fast.  Perhaps all such standard functions will end up 
as built-ins instead of kernel library functions, some very long time in the 
future.  If old compilers ever do finally fade away, that is.

It's somewhat annoying that __builtin_clz leaves the all-ones case dangling 
instead of returning -1.

Regards,

Daniel




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  2:46 Daniel Phillips
  2003-04-30  7:19 ` Willy Tarreau
@ 2003-04-30 11:14 ` Falk Hueffner
  2003-04-30 13:03   ` Daniel Phillips
  2003-04-30 14:11   ` Linus Torvalds
  2003-04-30 20:55 ` Andrew Morton
  2 siblings, 2 replies; 65+ messages in thread
From: Falk Hueffner @ 2003-04-30 11:14 UTC (permalink / raw)
  To: dphillips; +Cc: Linus Torvalds, linux-kernel

Daniel Phillips <dphillips@sistina.com> writes:

> Here's a faster implementation of generic_fls, that I discovered
> accidently, by not noticing 2.5 already had a generic_fls, and so I
> rolled my own.  Like the incumbent, it's O log2(bits), but there's
> not a lot of resemblance beyond that.  I think the new algorithm is
> inherently more parallelizable than the traditional approach.  A
> processor that can speculatively evaluate both sides of a
> conditional would benefit even more than the PIII I tested on.

gcc 3.4 will have a __builtin_ctz function which can be used for this.
It will emit special instructions on CPUs that support it (i386, Alpha
EV67), and use a lookup table on others, which is very boring, but
also faster.

-- 
	Falk

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  8:36   ` Aaron Lehmann
  2003-04-30  8:43     ` Aaron Lehmann
@ 2003-04-30  8:52     ` Aaron Lehmann
  2003-04-30  8:51       ` P
  1 sibling, 1 reply; 65+ messages in thread
From: Aaron Lehmann @ 2003-04-30  8:52 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Daniel Phillips, linux-kernel

On Wed, Apr 30, 2003 at 01:36:07AM -0700, Aaron Lehmann wrote:
> > ===== gcc-2.95.3 -march=i686 -O2 / athlon MP/2200 (1.8 GHz) =====
> 
> -march=athlon would give gcc a chance to make better scheduling
> decisions, which could make the results more sensible.

And contrary to my memory, gcc 2.95.3 doesn't seem to support
march=athlon. I guess I need to go to sleep.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  8:52     ` Aaron Lehmann
@ 2003-04-30  8:51       ` P
  0 siblings, 0 replies; 65+ messages in thread
From: P @ 2003-04-30  8:51 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: linux-kernel

Aaron Lehmann wrote:
> On Wed, Apr 30, 2003 at 01:36:07AM -0700, Aaron Lehmann wrote:
> 
>>>===== gcc-2.95.3 -march=i686 -O2 / athlon MP/2200 (1.8 GHz) =====
>>
>>-march=athlon would give gcc a chance to make better scheduling
>>decisions, which could make the results more sensible.
> 
> And contrary to my memory, gcc 2.95.3 doesn't seem to support
> march=athlon. I guess I need to go to sleep.

to bypass your memory:
http://www.pixelbeat.org/scripts/gcccpuopt

Pádraig.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  8:36   ` Aaron Lehmann
@ 2003-04-30  8:43     ` Aaron Lehmann
  2003-04-30  8:52     ` Aaron Lehmann
  1 sibling, 0 replies; 65+ messages in thread
From: Aaron Lehmann @ 2003-04-30  8:43 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Daniel Phillips, Linus Torvalds, linux-kernel

On Wed, Apr 30, 2003 at 01:36:07AM -0700, Aaron Lehmann wrote:
> > ==== gcc 3.2.3 -march=i686 -O3 / Pentium 4 / 2.0 GHz ====
> 
> This version of gcc accepts -march=pentium4, which again might make
> gcc's behavior less bizarre.

Oh, you tried that too, sorry.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  7:19 ` Willy Tarreau
@ 2003-04-30  8:36   ` Aaron Lehmann
  2003-04-30  8:43     ` Aaron Lehmann
  2003-04-30  8:52     ` Aaron Lehmann
  2003-04-30 13:51   ` Daniel Phillips
  1 sibling, 2 replies; 65+ messages in thread
From: Aaron Lehmann @ 2003-04-30  8:36 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Daniel Phillips, Linus Torvalds, linux-kernel

On Wed, Apr 30, 2003 at 09:19:07AM +0200, Willy Tarreau wrote:

> But gcc's optimizer seems to be even worse with each new version. The NIL
> checksum takes 3 times more to complete on gcc3/athlon than gcc2/athlon !!!
> And on some test, the P4 goes faster when the code is optimized for athlon
> than for P4 !

It would be interesting to compare the assembly output of the
compilers to see what's happening. The gcc folks appreciate test cases
and this function seems simple and sensitive enough to make a good one.

> ===== gcc-2.95.3 -march=i686 -O2 / athlon MP/2200 (1.8 GHz) =====

-march=athlon would give gcc a chance to make better scheduling
decisions, which could make the results more sensible.

> ==== gcc 3.2.3 -march=i686 -O3 / Pentium 4 / 2.0 GHz ====

This version of gcc accepts -march=pentium4, which again might make
gcc's behavior less bizarre.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC][PATCH] Faster generic_fls
  2003-04-30  2:46 Daniel Phillips
@ 2003-04-30  7:19 ` Willy Tarreau
  2003-04-30  8:36   ` Aaron Lehmann
  2003-04-30 13:51   ` Daniel Phillips
  2003-04-30 11:14 ` Falk Hueffner
  2003-04-30 20:55 ` Andrew Morton
  2 siblings, 2 replies; 65+ messages in thread
From: Willy Tarreau @ 2003-04-30  7:19 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

Hi Daniel,

On Wed, Apr 30, 2003 at 04:46:23AM +0200, Daniel Phillips wrote:
> Here's a faster implementation of generic_fls, that I discovered accidently,
> by not noticing 2.5 already had a generic_fls, and so I rolled my own.  Like
> the incumbent, it's O log2(bits), but there's not a lot of resemblance beyond
> that.  I think the new algorithm is inherently more parallelizable than the
> traditional approach.  A processor that can speculatively evaluate both sides 
> of a conditional would benefit even more than the PIII I tested on.

I did some tests on an Athlon MP2200 (1.8 GHz) and a Pentium 4/2.0 GHz, with
both gcc 2.95.3 and gcc 3.2.3.

The results are interesting, because they show that the compiler used is far
more important than the code ! To resume quickly, your code is faster than the
old one on an athlon, gcc-2.95.2 -O3, and in any combination of gcc-3.2.3.
On a P4, new code compiled with 3.2.3 is still slower than old code with 2.95.3.

But gcc's optimizer seems to be even worse with each new version. The NIL
checksum takes 3 times more to complete on gcc3/athlon than gcc2/athlon !!!
And on some test, the P4 goes faster when the code is optimized for athlon than
for P4 !

Oh, and the new2 function is another variant I have been using by the past, but
it is slower here. It performed well on older machines with gcc 2.7.2. Its main
problem may be the code size, due to the 32 bits masks. Your code has the
advantage of being short and working on small data sets.

#define new2_fls32(___a) ({ \
    register int ___x, ___bits = 0; \
    if (___a) { \
        ___x = (___a); \
        ++___bits; \
        if (___x & 0xffff0000) { ___x &= 0xffff0000; ___bits += 16;} \
        if (___x & 0xff00ff00) { ___x &= 0xff00ff00; ___bits +=  8;} \
        if (___x & 0xf0f0f0f0) { ___x &= 0xf0f0f0f0; ___bits +=  4;} \
        if (___x & 0xcccccccc) { ___x &= 0xcccccccc; ___bits +=  2;} \
        if (___x & 0xaaaaaaaa) { ___x &= 0xaaaaaaaa; ___bits +=  1;} \
    } \
    ___bits; \
    })



Here are the results.

Cheers,
Willy

===== gcc-2.95.3 -march=i686 -O2 / athlon MP/2200 (1.8 GHz) =====

4294967295 iterations of nil... checksum = 4294967295

real    0m4.778s
user    0m4.770s
sys     0m0.010s
4294967295 iterations of old... checksum = 4294967265

real    0m37.923s
user    0m37.920s
sys     0m0.000s
4294967295 iterations of new... checksum = 4294967265

real    0m47.956s
user    0m47.920s
sys     0m0.030s
4294967295 iterations of new2... checksum = 4294967295

real    0m43.884s
user    0m43.860s
sys     0m0.020s


===== gcc-2.95.3 -march=i686 -O3 / athlon MP/2200 =====

4294967295 iterations of nil... checksum = 4294967295

real    0m4.779s
user    0m4.770s
sys     0m0.010s
4294967295 iterations of old... checksum = 4294967265

real    0m39.235s
user    0m39.180s
sys     0m0.050s
4294967295 iterations of new... checksum = 4294967265

real    0m32.175s
user    0m32.170s
sys     0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real    0m37.560s
user    0m37.520s
sys     0m0.020s

===== gcc-2.95.3 -march=i686 -O3 / Pentium4 2.0 GHz =====

4294967295 iterations of nil... checksum = 4294967295

real    0m4.370s
user    0m4.370s
sys     0m0.000s
4294967295 iterations of old... checksum = 4294967265

real    0m22.148s
user    0m22.150s
sys     0m0.000s
4294967295 iterations of new... checksum = 4294967265

real    0m25.854s
user    0m25.850s
sys     0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real    0m30.743s
user    0m28.930s
sys     0m0.160s


==== gcc 3.2.3 -march=i686 -O3 / Pentium 4 / 2.0 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real    0m6.719s
user    0m6.710s
sys     0m0.000s
4294967295 iterations of old... checksum = 4294967265

real    0m45.571s
user    0m43.510s
sys     0m0.180s
4294967295 iterations of new... checksum = 4294967265

real    0m28.251s
user    0m26.420s
sys     0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real    0m48.881s
user    0m47.080s
sys     0m0.000s
-> awful, 60% more time than with gcc-2.95.3 !

==== gcc 3.2.3 -march=pentium4 -O3 / Pentium 4 / 2.0 GHz ====
(yes, gcc 3.2 may be producing better code... for the eyes, not
for the CPU)

4294967295 iterations of nil... checksum = 4294967295

real    0m4.422s
user    0m4.420s
sys     0m0.000s
4294967295 iterations of old... checksum = 4294967265

real    0m34.950s
user    0m34.950s
sys     0m0.000s
4294967295 iterations of new... checksum = 4294967265

real    0m27.667s
user    0m25.810s
sys     0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real    0m42.945s
user    0m42.760s
sys     0m0.160s

==== gcc 3.2.3 -march=athlon-mp -O3 / Pentium 4 / 2.0 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real    0m6.486s
user    0m6.490s
sys     0m0.000s
4294967295 iterations of old... checksum = 4294967265

real    0m41.240s
user    0m39.220s
sys     0m0.160s
4294967295 iterations of new... checksum = 4294967265

real    0m25.780s
user    0m25.780s
sys     0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real    0m51.789s
user    0m49.750s
sys     0m0.010s

==== gcc 3.2.3 -march=athlon-mp -O3 / Athlon MP2200 / 1.8 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real    0m14.544s
user    0m14.540s
sys     0m0.000s
4294967295 iterations of old... checksum = 4294967265

real    0m44.609s
user    0m44.600s
sys     0m0.000s
4294967295 iterations of new... checksum = 4294967265

real    0m26.765s
user    0m26.760s
sys     0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real    1m6.066s
user    1m6.030s
sys     0m0.030s

==== gcc 3.2.3 -march=pentium4 -O3 / Athlon MP2200 / 1.8 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real    0m10.775s
user    0m10.750s
sys     0m0.030s
4294967295 iterations of old... checksum = 4294967265

real    0m45.837s
user    0m45.830s
sys     0m0.000s
4294967295 iterations of new... checksum = 4294967265

real    0m24.414s
user    0m24.410s
sys     0m0.010s
4294967295 iterations of new2... checksum = 4294967295

real    1m7.299s
user    1m7.280s
sys     0m0.010s


==== gcc 3.2.3 -march=i686 -O3 / Athlon MP2200 / 1.8 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real    0m13.010s
user    0m13.000s
sys     0m0.010s



^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC][PATCH] Faster generic_fls
@ 2003-04-30  2:46 Daniel Phillips
  2003-04-30  7:19 ` Willy Tarreau
                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Daniel Phillips @ 2003-04-30  2:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Hi Linus,

Here's a faster implementation of generic_fls, that I discovered accidently,
by not noticing 2.5 already had a generic_fls, and so I rolled my own.  Like
the incumbent, it's O log2(bits), but there's not a lot of resemblance beyond
that.  I think the new algorithm is inherently more parallelizable than the
traditional approach.  A processor that can speculatively evaluate both sides 
of a conditional would benefit even more than the PIII I tested on.

The algorithm works roughly as follows: to find the highest bit in a value of
a given size, test the higher half for any one bits, then recursively apply
the algorithm to one of the two halves, depending on the test.  Once down to 8
bits, enumerate all the cases:

   return n & 0xf0? (n & 0xc0? (n & 0x80? 8: 7): (n & 0x20? 6: 5)):
          n & 0x0f? (n & 0x0c? (n & 0x08? 4: 3): (n & 0x02? 2: 1)): 0;

The above expression can be considerably optimized by noting that once we get
down to just two bits, at least one of which is known to be a one bit, it's
faster to shift out the higher of the two bits and add it directly to the
result than to evaluate a conditional.

A sneaky optimization is possible for the lowest two bits: the four values
{0, 1, 2, 3} map directly onto three of the four wanted results {0, 1, 2, 2},
so a little bit bashing takes care of both the conditional mentioned above
and the test that would otherwise be needed for the zero case.  The resulting
optimized code is sufficiently obfuscated for an IOCC entry, but it's fast:

   return n & 0xf0?
       n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
       n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);

In short, to find the high bit of a 32 bit value, the algorithm enumerates
all 32 possibilities using a binary search.  Inlines clarify the recursion,
and as a fringe benefit, give us a handy set of 8, 16, 32 and 64 bit
function flavors, the shorter versions being a little faster if you can use
them.

The thing that makes it fast (I think) is that the expressions at the leaves
can be evaluated in parallel with the conditional tests - that is, it's
possible to compute the results before we know exactly which one is needed. 
Another thing that may contribute to the speed is that the algorithm is doing
relatively more reading than writing, compared to the current version.
Though I did not test it, I believe the speedup will carry over to assembly
implementations as well.

There are still some optimization possibilities remaining.  For example, in
some of the evaluation cases the zero case doesn't have to be evaluated, so
a little arithmetic can be saved.  But then the helper functions wouldn't be
useable as sub-functions in their own right any more, so I don't think the
small speedup is worth it for the decreased orthogonality.

The improvement on a PIII varies from about 1.43x with gcc -O2 to 2.08x at
-O3.  The benchmark runs 2**32 iterations, evaluating all 32 bit cases.
Roughly speaking, at O3 it takes about 10 cycles to do the job:

       Old       New   Empty loop  Actual old  Actual new   Speedup
O2:  111.267   94.129    54.014      57.253      40.115       1.43
O3:   95.875   53.018    13.200      82.675      39.818       2.08

The test program is here:

   http://people.nl.linux.org/~phillips/test_fls.c

The patch is against 2.5.68-bk9.

Regards,

Daniel

--- 2.5.68.clean/include/linux/bitops.h	2003-04-20 04:51:12.000000000 +0200
+++ 2.5.68/include/linux/bitops.h	2003-04-30 01:29:27.000000000 +0200
@@ -41,33 +41,35 @@
  * fls: find last bit set.
  */
 
-extern __inline__ int generic_fls(int x)
+/* Faster generic_fls */
+/* (c) 2002, D.Phillips and Sistina Software */
+/* Licensed under Version 2 of the GPL */
+
+static inline unsigned fls8(unsigned n)
+{
+	return n & 0xf0?
+	    n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
+	    n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
+}
+
+static inline unsigned fls16(unsigned n)
+{
+	return n & 0xff00? fls8(n >> 8) + 8: fls8(n);
+}
+
+static inline unsigned fls32(unsigned n)
+{
+	return n & 0xffff0000? fls16(n >> 16) + 16: fls16(n);
+}
+
+static inline unsigned fls64(unsigned long long n) /* should be u64 */
 {
-	int r = 32;
+	return n & 0xffffffff00000000? fls32(n >> 32) + 32: fls32(n);
+}
 
-	if (!x)
-		return 0;
-	if (!(x & 0xffff0000u)) {
-		x <<= 16;
-		r -= 16;
-	}
-	if (!(x & 0xff000000u)) {
-		x <<= 8;
-		r -= 8;
-	}
-	if (!(x & 0xf0000000u)) {
-		x <<= 4;
-		r -= 4;
-	}
-	if (!(x & 0xc0000000u)) {
-		x <<= 2;
-		r -= 2;
-	}
-	if (!(x & 0x80000000u)) {
-		x <<= 1;
-		r -= 1;
-	}
-	return r;
+static inline int generic_fls(int n)
+{
+	return fls32(n);
 }
 
 extern __inline__ int get_bitmask_order(unsigned int count)


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2003-05-03 18:54 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-02 10:04 [RFC][PATCH] Faster generic_fls Chuck Ebbert
2003-05-02 22:55 ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2003-05-03  2:21 Chuck Ebbert
2003-05-02  8:50 Chuck Ebbert
2003-05-02  9:37 ` Peter Osterlund
2003-05-01 17:15 Chuck Ebbert
2003-05-01 13:31 linux
     [not found] <87d6j34jad.fsf@student.uni-tuebingen.de.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.44.0304301801210.20283-100000@home.transmeta.com.suse.lists.linux.kernel>
2003-05-01  1:46   ` Andi Kleen
2003-05-01  4:40     ` Linus Torvalds
2003-05-01 12:00       ` David S. Miller
2003-05-02  5:14         ` Linus Torvalds
2003-04-30  2:46 Daniel Phillips
2003-04-30  7:19 ` Willy Tarreau
2003-04-30  8:36   ` Aaron Lehmann
2003-04-30  8:43     ` Aaron Lehmann
2003-04-30  8:52     ` Aaron Lehmann
2003-04-30  8:51       ` P
2003-04-30 13:51   ` Daniel Phillips
2003-04-30 11:14 ` Falk Hueffner
2003-04-30 13:03   ` Daniel Phillips
2003-04-30 13:17     ` Falk Hueffner
2003-04-30 14:07       ` Daniel Phillips
2003-04-30 14:11   ` Linus Torvalds
2003-04-30 14:53     ` Falk Hueffner
2003-04-30 15:28       ` Linus Torvalds
2003-04-30 16:03         ` Falk Hueffner
2003-04-30 16:16           ` Linus Torvalds
2003-04-30 16:43             ` Falk Hueffner
2003-04-30 20:25             ` Alan Cox
2003-04-30 21:59               ` Falk Hueffner
2003-04-30 22:22                 ` Alan Cox
2003-04-30 23:41               ` Linus Torvalds
2003-04-30 22:47                 ` Alan Cox
2003-05-01  0:12                 ` Falk Hueffner
2003-05-01  1:07                   ` Linus Torvalds
2003-04-30 19:15     ` Daniel Phillips
2003-04-30 20:59       ` Willy Tarreau
2003-05-01  1:02         ` Daniel Phillips
2003-05-01  9:19           ` Willy Tarreau
2003-05-01 16:02             ` Linus Torvalds
2003-04-30 20:55 ` Andrew Morton
2003-05-01  5:03   ` hugang
2003-05-01  5:11     ` Andrew Morton
2003-05-01  5:33       ` hugang
2003-05-01  7:05         ` hugang
2003-05-01 13:52           ` Willy TARREAU
2003-05-01 14:14             ` Falk Hueffner
2003-05-01 14:26               ` Willy TARREAU
2003-05-01 14:53             ` hugang
2003-05-01 15:54               ` Thomas Schlichter
2003-05-02  0:33                 ` hugang
2003-05-01 17:16               ` Willy TARREAU
2003-05-01 23:27                 ` Thomas Schlichter
2003-05-02  0:10                   ` Willy Tarreau
2003-05-02  0:43                     ` Daniel Phillips
2003-05-02  0:54                       ` Andrew Morton
2003-05-02 12:21                         ` Daniel Phillips
2003-05-02  1:47                       ` Thomas Schlichter
2003-05-02 13:24                         ` Willy Tarreau
2003-05-02  0:13                   ` Daniel Phillips
2003-05-02  0:13             ` Lincoln Dale
2003-05-01  5:16   ` hugang
2003-05-01 10:22     ` Willy TARREAU
2003-05-01 11:17       ` hugang
2003-05-01 11:45         ` Willy TARREAU

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).