Hi,

On March 1, hugang wrote:
> On Thu, 1 May 2003 15:52:04 +0200
>
> Willy TARREAU <willy@w.ods.org> wrote:
> >  However, I have good news for you, the table code is 15% faster than my
> > tree on an Alpha EV6 ;-)
> >  It may be because gcc can use different tricks on this arch.
> >
> >  Cheers
> >  Willy
>
> However, The most code changed has increase a little peformance, Do we
> really need it?
>
> Here is test in my machine, The faster is still table.

I think the table version is so fast as a normal distribution of numbers is 
tested. With this distribution half of the occuring values have the MSB set, 
one quarter the next significant bit and so on...

The tree version is approximately equally fast for every number, but the table 
version is fast if a very significant bit is set and slow if  a less 
significant bit is set...

If small values occour more often than big ones the tree version should be 
preferred, else following version may be very fast, too.

static inline int fls_shift(int x)
{
	int bit = 32;
 
	while(x > 0) {
		--bit;
		x <<= 1;
	}

	return x ? bit : 0;
}

I get following times on my K6-III (compiled with -O3 -march=k6) for 
4294967296 iterations:

fls_bsrl: (uses the 'bsrl' command via inline assembler)
real    3m39.059s
user    3m27.416s
sys     0m0.345s

fls_shift:
real    1m47.337s
user    1m41.801s
sys     0m0.185s

fls_tree:
real    1m56.746s
user    1m50.681s
sys     0m0.165s

The 32Bit tree is a bit slower here, too:
real    1m59.447s
user    1m53.375s
sys     0m0.200s

I did not test the table version as I think it cannot be faster than the shift 
version...

Best regards
  Thomas Schlichter

P.S.: I'd be happy to see how this performs on an Athlon or P-IV...