linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-03  2:21 Chuck Ebbert
  0 siblings, 0 replies; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-03  2:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

>>   return conf->last_used = new_disk;
>> 
>> (new_disk is a local variable, conf is a function arg.)
>
> Since new_disk is on the stack, is there something about 'conf'
> that guarenetes it is not on the stack too?  F.e. what if
> &conf->last_used were one byte into 'new_disk' or something
> like that.

  It could -- I had thought the -fno-strict-aliasing option
meant exactly the opposite of what it really means.  Since the
compiler has been told to be pessimistic about possibilities
like this maybe that's what it has to do.
------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-02 10:04 Chuck Ebbert
  2003-05-02 22:55 ` David S. Miller
  0 siblings, 1 reply; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-02 10:04 UTC (permalink / raw)
  To: Peter Osterlund; +Cc: linux-kernel

Peter Osterlund wrote:

>>   mov <mem1>,eax
>>   mov eax,<mem2>
>>   mov <mem1>,eax        ; eax already contains mem1 you stupid compiler
>>   ret
>
> Not necessarily if mem2 == mem1 + 2. Consider this code:

  I realized after sending that last that I should have added that there
were no volatiles and no aliasing possible.  This was the kernel code:

  return conf->last_used = new_disk;

(new_disk is a local variable, conf is a function arg.)

------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-02  8:50 Chuck Ebbert
  2003-05-02  9:37 ` Peter Osterlund
  0 siblings, 1 reply; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-02  8:50 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

David S. Miller wrote:

> I know you think GCC is a pile of dogshit, in many ways I do too, but I
> think it's just as important to point out the good parts where they
> exist.

  GCC is the strangest combination of utterly brilliant and brain-dead
stupid that I've ever seen... I've seen it do tail merges that took
my breath away, followed by this:

  mov <mem1>,eax
  mov eax,<mem2>
  mov <mem1>,eax        ; eax already contains mem1 you stupid compiler
  ret

------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-01 17:15 Chuck Ebbert
  0 siblings, 0 replies; 65+ messages in thread
From: Chuck Ebbert @ 2003-05-01 17:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:

>>  BTW, has someone benchmarked BSF/BSR on x86 ? It should be
>> faster, it it's also possible that a poor microcode implements it with a one
>> bit/cycle algo, which will result in one instruction not being as fast as your
>> code.
>
> I think the original i386 did it with a one-bit-per-cycle algorithm,
> anything since should be fine. In particular, on a P4 where I just tested,
> the bsf seems to be 4 cycles over the whole input set (actually, my whole
> loop was 4 cycles per iteration, so 4 cycles is worst-case. I'm assuming
> the rest could have been done in parallell).

 
 Just for comparison, the Pentium (Classic) manual says 6-43 clocks for
bsfl and 7-72 (!) for bsrl.

------
 Chuck

^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [RFC][PATCH] Faster generic_fls
@ 2003-05-01 13:31 linux
  0 siblings, 0 replies; 65+ messages in thread
From: linux @ 2003-05-01 13:31 UTC (permalink / raw)
  To: linux-kernel

The Fount of Holy Pengun Pee expressed himself thus:
> Yeah, except if you want best code generation you should probably use
>
>	static inline int fls(int x)
>	{
>		int bit;
>		/* Only well-defined for non-zero */
>		asm("bsrl %1,%0":"=r" (bit):"rm" (x));
>		return x ? bit : 32;
>	}
>
> which should generate (assuming you give "-march=xxxx" to tell gcc to use 
> cmov):
>
>		movl    $32, %eax
>		bsrl %edx,%ecx
>		testl   %edx, %edx
>		cmovne  %ecx, %eax
>
> which looks pretty optimal.

Except that the testl is unnecessary.  The full semantics of
bsrl are:

if (source != 0) {
	destination = <index of most significant set bit in source>;
	ZF = 0;
} else {
	destination = UNDEFINED;
	ZF = 1;
}

Thus, there are two reasonable code sequences:

asm("	bsrl	%2, %1"
"\n	cmovne  %1, %0" : "=r" (bit), "=r" (temp) : "rm" (x), "0" (32));

and (if I've got the earlyclobber syntax right):

asm("	bsrl	%1, %0"
"\n	cmoveq  %2, %0" : "=&r,&r" (bit) : "0,rm" (x), "rm,rm" (32));

Note that in this latter case, I have to list %1 == %0 as an explicit
alternative, because otherwise the & on operand 0 would tell GCC to forbid
that combination and it's only %0 == %2 that's forbidden.

(The first alternative is listed first because it uses fewer registers
and so is preferred, all other things being equal.)

Unfortunately, I'm not sure how to let GCC choose between these two
alternatives...

^ permalink raw reply	[flat|nested] 65+ messages in thread
* [RFC][PATCH] Faster generic_fls
@ 2003-04-30  2:46 Daniel Phillips
  2003-04-30  7:19 ` Willy Tarreau
                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Daniel Phillips @ 2003-04-30  2:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Hi Linus,

Here's a faster implementation of generic_fls, that I discovered accidently,
by not noticing 2.5 already had a generic_fls, and so I rolled my own.  Like
the incumbent, it's O log2(bits), but there's not a lot of resemblance beyond
that.  I think the new algorithm is inherently more parallelizable than the
traditional approach.  A processor that can speculatively evaluate both sides 
of a conditional would benefit even more than the PIII I tested on.

The algorithm works roughly as follows: to find the highest bit in a value of
a given size, test the higher half for any one bits, then recursively apply
the algorithm to one of the two halves, depending on the test.  Once down to 8
bits, enumerate all the cases:

   return n & 0xf0? (n & 0xc0? (n & 0x80? 8: 7): (n & 0x20? 6: 5)):
          n & 0x0f? (n & 0x0c? (n & 0x08? 4: 3): (n & 0x02? 2: 1)): 0;

The above expression can be considerably optimized by noting that once we get
down to just two bits, at least one of which is known to be a one bit, it's
faster to shift out the higher of the two bits and add it directly to the
result than to evaluate a conditional.

A sneaky optimization is possible for the lowest two bits: the four values
{0, 1, 2, 3} map directly onto three of the four wanted results {0, 1, 2, 2},
so a little bit bashing takes care of both the conditional mentioned above
and the test that would otherwise be needed for the zero case.  The resulting
optimized code is sufficiently obfuscated for an IOCC entry, but it's fast:

   return n & 0xf0?
       n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
       n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);

In short, to find the high bit of a 32 bit value, the algorithm enumerates
all 32 possibilities using a binary search.  Inlines clarify the recursion,
and as a fringe benefit, give us a handy set of 8, 16, 32 and 64 bit
function flavors, the shorter versions being a little faster if you can use
them.

The thing that makes it fast (I think) is that the expressions at the leaves
can be evaluated in parallel with the conditional tests - that is, it's
possible to compute the results before we know exactly which one is needed. 
Another thing that may contribute to the speed is that the algorithm is doing
relatively more reading than writing, compared to the current version.
Though I did not test it, I believe the speedup will carry over to assembly
implementations as well.

There are still some optimization possibilities remaining.  For example, in
some of the evaluation cases the zero case doesn't have to be evaluated, so
a little arithmetic can be saved.  But then the helper functions wouldn't be
useable as sub-functions in their own right any more, so I don't think the
small speedup is worth it for the decreased orthogonality.

The improvement on a PIII varies from about 1.43x with gcc -O2 to 2.08x at
-O3.  The benchmark runs 2**32 iterations, evaluating all 32 bit cases.
Roughly speaking, at O3 it takes about 10 cycles to do the job:

       Old       New   Empty loop  Actual old  Actual new   Speedup
O2:  111.267   94.129    54.014      57.253      40.115       1.43
O3:   95.875   53.018    13.200      82.675      39.818       2.08

The test program is here:

   http://people.nl.linux.org/~phillips/test_fls.c

The patch is against 2.5.68-bk9.

Regards,

Daniel

--- 2.5.68.clean/include/linux/bitops.h	2003-04-20 04:51:12.000000000 +0200
+++ 2.5.68/include/linux/bitops.h	2003-04-30 01:29:27.000000000 +0200
@@ -41,33 +41,35 @@
  * fls: find last bit set.
  */
 
-extern __inline__ int generic_fls(int x)
+/* Faster generic_fls */
+/* (c) 2002, D.Phillips and Sistina Software */
+/* Licensed under Version 2 of the GPL */
+
+static inline unsigned fls8(unsigned n)
+{
+	return n & 0xf0?
+	    n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
+	    n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
+}
+
+static inline unsigned fls16(unsigned n)
+{
+	return n & 0xff00? fls8(n >> 8) + 8: fls8(n);
+}
+
+static inline unsigned fls32(unsigned n)
+{
+	return n & 0xffff0000? fls16(n >> 16) + 16: fls16(n);
+}
+
+static inline unsigned fls64(unsigned long long n) /* should be u64 */
 {
-	int r = 32;
+	return n & 0xffffffff00000000? fls32(n >> 32) + 32: fls32(n);
+}
 
-	if (!x)
-		return 0;
-	if (!(x & 0xffff0000u)) {
-		x <<= 16;
-		r -= 16;
-	}
-	if (!(x & 0xff000000u)) {
-		x <<= 8;
-		r -= 8;
-	}
-	if (!(x & 0xf0000000u)) {
-		x <<= 4;
-		r -= 4;
-	}
-	if (!(x & 0xc0000000u)) {
-		x <<= 2;
-		r -= 2;
-	}
-	if (!(x & 0x80000000u)) {
-		x <<= 1;
-		r -= 1;
-	}
-	return r;
+static inline int generic_fls(int n)
+{
+	return fls32(n);
 }
 
 extern __inline__ int get_bitmask_order(unsigned int count)


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2003-05-03 18:54 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87d6j34jad.fsf@student.uni-tuebingen.de.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.44.0304301801210.20283-100000@home.transmeta.com.suse.lists.linux.kernel>
2003-05-01  1:46   ` [RFC][PATCH] Faster generic_fls Andi Kleen
2003-05-01  4:40     ` Linus Torvalds
2003-05-01 12:00       ` David S. Miller
2003-05-02  5:14         ` Linus Torvalds
2003-05-03  2:21 Chuck Ebbert
  -- strict thread matches above, loose matches on Subject: below --
2003-05-02 10:04 Chuck Ebbert
2003-05-02 22:55 ` David S. Miller
2003-05-02  8:50 Chuck Ebbert
2003-05-02  9:37 ` Peter Osterlund
2003-05-01 17:15 Chuck Ebbert
2003-05-01 13:31 linux
2003-04-30  2:46 Daniel Phillips
2003-04-30  7:19 ` Willy Tarreau
2003-04-30  8:36   ` Aaron Lehmann
2003-04-30  8:43     ` Aaron Lehmann
2003-04-30  8:52     ` Aaron Lehmann
2003-04-30  8:51       ` P
2003-04-30 13:51   ` Daniel Phillips
2003-04-30 11:14 ` Falk Hueffner
2003-04-30 13:03   ` Daniel Phillips
2003-04-30 13:17     ` Falk Hueffner
2003-04-30 14:07       ` Daniel Phillips
2003-04-30 14:11   ` Linus Torvalds
2003-04-30 14:53     ` Falk Hueffner
2003-04-30 15:28       ` Linus Torvalds
2003-04-30 16:03         ` Falk Hueffner
2003-04-30 16:16           ` Linus Torvalds
2003-04-30 16:43             ` Falk Hueffner
2003-04-30 20:25             ` Alan Cox
2003-04-30 21:59               ` Falk Hueffner
2003-04-30 22:22                 ` Alan Cox
2003-04-30 23:41               ` Linus Torvalds
2003-04-30 22:47                 ` Alan Cox
2003-05-01  0:12                 ` Falk Hueffner
2003-05-01  1:07                   ` Linus Torvalds
2003-04-30 19:15     ` Daniel Phillips
2003-04-30 20:59       ` Willy Tarreau
2003-05-01  1:02         ` Daniel Phillips
2003-05-01  9:19           ` Willy Tarreau
2003-05-01 16:02             ` Linus Torvalds
2003-04-30 20:55 ` Andrew Morton
2003-05-01  5:03   ` hugang
2003-05-01  5:11     ` Andrew Morton
2003-05-01  5:33       ` hugang
2003-05-01  7:05         ` hugang
2003-05-01 13:52           ` Willy TARREAU
2003-05-01 14:14             ` Falk Hueffner
2003-05-01 14:26               ` Willy TARREAU
2003-05-01 14:53             ` hugang
2003-05-01 15:54               ` Thomas Schlichter
2003-05-02  0:33                 ` hugang
2003-05-01 17:16               ` Willy TARREAU
2003-05-01 23:27                 ` Thomas Schlichter
2003-05-02  0:10                   ` Willy Tarreau
2003-05-02  0:43                     ` Daniel Phillips
2003-05-02  0:54                       ` Andrew Morton
2003-05-02 12:21                         ` Daniel Phillips
2003-05-02  1:47                       ` Thomas Schlichter
2003-05-02 13:24                         ` Willy Tarreau
2003-05-02  0:13                   ` Daniel Phillips
2003-05-02  0:13             ` Lincoln Dale
2003-05-01  5:16   ` hugang
2003-05-01 10:22     ` Willy TARREAU
2003-05-01 11:17       ` hugang
2003-05-01 11:45         ` Willy TARREAU

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).