Re: .../asm-i386/bitops.h performance improvements

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: .../asm-i386/bitops.h  performance improvements
       [not found] ` <4fF2j-1Lo-19@gated-at.bofh.it>
@ 2005-06-15 14:57   ` Bodo Eggert
  2005-06-15 15:30     ` Maciej W. Rozycki
  2005-06-15 23:53     ` cutaway
  0 siblings, 2 replies; 14+ messages in thread
From: Bodo Eggert @ 2005-06-15 14:57 UTC (permalink / raw)
  To: Gene Heskett, cutaway, linux-kernel

Gene Heskett <gene.heskett@verizon.net> wrote:

>>leal (%%edx,%%edi,8),%%edx
>>
> To what cpu families does this apply?  eg, this may be true for intel,
> but what about amd, via etc?

lea is an 8086 instruction. All clones have it in it's basic form. However,
the multiplicator is not documented for i486, therefore it will be a i586
extension.
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 14:57   ` .../asm-i386/bitops.h performance improvements Bodo Eggert
@ 2005-06-15 15:30     ` Maciej W. Rozycki
  2005-06-15 16:06       ` Richard B. Johnson
  2005-06-15 23:53     ` cutaway
  1 sibling, 1 reply; 14+ messages in thread
From: Maciej W. Rozycki @ 2005-06-15 15:30 UTC (permalink / raw)
  To: 7eggert; +Cc: Gene Heskett, cutaway, linux-kernel

On Wed, 15 Jun 2005, Bodo Eggert wrote:

> lea is an 8086 instruction. All clones have it in it's basic form. However,
> the multiplicator is not documented for i486, therefore it will be a i586
> extension.

 Huh?  The SIB byte has been added in the original i386 with 32-bit 
addressing.

  Maciej

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 15:30     ` Maciej W. Rozycki
@ 2005-06-15 16:06       ` Richard B. Johnson
  2005-06-15 16:29         ` Maciej W. Rozycki
  0 siblings, 1 reply; 14+ messages in thread
From: Richard B. Johnson @ 2005-06-15 16:06 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: 7eggert, Gene Heskett, cutaway, linux-kernel

On Wed, 15 Jun 2005, Maciej W. Rozycki wrote:

> On Wed, 15 Jun 2005, Bodo Eggert wrote:
>
>> lea is an 8086 instruction. All clones have it in it's basic form. However,
>> the multiplicator is not documented for i486, therefore it will be a i586
>> extension.
>
> Huh?  The SIB byte has been added in the original i386 with 32-bit
> addressing.
>
>  Maciej

Well the __documented__ '486 LEA instruction doesn't
even allow the double-register indirect. It's just

 	LEA r16,m
 	LEA r32,m

... repeated twice

Page 26-190,  Intel486(tm) Microprocessor Programmer's Reference
Manual. ISBN 1-55512-195-4. The instruction may have been one
of those "immature features", read broken.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by Dictator Bush.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 16:06       ` Richard B. Johnson
@ 2005-06-15 16:29         ` Maciej W. Rozycki
  2005-06-15 19:10           ` Bodo Eggert
  0 siblings, 1 reply; 14+ messages in thread
From: Maciej W. Rozycki @ 2005-06-15 16:29 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: 7eggert, Gene Heskett, cutaway, linux-kernel

On Wed, 15 Jun 2005, Richard B. Johnson wrote:

> Well the __documented__ '486 LEA instruction doesn't
> even allow the double-register indirect. It's just
> 
> LEA r16,m
> LEA r32,m
> 
> ... repeated twice
> 
> Page 26-190,  Intel486(tm) Microprocessor Programmer's Reference
> Manual. ISBN 1-55512-195-4. The instruction may have been one
> of those "immature features", read broken.

 And "m" is presumably described in details elsewhere as the semantics is 
common for all instructions involving address calculation.  There is no 
point in repeating the lengthy explanation for every instruction, is it?  
Or would you prefer having each possible register and/or value of constant 
arguments described for every instruction separately?

  Maciej

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 16:29         ` Maciej W. Rozycki
@ 2005-06-15 19:10           ` Bodo Eggert
  2005-06-16  3:26             ` Stephen Rothwell
  2005-06-16  7:10             ` Mikael Pettersson
  0 siblings, 2 replies; 14+ messages in thread
From: Bodo Eggert @ 2005-06-15 19:10 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Richard B. Johnson, Gene Heskett, cutaway, linux-kernel

On Wed, 15 Jun 2005, Maciej W. Rozycki wrote:
> On Wed, 15 Jun 2005, Richard B. Johnson wrote:

> > Well the __documented__ '486 LEA instruction doesn't
> > even allow the double-register indirect. It's just
> > 
> > LEA r16,m
> > LEA r32,m
> > 
> > ... repeated twice
> > 
> > Page 26-190,  Intel486(tm) Microprocessor Programmer's Reference
> > Manual. ISBN 1-55512-195-4. The instruction may have been one
> > of those "immature features", read broken.
> 
>  And "m" is presumably described in details elsewhere as the semantics is 
> common for all instructions involving address calculation.

My documentation says:

lea reg16, mem
Available on 8086, 80186, 80286, 80386, 80486
32-bit-extension available
Opcode: 8D mod reg r/m

reg will be the target register (AX .. DI), and mod and r/m will select
something like a direct address, a register or a combination like 
BP+DI+ofs (I won't copy the table). A multiplier is not mentioned there.
-- 
Microwave: Signal from a friendly micro... 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 14:57   ` .../asm-i386/bitops.h performance improvements Bodo Eggert
  2005-06-15 15:30     ` Maciej W. Rozycki
@ 2005-06-15 23:53     ` cutaway
  1 sibling, 0 replies; 14+ messages in thread
From: cutaway @ 2005-06-15 23:53 UTC (permalink / raw)
  To: 7eggert, Gene Heskett, linux-kernel

Ummm, in simple terms - this statement is flat out 100% wrong.

LEA with the SIB byte has been around since 386 and is included on every CPU
Linux is capable of running on.

Compile this using -m386 and look at the ASM listing file and convince
yourself.

unsigned int foo(int bar)
{
return ((bar<<3)+bar);
}

GCC is going to generate a MOV of parm to EAX, then a LEA EAX,[EAX+EAX*8]

Don't trust me - compile this and prove it to yourself.


----- Original Message ----- 
From: "Bodo Eggert" <harvested.in.lkml@posting.7eggert.dyndns.org>
To: "Gene Heskett" <gene.heskett@verizon.net>; <cutaway@bellsouth.net>;
<linux->
>
> However, the multiplicator is not documented for i486, therefore it will
be a i586
> extension.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 19:10           ` Bodo Eggert
@ 2005-06-16  3:26             ` Stephen Rothwell
  2005-06-16  7:10             ` Mikael Pettersson
  1 sibling, 0 replies; 14+ messages in thread
From: Stephen Rothwell @ 2005-06-16  3:26 UTC (permalink / raw)
  To: Bodo Eggert; +Cc: macro, linux-os, gene.heskett, cutaway, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

On Wed, 15 Jun 2005 21:10:26 +0200 (CEST) Bodo Eggert <7eggert@gmx.de> wrote:
>
> My documentation says:
> 
> lea reg16, mem
> Available on 8086, 80186, 80286, 80386, 80486
> 32-bit-extension available
> Opcode: 8D mod reg r/m
> 
> reg will be the target register (AX .. DI), and mod and r/m will select
> something like a direct address, a register or a combination like 
> BP+DI+ofs (I won't copy the table). A multiplier is not mentioned there.

In 32 bit mode on the 386 and above, a two byte version of the "mod reg
r/m" is possible which contains the scaling field ...

On the 386, using a second register in the ea calculation costs another
cycle.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 19:10           ` Bodo Eggert
  2005-06-16  3:26             ` Stephen Rothwell
@ 2005-06-16  7:10             ` Mikael Pettersson
  1 sibling, 0 replies; 14+ messages in thread
From: Mikael Pettersson @ 2005-06-16  7:10 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Maciej W. Rozycki, Richard B. Johnson, Gene Heskett, cutaway,
	linux-kernel

Bodo Eggert writes:
 > My documentation says:
 > 
 > lea reg16, mem
 > Available on 8086, 80186, 80286, 80386, 80486
 > 32-bit-extension available
 > Opcode: 8D mod reg r/m
 > 
 > reg will be the target register (AX .. DI), and mod and r/m will select
 > something like a direct address, a register or a combination like 
 > BP+DI+ofs (I won't copy the table). A multiplier is not mentioned there.

You're looking at the wrong parts of the documentation. The 16-bit
mode ModR/M doesn't have SIB, but the 32-bit mode does. The SIB includes
the scaled index. All IA32 processors have it. The only LEA-related
quirk is that its ModR/M must not describe a non-memory operand.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 15:34 ` Maciej W. Rozycki
@ 2005-06-15 23:48   ` cutaway
  0 siblings, 0 replies; 14+ messages in thread
From: cutaway @ 2005-06-15 23:48 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: linux-kernel

The only thing I've seen so far that loses is a straight original 486.  Its
pipe is somewhat more simple minded and interlock prone than later models.
PPro, K6 and 386SX are big winners.

On the 386's, lacking actual pipeline, the compactness of the LEA over the
SHL/ADD can be more of an overriding factor.


----- Original Message ----- 
From: "Maciej W. Rozycki" <macro@linux-mips.org>
To: <cutaway@bellsouth.net>
Cc: <linux-kernel@vger.kernel.org>
Sent: Wednesday, June 15, 2005 11:34
Subject: Re: .../asm-i386/bitops.h performance improvements


>
>  Be careful about model-specific penalties from using certain address
> modes and AGIs when using "lea" for such calculations.
>
>   Maciej


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 12:18 ` Gene Heskett
  2005-06-15 13:06   ` Richard B. Johnson
@ 2005-06-15 19:18   ` cutaway
  1 sibling, 0 replies; 14+ messages in thread
From: cutaway @ 2005-06-15 19:18 UTC (permalink / raw)
  To: Gene Heskett, linux-kernel

----- Original Message ----- 
From: "Gene Heskett" <gene.heskett@verizon.net>
To: <linux-kernel@vger.kernel.org>
> >
> To what cpu families does this apply?  eg, this may be true for intel,
> but what about amd, via etc?

You tell me -- I've included below a small benchmark that compares them.

These are the results I've gotten so far:

                      LEA       SHL/ADD
---------------------------------------
Pentium Pro 200       88sec     96sec
AMD K6/2-500          29sec     48sec
386SLC(386SX core)  2966sec   4932sec

If LEA isn't fast, those CPU's you mentioned have much bigger problems than
these two inline functions because GCC always generates (with the kernel
default -O2 at least) an LEA for things like this:

unsigned int foo(unsigned int bar)
{
return ((bar<<3)+bar);
}

----------- LEA vs SHL/ADD ----------

#include <stdio.h>
#include <time.h>

#define ITERATIONS 2000000L

#define START  start = time(&start);
#define STOP   stop = time(&stop); delta = stop - start;
#define SUMMARY(s) printf(s " [%ld] seconds\n",delta);
#define TESTLOOP for (i=0; i<ITERATIONS; i++)

static void inline shl(void)
{
__asm__("shll $3,%edi; addl %edi,%eax");
}

static void inline lea(void)
{
__asm__("leal (%eax,%edi,8),%eax");
}


int main(int argc, char *argv[], char *envp[])
{
time_t start, stop, delta;
int i;

START;
   TESTLOOP
 {
#undef  T
#define T shl();shl();shl();shl();shl();shl();shl();shl();shl();shl();
#define T100 T T T T T T T T T T T
#define T1000 T100 T100 T100 T100 T100 T100 T100 T100 T100 T100

__asm__ __volatile__("pushl %eax");
__asm__ __volatile__("pushl %edi");
 T1000 T1000 T1000 T1000 T1000 T1000
__asm__ __volatile__("popl %edi");
__asm__ __volatile__("popl %eax");
 }
STOP;
SUMMARY("SHL/ADD");


/*---------------------------------------------------*/

START;
   TESTLOOP
 {
#undef  T
#define T lea();lea();lea();lea();lea();lea();lea();lea();lea();lea();
#define T100 T T T T T T T T T T T
#define T1000 T100 T100 T100 T100 T100 T100 T100 T100 T100 T100

__asm__ __volatile__("pushl %eax");
__asm__ __volatile__("pushl %edi");
 T1000 T1000 T1000 T1000 T1000 T1000
__asm__ __volatile__("popl %edi");
__asm__ __volatile__("popl %eax");
 }
STOP;
SUMMARY("LEA");

return 0;
}



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15  8:53 cutaway
  2005-06-15 12:18 ` Gene Heskett
@ 2005-06-15 15:34 ` Maciej W. Rozycki
  2005-06-15 23:48   ` cutaway
  1 sibling, 1 reply; 14+ messages in thread
From: Maciej W. Rozycki @ 2005-06-15 15:34 UTC (permalink / raw)
  To: cutaway; +Cc: linux-kernel

On Wed, 15 Jun 2005 cutaway@bellsouth.net wrote:

> In find_first_bit() there exists this the sequence:
> 
> shll $3,%%edi
> addl %%edi,%%eax
> 
> LEA knows how to multiply by small powers of 2 and add all in one shot very
> efficiently:
> 
> leal (%%eax,%%edi,8),%%eax

 Be careful about model-specific penalties from using certain address 
modes and AGIs when using "lea" for such calculations.

  Maciej

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15 12:18 ` Gene Heskett
@ 2005-06-15 13:06   ` Richard B. Johnson
  2005-06-15 19:18   ` cutaway
  1 sibling, 0 replies; 14+ messages in thread
From: Richard B. Johnson @ 2005-06-15 13:06 UTC (permalink / raw)
  To: Gene Heskett; +Cc: Linux kernel, cutaway


LEA was designed for address calculation on ix86 processors.
If it is used to ready the value of an index register for the
next memory access, it can run in parallel with the next operations.
However, if it is just used to put a value into a register, where
the CPU can't proceed until that value is finalized, it does
nothing more useful than shifts and adds.

In other words, don't substitute LEA for INC or ADD just because
you can.

 	leal	0x04(%ebx), %ebx
... and
 	addl	$0x04, %ebx

... are functionally the same if the CPU needs the value in ebx
immediately. In the code sequence....

 	movl	(%ebx), %eax
 	leal	0x04(%ebx), %ebx	# Next address
 	xorl	%ecx, %eax
 	movl	%eax, (%ebx)

... the address calculation for the marked next address can proceed
in parallel with the xorl operation that follows. This makes LEA
helpful. However, in the following...

>> leal (%%eax,%%edi,8),%%eax

... the CPU needs to complete the whole operation before proceeding.
If you measure this, LEA with two index registers, you will find
that the shift and add is faster, guaranteed.

On Wed, 15 Jun 2005, Gene Heskett wrote:

> On Wednesday 15 June 2005 04:53, cutaway@bellsouth.net wrote:
>> In find_first_bit() there exists this the sequence:
>>
>> shll $3,%%edi
>> addl %%edi,%%eax
>>
>> LEA knows how to multiply by small powers of 2 and add all in one
>> shot very efficiently:
>>
>> leal (%%eax,%%edi,8),%%eax
>>
>>
>> In find_first_zero_bit() the sequence:
>>
>> shll $3,%%edi
>> addl %%edi,%%edx
>>
>> could similarly become:
>>
>> leal (%%edx,%%edi,8),%%edx
>>
> To what cpu families does this apply?  eg, this may be true for intel,
> but what about amd, via etc?
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> -- 
> Cheers, Gene
> "There are four boxes to be used in defense of liberty:
> soap, ballot, jury, and ammo. Please use in that order."
> -Ed Howdershelt (Author)
> 99.35% setiathome rank, not too shabby for a WV hillbilly
> Yahoo.com and AOL/TW attorneys please note, additions to the above
> message by Gene Heskett are:
> Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by Dictator Bush.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: .../asm-i386/bitops.h  performance improvements
  2005-06-15  8:53 cutaway
@ 2005-06-15 12:18 ` Gene Heskett
  2005-06-15 13:06   ` Richard B. Johnson
  2005-06-15 19:18   ` cutaway
  2005-06-15 15:34 ` Maciej W. Rozycki
  1 sibling, 2 replies; 14+ messages in thread
From: Gene Heskett @ 2005-06-15 12:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: cutaway

On Wednesday 15 June 2005 04:53, cutaway@bellsouth.net wrote:
>In find_first_bit() there exists this the sequence:
>
>shll $3,%%edi
>addl %%edi,%%eax
>
>LEA knows how to multiply by small powers of 2 and add all in one
> shot very efficiently:
>
>leal (%%eax,%%edi,8),%%eax
>
>
>In find_first_zero_bit() the sequence:
>
>shll $3,%%edi
>addl %%edi,%%edx
>
>could similarly become:
>
>leal (%%edx,%%edi,8),%%edx
>
To what cpu families does this apply?  eg, this may be true for intel, 
but what about amd, via etc?
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.35% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* .../asm-i386/bitops.h  performance improvements
@ 2005-06-15  8:53 cutaway
  2005-06-15 12:18 ` Gene Heskett
  2005-06-15 15:34 ` Maciej W. Rozycki
  0 siblings, 2 replies; 14+ messages in thread
From: cutaway @ 2005-06-15  8:53 UTC (permalink / raw)
  To: linux-kernel

In find_first_bit() there exists this the sequence:

shll $3,%%edi
addl %%edi,%%eax

LEA knows how to multiply by small powers of 2 and add all in one shot very
efficiently:

leal (%%eax,%%edi,8),%%eax


In find_first_zero_bit() the sequence:

shll $3,%%edi
addl %%edi,%%edx

could similarly become:

leal (%%edx,%%edi,8),%%edx




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-06-16  7:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4fB8l-73q-9@gated-at.bofh.it>
     [not found] ` <4fF2j-1Lo-19@gated-at.bofh.it>
2005-06-15 14:57   ` .../asm-i386/bitops.h performance improvements Bodo Eggert
2005-06-15 15:30     ` Maciej W. Rozycki
2005-06-15 16:06       ` Richard B. Johnson
2005-06-15 16:29         ` Maciej W. Rozycki
2005-06-15 19:10           ` Bodo Eggert
2005-06-16  3:26             ` Stephen Rothwell
2005-06-16  7:10             ` Mikael Pettersson
2005-06-15 23:53     ` cutaway
2005-06-15  8:53 cutaway
2005-06-15 12:18 ` Gene Heskett
2005-06-15 13:06   ` Richard B. Johnson
2005-06-15 19:18   ` cutaway
2005-06-15 15:34 ` Maciej W. Rozycki
2005-06-15 23:48   ` cutaway

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).