* Re: .../asm-i386/bitops.h performance improvements [not found] ` <4fF2j-1Lo-19@gated-at.bofh.it> @ 2005-06-15 14:57 ` Bodo Eggert 2005-06-15 15:30 ` Maciej W. Rozycki 2005-06-15 23:53 ` cutaway 0 siblings, 2 replies; 14+ messages in thread From: Bodo Eggert @ 2005-06-15 14:57 UTC (permalink / raw) To: Gene Heskett, cutaway, linux-kernel Gene Heskett <gene.heskett@verizon.net> wrote: >>leal (%%edx,%%edi,8),%%edx >> > To what cpu families does this apply? eg, this may be true for intel, > but what about amd, via etc? lea is an 8086 instruction. All clones have it in it's basic form. However, the multiplicator is not documented for i486, therefore it will be a i586 extension. -- Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF verbreiteten Lügen zu sabotieren. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 14:57 ` .../asm-i386/bitops.h performance improvements Bodo Eggert @ 2005-06-15 15:30 ` Maciej W. Rozycki 2005-06-15 16:06 ` Richard B. Johnson 2005-06-15 23:53 ` cutaway 1 sibling, 1 reply; 14+ messages in thread From: Maciej W. Rozycki @ 2005-06-15 15:30 UTC (permalink / raw) To: 7eggert; +Cc: Gene Heskett, cutaway, linux-kernel On Wed, 15 Jun 2005, Bodo Eggert wrote: > lea is an 8086 instruction. All clones have it in it's basic form. However, > the multiplicator is not documented for i486, therefore it will be a i586 > extension. Huh? The SIB byte has been added in the original i386 with 32-bit addressing. Maciej ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 15:30 ` Maciej W. Rozycki @ 2005-06-15 16:06 ` Richard B. Johnson 2005-06-15 16:29 ` Maciej W. Rozycki 0 siblings, 1 reply; 14+ messages in thread From: Richard B. Johnson @ 2005-06-15 16:06 UTC (permalink / raw) To: Maciej W. Rozycki; +Cc: 7eggert, Gene Heskett, cutaway, linux-kernel On Wed, 15 Jun 2005, Maciej W. Rozycki wrote: > On Wed, 15 Jun 2005, Bodo Eggert wrote: > >> lea is an 8086 instruction. All clones have it in it's basic form. However, >> the multiplicator is not documented for i486, therefore it will be a i586 >> extension. > > Huh? The SIB byte has been added in the original i386 with 32-bit > addressing. > > Maciej Well the __documented__ '486 LEA instruction doesn't even allow the double-register indirect. It's just LEA r16,m LEA r32,m ... repeated twice Page 26-190, Intel486(tm) Microprocessor Programmer's Reference Manual. ISBN 1-55512-195-4. The instruction may have been one of those "immature features", read broken. Cheers, Dick Johnson Penguin : Linux version 2.6.11.9 on an i686 machine (5537.79 BogoMips). Notice : All mail here is now cached for review by Dictator Bush. 98.36% of all statistics are fiction. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 16:06 ` Richard B. Johnson @ 2005-06-15 16:29 ` Maciej W. Rozycki 2005-06-15 19:10 ` Bodo Eggert 0 siblings, 1 reply; 14+ messages in thread From: Maciej W. Rozycki @ 2005-06-15 16:29 UTC (permalink / raw) To: Richard B. Johnson; +Cc: 7eggert, Gene Heskett, cutaway, linux-kernel On Wed, 15 Jun 2005, Richard B. Johnson wrote: > Well the __documented__ '486 LEA instruction doesn't > even allow the double-register indirect. It's just > > LEA r16,m > LEA r32,m > > ... repeated twice > > Page 26-190, Intel486(tm) Microprocessor Programmer's Reference > Manual. ISBN 1-55512-195-4. The instruction may have been one > of those "immature features", read broken. And "m" is presumably described in details elsewhere as the semantics is common for all instructions involving address calculation. There is no point in repeating the lengthy explanation for every instruction, is it? Or would you prefer having each possible register and/or value of constant arguments described for every instruction separately? Maciej ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 16:29 ` Maciej W. Rozycki @ 2005-06-15 19:10 ` Bodo Eggert 2005-06-16 3:26 ` Stephen Rothwell 2005-06-16 7:10 ` Mikael Pettersson 0 siblings, 2 replies; 14+ messages in thread From: Bodo Eggert @ 2005-06-15 19:10 UTC (permalink / raw) To: Maciej W. Rozycki; +Cc: Richard B. Johnson, Gene Heskett, cutaway, linux-kernel On Wed, 15 Jun 2005, Maciej W. Rozycki wrote: > On Wed, 15 Jun 2005, Richard B. Johnson wrote: > > Well the __documented__ '486 LEA instruction doesn't > > even allow the double-register indirect. It's just > > > > LEA r16,m > > LEA r32,m > > > > ... repeated twice > > > > Page 26-190, Intel486(tm) Microprocessor Programmer's Reference > > Manual. ISBN 1-55512-195-4. The instruction may have been one > > of those "immature features", read broken. > > And "m" is presumably described in details elsewhere as the semantics is > common for all instructions involving address calculation. My documentation says: lea reg16, mem Available on 8086, 80186, 80286, 80386, 80486 32-bit-extension available Opcode: 8D mod reg r/m reg will be the target register (AX .. DI), and mod and r/m will select something like a direct address, a register or a combination like BP+DI+ofs (I won't copy the table). A multiplier is not mentioned there. -- Microwave: Signal from a friendly micro... ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 19:10 ` Bodo Eggert @ 2005-06-16 3:26 ` Stephen Rothwell 2005-06-16 7:10 ` Mikael Pettersson 1 sibling, 0 replies; 14+ messages in thread From: Stephen Rothwell @ 2005-06-16 3:26 UTC (permalink / raw) To: Bodo Eggert; +Cc: macro, linux-os, gene.heskett, cutaway, linux-kernel [-- Attachment #1: Type: text/plain, Size: 782 bytes --] On Wed, 15 Jun 2005 21:10:26 +0200 (CEST) Bodo Eggert <7eggert@gmx.de> wrote: > > My documentation says: > > lea reg16, mem > Available on 8086, 80186, 80286, 80386, 80486 > 32-bit-extension available > Opcode: 8D mod reg r/m > > reg will be the target register (AX .. DI), and mod and r/m will select > something like a direct address, a register or a combination like > BP+DI+ofs (I won't copy the table). A multiplier is not mentioned there. In 32 bit mode on the 386 and above, a two byte version of the "mod reg r/m" is possible which contains the scaling field ... On the 386, using a second register in the ea calculation costs another cycle. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 19:10 ` Bodo Eggert 2005-06-16 3:26 ` Stephen Rothwell @ 2005-06-16 7:10 ` Mikael Pettersson 1 sibling, 0 replies; 14+ messages in thread From: Mikael Pettersson @ 2005-06-16 7:10 UTC (permalink / raw) To: Bodo Eggert Cc: Maciej W. Rozycki, Richard B. Johnson, Gene Heskett, cutaway, linux-kernel Bodo Eggert writes: > My documentation says: > > lea reg16, mem > Available on 8086, 80186, 80286, 80386, 80486 > 32-bit-extension available > Opcode: 8D mod reg r/m > > reg will be the target register (AX .. DI), and mod and r/m will select > something like a direct address, a register or a combination like > BP+DI+ofs (I won't copy the table). A multiplier is not mentioned there. You're looking at the wrong parts of the documentation. The 16-bit mode ModR/M doesn't have SIB, but the 32-bit mode does. The SIB includes the scaled index. All IA32 processors have it. The only LEA-related quirk is that its ModR/M must not describe a non-memory operand. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 14:57 ` .../asm-i386/bitops.h performance improvements Bodo Eggert 2005-06-15 15:30 ` Maciej W. Rozycki @ 2005-06-15 23:53 ` cutaway 1 sibling, 0 replies; 14+ messages in thread From: cutaway @ 2005-06-15 23:53 UTC (permalink / raw) To: 7eggert, Gene Heskett, linux-kernel Ummm, in simple terms - this statement is flat out 100% wrong. LEA with the SIB byte has been around since 386 and is included on every CPU Linux is capable of running on. Compile this using -m386 and look at the ASM listing file and convince yourself. unsigned int foo(int bar) { return ((bar<<3)+bar); } GCC is going to generate a MOV of parm to EAX, then a LEA EAX,[EAX+EAX*8] Don't trust me - compile this and prove it to yourself. ----- Original Message ----- From: "Bodo Eggert" <harvested.in.lkml@posting.7eggert.dyndns.org> To: "Gene Heskett" <gene.heskett@verizon.net>; <cutaway@bellsouth.net>; <linux-> > > However, the multiplicator is not documented for i486, therefore it will be a i586 > extension. ^ permalink raw reply [flat|nested] 14+ messages in thread
* .../asm-i386/bitops.h performance improvements @ 2005-06-15 8:53 cutaway 2005-06-15 12:18 ` Gene Heskett 2005-06-15 15:34 ` Maciej W. Rozycki 0 siblings, 2 replies; 14+ messages in thread From: cutaway @ 2005-06-15 8:53 UTC (permalink / raw) To: linux-kernel In find_first_bit() there exists this the sequence: shll $3,%%edi addl %%edi,%%eax LEA knows how to multiply by small powers of 2 and add all in one shot very efficiently: leal (%%eax,%%edi,8),%%eax In find_first_zero_bit() the sequence: shll $3,%%edi addl %%edi,%%edx could similarly become: leal (%%edx,%%edi,8),%%edx ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 8:53 cutaway @ 2005-06-15 12:18 ` Gene Heskett 2005-06-15 13:06 ` Richard B. Johnson 2005-06-15 19:18 ` cutaway 2005-06-15 15:34 ` Maciej W. Rozycki 1 sibling, 2 replies; 14+ messages in thread From: Gene Heskett @ 2005-06-15 12:18 UTC (permalink / raw) To: linux-kernel; +Cc: cutaway On Wednesday 15 June 2005 04:53, cutaway@bellsouth.net wrote: >In find_first_bit() there exists this the sequence: > >shll $3,%%edi >addl %%edi,%%eax > >LEA knows how to multiply by small powers of 2 and add all in one > shot very efficiently: > >leal (%%eax,%%edi,8),%%eax > > >In find_first_zero_bit() the sequence: > >shll $3,%%edi >addl %%edi,%%edx > >could similarly become: > >leal (%%edx,%%edi,8),%%edx > To what cpu families does this apply? eg, this may be true for intel, but what about amd, via etc? > > >- >To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) 99.35% setiathome rank, not too shabby for a WV hillbilly Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2005 by Maurice Eugene Heskett, all rights reserved. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 12:18 ` Gene Heskett @ 2005-06-15 13:06 ` Richard B. Johnson 2005-06-15 19:18 ` cutaway 1 sibling, 0 replies; 14+ messages in thread From: Richard B. Johnson @ 2005-06-15 13:06 UTC (permalink / raw) To: Gene Heskett; +Cc: Linux kernel, cutaway LEA was designed for address calculation on ix86 processors. If it is used to ready the value of an index register for the next memory access, it can run in parallel with the next operations. However, if it is just used to put a value into a register, where the CPU can't proceed until that value is finalized, it does nothing more useful than shifts and adds. In other words, don't substitute LEA for INC or ADD just because you can. leal 0x04(%ebx), %ebx ... and addl $0x04, %ebx ... are functionally the same if the CPU needs the value in ebx immediately. In the code sequence.... movl (%ebx), %eax leal 0x04(%ebx), %ebx # Next address xorl %ecx, %eax movl %eax, (%ebx) ... the address calculation for the marked next address can proceed in parallel with the xorl operation that follows. This makes LEA helpful. However, in the following... >> leal (%%eax,%%edi,8),%%eax ... the CPU needs to complete the whole operation before proceeding. If you measure this, LEA with two index registers, you will find that the shift and add is faster, guaranteed. On Wed, 15 Jun 2005, Gene Heskett wrote: > On Wednesday 15 June 2005 04:53, cutaway@bellsouth.net wrote: >> In find_first_bit() there exists this the sequence: >> >> shll $3,%%edi >> addl %%edi,%%eax >> >> LEA knows how to multiply by small powers of 2 and add all in one >> shot very efficiently: >> >> leal (%%eax,%%edi,8),%%eax >> >> >> In find_first_zero_bit() the sequence: >> >> shll $3,%%edi >> addl %%edi,%%edx >> >> could similarly become: >> >> leal (%%edx,%%edi,8),%%edx >> > To what cpu families does this apply? eg, this may be true for intel, > but what about amd, via etc? >> >> >> - >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- > Cheers, Gene > "There are four boxes to be used in defense of liberty: > soap, ballot, jury, and ammo. Please use in that order." > -Ed Howdershelt (Author) > 99.35% setiathome rank, not too shabby for a WV hillbilly > Yahoo.com and AOL/TW attorneys please note, additions to the above > message by Gene Heskett are: > Copyright 2005 by Maurice Eugene Heskett, all rights reserved. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > Cheers, Dick Johnson Penguin : Linux version 2.6.11.9 on an i686 machine (5537.79 BogoMips). Notice : All mail here is now cached for review by Dictator Bush. 98.36% of all statistics are fiction. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 12:18 ` Gene Heskett 2005-06-15 13:06 ` Richard B. Johnson @ 2005-06-15 19:18 ` cutaway 1 sibling, 0 replies; 14+ messages in thread From: cutaway @ 2005-06-15 19:18 UTC (permalink / raw) To: Gene Heskett, linux-kernel ----- Original Message ----- From: "Gene Heskett" <gene.heskett@verizon.net> To: <linux-kernel@vger.kernel.org> > > > To what cpu families does this apply? eg, this may be true for intel, > but what about amd, via etc? You tell me -- I've included below a small benchmark that compares them. These are the results I've gotten so far: LEA SHL/ADD --------------------------------------- Pentium Pro 200 88sec 96sec AMD K6/2-500 29sec 48sec 386SLC(386SX core) 2966sec 4932sec If LEA isn't fast, those CPU's you mentioned have much bigger problems than these two inline functions because GCC always generates (with the kernel default -O2 at least) an LEA for things like this: unsigned int foo(unsigned int bar) { return ((bar<<3)+bar); } ----------- LEA vs SHL/ADD ---------- #include <stdio.h> #include <time.h> #define ITERATIONS 2000000L #define START start = time(&start); #define STOP stop = time(&stop); delta = stop - start; #define SUMMARY(s) printf(s " [%ld] seconds\n",delta); #define TESTLOOP for (i=0; i<ITERATIONS; i++) static void inline shl(void) { __asm__("shll $3,%edi; addl %edi,%eax"); } static void inline lea(void) { __asm__("leal (%eax,%edi,8),%eax"); } int main(int argc, char *argv[], char *envp[]) { time_t start, stop, delta; int i; START; TESTLOOP { #undef T #define T shl();shl();shl();shl();shl();shl();shl();shl();shl();shl(); #define T100 T T T T T T T T T T T #define T1000 T100 T100 T100 T100 T100 T100 T100 T100 T100 T100 __asm__ __volatile__("pushl %eax"); __asm__ __volatile__("pushl %edi"); T1000 T1000 T1000 T1000 T1000 T1000 __asm__ __volatile__("popl %edi"); __asm__ __volatile__("popl %eax"); } STOP; SUMMARY("SHL/ADD"); /*---------------------------------------------------*/ START; TESTLOOP { #undef T #define T lea();lea();lea();lea();lea();lea();lea();lea();lea();lea(); #define T100 T T T T T T T T T T T #define T1000 T100 T100 T100 T100 T100 T100 T100 T100 T100 T100 __asm__ __volatile__("pushl %eax"); __asm__ __volatile__("pushl %edi"); T1000 T1000 T1000 T1000 T1000 T1000 __asm__ __volatile__("popl %edi"); __asm__ __volatile__("popl %eax"); } STOP; SUMMARY("LEA"); return 0; } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 8:53 cutaway 2005-06-15 12:18 ` Gene Heskett @ 2005-06-15 15:34 ` Maciej W. Rozycki 2005-06-15 23:48 ` cutaway 1 sibling, 1 reply; 14+ messages in thread From: Maciej W. Rozycki @ 2005-06-15 15:34 UTC (permalink / raw) To: cutaway; +Cc: linux-kernel On Wed, 15 Jun 2005 cutaway@bellsouth.net wrote: > In find_first_bit() there exists this the sequence: > > shll $3,%%edi > addl %%edi,%%eax > > LEA knows how to multiply by small powers of 2 and add all in one shot very > efficiently: > > leal (%%eax,%%edi,8),%%eax Be careful about model-specific penalties from using certain address modes and AGIs when using "lea" for such calculations. Maciej ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: .../asm-i386/bitops.h performance improvements 2005-06-15 15:34 ` Maciej W. Rozycki @ 2005-06-15 23:48 ` cutaway 0 siblings, 0 replies; 14+ messages in thread From: cutaway @ 2005-06-15 23:48 UTC (permalink / raw) To: Maciej W. Rozycki; +Cc: linux-kernel The only thing I've seen so far that loses is a straight original 486. Its pipe is somewhat more simple minded and interlock prone than later models. PPro, K6 and 386SX are big winners. On the 386's, lacking actual pipeline, the compactness of the LEA over the SHL/ADD can be more of an overriding factor. ----- Original Message ----- From: "Maciej W. Rozycki" <macro@linux-mips.org> To: <cutaway@bellsouth.net> Cc: <linux-kernel@vger.kernel.org> Sent: Wednesday, June 15, 2005 11:34 Subject: Re: .../asm-i386/bitops.h performance improvements > > Be careful about model-specific penalties from using certain address > modes and AGIs when using "lea" for such calculations. > > Maciej ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2005-06-16 7:10 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <4fB8l-73q-9@gated-at.bofh.it> [not found] ` <4fF2j-1Lo-19@gated-at.bofh.it> 2005-06-15 14:57 ` .../asm-i386/bitops.h performance improvements Bodo Eggert 2005-06-15 15:30 ` Maciej W. Rozycki 2005-06-15 16:06 ` Richard B. Johnson 2005-06-15 16:29 ` Maciej W. Rozycki 2005-06-15 19:10 ` Bodo Eggert 2005-06-16 3:26 ` Stephen Rothwell 2005-06-16 7:10 ` Mikael Pettersson 2005-06-15 23:53 ` cutaway 2005-06-15 8:53 cutaway 2005-06-15 12:18 ` Gene Heskett 2005-06-15 13:06 ` Richard B. Johnson 2005-06-15 19:18 ` cutaway 2005-06-15 15:34 ` Maciej W. Rozycki 2005-06-15 23:48 ` cutaway
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).