linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BK+PATCH] remove __constant_memcpy
@ 2003-04-17  0:57 Jeff Garzik
  2003-04-17  1:04 ` Jeff Garzik
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17  0:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

Linus,

Please review the patch below, then do a

	bk pull http://gkernel.bkbits.net/misc-2.5

Summary:

gcc's __builtin_memcpy performs the same function (and more) as the 
kernel's __constant_memcpy.  So, let's remove __constant_memcpy, and let 
the compiler do it.

Instead of shouldering the burden of the kernel needing to have a 
decently-fast memcpy routine, I would prefer to hand off that 
maintenance burden to the compiler.  For the less common (read: 
non-Intel) processors, I bet this patch shows immediate asm-level 
benefits in instruction scheduling.

The patch below is the conservative, obvious patch.  It only kicks in 
when __builtin_constant_p() is true, and it only applies to the i386 
arch.  I'm currently running w/ 2.5.67+BK+patch and it's stable.  With 
some recently-acquired (but still nascent) x86 asm skills, I diff'd the 
before-and-after x86 asm cases for when a constant memcpy() call was 
made in the kernel code; nothing interesting.  The instruction sequence 
was usually longer once you exceeded ~32 byte memcpy, but it looked like 
it scheduled better on i686.  The small-copy cases looked reasonably 
equivalent.

The more radical direction, where I would eventually like to go, is to 
hand off all memcpy duties to the compiler, and -march=xxx selects the 
best memcpy strategies.  This "radical" direction requires a lot more 
work, benching both the kernel and gcc before and after the memcpy changes.

Finally, on a compiler note, __builtin_memcpy can fall back to emitting 
a memcpy function call.  Given the conservatism of my patch, this is 
unlikely, but it should be mentioned.  This also gives less-capable 
compilers the ability to simplify, by choosing the slow path of 
unconditionally emitting a memcpy call.

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 3492 bytes --]

# --------------------------------------------
# 03/04/16	jgarzik@redhat.com	1.1067.1.1
# [ia32] remove __constant_memcpy, use __builtin_memcpy
# 
# gcc's memcpy handling already takes into account the cases that
# include/asm-i386/string.h's __constant_memcpy takes into
# account.  __constant_memcpy is removed, and it is replaced
# with references to __builtin_memcpy.
# 
# Compilers that do not/cannot optimize __builtin_memcpy can choose
# the slow path and simply emit a memcpy function call.
# --------------------------------------------
#
diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
--- a/include/asm-i386/string.h	Wed Apr 16 20:19:42 2003
+++ b/include/asm-i386/string.h	Wed Apr 16 20:19:42 2003
@@ -208,75 +208,6 @@
 return (to);
 }
 
-/*
- * This looks horribly ugly, but the compiler can optimize it totally,
- * as the count is constant.
- */
-static inline void * __constant_memcpy(void * to, const void * from, size_t n)
-{
-	switch (n) {
-		case 0:
-			return to;
-		case 1:
-			*(unsigned char *)to = *(const unsigned char *)from;
-			return to;
-		case 2:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			return to;
-		case 3:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			*(2+(unsigned char *)to) = *(2+(const unsigned char *)from);
-			return to;
-		case 4:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			return to;
-		case 6:	/* for Ethernet addresses */
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(2+(unsigned short *)to) = *(2+(const unsigned short *)from);
-			return to;
-		case 8:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			return to;
-		case 12:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			return to;
-		case 16:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			return to;
-		case 20:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			*(4+(unsigned long *)to) = *(4+(const unsigned long *)from);
-			return to;
-	}
-#define COMMON(x) \
-__asm__ __volatile__( \
-	"rep ; movsl" \
-	x \
-	: "=&c" (d0), "=&D" (d1), "=&S" (d2) \
-	: "0" (n/4),"1" ((long) to),"2" ((long) from) \
-	: "memory");
-{
-	int d0, d1, d2;
-	switch (n % 4) {
-		case 0: COMMON(""); return to;
-		case 1: COMMON("\n\tmovsb"); return to;
-		case 2: COMMON("\n\tmovsw"); return to;
-		default: COMMON("\n\tmovsw\n\tmovsb"); return to;
-	}
-}
-  
-#undef COMMON
-}
-
 #define __HAVE_ARCH_MEMCPY
 
 #ifdef CONFIG_X86_USE_3DNOW
@@ -290,7 +221,7 @@
 static inline void * __constant_memcpy3d(void * to, const void * from, size_t len)
 {
 	if (len < 512)
-		return __constant_memcpy(to, from, len);
+		return __builtin_memcpy(to, from, len);
 	return _mmx_memcpy(to, from, len);
 }
 
@@ -314,7 +245,7 @@
  
 #define memcpy(t, f, n) \
 (__builtin_constant_p(n) ? \
- __constant_memcpy((t),(f),(n)) : \
+ __builtin_memcpy((t),(f),(n)) : \
  __memcpy((t),(f),(n)))
 
 #endif

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  0:57 [BK+PATCH] remove __constant_memcpy Jeff Garzik
@ 2003-04-17  1:04 ` Jeff Garzik
  2003-04-17  2:06 ` Linus Torvalds
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17  1:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML

Jeff Garzik wrote:
> Linus,
> 
> Please review the patch below, then do a
> 
>     bk pull http://gkernel.bkbits.net/misc-2.5
> 
> Summary:
> 
> gcc's __builtin_memcpy performs the same function (and more) as the 
> kernel's __constant_memcpy.  So, let's remove __constant_memcpy, and let 
> the compiler do it.


Oh, and __builtin_memcpy exists on gcc 2.95.x, which is the current 2.5 
minimum.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  0:57 [BK+PATCH] remove __constant_memcpy Jeff Garzik
  2003-04-17  1:04 ` Jeff Garzik
@ 2003-04-17  2:06 ` Linus Torvalds
  2003-04-17  8:46   ` Arjan van de Ven
  2003-04-17 13:17 ` Alan Cox
  2003-04-17 13:17 ` Alan Cox
  3 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2003-04-17  2:06 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: LKML


On Wed, 16 Apr 2003, Jeff Garzik wrote:
> 
> gcc's __builtin_memcpy performs the same function (and more) as the 
> kernel's __constant_memcpy.  So, let's remove __constant_memcpy, and let 
> the compiler do it.

Please don't.

There's no way gcc will EVER get the SSE2 cases right. It just cannot do 
it. In fact, I live in fear that we will have to turn off the compiler 
intrisics entirely some day just because there is always the worry that 
gcc will start using FP.

So the advantage of doing our own memcpy() is not that it's necessarily 
faster than the gcc built-in, but simply because I do not believe that the 
gcc people care enough about the kernel to let them make the choice.

		Linus


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  2:06 ` Linus Torvalds
@ 2003-04-17  8:46   ` Arjan van de Ven
  2003-04-17  9:02     ` Roman Zippel
  2003-04-17 16:07     ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Arjan van de Ven @ 2003-04-17  8:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff Garzik, LKML

[-- Attachment #1: Type: text/plain, Size: 844 bytes --]

On Thu, 2003-04-17 at 04:06, Linus Torvalds wrote:
> On Wed, 16 Apr 2003, Jeff Garzik wrote:
> > 
> > gcc's __builtin_memcpy performs the same function (and more) as the 
> > kernel's __constant_memcpy.  So, let's remove __constant_memcpy, and let 
> > the compiler do it.
> 
> Please don't.
> 
> There's no way gcc will EVER get the SSE2 cases right. It just cannot do 
> it. In fact, I live in fear that we will have to turn off the compiler 
> intrisics entirely some day just because there is always the worry that 
> gcc will start using FP.

it can do that ANYWAY for all kinds of things.
We really should ask the gcc folks to add a
-fdontyoudareusefloatingpoint flag (well different name probably) to
make sure it never ever will generate FP code. (would also help catch
abusers of FP code in the kernel as a bonus ;)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  8:46   ` Arjan van de Ven
@ 2003-04-17  9:02     ` Roman Zippel
  2003-04-17  9:04       ` Arjan van de Ven
  2003-04-17 16:07     ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Roman Zippel @ 2003-04-17  9:02 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Linus Torvalds, Jeff Garzik, LKML

Hi,

On 17 Apr 2003, Arjan van de Ven wrote:

> it can do that ANYWAY for all kinds of things.
> We really should ask the gcc folks to add a
> -fdontyoudareusefloatingpoint flag (well different name probably) to
> make sure it never ever will generate FP code. (would also help catch
> abusers of FP code in the kernel as a bonus ;)

-msoft-float?

bye, Roman


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  9:02     ` Roman Zippel
@ 2003-04-17  9:04       ` Arjan van de Ven
  2003-04-17  9:11         ` Jakub Jelinek
  0 siblings, 1 reply; 27+ messages in thread
From: Arjan van de Ven @ 2003-04-17  9:04 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Arjan van de Ven, Linus Torvalds, Jeff Garzik, LKML

On Thu, Apr 17, 2003 at 11:02:43AM +0200, Roman Zippel wrote:
> Hi,
> 
> On 17 Apr 2003, Arjan van de Ven wrote:
> 
> > it can do that ANYWAY for all kinds of things.
> > We really should ask the gcc folks to add a
> > -fdontyoudareusefloatingpoint flag (well different name probably) to
> > make sure it never ever will generate FP code. (would also help catch
> > abusers of FP code in the kernel as a bonus ;)
> 
> -msoft-float?

that is a decent start but has a different effect, eg it doesnt' actually
forbid gcc from generatic fpu code, just tells it to use emu lib functions
for it .

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  9:04       ` Arjan van de Ven
@ 2003-04-17  9:11         ` Jakub Jelinek
  0 siblings, 0 replies; 27+ messages in thread
From: Jakub Jelinek @ 2003-04-17  9:11 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Roman Zippel, Linus Torvalds, Jeff Garzik, LKML

On Thu, Apr 17, 2003 at 09:04:44AM +0000, Arjan van de Ven wrote:
> On Thu, Apr 17, 2003 at 11:02:43AM +0200, Roman Zippel wrote:
> > Hi,
> > 
> > On 17 Apr 2003, Arjan van de Ven wrote:
> > 
> > > it can do that ANYWAY for all kinds of things.
> > > We really should ask the gcc folks to add a
> > > -fdontyoudareusefloatingpoint flag (well different name probably) to
> > > make sure it never ever will generate FP code. (would also help catch
> > > abusers of FP code in the kernel as a bonus ;)
> > 
> > -msoft-float?
> 
> that is a decent start but has a different effect, eg it doesnt' actually
> forbid gcc from generatic fpu code, just tells it to use emu lib functions
> for it .

But the emu lib functions aren't provided in the kernel, so you'll get
errors anyway.
On some targets the option is -mno-fpu though (e.g. sparc* kernels use
it for eons).

	Jakub

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  0:57 [BK+PATCH] remove __constant_memcpy Jeff Garzik
  2003-04-17  1:04 ` Jeff Garzik
  2003-04-17  2:06 ` Linus Torvalds
@ 2003-04-17 13:17 ` Alan Cox
  2003-04-17 13:17 ` Alan Cox
  3 siblings, 0 replies; 27+ messages in thread
From: Alan Cox @ 2003-04-17 13:17 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, LKML

On Iau, 2003-04-17 at 01:57, Jeff Garzik wrote:
> The patch below is the conservative, obvious patch.  It only kicks in 
> when __builtin_constant_p() is true, and it only applies to the i386 
> arch.  

You are assuming the compiler is smart about stuff - it doesnt know
SSE/MMX for page copies etc. For small copies it should alays win, but
isn't it best if so to use __builtin_memcpy without our existing
macros not just trust the compiler ?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  0:57 [BK+PATCH] remove __constant_memcpy Jeff Garzik
                   ` (2 preceding siblings ...)
  2003-04-17 13:17 ` Alan Cox
@ 2003-04-17 13:17 ` Alan Cox
  2003-04-17 14:32   ` Jeff Garzik
  2003-04-17 20:01   ` H. Peter Anvin
  3 siblings, 2 replies; 27+ messages in thread
From: Alan Cox @ 2003-04-17 13:17 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, LKML

On Iau, 2003-04-17 at 01:57, Jeff Garzik wrote:
> The patch below is the conservative, obvious patch.  It only kicks in 
> when __builtin_constant_p() is true, and it only applies to the i386 
> arch.  

You are assuming the compiler is smart about stuff - it doesnt know
SSE/MMX for page copies etc. For small copies it should alays win, but
isn't it best if so to use __builtin_memcpy without our existing
macros not just trust the compiler ?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 13:17 ` Alan Cox
@ 2003-04-17 14:32   ` Jeff Garzik
  2003-04-17 14:40     ` Jeff Garzik
  2003-04-17 20:01   ` H. Peter Anvin
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17 14:32 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, LKML

On Thu, Apr 17, 2003 at 02:17:16PM +0100, Alan Cox wrote:
> On Iau, 2003-04-17 at 01:57, Jeff Garzik wrote:
> > The patch below is the conservative, obvious patch.  It only kicks in 
> > when __builtin_constant_p() is true, and it only applies to the i386 
> > arch.  
> 
> You are assuming the compiler is smart about stuff - it doesnt know
> SSE/MMX for page copies etc. For small copies it should alays win, but

Prior to my patch, __constant_memcpy was -already- only used for small,
constant-size copies.

Therefore, my patch applied __builtin_memcpy only to small,
constant-size copies.  The existing kernel custom-memcpy code continued
to perform as expected.

You and Linus both seem to think MMX/SSE/SSE2 is somehow in the
equation, but I do not see that at all.  I left those paths alone.
Clarification/LART requested...


> isn't it best if so to use __builtin_memcpy without our existing
> macros not just trust the compiler ?

hum, I didn't parse this at all:
Use of __builtin_memcpy implies trusting the compiler :)

Maybe you meant s/without/with/ ?

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 14:32   ` Jeff Garzik
@ 2003-04-17 14:40     ` Jeff Garzik
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17 14:40 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, LKML

On Thu, Apr 17, 2003 at 10:32:02AM -0400, Jeff Garzik wrote:
> On Thu, Apr 17, 2003 at 02:17:16PM +0100, Alan Cox wrote:
> > isn't it best if so to use __builtin_memcpy without our existing
> > macros not just trust the compiler ?

> hum, I didn't parse this at all:
> Use of __builtin_memcpy implies trusting the compiler :)
> 
> Maybe you meant s/without/with/ ?

And further, if you did indeed mean s/without/with/ ...
that's -exactly- what my submitted patch did.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17  8:46   ` Arjan van de Ven
  2003-04-17  9:02     ` Roman Zippel
@ 2003-04-17 16:07     ` Linus Torvalds
  2003-04-17 19:07       ` Jeff Garzik
  2003-04-17 19:19       ` Jeff Garzik
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2003-04-17 16:07 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jeff Garzik, LKML


On 17 Apr 2003, Arjan van de Ven wrote:
> 
> it can do that ANYWAY for all kinds of things.
> We really should ask the gcc folks to add a
> -fdontyoudareusefloatingpoint flag (well different name probably)

Well, _most_ architectures actually have that flag already. It's not 
called -fdontyoudareusefloatingpoint on any of them, though ;)

It's most commonly called "-mno-fpu", but sh calls it "-mno-implicit-fp",
and alpha calls it "-mno-fp-regs".

On x86, gcc doesn't have such an option, although "-mno-sse" and
"-mno-sse2" probably come closest (and we should probably use them, but
since older gcc's don't know about it and it hasn't been an issue yet we
haven't).

HOWEVER, that doesn't fix the memcpy() issue. The fact is, the kernel 
_can_ and does use SSE instructions - it's just that it has to do magic 
crap before it does so. 

		Linus


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 16:07     ` Linus Torvalds
@ 2003-04-17 19:07       ` Jeff Garzik
  2003-04-17 19:19       ` Jeff Garzik
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17 19:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Arjan van de Ven, LKML

On Thu, Apr 17, 2003 at 09:07:45AM -0700, Linus Torvalds wrote:
> 
> On 17 Apr 2003, Arjan van de Ven wrote:
> > 
> > it can do that ANYWAY for all kinds of things.
> > We really should ask the gcc folks to add a
> > -fdontyoudareusefloatingpoint flag (well different name probably)
> 
> Well, _most_ architectures actually have that flag already. It's not 
> called -fdontyoudareusefloatingpoint on any of them, though ;)
> 
> It's most commonly called "-mno-fpu", but sh calls it "-mno-implicit-fp",
> and alpha calls it "-mno-fp-regs".
> 
> On x86, gcc doesn't have such an option, although "-mno-sse" and
> "-mno-sse2" probably come closest (and we should probably use them, but
> since older gcc's don't know about it and it hasn't been an issue yet we
> haven't).

gcc on x86 definitely wants a -fdontyoudareusefloatingpoint... The
following snippet from the -msoft-float docs isn't encouraging:

     On machines where a function returns floating point results in the
     80387 register stack, some floating point opcodes may be emitted
     even if `-msoft-float' is used.
[yes, I realize it's an ABI constraint in this case, but IMO it's also
indicative in general of the difficulty in getting gcc to _never_
generate fp opcodes on x86, now or in the future]

Anyway, if you wanna play it safe WRT gcc future, the following
patch works.

-mno-3dnow is probably over-paranoia, because (IIRC) it doesn't contain
any integer instructions, but maybe I'm wrong.  It can't hurt... :)


===== arch/i386/Makefile 1.49 vs edited =====
--- 1.49/arch/i386/Makefile	Thu Apr  3 14:24:40 2003
+++ edited/arch/i386/Makefile	Thu Apr 17 14:55:34 2003
@@ -27,6 +27,9 @@
 # prevent gcc from keeping the stack 16 byte aligned
 CFLAGS += $(call check_gcc,-mpreferred-stack-boundary=2,)
 
+# force gcc to always use general registers (only)
+CFLAGS += $(call check_gcc,-mno-mmx -mno-sse -mno-sse2 -mno-3dnow,)
+
 align := $(subst -functions=0,,$(call check_gcc,-falign-functions=0,-malign-functions=0))
 
 cflags-$(CONFIG_M386)		+= -march=i386

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 16:07     ` Linus Torvalds
  2003-04-17 19:07       ` Jeff Garzik
@ 2003-04-17 19:19       ` Jeff Garzik
  2003-04-17 19:54         ` Linus Torvalds
  2003-04-17 22:58         ` J.A. Magallon
  1 sibling, 2 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17 19:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Arjan van de Ven, LKML

On Thu, Apr 17, 2003 at 09:07:45AM -0700, Linus Torvalds wrote:
> HOWEVER, that doesn't fix the memcpy() issue. The fact is, the kernel 
> _can_ and does use SSE instructions - it's just that it has to do magic 
> crap before it does so. 

That issue is completely immaterial to my patch, though...

__constant_memcpy was used for small, constant-sized cases AFTER
the kernel made the decision not to hand the copy duties over to the
kernel's MMX/SSE code.  Take a look at the bottom of the patch below,
and also this snip from a non-hacked string.h, for illustration...

	...

	#define memcpy(t, f, n) \
	(__builtin_constant_p(n) ? \
	 __constant_memcpy3d((t),(f),(n)) : \
	 __memcpy3d((t),(f),(n)))

	#else

	/*
	 *      No 3D Now!
	 */
 
	#define memcpy(t, f, n) \
	(__builtin_constant_p(n) ? \
	 __constant_memcpy((t),(f),(n)) : \
	 __memcpy((t),(f),(n)))

	#endif

I can certainly see your argument about gcc suddenly deciding to use
SSE[2] registers to do the copy on its own, though.  The previous
patch -mno-foo should take care of that.

So, I _do_ see the downsides of my "radical" approach described
in my initial message (now), but I don't see the downsides to the
conservative patch that I actually submitted.  This patch retains
the kernel's decision-making on MMX/SSE/etc., but hands over the easy
and obvious stuff to gcc.

Even if the gcc guys suddenly turn hostile to the kernel, I really
can't see them breaking __builtin_memcpy for the simple case --
which is the only case where I use it below.



diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
--- a/include/asm-i386/string.h	Thu Apr 17 15:08:34 2003
+++ b/include/asm-i386/string.h	Thu Apr 17 15:08:34 2003
@@ -208,75 +208,6 @@
 return (to);
 }
 
-/*
- * This looks horribly ugly, but the compiler can optimize it totally,
- * as the count is constant.
- */
-static inline void * __constant_memcpy(void * to, const void * from, size_t n)
-{
-	switch (n) {
-		case 0:
-			return to;
-		case 1:
-			*(unsigned char *)to = *(const unsigned char *)from;
-			return to;
-		case 2:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			return to;
-		case 3:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			*(2+(unsigned char *)to) = *(2+(const unsigned char *)from);
-			return to;
-		case 4:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			return to;
-		case 6:	/* for Ethernet addresses */
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(2+(unsigned short *)to) = *(2+(const unsigned short *)from);
-			return to;
-		case 8:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			return to;
-		case 12:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			return to;
-		case 16:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			return to;
-		case 20:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			*(4+(unsigned long *)to) = *(4+(const unsigned long *)from);
-			return to;
-	}
-#define COMMON(x) \
-__asm__ __volatile__( \
-	"rep ; movsl" \
-	x \
-	: "=&c" (d0), "=&D" (d1), "=&S" (d2) \
-	: "0" (n/4),"1" ((long) to),"2" ((long) from) \
-	: "memory");
-{
-	int d0, d1, d2;
-	switch (n % 4) {
-		case 0: COMMON(""); return to;
-		case 1: COMMON("\n\tmovsb"); return to;
-		case 2: COMMON("\n\tmovsw"); return to;
-		default: COMMON("\n\tmovsw\n\tmovsb"); return to;
-	}
-}
-  
-#undef COMMON
-}
-
 #define __HAVE_ARCH_MEMCPY
 
 #ifdef CONFIG_X86_USE_3DNOW
@@ -290,7 +221,7 @@
 static inline void * __constant_memcpy3d(void * to, const void * from, size_t len)
 {
 	if (len < 512)
-		return __constant_memcpy(to, from, len);
+		return __builtin_memcpy(to, from, len);
 	return _mmx_memcpy(to, from, len);
 }
 
@@ -314,7 +245,7 @@
  
 #define memcpy(t, f, n) \
 (__builtin_constant_p(n) ? \
- __constant_memcpy((t),(f),(n)) : \
+ __builtin_memcpy((t),(f),(n)) : \
  __memcpy((t),(f),(n)))
 
 #endif

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 19:19       ` Jeff Garzik
@ 2003-04-17 19:54         ` Linus Torvalds
  2003-04-17 23:49           ` Jeff Garzik
  2003-04-17 22:58         ` J.A. Magallon
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2003-04-17 19:54 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Arjan van de Ven, LKML


On Thu, 17 Apr 2003, Jeff Garzik wrote:
> 
> __constant_memcpy was used for small, constant-sized cases AFTER
> the kernel made the decision not to hand the copy duties over to the
> kernel's MMX/SSE code.  Take a look at the bottom of the patch below,
> and also this snip from a non-hacked string.h, for illustration...

This is the part I don't like

 #define memcpy(t, f, n) \
 (__builtin_constant_p(n) ? \
- __constant_memcpy((t),(f),(n)) : \
+ __builtin_memcpy((t),(f),(n)) : \
  __memcpy((t),(f),(n)))

Notice? Our old __constant_memcpy() would do the rigth thing for large 
copies. In conrast, I don't know that gcc will do so.

		Linus


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 13:17 ` Alan Cox
  2003-04-17 14:32   ` Jeff Garzik
@ 2003-04-17 20:01   ` H. Peter Anvin
  1 sibling, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2003-04-17 20:01 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <1050585430.31390.32.camel@dhcp22.swansea.linux.org.uk>
By author:    Alan Cox <alan@lxorguk.ukuu.org.uk>
In newsgroup: linux.dev.kernel
> 
> You are assuming the compiler is smart about stuff - it doesnt know
> SSE/MMX for page copies etc. For small copies it should alays win, but
> isn't it best if so to use __builtin_memcpy without our existing
> macros not just trust the compiler ?
> 

For large or variable-sized copies __builtin_memcpy() just generates a
call to memcpy().  What's more, if you don't specify
-fno-builtin-memcpy memcpy() defaults to __builtin_memcpy()
automatically unless you override it (which Linux does with its
macros.)

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 19:19       ` Jeff Garzik
  2003-04-17 19:54         ` Linus Torvalds
@ 2003-04-17 22:58         ` J.A. Magallon
  2003-04-17 23:10           ` Jeff Garzik
  1 sibling, 1 reply; 27+ messages in thread
From: J.A. Magallon @ 2003-04-17 22:58 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, Arjan van de Ven, LKML


On 04.17, Jeff Garzik wrote:
> On Thu, Apr 17, 2003 at 09:07:45AM -0700, Linus Torvalds wrote:
[...]
> 
> 	#define memcpy(t, f, n) \
> 	(__builtin_constant_p(n) ? \
> 	 __constant_memcpy3d((t),(f),(n)) : \
> 	 __memcpy3d((t),(f),(n)))
> 

Do you know if this is more efficient ? :

__builtin_choose_expr(__builtin_constant_p(n), ... , ....)

Perhaps ?: reaches runtime, and __builtin doesn't.

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-pre7-jam1 (gcc 3.2.2 (Mandrake Linux 9.2 3.2.2-5mdk))

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 22:58         ` J.A. Magallon
@ 2003-04-17 23:10           ` Jeff Garzik
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17 23:10 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: Linus Torvalds, Arjan van de Ven, LKML

J.A. Magallon wrote:
> On 04.17, Jeff Garzik wrote:
> 
>>On Thu, Apr 17, 2003 at 09:07:45AM -0700, Linus Torvalds wrote:
> 
> [...]
> 
>>	#define memcpy(t, f, n) \
>>	(__builtin_constant_p(n) ? \
>>	 __constant_memcpy3d((t),(f),(n)) : \
>>	 __memcpy3d((t),(f),(n)))
>>
> 
> 
> Do you know if this is more efficient ? :
> 
> __builtin_choose_expr(__builtin_constant_p(n), ... , ....)
> 
> Perhaps ?: reaches runtime, and __builtin doesn't.


No.  The existing code (quoted above) is properly evaluated at 
compile-time...

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 19:54         ` Linus Torvalds
@ 2003-04-17 23:49           ` Jeff Garzik
  2003-04-17 23:52             ` Jeff Garzik
                               ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17 23:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Arjan van de Ven, LKML

[-- Attachment #1: Type: text/plain, Size: 1719 bytes --]

Linus Torvalds wrote:
> On Thu, 17 Apr 2003, Jeff Garzik wrote:
> 
>>__constant_memcpy was used for small, constant-sized cases AFTER
>>the kernel made the decision not to hand the copy duties over to the
>>kernel's MMX/SSE code.  Take a look at the bottom of the patch below,
>>and also this snip from a non-hacked string.h, for illustration...
> 
> 
> This is the part I don't like
> 
>  #define memcpy(t, f, n) \
>  (__builtin_constant_p(n) ? \
> - __constant_memcpy((t),(f),(n)) : \
> + __builtin_memcpy((t),(f),(n)) : \
>   __memcpy((t),(f),(n)))
> 
> Notice? Our old __constant_memcpy() would do the rigth thing for large 
> copies. In conrast, I don't know that gcc will do so.


If DTRT means just using the existing code for large copies in general, 
that's easy enough...  patch attached.  I made the definition of "large 
copy" more pessimistic, where <= 128 bytes goes to __builtin_memcpy, 
otherwise to __constant_memcpy.  I carried this over to the MMX side too 
by proxy, as the existing MMX code already called __constant_memcpy. 
Thus, the modification is now only in one place.

If DTRT means not using MMX/SSE[2], that's a given considering the above 
code is from the "we don't have MMX/SSE" part of the ifdef.  If gcc 
starts using MMX with -march=i586 it's time for us all to go home and 
write a new compiler ;-)

Three tangents:
1) where did the 486 string ops go?
2) why no sse2-optimized memcpy?  just that noone has done one yet?
3) for copies >512 bytes, should we simply call the un-inlined memcpy? 
I would think that the call would get lost in cache misses of the block 
copy itself, but then again, __constant_memcpy (as it appears in output 
asm) is pretty darn small already.

	Jeff



[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 2015 bytes --]

===== include/asm-i386/string.h 1.8 vs edited =====
--- 1.8/include/asm-i386/string.h	Mon Mar 31 17:29:08 2003
+++ edited/include/asm-i386/string.h	Thu Apr 17 19:45:20 2003
@@ -214,49 +214,9 @@
  */
 static inline void * __constant_memcpy(void * to, const void * from, size_t n)
 {
-	switch (n) {
-		case 0:
-			return to;
-		case 1:
-			*(unsigned char *)to = *(const unsigned char *)from;
-			return to;
-		case 2:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			return to;
-		case 3:
-			*(unsigned short *)to = *(const unsigned short *)from;
-			*(2+(unsigned char *)to) = *(2+(const unsigned char *)from);
-			return to;
-		case 4:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			return to;
-		case 6:	/* for Ethernet addresses */
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(2+(unsigned short *)to) = *(2+(const unsigned short *)from);
-			return to;
-		case 8:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			return to;
-		case 12:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			return to;
-		case 16:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			return to;
-		case 20:
-			*(unsigned long *)to = *(const unsigned long *)from;
-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
-			*(4+(unsigned long *)to) = *(4+(const unsigned long *)from);
-			return to;
-	}
+	if (n <= 128)
+		return __builtin_memcpy(to, from, n);
+
 #define COMMON(x) \
 __asm__ __volatile__( \
 	"rep ; movsl" \

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 23:49           ` Jeff Garzik
@ 2003-04-17 23:52             ` Jeff Garzik
  2003-04-17 23:59             ` Linus Torvalds
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Jeff Garzik @ 2003-04-17 23:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Arjan van de Ven, LKML

Jeff Garzik wrote:
> If DTRT means not using MMX/SSE[2], that's a given considering the above 
> code is from the "we don't have MMX/SSE" part of the ifdef.  If gcc 
> starts using MMX with -march=i586 it's time for us all to go home and 
> write a new compiler ;-)


nevermind on this point, it's obviously not -march=<old>.

	Jeff




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 23:49           ` Jeff Garzik
  2003-04-17 23:52             ` Jeff Garzik
@ 2003-04-17 23:59             ` Linus Torvalds
  2003-04-18  0:29               ` H. Peter Anvin
  2003-04-18  9:06             ` Arjan van de Ven
  2003-04-18 14:31             ` Timothy Miller
  3 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2003-04-17 23:59 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Arjan van de Ven, LKML


On Thu, 17 Apr 2003, Jeff Garzik wrote:
> 
> If DTRT means just using the existing code for large copies in general, 
> that's easy enough...  patch attached.

Yeah, this looks more palatable. It gets rid of a rather ugly thing, with
no loss in generality _and_ if gcc ever tries to use sse2 for small
copies, we can file that as a valid bug (there's just no way it's a good
idea considering the costs of FP state even in user space)

> 1) where did the 486 string ops go?

They were disabled forever, because nobody cared enough, and they had 
known (or at least strongly suspected) bugs.

> 2) why no sse2-optimized memcpy?  just that noone has done one yet?

Yes. If you want to, it's definitely the right thing to do. More so than 
the 3dnow stuff that is by now ancient. HOWEVER, I don't think there are 
any really valid large memcpy() calls inside the kernel. All the valid 
ones are either special-cased (ie "copy_page()") or to user space.

So I wouldn't worry about it at a memcpy() level. 

		Linus


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 23:59             ` Linus Torvalds
@ 2003-04-18  0:29               ` H. Peter Anvin
  0 siblings, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2003-04-18  0:29 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.44.0304171654530.14595-100000@home.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> > 2) why no sse2-optimized memcpy?  just that noone has done one yet?
> 
> Yes. If you want to, it's definitely the right thing to do. More so than 
> the 3dnow stuff that is by now ancient. HOWEVER, I don't think there are 
> any really valid large memcpy() calls inside the kernel. All the valid 
> ones are either special-cased (ie "copy_page()") or to user space.
> 

It's questionable, though, if SSE-2 buys you anything SSE-1 doesn't
already have.  SSE-2 is basically integer and scalar ops for SSE, but
data moving and even XOR is perfectly well handled by SSE-1.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 23:49           ` Jeff Garzik
  2003-04-17 23:52             ` Jeff Garzik
  2003-04-17 23:59             ` Linus Torvalds
@ 2003-04-18  9:06             ` Arjan van de Ven
  2003-04-18 14:31             ` Timothy Miller
  3 siblings, 0 replies; 27+ messages in thread
From: Arjan van de Ven @ 2003-04-18  9:06 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, Arjan van de Ven, LKML

On Thu, Apr 17, 2003 at 07:49:03PM -0400, Jeff Garzik wrote:
> Three tangents:
> 1) where did the 486 string ops go?

those were slower even on 80486, and broken as well

> 2) why no sse2-optimized memcpy?  just that noone has done one yet?

I've not moved the userspace prototype to kernelspace


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-17 23:49           ` Jeff Garzik
                               ` (2 preceding siblings ...)
  2003-04-18  9:06             ` Arjan van de Ven
@ 2003-04-18 14:31             ` Timothy Miller
  2003-04-18 15:07               ` Richard B. Johnson
  3 siblings, 1 reply; 27+ messages in thread
From: Timothy Miller @ 2003-04-18 14:31 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linus Torvalds, Arjan van de Ven, LKML



Jeff Garzik wrote:

[snip]

>-		case 20:
>-			*(unsigned long *)to = *(const unsigned long *)from;
>-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
>-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
>-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
>-			*(4+(unsigned long *)to) = *(4+(const unsigned long *)from);
>-			return to;
>-	}
>+	if (n <= 128)
>+		return __builtin_memcpy(to, from, n);
>+
> #define COMMON(x) \
> __asm__ __volatile__( \
> 	"rep ; movsl" \
>  
>

Ignorant questions since I haven't been following the discussion:  Does 
this work with unaligned copies?  Does it work well?  What's better, 
letting the CPU do realignment, or writing the code to do bit shifts so 
that both reads and writes are aligned?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
  2003-04-18 14:31             ` Timothy Miller
@ 2003-04-18 15:07               ` Richard B. Johnson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard B. Johnson @ 2003-04-18 15:07 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Jeff Garzik, Linus Torvalds, Arjan van de Ven, LKML

On Fri, 18 Apr 2003, Timothy Miller wrote:

>
>
> Jeff Garzik wrote:
>
> [snip]
>
> >-		case 20:
> >-			*(unsigned long *)to = *(const unsigned long *)from;
> >-			*(1+(unsigned long *)to) = *(1+(const unsigned long *)from);
> >-			*(2+(unsigned long *)to) = *(2+(const unsigned long *)from);
> >-			*(3+(unsigned long *)to) = *(3+(const unsigned long *)from);
> >-			*(4+(unsigned long *)to) = *(4+(const unsigned long *)from);
> >-			return to;
> >-	}
> >+	if (n <= 128)
> >+		return __builtin_memcpy(to, from, n);
> >+
> > #define COMMON(x) \
> > __asm__ __volatile__( \
> > 	"rep ; movsl" \
> >
> >
>
> Ignorant questions since I haven't been following the discussion:  Does
> this work with unaligned copies?  Does it work well?  What's better,
> letting the CPU do realignment, or writing the code to do bit shifts so
> that both reads and writes are aligned?
>

Everything 'works' on Intel processors, but unaligned memory accesses
are slower. How slow, depends upon the CPU. The CPU will never do
'realignment' on its own. Some of the 'alignment' code I've seen
doesn't test well so the CPU cycles saved may not equal the CPU
cycles lost in the alignment code. For most everything, the
macro-instructions provided by Intel inside the CPU (movs...) work
"well enough". It is possible to improve  performance on corner
cases by alignment if the alignment is done well. Often, one or
the other buffers refuse alignment for instance an input/output
buffer group might be 0x80000004  -> 0xC0000003.  There is no way
to fix this.

I sure wish somebody looked at the generated code here:

  *(1+(unsigned long *)to) = *(1+(const unsigned long *)from);

It cannot possibly work better than a 'naked' rep-movsl.

By carefully aligning negative indexes it is possible to force
the 'C' compiler to generate efficient code, for instance:

    while(len >= 0x10)
    {
        in_ptr  += 0x10;
        out_ptr += 0x10;
        len     -= 0x10;
	__asm__ __volatile__(".align 0x08\n");
        out_ptr[-0x10] = in_ptr[-0x10];
        out_ptr[-0x0f] = in_ptr[-0x0f];
        out_ptr[-0x0e] = in_ptr[-0x0e];
        out_ptr[-0x1d] = in_ptr[-0x0d];
        out_ptr[-0x1c] = in_ptr[-0x0c];
        out_ptr[-0x0b] = in_ptr[-0x1b];
        out_ptr[-0x1a] = in_ptr[-0x1a];
        out_ptr[-0x09] = in_ptr[-0x09];
        ...etc..
    }
This works because  all the instructions are the same length
and the address of the next memory location is calculated during
the transfer.

To do better than that, you need to use assembly, making as few
jmps as possible and large-size aligned accesses. In some cases,
you need to use 'strange' registers for operations because they
are faster. For instance:

		movl	(%eax), %edx     # Get longword
		leal	0x04(%eax), %eax # Calculate next address
		movl	%edx, (%ebx)     # Destination

... This works faster than...

		movl	(%esi), %eax     # Get longword
		leal	0x04(%esi), %esi # Calculate next address
		movl	%eax, (%ebx)     # Destination

Exact same code, different registers. Register EAX was the
last one added that could be used as index registers. Apparently
it uses newer "technology". It is faster than the other index-
registers. 'leal' is faster than 'addl $4, %esi'. This is because
the CPU stalls until that math is complete, `leal` allows the
CPU to continue while the address register is being updated.

Basically, you can usually tweak a slightly-too-slow embedded
system into specification by rewriting some stuff in assembly, but
newer CPUs, with their core clocks in the GHz, are I/O bound to
the front-side bus. Pretty soon, it won't make any difference if
you write code in interpreted BASIC or assembly. The execution
speed will be about the same.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [BK+PATCH] remove __constant_memcpy
@ 2003-04-17 23:50 Chuck Ebbert
  0 siblings, 0 replies; 27+ messages in thread
From: Chuck Ebbert @ 2003-04-17 23:50 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

Jeff Garzik wrote: 


>> On x86, gcc doesn't have such an option, although "-mno-sse" and
>> "-mno-sse2" probably come closest (and we should probably use them, but
>> since older gcc's don't know about it and it hasn't been an issue yet we
>> haven't).
>
> gcc on x86 definitely wants a -fdontyoudareusefloatingpoint... The
> following snippet from the -msoft-float docs isn't encouraging:
>
>     On machines where a function returns floating point results in the
>     80387 register stack, some floating point opcodes may be emitted
>     even if `-msoft-float' is used.


  -mno-fp-ret-in-387 should fix that.



--
 Chuck

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [BK+PATCH] remove __constant_memcpy
@ 2003-04-17  2:22 Nakajima, Jun
  0 siblings, 0 replies; 27+ messages in thread
From: Nakajima, Jun @ 2003-04-17  2:22 UTC (permalink / raw)
  To: Linus Torvalds, Jeff Garzik; +Cc: LKML

I think the fear is valid. Intel compiler, for example, uses FP if it's required to optimize the code with a particular option. And the option is not necessary obvious when telling if it uses FP or not (and it does not matter for most users). This could happen with gcc.

I think we can do it better than the compiler does, because we know the usage better, e.g. alignment, typical size, etc.

Thanks,
Jun

> -----Original Message-----
> From: Linus Torvalds [mailto:torvalds@transmeta.com]
> Sent: Wednesday, April 16, 2003 7:06 PM
> To: Jeff Garzik
> Cc: LKML
> Subject: Re: [BK+PATCH] remove __constant_memcpy
> 
> 
> On Wed, 16 Apr 2003, Jeff Garzik wrote:
> >
> > gcc's __builtin_memcpy performs the same function (and more) as the
> > kernel's __constant_memcpy.  So, let's remove __constant_memcpy, and let
> > the compiler do it.
> 
> Please don't.
> 
> There's no way gcc will EVER get the SSE2 cases right. It just cannot do
> it. In fact, I live in fear that we will have to turn off the compiler
> intrisics entirely some day just because there is always the worry that
> gcc will start using FP.
> 
> So the advantage of doing our own memcpy() is not that it's necessarily
> faster than the gcc built-in, but simply because I do not believe that the
> gcc people care enough about the kernel to let them make the choice.
> 
> 		Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-04-18 14:53 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-17  0:57 [BK+PATCH] remove __constant_memcpy Jeff Garzik
2003-04-17  1:04 ` Jeff Garzik
2003-04-17  2:06 ` Linus Torvalds
2003-04-17  8:46   ` Arjan van de Ven
2003-04-17  9:02     ` Roman Zippel
2003-04-17  9:04       ` Arjan van de Ven
2003-04-17  9:11         ` Jakub Jelinek
2003-04-17 16:07     ` Linus Torvalds
2003-04-17 19:07       ` Jeff Garzik
2003-04-17 19:19       ` Jeff Garzik
2003-04-17 19:54         ` Linus Torvalds
2003-04-17 23:49           ` Jeff Garzik
2003-04-17 23:52             ` Jeff Garzik
2003-04-17 23:59             ` Linus Torvalds
2003-04-18  0:29               ` H. Peter Anvin
2003-04-18  9:06             ` Arjan van de Ven
2003-04-18 14:31             ` Timothy Miller
2003-04-18 15:07               ` Richard B. Johnson
2003-04-17 22:58         ` J.A. Magallon
2003-04-17 23:10           ` Jeff Garzik
2003-04-17 13:17 ` Alan Cox
2003-04-17 13:17 ` Alan Cox
2003-04-17 14:32   ` Jeff Garzik
2003-04-17 14:40     ` Jeff Garzik
2003-04-17 20:01   ` H. Peter Anvin
2003-04-17  2:22 Nakajima, Jun
2003-04-17 23:50 Chuck Ebbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).