All of lore.kernel.org
 help / color / mirror / Atom feed
* New x86 warning
@ 2009-04-22  6:46 Jeff Garzik
  2009-04-22  7:01 ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Garzik @ 2009-04-22  6:46 UTC (permalink / raw)
  To: LKML; +Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

[-- Attachment #1: Type: text/plain, Size: 476 bytes --]


On x86-32, this warning now appears for me in 2.6.30-rc3, and did not 
appear in 2.6.29.

drivers/acpi/acpica/tbfadt.c: In function 'acpi_tb_create_local_fadt':
/spare/repo/linux-2.6/arch/x86/include/asm/string_32.h:75: warning: 
array subscript is above array bounds

lspci, dmesg and .config attached.

	Jeff



P.S.  It is unclear in MAINTAINERS whether x86@kernel.org should be CC'd 
in addition to the other addresses listed, or as a replacement for 
individual emails.

[-- Attachment #2: lspci.txt --]
[-- Type: text/plain, Size: 1287 bytes --]

00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV18GL [Quadro4 380 XGL] (rev a2)
05:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5782 Gigabit Ethernet (rev 02)
05:09.0 Mass storage controller: Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01)

[-- Attachment #3: config.txt.gz --]
[-- Type: application/x-gzip, Size: 10699 bytes --]

[-- Attachment #4: dmesg.txt.gz --]
[-- Type: application/x-gzip, Size: 7402 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: New x86 warning
  2009-04-22  6:46 New x86 warning Jeff Garzik
@ 2009-04-22  7:01 ` Ingo Molnar
  2009-04-22  8:45   ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was " Andi Kleen
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-04-22  7:01 UTC (permalink / raw)
  To: Jeff Garzik, Linus Torvalds
  Cc: LKML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Jeff Garzik <jeff@garzik.org> wrote:

> On x86-32, this warning now appears for me in 2.6.30-rc3, and did 
> not appear in 2.6.29.
>
> drivers/acpi/acpica/tbfadt.c: In function 'acpi_tb_create_local_fadt':
> /spare/repo/linux-2.6/arch/x86/include/asm/string_32.h:75: warning:  
> array subscript is above array bounds

Last i checked it was a GCC bounds check bogosity. All attempts to 
work it around or annotate it sanely (without changing the assembly 
code) failed. (new ideas welcome)

The closest i came was the hacklet below to the assembly code. [with 
an intentionally corrupted patch header to make it harder to apply 
accidentally.]

The hacklet writes the fifth byte by writing two bytes from byte 
position 3.

> lspci, dmesg and .config attached.
>
> 	Jeff
>
> P.S.  It is unclear in MAINTAINERS whether x86@kernel.org should 
> be CC'd in addition to the other addresses listed, or as a 
> replacement for individual emails.

at your option. Cc:-ing maintainers directly can get faster 
treatment occasionally. Using the alias is shorter.

	Ingo

NOT-Signed-off-by-me:

+++ linux/arch/x86/include/asm/string_32.h
@@ -72,7 +72,7 @@ static __always_inline void *__constant_
 		return to;
 	case 5:
 		*(int *)to = *(int *)from;
-		*((char *)to + 4) = *((char *)from + 4);
+		*((short *)(char *)(to + 3)) = *((short *)(char *)(from + 3));
 		return to;
 	case 6:
 		*(int *)to = *(int *)from;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22  7:01 ` Ingo Molnar
@ 2009-04-22  8:45   ` Andi Kleen
  2009-04-22 18:00     ` [tip:x86/asm] x86: use __builtin_memcpy() on 32 bits tip-bot for Andi Kleen
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Andi Kleen @ 2009-04-22  8:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeff Garzik, Linus Torvalds, LKML, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Ingo Molnar <mingo@elte.hu> writes:

> * Jeff Garzik <jeff@garzik.org> wrote:
>
>> On x86-32, this warning now appears for me in 2.6.30-rc3, and did 
>> not appear in 2.6.29.
>>
>> drivers/acpi/acpica/tbfadt.c: In function 'acpi_tb_create_local_fadt':
>> /spare/repo/linux-2.6/arch/x86/include/asm/string_32.h:75: warning:  
>> array subscript is above array bounds
>
> Last i checked it was a GCC bounds check bogosity. All attempts to 
> work it around or annotate it sanely (without changing the assembly 
> code) failed. (new ideas welcome)
>
> The closest i came was the hacklet below to the assembly code. [with 
> an intentionally corrupted patch header to make it harder to apply 
> accidentally.]

Modern gcc (and that is all that is supported now) should be able to
generate this code on its own already.  So if you call __builtin_* it
will  just work (that is what 64bit does) without that explicit code.

Here's a patch that does that with some numbers. It gives about 3k
smaller kernels on my configuration.

IMHO it's long overdue to do this for 32bit too.

It's a very attractive patch because it removes a lot of code:

 arch/x86/include/asm/string_32.h |  127 ++-------------------------------------
 arch/x86/lib/memcpy_32.c         |   16 ++++

-Andi

---

X86-32: Let gcc decide whether to inline memcpy 

Modern gccs have own heuristics to decide whether string functions should be inlined
or not. This used to be not the case with old gccs, but Linux doesn't support them
anymore. The 64bit kernel always did it this way. Just define memcpy to
__builtin_memcpy and gcc should do the right thing. Also supply
a out of line memcpy that gcc can fall back to when it decides
not to inline.

First this fixes the 

arch/x86/include/asm/string_32.h:75: warning: array subscript is above array bounds

warnings which have been creeping up recently by just
removing that code.

Then trusting gcc actually makes the kernel smaller by nearly 3K:

5503146  529444 1495040 7527630  72dcce vmlinux
5500373  529444 1495040 7524857  72d1f9 vmlinux-string

Also it removes some quite ugly code and will likely speed up
compilation by a tiny bit by having less inline code to process
for every file.

It did some quick boot tests and everything worked as expected.
I left the 3d now case alone for now.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/include/asm/string_32.h |  127 ++-------------------------------------
 arch/x86/lib/memcpy_32.c         |   16 ++++
 2 files changed, 23 insertions(+), 120 deletions(-)

Index: linux-2.6.30-rc2-ak/arch/x86/include/asm/string_32.h
===================================================================
--- linux-2.6.30-rc2-ak.orig/arch/x86/include/asm/string_32.h	2009-04-22 10:22:33.000000000 +0200
+++ linux-2.6.30-rc2-ak/arch/x86/include/asm/string_32.h	2009-04-22 10:23:12.000000000 +0200
@@ -29,121 +29,10 @@
 #define __HAVE_ARCH_STRLEN
 extern size_t strlen(const char *s);
 
-static __always_inline void *__memcpy(void *to, const void *from, size_t n)
-{
-	int d0, d1, d2;
-	asm volatile("rep ; movsl\n\t"
-		     "movl %4,%%ecx\n\t"
-		     "andl $3,%%ecx\n\t"
-		     "jz 1f\n\t"
-		     "rep ; movsb\n\t"
-		     "1:"
-		     : "=&c" (d0), "=&D" (d1), "=&S" (d2)
-		     : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
-		     : "memory");
-	return to;
-}
-
-/*
- * This looks ugly, but the compiler can optimize it totally,
- * as the count is constant.
- */
-static __always_inline void *__constant_memcpy(void *to, const void *from,
-					       size_t n)
-{
-	long esi, edi;
-	if (!n)
-		return to;
-
-	switch (n) {
-	case 1:
-		*(char *)to = *(char *)from;
-		return to;
-	case 2:
-		*(short *)to = *(short *)from;
-		return to;
-	case 4:
-		*(int *)to = *(int *)from;
-		return to;
-
-	case 3:
-		*(short *)to = *(short *)from;
-		*((char *)to + 2) = *((char *)from + 2);
-		return to;
-	case 5:
-		*(int *)to = *(int *)from;
-		*((char *)to + 4) = *((char *)from + 4);
-		return to;
-	case 6:
-		*(int *)to = *(int *)from;
-		*((short *)to + 2) = *((short *)from + 2);
-		return to;
-	case 8:
-		*(int *)to = *(int *)from;
-		*((int *)to + 1) = *((int *)from + 1);
-		return to;
-	}
-
-	esi = (long)from;
-	edi = (long)to;
-	if (n >= 5 * 4) {
-		/* large block: use rep prefix */
-		int ecx;
-		asm volatile("rep ; movsl"
-			     : "=&c" (ecx), "=&D" (edi), "=&S" (esi)
-			     : "0" (n / 4), "1" (edi), "2" (esi)
-			     : "memory"
-		);
-	} else {
-		/* small block: don't clobber ecx + smaller code */
-		if (n >= 4 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-		if (n >= 3 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-		if (n >= 2 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-		if (n >= 1 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-	}
-	switch (n % 4) {
-		/* tail */
-	case 0:
-		return to;
-	case 1:
-		asm volatile("movsb"
-			     : "=&D"(edi), "=&S"(esi)
-			     : "0"(edi), "1"(esi)
-			     : "memory");
-		return to;
-	case 2:
-		asm volatile("movsw"
-			     : "=&D"(edi), "=&S"(esi)
-			     : "0"(edi), "1"(esi)
-			     : "memory");
-		return to;
-	default:
-		asm volatile("movsw\n\tmovsb"
-			     : "=&D"(edi), "=&S"(esi)
-			     : "0"(edi), "1"(esi)
-			     : "memory");
-		return to;
-	}
-}
-
 #define __HAVE_ARCH_MEMCPY
 
+extern void *__memcpy(void *to, const void *from, size_t n);
+
 #ifdef CONFIG_X86_USE_3DNOW
 
 #include <asm/mmx.h>
@@ -155,7 +44,7 @@
 static inline void *__constant_memcpy3d(void *to, const void *from, size_t len)
 {
 	if (len < 512)
-		return __constant_memcpy(to, from, len);
+		return __memcpy(to, from, len);
 	return _mmx_memcpy(to, from, len);
 }
 
@@ -168,20 +57,18 @@
 
 #define memcpy(t, f, n)				\
 	(__builtin_constant_p((n))		\
-	 ? __constant_memcpy3d((t), (f), (n))	\
+	 ? __builtin_memcpy((t), (f), (n))	\
 	 : __memcpy3d((t), (f), (n)))
 
 #else
 
 /*
  *	No 3D Now!
+ *
+ * Let gcc figure it out.
  */
 
-#define memcpy(t, f, n)				\
-	(__builtin_constant_p((n))		\
-	 ? __constant_memcpy((t), (f), (n))	\
-	 : __memcpy((t), (f), (n)))
-
+#define memcpy(t, f, n) __builtin_memcpy(t,f,n)
 #endif
 
 #define __HAVE_ARCH_MEMMOVE
Index: linux-2.6.30-rc2-ak/arch/x86/lib/memcpy_32.c
===================================================================
--- linux-2.6.30-rc2-ak.orig/arch/x86/lib/memcpy_32.c	2009-04-22 10:22:33.000000000 +0200
+++ linux-2.6.30-rc2-ak/arch/x86/lib/memcpy_32.c	2009-04-22 10:23:56.000000000 +0200
@@ -4,6 +4,22 @@
 #undef memcpy
 #undef memset
 
+void *__memcpy(void *to, const void *from, size_t n)
+{
+	int d0, d1, d2;
+	asm volatile("rep ; movsl\n\t"
+		     "movl %4,%%ecx\n\t"
+		     "andl $3,%%ecx\n\t"
+		     "jz 1f\n\t"
+		     "rep ; movsb\n\t"
+		     "1:"
+		     : "=&c" (d0), "=&D" (d1), "=&S" (d2)
+		     : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
+		     : "memory");
+	return to;
+}
+EXPORT_SYMBOL(__memcpy);
+
 void *memcpy(void *to, const void *from, size_t n)
 {
 #ifdef CONFIG_X86_USE_3DNOW



-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:x86/asm] x86: use __builtin_memcpy() on 32 bits
  2009-04-22  8:45   ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was " Andi Kleen
@ 2009-04-22 18:00     ` tip-bot for Andi Kleen
  2009-04-22 20:56     ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning Linus Torvalds
  2009-04-22 23:49     ` Joe Damato
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Andi Kleen @ 2009-04-22 18:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, andi, ak, tglx

Commit-ID:  1405ae250ec86802b32ca9f7aea977a5ab551b22
Gitweb:     http://git.kernel.org/tip/1405ae250ec86802b32ca9f7aea977a5ab551b22
Author:     Andi Kleen <andi@firstfloor.org>
AuthorDate: Wed, 22 Apr 2009 10:45:15 +0200
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Wed, 22 Apr 2009 10:55:20 -0700

x86: use __builtin_memcpy() on 32 bits

Modern gccs have own heuristics to decide whether string functions
should be inlined or not. This used to be not the case with old gccs,
but Linux doesn't support them anymore. The 64bit kernel always did it
this way. Just define memcpy to __builtin_memcpy and gcc should do the
right thing. Also supply a out of line memcpy that gcc can fall back
to when it decides not to inline.

First this fixes the

arch/x86/include/asm/string_32.h:75: warning: array subscript is above array bounds

warnings which have been creeping up recently by just
removing that code.

Then trusting gcc actually makes the kernel smaller by nearly 3K:

5503146  529444 1495040 7527630  72dcce vmlinux
5500373  529444 1495040 7524857  72d1f9 vmlinux-string

Also it removes some quite ugly code and will likely speed up
compilation by a tiny bit by having less inline code to process
for every file.

It did some quick boot tests and everything worked as expected.
I left the 3dnow case alone for now.

[ Impact: fixes warning, reduces code size ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <8763gxoz50.fsf_-_@basil.nowhere.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


---
 arch/x86/include/asm/string_32.h |  127 ++-----------------------------------
 arch/x86/lib/memcpy_32.c         |   16 +++++
 2 files changed, 23 insertions(+), 120 deletions(-)

diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 0e0e3ba..29fff54 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -29,121 +29,10 @@ extern char *strchr(const char *s, int c);
 #define __HAVE_ARCH_STRLEN
 extern size_t strlen(const char *s);
 
-static __always_inline void *__memcpy(void *to, const void *from, size_t n)
-{
-	int d0, d1, d2;
-	asm volatile("rep ; movsl\n\t"
-		     "movl %4,%%ecx\n\t"
-		     "andl $3,%%ecx\n\t"
-		     "jz 1f\n\t"
-		     "rep ; movsb\n\t"
-		     "1:"
-		     : "=&c" (d0), "=&D" (d1), "=&S" (d2)
-		     : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
-		     : "memory");
-	return to;
-}
-
-/*
- * This looks ugly, but the compiler can optimize it totally,
- * as the count is constant.
- */
-static __always_inline void *__constant_memcpy(void *to, const void *from,
-					       size_t n)
-{
-	long esi, edi;
-	if (!n)
-		return to;
-
-	switch (n) {
-	case 1:
-		*(char *)to = *(char *)from;
-		return to;
-	case 2:
-		*(short *)to = *(short *)from;
-		return to;
-	case 4:
-		*(int *)to = *(int *)from;
-		return to;
-
-	case 3:
-		*(short *)to = *(short *)from;
-		*((char *)to + 2) = *((char *)from + 2);
-		return to;
-	case 5:
-		*(int *)to = *(int *)from;
-		*((char *)to + 4) = *((char *)from + 4);
-		return to;
-	case 6:
-		*(int *)to = *(int *)from;
-		*((short *)to + 2) = *((short *)from + 2);
-		return to;
-	case 8:
-		*(int *)to = *(int *)from;
-		*((int *)to + 1) = *((int *)from + 1);
-		return to;
-	}
-
-	esi = (long)from;
-	edi = (long)to;
-	if (n >= 5 * 4) {
-		/* large block: use rep prefix */
-		int ecx;
-		asm volatile("rep ; movsl"
-			     : "=&c" (ecx), "=&D" (edi), "=&S" (esi)
-			     : "0" (n / 4), "1" (edi), "2" (esi)
-			     : "memory"
-		);
-	} else {
-		/* small block: don't clobber ecx + smaller code */
-		if (n >= 4 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-		if (n >= 3 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-		if (n >= 2 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-		if (n >= 1 * 4)
-			asm volatile("movsl"
-				     : "=&D"(edi), "=&S"(esi)
-				     : "0"(edi), "1"(esi)
-				     : "memory");
-	}
-	switch (n % 4) {
-		/* tail */
-	case 0:
-		return to;
-	case 1:
-		asm volatile("movsb"
-			     : "=&D"(edi), "=&S"(esi)
-			     : "0"(edi), "1"(esi)
-			     : "memory");
-		return to;
-	case 2:
-		asm volatile("movsw"
-			     : "=&D"(edi), "=&S"(esi)
-			     : "0"(edi), "1"(esi)
-			     : "memory");
-		return to;
-	default:
-		asm volatile("movsw\n\tmovsb"
-			     : "=&D"(edi), "=&S"(esi)
-			     : "0"(edi), "1"(esi)
-			     : "memory");
-		return to;
-	}
-}
-
 #define __HAVE_ARCH_MEMCPY
 
+extern void *__memcpy(void *to, const void *from, size_t n);
+
 #ifdef CONFIG_X86_USE_3DNOW
 
 #include <asm/mmx.h>
@@ -155,7 +44,7 @@ static __always_inline void *__constant_memcpy(void *to, const void *from,
 static inline void *__constant_memcpy3d(void *to, const void *from, size_t len)
 {
 	if (len < 512)
-		return __constant_memcpy(to, from, len);
+		return __memcpy(to, from, len);
 	return _mmx_memcpy(to, from, len);
 }
 
@@ -168,20 +57,18 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
 
 #define memcpy(t, f, n)				\
 	(__builtin_constant_p((n))		\
-	 ? __constant_memcpy3d((t), (f), (n))	\
+	 ? __builtin_memcpy((t), (f), (n))	\
 	 : __memcpy3d((t), (f), (n)))
 
 #else
 
 /*
  *	No 3D Now!
+ *
+ * Let gcc figure it out.
  */
 
-#define memcpy(t, f, n)				\
-	(__builtin_constant_p((n))		\
-	 ? __constant_memcpy((t), (f), (n))	\
-	 : __memcpy((t), (f), (n)))
-
+#define memcpy(t, f, n) __builtin_memcpy(t,f,n)
 #endif
 
 #define __HAVE_ARCH_MEMMOVE
diff --git a/arch/x86/lib/memcpy_32.c b/arch/x86/lib/memcpy_32.c
index 5415a9d..16dc123 100644
--- a/arch/x86/lib/memcpy_32.c
+++ b/arch/x86/lib/memcpy_32.c
@@ -4,6 +4,22 @@
 #undef memcpy
 #undef memset
 
+void *__memcpy(void *to, const void *from, size_t n)
+{
+	int d0, d1, d2;
+	asm volatile("rep ; movsl\n\t"
+		     "movl %4,%%ecx\n\t"
+		     "andl $3,%%ecx\n\t"
+		     "jz 1f\n\t"
+		     "rep ; movsb\n\t"
+		     "1:"
+		     : "=&c" (d0), "=&D" (d1), "=&S" (d2)
+		     : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
+		     : "memory");
+	return to;
+}
+EXPORT_SYMBOL(__memcpy);
+
 void *memcpy(void *to, const void *from, size_t n)
 {
 #ifdef CONFIG_X86_USE_3DNOW

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22  8:45   ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was " Andi Kleen
  2009-04-22 18:00     ` [tip:x86/asm] x86: use __builtin_memcpy() on 32 bits tip-bot for Andi Kleen
@ 2009-04-22 20:56     ` Linus Torvalds
  2009-04-22 21:15       ` Andi Kleen
  2009-04-22 23:49     ` Joe Damato
  2 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2009-04-22 20:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Jeff Garzik, LKML, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86



On Wed, 22 Apr 2009, Andi Kleen wrote:
> 
> Modern gcc (and that is all that is supported now) should be able to
> generate this code on its own already.  So if you call __builtin_* it
> will  just work (that is what 64bit does) without that explicit code.

Last time we tried that, it wasn't true. Gcc wouldn't inline even trivial 
cases of constant sizes.

I do not recall what gcc version that was all about, and maybe the 
versions we now require are all modern enough that it isn't an issue, but 
I'd want somebody to actually _verify_ that the oldest version we support 
does an ok job, and doesn't do stupid things for constant-sized memcpy's.

> IMHO it's long overdue to do this for 32bit too.

With actual testing, I'll happily merge this.

		Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22 20:56     ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning Linus Torvalds
@ 2009-04-22 21:15       ` Andi Kleen
  2009-04-22 21:19         ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2009-04-22 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Ingo Molnar, Jeff Garzik, LKML, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Apr 22, 2009 at 01:56:54PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 22 Apr 2009, Andi Kleen wrote:
> > 
> > Modern gcc (and that is all that is supported now) should be able to
> > generate this code on its own already.  So if you call __builtin_* it
> > will  just work (that is what 64bit does) without that explicit code.
> 
> Last time we tried that, it wasn't true. Gcc wouldn't inline even trivial 
> cases of constant sizes.

AFAIK it's all true on 3.2+ when it can figure out the alignment
(but some gcc versions had problems passing the alignment around e.g.
through inlining), under the assumption that out of line can do
a better job with unaligned data. That's not true with my patch,
but could be true in theory.

Quick test here:

char a[10];
char b[2];
char c[4];
char d[8];

short x;
long y;

char xyz[100];


f()
{
#define C(x) memcpy(&x, xyz, sizeof(x));
        C(x)
        C(y)
        C(a)
        C(b)
        C(c)
        C(d)
}

and everything gets inlined with gcc 3.2 which is the oldest
we still care about:

gcc version 3.2.3

        movzwl  xyz+8(%rip), %eax
        movzwl  xyz(%rip), %ecx
        movq    xyz(%rip), %rdx
        movw    %ax, a+8(%rip)
        movw    %cx, x(%rip)
        movw    %cx, b(%rip)
        movl    xyz(%rip), %eax
        movq    %rdx, y(%rip)
        movq    %rdx, a(%rip)
        movq    %rdx, d(%rip)
        movl    %eax, c(%rip)
        ret

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22 21:15       ` Andi Kleen
@ 2009-04-22 21:19         ` Linus Torvalds
  2009-04-22 22:04           ` Andi Kleen
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2009-04-22 21:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Jeff Garzik, LKML, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86



On Wed, 22 Apr 2009, Andi Kleen wrote:
> 
> AFAIK it's all true on 3.2+ when it can figure out the alignment
> (but some gcc versions had problems passing the alignment around e.g.
> through inlining), under the assumption that out of line can do
> a better job with unaligned data. That's not true with my patch,
> but could be true in theory.

Maybe it was the unaligned case that I remember.

Because it's definitely not true that out-of-line code can do any better 
with unaligned data, at least not for small constants. And the case I 
remember was for some silly 8-byte case or similar.

> Quick test here:

How about you just compile the kernel with gcc-3.2 and compare the number 
of calls to memcpy before-and-after instead? That's the real test.

		Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22 21:19         ` Linus Torvalds
@ 2009-04-22 22:04           ` Andi Kleen
  2009-04-23  6:08             ` fresh data was " Andi Kleen
  2009-04-23  6:30             ` Ingo Molnar
  0 siblings, 2 replies; 19+ messages in thread
From: Andi Kleen @ 2009-04-22 22:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Ingo Molnar, Jeff Garzik, LKML, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

> > Quick test here:
> 
> How about you just compile the kernel with gcc-3.2 and compare the number 
> of calls to memcpy before-and-after instead? That's the real test.

I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
patience. If someone has a fast disassembler we can try it. I'll leave
them running over night, maybe there are exact numbers tomorrow.

But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
very little files which call it with the patch, so there's some
evidence that there isn't a dramatic increase.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re:  New x86 warning
  2009-04-22  8:45   ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was " Andi Kleen
  2009-04-22 18:00     ` [tip:x86/asm] x86: use __builtin_memcpy() on 32 bits tip-bot for Andi Kleen
  2009-04-22 20:56     ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning Linus Torvalds
@ 2009-04-22 23:49     ` Joe Damato
  2009-04-23  1:48       ` H. Peter Anvin
  2009-04-23  6:09       ` Andi Kleen
  2 siblings, 2 replies; 19+ messages in thread
From: Joe Damato @ 2009-04-22 23:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Jeff Garzik, Linus Torvalds, LKML, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Apr 22, 2009 at 1:45 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Ingo Molnar <mingo@elte.hu> writes:
>
>> * Jeff Garzik <jeff@garzik.org> wrote:
>>
>>> On x86-32, this warning now appears for me in 2.6.30-rc3, and did
>>> not appear in 2.6.29.
>>>
>>> drivers/acpi/acpica/tbfadt.c: In function 'acpi_tb_create_local_fadt':
>>> /spare/repo/linux-2.6/arch/x86/include/asm/string_32.h:75: warning:
>>> array subscript is above array bounds
>>
>> Last i checked it was a GCC bounds check bogosity. All attempts to
>> work it around or annotate it sanely (without changing the assembly
>> code) failed. (new ideas welcome)
>>
>> The closest i came was the hacklet below to the assembly code. [with
>> an intentionally corrupted patch header to make it harder to apply
>> accidentally.]
>
> Modern gcc (and that is all that is supported now) should be able to
> generate this code on its own already.  So if you call __builtin_* it
> will  just work (that is what 64bit does) without that explicit code.
>
> Here's a patch that does that with some numbers. It gives about 3k
> smaller kernels on my configuration.
>
> IMHO it's long overdue to do this for 32bit too.
>
> It's a very attractive patch because it removes a lot of code:

I think this patch is great. Perhaps this would be a good time to also
clean out memset for x86_32? (If needed, I can start a new email
thread with this patch) I've built and booted this on my x86_32
hardware.

[danger: a newb's patch below]

Joe

--

Define memset to __builtin_memset and gcc should do the right thing.
Keep around the generic memset routine that gcc can use when it does
not inline. Removes a bunch of ugly code, too.

Signed-off-by: Joe Damato <ice799@gmail.com>
---
 arch/x86/include/asm/string_32.h |  117 +-------------------------------------
 arch/x86/lib/memcpy_32.c         |   12 ++++
 2 files changed, 15 insertions(+), 114 deletions(-)

diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 29fff54..aedee80 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -30,7 +30,6 @@ extern char *strchr(const char *s, int c);
 extern size_t strlen(const char *s);

 #define __HAVE_ARCH_MEMCPY
-
 extern void *__memcpy(void *to, const void *from, size_t n);

 #ifdef CONFIG_X86_USE_3DNOW
@@ -79,42 +78,8 @@ void *memmove(void *dest, const void *src, size_t n);
 #define __HAVE_ARCH_MEMCHR
 extern void *memchr(const void *cs, int c, size_t count);

-static inline void *__memset_generic(void *s, char c, size_t count)
-{
-	int d0, d1;
-	asm volatile("rep\n\t"
-		     "stosb"
-		     : "=&c" (d0), "=&D" (d1)
-		     : "a" (c), "1" (s), "0" (count)
-		     : "memory");
-	return s;
-}
-
-/* we might want to write optimized versions of these later */
-#define __constant_count_memset(s, c, count) __memset_generic((s),
(c), (count))
-
-/*
- * memset(x, 0, y) is a reasonably common thing to do, so we want to fill
- * things 32 bits at a time even when we don't know the size of the
- * area at compile-time..
- */
-static __always_inline
-void *__constant_c_memset(void *s, unsigned long c, size_t count)
-{
-	int d0, d1;
-	asm volatile("rep ; stosl\n\t"
-		     "testb $2,%b3\n\t"
-		     "je 1f\n\t"
-		     "stosw\n"
-		     "1:\ttestb $1,%b3\n\t"
-		     "je 2f\n\t"
-		     "stosb\n"
-		     "2:"
-		     : "=&c" (d0), "=&D" (d1)
-		     : "a" (c), "q" (count), "0" (count/4), "1" ((long)s)
-		     : "memory");
-	return s;
-}
+#define __HAVE_ARCH_MEMSET
+extern void *__memset(void *s, char c, size_t count);

 /* Added by Gertjan van Wingerde to make minix and sysv module work */
 #define __HAVE_ARCH_STRNLEN
@@ -124,83 +89,7 @@ extern size_t strnlen(const char *s, size_t count);
 #define __HAVE_ARCH_STRSTR
 extern char *strstr(const char *cs, const char *ct);

-/*
- * This looks horribly ugly, but the compiler can optimize it totally,
- * as we by now know that both pattern and count is constant..
- */
-static __always_inline
-void *__constant_c_and_count_memset(void *s, unsigned long pattern,
-				    size_t count)
-{
-	switch (count) {
-	case 0:
-		return s;
-	case 1:
-		*(unsigned char *)s = pattern & 0xff;
-		return s;
-	case 2:
-		*(unsigned short *)s = pattern & 0xffff;
-		return s;
-	case 3:
-		*(unsigned short *)s = pattern & 0xffff;
-		*((unsigned char *)s + 2) = pattern & 0xff;
-		return s;
-	case 4:
-		*(unsigned long *)s = pattern;
-		return s;
-	}
-
-#define COMMON(x)							\
-	asm volatile("rep ; stosl"					\
-		     x							\
-		     : "=&c" (d0), "=&D" (d1)				\
-		     : "a" (eax), "0" (count/4), "1" ((long)s)	\
-		     : "memory")
-
-	{
-		int d0, d1;
-#if __GNUC__ == 4 && __GNUC_MINOR__ == 0
-		/* Workaround for broken gcc 4.0 */
-		register unsigned long eax asm("%eax") = pattern;
-#else
-		unsigned long eax = pattern;
-#endif
-
-		switch (count % 4) {
-		case 0:
-			COMMON("");
-			return s;
-		case 1:
-			COMMON("\n\tstosb");
-			return s;
-		case 2:
-			COMMON("\n\tstosw");
-			return s;
-		default:
-			COMMON("\n\tstosw\n\tstosb");
-			return s;
-		}
-	}
-
-#undef COMMON
-}
-
-#define __constant_c_x_memset(s, c, count)			\
-	(__builtin_constant_p(count)				\
-	 ? __constant_c_and_count_memset((s), (c), (count))	\
-	 : __constant_c_memset((s), (c), (count)))
-
-#define __memset(s, c, count)				\
-	(__builtin_constant_p(count)			\
-	 ? __constant_count_memset((s), (c), (count))	\
-	 : __memset_generic((s), (c), (count)))
-
-#define __HAVE_ARCH_MEMSET
-#define memset(s, c, count)						\
-	(__builtin_constant_p(c)					\
-	 ? __constant_c_x_memset((s), (0x01010101UL * (unsigned char)(c)), \
-				 (count))				\
-	 : __memset((s), (c), (count)))
+#define memset(s, c, count) __builtin_memset(s, c, count)

 /*
  * find the first occurrence of byte 'c', or 1 past the area if none
diff --git a/arch/x86/lib/memcpy_32.c b/arch/x86/lib/memcpy_32.c
index 16dc123..16b10d9 100644
--- a/arch/x86/lib/memcpy_32.c
+++ b/arch/x86/lib/memcpy_32.c
@@ -30,6 +30,18 @@ void *memcpy(void *to, const void *from, size_t n)
 }
 EXPORT_SYMBOL(memcpy);

+void *__memset(void *s, char c, size_t count)
+{
+	int d0, d1;
+	asm volatile("rep\n\t"
+		     "stosb"
+		     : "=&c" (d0), "=&D" (d1)
+		     : "a" (c), "1" (s), "0" (count)
+		     : "memory");
+	return s;
+}
+EXPORT_SYMBOL(__memset);
+
 void *memset(void *s, int c, size_t count)
 {
 	return __memset(s, c, count);
-- 
1.6.2

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22 23:49     ` Joe Damato
@ 2009-04-23  1:48       ` H. Peter Anvin
  2009-04-23 21:22         ` Joe Damato
  2009-04-23  6:09       ` Andi Kleen
  1 sibling, 1 reply; 19+ messages in thread
From: H. Peter Anvin @ 2009-04-23  1:48 UTC (permalink / raw)
  To: Joe Damato
  Cc: Andi Kleen, Ingo Molnar, Jeff Garzik, Linus Torvalds, LKML,
	Thomas Gleixner, Ingo Molnar, x86

Joe Damato wrote:
> 
> I think this patch is great. Perhaps this would be a good time to also
> clean out memset for x86_32? (If needed, I can start a new email
> thread with this patch) I've built and booted this on my x86_32
> hardware.
> 

Under the same conditions, i.e. making sure that with the oldest gcc we
still care about (3.2) we don't end up uninlining any new cases we
shouldn't.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* fresh data was Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22 22:04           ` Andi Kleen
@ 2009-04-23  6:08             ` Andi Kleen
  2009-04-23  6:36               ` Ingo Molnar
  2009-04-23  6:30             ` Ingo Molnar
  1 sibling, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2009-04-23  6:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jeff Garzik, LKML, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Andi Kleen <andi@firstfloor.org> writes:

>> > Quick test here:
>> 
>> How about you just compile the kernel with gcc-3.2 and compare the number 
>> of calls to memcpy before-and-after instead? That's the real test.
>
> I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
> patience. If someone has a fast disassembler we can try it. I'll leave
> them running over night, maybe there are exact numbers tomorrow.
>
> But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
> very little files which call it with the patch, so there's some
> evidence that there isn't a dramatic increase.

I let the objdumps finish over night. On my setup (defconfig + some 
additions) there are actually less calls to out of line memcpy/__memcpy
with the patch. I see only one for my defconfig, while there are 
~10 without the patch. So it makes very little difference.
The code size savings must come from more efficient code generation
for the inline case. I haven't investigated  that in detail though.

So the patch seems like a overall win.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22 23:49     ` Joe Damato
  2009-04-23  1:48       ` H. Peter Anvin
@ 2009-04-23  6:09       ` Andi Kleen
  1 sibling, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-04-23  6:09 UTC (permalink / raw)
  To: Joe Damato
  Cc: Andi Kleen, Ingo Molnar, Jeff Garzik, Linus Torvalds, LKML,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

> > It's a very attractive patch because it removes a lot of code:
> 
> I think this patch is great. Perhaps this would be a good time to also
> clean out memset for x86_32? (If needed, I can start a new email

Yes looks reasonable. You can add

Reviewed-by: Andi Kleen <ak@linux.intel.com>

if you fix the comment below.

In fact it should be synced to do the same as on 64bit which already
does all that and was originally written for gcc 3.1/3.2 ...

-Andi
> +void *__memset(void *s, char c, size_t count)
> +{
> +	int d0, d1;
> +	asm volatile("rep\n\t"
> +		     "stosb"
> +		     : "=&c" (d0), "=&D" (d1)
> +		     : "a" (c), "1" (s), "0" (count)
> +		     : "memory");
> +	return s;
> +}
> +EXPORT_SYMBOL(__memset);

I suspect _memset is not needed anymore, just memset() alone
should be enough. So remove the wrapper below.

> +
>  void *memset(void *s, int c, size_t count)
>  {
>  	return __memset(s, c, count);
> -- 
> 1.6.2
> 

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-22 22:04           ` Andi Kleen
  2009-04-23  6:08             ` fresh data was " Andi Kleen
@ 2009-04-23  6:30             ` Ingo Molnar
  2009-04-23  7:43               ` Andi Kleen
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-04-23  6:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Jeff Garzik, LKML, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Andi Kleen <andi@firstfloor.org> wrote:

> > > Quick test here:
> > 
> > How about you just compile the kernel with gcc-3.2 and compare 
> > the number of calls to memcpy before-and-after instead? That's 
> > the real test.
> 
> I waited over 10 minutes for the full vmlinux objdumps to finish. 
> sorry lost patience. If someone has a fast disassembler we can try 
> it. I'll leave them running over night, maybe there are exact 
> numbers tomorrow.

Uhm, the test Linus requested is very simple, it doesnt need 'full' 
objdumps, just a plain defconfig [*] - an objdump takes less than 10 
seconds here even on an old box i tried it on.

I just did this - it all took less than 5 minutes to do the whole 
test with gcc34:

  vmlinux.gcc34.vanilla:       679 calls to memcpy
  vmlinux.gcc34.gcc-memcpy:   1393 calls to memcpy

So your patch more than doubles the number of calls to out-of-line 
memcpy on older GCC. That's not really acceptable so i'm NAK-ing 
this patch.

Next time you send such patches please test with older GCCs straight 
away - it's a basic act of testing when doing a patch that 'lets GCC 
decide' anything. GCC has a very bad track record in that area.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fresh data was Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-23  6:08             ` fresh data was " Andi Kleen
@ 2009-04-23  6:36               ` Ingo Molnar
  2009-04-23  7:37                 ` Andi Kleen
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-04-23  6:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Jeff Garzik, LKML, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Andi Kleen <andi@firstfloor.org> wrote:

> Andi Kleen <andi@firstfloor.org> writes:
> 
> >> > Quick test here:
> >> 
> >> How about you just compile the kernel with gcc-3.2 and compare the number 
> >> of calls to memcpy before-and-after instead? That's the real test.
> >
> > I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
> > patience. If someone has a fast disassembler we can try it. I'll leave
> > them running over night, maybe there are exact numbers tomorrow.
> >
> > But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
> > very little files which call it with the patch, so there's some
> > evidence that there isn't a dramatic increase.
> 
> I let the objdumps finish over night. [...]

objdump -d never took me more than a minute - let alone a full 
night. You must be doing something really wrong there. Looking at 
objdump -d is an essential, unavoidable component of my workflow 
with x86 architecture patches, you need to find a way to do it 
efficiently if you want to send patches for this area of the kernel.

> [...] On my setup (defconfig + some additions) there are actually 
> less calls to out of line memcpy/__memcpy with the patch. I see 
> only one for my defconfig, while there are ~10 without the patch. 
> So it makes very little difference. The code size savings must 
> come from more efficient code generation for the inline case. I 
> haven't investigated that in detail though.
> 
> So the patch seems like a overall win.

It's a clear loss here with GCC 3.4, and it took me less than 5 
minutes to figure that out.

With what precise compiler version did you test (please paste the 
gcc -v output), and could you send me the precise .config you used, 
and describe the method you used to determine the number of 
out-of-line memcpy calls? I'd like to double-check your numbers.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fresh data was Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-23  6:36               ` Ingo Molnar
@ 2009-04-23  7:37                 ` Andi Kleen
  0 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-04-23  7:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Linus Torvalds, Jeff Garzik, LKML, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Thu, Apr 23, 2009 at 08:36:25AM +0200, Ingo Molnar wrote:
> 
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Andi Kleen <andi@firstfloor.org> writes:
> > 
> > >> > Quick test here:
> > >> 
> > >> How about you just compile the kernel with gcc-3.2 and compare the number 
> > >> of calls to memcpy before-and-after instead? That's the real test.
> > >
> > > I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
> > > patience. If someone has a fast disassembler we can try it. I'll leave
> > > them running over night, maybe there are exact numbers tomorrow.
> > >
> > > But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
> > > very little files which call it with the patch, so there's some
> > > evidence that there isn't a dramatic increase.
> > 
> > I let the objdumps finish over night. [...]
> 
> objdump -d never took me more than a minute - let alone a full 

I use objdump -S. Maybe that's slower than -d.

Hmm quick test, yes -S seems to be much slower than -d. Thanks for
the hint. I guess I should switch to -d for these cases, unfortunately
-S seems to be hardcoded in my fingers and of course it gives much
nicer output if you have debug info.

> night. You must be doing something really wrong there. Looking at 
> objdump -d is an essential, unavoidable component of my workflow 
> with x86 architecture patches, you need to find a way to do it 

I do it all the time too, but only for specific functions, not
for full kernels. I have a objdump-symbol script for that that
looks up a symbol in the symbol table and only disassembles
the function I'm interested in 
(ftp://firstfloor.org/pub/ak/perl/objdump-symbol) 
I normally don't look at full listings of the complete kernel.

> > [...] On my setup (defconfig + some additions) there are actually 
> > less calls to out of line memcpy/__memcpy with the patch. I see 
> > only one for my defconfig, while there are ~10 without the patch. 
> > So it makes very little difference. The code size savings must 
> > come from more efficient code generation for the inline case. I 
> > haven't investigated that in detail though.
> > 
> > So the patch seems like a overall win.
> 
> It's a clear loss here with GCC 3.4, and it took me less than 5 
> minutes to figure that out.

Loss in what way?

> 
> With what precise compiler version did you test (please paste the 
> gcc -v output), and could you send me the precise .config you used, 

See the 2nd previous mail: 3.2.3

I didn't do tests with later versions, assuming there are no 
regressions.

> and describe the method you used to determine the number of 
> out-of-line memcpy calls? I'd like to double-check your numbers.

objdump -S ... | grep call.*memcpy         (gives some false positives,
you have to weed them out)

In addition I did a quick find -name '*.o' | xargs nm | grep 'U.*memcpy$'
to (under) estimate the calls  

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-23  6:30             ` Ingo Molnar
@ 2009-04-23  7:43               ` Andi Kleen
  0 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-04-23  7:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Linus Torvalds, Jeff Garzik, LKML, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Thu, Apr 23, 2009 at 08:30:53AM +0200, Ingo Molnar wrote:
> 
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > > Quick test here:
> > > 
> > > How about you just compile the kernel with gcc-3.2 and compare 
> > > the number of calls to memcpy before-and-after instead? That's 
> > > the real test.
> > 
> > I waited over 10 minutes for the full vmlinux objdumps to finish. 
> > sorry lost patience. If someone has a fast disassembler we can try 
> > it. I'll leave them running over night, maybe there are exact 
> > numbers tomorrow.
> 
> Uhm, the test Linus requested is very simple, it doesnt need 'full' 
> objdumps, just a plain defconfig [*] - an objdump takes less than 10 
> seconds here even on an old box i tried it on.
> 
> I just did this - it all took less than 5 minutes to do the whole 
> test with gcc34:
> 
>   vmlinux.gcc34.vanilla:       679 calls to memcpy
>   vmlinux.gcc34.gcc-memcpy:   1393 calls to memcpy
> 
> So your patch more than doubles the number of calls to out-of-line 
> memcpy on older GCC. That's not really acceptable 

How do you determine it's not acceptable?

It seems not nice to me, but not a fatal problem. I think
Linus was more interested in dramatic growth, but factor 2 over
a very large code base doesn't seem to be dramatic to me, especially
since it's very likely most of the are slow path code.

> Next time you send such patches please test with older GCCs straight 
> away - it's a basic act of testing when doing a patch that 'lets GCC 

I tested with 3.2.3, but not with 3.4.

3.2.3 doesn't do that. AFAIK 3.2 was a pretty common compiler; at least
several SUSE releases shipped with it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re:  New x86 warning
  2009-04-23  1:48       ` H. Peter Anvin
@ 2009-04-23 21:22         ` Joe Damato
  2009-04-23 22:09           ` H. Peter Anvin
  2009-04-24  8:44           ` Andi Kleen
  0 siblings, 2 replies; 19+ messages in thread
From: Joe Damato @ 2009-04-23 21:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Ingo Molnar, Jeff Garzik, Linus Torvalds, LKML,
	Thomas Gleixner, Ingo Molnar, x86

On Wed, Apr 22, 2009 at 6:48 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Joe Damato wrote:
>>
>> I think this patch is great. Perhaps this would be a good time to also
>> clean out memset for x86_32? (If needed, I can start a new email
>> thread with this patch) I've built and booted this on my x86_32
>> hardware.
>>
>
> Under the same conditions, i.e. making sure that with the oldest gcc we
> still care about (3.2) we don't end up uninlining any new cases we
> shouldn't.

Looks like this thread is dead/dying, but figured I should reply with
my test findings. The number of out-of-line calls (as determined by:
make mrproper && make defconfig && make && objdump -d vmlinux | grep
"call.*\<memset" | wc -l))

gcc 4.2.4 - withOUT memset patch: 20
gcc 4.2.4 - with memset patch: 365

gcc 3.4 - withOUT memset patch: 17
gcc 3.4 - with memset patch: 349

I'm guessing this is probably not acceptable, so I won't bother
installing/trying gcc-3.2 unless anyone thinks that a 300+ increase in
out-of-line calls is OK.

Joe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-23 21:22         ` Joe Damato
@ 2009-04-23 22:09           ` H. Peter Anvin
  2009-04-24  8:44           ` Andi Kleen
  1 sibling, 0 replies; 19+ messages in thread
From: H. Peter Anvin @ 2009-04-23 22:09 UTC (permalink / raw)
  To: Joe Damato
  Cc: Andi Kleen, Ingo Molnar, Jeff Garzik, Linus Torvalds, LKML,
	Thomas Gleixner, Ingo Molnar, x86

Joe Damato wrote:
> 
> Looks like this thread is dead/dying, but figured I should reply with
> my test findings. The number of out-of-line calls (as determined by:
> make mrproper && make defconfig && make && objdump -d vmlinux | grep
> "call.*\<memset" | wc -l))
> 
> gcc 4.2.4 - withOUT memset patch: 20
> gcc 4.2.4 - with memset patch: 365
> 
> gcc 3.4 - withOUT memset patch: 17
> gcc 3.4 - with memset patch: 349
> 
> I'm guessing this is probably not acceptable, so I won't bother
> installing/trying gcc-3.2 unless anyone thinks that a 300+ increase in
> out-of-line calls is OK.
> 

Not unless it can be proven those calls are in non-performance-critical 
contexts.  That's a lot of work to go through, though.

	-hpa


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning
  2009-04-23 21:22         ` Joe Damato
  2009-04-23 22:09           ` H. Peter Anvin
@ 2009-04-24  8:44           ` Andi Kleen
  1 sibling, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-04-24  8:44 UTC (permalink / raw)
  To: Joe Damato
  Cc: H. Peter Anvin, Andi Kleen, Ingo Molnar, Jeff Garzik,
	Linus Torvalds, LKML, Thomas Gleixner, Ingo Molnar, x86

> gcc 4.2.4 - withOUT memset patch: 20
> gcc 4.2.4 - with memset patch: 365
> 
> gcc 3.4 - withOUT memset patch: 17
> gcc 3.4 - with memset patch: 349

Yes it sounds like 3.4 is worse on that than 3.2. Too bad.

> I'm guessing this is probably not acceptable, so I won't bother

It depends if the calls are in critical code. Or how big they
are (for a 1K memset it's totally fine to have it out of line).

For example for any memsets in __init functions we wouldn't
care. You could filter those out. And perhaps eyeball the code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-04-24  8:40 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-22  6:46 New x86 warning Jeff Garzik
2009-04-22  7:01 ` Ingo Molnar
2009-04-22  8:45   ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was " Andi Kleen
2009-04-22 18:00     ` [tip:x86/asm] x86: use __builtin_memcpy() on 32 bits tip-bot for Andi Kleen
2009-04-22 20:56     ` [PATCH] X86-32: Let gcc decide whether to inline memcpy was Re: New x86 warning Linus Torvalds
2009-04-22 21:15       ` Andi Kleen
2009-04-22 21:19         ` Linus Torvalds
2009-04-22 22:04           ` Andi Kleen
2009-04-23  6:08             ` fresh data was " Andi Kleen
2009-04-23  6:36               ` Ingo Molnar
2009-04-23  7:37                 ` Andi Kleen
2009-04-23  6:30             ` Ingo Molnar
2009-04-23  7:43               ` Andi Kleen
2009-04-22 23:49     ` Joe Damato
2009-04-23  1:48       ` H. Peter Anvin
2009-04-23 21:22         ` Joe Damato
2009-04-23 22:09           ` H. Peter Anvin
2009-04-24  8:44           ` Andi Kleen
2009-04-23  6:09       ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.