linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] optimize ia32 memmove
       [not found] <200312300713.hBU7DGC4024213@hera.kernel.org>
@ 2003-12-30  7:32 ` Jeff Garzik
  2003-12-30  7:51   ` Andrew Morton
  2003-12-30 10:21   ` Ed Sweetman
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff Garzik @ 2003-12-30  7:32 UTC (permalink / raw)
  To: manfred, akpm; +Cc: Linux Kernel Mailing List

Linux Kernel Mailing List wrote:
> ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
> 
> 	[PATCH] optimize ia32 memmove
> 	
> 	From: Manfred Spraul <manfred@colorfullife.com>
> 	
> 	The memmove implementation of i386 is not optimized: it uses movsb, which is
> 	far slower than movsd.  The optimization is trivial: if dest is less than
> 	source, then call memcpy().  markw tried it on a 4xXeon with dbt2, it saved
> 	around 300 million cpu ticks in cache_flusharray():
[...]
> diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
> --- a/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
> +++ b/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
> @@ -299,14 +299,9 @@
>  static inline void * memmove(void * dest,const void * src, size_t n)
>  {
>  int d0, d1, d2;
> -if (dest<src)
> -__asm__ __volatile__(
> -	"rep\n\t"
> -	"movsb"
> -	: "=&c" (d0), "=&S" (d1), "=&D" (d2)
> -	:"0" (n),"1" (src),"2" (dest)
> -	: "memory");
> -else
> +if (dest<src) {
> +	memcpy(dest,src,n);
> +} else
>  __asm__ __volatile__(
>  	"std\n\t"
>  	"rep\n\t"

Dumb question, though...   what about the overlap case, when dest<src ? 
  It seems to me this change is ignoring that.

	Jeff




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik
@ 2003-12-30  7:51   ` Andrew Morton
  2003-12-30  7:56     ` Jeff Garzik
  2003-12-30 10:21   ` Ed Sweetman
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2003-12-30  7:51 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: manfred, linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Linux Kernel Mailing List wrote:
> > ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
> > 
> > 	[PATCH] optimize ia32 memmove
> > 	
> > 	From: Manfred Spraul <manfred@colorfullife.com>
> > 	
> > 	The memmove implementation of i386 is not optimized: it uses movsb, which is
> > 	far slower than movsd.  The optimization is trivial: if dest is less than
> > 	source, then call memcpy().  markw tried it on a 4xXeon with dbt2, it saved
> > 	around 300 million cpu ticks in cache_flusharray():
> [...]
> > diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
> > --- a/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
> > +++ b/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
> > @@ -299,14 +299,9 @@
> >  static inline void * memmove(void * dest,const void * src, size_t n)
> >  {
> >  int d0, d1, d2;
> > -if (dest<src)
> > -__asm__ __volatile__(
> > -	"rep\n\t"
> > -	"movsb"
> > -	: "=&c" (d0), "=&S" (d1), "=&D" (d2)
> > -	:"0" (n),"1" (src),"2" (dest)
> > -	: "memory");
> > -else
> > +if (dest<src) {
> > +	memcpy(dest,src,n);
> > +} else
> >  __asm__ __volatile__(
> >  	"std\n\t"
> >  	"rep\n\t"
> 
> Dumb question, though...   what about the overlap case, when dest<src ? 
>   It seems to me this change is ignoring that.
> 

"if dest is less that source, then call memcpy".  If the move is to a
higher address we do it the old way.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  7:51   ` Andrew Morton
@ 2003-12-30  7:56     ` Jeff Garzik
  2003-12-30  8:11       ` Andrew Morton
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jeff Garzik @ 2003-12-30  7:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: manfred, linux-kernel

Andrew Morton wrote:
> Jeff Garzik <jgarzik@pobox.com> wrote:
> 
>>Linux Kernel Mailing List wrote:
>>
>>>ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
>>>
>>>	[PATCH] optimize ia32 memmove
>>>	
>>>	From: Manfred Spraul <manfred@colorfullife.com>
>>>	
>>>	The memmove implementation of i386 is not optimized: it uses movsb, which is
>>>	far slower than movsd.  The optimization is trivial: if dest is less than
>>>	source, then call memcpy().  markw tried it on a 4xXeon with dbt2, it saved
>>>	around 300 million cpu ticks in cache_flusharray():
>>
>>[...]
>>
>>>diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
>>>--- a/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
>>>+++ b/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
>>>@@ -299,14 +299,9 @@
>>> static inline void * memmove(void * dest,const void * src, size_t n)
>>> {
>>> int d0, d1, d2;
>>>-if (dest<src)
>>>-__asm__ __volatile__(
>>>-	"rep\n\t"
>>>-	"movsb"
>>>-	: "=&c" (d0), "=&S" (d1), "=&D" (d2)
>>>-	:"0" (n),"1" (src),"2" (dest)
>>>-	: "memory");
>>>-else
>>>+if (dest<src) {
>>>+	memcpy(dest,src,n);
>>>+} else
>>> __asm__ __volatile__(
>>> 	"std\n\t"
>>> 	"rep\n\t"
>>
>>Dumb question, though...   what about the overlap case, when dest<src ? 
>>  It seems to me this change is ignoring that.
>>
> 
> 
> "if dest is less that source, then call memcpy".  If the move is to a
> higher address we do it the old way.


I'm confused... that doesn't say anything to me about overlap.

They can still overlap:  Consider if dest is 1 byte less than src, and 
n==128...

	Jeff




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  7:56     ` Jeff Garzik
@ 2003-12-30  8:11       ` Andrew Morton
  2003-12-30  8:11       ` Andreas Dilger
  2003-12-30  9:58       ` Linus Torvalds
  2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2003-12-30  8:11 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: manfred, linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> > "if dest is less that source, then call memcpy".  If the move is to a
>  > higher address we do it the old way.
> 
> 
>  I'm confused... that doesn't say anything to me about overlap.
> 

Overlap is OK if dest<src, because memcpy starts with the lowest address
and works upwards.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  7:56     ` Jeff Garzik
  2003-12-30  8:11       ` Andrew Morton
@ 2003-12-30  8:11       ` Andreas Dilger
  2003-12-30 10:05         ` Linus Torvalds
  2003-12-30  9:58       ` Linus Torvalds
  2 siblings, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2003-12-30  8:11 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, manfred, linux-kernel

On Dec 30, 2003  02:56 -0500, Jeff Garzik wrote:
> Andrew Morton wrote:
> > Jeff Garzik <jgarzik@pobox.com> wrote:
> > 
> >>Linux Kernel Mailing List wrote:
> >>
> >>>ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
> >>>
> >>>	[PATCH] optimize ia32 memmove
> >>>	
> >>>	From: Manfred Spraul <manfred@colorfullife.com>
> >>>	
> >>>	The memmove implementation of i386 is not optimized: it uses movsb, which is
> >>>	far slower than movsd.  The optimization is trivial: if dest is less than
> >>>	source, then call memcpy().  markw tried it on a 4xXeon with dbt2, it saved
> >>>	around 300 million cpu ticks in cache_flusharray():
> >>
> >>[...]
> >>
> >>>diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
> >>>--- a/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
> >>>+++ b/include/asm-i386/string.h	Mon Dec 29 23:13:20 2003
> >>>@@ -299,14 +299,9 @@
> >>> static inline void * memmove(void * dest,const void * src, size_t n)
> >>> {
> >>> int d0, d1, d2;
> >>>-if (dest<src)
> >>>-__asm__ __volatile__(
> >>>-	"rep\n\t"
> >>>-	"movsb"
> >>>-	: "=&c" (d0), "=&S" (d1), "=&D" (d2)
> >>>-	:"0" (n),"1" (src),"2" (dest)
> >>>-	: "memory");
> >>>-else
> >>>+if (dest<src) {
> >>>+	memcpy(dest,src,n);
> >>>+} else
> >>> __asm__ __volatile__(
> >>> 	"std\n\t"
> >>> 	"rep\n\t"
> >>
> >>Dumb question, though...   what about the overlap case, when dest<src ? 
> >>  It seems to me this change is ignoring that.
> >>
> > 
> > 
> > "if dest is less that source, then call memcpy".  If the move is to a
> > higher address we do it the old way.
> 
> 
> I'm confused... that doesn't say anything to me about overlap.
> 
> They can still overlap:  Consider if dest is 1 byte less than src, and 
> n==128...

The non-overlapping cases are probably very common and worth optimizing for:

	if (dest + n < src || src + n < dest) {
		memcpy(dest, src, n);
	} else {
		"old way"
	}

People generally call memmove() if they _think_ the areas might overlap,
but if memcpy() can be so much faster it is probably worth catching the
cases where the two areas don't actually overlap.  The above assumes that
"dest + n" and "src + n" don't wrap, but that is probably a broken case
in the current code so not worth adding extra checks in for.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  7:56     ` Jeff Garzik
  2003-12-30  8:11       ` Andrew Morton
  2003-12-30  8:11       ` Andreas Dilger
@ 2003-12-30  9:58       ` Linus Torvalds
  2003-12-30 10:17         ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2003-12-30  9:58 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, manfred, linux-kernel



On Tue, 30 Dec 2003, Jeff Garzik wrote:
> 
> I'm confused... that doesn't say anything to me about overlap.
> 
> They can still overlap:  Consider if dest is 1 byte less than src, and 
> n==128...

But then anything that does the loads in ascending order is still ok, so
it shouldn't matter - by the time "dest" has been overwritten, the source
data has already been read. And all the "memcpy()"  implementations had
better do that anyway, in order to get nice memory access patterns. "rep
movsl" certainly does.

So assuming we have an ascending "memcpy()", the only case we need to care 
about is "overlap && dest > src".

Now, if we have a non-ascending memcpy(), we have trouble. 

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  8:11       ` Andreas Dilger
@ 2003-12-30 10:05         ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2003-12-30 10:05 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jeff Garzik, Andrew Morton, manfred, linux-kernel



On Tue, 30 Dec 2003, Andreas Dilger wrote:
> 
> The non-overlapping cases are probably very common and worth optimizing for:

No, almost all non-overlapping users already just use "memcpy()". 

So most of the kernel uses of "memmove()" are likely overlapping - and 
just optimizing the non-overlap case probably doesn't help a lot.

That's why you optimize the overlapping case in one direction. We really 
should do the other case right too.

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  9:58       ` Linus Torvalds
@ 2003-12-30 10:17         ` Jeremy Fitzhardinge
  2003-12-30 11:12           ` Manfred Spraul
  0 siblings, 1 reply; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2003-12-30 10:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff Garzik, Andrew Morton, manfred, Linux Kernel List

On Tue, 2003-12-30 at 01:58, Linus Torvalds wrote:
> But then anything that does the loads in ascending order is still ok, so
> it shouldn't matter - by the time "dest" has been overwritten, the source
> data has already been read. And all the "memcpy()"  implementations had
> better do that anyway, in order to get nice memory access patterns. "rep
> movsl" certainly does.

A PPC memcpy may end up clearing the destination before reading the
source (using the cache-line zeroing instruction, to prevent the
destination from being spuriously read to populate the cache line).

	J


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30  7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik
  2003-12-30  7:51   ` Andrew Morton
@ 2003-12-30 10:21   ` Ed Sweetman
  2003-12-30 10:37     ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Ed Sweetman @ 2003-12-30 10:21 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: manfred, akpm, Linux Kernel Mailing List

Is this the entire patch?  When was the original post since I cant find 
the message going back over a month ago in the archives or any current 
posts.


Jeff Garzik wrote:
> Linux Kernel Mailing List wrote:
> 
>> ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
>>
>>     [PATCH] optimize ia32 memmove
>>     
>>     From: Manfred Spraul <manfred@colorfullife.com>
>>     
>> diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
>> --- a/include/asm-i386/string.h    Mon Dec 29 23:13:20 2003
>> +++ b/include/asm-i386/string.h    Mon Dec 29 23:13:20 2003
>> @@ -299,14 +299,9 @@
>>  static inline void * memmove(void * dest,const void * src, size_t n)
>>  {
>>  int d0, d1, d2;
>> -if (dest<src)
>> -__asm__ __volatile__(
>> -    "rep\n\t"
>> -    "movsb"
>> -    : "=&c" (d0), "=&S" (d1), "=&D" (d2)
>> -    :"0" (n),"1" (src),"2" (dest)
>> -    : "memory");
>> -else
>> +if (dest<src) {
>> +    memcpy(dest,src,n);
>> +} else
>>  __asm__ __volatile__(
>>      "std\n\t"
>>      "rep\n\t"
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30 10:21   ` Ed Sweetman
@ 2003-12-30 10:37     ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2003-12-30 10:37 UTC (permalink / raw)
  To: Ed Sweetman; +Cc: jgarzik, manfred, linux-kernel

Ed Sweetman <ed.sweetman@wmich.edu> wrote:
>
> Is this the entire patch?  When was the original post since I cant find 
> the message going back over a month ago in the archives or any current 
> posts.
> 

The original patch is on the 2.5/2.6 commits mailing list,
bk-commits-head@vger.kernel.org.  That list sets
linux-kernel@vger.kernel.org as the From: address, so replies come here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30 10:17         ` Jeremy Fitzhardinge
@ 2003-12-30 11:12           ` Manfred Spraul
  2003-12-30 20:17             ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: Manfred Spraul @ 2003-12-30 11:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Jeff Garzik, Andrew Morton, Linux Kernel List

Jeremy Fitzhardinge wrote:

 >On Tue, 2003-12-30 at 01:58, Linus Torvalds wrote:
 >
 >>But then anything that does the loads in ascending order is still ok, so
 >>it shouldn't matter - by the time "dest" has been overwritten, the source
 >>data has already been read. And all the "memcpy()"  implementations had
 >>better do that anyway, in order to get nice memory access patterns. "rep
 >>movsl" certainly does.

AMD recommends to perform bulk copies backwards: That defeats the hw 
prefecher, and results in even better access patterns. Doesn't matter in 
this case, memmove is never used for bulk copies.

 >A PPC memcpy may end up clearing the destination before reading the
 >source (using the cache-line zeroing instruction, to prevent the
 >destination from being spuriously read to populate the cache line).
 >
The change is i386 only, no effect on other archs.
I found the unoptimized memmove in oprofiles of dbt2 testruns: slab 
contains a few memmoves to keep it's recently used arrays in strict LIFO 
order. Typically perhaps 100 bytes.

--
    Manfred



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize ia32 memmove
  2003-12-30 11:12           ` Manfred Spraul
@ 2003-12-30 20:17             ` H. Peter Anvin
  0 siblings, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2003-12-30 20:17 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <3FF15DAB.8080203@colorfullife.com>
By author:    Manfred Spraul <manfred@colorfullife.com>
In newsgroup: linux.dev.kernel
> 
> AMD recommends to perform bulk copies backwards: That defeats the hw 
> prefecher, and results in even better access patterns. Doesn't matter in 
> this case, memmove is never used for bulk copies.
> 

That's also a microoptimization for one particular microarchitecture
*bug*.  Hardware prefetchers are going omnidirectional going forward.
Additionally, nearly all bulk copies are performed forward (DF=0) in
existing codebases.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-12-30 20:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200312300713.hBU7DGC4024213@hera.kernel.org>
2003-12-30  7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik
2003-12-30  7:51   ` Andrew Morton
2003-12-30  7:56     ` Jeff Garzik
2003-12-30  8:11       ` Andrew Morton
2003-12-30  8:11       ` Andreas Dilger
2003-12-30 10:05         ` Linus Torvalds
2003-12-30  9:58       ` Linus Torvalds
2003-12-30 10:17         ` Jeremy Fitzhardinge
2003-12-30 11:12           ` Manfred Spraul
2003-12-30 20:17             ` H. Peter Anvin
2003-12-30 10:21   ` Ed Sweetman
2003-12-30 10:37     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).