* Re: [PATCH] optimize ia32 memmove
[not found] <200312300713.hBU7DGC4024213@hera.kernel.org>
@ 2003-12-30 7:32 ` Jeff Garzik
2003-12-30 7:51 ` Andrew Morton
2003-12-30 10:21 ` Ed Sweetman
0 siblings, 2 replies; 12+ messages in thread
From: Jeff Garzik @ 2003-12-30 7:32 UTC (permalink / raw)
To: manfred, akpm; +Cc: Linux Kernel Mailing List
Linux Kernel Mailing List wrote:
> ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
>
> [PATCH] optimize ia32 memmove
>
> From: Manfred Spraul <manfred@colorfullife.com>
>
> The memmove implementation of i386 is not optimized: it uses movsb, which is
> far slower than movsd. The optimization is trivial: if dest is less than
> source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved
> around 300 million cpu ticks in cache_flusharray():
[...]
> diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
> --- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> +++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> @@ -299,14 +299,9 @@
> static inline void * memmove(void * dest,const void * src, size_t n)
> {
> int d0, d1, d2;
> -if (dest<src)
> -__asm__ __volatile__(
> - "rep\n\t"
> - "movsb"
> - : "=&c" (d0), "=&S" (d1), "=&D" (d2)
> - :"0" (n),"1" (src),"2" (dest)
> - : "memory");
> -else
> +if (dest<src) {
> + memcpy(dest,src,n);
> +} else
> __asm__ __volatile__(
> "std\n\t"
> "rep\n\t"
Dumb question, though... what about the overlap case, when dest<src ?
It seems to me this change is ignoring that.
Jeff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik
@ 2003-12-30 7:51 ` Andrew Morton
2003-12-30 7:56 ` Jeff Garzik
2003-12-30 10:21 ` Ed Sweetman
1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2003-12-30 7:51 UTC (permalink / raw)
To: Jeff Garzik; +Cc: manfred, linux-kernel
Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Linux Kernel Mailing List wrote:
> > ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
> >
> > [PATCH] optimize ia32 memmove
> >
> > From: Manfred Spraul <manfred@colorfullife.com>
> >
> > The memmove implementation of i386 is not optimized: it uses movsb, which is
> > far slower than movsd. The optimization is trivial: if dest is less than
> > source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved
> > around 300 million cpu ticks in cache_flusharray():
> [...]
> > diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
> > --- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> > +++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> > @@ -299,14 +299,9 @@
> > static inline void * memmove(void * dest,const void * src, size_t n)
> > {
> > int d0, d1, d2;
> > -if (dest<src)
> > -__asm__ __volatile__(
> > - "rep\n\t"
> > - "movsb"
> > - : "=&c" (d0), "=&S" (d1), "=&D" (d2)
> > - :"0" (n),"1" (src),"2" (dest)
> > - : "memory");
> > -else
> > +if (dest<src) {
> > + memcpy(dest,src,n);
> > +} else
> > __asm__ __volatile__(
> > "std\n\t"
> > "rep\n\t"
>
> Dumb question, though... what about the overlap case, when dest<src ?
> It seems to me this change is ignoring that.
>
"if dest is less that source, then call memcpy". If the move is to a
higher address we do it the old way.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 7:51 ` Andrew Morton
@ 2003-12-30 7:56 ` Jeff Garzik
2003-12-30 8:11 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Jeff Garzik @ 2003-12-30 7:56 UTC (permalink / raw)
To: Andrew Morton; +Cc: manfred, linux-kernel
Andrew Morton wrote:
> Jeff Garzik <jgarzik@pobox.com> wrote:
>
>>Linux Kernel Mailing List wrote:
>>
>>>ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
>>>
>>> [PATCH] optimize ia32 memmove
>>>
>>> From: Manfred Spraul <manfred@colorfullife.com>
>>>
>>> The memmove implementation of i386 is not optimized: it uses movsb, which is
>>> far slower than movsd. The optimization is trivial: if dest is less than
>>> source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved
>>> around 300 million cpu ticks in cache_flusharray():
>>
>>[...]
>>
>>>diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
>>>--- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
>>>+++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
>>>@@ -299,14 +299,9 @@
>>> static inline void * memmove(void * dest,const void * src, size_t n)
>>> {
>>> int d0, d1, d2;
>>>-if (dest<src)
>>>-__asm__ __volatile__(
>>>- "rep\n\t"
>>>- "movsb"
>>>- : "=&c" (d0), "=&S" (d1), "=&D" (d2)
>>>- :"0" (n),"1" (src),"2" (dest)
>>>- : "memory");
>>>-else
>>>+if (dest<src) {
>>>+ memcpy(dest,src,n);
>>>+} else
>>> __asm__ __volatile__(
>>> "std\n\t"
>>> "rep\n\t"
>>
>>Dumb question, though... what about the overlap case, when dest<src ?
>> It seems to me this change is ignoring that.
>>
>
>
> "if dest is less that source, then call memcpy". If the move is to a
> higher address we do it the old way.
I'm confused... that doesn't say anything to me about overlap.
They can still overlap: Consider if dest is 1 byte less than src, and
n==128...
Jeff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 7:56 ` Jeff Garzik
@ 2003-12-30 8:11 ` Andrew Morton
2003-12-30 8:11 ` Andreas Dilger
2003-12-30 9:58 ` Linus Torvalds
2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2003-12-30 8:11 UTC (permalink / raw)
To: Jeff Garzik; +Cc: manfred, linux-kernel
Jeff Garzik <jgarzik@pobox.com> wrote:
>
> > "if dest is less that source, then call memcpy". If the move is to a
> > higher address we do it the old way.
>
>
> I'm confused... that doesn't say anything to me about overlap.
>
Overlap is OK if dest<src, because memcpy starts with the lowest address
and works upwards.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 7:56 ` Jeff Garzik
2003-12-30 8:11 ` Andrew Morton
@ 2003-12-30 8:11 ` Andreas Dilger
2003-12-30 10:05 ` Linus Torvalds
2003-12-30 9:58 ` Linus Torvalds
2 siblings, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2003-12-30 8:11 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Andrew Morton, manfred, linux-kernel
On Dec 30, 2003 02:56 -0500, Jeff Garzik wrote:
> Andrew Morton wrote:
> > Jeff Garzik <jgarzik@pobox.com> wrote:
> >
> >>Linux Kernel Mailing List wrote:
> >>
> >>>ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
> >>>
> >>> [PATCH] optimize ia32 memmove
> >>>
> >>> From: Manfred Spraul <manfred@colorfullife.com>
> >>>
> >>> The memmove implementation of i386 is not optimized: it uses movsb, which is
> >>> far slower than movsd. The optimization is trivial: if dest is less than
> >>> source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved
> >>> around 300 million cpu ticks in cache_flusharray():
> >>
> >>[...]
> >>
> >>>diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
> >>>--- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> >>>+++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> >>>@@ -299,14 +299,9 @@
> >>> static inline void * memmove(void * dest,const void * src, size_t n)
> >>> {
> >>> int d0, d1, d2;
> >>>-if (dest<src)
> >>>-__asm__ __volatile__(
> >>>- "rep\n\t"
> >>>- "movsb"
> >>>- : "=&c" (d0), "=&S" (d1), "=&D" (d2)
> >>>- :"0" (n),"1" (src),"2" (dest)
> >>>- : "memory");
> >>>-else
> >>>+if (dest<src) {
> >>>+ memcpy(dest,src,n);
> >>>+} else
> >>> __asm__ __volatile__(
> >>> "std\n\t"
> >>> "rep\n\t"
> >>
> >>Dumb question, though... what about the overlap case, when dest<src ?
> >> It seems to me this change is ignoring that.
> >>
> >
> >
> > "if dest is less that source, then call memcpy". If the move is to a
> > higher address we do it the old way.
>
>
> I'm confused... that doesn't say anything to me about overlap.
>
> They can still overlap: Consider if dest is 1 byte less than src, and
> n==128...
The non-overlapping cases are probably very common and worth optimizing for:
if (dest + n < src || src + n < dest) {
memcpy(dest, src, n);
} else {
"old way"
}
People generally call memmove() if they _think_ the areas might overlap,
but if memcpy() can be so much faster it is probably worth catching the
cases where the two areas don't actually overlap. The above assumes that
"dest + n" and "src + n" don't wrap, but that is probably a broken case
in the current code so not worth adding extra checks in for.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 7:56 ` Jeff Garzik
2003-12-30 8:11 ` Andrew Morton
2003-12-30 8:11 ` Andreas Dilger
@ 2003-12-30 9:58 ` Linus Torvalds
2003-12-30 10:17 ` Jeremy Fitzhardinge
2 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2003-12-30 9:58 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Andrew Morton, manfred, linux-kernel
On Tue, 30 Dec 2003, Jeff Garzik wrote:
>
> I'm confused... that doesn't say anything to me about overlap.
>
> They can still overlap: Consider if dest is 1 byte less than src, and
> n==128...
But then anything that does the loads in ascending order is still ok, so
it shouldn't matter - by the time "dest" has been overwritten, the source
data has already been read. And all the "memcpy()" implementations had
better do that anyway, in order to get nice memory access patterns. "rep
movsl" certainly does.
So assuming we have an ascending "memcpy()", the only case we need to care
about is "overlap && dest > src".
Now, if we have a non-ascending memcpy(), we have trouble.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 8:11 ` Andreas Dilger
@ 2003-12-30 10:05 ` Linus Torvalds
0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2003-12-30 10:05 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Jeff Garzik, Andrew Morton, manfred, linux-kernel
On Tue, 30 Dec 2003, Andreas Dilger wrote:
>
> The non-overlapping cases are probably very common and worth optimizing for:
No, almost all non-overlapping users already just use "memcpy()".
So most of the kernel uses of "memmove()" are likely overlapping - and
just optimizing the non-overlap case probably doesn't help a lot.
That's why you optimize the overlapping case in one direction. We really
should do the other case right too.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 9:58 ` Linus Torvalds
@ 2003-12-30 10:17 ` Jeremy Fitzhardinge
2003-12-30 11:12 ` Manfred Spraul
0 siblings, 1 reply; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2003-12-30 10:17 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff Garzik, Andrew Morton, manfred, Linux Kernel List
On Tue, 2003-12-30 at 01:58, Linus Torvalds wrote:
> But then anything that does the loads in ascending order is still ok, so
> it shouldn't matter - by the time "dest" has been overwritten, the source
> data has already been read. And all the "memcpy()" implementations had
> better do that anyway, in order to get nice memory access patterns. "rep
> movsl" certainly does.
A PPC memcpy may end up clearing the destination before reading the
source (using the cache-line zeroing instruction, to prevent the
destination from being spuriously read to populate the cache line).
J
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik
2003-12-30 7:51 ` Andrew Morton
@ 2003-12-30 10:21 ` Ed Sweetman
2003-12-30 10:37 ` Andrew Morton
1 sibling, 1 reply; 12+ messages in thread
From: Ed Sweetman @ 2003-12-30 10:21 UTC (permalink / raw)
To: Jeff Garzik; +Cc: manfred, akpm, Linux Kernel Mailing List
Is this the entire patch? When was the original post since I cant find
the message going back over a month ago in the archives or any current
posts.
Jeff Garzik wrote:
> Linux Kernel Mailing List wrote:
>
>> ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@osdl.org
>>
>> [PATCH] optimize ia32 memmove
>>
>> From: Manfred Spraul <manfred@colorfullife.com>
>>
>> diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
>> --- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
>> +++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
>> @@ -299,14 +299,9 @@
>> static inline void * memmove(void * dest,const void * src, size_t n)
>> {
>> int d0, d1, d2;
>> -if (dest<src)
>> -__asm__ __volatile__(
>> - "rep\n\t"
>> - "movsb"
>> - : "=&c" (d0), "=&S" (d1), "=&D" (d2)
>> - :"0" (n),"1" (src),"2" (dest)
>> - : "memory");
>> -else
>> +if (dest<src) {
>> + memcpy(dest,src,n);
>> +} else
>> __asm__ __volatile__(
>> "std\n\t"
>> "rep\n\t"
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 10:21 ` Ed Sweetman
@ 2003-12-30 10:37 ` Andrew Morton
0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2003-12-30 10:37 UTC (permalink / raw)
To: Ed Sweetman; +Cc: jgarzik, manfred, linux-kernel
Ed Sweetman <ed.sweetman@wmich.edu> wrote:
>
> Is this the entire patch? When was the original post since I cant find
> the message going back over a month ago in the archives or any current
> posts.
>
The original patch is on the 2.5/2.6 commits mailing list,
bk-commits-head@vger.kernel.org. That list sets
linux-kernel@vger.kernel.org as the From: address, so replies come here.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 10:17 ` Jeremy Fitzhardinge
@ 2003-12-30 11:12 ` Manfred Spraul
2003-12-30 20:17 ` H. Peter Anvin
0 siblings, 1 reply; 12+ messages in thread
From: Manfred Spraul @ 2003-12-30 11:12 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Linus Torvalds, Jeff Garzik, Andrew Morton, Linux Kernel List
Jeremy Fitzhardinge wrote:
>On Tue, 2003-12-30 at 01:58, Linus Torvalds wrote:
>
>>But then anything that does the loads in ascending order is still ok, so
>>it shouldn't matter - by the time "dest" has been overwritten, the source
>>data has already been read. And all the "memcpy()" implementations had
>>better do that anyway, in order to get nice memory access patterns. "rep
>>movsl" certainly does.
AMD recommends to perform bulk copies backwards: That defeats the hw
prefecher, and results in even better access patterns. Doesn't matter in
this case, memmove is never used for bulk copies.
>A PPC memcpy may end up clearing the destination before reading the
>source (using the cache-line zeroing instruction, to prevent the
>destination from being spuriously read to populate the cache line).
>
The change is i386 only, no effect on other archs.
I found the unoptimized memmove in oprofiles of dbt2 testruns: slab
contains a few memmoves to keep it's recently used arrays in strict LIFO
order. Typically perhaps 100 bytes.
--
Manfred
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] optimize ia32 memmove
2003-12-30 11:12 ` Manfred Spraul
@ 2003-12-30 20:17 ` H. Peter Anvin
0 siblings, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2003-12-30 20:17 UTC (permalink / raw)
To: linux-kernel
Followup to: <3FF15DAB.8080203@colorfullife.com>
By author: Manfred Spraul <manfred@colorfullife.com>
In newsgroup: linux.dev.kernel
>
> AMD recommends to perform bulk copies backwards: That defeats the hw
> prefecher, and results in even better access patterns. Doesn't matter in
> this case, memmove is never used for bulk copies.
>
That's also a microoptimization for one particular microarchitecture
*bug*. Hardware prefetchers are going omnidirectional going forward.
Additionally, nearly all bulk copies are performed forward (DF=0) in
existing codebases.
-hpa
--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2003-12-30 20:18 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <200312300713.hBU7DGC4024213@hera.kernel.org>
2003-12-30 7:32 ` [PATCH] optimize ia32 memmove Jeff Garzik
2003-12-30 7:51 ` Andrew Morton
2003-12-30 7:56 ` Jeff Garzik
2003-12-30 8:11 ` Andrew Morton
2003-12-30 8:11 ` Andreas Dilger
2003-12-30 10:05 ` Linus Torvalds
2003-12-30 9:58 ` Linus Torvalds
2003-12-30 10:17 ` Jeremy Fitzhardinge
2003-12-30 11:12 ` Manfred Spraul
2003-12-30 20:17 ` H. Peter Anvin
2003-12-30 10:21 ` Ed Sweetman
2003-12-30 10:37 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).