linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fast memcpy patch
@ 2011-11-23 11:25 N. Coesel
  2011-11-23 11:45 ` Mihai Donțu
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: N. Coesel @ 2011-11-23 11:25 UTC (permalink / raw)
  To: linux-kernel

Dear readers,
I noticed the Linux kernel still uses a byte-by-byte copy method for 
memcpy. Since most memory allocations are aligned to the integer size 
of a cpu it is often faster to copy by using the CPU's native word 
size. The patch below does that. The code is already at work in many 
16 and 32 bit embedded products. It should also work for 64 bit 
platforms. So far I only tested 16 and 32 bit platforms.


--- lib/string.c.orig   2010-08-20 20:55:55.000000000 +0200
+++ lib/string.c        2011-11-23 12:29:02.000000000 +0100
@@ -565,14 +565,47 @@ EXPORT_SYMBOL(memset);
   * You should not use this function to access IO space, use memcpy_toio()
   * or memcpy_fromio() instead.
   */
-void *memcpy(void *dest, const void *src, size_t count)
+
+void *memcpy(void *dst, const void *src, size_t length)
  {
-       char *tmp = dest;
-       const char *s = src;
+       void *p=dst;

-       while (count--)
-               *tmp++ = *s++;
-       return dest;
+       //check alignment
+       if (( (int) dst & (sizeof(int) -1)) != ( (int) src & 
(sizeof(int) -1) ))
+               {
+               //unaligned. This will never align so copy byte-by-byte
+               goto copyrest;
+               }
+
+       //seek aligment (lower bits should become 0). Because
+       //we already tested the lower bits are equal, we only need
+       //to test source or destination for matching alignment.
+       while ( (length !=0) && (((int) src & (sizeof(int)-1 ))!=0) )
+               {
+
+                *((char*) dst++)=*((char*)src++);
+               length--;
+               }
+
+       //copy words
+       while(length> (sizeof(int)-1) )
+               {
+               *((int*) dst)=*((int*)src);
+               dst+=sizeof(int);
+               src+=sizeof(int);
+               length-=sizeof(int);
+               }
+
+copyrest:
+
+       //now copy the rest byte-by-byte
+       while(length !=0)
+               {
+               *((char*) dst++)=*((char*) src++);
+               length--;
+               }
+
+       return p;
  }
  EXPORT_SYMBOL(memcpy);
  #endif


Signed of by: Nico Coesel nico@nctdev.nl


o---------------------------------------------------------------o
|                       N C T  Developments                     |
|Innovative embedded solutions                                  |
o---------------------------------------------------------------o 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 11:25 Fast memcpy patch N. Coesel
@ 2011-11-23 11:45 ` Mihai Donțu
  2011-11-23 12:07   ` N. Coesel
  2011-11-23 12:06 ` richard -rw- weinberger
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Mihai Donțu @ 2011-11-23 11:45 UTC (permalink / raw)
  To: N. Coesel; +Cc: linux-kernel

On Wed, 23 Nov 2011 12:25:46 +0100 N. Coesel wrote:
> Dear readers,
> I noticed the Linux kernel still uses a byte-by-byte copy method for 
> memcpy. Since most memory allocations are aligned to the integer size 
> of a cpu it is often faster to copy by using the CPU's native word 
> size. The patch below does that. The code is already at work in many 
> 16 and 32 bit embedded products. It should also work for 64 bit 
> platforms. So far I only tested 16 and 32 bit platforms.
> 

Could you run checkpatch.pl on this an fix all the warnings? Or, if you
wish, I could do it for you.

> --- lib/string.c.orig   2010-08-20 20:55:55.000000000 +0200
> +++ lib/string.c        2011-11-23 12:29:02.000000000 +0100
> @@ -565,14 +565,47 @@ EXPORT_SYMBOL(memset);
>    * You should not use this function to access IO space, use
> memcpy_toio()
>    * or memcpy_fromio() instead.
>    */
> -void *memcpy(void *dest, const void *src, size_t count)
> +
> +void *memcpy(void *dst, const void *src, size_t length)
>   {
> -       char *tmp = dest;
> -       const char *s = src;
> +       void *p=dst;
> 
> -       while (count--)
> -               *tmp++ = *s++;
> -       return dest;
> +       //check alignment
> +       if (( (int) dst & (sizeof(int) -1)) != ( (int) src & 
> (sizeof(int) -1) ))
> +               {
> +               //unaligned. This will never align so copy
> byte-by-byte
> +               goto copyrest;
> +               }
> +
> +       //seek aligment (lower bits should become 0). Because
> +       //we already tested the lower bits are equal, we only need
> +       //to test source or destination for matching alignment.
> +       while ( (length !=0) && (((int) src & (sizeof(int)-1 ))!=0) )
> +               {
> +
> +                *((char*) dst++)=*((char*)src++);
> +               length--;
> +               }
> +
> +       //copy words
> +       while(length> (sizeof(int)-1) )
> +               {
> +               *((int*) dst)=*((int*)src);
> +               dst+=sizeof(int);
> +               src+=sizeof(int);
> +               length-=sizeof(int);
> +               }
> +
> +copyrest:
> +
> +       //now copy the rest byte-by-byte
> +       while(length !=0)
> +               {
> +               *((char*) dst++)=*((char*) src++);
> +               length--;
> +               }
> +
> +       return p;
>   }
>   EXPORT_SYMBOL(memcpy);
>   #endif
> 

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 11:25 Fast memcpy patch N. Coesel
  2011-11-23 11:45 ` Mihai Donțu
@ 2011-11-23 12:06 ` richard -rw- weinberger
  2011-11-23 12:07 ` Cong Wang
  2011-11-23 12:10 ` Sasha Levin
  3 siblings, 0 replies; 9+ messages in thread
From: richard -rw- weinberger @ 2011-11-23 12:06 UTC (permalink / raw)
  To: N. Coesel; +Cc: linux-kernel

On Wed, Nov 23, 2011 at 12:25 PM, N. Coesel <nico@nctdev.nl> wrote:
> Dear readers,
> I noticed the Linux kernel still uses a byte-by-byte copy method for memcpy.
> Since most memory allocations are aligned to the integer size of a cpu it is
> often faster to copy by using the CPU's native word size. The patch below
> does that. The code is already at work in many 16 and 32 bit embedded
> products. It should also work for 64 bit platforms. So far I only tested 16
> and 32 bit platforms.

Please note, this is only the fallback implementation.
Each arch has it's own optimized version.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 11:25 Fast memcpy patch N. Coesel
  2011-11-23 11:45 ` Mihai Donțu
  2011-11-23 12:06 ` richard -rw- weinberger
@ 2011-11-23 12:07 ` Cong Wang
  2011-11-23 12:10 ` Sasha Levin
  3 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2011-11-23 12:07 UTC (permalink / raw)
  To: N. Coesel; +Cc: linux-kernel

On Wed, Nov 23, 2011 at 7:25 PM, N. Coesel <nico@nctdev.nl> wrote:
> Dear readers,
> I noticed the Linux kernel still uses a byte-by-byte copy method for memcpy.
> Since most memory allocations are aligned to the integer size of a cpu it is
> often faster to copy by using the CPU's native word size. The patch below
> does that. The code is already at work in many 16 and 32 bit embedded
> products. It should also work for 64 bit platforms. So far I only tested 16
> and 32 bit platforms.

Which arch are you referring?

At least on x86, it has optimized memcpy(), see arch/x86/lib/memcpy_32.c
and arch/x86/include/asm/string_64.h.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 11:45 ` Mihai Donțu
@ 2011-11-23 12:07   ` N. Coesel
  0 siblings, 0 replies; 9+ messages in thread
From: N. Coesel @ 2011-11-23 12:07 UTC (permalink / raw)
  To: Mihai Donțu; +Cc: linux-kernel

Nihai,

At 12:45 23-11-2011, Mihai Donțu wrote:
>On Wed, 23 Nov 2011 12:25:46 +0100 N. Coesel wrote:
> > Dear readers,
> > I noticed the Linux kernel still uses a byte-by-byte copy method for
> > memcpy. Since most memory allocations are aligned to the integer size
> > of a cpu it is often faster to copy by using the CPU's native word
> > size. The patch below does that. The code is already at work in many
> > 16 and 32 bit embedded products. It should also work for 64 bit
> > platforms. So far I only tested 16 and 32 bit platforms.
> >
>
>Could you run checkpatch.pl on this an fix all the warnings? Or, if you
>wish, I could do it for you.

If you can, yes please. I used diff -uprN to make 
the patch. It cross-compiles without warnings using gcc4.4 for ARM.

Nico COesel

> > --- lib/string.c.orig   2010-08-20 20:55:55.000000000 +0200
> > +++ lib/string.c        2011-11-23 12:29:02.000000000 +0100
> > @@ -565,14 +565,47 @@ EXPORT_SYMBOL(memset);
> >    * You should not use this function to access IO space, use
> > memcpy_toio()
> >    * or memcpy_fromio() instead.
> >    */
> > -void *memcpy(void *dest, const void *src, size_t count)
> > +
> > +void *memcpy(void *dst, const void *src, size_t length)
> >   {
> > -       char *tmp = dest;
> > -       const char *s = src;
> > +       void *p=dst;
> >
> > -       while (count--)
> > -               *tmp++ = *s++;
> > -       return dest;
> > +       //check alignment
> > +       if (( (int) dst & (sizeof(int) -1)) != ( (int) src &
> > (sizeof(int) -1) ))
> > +               {
> > +               //unaligned. This will never align so copy
> > byte-by-byte
> > +               goto copyrest;
> > +               }
> > +
> > +       //seek aligment (lower bits should become 0). Because
> > +       //we already tested the lower bits are equal, we only need
> > +       //to test source or destination for matching alignment.
> > +       while ( (length !=0) && (((int) src & (sizeof(int)-1 ))!=0) )
> > +               {
> > +
> > +                *((char*) dst++)=*((char*)src++);
> > +               length--;
> > +               }
> > +
> > +       //copy words
> > +       while(length> (sizeof(int)-1) )
> > +               {
> > +               *((int*) dst)=*((int*)src);
> > +               dst+=sizeof(int);
> > +               src+=sizeof(int);
> > +               length-=sizeof(int);
> > +               }
> > +
> > +copyrest:
> > +
> > +       //now copy the rest byte-by-byte
> > +       while(length !=0)
> > +               {
> > +               *((char*) dst++)=*((char*) src++);
> > +               length--;
> > +               }
> > +
> > +       return p;
> >   }
> >   EXPORT_SYMBOL(memcpy);
> >   #endif
> >
>
>--
>Mihai Donțu

o---------------------------------------------------------------o
|                       N C T  Developments                     |
|Innovative embedded solutions                                  |
o---------------------------------------------------------------o 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 11:25 Fast memcpy patch N. Coesel
                   ` (2 preceding siblings ...)
  2011-11-23 12:07 ` Cong Wang
@ 2011-11-23 12:10 ` Sasha Levin
  2011-11-23 12:51   ` N. Coesel
  3 siblings, 1 reply; 9+ messages in thread
From: Sasha Levin @ 2011-11-23 12:10 UTC (permalink / raw)
  To: N. Coesel; +Cc: linux-kernel

On Wed, 2011-11-23 at 12:25 +0100, N. Coesel wrote:
> Dear readers,
> I noticed the Linux kernel still uses a byte-by-byte copy method for 
> memcpy. Since most memory allocations are aligned to the integer size 
> of a cpu it is often faster to copy by using the CPU's native word 
> size. The patch below does that. The code is already at work in many 
> 16 and 32 bit embedded products. It should also work for 64 bit 
> platforms. So far I only tested 16 and 32 bit platforms.

[snip]

memcpy (along with other mem* functions) are arch specific - for
example, look at arch/x86/lib/memcpy_64.S for the implementation(s) for
x86.

The code under lib/string.c is simple and should work on all platforms
(and is probably not being used anywhere anymore).

-- 

Sasha.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 12:10 ` Sasha Levin
@ 2011-11-23 12:51   ` N. Coesel
  2011-11-23 13:04     ` Sasha Levin
  0 siblings, 1 reply; 9+ messages in thread
From: N. Coesel @ 2011-11-23 12:51 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel

Sasha,

At 13:10 23-11-2011, Sasha Levin wrote:
>On Wed, 2011-11-23 at 12:25 +0100, N. Coesel wrote:
> > Dear readers,
> > I noticed the Linux kernel still uses a byte-by-byte copy method for
> > memcpy. Since most memory allocations are aligned to the integer size
> > of a cpu it is often faster to copy by using the CPU's native word
> > size. The patch below does that. The code is already at work in many
> > 16 and 32 bit embedded products. It should also work for 64 bit
> > platforms. So far I only tested 16 and 32 bit platforms.
>
>[snip]
>
>memcpy (along with other mem* functions) are arch specific - for
>example, look at arch/x86/lib/memcpy_64.S for the implementation(s) for
>x86.
>
>The code under lib/string.c is simple and should work on all platforms
>(and is probably not being used anywhere anymore).

Thanks for pointing that out. Currently my primary target is ARM. It 
seems the memcpy for that arch uses byte-by-byte copying as well with 
some loop unrolling. I modified the code so it tries to use 
word-by-word copy if the pointers are aligned on word boundaries, if 
not it reverts to the old method. For clarity: by word I mean the 
CPU's native bus width. In case of ARM that's (still) 32 bit.

Nico Coesel



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 12:51   ` N. Coesel
@ 2011-11-23 13:04     ` Sasha Levin
  2011-11-23 20:38       ` N. Coesel
  0 siblings, 1 reply; 9+ messages in thread
From: Sasha Levin @ 2011-11-23 13:04 UTC (permalink / raw)
  To: N. Coesel; +Cc: linux-kernel

On Wed, 2011-11-23 at 13:51 +0100, N. Coesel wrote:
> Sasha,
> 
> At 13:10 23-11-2011, Sasha Levin wrote:
> >On Wed, 2011-11-23 at 12:25 +0100, N. Coesel wrote:
> > > Dear readers,
> > > I noticed the Linux kernel still uses a byte-by-byte copy method for
> > > memcpy. Since most memory allocations are aligned to the integer size
> > > of a cpu it is often faster to copy by using the CPU's native word
> > > size. The patch below does that. The code is already at work in many
> > > 16 and 32 bit embedded products. It should also work for 64 bit
> > > platforms. So far I only tested 16 and 32 bit platforms.
> >
> >[snip]
> >
> >memcpy (along with other mem* functions) are arch specific - for
> >example, look at arch/x86/lib/memcpy_64.S for the implementation(s) for
> >x86.
> >
> >The code under lib/string.c is simple and should work on all platforms
> >(and is probably not being used anywhere anymore).
> 
> Thanks for pointing that out. Currently my primary target is ARM. It 
> seems the memcpy for that arch uses byte-by-byte copying as well with 
> some loop unrolling. I modified the code so it tries to use 
> word-by-word copy if the pointers are aligned on word boundaries, if 
> not it reverts to the old method. For clarity: by word I mean the 
> CPU's native bus width. In case of ARM that's (still) 32 bit.

I don't think we're looking at the same file.

For arm it's arch/arm/lib/copy_template.S, right? Or are you talking
about something else?

-- 

Sasha.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fast memcpy patch
  2011-11-23 13:04     ` Sasha Levin
@ 2011-11-23 20:38       ` N. Coesel
  0 siblings, 0 replies; 9+ messages in thread
From: N. Coesel @ 2011-11-23 20:38 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel

Sasha,

At 14:04 23-11-2011, Sasha Levin wrote:
>On Wed, 2011-11-23 at 13:51 +0100, N. Coesel wrote:
> > Sasha,
> >
> > At 13:10 23-11-2011, Sasha Levin wrote:
> > >On Wed, 2011-11-23 at 12:25 +0100, N. Coesel wrote:
> > > > Dear readers,
> > > > I noticed the Linux kernel still uses a byte-by-byte copy method for
> > > > memcpy. Since most memory allocations are aligned to the integer size
> > > > of a cpu it is often faster to copy by using the CPU's native word
> > > > size. The patch below does that. The code is already at work in many
> > > > 16 and 32 bit embedded products. It should also work for 64 bit
> > > > platforms. So far I only tested 16 and 32 bit platforms.
> > >
> > >[snip]
> > >
> > >memcpy (along with other mem* functions) are arch specific - for
> > >example, look at arch/x86/lib/memcpy_64.S for the implementation(s) for
> > >x86.
> > >
> > >The code under lib/string.c is simple and should work on all platforms
> > >(and is probably not being used anywhere anymore).
> >
> > Thanks for pointing that out. Currently my primary target is ARM. It
> > seems the memcpy for that arch uses byte-by-byte copying as well with
> > some loop unrolling. I modified the code so it tries to use
> > word-by-word copy if the pointers are aligned on word boundaries, if
> > not it reverts to the old method. For clarity: by word I mean the
> > CPU's native bus width. In case of ARM that's (still) 32 bit.
>
>I don't think we're looking at the same file.
>
>For arm it's arch/arm/lib/copy_template.S, right? Or are you talking
>about something else?

I was looking somewhere else indeed. There are a lot of versions of 
memcpy in the kernel :-) Thanks for pointing me to the right file. 
The asm stuff looks pretty nifty by using the load/store multiple 
instructions. Too bad. I was hoping to get some extra speed with a 
quick fix. I need to copy several MB from a driver.

Nico Coesel


>--
>
>Sasha.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-11-23 20:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-23 11:25 Fast memcpy patch N. Coesel
2011-11-23 11:45 ` Mihai Donțu
2011-11-23 12:07   ` N. Coesel
2011-11-23 12:06 ` richard -rw- weinberger
2011-11-23 12:07 ` Cong Wang
2011-11-23 12:10 ` Sasha Levin
2011-11-23 12:51   ` N. Coesel
2011-11-23 13:04     ` Sasha Levin
2011-11-23 20:38       ` N. Coesel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).