linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* performance: memcpy vs. __copy_tofrom_user
@ 2008-10-08 14:39 Dominik Bozek
  2008-10-08 15:31 ` Minh Tuan Duong
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Dominik Bozek @ 2008-10-08 14:39 UTC (permalink / raw)
  To: linuxppc-embedded

Hi all,

I have done a test of memcpy() and __copy_tofrom_user() on the mpc8313.
And the major conclusion is that __copy_tofrom_user is more efficient
than memcpy. Sometimes about 40%.

If I good understand, the memcpy() just copy the data, while
__copy_tofrom_user() take care if the memory wasn't swapped out. So then
memcpy() shall be faster than __copy_tofrom_user(). Am I right?
Is here anybody, who can confirm such results and maybe is able to
improve the memcpy()?


Let talk about the test.
I have prepared two pieces of memory of size 64KB and I make sure that
this memory is not swapped out (necessary for memcpy() later). Then I
run one of the memory copy function to transfer 32MB and I measure the
time. The memory is copied in chunks from 64KB to 8B. I take care about
the cache calling flush_dcache_range() whenever whole 64KB was used.
I know, that memcpy on the kernel level is not intended to copy memory
blocks in userspace and __copy_tofrom_user is not intended to copy data
only between two user blocks, but for the performance test it doesn't
matter.
Bellow you may see the short piece of code in the kernel module.

#define TEST_BUF_SIZE (64*1024)
int function;
char *buf1, *buf2, *buf1_bis, *buf2_bis;
unsigned int size, cnt;

get_user(function, &((TEST_ARG*)(arg))->function);
get_user(buf1, &((TEST_ARG*)(arg))->buf1);
get_user(buf2, &((TEST_ARG*)(arg))->buf2);
get_user(size, &((TEST_ARG*)(arg))->size);

cnt = (32*1024*1024)/size; /* how many repeats of memory copy is needed
to transfer 32MB ? */
buf1_bis = buf1;
buf2_bis = buf2;

switch (function)
{
    case MEMCPY_TEST:
        while (cnt-->0)
        {
            if (buf1_bis >= buf1+TEST_BUF_SIZE)
            {
                /* need for flusch data cache as seldom as possible */
                buf1_bis = buf1;
                buf2_bis = buf2;
                flush_dcache_range((int)buf1, (int)(buf2+TEST_BUF_SIZE));
            }
            if (buf1_bis != memcpy(buf1_bis, buf2_bis, size))
                break;
            buf1_bis += size;
            buf2_bis += size;
        }
        break;

    case COPY_TOFROM_USER_TEST:
        while (cnt-->0)
        {
            if (buf1_bis >= buf1+TEST_BUF_SIZE)
            {
                /* need for flusch data cache as seldom as possible */
                buf1_bis = buf1;
                buf2_bis = buf2;
                flush_dcache_range((int)buf1, (int)(buf2+TEST_BUF_SIZE));
            }
            ret = __copy_tofrom_user(buf1_bis, buf2_bis, size);
            if (ret != 0)
                break;
            buf1_bis += size;
            buf2_bis += size;
        }
        break;
}


Bellow are the results:

memcpy()
chunk:  65536 [B] | transfer:     69.2 [MB/s] | time: 1.849727 [s] |
size:  128.000 [MB]
chunk:  32768 [B] | transfer:     69.2 [MB/s] | time: 1.849700 [s] |
size:  128.000 [MB]
chunk:  16384 [B] | transfer:     69.2 [MB/s] | time: 1.849845 [s] |
size:  128.000 [MB]
chunk:   8192 [B] | transfer:     69.2 [MB/s] | time: 1.850535 [s] |
size:  128.000 [MB]
chunk:   4096 [B] | transfer:     69.1 [MB/s] | time: 1.853405 [s] |
size:  128.000 [MB]
chunk:   2048 [B] | transfer:     69.1 [MB/s] | time: 1.852877 [s] |
size:  128.000 [MB]
chunk:   1024 [B] | transfer:     69.2 [MB/s] | time: 1.849963 [s] |
size:  128.000 [MB]
chunk:    512 [B] | transfer:     69.0 [MB/s] | time: 1.853793 [s] |
size:  128.000 [MB]
chunk:    256 [B] | transfer:     68.6 [MB/s] | time: 1.866222 [s] |
size:  128.000 [MB]
chunk:    128 [B] | transfer:     68.0 [MB/s] | time: 1.883002 [s] |
size:  128.000 [MB]
chunk:     64 [B] | transfer:     67.2 [MB/s] | time: 1.904073 [s] |
size:  128.000 [MB]
chunk:     32 [B] | transfer:     64.7 [MB/s] | time: 1.978109 [s] |
size:  128.000 [MB]
chunk:     16 [B] | transfer:     54.5 [MB/s] | time: 2.348682 [s] |
size:  128.000 [MB]
chunk:      8 [B] | transfer:     47.4 [MB/s] | time: 2.698635 [s] |
size:  128.000 [MB]


__copy_tofrom_user()
chunk:  65536 [B] | transfer:     97.3 [MB/s] | time: 1.315155 [s] |
size:  128.000 [MB]
chunk:  32768 [B] | transfer:     97.3 [MB/s] | time: 1.315762 [s] |
size:  128.000 [MB]
chunk:  16384 [B] | transfer:     97.2 [MB/s] | time: 1.316946 [s] |
size:  128.000 [MB]
chunk:   8192 [B] | transfer:     96.8 [MB/s] | time: 1.321705 [s] |
size:  128.000 [MB]
chunk:   4096 [B] | transfer:     96.6 [MB/s] | time: 1.325516 [s] |
size:  128.000 [MB]
chunk:   2048 [B] | transfer:     96.6 [MB/s] | time: 1.325570 [s] |
size:  128.000 [MB]
chunk:   1024 [B] | transfer:     96.8 [MB/s] | time: 1.322599 [s] |
size:  128.000 [MB]
chunk:    512 [B] | transfer:     97.8 [MB/s] | time: 1.308186 [s] |
size:  128.000 [MB]
chunk:    256 [B] | transfer:    100.2 [MB/s] | time: 1.277788 [s] |
size:  128.000 [MB]
chunk:    128 [B] | transfer:     91.5 [MB/s] | time: 1.398216 [s] |
size:  128.000 [MB]
chunk:     64 [B] | transfer:     87.0 [MB/s] | time: 1.471784 [s] |
size:  128.000 [MB]
chunk:     32 [B] | transfer:     75.0 [MB/s] | time: 1.706426 [s] |
size:  128.000 [MB]
chunk:     16 [B] | transfer:     47.8 [MB/s] | time: 2.678039 [s] |
size:  128.000 [MB]
chunk:      8 [B] | transfer:     41.5 [MB/s] | time: 3.084689 [s] |
size:  128.000 [MB]

Regards
Dominik Bozek


BTW. The memcpy() maybe optimized as it is on i32 when the size of block
is known at compile time.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-08 14:39 performance: memcpy vs. __copy_tofrom_user Dominik Bozek
@ 2008-10-08 15:31 ` Minh Tuan Duong
  2008-10-08 15:39 ` Bill Gatliff
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Minh Tuan Duong @ 2008-10-08 15:31 UTC (permalink / raw)
  To: Dominik Bozek; +Cc: linuxppc-embedded

[-- Attachment #1: Type: text/plain, Size: 5935 bytes --]

http://www.xml.com/ldd/chapter/book/ch03.html
Hope this help.

On Wed, Oct 8, 2008 at 9:39 PM, Dominik Bozek <domino@mikroswiat.pl> wrote:

> Hi all,
>
> I have done a test of memcpy() and __copy_tofrom_user() on the mpc8313.
> And the major conclusion is that __copy_tofrom_user is more efficient
> than memcpy. Sometimes about 40%.
>
> If I good understand, the memcpy() just copy the data, while
> __copy_tofrom_user() take care if the memory wasn't swapped out. So then
> memcpy() shall be faster than __copy_tofrom_user(). Am I right?
> Is here anybody, who can confirm such results and maybe is able to
> improve the memcpy()?
>
>
> Let talk about the test.
> I have prepared two pieces of memory of size 64KB and I make sure that
> this memory is not swapped out (necessary for memcpy() later). Then I
> run one of the memory copy function to transfer 32MB and I measure the
> time. The memory is copied in chunks from 64KB to 8B. I take care about
> the cache calling flush_dcache_range() whenever whole 64KB was used.
> I know, that memcpy on the kernel level is not intended to copy memory
> blocks in userspace and __copy_tofrom_user is not intended to copy data
> only between two user blocks, but for the performance test it doesn't
> matter.
> Bellow you may see the short piece of code in the kernel module.
>
> #define TEST_BUF_SIZE (64*1024)
> int function;
> char *buf1, *buf2, *buf1_bis, *buf2_bis;
> unsigned int size, cnt;
>
> get_user(function, &((TEST_ARG*)(arg))->function);
> get_user(buf1, &((TEST_ARG*)(arg))->buf1);
> get_user(buf2, &((TEST_ARG*)(arg))->buf2);
> get_user(size, &((TEST_ARG*)(arg))->size);
>
> cnt = (32*1024*1024)/size; /* how many repeats of memory copy is needed
> to transfer 32MB ? */
> buf1_bis = buf1;
> buf2_bis = buf2;
>
> switch (function)
> {
>    case MEMCPY_TEST:
>        while (cnt-->0)
>        {
>            if (buf1_bis >= buf1+TEST_BUF_SIZE)
>            {
>                /* need for flusch data cache as seldom as possible */
>                buf1_bis = buf1;
>                buf2_bis = buf2;
>                flush_dcache_range((int)buf1, (int)(buf2+TEST_BUF_SIZE));
>            }
>            if (buf1_bis != memcpy(buf1_bis, buf2_bis, size))
>                break;
>            buf1_bis += size;
>            buf2_bis += size;
>        }
>        break;
>
>    case COPY_TOFROM_USER_TEST:
>        while (cnt-->0)
>        {
>            if (buf1_bis >= buf1+TEST_BUF_SIZE)
>            {
>                /* need for flusch data cache as seldom as possible */
>                buf1_bis = buf1;
>                buf2_bis = buf2;
>                flush_dcache_range((int)buf1, (int)(buf2+TEST_BUF_SIZE));
>            }
>            ret = __copy_tofrom_user(buf1_bis, buf2_bis, size);
>            if (ret != 0)
>                break;
>            buf1_bis += size;
>            buf2_bis += size;
>        }
>        break;
> }
>
>
> Bellow are the results:
>
> memcpy()
> chunk:  65536 [B] | transfer:     69.2 [MB/s] | time: 1.849727 [s] |
> size:  128.000 [MB]
> chunk:  32768 [B] | transfer:     69.2 [MB/s] | time: 1.849700 [s] |
> size:  128.000 [MB]
> chunk:  16384 [B] | transfer:     69.2 [MB/s] | time: 1.849845 [s] |
> size:  128.000 [MB]
> chunk:   8192 [B] | transfer:     69.2 [MB/s] | time: 1.850535 [s] |
> size:  128.000 [MB]
> chunk:   4096 [B] | transfer:     69.1 [MB/s] | time: 1.853405 [s] |
> size:  128.000 [MB]
> chunk:   2048 [B] | transfer:     69.1 [MB/s] | time: 1.852877 [s] |
> size:  128.000 [MB]
> chunk:   1024 [B] | transfer:     69.2 [MB/s] | time: 1.849963 [s] |
> size:  128.000 [MB]
> chunk:    512 [B] | transfer:     69.0 [MB/s] | time: 1.853793 [s] |
> size:  128.000 [MB]
> chunk:    256 [B] | transfer:     68.6 [MB/s] | time: 1.866222 [s] |
> size:  128.000 [MB]
> chunk:    128 [B] | transfer:     68.0 [MB/s] | time: 1.883002 [s] |
> size:  128.000 [MB]
> chunk:     64 [B] | transfer:     67.2 [MB/s] | time: 1.904073 [s] |
> size:  128.000 [MB]
> chunk:     32 [B] | transfer:     64.7 [MB/s] | time: 1.978109 [s] |
> size:  128.000 [MB]
> chunk:     16 [B] | transfer:     54.5 [MB/s] | time: 2.348682 [s] |
> size:  128.000 [MB]
> chunk:      8 [B] | transfer:     47.4 [MB/s] | time: 2.698635 [s] |
> size:  128.000 [MB]
>
>
> __copy_tofrom_user()
> chunk:  65536 [B] | transfer:     97.3 [MB/s] | time: 1.315155 [s] |
> size:  128.000 [MB]
> chunk:  32768 [B] | transfer:     97.3 [MB/s] | time: 1.315762 [s] |
> size:  128.000 [MB]
> chunk:  16384 [B] | transfer:     97.2 [MB/s] | time: 1.316946 [s] |
> size:  128.000 [MB]
> chunk:   8192 [B] | transfer:     96.8 [MB/s] | time: 1.321705 [s] |
> size:  128.000 [MB]
> chunk:   4096 [B] | transfer:     96.6 [MB/s] | time: 1.325516 [s] |
> size:  128.000 [MB]
> chunk:   2048 [B] | transfer:     96.6 [MB/s] | time: 1.325570 [s] |
> size:  128.000 [MB]
> chunk:   1024 [B] | transfer:     96.8 [MB/s] | time: 1.322599 [s] |
> size:  128.000 [MB]
> chunk:    512 [B] | transfer:     97.8 [MB/s] | time: 1.308186 [s] |
> size:  128.000 [MB]
> chunk:    256 [B] | transfer:    100.2 [MB/s] | time: 1.277788 [s] |
> size:  128.000 [MB]
> chunk:    128 [B] | transfer:     91.5 [MB/s] | time: 1.398216 [s] |
> size:  128.000 [MB]
> chunk:     64 [B] | transfer:     87.0 [MB/s] | time: 1.471784 [s] |
> size:  128.000 [MB]
> chunk:     32 [B] | transfer:     75.0 [MB/s] | time: 1.706426 [s] |
> size:  128.000 [MB]
> chunk:     16 [B] | transfer:     47.8 [MB/s] | time: 2.678039 [s] |
> size:  128.000 [MB]
> chunk:      8 [B] | transfer:     41.5 [MB/s] | time: 3.084689 [s] |
> size:  128.000 [MB]
>
> Regards
> Dominik Bozek
>
>
> BTW. The memcpy() maybe optimized as it is on i32 when the size of block
> is known at compile time.
>
> _______________________________________________
> Linuxppc-embedded mailing list
> Linuxppc-embedded@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-embedded
>



-- 
Best regards,
Tuan Duong
Mobile: 0983349121

[-- Attachment #2: Type: text/html, Size: 8412 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-08 14:39 performance: memcpy vs. __copy_tofrom_user Dominik Bozek
  2008-10-08 15:31 ` Minh Tuan Duong
@ 2008-10-08 15:39 ` Bill Gatliff
  2008-10-08 15:42 ` Grant Likely
  2008-10-08 17:40 ` Scott Wood
  3 siblings, 0 replies; 25+ messages in thread
From: Bill Gatliff @ 2008-10-08 15:39 UTC (permalink / raw)
  To: Dominik Bozek; +Cc: linuxppc-embedded

Dominik Bozek wrote:
> Hi all,
> 
> I have done a test of memcpy() and __copy_tofrom_user() on the mpc8313.
> And the major conclusion is that __copy_tofrom_user is more efficient
> than memcpy. Sometimes about 40%.

Have you looked at the two implementations?  I'm not as well-versed on PPC as
ARM, but I know the latter's __copy_* functions are optimized to be almost
unintelligible.  If your benchmark memcpy() implementation isn't, then you
aren't comparing apples-to-apples.

40% improvement is within what I could imagine getting from a hand-crafted,
no-holds-barred memcpy() implementation.  I'd look more carefully at that code.


b.g.
-- 
Bill Gatliff
bgat@billgatliff.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-08 14:39 performance: memcpy vs. __copy_tofrom_user Dominik Bozek
  2008-10-08 15:31 ` Minh Tuan Duong
  2008-10-08 15:39 ` Bill Gatliff
@ 2008-10-08 15:42 ` Grant Likely
  2008-10-09  2:34   ` Paul Mackerras
  2008-10-08 17:40 ` Scott Wood
  3 siblings, 1 reply; 25+ messages in thread
From: Grant Likely @ 2008-10-08 15:42 UTC (permalink / raw)
  To: Dominik Bozek; +Cc: linuxppc-dev, linuxppc-embedded

Forwarding message to linuxppc-dev@ozlabs.org.  This is an interesting
question for the wider powerpc community, but not many people read
linuxppc-embedded.

On Wed, Oct 08, 2008 at 04:39:13PM +0200, Dominik Bozek wrote:
> Hi all,
> 
> I have done a test of memcpy() and __copy_tofrom_user() on the mpc8313.
> And the major conclusion is that __copy_tofrom_user is more efficient
> than memcpy. Sometimes about 40%.
> 
> If I good understand, the memcpy() just copy the data, while
> __copy_tofrom_user() take care if the memory wasn't swapped out. So then
> memcpy() shall be faster than __copy_tofrom_user(). Am I right?
> Is here anybody, who can confirm such results and maybe is able to
> improve the memcpy()?
> 
> 
> Let talk about the test.
> I have prepared two pieces of memory of size 64KB and I make sure that
> this memory is not swapped out (necessary for memcpy() later). Then I
> run one of the memory copy function to transfer 32MB and I measure the
> time. The memory is copied in chunks from 64KB to 8B. I take care about
> the cache calling flush_dcache_range() whenever whole 64KB was used.
> I know, that memcpy on the kernel level is not intended to copy memory
> blocks in userspace and __copy_tofrom_user is not intended to copy data
> only between two user blocks, but for the performance test it doesn't
> matter.
> Bellow you may see the short piece of code in the kernel module.
> 
> #define TEST_BUF_SIZE (64*1024)
> int function;
> char *buf1, *buf2, *buf1_bis, *buf2_bis;
> unsigned int size, cnt;
> 
> get_user(function, &((TEST_ARG*)(arg))->function);
> get_user(buf1, &((TEST_ARG*)(arg))->buf1);
> get_user(buf2, &((TEST_ARG*)(arg))->buf2);
> get_user(size, &((TEST_ARG*)(arg))->size);
> 
> cnt = (32*1024*1024)/size; /* how many repeats of memory copy is needed
> to transfer 32MB ? */
> buf1_bis = buf1;
> buf2_bis = buf2;
> 
> switch (function)
> {
>     case MEMCPY_TEST:
>         while (cnt-->0)
>         {
>             if (buf1_bis >= buf1+TEST_BUF_SIZE)
>             {
>                 /* need for flusch data cache as seldom as possible */
>                 buf1_bis = buf1;
>                 buf2_bis = buf2;
>                 flush_dcache_range((int)buf1, (int)(buf2+TEST_BUF_SIZE));
>             }
>             if (buf1_bis != memcpy(buf1_bis, buf2_bis, size))
>                 break;
>             buf1_bis += size;
>             buf2_bis += size;
>         }
>         break;
> 
>     case COPY_TOFROM_USER_TEST:
>         while (cnt-->0)
>         {
>             if (buf1_bis >= buf1+TEST_BUF_SIZE)
>             {
>                 /* need for flusch data cache as seldom as possible */
>                 buf1_bis = buf1;
>                 buf2_bis = buf2;
>                 flush_dcache_range((int)buf1, (int)(buf2+TEST_BUF_SIZE));
>             }
>             ret = __copy_tofrom_user(buf1_bis, buf2_bis, size);
>             if (ret != 0)
>                 break;
>             buf1_bis += size;
>             buf2_bis += size;
>         }
>         break;
> }
> 
> 
> Bellow are the results:
> 
> memcpy()
> chunk:  65536 [B] | transfer:     69.2 [MB/s] | time: 1.849727 [s] |
> size:  128.000 [MB]
> chunk:  32768 [B] | transfer:     69.2 [MB/s] | time: 1.849700 [s] |
> size:  128.000 [MB]
> chunk:  16384 [B] | transfer:     69.2 [MB/s] | time: 1.849845 [s] |
> size:  128.000 [MB]
> chunk:   8192 [B] | transfer:     69.2 [MB/s] | time: 1.850535 [s] |
> size:  128.000 [MB]
> chunk:   4096 [B] | transfer:     69.1 [MB/s] | time: 1.853405 [s] |
> size:  128.000 [MB]
> chunk:   2048 [B] | transfer:     69.1 [MB/s] | time: 1.852877 [s] |
> size:  128.000 [MB]
> chunk:   1024 [B] | transfer:     69.2 [MB/s] | time: 1.849963 [s] |
> size:  128.000 [MB]
> chunk:    512 [B] | transfer:     69.0 [MB/s] | time: 1.853793 [s] |
> size:  128.000 [MB]
> chunk:    256 [B] | transfer:     68.6 [MB/s] | time: 1.866222 [s] |
> size:  128.000 [MB]
> chunk:    128 [B] | transfer:     68.0 [MB/s] | time: 1.883002 [s] |
> size:  128.000 [MB]
> chunk:     64 [B] | transfer:     67.2 [MB/s] | time: 1.904073 [s] |
> size:  128.000 [MB]
> chunk:     32 [B] | transfer:     64.7 [MB/s] | time: 1.978109 [s] |
> size:  128.000 [MB]
> chunk:     16 [B] | transfer:     54.5 [MB/s] | time: 2.348682 [s] |
> size:  128.000 [MB]
> chunk:      8 [B] | transfer:     47.4 [MB/s] | time: 2.698635 [s] |
> size:  128.000 [MB]
> 
> 
> __copy_tofrom_user()
> chunk:  65536 [B] | transfer:     97.3 [MB/s] | time: 1.315155 [s] |
> size:  128.000 [MB]
> chunk:  32768 [B] | transfer:     97.3 [MB/s] | time: 1.315762 [s] |
> size:  128.000 [MB]
> chunk:  16384 [B] | transfer:     97.2 [MB/s] | time: 1.316946 [s] |
> size:  128.000 [MB]
> chunk:   8192 [B] | transfer:     96.8 [MB/s] | time: 1.321705 [s] |
> size:  128.000 [MB]
> chunk:   4096 [B] | transfer:     96.6 [MB/s] | time: 1.325516 [s] |
> size:  128.000 [MB]
> chunk:   2048 [B] | transfer:     96.6 [MB/s] | time: 1.325570 [s] |
> size:  128.000 [MB]
> chunk:   1024 [B] | transfer:     96.8 [MB/s] | time: 1.322599 [s] |
> size:  128.000 [MB]
> chunk:    512 [B] | transfer:     97.8 [MB/s] | time: 1.308186 [s] |
> size:  128.000 [MB]
> chunk:    256 [B] | transfer:    100.2 [MB/s] | time: 1.277788 [s] |
> size:  128.000 [MB]
> chunk:    128 [B] | transfer:     91.5 [MB/s] | time: 1.398216 [s] |
> size:  128.000 [MB]
> chunk:     64 [B] | transfer:     87.0 [MB/s] | time: 1.471784 [s] |
> size:  128.000 [MB]
> chunk:     32 [B] | transfer:     75.0 [MB/s] | time: 1.706426 [s] |
> size:  128.000 [MB]
> chunk:     16 [B] | transfer:     47.8 [MB/s] | time: 2.678039 [s] |
> size:  128.000 [MB]
> chunk:      8 [B] | transfer:     41.5 [MB/s] | time: 3.084689 [s] |
> size:  128.000 [MB]
> 
> Regards
> Dominik Bozek
> 
> 
> BTW. The memcpy() maybe optimized as it is on i32 when the size of block
> is known at compile time.
> 
> _______________________________________________
> Linuxppc-embedded mailing list
> Linuxppc-embedded@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-embedded

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-08 14:39 performance: memcpy vs. __copy_tofrom_user Dominik Bozek
                   ` (2 preceding siblings ...)
  2008-10-08 15:42 ` Grant Likely
@ 2008-10-08 17:40 ` Scott Wood
  2008-10-09  2:36   ` Paul Mackerras
  2008-10-11 22:32   ` Benjamin Herrenschmidt
  3 siblings, 2 replies; 25+ messages in thread
From: Scott Wood @ 2008-10-08 17:40 UTC (permalink / raw)
  To: Dominik Bozek; +Cc: linuxppc-dev, linuxppc-embedded

Dominik Bozek wrote:
> I have done a test of memcpy() and __copy_tofrom_user() on the mpc8313.
> And the major conclusion is that __copy_tofrom_user is more efficient
> than memcpy. Sometimes about 40%.
> 
> If I good understand, the memcpy() just copy the data, while
> __copy_tofrom_user() take care if the memory wasn't swapped out.

There's not much overhead in dealing with bad pointers; it's mostly 
fixup after the fault.

The performance difference most likely comes from the fact that copy 
to/from user can assume that the memory is cacheable, while memcpy is 
occasionally used on cache-inhibited memory -- so dcbz isn't used.  We 
may be better off handling the alignment fault on those occasions, and 
we should use dcba on chips that support it.

I'm not sure why we don't use dcbt in memcpy(), as it's just ignored if 
the memory is cache-inhibited.

> BTW. The memcpy() maybe optimized as it is on i32 when the size of block
> is known at compile time.

Yes, that would be nice.

-Scott

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-08 15:42 ` Grant Likely
@ 2008-10-09  2:34   ` Paul Mackerras
  2008-10-09 10:12     ` Dominik Bozek
  0 siblings, 1 reply; 25+ messages in thread
From: Paul Mackerras @ 2008-10-09  2:34 UTC (permalink / raw)
  To: Grant Likely; +Cc: linuxppc-dev, Dominik Bozek, linuxppc-embedded

Grant Likely writes:

> On Wed, Oct 08, 2008 at 04:39:13PM +0200, Dominik Bozek wrote:
> > Hi all,
> > 
> > I have done a test of memcpy() and __copy_tofrom_user() on the mpc8313.
> > And the major conclusion is that __copy_tofrom_user is more efficient
> > than memcpy. Sometimes about 40%.
> > 
> > If I good understand, the memcpy() just copy the data, while
> > __copy_tofrom_user() take care if the memory wasn't swapped out. So then
> > memcpy() shall be faster than __copy_tofrom_user(). Am I right?
> > Is here anybody, who can confirm such results and maybe is able to
> > improve the memcpy()?

When I looked at this last (which was a few years ago, I'll admit), I
found that the vast majority of memcpy calls were for small copies,
i.e. less than 128 bytes, whereas __copy_tofrom_user was often used
for larger copies (usually 1 page).  So with memcpy the focus was more
on keeping the startup costs low, while __copy_tofrom_user was
optimized more for bandwidth.

The other point is that the kernel memcpy doesn't consume a noticeable
amount of CPU time (at least not on any workload I've seen), so it
hasn't been a target for aggressive optimization.

Paul.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-08 17:40 ` Scott Wood
@ 2008-10-09  2:36   ` Paul Mackerras
  2008-10-11 22:32   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 25+ messages in thread
From: Paul Mackerras @ 2008-10-09  2:36 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, Dominik Bozek, linuxppc-embedded

Scott Wood writes:

> I'm not sure why we don't use dcbt in memcpy(), as it's just ignored if 
> the memory is cache-inhibited.

Both dcbt and dcbz tend to slow things down if the relevant block is
already in the cache.  Since the kernel memcpy is mostly used for
copies that are only 1 or a small number of cache lines long, it's not
clear that the benefit of dcbt and/or dcbz would outweigh the cost.
And anyway, I have yet to be convinced that optimizing memcpy would
provide a measureable benefit.

Paul.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-09  2:34   ` Paul Mackerras
@ 2008-10-09 10:12     ` Dominik Bozek
  2008-10-09 11:06       ` Paul Mackerras
  0 siblings, 1 reply; 25+ messages in thread
From: Dominik Bozek @ 2008-10-09 10:12 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linuxppc-embedded

Paul Mackerras wrote:

> When I looked at this last (which was a few years ago, I'll admit), I
> found that the vast majority of memcpy calls were for small copies,
> i.e. less than 128 bytes, whereas __copy_tofrom_user was often used
> for larger copies (usually 1 page).  So with memcpy the focus was more
> on keeping the startup costs low, while __copy_tofrom_user was
> optimized more for bandwidth.
>
> The other point is that the kernel memcpy doesn't consume a noticeable
> amount of CPU time (at least not on any workload I've seen), so it
> hasn't been a target for aggressive optimization.
>   


Actually I made couple of other tests on that mpc8313. Most of them are
to ugly to publish them, but... My problem is that I have to boost the
gigabit interface on the mpc8313. I made simple substitution and
__copy_tofrom_user was used instead of memcpy. I know, it's wrong, but I
speedup that way the network interface for about 10%.

I made also some calculation based on the results I had send. One
__copy_tofrom_user of 1500B compensate profit from 258 memcpy of 8B. But
of course this is the case of mpc8313 (333MHz core, DDR2 at 266MHz). On
other hardware it may work differently and to make any binding
conclusion we need to see some results done on other cpus.
Unfortunately, right now, I don't have any other ppc to make such test
and compare it.
Other hand my test do not cover all cases. I believe most of small
transfer involve data already cached. This is big point for current memcpy.

Maybe there is another solution. Method, agressive or "low cost setup",
will be chosen depend on the size of the copied block and fixed limit.
Limit shall be known at compile time and related to the chosen cpu/platfrom.

If someone ask for other tests. I had optimized memcpy for cases when
transfer size is known at compile time and... hard to say if the system
was "faster", but for sure I didn't notice any boost at network
interface. Maybe my optimization was bad. It's possible with me.

Dominik

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-09 10:12     ` Dominik Bozek
@ 2008-10-09 11:06       ` Paul Mackerras
  2008-10-09 11:41         ` Dominik Bozek
                           ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Paul Mackerras @ 2008-10-09 11:06 UTC (permalink / raw)
  To: Dominik Bozek; +Cc: linuxppc-dev, linuxppc-embedded

Dominik Bozek writes:

> Actually I made couple of other tests on that mpc8313. Most of them are
> to ugly to publish them, but... My problem is that I have to boost the
> gigabit interface on the mpc8313. I made simple substitution and
> __copy_tofrom_user was used instead of memcpy. I know, it's wrong, but I
> speedup that way the network interface for about 10%.

Very interesting.  Can you work out where memcpy is being called on
the network data?  I wouldn't have expected that.

There is actually no strong reason not to use __copy_tofrom_user as
memcpy, in fact, as long as we are sure that source and destination
are both cacheable.

Paul.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-09 11:06       ` Paul Mackerras
@ 2008-10-09 11:41         ` Dominik Bozek
  2008-10-09 12:04           ` Leon Woestenberg
  2008-10-09 15:37         ` Matt Sealey
  2008-10-10 17:17         ` Dominik Bozek
  2 siblings, 1 reply; 25+ messages in thread
From: Dominik Bozek @ 2008-10-09 11:41 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linuxppc-embedded

Paul Mackerras wrote:
> Dominik Bozek writes:
>
>   
>> Actually I made couple of other tests on that mpc8313. Most of them are
>> to ugly to publish them, but... My problem is that I have to boost the
>> gigabit interface on the mpc8313. I made simple substitution and
>> __copy_tofrom_user was used instead of memcpy. I know, it's wrong, but I
>> speedup that way the network interface for about 10%.
>>     
>
> Very interesting.  Can you work out where memcpy is being called on
> the network data?  I wouldn't have expected that.
>   

I'm not the fastest, but I will. Just need some time.

> There is actually no strong reason not to use __copy_tofrom_user as
> memcpy, in fact, as long as we are sure that source and destination
> are both cacheable.
>   

My board doesn't have graphics, sound,...  so I don't know if and how it
touch that subsystems, but for sure ext2 fail. Interesting because it
was a ramdisk. Remember, that it was very tricky test, don't make any
wrong conclusion out of it.

Dominik

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-09 11:41         ` Dominik Bozek
@ 2008-10-09 12:04           ` Leon Woestenberg
  0 siblings, 0 replies; 25+ messages in thread
From: Leon Woestenberg @ 2008-10-09 12:04 UTC (permalink / raw)
  To: Dominik Bozek; +Cc: linuxppc-dev, Paul Mackerras, linuxppc-embedded

Hello all,

On Thu, Oct 9, 2008 at 1:41 PM, Dominik Bozek <domino@mikroswiat.pl> wrote:
> Paul Mackerras wrote:
>> Dominik Bozek writes:
>>> Actually I made couple of other tests on that mpc8313. Most of them are
>>> to ugly to publish them, but... My problem is that I have to boost the
>>> gigabit interface on the mpc8313. I made simple substitution and
>>
>> Very interesting.  Can you work out where memcpy is being called on
>> the network data?  I wouldn't have expected that.
>>

Also see this recent thread David Jander on August 25th, "Efficient
memcpy()/memmove() for G2/G3 cores..."
on linuxppc-dev@ozlabs.org.

http://ozlabs.org/pipermail/linuxppc-dev/2008-September/062449.html

BTW, I am interested in this work as well, I'm currently working with
a 8313 and 8315 design, both are e300 cores.

My current goal is PCIe though, but I need fast GbE performance as well.

Also, did you test Freescale's 2.6.24.3 patches that tune the gianfar
interfaces for higher performance?

Regards,
-- 
Leon

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-09 11:06       ` Paul Mackerras
  2008-10-09 11:41         ` Dominik Bozek
@ 2008-10-09 15:37         ` Matt Sealey
  2008-10-11 22:30           ` Benjamin Herrenschmidt
  2008-10-10 17:17         ` Dominik Bozek
  2 siblings, 1 reply; 25+ messages in thread
From: Matt Sealey @ 2008-10-09 15:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Dominik Bozek, linuxppc-embedded

Paul Mackerras wrote:
> Dominik Bozek writes:
> 
>> Actually I made couple of other tests on that mpc8313. Most of them are
>> to ugly to publish them, but... My problem is that I have to boost the
>> gigabit interface on the mpc8313. I made simple substitution and
>> __copy_tofrom_user was used instead of memcpy. I know, it's wrong, but I
>> speedup that way the network interface for about 10%.
> 
> Very interesting.  Can you work out where memcpy is being called on
> the network data?  I wouldn't have expected that.

It probably is somewhere.. through some weird and wonderful code path that
needs some serious digging to find. At least in 2.4 memcpy was used and
optimizing it (see Freescale's libmotovec benchmarks) did produce a sizable
performance improvement. That, and offloading TCP checksumming to AltiVec
helped a lot.

No help at all on an 8313 but, relevant anyway.

Since then zero copy networking and other fancy things like the DMA
engine API (for intel ioat at least but also there is fsl dma support)
there's less to actually optimize now so you're less likely to see the
same benefits. All these got into mainline because it's essential to
have this kind of architecture to get reasonable speeds out of >gigabit
network links.

> There is actually no strong reason not to use __copy_tofrom_user as
> memcpy, in fact, as long as we are sure that source and destination
> are both cacheable.

I do think there is probably a good benefit in doing things like zeroing
pages in AltiVec and copying entire pages with AltiVec (for instance
when copy-on-write happens in an application) - NetBSD and QNX implement
at least this because it's faster than using the cache management and
works fine on uncacheable pages too (also since you're always aligned to
a page, zeroing 4kb aligned to a 4kb boundary - or whatever your page
size happens to be, the number of errors that can occur are absolutely
tiny and performance can go through the roof).

Ahem, but nobody here wants AltiVec in the kernel do they?

-- 
Matt Sealey <matt@genesi-usa.com>
Genesi, Manager, Developer Relations

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-09 11:06       ` Paul Mackerras
  2008-10-09 11:41         ` Dominik Bozek
  2008-10-09 15:37         ` Matt Sealey
@ 2008-10-10 17:17         ` Dominik Bozek
  2 siblings, 0 replies; 25+ messages in thread
From: Dominik Bozek @ 2008-10-10 17:17 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linuxppc-embedded

Paul Mackerras wrote:
> Very interesting.  Can you work out where memcpy is being called on
> the network data?  I wouldn't have expected that.

Ok. I've some results.
I done two test with different MTU. In both cases, about 0.5GB in total
has been transfered over network. Large blocks.
The test didn't trace a "shallow copy", where occasionally memcpy() is
also in use.


1) MTU=1500 (on both host and mpc8313)
* achieved throughput: 22MB/s (from mpc), 16MB/s (to mpc)
* total size of copied data by memcpy() was 37.6MB
* 96% of that has been copied by skb_clone(): 787758 times in blocks of 48B.
* about 3% of that has been copied by skb_copy_bits(): 1013 times, the
block size vary but rather bigger like 1300B.
* about 1% of that has been copied by eth_header(): 80248 times in
blocks of 6B (!!!!).

2) MTU=9000 (on both host and mpc8313)
* achieved throughput: 50MB/s (from mpc), 44MB/s (to mpc)
* total size of copied data by memcpy() was 6.4MB
* 97% of that has been copied by skb_clone(): 134260 times in blocks of 48B.
* 3% (whole rest) has been copied by eth_header(): 32912 times in blocks
of 6B.

Conclusion. Need for optimized memcpy() for blocks 48B and 6B :). Joke.

I said earlier, that I got about 10% boost when I replaced memcpy() by
__copy_tofrom_user(). It was the case with MTU 9000 because I work with
that setting in my environment.

I don't know when __copy_tofrom_user get faster than memcpy on other
cpus than mpc8313, but on mpc8313 48B blocks are more suitable for
__copy_tofrom_user.

Dominik

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-09 15:37         ` Matt Sealey
@ 2008-10-11 22:30           ` Benjamin Herrenschmidt
  2008-10-12  2:05             ` Matt Sealey
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-11 22:30 UTC (permalink / raw)
  To: Matt Sealey
  Cc: linuxppc-dev, Dominik Bozek, Paul Mackerras, linuxppc-embedded

On Thu, 2008-10-09 at 10:37 -0500, Matt Sealey wrote:
> 
> Ahem, but nobody here wants AltiVec in the kernel do they?

It depends. We do use altivec in the kernel for example for
RAID accelerations.

The reason where we require a -real-good- reason to do it is
simply because of the drawbacks. The cost of enabling altivec
in the kernel can be high (especially if the user is using it)
and it's not context switched for kernel code (just like the
FPU) for obvious performance reasons. Thus any use of altivec in the
kernel must be done within non-preemptible sections, which can
cause higher latencies in preemptible kernels.

Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-08 17:40 ` Scott Wood
  2008-10-09  2:36   ` Paul Mackerras
@ 2008-10-11 22:32   ` Benjamin Herrenschmidt
  2008-10-13 15:06     ` Scott Wood
  1 sibling, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-11 22:32 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, Dominik Bozek, linuxppc-embedded

On Wed, 2008-10-08 at 12:40 -0500, Scott Wood wrote:
> 
> The performance difference most likely comes from the fact that copy 
> to/from user can assume that the memory is cacheable, while memcpy is 
> occasionally used on cache-inhibited memory -- so dcbz isn't used.  We 
> may be better off handling the alignment fault on those occasions, and 
> we should use dcba on chips that support it.

Note that the kernel memcpy isn't supposed to be used for non-cacheable
memory. That's what memcpy_to/fromio are for.

But Paul has a point that for small copies especially, the cost of
the cache instructions outweigh their benefit.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-11 22:30           ` Benjamin Herrenschmidt
@ 2008-10-12  2:05             ` Matt Sealey
  2008-10-12  4:05               ` Benjamin Herrenschmidt
  2008-10-13 15:20               ` Scott Wood
  0 siblings, 2 replies; 25+ messages in thread
From: Matt Sealey @ 2008-10-12  2:05 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Dominik Bozek, Paul Mackerras, linuxppc-embedded

Benjamin Herrenschmidt wrote:
> On Thu, 2008-10-09 at 10:37 -0500, Matt Sealey wrote:
>> Ahem, but nobody here wants AltiVec in the kernel do they?
> 
> It depends. We do use altivec in the kernel for example for
> RAID accelerations.
> 
> The reason where we require a -real-good- reason to do it is
> simply because of the drawbacks. The cost of enabling altivec
> in the kernel can be high (especially if the user is using it)
> and it's not context switched for kernel code (just like the
> FPU) for obvious performance reasons. Thus any use of altivec in the
> kernel must be done within non-preemptible sections, which can
> cause higher latencies in preemptible kernels.

Would the examples (page copy, page clear) be an okay place to do it?
These sections can't be preempted anyway (right?), and it's noted that
doing it with AltiVec is a tad faster than using MMU tricks or standard
copies?

In Scott's case, while "optimizing memcpy for 48byte blocks" was a joke,
this is 3 load/stores in AltiVec, which as long as every SKB is 16
byte aligned (is there any reason why it would not be? :)

skb_clone might not be something you want to dump AltiVec into and would
make a mess if an skb got extended somehow, but the principle is outlined
in a very good document from a very long time ago;

http://www.motorola.com.cn/semiconductors/sndf/conference/PDF/AH1109.pdf

I think a lot of it still holds true as long as you really don't care
about preemption under these circumstances (where network throughput
is more important, and where AltiVec actually *reduces* CPU time, the
overhead of disabling preemption is lower anyway). You could say the
same about the RAID functions - I bet LatencyTOP has a field day when
you're using RAID5 AltiVec. But if you're more concerned about fast disk
access, would you really care (especially since the algorithm is
automatically selected on boot, you've not much chance of having any
choice in the matter anyway)?

Granted it also doesn't help Scott one bit. Sorry :D

-- 
Matt Sealey <matt@genesi-usa.com>
Genesi, Manager, Developer Relations

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-12  2:05             ` Matt Sealey
@ 2008-10-12  4:05               ` Benjamin Herrenschmidt
  2008-10-13 15:20               ` Scott Wood
  1 sibling, 0 replies; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-12  4:05 UTC (permalink / raw)
  To: Matt Sealey
  Cc: linuxppc-dev, Dominik Bozek, Paul Mackerras, linuxppc-embedded


> Would the examples (page copy, page clear) be an okay place to do it?
> These sections can't be preempted anyway (right?), and it's noted that
> doing it with AltiVec is a tad faster than using MMU tricks or standard
> copies?

I think typically page copying and clearing -are- preemptible. I'm not
sure what you mean by MMU tricks, but it's not clear whether using
altivec will result in any significant performance gain here,
considering the cost of enabling/disabling altivec (added to handling
the preemption issue).

However, nothing prevents you from trying to do it and we'll see what
the results are with hard numbers.

> In Scott's case, while "optimizing memcpy for 48byte blocks" was a joke,
> this is 3 load/stores in AltiVec, which as long as every SKB is 16
> byte aligned (is there any reason why it would not be? :)

In this case, the cost of enabling/saving/restoring altivec will far
outweight any benefit. In addition, skb's are often not well aligned due
to the alignment tricks done with packet headers.

> skb_clone might not be something you want to dump AltiVec into and would
> make a mess if an skb got extended somehow, but the principle is outlined
> in a very good document from a very long time ago;
> 
> http://www.motorola.com.cn/semiconductors/sndf/conference/PDF/AH1109.pdf
> 
> I think a lot of it still holds true as long as you really don't care
> about preemption under these circumstances (where network throughput
> is more important, and where AltiVec actually *reduces* CPU time, the
> overhead of disabling preemption is lower anyway). You could say the
> same about the RAID functions - I bet LatencyTOP has a field day when
> you're using RAID5 AltiVec.

RAID6 actually :-)

In any case, as I said, people are welcome to implement something that
can be put to the test and measured. If it proves beneficial enough, then
I see no reason not to merge it. Basically, enough talks, just do something
and we'll see whether it proves useful or not.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-11 22:32   ` Benjamin Herrenschmidt
@ 2008-10-13 15:06     ` Scott Wood
  0 siblings, 0 replies; 25+ messages in thread
From: Scott Wood @ 2008-10-13 15:06 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Dominik Bozek, linuxppc-embedded

On Sun, Oct 12, 2008 at 09:32:07AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2008-10-08 at 12:40 -0500, Scott Wood wrote:
> > 
> > The performance difference most likely comes from the fact that copy 
> > to/from user can assume that the memory is cacheable, while memcpy is 
> > occasionally used on cache-inhibited memory -- so dcbz isn't used.  We 
> > may be better off handling the alignment fault on those occasions, and 
> > we should use dcba on chips that support it.
> 
> Note that the kernel memcpy isn't supposed to be used for non-cacheable
> memory. That's what memcpy_to/fromio are for.

I agree that it *shouldn't*, but the presence of cacheble_memcpy (used
only by the EMAC driver, AFAICT) suggests that it was a concern.

> But Paul has a point that for small copies especially, the cost of
> the cache instructions outweigh their benefit.

Possibly, but what is the overall effect on the system of using them,
even if it hurts small copies slightly?  How many small copies are of
constant size, which could be diverted to another implementation at
compile-time?  Even run-time diversion may help, as the cost of a small
memcpy is only important if you do it many times, in which case the
branch will probably be correctly predicted.

Given the networking results Dominik posted, I think it's worth a look.

-Scott

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-12  2:05             ` Matt Sealey
  2008-10-12  4:05               ` Benjamin Herrenschmidt
@ 2008-10-13 15:20               ` Scott Wood
  2008-10-13 20:50                 ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 25+ messages in thread
From: Scott Wood @ 2008-10-13 15:20 UTC (permalink / raw)
  To: Matt Sealey
  Cc: Dominik Bozek, linuxppc-embedded, Paul Mackerras, linuxppc-dev

On Sat, Oct 11, 2008 at 09:05:49PM -0500, Matt Sealey wrote:
> Benjamin Herrenschmidt wrote:
> >The reason where we require a -real-good- reason to do it is
> >simply because of the drawbacks. The cost of enabling altivec
> >in the kernel can be high (especially if the user is using it)
> >and it's not context switched for kernel code (just like the
> >FPU) for obvious performance reasons. Thus any use of altivec in the
> >kernel must be done within non-preemptible sections, which can
> >cause higher latencies in preemptible kernels.

It doesn't need to be done in non-preemptible sections, if you have a
separate per-thread save area for kernel fp/altivec use (and appropriate
flags so an FP unavailable handler knows which regs to restore), and you
can avoid using it in a preempting context.

In a realtime-configured kernel, preempting contexts should be fairly
minimal, so the loss of altivec use is not of critical performance impact
(other than one branch to determine if it can be used).

In a throughput-configured kernel, do it as you described (disable
preemption), and be able to use altivec memcpy in interrupt handlers, and
reduce the thread size.

> Would the examples (page copy, page clear) be an okay place to do it?
> These sections can't be preempted anyway (right?),

Why can't they be preempted?

-Scott

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-13 15:20               ` Scott Wood
@ 2008-10-13 20:50                 ` Benjamin Herrenschmidt
  2008-10-13 21:03                   ` Scott Wood
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-13 20:50 UTC (permalink / raw)
  To: Scott Wood; +Cc: Dominik Bozek, linuxppc-embedded, Paul Mackerras, linuxppc-dev


> It doesn't need to be done in non-preemptible sections, if you have a
> separate per-thread save area for kernel fp/altivec use (and appropriate
> flags so an FP unavailable handler knows which regs to restore), and you
> can avoid using it in a preempting context.

Yuck.

Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-13 20:50                 ` Benjamin Herrenschmidt
@ 2008-10-13 21:03                   ` Scott Wood
  2008-10-14  2:14                     ` Matt Sealey
  0 siblings, 1 reply; 25+ messages in thread
From: Scott Wood @ 2008-10-13 21:03 UTC (permalink / raw)
  To: benh; +Cc: Dominik Bozek, linuxppc-embedded, Paul Mackerras, linuxppc-dev

Benjamin Herrenschmidt wrote:
>> It doesn't need to be done in non-preemptible sections, if you have a
>> separate per-thread save area for kernel fp/altivec use (and appropriate
>> flags so an FP unavailable handler knows which regs to restore), and you
>> can avoid using it in a preempting context.
> 
> Yuck.

Hmm?  It's simple and achieves the desired result (avoiding 
non-preemptible regions without unduly restricting the ability to 
extract performance from the hardware).

Would it be nicer to avoid FP/Altivec in the kernel altogether?  Sure. 
If the benchmarking says that we're better off with it, though, then so 
be it.

-Scott

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-13 21:03                   ` Scott Wood
@ 2008-10-14  2:14                     ` Matt Sealey
  2008-10-14  2:39                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 25+ messages in thread
From: Matt Sealey @ 2008-10-14  2:14 UTC (permalink / raw)
  To: Scott Wood; +Cc: Dominik Bozek, linuxppc-embedded, Paul Mackerras, linuxppc-dev

Scott Wood wrote:
> Benjamin Herrenschmidt wrote:
>>
>> Yuck.
> 
> Hmm?  It's simple and achieves the desired result (avoiding 
> non-preemptible regions without unduly restricting the ability to 
> extract performance from the hardware).
> 
> Would it be nicer to avoid FP/Altivec in the kernel altogether?  Sure. 
> If the benchmarking says that we're better off with it, though, then so 
> be it.

There should definitely be a nice API for an in-kernel AltiVec context
save/restore. When preemption happens doesn't it do some equivalent of
the userspace context switch? Why can't the preemption system take care
of it?

At worst case you make the worst case latency bigger, but at best case
you gain performance across the board.

One thing which is worrying me is that now that Ben has thrown down the
gauntlet (note, I'm not going to be coding a line, but I know a man who
can :) how on earth do we benchmark the differences here?

-- 
Matt Sealey <matt@genesi-usa.com>
Genesi, Manager, Developer Relations

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-14  2:14                     ` Matt Sealey
@ 2008-10-14  2:39                       ` Benjamin Herrenschmidt
  2008-10-14 15:10                         ` Scott Wood
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-14  2:39 UTC (permalink / raw)
  To: Matt Sealey
  Cc: Scott Wood, linuxppc-dev, Dominik Bozek, Paul Mackerras,
	linuxppc-embedded


> There should definitely be a nice API for an in-kernel AltiVec context
> save/restore. When preemption happens doesn't it do some equivalent of
> the userspace context switch? Why can't the preemption system take care
> of it?
> 
> At worst case you make the worst case latency bigger, but at best case
> you gain performance across the board.

Do you ? Can you prove this assertion with numbers ?

> One thing which is worrying me is that now that Ben has thrown down the
> gauntlet (note, I'm not going to be coding a line, but I know a man who
> can :) how on earth do we benchmark the differences here?

Precisely :-)

So again, let's start by having somebody pick up something that you
believe is worth altivec-ifying, eat the preempt_disable/enable for now,
and if we see that indeed, it's worth the pain, then we can look into
adding a way to context switch altivec in a kernel thread upon explicit
request or something like that.

As to how to benchmark the difference ? Well, I would suggest first a
couple of very simple things that give a good indication, and from
there, if it looks promising, we can torture more and see whether we can
find regressions etc..

For example, I personally use kernel compile times (with make -jN on
SMP), I find it a good overall exercise, but if you feel like a network
benchmark might be better at advertising your improvements, then go for
that too, though expect us to also do some other tests to verify they
didn't regress.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-14  2:39                       ` Benjamin Herrenschmidt
@ 2008-10-14 15:10                         ` Scott Wood
  2008-10-15  1:37                           ` Matt Sealey
  0 siblings, 1 reply; 25+ messages in thread
From: Scott Wood @ 2008-10-14 15:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Dominik Bozek, linuxppc-embedded, Paul Mackerras, linuxppc-dev

On Tue, Oct 14, 2008 at 01:39:19PM +1100, Benjamin Herrenschmidt wrote:
> So again, let's start by having somebody pick up something that you
> believe is worth altivec-ifying, eat the preempt_disable/enable for now,
> and if we see that indeed, it's worth the pain, then we can look into
> adding a way to context switch altivec in a kernel thread upon explicit
> request or something like that.

Of course -- my suggestion was predicated on the outcome that the
benchmarks do justify it, and was just pointing out that there's no real
need to disable preemption.

BTW, it's actually simpler than I originally described (I had implemented
this years ago in the TimeSys kernel for x86 and some other arches that
already use FP or similar resources for memcpy, but the memory was a
little fuzzy); the FP restore code doesn't need to test anything, it
always restores from the regular spot.  The kernel code wishing to use FP
saves the user context in an alternate save area (it could even be on the
stack, allowing atomic context to use it as well, if it's not too large),
and restores it when it's done.

-Scott

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: performance: memcpy vs. __copy_tofrom_user
  2008-10-14 15:10                         ` Scott Wood
@ 2008-10-15  1:37                           ` Matt Sealey
  0 siblings, 0 replies; 25+ messages in thread
From: Matt Sealey @ 2008-10-15  1:37 UTC (permalink / raw)
  To: Scott Wood; +Cc: Dominik Bozek, linuxppc-embedded, Paul Mackerras, linuxppc-dev



Scott Wood wrote:
> BTW, it's actually simpler than I originally described (I had implemented
> this years ago in the TimeSys kernel for x86 and some other arches that
> already use FP or similar resources for memcpy, but the memory was a
> little fuzzy); the FP restore code doesn't need to test anything, it
> always restores from the regular spot.  The kernel code wishing to use FP
> saves the user context in an alternate save area (it could even be on the
> stack, allowing atomic context to use it as well, if it's not too large),
> and restores it when it's done.

Sure, it's simple, the problem is that VRSAVE isn't maintained in the 
kernel, which means for AltiVec context switches you need to save and 
restore 32 128-bit registers every time. And that takes a LONG time..

Just imagine if you did a ~512 byte memcpy, you could guarantee that it 
would take twice as long as it should!

There are ways around it, like assembly and fixed registers, and saving 
the ones you use (this is the sort of thing gcc does for you usually, 
but you can do it by hand just as well) and restoring the ones you 
trashed afterwards, but that makes code messier and less readable.

Not insurmountable problems, but it makes using AltiVec harder. You
would have to really justify a speed increase. I think you could get
that on cryptography functions easily. For page zero/copying, I think 
you would also get the increase to outweigh the prologue/epilogue 
required and also the loss of preemption.

TCP/IP copy with checksum? Probably absolutely definitely..

Straight memcpy? I am not so sure.

Like I said I am far more worried about how you'd get a reasonable
benchmark out of it. Profiling kernels is a messy business..

-- 
Matt Sealey <matt@genesi-usa.com>
Genesi, Manager, Developer Relations

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-10-15  1:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-08 14:39 performance: memcpy vs. __copy_tofrom_user Dominik Bozek
2008-10-08 15:31 ` Minh Tuan Duong
2008-10-08 15:39 ` Bill Gatliff
2008-10-08 15:42 ` Grant Likely
2008-10-09  2:34   ` Paul Mackerras
2008-10-09 10:12     ` Dominik Bozek
2008-10-09 11:06       ` Paul Mackerras
2008-10-09 11:41         ` Dominik Bozek
2008-10-09 12:04           ` Leon Woestenberg
2008-10-09 15:37         ` Matt Sealey
2008-10-11 22:30           ` Benjamin Herrenschmidt
2008-10-12  2:05             ` Matt Sealey
2008-10-12  4:05               ` Benjamin Herrenschmidt
2008-10-13 15:20               ` Scott Wood
2008-10-13 20:50                 ` Benjamin Herrenschmidt
2008-10-13 21:03                   ` Scott Wood
2008-10-14  2:14                     ` Matt Sealey
2008-10-14  2:39                       ` Benjamin Herrenschmidt
2008-10-14 15:10                         ` Scott Wood
2008-10-15  1:37                           ` Matt Sealey
2008-10-10 17:17         ` Dominik Bozek
2008-10-08 17:40 ` Scott Wood
2008-10-09  2:36   ` Paul Mackerras
2008-10-11 22:32   ` Benjamin Herrenschmidt
2008-10-13 15:06     ` Scott Wood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).