RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

From: "Ma, Ling" <ling.ma@intel.com>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: "mingo@elte.hu" <mingo@elte.hu>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
Date: Mon, 9 Nov 2009 15:24:03 +0800	[thread overview]
Message-ID: <8FED46E8A9CA574792FC7AACAC38FE7714FCF772C9@PDSMSX501.ccr.corp.intel.com> (raw)
In-Reply-To: <4AF4784C.5090800@zytor.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 6123 bytes --]

Hi All

Today we run our benchmark on Core2 and Sandy Bridge:

1. Retrieve result on Core2
Speedup on Core2
   Len        Alignement             Speedup
  1024,       0/ 0:                 0.95x 
  2048,       0/ 0:                 1.03x 
  3072,       0/ 0:                 1.02x 
  4096,       0/ 0:                 1.09x 
  5120,       0/ 0:                 1.13x 
  6144,       0/ 0:                 1.13x 
  7168,       0/ 0:                 1.14x 
  8192,       0/ 0:                 1.13x 
  9216,       0/ 0:                 1.14x 
  10240,      0/ 0:                 0.99x 
  11264,      0/ 0:                 1.14x 
  12288,      0/ 0:                 1.14x 
  13312,      0/ 0:                 1.10x 
  14336,      0/ 0:                 1.10x 
  15360,      0/ 0:                 1.13x
Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
	do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

    3323.041832  task-clock-msecs         #      0.998 CPUs  ( +-   0.016% )
             22  context-switches         #      0.000 M/sec ( +-  31.913% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
     9921549804  cycles                   #   2985.683 M/sec ( +-   0.016% )
    10863809359  instructions             #      1.095 IPC   ( +-   0.000% )
      972283451  cache-references         #    292.588 M/sec ( +-   0.018% )
          17703  cache-misses             #      0.005 M/sec ( +-   4.304% )

    3.330714469  seconds time elapsed   ( +-   0.021% )
After the patch:
Performance counter stats for './static_new' (10 runs):
    3392.902871  task-clock-msecs         #      0.998 CPUs ( +-   0.226% )
             21  context-switches         #      0.000 M/sec ( +-  30.982% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
    10130188030  cycles                   #   2985.699 M/sec ( +-   0.227% )
      391981414  instructions             #      0.039 IPC   ( +-   0.013% )
      874161826  cache-references         #    257.644 M/sec ( +-   3.034% )
          17628  cache-misses             #      0.005 M/sec ( +-   4.577% )

    3.400681174  seconds time elapsed   ( +-   0.219% )

2. Retrieve result on Sandy Bridge
  Speedup on Sandy Bridge
  Len        Alignement            Speedup
  1024,       0/ 0:                1.08x 
  2048,       0/ 0:                1.42x 
  3072,       0/ 0:                1.51x 
  4096,       0/ 0:                1.63x 
  5120,       0/ 0:                1.67x 
  6144,       0/ 0:                1.72x 
  7168,       0/ 0:                1.75x 
  8192,       0/ 0:                1.77x 
  9216,       0/ 0:                1.80x 
  10240,      0/ 0:                1.80x 
  11264,      0/ 0:                1.82x 
  12288,      0/ 0:                1.85x 
  13312,      0/ 0:                1.85x 
  14336,      0/ 0:                1.88x 
  15360,      0/ 0:                1.88x 

Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
	do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

    3787.441240  task-clock-msecs         #      0.995 CPUs  ( +-   0.140% )
              8  context-switches         #      0.000 M/sec ( +-  22.602% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
     6053487926  cycles                   #   1598.305 M/sec ( +-   0.140% )
    10861025194  instructions             #      1.794 IPC   ( +-   0.001% )
        2823963  cache-references         #      0.746 M/sec ( +-  69.345% )
         266000  cache-misses             #      0.070 M/sec ( +-   0.980% )

    3.805400837  seconds time elapsed   ( +-   0.139% )
After the patch:
Performance counter stats for './static_new' (10 runs):

    2879.424879  task-clock-msecs         #      0.995 CPUs  ( +-   0.076% )
             10  context-switches         #      0.000 M/sec ( +-  24.761% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.002 M/sec ( +-   0.003% )
     4602155158  cycles                   #   1598.290 M/sec ( +-   0.076% )
      386146993  instructions             #      0.084 IPC   ( +-   0.005% )
         520008  cache-references         #      0.181 M/sec ( +-   8.077% )
         267345  cache-misses             #      0.093 M/sec ( +-   0.792% )

    2.893813235  seconds time elapsed   ( +-   0.085% )

Thanks
Ling

>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@zytor.com]
>Sent: 2009Äê11ÔÂ7ÈÕ 3:26
>To: Ma, Ling
>Cc: mingo@elte.hu; tglx@linutronix.de; linux-kernel@vger.kernel.org
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/06/2009 09:07 AM, H. Peter Anvin wrote:
>>
>> Where did the 1024 byte threshold come from?  It seems a bit high to me,
>> and is at the very best a CPU-specific tuning factor.
>>
>> Andi is of course correct that older CPUs might suffer (sadly enough),
>> which is why we'd at the very least need some idea of what the
>> performance impact on those older CPUs would look like -- at that point
>> we can make a decision to just unconditionally do the rep movs or
>> consider some system where we point at different implementations for
>> different processors -- memcpy is probably one of the very few
>> operations for which something like that would make sense.
>>
>
>To be expicit: Ling, would you be willing to run some benchmarks across
>processors to see how this performs on non-Nehalem CPUs?
>
>	-hpa
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥