Hi All Today we run our benchmark on Core2 and Sandy Bridge: 1. Retrieve result on Core2 Speedup on Core2 Len Alignement Speedup 1024, 0/ 0: 0.95x 2048, 0/ 0: 1.03x 3072, 0/ 0: 1.02x 4096, 0/ 0: 1.09x 5120, 0/ 0: 1.13x 6144, 0/ 0: 1.13x 7168, 0/ 0: 1.14x 8192, 0/ 0: 1.13x 9216, 0/ 0: 1.14x 10240, 0/ 0: 0.99x 11264, 0/ 0: 1.14x 12288, 0/ 0: 1.14x 13312, 0/ 0: 1.10x 14336, 0/ 0: 1.10x 15360, 0/ 0: 1.13x Application run through perf For (i= 1024; i < 1024 * 16; i = i + 64) do_memcpy(0, 0, i); Run application by 'perf stat --repeat 10 ./static_orig/new' Before the patch: Performance counter stats for './static_orig' (10 runs): 3323.041832 task-clock-msecs # 0.998 CPUs ( +- 0.016% ) 22 context-switches # 0.000 M/sec ( +- 31.913% ) 0 CPU-migrations # 0.000 M/sec ( +- nan% ) 4428 page-faults # 0.001 M/sec ( +- 0.003% ) 9921549804 cycles # 2985.683 M/sec ( +- 0.016% ) 10863809359 instructions # 1.095 IPC ( +- 0.000% ) 972283451 cache-references # 292.588 M/sec ( +- 0.018% ) 17703 cache-misses # 0.005 M/sec ( +- 4.304% ) 3.330714469 seconds time elapsed ( +- 0.021% ) After the patch: Performance counter stats for './static_new' (10 runs): 3392.902871 task-clock-msecs # 0.998 CPUs ( +- 0.226% ) 21 context-switches # 0.000 M/sec ( +- 30.982% ) 0 CPU-migrations # 0.000 M/sec ( +- nan% ) 4428 page-faults # 0.001 M/sec ( +- 0.003% ) 10130188030 cycles # 2985.699 M/sec ( +- 0.227% ) 391981414 instructions # 0.039 IPC ( +- 0.013% ) 874161826 cache-references # 257.644 M/sec ( +- 3.034% ) 17628 cache-misses # 0.005 M/sec ( +- 4.577% ) 3.400681174 seconds time elapsed ( +- 0.219% ) 2. Retrieve result on Sandy Bridge Speedup on Sandy Bridge Len Alignement Speedup 1024, 0/ 0: 1.08x 2048, 0/ 0: 1.42x 3072, 0/ 0: 1.51x 4096, 0/ 0: 1.63x 5120, 0/ 0: 1.67x 6144, 0/ 0: 1.72x 7168, 0/ 0: 1.75x 8192, 0/ 0: 1.77x 9216, 0/ 0: 1.80x 10240, 0/ 0: 1.80x 11264, 0/ 0: 1.82x 12288, 0/ 0: 1.85x 13312, 0/ 0: 1.85x 14336, 0/ 0: 1.88x 15360, 0/ 0: 1.88x Application run through perf For (i= 1024; i < 1024 * 16; i = i + 64) do_memcpy(0, 0, i); Run application by 'perf stat --repeat 10 ./static_orig/new' Before the patch: Performance counter stats for './static_orig' (10 runs): 3787.441240 task-clock-msecs # 0.995 CPUs ( +- 0.140% ) 8 context-switches # 0.000 M/sec ( +- 22.602% ) 0 CPU-migrations # 0.000 M/sec ( +- nan% ) 4428 page-faults # 0.001 M/sec ( +- 0.003% ) 6053487926 cycles # 1598.305 M/sec ( +- 0.140% ) 10861025194 instructions # 1.794 IPC ( +- 0.001% ) 2823963 cache-references # 0.746 M/sec ( +- 69.345% ) 266000 cache-misses # 0.070 M/sec ( +- 0.980% ) 3.805400837 seconds time elapsed ( +- 0.139% ) After the patch: Performance counter stats for './static_new' (10 runs): 2879.424879 task-clock-msecs # 0.995 CPUs ( +- 0.076% ) 10 context-switches # 0.000 M/sec ( +- 24.761% ) 0 CPU-migrations # 0.000 M/sec ( +- nan% ) 4428 page-faults # 0.002 M/sec ( +- 0.003% ) 4602155158 cycles # 1598.290 M/sec ( +- 0.076% ) 386146993 instructions # 0.084 IPC ( +- 0.005% ) 520008 cache-references # 0.181 M/sec ( +- 8.077% ) 267345 cache-misses # 0.093 M/sec ( +- 0.792% ) 2.893813235 seconds time elapsed ( +- 0.085% ) Thanks Ling >-----Original Message----- >From: H. Peter Anvin [mailto:hpa@zytor.com] >Sent: 2009117 3:26 >To: Ma, Ling >Cc: mingo@elte.hu; tglx@linutronix.de; linux-kernel@vger.kernel.org >Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast >string. > >On 11/06/2009 09:07 AM, H. Peter Anvin wrote: >> >> Where did the 1024 byte threshold come from? It seems a bit high to me, >> and is at the very best a CPU-specific tuning factor. >> >> Andi is of course correct that older CPUs might suffer (sadly enough), >> which is why we'd at the very least need some idea of what the >> performance impact on those older CPUs would look like -- at that point >> we can make a decision to just unconditionally do the rep movs or >> consider some system where we point at different implementations for >> different processors -- memcpy is probably one of the very few >> operations for which something like that would make sense. >> > >To be expicit: Ling, would you be willing to run some benchmarks across >processors to see how this performs on non-Nehalem CPUs? > > -hpa {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I