All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
@ 2009-11-06  9:41 ling.ma
  2009-11-06 16:51 ` Andi Kleen
  2009-11-06 17:07 ` H. Peter Anvin
  0 siblings, 2 replies; 33+ messages in thread
From: ling.ma @ 2009-11-06  9:41 UTC (permalink / raw)
  To: mingo; +Cc: hpa, tglx, linux-kernel, Ma Ling

From: Ma Ling <ling.ma@intel.com>

Hi All

Intel Nehalem improves the performance of REP strings significantly
over previous microarchitectures in several ways:

1. Startup overhead have been reduced in most cases.
2. Data transfer throughput are improved.
3. REP string can operate in "fast string" even if address is not
   aligned to 16bytes.

According to the experiment when copy size is big enough
movsq almost can get 16bytes throughput per cycle, which
approximate SSE instruction set. The patch intends to utilize 
the optimization when copy size is over 1024.

Experiment data speedup under Nehalem platform:
  Len    alignment   Speedup
 1024,    0/ 0:      1.04x
 2048,    0/ 0:      1.36x
 3072,    0/ 0:      1.51x
 4096,    0/ 0:      1.60x
 5120,    0/ 0:      1.70x
 6144,    0/ 0:      1.74x
 7168,    0/ 0:      1.77x
 8192,    0/ 0:      1.80x
 9216,    0/ 0:      1.82x
 10240,   0/ 0:      1.83x
 11264,   0/ 0:      1.85x
 12288,   0/ 0:      1.86x
 13312,   0/ 0:      1.92x
 14336,   0/ 0:      1.84x
 15360,   0/ 0:      1.74x

'perf stat --repeat 10 ./static_orig' command get data before patch:

 Performance counter stats for './static_orig' (10 runs):

    2835.650105  task-clock-msecs         #      0.999 CPUs    ( +-   0.051% )
              3  context-switches         #      0.000 M/sec   ( +-   6.503% )
              0  CPU-migrations           #      0.000 M/sec   ( +-     nan% )
           4429  page-faults              #      0.002 M/sec   ( +-   0.003% )
     7941098692  cycles                   #   2800.451 M/sec   ( +-   0.051% )
    10848100323  instructions             #      1.366 IPC     ( +-   0.000% )
         322808  cache-references         #      0.114 M/sec   ( +-   1.467% )
         280716  cache-misses             #      0.099 M/sec   ( +-   0.618% )

    2.838006377  seconds time elapsed   ( +-   0.051% )

'perf stat --repeat 10 ./static_new' command get data after patch:

 Performance counter stats for './static_new' (10 runs):

    7401.423466  task-clock-msecs         #      0.999 CPUs    ( +-   0.108% )
             10  context-switches         #      0.000 M/sec   ( +-   2.797% )
              0  CPU-migrations           #      0.000 M/sec   ( +-     nan% )
           4428  page-faults              #      0.001 M/sec   ( +-   0.003% )
    20727280183  cycles                   #   2800.445 M/sec   ( +-   0.107% )
     1472673654  instructions             #      0.071 IPC     ( +-   0.013% )
        1092221  cache-references         #      0.148 M/sec   ( +-  12.414% )
         290550  cache-misses             #      0.039 M/sec   ( +-   1.577% )

    7.407006046  seconds time elapsed   ( +-   0.108% )

Appreciate your comments.

Thanks
Ma Ling

---
 arch/x86/lib/memcpy_64.S |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index ad5441e..2ea3561 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -50,6 +50,12 @@ ENTRY(memcpy)
 	movl %edx, %ecx
 	shrl   $6, %ecx
 	jz .Lhandle_tail
+	/*
+	 * If length is more than 1024 we chose optimized MOVSQ,
+	 * which has more throughput.
+	 */
+	cmpl $0x400, %edx 
+	jae .Lmore_0x400
 
 	.p2align 4
 .Lloop_64:
@@ -119,6 +125,17 @@ ENTRY(memcpy)
 
 .Lend:
 	ret
+
+	.p2align 4
+.Lmore_0x400:
+	movq %rdi, %rax
+	movl %edx, %ecx
+	shrl $3, %ecx
+	andl $7, %edx
+	rep movsq
+	movl %edx, %ecx
+	rep movsb
+	ret
 	CFI_ENDPROC
 ENDPROC(memcpy)
 ENDPROC(__memcpy)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-06  9:41 [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string ling.ma
@ 2009-11-06 16:51 ` Andi Kleen
  2009-11-08 10:18   ` Ingo Molnar
  2009-11-06 17:07 ` H. Peter Anvin
  1 sibling, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2009-11-06 16:51 UTC (permalink / raw)
  To: ling.ma; +Cc: mingo, hpa, tglx, linux-kernel

ling.ma@intel.com writes:

> Intel Nehalem improves the performance of REP strings significantly
> over previous microarchitectures in several ways:

The problem is that it's not necessarily a win on older CPUs to
do it this way.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-06  9:41 [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string ling.ma
  2009-11-06 16:51 ` Andi Kleen
@ 2009-11-06 17:07 ` H. Peter Anvin
  2009-11-06 19:26   ` H. Peter Anvin
  1 sibling, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-06 17:07 UTC (permalink / raw)
  To: ling.ma; +Cc: mingo, tglx, linux-kernel

On 11/06/2009 01:41 AM, ling.ma@intel.com wrote:
> 
>  Performance counter stats for './static_orig' (10 runs):
> 
>     2835.650105  task-clock-msecs         #      0.999 CPUs    ( +-   0.051% )
>               3  context-switches         #      0.000 M/sec   ( +-   6.503% )
>               0  CPU-migrations           #      0.000 M/sec   ( +-     nan% )
>            4429  page-faults              #      0.002 M/sec   ( +-   0.003% )
>      7941098692  cycles                   #   2800.451 M/sec   ( +-   0.051% )
>     10848100323  instructions             #      1.366 IPC     ( +-   0.000% )
>          322808  cache-references         #      0.114 M/sec   ( +-   1.467% )
>          280716  cache-misses             #      0.099 M/sec   ( +-   0.618% )
> 
>     2.838006377  seconds time elapsed   ( +-   0.051% )
> 
> 'perf stat --repeat 10 ./static_new' command get data after patch:
> 
>  Performance counter stats for './static_new' (10 runs):
> 
>     7401.423466  task-clock-msecs         #      0.999 CPUs    ( +-   0.108% )
>              10  context-switches         #      0.000 M/sec   ( +-   2.797% )
>               0  CPU-migrations           #      0.000 M/sec   ( +-     nan% )
>            4428  page-faults              #      0.001 M/sec   ( +-   0.003% )
>     20727280183  cycles                   #   2800.445 M/sec   ( +-   0.107% )
>      1472673654  instructions             #      0.071 IPC     ( +-   0.013% )
>         1092221  cache-references         #      0.148 M/sec   ( +-  12.414% )
>          290550  cache-misses             #      0.039 M/sec   ( +-   1.577% )
> 
>     7.407006046  seconds time elapsed   ( +-   0.108% )
> 

I assume these are backwards?  If so, it's a dramatic performance
improvement.

Where did the 1024 byte threshold come from?  It seems a bit high to me,
and is at the very best a CPU-specific tuning factor.

Andi is of course correct that older CPUs might suffer (sadly enough),
which is why we'd at the very least need some idea of what the
performance impact on those older CPUs would look like -- at that point
we can make a decision to just unconditionally do the rep movs or
consider some system where we point at different implementations for
different processors -- memcpy is probably one of the very few
operations for which something like that would make sense.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-06 17:07 ` H. Peter Anvin
@ 2009-11-06 19:26   ` H. Peter Anvin
  2009-11-09  7:24     ` Ma, Ling
  0 siblings, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-06 19:26 UTC (permalink / raw)
  To: ling.ma; +Cc: mingo, tglx, linux-kernel

On 11/06/2009 09:07 AM, H. Peter Anvin wrote:
> 
> Where did the 1024 byte threshold come from?  It seems a bit high to me,
> and is at the very best a CPU-specific tuning factor.
> 
> Andi is of course correct that older CPUs might suffer (sadly enough),
> which is why we'd at the very least need some idea of what the
> performance impact on those older CPUs would look like -- at that point
> we can make a decision to just unconditionally do the rep movs or
> consider some system where we point at different implementations for
> different processors -- memcpy is probably one of the very few
> operations for which something like that would make sense.
> 

To be expicit: Ling, would you be willing to run some benchmarks across
processors to see how this performs on non-Nehalem CPUs?

	-hpa

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-06 16:51 ` Andi Kleen
@ 2009-11-08 10:18   ` Ingo Molnar
  0 siblings, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2009-11-08 10:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ling.ma, hpa, tglx, linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> ling.ma@intel.com writes:
> 
> > Intel Nehalem improves the performance of REP strings significantly
> > over previous microarchitectures in several ways:
> 
> The problem is that it's not necessarily a win on older CPUs to do it 
> this way.

I'm wondering, why are you writing such obtruse comments to Intel 
submitted patches? I know it and you know it too which older CPUs have a 
slow string implementation, and you know the rough order of magnitude 
and significance as well and you have ideas how to solve it all.

Instead you injected just the minimal amount of information into this 
thread to derail this patch you can see a problem with, but you didnt at 
all explain your full opinion openly and honestly and you certainly 
didnt give enough information to allow Ling Ma to act upon your opinion 
with maximum efficiency.

I.e. you are not being helpful at all here and you are obstructing Intel 
folks actively, making their workflow and progress as inefficient as you 
possibly can. Why are you doing that?

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-06 19:26   ` H. Peter Anvin
@ 2009-11-09  7:24     ` Ma, Ling
  2009-11-09  7:36       ` H. Peter Anvin
  2009-11-12 12:16       ` Pavel Machek
  0 siblings, 2 replies; 33+ messages in thread
From: Ma, Ling @ 2009-11-09  7:24 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: mingo, tglx, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 6123 bytes --]

Hi All

Today we run our benchmark on Core2 and Sandy Bridge:

1. Retrieve result on Core2
Speedup on Core2
   Len        Alignement             Speedup
  1024,       0/ 0:                 0.95x 
  2048,       0/ 0:                 1.03x 
  3072,       0/ 0:                 1.02x 
  4096,       0/ 0:                 1.09x 
  5120,       0/ 0:                 1.13x 
  6144,       0/ 0:                 1.13x 
  7168,       0/ 0:                 1.14x 
  8192,       0/ 0:                 1.13x 
  9216,       0/ 0:                 1.14x 
  10240,      0/ 0:                 0.99x 
  11264,      0/ 0:                 1.14x 
  12288,      0/ 0:                 1.14x 
  13312,      0/ 0:                 1.10x 
  14336,      0/ 0:                 1.10x 
  15360,      0/ 0:                 1.13x
Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
	do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

    3323.041832  task-clock-msecs         #      0.998 CPUs  ( +-   0.016% )
             22  context-switches         #      0.000 M/sec ( +-  31.913% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
     9921549804  cycles                   #   2985.683 M/sec ( +-   0.016% )
    10863809359  instructions             #      1.095 IPC   ( +-   0.000% )
      972283451  cache-references         #    292.588 M/sec ( +-   0.018% )
          17703  cache-misses             #      0.005 M/sec ( +-   4.304% )

    3.330714469  seconds time elapsed   ( +-   0.021% )
After the patch:
Performance counter stats for './static_new' (10 runs):
    3392.902871  task-clock-msecs         #      0.998 CPUs ( +-   0.226% )
             21  context-switches         #      0.000 M/sec ( +-  30.982% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
    10130188030  cycles                   #   2985.699 M/sec ( +-   0.227% )
      391981414  instructions             #      0.039 IPC   ( +-   0.013% )
      874161826  cache-references         #    257.644 M/sec ( +-   3.034% )
          17628  cache-misses             #      0.005 M/sec ( +-   4.577% )

    3.400681174  seconds time elapsed   ( +-   0.219% )

2. Retrieve result on Sandy Bridge
  Speedup on Sandy Bridge
  Len        Alignement            Speedup
  1024,       0/ 0:                1.08x 
  2048,       0/ 0:                1.42x 
  3072,       0/ 0:                1.51x 
  4096,       0/ 0:                1.63x 
  5120,       0/ 0:                1.67x 
  6144,       0/ 0:                1.72x 
  7168,       0/ 0:                1.75x 
  8192,       0/ 0:                1.77x 
  9216,       0/ 0:                1.80x 
  10240,      0/ 0:                1.80x 
  11264,      0/ 0:                1.82x 
  12288,      0/ 0:                1.85x 
  13312,      0/ 0:                1.85x 
  14336,      0/ 0:                1.88x 
  15360,      0/ 0:                1.88x 
                                  
Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
	do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

    3787.441240  task-clock-msecs         #      0.995 CPUs  ( +-   0.140% )
              8  context-switches         #      0.000 M/sec ( +-  22.602% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
     6053487926  cycles                   #   1598.305 M/sec ( +-   0.140% )
    10861025194  instructions             #      1.794 IPC   ( +-   0.001% )
        2823963  cache-references         #      0.746 M/sec ( +-  69.345% )
         266000  cache-misses             #      0.070 M/sec ( +-   0.980% )

    3.805400837  seconds time elapsed   ( +-   0.139% )
After the patch:
Performance counter stats for './static_new' (10 runs):

    2879.424879  task-clock-msecs         #      0.995 CPUs  ( +-   0.076% )
             10  context-switches         #      0.000 M/sec ( +-  24.761% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.002 M/sec ( +-   0.003% )
     4602155158  cycles                   #   1598.290 M/sec ( +-   0.076% )
      386146993  instructions             #      0.084 IPC   ( +-   0.005% )
         520008  cache-references         #      0.181 M/sec ( +-   8.077% )
         267345  cache-misses             #      0.093 M/sec ( +-   0.792% )

    2.893813235  seconds time elapsed   ( +-   0.085% )

Thanks
Ling

>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@zytor.com]
>Sent: 2009Äê11ÔÂ7ÈÕ 3:26
>To: Ma, Ling
>Cc: mingo@elte.hu; tglx@linutronix.de; linux-kernel@vger.kernel.org
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/06/2009 09:07 AM, H. Peter Anvin wrote:
>>
>> Where did the 1024 byte threshold come from?  It seems a bit high to me,
>> and is at the very best a CPU-specific tuning factor.
>>
>> Andi is of course correct that older CPUs might suffer (sadly enough),
>> which is why we'd at the very least need some idea of what the
>> performance impact on those older CPUs would look like -- at that point
>> we can make a decision to just unconditionally do the rep movs or
>> consider some system where we point at different implementations for
>> different processors -- memcpy is probably one of the very few
>> operations for which something like that would make sense.
>>
>
>To be expicit: Ling, would you be willing to run some benchmarks across
>processors to see how this performs on non-Nehalem CPUs?
>
>	-hpa
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09  7:24     ` Ma, Ling
@ 2009-11-09  7:36       ` H. Peter Anvin
  2009-11-09  8:08         ` Ingo Molnar
  2009-11-09  9:26         ` Andi Kleen
  2009-11-12 12:16       ` Pavel Machek
  1 sibling, 2 replies; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-09  7:36 UTC (permalink / raw)
  To: Ma, Ling; +Cc: Ingo Molnar, Thomas Gleixner, linux-kernel

On 11/08/2009 11:24 PM, Ma, Ling wrote:
> Hi All
> 
> Today we run our benchmark on Core2 and Sandy Bridge:
> 

Hi Ling,

Thanks for doing that.  Do you also have access to any older CPUs?  I
suspect that the CPUs that Andi are worried about are older CPUs like
P4, K8 or Pentium M/Core 1.  (Andi: please do clarify if you have
additional information.)

My personal opinion is that if we can show no significant slowdown on
P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this code
unconditionally.  If one of them is radically worse than baseline, then
we have to do something conditional, which is a lot more complicated.

[Ingo, Thomas: do you agree?]

Thanks,

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09  7:36       ` H. Peter Anvin
@ 2009-11-09  8:08         ` Ingo Molnar
  2009-11-11  7:05           ` Ma, Ling
  2009-11-12 12:16           ` Pavel Machek
  2009-11-09  9:26         ` Andi Kleen
  1 sibling, 2 replies; 33+ messages in thread
From: Ingo Molnar @ 2009-11-09  8:08 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 11/08/2009 11:24 PM, Ma, Ling wrote:
> > Hi All
> > 
> > Today we run our benchmark on Core2 and Sandy Bridge:
> > 
> 
> Hi Ling,
> 
> Thanks for doing that.  Do you also have access to any older CPUs?  I 
> suspect that the CPUs that Andi are worried about are older CPUs like 
> P4, K8 or Pentium M/Core 1.  (Andi: please do clarify if you have 
> additional information.)
> 
> My personal opinion is that if we can show no significant slowdown on 
> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this 
> code unconditionally.  If one of them is radically worse than 
> baseline, then we have to do something conditional, which is a lot 
> more complicated.
> 
> [Ingo, Thomas: do you agree?]

Yeah. IIRC the worst-case were the old P2's which had a really slow, 
microcode based string ops. (Some of them even had erratums in early 
prototypes although we can certainly ignore those as string ops get 
relied on quite frequently.)

IIRC the original PPro core came up with some nifty, hardwired string 
ops, but those had to be dumbed down and emulated in microcode due to 
SMP bugs - making it an inferior choice in the end.

But that should be ancient history and i'd suggest we ignore the P4 
dead-end too, unless it's some really big slowdown (which i doubt). If 
anyone cares then some optional assembly implementations could be added 
back.

Ling, if you are interested, could you send a user-space test-app to 
this thread that everyone could just compile and run on various older 
boxes, to gather a performance profile of hand-coded versus string ops 
performance?

( And i think we can make a judgement based on cache-hot performance
  alone - if then the strings ops will perform comparatively better in
  cache-cold scenarios, so the cache-hot numbers would be a conservative
  estimate. )

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09  7:36       ` H. Peter Anvin
  2009-11-09  8:08         ` Ingo Molnar
@ 2009-11-09  9:26         ` Andi Kleen
  2009-11-09 16:41           ` H. Peter Anvin
  1 sibling, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2009-11-09  9:26 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:
>
> My personal opinion is that if we can show no significant slowdown on
> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this code

The issue is Core 2.

P4 uses a different path, and Core 1 doesn't use the 64bit code.

> unconditionally.  If one of them is radically worse than baseline, then
> we have to do something conditional, which is a lot more complicated.

I have an older patchkit which did this, and some more optimizations
to this code.

There was still one open issue, that is why I didn't post it. If there's
interest I can post it.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09  9:26         ` Andi Kleen
@ 2009-11-09 16:41           ` H. Peter Anvin
  2009-11-09 18:54             ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-09 16:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel

On 11/09/2009 01:26 AM, Andi Kleen wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> My personal opinion is that if we can show no significant slowdown on
>> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this code
> 
> The issue is Core 2.
> 
> P4 uses a different path, and Core 1 doesn't use the 64bit code.
> 

Ling's numbers didn't seem to show a significant slowdown on Core 2 (it
was something like 0.95x baseline in the worst case, and most of the
cases were positive) so Core 2 doesn't seem to have a problem.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09 16:41           ` H. Peter Anvin
@ 2009-11-09 18:54             ` Andi Kleen
  2009-11-09 22:36               ` H. Peter Anvin
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2009-11-09 18:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel

> Ling's numbers didn't seem to show a significant slowdown on Core 2 (it
> was something like 0.95x baseline in the worst case, and most of the
> cases were positive) so Core 2 doesn't seem to have a problem.

I ran quite a lot of micro benchmarks with various alignments and sizes
the 'q' variant was not always a win. I haven't checked that particular
version though.

There's also K8 of course.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09 18:54             ` Andi Kleen
@ 2009-11-09 22:36               ` H. Peter Anvin
  0 siblings, 0 replies; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-09 22:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel

On 11/09/2009 10:54 AM, Andi Kleen wrote:
>> Ling's numbers didn't seem to show a significant slowdown on Core 2 (it
>> was something like 0.95x baseline in the worst case, and most of the
>> cases were positive) so Core 2 doesn't seem to have a problem.
> 
> I ran quite a lot of micro benchmarks with various alignments and sizes
> the 'q' variant was not always a win. I haven't checked that particular
> version though.

Well, if you have concrete information about what the problem cases are,
then please provide it.  If you don't, but have a hunch where these
potential problems may lie, then please indicate what they might be.
Otherwise, there isn't any actionable information here.

	-hpa

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09  8:08         ` Ingo Molnar
@ 2009-11-11  7:05           ` Ma, Ling
  2009-11-11  7:18             ` Ingo Molnar
  2009-11-11 20:34             ` Cyrill Gorcunov
  2009-11-12 12:16           ` Pavel Machek
  1 sibling, 2 replies; 33+ messages in thread
From: Ma, Ling @ 2009-11-11  7:05 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin; +Cc: Ingo Molnar, Thomas Gleixner, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2415 bytes --]

Hi All
Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
if you have interest. In this program we did simple modification
on memcpy_new function.

Thanks
Ling


>-----Original Message-----
>From: Ingo Molnar [mailto:mingo@elte.hu]
>Sent: 2009年11月9日 16:09
>To: H. Peter Anvin
>Cc: Ma, Ling; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>
>* H. Peter Anvin <hpa@zytor.com> wrote:
>
>> On 11/08/2009 11:24 PM, Ma, Ling wrote:
>> > Hi All
>> >
>> > Today we run our benchmark on Core2 and Sandy Bridge:
>> >
>>
>> Hi Ling,
>>
>> Thanks for doing that.  Do you also have access to any older CPUs?  I
>> suspect that the CPUs that Andi are worried about are older CPUs like
>> P4, K8 or Pentium M/Core 1.  (Andi: please do clarify if you have
>> additional information.)
>>
>> My personal opinion is that if we can show no significant slowdown on
>> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this
>> code unconditionally.  If one of them is radically worse than
>> baseline, then we have to do something conditional, which is a lot
>> more complicated.
>>
>> [Ingo, Thomas: do you agree?]
>
>Yeah. IIRC the worst-case were the old P2's which had a really slow,
>microcode based string ops. (Some of them even had erratums in early
>prototypes although we can certainly ignore those as string ops get
>relied on quite frequently.)
>
>IIRC the original PPro core came up with some nifty, hardwired string
>ops, but those had to be dumbed down and emulated in microcode due to
>SMP bugs - making it an inferior choice in the end.
>
>But that should be ancient history and i'd suggest we ignore the P4
>dead-end too, unless it's some really big slowdown (which i doubt). If
>anyone cares then some optional assembly implementations could be added
>back.
>
>Ling, if you are interested, could you send a user-space test-app to
>this thread that everyone could just compile and run on various older
>boxes, to gather a performance profile of hand-coded versus string ops
>performance?
>
>( And i think we can make a judgement based on cache-hot performance
>  alone - if then the strings ops will perform comparatively better in
>  cache-cold scenarios, so the cache-hot numbers would be a conservative
>  estimate. )
>
>	Ingo

[-- Attachment #2: memcpy.c --]
[-- Type: text/plain, Size: 5683 bytes --]

#include<stdio.h>
#include <stdlib.h>


typedef unsigned long long int hp_timing_t;
#define  MAXSAMPLESTPT        100000
#define  MAXCOPYSIZE          (1024 * 32)
#define  ORIG  0
#define  NEW   1
static char* buf1 = NULL;
static char* buf2 = NULL;

hp_timing_t _dl_hp_timing_overhead;
# define HP_TIMING_NOW(Var) \
  ({ unsigned long long _hi, _lo; \
     asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
     (Var) = _hi << 32 | _lo; })

#define HP_TIMING_DIFF(Diff, Start, End)	(Diff) = ((End) - (Start))
#define HP_TIMING_TOTAL(total_time, start, end)	\
  do									\
    {									\
      hp_timing_t tmptime;						\
      HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);	\
	total_time += tmptime;						\
    }									\
  while (0)

void memcpy_orig(char *dst, char *src, int len);
void memcpy_new(char *dst, char *src, int len);
void (*do_memcpy)(char *dst, char *src, int len);

static void
do_one_throughput ( char *dst, char *src,
	     size_t len)
{
      __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      size_t i;
      hp_timing_t start __attribute ((unused));
      hp_timing_t stop __attribute ((unused));
      hp_timing_t total_time =  (hp_timing_t) 0;

      __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      for (i = 0; i < MAXSAMPLESTPT; ++i)  {
          HP_TIMING_NOW (start);
		do_memcpy(buf1, buf2, len);
	  HP_TIMING_NOW (stop);
	  HP_TIMING_TOTAL (total_time, start, stop);
      }

      printf ("\t%zd", (size_t) total_time/MAXSAMPLESTPT);

}

static void
do_tpt_test (size_t align1, size_t align2, size_t len)
{
  size_t i, j;
  char *s1, *s2;

  s1 = (char *) (buf1 + align1);
  s2 = (char *) (buf2 + align2);


   printf ("TPT: Len %4zd, alignment %2zd/%2zd:", len, align1, align2);
   do_memcpy = memcpy_orig;
   do_one_throughput (s2, s1, len);
   do_memcpy = memcpy_new;
   do_one_throughput (s2, s1, len);

    putchar ('\n');
}

static test_init(void)
{
  int i;
  buf1 = valloc(MAXCOPYSIZE);
  buf2 = valloc(MAXCOPYSIZE);

  for (i = 0; i < MAXCOPYSIZE ; i = i + 64) {
        buf1[i] = buf2[i] = i & 0xff;
  }

}

void memcpy_new(char *dst, char *src, int len)
{

	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("cmp $0x400, %rdx");
	__asm__("jae 8f");

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shl   $3, %ecx");
	__asm__("jz 5f");

	__asm__("3:");
	__asm__("cmp %edi, %esi");
	__asm__("mov $8, %r9");
	__asm__("jl 4f");
	__asm__("neg %r9");

	__asm__("4:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("5:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 7f");

	__asm__("6:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 6b");

	__asm__("7:");
	__asm__("retq");

	__asm__("8:");
	__asm__("movl %edx, %ecx");
	__asm__ ("shr $3, %ecx");
	__asm__ ("andl $7, %edx");
	__asm__("rep movsq ");
	__asm__ ("jz 9f");
	__asm__("movl %edx, %ecx");
	__asm__("rep movsb");

	__asm__("9:");
}
void memcpy_orig(char *dst, char *src, int len)
{
	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("mov $0x80, %r8d  ");  /*aligned case for loop 1 */

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shl   $3, %ecx");
	__asm__("jz 5f");

	__asm__("3:");
	__asm__("cmp %edi, %esi");
	__asm__("mov $8, %r9");
	__asm__("jl 4f");
	__asm__("neg %r9");

	__asm__("4:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("5:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 7f");

	__asm__("6:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 6b");

	__asm__("7:");
	__asm__("retq");
}


void main(void)
{
  int i;
  test_init();
  printf ("%23s", "");
  printf ("\t%s\t%s\n", "memcpy_orig", "memcpy_new");

  for (i = 1024; i < 1024 * 16; i = i + 1024)
     do_tpt_test(8, 0, i);

}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-11  7:05           ` Ma, Ling
@ 2009-11-11  7:18             ` Ingo Molnar
  2009-11-11  7:57               ` Ma, Ling
  2009-11-11 20:34             ` Cyrill Gorcunov
  1 sibling, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2009-11-11  7:18 UTC (permalink / raw)
  To: Ma, Ling; +Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, linux-kernel


* Ma, Ling <ling.ma@intel.com> wrote:

> Hi All
> Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
> if you have interest. In this program we did simple modification
> on memcpy_new function.

FYI:

earth4:~/s> cc -o memcpy memcpy.c -O2
memcpy.c: In function 'do_one_throughput':
memcpy.c:45: error: impossible register constraint in 'asm'
memcpy.c:53: error: impossible register constraint in 'asm'
memcpy.c:47: error: impossible register constraint in 'asm'
memcpy.c:53: error: impossible register constraint in 'asm'

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-11  7:18             ` Ingo Molnar
@ 2009-11-11  7:57               ` Ma, Ling
  2009-11-11 23:21                 ` H. Peter Anvin
  0 siblings, 1 reply; 33+ messages in thread
From: Ma, Ling @ 2009-11-11  7:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 1101 bytes --]

Hi Ingo

This program is for 64bit version, so please use 'cc -o memcpy  memcpy.c -O2 -m64'

Thanks
Ling

>-----Original Message-----
>From: Ingo Molnar [mailto:mingo@elte.hu]
>Sent: 2009Äê11ÔÂ11ÈÕ 15:19
>To: Ma, Ling
>Cc: H. Peter Anvin; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>
>* Ma, Ling <ling.ma@intel.com> wrote:
>
>> Hi All
>> Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
>> if you have interest. In this program we did simple modification
>> on memcpy_new function.
>
>FYI:
>
>earth4:~/s> cc -o memcpy memcpy.c -O2
>memcpy.c: In function 'do_one_throughput':
>memcpy.c:45: error: impossible register constraint in 'asm'
>memcpy.c:53: error: impossible register constraint in 'asm'
>memcpy.c:47: error: impossible register constraint in 'asm'
>memcpy.c:53: error: impossible register constraint in 'asm'
>
>	Ingo
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-11  7:05           ` Ma, Ling
  2009-11-11  7:18             ` Ingo Molnar
@ 2009-11-11 20:34             ` Cyrill Gorcunov
  2009-11-11 22:39               ` H. Peter Anvin
  1 sibling, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2009-11-11 20:34 UTC (permalink / raw)
  To: Ma, Ling
  Cc: Ingo Molnar, H. Peter Anvin, Ingo Molnar, Thomas Gleixner, linux-kernel

On Wed, Nov 11, 2009 at 03:05:34PM +0800, Ma, Ling wrote:
> Hi All
> Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
> if you have interest. In this program we did simple modification
> on memcpy_new function.
> 
> Thanks
> Ling

Just my 0.2$ :)

	-- Cyrill
---
                       			memcpy_orig	memcpy_new
TPT: Len 1024, alignment  8/ 0:		490		570
TPT: Len 2048, alignment  8/ 0:		826		329
TPT: Len 3072, alignment  8/ 0:		441		464
TPT: Len 4096, alignment  8/ 0:		579		596
TPT: Len 5120, alignment  8/ 0:		723		729
TPT: Len 6144, alignment  8/ 0:		859		861
TPT: Len 7168, alignment  8/ 0:		996		994
TPT: Len 8192, alignment  8/ 0:		1165		1127
TPT: Len 9216, alignment  8/ 0:		1273		1260
TPT: Len 10240, alignment  8/ 0:	1402		1395
TPT: Len 11264, alignment  8/ 0:	1543		1525
TPT: Len 12288, alignment  8/ 0:	1682		1659
TPT: Len 13312, alignment  8/ 0:	1869		1815
TPT: Len 14336, alignment  8/ 0:	1982		1951
TPT: Len 15360, alignment  8/ 0:	2185		2110
---

I've run this test a few times and results almost the same,
with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.

---
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Duo CPU     T8100  @ 2.10GHz
stepping	: 6
cpu MHz		: 800.000
cache size	: 3072 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm ida tpr_shadow vnmi flexpriority
bogomips	: 4189.60
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Duo CPU     T8100  @ 2.10GHz
stepping	: 6
cpu MHz		: 800.000
cache size	: 3072 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm ida tpr_shadow vnmi flexpriority
bogomips	: 4189.46
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-11 20:34             ` Cyrill Gorcunov
@ 2009-11-11 22:39               ` H. Peter Anvin
  2009-11-12  4:28                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-11 22:39 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Ma, Ling, Ingo Molnar, Ingo Molnar, Thomas Gleixner, linux-kernel

On 11/11/2009 12:34 PM, Cyrill Gorcunov wrote:
>                        			memcpy_orig	memcpy_new
> TPT: Len 1024, alignment  8/ 0:		490		570
> TPT: Len 2048, alignment  8/ 0:		826		329
> TPT: Len 3072, alignment  8/ 0:		441		464
> TPT: Len 4096, alignment  8/ 0:		579		596
> TPT: Len 5120, alignment  8/ 0:		723		729
> TPT: Len 6144, alignment  8/ 0:		859		861
> TPT: Len 7168, alignment  8/ 0:		996		994
> TPT: Len 8192, alignment  8/ 0:		1165		1127
> TPT: Len 9216, alignment  8/ 0:		1273		1260
> TPT: Len 10240, alignment  8/ 0:	1402		1395
> TPT: Len 11264, alignment  8/ 0:	1543		1525
> TPT: Len 12288, alignment  8/ 0:	1682		1659
> TPT: Len 13312, alignment  8/ 0:	1869		1815
> TPT: Len 14336, alignment  8/ 0:	1982		1951
> TPT: Len 15360, alignment  8/ 0:	2185		2110
> 
> I've run this test a few times and results almost the same,
> with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.
> 

Was the result for 2048 consistent (it seems odd in the extreme)... the
discrepancy between this result and Ling's results bothers me; perhaps
the right answer is to leave the current code for Core2 and use new code
(with a lower than 1024 threshold?) for NHM and K8?

	-hpa

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-11  7:57               ` Ma, Ling
@ 2009-11-11 23:21                 ` H. Peter Anvin
  2009-11-12  2:12                   ` Ma, Ling
  0 siblings, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-11 23:21 UTC (permalink / raw)
  To: Ma, Ling; +Cc: Ingo Molnar, Ingo Molnar, Thomas Gleixner, linux-kernel

On 11/10/2009 11:57 PM, Ma, Ling wrote:
> Hi Ingo
> 
> This program is for 64bit version, so please use 'cc -o memcpy  memcpy.c -O2 -m64'
> 

I did some measurements with this program; I added power-of-two
measurements from 1-512 bytes, plus some different alignments, and found
some very interesting results:

Nehalem:
	memcpy_new is a win for 1024+ bytes, but *also* a win for 2-32
	bytes, where the old code apparently performs appallingly bad.

	memcpy_new loses in the 64-512 byte range, so the 1024
	threshold is probably justified.

Core2:
	memcpy_new is a win for <= 512 bytes, but a lose for larger
	copies (possibly a win again for 16K+ copies, but those are
	very rare in the Linux kernel.)  Surprise...

	However, the difference is very small.

However, I had overlooked something much more fundamental about your
patch.  On Nehalem, at least *it will never get executed* (except during
very early startup), because we replace the memcpy code with a jmp to
memcpy_c on any CPU which has X86_FEATURE_REP_GOOD, which includes Nehalem.

So the patch is a no-op on Nehalem, and any other modern CPU.

Am I guessing that the perf numbers you posted originally were all from
your user space test program?

	-hpa

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-11 23:21                 ` H. Peter Anvin
@ 2009-11-12  2:12                   ` Ma, Ling
  0 siblings, 0 replies; 33+ messages in thread
From: Ma, Ling @ 2009-11-12  2:12 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Ingo Molnar, Ingo Molnar, Thomas Gleixner, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 2000 bytes --]

>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@zytor.com]
>Sent: 2009Äê11ÔÂ12ÈÕ 7:21
>To: Ma, Ling
>Cc: Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/10/2009 11:57 PM, Ma, Ling wrote:
>> Hi Ingo
>>
>> This program is for 64bit version, so please use 'cc -o memcpy  memcpy.c -O2
>-m64'
>>
>
>I did some measurements with this program; I added power-of-two
>measurements from 1-512 bytes, plus some different alignments, and found
>some very interesting results:
>
>Nehalem:
>	memcpy_new is a win for 1024+ bytes, but *also* a win for 2-32
>	bytes, where the old code apparently performs appallingly bad.
>
>	memcpy_new loses in the 64-512 byte range, so the 1024
>	threshold is probably justified.
>
>Core2:
>	memcpy_new is a win for <= 512 bytes, but a lose for larger
>	copies (possibly a win again for 16K+ copies, but those are
>	very rare in the Linux kernel.)  Surprise...
>
>	However, the difference is very small.
>
>However, I had overlooked something much more fundamental about your
>patch.  On Nehalem, at least *it will never get executed* (except during
>very early startup), because we replace the memcpy code with a jmp to
>memcpy_c on any CPU which has X86_FEATURE_REP_GOOD, which includes Nehalem.
>
>So the patch is a no-op on Nehalem, and any other modern CPU.

[Ma Ling]
It is good for modern CPU, our original intention is also to introduce movsq for Nehalem, above method is more smart.

>Am I guessing that the perf numbers you posted originally were all from
>your user space test program?

[Ma Ling] 
Yes, they are all from this program, and I'm confused about measurement values will be different for only one case after multiple tests.
(3 times at least on my core2 platform).

Thanks
Ling
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast  string.
  2009-11-11 22:39               ` H. Peter Anvin
@ 2009-11-12  4:28                 ` Cyrill Gorcunov
  2009-11-12  4:49                   ` Ma, Ling
  0 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2009-11-12  4:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ma, Ling, Ingo Molnar, Ingo Molnar, Thomas Gleixner, linux-kernel

On Thu, Nov 12, 2009 at 1:39 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 11/11/2009 12:34 PM, Cyrill Gorcunov wrote:
>>                                               memcpy_orig     memcpy_new
>> TPT: Len 1024, alignment  8/ 0:               490             570
>> TPT: Len 2048, alignment  8/ 0:               826             329
>> TPT: Len 3072, alignment  8/ 0:               441             464
>> TPT: Len 4096, alignment  8/ 0:               579             596
>> TPT: Len 5120, alignment  8/ 0:               723             729
>> TPT: Len 6144, alignment  8/ 0:               859             861
>> TPT: Len 7168, alignment  8/ 0:               996             994
>> TPT: Len 8192, alignment  8/ 0:               1165            1127
>> TPT: Len 9216, alignment  8/ 0:               1273            1260
>> TPT: Len 10240, alignment  8/ 0:      1402            1395
>> TPT: Len 11264, alignment  8/ 0:      1543            1525
>> TPT: Len 12288, alignment  8/ 0:      1682            1659
>> TPT: Len 13312, alignment  8/ 0:      1869            1815
>> TPT: Len 14336, alignment  8/ 0:      1982            1951
>> TPT: Len 15360, alignment  8/ 0:      2185            2110
>>
>> I've run this test a few times and results almost the same,
>> with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.
>>
>
> Was the result for 2048 consistent (it seems odd in the extreme)... the
> discrepancy between this result and Ling's results bothers me; perhaps
> the right answer is to leave the current code for Core2 and use new code
> (with a lower than 1024 threshold?) for NHM and K8?
>
>        -hpa
>

Hi Peter,

no, results for 2048 is not repeatable (that is why I didn't mention this number
in a former report).

Test1:
TPT: Len 2048, alignment  8/ 0:	826	329
Test2:
TPT: Len 2048, alignment  8/ 0:	359	329
Test3:
TPT: Len 2048, alignment  8/ 0:	306	331
Test4:
TPT: Len 2048, alignment  8/ 0:	415	329

I guess this was due to cpu frequency change from 800 to 2.1Ghz since
I did tests manually
not using any kind of bash cycle to run the test program.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast  string.
  2009-11-12  4:28                 ` Cyrill Gorcunov
@ 2009-11-12  4:49                   ` Ma, Ling
  2009-11-12  5:26                     ` H. Peter Anvin
  2009-11-12  9:54                     ` Cyrill Gorcunov
  0 siblings, 2 replies; 33+ messages in thread
From: Ma, Ling @ 2009-11-12  4:49 UTC (permalink / raw)
  To: Cyrill Gorcunov, H. Peter Anvin
  Cc: Ingo Molnar, Ingo Molnar, Thomas Gleixner, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2814 bytes --]

Hi All
The attachment is latest memcpy.c, please update by 
"cc -o memcpy memcpy.c -O2 -m64".

Thanks
Ling


>-----Original Message-----
>From: Cyrill Gorcunov [mailto:gorcunov@gmail.com]
>Sent: 2009年11月12日 12:28
>To: H. Peter Anvin
>Cc: Ma, Ling; Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On Thu, Nov 12, 2009 at 1:39 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 11/11/2009 12:34 PM, Cyrill Gorcunov wrote:
>>>                                               memcpy_orig     memc
>py_new
>>> TPT: Len 1024, alignment  8/ 0:               490             570
>>> TPT: Len 2048, alignment  8/ 0:               826             329
>>> TPT: Len 3072, alignment  8/ 0:               441             464
>>> TPT: Len 4096, alignment  8/ 0:               579             596
>>> TPT: Len 5120, alignment  8/ 0:               723             729
>>> TPT: Len 6144, alignment  8/ 0:               859             861
>>> TPT: Len 7168, alignment  8/ 0:               996             994
>>> TPT: Len 8192, alignment  8/ 0:               1165            1127
>>> TPT: Len 9216, alignment  8/ 0:               1273            1260
>>> TPT: Len 10240, alignment  8/ 0:      1402            1395
>>> TPT: Len 11264, alignment  8/ 0:      1543            1525
>>> TPT: Len 12288, alignment  8/ 0:      1682            1659
>>> TPT: Len 13312, alignment  8/ 0:      1869            1815
>>> TPT: Len 14336, alignment  8/ 0:      1982            1951
>>> TPT: Len 15360, alignment  8/ 0:      2185            2110
>>>
>>> I've run this test a few times and results almost the same,
>>> with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.
>>>
>>
>> Was the result for 2048 consistent (it seems odd in the extreme)... the
>> discrepancy between this result and Ling's results bothers me; perhaps
>> the right answer is to leave the current code for Core2 and use new code
>> (with a lower than 1024 threshold?) for NHM and K8?
>>
>>        -hpa
>>
>
>Hi Peter,
>
>no, results for 2048 is not repeatable (that is why I didn't mention this number
>in a former report).
>
>Test1:
>TPT: Len 2048, alignment  8/ 0:	826	329
>Test2:
>TPT: Len 2048, alignment  8/ 0:	359	329
>Test3:
>TPT: Len 2048, alignment  8/ 0:	306	331
>Test4:
>TPT: Len 2048, alignment  8/ 0:	415	329
>
>I guess this was due to cpu frequency change from 800 to 2.1Ghz since
>I did tests manually
>not using any kind of bash cycle to run the test program.

[-- Attachment #2: memcpy.c --]
[-- Type: text/plain, Size: 5495 bytes --]

#include<stdio.h>
#include <stdlib.h>


typedef unsigned long long int hp_timing_t;
#define  MAXSAMPLESTPT        100000
#define  MAXCOPYSIZE          (1024 * 32)
#define  ORIG  0
#define  NEW   1
static char* buf1 = NULL;
static char* buf2 = NULL;

hp_timing_t _dl_hp_timing_overhead;
# define HP_TIMING_NOW(Var) \
  ({ unsigned long long _hi, _lo; \
     asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
     (Var) = _hi << 32 | _lo; })

#define HP_TIMING_DIFF(Diff, Start, End)	(Diff) = ((End) - (Start))
#define HP_TIMING_TOTAL(total_time, start, end)	\
  do									\
    {									\
      hp_timing_t tmptime;						\
      HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);	\
	total_time += tmptime;						\
    }									\
  while (0)

void memcpy_orig(char *dst, char *src, int len);
void memcpy_new(char *dst, char *src, int len);
void memcpy_c(char *dst, char *src, int len);
void (*do_memcpy)(char *dst, char *src, int len);

static void
do_one_throughput ( char *dst, char *src,
	     size_t len)
{
      __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      size_t i;
      hp_timing_t start __attribute ((unused));
      hp_timing_t stop __attribute ((unused));
      hp_timing_t total_time =  (hp_timing_t) 0;

      __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      for (i = 0; i < MAXSAMPLESTPT; ++i)  {
          HP_TIMING_NOW (start);
		do_memcpy(buf1, buf2, len);
	  HP_TIMING_NOW (stop);
	  HP_TIMING_TOTAL (total_time, start, stop);
      }

      printf ("\t%zd", (size_t) total_time/MAXSAMPLESTPT);

}

static void
do_tpt_test (size_t align1, size_t align2, size_t len)
{
  size_t i, j;
  char *s1, *s2;

  s1 = (char *) (buf1 + align1);
  s2 = (char *) (buf2 + align2);


   printf ("TPT: Len %4zd, alignment %2zd/%2zd:", len, align1, align2);
   do_memcpy = memcpy_orig;
   do_one_throughput (s2, s1, len);
   do_memcpy = memcpy_new;
   do_one_throughput (s2, s1, len);

    putchar ('\n');
}

static test_init(void)
{
  int i;
  buf1 = valloc(MAXCOPYSIZE);
  buf2 = valloc(MAXCOPYSIZE);

  for (i = 0; i < MAXCOPYSIZE ; i = i + 64) {
        buf1[i] = buf2[i] = i & 0xff;
  }

}

void memcpy_new(char *dst, char *src, int len)
{
	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("cmp $0x400, %edx");
	__asm__("jae 7f");

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shl   $3, %ecx");
	__asm__("jz 4f");


	__asm__("3:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("4:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 6f");

	__asm__("5:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 5b");

	__asm__("6:");
	__asm__("retq");

	__asm__("7:");
	__asm__("movl %edx, %ecx");
	__asm__ ("shr $3, %ecx");
	__asm__ ("andl $7, %edx");
	__asm__("rep movsq ");
	__asm__ ("jz 8f");
	__asm__("movl %edx, %ecx");
	__asm__("rep movsb");

	__asm__("8:");
}
void memcpy_orig(char *dst, char *src, int len)
{
	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("mov $0x80, %r8d  ");  /*aligned case for loop 1 */

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shl   $3, %ecx");
	__asm__("jz 4f");


	__asm__("3:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("4:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 6f");

	__asm__("5:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 5b");

	__asm__("6:");
}


void main(void)
{
  int i;
  test_init();
  printf ("%23s", "");
  printf ("\t%s\t%s\t%s\n", "memcpy_orig", "memcpy_new");

  for (i = 1024; i < 1024 * 16; i = i+ 1024)
     do_tpt_test(0, 0, i);

}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast  string.
  2009-11-12  4:49                   ` Ma, Ling
@ 2009-11-12  5:26                     ` H. Peter Anvin
  2009-11-12  7:42                       ` Ma, Ling
  2009-11-12  9:54                     ` Cyrill Gorcunov
  1 sibling, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-12  5:26 UTC (permalink / raw)
  To: Ma, Ling
  Cc: Cyrill Gorcunov, Ingo Molnar, Ingo Molnar, Thomas Gleixner, linux-kernel

On 11/11/2009 08:49 PM, Ma, Ling wrote:
> Hi All
> The attachment is latest memcpy.c, please update by 
> "cc -o memcpy memcpy.c -O2 -m64".

OK... given that there seems to be no point since the actual code we're
talking about modifying doesn't ever actually get executed on the real
kernel, we can just drop this, right?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast  string.
  2009-11-12  5:26                     ` H. Peter Anvin
@ 2009-11-12  7:42                       ` Ma, Ling
  0 siblings, 0 replies; 33+ messages in thread
From: Ma, Ling @ 2009-11-12  7:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Cyrill Gorcunov, Ingo Molnar, Ingo Molnar, Thomas Gleixner, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]

Hi H. Peter Anvin

After running the test program in my attachment-memcpy.c on Nehalem platform,
when copy size is less 1024 memcopy_c function has very big regression compared
with original memcopy function. I think we have to combine original memcopy and
memcpy_c for Nehalem and other modern CPUS, so memcpy_new is on the right track.

Thanks
Ling 
( 

>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@zytor.com]
>Sent: 2009年11月12日 13:27
>To: Ma, Ling
>Cc: Cyrill Gorcunov; Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/11/2009 08:49 PM, Ma, Ling wrote:
>> Hi All
>> The attachment is latest memcpy.c, please update by
>> "cc -o memcpy memcpy.c -O2 -m64".
>
>OK... given that there seems to be no point since the actual code we're
>talking about modifying doesn't ever actually get executed on the real
>kernel, we can just drop this, right?
>
>	-hpa
>
>--
>H. Peter Anvin, Intel Open Source Technology Center
>I work for Intel.  I don't speak on their behalf.


[-- Attachment #2: memcpy.c --]
[-- Type: text/plain, Size: 6138 bytes --]

#include<stdio.h>
#include <stdlib.h>


typedef unsigned long long int hp_timing_t;
#define  MAXSAMPLESTPT        1000
#define  MAXCOPYSIZE          (1024 * 32)
#define  ORIG  0
#define  NEW   1
static char* buf1 = NULL;
static char* buf2 = NULL;

hp_timing_t _dl_hp_timing_overhead;
# define HP_TIMING_NOW(Var) \
  ({ unsigned long long _hi, _lo; \
     asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
     (Var) = _hi << 32 | _lo; })

#define HP_TIMING_DIFF(Diff, Start, End)	(Diff) = ((End) - (Start))
#define HP_TIMING_TOTAL(total_time, start, end)	\
  do									\
    {									\
      hp_timing_t tmptime;						\
      HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);	\
	total_time += tmptime;						\
    }									\
  while (0)

#define HP_TIMING_BEST(best_time, start, end)	\
  do									\
    {									\
      hp_timing_t tmptime;						\
      HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);	\
      if (best_time > tmptime)						\
	best_time = tmptime;						\
    }									\
  while (0)


void memcpy_orig(char *dst, char *src, int len);
void memcpy_new(char *dst, char *src, int len);
void memcpy_c(char *dst, char *src, int len);
void (*do_memcpy)(char *dst, char *src, int len);

static void
do_one_throughput ( char *dst, char *src,
	     size_t len)
{

     __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      hp_timing_t start __attribute ((unused));
      hp_timing_t stop __attribute ((unused));
      hp_timing_t best_time = ~ (hp_timing_t) 0;
      size_t i;

      __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      HP_TIMING_NOW (start);
      for (i = 0; i < MAXSAMPLESTPT; ++i)
	      do_memcpy ( dst, src, len);
	  HP_TIMING_NOW (stop);
	  HP_TIMING_BEST (best_time, start, stop);

      printf ("\t%zd", (size_t) best_time);

}

static void
do_tpt_test (size_t align1, size_t align2, size_t len)
{
  size_t i, j;
  char *s1, *s2;

  s1 = (char *) (buf1 + align1);
  s2 = (char *) (buf2 + align2);


   printf ("TPT: Len %4zd, alignment %2zd/%2zd:", len, align1, align2);
   do_memcpy = memcpy_orig;
   do_one_throughput (s2, s1, len);
   do_memcpy = memcpy_new;
   do_one_throughput (s2, s1, len);
   do_memcpy = memcpy_c;
   do_one_throughput (s2, s1, len);

    putchar ('\n');
}

static test_init(void)
{
  int i;
  buf1 = valloc(MAXCOPYSIZE);
  buf2 = valloc(MAXCOPYSIZE);

  for (i = 0; i < MAXCOPYSIZE ; i = i + 64) {
        buf1[i] = buf2[i] = i & 0xff;
  }

}

void memcpy_c(char *dst, char *src, int len)
{

	__asm__("movq %rdi, %rax");

	__asm__("movl %edx, %ecx");
	__asm__("shrl $3, %ecx");
	__asm__("andl $7, %edx");
	__asm__("rep movsq");
	__asm__("movl %edx, %ecx");
	__asm__("rep movsb");

}
void memcpy_new(char *dst, char *src, int len)
{
	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("cmp $0x400, %edx");
	__asm__("jae 7f");

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shrl   $3, %ecx");
	__asm__("jz 4f");


	__asm__("3:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("4:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 6f");

	__asm__("5:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 5b");

	__asm__("6:");
	__asm__("retq");

	__asm__("7:");
	__asm__("movl %edx, %ecx");
	__asm__ ("shr $3, %ecx");
	__asm__ ("andl $7, %edx");
	__asm__("rep movsq ");
	__asm__ ("jz 8f");
	__asm__("movl %edx, %ecx");
	__asm__("rep movsb");

	__asm__("8:");
}
void memcpy_orig(char *dst, char *src, int len)
{
	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("mov $0x80, %r8d  ");  /*aligned case for loop 1 */

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shrl   $3, %ecx");
	__asm__("jz 4f");


	__asm__("3:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("4:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 6f");

	__asm__("5:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 5b");

	__asm__("6:");
}


void main(void)
{
  int i;
  test_init();
  printf ("%23s", "");
  printf ("\t%s\t%s\t%s\n", "memcpy_orig", "memcpy_new", "memcpy_c");

  for (i = 0; i < 64; i = i+ 1)
     do_tpt_test(0, 0, i);
     do_tpt_test(0, 0, 1023);
     do_tpt_test(0, 0, 1024);
     do_tpt_test(0, 0, 2048);

}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast  string.
  2009-11-12  4:49                   ` Ma, Ling
  2009-11-12  5:26                     ` H. Peter Anvin
@ 2009-11-12  9:54                     ` Cyrill Gorcunov
  1 sibling, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2009-11-12  9:54 UTC (permalink / raw)
  To: Ma, Ling; +Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, linux-kernel

On Thu, Nov 12, 2009 at 7:49 AM, Ma, Ling <ling.ma@intel.com> wrote:
> Hi All
> The attachment is latest memcpy.c, please update by
> "cc -o memcpy memcpy.c -O2 -m64".
>
> Thanks
> Ling
>
>

Here is goes

                       	memcpy_orig	memcpy_new	memcpy_c
TPT: Len    0, alignment  0/ 0:	34482	31920	123564
TPT: Len    1, alignment  0/ 0:	31815	31710	123564
TPT: Len    2, alignment  0/ 0:	39606	31773	123522
TPT: Len    3, alignment  0/ 0:	175329	37212	123522
TPT: Len    4, alignment  0/ 0:	55440	42357	297129
TPT: Len    5, alignment  0/ 0:	63294	47607	296898
TPT: Len    6, alignment  0/ 0:	71148	52794	296856
TPT: Len    7, alignment  0/ 0:	79023	58044	296877
TPT: Len    8, alignment  0/ 0:	32403	32424	123564
TPT: Len    9, alignment  0/ 0:	31752	31815	123522
TPT: Len   10, alignment  0/ 0:	34482	34545	123522
TPT: Len   11, alignment  0/ 0:	42294	39732	123522
TPT: Len   12, alignment  0/ 0:	50211	42378	296856
TPT: Len   13, alignment  0/ 0:	58107	48279	329007
TPT: Len   14, alignment  0/ 0:	65898	53781	296877
TPT: Len   15, alignment  0/ 0:	73773	58065	296877
TPT: Len   16, alignment  0/ 0:	34482	37107	123522
TPT: Len   17, alignment  0/ 0:	31836	31815	123543
TPT: Len   18, alignment  0/ 0:	39627	37044	123522
TPT: Len   19, alignment  0/ 0:	47565	42294	123522
TPT: Len   20, alignment  0/ 0:	55566	47754	296898
TPT: Len   21, alignment  0/ 0:	63273	52773	296877
TPT: Len   22, alignment  0/ 0:	71148	58149	296856
TPT: Len   23, alignment  0/ 0:	79086	63273	296856
TPT: Len   24, alignment  0/ 0:	39816	45024	123522
TPT: Len   25, alignment  0/ 0:	37086	39753	123522
TPT: Len   26, alignment  0/ 0:	44877	44919	123522
TPT: Len   27, alignment  0/ 0:	52773	50253	123522
TPT: Len   28, alignment  0/ 0:	60690	55545	296898
TPT: Len   29, alignment  0/ 0:	68544	60690	296877
TPT: Len   30, alignment  0/ 0:	76398	65961	296877
TPT: Len   31, alignment  0/ 0:	84273	71211	296856
TPT: Len   32, alignment  0/ 0:	45045	52899	123522
TPT: Len   33, alignment  0/ 0:	42315	47628	123522
TPT: Len   34, alignment  0/ 0:	50127	52773	123522
TPT: Len   35, alignment  0/ 0:	58044	58107	123522
TPT: Len   36, alignment  0/ 0:	129612	63462	297129
TPT: Len   37, alignment  0/ 0:	257607	68733	902034
TPT: Len   38, alignment  0/ 0:	81879	73857	296919
TPT: Len   39, alignment  0/ 0:	89460	79023	296856
TPT: Len   40, alignment  0/ 0:	50253	60753	123543
TPT: Len   41, alignment  0/ 0:	47607	55545	123564
TPT: Len   42, alignment  0/ 0:	55356	60627	123522
TPT: Len   43, alignment  0/ 0:	63357	822843	123585
TPT: Len   44, alignment  0/ 0:	71337	71169	297087
TPT: Len   45, alignment  0/ 0:	79023	353388	297129
TPT: Len   46, alignment  0/ 0:	87024	81690	296856
TPT: Len   47, alignment  0/ 0:	94689	86940	296877
TPT: Len   48, alignment  0/ 0:	55482	68523	123522
TPT: Len   49, alignment  0/ 0:	52857	63336	123564
TPT: Len   50, alignment  0/ 0:	60690	68607	123522
TPT: Len   51, alignment  0/ 0:	68502	73731	123522
TPT: Len   52, alignment  0/ 0:	76419	79086	296856
TPT: Len   53, alignment  0/ 0:	84336	126147	296877
TPT: Len   54, alignment  0/ 0:	92190	89607	296877
TPT: Len   55, alignment  0/ 0:	100023	94668	296856
TPT: Len   56, alignment  0/ 0:	60690	76440	123522
TPT: Len   57, alignment  0/ 0:	58065	71211	123522
TPT: Len   58, alignment  0/ 0:	65877	76356	123522
TPT: Len   59, alignment  0/ 0:	73773	81606	196224
TPT: Len   60, alignment  0/ 0:	81732	86961	297129
TPT: Len   61, alignment  0/ 0:	89523	136689	296877
TPT: Len   62, alignment  0/ 0:	97377	97440	296877
TPT: Len   63, alignment  0/ 0:	105210	102564	296877
TPT: Len 1023, alignment  0/ 0:	457569	457107	719502
TPT: Len 1024, alignment  0/ 0:	422856	542535	575526
TPT: Len 2048, alignment  0/ 0:	819651	8217489	982779

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09  7:24     ` Ma, Ling
  2009-11-09  7:36       ` H. Peter Anvin
@ 2009-11-12 12:16       ` Pavel Machek
  2009-11-13  5:33         ` Ma, Ling
  1 sibling, 1 reply; 33+ messages in thread
From: Pavel Machek @ 2009-11-12 12:16 UTC (permalink / raw)
  To: Ma, Ling; +Cc: H. Peter Anvin, mingo, tglx, linux-kernel

On Mon 2009-11-09 15:24:03, Ma, Ling wrote:
> Hi All
> 
> Today we run our benchmark on Core2 and Sandy Bridge:
> 
> 1. Retrieve result on Core2
> Speedup on Core2
>    Len        Alignement             Speedup
>   1024,       0/ 0:                 0.95x 
>   2048,       0/ 0:                 1.03x 

Well, so you are running cache hot and it is only a win on huge
copies... how common are those?

> Application run through perf
> For (i= 1024; i < 1024 * 16; i = i + 64)
> 	do_memcpy(0, 0, i);

							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-09  8:08         ` Ingo Molnar
  2009-11-11  7:05           ` Ma, Ling
@ 2009-11-12 12:16           ` Pavel Machek
  2009-11-13  7:33             ` Ingo Molnar
  1 sibling, 1 reply; 33+ messages in thread
From: Pavel Machek @ 2009-11-12 12:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel


> Ling, if you are interested, could you send a user-space test-app to 
> this thread that everyone could just compile and run on various older 
> boxes, to gather a performance profile of hand-coded versus string ops 
> performance?
> 
> ( And i think we can make a judgement based on cache-hot performance
>   alone - if then the strings ops will perform comparatively better in
>   cache-cold scenarios, so the cache-hot numbers would be a conservative
>   estimate. )

Ugh, really? I'd expect cache-cold performance to be not helped at all
(memory bandwidth limit) and you'll get slow down from additional
i-cache misses...
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-12 12:16       ` Pavel Machek
@ 2009-11-13  5:33         ` Ma, Ling
  2009-11-13  6:04           ` H. Peter Anvin
  0 siblings, 1 reply; 33+ messages in thread
From: Ma, Ling @ 2009-11-13  5:33 UTC (permalink / raw)
  To: Pavel Machek; +Cc: H. Peter Anvin, mingo, tglx, linux-kernel

>Well, so you are running cache hot and it is only a win on huge
>copies... how common are those?
>
Hi Pavel Machek
Yes, we intend to introduce movsq for huge hot size(over 1024bytes)
and avoid regression for less 1024bytes. I guess you suggest using
prefetch instruction for cold data (if I was wrong please correct me).
memcpy don't know whether data has been in cache or not,
so only when copy size is over (first level 1 cache)/2 and lower
(last level cache)/2 , prefetch will get benefit. Currently first
level cache size of most cpus is around 32KB, so it is useful for prefetch 
when copy size is over 16KB, but as H. Peter Anvin mentioned in last email,
over 16KB copy in kernel is rare.

Thanks
Ling   


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-13  5:33         ` Ma, Ling
@ 2009-11-13  6:04           ` H. Peter Anvin
  2009-11-13  7:23             ` Ma, Ling
  0 siblings, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-13  6:04 UTC (permalink / raw)
  To: Ma, Ling; +Cc: Pavel Machek, mingo, tglx, linux-kernel

On 11/12/2009 09:33 PM, Ma, Ling wrote:
>> Well, so you are running cache hot and it is only a win on huge
>> copies... how common are those?
>>
> Hi Pavel Machek
> Yes, we intend to introduce movsq for huge hot size(over 1024bytes)
> and avoid regression for less 1024bytes. I guess you suggest using
> prefetch instruction for cold data (if I was wrong please correct me).
> memcpy don't know whether data has been in cache or not,
> so only when copy size is over (first level 1 cache)/2 and lower
> (last level cache)/2 , prefetch will get benefit. Currently first
> level cache size of most cpus is around 32KB, so it is useful for prefetch 
> when copy size is over 16KB, but as H. Peter Anvin mentioned in last email,
> over 16KB copy in kernel is rare.
> 

What it sounds to me is that for Nehalem, we want to use memcpy_c for >=
1024 bytes and the old code for < 1024 bytes; for Core2 it might be the
exact opposite.

Either way, whatever we do should use the appropriate static replacement
mechanism.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-13  6:04           ` H. Peter Anvin
@ 2009-11-13  7:23             ` Ma, Ling
  2009-11-13  7:30               ` H. Peter Anvin
  0 siblings, 1 reply; 33+ messages in thread
From: Ma, Ling @ 2009-11-13  7:23 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Pavel Machek, mingo, tglx, linux-kernel

Hi H. Peter Anvin
>What it sounds to me is that for Nehalem, we want to use memcpy_c for >=
>1024 bytes and the old code for < 1024 bytes;

Yes, so we modify memcpy_c as memcpy_new for Nehalem, and keep old
code for Core2 is acceptable?

Thanks
Ling


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-13  7:23             ` Ma, Ling
@ 2009-11-13  7:30               ` H. Peter Anvin
  0 siblings, 0 replies; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-13  7:30 UTC (permalink / raw)
  To: Ma, Ling; +Cc: Pavel Machek, mingo, tglx, linux-kernel

On 11/12/2009 11:23 PM, Ma, Ling wrote:
> Hi H. Peter Anvin
>> What it sounds to me is that for Nehalem, we want to use memcpy_c for >=
>> 1024 bytes and the old code for < 1024 bytes;
> 
> Yes, so we modify memcpy_c as memcpy_new for Nehalem, and keep old
> code for Core2 is acceptable?

No, what I think we should do is to rename the old memcpy to something
like memcpy_o, and then have the actual memcpy routine look like:

	cmpq $1024, %rcx
	ja memcpy_c
	jmp memcpy_o

... where the constant as well as the ja opcode can be patched by the
alternatives mechanism (to a jb if needed).

memcpy is *definitely* frequent enough that static patching is justified.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-12 12:16           ` Pavel Machek
@ 2009-11-13  7:33             ` Ingo Molnar
  2009-11-13  8:04               ` H. Peter Anvin
  0 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2009-11-13  7:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: H. Peter Anvin, Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel


* Pavel Machek <pavel@ucw.cz> wrote:

> > Ling, if you are interested, could you send a user-space test-app to 
> > this thread that everyone could just compile and run on various older 
> > boxes, to gather a performance profile of hand-coded versus string ops 
> > performance?
> > 
> > ( And i think we can make a judgement based on cache-hot performance
> >   alone - if then the strings ops will perform comparatively better in
> >   cache-cold scenarios, so the cache-hot numbers would be a conservative
> >   estimate. )
> 
> Ugh, really? I'd expect cache-cold performance to be not helped at all 
> (memory bandwidth limit) and you'll get slow down from additional 
> i-cache misses...

That's my point - the new code is shorter, which will run comparatively 
faster in a cache-cold environment.

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-13  7:33             ` Ingo Molnar
@ 2009-11-13  8:04               ` H. Peter Anvin
  2009-11-13  8:10                 ` Ingo Molnar
  0 siblings, 1 reply; 33+ messages in thread
From: H. Peter Anvin @ 2009-11-13  8:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel

On 11/12/2009 11:33 PM, Ingo Molnar wrote:
> 
> * Pavel Machek <pavel@ucw.cz> wrote:
> 
>>> Ling, if you are interested, could you send a user-space test-app to 
>>> this thread that everyone could just compile and run on various older 
>>> boxes, to gather a performance profile of hand-coded versus string ops 
>>> performance?
>>>
>>> ( And i think we can make a judgement based on cache-hot performance
>>>   alone - if then the strings ops will perform comparatively better in
>>>   cache-cold scenarios, so the cache-hot numbers would be a conservative
>>>   estimate. )
>>
>> Ugh, really? I'd expect cache-cold performance to be not helped at all 
>> (memory bandwidth limit) and you'll get slow down from additional 
>> i-cache misses...
> 
> That's my point - the new code is shorter, which will run comparatively 
> faster in a cache-cold environment.
> 

memcpy_c by itself is by far the shortest variant, of course.

The question is if it makes sense to use the long variants for short (<
1024 bytes) copies.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.
  2009-11-13  8:04               ` H. Peter Anvin
@ 2009-11-13  8:10                 ` Ingo Molnar
  0 siblings, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2009-11-13  8:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Pavel Machek, Ma, Ling, Ingo Molnar, Thomas Gleixner, linux-kernel


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 11/12/2009 11:33 PM, Ingo Molnar wrote:
> > 
> > * Pavel Machek <pavel@ucw.cz> wrote:
> > 
> >>> Ling, if you are interested, could you send a user-space test-app to 
> >>> this thread that everyone could just compile and run on various older 
> >>> boxes, to gather a performance profile of hand-coded versus string ops 
> >>> performance?
> >>>
> >>> ( And i think we can make a judgement based on cache-hot performance
> >>>   alone - if then the strings ops will perform comparatively better in
> >>>   cache-cold scenarios, so the cache-hot numbers would be a conservative
> >>>   estimate. )
> >>
> >> Ugh, really? I'd expect cache-cold performance to be not helped at all 
> >> (memory bandwidth limit) and you'll get slow down from additional 
> >> i-cache misses...
> > 
> > That's my point - the new code is shorter, which will run comparatively 
> > faster in a cache-cold environment.
> > 
> 
> memcpy_c by itself is by far the shortest variant, of course.

yep. The argument i made was when a long function was compared to a 
short one. As you noted we dont actually enable the long function all 
that often - which inverts the same argument.

> The question is if it makes sense to use the long variants for short 
> (< 1024 bytes) copies.

I'd say not - the kernel executes in a icache-cold environment most of 
the time (as user-space is far more cache intense in the majority of 
workloads and kernel processing starts with a cold icache), so 
optimizing the kernel for code size is very important. (but numbers done 
on real workloads can convince me of the opposite.)

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-11-13  8:10 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-06  9:41 [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string ling.ma
2009-11-06 16:51 ` Andi Kleen
2009-11-08 10:18   ` Ingo Molnar
2009-11-06 17:07 ` H. Peter Anvin
2009-11-06 19:26   ` H. Peter Anvin
2009-11-09  7:24     ` Ma, Ling
2009-11-09  7:36       ` H. Peter Anvin
2009-11-09  8:08         ` Ingo Molnar
2009-11-11  7:05           ` Ma, Ling
2009-11-11  7:18             ` Ingo Molnar
2009-11-11  7:57               ` Ma, Ling
2009-11-11 23:21                 ` H. Peter Anvin
2009-11-12  2:12                   ` Ma, Ling
2009-11-11 20:34             ` Cyrill Gorcunov
2009-11-11 22:39               ` H. Peter Anvin
2009-11-12  4:28                 ` Cyrill Gorcunov
2009-11-12  4:49                   ` Ma, Ling
2009-11-12  5:26                     ` H. Peter Anvin
2009-11-12  7:42                       ` Ma, Ling
2009-11-12  9:54                     ` Cyrill Gorcunov
2009-11-12 12:16           ` Pavel Machek
2009-11-13  7:33             ` Ingo Molnar
2009-11-13  8:04               ` H. Peter Anvin
2009-11-13  8:10                 ` Ingo Molnar
2009-11-09  9:26         ` Andi Kleen
2009-11-09 16:41           ` H. Peter Anvin
2009-11-09 18:54             ` Andi Kleen
2009-11-09 22:36               ` H. Peter Anvin
2009-11-12 12:16       ` Pavel Machek
2009-11-13  5:33         ` Ma, Ling
2009-11-13  6:04           ` H. Peter Anvin
2009-11-13  7:23             ` Ma, Ling
2009-11-13  7:30               ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.