All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ma, Ling" <ling.ma@intel.com>
To: Cyrill Gorcunov <gorcunov@gmail.com>, "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>, Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast 	string.
Date: Thu, 12 Nov 2009 12:49:04 +0800	[thread overview]
Message-ID: <8FED46E8A9CA574792FC7AACAC38FE7714FE8307DB@PDSMSX501.ccr.corp.intel.com> (raw)
In-Reply-To: <aa79d98a0911112028nf3fc475r30aa8dc37936ea22@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2814 bytes --]

Hi All
The attachment is latest memcpy.c, please update by 
"cc -o memcpy memcpy.c -O2 -m64".

Thanks
Ling


>-----Original Message-----
>From: Cyrill Gorcunov [mailto:gorcunov@gmail.com]
>Sent: 2009年11月12日 12:28
>To: H. Peter Anvin
>Cc: Ma, Ling; Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On Thu, Nov 12, 2009 at 1:39 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 11/11/2009 12:34 PM, Cyrill Gorcunov wrote:
>>>                                               memcpy_orig     memc
>py_new
>>> TPT: Len 1024, alignment  8/ 0:               490             570
>>> TPT: Len 2048, alignment  8/ 0:               826             329
>>> TPT: Len 3072, alignment  8/ 0:               441             464
>>> TPT: Len 4096, alignment  8/ 0:               579             596
>>> TPT: Len 5120, alignment  8/ 0:               723             729
>>> TPT: Len 6144, alignment  8/ 0:               859             861
>>> TPT: Len 7168, alignment  8/ 0:               996             994
>>> TPT: Len 8192, alignment  8/ 0:               1165            1127
>>> TPT: Len 9216, alignment  8/ 0:               1273            1260
>>> TPT: Len 10240, alignment  8/ 0:      1402            1395
>>> TPT: Len 11264, alignment  8/ 0:      1543            1525
>>> TPT: Len 12288, alignment  8/ 0:      1682            1659
>>> TPT: Len 13312, alignment  8/ 0:      1869            1815
>>> TPT: Len 14336, alignment  8/ 0:      1982            1951
>>> TPT: Len 15360, alignment  8/ 0:      2185            2110
>>>
>>> I've run this test a few times and results almost the same,
>>> with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.
>>>
>>
>> Was the result for 2048 consistent (it seems odd in the extreme)... the
>> discrepancy between this result and Ling's results bothers me; perhaps
>> the right answer is to leave the current code for Core2 and use new code
>> (with a lower than 1024 threshold?) for NHM and K8?
>>
>>        -hpa
>>
>
>Hi Peter,
>
>no, results for 2048 is not repeatable (that is why I didn't mention this number
>in a former report).
>
>Test1:
>TPT: Len 2048, alignment  8/ 0:	826	329
>Test2:
>TPT: Len 2048, alignment  8/ 0:	359	329
>Test3:
>TPT: Len 2048, alignment  8/ 0:	306	331
>Test4:
>TPT: Len 2048, alignment  8/ 0:	415	329
>
>I guess this was due to cpu frequency change from 800 to 2.1Ghz since
>I did tests manually
>not using any kind of bash cycle to run the test program.

[-- Attachment #2: memcpy.c --]
[-- Type: text/plain, Size: 5495 bytes --]

#include<stdio.h>
#include <stdlib.h>


typedef unsigned long long int hp_timing_t;
#define  MAXSAMPLESTPT        100000
#define  MAXCOPYSIZE          (1024 * 32)
#define  ORIG  0
#define  NEW   1
static char* buf1 = NULL;
static char* buf2 = NULL;

hp_timing_t _dl_hp_timing_overhead;
# define HP_TIMING_NOW(Var) \
  ({ unsigned long long _hi, _lo; \
     asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
     (Var) = _hi << 32 | _lo; })

#define HP_TIMING_DIFF(Diff, Start, End)	(Diff) = ((End) - (Start))
#define HP_TIMING_TOTAL(total_time, start, end)	\
  do									\
    {									\
      hp_timing_t tmptime;						\
      HP_TIMING_DIFF (tmptime, start + _dl_hp_timing_overhead, end);	\
	total_time += tmptime;						\
    }									\
  while (0)

void memcpy_orig(char *dst, char *src, int len);
void memcpy_new(char *dst, char *src, int len);
void memcpy_c(char *dst, char *src, int len);
void (*do_memcpy)(char *dst, char *src, int len);

static void
do_one_throughput ( char *dst, char *src,
	     size_t len)
{
      __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      size_t i;
      hp_timing_t start __attribute ((unused));
      hp_timing_t stop __attribute ((unused));
      hp_timing_t total_time =  (hp_timing_t) 0;

      __asm__("cpuid" : : : "eax", "ebx", "ecx", "edx");
      for (i = 0; i < MAXSAMPLESTPT; ++i)  {
          HP_TIMING_NOW (start);
		do_memcpy(buf1, buf2, len);
	  HP_TIMING_NOW (stop);
	  HP_TIMING_TOTAL (total_time, start, stop);
      }

      printf ("\t%zd", (size_t) total_time/MAXSAMPLESTPT);

}

static void
do_tpt_test (size_t align1, size_t align2, size_t len)
{
  size_t i, j;
  char *s1, *s2;

  s1 = (char *) (buf1 + align1);
  s2 = (char *) (buf2 + align2);


   printf ("TPT: Len %4zd, alignment %2zd/%2zd:", len, align1, align2);
   do_memcpy = memcpy_orig;
   do_one_throughput (s2, s1, len);
   do_memcpy = memcpy_new;
   do_one_throughput (s2, s1, len);

    putchar ('\n');
}

static test_init(void)
{
  int i;
  buf1 = valloc(MAXCOPYSIZE);
  buf2 = valloc(MAXCOPYSIZE);

  for (i = 0; i < MAXCOPYSIZE ; i = i + 64) {
        buf1[i] = buf2[i] = i & 0xff;
  }

}

void memcpy_new(char *dst, char *src, int len)
{
	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("cmp $0x400, %edx");
	__asm__("jae 7f");

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shl   $3, %ecx");
	__asm__("jz 4f");


	__asm__("3:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("4:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 6f");

	__asm__("5:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 5b");

	__asm__("6:");
	__asm__("retq");

	__asm__("7:");
	__asm__("movl %edx, %ecx");
	__asm__ ("shr $3, %ecx");
	__asm__ ("andl $7, %edx");
	__asm__("rep movsq ");
	__asm__ ("jz 8f");
	__asm__("movl %edx, %ecx");
	__asm__("rep movsb");

	__asm__("8:");
}
void memcpy_orig(char *dst, char *src, int len)
{
	__asm__("movq %rdi, %rax");
	__asm__("movl %edx, %ecx");
	__asm__("shrl   $6, %ecx");
	__asm__("jz 2f");

	__asm__("mov $0x80, %r8d  ");  /*aligned case for loop 1 */

	__asm__("1:");
	__asm__("decl %ecx");

	__asm__("movq 0*8(%rsi), %r11");
	__asm__("movq 1*8(%rdi), %r8");
	__asm__("movq %r11,	0*8(%rdi)");
	__asm__("movq %r8,	1*8(%rdi)");

	__asm__("movq 2*8(%rsi), %r9");
	__asm__("movq 3*8(%rdi), %r10");
	__asm__("movq %r9,	2*8(%rdi)");
	__asm__("movq %r10,	3*8(%rdi)");

	__asm__("movq 4*8(%rsi), %r11");
	__asm__("movq 5*8(%rdi), %r8");
	__asm__("movq %r11,	4*8(%rdi)");
	__asm__("movq %r8,	5*8(%rdi)");

	__asm__("movq 6*8(%rsi), %r9");
	__asm__("movq 7*8(%rdi), %r10");
	__asm__("movq %r9,	6*8(%rdi)");
	__asm__("movq %r10,	7*8(%rdi)");

	__asm__("leaq 64(%rsi), %rsi");
	__asm__("leaq 64(%rdi), %rdi");

	__asm__("jnz  1b");

	__asm__("2:");
	__asm__("movl %edx, %ecx");
	__asm__("andl $63, %ecx");
	__asm__("shl   $3, %ecx");
	__asm__("jz 4f");


	__asm__("3:");
	__asm__("decl %ecx");
	__asm__("movq (%rsi),	%r8");
	__asm__("movq %r8,	(%rdi)");
	__asm__("leaq 8(%rdi),	%rdi");
	__asm__("leaq 8(%rsi),	%rsi");
	__asm__("jnz 3b");

	__asm__("4:");
	__asm__("movl %edx,	%ecx");
	__asm__("andl $7,	%ecx");
	__asm__("jz 6f");

	__asm__("5:");
	__asm__("movb (%rsi),	%r8b");
	__asm__("movb %r8b, (%rdi)");
	__asm__("incq %rdi");
	__asm__("incq %rsi");
	__asm__("decl %ecx");
	__asm__("jnz 5b");

	__asm__("6:");
}


void main(void)
{
  int i;
  test_init();
  printf ("%23s", "");
  printf ("\t%s\t%s\t%s\n", "memcpy_orig", "memcpy_new");

  for (i = 1024; i < 1024 * 16; i = i+ 1024)
     do_tpt_test(0, 0, i);

}

  reply	other threads:[~2009-11-12  4:49 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-06  9:41 [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string ling.ma
2009-11-06 16:51 ` Andi Kleen
2009-11-08 10:18   ` Ingo Molnar
2009-11-06 17:07 ` H. Peter Anvin
2009-11-06 19:26   ` H. Peter Anvin
2009-11-09  7:24     ` Ma, Ling
2009-11-09  7:36       ` H. Peter Anvin
2009-11-09  8:08         ` Ingo Molnar
2009-11-11  7:05           ` Ma, Ling
2009-11-11  7:18             ` Ingo Molnar
2009-11-11  7:57               ` Ma, Ling
2009-11-11 23:21                 ` H. Peter Anvin
2009-11-12  2:12                   ` Ma, Ling
2009-11-11 20:34             ` Cyrill Gorcunov
2009-11-11 22:39               ` H. Peter Anvin
2009-11-12  4:28                 ` Cyrill Gorcunov
2009-11-12  4:49                   ` Ma, Ling [this message]
2009-11-12  5:26                     ` H. Peter Anvin
2009-11-12  7:42                       ` Ma, Ling
2009-11-12  9:54                     ` Cyrill Gorcunov
2009-11-12 12:16           ` Pavel Machek
2009-11-13  7:33             ` Ingo Molnar
2009-11-13  8:04               ` H. Peter Anvin
2009-11-13  8:10                 ` Ingo Molnar
2009-11-09  9:26         ` Andi Kleen
2009-11-09 16:41           ` H. Peter Anvin
2009-11-09 18:54             ` Andi Kleen
2009-11-09 22:36               ` H. Peter Anvin
2009-11-12 12:16       ` Pavel Machek
2009-11-13  5:33         ` Ma, Ling
2009-11-13  6:04           ` H. Peter Anvin
2009-11-13  7:23             ` Ma, Ling
2009-11-13  7:30               ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8FED46E8A9CA574792FC7AACAC38FE7714FE8307DB@PDSMSX501.ccr.corp.intel.com \
    --to=ling.ma@intel.com \
    --cc=gorcunov@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.