All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] perf bench: port memcpy_64.S to perf bench
@ 2010-10-29 16:01 Hitoshi Mitake
  2010-10-29 16:01 ` [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy Hitoshi Mitake
  2010-10-29 19:49 ` [PATCH 1/2] perf bench: port memcpy_64.S to perf bench Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-10-29 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mitake, h.mitake, Ma Ling:,
	Zhao Yakui, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Paul Mackerras, Frederic Weisbecker, Steven Rostedt,
	Thomas Gleixner, H. Peter Anvin

This patch ports arch/x86/lib/memcpy_64.S to "perf bench mem".
When PERF_BENCH is defined at preprocessor level,
memcpy_64.S is preprocessed to includable form from the sources
under tools/perf for benchmarking programs.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Ma Ling: <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/lib/memcpy_64.S |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 75ef61e..72c6dfe 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -1,10 +1,23 @@
 /* Copyright 2002 Andi Kleen */
 
+/*
+ * perf bench adoption by Hitoshi Mitake
+ * PERF_BENCH means that this file is included from
+ * the source files under tools/perf/ for benchmark programs.
+ *
+ * You don't have to care about PERF_BENCH when
+ * you are working on the kernel.
+ */
+
+#ifndef PERF_BENCH
+
 #include <linux/linkage.h>
 
 #include <asm/cpufeature.h>
 #include <asm/dwarf2.h>
 
+#endif /* PERF_BENCH */
+
 /*
  * memcpy - Copy a memory block.
  *
@@ -23,8 +36,13 @@
  * This gets patched over the unrolled variant (below) via the
  * alternative instructions framework:
  */
+#ifndef PERF_BENCH
 	.section .altinstr_replacement, "ax", @progbits
 .Lmemcpy_c:
+#else
+	.globl memcpy_x86_64_rep
+memcpy_x86_64_rep:
+#endif
 	movq %rdi, %rax
 
 	movl %edx, %ecx
@@ -34,12 +52,19 @@
 	movl %edx, %ecx
 	rep movsb
 	ret
+#ifndef PERF_BENCH
 .Lmemcpy_e:
 	.previous
+#endif
 
+#ifndef PERF_BENCH
 ENTRY(__memcpy)
 ENTRY(memcpy)
 	CFI_STARTPROC
+#else
+	.globl	memcpy_x86_64_unrolled
+memcpy_x86_64_unrolled:
+#endif
 	movq %rdi, %rax
 
 	/*
@@ -166,6 +191,9 @@ ENTRY(memcpy)
 
 .Lend:
 	retq
+
+#ifndef PERF_BENCH
+
 	CFI_ENDPROC
 ENDPROC(memcpy)
 ENDPROC(__memcpy)
@@ -189,3 +217,5 @@ ENDPROC(__memcpy)
 	.byte .Lmemcpy_e - .Lmemcpy_c
 	.byte .Lmemcpy_e - .Lmemcpy_c
 	.previous
+
+#endif
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-10-29 16:01 [PATCH 1/2] perf bench: port memcpy_64.S to perf bench Hitoshi Mitake
@ 2010-10-29 16:01 ` Hitoshi Mitake
  2010-10-30 19:23   ` Ingo Molnar
  2010-10-29 19:49 ` [PATCH 1/2] perf bench: port memcpy_64.S to perf bench Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Hitoshi Mitake @ 2010-10-29 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mitake, h.mitake, Ma Ling:,
	Zhao Yakui, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Paul Mackerras, Frederic Weisbecker, Steven Rostedt,
	Thomas Gleixner, H. Peter Anvin

This patch adds new file: mem-memcpy-x86-64-asm.S
for x86-64 specific memcpy() benchmarking.
Added new benchmarks are,
 x86-64-rep:      memcpy() implemented with rep instruction
 x86-64-unrolled: unrolled memcpy()

Original idea of including the source files of kernel
for benchmarking is suggested by Ingo Molnar.
This is more effective than write-once programs for quantitative
evaluation of in-kernel, little and leaf functions called high frequently.
Because perf bench is in kernel source tree and executing it
on various hardwares, especially new model CPUs, is easy.

This way can also be used for other functions of kernel e.g. checksum functions.

Example of usage on Core i3 M330:

| % ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
|
|      578.732506 MB/Sec
| % ./perf bench mem memcpy -l 500MB -r x86-64-rep
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
|
|      738.184980 MB/Sec
| % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
|
|      767.483269 MB/Sec

This shows clearly that unrolled memcpy() is efficient
than rep version and glibc's one :)

# checkpatch.pl warns about two externs in bench/mem-memcpy.c
# added by this patch. But I think it is no problem.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Ma Ling: <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 tools/perf/Makefile                      |    8 ++++++++
 tools/perf/bench/mem-memcpy-x86-64-asm.S |    4 ++++
 tools/perf/bench/mem-memcpy.c            |   14 ++++++++++++++
 3 files changed, 26 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm.S

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index d1db0f6..540020e 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -183,9 +183,12 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/ -e s/sun4u/sparc64/ \
 # Additional ARCH settings for x86
 ifeq ($(ARCH),i386)
         ARCH := x86
+	ARCH_CFLAGS = -DARCH_X86_64
 endif
 ifeq ($(ARCH),x86_64)
         ARCH := x86
+	ARCH_CFLAGS = -DARCH_X86_64
+	ARCH_INCLUDE = ../../arch/x86/lib/memcpy_64.S
 endif
 
 # CFLAGS and LDFLAGS are for the users to override from the command line.
@@ -417,6 +420,7 @@ LIB_H += util/probe-finder.h
 LIB_H += util/probe-event.h
 LIB_H += util/pstack.h
 LIB_H += util/cpumap.h
+LIB_H += $(ARCH_INCLUDE)
 
 LIB_OBJS += $(OUTPUT)util/abspath.o
 LIB_OBJS += $(OUTPUT)util/alias.o
@@ -472,6 +476,9 @@ BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
 # Benchmark modules
 BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
 BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
+ifeq ($(ARCH),x86)
+BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
+endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
@@ -898,6 +905,7 @@ BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
 LIB_OBJS += $(COMPAT_OBJS)
 
 ALL_CFLAGS += $(BASIC_CFLAGS)
+ALL_CFLAGS += $(ARCH_CFLAGS)
 ALL_LDFLAGS += $(BASIC_LDFLAGS)
 
 export TAR INSTALL DESTDIR SHELL_PATH
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm.S b/tools/perf/bench/mem-memcpy-x86-64-asm.S
new file mode 100644
index 0000000..6246d94
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm.S
@@ -0,0 +1,4 @@
+
+#define PERF_BENCH
+
+#include "../../../arch/x86/lib/memcpy_64.S"
diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 38dae74..ba73f39 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -19,6 +19,11 @@
 #include <sys/time.h>
 #include <errno.h>
 
+#ifdef ARCH_X86_64
+extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
+extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
+#endif
+
 #define K 1024
 
 static const char	*length_str	= "1MB";
@@ -47,6 +52,15 @@ struct routine routines[] = {
 	{ "default",
 	  "Default memcpy() provided by glibc",
 	  memcpy },
+#ifdef ARCH_X86_64
+	{ "x86-64-unrolled",
+	  "unrolled memcpy() in arch/x86/lib/memcpy_64.S",
+	  memcpy_x86_64_unrolled },
+	{ "x86-64-rep",
+	  "memcpy() implemented with rep instruction"
+	  " in arch/x86/lib/memcpy_64.S",
+	  memcpy_x86_64_rep },
+#endif
 	{ NULL,
 	  NULL,
 	  NULL   }
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] perf bench: port memcpy_64.S to perf bench
  2010-10-29 16:01 [PATCH 1/2] perf bench: port memcpy_64.S to perf bench Hitoshi Mitake
  2010-10-29 16:01 ` [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy Hitoshi Mitake
@ 2010-10-29 19:49 ` Peter Zijlstra
  2010-10-30 19:21   ` Ingo Molnar
       [not found]   ` <20101029210824.GB13385@ghostprotocols.net>
  1 sibling, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2010-10-29 19:49 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: Ingo Molnar, linux-kernel, h.mitake, Ma Ling:,
	Zhao Yakui, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	H. Peter Anvin

On Sat, 2010-10-30 at 01:01 +0900, Hitoshi Mitake wrote:
> This patch ports arch/x86/lib/memcpy_64.S to "perf bench mem".
> When PERF_BENCH is defined at preprocessor level,
> memcpy_64.S is preprocessed to includable form from the sources
> under tools/perf for benchmarking programs.
> 
> Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
> Cc: Ma Ling: <ling.ma@intel.com>
> Cc: Zhao Yakui <yakui.zhao@intel.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: H. Peter Anvin <hpa@zytor.com>
> ---
>  arch/x86/lib/memcpy_64.S |   30 ++++++++++++++++++++++++++++++
>  1 files changed, 30 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> index 75ef61e..72c6dfe 100644
> --- a/arch/x86/lib/memcpy_64.S
> +++ b/arch/x86/lib/memcpy_64.S
> @@ -1,10 +1,23 @@
>  /* Copyright 2002 Andi Kleen */
>  
> +/*
> + * perf bench adoption by Hitoshi Mitake
> + * PERF_BENCH means that this file is included from
> + * the source files under tools/perf/ for benchmark programs.
> + *
> + * You don't have to care about PERF_BENCH when
> + * you are working on the kernel.
> + */
> +
> +#ifndef PERF_BENCH

I don't like littering the actual kernel code with tools/perf/
ifdeffery..

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] perf bench: port memcpy_64.S to perf bench
  2010-10-29 19:49 ` [PATCH 1/2] perf bench: port memcpy_64.S to perf bench Peter Zijlstra
@ 2010-10-30 19:21   ` Ingo Molnar
       [not found]     ` <4D0CE05C.1070600@dcl.info.waseda.ac.jp>
       [not found]   ` <20101029210824.GB13385@ghostprotocols.net>
  1 sibling, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2010-10-30 19:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hitoshi Mitake, linux-kernel, h.mitake, Ma Ling:,
	Zhao Yakui, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	H. Peter Anvin


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Sat, 2010-10-30 at 01:01 +0900, Hitoshi Mitake wrote:
> > This patch ports arch/x86/lib/memcpy_64.S to "perf bench mem".
> > When PERF_BENCH is defined at preprocessor level,
> > memcpy_64.S is preprocessed to includable form from the sources
> > under tools/perf for benchmarking programs.
> > 
> > Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
> > Cc: Ma Ling: <ling.ma@intel.com>
> > Cc: Zhao Yakui <yakui.zhao@intel.com>
> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Cc: Paul Mackerras <paulus@samba.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > ---
> >  arch/x86/lib/memcpy_64.S |   30 ++++++++++++++++++++++++++++++
> >  1 files changed, 30 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> > index 75ef61e..72c6dfe 100644
> > --- a/arch/x86/lib/memcpy_64.S
> > +++ b/arch/x86/lib/memcpy_64.S
> > @@ -1,10 +1,23 @@
> >  /* Copyright 2002 Andi Kleen */
> >  
> > +/*
> > + * perf bench adoption by Hitoshi Mitake
> > + * PERF_BENCH means that this file is included from
> > + * the source files under tools/perf/ for benchmark programs.
> > + *
> > + * You don't have to care about PERF_BENCH when
> > + * you are working on the kernel.
> > + */
> > +
> > +#ifndef PERF_BENCH
> 
> I don't like littering the actual kernel code with tools/perf/
> ifdeffery..


Yeah - could we somehow accept that file into a perf build as-is?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-10-29 16:01 ` [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy Hitoshi Mitake
@ 2010-10-30 19:23   ` Ingo Molnar
  2010-11-01  5:36     ` Hitoshi Mitake
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2010-10-30 19:23 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: linux-kernel, h.mitake, Ma Ling:,
	Zhao Yakui, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Paul Mackerras, Frederic Weisbecker, Steven Rostedt,
	Thomas Gleixner, H. Peter Anvin


* Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> wrote:

> This patch adds new file: mem-memcpy-x86-64-asm.S
> for x86-64 specific memcpy() benchmarking.
> Added new benchmarks are,
>  x86-64-rep:      memcpy() implemented with rep instruction
>  x86-64-unrolled: unrolled memcpy()
> 
> Original idea of including the source files of kernel
> for benchmarking is suggested by Ingo Molnar.
> This is more effective than write-once programs for quantitative
> evaluation of in-kernel, little and leaf functions called high frequently.
> Because perf bench is in kernel source tree and executing it
> on various hardwares, especially new model CPUs, is easy.
> 
> This way can also be used for other functions of kernel e.g. checksum functions.
> 
> Example of usage on Core i3 M330:
> 
> | % ./perf bench mem memcpy -l 500MB
> | # Running mem/memcpy benchmark...
> | # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
> |
> |      578.732506 MB/Sec
> | % ./perf bench mem memcpy -l 500MB -r x86-64-rep
> | # Running mem/memcpy benchmark...
> | # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
> |
> |      738.184980 MB/Sec
> | % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
> | # Running mem/memcpy benchmark...
> | # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
> |
> |      767.483269 MB/Sec
> 
> This shows clearly that unrolled memcpy() is efficient
> than rep version and glibc's one :)

Hey, really cool output :-)

Might also make sense to measure Ma Ling's patched version?

> # checkpatch.pl warns about two externs in bench/mem-memcpy.c
> # added by this patch. But I think it is no problem.

You should put these:

 +#ifdef ARCH_X86_64
 +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
 +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
 +#endif

into a .h file - a new one if needed.

That will make both checkpatch and me happier ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-10-30 19:23   ` Ingo Molnar
@ 2010-11-01  5:36     ` Hitoshi Mitake
  2010-11-01  9:02       ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-01  5:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin

On 2010年10月31日 04:23, Ingo Molnar wrote:
>
> * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>  wrote:
>
>> This patch adds new file: mem-memcpy-x86-64-asm.S
>> for x86-64 specific memcpy() benchmarking.
>> Added new benchmarks are,
>>   x86-64-rep:      memcpy() implemented with rep instruction
>>   x86-64-unrolled: unrolled memcpy()
>>
>> Original idea of including the source files of kernel
>> for benchmarking is suggested by Ingo Molnar.
>> This is more effective than write-once programs for quantitative
>> evaluation of in-kernel, little and leaf functions called high frequently.
>> Because perf bench is in kernel source tree and executing it
>> on various hardwares, especially new model CPUs, is easy.
>>
>> This way can also be used for other functions of kernel e.g. checksum functions.
>>
>> Example of usage on Core i3 M330:
>>
>> | % ./perf bench mem memcpy -l 500MB
>> | # Running mem/memcpy benchmark...
>> | # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
>> |
>> |      578.732506 MB/Sec
>> | % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>> | # Running mem/memcpy benchmark...
>> | # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
>> |
>> |      738.184980 MB/Sec
>> | % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>> | # Running mem/memcpy benchmark...
>> | # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
>> |
>> |      767.483269 MB/Sec
>>
>> This shows clearly that unrolled memcpy() is efficient
>> than rep version and glibc's one :)
>
> Hey, really cool output :-)
>
> Might also make sense to measure Ma Ling's patched version?

Does Ma Ling's patched version mean,

http://marc.info/?l=linux-kernel&m=128652296500989&w=2

the memcpy applied the patch of the URL?
(It seems that this patch was written by Miao Xie.)

I'll include the result of patched version in the next post.

>
>> # checkpatch.pl warns about two externs in bench/mem-memcpy.c
>> # added by this patch. But I think it is no problem.
>
> You should put these:
>
>   +#ifdef ARCH_X86_64
>   +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
>   +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
>   +#endif
>
> into a .h file - a new one if needed.
>
> That will make both checkpatch and me happier ;-)
>

OK, I'll separate these files.

BTW, I found really interesting evaluation result.
Current results of "perf bench mem memcpy" include
the overhead of page faults because the measured memcpy()
is the first access to allocated memory area.

I tested the another version of perf bench mem memcpy,
which does memcpy() before measured memcpy() for removing
the overhead come from page faults.

And this is the result:

% ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...

        4.608340 GB/Sec

% ./perf bench mem memcpy -l 500MB
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...

        4.856442 GB/Sec

% ./perf bench mem memcpy -l 500MB -r x86-64-rep
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...

        6.024445 GB/Sec

The relation of scores reversed!
I cannot explain the cause of this result, and
this is really interesting phenomenon.

So I'd like to add new command line option,
like "--pre-page-faults" to perf bench mem memcpy,
for doing memcpy() before measured memcpy().

How do you think about this idea?

Thanks,

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-11-01  5:36     ` Hitoshi Mitake
@ 2010-11-01  9:02       ` Ingo Molnar
  2010-11-05 17:05         ` Hitoshi Mitake
  2011-01-11 16:27         ` [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy Hitoshi Mitake
  0 siblings, 2 replies; 30+ messages in thread
From: Ingo Molnar @ 2010-11-01  9:02 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin


* Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> wrote:

> On 2010年10月31日 04:23, Ingo Molnar wrote:
> >
> >* Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>  wrote:
> >
> >>This patch adds new file: mem-memcpy-x86-64-asm.S
> >>for x86-64 specific memcpy() benchmarking.
> >>Added new benchmarks are,
> >>  x86-64-rep:      memcpy() implemented with rep instruction
> >>  x86-64-unrolled: unrolled memcpy()
> >>
> >>Original idea of including the source files of kernel
> >>for benchmarking is suggested by Ingo Molnar.
> >>This is more effective than write-once programs for quantitative
> >>evaluation of in-kernel, little and leaf functions called high frequently.
> >>Because perf bench is in kernel source tree and executing it
> >>on various hardwares, especially new model CPUs, is easy.
> >>
> >>This way can also be used for other functions of kernel e.g. checksum functions.
> >>
> >>Example of usage on Core i3 M330:
> >>
> >>| % ./perf bench mem memcpy -l 500MB
> >>| # Running mem/memcpy benchmark...
> >>| # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
> >>|
> >>|      578.732506 MB/Sec
> >>| % ./perf bench mem memcpy -l 500MB -r x86-64-rep
> >>| # Running mem/memcpy benchmark...
> >>| # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
> >>|
> >>|      738.184980 MB/Sec
> >>| % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
> >>| # Running mem/memcpy benchmark...
> >>| # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
> >>|
> >>|      767.483269 MB/Sec
> >>
> >>This shows clearly that unrolled memcpy() is efficient
> >>than rep version and glibc's one :)
> >
> >Hey, really cool output :-)
> >
> >Might also make sense to measure Ma Ling's patched version?
> 
> Does Ma Ling's patched version mean,
> 
> http://marc.info/?l=linux-kernel&m=128652296500989&w=2
> 
> the memcpy applied the patch of the URL?
> (It seems that this patch was written by Miao Xie.)
> 
> I'll include the result of patched version in the next post.

(Indeed it is Miao Xie - sorry!)

> >># checkpatch.pl warns about two externs in bench/mem-memcpy.c
> >># added by this patch. But I think it is no problem.
> >
> >You should put these:
> >
> >  +#ifdef ARCH_X86_64
> >  +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
> >  +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
> >  +#endif
> >
> >into a .h file - a new one if needed.
> >
> >That will make both checkpatch and me happier ;-)
> >
> 
> OK, I'll separate these files.
> 
> BTW, I found really interesting evaluation result.
> Current results of "perf bench mem memcpy" include
> the overhead of page faults because the measured memcpy()
> is the first access to allocated memory area.
> 
> I tested the another version of perf bench mem memcpy,
> which does memcpy() before measured memcpy() for removing
> the overhead come from page faults.
> 
> And this is the result:
> 
> % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
> # Running mem/memcpy benchmark...
> # Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...
> 
>        4.608340 GB/Sec
> 
> % ./perf bench mem memcpy -l 500MB
> # Running mem/memcpy benchmark...
> # Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...
> 
>        4.856442 GB/Sec
> 
> % ./perf bench mem memcpy -l 500MB -r x86-64-rep
> # Running mem/memcpy benchmark...
> # Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...
> 
>        6.024445 GB/Sec
> 
> The relation of scores reversed!
> I cannot explain the cause of this result, and
> this is really interesting phenomenon.

Interesting indeed, and it would be nice to analyse that! (It should be possible, 
using various PMU metrics in a clever way, to figure out what's happening inside the 
CPU, right?)

> So I'd like to add new command line option,
> like "--pre-page-faults" to perf bench mem memcpy,
> for doing memcpy() before measured memcpy().
> 
> How do you think about this idea?

Agreed. (Maybe name it --prefault, as 'prefaulting' is the term we generally use for 
things like this.)

An even better solution would be to output _both_ results by default, so that people 
can see both characteristics at a glance?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-11-01  9:02       ` Ingo Molnar
@ 2010-11-05 17:05         ` Hitoshi Mitake
  2010-11-10  9:12           ` Ingo Molnar
  2011-01-11 16:27         ` [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy Hitoshi Mitake
  1 sibling, 1 reply; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-05 17:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin

On 2010年11月01日 18:02, Ingo Molnar wrote:
>
> * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>  wrote:
>
>> On 2010年10月31日 04:23, Ingo Molnar wrote:
>>>
>>> * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>   wrote:
>>>
>>>> This patch adds new file: mem-memcpy-x86-64-asm.S
>>>> for x86-64 specific memcpy() benchmarking.
>>>> Added new benchmarks are,
>>>>   x86-64-rep:      memcpy() implemented with rep instruction
>>>>   x86-64-unrolled: unrolled memcpy()
>>>>
>>>> Original idea of including the source files of kernel
>>>> for benchmarking is suggested by Ingo Molnar.
>>>> This is more effective than write-once programs for quantitative
>>>> evaluation of in-kernel, little and leaf functions called high frequently.
>>>> Because perf bench is in kernel source tree and executing it
>>>> on various hardwares, especially new model CPUs, is easy.
>>>>
>>>> This way can also be used for other functions of kernel e.g. checksum functions.
>>>>
>>>> Example of usage on Core i3 M330:
>>>>
>>>> | % ./perf bench mem memcpy -l 500MB
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
>>>> |
>>>> |      578.732506 MB/Sec
>>>> | % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
>>>> |
>>>> |      738.184980 MB/Sec
>>>> | % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
>>>> |
>>>> |      767.483269 MB/Sec
>>>>
>>>> This shows clearly that unrolled memcpy() is efficient
>>>> than rep version and glibc's one :)
>>>
>>> Hey, really cool output :-)
>>>
>>> Might also make sense to measure Ma Ling's patched version?
>>
>> Does Ma Ling's patched version mean,
>>
>> http://marc.info/?l=linux-kernel&m=128652296500989&w=2
>>
>> the memcpy applied the patch of the URL?
>> (It seems that this patch was written by Miao Xie.)
>>
>> I'll include the result of patched version in the next post.
>
> (Indeed it is Miao Xie - sorry!)
>
>>>> # checkpatch.pl warns about two externs in bench/mem-memcpy.c
>>>> # added by this patch. But I think it is no problem.
>>>
>>> You should put these:
>>>
>>>   +#ifdef ARCH_X86_64
>>>   +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
>>>   +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
>>>   +#endif
>>>
>>> into a .h file - a new one if needed.
>>>
>>> That will make both checkpatch and me happier ;-)
>>>
>>
>> OK, I'll separate these files.
>>
>> BTW, I found really interesting evaluation result.
>> Current results of "perf bench mem memcpy" include
>> the overhead of page faults because the measured memcpy()
>> is the first access to allocated memory area.
>>
>> I tested the another version of perf bench mem memcpy,
>> which does memcpy() before measured memcpy() for removing
>> the overhead come from page faults.
>>
>> And this is the result:
>>
>> % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...
>>
>>         4.608340 GB/Sec
>>
>> % ./perf bench mem memcpy -l 500MB
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...
>>
>>         4.856442 GB/Sec
>>
>> % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...
>>
>>         6.024445 GB/Sec
>>
>> The relation of scores reversed!
>> I cannot explain the cause of this result, and
>> this is really interesting phenomenon.
>
> Interesting indeed, and it would be nice to analyse that! (It should be possible,
> using various PMU metrics in a clever way, to figure out what's happening inside the
> CPU, right?)
>
>> So I'd like to add new command line option,
>> like "--pre-page-faults" to perf bench mem memcpy,
>> for doing memcpy() before measured memcpy().
>>
>> How do you think about this idea?
>
> Agreed. (Maybe name it --prefault, as 'prefaulting' is the term we generally use for
> things like this.)
>
> An even better solution would be to output _both_ results by default, so that people
> can see both characteristics at a glance?

Outputting both result of prefaulted and non prefaulted will be useful,
but this might be not good for using from scripts.
So I'll implement --prefault option first. If there is request
for outputting both, I'll consider to modify default output.

# Please wait about the result of Miao Xie's patch,
# benchmarking memcpy() of unaligned memory area is
# a little difficult

Thanks,
	Hitoshi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] perf bench: port memcpy_64.S to perf bench
       [not found]   ` <20101029210824.GB13385@ghostprotocols.net>
@ 2010-11-05 17:10     ` Hitoshi Mitake
  0 siblings, 0 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-05 17:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ma Ling:,
	Zhao Yakui, Paul Mackerras, Frederic Weisbecker, Steven Rostedt,
	Thomas Gleixner, H. Peter Anvin, Hitoshi Mitake

On Sat, Oct 30, 2010 at 06:08, Arnaldo Carvalho de Melo
<acme@ghostprotocols.net> wrote:
> Em Fri, Oct 29, 2010 at 09:49:11PM +0200, Peter Zijlstra escreveu:
>> On Sat, 2010-10-30 at 01:01 +0900, Hitoshi Mitake wrote:
>> > +++ b/arch/x86/lib/memcpy_64.S
>> > @@ -1,10 +1,23 @@
>> >  /* Copyright 2002 Andi Kleen */
>> >
>> > +/*
>> > + * perf bench adoption by Hitoshi Mitake
>> > + * PERF_BENCH means that this file is included from
>> > + * the source files under tools/perf/ for benchmark programs.
>> > + *
>> > + * You don't have to care about PERF_BENCH when
>> > + * you are working on the kernel.
>> > + */
>> > +
>> > +#ifndef PERF_BENCH
>>
>> I don't like littering the actual kernel code with tools/perf/
>> ifdeffery..
>
> Yeah, this kind of problem appeared in the past, we can't use things
> that weren't specifically designed to be shared, the discussion about
> how to properly share things between the kernel and things in tools
> still has to happen.

OK, it seems that I have to consider better solution.
Could you tell me about the past problem for reference?
Your experience must be useful for this case.

-- 
Hitoshi Mitake
h.mitake@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-11-05 17:05         ` Hitoshi Mitake
@ 2010-11-10  9:12           ` Ingo Molnar
  2010-11-12 15:01             ` Hitoshi Mitake
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2010-11-10  9:12 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin


* Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> wrote:

> > An even better solution would be to output _both_ results by default, so that 
> > people can see both characteristics at a glance?
> 
> Outputting both result of prefaulted and non prefaulted will be useful, but this 
> might be not good for using from scripts. So I'll implement --prefault option 
> first. If there is request for outputting both, I'll consider to modify default 
> output.

Ok - it should definitely be easily scriptable. The default can be have both flags 
enabled and both results written to the output.

People will try 'perf bench x86' to see performance at a glance - so printing all 
the tests we have is a good idea.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-11-10  9:12           ` Ingo Molnar
@ 2010-11-12 15:01             ` Hitoshi Mitake
  2010-11-12 15:02               ` [PATCH] perf bench: print both of prefaulted and no prefaulted results Hitoshi Mitake
  0 siblings, 1 reply; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-12 15:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin

On 2010年11月10日 18:12, Ingo Molnar wrote:
>
> * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>  wrote:
>
>>> An even better solution would be to output _both_ results by default, so that
>>> people can see both characteristics at a glance?
>>
>> Outputting both result of prefaulted and non prefaulted will be useful, but this
>> might be not good for using from scripts. So I'll implement --prefault option
>> first. If there is request for outputting both, I'll consider to modify default
>> output.
>
> Ok - it should definitely be easily scriptable. The default can be have both flags
> enabled and both results written to the output.
>
> People will try 'perf bench x86' to see performance at a glance - so printing all
> the tests we have is a good idea.

OK, I added --no-prefault and --only-prefault to perf bench mem memcpy.
As you told, printing both of them is convenient.

I send the updated patch later.

Thanks,

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] perf bench: print both of prefaulted and no prefaulted results
  2010-11-12 15:01             ` Hitoshi Mitake
@ 2010-11-12 15:02               ` Hitoshi Mitake
  2010-11-18  7:58                 ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-12 15:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mitake, h.mitake, Ma Ling, Zhao Yakui,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	H. Peter Anvin

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
for printing single result, mainly for scripting usage.

Example of usage:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      634.969014 MB/Sec
|        4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|        4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
---
 tools/perf/bench/mem-memcpy.c |  215 +++++++++++++++++++++++++++++------------
 1 files changed, 152 insertions(+), 63 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index be31ddb..61b6ead 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -25,7 +25,8 @@ static const char	*length_str	= "1MB";
 static const char	*routine	= "default";
 static bool		use_clock;
 static int		clock_fd;
-static bool		prefault;
+static bool		only_prefault;
+static bool		no_prefault;
 
 static const struct option options[] = {
 	OPT_STRING('l', "length", &length_str, "1MB",
@@ -35,15 +36,19 @@ static const struct option options[] = {
 		    "Specify routine to copy"),
 	OPT_BOOLEAN('c', "clock", &use_clock,
 		    "Use CPU clock for measuring"),
-	OPT_BOOLEAN('p', "prefault", &prefault,
-		    "Cause page faults before memcpy()"),
+	OPT_BOOLEAN('o', "only-prefault", &only_prefault,
+		    "Show only the result with page faults before memcpy()"),
+	OPT_BOOLEAN('n', "no-prefault", &no_prefault,
+		    "Show only the result without page faults before memcpy()"),
 	OPT_END()
 };
 
+typedef void *(*memcpy_t)(void *, const void *, size_t);
+
 struct routine {
 	const char *name;
 	const char *desc;
-	void * (*fn)(void *dst, const void *src, size_t len);
+	memcpy_t fn;
 };
 
 struct routine routines[] = {
@@ -92,29 +97,98 @@ static double timeval2double(struct timeval *ts)
 		(double)ts->tv_usec / (double)1000000;
 }
 
+static void alloc_mem(void **dst, void **src, size_t length)
+{
+	*dst = zalloc(length);
+	if (!dst)
+		die("memory allocation failed - maybe length is too large?\n");
+
+	*src = zalloc(length);
+	if (!src)
+		die("memory allocation failed - maybe length is too large?\n");
+}
+
+static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
+{
+	u64 clock_start = 0ULL, clock_end = 0ULL;
+	void *src = NULL, *dst = NULL;
+
+	alloc_mem(&src, &dst, len);
+
+	if (prefault)
+		fn(dst, src, len);
+
+	clock_start = get_clock();
+	fn(dst, src, len);
+	clock_end = get_clock();
+
+	free(src);
+	free(dst);
+	return clock_end - clock_start;
+}
+
+static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
+{
+	struct timeval tv_start, tv_end, tv_diff;
+	void *src = NULL, *dst = NULL;
+
+	alloc_mem(&src, &dst, len);
+
+	if (prefault)
+		fn(dst, src, len);
+
+	BUG_ON(gettimeofday(&tv_start, NULL));
+	fn(dst, src, len);
+	BUG_ON(gettimeofday(&tv_end, NULL));
+
+	timersub(&tv_end, &tv_start, &tv_diff);
+
+	free(src);
+	free(dst);
+	return (double)((double)len / timeval2double(&tv_diff));
+}
+
+#define pf (no_prefault ? 0 : 1)
+
+#define print_bps(x) do {					\
+		if (x < K)					\
+			printf(" %14lf B/Sec", x);		\
+		else if (x < K * K)				\
+			printf(" %14lfd KB/Sec", x / K);	\
+		else if (x < K * K * K)				\
+			printf(" %14lf MB/Sec", x / K / K);	\
+		else						\
+			printf(" %14lf GB/Sec", x / K / K / K); \
+	} while (0)
+
 int bench_mem_memcpy(int argc, const char **argv,
 		     const char *prefix __used)
 {
 	int i;
-	void *dst, *src;
-	size_t length;
-	double bps = 0.0;
-	struct timeval tv_start, tv_end, tv_diff;
-	u64 clock_start, clock_end, clock_diff;
+	size_t len;
+	double result_bps[2];
+	u64 result_clock[2];
 
-	clock_start = clock_end = clock_diff = 0ULL;
 	argc = parse_options(argc, argv, options,
 			     bench_mem_memcpy_usage, 0);
 
-	tv_diff.tv_sec = 0;
-	tv_diff.tv_usec = 0;
-	length = (size_t)perf_atoll((char *)length_str);
+	if (use_clock)
+		init_clock();
+
+	len = (size_t)perf_atoll((char *)length_str);
 
-	if ((s64)length <= 0) {
+	result_clock[0] = result_clock[1] = 0ULL;
+	result_bps[0] = result_bps[1] = 0.0;
+
+	if ((s64)len <= 0) {
 		fprintf(stderr, "Invalid length:%s\n", length_str);
 		return 1;
 	}
 
+	/* same to without specifying either of prefault and no-prefault */
+	if (only_prefault && no_prefault)
+		only_prefault = no_prefault = false;
+
 	for (i = 0; routines[i].name; i++) {
 		if (!strcmp(routines[i].name, routine))
 			break;
@@ -129,65 +203,80 @@ int bench_mem_memcpy(int argc, const char **argv,
 		return 1;
 	}
 
-	dst = zalloc(length);
-	if (!dst)
-		die("memory allocation failed - maybe length is too large?\n");
-
-	src = zalloc(length);
-	if (!src)
-		die("memory allocation failed - maybe length is too large?\n");
-
-	if (bench_format == BENCH_FORMAT_DEFAULT) {
-		printf("# Copying %s Bytes from %p to %p ...\n\n",
-		       length_str, src, dst);
-	}
-
-
-	if (prefault)
-		routines[i].fn(dst, src, length);
-
-	if (use_clock) {
-		init_clock();
-		clock_start = get_clock();
-	} else {
-		BUG_ON(gettimeofday(&tv_start, NULL));
-	}
+	if (bench_format == BENCH_FORMAT_DEFAULT)
+		printf("# Copying %s Bytes ...\n\n", length_str);
 
-	routines[i].fn(dst, src, length);
-
-	if (use_clock) {
-		clock_end = get_clock();
-		clock_diff = clock_end - clock_start;
+	if (!only_prefault && !no_prefault) {
+		/* show both of results */
+		if (use_clock) {
+			result_clock[0] =
+				do_memcpy_clock(routines[i].fn, len, false);
+			result_clock[1] =
+				do_memcpy_clock(routines[i].fn, len, true);
+		} else {
+			result_bps[0] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, false);
+			result_bps[1] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, true);
+		}
 	} else {
-		BUG_ON(gettimeofday(&tv_end, NULL));
-		timersub(&tv_end, &tv_start, &tv_diff);
-		bps = (double)((double)length / timeval2double(&tv_diff));
+		if (use_clock) {
+			result_clock[pf] =
+				do_memcpy_clock(routines[i].fn,
+						len, only_prefault);
+		} else {
+			result_bps[pf] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, only_prefault);
+		}
 	}
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
-		if (use_clock) {
-			printf(" %14lf Clock/Byte\n",
-			       (double)clock_diff / (double)length);
-		} else {
-			if (bps < K)
-				printf(" %14lf B/Sec\n", bps);
-			else if (bps < K * K)
-				printf(" %14lfd KB/Sec\n", bps / 1024);
-			else if (bps < K * K * K)
-				printf(" %14lf MB/Sec\n", bps / 1024 / 1024);
-			else {
-				printf(" %14lf GB/Sec\n",
-				       bps / 1024 / 1024 / 1024);
+		if (!only_prefault && !no_prefault) {
+			if (use_clock) {
+				printf(" %14lf Clock/Byte\n",
+					(double)result_clock[0]
+					/ (double)len);
+				printf(" %14lf Clock/Byte (with prefault)\n",
+					(double)result_clock[1]
+					/ (double)len);
+			} else {
+				print_bps(result_bps[0]);
+				printf("\n");
+				print_bps(result_bps[1]);
+				printf(" (with prefault)\n");
 			}
+		} else {
+			if (use_clock) {
+				printf(" %14lf Clock/Byte",
+					(double)result_clock[pf]
+					/ (double)len);
+			} else
+				print_bps(result_bps[pf]);
+
+			printf("%s\n", only_prefault ? " (with prefault)" : "");
 		}
 		break;
 	case BENCH_FORMAT_SIMPLE:
-		if (use_clock) {
-			printf("%14lf\n",
-			       (double)clock_diff / (double)length);
-		} else
-			printf("%lf\n", bps);
+		if (!only_prefault && !no_prefault) {
+			if (use_clock) {
+				printf("%lf %lf\n",
+					(double)result_clock[0] / (double)len,
+					(double)result_clock[1] / (double)len);
+			} else {
+				printf("%lf %lf\n",
+					result_bps[0], result_bps[1]);
+			}
+		} else {
+			if (use_clock) {
+				printf("%lf\n", (double)result_clock[pf]
+					/ (double)len);
+			} else
+				printf("%lf\n", result_bps[pf]);
+		}
 		break;
 	default:
 		/* reaching this means there's some disaster: */
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] perf bench: print both of prefaulted and no prefaulted results
  2010-11-12 15:02               ` [PATCH] perf bench: print both of prefaulted and no prefaulted results Hitoshi Mitake
@ 2010-11-18  7:58                 ` Ingo Molnar
  2010-11-25  7:04                   ` Hitoshi Mitake
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2010-11-18  7:58 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin


* Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> wrote:

> After applying this patch, perf bench mem memcpy prints
> both of prefualted and without prefaulted score of memcpy().
> 
> New options --no-prefault and --only-prefault are added
> for printing single result, mainly for scripting usage.

Ok. Mind resending the whole series once all review feedback has been incorporated?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] perf bench: print both of prefaulted and no prefaulted results
  2010-11-18  7:58                 ` Ingo Molnar
@ 2010-11-25  7:04                   ` Hitoshi Mitake
  2010-11-25  7:04                     ` [PATCH v2 1/2] " Hitoshi Mitake
  2010-11-25  7:04                     ` [PATCH v2 2/2] perf bench: port arch/x86/lib/memcpy_64.S to perf bench mem memcpy Hitoshi Mitake
  0 siblings, 2 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-25  7:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin


Really sorry for my late reply..

On 11/18/10 16:58, Ingo Molnar wrote:
 >
 > * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>  wrote:
 >
 >> After applying this patch, perf bench mem memcpy prints
 >> both of prefualted and without prefaulted score of memcpy().
 >>
 >> New options --no-prefault and --only-prefault are added
 >> for printing single result, mainly for scripting usage.
 >
 > Ok. Mind resending the whole series once all review feedback has been 
incorporated?
 >

OK, I'll send the patch series for prefaulting and
porting memcpy_64.S to perf bench later.
This series do some dirty things especially in Makefile
of perf and defining ENTRY(). So I'd like to hear your comment.
Could you review these?

And I have another problem. I cannot see the name of
memcpy based on rep prefix because the symbol of it is ".Lmemcpy_c".
It seems that the symbol name start from "." cannot be seen
from other object files. So I have to seek the way to
find the name of rep memcpy...

Thanks,
	Hitoshi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/2] perf bench: print both of prefaulted and no prefaulted results
  2010-11-25  7:04                   ` Hitoshi Mitake
@ 2010-11-25  7:04                     ` Hitoshi Mitake
  2010-11-26 10:30                       ` [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default tip-bot for Hitoshi Mitake
  2010-11-25  7:04                     ` [PATCH v2 2/2] perf bench: port arch/x86/lib/memcpy_64.S to perf bench mem memcpy Hitoshi Mitake
  1 sibling, 1 reply; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-25  7:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mitake, h.mitake, Miao Xie, Ma Ling, Zhao Yakui,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
to print single result, mainly for scripting usage.

Example of usage:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      634.969014 MB/Sec
|        4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|        4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|      642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
---
 tools/perf/bench/mem-memcpy.c |  219 ++++++++++++++++++++++++++++++-----------
 1 files changed, 162 insertions(+), 57 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 38dae74..db82021 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -12,6 +12,7 @@
 #include "../util/parse-options.h"
 #include "../util/header.h"
 #include "bench.h"
+#include "mem-memcpy-arch.h"
 
 #include <stdio.h>
 #include <stdlib.h>
@@ -23,8 +24,10 @@
 
 static const char	*length_str	= "1MB";
 static const char	*routine	= "default";
-static bool		use_clock	= false;
+static bool		use_clock;
 static int		clock_fd;
+static bool		only_prefault;
+static bool		no_prefault;
 
 static const struct option options[] = {
 	OPT_STRING('l', "length", &length_str, "1MB",
@@ -34,19 +37,33 @@ static const struct option options[] = {
 		    "Specify routine to copy"),
 	OPT_BOOLEAN('c', "clock", &use_clock,
 		    "Use CPU clock for measuring"),
+	OPT_BOOLEAN('o', "only-prefault", &only_prefault,
+		    "Show only the result with page faults before memcpy()"),
+	OPT_BOOLEAN('n', "no-prefault", &no_prefault,
+		    "Show only the result without page faults before memcpy()"),
 	OPT_END()
 };
 
+typedef void *(*memcpy_t)(void *, const void *, size_t);
+
 struct routine {
 	const char *name;
 	const char *desc;
-	void * (*fn)(void *dst, const void *src, size_t len);
+	memcpy_t fn;
 };
 
 struct routine routines[] = {
 	{ "default",
 	  "Default memcpy() provided by glibc",
 	  memcpy },
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc) { name, desc, fn },
+#include "mem-memcpy-x86-64-asm-def.h"
+#undef MEMCPY_FN
+
+#endif
+
 	{ NULL,
 	  NULL,
 	  NULL   }
@@ -89,29 +106,98 @@ static double timeval2double(struct timeval *ts)
 		(double)ts->tv_usec / (double)1000000;
 }
 
+static void alloc_mem(void **dst, void **src, size_t length)
+{
+	*dst = zalloc(length);
+	if (!dst)
+		die("memory allocation failed - maybe length is too large?\n");
+
+	*src = zalloc(length);
+	if (!src)
+		die("memory allocation failed - maybe length is too large?\n");
+}
+
+static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
+{
+	u64 clock_start = 0ULL, clock_end = 0ULL;
+	void *src = NULL, *dst = NULL;
+
+	alloc_mem(&src, &dst, len);
+
+	if (prefault)
+		fn(dst, src, len);
+
+	clock_start = get_clock();
+	fn(dst, src, len);
+	clock_end = get_clock();
+
+	free(src);
+	free(dst);
+	return clock_end - clock_start;
+}
+
+static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
+{
+	struct timeval tv_start, tv_end, tv_diff;
+	void *src = NULL, *dst = NULL;
+
+	alloc_mem(&src, &dst, len);
+
+	if (prefault)
+		fn(dst, src, len);
+
+	BUG_ON(gettimeofday(&tv_start, NULL));
+	fn(dst, src, len);
+	BUG_ON(gettimeofday(&tv_end, NULL));
+
+	timersub(&tv_end, &tv_start, &tv_diff);
+
+	free(src);
+	free(dst);
+	return (double)((double)len / timeval2double(&tv_diff));
+}
+
+#define pf (no_prefault ? 0 : 1)
+
+#define print_bps(x) do {					\
+		if (x < K)					\
+			printf(" %14lf B/Sec", x);		\
+		else if (x < K * K)				\
+			printf(" %14lfd KB/Sec", x / K);	\
+		else if (x < K * K * K)				\
+			printf(" %14lf MB/Sec", x / K / K);	\
+		else						\
+			printf(" %14lf GB/Sec", x / K / K / K); \
+	} while (0)
+
 int bench_mem_memcpy(int argc, const char **argv,
 		     const char *prefix __used)
 {
 	int i;
-	void *dst, *src;
-	size_t length;
-	double bps = 0.0;
-	struct timeval tv_start, tv_end, tv_diff;
-	u64 clock_start, clock_end, clock_diff;
+	size_t len;
+	double result_bps[2];
+	u64 result_clock[2];
 
-	clock_start = clock_end = clock_diff = 0ULL;
 	argc = parse_options(argc, argv, options,
 			     bench_mem_memcpy_usage, 0);
 
-	tv_diff.tv_sec = 0;
-	tv_diff.tv_usec = 0;
-	length = (size_t)perf_atoll((char *)length_str);
+	if (use_clock)
+		init_clock();
+
+	len = (size_t)perf_atoll((char *)length_str);
 
-	if ((s64)length <= 0) {
+	result_clock[0] = result_clock[1] = 0ULL;
+	result_bps[0] = result_bps[1] = 0.0;
+
+	if ((s64)len <= 0) {
 		fprintf(stderr, "Invalid length:%s\n", length_str);
 		return 1;
 	}
 
+	/* same to without specifying either of prefault and no-prefault */
+	if (only_prefault && no_prefault)
+		only_prefault = no_prefault = false;
+
 	for (i = 0; routines[i].name; i++) {
 		if (!strcmp(routines[i].name, routine))
 			break;
@@ -126,61 +212,80 @@ int bench_mem_memcpy(int argc, const char **argv,
 		return 1;
 	}
 
-	dst = zalloc(length);
-	if (!dst)
-		die("memory allocation failed - maybe length is too large?\n");
-
-	src = zalloc(length);
-	if (!src)
-		die("memory allocation failed - maybe length is too large?\n");
-
-	if (bench_format == BENCH_FORMAT_DEFAULT) {
-		printf("# Copying %s Bytes from %p to %p ...\n\n",
-		       length_str, src, dst);
-	}
-
-	if (use_clock) {
-		init_clock();
-		clock_start = get_clock();
-	} else {
-		BUG_ON(gettimeofday(&tv_start, NULL));
-	}
-
-	routines[i].fn(dst, src, length);
+	if (bench_format == BENCH_FORMAT_DEFAULT)
+		printf("# Copying %s Bytes ...\n\n", length_str);
 
-	if (use_clock) {
-		clock_end = get_clock();
-		clock_diff = clock_end - clock_start;
+	if (!only_prefault && !no_prefault) {
+		/* show both of results */
+		if (use_clock) {
+			result_clock[0] =
+				do_memcpy_clock(routines[i].fn, len, false);
+			result_clock[1] =
+				do_memcpy_clock(routines[i].fn, len, true);
+		} else {
+			result_bps[0] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, false);
+			result_bps[1] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, true);
+		}
 	} else {
-		BUG_ON(gettimeofday(&tv_end, NULL));
-		timersub(&tv_end, &tv_start, &tv_diff);
-		bps = (double)((double)length / timeval2double(&tv_diff));
+		if (use_clock) {
+			result_clock[pf] =
+				do_memcpy_clock(routines[i].fn,
+						len, only_prefault);
+		} else {
+			result_bps[pf] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, only_prefault);
+		}
 	}
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
-		if (use_clock) {
-			printf(" %14lf Clock/Byte\n",
-			       (double)clock_diff / (double)length);
-		} else {
-			if (bps < K)
-				printf(" %14lf B/Sec\n", bps);
-			else if (bps < K * K)
-				printf(" %14lfd KB/Sec\n", bps / 1024);
-			else if (bps < K * K * K)
-				printf(" %14lf MB/Sec\n", bps / 1024 / 1024);
-			else {
-				printf(" %14lf GB/Sec\n",
-				       bps / 1024 / 1024 / 1024);
+		if (!only_prefault && !no_prefault) {
+			if (use_clock) {
+				printf(" %14lf Clock/Byte\n",
+					(double)result_clock[0]
+					/ (double)len);
+				printf(" %14lf Clock/Byte (with prefault)\n",
+					(double)result_clock[1]
+					/ (double)len);
+			} else {
+				print_bps(result_bps[0]);
+				printf("\n");
+				print_bps(result_bps[1]);
+				printf(" (with prefault)\n");
 			}
+		} else {
+			if (use_clock) {
+				printf(" %14lf Clock/Byte",
+					(double)result_clock[pf]
+					/ (double)len);
+			} else
+				print_bps(result_bps[pf]);
+
+			printf("%s\n", only_prefault ? " (with prefault)" : "");
 		}
 		break;
 	case BENCH_FORMAT_SIMPLE:
-		if (use_clock) {
-			printf("%14lf\n",
-			       (double)clock_diff / (double)length);
-		} else
-			printf("%lf\n", bps);
+		if (!only_prefault && !no_prefault) {
+			if (use_clock) {
+				printf("%lf %lf\n",
+					(double)result_clock[0] / (double)len,
+					(double)result_clock[1] / (double)len);
+			} else {
+				printf("%lf %lf\n",
+					result_bps[0], result_bps[1]);
+			}
+		} else {
+			if (use_clock) {
+				printf("%lf\n", (double)result_clock[pf]
+					/ (double)len);
+			} else
+				printf("%lf\n", result_bps[pf]);
+		}
 		break;
 	default:
 		/* reaching this means there's some disaster: */
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/2] perf bench: port arch/x86/lib/memcpy_64.S to perf bench mem memcpy
  2010-11-25  7:04                   ` Hitoshi Mitake
  2010-11-25  7:04                     ` [PATCH v2 1/2] " Hitoshi Mitake
@ 2010-11-25  7:04                     ` Hitoshi Mitake
  2010-11-26 10:31                       ` [tip:perf/core] perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem' tip-bot for Hitoshi Mitake
  1 sibling, 1 reply; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-25  7:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mitake, h.mitake, Miao Xie, Ma Ling, Zhao Yakui,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen

This patch ports arch/x86/lib/memcpy_64.S to perf bench mem memcpy
for benchmarking memcpy() in userland with tricky and dirty way.

util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
util/include/linux/linkage.h are dummy (but do a little work) for
including memcpy_64.S without modification to it (e.g. defining ENTRY()).

This makes checkpatch.pl angry like this:
\#177: FILE: tools/perf/util/include/linux/linkage.h:7:
+#define ENTRY(name)                            \
+       .globl name;                            \
+       name:

WARNING: labels should not be indented
\#179: FILE: tools/perf/util/include/linux/linkage.h:9:
+       name:

because checkpatch.pl treat this file as the file written in C.
But I think this can be forgived because original include/linux/linkage.h
is doing the similar thing.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
---
 tools/perf/Makefile                          |   11 +++++++++++
 tools/perf/bench/mem-memcpy-arch.h           |   12 ++++++++++++
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |    4 ++++
 tools/perf/bench/mem-memcpy-x86-64-asm.S     |    2 ++
 tools/perf/util/include/asm/cpufeature.h     |    9 +++++++++
 tools/perf/util/include/asm/dwarf2.h         |   11 +++++++++++
 tools/perf/util/include/linux/linkage.h      |   13 +++++++++++++
 7 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/bench/mem-memcpy-arch.h
 create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm-def.h
 create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm.S
 create mode 100644 tools/perf/util/include/asm/cpufeature.h
 create mode 100644 tools/perf/util/include/asm/dwarf2.h
 create mode 100644 tools/perf/util/include/linux/linkage.h

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 2d414b3..b3e6bc6 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -185,7 +185,10 @@ ifeq ($(ARCH),i386)
         ARCH := x86
 endif
 ifeq ($(ARCH),x86_64)
+	RAW_ARCH := x86_64
         ARCH := x86
+	ARCH_CFLAGS := -DARCH_X86_64
+	ARCH_INCLUDE = ../../arch/x86/lib/memcpy_64.S
 endif
 
 # CFLAGS and LDFLAGS are for the users to override from the command line.
@@ -375,6 +378,7 @@ LIB_H += util/include/linux/prefetch.h
 LIB_H += util/include/linux/rbtree.h
 LIB_H += util/include/linux/string.h
 LIB_H += util/include/linux/types.h
+LIB_H += util/include/linux/linkage.h
 LIB_H += util/include/asm/asm-offsets.h
 LIB_H += util/include/asm/bug.h
 LIB_H += util/include/asm/byteorder.h
@@ -383,6 +387,8 @@ LIB_H += util/include/asm/swab.h
 LIB_H += util/include/asm/system.h
 LIB_H += util/include/asm/uaccess.h
 LIB_H += util/include/dwarf-regs.h
+LIB_H += util/include/asm/dwarf2.h
+LIB_H += util/include/asm/cpufeature.h
 LIB_H += perf.h
 LIB_H += util/cache.h
 LIB_H += util/callchain.h
@@ -417,6 +423,7 @@ LIB_H += util/probe-finder.h
 LIB_H += util/probe-event.h
 LIB_H += util/pstack.h
 LIB_H += util/cpumap.h
+LIB_H += $(ARCH_INCLUDE)
 
 LIB_OBJS += $(OUTPUT)util/abspath.o
 LIB_OBJS += $(OUTPUT)util/alias.o
@@ -472,6 +479,9 @@ BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
 # Benchmark modules
 BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
 BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
+ifeq ($(RAW_ARCH),x86_64)
+BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
+endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
@@ -909,6 +919,7 @@ BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
 LIB_OBJS += $(COMPAT_OBJS)
 
 ALL_CFLAGS += $(BASIC_CFLAGS)
+ALL_CFLAGS += $(ARCH_CFLAGS)
 ALL_LDFLAGS += $(BASIC_LDFLAGS)
 
 export TAR INSTALL DESTDIR SHELL_PATH
diff --git a/tools/perf/bench/mem-memcpy-arch.h b/tools/perf/bench/mem-memcpy-arch.h
new file mode 100644
index 0000000..a72e36c
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-arch.h
@@ -0,0 +1,12 @@
+
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc)		\
+	extern void *fn(void *, const void *, size_t);
+
+#include "mem-memcpy-x86-64-asm-def.h"
+
+#undef MEMCPY_FN
+
+#endif
+
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
new file mode 100644
index 0000000..d588b87
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
@@ -0,0 +1,4 @@
+
+MEMCPY_FN(__memcpy,
+	"x86-64-unrolled",
+	"unrolled memcpy() in arch/x86/lib/memcpy_64.S")
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm.S b/tools/perf/bench/mem-memcpy-x86-64-asm.S
new file mode 100644
index 0000000..a57b66e
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm.S
@@ -0,0 +1,2 @@
+
+#include "../../../arch/x86/lib/memcpy_64.S"
diff --git a/tools/perf/util/include/asm/cpufeature.h b/tools/perf/util/include/asm/cpufeature.h
new file mode 100644
index 0000000..acffd5e
--- /dev/null
+++ b/tools/perf/util/include/asm/cpufeature.h
@@ -0,0 +1,9 @@
+
+#ifndef PERF_CPUFEATURE_H
+#define PERF_CPUFEATURE_H
+
+/* cpufeature.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define X86_FEATURE_REP_GOOD 0
+
+#endif	/* PERF_CPUFEATURE_H */
diff --git a/tools/perf/util/include/asm/dwarf2.h b/tools/perf/util/include/asm/dwarf2.h
new file mode 100644
index 0000000..bb4198e
--- /dev/null
+++ b/tools/perf/util/include/asm/dwarf2.h
@@ -0,0 +1,11 @@
+
+#ifndef PERF_DWARF2_H
+#define PERF_DWARF2_H
+
+/* dwarf2.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define CFI_STARTPROC
+#define CFI_ENDPROC
+
+#endif	/* PERF_DWARF2_H */
+
diff --git a/tools/perf/util/include/linux/linkage.h b/tools/perf/util/include/linux/linkage.h
new file mode 100644
index 0000000..06387cf
--- /dev/null
+++ b/tools/perf/util/include/linux/linkage.h
@@ -0,0 +1,13 @@
+
+#ifndef PERF_LINUX_LINKAGE_H_
+#define PERF_LINUX_LINKAGE_H_
+
+/* linkage.h ... for including arch/x86/lib/memcpy_64.S */
+
+#define ENTRY(name)				\
+	.globl name;				\
+	name:
+
+#define ENDPROC(name)
+
+#endif	/* PERF_LINUX_LINKAGE_H_ */
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default
  2010-11-25  7:04                     ` [PATCH v2 1/2] " Hitoshi Mitake
@ 2010-11-26 10:30                       ` tip-bot for Hitoshi Mitake
       [not found]                         ` <4D03B1AD.7000606@dcl.info.waseda.ac.jp>
  0 siblings, 1 reply; 30+ messages in thread
From: tip-bot for Hitoshi Mitake @ 2010-11-26 10:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, acme, hpa, mingo, andi, a.p.zijlstra,
	yakui.zhao, mitake, fweisbec, rostedt, ling.ma, tglx, miaox,
	mingo

Commit-ID:  49ce8fc651794878189fd5f273228832cdfb5be9
Gitweb:     http://git.kernel.org/tip/49ce8fc651794878189fd5f273228832cdfb5be9
Author:     Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
AuthorDate: Thu, 25 Nov 2010 16:04:52 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 26 Nov 2010 08:15:57 +0100

perf bench: Print both of prefaulted and no prefaulted results by default

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
to print single result, mainly for scripting usage.

Usage example:

 | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
 | # Running mem/memcpy benchmark...
 | # Copying 500MB Bytes ...
 |
 |      634.969014 MB/Sec
 |        4.828062 GB/Sec (with prefault)
 | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
 | # Running mem/memcpy benchmark...
 | # Copying 500MB Bytes ...
 |
 |        4.705192 GB/Sec (with prefault)
 | mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
 | # Running mem/memcpy benchmark...
 | # Copying 500MB Bytes ...
 |
 |      642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: h.mitake@gmail.com
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1290668693-27068-1-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/bench/mem-memcpy.c |  219 ++++++++++++++++++++++++++++++-----------
 1 files changed, 162 insertions(+), 57 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 38dae74..db82021 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -12,6 +12,7 @@
 #include "../util/parse-options.h"
 #include "../util/header.h"
 #include "bench.h"
+#include "mem-memcpy-arch.h"
 
 #include <stdio.h>
 #include <stdlib.h>
@@ -23,8 +24,10 @@
 
 static const char	*length_str	= "1MB";
 static const char	*routine	= "default";
-static bool		use_clock	= false;
+static bool		use_clock;
 static int		clock_fd;
+static bool		only_prefault;
+static bool		no_prefault;
 
 static const struct option options[] = {
 	OPT_STRING('l', "length", &length_str, "1MB",
@@ -34,19 +37,33 @@ static const struct option options[] = {
 		    "Specify routine to copy"),
 	OPT_BOOLEAN('c', "clock", &use_clock,
 		    "Use CPU clock for measuring"),
+	OPT_BOOLEAN('o', "only-prefault", &only_prefault,
+		    "Show only the result with page faults before memcpy()"),
+	OPT_BOOLEAN('n', "no-prefault", &no_prefault,
+		    "Show only the result without page faults before memcpy()"),
 	OPT_END()
 };
 
+typedef void *(*memcpy_t)(void *, const void *, size_t);
+
 struct routine {
 	const char *name;
 	const char *desc;
-	void * (*fn)(void *dst, const void *src, size_t len);
+	memcpy_t fn;
 };
 
 struct routine routines[] = {
 	{ "default",
 	  "Default memcpy() provided by glibc",
 	  memcpy },
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc) { name, desc, fn },
+#include "mem-memcpy-x86-64-asm-def.h"
+#undef MEMCPY_FN
+
+#endif
+
 	{ NULL,
 	  NULL,
 	  NULL   }
@@ -89,29 +106,98 @@ static double timeval2double(struct timeval *ts)
 		(double)ts->tv_usec / (double)1000000;
 }
 
+static void alloc_mem(void **dst, void **src, size_t length)
+{
+	*dst = zalloc(length);
+	if (!dst)
+		die("memory allocation failed - maybe length is too large?\n");
+
+	*src = zalloc(length);
+	if (!src)
+		die("memory allocation failed - maybe length is too large?\n");
+}
+
+static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
+{
+	u64 clock_start = 0ULL, clock_end = 0ULL;
+	void *src = NULL, *dst = NULL;
+
+	alloc_mem(&src, &dst, len);
+
+	if (prefault)
+		fn(dst, src, len);
+
+	clock_start = get_clock();
+	fn(dst, src, len);
+	clock_end = get_clock();
+
+	free(src);
+	free(dst);
+	return clock_end - clock_start;
+}
+
+static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
+{
+	struct timeval tv_start, tv_end, tv_diff;
+	void *src = NULL, *dst = NULL;
+
+	alloc_mem(&src, &dst, len);
+
+	if (prefault)
+		fn(dst, src, len);
+
+	BUG_ON(gettimeofday(&tv_start, NULL));
+	fn(dst, src, len);
+	BUG_ON(gettimeofday(&tv_end, NULL));
+
+	timersub(&tv_end, &tv_start, &tv_diff);
+
+	free(src);
+	free(dst);
+	return (double)((double)len / timeval2double(&tv_diff));
+}
+
+#define pf (no_prefault ? 0 : 1)
+
+#define print_bps(x) do {					\
+		if (x < K)					\
+			printf(" %14lf B/Sec", x);		\
+		else if (x < K * K)				\
+			printf(" %14lfd KB/Sec", x / K);	\
+		else if (x < K * K * K)				\
+			printf(" %14lf MB/Sec", x / K / K);	\
+		else						\
+			printf(" %14lf GB/Sec", x / K / K / K); \
+	} while (0)
+
 int bench_mem_memcpy(int argc, const char **argv,
 		     const char *prefix __used)
 {
 	int i;
-	void *dst, *src;
-	size_t length;
-	double bps = 0.0;
-	struct timeval tv_start, tv_end, tv_diff;
-	u64 clock_start, clock_end, clock_diff;
+	size_t len;
+	double result_bps[2];
+	u64 result_clock[2];
 
-	clock_start = clock_end = clock_diff = 0ULL;
 	argc = parse_options(argc, argv, options,
 			     bench_mem_memcpy_usage, 0);
 
-	tv_diff.tv_sec = 0;
-	tv_diff.tv_usec = 0;
-	length = (size_t)perf_atoll((char *)length_str);
+	if (use_clock)
+		init_clock();
+
+	len = (size_t)perf_atoll((char *)length_str);
 
-	if ((s64)length <= 0) {
+	result_clock[0] = result_clock[1] = 0ULL;
+	result_bps[0] = result_bps[1] = 0.0;
+
+	if ((s64)len <= 0) {
 		fprintf(stderr, "Invalid length:%s\n", length_str);
 		return 1;
 	}
 
+	/* same to without specifying either of prefault and no-prefault */
+	if (only_prefault && no_prefault)
+		only_prefault = no_prefault = false;
+
 	for (i = 0; routines[i].name; i++) {
 		if (!strcmp(routines[i].name, routine))
 			break;
@@ -126,61 +212,80 @@ int bench_mem_memcpy(int argc, const char **argv,
 		return 1;
 	}
 
-	dst = zalloc(length);
-	if (!dst)
-		die("memory allocation failed - maybe length is too large?\n");
-
-	src = zalloc(length);
-	if (!src)
-		die("memory allocation failed - maybe length is too large?\n");
-
-	if (bench_format == BENCH_FORMAT_DEFAULT) {
-		printf("# Copying %s Bytes from %p to %p ...\n\n",
-		       length_str, src, dst);
-	}
-
-	if (use_clock) {
-		init_clock();
-		clock_start = get_clock();
-	} else {
-		BUG_ON(gettimeofday(&tv_start, NULL));
-	}
-
-	routines[i].fn(dst, src, length);
+	if (bench_format == BENCH_FORMAT_DEFAULT)
+		printf("# Copying %s Bytes ...\n\n", length_str);
 
-	if (use_clock) {
-		clock_end = get_clock();
-		clock_diff = clock_end - clock_start;
+	if (!only_prefault && !no_prefault) {
+		/* show both of results */
+		if (use_clock) {
+			result_clock[0] =
+				do_memcpy_clock(routines[i].fn, len, false);
+			result_clock[1] =
+				do_memcpy_clock(routines[i].fn, len, true);
+		} else {
+			result_bps[0] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, false);
+			result_bps[1] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, true);
+		}
 	} else {
-		BUG_ON(gettimeofday(&tv_end, NULL));
-		timersub(&tv_end, &tv_start, &tv_diff);
-		bps = (double)((double)length / timeval2double(&tv_diff));
+		if (use_clock) {
+			result_clock[pf] =
+				do_memcpy_clock(routines[i].fn,
+						len, only_prefault);
+		} else {
+			result_bps[pf] =
+				do_memcpy_gettimeofday(routines[i].fn,
+						len, only_prefault);
+		}
 	}
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
-		if (use_clock) {
-			printf(" %14lf Clock/Byte\n",
-			       (double)clock_diff / (double)length);
-		} else {
-			if (bps < K)
-				printf(" %14lf B/Sec\n", bps);
-			else if (bps < K * K)
-				printf(" %14lfd KB/Sec\n", bps / 1024);
-			else if (bps < K * K * K)
-				printf(" %14lf MB/Sec\n", bps / 1024 / 1024);
-			else {
-				printf(" %14lf GB/Sec\n",
-				       bps / 1024 / 1024 / 1024);
+		if (!only_prefault && !no_prefault) {
+			if (use_clock) {
+				printf(" %14lf Clock/Byte\n",
+					(double)result_clock[0]
+					/ (double)len);
+				printf(" %14lf Clock/Byte (with prefault)\n",
+					(double)result_clock[1]
+					/ (double)len);
+			} else {
+				print_bps(result_bps[0]);
+				printf("\n");
+				print_bps(result_bps[1]);
+				printf(" (with prefault)\n");
 			}
+		} else {
+			if (use_clock) {
+				printf(" %14lf Clock/Byte",
+					(double)result_clock[pf]
+					/ (double)len);
+			} else
+				print_bps(result_bps[pf]);
+
+			printf("%s\n", only_prefault ? " (with prefault)" : "");
 		}
 		break;
 	case BENCH_FORMAT_SIMPLE:
-		if (use_clock) {
-			printf("%14lf\n",
-			       (double)clock_diff / (double)length);
-		} else
-			printf("%lf\n", bps);
+		if (!only_prefault && !no_prefault) {
+			if (use_clock) {
+				printf("%lf %lf\n",
+					(double)result_clock[0] / (double)len,
+					(double)result_clock[1] / (double)len);
+			} else {
+				printf("%lf %lf\n",
+					result_bps[0], result_bps[1]);
+			}
+		} else {
+			if (use_clock) {
+				printf("%lf\n", (double)result_clock[pf]
+					/ (double)len);
+			} else
+				printf("%lf\n", result_bps[pf]);
+		}
 		break;
 	default:
 		/* reaching this means there's some disaster: */

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [tip:perf/core] perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'
  2010-11-25  7:04                     ` [PATCH v2 2/2] perf bench: port arch/x86/lib/memcpy_64.S to perf bench mem memcpy Hitoshi Mitake
@ 2010-11-26 10:31                       ` tip-bot for Hitoshi Mitake
  2010-11-29 13:26                         ` Hitoshi Mitake
  0 siblings, 1 reply; 30+ messages in thread
From: tip-bot for Hitoshi Mitake @ 2010-11-26 10:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, acme, hpa, mingo, andi, a.p.zijlstra,
	yakui.zhao, mitake, fweisbec, rostedt, ling.ma, tglx, miaox,
	mingo

Commit-ID:  ea7872b9d6a81101f6ba0ec141544a62fea35876
Gitweb:     http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876
Author:     Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 26 Nov 2010 08:15:57 +0100

perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'

This patch ports arch/x86/lib/memcpy_64.S to perf bench mem
memcpy for benchmarking memcpy() in userland with tricky and
dirty way.

util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
util/include/linux/linkage.h are mostly dummy files with small
wrappers, so that we are able to include memcpy_64.S
unmodified.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: h.mitake@gmail.com
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1290668693-27068-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/Makefile                          |   11 +++++++++++
 tools/perf/bench/mem-memcpy-arch.h           |   12 ++++++++++++
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |    4 ++++
 tools/perf/bench/mem-memcpy-x86-64-asm.S     |    2 ++
 tools/perf/util/include/asm/cpufeature.h     |    9 +++++++++
 tools/perf/util/include/asm/dwarf2.h         |   11 +++++++++++
 tools/perf/util/include/linux/linkage.h      |   13 +++++++++++++
 7 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 74b684d..e0db197 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -185,7 +185,10 @@ ifeq ($(ARCH),i386)
         ARCH := x86
 endif
 ifeq ($(ARCH),x86_64)
+	RAW_ARCH := x86_64
         ARCH := x86
+	ARCH_CFLAGS := -DARCH_X86_64
+	ARCH_INCLUDE = ../../arch/x86/lib/memcpy_64.S
 endif
 
 # CFLAGS and LDFLAGS are for the users to override from the command line.
@@ -375,6 +378,7 @@ LIB_H += util/include/linux/prefetch.h
 LIB_H += util/include/linux/rbtree.h
 LIB_H += util/include/linux/string.h
 LIB_H += util/include/linux/types.h
+LIB_H += util/include/linux/linkage.h
 LIB_H += util/include/asm/asm-offsets.h
 LIB_H += util/include/asm/bug.h
 LIB_H += util/include/asm/byteorder.h
@@ -383,6 +387,8 @@ LIB_H += util/include/asm/swab.h
 LIB_H += util/include/asm/system.h
 LIB_H += util/include/asm/uaccess.h
 LIB_H += util/include/dwarf-regs.h
+LIB_H += util/include/asm/dwarf2.h
+LIB_H += util/include/asm/cpufeature.h
 LIB_H += perf.h
 LIB_H += util/cache.h
 LIB_H += util/callchain.h
@@ -417,6 +423,7 @@ LIB_H += util/probe-finder.h
 LIB_H += util/probe-event.h
 LIB_H += util/pstack.h
 LIB_H += util/cpumap.h
+LIB_H += $(ARCH_INCLUDE)
 
 LIB_OBJS += $(OUTPUT)util/abspath.o
 LIB_OBJS += $(OUTPUT)util/alias.o
@@ -472,6 +479,9 @@ BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
 # Benchmark modules
 BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
 BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
+ifeq ($(RAW_ARCH),x86_64)
+BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
+endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
@@ -898,6 +908,7 @@ BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
 LIB_OBJS += $(COMPAT_OBJS)
 
 ALL_CFLAGS += $(BASIC_CFLAGS)
+ALL_CFLAGS += $(ARCH_CFLAGS)
 ALL_LDFLAGS += $(BASIC_LDFLAGS)
 
 export TAR INSTALL DESTDIR SHELL_PATH
diff --git a/tools/perf/bench/mem-memcpy-arch.h b/tools/perf/bench/mem-memcpy-arch.h
new file mode 100644
index 0000000..a72e36c
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-arch.h
@@ -0,0 +1,12 @@
+
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc)		\
+	extern void *fn(void *, const void *, size_t);
+
+#include "mem-memcpy-x86-64-asm-def.h"
+
+#undef MEMCPY_FN
+
+#endif
+
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
new file mode 100644
index 0000000..d588b87
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
@@ -0,0 +1,4 @@
+
+MEMCPY_FN(__memcpy,
+	"x86-64-unrolled",
+	"unrolled memcpy() in arch/x86/lib/memcpy_64.S")
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm.S b/tools/perf/bench/mem-memcpy-x86-64-asm.S
new file mode 100644
index 0000000..a57b66e
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm.S
@@ -0,0 +1,2 @@
+
+#include "../../../arch/x86/lib/memcpy_64.S"
diff --git a/tools/perf/util/include/asm/cpufeature.h b/tools/perf/util/include/asm/cpufeature.h
new file mode 100644
index 0000000..acffd5e
--- /dev/null
+++ b/tools/perf/util/include/asm/cpufeature.h
@@ -0,0 +1,9 @@
+
+#ifndef PERF_CPUFEATURE_H
+#define PERF_CPUFEATURE_H
+
+/* cpufeature.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define X86_FEATURE_REP_GOOD 0
+
+#endif	/* PERF_CPUFEATURE_H */
diff --git a/tools/perf/util/include/asm/dwarf2.h b/tools/perf/util/include/asm/dwarf2.h
new file mode 100644
index 0000000..bb4198e
--- /dev/null
+++ b/tools/perf/util/include/asm/dwarf2.h
@@ -0,0 +1,11 @@
+
+#ifndef PERF_DWARF2_H
+#define PERF_DWARF2_H
+
+/* dwarf2.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define CFI_STARTPROC
+#define CFI_ENDPROC
+
+#endif	/* PERF_DWARF2_H */
+
diff --git a/tools/perf/util/include/linux/linkage.h b/tools/perf/util/include/linux/linkage.h
new file mode 100644
index 0000000..06387cf
--- /dev/null
+++ b/tools/perf/util/include/linux/linkage.h
@@ -0,0 +1,13 @@
+
+#ifndef PERF_LINUX_LINKAGE_H_
+#define PERF_LINUX_LINKAGE_H_
+
+/* linkage.h ... for including arch/x86/lib/memcpy_64.S */
+
+#define ENTRY(name)				\
+	.globl name;				\
+	name:
+
+#define ENDPROC(name)
+
+#endif	/* PERF_LINUX_LINKAGE_H_ */

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [tip:perf/core] perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'
  2010-11-26 10:31                       ` [tip:perf/core] perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem' tip-bot for Hitoshi Mitake
@ 2010-11-29 13:26                         ` Hitoshi Mitake
  0 siblings, 0 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-11-29 13:26 UTC (permalink / raw)
  To: mingo, hpa, acme, paulus, linux-kernel, andi, a.p.zijlstra,
	yakui.zhao, mitake, fweisbec, ling.ma, rostedt, miaox, tglx,
	mingo
  Cc: linux-tip-commits

On 2010年11月26日 19:31, tip-bot for Hitoshi Mitake wrote:
 > Commit-ID:  ea7872b9d6a81101f6ba0ec141544a62fea35876
 > Gitweb: 
http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876
 > Author:     Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>
 > AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900
 > Committer:  Ingo Molnar<mingo@elte.hu>
 > CommitDate: Fri, 26 Nov 2010 08:15:57 +0100
 >
 > perf bench: Add feature that measures the performance of the 
arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'
 >
 > This patch ports arch/x86/lib/memcpy_64.S to perf bench mem
 > memcpy for benchmarking memcpy() in userland with tricky and
 > dirty way.
 >
 > util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
 > util/include/linux/linkage.h are mostly dummy files with small
 > wrappers, so that we are able to include memcpy_64.S
 > unmodified.
 >
 > Signed-off-by: Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>
 > Cc: h.mitake@gmail.com
 > Cc: Miao Xie<miaox@cn.fujitsu.com>
 > Cc: Ma Ling<ling.ma@intel.com>
 > Cc: Zhao Yakui<yakui.zhao@intel.com>
 > Cc: Peter Zijlstra<a.p.zijlstra@chello.nl>
 > Cc: Arnaldo Carvalho de Melo<acme@redhat.com>
 > Cc: Paul Mackerras<paulus@samba.org>
 > Cc: Frederic Weisbecker<fweisbec@gmail.com>
 > Cc: Steven Rostedt<rostedt@goodmis.org>
 > Cc: Andi Kleen<andi@firstfloor.org>
 > 
LKML-Reference:<1290668693-27068-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
 > Signed-off-by: Ingo Molnar<mingo@elte.hu>
 > ---
 >   tools/perf/Makefile                          |   11 +++++++++++
 >   tools/perf/bench/mem-memcpy-arch.h           |   12 ++++++++++++
 >   tools/perf/bench/mem-memcpy-x86-64-asm-def.h |    4 ++++
 >   tools/perf/bench/mem-memcpy-x86-64-asm.S     |    2 ++
 >   tools/perf/util/include/asm/cpufeature.h     |    9 +++++++++
 >   tools/perf/util/include/asm/dwarf2.h         |   11 +++++++++++
 >   tools/perf/util/include/linux/linkage.h      |   13 +++++++++++++
 >   7 files changed, 62 insertions(+), 0 deletions(-)
 >
 > diff --git a/tools/perf/Makefile b/tools/perf/Makefile
 > index 74b684d..e0db197 100644
 > --- a/tools/perf/Makefile
 > +++ b/tools/perf/Makefile
 > @@ -185,7 +185,10 @@ ifeq ($(ARCH),i386)
 >           ARCH := x86
 >   endif
 >   ifeq ($(ARCH),x86_64)
 > +	RAW_ARCH := x86_64
 >           ARCH := x86
 > +	ARCH_CFLAGS := -DARCH_X86_64
 > +	ARCH_INCLUDE = ../../arch/x86/lib/memcpy_64.S
 >   endif
 >
 >   # CFLAGS and LDFLAGS are for the users to override from the command 
line.
 > @@ -375,6 +378,7 @@ LIB_H += util/include/linux/prefetch.h
 >   LIB_H += util/include/linux/rbtree.h
 >   LIB_H += util/include/linux/string.h
 >   LIB_H += util/include/linux/types.h
 > +LIB_H += util/include/linux/linkage.h
 >   LIB_H += util/include/asm/asm-offsets.h
 >   LIB_H += util/include/asm/bug.h
 >   LIB_H += util/include/asm/byteorder.h
 > @@ -383,6 +387,8 @@ LIB_H += util/include/asm/swab.h
 >   LIB_H += util/include/asm/system.h
 >   LIB_H += util/include/asm/uaccess.h
 >   LIB_H += util/include/dwarf-regs.h
 > +LIB_H += util/include/asm/dwarf2.h
 > +LIB_H += util/include/asm/cpufeature.h
 >   LIB_H += perf.h
 >   LIB_H += util/cache.h
 >   LIB_H += util/callchain.h
 > @@ -417,6 +423,7 @@ LIB_H += util/probe-finder.h
 >   LIB_H += util/probe-event.h
 >   LIB_H += util/pstack.h
 >   LIB_H += util/cpumap.h
 > +LIB_H += $(ARCH_INCLUDE)
 >
 >   LIB_OBJS += $(OUTPUT)util/abspath.o
 >   LIB_OBJS += $(OUTPUT)util/alias.o
 > @@ -472,6 +479,9 @@ BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
 >   # Benchmark modules
 >   BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
 >   BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
 > +ifeq ($(RAW_ARCH),x86_64)
 > +BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
 > +endif
 >   BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 >
 >   BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 > @@ -898,6 +908,7 @@ BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
 >   LIB_OBJS += $(COMPAT_OBJS)
 >
 >   ALL_CFLAGS += $(BASIC_CFLAGS)
 > +ALL_CFLAGS += $(ARCH_CFLAGS)
 >   ALL_LDFLAGS += $(BASIC_LDFLAGS)
 >
 >   export TAR INSTALL DESTDIR SHELL_PATH
 > diff --git a/tools/perf/bench/mem-memcpy-arch.h 
b/tools/perf/bench/mem-memcpy-arch.h
 > new file mode 100644
 > index 0000000..a72e36c
 > --- /dev/null
 > +++ b/tools/perf/bench/mem-memcpy-arch.h
 > @@ -0,0 +1,12 @@
 > +
 > +#ifdef ARCH_X86_64
 > +
 > +#define MEMCPY_FN(fn, name, desc)		\
 > +	extern void *fn(void *, const void *, size_t);
 > +
 > +#include "mem-memcpy-x86-64-asm-def.h"
 > +
 > +#undef MEMCPY_FN
 > +
 > +#endif
 > +
 > diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h 
b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
 > new file mode 100644
 > index 0000000..d588b87
 > --- /dev/null
 > +++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
 > @@ -0,0 +1,4 @@
 > +
 > +MEMCPY_FN(__memcpy,
 > +	"x86-64-unrolled",
 > +	"unrolled memcpy() in arch/x86/lib/memcpy_64.S")
 > diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm.S 
b/tools/perf/bench/mem-memcpy-x86-64-asm.S
 > new file mode 100644
 > index 0000000..a57b66e
 > --- /dev/null
 > +++ b/tools/perf/bench/mem-memcpy-x86-64-asm.S
 > @@ -0,0 +1,2 @@
 > +
 > +#include "../../../arch/x86/lib/memcpy_64.S"
 > diff --git a/tools/perf/util/include/asm/cpufeature.h 
b/tools/perf/util/include/asm/cpufeature.h
 > new file mode 100644
 > index 0000000..acffd5e
 > --- /dev/null
 > +++ b/tools/perf/util/include/asm/cpufeature.h
 > @@ -0,0 +1,9 @@
 > +
 > +#ifndef PERF_CPUFEATURE_H
 > +#define PERF_CPUFEATURE_H
 > +
 > +/* cpufeature.h ... dummy header file for including 
arch/x86/lib/memcpy_64.S */
 > +
 > +#define X86_FEATURE_REP_GOOD 0
 > +
 > +#endif	/* PERF_CPUFEATURE_H */
 > diff --git a/tools/perf/util/include/asm/dwarf2.h 
b/tools/perf/util/include/asm/dwarf2.h
 > new file mode 100644
 > index 0000000..bb4198e
 > --- /dev/null
 > +++ b/tools/perf/util/include/asm/dwarf2.h
 > @@ -0,0 +1,11 @@
 > +
 > +#ifndef PERF_DWARF2_H
 > +#define PERF_DWARF2_H
 > +
 > +/* dwarf2.h ... dummy header file for including 
arch/x86/lib/memcpy_64.S */
 > +
 > +#define CFI_STARTPROC
 > +#define CFI_ENDPROC
 > +
 > +#endif	/* PERF_DWARF2_H */
 > +
 > diff --git a/tools/perf/util/include/linux/linkage.h 
b/tools/perf/util/include/linux/linkage.h
 > new file mode 100644
 > index 0000000..06387cf
 > --- /dev/null
 > +++ b/tools/perf/util/include/linux/linkage.h
 > @@ -0,0 +1,13 @@
 > +
 > +#ifndef PERF_LINUX_LINKAGE_H_
 > +#define PERF_LINUX_LINKAGE_H_
 > +
 > +/* linkage.h ... for including arch/x86/lib/memcpy_64.S */
 > +
 > +#define ENTRY(name)				\
 > +	.globl name;				\
 > +	name:
 > +
 > +#define ENDPROC(name)
 > +
 > +#endif	/* PERF_LINUX_LINKAGE_H_ */
 >

Thanks for your applying, Ingo!

BTW, I have a question.
Why does the symbol name of rep prefix memcpy() start from '.'?
The symbol name starts from '.' like ".Lmemcpy_c" cannot seen as
symbol name after compile.

I couldn't find the reason why .Lmemcpy_c has to start from '.'.
For example, clear_page in arch/x86/lib/clear_page_64.S
doesn't start from '.' but it is alternative function.

If there is no special reason, I'd like to rename it.

Thanks,
	Hitoshi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default
       [not found]                         ` <4D03B1AD.7000606@dcl.info.waseda.ac.jp>
@ 2010-12-12 13:46                           ` Arnaldo Carvalho de Melo
  2010-12-13 11:14                             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Arnaldo Carvalho de Melo @ 2010-12-12 13:46 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: mingo, hpa, paulus, linux-kernel, andi, a.p.zijlstra, yakui.zhao,
	fweisbec, ling.ma, rostedt, miaox, tglx, mingo, acme

Em Sun, Dec 12, 2010 at 02:15:25AM +0900, Hitoshi Mitake escreveu:
> BTW, I found that measuring performance of prefaulted memcpy()
> with perf stat is difficult. Because current perf stat monitors
> whole execution of program or range of perf stat lifetime.

> If perf stat and monitored program can interact and work
> synchronously, it will be better.

> For example, if perf stat waits on the unix domain socket
> before create_perf_stat_counter() and monitored program wakes perf stat
> up through the socket, more fine grain monitoring will be possible.

> I imagine the execution will be like this:
> perf stat --wait-on /tmp/perf_wait perf bench mem memcpy --wake-up  
> /tmp/perf_wait

> --wait-on is imaginaly option of perf stat, and the way of waking up
> perf stat is left to monitored program (in this case, --wake-up is
> used for specifying the name of the socket).

> I'd like to implement such a option to perf stat, how do you think?

Looks interesting, and also interesting would be to be able to place
probes that would wake up it too, for unmodified binaries to have
something similar.

Other kinds of triggers may be to hook on syscalls and when some
expression matches, like connecting to host 1.2.3.4, start monitoring,
stop when the socket is closed, i.e. monitor a connection lifetime, etc.

I think it is worth pursuing and encourage you to work on it :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default
  2010-12-12 13:46                           ` perf monitoring triggers Was: " Arnaldo Carvalho de Melo
@ 2010-12-13 11:14                             ` Peter Zijlstra
  2010-12-13 12:38                               ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-12-13 11:14 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Hitoshi Mitake, mingo, hpa, paulus, linux-kernel, andi,
	yakui.zhao, fweisbec, ling.ma, rostedt, miaox, tglx, mingo, acme

On Sun, 2010-12-12 at 11:46 -0200, Arnaldo Carvalho de Melo wrote:
> Em Sun, Dec 12, 2010 at 02:15:25AM +0900, Hitoshi Mitake escreveu:
> > BTW, I found that measuring performance of prefaulted memcpy()
> > with perf stat is difficult. Because current perf stat monitors
> > whole execution of program or range of perf stat lifetime.
> 
> > If perf stat and monitored program can interact and work
> > synchronously, it will be better.
> 
> > For example, if perf stat waits on the unix domain socket
> > before create_perf_stat_counter() and monitored program wakes perf stat
> > up through the socket, more fine grain monitoring will be possible.
> 
> > I imagine the execution will be like this:
> > perf stat --wait-on /tmp/perf_wait perf bench mem memcpy --wake-up  
> > /tmp/perf_wait
> 
> > --wait-on is imaginaly option of perf stat, and the way of waking up
> > perf stat is left to monitored program (in this case, --wake-up is
> > used for specifying the name of the socket).
> 
> > I'd like to implement such a option to perf stat, how do you think?
> 
> Looks interesting, and also interesting would be to be able to place
> probes that would wake up it too, for unmodified binaries to have
> something similar.
> 
> Other kinds of triggers may be to hook on syscalls and when some
> expression matches, like connecting to host 1.2.3.4, start monitoring,
> stop when the socket is closed, i.e. monitor a connection lifetime, etc.
> 
> I think it is worth pursuing and encourage you to work on it :-)

Sounds to me like you want something like a library with self-monitoring
stuff.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default
  2010-12-13 11:14                             ` Peter Zijlstra
@ 2010-12-13 12:38                               ` Arnaldo Carvalho de Melo
  2010-12-13 12:40                                 ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Arnaldo Carvalho de Melo @ 2010-12-13 12:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hitoshi Mitake, mingo, hpa, paulus, linux-kernel, andi,
	yakui.zhao, fweisbec, ling.ma, rostedt, miaox, tglx, mingo

Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
> On Sun, 2010-12-12 at 11:46 -0200, Arnaldo Carvalho de Melo wrote:
> > Looks interesting, and also interesting would be to be able to place
> > probes that would wake up it too, for unmodified binaries to have
> > something similar.

> > Other kinds of triggers may be to hook on syscalls and when some
> > expression matches, like connecting to host 1.2.3.4, start monitoring,
> > stop when the socket is closed, i.e. monitor a connection lifetime, etc.
 
> Sounds to me like you want something like a library with self-monitoring
> stuff.

Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
library calls, setup counters, start a monitoring thread, etc.

Along the lines of:

http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob;f=libautocork.c

This one just intercepts calls, but the __init function could do the
rest.

To make it easier we could move the counter setup we have in record/top
to a library, etc.

- Arnaldo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default
  2010-12-13 12:38                               ` Arnaldo Carvalho de Melo
@ 2010-12-13 12:40                                 ` Peter Zijlstra
  2010-12-13 13:12                                   ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2010-12-13 12:40 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Hitoshi Mitake, mingo, hpa, paulus, linux-kernel, andi,
	yakui.zhao, fweisbec, ling.ma, rostedt, miaox, tglx, mingo

On Mon, 2010-12-13 at 10:38 -0200, Arnaldo Carvalho de Melo wrote:
> Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
> > On Sun, 2010-12-12 at 11:46 -0200, Arnaldo Carvalho de Melo wrote:
> > > Looks interesting, and also interesting would be to be able to place
> > > probes that would wake up it too, for unmodified binaries to have
> > > something similar.
> 
> > > Other kinds of triggers may be to hook on syscalls and when some
> > > expression matches, like connecting to host 1.2.3.4, start monitoring,
> > > stop when the socket is closed, i.e. monitor a connection lifetime, etc.
>  
> > Sounds to me like you want something like a library with self-monitoring
> > stuff.
> 
> Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
> library calls, setup counters, start a monitoring thread, etc.
> 
> Along the lines of:
> 
> http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob;f=libautocork.c
> 
> This one just intercepts calls, but the __init function could do the
> rest.
> 
> To make it easier we could move the counter setup we have in record/top
> to a library, etc.

Nah, I was more thinking of something along the lines of libPAPI and
libpfmon. A library that contains the needed building blocks for apps to
profile themselves.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default
  2010-12-13 12:40                                 ` Peter Zijlstra
@ 2010-12-13 13:12                                   ` Arnaldo Carvalho de Melo
  2010-12-13 17:37                                     ` Hitoshi Mitake
  0 siblings, 1 reply; 30+ messages in thread
From: Arnaldo Carvalho de Melo @ 2010-12-13 13:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hitoshi Mitake, mingo, hpa, paulus, linux-kernel, andi,
	yakui.zhao, fweisbec, ling.ma, rostedt, miaox, tglx, mingo

Em Mon, Dec 13, 2010 at 01:40:59PM +0100, Peter Zijlstra escreveu:
> On Mon, 2010-12-13 at 10:38 -0200, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
> > > Sounds to me like you want something like a library with self-monitoring
> > > stuff.

> > Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
> > library calls, setup counters, start a monitoring thread, etc.

> > To make it easier we could move the counter setup we have in record/top
> > to a library, etc.
> 
> Nah, I was more thinking of something along the lines of libPAPI and
> libpfmon. A library that contains the needed building blocks for apps to
> profile themselves.

Ok, you mean for the case where you can modify the app, I was thinking
about when you can't.

In both cases its good to move the counter creation, etc routines from
record/top to a lib, that then could be used in the way you mention, and
in the way I mention too. Two different usecases :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default
  2010-12-13 13:12                                   ` Arnaldo Carvalho de Melo
@ 2010-12-13 17:37                                     ` Hitoshi Mitake
  2010-12-14  5:46                                       ` [RFC PATCH 1/2] perf stat: wait on unix domain socket before calling sys_perf_event_open() Hitoshi Mitake
  2010-12-14  5:46                                       ` [RFC PATCH 2/2] perf bench: more fine grain monitoring for prefault memcpy() Hitoshi Mitake
  0 siblings, 2 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-12-13 17:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, mingo, hpa, paulus, linux-kernel, andi,
	yakui.zhao, fweisbec, ling.ma, rostedt, miaox, tglx, mingo

On 2010年12月13日 22:12, Arnaldo Carvalho de Melo wrote:
> Em Mon, Dec 13, 2010 at 01:40:59PM +0100, Peter Zijlstra escreveu:
>> On Mon, 2010-12-13 at 10:38 -0200, Arnaldo Carvalho de Melo wrote:
>>> Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
>>>> Sounds to me like you want something like a library with self-monitoring
>>>> stuff.
>
>>> Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
>>> library calls, setup counters, start a monitoring thread, etc.
>
>>> To make it easier we could move the counter setup we have in record/top
>>> to a library, etc.
>>
>> Nah, I was more thinking of something along the lines of libPAPI and
>> libpfmon. A library that contains the needed building blocks for apps to
>> profile themselves.
>
> Ok, you mean for the case where you can modify the app, I was thinking
> about when you can't.
>
> In both cases its good to move the counter creation, etc routines from
> record/top to a lib, that then could be used in the way you mention, and
> in the way I mention too. Two different usecases :-)

Thanks for your comments, Arnaldo, Peter.

I implement basic feature of my proposal,
and found that communicating perf stat and benchmarking programs
via socket is really dirty. As you said, unified form,
interception for unmodified binary and library for modifiable binary,
will be ideal for fine grain monitoring.

But I believe that measuring performance of some sort of programs
like in kernel routines requires more fine grain perf stating,
so I'll seek the unified way.

Anyway, I'll send my proof of concept patch later.

Thanks,
         Hitoshi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/2] perf stat: wait on unix domain socket before calling sys_perf_event_open()
  2010-12-13 17:37                                     ` Hitoshi Mitake
@ 2010-12-14  5:46                                       ` Hitoshi Mitake
  2010-12-14  5:46                                       ` [RFC PATCH 2/2] perf bench: more fine grain monitoring for prefault memcpy() Hitoshi Mitake
  1 sibling, 0 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-12-14  5:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mitake, h.mitake, Miao Xie, Ma Ling, Zhao Yakui,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Andi Kleen

This patch adds new option "--wait-on" option to perf stat.

Current perf stat can monitor
 1) lifetime of program specified as command line argument, or
 2) lifetime of perf stat. Target process is specified with pid,
    and end of monitoring is triggered with signal.
1) is too coarse grain. And 2) is difficult to distinguish the range to monitor.

This patch makes it possible to wait before sys_perf_event_open().
Monitored process can wake up perf stat via unix domain socket,
and terminate monitoring via signal.

New option --wait-on requires the string as the path of unix domain socket.
perf stat read the pid from the socket for target_pid. Monitored program
should write the pid of itself to it.
perf stat replies the pid of itself to monitored program. The monitored program
should send signal SIGINT to perf stat with this pid. Then monitoring is terminated.

I feel current implementation is really dirty. As Arnaldo and Peter suggested,
more unified way like interception or self monitoring library is ideal.
This is the proof of concept version. I'd like to hear your comments.

Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
---
 tools/perf/builtin-stat.c |   63 ++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7ff746d..4cc10a1 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -51,6 +51,8 @@
 #include <sys/prctl.h>
 #include <math.h>
 #include <locale.h>
+#include <sys/socket.h>
+#include <sys/un.h>
 
 #define DEFAULT_SEPARATOR	" "
 
@@ -90,11 +92,15 @@ static const char		*cpu_list;
 static const char		*csv_sep			= NULL;
 static bool			csv_output			= false;
 
+static const char		*wait_path;
 
 static int			*fd[MAX_NR_CPUS][MAX_COUNTERS];
 
 static int			event_scaled[MAX_COUNTERS];
 
+static int			wait_fd				= -1;
+static struct sockaddr_un	wait_addr;
+
 static struct {
 	u64 val;
 	u64 ena;
@@ -342,7 +348,7 @@ static int run_perf_stat(int argc __used, const char **argv)
 	unsigned long long t0, t1;
 	int status = 0;
 	int counter, ncreated = 0;
-	int child_ready_pipe[2], go_pipe[2];
+	int child_ready_pipe[2], go_pipe[2], accepted_fd;
 	bool perm_err = false;
 	const bool forks = (argc > 0);
 	char buf;
@@ -401,6 +407,43 @@ static int run_perf_stat(int argc __used, const char **argv)
 		close(child_ready_pipe[0]);
 	}
 
+	if (wait_path) {
+		int sock_err;
+		struct sockaddr accepted_addr;
+		socklen_t accepted_len = sizeof(accepted_addr);
+
+		wait_fd = socket(PF_UNIX, SOCK_STREAM, 0);
+		if (wait_fd < 0)
+			die("unable to create socket for sync\n");
+
+		memset(&wait_addr, 0, sizeof(wait_addr));
+		wait_addr.sun_family = PF_UNIX;
+		strncpy(wait_addr.sun_path, wait_path,
+			sizeof(wait_addr.sun_path));
+
+		sock_err = bind(wait_fd, (struct sockaddr *)&wait_addr,
+				sizeof(wait_addr));
+		if (sock_err < 0)
+			die("bind() failed\n");
+
+		sock_err = listen(wait_fd, 1);
+		if (sock_err < 0)
+			die("listen() failed\n");
+
+		accepted_fd = accept(wait_fd, &accepted_addr, &accepted_len);
+		if (accepted_fd < 0)
+			die("accept() failed\n");
+
+		if (read(accepted_fd, &target_pid, sizeof(target_pid))
+			!= sizeof(target_pid))
+			die("read() pid from socket failed\n");
+
+		target_tid = target_pid;
+		thread_num = find_all_tid(target_pid, &all_tids);
+		if (thread_num <= 0)
+			die("couldn't find threads of %d\n", target_pid);
+	}
+
 	for (counter = 0; counter < nr_counters; counter++)
 		ncreated += create_perf_stat_counter(counter, &perm_err);
 
@@ -425,6 +468,14 @@ static int run_perf_stat(int argc __used, const char **argv)
 		close(go_pipe[1]);
 		wait(&status);
 	} else {
+		if (wait_path) {
+			pid_t myself = getpid();
+			if (write(accepted_fd, &myself, sizeof(myself))
+				!= sizeof(myself))
+				die("write() my pid failed\n");
+			close(accepted_fd);
+		}
+
 		while(!done) sleep(1);
 	}
 
@@ -670,6 +721,9 @@ static void sig_atexit(void)
 	if (signr == -1)
 		return;
 
+	if (wait_path)
+		unlink(wait_path);
+
 	signal(signr, SIG_DFL);
 	kill(getpid(), signr);
 }
@@ -715,6 +769,8 @@ static const struct option options[] = {
 		    "disable CPU count aggregation"),
 	OPT_STRING('x', "field-separator", &csv_sep, "separator",
 		   "print counts with custom separator"),
+	OPT_STRING('w', "wait-on", &wait_path, "path",
+		   "path of unix domain socket to wait on"),
 	OPT_END()
 };
 
@@ -746,7 +802,7 @@ int cmd_stat(int argc, const char **argv, const char *prefix __used)
 	} else if (big_num_opt == 0) /* User passed --no-big-num */
 		big_num = false;
 
-	if (!argc && target_pid == -1 && target_tid == -1)
+	if (!argc && target_pid == -1 && target_tid == -1 && !wait_path)
 		usage_with_options(stat_usage, options);
 	if (run_count <= 0)
 		usage_with_options(stat_usage, options);
@@ -769,7 +825,8 @@ int cmd_stat(int argc, const char **argv, const char *prefix __used)
 	if (nr_cpus < 1)
 		usage_with_options(stat_usage, options);
 
-	if (target_pid != -1) {
+	/* if wait_path is specified, we read pid to monitor from it later */
+	if (target_pid != -1 && !wait_path) {
 		target_tid = target_pid;
 		thread_num = find_all_tid(target_pid, &all_tids);
 		if (thread_num <= 0) {
-- 
1.7.3.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/2] perf bench: more fine grain monitoring for prefault memcpy()
  2010-12-13 17:37                                     ` Hitoshi Mitake
  2010-12-14  5:46                                       ` [RFC PATCH 1/2] perf stat: wait on unix domain socket before calling sys_perf_event_open() Hitoshi Mitake
@ 2010-12-14  5:46                                       ` Hitoshi Mitake
  1 sibling, 0 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-12-14  5:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mitake, h.mitake, Miao Xie, Ma Ling, Zhao Yakui,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Andi Kleen

This patch makes perf bench mem memcpy to use the new feature of perf stat.

New option --wake-up requires path name of unix domain socket.
If --only-prefault or --no-prefault is specified, the pid of itself is written
to this socket before actual memcpy() to be monitored. And the pid of perf stat
is read from it. The pid of perf stat is used for signaling perf stat
to terminate monitoring.

With this feature, the detailed performance monitoring of prefaulted
(or non prefaulted only) memcpy() will be possible.

Example of use, non prefaulted version:
| mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait
|

After execution, perf stat waits the pid...

|  Performance counter stats for process id '27109':
|
|         440.534943 task-clock-msecs         #      0.997 CPUs
|                 44 context-switches         #      0.000 M/sec
|                  5 CPU-migrations           #      0.000 M/sec
|            256,002 page-faults              #      0.581 M/sec
|        934,443,072 cycles                   #   2121.155 M/sec
|        780,408,435 instructions             #      0.835 IPC
|        111,756,558 branches                 #    253.684 M/sec
|            392,170 branch-misses            #      0.351 %
|          8,611,308 cache-references         #     19.547 M/sec
|          8,533,588 cache-misses             #     19.371 M/sec
|
|         0.441803031  seconds time elapsed

in another shell,

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf bench mem memcpy -l 500MB --no-prefault -w /tmp/perf-stat-wait
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|        1.105722 GB/Sec

Example of use, prefaulted version:

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait
| Performance counter stats for process id '27112':
|
|         105.001542 task-clock-msecs         #      0.997 CPUs
|                 11 context-switches         #      0.000 M/sec
|                  0 CPU-migrations           #      0.000 M/sec
|                  2 page-faults              #      0.000 M/sec
|        223,273,425 cycles                   #   2126.382 M/sec
|        197,992,585 instructions             #      0.887 IPC
|         16,657,288 branches                 #    158.639 M/sec
|              1,942 branch-misses            #      0.012 %
|          3,105,619 cache-references         #     29.577 M/sec
|          3,082,390 cache-misses             #     29.356 M/sec
|
|         0.105316101  seconds time elapsed

in another shell,

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf bench mem memcpy -l 500MB --only-prefault -w /tmp/perf-stat-wait
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
|        4.640927 GB/Sec (with prefault)

The result shows that the difference between non-prefaulted memcpy() and prefaulted one.
And this will be useful for detailed performance analysis of various memcpy()s
like Miao Xie's one and rep prefix version.

But this is too adhoc and dirty... :(

Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ma Ling <ling.ma@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
---
 tools/perf/bench/mem-memcpy.c |   56 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index ac88f52..7d0bcea 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -21,6 +21,10 @@
 #include <errno.h>
 #include <unistd.h>
 
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+
 #define K 1024
 
 static const char	*length_str	= "1MB";
@@ -31,6 +35,7 @@ static bool		only_prefault;
 static bool		no_prefault;
 static int		src_align;
 static int		dst_align;
+static const char	*wake_path;
 
 static const struct option options[] = {
 	OPT_STRING('l', "length", &length_str, "1MB",
@@ -48,6 +53,9 @@ static const struct option options[] = {
 		    "Alignment of source memory region (in byte)"),
 	OPT_INTEGER('d', "dst-alignment", &dst_align,
 		    "Alignment of destination memory region (in byte)"),
+	OPT_STRING('w', "wake-up", &wake_path, "default",
+		    "Path of unix domain socket for waking up perf stat"
+		    " (use with only_prefault option)"),
 	OPT_END()
 };
 
@@ -116,6 +124,33 @@ static double timeval2double(struct timeval *ts)
 		(double)ts->tv_usec / (double)1000000;
 }
 
+static pid_t perf_stat_pid;
+
+static void wake_up_perf_stat(void)
+{
+	int wake_fd;
+	struct sockaddr_un wake_addr;
+	pid_t myself = getpid();
+
+	wake_fd = socket(PF_UNIX, SOCK_STREAM, 0);
+	if (wake_fd < 0)
+		die("unable to create socket for sync\n");
+
+	memset(&wake_addr, 0, sizeof(wake_addr));
+	wake_addr.sun_family = PF_UNIX;
+	strncpy(wake_addr.sun_path, wake_path, sizeof(wake_addr.sun_path));
+
+	if (connect(wake_fd, (struct sockaddr *)&wake_addr, sizeof(wake_addr)))
+		die("connect() failed\n");
+
+	if (write(wake_fd, &myself, sizeof(myself)) != sizeof(myself))
+		die("write() my pid to socket failed\n");
+
+	if (read(wake_fd, &perf_stat_pid, sizeof(perf_stat_pid))
+		!= sizeof(perf_stat_pid))
+		die("read() pid of perf stat from socket\n");
+}
+
 static void alloc_mem(void **dst, void **src, size_t length)
 {
 	int ret;
@@ -139,10 +174,16 @@ static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
 	if (prefault)
 		fn(dst + dst_align, src + src_align, len);
 
+	if (wake_path)
+		wake_up_perf_stat();
+
 	clock_start = get_clock();
 	fn(dst + dst_align, src + src_align, len);
 	clock_end = get_clock();
 
+	if (wake_path)		/* kill perf stat */
+		kill(perf_stat_pid, SIGINT);
+
 	free(src);
 	free(dst);
 	return clock_end - clock_start;
@@ -158,12 +199,18 @@ static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
 	if (prefault)
 		fn(dst + dst_align, src + src_align, len);
 
+	if (wake_path)
+		wake_up_perf_stat();
+
 	BUG_ON(gettimeofday(&tv_start, NULL));
 	fn(dst + dst_align, src + src_align, len);
 	BUG_ON(gettimeofday(&tv_end, NULL));
 
 	timersub(&tv_end, &tv_start, &tv_diff);
 
+	if (wake_path)		/* kill perf stat */
+		kill(perf_stat_pid, SIGINT);
+
 	free(src);
 	free(dst);
 	return (double)((double)len / timeval2double(&tv_diff));
@@ -235,6 +282,15 @@ int bench_mem_memcpy(int argc, const char **argv,
 
 	if (!only_prefault && !no_prefault) {
 		/* show both of results */
+		if (wake_path) {
+			fprintf(stderr, "Meaningless combination of option, "
+				"you should not use wake_path alone.\n"
+				"Use it with --only-prefault"
+				" or --no-prefault\n");
+			return 1;
+		}
+
+
 		if (use_clock) {
 			result_clock[0] =
 				do_memcpy_clock(routines[i].fn, len, false);
-- 
1.7.3.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] perf bench: port memcpy_64.S to perf bench
       [not found]     ` <4D0CE05C.1070600@dcl.info.waseda.ac.jp>
@ 2010-12-20  6:30       ` Miao Xie
  2010-12-20 15:34         ` Hitoshi Mitake
  0 siblings, 1 reply; 30+ messages in thread
From: Miao Xie @ 2010-12-20  6:30 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, h.mitake, Ma,
	"Ling@dcl.info.waseda.ac.jp":,
	Zhao Yakui, Arnaldo Carvalho de Melo, Paul Mackerras,
	Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	H. Peter Anvin

On Sun, 19 Dec 2010 01:25:00 +0900, Hitoshi Mitake wrote:
> On 2010年10月31日 04:21, Ingo Molnar wrote:
>>
>> * Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
>>
>>> On Sat, 2010-10-30 at 01:01 +0900, Hitoshi Mitake wrote:
>>>> This patch ports arch/x86/lib/memcpy_64.S to "perf bench mem".
>>>> When PERF_BENCH is defined at preprocessor level,
>>>> memcpy_64.S is preprocessed to includable form from the sources
>>>> under tools/perf for benchmarking programs.
>>>>
>>>> Signed-off-by: Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>
>>>> Cc: Ma Ling:<ling.ma@intel.com>
>>>> Cc: Zhao Yakui<yakui.zhao@intel.com>
>>>> Cc: Peter Zijlstra<a.p.zijlstra@chello.nl>
>>>> Cc: Arnaldo Carvalho de Melo<acme@redhat.com>
>>>> Cc: Paul Mackerras<paulus@samba.org>
>>>> Cc: Frederic Weisbecker<fweisbec@gmail.com>
>>>> Cc: Steven Rostedt<rostedt@goodmis.org>
>>>> Cc: Thomas Gleixner<tglx@linutronix.de>
>>>> Cc: H. Peter Anvin<hpa@zytor.com>
>>>> ---
>>>> arch/x86/lib/memcpy_64.S | 30 ++++++++++++++++++++++++++++++
>>>> 1 files changed, 30 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
>>>> index 75ef61e..72c6dfe 100644
>>>> --- a/arch/x86/lib/memcpy_64.S
>>>> +++ b/arch/x86/lib/memcpy_64.S
>>>> @@ -1,10 +1,23 @@
>>>> /* Copyright 2002 Andi Kleen */
>>>>
>>>> +/*
>>>> + * perf bench adoption by Hitoshi Mitake
>>>> + * PERF_BENCH means that this file is included from
>>>> + * the source files under tools/perf/ for benchmark programs.
>>>> + *
>>>> + * You don't have to care about PERF_BENCH when
>>>> + * you are working on the kernel.
>>>> + */
>>>> +
>>>> +#ifndef PERF_BENCH
>>>
>>> I don't like littering the actual kernel code with tools/perf/
>>> ifdeffery..
>>
>>
>> Yeah - could we somehow accept that file into a perf build as-is?
>>
>> Thanks,
>>
>> Ingo
>>
>
> Really sorry for my slow work...
>
> BTW, I have a question for Miao and Ingo.
> We are planning to implement new memcpy() of Miao,
> and the important point is not removing previous memcpy()
> for future architectures and benchmarkings.
>
> I feel that adding new CPU feature flag (like X86_FEATURE_REP_GOOD)
> and switching memcpy() with alternative mechanism is good way.
> (So we will have three memcpy()s: rep based, unrolled, and new
> unaligned oriented one)
> But there is another way: #ifdef. Which do you prefer?

I agree with your idea, but Ma Ling said this way may cause the i-cache
miss problem.
   http://marc.info/?l=linux-kernel&m=128746120107953&w=2
(The size of the i-cache is 32K, the size of memcpy() in my patch is 560Byte,
and the size of the last version in tip tree is 400Byte).

But I have not tested it, so I don't know the real result. Maybe we should
try to implement the new memcpy() first.

> And could you tell me the detail of CPU family information
> you are targeting, Miao?

They are  Core2 Duo E7300(Core name: Wolfdale) and Xeon X5260(Core name: Wolfdale-DP).

The following is the detailed information of these two CPU:
Core2 Duo E7300:
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Duo CPU     E7300  @ 2.66GHz
stepping	: 6
cpu MHz		: 1603.000
cache size	: 3072 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm dts
bogomips	: 5319.70
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

Xeon X5260:
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           X5260  @ 3.33GHz
stepping	: 6
cpu MHz		: 1999.000
cache size	: 6144 KB
physical id	: 3
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow vnmi flexpriority
bogomips	: 6649.07
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

Thanks
Miao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] perf bench: port memcpy_64.S to perf bench
  2010-12-20  6:30       ` Miao Xie
@ 2010-12-20 15:34         ` Hitoshi Mitake
  0 siblings, 0 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2010-12-20 15:34 UTC (permalink / raw)
  To: miaox
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, ling.ma, Zhao Yakui,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin

On Mon, Dec 20, 2010 at 15:30, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On Sun, 19 Dec 2010 01:25:00 +0900, Hitoshi Mitake wrote:
>>
>> On 2010年10月31日 04:21, Ingo Molnar wrote:
>>>
>>> * Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
>>>
>>>> On Sat, 2010-10-30 at 01:01 +0900, Hitoshi Mitake wrote:
>>>>>
>>>>> This patch ports arch/x86/lib/memcpy_64.S to "perf bench mem".
>>>>> When PERF_BENCH is defined at preprocessor level,
>>>>> memcpy_64.S is preprocessed to includable form from the sources
>>>>> under tools/perf for benchmarking programs.
>>>>>
>>>>> Signed-off-by: Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>
>>>>> Cc: Ma Ling:<ling.ma@intel.com>
>>>>> Cc: Zhao Yakui<yakui.zhao@intel.com>
>>>>> Cc: Peter Zijlstra<a.p.zijlstra@chello.nl>
>>>>> Cc: Arnaldo Carvalho de Melo<acme@redhat.com>
>>>>> Cc: Paul Mackerras<paulus@samba.org>
>>>>> Cc: Frederic Weisbecker<fweisbec@gmail.com>
>>>>> Cc: Steven Rostedt<rostedt@goodmis.org>
>>>>> Cc: Thomas Gleixner<tglx@linutronix.de>
>>>>> Cc: H. Peter Anvin<hpa@zytor.com>
>>>>> ---
>>>>> arch/x86/lib/memcpy_64.S | 30 ++++++++++++++++++++++++++++++
>>>>> 1 files changed, 30 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
>>>>> index 75ef61e..72c6dfe 100644
>>>>> --- a/arch/x86/lib/memcpy_64.S
>>>>> +++ b/arch/x86/lib/memcpy_64.S
>>>>> @@ -1,10 +1,23 @@
>>>>> /* Copyright 2002 Andi Kleen */
>>>>>
>>>>> +/*
>>>>> + * perf bench adoption by Hitoshi Mitake
>>>>> + * PERF_BENCH means that this file is included from
>>>>> + * the source files under tools/perf/ for benchmark programs.
>>>>> + *
>>>>> + * You don't have to care about PERF_BENCH when
>>>>> + * you are working on the kernel.
>>>>> + */
>>>>> +
>>>>> +#ifndef PERF_BENCH
>>>>
>>>> I don't like littering the actual kernel code with tools/perf/
>>>> ifdeffery..
>>>
>>>
>>> Yeah - could we somehow accept that file into a perf build as-is?
>>>
>>> Thanks,
>>>
>>> Ingo
>>>
>>
>> Really sorry for my slow work...
>>
>> BTW, I have a question for Miao and Ingo.
>> We are planning to implement new memcpy() of Miao,
>> and the important point is not removing previous memcpy()
>> for future architectures and benchmarkings.
>>
>> I feel that adding new CPU feature flag (like X86_FEATURE_REP_GOOD)
>> and switching memcpy() with alternative mechanism is good way.
>> (So we will have three memcpy()s: rep based, unrolled, and new
>> unaligned oriented one)
>> But there is another way: #ifdef. Which do you prefer?
>
> I agree with your idea, but Ma Ling said this way may cause the i-cache
> miss problem.
>  http://marc.info/?l=linux-kernel&m=128746120107953&w=2
> (The size of the i-cache is 32K, the size of memcpy() in my patch is
> 560Byte,
> and the size of the last version in tip tree is 400Byte).
>
> But I have not tested it, so I don't know the real result. Maybe we should
> try to implement the new memcpy() first.

I compared memcpy()'s icache miss behaviour with my new
--wait-on patch ( https://patchwork.kernel.org/patch/408801/ ).
And the result is,

default of tip tree

% sudo ./perf stat -w /tmp/perf-stat-wait -e L1-icache-load-misses

 Performance counter stats for process id '12559':

            64,328 L1-icache-load-misses

        0.106513157  seconds time elapsed

Miao Xie's memcpy()

% sudo ./perf stat -w /tmp/perf-stat-wait -e L1-icache-misses

 Performance counter stats for process id '13159':

            64,559 L1-icache-load-misses

        0.107057925  seconds time elapsed

It seems that there is no fatal icache miss.
# I tested perf bench mem memcpy with Core i3 M 330 processor.

But I don't understand well about cache characteristics of intel processor.
I have to look at this problem more deeply.

>
>> And could you tell me the detail of CPU family information
>> you are targeting, Miao?
>
> They are  Core2 Duo E7300(Core name: Wolfdale) and Xeon X5260(Core name:
> Wolfdale-DP).
>
> The following is the detailed information of these two CPU:
> Core2 Duo E7300:
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 23
> model name      : Intel(R) Core(TM)2 Duo CPU     E7300  @ 2.66GHz
> stepping        : 6
> cpu MHz         : 1603.000
> cache size      : 3072 KB
> physical id     : 0
> siblings        : 2
> core id         : 1
> cpu cores       : 2
> apicid          : 1
> initial apicid  : 1
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
> constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor
> ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm dts
> bogomips        : 5319.70
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> Xeon X5260:
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 23
> model name      : Intel(R) Xeon(R) CPU           X5260  @ 3.33GHz
> stepping        : 6
> cpu MHz         : 1999.000
> cache size      : 6144 KB
> physical id     : 3
> siblings        : 2
> core id         : 1
> cpu cores       : 2
> apicid          : 7
> initial apicid  : 7
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm
> constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor
> ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow
> vnmi flexpriority
> bogomips        : 6649.07
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 38 bits physical, 48 bits virtual
> power management:
>

Thanks for your information!

Thanks,
        Hitoshi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy
  2010-11-01  9:02       ` Ingo Molnar
  2010-11-05 17:05         ` Hitoshi Mitake
@ 2011-01-11 16:27         ` Hitoshi Mitake
  1 sibling, 0 replies; 30+ messages in thread
From: Hitoshi Mitake @ 2011-01-11 16:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, h.mitake, Ma Ling, Zhao Yakui, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, H. Peter Anvin

On 2010年11月01日 18:02, Ingo Molnar wrote:
>
> * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>  wrote:
>
>> On 2010年10月31日 04:23, Ingo Molnar wrote:
>>>
>>> * Hitoshi Mitake<mitake@dcl.info.waseda.ac.jp>   wrote:
>>>
>>>> This patch adds new file: mem-memcpy-x86-64-asm.S
>>>> for x86-64 specific memcpy() benchmarking.
>>>> Added new benchmarks are,
>>>>   x86-64-rep:      memcpy() implemented with rep instruction
>>>>   x86-64-unrolled: unrolled memcpy()
>>>>
>>>> Original idea of including the source files of kernel
>>>> for benchmarking is suggested by Ingo Molnar.
>>>> This is more effective than write-once programs for quantitative
>>>> evaluation of in-kernel, little and leaf functions called high frequently.
>>>> Because perf bench is in kernel source tree and executing it
>>>> on various hardwares, especially new model CPUs, is easy.
>>>>
>>>> This way can also be used for other functions of kernel e.g. checksum functions.
>>>>
>>>> Example of usage on Core i3 M330:
>>>>
>>>> | % ./perf bench mem memcpy -l 500MB
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
>>>> |
>>>> |      578.732506 MB/Sec
>>>> | % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
>>>> |
>>>> |      738.184980 MB/Sec
>>>> | % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
>>>> |
>>>> |      767.483269 MB/Sec
>>>>
>>>> This shows clearly that unrolled memcpy() is efficient
>>>> than rep version and glibc's one :)
>>>
>>> Hey, really cool output :-)
>>>
>>> Might also make sense to measure Ma Ling's patched version?
>>
>> Does Ma Ling's patched version mean,
>>
>> http://marc.info/?l=linux-kernel&m=128652296500989&w=2
>>
>> the memcpy applied the patch of the URL?
>> (It seems that this patch was written by Miao Xie.)
>>
>> I'll include the result of patched version in the next post.
>
> (Indeed it is Miao Xie - sorry!)
>
>>>> # checkpatch.pl warns about two externs in bench/mem-memcpy.c
>>>> # added by this patch. But I think it is no problem.
>>>
>>> You should put these:
>>>
>>>   +#ifdef ARCH_X86_64
>>>   +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
>>>   +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
>>>   +#endif
>>>
>>> into a .h file - a new one if needed.
>>>
>>> That will make both checkpatch and me happier ;-)
>>>
>>
>> OK, I'll separate these files.
>>
>> BTW, I found really interesting evaluation result.
>> Current results of "perf bench mem memcpy" include
>> the overhead of page faults because the measured memcpy()
>> is the first access to allocated memory area.
>>
>> I tested the another version of perf bench mem memcpy,
>> which does memcpy() before measured memcpy() for removing
>> the overhead come from page faults.
>>
>> And this is the result:
>>
>> % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...
>>
>>         4.608340 GB/Sec
>>
>> % ./perf bench mem memcpy -l 500MB
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...
>>
>>         4.856442 GB/Sec
>>
>> % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...
>>
>>         6.024445 GB/Sec
>>
>> The relation of scores reversed!
>> I cannot explain the cause of this result, and
>> this is really interesting phenomenon.
>
> Interesting indeed, and it would be nice to analyse that! (It should be possible,
> using various PMU metrics in a clever way, to figure out what's happening inside the
> CPU, right?)
>

I corrected the PMU information of the each case of memcpy,
below is the result:

(I used partial monitoring patch I posted before: 
https://patchwork.kernel.org/patch/408801/,
and my local modification for testing rep based memcpy)

                  no prefault benchmarking

unrolled

Score:      685.812729 MB/Sec
Stat:
  Performance counter stats for process id '4139':

         725.939831 task-clock-msecs         #      0.995 CPUs
                 74 context-switches         #      0.000 M/sec
                  2 CPU-migrations           #      0.000 M/sec
            256,002 page-faults              #      0.353 M/sec
      1,535,468,702 cycles                   #   2115.146 M/sec
      1,691,516,817 instructions             #      1.102 IPC
        291,260,006 branches                 #    401.218 M/sec
          1,487,762 branch-misses            #      0.511 %
          8,470,560 cache-references         #     11.668 M/sec
          8,364,176 cache-misses             #     11.522 M/sec

         0.729488573  seconds time elapsed

rep based

Score:     670.172114 MB/Sec
Stat:
  Performance counter stats for process id '5539':

         742.943772 task-clock-msecs         #      0.995 CPUs
                 77 context-switches         #      0.000 M/sec
                  2 CPU-migrations           #      0.000 M/sec
            256,002 page-faults              #      0.345 M/sec
      1,578,787,149 cycles                   #   2125.043 M/sec
      1,499,144,628 instructions             #      0.950 IPC
        275,684,806 branches                 #    371.071 M/sec
          1,522,326 branch-misses            #      0.552 %
          8,503,747 cache-references         #     11.446 M/sec
          8,386,673 cache-misses             #     11.288 M/sec

         0.746320411  seconds time elapsed

                  prefaulted benchmarking

unrolled

Score:       4.485941 GB/Sec
Stat:
  Performance counter stats for process id '4279':

         108.466761 task-clock-msecs         #      0.994 CPUs
                 11 context-switches         #      0.000 M/sec
                  2 CPU-migrations           #      0.000 M/sec
                  2 page-faults              #      0.000 M/sec
        218,260,432 cycles                   #   2012.233 M/sec
        199,520,023 instructions             #      0.914 IPC
         16,963,327 branches                 #    156.392 M/sec
              8,169 branch-misses            #      0.048 %
          2,955,221 cache-references         #     27.245 M/sec
          2,916,018 cache-misses             #     26.884 M/sec

         0.109115820  seconds time elapsed

rep based

Score:       5.972859 GB/Sec
Stat:
  Performance counter stats for process id '5535':

          81.609445 task-clock-msecs         #      0.995 CPUs
                  8 context-switches         #      0.000 M/sec
                  0 CPU-migrations           #      0.000 M/sec
                  2 page-faults              #      0.000 M/sec
        173,888,853 cycles                   #   2130.744 M/sec
          3,034,096 instructions             #      0.017 IPC
            607,897 branches                 #      7.449 M/sec
              5,874 branch-misses            #      0.966 %
          8,276,533 cache-references         #    101.416 M/sec
          8,274,865 cache-misses             #    101.396 M/sec

         0.082030877  seconds time

Again, the surprising point is the reverse of the score relation.
I cannot find the direct reason of this reverse,
but it seems that the count of branch-miss is refrecting it.

I have to look into this more deeply...

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2011-01-11 16:27 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-29 16:01 [PATCH 1/2] perf bench: port memcpy_64.S to perf bench Hitoshi Mitake
2010-10-29 16:01 ` [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy Hitoshi Mitake
2010-10-30 19:23   ` Ingo Molnar
2010-11-01  5:36     ` Hitoshi Mitake
2010-11-01  9:02       ` Ingo Molnar
2010-11-05 17:05         ` Hitoshi Mitake
2010-11-10  9:12           ` Ingo Molnar
2010-11-12 15:01             ` Hitoshi Mitake
2010-11-12 15:02               ` [PATCH] perf bench: print both of prefaulted and no prefaulted results Hitoshi Mitake
2010-11-18  7:58                 ` Ingo Molnar
2010-11-25  7:04                   ` Hitoshi Mitake
2010-11-25  7:04                     ` [PATCH v2 1/2] " Hitoshi Mitake
2010-11-26 10:30                       ` [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default tip-bot for Hitoshi Mitake
     [not found]                         ` <4D03B1AD.7000606@dcl.info.waseda.ac.jp>
2010-12-12 13:46                           ` perf monitoring triggers Was: " Arnaldo Carvalho de Melo
2010-12-13 11:14                             ` Peter Zijlstra
2010-12-13 12:38                               ` Arnaldo Carvalho de Melo
2010-12-13 12:40                                 ` Peter Zijlstra
2010-12-13 13:12                                   ` Arnaldo Carvalho de Melo
2010-12-13 17:37                                     ` Hitoshi Mitake
2010-12-14  5:46                                       ` [RFC PATCH 1/2] perf stat: wait on unix domain socket before calling sys_perf_event_open() Hitoshi Mitake
2010-12-14  5:46                                       ` [RFC PATCH 2/2] perf bench: more fine grain monitoring for prefault memcpy() Hitoshi Mitake
2010-11-25  7:04                     ` [PATCH v2 2/2] perf bench: port arch/x86/lib/memcpy_64.S to perf bench mem memcpy Hitoshi Mitake
2010-11-26 10:31                       ` [tip:perf/core] perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem' tip-bot for Hitoshi Mitake
2010-11-29 13:26                         ` Hitoshi Mitake
2011-01-11 16:27         ` [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy Hitoshi Mitake
2010-10-29 19:49 ` [PATCH 1/2] perf bench: port memcpy_64.S to perf bench Peter Zijlstra
2010-10-30 19:21   ` Ingo Molnar
     [not found]     ` <4D0CE05C.1070600@dcl.info.waseda.ac.jp>
2010-12-20  6:30       ` Miao Xie
2010-12-20 15:34         ` Hitoshi Mitake
     [not found]   ` <20101029210824.GB13385@ghostprotocols.net>
2010-11-05 17:10     ` Hitoshi Mitake

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.