linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
@ 2012-10-12 21:02 George Spelvin
  2012-10-12 23:17 ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2012-10-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux

Here are some Phenom results for that benchmark.  The average time
increases from 700 to 760 cycles (+8.6%).

vendor_id       : AuthenticAMD
cpu family      : 16
model           : 2
model name      : AMD Phenom(tm) 9850 Quad-Core Processor
stepping        : 3
microcode       : 0x1000083
cpu MHz         : 2500.210
cache size      : 512 KB
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs hw_pstate npt lbrv svm_lock
bogomips        : 5000.42
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64

                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	678	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	667	760
TPT: Len 4096, alignment  0/ 0:	673	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	667	760
TPT: Len 4096, alignment  0/ 0:	673	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	671	760
TPT: Len 4096, alignment  0/ 0:	673	760
TPT: Len 4096, alignment  0/ 0:	671	760
TPT: Len 4096, alignment  0/ 0:	709	760
TPT: Len 4096, alignment  0/ 0:	708	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	667	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	671	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	678	760
TPT: Len 4096, alignment  0/ 0:	709	758
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	709	759
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	680	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	667	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	709	760
TPT: Len 4096, alignment  0/ 0:	709	759
TPT: Len 4096, alignment  0/ 0:	710	760
                       	copy_page_org	copy_page_new	
TPT: Len 4096, alignment  0/ 0:	678	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760
TPT: Len 4096, alignment  0/ 0:	710	760

^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register
@ 2012-10-11 12:29 ling.ma
  2012-10-11 13:40 ` Andi Kleen
  2012-10-11 14:35 ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 16+ messages in thread
From: ling.ma @ 2012-10-11 12:29 UTC (permalink / raw)
  To: mingo; +Cc: hpa, tglx, linux-kernel, Ma Ling

From: Ma Ling <ling.ma@intel.com>

Load and write operation occupy about 35% and 10% respectively
for most industry benchmarks. Fetched 16-aligned bytes code include 
about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.  
Modern CPU support 2 load and 1 write per cycle, so throughput from write is
bottleneck for memcpy or copy_page, and some slight CPU only support one mem
operation per cycle. So it is enough to issue one read and write instruction
per cycle, and we can save registers. 

In this patch we also re-arrange instruction sequence to improve performance
The performance on atom is improved about 11%, 9% on hot/cold-cache case respectively.

Signed-off-by: Ma Ling <ling.ma@intel.com>

---
 arch/x86/lib/copy_page_64.S |  103 +++++++++++++++++-------------------------
 1 files changed, 42 insertions(+), 61 deletions(-)

diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 3da5527..13c97f4 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -20,76 +20,57 @@ ENDPROC(copy_page_rep)
 
 ENTRY(copy_page)
 	CFI_STARTPROC
-	subq	$2*8,	%rsp
-	CFI_ADJUST_CFA_OFFSET 2*8
-	movq	%rbx,	(%rsp)
-	CFI_REL_OFFSET rbx, 0
-	movq	%r12,	1*8(%rsp)
-	CFI_REL_OFFSET r12, 1*8
+	mov	$(4096/64)-5, %ecx
 
-	movl	$(4096/64)-5,	%ecx
-	.p2align 4
 .Loop64:
-  	dec	%rcx
-
-	movq	0x8*0(%rsi), %rax
-	movq	0x8*1(%rsi), %rbx
-	movq	0x8*2(%rsi), %rdx
-	movq	0x8*3(%rsi), %r8
-	movq	0x8*4(%rsi), %r9
-	movq	0x8*5(%rsi), %r10
-	movq	0x8*6(%rsi), %r11
-	movq	0x8*7(%rsi), %r12
-
 	prefetcht0 5*64(%rsi)
-
-	movq	%rax, 0x8*0(%rdi)
-	movq	%rbx, 0x8*1(%rdi)
-	movq	%rdx, 0x8*2(%rdi)
-	movq	%r8,  0x8*3(%rdi)
-	movq	%r9,  0x8*4(%rdi)
-	movq	%r10, 0x8*5(%rdi)
-	movq	%r11, 0x8*6(%rdi)
-	movq	%r12, 0x8*7(%rdi)
-
-	leaq	64 (%rsi), %rsi
-	leaq	64 (%rdi), %rdi
-
+	decb	%cl
+
+	movq	0x8*0(%rsi), %r10
+	movq	0x8*1(%rsi), %rax
+	movq	0x8*2(%rsi), %r8
+	movq	0x8*3(%rsi), %r9
+	movq	%r10, 0x8*0(%rdi)
+	movq	%rax, 0x8*1(%rdi)
+	movq	%r8, 0x8*2(%rdi)
+	movq	%r9, 0x8*3(%rdi)
+
+	movq	0x8*4(%rsi), %r10
+	movq	0x8*5(%rsi), %rax
+	movq	0x8*6(%rsi), %r8
+	movq	0x8*7(%rsi), %r9
+	leaq	64(%rsi), %rsi
+	movq	%r10, 0x8*4(%rdi)
+	movq	%rax, 0x8*5(%rdi)
+	movq	%r8, 0x8*6(%rdi)
+	movq	%r9, 0x8*7(%rdi)
+	leaq	64(%rdi), %rdi
 	jnz	.Loop64
 
-	movl	$5, %ecx
-	.p2align 4
+	mov	$5, %dl
 .Loop2:
-	decl	%ecx
-
-	movq	0x8*0(%rsi), %rax
-	movq	0x8*1(%rsi), %rbx
-	movq	0x8*2(%rsi), %rdx
-	movq	0x8*3(%rsi), %r8
-	movq	0x8*4(%rsi), %r9
-	movq	0x8*5(%rsi), %r10
-	movq	0x8*6(%rsi), %r11
-	movq	0x8*7(%rsi), %r12
-
-	movq	%rax, 0x8*0(%rdi)
-	movq	%rbx, 0x8*1(%rdi)
-	movq	%rdx, 0x8*2(%rdi)
-	movq	%r8,  0x8*3(%rdi)
-	movq	%r9,  0x8*4(%rdi)
-	movq	%r10, 0x8*5(%rdi)
-	movq	%r11, 0x8*6(%rdi)
-	movq	%r12, 0x8*7(%rdi)
-
-	leaq	64(%rdi), %rdi
+	decb	%dl
+	movq	0x8*0(%rsi), %r10
+	movq	0x8*1(%rsi), %rax
+	movq	0x8*2(%rsi), %r8
+	movq	0x8*3(%rsi), %r9
+	movq	%r10, 0x8*0(%rdi)
+	movq	%rax, 0x8*1(%rdi)
+	movq	%r8, 0x8*2(%rdi)
+	movq	%r9, 0x8*3(%rdi)
+
+	movq	0x8*4(%rsi), %r10
+	movq	0x8*5(%rsi), %rax
+	movq	0x8*6(%rsi), %r8
+	movq	0x8*7(%rsi), %r9
 	leaq	64(%rsi), %rsi
+	movq	%r10, 0x8*4(%rdi)
+	movq	%rax, 0x8*5(%rdi)
+	movq	%r8, 0x8*6(%rdi)
+	movq	%r9, 0x8*7(%rdi)
+	leaq	64(%rdi), %rdi
 	jnz	.Loop2
 
-	movq	(%rsp), %rbx
-	CFI_RESTORE rbx
-	movq	1*8(%rsp), %r12
-	CFI_RESTORE r12
-	addq	$2*8, %rsp
-	CFI_ADJUST_CFA_OFFSET -2*8
 	ret
 .Lcopy_page_end:
 	CFI_ENDPROC
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-10-15  5:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-12 21:02 [PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register George Spelvin
2012-10-12 23:17 ` Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2012-10-11 12:29 ling.ma
2012-10-11 13:40 ` Andi Kleen
2012-10-12  3:10   ` Ma, Ling
2012-10-12 13:35     ` Andi Kleen
2012-10-12 14:54       ` Ma, Ling
2012-10-12 15:14         ` Andi Kleen
2012-10-11 14:35 ` Konrad Rzeszutek Wilk
2012-10-12  3:37   ` Ma, Ling
2012-10-12  6:18     ` Borislav Petkov
2012-10-12  9:07       ` Ma, Ling
2012-10-12 18:04         ` Borislav Petkov
2012-10-14 10:58           ` Borislav Petkov
2012-10-15  5:00             ` Ma, Ling
2012-10-15  5:13             ` George Spelvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).