* [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives
@ 2015-05-19 10:07 Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 1/6] powerpc: use memset_io() to clear CPM Muram Christophe Leroy
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: Christophe Leroy @ 2015-05-19 10:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund
This patchset implements use of cacheable versions of memset and
memcpy since when the destination is not cacheable, memset_io
and memcpy_toio are used.
On MPC885, we observe a 7% rate increase on FTP transfer
Christophe Leroy (6):
powerpc: use memset_io() to clear CPM Muram
Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero
functions"
powerpc32: memset(0): use cacheable_memzero
powerpc32: Merge the new memset() with the old one
powerpc32: cacheable_memcpy becomes memcpy
powerpc32: Few optimisations in memcpy
arch/powerpc/lib/copy_32.S | 109 ++++++++++++++++++++++++++++++++++++++-
arch/powerpc/sysdev/cpm_common.c | 2 +-
2 files changed, 109 insertions(+), 2 deletions(-)
--
2.1.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 1/6] powerpc: use memset_io() to clear CPM Muram
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
@ 2015-05-19 10:07 ` Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 2/6] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions" Christophe Leroy
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Christophe Leroy @ 2015-05-19 10:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund
CPM muram is not cached, so use memset_io() instead of memset()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/sysdev/cpm_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/sysdev/cpm_common.c b/arch/powerpc/sysdev/cpm_common.c
index 4f78695..e2ea519 100644
--- a/arch/powerpc/sysdev/cpm_common.c
+++ b/arch/powerpc/sysdev/cpm_common.c
@@ -147,7 +147,7 @@ unsigned long cpm_muram_alloc(unsigned long size, unsigned long align)
spin_lock_irqsave(&cpm_muram_lock, flags);
cpm_muram_info.alignment = align;
start = rh_alloc(&cpm_muram_info, size, "commproc");
- memset(cpm_muram_addr(start), 0, size);
+ memset_io(cpm_muram_addr(start), 0, size);
spin_unlock_irqrestore(&cpm_muram_lock, flags);
return start;
--
2.1.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 2/6] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions"
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 1/6] powerpc: use memset_io() to clear CPM Muram Christophe Leroy
@ 2015-05-19 10:07 ` Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 3/6] powerpc32: memset(0): use cacheable_memzero Christophe Leroy
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Christophe Leroy @ 2015-05-19 10:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund
This partially reverts
commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
("b05ae4ee602b7dc90771408ccf0972e1b3801a35")'
Functions cacheable_memcpy/memzero are more efficient than
memcpy/memset as they use the dcbz instruction which avoids refill
of the cacheline with the data that we will overwrite.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/copy_32.S | 127 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 6813f80..55f19f9 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -69,6 +69,54 @@ CACHELINE_BYTES = L1_CACHE_BYTES
LG_CACHELINE_BYTES = L1_CACHE_SHIFT
CACHELINE_MASK = (L1_CACHE_BYTES-1)
+/*
+ * Use dcbz on the complete cache lines in the destination
+ * to set them to zero. This requires that the destination
+ * area is cacheable. -- paulus
+ */
+_GLOBAL(cacheable_memzero)
+ mr r5,r4
+ li r4,0
+ addi r6,r3,-4
+ cmplwi 0,r5,4
+ blt 7f
+ stwu r4,4(r6)
+ beqlr
+ andi. r0,r6,3
+ add r5,r0,r5
+ subf r6,r0,r6
+ clrlwi r7,r6,32-LG_CACHELINE_BYTES
+ add r8,r7,r5
+ srwi r9,r8,LG_CACHELINE_BYTES
+ addic. r9,r9,-1 /* total number of complete cachelines */
+ ble 2f
+ xori r0,r7,CACHELINE_MASK & ~3
+ srwi. r0,r0,2
+ beq 3f
+ mtctr r0
+4: stwu r4,4(r6)
+ bdnz 4b
+3: mtctr r9
+ li r7,4
+10: dcbz r7,r6
+ addi r6,r6,CACHELINE_BYTES
+ bdnz 10b
+ clrlwi r5,r8,32-LG_CACHELINE_BYTES
+ addi r5,r5,4
+2: srwi r0,r5,2
+ mtctr r0
+ bdz 6f
+1: stwu r4,4(r6)
+ bdnz 1b
+6: andi. r5,r5,3
+7: cmpwi 0,r5,0
+ beqlr
+ mtctr r5
+ addi r6,r6,3
+8: stbu r4,1(r6)
+ bdnz 8b
+ blr
+
_GLOBAL(memset)
rlwimi r4,r4,8,16,23
rlwimi r4,r4,16,0,15
@@ -94,6 +142,85 @@ _GLOBAL(memset)
bdnz 8b
blr
+/*
+ * This version uses dcbz on the complete cache lines in the
+ * destination area to reduce memory traffic. This requires that
+ * the destination area is cacheable.
+ * We only use this version if the source and dest don't overlap.
+ * -- paulus.
+ */
+_GLOBAL(cacheable_memcpy)
+ add r7,r3,r5 /* test if the src & dst overlap */
+ add r8,r4,r5
+ cmplw 0,r4,r7
+ cmplw 1,r3,r8
+ crand 0,0,4 /* cr0.lt &= cr1.lt */
+ blt memcpy /* if regions overlap */
+
+ addi r4,r4,-4
+ addi r6,r3,-4
+ neg r0,r3
+ andi. r0,r0,CACHELINE_MASK /* # bytes to start of cache line */
+ beq 58f
+
+ cmplw 0,r5,r0 /* is this more than total to do? */
+ blt 63f /* if not much to do */
+ andi. r8,r0,3 /* get it word-aligned first */
+ subf r5,r0,r5
+ mtctr r8
+ beq+ 61f
+70: lbz r9,4(r4) /* do some bytes */
+ stb r9,4(r6)
+ addi r4,r4,1
+ addi r6,r6,1
+ bdnz 70b
+61: srwi. r0,r0,2
+ mtctr r0
+ beq 58f
+72: lwzu r9,4(r4) /* do some words */
+ stwu r9,4(r6)
+ bdnz 72b
+
+58: srwi. r0,r5,LG_CACHELINE_BYTES /* # complete cachelines */
+ clrlwi r5,r5,32-LG_CACHELINE_BYTES
+ li r11,4
+ mtctr r0
+ beq 63f
+53:
+ dcbz r11,r6
+ COPY_16_BYTES
+#if L1_CACHE_BYTES >= 32
+ COPY_16_BYTES
+#if L1_CACHE_BYTES >= 64
+ COPY_16_BYTES
+ COPY_16_BYTES
+#if L1_CACHE_BYTES >= 128
+ COPY_16_BYTES
+ COPY_16_BYTES
+ COPY_16_BYTES
+ COPY_16_BYTES
+#endif
+#endif
+#endif
+ bdnz 53b
+
+63: srwi. r0,r5,2
+ mtctr r0
+ beq 64f
+30: lwzu r0,4(r4)
+ stwu r0,4(r6)
+ bdnz 30b
+
+64: andi. r0,r5,3
+ mtctr r0
+ beq+ 65f
+40: lbz r0,4(r4)
+ stb r0,4(r6)
+ addi r4,r4,1
+ addi r6,r6,1
+ bdnz 40b
+65: blr
+
_GLOBAL(memmove)
cmplw 0,r3,r4
bgt backwards_memcpy
--
2.1.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 3/6] powerpc32: memset(0): use cacheable_memzero
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 1/6] powerpc: use memset_io() to clear CPM Muram Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 2/6] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions" Christophe Leroy
@ 2015-05-19 10:07 ` Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 4/6] powerpc32: Merge the new memset() with the old one Christophe Leroy
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Christophe Leroy @ 2015-05-19 10:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund
cacheable_memzero uses dcbz instruction and is more efficient than
memset(0) when the destination is in RAM
This patch renames memset as generic_memset, and defines memset
as a prolog to cacheable_memzero. This prolog checks if the byte
to set is 0. If not, it falls back to generic_memcpy()
cacheable_memzero disappears as it is not referenced anywhere anymore
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/copy_32.S | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 55f19f9..0b4f954 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -74,9 +74,9 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
* to set them to zero. This requires that the destination
* area is cacheable. -- paulus
*/
-_GLOBAL(cacheable_memzero)
- mr r5,r4
- li r4,0
+_GLOBAL(memset)
+ cmplwi r4,0
+ bne- generic_memset
addi r6,r3,-4
cmplwi 0,r5,4
blt 7f
@@ -117,7 +117,7 @@ _GLOBAL(cacheable_memzero)
bdnz 8b
blr
-_GLOBAL(memset)
+_GLOBAL(generic_memset)
rlwimi r4,r4,8,16,23
rlwimi r4,r4,16,0,15
addi r6,r3,-4
--
2.1.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 4/6] powerpc32: Merge the new memset() with the old one
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
` (2 preceding siblings ...)
2015-05-19 10:07 ` [PATCH v2 3/6] powerpc32: memset(0): use cacheable_memzero Christophe Leroy
@ 2015-05-19 10:07 ` Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 5/6] powerpc32: cacheable_memcpy becomes memcpy Christophe Leroy
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Christophe Leroy @ 2015-05-19 10:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund
cacheable_memzero() which has become the new memset() and the old
memset() are quite similar, so just merge them.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/copy_32.S | 34 +++++++---------------------------
1 file changed, 7 insertions(+), 27 deletions(-)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 0b4f954..9262071 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -75,8 +75,9 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
* area is cacheable. -- paulus
*/
_GLOBAL(memset)
- cmplwi r4,0
- bne- generic_memset
+ rlwimi r4,r4,8,16,23
+ rlwimi r4,r4,16,0,15
+
addi r6,r3,-4
cmplwi 0,r5,4
blt 7f
@@ -85,6 +86,9 @@ _GLOBAL(memset)
andi. r0,r6,3
add r5,r0,r5
subf r6,r0,r6
+ cmplwi 0,r4,0
+ bne 2f /* Use normal procedure if r4 is not zero */
+
clrlwi r7,r6,32-LG_CACHELINE_BYTES
add r8,r7,r5
srwi r9,r8,LG_CACHELINE_BYTES
@@ -103,32 +107,8 @@ _GLOBAL(memset)
bdnz 10b
clrlwi r5,r8,32-LG_CACHELINE_BYTES
addi r5,r5,4
-2: srwi r0,r5,2
- mtctr r0
- bdz 6f
-1: stwu r4,4(r6)
- bdnz 1b
-6: andi. r5,r5,3
-7: cmpwi 0,r5,0
- beqlr
- mtctr r5
- addi r6,r6,3
-8: stbu r4,1(r6)
- bdnz 8b
- blr
-_GLOBAL(generic_memset)
- rlwimi r4,r4,8,16,23
- rlwimi r4,r4,16,0,15
- addi r6,r3,-4
- cmplwi 0,r5,4
- blt 7f
- stwu r4,4(r6)
- beqlr
- andi. r0,r6,3
- add r5,r0,r5
- subf r6,r0,r6
- srwi r0,r5,2
+2: srwi r0,r5,2
mtctr r0
bdz 6f
1: stwu r4,4(r6)
--
2.1.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 5/6] powerpc32: cacheable_memcpy becomes memcpy
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
` (3 preceding siblings ...)
2015-05-19 10:07 ` [PATCH v2 4/6] powerpc32: Merge the new memset() with the old one Christophe Leroy
@ 2015-05-19 10:07 ` Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 6/6] powerpc32: Few optimisations in memcpy Christophe Leroy
2015-05-19 11:43 ` [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives David Laight
6 siblings, 0 replies; 8+ messages in thread
From: Christophe Leroy @ 2015-05-19 10:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund
cacheable_memcpy uses dcbz instruction and is more efficient than
memcpy when the destination is in RAM. If the destination is in an
io area, memcpy_toio() is normally used, not memcpy
This patch renames memcpy as generic_memcpy, and renames
cacheable_memcpy as memcpy
On MPC885, we get approximatly 7% increase of the transfer rate
on an FTP reception
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/copy_32.S | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 9262071..1d49c74 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -129,13 +129,18 @@ _GLOBAL(memset)
* We only use this version if the source and dest don't overlap.
* -- paulus.
*/
-_GLOBAL(cacheable_memcpy)
+_GLOBAL(memmove)
+ cmplw 0,r3,r4
+ bgt backwards_memcpy
+ /* fall through */
+
+_GLOBAL(memcpy)
add r7,r3,r5 /* test if the src & dst overlap */
add r8,r4,r5
cmplw 0,r4,r7
cmplw 1,r3,r8
crand 0,0,4 /* cr0.lt &= cr1.lt */
- blt memcpy /* if regions overlap */
+ blt generic_memcpy /* if regions overlap */
addi r4,r4,-4
addi r6,r3,-4
@@ -201,12 +206,7 @@ _GLOBAL(cacheable_memcpy)
bdnz 40b
65: blr
-_GLOBAL(memmove)
- cmplw 0,r3,r4
- bgt backwards_memcpy
- /* fall through */
-
-_GLOBAL(memcpy)
+_GLOBAL(generic_memcpy)
srwi. r7,r5,3
addi r6,r3,-4
addi r4,r4,-4
--
2.1.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 6/6] powerpc32: Few optimisations in memcpy
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
` (4 preceding siblings ...)
2015-05-19 10:07 ` [PATCH v2 5/6] powerpc32: cacheable_memcpy becomes memcpy Christophe Leroy
@ 2015-05-19 10:07 ` Christophe Leroy
2015-05-19 11:43 ` [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives David Laight
6 siblings, 0 replies; 8+ messages in thread
From: Christophe Leroy @ 2015-05-19 10:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund
This patch adds a few optimisations in memcpy functions by using
lbzu/stbu instead of lxb/stb and by re-ordering insn inside a loop
to reduce latency due to loading
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
arch/powerpc/lib/copy_32.S | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 1d49c74..2ef50c6 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -155,9 +155,9 @@ _GLOBAL(memcpy)
mtctr r8
beq+ 61f
70: lbz r9,4(r4) /* do some bytes */
- stb r9,4(r6)
addi r4,r4,1
addi r6,r6,1
+ stb r9,3(r6)
bdnz 70b
61: srwi. r0,r0,2
mtctr r0
@@ -199,10 +199,10 @@ _GLOBAL(memcpy)
64: andi. r0,r5,3
mtctr r0
beq+ 65f
-40: lbz r0,4(r4)
- stb r0,4(r6)
- addi r4,r4,1
- addi r6,r6,1
+ addi r4,r4,3
+ addi r6,r6,3
+40: lbzu r0,1(r4)
+ stbu r0,1(r6)
bdnz 40b
65: blr
--
2.1.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* RE: [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
` (5 preceding siblings ...)
2015-05-19 10:07 ` [PATCH v2 6/6] powerpc32: Few optimisations in memcpy Christophe Leroy
@ 2015-05-19 11:43 ` David Laight
6 siblings, 0 replies; 8+ messages in thread
From: David Laight @ 2015-05-19 11:43 UTC (permalink / raw)
To: 'Christophe Leroy',
Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
scottwood
Cc: linuxppc-dev, linux-kernel
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 884 bytes --]
From: Christophe Leroy
> Sent: 19 May 2015 11:08
>
> This patchset implements use of cacheable versions of memset and
> memcpy since when the destination is not cacheable, memset_io
> and memcpy_toio are used.
This isn't the right list to ask, but:
Can someone fix the x86 versions of memset/memcpy (and the _io variants)
so that they don't end up being 'rep movsb' on new intel cpus?
I've a C2558 Atom which has the optimised 'rep movsb' hardware.
Copies to/from uncached locations are now done 'byte by byte'.
As well as kernel code this affects userpace copying to/from
mmap()ed PCIe space.
64bit reads are slow enough, making it 8 times slower is horrid.
I suspect this affect some network drivers as well.
David
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-05-19 11:45 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-19 10:07 [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 1/6] powerpc: use memset_io() to clear CPM Muram Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 2/6] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions" Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 3/6] powerpc32: memset(0): use cacheable_memzero Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 4/6] powerpc32: Merge the new memset() with the old one Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 5/6] powerpc32: cacheable_memcpy becomes memcpy Christophe Leroy
2015-05-19 10:07 ` [PATCH v2 6/6] powerpc32: Few optimisations in memcpy Christophe Leroy
2015-05-19 11:43 ` [PATCH v2 0/6] powerpc32: replace memcpy and memset by cacheable alternatives David Laight
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).