All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] powerpc32: use cacheable alternatives of memcpy and memset
@ 2015-05-12 13:32 ` Christophe Leroy
  0 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

This patchset implements use of cacheable versions of memset and
memcpy when the len is greater than the cacheline size and the
destination is in RAM.

On MPC885, we observe a 7% rate increase on FTP transfer

Christophe Leroy (4):
  Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero
    functions"
  powerpc32: swap r4 and r5 in cacheable_memzero
  powerpc32: memset(0): use cacheable_memzero
  powerpc32: memcpy: use cacheable_memcpy

 arch/powerpc/lib/copy_32.S | 148 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 148 insertions(+)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 0/4] powerpc32: use cacheable alternatives of memcpy and memset
@ 2015-05-12 13:32 ` Christophe Leroy
  0 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linuxppc-dev, linux-kernel, Kyle Moffett

This patchset implements use of cacheable versions of memset and
memcpy when the len is greater than the cacheline size and the
destination is in RAM.

On MPC885, we observe a 7% rate increase on FTP transfer

Christophe Leroy (4):
  Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero
    functions"
  powerpc32: swap r4 and r5 in cacheable_memzero
  powerpc32: memset(0): use cacheable_memzero
  powerpc32: memcpy: use cacheable_memcpy

 arch/powerpc/lib/copy_32.S | 148 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 148 insertions(+)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions"
  2015-05-12 13:32 ` Christophe Leroy
@ 2015-05-12 13:32   ` Christophe Leroy
  -1 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

This partially reverts
commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
("f909a35bdfb7cb350d078a2cf888162eeb20381c")'

Functions cacheable_memcpy/memzero are more efficient than
memcpy/memset as they use the dcbz instruction which avoids refill
of the cacheline with the data that we will overwrite.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 127 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 127 insertions(+)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 6813f80..55f19f9 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -69,6 +69,54 @@ CACHELINE_BYTES = L1_CACHE_BYTES
 LG_CACHELINE_BYTES = L1_CACHE_SHIFT
 CACHELINE_MASK = (L1_CACHE_BYTES-1)
 
+/*
+ * Use dcbz on the complete cache lines in the destination
+ * to set them to zero.  This requires that the destination
+ * area is cacheable.  -- paulus
+ */
+_GLOBAL(cacheable_memzero)
+	mr	r5,r4
+	li	r4,0
+	addi	r6,r3,-4
+	cmplwi	0,r5,4
+	blt	7f
+	stwu	r4,4(r6)
+	beqlr
+	andi.	r0,r6,3
+	add	r5,r0,r5
+	subf	r6,r0,r6
+	clrlwi	r7,r6,32-LG_CACHELINE_BYTES
+	add	r8,r7,r5
+	srwi	r9,r8,LG_CACHELINE_BYTES
+	addic.	r9,r9,-1	/* total number of complete cachelines */
+	ble	2f
+	xori	r0,r7,CACHELINE_MASK & ~3
+	srwi.	r0,r0,2
+	beq	3f
+	mtctr	r0
+4:	stwu	r4,4(r6)
+	bdnz	4b
+3:	mtctr	r9
+	li	r7,4
+10:	dcbz	r7,r6
+	addi	r6,r6,CACHELINE_BYTES
+	bdnz	10b
+	clrlwi	r5,r8,32-LG_CACHELINE_BYTES
+	addi	r5,r5,4
+2:	srwi	r0,r5,2
+	mtctr	r0
+	bdz	6f
+1:	stwu	r4,4(r6)
+	bdnz	1b
+6:	andi.	r5,r5,3
+7:	cmpwi	0,r5,0
+	beqlr
+	mtctr	r5
+	addi	r6,r6,3
+8:	stbu	r4,1(r6)
+	bdnz	8b
+	blr
+
 _GLOBAL(memset)
 	rlwimi	r4,r4,8,16,23
 	rlwimi	r4,r4,16,0,15
@@ -94,6 +142,85 @@ _GLOBAL(memset)
 	bdnz	8b
 	blr
 
+/*
+ * This version uses dcbz on the complete cache lines in the
+ * destination area to reduce memory traffic.  This requires that
+ * the destination area is cacheable.
+ * We only use this version if the source and dest don't overlap.
+ * -- paulus.
+ */
+_GLOBAL(cacheable_memcpy)
+	add	r7,r3,r5		/* test if the src & dst overlap */
+	add	r8,r4,r5
+	cmplw	0,r4,r7
+	cmplw	1,r3,r8
+	crand	0,0,4			/* cr0.lt &= cr1.lt */
+	blt	memcpy			/* if regions overlap */
+
+	addi	r4,r4,-4
+	addi	r6,r3,-4
+	neg	r0,r3
+	andi.	r0,r0,CACHELINE_MASK	/* # bytes to start of cache line */
+	beq	58f
+
+	cmplw	0,r5,r0			/* is this more than total to do? */
+	blt	63f			/* if not much to do */
+	andi.	r8,r0,3			/* get it word-aligned first */
+	subf	r5,r0,r5
+	mtctr	r8
+	beq+	61f
+70:	lbz	r9,4(r4)		/* do some bytes */
+	stb	r9,4(r6)
+	addi	r4,r4,1
+	addi	r6,r6,1
+	bdnz	70b
+61:	srwi.	r0,r0,2
+	mtctr	r0
+	beq	58f
+72:	lwzu	r9,4(r4)		/* do some words */
+	stwu	r9,4(r6)
+	bdnz	72b
+
+58:	srwi.	r0,r5,LG_CACHELINE_BYTES /* # complete cachelines */
+	clrlwi	r5,r5,32-LG_CACHELINE_BYTES
+	li	r11,4
+	mtctr	r0
+	beq	63f
+53:
+	dcbz	r11,r6
+	COPY_16_BYTES
+#if L1_CACHE_BYTES >= 32
+	COPY_16_BYTES
+#if L1_CACHE_BYTES >= 64
+	COPY_16_BYTES
+	COPY_16_BYTES
+#if L1_CACHE_BYTES >= 128
+	COPY_16_BYTES
+	COPY_16_BYTES
+	COPY_16_BYTES
+	COPY_16_BYTES
+#endif
+#endif
+#endif
+	bdnz	53b
+
+63:	srwi.	r0,r5,2
+	mtctr	r0
+	beq	64f
+30:	lwzu	r0,4(r4)
+	stwu	r0,4(r6)
+	bdnz	30b
+
+64:	andi.	r0,r5,3
+	mtctr	r0
+	beq+	65f
+40:	lbz	r0,4(r4)
+	stb	r0,4(r6)
+	addi	r4,r4,1
+	addi	r6,r6,1
+	bdnz	40b
+65:	blr
+
 _GLOBAL(memmove)
 	cmplw	0,r3,r4
 	bgt	backwards_memcpy
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 1/4] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions"
@ 2015-05-12 13:32   ` Christophe Leroy
  0 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linuxppc-dev, linux-kernel, Kyle Moffett

This partially reverts
commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
("f909a35bdfb7cb350d078a2cf888162eeb20381c")'

Functions cacheable_memcpy/memzero are more efficient than
memcpy/memset as they use the dcbz instruction which avoids refill
of the cacheline with the data that we will overwrite.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 127 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 127 insertions(+)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 6813f80..55f19f9 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -69,6 +69,54 @@ CACHELINE_BYTES = L1_CACHE_BYTES
 LG_CACHELINE_BYTES = L1_CACHE_SHIFT
 CACHELINE_MASK = (L1_CACHE_BYTES-1)
 
+/*
+ * Use dcbz on the complete cache lines in the destination
+ * to set them to zero.  This requires that the destination
+ * area is cacheable.  -- paulus
+ */
+_GLOBAL(cacheable_memzero)
+	mr	r5,r4
+	li	r4,0
+	addi	r6,r3,-4
+	cmplwi	0,r5,4
+	blt	7f
+	stwu	r4,4(r6)
+	beqlr
+	andi.	r0,r6,3
+	add	r5,r0,r5
+	subf	r6,r0,r6
+	clrlwi	r7,r6,32-LG_CACHELINE_BYTES
+	add	r8,r7,r5
+	srwi	r9,r8,LG_CACHELINE_BYTES
+	addic.	r9,r9,-1	/* total number of complete cachelines */
+	ble	2f
+	xori	r0,r7,CACHELINE_MASK & ~3
+	srwi.	r0,r0,2
+	beq	3f
+	mtctr	r0
+4:	stwu	r4,4(r6)
+	bdnz	4b
+3:	mtctr	r9
+	li	r7,4
+10:	dcbz	r7,r6
+	addi	r6,r6,CACHELINE_BYTES
+	bdnz	10b
+	clrlwi	r5,r8,32-LG_CACHELINE_BYTES
+	addi	r5,r5,4
+2:	srwi	r0,r5,2
+	mtctr	r0
+	bdz	6f
+1:	stwu	r4,4(r6)
+	bdnz	1b
+6:	andi.	r5,r5,3
+7:	cmpwi	0,r5,0
+	beqlr
+	mtctr	r5
+	addi	r6,r6,3
+8:	stbu	r4,1(r6)
+	bdnz	8b
+	blr
+
 _GLOBAL(memset)
 	rlwimi	r4,r4,8,16,23
 	rlwimi	r4,r4,16,0,15
@@ -94,6 +142,85 @@ _GLOBAL(memset)
 	bdnz	8b
 	blr
 
+/*
+ * This version uses dcbz on the complete cache lines in the
+ * destination area to reduce memory traffic.  This requires that
+ * the destination area is cacheable.
+ * We only use this version if the source and dest don't overlap.
+ * -- paulus.
+ */
+_GLOBAL(cacheable_memcpy)
+	add	r7,r3,r5		/* test if the src & dst overlap */
+	add	r8,r4,r5
+	cmplw	0,r4,r7
+	cmplw	1,r3,r8
+	crand	0,0,4			/* cr0.lt &= cr1.lt */
+	blt	memcpy			/* if regions overlap */
+
+	addi	r4,r4,-4
+	addi	r6,r3,-4
+	neg	r0,r3
+	andi.	r0,r0,CACHELINE_MASK	/* # bytes to start of cache line */
+	beq	58f
+
+	cmplw	0,r5,r0			/* is this more than total to do? */
+	blt	63f			/* if not much to do */
+	andi.	r8,r0,3			/* get it word-aligned first */
+	subf	r5,r0,r5
+	mtctr	r8
+	beq+	61f
+70:	lbz	r9,4(r4)		/* do some bytes */
+	stb	r9,4(r6)
+	addi	r4,r4,1
+	addi	r6,r6,1
+	bdnz	70b
+61:	srwi.	r0,r0,2
+	mtctr	r0
+	beq	58f
+72:	lwzu	r9,4(r4)		/* do some words */
+	stwu	r9,4(r6)
+	bdnz	72b
+
+58:	srwi.	r0,r5,LG_CACHELINE_BYTES /* # complete cachelines */
+	clrlwi	r5,r5,32-LG_CACHELINE_BYTES
+	li	r11,4
+	mtctr	r0
+	beq	63f
+53:
+	dcbz	r11,r6
+	COPY_16_BYTES
+#if L1_CACHE_BYTES >= 32
+	COPY_16_BYTES
+#if L1_CACHE_BYTES >= 64
+	COPY_16_BYTES
+	COPY_16_BYTES
+#if L1_CACHE_BYTES >= 128
+	COPY_16_BYTES
+	COPY_16_BYTES
+	COPY_16_BYTES
+	COPY_16_BYTES
+#endif
+#endif
+#endif
+	bdnz	53b
+
+63:	srwi.	r0,r5,2
+	mtctr	r0
+	beq	64f
+30:	lwzu	r0,4(r4)
+	stwu	r0,4(r6)
+	bdnz	30b
+
+64:	andi.	r0,r5,3
+	mtctr	r0
+	beq+	65f
+40:	lbz	r0,4(r4)
+	stb	r0,4(r6)
+	addi	r4,r4,1
+	addi	r6,r6,1
+	bdnz	40b
+65:	blr
+
 _GLOBAL(memmove)
 	cmplw	0,r3,r4
 	bgt	backwards_memcpy
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/4] powerpc32: swap r4 and r5 in cacheable_memzero
  2015-05-12 13:32 ` Christophe Leroy
@ 2015-05-12 13:32   ` Christophe Leroy
  -1 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

We swap r4 and r5, this avoids having to move the len contained in r4
into r5

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 55f19f9..cbca76c 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -75,18 +75,17 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
  * area is cacheable.  -- paulus
  */
 _GLOBAL(cacheable_memzero)
-	mr	r5,r4
-	li	r4,0
+	li	r5,0
 	addi	r6,r3,-4
-	cmplwi	0,r5,4
+	cmplwi	0,r4,4
 	blt	7f
-	stwu	r4,4(r6)
+	stwu	r5,4(r6)
 	beqlr
 	andi.	r0,r6,3
-	add	r5,r0,r5
+	add	r4,r0,r4
 	subf	r6,r0,r6
 	clrlwi	r7,r6,32-LG_CACHELINE_BYTES
-	add	r8,r7,r5
+	add	r8,r7,r4
 	srwi	r9,r8,LG_CACHELINE_BYTES
 	addic.	r9,r9,-1	/* total number of complete cachelines */
 	ble	2f
@@ -94,26 +93,26 @@ _GLOBAL(cacheable_memzero)
 	srwi.	r0,r0,2
 	beq	3f
 	mtctr	r0
-4:	stwu	r4,4(r6)
+4:	stwu	r5,4(r6)
 	bdnz	4b
 3:	mtctr	r9
 	li	r7,4
 10:	dcbz	r7,r6
 	addi	r6,r6,CACHELINE_BYTES
 	bdnz	10b
-	clrlwi	r5,r8,32-LG_CACHELINE_BYTES
-	addi	r5,r5,4
-2:	srwi	r0,r5,2
+	clrlwi	r4,r8,32-LG_CACHELINE_BYTES
+	addi	r4,r4,4
+2:	srwi	r0,r4,2
 	mtctr	r0
 	bdz	6f
-1:	stwu	r4,4(r6)
+1:	stwu	r5,4(r6)
 	bdnz	1b
-6:	andi.	r5,r5,3
-7:	cmpwi	0,r5,0
+6:	andi.	r4,r4,3
+7:	cmpwi	0,r4,0
 	beqlr
-	mtctr	r5
+	mtctr	r4
 	addi	r6,r6,3
-8:	stbu	r4,1(r6)
+8:	stbu	r5,1(r6)
 	bdnz	8b
 	blr
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/4] powerpc32: swap r4 and r5 in cacheable_memzero
@ 2015-05-12 13:32   ` Christophe Leroy
  0 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linuxppc-dev, linux-kernel, Kyle Moffett

We swap r4 and r5, this avoids having to move the len contained in r4
into r5

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 55f19f9..cbca76c 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -75,18 +75,17 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
  * area is cacheable.  -- paulus
  */
 _GLOBAL(cacheable_memzero)
-	mr	r5,r4
-	li	r4,0
+	li	r5,0
 	addi	r6,r3,-4
-	cmplwi	0,r5,4
+	cmplwi	0,r4,4
 	blt	7f
-	stwu	r4,4(r6)
+	stwu	r5,4(r6)
 	beqlr
 	andi.	r0,r6,3
-	add	r5,r0,r5
+	add	r4,r0,r4
 	subf	r6,r0,r6
 	clrlwi	r7,r6,32-LG_CACHELINE_BYTES
-	add	r8,r7,r5
+	add	r8,r7,r4
 	srwi	r9,r8,LG_CACHELINE_BYTES
 	addic.	r9,r9,-1	/* total number of complete cachelines */
 	ble	2f
@@ -94,26 +93,26 @@ _GLOBAL(cacheable_memzero)
 	srwi.	r0,r0,2
 	beq	3f
 	mtctr	r0
-4:	stwu	r4,4(r6)
+4:	stwu	r5,4(r6)
 	bdnz	4b
 3:	mtctr	r9
 	li	r7,4
 10:	dcbz	r7,r6
 	addi	r6,r6,CACHELINE_BYTES
 	bdnz	10b
-	clrlwi	r5,r8,32-LG_CACHELINE_BYTES
-	addi	r5,r5,4
-2:	srwi	r0,r5,2
+	clrlwi	r4,r8,32-LG_CACHELINE_BYTES
+	addi	r4,r4,4
+2:	srwi	r0,r4,2
 	mtctr	r0
 	bdz	6f
-1:	stwu	r4,4(r6)
+1:	stwu	r5,4(r6)
 	bdnz	1b
-6:	andi.	r5,r5,3
-7:	cmpwi	0,r5,0
+6:	andi.	r4,r4,3
+7:	cmpwi	0,r4,0
 	beqlr
-	mtctr	r5
+	mtctr	r4
 	addi	r6,r6,3
-8:	stbu	r4,1(r6)
+8:	stbu	r5,1(r6)
 	bdnz	8b
 	blr
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
  2015-05-12 13:32 ` Christophe Leroy
@ 2015-05-12 13:32   ` Christophe Leroy
  -1 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

cacheable_memzero uses dcbz instruction and is more efficient than
memset(0) when the destination is in RAM

This patch renames memset as generic_memset, and defines memset
as a prolog to cacheable_memzero. This prolog checks if the byte
to set is 0 and if the buffer is in RAM. If not, it falls back to
generic_memcpy()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index cbca76c..d8a9a86 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -12,6 +12,7 @@
 #include <asm/cache.h>
 #include <asm/errno.h>
 #include <asm/ppc_asm.h>
+#include <asm/page.h>
 
 #define COPY_16_BYTES		\
 	lwz	r7,4(r4);	\
@@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
  * to set them to zero.  This requires that the destination
  * area is cacheable.  -- paulus
  */
+_GLOBAL(memset)
+	cmplwi	r4,0
+	bne-	generic_memset
+	cmplwi	r5,L1_CACHE_BYTES
+	blt-	generic_memset
+	lis	r8,max_pfn@ha
+	lwz	r8,max_pfn@l(r8)
+	tophys	(r9,r3)
+	srwi	r9,r9,PAGE_SHIFT
+	cmplw	r9,r8
+	bge-	generic_memset
+	mr	r4,r5
 _GLOBAL(cacheable_memzero)
 	li	r5,0
 	addi	r6,r3,-4
@@ -116,7 +129,7 @@ _GLOBAL(cacheable_memzero)
 	bdnz	8b
 	blr
 
-_GLOBAL(memset)
+_GLOBAL(generic_memset)
 	rlwimi	r4,r4,8,16,23
 	rlwimi	r4,r4,16,0,15
 	addi	r6,r3,-4
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
@ 2015-05-12 13:32   ` Christophe Leroy
  0 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linuxppc-dev, linux-kernel, Kyle Moffett

cacheable_memzero uses dcbz instruction and is more efficient than
memset(0) when the destination is in RAM

This patch renames memset as generic_memset, and defines memset
as a prolog to cacheable_memzero. This prolog checks if the byte
to set is 0 and if the buffer is in RAM. If not, it falls back to
generic_memcpy()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index cbca76c..d8a9a86 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -12,6 +12,7 @@
 #include <asm/cache.h>
 #include <asm/errno.h>
 #include <asm/ppc_asm.h>
+#include <asm/page.h>
 
 #define COPY_16_BYTES		\
 	lwz	r7,4(r4);	\
@@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
  * to set them to zero.  This requires that the destination
  * area is cacheable.  -- paulus
  */
+_GLOBAL(memset)
+	cmplwi	r4,0
+	bne-	generic_memset
+	cmplwi	r5,L1_CACHE_BYTES
+	blt-	generic_memset
+	lis	r8,max_pfn@ha
+	lwz	r8,max_pfn@l(r8)
+	tophys	(r9,r3)
+	srwi	r9,r9,PAGE_SHIFT
+	cmplw	r9,r8
+	bge-	generic_memset
+	mr	r4,r5
 _GLOBAL(cacheable_memzero)
 	li	r5,0
 	addi	r6,r3,-4
@@ -116,7 +129,7 @@ _GLOBAL(cacheable_memzero)
 	bdnz	8b
 	blr
 
-_GLOBAL(memset)
+_GLOBAL(generic_memset)
 	rlwimi	r4,r4,8,16,23
 	rlwimi	r4,r4,16,0,15
 	addi	r6,r3,-4
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/4] powerpc32: memcpy: use cacheable_memcpy
  2015-05-12 13:32 ` Christophe Leroy
@ 2015-05-12 13:32   ` Christophe Leroy
  -1 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

cacheable_memcpy uses dcbz instruction and is more efficient than
memcpy when the destination is in RAM

This patch renames memcpy as generic_memcpy, and defines memcpy as a
prolog to cacheable_memcpy. This prolog checks if the buffer is
in RAM. If not, it falls back to generic_memcpy()

On MPC885, we get approximatly 7% increase of the transfer rate
on an FTP reception

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index d8a9a86..8f76d49 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -161,13 +161,27 @@ _GLOBAL(generic_memset)
  * We only use this version if the source and dest don't overlap.
  * -- paulus.
  */
+_GLOBAL(memmove)
+	cmplw	0,r3,r4
+	bgt	backwards_memcpy
+	/* fall through */
+
+_GLOBAL(memcpy)
+	cmplwi	r5,L1_CACHE_BYTES
+	blt-	generic_memcpy
+	lis	r8,max_pfn@ha
+	lwz	r8,max_pfn@l(r8)
+	tophys	(r9,r3)
+	srwi	r9,r9,PAGE_SHIFT
+	cmplw	r9,r8
+	bge-	generic_memcpy
 _GLOBAL(cacheable_memcpy)
 	add	r7,r3,r5		/* test if the src & dst overlap */
 	add	r8,r4,r5
 	cmplw	0,r4,r7
 	cmplw	1,r3,r8
 	crand	0,0,4			/* cr0.lt &= cr1.lt */
-	blt	memcpy			/* if regions overlap */
+	blt	generic_memcpy		/* if regions overlap */
 
 	addi	r4,r4,-4
 	addi	r6,r3,-4
@@ -233,12 +247,7 @@ _GLOBAL(cacheable_memcpy)
 	bdnz	40b
 65:	blr
 
-_GLOBAL(memmove)
-	cmplw	0,r3,r4
-	bgt	backwards_memcpy
-	/* fall through */
-
-_GLOBAL(memcpy)
+_GLOBAL(generic_memcpy)
 	srwi.	r7,r5,3
 	addi	r6,r3,-4
 	addi	r4,r4,-4
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/4] powerpc32: memcpy: use cacheable_memcpy
@ 2015-05-12 13:32   ` Christophe Leroy
  0 siblings, 0 replies; 20+ messages in thread
From: Christophe Leroy @ 2015-05-12 13:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linuxppc-dev, linux-kernel, Kyle Moffett

cacheable_memcpy uses dcbz instruction and is more efficient than
memcpy when the destination is in RAM

This patch renames memcpy as generic_memcpy, and defines memcpy as a
prolog to cacheable_memcpy. This prolog checks if the buffer is
in RAM. If not, it falls back to generic_memcpy()

On MPC885, we get approximatly 7% increase of the transfer rate
on an FTP reception

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/copy_32.S | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index d8a9a86..8f76d49 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -161,13 +161,27 @@ _GLOBAL(generic_memset)
  * We only use this version if the source and dest don't overlap.
  * -- paulus.
  */
+_GLOBAL(memmove)
+	cmplw	0,r3,r4
+	bgt	backwards_memcpy
+	/* fall through */
+
+_GLOBAL(memcpy)
+	cmplwi	r5,L1_CACHE_BYTES
+	blt-	generic_memcpy
+	lis	r8,max_pfn@ha
+	lwz	r8,max_pfn@l(r8)
+	tophys	(r9,r3)
+	srwi	r9,r9,PAGE_SHIFT
+	cmplw	r9,r8
+	bge-	generic_memcpy
 _GLOBAL(cacheable_memcpy)
 	add	r7,r3,r5		/* test if the src & dst overlap */
 	add	r8,r4,r5
 	cmplw	0,r4,r7
 	cmplw	1,r3,r8
 	crand	0,0,4			/* cr0.lt &= cr1.lt */
-	blt	memcpy			/* if regions overlap */
+	blt	generic_memcpy		/* if regions overlap */
 
 	addi	r4,r4,-4
 	addi	r6,r3,-4
@@ -233,12 +247,7 @@ _GLOBAL(cacheable_memcpy)
 	bdnz	40b
 65:	blr
 
-_GLOBAL(memmove)
-	cmplw	0,r3,r4
-	bgt	backwards_memcpy
-	/* fall through */
-
-_GLOBAL(memcpy)
+_GLOBAL(generic_memcpy)
 	srwi.	r7,r5,3
 	addi	r6,r3,-4
 	addi	r4,r4,-4
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions"
  2015-05-12 13:32   ` Christophe Leroy
@ 2015-05-14  0:49     ` Scott Wood
  -1 siblings, 0 replies; 20+ messages in thread
From: Scott Wood @ 2015-05-14  0:49 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> This partially reverts
> commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
> ("f909a35bdfb7cb350d078a2cf888162eeb20381c")'

I don't have that SHA.  Do you mean
b05ae4ee602b7dc90771408ccf0972e1b3801a35?

> Functions cacheable_memcpy/memzero are more efficient than
> memcpy/memset as they use the dcbz instruction which avoids refill
> of the cacheline with the data that we will overwrite.

I don't see anything in this patchset that addresses the "NOTE: The old
routines are just flat buggy on kernels that support hardware with
different cacheline sizes" comment.

-Scott



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions"
@ 2015-05-14  0:49     ` Scott Wood
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Wood @ 2015-05-14  0:49 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linux-kernel, Paul Mackerras, Kyle Moffett, linuxppc-dev

On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> This partially reverts
> commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
> ("f909a35bdfb7cb350d078a2cf888162eeb20381c")'

I don't have that SHA.  Do you mean
b05ae4ee602b7dc90771408ccf0972e1b3801a35?

> Functions cacheable_memcpy/memzero are more efficient than
> memcpy/memset as they use the dcbz instruction which avoids refill
> of the cacheline with the data that we will overwrite.

I don't see anything in this patchset that addresses the "NOTE: The old
routines are just flat buggy on kernels that support hardware with
different cacheline sizes" comment.

-Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
  2015-05-12 13:32   ` Christophe Leroy
@ 2015-05-14  0:55     ` Scott Wood
  -1 siblings, 0 replies; 20+ messages in thread
From: Scott Wood @ 2015-05-14  0:55 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> cacheable_memzero uses dcbz instruction and is more efficient than
> memset(0) when the destination is in RAM
> 
> This patch renames memset as generic_memset, and defines memset
> as a prolog to cacheable_memzero. This prolog checks if the byte
> to set is 0 and if the buffer is in RAM. If not, it falls back to
> generic_memcpy()
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
> index cbca76c..d8a9a86 100644
> --- a/arch/powerpc/lib/copy_32.S
> +++ b/arch/powerpc/lib/copy_32.S
> @@ -12,6 +12,7 @@
>  #include <asm/cache.h>
>  #include <asm/errno.h>
>  #include <asm/ppc_asm.h>
> +#include <asm/page.h>
>  
>  #define COPY_16_BYTES		\
>  	lwz	r7,4(r4);	\
> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
>   * to set them to zero.  This requires that the destination
>   * area is cacheable.  -- paulus
>   */
> +_GLOBAL(memset)
> +	cmplwi	r4,0
> +	bne-	generic_memset
> +	cmplwi	r5,L1_CACHE_BYTES
> +	blt-	generic_memset
> +	lis	r8,max_pfn@ha
> +	lwz	r8,max_pfn@l(r8)
> +	tophys	(r9,r3)
> +	srwi	r9,r9,PAGE_SHIFT
> +	cmplw	r9,r8
> +	bge-	generic_memset
> +	mr	r4,r5

max_pfn includes highmem, and tophys only works on normal kernel
addresses.

If we were to point memset_io, memcpy_toio, etc. at noncacheable
versions, are there any other callers left that can reasonably point at
uncacheable memory?

-Scott



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
@ 2015-05-14  0:55     ` Scott Wood
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Wood @ 2015-05-14  0:55 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linux-kernel, Paul Mackerras, Kyle Moffett, linuxppc-dev

On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> cacheable_memzero uses dcbz instruction and is more efficient than
> memset(0) when the destination is in RAM
> 
> This patch renames memset as generic_memset, and defines memset
> as a prolog to cacheable_memzero. This prolog checks if the byte
> to set is 0 and if the buffer is in RAM. If not, it falls back to
> generic_memcpy()
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
> index cbca76c..d8a9a86 100644
> --- a/arch/powerpc/lib/copy_32.S
> +++ b/arch/powerpc/lib/copy_32.S
> @@ -12,6 +12,7 @@
>  #include <asm/cache.h>
>  #include <asm/errno.h>
>  #include <asm/ppc_asm.h>
> +#include <asm/page.h>
>  
>  #define COPY_16_BYTES		\
>  	lwz	r7,4(r4);	\
> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
>   * to set them to zero.  This requires that the destination
>   * area is cacheable.  -- paulus
>   */
> +_GLOBAL(memset)
> +	cmplwi	r4,0
> +	bne-	generic_memset
> +	cmplwi	r5,L1_CACHE_BYTES
> +	blt-	generic_memset
> +	lis	r8,max_pfn@ha
> +	lwz	r8,max_pfn@l(r8)
> +	tophys	(r9,r3)
> +	srwi	r9,r9,PAGE_SHIFT
> +	cmplw	r9,r8
> +	bge-	generic_memset
> +	mr	r4,r5

max_pfn includes highmem, and tophys only works on normal kernel
addresses.

If we were to point memset_io, memcpy_toio, etc. at noncacheable
versions, are there any other callers left that can reasonably point at
uncacheable memory?

-Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
  2015-05-14  0:55     ` Scott Wood
@ 2015-05-14  8:50       ` christophe leroy
  -1 siblings, 0 replies; 20+ messages in thread
From: christophe leroy @ 2015-05-14  8:50 UTC (permalink / raw)
  To: Scott Wood
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett



Le 14/05/2015 02:55, Scott Wood a écrit :
> On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
>> cacheable_memzero uses dcbz instruction and is more efficient than
>> memset(0) when the destination is in RAM
>>
>> This patch renames memset as generic_memset, and defines memset
>> as a prolog to cacheable_memzero. This prolog checks if the byte
>> to set is 0 and if the buffer is in RAM. If not, it falls back to
>> generic_memcpy()
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
>> index cbca76c..d8a9a86 100644
>> --- a/arch/powerpc/lib/copy_32.S
>> +++ b/arch/powerpc/lib/copy_32.S
>> @@ -12,6 +12,7 @@
>>   #include <asm/cache.h>
>>   #include <asm/errno.h>
>>   #include <asm/ppc_asm.h>
>> +#include <asm/page.h>
>>   
>>   #define COPY_16_BYTES		\
>>   	lwz	r7,4(r4);	\
>> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
>>    * to set them to zero.  This requires that the destination
>>    * area is cacheable.  -- paulus
>>    */
>> +_GLOBAL(memset)
>> +	cmplwi	r4,0
>> +	bne-	generic_memset
>> +	cmplwi	r5,L1_CACHE_BYTES
>> +	blt-	generic_memset
>> +	lis	r8,max_pfn@ha
>> +	lwz	r8,max_pfn@l(r8)
>> +	tophys	(r9,r3)
>> +	srwi	r9,r9,PAGE_SHIFT
>> +	cmplw	r9,r8
>> +	bge-	generic_memset
>> +	mr	r4,r5
> max_pfn includes highmem, and tophys only works on normal kernel
> addresses.
Is there any other simple way to determine whether an address is in RAM 
or not ?

I did that because of the below function from mm/mem.c

|int  page_is_ram(unsigned long  pfn)
{
#ifndef CONFIG_PPC64	/* XXX for now */
	return  pfn<  max_pfn;
#else
	unsigned long  paddr= (pfn<<  PAGE_SHIFT);
	struct  memblock_region*reg;

	for_each_memblock(memory,  reg)
		if  (paddr>=  reg->base&&  paddr< (reg->base+  reg->size))
			return  1;
	return  0;
#endif
}
|



>
> If we were to point memset_io, memcpy_toio, etc. at noncacheable
> versions, are there any other callers left that can reasonably point at
> uncacheable memory?
Do you mean we could just consider that memcpy() and memset() are called 
only with destination on RAM and thus we could avoid the check ?
copy_tofrom_user() already does this assumption (allthought a user app 
could possibly provide a buffer located in an ALSA mapped IO area)

Christophe


---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
http://www.avast.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
@ 2015-05-14  8:50       ` christophe leroy
  0 siblings, 0 replies; 20+ messages in thread
From: christophe leroy @ 2015-05-14  8:50 UTC (permalink / raw)
  To: Scott Wood; +Cc: Kyle Moffett, linux-kernel, Paul Mackerras, linuxppc-dev



Le 14/05/2015 02:55, Scott Wood a écrit :
> On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
>> cacheable_memzero uses dcbz instruction and is more efficient than
>> memset(0) when the destination is in RAM
>>
>> This patch renames memset as generic_memset, and defines memset
>> as a prolog to cacheable_memzero. This prolog checks if the byte
>> to set is 0 and if the buffer is in RAM. If not, it falls back to
>> generic_memcpy()
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
>> index cbca76c..d8a9a86 100644
>> --- a/arch/powerpc/lib/copy_32.S
>> +++ b/arch/powerpc/lib/copy_32.S
>> @@ -12,6 +12,7 @@
>>   #include <asm/cache.h>
>>   #include <asm/errno.h>
>>   #include <asm/ppc_asm.h>
>> +#include <asm/page.h>
>>   
>>   #define COPY_16_BYTES		\
>>   	lwz	r7,4(r4);	\
>> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
>>    * to set them to zero.  This requires that the destination
>>    * area is cacheable.  -- paulus
>>    */
>> +_GLOBAL(memset)
>> +	cmplwi	r4,0
>> +	bne-	generic_memset
>> +	cmplwi	r5,L1_CACHE_BYTES
>> +	blt-	generic_memset
>> +	lis	r8,max_pfn@ha
>> +	lwz	r8,max_pfn@l(r8)
>> +	tophys	(r9,r3)
>> +	srwi	r9,r9,PAGE_SHIFT
>> +	cmplw	r9,r8
>> +	bge-	generic_memset
>> +	mr	r4,r5
> max_pfn includes highmem, and tophys only works on normal kernel
> addresses.
Is there any other simple way to determine whether an address is in RAM 
or not ?

I did that because of the below function from mm/mem.c

|int  page_is_ram(unsigned long  pfn)
{
#ifndef CONFIG_PPC64	/* XXX for now */
	return  pfn<  max_pfn;
#else
	unsigned long  paddr= (pfn<<  PAGE_SHIFT);
	struct  memblock_region*reg;

	for_each_memblock(memory,  reg)
		if  (paddr>=  reg->base&&  paddr< (reg->base+  reg->size))
			return  1;
	return  0;
#endif
}
|



>
> If we were to point memset_io, memcpy_toio, etc. at noncacheable
> versions, are there any other callers left that can reasonably point at
> uncacheable memory?
Do you mean we could just consider that memcpy() and memset() are called 
only with destination on RAM and thus we could avoid the check ?
copy_tofrom_user() already does this assumption (allthought a user app 
could possibly provide a buffer located in an ALSA mapped IO area)

Christophe


---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
http://www.avast.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
  2015-05-14  8:50       ` christophe leroy
@ 2015-05-14 20:18         ` Scott Wood
  -1 siblings, 0 replies; 20+ messages in thread
From: Scott Wood @ 2015-05-14 20:18 UTC (permalink / raw)
  To: christophe leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett

On Thu, 2015-05-14 at 10:50 +0200, christophe leroy wrote:
> 
> Le 14/05/2015 02:55, Scott Wood a écrit :
> > On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> >> cacheable_memzero uses dcbz instruction and is more efficient than
> >> memset(0) when the destination is in RAM
> >>
> >> This patch renames memset as generic_memset, and defines memset
> >> as a prolog to cacheable_memzero. This prolog checks if the byte
> >> to set is 0 and if the buffer is in RAM. If not, it falls back to
> >> generic_memcpy()
> >>
> >> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> >> ---
> >>   arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
> >>   1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
> >> index cbca76c..d8a9a86 100644
> >> --- a/arch/powerpc/lib/copy_32.S
> >> +++ b/arch/powerpc/lib/copy_32.S
> >> @@ -12,6 +12,7 @@
> >>   #include <asm/cache.h>
> >>   #include <asm/errno.h>
> >>   #include <asm/ppc_asm.h>
> >> +#include <asm/page.h>
> >>   
> >>   #define COPY_16_BYTES		\
> >>   	lwz	r7,4(r4);	\
> >> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
> >>    * to set them to zero.  This requires that the destination
> >>    * area is cacheable.  -- paulus
> >>    */
> >> +_GLOBAL(memset)
> >> +	cmplwi	r4,0
> >> +	bne-	generic_memset
> >> +	cmplwi	r5,L1_CACHE_BYTES
> >> +	blt-	generic_memset
> >> +	lis	r8,max_pfn@ha
> >> +	lwz	r8,max_pfn@l(r8)
> >> +	tophys	(r9,r3)
> >> +	srwi	r9,r9,PAGE_SHIFT
> >> +	cmplw	r9,r8
> >> +	bge-	generic_memset
> >> +	mr	r4,r5
> > max_pfn includes highmem, and tophys only works on normal kernel
> > addresses.
> Is there any other simple way to determine whether an address is in RAM 
> or not ?

If you want to do it based on the virtual address, rather than doing a
tablewalk or TLB search, you need to limit it to lowmem.

> I did that because of the below function from mm/mem.c
> 
> |int  page_is_ram(unsigned long  pfn)
> {
> #ifndef CONFIG_PPC64	/* XXX for now */
> 	return  pfn<  max_pfn;
> #else
> 	unsigned long  paddr= (pfn<<  PAGE_SHIFT);
> 	struct  memblock_region*reg;
> 
> 	for_each_memblock(memory,  reg)
> 		if  (paddr>=  reg->base&&  paddr< (reg->base+  reg->size))
> 			return  1;
> 	return  0;
> #endif
> }

Right, the problem is figuring out the pfn in the first place.

> > If we were to point memset_io, memcpy_toio, etc. at noncacheable
> > versions, are there any other callers left that can reasonably point at
> > uncacheable memory?
> Do you mean we could just consider that memcpy() and memset() are called 
> only with destination on RAM and thus we could avoid the check ?

Maybe.  If that's not a safe assumption I hope someone will point it
out.

> copy_tofrom_user() already does this assumption (allthought a user app 
> could possibly provide a buffer located in an ALSA mapped IO area)

The user could also pass in NULL.  That's what the fixups are for. :-)

-Scott



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero
@ 2015-05-14 20:18         ` Scott Wood
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Wood @ 2015-05-14 20:18 UTC (permalink / raw)
  To: christophe leroy; +Cc: Kyle Moffett, linux-kernel, Paul Mackerras, linuxppc-dev

On Thu, 2015-05-14 at 10:50 +0200, christophe leroy wrote:
> 
> Le 14/05/2015 02:55, Scott Wood a écrit :
> > On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> >> cacheable_memzero uses dcbz instruction and is more efficient than
> >> memset(0) when the destination is in RAM
> >>
> >> This patch renames memset as generic_memset, and defines memset
> >> as a prolog to cacheable_memzero. This prolog checks if the byte
> >> to set is 0 and if the buffer is in RAM. If not, it falls back to
> >> generic_memcpy()
> >>
> >> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> >> ---
> >>   arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
> >>   1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
> >> index cbca76c..d8a9a86 100644
> >> --- a/arch/powerpc/lib/copy_32.S
> >> +++ b/arch/powerpc/lib/copy_32.S
> >> @@ -12,6 +12,7 @@
> >>   #include <asm/cache.h>
> >>   #include <asm/errno.h>
> >>   #include <asm/ppc_asm.h>
> >> +#include <asm/page.h>
> >>   
> >>   #define COPY_16_BYTES		\
> >>   	lwz	r7,4(r4);	\
> >> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
> >>    * to set them to zero.  This requires that the destination
> >>    * area is cacheable.  -- paulus
> >>    */
> >> +_GLOBAL(memset)
> >> +	cmplwi	r4,0
> >> +	bne-	generic_memset
> >> +	cmplwi	r5,L1_CACHE_BYTES
> >> +	blt-	generic_memset
> >> +	lis	r8,max_pfn@ha
> >> +	lwz	r8,max_pfn@l(r8)
> >> +	tophys	(r9,r3)
> >> +	srwi	r9,r9,PAGE_SHIFT
> >> +	cmplw	r9,r8
> >> +	bge-	generic_memset
> >> +	mr	r4,r5
> > max_pfn includes highmem, and tophys only works on normal kernel
> > addresses.
> Is there any other simple way to determine whether an address is in RAM 
> or not ?

If you want to do it based on the virtual address, rather than doing a
tablewalk or TLB search, you need to limit it to lowmem.

> I did that because of the below function from mm/mem.c
> 
> |int  page_is_ram(unsigned long  pfn)
> {
> #ifndef CONFIG_PPC64	/* XXX for now */
> 	return  pfn<  max_pfn;
> #else
> 	unsigned long  paddr= (pfn<<  PAGE_SHIFT);
> 	struct  memblock_region*reg;
> 
> 	for_each_memblock(memory,  reg)
> 		if  (paddr>=  reg->base&&  paddr< (reg->base+  reg->size))
> 			return  1;
> 	return  0;
> #endif
> }

Right, the problem is figuring out the pfn in the first place.

> > If we were to point memset_io, memcpy_toio, etc. at noncacheable
> > versions, are there any other callers left that can reasonably point at
> > uncacheable memory?
> Do you mean we could just consider that memcpy() and memset() are called 
> only with destination on RAM and thus we could avoid the check ?

Maybe.  If that's not a safe assumption I hope someone will point it
out.

> copy_tofrom_user() already does this assumption (allthought a user app 
> could possibly provide a buffer located in an ALSA mapped IO area)

The user could also pass in NULL.  That's what the fixups are for. :-)

-Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions"
  2015-05-14  0:49     ` Scott Wood
@ 2015-05-15 17:58       ` christophe leroy
  -1 siblings, 0 replies; 20+ messages in thread
From: christophe leroy @ 2015-05-15 17:58 UTC (permalink / raw)
  To: Scott Wood
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, Joakim Tjernlund, Kyle Moffett


Le 14/05/2015 02:49, Scott Wood a écrit :
> On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
>> This partially reverts
>> commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
>> ("f909a35bdfb7cb350d078a2cf888162eeb20381c")'
> I don't have that SHA.  Do you mean
> b05ae4ee602b7dc90771408ccf0972e1b3801a35?
Right, took it from the wrong tree sorry.
>
>> Functions cacheable_memcpy/memzero are more efficient than
>> memcpy/memset as they use the dcbz instruction which avoids refill
>> of the cacheline with the data that we will overwrite.
> I don't see anything in this patchset that addresses the "NOTE: The old
> routines are just flat buggy on kernels that support hardware with
> different cacheline sizes" comment.
I believe the NOTE means that if a kernel is compiled for several CPUs 
having different cache line size,
then it will not work. But it is also the case of other functions using 
dcbz instruction, like copy_page() clear_page() copy_tofrom_user().

And indeed, this seems only possible in three cases:
1/ With CONFIG_44x as 47x has different size than 44x and 46x. However 
it is explicitly stated in arch/powerpc/platforms/44x/Kconfig : "config 
PPC_47x This option enables support for the 47x family of processors and 
is not currently compatible with other 44x or 46x varients"
2/ With CONFIG_PPC_85xx, as PPC_E500MC has different size than other 
E500. However it is explicitly stated in 
arch/powerpc/platforms/Kconfig.cputype : "config PPC_E500MC This must be 
enabled for running on e500mc (and derivatives such as e5500/e6500), and 
must be disabled for running on e500v1 or e500v2."
3/ With CONFIG_403GCX as 403GCX has different size than other 40x. 
However it seems to be no way to select CONFIG_403GCX from 
arch/powerpc/platforms/40x/Kconfig

Christophe

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
http://www.avast.com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions"
@ 2015-05-15 17:58       ` christophe leroy
  0 siblings, 0 replies; 20+ messages in thread
From: christophe leroy @ 2015-05-15 17:58 UTC (permalink / raw)
  To: Scott Wood; +Cc: linux-kernel, Paul Mackerras, Kyle Moffett, linuxppc-dev


Le 14/05/2015 02:49, Scott Wood a écrit :
> On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
>> This partially reverts
>> commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
>> ("f909a35bdfb7cb350d078a2cf888162eeb20381c")'
> I don't have that SHA.  Do you mean
> b05ae4ee602b7dc90771408ccf0972e1b3801a35?
Right, took it from the wrong tree sorry.
>
>> Functions cacheable_memcpy/memzero are more efficient than
>> memcpy/memset as they use the dcbz instruction which avoids refill
>> of the cacheline with the data that we will overwrite.
> I don't see anything in this patchset that addresses the "NOTE: The old
> routines are just flat buggy on kernels that support hardware with
> different cacheline sizes" comment.
I believe the NOTE means that if a kernel is compiled for several CPUs 
having different cache line size,
then it will not work. But it is also the case of other functions using 
dcbz instruction, like copy_page() clear_page() copy_tofrom_user().

And indeed, this seems only possible in three cases:
1/ With CONFIG_44x as 47x has different size than 44x and 46x. However 
it is explicitly stated in arch/powerpc/platforms/44x/Kconfig : "config 
PPC_47x This option enables support for the 47x family of processors and 
is not currently compatible with other 44x or 46x varients"
2/ With CONFIG_PPC_85xx, as PPC_E500MC has different size than other 
E500. However it is explicitly stated in 
arch/powerpc/platforms/Kconfig.cputype : "config PPC_E500MC This must be 
enabled for running on e500mc (and derivatives such as e5500/e6500), and 
must be disabled for running on e500v1 or e500v2."
3/ With CONFIG_403GCX as 403GCX has different size than other 40x. 
However it seems to be no way to select CONFIG_403GCX from 
arch/powerpc/platforms/40x/Kconfig

Christophe

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
http://www.avast.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-05-15 17:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-12 13:32 [PATCH 0/4] powerpc32: use cacheable alternatives of memcpy and memset Christophe Leroy
2015-05-12 13:32 ` Christophe Leroy
2015-05-12 13:32 ` [PATCH 1/4] Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero functions" Christophe Leroy
2015-05-12 13:32   ` Christophe Leroy
2015-05-14  0:49   ` Scott Wood
2015-05-14  0:49     ` Scott Wood
2015-05-15 17:58     ` christophe leroy
2015-05-15 17:58       ` christophe leroy
2015-05-12 13:32 ` [PATCH 2/4] powerpc32: swap r4 and r5 in cacheable_memzero Christophe Leroy
2015-05-12 13:32   ` Christophe Leroy
2015-05-12 13:32 ` [PATCH 3/4] powerpc32: memset(0): use cacheable_memzero Christophe Leroy
2015-05-12 13:32   ` Christophe Leroy
2015-05-14  0:55   ` Scott Wood
2015-05-14  0:55     ` Scott Wood
2015-05-14  8:50     ` christophe leroy
2015-05-14  8:50       ` christophe leroy
2015-05-14 20:18       ` Scott Wood
2015-05-14 20:18         ` Scott Wood
2015-05-12 13:32 ` [PATCH 4/4] powerpc32: memcpy: use cacheable_memcpy Christophe Leroy
2015-05-12 13:32   ` Christophe Leroy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.