[PATCH 0/9] powerpc32: set of optimisation of network checksum functions

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] powerpc32: set of optimisation of network checksum functions
@ 2015-09-22 14:34 Christophe Leroy
  2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

This patch serie gather patches related to checksum functions on powerpc.
Some of those patches have already been submitted individually.

Christophe Leroy (9):
  powerpc: unexport csum_tcpudp_magic
  powerpc: mark xer clobbered in csum_add()
  powerpc32: checksum_wrappers_64 becomes checksum_wrappers
  powerpc: inline ip_fast_csum()
  powerpc32: rewrite csum_partial_copy_generic() based on
    copy_tofrom_user()
  powerpc32: optimise a few instructions in csum_partial()
  powerpc32: optimise csum_partial() loop
  powerpc: simplify csum_add(a, b) in case a or b is constant 0
  powerpc: optimise csum_partial() call when len is constant

 arch/powerpc/include/asm/checksum.h                | 143 +++++---
 arch/powerpc/lib/Makefile                          |   3 +-
 arch/powerpc/lib/checksum_32.S                     | 398 +++++++++++++--------
 arch/powerpc/lib/checksum_64.S                     |  31 +-
 ...{checksum_wrappers_64.c => checksum_wrappers.c} |   0
 arch/powerpc/lib/ppc_ksyms.c                       |   4 +-
 6 files changed, 350 insertions(+), 229 deletions(-)
 rename arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} (100%)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/9] powerpc: unexport csum_tcpudp_magic
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-09-22 14:34 ` [PATCH 2/9] powerpc: mark xer clobbered in csum_add() Christophe Leroy
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

csum_tcpudp_magic is now an inline function, so there is
nothing to export

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/ppc_ksyms.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
index c7f8e95..f5e427e 100644
--- a/arch/powerpc/lib/ppc_ksyms.c
+++ b/arch/powerpc/lib/ppc_ksyms.c
@@ -20,7 +20,6 @@ EXPORT_SYMBOL(strncmp);
 EXPORT_SYMBOL(csum_partial);
 EXPORT_SYMBOL(csum_partial_copy_generic);
 EXPORT_SYMBOL(ip_fast_csum);
-EXPORT_SYMBOL(csum_tcpudp_magic);
 #endif
 
 EXPORT_SYMBOL(__copy_tofrom_user);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/9] powerpc: mark xer clobbered in csum_add()
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
  2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

addc uses carry so xer is clobbered in csum_add()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/include/asm/checksum.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index e8d9ef4..d2ca07b 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -141,7 +141,7 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
 #else
 	asm("addc %0,%0,%1;"
 	    "addze %0,%0;"
-	    : "+r" (csum) : "r" (addend));
+	    : "+r" (csum) : "r" (addend) : "xer");
 	return csum;
 #endif
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
  2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
  2015-09-22 14:34 ` [PATCH 2/9] powerpc: mark xer clobbered in csum_add() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-10-23  3:26   ` Scott Wood
  2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

The powerpc64 checksum wrapper functions adds csum_and_copy_to_user()
which otherwise is implemented in include/net/checksum.h by using
csum_partial() then copy_to_user()

Those two wrapper fonctions are also applicable to powerpc32 as it is
based on the use of csum_partial_copy_generic() which also
exists on powerpc32

This patch renames arch/powerpc/lib/checksum_wrappers_64.c to
arch/powerpc/lib/checksum_wrappers.c and
makes it non-conditional to CONFIG_WORD_SIZE

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/include/asm/checksum.h                              | 9 ---------
 arch/powerpc/lib/Makefile                                        | 3 +--
 arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} | 0
 3 files changed, 1 insertion(+), 11 deletions(-)
 rename arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} (100%)

diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index d2ca07b..afa6722 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -47,21 +47,12 @@ extern __wsum csum_partial_copy_generic(const void *src, void *dst,
 					      int len, __wsum sum,
 					      int *src_err, int *dst_err);
 
-#ifdef __powerpc64__
 #define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER
 extern __wsum csum_and_copy_from_user(const void __user *src, void *dst,
 				      int len, __wsum sum, int *err_ptr);
 #define HAVE_CSUM_COPY_USER
 extern __wsum csum_and_copy_to_user(const void *src, void __user *dst,
 				    int len, __wsum sum, int *err_ptr);
-#else
-/*
- * the same as csum_partial, but copies from src to dst while it
- * checksums.
- */
-#define csum_partial_copy_from_user(src, dst, len, sum, errp)   \
-        csum_partial_copy_generic((__force const void *)(src), (dst), (len), (sum), (errp), NULL)
-#endif
 
 #define csum_partial_copy_nocheck(src, dst, len, sum)   \
         csum_partial_copy_generic((src), (dst), (len), (sum), NULL, NULL)
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index a47e142..e46b068 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -22,8 +22,7 @@ obj64-$(CONFIG_SMP)	+= locks.o
 obj64-$(CONFIG_ALTIVEC)	+= vmx-helper.o
 
 ifeq ($(CONFIG_GENERIC_CSUM),)
-obj-y			+= checksum_$(CONFIG_WORD_SIZE).o
-obj-$(CONFIG_PPC64)	+= checksum_wrappers_64.o
+obj-y			+= checksum_$(CONFIG_WORD_SIZE).o checksum_wrappers.o
 endif
 
 obj-$(CONFIG_PPC_EMULATE_SSTEP)	+= sstep.o ldstfp.o
diff --git a/arch/powerpc/lib/checksum_wrappers_64.c b/arch/powerpc/lib/checksum_wrappers.c
similarity index 100%
rename from arch/powerpc/lib/checksum_wrappers_64.c
rename to arch/powerpc/lib/checksum_wrappers.c
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/9] powerpc: inline ip_fast_csum()
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
                   ` (2 preceding siblings ...)
  2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-09-23  5:43   ` Denis Kirjanov
  2016-03-05  3:50   ` [4/9] " Scott Wood
  2015-09-22 14:34 ` [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user() Christophe Leroy
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

In several architectures, ip_fast_csum() is inlined
There are functions like ip_send_check() which do nothing
much more than calling ip_fast_csum().
Inlining ip_fast_csum() allows the compiler to optimise better

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/include/asm/checksum.h | 46 +++++++++++++++++++++++++++++++------
 arch/powerpc/lib/checksum_32.S      | 21 -----------------
 arch/powerpc/lib/checksum_64.S      | 27 ----------------------
 arch/powerpc/lib/ppc_ksyms.c        |  1 -
 4 files changed, 39 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index afa6722..56deea8 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -9,16 +9,9 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-/*
- * This is a version of ip_compute_csum() optimized for IP headers,
- * which always checksum on 4 octet boundaries.  ihl is the number
- * of 32-bit words and is always >= 5.
- */
 #ifdef CONFIG_GENERIC_CSUM
 #include <asm-generic/checksum.h>
 #else
-extern __sum16 ip_fast_csum(const void *iph, unsigned int ihl);
-
 /*
  * computes the checksum of a memory block at buff, length len,
  * and adds in "sum" (32-bit)
@@ -137,6 +130,45 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
 #endif
 }
 
+/*
+ * This is a version of ip_compute_csum() optimized for IP headers,
+ * which always checksum on 4 octet boundaries.  ihl is the number
+ * of 32-bit words and is always >= 5.
+ */
+static inline __wsum ip_fast_csum_nofold(const void *iph, unsigned int ihl)
+{
+	u32 *ptr = (u32 *)iph + 1;
+#ifdef __powerpc64__
+	unsigned int i;
+	u64 s = *(__force u32 *)iph;
+
+	for (i = 0; i < ihl - 1; i++, ptr++)
+		s += *ptr;
+	s += (s >> 32);
+	return (__force __wsum)s;
+
+#else
+	__wsum sum, tmp;
+
+	asm("mtctr %3;"
+	    "addc %0,%4,%5;"
+	    "1:lwzu %1, 4(%2);"
+	    "adde %0,%0,%1;"
+	    "bdnz 1b;"
+	    "addze %0,%0;"
+	    : "=r"(sum), "=r"(tmp), "+b"(ptr)
+	    : "r"(ihl - 2), "r"(*(u32 *)iph), "r"(*ptr)
+	    : "ctr", "xer", "memory");
+
+	return sum;
+#endif
+}
+
+static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
+{
+	return csum_fold(ip_fast_csum_nofold(iph, ihl));
+}
+
 #endif
 #endif /* __KERNEL__ */
 #endif
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 6d67e05..0d7eba3 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -20,27 +20,6 @@
 	.text
 
 /*
- * ip_fast_csum(buf, len) -- Optimized for IP header
- * len is in words and is always >= 5.
- */
-_GLOBAL(ip_fast_csum)
-	lwz	r0,0(r3)
-	lwzu	r5,4(r3)
-	addic.	r4,r4,-2
-	addc	r0,r0,r5
-	mtctr	r4
-	blelr-
-1:	lwzu	r4,4(r3)
-	adde	r0,r0,r4
-	bdnz	1b
-	addze	r0,r0		/* add in final carry */
-	rlwinm	r3,r0,16,0,31	/* fold two halves together */
-	add	r3,r0,r3
-	not	r3,r3
-	srwi	r3,r3,16
-	blr
-
-/*
  * computes the checksum of a memory block at buff, length len,
  * and adds in "sum" (32-bit)
  *
diff --git a/arch/powerpc/lib/checksum_64.S b/arch/powerpc/lib/checksum_64.S
index f3ef354..f53f4ab 100644
--- a/arch/powerpc/lib/checksum_64.S
+++ b/arch/powerpc/lib/checksum_64.S
@@ -18,33 +18,6 @@
 #include <asm/ppc_asm.h>
 
 /*
- * ip_fast_csum(r3=buf, r4=len) -- Optimized for IP header
- * len is in words and is always >= 5.
- *
- * In practice len == 5, but this is not guaranteed.  So this code does not
- * attempt to use doubleword instructions.
- */
-_GLOBAL(ip_fast_csum)
-	lwz	r0,0(r3)
-	lwzu	r5,4(r3)
-	addic.	r4,r4,-2
-	addc	r0,r0,r5
-	mtctr	r4
-	blelr-
-1:	lwzu	r4,4(r3)
-	adde	r0,r0,r4
-	bdnz	1b
-	addze	r0,r0		/* add in final carry */
-        rldicl  r4,r0,32,0      /* fold two 32-bit halves together */
-        add     r0,r0,r4
-        srdi    r0,r0,32
-	rlwinm	r3,r0,16,0,31	/* fold two halves together */
-	add	r3,r0,r3
-	not	r3,r3
-	srwi	r3,r3,16
-	blr
-
-/*
  * Computes the checksum of a memory block at buff, length len,
  * and adds in "sum" (32-bit).
  *
diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
index f5e427e..8cd5c0b 100644
--- a/arch/powerpc/lib/ppc_ksyms.c
+++ b/arch/powerpc/lib/ppc_ksyms.c
@@ -19,7 +19,6 @@ EXPORT_SYMBOL(strncmp);
 #ifndef CONFIG_GENERIC_CSUM
 EXPORT_SYMBOL(csum_partial);
 EXPORT_SYMBOL(csum_partial_copy_generic);
-EXPORT_SYMBOL(ip_fast_csum);
 #endif
 
 EXPORT_SYMBOL(__copy_tofrom_user);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user()
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
                   ` (3 preceding siblings ...)
  2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

csum_partial_copy_generic() does the same as copy_tofrom_user and also
calculates the checksum during the copy. Unlike copy_tofrom_user(),
the existing version of csum_partial_copy_generic() doesn't take
benefit of the cache.

This patch is a rewrite of csum_partial_copy_generic() based on
copy_tofrom_user().
The previous version of csum_partial_copy_generic() was handling
errors. Now we have the checksum wrapper functions to handle the error
case like in powerpc64 so we can make the error case simple:
just return -EFAULT.
copy_tofrom_user() only has r12 available => we use it for the
checksum r7 and r8 which contains pointers to error feedback are used,
so we stack them.

On a TCP benchmark using socklib on the loopback interface on which
checksum offload and scatter/gather have been deactivated, we get
about 20% performance increase.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/checksum_32.S | 320 +++++++++++++++++++++++++++--------------
 1 file changed, 209 insertions(+), 111 deletions(-)

diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 0d7eba3..3472372 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -14,6 +14,7 @@
 
 #include <linux/sys.h>
 #include <asm/processor.h>
+#include <asm/cache.h>
 #include <asm/errno.h>
 #include <asm/ppc_asm.h>
 
@@ -66,123 +67,220 @@ _GLOBAL(csum_partial)
  *
  * csum_partial_copy_generic(src, dst, len, sum, src_err, dst_err)
  */
+#define CSUM_COPY_16_BYTES_WITHEX(n)	\
+8 ## n ## 0:			\
+	lwz	r7,4(r4);	\
+8 ## n ## 1:			\
+	lwz	r8,8(r4);	\
+8 ## n ## 2:			\
+	lwz	r9,12(r4);	\
+8 ## n ## 3:			\
+	lwzu	r10,16(r4);	\
+8 ## n ## 4:			\
+	stw	r7,4(r6);	\
+	adde	r12,r12,r7;	\
+8 ## n ## 5:			\
+	stw	r8,8(r6);	\
+	adde	r12,r12,r8;	\
+8 ## n ## 6:			\
+	stw	r9,12(r6);	\
+	adde	r12,r12,r9;	\
+8 ## n ## 7:			\
+	stwu	r10,16(r6);	\
+	adde	r12,r12,r10
+
+#define CSUM_COPY_16_BYTES_EXCODE(n)		\
+.section __ex_table,"a";		\
+	.align	2;			\
+	.long	8 ## n ## 0b,src_error;	\
+	.long	8 ## n ## 1b,src_error;	\
+	.long	8 ## n ## 2b,src_error;	\
+	.long	8 ## n ## 3b,src_error;	\
+	.long	8 ## n ## 4b,dst_error;	\
+	.long	8 ## n ## 5b,dst_error;	\
+	.long	8 ## n ## 6b,dst_error;	\
+	.long	8 ## n ## 7b,dst_error;	\
+	.text
+
+	.text
+	.stabs	"arch/powerpc/lib/",N_SO,0,0,0f
+	.stabs	"checksum_32.S",N_SO,0,0,0f
+0:
+
+CACHELINE_BYTES = L1_CACHE_BYTES
+LG_CACHELINE_BYTES = L1_CACHE_SHIFT
+CACHELINE_MASK = (L1_CACHE_BYTES-1)
+
 _GLOBAL(csum_partial_copy_generic)
-	addic	r0,r6,0
-	subi	r3,r3,4
-	subi	r4,r4,4
-	srwi.	r6,r5,2
-	beq	3f		/* if we're doing < 4 bytes */
-	andi.	r9,r4,2		/* Align dst to longword boundary */
-	beq+	1f
-81:	lhz	r6,4(r3)	/* do 2 bytes to get aligned */
-	addi	r3,r3,2
-	subi	r5,r5,2
-91:	sth	r6,4(r4)
-	addi	r4,r4,2
-	addc	r0,r0,r6
-	srwi.	r6,r5,2		/* # words to do */
-	beq	3f
-1:	srwi.	r6,r5,4		/* # groups of 4 words to do */
-	beq	10f
-	mtctr	r6
-71:	lwz	r6,4(r3)
-72:	lwz	r9,8(r3)
-73:	lwz	r10,12(r3)
-74:	lwzu	r11,16(r3)
-	adde	r0,r0,r6
-75:	stw	r6,4(r4)
-	adde	r0,r0,r9
-76:	stw	r9,8(r4)
-	adde	r0,r0,r10
-77:	stw	r10,12(r4)
-	adde	r0,r0,r11
-78:	stwu	r11,16(r4)
-	bdnz	71b
-10:	rlwinm.	r6,r5,30,30,31	/* # words left to do */
-	beq	13f
-	mtctr	r6
-82:	lwzu	r9,4(r3)
-92:	stwu	r9,4(r4)
-	adde	r0,r0,r9
-	bdnz	82b
-13:	andi.	r5,r5,3
-3:	cmpwi	0,r5,2
-	blt+	4f
-83:	lhz	r6,4(r3)
-	addi	r3,r3,2
-	subi	r5,r5,2
-93:	sth	r6,4(r4)
+	stwu	r1,-16(r1)
+	stw	r7,12(r1)
+	stw	r8,8(r1)
+
+	andi.	r0,r4,1			/* is destination address even ? */
+	cmplwi	cr7,r0,0
+	addic	r12,r6,0
+	addi	r6,r4,-4
+	neg	r0,r4
+	addi	r4,r3,-4
+	andi.	r0,r0,CACHELINE_MASK	/* # bytes to start of cache line */
+	beq	58f
+
+	cmplw	0,r5,r0			/* is this more than total to do? */
+	blt	63f			/* if not much to do */
+	andi.	r8,r0,3			/* get it word-aligned first */
+	mtctr	r8
+	beq+	61f
+	li	r3,0
+70:	lbz	r9,4(r4)		/* do some bytes */
+	addi	r4,r4,1
+	slwi	r3,r3,8
+	rlwimi	r3,r9,0,24,31
+71:	stb	r9,4(r6)
+	addi	r6,r6,1
+	bdnz	70b
+	adde	r12,r12,r3
+61:	subf	r5,r0,r5
+	srwi.	r0,r0,2
+	mtctr	r0
+	beq	58f
+72:	lwzu	r9,4(r4)		/* do some words */
+	adde	r12,r12,r9
+73:	stwu	r9,4(r6)
+	bdnz	72b
+
+58:	srwi.	r0,r5,LG_CACHELINE_BYTES /* # complete cachelines */
+	clrlwi	r5,r5,32-LG_CACHELINE_BYTES
+	li	r11,4
+	beq	63f
+
+	/* Here we decide how far ahead to prefetch the source */
+	li	r3,4
+	cmpwi	r0,1
+	li	r7,0
+	ble	114f
+	li	r7,1
+#if MAX_COPY_PREFETCH > 1
+	/* Heuristically, for large transfers we prefetch
+	   MAX_COPY_PREFETCH cachelines ahead.  For small transfers
+	   we prefetch 1 cacheline ahead. */
+	cmpwi	r0,MAX_COPY_PREFETCH
+	ble	112f
+	li	r7,MAX_COPY_PREFETCH
+112:	mtctr	r7
+111:	dcbt	r3,r4
+	addi	r3,r3,CACHELINE_BYTES
+	bdnz	111b
+#else
+	dcbt	r3,r4
+	addi	r3,r3,CACHELINE_BYTES
+#endif /* MAX_COPY_PREFETCH > 1 */
+
+114:	subf	r8,r7,r0
+	mr	r0,r7
+	mtctr	r8
+
+53:	dcbt	r3,r4
+54:	dcbz	r11,r6
+/* the main body of the cacheline loop */
+	CSUM_COPY_16_BYTES_WITHEX(0)
+#if L1_CACHE_BYTES >= 32
+	CSUM_COPY_16_BYTES_WITHEX(1)
+#if L1_CACHE_BYTES >= 64
+	CSUM_COPY_16_BYTES_WITHEX(2)
+	CSUM_COPY_16_BYTES_WITHEX(3)
+#if L1_CACHE_BYTES >= 128
+	CSUM_COPY_16_BYTES_WITHEX(4)
+	CSUM_COPY_16_BYTES_WITHEX(5)
+	CSUM_COPY_16_BYTES_WITHEX(6)
+	CSUM_COPY_16_BYTES_WITHEX(7)
+#endif
+#endif
+#endif
+	bdnz	53b
+	cmpwi	r0,0
+	li	r3,4
+	li	r7,0
+	bne	114b
+
+63:	srwi.	r0,r5,2
+	mtctr	r0
+	beq	64f
+30:	lwzu	r0,4(r4)
+	adde	r12,r12,r0
+31:	stwu	r0,4(r6)
+	bdnz	30b
+
+64:	andi.	r0,r5,2
+	beq+	65f
+40:	lhz	r0,4(r4)
 	addi	r4,r4,2
-	adde	r0,r0,r6
-4:	cmpwi	0,r5,1
-	bne+	5f
-84:	lbz	r6,4(r3)
-94:	stb	r6,4(r4)
-	slwi	r6,r6,8		/* Upper byte of word */
-	adde	r0,r0,r6
-5:	addze	r3,r0		/* add in final carry */
+41:	sth	r0,4(r6)
+	adde	r12,r12,r0
+	addi	r6,r6,2
+65:	andi.	r0,r5,1
+	beq+	66f
+50:	lbz	r0,4(r4)
+51:	stb	r0,4(r6)
+	slwi	r0,r0,8
+	adde	r12,r12,r0
+66:	addze	r3,r12
+	addi	r1,r1,16
+	beqlr+	cr7
+	rlwinm	r3,r3,8,0,31	/* swap bytes for odd destination */
 	blr
 
-/* These shouldn't go in the fixup section, since that would
-   cause the ex_table addresses to get out of order. */
-
-src_error_4:
-	mfctr	r6		/* update # bytes remaining from ctr */
-	rlwimi	r5,r6,4,0,27
-	b	79f
-src_error_1:
-	li	r6,0
-	subi	r5,r5,2
-95:	sth	r6,4(r4)
-	addi	r4,r4,2
-79:	srwi.	r6,r5,2
-	beq	3f
-	mtctr	r6
-src_error_2:
-	li	r6,0
-96:	stwu	r6,4(r4)
-	bdnz	96b
-3:	andi.	r5,r5,3
-	beq	src_error
-src_error_3:
-	li	r6,0
-	mtctr	r5
-	addi	r4,r4,3
-97:	stbu	r6,1(r4)
-	bdnz	97b
+/* read fault */
 src_error:
-	cmpwi	0,r7,0
-	beq	1f
-	li	r6,-EFAULT
-	stw	r6,0(r7)
-1:	addze	r3,r0
+	lwz	r7,12(r1)
+	addi	r1,r1,16
+	cmpwi	cr0,r7,0
+	beqlr
+	li	r0,-EFAULT
+	stw	r0,0(r7)
 	blr
-
+/* write fault */
 dst_error:
-	cmpwi	0,r8,0
-	beq	1f
-	li	r6,-EFAULT
-	stw	r6,0(r8)
-1:	addze	r3,r0
+	lwz	r8,8(r1)
+	addi	r1,r1,16
+	cmpwi	cr0,r8,0
+	beqlr
+	li	r0,-EFAULT
+	stw	r0,0(r8)
 	blr
 
-.section __ex_table,"a"
-	.long	81b,src_error_1
-	.long	91b,dst_error
-	.long	71b,src_error_4
-	.long	72b,src_error_4
-	.long	73b,src_error_4
-	.long	74b,src_error_4
-	.long	75b,dst_error
-	.long	76b,dst_error
-	.long	77b,dst_error
-	.long	78b,dst_error
-	.long	82b,src_error_2
-	.long	92b,dst_error
-	.long	83b,src_error_3
-	.long	93b,dst_error
-	.long	84b,src_error_3
-	.long	94b,dst_error
-	.long	95b,dst_error
-	.long	96b,dst_error
-	.long	97b,dst_error
+	.section __ex_table,"a"
+	.align	2
+	.long	70b,src_error
+	.long	71b,dst_error
+	.long	72b,src_error
+	.long	73b,dst_error
+	.long	54b,dst_error
+	.text
+
+/*
+ * this stuff handles faults in the cacheline loop and branches to either
+ * src_error (if in read part) or dst_error (if in write part)
+ */
+	CSUM_COPY_16_BYTES_EXCODE(0)
+#if L1_CACHE_BYTES >= 32
+	CSUM_COPY_16_BYTES_EXCODE(1)
+#if L1_CACHE_BYTES >= 64
+	CSUM_COPY_16_BYTES_EXCODE(2)
+	CSUM_COPY_16_BYTES_EXCODE(3)
+#if L1_CACHE_BYTES >= 128
+	CSUM_COPY_16_BYTES_EXCODE(4)
+	CSUM_COPY_16_BYTES_EXCODE(5)
+	CSUM_COPY_16_BYTES_EXCODE(6)
+	CSUM_COPY_16_BYTES_EXCODE(7)
+#endif
+#endif
+#endif
+
+	.section __ex_table,"a"
+	.align	2
+	.long	30b,src_error
+	.long	31b,dst_error
+	.long	40b,src_error
+	.long	41b,dst_error
+	.long	50b,src_error
+	.long	51b,dst_error
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial()
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
                   ` (4 preceding siblings ...)
  2015-09-22 14:34 ` [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-10-23  3:30   ` Scott Wood
  2015-09-22 14:34 ` [PATCH 7/9] powerpc32: optimise csum_partial() loop Christophe Leroy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

r5 does contain the value to be updated, so lets use r5 all way long
for that. It makes the code more readable.

To avoid confusion, it is better to use adde instead of addc

The first addition is useless. Its only purpose is to clear carry.
As r4 is a signed int that is always positive, this can be done by
using srawi instead of srwi

Let's also remove the comment about bdnz having no overhead as it
is not correct on all powerpc, at least on MPC8xx

In the last part, in our situation, the remaining quantity of bytes
to be proceeded is between 0 and 3. Therefore, we can base that part
on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3
then proceding on comparisons and substractions.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++--------------------
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 3472372..9c12602 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -27,35 +27,32 @@
  * csum_partial(buff, len, sum)
  */
 _GLOBAL(csum_partial)
-	addic	r0,r5,0
 	subi	r3,r3,4
-	srwi.	r6,r4,2
+	srawi.	r6,r4,2		/* Divide len by 4 and also clear carry */
 	beq	3f		/* if we're doing < 4 bytes */
-	andi.	r5,r3,2		/* Align buffer to longword boundary */
+	andi.	r0,r3,2		/* Align buffer to longword boundary */
 	beq+	1f
-	lhz	r5,4(r3)	/* do 2 bytes to get aligned */
-	addi	r3,r3,2
+	lhz	r0,4(r3)	/* do 2 bytes to get aligned */
 	subi	r4,r4,2
-	addc	r0,r0,r5
+	addi	r3,r3,2
 	srwi.	r6,r4,2		/* # words to do */
+	adde	r5,r5,r0
 	beq	3f
 1:	mtctr	r6
-2:	lwzu	r5,4(r3)	/* the bdnz has zero overhead, so it should */
-	adde	r0,r0,r5	/* be unnecessary to unroll this loop */
+2:	lwzu	r0,4(r3)
+	adde	r5,r5,r0
 	bdnz	2b
-	andi.	r4,r4,3
-3:	cmpwi	0,r4,2
-	blt+	4f
-	lhz	r5,4(r3)
+3:	andi.	r0,r4,2
+	beq+	4f
+	lhz	r0,4(r3)
 	addi	r3,r3,2
-	subi	r4,r4,2
-	adde	r0,r0,r5
-4:	cmpwi	0,r4,1
-	bne+	5f
-	lbz	r5,4(r3)
-	slwi	r5,r5,8		/* Upper byte of word */
-	adde	r0,r0,r5
-5:	addze	r3,r0		/* add in final carry */
+	adde	r5,r5,r0
+4:	andi.	r0,r4,1
+	beq+	5f
+	lbz	r0,4(r3)
+	slwi	r0,r0,8		/* Upper byte of word */
+	adde	r5,r5,r0
+5:	addze	r3,r5		/* add in final carry */
 	blr
 
 /*
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/9] powerpc32: optimise csum_partial() loop
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
                   ` (5 preceding siblings ...)
  2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

On the 8xx, load latency is 2 cycles and taking branches also takes
2 cycles. So let's unroll the loop.

This patch improves csum_partial() speed by around 10% on both:
* 8xx (single issue processor with parallele execution)
* 83xx (superscalar 6xx processor with dual instruction fetch
and parallele execution)

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/lib/checksum_32.S | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 9c12602..0d34f47 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -38,10 +38,24 @@ _GLOBAL(csum_partial)
 	srwi.	r6,r4,2		/* # words to do */
 	adde	r5,r5,r0
 	beq	3f
-1:	mtctr	r6
+1:	andi.	r6,r6,3		/* Prepare to handle words 4 by 4 */
+	beq	21f
+	mtctr	r6
 2:	lwzu	r0,4(r3)
 	adde	r5,r5,r0
 	bdnz	2b
+21:	srwi.	r6,r4,4		/* # blocks of 4 words to do */
+	beq	3f
+	mtctr	r6
+22:	lwz	r0,4(r3)
+	lwz	r6,8(r3)
+	lwz	r7,12(r3)
+	lwzu	r8,16(r3)
+	adde	r5,r5,r0
+	adde	r5,r5,r6
+	adde	r5,r5,r7
+	adde	r5,r5,r8
+	bdnz	22b
 3:	andi.	r0,r4,2
 	beq+	4f
 	lhz	r0,4(r3)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
                   ` (6 preceding siblings ...)
  2015-09-22 14:34 ` [PATCH 7/9] powerpc32: optimise csum_partial() loop Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-10-23  3:33   ` Scott Wood
  2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
  2015-09-23 22:38 ` [PATCH 0/9] powerpc32: set of optimisation of network checksum functions David Miller
  9 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

Simplify csum_add(a, b) in case a or b is constant 0

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/include/asm/checksum.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index 56deea8..f8a9704 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -119,7 +119,13 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
 {
 #ifdef __powerpc64__
 	u64 res = (__force u64)csum;
+#endif
+	if (__builtin_constant_p(csum) && csum == 0)
+		return addend;
+	if (__builtin_constant_p(addend) && addend == 0)
+		return csum;
 
+#ifdef __powerpc64__
 	res += (__force u64)addend;
 	return (__force __wsum)((u32)res + (res >> 32));
 #else
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
                   ` (7 preceding siblings ...)
  2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
@ 2015-09-22 14:34 ` Christophe Leroy
  2015-10-23  3:32   ` Scott Wood
  2016-03-05  5:29   ` [9/9] " Scott Wood
  2015-09-23 22:38 ` [PATCH 0/9] powerpc32: set of optimisation of network checksum functions David Miller
  9 siblings, 2 replies; 22+ messages in thread
From: Christophe Leroy @ 2015-09-22 14:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, scottwood
  Cc: linux-kernel, linuxppc-dev, netdev

csum_partial is often called for small fixed length packets
for which it is suboptimal to use the generic csum_partial()
function.

For instance, in my configuration, I got:
* One place calling it with constant len 4
* Seven places calling it with constant len 8
* Three places calling it with constant len 14
* One place calling it with constant len 20
* One place calling it with constant len 24
* One place calling it with constant len 32

This patch renames csum_partial() to __csum_partial() and
implements csum_partial() as a wrapper inline function which
* uses csum_add() for small 16bits multiple constant length
* uses ip_fast_csum() for other 32bits multiple constant
* uses __csum_partial() in all other cases

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 arch/powerpc/include/asm/checksum.h | 80 ++++++++++++++++++++++++++-----------
 arch/powerpc/lib/checksum_32.S      |  4 +-
 arch/powerpc/lib/checksum_64.S      |  4 +-
 arch/powerpc/lib/ppc_ksyms.c        |  2 +-
 4 files changed, 62 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
index f8a9704..25c4657f 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -13,20 +13,6 @@
 #include <asm-generic/checksum.h>
 #else
 /*
- * computes the checksum of a memory block at buff, length len,
- * and adds in "sum" (32-bit)
- *
- * returns a 32-bit number suitable for feeding into itself
- * or csum_tcpudp_magic
- *
- * this function must be called with even lengths, except
- * for the last fragment, which may be odd
- *
- * it's best to have buff aligned on a 32-bit boundary
- */
-extern __wsum csum_partial(const void *buff, int len, __wsum sum);
-
-/*
  * Computes the checksum of a memory block at src, length len,
  * and adds in "sum" (32-bit), while copying the block to dst.
  * If an access exception occurs on src or dst, it stores -EFAULT
@@ -67,15 +53,6 @@ static inline __sum16 csum_fold(__wsum sum)
 	return (__force __sum16)(~((__force u32)sum + tmp) >> 16);
 }
 
-/*
- * this routine is used for miscellaneous IP-like checksums, mainly
- * in icmp.c
- */
-static inline __sum16 ip_compute_csum(const void *buff, int len)
-{
-	return csum_fold(csum_partial(buff, len, 0));
-}
-
 static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
                                      unsigned short len,
                                      unsigned short proto,
@@ -175,6 +152,63 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 	return csum_fold(ip_fast_csum_nofold(iph, ihl));
 }
 
+/*
+ * computes the checksum of a memory block at buff, length len,
+ * and adds in "sum" (32-bit)
+ *
+ * returns a 32-bit number suitable for feeding into itself
+ * or csum_tcpudp_magic
+ *
+ * this function must be called with even lengths, except
+ * for the last fragment, which may be odd
+ *
+ * it's best to have buff aligned on a 32-bit boundary
+ */
+__wsum __csum_partial(const void *buff, int len, __wsum sum);
+
+static inline __wsum csum_partial(const void *buff, int len, __wsum sum)
+{
+	if (__builtin_constant_p(len) && len == 0)
+		return sum;
+
+	if (__builtin_constant_p(len) && len <= 16 && (len & 1) == 0) {
+		__wsum sum1;
+
+		if (len == 2)
+			sum1 = (__force u32)*(u16 *)buff;
+		if (len >= 4)
+			sum1 = *(u32 *)buff;
+		if (len == 6)
+			sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 4));
+		if (len >= 8)
+			sum1 = csum_add(sum1, *(u32 *)(buff + 4));
+		if (len == 10)
+			sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 8));
+		if (len >= 12)
+			sum1 = csum_add(sum1, *(u32 *)(buff + 8));
+		if (len == 14)
+			sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 12));
+		if (len >= 16)
+			sum1 = csum_add(sum1, *(u32 *)(buff + 12));
+
+		sum = csum_add(sum1, sum);
+	} else if (__builtin_constant_p(len) && (len & 3) == 0) {
+		sum = csum_add(ip_fast_csum_nofold(buff, len >> 2), sum);
+	} else {
+		sum = __csum_partial(buff, len, sum);
+	}
+	return sum;
+}
+
+/*
+ * this routine is used for miscellaneous IP-like checksums, mainly
+ * in icmp.c
+ */
+static inline __sum16 ip_compute_csum(const void *buff, int len)
+{
+	return csum_fold(csum_partial(buff, len, 0));
+}
+
 #endif
 #endif /* __KERNEL__ */
 #endif
diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
index 0d34f47..043d0088 100644
--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -24,9 +24,9 @@
  * computes the checksum of a memory block at buff, length len,
  * and adds in "sum" (32-bit)
  *
- * csum_partial(buff, len, sum)
+ * do_csum_partial(buff, len, sum)
  */
-_GLOBAL(csum_partial)
+_GLOBAL(__csum_partial)
 	subi	r3,r3,4
 	srawi.	r6,r4,2		/* Divide len by 4 and also clear carry */
 	beq	3f		/* if we're doing < 4 bytes */
diff --git a/arch/powerpc/lib/checksum_64.S b/arch/powerpc/lib/checksum_64.S
index f53f4ab..4ab562d 100644
--- a/arch/powerpc/lib/checksum_64.S
+++ b/arch/powerpc/lib/checksum_64.S
@@ -21,9 +21,9 @@
  * Computes the checksum of a memory block at buff, length len,
  * and adds in "sum" (32-bit).
  *
- * csum_partial(r3=buff, r4=len, r5=sum)
+ * do_csum_partial(r3=buff, r4=len, r5=sum)
  */
-_GLOBAL(csum_partial)
+_GLOBAL(__csum_partial)
 	addic	r0,r5,0			/* clear carry */
 
 	srdi.	r6,r4,3			/* less than 8 bytes? */
diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
index 8cd5c0b..c422812 100644
--- a/arch/powerpc/lib/ppc_ksyms.c
+++ b/arch/powerpc/lib/ppc_ksyms.c
@@ -17,7 +17,7 @@ EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strncmp);
 
 #ifndef CONFIG_GENERIC_CSUM
-EXPORT_SYMBOL(csum_partial);
+EXPORT_SYMBOL(__csum_partial);
 EXPORT_SYMBOL(csum_partial_copy_generic);
 #endif
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/9] powerpc: inline ip_fast_csum()
  2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
@ 2015-09-23  5:43   ` Denis Kirjanov
  2016-02-29  7:25     ` Christophe Leroy
  2016-03-05  3:50   ` [4/9] " Scott Wood
  1 sibling, 1 reply; 22+ messages in thread
From: Denis Kirjanov @ 2015-09-23  5:43 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	scottwood, linux-kernel, linuxppc-dev, netdev

On 9/22/15, Christophe Leroy <christophe.leroy@c-s.fr> wrote:
> In several architectures, ip_fast_csum() is inlined
> There are functions like ip_send_check() which do nothing
> much more than calling ip_fast_csum().
> Inlining ip_fast_csum() allows the compiler to optimise better

Hi Christophe,
I did try it and see no difference on ppc64. Did you test with socklib
with modified loopback and if so do you have any numbers?
>
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/include/asm/checksum.h | 46
> +++++++++++++++++++++++++++++++------
>  arch/powerpc/lib/checksum_32.S      | 21 -----------------
>  arch/powerpc/lib/checksum_64.S      | 27 ----------------------
>  arch/powerpc/lib/ppc_ksyms.c        |  1 -
>  4 files changed, 39 insertions(+), 56 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/checksum.h
> b/arch/powerpc/include/asm/checksum.h
> index afa6722..56deea8 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -9,16 +9,9 @@
>   * 2 of the License, or (at your option) any later version.
>   */
>
> -/*
> - * This is a version of ip_compute_csum() optimized for IP headers,
> - * which always checksum on 4 octet boundaries.  ihl is the number
> - * of 32-bit words and is always >= 5.
> - */
>  #ifdef CONFIG_GENERIC_CSUM
>  #include <asm-generic/checksum.h>
>  #else
> -extern __sum16 ip_fast_csum(const void *iph, unsigned int ihl);
> -
>  /*
>   * computes the checksum of a memory block at buff, length len,
>   * and adds in "sum" (32-bit)
> @@ -137,6 +130,45 @@ static inline __wsum csum_add(__wsum csum, __wsum
> addend)
>  #endif
>  }
>
> +/*
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksum on 4 octet boundaries.  ihl is the number
> + * of 32-bit words and is always >= 5.
> + */
> +static inline __wsum ip_fast_csum_nofold(const void *iph, unsigned int
> ihl)
> +{
> +	u32 *ptr = (u32 *)iph + 1;
> +#ifdef __powerpc64__
> +	unsigned int i;
> +	u64 s = *(__force u32 *)iph;
> +
> +	for (i = 0; i < ihl - 1; i++, ptr++)
> +		s += *ptr;
> +	s += (s >> 32);
> +	return (__force __wsum)s;
> +
> +#else
> +	__wsum sum, tmp;
> +
> +	asm("mtctr %3;"
> +	    "addc %0,%4,%5;"
> +	    "1:lwzu %1, 4(%2);"
> +	    "adde %0,%0,%1;"
> +	    "bdnz 1b;"
> +	    "addze %0,%0;"
> +	    : "=r"(sum), "=r"(tmp), "+b"(ptr)
> +	    : "r"(ihl - 2), "r"(*(u32 *)iph), "r"(*ptr)
> +	    : "ctr", "xer", "memory");
> +
> +	return sum;
> +#endif
> +}
> +
> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> +{
> +	return csum_fold(ip_fast_csum_nofold(iph, ihl));
> +}
> +
>  #endif
>  #endif /* __KERNEL__ */
>  #endif
> diff --git a/arch/powerpc/lib/checksum_32.S
> b/arch/powerpc/lib/checksum_32.S
> index 6d67e05..0d7eba3 100644
> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -20,27 +20,6 @@
>  	.text
>
>  /*
> - * ip_fast_csum(buf, len) -- Optimized for IP header
> - * len is in words and is always >= 5.
> - */
> -_GLOBAL(ip_fast_csum)
> -	lwz	r0,0(r3)
> -	lwzu	r5,4(r3)
> -	addic.	r4,r4,-2
> -	addc	r0,r0,r5
> -	mtctr	r4
> -	blelr-
> -1:	lwzu	r4,4(r3)
> -	adde	r0,r0,r4
> -	bdnz	1b
> -	addze	r0,r0		/* add in final carry */
> -	rlwinm	r3,r0,16,0,31	/* fold two halves together */
> -	add	r3,r0,r3
> -	not	r3,r3
> -	srwi	r3,r3,16
> -	blr
> -
> -/*
>   * computes the checksum of a memory block at buff, length len,
>   * and adds in "sum" (32-bit)
>   *
> diff --git a/arch/powerpc/lib/checksum_64.S
> b/arch/powerpc/lib/checksum_64.S
> index f3ef354..f53f4ab 100644
> --- a/arch/powerpc/lib/checksum_64.S
> +++ b/arch/powerpc/lib/checksum_64.S
> @@ -18,33 +18,6 @@
>  #include <asm/ppc_asm.h>
>
>  /*
> - * ip_fast_csum(r3=buf, r4=len) -- Optimized for IP header
> - * len is in words and is always >= 5.
> - *
> - * In practice len == 5, but this is not guaranteed.  So this code does
> not
> - * attempt to use doubleword instructions.
> - */
> -_GLOBAL(ip_fast_csum)
> -	lwz	r0,0(r3)
> -	lwzu	r5,4(r3)
> -	addic.	r4,r4,-2
> -	addc	r0,r0,r5
> -	mtctr	r4
> -	blelr-
> -1:	lwzu	r4,4(r3)
> -	adde	r0,r0,r4
> -	bdnz	1b
> -	addze	r0,r0		/* add in final carry */
> -        rldicl  r4,r0,32,0      /* fold two 32-bit halves together */
> -        add     r0,r0,r4
> -        srdi    r0,r0,32
> -	rlwinm	r3,r0,16,0,31	/* fold two halves together */
> -	add	r3,r0,r3
> -	not	r3,r3
> -	srwi	r3,r3,16
> -	blr
> -
> -/*
>   * Computes the checksum of a memory block at buff, length len,
>   * and adds in "sum" (32-bit).
>   *
> diff --git a/arch/powerpc/lib/ppc_ksyms.c b/arch/powerpc/lib/ppc_ksyms.c
> index f5e427e..8cd5c0b 100644
> --- a/arch/powerpc/lib/ppc_ksyms.c
> +++ b/arch/powerpc/lib/ppc_ksyms.c
> @@ -19,7 +19,6 @@ EXPORT_SYMBOL(strncmp);
>  #ifndef CONFIG_GENERIC_CSUM
>  EXPORT_SYMBOL(csum_partial);
>  EXPORT_SYMBOL(csum_partial_copy_generic);
> -EXPORT_SYMBOL(ip_fast_csum);
>  #endif
>
>  EXPORT_SYMBOL(__copy_tofrom_user);
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/9] powerpc32: set of optimisation of network checksum functions
  2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
                   ` (8 preceding siblings ...)
  2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
@ 2015-09-23 22:38 ` David Miller
  9 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2015-09-23 22:38 UTC (permalink / raw)
  To: christophe.leroy
  Cc: benh, paulus, mpe, scottwood, linux-kernel, linuxppc-dev, netdev

From: Christophe Leroy <christophe.leroy@c-s.fr>
Date: Tue, 22 Sep 2015 16:34:17 +0200 (CEST)

> This patch serie gather patches related to checksum functions on powerpc.
> Some of those patches have already been submitted individually.

I'm assuming that the powerpc folks will integrate this series.

Let me know if I should take it into net-next instead.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers
  2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
@ 2015-10-23  3:26   ` Scott Wood
  2015-10-28 11:11     ` Anton Blanchard
  0 siblings, 1 reply; 22+ messages in thread
From: Scott Wood @ 2015-10-23  3:26 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, netdev, anton

On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> The powerpc64 checksum wrapper functions adds csum_and_copy_to_user()
> which otherwise is implemented in include/net/checksum.h by using
> csum_partial() then copy_to_user()
> 
> Those two wrapper fonctions are also applicable to powerpc32 as it is
> based on the use of csum_partial_copy_generic() which also
> exists on powerpc32
> 
> This patch renames arch/powerpc/lib/checksum_wrappers_64.c to
> arch/powerpc/lib/checksum_wrappers.c and
> makes it non-conditional to CONFIG_WORD_SIZE
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/include/asm/checksum.h                              | 9 ------
> ---
>  arch/powerpc/lib/Makefile                                        | 3 +--
>  arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} | 0
>  3 files changed, 1 insertion(+), 11 deletions(-)
>  rename arch/powerpc/lib/{checksum_wrappers_64.c => checksum_wrappers.c} 
> (100%)

I wonder why it was 64-bit specific in the first place.

CCing Anton Blanchard.  

-Scott

> 
> diff --git a/arch/powerpc/include/asm/checksum.h 
> b/arch/powerpc/include/asm/checksum.h
> index d2ca07b..afa6722 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -47,21 +47,12 @@ extern __wsum csum_partial_copy_generic(const void 
> *src, void *dst,
>                                             int len, __wsum sum,
>                                             int *src_err, int *dst_err);
>  
> -#ifdef __powerpc64__
>  #define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER
>  extern __wsum csum_and_copy_from_user(const void __user *src, void *dst,
>                                     int len, __wsum sum, int *err_ptr);
>  #define HAVE_CSUM_COPY_USER
>  extern __wsum csum_and_copy_to_user(const void *src, void __user *dst,
>                                   int len, __wsum sum, int *err_ptr);
> -#else
> -/*
> - * the same as csum_partial, but copies from src to dst while it
> - * checksums.
> - */
> -#define csum_partial_copy_from_user(src, dst, len, sum, errp)   \
> -        csum_partial_copy_generic((__force const void *)(src), (dst), 
> (len), (sum), (errp), NULL)
> -#endif
>  
>  #define csum_partial_copy_nocheck(src, dst, len, sum)   \
>          csum_partial_copy_generic((src), (dst), (len), (sum), NULL, NULL)
> diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
> index a47e142..e46b068 100644
> --- a/arch/powerpc/lib/Makefile
> +++ b/arch/powerpc/lib/Makefile
> @@ -22,8 +22,7 @@ obj64-$(CONFIG_SMP) += locks.o
>  obj64-$(CONFIG_ALTIVEC)      += vmx-helper.o
>  
>  ifeq ($(CONFIG_GENERIC_CSUM),)
> -obj-y                        += checksum_$(CONFIG_WORD_SIZE).o
> -obj-$(CONFIG_PPC64)  += checksum_wrappers_64.o
> +obj-y                        += checksum_$(CONFIG_WORD_SIZE).o checksum_wrappers.o
>  endif
>  
>  obj-$(CONFIG_PPC_EMULATE_SSTEP)      += sstep.o ldstfp.o
> diff --git a/arch/powerpc/lib/checksum_wrappers_64.c 
> b/arch/powerpc/lib/checksum_wrappers.c
> similarity index 100%
> rename from arch/powerpc/lib/checksum_wrappers_64.c
> rename to arch/powerpc/lib/checksum_wrappers.c

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial()
  2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
@ 2015-10-23  3:30   ` Scott Wood
  2016-02-29 12:53     ` Christophe Leroy
  0 siblings, 1 reply; 22+ messages in thread
From: Scott Wood @ 2015-10-23  3:30 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, netdev

On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> r5 does contain the value to be updated, so lets use r5 all way long
> for that. It makes the code more readable.
> 
> To avoid confusion, it is better to use adde instead of addc
> 
> The first addition is useless. Its only purpose is to clear carry.
> As r4 is a signed int that is always positive, this can be done by
> using srawi instead of srwi
> 
> Let's also remove the comment about bdnz having no overhead as it
> is not correct on all powerpc, at least on MPC8xx
> 
> In the last part, in our situation, the remaining quantity of bytes
> to be proceeded is between 0 and 3. Therefore, we can base that part
> on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3
> then proceding on comparisons and substractions.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++--------------------
>  1 file changed, 17 insertions(+), 20 deletions(-)

Do you have benchmarks for these optimizations?

-Scott

> 
> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
> index 3472372..9c12602 100644
> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -27,35 +27,32 @@
>   * csum_partial(buff, len, sum)
>   */
>  _GLOBAL(csum_partial)
> -     addic   r0,r5,0
>       subi    r3,r3,4
> -     srwi.   r6,r4,2
> +     srawi.  r6,r4,2         /* Divide len by 4 and also clear carry */
>       beq     3f              /* if we're doing < 4 bytes */
> -     andi.   r5,r3,2         /* Align buffer to longword boundary */
> +     andi.   r0,r3,2         /* Align buffer to longword boundary */
>       beq+    1f
> -     lhz     r5,4(r3)        /* do 2 bytes to get aligned */
> -     addi    r3,r3,2
> +     lhz     r0,4(r3)        /* do 2 bytes to get aligned */
>       subi    r4,r4,2
> -     addc    r0,r0,r5
> +     addi    r3,r3,2
>       srwi.   r6,r4,2         /* # words to do */
> +     adde    r5,r5,r0
>       beq     3f
>  1:   mtctr   r6
> -2:   lwzu    r5,4(r3)        /* the bdnz has zero overhead, so it should */
> -     adde    r0,r0,r5        /* be unnecessary to unroll this loop */
> +2:   lwzu    r0,4(r3)
> +     adde    r5,r5,r0
>       bdnz    2b
> -     andi.   r4,r4,3
> -3:   cmpwi   0,r4,2
> -     blt+    4f
> -     lhz     r5,4(r3)
> +3:   andi.   r0,r4,2
> +     beq+    4f
> +     lhz     r0,4(r3)
>       addi    r3,r3,2
> -     subi    r4,r4,2
> -     adde    r0,r0,r5
> -4:   cmpwi   0,r4,1
> -     bne+    5f
> -     lbz     r5,4(r3)
> -     slwi    r5,r5,8         /* Upper byte of word */
> -     adde    r0,r0,r5
> -5:   addze   r3,r0           /* add in final carry */
> +     adde    r5,r5,r0
> +4:   andi.   r0,r4,1
> +     beq+    5f
> +     lbz     r0,4(r3)
> +     slwi    r0,r0,8         /* Upper byte of word */
> +     adde    r5,r5,r0
> +5:   addze   r3,r5           /* add in final carry */
>       blr
>  
>  /*

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant
  2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
@ 2015-10-23  3:32   ` Scott Wood
  2016-03-05  5:29   ` [9/9] " Scott Wood
  1 sibling, 0 replies; 22+ messages in thread
From: Scott Wood @ 2015-10-23  3:32 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, netdev

On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> csum_partial is often called for small fixed length packets
> for which it is suboptimal to use the generic csum_partial()
> function.
> 
> For instance, in my configuration, I got:
> * One place calling it with constant len 4
> * Seven places calling it with constant len 8
> * Three places calling it with constant len 14
> * One place calling it with constant len 20
> * One place calling it with constant len 24
> * One place calling it with constant len 32
> 
> This patch renames csum_partial() to __csum_partial() and
> implements csum_partial() as a wrapper inline function which
> * uses csum_add() for small 16bits multiple constant length
> * uses ip_fast_csum() for other 32bits multiple constant
> * uses __csum_partial() in all other cases
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/include/asm/checksum.h | 80 ++++++++++++++++++++++++++--------
> ---
>  arch/powerpc/lib/checksum_32.S      |  4 +-
>  arch/powerpc/lib/checksum_64.S      |  4 +-
>  arch/powerpc/lib/ppc_ksyms.c        |  2 +-
>  4 files changed, 62 insertions(+), 28 deletions(-)

Benchmarks?

-Scott

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0
  2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
@ 2015-10-23  3:33   ` Scott Wood
  2016-02-29  7:26     ` Christophe Leroy
  0 siblings, 1 reply; 22+ messages in thread
From: Scott Wood @ 2015-10-23  3:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, netdev

On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
> Simplify csum_add(a, b) in case a or b is constant 0
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
>  arch/powerpc/include/asm/checksum.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/checksum.h 
> b/arch/powerpc/include/asm/checksum.h
> index 56deea8..f8a9704 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -119,7 +119,13 @@ static inline __wsum csum_add(__wsum csum, __wsum 
> addend)
>  {
>  #ifdef __powerpc64__
>       u64 res = (__force u64)csum;
> +#endif
> +     if (__builtin_constant_p(csum) && csum == 0)
> +             return addend;
> +     if (__builtin_constant_p(addend) && addend == 0)
> +             return csum;
>  
> +#ifdef __powerpc64__
>       res += (__force u64)addend;
>       return (__force __wsum)((u32)res + (res >> 32));
>  #else

How often does this happen?

-Scott

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers
  2015-10-23  3:26   ` Scott Wood
@ 2015-10-28 11:11     ` Anton Blanchard
  0 siblings, 0 replies; 22+ messages in thread
From: Anton Blanchard @ 2015-10-28 11:11 UTC (permalink / raw)
  To: Scott Wood
  Cc: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, linux-kernel, linuxppc-dev, netdev

Hi Scott,

> I wonder why it was 64-bit specific in the first place.

I think it was part of a series where I added my 64bit assembly checksum
routines, and I didn't step back and think that the wrapper code would
be useful on 32 bit.

Anton

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/9] powerpc: inline ip_fast_csum()
  2015-09-23  5:43   ` Denis Kirjanov
@ 2016-02-29  7:25     ` Christophe Leroy
  0 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2016-02-29  7:25 UTC (permalink / raw)
  To: Denis Kirjanov
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	scottwood, linux-kernel, linuxppc-dev, netdev



Le 23/09/2015 07:43, Denis Kirjanov a écrit :
> On 9/22/15, Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>> In several architectures, ip_fast_csum() is inlined
>> There are functions like ip_send_check() which do nothing
>> much more than calling ip_fast_csum().
>> Inlining ip_fast_csum() allows the compiler to optimise better
> Hi Christophe,
> I did try it and see no difference on ppc64. Did you test with socklib
> with modified loopback and if so do you have any numbers?

Hi Denis,

I put a mftbl at start and end of ip_send_check() and tested on a MPC885:
* Without ip_fast_csum() inlined, approxymatly 7 TB ticks are spent in 
ip_send_check()
* With ip_fast_csum() inlined, approxymatly 5,4 TB ticks are spent in 
ip_send_check()

So it is about 23% time reduction.

Christophe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0
  2015-10-23  3:33   ` Scott Wood
@ 2016-02-29  7:26     ` Christophe Leroy
  0 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2016-02-29  7:26 UTC (permalink / raw)
  To: Scott Wood
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, netdev



Le 23/10/2015 05:33, Scott Wood a écrit :
> On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
>> Simplify csum_add(a, b) in case a or b is constant 0
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/include/asm/checksum.h | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/checksum.h
>> b/arch/powerpc/include/asm/checksum.h
>> index 56deea8..f8a9704 100644
>> --- a/arch/powerpc/include/asm/checksum.h
>> +++ b/arch/powerpc/include/asm/checksum.h
>> @@ -119,7 +119,13 @@ static inline __wsum csum_add(__wsum csum, __wsum
>> addend)
>>   {
>>   #ifdef __powerpc64__
>>        u64 res = (__force u64)csum;
>> +#endif
>> +     if (__builtin_constant_p(csum) && csum == 0)
>> +             return addend;
>> +     if (__builtin_constant_p(addend) && addend == 0)
>> +             return csum;
>>
>> +#ifdef __powerpc64__
>>        res += (__force u64)addend;
>>        return (__force __wsum)((u32)res + (res >> 32));
>>   #else
> How often does this happen?
>
>
In the following patch (9/9), csum_add() is used to implement 
csum_partial() for small blocks.
In several places in the networking code, csum_partial() is called with 
0 as initial sum.

Christophe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial()
  2015-10-23  3:30   ` Scott Wood
@ 2016-02-29 12:53     ` Christophe Leroy
  0 siblings, 0 replies; 22+ messages in thread
From: Christophe Leroy @ 2016-02-29 12:53 UTC (permalink / raw)
  To: Scott Wood
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linux-kernel, linuxppc-dev, netdev



Le 23/10/2015 05:30, Scott Wood a écrit :
> On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
>> r5 does contain the value to be updated, so lets use r5 all way long
>> for that. It makes the code more readable.
>>
>> To avoid confusion, it is better to use adde instead of addc
>>
>> The first addition is useless. Its only purpose is to clear carry.
>> As r4 is a signed int that is always positive, this can be done by
>> using srawi instead of srwi
>>
>> Let's also remove the comment about bdnz having no overhead as it
>> is not correct on all powerpc, at least on MPC8xx
>>
>> In the last part, in our situation, the remaining quantity of bytes
>> to be proceeded is between 0 and 3. Therefore, we can base that part
>> on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3
>> then proceding on comparisons and substractions.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++--------------------
>>   1 file changed, 17 insertions(+), 20 deletions(-)
> Do you have benchmarks for these optimizations?
>
> -Scott
Using mftbl() to get timebase just before and after call to 
csum_partial(), I get the following on an MPC885:
* 78 bytes packets: 9% faster (11,5 to 10,4 tb ticks)
* 328 bytes packets: 3% faster (47,9 to 46,5 tb ticks)

Christophe
>
>> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
>> index 3472372..9c12602 100644
>> --- a/arch/powerpc/lib/checksum_32.S
>> +++ b/arch/powerpc/lib/checksum_32.S
>> @@ -27,35 +27,32 @@
>>    * csum_partial(buff, len, sum)
>>    */
>>   _GLOBAL(csum_partial)
>> -     addic   r0,r5,0
>>        subi    r3,r3,4
>> -     srwi.   r6,r4,2
>> +     srawi.  r6,r4,2         /* Divide len by 4 and also clear carry */
>>        beq     3f              /* if we're doing < 4 bytes */
>> -     andi.   r5,r3,2         /* Align buffer to longword boundary */
>> +     andi.   r0,r3,2         /* Align buffer to longword boundary */
>>        beq+    1f
>> -     lhz     r5,4(r3)        /* do 2 bytes to get aligned */
>> -     addi    r3,r3,2
>> +     lhz     r0,4(r3)        /* do 2 bytes to get aligned */
>>        subi    r4,r4,2
>> -     addc    r0,r0,r5
>> +     addi    r3,r3,2
>>        srwi.   r6,r4,2         /* # words to do */
>> +     adde    r5,r5,r0
>>        beq     3f
>>   1:   mtctr   r6
>> -2:   lwzu    r5,4(r3)        /* the bdnz has zero overhead, so it should */
>> -     adde    r0,r0,r5        /* be unnecessary to unroll this loop */
>> +2:   lwzu    r0,4(r3)
>> +     adde    r5,r5,r0
>>        bdnz    2b
>> -     andi.   r4,r4,3
>> -3:   cmpwi   0,r4,2
>> -     blt+    4f
>> -     lhz     r5,4(r3)
>> +3:   andi.   r0,r4,2
>> +     beq+    4f
>> +     lhz     r0,4(r3)
>>        addi    r3,r3,2
>> -     subi    r4,r4,2
>> -     adde    r0,r0,r5
>> -4:   cmpwi   0,r4,1
>> -     bne+    5f
>> -     lbz     r5,4(r3)
>> -     slwi    r5,r5,8         /* Upper byte of word */
>> -     adde    r0,r0,r5
>> -5:   addze   r3,r0           /* add in final carry */
>> +     adde    r5,r5,r0
>> +4:   andi.   r0,r4,1
>> +     beq+    5f
>> +     lbz     r0,4(r3)
>> +     slwi    r0,r0,8         /* Upper byte of word */
>> +     adde    r5,r5,r0
>> +5:   addze   r3,r5           /* add in final carry */
>>        blr
>>   
>>   /*

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [4/9] powerpc: inline ip_fast_csum()
  2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
  2015-09-23  5:43   ` Denis Kirjanov
@ 2016-03-05  3:50   ` Scott Wood
  1 sibling, 0 replies; 22+ messages in thread
From: Scott Wood @ 2016-03-05  3:50 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	scottwood, netdev, linuxppc-dev, linux-kernel

On Tue, Sep 22, 2015 at 04:34:25PM +0200, Christophe Leroy wrote:
> @@ -137,6 +130,45 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
>  #endif
>  }
>  
> +/*
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksum on 4 octet boundaries.  ihl is the number
> + * of 32-bit words and is always >= 5.
> + */
> +static inline __wsum ip_fast_csum_nofold(const void *iph, unsigned int ihl)
> +{
> +	u32 *ptr = (u32 *)iph + 1;

const?

> +#ifdef __powerpc64__
> +	unsigned int i;
> +	u64 s = *(__force u32 *)iph;

const?
Why __force?

> +	s += (s >> 32);
> +	return (__force __wsum)s;
> +
> +#else
> +	__wsum sum, tmp;
> +
> +	asm("mtctr %3;"
> +	    "addc %0,%4,%5;"
> +	    "1:lwzu %1, 4(%2);"
> +	    "adde %0,%0,%1;"
> +	    "bdnz 1b;"
> +	    "addze %0,%0;"
> +	    : "=r"(sum), "=r"(tmp), "+b"(ptr)
> +	    : "r"(ihl - 2), "r"(*(u32 *)iph), "r"(*ptr)
> +	    : "ctr", "xer", "memory");

Space between " and (
Space after :
const in cast

I've fixed these up while applying.

-Scott

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [9/9] powerpc: optimise csum_partial() call when len is constant
  2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
  2015-10-23  3:32   ` Scott Wood
@ 2016-03-05  5:29   ` Scott Wood
  1 sibling, 0 replies; 22+ messages in thread
From: Scott Wood @ 2016-03-05  5:29 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	scottwood, netdev, linuxppc-dev, linux-kernel

On Tue, Sep 22, 2015 at 04:34:36PM +0200, Christophe Leroy wrote:
> +/*
> + * computes the checksum of a memory block at buff, length len,
> + * and adds in "sum" (32-bit)
> + *
> + * returns a 32-bit number suitable for feeding into itself
> + * or csum_tcpudp_magic
> + *
> + * this function must be called with even lengths, except
> + * for the last fragment, which may be odd
> + *
> + * it's best to have buff aligned on a 32-bit boundary
> + */
> +__wsum __csum_partial(const void *buff, int len, __wsum sum);
> +
> +static inline __wsum csum_partial(const void *buff, int len, __wsum sum)
> +{
> +	if (__builtin_constant_p(len) && len == 0)
> +		return sum;
> +
> +	if (__builtin_constant_p(len) && len <= 16 && (len & 1) == 0) {
> +		__wsum sum1;
> +
> +		if (len == 2)
> +			sum1 = (__force u32)*(u16 *)buff;
> +		if (len >= 4)
> +			sum1 = *(u32 *)buff;
> +		if (len == 6)
> +			sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 4));
> +		if (len >= 8)
> +			sum1 = csum_add(sum1, *(u32 *)(buff + 4));
> +		if (len == 10)
> +			sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 8));
> +		if (len >= 12)
> +			sum1 = csum_add(sum1, *(u32 *)(buff + 8));
> +		if (len == 14)
> +			sum1 = csum_add(sum1, (__force u32)*(u16 *)(buff + 12));
> +		if (len >= 16)
> +			sum1 = csum_add(sum1, *(u32 *)(buff + 12));
> +
> +		sum = csum_add(sum1, sum);

Why the final csum_add instead of s/sum1/sum/ and putting csum_add in the
"len == 2" and "len >= 4" cases?

The (__force u32) casts are unnecessary.  Or rather, it should be
(__force __wsum) -- on all of them, not just the 16-bit ones.

The pointer casts should be const.

> +	} else if (__builtin_constant_p(len) && (len & 3) == 0) {
> +		sum = csum_add(ip_fast_csum_nofold(buff, len >> 2), sum);

It may not make a functional difference, but based on the csum_add()
argument names and other csum_add() usage, sum should come first
and the new content second.

-Scott

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-03-05  5:29 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-22 14:34 [PATCH 0/9] powerpc32: set of optimisation of network checksum functions Christophe Leroy
2015-09-22 14:34 ` [PATCH 1/9] powerpc: unexport csum_tcpudp_magic Christophe Leroy
2015-09-22 14:34 ` [PATCH 2/9] powerpc: mark xer clobbered in csum_add() Christophe Leroy
2015-09-22 14:34 ` [PATCH 3/9] powerpc32: checksum_wrappers_64 becomes checksum_wrappers Christophe Leroy
2015-10-23  3:26   ` Scott Wood
2015-10-28 11:11     ` Anton Blanchard
2015-09-22 14:34 ` [PATCH 4/9] powerpc: inline ip_fast_csum() Christophe Leroy
2015-09-23  5:43   ` Denis Kirjanov
2016-02-29  7:25     ` Christophe Leroy
2016-03-05  3:50   ` [4/9] " Scott Wood
2015-09-22 14:34 ` [PATCH 5/9] powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user() Christophe Leroy
2015-09-22 14:34 ` [PATCH 6/9] powerpc32: optimise a few instructions in csum_partial() Christophe Leroy
2015-10-23  3:30   ` Scott Wood
2016-02-29 12:53     ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 7/9] powerpc32: optimise csum_partial() loop Christophe Leroy
2015-09-22 14:34 ` [PATCH 8/9] powerpc: simplify csum_add(a, b) in case a or b is constant 0 Christophe Leroy
2015-10-23  3:33   ` Scott Wood
2016-02-29  7:26     ` Christophe Leroy
2015-09-22 14:34 ` [PATCH 9/9] powerpc: optimise csum_partial() call when len is constant Christophe Leroy
2015-10-23  3:32   ` Scott Wood
2016-03-05  5:29   ` [9/9] " Scott Wood
2015-09-23 22:38 ` [PATCH 0/9] powerpc32: set of optimisation of network checksum functions David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).